Basic functionality

The URL Shares dataset is accessible via VPN through a dedicated Jupyter server environment (the Researcher Platform). In this environment, you can run queries using the fbri.public interface. This library exposes an execute() method through the sql.query sub-module, which primarily allows you to write SQL queries against the fbri_prod_public database, and retain the result in memory or write it out to a .tsv file. The output files reside on the Jupyter server.

See Researcher Platform for information about the many features available to you. The ability to upload files to an Amazon S3 bucket might be of particular interest.

In collaborative settings, you may wish to share query or analytical results. It is important that you review the Data sharing for reproducibility and publication policy guide to ensure you only download and share data in compliance with the Research Data Agreement. Data sharing is only allowed in the public environment.

Querying the database

The sample SQL queries included here use the erc_condor_url_attributes_dp_final_public_v3 table. For a list of available data tables, refer to Public environment data tables.

Executing SQL queries

Beginning in your Jupyter environment, click the blue + button from the left menu and select Python3 under Notebook. This creates a new Jupyter notebook in a new browser tab. You can optionally rename the notebook.

If an empty cell is not already available, create a new cell by clicking + in the top menu bar of the notebook.

Insert the following code into the empty cell:

from fbri.public.sql.query import execute

database = "fbri_prod_public"
table = "erc_condor_url_attributes_dp_final_public_v3"

sql = f"""
SELECT *
FROM {database}.{table}
LIMIT 20
"""

result = execute(sql)

Click the play button in the notebook's top menu bar to run the query.

While the raw result is still in memory, you might find it more manageable to load the file as a Pandas DataFrame and save it to a tab-separated values (.tsv) file by modifying the code as shown below:

from fbri.public.sql.query import execute
import pandas as pd
      
database = "fbri_prod_public"
table = "erc_condor_url_attributes_dp_final_public_v3"

sql = f"""
SELECT *
FROM {database}.{table}
LIMIT 20
"""
      
result = execute(sql, "attributes.tsv")
df = pd.read_csv('attributes.tsv', delimiter = '\t')
df

After clicking the play button, the newly created table (attributes.tsv) appears in the left pane:

Each record of the table represents one line of the text file as shown in the example below. The width of columns is adjustable.

Negative numeric values sometimes appear in the data.

This is the result of the Gaussian noise that we have added to the original values in the data for the purpose of preserving privacy. Specifically, it's an implementation of user-level zero-concentrated differential privacy (zCDP). See Learn more for some references about differential privacy.

Further analysis

column_name.value_counts()

We can use our targeted data df for further analysis. For example, we can further analyze the value of a column by displaying its count. In this example we'll use the parent_domain column. Insert the following code in a blank cell:

df.parent_domain.value_counts()

Running this query results in something similar to the example below, showing the count of each parent domain found within df.

The next example displays the count for the public_shares_top_country column. Insert the following code in a blank cell:

df.public_shares_top_country.value_counts()

Running this query results in something similar to the example below. The count of public shares by top country are separated by country and ordered from highest to lowest.

[['column_name']].describe()

We can further analyze the value of a column by using describe(). In this example we are analyzing the clean_url and parent_domain columns. Insert the following code in a blank cell:

df[['clean_url', 'parent_domain']].describe()

Running this query results in something similar to the example below showing count, unique count, top value, and frequency of the clean_url and parent_domain columns.

You can change column names to suit your analysis.