The URL Shares dataset is accessible via VPN through a dedicated Jupyter server environment (the Researcher Platform). In this environment, you can execute queries using the fbri.private
interface. This library exposes an execute()
method through the sql.query
sub-module, which primarily allows you to write SQL queries against the fbri_prod_private
database, and retain the result in memory or write it out to a .tsv file. The output files remain on the Jupyter server.
See Researcher Platform for information about the many features available to you.
The private environment does not support the downloading of files from JupyterHub.
For a list of available database tables, refer to the Private environment data tables section.
We'll use the the erc_condor_url_attributes_dp_final_v3
and erc_condor_url_breakdowns_dp_clean_partitioned_v2
table to execute a simple SQL query. This tutorial is designed for the private environment.
Starting in your Jupyter environment, click the blue + button from the left menu and select Python3 under Notebook. This will create a new Jupyter notebook in a new browser tab. You can optionally rename the notebook.
If an empty cell is not already available, create a new cell by clicking + in the notebook's top menu bar.
Insert the code below into the newly created cell:
from fbri.private.sql.query import execute database = "fbri_prod_private" table = "erc_condor_url_attributes_dp_final_v3" sql = f""" SELECT * FROM {database}.{table} LIMIT 20 """ result = execute(sql)
Click the play button to run the query.
While the raw result is still in memory, you might find it easier to manipulate data if you load the file as a Pandas DataFrame.
Modify the code as shown below:
from fbri.private.sql.query import execute import pandas as pd database = "fbri_prod_private" table = "erc_condor_url_attributes_dp_final_v3" sql = f""" SELECT * FROM {database}.{table} LIMIT 20 """ result = execute(sql, "attributes.tsv") df = pd.read_csv('attributes.tsv', delimiter = '\t') df
After clicking the play button, you will notice a newly created table in the left pane called attributes.tsv. You might also notice that some rows in the table contain negative values in the numeric columns. This is the result of the Gaussian noise that we have added to the original values in the data for the purposes of preserving privacy. Specifically, it's an implementation of user-level zero-concentrated differential privacy (zCDP). See Learn more for some references about differential privacy.
Simulated data
The values in the example table below are simulated. They do not represent actual data.
column_name.value_counts()
Now that we have our targeted data df
we can use this for further analysis such as analyzing the value of a column by gathering its count. In this example we are analyzing the parent_domain column. Insert the following code into a blank cell:
df.parent_domain.value_counts()
When run, the results should be similar to:
The result should be an analysis of the table df
showing the counts of parent domains found within df
.
We can also analyze other columns. In this example we analyze the public_shares_top_country column. Insert the following code into a blank cell:
df.public_shares_top_country.value_counts()
When run, the results should be similar to:
The result should be a summation of public_shares_top_country from the table df
showing the count of public shares by top country, ordered by country from highest to lowest.
[['column_name']].describe()
We can further analyze the value of a column by using describe()
. In this example we are analyzing the clean_url and parent_domain columns. Insert the following code into a blank cell:
df[['clean_url', 'parent_domain']].describe()
When you run this code, you will get something similar to:
The result should be an analysis of the table df
showing the count, unique count, top value, and frequency of clean_url
and parent_domain
of the URLs found within df
, ordered by country from highest to lowest.
You can change the column names
You can change column names to suit your analysis.