You can invoke aggregate functions. This example gets a count of the number of rows where the country (c) is US.
Run the following code to retrieve a part of the breakdown table:
from fbri.private.sql.query import execute import pandas as pd database = "fbri_prod_private" table = "erc_condor_url_breakdowns_dp_clean_partitioned_v2" sql = f""" SELECT * FROM {database}.{table} LIMIT 20; """ result = execute(sql, "breakdowns.tsv") df = pd.read_csv('breakdowns.tsv', delimiter = '\t') df
The results of running this code should look similar to the following example.
Simulated data
The values in the example table below are simulated. They do not represent actual data.
You can process and further analyze this table by running the following code in a new cell:
df.describe()
The result contains the count, mean, std, min, 25%, 50%, 75%, and max values derived from the values in the df
table's columns:
You can also process the table by retrieving value counts of specific columns. The query uses the format table.column_name.value_counts()
. Below are three examples for the year_month, age_bracket, and gender columns.
df.year_month.value_counts()
df.age_bracket.value_counts()
df.gender.value_counts()
The querying library uses the AWS Athena API. Some performance optimizations such as partitioning have been applied and you should leverage them when querying.
For example, "c" is a partition of the "country" attribute. So be sure to code WHERE c='US'
instead of WHERE country='US'
as shown here:
from fbri.private.sql.query import execute import pandas as pd database = "fbri_prod_private" table = "erc_condor_url_breakdowns_dp_clean_partitioned_v2" sql = f""" SELECT c AS country, COUNT(*) as count FROM {database}.{table} WHERE c = 'US' GROUP BY c""" result = execute(sql, "us_counts.tsv") df = pd.read_csv("us_counts.tsv", delimiter = '\t') df
This produces an aggregate statistic. You can also take a look at some of the rows that make up this population.
from fbri.private.sql.query import execute import pandas as pd database = "fbri_prod_private" table = "erc_condor_url_breakdowns_dp_clean_partitioned_v2" sql = f""" SELECT * FROM {database}.{table} WHERE c = 'US' LIMIT 20 """ result = execute(sql, "us_breakdowns.tsv") df = pd.read_csv("us_breakdowns.tsv", delimiter = '\t') df
Simulated data
The values in the example table below are simulated. They do not represent actual data.
You can perform more complex queries, including the combination of breakdown and attribute data on URLs. The following query retrieves 10,000 URLs that were most shared in the US, come from infowars.com or nytimes.com, and were reported as false news more than once:
from fbri.private.sql.query import execute import pandas as pd database = "fbri_prod_private" table = "erc_condor_url_breakdowns_dp_clean_partitioned_v2" sql = f""" SELECT a.false_news_usr_feedback, a.clean_url, b.views, b.clicks, b.shares, b.age_bracket, b.gender, b.political_page_affinity, b.year_month FROM {database}.{'ERC_CONDOR_URL_BREAKDOWNS_DP_CLEAN_PARTITIONED'} b JOIN {database}.{'ERC_CONDOR_URL_ATTRIBUTES_DP_FINAL'} a ON ( a.url_rid = b.url_rid AND a.public_shares_top_country = 'US' AND b.c = 'US' AND a.false_news_usr_feedback > 1 AND a.parent_domain in ('infowars.com', 'nytimes.com') ) WHERE a.public_shares_top_country = 'US' LIMIT 10000; """ result = execute(sql, "results.tsv") df = pd.read_csv('results.tsv', delimiter = '\t') df
Your results should look similar to the following example.
Simulated data
The values in the example table below are simulated. They do not represent actual data.
You can create a pair plot of the newly created table by running the following code in a blank cell:
sns.pairplot(df)
The resulting graphs should look similar to this:
To plot the number of views by date in a visual format we can use the pandas and matplotlib libraries. The example below first use pandas to convert year_month to pd.datetime and then increases the plot size. Finally, it creates a line plot.
# Convert year_month to pd.datetime from pandas.plotting import register_matplotlib_converters register_matplotlib_converters() df['date_time'] = pd.to_datetime(df['year_month']) # Increase plot size from matplotlib import pyplot dims = (12, 7) fig, ax = pyplot.subplots(figsize=dims) sns.lineplot(x="date_time", y="views", data=df, ax=ax)
Running this code should produce results similar to: