This tutorial demonstrates how query character limits can be reached. Armed with this knowledge, you will be better able to avoid getting blocked by these limits.
Suppose you would like to retrieve a large list of URLs. This example attempts to extract 20,000 URLs that belong to bbc.com.
Enter the following code in a new cell:
from fbri.private.sql.query import execute import pandas as pd import csv database = "fbri_prod_private" attributes_table = "erc_condor_url_attributes_dp_final_v3" breakdowns_table = "erc_condor_url_breakdowns_dp_clean_partitioned_v2"
This code imports three libraries, execute
, pandas
, and csv
and defines database, attributes_table, and breakdowns_table. Running this code does not return any results.
Enter the following code in the next cell:
sql = f""" SELECT * FROM {database}.{attributes_table} WHERE parent_domain = 'bbc.com' LIMIT 20000 """ result = execute(sql, '20k_bbc_url_rids.tsv') df = pd.read_csv('20k_bbc_url_rids.tsv', delimiter = '\t', error_bad_lines=False) df
This code creates a table of data from the database
and attributes_table
you specified in the first cell. In this table you are only selecting 20,000 rows that have bbc.com as the parent_domain
and creating a temporary save file called 20k_bbc_url_rids.tsv
that contains this information. It also creates a variable df
that is from the created table with all "bad" lines skipped. In this example some lines were considered "bad" because the number of columns did not match the header line number of columns.
The result of this code should look similar to the following example.
Simulated data
The values in the example table below are simulated. They do not represent actual data.
Enter the following code in the next cell:
url_rids = df.url_rid url_rids
This code creates a table called url_rids
that is derived from the table df
that you created earlier. The difference is that the newly created table only contains the url_rid
column.
At this point you have a list of 19485 url_rids. You might want to use this exact list to query matching records in the database.
If you run:
sql = f""" SELECT * FROM {database}.{attributes_table} WHERE url_rid IN {tuple(url_rids)} """ result = execute(sql, "query_overflow.tsv") df = pd.read_csv('chunk.tsv', delimiter = '\t', error_bad_lines=False, encoding='utf-8', engine='python', quoting=csv.QUOTE_NONE) df
The query would fail and you would see a result similar to the following example (truncated for brevity):
The query fails with the error: queryString failed to satisfy constraint: Member must have length less than or equal to 262144. This error message appears after a long block of text and is not shown in the image above.
The message explicitly says that the character limit for query strings is 262144, so you'll need a modification to successfully execute this query.
Each url_rid is 15 characters. With a character limit of 262144, the number of URLs that a single query string can contain is the limit divided by (the length of each url_rid + list element delimiter (,) for each url_rid and quoting of each url_rid ('...')).
262144/(16 + 1 + 2)
Of course, a query string would not only consist of this list. You must account for characters in the actual SQL clauses. So it's safer to round this upper bound down to ~ 13,000.
You can modify the code to satisfy this by running:
sql = f""" SELECT * FROM {database}.{attributes_table} WHERE url_rid IN {tuple(url_rids[:13000])} """ result = execute(sql, "query_overflow.tsv") df = pd.read_csv('query_overflow.tsv', delimiter = '\t', encoding='utf-8', error_bad_lines=False) df
The result should look similar to the following example.
Simulated data
The values in the example table below are simulated. They do not represent actual data.
We have two primary recommendations to work around this constraint, discussed in the following sections.
To retrieve breakdown data for specific URLs, try designing a query that leverages the country and date partitions available to optimize the SQL query, then subset the resulting dataframe on your specific list of URLs. The following code joins on url_rid so the subset is possible:
sql = f""" SELECT parent_domain, {breakdowns_table}.* FROM {database}.{breakdowns_table} JOIN {database}.{attributes_table} ON {database}.{breakdowns_table}.url_rid = {database}.{attributes_table}.url_rid WHERE c='GB' AND year_month = '2019-04' AND parent_domain = 'bbc.com' """ result = execute(sql, "bbc_breakdowns.tsv") df = pd.read_csv('bbc_breakdowns.tsv', delimiter = '\t') df
The results of running this query should look similar to the following example.
Simulated data
The values in the example table below are simulated. They do not represent actual data.
As you can see, running this code joins the two tables breakdowns_table
and attributes_table
. The results are saved as a .tsv file.
Enter the following code in the next cell:
subset_df = df[df['url_rid'].isin(url_rids)] subset_df
The result of running this code is shown in the following example.
Simulated data
The values in the example table below are simulated. They do not represent actual data.
unique_subset_url_rids = len(subset_df['url_rid'].unique()) unique_subset_url_rids
When you run this code, you obtain the resulting value 6924. This value is saved to unique_subset_url_rids
.
For the last cell in the workaround, enter:
unique_subset_url_rids/len(url_rids)
The results of running this code gives you the value 0.5445536767597327.
This approach captured nearly 92% of the URLs in one pass, and you can tailor the approach can as needed.
Alternatively, you can create a function to invoke the same query in chunks:
def query_by_url_rids(table, url_rids, chunk_size = 10000): if not 1000 < chunk_size < 13000: return "Chunk size out of bounds." url_rids = sorted(url_rids) url_rid_chunks = [url_rids[i * chunk_size:(i + 1) * chunk_size] for i in range((len(url_rids) + chunk_size - 1) // chunk_size)] chunk_dfs = [] for page, url_rid_chunk in enumerate(url_rid_chunks): sql = f""" SELECT * FROM fbri_prod_private.{table} WHERE url_rid IN {tuple(url_rid_chunk)} ORDER BY url_rid """ result = execute(sql, "chunk.tsv") chunk_df = pd.read_csv('chunk.tsv', delimiter = '\t', error_bad_lines=False, encoding='utf-8', engine='python', quoting=csv.QUOTE_NONE) chunk_dfs.append(chunk_df) df = pd.concat(chunk_dfs) return df
This code runs the query in a small loop that goes through the whole table and analyzes the url_rids in chunks until completion. This code does not return anything by itself.
To get a return of the url_rids that meet the query parameters, run the code below in a separate cell:
df = query_by_url_rids(attributes_table, url_rids) df
The results should look similar to the following example.
Simulated data
The values in the example table below are simulated. They do not represent actual data.