Query limits

This tutorial demonstrates how query character limits can be reached. Armed with this knowledge, you will be better able to avoid getting blocked by these limits.

Querying table limit

Suppose you would like to retrieve a large list of URLs. This example attempts to extract 20,000 URLs that belong to bbc.com.

Enter the following code in a new cell:

from fbri.private.sql.query import execute
import pandas as pd
import csv

database = "fbri_prod_private"
attributes_table = "erc_condor_url_attributes_dp_final_v3"
breakdowns_table = "erc_condor_url_breakdowns_dp_clean_partitioned_v2"

This code imports three libraries, execute, pandas, and csv and defines database, attributes_table, and breakdowns_table. Running this code does not return any results.

Enter the following code in the next cell:

sql = f"""
SELECT *
FROM {database}.{attributes_table}
WHERE parent_domain = 'bbc.com'
LIMIT 20000
"""

result = execute(sql, '20k_bbc_url_rids.tsv')
df = pd.read_csv('20k_bbc_url_rids.tsv', delimiter = '\t', error_bad_lines=False)
df

This code creates a table of data from the database and attributes_table you specified in the first cell. In this table you are only selecting 20,000 rows that have bbc.com as the parent_domain and creating a temporary save file called 20k_bbc_url_rids.tsv that contains this information. It also creates a variable df that is from the created table with all "bad" lines skipped. In this example some lines were considered "bad" because the number of columns did not match the header line number of columns.

The result of this code should look similar to the following example.

Simulated data

The values in the example table below are simulated. They do not represent actual data.

Enter the following code in the next cell:

url_rids = df.url_rid
url_rids

This code creates a table called url_rids that is derived from the table df that you created earlier. The difference is that the newly created table only contains the url_rid column.

At this point you have a list of 19485 url_rids. You might want to use this exact list to query matching records in the database.

If you run:

sql = f"""
SELECT *
FROM {database}.{attributes_table}
WHERE url_rid IN {tuple(url_rids)}
"""

result = execute(sql, "query_overflow.tsv")
df = pd.read_csv('chunk.tsv', delimiter = '\t', error_bad_lines=False, encoding='utf-8', engine='python', quoting=csv.QUOTE_NONE)
df

The query would fail and you would see a result similar to the following example (truncated for brevity):

The query fails with the error: queryString failed to satisfy constraint: Member must have length less than or equal to 262144. This error message appears after a long block of text and is not shown in the image above.

The message explicitly says that the character limit for query strings is 262144, so you'll need a modification to successfully execute this query.

Each url_rid is 15 characters. With a character limit of 262144, the number of URLs that a single query string can contain is the limit divided by (the length of each url_rid + list element delimiter (,) for each url_rid and quoting of each url_rid ('...')).

262144/(16 + 1 + 2)

Of course, a query string would not only consist of this list. You must account for characters in the actual SQL clauses. So it's safer to round this upper bound down to ~ 13,000.

You can modify the code to satisfy this by running:

sql = f"""
SELECT *
FROM {database}.{attributes_table}
WHERE url_rid IN {tuple(url_rids[:13000])}
"""

result = execute(sql, "query_overflow.tsv")
df = pd.read_csv('query_overflow.tsv', delimiter = '\t', encoding='utf-8', error_bad_lines=False)
df

The result should look similar to the following example.

Simulated data

The values in the example table below are simulated. They do not represent actual data.

Workarounds

We have two primary recommendations to work around this constraint, discussed in the following sections.

Leverage partitions

To retrieve breakdown data for specific URLs, try designing a query that leverages the country and date partitions available to optimize the SQL query, then subset the resulting dataframe on your specific list of URLs. The following code joins on url_rid so the subset is possible:

sql = f"""
SELECT parent_domain, {breakdowns_table}.*
FROM {database}.{breakdowns_table}
JOIN {database}.{attributes_table} 
ON {database}.{breakdowns_table}.url_rid = {database}.{attributes_table}.url_rid
WHERE c='GB'
AND year_month = '2019-04'
AND parent_domain = 'bbc.com'
"""
result = execute(sql, "bbc_breakdowns.tsv")
df = pd.read_csv('bbc_breakdowns.tsv', delimiter = '\t')
df

The results of running this query should look similar to the following example.

Simulated data

The values in the example table below are simulated. They do not represent actual data.

As you can see, running this code joins the two tables breakdowns_table and attributes_table. The results are saved as a .tsv file.

Enter the following code in the next cell:

subset_df = df[df['url_rid'].isin(url_rids)]
subset_df

The result of running this code is shown in the following example.

Simulated data

The values in the example table below are simulated. They do not represent actual data.

unique_subset_url_rids = len(subset_df['url_rid'].unique())
unique_subset_url_rids

When you run this code, you obtain the resulting value 6924. This value is saved to unique_subset_url_rids.

For the last cell in the workaround, enter:

unique_subset_url_rids/len(url_rids)

The results of running this code gives you the value 0.5445536767597327.

This approach captured nearly 92% of the URLs in one pass, and you can tailor the approach can as needed.

Chunking

Alternatively, you can create a function to invoke the same query in chunks:

def query_by_url_rids(table, url_rids, chunk_size = 10000):
    if not 1000 < chunk_size < 13000:
        return "Chunk size out of bounds."
    url_rids = sorted(url_rids)
    url_rid_chunks = [url_rids[i * chunk_size:(i + 1) * chunk_size] for i in range((len(url_rids) + chunk_size - 1) // chunk_size)] 
    chunk_dfs = []
    for page, url_rid_chunk in enumerate(url_rid_chunks):
        sql = f"""
        SELECT *
        FROM fbri_prod_private.{table}
        WHERE url_rid IN {tuple(url_rid_chunk)}
        ORDER BY url_rid
        """
        result = execute(sql, "chunk.tsv")
        chunk_df = pd.read_csv('chunk.tsv', delimiter = '\t', error_bad_lines=False, encoding='utf-8', engine='python', quoting=csv.QUOTE_NONE)

        chunk_dfs.append(chunk_df)
    df = pd.concat(chunk_dfs)
    return df

This code runs the query in a small loop that goes through the whole table and analyzes the url_rids in chunks until completion. This code does not return anything by itself.

To get a return of the url_rids that meet the query parameters, run the code below in a separate cell:

df = query_by_url_rids(attributes_table, url_rids)
df

The results should look similar to the following example.

Simulated data

The values in the example table below are simulated. They do not represent actual data.