Query limits

This tutorial demonstrates how query character limits can be reached. Armed with this knowledge, you will be able to avoid getting blocked by these limits.

Querying table limit

Suppose you want to retrieve a large list of URLs. This example attempts to extract 20,000 URLs that belong to bbc.com.

The following code imports the execute, pandas, and csv libraries. It also defines the database as fbri_prod_public and the table as erc_condor_url_attributes_dp_final_public_v3.

from fbri.public.sql.query import execute
import pandas as pd
import csv

database = "fbri_prod_public"
table = "erc_condor_url_attributes_dp_final_public_v3"

Running this query does not return any result.

The next section of code does the following:

Creates a table from the database and table we specified in the first cell.
Establishes a limit of 20,000 rows that all have bbc.com as their parent_domain.
Creates a temporary save file 20k_bbc_url_rids.tsv for the data.
Creates a variable df that is from the created table with all "bad" lines skipped.

sql = f"""
SELECT *
FROM {database}.{table}
WHERE parent_domain = 'bbc.com'
LIMIT 20000
"""

result = execute(sql, '20k_bbc_url_rids.tsv')
df = pd.read_csv('20k_bbc_url_rids.tsv', delimiter = '\t', error_bad_lines=False)
df

The result of running this code should look similar to the example below. Some lines in this example were considered "bad" because the number of columns did not match the header line number of columns (see the red shaded text in the result).

The next step creates a table called url_rids that is derived from the table df that was created in the second cell. The new table has the same rows, but only the url_rid column.

url_rids = df.url_rid
url_rids

If you wanted to use this exact list of URL RIDs to query matching records in the database, you might try running this code:

sql = f"""
SELECT *
FROM {database}.{table}
WHERE url_rid IN {tuple(url_rids)}
"""

result = execute(sql, "query_overflow.tsv")
df = pd.read_csv('chunk.tsv', delimiter = '\t', error_bad_lines=False, encoding='utf-8', engine='python', quoting=csv.QUOTE_NONE)
df

The query would fail and you would see a result similar to this example (truncated for brevity):

The query fails with the error: queryString failed to satisfy constraint: Member must have length less than or equal to 262144. This error message appears after a long block of text and is not shown in the image above.

The message explicitly indicates that the character limit for query strings is 262144, so you'll need a modification to successfully execute this query.

Each url_rid is 15 characters. With a character limit of 262144, the number of URLs that a single query string can contain is the limit divided by (the length of each url_rid + list element delimiter (,) for each url_rid and quoting of each url_rid ('...')).

262144/(16 + 1 + 2)

Of course, a query string would not only consist of this list. You would have to account for characters in the actual SQL clauses. So it's safer to round this upper bound down from 14,563 to about 13,000.

You can modify the code to satisfy this by running:

sql = f"""
SELECT *
FROM {database}.{table}
WHERE url_rid IN {tuple(url_rids[:13000])}
"""

result = execute(sql, "query_overflow.tsv")
df = pd.read_csv('query_overflow.tsv', delimiter = '\t', encoding='utf-8', error_bad_lines=False)
df

The result should look like this:

Workaround by chunking

You can create a function to invoke the same query in chunks. For example:

def query_by_url_rids(table, url_rids, chunk_size = 10000):
    if not 1000 < chunk_size < 13000:
        return "Chunk size out of bounds."
    url_rids = sorted(url_rids)
    url_rid_chunks = [url_rids[i * chunk_size:(i + 1) * chunk_size] for i in range((len(url_rids) + chunk_size - 1) // chunk_size)] 
    chunk_dfs = []
    for page, url_rid_chunk in enumerate(url_rid_chunks):
        sql = f"""
        SELECT *
        FROM fbri_prod_public.{table}
        WHERE url_rid IN {tuple(url_rid_chunk)}
        ORDER BY url_rid
        """
        result = execute(sql, "chunk.tsv")
        chunk_df = pd.read_csv('chunk.tsv', delimiter = '\t', error_bad_lines=False, encoding='utf-8', engine='python', quoting=csv.QUOTE_NONE)

        chunk_dfs.append(chunk_df)
    df = pd.concat(chunk_dfs)
    return df

This code runs the query in a small loop that goes through the whole table and analyzes the url_rids in chunks until it gets through all of them. This code does not return anything by itself.

To get a return of the url_rids that meet the query parameters, you must run the code below in a separate cell:

df = query_by_url_rids(table, url_rids)
df

The results should look similar to this example: