Upload files to an S3 bucket

When you run queries in your Jupyter notebook, you can specify that results be formatted as a .tsv or .csv file by including code like the following ("upload.tsv" is the filename for the saved results in this example).

result = execute(sql, "to_upload.tsv")
df = pd.read_csv('to_upload.tsv', delimiter = '\t')
df

Once saved, your files appear in the left pane of your notebook as shown below.

You can upload these saved files to your own database within the Researcher Platform-owned Amazon Simple Storage Service (Amazon S3) bucket right from a Jupyter notebook. Because you can't look up information directly in your saved notebook files, this is a great way to store your query results where you can perform further analysis, save subsets of data, and join tables with other tables. You can also execute commands from your Jupyter notebook that upload files saved on your physical computer to an Amazon S3 bucket (not allowed for all products).

Whether you upload a file saved in your Jupyter notebook or one saved on your physical computer, the process involves:

  • Agreeing to the terms and conditions for using the upload function.

  • Completing the file upload.

  • Creating a table in the S3 bucket and mapping the uploaded file to that table. You can then query and perform other operations on the table.

There is a 5 GB limit per upload.

A few notes about file formats:

  • Only .csv and .tsv files are supported.

  • When creating a .csv or .tsv file, the first row consists of the column names. This row must remain intact for the upload function to recognize the .csv or .tsv format.

  • We infer the data column types by the top 100 rows in the file.

Before you begin

Run the code shown below to import the UserDataManager library and create a new instance for the purpose of uploading your files.

from  fbri.common.user_data_manager import UserDataManager 
user = UserDataManager.get_instance()

"user" is a variable identifying the instance in this example. You can use any name you like.

Running this code does not return any results.

Step 1: Run the upload command

The upload command and output are slightly different depending on whether you are uploading a file saved in your Jupyter notebook or uploading a file saved on your physical computer.

Upload a file saved in your Jupyter notebook

Run the upload command shown below, substituting "user" for the library instance variable name you specified previously and "a.csv" for your file name.

user.get_uploads_file_cmd('a.csv')

The Upload Agreement is displayed, followed by a prompt (shown below) to either agree to or reject the agreement. To continue, type Agree (case sensitive) and press Return.

The system returns a message indicating that you have accepted the agreement and presents a code block for you to copy and paste into a new notebook cell. Copy the entire block (several lines) beginning with "curl", and preface it with an exclamation point (!) so it is recognized as a command. Run the code.

There is no feedback to tell you that the command was successful, but in the next step you will see how to verify it.

Upload a file saved on your physical computer

When you request an upload from your physical computer, the process involves receiving a command by email that you execute on your local computer to upload the file.

Start by running the upload command shown below, substituting "user" for the library instance variable name you specified previously and "a.csv" for your file name:

user.prompt_uploads_file("a.csv")

The Upload Agreement is displayed, followed by a prompt to either agree to or reject the agreement. Type Agree (case sensitive) and press Return.

Researcher Platform alerts you that an email has been sent to you, and indicates the email address to which it has been sent. On your local computer:

  1. Open the email and copy the command.
  2. Navigate to the directory where the file to be uploaded is located.
  3. Run the command in that directory to upload the file to the S3 bucket.

Step 2: Confirm your files have been uploaded

When your file has been successfully uploaded, it is listed as an Unmapped File until you create a table and map the file to that table.

To list uploaded files (mapped and unmapped), run the following command in a new Jupyter notebook cell:

 user.list_database_file_mappings()

The output includes all unmapped files followed by mapped files, similar to the example shown below:

Notice that each file now has a unique number appended to the file name for identification purposes.

Step 3: Create a table and map it to an uploaded file

Use the user.create_table function to create a mapping using a table name that you specify and a filename from the Unmapped Files section of the list function output. The first parameter is the name of the table you want to create, and the second parameter is the name of the unmapped .csv or .tsv file, including the appended unique identifier.

In the example shown below, we create a table named “A1” and map it to an unmapped .csv file shown in the list function output (a.csv_1667943293):

 user.create_table('A1', 'a.csv_1667943293')

The output confirms that the new table was successfully created:

Run the list command again to see the updated list of mappings:

user.list_database_file_mappings()

Here is the updated mappings output:

You can see a new entry for the Table Files Mapping just created. Notice the table name is not case-sensitive ("A1" is the same as "a1").

In this sample output, public_user_10159518015719333 uniquely identifies the researcher. Your number will be unique to you, but it will be the same for all the files that you upload. This ensures that each researcher's files are maintained in a separate database in the S3 bucket.

Step 4: Query the new table

Now that the mapping is complete, you can run SQL queries on the table.

In this example, public_user_10159518015719333 identifies the researcher's database within the S3 bucket, and A1 is the name of the specific table being queried.

          
from fbri.public.sql.query import execute 
database = "public_user_10159518015719333"
table = "A1"

sql = f"""
SELECT *
FROM {database}.{table} LIMIT 2
"""
      
result = execute(sql)
result
          
    

Learn more

  • Amazon S3