If you have a question not addressed here, please submit a Direct Support question. See Use the Direct Support tool to get help for information and instructions.
For important publication review process guidelines, see:
Data sharing for reproducibility and publication policy guide
You can reset your OpenVPN credentials using the credential manager.
Researcher Platform (the research tool) is a Jupyter notebook server configured to use specially developed Python libraries to access the URL Shares dataset. These libraries expose methods that allow you to explore the dataset through SQL queries run against Amazon Web Services' Athena API, where the complete differentially private dataset is stored.
Query results can be saved to the Jupyter notebook server, or further analyzed using popular data analysis libraries such as Pandas.
A complete tutorial is located here.
Researcher Platform uses Presto, a high performance, distributed SQL query engine for big data. It is used extensively within Meta.
Since there are syntactical variations amongst SQL engines, queries that you've executed against other databases may not run in this environment. You can refer to Presto's documentation for examples.
Some of the queries can take a long time to complete. With there being over 10 trillion cells of data, computationally expensive queries are prone to long execution times. This guide highlights best practices for using the Athena API (part of the AWS backend) with data of this order of magnitude: Top 10 Performance Tuning Tips for Amazon Athena.
No, you may not scrape the downloaded URLs. Attempting to do so would violate the terms of service in the data use policy.
There are a number of popular pre-installed libraries for common tasks such as data processing, data analysis, machine learning, and data visualization, including dask, numpy, scikit-learn, pandas, and seaborn. You can also install Python packages using our Pip package manager class, and R packages using our Conda and CRAN package manager classes. See the Researcher platform features documentation for more information and instructions.
Differential privacy is a privacy preserving technology we employ when releasing this dataset. Datasets are differentially private if no statistical result would change in a material way if any one person or action is included in or excluded from the data. Differential privacy provides a specific, quantifiable mathematical guarantee of privacy while still making it possible for researchers to learn from the data.
Ensuring differential privacy means adding noise statistics or machine learning (ML) models in order to provide mathematical guarantees about how well any individual has been obscured in the data. This enables analysis of otherwise sensitive datasets without compromising the privacy of individuals. For general overviews of differential privacy, see Differential Privacy: A Primer for a Non-technical Audience and The Algorithmic Foundations of Differential Privacy.
We are providing action-level privacy in this release, which guarantees users plausible deniability that they took an action summarized in the dataset. That means that reidentification attacks that reveal anything about the information users shared will be thwarted. The action-level differential privacy we impose also ensures a degree of user-level differential privacy.
The full URLs data is privacy protected by adding specially calibrated types of random noise to the data. From a statistical point of view, this is equivalent to data with measurement error. Measurement error induces biases in statistical analysis, including attenuation, exaggeration, switched signs, and incorrect uncertainty estimates in different situations. In most situations, ignoring the noise and pretending it won't matter is a mistake. We are implementing statistical methods developed especially to analyze these data without bias (see Georgina Evans and Gary King's Statistically Valid Inferences from Differentially Private Data Releases, with Application to the Facebook URLs dataset), and will make them available to all on the Facebook research tool.
To construct this dataset, we processed approximately an exabyte (1 quintillion bytes; a billion gigabytes) of raw data from the Facebook platform -- including more than 50 terabytes per day of interaction metrics and more than 1 petabyte per day of exposure data (views). The data set includes two tables: The URL Attributes table contains 37,983,353 URLs, and the Breakdowns table contains a total 634,717,588,407 rows. There are over 10 trillion cell values in total.
Many researchers have requested descriptive statistics such as interquartile ranges, data summaries, deciles for each variable (ideally for each country), and the like. Unfortunately, the nature of differential privacy means that the number of queries must be limited and so we cannot satisfy all such requests and also guarantee privacy. We are instead looking into supplementing the dataset with some descriptive statistics that may be useful for all. We are also offering methods and software to enable the calculation of many of these statistics in unbiased ways from these privacy protected data.
A core resource is the working paper by Georgina Evans and Gary King titled Statistically Valid Inferences from Differentially Private Data Releases, with Application to the Facebook URLs Dataset. The paper's abstract highlights the offering of "an easy-to-use, computationally efficient statistical approach for analyzing differentially private data that corrects for these biases. Researchers can use this approach as they would linear regression and receive statistically consistent and approximately unbiased estimates and standard errors." Additionally, libraries are being developed for direct use through the research tool.
This question comes up because if it is true, aggregating an attribute at the level of the domain means summing for each URL gender[3] * age[7] * year-month[31] + the number of URLs belonging to a particular domain. It is reasonable to wonder if the signal would still be meaningful and if you should consider using the average instead of the sum.
Since n (the sample size) is being disclosed, it doesn’t matter whether you compute the variance of the sum or the variance of the mean (the ratio of the sum divided by the mean doesn’t have the problem above because the denominator doesn’t have noise). They will be different of course but you can move from one to the other easily and, most importantly, one won’t avoid the noise more than the other.
So yes, the variance of the noise is larger with noisy sums that are added (since the noise is independent, if X and Y are sums, Var(X+Y)=Var(X)+Var(Y)), but that does not mean that X+Y is less informative, since the true value of the variables are also larger. The uncertainty around a typical college professor’s wealth is far smaller than the uncertainty around Bill Gates’ wealth (who probably can’t estimate his wealth on any one day to less than +/- 10 million), but we still know that he’s quite wealthy. The variances of large numbers of all kinds tend to be larger than the variances of small ones; that doesn’t have to be true mathematically, but it is typically true empirically. Whether you should add variables together depends on the signal more than the noise then: If all the variables you’re adding are measures of the same thing, then adding them will strengthen the signal; if they are measures of different things, some of the signals will cancel out and you’ll be able to see less clearly.
While normally, you might do something like this:
sql = f""" SELECT * FROM ( SELECT url_rid, SUM(views) AS sum_views FROM {database}.{table} WHERE c = 'US' AND year_month = '2018-08' GROUP BY url_rid, c ) ORDER BY (LN(1 - RAND()) / sum_views DESC LIMIT 10000 """
You have to take a different approach when there is the potential to have negative views.
Gary King's paper provides a good primer on statistical analysis of differentially private datasets: Statistically Valid Inferences from Differentially Private Data Releases, with Application to the Facebook URLs Dataset.
Conducting inference on the dataset should not require sampling the data. The query system is designed to handle analysis on the full dataset.
If you need to do sampling (say for exploratory analysis), you can use the TABLESAMPLE operator to get a random sample.
For example, if you want a 20% sample of your table your query will be something like: SELECT id FROM mytable TABLESAMPLE BERNOULLI(20)
You can use the BERNOULLI or SYSTEM sampling versions with TABLESAMPLE.
This technique is typically discouraged, as we are not aware of any datasets where such a comparison would result in valid inferences. For example, it may seem sound to use CrowdTangle (CT) to extrapolate the true count of total public shares. That’s not possible however, because CT only includes posts from profile accounts if the account is verified AND following is turned on. It would almost certainly underestimate the true number of public shares.
Similar limitations exist in other datasets in aspects including (but not limited to) mobile versus desktop data, domain level fact checking, and so on.
The original value for shares is, of course, greater than shares without clicks. But after adding noise, it is possible that the value of shares might end up less than the value of shares without clicks. We add noise to the data to maintain differential privacy. The noise helps protect individual users or groups from being identified through the data.
We have to include all possible rows for subsequent year months, even if the true number of URL views or interactions was zero. We suggest focusing the analysis of this URL on year months where the number of interactions is likely to be non-zero. Most engagement with URLs occurs within the first month following the first post time, so a simple processing rule could be to include year-months within the first month of each URL's first post time.
It should be possible to determine an ideal cutoff point by doing an analysis across all URLs in your dataset (something like URL engagement as a function of months following the first post time). This would avoid the need to select year months to include for each URL and help to minimize the amount of statistical noise in the resulting analysis.
It's possible to calculate the time difference in between year_month and first_post_time as follows: DATE_DIFF('day', DATE_PARSE(year_month, '%Y-%m'), FROM_UNIXTIME(first_post_time_unix))
You can then choose the appropriate number of days to use as a cutoff, such as putting the following in the WHERE statement to use 31 days as a cutoff: DATE_DIFF('day', DATE_PARSE(year_month, '%Y-%m'), FROM_UNIXTIME(first_post_time_unix)) BETWEEN 0 AND 31.
There are still going to be some boundary issues regardless of the number of days used, but this would help with including data from the subsequent month for URLs first posted late in the month.
For a global analysis of URLs, you can join the breakdowns table to the attributes table on url_rid, then filter to rows where c (breakdowns) is equal to public_shares_top_country (attributes).
You can use Direct Support to request a new country be added to the URL Shares dataset. However, please note that adding any new countries can take a month or more.