svinfer
We use a differential privacy framework to calibrate the level of noise that needs to be added into a dataset in order to protect users’ privacy. When you analyze privacy protected data, you must properly account for the added noise or your models could return biased results.
Standard statistical models usually assume there are no measurement errors in predictors, which makes them less appropriate for handling privacy protected data. We have made the Statistical Valid Inference (svinfer
) Python package available in the research tool for use in accounting for the measurement errors.
The svinfer
Python package provides a statistical model for privacy protected data in the form of estimates of the regression coefficients and of the variance-covariance matrix. The svinfer
package can handle big data, can process data in either memory or database, and does not need to load data into memory. It currently supports linear regression models and logistic regression (a type of generalized linear model), and we plan to add more models in the future.
There are some limitations to the svinfer
package. It currently does not support:
Nonlinear terms such as quadratic terms, ratios of two predictors, or a logarithm of predictors
Missing data
Categorical variables
Generalized linear models (except logistic regression)
Interactions
There are two versions of the svinfer
package. The open source version is called svinfer
and the JupyterHub version is called svinfer_ss1
. Both versions contain LinearRegression, DataFrameProcessor, and DatabaseProcessor. The only difference is that the svinfer_ss1
package contains an additional special processor designed for JupyterHub.
svinfer
The following example models the relationship between the number of shares and the number of likes in the URL data stored in the breakdowns table erc_condor_url_breakdowns_dp_clean_partitioned_v2
.
In the first cell, import LinearRegression and AthenaProcessor from the svinfer
libraries. Running this code alone does not return any results.
from svinfer.linear_model import LinearRegression from svinfer_ss1.processor import AthenaProcessor
The next section of code prepares for the linear regression model by getting the data from the database and table while also setting the x and y variables. This code can take over 20 minutes to complete.
database = "fbri_prod_private" table = "ERC_CONDOR_URL_BREAKDOWNS_DP_CLEAN_PARTITIONED_V2" data = AthenaProcessor(database, table) X = ['likes'] y = 'shares' X_s2 = [22.0 ** 2] model = LinearRegression(X, y, X_s2) model.fit(data)
Your results should look similar to:
Run the final section of code in a third cell. This code uses the statistical model created in the previous step and gathers Values for Beta
, Sigma squared
, Variance-Covarience Matrix
, and Standard Errors
values.
print('Values for Beta: {}'.format(model.beta)) print('Sigma squared: {}'.format(model.sigma_sq)) print('Variance-Covariance Matrix: {}'.format(model.beta_vcov)) print('Standard Errors: {}'.format(model.beta_standarderror))
The results should look like this:
This section presents a tutorial on how to use svinfer
for linear regression.
class SVILinearRegression( x_columns, y_column, x_s2, fit_intercept=True, df_corrected=False, n_replications=500, random_state=None)
SVILinearRegression fits a linear regression where some variables in the training data are subject to naturally occurring measurement error, and implements a computationally efficient algorithm for statistical inference in large datasets.
x_columns: list of strings: Specifies the column names for the features in the data.
y_column: string: Specifies the column name for the response in the data.
x_s2: list of doubles: Specifies the variance of noise that was added to each of the features. It’s length should be equal to that of x_columns. If no noise is added to a feature, use 0.0 in the corresponding place.
fit_intercept: bool, optional, default to True: Specifies whether to include the intercept into the model. If set to False, the design matrix will only consist of the features specified. If set to True, a column of 1 will be added to the design matrix besides features specified.
df_corrected: bool, optional, default to True: Specifies whether to adjust the degree of freedom when estimating the error variance. If set to False, the degree of freedom is n, where n is the sample size. If set to True, the degree of freedom is n - p, where p is the number of columns in the design matrix.
n_replications: int, optional, default to 500: Number of simulation replicates. Should be a positive integer.
random_state: int or None, optional, default to None: Determines random number generation for dataset creation. Pass an integer for reproducible output across multiple function calls.
beta: array of shape (n_features, ): Vector of regression coefficients. If fit_intercept is set to True, the first element of beta is the intercept, followed by other regression coefficients.
sigma_sq: double: Estimated error variance for the error term.
beta_vcov: array of shape (n_features, n_features): Estimated variance-covariance matrix of the regression coefficients.
beta_standarderror: array of the shape (n_features,): Standard errors of regression coefficients.
fit(self, data)
Fit the linear regression.
Parameter: data
The parameter data can be one of the following formats:
Return: self
import numpy as np import pandas as pd # generate independent variables n = 100000 np.random.seed(0) z1 = np.random.poisson(lam=7, size=n) z2 = np.random.poisson(lam=9, size=n) + 2 * z1 beta_true = [10, 12, -3] # generate y based on z1, z2 # add noise ~ N(0, 2^2) to z1 # add noise ~ N(0, 1^2) to z2 # generate training data x_s2 = [2 ** 2, 1 ** 2] np.random.seed(100) data = pd.DataFrame( { "y": 10 + 12 * z1 - 3 * z2 + 2 * np.random.standard_normal(size=n), "x1": z1 + np.random.standard_normal(size=n) * np.sqrt(x_s2[0]), "x2": z2 + np.random.standard_normal(size=n) * np.sqrt(x_s2[1]), } ) # fit y ~ x1 + x2 model = SVILinearRegression( ["x1", "x2"], # column names for explanatory variables "y", # column name for the response x_s2, # variances of the designed noises added to each explanatory variable fit_intercept=True, # include intercept in the model df_corrected=True, # use corrected degree freedom when estimating residual variance n_replications=10000, # replicate 10000 times when estimating vcov(beta) random_state=123, # set seed when simulating vcov(beta) ).fit(data) # check result print("beta_tilde is: \n{}".format(model.beta)) print("beta_tilde's standard error is:\n{}".format(model.beta_standarderror)) print("beta_tile's variance-covariance matrix:\n{}".format(model.beta_vcov)) print("estimated residual variance is {}".format(model.sigma_sq))
PySpark linear regression:
sklearn.linear_model. linearRegression | pyspark.ml.regression. LinearRegression | fort. SVILinearRegression | |
---|---|---|---|
Parameters | fit_intercept=True, normalize=False, copy_X=True, n_jobs=None | featureCol=’feature’, labelCol=’label’, predictionCol, maxIter=100, regParam=0.0, elesticNetParam=0.0, tol=1e-6, fitIntercept=True, standardization=True , solver=’auto’, weightCol=None, aggregationDepth=2, loss=’squaredError’, epsilon=1.35 | x_columns, y_column, x_s2, fit_intercept=True, df_corrected=False, n_replications=500, random_state-None |
Attributes | coef_, rank_, singular_, intercept_, | pyspark.ml.regressio n.LinearRegressionM odel coefficients, intercept, numFeatures, params, ... | beta, sigma_sq, beta_vcov, beta_standarderror |
Methods | fit(self, X, y[, sample_weight]) get_params(self[,dee p]) score(self, X, y[, sample_weight]) set_params(self) | fit(dataset, params=None) getFeatureCol() ... | fit(self, data) |
Example | reg = LinearRegression().it(X, y) print(reg.coef_) | model = LinearRegression(s ).fit(df) print(model.coeffici ents) | svi = SVILinearRegressi on([“x1”, “x2], “y”, [1, 1]).fit(data) print(svi.beta) print(svi.beta_vcov) |
This section is a sample Jupyter notebook showing how to use the logistic regression functions in svinfer
.
import numpy as np import pandas as pd from scipy import stats, special from fbri.private.sql.query import execute from svinfer_ss1.linear_model import LogisticRegression from svinfer_ss1.processor import AthenaProcessor, DataFrameProcessor
def simulate_data_category(n=10000, seed=123): np.random.seed(seed) z1 = np.random.normal(loc=1, scale=0.5, size=n) ** 2 z2 = np.random.normal(loc=0.5, scale=1, size=n) ** 2 return pd.DataFrame( { "x1": z1 + np.random.standard_normal(size=n) * 1, "x2": z2 + np.random.standard_normal(size=n) * 2, "y_binary": stats.bernoulli.rvs(special.expit(1 + 2 * z1 - 0.5 * z2)), } ) print("beta true is (1, 2, 0.5) in this setting. ")
## generate some in-memory data
data = simulate_data_category() print(data)
# prepare the in-memory data and fit the model df_data = DataFrameProcessor(data) df_model = LogisticRegression( ["x1", "x2"], "y_binary", [1.0 ** 2, 2.0 ** 2] ).fit(df_data)
# check the result print("Does the optimization success?:\n{}".format(df_model.success)) print("beta hat is: \n{}".format(df_model.beta)) print("standard error of the beta hat is: \n{}".format(df_model.beta_standarderror)) print("variance-covariance matrix of the beta hat: \n{}".format(df_model.beta_vcov))
## upload the same training data into Athena as table svinfer_test4 ## you can following the user uploader instruction here ## https://github.com/fb-researchtool-ss1/docs/blob/master/Resources/User%20Upload%20Documentation.pdf
print(execute("select * from private_user_3557305751000411.svinfer_test4"))
# prepare the athena data and fit the model athena_data = AthenaProcessor( "private_user_3557305751000411", "svinfer_test4" ) athena_model = LogisticRegression( ["x1", "x2"], "y_binary", [1.0 ** 2, 2.0 ** 2] ).fit(athena_data)
# check result print("Does the optimization success?:\n{}".format(athena_model.success)) print("beta hat is: \n{}".format(athena_model.beta)) print("standard error of the beta hat is: \n{}".format(athena_model.beta_standarderror)) print("variance-covariance matrix of the beta hat: \n{}".format(athena_model.beta_vcov))
This section is a sample Jupyter notebook showing how to use the ordinary least squares regression functions in svinfer
.
from svinfer_ss1.linear_model import LinearRegression from svinfer_ss1.processor import AthenaProcessor, DataFrameProcessor from fbri.private.sql.query import execute
# research goal: shares ~ likes; # fit model with 3 different datasets x_columns = ["likes"] y_column = "shares" x_s2 = [22.0 ** 2]
# Example 1: with all URL Shares data in the private environment;
data1 = AthenaProcessor( "fbri_prod_private", "erc_condor_url_breakdowns_dp_clean_partitioned_v2")
my_model1 = LinearRegression(x_columns, y_column, x_s2, random_state=123) my_model1.fit(data1)
print(f"beta_tilde is: \n{my_model1.beta}") print(f"beta_tilde's standard error is: \n{my_model1.beta_standarderror}") print(f"beta_tile's variance-covariance matrix: \n{my_model1.beta_vcov}") print(f"estimated residual variance is {my_model1.sigma_sq}")
# Example 2: with subset of URL Shares data in the private environment
data2 = AthenaProcessor( database="fbri_prod_private", table_name="erc_condor_url_breakdowns_dp_clean_partitioned_v2", filters={"c": ["US"], "year_month": ["2019-01", "2019-02", "2019-03"]})
my_model2 = LinearRegression( x_columns, y_column, x_s2, fit_intercept=True, df_corrected=True, n_replications=50000, random_state=123) my_model2.fit(data2)
print(f"beta_tilde is: \n{my_model2.beta}") print(f"beta_tilde's standard error is: \n{my_model2.beta_standarderror}") print(f"beta_tile's variance-covariance matrix: \n{my_model2.beta_vcov}") print(f"estimated residual variance is {my_model2.sigma_sq}")
# example 3: with in-memory data
pd_data = execute( """ SELECT likes, shares FROM fbri_prod_private.erc_condor_url_breakdowns_dp_clean_partitioned_v2 WHERE c = 'US' AND year_month IN ('2019-01', '2019-02', '2019-03') LIMIT 100000 """ ) pd_data.head()
data3 = DataFrameProcessor(pd_data)
my_model3 = LinearRegression(x_columns, y_column, x_s2, n_replications=50000, random_state=123) my_model3.fit(data3)
print(f"beta_tilde is: \n{my_model3.beta}") print(f"beta_tilde's standard error is: \n{my_model3.beta_standarderror}") print(f"beta_tile's variance-covariance matrix: \n{my_model3.beta_vcov}") print(f"estimated residual variance is {my_model3.sigma_sq}")