Statistical modeling using svinfer

We use a differential privacy framework to calibrate the level of noise that needs to be added into a dataset in order to protect users’ privacy. When you analyze privacy protected data, you must properly account for the added noise or your models could return biased results.

Standard statistical models usually assume there are no measurement errors in predictors, which makes them less appropriate for handling privacy protected data. We have made the Statistical Valid Inference (svinfer) Python package available in the research tool for use in accounting for the measurement errors.

The svinfer Python package provides a statistical model for privacy protected data in the form of estimates of the regression coefficients and of the variance-covariance matrix. The svinfer package can handle big data, can process data in either memory or database, and does not need to load data into memory. It currently supports linear regression models and logistic regression (a type of generalized linear model), and we plan to add more models in the future.

Known limitations

There are some limitations to the svinfer package. It currently does not support:

  • Nonlinear terms such as quadratic terms, ratios of two predictors, or a logarithm of predictors

  • Missing data

  • Categorical variables

  • Generalized linear models (except logistic regression)

  • Interactions

Package differences

There are two versions of the svinfer package. The open source version is called svinfer and the JupyterHub version is called svinfer_ss1. Both versions contain LinearRegression, DataFrameProcessor, and DatabaseProcessor. The only difference is that the svinfer_ss1 package contains an additional special processor designed for JupyterHub.

Modeling with svinfer

The following example models the relationship between the number of shares and the number of likes in the URL data stored in the breakdowns table erc_condor_url_breakdowns_dp_clean_partitioned_v2.

In the first cell, import LinearRegression and AthenaProcessor from the svinfer libraries. Running this code alone does not return any results.

from svinfer.linear_model import LinearRegression
from svinfer_ss1.processor import AthenaProcessor

The next section of code prepares for the linear regression model by getting the data from the database and table while also setting the x and y variables. This code can take over 20 minutes to complete.

database = "fbri_prod_private"
table = "ERC_CONDOR_URL_BREAKDOWNS_DP_CLEAN_PARTITIONED_V2"
data = AthenaProcessor(database, table)
X = ['likes']
y = 'shares'
X_s2 = [22.0 ** 2]

model = LinearRegression(X, y, X_s2)
model.fit(data)

Your results should look similar to:

Run the final section of code in a third cell. This code uses the statistical model created in the previous step and gathers Values for Beta, Sigma squared, Variance-Covarience Matrix, and Standard Errors values.

print('Values for Beta: {}'.format(model.beta))
print('Sigma squared: {}'.format(model.sigma_sq))
print('Variance-Covariance Matrix: {}'.format(model.beta_vcov))
print('Standard Errors: {}'.format(model.beta_standarderror))

The results should look like this:

API reference

This section presents a tutorial on how to use svinfer for linear regression.

class SVILinearRegression(
    x_columns, y_column, x_s2, fit_intercept=True, df_corrected=False, n_replications=500, random_state=None)

SVILinearRegression fits a linear regression where some variables in the training data are subject to naturally occurring measurement error, and implements a computationally efficient algorithm for statistical inference in large datasets.

Parameters

x_columns: list of strings: Specifies the column names for the features in the data.

y_column: string: Specifies the column name for the response in the data.

x_s2: list of doubles: Specifies the variance of noise that was added to each of the features. It’s length should be equal to that of x_columns. If no noise is added to a feature, use 0.0 in the corresponding place.

fit_intercept: bool, optional, default to True: Specifies whether to include the intercept into the model. If set to False, the design matrix will only consist of the features specified. If set to True, a column of 1 will be added to the design matrix besides features specified.

df_corrected: bool, optional, default to True: Specifies whether to adjust the degree of freedom when estimating the error variance. If set to False, the degree of freedom is n, where n is the sample size. If set to True, the degree of freedom is n - p, where p is the number of columns in the design matrix.

n_replications: int, optional, default to 500: Number of simulation replicates. Should be a positive integer.

random_state: int or None, optional, default to None: Determines random number generation for dataset creation. Pass an integer for reproducible output across multiple function calls.

Attributes

beta: array of shape (n_features, ): Vector of regression coefficients. If fit_intercept is set to True, the first element of beta is the intercept, followed by other regression coefficients.

sigma_sq: double: Estimated error variance for the error term.

beta_vcov: array of shape (n_features, n_features): Estimated variance-covariance matrix of the regression coefficients.

beta_standarderror: array of the shape (n_features,): Standard errors of regression coefficients.

Method

fit(self, data)

Fit the linear regression.

Parameter: data

The parameter data can be one of the following formats:

  • A Pandas DataFrame
  • An instance of class SVIInputDataPresto, which specifies a Hive table in the FB Infra
  • An instance of class SVIInputDataAthena, which specifies a data table in Amazon Athena

Return: self

Example

     import numpy as np
     import pandas as pd
      
     # generate independent variables
      
     n = 100000
     np.random.seed(0)
     z1 = np.random.poisson(lam=7, size=n)
     z2 = np.random.poisson(lam=9, size=n) + 2 * z1
     beta_true = [10, 12, -3]

     # generate y based on z1, z2
     # add noise ~ N(0, 2^2) to z1
     # add noise ~ N(0, 1^2) to z2
     # generate training data
      
     x_s2 = [2 ** 2, 1 ** 2]
     np.random.seed(100)
     data = pd.DataFrame(
          {
               "y": 10 + 12 * z1 - 3 * z2 + 2 * np.random.standard_normal(size=n),
               "x1": z1 + np.random.standard_normal(size=n) * np.sqrt(x_s2[0]),
               "x2": z2 + np.random.standard_normal(size=n) * np.sqrt(x_s2[1]),
          }
     )

      # fit y ~ x1 + x2
     model = SVILinearRegression(
          ["x1", "x2"], # column names for explanatory variables
          "y", # column name for the response
          x_s2, # variances of the designed noises added to each explanatory variable
          fit_intercept=True, # include intercept in the model
          df_corrected=True, # use corrected degree freedom when estimating residual variance
          n_replications=10000, # replicate 10000 times when estimating vcov(beta)
          random_state=123, # set seed when simulating vcov(beta)
     ).fit(data)

      # check result
     print("beta_tilde is: \n{}".format(model.beta))
     print("beta_tilde's standard error is:\n{}".format(model.beta_standarderror))
     print("beta_tile's variance-covariance matrix:\n{}".format(model.beta_vcov))
     print("estimated residual variance is {}".format(model.sigma_sq))

References

Sklearn linear regression:

  • Popular ML package (focus on fit, prediction, score but no inference)
  • In-memory model

PySpark linear regression:

  • Both in-memory version
  • Database version
sklearn.linear_model. linearRegressionpyspark.ml.regression. LinearRegressionfort. SVILinearRegression

Parameters

fit_intercept=True, normalize=False, copy_X=True, n_jobs=None

featureCol=’feature’, labelCol=’label’, predictionCol, maxIter=100, regParam=0.0, elesticNetParam=0.0, tol=1e-6, fitIntercept=True, standardization=True , solver=’auto’, weightCol=None, aggregationDepth=2, loss=’squaredError’, epsilon=1.35

x_columns, y_column, x_s2, fit_intercept=True, df_corrected=False, n_replications=500, random_state-None

Attributes

coef_,

rank_,

singular_,

intercept_,

pyspark.ml.regressio n.LinearRegressionM odel coefficients, intercept, numFeatures,

params,

...

beta,

sigma_sq,

beta_vcov,

beta_standarderror

Methods

fit(self, X, y[, sample_weight]) get_params(self[,dee p])

score(self, X, y[, sample_weight]) set_params(self)

fit(dataset, params=None) getFeatureCol()

...

fit(self, data)

Example

reg = LinearRegression().it(X, y)

print(reg.coef_)

model = LinearRegression(s ).fit(df) print(model.coeffici ents)

svi = SVILinearRegressi on([“x1”, “x2], “y”, [1, 1]).fit(data) print(svi.beta) print(svi.beta_vcov)

Logistic regression demo

This section is a sample Jupyter notebook showing how to use the logistic regression functions in svinfer.

import numpy as np
import pandas as pd
from scipy import stats, special

from fbri.private.sql.query import execute 

from svinfer_ss1.linear_model import LogisticRegression
from svinfer_ss1.processor import AthenaProcessor, DataFrameProcessor
def simulate_data_category(n=10000, seed=123):
    np.random.seed(seed)
    z1 = np.random.normal(loc=1, scale=0.5, size=n) ** 2
    z2 = np.random.normal(loc=0.5, scale=1, size=n) ** 2
    return pd.DataFrame(
        {
            "x1": z1 + np.random.standard_normal(size=n) * 1,
            "x2": z2 + np.random.standard_normal(size=n) * 2,
            "y_binary": stats.bernoulli.rvs(special.expit(1 + 2 * z1 - 0.5 * z2)),
        }
    )
print("beta true is (1, 2, 0.5) in this setting. ")
## generate some in-memory data
data = simulate_data_category()
print(data)
# prepare the in-memory data and fit the model
df_data = DataFrameProcessor(data)
df_model = LogisticRegression(
    ["x1", "x2"], 
    "y_binary", 
    [1.0 ** 2, 2.0 ** 2]
).fit(df_data)
# check the result
print("Does the optimization success?:\n{}".format(df_model.success))
print("beta hat is: \n{}".format(df_model.beta))
print("standard error of the beta hat is: \n{}".format(df_model.beta_standarderror))
print("variance-covariance matrix of the beta hat: \n{}".format(df_model.beta_vcov))
## upload the same training data into Athena as table svinfer_test4
## you can following the user uploader instruction here
## https://github.com/fb-researchtool-ss1/docs/blob/master/Resources/User%20Upload%20Documentation.pdf
print(execute("select * from private_user_3557305751000411.svinfer_test4"))
# prepare the athena data and fit the model
athena_data = AthenaProcessor(
    "private_user_3557305751000411",
    "svinfer_test4"
)

athena_model = LogisticRegression(
    ["x1", "x2"], 
    "y_binary", 
    [1.0 ** 2, 2.0 ** 2]
).fit(athena_data)
# check result
print("Does the optimization success?:\n{}".format(athena_model.success))
print("beta hat is: \n{}".format(athena_model.beta))
print("standard error of the beta hat is: \n{}".format(athena_model.beta_standarderror))
print("variance-covariance matrix of the beta hat: \n{}".format(athena_model.beta_vcov))

Ordinary least squares regression demo

This section is a sample Jupyter notebook showing how to use the ordinary least squares regression functions in svinfer.

from svinfer_ss1.linear_model import LinearRegression
from svinfer_ss1.processor import AthenaProcessor, DataFrameProcessor
from fbri.private.sql.query import execute
# research goal: shares ~ likes; 
# fit model with 3 different datasets

x_columns = ["likes"]
y_column = "shares"
x_s2 = [22.0 ** 2]
# Example 1: with all URL Shares data in the private environment; 
data1 = AthenaProcessor(
    "fbri_prod_private", 
    "erc_condor_url_breakdowns_dp_clean_partitioned_v2")
my_model1 = LinearRegression(x_columns, y_column, x_s2, random_state=123)
my_model1.fit(data1)
print(f"beta_tilde is: \n{my_model1.beta}")
print(f"beta_tilde's standard error is: \n{my_model1.beta_standarderror}")
print(f"beta_tile's variance-covariance matrix: \n{my_model1.beta_vcov}")
print(f"estimated residual variance is {my_model1.sigma_sq}")
# Example 2: with subset of URL Shares data in the private environment
data2 = AthenaProcessor(
    database="fbri_prod_private", 
    table_name="erc_condor_url_breakdowns_dp_clean_partitioned_v2", 
    filters={"c": ["US"], "year_month": ["2019-01", "2019-02", "2019-03"]})
my_model2 = LinearRegression(
    x_columns, 
    y_column, 
    x_s2, 
    fit_intercept=True, 
    df_corrected=True, 
    n_replications=50000, 
    random_state=123)
my_model2.fit(data2)
print(f"beta_tilde is: \n{my_model2.beta}")
print(f"beta_tilde's standard error is: \n{my_model2.beta_standarderror}")
print(f"beta_tile's variance-covariance matrix: \n{my_model2.beta_vcov}")
print(f"estimated residual variance is {my_model2.sigma_sq}")
# example 3: with in-memory data
pd_data = execute(
    """
    SELECT likes, shares 
    FROM fbri_prod_private.erc_condor_url_breakdowns_dp_clean_partitioned_v2
    WHERE c = 'US' AND year_month IN ('2019-01', '2019-02', '2019-03')
    LIMIT 100000
    """
)
pd_data.head()
data3 = DataFrameProcessor(pd_data)
my_model3 = LinearRegression(x_columns, y_column, x_s2, n_replications=50000, random_state=123)
my_model3.fit(data3)
print(f"beta_tilde is: \n{my_model3.beta}")
print(f"beta_tilde's standard error is: \n{my_model3.beta_standarderror}")
print(f"beta_tile's variance-covariance matrix: \n{my_model3.beta_vcov}")
print(f"estimated residual variance is {my_model3.sigma_sq}")