Statistical modeling using `svinfer`

We use a differential privacy framework to calibrate the level of noise that needs to be added into a dataset in order to protect users’ privacy. When you analyze privacy protected data, you must properly account for the added noise or your models could return biased results.

Standard statistical models usually assume there are no measurement errors in predictors, which makes them less appropriate for handling privacy protected data. We have made the Statistical Valid Inference (svinfer) Python package available in the research tool for use in accounting for the measurement errors.

The svinfer Python package provides a statistical model for privacy protected data in the form of estimates of the regression coefficients and of the variance-covariance matrix. The svinfer package can handle big data, can process data in either memory or database, and does not need to load data into memory. It currently supports linear regression models and logistic regression (a type of generalized linear model), and we plan to add more models in the future.

Known limitations

There are some limitations to the svinfer package. It currently does not support:

Nonlinear terms such as quadratic terms, ratios of two predictors, or a logarithm of predictors
Missing data
Categorical variables
Generalized linear models (except logistic regression)
Interactions

Package differences

There are two versions of the svinfer package. The open source version is called svinfer and the JupyterHub version is called svinfer_ss1. Both versions contain LinearRegression, DataFrameProcessor, and DatabaseProcessor. The only difference is that the svinfer_ss1 package contains an additional special processor designed for JupyterHub.

Modeling with `svinfer`

The following example models the relationship between the number of shares and the number of likes in the URL data stored in the breakdowns table erc_condor_url_breakdowns_dp_clean_partitioned_v2.

In the first cell, import LinearRegression and AthenaProcessor from the svinfer libraries. Running this code alone does not return any results.

from svinfer.linear_model import LinearRegression
from svinfer_ss1.processor import AthenaProcessor

The next section of code prepares for the linear regression model by getting the data from the database and table while also setting the x and y variables. This code can take over 20 minutes to complete.

database = "fbri_prod_private"
table = "ERC_CONDOR_URL_BREAKDOWNS_DP_CLEAN_PARTITIONED_V2"
data = AthenaProcessor(database, table)
X = ['likes']
y = 'shares'
X_s2 = [22.0 ** 2]

model = LinearRegression(X, y, X_s2)
model.fit(data)

Your results should look similar to:

Run the final section of code in a third cell. This code uses the statistical model created in the previous step and gathers Values for Beta, Sigma squared, Variance-Covarience Matrix, and Standard Errors values.

print('Values for Beta: {}'.format(model.beta))
print('Sigma squared: {}'.format(model.sigma_sq))
print('Variance-Covariance Matrix: {}'.format(model.beta_vcov))
print('Standard Errors: {}'.format(model.beta_standarderror))

The results should look like this:

API reference

This section presents a tutorial on how to use svinfer for linear regression.

class SVILinearRegression(
    x_columns, y_column, x_s2, fit_intercept=True, df_corrected=False, n_replications=500, random_state=None)

SVILinearRegression fits a linear regression where some variables in the training data are subject to naturally occurring measurement error, and implements a computationally efficient algorithm for statistical inference in large datasets.

Parameters

x_columns: list of strings: Specifies the column names for the features in the data.

y_column: string: Specifies the column name for the response in the data.

x_s2: list of doubles: Specifies the variance of noise that was added to each of the features. It’s length should be equal to that of x_columns. If no noise is added to a feature, use 0.0 in the corresponding place.

fit_intercept: bool, optional, default to True: Specifies whether to include the intercept into the model. If set to False, the design matrix will only consist of the features specified. If set to True, a column of 1 will be added to the design matrix besides features specified.

df_corrected: bool, optional, default to True: Specifies whether to adjust the degree of freedom when estimating the error variance. If set to False, the degree of freedom is n, where n is the sample size. If set to True, the degree of freedom is n - p, where p is the number of columns in the design matrix.

n_replications: int, optional, default to 500: Number of simulation replicates. Should be a positive integer.

random_state: int or None, optional, default to None: Determines random number generation for dataset creation. Pass an integer for reproducible output across multiple function calls.

Attributes

beta: array of shape (n_features, ): Vector of regression coefficients. If fit_intercept is set to True, the first element of beta is the intercept, followed by other regression coefficients.

sigma_sq: double: Estimated error variance for the error term.

beta_vcov: array of shape (n_features, n_features): Estimated variance-covariance matrix of the regression coefficients.

beta_standarderror: array of the shape (n_features,): Standard errors of regression coefficients.

Method

fit(self, data)

Fit the linear regression.

Parameter: data

The parameter data can be one of the following formats:

A Pandas DataFrame
An instance of class SVIInputDataPresto, which specifies a Hive table in the FB Infra
An instance of class SVIInputDataAthena, which specifies a data table in Amazon Athena

Return: self

Example

     import numpy as np
     import pandas as pd
      
     # generate independent variables
      
     n = 100000
     np.random.seed(0)
     z1 = np.random.poisson(lam=7, size=n)
     z2 = np.random.poisson(lam=9, size=n) + 2 * z1
     beta_true = [10, 12, -3]

     # generate y based on z1, z2
     # add noise ~ N(0, 2^2) to z1
     # add noise ~ N(0, 1^2) to z2
     # generate training data
      
     x_s2 = [2 ** 2, 1 ** 2]
     np.random.seed(100)
     data = pd.DataFrame(
          {
               "y": 10 + 12 * z1 - 3 * z2 + 2 * np.random.standard_normal(size=n),
               "x1": z1 + np.random.standard_normal(size=n) * np.sqrt(x_s2[0]),
               "x2": z2 + np.random.standard_normal(size=n) * np.sqrt(x_s2[1]),
          }
     )

      # fit y ~ x1 + x2
     model = SVILinearRegression(
          ["x1", "x2"], # column names for explanatory variables
          "y", # column name for the response
          x_s2, # variances of the designed noises added to each explanatory variable
          fit_intercept=True, # include intercept in the model
          df_corrected=True, # use corrected degree freedom when estimating residual variance
          n_replications=10000, # replicate 10000 times when estimating vcov(beta)
          random_state=123, # set seed when simulating vcov(beta)
     ).fit(data)

      # check result
     print("beta_tilde is: \n{}".format(model.beta))
     print("beta_tilde's standard error is:\n{}".format(model.beta_standarderror))
     print("beta_tile's variance-covariance matrix:\n{}".format(model.beta_vcov))
     print("estimated residual variance is {}".format(model.sigma_sq))

References

Sklearn linear regression:

Popular ML package (focus on fit, prediction, score but no inference)
In-memory model

PySpark linear regression:

Both in-memory version
Database version

	sklearn.linear_model. linearRegression	pyspark.ml.regression. LinearRegression	fort. SVILinearRegression
Parameters	fit_intercept=True, normalize=False, copy_X=True, n_jobs=None	featureCol=’feature’, labelCol=’label’, predictionCol, maxIter=100, regParam=0.0, elesticNetParam=0.0, tol=1e-6, fitIntercept=True, standardization=True , solver=’auto’, weightCol=None, aggregationDepth=2, loss=’squaredError’, epsilon=1.35	x_columns, y_column, x_s2, fit_intercept=True, df_corrected=False, n_replications=500, random_state-None
Attributes	coef_, rank_, singular_, intercept_,	pyspark.ml.regressio n.LinearRegressionM odel coefficients, intercept, numFeatures, params, ...	beta, sigma_sq, beta_vcov, beta_standarderror
Methods	fit(self, X, y[, sample_weight]) get_params(self[,dee p]) score(self, X, y[, sample_weight]) set_params(self)	fit(dataset, params=None) getFeatureCol() ...	fit(self, data)
Example	reg = LinearRegression().it(X, y) print(reg.coef_)	model = LinearRegression(s ).fit(df) print(model.coeffici ents)	svi = SVILinearRegressi on([“x1”, “x2], “y”, [1, 1]).fit(data) print(svi.beta) print(svi.beta_vcov)

Logistic regression demo

This section is a sample Jupyter notebook showing how to use the logistic regression functions in svinfer.

import numpy as np
import pandas as pd
from scipy import stats, special

from fbri.private.sql.query import execute 

from svinfer_ss1.linear_model import LogisticRegression
from svinfer_ss1.processor import AthenaProcessor, DataFrameProcessor

def simulate_data_category(n=10000, seed=123):
    np.random.seed(seed)
    z1 = np.random.normal(loc=1, scale=0.5, size=n) ** 2
    z2 = np.random.normal(loc=0.5, scale=1, size=n) ** 2
    return pd.DataFrame(
        {
            "x1": z1 + np.random.standard_normal(size=n) * 1,
            "x2": z2 + np.random.standard_normal(size=n) * 2,
            "y_binary": stats.bernoulli.rvs(special.expit(1 + 2 * z1 - 0.5 * z2)),
        }
    )
print("beta true is (1, 2, 0.5) in this setting. ")

## generate some in-memory data

data = simulate_data_category()
print(data)

# prepare the in-memory data and fit the model
df_data = DataFrameProcessor(data)
df_model = LogisticRegression(
    ["x1", "x2"], 
    "y_binary", 
    [1.0 ** 2, 2.0 ** 2]
).fit(df_data)

# check the result
print("Does the optimization success?:\n{}".format(df_model.success))
print("beta hat is: \n{}".format(df_model.beta))
print("standard error of the beta hat is: \n{}".format(df_model.beta_standarderror))
print("variance-covariance matrix of the beta hat: \n{}".format(df_model.beta_vcov))

## upload the same training data into Athena as table svinfer_test4
## you can following the user uploader instruction here
## https://github.com/fb-researchtool-ss1/docs/blob/master/Resources/User%20Upload%20Documentation.pdf

print(execute("select * from private_user_3557305751000411.svinfer_test4"))

# prepare the athena data and fit the model
athena_data = AthenaProcessor(
    "private_user_3557305751000411",
    "svinfer_test4"
)

athena_model = LogisticRegression(
    ["x1", "x2"], 
    "y_binary", 
    [1.0 ** 2, 2.0 ** 2]
).fit(athena_data)

# check result
print("Does the optimization success?:\n{}".format(athena_model.success))
print("beta hat is: \n{}".format(athena_model.beta))
print("standard error of the beta hat is: \n{}".format(athena_model.beta_standarderror))
print("variance-covariance matrix of the beta hat: \n{}".format(athena_model.beta_vcov))

Ordinary least squares regression demo

This section is a sample Jupyter notebook showing how to use the ordinary least squares regression functions in svinfer.

from svinfer_ss1.linear_model import LinearRegression
from svinfer_ss1.processor import AthenaProcessor, DataFrameProcessor
from fbri.private.sql.query import execute

# research goal: shares ~ likes; 
# fit model with 3 different datasets

x_columns = ["likes"]
y_column = "shares"
x_s2 = [22.0 ** 2]

# Example 1: with all URL Shares data in the private environment;

data1 = AthenaProcessor(
    "fbri_prod_private", 
    "erc_condor_url_breakdowns_dp_clean_partitioned_v2")

my_model1 = LinearRegression(x_columns, y_column, x_s2, random_state=123)
my_model1.fit(data1)

print(f"beta_tilde is: \n{my_model1.beta}")
print(f"beta_tilde's standard error is: \n{my_model1.beta_standarderror}")
print(f"beta_tile's variance-covariance matrix: \n{my_model1.beta_vcov}")
print(f"estimated residual variance is {my_model1.sigma_sq}")

# Example 2: with subset of URL Shares data in the private environment

data2 = AthenaProcessor(
    database="fbri_prod_private", 
    table_name="erc_condor_url_breakdowns_dp_clean_partitioned_v2", 
    filters={"c": ["US"], "year_month": ["2019-01", "2019-02", "2019-03"]})

my_model2 = LinearRegression(
    x_columns, 
    y_column, 
    x_s2, 
    fit_intercept=True, 
    df_corrected=True, 
    n_replications=50000, 
    random_state=123)
my_model2.fit(data2)

print(f"beta_tilde is: \n{my_model2.beta}")
print(f"beta_tilde's standard error is: \n{my_model2.beta_standarderror}")
print(f"beta_tile's variance-covariance matrix: \n{my_model2.beta_vcov}")
print(f"estimated residual variance is {my_model2.sigma_sq}")

# example 3: with in-memory data

pd_data = execute(
    """
    SELECT likes, shares 
    FROM fbri_prod_private.erc_condor_url_breakdowns_dp_clean_partitioned_v2
    WHERE c = 'US' AND year_month IN ('2019-01', '2019-02', '2019-03')
    LIMIT 100000
    """
)
pd_data.head()

data3 = DataFrameProcessor(pd_data)

my_model3 = LinearRegression(x_columns, y_column, x_s2, n_replications=50000, random_state=123)
my_model3.fit(data3)

print(f"beta_tilde is: \n{my_model3.beta}")
print(f"beta_tilde's standard error is: \n{my_model3.beta_standarderror}")
print(f"beta_tile's variance-covariance matrix: \n{my_model3.beta_vcov}")
print(f"estimated residual variance is {my_model3.sigma_sq}")

Learn more

Statistical modeling using svinfer