Python statsmodels GSOC: coding structure first pass

Me and Kerby have gotten the structure of MICE to a reasonable point; it's a good time to make and update! Here is the current user interface:

>>> import pandas as pd (1)
>>> import statsmodels.api as sm (2)
>>> from statsmodels.sandbox.mice import mice (3)
>>> data = pd.read_csv('directory_here') (4)
>>> impdata = mice.ImputedData(data) (5)
>>> m1 = impdata.new_imputer("x2") (6)
>>> m2 = impdata.new_imputer("x3") (7)
>>> m3 = impdata.new_imputer("x1", model_class=sm.Logit) (8)
>>> impcomb = mice.MICE("x1 ~ x2 + x3", sm.Logit, [m1,m2,m3]) (9)
>>> p1 = impcomb.combine(iternum=20, skipnum=10) (10)

Now here is what's going on, step by step:

1) Our base data type is going to be a pandas DataFrame. Currently, the data must be in a form that supports pd.DataFrame(data).

2) Import our statsmodels API.

3) Import our mice module.

4) Read in our data.

5) mice.ImputedData is a class that stores both the underlying data and its missing data attributes. Right now the key attribute is which indices for each variable are missing. Later we will be making changes directly to the underlying dataset, so we don't want to lose this information as we start filling in values. As soon as these indices are saved, we modify the underlying data by filling in as an initial imputation all the column-wise means for the missing values. ImputedData also contains a helper function that allows us to fill in values and a function that allows us to construct Imputers (described below). Note that changes to the data are within the scope of ImputedData, so your actual data will be safely preserved after all the MICE carnage is done :)

6-8) For each variable with missing values that we want to impute, we initialize an Imputer. These Imputers contain two simulation methods that will help us impute the specific variable of interest: impute_asymptotic_bayes (described in my last post) and impute_pmm (a new one, predictive mean matching). There will be more simulation methods later on. Each Imputer's job is to impute one variable given a formula and model, specified by the user. ImputedData.new_imputer defaults to OLS and all other variables as predictors. Note that here we are implicitly assuming Missing At Random since, conditional on the predictors, the missing value is completely random (and asymptotically Gaussian).

9) Initialize a MICE instance by specifying the model and formula that we are really interested in. The previous imputation models are simply conditional models that we think will do a good job at predicting the variable's missing values; this new model is an analysis model that we want to fit to the already-imputed datasets. We pass in a list containing all the Imputers from steps 6-8; these Imputers will do the work of imputation for us.

10) Here's where we specify an iteration number (iternum) and number of imputations to skip between iterations (skipnum). Mechanically, what is happening is that once MICE.combine is called, we start stringing imputations together, with the variable with the least number of missing observations being imputed first. ImputerChain just makes sure every Imputer in the list is used once to make one fully-imputed dataset. However, one imputation for each variable may not be enough to ensure a stable conditional distribution, so we want to skip a number of datasets between actually fitting the analysis model. So we run through all the Imputers ad nauseum until we have skipped the set number of fully imputed datasets, then fit the model. This is one iteration; we repeat until have iternum number of fitted models. To be clear: if we specify iternum=10 and skipnum=5, we will go through a total of 50 imputation iterations (one iteration is a series of imputations for all specified Imputers) and only fit the analysis model for imputation number 5, 10, 15, etc.

All this skipping and fitting happens in AnalysisChain. MICE.combine then takes all these fitted models (actually, only the parameters we care about: params, cov_params, and scale) and combines them using Rubin's rule, and finally the combined parameters are stuffed into a fitted analysis model instance.

Whew, that's a mouthful! That's the gist of it, if you'd like to get under the hood (and I hope people do!) the Github is here. There are some minor things I did to statsmodels.api and the working directory just so I could work from my local Github folder/machine, so don't try to install it directly before changing those back.

Next step is making the user interface simpler (hopefully he will just pass the data object directly into MICE and not have to deal with initializing Imputers and ImputerData) and also adding logic in that lets the user specify which simulation method he wants to use for imputation. Hopefully get some feedback and make more improvements before my next post!

Python statsmodels GSOC

Friday, May 23, 2014

coding structure first pass

No comments:

Post a Comment