Python statsmodels GSOC

Monday, June 16, 2014

How to test MICE?

MICE is a method that imputes missing data using simulation from an inferred posterior distribution. See the Appendix of http://www.statcan.gc.ca/ads-annonces/12-001-x/5857-eng.pdf to see the exact way we simulated data.

With this in mind, MICE is a random method that will yield differing results every time it is run. Therefore, a typical testing procedure where we try to get exact agreement between our implementation and an existing, mature package is not possible. However, we should get roughly the same results for any given instance, and more importantly, the distribution of results should be very similar. So what we will aim to do is really a comparison of simulations between our implementation and those of R and Stata. Right now, predictive mean matching is directly comparable between all implementations, but there are some details in our other asymptotically Gaussian approach that are different in the other packages. I will write a follow up post shortly describing the different approaches taken by different packages. For now, the takeaway for testing is that we will perform simulation studies to assess the similarity between several MICE implementations.

Friday, May 23, 2014

coding structure first pass

Me and Kerby have gotten the structure of MICE to a reasonable point; it's a good time to make and update! Here is the current user interface:

>>> import pandas as pd (1)
>>> import statsmodels.api as sm (2)
>>> from statsmodels.sandbox.mice import mice (3)
>>> data = pd.read_csv('directory_here') (4)
>>> impdata = mice.ImputedData(data) (5)
>>> m1 = impdata.new_imputer("x2") (6)
>>> m2 = impdata.new_imputer("x3") (7)
>>> m3 = impdata.new_imputer("x1", model_class=sm.Logit) (8)
>>> impcomb = mice.MICE("x1 ~ x2 + x3", sm.Logit, [m1,m2,m3]) (9)
>>> p1 = impcomb.combine(iternum=20, skipnum=10) (10)

Now here is what's going on, step by step:

1) Our base data type is going to be a pandas DataFrame. Currently, the data must be in a form that supports pd.DataFrame(data).

2) Import our statsmodels API.

3) Import our mice module.

4) Read in our data.

5) mice.ImputedData is a class that stores both the underlying data and its missing data attributes. Right now the key attribute is which indices for each variable are missing. Later we will be making changes directly to the underlying dataset, so we don't want to lose this information as we start filling in values. As soon as these indices are saved, we modify the underlying data by filling in as an initial imputation all the column-wise means for the missing values. ImputedData also contains a helper function that allows us to fill in values and a function that allows us to construct Imputers (described below). Note that changes to the data are within the scope of ImputedData, so your actual data will be safely preserved after all the MICE carnage is done :)

6-8) For each variable with missing values that we want to impute, we initialize an Imputer. These Imputers contain two simulation methods that will help us impute the specific variable of interest: impute_asymptotic_bayes (described in my last post) and impute_pmm (a new one, predictive mean matching). There will be more simulation methods later on. Each Imputer's job is to impute one variable given a formula and model, specified by the user. ImputedData.new_imputer defaults to OLS and all other variables as predictors. Note that here we are implicitly assuming Missing At Random since, conditional on the predictors, the missing value is completely random (and asymptotically Gaussian).

9) Initialize a MICE instance by specifying the model and formula that we are really interested in. The previous imputation models are simply conditional models that we think will do a good job at predicting the variable's missing values; this new model is an analysis model that we want to fit to the already-imputed datasets. We pass in a list containing all the Imputers from steps 6-8; these Imputers will do the work of imputation for us.

10) Here's where we specify an iteration number (iternum) and number of imputations to skip between iterations (skipnum). Mechanically, what is happening is that once MICE.combine is called, we start stringing imputations together, with the variable with the least number of missing observations being imputed first. ImputerChain just makes sure every Imputer in the list is used once to make one fully-imputed dataset. However, one imputation for each variable may not be enough to ensure a stable conditional distribution, so we want to skip a number of datasets between actually fitting the analysis model. So we run through all the Imputers ad nauseum until we have skipped the set number of fully imputed datasets, then fit the model. This is one iteration; we repeat until have iternum number of fitted models. To be clear: if we specify iternum=10 and skipnum=5, we will go through a total of 50 imputation iterations (one iteration is a series of imputations for all specified Imputers) and only fit the analysis model for imputation number 5, 10, 15, etc.

All this skipping and fitting happens in AnalysisChain. MICE.combine then takes all these fitted models (actually, only the parameters we care about: params, cov_params, and scale) and combines them using Rubin's rule, and finally the combined parameters are stuffed into a fitted analysis model instance.

Whew, that's a mouthful! That's the gist of it, if you'd like to get under the hood (and I hope people do!) the Github is here. There are some minor things I did to statsmodels.api and the working directory just so I could work from my local Github folder/machine, so don't try to install it directly before changing those back.

Next step is making the user interface simpler (hopefully he will just pass the data object directly into MICE and not have to deal with initializing Imputers and ImputerData) and also adding logic in that lets the user specify which simulation method he wants to use for imputation. Hopefully get some feedback and make more improvements before my next post!

Friday, May 16, 2014

I've made some good progress with the help of my fantastic mentor, Kerby Shedden, as well as the statsmodels guru Josef Perktold. In general terms, we have a working implementation of MICE for two parametric models that have good theoretical properties: logistic regression and OLS.

On simulation:

Our implementation uses a Gaussian approximation to the posterior distribution. We draw a parameter from this distribution and add noise according to the estimated quantities. The center of this Gaussian is the maximum likelihood estimator and the covariance matrix is the inverse Fisher information matrix. So, the distribution of the parameters for OLS is just approximately normal with the estimated conditional mean function and covariance, with an (optional) perturbation in the variance parameters given by a chi-square draw. Similarly, for logistic regression the draw is from a Bernoulli. These are asymptotically Bayesian because the real Bayesian posterior mean is a weighted sum of some prior and this estimated mean (as are the scale parameters for OLS), but as the sample size grows large, the prior is discounted. The approximation is most accurate when the prior distribution is uniform. Note that IVEware, a well respected MICE package in SAS, uses this methodology.

On combining:

The combining rules are given by Rubin's rules, which are described in detail here. This is a very intuitive set of rules that are not trivial to derive, but for our purposes we can take them for granted. The estimated parameters are just the mean of the parameters fit on each imputed dataset, and the standard errors are a weighted sum of the within-imputed-data variance and the between-imputed-data variance.

And that's it for the actual statistics! A lot of time was spent figuring out how this can fit into the general statsmodels framework. Next step for now is going to be implementing predictive mean matching. This is important because, although it is potentially biased, it is an imputation method that is robust to choice of model. So when we get into other parametric families that are resistant to our attempts to define a posterior, PMM will become a viable backup.

Cool graph on a bunch of imputations run on a single dataset:

Notice that the green dots fulfill our missing at random assumption: conditional on the horizontal axis variable, the value of the vertical axis variable is random. The reverse is also true, but harder to see on this graph. More in a week or so.

Friday, March 21, 2014

GSOC Blog initialized

Today was the deadline for GSOC submission, and I have completed my application. The small fix I made to statsmodels was also done, it was an M-Dependent covariance structure for Generalized Estimating Equations. It can be found here.