HOME

HOME ♦ ABOUT ♦ NEWS ♦ JOB BANK ♦ EVENTS ♦ CONTACT

Addressing Missing Data in Study Outcomes

Author: Madeline M. Roberts MPH, PhD

Missing data is something of an inconvenient truth in epidemiological studies. Virtually every investigator and every study are subject to missing data to varying degrees. Non-response, loss-to-follow up, and errors in data entry are just a few of the possible contributing factors. If left unaddressed, missing data leads to imprecise results and can introduce substantial bias into study conclusions. Because missing data in some capacity is a virtual certitude within any study, how investigators deal with missing data is critically important.

A recent study in the American Journal of Epidemiology by Cole et al implores epidemiologists to account for missing data as opposed to simply ignoring it. The authors demonstrate the impact of missing data through several data simulations on a randomized control data set which was not subject to missing data. Different conditions for missing data were artificially imposed on this complete data set to demonstrate the potential impact of missing data. For each missing data mechanism, two approaches were used, one naïve (complete-case analysis) and one principled (generalized computation). The simulated missing data mechanisms and study findings were as follows:

Missing completely at random (MCAR): in this case, 25% of participant outcomes were set to missing independent of treatment status, the covariate of interest, and the value of the outcome.

Simulation results: using complete-case analysis had no resulting bias but did have a loss of precision (standard error was 1.16 larger compared to the complete data set). No improvement in precision was demonstrated after accounting for MCAR data.

Missing at random (MAR) with positivity: the authors define positivity as the condition that each study participant has a positive probability of having observed data given measured covariates. In this case, half of the participants who both received treatment and had the covariate of interest, and half of the participants who neither received treatment nor had the covariate of interest each had their outcomes set to missing

Simulation results: substantive bias was introduced using complete-case analysis, which was ameliorated when missing data was accounted for, though at the cost of precision (standard error was 1.07 times larger compared to no bias correction).

Missing at random (MAR) without positivity: all participants who received both the treatment and had the outcome of interest had their outcomes set to missing (probability of being observed for this group was zero—nonpositive).

Simulation results: substantive bias was also introduced under complete-case estimation, however here it was not improved upon accounting for missing data.

Missing not at random (MNAR): among participants who did not have the outcome of interest, half of the participants who both received treatment and had the covariate of interest, as well as half of the participants who neither received treatment nor had the covariate of interest had their outcomes set to missing. Additionally, half of the participants who did have the outcome of interest and who were in the treatment arm of the study had their outcomes set to missing.

Simulation results: bias was again introduced using complete-case estimation. Accounting for the missing data mitigated but did not fully eliminate this bias.

These types of data simulations have been done before, and missing data mechanisms have been thoroughly discussed in the work of Drs. Paul Allison, Roderick Little and Donald Rubin, among others. In this study, the two scenarios in which accounting for missing data did not improve bias were MCAR and MAR without positivity. MCAR is a stringent assumption and not commonly encountered in practice. Many researchers find that their data reasonably fall under the MAR assumption, though it may be difficult to discern whether these data are with or without positivity. It is helpful to note that under the more common assumption of MAR (with positivity), bias can be ameliorated when a principled approach is applied, though this example also demonstrates that without positivity an unbiased parameter estimate may not be achievable.

This particular demonstration dealt exclusively with missing outcome data, but serves as an important reminder for epidemiology as a field to do our due diligence in accounting for and applying appropriate principled approaches to missing data. It is true that one cannot empirically verify missing data assumptions since to do so would require information on the data that are missing. We are never certain that our assumptions about missing data are true given what data is observed. It is possible, however, to thoughtfully and methodically evaluate what data is available. Principled approaches such as maximum likelihood and multiple imputation are available in standard statistical packages and, when appropriately utilized, can maximize what data is available and also mitigate bias. An excellent article on approaches to missing data in a primary care setting can be found here.

Paul Allison commented in one of his seminars, “It’s sort of like linear regression where we standardly assume that the error term is uncorrelated with the predictors. We never know for sure if that’s true and, in fact, it’s probably always false to some degree. The idea is to control for everything that we can observe and assume that whatever we can’t observe is completely random.”

As the study’s authors aptly state, “Missing data are arguably the central analytic problem for epidemiology, because confounding and measurement error may be framed as implicit missing-data problems.” May we as epidemiologists double down on our efforts to understand missing data mechanisms and utilize the optimal statistical tools for analysis. ■

HOME ♦ ABOUT ♦ NEWS ♦ JOB BANK ♦ EVENTS ♦ CONTACT