Missing Data

Data that we plan to analyze are often incomplete. Study design strategies should ideally be set up to obtain complete data in the first place through questionnaire design, interviewer training, study protocol development, real-time data checking, or re-contacting participants to obtain complete data. When obtaining complete data is not feasible, proxy reports or the collection of characteristics associated with the missing values can help. Missing data can be categorized in multiple ways. Perhaps the most troubling are the data missing on entire observations (e.g., due to selection bias) or on entire variables that have been omitted from the study design. Somewhat more tractable, but still potentially problematic, are data missing on a subset of variables that are missing for a subset of the observations. In this case, it can be useful to label those observations without missing data as “complete cases” and those with some missing data as “partial cases.” Ideally, we hope that the amount of missing data is limited, in which case we will rely less heavily on our assumptions about the pattern of missing data. Missing data can bias study results because they distort the effect estimate of interest (e.g. β). Missing data are also problematic if they decrease the statistical power by effectively decreasing the sample size, or if they complicate comparisons across models that differ in both the analysis strategy and the number of included observations.





Use of multiple imputation in the epidemiologic literature

What do we do with missing data? Some options for analysis of incomplete data

Imputation of missing values is superior to complete case analysis and the missing-indicator method in multivariable diagnostic research: a clinical example

Multiple imputation for missing data in epidemiological and clinical research: potential and pitfalls

Multiple imputation versus data enhancement for dealing with missing data in observational health care outcome analyses


State of the Multiple Imputation Software

Much ado about nothing: A comparison of missing data methods and software to fit incomplete data regression models


Association of black carbon with cognition among children in a prospective birth cohort study

Survival associated with two sets of diagnostic criteria for congestive heart failure

Early-life and adult socioeconomic status and inflammatory risk markers in adulthood

Race/ethnicity and breast cancer estrogen receptor status: impact of class, missing data, and modeling assumptions

Chlorpyrifos exposure and urban residential environment characteristics as determinants of early childhood neurodevelopment.