< Back to articles

Missing data mechanisms and how to handle it

  • Study design & methods

Missing data may severely impact the statistical analyses and the validity of the data in a study. Mechanisms for missing data should be identified as early as possible so that measures to prevent or compensate for missing data can be implemented.

During data collection, it is crucial to identify as early as possible the mechanisms for missing data. Missing data may severely impact the statistical analyses and the validity of the data. Often it is possible to suspect the likely weaknesses in the study design, or it may be shortcomings in the data sources. Other potential risk factors for missing data or biases can be associated with the study subjects. It may be possible to implement measures to compensate for these, however there is no universally applicable method for handling missing values. Different approaches may lead to different results, and to avoid concerns over data-driven selection methods, pre-specified methods should be included in the study protocol.

Missing data mechanisms

An important consideration in choosing a missing data approach is the missing data mechanism. Different approaches have different assumptions about the mechanism. All missing data have a negative impact on statistical power, but the mechanism also affects how much bias the missing data will lead to and that may influence the final results and the certainty of the conclusions.

Broadly speaking, there are three groups of missing data:

  1. Missing Completely At Random (MCAR). The missing values have no correlation with other values in the dataset, neither observed or missing. For example, if the result of a biochemical test (Xi) is missing and the reason is that the lab technician messed up the blood samples in some way, then the data is likely to be missing completely at random and it just happened due to bad luck. Equipment malfunction or that data has not been entered correctly may be other reasons. MCAR means that the probability that observation Xi is missing is unrelated to the value of Xi or to the value of any other variables in the dataset.

  2. Missing At Random (MAR). This term is confusing as in this case there is a systematic relationship because the propensity of the missing value of a variable is related to the observed data of another variable in the dataset. However, the missingness is not related to the value of the variable itself. To continue with the previous example, if men are less likely to show up at blood sampling, more biochemical results will be missing in the dataset among men. The missingness of Xi is not associated with the result of Xi, but the mechanism for missing the data has a correlation to another variable within the dataset, namely the gender.

  3. Missing Not At Random (MNAR). Here there is a direct relationship between the parameter and its values, missing or non-missing. If there are more missing biochemical values among patients with severe disease, and the disease severity is monitored by biochemical tests, we would suspect that there is a direct association. Perhaps the patients with more severe disease are less likely to come for scheduled visits. Clearly the data will not be unbiased as the data will contain more complete data from the patients with less severe disease.

The impact of missing data

Missing data negatively impacts the statistical power of a study by reducing the sample size. However, missing data may also affect the variability. If the missing data mechanism is not random, the missing data might represent more extreme values, for example from more patients with more severe disease. The missing data may therefore contribute to less variability in the dataset and artificially increase the statistical power of the study.

The most concerning effect of missing data is bias, which is dependent on the mechanism of missingness. There is no universal solution for handling missing data, and different approaches may lead to different results. However, there is an important distinction between whether a missing observation is ignorable or non-ignorable.

“Missing Completely at Random” and “Missing at Random” are both considered ‘ignorable’ because we don’t have to include any information about the missing data itself when we deal with the missing data. However, the selection of the most appropriate analytical strategy for handling missing data may differ between MCAR and MAR. MNAR is called “non-ignorable” because the missing data mechanism itself has to be modeled and we have to establish a model for why the data are missing and what the likely values are.

Complete-case analysis

If the probability of being missing is the same for all cases and the mechanism is unrelated to the data (MCAR), we will not introduce any bias in our dataset, and it will only be a loss of information and reduced statistical power. In this case, a complete-case analysis (listwise deletion) will give unbiased estimates, but standard errors and confidence intervals will reflect the smaller subset of data with complete records. Listwise deletion means that an entire record is deleted from an analysis if it is missing a single value. However, this may be a wasteful approach and can severely reduce the statistical power if the percentage of incomplete records is high.

If there is more missing data within groups defined by the observed data (MAR), a complete-case analysis may lead to bias. In particular, if the missing data affects the predictors or variables related to the outcome of the study, listwise deletion may lead to critical biases. For example, if men in general are less likely to show up at follow-ups, and men also happen to be more severely affected by the disease being studied, simply excluding incomplete records from the analysis may bias the results in favor of women that as a group have less severe disease.

Imputation methods

There are different techniques to retain records with missing data in the analysis. For MCAR and MAR, multiple imputation and maximum likelihood methods can produce unbiased estimates without losing statistical power. Imputation methods replace missing observations with values that are predicted in some manner, often from a model. In single imputation, the missing observation may be replaced with the sample mean or median, with a predicted value of the variable (e.g., from a regression model, bootstrap, or a random dataset from multiple imputation), or with the value from a study patient who matches the patient with the missing data on a set of selected covariates. Another common form of single imputation is carry-forward in longitudinal data: if a patient has an observed lab value at time one, and is missing that lab value at time two, it is assumed that the value at time two is equal to the value at time one: the time one value is carried forward. In regression imputation, the missing value is estimated using the regression of the target variable on all other variables or a subset of all other variables.

One of the problems with the single imputation methods is that the replaced observations are treated as actual observations in subsequent analysis. However, the filled-in values are estimates, which have standard errors. Multiple imputations can be used to take the uncertainty about the estimates into account.

In multiple imputation, multiple data sets are produced with different values imputed for each missing variable per data set, thus reflecting the uncertainty around the true values of the missing variables. The imputed data sets are then analysed using standard procedures for complete data and combining the results from these analyses. If data are MAR, multiple imputation will generally produce unbiased results if the model includes the correct set of covariates. However, In the presence of MNAR data, multiple imputation in general cannot fully correct any bias due to missing data.

Maximum likelihood methods

Maximum likelihood estimation (MLE) is an analytic maximization procedure which provides the values of the model parameters that maximize the sample likelihood, i.e., the values that make the observed data “most probable”. This method does not impute any data, but rather uses the available data from each record to compute maximum likelihood estimates. The maximum likelihood estimate of a parameter is the value of the parameter that is most likely to have resulted in the observed data. Like multiple imputation, this method gives unbiased parameter estimates and standard errors for MCAR and MAR.