Original Article
Bias arising from missing data in predictive models

https://doi.org/10.1016/j.jclinepi.2004.11.029Get rights and content

Abstract

Objective

The purpose of this study is to determine the effect of three common approaches to handling missing data on the results of a predictive model.

Study Design and Setting

Monte Carlo simulation study using simulated data was used. A baseline logistic regression using complete data was performed to predict hospital admission, based on the white blood cell count (WBC) (dichotomized as normal or high), presence of fever, or procedures performed (PROC). A series of simulations was then performed in which WBC data were deleted for varying proportions (15–85%) of patients under various patterns of missingness. Three analytic approaches were used: analysis restricted to cases with complete data, missing data assumed to be normal (MAN), and use of imputed values.

Results

In the baseline analysis, all three predictors were all significantly associated with admission. Using either the MAN approach or imputation, the odds ratio (OR) for WBC was substantially over- or underestimated depending on the missingness pattern, and there was considerable bias toward the null in the OR estimates for fever. In the CC analyses, OR for WBC was consistently biased toward the null, OR for PROC was biased away from the null, and the OR for fever was biased toward or away from the null. Estimates for overall model discrimination were substantially biased using all analytic approaches.

Conclusions

All three methods of handling large amounts of missing data can lead to biased estimates of the OR and of model performance in predictive models. Predictor variables that are measured inconsistently can affect the validity of such models.

Introduction

Multivariable regression modeling is a commonly used means of predicting outcomes in clinical research. The resulting predictive models may be used for risk adjustment in comparing outcomes across settings and populations [1]. Some examples relevant to acute care include the APACHE (Acute Physiology and Chronic Health Evaluation) [2] and PRISM (Pediatric Risk of Mortality) [3] scores, which predict mortality for critically ill patients, and two recently published models predicting admission (Pediatric Risk of Admission [PRISA]; [4], [5]) and resource intensity (Pediatric Emergency Assessment Tool [PEAT]; [6]) for pediatric emergency patients. Such observational studies of health care outcomes frequently rely on existing data sources, such as paper charts, electronic medical records, or other clinical or administrative databases. Most datasets include at least some variables with incomplete data, even when prospectively collected. Some variables may be measured but not recorded; others may not be measured in the first place. This is particularly true in the emergency department (ED) setting where, because of the broad spectrum of patient types and clinical conditions encountered, few physiologic parameters are assessed as a matter of routine. For example, in the PRISA study, two of the predictor covariates were platelet count and serum glucose, which were measured in only 26% and 14% of the patients, respectively [4]. Similarly, the PEAT model included pulse oximetry, which was not measured for 75% of the patients. Statistical methods that aim to predict outcomes across broad ranges of ED patients must therefore address the likelihood that data for some predictors of interest may be missing in a substantial proportion of cases.

Three primary mechanisms of missing data have been described [7, p. 14]. Data are said to be “missing completely at random” (MCAR), when the probability of a given variable having a missing value is independent of the actual value of that variable and of all other variables. This is the case where records may simply be lost. “Missing at random” (MAR) indicates that the probability of being missing is independent of the value of the variable subject to missingness, although it may be related to the values of other variables. For example, patients with fever may be more likely to have a blood count performed, and those without a fever are therefore more likely to have blood count information that is missing. Finally, “missing not at random” (MNAR), or nonignorable missingness, describes a variable where the probability of the value being missing is dependent on the value of that variable itself. For instance, blood count values in the normal range may be less likely to be recorded than those that are clearly abnormal.

To deal with missing data, a variety of analytic approaches have been developed [1], [7]. One is to include in the analysis only those subjects who do have complete data (CC). This is the default for most statistical software packages. However, even if data for a given variable are complete in the great majority of cases, the number of subjects with no missing values for any variables may be quite small, leading to a large number of cases being discarded. A second approach is to assume that if a variable was not measured for a given patient, it does not add predictive information. In this case, a normal value is substituted for the missing one. This was the approach used in the development of the PRISM [3], PRISA [4], [5], and PEAT [6] scores. Finally, one may use one of several statistical techniques to impute the missing values, which has been less commonly used in the emergency medicine literature.

A key consideration in choosing a method is the validity of the proposed approach. Several studies have compared these different relatively simple approaches to handling missing data [1], [7], [8]. However, none of the available studies have examined these analytic methods when the proportion of missing data is very high, as in the ED-based examples cited above. The goal of this study was to evaluate the validity of several commonly used approaches to dealing with missing data under a series of scenarios in which data are missing with high frequency. We used a fabricated dataset, on which an “ideal” analysis (i.e., complete information for all subjects) could be performed, and then performed a series of simulations in which analyses were repeated on subsets of the data with varying degrees and patterns of missing data. The scenarios were selected to mimic actual ED-test ordering practice, and the results were compared with the baseline, or ideal, analysis, to evaluate the magnitude of bias in the resulting models.

Section snippets

Study design

This is a simulation study. Monte Carlo simulation is a method of repeating a statistical analysis using multiple iterations, with a subset of data sampled, changed, or deleted at random with each iteration. For example, in this study, the basic analysis was a logistic regression with six dichotomous predictor variables. In each iteration of the simulation, a portion of the data are set to missing, and the odds ratio (OR) and 95% confidence interval (CI) are calculated for each of the predictor

Description of variables

Fever was present in 19.4% of subjects, and procedures were performed in 12.4%. The outcome, admission, was relatively uncommon, occurring in 6.7% of cases.

The mean ± SD of the sham WBC counts that were generated was 8.6 ± 2.6 for children without complaint of fever, and 12.0 ± 6.0 for patients with fever as a presenting complaint. Among subjects admitted to the hospital, the mean WBC was 11.0 ± 5.4, compared with 9.2 ± 3.6 for those discharged to home. The prevalence of high WBC as defined above was

Discussion

The prediction models developed using data subject to substantial rates of missingness showed various, and at times striking, levels of bias in the estimates both of the measures of association for the individual predictor variables and of the discrimination of the model as a whole. These biases were manifest with any of the analytic approaches, although the patterns of the bias were different. Moreover, the degree of bias was dependent not only on the analytic method used, but on the specific

Acknowledgments

This work was supported by a grant, R03- HS11395 from the Agency for Health Care Research and Quality.

The author acknowledges Molly Stevens, MD, MSCE and David Brousseau, MD, MS for their constructive review of the manuscript.

References (13)

There are more references available in the full text version of this article.

Cited by (61)

  • Development and external validation of predictive algorithms for six-week mortality in spinal metastasis using 4,304 patients from five institutions

    2022, Spine Journal
    Citation Excerpt :

    Rates of missing data for the development and external validation cohorts, respectively, were: ECOG in 22.4% (672/3001) and 0.5% (7/1303); ASIA in 5.2% (157/3001) and 0.3% (4/1303); hemoglobin in 18.6% (557/3001) and 0.1% (1/1303); white blood cell in 18.6% (557/3001) and 0.1% (1/1303); platelet in 18.6% (558/3001) and 0.1% (1/1303); absolute lymphocyte in 29.9% (897/3001) and 6.6% (86/1303); absolute neutrophil in 27.3% (818/3001) and 6.5% (85/1303); albumin in 25.0% (750/3001) and 6.1% (80/1303); alkaline phosphatase in 25.5% (766/3001) and 6.1% (80/1303); and creatinine in 18.0% (540/3001) and 0.1% (1/1303). No complete case analysis was performed as this introduces bias and is not recommended in the development of predictive models [17]. Five algorithms (stochastic gradient boosting, random forest, support vector machine, neural network, and elastic-net penalized logistic regression) were developed for prediction of 6-week mortality in the developmental cohort [18].

  • Population median imputation was noninferior to complex approaches for imputing missing values in cardiovascular prediction models in clinical practice

    2022, Journal of Clinical Epidemiology
    Citation Excerpt :

    One of seven is the ability of the prediction model to offer features that enable dealing with missing or unavailable values such replacing the missing value with the median value for a given population [9]. While extensive research has been undertaken into the handling of missing data in the development and validation of risk prediction models, there is limited evidence on methods to deal with missing patient characteristics in the implementation stage [10–12]. Of the 23 prediction models discussed by Tsvetvanova et al, less then halve provided methods and guidance for approaches to estimating patient CVD risk in the absence of patient characteristics included in the model.

  • Prediction models in gynaecology: Transparent reporting needed for clinical application

    2021, European Journal of Obstetrics and Gynecology and Reproductive Biology
    Citation Excerpt :

    A possible consequence of this cut-off is that important predictors can be eliminated [11]. According to Steyerberg, the incorrect exclusion of a predictor does more harm than adding a less useful predictor [45,46]. Nonetheless, a relatively high significance level (P < 0.20 – P < 0.50) will let irrelevant predictors enter the model [6].

View all citing articles on Scopus

Presented in part at the Annual Meetings of the Pediatric Academic Societies, Baltimore, MD, May 2002 and the Society for Academic Emergency Medicine, St. Louis, MO, May 2002.

View full text