Article Text

## Abstract

**Objective:** To determine whether physicians’ post-test probability estimates are influenced by receiving test characteristics and impact their subsequent clinical decisions.

**Design:** Questionnaire based randomised controlled trial.

**Setting:** Mailed survey with a vignette describing an infant whose pretest likelihood of pertussis was 30% and direct fluorescent-antibody (DFA) test was negative for pertussis.

**Subjects:** Nationally representative sample of US paediatricians (n = 1502).

**Interventions:** Random receipt of no additional information (controls), the DFA’s sensitivity and specificity (TC group) or the test’s sensitivity and specificity with their definitions (TCD group).

**Main outcome measures:** Estimated post-test probability (PTP) of pertussis, PTP of 0.50, “nearly correct” PTP (±5%), intended erythromycin management and intended hospital disposition.

**Analyses:** χ^{2} and t tests.

**Results:** Despite the negative DFA result, 67% of the 635 (49.7%) participants who responded estimated a PTP higher than the pretest probability of 30%; the overall mean estimated PTP was 0.41 (SD 0.26) (correct answer: 0.18). The TCD group’s mean PTP was significantly higher than controls’ mean PTP (0.45 vs 0.38, p<0.001), while the TC and control groups’ mean PTP did not differ significantly (0.41 vs 0.38, p = 0.16). With decision support significantly more TC and TCD participants compared to controls estimated the PTP as 0.50 (38% vs 17%, p<0.001; 41% vs 17%, p<0.001, respectively) and also estimated a nearly correct PTP more often (20% vs 13%, p = 0.06; 19% vs 13%, p = 0.08, respectively). The mean PTP of participants intending to discontinue erythromycin therapy or discharge the patient home was significantly *lower* than that of participants who intended continuing erythromycin or hospitalisation (0.20 vs 0.43, p<0.001; 0.40 vs 0.49, p = 0.005, respectively).

**Conclusions:** Paediatricians differed in their response to information about test characteristics. For many, it increased errors in estimating post-test probability; for others, it reduced errors. Estimated post-test probability was logically associated with intended clinical management.

## Statistics from Altmetric.com

When making diagnostic and management decisions while caring for ill patients, physicians need to integrate information from their history and examinations with diagnostic test results. Given a patient’s pretest probability of disease and a test result, Bayes’ theorem provides a framework for calculating the patient’s post-test probability of disease.1 Although evidence based medicine courses teach the skills of test result interpretation and post-test probability calculation,2 few physicians use the recommended formal Bayesian calculations in clinical practice.3

Many tests are ordered inappropriately,4 so improving test utilisation is important. Decision support linked to computer provider order entry can reduce unnecessary serological5 6 and radiological7 test ordering, but it has been infrequently used to improve physicians’ interpretation of test results. As electronic medical records are being implemented widely, installing computer based decision support that improves clinicians’ interpretation of test results has important implications. Presenting test performance information in self-administered questionnaires has been shown to influence the post-test probability estimates of general practitioners attending continuing medical education courses in Switzerland.8 9 To our knowledge, no population based study has been conducted to investigate ways of improving physicians’ interpretation of post-test probability estimates while studying the relationships between post-test probability and intended clinical management decisions.

As a component of an experiment we conducted with a nationally-representative sample of paediatricians practising in the United States, this study had two objectives: (1) to determine whether being presented with the sensitivity and specificity of a diagnostic test affects paediatricians’ estimates of the post-test probability of disease; and (2) to find out if their post-test probabilities of disease were associated with their subsequent patient management.

## METHODS

In previous publications we have described the subject selection, questionnaire used10 and design of this randomised controlled trial,11 which we will now review briefly.

### Subjects

Our target study population was a random sample of general paediatricians practising in the USA. We randomly selected 1502 of the 44 561 paediatricians with no secondary specialty in the 2002 American Medical Association (AMA) masterfile of physicians licensed in the USA. To be powered to detect a 15% difference in management rates between intervention groups while assuming a 50% response rate, we chose this sample size to ensure there were at least 175 participants in each randomised group.

### Study design

We implemented a questionnaire based double-blinded randomised controlled trial in a mailed survey that included two clinical vignettes; this manuscript is focused on post-test probability, which was one of the outcomes of one vignette. Using a computerised random number generator, we randomised subjects into one of three groups who received different levels of test performance information. Starting in February 2002, we mailed potential subjects a four page questionnaire. We sent non-responders a replacement survey at 4-week intervals, for a maximum of four mailings per subject; we also sent non-responders a reminder postcard after the second mailing. The University of Washington Institutional Review Board approved this study.

### Questionnaire

The questionnaire described a clinical vignette of a 5-month-old girl with perioral cyanosis and a hacking cough (see online supplementary file); subjects were asked to assume that the pre-test likelihood she had pertussis was 30%. The patient was hospitalised, a direct fluorescent antibody (DFA) test for *Bordetella pertussis* was carried out on her sputum, and she was started on erythromycin. Three days later, her condition was somewhat improved and the DFA result was finalised as negative for *B pertussis*.

Along with the DFA result, subjects received their randomly assigned test performance information. The control group received no additional information. The “test characteristics alone” (TC) group was given the following information: “the pertussis DFA has a sensitivity of 50% and a specificity of 95%”. The “test characteristics defined” (TCD) group was given the information: “the pertussis DFA has a sensitivity of 50%, meaning that if 100 patients infected with pertussis were tested with the pertussis DFA, 50 would test positive and 50 would test negative by DFA. Similarly, the pertussis DFA has a specificity of 95%, meaning that if 100 patients who are not infected with pertussis were tested with the DFA for pertussis, 5 would test positive and 95 would test negative”. We derived the sensitivity and specificity of the DFA presented in this vignette from the medical literature12 13; these test characteristics are not widely known by general paediatricians.

After the vignette presented test result, subjects were asked: “What is your estimate of the likelihood that this patient has pertussis? ____ (0–100%)?” This question was immediately followed by two clinical management questions regarding whether subjects would choose to: (1) continue the erythromycin therapy, or not, and (2) discharge the patient home, or not. We chose these outcomes because erythromycin was considered the drug of choice for the treatment of pertussis when the survey was conducted and “infants younger than 6 months of age often require hospitalisation for supportive care”.14

### Additional demographic data

To collect information about medical education and training, we asked subjects if they were board certified in paediatrics, a US medical school graduate, or currently in a paediatric residency program.

To collect information about clinical practice, we asked subjects to report the percentage of clinical time spent in general paediatrics, whether they worked in a practice that had one or two paediatricians, or more, or if they were affiliated with an academic medical centre. We defined “general paediatricians” as those who reported spending more than 80% of their clinical time in general paediatrics. We defined participants with a “small practices” as those whose primary practices included one or two paediatricians. We categorised academic affiliation as resident, academic (non-resident) or no academic affiliation.

### Outcome measures

The primary outcome was participants’ estimate of the patient’s post-test probability of pertussis. As probability expresses a physician’s opinion of the chance that an event will occur on a scale from 0 to 1,15 we divided participants’ estimated likelihoods of pertussis by 100 to obtain each subjects’ estimate of the post-test probability of pertussis. In this case, using Bayes’ theorem to calculate the post-test probability of pertussis after a negative DFA test result, the correct answer is 0.18, using the following equation2 16:

This randomised controlled trial was also designed to investigate how post-test probability estimates relate to erythromycin treatment and hospital disposition plans (continue hospitalisation or discharge the patient home). The lack of a direct effect of receiving test characteristics on these two management outcomes has been published separately.11 To validate post-test probability as clinically meaningful, we present our previously unpublished assessment of relationships between participants’ post-test probability estimates and these intended clinical management measures.

After examining the distribution of participants’ post-test probabilities by intervention group, we defined two secondary outcome measures: (1) an estimated post-test probability of 0.50 (the TC and TCD groups were both told of the DFA’s sensitivity of 50%), and (2) a “nearly correct” post-test probability. We presumed that a post-test probability estimate of 0.50 represented either a “50/50” guess17 or base rate neglect.18 We define a “nearly correct post-test probability” to be an answer within 1 standard deviation of the correct answer (SD 0.05), or a post-test probability between 0.13 and 0.23.

### Statistical analysis

For each outcome, we performed bivariate comparisons between the control group and either the TC group or the TCD group. We conducted t tests to compare the effect of receiving test characteristics on post-test probability estimates, and χ^{2} tests to compare its effect on the two secondary dichotomous outcome measures. As we found that post-test probability estimates were non-normally distributed, we conducted Wilcoxon rank sum tests to confirm the t test analyses.

To examine the relationship between post-test probability estimates and subsequent patient management, we conducted t tests to compare the post-test probability estimates between the two groups of our two dichotomous management outcomes (erythromycin therapy and hospital disposition plans). We also conducted χ^{2} tests to assess for differences in the two management outcomes by the two dichotomous secondary outcomes (post-test probability = 0.50, and nearly correct post-test probability).

## RESULTS

### Response to survey

We mailed questionnaires to 1502 physicians identified as paediatricians with no subspecialty. Of these, 106 questionnaires (7%) were returned by the post office without forwarding addresses and 43 physicians (3%) did not meet inclusion criteria for the survey (eg, were not currently practising paediatrics). Of the 1353 potentially eligible subjects, 59 (4%) replied with a refusal to participate and 653 returned completed surveys, leaving 641 potentially eligible non-responders (fig 1); therefore, the estimated response rate19 was 49.7% (653/1353). There were no statistically significant differences between survey responders and non-responders by gender, age or intervention group (control, TC and TCD).

### Study participation

Of the 653 completed surveys we received, 18 did not include an estimate of the post-test probability of pertussis (although they did include responses to most questions). As a result, these 18 survey responders were excluded from this study of post-test probability estimates. There were no statistically significant differences by gender, age or intervention group between survey responders who estimated a post-test probability of pertussis and those who did not.

Therefore, this study had 635 participants: 194 in the control group, 224 in the TC group, and 217 in the TCD group. Study participants had graduated from medical school a mean of 16.2 years before responding to the questionnaire (SD 11.7, median 14, range 1–55); other characteristics of the 635 participants are presented in table 1. For the characteristics of all 653 survey respondents, please see two previous publications based on analyses of all survey respondents.10 11

### Post-test probability estimates

Overall, study participants’ estimates of the post-test probability of pertussis had a non-normal distribution (fig 2). While the correct post-test probability was 0.18, the mean post-test probability of all participants was 0.41 (SD 0.26, median 0.50, range 0.0–1.0, interquartile range 0.15–0.50). Although the patient’s DFA test result was negative for pertussis, 56% of participants estimated a post-test probability of pertussis higher than the pretest probability of 0.30, and 11% estimated a post-test probability of 0.30.

The distributions of the three intervention groups’ post-test probability estimates were different (fig 3). The mean post-test probability of the TCD group was significantly *higher* than that of controls, while those of the TC and control groups were not significantly different (table 2). Making these comparisons using the non-parametric Wilcoxon rank sum test and the same data yielded similar results. Of 635 study participants, three estimated the post-test probability to be 0.18 and two estimated it to be 0.19 (the correct answer was 0.184); all five of these participants were in the TCD group.

### Secondary outcomes

Overall, 32% of participants estimated the post-test probability to be 0.50, and 17% estimated a nearly correct probability. Compared to controls, more than twice as many TC and TCD participants (who were presented with the sensitivity of 50% and specificity of 95%) estimated the post-test probability to be 0.50. At the same time, there was a trend towards more TC and TCD participants estimating a nearly correct probability than controls (table 2).

A greater proportion of residents (29%) estimated a nearly correct probability than participants with (15%) or without an academic affiliation (15%, p = 0.003). Similarly, significantly more US medical school graduates (20%) estimated a nearly correct probability than did foreign medical school graduates (5%, p<0.001), and more paediatricians practising with multiple other paediatricians (19%) estimated a nearly correct probability than did those working in a small practice (9%, p = 0.008).

### Relationships between probability estimates and intended clinical management

Although being presented with test characteristics did not directly affect their intended clinical management,11 participants’ post-test probability estimates were associated with their subsequent intended management. The mean post-test probability of pertussis of participants who intended to stop erythromycin therapy was significantly lower than that of participants who intended to continue treatment. Similarly, the mean post-test probability of participants who intended to discharge the patient from hospital was significantly lower than that of those who intended to keep the patient in hospital (table 3).

Participants who incorrectly estimated the post-test probability of pertussis to be 0.50 were more likely to intend to continue the patient’s erythromycin treatment than those who did not estimate it to be 0.50, and they trended towards being more likely to intend to continue hospitalisation as well. In contrast, participants who estimated a nearly correct probability trended towards being more likely to intend to discharge the patient home from hospital than those who did not make a nearly correct estimate, but there was no significant relationship between estimating a nearly correct probability and hospital disposition plan (table 3).

## DISCUSSION

In this randomised controlled trial, we identified two distinct populations of paediatricians: a small cluster who responded to receiving test characteristics by making post-test probability estimates that were close to correct and a larger group whose estimates were worsened by the same information. This demonstration that presenting test characteristics to a nationally representative group of US paediatricians can influence their post-test probability estimates in different ways suggests that some paediatricians understand how to calculate post-test probabilities while others do not. We found that being told that a test’s sensitivity was 50% and specificity was 95% resulted in some paediatricians incorrectly estimating the post-test probability to be 0.50, while influencing others to estimate a nearly correct probability. We had hypothesised that augmenting the decision support concerning the test’s sensitivity and specificity with brief definitions would improve subjects’ post-test probability estimates, but found that presenting the additional information resulted in probability estimates even further from the correct post-test probability than those of subjects who received no information about test characteristics. We were pleased to find that physicians’ post-test probability estimates were logically associated with their subsequent clinical decisions, as the mean estimated post-test probability of pertussis by paediatricians who continued erythromycin treatment or hospitalisation was significantly higher than that of participants who did not continue antibiotics or hospitalisation.

While some of our study’s findings were consistent with results from two recently published questionnaire based trials conducted in Switzerland, others contradict their results.8 9 In the Swiss controlled trial8 and the Swiss randomised trial,9 participants read clinical vignettes with either (1) the test’s sensitivity and specificity, (2) its positive or negative likelihood ratio, or either (3a) no further information (in the controlled trial) or (3b) a graphical representation of the same likelihood ratio (in the randomised trial). Both our randomised controlled trial and the Swiss controlled trial used a single vignette, while the Swiss randomised trial used six different vignettes. Both of the Swiss trials found that receiving the sensitivity and specificity data was associated with more accurate post-test probability estimates,8 9 while we found that this information actually worsened post-test probability estimates. The Swiss trials did not study the relationship between post-test probability and clinical decisions, while our study demonstrated that paediatricians’ post-test probabilities were logically associated with their intended clinical management.

There are many potential explanations why three similar studies had contradictory results; perhaps the most notable is the differences in the generalisability of the three study populations. Our study approached a random sample of general paediatricians practising in the USA, while the Swiss trials approached attendees at two continuing medical education (CME) courses in Switzerland. Perhaps attendees of an evidence based medicine CME course are more familiar with calculating post-test probabilities than the average US paediatrician. Our finding that post-test probability estimates were often in the “wrong direction” may have been related to the poor accuracy of the DFA test for pertussis, as the Swiss randomised trial found that such estimates occurred most often in vignettes with low accuracy tests.9 If we had presented multiple vignettes with different pretest probabilities or test accuracies, perhaps receiving test characteristics would have caused paediatricians to make more accurate probability estimates.

We identified two distinct populations of participants: a small cluster who responded to receiving test characteristics by making post-test probability estimates that were close to correct and a larger group whose estimates were worsened by the same information. Participants who estimated a nearly correct post-test probability tended to be residents and US medical school graduates. Presenting the test’s sensitivity of 50% doubled the number of participants who estimated the post-test probability to be 0.50, suggesting that they may have ignored base rate information, as frequently occurs.18 Some estimates of 50% may have resulted from using an abbreviation of the phrase “fifty-fifty” as an expression of uncertainty rather than an actual numeric estimate17 or from the representativeness heuristic leading “to confusion of post-test probability with test sensitivity”.20 Finally, the open-ended fashion in which we collected probability estimates may have been influential, as the use of an open-ended probability scale has resulted in a “seemingly inappropriate blip” in estimates of 50%.17 21

Several study limitations deserve comment. First, while our results might have differed if the vignette had used a different pre-test probability or a more accurate test, we do believe that our findings are both important and relevant to clinical practice. Second, while response bias may have affected our results, we found no obvious differences in subject characteristics between participants and non-responders. Third, while only half of eligible subjects participated in the study, our response rate was similar to the mean response rates in published surveys of physicians.22 23 As this study was not powered to detect differences in our secondary outcomes stratified by subject characteristics, caution should be used when interpreting these results. And finally, participants’ probability estimates and clinical decisions in this hypothetical vignette may not be related to how they actually practice clinical medicine. However, using clinical vignettes to measure quality of care has been validated.24

#### What is already known on this topic

Our study was similar to two Swiss questionnaire based trials of test characteristics conducted among physicians attending evidence based medicine continuing medical education courses.

Both trials found that that the presentation of test characteristics to subjects assessing clinical vignettes improved their post-test probability estimates.

#### What this study adds

Conducted in a representative sample of US paediatricians, our randomised controlled trial demonstrates that presenting test characteristics can adversely affect physicians’ post-test probability estimates.

Test characteristics presented with their definitions led to larger post-test probability overestimates than when test characteristics were presented alone or not at all.

Our findings suggest that monitoring the impact of decision support linked to computer provider order entry on clinical care may be warranted.

This randomised controlled trial demonstrates that presenting the sensitivity and specificity of a diagnostic test to a nationally representative group of paediatricians can have two distinct effects: this information adversely affects post-test probability estimates of the majority but improves the estimates of a small minority. While the majority of subjects incorrectly interpreted the test result, we found that presenting test characteristics with their definitions significantly increased post-test probability overestimates, while presenting the sensitivity and specificity alone was no different than presenting no information. In addition, both interventions doubled the frequency of post-test probability estimates of 50%, while simultaneously increasing the proportion of participants who estimated a nearly correct probability. Finally, we documented a logical association between post-test probability estimates and intended clinical management, which reassured us that post-test probability estimation is not just an academic exercise. The association of probability estimation with clinical management plans emphasises the importance of closely monitoring for unexpected effects of presentation of test characteristics during clinical care. As the USA and UK attempt to implement electronic medical records nationally, the effect on clinical care of diagnostic test decision support linked to computer provider order entry should be carefully monitored. Our findings imply that educating physicians about the application of test characteristics to the interpretation of diagnostic test results may be warranted. On the other hand, empowering the computer to calculate the revised post-test probabilities might generate more effective decision support than merely presenting the physician with a test’s sensitivity and specificity. As training and clinical practice in paediatrics and other medical specialties do not differ dramatically, similar studies should be carried out with different groups of physicians.

## Acknowledgments

We acknowledge the helpful feedback received on this project during Robert Wood Johnson Clinical Scholar Program work-in-progress seminars. In addition, we appreciate feedback given by reviewers of *Medical Decision Making*, and from Alison Galbraith, MD, MPH, and Hal Sox, MD.

## REFERENCES

## Footnotes

▸ Additional information is published online only at http://adc.bmj.com/content/vol94/issue3

**Funding:**This work was conducted while CMS was a Robert Wood Johnson Clinical Scholar at the University of Washington. The Robert Wood Johnson Generalist Physician Faculty Scholar program supported DAC. The opinions are those of the authors and not the Robert Wood Johnson Foundation (Princeton, New Jersey, USA). This study was supported in part by the Nesholm Family Foundation (Seattle, Washington, USA). The authors’ work was independent of the funding organisations.**Competing interests:**None.**Ethics approval:**The University of Washington Institutional Review Board approved this study.Some of the results of this study were presented at the 2004 annual meetings of the Society for Medical Decision Making and the Pediatric Academic Society, as well as the 2003 Robert Wood Johnson Clinical Scholars Program national meeting.

**Contributions**: All authors participated in the study design and interpretation of the data, and contributed to revision of the manuscript. CMS and DAC came up with the idea for the study. CMS collected and analysed the data, wrote the first draft and finalised the manuscript, and is guarantor.

## Linked Articles

- Perspectives