Article Text

Large observer variation of clinical assessment of dyspnoeic wheezing children
  1. Jolita Bekhof1,
  2. Roelien Reimink1,
  3. Ine-Marije Bartels1,2,
  4. Hendriekje Eggink1,2,
  5. Paul L P Brand1
  1. 1Princess Amalia Children's Clinic, Isala, Zwolle, the Netherlands
  2. 2University Medical Center Groningen, Groningen, the Netherlands
  1. Correspondence to Dr J Bekhof, Princess Amalia Children's Clinic, Isala klinieken, Dr van Heesweg 2, PO Box 10400, Zwolle 8000 GK, the Netherlands; j.bekhof{at}


Background In children with acute dyspnoea, the assessment of severity of dyspnoea and response to treatment is often performed by different professionals, implying that knowledge of the interobserver variation of this clinical assessment is important.

Objective To determine intraobserver and interobserver variation in clinical assessment of children with dyspnoea.

Methods From September 2009 to September 2010, we recorded a convenience sample of 27 acutely wheezing children (aged 3 months–7 years) in the emergency department of a general teaching hospital in the Netherlands, on video before and after treatment with inhaled bronchodilators. These video recordings were independently assessed by nine observers scoring wheeze, prolonged expiratory phase, retractions, nasal flaring and a general assessment of dyspnoea on a Likert scale (0–10). Assessment was repeated after 2 weeks to evaluate intraobserver variation.

Results We analysed 972 observations. Intraobserver reliability was the highest for supraclavicular retractions (κ 0.84) and moderate-to-substantial for other items (κ 0.49–0.65). Interobserver reliability was considerably worse, with κ<0.46 for all items. The smallest detectable change of the dyspnoea score (>3 points) was larger than the minimal important change (<1 point), meaning that in 69% of observations a clinically important change after treatment cannot be distinguished from measurement error.

Conclusions Intraobserver variation is modest, and interobserver variation is large for most clinical findings in children with dyspnoea. The measurement error induced by this variation is too large to distinguish potentially clinically relevant changes in dyspnoea after treatment in two-thirds of observations. The poor interobserver reliability of clinical dyspnoea assessment in children limits its usefulness in clinical practice and research, and highlights the need to use more objective measurements in these patients.

  • Paediatric Practice
  • Respiratory
  • Measurement

Statistics from

Request Permissions

If you wish to reuse any or all of this article please use the link below which will take you to the Copyright Clearance Center’s RightsLink service. You will be able to get a quick price and instant permission to reuse the content in many different ways.

What is already known on this topic?

  • In children with acute dyspnoea, assessment of response to treatment and follow-up of severity are often performed by different professionals.

  • This implies that knowledge of the interobserver variation of the clinical assessment of children with dyspnoea is important.

  • Few studies assessed interobserver variation of assessment of children with dyspnoea, using small numbers of observers, showing substantial variation. No data on intraobserver variation are available.

What this study adds?

  • This study shows fair-to-good intraobserver reliability, but poor interobserver reliability of clinical assessment of children with dyspnoea.

  • This large interobserver variation obscures detection of clinically important improvement and limits its usefulness in clinical practice and research.


Acute dyspnoea, commonly accompanied by wheeze, is one of the major reasons for emergency room visits and hospitalisations of young children.1 ,2 Evaluating the severity of dyspnoea in these children is important in clinical decision-making and evaluation of treatment. The severity of dyspnoea is primarily assessed by clinical findings, because pulmonary function tests are not readily available for young children.3

Like any clinical measurement, the usefulness of the clinical assessment of dyspnoea is strongly determined by its reliability.4–6 Since assessment of response to treatment, and follow-up of severity of dyspnoea, is often performed by different professionals, knowledge of the interobserver reliability is important.5 There are few reported studies, demonstrating substantial interobserver variability for clinical findings in children with dyspnoea (see online supplement 1). The small number of observers in these studies limits their applicability in clinical practice, where the number of healthcare professionals involved in assessing dyspnoea in children may be considerably larger.7–14 Intraobserver variation has never been studied for these clinical findings.15 Although observer variation in dyspnoea assessment may hinder identification of improvement in dyspnoea after (bronchodilator) therapy, the extent to which this occurs has not been studied to date.

The aim of this study was to determine the intraobserver and interobserver variation of common clinical findings in children with acute severe dyspnoea and wheeze and to compare the variability of clinical dyspnoea assessment with its change after bronchodilator therapy.


Study design and setting

We performed an observational study using a crossed design. We consecutively enrolled a convenience sample of children aged 0–8 years, presenting to the emergency department with acute dyspnoea and wheeze, in the period from September 2009 to September 2010. Patients were selected when parents gave permission to record their child on video and when the research nurse who recorded the videos was available (during office hours on approximately 3 days per week). After undressing, the patients were recorded on digital video before and 15–30 min after inhaling nebulised salbutamol. The videos were recorded in a quiet single-bed room at the emergency or paediatric ward without disturbing environmental noises while the child was not crying. Respiratory rate was counted during 1 min by the nurse who assessed the patient. Oxygen saturation and heart rate were measured by pulse oximetry, which was visible on the video. The video included sound recording.

The study was approved by the hospital's ethical review board (09.0536n), and written informed consent was obtained from the parents.

Assessment of video recordings

All video recordings were assessed independently by five experienced consultant paediatricians and four paediatric nurses (all had at least 5 years of experience as consultant paediatricians or paediatric nurses). All assessments were repeated in random order by the same observers after an interval of at least 2 weeks. This means that all videos were presented in a random order; pretreatment and post-treatment videos of the same patient were not presented together. Thus, observers did not know whether the video was recorded before or after bronchodilator treatment. Each observer recorded the presence of the following clinical signs on a structured form: wheeze, prolonged expiratory phase, subcostal, intercostal, jugular (ie, suprasternal) and supraclavicular retractions, nasal flaring and mental status. Observers were also requested to give an overall assessment of the degree of dyspnoea on a Likert scale ranging from 0 (no dyspnoea) to 10 (very severe dyspnoea), which we will call the dyspnoea score.

The assessments were performed in group sessions, scheduled in the afternoon, each lasting a maximum of 1 h to avoid diminished concentration and attention.

Terminology and definitions

To avoid confusion in the terminology in clinical measurement analysis, two terms should be explained: measurement error and reliability. Repeated measurements show variation such as variations within and between assessors and spontaneous variation within patients. The ‘SE of measurements’ (SEM) represents the magnitude of this measurement error.6 The SEM is calculated as follows: SEM=SD/√(1–reliability), and it is interpreted as the SE around a single measurement. This is in contrast to the commonly used ‘SEM’ that helps to estimate the dispersion of sampling error while trying to estimate the population mean from a sample mean and is calculated as follows: SE=SD/√sample size. Reliability is the degree to which the measurement is free from measurement error.16 Intraobserver reliability refers to the variation within one observer, and interobserver reliability refers to the variation between observers. κ and the intraclass correlation coefficient (ICC) are reliability parameters, both ranging from 0 (totally unreliable) to 1 (perfect reliability). The value of κ can be calculated for dichotomous measures and ICC for continuous measures.

In addition to reliability, clinical measurements of dyspnoea should also be examined on their ability to identify change in the degree of dyspnoea in the individual patient, either spontaneously or as a result of treatment. This is expressed as the smallest detectable change (SDC), the smallest within-person change which can be interpreted as real change above measurement error. The SDC should be compared with the minimal important change (MIC),5 the smallest change in the measurement which the clinician or patient perceives as important.6 When the MIC exceeds the SDC, the measurement has good clinical value, because clinically relevant changes can be distinguished from measurement error.6 Conversely, if the MIC is smaller than the SDC, the clinical usefulness of the measurement is limited because changes larger than the MIC but smaller than the SDC cannot be distinguished from measurement error.

Statistical analysis

Quantifying the measurement error in the units of measurement is only possible for continuous variables. We calculated the SEM due to variation within observers by using the following formula: SEM=SDdifference/√2.6 The SEM due to variation between observers was calculated by using the pooled SD of the mean scores of the different observers by using the following formula: SEM=SDpooled×√(1−ICCagreement). SDpooled was calculated as follows: √(SD2observer1+SD2observer2+etc/n).6

For the dichotomous clinical findings, we calculated Cohen's κ for intraobserver reliability,17 ,18 Light's multirater κ for interobserver reliability19 and percentage agreement. κ Values were categorised as follows: <0, poor; >0–0.2, slight; >0.2–0.4, fair; >0.4–0.6, moderate; >0.6–0.8, substantial agreement; and >0.8–1.0 represent almost perfect agreement.20

For the continuous variable (dyspnoea score), we calculated ICC in SPSS, by using two-way-mixed, absolute agreement and single-measure calculations. In general, an ICC >0.7 is considered adequate.6

We used the visual anchor-based MIC distribution to calculate the MIC of the dyspnoea score in our study population.6 This approach uses an external criterion, or ‘anchor’, to determine what patients or their clinicians consider important improvement. We used two anchors: (1) the clinical judgement of the consultant paediatrician who had assessed the patient in the emergency department and (2) the difference in respiratory rate—assessed by the nurse in the emergency department—before and after bronchodilator.21 The SDC was calculated by multiplying the SD of the change in the dyspnoea score after treatment in the stable group of patients (defined by the anchor as ‘not importantly changed’ after treatment with bronchodilators) by a factor of 1.96.6

Detailed explanation and calculation of the MIC and SDC are given in the online supplement 2.

All analyses were performed in SPSS, V.20.0.


We included and video-recorded 27 patients twice, before and after bronchodilator therapy. Each of these 54 recordings was assessed by nine observers on two occasions, resulting in a total of 972 assessments. Characteristics of included patients are listed in table 1. Overall, patients had mild-to-moderate dyspnoea. None of the patients required intensive care or mechanical ventilation.

Table 1

Patient characteristics (n=27)

Table 2 shows that intraobserver reliability of the dyspnoea score was adequate, but interobserver reliability was much worse, as depicted by a low ICC.

Table 2

Intraobserver and interobserver variation of continuous measures of dyspnoea in children (dyspnoea score (0–10))

Table 3 shows the reliability of the different binary clinical findings. Reliability within observers was moderate to almost perfect, and reliability between observers was slight to moderate.

Table 3

Intraobserver and interobserver variations of categorical measures of dyspnoea in children

The interobserver and intraobserver variability was comparable for nurses and paediatricians (see table 1 and online supplement 3).

The MIC and SDC for the dyspnoea score were 1 (0.5) and 3 (3.2), respectively, for both anchors (see online supplement 2), meaning that the SDC was considerably larger than the MIC, the consequences of which are expressed in figure 1. In only 5.8% of observations in our study, the change in dyspnoea score was both statistically significant and clinically relevant.

Figure 1

Interpretation of change in dyspnoea score after treatment, explaining the relevance of the minimal important change (MIC) being smaller than the smallest detectable change (SDC).


This study is unique because it examines both within-observer and between-observer variation among more than two observers of clinical findings in children with dyspnoea which are being used in all published composite dyspnoea severity scoring systems. The results of our study show moderate-to-good intraobserver reliability and poor interobserver reliability of the clinical assessment of dyspnoea. Subcostal retractions and wheeze showed the best interobserver agreement and mental state the least, the other signs were more or less comparable.

Due to this variation within and between observers, the SDC exceeded the minimally important effect of treatment in 69.4% of observations in our study, obscuring the detection of a clinically important improvement in dyspnoea after treatment.

Our findings implicate that in clinical practice, assessment of the severity of dyspnoea in children is not interchangeable between professionals. The results of our study, therefore, argue for great caution in interpreting the effect of a trial treatment with a bronchodilator, which is recommended in clinical guidelines that include young children with acute severe wheeze,22 in particular when the assessment of dyspnoea before and after bronchodilator is being performed by different observers. However, even when the same professional assesses the degree of dyspnoea before and after bronchodilator, the considerable intraobserver variation (table 3) should be taken into account. If clinical dyspnoea scoring is being used in clinical trials of young children, the number of different observers should be presented and discussed, because of the large variation between observers (table 3). Our results suggest that clinical dyspnoea scoring systems require further validation testing and assessment of variation between and within observers. We postulate that the use of more objective parameters, such as oxygen saturation and lung function assessments with acceptably small measurement error, will provide less variable and thus more reliable assessments of dyspnoea in children. Furthermore, it remains important to assess children with dyspnoea together with the colleague who will take over the care of the patient in the next shift.

Strengths and limitations

The major strengths of our study include the measurement of intraobserver and interobserver reliability, the use of a large group of observers in a crossed design and the assessment of the clinical impact of reliability of these clinical signs of dyspnoea by computing measurement error.

We acknowledge the following weaknesses of our study. The use of video recordings has limitations. The video recordings were relatively of short duration (2–3 min), which may have led to less accurate ratings or missed observations and may have decreased the likelihood of detecting subtle signs on physical examination. For our study purposes, however, video recordings were considered to be the only feasible method. The lack of chest auscultation could also be viewed as a weakness; however, previous studies have shown poor association between wheeze severity on auscultation and the degree of airway obstruction and hypoxaemia.4 ,23 Furthermore, leaving out auscultation in the assessment of dyspnoea severity in children reflects clinical practice where many assessments are being made by healthcare professionals who have not been trained in chest auscultation.

We examined a limited number of patients for feasibility reasons, to avoid observer fatigue and boredom while assessing the videos. One could argue that the number of observers is also relatively small, although it is considerably larger than the two to four observers7 ,15 used in previous studies (see online supplement 1). The reliability of the dyspnoea score and the individual items may have been greater if we would have included more children with (very) severe dyspnoea. It would also have been interesting to evaluate the relation of the observer variation to the severity of the dyspnoea. However, our study group was too small to be able to compare the observer variation between these subgroups. Additionally, in our general practice, children with very severe dyspnoea needing mechanical ventilation comprise only a small minority of all children with dyspnoea presenting to our clinic (1%–2%). Thus, a clinical dyspnoea scoring system is potentially most useful in mild-to-moderate dyspnoea, and this is represented by our study population.

Furthermore, one could hypothesise that the age of patients might influence the observer variation. The median age of patients in our sample was 19 months, but the sample also included a patient aged 7 years. Our sample size was too small to divide the patients into different age categories, as most (>90%) patients were <4 years old. On the other hand, acute wheeze is most commonly represented by this preschool age group. Therefore, we still feel that our sample is sufficiently representative for this purpose.

Future perspectives

Variation between observers may be reduced by formal standardising of the assessment and training. Only a few examples are available in the literature, however, all pointing towards a positive effect of training and/or standardising.24–26 In clinical practice, it is uncommon to (re)train basic skills after graduation, apart from newly developed tools or reanimation skills. This study may help increasing awareness that evaluating (and maybe training) commonly used day-to-day skills is valuable. Further studies are needed to assess whether training professionals—and even more what kind of training—can reduce the amount of variation we observed in this study.

Another aspect that might be interesting to take into account in future studies, aiming at improving the assessment of dyspnoea in children, is the parental judgement. In this study, we asked the treating doctor as well as the parent to rate the effect of treatment with bronchodilators. It appeared that in 67% (18/27) there was full agreement between parents and doctors. In the nine patients where there was no agreement, disagreement occurred in both directions (in four patients doctors rated improvement while parents rated no change or slight improvement, and in five patients doctors rated no effect while parents rated improvement). This may suggest that parents take other aspects into consideration than medical professionals, which may be of importance when further elaborating the way we assess children with dyspnoea.

Finally, we feel that it might be useful to further evaluate the assessment of the ‘mental state’. Mental state, sometimes described as ‘general condition’ or ‘cerebral function’, is included in many dyspnoea scores.6 In the present study, the prevalence of affected mental state was much lower (4.4%) than the prevalence of the other clinical findings (21.2%–68.0%), resulting in bias of the κ value.18 The low prevalence may have played a role in explaining the low κ value for mental state compared with the high-percentage agreement. The mental state in our study was rated as ‘affected’, when the observer assessed the mental state either as ‘hyperalert or anxious’ or as ‘decreased consciousness’. Possibly, a more subtle description of the mental state will improve the precision of dyspnoea assessment, especially when evaluating responsiveness over time, which can sometimes be experienced by only small differences that are difficult to describe. The assessment of the mental state would typically be an item where involving the parents might improve accuracy and utility.


The measurement error induced by interobserver variability of clinical signs of dyspnoea in children is considerable and cannot be distinguished from a possibly relevant effect of therapy in two-thirds of patients. The poor interobserver reliability of clinical dyspnoea assessment in children limits its usefulness in clinical practice and research and highlights the need to use more objective measurements in these patients.


Supplementary materials

  • Supplementary Data

    This web only file has been produced by the BMJ Publishing Group from an electronic file supplied by the author(s) and has not been edited for content.

    Files in this Data Supplement:


  • Contributors JB: Designed the study, supervised data collection and analysis, and drafted the article. She is guarantor for the study. RR: Participated in the design of the study, acquisition of patients as well as data collection. I-MB: Responsible for acquisition of patients, data collection and management, HE: Helped with data management and data analysis, PLPB: Reviewed critically the study design, supervised data analysis and was involved in the interpretation of the data. All authors contributed to the drafting of the article, revised and commented on, and contributed to the various drafts of the article. All authors read and approved the final draft.

  • Competing interests None.

  • Patient consent Obtained.

  • Ethics approval Medical Ethics Committee Zwolle.

  • Provenance and peer review Not commissioned; externally peer reviewed.

Linked Articles