Statistics from Altmetric.com
- ICC, intraclass correlation
- ICU, intensive care unit
- PICU, paediatric intensive care unit
- PIM, Paediatric Index of Mortality
- PRISM, Paediatric Risk of Mortality
- clinical scoring systems
- inter-observer variability
- outcome assessment
- multicentre study
- risk adjustment systems
Scoring systems such as the Paediatric Risk of Mortality (PRISM) score and Paediatric Index of Mortality (PIM) are widely used in paediatric intensive care. These are third generation scoring systems that allow assessment of the severity of illness and mortality risk adjustment in heterogeneous groups of patients in an objective manner, enabling conversion of these numbers into a numerical mortality risk based on logistic regression analysis. The purpose of their usage varies, and may include comparison of severity of illness between different treatment arms in clinical trials and comparison of quality of care between paediatric intensive care units (PICUs) using standardised (that is, severity of illness adjusted) mortality rates. Both the PRISM and PIM scoring system have been developed and carefully validated in tertiary PICUs.1–3 In some centres that were closely involved in developing these scoring systems, preliminary data have indicated that the degree of inter-observer reliability was acceptable.4,5 As was the case in these centres, patients are preferably scored by a small and dedicated number of thoroughly trained professionals (for example, research nurses). In principle, this form of organisation can be expected to have a high inter-observer agreement. However, the practical situation in numerous ICUs and PICUs throughout Europe is that severity scoring is performed by a varying number of residents, fellows, (paediatric) intensivists, paediatricians, or nurses, with varying degrees of PICU experience and training with the PRISM and PIM score. This may imply that these scoring systems will not achieve the same degree of accuracy and reliability in everyday clinical practice.
We previously showed significant degrees of inter-observer and intra-observer variability in the use of the APACHE II scoring system, both in the research setting and in everyday clinical practice.7,8,9,10,11 To our knowledge no studies have been carried out to systematically address this issue in paediatric ICUs. We therefore decided to assess the accuracy and reliability of PRISM and PIM scoring in everyday clinical practice, stratified by level of experience.
Physicians from eight academic paediatric ICUs (tertiary referral centres) with residency and fellowship training programmes were asked to participate in our study. Physicians were divided into three categories: residents (n = 9) with limited experience in paediatric intensive care (average 3 months, range 6 weeks–6 months); PICU fellows (n = 6, average experience 18 months, range 6–30 months), and paediatric intensivists (n = 12) with at least three years of full time PICU experience. For the Dutch situation this means that all major paediatric centres were represented, and that around 50% of all paediatric intensivists and PICU fellows participated in our study.
The charts of 10 patients that had been admitted to a single PICU in the course of a one year period were selected for scoring. The charts were selected to reflect typical PICU patients, and were not chosen for difficulty of scoring. Relevant data from the medical charts and copies of blank data collection sheets from the PRISM and PIM scores were provided to all participating physicians. No specific additional instructions were given to those who scored the patients, and no time limit was given in order to mimic the routine working situation as closely as possible. All physicians assessed PRISM and PIM scores, which were noted on similar data collection sheets as are used for routine assessment. Mean (SD) and range of the PRISM and PIM scores were calculated for each individual patient for the overall group of physicians and for each of the three categories of physicians, according to methods described previously.8 Weighted kappa scores and intraclass correlations (ICC) were also determined to assess inter-observer agreement. A deviation of ±1 point (PRISM score) or ±1 item (PIM score) was designated as mild disagreement, ±2 or 3 points (items) as moderate disagreement, and >3 points (items) as severe disagreement. Statistical analysis was performed using Student’s t test for unpaired variables for paired groups and by analysis of variance (ANOVA). Statistical significance was accepted for p < 0.05. Excel (Microsoft Inc.) and SPSS 9.0 (SPSS Inc.) software was used to perform the necessary calculations.
Tables 1–3 present the results. Table 1 depicts overall inter-observer agreement as the percentage of exact agreement, kappa scores, and ICC for the PRISM and PIM based mortality risks, respectively. Tables 2 and 3 show the inter-observer agreement for the PRISM and PIM score derived mortality risk.
As is evident from the tables, a wide variation in both PRISM and PIM based mortality risk assessment was found, regardless of the level of expertise. Indeed, the differences between the three groups were small, with statistically significant differences observed only between residents and intensivists for the PRISM based mortality risk (p < 0.01), and for intensivists versus fellows and residents versus fellows for the PIM based mortality risk (p < 0.05).
On analysis, the most frequent problems and sources of error in determining PRISM scores were:
The PaO2/FiO2 determination. An arterial PaO2 and a measured FiO2 are mandatory, so it is only possible to include this measurement if oxygen is given through an endotracheal tube or by a blender, or when FiO2 is measured by a “head” or “Gardner” box. In the absence of an arterial PaO2, the score should be missing.
Assessing the Glasgow Coma Score (GCS) when sedatives have been given. It is only permitted to assess GCS before administration of sedatives or after their effect has worn off.
In some observers there were difficulties in determining the appropriate time window for PRISM calculation. This should include the worst values in the first 24 hours after PICU admission; patients admitted for less than 2 hours should be excluded from scoring.1–3
The most frequent problems in determining PIM scores were:
PIM scores should be determined by taking the first value (instead of inappropriately using the worst value) during the first hour after starting intensive care treatment. This treatment may start outside the PICU, for example, in the emergency room, by a retrieval service or a trauma team, so that (in contrast to PRISM scores) in some cases values obtained before PICU admission can be used to assess PIM scores.
Misinterpretation of underlying conditions. For example, cardiac failure during sepsis was inappropriately scored as cardiomyopathy, and muscular dystrophia was inappropriately counted as a neurodegenerative illness.
Use of an FiO2 value measured at a different moment than PaO2.
Adequate definition of a booked admission to the ICU: this should be a pre-arranged admission, for example after elective surgery.
Pupils can only be scored as fixed when the diameter of both pupils is >3 mm and when there is no reaction to light. In addition, the possibility of a medication effect or a peripheral cause of absent or abnormal pupil reactions as a direct result of injury to the eyes should be taken into account. This was not always correctly interpreted.
The results of our study show that substantial inter-observer variability exists in both PRISM and PIM scoring. We had expected to find some variability on the basis of our experience in previous studies dealing with the use of the APACHE II score.7,8,9,10,11 In these studies inter-observer variability had ranged from 15% to 30%. However, we were quite surprised by the degree of inconsistency in the use of the PIM and PRISM scores. Kappa scores ranged from 0.28 to 0.87 in PRISM scores, and from 0.32 to 0.88 in PIM scores. For the PRISM score the average ICC was 0.51 (range 0.32–0.78) and the average kappa score 0.6 (range 0.28–0.87); for the PIM score the average ICC was 0.18 (range 0.08–0.46) and the average kappa score 0.53 (range 0.32–0.88). Inter-observer agreement tended to be better in less severely ill patients. In this category, the a priori chance of disagreement between observers is lower due to the fact that there are less abnormal variables.
Our results contrast with a report by Pollack and colleagues,6 who studied the effects of variation in frequency of variable measurements on PRISM scores and found these effects to be relatively small.6 This study did not deal with the issue of inter-observer reliability of the measured variables themselves, which might subsequently affect the scores. Thus the degree of variability may even have been underestimated in our study. In addition, errors may have occurred in different areas of assessment which subsequently “cancelled each other out”; for example, if one observer erroneously attributed points for chronic diseases while another erroneously attributed points for low blood pressure, these two observers will agree on the total number of points and predicted mortality; seemingly they are in concurrence, while in reality their agreement is poor.
Inter-observer variability in PIM scoring appeared higher than with PRISM scoring, despite the fact that the PIM score comprises fewer variables. There are several potential explanations for this observation. Firstly, all variables of the PIM score are entered directly into the logit equation from which the mortality risks are calculated, meaning that differences in the final risk are directly proportionate to differences in the underlying variable. In addition some variables of the PIM score carry a relatively high weight, hence an error in one of these variables will lead to a relatively large change in the overall score and in predicted mortality (for example, an erroneous “specified diagnosis”). In PRISM score based mortality risk calculation, all underlying variables of acute physiology are added to a single summary score, which does not reflect that more points that are attributed to some underlying variables might have been cancelled out by fewer points that were attributed to other variables. Only this total number of points (the PRISM score) is entered in the regression equation (with age and operative status) to calculate mortality risk, and consequently underlying variation in PRISM score variables might be obscured.
Poor agreement in scoring was found in intensivists, PICU fellows, and residents. The differences between the groups were unexpectedly small, with experienced physicians performing only marginally better than inexperienced ones. In this respect it should be realised that the number of years of experience in the PICU and/or the age of the physicians does not necessarily imply dedicated training and regular checking of the appropriateness of the application of scoring systems.
Our findings have significant implications. PRISM and PIM scores are widely used in clinical trials to compare severity of illness, and for benchmarking and comparison of quality of care between different PICUs. Our observations show that inter-observer variability in both scores is of a magnitude that makes it questionable whether they can, and should, be used for these purposes when collected in this way. The results of scoring system assessments, especially when used in research protocols and as benchmarks for quality of care, should be unequivocal; this applies especially when they are used not just by health care providers but also by hospital managers, insurance companies, and other financial decision makers, as well as by patient interest organisations. At the very least a rigorous training programme is needed; ideally, patients should probably be scored by a small number of dedicated, well trained, and regularly audited staff members.
Our study has a number of potential limitations. Patients were scored mostly by physicians who were not specifically trained for scoring, and who worked in different hospitals. The patient information was presented in the form of copies of all relevant parts of the medical and nursing charts from one medical centre; the lack of familiarity with charts from another hospital might have caused some bias against physicians working in other centres with different charts. The situation in daily clinical practice in the participating centres was as follows: in five participating tertiary PICUs, patients are normally scored by intensivists and fellows only; in two centres, residents and nurses also participate in scoring; and in one centre two dedicated students do the scoring. Training and auditing are not formally regulated in any of the eight centres. Although we have not performed a formal study to assess way the severity of illness scores are collected in most European PICUs, informal inquiries indicate that circumstances similar to the one in the Netherlands exist in many other European countries. Therefore, it is highly likely that our results reflect the average situation in Europe, although further studies will be required to fully address this issue.
What is already known on this topic
Severity of illness scoring is used for various purposes in (paediatric) intensive care
Reliability has been assessed only in centres closely involved in their development
For the APACHE score, used in adult ICUs, substantial inter-observer variation in routine assessment has been reported
The situation may be very different in centres were scores are assessed exclusively by a restricted number of dedicated staff members (Frank Shann, personal communication). This method of score collection can be expected to have less inter-observer variability, although we are not aware of published studies in which this issue has been addressed. Scores are collected in this way in many centres in the United States and Australia, and in some centres in Europe. If this method is indeed superior, the problem of score collection could in theory be solved by allowing two or three specially trained and frequently monitored staff to score patients. However, this should first be shown in future studies.
Thus, although limiting score collection to a few dedicated individuals should probably be our long term goal, at this moment we urgently need to improve the performance of those currently assessing PIM and PRISM scores. The most frequent problems in scoring should not be too difficult to resolve by adequate training and guidelines. These include using the appropriate time windows, using the appropriate values (worst in the first 24 hours and first value in the first hour, respectively), appropriate assessment of GCS and pupils during use of sedatives, avoiding misinterpretation of underlying conditions, and inadequate interpretation of the use of oxygen. Methods through which such a training programme and strict guidelines can be implemented have been described previously, in studies where a marked decrease in inter-observer variability was achieved after a training programme was implemented.9 With such a programme, severity of illness scoring could still be performed by a relatively large and varying number of (rotating) physicians, or by a small number of well trained and dedicated (for example, nursing) staff members. The results of our study suggest that the latter option may be preferable, but it remains to be determined whether this can be achieved in all PICUs that use these scoring systems.
In conclusion, we observed wide inter-observer variability in assessment of PRISM and PIM scores in routine clinical practice. These scoring systems are widely used in clinical trials and as instruments to assess and compare quality of care between different PICUs. In our opinion, based on these results and in the current situation, the results of these score assessments should be interpreted with great caution. Reliability of the scores may be higher if scoring is performed by a limited number of dedicated individuals, and may be improved by strict guidelines and regular training. Further studies are required to address these issues.
What this study adds
In this multicentre study, wide inter-observer variation in PRISM and PIM based mortality risk was found in routine clinical practice
Inter-observer variability could not be explained by the physicians’ levels of qualification
Education, training, and guidelines for score assessment are needed, and perhaps severity of illness scoring in PICUs should be performed only by a limited number of well trained professionals
The authors thank all participating physicians from the Dutch PICUs (Academic Medical Centre Amsterdam; University Medical Centre, Utrecht; Leiden University Medical Centre; Academic Hospital Groningen; Academic Hospital Maastricht; Erasmus University Medical Centre, Rotterdam; VU University Medical Centre) without whose most appreciated contributions this study would not have been possible. We also thank Dr E de Lange, Department of Statistics and Epidemiology, VU University Medical Center, for help with the statistical analysis, and Prof. AJ van Vught, Wilhelmina Children’s Hospital and University Medical Center Utrecht, for critically reading the manuscript.
Competing interests: none declared
If you wish to reuse any or all of this article please use the link below which will take you to the Copyright Clearance Center’s RightsLink service. You will be able to get a quick price and instant permission to reuse the content in many different ways.