Objective To report the evidence for and challenges to the validity of Sheffield Peer Review Assessment Tool (SPRAT) with paediatric Specialist Registrars (SpRs) across the UK as part of Royal College of Paediatrics and Child Health workplace based assessment programme.
Design Quality assurance analysis, including generalisability, of a multisource feedback questionnaire study.
Setting All UK Deaneries between August 2005 and May 2006.
Participants 577 year 2 and 4 Paediatric SpRs.
Interventions Trainees were evaluated using SPRAT sent to clinical colleagues of their choosing. Data were analysed reporting totals, means and SD, and year groups were compared using independent t tests. A factor analysis was undertaken. Reliability was estimated using generalisability theory. Trainee and assessor demographic details were explored to try to explain variability in scores.
Main outcome measures 4770 SPRAT assessments were provided about 577 paediatric SpRs. The mean scores between years were significantly different (Year 2 mean=5.08, SD=0.34, Year 4 mean=5.18, SD=0.34). A factor analysis returned a two-factor solution, clinical care and psychosocial skills. The 95% CI showed that trainees scoring ≥4.3 with nine assessors can be seen as achieving satisfactory performance with statistical confidence. Consultants marked trainees significantly lower (t=−4.52) whereas Senior House Officers and Foundation doctors scored their SpRs significantly higher (SHO t=2.06, Foundation t=2.77).
Conclusions There is increasing evidence that multisource feedback (MSF) assesses two generic traits, clinical care and psychosocial skills. The validity of MSF is threatened by systematic bias, namely leniency bias and the seniority of assessors. Unregulated self-selection of assessors needs to end.
Statistics from Altmetric.com
In September 2004, the Royal College of Paediatrics and Child Health (RCPCH) introduced the Sheffield Peer Review Assessment Tool (SPRAT)1 as part of their workplace-based assessment (WBA) programme.2 The Postgraduate Medical Education and Training Board (PMETB) and Modernising Medical Careers require WBA as part of an integrated overall assessment programme. SPRAT is a multisource feedback (MSF) instrument containing 24 questions against a six-point scale. Scale descriptors range from very poor to very good with 4=satisfactory. SPRAT was developed to assess the generic competencies across six domains of Good Medical Practice (GMP). MSF also known as 360° feedback is the collection of views from colleagues and, on occasion, patients which are collated and used to make a judgement about performance. MSF is now increasingly being used to assess doctors3 including recent proposals for revalidation4 despite concerns about a paucity of evidence.5
What is already known on this topic
Multisource feedback (MSF) is being advocated for the assessment of all doctors in the UK.
There is a paucity of evidence to support robust implementation.
What this study adds
There is increasing evidence for the utility of MSF to assess doctors while achieving reliable results at feasible levels.
MSF is reliant on subjective judgements which, if systematically biased, can undermine validity. There is a growing case for an end to self-selection by trainees of their own assessors.
Evaluation of assessment methods is good practice and a requirement of the PMETB. PMETB principles of assessment6 require evidence for the utility of programmes. Many questions remain about the validity of WBA when implemented in the naturalistic clinical environment. This is important if MSF is to be used to assess doctors, leading potentially to further assessment burden, remediation and questions being raised about fitness to practise. This paper discusses the evidence for, and the challenges to, validity, including reliability, in the national implementation of MSF for paediatric Specialist Registrars (SpRs).7
This first section describes the process by which SPRAT was implemented. The methods section that follows describes the analysis of the centrally collated data.
All year 2 and 4 SpRs were identified in all UK Deaneries between August 2005 and May 2006. The training years were chosen as SPRAT informed the trainees' completion of core and the penultimate year of training.
Identified trainees were sent a self-SPRAT, demographic data and assessor nomination forms. Based on current best evidence,8 trainees were asked to nominate their own assessors with whom they worked clinically. However, trainees were advised that their supervising consultant should be included and that administrative staff were not suitable. Previous validation work showed that SPRAT is primarily a clinically orientated instrument and not suitable for completion by administrative staff who are unable to answer much of the questionnaire.9 The forms were returned to an independent administrative team, scanned and verified. The nominated assessors were then contacted directly by letter and sent the SPRAT form by the independent administrative team. Trainees were encouraged to ask for consent from their proposed assessors in advance of submitting their names.
All assessors' identities were anonymised to the trainee as supported by the literature.10 Educational supervisors, programme directors and other relevant parties were not made aware of the identities of assessors unless serious concerns arose.
Training and feedback
Literature was produced to highlight the proposed plans, and written guidance was provided with the questionnaires. Guidance and an online slideshow available for download were provided to support feedback interpretation.
Feedback was generated for each trainee to provide evidence for the record of in-training assessment process and to support personal development planning. Free text comments were provided verbatim but anonymised.
All data manipulation and analyses were undertaken in SPSS v.14 (SPSS, Chicago, Illinois) and Microsoft Excel 2003 (Microsoft, Seattle, Washington). Missing scores or other parameters, for example ethnicity, were not replaced, as this causes an artificial increase in reliability and produces artefact with factor analyses.
The mean score per SPRAT form was used for all analyses. Data were analysed, reporting totals, means and SD. Assessor and trainee demographic data were collected to explore potential bias and therefore challenges to the validity of MSF.
The content validity of SPRAT was assured during its development1 including the mapping to GMP11 and a review of other professional standards frameworks.12,–,17 Extensive piloting has previously been undertaken1 9 but the internal structure of the instrument has not been explored fully before. Factor analyses are used to study the patterns of relationships among dependent variables, with the goal of discovering something about the nature of the independent variables that affect them. In the context of SPRAT, do assessors independently assess trainees on each of the items, or are the items related in some way? Not all datasets are suitable for factor analysis. SPRAT's suitability was explored using the Kaiser–Meyer–Olkin (KMO) and Bartlett tests. A principal-component factor analysis with Kaiser Normalisation was then undertaken with the default threshold of Eigenvalue >1.
Generalisability theory systematically quantifies errors of measurement in educational tests. The design of the analysis is dictated by the design of the study. Collecting data, while implementing WBA, led to a naturalistic dataset. Assessors were not uniquely coded in the dataset, and so it had to be assumed that each assessor was unique to each trainee. This produces a fully nested design; assessors ‘nested’ within trainee. Items were not considered as a facet for analysis in this study, as the content validity of SPRAT was assured by the development process. Items should not be added or removed on the basis of reliability. The ‘nested’ model allowed the estimation of two variance components, true (attributable to the trainee) and residual (all other variance) using VARCOMP (Minimum Norm Quadratic Unbiased Estimation—the MINQUE procedure). The procedure was undertaken to generate variance components for each cohort and within each training year. The procedure was also repeated at the level of the factors identified in the factor analysis.
Reliability can be presented in a number of ways. Cronbach and Shavelson advocate that demonstrating the precision and spread of scores is most useful and faithful to reliability analyses.18 To obtain a measure of precision, the 95% CI around each mean rating can be generated as follows. The square root of the measurement error is the standard error of measurement (SEM); this can be calculated for 1–10 assessors (√error/number of assessors). The 95% CI are equal to the SEM multiplied by 1.96 and are added to and subtracted from a mean rating. The CI, generated for the number of assessors that contributed to the individual trainee mean score, can then be placed around that score. This provides a measure of precision and, therefore, the reliability that can be attributed to each mean score based on the number of individual scores contributing to it. To reach a reliable decision about a trainee requires that the CIs do not cross the criterion standard, in this case 4.0 on a six-point scale. One can then be 95% confident that the trainee has achieved a satisfactory performance in their MSF or not.
Demographic data analysis: trainee
The frequencies, means and SD were calculated for each gender, place of graduation (UK and International Medical Graduates (IMGs)), working pattern (full and part-time) and post-type (stand alone vs part of rotation). In each case, the two groups were compared using an independent t test. A one-way analysis of variance was used to determine significant differences across Deaneries and clinical environments—for example, the emergency department. The time in post, the hospital and the Deanery were correlated against the performance on SPRAT using Pearson correlations.
Demographic data analysis: assessor
Demographic data collected on assessors were analysed using a hierarchical regression to calculate potential variability attributable to them. This was undertaken, controlling for the level of training of the doctor, as it was accepted that training would affect performance. Terms entered second were assessor gender and ethnicity, length of working relationship in quartiles, location of the working relationship and their occupation. Taking the t statistic as a measure of the relative importance of each potential confounder, those above +2 or below −2 with significance (p<0.05) are reported.
The ability to discriminate is fundamental to the success of any assessment instrument. The placing of doctors around the expected or criterion standard is reported. The range of mean scores at each level of training is also reported. The number of doctors to fall short of the standard is stated with the importance of the SEM for each doctor's mean. A content analysis of the free text comments was undertaken using a simple of word-frequency analysis to identify any overarching themes.
Five hundred and seventy-seven trainees identified 5770 potential assessors who were each sent questionnaires. Of these, 4770 completed forms were returned for an 83% response rate. Three hundred and forty-three SpRs were in year 2, and 201 SpRs were in year 4. In addition, 10 trainees stated that they were in years 1, 3, 5 or 6.
Supportive evidence for validity
The whole instrument was found to be suitable for factor analysis (KMO=0.976; Bartlett test significant, p<0.001), which returned a two-factor solution accounting for 76.5% of the variance (table 1). One factor considers questions surrounding the clinical care components of medical practice, and the other considers psychosocial skills. The overall mean score achieved by the trainees on SPRAT was 5.11 (SD=0.34, Skewness=−0.514) (figure 1). Year 4 trainees scored significantly higher than those in Year 2 (year 2 n=343, mean 5.08, SD=0.34, year 4 n=201, mean 5.18, SD=0.34, t=3.50, df 552, p<0.01).
Little difference was found between the variance components for the whole cohort, the two training years (2 and 4) or the two factors (table 2). Table 3 shows 95% CI generated from the residual component dependent on the number of assessors contributing to that score.
Stability across demographics
The majority of participant demographic factors did not cause variability in means with significance. Male trainees did not score significantly different from female trainees (male n=242, mean=5.10, SD=0.33, female n=307, mean=5.12, SD 0.35, t=0.67, df=572, p>0.05). Working patterns did not effect scores (full time n=488, mean=5.11, SD 0.35, part time trainees, n=77, mean=5.15, SD 0.29, t=−0.91, df=563, p>0.05). No significant difference in performance could be attributed to Deanery (df 23, F 0.85, p<0.05), clinical environment (df 8, F 1.16, p<0.05), post-type (F 0.02, t 1.23, df 470, p>0.05) or time in post, hospital or region (time in postcorrelation 0.05, p>0.05, time in hospital correlation 0.04, p>0.05, time in region correlation −0.09, p>0.05).
However, a number of characteristics did contribute independently to variability in scores.
Challenges to validity
There was a significant different between the means achieved by UK and non-UK graduates (UK n=333, graduate mean=5.17, SD 0.35, non-UK n=233, mean=5.04, SD 0.31, t=4.74, df 564, p<0.01).
Seven per cent of variation in means could be attributed to confounding variables when controlling for year of training. Consultants marked trainees significantly lower (t=−4.52, p<0.01), whereas SHOs and Foundation doctors scored their SpRs significantly higher (SHO t=2.06, p<0.05, Foundation t=2.77, p<0.05), figure 2. The length of the working relationship also confounded scores (t=10.90, p<0.01). Trainees receive higher scores from assessors they have known for longer.
Mean scores for Year 2 trainees ranged from 3.49 to 5.74, Year 4 trainees from 3.49 to 5.86. Five hundred and fifty-seven trainees achieved scores that could be placed with 95% confidence above the 4.0 expected standard. Only three doctors (<1%) scored an aggregate below 4.0, two of whom could be placed below with 95% confidence. A further 17 doctors were highlighted as potentially in difficulty, with 95% CI crossing the expected standard. All these trainees were more than 2 SD below the cohort mean. Twenty-one assessors raised concerns about issues of probity or health. Free text comments raising concerns could be summarised into two allied themes; coping with stress and concerns around illness (high levels of sick leave).
This paper reports the findings of a large study into the validity of MSF. The process was supported in the RCPCH literature, as organisational support is fundamental to success,19 20 and guidance was provided as endorsed by best evidence.21 Independent third-party involvement is advocated to help support the implementation of MSF.22 This allows for communication with all involved, clear ownership for administration, helpline facilities, deadlines and monitoring systems.
SPRAT was constructed to assess the contents of GMP, and so it might be expected that they would ‘factor’ into its domains. However, a two-factor solution was returned more in line with the medical23,–,26 and management27 literatures. MSF instruments appear overwhelmingly to assess two generic factors, often labelled clinical/cognitive and psychosocial/humanistic, regardless of context or setting. There are a number of potential explanations for this. The halo effect,28 which is ‘the tendency to give global impressions’29 and stereotyping30 lead to global judgements. According to implicit personality theory, we perceive personality traits in clusters.30 We see one trait and assume others. However, we would argue that content and context specificity must be upheld to support content validity and the educational potential of feedback. Therefore, it is important that items remain rather than condensing to two ‘factor’ items. It should be remembered, however, when giving feedback or using SPRAT to support decision-making, that validity evidence supports feedback about each factor (is a doctor satisfactory in their clinical skills and or their psychosocial approach) but not each item within SPRAT (for example at managing complex patients). Therefore, a trainee scoring 5.2 on item 1 and 4.8 on item 13 is not ‘better’ at diagnosing problems compared with their commitment to learning.
The 2-year groups were assessed as being statistically different. This is supportive validity evidence and identifies an inherent standard setting ability in assessors. Further criterion validity evidence will be sought comparing individual performance across a number of paediatric pilot instruments including mini-Clinical Examination, Case based Discussion, the Paediatric Consultation Assessment Tool31 and patient feedback.32
Variance components are consistent with the published literature1 3 although they may be overestimated due to the inability from this dataset to explore assessor stringency between ‘nests.’ Assessors ‘grouped in nests’ (each nest unique to each trainee) are likely in themselves to vary in their stringency independently of any true performance difference between trainees. Further work is needed.
Trainees doing well require less evidence to assure reliability as 95% CIs provide a pragmatic approach to balancing sampling with feasibility. In this study, 557 trainees achieved scores with 95% CI remaining above the 4.0 expected standard when related to the number of assessors. Two trainees scored significantly below. More assessments are required about those who raise concerns. This is beneficial as more assessment data improve reliability and validity. Increased data support the educational feedback process and/or remediation. The process remains feasible by prioritising resources away from the majority to doctors potentially in difficulty.
Challenges to validity
UK-trained doctors scored systematically higher than IMGs. It could be argued that IMGs do less well due to differences in ability and/or knowledge. However, poor integration into the National Health Service and cultural differences33 are likely to be important. Some commentators may reach the conclusion that this is evidence of institutional racism. Ethnic background may explain these results, with evidence that assessors rate higher those who are similar to them34 including ethnicity.35 Regardless of the underlying cause, the perception of racial bias could potentially undermine any process.35
Two demographic features of assessors contribute to variability in mean scores, occupation of the assessor and the length of working relationship. Consultants marked trainees significantly lower than any other occupational group of assessors, while SHOs and Foundation doctors scored their SpRs significantly higher. This is in line with the majority of the literature.36 37 This might be explained by peers paying more attention to interpersonal skills rather than clinical care.38 Assessor confidence in their own skills may alter the ability to score others39 or experience supports more sophisticated evaluative categories improving assimilation. There is some evidence that non-assimilation may help preserve assessor feelings or relationships.40 41 Assumptions are often based on stereotypes42 and first impressions.29 30 The defensive nature of stereotyping means that it is applied differently depending on the level of perceived threat and insecurity.37 This might explain why trainees are influenced more than senior doctors.
Only two trainees failed to meet the expected standard of 4.0 with confidence. There were a further 17 trainees who could not be placed with confidence either side of the standard. A negative skew, with relatively small numbers of trainees ‘failing,’ may correctly represent the standard of UK paediatric SpRs. However, there is evidence that assessors are often lenient,43 influenced inversely by assessor seniority.44 Even perceived leniency, it is argued, is an important phenomenon, as the resulting elevated ratings make it difficult to substantiate important decisions. In addition, participants may question the fairness and validity of performance ratings. Assessor training may reduce bias, but there is varying evidence to support this,45 except with outliers.46 The impact of these biases on scores is important and indicates that self-selection of assessors is no longer supportable, as those who choose more junior members of medical staff are potentially at an unfair advantage.
Free text comments may provide a way forward for a combined quantitative/qualitative approach in decision-making.
Large-scale implementation of MSF is feasible, and 95% CIs provide a pragmatic way of focusing resources on trainees in difficulty while reducing the assessment burden for the majority. There is increasing evidence that MSF assesses two generic traits, clinical care and psychosocial skills, but content specificity must be assured with the use of explicit mapped items. MSF is unlikely to be successful without robust regular quality assurance to establish and maintain validity including reliability.
Subjective decisions are at the centre of any MSF process, so exploration of sources of bias may simply represent an artificial subdivision, but even the perception of bias through poor implementation can undermine MSF. Certainly, leniency bias and the impact of assessor seniority appear important, and unregulated self-section of assessors should end.
The authors thank the members of the research team, Healthcare Assessment and Training (HcAT), based in Sheffield Children's Hospital Foundation NHS Trust.
Funding JA was funded by Clinical Research Fellowships through the University of Sheffield, Western Bank, Sheffield, S10 2HT supported by Doncaster and Bassetlaw Foundation NHS Trust and Barnsley NHS Trust.
Competing interests None.
Provenance and peer review Not commissioned; externally peer reviewed.
If you wish to reuse any or all of this article please use the link below which will take you to the Copyright Clearance Center’s RightsLink service. You will be able to get a quick price and instant permission to reuse the content in many different ways.