Introduction

Many children’s lives are troubled. Psychosocial childhood problems are common; research has shown that between 3 and 18% of all children suffer from some sort of psychopathology (Bourdon et al. 2005; Costello et al. 2003; Egger and Angold 2006; Ford et al. 2003; Meltzer et al. 2003; Zwirs et al. 2007). Behavioral disorders, such as oppositional defiant disorder (ODD), conduct disorder, and attention-deficit/hyperactivity disorder (ADHD), and emotional disorders, such as anxiety and depressive disorders are diagnosed most frequently in children (Canino et al. 2004; Egger and Angold 2006; Ford et al. 2003).

A substantial discrepancy has been found between the prevalence rates and the number of psychosocial problems being treated in childhood (see for a review Costello et al. 2005). One of the causes of this divergence may be the stigma (Corrigan 2004) associated with mental health care or limited access to care (Kataoka et al. 2002). Another explanation might be that psychosocial problems in the community are often not recognized or diagnosed (Costello et al. 2005). This is worrisome given the fact that problems in young children show relative stability over time (Caspi et al. 1996) and can potentially escalate or progress into psychiatric disorders. Thus, screening children at an early age for mental health problems and delivering early interventions, which might prevent these childhood problems from developing into more severe psychiatric disorders, is of great importance (Harrington et al. 1996). Though many instruments are available for screening children, The Child Behavior Check List (CBCL; Achenbach 1991) has long been viewed as the “gold standard” in assessing childhood problems. Recently, attention for early and quick detection of childhood psychopathology has increased. This has created room for other questionnaires than the CBCL to be used as screening instruments. The launch of the Strengths and Difficulties Questionnaire (SDQ; Goodman 1997) has enabled researchers and clinicians to increase acceptability in respondents by offering a short and partly positively worded questionnaire (Goodman and Scott 1999). Whereas the CBCL is a very solid instrument in doing in-depth assessment, the SDQ may be more suitable for screening purposes. The SDQ is thus not a replacement of the CBCL by being the new gold standard but complements the field of childhood psychological assessment by adding a questionnaire which is shorter and quicker than the CBCL. The CBCL remains very useful though as an in-depth questionnaire. The SDQ has quickly become one of the most utilized screening instruments because it is able to measure both problem behavior and competencies at an early age. In the current study, we reviewed studies examining the psychometric properties of the parent and teacher versions of the SDQ.

The SDQ is a relatively short, user-friendly screening instrument of psychosocial problems for children, and worded more positively compared to other common questionnaires. Specifically, the SDQ has relatively few items (25 vs. 118) compared to the Child Behavior Check List (CBCL; Achenbach 1991). Another advantage of the SDQ is that it is free of charge and available online (www.sdqinfo.com). The SDQ fits the current paradigm in the assessment of psychosocial problems, wherein the focus is expanded to include competencies or strengths in addition to assessing problems (Carr 2000; Rhee et al. 2001). The SDQ is based on the Rutter Questionnaires, which were developed in the 1960 s (Rutter 1967). Goodman updated the items of the Rutter Questionnaires according to the current focus in child psychopathology, for example by adding items to concentration, peer relations, and social competence areas (Goodman 1994, 1997). The update is based on criteria from the Diagnostic and Statistical Manual of mental disorders, fourth edition (American Psychiatric Association 1994) and the International Classification of Diseases, tenth edition (World Health Organization 1992). Additionally, the instrument includes a prosocial scale, which was added to make the assessment more acceptable to respondents. Goodman (1994) devised items of the parent version of the prosocial scale, while the teacher version items were based on the Prosocial Behavior Questionnaire (PBQ; Weir and Duveen 1981). An impact supplement was added to the SDQ, enabling the informants to report on possible burden and distress (Goodman 1999).

The SDQ intends to measure both psychosocial problems and strengths (for example prosocial behavior) in children and youths aged 3–16 through a multi-informant approach. Parents and teachers can report difficulties and strengths among 3- to 16-year-olds, whereas youths aged 11–16 can report on their difficulties and strengths themselves. The questionnaire consists of 25 items equally divided across five scales measuring emotional symptoms, conduct problems, hyperactivity-inattention, peer problems, and prosocial behavior. Except for the prosocial scale, the combined scale score reflects total difficulties, indicating the severity and the content of the psychosocial problems. The prosocial scale indicates the amount of prosocial characteristics a child shows (Goodman 1997).

The impact supplement comprises of eight questions. The first question asks whether the informant thinks the child has a problem, the remaining questions assess chronicity, distress, social impairment, and burden for others. From these questions, three dimensions can be inferred: perceived difficulties (is there a problem), impact score (distress and social incapacity on the child), and a burden rating (do symptoms impose a burden) (Goodman 1999).

As the SDQ is translated into over 60 languages, it has been widely used as a screening and research tool, a treatment-outcome measure, and a part of clinical assessment. In accordance with the increasing use of the SDQ, the body of research on the psychometric properties of the instrument is also growing substantially. Therefore, an overview of the results on psychometric properties, reliability, and validity would be very useful for researchers and practitioners.

The aim of this review is to review the psychometric properties of the parent and teacher versions of the SDQ for children aged 4–12 (primary school-aged children). Most research on the SDQ has focused on upper primary school-aged children and youngsters attending secondary school. Psychometric properties of the SDQ in these older children have been found sufficient in community (e.g., Koskelainen et al. 2001) and clinical samples (e.g., Becker et al. 2004), but research conducted on lower primary school-aged children shows mixed findings. Thus, it is important to review findings for primary school-aged children in order to draw conclusions about the suitability of the SDQ for younger children.

Having multiple informants reporting on the SDQ is valuable because psychosocial problems may be highly situational (Achenbach et al. 1987; Goodman et al. 2000c). Thus, the rater’s perception of the situation may influence the ratings. Therefore, we have to investigate whether the psychometric properties of the SDQ in these informants differ and, based on the findings, examine possible implications for the use of the SDQ. Further, the utility of the SDQ is different in clinical versus community populations. In a clinical population, we assume the presence of psychosocial problems. Therefore, the SDQ should inform us about types of psychosocial problems, the duration, and perception of these problems. In a community population of children, we assume the presence of some but not all psychosocial problems; hence, the SDQ should be very sensitive in detecting those children in the community who suffer from (developing) psychosocial problems. The aim of the SDQ is thus slightly different in clinical and community populations.

Specifically, we report results on internal consistency, test–retest reliability, and inter-rater agreement. As for validity, the results of construct, concurrent, capacity to discriminate, and predictive validity are reported.

Methods

Search Strategy and Selection for Identification of Studies

The electronic databases PsychINFO, PubMed, and ERIC were searched in March 2010 using the search terms “strengths and difficulties questionnaire,” “validity,” and “reliability.” Neither books nor unpublished articles were retrieved from the references.

Abstracts of selected studies were thoroughly read in order to determine whether they were potentially eligible for the inclusion in this review. Inclusion criteria were as follows:

  • The target population had to be 4–12 years of age. The age was above the range in 27 out of 48 studies. Of those studies, 3.7% exceeded the age limit by 1 year, 7.4% by 2 years, 22.2% by 3 years, 25.9% by 4 years, 29.6% by 5 years, 7.4% by 6 years, and 3.7% by 7 years. Still, we included these studies in our review, as the results from younger children in those studies are important for our review. Whenever possible, only the results from primary school-aged children were extracted, and the results from secondary school-aged children were omitted.

  • Studies had to assess the psychometric properties.

  • Studies had to use the parent and/or teacher SDQ version but not self-report.

  • Reports had to be available in English.

Eventually, k = 48 studies were eligible for our review. All studies were published as articles in scientific journals. The publication dates of the 48 articles ranged from 1997 to March 2010. Methodological characteristics of each study are summarized in Table 1. The studies that were selected for this review are indicated with an asterisk in the reference list.

Table 1 Summary of studies included in the review

Strategy for Analysis

The results of internal consistency (the extent to which items produce similar scores) (Cronbach 1951), test–retest reliability (the extent to which a questionnaire yields similar results at different time points), and inter-rater agreement (the consensus between different raters) enabled us to report the outcomes systematically. In addition, a systematic comparison of the results of construct, concurrent, and capacity to discriminate was feasible. One of the most important assets of a questionnaire, the construct validity, here refers to the degree to which the SDQ is similar to other theoretical constructs of child psychopathology (Campbell and Fiske 1959). Concurrent validity is defined as the degree to which the SDQ scores relate to a theoretically similar construct, represented in a questionnaire. Capacity to discriminate refers to the ability of the SDQ to distinguish between groups that it should theoretically be able to distinguish between. Predictive validity is defined as the ability of the SDQ to predict scores on another criterion measure. As the method of examining predictive validity differs greatly with respect to research design, the results on predictive validity were not reviewed systematically but descriptively.

Reliability results were reported for each subscale as well as for the impact and total difficulties scales. Correlations were obtained and transformed first into Fisher’s Z-scores in order to enable the calculation of weighted correlations. The normally distributed Fisher’s Z-scores were weighted according to their sample size minus 3, and a weighted mean Fisher’s Z-score was computed by dividing the sum of the weighted Fisher’s Z-scores by the sum of their weights. The weighted mean Z-score was transformed back to a correlation coefficient r (Field 2001). Weighted mean correlations were reported separately by type of informant, parent, and teacher. Internal consistency values of α = 0.70 and below are generally considered low, values between α = 0.70 and α = 0.80 acceptable, and values of α = 0.80 and above good (Cohen 1977). Time intervals of test–retest reliability varied between 2 weeks and 6 months. Generally, test–retest correlations of r = 0.70 and above are considered acceptable. Inter-rater agreement between parents and teachers was reported by subscale and total difficulties scale. No results on the impact scale were reported in the reviewed studies. As a rule of thumb, the meta-analytic mean of inter-rater agreement between parents and teachers (r = 0.27) (Achenbach et al. 1987) is used as a benchmark of agreement or data quality (Goodman 2001). This meta-analytic mean was computed by extracting inter-rater agreement results from 41 studies on the CBCL. As the Achenbach et al., study is known as a landmark paper on inter-rater agreement, the use of 0.27 as a benchmark seems justified.

Item-level factor loadings were extracted from studies assessing construct validity. Factor loadings were not fully comparable due to the application of different extraction methods (like principal component analysis and principal axis factoring) and rotation methods (orthogonal or oblique) in studies using exploratory factor analysis. The estimation methods were different (maximum likelihood or weighted least squares) in studies using confirmative factor analysis. To gain insight into the quality of the measurement model of the SDQ, loadings were categorized into low (<0.40), medium (≥0.40–≤0.70), or high (>0.70). Also, weighted mean factor loadings were calculated on item level.

Concurrent validity was reported mainly as the correlation of SDQ measures with measures of psychopathology like the CBCL or other measures of psychopathology. In the reviewed studies that examined capacity to discriminate, receiver operating characteristic (ROC) analyses were conducted to distinguish between high- and low-risk samples, generating the area under curve (AUC). An AUC with a value of 1 shows perfect capacity to discriminate and a value of 0.5 the absence of capacity to discriminate. Sensitivity (i.e., the proportion of children who are correctly identified by the SDQ as having psychosocial problems) and specificity (i.e., the proportion of children who are correctly identified by the SDQ as not having psychosocial problems) results were extracted and summarized. Again, the results were weighted according to their sample size.

Due to unique research designs in some studies, not all results could be captured in tables. Results from these studies are reported descriptively, as are the results on predictive validity.

Results

Internal Consistency

Weighted mean and the range of unweighted internal consistency reliability estimates by type of informant are presented in Table 2, as extracted from 26 studies. Prosocial behavior, emotional symptoms, conduct problems, and peer problems showed internal consistencies below 0.70 for parents. Teacher ratings showed higher internal consistencies with only peer problems having a value below 0.70.

Table 2 Weighted mean internal consistency results on the SDQ specified by informant

Test–Retest Reliability

Weighted correlations and the range of unweighted correlations from six studies are presented in Table 3. At the subscale level as well as for the impact scale, parent ratings tended to be less reliable over time compared to teacher ratings.Footnote 1

Table 3 Weighted mean test–retest correlations on the SDQ specified by informant

Inter-Rater Agreement

The results of parent and teacher inter-rater agreement correlations from eight studies by weighted mean correlations and by the range of unweighted correlations are presented in Table 4. The weighted mean correlations varied between 0.26 and 0.47. All subscales, except the prosocial scale, had a higher mean than the meta-analytic mean of 0.27.

Table 4 Weighted parent and teacher inter-rater agreement correlations on the SDQ

Construct Validity

A review of the results of the five-factor structure for children aged 4–12 is presented in Table 5. In the parent version, the number of factor loadings was summed across 13 studies. Of these 13 studies, six studies examined also the teacher version. It should be noted that Smedje et al. (1999) and Hawes and Dadds (2004) split their sample into boys and girls, each study generating two sets of factor loadings. Sanne et al. (2009) applied both EFA and CFA, which also generated two sets of factor loadings. Therefore, factor loadings for the parent version summed to 16.

Table 5 Frequencies of factor loadings on item level of the SDQ specified by informant

For parent and teacher versions, most items showed satisfactory factor loadings >0.40–≤0.70. For the parent version, highest loadings were found on the hyperactivity-inattention subscale and lowest on the conduct problems subscale. For teachers, highest loadings were found on the prosocial subscale and lowest on the peer problems scale. However, in 11 out of 14 studies, the results of these factor analyses were obtained by conducting exploratory factor analysis (EFA).

Eight studies applied confirmatory factor analysis; however, only four are presented in Table 5 (Palmieri and Smith 2007; Van Leeuwen et al. 2006; Van Roy et al. 2008; Sanne et al. 2009) because four out of the total of eight studies did not report factor loadings (Becker et al. 2004; Dickey and Blumberg 2004; Hill and Hughes 2007; Mellor and Stokes 2007).

These eight studies are discussed below. Dickey and Blumberg (2004) found support for a three-factor structure of prosocial, externalizing, and internalizing problems. Van Leeuwen et al. (2006) examined a five-factor model and a three-factor model in two samples. Support was found for the five-factor model for the parent and teacher versions. The three-factor model for the parent and teacher versions revealed a worse model fit. The findings of Becker et al. (2004), Van Roy et al. (2008), and Sanne et al. (2009) provided support for the five-factor model for both the parent and teacher versions, but this factor structure was not found by Mellor and Stokes (2007) and was only marginally adequate in Hill and Hughes’ (2007) study. Palmieri and Smith (2007) confirmed the five-factor structure for custodial grandparents.

Concurrent Validity

Regarding results of concurrent validity, weighted SDQ-CBCL correlations and the range of unweighted correlations are presented in Table 6. The presented correlations do not include all CBCL subscales. In the majority of the reviewed studies, SDQ problem scales correlated with the CBCL subscales that covered similar concepts in general, that is, externalizing, attention problems, internalizing, and social problems. Weighted correlations of 0.76 for both parent (range of unweighted r = 0.70–0.87) and teacher ratings (range of unweighted r = 0.68–0.87) were found between the SDQ total difficulties and CBCL total scales. At the subscale level, conduct problems, externalizing and hyperactivity, and attention problems correlated sufficiently, while emotional symptoms, internalizing and peer problems, and social problems showed correlations below 0.70. The SDQ impact scale and CBCL total scale correlated below 0.70.

Table 6 Concurrent validity: weighted SDQ-CBCL correlations specified by informant

SDQ Correlations with Measures of General Psychopathology

The SDQ has correlated with other measures of general psychopathology. High correlations have been found between SDQ total difficulties and Rutter total deviance scales for parent (r = 0.88) and teacher (r = 0.92) ratings (Goodman 1997). Another study replicated the correlation between SDQ total difficulties and Rutter total deviance scales for parent ratings (r = 0.76) (Goodman et al. 2007). Somewhat lower correlations were found between the parent-rated SDQ and the Chinese version of the parent-rated Conner’s Parent Symptom Questionnaire (PSQ; Du et al. 1995), SDQ total difficulties and PSQ total score had r = 0.63, conduct problems and conduct problems had r = 0.53, hyperactivity-inattention and impulsivity-hyperactivity had r = 0.56, hyperactivity-inattention and hyperactivity index score had r = 0.61, and hyperactivity-inattention and learning problems had r = 0.58 (Du et al. 2008). The Health of the Nation Outcome Scales for Children and Adolescents (HoNOSCA; Gowers et al. 1999), a clinician-based mental health assessment tool, has been correlated with the SDQ total difficulties, resulting in moderate correlations for parent r = 0.38 and teacher r = 0.46 ratings. At the subscale level, correlations between HoNOSCA and the hyperactivity-inattention scales of r = 0.33 for parent and r = 0.41 for teacher ratings have been reported (Mathai et al. 2002).

SDQ Correlations with Measures of Specific Psychopathology

The parent-rated SDQ correlated with the clinician-rated ADHD-RS-IV (DuPaul et al. 1998) in that total difficulties, and total score had r = 0.50. At the subscale level, hyperactivity-inattention and hyperactivity-impulsivity had r = 0.54. The SDQ prosocial scale correlated with the parent-rated Child Health and Illness Profile-Child Edition (CHIP-CE; Riley et al. 2004) on the subscales of resilience r = 0.41 and risk avoidance r = 0.40 (Becker et al. 2006). The SDQ also correlated with the parent-rated ADHDQ-P (Scholte and Van der Ploeg 1998) on total difficulties with total score r = 0.67, hyperactivity-inattention with total score r = 0.73, and at the subscale level on hyperactivity-inattention with attention-deficit r = 0.65, and with hyperactivity r = 0.72. Correlations have been found between the parent-rated SDQ and the parent-rated Child Depression Inventory (CDI-P; Kovacs 1981), in that total difficulties and total score had r = 0.73 and emotional symptoms and total score had r = 0.67. The parent-rated SDQ correlated with the parent-rated Revised Children’s Manifest Anxiety Scale (RCMAS-P; Reynolds and Richmond 1978), in that difficulties and total anxiety score had r = 0.72, and emotional symptoms and total anxiety score had r = 0.73 (Muris et al. 2003).

Associations of the SDQ with the DAWBA, DMS-IV Diagnoses, and Risk Factors in Community Samples

An SDQ algorithm was developed in order to predict whether any psychiatric disorder is “unlikely,” “possible,” or “probable” (Goodman et al. 2000b). With this algorithm, children with a psychiatric diagnosis, as identified by the Development and Well-Being Assessment (DAWBA; Goodman et al. 2000b), were correctly classified as probably having a disorder in 77.3% of the cases. Using the SDQ algorithm, out of the children who were identified as having hyperactivity or conduct-oppositional or emotional disorder diagnosis according to DAWBA, 91% were rated as probable for a hyperactivity disorder, 60% were rated as probable for a conduct-oppositional disorder, and 44% were rated as probable for an emotional disorder (Hysing et al. 2007).

The SDQ algorithm was used in a study to generate diagnoses from SDQ scores. These diagnoses were compared with diagnoses given by independent clinicians or clinical teams based on DSM-IV (1994) criteria. Agreement (expressed in the rank-order correlation tau) between SDQ generated and clinical team diagnoses was found for hyperactivity (τ = 0.44), and conduct (τ = 0.56) and emotional (τ = 0.39) disorders. Reasonable correlations were found between SDQ generated and independent clinician diagnoses for hyperactivity (τ = 0.43), and conduct (τ = 0.30) and emotional (τ = 0.26) disorders (Mathai et al. 2004).

Prevalence of DSM-IV (1994) diagnoses of high- (extreme 10% of sample) versus low-risk (90% of sample) groups based on parent- and teacher-rated SDQ scores differed. SDQ scores were compared with clinical diagnoses, which were assigned based on the DAWBA. Differences in prevalence between high- and low-risk groups showed that all (sub)scales were associated with DSM-IV diagnoses. The odds ratio (OR) for having a psychiatric disorder in the high-risk group was 15.7 for parent- and 15.2 for teacher-rated SDQs, across the total difficulties scale and the subscales (Goodman 2001).

A similar study assessed children with the Diagnostic Interview Schedule for Children, Adolescents, and Parents (DISCAP; Holland and Dadds 1995) and subsequently assigned DSM-IV diagnoses. Significant differences were found between high- and low-risk groups on each SDQ subscale and the total difficulties scale, indicating that higher scores are associated with a greater probability of being assigned a DSM-IV diagnosis. The odds ratio for having a psychiatric disorder in the high-risk group was 11.7 based on total difficulties and 14.9 based on the impact scale. In addition, severity of psychosocial problems was rated by clinicians and correlated with parent-rated SDQ scores for total difficulties (r = 0.47) and the impact scale (r = 0.57) (Hawes and Dadds 2004).

Risk factors such as having contact with a mental health professional or general practitioner (GP), attending special education, or having a desire of using these type of services but not being able to afford them have been shown to be associated with high parent-rated SDQ scores. Learning disability, ADHD, declining health, and demographic variables, such as living below the poverty line, living in single-parent, or reconstituted families, were significantly associated with high parent rated SDQ scores (Bourdon et al. 2005). For 26 children, parent-rated SDQ total difficulties were associated with (consideration of) service use (OR = 8.7) (Koskelainen et al. 2001). Parent-rated SDQ total difficulties (r = 0.16), emotional symptoms (r = 0.15), and peer problems r = 0.15) were associated with additional service use in 68 children receiving care in a welfare institution. Further, the need for additional help was predicted by the impact score of parents (OR = 1.37) and caregivers (OR = 1.50) but not by their total difficulties scores (OR = 1.07, OR = 1.03) (Janssens and Deboutte 2009).

Capacity to Discriminate

In Table 7, weighted AUC values are presented by informant. The combined AUC represents a weighted average of the AUC in each study. The AUCs were weighted by their standard error. For the subscales, prosocial behavior, and peer problems, the AUC values were just above 0.5, indicating that, for teacher ratings, the ability of these subscales to distinguish between children with diagnoses, and those without, is just above chance level. For the remaining scales, AUC values are satisfactory.

Table 7 Weighted area under curves (by SE) on the SDQ specified by informant

Two studies could not be incorporated in Table 7 because standard errors or upper bounds were not given. Becker et al. (2004) report AUCs for the total difficulties (0.77, 0.75), emotional symptoms (0.69, 0.65), conduct problems (0.81, 0.82), and hyperactivity-inattention (0.77, 0.80) scales for the parent and teacher versions, respectively. So, except for the emotional symptoms scale, the SDQ is adequately able to differentiate between children with and without clinical diagnoses. In a study by Lai et al. (2009), AUC values were reported for emotional symptoms (0.79, 0.70), conduct problems (0.89, 0.86), hyperactivity-inattention (0.86, 0.85), peer problems (0.71, 0.69), prosocial behavior (0.60, 0.69), and total difficulties (0.84, 0.78), for the parent and teacher versions.

Samad et al. (2005) and Malmberg et al. (2003) assessed sensitivity and specificity of the parent-rated total difficulties and impact scales. The percentages of children identified by the SDQ as having a psychiatric disorder and who did have a disorder (true positives) were 69 and 82.4% for total difficulties, and respectively 66 and 82.7% for the impact scale (true positives). Children who did not have a psychiatric disorder were correctly identified as such (true negatives) 71 and 85.4% of the time by total difficulties, and 86 and 87.8% of the time by the impact scale. At the subscale level, sensitivity ranged from 56.6 to 75% and specificity from 66 to 88.1%. Two other studies assessed sensitivity and specificity by combining parent and teacher reports only for the hyperactivity-inattention and emotional and conduct problems subscales. Goodman et al. (2000b) found sensitivity to be 89, 81, and 90%, respectively, on the aforementioned subscales in a London sample and 89, 86, and 86% in a Dhaka sample. Reported specificity values were 78, 80, and 47% in the London sample and 81, 84, and 82% in the Dhaka sample. Mathai et al. (2004) reported sensitivity of 44% for the hyperactivity-inattention scale, indicating that 44% of children with ADHD symptoms were correctly identified by the scale as such. Children presenting with emotional symptoms were correctly identified as having emotional symptoms in 36% of the cases. The scale conduct problems identified 93% of the children showing conduct problems correctly. So, the proportion of true positives that are correctly identified by the SDQ was higher for the conduct problems scale, than it was for the hyperactivity-inattention and emotional symptoms scale.

Goodman et al. (2000a) and Goodman et al. (2004) tested sensitivity in a community and clinical samples. Combined parent (or caregiver) and teacher reports yielded sensitivity of 62.1 and 82.2% in detecting any psychiatric disorder, respectively, in the community and clinical samples. When only parent report was used, sensitivity dropped to 29.8% in the community sample and to 51.4% in the clinical sample. For teacher reports only, sensitivity dropped to 34.5 and 59.8% in the community and clinical samples, respectively. Sensitivity for detecting conduct-oppositional, hyperkinetic, ADHD, anxiety, depressive, as well as less common disorders was also assessed. Results were comparable to sensitivity found in detecting any other psychiatric disorder, except for detecting anxiety disorder in the community. Sensitivity was only 45.5% for parent and teacher reports combined and even lower for teacher report only, with a detection rate of 15.9%. Parent report correctly identified anxiety disorders 33.8% of the time, a significant difference to teacher report.

When comparing children with and without intellectual disability (ID), 60.9% with ID were found to have an elevated SDQ score compared to 9.8% of children without ID (Kaptein et al. 2008). A somewhat similar result was obtained for children with chronic illness (CI); 20% of them scored high based on parent-rated SDQ total difficulties, while 11% of children who did not have CI scored high (Hysing et al. 2007). Children attending pediatric outpatient clinics were more than twice as likely to score in the abnormal SDQ range compared to children from the community (OR = 2.33). The chance of scoring in the abnormal range was even greater for children attending a pediatric clinic for brain disorder (OR = 5.8) compared to community children (Glazebrook et al. 2002).

Goodman (1999) directed special attention to the impact scale of the SDQ. The three concepts of the impact scale, perceived difficulties, impact score, and burden rating, showed a different distribution in community and clinical samples (χ2 = 67.8), confirming the idea that problems of children in the community sample are not perceived as severe as problems of children in the clinical sample. Lastly, SDQ scores differed according to treatment status. Children currently receiving treatment for psychosocial problems had higher SDQ scores (M = 15.0) compared to children not receiving treatment (M = 8.0) (Hawes and Dadds 2004).

Predictive Validity

Evidence for the predictive validity of the SDQ has been found in three studies. The first focused on the stability of parent ratings, the second on help-seeking behaviors, and the third on prosocial behavior. Hawes and Dadds (2004) found that SDQ scores remained relatively stable over a 12-month period for the total difficulties r = 0.77 and impact r = 0.63 scales. For the subscales, comparable correlations were found for hyperactivity-inattention, r = 0.77, prosocial, r = 0.64, conduct, r = 0.65, emotional, r = 0.71, and peer problems r = 0.61.

Sharp et al. (2005) found that, over 1 year, parent- and teacher-rated SDQ scores predicted parental help-seeking behaviors and worry about the child. Over three time points (6-month intervals), parent-rated emotional problems were associated with seeking help from family (OR = 1.09). Parent-rated total difficulties at 12 months were associated with worries (OR = 1.06). Emotional problems rated by parents at baseline and 6 months, predicted worries (OR = 0.85; OR = 1.33). Teacher-rated baseline total difficulties scores were associated with seeking help from a GP (OR = 0.17) and from a friend (OR = 14.88). The rate of change in total difficulties rated by teachers was associated with seeking help from school (OR = 1.13) and GP (OR = 1.25). Teacher-rated total difficulties at 6 months were associated with parental worry (OR = 1.12). Peer problems rated by teachers were associated with parental worry 6 months later (OR = 1.57).

Perren et al. (2007) examined the role of prosocial behavior in kindergarten longitudinally. In addition to parent and teacher SDQ ratings, children were able to perform as informants regarding their problems by using the Berkeley Puppet Interview (BPI; Measelle et al. 1998). Emotional symptoms, conduct problems, and hyperactivity-inattention at age five predicted subsequent emotional symptoms, conduct problems, and hyperactivity-inattention, as rated by multiple informants (i.e., parents, teachers, and children) at age six (β = 0.530; β = 0.500; β = 0.667, respectively). The level of prosocial behavior, in combination with the level of emotional symptoms at age five, predicted emotional symptoms at age six. Children showing high levels of prosocial behavior and high levels of emotional problems at age five showed the highest level of emotional symptoms at age six, but children exhibiting high levels of prosocial behavior and low levels of emotional symptoms at age five showed the lowest levels of emotional symptoms at age six.

Discussion

The aim of this review was to contribute to a better understanding of the psychometric properties of the SDQ. A total of 48 studies were reviewed. Several indications for research and practice regarding reliability and validity of the SDQ follow from this review.

Internal Consistency

Results from an impressive number of studies show acceptable internal consistency for the total difficulties and impact scale for both parent and teacher ratings. At the subscale level, we found differences between parent and teacher ratings. Except for hyperactivity-inattention scale, which had an adequate internal consistency, the prosocial scale, emotional, conduct, and peer problems scales showed only moderate internal consistencies for parent ratings. For teacher ratings, the peer problems scale showed a moderate alpha, while the remaining scales showed adequate internal consistency. The items of the peer problems scale may not reflect the same construct, as alphas for this scale are lowest for parent and teacher versions. The only item measuring problem behavior is, in our opinion, “picked on or bullied by other children”. Remaining items seem to reflect loneliness on the one hand (rather solitary, tends to play alone; has at least one good friend) and sociability on the other (generally liked by other children; gets on better with adults than with other children).

An explanation for the difference in internal consistency between parents and teachers is that for parents, the items from the subscales may be less one-dimensional than for teachers, which may refer to a halo effect for teachers (Abikoff et al. 1993; Nisbett and Wilson 1977). Halo effects occur when one class of behavior influences the perception, and thus the rating, of other behaviors. Specifically, halo effects have been found to influence ratings of ADHD and ODD (Abikoff et al. 1993; Jackson and King 2004).

Test–Retest Reliability

The parent version of the SDQ had lower reliability over time compared to the teacher version, specifically at the subscale level. All parent-rated subscales, except the hyperactivity-inattention subscale, showed correlations below r = 0.70, whereas teacher subscales were all above r = 0.70. The total difficulties scales for parent and teacher ratings showed good test–retest reliability. Only the impact scale showed to be less reliable over time. The moderate over-time correlation for the impact scale may be due to the time interval of 4–6 months that was used in the study assessing the impact scale (Goodman 2001), in contrast to the time interval of 2 weeks to 6 months used in studies assessing the total difficulties scale (Du et al. 2008; Goodman 1999; Goodman 2001; Lai et al. 2009; Mellor 2004; Muris et al. 2003). The difference in parent versus teacher ratings at the subscale level may be explained in that parents are more prone to detect changes in their child’s mood, as they usually spend more time with their child than their teacher does. This may have caused the correlation to be lower for parent than for teacher ratings.

Inter-Rater Agreement

Compared to the average inter-rater agreement reported for other measures of child psychopathology, the inter-rater agreement between parent and teacher ratings for total scales and subscales was predominantly better (Achenbach et al. 1987). However, reliability remains modest, which is a well-known phenomenon in psychological assessment. Although inter-rater agreement is valuable to test whether children behave similarly across situations, its use may be less valuable as a psychometric property.

Construct Validity

In five studies, the proposed five-factor structure was supported for both parent and teacher versions using confirmatory factor analysis. Recently, support was found for the five-factor model for the parent and teacher versions in a very large sample (Sanne et al. 2009). Only one study (Dickey and Blumberg 2004) found more support for a three-factor structure (internalizing, externalizing, and prosocial behavior) for the parent version. An explanation for the difference in factor structure between the studies of Dickey and Blumberg and Becker et al. (2004), which tested only the parent version using CFA, might be cross-cultural inequivalence (Berry et al. 2002). Parents from the United States may perceive problems differently than German parents do, which could lead to inconsistencies in factor structures.

In this review, most evidence was thus found for the original five-factor structure of prosocial behavior, hyperactivity/inattention, conduct, emotional, and peer problems. An important methodological aspect of construct validity needs to be highlighted. Despite the theoretical foundation for a five-factor structure, non-normal distribution of scores, and a three-item response category, most studies reported results of exploratory factor analysis and principal component analysis. Both techniques are not suited to test the underlying structure of the SDQ. As the SDQ is based on theoretical constructs concerning child psychopathology (Goodman 1997), scores are non-normally distributed, and the response category is limited; therefore, confirmatory factor analysis (CFA) should be the first method of choice when investigating factor structure (Sanne et al. 2009).

Concurrent Validity

Many studies, comparable in some but not all cases, have validated the SDQ. Summarizing and interpreting the results from these studies is therefore complex. Correlations between SDQ and CBCL scales showed to be high for both parent and teacher ratings at the total scales level. The SDQ is thought to measure the same constructs as the CBCL, and these high correlations support that notion. However, at the subscale level, evidence for concurrent validity is less clear. The SDQ emotional and peer problems scales correlated moderately with the CBCL internalizing and social problems scales for both parent and teacher ratings. Further inspection of the CBCL internalizing subscales showed that the CBCL Anxious/Depressed subscale is very well represented by providing three out of five items which are very comparable with the items from the SDQ Emotional Symptoms subscale. However, no items from the CBCL Withdrawn subscale and only one from the Somatic Complaints and Emotionally Reactive subscales are represented in the SDQ Emotional Symptoms subscale. The Withdrawn subscale consists of items that reflect the autism spectrum disorders (ASD), which are not included in the SDQ. The overlap between the CBCL internalizing subscales and the SDQ Emotional Symptoms scale is thus quite small, which may explain the moderate correlation found in our review.

The SDQ impact correlated moderately with the CBCL total problems scale for both parent and teacher ratings. Experience of social impairment and substantial distress caused by psychiatric symptoms is nowadays a part of the diagnostic criteria for a psychiatric disorder (American Psychiatric Association 1994; World Health Organization 1992). The CBCL does not contain social impairment and distress items that would be similar to the SDQ impact supplement. Hence, the moderate correlation between the SDQ impact and CBCL total problems scales may indicate that these scales are conceptually different. The impact scale also correlated with a parental burden scale resulting in r = 0.74 (Goodman 1999). This parental burden scale is thought to be more comparable to the impact scale than is the CBCL total scale. The CBCL total scale focuses on symptoms of psychosocial problems, whereas the impact and parental burden scale focuses on the perception of the consequences of psychosocial symptoms.

In addition to the CBCL, the SDQ had a moderate to high correlation with measures of general and specific psychopathology. High correlations were found specifically for the Rutter scales, on which the SDQ is partly based (Goodman 1997), and for measures of depression and anxiety. This is contradictory to the low correlation found between the SDQ emotional and peer problems scales and the CBCL internalizing and social problems scales. However, as the SDQ correlated with specific measures of depression and anxiety here, the overlap between symptoms may have become greater and thus the correlations higher. Further, in community samples, SDQ scores also detect psychiatric diagnoses assigned by clinicians. Risk factors for developing psychosocial problems, such as poor health, seem to be associated with higher SDQ scores. This indicates that concurrent validity of the SDQ in comparison with different measures of psychopathology, psychiatric diagnoses, and risk factors is well established.

Capacity to Discriminate

The SDQ proves to be a good screening instrument, with high sensitivity and specificity for the total difficulties and impact scales. The percentage of children correctly identified by the SDQ as having a disorder is high, as is the percentage of children correctly identified by the SDQ as not having a disorder. A more detailed insight into the ability of the subscales to distinguish between community and clinical samples is reflected in the AUC values. Weighted AUC values indicate that, for teacher ratings only, the prosocial behavior and peer problems subscales distinguish between children with diagnoses, and those without, at the chance level. Prosocial behavior does not reflect child psychopathology, so it is not expected to distinguish between community and clinical samples. The peer problems scale again showed some inadequacy here.

However, we cannot infer from the sensitivity and specificity values which proportion of children with abnormal test results are truly abnormal (Altman and Bland 1994). When using the SDQ, we should therefore always consider the context, i.e., clinical versus community samples. If used in a community sample, quite a few children with clinical range SDQ results will actually be typically developing, i.e., false positives, due to low prevalence rates in the general population. In contrast, when the SDQ is used in a clinical sample, where prevalence rates are higher, fewer children will be false positives, but more will be false negatives. It is thus important to consider that the accuracy of the SDQ as a screening instrument varies accordingly with the prevalence rates in a certain population. This underscores the need for using multiple diagnostic instruments in clinical or at risk settings, such as pediatric clinics.

Predictive Validity

Only three studies assessed the predictive validity using a longitudinal design. The results showed evidence of predictive validity, as SDQ scores predicted help seeking for psychosocial problems over a year. Two studies found evidence for SDQ scores predicting similar SDQ scores over a year. In addition, they clarified the role of prosocial behavior in the development of psychosocial problems. Prosocial behavior has not been found to be compatible with high levels of internalizing behavior and thus is not beneficial to children showing highly internalizing behaviors, which concurs with the literature (Hay 1994).

Conclusion

Overall, the 25-item SDQ shows strong psychometric properties. Shorter scales are usually less reliable compared to longer scales, which means they also tend to attenuate the validity (Streiner and Norman 1989). However, the SDQ’s brevity did not substantially influence its psychometric properties. As for reliability, internal consistency of the total scales was satisfactory. Ratings showed sufficient reliability over time, and agreement between parents and teachers was relatively high. We should note here that these conclusions are stronger for teachers. Results concerning validity are less straightforward, but in general, we may state that the five-factor structure was confirmed by most studies, correlations with other measures of child psychopathology were high, and evidence for the screening ability of the SDQ was convincing. Predictive validity has not been studied extensively yet, so these findings should be interpreted with caution.

Additional attention should be directed to the necessity to conduct longitudinal studies that would examine the predictive validity of the SDQ and to the validation of the prosocial scale. Overall, the peer problems scale showed the weakest reliability and validity results that were most salient for parent ratings. The prosocial scale also showed some weaknesses, especially concerning internal consistency and capacity to discriminate. This notion should be familiar to researchers as these findings on the peer problems and prosocial behavior scales were extracted from previous studies. However, no interpretation of these findings has been proposed yet. A possible explanation of these findings lies in the concepts of prosocial behavior and peer problems.

In contrast to studies focusing on deviant behavior, studies assessing competence behaviors are relatively rare (Goodman 1994; Tremblay et al. 1992). As a consequence, the competence, or prosocial, construct has not been developed well in terms of what behaviors should be measured. A distinction in prosocial behavior is the Prosocial Orientation versus the Social Initiative dimension (Rydell et al. 1997). SDQ items are most comparable to the Prosocial Orientation dimension, which can be summarized as behaving smoothly in normal social interactions. In the Rydell et al., study, parent and teacher agreement was lower for the Prosocial Orientation than for the Social Initiative dimension. Possibly, the Social Initiative dimension consists of behaviors that are more easily observed (e.g., shy/hesitant with unfamiliar adults) than those of the Prosocial Orientation dimension (e.g., has ability to decode peers’ feelings), and thus the comparable SDQ prosocial scale (e.g., considerate of other people’s feelings). Behavior that is more difficult to observe may be more susceptible to inferences from raters, for example according to the relationship of the rater with the child (e.g., Ladd and Profilet 1996). Inferences may be stronger for parents than for teachers in rating prosocial behavior, as internal consistency is lower for the former raters. This may be explained by the nature of the relationship with the child which differs clearly for parents versus teachers.

The peer problems scale showed low internal consistency values for both parent and teacher ratings. Peer problems are most often assessed via reports by children themselves (i.e., sociometrics) because children are regarded “insiders”, whereas parents and teachers are regarded “outsiders” of the peer group. Judgments of peers are based on many and varied social interactions with those being assessed, which may be unknown to “outsiders” (Rubin et al. 2005). Assessment of peer problems by parents and teachers is further impeded by the adult perspective used to interpret children’s social interactions, the relationship with the child and child’s gender (Ladd and Mars 1986; Ladd and Profilet 1996; Rubin and Coplan 1992). The outsider view combined with the mentioned rater biases may be responsible for the low internal consistency values for the peer problems scale found in our review.

Further, parents and teachers observe children in differing contexts, where different behaviors are shown. This may lead to lower values of internal consistency for both the peer problems and the prosocial subscale. As for rater bias, regardless of rater bias being a factor in the weak performance of subscales, it is important to be aware of rater bias when dealing with screening instruments. The application of screening instruments like the SDQ can be meaningful, in the sense that children are screened before psychosocial problems exacerbate, only if they are used appropriately.

Finally, it is important to note that results from this review are only applicable to the parent and teacher versions of the SDQ. The SDQ self-report version was not included in this review because it was not developed nor intended to be used for children younger than 12 years of age. From a developmental perspective, the use of traditional self-report questionnaires in children younger than 12 years of age has been questioned, and in children younger than 8 years discouraged. Due to limited linguistic, cognitive, and social-emotional abilities, children were not thought to provide reliable self-reports (Edelbrock et al. 1985; Fallon and Schwab-Stone 1994).

Recently, tests of using a puppet interview and computerized pictorial questionnaire have yielded results which point to promising psychometric results in children as young as 5–7 and 6–11 years (Measelle et al. 1998; Valla et al. 1994). However, the SDQ and the former interview methods differ greatly in respect to taking into account the developmental level of the elementary school child. The former interviews take into account the developmental level of the child by giving both visual (graphics) and auditory stimuli. The cognitive abilities of children below age 12 may not be sufficiently developed to adequately respond to the SDQ questions, which are presented only by visual verbal information (Edelbrock et al. 1985; Fallon and Schwab-Stone 1994). Therefore, we have focused on the parent and teacher versions of the SDQ in this review.

Limitations

Some limitations of this review should be noted. First, the methodologies varied across the reviewed studies, making it sometimes impossible to extract data from those studies. Comparing these studies with each other was therefore difficult, and conducting a meta-analysis on the data was not possible. Second, many studies did not state which parent was used as a rater, making it hard to draw specific conclusions concerning rater bias. In addition, it was beyond our scope to consider rater psychopathology. Third, few studies were conducted using a longitudinal design, making it hard to draw robust conclusions regarding predictive validity. In addition, the reviewed studies did not give sufficient attention to validation of the prosocial scale. Future research should reveal whether the SDQ predicts psychosocial problems and whether the prosocial scale correlates with other measures of prosocial behavior.

Implications

With these limitations in mind, the implications of these results for practice and research can be noted. This review offers researchers and clinicians a clear overview of the psychometric properties for the parent and teacher versions of the SDQ for 4- to 12-year-olds. Reliability and validity results at the subscale level have been found weaker when compared to the results for the total scales. Therefore, caution is warranted when using and interpreting the subscales of the SDQ separately. Sanne et al. (2009) argued that the distinctiveness of the subscales is not convincing. An explanation for this may be the high comorbidity of psychosocial problems (Ford et al. 2003). Moreover, caution is warranted if a single informant reports on the SDQ, as results may not generalize to other contexts. The use of multiple informants should always be priority when using the SDQ. Most studies used parents and teachers, but possibilities of using other informants should be explored. For example, neighbors, daycare workers, or sports club coaches might be able to report on children’s psychosocial problems. Future research should reveal whether these informants are able to assess psychosocial problems reliably.

For clinical practice in particular, the SDQ is a useful instrument for quickly assessing possible psychosocial problems. The results found in this review give rise to some specific implications at the subscale level.

First, the prosocial subscale shows some weaknesses in its psychometric properties, especially for the parent version. Low levels of prosocial behavior and high levels of aggression have been shown to increase the risk for future social adjustment difficulties (Coie et al. 1982; Crick 1996; Romano et al. 2005). Excessively high levels of prosocial behavior are also a risk factor for psychopathology (Hay 1994; Perren et al. 2007), underscoring the importance of assessing prosocial behavior. Therefore, when assessing prosocial behavior teacher ratings should always be included in addition to parent ratings. Further assessment of the child, for example by observing the child in the class room or a naturalistic play situation, should reveal whether the reported lack of prosocial behavior is confirmed by a mental health specialist. When a child is referred for treatment, interventions target at the increase of prosocial behavior instead of the decrease of aversive behaviors (Coie and Koeppl 1990). This emphasizes the importance of assessing prosocial behavior adequately.

Second, the psychometric properties of the hyperactivity/inattention scale are adequate, and the SDQ should thus provide a reliable and valid report as to whether ADHD symptoms are present. However, when an ADHD diagnosis is suspected, identification of one of the subtypes Inattentive, Hyperactive-Impulsive, or Combined is required (American Psychiatric Association 1994). Further assessment may be done by using one of the many ADHD rating scales available, such as the SNAP-IV (Swanson 1992) or the SWAN (Swanson et al. 2001).

For the emotional symptoms subscale, psychometric properties are also adequate. However, in contrast to externalizing problems, internalizing problems are reported more accurately by children themselves than by their parents and teachers (Edelbrock et al. 1985; Ederer 2004). Gaining insight into the child’s subjective experience of its emotional symptoms is thus highly relevant and advisable in clinical settings.

The conduct problems subscale shows adequate reliability and validity. In order to assess whether a diagnosis of Oppositional Defiant Disorder or Conduct Disorder would be justified, additional assessment is indicated. Because children themselves tend to underestimate their externalizing problems, parents and teachers are particularly important in the further assessment of children presenting with conduct problems (Loeber et al. 1991).

Finally, the psychometric properties of the peer problems scale are quite weak in some respects. Assessing peer problems is complicated because children are considered as “insiders” who contribute unique information about their peer group. Possibly, it is difficult for parents and teachers to estimate the problems children experience in their peer group because they are “outsiders”. Because of the difficulties with assessing peer problems, additional assessment is essential. The Perceived Competence Scale for Children (Harter 1982) is a very suitable measure for this purpose. Further, classroom observation is recommended (Wragg 1994).

The SDQ is not intended to be used as a psychiatric diagnostic instrument and therefore should not be utilized as such. As a screening instrument, the SDQ performs very well and adds to the field of early detection of child psychopathology. The SDQ has been translated into over 60 languages, which is a great benefit. However, norms are available only for six countries. Culture plays a role in the distribution and expression of psychosocial problems in society, and thus norms for every culture should be established. Results from studies assessing capacity to discriminate showed that the SDQ distinguishes well between children with and those without diagnoses. In populations at risk for psychosocial problems, such as children attending pediatric clinics, we recommend screening of all children referred to specialist services.

For research purposes, longitudinal designs should be employed in order to assess predictive validity more thoroughly. The SDQ is a promising instrument for researching developmental pathways, as it seems to be well validated, short, and acceptable. The teacher version shows strong psychometric properties, but our review shows that the parent version is at the focus of research (17 out of 48 studies studied only the parent version of the SDQ). However, researchers do not fully employ the use of a multiple informant approach. We do argue for such a multi-informant approach, as it is essential for children, their parents, and society when psychosocial problems are found at a young age.