Table 2

Basic psychometric properties used to evaluate child development assessment tools (CDATs)

Relevance/ImportanceComment
Reliability
 Internal consistencyEvaluates the similarity of test items assessed in one domain. One measure is split-half reliability, which compares the scores on two halves of a test in a single domain.High internal consistency suggests that some items are too similar, so no additional information is gained from assessing them. Low internal consistency suggests the items may not be assessing the same domain.
 InterobserverEvaluates variability between different assessors on the same subjectThere may be systematic errors, specific to a particular group of assessors, and this parameter may not be generalisable when the tool is used by a different group of assessors.
 IntraobserverEvaluates variability within a single assessor on a single subjectCommonly evaluated by the same assessor scoring video recordings of their own assessments. This is not essential unless there is low interobserver reliability
Validity
 Test-retestEvaluates variability within the subject (influenced by random factors such as familiarity with items and mood)Difficult to interpret in early childhood when changes in development occur over a short time. Usually the repeat assessment should be carried out within 2 weeks of the first test.
 ContentExperts in the field make consensus agreement on whether the individual item and the range of items adequately sample and represent the domain of interest.Subjective measure that cannot be used in isolation to evaluate validity.
 CriterionIdeally assessed by comparison to an established ‘gold standard’ test assessing the same constructUsually ‘gold standard’ tests are not available so the comparison is typically against another recognised test regularly used in the same population and thought to measure the same domain.
 Discriminant/convergentEvaluates expected positive and negative correlations between scores in different domains or between different tests of the same or differing underlying construct.Scores from two independent tests (eg, one using report method the other a direct test) of one domain should correlate where neither test is considered a ‘gold-standard’. To ensure the test is not overlapping with constructs not of interest, the scores evaluating different constructs should poorly correlate, for example, test scores on ‘fine motor’ should correlate poorly with ‘social emotional’.
 ConstructStatistical evaluation to see whether values of observed data fit a theoretical model of the constructs (confirmatory) or to explore a possible model of the ‘underlying traits’ being measured.Large numbers of assessments are required to evaluate this.