The validity of a test concerns what the test measures and how well it does so. Whether a test is valid depends, in part, on its specific purpose.  For example, is the test valid for this particular purpose, in this particular situation, for these particular participants? Only after a test's reliability (consistency) has been established can researchers consider a test's validity.

Construct validity is the extent to which a test measures a theoretical concept or trait, such as a personality characteristic.  Construct validity can include measures of content and criterion-related validity.  

  • Discriminant validity  is when an assessment does NOT highly correlate with another assessment that measures an unrelated concept.  For example, we would NOT expect Graduate Record Examination (GRE) scores to be highly correlated with self-esteem or shyness measures.

  • Convergent validity is a form of construct validity that refers to the degree that the actual test results are corresponding to the expected results.  For example, if we had a test that measured leadership skills, and if it was high in convergent validity, we would expect individuals in management of companies to score higher on the test than regular employees.

  • Content validity refers to two questions: 

                      (1) Does the test cover the content of interest? For example, are the items on an achievement test for statistics based on statistical concepts? 

                      (2) Is the test appropriate for your participants? For example, are the items geared toward college math majors or psychology majors?  
          Evaluating content validity is carried out in one of two ways:  subjectively or empirically.

                      (1) Subjective methods involve asking experts to judge the relevance of  the test items.

                      (2)  Empirical methods identify which test item can be grouped or categorized together.

          Researchers use factor analysis to investigate actual patterns of participants’ performance on a test, such as the Wechsler Intelligence Scale for Children – Fourth Edition (WISC-IV). Such studies give empirical evidence for organizing subtests into broader categories, called factors or “indices”.
 Factor analysis is a detailed empirical method that can be used to either:
                                        (a) find a pattern of test item performance 
                                        (b) confirm that test items fit a certain test performance pattern predicted by theory.

         Example: In a study (Williams, Weiss, & Rolfus, 2003) with 1,525 children, the 15 subtests of the WISC-IV formed four factors. Children tended to perform similarly on the subtests within each factor. In other words, the subtests within each factor seemed to be measuring similar abilities. For example, the similarities, vocabulary, comprehension, information, and word reasoning subtests formed the verbal comprehension factor. Researcher looked at the content of these subtests and decided to name the factor Verbal Comprehension. So if children do very well on defining vocabulary words, then we would expect them to show strong social comprehension skills as well. This gives evidence for convergent validity within a factor.

     The remaining 9 subtests did not fall into the verbal comprehension index. They clustered into three other indices. In other words, just because children do very well in defining vocabulary words does not mean that they will tend to do well in copying designs with colored blocks. In other words, the fact that there are different factors or indices (i.e., verbal comprehension, perceptual reasoning, working memory, processing speed) gives evidence that we might find differences in childrens’ performance based on the different indices. In other words, children who do very well on verbal comprehension subtests might do very poorly on working memory subtests. The fact that certain subtests fit better into one factor and not another gives evidence for discriminant validity.   

  • Criterion-related validity: There are two types: concurrent validity and predictive validity. Both types are based on correlation.

               (1)
    Concurrent validity If a test is said to measure intelligence, we must show that scores on the test are highly correlated with performance on an established test of intelligence (the standard or criterion for intelligence). In establishing concurrent validity, researchers administer the test  to a group of participants and the scores are compared to a criterion measure, a standard, that reflects the variable being tested.  Concurrent validity is used when a test is given to people in various categories (e.g. clinical diagnostic groups or socioeconomic levels) to help determine whether test scores of people in one category are significantly different from people in another category.

                    (2)  Predictive validity  Predictive validity is a type of criterion-related validity where the criterion measures are obtained in the future, usually months or years after test scores are obtained. An example is when college graduates are predicted from an entrance exam. The ideal situation for this type of validity is to administer test during a time of open enrollment, hiring, etc., for a full range of results can be possible on outcome measures.

                Example: The Pre-Kindergarten Screen (PKS) is a standardized screening measurement to assess school readiness in children between 4 year and 0 months and 5 years and 11 months. There are 10 scores: gross motor skills, fine motor skills,  following directions, block tapping, visual matching, visual memory limitation, basic academic skills, delayed gratification, and total score. PKS predictive validity was examined by comparing children’s pre-kindergarten PKS scores to both their kindergarten outcomes and the teacher’s identification of highest and lowest performing students. For the comparison to kindergarten outcome, the PKS was able to accurately classify 98.7% of a group of 392 children. For the comparison to high and low performing children, it was able to accurately classify 91.2% of 125 children (Chittobran, 2003).