IV. Reliability 

Test reliability (consistency) is an essential requirement for test validity. Test validity is the degree to which a test measures what it is designed to measure.

Four Types of Reliability

       Researchers use four methods to check the reliability of a test: the test-retest method, alternate forms, internal consistency, and inter-scorer reliability. Not all of these methods are used for all tests. Each method provides research evidence that the responses are consistent under certain circumstances.

(1.) Test-Retest- a method of estimating test reliability in which a test developer or researcher gives the same test to the same group of research participants on two different occasions. The results from the two tests are then correlated to produce a stability coefficient. Studying the coefficients for a particular test allows the assessor to see how stable the test is over time.

Example :The information obtained for test-retest reliability of the WISC-IV was evaluated with information from 243 children. The WISC-IV was administered two separate times with the test-retest mean interval of 32 days. The average corrected Full Scale IQ stability coefficient was .93.

(Click here to see Table 2  from WISC-IV Technical Report 2)

 Computing a (Test-Retest) Stability Coefficient

 

(2.) Alternate Forms- This type of reliability makes a second form of a test consisting of similar items, but not the same items. Researchers administer this second “parallel” form of a test after having already administered the first form. This allows researcher to determine a reliability coefficient that reflects error due to different times and items and allow to control for test form. By administering form A to one group and form B to another group, and then form B to the first group and form A to the second group for the next administration of the test, researchers are able to find a coefficient of stability and equivalence. This is the correlation between scores on two forms and takes into account error of different times and forms.

Example: The ACT is is an academic test used in the college admission process. There are four academic subtests: English, mathematics, reading, and natural science reading. A standard score scale is used to report scores on the four academic tests. There is also a composite score – the average of standard scores on the four subtests. Scaled scores equivalents are provided for each four of the test by the equipercentile method based on the score distribution of an anchor form of the ACT. New forms of the test are equated to older forms by giving both forms to parallel samples of students and then equating the forms by the equipercentile method (Aiken, 1985).

(3.) Internal Consistency- two ways to measure the consistency of a test with only one form.

        (a)  Split-Half Reliability

One way to study and estimate the reliability of an instrument without running into practice effects as in test-retest methods.  The process of obtaining split-half reliability is begun by “splitting in half” all items of a test that are intended to probe the same area of knowledge in order to form two “sets” of items.  If the test is consistent it leads the experimenter to believe that it is most likely measuring the same thing. The Spearman-Brown Prophecy Formula is used for computing reliability.

*Simply put: If items on a test can be divided into two halves and give the same results, your test is reliable.

Example:: The Beck Depression Inventory (BDI) has 21 questions. Its split-half reliability coefficient is .93.

(b) If dividing a test is impractical, researchers use either the Kuder-Richardson or the Cronbach alpha formula to measure a test's internal consistency.

          Example : Glenn Walter's "Psychological Inventory of Criminal Thinking Styles" (PICTS) (2004) assesses criminal thinking styles.  It is    commonly used to evaluate treatment programs in jails and prisons. The Cronbach's Alpha is a complex statistical method used to evaluate the scale's internal consistency.  Click Here to see an example of the Cronbach's Alpha used to evaluate the internal consistency of the PICTS.

 

(4.) Inter scorer reliability-  measures the degree of agreement between persons scoring a subjective test (like an essay exam) or rating an individual. In regards to the latter, this type of reliability is most often used when scorers have to observe and rate the actions of participants in a study. This research method reveals how well the scorers agreed when rating the same set of things. Other names for this type of reliability are inter-rater reliability or inter observer reliability.

Example

 

 

A Hypothetical Example

 

Example I: The hypothetical example that follows was intentionally kept very small (i.e., small number of participants), so that the calculations could be replicated with relative ease.[3] (In "real life," the number of participant would certainly be greater than 10, and the numbers of the levels of the other facets--such as number of raters--might also be greater.[4]) In this hypothetical example, 10 children (participants) were each rated by two different raters, independently, on the "quality of perceived physical activity" demonstrated on the playground at school. The ratings occurred on six different occasions (days). At the end of each observation session (approximately 1/2 hr long on each day), each rater assigned an overall quality rating, using a 7-point anchored scale ranging from 1 (low) to 7 (high). Higher ratings, therefore, meant that the observed activities were of higher quality (e.g., included more mobility, were age-appropriate, etc.). The hypothetical data are presented in Table 1. For purposes of illustrating the different results that can be obtained with different reliability estimation techniques, the data are assumed to be interval-level.

Estimates of Inter-rater Agreement and Reliability

Simple percentages of agreement and kappa. To estimate interrater reliability of observational data, percentages of agreement are often calculated--especially if the number of scale points is small. Percentages of agreement can be calculated in a number of different ways, depending on the definition of agreement. In Table 2, the percentages of agreement between the raters for each occasion (day) are presented two ways: first, for the case in which agreement meant an exact match between raters in their assigned ratings; second, for the case in which agreement was defined more leniently as either exact agreement, or differences between the two raters' scores of not more than one point in either direction. (This latter definition of agreement has been used fairly often in the estimation of interrater agreement of some types of measures, such as parent-infant interaction scales [Goodwin & Sandall, 1988].) As would be expected, percentages of agreement are lower when agreement is defined in the more conservative way (exact match). The results shown in Table 2 demonstrate that the median percentage of agreement for the 6 days, when agreement was defined as exact match, was 20%; the median percentage of agreement for the 6 days, when the more liberal definition of agreement was used, was 80%. (Percentages of agreement were not calculated for the total scores in Table 1 because this approach to reliability estimation is rarely used if the range of scores is large; here, the total scores could range from 6 to 42.)

In addition to simple percentages of agreement, Cohen's kappa (KAPPA) was also calculated for the exact match percentages of agreement, and the results are shown in Table 2. The KAPPA coefficients indicate the extent of agreement between the raters, after removing that part of their agreement that is attributable to chance. As can be seen, the values of the KAPPA statistic are much lower than the simple percentages of agreement (Goodwin, 2001).

Click here to see the tables

The Reliability Coefficient

WHAT DIFFERENT "RELIABILITY" COEFFICIENTS ASSESS

Standard Error of Measurement

 

Return