Test reliability (consistency) is an essential requirement for test validity. Test validity is the degree to which a test measures what it is designed to measure.

       Researchers use four methods to check the reliability of a test: the test-retest method, alternate forms, internal consistency, and inter-scorer reliability. Not all of these methods are used for all tests. Each method provides research evidence that the responses are consistent under certain circumstances. There are four distinct types of reliability.

(1.) Test-Retest- a method of estimating test reliability in which a test developer or researcher gives the same test to the same group of research participants on two different occasions. The results from the two tests are then correlated to produce a stability coefficient. Studying the coefficients for a particular test allows the assessor to see how stable the test is over time.

Example: The information obtained for test-retest reliability of the WISC-IV was evaluated with information from 243 children. The WISC-IV was administered two separate times with the test-retest mean interval of 32 days. The average corrected Full Scale IQ stability coefficient was .93.

(Click here to see Table 2  from WISC-IV Technical Report 2)

 Computing a (Test-Retest) Stability Coefficient

 (2.) Alternate Forms- This type of reliability makes a second form of a test consisting of similar items, but not the same items. Researchers administer this second “parallel” form of a test after having already administered the first form. This allows researchers to determine a reliability coefficient that reflects error due to different times and items and allow to control for test form. By administering form A to one group and form B to another group, and then form B to the first group and form A to the second group for the next administration of the test, researchers are able to find a coefficient of stability and equivalence. This is the correlation between scores on two forms and takes into account error of different times and forms.

Example: The ACT is is an academic test used in the college admission process. There are four academic subtests: English, mathematics, reading, and natural science reading. A standard score scale is used to report scores on the four academic tests. There is also a composite score – the average of standard scores on the four subtests. Scaled scores equivalents are provided for each four of the test by the equipercentile method based on the score distribution of an anchor form of the ACT. New forms of the test are equated to older forms by giving both forms to parallel samples of students and then equating the forms by the equipercentile method (Aiken, 1985).


(3.) Internal Consistency- three ways to measure the consistency of a test with only one form.

           (A)  Split-Half Reliability

What is Split-Half Reliability?

A test given and divided into halves and are scored separately, then the score of one half of test are compared to the score of the remaining half to test the reliability (Kaplan & Saccuzzo, 2001).

Why use Split-Half?

Split-Half Reliability is a useful measure when impractical or undesirable to assess reliability with two tests or to have two test administrations (because of limited time or money) (Cohen & Swerdlik, 2001).

How do I use Split-Half?

1st-divide test into halves. The most commonly used way to do this would be to assign odd numbered items to one half of the test and even numbered items to the other, this is called, Odd-Even reliability.

2nd- Find the correlation of scores between the two halves by using the Pearson r formula.

3rd- Adjust or reevaluate correlation using Spearman-Brown formula which increases the estimate reliability even more. The longer the test the more reliable it is so it is necessary to apply the Spearman-Brown formula to a test that has been shortened, as we do in split-half reliability (Kaplan & Saccuzzo, 2001).

 Spearman-Brown formula

r = 2 r
1+ r

r = estimated correlation between two halves (Pearson r) (Kaplan & Saccuzzo, 2001).

          (B)  Kuder-Richardson Formula

Another way to internally evaluate a test would be to use the Kuder-Richardson 20. This is only advisable if you have dichotomous item in a test (usually for right or wrong answers).

KR 20= r = N (S2 alpha pq)
N-1 (S

  KR20 = reliability estimate (r)

N= the number of items on the test

S2 = the variance of the total test score

p = proportion of people getting each item correct (this is found separately for each item)

q = the proportion of people getting each item incorrect. For each item q equals 1-p.

alpha p q = the sum of the products of p times q for each item on the test.

(Kaplan, Saccuzzo.2001)

         (C)  Cronbach Alpha/Coefficient Alpha

The Cronbach Alpha/Coefficient Alpha formula is a general formula for estimating the reliability of a test consisting of items on which different scoring weights may be assigned to different responses

insert equation here

k = the number of items

si2 = the variance of scores on item i

st2 = the variance of total test scores

(Aiken, 2003)

(4.)Inter scorer reliability- measures the degree of agreement between persons scoring a subjective test (like an essay exam) or rating an individual. In regards to the latter, this type of reliability is most often used when scorers have to observe and rate the actions of participants in a study. This research method reveals how well the scorers agreed when rating the same set of things. Other names for this type of reliability are inter-rater reliability or inter observer reliability.

Estimates of Inter-rater Agreement and Reliability

In addition to simple percentages of agreement, Cohen's kappa (KAPPA) was also calculated for the exact match percentages of agreement, and the results are shown in Table 2. The KAPPA coefficients indicate the extent of agreement between the raters, after removing that part of their agreement that is attributable to chance. As can be seen, the values of the KAPPA statistic are much lower than the simple percentages of agreement (Goodwin, 2001).

Simple percentages of agreement and kappa. To estimate inter-rater reliability of observational data, percentages of agreement are often calculated--especially if the number of scale points is small. Percentages of agreement can be calculated in a number of different ways, depending on the definition of agreement.

For Example:

In Table 2, the percentages of agreement between the raters for each occasion (day) are presented two ways: first, for the case in which agreement meant an exact match between raters in their assigned ratings; second, for the case in which agreement was defined more leniently as either exact agreement, or differences between the two raters' scores of not more than one point in either direction. (This latter definition of agreement has been used fairly often in the estimation of interrater agreement of some types of measures, such as parent-infant interaction scales [Goodwin & Sandall, 1988].) As would be expected, percentages of agreement are lower when agreement is defined in the more conservative way (exact match). The results shown in Table 2 demonstrate that the median percentage of agreement for the 6 days, when agreement was defined as exact match, was 20%; the median percentage of agreement for the 6 days, when the more liberal definition of agreement was used, was 80%. (Percentages of agreement were not calculated for the total scores in Table 1 because this approach to reliability estimation is rarely used if the range of scores is large; here, the total scores could range from 6 to 42.)

Click here to see the tables

The Reliability Coefficient


Standard Error of Measurement