Today

  • Classical Test Theory
  • Reliability

Perfect Measurements?

Metric Ruler

  • Recall: No instrument is perfect and no measurement is without error
  • More difficult in psychology

Dealing with Imperfect Measurements

  • Error is part of life
  • Minimize error
  • Partition out error

Classical Test Theory Definition

Rodriguez (2015):

  • 1. No single approach to the measurement of any construct is universally accepted.
  • 2. Psychological measurements are usually based on limited samples of behavior.
  • 3. The measurement obtained is always subject to error.
  • 4. The lack of well-defined units on the measurement scales poses still another problem.
  • 5. Psychological constructs cannot be defined only in terms of operational definitions but must also have demonstrated relationships to other constructs or observable phenomena.

Classical Test Theory Model

$$X = T + E$$

  • Observed Score = True Score + Error
  • SLR: Height = Weight + Residual

Variance

$$\frac{\sum(X - \bar{X})^2}{N}$$

  • You administer a test and the students get the following grades: 76, 87, 88, 80, 77. Calculate the variance.
  • You administer the same test again to other students and they get the following grades: 107, 78, 78, 94, 77. Do you expect that the variance was greater? Why?

Partition Observed Score Variance

$$\sigma^2_X = \sigma^2_T + \sigma^2_E$$
  • Total Observed Variance = True Score Variance + Error Variance
  • True Score Variance is considered to be stable
  • How does Error Variance affect the consistency and usefulness of our test?

What Goes into Error Variance

Measurement Error

  • Random Error
    • Unpredictable and inconsistent sources of error
  • Systematic Error
    • Constant and predictable source of error
  • Examples?
  • Which poses a bigger threat to a consistent measure?

Error Sources

  • Constructing tests
  • Administering tests
  • Scoring tests
  • Interpreting tests

Reliability

$$\sigma^2_T / \sigma^2_X$$
  • Proportion of observed variance attributed to true variance
  • What does it mean?

Types of Reliability

  • Test-Retest
    • Coefficient of stability
  • Parallel Forms
    • Coefficient of equivalence
    • Parallel forms reliability
  • Alternate Forms (constructed Parallel)
    • Alternate forms reliability
  • Measures of Internal Consistency

Different types of associations

  • Pearson's product-moment correlation only appropriate when variables are continuous and are interval/ratio scales
  • Alternatives when variables are dichotomous either naturally or artifically (assumed to have a continuous underlying scale)
    • Phi coefficient, equivalent to Pearson's correlation but for dichotomous variables
    • Polychoric coefficient, an index of association between two artifically ordinal variables
    • Tetrachoric coefficient, an index of association between two artifically dichotomized variables
    • Point-biserial coefficient, an index of association betweeen a dichotomous and a continous variable
    • Biserial coefficient, an index of association betweeen an artificially dichotomous and a continous variable
    • Spearman Rank-Order coefficient, an index of association where at least one variable is ordinal
    • Kendall's tau, alternative to Spearman

Internal consistency

  • Measure level of consistency or agreement between items
  • A test is unidimensional, all the items measure the same latent construct
  • The more items measure just one construct, the higher the inter-item consistency
  • Is it always possible or desirable to have a test that measures just one thing?

Split-Half Reliability

  • Obtained by correlating two pairs of scores from equivalent halves of a single test
  • "Creating two equivalent forms of a test"
  • What are the steps to calculate this?
  • How might we consider making splits?
  • What should we be careful of?

Corrections

  • Why do we need a correction?
    • Correlation is affected by measurement error
    • Correlation is affected by range restriction
  • All things considered, which test would be more reliable?
    • Test A: 5 items or Test B: 10 items?
    • Tests A and B: 7 items each?
    • Reliability is affected by test length

Spearman-Brown Correction

$$r_{SB} = \frac{Nr}{1 + (N - 1)r} $$

  • r is the correlation of the original two splits
  • N is the number of "tests", i.e. the factor by which the test is increased

Uses of Spearman-Brown

$$N = \frac{r_{SB}(1 - r)}{r(1 - r_{SB})}$$
  • Determine length of test (add or delete items)
  • Test modification

Working with Spearman-Brown

Consider the following Scenarios
NrSB's r
2.8
3.8
3.6
.6.8

Spearman Brown in R

      # Install and load the package
      install.packages("CTT")
      library("CTT")

      # old relibility is 0.6, if the measure is lengthened
      # by a factor of 2, the relibility of new test is:
      spearman.brown(0.6, 2,"n")

      # old relibility is 0.5, if we want a new measure to
      # be 0.8, the new test length is:
      spearman.brown(0.5, 0.8, "r")
      

Kuder-Richardson 20

$$r_{KR} = \frac{k}{k-1}\left(1 - \frac{\sum pq}{\sigma^2}\right) $$

  • De facto statistic for dichotomously scored items
  • KR-20 typically more conservative than split-half

Caclulating KR-20

Coefficient alpha

$$r_{\alpha} = \left(\frac{k}{k - 1}\right)\left(1 - \frac{\sum \sigma_i^2}{\sigma^2}\right)$$
  • May be conceptualized as the mean of all possible split-halves
  • Can use on ordinal scales or scales that aren't scored dichtomously
  • Ranges from 0 to 1
  • Want higher values for making high-stakes decisions and lower values OK for research
  • This is an overly used statistic with problems (Sijtsma, 2009)
  • Always consider reporting 95% confidence intervals!
  • Alternative proportional distance, doesn't depend on number of items on instrument

Inter-rater reliability

  • Two raters measure the same behavior
  • For example: Number of aggressive behaviors
  • Degree to which these raters report the same incidence of aggressive behaviors is a measure of reliablity
  • Correlate scores from raters
  • Test scores have reliability NOT test

IRR: Example

Two parents are administered the CBCL (an instrument to identify problem behaviors in children) on their four children. How well do their scores for the section Aggressive Behavior agree (i.e. what is their inter-rater reliability)?
ChildParent 1Parent 2
15.56.0
25.25.2
34.64.0
46.65.6

Test characteristics on reliability

  • More homogeneous, higher reliability
  • More static the characteristic, higher reliability
  • Restriction range, lower reliability
  • Power vs. speed test
    • If speed, reliability estimates may be too high
    • Test-retest, alternate-forms, or split halves from two independently timed half tests
  • Criterion-referenced, lower variability, lower reliability

Calculating True Score

  • Anna takes 3 tests (parallel forms) in math
  • She gets an 8, 7, and 7.5
  • What should we estimate as her true score/ability in math?
  • Do you think that score is her score?

Quantifying the Uncertainty - SEM

$$\sigma_{SEM} = \sigma_X\sqrt{1 - r_{xx}} $$
  • Standard Error of Measurement = Standard Deviation of Observed Scores * Square Root of 1 - Reliability Coefficient
  • Can use this to create confidence intervals and using normality assumption of scores on large number of tests
  • Determine plausible values for a person's true score

SEM: Example

  • A math test is administered. The test scores have a reliability coefficient of 0.80 and a standard deviation of 0.5
  • What is the standard error of measurement?
  • If Anna scored a 7.5, what range of values can we be 95% confident that her true score lies between? 99% confident?

Standard Error of the Difference

$$\sigma_D = \sqrt{\sigma_{SEM_1} + \sigma_{SEM_2}} $$ $$\sigma_D = \sigma\sqrt{2 - r_1 - r_2} $$

SED: Example

  • Sigrun takes the same test as Anna and scores a 6.5. Did Anna perform significantly better on the test?
  • Next time

    • Quiz and validity (chapter 6)