E-411 PRMA

Lecture 11 - Item Response Theory & Generalizability Theory

Christopher David Desjardins

Properties of IRT

  1. Manifest variables differentiate among persons at different locations on the latent scale
  2. Items are characterized by location and ability to discriminate among persons
  3. Items and persons are on the same scale
  4. Parameters estimated in a sample are linearly transformable to estimates of those parameters from another sample
  5. Yields scores that are independent of number of items, item difficulty, and the individuals it is measured on, and are placed on a real-number scale

Assumptions of IRT

Response of a person to an item can be modeled with the a specific item reponse function

IRT conceptually

Item Response Function (IRF)

https://lundinn.shinyapps.io/rasch Error: Embedded data could not be displayed.

Getting item correct recap

For the 1-PL and the Rasch, the probability of getting an item correct is a function of the distance an item is located from a person.

For the 2-PL, this is also a function of how well the item differentiates among people at different locations.

For the 3-PL, include item difficulty, item discrimination, and guessing

Guessing Parameter

Standard Error of Estimate and Information

  • Similar to the SEM, the standard error of estimate (SEE) allows us to quantify uncertainty about score of a person within IRT

  • Information is the inverse of the SEE and tells us how precise our estimates

  • We can use this to select items and develop tests!

  • See can also create 95% confidence intervals with this information

Item Information Function

http://130.208.71.121:3838/rasch_information/ Error: Embedded data could not be displayed.
http://130.208.71.121:3838/twopl_information/ Error: Embedded data could not be displayed.
http://130.208.71.121:3838/threepl_information/ Error: Embedded data could not be displayed.

Test Information Function

Generalizability theory

Generalizability Theory

Reliability is essentially a measure of consistency in your scores (think of a dart board)

Error doesn't necessarily mean mistake, it can just refer to the measurement procedure/conditions

Generalizability theory, the child of CTT and ANOVA, allows a researcher to quantify and distangle the different sources of error in observed scores

What are we trying to generalize over

The CTT model is: \(X = T + E\)

The G-Theory model is: \(X = \mu_p + E_1 + E_2 + \dots + E_H\)

\(\mu_p\) - universe score and \(E_h\) - are sources of error

G studies

  • Suppose we develop a test to measure your writing abilities
  • We could have various ...
    • Items
    • Raters
  • These are referred to collectively as facets
  • Facets - Conditions of measurement
  • If we administered the test on multiple occasions, occasions could also be a facet
  • Let each rater rate each item and assume there are an essentially limitless pool of items and raters could draw from - universe of admissible observations

Fixed Design

Random Design

Nested Design

More terminology

  • universe - conditions of the measurement
  • population - objects of measurement
    • This is our typical notion of a population
  • G-study - Set up study design and estimate the variances
  • Universe of generalizations - What are we trying to generalize to? Just these items and raters? Or are these items and raters a sample from all items and raters?

Our model

Recall, each rater rates each item

\(X_{pir} = \mu + v_p + v_i + v_r + v_{pi} + v_{pr} + v_{ir} + v_{pir}\)

If we assume that that these effects are uncorrelated then

\(\sigma^2(X_{pir}) = \sigma^2_p + \sigma^2_i + \sigma^2_r + \sigma^2_{pi} + \sigma^2_{pr} + \sigma^2_{ir} + \sigma^2_{pir}\)

These are our variance components

In a G study, we estimate each of these variance components

They can be estimated using aov() or lme4::lmer() functions in R

This forms the basis of our D study, which is used to investigate different scenarios and allow us to calculate different reliability estimates based on our use

D study

  • We need to decide if our facets should be considered fixed or random
  • We need to know if they are nested within one another
  • This will determine our universe of generalization and has implications for our reliability estimates!

What does the D study give us?

  • It tells us what effect changing the number of ...
    • Items
    • Raters
    • Testing Occasions
    • Whatever
  • ... affects reliability

Consider Raters and Items crossed (p x R x I design)

We need to derive universe score, relative error, and absolute error variances

\(\sigma^2(X_{pir}) = \sigma^2_p + \sigma^2_i + \sigma^2_r + \sigma^2_{pi} + \sigma^2_{pr} + \sigma^2_{ir} + \sigma^2_{pir}\)

The variances based on our design

universe-score variance $$\sigma_{\tau}^2 = \sigma_p^2$$

relative error variance$$\sigma_{\delta}^2 = \frac{\sigma_{pi}^2}{n_i^`} + \frac{\sigma_{pr}^2}{n_r^`} + \frac{\sigma_{pir}^2}{n_i^`n_r^`} $$

absolute error variance$$\sigma_{\Delta}^2 = \frac{\sigma_{i}^2}{n_i^`} + \frac{\sigma_{r}^2}{n_r^`} + \frac{\sigma_{ir}^2}{n_i^`n_r^`} + \frac{\sigma_{pr}^2}{n_r^`} + \frac{\sigma_{pi}^2}{n_i^`} + \frac{\sigma_{pir}^2}{n_i^`n_r^`} $$

IMPORTANT: What we consider fixed or random determines what goes where!

D study estimates

Now that we've partititioned our variance into 3 components: universe score, relative error, and absolute error variance.

Relative error and the generalizability coefficient, are analagous to \(\sigma^2_E\) and reliability in CTT, and is based on comparing examinees

\(E\rho^2 = \frac{\sigma^2_\tau}{\sigma^2_\tau + \sigma^2_\delta}\)

Absolute error variance is for making absolute decisions about examinees

Dependability coefficient, \(\phi = \frac{\sigma^2_\tau}{\sigma^2_\tau + \sigma^2_\Delta}\)

Icelandic writing test

Again, consider our G-study in which Icelanders answer items on a writing test that were scored by multiple raters.

SourceVariance componentEstimateTotal variability (%)
Person (p) $$\sigma_p^2$$ 1.376 32
Item (i)$$\sigma_i^2$$ 0.215 05
Rater (r)$$\sigma_r^2$$0.04301
p × i$$\sigma_{pi}^2$$0.86020
p × r$$\sigma_{pr}^2$$0.25806
i × r$$\sigma_{ir}^2$$0.00100
p × r × i$$\sigma_{pir}^2$$1.54836

What do those numbers actually mean?

  • The large person variation (32%) means there was a lot of between person variability even after accounting for items and raters
    • This is our universe score variance in our example
  • 5% of the variation was associated with items (i.e. items were of varied difficulty)
  • Only 1% of the variation was associated with raters
  • 20% of the variation for p x i - means that person relative standings differed by items
  • 6% of the variation for p x r - means that person relative standing differed somewhat by raters
  • 0% of the variation for i x r - means the ordering of the item's difficulty did not change by raters
  • 36% of the variation for p x i x r - means relative standing varied by item and rater and other sources of error not controlled for in the study

Do you think changing the numbers of items or the number of occasions would have the biggest affect on reliability?

D-study 1 - 20 items and 3 raters

$$\sigma_{\delta}^2 = \frac{\sigma_{pi}^2}{n_i^`} + \frac{\sigma_{pr}^2}{n_r^`} + \frac{\sigma_{pir}^2}{n_i^`n_r^`} = \frac{0.86}{20} + \frac{0.258}{3} + \frac{1.548}{3*20} = 0.1548$$

$$\sigma_{\Delta}^2 = \frac{\sigma_{i}^2}{n_i^`} + \frac{\sigma_{r}^2}{n_r^`} + \frac{\sigma_{ir}^2}{n_i^`n_r^`} + \frac{\sigma_{pi}^2}{n_i^`} + \frac{\sigma_{pr}^2}{n_r^`} + \frac{\sigma_{pir}^2}{n_i^`n_r^`} = \frac{0.215}{20} + \frac{0.043}{3} + \frac{.001}{3*20} + \frac{0.86}{20} + \frac{0.258}{3} + \frac{1.548}{3*20} = 0.1799$$

\(E\rho^2 = \frac{\sigma^2_\tau}{\sigma^2_\tau + \sigma^2_\delta} = \frac{1.376}{1.376 + 0.1548} = 0.899\)

\(\phi = \frac{\sigma^2_\tau}{\sigma^2_\tau + \sigma^2_\Delta} = \frac{1.376}{1.376 + 0.1799} = 0.884\)

What if we used just 10 items and 2 raters?

\(E\rho^2 = 0.824\) and \(\phi = 0.803\)

So reliabilities decrease!

Comparing G-Theory to CTT

  • CTT reliability estimates are often incorrect
  • When we have more than 1 random facet, CTT reliabilities are too high
  • Different D-study scenarios allow you to investigate what-ifs based on number of items, raters, occasions, test forms, etc
  • Some take homes from CTT
    • Universe score variance gets smaller if we consider a facet fixed instead of random bc we reduce our universe of generalization!
    • lead to smaller error variances
    • Nested D study designs usually lead to small error variances and larger reliability coefficients