E-411 PRMA

Lecture 11 - Item Response Theory & Generalizability Theory

Christopher David Desjardins

Properties of IRT

Manifest variables differentiate among persons at different locations on the latent scale
Items are characterized by location and ability to discriminate among persons
Items and persons are on the same scale
Parameters estimated in a sample are linearly transformable to estimates of those parameters from another sample
Yields scores that are independent of number of items, item difficulty, and the individuals it is measured on, and are placed on a real-number scale

Assumptions of IRT

Response of a person to an item can be modeled with the a specific item reponse function

IRT conceptually

Item Response Function (IRF)

https://lundinn.shinyapps.io/rasch

Getting item correct recap

For the 1-PL and the Rasch, the probability of getting an item correct is a function of the distance an item is located from a person.

For the 2-PL, this is also a function of how well the item differentiates among people at different locations.

For the 3-PL, include item difficulty, item discrimination, and guessing

Guessing Parameter

Standard Error of Estimate and Information

Similar to the SEM, the standard error of estimate (SEE) allows us to quantify uncertainty about score of a person within IRT
Information is the inverse of the SEE and tells us how precise our estimates
We can use this to select items and develop tests!
See can also create 95% confidence intervals with this information

Item Information Function

http://130.208.71.121:3838/rasch_information/

http://130.208.71.121:3838/twopl_information/

http://130.208.71.121:3838/threepl_information/

Test Information Function

Generalizability theory

Generalizability Theory

Reliability is essentially a measure of consistency in your scores (think of a dart board)

Error doesn't necessarily mean mistake, it can just refer to the measurement procedure/conditions

Generalizability theory, the child of CTT and ANOVA, allows a researcher to quantify and distangle the different sources of error in observed scores

What are we trying to generalize over

The CTT model is: $X = T + E$

The G-Theory model is: $X = \mu_p + E_1 + E_2 + \dots + E_H$

$\mu_p$ - universe score and $E_h$ - are sources of error

G studies

Suppose we develop a test to measure your writing abilities
We could have various ...

Items
Raters

These are referred to collectively as facets
Facets - Conditions of measurement
If we administered the test on multiple occasions, occasions could also be a facet
Let each rater rate each item and assume there are an essentially limitless pool of items and raters could draw from - universe of admissible observations

Fixed Design

Random Design

Nested Design

More terminology

universe - conditions of the measurement
population - objects of measurement

This is our typical notion of a population

G-study - Set up study design and estimate the variances
Universe of generalizations - What are we trying to generalize to? Just these items and raters? Or are these items and raters a sample from all items and raters?

Our model

Recall, each rater rates each item

$X_{pir} = \mu + v_p + v_i + v_r + v_{pi} + v_{pr} + v_{ir} + v_{pir}$

If we assume that that these effects are uncorrelated then

$\sigma^2(X_{pir}) = \sigma^2_p + \sigma^2_i + \sigma^2_r + \sigma^2_{pi} + \sigma^2_{pr} + \sigma^2_{ir} + \sigma^2_{pir}$

These are our variance components

In a G study, we estimate each of these variance components

They can be estimated using aov() or lme4::lmer() functions in R

This forms the basis of our D study, which is used to investigate different scenarios and allow us to calculate different reliability estimates based on our use

D study

We need to decide if our facets should be considered fixed or random
We need to know if they are nested within one another
This will determine our universe of generalization and has implications for our reliability estimates!

What does the D study give us?

It tells us what effect changing the number of ...

Items
Raters
Testing Occasions
Whatever

... affects reliability

Consider Raters and Items crossed (p x R x I design)

We need to derive universe score, relative error, and absolute error variances

$\sigma^2(X_{pir}) = \sigma^2_p + \sigma^2_i + \sigma^2_r + \sigma^2_{pi} + \sigma^2_{pr} + \sigma^2_{ir} + \sigma^2_{pir}$

The variances based on our design

universe-score variance $$\sigma_{\tau}^2 = \sigma_p^2$$

relative error variance$$\sigma_{\delta}^2 = \frac{\sigma_{pi}^2}{n_i^`} + \frac{\sigma_{pr}^2}{n_r^`} + \frac{\sigma_{pir}^2}{n_i^`n_r^`} $$

absolute error variance$$\sigma_{\Delta}^2 = \frac{\sigma_{i}^2}{n_i^`} + \frac{\sigma_{r}^2}{n_r^`} + \frac{\sigma_{ir}^2}{n_i^`n_r^`} + \frac{\sigma_{pr}^2}{n_r^`} + \frac{\sigma_{pi}^2}{n_i^`} + \frac{\sigma_{pir}^2}{n_i^`n_r^`} $$

IMPORTANT: What we consider fixed or random determines what goes where!

D study estimates

Now that we've partititioned our variance into 3 components: universe score, relative error, and absolute error variance.

Relative error and the generalizability coefficient, are analagous to $\sigma^2_E$ and reliability in CTT, and is based on comparing examinees

$E\rho^2 = \frac{\sigma^2_\tau}{\sigma^2_\tau + \sigma^2_\delta}$

Absolute error variance is for making absolute decisions about examinees

Dependability coefficient, $\phi = \frac{\sigma^2_\tau}{\sigma^2_\tau + \sigma^2_\Delta}$

Icelandic writing test

Again, consider our G-study in which Icelanders answer items on a writing test that were scored by multiple raters.

Source	Variance component	Estimate	Total variability (%)
Person (p)	$$\sigma_p^2$$	1.376	32
Item (i)	$$\sigma_i^2$$	0.215	05
Rater (r)	$$\sigma_r^2$$	0.043	01
p × i	$$\sigma_{pi}^2$$	0.860	20
p × r	$$\sigma_{pr}^2$$	0.258	06
i × r	$$\sigma_{ir}^2$$	0.001	00
p × r × i	$$\sigma_{pir}^2$$	1.548	36

What do those numbers actually mean?

The large person variation (32%) means there was a lot of between person variability even after accounting for items and raters

This is our universe score variance in our example

5% of the variation was associated with items (i.e. items were of varied difficulty)
Only 1% of the variation was associated with raters
20% of the variation for p x i - means that person relative standings differed by items
6% of the variation for p x r - means that person relative standing differed somewhat by raters
0% of the variation for i x r - means the ordering of the item's difficulty did not change by raters
36% of the variation for p x i x r - means relative standing varied by item and rater and other sources of error not controlled for in the study

Do you think changing the numbers of items or the number of occasions would have the biggest affect on reliability?

D-study 1 - 20 items and 3 raters

$$\sigma_{\delta}^2 = \frac{\sigma_{pi}^2}{n_i^`} + \frac{\sigma_{pr}^2}{n_r^`} + \frac{\sigma_{pir}^2}{n_i^`n_r^`} = \frac{0.86}{20} + \frac{0.258}{3} + \frac{1.548}{3*20} = 0.1548$$

$$\sigma_{\Delta}^2 = \frac{\sigma_{i}^2}{n_i^`} + \frac{\sigma_{r}^2}{n_r^`} + \frac{\sigma_{ir}^2}{n_i^`n_r^`} + \frac{\sigma_{pi}^2}{n_i^`} + \frac{\sigma_{pr}^2}{n_r^`} + \frac{\sigma_{pir}^2}{n_i^`n_r^`} = \frac{0.215}{20} + \frac{0.043}{3} + \frac{.001}{3*20} + \frac{0.86}{20} + \frac{0.258}{3} + \frac{1.548}{3*20} = 0.1799$$

$E\rho^2 = \frac{\sigma^2_\tau}{\sigma^2_\tau + \sigma^2_\delta} = \frac{1.376}{1.376 + 0.1548} = 0.899$

$\phi = \frac{\sigma^2_\tau}{\sigma^2_\tau + \sigma^2_\Delta} = \frac{1.376}{1.376 + 0.1799} = 0.884$

What if we used just 10 items and 2 raters?

$E\rho^2 = 0.824$ and $\phi = 0.803$

So reliabilities decrease!

Comparing G-Theory to CTT

CTT reliability estimates are often incorrect
When we have more than 1 random facet, CTT reliabilities are too high
Different D-study scenarios allow you to investigate what-ifs based on number of items, raters, occasions, test forms, etc
Some take homes from CTT

Universe score variance gets smaller if we consider a facet fixed instead of random bc we reduce our universe of generalization!
lead to smaller error variances

Nested D study designs usually lead to small error variances and larger reliability coefficients