Response of a person to an item can be modeled with the a specific item reponse function
For the 1-PL and the Rasch, the probability of getting an item correct is a function of the distance an item is located from a person.
For the 2-PL, this is also a function of how well the item differentiates among people at different locations.
For the 3-PL, include item difficulty, item discrimination, and guessing
Similar to the SEM, the standard error of estimate (SEE) allows us to quantify uncertainty about score of a person within IRT
Information is the inverse of the SEE and tells us how precise our estimates
We can use this to select items and develop tests!
See can also create 95% confidence intervals with this information
Reliability is essentially a measure of consistency in your scores (think of a dart board)
Error doesn't necessarily mean mistake, it can just refer to the measurement procedure/conditions
Generalizability theory, the child of CTT and ANOVA, allows a researcher to quantify and distangle the different sources of error in observed scores
What are we trying to generalize over
The CTT model is: \(X = T + E\)
The G-Theory model is: \(X = \mu_p + E_1 + E_2 + \dots + E_H\)
\(\mu_p\) - universe score and \(E_h\) - are sources of error
Recall, each rater rates each item
\(X_{pir} = \mu + v_p + v_i + v_r + v_{pi} + v_{pr} + v_{ir} + v_{pir}\)
If we assume that that these effects are uncorrelated then
\(\sigma^2(X_{pir}) = \sigma^2_p + \sigma^2_i + \sigma^2_r + \sigma^2_{pi} + \sigma^2_{pr} + \sigma^2_{ir} + \sigma^2_{pir}\)
These are our variance components
In a G study, we estimate each of these variance components
They can be estimated using aov()
or lme4::lmer()
functions in R
This forms the basis of our D study, which is used to investigate different scenarios and allow us to calculate different reliability estimates based on our use
We need to derive universe score, relative error, and absolute error variances
\(\sigma^2(X_{pir}) = \sigma^2_p + \sigma^2_i + \sigma^2_r + \sigma^2_{pi} + \sigma^2_{pr} + \sigma^2_{ir} + \sigma^2_{pir}\)
universe-score variance $$\sigma_{\tau}^2 = \sigma_p^2$$
relative error variance$$\sigma_{\delta}^2 = \frac{\sigma_{pi}^2}{n_i^`} + \frac{\sigma_{pr}^2}{n_r^`} + \frac{\sigma_{pir}^2}{n_i^`n_r^`} $$
absolute error variance$$\sigma_{\Delta}^2 = \frac{\sigma_{i}^2}{n_i^`} + \frac{\sigma_{r}^2}{n_r^`} + \frac{\sigma_{ir}^2}{n_i^`n_r^`} + \frac{\sigma_{pr}^2}{n_r^`} + \frac{\sigma_{pi}^2}{n_i^`} + \frac{\sigma_{pir}^2}{n_i^`n_r^`} $$
IMPORTANT: What we consider fixed or random determines what goes where!
Now that we've partititioned our variance into 3 components: universe score, relative error, and absolute error variance.
Relative error and the generalizability coefficient, are analagous to \(\sigma^2_E\) and reliability in CTT, and is based on comparing examinees
\(E\rho^2 = \frac{\sigma^2_\tau}{\sigma^2_\tau + \sigma^2_\delta}\)
Absolute error variance is for making absolute decisions about examinees
Dependability coefficient, \(\phi = \frac{\sigma^2_\tau}{\sigma^2_\tau + \sigma^2_\Delta}\)
Again, consider our G-study in which Icelanders answer items on a writing test that were scored by multiple raters.
Source | Variance component | Estimate | Total variability (%) |
Person (p) | $$\sigma_p^2$$ | 1.376 | 32 |
Item (i) | $$\sigma_i^2$$ | 0.215 | 05 |
Rater (r) | $$\sigma_r^2$$ | 0.043 | 01 |
p × i | $$\sigma_{pi}^2$$ | 0.860 | 20 |
p × r | $$\sigma_{pr}^2$$ | 0.258 | 06 |
i × r | $$\sigma_{ir}^2$$ | 0.001 | 00 |
p × r × i | $$\sigma_{pir}^2$$ | 1.548 | 36 |
$$\sigma_{\delta}^2 = \frac{\sigma_{pi}^2}{n_i^`} + \frac{\sigma_{pr}^2}{n_r^`} + \frac{\sigma_{pir}^2}{n_i^`n_r^`} = \frac{0.86}{20} + \frac{0.258}{3} + \frac{1.548}{3*20} = 0.1548$$
$$\sigma_{\Delta}^2 = \frac{\sigma_{i}^2}{n_i^`} + \frac{\sigma_{r}^2}{n_r^`} + \frac{\sigma_{ir}^2}{n_i^`n_r^`} + \frac{\sigma_{pi}^2}{n_i^`} + \frac{\sigma_{pr}^2}{n_r^`} + \frac{\sigma_{pir}^2}{n_i^`n_r^`} = \frac{0.215}{20} + \frac{0.043}{3} + \frac{.001}{3*20} + \frac{0.86}{20} + \frac{0.258}{3} + \frac{1.548}{3*20} = 0.1799$$
\(E\rho^2 = \frac{\sigma^2_\tau}{\sigma^2_\tau + \sigma^2_\delta} = \frac{1.376}{1.376 + 0.1548} = 0.899\)
\(\phi = \frac{\sigma^2_\tau}{\sigma^2_\tau + \sigma^2_\Delta} = \frac{1.376}{1.376 + 0.1799} = 0.884\)
What if we used just 10 items and 2 raters?
\(E\rho^2 = 0.824\) and \(\phi = 0.803\)
So reliabilities decrease!