Many different ways to analyze items
Can focus on
Difficulty of item
Reliability of item
Validity of item
Discrimination of item
Guessing
Bias in favor of one group - differential item functioning
Test length and duration of testing session
Think Alouds
Expert Panels
Interviews
Qualitative Methods
On what basis should we revise our items?
Too easy or too hard items?
Items with similar difficulty that are measuring the same concept?
Items with negative point-biserial correlations?
Items that on a second/third read through seem unrelated to the construct?
Items with low factor loadings?
Based on IRT?
We settle on our revisions
Administer revised version to new sample
This becomes our comparsion group, our standardization sample
Tests need to be revised when the domain has significantly changed
Content of the items is not understood or changed
Test norms are no longer adequate
Theory underlying the domain has changed
Reliability and validity of the instrument can be improved
Cross-validation - revalidation of a test on a seperate, independent sample of testtakers
Item validities should shrink during this process (validity shrinkage)
Co-validation - test validation conducted on two or more tests with the same sample of testtakers
Creating norms, co-norming
Cheaper, reduces sampling error by norming on the same sample
Item Response Theory
Classical Test Theory
X = T + E
\(\sigma^2_X = \sigma^2_T + \sigma^2_E\)
\(\sigma_{\text{SEM}} = \sigma \sqrt{1 - r_{xx}}\)
In an nutshell, IRT is able to address all of these criticisms
BUT, makes stronger assumptions and requires a larger sample size
A measurement perspective
A series of non-linear models
Links manifest variables with latent variables
Latent characteristics of individuals and items are predictors of observed responses
Not a "how" or "why" theory
Anxiety could be loosely defined as feelings that range from general uneasienss to incapcitating attacks of terror
Is anxiety latent and is it continuous, categorical, or both?
Categorical - Individuals can be placed into a high anxiety latent class and a low anxiety latent class
Continuous - Individuals fall along an anxiety continuum
Both - Given a latent class (e.g. the high anxiety latent class), within this class there is a continuum of even greater anxiety.
Response of a person to an item can be modeled with the a specific item reponse function
IRT: Item parameters estimated in one sample from a population are linearly transformable to estimates of those parameters on another sample from the same population. This makes it possible to create large pools of items that have been linked by this transformation process onto a common scale.
Unlike CTT, equating occurs automatically as a result of linking, without assumption of score distributions. This makes it possible to compare on a common scale persons measured in different groups and with different items.
The logistic model
\(p(x = 1 | z) = \frac{e^z}{1 - e^z}\)
The logistic regression model
\(p(x = 1 | g) = \frac{e^{\beta_0 + \beta_1g}}{1 - e^{\beta_0 + \beta_1g}}\)
The Rasch model
\(p(x_j = 1 | \theta, b_j) = \frac{e^{\theta - b_j}}{1 - e^{\theta - b_j}}\)
So, the Rasch model is just the logistic regression model in disguise
rasch <- function(person, item) {
exp(person - item\) (1 + exp(person - item))
}
rasch(person = 1, item = 1.5)
# [1] 0.3775407
rasch(person = 1, item = 1)
# [1] 0.5
For the 1-PL and the Rasch, the probability of getting an item correct is a function of the distance an item is located from a person.
For the 2-PL, this is also a function of how well the item differentiates among people at different locations.
For the 3-PL, include item difficulty, item discrimination, and guessing
Similar to the SEM, the standard error of estimate (SEE) allows us to quantify uncertainty about score of a person within IRT
Information is the inverse of the SEE and tells us how precise our estimates
We can use this to select items and develop tests!
See can also create 95% confidence intervals with this information