E-411 PRMA

Lecture 13 - Equating

Christopher David Desjardins

Motivation

Consider the salary of a teacher at a school now and in 1950

Is it fair to compare their salaries?

There salaries will most certainly not be the same

The krona has changed a lot, right?

How can we most fairly compare these salaries?

One possibility would be to compare the salaries against a set up comparable goods (e.g. price of milk, liter of gas, price of a stamp, etc)

How should we compare test scores on different forms of a test?

Equating in testing

  • We often have multiple forms of a test
    • Parallel or alternate form
  • How might we compare these forms?
    • Sum up number correct
    • Calculate percent correct
    • Raw Scores
  • Problem: These tests are composed of different questions with differences in item difficulties
  • Why is this a problem?

Scaled Scores

Tests need to be comparable across forms

We need to scale our scores to adjust for different difficulty

For each possible raw score, we will come up with a scaled score based on the difficulty of the questions

Review: What measurement framework do you think we are using?

Scaled Scores
Raw ScoreForm AForm B Form C
50130 130 130
49 130 130 128
48 129 130 126
47 127 130 124
46 126 130 122
45 124 129 120
44 121 128 118
43 119 127 115
42 118 126 114
41 117 125 113
40 116 124 110

Equating Process

  • The first form that is used to derive scale scores is the base form
  • After the raw-to-scale conversion has occurred this form is on scale
  • We then equate a new form to a form that is already on scale
  • The form already on scale is our reference form and the form that is not yet equated is our new form
  • Once our new form is on scale, we can calculate raw scores for the new test takers for the reference form and use the reference form to derive scaled scores
  • An issue - Can derive reference scores that weren't possible because of discreteness
New Form Raw-to-RawReference Form Raw-to-Scale
NewReferenceReference Scaled
...... ... ...
39 43.25 44 109.765
38 42.80 43 107.643
37 41.75 42 106.902
36 41 41 103.853

What should someone with a 38 on the new form get for a scaled score?

Test takers with a 38 on the new form

38's reference test score was 42.80

This is 80% of the way between 42 and 43


# 80% of the way between 42 and 43
(107.643 - 106.902) * .80
[1] 0.5928

# Add this to the score for 42
106.902 + 0.5928
[1] 107.4948
 					

They should get a 107.4948

Scale Decisions in equating

  • Choosing the range of scale scores
    • Don't want scale to look like total or percent correct
  • How fine should our scale be?
    • Usually, want each raw score to correspond to a unique scaled score
    • Need to be careful to not exaggerate precision
  • Often truncate the scaled scores at the end
    • Allows test takers on an easier form than the reference form to get the highest possible scaled score
    • Truncate at lower end to avoid meaningless distinctions if scores are below chance alone

How to create the raw-to-scale conversion

Decide on the mean and standard deviation of a group of test takers

Choose two raw scores, specify their scaled scores, then linearly interpolate the other scores

General limitations of equating

A test taker may know more answers on one form of a test

Equating is unable to adjust scores correctly for every test taker!

We strive to be approximately correct for our target population

Two groups could differ based on emphasized material (e.g. a teacher effect)

Equating results in discrete scores (well, we report them that way)

Symmetry of Equating

A score of 20 on test form A corresponds to a score of 25 on test form B

A score of 25 on test form B corresponds to a score of 20 on test form A

This is known as symmetry

Statistical prediction isn't like this!

Cars again


mod1 <- lm(speed ~ dist, cars)
mod2 <- lm(dist ~ speed, cars)
predict(mod1, newdata = list(dist = 100))
       1 
24.84066 
predict(mod2, newdata = list(speed = 24.84066))
       1 
80.10453 
 					

Equating Designs

  • To make scores comparable you need something similar across the forms
  • This could involve ...
    • Same group
      • Differences in score distributions are a function of form difficulty
    • Equivalent groups
      • Two random samples from the same population
      • Group ability, again, assumed constant and differences in score distributions are a function of form difficulty
    • Nonequivalent group
      • Two random samples from two populations
      • Common anchor items is necessary
      • Equating methods more complex

Our first def'n of Equating

“A score on the new form and a score on the reference form are equivalent in a group of test takers if they represent the same relative position in the group.”

Mean Equating

The simplest form of equating involves adjusting the scores by the difference in means between the reference and new forms

Substraction of values if the new form is easier

Addition of values if the new form is harder

Example

Suppose the target population's mean on the reference form was 80 and their mean on the new form was 85.

  1. Which form was harder?
  2. What should should someone with a 90 on the new form get on the reference form if we were using mean equating?

Problem with mean equating (Livingston, 2014)

Linear equating

We need to adjust based on how high or low a test taker's score is from the mean

What might we consider doing?

Equating better def'n

“ A score on the new form and a score on the reference form are equivalent in a group of test takers if they are the same number of standard deviations above or below the mean of the group. ”

Linear equating a harder new form (Livingston, 2014)

Linear equating conceptually

  • Make the adjusted new form mean equal to the reference score mean
  • Same with standard deviations above and below the mean
  • Do this for every possible value
  • This results in a linear relationship between the new form raw and the new form adjusted scores

Doing the maths!

Let NF stand for a score on the new form and RF a score on the reference form

$$ \frac{RF - \bar{RF}}{sd(RF)} = \frac{NF - \bar{NF}}{sd(NF)} $$

What do these formulas look like?

Our new adjusted score

$$ RF = \frac{sd(RF)}{sd(NF)}NF + \bar{RF} - \frac{sd(RF)}{sd(NF)}\bar{NF} = \text{adjusted } NF $$

Note, the adjusted NF score is very unlikely to ever be a whole number

Example

FormMeanStandard Deviation
Reference8215
New7914

If someone scored an 80 on the new form, what should there reference form score be?

$$RF = \frac{15}{14}80 + 82 - \frac{15}{14}79$$

# Do the math in R and save it as RF
RF <- (15 / 14) * 80 + 82 - (15 / 14) * 79

# Print RF
RF
[1] 83.07143
 					

Does 83.07413 seem sensible?

Problems with linear equating

A very high or very low score can equate to a score outside of the range on the reference form

Depends heavily on the group of test takers (e.g. are they strong test takers? weak test takers?)

Equipercentile equating

“To equate scores on the new form to scores on the reference form in a group of test takers, transform each score on the new form to the score on the reference form that has the same percentile rank in that group.”

Equipercentile equating with a harder new form

Equipercentile equating

15th percentile of the adjusted test form corresponds (as much as possible) to 15th percentile on the reference form and so on

Adjusted scores will all fall within the range of possible scores on the reference form

The steepness of the slope of the curve can vary

Will result in the adjusted test scores having a similar distribution to the reference form

Will be identical to linear equating when the distribution of scores on the new form has the same shape as the distribution of the scores on the reference form

Smoothing

Limitations of equipercentile

Equating relationship is bound by the highest and lowest observed score

On a difficult test, the highest possible raw score might not be observed

Future administration could result in a higher score being observed

Smoothing may help with this

Again, the discreteness problem (Livingston, 2014)

Can use interpolation to calculate unobserved raw score

Concluding remarks on equating

  • Lots of other equating methods exist beyond these three
  • Lots of other equating design exist beyond these introduced (briefly)
  • Non-equivalent group designs are tricky
  • The equate package in R does all of this (and more)