E-411 PRMA

Lecture 7 - Test Development

Christopher David Desjardins

Review

Classical Test Theory & Reliability

Test-Retest

Parallel Form

Internal Consistency

Quantifying uncertainty (i.e. standard error of measurement)

Review

Validity

Content

Criterion-Related

Construct

Tests, Test, Tests

Test Development

  • Conceptualization
  • Construction
  • Piloting
  • Item Analysis
  • Revision

Test Conceptualization

  • Identify a need for this test

  • Identify a purpose for this test

  • No test exists

  • People aren't static, so tests need to be revised

  • What else should we be thinking about when we're conceptualizing a test?

Norm or Criterion-Referenced

Does it matter?

  • Norm-Referenced
    • Want high scores on the test to get the item correct and low scores on the test to get the item incorrect
    • Want to spread out testtakers

  • Criterion-Referenced
    • Want high scores on the test to get the item correct and low scores on the test to get the item incorrect
    • However, each item needs to measure whether a criterion is met

Test Construction

  • Now that we know why, we have to know how

  • We need a set of rules for assigning numbers in measurement - scaling

  • In psychology, scales are instruments used to measure traits, states, or abilities

Scales

Types of Scales

  • Nominal, ordinal, interval, or ratio

  • Examples?

Rating Scales

  • Testtaker indicates their response to an item by selecting among strengths

  • Examples: Stealing

  • Likert-Type are common rating scales

  • Scores from test could be summed (summative) directly; factor analysis or item response theory could be used

Scale Issues

“Downloading movies is the same as stealing”
Strongly Agree Agree Neither Agree/Nor Disagree Disagree Strongly Disagree

Are the distances the same betwen the choices?

What might affect are choices?

More Scales

  • Paired comparsions - choose between two options scored based on some criteria

  • Comparative scaling - items are arranged based on some criteria and categorical scaling - items into two or more categories

  • Guttman scale - items written in a sequential manner such that someone higher on the trait will agree with the strongest statements through the mildest statments

Let's write some items that use these scales

Let's brain storm 10 nouns

Interval Scales

  • Could use Thurstone's equal-appearing scale (p 250 - 251)

  • I am skeptical ... what do you think?

Let's write a test!

Writing Items

  • What content should the items cover?
  • What should the format of the items be?
  • How many items should be written and for each content area?
  • Book recommends writing 2x the number of items for the item bank/pool . . . seems a bit excessive

Types of items

Selected-response vs constructed response

Selected-Response

  • Types

    • Multiple-choice

    • Binary-choice

    • Matching

  • Each item will have a stem, correct chioce, and distractors

  • A good multiple-choice item in achievement test
    • Only one correct choice

    • Grammatically parallel alternatives

    • Alternatives of similar length

    • Alternatives that fit grammatically with the stem

    • Include as much information in the stem as possible to avoid repetiiton

    • Avoids ridiculous distractors

    • Is not excessively long

Final thoughts on selected-response

  • There are more than just true/false for binary-choice items

  • Matching bank should have more answers choices than items and/or be used more than once

  • Guessing is a problem in an achievement setting

  • Always forcing a choice in a non-achievement setting

Constructed response

  • Completion items are fill-in-the blank responses

  • Short-answer items require a response of a few sentences

  • Essay items are long short-answer items demonstrating deeper, more thorough knowledge

  • More deeply probe a specific portion of a construct, require more time

  • Subjectivity in scoring essays

  • What reliability statistic would we report here?

Scoring the items

  • Cumulative model - sum up the items on the test

  • Class scoring - based on pattern of responses placed with similar testtakers

  • Ipsative scoring - score on a scale within a test compared to score on another scale on same test

    • Edwards Personal Preference Schedule - measures relative strength of different psychological needs

  • Could look at both the cumlative scores on seperate scales and the pattern of these scores, profile analysis

Piloting the test

Item analysis

  • Many different ways to analyze items

  • Can focus on

    • Difficulty of item

    • Reliability of item

    • Validity of item

    • Discrimination of item

Item Difficulty

  • Proportion of testtakers that get the item correct

  • Higher the item difficulty, the easier the item

    • item-endorsement index

  • Can calculate average item difficulty for the test

  • Optimal value = $\frac{\text{Pr(Guess)} + 1}{2}$

Item Difficulty - Example

Administer an item to 10 students and 4 students get the item correct

What is the item's difficulty?

If the item was a multiple choice with 5 distractors, what is the optimal item difficulty?

Item Reliability

  • Internal consistency of the test

  • Software often calculates changes in a reliability index (e.g. coefficient alpha) when item is deleted

  • Examine factor loadings

  • Calculate item-reliability index = $s_i * r_{i,\text{ttscore}}$

    • $s_i$, the standard deviation of item i
    • $r_{i,\text{ttscore}}$, correlation between item i and total test score

Item Reliability Index - Example

Item 1
1
0
0
1
1

Assume correlation between item 1 and total test score is 0.7

Item Validity

  • Item-validity index = $s_i * r_{i,\text{crit}}$

    • $s_i$, the standard deviation of item i
    • $r_{i,\text{crit}}$, correlation between item i and criterion measure

Item Discrimination

  • Point-biserial correlations - Are testtakers with higher abilities more likely to get the item correct?

  • IRT's discrimination parameter

  • Item discrimination index

    1. Discretize total test scores into upper and lower 27%

    2. Calculate number of "high" scores that got item correct and number of "low" scores that got item correct

    3. Calculate difference

  • Examine distractor functioning

Example in R

Issues in Test Development

Guessing

Bias in favor of one group - differential item functioning

Test length and duration of testing session

Alternatives to item analysis

Think Alouds

Expert Panels

Interviews

Qualitative Methods

Test Revision

  • On what basis should we revise our items?

  • Too easy or too hard items?

  • Items with similar difficulty that are measuring the same concept?

  • Items with negative point-biserial correlations?

  • Items that on a second/third read through seem unrelated to the construct?

  • Items with low factor loadings?

  • Based on IRT?

Standardization

  • We settle on our revisions

  • Administer revised version to new sample

  • This becomes our comparsion group, our standardization sample

Revising old tests

  • Tests need to be revised when the domain has significantly changed

  • Content of the items is not understood or changed

  • Test norms are no longer adequate

  • Theory underlying the domain has changed

  • Reliability and validity of the instrument can be improved

Cross- and co-validation

  • Cross-validation - revalidation of a test on a seperate, independent sample of testtakers

  • Item validities should shrink during this process (validity shrinkage)

  • Co-validation - test validation conducted on two or more tests with the same sample of testtakers

  • Creating norms, co-norming

  • Cheaper, reduces sampling error by norming on the same sample