Chapter 15 Introduction to Item Response Theory
LEARNING OUTCOMES
- Identify the key assumptions of Classical Test Theory (CTT) and Item Response Theory (IRT), critically evaluating CTT’s limitations.
- Distinguish the relevant features of the different IRT models for dichotomous and polytomous data.
- Explain the parameters bj, aj, and cj used in logistic models and appraise the visualizations displaying these parameters in Item Characteristic Curves (ICCs).
- Evaluate the reliability of items and tests by inspecting the Item Information Functions (IIFs) and the Test Information Functions (TIFs).
In Chapter 12 (Scale development), we emphasized the importance of inspecting the psychometric quality of the scale once we have pre-tested our initial pool of items. To do so, we rely on four key psychometric principles: validity, reliability, fairness, and comparability (Mislevy, Wilson, Ercikan, & Chudowsky, 2003). Guided by these principles, we evaluate the original scale and usually reduce the initial pool of items, retaining the best items that enhance the psychometric quality of the scale. After this quality control process, the remaining items are field tested with its corresponding iterative psychometric quality checks.
Classical Test Theory (CTT) has been (and still is) profusely applied in scale development. However, in some fields (e.g., educational assessment) this psychometric model has become problematic, generating much needed alternatives taking the form of different models under the framework of Item Response Theory (IRT). Although the antecedents of IRT can be traced back to Thurstone's measurement model, its development and expansion flourished in the 1960s thanks to pioneers Frederic Lord, Melvin Novick, and Georg Rasch (Bock, 1997).
Limitations of Classical Test Theory (CTT)
- CTT assumes a linear relationship between the latent variable and the observed scores. This linear trend is hardly observed in real life
- Test statistics (e.g., reliability, standard error of measurement) depend on the sample and population being measured. Global test statistics are sensible to the variability of the participants' responses
- CTT does not provide a theoretical model for the responses given to the items. We cannot evaluate the responses of one respondent as a function of the level of ability or trait for a given item. Consequently, the level of analysis in CTT is the overall test, not the individual items
- It is recommended to develop scales with a large number of items to sample the universe of items measuring a psychological construct. Consequently, larger scales are more reliable than shorter ones
- Item statistics (e.g., difficulty, discrimination) depend on the sample and population being measured. For example, item difficulty will be higher when the respondents are above the average. Likewise, discrimination indexes tend to be higher in samples with higher variability because they are estimated using correlation coefficients
- In CTT, tests' standard error of measurement is consistent across an entire sample or population. Thus, regardless of the raw test score (e.g., high, low), the standard error of measurement for each score will remain the same
- CTT is well suited for test scores produced by respondents with an average ability or trait. However, for extremes scores, CTT is problematic