Beyond Additional Human Scores: Automated Scoring Reliability Estimation in ILSAs
May 13 @ 4:00 pm – 5:00 pm EDT
This is presented by the Artificial Intelligence in Measurement and Education SIGIMIE.
Presenter: Jenny (Ji Yoon) Jung, TIMSS & PIRLS International Study Center
Scoring reliability has traditionally depended on inter-rater agreement, which requires additional human scores that are costly, time-consuming, and vulnerable to rater effects. These challenges are especially amplified in international large-scale assessments (ILSAs), which involve massive volumes of multilingual responses across more than 60 countries. To address this, we propose the Linguistic-integrated Reliability Audit (LiRA), a novel framework that automatically generates data-driven benchmark scores without requiring additional human raters. Applying LiRA to the Progress in International Reading Literacy Study (PIRLS), we estimated cross-country scoring reliability (CCSR) at the item, country, and language levels. This approach offers significantly broader coverage than conventional CCSR methods, which rely on small subsets of English-language responses. Furthermore, LiRA enables direct, human-score-independent comparisons between human and AI scoring reliability. Our findings position LiRA as a scalable and reproducible framework for scoring reliability estimation in ILSAs.

