Educational justice. Reliability and consistency of large language models for automated essay scoring and its implications

Authors

DOI:

https://doi.org/10.37074/jalt.2025.8.1.21

Abstract

Maintaining consistency in automated essay scoring is essential to guarantee fair and dependable assessments. This study investigates consistency and provides a comparative analysis of open-source and proprietary large language models (LLMs) for automated essay scoring (AES). The study utilized student essays, each assessed five times to measure both intrarater (using intraclass coefficient and repeatability coefficient) and interrater (concordance correlation coefficient) reliability across several models: GPT-4, GPT-4o, GPT-4o mini, GPT-3.5 Turbo, Gemini 1.5 Flash, and LLaMa 3.1 70B. Essays and marking criteria are used for prompt construction and sent to each large language model to obtain score outputs. Results indicate that the scores generated by GPT-4o closely align with human assessments, demonstrating fair agreement across repeated measures. Specifically, GPT-4o exhibits slightly higher concordance correlation coefficients (CCC) than GPT-4o mini, indicating superior agreement with human scores. However, qualitatively, it can be observed that all LLM models are not as consistent in terms of their scoring rationale/evaluation. Our study results indicate that the challenges currently faced in automated essay scoring with large language models need to be analyzed not only from a quantitative perspective but also qualitatively. Additionally, we utilize more sophisticated prompting methods and address the inconsistencies observed in the initial measurements. Despite the purported reliability of some models within our study, the selection of LLMs should be considered thoroughly during practical implementations for an AES.

Downloads

Download data is not yet available.

Author Biographies

  • Siti Bealinda Qinthara Rony, Raffles University

    Research assistant at the faculty of AI and robotics in Raffles University, Malaysia.

  • Tan Xin Fei, Raffles University

    Undergraduate student of the faculty of AI and robotics in Raffles University, Malaysia.

  • Sasa Arsovski, Raffles University

    Professor, dean of faculty of AI and robotics in Raffles University, Malaysia.

Downloads

Published

2025-01-20

How to Cite

Educational justice. Reliability and consistency of large language models for automated essay scoring and its implications. (2025). Journal of Applied Learning and Teaching, 8(1), 67-77. https://doi.org/10.37074/jalt.2025.8.1.21