Educational justice. Reliability and consistency of large language models for automated essay scoring and its implications

Siti Bealinda Qinthara Rony; Tan Xin Fei; Sasa Arsovski

doi:10.37074/jalt.2025.8.1.21

Authors

Siti Bealinda Qinthara Rony Raffles University https://orcid.org/0009-0006-8007-2793
Tan Xin Fei Raffles University https://orcid.org/0009-0007-3823-2404
Sasa Arsovski Raffles University https://orcid.org/0000-0001-5981-9473

DOI:

https://doi.org/10.37074/jalt.2025.8.1.21

Abstract

Maintaining consistency in automated essay scoring is essential to guarantee fair and dependable assessments. This study investigates consistency and provides a comparative analysis of open-source and proprietary large language models (LLMs) for automated essay scoring (AES). The study utilized student essays, each assessed five times to measure both intrarater (using intraclass coefficient and repeatability coefficient) and interrater (concordance correlation coefficient) reliability across several models: GPT-4, GPT-4o, GPT-4o mini, GPT-3.5 Turbo, Gemini 1.5 Flash, and LLaMa 3.1 70B. Essays and marking criteria are used for prompt construction and sent to each large language model to obtain score outputs. Results indicate that the scores generated by GPT-4o closely align with human assessments, demonstrating fair agreement across repeated measures. Specifically, GPT-4o exhibits slightly higher concordance correlation coefficients (CCC) than GPT-4o mini, indicating superior agreement with human scores. However, qualitatively, it can be observed that all LLM models are not as consistent in terms of their scoring rationale/evaluation. Our study results indicate that the challenges currently faced in automated essay scoring with large language models need to be analyzed not only from a quantitative perspective but also qualitatively. Additionally, we utilize more sophisticated prompting methods and address the inconsistencies observed in the initial measurements. Despite the purported reliability of some models within our study, the selection of LLMs should be considered thoroughly during practical implementations for an AES.

Downloads

Download data is not yet available.

Author Biographies

Siti Bealinda Qinthara Rony, Raffles University

Research assistant at the faculty of AI and robotics in Raffles University, Malaysia.
Tan Xin Fei, Raffles University

Undergraduate student of the faculty of AI and robotics in Raffles University, Malaysia.
Sasa Arsovski, Raffles University

Professor, dean of faculty of AI and robotics in Raffles University, Malaysia.

Educational justice. Reliability and consistency of large language models for automated essay scoring and its implications

Authors

DOI:

Abstract

Downloads

Author Biographies

Downloads

Published

Issue

Section

How to Cite

Make a Submission

Information