Swapna Haresh Teckwani*, Amanda Huee-Ping WONG, and Ivan Cherh Chiet LOW*
Department of Physiology,
Yong Loo Lin School of Medicine, NUS
*swapnaht@nus.edu.sg; phsilcc@nus.edu.sg
Teckwani, S. H., Wong, A. H.-P., & Low, I. C. C. (2024). Accuracy and reliability of large language models in assessing learning outcomes achievement across cognitive domains [Paper presentation]. In Higher Education Conference in Singapore (HECS) 2024, 3 December, National University of Singapore. https://blog.nus.edu.sg/hecs/hecs2024-teckwani-et-al/
SUB-THEME
Opportunities from Generative AI
KEYWORDS
ChatGPT, large language model, grading, assessment, Bloom’s taxonomy
CATEGORY
Paper Presentation
INTRODUCTION
Rapid advancements in artificial intelligence (AI) have significantly impacted various sectors, notably education. AI, particularly through Large Language Models (LLMs) such as ChatGPT and Gemini, has introduced new opportunities in higher education, offering personalised feedback, developing problem-solving skills, and enhancing learning experiences (Kasneci et al., 2023; Moorhouse et al., 2023; Yan et al., 2024). However, the integration of AI in educational assessment, especially in grading written assignments, remains controversial. This study evaluates the accuracy and reliability of LLMs compared to human graders in assessing learning outcomes in a scientific inquiry course on sports physiology. Efficacy of LLM in feedback provision for the graded assignments was evaluated as well.
METHODS
This study involved 40 undergraduate students enrolled in the HSI2002 course “Inquiry into Current Sporting Beliefs and Practices”. Students attended three tutorial sessions, each focusing on different topics related to sports physiology. After each tutorial, students submitted a one-page written assignment evaluated on the ability to “Understand”, “Analyse”, “Evaluate” from the revised Bloom’s taxonomy, and ‘Scientific inquiry competency.’
A total of 117 assignments were independently scored by two human graders and three LLMs: GPT-3.5, GPT-4o, and Gemini. The assessment rubrics, aligned with the revised Bloom’s taxonomy, were engineered into language prompts for the LLMs. Each LLM graded the assignments twice to assess scoring reliability. Paired t-tests and Pearson correlation coefficients, were conducted to compare mean scores and inter-rater reliability (IRR).
RESULTS
Mean overall scores and mean scores for each learning taxa were comparable between the first and second raters for human and LLM graders. However, GPT-3.5 consistently scored lower and GPT-4o scored higher than human graders, while Gemini’s scores were similar to human graders.
IRR analysis of overall assignment scores revealed excellent (80%) agreement and correlation (r = 0.936) between human raters. Contrastingly, Gemini showed good agreement (71%) and correlation (r = 0.672), whereas GPT-3.5 and GPT-4o showed only moderate agreement (40% and 49%, respectively) and no correlation between raters. When comparing human with LLM raters, all LLMs were only in moderate agreement with human raters and a weak correlation (Pearson r = 0.271) observed only between GPT-4o and human graders.
Human graders exhibited excellent inter-rater agreement (≥ 80%) in the “Understand,” “Evaluate,” and “Scientific Inquiry Competency” categories, with slightly lower agreement (72%) in the “Analyse” category. LLMs demonstrated poorer inter-rater agreement compared to human graders. Among the LLMs, Gemini showed the highest inter-rater agreement, with good agreement (50-79%) in three categories and excellent agreement (80%) in the “Analyse” category. GPT-3.5 exhibited the lowest inter-rater agreement, with moderate agreement (30-49%) across all categories. GPT-4o showed slightly better inter-rater agreement than GPT-3.5, with good agreement (56%) in “Scientific Inquiry Competency” and moderate agreement (45-47%) in the other categories. All LLMs showed only moderate agreement (30-49%) with human graders across all learning categories.
Correlation analysis revealed that Gemini had strong correlations for the “Understand” and “Analyse” categories but only moderate correlations for “Evaluate” and “Scientific Inquiry Competency.” GPT-3.5 and GPT-4o had no significant correlation in scores within their grading rounds. In contrast, human graders displayed strong correlations across all categories. Comparing LLM scores with human scores revealed no significant correlation, highlighting the current limitations of LLMs in achieving human-level grading reliability.
CONCLUSION
While LLMs demonstrated potential in grading written assignments, they do not yet match the assessment standards of human graders. The study highlighted superior consistency among human graders and moderate concordance between human and LLM graders. These findings underscore the need for continuous improvement in LLM technologies and adaptive learning by educators to fully harness AI’s potential in educational assessment. LLMs nonetheless exhibited promising capabilities in providing personalised and constructive feedbacks.
REFERENCES
Kasneci, E., Sessler, K., Küchemann, S., Bannert, M., Dementieva, D., Fischer, F., Gasser, U., Groh, G., Günnemann, S., Hüllermeier, E., Krusche, S., Kutyniok, G., Michaeli, T., Nerdel, C., Pfeffer, J., Poquet, O., Sailer, M., Schmidt, A., Seidel, T., Stadler, M., Weller, J., Kuhn, J., & Kasneci, G. (2023). ChatGPT for good? On opportunities and challenges of large language models for education. Learning and Individual Differences, 103, 102274. https://doi.org/https://doi.org/10.1016/j.lindif.2023.102274
Moorhouse, B. L., Yeo, M. A., & Wan, Y. (2023). Generative AI tools and assessment: Guidelines of the world’s top-ranking universities. Computers and Education Open, 5. https://doi.org/10.1016/j.caeo.2023.100151
Yan, L., Sha, L., Zhao, L., Li, Y., Martinez-Maldonado, R., Chen, G., Li, X., Jin, Y., & Gašević, D. (2024). Practical and ethical challenges of large language models in education: A systematic scoping review. British Journal of Educational Technology, 55(1), 90-112. https://doi.org/https://doi.org/10.1111/bjet.13370