Accuracy and Reliability of Large Language Models in Assessing Learning Outcomes Achievement Across Cognitive Domains

Swapna Haresh Teckwani*, Amanda Huee-Ping WONG, and Ivan Cherh Chiet LOW*

Department of Physiology,
Yong Loo Lin School of Medicine, NUS 

*swapnaht@nus.edu.sg; phsilcc@nus.edu.sg  

Teckwani, S. H., Wong, A. H.-P., & Low, I. C. C. (2024). Accuracy and reliability of large language models in assessing learning outcomes achievement across cognitive domains [Paper presentation]. In Higher Education Conference in Singapore (HECS) 2024, 3 December, National University of Singapore. https://blog.nus.edu.sg/hecs/hecs2024-teckwani-et-al/

SUB-THEME

Opportunities from Generative AI

KEYWORDS

ChatGPT, large language model, grading, assessment, Bloom’s taxonomy

CATEGORY

Paper Presentation 

 

INTRODUCTION

Rapid advancements in artificial intelligence (AI) have significantly impacted various sectors, notably education. AI, particularly through Large Language Models (LLMs) such as ChatGPT and Gemini, has introduced new opportunities in higher education, offering personalised feedback, developing problem-solving skills, and enhancing learning experiences (Kasneci et al., 2023; Moorhouse et al., 2023; Yan et al., 2024). However, the integration of AI in educational assessment, especially in grading written assignments, remains controversial. This study evaluates the accuracy and reliability of LLMs compared to human graders in assessing learning outcomes in a scientific inquiry course on sports physiology. Efficacy of LLM in feedback provision for the graded assignments was evaluated as well.

 

METHODS

This study involved 40 undergraduate students enrolled in the HSI2002 course “Inquiry into Current Sporting Beliefs and Practices”. Students attended three tutorial sessions, each focusing on different topics related to sports physiology. After each tutorial, students submitted a one-page written assignment evaluated on the ability to “Understand”, “Analyse”, “Evaluate” from the revised Bloom’s taxonomy, and ‘Scientific inquiry competency.’

 

A total of 117 assignments were independently scored by two human graders and three LLMs: GPT-3.5, GPT-4o, and Gemini. The assessment rubrics, aligned with the revised Bloom’s taxonomy, were engineered into language prompts for the LLMs. Each LLM graded the assignments twice to assess scoring reliability. Paired t-tests and Pearson correlation coefficients, were conducted to compare mean scores and inter-rater reliability (IRR).

 

RESULTS

Mean overall scores and mean scores for each learning taxa were comparable between the first and second raters for human and LLM graders. However, GPT-3.5 consistently scored lower and GPT-4o scored higher than human graders, while Gemini’s scores were similar to human graders.

 

IRR analysis of overall assignment scores revealed excellent (80%) agreement and correlation (r = 0.936) between human raters. Contrastingly, Gemini showed good agreement (71%) and correlation (r = 0.672), whereas GPT-3.5 and GPT-4o showed only moderate agreement (40% and 49%, respectively) and no correlation between raters. When comparing human with LLM raters, all LLMs were only in moderate agreement with human raters and a weak correlation (Pearson r = 0.271) observed only between GPT-4o and human graders.

 

Human graders exhibited excellent inter-rater agreement (≥ 80%) in the “Understand,” “Evaluate,” and “Scientific Inquiry Competency” categories, with slightly lower agreement (72%) in the “Analyse” category. LLMs demonstrated poorer inter-rater agreement compared to human graders. Among the LLMs, Gemini showed the highest inter-rater agreement, with good agreement (50-79%) in three categories and excellent agreement (80%) in the “Analyse” category. GPT-3.5 exhibited the lowest inter-rater agreement, with moderate agreement (30-49%) across all categories. GPT-4o showed slightly better inter-rater agreement than GPT-3.5, with good agreement (56%) in “Scientific Inquiry Competency” and moderate agreement (45-47%) in the other categories. All LLMs showed only moderate agreement (30-49%) with human graders across all learning categories.

 

Correlation analysis revealed that Gemini had strong correlations for the “Understand” and “Analyse” categories but only moderate correlations for “Evaluate” and “Scientific Inquiry Competency.” GPT-3.5 and GPT-4o had no significant correlation in scores within their grading rounds. In contrast, human graders displayed strong correlations across all categories. Comparing LLM scores with human scores revealed no significant correlation, highlighting the current limitations of LLMs in achieving human-level grading reliability.

 

CONCLUSION

While LLMs demonstrated potential in grading written assignments, they do not yet match the assessment standards of human graders. The study highlighted superior consistency among human graders and moderate concordance between human and LLM graders. These findings underscore the need for continuous improvement in LLM technologies and adaptive learning by educators to fully harness AI’s potential in educational assessment. LLMs nonetheless exhibited promising capabilities in providing personalised and constructive feedbacks.

 

REFERENCES

Kasneci, E., Sessler, K., Küchemann, S., Bannert, M., Dementieva, D., Fischer, F., Gasser, U., Groh, G., Günnemann, S., Hüllermeier, E., Krusche, S., Kutyniok, G., Michaeli, T., Nerdel, C., Pfeffer, J., Poquet, O., Sailer, M., Schmidt, A., Seidel, T., Stadler, M., Weller, J., Kuhn, J., & Kasneci, G. (2023). ChatGPT for good? On opportunities and challenges of large language models for education. Learning and Individual Differences, 103, 102274. https://doi.org/https://doi.org/10.1016/j.lindif.2023.102274

Moorhouse, B. L., Yeo, M. A., & Wan, Y. (2023). Generative AI tools and assessment: Guidelines of the world’s top-ranking universities. Computers and Education Open, 5. https://doi.org/10.1016/j.caeo.2023.100151

Yan, L., Sha, L., Zhao, L., Li, Y., Martinez-Maldonado, R., Chen, G., Li, X., Jin, Y., & Gašević, D. (2024). Practical and ethical challenges of large language models in education: A systematic scoping review. British Journal of Educational Technology, 55(1), 90-112. https://doi.org/https://doi.org/10.1111/bjet.13370

Planting the Seeds for Meaningful and Effective Community Engagement Experiences through University Overseas Study Trips

Corinne ONG*, WONG Soon Fen, Eunice NG, and LIM Cheng Puay
Ridge View Residential College (RVRC)

*corinne@nus.edu.sg 

Ong, C. P. P., Wong, S. F., Ng, E. S. Q., & Lim, C. P. (2024). Planting the seeds for meaningful and effective community engagement experiences through university overseas study trips [Paper presentation]. In Higher Education Conference in Singapore (HECS) 2024, 3 December, National University of Singapore. https://blog.nus.edu.sg/hecs/hecs2024-ong-et-al/

SUB-THEME

Opportunities from Engaging Communities

KEYWORDS

Overseas study trips, high-impact educational practice, deep learning, community engagement, course design

CATEGORY

Paper Presentation 

 

INTRODUCTION

This paper documents the reflective experiences of the authors in designing a new undergraduate course involving a 10-day overseas learning component in a public university in Singapore. We illustrate how community engagement can be integrated into a course which focuses on culture and sustainability in Southeast Asia. The benefits of community-based learning experiences are increasingly well-established in the higher education landscape, constituting a form of high-impact educational practice, especially when facilitated by deep learning teaching strategies (Laird, 2008). Its accruable benefits range from developing greater civic interest and engagement, increased social capital, competency development, personal growth, and improved academic achievement among students (O’Brien, 2014). Deep learning, which furthers the impact of community engagement experiences, are enabled through integrative learning experiences (e.g. perspective-taking, interdisciplinary problem-solving), higher-order learning experiences (e.g. theoretical applications, idea analyses, and synthesis), and reflective learning experiences (Warburton, 2003).

 

PURPOSE/SIGNIFICANCE OF STUDY

Planning a study trip that integrates community engagement opportunities is a manifold process that this paper seeks to demystify. For instance, such engagements can exist in (a) multiple forms (between educators and the partners, between organisations, between students and community partners), and are (b) managed and enacted at various temporal junctures (course design to implementation and post-trip). Designing community engagement encounters also involves the deliberate introduction of (c) student learning objectives as guided by certain principles and values (e.g. social equity), and (d) intentional learning activities/assessments (e.g. reflections, stakeholder interviews, awareness-building projects) capable of maximising benefits for all stakeholders.

 

By documenting, conceptualising, and evaluating community engagement in the above ways, this paper is expected to provide educators, keen to introduce community engagement opportunities in undergraduate overseas study trips, with considerations on how community engagement activities can be integrated in impactful ways in overseas study trips. The following research questions (RQs) are examined:

 

1. How can community engagements for overseas study trips be designed to maximise its positive benefits for all stakeholders, including students?

Through this research question, we discuss the importance of context in shaping the design of these engagements, such as choice of issues of coverage and partners in order to meet course learning objectives. For instance, Southeast Asia, with its cultural diversity, natural resource endowments, and economic potential, offers significant scope for learning about sustainability (tensions) and the Sustainable Development Goals (SDGs). Partners who were actively contributing to promoting cultural and/or environmental sustainability in local communities in East Malaysia (e.g. WWF-Sarawak, Shell Sabah, Borneo Marine Institute, Sabah, Sarawak Biodiversity Centre) for instance, were identified and engaged as our partners who created learning content and insight-sharing opportunities with students.

The interdisciplinary nature of sustainability further lends itself to learning and inquiry from multiple disciplines. We share examples of how students from different disciplines were engaged in cross-disciplinary learning in the process of community engagement, and how course activities (e.g. pre-seminar activities ranging from videos, case analyses), in-trip post-engagement reflections, and post-trip activities (video documentaries), were designed with the intent of helping students make critical culture and sustainability connections, while leveraging on their engagement experiences. These aspects of course design are expected to be instructive to educators of diverse disciplines.

 

2. What are the benefits of learning activities facilitated around community engagement encounters for students?

This includes a discussion of how community engagement skills (e.g. cultural sensitivity, interview skills), acquired through experiences from these study trips, could be applied to contexts beyond Malaysia and to different disciplines or topics of study.

 

METHODS

The findings of this paper are informed and derived from the triangulation of multiple data points: from the authors’ reflections of engagement efforts and encounters from course design to implementation, observations of student learning, and students’ works and course feedback.

 

PRELIMINARY FINDINGS

In response to RQ1, we outline key phases of the engagement planning process and accompanying considerations in three phases, namely pre-trip, in-trip, and post-trip:

Table 1
Conceptualisation of phases, actions/activities, and considerations involved in community engagement planning (click on the table to view a full-sized version)

HECS2024-a89-Table1

 

In response to RQ2, final course evaluations from students showed that nearly all students (at least 90%) who responded (N=12) indicated their agreement with the perceived achievement of learning outcomes (Figure 1),  and satisfaction with the course’s design and structure (Figure 2).

HECS2024-a89-Fig1

Figure 1: Students’ self-reported evaluation of the extent to which course learning outcomes were achieved.

 

HECS2024-a89-Fig2

Figure 2: Students’ evaluation of the effectiveness of the course structure and design.

 

Finally, students’ qualitative course feedback (some examples of anonymous student feedback shared below) reinforced the value of learning activities, especially pertaining to planned community engagements and instructor-facilitated class debriefs:


“The most effective learning strategy was definitely interacting with the locals and the people working in the NGO’s since they do not necessarily have the same views as the organisations they are working for/the views that are prevalent in academic literature. It was really eye opening how many of the social issues faced by the people and the challenges faced by organisations were not readily available or easy to find solely through research…”

 

Another student shared how the community interactions and reflections proved transformative, offering them new insights on privilege and the value of context in perspective-making:


“I think what was most effective was interacting with different stakeholders, ranging from students to villagers, and experiencing the homestays, especially Kampung Menuang…It also reminded me of how small we are compared to the world. Through daily reflections from the trip, I really feel and learned a lot from our peers, professors and our partners as we all have different perspectives due to different backgrounds.”

 

These findings validate the effectiveness of community engagement encounters in promoting meaningful, deep, and transformational learning for students.

 

REFERENCES

Grauerholz, L. (2001). Teaching holistically to achieve deep learning. College Teaching, 49(2), 44–50. http://www.jstor.org/stable/27559032

Laird, N. et al. (2008). The effects of discipline on deep approaches to student learning and college outcomes. Research in Higher Education, 49, 469–494. https://doi.org/10.1007/s11162-008-9088-5

Roberts, J. W. (2012;2011;). Beyond learning by doing: theoretical currents in experiential education (1st ed.). Routledge.

Mezirow, J. (2003). Transformative learning as discourse. Journal of Transformative Education, 1(1), 58-63. https://doi.org/10.1177/1541344603252172

O’Brien, W., & Sarkis, J. (2014). The potential of community-based sustainability projects for deep learning initiatives. Journal of Cleaner Production, 62, 48-61. https://doi.org/10.1016/j.jclepro.2013.07.001

Warburton, K. (2003). Deep learning and education for sustainability. International Journal of Sustainability in Higher Education, 4(1), 44-56. https://doi.org/10.1108/14676370310455332

Viewing Message: 1 of 1.
Success

Blog.nus login is now via SSO. Don't see or cannot edit your blogs after logging in? Please get in touch with us, and we will fix that. (More information.)