Per Arne Godejord
Nord University Business School
Norway
Abstract
(Draft)
Background: Large language models (LLMs) such as ChatGPT, Bing Chat, Claude, and others are often claimed to be capable of producing academic responses to university-level tasks. However, these models are prone to «hallucinations,» generating outputs that appear convincing but are factually incorrect. This study evaluates the performance of various LLMs in completing higher-order academic assignments within the field of Social Informatics.
Objective: The objective of this paper is to assess the capabilities and limitations of several popular chatbots in addressing complex academic tasks within the field of Social Informatics, focusing on their alignment with higher cognitive levels as defined by Bloom’s taxonomy.
Methods: A field-based evaluation experiment was conducted over two years, employing a quasi-experimental case study design. The study involved testing the performance of ChatGPT, GPT UiO, Sikt KI-Chat, GPT-3 Playground, Chatsonic, Bing Chat (Copilot), Jenni, Claude, llama70b-v2-chat, Perplexity.ai, Gemini Pro, and others, using primarily their free versions.
These tools were tested in realistic teaching settings, using authentic coursework requirements from a set of courses belonging to two online study programmes and one online MBA course. The evaluations were based on discipline-specific criteria, including qualitative descriptions of academic texts, alignment with Bloom’s taxonomy, and authentic portfolio assessment.
Results: The findings indicate that none of the chatbots were capable of reliably producing high-quality academic outputs beyond simple fact repetition. The responses often lacked analytical depth, critical reflection, and adherence to academic standards. Many outputs included fabricated information and non-existent sources.
Conclusions: The study concludes that current chatbots fall short of delivering the level of academic rigor required by advanced university education. While these models can assist with simple content generation, they cannot replace the intellectual engagement and analytical reasoning necessary for higher-order academic tasks. The findings suggest that concerns about chatbots undermining academic integrity in higher education are unfounded, as these tools are not yet capable of meeting the demands of complex academic assessments.
REFERENCES
(not complete)
- Bharatha, A., et al. (2024). Comparing the performance of ChatGPT-4 and medical students on MCQs at varied levels of Bloom’s Taxonomy. Advances in Medical Education and Practice. Retrieved from https://www.tandfonline.com/doi/pdf/10.2147/AMEP.S457408
- Crowther GJ, Sankar U, Knight LS, Myers DL, Patton KT, Jenkins LD, Knight TA. (2023). Chatbot responses suggest that hypothetical biology questions are harder than realistic ones. J Microbiol Biol Educ. 24:e00153-23. Retrieved from:
https://journals.asm.org/doi/full/10.1128/jmbe.00153-23 - Elsayed, S. (2023). Towards mitigating ChatGPT’s negative impact on education: Optimizing question design through Bloom’s taxonomy. arXiv. Retrieved from https://arxiv.org/pdf/2304.08176
- Govender, R. G. (2024). My AI students: Evaluating the proficiency of three AI chatbots in completeness and accuracy. Contemporary Educational Technology. Retrieved from https://www.cedtech.net/article/my-ai-students-evaluating-the-proficiency-of-three-ai-chatbots-in-completeness-and-accuracy-14564
- Habiballa, H., et al. (2025). Artificial intelligence (ChatGPT) and Bloom’s Taxonomy in theoretical computer science education. Applied Sciences, 15(2). Retrieved from https://www.mdpi.com/2076-3417/15/2/581
- Herrmann-Werner, A., et al. (2024). Assessing ChatGPT’s mastery of Bloom’s Taxonomy using psychosomatic medicine exam questions: Mixed-methods study. Journal of Medical Internet Research. Retrieved from https://www.jmir.org/2024/1/e52113/
- Leary, A., et al. (2023/2024). Strategies for effective teaching in the age of AI. University of Notre Dame Resource Library. Retrieved from https://learning.nd.edu/resource-library/strategies-for-effective-teaching-in-the-age-of-ai/
- Lodge, J. M. (2023). ChatGPT consistently fails (most parts of) the assessment tasks I assign my students. Here’s why. LinkedIn Pulse. Retrieved from https://www.linkedin.com/pulse/chatgpt-consistently-fails-most-parts-assessment-tasks-jason-m-lodge
- Mirzadeh, I., et al. (2024). GSM-Symbolic: Understanding the limitations of mathematical reasoning in large language models. Hugging Face Papers. Retrieved from https://huggingface.co/papers/2410.05229
- Mitchell, M. (2023). Can large language models reason? AI Guide. Retrieved from https://aiguide.substack.com/p/can-large-language-models-reason
- Newton, P., & Xiromeriti, M. (2024). ChatGPT performance on multiple choice question examinations in higher education: A pragmatic scoping review. Assessment & Evaluation in Higher Education, 49(6), 781–798. https://doi.org/10.1080/02602938.2023.2299059
- Spencer, J. (2023). The FACTS cycle for prompt engineering. Spencer Education. Retrieved from https://spencereducation.com/facts-cycle/
- Susnjak, T. (2022). ChatGPT: The end of online exam integrity? ResearchGate. Retrieved from https://www.researchgate.net/publication/366423865_ChatGPT_The_End_of_Online_Exam_Integrity
- Volante, L., DeLuca, C., & Klinger, D. A. (2023). ChatGPT challenge: 5 ways to change how students are graded. Queen’s Gazette. Retrieved from https://www.queensu.ca/gazette/stories/chatgpt-challenge-5-ways-change-how-students-are-graded

This excerpt from an upcoming journal paper is derived from two years of testing involving 14 chatbots. It also incorporates insights from my blog book, “ChatGPT – A Talkative Example of Artificial Intelligence, or…?”.
The comprehensive findings and analyses will be fully detailed in the complete paper, scheduled for publication in 2025/2026.