Per Arne Godejord
Nord University Business School
Norway
Abstract
This paper critically evaluates the capabilities of fourteen popular chatbots evaluated over a two-year period (December 2022 to January 2025) to address higher-order academic assignments within the field of Social Informatics, specifically focusing on assignments that require complex cognitive skills in alignment with Bloom’s taxonomy.
The chatbots examined include ChatGPT, GPT UiO, Sikt KI-Chat, GPT-3 Playground, Chatsonic, Bing Chat (Copilot), Jenni, Claude, llama70b-v2-chat, Perplexity.ai, Gemini Pro, and others, using primarily their free versions. The tools were tasked with producing fact-based essays, academic responses to course assignments in various sub-fields (e.g., computer security, law, game creation, and work in virtual teams), and addressing complex academic inquiries.
Results indicated that none of the chatbots were capable of reliably producing high-quality academic outputs beyond simple fact repetition. A substantial number of responses involved fabricated information, including non-existent sources.
These findings challenge media claims that chatbots like ChatGPT can effectively meet the demands of higher education assessments and thereby making portfolio assessment in online courses impossible as examination method.
Furthermore, this paper strongly suggests that concerns about chatbots undermining academic integrity in Norwegian bachelor and master’s theses within the fields of Social Sciences and the Humanities are unfounded. In conclusion, current AI tools are far from being true artificial intelligence, and they fall short in delivering the level of academic rigor required by advanced education within the Social Sciences and the Humanities.
REFERENCES
(not complete)
- Bharatha, A., et al. (2024). Comparing the performance of ChatGPT-4 and medical students on MCQs at varied levels of Bloom’s Taxonomy. Advances in Medical Education and Practice. Retrieved from https://www.tandfonline.com/doi/pdf/10.2147/AMEP.S457408
- Crowther GJ, Sankar U, Knight LS, Myers DL, Patton KT, Jenkins LD, Knight TA. (2023). Chatbot responses suggest that hypothetical biology questions are harder than realistic ones. J Microbiol Biol Educ. 24:e00153-23. Retrieved from:
https://journals.asm.org/doi/full/10.1128/jmbe.00153-23 - Elsayed, S. (2023). Towards mitigating ChatGPT’s negative impact on education: Optimizing question design through Bloom’s taxonomy. arXiv. Retrieved from https://arxiv.org/pdf/2304.08176
- Govender, R. G. (2024). My AI students: Evaluating the proficiency of three AI chatbots in completeness and accuracy. Contemporary Educational Technology. Retrieved from https://www.cedtech.net/article/my-ai-students-evaluating-the-proficiency-of-three-ai-chatbots-in-completeness-and-accuracy-14564
- Habiballa, H., et al. (2025). Artificial intelligence (ChatGPT) and Bloom’s Taxonomy in theoretical computer science education. Applied Sciences, 15(2). Retrieved from https://www.mdpi.com/2076-3417/15/2/581
- Herrmann-Werner, A., et al. (2024). Assessing ChatGPT’s mastery of Bloom’s Taxonomy using psychosomatic medicine exam questions: Mixed-methods study. Journal of Medical Internet Research. Retrieved from https://www.jmir.org/2024/1/e52113/
- Leary, A., et al. (2023/2024). Strategies for effective teaching in the age of AI. University of Notre Dame Resource Library. Retrieved from https://learning.nd.edu/resource-library/strategies-for-effective-teaching-in-the-age-of-ai/
- Lodge, J. M. (2023). ChatGPT consistently fails (most parts of) the assessment tasks I assign my students. Here’s why. LinkedIn Pulse. Retrieved from https://www.linkedin.com/pulse/chatgpt-consistently-fails-most-parts-assessment-tasks-jason-m-lodge
- Mirzadeh, I., et al. (2024). GSM-Symbolic: Understanding the limitations of mathematical reasoning in large language models. Hugging Face Papers. Retrieved from https://huggingface.co/papers/2410.05229
- Mitchell, M. (2023). Can large language models reason? AI Guide. Retrieved from https://aiguide.substack.com/p/can-large-language-models-reason
- Newton, P., & Xiromeriti, M. (2024). ChatGPT performance on multiple choice question examinations in higher education: A pragmatic scoping review. Assessment & Evaluation in Higher Education, 49(6), 781–798. https://doi.org/10.1080/02602938.2023.2299059
- Spencer, J. (2023). The FACTS cycle for prompt engineering. Spencer Education. Retrieved from https://spencereducation.com/facts-cycle/
- Susnjak, T. (2022). ChatGPT: The end of online exam integrity? ResearchGate. Retrieved from https://www.researchgate.net/publication/366423865_ChatGPT_The_End_of_Online_Exam_Integrity
- Volante, L., DeLuca, C., & Klinger, D. A. (2023). ChatGPT challenge: 5 ways to change how students are graded. Queen’s Gazette. Retrieved from https://www.queensu.ca/gazette/stories/chatgpt-challenge-5-ways-change-how-students-are-graded

This excerpt from an upcoming journal paper is derived from two years of testing involving 14 chatbots. It also incorporates insights from my blog book, “ChatGPT – A Talkative Example of Artificial Intelligence, or…?”.
The comprehensive findings and analyses will be fully detailed in the complete paper, scheduled for publication in 2025.