Abstract
Multiple-choice items form the basis of many assessments used in medical education. Writing questions using traditional approaches is, however, a time-consuming and challenging process. Therefore, the primary aim of this study is to generate case-based multiple-choice items related to preventive medicine using the artificial intelligence model ChatGPT-4 and to examine the psychometric properties of the generated items. Of the 25 items produced by ChatGPT-4, 20 were removed from the study after being reviewed by field experts, as they did not meet the required characteristics of a high-quality multiple-choice item. The remaining five questions were administered to 110 family medicine residency students, and item statistics were obtained based on classical test theory. Evaluations by field experts revealed that while the stems and options of the generated items adhered to high-quality item writing standards, the distractors needed improvement. Item statistics based on student responses indicated that the first item was too easy and not discriminatory, while one of the remaining four items was easy and the other three had moderate difficulty and were discriminatory. Distractor analyses showed that for the item answered correctly by 97.3% of students, none of the distractors were effective, whereas for the other four items, one or two distractors were marked by less than 5% of the students. In conclusion, ChatGPT can assist field experts in creating case-based multiple-choice items for medical education; however, it is essential that the generated items are reviewed by experts before use.
xmlui.mirage2.itemSummaryView.Collections
xmlui.dri2xhtml.METS-1.0.item-citation
1. Farley JK. The multiple choice test: writing the questions. Nurse Educator. 1989;14(6):10-39. doi: 10.1097/00006223-198911000-00003.
2. Murphy JFA. Assessment in medical education. Irish Medical Journal. 2007;100(2). doi: 10.1056/NEJMRA054784.
3. Royal KD, Hedgpeth M, Jeon T, Colford CM. Automated item generation: the future of medical education assessment? European Medical Journal. 2018;2(1):88-93.
4. Gierl MJ, Lai H, Turner SR. Using automatic item generation to create multiple-choice test items. Medical Education. 2012;46(8):757-65.
5. Collins J. Writing multiple-choice questions for continuing medical education activities and self-assessment modules. Radiographics. 2006;26(2):543-51. doi: 10.1148/rg.262055145.
6. Schuwirth LWT. How to write short cases for assessing problem-solving skills. Medical Teacher. 1999;21(2):144-50. doi: 10.1080/01421599979761.
7. Coderre S, Mandin H, Harasym PH, Fick GH. Diagnostic reasoning strategies and diagnostic success. Medical Education. 2003;37(8):695-703.
8. Schuwirth LWT, van der Vleuten CPM. Different written assessment methods: What can be said about their strengths and weaknesses? Medical Education. 2004;38(9):974-9. https://doi.org/10.1111/j.1365-2929.2004.01942.x
9. Palmer EJ, Devitt PG. Assessment of higher order cognitive skills in undergraduate education: Modified essay or multiple choice questions? BMC Medical Education. 2007;7(1):49. doi: 10.1186/1472-6920-7-49.
10. Heist BS, Gonzalo JD, Durning S, Torre D, Elnicki DM. Exploring clinical reasoning strategies and test-taking behaviors during clinical vignette style multiple-choice examinations: a mixed methods study. Journal of Graduate Medical Education. 2014;6(4):709-14.
11. Schauber SK, Hautz SC, Kämmer JE, et al. Do different response formats affect how test takers approach a clinical reasoning task? An experimental study on antecedents of diagnostic accuracy using a constructed response and a selected response format. Advances in Health Sciences Education. 2021;26(4):1339-54. doi: 10.1007/s10459-021-10052-z.
12. Haladyna TM, Downing SM, Rodriguez MC. A review of multiple-choice item-writing guidelines for classroom assessment. Applied Measurement in Education. 2002;15(3):309-34.
13. Kumar D, Jaipurkar R, Shekhar A, Sikri G, Srinivas V. Item analysis of multiple-choice questions: a quality assurance test for an assessment tool. Medical Journal Armed Forces India. 2021;77:85-9. doi: 10.1016/j.mjafi.2020.11.007.
14. Epstein RM. Assessment in medical education. New England Journal of Medicine. 2007;356(4):387-96. doi: 10.1056/NEJMra054784.
15. Rodriguez MC. Three options are optimal for multiple-choice items: a meta-analysis of 80 years of research. Educational Measurement: Issues and Practice. 2005;24(2):3-13.
16. Royal KD, Hedgpeth MW. The prevalence of item construction flaws in medical school examinations and innovative recommendations for improvement. EMJ Innov. 2017;1(1):61-6.
17. Rao SK, Kimball AB, Lehrhoff SR, Hidrue MK, Colton DG, Ferris TG, et al. The impact of administrative burden on academic physicians: Results of a hospital-wide physician survey. Academic Medicine. 2017;92(2):237-43. doi: 10.1097/ACM.0000000000001461.
18. Case SM, Swanson DB. Constructing written test questions for the basic and clinical sciences. Philadelphia, PA: National Board of Medical Examiners; 1998.
19. Gronlund NE. Assessment of student achievement. Boston, MA: Allyn and Bacon; 1998.
20. Jozefowicz RF, Koeppen BM, Case S, Galbraith R, Swanson D, Glew H. The quality of in-house medical school examinations. Acad Med. 2002;77:156-61.
21. Rudner L. Implementing the graduate management admission test computerized adaptive test. In: van der Linden W, Glas C, editors. Elements of adaptive testing. New York, NY: Springer; 2010. p. 151-65.
22. Lane S, Raymond M, Haladyna R, Downing S. Test development process. In: Lane S, Raymond M, Haladyna T, editors. Handbook of test development. 2nd ed. New York, NY: Routledge; 2016. p. 3-18.
23. Bormuth J. On a theory of achievement test items. Chicago, IL: University of Chicago Press; 1969.
24. Gierl MJ, Lai H. Instructional topics in educational measurement (ITEMS) module: Using automated processes to generate test items. Educ Meas. 2013;32(3):36-50.
25. Gierl MJ, Lai H. A process for reviewing and evaluating generated test items. Educational Measurement: Issues and Practice. 2016;35(4):6-20
26. Kurdi G, Leo J, Parsia B, et al. A systematic review of automatic question generation for educational purposes. Int J Artif Intell Educ. 2020;30:121-204. doi: 10.1007/s40593-019-00186-y.
27. Bejar II. Generative testing: from conception to implementation. In: Irvine SH, Kyllonen PC, editors. Item generation for test development. Hillsdale, NJ: Lawrence Erlbaum; 2002. p. 199-217.
28. Drasgow F, Luecht RM, Bennett R. Technology and testing. In: Brennan RL, editor. Educational measurement. 4th ed. Washington, DC: American Council on Education; 2006. p. 471-516.
29. Hirschberg J, Manning CD. Advances in natural language processing. Science. 2015;349(6245):261-6. doi: 10.1126/science.aaa8685.
30. Gierl MJ, Haladyna TM. Automatic item generation: theory and practice. New York: Routledge; 2012.
31. Millman J, Westman R. Computer-assisted writing of achievement test items: toward a future technology. Journal of Educational Measurement. 1989;26(2):177-90.
32. Gierl MJ, Lai H. Evaluating the quality of medical multiple-choice items created with automated processes. Medical Education. 2013;47(7):726-33.
33. Gierl MJ, Zhou J, Alves C. Developing a taxonomy of item model types to promote assessment engineering. The Journal of Technology, Learning and Assessment. 2008;7(2):1-51.
34. Singley MK, Bennett RE. Item generation and beyond: applications of schema theory to mathematics assessment. In: Irvine SH, Kyllonen PC, editors. Item generation for test development. Mahwah, NJ: Lawrence Erlbaum; 2002. p. 361-84.
35. Higgins D, Futagi Y, Deane P. Multilingual generalization of the model creator software for math item generation. Educ Test Serv Res Rep. 2005;RR-05-02. Princeton, NJ: Educational Testing Service.
36. Higgins D. Item Distiller: text retrieval for computer-assisted test item creation. Educ Test Serv Res Mem. 2007;RM-07-05. Princeton, NJ: Educational Testing Service.
37. Gütl C, Lankmayr K, Weinhofer J, Höfler M. Enhanced automatic question creator – EAQC: Concept, development and evaluation of an automatic test item creation tool to foster modern e-education. Electronic Journal of e-Learning. 2011;9(1):23-38.
38. Devlin J, Chang MW, Lee K, Toutanova K. BERT: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805. 2018.
39. Narayan S, Simoes G, Ma J, Craighead H, McDonald R. QURIOUS: Question generation pretraining for text generation. arXiv preprint arXiv:2004.11026. 2020.
40. Yao X, Zhang Y. Question generation with minimal recursion semantics. In: Proceedings of QG2010: The Third Workshop on Question Generation; 2010.
41. Baghaee T. Automatic neural question generation using community-based question answering systems [dissertation]. Lethbridge, Alta: University of Lethbridge, Dept. of Mathematics and Computer Sciences; 2017.
42. Jacquemin C, Tzoukermann E. NLP for term variant extraction: synergy between morphology, lexicon, and syntax. In: Natural language information retrieval. Dordrecht: Springer; 1999. p. 25-74.
43. Aist G. Towards automatic glossarization: Automatically constructing and administering vocabulary assistance factoids and multiple-choice assessment. International Journal of Artificial Intelligence in Education. 2001;12:21-31.
44. Danon G, Last M. A syntactic approach to domain-specific automatic question generation. arXiv preprint arXiv:1712.09827. 2017.
45. Flor M, Riordan B. A semantic role-based approach to open-domain automatic question generation. In: Tetreault J, Burstein J, Leacock C, Yannakoudakis H, editors. Proceedings of the thirteenth workshop on innovative use of NLP for building educational applications; 2018 Jun; New Orleans. Association for Computational Linguistics; 2018. p. 254-63.
46. Vaswani A, Shazeer N, Parmar N, Uszkoreit J, Jones L, Gomez AN, et al. Attention is all you need. In: Advances in Neural Information Processing Systems; 2017. p. 5998-6008.
47. OpenAI. ChatGPT: Optimizing language models for dialogue. 2023. Available from: https://openai.com/blog/chatgpt/
48. Cotton DRE, Cotton PA, Shipway JR. Chatting and cheating: Ensuring academic integrity in the era of ChatGPT. Innovations in Education and Teaching International. 2023;61(2):228-39. doi: 10.1080/14703297.2023.2190148.
49. Brown T, Mann B, Ryder N, et al. Language models are few-shot learners. In: Larochelle H, Ranzato M, Hadsell R, editors. Advances in Neural Information Processing Systems. Vol. 33; 2020. p. 1877-1901.
50. Abd-alrazaq A, AlSaad R, Alhuwail D, Ahmed A, Healy P, Latifi S, et al. Large language models in medical education: Opportunities, challenges, and future directions. JMIR Medical Education. 2023;9. doi: 10.2196/48291
51. Wang LK-P, Paidisetty PS, Cano AM. The next paradigm shift? ChatGPT, artificial intelligence, and medical education. Medical Teacher. 2023. Advance online publication. doi: 10.1080/0142159X.2023.2198663.
52. O’Connor S. Open artificial intelligence platforms in nursing education: Tools for academic progress or abuse? Nurse Education in Practice. 2023;66:103537. doi: 10.1016/j.nepr.2022.103537.
53. Gierl MJ, Lai H, Pugh D, Touchie C, Boulais AP, De Champlain A. Evaluating the psychometric properties of generated test items. Applied Measurement in Education. 2016;29(3):196-210.
54. Pugh D, De Champlain A, Gierl MJ, Lai H, Touchie C. Can automated item generation be used to develop high quality MCQs that assess application of knowledge? Research and Practice in Technology Enhanced Learning. 2020;15:12. doi: 10.1186/s41039-020-00134-8.
55. Cheung BHH, Lau GKK, Wong GTC, Lee EYP, Kulkarni D, et al. ChatGPT versus human in generating medical graduate exam multiple choice questions—A multinational prospective study (Hong Kong S.A.R., Singapore, Ireland, and the United Kingdom). PLoS ONE. 2023;18(8). doi: 10.1371/journal.pone.0290691.
56. Coşkun Ö, Kıyak YS, Budakoğlu İİ. ChatGPT to generate clinical vignettes for teaching and multiple-choice questions for assessment: A randomized controlled experiment. Medical Teacher. 2024;1-7. doi: 10.1080/0142159X.2024.2327477.
57. Elkins K, Chun J. Can GPT-3 pass a writer’s Turing test? Journal of Cultural Analytics. 2020;5(2):1-16. doi: 10.22148/001c.17212.
58. Preiksaitis C, Nash C, Gottlieb M, Chan TM, Alvarez A, Landry A. Brain versus bot: Distinguishing letters of recommendation authored by humans compared with artificial intelligence. AEM Education and Training. 2023;7(6):10924. doi: 10.1002/aet2.10924.
59. Wendt A, Kao S, Gorham J, Woo A. Developing item variants: An empirical study. In: Weiss DJ, editor. Proceedings of the 2009 GMAC Conference on Computerized Adaptive Testing; 2009.
60. Simon HA. The structure of ill structured problems. Artificial Intelligence. 1973;4:181-201.
61. Gilson A, Safranek CW, Huang T, et al. How does ChatGPT perform on the United States medical licensing examination? The implications of large language models for medical education and knowledge assessment. JMIR Medical Education. 2023;9. doi: 10.2196/45312.
62. Han Z, Battaglia F, Udaiyar A, Fooks A, Terlecky SR. An explorative assessment of ChatGPT as an aid in medical education: Use it with caution. medRxiv. 2023. doi: 10.1101/2023.02.13.23285879.
63. Tlili A, Shehata B, Adarkwah MA, Bozkurt A, Hickey DT, Huang R, et al. What if the devil is my guardian angel: ChatGPT as a case study of using chatbots in education. Smart Learn Environ. 2023 Feb 22;10(15):1-24. doi: 10.1186/s40561-023-00237-x.
64. Kıyak YS, Coşkun Ö, Budakoğlu İİ, et al. ChatGPT for generating multiple-choice questions: Evidence on the use of artificial intelligence in automatic item generation for a rational pharmacotherapy exam. European Journal of Clinical Pharmacology. 2024;80(6):729–35.
65. Tıpta Uzmanlık Kurulu Müfredat Oluşturma ve Standart Belirleme Sistemi (TUKMOS) Aile Hekimliği Uzmanlık Eğitimi Çekirdek Müfredatı [Internet]. 2024 [Erişim tarihi: 7 Haziran 2024].Erişim adresi: https://dosyamerkez.saglik.gov.tr/Eklenti/34104/0/ailehekimligimufredatv24pdf.
66. Basan NM, Bilir N. Prevention paradox and causes in preventive health services. TAF Preventive Medicine Bulletin. 2016;15(1):44-50.
67. Canbolat M, Aslan AK, Durmuş M, Vardı N, Yakıncı C. Tıp eğitim müfredatında koruyucu sağlık: İnönü Üniversitesi Tıp Fakültesi örneği. İnönü Üniversitesi Sağlık Bilimleri Dergisi. 2018;7(2):10-13.
68. Kıyak YS, Coşkun O, Budakoğlu İİ, Uluoğlu C. Psychometric analysis of the first Turkish multiple-choice questions generated using automatic item generation method in medical education. World of Medical Education. 2023;22(68):154-61.
69. Sayın A, Gierl MJ. Automatic item generation for online measurement and evaluation: Turkish literature items. International Journal of Assessment Tools in Education. 2023;10(2):218-31. https://doi.org/10.21449/ijate.1249297
70. Sayın A, Bozdağ S, Gierl MJ. Automatic item generation for non-verbal reasoning items. International Journal of Assessment Tools in Education. 2023;10(Special Issue):131-47. https://doi.org/10.21449/ijate.1359348
71. Falcão F, Costa P, Pêgo JM. Feasibility assurance: a review of automatic item generation in medical assessment. Advances in Health Sciences Education. 2022;27(2):405-25. https://doi.org/10.1007/s10459-022-10092-z
72. Latifi SMF, Guo Q, Gierl MJ, Mousavi A, Fung K. Towards automated scoring using open-source technologies. Paper presented at: Annual Meeting of the Canadian Society for the Study of Education; 2013 Jun 1-5; Victoria, British Columbia. Centre for Research in Applied Measurement and Evaluation, University of Alberta.
73. Höfler M, Al-Smadi M, Gütl C. Investigating content quality of automatically and manually generated questions to support self-directed learning. In: Whitelock D, Warburton W, Wills G, Gilbert L, editors. CAA 2011 International Conference; 2011; University of Southampton. Southampton: [cited 1 May 2024]. p. 1-9. Available from: http://caaconference.com
74. Lai H, Alves C, Gierl MJ. Using automatic item generation to address item demands for CAT. In: Weiss DJ, editor. Proceedings of the 2009 GMAC Conference on Computerized Adaptive Testing; 2009. [Accessed 29 May 2024]. Available from: www.psych.umn.edu/psylabs/CATCentral/
75. Agarwal M, Sharma P, Goswami A. Analysing the applicability of ChatGPT, Bard, and Bing to generate reasoning-based multiple-choice questions in medical physiology. Cureus. 2023;15(6). Available from: https://doi.org/10.7759/cureus.40977
76. Alfertshofer M, Hoch CC, Funk PF, Hollmann K, Wollenberg B, Knoedler S, Knoedler L. Sailing the Seven Seas: A Multinational Comparison of ChatGPT's Performance on Medical Licensing Examinations. Annals of Biomedical Engineering. 2024;52(6):1542-1545. Available from: https://doi.org/10.1007/s10439-023-03338-3
77. Ayub I, Hamann D, Hamann CR, Davis MJ. Exploring the potential and limitations of Chat Generative Pre-trained Transformer (ChatGPT) in generating board-style dermatology questions: A qualitative analysis. Cureus. 2023;15(8). . Available from: https://doi.org/10.7759/cureus.43717
78. Bakkum MJ, Hartjes MG, Piët JD, et al. Using artificial intelligence to create diverse and inclusive medical case vignettes for education. British Journal of Clinical Pharmacology. 2024;90(3):640-648. Available from: https://doi.org/10.1111/bcp.15977.
79. Friederichs H, Friederichs WJ, Marz M. ChatGPT in medical school: How successful is AI in progress testing? Medical Education Online. 2023;28(1):2220920. Available from: https://doi.org/10.1080/10872981.2023.2220920
80. Kıyak YS. A ChatGPT prompt for writing case-based multiple-choice questions. Spanish Journal of Medical Education. 2023;4(3). Available from: https://doi.org/10.6018/edumed.587451
81. Kıyak YS, Budakoğlu İİ, Coşkun Ö, Koyun E. The first automatic item generation in Turkish for assessment of clinical reasoning in medical education. World of Medical Education. 2023;22(66):72-90.
82. Zuckerman M, Flood R, Udaiyar A, Fooks A, Terlecky SR. An explorative assessment of ChatGPT for assessment writing. Medical Teacher. 2023;45(11):1224-1227. Available from: https://doi.org/10.1080/0142159X.2023.2249239
83. Karasar N. Bilimsel araştırma yöntemi: Kavramlar, ilkeler, teknikler. 7. basım. Ankara: 3A Ltd.; 1995. s. 77.
84. Kaptan S. Bilimsel araştırma ve istatistik teknikleri. 11. baskı. Ankara: Tekışık Web Ofset; 1998.
85. Franco VR, de Francisco Carvalho L. A tutorial on how to use ChatGPT to generate items following a binary tree structure. 2023 Sep 14. Available from: https://doi.org/10.31234/osf.io/5hnkz
86. Kıyak YS, Emekli E. ChatGPT prompts for generating multiple-choice questions in medical education and evidence on their validity: A literature review. Postgraduate Medical Journal. 2024;qgae065. Available from: https://doi.org/10.1093/postmj/qgae065
87. Doggett T, Warr H, Johnson JA, et al. Response to: “Chat-GPT for assessment writing”. Medical Teacher. 2024;1. Available from: https://doi.org/10.1080/0142159X.2024.2311269
88. Koga S. The potential of ChatGPT in medical education: Focusing on USMLE preparation. Annals of Biomedical Engineering. 2023;51:2123-2124. Available from: https://doi.org/10.1007/s10439-023-03278-7
89. Sauder M, Tritsch T, Rajput V, Schwartz G, Shoja MM. Exploring generative artificial intelligence-assisted medical education: Assessing case-based learning for medical students. Cureus. 2024;16(1). Available from: https://doi.org/10.7759/cureus.51961
90. Tatla E. 5 essential AI (ChatGPT) prompts every medical student and doctor should be using to 10x their productivity [Internet]. 2023. [Erişim Tarihi 7 Nisan 2024]. Erişim adresi: https://medium.com/@eshtatla/5-essential-ai-chatgpt-prompts-every-medical-student-and-doctor-should-be-using-to-10x-their-de3f97d3802a
91. Crocker L, Algina J. Introduction to classical and modern test theory. New York: Harcourt Brace Jovanovich; 1986.
92. Thorndike RM, Cunningham GK, Thorndike RL, Hagen EP. Measurement and evaluation in psychology and education. 5th ed. New York: MacMillan; 1991.
93. Glass GV, Hopkins KD. Statistical methods in education and psychology. 3rd ed. Boston: Allyn & Bacon; 1995. ISBN 0-205-14212-5.
94. Bengtsson H. R.rsp: Dynamic generation of scientific reports. R package version 0.46.0. [Internet]. 2024 [Erişim Tarihi 10 Nisan 2024]. Erişim adresi: https://henrikbengtsson.github.io/R.rsp/
95. T.C. Sağlık Bakanlığı. Bebek ve Çocuk İzlem Protokolü [Internet]. 2018. [Erişim Tarihi 10 Mayıs 2024]. Erişim adresi: https://ekutuphane.saglik.gov.tr/Ekutuphane/kitaplar/Bebek_Cocuk_Ergen_Izlem_Protokolleri_2018.pdf.
96. Haladyna TM. Developing and validating multiple-choice test items. 3rd ed. Routledge; 2004. Available from: https://doi.org/10.4324/9780203825945
97. Ngo A, Gupta S, Perrine O, Reddy R, Ershadi S, Remick D. ChatGPT 3.5 fails to write appropriate multiple choice practice exam questions. Academic Pathology. 2024;11(1).
98. Ilgaz HB, Çelik Z. The significance of artificial intelligence platforms in anatomy education: an experience with ChatGPT and Google Bard. Cureus. 2023 Sep 15;15(9). Available from: https://doi.org/10.7759/cureus.45301
99. Indran IR, Paranthaman P, Gupta N, Mustafa N. Twelve tips to leverage AI for efficient and effective medical question generation: A guide for educators using Chat GPT. Medical Teacher. 2023;1-6. Available from: https://doi.org/10.1080/0142159X.2023.2294703
100. American Educational Research Association, American Psychological Association, National Council on Measurement in Education. Standards for educational and psychological testing. Washington, DC: American Educational Research Association; 2014.