Deep learning in the marking of medical student short answer question examinations: Student perceptions and pilot accuracy assessment

Lily Hollis-Sando; Charlotte Pugh; Kyle Franke; Toby Zerner; Yiran Tan; Gustavo Carneiro; Anton van den Hengel; Ian Symonds; Paul Duggan; Stephen Bacchi

doi:10.11157/fohpe.v24i1.531

Authors

Lily Hollis-Sando University of Adelaide
Charlotte Pugh University of Adelaide
Kyle Franke University of Adelaide
Toby Zerner University of Adelaide
Yiran Tan Royal Adelaide Hospital
Gustavo Carneiro University of Adelaide
Anton van den Hengel University of Adelaide
Ian Symonds University of Adelaide
Paul Duggan University of Adelaide
Stephen Bacchi

DOI:

https://doi.org/10.11157/fohpe.v24i1.531

Keywords:

deep learning, natural language processing, automation, medical education.

Abstract

Introduction: Machine learning has previously been applied to text analysis. There is limited data regarding the acceptability or accuracy of such applications in medical education. This project examined medical student opinion regarding computer-based marking and evaluated the accuracy of deep learning (DL), a subtype of machine learning, in the scoring of medical short answer questions (SAQs).

Methods: Fourth- and fifth-year medical students undertook an anonymised online examination. Prior to the examination, students completed a survey gauging their opinion on computer-based marking. Questions were marked by humans, and then a DL analysis was conducted using convolutional neural networks. In the DL analysis, following preprocessing, data were split into a training dataset (on which models were developed using 10-fold cross-validation) and a test dataset (on which performance analysis was conducted).

Results: One hundred and eighty-one students completed the examination (participation rate 59.0%). While students expressed concern regarding the accuracy of computer-based marking, the majority of students agreed that computer marking would be more objective than human marking (67.0%) and reported they would not object to computer-based marking (55.5%). Regarding automated marking of SAQs, for 1-mark questions, there were consistently high classification accuracies (mean accuracy 0.98). For more complex 2-mark and 3-mark SAQs, in which multiclass classification was required, accuracy was lower (mean 0.65 and 0.59, respectively).

Conclusions: Medical students may be supportive of computer-based marking due to its objectivity. DL has the potential to provide accurate marking of written questions, however further research into DL marking of medical examinations is required.

References

Bird, J. B., Olvet, D. M., Willey, J. M., & Brenner, J. (2019, December). Patients don't come with multiple choice options: Essay-based assessment in UME. Medical Education Online, 24(1), Article 1649959. https://doi.org/10.1080/10872981.2019.1649959

Burrows, S., Gurevych, I., & Stein, B. (2015). The eras and trends of automatic short answer grading. International Journal of Artificial Intelligence in Education, 25(1), 60–117. https://doi.org/10.1007/s40593-014-0026-8

Deo, R. (2015). Machine learning in medicine. Circulation, 132(20), 1920–1930. https://doi.org/10.1161/CIRCULATIONAHA.115.001593

Dias, R., Gupta, A., & Yule, S. (2018). Using machine learning to assess physician competence: A systematic review. Academic Medicine, 94(3), 427–439. https://doi.org/10.1097/ACM.0000000000002414

Gierl, M. J., Latifi, S., Lai, H., Boulais, A. P., & De Champlain, A. (2014, October). Automated essay scoring and the future of educational assessment in medical education. Medical Education, 48(10), 950–962. https://doi.org/10.1111/medu.12517

Hift, R. (2014). Should essays and other "open-ended"-type questions retain a place in written summative assessment in clinical medicine? BMC Medical Education, 14, 249. https://doi.org/10.1186/s12909-014-0249-2

James, C. A., Wheelock, K. M., & Woolliscroft, J. O. (2021, July 1). Machine learning: The next paradigm shift in medical education. Academic Medicine, 96(7), 954–957. https://doi.org/10.1097/ACM.0000000000003943

Lai, S., Xu, L., Liu, K., & Zhao, J. (2015). Recurrent convolutional neural networks for text classification. Proceedings of the 29th AAAI Conference on Artificial Intelligence, 29(1), 2267–2273. https://doi.org/10.1609/aaai.v29i1.9513

Latifi, S., Gierl, M., Boulais, A., & De Champlain, A. (2016). Using automated scoring to evaluate written responses in English and French on a high-stakes clinical competency examination. Evaluation & the Health Professions, 39(1), 100–113. https://doi.org/10.1177/0163278715605358

Locke, S., Bashall, A., Al-Adely, S., Moore, J., Wilson, A., & Kitchen, G. B. (2021). Natural language processing in medicine: A review. Trends in Anaesthesia and Critical Care, 38, 4–9. https://doi.org/10.1016/j.tacc.2021.02.007

Nadkarni, P., Ohno-Machado, L., & Chapman, W. (2011). Natural language processing: An introduction. Journal of the American Medical Informatics Association, 18(5), 544–551. https://doi.org/10.1136/amiajnl-2011-000464

Pinto dos Santos, D., Giese, D., Brodehl, S., Chon, S. H., Staab, W., Kleinert, R., Maintz, D., & Baessler, B. (2019, April). Medical students' attitude towards artificial intelligence: A multicentre survey. European Radiology, 29(4), 1640–1646. https://doi.org/10.1007/s00330-018-5601-1

Puthiaparampil, T., & Rahman, M. M. (2020, May 6). Very short answer questions: A viable alternative to multiple choice questions. BMC Medical Education, 20(1), Article 141. https://doi.org/10.1186/s12909-020-02057-w

Salt, J., Harik, P., & Barone, M. A. (2018, December 11). Leveraging natural language processing: Toward computer-assisted scoring of patient notes in the USMLE Step 2 clinical skills exam. Academic Medicine, 94(3), 314–316. https://doi.org/10.1097/ACM.0000000000002558

Sam, A., Field, S., Collares, C., van der Vleuten, C. P., Wass, V., Melville, C., Harris, J., & Meeran, K. (2018). Very-short-answer questions: Reliability, discrimination and acceptability. Medical Education, 52(4), 1–9. https://doi.org/10.1111/medu.13504

Sam, A. H., Westacott, R., Gurnell, M., Wilson, R., Meeran, K., & Brown, C. (2019, September 26). Comparing single-best-answer and very-short-answer questions for the assessment of applied medical knowledge in 20 UK medical schools: Cross-sectional study. BMJ Open, 9(9), Article e032550. https://doi.org/10.1136/bmjopen-2019-032550

Sidey-Gibbons, J. A. M., & Sidey-Gibbons, C. J. (2019, March 19). Machine learning in medicine: A practical introduction. BMC Medical Research Methodology, 19(1), Article 64. https://doi.org/10.1186/s12874-019-0681-4