Javid K, Driessche A, Clymer C, Abbas MJ, Pantuso A, Maier LM, Hoegler J, Hakeos WM, Guthrie ST. OpenEvidence performs at similar levels compared to current and previous GPT models on orthopedic training and education questions. World J Orthop 2026; 17(6): 118593 [DOI: 10.5312/wjo.v17.i6.118593]
Corresponding Author of This Article
Kashif Javid, Department of Orthopaedic Surgery, Henry Ford Health System, 2799 W. Grand Blvd, Detroit, MI 48202, United States. kjavid1@hfhs.org
Research Domain of This Article
Orthopedics
Article-Type of This Article
research-article
Open-Access Policy of This Article
This article is an open-access article which was selected by an in-house editor and fully peer-reviewed by external reviewers. It is distributed in accordance with the Creative Commons Attribution Non Commercial (CC BY-NC 4.0) license, which permits others to distribute, remix, adapt, build upon this work non-commercially, and license their derivative works on different terms, provided the original work is properly cited and the use is non-commercial. See: http://creativecommons.org/licenses/by-nc/4.0/
Baishideng Publishing Group Inc, 7041 Koll Center Parkway, Suite 160, Pleasanton, CA 94566, USA
Share the Article
Javid K, Driessche A, Clymer C, Abbas MJ, Pantuso A, Maier LM, Hoegler J, Hakeos WM, Guthrie ST. OpenEvidence performs at similar levels compared to current and previous GPT models on orthopedic training and education questions. World J Orthop 2026; 17(6): 118593 [DOI: 10.5312/wjo.v17.i6.118593]
World J Orthop. Jun 18, 2026; 17(6): 118593 Published online Jun 18, 2026. doi: 10.5312/wjo.v17.i6.118593
OpenEvidence performs at similar levels compared to current and previous GPT models on orthopedic training and education questions
Kashif Javid, Alexander Driessche, Colton Clymer, Muhammad J Abbas, Annamarie Pantuso, Lindsay M Maier, Joseph Hoegler, William M Hakeos, Stuart T Guthrie
Kashif Javid, Alexander Driessche, Colton Clymer, Muhammad J Abbas, Annamarie Pantuso, Lindsay M Maier, Joseph Hoegler, William M Hakeos, Stuart T Guthrie, Department of Orthopaedic Surgery, Henry Ford Health System, Detroit, MI 48202, United States
Author contributions: All authors contributed to the study conception and design. Javid K and Driessche A contributed to material preparation, data collection and analysis; Javid K, Driessche A, and Clymer C wrote the first draft of the manuscript; Abbas M, Pantuso A, Maier LM, Hoegler J, Hakeos WM, and Guthrie ST contributed to revisions; all authors commented on previous versions of the manuscript, read and approved the final manuscript.
Institutional review board statement: This study does not involve human or animal experiments and thus does not require an ethical document.
Informed consent statement: This study does not require an informed consent form.
Conflict-of-interest statement: All the authors report no relevant conflicts of interest for this article.
STROBE statement: The authors have read the STROBE Statement-checklist of items, and the manuscript was prepared and revised according to the STROBE Statement- checklist of items.
Data sharing statement: Not applicable.
Corresponding author: Kashif Javid, Department of Orthopaedic Surgery, Henry Ford Health System, 2799 W. Grand Blvd, Detroit, MI 48202, United States. kjavid1@hfhs.org
Received: January 7, 2026 Revised: February 6, 2026 Accepted: March 30, 2026 Published online: June 18, 2026 Processing time: 162 Days and 3.4 Hours
Abstract
BACKGROUND
The role and utility of large language models within medical practice, education, and patient interaction is still being defined as medical learners increasingly turn to AI chatbots for educational aid and clinical information. As these tools become more commonplace and emphasized, it becomes crucial to evaluate their accuracy and reliability.
AIM
To evaluate the accuracy of new models OpenEvidence (OE) and GPT-5 in comparison to the well-studied model GPT-4 in their performance on orthopedic training exam questions. We hypothesize that OE and GPT-5 will provide superior results on orthopedic training questions when compared to GPT-4.
METHODS
We conducted an analysis of orthopedic board-style questions obtained from Orthobullets, a widely used educational platform for orthopedic resident education and board preparation. The primary outcome was accuracy, defined as the proportion of questions answered correctly, with subgroup analyses and statistical comparisons using Pearson’s χ2 or Fisher’s exact tests (P < 0.05).
RESULTS
A total of 140 orthopedic board-style questions were tested, of which 94 were text-only and 46 included images. Questions with images were answered correctly less often than those without images (71% vs 83%, P = 0.001). GPT-5 achieved the highest overall accuracy (85%), followed by GPT-4 (76%), and OE (73%). For image-only questions, models differed significantly (P = 0.045): GPT-5 (78%) outperformed GPT-4 (67%) and OE (57%). For text-only questions, GPT-5 was similarly the highest performing compared to OE and GPT-4 (88%, 81%).
CONCLUSION
OE performed at a similar level as ChatGPT-4 on orthopedic surgery training questions, with comparison to previous studies placing this performance at the level of senior orthopedic surgery residents. GPT-5 trended towards superior performance compared to both its previous model and OE across all subfields and question types but still showed no significant differences. All models had lower accuracy with questions that required analysis of visual media.
Core Tip: We evaluated the performance of contemporary large language models on orthopedic board-style questions, comparing ChatGPT-5 and OpenEvidence (OE), with the established GPT-4. Using a standardized orthopedic training exam question set, we found that ChatGPT-5 achieved the highest overall accuracy and consistently outperformed prior models across subspecialties and question formats. OE performed comparably to GPT-4 across multiple fields. All models demonstrated reduced accuracy on image-based questions, highlighting persistent limitations in visual interpretation. We assert that OE is a reputable addition to the tools available to orthopedists. The added benefit of training drawn from peer-reviewed literature adds to its potential value.