OpenEvidence performs at similar levels compared to current and previous GPT models on orthopedic training and education questions

doi:10.5312/wjo.v17.i6.118593

Advanced Search

BPG is committed to discovery and dissemination of knowledge

Home / Archive / Volume 17, Issue 6

This Article

(13)

(14)

(0)

(185)

Peer-Review Report of This Article

CrossCheck and Google Search of This Article

Academic Rules and Norms of This Article

Citation of this article

Corresponding Author of This Article

Research Domain of This Article

Article-Type of This Article

Open-Access Policy of This Article

Times Cited Counts in Google of This Article

Journal Information of This Article

Publication Name

World Journal of Orthopedics

ISSN

2218-5836

Publisher of This Article

Baishideng Publishing Group Inc, 7041 Koll Center Parkway, Suite 160, Pleasanton, CA 94566, USA

Observational Study

Copyright: ©Author(s) 2026. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution-NonCommercial (CC BY-NC 4.0) license. No commercial re-use. See permissions. Published by Baishideng Publishing Group Inc.

World J Orthop. Jun 18, 2026; 17(6): 118593
Published online Jun 18, 2026. doi: 10.5312/wjo.v17.i6.118593

OpenEvidence performs at similar levels compared to current and previous GPT models on orthopedic training and education questions

Kashif Javid, Alexander Driessche, Colton Clymer, Muhammad J Abbas, Annamarie Pantuso, Lindsay M Maier, Joseph Hoegler, William M Hakeos, Stuart T Guthrie

Kashif Javid, Alexander Driessche, Colton Clymer, Muhammad J Abbas, Annamarie Pantuso, Lindsay M Maier, Joseph Hoegler, William M Hakeos, Stuart T Guthrie, Department of Orthopaedic Surgery, Henry Ford Health System, Detroit, MI 48202, United States

Author contributions: All authors contributed to the study conception and design. Javid K and Driessche A contributed to material preparation, data collection and analysis; Javid K, Driessche A, and Clymer C wrote the first draft of the manuscript; Abbas M, Pantuso A, Maier LM, Hoegler J, Hakeos WM, and Guthrie ST contributed to revisions; all authors commented on previous versions of the manuscript, read and approved the final manuscript.

Institutional review board statement: This study does not involve human or animal experiments and thus does not require an ethical document.

Informed consent statement: This study does not require an informed consent form.

Conflict-of-interest statement: All the authors report no relevant conflicts of interest for this article.

STROBE statement: The authors have read the STROBE Statement-checklist of items, and the manuscript was prepared and revised according to the STROBE Statement- checklist of items.

Data sharing statement: Not applicable.

Corresponding author: Kashif Javid, Department of Orthopaedic Surgery, Henry Ford Health System, 2799 W. Grand Blvd, Detroit, MI 48202, United States. kjavid1@hfhs.org

Received: January 7, 2026
Revised: February 6, 2026
Accepted: March 30, 2026
Published online: June 18, 2026
Processing time: 162 Days and 3.4 Hours

Abstract

BACKGROUND

The role and utility of large language models within medical practice, education, and patient interaction is still being defined as medical learners increasingly turn to AI chatbots for educational aid and clinical information. As these tools become more commonplace and emphasized, it becomes crucial to evaluate their accuracy and reliability.

AIM

To evaluate the accuracy of new models OpenEvidence (OE) and GPT-5 in comparison to the well-studied model GPT-4 in their performance on orthopedic training exam questions. We hypothesize that OE and GPT-5 will provide superior results on orthopedic training questions when compared to GPT-4.

METHODS

We conducted an analysis of orthopedic board-style questions obtained from Orthobullets, a widely used educational platform for orthopedic resident education and board preparation. The primary outcome was accuracy, defined as the proportion of questions answered correctly, with subgroup analyses and statistical comparisons using Pearson’s χ² or Fisher’s exact tests (P < 0.05).

RESULTS

A total of 140 orthopedic board-style questions were tested, of which 94 were text-only and 46 included images. Questions with images were answered correctly less often than those without images (71% vs 83%, P = 0.001). GPT-5 achieved the highest overall accuracy (85%), followed by GPT-4 (76%), and OE (73%). For image-only questions, models differed significantly (P = 0.045): GPT-5 (78%) outperformed GPT-4 (67%) and OE (57%). For text-only questions, GPT-5 was similarly the highest performing compared to OE and GPT-4 (88%, 81%).

CONCLUSION

OE performed at a similar level as ChatGPT-4 on orthopedic surgery training questions, with comparison to previous studies placing this performance at the level of senior orthopedic surgery residents. GPT-5 trended towards superior performance compared to both its previous model and OE across all subfields and question types but still showed no significant differences. All models had lower accuracy with questions that required analysis of visual media.

Keywords: Artificial intelligence; Medical education; Orthopedic training; Large language models; Chatbot

Core Tip: We evaluated the performance of contemporary large language models on orthopedic board-style questions, comparing ChatGPT-5 and OpenEvidence (OE), with the established GPT-4. Using a standardized orthopedic training exam question set, we found that ChatGPT-5 achieved the highest overall accuracy and consistently outperformed prior models across subspecialties and question formats. OE performed comparably to GPT-4 across multiple fields. All models demonstrated reduced accuracy on image-based questions, highlighting persistent limitations in visual interpretation. We assert that OE is a reputable addition to the tools available to orthopedists. The added benefit of training drawn from peer-reviewed literature adds to its potential value.