OpenEvidence performs at similar levels compared to current and previous GPT models on orthopedic training and education questions

doi:10.5312/wjo.v17.i6.118593

Advanced Search

BPG is committed to discovery and dissemination of knowledge

Home / Archive / Volume 17, Issue 6

This Article

(13)

(14)

(0)

(185)

Table of Contents

Peer-Review Report of This Article

CrossCheck and Google Search of This Article

Academic Rules and Norms of This Article

Citation of this article

Corresponding Author of This Article

Research Domain of This Article

Article-Type of This Article

Open-Access Policy of This Article

Times Cited Counts in Google of This Article

Journal Information of This Article

Publication Name

World Journal of Orthopedics

ISSN

2218-5836

Publisher of This Article

Baishideng Publishing Group Inc, 7041 Koll Center Parkway, Suite 160, Pleasanton, CA 94566, USA

Observational Study Open Access

Copyright: ©Author(s) 2026. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution-NonCommercial (CC BY-NC 4.0) license. No commercial re-use. See permissions. Published by Baishideng Publishing Group Inc.

World J Orthop. Jun 18, 2026; 17(6): 118593
Published online Jun 18, 2026. doi: 10.5312/wjo.v17.i6.118593

OpenEvidence performs at similar levels compared to current and previous GPT models on orthopedic training and education questions

Kashif Javid, Alexander Driessche, Colton Clymer, Muhammad J Abbas, Annamarie Pantuso, Lindsay M Maier, Joseph Hoegler, William M Hakeos, Stuart T Guthrie

Kashif Javid, Alexander Driessche, Colton Clymer, Muhammad J Abbas, Annamarie Pantuso, Lindsay M Maier, Joseph Hoegler, William M Hakeos, Stuart T Guthrie, Department of Orthopaedic Surgery, Henry Ford Health System, Detroit, MI 48202, United States

ORCID number: Kashif Javid (0009-0008-4576-6006).

Author contributions: All authors contributed to the study conception and design. Javid K and Driessche A contributed to material preparation, data collection and analysis; Javid K, Driessche A, and Clymer C wrote the first draft of the manuscript; Abbas M, Pantuso A, Maier LM, Hoegler J, Hakeos WM, and Guthrie ST contributed to revisions; all authors commented on previous versions of the manuscript, read and approved the final manuscript.

Institutional review board statement: This study does not involve human or animal experiments and thus does not require an ethical document.

Informed consent statement: This study does not require an informed consent form.

Conflict-of-interest statement: All the authors report no relevant conflicts of interest for this article.

STROBE statement: The authors have read the STROBE Statement-checklist of items, and the manuscript was prepared and revised according to the STROBE Statement- checklist of items.

Data sharing statement: Not applicable.

Corresponding author: Kashif Javid, Department of Orthopaedic Surgery, Henry Ford Health System, 2799 W. Grand Blvd, Detroit, MI 48202, United States. kjavid1@hfhs.org

Received: January 7, 2026
Revised: February 6, 2026
Accepted: March 30, 2026
Published online: June 18, 2026
Processing time: 162 Days and 3.4 Hours

Abstract

BACKGROUND

The role and utility of large language models within medical practice, education, and patient interaction is still being defined as medical learners increasingly turn to AI chatbots for educational aid and clinical information. As these tools become more commonplace and emphasized, it becomes crucial to evaluate their accuracy and reliability.

AIM

To evaluate the accuracy of new models OpenEvidence (OE) and GPT-5 in comparison to the well-studied model GPT-4 in their performance on orthopedic training exam questions. We hypothesize that OE and GPT-5 will provide superior results on orthopedic training questions when compared to GPT-4.

METHODS

We conducted an analysis of orthopedic board-style questions obtained from Orthobullets, a widely used educational platform for orthopedic resident education and board preparation. The primary outcome was accuracy, defined as the proportion of questions answered correctly, with subgroup analyses and statistical comparisons using Pearson’s χ² or Fisher’s exact tests (P < 0.05).

RESULTS

A total of 140 orthopedic board-style questions were tested, of which 94 were text-only and 46 included images. Questions with images were answered correctly less often than those without images (71% vs 83%, P = 0.001). GPT-5 achieved the highest overall accuracy (85%), followed by GPT-4 (76%), and OE (73%). For image-only questions, models differed significantly (P = 0.045): GPT-5 (78%) outperformed GPT-4 (67%) and OE (57%). For text-only questions, GPT-5 was similarly the highest performing compared to OE and GPT-4 (88%, 81%).

CONCLUSION

OE performed at a similar level as ChatGPT-4 on orthopedic surgery training questions, with comparison to previous studies placing this performance at the level of senior orthopedic surgery residents. GPT-5 trended towards superior performance compared to both its previous model and OE across all subfields and question types but still showed no significant differences. All models had lower accuracy with questions that required analysis of visual media.

Key Words: Artificial intelligence; Medical education; Orthopedic training; Large language models; Chatbot

Core Tip: We evaluated the performance of contemporary large language models on orthopedic board-style questions, comparing ChatGPT-5 and OpenEvidence (OE), with the established GPT-4. Using a standardized orthopedic training exam question set, we found that ChatGPT-5 achieved the highest overall accuracy and consistently outperformed prior models across subspecialties and question formats. OE performed comparably to GPT-4 across multiple fields. All models demonstrated reduced accuracy on image-based questions, highlighting persistent limitations in visual interpretation. We assert that OE is a reputable addition to the tools available to orthopedists. The added benefit of training drawn from peer-reviewed literature adds to its potential value.

Citation: Javid K, Driessche A, Clymer C, Abbas MJ, Pantuso A, Maier LM, Hoegler J, Hakeos WM, Guthrie ST. OpenEvidence performs at similar levels compared to current and previous GPT models on orthopedic training and education questions. World J Orthop 2026; 17(6): 118593
URL: https://www.wjgnet.com/2218-5836/full/v17/i6/118593.htm
DOI: https://dx.doi.org/10.5312/wjo.v17.i6.118593

INTRODUCTION

Artificial intelligence (AI) continues to be an expanding field, with the role and utility of large language models (LLMs) within medical practice, education, and patient interaction still being defined and understood. Medical learners increasingly turn to AI chatbots for educational aid and clinical information. These learners further assert that understanding the utility and function of AI in the clinical setting should be encouraged as a part of medical education[1]. Thus, as these tools become more commonplace and emphasized, it becomes crucial to evaluate their accuracy and reliability. This is even more important in a highly specialized field such as orthopedics. While other fields, including neurosurgery and radiology, have seen LLMs achieve success with their in-training exams, orthopedic surgery has yet to have similar results[2].

OpenEvidence (OE) is a newer LLM that provides similar chatbot functionality and assistance, but it is specifically trained on medical literature and clinical guidelines[3]. OE is an AI-powered chatbot that can only be accessed by medical professionals. Founded on a Mayo Clinic platform, the model was created from a recent partnership with major medical journals such as Journal of the American Medical Association and the New England Journal of Medicine, along with a HIPAA-compliant function, which further increases its clinical utility. It differs from ChatGPT due to its predominantly medical-focused algorithm that can aid clinicians with diagnosis, treatments, plans, medical knowledge, and clinical interpretation. The use of OE is rapidly expanding in clinical practice, and it has performed better than any other chatbot in standardized medical examination testing, with a recent score of 100% on United States Medical Licensing Examination Step 1, 2, and 3, relative to 97% with GPT-5[4]. Many studies have analyzed the usage and accuracy of previous iterations of ChatGPT, but none have evaluated OE or the performance of the newest version of ChatGPT, GPT-5, on orthopedic information.

Previously, many different learning platforms and clinical databases have been tested in the accuracy of answering orthopedic board-specific questions relative to orthopedic residents. ChatGPT has provided mixed results in terms of its clinical accuracy on tests such as the Orthopaedic In-Training Examination (OITE). Findings from Hayes et al[5] demonstrated that ChatGPT consistently underperformed relative to each class of orthopedic residents on previous OITE forms. However, the rapid evolution of AI has demonstrated that each new iteration of ChatGPT performs significantly better than its predecessor on orthopedic training examinations[6]. As LLMs continue to be developed and trained, their capabilities to aid medical professionals and learners also increase. Each generation potentially offers a more polished and valuable tool. The goal of our study was to evaluate the accuracy of new models OE and GPT-5 in comparison to the well-studied model GPT-4 in their performance on orthopedic training exam questions. We hypothesize that OE and GPT-5 will provide superior results on orthopedic training questions when compared to GPT-4.

MATERIALS AND METHODS

We conducted an analysis of orthopedic board-style questions obtained from Orthobullets, a widely used educational platform for orthopedic resident education and board preparation[7]. Each question was categorized by subspecialty (basic science, foot and ankle, hand, knee and sports, pathology, pediatrics, reconstruction, shoulder and elbow, spine, trauma), perceived importance (A-E), complexity (1-5), and whether the item included an image. Perceived importance was defined on a spectrum of “critical” to “community”, and complexity was defined as 1 being least complex and 5 being the most complex. The complexity, perceived importance, and subspecialty categorizations were assigned to each question by Orthobullets.

Across the three tested models, four model conditions were evaluated: GPT-4, GPT-5, OE, and OE (described). GPT-4 and GPT-5 were multimodal models with the ability to process both text and images directly. GPT-5’s initial publicly accessible release version was utilized, with responses queried between August 19^th, 2025, and August 21^st, 2025. OE was a text-only model and did not receive image content when present. To provide a comparator, the OE (described) condition supplied the same model with a textual summary of the image content (as written in the Orthobullets question bank) instead of the actual image itself. Questions were retrieved and input into each model individually. To decrease potential confounding factors, a new chat was created for every question within each model. Additionally, all chats were run in a new Google Chrome Incognito browser (Version 140.0.7339.133, Menlo Park, CA, United States). For each model-question pairing, the selected answer was recorded as correct or incorrect.

The primary outcome was accuracy, defined as the proportion of questions answered correctly. Subgroup analyses compared performance across models, by image inclusion, subspecialty, and question complexity. Statistical comparisons were performed using Pearson’s χ² or Fisher’s exact tests, with significance set at P < 0.05. All analyses were conducted in RStudio (R Foundation for Statistical Computing, Vienna, Austria).

RESULTS

A total of 140 orthopedic board-style questions were tested, of which 94 were text-only and 46 included images. From these 140 questions, 466 question instances across all models were analyzed, of which 184 (39%) included images and 282 (61%) were text-only (Table 1). Across all questions, 365 responses (78%) were correct. Questions with images were answered correctly less often than those without images (71% vs 83%, P = 0.001) (Figure 1). Accuracy also differed across subspecialties, ranging from 100% in pathology to 62% in foot and ankle (P < 0.001). Higher complexity questions (level 4-5) were associated with lower accuracy compared with lower complexity questions (P = 0.005) (Figure 2).

Open in New Tab Full Size Figure Download Figure

Figure 1 Accuracy by model and question type. OE: OpenEvidence.

Open in New Tab Full Size Figure Download Figure

Figure 2 Pooled accuracy. A: By perceived question importance; B: By question complexity.

Table 1 Comparison of question characteristics stratified by answer correctness, n (%).

Variable	Overall	Incorrect	Correct	P value¹
Image included				0.001
Image	184	54 (29)	130 (71)
No image	282	47 (17)	235 (83)
Model				0.087
GPT-4	140	33 (24)	107 (76)
GPT-5	140	21 (15)	119 (85)
OE	140	38 (27)	102 (73)
OE (described)	46	9 (20)	37 (80)
Specialty				< 0.001
Basic science	62	5 (8.1)	57 (92)
Foot and ankle	29	11 (38)	18 (62)
Hand	30	4 (13)	26 (87)
Knee and sports	53	14 (26)	39 (74)
Pathology	28	0 (0)	28 (100)
Pediatrics	52	10 (19)	42 (81)
Recon	55	16 (29)	39 (71)
Shoulder and elbow	39	8 (21)	31 (79)
Spine	69	16 (23)	53 (77)
Trauma	49	17 (35)	32 (65)
Perceived importance				0.4
A	83	15 (18)	68 (82)
B	83	21 (25)	62 (75)
C	100	18 (18)	82 (82)
D	100	20 (20)	80 (80)
E	100	27 (27)	73 (73)
Question complexity				0.005
1	100	14 (14)	86 (86)
2	100	16 (16)	84 (84)
3	100	19 (19)	81 (81)
4	100	32 (32)	68 (68)
5	66	20 (30)	46 (70)

¹Pearson’s χ² test; Wilcoxon rank sum test.

OE: OpenEvidence.

Open in New Tab Full Size Table

When stratified by model, GPT-5 achieved the highest overall accuracy (85%), followed by OE described (80%), GPT-4 (76%), and OE (73%), although this difference did not reach statistical significance in the pooled analysis (P = 0.087, Table 1). Table 2 provides intra-model results for image-only questions. GPT-5 answered 36 of 46 image questions correctly (78%), compared with 67% for GPT-4, 57% for OE, and 80% for OE Described. Accuracy varied within subspecialties, with consistently high performance in pathology (100% across all models) but lower results in foot and ankle and trauma. Table 3 shows intra-model results for text-only questions. GPT-5 achieved 88% accuracy, compared with 81% for GPT-4 and 81% for OE. Again, subspecialty breakdowns revealed excellent performance in pathology (100%) and hand (100% across models), while trauma and knee and sports questions demonstrated lower overall accuracy. For image-only questions, accuracy differed significantly between models (P = 0.045). GPT-5 (78%) and OE described (80%) outperformed GPT-4 (67%) and OE (57%). For text-only questions, performance was uniformly high across all models, with no significant differences (GPT-5, 88%; GPT-4, 81%; OE, 81%; P = 0.286).

Table 2 Image-based question characteristics by language model and response accuracy (n = 46), n (%).

	GPT-4		GPT-5		OE		OE (described)
	Correct	P value¹	Correct	P value¹	Correct	P value¹	Correct	P value¹
Total correct	31 (67)		36 (78)		26 (57)		37 (80)
Specialty		0.3		0.3		0.5		0.2
Basic science	2 (100)		2 (100)		2 (100)		2 (100)
Foot and ankle	1 (25)		2 (50)		1 (25)		2 (50)
Hand	6 (100)		5 (83)		4 (67)		5 (83)
Knee and sports	3 (60)		4 (80)		3 (60)		4 (80)
Pathology	4 (100)		4 (100)		4 (100)		4 (100)
Pediatrics	5 (71)		6 (86)		3 (43)		7 (100)
Recon	2 (50)		1 (25)		1 (25)		2 (50)
Shoulder and elbow	3 (75)		4 (100)		3 (75)		4 (100)
Spine	3 (50)		5 (83)		3 (50)		5 (83)
Trauma	2 (50)		3 (75)		2 (50)		2 (50)
Perceived importance		0.3		0.5		0.5		0.6
A	5 (63)		7 (88)		5 (63)		7 (88)
B	6 (75)		7 (88)		5 (63)		7 (88)
C	8 (80)		9 (90)		7 (70)		8 (80)
D	8 (80)		7 (70)		6 (60)		9 (90)
E	4 (40)		6 (60)		3 (30)		6 (60)
Question complexity		0.2		0.3		0.6		> 0.9
1	8 (80)		10 (100)		7 (70)		8 (80)
2	8 (80)		7 (70)		5 (50)		9 (90)
3	8 (80)		8 (80)		7 (70)		7 (70)
4	4 (40)		6 (60)		4 (40)		8 (80)
5	3 (50)		5 (83)		3 (50)		5 (83)

¹Fisher’s exact test; Wilcoxon rank sum test.

OE: OpenEvidence.

Open in New Tab Full Size Table

Table 3 Text-only question characteristics by language model and response accuracy (n = 94), n (%).

	GPT-4		GPT-5		OE
		P value¹		P value¹		P value¹
Total correct	76 (81)		83 (88)		76 (81)
Specialty		0.6		> 0.9		0.8
Basic science	16 (89)		17 (94)		16 (89)
Foot and ankle	4 (80)		4 (100)		4 (100)
Hand	2 (100)		2 (100)		2 (100)
Knee and sports	8 (73)		9 (82)		8 (73)
Pathology	4 (100)		4 (100)		4 (100)
Pediatrics	7 (88)		7 (88)		7 (88)
Recon	11 (85)		11 (85)		11 (85)
Shoulder and elbow	5 (71)		7 (88)		5 (63)
Spine	13 (87)		13 (87)		11 (73)
Trauma	6 (55)		9 (82)		8 (73)
Perceived importance		0.3		0.3		0.5
A	13 (76)		17 (100)		14 (82)
B	12 (71)		14 (82)		11 (65)
C	15 (75)		18 (90)		17 (85)
D	17 (85)		16 (80)		17 (85)
E	19 (95)		18 (90)		17 (85)
Question complexity		0.025		0.3		0.9
1	19 (95)		17 (85)		17 (85)
2	19 (95)		20 (100)		16 (80)
3	16 (80)		18 (90)		17 (85)
4	13 (65)		17 (85)		16 (80)
5	9 (64)		11 (79)		10 (71)

¹Fisher’s exact test; Wilcoxon rank sum test.

OE: OpenEvidence.

Open in New Tab Full Size Table

DISCUSSION

OE and ChatGPT-4 as a whole performed similarly. Across all questions, the models scored with 73% and 76% accuracy, respectively. Across text-based questions alone, the models both scored 81% accuracy. These values build upon previous investigations into ChatGPT’s performance on similar orthopedic question banks. Hayes et al[5], when studying an earlier GPT-4, found that the model had an average correct answer choice rate of 49% with images and 48% with text-only questions, performing worse than any resident level on a previous OITE test form[5]. Guerra et al[8] reported an increased capability of ChatGPT in a more recent study, with GPT having a 49.1% correct answer rate[8]. This score put the model at the approximate level of a PGY-1 orthopedic surgery resident. While we were unable to directly test past OITE question sets, our dataset, based on questions from the Orthobullets question bank, represents another commonly utilized education source for orthopedic trainees[9,10]. Sparks et al[2] specifically tested an earlier model of ChatGPT on different orthopedic question sets, including Orthobullets and the 2022 and 2021 OITE forms, reporting that the model scored a 54.8% on the Orthobullets set, a 54.1% on the 2022 OITE, and 55.9% on the 2021 OITE[2]. With this equivalent performance between question sets, we can infer from our results that OE, the most updated version of GPT-4, and GPT-5 all would score similarly on the previous OITE forms as they had on our dataset, performing above the level of a PGY-5 orthopedic surgery resident (72%)[5,11]. Our findings thus support previously established literature on the validation of ChatGPT’s efficacy as an orthopedic informational tool for providers, while being the first potential instance of any LLM being capable of passing the American Board of Orthopaedic Surgery Part I written exam. Additionally, we assert that OE is a reputable addition to the LLM tools available to orthopedic surgeons. The added benefit of training and data being drawn from peer-reviewed literature that may otherwise be limited behind the paywall of subscription adds to the potential value of OE over other chatbots. While not statistically significant, we observed higher accuracy in OE compared to GPT-4 when answering more complex questions, potentially reflecting the deeper access to medical literature that OE possesses.

OpenAI’s release of GPT-5 occurred with emphasis on the model’s increased utility in medical contexts relative to previous models[12]. While our results showed a trend towards higher accuracy, there remained no statistically significant difference in performance between models. GPT-5 trended towards higher accuracy compared to its previous model as well as OE across all question types and subjects. These potential advancements from its predecessor within our study may indicate potentially increased capability when compared to other LLMs. Guerra et al[8] reported that an earlier version of ChatGPT-4 received a 49.1% correct answer rate on a set of past OITE form-based questions, while Microsoft’s BingChat scored a 52.4%, and Google’s Bard a 51.4%[8]. More recent studies comparing the performance of ChatGPT-4 and Google’s Gemini found near equivalent performance on orthopedic questions[13,14]. While all three tested models performed worse on more complex questions, GPT-5’s scores had the least difficulty with level 5 questions and the smallest decrease in accuracy from level 1 difficulty questions. When stratifying by commonality of question topic, GPT-5 performed better on critical question topics but was still nearly equivalent to OE and GPT-4 for the community level. Thus, while GPT-5 demonstrated advanced performance for many fields, the ability to answer less commonly tested orthopedic knowledge has not seen any growth. Additionally, when incorrect answers were selected, across all models it appeared to frequently be a result of common pitfalls with the associated question. The question bank provides information on the frequency of selection by all users for each answer choice. Given that when a model did select an incorrect answer, it tended to be a commonly selected incorrect answer; it may be that some mistakes arise from a lack of ability to pick apart nuance in certain pathologies or presentations. This reflects the training and process of these models as generating responses based on the likelihood of a correct connection calculated from the broad variety of information, valid or not, that each model has been trained on. The chatbots, while appearing conversational and thinking, still lack the ability to reason through nuance or identify common misconceptions in their thinking.

All three models performed better with text-based questions compared to the question set with visual media, but GPT-5’s performance was significantly better than its past iteration for these questions. This finding was in concordance with previous studies that reported decreased accuracy across multiple LLMs when comparing image-including and text-only questions[8]. Our study grouped our standardized question set by the inclusion of visual media in each question. Since OE was not capable of image analysis, questions with visual media were run through each model twice. Once with the question text alone and once with a description of the images associated with the question taken directly from the question bank’s written explanation of the question, image, and answer choices. While this allowed us to make rough comparisons between model performance on these questions, multiple confounders are potentially introduced with this strategy. In the image described instances, a bias is created in the accuracy of information provided to the OE model compared to the GPT models that needed to interpret the image independently. In the non-described instances, key information that the image provided that may have been essential to a correct answer would be absent, deflating scores heavily. With these factors in mind, we found that OE and GPT-5 performed similarly on the image-based question set, and both performed significantly better than GPT-4 (78%, 80%, 67%, P = 0.045). This suggests that OE is superior to GPT-4 for medical and orthopedic questions when given additional context. Moreover, our findings assert that GPT-5 potentially represents a significant improvement in the utility of LLMs for the analysis and inquiry of medical visual material. Drouaud et al[15] specifically evaluated the image interpretation ability of ChatGPT-4, finding the chatbot’s performance neutral, with weakness in radiographic accuracy. Our study shows improved performance in GPT-5 that could further encourage the usage of these chatbots by medical professionals and learners. The utility of these tools may lie in providing a supplement to the learning process for orthopaedic trainees and professionals. Navigating orthopaedic questions in clinical and educational settings can be challenging. These tools have the potential to provide support through the provision of focused discussion, conversational teaching and reasoning, and most importantly constant accessibility, all benefits that traditional teaching methods and informational sources may be limited by in one aspect or another.

Within our own study, our findings are limited by our question sample size and the singular question bank from which we drew our set. A more robust question set and utilization of other commonly used orthopedic training question banks would lend further impact to our analysis. This pilot study may be underpowered to detect small differences in accuracy between high-performing models. For example, detecting an absolute 5% difference in accuracy (e.g., 80% vs 85%) with 80% power would require substantially more questions per model, and therefore non-significant comparisons in this study should not be interpreted as evidence of equivalence. A larger question set could also increase the power of our sub-analysis by question subject. As previously mentioned, the inability of OE to analyze images makes direct comparison of the image-based questions set difficult between it and our other two models. The practice of describing images allowed us to make some comparisons, but still does not account for the additional opportunity of error in understanding an image accurately before utilizing it.

CONCLUSION

In conclusion, our study found that OE performed at a similar level as ChatGPT-4.0 on orthopedic surgery training questions, with comparison to previous studies placing this performance at the level of senior orthopedic surgery residents. ChatGPT-5 trended towards superior performance compared to both its previous model and OE across all subfields and question types but still showed no significant differences. All models had lower accuracy with questions that required analysis of visual media, identifying a persistent weakness with the use of LLMs in medical settings. However, these lower scores still represent increased performance for OE and GPT-5 compared to the established literature. OE is further affected by this factor through its inability to process and analyze images, an ability ChatGPT and other commonly used LLMs possess. While these drawbacks and performance overall emphasize that they cannot replace direct clinical expertise, OE and ChatGPT5 still present a valuable option in a supportive capacity for use by medical providers for assistance, education, or patient communication, that we have found to be on par with and superior to previously investigated LLMs.

References

Civaner MM, Uncu Y, Bulut F, Chalil EG, Tatli A. Artificial intelligence in medical education: a cross-sectional needs assessment. BMC Med Educ. 2022;22:772. [RCA] [PubMed] [DOI] [Full Text] [Full Text (PDF)] [Cited by in Crossref: 6] [Cited by in RCA: 165] [Article Influence: 41.3] [Reference Citation Analysis (0)]

2.	Sparks CA, Kraeutler MJ, Chester GA, Contrada EV, Zhu E, Fasulo SM, Scillia AJ. Inadequate Performance of ChatGPT on Orthopedic Board-Style Written Exams. Cureus. 2024;16:e62643. [RCA] [PubMed] [DOI] [Full Text] [Cited by in RCA: 4] [Reference Citation Analysis (0)]

3.	OpenEvidence is the leading medical information platform. [cited 11 March 2026]. Available from: https://www.openevidence.com/about. [PubMed] [DOI]

OpenEvidence Creates the First AI in History to Score a Perfect 100% on the United States Medical Licensing Examination (USMLE). [cited 11 March 2026]. Available from: https://www.openevidence.com/announcements/openevidence-creates-the-first-ai-in-history-to-score-a-perfect-100percent-on-the-united-states-medical-licensing-examination-usmle.

Hayes DS, Foster BK, Makar G, Manzar S, Ozdag Y, Shultz M, Klena JC, Grandizio LC. Artificial Intelligence in Orthopaedics: Performance of ChatGPT on Text and Image Questions on a Complete AAOS Orthopaedic In-Training Examination (OITE). J Surg Educ. 2024;81:1645-1649. [RCA] [PubMed] [DOI] [Full Text] [Cited by in Crossref: 1] [Cited by in RCA: 13] [Article Influence: 6.5] [Reference Citation Analysis (0)]

Hofmann HL, Guerra GA, Le JL, Wong AM, Hofmann GH, Mayfield CK, Petrigliano FA, Liu JN. The Rapid Development of Artificial Intelligence: GPT-4's Performance on Orthopedic Surgery Board Questions. Orthopedics. 2024;47:e85-e89. [RCA] [PubMed] [DOI] [Full Text] [Cited by in Crossref: 4] [Cited by in RCA: 24] [Article Influence: 12.0] [Reference Citation Analysis (0)]

7.	OrthoBullets. [cited 11 March 2026]. Available from: https://www.orthobullets.com/. [PubMed] [DOI]

Guerra GA, Hofmann HL, Le JL, Wong AM, Fathi A, Mayfield CK, Petrigliano FA, Liu JN. ChatGPT, Bard, and Bing Chat Are Large Language Processing Models That Answered Orthopaedic In-Training Examination Questions With Similar Accuracy to First-Year Orthopaedic Surgery Residents. Arthroscopy. 2025;41:557-562. [RCA] [PubMed] [DOI] [Full Text] [Cited by in RCA: 11] [Reference Citation Analysis (0)]

Rowe N, Familia MC, Brown SM, Mulcahey MK. Orthopaedic In-training Exam Preparation among Orthopaedic Surgery Residency Programs. J Surg Educ. 2021;78:2146-2151. [RCA] [PubMed] [DOI] [Full Text] [Cited by in Crossref: 4] [Cited by in RCA: 7] [Article Influence: 1.4] [Reference Citation Analysis (0)]

10.

Margalit A, Mixa P, Day L, Marrache M, Mitchell S, Suresh KV, Wang K, Sabharwal S, Li TP, Loeb A, Naziri Q, Henn RF, Laporte D. Top Three Learning Platforms for Orthopaedic In-Training Knowledge Produce Different Results. JAAOS Glob Res Rev. 2021;5:e21.00148. [RCA] [DOI] [Full Text] [Full Text (PDF)] [Cited by in Crossref: 1] [Cited by in RCA: 2] [Article Influence: 0.4] [Reference Citation Analysis (0)]

11.

Fritz E, Bednar M, Harrast J, Marsh JL, Martin D, Swanson D, Tornetta P, Van Heest A. Do Orthopaedic In-Training Examination Scores Predict the Likelihood of Passing the American Board of Orthopaedic Surgery Part I Examination? An Update With 2014 to 2018 Data. J Am Acad Orthop Surg. 2021;29:e1370-e1377. [RCA] [PubMed] [DOI] [Full Text] [Cited by in Crossref: 3] [Cited by in RCA: 24] [Article Influence: 4.8] [Reference Citation Analysis (0)]

12.	Introducing GPT-5. [cited 11 March 2026]. Available from: https://openai.com/index/introducing-gpt-5/. [PubMed] [DOI]

13.

Quinn M, Milner JD, Schmitt P, Morrissey P, Lemme N, Marcaccio S, DeFroda S, Tabaddor R, Owens BD. Artificial Intelligence Large Language Models Address Anterior Cruciate Ligament Reconstruction: Superior Clarity and Completeness by Gemini Compared With ChatGPT-4 in Response to American Academy of Orthopaedic Surgeons Clinical Practice Guidelines. Arthroscopy. 2025;41:2002-2008. [RCA] [PubMed] [DOI] [Full Text] [Cited by in Crossref: 6] [Cited by in RCA: 16] [Article Influence: 16.0] [Reference Citation Analysis (0)]

14.

Tong L, Zhang C, Liu R, Yang J, Sun Z. Comparative performance analysis of large language models: ChatGPT-3.5, ChatGPT-4 and Google Gemini in glucocorticoid-induced osteoporosis. J Orthop Surg Res. 2024;19:574. [RCA] [PubMed] [DOI] [Full Text] [Cited by in RCA: 13] [Reference Citation Analysis (0)]

15.

Drouaud A, Stocchi C, Tang J, Gonsalves G, Cheung Z, Szatkowski J, Forsh D. Exploring the Performance of ChatGPT in an Orthopaedic Setting and Its Potential Use as an Educational Tool. JB JS Open Access. 2024;9:e24.00081. [RCA] [PubMed] [DOI] [Full Text] [Full Text (PDF)] [Cited by in RCA: 4] [Reference Citation Analysis (0)]

Footnotes

Peer review: Externally peer reviewed.

Peer-review model: Single blind

Specialty type: Orthopedics

Country of origin: United States

Peer-review report’s classification

Scientific quality: Grade C

Novelty: Grade A

Creativity or innovation: Grade A

Scientific significance: Grade C

P-Reviewer: Bulgurcu A, PhD, Lecturer, Türkiye S-Editor: Liu H L-Editor: A P-Editor: Zhao YQ