OpenEvidence performs at similar levels compared to current and previous GPT models on orthopedic training and education questions

doi:10.5312/wjo.v17.i6.118593

Advanced Search

BPG is committed to discovery and dissemination of knowledge

Home / Archive / Volume 17, Issue 6

This Article

(6)

(0)

(55)

Peer-Review Report of This Article

CrossCheck and Google Search of This Article

Academic Rules and Norms of This Article

Citation of this article

Corresponding Author of This Article

Research Domain of This Article

Article-Type of This Article

Open-Access Policy of This Article

Times Cited Counts in Google of This Article

Journal Information of This Article

Publication Name

World Journal of Orthopedics

ISSN

2218-5836

Publisher of This Article

Baishideng Publishing Group Inc, 7041 Koll Center Parkway, Suite 160, Pleasanton, CA 94566, USA

Observational Study

Copyright: ©Author(s) 2026. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution-NonCommercial (CC BY-NC 4.0) license. No commercial re-use. See permissions. Published by Baishideng Publishing Group Inc.

World J Orthop. Jun 18, 2026; 17(6): 118593
Published online Jun 18, 2026. doi: 10.5312/wjo.v17.i6.118593

Table 1 Comparison of question characteristics stratified by answer correctness, n (%)

Variable	Overall	Incorrect	Correct	P value¹
Image included				0.001
Image	184	54 (29)	130 (71)
No image	282	47 (17)	235 (83)
Model				0.087
GPT-4	140	33 (24)	107 (76)
GPT-5	140	21 (15)	119 (85)
OE	140	38 (27)	102 (73)
OE (described)	46	9 (20)	37 (80)
Specialty				< 0.001
Basic science	62	5 (8.1)	57 (92)
Foot and ankle	29	11 (38)	18 (62)
Hand	30	4 (13)	26 (87)
Knee and sports	53	14 (26)	39 (74)
Pathology	28	0 (0)	28 (100)
Pediatrics	52	10 (19)	42 (81)
Recon	55	16 (29)	39 (71)
Shoulder and elbow	39	8 (21)	31 (79)
Spine	69	16 (23)	53 (77)
Trauma	49	17 (35)	32 (65)
Perceived importance				0.4
A	83	15 (18)	68 (82)
B	83	21 (25)	62 (75)
C	100	18 (18)	82 (82)
D	100	20 (20)	80 (80)
E	100	27 (27)	73 (73)
Question complexity				0.005
1	100	14 (14)	86 (86)
2	100	16 (16)	84 (84)
3	100	19 (19)	81 (81)
4	100	32 (32)	68 (68)
5	66	20 (30)	46 (70)

¹Pearson’s χ² test; Wilcoxon rank sum test.

OE: OpenEvidence.

Full Size Table

Table 2 Image-based question characteristics by language model and response accuracy (n = 46), n (%)

	GPT-4		GPT-5		OE		OE (described)
	Correct	P value¹	Correct	P value¹	Correct	P value¹	Correct	P value¹
Total correct	31 (67)		36 (78)		26 (57)		37 (80)
Specialty		0.3		0.3		0.5		0.2
Basic science	2 (100)		2 (100)		2 (100)		2 (100)
Foot and ankle	1 (25)		2 (50)		1 (25)		2 (50)
Hand	6 (100)		5 (83)		4 (67)		5 (83)
Knee and sports	3 (60)		4 (80)		3 (60)		4 (80)
Pathology	4 (100)		4 (100)		4 (100)		4 (100)
Pediatrics	5 (71)		6 (86)		3 (43)		7 (100)
Recon	2 (50)		1 (25)		1 (25)		2 (50)
Shoulder and elbow	3 (75)		4 (100)		3 (75)		4 (100)
Spine	3 (50)		5 (83)		3 (50)		5 (83)
Trauma	2 (50)		3 (75)		2 (50)		2 (50)
Perceived importance		0.3		0.5		0.5		0.6
A	5 (63)		7 (88)		5 (63)		7 (88)
B	6 (75)		7 (88)		5 (63)		7 (88)
C	8 (80)		9 (90)		7 (70)		8 (80)
D	8 (80)		7 (70)		6 (60)		9 (90)
E	4 (40)		6 (60)		3 (30)		6 (60)
Question complexity		0.2		0.3		0.6		> 0.9
1	8 (80)		10 (100)		7 (70)		8 (80)
2	8 (80)		7 (70)		5 (50)		9 (90)
3	8 (80)		8 (80)		7 (70)		7 (70)
4	4 (40)		6 (60)		4 (40)		8 (80)
5	3 (50)		5 (83)		3 (50)		5 (83)

¹Fisher’s exact test; Wilcoxon rank sum test.

OE: OpenEvidence.

Full Size Table

Table 3 Text-only question characteristics by language model and response accuracy (n = 94), n (%)

	GPT-4		GPT-5		OE
		P value¹		P value¹		P value¹
Total correct	76 (81)		83 (88)		76 (81)
Specialty		0.6		> 0.9		0.8
Basic science	16 (89)		17 (94)		16 (89)
Foot and ankle	4 (80)		4 (100)		4 (100)
Hand	2 (100)		2 (100)		2 (100)
Knee and sports	8 (73)		9 (82)		8 (73)
Pathology	4 (100)		4 (100)		4 (100)
Pediatrics	7 (88)		7 (88)		7 (88)
Recon	11 (85)		11 (85)		11 (85)
Shoulder and elbow	5 (71)		7 (88)		5 (63)
Spine	13 (87)		13 (87)		11 (73)
Trauma	6 (55)		9 (82)		8 (73)
Perceived importance		0.3		0.3		0.5
A	13 (76)		17 (100)		14 (82)
B	12 (71)		14 (82)		11 (65)
C	15 (75)		18 (90)		17 (85)
D	17 (85)		16 (80)		17 (85)
E	19 (95)		18 (90)		17 (85)
Question complexity		0.025		0.3		0.9
1	19 (95)		17 (85)		17 (85)
2	19 (95)		20 (100)		16 (80)
3	16 (80)		18 (90)		17 (85)
4	13 (65)		17 (85)		16 (80)
5	9 (64)		11 (79)		10 (71)

¹Fisher’s exact test; Wilcoxon rank sum test.

OE: OpenEvidence.

Full Size Table

Citation: Javid K, Driessche A, Clymer C, Abbas MJ, Pantuso A, Maier LM, Hoegler J, Hakeos WM, Guthrie ST. OpenEvidence performs at similar levels compared to current and previous GPT models on orthopedic training and education questions. World J Orthop 2026; 17(6): 118593
URL: https://www.wjgnet.com/2218-5836/full/v17/i6/118593.htm
DOI: https://dx.doi.org/10.5312/wjo.v17.i6.118593

Javid K, Driessche A, Clymer C, Abbas MJ, Pantuso A, Maier LM, Hoegler J, Hakeos WM, Guthrie ST. OpenEvidence performs at similar levels compared to current and previous GPT models on orthopedic training and education questions. World J Orthop 2026; 17(6): 118593 [DOI: 10.5312/wjo.v17.i6.118593]

All content on this site: Copyright © 1993-2026 Baishideng Publishing Group Inc, its licensors, and contributors. All rights are reserved, including those for text and data mining, AI training, and similar technologies. For all open access content, the relevant licensing terms apply.