BPG is committed to discovery and dissemination of knowledge
Observational Study
Copyright: ©Author(s) 2026.
World J Orthop. Jun 18, 2026; 17(6): 118593
Published online Jun 18, 2026. doi: 10.5312/wjo.v17.i6.118593
Table 1 Comparison of question characteristics stratified by answer correctness, n (%)
Variable
Overall
Incorrect
Correct
P value1
Image included0.001
Image184 54 (29)130 (71)
No image28247 (17)235 (83)
Model0.087
GPT-4140 33 (24)107 (76)
GPT-514021 (15)119 (85)
OE14038 (27)102 (73)
OE (described)46 9 (20)37 (80)
Specialty< 0.001
Basic science62 5 (8.1)57 (92)
Foot and ankle29 11 (38)18 (62)
Hand30 4 (13)26 (87)
Knee and sports53 14 (26)39 (74)
Pathology28 0 (0)28 (100)
Pediatrics52 10 (19)42 (81)
Recon55 16 (29)39 (71)
Shoulder and elbow39 8 (21)31 (79)
Spine69 16 (23)53 (77)
Trauma4917 (35)32 (65)
Perceived importance0.4
A83 15 (18)68 (82)
B83 21 (25)62 (75)
C100 18 (18)82 (82)
D100 20 (20)80 (80)
E100 27 (27)73 (73)
Question complexity0.005
110014 (14)86 (86)
210016 (16)84 (84)
3100 19 (19)81 (81)
4100 32 (32)68 (68)
566 20 (30)46 (70)
Table 2 Image-based question characteristics by language model and response accuracy (n = 46), n (%)
GPT-4
GPT-5
OE
OE (described)
Correct
P value1
Correct
P value1
Correct
P value1
Correct
P value1
Total correct31 (67)36 (78)26 (57)37 (80)
Specialty0.30.30.50.2
Basic science2 (100)2 (100)2 (100)2 (100)
Foot and ankle1 (25)2 (50)1 (25)2 (50)
Hand6 (100)5 (83)4 (67)5 (83)
Knee and sports3 (60)4 (80)3 (60)4 (80)
Pathology4 (100)4 (100)4 (100)4 (100)
Pediatrics5 (71)6 (86)3 (43)7 (100)
Recon2 (50)1 (25)1 (25)2 (50)
Shoulder and elbow3 (75)4 (100)3 (75)4 (100)
Spine3 (50)5 (83)3 (50)5 (83)
Trauma2 (50)3 (75)2 (50)2 (50)
Perceived importance0.30.50.50.6
A5 (63)7 (88)5 (63)7 (88)
B6 (75)7 (88)5 (63)7 (88)
C8 (80)9 (90)7 (70)8 (80)
D8 (80)7 (70)6 (60)9 (90)
E4 (40)6 (60)3 (30)6 (60)
Question complexity0.20.30.6> 0.9
18 (80)10 (100)7 (70)8 (80)
28 (80)7 (70)5 (50)9 (90)
38 (80)8 (80)7 (70)7 (70)
44 (40)6 (60)4 (40)8 (80)
53 (50)5 (83)3 (50)5 (83)
Table 3 Text-only question characteristics by language model and response accuracy (n = 94), n (%)
GPT-4
GPT-5
OE

P value1

P value1

P value1
Total correct76 (81)83 (88)76 (81)
Specialty0.6> 0.90.8
Basic science16 (89)17 (94)16 (89)
Foot and ankle4 (80)4 (100)4 (100)
Hand2 (100)2 (100)2 (100)
Knee and sports8 (73)9 (82)8 (73)
Pathology4 (100)4 (100)4 (100)
Pediatrics7 (88)7 (88)7 (88)
Recon11 (85)11 (85)11 (85)
Shoulder and elbow5 (71)7 (88)5 (63)
Spine13 (87)13 (87)11 (73)
Trauma6 (55)9 (82)8 (73)
Perceived importance0.30.30.5
A13 (76)17 (100)14 (82)
B12 (71)14 (82)11 (65)
C15 (75)18 (90)17 (85)
D17 (85)16 (80)17 (85)
E19 (95)18 (90)17 (85)
Question complexity0.0250.30.9
119 (95)17 (85)17 (85)
219 (95)20 (100)16 (80)
316 (80)18 (90)17 (85)
413 (65)17 (85)16 (80)
59 (64)11 (79)10 (71)


Write to the Help Desk