Copyright: ©Author(s) 2026.
World J Orthop. Jun 18, 2026; 17(6): 118593
Published online Jun 18, 2026. doi: 10.5312/wjo.v17.i6.118593
Published online Jun 18, 2026. doi: 10.5312/wjo.v17.i6.118593
Table 1 Comparison of question characteristics stratified by answer correctness, n (%)
| Variable | Overall | Incorrect | Correct | P value1 |
| Image included | 0.001 | |||
| Image | 184 | 54 (29) | 130 (71) | |
| No image | 282 | 47 (17) | 235 (83) | |
| Model | 0.087 | |||
| GPT-4 | 140 | 33 (24) | 107 (76) | |
| GPT-5 | 140 | 21 (15) | 119 (85) | |
| OE | 140 | 38 (27) | 102 (73) | |
| OE (described) | 46 | 9 (20) | 37 (80) | |
| Specialty | < 0.001 | |||
| Basic science | 62 | 5 (8.1) | 57 (92) | |
| Foot and ankle | 29 | 11 (38) | 18 (62) | |
| Hand | 30 | 4 (13) | 26 (87) | |
| Knee and sports | 53 | 14 (26) | 39 (74) | |
| Pathology | 28 | 0 (0) | 28 (100) | |
| Pediatrics | 52 | 10 (19) | 42 (81) | |
| Recon | 55 | 16 (29) | 39 (71) | |
| Shoulder and elbow | 39 | 8 (21) | 31 (79) | |
| Spine | 69 | 16 (23) | 53 (77) | |
| Trauma | 49 | 17 (35) | 32 (65) | |
| Perceived importance | 0.4 | |||
| A | 83 | 15 (18) | 68 (82) | |
| B | 83 | 21 (25) | 62 (75) | |
| C | 100 | 18 (18) | 82 (82) | |
| D | 100 | 20 (20) | 80 (80) | |
| E | 100 | 27 (27) | 73 (73) | |
| Question complexity | 0.005 | |||
| 1 | 100 | 14 (14) | 86 (86) | |
| 2 | 100 | 16 (16) | 84 (84) | |
| 3 | 100 | 19 (19) | 81 (81) | |
| 4 | 100 | 32 (32) | 68 (68) | |
| 5 | 66 | 20 (30) | 46 (70) |
Table 2 Image-based question characteristics by language model and response accuracy (n = 46), n (%)
| GPT-4 | GPT-5 | OE | OE (described) | |||||
| Correct | P value1 | Correct | P value1 | Correct | P value1 | Correct | P value1 | |
| Total correct | 31 (67) | 36 (78) | 26 (57) | 37 (80) | ||||
| Specialty | 0.3 | 0.3 | 0.5 | 0.2 | ||||
| Basic science | 2 (100) | 2 (100) | 2 (100) | 2 (100) | ||||
| Foot and ankle | 1 (25) | 2 (50) | 1 (25) | 2 (50) | ||||
| Hand | 6 (100) | 5 (83) | 4 (67) | 5 (83) | ||||
| Knee and sports | 3 (60) | 4 (80) | 3 (60) | 4 (80) | ||||
| Pathology | 4 (100) | 4 (100) | 4 (100) | 4 (100) | ||||
| Pediatrics | 5 (71) | 6 (86) | 3 (43) | 7 (100) | ||||
| Recon | 2 (50) | 1 (25) | 1 (25) | 2 (50) | ||||
| Shoulder and elbow | 3 (75) | 4 (100) | 3 (75) | 4 (100) | ||||
| Spine | 3 (50) | 5 (83) | 3 (50) | 5 (83) | ||||
| Trauma | 2 (50) | 3 (75) | 2 (50) | 2 (50) | ||||
| Perceived importance | 0.3 | 0.5 | 0.5 | 0.6 | ||||
| A | 5 (63) | 7 (88) | 5 (63) | 7 (88) | ||||
| B | 6 (75) | 7 (88) | 5 (63) | 7 (88) | ||||
| C | 8 (80) | 9 (90) | 7 (70) | 8 (80) | ||||
| D | 8 (80) | 7 (70) | 6 (60) | 9 (90) | ||||
| E | 4 (40) | 6 (60) | 3 (30) | 6 (60) | ||||
| Question complexity | 0.2 | 0.3 | 0.6 | > 0.9 | ||||
| 1 | 8 (80) | 10 (100) | 7 (70) | 8 (80) | ||||
| 2 | 8 (80) | 7 (70) | 5 (50) | 9 (90) | ||||
| 3 | 8 (80) | 8 (80) | 7 (70) | 7 (70) | ||||
| 4 | 4 (40) | 6 (60) | 4 (40) | 8 (80) | ||||
| 5 | 3 (50) | 5 (83) | 3 (50) | 5 (83) | ||||
Table 3 Text-only question characteristics by language model and response accuracy (n = 94), n (%)
| GPT-4 | GPT-5 | OE | ||||
| P value1 | P value1 | P value1 | ||||
| Total correct | 76 (81) | 83 (88) | 76 (81) | |||
| Specialty | 0.6 | > 0.9 | 0.8 | |||
| Basic science | 16 (89) | 17 (94) | 16 (89) | |||
| Foot and ankle | 4 (80) | 4 (100) | 4 (100) | |||
| Hand | 2 (100) | 2 (100) | 2 (100) | |||
| Knee and sports | 8 (73) | 9 (82) | 8 (73) | |||
| Pathology | 4 (100) | 4 (100) | 4 (100) | |||
| Pediatrics | 7 (88) | 7 (88) | 7 (88) | |||
| Recon | 11 (85) | 11 (85) | 11 (85) | |||
| Shoulder and elbow | 5 (71) | 7 (88) | 5 (63) | |||
| Spine | 13 (87) | 13 (87) | 11 (73) | |||
| Trauma | 6 (55) | 9 (82) | 8 (73) | |||
| Perceived importance | 0.3 | 0.3 | 0.5 | |||
| A | 13 (76) | 17 (100) | 14 (82) | |||
| B | 12 (71) | 14 (82) | 11 (65) | |||
| C | 15 (75) | 18 (90) | 17 (85) | |||
| D | 17 (85) | 16 (80) | 17 (85) | |||
| E | 19 (95) | 18 (90) | 17 (85) | |||
| Question complexity | 0.025 | 0.3 | 0.9 | |||
| 1 | 19 (95) | 17 (85) | 17 (85) | |||
| 2 | 19 (95) | 20 (100) | 16 (80) | |||
| 3 | 16 (80) | 18 (90) | 17 (85) | |||
| 4 | 13 (65) | 17 (85) | 16 (80) | |||
| 5 | 9 (64) | 11 (79) | 10 (71) | |||
- Citation: Javid K, Driessche A, Clymer C, Abbas MJ, Pantuso A, Maier LM, Hoegler J, Hakeos WM, Guthrie ST. OpenEvidence performs at similar levels compared to current and previous GPT models on orthopedic training and education questions. World J Orthop 2026; 17(6): 118593
- URL: https://www.wjgnet.com/2218-5836/full/v17/i6/118593.htm
- DOI: https://dx.doi.org/10.5312/wjo.v17.i6.118593