Li Y, Huang CK, Hu Y, Zhou XD, He C, Zhong JW. Exploring the performance of large language models on hepatitis B infection-related questions: A comparative study. World J Gastroenterol 2025; 31(3): 101092 [DOI: 10.3748/wjg.v31.i3.101092]
Corresponding Author of This Article
Cong He, Associate Chief Physician, MD, Department of Gastroenterology, Jiangxi Provincial Key Laboratory of Digestive Diseases, Jiangxi Clinical Research Center for Gastroenterology, Digestive Disease Hospital, The First Affiliated Hospital, Jiangxi Medical College, Nanchang University, No. 17 Yong Waizheng Street, Nanchang 330006, Jiangxi Province, China. hecong.1987@163.com
Research Domain of This Article
Gastroenterology & Hepatology
Article-Type of This Article
Basic Study
Open-Access Policy of This Article
This article is an open-access article which was selected by an in-house editor and fully peer-reviewed by external reviewers. It is distributed in accordance with the Creative Commons Attribution Non Commercial (CC BY-NC 4.0) license, which permits others to distribute, remix, adapt, build upon this work non-commercially, and license their derivative works on different terms, provided the original work is properly cited and the use is non-commercial. See: http://creativecommons.org/licenses/by-nc/4.0/
World J Gastroenterol. Jan 21, 2025; 31(3): 101092 Published online Jan 21, 2025. doi: 10.3748/wjg.v31.i3.101092
Table 1 Quality indicators (scientific adequacy) for answers from ChatGPT-3.5, ChatGPT-4.0, and Google Gemini
Common questions
Sources of answers
Answer lengths, 1st run
Answer lengths, 2nd run
Answer lengths, 3rd run
Grades, mean
Grades, P value
Overall (mean)
ChatGPT-3.5
275
366
352
3.50
0.2
ChatGPT-4.0
274
252
238
3.69
Google Gemini
307
322
325
3.53
Risk factors
What are the transmission modes of hepatitis B virus?
ChatGPT-3.5
189
316
400
3.67
0.296
ChatGPT-4.0
358
241
220
4
Google Gemini
264
291
291
3.33
Clinical manifestation
What are the symptoms of hepatitis B infection?
ChatGPT-3.5
247
333
356
3.67
0.216
ChatGPT-4.0
269
276
295
3.67
Google Gemini
226
349
352
3
Diagnosis
What is the most accurate test for diagnosing Hepatitis B infection?
ChatGPT-3.5
223
341
348
3.67
0.027
ChatGPT-4.0
307
349
280
4
Google Gemini
281
281
281
3
Treatment
Can hepatitis B infection be cured clinically?
ChatGPT-3.5
334
357
395
3.67
0.216
ChatGPT-4.0
271
324
264
3.67
Google Gemini
268
360
359
3
What are the indications of antiviral therapy for patients infected with hepatitis B virus?
ChatGPT-3.5
368
367
402
3.67
0.296
ChatGPT-4.0
351
334
296
3.33
Google Gemini
385
384
392
3
Can patients infected with hepatitis B virus be pregnant during antiviral treatment?
ChatGPT-3.5
341
392
383
3.33
0.079
ChatGPT-4.0
319
247
242
4
Google Gemini
369
352
351
4
Do patients diagnosed with chronic hepatitis B during pregnancy need antiviral therapy?
ChatGPT-3.5
366
416
383
3
0.296
ChatGPT-4.0
230
313
256
3.33
Google Gemini
325
330
375
3.67
Can patients diagnosed with chronic hepatitis B during lactation be treated with antiviral therapy?
ChatGPT-3.5
366
419
391
3.33
0.296
ChatGPT-4.0
245
190
218
3.67
Google Gemini
362
328
330
4
Prevention
How long should a newborn receive the first dose of hepatitis B vaccine after birth?
ChatGPT-3.5
133
392
185
3.67
0.296
ChatGPT-4.0
182
146
146
3.33
Google Gemini
193
207
201
4
Can pregnant women receive hepatitis B vaccine?
ChatGPT-3.5
181
397
338
4
ChatGPT-4.0
179
183
149
4
Google Gemini
277
318
318
4
How often should patients with hepatitis B virus infection be reexamined?
ChatGPT-3.5
209
421
328
3
0.027
ChatGPT-4.0
275
171
205
3.33
Google Gemini
330
334
334
4
Prognosis
What are the complications of hepatitis B infection?
ChatGPT-3.5
343
235
305
3.33
0.216
ChatGPT-4.0
300
245
280
4.00
Google Gemini
405
326
318
3.33
Table 2 Performance of ChatGPT-3.5, ChatGPT-4.0 and Google Gemini on hepatitis B infection test questions by different subfields, n (%)
Test questions by subfields
ChatGPT-3.5, correct
ChatGPT-3.5, incorrect
ChatGPT-4.0, correct
ChatGPT-4.0, incorrect
Google Gemini, correct
Google Gemini, incorrect
All test questions
52
52
52
1st run
34 (65.4)
18 (34.6)
43 (82.7)
9 (17.3)
37 (71.1)
15 (28.9)
2nd run
30 (57.7)
22 (42.3)
41 (78.9)
11 (21.1)
38 (73.1)
14 (26.9)
3rd run
34 (65.4)
18 (34.6)
42 (80.8)
10 (19.2)
39 (75)
13 (25)
Concordance among 3 runs
41 (78.9)
46 (88.4)
50 (96.2)
Total accuracy (%)
62.9
80.8
73.1
Risk factors (n)
5
5
5
1st run
5 (100)
0 (0)
5 (100)
0 (0)
5 (100)
0 (0)
2nd run
5 (100)
0 (0)
5 (100)
0 (0)
5 (100)
0 (0)
3rd run
5 (100)
0 (0)
5 (100)
0 (0)
5 (100)
0 (0)
Concordance among 3 runs
5 (100)
5 (100)
5 (100)
Total accuracy (%)
100
100
100
Clinical manifestation (n)
7
7
7
1st run
2 (40)
5 (71.4)
4 (57.1)
3 (42.9)
5 (71.4)
2 (28.6)
2nd run
2 (40)
5 (71.4)
4 (57.1)
3 (42.9)
5 (71.4)
2 (28.6)
3rd run
3 (42.9)
4 (57.1)
4 (57.1)
3 (42.9)
5 (71.4)
2 (28.6)
Concordance among 3 runs
5 (71.4)
6 (85.7)
7 (100)
Total accuracy (%)
33.3
57.1
71.4
Diagnosis (n)
18
18
18
1st run
9 (50)
9 (50)
15 (83.3)
3 (16.7)
13 (72.2)
5 (27.8)
2nd run
8 (44.4)
10 (55.6)
15 (83.3)
3 (16.7)
14 (77.8)
4 (22.2)
3rd run
11 (61,1)
7 (38.9)
15 (83.3)
3 (16.7)
15 (83.3)
3 (16.7)
Concordance among 3 runs
12 (66.7)
16 (88.9)
16 (88.9)
Total accuracy (%)
51.9
83.3
77.8
Treatment (n)
11
11
11
1st run
11 (100)
0 (0)
10 (90.9)
1 (9.1)
9 (81.9)
2 (18.1)
2nd run
10 (90.9)
1 (9.1)
10 (90.9)
1 (9.1)
9 (81.9)
2 (18.1)
3rd run
10 (90.9)
1 (9.1)
11 (100)
0 (0)
9 (81.9)
2 (18.1)
Concordance among 3 runs
10 (90.9)
10 (90.9)
11 (100)
Total accuracy (%)
93.9
93.9
81.9
Prevention (n)
7
7
7
1st run
4 (57.1)
3 (42.9)
6 (85.7)
1 (14.3)
3 (42.9)
4 (57.1)
2nd run
3 (42.9)
4 (57.1)
4 (57.1)
3 (42.9)
3 (42.9)
4 (57.1)
3rd run
3 (42.9)
4 (57.1)
4 (57.1)
3 (42.9)
3 (42.9)
4 (57.1)
Concordance among 3 runs
6 (85.7)
5 (71.4)
7 (100)
Total accuracy (%)
47.6
66.7
42.9
Prognosis (n)
4
4
4
1st run
3 (75)
1 (25)
3 (75)
1 (25)
2 (50)
2 (50)
2nd run
2 (50)
2 (50)
3 (75)
1 (25)
2 (50)
2 (50)
3rd run
2 (50)
2 (50)
3 (75)
1 (25)
2 (50)
2 (50)
Concordance among 3 runs
3 (75)
4 (100)
4 (100)
Total accuracy (%)
58.3
75
50
Table 3 Comparison of readability of answers from ChatGPT-3.5 with the 8th grade reading level, mean ± SD
Subfield
GFI
P value
FKGL
P value
Risk factors
16.73 ± 1.77
0.013
12.60 ± 1.72
0.043
Clinical manifestation
13.68 ± 2.12
0.043
10.75 ± 1.72
0.109
Diagnosis
15.46 ± 1.65
0.016
12.12 ± 1.61
0.048
Treatment
21.22 ± 1.99
< 0.001
17.22 ± 1.47
< 0.001
Prevention
18.89 ± 1.80
< 0.001
15.53 ± 1.72
< 0.001
Prognosis
18.52 ± 1.85
0.010
15.51 ± 2.17
0.027
Overall
18.93 ± 3.03
< 0.001
15.31 ± 2.67
< 0.001
Table 4 Comparison of readability of answers from ChatGPT-4.0 with the 8th grade reading level, mean ± SD
Subfield
GFI
P value
FKGL
P value
Risk factors
14.79 ± 0.24
< 0.001
11.45 ± 0.35
0.003
Clinical manifestation
11.05 ± 0.89
0.027
9.06 ± 0.73
0.130
Diagnosis
14.40 ± 0.42
0.001
11.28 ± 0.47
< 0.001
Treatment
18.18 ± 1.45
< 0.001
14.57 ± 1.27
< 0.001
Prevention
16.49 ± 1.27
< 0.001
13.49 ± 1.09
< 0.001
Prognosis
16.10 ± 0.52
0.001
13.18 ± 0.05
< 0.001
Overall
16.39 ± 2.38
< 0.001
13.19 ± 1.96
< 0.001
Table 5 Comparison of readability of answers from Google Gemini with the 8th grade reading level, mean ± SD
Subfield
GFI
P value
FKGL
P value
Risk factors
14.54 ± 0.46
0.002
10.73 ± 0.16
0.001
Clinical manifestation
13.06 ± 0.42
0.002
9.81 ± 0.68
0.043
Diagnosis
17.71 ± 0.30
< 0.001
13.54 ± 0.24
< 0.001
Treatment
19.93 ± 1.44
< 0.001
15.65 ± 1.06
< 0.001
Prevention
15.63 ± 1.96
< 0.001
11.71 ± 1.82
< 0.001
Prognosis
14.81 ± 0.62
0.003
12.37 ± 0.27
0.001
Overall
17.22 ± 2.86
< 0.001
13.32 ± 2.44
< 0.001
Citation: Li Y, Huang CK, Hu Y, Zhou XD, He C, Zhong JW. Exploring the performance of large language models on hepatitis B infection-related questions: A comparative study. World J Gastroenterol 2025; 31(3): 101092