BPG is committed to discovery and dissemination of knowledge
Observational Study Open Access
Copyright ©The Author(s) 2025. Published by Baishideng Publishing Group Inc. All rights reserved.
World J Gastrointest Oncol. Oct 15, 2025; 17(10): 109792
Published online Oct 15, 2025. doi: 10.4251/wjgo.v17.i10.109792
Evaluating chat generative pretrained transformer in answering questions on endoscopic mucosal resection and endoscopic submucosal dissection
Shi-Song Wang, Hui Gao, Tian-Chen Qian, Ying Du, Lei Xu, Department of Gastroenterology, The First Affiliated Hospital of Ningbo University, Ningbo 315010, Zhejiang Province, China
Shi-Song Wang, Peng-Yao Lin, Ying Du, Health Science Center, Ningbo University, Ningbo 315010, Zhejiang Province, China
Tian-Chen Qian, Department of Gastroenterology, The First Affiliated Hospital, College of Medicine, Zhejiang University, Hangzhou 310003, Zhejiang Province, China
ORCID number: Hui Gao (0000-0001-5899-2784); Lei Xu (0000-0001-6017-3745).
Author contributions: Xu L conceived the study design; Wang SS and Gao H performed the statistical analysis; Wang SS and Du Y wrote the manuscript; Qian TC and Lin PY reviewed the manuscript; All authors approved the submitted draft.
Supported by Ningbo Top Medical and Health Research Program, No. 2023020612; the Ningbo Leading Medical & Healthy Discipline Project, No. 2022-S04; the Medical Health Science and Technology Project of Zhejiang Provincial Health Commission, No. 2022KY315; and Ningbo Science and Technology Public Welfare Project, No. 2023S133.
Institutional review board statement: Since the study did not involve human or animal data and all ChatGPT answers were public, there was no need for Ethics Committee approval.
Informed consent statement: As this study does not involve human or animal data and all ChatGPT responses are publicly accessible, informed consent was not required.
Conflict-of-interest statement: All the authors report no relevant conflicts of interest for this article.
STROBE statement: The authors have read the STROBE Statement—checklist of items, and the manuscript was prepared and revised according to the STROBE Statement—checklist of items.
Data sharing statement: Technical appendix, statistical code, and dataset available from the corresponding author.
Open Access: This article is an open-access article that was selected by an in-house editor and fully peer-reviewed by external reviewers. It is distributed in accordance with the Creative Commons Attribution NonCommercial (CC BY-NC 4.0) license, which permits others to distribute, remix, adapt, build upon this work non-commercially, and license their derivative works on different terms, provided the original work is properly cited and the use is non-commercial. See: https://creativecommons.org/Licenses/by-nc/4.0/
Corresponding author: Lei Xu, MD, PhD, Department of Gastroenterology, The First Affiliated Hospital of Ningbo University, No. 59 Liuting Street, Ningbo 315010, Zhejiang Province, China. xulei22@163.com
Received: May 22, 2025
Revised: June 17, 2025
Accepted: August 27, 2025
Published online: October 15, 2025
Processing time: 145 Days and 19.4 Hours

Abstract
BACKGROUND

With the rising use of endoscopic submucosal dissection (ESD) and endoscopic mucosal resection (EMR), patients are increasingly questioning various aspects of these endoscopic procedures. At the same time, conversational artificial intelligence (AI) tools like chat generative pretrained transformer (ChatGPT) are rapidly emerging as sources of medical information.

AIM

To evaluate ChatGPT’s reliability and usefulness regarding ESD and EMR for patients and healthcare professionals.

METHODS

In this study, 30 specific questions related to ESD and EMR were identified. Then, these questions were repeatedly entered into ChatGPT, with two independent answers generated for each question. A Likert scale was used to rate the accuracy, completeness, and comprehensibility of the responses. Meanwhile, a binary category (high/Low) was used to evaluate each aspect of the two responses generated by ChatGPT and the response retrieved from Google.

RESULTS

By analyzing the average scores of the three raters, our findings indicated that the responses generated by ChatGPT received high ratings for accuracy (mean score of 5.14 out of 6), completeness (mean score of 2.34 out of 3), and comprehensibility (mean score of 2.96 out of 3). Kendall’s coefficients of concordance indicated good agreement among raters (all P < 0.05). For the responses generated by Google, more than half were classified by experts as having low accuracy and low completeness.

CONCLUSION

ChatGPT provided accurate and reliable answers in response to questions about ESD and EMR. Future studies should address ChatGPT’s current limitations by incorporating more detailed and up-to-date medical information. This could establish AI chatbots as significant resource for both patients and health care professionals.

Key Words: Endoscopic submucosal dissection; Endoscopic mucosal dissection; Artificial intelligence; Chat generative pretrained transformer; Patient education; Google

Core Tip: This study evaluated the reliability and usefulness of chat generative pretrained transformer in addressing questions related to endoscopic submucosal dissection and endoscopic mucosal resection. A set of thirty targeted questions was repeatedly entered, and responses were independently rated for accuracy, completeness, and comprehensibility. Compared with Google, chat generative pretrained transformer produced more accurate, detailed, and easier to understand answers, with consistent agreement among evaluators. The findings indicate that chat generative pretrained transformer may serve as a valuable and accessible source of medical information for both patients and healthcare professionals.



INTRODUCTION

Colorectal cancer is the third most prevalent cancer worldwide, with most cases developing from colorectal polyps[1]. Endoscopic mucosal resection (EMR) is an established treatment for these polyps[2]. The effectiveness of endoscopic screening and therapy depends not only on the accurate detection of adenomas but also on their complete removal[3]. However, a key limitation of EMR is that it is suitable only for small lesions, approximately 20 mm in diameter, which restricts the possibility of complete resection. In contrast, endoscopic mucosal dissection (ESD) allows for the removal of a wider range of lesions[4], but the technique is technically demanding, time-consuming, and costly[5]. Despite these challenges, ESD technology has gradually gained popularity over EMR for the endoscopic treatment of early gastric cancer[6]. During pre-surgery counseling for ESD or EMR, patients frequently ask numerous questions about the procedure, as well as about postoperative care and lifestyle considerations.

One promising tool to address patient questions is the use of artificial intelligence (AI)-driven chatbots. In recent years, such chatbots have proven effective in providing personalized support and patient education, indicating their potential as a supplementary resource in healthcare[7]. Particularly, advancements in natural language processing have enabled large language models, such as chat generative pre-trained transformers (ChatGPT)[8], to perform well across various fields, including medicine[9]. These models draw on extensive knowledge bases and a deep understanding of complex language patterns to deliver customized and informative responses[10]. Given the rapid rise in ChatGPT's popularity, it is likely that more patients will turn to this tool for information about ESD and EMR. Therefore, it is essential to evaluate whether these AI tools provide information that is accurate, complete, and easily understood.

This study aimed to evaluate ChatGPT 4.0’s responses to questions related to ESD and EMR across three domains: Accuracy, completeness, and comprehensibility. Specifically, we aimed to evaluate the tool’s ability to answer common patient questions regarding the preoperative, intraoperative, and postoperative aspects of surgery, and to explore its potential role in patient education.

MATERIALS AND METHODS

We conducted our queries using ChatGPT 4.0, an updated model reportedly offering more advanced reasoning capabilities and a broader knowledge base than ChatGPT 3.5[9,11,12]. In this study, we compiled expert responses to common patient questions regarding ESD and EMR encountered in clinical practice. An expert independent of the study excluded questions with overlapping meanings or duplicates and revised the wording and grammatical accuracy of certain items to ensure clarity and precision. Questions were posed in English, and to eliminate potential bias from previous conversations and ensure the relevance of responses, the “New Chat” reset function was used before every query. To evaluate the temporal accuracy and reproducibility of ChatGPT, each question was re-entered one week after the initial output, and both responses were documented for comparison. The first set of answers was designated for the 30 questions as the “discovery phase”, and the second set was referred to as the “replication phase” for analysis.

Subsequently, each ChatGPT response was evaluated by three gastroenterologists, three non-expert reviewers, and three patients. Each expert had over 20 years of professional experience in gastrointestinal endoscopy and had published extensively in the field. The non-expert reviewers possessed a fundamental understanding of endoscopic procedures. And three patients, aged between 40 and 50 years, who were scheduled to undergo endoscopic procedures involving ESD or EMR. Table 1 presents the details of the research questions. The experts independently assessed each answer using a Likert scale (Supplementary Table 1) across three dimensions. Accuracy was rated from 1 to 6 (with six being the most accurate), and completeness and comprehensibility from 1 to 3 (with three indicating the most complete or easiest to understand)[13]. Additionally, to further evaluate the linguistic quality of ChatGPT responses compared to traditional web searches and to minimize the potential impact of subtle rating differences, experts also reassessed and compared the quality of ChatGPT and Google responses using binary categories (low/high). Non-expert reviewers and patients evaluated the comprehensibility of each ChatGPT response using binary categorical ratings only. All reviewers were blinded to each other's scores to avoid potential bias.

Table 1 Questions posed to chat generative pretrained transformer.
No.
Question
1What is the anatomy of the gastrointestinal tract?
2What are the surgical indications/indications for ESD/EMR?
3What is the specific surgical process for ESD/EMR?
4What are the contraindications for ESD/EMR?
5What should patients do before ESD/EMR?
6What preoperative measures for ESD/EMR can help reduce surgical risks?
7What are the possible problems and solutions that may be encountered during the ESD/EMR process?
8Will sedation or anesthesia be used during ESD/EMR, and will the procedure cause pain or discomfort? How long does the procedure usually take?
9What cooperation and precautions are required from the patient during an ESD/EMR procedure?
10What are the factors influencing the safety and success rate of ESD/EMR procedures?
11How is the resected tissue handled after an ESD/EMR procedure?
12What is the expected timeframe and method for accessing pathology results after an ESD/EMR procedure?
13Which terms in the pathology report after an ESD/EMR procedure should be given special attention?
14What are the relevant definition standards and classifications for postoperative complications of ESD/EMR?
15What are the common postoperative complications and related treatments of ESD/EMR?
16What are the influencing factors of postoperative complications in ESD/EMR?
17What are the observation indicators for the therapeutic effect of ESD/EMR?
18What postoperative symptoms are considered normal after an ESD/EMR procedure?
19Do patients need family accompaniment after an ESD/EMR procedure, and for how long is it recommended?
20What are the postoperative care precautions for ESD/EMR?
21What are the application and precautions of drugs and food after ESD/EMR surgery?
22What are the daily life precautions for postoperative ESD/EMR?
23How soon after an ESD/EMR procedure can a patient return to work, engage in physical activity, or take a shower?
24What are the follow-up appointments and response methods for postoperative adverse events in ESD/EMR?
25What is the likelihood of recurrence after an ESD/EMR procedure, and what are the related influencing factors?
26What postoperative signs may indicate incomplete lesion removal or potential recurrence after an ESD/EMR procedure?
27If follow-up after ESD/EMR suggests recurrence, what should be done next?
28In endoscopic therapy, how are ESD and EMR selected?
29What is the typical cost of an ESD/EMR procedure, and how is it covered by medical insurance?
30How can the psychological state of patients be managed after an ESD/EMR procedure to reduce anxiety and concerns?

The study adhered to the ethical standards outlined in the Helsinki Declaration. Since the study did not involve human or animal data and all ChatGPT answers were public, there was no need for ethics committee approval.

Statistical analysis

In this study, we used the mean, standard deviation, and median to perform descriptive statistical analysis. When the curves representing the ratings of different reviewers were closer within the circles, a greater degree of agreement was indicated. The farther a curve extended toward the outer edge of the graph, the higher the score given by the reviewer. To evaluate reproducibility of ChatGPT’s answers, we dichotomized the accuracy ratings into two categories: Scores 1-3 vs scores 4-6. If the two responses to the same question fell into different categories, they were classified as significantly different, indicating low reproducibility for that item. To evaluate the reliability and consistency of the rating process, Kendall’s coefficients of concordance were used[14]. This nonparametric statistical method quantified the level of agreement among evaluators, where a coefficient of one indicates perfect agreement and a coefficient of zero reflects agreement by chance. The coefficient of variation was calculated to assess the variability among the three experts’ ratings for each response. Each rater’s set of scores was treated as an independent sample for this analysis. Data analysis was performed using IBM Statistical Package for the Social Sciences software (version 29), with statistical significance set at P < 0.05.

RESULTS

Initially, 71 questions were included. After excluding 41 similar or duplicate items, a final set of 30 relevant questions was retained (Figure 1). Each of the 30 distinct questions was submitted to both ChatGPT and Google. For each question, ChatGPT generated two responses (Supplementary Table 2), while Google provided a single response (Supplementary Table 3). Three experts evaluated each response in terms of accuracy, completeness, and comprehensibility (Table 2). Overall, the two responses for each question were largely consistent, indicating good reproducibility of ChatGPT’s answers.

Figure 1
Figure 1 Flow chart of question selection for endoscopic submucosal dissection and endoscopic mucosal resection. ESD: Endoscopic submucosal dissection; EMR: Endoscopic mucosal resection.
Table 2 Likert scale scores for questions generated by chat generative pretrained transformer.
Question
Parameter
Rater 1
Rater 2
Rater 3
mean ± SD
Q1Accuracy5655.33 ± 0.47
Completeness2222 ± 0
Comprehensibility3333 ± 0
RQ1Accuracy5655.33 ± 0.47
Completeness2222 ± 0
Comprehensibility3333 ± 0
Q2Accuracy4444 ± 0
Completeness2222 ± 0
Comprehensibility3333 ± 0
RQ2Accuracy4544.33 ± 0.47
Completeness2322.33 ± 0.47
Comprehensibility3333 ± 0
Q3Accuracy5544.67 ± 0.47
Completeness3333 ± 0
Comprehensibility3333 ± 0
RQ3Accuracy5555 ± 0
Completeness3322.67 ± 0.47
Comprehensibility3333 ± 0
Q4Accuracy5655.33 ± 0.47
Completeness2322.33 ± 0.47
Comprehensibility3333 ± 0
RQ4Accuracy5655.33 ± 0.47
Completeness2222 ± 0
Comprehensibility3333 ± 0
Q5Accuracy5665.67 ± 0.47
Completeness2222 ± 0
Comprehensibility3333 ± 0
RQ5Accuracy5665.67 ± 0.47
Completeness2222 ± 0
Comprehensibility3333 ± 0
Q6Accuracy5555 ± 0
Completeness3333 ± 0
Comprehensibility3333 ± 0
RQ6Accuracy5555 ± 0
Completeness2232.33 ± 0.47
Comprehensibility3333 ± 0
Q7Accuracy4544.33 ± 0.47
Completeness2222 ± 0
Comprehensibility2322.33 ± 0.47
RQ7Accuracy4544.33 ± 0.47
Completeness2222 ± 0
Comprehensibility2322.33 ± 0.47
Q8Accuracy5555 ± 0
Completeness2222 ± 0
Comprehensibility3333 ± 0
RQ8Accuracy5555 ± 0
Completeness2222 ± 0
Comprehensibility3333 ± 0
Q9Accuracy5555 ± 0
Completeness2232.33 ± 0.47
Comprehensibility3333 ± 0
RQ9Accuracy5555 ± 0
Completeness2232.33 ± 0.47
Comprehensibility3333 ± 0
Q10Accuracy5555 ± 0
Completeness2222 ± 0
Comprehensibility3333 ± 0
RQ10Accuracy5555 ± 0
Completeness2222 ± 0
Comprehensibility3333 ± 0
Q11Accuracy5655.33 ± 0.47
Completeness2232.33 ± 0.47
Comprehensibility3333 ± 0
RQ11Accuracy6565.67 ± 0.47
Completeness2322.33 ± 0.47
Comprehensibility3333 ± 0
Q12Accuracy6565.67 ± 0.47
Completeness3232.67 ± 0.47
Comprehensibility3333 ± 0
RQ12Accuracy6655.67 ± 0.47
Completeness3322.67 ± 0.47
Comprehensibility3333 ± 0
Q13Accuracy5555 ± 0
Completeness2222 ± 0
Comprehensibility3333 ± 0
RQ13Accuracy5555 ± 0
Completeness2222 ± 0
Comprehensibility3333 ± 0
Q14Accuracy5655.33 ± 0.47
Completeness3322.67 ± 0.47
Comprehensibility3322.67 ± 0.47
RQ14Accuracy5555 ± 0
Completeness3232.67 ± 0.47
Comprehensibility3333 ± 0
Q15Accuracy5555 ± 0
Completeness2332.67 ± 0.47
Comprehensibility3333 ± 0
RQ15Accuracy5655.33 ± 0.47
Completeness2332.67 ± 0.47
Comprehensibility3333 ± 0
Q16Accuracy5555 ± 0
Completeness2232.33 ± 0.47
Comprehensibility3333 ± 0
RQ16Accuracy5665.67 ± 0.47
Completeness3333 ± 0
Comprehensibility3333 ± 0
Q17Accuracy5555 ± 0
Completeness2232.33 ± 0.47
Comprehensibility3333 ± 0
RQ17Accuracy5555 ± 0
Completeness2332.67 ± 0.47
Comprehensibility3333 ± 0
Q18Accuracy6565.67 ± 0.47
Completeness2222 ± 0
Comprehensibility3333 ± 0
RQ18Accuracy5655.33 ± 0.47
Completeness2222 ± 0
Comprehensibility3333 ± 0
Q19Accuracy6655.67 ± 0.47
Completeness3333 ± 0
Comprehensibility3333 ± 0
RQ19Accuracy6565.67 ± 0.47
Completeness3333 ± 0
Comprehensibility3333 ± 0
Q20Accuracy5555 ± 0
Completeness3232.67 ± 0.47
Comprehensibility3333 ± 0
RQ20Accuracy5655.33 ± 0.47
Completeness3333 ± 0
Comprehensibility3333 ± 0
Q21Accuracy5555 ± 0
Completeness2322.33 ± 0.47
Comprehensibility3333 ± 0
RQ21Accuracy5655.33 ± 0.47
Completeness2322.33 ± 0.47
Comprehensibility3333 ± 0
Q22Accuracy5555 ± 0
Completeness2222 ± 0
Comprehensibility3333 ± 0
RQ22Accuracy5655.33 ± 0.47
Completeness2222 ± 0
Comprehensibility3333 ± 0
Q23Accuracy5555 ± 0
Completeness3333 ± 0
Comprehensibility3333 ± 0
RQ23Accuracy5555 ± 0
Completeness3333 ± 0
Comprehensibility3333 ± 0
Q24Accuracy5555 ± 0
Completeness2222 ± 0
Comprehensibility2322.33 ± 0.47
RQ24Accuracy5655.33 ± 0.47
Completeness2322.33 ± 0.47
Comprehensibility2322.33 ± 0.47
Q25Accuracy5555 ± 0
Completeness2222 ± 0
Comprehensibility3333 ± 0
RQ25Accuracy5555 ± 0
Completeness2222 ± 0
Comprehensibility3333 ± 0
Q26Accuracy5565.33 ± 0.47
Completeness2222 ± 0
Comprehensibility3333 ± 0
RQ26Accuracy5655.33 ± 0.47
Completeness2222 ± 0
Comprehensibility3333 ± 0
Q27Accuracy5555 ± 0
Completeness2222 ± 0
Comprehensibility3333 ± 0
RQ27Accuracy5555 ± 0
Completeness2222 ± 0
Comprehensibility3333 ± 0
Q28Accuracy4565 ± 1
Completeness2232.33 ± 0.47
Comprehensibility3333 ± 0
RQ28Accuracy5655.33 ± 0.47
Completeness2232.33 ± 0.47
Comprehensibility3333 ± 0
Q29Accuracy5444.33 ± 0.47
Completeness2222 ± 0
Comprehensibility3333 ± 0
RQ29Accuracy5444.33 ± 0.47
Completeness2222 ± 0
Comprehensibility3333 ± 0
Q30Accuracy5655.33 ± 0.47
Completeness2332.67 ± 0.47
Comprehensibility3333 ± 0
RQ30Accuracy6655.67 ± 0.47
Completeness3333 ± 0
Comprehensibility3333 ± 0

In terms of accuracy, ChatGPT’s responses were evaluated by three experts using a 6-point Likert scale, yielding a mean score of 5.14 ± 0.54 and a median score of 5.00 (Figure 2). Multiple experts assigned a score of 4 to both responses for questions 2, 7, and 29. The lowest average score was observed for question 2 in the discovery phase, at 4.00 ± 0, with no variance among raters. Additionally, each of questions 3 and 28 in the discovery phase received a score of 4 from one of the experts. Furthermore, most of the ratings provided by the experts were 5 (almost all correct) or 6 (correct) (Table 2). Questions 5, 12, and 19 achieved the highest mean accuracy score, with an average of 5.67 ± 0.47 (Figure 3). Kendall’s coefficient of concordance was 0.538, which was statistically significant (P = 0.002), indicating that there was relatively moderate consistency among the experts (Supplementary Table 4). Meanwhile, the coefficients of variation for expert-assigned Likert scale scores on ChatGPT-generated questions are reported in Supplementary Table 5. When the scores were converted into binary categories, multiple experts classified both responses to questions 2, 7, and 29 as low-level. In the discovery phase, one expert also categorized the responses to questions 3 and 28 as low-level, while all other responses were classified as high-level. In the binary classification of Google responses, a majority were rated as low-level by one or more experts (Supplementary Table 6).

Figure 2
Figure 2  Mean, median and SD of each score among the expert raters.
Figure 3
Figure 3 Distribution of expert ratings by topic for each question during the discovery and replication phases, presented as a radar chart. A-C: The discovery phases; D-F: Replication phases. When the curves representing the ratings of different reviewers were closer within the circles, a greater degree of agreement was indicated. The farther a curve extended toward the outer edge of the graph, the higher the score given by the reviewer. Q: Question; RQ: Replication question.

In terms of completeness, expert ratings of ChatGPT’s responses on a 3-point Likert scale yielded a mean score of 2.34 ± 0.47 and a median of 2.00. All questions received scores of 2 or 3. Further analysis revealed that only for question 19 did all experts assign a score of 3 to both responses (Table 2). Moreover, Kendall's coefficient of concordance was 0.602, which was statistically significant (P < 0.001), indicating that there was a strong consistency among the experts. In the binary classification of ChatGPT scores, only both responses to question 7 and the discovery phase response to question 2 were classified as low-level by one expert; all other responses were categorized as high-level. In contrast, for Google, more than half of the responses were classified as low-level by at least one expert (Supplementary Table 6).

In terms of comprehensibility, ChatGPT’s responses received relatively high scores, with a mean of 2.96 ± 0.19 and a median of 3.00 on a 3-point Likert scale. Two of the three experts gave a score of 2 for questions 7 and 24. Additionally, for question 14 (discovery phase), one expert gave a 2 while the others gave 3, whereas all experts rated the remaining questions as 3 (easy to understand) (Figure 3). Moreover, Kendall's coefficient of concordance was 0.617, which was statistically significant (P < 0.001), indicating relatively strong consistency among the experts. In the binary classification, all responses from both ChatGPT and Google were categorized as high-level by all three experts. Three non-expert reviewers and three patients also evaluated the comprehensibility of ChatGPT’s responses using a binary classification method. All responses were classified as high-level by the three non-expert reviewers. From the patient perspective, only one patient classified both responses to questions 2, 13, 26, and 29 as low-level (Supplementary Table 7).

DISCUSSION

In this study, we evaluated the effectiveness of the ChatGPT in responding to 30 questions related to ESD and EMR. Our findings indicate that OpenAI's ChatGPT chatbot could provide answers to ESD and EMR related questions with high accuracy (mean score 5.14/6), substantial completeness (2.34/3), and strong comprehensibility (2.96/3). Moreover, it exhibited superior performance over traditional search engines like Google and offered information in a format that was more comprehensible to patients.

Before undergoing any diagnostic or therapeutic endoscopic procedure, patients should have a clear understanding of the expected benefits, potential risks, and available alternatives. Presenting this information in clear and accessible language is essential to support informed decision-making regarding perioperative management. The European Society of Gastrointestinal Endoscopy also emphasizes that patient preferences should be central to the informed consent process[15]. Evidence suggests a strong association between patients’ awareness of their condition and adherence to prescribed treatment plans[16,17]. Furthermore, structured surveillance following ESD and EMR has been indicated to facilitate early detection of recurrence and improve long-term survival outcomes[18].

Despite the recognized importance of health education, patients often encounter barriers to accessing accurate, individualized information pertinent to their clinical circumstances. One study reported that approximately 70%-80% of internet users seek health information online[19]. This enables patients to quickly access up-to-date information on disease prevention, evaluation, and treatment. As a convenient and cost-effective tool, the Internet plays a key role in enhancing patients' health literacy. However, due to the complexity and variability of online content, it is not always a reliable source of information[20-22]. ChatGPT offers a potential solution to improve this situation. It produces human-like responses optimized through reinforcement learning with feedback loops[23]. In our study, ChatGPT demonstrated superior performance compared to traditional internet search engines such as Google. Particularly, the ChatGPT 4.0 model features enhanced reasoning capabilities and a broader knowledge base, allowing it to solve complex problems more accurately[24]. Its readability exceeds the fifth-to-sixth-grade level recommended by the American Medical Association instead of reaching the average college reading level[25]. The model’s training process involves human feedback to guide the generation of clear, relevant, and user-aligned responses[26]. Moreover, previous studies have demonstrated that ChatGPT’s responses to cardiology-related questions were appropriate in most cases[27], and the model also performed well in addressing questions related to cirrhosis and hepatocellular carcinoma[28].

Additionally, ChatGPT can generate a structured framework in response to questions posed by patients and healthcare providers, facilitating better understanding and problem-solving. While many of its responses are comprehensive or accurate, they are occasionally insufficient. However, given the expected ongoing improvements in the model, physicians can enhance patient communication by simply refining ChatGPT’s initial responses[28]. Importantly, it demonstrated high reproducibility for certain questions, suggesting that its ability to generate clinically appropriate responses is not heavily dependent on the initial prompt. This approach not only increases physician efficiency but also reduces the overall cost and burden on the healthcare system. Moreover, ChatGPT empowers patients to better understand their care, promoting patient-centered approaches and supporting effective, shared decision-making by providing an additional source of reliable information.

Notably, the answers to question 5 (preoperative preparation), question 12 (postoperative pathology results), and question 19 (postoperative family support) received the highest accuracy scores among all items, suggesting that ChatGPT's responses on these topics may be directly useful for patient instruction. But there may be some shortcomings in the answers to more complex issues, such as indications for endoscopic surgery, intraoperative precautions, and detailed insurance-related policies. Several possible reasons may account for this. ChatGPT is primarily trained on large volumes of historical text data (books, articles, and websites)[29], and its knowledge may not reflect the most up-to-date medical guidelines. Consequently, some answers lagged behind current clinical standards, which likely contributed to the lower scores observed for specific questions. On the other hand, substantial disagreement remained among experts regarding certain responses, which may be attributed to linguistic and cultural differences across regions or countries. As ChatGPT currently lacks the ability to tailor its responses based on the user's geographic context, this challenge has also been noted in prior studies[28,30]. Future research may address this limitation by involving experts from multiple regions and conducting cross-cultural evaluations, thereby exploring the potential for developing regionally adaptive capabilities in ChatGPT.

ChatGPT’s responses can vary depending on its training data, contextual differences, and linguistic nuances. The same question posed at different times or in other situations may yield different responses, potentially affecting the accuracy and completeness of the information provided. In our study, each question was submitted to ChatGPT in its native language, English, within separate chat sessions. To assess the model’s stability, the same questions were resubmitted under identical conditions after a defined time interval. This approach aimed to minimize variability. However, the potential inconsistency of ChatGPT-generated responses should still be acknowledged. Therefore, extra caution is warranted when using ChatGPT as a stand-alone tool for patient counseling. While it can offer helpful information and guidance, it is not a substitute for the clinical expertise of a well-trained physician[31,32].

In addition, the application of AI to medical decision-making requires careful consideration of a variety of safety, legal, and ethical issues. ChatGPT, in the absence of effective fact-checking mechanisms, is prone to generating inaccurate or misleading information. In medical contexts, its “hallucinations” can exacerbate public health misinformation, contributing to an AI-driven infodemic[33]. Moreover, the lack of data minimization and protection measures raises concerns over sensitive information leakage[34]. Misuse of ChatGPT also poses legal risks. When users rely on AI-generated medical or legal advice, accountability becomes unclear[33]. In unregulated settings, generating legal documents may constitute unauthorized practice of law, violating professional ethics. Additionally, unsupervised use may reinforce biases, leading to discriminatory or harmful outputs[35]. To mitigate these risks, integrating real-time knowledge verification, promoting human-AI collaboration, and establishing clear ethical and legal guidelines for AI deployment are essential.

Overall, our findings suggest that relying solely on AI is insufficient. Medical decisions and patient counseling should always involve qualified healthcare professionals who can offer personalized advice based on an individual patient’s condition and needs. This ensures that patients receive accurate and comprehensive information, thereby improving their prognosis.

This study has several key strengths. First, to ensure the overall quality of ChatGPT’s responses, three independent gastroenterology experts reviewed and evaluated the answers. Moreover, to the best of our knowledge, this is the first study to assess the accuracy, reliability, and comprehensibility of ChatGPT in addressing questions related to ESD and EMR.

We must consider the limitations of this study. This study focused on evaluating the performance of ChatGPT 4.0; therefore, the results may not be applicable to other AI models, particularly in medical training. Additionally, our evaluation was based on a small group of experts. Their subjective judgments, although informed, may be biased and may not accurately reflect the broader range of opinions in the medical community or among patients. Furthermore, the design of the study and the choice of questions may be influenced by human moral concepts, social influences, and personal beliefs, which are often difficult to quantify and depend on subjective assessment. Our research questions were grounded in clinical practice, reviewed for comprehensibility by patients, and demonstrated strong inter-expert agreement with low variability, helping to minimize potential bias. Meantime, we only focused on ChatGPT’s performance in generating research questions for ESD/EMR, without addressing its potential in other gastrointestinal or surgical domains. This limitation is also consistent with those reported in previous studies[26,36,37]. Future research should aim to improve its multi-source integration and knowledge transfer for broader clinical application.

Furthermore, ESD and EMR require not only the transmission of information but a wide range of skills. This includes understanding the patient's health and lifestyle, identifying personal needs, building and maintaining good relationships and trust, motivating patients, and supporting their rehabilitation process. Besides, these duties require empathy, emotional intelligence, and interpersonal skills, all of which are currently not available in ChatGPT or other AI technologies. Moreover, AI models often experience performance drift or degradation after deployment due to updates in training data, architectural optimizations, or changes in the usage environment. Therefore, establishing a long-term evaluation and adaptive updating framework is essential[38-40]. Future research should consider dynamically comparing the performance of different model versions (e.g., ChatGPT 4.0 vs 4.5) to capture improvements introduced by updates. Additionally, constructing a behavioral timeline for model outputs, combined with online performance tracking and periodic assessments, can enable continuous monitoring. Furthermore, the use of multicenter datasets for cross-validation may help assess model generalizability and support external validation.

CONCLUSION

Although ChatGPT indicates potential in providing information related to ESD and EMR management, it should be used with caution and not relied upon as a universal tool for patient counseling. Future studies should aim to address current limitations and improve the reliability and practical application of AI models in delivering accurate and comprehensive medical information.

ACKNOWLEDGEMENTS

The authors would like to thank all participants and their families.

Footnotes

Provenance and peer review: Unsolicited article; Externally peer reviewed.

Peer-review model: Single blind

Specialty type: Gastroenterology and hepatology

Country of origin: China

Peer-review report’s classification

Scientific Quality: Grade B, Grade B, Grade C

Novelty: Grade B, Grade C, Grade C

Creativity or Innovation: Grade B, Grade C, Grade C

Scientific Significance: Grade B, Grade B, Grade C

P-Reviewer: Mukundan A, PhD, Assistant Professor, Taiwan; Ren S, MD, PhD, Assistant Professor, Chief Physician, China S-Editor: Li L L-Editor: A P-Editor: Zhao S

References
1.  Bray F, Ferlay J, Soerjomataram I, Siegel RL, Torre LA, Jemal A. Global cancer statistics 2018: GLOBOCAN estimates of incidence and mortality worldwide for 36 cancers in 185 countries. CA Cancer J Clin. 2018;68:394-424.  [RCA]  [PubMed]  [DOI]  [Full Text]  [Cited by in Crossref: 53206]  [Cited by in RCA: 56101]  [Article Influence: 8014.4]  [Reference Citation Analysis (132)]
2.  Tanaka S, Kashida H, Saito Y, Yahagi N, Yamano H, Saito S, Hisabe T, Yao T, Watanabe M, Yoshida M, Kudo SE, Tsuruta O, Sugihara KI, Watanabe T, Saitoh Y, Igarashi M, Toyonaga T, Ajioka Y, Ichinose M, Matsui T, Sugita A, Sugano K, Fujimoto K, Tajiri H. JGES guidelines for colorectal endoscopic submucosal dissection/endoscopic mucosal resection. Dig Endosc. 2015;27:417-434.  [RCA]  [PubMed]  [DOI]  [Full Text]  [Cited by in Crossref: 359]  [Cited by in RCA: 445]  [Article Influence: 44.5]  [Reference Citation Analysis (0)]
3.  Martínez ME, Baron JA, Lieberman DA, Schatzkin A, Lanza E, Winawer SJ, Zauber AG, Jiang R, Ahnen DJ, Bond JH, Church TR, Robertson DJ, Smith-Warner SA, Jacobs ET, Alberts DS, Greenberg ER. A pooled analysis of advanced colorectal neoplasia diagnoses after colonoscopic polypectomy. Gastroenterology. 2009;136:832-841.  [RCA]  [PubMed]  [DOI]  [Full Text]  [Cited by in Crossref: 396]  [Cited by in RCA: 433]  [Article Influence: 27.1]  [Reference Citation Analysis (0)]
4.  Saito Y, Uraoka T, Matsuda T, Emura F, Ikehara H, Mashimo Y, Kikuchi T, Fu KI, Sano Y, Saito D. Endoscopic treatment of large superficial colorectal tumors: a case series of 200 endoscopic submucosal dissections (with video). Gastrointest Endosc. 2007;66:966-973.  [RCA]  [PubMed]  [DOI]  [Full Text]  [Cited by in Crossref: 261]  [Cited by in RCA: 292]  [Article Influence: 16.2]  [Reference Citation Analysis (0)]
5.  Wang J, Zhang XH, Ge J, Yang CM, Liu JY, Zhao SL. Endoscopic submucosal dissection vs endoscopic mucosal resection for colorectal tumors: a meta-analysis. World J Gastroenterol. 2014;20:8282-8287.  [RCA]  [PubMed]  [DOI]  [Full Text]  [Full Text (PDF)]  [Cited by in CrossRef: 87]  [Cited by in RCA: 103]  [Article Influence: 9.4]  [Reference Citation Analysis (0)]
6.  Nakamoto S, Sakai Y, Kasanuki J, Kondo F, Ooka Y, Kato K, Arai M, Suzuki T, Matsumura T, Bekku D, Ito K, Tanaka T, Yokosuka O. Indications for the use of endoscopic mucosal resection for early gastric cancer in Japan: a comparative study with endoscopic submucosal dissection. Endoscopy. 2009;41:746-750.  [RCA]  [PubMed]  [DOI]  [Full Text]  [Cited by in Crossref: 96]  [Cited by in RCA: 110]  [Article Influence: 6.9]  [Reference Citation Analysis (0)]
7.  Gordijn B, Have HT. ChatGPT: evolution or revolution? Med Health Care Philos. 2023;26:1-2.  [RCA]  [PubMed]  [DOI]  [Full Text]  [Cited by in Crossref: 26]  [Cited by in RCA: 90]  [Article Influence: 45.0]  [Reference Citation Analysis (0)]
8.  Milne-Ives M, de Cock C, Lim E, Shehadeh MH, de Pennington N, Mole G, Normando E, Meinert E. The Effectiveness of Artificial Intelligence Conversational Agents in Health Care: Systematic Review. J Med Internet Res. 2020;22:e20346.  [RCA]  [PubMed]  [DOI]  [Full Text]  [Cited by in Crossref: 229]  [Cited by in RCA: 217]  [Article Influence: 43.4]  [Reference Citation Analysis (0)]
9.  Lim ZW, Pushpanathan K, Yew SME, Lai Y, Sun CH, Lam JSH, Chen DZ, Goh JHL, Tan MCJ, Sheng B, Cheng CY, Koh VTC, Tham YC. Benchmarking large language models' performances for myopia care: a comparative analysis of ChatGPT-3.5, ChatGPT-4.0, and Google Bard. EBioMedicine. 2023;95:104770.  [RCA]  [PubMed]  [DOI]  [Full Text]  [Full Text (PDF)]  [Cited by in RCA: 169]  [Reference Citation Analysis (36)]
10.  Kung TH, Cheatham M, Medenilla A, Sillos C, De Leon L, Elepaño C, Madriaga M, Aggabao R, Diaz-Candido G, Maningo J, Tseng V. Performance of ChatGPT on USMLE: Potential for AI-assisted medical education using large language models. PLOS Digit Health. 2023;2:e0000198.  [RCA]  [PubMed]  [DOI]  [Full Text]  [Full Text (PDF)]  [Cited by in Crossref: 1564]  [Cited by in RCA: 1514]  [Article Influence: 757.0]  [Reference Citation Analysis (0)]
11.  Wang G, Gao K, Liu Q, Wu Y, Zhang K, Zhou W, Guo C. Potential and Limitations of ChatGPT 3.5 and 4.0 as a Source of COVID-19 Information: Comprehensive Comparative Analysis of Generative and Authoritative Information. J Med Internet Res. 2023;25:e49771.  [RCA]  [PubMed]  [DOI]  [Full Text]  [Cited by in RCA: 11]  [Reference Citation Analysis (0)]
12.  Deng L, Wang T, Yangzhang, Zhai Z, Tao W, Li J, Zhao Y, Luo S, Xu J. Evaluation of large language models in breast cancer clinical scenarios: a comparative analysis based on ChatGPT-3.5, ChatGPT-4.0, and Claude2. Int J Surg. 2024;110:1941-1950.  [RCA]  [PubMed]  [DOI]  [Full Text]  [Full Text (PDF)]  [Cited by in Crossref: 47]  [Cited by in RCA: 35]  [Article Influence: 35.0]  [Reference Citation Analysis (0)]
13.  Pugliese N, Wai-Sun Wong V, Schattenberg JM, Romero-Gomez M, Sebastiani G; NAFLD Expert Chatbot Working Group, Aghemo A. Accuracy, Reliability, and Comprehensibility of ChatGPT-Generated Medical Responses for Patients With Nonalcoholic Fatty Liver Disease. Clin Gastroenterol Hepatol. 2024;22:886-889.e5.  [RCA]  [PubMed]  [DOI]  [Full Text]  [Cited by in Crossref: 4]  [Cited by in RCA: 42]  [Article Influence: 42.0]  [Reference Citation Analysis (0)]
14.  Mukaka MM. Statistics corner: A guide to appropriate use of correlation coefficient in medical research. Malawi Med J. 2012;24:69-71.  [PubMed]  [DOI]
15.  Everett SM, Triantafyllou K, Hassan C, Mergener K, Tham TC, Almeida N, Antonelli G, Axon A, Bisschops R, Bretthauer M, Costil V, Foroutan F, Gauci J, Hritz I, Messmann H, Pellisé M, Roelandt P, Seicean A, Tziatzios G, Voiosu A, Gralnek IM. Informed consent for endoscopic procedures: European Society of Gastrointestinal Endoscopy (ESGE) Position Statement. Endoscopy. 2023;55:952-966.  [RCA]  [PubMed]  [DOI]  [Full Text]  [Cited by in Crossref: 7]  [Cited by in RCA: 7]  [Article Influence: 3.5]  [Reference Citation Analysis (0)]
16.  Heydari A, Ziaee ES, Gazrani A. Relationship between Awareness of Disease and Adherence to Therapeutic Regimen among Cardiac Patients. Int J Community Based Nurs Midwifery. 2015;3:23-30.  [PubMed]  [DOI]
17.  Farvardin S, Patel J, Khambaty M, Yerokun OA, Mok H, Tiro JA, Yopp AC, Parikh ND, Marrero JA, Singal AG. Patient-reported barriers are associated with lower hepatocellular carcinoma surveillance rates in patients with cirrhosis. Hepatology. 2017;65:875-884.  [RCA]  [PubMed]  [DOI]  [Full Text]  [Cited by in Crossref: 99]  [Cited by in RCA: 147]  [Article Influence: 18.4]  [Reference Citation Analysis (0)]
18.  Pimentel-Nunes P, Dinis-Ribeiro M, Ponchon T, Repici A, Vieth M, De Ceglie A, Amato A, Berr F, Bhandari P, Bialek A, Conio M, Haringsma J, Langner C, Meisner S, Messmann H, Morino M, Neuhaus H, Piessevaux H, Rugge M, Saunders BP, Robaszkiewicz M, Seewald S, Kashin S, Dumonceau JM, Hassan C, Deprez PH. Endoscopic submucosal dissection: European Society of Gastrointestinal Endoscopy (ESGE) Guideline. Endoscopy. 2015;47:829-854.  [RCA]  [PubMed]  [DOI]  [Full Text]  [Cited by in Crossref: 817]  [Cited by in RCA: 938]  [Article Influence: 93.8]  [Reference Citation Analysis (0)]
19.  Prestin A, Vieux SN, Chou WY. Is Online Health Activity Alive and Well or Flatlining? Findings From 10 Years of the Health Information National Trends Survey. J Health Commun. 2015;20:790-798.  [RCA]  [PubMed]  [DOI]  [Full Text]  [Cited by in Crossref: 146]  [Cited by in RCA: 142]  [Article Influence: 14.2]  [Reference Citation Analysis (0)]
20.  He Z, Wang Z, Song Y, Liu Y, Kang L, Fang X, Wang T, Fan X, Li Z, Wang S, Bai Y. The Reliability and Quality of Short Videos as a Source of Dietary Guidance for Inflammatory Bowel Disease: Cross-sectional Study. J Med Internet Res. 2023;25:e41518.  [RCA]  [PubMed]  [DOI]  [Full Text]  [Cited by in RCA: 43]  [Reference Citation Analysis (0)]
21.  Lai Y, He Z, Liu Y, Yin X, Fan X, Rao Z, Fu H, Gu L, Xia T. The quality and reliability of TikTok videos on non-alcoholic fatty liver disease: a propensity score matching analysis. Front Public Health. 2023;11:1231240.  [RCA]  [PubMed]  [DOI]  [Full Text]  [Full Text (PDF)]  [Cited by in RCA: 13]  [Reference Citation Analysis (0)]
22.  Kong W, Song S, Zhao YC, Zhu Q, Sha L. TikTok as a Health Information Source: Assessment of the Quality of Information in Diabetes-Related Videos. J Med Internet Res. 2021;23:e30409.  [RCA]  [PubMed]  [DOI]  [Full Text]  [Cited by in Crossref: 25]  [Cited by in RCA: 162]  [Article Influence: 40.5]  [Reference Citation Analysis (0)]
23.  Egli A. ChatGPT, GPT-4, and Other Large Language Models: The Next Revolution for Clinical Microbiology? Clin Infect Dis. 2023;77:1322-1328.  [RCA]  [PubMed]  [DOI]  [Full Text]  [Full Text (PDF)]  [Cited by in Crossref: 43]  [Cited by in RCA: 41]  [Article Influence: 20.5]  [Reference Citation Analysis (0)]
24.  Massey PA, Montgomery C, Zhang AS. Comparison of ChatGPT-3.5, ChatGPT-4, and Orthopaedic Resident Performance on Orthopaedic Assessment Examinations. J Am Acad Orthop Surg. 2023;31:1173-1179.  [RCA]  [PubMed]  [DOI]  [Full Text]  [Full Text (PDF)]  [Cited by in Crossref: 73]  [Cited by in RCA: 77]  [Article Influence: 38.5]  [Reference Citation Analysis (0)]
25.  King RC, Samaan JS, Yeo YH, Peng Y, Kunkel DC, Habib AA, Ghashghaei R. A Multidisciplinary Assessment of ChatGPT's Knowledge of Amyloidosis: Observational Study. JMIR Cardio. 2024;8:e53421.  [RCA]  [PubMed]  [DOI]  [Full Text]  [Cited by in RCA: 7]  [Reference Citation Analysis (0)]
26.  Lahat A, Shachar E, Avidan B, Shatz Z, Glicksberg BS, Klang E. Evaluating the use of large language model in identifying top research questions in gastroenterology. Sci Rep. 2023;13:4164.  [RCA]  [PubMed]  [DOI]  [Full Text]  [Full Text (PDF)]  [Cited by in RCA: 51]  [Reference Citation Analysis (0)]
27.  Sarraju A, Bruemmer D, Van Iterson E, Cho L, Rodriguez F, Laffin L. Appropriateness of Cardiovascular Disease Prevention Recommendations Obtained From a Popular Online Chat-Based Artificial Intelligence Model. JAMA. 2023;329:842-844.  [RCA]  [PubMed]  [DOI]  [Full Text]  [Cited by in Crossref: 5]  [Cited by in RCA: 283]  [Article Influence: 141.5]  [Reference Citation Analysis (0)]
28.  Yeo YH, Samaan JS, Ng WH, Ting PS, Trivedi H, Vipani A, Ayoub W, Yang JD, Liran O, Spiegel B, Kuo A. Assessing the performance of ChatGPT in answering questions regarding cirrhosis and hepatocellular carcinoma. Clin Mol Hepatol. 2023;29:721-732.  [RCA]  [PubMed]  [DOI]  [Full Text]  [Full Text (PDF)]  [Cited by in Crossref: 177]  [Cited by in RCA: 341]  [Article Influence: 170.5]  [Reference Citation Analysis (0)]
29.  Lee PY, Salim H, Abdullah A, Teo CH. Use of ChatGPT in medical research and scientific writing. Malays Fam Physician. 2023;18:58.  [RCA]  [PubMed]  [DOI]  [Full Text]  [Cited by in Crossref: 13]  [Cited by in RCA: 23]  [Article Influence: 11.5]  [Reference Citation Analysis (0)]
30.  Samaan JS, Yeo YH, Rajeev N, Hawley L, Abel S, Ng WH, Srinivasan N, Park J, Burch M, Watson R, Liran O, Samakar K. Assessing the Accuracy of Responses by the Language Model ChatGPT to Questions Regarding Bariatric Surgery. Obes Surg. 2023;33:1790-1796.  [RCA]  [PubMed]  [DOI]  [Full Text]  [Full Text (PDF)]  [Cited by in Crossref: 209]  [Cited by in RCA: 147]  [Article Influence: 73.5]  [Reference Citation Analysis (0)]
31.  Vaishya R, Misra A, Vaish A. ChatGPT: Is this version good for healthcare and research? Diabetes Metab Syndr. 2023;17:102744.  [RCA]  [PubMed]  [DOI]  [Full Text]  [Cited by in Crossref: 57]  [Cited by in RCA: 110]  [Article Influence: 55.0]  [Reference Citation Analysis (0)]
32.  Ismail A, Ghorashi NS, Javan R. New Horizons: The Potential Role of OpenAI's ChatGPT in Clinical Radiology. J Am Coll Radiol. 2023;20:696-698.  [RCA]  [PubMed]  [DOI]  [Full Text]  [Cited by in Crossref: 28]  [Cited by in RCA: 27]  [Article Influence: 13.5]  [Reference Citation Analysis (0)]
33.  De Angelis L, Baglivo F, Arzilli G, Privitera GP, Ferragina P, Tozzi AE, Rizzo C. ChatGPT and the rise of large language models: the new AI-driven infodemic threat in public health. Front Public Health. 2023;11:1166120.  [RCA]  [PubMed]  [DOI]  [Full Text]  [Full Text (PDF)]  [Cited by in Crossref: 221]  [Cited by in RCA: 217]  [Article Influence: 108.5]  [Reference Citation Analysis (0)]
34.  Chen Y, Esmaeilzadeh P. Generative AI in Medical Practice: In-Depth Exploration of Privacy and Security Challenges. J Med Internet Res. 2024;26:e53008.  [RCA]  [PubMed]  [DOI]  [Full Text]  [Full Text (PDF)]  [Cited by in Crossref: 159]  [Cited by in RCA: 52]  [Article Influence: 52.0]  [Reference Citation Analysis (0)]
35.  Meskó B, Topol EJ. The imperative for regulatory oversight of large language models (or generative AI) in healthcare. NPJ Digit Med. 2023;6:120.  [RCA]  [PubMed]  [DOI]  [Full Text]  [Full Text (PDF)]  [Cited by in Crossref: 266]  [Cited by in RCA: 350]  [Article Influence: 175.0]  [Reference Citation Analysis (0)]
36.  Gan W, Ouyang J, Li H, Xue Z, Zhang Y, Dong Q, Huang J, Zheng X, Zhang Y. Integrating ChatGPT in Orthopedic Education for Medical Undergraduates: Randomized Controlled Trial. J Med Internet Res. 2024;26:e57037.  [RCA]  [PubMed]  [DOI]  [Full Text]  [Full Text (PDF)]  [Cited by in Crossref: 19]  [Cited by in RCA: 12]  [Article Influence: 12.0]  [Reference Citation Analysis (0)]
37.  Lai Y, Liao F, Zhao J, Zhu C, Hu Y, Li Z. Exploring the capacities of ChatGPT: A comprehensive evaluation of its accuracy and repeatability in addressing helicobacter pylori-related queries. Helicobacter. 2024;29:e13078.  [RCA]  [PubMed]  [DOI]  [Full Text]  [Cited by in RCA: 15]  [Reference Citation Analysis (0)]
38.  Andersen ES, Birk-Korch JB, Hansen RS, Fly LH, Röttger R, Arcani DMC, Brasen CL, Brandslund I, Madsen JS. Monitoring performance of clinical artificial intelligence in health care: a scoping review. JBI Evid Synth. 2024;22:2423-2446.  [RCA]  [PubMed]  [DOI]  [Full Text]  [Full Text (PDF)]  [Cited by in RCA: 7]  [Reference Citation Analysis (0)]
39.  Rosenthal JT, Beecy A, Sabuncu MR. Rethinking clinical trials for medical AI with dynamic deployments of adaptive systems. NPJ Digit Med. 2025;8:252.  [RCA]  [PubMed]  [DOI]  [Full Text]  [Cited by in RCA: 3]  [Reference Citation Analysis (0)]
40.  Gallifant J, Bitterman DS, Celi LA, Gichoya JW, Matos J, McCoy LG, Pierce RL. Ethical debates amidst flawed healthcare artificial intelligence metrics. NPJ Digit Med. 2024;7:243.  [RCA]  [PubMed]  [DOI]  [Full Text]  [Full Text (PDF)]  [Cited by in Crossref: 3]  [Cited by in RCA: 4]  [Article Influence: 4.0]  [Reference Citation Analysis (0)]