Accuracy and reproducibility of ChatGPT responses to parent and patient inquiries on attention-deficit/hyperactivity disorder

doi:10.5498/wjp.v16.i6.119773

Advanced Search

BPG is committed to discovery and dissemination of knowledge

Home / Archive / Volume 16, Issue 6

This Article

(15)

(11)

(0)

(21)

(219)

Table of Contents

Peer-Review Report of This Article

CrossCheck and Google Search of This Article

Academic Rules and Norms of This Article

Citation of this article

Corresponding Author of This Article

Research Domain of This Article

Article-Type of This Article

Open-Access Policy of This Article

Times Cited Counts in Google of This Article

Journal Information of This Article

Publication Name

World Journal of Psychiatry

ISSN

2220-3206

Publisher of This Article

Baishideng Publishing Group Inc, 7041 Koll Center Parkway, Suite 160, Pleasanton, CA 94566, USA

Case Control Study Open Access

Copyright: ©Author(s) 2026. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution-NonCommercial (CC BY-NC 4.0) license. No commercial re-use. See permissions. Published by Baishideng Publishing Group Inc.

World J Psychiatry. Jun 19, 2026; 16(6): 119773
Published online Jun 19, 2026. doi: 10.5498/wjp.v16.i6.119773

Accuracy and reproducibility of ChatGPT responses to parent and patient inquiries on attention-deficit/hyperactivity disorder

Berrin Bilgiç, Serkan Turan

Berrin Bilgiç, Department of Child and Adolescent Psychiatry, Adnan Menderes University Faculty of Medicine, Aydın 09100, Türkiye

Serkan Turan, Department of Child and Adolescent Psychiatry, Uludag University Faculty of Medicine, Bursa 16059, Türkiye

ORCID number: Berrin Bilgiç (0000-0002-0180-5797); Serkan Turan (0000-0002-6548-0629).

Author contributions: Bilgiç B conceptualized and designed the study, drafted the manuscript; Turan S supervised the research, performed the statistical analysis, critically revised the manuscript for important intellectual content; Bilgiç B and Turan S collected the data; all authors reviewed and approved the final manuscript.

AI contribution statement: AI-based tools (e.g., ChatGPT and language editing tools) were used in a limited manner to support language refinement and clarity.

Institutional review board statement: This study did not require institutional review board approval, as no patient data, clinical records, or personally identifiable information were used. The study was based exclusively on publicly available questions and artificial intelligence-generated responses. The authors declare no affiliation with OpenAI, the developer of ChatGPT.

Informed consent statement: Informed consent was not required because this study did not involve human participants or patient data.

Conflict-of-interest statement: Conflict of interests: The authors declare that there are no conflicts of interest regarding the publication of this article.

STROBE statement: The authors have read the STROBE Statement-checklist of items, and the manuscript was prepared and revised according to the STROBE Statement-checklist of items.

Data sharing statement: The datasets generated and/or analyzed during the current study are available from the corresponding author upon reasonable request.

Corresponding author: Serkan Turan, MD, PhD, Associate Professor, Department of Child and Adolescent Psychiatry, Uludag University Faculty of Medicine, Gorukle Campus, Bursa 16059, Türkiye. serkanturan@uludag.edu.tr

Received: February 10, 2026
Revised: February 23, 2026
Accepted: March 12, 2026
Published online: June 19, 2026
Processing time: 108 Days and 1 Hours

Abstract

BACKGROUND

Children and adolescents with attention-deficit/hyperactivity disorder (ADHD) and their caregivers increasingly turn to artificial intelligence-based chatbots for information about symptoms, functional difficulties, and treatment-related concerns. Large language models (LLMs), such as ChatGPT, are of particular interest due to their ability to generate fluent, natural-language responses. However, empirical evidence regarding their clinical performance in child and adolescent psychiatry remains limited, especially with respect to the reproducibility and clinical reliability of ADHD-related responses.

AIM

To systematically evaluate the accuracy and reproducibility of ChatGPT (GPT-4o)-generated responses to commonly asked ADHD-related questions from parents and patients.

METHODS

In this cross-sectional study, 88 frequently asked ADHD-related questions were identified through internet search engines, parent-oriented forums, and professional organization websites. Questions were categorized into four domains: Basic knowledge (n = 30), diagnosis and assessment (n = 22), treatment and medication use (n = 21), and long-term outcomes and psychosocial impact (n = 15). Each question was submitted twice to the subscription-based version of ChatGPT (GPT-4o) in separate chat sessions. Two blinded child and adolescent psychiatrists independently evaluated responses for accuracy (comprehensive/correct, incomplete, mixed or potentially misleading, or inaccurate) and reproducibility. Inter-rater agreement and domain-specific differences were analyzed statistically.

RESULTS

Overall, 59.1% (52/88) of responses were rated as comprehensive/correct, 27.3% (24/88) as incomplete, and 13.6% (12/88) as mixed or potentially misleading; no inaccurate or irrelevant responses were identified. Accuracy was highest for basic knowledge questions (66.7%) and lowest for treatment and medication-related questions (47.6%). Overall reproducibility was 87.5% (77/88), with no significant differences across domains (χ², P = 0.61). Inter-rater reliability was moderate (Cohen’s κ = 0.52).

CONCLUSION

ChatGPT (GPT-4o) demonstrated relatively higher accuracy and reproducibility overall, with stronger performance in basic informational and diagnostic domains, but greater variability observed in clinically sensitive areas such as treatment, medication use, and long-term outcomes. These findings highlight both the potential utility and important limitations of LLM-based chatbots in ADHD-related information-seeking, underscoring the need for cautious interpretation-particularly in treatment-related contexts where responses may require professional clinical guidance.

Key Words: Attention-deficit/hyperactivity disorder; Large language models; ChatGPT; Accuracy; Reproducibility

Core Tip: The clinical reliability of large language models (LLMs) in addressing attention-deficit/hyperactivity disorder (ADHD)-related questions from patients and caregivers has not been sufficiently characterized. This study systematically evaluates the accuracy and reproducibility of ChatGPT (GPT-4o) across clinically relevant domains in child and adolescent psychiatry. The findings indicate stronger and more consistent performance in basic informational and diagnostic domains, whereas greater variability was observed in clinically sensitive areas such as treatment, medication use, and long-term outcomes. These results highlight both the potential utility and the limitations of LLM-based tools in ADHD-related information-seeking, emphasizing the need for cautious, developmentally informed interpretation in higher-risk clinical contexts.

Citation: Bilgiç B, Turan S. Accuracy and reproducibility of ChatGPT responses to parent and patient inquiries on attention-deficit/hyperactivity disorder. World J Psychiatry 2026; 16(6): 119773
URL: https://www.wjgnet.com/2220-3206/full/v16/i6/119773.htm
DOI: https://dx.doi.org/10.5498/wjp.v16.i6.119773

INTRODUCTION

Attention-deficit/hyperactivity disorder (ADHD) is one of the most common chronic and heterogeneous neurodevelopmental disorders in child and adolescent mental health, typically emerging in early childhood and frequently persisting into adulthood. Epidemiological studies estimate that ADHD affects approximately 2% to 7% of children and adolescents across populations[1,2]. Beyond its core symptoms of inattention, hyperactivity, and impulsivity, ADHD is associated with clinically significant impairments in academic achievement, peer relationships, family functioning, and daily life organization, and it can adversely influence developmental trajectories and long-term outcomes[3]. Given its chronic course and heterogeneous presentation, ADHD often requires long-term, multidimensional monitoring and support rather than symptom-focused management alone. The clinical management of ADHD involves a broad range of interventions, including psychoeducation, behavioral strategies, school-based supports, and pharmacological treatments. Although pharmacological treatments are supported by strong evidence for efficacy[4], long-term medication use, concerns about adverse effects, and variability in treatment response may complicate follow-up and raise parental uncertainty. In this context, families often seek additional guidance regarding diagnosis, treatment options, prognosis, and practical strategies for daily functioning, increasingly turning to digital platforms for accessible information.

The integration of the internet and social media into everyday life has contributed to the widespread use of online environments for health-related information seeking. In recent years, large language models (LLMs), trained on large-scale datasets using deep learning techniques, have demonstrated substantial capabilities in understanding and generating natural language and have attracted attention in mental health as rapidly accessible tools for providing user-oriented information[5-7]. Both individuals and healthcare professionals increasingly rely on these systems, highlighting the need to systematically evaluate their clinical relevance, reliability, and potential impact in psychiatry. However, emerging literature has highlighted important limitations of LLMs in psychiatric contexts, including variability in clinical accuracy, limited contextual sensitivity, ethical concerns related to vulnerable populations, and the risk of over-reliance despite incomplete clinical framing. These challenges underscore the need for cautious evaluation of LLM outputs within mental health settings[8].

Recent years have witnessed not only a quantitative increase in studies examining artificial intelligence (AI) applications in health communication and patient education, but also a conceptual shift in how these technologies are evaluated. Bibliometric knowledge-mapping analyses indicate that early research primarily focused on assessing technical feasibility and accuracy, whereas subsequent phases increasingly emphasized systematic evaluation of content quality, reliability, readability, and user-centered communication features[9]. More recent literature has positioned AI systems within a broader patient education ecosystem, highlighting patient-centered outcomes such as health literacy, patient engagement, treatment adherence, and shared decision-making[10]. This evolving research landscape suggests that AI applications should be evaluated not only from a technological validation perspective but also in terms of clinically meaningful dimensions, including safety, ethical challenges, integration into healthcare workflows, and real-world impact. Within this framework, systematic evaluation of fundamental performance metrics-particularly accuracy and consistency-represents an essential foundation for more advanced clinical evaluations and the safe integration of AI systems into clinical practice.

Within mental health services, LLMs are being explored for clinical and educational applications, including supporting medical consultations and health education. Evidence suggests that LLM-based systems can generate health-related explanations and provide interactive responses that may be useful for patient education and caregiver support[11]. Natural language processing-based chatbots can produce human-like dialogue and have therefore raised interest in their potential use for psychoeducation and supportive interventions. Some studies have suggested that these systems may contribute to symptom-related support by facilitating evidence-based elements such as psychoeducation, emotional support, and cognitive-behavioral therapy techniques[12-14]. Among general-purpose LLMs, ChatGPT has received increasing attention in mental health care due to its ability to generate coherent, human-like responses[15]. However, unlike regulated medical devices, LLMs currently lack an established clinical approval framework, and their outputs may influence clinical decision-making despite concerns regarding accuracy, completeness, and patient safety[16]. These issues are particularly relevant in child and adolescent psychiatry, where developmental considerations, ethical responsibilities, and family-centered decision-making increase the potential impact of misleading or poorly contextualized information[17].

In recent years, the growing use of general-purpose LLMs such as ChatGPT for mental health support has led to an increasing number of studies systematically assessing the accuracy, consistency, and clinical appropriateness of responses to user-generated mental health questions[18]. In the existing literature, the questions posed to LLMs-based systems are often defined by the authors or derived from clinical guidelines, and the generated responses are evaluated within these pre-defined frameworks[19]. A study conducted in the context of autism spectrum disorder also reported that LLMs can provide understandable responses to frequently asked parent questions and may have the potential to meet informational needs[20]. However, evidence remains limited regarding the accuracy-and particularly the reproducibility-of LLM responses in common neurodevelopmental disorders such as ADHD.

Importantly, ADHD represents a particularly challenging clinical context for evaluating LLM performance, given its heterogeneous presentation, developmentally dynamic symptom profile, and the high clinical sensitivity of treatment-related decision-making in pediatric populations. In this context, systematic evaluation of response accuracy and reproducibility becomes particularly relevant for supporting the safe and responsible use of AI in patient education. Despite the high prevalence of ADHD and the frequency with which parents and patients seek online information about symptoms, diagnosis, and treatment options, systematic evaluations of LLM-generated responses to real-world ADHD-related inquiries remain scarce.

Nevertheless, several recent studies have suggested that ChatGPT, a general-purpose LLMs, may have potential not only as an informational tool but also as a supportive system in the design of structured neuropsychological rehabilitation and executive function intervention programs for ADHD. In this context, the structural coherence of intervention content generated by ChatGPT in response to specific prompts, as well as its adaptability to developmental levels and different ADHD profiles, has been evaluated[21]. However, the safe and responsible implementation of such advanced clinical applications first requires establishing whether LLMs can generate accurate, consistent, and reproducible responses to basic ADHD-related questions that are frequently asked by parents and patients in real-world settings. Accordingly, the present study aims to systematically evaluate ChatGPT’s responses to a standardized set of ADHD-related questions derived from common parent and patient inquiries, with a particular focus on accuracy and reproducibility.

MATERIALS AND METHODS

Ethical considerations

This study did not require institutional review board approval, as no patient data, clinical records, or personally identifiable information were used. The study was based exclusively on publicly available questions and AI-generated responses. The authors declare no affiliation with OpenAI, the developer of ChatGPT.

Data acquisition and selection of ADHD-related questions

A structured, multi-step question selection process was employed to enhance transparency and minimize selection bias, following principles analogous to a PRISMA-inspired framework. Initially, a broad pool of ADHD-related questions was identified through systematic Google searches using predefined keywords (e.g., “questions parents ask about ADHD”, “ADHD medication concerns”, “ADHD diagnosis in children”), as well as through parent-oriented online forums, educational platforms, and the websites of professional pediatric and psychiatric organizations. This initial search yielded approximately 160 candidate questions.

In the first screening step, duplicate and near-duplicate questions were removed. In the second step, two authors independently reviewed all remaining questions for eligibility based on predefined inclusion and exclusion criteria. Inclusion criteria comprised questions that: (1) Reflected commonly asked parent or patient concerns; (2) Addressed general aspects of ADHD (symptoms, diagnosis, treatment, or long-term outcomes); and (3) Were phrased in a manner suitable for a general informational context. Exclusion criteria included questions that were highly individualized or case-specific (e.g., referring to a particular child’s age, comorbidities, or treatment history), non-medical or administrative in nature, legally focused, or primarily opinion-based.

Following independent screening, disagreements regarding eligibility were resolved through discussion and consensus. After this process, 88 questions met the inclusion criteria and were retained for final analysis. The complete workflow from initial question identification to final inclusion, ChatGPT querying, expert evaluation, and statistical analysis-is summarized in Figure 1, providing a transparent overview of the study process. The study evaluated ChatGPT using the GPT-4o model (OpenAI). The question set was intentionally derived from publicly available parent- and patient-oriented sources to reflect real-world information-seeking behaviors outside clinical settings. While this approach may not fully capture highly specialized or clinician-driven scenarios, it enhances ecological validity by representing the types of questions most likely to be posed to LLMs by lay users. All queries were conducted using the subscription-based version of ChatGPT. Data collection was performed on (exact date, 15-17 December 2025) between (time: 09:00 and 13:00 UTC). All questions and repeated queries were submitted within this fixed time window to reduce potential variability related to model updates or system-level changes.

Open in New Tab Full Size Figure Download Figure

Figure 1 Study flowchart. Flowchart illustrating attention-deficit/hyperactivity disorder-related question selection, ChatGPT querying procedures, expert evaluation, and statistical analysis workflow. ADHD: Attention-deficit/hyperactivity disorder.

Categorization of questions

The selected questions were systematically grouped into four predefined categories based on their thematic content: (1) Basic knowledge (e.g., etiology, core symptoms, common myths); (2) Diagnosis and assessment (e.g., age of diagnosis, diagnostic criteria, rating scales); (3) Treatment and medication (e.g., stimulant safety, treatment duration, side effects); and (4) Long-term outcomes and psychosocial impact (e.g., academic functioning, emotional regulation, persistence into adulthood).

The final distribution of questions was as follows: Basic knowledge (n = 30), diagnosis and assessment (n = 22), treatment and medication (n = 21), and long-term outcomes and psychosocial impact (n = 15).

LLMs selection and querying procedure

The study utilized the most recent subscription-based version of ChatGPT (GPT-4o). All questions were entered in English. To minimize the influence of contextual memory or sequential learning, each question was submitted in a new chat session, with the browser refreshed prior to each query. All questions were submitted to ChatGPT as standalone prompts without providing any explicit instruction regarding the identity or role of either the user or the AI system (e.g., no prompts such as “answer as a doctor” or “answer for a parent” were used). Questions were entered verbatim as identified during the selection process. As a result, some questions naturally reflected a parental perspective (e.g., phrasing such as “my child”), whereas others were framed in a more general manner. No attempts were made to rephrase or standardize questions to a single user identity in order to preserve the real-world variability of commonly asked parent and patient inquiries.

To evaluate reproducibility, each question was submitted twice, and both responses were recorded verbatim in a separate document for subsequent evaluation. All queries were conducted using the same device, web browser, and internet connection, and through the same subscription-based ChatGPT account. The system was accessed in a logged-in state for all interactions. To minimize temporal variability, all questions and repeated queries for reproducibility assessment were submitted on the same calendar day. Prior to each query, the browser was refreshed and a new chat session was initiated to reduce the influence of contextual memory or sequential learning effects.

Assessment of accuracy and reproducibility

Two board-certified child and adolescent psychiatrists independently assessed all responses generated by ChatGPT. Accuracy was graded using a four-level classification adapted from previous AI evaluation studies[22]. During evaluation process, raters were instructed to assess responses with reference to established evidence-based clinical principles, including consistency with widely accepted guidelines for ADHD assessment and management (e.g., American Academy of Child and Adolescent Psychiatry, National Institute for Health and Clinical Excellence), rather than relying solely on subjective clinical impressions. Responses were therefore judged not only for factual correctness but also for alignment with guideline-concordant clinical reasoning and safety framing. The grading framework and operational definitions used for accuracy classification are presented in Table 1. To clarify the thematic framework used in the analysis, the predefined question categories are outlined below.

Table 1 Grading criteria for accuracy of ChatGPT responses.

Grading category	Definition
Comprehensive/correct	The response is accurate, clinically appropriate, and sufficiently comprehensive such that no additional explanation would be required in routine clinical practice
Incomplete/partially correct	The response is generally accurate but lacks relevant clinical details, developmental nuance, or contextual clarification
Mixed/potentially misleading	The response contains both accurate and inaccurate elements or statements that could lead to misunderstanding in a clinical or developmental context.
Inaccurate/irrelevant	The response is incorrect or unrelated to the question asked

Responses were independently rated by two blinded child and adolescent psychiatrists using the above accuracy framework. Ratings were assigned based on clinical appropriateness, completeness, and developmental contextualization. Disagreements were resolved by consensus.

Open in New Tab Full Size Table

Comprehensive/correct: Responses that were accurate, clinically appropriate, and sufficiently comprehensive such that no additional clarification would be required in routine clinical practice.

Incomplete/partially correct: Responses that were generally accurate but lacked relevant clinical details or contextual nuance.

Mixed or potentially misleading: Responses containing both accurate and inaccurate elements, or statements that could lead to misunderstanding in a clinical or developmental context.

Inaccurate/irrelevant: Responses that were incorrect or unrelated to the question.

The four-level accuracy classification was used as a categorical and descriptive framework rather than a numerical scoring system. No numerical weights or point values were assigned to the accuracy categories. Responses were classified based on factual accuracy, completeness, and potential to mislead in a clinical context, allowing appropriate use of non-parametric statistical analyses.

Reproducibility was evaluated by comparing the two responses generated for each question. Responses were considered reproducible if the core clinical content, recommendations, and implications were consistent across both answers, regardless of minor differences in wording or structure. Core clinical content consistency was defined as the preservation of the primary diagnostic, therapeutic, or safety-related messages across repeated responses. Differences in wording or level of detail were considered acceptable as long as the central clinical meaning remained unchanged. Responses were classified as non-reproducible when differences involved changes in clinical recommendations, omission or addition of critical safety information, or alterations in therapeutic guidance that could influence clinical interpretation. For example, responses were considered reproducible when both recommended professional evaluation prior to medication decisions despite differences in phrasing, whereas more directive treatment advice without appropriate clinical caution was classified as non-reproducible. The evaluations were performed by two board-certified child and adolescent psychiatrists, both of whom are among the study authors. Each evaluator has more than 10 years of clinical experience in the diagnosis and management of ADHD and is actively involved in academic teaching and research in child and adolescent mental health. Their professional background includes routine clinical care, psychoeducational guidance for families, and research on neurodevelopmental disorders. All evaluations were conducted independently and in a blinded manner with respect to each other’s ratings.

Resolution of discrepancies

When discrepancies arose between the two evaluators regarding accuracy or reproducibility, the responses were reviewed by a senior child and adolescent psychiatrist with extensive clinical experience. This reviewer was blinded to the initial ratings and provided the final decision. Prior to evaluation, both raters reviewed the predefined scoring criteria and agreed on the operational definitions of each rating category to ensure consistency in assessment. Inter-rater agreement and domain-specific differences were analyzed statistically.

Statistical analysis

Statistical analyses were performed using IBM SPSS Statistics version 28.0 (IBM Corp., Armonk, NY, United States). Descriptive statistics were reported as n (%). Inter-rater agreement between the two primary evaluators was assessed using Cohen’s kappa coefficient. Differences in accuracy and reproducibility across question categories were analyzed using the χ² test. Statistical significance was set at a two-tailed P value of < 0.05.

RESULTS

Distribution of questions and inter-rater agreement

A total of 88 ADHD-related questions were included in the analysis. Of these, 30 questions (34.1%) were categorized as basic knowledge, 22 (25.0%) as diagnosis and assessment, 21 (23.9%) as treatment and medication, and 15 (17.0%) as long-term outcomes and psychosocial impact. Inter-rater agreement between the two child and adolescent psychiatrists who evaluated the responses was moderate to good, with a Cohen’s kappa coefficient of 0.52 (P < 0.001), indicating acceptable consistency in accuracy ratings. The moderate agreement likely reflects variability in judgments of response completeness and contextual nuance rather than disagreement regarding clearly inaccurate content. Discrepancies predominantly involved adjacent rating categories, a common finding in qualitative expert-based evaluations. Given that disagreements rarely involved clinically unsafe classifications, the overall conclusions are unlikely to be substantially affected. Although the overall inter-rater agreement was moderate (Cohen’s κ = 0.52), disagreements between evaluators most frequently involved distinctions between “comprehensive/correct” and “incomplete/partially correct” ratings, rather than disagreements involving clearly inaccurate or misleading responses. Discrepancies were more common in clinically nuanced domains, particularly treatment and medication-related questions, where clinical judgment and emphasis on contextual detail may vary between clinicians.

Accuracy of ChatGPT responses across all ADHD-related questions

Across all 88 questions, ChatGPT responses were rated as comprehensive and correct in 59.1% (52/88) of cases. Incomplete or partially correct responses were identified in 27.3% (24/88) of questions, while mixed or potentially misleading responses were observed in 13.6% (12/88). No responses were classified as completely inaccurate or irrelevant.

Accuracy by question category

When accuracy was analyzed according to question category, the highest proportion of comprehensive and correct responses was observed in the basic knowledge domain (66.7%), followed by diagnosis and assessment (59.1%), long-term outcomes and psychosocial impact (53.3%), and treatment and medication (47.6%). The treatment and medication category had the highest proportion of mixed or potentially misleading responses, particularly in questions about stimulant safety, treatment duration, and long-term medication effects. A χ² analysis revealed no statistically significant difference in the distribution of overall accuracy across categories (χ², P = 0.08). Detailed accuracy distributions across categories are presented in Table 2 and illustrated in Figure 2.

Open in New Tab Full Size Figure Download Figure

Figure 2 Accuracy ratings across attention-deficit/hyperactivity disorder question categories. Distribution of ChatGPT response accuracy by clinical domain. Comprehensive/correct responses were most frequent in basic knowledge questions, whereas mixed or potentially misleading responses were more common in treatment and medication-related queries.

Table 2 Distribution of accuracy ratings across attention-deficit/hyperactivity disorder question categories (n = 88), n (%).

Question category	Comprehensive/correct	Incomplete/partially correct	Mixed/misleading
Basic knowledge (n = 30)	20 (66.7)	7 (23.3)	3 (10.0)
Diagnosis and assessment (n = 22)	13 (59.1)	6 (27.3)	3 (13.6)
Treatment and medication (n = 21)	10 (47.6)	7 (33.3)	4 (19.1)
Long-term outcomes and psychosocial impact (n = 15)	8 (53.3)	4 (26.7)	3 (20.0)
Total (n = 88)	52 (59.1)	24 (27.3)	12 (13.6)

Percentages were calculated within each question category. No responses were rated as inaccurate/irrelevant.

Open in New Tab Full Size Table

Reproducibility of ChatGPT responses

Overall reproducibility across all ADHD-related questions was 87.5% (77/88). Reproducibility rates by category were 90.0% for basic knowledge, 88.0% for diagnosis and assessment, 81.0% for treatment and medication, and 85.0% for long-term outcomes and psychosocial impact. Although reproducibility was slightly lower in the treatment and medication category, differences between categories were not statistically significant (χ², P = 0.61). Reproducibility results are summarized in Figure 3.

Open in New Tab Full Size Figure Download Figure

Figure 3 Reproducibility of ChatGPT responses. Classification of responses as reproducible or non-reproducible based on consistency between two independent outputs generated for each question.

Summary of incomplete and mixed responses

Mixed or potentially misleading responses most frequently occurred in questions addressing pharmacological treatment, long-term prognosis, and functional outcomes. In contrast, questions related to symptom definition and the general characteristics of ADHD were answered more consistently with comprehensive and correct information. Examples of incomplete or mixed responses, along with expert commentary, are provided in Table 3.

Table 3 Examples of ChatGPT responses rated as incomplete or potentially misleading, with expert commentary.

Parent/patient question	ChatGPT response summary	Reason for rating	Expert clinical commentary
Can ADHD medications cause addiction?	Indicates that stimulant medications may lead to dependence if misused, while generally being safe when prescribed	The response oversimplifies addiction risk and does not clearly differentiate therapeutic use from substance misuse	Clinically, stimulant treatment for ADHD is not equivalent to substance use disorder. When appropriately prescribed and monitored, evidence suggests a low risk of addiction, and this distinction should be explicitly communicated to prevent unnecessary parental anxiety
Will my child outgrow ADHD?	Suggests that symptoms often improve with age and that some individuals may no longer meet diagnostic criteria in adulthood	The response inadequately addresses symptom persistence and functional impairment over time	Although symptom severity may decrease, ADHD frequently persists into adulthood, often with ongoing academic, occupational, or emotional difficulties. Families should be informed about the developmental trajectory rather than reassured by a simplistic “outgrowing” narrative
Does ADHD medication affect brain development?	States that there is no clear evidence of harmful effects on brain development	The answer lacks reference to current neurodevelopmental and longitudinal research	Contemporary neuroimaging and longitudinal studies suggest possible normalization of certain neural networks with treatment, while uncertainties remain. A balanced discussion of current evidence and ongoing research is essential in developmental psychiatry
How long should my child take ADHD medication?	Notes that treatment duration varies depending on individual needs	The response provides vague guidance and omits principles of treatment planning	In clinical practice, medication use involves regular monitoring, periodic reassessment, and shared decision-making with families, rather than indefinite or open-ended treatment. This nuance is critical for parental understanding
Does ADHD cause emotional problems later in life?	Associates ADHD with emotional regulation difficulties	The framing risks a deterministic interpretation of long-term outcomes	Emotional difficulties are influenced by multiple factors, including comorbidities, environmental supports, and early intervention. Overly deterministic statements may contribute to stigma or parental pessimism
Are stimulant medications safe for long-term use?	States that stimulants are commonly used long term and are considered safe	The response does not emphasize monitoring or potential side effects	Long-term stimulant use requires systematic monitoring of growth, cardiovascular parameters, sleep, and emotional functioning. Omitting these elements may give families a false sense of unconditional safety

This table presents representative examples of ChatGPT responses rated as incomplete/partially correct or mixed/potentially misleading, along with expert clinical commentary highlighting missing developmental context and potential patient-safety implications. Response summaries were condensed for clarity. ADHD: Attention-deficit/hyperactivity disorder.

Open in New Tab Full Size Table

DISCUSSION

This study evaluated the accuracy and reproducibility of ChatGPT (GPT-4o) responses to 88 frequently asked questions posed by parents and patients regarding ADHD, a common neurodevelopmental condition in child and adolescent mental health. Overall, 59.1% of responses were rated as comprehensive and correct, while 27.3% were classified as incomplete or partially correct and 13.6% as mixed or potentially misleading; no responses were deemed completely inaccurate or irrelevant. The overall reproducibility rate was 87.5%. Higher accuracy and consistency were observed for basic informational and diagnostic or assessment-related questions, whereas responses addressing treatment, medication use, and long-term outcomes showed greater variability in contextual adequacy and clinical nuance. The absence of statistically significant differences across question categories suggests that this variability reflects content-related response characteristics rather than clear domain-specific superiority.

Current literature indicates that discussions surrounding the clinical use of LLM-based systems generally converge on three main dimensions: (1) Factual accuracy; (2) Contextual and developmental appropriateness; and (3) Clinical risk framing[23]. In child and adolescent mental health-particularly for heterogeneous and developmentally dynamic conditions such as ADHD-the latter two dimensions may be as consequential for patient safety as factual correctness.

Although LLM-based systems such as ChatGPT hold promise for patient education, health literacy enhancement, and the provision of preliminary information, their performance appears to vary depending on the language of the query, the depth of domain-specific knowledge required, and the nuances of the medical context. Studies examining ChatGPT responses to physician-generated medical questions have generally reported accurate and understandable outputs across many topics, while also noting limitations in comprehensiveness, specificity, and reliability for more complex questions that could influence clinical decision-making[24]. Similarly, evaluations based on parent-focused question sets suggest that AI-based chatbots can deliver rapid and accessible information, but that content quality may vary across domains[20].

When considered alongside existing literature, the present findings align with broader evaluations of LLM performance in healthcare settings. Across diverse clinical contexts, including patient education and pediatric health domains, AI-based chatbots have been shown to generate accessible and understandable responses, while highlighting the importance of clinician oversight for clinically nuanced or decision-relevant content[11,25-27]. Taken together, these findings suggest that variations in response quality may be better explained by the interaction between content complexity and contextual requirements rather than by domain-specific superiority alone.

By focusing on ADHD-a condition characterized by marked developmental variability and clinical heterogeneity-this study extends prior work by evaluating LLM-generated responses not only in terms of factual accuracy, but also with respect to reproducibility and clinical risk framing. Within this framework, descriptively higher proportions of “mixed or potentially misleading” responses were observed for treatment- and medication-related questions, with greater variability in response consistency in this domain. However, as no statistically significant differences emerged across question categories, these findings should be interpreted cautiously and not as evidence of definitive domain-specific inadequacy. Rather, they suggest that response characteristics may be influenced by content type and contextual complexity. In line with prior clinical AI evaluations, even non-significant numerical differences in clinically sensitive domains warrant careful consideration[25].

Questions concerning pharmacological treatment involve multiple interacting clinical variables (e.g., dosing considerations, adverse effects, comorbidities, and individual risk factors), which may increase susceptibility to generalized or context-limited responses and contribute to variability in clinical adequacy. Consistent with prior literature, LLM-based systems may generate cautious or insufficiently directive responses in clinically sensitive contexts, highlighting the need for expert oversight[28]. In addition to clinical complexity, methodological factors related to how prompts are structured may also contribute to variability in response characteristics. In the present study, standalone prompts without explicit role-based instructions or structured prompt engineering strategies were used. However, existing literature suggests that prompt structure-particularly role-based prompting-may improve the accuracy and acceptability of responses generated by LLMs[26]. Therefore, the variability observed in treatment- and medication-related questions may reflect not only clinical complexity but also the influence of prompt design. Future studies could investigate whether structured prompt engineering approaches may help improve response quality, particularly in clinically sensitive domains such as pharmacotherapy.

As the present study evaluated only the GPT-4o model, caution is warranted when generalizing the observed domain-specific performance differences to other LLMs. Comparative studies have shown that different LLMs may exhibit distinct performance profiles in terms of accuracy, contextual appropriateness, and response characteristics across similar tasks; therefore, the variability observed particularly in treatment- and medication-related domains may partly reflect model-specific features rather than solely clinical complexity[29]. This consideration further highlights the importance of interpreting current findings within the broader context of ongoing model evolution. Accordingly, the present findings should be interpreted as a snapshot evaluation within a rapidly evolving AI landscape, where continuous model updates and architectural differences may influence performance over time. Ongoing comparative and longitudinal assessments will therefore be essential to better understand the stability, generalizability, and clinical applicability of LLM-generated responses.

Beyond model-specific variability, the observed patterns in contextual omission and framing inconsistencies may also be conceptually linked to phenomena described in the literature as “AI hallucination”, referring to the confident presentation of incomplete, decontextualized, or clinically unverified information[30]. Some responses categorized as “mixed or potentially misleading” reflected patterns such as overgeneralization of treatment advice or insufficient emphasis on clinical context. For example, certain answers provided broadly accurate information but lacked adequate developmental framing or individualized cautions, which could increase the risk of misinterpretation in real-world settings. Although overtly incorrect or fabricated information was uncommon in the present study, some responses failed to adequately emphasize critical clinical warnings or relied on overly generalized language. Overall, these observations suggest that the primary vulnerability lies not in explicit factual inaccuracy but in contextual omission and insufficient clinical framing.

Notably, although reproducibility did not differ significantly across domains (χ², P = 0.61), treatment-related questions showed numerically greater variability and a higher proportion of potentially misleading elements. This finding suggests that clinically sensitive topics may be more prone to inconsistent framing across repeated outputs. Minor variations in advisory tone were occasionally observed across repeated responses. These differences generally involved emphasis rather than changes in clinical meaning and therefore did not affect reproducibility classification. Although the observed reproducibility rate of 87.5% appears relatively high, there is currently no established benchmark defining acceptable reproducibility thresholds for LLMs in clinical contexts. Given the probabilistic and non-deterministic nature of generative AI systems, some degree of variability across repeated outputs is expected. Previous research examining repeated medical queries has similarly demonstrated that response consistency may vary even when identical prompts are used[31]. Therefore, reproducibility should be interpreted as a contextual performance indicator rather than an absolute standard, with particular attention to clinically sensitive domains such as treatment-related information.

A key contribution of this study lies in its evaluation framework, which extends beyond factual accuracy by incorporating clinical risk framing and developmental context. By emphasizing whether responses adequately address safety considerations and age-appropriate contextualization, this approach highlights that, particularly in treatment-related domains within child and adolescent mental health, factual correctness alone may not be sufficient to ensure patient safety.

In addition, cultural and contextual sensitivity limitations of LLM-based systems warrant consideration, as models trained on large and often culturally homogeneous datasets may reproduce generalized patterns that do not fully align with diverse contexts[32]. Furthermore, although all queries were conducted within a fixed time window to minimize variability related to model updates, previous research has shown that the behavior and performance of LLMs may change over time due to ongoing system-level adjustments and non-transparent server-side updates[31]. Therefore, the reproducibility findings of the present study should be interpreted with consideration that variability may occur across different time points or following model updates. This limitation reflects broader methodological challenges in the evaluation of rapidly evolving LLM systems.

In ADHD, parental attitudes, expectations regarding treatment, and levels of health literacy are influenced by cultural factors, suggesting that LLM-generated responses may be interpreted differently across settings. Furthermore, the reliance on evaluations conducted by two expert clinicians represents a potential limitation, particularly for judgments involving clinical risk and developmental framing, and may constrain the generalizability of the findings.

Despite these limitations, ChatGPT’s ability to translate medical terminology into accessible, patient-friendly language may support patient education and health literacy[33]. In underserved settings, such tools may help caregivers obtain preliminary information and prepare for clinical consultations. However, LLM-based systems should not replace clinical decision-making, particularly for treatment-related inquiries[28]. Accordingly, ChatGPT may serve as a complementary informational resource for ADHD when used cautiously in clinically sensitive contexts.

Beyond their current role as informational tools, LLM-based systems may also represent an early component of broader computational psychiatry frameworks aiming to integrate diverse data modalities for clinical decision support. Within this evolving perspective, future developments in computational psychiatry suggest that advanced diagnostic support systems may increasingly integrate multimodal data sources, combining not only clinical text but also neuroimaging biomarkers and genetic susceptibility profiles. Neuroimaging-based machine learning approaches have already demonstrated promising accuracy for classification and prognosis prediction in psychiatric disorders[34], while recent research indicates that genetic risk factors may influence neural phenotypes in ADHD[35]. Within this broader trajectory, LLMs or their successors could potentially serve as integrative frameworks capable of synthesizing clinical narrative data with biological and neurodevelopmental markers, highlighting a future direction for multimodal computational psychiatry.

The strengths of this study include the systematic evaluation of ChatGPT (GPT-4o) responses across both accuracy and reproducibility dimensions, as well as the use of a question set derived from real-world inquiries commonly posed by parents and patients, enhancing ecological validity. However, several methodological limitations should be acknowledged.

Although the present study focused on accuracy and reproducibility, readability and comprehensibility represent important dimensions of quality in the use of AI for patient education. Individual differences in health literacy constitute a significant barrier to patients’ understanding of medical information; therefore, patient education materials are recommended to be written in accessible language that can be understood by a broad audience[36]. Recent studies suggest that even when models demonstrate comparable levels of factual accuracy, they may differ substantially in readability and ease of understanding[37]. However, readability was not systematically evaluated in the present study, which should be considered a limitation. Future research should incorporate multidimensional evaluation approaches that assess not only accuracy and reliability but also readability and patient comprehension.

All questions were posed in English, and analyses were conducted at a single time point using one LLM model (ChatGPT, GPT-4o), which may limit generalizability across linguistic, cultural, and evolving model contexts. Clinical guidelines, treatment practices, and parental expectations regarding ADHD vary across healthcare systems, and reliance on a single language and model context may therefore constrain broader applicability. Additionally, the evaluation focused exclusively on textual outputs rather than real-world clinical implementation or user interaction processes. Accordingly, the present findings should be interpreted within these methodological constraints, and future studies should incorporate broader linguistic, cultural, and multi-model comparisons.

Although the present study focused on response accuracy and reproducibility, it did not directly assess how parents or patients might interpret or act upon LLM-generated information. Importantly, prior work in psychiatry has highlighted that factual correctness alone does not necessarily ensure clinical safety, as responses may still be misinterpreted or lack sufficient contextual framing[38]. In real-world settings, incomplete or insufficiently contextualized responses may therefore influence decision-making processes or delay appropriate help-seeking. Future research should examine downstream effects in real-world clinical contexts, including caregiver interpretation, behavioral responses, and clinical outcomes.

Future directions

Future research should focus on evaluating the safe use of LLM-based tools in the context of ADHD by including samples that represent diverse age groups, clinical subtypes, and linguistic contexts. Examining multi-step and interactive question-answer scenarios that more closely reflect real-world interactions will be important for assessing the contextual adaptability of these systems. In addition, the development of standardized evaluation frameworks that prioritize clinical risk framing may contribute to the responsible use of LLMs in child and adolescent mental health.

CONCLUSION

This study demonstrates that ChatGPT (GPT-4o) responses to ADHD-related questions vary across content domains in terms of accuracy, reproducibility, and clinical framing. While basic informational and diagnostic content showed higher consistency, greater variability was observed in clinically sensitive areas such as pharmacological treatment and long-term outcomes, likely reflecting the complexity and individual variability inherent to ADHD.

Given that families often encounter fragmented or conflicting information, LLM-based tools may support general information-seeking between clinical visits by providing accessible explanations. However, responses that insufficiently account for developmental context or individual differences may pose risks of misinterpretation and raise patient-safety concerns.

Accordingly, LLM-based systems should be viewed as complementary informational resources rather than substitutes for professional clinical judgment, particularly for treatment-related decisions[39]. From a practical perspective, a domain-sensitive “triage” approach may help balance accessibility with safety, whereby LLMs support psychoeducation and general understanding, while individualized treatment decisions remain guided by professional clinical evaluation. Future research should focus on developing practical safety evaluation frameworks and examining real-world downstream effects, including caregiver interpretation, behavioral responses, and clinical outcomes, to better define the responsible integration of LLM-based tools into child and adolescent mental health care.

ACKNOWLEDGEMENTS

The authors thank the peer reviewers and editorial team for their constructive feedback during the revision process.

References

Sayal K, Prasad V, Daley D, Ford T, Coghill D. ADHD in children and young people: prevalence, care pathways, and service provision. Lancet Psychiatry. 2018;5:175-186. [RCA] [PubMed] [DOI] [Full Text] [Cited by in Crossref: 977] [Cited by in RCA: 758] [Article Influence: 94.8] [Reference Citation Analysis (0)]

Salari N, Ghasemi H, Abdoli N, Rahmani A, Shiri MH, Hashemian AH, Akbari H, Mohammadi M. The global prevalence of ADHD in children and adolescents: a systematic review and meta-analysis. Ital J Pediatr. 2023;49:48. [RCA] [PubMed] [DOI] [Full Text] [Full Text (PDF)] [Cited by in Crossref: 465] [Cited by in RCA: 329] [Article Influence: 109.7] [Reference Citation Analysis (0)]

Di Lorenzo R, Balducci J, Poppi C, Arcolin E, Cutino A, Ferri P, D'Amico R, Filippini T. Children and adolescents with ADHD followed up to adulthood: a systematic review of long-term outcomes. Acta Neuropsychiatr. 2021;33:283-298. [RCA] [PubMed] [DOI] [Full Text] [Cited by in Crossref: 15] [Cited by in RCA: 97] [Article Influence: 19.4] [Reference Citation Analysis (0)]

Cortese S, Adamo N, Del Giovane C, Mohr-Jensen C, Hayes AJ, Carucci S, Atkinson LZ, Tessari L, Banaschewski T, Coghill D, Hollis C, Simonoff E, Zuddas A, Barbui C, Purgato M, Steinhausen HC, Shokraneh F, Xia J, Cipriani A. Comparative efficacy and tolerability of medications for attention-deficit hyperactivity disorder in children, adolescents, and adults: a systematic review and network meta-analysis. Lancet Psychiatry. 2018;5:727-738. [RCA] [PubMed] [DOI] [Full Text] [Full Text (PDF)] [Cited by in Crossref: 923] [Cited by in RCA: 874] [Article Influence: 109.3] [Reference Citation Analysis (1)]

Levkovich I. Evaluating Diagnostic Accuracy and Treatment Efficacy in Mental Health: A Comparative Analysis of Large Language Model Tools and Mental Health Professionals. Eur J Investig Health Psychol Educ. 2025;15:9. [RCA] [PubMed] [DOI] [Full Text] [Full Text (PDF)] [Cited by in Crossref: 21] [Cited by in RCA: 12] [Article Influence: 12.0] [Reference Citation Analysis (0)]

6.	Lawson McLean A. Constructing knowledge: the role of AI in medical learning. J Am Med Inform Assoc. 2024;31:1797-1798. [RCA] [PubMed] [DOI] [Full Text] [Full Text (PDF)] [Cited by in Crossref: 12] [Cited by in RCA: 6] [Article Influence: 3.0] [Reference Citation Analysis (0)]

Yang R, Tan TF, Lu W, Thirunavukarasu AJ, Ting DSW, Liu N. Large language models in health care: Development, applications, and challenges. Health Care Sci. 2023;2:255-263. [RCA] [PubMed] [DOI] [Full Text] [Full Text (PDF)] [Cited by in Crossref: 69] [Cited by in RCA: 127] [Article Influence: 42.3] [Reference Citation Analysis (0)]

8.	Volkmer S, Meyer-Lindenberg A, Schwarz E. Large language models in psychiatry: Opportunities and challenges. Psychiatry Res. 2024;339:116026. [RCA] [PubMed] [DOI] [Full Text] [Cited by in Crossref: 44] [Cited by in RCA: 34] [Article Influence: 17.0] [Reference Citation Analysis (0)]

9.	Zhou J, Zhang W, Liu S. The evolving landscape of artificial intelligence in patient education: A bibliometric knowledge mapping study. Digit Health. 2026;12:20552076251406653. [RCA] [PubMed] [DOI] [Full Text] [Full Text (PDF)] [Cited by in RCA: 1] [Reference Citation Analysis (1)]

10.

Malerbi FK, Nakayama LF, Gayle Dychiao R, Zago Ribeiro L, Villanueva C, Celi LA, Regatieri CV. Digital Education for the Deployment of Artificial Intelligence in Health Care. J Med Internet Res. 2023;25:e43333. [RCA] [PubMed] [DOI] [Full Text] [Cited by in RCA: 28] [Reference Citation Analysis (0)]

11.	Sathe A, Chikanna H. Short Research Article: Evaluation of an artificial intelligence language model in psychiatric patient education. Child Adolesc Ment Health. 2025;30:265-271. [RCA] [PubMed] [DOI] [Full Text] [Cited by in RCA: 2] [Reference Citation Analysis (0)]

12.

Ray A, Bhardwaj A, Malik YK, Singh S, Gupta R. Artificial intelligence and Psychiatry: An overview. Asian J Psychiatr. 2022;70:103021. [RCA] [PubMed] [DOI] [Full Text] [Full Text (PDF)] [Cited by in Crossref: 4] [Cited by in RCA: 68] [Article Influence: 17.0] [Reference Citation Analysis (1)]

13.	Olawade DB, Wada OZ, Odetayo A, David-Olawade AC, Asaolu F, Eberhardt J. Enhancing mental health with Artificial Intelligence: Current trends and future prospects. J Med Surg Public Health. 2024;3:100099. [PubMed] [DOI] [Full Text]

14.	Xu Z, Lee YC, Stasiak K, Warren J, Lottridge D. The Digital Therapeutic Alliance With Mental Health Chatbots: Diary Study and Thematic Analysis. JMIR Ment Health. 2025;12:e76642. [RCA] [PubMed] [DOI] [Full Text] [Full Text (PDF)] [Cited by in RCA: 4] [Reference Citation Analysis (0)]

15.	Omar M, Soffer S, Charney AW, Landi I, Nadkarni GN, Klang E. Applications of large language models in psychiatry: a systematic review. Front Psychiatry. 2024;15:1422807. [RCA] [PubMed] [DOI] [Full Text] [Cited by in RCA: 38] [Reference Citation Analysis (0)]

16.

Weissman GE, Mankowitz T, Kanter GP. Unregulated large language models produce medical device-like output. NPJ Digit Med. 2025;8:148. [RCA] [PubMed] [DOI] [Full Text] [Full Text (PDF)] [Cited by in Crossref: 35] [Cited by in RCA: 33] [Article Influence: 33.0] [Reference Citation Analysis (0)]

17.

Dergaa I, Fekih-Romdhane F, Hallit S, Loch AA, Glenn JM, Fessi MS, Ben Aissa M, Souissi N, Guelmami N, Swed S, El Omri A, Bragazzi NL, Ben Saad H. ChatGPT is not ready yet for use in providing mental health assessment and interventions. Front Psychiatry. 2023;14:1277756. [RCA] [PubMed] [DOI] [Full Text] [Full Text (PDF)] [Cited by in RCA: 53] [Reference Citation Analysis (0)]

18.

Ferrario A, Sedlakova J, Trachsel M. The Role of Humanization and Robustness of Large Language Models in Conversational Artificial Intelligence for Individuals With Depression: A Critical Analysis. JMIR Ment Health. 2024;11:e56569. [RCA] [PubMed] [DOI] [Full Text] [Full Text (PDF)] [Cited by in Crossref: 36] [Cited by in RCA: 17] [Article Influence: 8.5] [Reference Citation Analysis (0)]

19.

Yin Y, Zeng M, Wang H, Yang H, Zhou C, Jiang F, Wu S, Huang T, Yuan S, Lin J, Tang M, Chen J, Dong B, Yuan J, Xie D. A clinician-based comparative study of large language models in answering medical questions: the case of asthma. Front Pediatr. 2025;13:1461026. [RCA] [PubMed] [DOI] [Full Text] [Full Text (PDF)] [Cited by in RCA: 4] [Reference Citation Analysis (0)]

20.

Almulla AA, Khasawneh MAS. Assessing AI-Based Large Language Models (ChatGPT, Google Gemini, and DeepSeek) for Common Parent Questions about Autism: Acceptability, Readability, and Accuracy. Psychiatr Q. 2025. [RCA] [PubMed] [DOI] [Full Text] [Cited by in RCA: 3] [Reference Citation Analysis (0)]

21.

Dahò M, Caci B. Exploring AI-assisted design of executive function rehabilitation programs for individuals with ADHD: a mixed-methods evaluation of prompts and chatgpt outputs. BMC Psychol. 2025;14:25. [RCA] [PubMed] [DOI] [Full Text] [Full Text (PDF)] [Cited by in RCA: 1] [Reference Citation Analysis (0)]

22.

Mete U, Özmen ÖA. Assessing the accuracy and reproducibility of ChatGPT for responding to patient inquiries about otosclerosis. Eur Arch Otorhinolaryngol. 2025;282:1567-1575. [RCA] [PubMed] [DOI] [Full Text] [Cited by in Crossref: 1] [Cited by in RCA: 7] [Article Influence: 7.0] [Reference Citation Analysis (0)]

23.

Singhal K, Azizi S, Tu T, Mahdavi SS, Wei J, Chung HW, Scales N, Tanwani A, Cole-Lewis H, Pfohl S, Payne P, Seneviratne M, Gamble P, Kelly C, Babiker A, Schärli N, Chowdhery A, Mansfield P, Demner-Fushman D, Agüera Y Arcas B, Webster D, Corrado GS, Matias Y, Chou K, Gottweis J, Tomasev N, Liu Y, Rajkomar A, Barral J, Semturs C, Karthikesalingam A, Natarajan V. Publisher Correction: Large language models encode clinical knowledge. Nature. 2023;620:E19. [RCA] [PubMed] [DOI] [Full Text] [Full Text (PDF)] [Cited by in Crossref: 61] [Cited by in RCA: 25] [Article Influence: 8.3] [Reference Citation Analysis (0)]

24.

Gilson A, Safranek CW, Huang T, Socrates V, Chi L, Taylor RA, Chartash D. How Does ChatGPT Perform on the United States Medical Licensing Examination (USMLE)? The Implications of Large Language Models for Medical Education and Knowledge Assessment. JMIR Med Educ. 2023;9:e45312. [RCA] [PubMed] [DOI] [Full Text] [Full Text (PDF)] [Cited by in Crossref: 1746] [Cited by in RCA: 1058] [Article Influence: 352.7] [Reference Citation Analysis (2)]

25.

Latt PM, Aung ET, Htaik K, Soe NN, Lee D, King AJ, Fortune R, Ong JJ, Chow EPF, Bradshaw CS, Rahman R, Deneen M, Dobinson S, Randall C, Zhang L, Fairley CK. Evaluation of artificial intelligence (AI) chatbots for providing sexual health information: a consensus study using real-world clinical queries. BMC Public Health. 2025;25:1788. [RCA] [PubMed] [DOI] [Full Text] [Full Text (PDF)] [Cited by in RCA: 8] [Reference Citation Analysis (0)]

26.

Chen YC, Lee SH, Sheu H, Lin SH, Hu CC, Fu SC, Yang CP, Lin YC. Enhancing responses from large language models with role-playing prompts: a comparative study on answering frequently asked questions about total knee arthroplasty. BMC Med Inform Decis Mak. 2025;25:196. [RCA] [PubMed] [DOI] [Full Text] [Full Text (PDF)] [Cited by in RCA: 10] [Reference Citation Analysis (0)]

27.	Senbaykal Yigit E, Taskirdi I, Haci IA, Tuncel T. Artificial intelligence performance in pediatric asthma. J Asthma. 2025;62:1926-1932. [RCA] [PubMed] [DOI] [Full Text] [Cited by in Crossref: 1] [Cited by in RCA: 3] [Article Influence: 3.0] [Reference Citation Analysis (0)]

28.

Wei Q, Wang Y, Yao Z, Cui Y, Wei B, Li T, Xu X. Evaluation of ChatGPT's performance in providing treatment recommendations for pediatric diseases. Pediatr Discov. 2023;1:e42. [RCA] [PubMed] [DOI] [Full Text] [Full Text (PDF)] [Cited by in Crossref: 16] [Cited by in RCA: 10] [Article Influence: 3.3] [Reference Citation Analysis (0)]

29.	Neubauer JC, Kaiser A, Lettermann L, Volkert T, Häge A. Performance of large language models ChatGPT and Gemini in child and adolescent psychiatry knowledge assessment. PLoS One. 2025;20:e0332917. [RCA] [PubMed] [DOI] [Full Text] [Cited by in RCA: 4] [Reference Citation Analysis (0)]

30.	Ji Z, Lee N, Frieske R, Yu T, Su D, Xu Y, Ishii E, Bang YJ, Madotto A, Fung P. Survey of Hallucination in Natural Language Generation. ACM Comput Surv. 2023;55:1-38. [PubMed] [DOI] [Full Text]

31.

Funk PF, Hoch CC, Knoedler S, Knoedler L, Cotofana S, Sofo G, Bashiri Dezfouli A, Wollenberg B, Guntinas-Lichius O, Alfertshofer M. ChatGPT's Response Consistency: A Study on Repeated Queries of Medical Examination Questions. Eur J Investig Health Psychol Educ. 2024;14:657-668. [RCA] [PubMed] [DOI] [Full Text] [Full Text (PDF)] [Cited by in Crossref: 12] [Cited by in RCA: 22] [Article Influence: 11.0] [Reference Citation Analysis (0)]

32.	Tao Y, Viberg O, Baker RS, Kizilcec RF. Cultural bias and cultural alignment of large language models. PNAS Nexus. 2024;3:pgae346. [RCA] [PubMed] [DOI] [Full Text] [Full Text (PDF)] [Cited by in RCA: 38] [Reference Citation Analysis (0)]

33.	Oztermeli AD. Is ChatGPT a Reliable Tool for Explaining Medical Terms? Cureus. 2025;17:e77258. [RCA] [PubMed] [DOI] [Full Text] [Cited by in RCA: 4] [Reference Citation Analysis (0)]

34.

You W, Luo L, Yao L, Zhao Y, Li Q, Wang Y, Wang Y, Zhang Q, Long F, Sweeney JA, Gong Q, Li F. Impaired dynamic functional brain properties and their relationship to symptoms in never treated first-episode patients with schizophrenia. Schizophrenia (Heidelb). 2022;8:90. [RCA] [PubMed] [DOI] [Full Text] [Full Text (PDF)] [Cited by in Crossref: 30] [Cited by in RCA: 60] [Article Influence: 15.0] [Reference Citation Analysis (0)]

35.

Li Q, Qin K, Lei D, Li W, Tallman MJ, Patino LR, Sweeney JA, Gong Q, Li F, DelBello MP, McNamara RK. Reduced emotion-generated frontolimbic functional connectivity in psychostimulant-free ADHD youth with and without familial risk for bipolar I disorder. Eur Child Adolesc Psychiatry. 2025. [RCA] [PubMed] [DOI] [Full Text] [Cited by in Crossref: 7] [Cited by in RCA: 6] [Article Influence: 6.0] [Reference Citation Analysis (0)]

36.	Ozduran E, Büyükçoban S. Evaluating the readability, quality and reliability of online patient education materials on post-covid pain. PeerJ. 2022;10:e13686. [RCA] [PubMed] [DOI] [Full Text] [Full Text (PDF)] [Cited by in RCA: 25] [Reference Citation Analysis (1)]

37.

Liu S, Su L, He Q, Qiu M, Liang R. Comparative evaluation of ChatGPT and Gemini in brain-computer interfaces patient education: A multi-dimensional analysis of reliability, accuracy, comprehensibility, and readability. Int J Med Inform. 2026;206:106164. [RCA] [PubMed] [DOI] [Full Text] [Cited by in Crossref: 1] [Cited by in RCA: 3] [Article Influence: 3.0] [Reference Citation Analysis (0)]

38.

Luykx JJ, Gerritse F, Habets PC, Vinkers CH. The performance of ChatGPT in generating answers to clinical questions in psychiatry: a two-layer assessment. World Psychiatry. 2023;22:479-480. [RCA] [PubMed] [DOI] [Full Text] [Cited by in Crossref: 19] [Cited by in RCA: 24] [Article Influence: 8.0] [Reference Citation Analysis (1)]

39.

Zhang K, Meng X, Yan X, Ji J, Liu J, Xu H, Zhang H, Liu D, Wang J, Wang X, Gao J, Wang YG, Shao C, Wang W, Li J, Zheng MQ, Yang Y, Tang YD. Revolutionizing Health Care: The Transformative Impact of Large Language Models in Medicine. J Med Internet Res. 2025;27:e59069. [RCA] [PubMed] [DOI] [Full Text] [Cited by in RCA: 53] [Reference Citation Analysis (0)]

Footnotes

Peer review: Externally peer reviewed.

Peer-review model: Single blind

Specialty type: Psychiatry

Country of origin: Türkiye

Peer-review report’s classification

Scientific quality: Grade A, Grade B, Grade B, Grade B, Grade B, Grade C

Novelty: Grade A, Grade B, Grade B, Grade B, Grade C, Grade C

Creativity or innovation: Grade B, Grade B, Grade B, Grade B, Grade B, Grade B

Scientific significance: Grade A, Grade B, Grade B, Grade B, Grade B, Grade C

P-Reviewer: Li F, MD, PhD, Associate Professor, China; Liu SC, MD, China; Xiao Y, MD, PhD, Assistant Professor, China S-Editor: Liu H L-Editor: A P-Editor: Zhang L