Published online Jun 19, 2026. doi: 10.5498/wjp.v16.i6.119773
Revised: February 23, 2026
Accepted: March 12, 2026
Published online: June 19, 2026
Processing time: 108 Days and 1 Hours
Children and adolescents with attention-deficit/hyperactivity disorder (ADHD) and their caregivers increasingly turn to artificial intelligence-based chatbots for information about symptoms, functional difficulties, and treatment-related concerns. Large language models (LLMs), such as ChatGPT, are of particular in
To systematically evaluate the accuracy and reproducibility of ChatGPT (GPT-4o)-generated responses to commonly asked ADHD-related questions from parents and patients.
In this cross-sectional study, 88 frequently asked ADHD-related questions were identified through internet search engines, parent-oriented forums, and professional organization websites. Questions were categorized into four domains: Basic knowledge (n = 30), diagnosis and assessment (n = 22), treatment and medication use (n = 21), and long-term outcomes and psychosocial impact (n = 15). Each question was submitted twice to the subscription-based version of ChatGPT (GPT-4o) in separate chat sessions. Two blinded child and adolescent psychiatrists independently evaluated responses for accuracy (comprehensive/correct, incomplete, mixed or potentially misleading, or inaccurate) and reproducibility. Inter-rater agreement and domain-specific differences were analyzed statistically.
Overall, 59.1% (52/88) of responses were rated as comprehensive/correct, 27.3% (24/88) as incomplete, and 13.6% (12/88) as mixed or potentially misleading; no inaccurate or irrelevant responses were identified. Accuracy was highest for basic knowledge questions (66.7%) and lowest for treatment and medication-related questions (47.6%). Overall reproducibility was 87.5% (77/88), with no significant differences across domains (χ², P = 0.61). Inter-rater reliability was moderate (Cohen’s κ = 0.52).
ChatGPT (GPT-4o) demonstrated relatively higher accuracy and reproducibility overall, with stronger performance in basic informational and diagnostic domains, but greater variability observed in clinically sensitive areas such as treatment, medication use, and long-term outcomes. These findings highlight both the potential utility and important limitations of LLM-based chatbots in ADHD-related information-seeking, underscoring the need for cautious interpretation-particularly in treatment-related contexts where responses may require professional clinical guidance.
Core Tip: The clinical reliability of large language models (LLMs) in addressing attention-deficit/hyperactivity disorder (ADHD)-related questions from patients and caregivers has not been sufficiently characterized. This study systematically evaluates the accuracy and reproducibility of ChatGPT (GPT-4o) across clinically relevant domains in child and adolescent psychiatry. The findings indicate stronger and more consistent performance in basic informational and diagnostic domains, whereas greater variability was observed in clinically sensitive areas such as treatment, medication use, and long-term outcomes. These results highlight both the potential utility and the limitations of LLM-based tools in ADHD-related in