Published online Apr 19, 2026. doi: 10.5498/wjp.v16.i4.116428
Revised: December 7, 2025
Accepted: January 14, 2026
Published online: April 19, 2026
Processing time: 139 Days and 4.6 Hours
Adolescent depression is a pressing global public health challenge. Current screening largely depends on self-reported questionnaires, which are vulnerable to response biases and underreporting. Integrating objective behavioral signals with validated scales may bridge this subjective-objective gap and improve de
To develop a novel multimodal protocol integrating video-recorded facial ex
A total of 771 adolescents (aged 12-18 years, mean 15.23 ± 1.68) were recruited. Facial expressions, reading-aloud voices, and CSSSDS scale data were collected from all participants. Five machine learning algorithms [extreme gradient bo
Statistical analysis confirmed XGBoost as the preferred algorithm in both multimodal and bimodal protocols, showing statistically significant superiority (P < 0.05) across several key metrics (multimodal recall and F1 score; bimodal AUC-ROC, AUC-PR, and F1 score). In stark contrast, the artificial neural network exhibited high volatility and low precision despite achieving perfect recall in both protocols (all P < 0.001). Statistical comparisons further confirmed the superiority of the multimodal XGBoost over its bimodal counterpart, demonstrating higher AUC-ROC (t = 4.52, P < 0.001) and AUC-PR (t = 3.87, P < 0.001), both with large effect sizes (Cohen’s d > 1.0). The multimodal model also demonstrated significantly greater stability in core discriminative metrics (AUC-ROC, AUC-PR, and recall; all P < 0.05).
The XGBoost-driven multimodal model demonstrated superior discriminative power, greater stability, and a balanced precision-recall profile compared with bimodal models and other algorithms. Nevertheless, limitations related to sample size, use of a regionspecific scale, and task-driven data collection mean that further validation in larger, more diverse, and ecologically valid settings is warranted.
Core Tip: Current screening for adolescent depression relies heavily on subjective questionnaires. Therefore, we developed a multimodal protocol combining the Chinese Secondary School Students Depression Scale with objective facial and vocal data to improve detection. Our analysis showed that extreme gradient boosting outperformed other machine learning models under multimodal and bimodal settings, achieving superior performance across multiple metrics. Statistical comparisons confirmed that the multimodal extreme gradient boosting model significantly surpassed its bimodal counterpart, de
