Published online Feb 15, 2026. doi: 10.4251/wjgo.v18.i2.113959
Revised: November 9, 2025
Accepted: December 11, 2025
Published online: February 15, 2026
Processing time: 148 Days and 13.3 Hours
Chronic atrophic gastritis (CAG) is a significant precancerous condition of gastric cancer (GC). CAG often lacks typical symptoms in its early stages, and clinical di
To construct and validate a CAG risk prediction model to achieve noninvasive and accurate identification of high-risk patients.
This study included 1268 subjects from a GC screening program. Multimodal data, including serological marker, demographic, lifestyle, and family history data, were collected. Subjects were grouped by pathological biopsy results. Least absolute shrinkage and selection operator regression was used for feature selection. A model was constructed using the random forest algorithm, evaluated with metrics such as the area under the curve (AUC), and interpreted using the SHapley Additive exPlanation (SHAP) method. The model was validated in an independent external cohort, and a web-based prediction platform was developed using Shiny.
Six key features were ultimately included: Age, Helicobacter pylori (H. pylori) infection status, pepsinogen I/II ratio (PGR), smoking history, alcohol consumption history, and family history of GC. The model achieved AUCs of 0.8542 and 0.8073 in the training and testing sets, respectively, and an AUC of 0.8505 in the external validation cohort, demonstrating good generalizability and stability. SHAP analysis indicated that H. pylori infection, age, and PGR were the most important variables influencing CAG risk. The final model was successfully embedded into a web-based platform for convenient clinical application.
The random forest-based CAG prediction model is a highly accurate and interpretable tool with significant clinical utility in early screening and identifying high-risk patients.
Core Tip: This study addresses the need for a noninvasive method to screen for chronic atrophic gastritis (CAG), a key precancerous condition of gastric cancer (GC). We developed and validated a random forest machine learning model using data from 1268 subjects. The model accurately predicts CAG risk using six easily obtainable features: Helicobacter pylori infection status, age, pepsinogen ratio, smoking history, alcohol use history, and family history of GC. The model demonstrated high accuracy and generalizability (area under the curve > 0.85). A user-friendly web calculator was created for clinical application, providing a practical tool for the early identification of high-risk individuals.
