Copyright
©The Author(s) 2025.
World J Gastroenterol. Dec 21, 2025; 31(47): 112921
Published online Dec 21, 2025. doi: 10.3748/wjg.v31.i47.112921
Published online Dec 21, 2025. doi: 10.3748/wjg.v31.i47.112921
Table 1 Summary of common general-purpose foundation models used in gastrointestinal cancer
| Name | Type | Creator | Year | Architecture | Parameters | Modality | OSS | GI cancer applications |
| BERT | LLM | 2018 | Encoder-only transformer | 110M (base), 340M (large) | Text | Yes | NLP, Radio, MLLM | |
| GPT-3 | LLM | OpenAI | 2020 | Decoder-only transformer | 175B | Text | No | NLP |
| ViT | Vision | 2020 | Encoder-only transformer | 86M (base), 307M (large), 632M (huge) | Image | Yes | Endo, Radio, PA, MLLM | |
| DINOv1 | Vision | Meta | 2021 | Encoder-only transformer | 22M, 86M | Image | Yes | Endo, PA |
| CLIP | MM | OpenAI | 2021 | Encoder-encoder | 120-580M | Text, Image | Yes | Endo, Radio, MLLM, directly1 |
| GLM-130B | LLM | Tsinghua | 2022 | Encoder-decoder | 130B | Text | Yes | NLP |
| Stable Diffusion | MM | Stability AI | 2022 | Diffusion model | 1.45B | Text, Image | Yes | NLP, Endo, MLLM, directly |
| BLIP | MM | Salesforce | 2022 | Encoder-decoder | 120M (base), 340M (large) | Text, Image | Yes | Radio, MLLM, directly |
| YouChat | LLM | You.com | 2022 | Fine-tuned LLMs | Unknown | Text | No | NLP |
| Bard | MM | 2023 | Based on PaLM 2 | 340B estimated | Text, Image, Audio, Code | No | NLP | |
| Bing Chat | MM | Microsoft | 2023 | Fine-tuned GPT-4 | Unknown | Text, Image | No | NLP |
| Mixtral 8x7B | LLM | Mistral AI | 2023 | Decoder-only, Mixture-of-Experts (MoE) | 46.7B total (12.9B active per token) | Text | NLP | |
| LLaVA | MM | Microsoft | 2023 | Vision encoder, LLM | 7B, 13B | Text, Image | Yes | PA, MLLM |
| DINOv2 | Vision | Meta | 2023 | Encoder-only transformer | 86M to 1.1B | Image | Yes | Endo, Radio, PA, MLLM, directly |
| Claude 2 | LLM | Anthropic | 2023 | Decoder-only transformer | Unknown | Text | No | NLP |
| GPT-4 | MM | OpenAI | 2023 | Decoder-only transformer | 1.8T (Estimated) | Text, Image | No | NLP, Endo, MLLM, directly |
| LLaMa 2 | LLM | Meta | 2023 | Decoder-only transformer | 7B, 13B, 34B, 70B | Text | Yes | NLP, Endo, MLLM, directly |
| SAM | Vision | Meta | 2023 | Encoder-decoder | 375M, 1.25G, 2.56G | Image | Yes | Endo, directly |
| GPT-4V | MM | OpenAI | 2023 | MM transformer | 1.8T | Text, Image | No | Endo, MLLM |
| Qwen | NLP | Alibaba | 2023 | Decoder-only transformer | 70B, 180B, 720B | Text | Yes | NLP, MLLM |
| GPT-4o | MM | OpenAI | 2024 | MM transformer | Unknown (Larger than GPT-4) | Text, Image, Video | No | NLP |
| LLaMa 3 | LLM | Meta | 2024 | Decoder-only transformer | 8B, 70B, 400B | Text | Yes | NLP, directly |
| Gemini 1.5 | MM | 2024 | MM transformer | 1.6T | Text, Image, Video, Audio | No | NLP, Radio, directly | |
| Claude 3.7 | MM | Anthropic | 2024 | Decoder-only transformer | Unknown | Text, Image | No | NLP, directly |
| YOLOWorld | Vision | IDEA | 2024 | CNN + RepVL-PAN vision-language fusion | 13-110M (depending on scale) | Text, Image | Yes | Endo, directly |
| DeepSeek | LLM | DeepSeek | 2025 | Decoder-only transformer | 671B | Text | Yes | NLP |
| Phi-4 | LLM | Microsoft | 2025 | Decoder-only transformer | 14B (plus), 7B (mini) | Text | Yes | Endo |
Table 2 Summary of key studies of large language models in the field of gastrointestinal cancer
| Ref. | Year | Models | Objectives | Datasets | Performance | Evaluation |
| Syed et al[29] | 2022 | BERTi | Developed fine-tuned BERTi for integrated colonoscopy reports | 34165 reports | F1-scores of 91.76%, 92.25%, 88.55% for colonoscopy, pathology, and radiology | Manual chart review by 4 expert-guided reviewer |
| Lahat et al[30] | 2023 | GPT | Assessed GPT performance in addressing 110 real-world gastrointestinal inquiries | 110 real-life questions | Moderate accuracy (3.4-3.9/5) for treatment and diagnostic queries | Assessed by three gastroenterologists using a 1-5 scale for accuracy etc. |
| Lee et al[31] | 2023 | GPT-3.5 | Examined GPT-3.5’s responses to eight frequently asked colonoscopy questions | 8 colonoscopy-related questions | GPT answers had extremely low text similarity (0%-16%) | Four gastroenterologists rated the answers on a 7-point Likert scale |
| Emile et al[32] | 2023 | GPT-3.5 | Analyzed GPT-3.5’s ability to generate appropriate responses to CRC questions | 38 CRC questions | 86.8% deemed appropriately, with 95% concordance on 2022 ASCRS guidelines | Three surgery experts assessed answers using ASCRS guidelines |
| Moazzam et al[33] | 2023 | GPT | Investigated the quality of GPT’s responses to pancreatic cancer-related questions | 30 pancreatic cancer-questions | 80% responses were “very good” or “excellent” | Responses were graded by 20 experts against a clinical benchmark |
| Yeo et al[34] | 2023 | GPT | Assessed GPT’s performance in answering questions regarding cirrhosis and HCC | 164 questions about cirrhosis and HCC | 79.1% correctness for cirrhosis and 74% for HCC, but only 47.3% comprehensiveness | Responses were reviewed by two hepatologists and resolved by a 3rd reviewer |
| Cao et al[35] | 2023 | GPT-3.5 | Examined GPT-3.5’s capacity to answer on liver cancer screening and diagnosis | 20 questions | 48% answers were accurate, with frequent errors in LI-RADS categories | Six fellowship-trained physicians from three centers assessed answers |
| Gorelik et al[36] | 2024 | GPT-4 | Evaluated GPT-4’s ability to provide guideline-aligned recommendations | 275 colonoscopy reports | Aligned with experts in 87% of scenarios, showing no significant accuracy gap | Advice assessed by consensus review with multiple experts |
| Gorelik et al[37] | 2023 | GPT-4 | Analyzed GPT-4’s effectiveness in post-colonoscopy management guidance | 20 clinical scenarios | 90% followed guidelines, with 85% correctness and strong agreement (κ = 0.84) | Assessed by two senior gastroenterologists for guideline compliance |
| Zhou et al[38] | 2023 | GPT-3.5 and GPT-4 | Developed a gastric cancer consultation system and automated report generator | 23 medical knowledge questions | 91.3% appropriate gastric cancer advice (GPT-4), 73.9% for GPT-3.5 | The evaluation was conducted by reviewers with medical standards |
| Yang et al[39] | 2025 | RECOVER (LLM) | Designed a LLM-based remote patient monitoring system for postoperative care | 7 design sessions, 5 interviews | Six major design strategies for integrating clinical guidelines and information | Clinical staff reviewed and provided feedback on the design and functionality |
| Kerbage et al[40] | 2024 | GPT-4 | Evaluated GPT-4’s accuracy in responding to IBS, IBD, and CRC screening | 65 questions (45 patients, 20 doctors) | 84% of answers were accurate | Assessed independently by three senior gastroenterologists |
| Tariq et al[41] | 2024 | GPT-3.5, GPT-4, and Bard | Compared the efficacy of GPT-3.5, GPT 4, and Bard (July 2023 version) in answering 47 common colonoscopy patient queries | 47 queries | GPT 4 outperformed GPT-3.5 and Bard, with 91.4% fully accurate responses vs 6.4% and 14.9%, respectively | Responses were scored by two specialists on a 0-2 point scale and resolved by a 3rd reviewer |
| Maida et al[42] | 2025 | GPT-4 | Evaluated GPT-4’s suitability in addressing screening, diagnostic, therapeutic inquiries | 15 CRC screening inquiries | 4.8/6 for CRC screening accuracy, 2.1/3 for completeness scored | Assessment involved 20 experts and 20 non-experts rating the answers |
| Atarere et al[43] | 2024 | BingChat, GPT, YouChat | Tested the appropriateness of GPT, BingChat, and YouChat in patient education and patient-physician communication | 20 questions (15 on CRC screening and 5 patient-related) | GPT and YouChat provided more reliable answers than BingChat, but all models had occasional inaccuracies | Two board-certified physicians and one Gastroenterologist graded the responses |
| Chang et al[44] | 2024 | GPT-4 | Compared GPT-4’s accuracy, reliability, and alignment of colonoscopy recommendations | 505 colonoscopy reports | 85.7% of cases matched USMSTF guidelines | Assessment was conducted by an expert panel under USMSTF guidelines |
| Lim et al[45] | 2024 | GPT-4 | Compared a contextualized GPT model with standard GPT in colonoscopy screening | 62 example use cases | Contextualized GPT-4 outperformed standard GPT-4 | Compare the GPT4 against a model with relevant screening guidelines |
| Munir et al[46] | 2024 | GPT | Evaluated the quality and utility of responses for three GI surgeries | 24 research questions | Modest quality and vary significantly based on the type of procedure | Responses were graded by 45 expert surgeons |
| Truhn et al[47] | 2024 | GPT-4 | Created a structured data parsing module with GPT-4 for clinical text processing | 100 CRC reports | 99% accuracy for T-stage extraction, 96% for N-stage, and 94% for M-stage | Accuracy of GPT-4 was compared with manually extracted data by experts |
| Choo et al[48] | 2024 | GPT | Designed a clinical decision-support system to generate personalized management plans | 30 stage III recurrent CRC patients | 86.7% agree with tumor board decisions, 100% for second-line therapies | The recommendations were compared with the decision plans made by the MDT |
| Huo et al[49] | 2024 | GPT, BingChat, Bard, Claude 2 | Established a multi-AI platform framework to optimize CRC screening recommendations | Responses for 3 patient cases | GPT aligned with guidelines in 66.7% of cases, while other AIs showed greater divergence | Clinician and patient advice was compared to guidelines |
| Pereyra et al[50] | 2024 | GPT-3.5 | Optimized GPT-3.5 for personalized CRC screening recommendations | 238 physicians | GPT scored 4.57/10 for CRC screening, vs 7.72/10 for physicians | Answers were compared against a group of surgeons |
| Peng et al[51] | 2024 | GPT-3.5 | Built a GPT-3.5-powered system for answering CRC-related queries | 131 CRC questions | 63.01 mean accuracy, but low comprehensiveness scores (0.73-0.83) | Two physicians reviewed each response, with a third consulted for discrepancies |
| Ma et al[52] | 2024 | GPT-3.5 | Established GPT-3.5-based quality control for post-esophageal ESD procedures | 165 esophageal ESD cases | 92.5%-100% accuracy across post-esophageal ESD quality metrics | Two QC members and a senior supervisor conducted assessment |
| Cohen et al[53] | 2025 | LLaMA-2, Mistral-v0.1 | Explored the ability of LLMs to extract PD-L1 biomarker details for research purposes | 232 EHRs from 10 cancer types | Fine-tuned LLMs outperformed LSTM trained on > 10000 examples | Assessed by 3 clinical experts against manually curated answers |
| Scherbakov et al[54] | 2025 | Mixtral 8 × 7 B | Assessed LLM to extract stressful events from social history of clinical notes | 109556 patients, 375334 notes | Arrest or incarceration (OR = 0.26, 95%CI: 0.06-0.77) | One human reviewer assessed the precision and recall of extracted events |
| Chatziisaak et al[55] | 2025 | GPT-4 | Evaluated the concordance of therapeutic recommendations generated by GPT | 100 consecutive CRC patients | 72.5% complete concordance, 10.2% partial concordance, and 17.3% discordance | Three reviewers independently assessed concordance with MDT |
| Saraiva et al[56] | 2025 | GPT-4 | Assessed GPT-4’s performance in interpreting images in gastroenterology | 740 images | Capsule endoscopy: Accuracies 50.0%-90.0% (AUCs 0.50-0.90) | Three experts reviewed and labeled images for CE |
| Siu et al[57] | 2025 | GPT-4 | Evaluated the efficacy, quality, and readability of GPT-4’s responses | 8 patient-style questions | Accurate (40), safe (4.25), appropriate (4.00), actionable (4.00), effective (4.00) | Evaluated by 8 colorectal surgeons |
| Horesh et al[58] | 2025 | GPT-3.5 | Evaluated management recommendations of GPT in clinical settings | 15 colorectal or anal cancer patients | Rating 48 for GPT recommendations, 4.11 for decision justification | Evaluated by 3 experienced colorectal surgeons |
| Ellison et al[59] | 2025 | GPT-3.5, Perplexity | Compared readability using different prompts | 52 colorectal surgery materials | Average 7.0-9.8, Ease 53.1-65.0, Modified 9.6-11.5 | Compared mean scores between baseline and documents generated by AI |
| Ramchandani et al[60] | 2025 | GPT-4 | Validated the use of GPT-4 for identifying articles discussing perioperative and preoperative risk factors for esophagectomy | 1967 studies for title and abstract screening | Perioperative: Agreement rate = 85.58%, AUC = 0.87. Preoperative: Agreement rate = 78.75%, AUC = 0.75 | Decisions were compared with those of three independent human reviewers |
| Zhang et al[61] | 2025 | GPT-4, DeepSeek, GLM-4, Qwen, LLaMa3 | To evaluate the consistency of LLMs in generating diagnostic records for hepatobiliary cases using the HepatoAudit dataset | 684 medical records covering 20 hepatobiliary diseases | Precision: GPT-4 reached a maximum of 93.42%. Recall: Generally below 70%, with some diseases below 40% | Professional physicians manually verified and corrected all the data |
| Spitzl et al[62] | 2025 | Claude-3.5, GPT-4o, DeepSeekV3, Gemini 2 | Assessed the capability of state-of-the-art LLMs to classify liver lesions based solely on textual descriptions from MRI reports | 88 fictitious MRI reports designed to resemble real clinical documentation | Micro F1-score and macro F1-score: Claude 3.5 Sonnet 0.91 and 0.78, GPT-4o 0.76 and 0.63, DeepSeekV3 0.84 and 0.70, Gemini 2.0 Flash 0.69 and 0.55 | Model performance was assessed using micro and macro F1-scores benchmarked against ground truth labels |
| Sheng et al[63] | 2025 | GPT-4o and Gemini | Investigated the diagnostic accuracies for focal liver lesions | 228 adult patients with CT/MRI reports | Two-step GPT-4o, single-step GPT-4o and single-step Gemini (78.9%, 68.0%, 73.2%) | Six radiologists reviewed the images and clinical information in two rounds (alone, with LLM) |
| Williams et al[64] | 2025 | GPT-4-32K | Determined LLM extract reasons for a lack of follow-up colonoscopy | 846 patients' clinical notes | Overall accuracy: 89.3%, reasons: Refused/not interested (35.2%) | A physician reviewer checked 10% of LLM-generated labels |
| Lu et al[65] | 2025 | MoE-HRS | Used a novel MoE combined with LLMs for risk prediction and personalized healthcare recommendations | SNPs, medical and lifestyle data from United Kingdom Biobank | MoE-HRS outperformed state-of-the-art cancer risk prediction models in terms of ROC-AUC, precision, recall, and F1 score | LLMs-generated advice were validated by clinical medical staff |
| Yang et al[66] | 2025 | GPT-4 | Explored the use of LLMs to enhance doctor-patient communication | 698 pathology reports of tumors | Average communication time decreased by over 70%, from 35 to 10 min (P < 0.001) | Pathologists evaluated the consistency between original and AI reports |
| Jain et al[67] | 2025 | GPT-4, GPT-3.5, Gemini | Studied the performance of LLMs across 20 clinicopathologic scenarios in gastrointestinal pathology | 20 clinicopathologic scenarios in GI | Diagnostic accuracy: Gemini Advanced (95%, P = 0.01), GPT-4 (90%, P = 0.05), GPT-3.5 (65%) | Two fellowship-trained pathologists independently assessed the responses of the models |
| Xu et al[68] | 2025 | GPT-4, GPT-4o, Gemini | Assessed the performance of LLMs in predicting immunotherapy response in unresectable HCC | Multimodal data from 186 patients | Accuracy and sensitivity: GPT-4o (65% and 47%) Gemini-GPT (68% and 58%). Physicians (72% and 70%) | Six physicians (three radiologists and three oncologists) independently assessed the same dataset |
| Deroy et al[69] | 2025 | GPT-3.5 Turbo | Explored the potential of LLMs as a question-answering (QA) tool | 30 training and 50 testing queries | A1: 0.546 (maximum value); A2: 0.881 (maximum value across three runs) | Model-generated answers were compared to the gold standard |
| Ye et al[70] | 2025 | BioBERT-based | Proposed a novel framework that incorporates clinical features to enhance multi-omics clustering for cancer subtyping | Six cancer datasets across three omics levels | Mean survival score of 2.20, significantly higher than other methods | Three independent clinical experts review and validate the clustering results |
Table 3 Summary of key studies of vision foundation models-assisted endoscopy in the field of gastrointestinal cancer
| Model | Year | Architecture | Training algorithm | Parameters | Datasets | Disease studied | Model type | Source code link |
| Surgical-DINO[76] | 2023 | DINOv2 | LoRA layers added to DINOv2, optimizing the LoRA layers | 86.72M | SCARED, Hamlyn | Endoscopic Surgery | Vision | https://github.com/BeileiCui/SurgicalDINO |
| ProMISe[77] | 2023 | SAM (ViT-B) | APM and IPS modules are trained while keeping SAM frozen | 1.3-45.6M | EndoScene, ColonDB etc. | Polyps, Skin Cancer | Vision | NA |
| Polyp-SAM[78] | 2023 | SAM | Strategy as pretrain only the mask decoder while freezing all encoders | NA | CVC-ColonDB Kvasir etc. | Colon Polyps | Vision | https://github.com/ricklisz/Polyp-SAM |
| Endo-FM[79] | 2023 | ViT B/16 | Pretrained using a self-supervised teacher-student framework, and fine-tuned on downstream tasks | 121M | Colonoscopic, LDPolyp etc. | Polyps, erosion, etc. | Vision | https://github.com/med-air/Endo-FM |
| ColonGPT[80] | 2024 | SigLIP-SO, Phi1.5 | Pre-alignment with image-caption pairs, followed by supervised fine-tuning using LoRA | 0.4-1.3B | ColonINST (30k+ images) | Colorectal polyps | Vision | https://github.com/ColonGPT/ColonGPT |
| DeepCPD[81] | 2024 | ViT | Hyperparameters are optimized for colonoscopy datasets, including Adam optimizer | NA | PolypsSet, CP-CHILD-A etc. | CRC | Vision | https://github.com/Zhang-CV/DeepCPD |
| OneSLAM[82] | 2024 | Transformer (CoTracker) | Zero-shot adaptation using TAP + Local Bundle Adjustment | NA | SAGE-SLAM, C3VD etc. | Laparoscopy, Colon | Vision | https://github.com/arcadelab/OneSLAM |
| EIVS[83] | 2024 | Vision Mamba, CLIP | Unsupervised Cycle‑Consistency | 63.41M | 613 WLE, 637 images | Gastrointestinal | Vision | NA |
| APT[84] | 2024 | SAM | Parameter-efficient fine-tuning | NA | Kvasir-SEG, EndoTect etc. | CRC | Vision | NA |
| FCSAM[85] | 2024 | SAM | LayerNorm LoRA fine-tuning strategy | 1.2M | Gastric cancer (630 pairs) etc. | GC, Colon Polyps | Vision | NA |
| DuaPSNet[86] | 2024 | PVTv2-B3 | Transfer learning with pre-trained PVTv2-B3 on ImageNet | NA | LaribPolypDB, ColonDB etc. | CRC | Vision | https://github.com/Zachary-Hwang/Dua-PSNet |
| EndoDINO[87] | 2025 | ViT (B, L, g) | DINOv2 methodology, hyperparameters tuning | 86M to 1B | HyperKvasir, LIMUC | GI Endoscopy | Vision | https://github.com/ZHANGBowen0208/EndoDINO/ |
| PolypSegTrack[88] | 2025 | DINOv2 | One-step fine-tuning on colonoscopic videos without first pre-training | NA | ETIS, CVC-ColonDB etc. | Colon polyps | Vision | NA |
| AiLES[89] | 2025 | RF-Net | Not fine-tuned from external model | NA | 100 GC patients | Gastric cancer | Vision | https://github.com/CalvinSMU/AiLES |
| PPSAM[90] | 2025 | SAM | Fine-tuning with variable bounding box prompt perturbations | NA | EndoScene, ColonDB etc. | Investigated in Ref. | Vision | https://github.com/SLDGroup/PP-SAM |
| SPHINX-Co[91] | 2024 | LLaMA-2 + SPHINX-X | Fine-tuned SPHINX-X on CoPESD with cosine learning rate scheduler | 7B, 13B | CoPESD | Gastric cancer | Multimodal | https://github.com/gkw0010/CoPESD |
| LLaVA-Co[91] | 2024 | LLaVA-1.5 (CLIP-ViT-L) | Fine-tuned LLaVA-1.5 on CoPESD with cosine learning rate scheduler | 7B, 13B | CoPESD | Gastric cancer | Multimodal | https://github.com/gkw0010/CoPESD |
| ColonCLIP[92] | 2025 | CLIP | Prompt tuning with frozen CLIP, then encoder fine-tuning with frozen prompts | 57M, 86M | OpenColonDB | CRC | Multimodal | https://github.com/Zoe-TAN/ColonCLIP-OpenColonDB |
| PSDM[93] | 2025 | Stable Diffusion + CLIP | Continual learning with prompt replay to incrementally train on multiple datasets | NA | PolypGen, ColonDB, Polyplus etc. | CRC | Vision, Generative | The original paper reported a GitHub link for this model, but it is currently unavailable |
| PathoPolypDiff[94] | 2025 | Stable Diffusion v1-4 | Fine-tuned Stable Diffusion v1-4 and locked first U-Net block, fine-tuned remaining blocks | NA | ISIT-UMR Colonoscopy Dataset | CRC | Generative | https://github.com/Vanshali/PathoPolyp-Diff |
Table 4 Summary of key studies of vision foundation models-assisted radiology in the field of gastrointestinal cancer
| Model | Year | Architecture | Training algorithm | Parameters | Datasets | Disease studied | Model type | Source code link |
| PubMedCLIP[98] | 2021 | CLIP | Fine-tuned on ROCO dataset for 50 epochs with Adam optimizer | NA | ROCO, VQA-RAD, SLAKE | Abdomen samples | Multimodal | https://github.com/sarahESL/PubMedCLIP |
| RadFM[97] | 2023 | MedLLaMA-13B | Pre-trained on MedMD and fine-tuned on RadMD | 14B | MedMD, RadMD etc. | Over 5000 diseases | Multimodal | https://github.com/chaoyi-wu/RadFM |
| Merlin[99] | 2024 | I3D-ResNet152 | Multi-task learning with EHR and radiology reports and fine-tuning for specific tasks | NA | 6M images, 6M codes and reports | Multiple diseases, Abdominal | Multimodal | NA |
| MedGemini[100] | 2024 | Gemini | Fine-tuning Gemini 1.0/1.5 on medical QA, multimodal and long-context corpora | 1.5B | MedQA, NEJM, GeneTuring | Various | Multimodal | https://github.com/Google-Health/med-gemini-medqa-relabelling |
| HAIDEF[101] | 2024 | VideoCoCa | Fine-tuning on downstream tasks with limited labeled data | NA | CT volumes and reports | Various | Vision | https://huggingface.co/collections/google/ |
| CTFM[102] | 2024 | Vision Model1 | Trained using a self-supervised learning strategy, employing a SegResNet encoder for the pre-training phase | NA | 26298 CT scans | CT scans (stomach, colon) | Vision | https://aim.hms.harvard.edu/ct-fm |
| MedVersa[103] | 2024 | Vision Model1 | Trained from scratch on the MedInterp dataset and adapted to various medical imaging tasks | NA | MedInterp | Various | Vision | https://github.com/3clyp50/MedVersa_Internal |
| iMD4GC[104] | 2024 | Transformer-based2 | A novel multimodal fusion architecture with cross-modal interaction and knowledge distillation | NA | GastricRes/Sur, TCGA etc. | Gastric cancer | Multimodal | https://github.com/FT-ZHOU-ZZZ/iMD4GC/ |
| Yasaka et al[105] | 2025 | BLIP-2 | LORA with specific fine-tuning of the fc1 layer in the vision and q-former models | NA | 5777 CT scans | Esophageal cancer via chest CT | Multimodal | NA |
Table 5 Summary of key studies of Vision Foundation Models-assisted pathology in the field of gastrointestinal cancer
| Model | Year | Architecture | Training Algorithm | Paras | WSIs | Tissues | Open source link |
| LUNIT-SSL[110] | 2021 | ViT-S | DINO; full fine-tuning and linear evaluation on downstream tasks | 22M | 3.7K | 32 | https://Lunitio.github.io/research/publications/pathology_ssl |
| CTransPath[111] | 2022 | Swin Transformer | MoCoV3 (SRCL); frozen backbone with linear classifier fine-tuning | 28M | 32K | 32 | https://github.com/Xiyue-Wang/TransPath |
| Phikon[112] | 2023 | ViT-B | iBOT (Masked Image Modeling); fine-tuned with ABMIL/TransMIL on frozen features | 86M | 6K | 16 | https://github.com/owkin/HistoSSLscaling |
| REMEDIS[113] | 2023 | BiT-L (ResNet-152) | SimCLR (contrastive learning); end-to-end fine-tuning on labeled ID/OOD data | 232M | 29K | 32 | https://github.com/google-research/simclr |
| Virchow[114] | 2024 | ViT-H, DINOv2 | DINOv2 (SSL); used frozen embeddings with simple aggregators | 632M | 1.5M | 17 | https://huggingface.co/paige-ai/Virchow |
| Virchow2[115] | 2024 | ViT-H | DINOv2 (SSL); fine-tuned with linear probes or full-tuning on downstream tasks | 632M | 3.1M | 25 | https://huggingface.co/paige-ai/Virchow2 |
| Virchow2G[115] | 2024 | ViT-G | DINOv2 (SSL); fine-tuned with linear probes or full fine-tuning | 1.9B | 3.1M | 25 | https://huggingface.co/paige-ai/Virchow2 |
| Virchow2G mini[115]1 | 2024 | ViT-S, Virchow2G | DINOv2 (SSL); distilled from Virchow2G, then fine-tuned on downstream tasks | 22M | 3.2M | 25 | https://huggingface.co/paige-ai/Virchow2 |
| UNI[9] | 2024 | ViT-L | DINOv2 (SSL); used frozen features with linear probes or few-shot learning | 307M | 100K | 20 | https://github.com/mahmoodlab/UNI |
| Phikon-v2[116] | 2024 | ViT-L | DINOv2 (SSL); frozen ViT and ABMIL ensemble fine-tuning | 307M | 58K | 30 | https://huggingface.co/owkin/phikon-v2 |
| RudolfV[117] | 2024 | ViT-L | DINOv2 (SSL); fine-tuned with optimizing linear classification layer and adapting encoder weights | 304M | 103K | 58 | https://github.com/rudolfv |
| HIBOU-B[118] | 2024 | ViT-B | DINOv2 (SSL); frozen feature extractor, trained linear classifier or attention pooling | 86M | 1.1M | 12 | https://github.com/HistAI/hibou |
| HIBOU-L[118]2 | 2024 | ViT-L | DINOv2 (SSL); frozen feature extractor, trained linear classifier or attention pooling | 307M | 1.1M | 12 | https://github.com/HistAI/hibou |
| H-Optimus-03 | 2024 | ViT-G | DINOv2 (SSL); linear probe and ABMIL on frozen features | 1.1B | > 500K | 32 | https://github.com/bioptimus/releases/ |
| Madeleine[119] | 2024 | CONCH | MAD-MIL; linear probing, prototyping, and full fine-tuning for downstream tasks | 86M | 23K | 2 | https://github.com/mahmoodlab/MADELEINE |
| COBRA[120] | 2024 | Mamba-2 | Self-supervised contrastive pretraining with multiple FMs and Mamba2 architecture | 15M | 3K | 6 | https://github.com/KatherLab/COBRA |
| PLUTO[121] | 2024 | FlexiVit-S | DINOv2; frozen backbone with task-specific heads for fine-tuning | 22M | 158K | 28 | NA |
| HIPT[122] | 2025 | ViT-HIPT | DINO (SSL); fine-tune with gradient accumulation | 10M | 11K | 33 | https://github.com/mahmoodlab/HIPT |
| PathoDuet[123] | 2025 | ViT-B | MoCoV3; fine-tuned using standard supervised learning on labeled downstream task data | 86M | 11K | 32 | https://github.com/openmedlab/PathoDuet |
| Kaiko[124] | 2025 | ViT-L | DINOv2 (SSL); linear probing with frozen encoder on downstream tasks | 303M | 29K | 32 | https://github.com/kaiko-ai/towards_large_pathology_fms |
| PathOrchestra[125] | 2025 | ViT-L | DINOv2; ABMIL, linear probing, weakly supervised classification | 304M | 300K | 20 | https://github.com/yanfang-research/PathOrchestra |
| THREADS[126] | 2025 | ViT-L, CONCHv1.5 | Fine-tune gene encoder, initialize patch encoder randomly | 16M | 47K | 39 | https://github.com/mahmoodlab/trident |
| H0-mini[127] | 2025 | ViT | Using knowledge distillation from H-Optimus-0 | 86M | 6K | 16 | https://huggingface.co/bioptimus/H0-mini |
| TissueConcepts[128] | 2025 | Swin Transformer | Frozen encoder with linear probe for downstream tasks | 27.5M | 7K | 14 | https://github.com/FraunhoferMEVIS/MedicalMultitaskModeling |
| OmniScreen[129] | 2025 | Virchow2 | Attention-aggregated Virchow2 embeddings fine-tuning | 632M | 48K | 27 | https://github.com/OmniScreen |
| BROW[130] | 2025 | ViT-B | DINO (SSL); self-distillation with multi-scale and augmented views | 86M | 11K | 6 | NA |
| BEPH[131] | 2025 | BEiTv2 | BEiTv2 (SSL); supervised fine-tuning on clinical tasks with labeled data | 86M | 11K | 32 | https://github.com/Zhcyoung/BEPH |
| Atlas[132] | 2025 | ViT-H, RudolfV | DINOv2; linear probing with frozen backbone on downstream tasks | 632M | 1.2M | 70 | NA |
Table 6 Summary of key studies of multimodal large language models in the field of gastrointestinal cancer
| Model | Year | Vision architecture | Vision dataset | WSIs | Text model | Text dataset | Parameters | Tissues | Generative | Open source link |
| PLIP[136] | 2023 | CLIP | OpenPath | 28K | CLIP | OpenPath | NA | 32 | Captioning | https://github.com/PathologyFoundation/plip |
| HistGen[137] | 2023 | DINOv2, ViT-L | Multiple | 55K | LGH Module | TCGA | Approximately 100M | 32 | Report generation | https://github.com/dddavid4real/HistGen |
| PathAlign[138] | 2023 | PathSSL | Custom | 350K | BLIP-2 | Diagnostic reports | Approximately 100M | 32 | Report generation | https://github.com/elonybear/PathAlign |
| CHIEF[139] | 2024 | CTransPath | 14 Sources | 60K | CLIP | Anatomical information | 27.5M, 63M | 19 | No | https://github.com/hms-dbmi/CHIEF |
| PathGen[140] | 2024 | LLaVA, CLIP | TCGA | 7K | CLIP | 1.6M pairs | 13B | 32 | WSI assistant | https://github.com/PathFoundation/PathGen-1.6M |
| PathChat[141] | 2024 | UNI | Multiple | 999K | LLaMa 2 | Pathology instructions | 13B | 20 | AI assistant | https://github.com/fedshyvana/pathology_mllm_training |
| PathAsst[142] | 2024 | PathCLIP | PathCap | 207K | Vicuna-13B | Pathology instructions | 13B | 32 | AI assistant | https://github.com/superjamessyx/Generative-Foundation-AI-Assistant-for-Pathology |
| ProvGigaPath[143] | 2024 | ViT | Prov-Path | 171K | OpenCLIP | 17K Reports | 1135 | 31 | No | https://github.com/prov-gigapath/prov-gigapath |
| TITAN[144] | 2024 | ViT | Mass340K | 336K | CoCa | Medical reports | Approximately 5B | 20 | Report generation | https://github.com/your-repo/TITAN |
| CONCH[145] | 2024 | ViT | Multiple | 21K | GPTstyle | 1.17M pairs | NA | 19 | Captioning | http://github.com/mahmoodlab/CONCH |
| SlideChat[146] | 2024 | CONCHLongNet | TCGA | 4915 | Qwen2.5-7B | Slide Instructions | 7B | 10 | WSI assistant | https://github.com/uni-medical/SlideChat |
| PMPRG[147] | 2024 | MR-ViT | Custom | 7422 | GPT-2 | Pathology Reports | NA | 2 | Multi-organ report | https://github.com/hvcl/Clinical-grade-PathologyReport-Generation |
| MuMo[148] | 2024 | MnasNet | Custom | 429 | Transformer | PathoRadio Reports | NA | 1 | No | https://github.com/czifan/MuMo |
| ConcepPath[149] | 2024 | ViT-B, CONCH | Quilt-1M | 2243 | CLIPGPT | PubMed | Approximately 187M | 3 | No | https://github.com/HKU-MedAI/ConcepPath |
| GPT-4V[150] | 2024 | Phikon ViT-B | CRC-7K, MHIST etc. | 338K | GPT-4 | NA | 40M | 3 | Report generation | https://github.com/Dyke-F/GPT-4V-In-Context-Learning |
| MINIM[151] | 2024 | Stable diffusion | Multiple | NA | BERT, CLIP | Multiple | NA | 6 | Report generation | https://github.com/WithStomach/MINIM |
| PathM3[152] | 2024 | ViT-g/14 | PatchGastric | 991 | FlanT5XL | PatchGastric | NA | 1 | WSI assistant | NA |
| FGCR[153] | 2024 | ResNet50 | Custom, GastrADC | 3598, 991 | BERT | NA | 9.21 Mb | 6 | Report generation | https://github.com/hudingyi/FGCR |
| PromptBio[154] | 2024 | PLIP | TCGA, CPTAC | 482, 105 | GPT-4 | NA | NA | 1 | Report generation | https://github.com/DeepMed-Lab-ECNU/PromptBio |
| HistoCap[155] | 2024 | ViT | NA | 10K | BERT, BioBERT | GTEx datasets | NA | 40 | Report generation | https://github.com/ssen7/histo_cap_transformers |
| mSTAR[156] | 2024 | UNI | TCGA | 10K | BioBERT | Pathology Reports 11K | NA | 32 | Report generation | https://github.com/Innse/mSTAR |
| GPT-4 Enhanced[157] | 2025 | CTransPath | TCGA | NA | GPT-4 | ASCO, ESMO, Onkopedia | NA | 4 | Recommendation generation | https://github.com/Dyke-F/LLM_RAG_Agent |
| PRISM[158] | 2025 | Virchow, ViT-H | Virchow dataset | 587K | BioGPT | 195K Reports | 632M | 17 | Report generation | NA |
| HistoGPT[159] | 2025 | CTransPath, UNI | Custom | 15K | BioGPT | Pathology Reports | 30M to 1.5B | 1 | WSI assistant | https://github.com/marrlab/HistoGPT |
| PathologyVLM[160] | 2025 | PLIP, CLIP | PCaption-0.8M | NA | LLaVA | PCaption-0.5M | NA | Multi | Report generation | https://github.com/ddw2AIGROUP2CQUP/PA-LLaVA |
| MUSK[161] | 2025 | Transformer | TCGA | 33K | Transformer | PubMed Central | 675M | 33 | Question answering | https://github.com/Lilab-stanford/MUSK |
- Citation: Shi L, Huang R, Zhao LL, Guo AJ. Foundation models: Insights and implications for gastrointestinal cancer. World J Gastroenterol 2025; 31(47): 112921
- URL: https://www.wjgnet.com/1007-9327/full/v31/i47/112921.htm
- DOI: https://dx.doi.org/10.3748/wjg.v31.i47.112921
