Akbulut S, Colak C. Pioneering efficient deep learning architectures for enhanced hepatocellular carcinoma prediction and clinical translation. World J Gastrointest Oncol 2026; 18(2): 113870 [DOI: 10.4251/wjgo.v18.i2.113870]
Corresponding Author of This Article
Sami Akbulut, MD, FACS, Professor, Surgery and Liver Transplantation, Inonu University Faculty of Medicine, Elazig Yolu 10 Km, Malatya 44280, Türkiye. akbulutsami@gmail.com
Research Domain of This Article
Gastroenterology & Hepatology
Article-Type of This Article
Systematic Reviews
Open-Access Policy of This Article
This article is an open-access article which was selected by an in-house editor and fully peer-reviewed by external reviewers. It is distributed in accordance with the Creative Commons Attribution Non Commercial (CC BY-NC 4.0) license, which permits others to distribute, remix, adapt, build upon this work non-commercially, and license their derivative works on different terms, provided the original work is properly cited and the use is non-commercial. See: http://creativecommons.org/licenses/by-nc/4.0/
Author contributions: Akbulut S and Colak C conceived the project and designed the research; Colak C performed the literature analysis and prepared the tables. Both authors contributed to writing the manuscript and approved the final version.
Conflict-of-interest statement: All the authors report no relevant conflicts of interest for this article.
PRISMA 2009 Checklist statement: This is a narrative review and does not require the full PRISMA protocol of a systematic review. A reproducible search string and a summary SANRA table was added.
Open Access: This article is an open-access article that was selected by an in-house editor and fully peer-reviewed by external reviewers. It is distributed in accordance with the Creative Commons Attribution NonCommercial (CC BY-NC 4.0) license, which permits others to distribute, remix, adapt, build upon this work non-commercially, and license their derivative works on different terms, provided the original work is properly cited and the use is non-commercial. See: https://creativecommons.org/Licenses/by-nc/4.0/
Corresponding author: Sami Akbulut, MD, FACS, Professor, Surgery and Liver Transplantation, Inonu University Faculty of Medicine, Elazig Yolu 10 Km, Malatya 44280, Türkiye. akbulutsami@gmail.com
Received: September 5, 2025 Revised: September 11, 2025 Accepted: November 26, 2025 Published online: February 15, 2026 Processing time: 151 Days and 8.4 Hours
Abstract
BACKGROUND
Hepatocellular carcinoma (HCC) is a major cause of cancer-related mortality worldwide and is often diagnosed at advanced stages, reducing opportunities for curative treatment. Current screening tools, including ultrasonography with or without alpha-fetoprotein, lack sufficient sensitivity for early detection. Deep learning (DL) has emerged as a transformative approach, capable of detecting subtle, high-dimensional patterns in ultrasonography, computed tomography, magnetic resonance imaging, histopathological whole-slide images, and electronic health records. Convolutional neural networks, recurrent neural networks, and Transformer-based models have achieved strong performance in classification, segmentation, and risk prediction tasks, with sensitivities and specificities frequently above 89% and 90%. In some applications, DL has matched or even exceeded expert interpretation. However, the high computational cost and limited feasibility in real-time, resource-constrained settings remain major barriers to adoption.
AIM
To overcome these challenges, recent studies emphasize efficiency-oriented strategies.
METHODS
Lightweight architectures such as MobileNet and EfficientNet, model compression through pruning and quantization, and data-efficient methods like self-supervised pretraining and targeted augmentation enable smaller and faster models without major loss of accuracy. Hybrid or pseudo-3D approaches that summarize volumetric information from sequential slices further reduce computational load, while multimodal fusion of imaging, clinical, and omics data extends applications beyond detection toward personalized prognostication and treatment guidance. These developments highlight that efficiency is essential for real-world deployment, not merely a technical refinement. Nonetheless, significant gaps remain.
RESULTS
Most studies are retrospective, single-center, and limited in sample size, underscoring the need for rigorous external validation across multicenter cohorts and prospective trials assessing patient-relevant outcomes. Bias and fairness audits are critical to ensure equitable performance across demographic and etiological groups, while privacy-preserving strategies such as federated learning are required to harness diverse datasets securely. Seamless integration into hospital workflows, including Picture Archiving and Communication Systems and Electronic Medical Records using standards such as substitutable medical applications reusable technologies on fast healthcare interoperability resources, together with clear regulatory frameworks and post-market monitoring, will be essential for safe and scalable clinical translation. In conclusion, efficient and explainable DL offers a promising path to earlier detection and more personalized therapy in HCC. Achieving this potential will require not only technical innovation but also disciplined validation, thoughtful design for resource-limited contexts, and strong collaboration between clinicians, engineers, and regulators.
CONCLUSION
This review synthesizes current advances, identifies persistent challenges, and provides guidance for developing efficient DL systems that are both clinically relevant and broadly deployable.
Core Tip: Hepatocellular carcinoma remains a major cause of cancer-related mortality worldwide, with limited sensitivity of current screening tools for early detection. Deep learning (DL) offers great promise, but computational demands often hinder its clinical use. This review highlights advances in efficiency-oriented strategies, including lightweight architectures, pruning, quantization, and multimodal integration, which enable smaller and faster models without major loss of accuracy. By emphasizing validation, fairness audits, regulatory alignment, and workflow integration, we provide guidance for developing explainable and efficient DL solutions that are clinically deployable and impactful.
Citation: Akbulut S, Colak C. Pioneering efficient deep learning architectures for enhanced hepatocellular carcinoma prediction and clinical translation. World J Gastrointest Oncol 2026; 18(2): 113870
Primary liver cancers are divided into two groups: Primary, arising from hepatocytes, cholangiocytes, or other hepatic parenchymal cells, and secondary metastatic tumors[1]. Hepatocellular carcinoma (HCC) constitutes the vast majority of primary liver cancers, accounting for roughly 75%-85% of cases[2]. According to GLOBOCAN database estimates, liver cancer ranks sixth in global cancer incidence but third in cancer-related mortality, with 865269 new cases and 757948 deaths annually, underscoring both its high incidence and mortality[3,4]. Established etiological factors include chronic hepatitis B virus (HBV) and hepatitis C virus (HCV) infections, prolonged heavy alcohol consumption, and, increasingly, metabolic dysfunction-associated steatotic liver disease (MASLD, previously known as non-alcoholic fatty liver disease). Of particular importance is its progressive inflammatory form, metabolic dysfunction-associated steatohepatitis (previously non-alcoholic steatohepatitis), which is fueled by the global rise in metabolic syndrome[5-8]. Other important risk factors include environmental exposures such as aflatoxin contamination and smoking, as well as genetic and epigenetic alterations involving TP53 mutations and aberrant Wnt/β-catenin signaling pathways[5,9-11]. Less common but clinically relevant conditions, including autoimmune hepatitis, hereditary hemochromatosis, and alpha-1 antitrypsin deficiency, have also been linked to an increased risk of HCC[5,12,13]. Carcinogenesis typically evolves through a stepwise progression from inflammation and fibrosis to cirrhosis, which serves as the highest-risk premalignant state, although HCC can also develop in non-cirrhotic livers, notably in chronic HBV infection and advanced MASLD[14-17].
A critical challenge in HCC management is its frequently asymptomatic presentation during early, potentially curable stages. Consequently, approximately 60%-70% of patients are diagnosed at intermediate or advanced stages [Barcelona Clinic Liver Cancer (BCLC) stages B or C], where curative options are severely limited[18-22]. This diagnostic delay stems from the non-specific nature of early symptoms (fatigue, vague abdominal discomfort) and the inherent limitations of current surveillance strategies. Although decades of research have established standardized surveillance and diagnostic protocols, outcomes remain suboptimal. The current approach relies mainly on abdominal ultrasonography (US), with or without alpha-fetoprotein (AFP), every 6 months for high-risk individuals, most commonly patients with cirrhosis, but also certain non-cirrhotic groups such as those with chronic HBV infection or advanced fibrosis[8,23-25]. When nodules are detected, they are subsequently characterized using multiphasic contrast-enhanced computed tomography (CT) or magnetic resonance imaging (MRI)[23,24]. The sensitivity of US for detecting early-stage HCC is variable, reported as low as 47%-63%, and is highly operator-dependent. Diagnostic accuracy is particularly limited in patients with obesity or nodular cirrhotic livers, making consistent early detection challenging[26-30]. Contrast-enhanced CT has demonstrated higher sensitivity for early-stage HCC. In a prospective intra-individual study in high-risk patients, biannual low-dose CT achieved 83.3% sensitivity[31]. MRI demonstrates excellent performance in HCC surveillance, particularly when using gadoxetic acid-enhanced full-sequence protocols, which achieved 84.8% sensitivity for very early-stage HCC[31]. Moreover, abbreviated MRI protocols with gadoxetic acid have reported promising results, with sensitivity ranging from 80%-90% and specificity from 91%-98%, while significantly shortening scan times[31,32]. Non-contrast MRI, based on diffusion-weighted and T2-weighted imaging, has also been investigated and demonstrated sensitivity of 83%-87% and specificity of 87%-91%. However, the high cost, long acquisition times, and limited availability of MRI remain barriers to its widespread use in routine surveillance[31].
Although curative-intent therapies - including surgical resection, radiofrequency ablation, and liver transplantation - achieve favorable long-term outcomes in early-stage disease[33], the overall 5-year survival rate for all patients with HCC remains only 18%-22% in population-based analyses, largely because most are diagnosed at intermediate or advanced stages according to the BCLC classification[34-38]. In contrast, stage- and treatment-specific outcomes are substantially higher: 5-year survival exceeds 70% after liver transplantation[39-42], reaches 30%-70% following surgical resection, and ranges between 40%-70% after locoregional therapies[41,43-46]. This stark reality underscores a persistent and critical unmet medical need, namely the development of novel and improved strategies for early detection and accurate risk stratification to enable timely therapeutic intervention. To effectively address this urgent need, artificial intelligence (AI)[47], including its subfields of machine learning (ML)[47-49] and deep learning (DL), has recently gained increasing attention as a transformative solution in HCC management[48,50-54].
AI is increasingly applied in HCC management. Its roles include imaging-based detection and classification, survival prediction, and treatment-response assessment[55,56]. DL systems show high diagnostic accuracy with US, CT, and MRI, in some cases matching expert readers. Radiomics combined with DL has also been used to predict response to oco-regional treatment modality[54,57,58]. ML models trained on clinical and biochemical variables provide personalized risk stratification and survival prediction[59,60]. Evidence for HCC screening remains limited, but AI can support imaging workflows such as triage, segmentation, and reporting, indirectly improving surveillance efficiency[30,61]. ML refers to algorithms that learn from structured clinical or laboratory datasets, whereas DL uses multi-layer neural networks (NNs) capable of handling high-dimensional data, including US, CT, MRI, and digitized histopathology (whole-slide images)[62-65].
Emerging computational approaches, particularly AI and ML, offer transformative potential to address current limitations by harnessing the vast amounts of heterogeneous clinical data generated during patient care. Conventional statistical models, such as logistic regression and Cox proportional-hazards models, still dominate many high-impact HCC risk calculators used in everyday clinics due to their transparency and modest data requirements, which allow easier validation, especially in small or heterogeneous cohorts. DL, however, promises unprecedented accuracy. It enhances imaging analysis by improving small nodule detection in cirrhotic livers and integrates multimodal data, including imaging and biomarkers, for personalized risk stratification. While DL holds profound promise for revolutionizing HCC diagnosis, prognosis, and treatment response prediction by integrating diverse data streams such as medical imaging, clinical records, and genomic data, its widespread clinical integration is hindered by computational complexity, model size, and inference costs. DL, as a subset of AI utilizing multi-layered NNs, can integrate diverse and complex multimodal data streams relevant to HCC. The scope of integration encompasses: (1) Demograhic and clinical data: Age, sex, etiology of chronic liver disease, comorbidities; (2) Biochemical analysis: Bilirubin, albumin, gamma-glutamyl transferase, international normalized ratio, AFP, lens culinaris agglutinin-reactive fraction of fetoprotein (AFP-L3), des-carboxy prothrombin, and platelet count; (3) Multi-omics data: Genomic mutations and gene expression, transcriptomic, proteomic, and metabolomic profiles, providing insights into molecular drivers and tumor biology; (4) Medical imaging: US, CT, and MRI, which provide structural and functional information for tumor detection and characterization; and (5) Histopathological data: Features from biopsy specimens when available, such as tumor grade, differentiation, microvascular invasion, stromal characteristics, mitotic activity, and immunohistochemical markers (Ki-67, glypican-3, cytokeratin profiles).
Modern DL architectures, such as convolutional NNs (CNNs) for image analysis and transformers or recurrent NNs (RNNs) for sequential or tabular data, can process and identify subtle, complex, and non-linear patterns within this heterogeneous data that are often imperceptible to human clinicians or remain undetected by conventional statistical models[66]. Potential applications are broad: (1) Enhanced early detection and screening: Developing risk prediction models using longitudinal electronic health record (EHR) data to identify high-risk subpopulations needing intensified surveillance, or analyzing US images with AI to improve sensitivity for small nodules; (2) Non-Invasive diagnosis and characterization: Refining diagnostic accuracy by analyzing multi-phase CT and MRI scans beyond standard radiological criteria [Liver Imaging Reporting and Data System (LI-RADS)], potentially reducing the need for biopsy, and predicting molecular subtypes from imaging ("radiomics") or circulating biomarkers; (3) Precision risk stratification: Predicting individual patient risk of HCC development (primary prevention), recurrence after curative therapy (secondary prevention), or progression on systemic therapy, enabling personalized surveillance and treatment plans; and (4) Treatment planning and response prediction: Assisting in treatment selection (predicting response to locoregional or systemic therapies, etc.) and planning interventions like surgery or radiation[67,68].
While significant challenges remain regarding data standardization, model robustness, generalizability across diverse populations, interpretability ("black box" problem), and clinical integration, the strategic application of DL holds immense promise for translating complex, high-dimensional clinical data into actionable insights[51]. These advances may improve both the timeliness and precision of HCC diagnosis and management[69].
Checklist for clinical deployment of DL models in HCC
To facilitate the transition from research to real-world application, we propose the following step-by-step checklist for clinicians and engineers.
Data quality and standardization
Curate multi-institutional, diverse datasets covering ethnic, etiological, and clinical variability. Standardize imaging protocols [digital imaging and communications in medicine (DICOM) conformity, etc.] and clinical data formats [fast healthcare interoperability resources (FHIR), etc.]. Annotate data under expert consensus and report inter-rater reliability.
Model validation and generalizability
Perform rigorous external validation on unseen, multicenter data. Report subgroup performance (age, sex, etiology, disease stage) and calibration metrics. Compare against clinical baselines (LI-RADS, AFP) and simple models.
Interpretability and trust
Incorporate explainability tools [gradient-weighted class activation mapping (Grad-CAM), attention maps, etc.] for clinician review. Validate AI explanations against radiological/pathological ground truth. Develop user-friendly interfaces showing confidence scores and decision rationale.
Clinical workflow integration
Ensure compatibility with hospital information technology [picture archiving and communication systems (PACS), electronic medical records (EMR) via substitutable medical applications reusable technologies (SMART) on FHIR]. Optimize inference time for real-time use (< 1 second per image for US or CT). Design human-in-the-loop protocols requiring clinician confirmation.
Regulatory and ethical readiness
Document model design, training data, and performance as per Food and Drug Administration (FDA) and European Medicines Agency guidelines. Implement privacy-preserving techniques (federated learning, de-identification). Conduct bias audits and ensure fairness across subgroups.
Performance monitoring and maintenance
Establish continuous monitoring for model drift and performance degradation. Plan for periodic retraining with new data. Create feedback mechanisms for clinicians to report errors or edge cases.
MATERIALS AND METHODS
Review design and rationale
This narrative review was systematically conducted to synthesize current advancements in efficient DL architectures for HCC prediction and clinical translation. The primary objectives were to: (1) Identify and categorize state-of-the-art DL models applied to HCC across various data; (2) Critically evaluate the integration and impact of efficiency-enhancing techniques within these models; and (3) Pinpoint current limitations, research gaps, and future directions essential for accelerating clinical adoption. This review design was chosen to provide a comprehensive and interpretive overview of a rapidly evolving field, focusing on conceptual frameworks and practical implications relevant to clinical translation.
Literature search strategy
A systematic search was conducted in PubMed, IEEE Xplore, ACM Digital Library, and Google Scholar for studies published between January 2015 and March 2025. The search strategy was designed to combine terms related to HCC, AI/ML, efficiency, and applications. A sample reproducible search string for PubMed is provided below.
("hepatocellular carcinoma"[Title/Abstract] OR "HCC"[Title/Abstract] OR "liver cancer"[Title/Abstract]) AND ("deep learning"[Title/Abstract] OR "convolutional neural network"[Title/Abstract] OR "CNN"[Title/Abstract] OR "transformer"[Title/Abstract] OR "recurrent neural network"[Title/Abstract] OR "RNN"[Title/Abstract]) AND ("model compression"[Title/Abstract] OR pruning[Title/Abstract] OR quantization[Title/Abstract] OR "lightweight architecture"[Title/Abstract] OR "edge AI"[Title/Abstract] OR efficient[Title/Abstract]) AND ("medical imaging"[Title/Abstract] OR radiomics[Title/Abstract] OR "early detection"[Title/Abstract] OR "risk stratification"[Title/Abstract]).
This strategy was adapted for syntax in the other databases. The reference lists of relevant retrieved articles were also manually screened to identify additional studies.
Inclusion and exclusion criteria
Studies were included if they focused on DL applications for HCC prediction, diagnosis, prognosis, or treatment response, were published in English, and addressed methodologies or explicitly discussed aspects of model efficiency, clinical translation, or practical implementation. Exclusion criteria were defined as follows: Studies solely using traditional ML methods without DL components; duplicate publications; conference abstracts or preprints not subsequently published in peer-reviewed journals (unless considered highly relevant and foundational); and articles not directly related to HCC or DL efficiency.
Study robustness was evaluated using: (1) Benchmark consistency: Comparison against standard metrics such as the LI-RADS for imaging; (2) Clinical validation rigor: Prioritizing prospective/multicenter studies; and (3) Efficiency-impact balance: Assessing trade-offs between model size and accuracy.
The quality of the included studies was assessed using: Elements adapted from the Scale for the Assessment of Narrative Review Articles (SANRA) criteria[70], specifically focusing on: (1) Comprehensiveness of literature search: Evaluation of the reported search strategy and its ability to identify relevant studies; (2) Clarity of objectives and scope: Assessment of whether the review's aims were clearly stated and consistently addressed; (3) Scientific accuracy and critical appraisal: Evaluation of the factual correctness of information and the depth of critical analysis presented; (4) Interpretation and synthesis: Assessment of how well the findings from individual studies were integrated and interpreted to form cohesive conclusions; and (5) Relevance to current knowledge: Evaluation of the contribution to the existing body of literature.
The overall quality of the included studies was assessed using the six SANRA criteria. The results of this assessment are summarized in Table 1. While most studies were scientifically sound and relevant, a common limitation was the lack of rigorous clinical validation, with the majority being retrospective and single-center. The clarity of objectives and reporting of efficiency metrics were variable across studies.
Table 1 Summary of quality assessment based on Scale for the Assessment of Narrative Review Articles criteria.
SANRA criterion
Summary of findings
Justification of article importance
High: Most studies clearly stated the clinical problem of HCC detection and the potential of DL
Statement of concrete aims
Medium/high: Modeling aims (e.g., classification accuracy) were usually clear; efficiency aims were sometimes less explicitly stated
Description of literature search
Low: A significant weakness across almost all primary studies; search strategies were rarely reported
Scientific accuracy and rigor
Medium: Methods were mostly sound technically, but clinical validation rigor (prospective/multicenter) was often low
Discussion of limitations
Medium: Common limitations like small sample size were often acknowledged; discussion of bias or generalizability was less frequent
Quality of illustrations/reporting
High: Most studies included high-quality figures of models, results, and attention maps. Efficiency metrics (e.g., model size, inference time) were not consistently reported
Additionally, particular attention was paid to the rigor of clinical validation (use of external datasets, prospective studies, etc.) and the reported balance between model performance and efficiency, a critical aspect of this review.
Definition of model efficiency
In the framework of this review, efficient DL models are the ones which are not only optimized on predictive accuracy but also on practical deployment constraints. This includes: (1) Computational efficiency: Fewer parameters and floating-point operations (FLOPs), resulting in faster inference times, which are important in real-time analysis (e.g., when one wants to analyze ultrasound images); (2) Memory efficiency: Reduced graphics processing unit (GPU)-random access memory and storage requirements, allows it to run on a typical hospital workstation or edge device; and (3) Data efficiency: The capacity to do high-performance despite having few annotated datasets, which alleviates a major bottleneck in medical AI. The key focus of this efficiency-oriented synthesis is techniques that will further these objectives: Lightweight architectures, pruning, quantization, and self-supervised learning. These gains of deployment must be separated from diagnostic accuracy gains, but the ultimate aim is to realize both without major trade-offs.
Recommendations for reproducibility
To ensure reproducibility and facilitate clinical replication, future HCC DL studies should consistently report the following minimal set of details: (1) Dataset splits: Ratios or cross-validation folds, sample counts per split, and stratification criteria; (2) Random seeds: Values used for splitting data and initializing models; (3) Optimizer: Type, learning rate, batch size, scheduler, and number of epochs; (4) Training time: Total duration and hardware specifications; (5) Inference time: Average time per sample/image on deployment-level hardware; (6) Memory use: Peak GPU memory consumption during training and inference; (7) Software: Framework, version, and critical libraries; (8) Data augmentation: Specific techniques applied; (9) Model size: Parameter count and computational complexity (GIGA FLoating point OPerations per Second); and (10) Performance metrics: Key metrics [area under curve (AUC), sensitivity, etc.] with confidence intervals.
The structured reporting will enhance transparency, comparison, and validation of results across studies.
RESULTS
DL architectures in HCC prediction and classification
DL uses multi-layer NNs to learn complex features directly from raw data. In HCC, several DL architectures have been applied:
CNNs: By far the most common approach for image analysis, CNNs use convolutional filters to detect spatial patterns. CNNs have been trained on US, CT, and MRI scans to detect and classify liver lesions. For example, CNNs have shown significant promise in the diagnosis, segmentation, prognosis, and treatment response prediction of HCC using medical imaging. CNN-based models can accurately classify liver lesions on multi-phasic MRI and dynamic contrast-enhanced CT, often outperforming or matching experienced radiologists, with reported accuracies and sensitivities for HCC detection and classification as high as 92% and 90%, respectively, and AUC values up to 0.992[71,72]. Foundational CNN studies continue to demonstrate clinical versatility across diverse HCC diagnostic scenarios. A paper pioneered CNN-based differentiation of liver masses on dynamic contrast-enhanced CT, achieving radiologist-level accuracy and establishing an early benchmark for AI in hepatic imaging[68]. Another paper later addressed resource limitations by developing a lightweight CNN that maintains high accuracy on non-contrast CT - critical for low-infrastructure settings where contrast agents are unavailable[73]. Beyond single-classification tasks, a different article advanced differential diagnosis through their Siamese Cross-Contrast NN, enabling precise discrimination between HCC and cholangiocarcinoma (AUC > 0.95) using multi-branch feature fusion. These innovations collectively highlight CNN architectures' adaptability to both technical constraints and complex diagnostic challenges in hepatology. These results show CNNs can accurately distinguish HCC from benign or other lesions on various image modalities, often matching expert performance[74].
RNNs: RNNs, especially long short-term memory (LSTM) and gated recurrent unit variants, are designed to model sequential or longitudinal data. In HCC, they have been applied to time-series risk prediction. For instance, one study developed an LSTM-based RNN using serial laboratory tests and vital signs from patients with HCV-related cirrhosis. The RNN achieved an AUC ≥ 0.76 for predicting 3-year HCC risk, outperforming logistic regression on the same dataset[75]. Another example is the treatment of multi-phase imaging or follow-up scans as sequences. One study built a CNN-RNN hybrid (CRNN) that analyzed sequences of CT slices to predict survival in immunotherapy-treated HCC. Thus, RNNs capture temporal trends in clinical and imaging data, thereby contributing to prognosis assessment and screening strategies[75,76].
Transformer networks and attention: Transformer models, originally developed in natural language processing, use self-attention mechanisms to model long-range dependencies. Recent research demonstrates that transformer networks and attention mechanisms are significantly advancing the detection, segmentation, and characterization of HCC in medical imaging. Transformer-based models such as TDCenterNet and HyborNet leverage multi-head self-attention to capture global contextual information and long-range dependencies, which improves the identification of small or heterogeneous liver tumors and refines lesion boundaries, outperforming traditional convolutional approaches in both detection and segmentation tasks[48,77]. Furthermore, hybrid models incorporating efficient channel-spatial attention mechanisms, such as those based on Gabor filters, have been shown to enhance the precision of boundary delineation in liver tumor segmentation while maintaining a compact and computationally efficient model architecture. Also, Hybrid attention structures, such as those in the HDU-Net and modified U-Net models, combine channel and spatial attention to filter and integrate low-resolution information, leading to improved feature extraction and segmentation accuracy[78]. In imaging, vision transformers or hybrid CNN-transformer networks can exploit global context in MRI and CT. Emerging research articles shows promise for transformer-based liver lesion classification, though additional studies are needed[77,79]. Additionally, an article proposed a lightweight network based on multi-scale context fusion and dual self-attention, and concluded that multi-organ segmentation on abdominal CT images is very important for analyzing medical images and can give important information and numbers for things like surgery navigation, organ transplants, and radiation therapy[80,81].
Hybrid and multi-input models: Many systems combine architectures to leverage different data. A common hybrid is CNN + RNN for spatial-temporal fusion. Others fuse image features with clinical vectors: For example, some models concatenate CNN-derived radiomic features with tabular patient data in a dense network. Ensemble approaches or multi-branch networks can also integrate multi-modal inputs (dual CNNs on two phases, etc.). These hybrid models often yield better performance than single-modality models. DL-based hybrid architectures, such as those fusing CNN-derived image features with tabular patient data in dense networks, also demonstrate superior accuracy in HCC grade classification compared to single-modality models. Overall, leveraging hybrid models for spatial-temporal fusion and multi-modal data integration leads to more robust and clinically valuable tools for HCC management[82-84].
In addition to these architectures, other DL variants have also been explored. Autoencoders and variational autoencoders have been applied to unsupervised HCC subtyping, particularly for integrating high-dimensional gene expression profiles with clinical metadata[85-87]. Graph NNs are emerging as a promising approach to model complex biological relationships, such as tumor-microenvironment interaction graphs; however, their application in HCC remains relatively nascent[88,89]. Table 2 synthesizes key advancements in DL architectures for HCC research from 2019-2025, revealing three critical trends. First, architectural diversity is expanding beyond conventional CNNs to include transformers, hybrid models, and multimodal systems, enabling nuanced applications from histopathology grading to survival prediction. Second, efficiency innovations - such as lightweight designs, self-supervised pretraining, and synthetic data augmentation - consistently achieve > 90% accuracy/AUC while overcoming resource constraints. Third, studies increasingly address clinical translation barriers: Interpretability (attention mechanisms, etc.), bias mitigation, and data heterogeneity. Nevertheless, gaps persist in generalizability across protocols and multi-class discrimination, underscoring the need for standardized multicenter validation. Collectively, these works highlight DL’s evolution from single-modality tools toward integrated, clinically actionable systems[72].
Table 2 Comprehensive summary of deep learning architectures for hepatocellular carcinoma studies.
US: US is a common screening tool. The lower cost of US makes it attractive but images are noisier. CNNs have still achieved impressive results. Studies have reported classification accuracies exceeding 95% using CNNs for HCC detection from US images. Combining CNNs with traditional image analysis methods has further improved diagnostic performance, achieving accuracies above 98%. Ongoing research is focused on refining CNN algorithms to enhance early detection rates, which are crucial given the low five-year survival rates for HCC[90-92].
Multiphase contrast-enhanced CT: CT is a workhorse for HCC diagnosis. CNNs trained on CT volumes have excelled at lesion classification and segmentation. Studies have applied CNNs for HCC detection/Localization or LI-RADS grading on CT, often achieving high sensitivity/specificity. Advanced models include 3D CNNs or region-based CNNs to analyze volumetric data. Contrast-enhanced CT, especially multiphase protocols, is central to HCC diagnosis, and CNNs have become highly effective tools for analyzing these images. CNNs trained on CT volumes excel at lesion classification, achieving high accuracy, precision, and recall in distinguishing HCC from other liver lesions/tumors, even on non-contrast images[73,74].
Multiphase contrast-enhanced MRI: DL on MRI has shown similarly high performance. MRI provides superior soft tissue contrast and lesion characterization (hepatobiliary contrast, diffusion, etc.). DL applied to MRI has achieved high performance in HCC diagnosis, leveraging MRI’s superior soft tissue contrast and advanced imaging sequences such as hepatobiliary contrast and diffusion-weighted imaging. CNNs trained on multi-phasic or dynamic contrast-enhanced MRI can accurately classify and segment HCC lesions, often surpassing radiologist performance, with reported sensitivities and specificities above 90% and AUCs approaching 0.99 for lesion classification and subtyping[93-95].
Histopathology: Digital pathology (whole-slide images) is fertile ground for CNNs. Here, DL performs patch-wise classification or whole-slide analysis. Models like LiverNet use attention modules to focus on cancerous regions. An article showed that incorporating multi-scale context (via atrous spatial pyramid pooling) improved CNN classification of HCC subtypes. More recently, a paper used extensive data augmentation to distinguish normal liver, HCC, and cholangiocarcinoma on histology, achieving > 99% AUC and 97% accuracy. These studies indicate DL can match or exceed pathologist accuracy in distinguishing HCC histology subtypes[96,97].
Structured clinical and omics data: Beyond images, deep models have been applied to tabular and molecular data. For example, RNNs and fully-connected networks analyze longitudinal labs, demographics, and risk factors to predict HCC risk. In genomics, autoencoder-based models have been trained on multi-omics (gene/microRNA/methylation) to stratify tumor aggression. Integrative "multi-omics" DL can identify molecular HCC subtypes or prognostic signatures that may elude traditional statistics. Integrating multi-omics data allows for a more holistic view of HCC, identifying unique molecular signatures and tumor microenvironments that inform prognosis and treatment strategies. The insights gained from these models support the development of targeted therapies, enhancing personalized treatment approaches for HCC patients[98].
Multi-modal fusion: Recognizing that no single data type is sufficient, recent works fuse images with clinical variables. A study developed a "CT-CRNN + clinical" model: A CNN-RNN processes serial CT scans to yield a radiomic score, which is then combined with a Cox model of seven clinical covariates. This combined model better stratified survival than imaging alone. Similarly, other studies concatenate CNN features (radiomics) with biochemical analysis or whole-slide images in a NNs. Meta-analyses suggest that multi-modal DL often outperforms single-mode models for HCC outcomes (e.g., integrating EHR and imaging data typically boosts AUC)[76].
Model performance, interpretability, and efficiency
DL models for HCC have achieved high diagnostic performance. DL models have demonstrated remarkable diagnostic performance in the detection of HCC, achieving sensitivity and specificity rates that rival those of expert radiologists. A recent meta-analysis of 30 studies reported a pooled sensitivity of approximately 89% and specificity of 90%, with an AUC of 0.95 for DL-based HCC diagnosis from imaging. This performance is corroborated by individual studies where DL models achieved AUCs exceeding 0.90 and accuracies above 90% in lesion classification[99,100].
However, performance can vary by task. CNNs excel at imaging tasks (detection/segmentation) but may overfit small datasets. RNNs on large EHR cohorts gave moderate AUC (approximately 0.76) for risk prediction, reflecting inherent difficulty in long-term prediction. Hybrid models and ensembling often improve results. For example, HTRecNet’s ensemble-like augmentation produced near-perfect discrimination of tissue types[75,97].
Interpretability: A key concern in medical AI is explainability. Deep NNs are "black boxes" with millions of parameters, so trust hinges on showing why a prediction was made. Explainability is a central concern in applying DL to HCC, as the "black box" nature of these models can hinder clinical trust and adoption. To address this, researchers increasingly use techniques like attention maps and saliency methods - such as Grad-CAM heatmaps - to visually highlight the image regions most influential in a model’s prediction, helping clinicians understand and verify the AI’s focus during HCC detection or classification. For example, DL systems for HCC diagnosis on CT and MRI have incorporated saliency heatmaps, with radiologists confirming that these visualizations accurately localize true lesions in over 90% of cases[51,101,102].
Efficiency: DL refers to models that are optimized for speed, size, and resource use. This is crucial for clinical deployment, where computational resources are often constrained by limited GPU capacity and the need for real-time use. Several approaches improve efficiency: (1) Lightweight architectures: Models like MobileNet or EfficientNet, combined with pruning techniques, reduce parameters. LiverNet achieved a diagnostic accuracy of 90.9% for subtype classification while being highly efficient, utilizing only 0.57M parameters (approximately 1/50th of ResNet-50). This drastic reduction in model size directly translates to lower memory footprint and faster inference, facilitating integration into pathological workflows. This situation drastically cuts inference time and memory. Similarly, incorporating spatial pyramid pooling (as in LiverNet) captures multi-scale tumor features without huge networks[96]; (2) Quantization and pruning: Pruning deep NNs is an effective strategy to reduce model size and computational demands, and it has been successfully applied in DL studies focused on liver and HCC. By identifying and removing redundant nodes or connections - using methods such as magnitude-based, structured, or evolutionary pruning - networks can often be compressed by 50%-75% or more with minimal or no loss of accuracy, and in some cases even with slight improvements in performance[103,104]; (3) Hybrid or local computation: A combined CNN-RNN can reduce 3D processing by slicing volumes into 2D segments and summarizing features over time. This "pseudo-3D" approach uses less memory[76]; and (4) Data efficiency: Augmentation and self-supervised pretraining (on large liver datasets or ImageNet, etc.) help DL models learn with fewer labeled examples. HTRecNet used aggressive augmentation to compensate for limited cholangiocarcinoma examples[97].
Efficiency must be balanced with accuracy. In clinical settings, faster models facilitate real-time use, for example ultrasound scanning with on-the-fly analysis. Light models also lower costs and enable edge computing on scanners or mobile devices. Ultimately, efficient DL refers to models that are both computationally lean and clinically robust.
Table 3 expands the architectural taxonomy by linking common HCC tasks and data modalities to suitable DL designs, summarizing their typical efficiency trade-offs. Pareto-efficient designs offer a favorable balance between accuracy and computational cost for clinical deployment.
Table 3 Taxonomy of deep learning architectures for hepatocellular carcinoma tasks: Mapping to data types and efficiency profiles.
Primary task
Data modality
Suitable DL architectures
Efficiency profile & trade-offs
Lesion detection & diagnosis
Static US/CT/MRI slice
Lightweight CNNs (MobileNet, EfficientNet), standard CNNs
Pareto-efficient: Lightweight CNNs are optimized for high speed and low cost with minimal accuracy loss. Standard CNNs offer a balance
Volumetric CT/MRI
3D CNNs, 2.5D CNNs (processing slices sequentially)
Trades speed for accuracy: 3D CNNs are accurate but computationally heavy. 2.5D/pseudo-3D approaches are a more efficient compromise
Pareto-efficient: Multi-input CNNs efficiently fuse phase data. Hybrids use transformers to capture long-range dependencies between phases without a full transformer's cost
Segmentation
CT/MRI volumes
U-Net variants, transformer-based (e.g., UNETR)
Trades speed for accuracy: U-Nets are relatively efficient. Pure transformers (e.g., UNETR) offer superior accuracy for complex shapes but are computationally intensive
Longitudinal risk prediction
Tabular EHR time-series
RNNs (LSTM/GRU), transformers, lightweight ML
Context-dependent: RNNs/transformers model time well but can be heavy. For simpler tasks, logistic regression/gradient boosting (non-DL) are often more efficient and perform similarly
Histopathology classification
WSI
MIL + CNN, ViT
Pareto-efficient: MIL frameworks are inherently efficient, processing bags of image patches. New efficient ViT variants are emerging for WSI analysis
Multimodal fusion
Fused imaging + clinical
Hybrid architectures (e.g., CNN branch for images, DNN for tabular data)
Pareto-efficient: Hybrid designs effectively integrate data types. The efficiency is determined by the choice of the image backbone (e.g., using a lightweight CNN)
High-precision detection
CT/MRI (small lesions)
Dense detectors, ViT
Trades speed for accuracy: Models like ViT and complex detectors excel at finding small lesions due to global attention but have high computational demands
Synthesis of ablation studies on efficiency techniques
The reviewed literature often includes ablation studies to validate the contribution of specific efficiency components. A synthesis of these studies reveals consistent trends.
Structured pruning: Pruning levels of 50%-75% frequently result in a > 70% reduction in parameters and FLOPs, with an accuracy drop often limited to < 2%. Pruning beyond this point typically leads to significant performance degradation.
Quantization: Moving from FP32 to INT8 precision consistently reduces model size by approximately 75% and improves inference speed by 2-4 ×, with a negligible accuracy loss (< 1%) in most imaging tasks.
Attention mechanisms: While transformers and attention modules improve accuracy for small lesion detection by 3%-5%, they increase computational cost by 20%-50% compared to standard CNNs. Efficient attention variants (e.g., linear attention) can mitigate this cost.
Lightweight architectures: Models like MobileNet and EfficientNet reduce parameters by 10-50 × compared to ResNet-50, with a typical accuracy trade-off of 1%-3%, making them highly suitable for deployment.
Knowledge distillation: Using a larger teacher network to train a compact student model often allows the student to achieve within 1%-2% of the teacher's accuracy while being 5-10 × faster.
These ablation results indicate that pruning and quantization are the most reliable 'knobs' for achieving immediate compute savings with minimal performance loss. In contrast, the use of attention mechanisms involves a clearer trade-off between accuracy and computational cost, which must be justified by the clinical task (small lesion detection, etc.).
Figure 1 illustrates the efficiency-performance landscape, revealing that architectural choice alone does not determine superiority. When efficiency is held constant, lightweight CNNs often match or exceed transformer performance for standard HCC detection tasks, while transformers show advantages primarily in complex spatial reasoning tasks requiring global context.
Figure 1 Efficiency-performance trade-offs in deep learning models for hepatocellular carcinoma.
A: Performance vs model parameters; B: Performance vs computational cost. CNN: Convolutional neural network; RNN: Recurrent neural network; AUC: Area under curve; GFLOPs: Giga floating point operations.
External validation and generalizability
We reviewed the validation strategy reported in each included study and categorized them by whether they reported: (1) A true multicenter external test (model trained on data from one or more sites and evaluated on independent data from different institutions); (2) A single-center external test (model evaluated on held-out data from the same institution but collected at a different time or with a different protocol); or (3) Internal cross-validation only (k-fold or split-sample validation without out-of-institution testing). Notably, the majority of studies relied on internal cross-validation or single-center cohorts; only a minority performed true multicenter external testing. Because performance estimates from single-center and internally validated studies are susceptible to optimistic bias from site-specific imaging protocols, annotation practices, and population composition, we advise caution when interpreting pooled AUCs and sensitivity/specificity figures that are dominated by these designs.
DISCUSSION
The rapid evolution of DL approaches for HCC prediction
In recent years, rapid developments in DL techniques have achieved remarkable success in the prediction and diagnosis of HCC. In particular, CNN and transformer-based models have provided sensitivity and specificity rates equal to or above the performance of radiologists on multiphase MRI and CT data. For example, in the study, 99.2% AUC was achieved on multiphase MRI data using a CNN architecture, which exceeds the performance of radiologist, which is considered the gold standard in the classification of HCC lesions[71]. Similarly, an article developed a transformer-based model called TDCenterNet and reported superior results compared to traditional CNNs in the localization of small tumors. These developments play a critical role especially in early diagnosis and treatment planning[77].
Clinical potential of hybrid and multi-modal models
Recently, hybrid architectures (e.g., CNN-RNN or CNN-transformer combinations) have been shown to have great potential for HCC management. A paper developed a hybrid model (HTRecNet) based on CNN and RNN for HCC grading in histopathological images (whole-slide images) and achieved 97% accuracy. This model offers more comprehensive analysis by integrating both spatial and temporal features[83,97]. Furthermore, the combination of multimodal data (imaging data + clinical laboratory results, etc.) has enabled DL models to make more predictions that are robust in clinical applications. Another work performed survival prediction in advanced HCC patients responding to immunotherapy with a CRNN model combining CT data and clinical variables and improved accuracy with temporal-spatial analysis[76].
Efficiency and clinical applicability
Computational efficiency is critical for the widespread use of DL models in clinical settings. Techniques such as network pruning, quantization, and lightweight architectures (MobileNet, EfficientNet, etc.) have enabled models to run on smaller sizes and low-resource devices. A study developed a model called LiverNet that works with only approximately 0.57 million parameters and achieved 90.9% accuracy. This provides an alternative to large models such as traditional ResNet-50 that can be used in low-resource systems[96]. Furthermore, another article reported both reduced memory usage and minimal losses in accuracy by reducing the network size by 50%-75% with structured pruning techniques. These approaches are vital for real-time analysis, especially in resource-limited environments (mobile US devices, etc.)[103]. The reviewed efficiency techniques provide tangible deployment gains that are distinct from, yet complementary to, accuracy. A model's high AUC is a theoretical advantage if it requires a specialized high-end GPU and several seconds to process a single image, making it unsuitable for a busy radiology department. In contrast, a model with a marginally lower AUC but that can deliver results in milliseconds on a standard workstation or even embedded within an ultrasound machine offers a direct clinical utility gain. This shift in focus from pure performance to performance-per-computational-cost is essential for moving from research prototypes to clinically translatable tools.
Advantages and limitations compared to traditional models
Despite the complexity of DL models, traditional statistical models (logistic regression or penalized regression, etc.) may perform better in some scenarios. A work showed that logistic regression provided higher area under the receiver operating characteristic (AUROC) and calibration due to overparameterization of DL models in small datasets (n < 500)[105]. These findings emphasize that DL is not an optimal solution in all cases, and simpler models should be preferred, especially in small datasets.
Limitations and gaps
Despite promise, many challenges remain. Data quality and heterogeneity limit generalizability. Most studies are retrospective and single-center; models trained on one hospital’s data may not perform well elsewhere. For example, variations in MRI protocols or CT contrast timing can confuse a CNN. The meta-review noted high heterogeneity across studies, cautioning that pooled performance likely overestimates real-world accuracy[100]. Additionally, while DL offers superior feature extraction capabilities, some clinical experts argue that traditional statistical models may still outperform DL in settings with low sample sizes or well-structured tabular data, due to their interpretability and simplicity. We emphasize that many high-performance reports in the HCC DL literature derive from single-center cohorts or internal cross-validation rather than independent multicenter external tests. Such designs increase the risk of overoptimistic performance estimates because of site-specific protocol and population effects. Accordingly, generalization claims in the literature should be tempered until reinforced by multicenter external validation or prospective evaluation. We therefore recommend that future HCC DL studies provide explicit external test sets (ideally from geographically and technologically diverse centers) or adopt federated approaches to better estimate real-world performance.
Sample size
DL research in HCC is often limited by modest dataset sizes, typically fewer than 1000 cases, due to privacy concerns and the high cost of expert annotation, especially for rare or complex tasks like predicting microvascular invasion from imaging. This scarcity of data increases the risk of overfitting, where models perform well on training data but fail to generalize to new patients. While techniques such as cross-validation are commonly used to maximize the utility of available data and assess model robustness, true external validation - testing models on independent datasets from different institutions or populations - remains rare in HCC DL studies[106].
Interpretability and trust
As noted, the "black box" nature of DL is a barrier. Clinicians may be reluctant to act on AI output without understanding its basis. The "black box" nature of DL remains a significant barrier to clinical adoption, as clinicians and regulatory bodies require transparency to trust and act on AI-generated outputs. Without clear explanations, there is a risk of misapplication - such as using a model outside its intended domain - that can lead to erroneous results and potential patient harm[51,107,108].
Clinical integration
Few DL tools have reached clinical practice for HCC, largely due to several persistent barriers. Most studies to date are retrospective, with only a handful including prospective or multicenter validation, which limits confidence in real-world performance and generalizability. Unlike FDA-approved AI devices in fields such as retinal disease or cardiovascular imaging, there are currently no AI tools specifically cleared for HCC screening or management as of 2025. Integration into clinical workflows remains a challenge, as most DL models are not designed for seamless use within existing hospital systems or radiology platforms[100,109,110].
Bias and disparities
DL models can unintentionally perpetuate and even amplify biases present in their training data, leading to disparities in performance across different ethnic groups, geographic regions, or other subpopulations. For example, a model trained primarily on data from one demographic may underperform or make inaccurate predictions when applied to others, raising concerns about fairness and equity in medical AI. These biases can stem from imbalanced datasets, unrepresentative sampling, or societal factors embedded in the data itself. To address this issue, careful cohort design - ensuring diverse and representative data - and systematic fairness auditing are essential steps in both the development and deployment of deep models. Techniques such as bias detection, data augmentation (including the use of synthetic data), and fairness-aware algorithms can help mitigate these issues, but ongoing vigilance and transparent reporting are required to ensure equitable outcomes. Ultimately, building trustworthy AI systems for healthcare demands not only technical solutions but also ethical oversight and inclusive practices throughout the model lifecycle[111-113]. To ensure equitable performance, future studies should move beyond aggregate metrics and report subgroup-specific performance across racial and ethnic groups, etiologies (HBV, HCV, MASLD, etc.), sex, and liver disease stages. We recommend a minimal reporting set for HCC AI studies that includes: Stratified performance metrics (sensitivity, specificity, AUC, etc.) for key demographic and clinical subgroups; Calibration measures, such as calibration plots or expected calibration error, to assess prediction reliability across risk groups; Explicit decision thresholds and their clinical rationale to ensure consistent and fair application across populations.
Regulatory and ethical issues
Patient privacy concerns limit data sharing. Federated learning (training across institutions without sharing raw data) is one proposed solution, but it is technically complex and not yet widely used in HCC. Moreover, liability for AI "errors" is unclear. Patient privacy concerns are a major barrier to data sharing in HCC research, limiting the creation of large, diverse datasets needed for robust DL models. Federated learning offers a promising solution by enabling multiple institutions to collaboratively train AI models without sharing raw patient data, thus preserving privacy and data ownership. Studies show that federated models can achieve accuracy and generalizability comparable to those trained on centralized data, while providing stronger privacy protections[114-116]. Beyond federated learning, ethical AI requires explicit informed consent for AI-driven diagnostics. Clinicians must explain AI’s role in decision-making to patients, ensuring transparency[107].
Traditional models may outperform under data scarcity
Several comparative studies have shown that when sample sizes fall below approximately 500 patients, or when predictor variables are low-dimensional and well-structured, penalized logistic regression or random forests can outperform CNN- or transformer-based pipelines in both AUROC and calibration. Bailly et al[117] demonstrated that under conditions of limited data and low feature complexity, penalized logistic regression achieved AUC values around 0.80, while comparable NNs models reached approximately 0.77-0.79. Similarly, John et al[118] found that in smaller datasets logistic regression and gradient boosting consistently achieved higher AUROC and better calibration than DL models, whereas in larger cohorts performance differences narrowed. These findings emphasize that DL is not a one-size-fits-all solution and should always be evaluated against simpler baselines before implementation.
When DL is NOT the optimal tool
Small datasets (< 300 cases): Penalized logistic regression or gradient-boosted trees are often better suited in such contexts.
Regulatory approval: Current pathways demand transparent model logic; interpretable algorithms accelerate approval.
Clinical usability: Clinicians find odds ratios and hazard ratios directly actionable. In contrast, DL feature maps rarely translate into bedside decisions.
Resource limitations: In low-resource environments, real-time DL inference may be unfeasible, whereas tabular algorithms can run efficiently on standard hospital personal computers[105,119].
Although DL is powerful for high-dimensional imaging and multi-omics tasks, it is not always the best choice. In small, well-curated tabular cohorts (typically < 300-500 patients) or when the predictor set is low-dimensional (< approximately 20 features) penalized regression (least absolute shrinkage and selection operator/elastic-net), Cox models for time-to-event outcomes, or gradient-boosted trees often achieve equal or better discrimination and substantially superior calibration and interpretability. Similarly, when regulatory approval, clinician interpretability, rapid prototyping, or edge deployment are the priority, simpler transparent models reduce validation burden and speed implementation. We therefore recommend that researchers first benchmark simple, well-regularized statistical or classical ML baselines and report calibration and subgroup performance; only when these baselines are clearly outperformed on robust external validation should DL be pursued.
To provide practical guidance, we summarize below a simple decision checklist that can assist practitioners in choosing between DL and alternative approaches: (1) Sample size: < 300 → prefer penalized regression/gradient boosting; 300-1000 → compare baselines vs lightweight DL; > 1000 → DL more justifiable; (2) Feature type: Low-dimensional tabular → prefer interpretable/tabular methods; high-dimensional images/omics → DL recommended; (3) Interpretability/regulatory need: High → choose transparent models (penalized regression, Cox); include explainability reports if DL is used; (4) Resource & deployment constraints: Edge/Low compute → prefer compact non-DL models or distilled/Lightweight models; and (5) Validation rule: If a simple model’s AUC is within approximately 1%-2% of DL and calibration is better, select the simple model.
In summary, DL research in HCC is strong in innovation and retrospective accuracy but remains limited in clinical validation and real-world implementation. Bridging these gaps will require multicenter cohorts, external validation, prospective trials, explainable architectures, and alignment with regulatory pathways to achieve meaningful clinical impact.
Future directions
Future research will likely pursue several directions.
Multi-omics and precision diagnostics: Integrating genomics, for example mutation profiles, together with proteomics and radiomics into unified DL models could reveal novel biomarkers. Self-supervised learning on unlabeled liver scans could leverage large datasets.
Clinical trials: Prospective trials, such as randomized controlled trials or observational studies evaluating AI-assisted HCC screening or diagnosis, will be needed to demonstrate outcome benefits. For example, a trial could randomize patients with cirrhosis to receive either standard imaging or AI-augmented surveillance.
Federated and privacy-preserving AI: Implementing federated learning or differential privacy will allow institutions to collaboratively train models on diverse populations without compromising patient data.
Real-time and point-of-care tools: Deploying efficient DL on portable US devices or endoscopic scanners could enable real-time lesion detection. This could be facilitated by embedded DL chips or cloud-based services.
Explainable AI: Development of inherently interpretable architectures, for example attention gates that map to known liver anatomy, will build trust. Standardizing reporting, such as a tumor heatmap, will become routine. Equally important is developing a patient-facing narrative. Surveys indicate that individuals are more willing to accept AI-assisted care when the underlying reasoning is clear. This aligns with the broader shift toward shared decision-making in hepatology[47,120].
Workflow integration: For successful clinical adoption, DL tools must integrate into radiology and pathology PACS and EMR systems. Standards such as SMART on FHIR could help but would require close collaboration with IT vendors. Practical workflow integration requires more than theoretical compatibility. For radiology workflows, DL models can be containerized (via Docker, etc.) and linked to PACS using DICOM routing. Input is provided as standard DICOM CT, MRI, or US images, and outputs are returned as DICOM-SEG overlays or annotated probability maps. With lightweight CNNs such as EfficientNet, inference times typically remain below 400 milliseconds per slice on standard hospital workstations, enabling near-real-time triage. Importantly, predictions are reviewed by radiologists before being finalized in the clinical report, ensuring a human-in-the-loop safeguard. For EHRs, risk prediction tools can be implemented as SMART on FHIR applications. These apps ingest structured FHIR resources (observation, condition, and imaging study, etc.) and return risk scores or decision-support alerts via FHIR diagnostic report objects. Because such tasks are not time-critical, models often run asynchronously in daily or weekly batches, minimizing workflow disruption. Together, these examples highlight how AI tools can be technically embedded into PACS and EHR systems while respecting runtime constraints and maintaining clinician oversight[121].
Regulatory and guideline development: As evidence accumulates, professional societies such as the American Association for the Study of Liver Diseases, the European Association for the Study of the Liver, and the Asian Pacific Association for the Study of the Liver may include AI tools in HCC guidelines. Regulatory frameworks will need updating to assess and monitor AI devices for HCC[53,99,122,123].
CONCLUSION
This review provides a systematic, efficiency-focused synthesis of DL for HCC and offers concrete guidance for clinical translation. Across imaging such as ultrasound, CT, MRI, and whole-slide histology, together with structured and omics data, modern DL consistently achieves high diagnostic performance when paired with efficiency techniques that enable real-time, resource-constrained deployment.
What works in practice is lightweight architectures including MobileNet and EfficientNet, model compression through pruning, quantization, and knowledge distillation, data-efficient learning using self-supervised pretraining and targeted augmentation, and computationally frugal designs such as pseudo-3D CRNNs that summarize volumes slice-wise can reduce parameters substantially with minimal loss in accuracy, enabling edge or near-edge inference. Multi-modal fusion that combines imaging with clinical and omics features further improves risk stratification and treatment-response prediction, while attention and transformer mechanisms enhance small-lesion detection and global context modeling.
What is still missing for real-world impact is rigorous external validation on multicenter cohorts; prospective clinical trials that measure patient-relevant outcomes; standardized reporting and calibration with strong baselines; bias and fairness audits with subgroup performance reporting; privacy-preserving learning approaches such as federated learning to unlock diverse data; and seamless workflow integration into PACS and EMR systems, for example through SMART on FHIR, with clear regulatory pathways and post-market monitoring.
Bottom line is Efficient, explainable DL - validated externally, audited for fairness, and embedded into clinical workflows - can enable earlier detection and more personalized therapy in HCC, with a credible path to improving survival. Closing the remaining gaps now depends less on inventing ever-larger networks and more on disciplined validation, thoughtful design for constrained settings, and tight collaboration between clinicians, engineers, and regulators.
ACKNOWLEDGEMENTS
ChatGPT (GPT-5, OpenAI, San Francisco, CA, United States) and QuillBot (QuillBot, Chicago, IL, United States) were used to assist with language editing during the preparation of this manuscript. All AI-assisted content was carefully reviewed, edited, and validated by the authors. The authors take full responsibility for the accuracy, originality, and integrity of the scientific content presented in this publication.
Footnotes
Provenance and peer review: Invited article; Externally peer reviewed.
Peer-review model: Single blind
Specialty type: Gastroenterology and hepatology
Country of origin: Türkiye
Peer-review report’s classification
Scientific Quality: Grade B
Novelty: Grade B
Creativity or Innovation: Grade B
Scientific Significance: Grade B
P-Reviewer: Hayat M, PhD, Academic Fellow, Postdoc, Canada S-Editor: Li L L-Editor: A P-Editor: Zhang L
Vitellius C, Desjonqueres E, Lequoy M, Amaddeo G, Fouchard I, N'Kontchou G, Canivet CM, Ziol M, Regnault H, Lannes A, Oberti F, Boursier J, Ganne-Carrie N. MASLD-related HCC: Multicenter study comparing patients with and without cirrhosis.JHEP Rep. 2024;6:101160.
[RCA] [PubMed] [DOI] [Full Text][Cited by in RCA: 16][Reference Citation Analysis (0)]
Yang Y, Sun J, Cai J, Chen M, Dai C, Wen T, Xia J, Ying M, Zhang Z, Zhang X, Fang C, Shen F, An P, Cai Q, Cao J, Zeng Z, Chen G, Chen J, Chen P, Chen Y, Shan Y, Dang S, Guo WX, He J, Hu H, Huang B, Jia W, Jiang K, Jin Y, Jin Y, Jin Y, Li G, Liang Y, Liu E, Liu H, Peng W, Peng Z, Peng Z, Qian Y, Ren W, Shi J, Song Y, Tao M, Tie J, Wan X, Wang B, Wang J, Wang K, Wang K, Wang X, Wei W, Wu FX, Xiang B, Xie L, Xu J, Yan ML, Ye Y, Yue J, Zhang X, Zhang Y, Zhang A, Zhao H, Zhao W, Zheng X, Zhou H, Zhou H, Zhou J, Zhou X, Cheng SQ, Li Q; Chinese Association of Liver Cancer and Chinese Medical Doctor Association. Chinese Expert Consensus on the Whole-Course Management of Hepatocellular Carcinoma (2023 Edition).Liver Cancer. 2025;14:311-333.
[RCA] [PubMed] [DOI] [Full Text] [Full Text (PDF)][Cited by in RCA: 1][Reference Citation Analysis (0)]
Drefs M, Schoenberg MB, Börner N, Koliogiannis D, Koch DT, Schirren MJ, Andrassy J, Bazhin AV, Werner J, Guba MO. Changes of long-term survival of resection and liver transplantation in hepatocellular carcinoma throughout the years: A meta-analysis.Eur J Surg Oncol. 2024;50:107952.
[RCA] [PubMed] [DOI] [Full Text][Cited by in RCA: 14][Reference Citation Analysis (0)]
Mohamed-Chairi MH, Vico-Arias AB, Zambudio-Carroll N, Villegas-Herrera MT, Villar-Del-Moral JM. Comparative analysis of patients transplanted due to hepatocellular carcinoma. Are there survival differences between those who meet the Milan criteria and those who exceed them?Rev Gastroenterol Mex (Engl Ed). 2025;90:36-43.
[RCA] [PubMed] [DOI] [Full Text][Reference Citation Analysis (0)]
Yu PLH, Chiu KW, Lu J, Lui GCS, Zhou J, Cheng HM, Mao X, Wu J, Shen XP, Kwok KM, Kan WK, Ho YC, Chan HT, Xiao P, Mak LY, Tsui VWM, Hui C, Lam PM, Deng Z, Guo J, Ni L, Huang J, Yu S, Peng C, Li WK, Yuen MF, Seto WK. Application of a deep learning algorithm for the diagnosis of HCC.JHEP Rep. 2025;7:101219.
[RCA] [PubMed] [DOI] [Full Text] [Full Text (PDF)][Cited by in RCA: 5][Reference Citation Analysis (0)]
Bartnik K, Krzyziński M, Bartczak T, Korzeniowski K, Lamparski K, Wróblewski T, Grąt M, Hołówko W, Mech K, Lisowska J, Januszewicz M, Biecek P. A novel radiomics approach for predicting TACE outcomes in hepatocellular carcinoma patients using deep learning for multi-organ segmentation.Sci Rep. 2024;14:14779.
[RCA] [PubMed] [DOI] [Full Text] [Full Text (PDF)][Cited by in Crossref: 1][Cited by in RCA: 9][Article Influence: 4.5][Reference Citation Analysis (0)]
Xu Y, Zhang B, Zhou F, Yi YP, Yang XL, Ouyang X, Hu H. Development of machine learning-based personalized predictive models for risk evaluation of hepatocellular carcinoma in hepatitis B virus-related cirrhosis patients with low levels of serum alpha-fetoprotein.Ann Hepatol. 2024;29:101540.
[RCA] [PubMed] [DOI] [Full Text][Cited by in RCA: 6][Reference Citation Analysis (1)]
Wang Q, Wang Z, Sun Y, Zhang X, Li W, Ge Y, Huang X, Liu Y, Chen Y. SCCNN: A Diagnosis Method for Hepatocellular Carcinoma and Intrahepatic Cholangiocarcinoma Based on Siamese Cross Contrast Neural Network.IEEE Access. 2020;8:85271-85283.
[PubMed] [DOI] [Full Text]
Wang Z, Fu S, Fu S, Li D, Liu D, Yao Y, Yin H, Bai L. Hybrid gabor attention convolution and transformer interaction network with hierarchical monitoring mechanism for liver and tumor segmentation.Sci Rep. 2025;15:8318.
[RCA] [PubMed] [DOI] [Full Text] [Full Text (PDF)][Reference Citation Analysis (0)]
Liao M, Tang H, Li X, Vijayakumar P, Arya V, Gupta BB. A lightweight network for abdominal multi-organ segmentation based on multi-scale context fusion and dual self-attention.Inf Fusion. 2024;108:102401.
[PubMed] [DOI] [Full Text]
Murugesan GK, Mccrumb D, Brunner E, Kumar J, Soni R, Grigorash V, Chang A, Peck A, Vanoss J, Moore S.
Automatic Abdominal Multi Organ Segmentation using Residual UNet. 2023 Preprint. Available from: bioRxiv:528755.
[PubMed] [DOI] [Full Text]
Deshpande A, Gupta D, Bhurane A, Meshram N, Singh S, Radeva P.
Hybrid deep learning-based strategy for the hepatocellular carcinoma cancer grade classification of H&E stained liver histopathology images. 2024 Preprint. Available from: arXiv:2412.03084.
[PubMed] [DOI] [Full Text]
Liu Z, Fu DGJR, Tu H. Using a Hybrid Pso-Bp Model to Predict Survival Rate after Partial Hepatectomy.J Investig Med. 2016;64 8(Suppl 1):A8.
[PubMed] [DOI] [Full Text]
Jain R, Mungamuri SK, Garg P. Redefining precision medicine in hepatocellular carcinoma through omics, translational, and AI-based innovations.J Precis Med. 2025;1:100003.
[PubMed] [DOI] [Full Text]
Govardhanam V, Zeghal M, Cheung A. A287 Leveraging Machine Learning To Improve The Diagnostic Accuracy Of Ultrasound Screening For Hepatocellular Carcinoma.J Can Assoc Gastroenterol. 2024;7 (Suppl 1):231-232.
[PubMed] [DOI] [Full Text]
Mitrea D, Brehar R, Nedevschi S, Socaciu M, Badea R.
Hepatocellular Carcinoma Recognition from Ultrasound Images Through Convolutional Neural Networks and Their Combinations. In: Vlad S, Roman NM, editors. 8th International Conference on Advancements of Medicine and Health Care Through Technology. MEDITECH 2022. IFMBE Proceedings, vol 102. Cham: Springer, 2024.
[PubMed] [DOI] [Full Text]
Oestmann PM, Wang CJ, Savic LJ, Hamm CA, Stark S, Schobert I, Gebauer B, Schlachter T, Lin M, Weinreb JC, Batra R, Mulligan D, Zhang X, Duncan JS, Chapiro J. Deep learning-assisted differentiation of pathologically proven atypical and typical hepatocellular carcinoma (HCC) versus non-HCC on contrast-enhanced MRI of the liver.Eur Radiol. 2021;31:4981-4990.
[RCA] [PubMed] [DOI] [Full Text][Cited by in Crossref: 49][Cited by in RCA: 57][Article Influence: 11.4][Reference Citation Analysis (0)]
Qu H, Zhang S, Guo M, Miao Y, Han Y, Ju R, Cui X, Li Y. Deep Learning Model for Predicting Proliferative Hepatocellular Carcinoma Using Dynamic Contrast-Enhanced MRI: Implications for Early Recurrence Prediction Following Radical Resection.Acad Radiol. 2024;31:4445-4455.
[RCA] [PubMed] [DOI] [Full Text][Cited by in RCA: 4][Reference Citation Analysis (0)]
He X, Xu Y, Zhou C, Song R, Liu Y, Zhang H, Wang Y, Fan Q, Wang D, Chen W, Wang J, Guo D. Prediction of microvascular invasion and pathological differentiation of hepatocellular carcinoma based on a deep learning model.Eur J Radiol. 2024;172:111348.
[RCA] [PubMed] [DOI] [Full Text][Cited by in RCA: 9][Reference Citation Analysis (0)]
Boche H, Fono A, Kutyniok G. Mathematical algorithm design for deep learning under societal and judicial constraints: The algorithmic transparency requirement.Appl Comput Harmon Anal. 2025;77:101763.
[PubMed] [DOI] [Full Text]
Paproki A, Salvado O, Fookes C. Synthetic Data for Deep Learning in Computer Vision & Medical Imaging: A Means to Reduce Data Bias.ACM Comput Surv. 2024;56:1-37.
[PubMed] [DOI] [Full Text]
Shah M, Sureja N. A Comprehensive Review of Bias in Deep Learning Models: Methods, Impacts, and Future Directions.Arch Computat Methods Eng. 2025;32:255-267.
[PubMed] [DOI] [Full Text]
Whang SE, Roh Y, Song H, Lee J. Data collection and quality challenges in deep learning: a data-centric AI perspective.VLDB J. 2023;32:791-813.
[PubMed] [DOI] [Full Text]
John LH, Kim C, Kors JA, Chang J, Morgan-Cooper H, Desai P, Pang C, Rijnbeek PR, Reps JM, Fridgeirsson EA.
Comparison of deep learning and conventional methods for disease onset prediction. 2024 Preprint. Available from: arXiv:241010505.
[PubMed] [DOI] [Full Text]