Artificial intelligence assisted ultrasound report generation

doi:10.35711/aimi.v6.i1.107069

Advanced Search

BPG is committed to discovery and dissemination of knowledge

Home / Archive / Volume 6, Issue 1

This Article

Table of Contents

Academic Content and Language Evaluation of This Article

CrossCheck and Google Search of This Article

Academic Rules and Norms of This Article

Citation of this article

Corresponding Author of This Article

Research Domain of This Article

Article-Type of This Article

Open-Access Policy of This Article

Times Cited Counts in Google of This Article

Number of Hits and Downloads for This Article

Total Article Views (3975)

All Articles published online

The chart showing PDF series, HTML series, Tables (1-3) series.

Item

Count

PDF

104

HTML

2500

Tables (1-3)

304

Sum=2908

Featured Article

The chart showing Browse series, Download series.

Item

Count

Browse

186

Download

462

Sum=648

Publishing Process of This Article

Item

Count

Browse

Download

263

Sum=304

Jun 8, 2025 (publication date) through Nov 23, 2025

Times Cited of This Article

Times Cited (0)

Journal Information of This Article

Publication Name

Artificial Intelligence in Medical Imaging

ISSN

2644-3260

Publisher of This Article

Baishideng Publishing Group Inc, 7041 Koll Center Parkway, Suite 160, Pleasanton, CA 94566, USA

Minireviews Open Access

Artif Intell Med Imaging. Jun 8, 2025; 6(1): 107069
Published online Jun 8, 2025. doi: 10.35711/aimi.v6.i1.107069

Artificial intelligence assisted ultrasound report generation

Jia-Hui Zeng, Kai-Kai Zhao, Ning-Bo Zhao

Jia-Hui Zeng, Kai-Kai Zhao, Department of Ultrasound, The Third People’s Hospital of Shenzhen, Shenzhen 518000, Guangdong Province, China

Ning-Bo Zhao, Department of Ultrasound, National Clinical Research Centre for Infectious Disease, The Third People’s Hospital of Shenzhen, The Second Affiliated Hospital of Southern University of Science and Technology, Shenzhen 518116, Guangdong Province, China

ORCID number: Jia-Hui Zeng (0000-0001-8469-8108); Kai-Kai Zhao (0000-0002-6362-1352); Ningbo Zhao (0000-0002-2262-7600).

Author contributions: Zeng JH and Zhao NB contributed to conceptualization of the study, literature review, drafting and critical revision of the manuscript; Zhao KK contributed to critical revisions of the manuscript and addressing reviewer comments. All authors have read and approved the final manuscript.

Conflict-of-interest statement: All the authors have no potential conflicts of interest to disclose.

Open Access: This article is an open-access article that was selected by an in-house editor and fully peer-reviewed by external reviewers. It is distributed in accordance with the Creative Commons Attribution NonCommercial (CC BY-NC 4.0) license, which permits others to distribute, remix, adapt, build upon this work non-commercially, and license their derivative works on different terms, provided the original work is properly cited and the use is non-commercial. See: https://creativecommons.org/Licenses/by-nc/4.0/

Corresponding author: Ning-Bo Zhao, Associate Chief Physician, Associate Professor, Department of Ultrasound, National Clinical Research Centre for Infectious Disease, The Third People’s Hospital of Shenzhen, The Second Affiliated Hospital of Southern University of Science and Technology, No. 29 Bulan Road, Longgang District, Shenzhen 518116, Guangdong Province, China. 971599910@qq.com

Received: March 16, 2025
Revised: April 14, 2025
Accepted: May 26, 2025
Published online: June 8, 2025
Processing time: 83 Days and 16.6 Hours

Abstract

Artificial intelligence (AI) assisted ultrasound report generation represents a technology that leverages artificial intelligence to convert ultrasound imaging analysis results into structured diagnostic reports. By integrating image recognition and natural language generation models, AI systems can automatically detect and analyze lesions or abnormalities in ultrasound images, generating textual descriptions of diagnostic conclusions (e.g., fatty liver, liver fibrosis, automated BI-RADS grading of breast lesions), imaging findings, and clinical recommendations to form comprehensive reports. This technology enhances the efficiency and accuracy of imaging diagnosis, reduces physicians’ workloads, ensures report standardization and consistency, and provides robust support for clinical decision-making. Current state-of-the-art algorithms for automated ultrasound report generation primarily rely on vision-language models, which harness the generalization capabilities of large language models and large vision models through multimodal (language + vision) feature alignment. However, existing approaches inadequately address challenges such as numerical measurement generation, effective utilization of report templates, incorporation of historical reports, learning text-image correlations, and overfitting under limited data conditions. This paper aims to introduce the current state of research on ultrasound report generation, the existing issues, and to provide some thoughts for future research.

Key Words: Artificial intelligence; Ultrasound report generation; Vision-Language Models; Natural language generation; Large language model

Core Tip: This article investigates artificial intelligence assisted ultrasound report generation using vision-language models, addressing challenges unique to ultrasound imaging, such as numerical measurement accuracy, multi-image correlation, and template integration. Unlike standardized radiological imaging, ultrasound variability stems from operator-dependent acquisition and image noise, complicating automated analysis. The framework integrates Transformer-based Optical Character Recognition for measurement extraction, pseudo-case synthesis for data augmentation, and cross-modal alignment to improve report precision. Innovations include leveraging historical reports, video data, and clinical expertise to enhance diagnostic outputs. Ethical protocols ensure data privacy, while template-driven workflows enhance clinical relevance. Future advancements focus on real-time reporting, personalized diagnostics, and multimodal models like GPT-4 vision. This article bridges artificial intelligence capabilities with clinical demands to standardize reports, reduce workloads, and support ultrasound decision-making.

Citation: Zeng JH, Zhao KK, Zhao NB. Artificial intelligence assisted ultrasound report generation. Artif Intell Med Imaging 2025; 6(1): 107069
URL: https://www.wjgnet.com/2644-3260/full/v6/i1/107069.htm
DOI: https://dx.doi.org/10.35711/aimi.v6.i1.107069

INTRODUCTION

Manually generating medical reports from medical images is a time-consuming process, and the results may vary depending on the clinician. Automated medical report generation holds significant value for improving efficiency and diagnostic accuracy in the medical field. As highlighted in Table 1, various artificial intelligence (AI) models, such as convolutional neural networks (CNN)-long short-term memory (LSTM), Transformer-based models, and visual language models (VLMs), each have distinct architectural features and clinical relevance.

Table 1 Comparison of artificial intelligence models for ultrasound report generation.

Method	Architectural features	Clinical relevance
CNN-LSTM	Combines CNN and LSTM, suitable for processing sequential data	Performs well in handling image and sequence information, applicable for ultrasound image analysis
Transformer-based models	Based on self-attention mechanisms, capable of capturing long-range dependencies, suitable for parallel processing	Excels in generating natural language reports, suitable for complex ultrasound report generation
VLMs	Integrates visual and linguistic information, capable of understanding image content and generating related text	Outstanding performance in multimodal learning, enhances the accuracy and clinical relevance of ultrasound reports

CNN: Convolutional Neural Networks; LSTM: Long short-term memory networks; VLMs: Visual language models.

Open in New Tab Full Size Table

Early research primarily relied on the combination of CNN and LSTM networks. This structure extracts image features using CNN and generates textual descriptions using LSTM, often combined with attention mechanisms[1], treating the task as image captioning (generating a textual description that reflects the content of an image). The attention mechanism significantly improves the model's ability to focus on important image regions, thereby enhancing the accuracy of report generation.

The introduction of the Transformer model[2] has further advanced this field. With its powerful parallel processing capabilities and effective modeling of long-sequence dependencies, transformer has gradually replaced traditional RNN and LSTM structures. This allows medical report generation models to understand image information more efficiently and accurately, generating corresponding textual descriptions. Particularly, the rise of large language models (LLMs) based on the Transformer architecture has further improved the quality and linguistic expressiveness of generated text. LLMs can be pre-trained on extensive language corpora, producing more coherent and natural text. Models based on LLMs can not only handle more complex semantic relationships but also better adapt to different task requirements[3]. However, original LLMs primarily process text data and cannot handle multimodal datasets[4].

VLMs are a class of deep learning models capable of simultaneously understanding images and text. They combine visual information (images or videos) with linguistic information (text descriptions) to process and reason about multimodal data. The core idea of VLMs is to learn the associations between visual and linguistic information, enabling the model to understand image content and match it with textual expressions. Common VLMs applications include image-text retrieval, image captioning, visual question answering, and visual navigation. The basic architecture of VLMs typically includes an image encoder and a text encoder, which encode image and text information separately. The model then aligns these two types of information, bringing image and text features of the same semantic closer together.

Typical VLMs models are: (1) Contrastive language-image pretraining (CLIP[5]): Through large-scale image-text contrast learning, the model learns to associate images and text. CLIP is capable of completing a wide range of visual-lingual tasks without the need for specialized training; (2) ALIGN[6]: Similar to CLIP, ALIGN uses contrastive learning to correlate image and text features, and uses a larger dataset, which makes the model have strong generalization ability in understanding the semantic relationship between image and text; and (3) Flamingo[7]: Focuses on multimodal contextual understanding, making cross-modal reasoning (e.g., answering questions or descriptions) in images or videos, and maintaining contextual coherence across multiple rounds of conversations. The latest generation of large multimodal models, including GPT-4o[8], Grok-3[9], and Claude 3[10], represent a significant leap forward by featuring enhanced reasoning, improved efficiency, and expanded capabilities for processing and understanding information across text, vision, and, in the case of GPT-4o, audio modalities, moving towards more seamless and real-time human-computer interaction.

The transition to VLMs represents a significant advancement in the field of ultrasound report generation. Unlike traditional CNN-RNN methods, which may struggle to effectively capture complex patterns in noisy ultrasound images, VLMs leverage self-attention mechanisms that enable them to focus on relevant features across both visual and textual modalities[5]. This process enhances their ability to discern critical diagnostic information from images that may otherwise be obscured by artifacts. This robust feature extraction capability allows VLM to address some of the inherent challenges associated with ultrasound images, such as noise and variability.

For ultrasound report generation, VLMs not only tightly integrates imaging information with text generation but also precisely captures key information in complex medical images, making the generated reports more professional and detailed, marking the latest progress in this field[11,12].

Current research on report generation mainly focuses on radiological imaging [e.g., chest X-rays, computed tomography (CT), magnetic resonance imaging (MRI)], while studies on ultrasound report generation are relatively scarce. The reasons include:

(1) Image stability and consistency: Radiological images such as X-rays, CT, and MRI typically have high image quality and consistency. Their acquisition process is standardized, with minimal influence from factors like patient positioning and equipment operation, resulting in relatively clear and stable images that facilitate model learning. In contrast, the quality and clarity of ultrasound images are often affected by the operator's skill, equipment angle, imaging depth, and other factors, leading to noise interference and difficulty in standardization, posing higher demands on the model's learning and generalization capabilities; (2) Difficulty in data annotation: Annotating radiological images is relatively straightforward, as these images are usually two-dimensional, allowing doctors to directly annotate regions of interest and diagnostic information. In contrast, ultrasound images are more complex, requiring experienced professionals to interpret image features, and the presence of noise and artifacts increases the difficulty of annotation, making the construction of large-scale, high-quality ultrasound image datasets more challenging; and (3) Resolution of anatomical structures: Radiological images such as CT and MRI can clearly display the structure and layers of internal tissues, aiding the model in accurately identifying lesion areas. Ultrasound images have lower resolution and are typically used for imaging soft tissues and fluids, making it difficult to distinguish certain fine structures. Additionally, the resolution and contrast of ultrasound images depend on the acquisition angle and depth, making it challenging to clearly display anatomical structures of different regions in the same image.

In this paper, we conduct a survey on visual language models for ultrasound diagnostic report generation. We focus solely on the AI technology aspect, and the discussion of clinical impacts and integration is beyond the scope of this paper. In addition, Table 2 outlines key concepts in ultrasound report generation, providing definitions and significance to aid in understanding the challenges and advancements in this area. To further elucidate the challenges faced in VLM-based ultrasound report generation and the proposed solutions, Table 3 provides a comprehensive overview of these aspects.

Table 2 Key concepts in ultrasound report generation.

Concept	Description	Significance
AI-assisted ultrasound report generation	Technology using AI to convert ultrasound imaging into structured diagnostic reports	Enhances efficiency, accuracy, and consistency of diagnosis
VLMs	AI models that integrate visual (images) and linguistic (text) information	Enable understanding of image content and generation of descriptive text
Image encoder	A component of VLMs that encodes image information	Transforms images into a format that the model can process
Text encoder	A component of VLMs that encodes text information	Transforms text into a format that the model can process
Attention mechanism	A technique that allows the model to focus on specific parts of the input (image or text)	Improves the model's ability to focus on important image regions and text
LLMs	Transformer-based models pre-trained on large text corpora	Enhance the quality and fluency of generated text

AI: Artificial intelligence; LLMs: Large language models; VLMs: Visual language models.

Open in New Tab Full Size Table

Table 3 Challenges and proposed solutions in visual language model -based ultrasound report generation.

Challenge	Proposed solution
Poor accuracy in text generation related to measurement results	Extract numerical values from ultrasound images using tools like TrOCR[15] and insert them into the report
Suboptimal handling of correspondence between text and images	Annotate the correspondence between text and images and design mechanisms to learn these relationships
Ineffective utilization of report templates	Use report templates as input, treat template prediction as an intermediate task, or have the model learn to modify templates
Issues with training data volume	Split existing reports into text-image pairs and reassemble them to create pseudo-cases for training
Ineffective utilization of historical reports	Use historical reports along with current ultrasound images as input
Neglect of image selection task	Explicitly model the image selection process to choose representative images for the report
Lack of utilization of ultrasound-related expertise	Fine-tune LLM models to learn this prior knowledge
Lack of exploration of predictive tasks	Conduct in-depth research on ultrasound examination scenarios to define effective predictive tasks

LLM: Large language model.

Open in New Tab Full Size Table

RESEARCH ON ULTRASOUND REPORT GENERATION BASED ON VLMs

Poor accuracy in text generation related to measurement results

Ultrasound reports often contain a large amount of text related to measurements and locations, such as "the maximum oblique diameter of the right liver lobe is 120 mm", "multiple hyperechoic and hypoechoic nodules are visible in the liver, the largest measuring approximately 11 mm × 7 mm", and "the diameter of the portal vein is 11 mm". A simple approach is to treat the prediction of numerical values as ordinary text prediction without special processing[11,13]. However, current generation models are not adept at predicting numerical values (e.g., "120", "11 × 7", "11")[12]. Li et al[12] replace text related to numerical predictions with special tokens. For text with two numerical values in the format of "x cm × y cm", they uniformly replace it with "2DS". These tokens serve as placeholders, indicating the format of the measurement results, but they do not predict specific measurement values. Wang et al[14] replace text related to numerical values with simple textual descriptions, such as replacing "approximately 1.83 mm × 1.48 mm × 2.79 mm" with "a hypoechoic cystic-solid lesion in the nasal quadrant". While this cleverly avoids the issue of numerical prediction, the replacement is less natural and loses the simplicity and intuitiveness of the report.

Suboptimal handling of correspondence between text and images

A case typically involves multiple images (e.g., 8 images) during an ultrasound scan, and the generation of ultrasound reports is primarily based on these images. Accurately understanding the correspondence between images and text statements in reports is a core issue in report generation. A simple approach is to have doctors select the most representative[14] or highest-quality[15] image for generating the report. Li et al[12] use images selected by doctors (usually 2 to 3 images) to generate reports. Guo et al[11] first extract features from each image using an image encoder, then score each image through a scoring module, selecting the highest-scoring image to generate the report. They also calculate the cross-attention between the "ultrasound findings" in the report and the features of the image encoder to obtain a relevance score between each image and the ultrasound findings, using this as a supervisory signal for knowledge distillation in the scoring module to assist in learning how to automatically select images relevant to the text. However, these approaches do not effectively utilize information from other images. Huh et al[16] first use specially designed sub-models ("suspicious description tool", "category classification tool", and "probe information tool") to generate preliminary textual descriptions (preliminary reports) for each image, then summarize all preliminary descriptions using an LLM to generate a final report. The drawback is that each sub-model may introduce errors, leading to the accumulation of errors from various sub-models in a single report.

Ineffective utilization of report templates

When doctors issue reports, they often start by finding the report template for the examination item, then edit the template by deleting statements irrelevant to the case, filling in measurement data (e.g., cyst size, blood flow, blood pressure), and modifying individual statements. Some template text may remain unchanged. Report templates provide a lot of prior information for report generation (we can conduct a test: For each case, ignore the image information and use only the template as the final report; based on current text-matching evaluation metrics for report generation, this might yield a good score). Current report generation methods primarily generate reports directly from images, without effectively utilizing templates as prior knowledge to improve the quality of report generation.

Liu et al[17] proposed a method for generating thyroid reports using voice input. The authors first analyzed 40000 thyroid reports and manually constructed a semantic tree for thyroid reports, consisting of three subtrees: The thyroid subtree, the parathyroid subtree, and the cervical lymph node subtree. The nodes of the tree are labeled with organ/region (examined area), attribute name (examined content), and attribute value (observed content). The edges of the tree represent "part of" relationships, region-attribute relationships, and attribute value relationships. Each subtree of the thyroid ultrasound semantic tree contains a region layer, an attribute name layer, and an attribute value layer. After completing the examination, doctors verbally report key parts, and the model automatically matches the corresponding semantic tree and modifies the attribute values. While the voice input method reduces the time doctors spend entering reports, it lacks the ability to automatically generate reports from images.

Issues with training data volume

The training data in current research typically includes only a few hundred[16] or a few thousand[14,15] cases, with a maximum of 20000 cases[12]. The lack of sufficient data to train models is a significant constraint on report generation, and accumulating data and collecting reports is time-consuming. We need to propose methods to artificially increase data volume using limited data.

Ineffective utilization of historical reports

When patients undergo hospital examinations, they often refer to historical reports to understand more about their medical history, enabling more comprehensive and continuous assessments through comparative analysis. In other words, generating the current report may require referencing a variable number of historical reports (if available). However, current ultrasound diagnostic models do not consider utilizing historical reports. Exploring how to use historical reports to improve the quality of current reports is a worthwhile direction.

Ineffective utilization of video data

When doctors capture images, they sometimes record dynamic video clips (e.g., 10 seconds) in addition to static images. These video clips contain a lot of diagnostic information, but current diagnostic models do not consider utilizing video information.

Neglect of image selection task

For ultrasound report generation, it is not only about generating textual descriptions but also typically requires selecting 2 or 3 representative images to include in the report as evidence for communication with patients or other doctors. Current report generation models do not treat this task as a necessary subtask.

Effective utilization of ultrasound-related expertise

Ultrasound diagnosis often requires extensive professional knowledge, such as normal ranges for measurement items, to determine normal/abnormal conditions. This professional knowledge is an important reference.

Exploration of predictive tasks

The success of large models (language, vision, language + vision) largely benefits from predictive training tasks on massive datasets, such as predicting the next token in large language models or predicting occluded image regions in vision models.

FUTURE DIERCTIONS

Text generation related to measurement results

We observe that numerical measurement results are already stored on ultrasound images after the examination. By extracting this information and placing it in the appropriate diagnostic text, we can obtain precise textual descriptions related to measurements. We can use TrOCR[18] as a tool, following the LangChain approach[16], to extract text results.

Correspondence between text and images

In the manual report generation process, doctors find relevant images for different diagnostic statements and then make a comprehensive judgment to determine the final text. Current research lacks effective mechanisms to extract relevant information from all images. We need to work with medical professionals to annotate the correspondence between text and images in reports. After thorough research, we will design an effective mechanism for learning the relationship between text and images. Since annotation is labor-intensive, we can start with 50 cases to evaluate the feasibility of this approach.

Effective utilization of report templates

For each case, the examination item is known, and the report template is also known. Possible approaches to utilizing templates include: (1) Using the report template as input and having the model learn how to modify the template based on images; and (2) Treating template prediction as an intermediate task, where the model automatically infers the examination and applicable report template for each case's images.

Issues with training data volume

We need to propose methods to artificially increase data volume using limited data. We can first split existing reports into interrelated text statements and determine the corresponding images for these statements (through doctor annotation), generating (text fragment, image) pairs. We can then use these pairs as basic units to split and reassemble examination reports from different cases, creating a large number of pseudo-cases for model training. The training process can be divided into two stages: Pre-training on pseudo-cases to improve the model's generalization ability, followed by fine-tuning on real cases.

Effective utilization of historical reports

We can use historical reports along with current ultrasound images as input to improve the quality of report generation. This requires adjustments during training, which need to be explored.

Selection of images for inclusion in reports

We can explicitly model the image selection process, allowing the model to automatically select 2 or 3 images to include in the report, making the report generation task more aligned with real-world applications. This can also serve as a supervised prediction task to improve the identification of important information and the pairing of images and text.

Effective utilization of ultrasound-related expertise

We can fine-tune LLM models to learn this prior knowledge.

Exploration of predictive tasks

We need to conduct in-depth research on real-world ultrasound examination scenarios and the useful information contained in images (e.g., ultrasound parameter settings, measurement text, and even timestamps to determine the sequence of image acquisition) to explore how to define effective predictive tasks.

LIMITATIONS

This paper acknowledges several limitations that may impact the comprehensiveness of our analysis:

Lack of Experimental Comparisons: A significant limitation of this study is the absence of experimental comparisons of the performance of various methods discussed, such as CNN-LSTM, Transformer-based models, and VLMs. This lack of empirical evaluation restricts our ability to draw definitive conclusions about the relative effectiveness of these approaches in ultrasound report generation. The challenges in conducting such comparisons are compounded by the lack of publicly available datasets and open-source code, which makes it difficult to replicate existing studies or validate our findings within the broader research community.

Ethical Considerations: The paper does not adequately address various ethical considerations, such as algorithmic bias and the implications of AI errors regarding accountability. These issues warrant deeper exploration, especially as concerns over ethical challenges in AI applications in healthcare grow. A dedicated subsection discussing strategies for bias mitigation, patient consent for AI-generated reports, and regulatory hurdles (e.g., the United States Food and Drug Administration approval) would align well with the increasing focus on accountability in medical AI technologies.

Clinical application discussion: The clinical application of the discussed AI models is not thoroughly explored. There is a pressing need to consider existing commercial ultrasound AI tools, such as those developed by Butterfly Network and Caption Health, along with the real-world challenges of adoption, including physician trust, training requirements, and reimbursement models. Integrating insights from industry trends and the perspectives of clinicians would provide a more holistic view of the current state of the field and the translational barriers that need to be addressed.

By acknowledging these limitations, we aim to highlight areas for future research and discussion that could enhance the understanding and application of AI in ultrasound report generation.

CONCLUSION

AI-assisted ultrasound report generation has evolved from early-stage simple image descriptions to complex tasks involving multimodal fusion. However, breakthroughs are still needed in the following areas: (1) Technological advancements: Enhancing the accuracy of numerical generation, optimizing the fusion of multi-image information, and effectively utilizing templates and historical data; (2) Data ecosystem: Establishing large-scale, high-quality ultrasound datasets and promoting cross-institutional data sharing and annotation standardization; and (3) Clinical implementation: Designing interactive systems that align with physicians' workflows, ensuring the reliability and editability of generated reports.

An important aspect that must also be addressed is model interpretability and reliability. Given the high-risk environment of medical diagnostics, AI systems must be transparent and interpretable. Techniques such as Grad-CAM[19], SHAP[20], and attention visualization can be employed to elucidate how AI models arrive at their decisions. By providing insights into the decision-making processes of these models, clinicians can better understand the rationale behind generated reports, fostering trust and facilitating effective collaboration between AI systems and healthcare professionals.

ACKNOWLEDGEMENTS

The authors acknowledge the valuable input of Dr. Kai-Kai Zhao during the manuscript review process.

Footnotes

Provenance and peer review: Invited article; Externally peer reviewed.

Peer-review model: Single blind

Specialty type: Health care sciences and services

Country of origin: China

Peer-review report’s classification

Scientific Quality: Grade B, Grade C

Novelty: Grade B, Grade C

Creativity or Innovation: Grade C, Grade C

Scientific Significance: Grade B, Grade C

P-Reviewer: Khedimi M; Zhou S S-Editor: Liu JH L-Editor: A P-Editor: Wang WB

References

1.	Jing BY, Xie PT, Xing E. On the Automatic Generation of Medical Imaging Reports. Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) 2018; 2577–2586. [PubMed] [DOI] [Full Text]

2.	Vaswani A, Shazeer N, Parmar N, Uszkoreit J, Jones L, Gomez AN, Kaiser L, Polosukhin I. Attention is all you need. Advances in neural information processing systems, 2017. Preprint. [PubMed] [DOI] [Full Text]

Brown TB, Mann B, Ryder N, Subbiah M, Kaplan J, Dhariwal P, Neelakantan A, Shyam P, Sastry G, Askell A, Agarwal S, Herbert-Voss A, Krueger G, Henighan T, Child R, Ramesh A, Ziegler DM, Wu J, Winter C, Hesse C, Chen M, Sigler E, Litwin M, Gray S, Chess B, Clark J, Berner C, McCandlish S, Radford A, Sutskever I, Amodei D. Language models are few-shot learners; 2020. Preprint. [DOI] [Full Text]

4.	Liu HT, Li CY, Wu QY, Lee YJ. Visual instruction tuning. Advances in neural information processing systems, 2024; 36. [PubMed] [DOI] [Full Text]

5.	Radford A, Kim JW, Hallacy C, Ramesh A, Goh G, Agarwal S, Sastry G, Askell A, Mishkin P, Clark J, Krueger G, Sutskever I. Learning transferable visual models from natural language supervision; 2021. Preprint. [PubMed] [DOI] [Full Text]

Jia C, Yang YF, Xia Y, Chen YT, Parekh Z, Pham H, Le Q, Sung YH, Li Z, Duerig T. Scaling up visual and vision-language representation learning with noisy text supervision. Proceedings of the 38th International Conference on Machine Learning, 2021; 139: 4904-4916. Available from: https://proceedings.mlr.press/v139/jia21b.html.

Alayrac JB, Donahue J, Luc P, Miech A, Barr I, Hasson Y, Lenc K, Mensch A, Millican K, Reynolds M, Ring R, Rutherford E, Cabi S, Han T, Gong Z, Samangooei S, Monteiro M, Menick J, Borgeaud S, Brock A, Nematzadeh A, Sharifzadeh S, Binkowski M, Barreira R, Vinyals O, Zisserman A, Simonyan K. Flamingo: a visual language model for few-shot learning. In Proceedings of Neural Information Processing Systems (NeurIPS), 2022. [DOI] [Full Text]

8.	OpenAI. Hello GPT-4o. OpenAI Blog. [Internet] – [cited 2025 April 14]. Available from: https://openai.com/index/hello-gpt-4o/. [PubMed] [DOI]

9.	xAI. Grok 3 Beta — The Age of Reasoning Agents. xAI Blog. [Internet] – [cited 2025 April 14]. Available from: https://x.ai/news/grok-3. [PubMed] [DOI]

10.	Anthropic. Claude 3 haiku: our fastest model yet. 2024. [Internet] – [cited 2025 April 14]. Available from: https://www.anthropic.com/news/claude-3-haiku. [PubMed] [DOI]

11.	Guo C, Li M, Xu J, Bai L. Ultrasonic characterization of small defects based on Res-ViT and unsupervised domain adaptation. Ultrasonics. 2024;137:107194. [RCA] [PubMed] [DOI] [Full Text] [Cited by in RCA: 3] [Reference Citation Analysis (0)]

12.

Li J, Su T, Zhao B, Lv F, Wang Q, Navab N, Hu Y, Jiang Z. Ultrasound Report Generation With Cross-Modality Feature Alignment via Unsupervised Guidance. IEEE Trans Med Imaging. 2025;44:19-30. [RCA] [PubMed] [DOI] [Full Text] [Cited by in Crossref: 5] [Cited by in RCA: 4] [Article Influence: 4.0] [Reference Citation Analysis (0)]

13.	Guo XQ, Men QH, Noble JA. MMSummary: Multimodal Summary Generation for Fetal Ultrasound Video. In: Linguraru MG, editors. Medical Image Computing and Computer Assisted Intervention – MICCAI 2024. Cham: Springer, 2024. [PubMed] [DOI] [Full Text]

14.	Wang J, Fan JY, Zhou M, Zhang YZ, Shi MY. A labeled ophthalmic ultrasound dataset with medical report generation based on cross-modal deep learning; 2024. Preprint. [PubMed] [DOI] [Full Text]

15.

Yang S, Niu J, Wu J, Wang Y, Liu X, Li Q. Automatic ultrasound image report generation with adaptive multimodal attention mechanism. Neurocomputing. 2021;427:40-49. [RCA] [DOI] [Full Text] [Cited by in Crossref: 3] [Cited by in RCA: 13] [Article Influence: 3.3] [Reference Citation Analysis (0)]

16.	Huh J, Park HJ, Ye JC. Breast ultrasound report generation using langehain; 2023. Preprint. [PubMed] [DOI] [Full Text]

17.

Liu LH, Wang M, Dong YJ, Zhao WL, Yang J, Su JW. Semantic Tree Driven Thyroid Ultrasound Report Generation by Voice Input. In: Arabnia HR, Deligiannidis L, Shouno H, Tinetti FG, Tran QN, editors. Advances in Computer Vision and Computational Biology. Transactions on Computational Science and Computational Intelligence. Cham: Springer, 2021. [DOI] [Full Text]

18.	Li MH, Lv TC, Chen JY, Cui L, Lu YJ, Florencio D, Zhang C, Li ZJ, Wei FR. TrOCR: Transformer-Based Optical Character Recognition with Pre-trained Models. AAAI. 2023;37:13094-13102. [PubMed] [DOI] [Full Text]

19.	Selvaraju RR, Cogswell M, Das A, Vedantam R, Parikh D, Batra D. Grad-CAM: Visual Explanations from Deep Networks via Gradient-Based Localization. 2017 IEEE International Conference on Computer Vision (ICCV). [PubMed] [DOI] [Full Text]

20.	Zheng Q, Wang ZW, Zhou J, Lu JW. Shap-CAM: Visual Explanations for Convolutional Neural Networks Based on Shapley Value. In: Avidan S, Brostow G, Cissé M, Farinella GM, Hassner T, editors. Computer Vision – ECCV 2022. Cham: Springer, 2022. [PubMed] [DOI] [Full Text]