Published online Jun 8, 2025. doi: 10.35711/aimi.v6.i1.107069
Revised: April 14, 2025
Accepted: May 26, 2025
Published online: June 8, 2025
Processing time: 83 Days and 16.5 Hours
Artificial intelligence (AI) assisted ultrasound report generation represents a technology that leverages artificial intelligence to convert ultrasound imaging analysis results into structured diagnostic reports. By integrating image recogni
Core Tip: This article investigates artificial intelligence assisted ultrasound report generation using vision-language models, addressing challenges unique to ultrasound imaging, such as numerical measurement accuracy, multi-image correlation, and template integration. Unlike standardized radiological imaging, ultrasound variability stems from operator-dependent acquisition and image noise, complicating automated analysis. The framework integrates Transformer-based Optical Character Recognition for measurement extraction, pseudo-case synthesis for data augmentation, and cross-modal alignment to improve report precision. Innovations include leveraging historical reports, video data, and clinical expertise to enhance diagnostic outputs. Ethical protocols ensure data privacy, while template-driven workflows enhance clinical relevance. Future advancements focus on real-time reporting, personalized diagnostics, and multimodal models like GPT-4 vision. This article bridges artificial intelligence capabilities with clinical demands to standardize reports, reduce workloads, and support ultrasound decision-making.
- Citation: Zeng JH, Zhao KK, Zhao NB. Artificial intelligence assisted ultrasound report generation. Artif Intell Med Imaging 2025; 6(1): 107069
- URL: https://www.wjgnet.com/2644-3260/full/v6/i1/107069.htm
- DOI: https://dx.doi.org/10.35711/aimi.v6.i1.107069
Manually generating medical reports from medical images is a time-consuming process, and the results may vary depending on the clinician. Automated medical report generation holds significant value for improving efficiency and diagnostic accuracy in the medical field. As highlighted in Table 1, various artificial intelligence (AI) models, such as convolutional neural networks (CNN)-long short-term memory (LSTM), Transformer-based models, and visual language models (VLMs), each have distinct architectural features and clinical relevance.
Method | Architectural features | Clinical relevance |
CNN-LSTM | Combines CNN and LSTM, suitable for processing sequential data | Performs well in handling image and sequence information, applicable for ultrasound image analysis |
Transformer-based models | Based on self-attention mechanisms, capable of capturing long-range dependencies, suitable for parallel processing | Excels in generating natural language reports, suitable for complex ultrasound report generation |
VLMs | Integrates visual and linguistic information, capable of understanding image content and generating related text | Outstanding performance in multimodal learning, enhances the accuracy and clinical relevance of ultrasound reports |
Early research primarily relied on the combination of CNN and LSTM networks. This structure extracts image features using CNN and generates textual descriptions using LSTM, often combined with attention mechanisms[1], treating the task as image captioning (generating a textual description that reflects the content of an image). The attention mechanism significantly improves the model's ability to focus on important image regions, thereby enhancing the accuracy of report generation.
The introduction of the Transformer model[2] has further advanced this field. With its powerful parallel processing capabilities and effective modeling of long-sequence dependencies, transformer has gradually replaced traditional RNN and LSTM structures. This allows medical report generation models to understand image information more efficiently and accurately, generating corresponding textual descriptions. Particularly, the rise of large language models (LLMs) based on the Transformer architecture has further improved the quality and linguistic expressiveness of generated text. LLMs can be pre-trained on extensive language corpora, producing more coherent and natural text. Models based on LLMs can not only handle more complex semantic relationships but also better adapt to different task requirements[3]. However, original LLMs primarily process text data and cannot handle multimodal datasets[4].
VLMs are a class of deep learning models capable of simultaneously understanding images and text. They combine visual information (images or videos) with linguistic information (text descriptions) to process and reason about multi
Typical VLMs models are: (1) Contrastive language-image pretraining (CLIP[5]): Through large-scale image-text contrast learning, the model learns to associate images and text. CLIP is capable of completing a wide range of visual-lingual tasks without the need for specialized training; (2) ALIGN[6]: Similar to CLIP, ALIGN uses contrastive learning to correlate image and text features, and uses a larger dataset, which makes the model have strong generalization ability in understanding the semantic relationship between image and text; and (3) Flamingo[7]: Focuses on multimodal contextual understanding, making cross-modal reasoning (e.g., answering questions or descriptions) in images or videos, and maintaining contextual coherence across multiple rounds of conversations. The latest generation of large multimodal models, including GPT-4o[8], Grok-3[9], and Claude 3[10], represent a significant leap forward by featuring enhanced reasoning, improved efficiency, and expanded capabilities for processing and understanding information across text, vision, and, in the case of GPT-4o, audio modalities, moving towards more seamless and real-time human-computer interaction.
The transition to VLMs represents a significant advancement in the field of ultrasound report generation. Unlike traditional CNN-RNN methods, which may struggle to effectively capture complex patterns in noisy ultrasound images, VLMs leverage self-attention mechanisms that enable them to focus on relevant features across both visual and textual modalities[5]. This process enhances their ability to discern critical diagnostic information from images that may other
For ultrasound report generation, VLMs not only tightly integrates imaging information with text generation but also precisely captures key information in complex medical images, making the generated reports more professional and detailed, marking the latest progress in this field[11,12].
Current research on report generation mainly focuses on radiological imaging [e.g., chest X-rays, computed tomogra
(1) Image stability and consistency: Radiological images such as X-rays, CT, and MRI typically have high image quality and consistency. Their acquisition process is standardized, with minimal influence from factors like patient positioning and equipment operation, resulting in relatively clear and stable images that facilitate model learning. In contrast, the quality and clarity of ultrasound images are often affected by the operator's skill, equipment angle, imaging depth, and other factors, leading to noise interference and difficulty in standardization, posing higher demands on the model's learning and generalization capabilities; (2) Difficulty in data annotation: Annotating radiological images is relatively straightforward, as these images are usually two-dimensional, allowing doctors to directly annotate regions of interest and diagnostic information. In contrast, ultrasound images are more complex, requiring experienced professionals to interpret image features, and the presence of noise and artifacts increases the difficulty of annotation, making the construction of large-scale, high-quality ultrasound image datasets more challenging; and (3) Resolution of anatomical structures: Radiological images such as CT and MRI can clearly display the structure and layers of internal tissues, aiding the model in accurately identifying lesion areas. Ultrasound images have lower resolution and are typically used for imaging soft tissues and fluids, making it difficult to distinguish certain fine structures. Additionally, the resolution and contrast of ultrasound images depend on the acquisition angle and depth, making it challenging to clearly display anatomical structures of different regions in the same image.
In this paper, we conduct a survey on visual language models for ultrasound diagnostic report generation. We focus solely on the AI technology aspect, and the discussion of clinical impacts and integration is beyond the scope of this paper. In addition, Table 2 outlines key concepts in ultrasound report generation, providing definitions and significance to aid in understanding the challenges and advancements in this area. To further elucidate the challenges faced in VLM-based ultrasound report generation and the proposed solutions, Table 3 provides a comprehensive overview of these aspects.
Concept | Description | Significance |
AI-assisted ultrasound report generation | Technology using AI to convert ultrasound imaging into structured diagnostic reports | Enhances efficiency, accuracy, and consistency of diagnosis |
VLMs | AI models that integrate visual (images) and linguistic (text) information | Enable understanding of image content and generation of descriptive text |
Image encoder | A component of VLMs that encodes image information | Transforms images into a format that the model can process |
Text encoder | A component of VLMs that encodes text information | Transforms text into a format that the model can process |
Attention mechanism | A technique that allows the model to focus on specific parts of the input (image or text) | Improves the model's ability to focus on important image regions and text |
LLMs | Transformer-based models pre-trained on large text corpora | Enhance the quality and fluency of generated text |
Challenge | Proposed solution |
Poor accuracy in text generation related to measurement results | Extract numerical values from ultrasound images using tools like TrOCR[15] and insert them into the report |
Suboptimal handling of correspondence between text and images | Annotate the correspondence between text and images and design mechanisms to learn these relationships |
Ineffective utilization of report templates | Use report templates as input, treat template prediction as an intermediate task, or have the model learn to modify templates |
Issues with training data volume | Split existing reports into text-image pairs and reassemble them to create pseudo-cases for training |
Ineffective utilization of historical reports | Use historical reports along with current ultrasound images as input |
Neglect of image selection task | Explicitly model the image selection process to choose representative images for the report |
Lack of utilization of ultrasound-related expertise | Fine-tune LLM models to learn this prior knowledge |
Lack of exploration of predictive tasks | Conduct in-depth research on ultrasound examination scenarios to define effective predictive tasks |
Ultrasound reports often contain a large amount of text related to measurements and locations, such as "the maximum oblique diameter of the right liver lobe is 120 mm", "multiple hyperechoic and hypoechoic nodules are visible in the liver, the largest measuring approximately 11 mm × 7 mm", and "the diameter of the portal vein is 11 mm". A simple approach is to treat the prediction of numerical values as ordinary text prediction without special processing[11,13]. However, current generation models are not adept at predicting numerical values (e.g., "120", "11 × 7", "11")[12]. Li et al[12] replace text related to numerical predictions with special tokens. For text with two numerical values in the format of "x cm × y cm", they uniformly replace it with "2DS". These tokens serve as placeholders, indicating the format of the measurement results, but they do not predict specific measurement values. Wang et al[14] replace text related to numerical values with simple textual descriptions, such as replacing "approximately 1.83 mm × 1.48 mm × 2.79 mm" with "a hypoechoic cystic-solid lesion in the nasal quadrant". While this cleverly avoids the issue of numerical prediction, the replacement is less natural and loses the simplicity and intuitiveness of the report.
A case typically involves multiple images (e.g., 8 images) during an ultrasound scan, and the generation of ultrasound reports is primarily based on these images. Accurately understanding the correspondence between images and text statements in reports is a core issue in report generation. A simple approach is to have doctors select the most representative[14] or highest-quality[15] image for generating the report. Li et al[12] use images selected by doctors (usually 2 to 3 images) to generate reports. Guo et al[11] first extract features from each image using an image encoder, then score each image through a scoring module, selecting the highest-scoring image to generate the report. They also calculate the cross-attention between the "ultrasound findings" in the report and the features of the image encoder to obtain a relevance score between each image and the ultrasound findings, using this as a supervisory signal for knowledge distillation in the scoring module to assist in learning how to automatically select images relevant to the text. However, these approaches do not effectively utilize information from other images. Huh et al[16] first use specially designed sub-models ("suspi
In the manual report generation process, doctors find relevant images for different diagnostic statements and then make a comprehensive judgment to determine the final text. Current research lacks effective mechanisms to extract relevant information from all images.
When doctors issue reports, they often start by finding the report template for the examination item, then edit the template by deleting statements irrelevant to the case, filling in measurement data (e.g., cyst size, blood flow, blood pressure), and modifying individual statements. Some template text may remain unchanged. Report templates provide a lot of prior information for report generation (we can conduct a test: For each case, ignore the image information and use only the template as the final report; based on current text-matching evaluation metrics for report generation, this might yield a good score). Current report generation methods primarily generate reports directly from images, without effec
Liu et al[17] proposed a method for generating thyroid reports using voice input. The authors first analyzed 40000 thyroid reports and manually constructed a semantic tree for thyroid reports, consisting of three subtrees: The thyroid subtree, the parathyroid subtree, and the cervical lymph node subtree. The nodes of the tree are labeled with organ/region (examined area), attribute name (examined content), and attribute value (observed content). The edges of the tree represent "part of" relationships, region-attribute relationships, and attribute value relationships. Each subtree of the thyroid ultrasound semantic tree contains a region layer, an attribute name layer, and an attribute value layer. After completing the examination, doctors verbally report key parts, and the model automatically matches the corresponding semantic tree and modifies the attribute values. While the voice input method reduces the time doctors spend entering reports, it lacks the ability to automatically generate reports from images.
The training data in current research typically includes only a few hundred[16] or a few thousand[14,15] cases, with a maximum of 20000 cases[12]. The lack of sufficient data to train models is a significant constraint on report generation, and accumulating data and collecting reports is time-consuming. We need to propose methods to artificially increase data volume using limited data.
When patients undergo hospital examinations, they often refer to historical reports to understand more about their medical history, enabling more comprehensive and continuous assessments through comparative analysis. In other words, generating the current report may require referencing a variable number of historical reports (if available). However, current ultrasound diagnostic models do not consider utilizing historical reports. Exploring how to use his
When doctors capture images, they sometimes record dynamic video clips (e.g., 10 seconds) in addition to static images. These video clips contain a lot of diagnostic information, but current diagnostic models do not consider utilizing video information.
For ultrasound report generation, it is not only about generating textual descriptions but also typically requires selecting 2 or 3 representative images to include in the report as evidence for communication with patients or other doctors. Current report generation models do not treat this task as a necessary subtask.
Ultrasound diagnosis often requires extensive professional knowledge, such as normal ranges for measurement items, to determine normal/abnormal conditions. This professional knowledge is an important reference.
The success of large models (language, vision, language + vision) largely benefits from predictive training tasks on massive datasets, such as predicting the next token in large language models or predicting occluded image regions in vision models.
We observe that numerical measurement results are already stored on ultrasound images after the examination. By extracting this information and placing it in the appropriate diagnostic text, we can obtain precise textual descriptions related to measurements. We can use TrOCR[18] as a tool, following the LangChain approach[16], to extract text results.
In the manual report generation process, doctors find relevant images for different diagnostic statements and then make a comprehensive judgment to determine the final text. Current research lacks effective mechanisms to extract relevant information from all images. We need to work with medical professionals to annotate the correspondence between text and images in reports. After thorough research, we will design an effective mechanism for learning the relationship between text and images. Since annotation is labor-intensive, we can start with 50 cases to evaluate the feasibility of this approach.
For each case, the examination item is known, and the report template is also known. Possible approaches to utilizing templates include: (1) Using the report template as input and having the model learn how to modify the template based on images; and (2) Treating template prediction as an intermediate task, where the model automatically infers the examination and applicable report template for each case's images.
We need to propose methods to artificially increase data volume using limited data. We can first split existing reports into interrelated text statements and determine the corresponding images for these statements (through doctor annotation), generating (text fragment, image) pairs. We can then use these pairs as basic units to split and reassemble examination reports from different cases, creating a large number of pseudo-cases for model training. The training process can be divided into two stages: Pre-training on pseudo-cases to improve the model's generalization ability, followed by fine-tuning on real cases.
We can use historical reports along with current ultrasound images as input to improve the quality of report generation. This requires adjustments during training, which need to be explored.
We can explicitly model the image selection process, allowing the model to automatically select 2 or 3 images to include in the report, making the report generation task more aligned with real-world applications. This can also serve as a supervised prediction task to improve the identification of important information and the pairing of images and text.
We can fine-tune LLM models to learn this prior knowledge.
We need to conduct in-depth research on real-world ultrasound examination scenarios and the useful information contained in images (e.g., ultrasound parameter settings, measurement text, and even timestamps to determine the sequence of image acquisition) to explore how to define effective predictive tasks.
This paper acknowledges several limitations that may impact the comprehensiveness of our analysis:
Lack of Experimental Comparisons: A significant limitation of this study is the absence of experimental comparisons of the performance of various methods discussed, such as CNN-LSTM, Transformer-based models, and VLMs. This lack of empirical evaluation restricts our ability to draw definitive conclusions about the relative effectiveness of these app
Ethical Considerations: The paper does not adequately address various ethical considerations, such as algorithmic bias and the implications of AI errors regarding accountability. These issues warrant deeper exploration, especially as con
Clinical application discussion: The clinical application of the discussed AI models is not thoroughly explored. There is a pressing need to consider existing commercial ultrasound AI tools, such as those developed by Butterfly Network and Caption Health, along with the real-world challenges of adoption, including physician trust, training requirements, and reimbursement models. Integrating insights from industry trends and the perspectives of clinicians would provide a more holistic view of the current state of the field and the translational barriers that need to be addressed.
By acknowledging these limitations, we aim to highlight areas for future research and discussion that could enhance the understanding and application of AI in ultrasound report generation.
AI-assisted ultrasound report generation has evolved from early-stage simple image descriptions to complex tasks involving multimodal fusion. However, breakthroughs are still needed in the following areas: (1) Technological advance
An important aspect that must also be addressed is model interpretability and reliability. Given the high-risk environ
The authors acknowledge the valuable input of Dr. Kai-Kai Zhao during the manuscript review process.
1. | Jing BY, Xie PT, Xing E. On the Automatic Generation of Medical Imaging Reports. Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) 2018; 2577–2586. [DOI] [Full Text] |
2. | Vaswani A, Shazeer N, Parmar N, Uszkoreit J, Jones L, Gomez AN, Kaiser L, Polosukhin I. Attention is all you need. Advances in neural information processing systems, 2017. Preprint. [DOI] [Full Text] |
3. | Brown TB, Mann B, Ryder N, Subbiah M, Kaplan J, Dhariwal P, Neelakantan A, Shyam P, Sastry G, Askell A, Agarwal S, Herbert-Voss A, Krueger G, Henighan T, Child R, Ramesh A, Ziegler DM, Wu J, Winter C, Hesse C, Chen M, Sigler E, Litwin M, Gray S, Chess B, Clark J, Berner C, McCandlish S, Radford A, Sutskever I, Amodei D. Language models are few-shot learners; 2020. Preprint. [DOI] [Full Text] |
4. | Liu HT, Li CY, Wu QY, Lee YJ. Visual instruction tuning. Advances in neural information processing systems, 2024; 36. [DOI] [Full Text] |
5. | Radford A, Kim JW, Hallacy C, Ramesh A, Goh G, Agarwal S, Sastry G, Askell A, Mishkin P, Clark J, Krueger G, Sutskever I. Learning transferable visual models from natural language supervision; 2021. Preprint. [DOI] [Full Text] |
6. | Jia C, Yang YF, Xia Y, Chen YT, Parekh Z, Pham H, Le Q, Sung YH, Li Z, Duerig T. Scaling up visual and vision-language representation learning with noisy text supervision. Proceedings of the 38th International Conference on Machine Learning, 2021; 139: 4904-4916. Available from: https://proceedings.mlr.press/v139/jia21b.html. |
7. | Alayrac JB, Donahue J, Luc P, Miech A, Barr I, Hasson Y, Lenc K, Mensch A, Millican K, Reynolds M, Ring R, Rutherford E, Cabi S, Han T, Gong Z, Samangooei S, Monteiro M, Menick J, Borgeaud S, Brock A, Nematzadeh A, Sharifzadeh S, Binkowski M, Barreira R, Vinyals O, Zisserman A, Simonyan K. Flamingo: a visual language model for few-shot learning. In Proceedings of Neural Information Processing Systems (NeurIPS), 2022. [DOI] [Full Text] |
8. | OpenAI. Hello GPT-4o. OpenAI Blog. [Internet] – [cited 2025 April 14]. Available from: https://openai.com/index/hello-gpt-4o/. |
9. | xAI. Grok 3 Beta — The Age of Reasoning Agents. xAI Blog. [Internet] – [cited 2025 April 14]. Available from: https://x.ai/news/grok-3. |
10. | Anthropic. Claude 3 haiku: our fastest model yet. 2024. [Internet] – [cited 2025 April 14]. Available from: https://www.anthropic.com/news/claude-3-haiku. |
11. | Guo C, Li M, Xu J, Bai L. Ultrasonic characterization of small defects based on Res-ViT and unsupervised domain adaptation. Ultrasonics. 2024;137:107194. [RCA] [PubMed] [DOI] [Full Text] [Cited by in RCA: 3] [Reference Citation Analysis (0)] |
12. | Li J, Su T, Zhao B, Lv F, Wang Q, Navab N, Hu Y, Jiang Z. Ultrasound Report Generation With Cross-Modality Feature Alignment via Unsupervised Guidance. IEEE Trans Med Imaging. 2025;44:19-30. [RCA] [PubMed] [DOI] [Full Text] [Cited by in Crossref: 1] [Cited by in RCA: 2] [Article Influence: 2.0] [Reference Citation Analysis (0)] |
13. | Guo XQ, Men QH, Noble JA. MMSummary: Multimodal Summary Generation for Fetal Ultrasound Video. In: Linguraru MG, editors. Medical Image Computing and Computer Assisted Intervention – MICCAI 2024. Cham: Springer, 2024. [DOI] [Full Text] |
14. | Wang J, Fan JY, Zhou M, Zhang YZ, Shi MY. A labeled ophthalmic ultrasound dataset with medical report generation based on cross-modal deep learning; 2024. Preprint. [DOI] [Full Text] |
15. | Yang S, Niu J, Wu J, Wang Y, Liu X, Li Q. Automatic ultrasound image report generation with adaptive multimodal attention mechanism. Neurocomputing. 2021;427:40-49. [RCA] [DOI] [Full Text] [Cited by in Crossref: 3] [Cited by in RCA: 11] [Article Influence: 2.8] [Reference Citation Analysis (0)] |
16. | Huh J, Park HJ, Ye JC. Breast ultrasound report generation using langehain; 2023. Preprint. [DOI] [Full Text] |
17. | Liu LH, Wang M, Dong YJ, Zhao WL, Yang J, Su JW. Semantic Tree Driven Thyroid Ultrasound Report Generation by Voice Input. In: Arabnia HR, Deligiannidis L, Shouno H, Tinetti FG, Tran QN, editors. Advances in Computer Vision and Computational Biology. Transactions on Computational Science and Computational Intelligence. Cham: Springer, 2021. [DOI] [Full Text] |
18. | Li MH, Lv TC, Chen JY, Cui L, Lu YJ, Florencio D, Zhang C, Li ZJ, Wei FR. TrOCR: Transformer-Based Optical Character Recognition with Pre-trained Models. AAAI. 2023;37:13094-13102. [DOI] [Full Text] |