1
|
Zhang H, Hu H, Zhou D, Zhang X, Cao B. Compact CNN module balancing between feature diversity and redundancy. Neural Netw 2025; 188:107456. [PMID: 40220561 DOI: 10.1016/j.neunet.2025.107456] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/26/2024] [Revised: 03/29/2025] [Accepted: 03/31/2025] [Indexed: 04/14/2025]
Abstract
Feature diversity and redundancy play a crucial role in enhancing a model's performance, although their effect on network design remains underexplored. Herein, we introduce BDRConv, a compact convolutional neural network (CNN) module that establishes a balance between feature diversity and redundancy to generate and retain features with moderate redundancy and high diversity while reducing computational costs. Specifically, input features are divided into a main part and an expansion part. The main part extracts intrinsic and diverse features, while the expansion part enhances diverse information extraction. Experiments on the CIFAR10, ImageNet, and MS COCO datasets demonstrate that BDRConv-equipped networks outperform state-of-the-art methods in accuracy, with significantly lower floating-point operations (FLOPs) and parameters. In addition, BDRConv module as a plug-and-play component can easily replace existing convolution modules, offering potential for broader CNN applications.
Collapse
Affiliation(s)
- Huihuang Zhang
- College of Computer Science and Technology, Zhejiang University of Technology, Hangzhou, 310023, China; Key Laboratory of Visual Media Intelligent Processing Technology of Zhejiang Province, Hangzhou, 310023, China
| | - Haigen Hu
- College of Computer Science and Technology, Zhejiang University of Technology, Hangzhou, 310023, China; Key Laboratory of Visual Media Intelligent Processing Technology of Zhejiang Province, Hangzhou, 310023, China.
| | - Deming Zhou
- College of Computer Science and Technology, Zhejiang University of Technology, Hangzhou, 310023, China; Key Laboratory of Visual Media Intelligent Processing Technology of Zhejiang Province, Hangzhou, 310023, China
| | - Xiaoqin Zhang
- College of Computer Science and Technology, Zhejiang University of Technology, Hangzhou, 310023, China; Key Laboratory of Visual Media Intelligent Processing Technology of Zhejiang Province, Hangzhou, 310023, China
| | - Bin Cao
- College of Computer Science and Technology, Zhejiang University of Technology, Hangzhou, 310023, China; Key Laboratory of Visual Media Intelligent Processing Technology of Zhejiang Province, Hangzhou, 310023, China
| |
Collapse
|
2
|
Ahsan MJ, Abdel-Aty M, Abdelrahman AS. Can mid-block pedestrian signals (MPS) provide greater safety benefits than other mid-block pedestrian crossings? ACCIDENT; ANALYSIS AND PREVENTION 2025; 218:108105. [PMID: 40373590 DOI: 10.1016/j.aap.2025.108105] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 04/01/2025] [Revised: 05/06/2025] [Accepted: 05/07/2025] [Indexed: 05/17/2025]
Abstract
The Florida Department of Transportation (FDOT) has recently implemented a new midblock signal system known as the Midblock Pedestrian Signal (MPS) to enhance pedestrian safety. This study evaluates the effectiveness of MPSs by comparing their safety performance with other existing midblock crossing treatments. Portable CCTV video data were collected from 14 MPS equipped locations, in addition to five reference sites, to calculate Conflict Modification Factors (CoMFs) using vehicle-pedestrian conflict data. Advanced computer vision techniques, specifically the RT-DETR model for object detection and the ByteTrack algorithm for tracking were utilized to process the video data. The study employed both Cross-Sectional (CS) and Before-After methods, incorporating the Comparison Group (CG) and Empirical Bayesian (EB) approaches to evaluate the safety impacts of MPSs. To address repeated observations at the same locations and minimize bias, Safety Performance Functions (SPFs) were developed using Generalized Estimating Equations (GEE) with a Negative Binomial distribution, which proved more robust than traditional Generalized Linear Models (GLMs). The results demonstrate that MPS systems outperform Rectangular Rapid Flashing Beacons (RRFBs) and Flashing Beacons in reducing pedestrian-vehicle conflicts. Furthermore, when compared to Pedestrian Hybrid Beacons (PHBs) which share similar functionalities but differ in signal phase management-MPS systems provided additional safety benefits. Compared to PHBs, MPS systems reduced serious and all conflicts by between 26-33% and 31-33%, respectively, using the EB, CG and the CS methods. These reductions highlight the superior safety performance of MPS systems compared to PHBs and other midblock crossing treatments. With their adaptability, cost-effectiveness, and enhanced safety benefits, MPS systems are a promising alternative for upgrading existing pedestrian crossings or installing new signal systems to improve pedestrian safety at midblock locations.
Collapse
Affiliation(s)
- Md Jamil Ahsan
- Department of Civil, Environmental and Construction Engineering, University of Central Florida, Orlando, FL 32816, USA.
| | - Mohamed Abdel-Aty
- Department of Civil, Environmental and Construction Engineering, University of Central Florida, Orlando, FL 32816, USA.
| | - Ahmed S Abdelrahman
- Department of Civil, Environmental and Construction Engineering, University of Central Florida, Orlando, FL 32816, USA.
| |
Collapse
|
3
|
Pawar P, McManus B, Anthony T, Yang J, Kerwin T, Stavrinos D. Artificial intelligence automated solution for hazard annotation and eye tracking in a simulated environment. ACCIDENT; ANALYSIS AND PREVENTION 2025; 218:108075. [PMID: 40339543 PMCID: PMC12123859 DOI: 10.1016/j.aap.2025.108075] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Subscribe] [Scholar Register] [Received: 12/13/2024] [Accepted: 04/27/2025] [Indexed: 05/10/2025]
Abstract
High-fidelity simulators and sensors are commonly used in research to create immersive environments for studying real-world problems. This setup records detailed data, generating large datasets. In driving research, a full-scale car model repurposed as a driving simulator allows human subjects to navigate realistic driving scenarios. Data from these experiments are collected in raw form, requiring extensive manual annotation of roadway elements such as hazards and distractions. This process is often time-consuming, labor-intensive, and repetitive, causing delays in research progress. This paper proposes an AI-driven solution to automate these tasks, enabling researchers to focus on analysis and advance their studies efficiently. The solution builds on previous driving behavior research using a high-fidelity full-cab simulator equipped with gaze-tracking cameras. It extends the capabilities of the earlier system described in Pawar's (2021) "Hazard Detection in Driving Simulation using Deep Learning", which performed only hazard detection. The enhanced system now integrates both hazard annotation and gaze-tracking data. By combining vehicle handling parameters with drivers' visual attention data, the proposed method provides a unified, detailed view of participants' driving behavior across various simulated scenarios. This approach streamlines data analysis, accelerates research timelines, and enhances understanding of driving behavior.
Collapse
Affiliation(s)
- Piyush Pawar
- Institute for Social Science Research, University of Alabama, 306 Paul W. Bryant Drive East, Tuscaloosa, AL 35401, USA.
| | - Benjamin McManus
- Institute for Social Science Research, University of Alabama, 306 Paul W. Bryant Drive East, Tuscaloosa, AL 35401, USA.
| | - Thomas Anthony
- Analytical AI, 1500, 1st Ave. N, Birmingham, AL 35022, USA.
| | - Jingzhen Yang
- Department of Pediatrics, College of Medicine, The Ohio State University, Center for Injury Research and Policy, Abigail Wexner Research Institute at Nationwide Children's Hospital, 700 Children's Dr. RBIII-WB5403 Columbus, OH 43205, USA.
| | - Thomas Kerwin
- Ohio State University, Driving Simulation Laboratory, The Ohio State University, 1305 Kinnear Road, Suite 194, Columbus OH 43212, USA.
| | - Despina Stavrinos
- Institute for Social Science Research, University of Alabama, 306 Paul W. Bryant Drive East, Tuscaloosa, AL 35401, USA.
| |
Collapse
|
4
|
Chen H, Wang Z, Tao R, Wei H, Xie X, Sugiyama M, Raj B, Wang J. Impact of Noisy Supervision in Foundation Model Learning. IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE 2025; 47:5690-5707. [PMID: 40117144 DOI: 10.1109/tpami.2025.3552309] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 03/23/2025]
Abstract
Foundation models are usually pre-trained on large-scale datasets and then adapted to different downstream tasks through tuning. This pre-training and then fine-tuning paradigm has become a standard practice in deep learning. However, the large-scale pre-training datasets, often inaccessible or too expensive to handle, can contain label noise that may adversely affect the generalization of the model and pose unexpected risks. This paper stands out as the first work to comprehensively understand and analyze the nature of noise in pre-training datasets and then effectively mitigate its impacts on downstream tasks. Specifically, through extensive experiments of fully-supervised and image-text contrastive pre-training on synthetic noisy ImageNet-1 K, YFCC15 M, and CC12 M datasets, we demonstrate that, while slight noise in pre-training can benefit in-domain (ID) performance, where the training and testing data share a similar distribution, it always deteriorates out-of-domain (OOD) performance, where training and testing distributions are significantly different. These observations are agnostic to scales of pre-training datasets, pre-training noise types, model architectures, pre-training objectives, downstream tuning methods, and downstream applications. We empirically ascertain that the reason behind this is that the pre-training noise shapes the feature space differently. We then propose a tuning method (NMTune) to affine the feature space to mitigate the malignant effect of noise and improve generalization, which is applicable in both parameter-efficient and black-box tuning manners, considering one may not be able to access or fully fine-tune the pre-trained models. We additionally conduct extensive experiments on popular vision and language models, including APIs, which are supervised and self-supervised pre-trained on realistic noisy data for evaluation. Our analysis and results demonstrate the importance of this novel and fundamental research direction, which we term as Noisy Model Transfer Learning.
Collapse
|
5
|
Zhou W, Lin K, Zheng Z, Chen D, Su T, Hu H. DRTN: Dual Relation Transformer Network with feature erasure and contrastive learning for multi-label image classification. Neural Netw 2025; 187:107309. [PMID: 40048756 DOI: 10.1016/j.neunet.2025.107309] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/01/2024] [Revised: 12/14/2024] [Accepted: 02/21/2025] [Indexed: 04/29/2025]
Abstract
The objective of multi-label image classification (MLIC) task is to simultaneously identify multiple objects present in an image. Several researchers directly flatten 2D feature maps into 1D grid feature sequences, and utilize Transformer encoder to capture the correlations of grid features to learn object relationships. Although obtaining promising results, these Transformer-based methods lose spatial information. In addition, current attention-based models often focus only on salient feature regions, but ignore other potential useful features that contribute to MLIC task. To tackle these problems, we present a novel Dual Relation Transformer Network (DRTN) for MLIC task, which can be trained in an end-to-end manner. Concretely, to compensate for the loss of spatial information of grid features resulting from the flattening operation, we adopt a grid aggregation scheme to generate pseudo-region features, which does not need to make additional expensive annotations to train object detector. Then, a new dual relation enhancement (DRE) module is proposed to capture correlations between objects using two different visual features, thereby complementing the advantages provided by both grid and pseudo-region features. After that, we design a new feature enhancement and erasure (FEE) module to learn discriminative features and mine additional potential valuable features. By using attention mechanism to discover the most salient feature regions and removing them with region-level erasure strategy, our FEE module is able to mine other potential useful features from the remaining parts. Further, we devise a novel contrastive learning (CL) module to encourage the foregrounds of salient and potential features to be closer, while pushing their foregrounds further away from background features. This manner compels our model to learn discriminative and valuable features more comprehensively. Extensive experiments demonstrate that DRTN method surpasses current MLIC models on three challenging benchmarks, i.e., MS-COCO 2014, PASCAL VOC 2007, and NUS-WIDE datasets.
Collapse
Affiliation(s)
- Wei Zhou
- School of Electronics and Information Technology, Sun Yat-sen University, Guangzhou, 510006, Guangdong, China
| | - Kang Lin
- School of Electronics and Information Technology, Sun Yat-sen University, Guangzhou, 510006, Guangdong, China
| | - Zhijie Zheng
- School of Electronics and Information Technology, Sun Yat-sen University, Guangzhou, 510006, Guangdong, China
| | - Dihu Chen
- School of Electronics and Information Technology, Sun Yat-sen University, Guangzhou, 510006, Guangdong, China
| | - Tao Su
- School of Electronics and Information Technology, Sun Yat-sen University, Guangzhou, 510006, Guangdong, China.
| | - Haifeng Hu
- School of Electronics and Information Technology, Sun Yat-sen University, Guangzhou, 510006, Guangdong, China.
| |
Collapse
|
6
|
Wu X, Wang L, Huang J. AnimalRTPose: Faster cross-species real-time animal pose estimation. Neural Netw 2025; 190:107685. [PMID: 40516380 DOI: 10.1016/j.neunet.2025.107685] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/18/2024] [Revised: 05/15/2025] [Accepted: 05/26/2025] [Indexed: 06/16/2025]
Abstract
Recent advancements in computer vision have facilitated the development of sophisticated tools for analyzing complex animal behaviors, yet the diversity of animal morphology and environmental complexities present significant challenges to real-time animal pose estimation. To address these challenges, we introduce AnimalRTPose, a one-stage model designed for cross-species real-time animal pose estimation. At its core, AnimalRTPose leverages CSPNeXt†, a novel backbone network that integrates depthwise separable convolution with skip connections for high-frequency feature extraction, a channel attention mechanism (CAM) to enhance the fusion of high-frequency and low-frequency features, and spatial pyramid pooling (SPP) to capture multi-scale contextual information. This architecture enables robust feature representation across varying spatial resolutions, enhancing adaptability to diverse species and environments. Additionally, AnimalRTPose incorporates an efficient multi-scale feature fusion module that dynamically balances local detail and global structural consistency, ensuring high accuracy and robustness in pose estimation. Designed for scalability and versatility, AnimalRTPose supports single-animal, multi-animal, cross-species, and few-shot scenarios. Specifically, AnimalRTPose-N achieves 476 FPS on NVIDIA RTX 2080Ti, 769 FPS on NVIDIA RTX 3090, and 1111 FPS on NVIDIA A800, while demonstrating high throughput on edge devices with 196 FPS on the NVIDIA Jetson™ AGX Orin Developer Kit (275 TOPS, 15 W to 60 W), 77 FPS on the Raspberry Pi 5 with AI HAT+ (26 TOPS, 25 W), and 64 FPS on the Atlas 200I Developer Kit A2 (8 TOPS, 24 W), all with a 640 × 640 input resolution. These results surpass all existing one-stage models, showcasing its superior performance in real-time animal pose estimation. AnimalRTPose is thus highly applicable for scenarios requiring real-time animal behavior monitoring. Further details on the model configuration and dataset are available on the AnimalRTPose project website.
Collapse
Affiliation(s)
- Xin Wu
- School of Physics, Northeast Normal University, Renmin Street 5268, Changchun, Jilin, 130024, China.
| | - Lianming Wang
- Yazhou Bay Innovation Institute of Hainan Tropical Ocean University, Yucai Road 1, Sanya, 572022, Hainan, China.
| | - Jipeng Huang
- School of Physics, Northeast Normal University, Renmin Street 5268, Changchun, Jilin, 130024, China.
| |
Collapse
|
7
|
Vogg R, Lüddecke T, Henrich J, Dey S, Nuske M, Hassler V, Murphy D, Fischer J, Ostner J, Schülke O, Kappeler PM, Fichtel C, Gail A, Treue S, Scherberger H, Wörgötter F, Ecker AS. Computer vision for primate behavior analysis in the wild. Nat Methods 2025; 22:1154-1166. [PMID: 40211003 DOI: 10.1038/s41592-025-02653-y] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/22/2024] [Accepted: 02/28/2025] [Indexed: 04/12/2025]
Abstract
Advances in computer vision and increasingly widespread video-based behavioral monitoring are currently transforming how we study animal behavior. However, there is still a gap between the prospects and practical application, especially in videos from the wild. In this Perspective, we aim to present the capabilities of current methods for behavioral analysis, while at the same time highlighting unsolved computer vision problems that are relevant to the study of animal behavior. We survey state-of-the-art methods for computer vision problems relevant to the video-based study of individualized animal behavior, including object detection, multi-animal tracking, individual identification and (inter)action understanding. We then review methods for effort-efficient learning, one of the challenges from a practical perspective. In our outlook on the emerging field of computer vision for animal behavior, we argue that the field should develop approaches to unify detection, tracking, identification and (inter)action understanding in a single, video-based framework.
Collapse
Affiliation(s)
- Richard Vogg
- Institute of Computer Science and Campus Institute Data Science, University of Göttingen, Göttingen, Germany
| | - Timo Lüddecke
- Institute of Computer Science and Campus Institute Data Science, University of Göttingen, Göttingen, Germany
| | - Jonathan Henrich
- Chairs of Statistics and Econometrics and Campus Institute Data Science, University of Göttingen, Göttingen, Germany
| | - Sharmita Dey
- Institute of Computer Science and Campus Institute Data Science, University of Göttingen, Göttingen, Germany
| | - Matthias Nuske
- Department for Computational Neuroscience, Third Physics Institute, University of Göttingen, Göttingen, Germany
| | - Valentin Hassler
- Institute of Computer Science and Campus Institute Data Science, University of Göttingen, Göttingen, Germany
| | - Derek Murphy
- Cognitive Ethology Laboratory, German Primate Center, Leibniz Institute for Primate Research, Göttingen, Germany
- Department for Primate Cognition, Johann-Friedrich-Blumenbach Institute, University of Göttingen, Göttingen, Germany
| | - Julia Fischer
- Cognitive Ethology Laboratory, German Primate Center, Leibniz Institute for Primate Research, Göttingen, Germany
- Department for Primate Cognition, Johann-Friedrich-Blumenbach Institute, University of Göttingen, Göttingen, Germany
- Leibniz ScienceCampus, German Primate Center, Leibniz Institute for Primate Research, Göttingen, Germany
- Bernstein Center for Computational Neuroscience, University of Göttingen, Göttingen, Germany
| | - Julia Ostner
- Leibniz ScienceCampus, German Primate Center, Leibniz Institute for Primate Research, Göttingen, Germany
- Behavioral Ecology Department, University of Göttingen, Göttingen, Germany
- Social Evolution in Primates Group, German Primate Center, Leibniz Institute for Primate Research, Göttingen, Germany
| | - Oliver Schülke
- Leibniz ScienceCampus, German Primate Center, Leibniz Institute for Primate Research, Göttingen, Germany
- Behavioral Ecology Department, University of Göttingen, Göttingen, Germany
- Social Evolution in Primates Group, German Primate Center, Leibniz Institute for Primate Research, Göttingen, Germany
| | - Peter M Kappeler
- Leibniz ScienceCampus, German Primate Center, Leibniz Institute for Primate Research, Göttingen, Germany
- Behavioral Ecology & Sociobiology Unit, German Primate Center, Leibniz Institute for Primate Research, Göttingen, Germany
- Department of Sociobiology/Anthropology, University of Göttingen, Göttingen, Germany
| | - Claudia Fichtel
- Leibniz ScienceCampus, German Primate Center, Leibniz Institute for Primate Research, Göttingen, Germany
- Behavioral Ecology & Sociobiology Unit, German Primate Center, Leibniz Institute for Primate Research, Göttingen, Germany
| | - Alexander Gail
- Leibniz ScienceCampus, German Primate Center, Leibniz Institute for Primate Research, Göttingen, Germany
- Bernstein Center for Computational Neuroscience, University of Göttingen, Göttingen, Germany
- Sensorimotor Group, German Primate Center, Leibniz Institute for Primate Research, Göttingen, Germany
- Sensorimotor Neuroscience and Neuroprosthetics, Georg-Elias-Müller Institute, University of Göttingen, Göttingen, Germany
| | - Stefan Treue
- Leibniz ScienceCampus, German Primate Center, Leibniz Institute for Primate Research, Göttingen, Germany
- Bernstein Center for Computational Neuroscience, University of Göttingen, Göttingen, Germany
- Cognitive Neuroscience Laboratory, German Primate Center, Leibniz Institute for Primate Research, Göttingen, Germany
- Biological Psychology & Cognitive Neuroscience, Georg-Elias-Müller-Institute of Psychology, University of Göttingen, Göttingen, Germany
| | - Hansjörg Scherberger
- Leibniz ScienceCampus, German Primate Center, Leibniz Institute for Primate Research, Göttingen, Germany
- Bernstein Center for Computational Neuroscience, University of Göttingen, Göttingen, Germany
- Primate Neurobiology, Johann-Friedrich-Blumenbach-Institute for Zoology & Anthropology, University of Göttingen, Göttingen, Germany
- Neurobiology Laboratory, German Primate Center, Leibniz Institute for Primate Research, Göttingen, Germany
| | - Florentin Wörgötter
- Department for Computational Neuroscience, Third Physics Institute, University of Göttingen, Göttingen, Germany
- Leibniz ScienceCampus, German Primate Center, Leibniz Institute for Primate Research, Göttingen, Germany
- Bernstein Center for Computational Neuroscience, University of Göttingen, Göttingen, Germany
| | - Alexander S Ecker
- Institute of Computer Science and Campus Institute Data Science, University of Göttingen, Göttingen, Germany.
- Leibniz ScienceCampus, German Primate Center, Leibniz Institute for Primate Research, Göttingen, Germany.
- Bernstein Center for Computational Neuroscience, University of Göttingen, Göttingen, Germany.
- Max Planck Institute for Dynamics and Self-Organization, Göttingen, Germany.
| |
Collapse
|
8
|
Li H, Lo JTY. A review on the use of top-view surveillance videos for pedestrian detection, tracking and behavior recognition across public spaces. ACCIDENT; ANALYSIS AND PREVENTION 2025; 215:107986. [PMID: 40081266 DOI: 10.1016/j.aap.2025.107986] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 09/15/2024] [Revised: 01/03/2025] [Accepted: 02/05/2025] [Indexed: 03/15/2025]
Abstract
The use of top-view surveillance cameras has been considered as the feature to maintain uncovered view and privacy protection in public buildings like stations and traffic hubs. This study aims to provide a comprehensive review on recent developments and challenges related to the use of top-view surveillance videos in public places. The techniques using top-view images in pedestrian detection, tracking and behavior recognition are reviewed, specifically focusing on their influence on crowd control and safety management. The setup of top-view cameras and the characteristics of several available datasets are introduced. The methodologies, field of view, extracted features, region of interest, color space and used datasets for key literature are consolidated. This study contributes by identifying key advantages of top-view cameras, such as their ability to reduce occlusions and preserve privacy, while also addressing limitations, including restricted field of view and the challenges of adapting algorithms to this unique perspective. We highlight knowledge gaps in leveraging top-view cameras for transport hubs, such as the need for advanced algorithms and the lack of standardized datasets for dynamic crowd scenarios. Through this review, we aim to provide actionable insights for improving crowd management and safety measures in public buildings, especially transport hubs.
Collapse
Affiliation(s)
- Hongliu Li
- Department of Civil and Environmental Engineering, The Hong Kong Polytechnic University, PR China
| | - Jacqueline Tsz Yin Lo
- Department of Civil and Environmental Engineering, The Hong Kong Polytechnic University, PR China.
| |
Collapse
|
9
|
Gao G, Lv Z, Zhang Y, Qin AK. Advertising or adversarial? AdvSign: Artistic advertising sign camouflage for target physical attacking to object detector. Neural Netw 2025; 186:107271. [PMID: 40010291 DOI: 10.1016/j.neunet.2025.107271] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/20/2024] [Revised: 01/07/2025] [Accepted: 02/11/2025] [Indexed: 02/28/2025]
Abstract
Deep learning models are often vulnerable to adversarial attacks in both digital and physical environments. Particularly challenging are physical attacks that involve subtle, unobtrusive modifications to objects, such as patch-sticking or light-shooting, designed to maliciously alter the model's output when the scene is captured and fed into the model. Developing physical adversarial attacks that are robust, flexible, inconspicuous, and difficult to trace remains a significant challenge. To address this issue, we propose an artistic-based camouflage named Adversarial Advertising Sign (AdvSign) for object detection task, especially in autonomous driving scenarios. Generally, artistic patterns, such as brand logos and advertisement signs, always have a high tolerance for visual incongruity and are widely exist with strong unobtrusiveness. We design these patterns into advertising signs that can be attached to various mobile carriers, such as carry-bags and vehicle stickers, to create adversarial camouflage with strong untraceability. This method is particularly effective at misleading self-driving cars, for instance, causing them to misidentify these signs as 'stop' signs. Our approach combines a trainable adversarial patch with various signs of artistic patterns to create advertising patches. By leveraging the diversity and flexibility of these patterns, we draw attention away from the conspicuous adversarial elements, enhancing the effectiveness and subtlety of our attacks. We then use the CARLA autonomous-driving simulator to place these synthesized patches onto 3D flat surfaces in different traffic scenes, rendering 2D composite scene images from various perspectives. These varied scene images are then input into the target detector for adversarial training, resulting in the final trained adversarial patch. In particular, we introduce a novel loss with artistic pattern constraints, designed to differentially adjust pixels within and outside the advertising sign during training. Extensive experiments in both simulated (composite scene images with AdvSign) and real-world (printed AdvSign images) environments demonstrate the effectiveness of AdvSign in executing physical attacks on state-of-the-art object detectors, such as YOLOv5. Our training strategy, leveraging diverse scene images and varied artistic transformations to adversarial patches, enables seamless integration with multiple patterns. This enhances attack effectiveness across various physical settings and allows easy adaptation to new environments and artistic patterns.
Collapse
Affiliation(s)
- Guangyu Gao
- School of Computer Science and Technology, Beijing Institute of Technology, Beijing 100081, China.
| | - Zhuocheng Lv
- School of Computer Science and Technology, Beijing Institute of Technology, Beijing 100081, China
| | - Yan Zhang
- School of Computer Science and Technology, Beijing Institute of Technology, Beijing 100081, China
| | - A K Qin
- Department of Computing Technologies, Swinburne University of Technology, Hawthorn, VIC 3122, Australia
| |
Collapse
|
10
|
Falisse A, Uhlrich SD, Chaudhari AS, Hicks JL, Delp SL. Marker Data Enhancement for Markerless Motion Capture. IEEE Trans Biomed Eng 2025; 72:2013-2022. [PMID: 40031222 DOI: 10.1109/tbme.2025.3530848] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 03/05/2025]
Abstract
OBJECTIVE Human pose estimation models can measure movement from videos at a large scale and low cost; however, open-source pose estimation models typically detect only sparse keypoints, which leads to inaccurate joint kinematics. OpenCap, a freely available service for researchers to measure movement from videos, mitigates this issue using a deep learning model-the marker enhancer-that transforms sparse video keypoints into dense anatomical markers. However, OpenCap performs poorly on movements not included in the training data. Here, we create a much larger and more diverse training dataset and develop a more accurate and generalizable marker enhancer. METHODS We compiled marker-based motion capture data from 1176 subjects and synthesized 1433 hours of video keypoints and anatomical markers to train the marker enhancer. We evaluated its accuracy in computing kinematics using both benchmark movement videos and synthetic data representing unseen, diverse movements. RESULTS The marker enhancer improved kinematic accuracy on benchmark movements (mean error: 4.1$^\circ$, max: 8.7$^\circ$) compared to using video keypoints (mean: 9.6$^\circ$, max: 43.1$^\circ$) and OpenCap's original enhancer (mean: 5.3$^\circ$, max: 11.5$^\circ$). It also better generalized to unseen, diverse movements (mean: 4.1$^\circ$, max: 6.7$^\circ$) than OpenCap's original enhancer (mean: 40.4$^\circ$, max: 252.0$^\circ$). CONCLUSION Our marker enhancer demonstrates both improved accuracy and generalizability across diverse movements. SIGNIFICANCE We integrated the marker enhancer into OpenCap, thereby offering its thousands of users more accurate measurements across a broader range of movements.
Collapse
|
11
|
Xu X, Wang C, Yi Q, Ye J, Kong X, Ashraf SQ, Dearn KD, Hajiyavand AM. MedBin: A lightweight End-to-End model-based method for medical waste management. WASTE MANAGEMENT (NEW YORK, N.Y.) 2025; 200:114742. [PMID: 40088805 DOI: 10.1016/j.wasman.2025.114742] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 11/05/2024] [Revised: 03/04/2025] [Accepted: 03/07/2025] [Indexed: 03/17/2025]
Abstract
The surge in medical waste has highlighted the urgent need for cost-effective and advanced management solutions. In this paper, a novel medical waste management approach, "MedBin," is proposed for automated sorting, reusing, and recycling. A comprehensive medical waste dataset, "MedBin-Dataset" is established, comprising 2,119 original images spanning 36 categories, with samples captured in various backgrounds. The lightweight "MedBin-Net" model is introduced to enable detection and instance segmentation of medical waste, enhancing waste recognition capabilities. Experimental results demonstrate the effectiveness of the proposed approach, achieving an average precision of 0.91, recall of 0.97, and F1-score of 0.94 across all categories with just 2.51 M parameters (where M stands for million, i.e., 2.51 million parameters), 5.20G FLOPs (where G stands for billion, i.e., 5.20 billion floating-point operations per second), and 0.60 ms inference time. Additionally, the proposed method includes a World Health Organization (WHO) Guideline-Based Classifier that categorizes detected waste into 5 types, each with a corresponding disposal method, following WHO medical waste classification standards. The proposed method, along with the dedicated dataset, offers a promising solution that supports sustainable medical waste management and other related applications. To access the MedBin-Dataset samples, please visit https://universe.roboflow.com/uob-ylti8/medbin_dataset. The source code for MedBin-Net can be found at https://github.com/Wayne3918/MedbinNet.
Collapse
Affiliation(s)
- Xiazhen Xu
- Department of Mechanical Engineering, School of Engineering, University of Birmingham, Birmingham B15 2TT, UK
| | - Chenyang Wang
- Department of Mechanical Engineering, School of Engineering, University of Birmingham, Birmingham B15 2TT, UK
| | - Qiufeng Yi
- Department of Mechanical Engineering, School of Engineering, University of Birmingham, Birmingham B15 2TT, UK
| | - Jiaqi Ye
- Department of Mechanical Engineering, School of Engineering, University of Birmingham, Birmingham B15 2TT, UK
| | - Xiangfei Kong
- Department of Mechanical Engineering, School of Engineering, University of Birmingham, Birmingham B15 2TT, UK
| | - Shazad Q Ashraf
- Queen Elizabeth Hospital, Mindelsohn Way, Birmingham B15 2GW, UK
| | - Karl D Dearn
- Department of Mechanical Engineering, School of Engineering, University of Birmingham, Birmingham B15 2TT, UK
| | - Amir M Hajiyavand
- Department of Mechanical Engineering, School of Engineering, University of Birmingham, Birmingham B15 2TT, UK.
| |
Collapse
|
12
|
Guo L, Chang R, Wang J, Narayanan A, Qian P, Leong MC, Kundu PP, Senthilkumar S, Garlapati SC, Yong ECK, Pahwa RS. Artificial intelligence-enhanced 3D gait analysis with a single consumer-grade camera. J Biomech 2025; 187:112738. [PMID: 40378677 DOI: 10.1016/j.jbiomech.2025.112738] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/26/2025] [Revised: 04/21/2025] [Accepted: 04/29/2025] [Indexed: 05/19/2025]
Abstract
Gait analysis is crucial for diagnosing and monitoring various healthcare conditions, but traditional marker-based motion capture (MoCap) systems require expensive equipment, extensive setup, and trained personnel, limiting their accessibility in clinical and home settings. Markerless systems reduce setup complexity but often require multiple cameras, fixed calibration, and are not designed for widespread clinical adoption. This study introduces 3DGait, an artificial intelligence-enhanced markerless 3-Dimensional gait analysis system that operates with a single consumer-grade depth camera, providing a streamlined, accessible alternative. The system integrates advanced machine learning algorithms to produce 49 angular, spatial, and temporal gait biomarkers commonly used in mobility analysis. We validated 3DGait against a marker-based MoCap (OptiTrack) using 16 trials from 8 healthy adults performing the Timed Up and Go (TUG) test. The system achieved an overall average mean absolute error (MAE) of 2.3°, with all MAE under 5.2°, and a Pearson's correlation coefficient (PCC) of 0.75 for angular biomarkers. All spatiotemporal biomarkers had errors no greater than 15 %. Temporal biomarkers (excluding TUG time) had errors under 0.03 s, corresponding to one video frame at 30 frames per second. These results demonstrate that 3DGait provides clinically acceptable gait metrics relative to marker-based MoCap, while eliminating the need for markers, calibration, or fixed camera placement. 3DGait's accessible, non-invasive and single camera design makes it practical for use in non-specialist clinics and home settings, supporting patient monitoring and chronic disease management. Future research will focus on validating 3DGait with diverse populations, including individuals with gait abnormalities, to broaden its clinical applications.
Collapse
Affiliation(s)
- Ling Guo
- Carecam Pte Ltd., Singapore; Institute for Infocomm Research (I2R), Agency for Science, Technology and Research (A*STAR), Singapore
| | - Richard Chang
- Institute for Infocomm Research (I2R), Agency for Science, Technology and Research (A*STAR), Singapore
| | - Jie Wang
- Carecam Pte Ltd., Singapore; Institute for Infocomm Research (I2R), Agency for Science, Technology and Research (A*STAR), Singapore
| | - Amudha Narayanan
- Institute for Infocomm Research (I2R), Agency for Science, Technology and Research (A*STAR), Singapore
| | - Peisheng Qian
- Institute for Infocomm Research (I2R), Agency for Science, Technology and Research (A*STAR), Singapore
| | - Mei Chee Leong
- Institute for Infocomm Research (I2R), Agency for Science, Technology and Research (A*STAR), Singapore
| | - Partha Pratim Kundu
- Carecam Pte Ltd., Singapore; Institute for Infocomm Research (I2R), Agency for Science, Technology and Research (A*STAR), Singapore
| | | | | | | | - Ramanpreet Singh Pahwa
- Carecam Pte Ltd., Singapore; Institute for Infocomm Research (I2R), Agency for Science, Technology and Research (A*STAR), Singapore.
| |
Collapse
|
13
|
Tamin O, Moung EG, Dargham JA, Karim SAA, Ibrahim AO, Adam N, Osman HA. RGB and RGNIR image dataset for machine learning in plastic waste detection. Data Brief 2025; 60:111524. [PMID: 40275976 PMCID: PMC12020901 DOI: 10.1016/j.dib.2025.111524] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/28/2024] [Revised: 03/14/2025] [Accepted: 03/24/2025] [Indexed: 04/26/2025] Open
Abstract
The increasing volume of plastic waste is an environmental issue that demands effective sorting methods for different types of plastic. While spectral imaging offers a promising solution, it has several drawbacks, such as complexity, high cost, and limited spatial resolution. Machine learning has emerged as a potential solution for plastic waste due to its ability to analyse and interpret large volumes of data using algorithms. However, developing an efficient machine learning model requires a comprehensive dataset with information on the size, shape, colour, texture, and other features of plastic waste. Moreover, incorporating near-infrared (NIR) spectral data into machine learning models can reveal crucial information about plastic waste composition and structure that remains invisible in standard RGB images. Despite this potential, no publicly available dataset currently combines RGB with NIR spectral information for plastic waste detection. To address this research gap, we introduce a comprehensive dataset of plastic waste images captured onshore using both standard RGB and RGNIR (red, green, near-infrared) channels. Each of the two-colour space datasets include 405 images that were taken along riverbanks and beaches. Both datasets underwent further pre-processing to ensure proper labelling and annotations to prepare them for training machine learning models. In total, there are 1,344 plastic waste objects that have been annotated. The proposed dataset offers a unique resource for researchers to train machine learning models for plastic waste detection. While there are existing datasets on plastic waste, the proposed dataset aims to set itself apart by offering a more comprehensive dataset with unique spectral information in the near-infrared region. It is hopeful that these datasets will contribute to the advancement of the field of plastic waste detection and encourage further research in this area.
Collapse
Affiliation(s)
- Owen Tamin
- Faculty of Science and Natural Resources, Universiti Malaysia Sabah, Jalan UMS, Kota Kinabalu, 88400, Sabah, Malaysia
| | - Ervin Gubin Moung
- Data Technologies and Applications (DaTA) Research Group, Faculty of Computing and Informatics, Universiti Malaysia Sabah, Jalan UMS, Kota Kinabalu, 88400, Sabah, Malaysia
- Faculty of Computing and Informatics, Universiti Malaysia Sabah, Jalan UMS, Kota Kinabalu, 88400, Sabah, Malaysia
| | - Jamal Ahmad Dargham
- Faculty of Engineering, Universiti Malaysia Sabah, Kota Kinabalu, 88400, Sabah, Malaysia
| | - Samsul Ariffin Abdul Karim
- Institute of Strategic Industrial Decision Modelling (ISIDM), School of Quantitative Sciences, UUM College of Arts and Sciences, Universiti Utara Malaysia, 06010 Sintok, Kedah Darul Aman, Malaysia
| | - Ashraf Osman Ibrahim
- Department of Computer and Information Sciences, Universiti Teknologi PETRONAS, 32610 Seri Iskandar, Malaysia
- Positive Computing Research Center, Emerging & Digital Technologies Institute, Universiti Teknologi PETRONAS, 32610 Seri Iskandar, Malaysia
| | - Nada Adam
- Department of Computer Science, The applied College, Northern Border University, Arar 73213, Saudi Arabia
| | - Hadia Abdelgader Osman
- Department of Computer Science, The applied College, Northern Border University, Arar 73213, Saudi Arabia
| |
Collapse
|
14
|
Guru D, N S. Banana bunch image and video dataset for variety classification and grading. Data Brief 2025; 60:111478. [PMID: 40231149 PMCID: PMC11994905 DOI: 10.1016/j.dib.2025.111478] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/15/2025] [Revised: 02/26/2025] [Accepted: 03/13/2025] [Indexed: 04/16/2025] Open
Abstract
Banana, a major commercial fruit crop, holds high nutritional value and widespread consumption [[4], [8],10]. The global banana market valued at USD 140.83 billion in 2024 is projected to reach USD 147.74 billion by 2030. Accurate variety identification and quality grading are crucial for marketing, pricing, and operational efficiency in food processing industries [9]. As wholesalers and food processing industries process bananas in bunches (not individual fruit levels) , our bunch-level dataset offers a more accurate assessment by capturing bunch-level characteristics, which are vital for grading. Existing datasets, such as [1,6], focus on individual bananas or have limited bunch-level data, highlighting the lack of large-scale bunch datasets. This dataset fills the gap by providing bunch-level images and videos of three widely consumed banana varieties-Elakki-bale, Pachbale, and Rasbale, from Mysuru, South Karnataka, India, serving as a valuable resource for food processing industries. Our dataset supports training machine learning models for bunch-level variety classification and grading of bananas and serves as a resource for research and education.
Collapse
Affiliation(s)
- D.S. Guru
- Department of Studies in Computer Science, University of Mysore, Manasagangotri, Mysuru, Karnataka, 570006, India
| | - Saritha N
- Department of Studies in Computer Science, University of Mysore, Manasagangotri, Mysuru, Karnataka, 570006, India
| |
Collapse
|
15
|
Ebert N, Stricker D, Wasenmüller O. Enhancing robustness and generalization in microbiological few-shot detection through synthetic data generation and contrastive learning. Comput Biol Med 2025; 191:110141. [PMID: 40253923 DOI: 10.1016/j.compbiomed.2025.110141] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/02/2024] [Revised: 02/25/2025] [Accepted: 04/03/2025] [Indexed: 04/22/2025]
Abstract
In many medical and pharmaceutical processes, continuous hygiene monitoring is crucial, often involving the manual detection of microorganisms in agar dishes by qualified personnel. Although deep learning methods hold promise for automating this task, they frequently encounter a shortage of sufficient training data, a prevalent challenge in colony detection. To overcome this limitation, we propose a novel pipeline that combines generative data augmentation with few-shot detection. Our approach aims to significantly enhance detection performance, even with (very) limited training data. A main component of our method is a diffusion-based generator model that inpaints synthetic bacterial colonies onto real agar plate backgrounds. This data augmentation technique enhances the diversity of training data, allowing for effective model training with only 25 real images. Our method outperforms common training-techniques, demonstrating a +0.45 mAP improvement compared to training from scratch, and a +0.15 mAP advantage over the current SOTA in synthetic data augmentation. Additionally, we integrate a decoupled feature classification strategy, where class-agnostic detection is followed by lightweight classification via a feed-forward network, making it possible to detect and classify colonies with minimal examples. This approach achieves an AP50 score of 0.7 in a few-shot scenario on the AGAR dataset. Our method also demonstrates robustness to various image corruptions, such as noise and blur, proving its applicability in real-world scenarios. By reducing the need for large labeled datasets, our pipeline offers a scalable, efficient solution for colony detection in hygiene monitoring and biomedical research, with potential for broader applications in fields where rapid detection of new colony types is required.
Collapse
Affiliation(s)
- Nikolas Ebert
- Research and Transfer Center CeMOS, Technical University of Applied Sciences Mannheim, Mannheim, 68163, Germany; Department of Computer Science, University of Kaiserslautern-Landau (RPTU), Kaiserslautern, 67663, Germany.
| | - Didier Stricker
- Department of Computer Science, University of Kaiserslautern-Landau (RPTU), Kaiserslautern, 67663, Germany.
| | - Oliver Wasenmüller
- Research and Transfer Center CeMOS, Technical University of Applied Sciences Mannheim, Mannheim, 68163, Germany.
| |
Collapse
|
16
|
Dong X, Zhang C, Wang P, Chen D, Tu GJ, Zhao S, Xiang T. A Novel Dual-Network Approach for Real-Time Liveweight Estimation in Precision Livestock Management. ADVANCED SCIENCE (WEINHEIM, BADEN-WURTTEMBERG, GERMANY) 2025; 12:e2417682. [PMID: 40285549 PMCID: PMC12165045 DOI: 10.1002/advs.202417682] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 12/28/2024] [Revised: 04/03/2025] [Indexed: 04/29/2025]
Abstract
The increasing demand for automation in livestock farming scenarios highlights the need for effective noncontact measurement methods. The current methods typically require either fixed postures and specific positions of the target animals or high computational demands, making them difficult to implement in practical situations. In this study, a novel dual-network framework is presented that extracts accurate contour information instead of segmented images from unconstrained pigs and then directly employs this information to obtain precise liveweight estimates. The experimental results demonstrate that the developed framework achieves high accuracy, providing liveweight estimates with an R2 value of 0.993. When contour information is used directly to estimate the liveweight, the real-time performance of the framework can reach 1131.6 FPS. This achievement sets a new benchmark for accuracy and efficiency in non-contact liveweight estimation. Moreover, the framework holds significant practical value, equipping farmers with a robust and scalable tool for precision livestock management in dynamic, real-world farming environments. Additionally, the Liveweight and Instance Segmentation Annotation of Pigs dataset is introduced as a comprehensive resource designed to support further advancements and validation in this field.
Collapse
Affiliation(s)
- Ximing Dong
- Key Laboratory of Agricultural Animal GeneticsBreeding and Reproduction of Ministry of Education & Key Laboratory of Swine Genetics and Breeding of Ministry of AgricultureHuazhong Agricultural UniversityWuhan430070China
| | - Caiming Zhang
- Key Laboratory of Agricultural Animal GeneticsBreeding and Reproduction of Ministry of Education & Key Laboratory of Swine Genetics and Breeding of Ministry of AgricultureHuazhong Agricultural UniversityWuhan430070China
| | - Peiyuan Wang
- Key Laboratory of Agricultural Animal GeneticsBreeding and Reproduction of Ministry of Education & Key Laboratory of Swine Genetics and Breeding of Ministry of AgricultureHuazhong Agricultural UniversityWuhan430070China
| | - Dexuan Chen
- Key Laboratory of Agricultural Animal GeneticsBreeding and Reproduction of Ministry of Education & Key Laboratory of Swine Genetics and Breeding of Ministry of AgricultureHuazhong Agricultural UniversityWuhan430070China
| | - Gang Jun Tu
- Key Laboratory of Agricultural Animal GeneticsBreeding and Reproduction of Ministry of Education & Key Laboratory of Swine Genetics and Breeding of Ministry of AgricultureHuazhong Agricultural UniversityWuhan430070China
| | - Shuhong Zhao
- Key Laboratory of Agricultural Animal GeneticsBreeding and Reproduction of Ministry of Education & Key Laboratory of Swine Genetics and Breeding of Ministry of AgricultureHuazhong Agricultural UniversityWuhan430070China
| | - Tao Xiang
- Key Laboratory of Agricultural Animal GeneticsBreeding and Reproduction of Ministry of Education & Key Laboratory of Swine Genetics and Breeding of Ministry of AgricultureHuazhong Agricultural UniversityWuhan430070China
| |
Collapse
|
17
|
Chen C, Lv F, Guan Y, Wang P, Yu S, Zhang Y, Tang Z. Human-Guided Image Generation for Expanding Small-Scale Training Image Datasets. IEEE TRANSACTIONS ON VISUALIZATION AND COMPUTER GRAPHICS 2025; 31:3809-3821. [PMID: 40323760 DOI: 10.1109/tvcg.2025.3567053] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 05/07/2025]
Abstract
The performance of computer vision models in certain real-world applications (e.g., rare wildlife observation) is limited by the small number of available images. Expanding datasets using pre-trained generative models is an effective way to address this limitation. However, since the automatic generation process is uncontrollable, the generated images are usually limited in diversity, and some of them are undesired. In this paper, we propose a human-guided image generation method for more controllable dataset expansion. We develop a multi-modal projection method with theoretical guarantees to facilitate the exploration of both the original and generated images. Based on the exploration, users refine the prompts and re-generate images for better performance. Since directly refining the prompts is challenging for novice users, we develop a sample-level prompt refinement method to make it easier. With this method, users only need to provide sample-level feedback (e.g., which samples are undesired) to obtain better prompts. The effectiveness of our method is demonstrated through the quantitative evaluation of the multi-modal projection method, improved model performance in the case study for both classification and object detection tasks, and positive feedback from the experts.
Collapse
|
18
|
Yip HF, Li Z, Zhang L, Lyu A. Large Language Models in Integrative Medicine: Progress, Challenges, and Opportunities. J Evid Based Med 2025; 18:e70031. [PMID: 40384541 PMCID: PMC12086751 DOI: 10.1111/jebm.70031] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Received: 01/03/2025] [Revised: 04/11/2025] [Accepted: 05/05/2025] [Indexed: 05/20/2025]
Abstract
Integrating Traditional Chinese Medicine (TCM) and Modern Medicine faces significant barriers, including the absence of unified frameworks and standardized diagnostic criteria. While Large Language Models (LLMs) in Medicine hold transformative potential to bridge these gaps, their application in integrative medicine remains underexplored and methodologically fragmented. This review systematically examines LLMs' development, deployment, and challenges in harmonizing Modern and TCM practices while identifying actionable strategies to advance this emerging field. This review aimed to provide insight into the following aspects. First, it summarized the existing LLMs in the General Domain, Modern Medicine, and TCM from the perspective of their model structures, number of parameters and domain-specific training data. We highlighted the limitations of existing LLMs in integrative medicine tasks through benchmark experiments and the unique applications of LLMs in Integrative Medicine. We discussed the challenges during the development and proposed possible solutions to mitigate them. This review synthesizes technical insights with practical clinical considerations, providing a roadmap for leveraging LLMs to bridge TCM's empirical wisdom with modern medical systems. These AI-driven synergies could redefine personalized care, optimize therapeutic outcomes, and establish new standards for holistic healthcare innovation.
Collapse
Affiliation(s)
- Hiu Fung Yip
- School of Chinese MedicineHong Kong Baptist UniversityHong KongChina
- Institute of Systems Medicine and Health SciencesHong Kong Baptist UniversityHong KongChina
| | - Zeming Li
- Department of Computer ScienceHong Kong Baptist UniversityHong KongChina
| | - Lu Zhang
- Institute of Systems Medicine and Health SciencesHong Kong Baptist UniversityHong KongChina
- Department of Computer ScienceHong Kong Baptist UniversityHong KongChina
| | - Aiping Lyu
- School of Chinese MedicineHong Kong Baptist UniversityHong KongChina
- Institute of Systems Medicine and Health SciencesHong Kong Baptist UniversityHong KongChina
- Guandong‐Hong Kong‐Macau Joint Lab on Chinese Medicine and Immune Disease ResearchGuangzhouChina
| |
Collapse
|
19
|
Cui H, Huang D, Feng W, Li Z, Ouyang Q, Zhong C. FIAEPI-KD: A novel knowledge distillation approach for precise detection of missing insulators in transmission lines. PLoS One 2025; 20:e0324524. [PMID: 40445919 PMCID: PMC12124544 DOI: 10.1371/journal.pone.0324524] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/04/2025] [Accepted: 04/25/2025] [Indexed: 06/02/2025] Open
Abstract
Ensuring transmission line safety is crucial. Detecting insulator defects is a key task. UAV-based insulator detection faces challenges: complex backgrounds, scale variations, and high computational costs. To address these, we propose FIAEPI-KD, a knowledge distillation framework integrating Feature Indicator Attention (FIA) and Edge Preservation Index (EPI). The method employs ResNet and FPN for multi-scale feature extraction. The FIA module dynamically focuses on multi-scale insulator edges via dual-path attention mechanisms, suppressing background interference. The EPI module quantifies edge differences between teacher and student models through gradient-aware distillation. The training objective combines Euclidean distance, KL divergence, and FIA-EPI losses to align feature-space similarities and edge details, enabling multi-level knowledge distillation. Experiments demonstrate significant improvements on our custom dataset containing farmland and waterbody scenarios. The RetinaNet-ResNet18 student model achieves a 10.5% mAP improvement, rising from 42.7% to 53.2%. Meanwhile, the Faster R-CNN-ResNet18 model achieves a 7.4% mAP improvement, rising from 42.7% to 50.1%. Additionally, the RepPoints-ResNet18 model achieves a 7.7% mAP improvement, rising from 49.6% to 57.3%. These results validate the effectiveness of FIAEPI-KD in enhancing detection accuracy across diverse detector architectures and backbone networks. On the MSCOCO dataset, FIAEPI-KD outperformed mainstream distillation methods like FKD and PKD. Ablation studies confirmed FIA's role in feature focus and EPI's edge difference quantification. Using FIA alone increased RetinaNet-ResNet50's mAP by 0.9%. Combined FIA+EPI achieved a total 3.0% improvement, the method utilizes a lightweight student model for efficient deployment, providing an effective solution for detecting insulation defects in transmission lines.
Collapse
Affiliation(s)
- Hanzhi Cui
- College of Computer Engineering, Qingdao City University, Qingdao, China
| | - Dawei Huang
- School of Intelligent Equipment, Shandong University of Science and Technology, Tai’an, China
| | - Wancheng Feng
- School of Intelligent Equipment, Shandong University of Science and Technology, Tai’an, China
| | - Zhengao Li
- School of Intelligent Equipment, Shandong University of Science and Technology, Tai’an, China
| | - Qiuxue Ouyang
- College of Computer Engineering, Qingdao City University, Qingdao, China
| | - Conghan Zhong
- College of Computer Engineering, Qingdao City University, Qingdao, China
| |
Collapse
|
20
|
Sirimewan D, Dayarathna S, Raman S, Bai Y, Arashpour M. A benchmark dataset for class-wise segmentation of construction and demolition waste in cluttered environments. Sci Data 2025; 12:885. [PMID: 40436975 PMCID: PMC12120074 DOI: 10.1038/s41597-025-05243-x] [Citation(s) in RCA: 1] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/13/2025] [Accepted: 05/20/2025] [Indexed: 06/01/2025] Open
Abstract
Efficient management of construction and demolition waste (CDW) is essential for enhancing resource recovery. The lack of publicly available, high-quality datasets for waste recognition limits the development and adoption of automated waste handling solutions. To facilitate data sharing and reuse, this study introduces 'CDW-Seg', a benchmark dataset for class-wise segmentation of CDW. The dataset comprises high-resolution images captured at authentic construction sites, featuring skip bins filled with a diverse mixture of CDW materials in-the-wild. It includes 5,413 manually annotated objects across ten categories: concrete, fill dirt, timber, hard plastic, soft plastic, steel, fabric, cardboard, plasterboard, and the skip bin, representing a total of 2,492,021,189 pixels. Each object was meticulously annotated through semantic segmentation, providing reliable ground-truth labels. To demonstrate the applicability of the dataset, an adapter-based fine-tuning approach was implemented using a hierarchical Vision Transformer, ensuring computational efficiency suitable for deployment in automated waste handling scenarios. The CDW-Seg has been made publicly accessible to promote data sharing, facilitate further research, and support the development of automated solutions for resource recovery.
Collapse
Affiliation(s)
- Diani Sirimewan
- Department of Civil Engineering, Faculty of Engineering, Monash University, Melbourne, Australia.
| | - Sanuwani Dayarathna
- Department of Data Science and AI, Faculty of IT, Monash University, Melbourne, Australia
| | - Sudharshan Raman
- Civil Engineering Discipline, School of Engineering, Monash University, Subang Jaya, Malaysia
| | - Yu Bai
- Department of Civil Engineering, Faculty of Engineering, Monash University, Melbourne, Australia
| | - Mehrdad Arashpour
- Department of Civil Engineering, Faculty of Engineering, Monash University, Melbourne, Australia
| |
Collapse
|
21
|
Rajabi N, Zanettin I, Ribeiro AH, Vasco M, Björkman M, Lundström JN, Kragic D. Exploring the feasibility of olfactory brain-computer interfaces. Sci Rep 2025; 15:18404. [PMID: 40419502 DOI: 10.1038/s41598-025-01488-z] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/06/2024] [Accepted: 05/05/2025] [Indexed: 05/28/2025] Open
Abstract
In this study, we explore the feasibility of single-trial predictions of odor registration in the brain using olfactory bio-signals. We focus on two main aspects: input data modality and the processing model. For the first time, we assess the predictability of odor registration from novel electrobulbogram (EBG) recordings, both in sensor and source space, and compare these with commonly used electroencephalogram (EEG) signals. Despite having fewer data channels, EBG shows comparable performance to EEG. We also examine whether breathing patterns contain relevant information for this task. By comparing a logistic regression classifier, which requires hand-crafted features, with an end-to-end convolutional deep neural network, we find that end-to-end approaches can be as effective as classic methods. However, due to the high dimensionality of the data, the current dataset is insufficient for either classifier to robustly differentiate odor and non-odor trials. Finally, we identify key challenges in olfactory BCIs and suggest future directions for improving odor detection systems.
Collapse
Affiliation(s)
- Nona Rajabi
- Department of Intelligent Systems, KTH Royal Institute of Technology, 10044, Stockholm, Sweden.
| | - Irene Zanettin
- Department of Clinical Neuroscience, Karolinska Institute, 17165, Stockholm, Sweden
| | - Antônio H Ribeiro
- Department of Information Technology, Uppsala University, 75105, Uppsala, Sweden
| | - Miguel Vasco
- Department of Intelligent Systems, KTH Royal Institute of Technology, 10044, Stockholm, Sweden
| | - Mårten Björkman
- Department of Intelligent Systems, KTH Royal Institute of Technology, 10044, Stockholm, Sweden
| | - Johan N Lundström
- Department of Clinical Neuroscience, Karolinska Institute, 17165, Stockholm, Sweden
| | - Danica Kragic
- Department of Intelligent Systems, KTH Royal Institute of Technology, 10044, Stockholm, Sweden
| |
Collapse
|
22
|
Alabi O, Toe KKZ, Zhou Z, Budd C, Raison N, Shi M, Vercauteren T. CholecInstanceSeg: A Tool Instance Segmentation Dataset for Laparoscopic Surgery. Sci Data 2025; 12:825. [PMID: 40394065 PMCID: PMC12092654 DOI: 10.1038/s41597-025-05163-w] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/20/2024] [Accepted: 05/08/2025] [Indexed: 05/22/2025] Open
Abstract
In laparoscopic and robotic surgery, precise tool instance segmentation is an essential technology for advanced computer-assisted interventions. Although publicly available procedures of routine surgeries exist, they often lack comprehensive annotations for tool instance segmentation. Additionally, the majority of standard datasets for tool segmentation are derived from porcine(pig) surgeries. To address this gap, we introduce CholecInstanceSeg, the largest open-access tool instance segmentation dataset to date. Derived from the existing CholecT50 and Cholec80 datasets, CholecInstanceSeg provides novel annotations for laparoscopic cholecystectomy procedures in patients. Our dataset comprises 41.9k annotated frames extracted from 85 clinical procedures and 64.4k tool instances, each labelled with semantic masks and instance IDs. To ensure the reliability of our annotations, we perform extensive quality control, conduct label agreement statistics, and benchmark the segmentation results with various instance segmentation baselines. CholecInstanceSeg aims to advance the field by offering a comprehensive and high-quality open-access dataset for the development and evaluation of tool instance segmentation algorithms.
Collapse
Affiliation(s)
- Oluwatosin Alabi
- Kings College London, Surgical & Interventional Engineering, London, SE1 7EU, United Kingdom
| | - Ko Ko Zayar Toe
- Kings College Hospital Denmark Hill, department, London, SE5 9RS, United Kingdom
| | - Zijian Zhou
- Department of Informatics, King's College London, London, United Kingdom
| | - Charlie Budd
- Kings College London, Surgical & Interventional Engineering, London, SE1 7EU, United Kingdom
| | - Nicholas Raison
- Kings College London, Surgical & Interventional Engineering, London, SE1 7EU, United Kingdom
| | - Miaojing Shi
- Tongji University, College of Electronic and Information Engineering, Shanghai, 200092, China.
| | - Tom Vercauteren
- Kings College London, Surgical & Interventional Engineering, London, SE1 7EU, United Kingdom
| |
Collapse
|
23
|
Day AL, Wahl CB, Dos Reis R, Liao WK, Li Y, Kilic MNT, Mirkin CA, Dravid VP, Choudhary A, Agrawal A. Automated image segmentation for accelerated nanoparticle characterization. Sci Rep 2025; 15:17180. [PMID: 40382402 PMCID: PMC12085630 DOI: 10.1038/s41598-025-01337-z] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/21/2024] [Accepted: 04/25/2025] [Indexed: 05/20/2025] Open
Abstract
Recent developments in materials science have made it possible to synthesize millions of individual nanoparticles on a chip. However, many steps in the characterization process still require extensive human input. To address this challenge, we present an automated image processing pipeline that optimizes high-throughput nanoparticle characterization using intelligent image segmentation and coordinate generation. The proposed method can rapidly analyze each image and return optimized acquisition coordinates suitable for multiple analytical STEM techniques, including 4D-STEM, EELS, and EDS. The pipeline employs computer vision and unsupervised learning to remove the image background, segment the particle into areas of interest, and generate acquisition coordinates. This approach eliminates the need for uniform grid sampling, focusing data collection on regions of interest. We validated our approach using a diverse dataset of over 900 high-resolution grayscale nanoparticle images, achieving a 96.0% success rate based on expert-validated criteria. Using established 4D-STEM acquisition times as a baseline, our method demonstrates a 25.0 to 29.1-fold reduction in total processing time. By automating this crucial preprocessing step and optimizing data acquisition, our pipeline significantly accelerates materials characterization workflows while reducing unnecessary data collection.
Collapse
Affiliation(s)
- Alexandra L Day
- Department of Electrical and Computer Engineering, Northwestern University, Evanston, IL, 60208, USA
| | - Carolin B Wahl
- Department of Materials Science and Engineering, Northwestern University, Evanston, IL, 60208, USA
- International Institute for Nanotechnology, Northwestern University, Evanston, IL, 60208, USA
| | - Roberto Dos Reis
- Department of Materials Science and Engineering, Northwestern University, Evanston, IL, 60208, USA
- International Institute for Nanotechnology, Northwestern University, Evanston, IL, 60208, USA
- The NUANCE Center, Northwestern University, Evanston, IL, 60208, USA
| | - Wei-Keng Liao
- Department of Electrical and Computer Engineering, Northwestern University, Evanston, IL, 60208, USA
| | - Youjia Li
- Department of Electrical and Computer Engineering, Northwestern University, Evanston, IL, 60208, USA
| | | | - Chad A Mirkin
- Department of Materials Science and Engineering, Northwestern University, Evanston, IL, 60208, USA
- International Institute for Nanotechnology, Northwestern University, Evanston, IL, 60208, USA
- Department of Chemistry, Northwestern University, Evanston, IL, 60208, USA
| | - Vinayak P Dravid
- Department of Materials Science and Engineering, Northwestern University, Evanston, IL, 60208, USA
- International Institute for Nanotechnology, Northwestern University, Evanston, IL, 60208, USA
- The NUANCE Center, Northwestern University, Evanston, IL, 60208, USA
| | - Alok Choudhary
- Department of Electrical and Computer Engineering, Northwestern University, Evanston, IL, 60208, USA
| | - Ankit Agrawal
- Department of Electrical and Computer Engineering, Northwestern University, Evanston, IL, 60208, USA.
| |
Collapse
|
24
|
Wang J, Wang T, Xu Q, Gao L, Gu G, Jia L, Yao C. RP-DETR: end-to-end rice pests detection using a transformer. PLANT METHODS 2025; 21:63. [PMID: 40382633 DOI: 10.1186/s13007-025-01381-w] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 02/08/2025] [Accepted: 04/25/2025] [Indexed: 05/20/2025]
Abstract
Pest infestations in rice crops greatly affect yield and quality, making early detection essential. As most rice pests affect leaves and rhizomes, visual inspection of rice for pests is becoming increasingly important. In precision agriculture, fast and accurate automatic pest identification is essential. To tackle this issue, multiple models utilizing computer vision and deep learning have been applied. Owing to its high efficiency, deep learning is now the favored approach for detecting plant pests. In this regard, the paper introduces an effective rice pest detection framework utilizing the Transformer architecture, designed to capture long-range features. The paper enhances the original model by adding the self-developed RepPConv-block to reduce the problem of information redundancy in feature extraction in the model backbone and to a certain extent reduce the model parameters. The original model's CCFM structure is enhanced by integrating the Gold-YOLO neck, improving its ability to fuse multi-scale features. Additionally, the MPDIoU-based loss function enhances the model's detection performance. Using the self-constructed high-quality rice pest dataset, the model achieves higher identification accuracy while reducing the number of parameters. The proposed RP18-DETR and RP34-DETR models reduce parameters by 16.5% and 25.8%, respectively, compared to the original RT18-DETR and RT34-DETR models. With a threshold of 0.5, the average accuracy calculated is 1.2% higher for RP18-DETR than for RT18-DETR.
Collapse
Affiliation(s)
- Jinsheng Wang
- School of Information Engineering, Huzhou University, Huzhou, 313000, China
| | - Tao Wang
- School of Information Engineering, Huzhou University, Huzhou, 313000, China
| | - Qin Xu
- School of Information Engineering, Huzhou University, Huzhou, 313000, China
| | - Lu Gao
- School of Information Engineering, Huzhou University, Huzhou, 313000, China
| | - Guosong Gu
- School of Information Science and Engineering, Jiaxing University, Jiaxing, 314001, China.
| | - Liangquan Jia
- School of Information Engineering, Huzhou University, Huzhou, 313000, China.
| | - Chong Yao
- Huzhou Central Hospital, Huzhou, 313000, China.
| |
Collapse
|
25
|
Elnady M, Abdelmunim HE. A novel YOLO LSTM approach for enhanced human action recognition in video sequences. Sci Rep 2025; 15:17036. [PMID: 40379779 DOI: 10.1038/s41598-025-01898-z] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/25/2025] [Accepted: 05/09/2025] [Indexed: 05/19/2025] Open
Abstract
Human Action Recognition (HAR) is a critical task in computer vision with applications in surveillance, healthcare, and human-computer interaction. This paper introduces a novel approach combining the strengths of You Only Look Once (YOLO) for feature extraction and Long Short-Term Memory (LSTM) networks for temporal modeling to achieve robust and accurate action recognition in video sequences. The YOLO model efficiently identifies key features from individual frames, enabling real-time processing, while the LSTM network captures temporal dependencies to understand sequential dynamics in human movements. The proposed YOLO-LSTM framework is evaluated on multiple publicly available HAR datasets, achieving an accuracy of 96%, precision of 96%, recall of 97%, and F1-score of 96% on the UCF101 dataset; 99% across all metrics on the KTH dataset; 100% on the WEIZMANN dataset; and 98% on the IXMAS dataset. These results demonstrate the superior performance of our approach compared to existing methods in terms of both accuracy and processing speed. Additionally, this approach effectively handles challenges such as occlusions, varying illumination, and complex backgrounds, making it suitable for real-world applications. The results highlight the potential of combining object detection and recurrent architectures for advancing state-of-the-art HAR systems.
Collapse
Affiliation(s)
- Mahmoud Elnady
- Computer and Systems Engineering, Ain Shams University, El Sarayat, Cairo, 11517, Egypt.
| | - Hossam E Abdelmunim
- Computer and Systems Engineering, Ain Shams University, El Sarayat, Cairo, 11517, Egypt
| |
Collapse
|
26
|
Bondarenko A, Jumutc V, Netter A, Duchateau F, Abrão HM, Noorzadeh S, Giacomello G, Ferrari F, Bourdel N, Kirk UB, Bļizņuks D. Object Detection in Laparoscopic Surgery: A Comparative Study of Deep Learning Models on a Custom Endometriosis Dataset. Diagnostics (Basel) 2025; 15:1254. [PMID: 40428247 PMCID: PMC12110204 DOI: 10.3390/diagnostics15101254] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/02/2025] [Revised: 05/07/2025] [Accepted: 05/07/2025] [Indexed: 05/29/2025] Open
Abstract
Background: Laparoscopic surgery for endometriosis presents unique challenges due to the complexity of and variability in lesion appearances within the abdominal cavity. This study investigates the application of deep learning models for object detection in laparoscopic videos, aiming to assist surgeons in accurately identifying and localizing endometriosis lesions and related anatomical structures. A custom dataset was curated, comprising of 199 video sequences and 205,725 frames. Of these, 17,560 frames were meticulously annotated by medical professionals. The dataset includes object detection annotations for 10 object classes relevant to endometriosis, alongside segmentation masks for some classes. Methods: To address the object detection task, we evaluated the performance of two deep learning models-FasterRCNN and YOLOv9-under both stratified and non-stratified training scenarios. Results: The experimental results demonstrated that stratified training significantly reduced the risk of data leakage and improved model generalization. The best-performing FasterRCNN object detection model achieved a high average test precision of 0.9811 ± 0.0084, recall of 0.7083 ± 0.0807, and mAP50 (mean average precision at 50% overlap) of 0.8185 ± 0.0562 across all presented classes. Despite these successes, the study also highlights the challenges posed by the weak annotations and class imbalances in the dataset, which impacted overall model performances. Conclusions: In conclusion, this study provides valuable insights into the application of deep learning for enhancing laparoscopic surgical precision in endometriosis treatment. The findings underscore the importance of robust dataset curation and advanced training strategies in developing reliable AI-assisted tools for surgical interventions. The latter could potentially improve the guidance of surgical interventions and prevent blind spots occurring in difficult to reach abdominal regions. Future work will focus on refining the dataset and exploring more sophisticated model architectures to further improve detection accuracy.
Collapse
Affiliation(s)
- Andrey Bondarenko
- Institute of Applied Computer Systems, Riga Technical University, LV-1048 Riga, Latvia; (A.B.); (V.J.)
| | - Vilen Jumutc
- Institute of Applied Computer Systems, Riga Technical University, LV-1048 Riga, Latvia; (A.B.); (V.J.)
| | - Antoine Netter
- Department of Obstetrics and Gynecology, Marseille Hospital, 13005 Marseille, France; (A.N.); (F.D.)
| | - Fanny Duchateau
- Department of Obstetrics and Gynecology, Marseille Hospital, 13005 Marseille, France; (A.N.); (F.D.)
| | | | - Saman Noorzadeh
- SurgAR, 63000 Clermont-Ferrand, France; (S.N.); (G.G.); (F.F.); (N.B.)
| | - Giuseppe Giacomello
- SurgAR, 63000 Clermont-Ferrand, France; (S.N.); (G.G.); (F.F.); (N.B.)
- Department of Obstetrics and Gynecology, Istituto Ospedaliero Fondazione Poliambulanza, 25124 Brescia, Italy
| | - Filippo Ferrari
- SurgAR, 63000 Clermont-Ferrand, France; (S.N.); (G.G.); (F.F.); (N.B.)
- Department of Obstetrics and Gynecology, Gynecologic Oncology and Minimally Invasive Pelvic Surgery, International School of Surgical Anatomy (ISSA), IRCCS “Sacro Cuore—Don Calabria” Hospital, Negrar di Valpolicella, 37024 Verona, Italy
| | - Nicolas Bourdel
- SurgAR, 63000 Clermont-Ferrand, France; (S.N.); (G.G.); (F.F.); (N.B.)
- Department of Clinical Research and Innovation, CHU Clermont Ferrand, 63100 Clermont-Ferrand, France
| | - Ulrik Bak Kirk
- Department of Public Health, Aarhus University, 8000 Aarhus, Denmark;
- The Research Unit for General Practice, 8000 Aarhus, Denmark
| | - Dmitrijs Bļizņuks
- Institute of Applied Computer Systems, Riga Technical University, LV-1048 Riga, Latvia; (A.B.); (V.J.)
| |
Collapse
|
27
|
Vinken K, Sharma S, Livingstone MS. Mapping Macaque to Human Cortex with Natural Scene Responses. BIORXIV : THE PREPRINT SERVER FOR BIOLOGY 2025:2025.05.11.653327. [PMID: 40462947 PMCID: PMC12132291 DOI: 10.1101/2025.05.11.653327] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Indexed: 06/16/2025]
Abstract
Neuroscience has long relied on macaque studies to infer human brain function, yet identifying functionally corresponding brain regions across species and measurement modalities remains a fundamental challenge. This is especially true for higher-order cortex, where functional interpretations are constrained by narrow hypotheses and anatomical landmarks are often non-homologous. We present a data-driven approach for mapping functional correspondence across species using rich, naturalistic stimuli. By directly comparing macaque electrophysiology with human fMRI responses to 700 natural scenes, we identify fine-grained alignment based on response pattern similarity, without relying on predefined tuning concepts or hand-picked stimuli. As a test case, we examine the ventral face patch system, a well-studied but contested domain in cross-species alignment. Our approach resolves a longstanding ambiguity, yielding a correspondence consistent with full-brain anatomical warping but inconsistent with prior studies limited by narrow functional hypotheses. These findings show that natural image-evoked response patterns provide a robust foundation for cross-species functional alignment, supporting scalable comparisons as large-scale primate recordings become more widespread.
Collapse
Affiliation(s)
- Kasper Vinken
- Department of Neurobiology, Harvard Medical School, Boston, MA 02115, USA
| | - Saloni Sharma
- Department of Neurobiology, Harvard Medical School, Boston, MA 02115, USA
| | | |
Collapse
|
28
|
Deng K, Wen Q, Yang F, Ouyang H, Shi Z, Shuai S, Wu Z. OS-DETR: End-to-end brain tumor detection framework based on orthogonal channel shuffle networks. PLoS One 2025; 20:e0320757. [PMID: 40359502 PMCID: PMC12074655 DOI: 10.1371/journal.pone.0320757] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/03/2024] [Accepted: 02/21/2025] [Indexed: 05/15/2025] Open
Abstract
OrthoNets use the Gram-Schmidt process to achieve orthogonality among filters but do not impose constraints on the internal orthogonality of individual filters. To reduce the risk of overfitting, especially in scenarios with limited data such as medical image, this study explores an enhanced network that ensures the internal orthogonality within individual filters, named the Orthogonal Channel Shuffle Network ( OSNet). This network is integrated into the Detection Transformer (DETR) framework for brain tumor detection, resulting in the OS-DETR. To further optimize model performance, this study also incorporates deformable attention mechanisms and an Intersection over Union strategy that emphasizes the internal region influence of bounding boxes and the corner distance disparity. Experimental results on the Br35H brain tumor dataset demonstrate the significant advantages of OS-DETR over mainstream object detection frameworks. Specifically, OS-DETR achieves a Precision of 95.0%, Recall of 94.2%, mAP@50 of 95.7%, and mAP@50:95 of 74.2%. The code implementation and experimental results are available at https://github.com/dkx2077/OS-DETR.git.
Collapse
Affiliation(s)
- Kaixin Deng
- College of Computer Science and Cyber Security, Chengdu University of Technology, Chengdu, China
| | - Quan Wen
- College of Computer Science and Cyber Security, Chengdu University of Technology, Chengdu, China
| | - Fan Yang
- College of Computer Science and Cyber Security, Chengdu University of Technology, Chengdu, China
| | - Hang Ouyang
- College of Computer Science and Cyber Security, Chengdu University of Technology, Chengdu, China
| | - Zhuohang Shi
- College of Computer Science and Cyber Security, Chengdu University of Technology, Chengdu, China
| | - Shiyu Shuai
- College of Computer Science and Cyber Security, Chengdu University of Technology, Chengdu, China
| | - Zhaowang Wu
- College of Computer Science and Cyber Security, Chengdu University of Technology, Chengdu, China
| |
Collapse
|
29
|
Zhao P, Wang X, Yu S, Dong X, Li B, Wang H, Chen G. An open paradigm dataset for intelligent monitoring of underground drilling operations in coal mines. Sci Data 2025; 12:780. [PMID: 40355463 PMCID: PMC12069595 DOI: 10.1038/s41597-025-05118-1] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/23/2024] [Accepted: 05/01/2025] [Indexed: 05/14/2025] Open
Abstract
The underground drilling environment in coal mines is critical and prone to accidents, with common accident types including rib spalling, roof falling, and others. High-quality datasets are essential for developing and validating artificial intelligence (AI) algorithms in coal mine safety monitoring and automation field. Currently, there is no comprehensive benchmark dataset for coal mine industrial scenarios, limiting the research progress of AI algorithms in this industry. For the first time, this study constructed a benchmark dataset (DsDPM 66) specifically for underground coal mine drilling operations, containing 105,096 images obtained from surveillance videos of multiple drilling operation scenes. The dataset has been manually annotated to support computer vision tasks such as object detection and pose estimation. In addition, this study conducted extensive benchmarking experiments on this dataset, applying various advanced AI algorithms including but not limited to YOLOv8 and DETR. The results indicate the proposed dataset highlights areas for improvement in algorithmic models and fills the data gap in the coal mining, providing valuable resources for developing coal mine safety monitoring.
Collapse
Affiliation(s)
- Pengzhen Zhao
- School of Electrical Engineering, Shanghai DianJi University, Shanghai, 201306, China
- Intelligent Decision and Control Technology Institute, Shanghai DianJi University, Shanghai, 201306, China
| | - Xichao Wang
- School of Electrical Engineering, Shanghai DianJi University, Shanghai, 201306, China.
- Intelligent Decision and Control Technology Institute, Shanghai DianJi University, Shanghai, 201306, China.
| | - Shuainan Yu
- School of Electrical Engineering, Shanghai DianJi University, Shanghai, 201306, China
- Intelligent Decision and Control Technology Institute, Shanghai DianJi University, Shanghai, 201306, China
| | - Xiangqing Dong
- School of Electrical Engineering, Shanghai DianJi University, Shanghai, 201306, China
- Intelligent Decision and Control Technology Institute, Shanghai DianJi University, Shanghai, 201306, China
| | - Baojiang Li
- School of Electrical Engineering, Shanghai DianJi University, Shanghai, 201306, China
- Intelligent Decision and Control Technology Institute, Shanghai DianJi University, Shanghai, 201306, China
| | - Haiyan Wang
- School of Electrical Engineering, Shanghai DianJi University, Shanghai, 201306, China
- Intelligent Decision and Control Technology Institute, Shanghai DianJi University, Shanghai, 201306, China
| | - Guochu Chen
- School of Electrical Engineering, Shanghai DianJi University, Shanghai, 201306, China
- Intelligent Decision and Control Technology Institute, Shanghai DianJi University, Shanghai, 201306, China
| |
Collapse
|
30
|
Ruarte G, Bujia G, Care D, Ison MJ, Kamienkowski JE. Integrating Bayesian and neural networks models for eye movement prediction in hybrid search. Sci Rep 2025; 15:16482. [PMID: 40355508 PMCID: PMC12069626 DOI: 10.1038/s41598-025-00272-3] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/20/2025] [Accepted: 04/28/2025] [Indexed: 05/14/2025] Open
Abstract
Visual search is crucial in daily human interaction with the environment. Hybrid search extends this by requiring observers to find any item from a given set. Recently, a few models were proposed to simulate human eye movements in visual search tasks within natural scenes, but none were implemented for Hybrid search under similar conditions. We present an enhanced neural network Entropy Limit Minimization (nnELM) model, grounded in a Bayesian framework and signal detection theory, and the Hybrid Search Eye Movements (HSEM) Dataset, containing thousands of human eye movements during hybrid tasks. A key Hybrid search challenge is that participants have to look for different objects at the same time. To address this, we developed several strategies involving the posterior probability distributions after each fixation. Adjusting peripheral visibility improved early-stage efficiency, aligning it with human behavior. Limiting the model's memory reduced success in longer searches, mirroring human performance. We validated these improvements by comparing our model with a held-out set within the HSEM and with other models in a separate visual search benchmark. Overall, the new nnELM model not only handles Hybrid search in natural scenes but also closely replicates human behavior, advancing our understanding of search processes while maintaining interpretability.
Collapse
Affiliation(s)
- Gonzalo Ruarte
- Laboratorio de Inteligencia Artificial Aplicada (LIAA), Instituto de Ciencias de la Computación (ICC), CONICET - Universidad de Buenos Aires, Buenos Aires, Argentina.
- Departamento de Computación, Facultad de Ciencias Exactas y Naturales, Universidad de Buenos Aires, Buenos Aires, Argentina.
| | - Gaston Bujia
- Laboratorio de Inteligencia Artificial Aplicada (LIAA), Instituto de Ciencias de la Computación (ICC), CONICET - Universidad de Buenos Aires, Buenos Aires, Argentina
- Departamento de Computación, Facultad de Ciencias Exactas y Naturales, Universidad de Buenos Aires, Buenos Aires, Argentina
| | - Damián Care
- Laboratorio de Inteligencia Artificial Aplicada (LIAA), Instituto de Ciencias de la Computación (ICC), CONICET - Universidad de Buenos Aires, Buenos Aires, Argentina
| | | | - Juan Esteban Kamienkowski
- Laboratorio de Inteligencia Artificial Aplicada (LIAA), Instituto de Ciencias de la Computación (ICC), CONICET - Universidad de Buenos Aires, Buenos Aires, Argentina
- Departamento de Computación, Facultad de Ciencias Exactas y Naturales, Universidad de Buenos Aires, Buenos Aires, Argentina
- Maestría en Explotación de Datos y Descubrimiento del Conocimiento, Facultad de Ciencias Exactas y Naturales, Universidad de Buenos Aires, Buenos Aires, Argentina
| |
Collapse
|
31
|
Alzahrani N, Bchir O, Ismail MMB. YOLO-Act: Unified Spatiotemporal Detection of Human Actions Across Multi-Frame Sequences. SENSORS (BASEL, SWITZERLAND) 2025; 25:3013. [PMID: 40431808 PMCID: PMC12115296 DOI: 10.3390/s25103013] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 03/31/2025] [Revised: 05/04/2025] [Accepted: 05/08/2025] [Indexed: 05/29/2025]
Abstract
Automated action recognition has become essential in the surveillance, healthcare, and multimedia retrieval industries owing to the rapid proliferation of video data. This paper introduces YOLO-Act, a novel spatiotemporal action detection model that enhances the object detection capabilities of YOLOv8 to efficiently manage complex action dynamics within video sequences. YOLO-Act achieves precise and efficient action recognition by integrating keyframe extraction, action tracking, and class fusion. The model depicts essential temporal dynamics without the computational overhead of continuous frame processing by leveraging the adaptive selection of three keyframes representing the beginning, middle, and end of the actions. Compared with state-of-the-art approaches such as the Lagrangian Action Recognition Transformer (LART), YOLO-Act exhibits superior performance with a mean average precision (mAP) of 73.28 in experiments conducted on the AVA dataset, resulting in a gain of +28.18 mAP. Furthermore, YOLO-Act achieves this higher accuracy with significantly lower FLOPs, demonstrating its efficiency in computational resource utilization. The results highlight the advantages of incorporating precise tracking, effective spatial detection, and temporal consistency to address the challenges associated with video-based action detection.
Collapse
Affiliation(s)
- Nada Alzahrani
- Computer Science Department, College of Computer and Information Sciences, King Saud University, Riyadh 11451, Saudi Arabia; (O.B.); (M.M.B.I.)
- Computer Science Department, College of Computer Engineering and Science, Prince Sattam Bin Abdulaziz University, Al-Kharj 16278, Saudi Arabia
| | - Ouiem Bchir
- Computer Science Department, College of Computer and Information Sciences, King Saud University, Riyadh 11451, Saudi Arabia; (O.B.); (M.M.B.I.)
| | - Mohamed Maher Ben Ismail
- Computer Science Department, College of Computer and Information Sciences, King Saud University, Riyadh 11451, Saudi Arabia; (O.B.); (M.M.B.I.)
| |
Collapse
|
32
|
Fincato M, Vezzani R. DualPose: Dual-Block Transformer Decoder with Contrastive Denoising for Multi-Person Pose Estimation. SENSORS (BASEL, SWITZERLAND) 2025; 25:2997. [PMID: 40431791 PMCID: PMC12114973 DOI: 10.3390/s25102997] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 04/03/2025] [Revised: 05/05/2025] [Accepted: 05/06/2025] [Indexed: 05/29/2025]
Abstract
Multi-person pose estimation is the task of detecting and regressing the keypoint coordinates of multiple people in a single image. Significant progress has been achieved in recent years, especially with the introduction of transformer-based end-to-end methods. In this paper, we present DualPose, a novel framework that enhances multi-person pose estimation by leveraging a dual-block transformer decoding architecture. Class prediction and keypoint estimation are split into parallel blocks so each sub-task can be separately improved and the risk of interference is reduced. This architecture improves the precision of keypoint localization and the model's capacity to accurately classify individuals. To improve model performance, the Keypoint-Block uses parallel processing of self-attentions, providing a novel strategy that improves keypoint localization accuracy and precision. Additionally, DualPose incorporates a contrastive denoising (CDN) mechanism, leveraging positive and negative samples to stabilize training and improve robustness. Thanks to CDN, a variety of training samples are created by introducing controlled noise into the ground truth, improving the model's ability to discern between valid and incorrect keypoints. DualPose achieves state-of-the-art results outperforming recent end-to-end methods, as shown by extensive experiments on the MS COCO and CrowdPose datasets. The code and pretrained models are publicly available.
Collapse
Affiliation(s)
| | - Roberto Vezzani
- Department of Engineering “Enzo Ferrari”, University of Modena and Reggio Emilia, Via P. Vivarelli 10, 41125 Modena, Italy;
| |
Collapse
|
33
|
Tu J, Liu X, Huang Z, Hao Y, Hong R, Wang M. Cross-Modal Hashing via Diverse Instances Matching. IEEE TRANSACTIONS ON IMAGE PROCESSING : A PUBLICATION OF THE IEEE SIGNAL PROCESSING SOCIETY 2025; 34:2737-2749. [PMID: 40266858 DOI: 10.1109/tip.2025.3561659] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 04/25/2025]
Abstract
Cross-modal hashing is a highly effective technique for searching relevant data across different modalities, owing to its low storage costs and fast similarity retrieval capability. While significant progress has been achieved in this area, prior investigations predominantly concentrate on a one-to-one feature alignment approach, where a singular feature is derived for similarity retrieval. However, the singular feature in these methods fails to adequately capture the varied multi-instance information inherent in the original data across disparate modalities. Consequently, the conventional one-to-one methodology is plagued by a semantic mismatch issue, as the rigid one-to-one alignment inhibits effective multi-instance matching. To address this issue, we propose a novel Diverse Instances Matching for Cross-modal Hashing (DIMCH), which explores the relevance between multiple instances in different modalities using a multi-instance learning algorithm. Specifically, we design a novel diverse instances learning module to extract a multi-feature set, which enables our model to capture detailed multi-instance semantics. To evaluate the similarity between two multi-feature sets, we adopt the smooth chamfer distance function, which enables our model to incorporate the conventional similarity retrieval structure. Moreover, to sufficiently exploit the supervised information from the semantic label, we adopt the weight cosine triplet loss as the objective function, which incorporates the multilevel similarity among the multi-labels into the training procedure and enables the model to mine the multi-label correlation effectively. Extensive experiments demonstrate that our diverse hashing embedding method achieves state-of-the-art performance in supervised cross-modal hashing retrieval tasks.
Collapse
|
34
|
Zhou Q, Pan Z, Niu B. SFEF-Net: Scattering Feature Extraction and Fusion Network for Aircraft Detection in SAR Images. SENSORS (BASEL, SWITZERLAND) 2025; 25:2988. [PMID: 40431781 PMCID: PMC12114894 DOI: 10.3390/s25102988] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 03/13/2025] [Revised: 04/27/2025] [Accepted: 05/07/2025] [Indexed: 05/29/2025]
Abstract
Synthetic aperture radar (SAR) offers robust Earth observation capabilities under diverse lighting and weather conditions, making SAR-based aircraft detection crucial for various applications. However, this task presents significant challenges, including extracting discrete scattering features, mitigating interference from complex backgrounds, and handling potential label noise. To tackle these issues, we propose the scattering feature extraction and fusion network (SFEF-Net). Firstly, we proposed an innovative sparse convolution operator and applied it to feature extraction. Compared to traditional convolution, sparse convolution offers more flexible sampling positions and a larger receptive field without increasing the number of parameters, which enables SFEF-Net to better extract discrete features. Secondly, we developed the global information fusion and distribution module (GIFD) to fuse feature maps of different levels and scales. GIFD possesses the capability for global modeling, enabling the comprehensive fusion of multi-scale features and the utilization of contextual information. Additionally, we introduced a noise-robust loss to mitigate the adverse effects of label noise by reducing the weight of outliers. To assess the performance of our proposed method, we carried out comprehensive experiments utilizing the SAR-AIRcraft1.0 dataset. The experimental results demonstrate the outstanding performance of SFEF-Net.
Collapse
Affiliation(s)
- Qiang Zhou
- Aerospace Information Research Institute, Chinese Academy of Sciences, Beijing 100094, China; (Q.Z.); (B.N.)
- Key Laboratory of Technology in Geo-Spatial Information Processing and Application System, Chinese Academy of Sciences, Beijing 100190, China
- School of Electronic, Electrical and Communication Engineering, University of Chinese Academy of Sciences, Beijing 100049, China
| | - Zongxu Pan
- Aerospace Information Research Institute, Chinese Academy of Sciences, Beijing 100094, China; (Q.Z.); (B.N.)
- Key Laboratory of Technology in Geo-Spatial Information Processing and Application System, Chinese Academy of Sciences, Beijing 100190, China
- School of Electronic, Electrical and Communication Engineering, University of Chinese Academy of Sciences, Beijing 100049, China
| | - Ben Niu
- Aerospace Information Research Institute, Chinese Academy of Sciences, Beijing 100094, China; (Q.Z.); (B.N.)
- Key Laboratory of Technology in Geo-Spatial Information Processing and Application System, Chinese Academy of Sciences, Beijing 100190, China
- School of Electronic, Electrical and Communication Engineering, University of Chinese Academy of Sciences, Beijing 100049, China
| |
Collapse
|
35
|
Abo-Zahhad MM, Abo-Zahhad M. Real time intelligent garbage monitoring and efficient collection using Yolov8 and Yolov5 deep learning models for environmental sustainability. Sci Rep 2025; 15:16024. [PMID: 40341180 PMCID: PMC12062267 DOI: 10.1038/s41598-025-99885-x] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/07/2025] [Accepted: 04/23/2025] [Indexed: 05/10/2025] Open
Abstract
Effective waste management is currently one of the most influential factors in enhancing the quality of life. Increased garbage production has been identified as a significant problem for many cities worldwide and a crucial issue for countries experiencing rapid urban population growth. According to the World Bank Organization, global waste production is projected to increase from 2.01 billion tonnes in 2018 to 3.4 billion tonnes by 2050 (Kaza et al. in What a Waste 2.0: A Global Snapshot of Solid Waste Management to 2050, The World Bank Group, Washington, DC, USA, 2018). In many cities, growing waste is the primary driver of environmental pollution. Nationally, governments have initiated several programs to improve cleanliness by developing systems that alert businesses when it's time to empty the bins. Current research proposes an enhanced, accurate, real-time object detection system to address the problem of trash accumulating around containers. This system involves numerous trash cans scattered across the city, each equipped with a low-cost device that measures the amount of trash inside. When a certain threshold is reached, the device sends a message with a unique identifier, prompting the appropriate authorities to take action. The system also triggers alerts if individuals throw trash bags outside the container or if the bin overflows, sending a message with a unique identifier to the authorities. Additionally, this paper addresses the need for efficient garbage classification while reducing computing costs to improve resource utilization. Two-stage lightweight deep learning models based on YOLOv5 and YOLOv8 are adopted to significantly decrease the number of parameters and processes, thereby reducing hardware requirements. In this study, trash is first classified into primary categories, which are further subdivided. The primary categories include full trash containers, trash bags, trash outside containers, and wet trash containers. YOLOv5 is particularly effective for classifying small objects, achieving high accuracy in identifying and categorizing different types of waste products on hardware without GPU capabilities. Each main class is further subdivided using YOLOv8 to facilitate recycling. A comparative study of YOLOv8, YOLOv5, and EfficientNet models on public and newly constructed garbage datasets shows that YOLOv8 and YOLOv5 have good accuracy for most classes, with the full-trash bin class achieving the highest accuracy and the wet trash container class the lowest compared to the EfficientNet model. The results demonstrate that the system effectively addresses the reliability issues of previously proposed systems, including detecting whether a trash bin is full, identifying trash outside the bin, and ensuring proper communication with authorities for necessary actions. Further research is recommended to enhance garbage management and collection, considering target occlusion, CPU and GPU hardware optimization, and robotic integration with the proposed system.
Collapse
Affiliation(s)
- Mohammed M Abo-Zahhad
- Department of Electrical Engineering, Faculty of Engineering, Sohag University, Sohag, New Sohag City, Egypt
| | - Mohammed Abo-Zahhad
- Department of Electronics and Communications Engineering, Egypt-Japan University of Science and Technology (E-JUST), New Borg El-Arab City, Alexandria, 21934, Egypt.
- Department of Electrical and Electronics Engineering, Assiut University, Assiut, 71515, Egypt.
| |
Collapse
|
36
|
Wu M, Sharapov J, Anderson M, Lu Y, Wu Y. Quantifying dislocation-type defects in post irradiation examination via transfer learning. Sci Rep 2025; 15:15889. [PMID: 40335501 PMCID: PMC12059087 DOI: 10.1038/s41598-025-00238-5] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/12/2024] [Accepted: 04/25/2025] [Indexed: 05/09/2025] Open
Abstract
The quantitative analysis of dislocation-type defects in irradiated materials is critical to materials characterization in the nuclear energy industry. The conventional approach of an instrument scientist manually identifying any dislocation defects is both time-consuming and subjective, thereby potentially introducing inconsistencies in the quantification. This work approaches dislocation-type defect identification and segmentation using a standard open-source computer vision model, YOLO11, that leverages transfer learning to create a highly effective dislocation defect quantification tool while using only a minimal number of annotated micrographs for training. This model demonstrates the ability to segment both dislocation lines and loops concurrently in micrographs with high pixel noise levels and on two alloys not represented in the training set. Inference of dislocation defects using transmission electron microscopy on three different irradiated alloys relevant to the nuclear energy industry are examined in this work with widely varying pixel noise levels and with completely unrelated composition and dislocation formations for practical post irradiation examination analysis. Code and models are available at https://github.com/idaholab/PANDA .
Collapse
Affiliation(s)
- Michael Wu
- Idaho National Laboratory, Idaho Falls, ID, USA
| | | | | | - Yu Lu
- Boise State University, Boise, ID, USA
- Center for Advanced Energy Studies, Idaho Falls, ID, USA
| | - Yaqiao Wu
- Boise State University, Boise, ID, USA
- Center for Advanced Energy Studies, Idaho Falls, ID, USA
| |
Collapse
|
37
|
Zheng S, Wu Z, Xu Y, He C, Wei Z. Detector With Classifier 2: An End-to-End Multi-Stream Feature Aggregation Network for Fine-Grained Object Detection in Remote Sensing Images. IEEE TRANSACTIONS ON IMAGE PROCESSING : A PUBLICATION OF THE IEEE SIGNAL PROCESSING SOCIETY 2025; 34:2707-2720. [PMID: 40305241 DOI: 10.1109/tip.2025.3563708] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 05/02/2025]
Abstract
Fine-grained object detection (FGOD) fundamentally comprises two primary tasks: object detection and fine-grained classification. In natural scenes, most FGOD methods benefit from higher instance resolution and fewer environmental variation, attributing more commonly associated with the latter task. In this paper, we propose a unified paradigm named Detector with Classifier2 (DC2), which provides a holistic paradigm by explicitly considering the end-to-end integration of object detection and fine-grained classification tasks, rather than prioritizing one aspect. Initially, our detection sub-network is restricted to only determining whether the proposal is a coarse-category and does not delve into the specific sub-categories. Moreover, in order to reduce redundant pixel-level calculation, we propose an instance-level feature enhancement (IFE) module to model the semantic similarities among proposals, which poses great potential for locating more instances in remote sensing images (RSIs). After obtaining the coarse detection predictions, we further construct a classification sub-network, which is built on top of the former branch to determine the specific sub-categories of the aforementioned predictions. Importantly, the detection network is performed on the complete image, while the classification network conducts secondary modeling for the detected regions. These operations can be denoted as the global contextual information and local intrinsic cues extractions for each instance. Therefore, we propose a multi-stream feature aggregation (MSFA) module to integrate global-stream semantic information and local-stream discriminative cues. Our whole DC2 network follows an end-to-end learning fashion, which effectively excavates the internal correlation between detection and fine-grained classification networks. We evaluate the performance of our DC2 network on two benchmarks SAT-MTB and HRSC2016 datasets. Importantly, our method achieves the new state-of-the-art results compared with recent works (approximately 7% mAP gains on SAT-MTB) and improves baseline by a significant margin (43.2% $v.s.~36.7$ %) without any complicated post-processing strategies. Source codes of the proposed methods are available at https://github.com/zhengshangdong/DC2.
Collapse
|
38
|
Gao C, Ajith S, Peelen MV. Object representations drive emotion schemas across a large and diverse set of daily-life scenes. Commun Biol 2025; 8:697. [PMID: 40325234 PMCID: PMC12053605 DOI: 10.1038/s42003-025-08145-1] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/21/2025] [Accepted: 04/29/2025] [Indexed: 05/07/2025] Open
Abstract
The rapid emotional evaluation of objects and events is essential in daily life. While visual scenes reliably evoke emotions, it remains unclear whether emotion schemas evoked by daily-life scenes depend on object processing systems or are extracted independently. To explore this, we collected emotion ratings for 4913 daily-life scenes from 300 participants, and predicted these ratings from representations in deep neural networks and functional magnetic resonance imaging (fMRI) activity patterns in visual cortex. AlexNet, an object-based model, outperformed EmoNet, an emotion-based model, in predicting emotion ratings for daily-life scenes, while EmoNet excelled for explicitly evocative scenes. Emotion information was processed hierarchically within the object recognition system, consistent with the visual cortex's organization. Activity patterns in the lateral occipital complex (LOC), an object-selective region, reliably predicted emotion ratings and outperformed other visual regions. These findings suggest that the emotional evaluation of daily-life scenes is mediated by visual object processing, with additional mechanisms engaged when object content is uninformative.
Collapse
Affiliation(s)
- Chuanji Gao
- School of Psychology, Nanjing Normal University, Nanjing, China.
| | - Susan Ajith
- Department of Medicine, Justus-Liebig-Universität Gießen, Gießen, Germany
- Max Planck Institute for Human Cognitive and Brain Sciences, Leipzig, Germany
| | - Marius V Peelen
- Donders Institute for Brain, Cognition and Behaviour, Radboud University, Nijmegen, the Netherlands.
| |
Collapse
|
39
|
Haupt M, Garrett DD, Cichy RM. Healthy aging delays and dedifferentiates high-level visual representations. Curr Biol 2025; 35:2112-2127.e6. [PMID: 40239656 DOI: 10.1016/j.cub.2025.03.062] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/04/2024] [Revised: 01/23/2025] [Accepted: 03/25/2025] [Indexed: 04/18/2025]
Abstract
Healthy aging impacts visual information processing with consequences for subsequent high-level cognition and everyday behavior, but the underlying neural changes in visual representations remain unknown. Here, we investigate the nature of representations underlying object recognition in older compared to younger adults by tracking them in time using electroencephalography (EEG), across space using functional magnetic resonance imaging (fMRI), and by probing their behavioral relevance using similarity judgments. Applying a multivariate analysis framework to combine experimental assessments, four key findings about how brain aging impacts object recognition emerge. First, aging selectively delays the formation of object representations, profoundly changing the chronometry of visual processing. Second, the delay in the formation of object representations emerges in high-level rather than low- and mid-level ventral visual cortex, supporting the theory that brain areas developing last deteriorate first. Third, aging reduces content selectivity in the high-level ventral visual cortex, indicating age-related neural dedifferentiation as the mechanism of representational change. Finally, we demonstrate that the identified representations of the aging brain are behaviorally relevant, ascertaining ecological relevance. Together, our results reveal the impact of healthy aging on the visual brain.
Collapse
Affiliation(s)
- Marleen Haupt
- Department of Education and Psychology, Freie Universität Berlin, Habelschwerdter Allee 45, Berlin 14195, Germany; Center for Lifespan Psychology, Max Planck Institute for Human Development, Lentzallee 94, Berlin 14195, Germany.
| | - Douglas D Garrett
- Max Planck UCL Centre for Computational Psychiatry and Ageing Research, 10-12 Russell Square, London WC1B 5EH, UK
| | - Radoslaw M Cichy
- Department of Education and Psychology, Freie Universität Berlin, Habelschwerdter Allee 45, Berlin 14195, Germany; Berlin School of Mind and Brain, Faculty of Philosophy, Humboldt-Universität zu Berlin, Luisenstraße 56, Berlin 10117, Germany; Bernstein Center for Computational Neuroscience Berlin, Humbold-Universität zu Berlin, Philippstraße 13, Berlin 10115, Germany.
| |
Collapse
|
40
|
Toosi A, Harsini S, Divband G, Bénard F, Uribe CF, Oviedo F, Dodhia R, Weeks WB, Lavista Ferres JM, Rahmim A. Computer-Aided Detection (CADe) of Small Metastatic Prostate Cancer Lesions on 3D PSMA PET Volumes Using Multi-Angle Maximum Intensity Projections. Cancers (Basel) 2025; 17:1563. [PMID: 40361490 PMCID: PMC12071532 DOI: 10.3390/cancers17091563] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/27/2025] [Revised: 04/28/2025] [Accepted: 04/29/2025] [Indexed: 05/15/2025] Open
Abstract
OBJECTIVES We aimed to develop and evaluate a novel computer-aided detection (CADe) approach for identifying small metastatic biochemically recurrent (BCR) prostate cancer (PCa) lesions on PSMA-PET images, utilizing multi-angle Maximum Intensity Projections (MA-MIPs) and state-of-the-art (SOTA) object detection algorithms. METHODS We fine-tuned and evaluated 16 SOTA object detection algorithms (selected across four main categories of model types) applied to MA-MIPs as extracted from rotated 3D PSMA-PET volumes. Predicted 2D bounding boxes were back-projected to the original 3D space using the Ordered Subset Expectation Maximization (OSEM) algorithm. A fine-tuned Medical Segment-Anything Model (MedSAM) was then also used to segment the identified lesions within the bounding boxes. RESULTS The proposed method achieved a high detection performance for this difficult task, with the FreeAnchor model reaching an F1-score of 0.69 and a recall of 0.74. It outperformed several 3D methods in efficiency while maintaining comparable accuracy. Strong recall rates were observed for clinically relevant areas, such as local relapses (0.82) and bone metastases (0.80). CONCLUSION Our fully automated CADe tool shows promise in assisting physicians as a "second reader" for detecting small metastatic BCR PCa lesions on PSMA-PET images. By leveraging the strength and computational efficiency of 2D models while preserving 3D spatial information of the PSMA-PET volume, the proposed approach has the potential to improve detectability and reduce workload in cancer diagnosis and management.
Collapse
Affiliation(s)
- Amirhosein Toosi
- Department of Integrative Oncology, BC Cancer Research Institute, Vancouver, BC V5Z 1L3, Canada; (A.T.); (C.F.U.)
- Department of Radiology, University of British Columbia, Vancouver, BC V6T 1Z3, Canada;
- AI for Good Lab, Microsoft Corporation, Redmond, WA 98052, USA; (F.O.); (R.D.); (W.B.W.); (J.M.L.F.)
| | | | | | - François Bénard
- Department of Radiology, University of British Columbia, Vancouver, BC V6T 1Z3, Canada;
- BC Cancer, Vancouver, BC V5Z 1L3, Canada;
| | - Carlos F. Uribe
- Department of Integrative Oncology, BC Cancer Research Institute, Vancouver, BC V5Z 1L3, Canada; (A.T.); (C.F.U.)
- Department of Radiology, University of British Columbia, Vancouver, BC V6T 1Z3, Canada;
- BC Cancer, Vancouver, BC V5Z 1L3, Canada;
| | - Felipe Oviedo
- AI for Good Lab, Microsoft Corporation, Redmond, WA 98052, USA; (F.O.); (R.D.); (W.B.W.); (J.M.L.F.)
| | - Rahul Dodhia
- AI for Good Lab, Microsoft Corporation, Redmond, WA 98052, USA; (F.O.); (R.D.); (W.B.W.); (J.M.L.F.)
| | - William B. Weeks
- AI for Good Lab, Microsoft Corporation, Redmond, WA 98052, USA; (F.O.); (R.D.); (W.B.W.); (J.M.L.F.)
| | - Juan M. Lavista Ferres
- AI for Good Lab, Microsoft Corporation, Redmond, WA 98052, USA; (F.O.); (R.D.); (W.B.W.); (J.M.L.F.)
| | - Arman Rahmim
- Department of Integrative Oncology, BC Cancer Research Institute, Vancouver, BC V5Z 1L3, Canada; (A.T.); (C.F.U.)
- Department of Radiology, University of British Columbia, Vancouver, BC V6T 1Z3, Canada;
- Department of Physics and Astronomy, University of British Columbia, Vancouver, BC V6T 1Z3, Canada
| |
Collapse
|
41
|
Demircioğlu A, Bos D, Quinsten AS, Umutlu L, Bruder O, Forsting M, Nassenstein K. Detecting the left atrial appendage in CT localizers using deep learning. Sci Rep 2025; 15:15333. [PMID: 40316718 PMCID: PMC12048584 DOI: 10.1038/s41598-025-99701-6] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/14/2025] [Accepted: 04/22/2025] [Indexed: 05/04/2025] Open
Abstract
Patients with cardioembolic stroke often undergo CT of the left atrial appendage (LAA), for example, to determine whether thrombi are present in the LAA. To guide the imaging process, technologists first perform a localizer scan, which is a preliminary image used to identify the region of interest. However, the lack of well-defined landmarks makes accurate delimitation of the LAA in localizers difficult and often requires whole-heart scans, increasing radiation exposure and cancer risk. This study aims to automate LAA delimitation in CT localizers using deep learning. Four commonly used deep networks (VariFocalNet, Cascade-R-CNN, Task-aligned One-stage Object Detection Network, YOLO v11) were trained to predict the LAA boundaries on a cohort of 1253 localizers, collected retrospectively from a single center. The best-performing network in terms of delimitation accuracy was then evaluated on an internal test cohort of 368 patients, and on an external test cohort of 309 patients. The VariFocalNet performed best, achieving LAA delimitations with high accuracy (97.8% and 96.8%; Dice coefficients: 90.4% and 90.0%) and near-perfect clinical utility (99.8% and 99.3%). Compared to whole-heart scanning, the network-based delimitation reduced the radiation exposure by more than 50% (5.33 ± 6.42 mSv vs. 11.35 ± 8.17 mSv in the internal cohort, 4.39 ± 4.23 mSv vs. 10.09 ± 8.0 mSv in the external cohort). This study demonstrates that a deep learning network can accurately delimit the LAA in the localizer, leading to more accurate CT scans of the LAA, thereby significantly reducing radiation exposure to the patient compared to whole-heart scanning.
Collapse
Affiliation(s)
- Aydin Demircioğlu
- Institute of Diagnostic and Interventional Radiology and Neuroradiology, University Hospital Essen, Hufelandstraße 55, 45147, Essen, Germany.
| | - Denise Bos
- Institute of Diagnostic and Interventional Radiology and Neuroradiology, University Hospital Essen, Hufelandstraße 55, 45147, Essen, Germany
| | - Anton S Quinsten
- Institute of Diagnostic and Interventional Radiology and Neuroradiology, University Hospital Essen, Hufelandstraße 55, 45147, Essen, Germany
| | - Lale Umutlu
- Institute of Diagnostic and Interventional Radiology and Neuroradiology, University Hospital Essen, Hufelandstraße 55, 45147, Essen, Germany
| | - Oliver Bruder
- Department of Cardiology and Angiology, Contilia Heart and Vascular Center, Elisabeth-Krankenhaus Essen, Klara-Kopp-Weg 1, 45138, Essen, Germany
- Faculty of Medicine, Ruhr University Bochum, 44801, Bochum, Germany
| | - Michael Forsting
- Institute of Diagnostic and Interventional Radiology and Neuroradiology, University Hospital Essen, Hufelandstraße 55, 45147, Essen, Germany
| | - Kai Nassenstein
- Institute of Diagnostic and Interventional Radiology and Neuroradiology, University Hospital Essen, Hufelandstraße 55, 45147, Essen, Germany
| |
Collapse
|
42
|
Yang L, He L, Hu D, Liu Y, Peng Y, Chen H, Zhou M. Variational Transformer: A Framework Beyond the Tradeoff Between Accuracy and Diversity for Image Captioning. IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS 2025; 36:9500-9511. [PMID: 39374280 DOI: 10.1109/tnnls.2024.3440872] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 10/09/2024]
Abstract
Accuracy and diversity represent two critical quantifiable performance metrics in the generation of natural and semantically accurate captions. While efforts are made to enhance one of them, the other suffers due to the inherent conflicting and complex relationship between them. In this study, we demonstrate that the suboptimal accuracy levels derived from human annotations are unsuitable for machine-generated captions. To boost diversity while maintaining high accuracy, we propose an innovative variational transformer (VaT) framework. By integrating "invisible information prior (IIP)" and "auto-selectable Gaussian mixture model (AGMM)," we enable its encoder to learn precise linguistic information and object relationships in various scenes, thus ensuring high accuracy. By incorporating the "range-median reward (RMR)" baseline into it, we preserve a wider range of candidates with higher rewards during the reinforcement-learning-based training process, thereby guaranteeing outstanding diversity. Experimental results indicate that our method achieves simultaneous improvements in accuracy and diversity by up to 1.1% and 4.8%, respectively, over the state-of-the-art. Furthermore, our approach demonstrates its performance that is the closest to human annotations in semantic retrieval, with its score of 50.3 versus the human score of 50.6. Thus, the method can be readily put into industrial use.
Collapse
|
43
|
Tang X, Ye S, Shi Y, Hu T, Peng Q, You X. Filter Pruning Based on Information Capacity and Independence. IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS 2025; 36:8401-8413. [PMID: 39231052 DOI: 10.1109/tnnls.2024.3415068] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 09/06/2024]
Abstract
Filter pruning has gained widespread adoption for the purpose of compressing and speeding up convolutional neural networks (CNNs). However, the existing approaches are still far from practical applications due to biased filter selection and heavy computation cost. This article introduces a new filter pruning method that selects filters in an interpretable, multiperspective, and lightweight manner. Specifically, we evaluate the contributions of filters from both individual and overall perspectives. For the amount of information contained in each filter, a new metric called information capacity is proposed. Inspired by the information theory, we utilize the interpretable entropy to measure the information capacity and develop a feature-guided approximation process. For correlations among filters, another metric called information independence is designed. Since the aforementioned metrics are evaluated in a simple but effective way, we can identify and prune the least important filters with less computation cost. We conduct comprehensive experiments on benchmark datasets employing various widely used CNN architectures to evaluate the performance of our method. For instance, on ILSVRC-2012, our method outperforms state-of-the-art methods by reducing floating-point operations (FLOPs) by 77.4% and parameters by 69.3% for ResNet-50 with only a minor decrease in an accuracy of 2.64%.
Collapse
|
44
|
Liu C, Li B, Shi M, Chen X, Ye Q, Ji X. Explicit Margin Equilibrium for Few-Shot Object Detection. IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS 2025; 36:8072-8084. [PMID: 38980785 DOI: 10.1109/tnnls.2024.3422216] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 07/11/2024]
Abstract
Under low data regimes, few-shot object detection (FSOD) transfers related knowledge from base classes with sufficient annotations to novel classes with limited samples in a two-step paradigm, including base training and balanced fine-tuning. In base training, the learned embedding space needs to be dispersed with large class margins to facilitate novel class accommodation and avoid feature aliasing while in balanced fine-tuning properly concentrating with small margins to represent novel classes precisely. Although obsession with the discrimination and representation dilemma has stimulated substantial progress, explorations for the equilibrium of class margins within the embedding space are still in full swing. In this study, we propose a class margin optimization scheme, termed explicit margin equilibrium (EME), by explicitly leveraging the quantified relationship between base and novel classes. EME first maximizes base-class margins to reserve adequate space to prepare for novel class adaptation. During fine-tuning, it quantifies the interclass semantic relationships by calculating the equilibrium coefficients based on the assumption that novel instances can be represented by linear combinations of base-class prototypes. EME finally reweights margin loss using equilibrium coefficients to adapt base knowledge for novel instance learning with the help of instance disturbance (ID) augmentation. As a plug-and-play module, EME can also be applied to few-shot classification. Consistent performance gains upon various baseline methods and benchmarks validate the generality and efficacy of EME. The code is available at github.com/Bohao-Lee/EME.
Collapse
|
45
|
Song Y, Liu Z, Li G, Xie J, Wu Q, Zeng D, Xu L, Zhang T, Wang J. EMS: A Large-Scale Eye Movement Dataset, Benchmark, and New Model for Schizophrenia Recognition. IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS 2025; 36:9451-9462. [PMID: 39178070 DOI: 10.1109/tnnls.2024.3441928] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 08/25/2024]
Abstract
Schizophrenia (SZ) is a common and disabling mental illness, and most patients encounter cognitive deficits. The eye-tracking technology has been increasingly used to characterize cognitive deficits for its reasonable time and economic costs. However, there is no large-scale and publicly available eye movement dataset and benchmark for SZ recognition. To address these issues, we release a large-scale Eye Movement dataset for SZ recognition (EMS), which consists of eye movement data from 104 schizophrenics and 104 healthy controls (HCs) based on the free-viewing paradigm with 100 stimuli. We also conduct the first comprehensive benchmark, which has been absent for a long time in this field, to compare the related 13 psychosis recognition methods using six metrics. Besides, we propose a novel mean-shift-based network (MSNet) for eye movement-based SZ recognition, which elaborately combines the mean shift algorithm with convolution to extract the cluster center as the subject feature. In MSNet, first, a stimulus feature branch (SFB) is adopted to enhance each stimulus feature with similar information from all stimulus features, and then, the cluster center branch (CCB) is utilized to generate the cluster center as subject feature and update it by the mean shift vector. The performance of our MSNet is superior to prior contenders, thus, it can act as a powerful baseline to advance subsequent study. To pave the road in this research field, the EMS dataset, the benchmark results, and the code of MSNet are publicly available at https://github.com/YingjieSong1/EMS.
Collapse
|
46
|
Xu M, Dai N, Jiang L, Fu Y, Deng X, Li S. Recruiting Teacher IF Modality for Nephropathy Diagnosis: A Customized Distillation Method With Attention-Based Diffusion Network. IEEE TRANSACTIONS ON MEDICAL IMAGING 2025; 44:2028-2040. [PMID: 40030767 DOI: 10.1109/tmi.2024.3524544] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 03/05/2025]
Abstract
The joint use of multiple modalities for medical image processing has been widely studied in recent years. The fusion of information from different modalities has demonstrated the performance improvement for a lot of medical tasks. For nephropathy diagnosis, immunofluorescence (IF) is one of the most widely-used multi-modality medical images due to its ease of acquisition and the effectiveness for certain nephropathy. However, the existing methods mainly assume different modalities have the equal effect on the diagnosis task, failing to exploit multi-modality knowledge in details. To avoid this disadvantage, this paper proposes a novel customized multi-teacher knowledge distillation framework to transfer knowledge from the trained single-modality teacher networks to a multi-modality student network. Specifically, a new attention-based diffusion network is developed for IF based diagnosis, considering global, local, and modality attention. Besides, a teacher recruitment module and diffusion-aware distillation loss are developed to learn to select the effective teacher networks based on the medical priors of the input IF sequence. The experimental results in the test and external datasets show that the proposed method has a better nephropathy diagnosis performance and generalizability, in comparison with the state-of-the-art methods.
Collapse
|
47
|
Zhou H, Yang R, Zhang Y, Duan H, Huang Y, Hu R, Li X, Zheng Y. UniHead: Unifying Multi-Perception for Detection Heads. IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS 2025; 36:9565-9576. [PMID: 38905097 DOI: 10.1109/tnnls.2024.3412947] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/23/2024]
Abstract
The detection head constitutes a pivotal component within object detectors, tasked with executing both classification and localization functions. Regrettably, the commonly used parallel head often lacks omni perceptual capabilities, such as deformation perception (DP), global perception (GP), and cross-task perception (CTP). Despite numerous methods attempting to enhance these abilities from a single aspect, achieving a comprehensive and unified solution remains a significant challenge. In response to this challenge, we develop an innovative detection head, termed UniHead, to unify three perceptual abilities simultaneously. More precisely, our approach: 1) introduces DP, enabling the model to adaptively sample object features; 2) proposes a dual-axial aggregation transformer (DAT) to adeptly model long-range dependencies, thereby achieving GP; and 3) devises a cross-task interaction transformer (CIT) that facilitates interaction between the classification and localization branches, thus aligning the two tasks. As a plug-and-play method, the proposed UniHead can be conveniently integrated with existing detectors. Extensive experiments on the COCO dataset demonstrate that our UniHead can bring significant improvements to many detectors. For instance, the UniHead can obtain +2.7 AP gains in RetinaNet, +2.9 AP gains in FreeAnchor, and +2.1 AP gains in GFL. The code is available at https://github.com/zht8506/UniHead.
Collapse
|
48
|
Bao J, Zhang J, Zhang C, Bao L. DCTCNet: Sequency discrete cosine transform convolution network for visual recognition. Neural Netw 2025; 185:107143. [PMID: 39847941 DOI: 10.1016/j.neunet.2025.107143] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/28/2024] [Revised: 01/01/2025] [Accepted: 01/09/2025] [Indexed: 01/25/2025]
Abstract
The discrete cosine transform (DCT) has been widely used in computer vision tasks due to its ability of high compression ratio and high-quality visual presentation. However, conventional DCT is usually affected by the size of transform region and results in blocking effect. Therefore, eliminating the blocking effects to efficiently serve for vision tasks is significant and challenging. In this paper, we introduce All Phase Sequency DCT (APSeDCT) into convolutional networks to extract multi-frequency information of deep features. Due to the fact that APSeDCT can be equivalent to convolutional operation, we construct corresponding convolution module called APSeDCT Convolution (APSeDCTConv) that has great transferability similar to vanilla convolution. Then we propose an augmented convolutional operator called MultiConv with APSeDCTConv. By replacing the last three bottleneck blocks of ResNet with MultiConv, our approach not only reduces the computational costs and the number of parameters, but also exhibits great performance in classification, object detection and instance segmentation tasks. Extensive experiments show that APSeDCTConv augmentation leads to consistent performance improvements in image classification on ImageNet across various different models and scales, including ResNet, Res2Net and ResNext, and achieving 0.5%-1.1% and 0.4%-0.7% AP performance improvements for object detection and instance segmentation, respectively, on the COCO benchmark compared to the baseline.
Collapse
Affiliation(s)
- Jiayong Bao
- School of Mathematics and Statistics, Xi'an Jiaotong University, Xi'an 710049, China
| | - Jiangshe Zhang
- School of Mathematics and Statistics, Xi'an Jiaotong University, Xi'an 710049, China.
| | - Chunxia Zhang
- School of Mathematics and Statistics, Xi'an Jiaotong University, Xi'an 710049, China
| | - Lili Bao
- School of Mathematics and Statistics, Xi'an Jiaotong University, Xi'an 710049, China
| |
Collapse
|
49
|
Zhang K, Zhu D, Min X, Zhai G. Unified Approach to Mesh Saliency: Evaluating Textured and Non-Textured Meshes Through VR and Multifunctional Prediction. IEEE TRANSACTIONS ON VISUALIZATION AND COMPUTER GRAPHICS 2025; 31:3151-3160. [PMID: 40063447 DOI: 10.1109/tvcg.2025.3549550] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 04/26/2025]
Abstract
Mesh saliency aims to empower artificial intelligence with strong adaptability to highlight regions that naturally attract visual attention. Existing advances primarily emphasize the crucial role of geometric shapes in determining mesh saliency, but it remains challenging to flexibly sense the unique visual appeal brought by the realism of complex texture patterns. To investigate the interaction between geometric shapes and texture features in visual perception, we establish a comprehensive mesh saliency dataset, capturing saliency distributions for identical 3D models under both non-textured and textured conditions. Additionally, we propose a unified saliency prediction model applicable to various mesh types, providing valuable insights for both detailed modeling and realistic rendering applications. This model effectively analyzes the geometric structure of the mesh while seamlessly incorporating texture features into the topological framework, ensuring coherence throughout appearance-enhanced modeling. Through extensive theoretical and empirical validation, our approach not only enhances performance across different mesh types, but also demonstrates the model's scalability and generalizability, particularly through cross-validation of various visual features.
Collapse
|
50
|
Luo X, Duan Z, Qin A, Tian Z, Xie T, Zhang T, Tang YY. Layer-Wise Mutual Information Meta-Learning Network for Few-Shot Segmentation. IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS 2025; 36:9684-9698. [PMID: 39255186 DOI: 10.1109/tnnls.2024.3438771] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 09/12/2024]
Abstract
The goal of few-shot segmentation (FSS) is to segment unlabeled images belonging to previously unseen classes using only a limited number of labeled images. The main objective is to transfer label information effectively from support images to query images. In this study, we introduce a novel meta-learning framework called layer-wise mutual information (LayerMI), which enhances the propagation of label information by maximizing the mutual information (MI) between support and query features at each layer. Our approach involves the utilization of a LayerMI Block based on information-theoretic co-clustering. This block performs online co-clustering on the joint probability distribution obtained from each layer, generating a target-specific attention map. The LayerMI Block can be seamlessly integrated into the meta-learning framework and applied to all convolutional neural network (CNN) layers without altering the training objectives. Notably, the LayerMI Block not only maximizes MI between support and query features but also facilitates internal clustering within the image. Extensive experiments demonstrate that LayerMI significantly enhances the performance of baseline and achieves competitive performance compared to state-of-the-art methods on three challenging benchmarks: PASCAL- $5^{i}$ , COCO- $20^{i}$ , and FSS-1000.
Collapse
|