Contrastive Learning Approaches for Deep Learning
Contrastive learning has emerged as a powerful approach for self-supervised representation learning, achieving classification accuracies within 1-2% of supervised benchmarks on ImageNet. These methods learn by comparing positive pairs of augmented samples against negative examples, creating embedding spaces where semantically similar items cluster together while dissimilar ones are pushed apart.
The fundamental challenge lies in designing contrastive objectives and sampling strategies that capture meaningful invariances while avoiding representational collapse.
This page brings together solutions from recent research—including momentum encoders, memory banks, hard negative mining techniques, and multi-view consistency approaches. These and other methods demonstrate how contrastive learning can be effectively implemented across computer vision, natural language processing, and multi-modal applications.
1. Contrastive Fine-Tuning of Image-Text Models with Embedding Perturbation Based on Correlation-Derived Vectors
ROBERT BOSCH GMBH, 2025
Robust contrastive fine-tuning of image-text machine learning models for improved robustness and zero-shot classification accuracy. The method involves perturbing the embeddings of images and their text descriptions during fine-tuning, rather than just the images. This is done by generating perturbation vectors with magnitudes and directions based on correlations between the original embedding and other embeddings. These perturbed embeddings are then used to calculate the contrastive loss for fine-tuning. This improves robustness to perturbations and outlier samples, as well as zero-shot classification accuracy.
2. Distributed Contrastive Loss Calculation with Grouped Batch Processing Across Multiple GPUs
ALIPAY INFORMATION TECHNOLOGY CO LTD, 2025
Calculating contrastive loss through multiple GPUs in a more memory-efficient way for training large neural networks with multiple modalities. The method involves dividing the training batches into groups, calculating group-specific contrastive loss on each GPU, and then aggregating the group losses to get the overall batch loss. This reduces the memory requirements per GPU compared to calculating the full loss across all samples.
3. Contrastive Learning-Based Label Propagation with Variable Representation Mapping in Sparse Datasets
CAPITAL ONE SERVICES LLC, 2025
Propagating labels through a sparsely labeled dataset using contrastive learning projection to improve label accuracy in artificial intelligence applications. The method involves mapping labeled points from a coarse representation to a fine-grained representation using a contrastive learning projection that maximizes distances between dissimilar labeled points and minimizes distances between similar labeled points. This expands the projection space based on specific syntax rules associated with the labeled points. Unlabeled points are then projected into the fine-grained representation and labels are propagated based on similarity to the labeled points.
4. Contrastive Language-Audio Pre-Training Model with Emotion Captioning for Joint Audio-Text Representation
ROBERT BOSCH GMBH, 2025
Retrieving speech emotion samples using natural language prompts to improve speech emotion recognition systems. The method involves training a contrastive language-audio pre-training (CLAP) model using emotion captions generated by a large language model like ChatGPT. The captions are generated based on emotion classes and lexicons. The CLAP model learns joint audio-text representations without predefined categories. It processes audio and text separately through encoders and connects them in joint space using linear projections. Contrastive learning is used to learn similarity between audio-text pairs in a batch. This allows zero-shot predictions and improves retrieval diversity for unseen and out-of-domain emotions.
5. Dual Encoder Model with Contrastive Learning Using Separate Image and Text Encoders
GOOGLE LLC, 2025
Training a dual encoder model using contrastive learning to enhance downstream tasks like object detection and caption generation. The model processes input data through two separate encoder networks - one for images and one for text - and trains them together using a contrastive loss function. The network optimizes both encoder parameters based on the similarity between input pairs, with higher-separation pairs receiving increased contribution during training. This approach effectively balances the gradient contributions from both input modalities, enabling more accurate model performance in low-shot settings where training data is scarce.
6. Deep Neural Network for Patch-wise Disease Prediction in Chest X-Rays Using Self-supervised Image and Text Feature Alignment
SIEMENS HEALTHINEERS AG, 2025
AI algorithm for medical image analysis that provides patch-wise predictions of disease presence in chest X-rays. The algorithm uses a deep neural network (DNN) trained using a self-supervised learning approach with both image and text inputs. It takes chest X-rays and text prompts describing the disease as inputs. The DNN extracts features from the images and text. It aligns the features using a loss function that measures the distance between corresponding patch features in the images and text. This self-supervised training enables learning disease features from unlabeled images. In fine-tuning, the DNN is further trained with labeled images to make patch-wise predictions of disease presence.
7. Few-Shot Learning for Medical Image Segmentation: A Review and Comparative Study
theekshana dissanayake, yasmeen george, dwarikanath mahapatra - Association for Computing Machinery, 2025
Medical image segmentation plays a crucial role in assisting clinicians with diagnosing critical medical conditions. In deep learning, few-shot learning methods aim to replicate human by leveraging fewer examples for determining prediction novel class. Researchers the imaging community have also explored segmentation, meta-learning, foundation models and self-supervised (SSL). Acknowledging this growing interest, we review literature on from 2020 early 2025, focusing architectural modifications, loss-inspired strategies, meta-learning frameworks. We further divide each category into fine-grained learning-oriented solutions, including contrastive regularization, providing in-depth discussions improvements representation strategies. Additionally, present preliminary results several across both computer vision domains, evaluating their strengths limitations applications. Finally, based observed, advancements natural domain, empirical findings, outline future research directions, specific insights data-efficient rapid adaptation of generalization. The code is available here.
8. A Self-Supervised Specific Emitter Identification Method Based on Contrastive Asymmetric Masked Learning
dong wang, yonghui huang, tianshu cui - Multidisciplinary Digital Publishing Institute, 2025
Specific emitter identification (SEI) is a core technology for wireless device security that plays crucial role in protecting communication systems from various threats. However, current deep learning-based SEI methods heavily rely on large amounts of labeled data supervised training, facing challenges non-cooperative scenarios. To address these issues, this paper proposes novel contrastive asymmetric masked (CAML-SEI) method, effectively solving the problem under scarce samples. The proposed method constructs an auto-encoder architecture, comprising encoder network based channel squeeze-and-excitation residual blocks to capture radio frequency fingerprint (RFF) features embedded signals, while employing lightweight single-layer convolutional decoder signal reconstruction. This design promotes learning fine-grained local feature representations. further enhance discriminability, learnable non-linear mapping introduced compress high-dimensional encoded into compact low-dimensional space, accompanied by loss function simultaneously achieves aggregation positive samples and separation n... Read More
9. Machine Learning Model Training for MRI Reconstruction Using Contrastive Learning with Anchor and Negative Example Integration
SHANGHAI UNITED IMAGING INTELLIGENCE CO LTD, 2025
Training machine learning models for accelerated MRI reconstruction using contrastive learning. The technique involves training the ML model using a contrastive learning approach that involves generating reconstructed MRI datasets from under-sampled MRI datasets, and adjusting the ML model parameters based on anchor and negative examples. The anchor example replaces part of the reconstructed MRI data with the under-sampled data, while the negative example replaces part of the reconstructed MRI data with different values. This contrastive learning helps the ML model learn more effectively and converge faster since it has more supervision signals from both the original and replaced locations.
10. A multimodal visual–language foundation model for computational ophthalmology
danli shi, weiyi zhang, j yang - Nature Portfolio, 2025
Early detection of eye diseases is vital for preventing vision loss. Existing ophthalmic artificial intelligence models focus on single modalities, overlooking multi-view information and struggling with rare due to long-tail distributions. We propose EyeCLIP, a multimodal visual-language foundation model trained 2.77 million ophthalmology images from 11 modalities partial clinical text. Our novel pretraining strategy combines self-supervised reconstruction, image contrastive learning, image-text learning capture shared representations across modalities. EyeCLIP demonstrates robust performance 14 benchmark datasets, excelling in disease classification, visual question answering, cross-modal retrieval. It also exhibits strong few-shot zero-shot capabilities, enabling accurate predictions real-world, scenarios. offers significant potential detecting both ocular systemic diseases, bridging gaps real-world applications.
11. Data-Driven Simulation System for Autonomous Vehicle Scenarios with Controllable Guidance Signals and Contrastive Loss-Based Representation Learning
NVIDIA CORP, 2025
Generating realistic driving scenarios for autonomous vehicles using data-driven simulation. The simulation leverages real-world traffic data to accurately generate agent behavior. To enable controllability of the simulation, guidance signals are provided to the scenario generator. These signals can be expressed using self-supervised learning techniques like contrastive loss to learn representations of scenarios without labeled data. This allows generating customized and controllable scenarios for autonomous vehicle training.
12. A network intrusion detection method based on contrastive learning and Bayesian Gaussian Mixture Model
lei liu, ming xu - Springer Nature, 2025
Abstract Network Intrusion Detection Systems (NIDS) are essential for safeguarding networks against malicious activities. However, existing machine learning-based NIDS often require complex feature engineering, which demands significant domain expertise and experimentation, leading to suboptimal model performance in network environments. In contrast, deep learning approaches, while powerful, struggle with imbalanced data, resulting a bias towards normal traffic reduced effectiveness detecting rare attacks. To address these issues, we propose method that combines contrastive Bayesian Gaussian Mixture Model (BGMM). Specifically, novel loss enables the automatically learn similarity within distinction between traffic, thereby generating robust distinguishable representations. This approach not only eliminates need manual engineering but also helps alleviate issue of weak representations BGMM further enhances detection by adapting both patterns through use multiple components. The proposed is validated extensive experiments on two widely used modern intrusion datasets. On UNSW-NB15 datas... Read More
13. Vulnerability Detection Method with Dual-View Causal Reasoning and Contrastive Learning
YANGZHOU UNIVERSITY, 2025
Explainable vulnerability detection using dual-view causal reasoning for accurate, robust, and concise explanations of software security issues. The method involves a two-step process: (1) using contrastive learning to train a vulnerability detection model, and (2) generating explanations using dual-view causal reasoning. The explanations provide a minimal subset of the code that, when removed, changes the model's prediction. This allows concise, explainable vulnerability detection with robustness against perturbations. The contrastive learning uses self-supervised and supervised contrastive losses to train the model.
14. Graph Neural Network-Based Feature Extraction Method with Contrastive Learning for Search Query and Media Node Matching
TENCENT TECHNOLOGY COMPANY LTD, 2025
Method for improving search accuracy in applications like media search by leveraging graph neural networks and contrastive learning. The method involves training a graph neural network to extract features from nodes in a media search graph containing queries, media, and associations. The training uses pairs of nodes that are connected versus randomly combined to learn distinguishing features. This trained network is then used to extract features from search queries and media. During search, these features are compared to find matching nodes for personalized results. The method also involves using a two-tower search model with separate branches for query and media features to further improve accuracy.
15. Unsupervised Vision Mamba with Contrastive Regularization Network for Image Dehazing
bin hu, jincheng li, sai yang - Research Square, 2025
<title>Abstract</title> Benefiting from the powerful nonlinear fitting ability of neural networks, deep learning-based methods have gradually emerged as dominate solutions for single image dehazing. However, supervised learningbased require paired samples training. To address this, an unsupervised Vision Mamba with Contrastive Regularization network (VMCR) is proposed. The designed based on DisentGAN framework, and its main module Mamba. This performs very competitively compared to transformers, while maintaining linear time complexity constant memory respect input size. Furthermore, a contrastive regularization method learning proposed enhance reconstruction capabilities achieve superior dehazing results. Our VMCR-Net outperforms state-of-the-art methods, evidenced by experimental results several benchmarks. research successfully proposes enhanced approach, overcoming limitations existing achieving performance.
16. Molecular Embedding Model Training via Contrastive Learning with Scaffold-Based Similarity Constraints
MICROSOFT TECHNOLOGY LICENSING LLC, 2025
Training molecular embedding models using contrastive learning with scaffold similarity for improved molecular similarity analysis. The method involves generating a training dataset by separating molecules into similar (sharing scaffold) and dissimilar (different scaffolds) pairs. This defines similarity based on scaffolds. The model learns to map similar molecules close together and dissimilar molecules far apart in embedding space. This improves molecular embedding quality for tasks like drug discovery and property prediction.
17. Neural Network Architecture with Modality-Specific Partitioned Attention for Multimodal Contrastive Learning
DEEPMIND TECHNOLOGIES LTD, 2025
Processing inputs using neural networks that maintain modality-specific representations while enabling multimodal contrastive learning. The approach involves partitioning the input data into disjoint modalities and training separate attention mechanisms for each partition. At inference, the system combines these partition-specific attention mechanisms with fused attention mechanisms to generate outputs that can be processed independently by each modality. This enables the use of contrastive learning techniques while preserving modality-specific representations.
18. Contrastive Attention-Supervised Tuning with Saliency-Guided Geometric Transform for Visual Grounding in Self-Supervised Learning
SALESFORCE INC, 2025
Self-supervised learning technique called Contrastive Attention-Supervised Tuning (CAST) that improves self-supervised learning for computer vision tasks by fixing the visual grounding ability of contrastive SSL methods. The CAST training method uses unsupervised saliency maps to provide explicit grounding supervision to encourage models to focus on specific objects when making decisions. It also introduces a geometric transform for randomly cropping views based on constraints derived from saliency maps to fix the visual grounding issue of randomly sampled crops in complex scenes.
19. Self-supervised contrastive learning with time-frequency consistency for few-shot bearing fault diagnosis
xiaoyun gong, y wei, wenliao du - IOP Publishing, 2025
Abstract Deep learning technology has made significant progress in fault diagnosis. However, real-world industrial settings, most existing methods require substantial labeled data for training, while harsh operating conditions and collection constraints often result scarce samples. This limitation significantly impairs their diagnostic performance practical applications. To address this challenge, we propose a few-shot diagnosis approach based on time-frequency contrastive (TF-CL) framework. The TF-CL framework adopts pre-training downstream task pipeline, enabling the model to automatically learn extract multi-perspective features from unlabeled self-supervised conditions. During pre-training, dedicated encoders separately time-domain frequency-domain feature representations abundant extracted are then projected into shared space using projector. ensure that can be data, paper introduces consistency loss function, constructed novel positive negative sample pairs. In task, is combined with multilayer perceptron classifier optimized fine-tuned end-to-end limited data. Gradient updates... Read More
20. Self-Supervised Learning for Domain Adaptation in Medical Imaging
murali krishna pasupuleti, 2025
Abstract: Self-supervised learning (SSL) offers a transformative path for addressing domain adaptation in medical imaging, where annotated datasets are often limited and expensive to acquire. This paper explores how various SSL approachescontrastive (SimCLR), masked image modeling (MAE), transformer-based (DINO)improve performance segmentation classification across heterogeneous imaging domains (MRI, X-ray, CT). Using such as BraTS, CheXpert, NIH ChestXray14, we evaluate pretraining followed by fine-tuning with minimal supervision. We demonstrate statistically significant improvements (615%) Dice scores AUC. Regression analysis shows strong correlation between representation similarity (CKA) downstream task performance. Explainability tools SHAP LIME used validate model reliability transparency. Keywords: Self-Supervised Learning, Domain Adaptation, Medical Imaging, Contrastive SimCLR, DINO, Swin UNet, SHAP, LIME, Transfer Learning
21. Neural Network Training via Bilevel Spectral Inference with Covariance-Based Gradient Estimation
DEEPMIND TECHNOLOGIES LTD, 2025
Training neural networks to generate high quality feature representations by optimizing a spectral inference objective using a bilevel optimization technique. This involves maintaining moving averages of covariance measures and the Jacobian of the covariance during training. It also involves computing kernel-weighted mini-batch covariance estimates and using them to generate gradient estimates for updating the network parameters.
22. RPF-MAD: A Robust Pre-Training–Fine-Tuning Algorithm for Meta-Adversarial Defense on the Traffic Sign Classification System of Autonomous Driving
xiaoxu peng, dong zhou, zhang jianwen - Multidisciplinary Digital Publishing Institute, 2025
Traffic sign classification (TSC) based on deep neural networks (DNNs) plays a crucial role in the perception subsystem of autonomous driving systems (ADSs). However, studies reveal that TSC system can make dangerous and potentially fatal errors under adversarial attacks. Existing defense strategies, such as training (AT), have demonstrated effectiveness but struggle to generalize across diverse attack scenarios. Recent advancements self-supervised learning (SSL), particularly contrastive (ACL) methods, strong potential enhancing robustness generalization compared AT. conventional ACL methods lack mechanisms ensure effective transferability different stages. To address this, we propose robust pre-trainingfine-tuning algorithm for meta-adversarial (RPF-MAD), designed enhance sustainability throughout pipeline. Dual-track pre-training (Dual-MAP) integrates meta-learning with which improves ability upstream model conditions. Meanwhile, adaptive variance anchoring fine-tuning (AVA-RFT) utilizes prototype regularization stabilize feature representations reinforce generalizable capabili... Read More
23. System for Encoding and Aligning Multimodal Sensor Data Using Neural Networks with Hierarchical Scenario Representation
PONY.AI INC, 2025
System for generating and organizing driving scenarios for autonomous vehicles to improve safety, efficiency, and reliability. The system uses neural networks to encode and decode multimodal sensor data like video, audio, and text prompts. It aligns sequences of sensor data with prompts using contrastive learning. This allows finding specific sensor sequences that match a given prompt. The system then generates a hierarchical structure representing the matching sensor sequence. By encoding, embedding, and aligning multimodal data, it creates a shared analytical space to discover cross-modal correlations and analyze latent dependencies. This improves understanding of navigation scenarios by capturing context and nuances like temporal evolution. The system also organizes scenarios for searchability and retrieval.
24. Automated Detection of Canine Babesia Parasite in Blood Smear Images Using Deep Learning and Contrastive Learning Techniques
dilip kumar baruah, kuntala boruah, nagendra nath barman - MDPI AG, 2025
This research introduces a novel method that integrates both unsupervised and supervised learning, leveraging SimCLR (Simple Framework for Contrastive Learning of Visual Representations) self-supervised learning along with different pre-trained models to improve microscopic image classification Babesia parasite in canines. We focused on three popular CNN architectures, namely ResNet, EfficientNet, DenseNet, evaluated the impact pre-training their performance. A detailed comparison variants Densenet terms accuracy training efficiency is presented. Base such as DenseNet were utilized within framework. Firstly, unlabeled images, followed by classifiers labeled datasets. approach significantly improved robustness accuracy, demonstrating potential benefits combining contrastive conventional techniques. The highest 97.07% was achieved Efficientnet_b2. Thus, detection or other hemoparasites blood smear images could be automated high without using labelled dataset.
25. MoHGCN: Momentum Hypergraph Convolution Network for Cross-modal Retrieval
ying li, yuxiang ding - Association for Computing Machinery, 2025
Cross-modal retrieval tasks, encompassing the of image-text, video-audio, and more, are progressively gaining significance in response to exponential growth information on Internet. However, there has always been a cloud hanging over multimodal tasks due inherent challenges aligning different modalities with distinct physical meanings. Most previous works simply rely single encoder or novel similarity calculation for fusion, which often result unsatisfactory performance. To tackle this challenge, we introduce Momentum Hypergraph Convolutional Network (MoHGCN) representation learning, strengthens alignment both visual textual data before process. Specifically, MoHGCN utilizes contrastive learning select most challenging negative positive samples form hyperedges, completes modality through two rounds fusion. Subsequently, fully integrated node features global fused using fusion obtain final vector image-text retrieval. Extensive experiments conducted widely-used datasets, namely Flickr30K MSCOCO, demonstrate superiority proposed approach achieving state-of-the-art performances.
26. Quality controlling in capsule gastroduodenoscopy with less annotation via self-supervised learning
yaqiong zhang, kai zhang, meijia wang - Research Square, 2025
<title>Abstract</title> Background It is possible to control the quality of capsule endoscopic images using artificial intelligence (AI), but it requires a great deal time for labeling. Methods SimCLR (a simple framework contrastive learning visual representations), capable acquiring inherent image representation with minimal annotation, feasibility not studied. 62850 were collected train models. In internal cross-validation (more training data and less testing data) reversed (less more data). Random forest Xgboost (eXtreme Gradient Boosting) used finish controlling after extracting features from images. Results reported that mean AUROC (Area Under Receiver Operating Characteristic) curve exceeded 0.98 0.97. Moreover, surpassed supervised CNN (Convolutional Neural Network). Extra 18636 pictures gathered 0.93 (95% CI 0.92710.9548), which close Network) (0.9645) in cross validation. surpass 0.96, better than (0.8374) Conclusions Through SimCLR, task can be completed performance similar or fewer annotations.
27. Contrastive In-Context Learning for Personalized Response Generation in Large Language Models
INTUIT INC, 2025
Training large language models to generate personalized and context-specific responses using contrastive in-context learning. The technique involves feeding both positive and negative examples to the model during training. The positive examples are desired responses based on user preferences, while the negative examples are undesired responses. The model learns to generate preferred responses and avoid the non-preferred ones. After training, the model can generate customized answers for new questions based on the learned user preferences.
28. Contrastive Learning Model Training with Hierarchical Category Tree-Based Loss Optimization
HANGZHOU ALIBABA INTERNATIONAL INTERNET INDUSTRY CO LTD, 2025
Training a contrastive learning model for query classification in a hierarchical category tree. The method involves optimizing the loss function using semantic relationships between categories. This is based on the relative positions of categories in the tree. The optimized loss function is used to train the contrastive learning model. It allows the model to predict categories accurately by leveraging the hierarchical category tree structure. By optimizing the loss function based on semantic relationships, the model learns to distinguish differences between categories based on their positions in the tree. This improves query classification accuracy compared to traditional flat category methods.
29. Learning-to-Optimize with PAC-Bayesian Guarantees: Theoretical Considerations and Practical Implementation
Michael Sucker, Jalal Fadili, Peter Ochs, 2024
We use the PAC-Bayesian theory for the setting of learning-to-optimize. To the best of our knowledge, we present the first framework to learn optimization algorithms with provable generalization guarantees (PAC-Bayesian bounds) and explicit trade-off between convergence guarantees and convergence speed, which contrasts with the typical worst-case analysis. Our learned optimization algorithms provably outperform related ones derived from a (deterministic) worst-case analysis. The results rely on PAC-Bayesian bounds for general, possibly unbounded loss-functions based on exponential families. Then, we reformulate the learning procedure into a one-dimensional minimization problem and study the possibility to find a global minimum. Furthermore, we provide a concrete algorithmic realization of the framework and new methodologies for learning-to-optimize, and we conduct four practically relevant experiments to support our theory. With this, we showcase that the provided learning framework yields optimization algorithms that provably outperform the state-of-the-art by orders of magnitude.
30. Which Samples Should Be Learned First: Easy or Hard?
Xiaoling Zhou, Ou Wu - Institute of Electrical and Electronics Engineers (IEEE), 2023
Treating each training sample unequally is prevalent in many machine-learning tasks. Numerous weighting schemes have been proposed. Some schemes take the easy-first mode, whereas others take the hard-first one. Naturally, an interesting yet realistic question is raised. Given a new learning task, which samples should be learned first, easy or hard? To answer this question, both theoretical analysis and experimental verification are conducted. First, a general objective function is proposed and the optimal weight can be derived from it, which reveals the relationship between the difficulty distribution of the training set and the priority mode. Two novel findings are subsequently obtained: besides the easy-first and hard-first modes, there are two other typical modes, namely, medium-first and two-ends-first; the priority mode may be varied if the difficulty distribution of the training set changes greatly. Second, inspired by the findings, a flexible weighting scheme (FlexW) is proposed for selecting the optimal priority mode when there is no prior knowledge or theoretical clues. The ... Read More
31. Knowledge Distillation via Route Constrained Optimization
Jin Xiao, Baoyun Peng, Yichao Wu - IEEE, 2019
Distillation-based learning boosts the performance of the miniaturized neural network based on the hypothesis that the representation of a teacher model can be used as structured and relatively weak supervision, and thus would be easily learned by a miniaturized model. However, we find that the representation of a converged heavy model is still a strong constraint for training a small student model, which leads to a higher lower bound of congruence loss. In this work, we consider the knowledge distillation from the perspective of curriculum learning by teacher's routing. Instead of supervising the student model with a converged teacher model, we supervised it with some anchor points selected from the route in parameter space that the teacher model passed by, as we called route constrained optimization (RCO). We experimentally demonstrate this simple operation greatly reduces the lower bound of congruence loss for knowledge distillation, hint and mimicking learning. On close-set classification tasks like CIFAR and ImageNet, RCO improves knowledge distillation by 2.14% and 1.5% respect... Read More
Get Full Report
Access our comprehensive collection of 31 documents related to this technology
