Contrastive learning has emerged as a powerful approach for self-supervised representation learning, achieving classification accuracies within 1-2% of supervised benchmarks on ImageNet. These methods learn by comparing positive pairs of augmented samples against negative examples, creating embedding spaces where semantically similar items cluster together while dissimilar ones are pushed apart.

The fundamental challenge lies in designing contrastive objectives and sampling strategies that capture meaningful invariances while avoiding representational collapse.

This page brings together solutions from recent research—including momentum encoders, memory banks, hard negative mining techniques, and multi-view consistency approaches. These and other methods demonstrate how contrastive learning can be effectively implemented across computer vision, natural language processing, and multi-modal applications.

1. Few-Shot Learning for Medical Image Segmentation: A Review and Comparative Study

theekshana dissanayake, yasmeen george, dwarikanath mahapatra - Association for Computing Machinery, 2025

Medical image segmentation plays a crucial role in assisting clinicians with diagnosing critical medical conditions. In deep learning, few-shot learning methods aim to replicate human by leveraging fewer examples for determining prediction novel class. Researchers the imaging community have also explored segmentation, meta-learning, foundation models and self-supervised (SSL). Acknowledging this growing interest, we review literature on from 2020 early 2025, focusing architectural modifications, loss-inspired strategies, meta-learning frameworks. We further divide each category into fine-grained learning-oriented solutions, including contrastive regularization, providing in-depth discussions improvements representation strategies. Additionally, present preliminary results several across both computer vision domains, evaluating their strengths limitations applications. Finally, based observed, advancements natural domain, empirical findings, outline future research directions, specific insights data-efficient rapid adaptation of generalization. The code is available here.

2. A Self-Supervised Specific Emitter Identification Method Based on Contrastive Asymmetric Masked Learning

dong wang, yonghui huang, tianshu cui - Multidisciplinary Digital Publishing Institute, 2025

Specific emitter identification (SEI) is a core technology for wireless device security that plays crucial role in protecting communication systems from various threats. However, current deep learning-based SEI methods heavily rely on large amounts of labeled data supervised training, facing challenges non-cooperative scenarios. To address these issues, this paper proposes novel contrastive asymmetric masked (CAML-SEI) method, effectively solving the problem under scarce samples. The proposed method constructs an auto-encoder architecture, comprising encoder network based channel squeeze-and-excitation residual blocks to capture radio frequency fingerprint (RFF) features embedded signals, while employing lightweight single-layer convolutional decoder signal reconstruction. This design promotes learning fine-grained local feature representations. further enhance discriminability, learnable non-linear mapping introduced compress high-dimensional encoded into compact low-dimensional space, accompanied by loss function simultaneously achieves aggregation positive samples and separation n... Read More

3. Machine Learning Model Training for MRI Reconstruction Using Contrastive Learning with Anchor and Negative Example Integration

SHANGHAI UNITED IMAGING INTELLIGENCE CO LTD, 2025

Training machine learning models for accelerated MRI reconstruction using contrastive learning. The technique involves training the ML model using a contrastive learning approach that involves generating reconstructed MRI datasets from under-sampled MRI datasets, and adjusting the ML model parameters based on anchor and negative examples. The anchor example replaces part of the reconstructed MRI data with the under-sampled data, while the negative example replaces part of the reconstructed MRI data with different values. This contrastive learning helps the ML model learn more effectively and converge faster since it has more supervision signals from both the original and replaced locations.

4. A multimodal visual–language foundation model for computational ophthalmology

danli shi, weiyi zhang, j yang - Nature Portfolio, 2025

Early detection of eye diseases is vital for preventing vision loss. Existing ophthalmic artificial intelligence models focus on single modalities, overlooking multi-view information and struggling with rare due to long-tail distributions. We propose EyeCLIP, a multimodal visual-language foundation model trained 2.77 million ophthalmology images from 11 modalities partial clinical text. Our novel pretraining strategy combines self-supervised reconstruction, image contrastive learning, image-text learning capture shared representations across modalities. EyeCLIP demonstrates robust performance 14 benchmark datasets, excelling in disease classification, visual question answering, cross-modal retrieval. It also exhibits strong few-shot zero-shot capabilities, enabling accurate predictions real-world, scenarios. offers significant potential detecting both ocular systemic diseases, bridging gaps real-world applications.

5. Data-Driven Simulation System for Autonomous Vehicle Scenarios with Controllable Guidance Signals and Contrastive Loss-Based Representation Learning

NVIDIA CORP, 2025

Generating realistic driving scenarios for autonomous vehicles using data-driven simulation. The simulation leverages real-world traffic data to accurately generate agent behavior. To enable controllability of the simulation, guidance signals are provided to the scenario generator. These signals can be expressed using self-supervised learning techniques like contrastive loss to learn representations of scenarios without labeled data. This allows generating customized and controllable scenarios for autonomous vehicle training.

6. A network intrusion detection method based on contrastive learning and Bayesian Gaussian Mixture Model

lei liu, ming xu - Springer Nature, 2025

Abstract Network Intrusion Detection Systems (NIDS) are essential for safeguarding networks against malicious activities. However, existing machine learning-based NIDS often require complex feature engineering, which demands significant domain expertise and experimentation, leading to suboptimal model performance in network environments. In contrast, deep learning approaches, while powerful, struggle with imbalanced data, resulting a bias towards normal traffic reduced effectiveness detecting rare attacks. To address these issues, we propose method that combines contrastive Bayesian Gaussian Mixture Model (BGMM). Specifically, novel loss enables the automatically learn similarity within distinction between traffic, thereby generating robust distinguishable representations. This approach not only eliminates need manual engineering but also helps alleviate issue of weak representations BGMM further enhances detection by adapting both patterns through use multiple components. The proposed is validated extensive experiments on two widely used modern intrusion datasets. On UNSW-NB15 datas... Read More

7. Vulnerability Detection Method with Dual-View Causal Reasoning and Contrastive Learning

YANGZHOU UNIVERSITY, 2025

Explainable vulnerability detection using dual-view causal reasoning for accurate, robust, and concise explanations of software security issues. The method involves a two-step process: (1) using contrastive learning to train a vulnerability detection model, and (2) generating explanations using dual-view causal reasoning. The explanations provide a minimal subset of the code that, when removed, changes the model's prediction. This allows concise, explainable vulnerability detection with robustness against perturbations. The contrastive learning uses self-supervised and supervised contrastive losses to train the model.

US2025190574A1-patent-drawing

8. Graph Neural Network-Based Feature Extraction Method with Contrastive Learning for Search Query and Media Node Matching

TENCENT TECHNOLOGY COMPANY LTD, 2025

Method for improving search accuracy in applications like media search by leveraging graph neural networks and contrastive learning. The method involves training a graph neural network to extract features from nodes in a media search graph containing queries, media, and associations. The training uses pairs of nodes that are connected versus randomly combined to learn distinguishing features. This trained network is then used to extract features from search queries and media. During search, these features are compared to find matching nodes for personalized results. The method also involves using a two-tower search model with separate branches for query and media features to further improve accuracy.

US2025190433A1-patent-drawing

9. Unsupervised Vision Mamba with Contrastive Regularization Network for Image Dehazing

bin hu, jincheng li, sai yang - Research Square, 2025

<title>Abstract</title> Benefiting from the powerful nonlinear fitting ability of neural networks, deep learning-based methods have gradually emerged as dominate solutions for single image dehazing. However, supervised learningbased require paired samples training. To address this, an unsupervised Vision Mamba with Contrastive Regularization network (VMCR) is proposed. The designed based on DisentGAN framework, and its main module Mamba. This performs very competitively compared to transformers, while maintaining linear time complexity constant memory respect input size. Furthermore, a contrastive regularization method learning proposed enhance reconstruction capabilities achieve superior dehazing results. Our VMCR-Net outperforms state-of-the-art methods, evidenced by experimental results several benchmarks. research successfully proposes enhanced approach, overcoming limitations existing achieving performance.

10. Molecular Embedding Model Training via Contrastive Learning with Scaffold-Based Similarity Constraints

MICROSOFT TECHNOLOGY LICENSING LLC, 2025

Training molecular embedding models using contrastive learning with scaffold similarity for improved molecular similarity analysis. The method involves generating a training dataset by separating molecules into similar (sharing scaffold) and dissimilar (different scaffolds) pairs. This defines similarity based on scaffolds. The model learns to map similar molecules close together and dissimilar molecules far apart in embedding space. This improves molecular embedding quality for tasks like drug discovery and property prediction.

US12327616B2-patent-drawing

11. Neural Network Architecture with Modality-Specific Partitioned Attention for Multimodal Contrastive Learning

DEEPMIND TECHNOLOGIES LTD, 2025

Processing inputs using neural networks that maintain modality-specific representations while enabling multimodal contrastive learning. The approach involves partitioning the input data into disjoint modalities and training separate attention mechanisms for each partition. At inference, the system combines these partition-specific attention mechanisms with fused attention mechanisms to generate outputs that can be processed independently by each modality. This enables the use of contrastive learning techniques while preserving modality-specific representations.

12. Contrastive Attention-Supervised Tuning with Saliency-Guided Geometric Transform for Visual Grounding in Self-Supervised Learning

SALESFORCE INC, 2025

Self-supervised learning technique called Contrastive Attention-Supervised Tuning (CAST) that improves self-supervised learning for computer vision tasks by fixing the visual grounding ability of contrastive SSL methods. The CAST training method uses unsupervised saliency maps to provide explicit grounding supervision to encourage models to focus on specific objects when making decisions. It also introduces a geometric transform for randomly cropping views based on constraints derived from saliency maps to fix the visual grounding issue of randomly sampled crops in complex scenes.

US12321418B2-patent-drawing

13. Self-supervised contrastive learning with time-frequency consistency for few-shot bearing fault diagnosis

xiaoyun gong, y wei, wenliao du - IOP Publishing, 2025

Abstract Deep learning technology has made significant progress in fault diagnosis. However, real-world industrial settings, most existing methods require substantial labeled data for training, while harsh operating conditions and collection constraints often result scarce samples. This limitation significantly impairs their diagnostic performance practical applications. To address this challenge, we propose a few-shot diagnosis approach based on time-frequency contrastive (TF-CL) framework. The TF-CL framework adopts pre-training downstream task pipeline, enabling the model to automatically learn extract multi-perspective features from unlabeled self-supervised conditions. During pre-training, dedicated encoders separately time-domain frequency-domain feature representations abundant extracted are then projected into shared space using projector. ensure that can be data, paper introduces consistency loss function, constructed novel positive negative sample pairs. In task, is combined with multilayer perceptron classifier optimized fine-tuned end-to-end limited data. Gradient updates... Read More

14. Self-Supervised Learning for Domain Adaptation in Medical Imaging

murali krishna pasupuleti, 2025

Abstract: Self-supervised learning (SSL) offers a transformative path for addressing domain adaptation in medical imaging, where annotated datasets are often limited and expensive to acquire. This paper explores how various SSL approachescontrastive (SimCLR), masked image modeling (MAE), transformer-based (DINO)improve performance segmentation classification across heterogeneous imaging domains (MRI, X-ray, CT). Using such as BraTS, CheXpert, NIH ChestXray14, we evaluate pretraining followed by fine-tuning with minimal supervision. We demonstrate statistically significant improvements (615%) Dice scores AUC. Regression analysis shows strong correlation between representation similarity (CKA) downstream task performance. Explainability tools SHAP LIME used validate model reliability transparency. Keywords: Self-Supervised Learning, Domain Adaptation, Medical Imaging, Contrastive SimCLR, DINO, Swin UNet, SHAP, LIME, Transfer Learning

15. Neural Network Training via Bilevel Spectral Inference with Covariance-Based Gradient Estimation

DEEPMIND TECHNOLOGIES LTD, 2025

Training neural networks to generate high quality feature representations by optimizing a spectral inference objective using a bilevel optimization technique. This involves maintaining moving averages of covariance measures and the Jacobian of the covariance during training. It also involves computing kernel-weighted mini-batch covariance estimates and using them to generate gradient estimates for updating the network parameters.

US12307376B2-patent-drawing

16. RPF-MAD: A Robust Pre-Training–Fine-Tuning Algorithm for Meta-Adversarial Defense on the Traffic Sign Classification System of Autonomous Driving

xiaoxu peng, dong zhou, zhang jianwen - Multidisciplinary Digital Publishing Institute, 2025

Traffic sign classification (TSC) based on deep neural networks (DNNs) plays a crucial role in the perception subsystem of autonomous driving systems (ADSs). However, studies reveal that TSC system can make dangerous and potentially fatal errors under adversarial attacks. Existing defense strategies, such as training (AT), have demonstrated effectiveness but struggle to generalize across diverse attack scenarios. Recent advancements self-supervised learning (SSL), particularly contrastive (ACL) methods, strong potential enhancing robustness generalization compared AT. conventional ACL methods lack mechanisms ensure effective transferability different stages. To address this, we propose robust pre-trainingfine-tuning algorithm for meta-adversarial (RPF-MAD), designed enhance sustainability throughout pipeline. Dual-track pre-training (Dual-MAP) integrates meta-learning with which improves ability upstream model conditions. Meanwhile, adaptive variance anchoring fine-tuning (AVA-RFT) utilizes prototype regularization stabilize feature representations reinforce generalizable capabili... Read More

17. System for Encoding and Aligning Multimodal Sensor Data Using Neural Networks with Hierarchical Scenario Representation

PONY.AI INC, 2025

System for generating and organizing driving scenarios for autonomous vehicles to improve safety, efficiency, and reliability. The system uses neural networks to encode and decode multimodal sensor data like video, audio, and text prompts. It aligns sequences of sensor data with prompts using contrastive learning. This allows finding specific sensor sequences that match a given prompt. The system then generates a hierarchical structure representing the matching sensor sequence. By encoding, embedding, and aligning multimodal data, it creates a shared analytical space to discover cross-modal correlations and analyze latent dependencies. This improves understanding of navigation scenarios by capturing context and nuances like temporal evolution. The system also organizes scenarios for searchability and retrieval.

18. Automated Detection of Canine Babesia Parasite in Blood Smear Images Using Deep Learning and Contrastive Learning Techniques

dilip kumar baruah, kuntala boruah, nagendra nath barman - MDPI AG, 2025

This research introduces a novel method that integrates both unsupervised and supervised learning, leveraging SimCLR (Simple Framework for Contrastive Learning of Visual Representations) self-supervised learning along with different pre-trained models to improve microscopic image classification Babesia parasite in canines. We focused on three popular CNN architectures, namely ResNet, EfficientNet, DenseNet, evaluated the impact pre-training their performance. A detailed comparison variants Densenet terms accuracy training efficiency is presented. Base such as DenseNet were utilized within framework. Firstly, unlabeled images, followed by classifiers labeled datasets. approach significantly improved robustness accuracy, demonstrating potential benefits combining contrastive conventional techniques. The highest 97.07% was achieved Efficientnet_b2. Thus, detection or other hemoparasites blood smear images could be automated high without using labelled dataset.

19. MoHGCN: Momentum Hypergraph Convolution Network for Cross-modal Retrieval

ying li, yuxiang ding - Association for Computing Machinery, 2025

Cross-modal retrieval tasks, encompassing the of image-text, video-audio, and more, are progressively gaining significance in response to exponential growth information on Internet. However, there has always been a cloud hanging over multimodal tasks due inherent challenges aligning different modalities with distinct physical meanings. Most previous works simply rely single encoder or novel similarity calculation for fusion, which often result unsatisfactory performance. To tackle this challenge, we introduce Momentum Hypergraph Convolutional Network (MoHGCN) representation learning, strengthens alignment both visual textual data before process. Specifically, MoHGCN utilizes contrastive learning select most challenging negative positive samples form hyperedges, completes modality through two rounds fusion. Subsequently, fully integrated node features global fused using fusion obtain final vector image-text retrieval. Extensive experiments conducted widely-used datasets, namely Flickr30K MSCOCO, demonstrate superiority proposed approach achieving state-of-the-art performances.

20. Quality controlling in capsule gastroduodenoscopy with less annotation via self-supervised learning

yaqiong zhang, kai zhang, meijia wang - Research Square, 2025

<title>Abstract</title> Background It is possible to control the quality of capsule endoscopic images using artificial intelligence (AI), but it requires a great deal time for labeling. Methods SimCLR (a simple framework contrastive learning visual representations), capable acquiring inherent image representation with minimal annotation, feasibility not studied. 62850 were collected train models. In internal cross-validation (more training data and less testing data) reversed (less more data). Random forest Xgboost (eXtreme Gradient Boosting) used finish controlling after extracting features from images. Results reported that mean AUROC (Area Under Receiver Operating Characteristic) curve exceeded 0.98 0.97. Moreover, surpassed supervised CNN (Convolutional Neural Network). Extra 18636 pictures gathered 0.93 (95% CI 0.92710.9548), which close Network) (0.9645) in cross validation. surpass 0.96, better than (0.8374) Conclusions Through SimCLR, task can be completed performance similar or fewer annotations.

Get Full Report

Access our comprehensive collection of 25 documents related to this technology