Self-supervised representation learning (SSL) has emerged as a fundamental paradigm in representation learning, enabling models to learn meaningful representations without requiring labeled data. Despite its success, SSL remains constrained by two core challenges: (i) lack of robustness against real-world distribution shifts and adversarial perturbations, and (ii) lack of domain-awareness, limiting its usability beyond natural scenes. These limitations arise from the generic invariance assumptions in SSL, which rely on predefined augmentations to learn representations but suffer to generalize when exposed to unseen environmental distortions, adversarial attacks, and domain-specific nuances. Existing SSL approaches—whether contrastive learning, knowledge distillation, or information maximization—do not explicitly account for these factors, making them vulnerable in real-world applications and suboptimal in specialized domains.
This thesis aims to enhance both robustness and domain-awareness in a modular, plug-and-play manner, ensuring that the advancements are applicable across different joint embedding architecture and method (JEAM)-based SSL approaches and adaptable to future developments in SSL. To achieve this, this thesis follows a guiding principle-leveraging invariant representations to improve robustness and domain-awareness in a modular and plug-and-play manner without altering fundamental SSL objectives. This principle guides that improvements can be seamlessly integrated into existing and future SSL approaches.
To systematically address the above-stated core challenges, this thesis begins with a foundational study of SSL approaches, identifying the common schema that underlies different SSL approaches. This unification provides a conceptual view of SSL methods, allowing us to isolate the domain-sensitive and domain-agnostic components across approaches. This conceptual outcome set the stage to establish precisely where improvements are needed to enhance robustness and domain-awareness across methods as current SSL methods fail under real-world challenges.
Next, the thesis conducts a large-scale empirical evaluation of existing SSL methods against relevant robustness benchmarks, uncovering their failures under distribution shifts caused by real-world environmental challenges. This evaluation reveals a significant decline in the robustness performance of existing SSL methods across different SSL approaches. It establishes the fundamental research gap and motivates the advancements introduced in this thesis.
The first advancement focuses on robustness against distribution shifts, particularly geometric distortions such as perspective distortion (PD), which are prevalent in real-world environment but not addressed by existing SSL methods. Since PD introduces nonlinear spatial transformations, standard affine augmentations fail to model these effects, leading to degraded representation stability. To address this, this thesis introduces Möbius-based mitigating perspective distortion (MPD) and log conformal maps (LCM), mathematically grounded transformations that enable robustness without requiring perspective-distorted training data and estimation of camera parameters. These methods are additionally adapted to multiple real-world computer vision applications—including crowd counting, object detection, person re-identification, and fisheye view recognition—showcasing their effectiveness. Further, addressing the non-availability of dedicated perspectively distorted benchmark, ImageNet-PD robustness benchmark is developed to fill the gap.
Beyond environmental challenges, another critical real-world challenge is adversarial attacks. SSL methods are highly susceptible to adversarial attacks, as the learned representations lack perturbation-invariant constraints. Existing adversarial training approaches in SSL rely on brute-force attack strategies, which fail to adapt dynamically. To address this, this thesis introduces adversarial self-supervised training with adaptive-attacks (ASTrA), where attack strategies evolve dynamically based on the model’s learning dynamics and establish a correspondence between attack parameters and training examples, optimizing adversarial perturbations in a learnable manner. Unlike conventional adversarial training, ASTrA ensures robustness while maintaining SSL’s efficiency and scalability.
While robustness, in this thesis, focuses on real-world challenges in natural scenes, domain-awareness focuses on specialized visual domains beyond natural scenes. Standard SSL augmentations are designed for variations in natural scenes, making them ill-suited for specialized fields such as medical imaging and industrial mining material inspection. This thesis introduces domain-awareness in SSL that incorporates domain-specific information into SSL’s view generation process. Particularly, (i) magnification prior contrastive similarity (MPCS) makes learned representations invariant to magnifications for histopathology images by inducing varying magnifications in the view generation process, improving breast cancer recognition. (ii) depth contrast explicitly enforces modality alignment between material images and attained height of materials on conveyor belt, ensuring that the learned representations become aware of physical properties, thereby improving material classification.
Beyond robustness and domain-awareness, SSL’s ability to generalize with limited data is advantageous for its practicality. While the loss objective in SSL is generally domain-agnostic, its effectiveness relies on large-scale data. In this direction, this thesis explores functional knowledge transfer (FKT), where self-supervised and supervised learning objectives are jointly optimized, enabling SSL representations to adapt dynamically to supervised tasks. This approach enhances generalization in low-data regimes.
In conclusion, this thesis provides a foundation for robust and domain-aware self-supervised representation learning in a modular manner, highlighting its applicability to existing and future JEAM-based SSL approaches, which can inherit these advancements and adapt to emerging challenges.