A Data-driven Typology of Vision Models from Integrated Representational Metrics

Abstract

Large vision models differ widely in architecture and training paradigm, yet we lack principled methods to determine which aspects of their representations are shared across families and which reflect distinctive computational strategies. We leverage a suite of representational similarity metrics, each capturing a different facet—geometry, unit tuning, or linear decodability—and assess family separability using multiple complementary measures. Metrics preserving geometry or tuning (e.g., RSA, Soft Matching) yield strong family discrimination, whereas flexible mappings such as Linear Predictivity show weaker separation. These findings indicate that geometry and tuning carry family-specific signatures, while linearly decodable information is more broadly shared. To integrate these complementary facets, we adapt Similarity Network Fusion (SNF), a method inspired by multi-omics integration. SNF achieves substantially sharper family separation than any individual metric and produces robust composite signatures. Clustering of the fused similarity matrix recovers both expected and surprising patterns: supervised ResNets and ViTs form distinct clusters, yet all self-supervised models group together across architectural boundaries. Hybrid architectures (ConvNeXt, Swin) cluster with masked autoencoders, suggesting convergence between architectural modernization and reconstruction-based training. This biology-inspired framework provides a principled typology of vision models, showing that emergent computational strategies—shaped jointly by architecture and training objective—define representational structure beyond surface design categories.

Figure 1: Similarity Network Fusion integrates multiple representational metrics to create unified model signatures.

Motivation

Modern computer vision has produced a diverse landscape of models—CNNs, Vision Transformers, hybrid architectures—trained with supervised learning or self-supervised learning. This raises fundamental questions:

What makes different model families unique?
Which representational properties are universally shared?
How do architecture and training jointly shape learned representations?

Understanding these questions is crucial for principled model selection, architecture design, and advancing our understanding of visual intelligence.

Traditional approaches categorize models by surface-level characteristics (architecture type, training method). We argue that a data-driven, representation-centric approach reveals deeper patterns about computational strategies.

Method

Overview

Our approach consists of three main steps:

Extract representations from 35 vision models across diverse architectures and training paradigms
Compute similarity using multiple representational metrics capturing different facets
Integrate metrics via Similarity Network Fusion to create unified signatures
Cluster models to reveal data-driven typology

Representational Similarity Metrics

RSA (Representational Similarity Analysis): Compares representational dissimilarity matrices
Linear CKA (Centered Kernel Alignment): Invariant to orthogonal transformations and isotropic scaling
Soft Matching: Aligns representations via optimal transport, preserving unit-level tuning properties
Procrustes Distance: Finds optimal orthogonal alignment
Linear Predictivity: Measures linearly accessible shared information via unconstrained linear mapping
PWCCA: Projection-weighted canonical correlation analysis
SVCCA: Singular value canonical correlation analysis

Similarity Network Fusion

SNF integrates multiple similarity networks through an iterative diffusion process:

Construct K-NN graphs: For each metric, build a sparse affinity graph
Message passing: Iteratively update each network using information from others
Convergence: Networks mutually reinforce consistent structure while suppressing noise
Output: Unified similarity matrix combining all metrics

This approach amplifies consensus across metrics, reduces metric-specific noise, and reveals robust patterns invisible to individual metrics.

Results

Different Metrics Reveal Different Patterns

Our systematic evaluation reveals that representational facets vary dramatically in their ability to discriminate model families:

Figure 2: Model-family separability on ImageNet under d', silhouette score, and contrastive ratio.Columns correspond to nine similarity metrics, including two fusion-based methods (SNF, average) and seven commonly-used representational metrics. Fusion-based metrics consistently yield higher scores, highlighting their effectiveness in capturing family-level distinctions..

Figure 3: Mean model-family separability across four datasets, evaluated using d', silhouette score, and contrastive ratio. SNF yields the most consistent and robust separation. Scores are shown in their native scales and are not directly comparable across measures.

Key Findings:

Geometry & tuning preserve family signatures: RSA and Soft Matching strongly discriminate families
Linearly decodable information is shared: Linear Predictivity and CCA-based metrics show weaker separation
Mapping flexibility matters: Discrimination decreases as mappings become more flexible (Soft Matching > Procrustes > Linear Predictivity)

This reveals a fundamental pattern: geometric organization and unit tuning constitute family-specific signatures, while linearly accessible information is more universally shared.

SNF Achieves Superior Integration

SNF dramatically outperforms all individual metrics and simple averaging:

Higher and Balanced discrimination: High separation across all family pairs
Robust signatures: Consistent across multiple datasets (ImageNet, Ecoset, CIFAR)

The superior performance demonstrates that integrating complementary representational facets yields comprehensive signatures that most reliably distinguish model families.

A Data-Driven Model Typology

Hierarchical clustering of the SNF-fused similarity matrix reveals a principled typology:

Figure 4: Data-driven typology reveals expected and surprising patterns in model organization.

Expected Patterns:

Supervised ResNets cluster together
Supervised ViTs cluster together
VGG models separate from ResNets

Surprising Discoveries:

Training paradigm overrides architecture. All self-supervised models cluster together, regardless of whether they’re CNNs or Transformers
Self-supervised ResNets group more closely with self-supervised ViTs than with supervised ResNets
Hybrid convergence: ConvNeXt (modernized CNN) and Swin (CNN-like ViT) cluster with MAE (masked autoencoder)

This suggests that computational strategies induced by training objectives can override architectural differences, defining a representational “species” that transcends surface-level design categories.

Cross-Layer Consistency

Figure 5: SNF shows highest consistency across network depths.

Our findings remain stable across different network depths (60%, 80%, 100%), with SNF showing the highest cross-layer consistency.

Practical Applications

Model Selection

Move beyond architecture heuristics to representation-based model selection
Match representational properties to downstream task requirements

Model Design

Training paradigm may be more impactful than architectural details
Hybrid designs show convergence

Understanding Intelligence

Models achieve similar solutions via different paths
Training objectives may be more fundamental than architecture

Key Insight: Just as biological typologies require multiple markers to reveal cell types or species, understanding model representations benefits from integrating complementary perspectives rather than relying on single metrics.