A Data-driven Typology of Vision Models from Integrated Representational Metrics
A biology-inspired framework that integrates multiple representational similarity metrics to create a principled typology of vision models.
Abstract
Large vision models differ widely in architecture and training paradigm, yet we lack principled methods to determine which aspects of their representations are shared across families and which reflect distinctive computational strategies. We leverage a suite of representational similarity metrics, each capturing a different facet—geometry, unit tuning, or linear decodability—and assess family separability using multiple complementary measures. Metrics preserving geometry or tuning (e.g., RSA, Soft Matching) yield strong family discrimination, whereas flexible mappings such as Linear Predictivity show weaker separation. These findings indicate that geometry and tuning carry family-specific signatures, while linearly decodable information is more broadly shared. To integrate these complementary facets, we adapt Similarity Network Fusion (SNF), a method inspired by multi-omics integration. SNF achieves substantially sharper family separation than any individual metric and produces robust composite signatures. Clustering of the fused similarity matrix recovers both expected and surprising patterns: supervised ResNets and ViTs form distinct clusters, yet all self-supervised models group together across architectural boundaries. Hybrid architectures (ConvNeXt, Swin) cluster with masked autoencoders, suggesting convergence between architectural modernization and reconstruction-based training. This biology-inspired framework provides a principled typology of vision models, showing that emergent computational strategies—shaped jointly by architecture and training objective—define representational structure beyond surface design categories.
Figure 1: Similarity Network Fusion integrates multiple representational metrics to create unified model signatures.
Motivation
Modern computer vision has produced a diverse landscape of models—CNNs, Vision Transformers, hybrid architectures—trained with supervised learning or self-supervised learning. This raises fundamental questions:
- What makes different model families unique?
- Which representational properties are universally shared?
- How do architecture and training jointly shape learned representations?
Understanding these questions is crucial for principled model selection, architecture design, and advancing our understanding of visual intelligence.
Traditional approaches categorize models by surface-level characteristics (architecture type, training method). We argue that a data-driven, representation-centric approach reveals deeper patterns about computational strategies.
Method
Overview
Our approach consists of three main steps:
- Extract representations from 35 vision models across diverse architectures and training paradigms
- Compute similarity using multiple representational metrics capturing different facets
- Integrate metrics via Similarity Network Fusion to create unified signatures
- Cluster models to reveal data-driven typology
Representational Similarity Metrics
- RSA (Representational Similarity Analysis): Compares representational dissimilarity matrices
- Linear CKA (Centered Kernel Alignment): Invariant to orthogonal transformations and isotropic scaling
- Soft Matching: Aligns representations via optimal transport, preserving unit-level tuning properties
- Procrustes Distance: Finds optimal orthogonal alignment
- Linear Predictivity: Measures linearly accessible shared information via unconstrained linear mapping
- PWCCA: Projection-weighted canonical correlation analysis
- SVCCA: Singular value canonical correlation analysis
Similarity Network Fusion
SNF integrates multiple similarity networks through an iterative diffusion process:
- Construct K-NN graphs: For each metric, build a sparse affinity graph
- Message passing: Iteratively update each network using information from others
- Convergence: Networks mutually reinforce consistent structure while suppressing noise
- Output: Unified similarity matrix combining all metrics
This approach amplifies consensus across metrics, reduces metric-specific noise, and reveals robust patterns invisible to individual metrics.
Results
Different Metrics Reveal Different Patterns
Our systematic evaluation reveals that representational facets vary dramatically in their ability to discriminate model families:
Figure 2: Model-family separability on ImageNet under d', silhouette score, and contrastive ratio.Columns correspond to nine similarity metrics, including two fusion-based methods (SNF, average) and seven commonly-used representational metrics. Fusion-based metrics consistently yield higher scores, highlighting their effectiveness in capturing family-level distinctions..
Figure 3: Mean model-family separability across four datasets, evaluated using d', silhouette score, and contrastive ratio. SNF yields the most consistent and robust separation. Scores are shown in their native scales and are not directly comparable across measures.
Key Findings:
- Geometry & tuning preserve family signatures: RSA and Soft Matching strongly discriminate families
- Linearly decodable information is shared: Linear Predictivity and CCA-based metrics show weaker separation
- Mapping flexibility matters: Discrimination decreases as mappings become more flexible (Soft Matching > Procrustes > Linear Predictivity)
This reveals a fundamental pattern: geometric organization and unit tuning constitute family-specific signatures, while linearly accessible information is more universally shared.
SNF Achieves Superior Integration
SNF dramatically outperforms all individual metrics and simple averaging:
- Higher and Balanced discrimination: High separation across all family pairs
- Robust signatures: Consistent across multiple datasets (ImageNet, Ecoset, CIFAR)
The superior performance demonstrates that integrating complementary representational facets yields comprehensive signatures that most reliably distinguish model families.
A Data-Driven Model Typology
Hierarchical clustering of the SNF-fused similarity matrix reveals a principled typology:
Figure 4: Data-driven typology reveals expected and surprising patterns in model organization.
Expected Patterns:
- Supervised ResNets cluster together
- Supervised ViTs cluster together
- VGG models separate from ResNets
Surprising Discoveries:
- Training paradigm overrides architecture. All self-supervised models cluster together, regardless of whether they’re CNNs or Transformers
- Self-supervised ResNets group more closely with self-supervised ViTs than with supervised ResNets
- Hybrid convergence: ConvNeXt (modernized CNN) and Swin (CNN-like ViT) cluster with MAE (masked autoencoder)
This suggests that computational strategies induced by training objectives can override architectural differences, defining a representational “species” that transcends surface-level design categories.
Cross-Layer Consistency
Figure 5: SNF shows highest consistency across network depths.
Our findings remain stable across different network depths (60%, 80%, 100%), with SNF showing the highest cross-layer consistency.
Practical Applications
Model Selection
- Move beyond architecture heuristics to representation-based model selection
- Match representational properties to downstream task requirements
Model Design
- Training paradigm may be more impactful than architectural details
- Hybrid designs show convergence
Understanding Intelligence
- Models achieve similar solutions via different paths
- Training objectives may be more fundamental than architecture
BibTeX
@misc{wu2025datadriventypologyvisionmodels,
title={A Data-driven Typology of Vision Models from Integrated Representational Metrics},
author={Jialin Wu and Shreya Saha and Yiqing Bo and Meenakshi Khosla},
year={2025},
eprint={2509.21628},
archivePrefix={arXiv},
primaryClass={cs.CV},
url={https://arxiv.org/abs/2509.21628},
}