A Concept-Centric Approach to Multi-Modality Learning

Yuchong Geng1    Ao Tang1

1School of Electrical and Computer Engineering, Cornell University

Paper / Code

TL;DR: Humans reuse knowledge across modalities and tasks. This project builds ML systems that do the same by learning reusable concepts in a shared space and connecting multiple modalities through lightweight projection modules.

Overall architecture of the concept-centric framework

Overall architecture: a shared concept space with modality-specific projection modules.

Overview

Modern multimodal models often learn representations that are tightly coupled to a specific set of modalities and tasks. In contrast, our goal is to build a learning system with a persistent store of abstract knowledge that can be reused. We introduce a concept-centric multi-modality learning framework built around a modality-agnostic concept space, together with a set of lightweight projection modules that map each modality into this shared space.

Once the concept space is learned, it can be reused as a knowledge base. New modalities can be added by training only their projection modules, rather than retraining the entire system. We evaluate the framework on two representative downstream tasks and show that knowledge-aware projections can converge faster while maintaining competitive performance, with inference performed entirely in the shared concept space.

Core idea

The main contribution is a reusable and interpretable learning system, not a small improvement on a single benchmark. The framework separates persistent conceptual knowledge from modality-specific perception, enabling modular extension and concept-level probing.

How it works

The system learns a concept space that represents structured, modality-agnostic knowledge. Each modality is paired with a projection module that aligns its inputs to the shared space. After training, the learned concept space can be reused, and adding a new modality requires training only a new projection module while keeping the existing knowledge fixed.

Why this matters

From transient to persistent: Many models store knowledge implicitly in dense parameters and activate it only in response to inputs. Our framework stores abstract knowledge explicitly in a shared space, making it durable and reusable.

Cross-modality reuse: Once the concept space is learned, modality-specific modules can align to it instead of relearning representations from scratch, which supports efficient adaptation.

Interpretability by design: Concept-level representations and relations can be inspected through direct queries, enabling transparent analysis and targeted debugging.

Figures

Learning curves comparing projection models and baselines

Learning curves: projection modules converge faster by aligning to a shared concept space rather than learning representations from scratch.

Cross-modality alignment heatmaps

Out-of-the-box alignment across modalities: although projection modules are trained independently, adapting to the shared concept space yields consistent concept entailment structure across image and multilingual text.

BibTeX

@article{geng2026concept,
  title     = {A Concept-Centric Approach to Multi-Modality Learning},
  author    = {Geng, Yuchong and Tang, Ao},
  journal   = {Transactions on Machine Learning Research},
  year      = {2026}
}
  

Contact

Questions? Email yg534@cornell.edu.