MultiBench++ | Comprehensive Multimodal Fusion Benchmark

About the Project

Unified Benchmark

Standardized evaluation across diverse multimodal tasks including emotion recognition, healthcare, and remote sensing.

Extensive Datasets

Coverage of 37 datasets spanning 15+ modalities and 20+ predictive tasks across specialized domains.

Easy Integration

Modular codebase designed for easy extension, allowing researchers to plug in new models and datasets effortlessly.

Advanced Metrics

Task-specific evaluation with accuracy, macro-F1, AUPRC, and MSE, reported under a consistent protocol with multiple random seeds.

Datasets

Explore the MULTIBENCH++ suite of 37 multimodal datasets across specialized domains. Click on each category to view the datasets. View full documentation →

Paradigms

Following the paper setup, MULTIBENCH++ evaluates 11 fusion baselines and advanced methods, including Transformer-centric feature fusion and decision-level logit fusion.

Early Fusion - Feature-level integration
- Concatenation
  Concat ConcatEarly
- Tensor Fusion
  TensorFusion LowRankTensorFusion
- Multiplicative Interactions
  MultiplicativeInteractions2Modal MultiplicativeInteractions3Modal
Intermediate Fusion - Joint representation learning
- Transformer-Based
  EarlyFusionTransformer LateFusionTransformer
- Attention Mechanisms
  CrossAttentionFusion CrossAttentionConcatFusion MultiModalCrossAttentionFusion MultiModalCrossAttentionConcatFusion
- Hierarchical Attention
  HierarchicalAttentionMultiToOne HierarchicalAttentionOneToMulti
- Non-Local Gating
  NLgate
Late Fusion - Decision-level integration
- Confidence-Based
  MultimodalLateFusionClf
- Evidence Theory
  TMC (Trusted Multi-view Classification)

Code

The implementation follows the benchmark design in the paper: unified data access, modular encoders and fusion blocks, and a standardized training/evaluation workflow. Source code is available on GitHub.

MULTIBENCH++ is designed as a reproducible evaluation stack rather than a single fixed model. Data handling, representation learning, and fusion logic are decoupled into reusable modules for cross-domain, repeatable method comparison.

This structure supports fair method comparison under a shared protocol while keeping new model integration straightforward.

Dataset Loaders

Dataset-specific preprocessing, split rules, and modality alignment are handled in each loader, while a unified training entry keeps all tasks on the same interface so comparisons stay fair and reruns remain reproducible.

Unimodal Encoders

Modality-specific encoders convert heterogeneous inputs into compatible latent features for downstream fusion and task heads, and this separation allows controlled ablations or backbone replacement without changing the benchmark evaluation protocol.

Fusion Implementations

The benchmark includes early fusion baselines, Transformer-centric feature interaction, and decision-level logit fusion modules reported in the paper, all organized for direct side-by-side evaluation where performance differences can be attributed to fusion strategy changes.

Training, Evaluation, Experiments

Training and evaluation follow a standardized routine with task-specific metrics, repeated runs, and fixed hardware/seed settings, while an automated tuning workflow improves reproducibility and reduces manual hyperparameter search overhead.

Leaderboard

Comprehensive experimental results of multimodal fusion methods across specialized domains. The best result in each row is highlighted.

Structure & Usage

                    Project Hierarchy
                    MultiBench++/
|-- datasets/             # Loaders spanning 37 datasets
|-- encoders/             # Unimodal backbones (BERT, ResNet)
|-- fusions/              # 11 benchmarked fusion methods
|   |-- feature_fusion.py # Concat / TF / Transformer-centric methods
|   `-- logit_fusion.py   # LS / TMC
|-- baseline/             # Standardized experiment scripts
|   `-- MAMI/baseline.py
|-- training_structures/  # Standardized training routines
|-- objective_functions/  # Loss functions & metrics
|-- configs/              # Hyperparameter configs
`-- requirements.txt      # Dependencies
                

                    Quick Start
                    # 1. Environment Setup
conda create -n multibench python=3.9
conda activate multibench
pip install -r requirements.txt

# 2. Data Preparation
# Option A: follow MultiBenchplus-main/DATASET.md
# Option B: Baidu Netdisk: https://pan.baidu.com/s/11ITMTGO4KCnTLr05dnmThg?pwd=8rc9
# Place datasets under ./data/<DATASET_NAME>/

# 3. Run Experiments
cd baseline/MAMI
python baseline.py
                

Citation

@inproceedings{liang2021multibench,
  title={MultiBench: Multiscale Benchmarks for Multimodal Representation Learning},
  author={Liang, Paul Pu and Lyu, Yiwei and Fan, Xiang and Wu, Zetian and Cheng, Yun and Wu, Jason and Chen, Leslie Yufan and Wu, Peter and Lee, Michelle A and Zhu, Yuke and Salakhutdinov, Ruslan and Morency, Louis-Philippe},
  booktitle={Thirty-fifth Conference on Neural Information Processing Systems Datasets and Benchmarks Track (Round 1)},
  year={2021}
}

@inproceedings{xue2026multibench,
  title={MULTIBENCH++: A Unified and Comprehensive Multimodal Fusion Benchmarking Across Specialized Domains},
  author={Leyan Xue and Changqing Zhang and Kecheng Xue and Xiaohong Liu and Guangyu Wang and Zongbo Han},
  booktitle={Proceedings of the AAAI Conference on Artificial Intelligence},
  year={2026}
}