All collections

Multimodal Foundation Models

Large-scale vision-language models covering pretraining, instruction tuning, and alignment. Includes image-text, video-text, audio-visual models and their evaluation on multimodal benchmarks.

0 papers
No papers match your filters.