All collections
Multimodal Foundation Models
Large-scale vision-language models covering pretraining, instruction tuning, and alignment. Includes image-text, video-text, audio-visual models and their evaluation on multimodal benchmarks.
0 papers
No papers match your filters.