Sotabase
Home
Researchers
Career
·
Researcher
,
MIT CSAIL Spoken Language Systems Group
2019–
Publications
(28)
The Sound of Pixels
European Conference on Computer Vision · 2018
581
cited
Contrastive Audio-Visual Masked Autoencoder
International Conference on Learning Representations · 2022
168
cited
Everything at Once – Multi-modal Fusion Transformer for Video Retrieval
Computer Vision and Pattern Recognition · 2021
156
cited
AVLnet: Learning Audio-Visual Language Representations from Instructional Videos
Interspeech · 2020
146
cited
Self-supervised Audio-visual Co-segmentation
IEEE International Conference on Acoustics, Speech, and Signal Processing · 2019
107
cited
Multimodal Clustering Networks for Self-supervised Learning from Unlabeled Videos
IEEE International Conference on Computer Vision · 2021
97
cited
Cross-Modal Discrete Representation Learning
Annual Meeting of the Association for Computational Linguistics · 2021
53
cited
Whisper-Flamingo: Integrating Visual Features into Whisper for Audio-Visual Speech Recognition and Translation
Interspeech · 2024
37
cited
CMKD: CNN/Transformer-Based Cross-Model Knowledge Distillation for Audio Classification
IEEE Transactions on Pattern Analysis and Machine Intelligence · 2022
34
cited
UAVM: Towards Unifying Audio and Visual Models
IEEE Signal Processing Letters · 2022
30
cited
Comparison of Multilingual Self-Supervised and Weakly-Supervised Speech Pre-Training for Adaptation to Unseen Languages
Interspeech · 2023
25
cited
Omni-R1: Do You Really Need Audio to Fine-Tune Your Audio LLM?
arXiv.org · 2025
23
cited
What, When, and Where? Self-Supervised Spatio- Temporal Grounding in Untrimmed Multi-Action Videos from Narrated Instructions
Computer Vision and Pattern Recognition · 2023
9
cited
C2KD: Cross-Lingual Cross-Modal Knowledge Distillation for Multilingual Text-Video Retrieval
IEEE International Conference on Acoustics, Speech, and Signal Processing · 2022
8
cited
Cascaded Multilingual Audio-Visual Learning from Videos
Interspeech · 2021
8
cited
Routing with Self-Attention for Multimodal Capsule Networks
arXiv.org · 2021
5
cited
Self-Supervised Segmentation and Source Separation on Videos
CVPR Workshops · 2019
4
cited
Spoken ObjectNet: A Bias-Controlled Spoken Caption Dataset
Interspeech · 2021
4
cited
CAV-MAE Sync: Improving Contrastive Audio-Visual Mask Autoencoders via Fine-Grained Alignment
Computer Vision and Pattern Recognition · 2025
2
cited
mWhisper-Flamingo for Multilingual Audio-Visual Noise-Robust Speech Recognition
IEEE Signal Processing Letters · 2025
2
cited
Show all 28 papers →
Sotabase
Andrew Rouditchenko | Researcher Profile | Sotabase | Sotabase