Andrew Rouditchenko | Researcher Profile | Sotabase

Career

· Researcher, MIT CSAIL Spoken Language Systems Group2019–

Publications (28)

The Sound of Pixels

European Conference on Computer Vision · 2018

581

cited

Contrastive Audio-Visual Masked Autoencoder

International Conference on Learning Representations · 2022

168

cited

Everything at Once – Multi-modal Fusion Transformer for Video Retrieval

Computer Vision and Pattern Recognition · 2021

156

cited

AVLnet: Learning Audio-Visual Language Representations from Instructional Videos

Interspeech · 2020

146

cited

Self-supervised Audio-visual Co-segmentation

IEEE International Conference on Acoustics, Speech, and Signal Processing · 2019

107

cited

Multimodal Clustering Networks for Self-supervised Learning from Unlabeled Videos

IEEE International Conference on Computer Vision · 2021

cited

Cross-Modal Discrete Representation Learning

Annual Meeting of the Association for Computational Linguistics · 2021

cited

Whisper-Flamingo: Integrating Visual Features into Whisper for Audio-Visual Speech Recognition and Translation

Interspeech · 2024

cited

CMKD: CNN/Transformer-Based Cross-Model Knowledge Distillation for Audio Classification

IEEE Transactions on Pattern Analysis and Machine Intelligence · 2022

cited

UAVM: Towards Unifying Audio and Visual Models

IEEE Signal Processing Letters · 2022

cited

Comparison of Multilingual Self-Supervised and Weakly-Supervised Speech Pre-Training for Adaptation to Unseen Languages

Interspeech · 2023

cited

Omni-R1: Do You Really Need Audio to Fine-Tune Your Audio LLM?

arXiv.org · 2025

cited

What, When, and Where? Self-Supervised Spatio- Temporal Grounding in Untrimmed Multi-Action Videos from Narrated Instructions

Computer Vision and Pattern Recognition · 2023

cited

C2KD: Cross-Lingual Cross-Modal Knowledge Distillation for Multilingual Text-Video Retrieval

IEEE International Conference on Acoustics, Speech, and Signal Processing · 2022

cited

Cascaded Multilingual Audio-Visual Learning from Videos

Interspeech · 2021

cited

Routing with Self-Attention for Multimodal Capsule Networks

arXiv.org · 2021

cited

Self-Supervised Segmentation and Source Separation on Videos

CVPR Workshops · 2019

cited

Spoken ObjectNet: A Bias-Controlled Spoken Caption Dataset

Interspeech · 2021

cited

CAV-MAE Sync: Improving Contrastive Audio-Visual Mask Autoencoders via Fine-Grained Alignment

Computer Vision and Pattern Recognition · 2025

cited

mWhisper-Flamingo for Multilingual Audio-Visual Noise-Robust Speech Recognition

IEEE Signal Processing Letters · 2025

cited

Sotabase