Researchers at FunAudioLLM have released PrismAudio, an open-source framework that converts video into high-quality audio using a chain-of-thought (CoT) approach. The system builds on the team’s earlier ThinkSound model and addresses a key limitation in existing video-to-audio (V2A) generation: the inability to align multiple objectives simultaneously.
PrismAudio breaks down complex reasoning into four distinct modules: semantic alignment, temporal synchronization, aesthetic quality, and spatial accuracy. With 518 million parameters, it outperforms current state-of-the-art (SOTA) models on standard benchmarks. On the VGGSound and AudioCanvas datasets, it led in four evaluation metrics: semantic consistency, audio-visual synchronization, aesthetic quality, and spatial precision.
The framework’s efficiency is notable. Despite its smaller size compared to competitors, it delivers faster processing speeds while maintaining higher output quality. FunAudioLLM demonstrated this capability in a sample using Google’s Veo3 video generation paired with PrismAudio’s audio output.
The project is available on GitHub, Hugging Face, and includes an interactive demo. The GitHub repository links to the PrismAudio branch under ThinkSound, while the Hugging Face model card and space provide direct access to the framework and live demonstrations.
PrismAudio targets professionals in film, gaming, and post-production who need to generate sound effects, Foley, or ambient audio from video clips without manual editing.
Resources: