Auffusion: Leveraging the Power of Diffusion and Large Language Models for Text-to-Audio Generation
Jinlong Xue1, Yayue Deng1, Yingming Gao1, Ya Li1
1Beijing University of Posts and Telecommunications, Beijing, China
Recent advancements in diffusion models and large language models (LLMs) have significantly propelled the field of AIGC. Text-to-Audio (TTA), a burgeoning AIGC application designed to generate audio from natural language prompts, is attracting increasing attention. However, existing TTA studies often struggle with generation quality and text-audio alignment, especially for complex textual inputs. Drawing inspiration from state-of-the-art Text-to-Image (T2I) diffusion models, we introduce Auffusion, a TTA system adapting T2I model frameworks to TTA task, by effectively leveraging their inherent generative strengths and precise cross-modal alignment. Our objective and subjective evaluations demonstrate that Auffusion surpasses previous TTA approaches using limited data and computational resource. Furthermore, previous studies in T2I recognizes the significant impact of encoder choice on cross-modal alignment, like fine-grained details and object bindings, while similar evaluation is lacking in prior TTA works. Through comprehensive ablation studies and innovative cross-attention map visualizations, we provide insightful assessments of text-audio alignment in TTA. Our findings reveal Auffusion’s superior capability in generating audios that accurately match textual descriptions, which further demonstrated in several related tasks, such as audio style transfer, inpainting and other manipulations.
Table of Contents
|Two gunshots followed by birds chirping
|A dog is barking
|People cheering in a stadium while rolling thunder and lightning strikes
Acoustic Environment Control:
|A man is speaking in a huge room.
|A man is speaking in a small room.
|A man is speaking in a studio.
|Chopping tomatos on a wooden table.
|Chopping meat on a wooden table.
|Chopping potatos on a metal table.
|Sine wave with low pitch.
|Sine wave with medium pitch.
|Sine wave with high pitch.
Temporal Order Control:
|A racing car is passing by and disappear.
|Two gunshots followed by birds flying away while chirping
|Wooden table tapping sound followed by water pouring sound.
TTA Generation with ChatGPT Text Prompt
|Birds singing sweetly in a blooming garden
|A kitten mewing for attention
|Magical fairies laughter echoing through an enchanted forest
|Soft whispers of a bedtime story being told
|A monkey laughs before getting hit on the head by a large atomic bomb
|A pencil scribbling on a notepad
|The splashing of water in a pond
|Coins clinking in a piggy bank
|A kid is whistling in a studio
|A distant church bell chiming noon
|A car’s horn honking in traffic
|Angry kids breaking glass in frustration
|An old-fashioned typewriter clacking
|A girl screaming at the most demented and vile sight
|A train whistle blowing in the distance
Multi Event Comparision
|A bell chiming as a clock ticks and a man talks through a television speaker in the background followed by a muffled bell chiming
|Buzzing and humming of a motor with a man speaking
|A series of machine gunfire and two gunshots firing as a jet aircraft flies by followed by soft music playing
|Woman speaks, girl speaks, clapping, croaking noise interrupts, followed by laughter
|A man talking as paper crinkles followed by plastic creaking then a toilet flushing
|Rain falls as people talk and laugh in the background.
|People walk heavily, pause, slide their feet, walk, stop, and begin walking again.
Cross Attention Map Comparision
Text-Guided Audio Style Transfer
|From cat screaming to car racing.
|From bird chirping to ambulance siren.
|From baby crying to cat meowing.
Audio Inpainting Examples
Attention-based Replacement Examples
Attention-based Reweighting Examples
1. We will share our code on github, which aims to open source the audio generation model training and evaluation for easier comparison.
2. We are confirming the data-related copyright issue, after which the pretrained models will be released.
This website is created based on https://github.com/AudioLDM/AudioLDM.github.io