Entry

AudioLDM: latent diffusionを用いてテキストからオーディオ(環境音、音楽等)を生成するモデル

Simple Title

CLAPを用いることでText-to-AudioのSOTAを達成。オープンソース化されていて、すぐに試せるオンラインデモもあり！

Description

Liu, Haohe, Zehua Chen, Yi Yuan, Xinhao Mei, Xubo Liu, Danilo Mandic, Wenwu Wang, and Mark D. Plumbley. 2023. “AudioLDM: Text-to-Audio Generation with Latent Diffusion Models.” arXiv [cs.SD] . arXiv. http://arxiv.org/abs/2301.12503.

Type

Paper

Year

2023

Posted at

February 10, 2023

Overview

拡散モデルを使った Text-to-Audioモデル

環境音 / SE音 /人の話し声 /音楽などを学習できる

定量・定性的に音質の面でText-to-AudioモデルのSOTAを達成
CLAP (Contrastive language-audio pretraining) (CLIPのオーディオ版)を利用してテキストの整合性を担保

学習時にはテキストは必要ない - CLAPモデルのemeddingを利用

音のスタイルトランスファーやインペインティングなども実現
オープンソース化されている + Hugging Faceのデモあり

Abstract

Text-to-audio (TTA) system has recently gained attention for its ability to synthesize general audio based on text descriptions. However, previous studies in TTA have limited generation quality with high computational costs. In this study, we propose AudioLDM, a TTA system that is built on a latent space to learn the continuous audio representations from contrastive language-audio pretraining (CLAP) latents. The pretrained CLAP models enable us to train latent diffusion models (LDMs) with audio embedding while providing text embedding as a condition during sampling. By learning the latent representations of audio signals and their compositions without modeling the cross-modal relationship, AudioLDM is advantageous in both generation quality and computational efficiency. Trained on AudioCaps with a single GPU, AudioLDM achieves state-of-the-art TTA performance measured by both objective and subjective metrics (e.g., frechet distance). Moreover, AudioLDM is the first TTA system that enables various text-guided audio manipulations (e.g., style transfer) in a zero-shot fashion.

Motivation

テキストとオーディオのアラインメントがとれたデータセットを集めるのはめちゃくちゃ大変 → 学習ずみのCLAP (CLIPのオーディオ版）を利用。学習時にはテキストが必要ない。

テキストと一緒に学習するよりも結果が良かった。

Architecture

AudioLDM

CNNベースのVariational Autoencoder(VAE)でメルスペクトログラムを扱う
VAEの潜在ベクトルをLatent Diffusionで生成する。

このLatent Diffusionへの条件付けのベクトルはCLAPモデルのembedding (埋め込みベクトル)
学習時はオーディオを入力して得られた埋め込みベクトル $\boldsymbol{E}^x \in R^L$ を利用
サンプリング(推論)時にはテキストを入力して得られる埋め込みベクトル $\boldsymbol{E}^y \in R^L$ を利用

CLAPでContrastive Lossで学習しているので $\boldsymbol{E}^x \simeq \boldsymbol{E}^x$

拡散モデルにはLatent diffusionモデルを利用 (モデルのサイズ、学習に使うメモリを小さく抑えられる)
VAEでメルスペクトログラムを小さい潜在空間 $\boldsymbol{z} \in R^{C \times \frac{T}{r} \times \frac{F}{r}}$ に圧縮

メルスペクトログラムを波形に戻すボコーダーにはHiFi-GANを利用

AudioLDMモデルの学習とサンプリング(推論時)

AudioLDMを用いたインペインティング(補完)とスタイルトランスファー

既存の音の欠けている部分をテキスト入力を用いて補完したり(インペインティング)、サンプリング周波数が低い音の高周波数成分を補ったり(Super Resolution)、テキスト入力で別の音に変換したり… といったこともできる。

インペインティングとSuperResolution

インペインティングとSuper Resolutionはそれぞれスペクトログラムの時間方向、周波数方向の補完と考えることができる。
通常のテキストの埋め込みベクトルから得られる潜在ベクトルと与えたれた入力のスペクトログラムをVAEでエンコードして得られる潜在ベクトルを $z^{o b} \in R^{C \times \frac{T}{r} \times \frac{F}{r}}$ を重み付けしつつ混ぜて VAEのデコーダーに入力する $z$ を得る

スタイルトランスファー

stable diffusionのimg2imgなどと同じく、完全なノイズから始める代わりに、入力の音にノイズを乗っけた入力 (図の中の $Z_{n_0}$ )から diffusionのプロセスを始めることで　元の音の特徴とテキスト入力で指定された特徴を併せ持つ音を生成できるようになる
どのくらい元の入力をノイズで汚すか (Forward diffusionの回数)によって、元の音の特徴がどのくらいキープできるかが決まってくる

Inpainting (上半分). Style Transfer (下半分)

Results

生成された音のサンプルはプロジェクトページで聞ける

定量的にもText-to-AudioモデルのSOTAを達成

モデルの大きさやデータセットの量でAudioLDM-SからAudioLDFM-L-Fullまでいくつかのバリエーションを作って既存研究 (DiffSound/AudioGen)などと比較した

Further Thoughts

論文を読んで考えた個人的感想

意外とシンプルでstraight forwardな印象
同時期に出た Mousaiモデルとの比較はなかった。Mousaiの方はスペクトログラムを経由せずに直接波形を生成 + CLAPではなくカスタムの小規模のデータセットを利用している。

Moûsai: Latent Diffusionモデルでの音楽生成

The recent surge in popularity of diffusion mod-els for image generation has brought new atten-tion to the potential of these models in other ar-eas of media synthesis. One area that has yet tobe fully explored is the application of diffusionmodels to music generation.

createwith.ai

Links

プロジェクトページ: サンプルなど

AudioLDM: Text-to-Audio Generation with Latent Diffusion Models

Haohe Liu*1, Zehua Chen*2, Yi Yuan1, Xinhao Mei1, Xubo Liu1 Danilo Mandic2, Wenwu Wang1, Mark D. Plumley1 1 CVSSP, University of Surrey, Guildford, UK 2 Department of EEE, Imperial College London, London, UK * Equal Contribution Text-to-audio (TTA) system has recently gained attention for its ability to synthesize general audio based on text descriptions.

audioldm.github.io

AudioLDM: Text-to-Audio Generation with Latent Diffusion Models

Hugging Faceのデモ

Audioldm Text To Audio Generation - a Hugging Face Space by haoheliu

Discover amazing ML apps made by the community

huggingface.co

Audioldm Text To Audio Generation - a Hugging Face Space by haoheliu