Entry

MusicLM: テキストから音楽を生成するモデル

Simple Title

Agostinelli, Andrea, Timo I. Denk, Zalán Borsos, Jesse Engel, Mauro Verzetti, Antoine Caillon, Qingqing Huang, et al. 2023. “MusicLM: Generating Music From Text.” arXiv [cs.SD] . arXiv. http://arxiv.org/abs/2301.11325.

Description

“a calming violin melody backed by a distorted guitar riff” といったテキストから音楽がサウンドファイルとして生成される. Stable Diffusionの音楽版

Type

Paper

Year

2023

Posted at

January 27, 2023

Overview

テキストから音楽を生成するモデル
“a calming violin melody backed by a distorted guitar riff” といったテキストから音楽が生成される

24kH サンプリングレート

プロのミュージシャンの手による、5500のテキストと音楽のペアからなるデータセット MusicCapsを公開

Abstract

We introduce MusicLM, a model for generating high-fidelity music from text descriptions such as “a calming violin melody backed by a distorted guitar riff ”. MusicLM casts the process of conditional music generation as a hierarchical sequence- to-sequence modeling task, and it generates music at 24 kHz that remains consistent over several mi- nutes. Our experiments show that MusicLM out- performs previous systems both in audio quality and adherence to the text descriptions. Moreover, we demonstrate that MusicLM can be conditioned on both text and a melody in that it can transform whistled and hummed melodies according to the style described in a text caption. To support future research, we publicly release MusicCaps, a dataset composed of 5.5k music-text pairs, with rich text descriptions provided by human experts. google-research.github.io/seanet/musiclm/examples

生成モデルのベースになっているのは、音をローレベル(シグナルレベル acoustic tokens)、ハイレベル(もうちょっと構造的な情報 semantic tokens)な離散的な情報(トークン)の階層的なシーケンスとして扱うことで合成するAudioLMの研究。

acoustic token: SoundStream を利用 (24kHz mono → 50Hzのトークン列に圧縮)
semantic token: w2v-BERT (25Hzのトークン列)

それぞれ事前学習して重みを固定する

音楽とテキストのペアは画像とテキストのペアに比べて圧倒的に少ない → テキストと音楽を同じ潜在空間にマッピング(embedding)するMuLanと呼ばれるモデルを利用。

CLIPモデルの画像が音楽に置き換わったようなモデル
テキストと音楽の関係が割とルースに関連しているだけでも学習できるのが特徴

YouTubeのミュージックビデオのタイトルやタグ、説明文、そのビデオが含まれているプレイリストのタイトルを音楽に紐づくテキストとして扱った。

MuLanは10秒の音をembedする → 音楽の長いシーケンスを処理することが直接はできない → 1秒ずつずらして(stride= 1秒)、10秒ごとにembedしたものを平均して、その楽曲のembeddingとする。
RVQ(residual vector quantization)を使って 12のトークンで表現
こちらも事前に学習しておく

MuLan

Architecture

生成は二段階

MuLanのトークンから → Semanticトークンを予測するモデル
MuLanの出力のトークン + 予測されたSemantic トークンから Acoustic トークンを予測するモデル

この二つのモデルをそれぞれデコーダーだけのTransformerモデルとして学習

学習時には以下の二つが近くなるように学習

学習データの音楽をMuLanに入力して得られるMuLanのトークンの入力として、上の二つのモデルを通して得られたAcousticトークン
入力の音楽を SoundStreamのエンコーダを通して得られる Acousticトークン

学習 - 生成モデルの学習時には音楽だけがあればいい。テキストはいらない。

Free Music Archiveの曲を利用 (500万曲 28万時間分の音楽)
MuLanで音楽をembedして得られるMuLanのトークンを入力とする。

推論時

入力をMuLanでテキストをembedして得られるトークンに差し替える

学習時(左)と生成時

Results

生成された音楽のサンプル

MusicLM

Andrea Agostinelli, Timo I. Denk, Zalán Borsos, Jesse Engel, Mauro Verzetti, Antoine Caillon, Qingqing Huang, Aren Jansen, Adam Roberts, Marco Tagliasacchi, Matt Sharifi, Neil Zeghidour, Christian Frank Google Research Abstract We introduce MusicLM, a model generating high-fidelity music from text descriptions such as"a calming violin melody backed by a distorted guitar riff".

google-research.github.io

RiffusionやMubertよりも音質、テキストとの整合性どちらにおいても上回った。
定量評価

Frechet Audio Distance
KL Divergence

生成された音楽を音の識別モデルにかけて出てきたラベルの分布とリファレンスになる音楽(同じジャンルの音楽?)のそれとのKL Divergenceをとる

MuLan Cycle Consistency (MCC)

生成された音楽をMulanでembedし直して、元になったプロンプトのtextをembedしたベクトルとのcosine similarityをとる

Pair-wise Wins

人に聞いてもらった評価. 二つ曲のペアを使ってどちらが良いかを聞く→勝った回数

さらに… プロのミュージシャンに音楽(AudioSetから抽出)を聴いてテキストで表現してもらったデータとの比較も → このデータはMusicCapsとして公開。

MusicCaps

5.5k high-quality music captions written by musicians

www.kaggle.com

10秒の曲を聴いてもらって、4つの文くらいで音楽の内容と音楽のスタイルを記述してもらう
データのサンプル

“This folk song features a male voice singing the main melody in an emotional mood. This is accompanied by an accordion playing fills in the background. A violin plays a droning melody. There is no percussion in this song. This song can be played at a Central Asian classical concert.”
“This is a live recording of a keyboardist playing a twelve bar blues progression on an electric keyboard. The player adds embellishments between chord changes and the piece sounds groovy, bluesy and soulful.”

流石にこの人が作った音楽に人がテキストをつけたデータと比較すると負けてしまう。

Strong A = 学習データの方がかなり良い Strong B = MusicLMで生成した方がかなり良い

メロディーを入力してその続きをそのスタイルで生成することも可能

学習データの断片を入力、その後を生成させたところ、生成された音楽は学習データの中の元楽曲とは全然違うものになったことが確認された ←丸暗記されてるわけではない

リスク

学習データの偏り - 特定のジャンル (特に西洋の音楽)が多い
逆に文化の盗用 (cultural appropriation)だと受け止められるリスクもある

Further Thoughts

論文を読んで考えた個人的感想

MuLanを使うことで、音楽とテキストのペアを学習データとして大量に集めることなく、生成モデルを学習できたというところが目の付け所。
ミュージシャンにとって使えるものになるのだろうか？本当にテキストで音楽生成するのが一番創造的なやり方なのか？

言葉の写実性に縛られてしまうのではないか？
音楽は特に抽象的な表現なだけに、モデルのアウトプットを無自覚に受け入れてしまいがち？

モデルを公開する予定がないというのが残念

MusicLM: テキストから音楽を生成するモデル

Overview

Abstract

Architecture

Results

MusicLM

MusicCaps

Further Thoughts

Links

MusicLM

MusicCaps