Entry

SingSong — ボーカルを入力に伴奏をまるっと音で生成するモデル

Simple Title

Donahue, Chris, Antoine Caillon, Adam Roberts, Ethan Manilow, Philippe Esling, Andrea Agostinelli, Mauro Verzetti, et al. 2023. “SingSong: Generating Musical Accompaniments from Singing.” arXiv [cs.SD] . arXiv. http://arxiv.org/abs/2301.12662.

Description

音源分離技術を使ってボーカルとそれに付随する伴奏を抽出。その関係を学習。Ground Truth (元々の曲に入ってた伴奏)には流石に劣るがそれに匹敵するクオリティの曲を生成できるようになった。

Type

Paper

Year

2023

Posted at

January 31, 2023

Overview

ボーカルを入力に伴奏の曲をまるっと生成するモデル。入力のボーカルとそのままミックスできちゃうところがポイント。
音源分離技術を使ってボーカルとそれに付随する伴奏を抽出。その関係を学習。
Ground Truth (元々の曲に入ってた伴奏)には流石に劣るがそれに匹敵するクオリティの曲を生成できるようになった。

Abstract

We present SingSong, a system that generates instrumental music to accompany input vocals, potentially offering musicians and non-musicians alike an intuitive new way to create music featuring their own voice. To accomplish this, we build on recent developments in musical source separation and audio generation. Specifically, we apply a state-of-the-art source separation algorithm to a large corpus of music audio to produce aligned pairs of vocals and instrumental sources. Then, we adapt AudioLM (Borsos et al., 2022)—a state-of-the-art approach for unconditional audio generation—to be suitable for conditional “audio-to-audio” generation tasks, and train it on the source-separated (vocal, instrumental) pairs. In a pairwise comparison with the same vocal inputs, listeners expressed a significant preference for instrumentals generated by SingSong compared to those from a strong retrieval baseline.

Architecture

アーキテクチャと学習のプロセス

MusicLMでも使われていたAudioLMの研究がベース.

AudioLM: 階層構造で音楽の構造を扱う

音楽の大まかな構造を司るsemantic token = w2v-BERTのエンコーダの出力 (25Hzの時間的解像度)
もうちょっと細かい音の情報を扱う coarse acoustic = SoundStreamのエンコーダの出力 (50Hz)

音源分離技術を使ってボーカルと伴奏のペアを大量に作る

伴奏をw2v-BERTのエンコーダにかけて得られるsematinc tokenとSoundStreamのエンコーダにかけて得られるcoarse acoustic token
二つのベクトルを連結して得られるベクトルがターゲットに (上の図のTarget codes)
ボーカルをw2v-BERTのエンコーダにかけて得られる semantic tokenから上記の伴奏から得られるベクトルを予測するEncoder-decoder transformerを学習する
得られた coarse acoustic tokenからより細かい音の情報を表現するトークン (fine acoustic token)を予測するdecoderだけのTransformerを学習。
音源分離で作った学習データがもつアーティファクト(ノイズ)が学習に影響するため、わざとホワイトノイズを足して学習することを提案 = そうしないと、推論時に綺麗なボーカルを入力したときにうまく推論できないため。

学習データは 100万曲/46000時間の音楽

音源分離は MDXNetを利用 → SoundStream/w2v-BERTの入力できるサンプリングレート、16kHzに変換
10秒ずつ切り出す
ボーカルが大きすぎたり、アカペラのセクションはカット → ボーカルを入れたらなんらかの伴奏を吐き出すように

評価データは MUSDB18データセットにある元々トラック別に分離された楽曲データを利用

Results

生成された曲のサンプル

SingSong: Generating musical accompaniments from singing

paper| Chris Donahue*, Antoine Caillon* 1, Adam Roberts*, Ethan Manilow, Philippe Esling 1, Andrea Agostinelli, Mauro Verzetti, Ian Simon, Olivier Pietquin, Neil Zeghidour, Jesse Engel Google Research, 1IRCAM, * Equal Contribution We present SingSong, a system which generates instrumental music to accompany input vocals, potentially offering musicians and non-musicians alike an intuitive new way to create music featuring their own voice.

storage.googleapis.com

SingSong: Generating musical accompaniments from singing

実際に被験者に二つの曲のペアを聞いてもらって比較した結果.

Ground truth (元々の伴奏) には流石に大体負けるが 34%では元々の伴奏よりも良いと評価された!! ← すごい!
既存のインスト曲からランダムに選んだりしたよりは良いとされる割合がかなり高い (74%)

今後の改善点・目標

サンプリングレートの改善まだ16kHzと音楽を扱うには物成ない
入力・出力のペアの多様化 (バグパイプの伴奏みたいなこともできるはず)

Further Thoughts

論文を読んで考えた個人的感想

論文が読みやすい!
この論文の数日前に発表されたMusicMLと対を成すような論文

アーキテクチャも似ている (SoundStream / w2v-BERT)
MusicMLはテキストの条件付け / 本研究はボーカルの条件付け

やっぱりAudioLMが肝. SoundStreamの論文も読まなきゃ.

Links

MusicLM: テキストから音楽を生成するモデル

We introduce MusicLM, a model for generating high-fidelity music from text descriptions such as "a calming violin melody backed by a distorted guitar riff ". MusicLM casts the process of conditional music generation as a hierarchical sequence- to-sequence modeling task, and it generates music at 24 kHz that remains consistent over several mi- nutes.

createwith.ai

AudioLMでも使われているNeural Audio Codec: SoundStream

SoundStream: An End-to-End Neural Audio Codec

We present SoundStream, a novel neural audio codec that can efficiently compress speech, music and general audio at bitrates normally targeted by speech-tailored codecs. SoundStream relies on a model architecture composed by a fully convolutional encoder/decoder network and a residual vector quantizer, which are trained jointly end-to-end.

arxiv.org