Entry

Simple Title

Garcia, Hugo Flores, Prem Seetharaman, Rithesh Kumar, and Bryan Pardo. 2023. “VampNet: Music Generation via Masked Acoustic Token Modeling.” arXiv [cs.SD]. arXiv. http://arxiv.org/abs/2307.04686.

Type

Paper

Year

2023

Posted at

June 2, 2021

Overview

一言まとめ

Abstract

We introduce VampNet, a masked acoustic token mod- eling approach to music synthesis, compression, inpaint- ing, and variation. We use a variable masking schedule during training which allows us to sample coherent mu- sic from the model by applying a variety of masking ap- proaches (called prompts) during inference. VampNet is non-autoregressive, leveraging a bidirectional transformer architecture that attends to all tokens in a forward pass. With just 36 sampling passes, VampNet can generate co- herent high-fidelity musical waveforms. We show that by prompting VampNet in various ways, we can apply it to tasks like music compression, inpainting, outpainting, con- tinuation, and looping with variation (vamping). Appropri- ately prompted, VampNet is capable of maintaining style, genre, instrumentation, and other high-level aspects of the music. This flexible prompting capability makes VampNet a powerful music co-creation tool. Code 3 and audio sam- ples 4 are available online.

Motivation

“To Vamp”: 短いフレーズをちょっとずつ変化させながら繰り返すこと

このモデルでも一貫性を持って変化するフレーズを生成可能

auto-regressiveではなく、パレレルにデコードできるので生成のプロセスが高速化されている。

Architecture

オーディオのトークナイゼーションには Descript Audio Codec (DAC)がベースに

Kumar, Rithesh, Prem Seetharaman, Alejandro Luebs, Ishaan Kumar, and Kundan Kumar. 2023. “High-Fidelity Audio Compression with Improved RVQGAN.” arXiv [cs.SD]. arXiv. http://arxiv.org/abs/2306.06546.
本論文で提示されている手法は他のトークナイザーにも応用可能

システムの全体像

MusicLM, AudioLM, Jukeboxなどはトークンをautoregressiveに利用 → 前のトークンの連なりから次のトークンを予想する. = どうしても時間がかかる + 応用の幅が狭くなる (続きを予測することだけしかできない)
そこで… parallel iterative decoding procedureの考え方を利用する。

BERTなどの言語モデルで使われる、文章中のトークンをランダムに隠し(マスクする)、マスクしたトークンを予測する学習手法を応用
一回のforward passでシーケンスを丸ごと予測ただしそれだけだと精度は良くない → マスクする確率を少しずつ減らしつつ、繰り返しトークンの予測を行う。→ 少しずつ精度が上がる。

Results

Further Thoughts

学習済みモデルは非商用のCreative Commonsだが、コード自体はMITライセンスなのはありがたい
パラレルにデコードする手法については同時期にGoogleのチームが近い手法を発表している。 Borsos, Zalán, Matt Sharifi, Damien Vincent, Eugene Kharitonov, Neil Zeghidour, and Marco Tagliasacchi. 2023. “SoundStorm: Efficient Parallel Audio Generation.” arXiv [cs.SD]. arXiv. http://arxiv.org/abs/2305.09636.