Entry

MuseMorphose: : Transformerを用いたVAEによる音楽のスタイル変換

Simple Title

Wu, S.-L. and Yang, Y.-H. (2021) ‘MuseMorphose: Full-Song and Fine-Grained Music Style Transfer with Just One Transformer VAE’

Description

長期の時間依存性を学習できるTransformerの利点とコントロール性が高いVAEの利点。この二つを組み合わせたEncoder-Decoderアーキテクチャで、MIDIで表現された音楽のスタイル変換(Style Transfer)を実現。

Type

Paper

Year

2021

Posted at

May 21, 2021

Overview - 何がすごい?

長期の時間依存性を学習できるTransformerの利点とコントロール性が高いVAEの利点。この二つを組み合わせたEncoder-Decoderアーキテクチャで、MIDIで表現された音楽のスタイル変換(Style Transfer)を実現する。

Abstract

Transformers and variational autoencoders (VAE) have been extensively employed for symbolic (e.g., MIDI) domain music generation. While the former boast an impressive capability in modeling long sequences, the latter allow users to willingly exert control over different parts (e.g., bars) of the music to be generated. In this paper, we are interested in bringing the two together to construct a single model that exhibits both strengths. The task is split into two steps. First, we equip Transformer decoders with the ability to accept segment-level, time-varying conditions during sequence generation. Subsequently, we combine the developed and tested in-attention decoder with a Transformer encoder, and train the resulting MuseMorphose model with the VAE objective to achieve style transfer of long musical pieces, in which users can specify musical attributes including rhythmic intensity and polyphony (i.e., harmonic fullness) they desire, down to the bar level. Experiments show that MuseMorphose outperforms recurrent neural network (RNN) based prior art on numerous widely-used metrics for style transfer tasks.

Overview

ステップ #1

TransformerのDecoderを実装。その際に、時間軸にそって小節単位で変化する条件付けを可能にするために pre-attention, in-attention, post-attention の三つの仕組みを試す(後述)。その中で in-attentionが優れていることがわかった。

ステップ #2

Transformer Encoderを学習。VAEの目的関数に沿って学習することで、条件付けできる生成モデルに。

Architecture

Encoder

Transformerベースのエンコーダー小節ごとのトークンを入力に512次元のembeddingベクトルを出力。

一般的な次のトークンを予測するモデルの隠れ層の出力(のavgpool)をembeddingベクトルとして利用。

Encoderの概念図

Encoderに入力されるトークン

Decoder

デコーダもEncoder同様に次のトークンを予測するモデル。ただし、ここではEncoderの出力のembeddingベクトルも使われる。

Encoderの出力のembeddingベクトルは各 self-attention layerの中で使われる(in-attention) (右図)

その他にも self-attention layerの前後に一度だけ embeddingベクトルを足す pre/post-attentionも試したが、in-attentionが一番よかった。

in-attentionの仕組み

MuseMorphose: ユーザのコントロール

$\left(X,\left\{\tilde{a}_{k}^{1}, \tilde{a}_{k}^{2}, \ldots, \tilde{a}_{k}^{J}\right\}_{k=1}^{K}\right)$ 曲のセグメント(小節) ごとに $\tilde{a}_{k}^{1}, \tilde{a}_{k}^{2}, \ldots, \tilde{a}_{k}^{J}$ の $J$ 種類(この研究では $J = 2$ )のattributeを指定できるようにする。 $X$ : 生成されるメロディ

→ 曲の特定の箇所だけ盛り上げる、落ち着かせるといったコントロールが可能に。

小節単位でコントロールできる条件(attribute)としては、

Rhythm Intensity リズム強度 - 拍数に対するオンセットの割合 (B: ビートの数)

s^{\text {rhym }}=\frac{1}{B} \sum_{b=1}^{B} \mathbf{1}\left(n_{\text {onset }, b} \geq 1\right)

Polyphony score ポリフォニー - 各四分音符単位で平均して幾つの音符が鳴らされている(オンセット or サステイン) か

s^{\text {poly }}=\frac{1}{B} \sum_{b=1}^{B}\left(n_{\text {onset }, b}+n_{\text {hold }, b}\right) \text { . }

アーキテクチャの全体像

全体の流れ

\begin{aligned}\boldsymbol{z}_{k} &=\boldsymbol{e n c}\left(X_{k}\right) \quad \text { for } 1 \leq k \leq K ; \\\boldsymbol{c}_{k} &=\operatorname{concat}\left(\left[\boldsymbol{z}_{k} ; \boldsymbol{a}_{k}^{\text {rhym }} ; \boldsymbol{a}_{k}^{\text {poly }}\right]\right) \\\boldsymbol{y}_{t} &=\boldsymbol{dec}\left(x_{<t} ; \boldsymbol{c}_{k}\right), t \in I_{k} \quad \text { for } 1 \leq k \leq K,\end{aligned}

K: 小節数 $X_k$ : k番目の音符列

$a^{rhym}$ $a^{poly}$ : embedされた条件 (attribute)

$I_k$ k番目の小節のタイムステップ

$z_k$ は EncoderのAttentionレイヤーの最初のタイムステップでのアウトプット $\boldsymbol{h}_{k, 1}^{L_{\text {enc }}}$ を入力として VAEのモデルを採用。それぞれ学習できる重み $W_{{\mu}}$ と $W_{{\sigma}}$ で正規分布の $\mu$ と $\sigma$ に写像.

$\boldsymbol{\mu}_{k}=\boldsymbol{h}_{k, 1}^{L_{\mathrm{enc}} \top} W_{\boldsymbol{\mu}} \quad \boldsymbol{\sigma}_{k}=\boldsymbol{h}_{k, 1}^{L_{\mathrm{enc}}^{\top}} W_{\boldsymbol{\sigma}}$

$z_k$ はそこからサンプリングされる.

$\boldsymbol{z}_{k} \sim \mathcal{N}\left(\boldsymbol{\mu}_{k}, \operatorname{diag}\left(\boldsymbol{\sigma}_{k}^{2}\right)\right)$

Data

MIDIをベースにしたシンボリックな情報で音楽を扱う。データセットは Cleansed Lakhデータセットを利用。

PerformanceRNNで提案されたNote-On/Note-offに加えて、前のイベントからの時間差やベロシティ(音の強さ)で音楽を表現するが、それだけだと小節の中での位置の情報が抜けている。

そこで Pop Music Transformerで提案された Revamped MIDI-derived events (REMI) と呼ばれるフォーマットを利用することにする。

REMIでは次のような情報を扱う

BAR(小節) と POSITION (小節内での位置 1-16)
TEMPO (テンポ)
NOTE-ON / NOTE-DURATION (ノートオンとその音符の長さをセットで)

Results

原曲の特徴を残しつつ、条件付けに従った音楽性に飛んだ音楽を、多様性を持って生成できることを確認

原曲

生成例

リズムの強度・ポリフォニー共に高い

徐々にリズムの強度・ポリフォニーを高くしていくと

徐々にリズムの強度・ポリフォニーを下げていく

長い曲も！ (121小節)

原曲

生成例

リズムの強度・ポリフォニー共にあげると...

Further Thoughts

最近の研究についてよくまとまっているのでサーベイ論文としても利用できる (特にconditioningについて)
音楽の生成モデルの定量的な評価の仕方としても勉強になる
まだ完全に理解できてない部分も多いのであとで読み直す!!

Links

Transformerの原論文

📄Attention is All You Need

Transformerを使った音楽生成

📄Music transformer: Generating music with long-term structure

Pop Music Transformer: Beat-based Modeling and Generation of Expressive Pop Piano Compositions

A great number of deep learning based models have been recently proposed for automatic music composition. Among these models, the Transformer stands out as a prominent approach for generating expressive classical piano performance with a coherent structure of up to one minute.

arxiv.org

MuseMorphose: Transformerを用いたVAEによる音楽のスタイル変換