Entry

Type

Paper

Year

2018

Posted at

Overview - 何がすごい?

TransformerがNLP領域で長期の依存関係を低コストで学習できることを示した。音楽もミクロからマクロまで様々な解像度での時間的な依存関係を持つことから、Transfromer/Self Attention の考え方を音楽にも生かせるのではないか。

結果としてPerformanceRNNの4倍以上の長さの曲(数1000ステップ)を生成できることを示した。

Abstract

Music relies heavily on repetition to build structure and meaning. Self-reference occurs on multiple timescales, from motifs to phrases to reusing of entire sections of music, such as in pieces with ABA structure. The Transformer (Vaswani et al., 2017), a sequence model based on self-attention, has achieved compelling results in many generation tasks that require maintaining long-range coherence. This suggests that self-attention might also be well-suited to modeling music. In musical composition and performance, however, relative timing is critically important. Existing approaches for representing relative positional information in the Transformer modulate attention based on pairwise distance (Shaw et al., 2018). This is impractical for long sequences such as musical compositions since their memory complexity for intermediate relative information is quadratic in the sequence length. We propose an algorithm that reduces their intermediate memory requirement to linear in the sequence length. This enables us to demonstrate that a Transformer with our modified relative attention mechanism can generate minute- long compositions (thousands of steps, four times the length modeled in Oore et al. (2018)) with compelling structure, generate continuations that coherently elaborate on a given motif, and in a seq2seq setup generate accompaniments conditioned on melodies1. We evaluate the Transformer with our relative attention mechanism on two datasets, JSB Chorales and Piano-e-Competition, and obtain state-of-the-art results on the latter.

Motivation

音楽に特有の繰り返しなどの構造を低コストで学習したい！

Architecture

データの表現

Performance RNNの表現方式を採用

Relative Position Self-Attention

本論文の技術的なコア。

元々のTransformerの論文ではサイン、コサインでタインミングの情報をエンコードしていた。"Self-Attention with Relative Position Representations"が導入したのはQとKの距離を相対的に考慮にいれる。

元々のSelf-attention

\text { Attention }(\mathrm{Q}, \mathrm{K}, \mathrm{V})=\operatorname{softmax}\left(\frac{\mathrm{QK}^{\mathrm{T}}}{\mathrm{d}_{\mathrm{k}}}\right) \mathrm{V}

この論文で利用されたRelative Position Self-Attention

\text { RelativeAttention }=\operatorname{Softmax}\left(\frac{Q K^{\top}+S^{r e l}}{\sqrt{D_{h}}}\right) V

$S^{r e l}$ は (l, l)のテンソル (lはシーケンスの長さ) 相対的な距離を表現する。で、 $S^{r e l}=Q R^{T}$ $R^T$ は $(l, l, d)$ のテンソル (l, l)それぞれの距離に対して d次元のembeddingを持っている。

Skewing

問題はこの $R$ がでかすぎること。計算量が $O\left(L^{2} D\right)$ のオーダーになる。そこでSkewingと呼ぶ、配列をゴニョゴニョする方式で $O\left(L D\right)$ にしちゃう。。。というのがこの論文の技術的なポイントなのだけど、Skewingの計算の意味の理解が正直追いついてない💦

さらにシーケンスを細かくブロックに切って、ブロック内と一つ前のブロックのみをAttendするようにして計算量をさらに抑える工夫もしている (Relative Local Attention)

Results

まずはValidationの数値が向上していることを確認

以前のRNNベースのもよのよりも格段に長い曲を生成できる
Self-attentionの学習に必要なメモリを 8.5GBから4.2MBと言ったスケールで大きく減らすことに成功

生成された曲の例

先行研究 RNN (Performance RNN)

Further Thoughts

Skewing の処理がよくわからない、、、

Links

Music Transformer: Generating Music with Long-Term Structure - Magentaブログ

Music Transformer: Generating Music with Long-Term Structure

Update (9/16/19): Play with Music Transformer in an interactive colab! Generating long pieces of music is a challenging problem, as music contains structure at multiple timescales, from milisecond timings to motifs to phrases to repetition of entire sections. We present Music Transformer, an attention-based neural network that can generate music with improved long-term coherence.

magenta.tensorflow.org

Music Transformer: Generating Music with Long-Term Structure

Understanding Music Transformer - gudgud96's Blog

TLDR: This blog will discuss:1 - Concepts discussed in the Music Transformer paper2 - Background of Relative Attention, Relative Global Attention, Relative Local Attention3 - Ideas in the paper to efficiently implement the above attention mechanisms4 - Results on music generation I personally suffer from a lot of pain points when I first try to understand the Music Transformer paper, especially the details in this paper about relative attention, local attention, and also the "skewing" procedure.

gudgud96.github.io

Music Transformerを使用したピアノ音楽の生成

Google MagentaのMusic Transformerがピアノ音楽をゼロから生成する方法を説明します。

wandb.ai

Transformer 元論文

📄Attention is All You Need

Self-Attention with Relative Position Representations

Relying entirely on an attention mechanism, the Transformer introduced by Vaswani et al. (2017) achieves state-of-the-art results for machine translation. In contrast to recurrent and convolutional neural networks, it does not explicitly model relative or absolute position information in its structure. Instead, it requires adding representations of absolute positions to its inputs.

arxiv.org

Performance RNN

This Time with Feeling: Learning Expressive Musical Performance

Music generation has generally been focused on either creating scores or interpreting them. We discuss differences between these two problems and propose that, in fact, it may be valuable to work in the space of direct $\it performance$ generation: jointly predicting the notes $\it and$ $\it also$ their expressive timing and dynamics.

arxiv.org

Music Transformerが生み出す繰り返し構造の解析など

自動作曲AI Music Transformerの衝撃 - Qiita

この音楽を聴いてみてください。 See the Pen MusicTransformerDemo by NayuSato (@nayusato) on CodePen. 埋め込みが見られない場合はこここれはGoogleの自動作曲Music Transformerが生み出した曲の1つです。入力は、最初の6秒だけ。クラシックのドビュッシー「月の光」の冒頭6秒だけを与えて、その続きを全て作らせるというタスクを行わせています。十分聴き入って感動できるような曲に仕上がっています。ケチをつけられる点がないとは言わないけれど、「人の作った曲です」と言われても識別できないほどの精度になっています。 2018年Googleが発表した自動作曲のAI。自然言語処理のアルゴリズムである Transformerを音楽に適用することにより、それ以前とははるかに違う性能の音楽生成が可能となりました。 2019年には、OpenAIの MuseNet もGPT-2（自然言語処理のネットワーク）を用いる形で追随しました。この記事では、Music Transformerのアルゴリズムと、それがどのように既存の課題を解決したかを説明します。 Music Transformerは、前の音を元に次の音を逐次的に生成していくこの際、Attention重みによって適切に前の音を参照しながら生成するそのため、繰り返し構造が生まれやすくなっている前提として、自然言語処理のTransformerを解説します。 Google翻訳にも使われている技術です。従来のRNNでは、データを入力した後の出力をもう一度入力層につなぐことで系列を処理します。しかしこれには、過去の情報が薄まってしまうという欠点があります。この性質をカバーするため、忘却ゲートをもうけた、 LSTM（Long Short-Term Memory）が考案されてきましたが、これによって性能は向上したものの、長い系列をうまく学習し生成するには不十分でした。そこで出てきたのが、過去の系列のどこを参照するかのパラメータ Attention重みさえも学習してしまおうという考え方です。直接、過去の入力の情報を参照するようにすることで、長い系列であっても過去の情報が失われなくなります。このAttentionという機構を用いた自然言語処理モデルを Transformer と呼びます。 1単語前を0.1、2単語前を0.1、3単語前を0.3...、のように重みをつけて合算します。どの単語を参照するかは文脈によって変わってきます。前の単語の参照の強さまでを学習してしまうことにより、文脈によって柔軟に参照先を変えながら、文章の翻訳や続きの生成ができます。図の詳しい説明はここを開く・単語をエンコーダーに1つずつ入力する・変換開始を表すを入力・現在のデコーダーの状態と、過去のエンコーダーの状態の関数から、スコアを算出（関連度を表す）・スコアをsoftmax関数で和を1にする。これがAttention重みになる・Attention重みで重みをつけて、過去のエンコーダーの状態の重み付き和を取る・文脈ベクトルと現在の隠れ状態を結合して、予測分布を生成・この予測分布をsoftmaxに通して、単語の出力を得る（例："Je"）・次の入力は、前の単語の出力（"Je"）の埋め込み表現と、文脈ベクトルとを結合したものをデコーダーに入れる・デコーダーがセンテンス末を表すを出力したら終わり詳しくは、Neural machine translation with attention 分かりやすい例：文脈が異なった時の指示語のAttention重み Attention機構では、どのような語が前にエンコードされるかによって隠れ状態が変わってきます。そのため、計算されるAttention重みも文脈の影響を受けて変わります。単語"it"の意味と文脈は単語単独では決まらず、その前後の入力によって決定されます。 (図は Google AI Blog を元に作成) Transformerの事前学習においては、マスクされた1単語を予測させる与えられた2つの文が隣接する2文かどうかを判断させるなどが使われます。 Attention自体には教師データがないですが、出力結果の誤差を少なくするように学習が進むことで、Transformerは文脈を獲得します。 Transformerの学習のためには、大規模かつ質の良いデータが必要となります。さすがはGoogle、 YouTubeを使いました。 ...

qiita.com

pytorchの実装

jason9693/MusicTransformer-pytorch

This Repository is perfectly cometible with pytorch Domain: Dramatically reduces the memory footprint, allowing it to scale to musical sequences on the order of minutes. Algorithm: Reduced space complexity of Transformer from O(N^2D) to O(ND). In this repository using single track method (2nd method in paper.).

github.com

Music transformer: Generating music with long-term structure