Entry

GANで音楽生成 – MidiNet: A Convolutional Generative Adversarial Network for Symbolic-domain Music Generation using 1D and 2D Conditions

Simple Title

Yang, Li-Chia, Szu-Yu Chou, and Yi-Hsuan Yang. "Midinet: A convolutional generative adversarial network for symbolic-domain music generation." arXiv preprint arXiv:1703.10847 (2017).

Description

GANで音楽生成

Type

Paper

Year

2017

Posted at

July 9, 2017

Overview

GAN(generative adversarial network)で音楽を生成しようという試み. まずコード進行を考え、それから一小節ずつメロディーを作っていく…という人間の一般的な曲作りの過程を踏まえ、「コード」と「ひとつ前の小節のメロディー」の二つの条件から、次の一小節のメロディーを生成します.

Abstract

In this paper, we present MidiNet, a deep convolutional neural network (CNN) based generative adversarial network (GAN) that is intended to provide a general, highly adaptive network structure for symbolic-domain music generation. The network takes random noise as input and generates a melody sequence one mea- sure (bar) after another. Moreover, it has a novel reflective CNN sub-model that allows us to guide the generation process by providing not only 1D but also 2D conditions. In our implementation, we used the intended chord of the current bar as a 1D condition to provide a harmonic context, and the melody generated for the preceding bar previously as a 2D condition to provide sequential information. The output of the network is a 16 by 128 matrix each time, representing the presence of each of the 128 MIDI notes in the generated melody sequence of that bar, with the smallest temporal unit being the sixteenth note. MidiNet can generate music of arbitrary number of bars, by concatenating these 16 by 128 matrices. The melody sequence can then be played back with a synthesizer. We provide example clips showing the effectiveness of MidiNet in generating harmonic music.

Motivation

16分音符で1小節分とMIDIで扱える音程128の、16×128の二次元配列で一小節分のメロディーを表現し、 GANのGenerator、Discriminatorはともに畳み込みニューラルネットワークをベースにしています. ストレートに実装できそうなDiscriminator (生成されたメロディーか、もともと学習データにあったメロディーなのかを識別する)に対して、Generatorの方には新しい工夫がはいっています。

Architecture

Generatorは基本的にはランダムな1次元ベクトル(下の図のNoise Z)から、いわゆるDeconvolutionを通して、上記の16×128の二次元配列を生成します. 各レイヤーで、「コード」(図中の1D conditions)と「直前の小節のメロディー」(2D conditions) という二つの情報を連結することで、条件付けのもとでのメロディー生成を実現しています.

直前の小節のメロディーの情報も二次元配列になるわけですが、Generatorの中間の各レイヤーに結合するために、それぞれ同じサイズになるように別のネットワークを用意します. ConvolutionのレイヤーをGeneratorと同じ数だけ用意して、合わせて学習する (左上の水色で描かれたモデル)が最大のポイントです. (Generatorのdeconvolutionをちょうど反転したようなかたちになるため、著者らは reflective CNNと呼んでます. )

Results

以下は生成された音楽の例です. いかがでしょうか? 1022個のMIDIファイルで学習したとのことですが、ちょっと少ないような気もしますね.

Further Thoughts

画像の領域では多数の研究例があるGANですが、音楽に応用した例はまだまだ少ないように思います。今後、探求のしがいのある分野ではないでしょうか。