Entry

それっぽさと新奇性、それぞれを最大化するGANモデルを用いた音楽生成 — Musicality-Novelty GAN

Simple Title

Chen, Gong, Yan Liu, Sheng-Hua Zhong, and Xiang Zhang. 2018. “Musicality-Novelty Generative Adversarial Nets for Algorithmic Composition.” In Proceedings of the 26th ACM International Conference on Multimedia , 1607–15. MM ’18. New York, NY, USA: Association for Computing Machinery.

Description

人真似ではない新しい音楽をAIで生成しようとする野心的な研究

Type

Paper

Year

2018

Posted at

August 7, 2022

Overview

GANを使った音楽生成モデル。音楽のそれっぽさ(musicality)=学習データとの類似度と新奇性(novelty)を交互に最大化するようなGANのアーキテクチャを組むことで、新しい、ユニークな音楽をAIで生成することを目指す。

Abstract

Algorithmic composition, which enables computer to generate mu- sic like human composers, has lasting charm because it intends to approximate artistic creation, most mysterious part of human intelligence. To deliver both melodious and refreshing music, this paper proposes the Musicality-Novelty Generative Adversarial Nets for algorithmic composition. With the same generator, two adver- sarial nets alternately optimize the musicality and novelty of the machine-composed music. A new model called novelty game is presented to maximize the minimal distance between the machine- composed music sample and any human-composed music sample in the novelty space, where all well-known human composed mu- sic products are far from each other. We implement the proposed framework using three supervised CNNs with one for generator, one for musicality critic and one for novelty critic on the time-pitch feature space. Specifically, the novelty critic is implemented by Siamese neural networks with temporal alignment using dynamic time warping. We provide empirical validations by generating the music samples under various scenarios.

Motivation

音楽生成モデルの精度はどんどん上がっている、一方でそれが「新しい」「ユニークな」音楽の生成にはつながっていない。GANも結局は学習データの分布の「それっぽさ」を学習して模倣するだけ。かといってランダムな音符の並びでは意味がない
本研究では、音楽のそれっぽさ(musicality)=学習データとの類似度と新奇性(novelty)を交互に最大化するようなGANのアーキテクチャを組むことで、新しい、ユニークな音楽をAIで生成することを目指す。

From the global view, the machine-composed samples and human- composed samples should have similar distributions, while from the local view, a machine-composed music should guarantee enough distance to the nearest human-composed neighbor. With this frame- work, we expect to generate music with both good musicality and good novelty.

Architecture

Musicalityの最大化 - 普通のGANのセッティング. $\tilde{\mathbf{x}}=G_{\theta}(\mathbf{z})$ はGeneratorで生成した音楽

\min _{\theta} \max _{w} \underset{\mathbf{x} \sim \mathbb{P}_{r}}{\mathbb{E}}\left(D_{w}(\mathbf{x})\right)-\underset{\tilde{\mathbf{x}} \sim \mathbb{P}_{g}}{\mathbb{E}}\left(D_{w}(\tilde{\mathbf{x}})\right)

Novelty

GANのDiscriminatorのように、 Noveltyを推定するネットワークを学習する. $H_v$ は楽曲のペアに対して Pair-wise novelty「ペアごとに新奇性」を推定。楽曲のペアそれぞれに対して Pair-wise noveltyが高い = 全体のNovelty 新奇性が高いとする。

\max _{\theta} \min _{v} \inf _{\tilde{\mathbf{x}} \sim \mathbb{P}_{g}, \mathbf{x} \sim \mathbb{P}_{r}}\left(H_{v}(\tilde{\mathbf{x}}, \mathbf{x})\right)-\inf _{\mathbf{x}_{1}, \mathbf{x}_{2} \sim \mathbb{P}_{r}}\left(H_{v}\left(\mathbf{x}_{1}, \mathbf{x}_{2}\right)\right) .

$\tilde{\mathbf{x}}$ は生成した曲. $\mathbf{x}_1, \mathbf{x}_2$ は学習データ内の曲.
$H_{v}$ は Generatorと敵対的に学習。学習データの曲と生成された曲の Pair-wise noveltyは低く、学習データの中の曲同士は高くなるように学習。Gは逆=学習データと生成された曲の Pair-wise noveltyが高くなるように学習。

具体的にNoveltyを定義することなく、敵対的な学習のフレームワークを設定することでNoveltyを算出するネットワークを学習しているところが面白い!

$\mathbb{E}$ (expectation) ではなくて、Infimum(最大下限)をとっているのは、平均的にNoveltyが高いことを期待しているわけではなく、生成された曲と学習データの中にある曲、全ての組に対してnoveltyが高いことを期待しているため

実装

音楽の表現 - 時間軸, ピッチ、小節の情報の3次元のテンソルで音楽を表現
Generator - CNNで実装
Discriminator - ペアで入力をとる必要があるので、Siamese 構造で実装

Results

Noveltyに関するリスニングテストの結果

42人の学生に学習データの中の曲をいくつか事前に聞かせた後、以下の手法で得られた曲を聴いてもらった上で、事前に聴いた曲が再度かかっていないかを聞く Type (a) 聴かせた曲の中のいくつかを再度再生 (b) 新しい曲を人が作曲して聴かせる (c) 事前に聴かせた曲の別のパートを再生 (d) musicalityのみを最大化した結果を聞かせる (c) 本研究の提案
当然 (a) の場合は低い = 前にかかった曲が再度かかったと判断されるのに対して、本研究での手法で生成された曲は「聴いたことがない曲」として判断された.
人が作曲した新曲よりも新奇性が高いと判断されたのは驚き！

Musicallityに関しては、ちゃんと協和音が生成されたことを確認した程度

Further Thoughts

2018年の論文、すっかり見落としてました。
Creative-GANのような新奇性を明示的に扱う研究は非常に少ない (特に音楽分野で) ので貴重！
生成された曲のサンプルが聞けないのが残念…