Entry

GANSynth—GANを用いた楽器音の合成

Simple Title

Engel, J. et al. (2019) ‘Gansynth: Adversarial neural audio synthesis’, arXiv. arXiv. Available at: http://arxiv.org/abs/1902.08710 (Accessed: 24 May 2021).

Type

Paper

Year

2019

Posted at

May 28, 2021

Overview - 何がすごい?

GANを使った楽器の音色の生成
先行研究 Nsynthの自己回帰的なデコーダではなく、GANを用いることで合成にかかる時間を大幅に削減 (WaveNetよりも 50000倍速い!!)

Abstract

Efficient audio synthesis is an inherently difficult machine learning task, as human perception is sensitive to both global structure and fine-scale waveform coherence. Autoregressive models, such as WaveNet, model local structure but have slow iterative sampling and lack global latent structure. In contrast, Generative Adversarial Networks (GANs) have global latent conditioning and efficient parallel sampling, but struggle to generate locally-coherent audio waveforms. Herein, we demonstrate that GANs can in fact generate high-fidelity and locally-coherent audio by modeling log magnitudes and instantaneous frequencies with sufficient frequency resolution in the spectral domain. Through extensive empirical investigations on the NSynth dataset, we demonstrate that GANs are able to outperform strong WaveNet baselines on automated and human evaluation metrics, and efficiently generate audio several orders of magnitude faster than their autoregressive counterparts.

Motivation

Autoregressiveなモデルは生成に時間がかかる → GANが持つ並列的な処理を生かしたい
GANで波形を生成する仕組み(例えばWaveGAN)は、音の再現度、音質的にあまり良くない... ↔ 一方でスペクトログラムを扱う場合、位相の問題がある。

下図で示されているように、SFFTなどのフレームごとに生成するような手法の場合、フレームごとに位相がずれていく (一番上の黒い線 -●) → 周波数の分布と位相の情報の両方をこのまま学習するのは不可能。
一方で、位相のズレの差分(勾配?)は一定 (下の図中央のInstantaneous Frequency/IF) → モデル化しやすい

本研究では、GANでスペクトログラムと位相の情報を生成する

Architecture

ProgressiveGANをベースにしている。違いは、ピッチの情報をone-hotベクトルで条件付けとして渡している。

2チャンネル(音の振幅 magnitudeと位相phase）でアウトプット

振幅 logをとって -1 から 1の範囲にスケール　　位相も -1 から 1に → 出力レイヤーで tanhのアクティベーション関数を使う

単純に位相の情報(図の一番左)を出力するモデルと位相の差分(一番右)を生成するモデルをそれぞれテスト

Dataset

NSynthデータセットを利用

Results

WaveNetベースのモデルよりも54000倍速い！
位相の差分を生成するモデル (IF) の方がやはりよかった！

ユーザにリスニング評価をしてもらった結果 (二つ聞かせてどちらが良いかを選んでもらった)。

比較手法としてWaveGAN、WaveNetとReal Data(学習データそのもの)がある。
本研究のなかで一番よかったパラメータ設定(IF-Mel+HP)のものは、学習データの音に匹敵する音が生成できているのがわかる。

ユーザテストの結果

生成された音の例

オリジナル

GANSynth提案手法

WaveNet

WaveGAN

潜在ベクトルを変化させることで、音色の補間 (少しずつ変わっていく)ことも実現

他の音のサンプル

GANSynth: Adversarial Neural Audio Synthesis

GANSynth learns to produce individual instrument notes like the NSynth Dataset. With pitch provided as a conditional attribute, the generator learns to use its latent space to represent different instrument timbres. This allows us to synthesize performances from MIDI files, either keeping the timbre constant, or interpolating between instruments over time.

storage.googleapis.com

GANSynth: Adversarial Neural Audio Synthesis

Further Thoughts

NSynthに比べて 5万倍速いってすごい🙌
NSynthの時にあった、低音のノイズが消えている！
Instantaneous Frequencyの考え方は1992年の論文を参考にしている。古い音声処理の論文を紐解いて新しい実装につなげてるあたりもすごい...

Links

先行研究

📄NSynth: Neural Audio Synthesis—WaveNetを用いたAutoencoderで楽器音を合成

比較として挙げられているWaveGAN

Adversarial Audio Synthesis

Audio signals are sampled at high temporal resolutions, and learning to synthesize audio requires capturing structure across a range of timescales. Generative adversarial networks (GANs) have seen wide success at generating images that are both locally and globally coherent, but they have seen little application to audio generation.

arxiv.org

すぐに試せるColabのノートブック

Google Colaboratory

colab.research.google.com