Entry

NSynth: Neural Audio Synthesis—WaveNetを用いたAutoencoderで楽器音を合成

Simple Title

Engel, J. et al. (2017) ‘Neural Audio Synthesis of Musical Notes with WaveNet Autoencoders’. Available

Description

WaveNetの仕組みを使ったAutoencoderで、楽器の音の時間方向の変化も含めて、潜在空間にマッピング → 潜在ベクトルから楽器の音を合成する。この研究で使った多数の楽器の音を集めたデータセット NSynth を合わせて公開。

Type

Paper

Year

2017

Posted at

May 28, 2021

Overview - 何がすごい?

WaveNetの仕組みを使ったAutoencoderで、楽器の音の時間方向の変化も含めて、潜在空間にマッピング → 潜在ベクトルから楽器の音を合成
この研究で使った多数の楽器の音を集めたデータセット NSynth を合わせて公開

Abstract

Generative models in vision have seen rapid progress due to algorithmic improvements and the availability of high-quality image datasets. In this paper, we offer contribu- tions in both these areas to enable similar progress in audio modeling. First, we de- tail a powerful new WaveNet-style autoen- coder model that conditions an autoregres- sive decoder on temporal codes learned from the raw audio waveform. Second, we intro- duce NSynth, a large-scale and high-quality dataset of musical notes that is an order of magnitude larger than comparable pub- lic datasets. Using NSynth, we demonstrate improved qualitative and quantitative performance of the WaveNet autoencoder over a well-tuned spectral autoencoder baseline. Fi- nally, we show that the model learns a mani- fold of embeddings that allows for morphing between instruments, meaningfully interpo- lating in timbre to create new types of sounds that are realistic and expressive.

Motivation

ニューラルネットワークを使った新しい音、それも音楽的な楽器の音の合成手法の構築
音色や音程、音圧のダイナミクス(envelope)などをコントロールできる仕組みも合わせて構築

Architecture

WaveNetが基本。WaveNetの場合は直前のシグナルの連続から次のシグナルを予測する自己回帰(autoregressive) なモデル。

外部からの条件付けがないとうまく長期間の時間依存をコントロールできない → Text-to-speechでの利用を念頭においたWaveNetでは問題ない (テキストとして条件付けがなされるから )

WaveNet

p(x)=\prod_{i=1}^{N} p\left(x_{i} \mid x_{1}, \ldots, x_{N-1}\right)

本研究では $Z=f(x)$ のエンコーダで埋め込み表現を取得。この $Z$ の値を使って次のシグナルを予測する

p(x)=\prod_{i=1}^{N} p\left(x_{i} \mid x_{1}, \ldots, x_{N-1}, f(x)\right)

アーキテクチャ

エンコーダは入力 512サンプルごとに、16次元の潜在ベクトル $Z$ に変換

エンコーダは自己回帰的なモデルではなく、入力をまるっと受けるかたち

デコーダは、 $Z$ を元の時間解像度にアップサンプルしたものが渡される

Dataset

NSynthデータセットを新たに制作

305,979個の楽器音のデータセット
MIDIの21-108のそれぞれのピッチ、25, 50, 75, 100, 127の5段階の異なるベロシティ(強度)で録音

楽器の内訳

Results

再合成

入力

再合成された結果

音色の補間

音色のミックス

埋め込み表現が得られているので、それらを平均したり補間した値をデコーダに渡すことで、少しずつ変化する多様な音が得られるのでは！
単に音をミックスするのとは違う点に注意

学習データ

ベース

フルート

ベース+フルート (単純に音をミックスしたもの)

本手法で生成した音

ベース

フルート

ベース+フルート

Further Thoughts

このアルゴリズムをベースにハードウェアに落とし込んだのがこれ...

Links

NSynthデータセット

The NSynth Dataset

A large-scale and high-quality dataset of annotated musical notes. Recent breakthroughs in generative modeling of images have been predicated on the availability of high-quality and large-scale datasebts such as MNIST, CIFAR and ImageNet. We recognized the need for an audio dataset that was as approachable as those in the image domain.

magenta.tensorflow.org

WaveNetの仕組みを使ったAutoencoderで、楽器の音の時間方向の変化も含めて、潜在空間にマッピング → 潜在ベクトルから楽器の音を合成
この研究で使った多数の楽器の音を集めたデータセット NSynth を合わせて公開