サーベイ - Deep Learningを用いたサウンド合成

Entry

Simple Title

Huzaifah bin Md Shahrin, M. and Wyse, L. (2020) ‘Deep Generative Models for Musical Audio Synthesis’, arXiv.

Type
Paper
Year

2020

Posted at
May 24, 2021
Tags
sound
Arxiv
https://arxiv.org/abs/2006.06426

Generative Model

フォーマルに書くとすると、対象の確率分布 pdatap_{data} とモデルの出力の分布 pθp_\thetaの距離 dd をどう小さくするか

minθMd(pdata ,pθ)\min _{\theta \in M} d\left(p_{\text {data }}, p_{\theta}\right)
image

Autoregressive Models (自己回帰モデル)

事前のシーケンスから次のトークンを予測するモデル

p(X)=i=1np(xix1,,xi1)p(X)=\prod_{i=1}^{n} p\left(x_{i} \mid x_{1}, \ldots, x_{i-1}\right)

RNNベース

SampleRNN: An Unconditional End-to-End Neural Audio Generation Model

In this paper we propose a novel model for unconditional audio generation based on generating one audio sample at a time. We show that our model, which profits from combining memory-less modules, namely autoregressive multilayer perceptrons, and stateful recurrent neural networks in a hierarchical structure is able to capture underlying sources of variations in the temporal sequences over very long time spans, on three datasets of different nature.

Dilated CNN

WaveNet: A Generative Model for Raw Audio

This paper introduces WaveNet, a deep neural network for generating raw audio waveforms. The model is fully probabilistic and autoregressive, with the predictive distribution for each audio sample conditioned on all previous ones; nonetheless we show that it can be efficiently trained on data with tens of thousands of samples per second of audio.

WaveRNN

24kHz 16-bitのオーディオを リアルタイムの x4 の速さで生成できる

arxiv.org
WaveRNN
WaveRNN

Variational Autoencoder

Learning Disentangled Representations of Timbre and Pitch for Musical Instrument Sounds Using Gaussian Mixture Variational Autoencoders

In this paper, we learn disentangled representations of timbre and pitch for musical instrument sounds. We adapt a framework based on variational autoencoders with Gaussian mixture latent distributions. Specifically, we use two separate encoders to learn distinct latent spaces for timbre and pitch, which form Gaussian mixture components representing instrument identity and pitch, respectively.

Neural Audio Synthesis of Musical Notes with WaveNet Autoencoders

Generative models in vision have seen rapid progress due to algorithmic improvements and the availability of high-quality image datasets. In this paper, we offer contributions in both these areas to enable similar progress in audio modeling. First, we detail a powerful new WaveNet-style autoencoder model that conditions an autoregressive decoder on temporal codes learned from the raw audio waveform.

Normalizing Flow Model

Normalizing Flow Model
Normalizing Flow Model
image
Parallel WaveNet: Fast High-Fidelity Speech Synthesis

The recently-developed WaveNet architecture is the current state of the art in realistic speech synthesis, consistently rated as more natural sounding for many different languages than any previous system. However, because WaveNet relies on sequential generation of one audio sample at a time, it is poorly suited to today's massively parallel computers, and therefore hard to deploy in a real-time production setting.

Generative Adversarial Networks

画像生成に比べるとまだあまり研究が進んでいない

Adversarial Audio Synthesis

Audio signals are sampled at high temporal resolutions, and learning to synthesize audio requires capturing structure across a range of timescales. Generative adversarial networks (GANs) have seen wide success at generating images that are both locally and globally coherent, but they have seen little application to audio generation.

なんだかんだこの辺が最新 (少なくとも2020年時点)

Engel, J. et al. (2019) ‘Gansynth: Adversarial neural audio synthesis’, arXiv. arXiv. Available at: http://arxiv.org/abs/1902.08710 (Accessed: 24 May 2021).

GANSynth: Adversarial Neural Audio Synthesis

Efficient audio synthesis is an inherently difficult machine learning task, as human perception is sensitive to both global structure and fine-scale waveform coherence.

Conditioning 条件付けについて

当たり前だが、シンセサイザーとして使うには、音色をコントロールできるようにする必要がある。RNNなどの自己回帰モデルは、直前のシーケンスからの影響をもとに生成する → シーケンスの長さが長くなるにつれて、最初に指定したシーケンス(seed)の影響が消えていく

例えば... サイン波の合成。後半につれて音程がドリフト(ずれて)している。 ← Conditioningによってモデルが長期の時間依存を記憶する必要を軽減することができる

Manzelli, R. et al. (2018) ‘Conditioning deep generative raw audio models for structured automatic music’, in Proceedings of the 19th International Society for Music Information Retrieval Conference, ISMIR 2018, pp. 182–189. doi: 10.5281/zenodo.1492375.

Conditioning Deep Generative Raw Audio Models for Structured Automatic Music

Existing automatic music generation approaches that feature deep learning can be broadly classified into two types: raw audio models and symbolic models. Symbolic models, which train and generate at the note level, are currently the more prevalent approach; these models can capture long-range dependencies of melodic structure, but fail to grasp the nuances and richness of raw audio generations.

MIDIでWaveNetをConditioning

Hawthorne, C. et al. (2018) ‘Enabling Factorized Piano Music Modeling and Generation with the MAESTRO Dataset’, arXiv. arXiv. Available at: http://arxiv.org/abs/1810.12247 (Accessed: 24 May 2021).

Enabling Factorized Piano Music Modeling and Generation with the MAESTRO Dataset

Generating musical audio directly with neural networks is notoriously difficult because it requires coherently modeling structure at many different timescales. Fortunately, most music is also highly structured and can be represented as discrete note events played on musical instruments.

image

Wave2MIDI2Waveモデル

音色のマッピング - ピッチと音色のdisentanglement → ある音色とある音色の中間の音を作れる

Kim, J. W. et al. (2018) ‘Neural Music Synthesis for Flexible Timbre Control’, ICASSP, IEEE International Conference on Acoustics, Speech and Signal Processing - Proceedings. Institute of Electrical and Electronics Engineers Inc., 2019-May, pp. 176–180. Available at: http://arxiv.org/abs/1811.00223 (Accessed: 24 May 2021).

image

Esling, P., Chemla–Romeu-Santos, A. and Bitton, A. (2018) ‘Generative timbre spaces: Regularizing variational auto-encoders with perceptual metrics’, in DAFx 2018 - Proceedings: 21st International Conference on Digital Audio Effects. DAFx18, pp. 369–376. Available at: http://arxiv.org/abs/1805.08501 (Accessed: 24 May 2021).

image

Engel, J. et al. (2017) ‘Neural Audio Synthesis of Musical Notes with WaveNet Autoencoders’. Available at: http://arxiv.org/abs/1704.01279 (Accessed: 8 April 2017).

Music Translation

音のStyle Transfer

Mor, N. et al. (2018) ‘A Universal Music Translation Network’. Available at: http://arxiv.org/abs/1805.07848 (Accessed: 23 May 2018).

Kumar, K. et al. (2019) ‘MelGAN: Generative Adversarial Networks for Conditional Waveform Synthesis’, arXiv. arXiv. Available at: http://arxiv.org/abs/1910.06711 (Accessed: 24 May 2021).

次に読む

ずっと後回しにしてたやつ..

Engel, J. et al. (2017) ‘Neural Audio Synthesis of Musical Notes with WaveNet Autoencoders’. Available at: http://arxiv.org/abs/1704.01279 (Accessed: 8 April 2017).

Hawthorne, C. et al. (2018) ‘Enabling Factorized Piano Music Modeling and Generation with the MAESTRO Dataset’, arXiv. arXiv. Available at: http://arxiv.org/abs/1810.12247 (Accessed: 24 May 2021).

Défossez, A. et al. (2018) ‘SING: Symbol-to-Instrument Neural Generator’. Available at: http://arxiv.org/abs/1810.09785.

Engel, J. et al. (2019) ‘Gansynth: Adversarial neural audio synthesis’, arXiv. arXiv. Available at: http://arxiv.org/abs/1902.08710 (Accessed: 24 May 2021).

Kumar, K. et al. (2019) ‘MelGAN: Generative Adversarial Networks for Conditional Waveform Synthesis’, arXiv. arXiv. Available at: http://arxiv.org/abs/1910.06711 (Accessed: 24 May 2021).

Further Thoughts

  • DDSP についての言及がなかった.