Entry

Simple Title

Huzaifah bin Md Shahrin, M. and Wyse, L. (2020) ‘Deep Generative Models for Musical Audio Synthesis’, arXiv.

Type

Paper

Year

2020

Posted at

May 24, 2021

Generative Model

フォーマルに書くとすると、対象の確率分布 $p_{data}$ とモデルの出力の分布 $p_\theta$ の距離 $d$ をどう小さくするか

\min _{\theta \in M} d\left(p_{\text {data }}, p_{\theta}\right)

Autoregressive Models (自己回帰モデル)

事前のシーケンスから次のトークンを予測するモデル

p(X)=\prod_{i=1}^{n} p\left(x_{i} \mid x_{1}, \ldots, x_{i-1}\right)

RNNベース

SampleRNN: An Unconditional End-to-End Neural Audio Generation Model

In this paper we propose a novel model for unconditional audio generation based on generating one audio sample at a time. We show that our model, which profits from combining memory-less modules, namely autoregressive multilayer perceptrons, and stateful recurrent neural networks in a hierarchical structure is able to capture underlying sources of variations in the temporal sequences over very long time spans, on three datasets of different nature.

arxiv.org

Dilated CNN

WaveNet: A Generative Model for Raw Audio

This paper introduces WaveNet, a deep neural network for generating raw audio waveforms. The model is fully probabilistic and autoregressive, with the predictive distribution for each audio sample conditioned on all previous ones; nonetheless we show that it can be efficiently trained on data with tens of thousands of samples per second of audio.

arxiv.org

WaveRNN

24kHz 16-bitのオーディオをリアルタイムの x4 の速さで生成できる

arxiv.org

WaveRNN

Variational Autoencoder

Learning Disentangled Representations of Timbre and Pitch for Musical Instrument Sounds Using Gaussian Mixture Variational Autoencoders

In this paper, we learn disentangled representations of timbre and pitch for musical instrument sounds. We adapt a framework based on variational autoencoders with Gaussian mixture latent distributions. Specifically, we use two separate encoders to learn distinct latent spaces for timbre and pitch, which form Gaussian mixture components representing instrument identity and pitch, respectively.

arxiv.org

Neural Audio Synthesis of Musical Notes with WaveNet Autoencoders

Generative models in vision have seen rapid progress due to algorithmic improvements and the availability of high-quality image datasets. In this paper, we offer contributions in both these areas to enable similar progress in audio modeling. First, we detail a powerful new WaveNet-style autoencoder model that conditions an autoregressive decoder on temporal codes learned from the raw audio waveform.

arxiv.org

Normalizing Flow Model

Parallel WaveNet: Fast High-Fidelity Speech Synthesis

The recently-developed WaveNet architecture is the current state of the art in realistic speech synthesis, consistently rated as more natural sounding for many different languages than any previous system. However, because WaveNet relies on sequential generation of one audio sample at a time, it is poorly suited to today's massively parallel computers, and therefore hard to deploy in a real-time production setting.

arxiv.org

Generative Adversarial Networks

画像生成に比べるとまだあまり研究が進んでいない

Adversarial Audio Synthesis

Audio signals are sampled at high temporal resolutions, and learning to synthesize audio requires capturing structure across a range of timescales. Generative adversarial networks (GANs) have seen wide success at generating images that are both locally and globally coherent, but they have seen little application to audio generation.

arxiv.org

なんだかんだこの辺が最新 (少なくとも2020年時点)

Engel, J. et al. (2019) ‘Gansynth: Adversarial neural audio synthesis’, arXiv. arXiv. Available at: http://arxiv.org/abs/1902.08710 (Accessed: 24 May 2021).

GANSynth: Adversarial Neural Audio Synthesis

Efficient audio synthesis is an inherently difficult machine learning task, as human perception is sensitive to both global structure and fine-scale waveform coherence.

arxiv.org

Conditioning 条件付けについて

当たり前だが、シンセサイザーとして使うには、音色をコントロールできるようにする必要がある。RNNなどの自己回帰モデルは、直前のシーケンスからの影響をもとに生成する → シーケンスの長さが長くなるにつれて、最初に指定したシーケンス(seed)の影響が消えていく

例えば... サイン波の合成。後半につれて音程がドリフト(ずれて)している。 ← Conditioningによってモデルが長期の時間依存を記憶する必要を軽減することができる

Manzelli, R. et al. (2018) ‘Conditioning deep generative raw audio models for structured automatic music’, in Proceedings of the 19th International Society for Music Information Retrieval Conference, ISMIR 2018, pp. 182–189. doi: 10.5281/zenodo.1492375.

Conditioning Deep Generative Raw Audio Models for Structured Automatic Music

Existing automatic music generation approaches that feature deep learning can be broadly classified into two types: raw audio models and symbolic models. Symbolic models, which train and generate at the note level, are currently the more prevalent approach; these models can capture long-range dependencies of melodic structure, but fail to grasp the nuances and richness of raw audio generations.

arxiv.org

MIDIでWaveNetをConditioning

Hawthorne, C. et al. (2018) ‘Enabling Factorized Piano Music Modeling and Generation with the MAESTRO Dataset’, arXiv. arXiv. Available at: http://arxiv.org/abs/1810.12247 (Accessed: 24 May 2021).

Enabling Factorized Piano Music Modeling and Generation with the MAESTRO Dataset

Generating musical audio directly with neural networks is notoriously difficult because it requires coherently modeling structure at many different timescales. Fortunately, most music is also highly structured and can be represented as discrete note events played on musical instruments.

arxiv.org

Wave2MIDI2Waveモデル

音色のマッピング - ピッチと音色のdisentanglement → ある音色とある音色の中間の音を作れる

Kim, J. W. et al. (2018) ‘Neural Music Synthesis for Flexible Timbre Control’, ICASSP, IEEE International Conference on Acoustics, Speech and Signal Processing - Proceedings. Institute of Electrical and Electronics Engineers Inc., 2019-May, pp. 176–180. Available at: http://arxiv.org/abs/1811.00223 (Accessed: 24 May 2021).

Esling, P., Chemla–Romeu-Santos, A. and Bitton, A. (2018) ‘Generative timbre spaces: Regularizing variational auto-encoders with perceptual metrics’, in DAFx 2018 - Proceedings: 21st International Conference on Digital Audio Effects. DAFx18, pp. 369–376. Available at: http://arxiv.org/abs/1805.08501 (Accessed: 24 May 2021).

Engel, J. et al. (2017) ‘Neural Audio Synthesis of Musical Notes with WaveNet Autoencoders’. Available at: http://arxiv.org/abs/1704.01279 (Accessed: 8 April 2017).

Music Translation

音のStyle Transfer

Mor, N. et al. (2018) ‘A Universal Music Translation Network’. Available at: http://arxiv.org/abs/1805.07848 (Accessed: 23 May 2018).

Kumar, K. et al. (2019) ‘MelGAN: Generative Adversarial Networks for Conditional Waveform Synthesis’, arXiv. arXiv. Available at: http://arxiv.org/abs/1910.06711 (Accessed: 24 May 2021).

次に読む

ずっと後回しにしてたやつ..

Engel, J. et al. (2017) ‘Neural Audio Synthesis of Musical Notes with WaveNet Autoencoders’. Available at: http://arxiv.org/abs/1704.01279 (Accessed: 8 April 2017).

Défossez, A. et al. (2018) ‘SING: Symbol-to-Instrument Neural Generator’. Available at: http://arxiv.org/abs/1810.09785.

Engel, J. et al. (2019) ‘Gansynth: Adversarial neural audio synthesis’, arXiv. arXiv. Available at: http://arxiv.org/abs/1902.08710 (Accessed: 24 May 2021).

Kumar, K. et al. (2019) ‘MelGAN: Generative Adversarial Networks for Conditional Waveform Synthesis’, arXiv. arXiv. Available at: http://arxiv.org/abs/1910.06711 (Accessed: 24 May 2021).

Further Thoughts

DDSP についての言及がなかった.

サーベイ - Deep Learningを用いたサウンド合成

Generative Model

Autoregressive Models (自己回帰モデル)

SampleRNN: An Unconditional End-to-End Neural Audio Generation Model

WaveNet: A Generative Model for Raw Audio

arxiv.org

Variational Autoencoder

Learning Disentangled Representations of Timbre and Pitch for Musical Instrument Sounds Using Gaussian Mixture Variational Autoencoders

Neural Audio Synthesis of Musical Notes with WaveNet Autoencoders

Normalizing Flow Model

Parallel WaveNet: Fast High-Fidelity Speech Synthesis

Generative Adversarial Networks

Adversarial Audio Synthesis

GANSynth: Adversarial Neural Audio Synthesis

Conditioning 条件付けについて

Conditioning Deep Generative Raw Audio Models for Structured Automatic Music

Enabling Factorized Piano Music Modeling and Generation with the MAESTRO Dataset

Music Translation

次に読む

Further Thoughts