Huzaifah bin Md Shahrin, M. and Wyse, L. (2020) ‘Deep Generative Models for Musical Audio Synthesis’, arXiv.
2020
Generative Model
フォーマルに書くとすると、対象の確率分布 とモデルの出力の分布 の距離 をどう小さくするか
Autoregressive Models (自己回帰モデル)
事前のシーケンスから次のトークンを予測するモデル
RNNベース
SampleRNN: An Unconditional End-to-End Neural Audio Generation Model
In this paper we propose a novel model for unconditional audio generation based on generating one audio sample at a time. We show that our model, which profits from combining memory-less modules, namely autoregressive multilayer perceptrons, and stateful recurrent neural networks in a hierarchical structure is able to capture underlying sources of variations in the temporal sequences over very long time spans, on three datasets of different nature.
arxiv.org
Dilated CNN
WaveNet: A Generative Model for Raw Audio
This paper introduces WaveNet, a deep neural network for generating raw audio waveforms. The model is fully probabilistic and autoregressive, with the predictive distribution for each audio sample conditioned on all previous ones; nonetheless we show that it can be efficiently trained on data with tens of thousands of samples per second of audio.
arxiv.org
WaveRNN
24kHz 16-bitのオーディオを リアルタイムの x4 の速さで生成できる
arxiv.org
arxiv.org
Variational Autoencoder
Learning Disentangled Representations of Timbre and Pitch for Musical Instrument Sounds Using Gaussian Mixture Variational Autoencoders
In this paper, we learn disentangled representations of timbre and pitch for musical instrument sounds. We adapt a framework based on variational autoencoders with Gaussian mixture latent distributions. Specifically, we use two separate encoders to learn distinct latent spaces for timbre and pitch, which form Gaussian mixture components representing instrument identity and pitch, respectively.
arxiv.org
Neural Audio Synthesis of Musical Notes with WaveNet Autoencoders
Generative models in vision have seen rapid progress due to algorithmic improvements and the availability of high-quality image datasets. In this paper, we offer contributions in both these areas to enable similar progress in audio modeling. First, we detail a powerful new WaveNet-style autoencoder model that conditions an autoregressive decoder on temporal codes learned from the raw audio waveform.
arxiv.org
Normalizing Flow Model
Parallel WaveNet: Fast High-Fidelity Speech Synthesis
The recently-developed WaveNet architecture is the current state of the art in realistic speech synthesis, consistently rated as more natural sounding for many different languages than any previous system. However, because WaveNet relies on sequential generation of one audio sample at a time, it is poorly suited to today's massively parallel computers, and therefore hard to deploy in a real-time production setting.
arxiv.org
Generative Adversarial Networks
画像生成に比べるとまだあまり研究が進んでいない
Adversarial Audio Synthesis
Audio signals are sampled at high temporal resolutions, and learning to synthesize audio requires capturing structure across a range of timescales. Generative adversarial networks (GANs) have seen wide success at generating images that are both locally and globally coherent, but they have seen little application to audio generation.
arxiv.org
なんだかんだこの辺が最新 (少なくとも2020年時点)
Engel, J. et al. (2019) ‘Gansynth: Adversarial neural audio synthesis’, arXiv. arXiv. Available at: http://arxiv.org/abs/1902.08710 (Accessed: 24 May 2021).
GANSynth: Adversarial Neural Audio Synthesis
Efficient audio synthesis is an inherently difficult machine learning task, as human perception is sensitive to both global structure and fine-scale waveform coherence.
arxiv.org
Conditioning 条件付けについて
当たり前だが、シンセサイザーとして使うには、音色をコントロールできるようにする必要がある。RNNなどの自己回帰モデルは、直前のシーケンスからの影響をもとに生成する → シーケンスの長さが長くなるにつれて、最初に指定したシーケンス(seed)の影響が消えていく
例えば... サイン波の合成。後半につれて音程がドリフト(ずれて)している。 ← Conditioningによってモデルが長期の時間依存を記憶する必要を軽減することができる
Manzelli, R. et al. (2018) ‘Conditioning deep generative raw audio models for structured automatic music’, in Proceedings of the 19th International Society for Music Information Retrieval Conference, ISMIR 2018, pp. 182–189. doi: 10.5281/zenodo.1492375.
Conditioning Deep Generative Raw Audio Models for Structured Automatic Music
Existing automatic music generation approaches that feature deep learning can be broadly classified into two types: raw audio models and symbolic models. Symbolic models, which train and generate at the note level, are currently the more prevalent approach; these models can capture long-range dependencies of melodic structure, but fail to grasp the nuances and richness of raw audio generations.
arxiv.org
MIDIでWaveNetをConditioning
Hawthorne, C. et al. (2018) ‘Enabling Factorized Piano Music Modeling and Generation with the MAESTRO Dataset’, arXiv. arXiv. Available at: http://arxiv.org/abs/1810.12247 (Accessed: 24 May 2021).
Enabling Factorized Piano Music Modeling and Generation with the MAESTRO Dataset
Generating musical audio directly with neural networks is notoriously difficult because it requires coherently modeling structure at many different timescales. Fortunately, most music is also highly structured and can be represented as discrete note events played on musical instruments.
arxiv.org
Wave2MIDI2Waveモデル
音色のマッピング - ピッチと音色のdisentanglement → ある音色とある音色の中間の音を作れる
Kim, J. W. et al. (2018) ‘Neural Music Synthesis for Flexible Timbre Control’, ICASSP, IEEE International Conference on Acoustics, Speech and Signal Processing - Proceedings. Institute of Electrical and Electronics Engineers Inc., 2019-May, pp. 176–180. Available at: http://arxiv.org/abs/1811.00223 (Accessed: 24 May 2021).
Esling, P., Chemla–Romeu-Santos, A. and Bitton, A. (2018) ‘Generative timbre spaces: Regularizing variational auto-encoders with perceptual metrics’, in DAFx 2018 - Proceedings: 21st International Conference on Digital Audio Effects. DAFx18, pp. 369–376. Available at: http://arxiv.org/abs/1805.08501 (Accessed: 24 May 2021).
Engel, J. et al. (2017) ‘Neural Audio Synthesis of Musical Notes with WaveNet Autoencoders’. Available at: http://arxiv.org/abs/1704.01279 (Accessed: 8 April 2017).
Music Translation
音のStyle Transfer
Mor, N. et al. (2018) ‘A Universal Music Translation Network’. Available at: http://arxiv.org/abs/1805.07848 (Accessed: 23 May 2018).
Kumar, K. et al. (2019) ‘MelGAN: Generative Adversarial Networks for Conditional Waveform Synthesis’, arXiv. arXiv. Available at: http://arxiv.org/abs/1910.06711 (Accessed: 24 May 2021).
次に読む
ずっと後回しにしてたやつ..
Engel, J. et al. (2017) ‘Neural Audio Synthesis of Musical Notes with WaveNet Autoencoders’. Available at: http://arxiv.org/abs/1704.01279 (Accessed: 8 April 2017).
Hawthorne, C. et al. (2018) ‘Enabling Factorized Piano Music Modeling and Generation with the MAESTRO Dataset’, arXiv. arXiv. Available at: http://arxiv.org/abs/1810.12247 (Accessed: 24 May 2021).
Défossez, A. et al. (2018) ‘SING: Symbol-to-Instrument Neural Generator’. Available at: http://arxiv.org/abs/1810.09785.
Engel, J. et al. (2019) ‘Gansynth: Adversarial neural audio synthesis’, arXiv. arXiv. Available at: http://arxiv.org/abs/1902.08710 (Accessed: 24 May 2021).
Kumar, K. et al. (2019) ‘MelGAN: Generative Adversarial Networks for Conditional Waveform Synthesis’, arXiv. arXiv. Available at: http://arxiv.org/abs/1910.06711 (Accessed: 24 May 2021).
Further Thoughts
- DDSP についての言及がなかった.