Entry

深層学習を用いたグラニュラーシンセシス

Simple Title

Hertzmann, A. (2020) ‘Visual indeterminacy in GAN art’, Leonardo. MIT Press Journals, 53(4), pp. 424–428. doi: 10.1162/LEON_a_01930.

Description

グラニュラーシンセシスのGrain(音の粒)をVAEを使って生成しようという試み。Grainの空間の中での軌跡についても合わせて学習。

Type

Paper

Year

2020

Posted at

March 30, 2021

Overview

グラニュラーシンセシスのGrainをVAEを使って生成しようという試み。

Grainの空間の中での軌跡についても合わせて学習。

Abstract

Granular sound synthesis is a popular audio generation technique based on rearranging sequences of small waveform windows. In order to control the synthesis, all grains in a given corpus are analyzed through a set of acoustic descriptors. This provides a representation reflecting some form of local similarities across the grains. However, the quality of this grain space is bound by that of the descriptors. Its traversal is not continuously invertible to signal and does not render any structured temporality. We demonstrate that generative neural networks can implement granular synthesis while alleviating most of its shortcomings. We efficiently replace its audio descriptor basis by a probabilistic latent space learned with a Variational Auto-Encoder. A major advantage of our proposal is that the resulting grain space is invertible, meaning that we can continuously synthesize sound when traversing its dimensions. It also implies that original grains are not stored for synthesis. To learn structured paths inside this latent space, we add a higher-level temporal embedding trained on arranged grain sequences. The model can be applied to many types of libraries, including pitched notes or unpitched drums and environmental noises. We experiment with the common granular synthesis processes and enable new ones.

Motivation

グラニュラーシンセシス(granular synthesis)は、10-100msほどの音の断片(grain)を組み合わせることで、音を合成する音響合成のアルゴリズム。従来は、grainを解析してタグ(discriptor) づけすることで、この情報を元に音を組み合わせていた。したがって、granular sysnthesisの精度はこのdiscriptorの精度に依存する上、grainの「空間」は非連続的に変化するため、望みの音色を作れるとは限らない。

そこで、この研究ではgrain自体をVAE(variational autoencoder)で生成することで、連続的に変化するgrainの空間を実現する。さらにこのgrainの空間の中での「軌跡」を別のモデルで学習する。

Architecture

提案手法の全体像

Grainの生成

一様なノイズに対して、周波数領域の係数(フィルタ)をかけることで、特定のGrainを実現

\begin{aligned}\hat{\mathbf{X}}_{i} &=\mathbf{H}_{i} * \operatorname{DFT}\left(\mathbf{n}_{i}\right) \\\hat{\mathbf{x}}_{i} &=\operatorname{iDFT}\left(\hat{\mathbf{X}}_{i}\right)\end{aligned}

$n_i$ が一様なノイズ $\mathbf{H}_{i}$ 周波数領域の係数 ← VAEによって学習 $\hat{\mathbf{x}}_{i}$ 再合成されたgrain $\operatorname{DFT}$ : 離散フーリエ変換

Grain空間での軌跡

grainのVAEの $z$ のベクトル空間内での軌跡をrecurrentなモデルで学習

学習時のロス関数

\mathcal{L}_{\theta, \phi}=\underbrace{\sum_{n=1}^{N}\left\|l_{n}(\mathbf{x})-l_{n}(\hat{\mathbf{x}})\right\|_{1}}_{\text {reconstructions }}+\beta * \underbrace{\sum_{i=1}^{g} \mathcal{D}_{K L}\left[q_{\phi}\left(\mathbf{z}_{i} \mid \mathbf{x}_{i}\right) \| p_{\theta}(\mathbf{z})\right]}_{\text {regularizations }}

Reconstruction Loss(再合成の精度) : DDSPなどでも使われていた Multi-scale spectrogram lossを利用。スペクトログラムを作るときのFFTのWindowサイズを $[128, 256, 512, 1024, 2048]$ と大きくしていき、それぞれのスペクトログラムの $L_1$ の距離をとっている。
Regularizations: VAEの学習で一般的に使われるregularization

Results

楽器の音を学習できただけでなく、Grain空間での軌跡を別のgrain空間に適用することで、音のtransferも実現。

合成された楽器音

Grain空間内を適当に動き回ると..

音色のtransfer

Further Thoughts

コードが公開されていないので実装が非常にわかりにくい.
DDSPの実装について理解しないとこちらも理解が難しい...

Links

Deep generative models for musical audio synthesis

Sound modelling is the process of developing algorithms that generate sound under parametric control. There are a few distinct approaches that have been developed historically including modelling the physics of sound production and propagation, assembling signal generating and processing elements to capture acoustic features, and manipulating collections of recorded audio samples.

arxiv.org

Neural Granular Sound Synthesis