Entry

深層学習を用いたウェーブ・シェーピング合成 - NEURAL WAVESHAPING SYNTHESIS

Simple Title

Hayes, B., Saitis, C., & Fazekas, G. (2021). Neural Waveshaping Synthesis.

Description

CPUでもサクサク動くのがポイント！

Type

Paper

Year

2021

Posted at

January 19, 2022

Overview

深層学習を用いた音の合成、特に楽器音の合成については最近、色々なモデル(DDSP, GANSynth, NSynth etc) が提案されている。が、どれも処理が重い！ 計算量が大きくGPUが必要。
本研究ではシンプルなMLP(dense層)を使って、CPUでも十分に使える楽器音の合成手法を実現する。
割と古典的なWaveshaping合成の応用

Abstract

We present the Neural Waveshaping Unit (NEWT): a novel, lightweight, fully causal approach to neural audio synthesis which operates directly in the waveform domain, with an accompanying optimisation (FastNEWT) for efficient CPU inference. The NEWT uses time-distributed multilayer perceptrons with periodic activations to implicitly learn nonlinear transfer functions that encode the characteristics of a target timbre. Once trained, a NEWT can produce complex timbral evolutions by simple affine transformations of its input and output signals. We paired the NEWT with a differentiable noise synthesiser and reverb and found it capable of generating realistic musical instrument performances with only 260k total model parameters, conditioned on F0 and loudness features. We compared our method to state-of-the-art benchmarks with a multi-stimulus listening test and the Fr\'echet Audio Distance and found it performed competitively across the tested timbral domains. Our method significantly outperformed the benchmarks in terms of generation speed, and achieved real-time performance on a consumer CPU, both with and without FastNEWT, suggesting it is a viable basis for future creative sound design tools.

Motivation

AIの音声合成はとにかく処理が重い。それがミュージシャンが実際の作曲で使う上で大きな障害になっている → ミュージシャン、アーティストが気軽に使えるようにCPUでもサクサク動くものを作りたい。

複雑な深層学習のレイヤーをなるべく使わない。シンプルな MLP (Multi-layer perceptron)=Feed-forward層の組み合わせで実装する。

すでに歴史が長いウェーブシェーピングの技術を深層学習と組み合わせる

Waveshaping?

入力波形( $x(t)$ とする) の値を別の関数( $f$ : shaping function)で別の値に変化させる・マッピングするというシンプルなもの。その際、入力の大きさが時間によって変化するように $a(t)$ (distortion index)をかけるのが一般的。
入力が楽器音の場合、ハーモニクス (倍音構造)があるのが普通。なので、 $x(t)$ を $cos(\omega t)$

f(\cos \omega n)=\sum_{k=1}^{\infty} h_{k} \cos \omega n

y=f(a(t) x(t))

Architecture

Results

音色の合成: ピッチ(F0)とボリューム(Loudness)の情報から合成された音の例 (左: フルート右: トランペット)

音色のStyle Transfer : インプットのボーカル曲のピッチとボリュームを元に再合成

入力:

フルート:

トランペット:

定量的な評価

Further Thoughts

確かにモデルは軽い。学習済みのモデルをGoogle Colabを使って試したところ、20秒の音を合成するのに、4秒かからなかった。リアルタイムの5倍のスピード。FastNEWTだとさらにその2倍以上のスピード。
ただし音はそれほどよくない。友人にバイオリンのモデルを聞かせたところ、バイオリンを習いたての小学生がなんとか弾いているような音に聞こえると言われる。

Links

すぐに試せるGoogle Colab

Google Colaboratory

colab.research.google.com

PureDataを使った簡単なウェーブ・シェーピング合成のデモ

ウェーブ・シェーピング合成

楽器、例えばピアノやギターや管楽器は強く吹けば強く吹いた時の音色がして、弱く吹けば弱く吹いた時の音色がします。つまり、音色と音の強さには相関性があるのですが、電子音の場合同じ音色で音量だけ変えてしまう事ができてしまいます。それはいい事でもあるのですが、我々の知ってる「自然な」楽器の反応とは異なるものです。 ...

puredatajapan.info

ウェーブ・シェーピング合成の応用

前回のチュートリアルで、ウェーブ・シェーピングの基礎は理解できたと思いますが、今回は応用です。チェビシェフの多項式の利用 PureData本体に付随するチュートリアルの「E05.Chebychev.pd」はチェビシェフの多項式を用いて、ウェーブ・シェーピングで倍音をコントロールする方法を紹介しています。 ...

puredatajapan.info

関連が深い研究 Google MagentaのDDSP - 計算負荷が重いよね... として論文内で取り上げられていたもの

DDSP: Differentiable Digital Signal Processing

Today, we're pleased to introduce the Differentiable Digital Signal Processing (DDSP) library. DDSP lets you combine the interpretable structure of classical DSP elements (such as filters, oscillators, reverberation, etc.) with the expressivity of deep learning. Neural networks (such as WaveNet or GANSynth) are often black boxes.

magenta.tensorflow.org