Entry

LoopNet—ドラムループの生成

Simple Title

Chandna, P., Ramires, A., Serra, X., & Gómez, E. (2021). LoopNet: Musical Loop Synthesis Conditioned On Intuitive Musical Parameters.

Description

音源分離のモデルで提案されたWave-U-Netのアーキテクチャを用いて、ドラムループをまるごと生成する仕組み

Type

Paper

Year

2021

Posted at

June 5, 2021

Overview - 何がすごい?

音源分離のモデルで提案されたWave-U-Netのアーキテクチャを用いて、ドラムループを丸っと生成する仕組み。
単発のパーカッション、ドラム音を生成する同じグループの研究がベースになっている。

📄パーカッション音の合成 - NEURAL PERCUSSIVE SYNTHESIS

Abstract

Loops, seamlessly repeatable musical segments, are a corner-stone of modern music production. Contemporary artists often mix and match various sampled or pre-recorded loops based on musical criteria such as rhythm, harmony and timbral texture to create com- positions. Taking such criteria into account, we present LoopNet, a feed-forward generative model for creating loops conditioned on intuitive parameters. We leverage Music Information Retrieval (MIR) models as well as a large collection of public loop samples in our study and use the Wave-U-Net architecture to map control parameters to audio. We also evaluate the quality of the generated audio and propose intuitive controls for composers to map the ideas in their minds

Motivation

最近の音楽ジャンルの制作は、ループ素材の組み合わせによるところが大きい (Ableton Liveなど)
一方でAIを用いた音の生成は、単音単位がほとんど → ドラムループ単位での生成を試みる。
ドラムの各音色(キック、スネア、ハイハット)の音量で条件付け

Architecture

アーキテクチャ

音源分離モデルで提案された Wave-U-Netのアーキテクチャを利用 (トップの図)
次のConditioningの項で紹介する情報を利用

ロス関数

シンプルのreconstructionロス。STFTを利用したロス。FFT窓幅を細かいものから大きいものまで変えて、それぞれのロスの合計をとったマルチスケールのロスの三種類のロス関数でテスト。

\begin{array}{c}\mathcal{L}_{r e c o n}=\mathbb{E}\left[\|\hat{x}-x\|_{1}\right] \\\\\mathcal{L}_{s t f t}=\mathbb{E}\left[\|\hat{x}-x\|_{1}\right]+\mathbb{E}\left[\|S T F T(\hat{x})-S T F T(x)\|_{1}\right] \\\\\mathcal{L}_{m u l t i}=\mathbb{E}\left[\|\hat{x}-x\|_{1}\right]+\sum_{i=0}^{5} \mathbb{E}\left[\left\|S T F T_{i}(\hat{x})-S T F T_{i}(x)\right\|_{1}\right]\end{array}

$x$ : オリジナルの信号　 $\hat{x}$ : 合成された信号

条件付け Conditioning

Time-varying Conditioning 時間変化する特徴 - ドラムの各音色(キック、スネア、ハイハット)の音量を、ドラムの採譜ツールで自動的に抜き出して条件付けに利用。

ドラムの採譜

Global conditioning 全体の音の特徴 - ツールを用いて音の温かみ、シャープネス、明るさ、荒っぽさなどを解析。全体に対する条件付けとして利用。また12音階でのピッチの分布を示すHarmonic Pitch Class Profile(HPCP)をツールを使って算出。これらを全体の条件付けとして利用。

さらに全体の音量のエンベローヴで条件つけする仕組みも実装 ← 結果的にはあまり必要ない

Dataset

著作権フリーのループ素材んがダウンロードできるコミュニティサイト loopermanのデータを利用
BPMが120-140のものを選んだ上で、ループのテンポ推定のアルゴリズムにかけてBPMを確認
タイムストレッチのアルゴリズムで全てBPM130に統一 ← Wave-U-Netのモデルは同じ長さのサウンドファイルで学習する必要があるため

Results

生成結果

オリジナル #1

合成 #1

オリジナル #1 の各ドラムの特徴量

左がこのドラムループの

青がキック、オレンジがスネア、水色がハイハット

オリジナル #2

合成 #2

音質の再現度を示す Frechét Audio Distance (FAD) (小さいほどよい) によると、スペクトログラムに対するマルチスケールのロスを適用して学習したモデル(かつエンベローヴの情報を与えない) が一番精度が良い (MULTI NOENV 3.35)

それでも単純にオリジナルのループ → スペクトログラムに変換 → Griffin-Limで音に再変換したもの(Griffin-Lim 1.26)に比べると悪い

Further Thoughts

オーディオのクオリティ的にはまだまだかな
音楽情報処理関連のツールもたくさん紹介されていて有難い！
colabのノートブックでいじってみるのが一番わかりやすい
うまく誤用すると面白い音が生成できるのでは...

と思って変なパラメータを入れてみたが... あまり面白い結果にはならない。

一般的なパラメータだと.

次に読む - Ramires, A., Chandna, P., Favory, X., Gómez, E., & Serra, X. (2019). Neural Percussive Synthesis Parameterised by High-Level Timbral Features. ICASSP, IEEE International Conference on Acoustics, Speech and Signal Processing - Proceedings, 2020-May, 786–790. http://arxiv.org/abs/1911.11853

Links

すぐに試せるGoogle Colabのノートブック

Google Colaboratory

colab.research.google.com

この研究のベースになった単発のパーカッション音、ドラム音の合成

📄パーカッション音の合成 - NEURAL PERCUSSIVE SYNTHESIS

オーディオのタイムストレッチ用のライブラリ (コマンドラインで叩けるのは便利)

Rubber Band Library

Rubber Band Library is a high quality software library for audio time-stretching and pitch-shifting. It permits you to change the tempo and pitch of an audio stream or recording dynamically and independently of one another. 12th March, 2021: Rubber Band Library v1.9.1 released!

breakfastquay.com

ドラムの採譜

Southall, C., Stables, R., & Hockman, J. (2017). Automatic drum transcription for polyphonic recordings using soft attention mechanisms and convolutional neural networks. Proc. of the International Society for Music Information Retrieval Conference (ISMIR), 606–612.

www.open-access.bcu.ac.uk

CarlSouthall/ADTLib

The automatic drum transcription (ADT) library contains open source ADT algorithms to aid other researchers in areas of music information retrieval (MIR). The algorithms return both a .txt file of kick drum, snare drum, and hi-hat onsets and an automatically generated drum tabulature.

github.com

ループ素材のテンポ推定

ffont/ismir2016

This repository contains code and instructions for reproducing the research described in the paper Font, F., & Serra, X. (2016). Tempo Estimation for Music Loops and a Simple Confidence Measure. In Int. Conf. on Music Information Retrieval (ISMIR). The full text of the paper can be found here.

github.com

Wave-U-Net

Stoller, D., Ewert, S., & Dixon, S. (2018). Wave-U-Net: A Multi-Scale Neural Network for End-to-End Audio Source Separation. Proceedings of the 19th International Society for Music Information Retrieval Conference, ISMIR 2018, 334–340. http://arxiv.org/abs/1806.03185

Wave-U-Net: A Multi-Scale Neural Network for End-to-End Audio Source Separation

Models for audio source separation usually operate on the magnitude spectrum, which ignores phase information and makes separation performance dependant on hyper-parameters for the spectral front-end. Therefore, we investigate end-to-end source separation in the time-domain, which allows modelling phase information and avoids fixed spectral transformations.

arxiv.org

音質の評価指標 Frechet Audio Distance (FAD)

Fréchet Audio Distance: A Metric for Evaluating Music Enhancement Algorithms

We propose the Fréchet Audio Distance (FAD), a novel, reference-free evaluation metric for music enhancement algorithms. We demonstrate how typical evaluation metrics for speech enhancement and blind source separation can fail to accurately measure the perceived effect of a wide variety of distortions.

arxiv.org

音の特徴を抜き出すためのツール

AudioCommons/ac-audio-extractor

The Audio Commons Audio Extractor is a tool for analyzing audio files and extract both music properties (for music samples and music pieces) as well as high-level non-musical properties (timbre models). See this blog post for further details about the Audio Commons Audio Extractor.

github.com

HPCPの算出に用いたツール

HPCP

streaming mode | Tonal category Computes a Harmonic Pitch Class Profile (HPCP) from the spectral peaks of a signal. HPCP is a k*12 dimensional vector which represents the intensities of the twelve (k==1) semitone pitch classes (corresponsing to notes from A to G#), or subdivisions of these (k>1).

essentia.upf.edu

マルチスケールロスの提案

Engel, J., Hantrakul, L., Gu, C., & Roberts, A. (2020). DDSP: DifferentiabLE Digital Signal Processing. In arXiv. arXiv.

DDSP: Differentiable Digital Signal Processing

Most generative models of audio directly generate samples in one of two domains: time or frequency. While sufficient to express any signal, these representations are inefficient, as they do not utilize existing knowledge of how sound is generated and perceived.

arxiv.org

LoopNet—ドラムループのサウンド合成

Overview - 何がすごい?

Abstract

Motivation

Architecture

アーキテクチャ

ロス関数

条件付け Conditioning

Dataset

Results

生成結果

Further Thoughts

Links

Google Colaboratory

Rubber Band Library

www.open-access.bcu.ac.uk

CarlSouthall/ADTLib

ffont/ismir2016

Wave-U-Net: A Multi-Scale Neural Network for End-to-End Audio Source Separation

Fréchet Audio Distance: A Metric for Evaluating Music Enhancement Algorithms

AudioCommons/ac-audio-extractor

HPCP

DDSP: Differentiable Digital Signal Processing