AIを用いたAudio Visual – Stylizing Audio Reactive Visuals

Entry
AIを用いたAudio Visual – Stylizing Audio Reactive Visuals
Simple Title
Han-Hung Lee, Da-Gin Wu, and Hwann-Tzong Chen, "Stylizing Audio Reactive Visuals", NeurlPS2019, (2019)
Type
Paper
Year
2019
Posted at
June 24, 2020
Tags
visualGAN

Overview - 何がすごい?

音楽のメロディやリズムに合わせて映像が綺麗に切り替わっており,生成結果は非常に興味深い映像です.

Abstract

VJ is an art-form of mixing videos so that the visuals match the mood or groove of the music being played. While the video clips may be sampled from movies or animations, the video frames alone are more static in a sense as the contents of the video frames are fixed. Audio reactive visuals enable more dynamic performances by mapping audio input of the music to some visual effect so that the visuals vibrate or resonate with the music. A common way to do this is to use FFT and filters to obtain a frequency band that might correspond to an instrument such as snare drums, and then map the magnitudes of the changes in this frequency band to the parameters of a visual effect such as blur or distortion. In this work, we explore the use of GANs to produce audio reactive visuals by following these magnitude changes in the frequency band to traverse the latent space of GANs. Because the latent space is smooth and interpolatable, by concatenating the generated images we can form a smooth video clip that reacts to the audio clip. We choose to use StyleGan in particular because it maps a latent vector to several styles that control coarse-to-fine structures of the image, which makes it intuitive to map to when we have multiple features. We also explore using Nsynth to extract features from the audio clip and GAN steerability to learn specific walks in latent space that correspond to effects such as zooming and rotation. We use different pretrained StyleGan models for our experiments. Video results can be found on the website listed in Supplementary Materials.

Motivation

オーディオリアクティブなヴィジュアル表現をGANの潜在空間の横断によって実現し, よりダイナミックな演出を可能にする.

Architecture

この研究は,オーディオのデータから特徴量を抽出(NSynthや周波数フィルターなどを用いる)し,その抽出した特徴量がタイムステップごとにどれくらい変化したかを計算し,その値に応じてStyleGANの潜在空間を探索します.

image

Results

音楽そのものの曲の雰囲気を捉えることはできておらず,生成される画像は学習した画像からランダムに選択されてしまうものの,リズムやメロディの変化をうまく捉え,映像を変化させることができています.また,モデルのアーキテクチャの規模を考えると厳しいという観点から,リアルタイムで生成するのは厳しいのが現状と言えるでしょう.

Further Thoughts

今後は,生成された画像を音楽の性質と紐付けたり,リアルタイム生成可能なモデルの提案が重要となってきそうです.

Links