Entry

Type

Paper

Year

2020

Posted at

Overview - 何がすごい?

音をGANを使って画像化することで、聴覚障害者が音を理解できるようにしようとする試み。イーロン・マスクらが取り組んでいるような脳に電極をさすタイプの手法より、よっぽど簡単に実現できるのではという仮説から。

Abstract

Sensory substitution can help persons with perceptual deficits. In this work, we attempt to visualize audio with video. Our long-term goal is to create sound perception for hearing impaired people, for instance, to facilitate feedback for training deafspeech. Different from existing models that translate between speech and text or text and images, we target an immediate and low-level translation that applies to generic environment sounds and human speech without delay. No canonical mapping is known for this artificial transla- tion task. Our design is to translate from audio to video by compressing both into a common latent space with shared structure. Our core contribution is the development and eval- uation oflearned mappings that respect human perception limits and maximize user comfort by enforcing priors and combining strategies from unpaired image translation and disentanglement. We demonstrate qualitatively and quanti- tatively that our AudioViewer model maintains important audio features in the generated video and that generated videos offaces and numbers are well suited for visualizing high-dimensional audio features since they can easily be parsed by humans to match and distinguish between sounds, words, and speakers.

Motivation

話し言葉を文字に起こすことで、音を映像を使って伝えることはできるが、言葉のような内容を伝えることはできても、細かいニュアンスやトーンを伝えることができない。ましてや環境音のようなものは伝えようがない。そこで音をそのまま画像生成モデルの潜在空間にマッピングすることで、細かい変化を伝えることを試みる → 受け手側が次第に慣れて、音と内容のマッチングはそのうち学習してくれるだろうという仮説

前提

人は自然環境、特に人の顔の変化に敏感 → マッピングされる画像として、顔の画像を使う
人は対象の急激な変化に慣れない → スムーズな変化を実現するための制約の導入
頻出する音に対応する画像も頻出してほしい → CycleGANようなCycle consistencyを導入。音と画像の関係性を学習
音(言葉)の内容とスタイル(話者、トーン etc)を切り離す

Architecture

アーキテクチャの概念図

まず最初に音のEncoder-Decoder, 画像のEncoder-Decoderを個別に学習

前提4.のスタイルとコンテンツを切り離すために d次元の潜在ベクトル $z$ を話者を表す $z_s$ とコンテンツを表す $z_c$ の二つにわけ、これらをつなげて $z$ とする。　

$\mathbf{z}_{s}=\left[z_{1}, \cdots, z_{m}\right]^{T}$

$\mathbf{z}_{c}=\left[z_{m+1}, \cdots, z_{d}\right]^{T}$

例えばaさんとbさんによる water 単語とaさんによる air という単語の発話があったときに、

bさんの water をエンコードしたときのコンテンツ $z_c$ に対応するベクトルと、aさんの air の話者 $z_s$ をつなげた $z$ から、aさんが話した water を再現するように学習する！

音と画像の潜在空間の構造を似せるために、Cycle Loss $\mathcal{L}_{\text {cycle }}$ を導入

\left.\mathcal{L}_{\text {cycle }}=\mid E_{V}\left(D_{V}\left(\mathbf{z}_{c}\right)\right)\right)-\mathbf{z}_{c} \mid

$z_c$ は音のエンコーダでエンコードしたzのコンテンツ部分。音の潜在ベクトルから画像のVAEのデコーダ $D_V$ でデコード、もう一度同じく画像のエンコーダ $E_V$ でエンコードしたものと $z_c$ の距離を小さくするように学習

Results

学習したモデルで単語を変換したときの画像

異なる話者で単語 that, water, childrenを顔に変換したときの違い

各単語の最初の音素の話者による違い. 確かに同じ音素が同じような顔、画像に変換されているのがわかる。

ユーザテスト - 二つのユーザテストを実施

Mathing question: 二つの画像を見せて同じ音(単語、音素)から生成された画像がどちらかを当てさせる

Grouping question: 同じ音(単語、音素)から生成された画像のペアを当てさせる

Matching question

Grouping question

結果、どちらもランダムよりも遥かに良い成績に

顔のモデルの方(CelebA)が概ね結果が良いことに

ユーザーテストの結果

Further Thoughts

VJ的な要素として音と映像の潜在空間をつなげる方法はこれまでも提案されてきたが、聴覚障害者のための...という切り口は新しい
コンテンツとスタイルの切り分け、潜在空間間の構造を近づけるためのcycle lossの設定などは非常に勉強になる！！
画像を大量にみることで果たして音を識別できるように人は学習できるものなのだろうか

Links

本研究で使っている音のVAE - SpeechVAE 出音が面白いから曲作りにも使えるかも

Learning Latent Representations for Speech Generation and Transformation

An ability to model a generative process and learn a latent representation for speech in an unsupervised fashion will be crucial to process vast quantities of unlabelled speech data. Recently, deep probabilistic generative models such as Variational Autoencoders (VAEs) have achieved tremendous success in modeling natural images.

arxiv.org

Learning Latent Representations for Speech Generation and Transformation

(this work is to appear in Interspeech 2017. the paper can be found here) apply a convolutional VAE to model the generative process of natural speech. derive latent space arithmetic operations to disentangle learned latent representations. demonstrate the capability to randomly generate short speech segments (200ms).

people.csail.mit.edu

Learning Latent Representations for Speech Generation and Transformation

AudioViewer: Learning to Visualize Sound

Overview - 何がすごい?

Abstract

Motivation

Architecture

Results

Further Thoughts

Links

Learning Latent Representations for Speech Generation and Transformation

Learning Latent Representations for Speech Generation and Transformation

Code found at https://github.com/wnhsu/SpeechVAE