Overview - 何がすごい?

動画のクリップとモノラル音源を独自に制作したフレームワークを用いることでバイノーラル音源へ変換することができる。

Abstract

Binaural audio provides a listener with 3D sound sensation, allowing a rich perceptual experience of the scene. However, binaural recordings are scarcely available and require nontrivial expertise and equipment to obtain. We propose to convert common monaural audio into binaural audio by leveraging video. The key idea is that visual frames reveal significant spatial cues that, while explicitly lacking in the accompanying single-channel audio, are strongly linked to it. Our multi-modal approach recovers this link from unlabeled video. We devise a deep convolutional neural network that learns to decode the monaural (single-channel) soundtrack into its binaural counterpart by injecting visual information about object and scene configurations. We call the resulting output 2.5D visual sound---the visual stream helps "lift" the flat single channel audio into spatialized sound. In addition to sound generation, we show the self-supervised representation learned by our network benefits audio-visual source separation.

Motivation

バイノーラル音源は記録されている視覚情報をより鮮明にするものだが、専用の機器を用いてなければ音源としての記録は難しい、この技術を用いれば映像クリップとそれに対となるモノラル音源で擬似的なバイノーラル音源を生成することができる。

Architecture

3D音響データを収集するためのデバイス、米国3Dio社製バイノーラルマイクFree Spaceで3D音源を、GoProで映像を記録してデータセットを制作(1871の短いクリップ、5.2時間分)

モノラルオーディオとそれに付随する映像を入力とし、ResNet-18で視覚的特徴を抽出、U-NETでオーディオ特徴を抽出、オーディオビジュアル分析を実行し、ビデオの空間構成と一致するバイノーラルオーディオを推定

Results

これにより推定されたバイノーラルオーディオ（2.5Dビジュアルサウンド）は、音源の位置を感じさせることができ、より没入感のあるオーディオ体験を提供することを可能とする。

Further Thoughts

スマートフォンについているカメラの精度が上がったことによって、生活の中で動画を記録するということは我々にとって日常的なものになった。しかし、視覚的な記録技術が向上しても鮮明に記憶されるのは視覚的な情報であり、聴覚的な情報ではない。そこで、この様な技術を用いて当時の聴覚的記録を再現することで、当時の記憶というものをより鮮明にするのではないだろうか？

Links

1.The Sound of Pixels(https://arxiv.org/pdf/1804.03160.pdf)

2.A Speaker-Independent Audio-Visual Model for Speech Separation(https://arxiv.org/pdf/1804.03619.pdf)