Overview

Googleから公開された、大規模なサウンド・クリップのデータセットです。

Audio event recognition, the human-like ability to identify and relate sounds from audio, is a nascent problem in machine perception. Comparable problems such as object detection in images have reaped enormous benefits from comprehensive datasets — principally ImageNet. This paper describes the creation of Audio Set, a large-scale dataset of manually-annotated audio events that endeavors to bridge the gap in data availability between image and audio research. Using a carefully structured hierarchical ontology of 635 audio classes guided by the literature and manual curation, we collect data from human labelers to probe the presence of specific audio classes in 10 second segments of YouTube videos. Segments are proposed for labeling using searches based on metadata, context (e.g., links), and content analysis. The result is a dataset of unprecedented breadth and size that will, we hope, substantially stimulate the development of high-performance audio event recognizers.

Data

サウンド・クリップはYouTubeの動画から10秒程度の長さを抽出したもので、音楽・人の声・車の音など、527ものクラスに分類されています。

さらにこのクラスは階層構造を持っており、例えば人の声ならその下にスピーチやささやき、叫び声といったより細かいクラスがぶら下がる形となっています。

こちらから、検索を行い実際に収録されている音を確認することができます！

AudioSet/Dataset

この階層構造はGitHubで公開されています。その構造、またデータセットの内容の詳細については以下の論文に記載されています。

Audio Set: An ontology and human-labeled dataset for audio events

200万ものサウンド・クリップのデータセット – AudioSet

Overview

Data

Links