📄

Paper

Entry
Type
Paper
Year
Posted at
June 2, 2021
Tags
image

Overview

  • LLMとさまざまな音響処理のモデルを組み合わせて、テキストプロンプトに合わせて一連の音響処理を自動的に行うプロセス, WavCraftの提案
  • これまでもエージェントベースの音響合成/処理の仕組みの提案はあったが、音声ファイルの音響的な特徴量の抽出を組み合わせることでより精度を高めている

Abstract

We introduce WavCraft, a collective system that leverages large language models (LLMs) to connect diverse task-specific models for audio content creation and editing. Specifically, WavCraft describes the content of raw sound materials in natural language and prompts the LLM conditioned on audio descriptions and users’ requests. WavCraft leverages the in-context learning ability of the LLM to decom- poses users’ instructions into several tasks and tackle each task collaboratively with audio expert modules. Through task decomposition along with a set of task-specific models, WavCraft follows the input instruction to create or edit audio content with more details and rationales, facilitating users’ control. In addition, WavCraft is able to cooperate with users via dialogue interaction and even produce the audio content without explicit user commands. Experiments demonstrate that WavCraft yields a better performance than existing methods, especially when adjusting the local regions of audio clips. Moreover, WavCraft can follow complex instructions to edit and even create audio content on the top of input recordings, facilitating audio producers in a broader range of applications. Our implementation and demos are available at https://github.com/JinhuaLiang/WavCraft.

Motivation

  • 大規模なtext-to-audio/musicなどのモデルが登場しているが、インタラクティブにユーザからのリクエストを受けるようなカタチになっていない。

Architecture

Results

Further Thoughts

論文を読んで考えた個人的感想

Links