Neural Text Generation with Unlikelihood Training

Entry

Unlikelihood Training - 「もっともらしさ」を最適化しないことで、より自然な文章の生成を

Simple Title

Welleck, S., Kulikov, I., Roller, S., Dinan, E., Cho, K., & Weston, J. (2019). Neural Text Generation with Unlikelihood Training.

Description

Likelihoodを最適化しようとすると頻出する単語が必要以上に頻出する結果に

Type

Paper

Year

2019

Posted at

May 14, 2021

Abstract

To train a machine learning model is necessary to take numerous decisions about many options for each process involved, in the field of sequence generation and more specifically of music composition, the nature of the problem helps to narrow the options but at the same time, some other options appear for specific challenges. This paper takes the framework proposed in a previous research that did not consider rhythm to make a series of design decisions, then, rhythm support is added to evaluate the performance of two RNN memory cells in the creation of monophonic music. The model considers the handling of music transposition and the framework evaluates the quality of the generated pieces using automatic quantitative metrics based on geometry which have rhythm support added as well.

Motivation

GPT-3などで言語生成モデルの精度が上がっていることは誰もが知っていることだけど、やっぱりまだ不自然さが否めない.. 同じパターンを繰り返しがち and 単語の使用頻度の濃淡が強調される結果(よく使われる単語はさらによく使われる、滅多使われない単語は全く使われなくなる)になっているなど。学習データを増やしても結果が変わらないこともわかっている。

言語生成モデルでみられる不自然な文章の例

これらの問題がそもそもLikelihood、もっともらしさを高めようとする学習方法(=特定のシーケンスに対して次に来る単語としてのたしからしさ最大化する)のせいなのではないかという点からスタートした研究。

Architecture

一般的に使われる言語モデルのlikelihoodを高めようとする目的関数(ロス関数)。

\mathcal{L}_{\mathrm{MLE}}\left(p_{\theta}, \mathcal{D}\right)=-\sum_{i=1}^{|\mathcal{D}|} \sum_{t=1}^{\left|\mathbf{x}^{(i)}\right|} \log p_{\theta}\left(x_{t}^{(i)} \mid x_{<t}^{(i)}\right)

Unlikelihood Training

以下のUnlikelihoodロスを追加。

\mathcal{L}_{\mathrm{UL}}^{t}\left(p_{\theta}\left(\cdot \mid x_{<t}\right), \mathcal{C}^{t}\right)=-\sum_{c \in \mathcal{C}^{t}} \log \left(1-p_{\theta}\left(c \mid x_{<t}\right)\right)

元々のLiklihoodのロスに加えて全体のtoken-level unlikelihood学習の目的関数はこうなる.

\mathcal{L}_{\mathrm{UL}-\text { token }}^{t}\left(p_{\theta}\left(\cdot \mid x_{<t}\right), \mathcal{C}^{t}\right)=-\alpha \cdot \underbrace{\sum_{c \in \mathcal{C}^{t}} \log \left(1-p_{\theta}\left(c \mid x_{<t}\right)\right)}_{\text {unlikelihood }}-\underbrace{\log p_{\theta}\left(x_{t} \mid x_{<t}\right)}_{\text {likelihood }} .

$\mathcal{C}^{t}=\left\{c_{1}, \ldots, c_{m}\right\}$ のトークン: negative candidates $p_{\theta}\left(c \mid x_{<t}\right)$ が小さくなればなるほど、unlikelihoodロスは小さく。

$\mathcal{C}^{t}$ を以下のように設定する (=直前に使われたトークン)ことで、直前に使われた単語が再度出てくる確率を抑える。

\mathcal{C}_{\text {prev-context }}^{t}=\left\{x_{1}, \ldots, x_{t-1}\right\} \backslash\left\{x_{t}\right\}

Results

定量的な結果

生成された文章で使われるユニークな単語数 (uniq)がベースラインとなっている $\mathcal{L}_{\mathrm{MLE}}$ モデルよりも増えた
同じ単語の繰り返し (rep / wrep) が減った
accuracy はそれほど変わってない

ということで、以下の表にあるように被験者による評価でも $\mathcal{L}_{\mathrm{MLE}}$ モデルより良い結果に!!!!

人間の評価

Further Thoughts

同じ手法を音楽に使えないか?
音楽の場合は同じ単語(音符??)の繰り返しはそこまで忌避されることではない。
同じ音符ではなく、メロディーの連続の繰り返しを避けるような仕組みを作れると良さそう