Entry

AIモデルの生成物で新しいモデルを学習! を繰り返すと… 数世代でモデルが崩壊する!

Simple Title

Shumailov, Ilia, Zakhar Shumaylov, Yiren Zhao, Nicolas Papernot, Ross Anderson, and Yarin Gal. 2024. “AI Models Collapse When Trained on Recursively Generated Data.” Nature 631 (8022): 755–59.

Description

LLMが生成したテキストを学習に利用→新しい学習データを生成→学習を繰り返した結果… Natureに掲載された論文

Type

Paper

Year

2024

Posted at

December 13, 2024 4:33 PM (GMT+9)

Overview

AIが生成したデータでAIモデル(LLM)を学習することを繰り返すと、数世代のうちに、AIの精度はガクッと下がってしまう。
学習を繰り返すと、エラーが増えると同時にもっともらしいこと(頻度が高い)文章しか生成されなくなる!
今後、AI生成のデータが相対的に多数を占めるようになり、人間の手による学習データ(テキスト、画像 etc)が少数になった時に、大きな問題になりそう。

Abstract

Stable diffusion revolutionized image creation from descriptive text. GPT-2 (ref. 1), GPT-3(.5) (ref. 2) and GPT-4 (ref. 3) demonstrated high performance across a variety of language tasks. ChatGPT introduced such language models to the public. It is now clear that generative artificial intelligence (AI) such as large language models (LLMs) is here to stay and will substantially change the ecosystem of online text and images. Here we consider what may happen to GPT-{n} once LLMs contribute much of the text found online. We find that indiscriminate use of model-generated content in training causes irreversible defects in the resulting models, in which tails of the original content distribution disappear. We refer to this effect as ‘model collapse’ and show that it can occur in LLMs as well as in variational autoencoders (VAEs) and Gaussian mixture models (GMMs). We build theoretical intuition behind the phenomenon and portray its ubiquity among all learned generative models. We demonstrate that it must be taken seriously if we are to sustain the benefits of training from large-scale data scraped from the web. Indeed, the value of data collected about genuine
human interactions with systems will be increasingly valuable in the presence of LLM-generated content in data crawled from the Internet.

Motivation

LLMの学習データが巨大になる中で、インターネット上のデータはほぼ学習され尽くしていて、今後学習データが足りなくなる事態が起きることが懸念されている。
「学習データが足りなければ、AIによって生成されたデータを使ったら?」という声も聞かれるが本当にそうなのか？ AI生成データを学習に使うことを繰り返したら、モデルの精度・生成物の傾向はどう変化するだろうか?

Method

学習済みのオープンソースのLLM (OPT-125M)をベースのモデルとして利用。wikitext2のデータセットでファインチューニングしていく。

”LLMをゼロから学習しても同じことになると思うが、平均的なアメリカ人の一生分の二倍のCO2放出量に相当することになるのでその実験はやめました”…とある。
入力 64トークンから次の64トークンを予測するように学習。
最初はwikitext2のデータのみで学習。次の世代からは、学習したモデルが生成したテキストを学習に用いるかたちで、生成と学習を繰り返す。

シナリオを二つ用意

シナリオ#1 オリジナルの学習データ(wikitext2のテキスト)は、一番最初のモデルの学習にしか利用しない + 5エポックの学習
シナリオ#2 オリジナルの学習データを10%、ランダムにサンプルして学習時に加える + 10エポックの学習

Results

シナリオ #1 下図の b
シナリオ #2 下図の c

両方とも徐々にパフォーマンスが下がったが…

シナリオ#2 = 10%のオリジナルの学習データを足した結果の方が高いパフォーマンスに! (相対的にPerplexityが低い)

少数でもオリジナルの学習データ (=人間の手によるテキスト)の分布に触れることが学習上は重要。

学習を繰り返すことで

頻度が高い文章はより頻度が高く生成される → もっともらしいこと、当たり前のことしか言わなくなる
逆にロングテイルも生まれる = 生成されるべきではない(=非常に可能性が低い/エラー)シーケンスも少数生成される

具体的なモデル崩壊の例 (シナリオ#1)

入力:

	some started before 1360 — was typically accomplished by a  master mason and a small team of itinerant masons, supplemented by  local parish labourers, according to Poyntz Wright. But other authors  reject this model, suggesting instead that leading architects designed  the parish church towers based on early examples of Perpendicular.

Gen 0 (オリジナルの学習データだけで学習)

Revival architecture such as St. John’s Cathedral in London.  The earliest surviving example of Perpendicular Revival architecture  is found in the 18th @-@ century Church of Our Lady of Guernsey,  which dates from the late 19th century. There are two types of per- pendicular churches : those.

Gen 1

architecture such as St. Peter’s Basilica in Rome or St. Peter’s  Basilica in Buenos Aires. There is no evidence that any of these build- ings were built during the reign of Pope Innocent III, but it is possible  that they may have been built during the reign of his successor, Pope  Innocent.

Gen 5

ism, which had been translated into more than 100 languages  including English, French, German, Italian, Spanish, Portuguese,  Dutch, Swedish, Norwegian, Polish, Hungarian, Slovak, Lithuanian,  Estonian, Finnish, Romanian, Bulgarian, Turkish, Croatian, Serbian,  Ukrainian, Russian, Kazakh, Kyrgyz.

Gen 9 (建築の話をしてたのに何故かウサギの話に…)

architecture. In addition to being home to some of the world’s  largest populations of black @-@ tailed jackrabbits, white @-@ tailed  jackrabbits, blue @-@ tailed jackrabbits, red @-@ tailed jackrabbits,  yellow @-.

Further Thoughts

論文を読んで考えた個人的感想

AIモデルの生成結果で新しいモデルを学習すると、たった数世代でモデルが崩壊するというのは驚き
インターネット上に使える学習データがなくなったら、AIの生成物を使えば良いという理論は成立しない。
AI生成物で学習を続けると、もっともらしいものしか生成されなくなるというのは、文化の多様性を考える上でも重要な示唆

この実験はLLMを対象にしているが音楽や画像でも同じことが言えるだろう

Discussionのところでも書かれているが、先行者の既得権益が守られる/参入障壁が高くなるという副作用があるというのは面白い。

AI生成物が比較的少ない、今のうちに巨大なモデルを学習した方が勝ち。今後はAI生成物で学習データが汚染されるため、精度の高いモデルを学習するのがどんどん難しくなる。
環境意識が高まる前に好き勝手して、巨大な富を築いた化学メーカーの姿が重なってくる。