Thursday, January 25, 2024

Large Language Models/Generative Pre-trained Transformers

(I'm turning this stock comment into a blog article so that I can refer to it in the future.)

My concern is that by the time we figure out we need an enormous volume of high quality content created and curated by human experts to correctly train Large Language Models (LLMs) like ChatGPT, we will have eliminated all the entry-level career paths of those very same human experts by using those same LLMs. As the existing cohort of experts retire, die, move into management, or otherwise quit producing content, there will be no one to take their place. We will have “eaten our own seed corn”.

Because human-created and -curated content will be more expensive to produce, organizations will be strongly incentivized to use LLM-created content to train other LLMs - or perhaps even the same LLM. This tends to cause errors in the training data to be amplified, leading to model collapse, where the LLM produces nonsense. (This is less likely to happen with human-created content because humans, unlike an algorithm, are unlikely to make exactly the same mistakes.)

Because human-created and -curated content will be deemed to be of higher quality, organizations will be strongly incentivized to not label LLM-created content as such. This will be problematic for LLM developers who are looking for the enormous amounts of high quality data necessary to train their models.

The seeds of the destruction of LLMs lies in the economics of creating and using LLMs.

I believe that LLMs have a future in being used as tools by experienced users in the same way such users may use tools like Wikipedia, Google Search, and StackOverflow today, with much of the same risk.

No comments: