Wednesday, July 26, 2023

Model Collapse

A few decades ago I was working at the National Center for Atmospheric Research, a national lab in Boulder Colorado sponsored by the National Science Foundation. Although our missions were completely different, we had a lot in common operationally with the Department of Energy labs, like Los Alamos and Lawrence Livermore, regarding supercomputers and large data storage systems, so we did a lot of collaboration with them.

I had a boss at NCAR that once remarked that the hidden agenda behind the DoE labs was that they were a work program for physicists. Sometimes, often without much warning, you need a bunch of physicists for a Manhattan Project kind of activity. And you can't just turn out experienced Ph.D. physicists at the drop of a hat; it takes years or even decades. So for reasons of national security and defense policy you have to maintain a pipeline of physicist production, and a means to keep them employed and busy so that they can get the experience they need. Then you've got them when you need them.

This always seemed very forward thinking to me. The kind of forward thinking you hope someone in the U.S. Government is doing.

It came to me today that this is the same issue in the screen writers' and actors' strike.

Machine learning (ML) algorithms, of which Large Language Models (LMMs) are just one example, need almost unbelievably large amounts of data to train their huge neural networks. There is a temptation to use the output of ML models to train other ML models because it's relatively cheap and easy to create more input data, where as expensive humans can take a long time to do it. But training an ML model with the output of another ML model leads to an effect called "model collapse".

I mentioned an article on VentureBeat (which cites an academic paper) on this topic in a prior blog article. The VentureBeat article by Carl Franzen provides the following metaphor:

If you trained an ML model to recognize cats, you could feed it billions of "natural" real-life examples of data about blue cats and yellow cats. Then if you asked it questions about cats, you would get answers containing blue cats, and yellow cats, and maybe even occasionally green cats.

But suppose yellow cats were relatively rarely represented in your data, whether they were rare in the real world or not. Mostly then you would get answers about blue cats, almost never yellow cats, and rarely if ever green cats.

Then you started training your new improved ML model on the output of the the prior ML model. The new "synthetic" data set would dilute out all the examples of yellow cats. Eventually you would have model that didn't even recognize yellow cats at all.

This is one example of model collapse: the ML model no longer represents the real-world, and cannot be relied upon to deliver accurate results.

This is what will happen if you eliminate the human elements from your screenwriting or acting (or software development), using AI algorithms to write and to synthesize and portray characters (or write software). If you don't have a full pipeline constantly producing people who have training and experience at writing or acting (or writing software, or whatever it is you need), you no longer have a way to generate the huge human-created and human-curated datasets you need to train your AIs. The models collapse, and eventually the writing or portrayal of characters (or the software) in no way represents the real world.

That valley isn't even uncanny; it's just wrong.

But you can't just gen up more competent, trained, experienced writers or actors (or software developers) on the spur of the moment. It takes years to do that. By the time you realize you're in trouble, it's too late.

This is the precipice some folks want us to move towards today.

No comments: