Wednesday, July 26, 2023

Model Collapse

A few decades ago I was working at the National Center for Atmospheric Research, a national lab in Boulder Colorado sponsored by the National Science Foundation. Although our missions were completely different, we had a lot in common operationally with the Department of Energy labs, like Los Alamos and Lawrence Livermore, regarding supercomputers and large data storage systems, so we did a lot of collaboration with them.

I had a boss at NCAR that once remarked that the hidden agenda behind the DoE labs was that they were a work program for physicists. Sometimes, often without much warning, you need a bunch of physicists for a Manhattan Project kind of activity. And you can't just turn out experienced Ph.D. physicists at the drop of a hat; it takes years or even decades. So for reasons of national security and defense policy you have to maintain a pipeline of physicist production, and a means to keep them employed and busy so that they can get the experience they need. Then you've got them when you need them.

This always seemed very forward thinking to me. The kind of forward thinking you hope someone in the U.S. Government is doing.

It came to me today that this is the same issue in the screen writers' and actors' strike.

Machine learning (ML) algorithms, of which Large Language Models (LMMs) are just one example, need almost unbelievably large amounts of data to train their huge neural networks. There is a temptation to use the output of ML models to train other ML models because it's relatively cheap and easy to create more input data, where as expensive humans can take a long time to do it. But training an ML model with the output of another ML model leads to an effect called "model collapse".

I mentioned an article on VentureBeat (which cites an academic paper) on this topic in a prior blog article. The VentureBeat article by Carl Franzen provides the following metaphor:

If you trained an ML model to recognize cats, you could feed it billions of "natural" real-life examples of data about blue cats and yellow cats. Then if you asked it questions about cats, you would get answers containing blue cats, and yellow cats, and maybe even occasionally green cats.

But suppose yellow cats were relatively rarely represented in your data, whether they were rare in the real world or not. Mostly then you would get answers about blue cats, almost never yellow cats, and rarely if ever green cats.

Then you started training your new improved ML model on the output of the the prior ML model. The new "synthetic" data set would dilute out all the examples of yellow cats. Eventually you would have model that didn't even recognize yellow cats at all.

This is one example of model collapse: the ML model no longer represents the real-world, and cannot be relied upon to deliver accurate results.

This is what will happen if you eliminate the human elements from your screenwriting or acting (or software development), using AI algorithms to write and to synthesize and portray characters (or write software). If you don't have a full pipeline constantly producing people who have training and experience at writing or acting (or writing software, or whatever it is you need), you no longer have a way to generate the huge human-created and human-curated datasets you need to train your AIs. The models collapse, and eventually the writing or portrayal of characters (or the software) in no way represents the real world.

That valley isn't even uncanny; it's just wrong.

But you can't just gen up more competent, trained, experienced writers or actors (or software developers) on the spur of the moment. It takes years to do that. By the time you realize you're in trouble, it's too late.

This is the precipice some folks want us to move towards today.

Saturday, July 22, 2023


The movie we've all be waiting for, the movie about a beloved childhood toy, opened in theaters everywhere this week. I am talking, of course, about Christopher Nolan's Oppenheimer.

Who knows how many little girls were inspired to enter STEM fields by playing with their "Oppie" dolls. The Spousal Unit has regaled me with stories about the many happy hours spent dressing her Oppie in different suits, trench coats, and fedora hats. Of the joy she felt on that one extra special Christmas morning when she opened a brightly wrapped package to find her very own "Oppie's Dream Atomic Bomb". How she collected Oppie's friends like Lieutenant Colonel Leslie Groves ("With Angry Face!"), and the terrific glow-in-the-dark Louis Slotin (which inspired her to pursue a career in medicine). And how building their own Trinity test site in the backyard sand box brought her and her older brother closer together.

She and I look forward to seeing Oppenheimer so that she can relive those golden childhood memories.

Using Microsoft's Windows Subsystem for Linux on Windows 11

I recently bought a new laptop that runs Microsoft Windows 11.

And I didn't immediately install a Linux distro over top of Windows.

Shocking, I know, for a guy who for years has been, and remains, firmly in the Apple ecosystem: laptop, desktop, phone, and tablet. And who, for the past few decades, has been writing software for the Linux ecosystem (including for Android). But I didn't own a hardware platform that can run Windows 11, which only works on systems that have a Trusted Platform Module (TPM). And a lot of the commercial tools for embedded systems, and vendor tools for GNSS devices, that I use only run on Windows.

I bought a 2022 HP Envy x360 Convertible Model 15. That's a laptop with a 15.6" touch sensitive screen that folds up to convert into a tablet. It's the first hardware platform I've run my code on that uses an AMD processor: a Ryzen 7 5825U. It came with Windows 11 Professional. It has 64GB of RAM, and a 2TB PCIe SSD.

So of course almost the first thing I did was get Microsoft's Windows Subsystem for Linux (WSL) working on it. This allows you to run a full blown Linux/GNU distro - not an emulation layer - with a Linux kernel, in a highly optimized virtual machine environment native to Microsoft. Then I got my own software running on it, my Diminuto and Hazer repositories.

It was mostly straightforward, although getting a USB device (in my case, a GNSS receiver dongle) attached to the Linux environment was a little weird - definitely weirder than doing the same thing using the commercial VMware software, which I have done many many times.

Here's a snapshot of my GNSS software, gpstool, running under Ubuntu Linux, under Windows 11, on the new laptop.

Hazer on HP Envy x360 15 Convertible

I have come to think of the WSL window as the "system console" for the running Linux. If you do an ifconfig command in this window, you can get the local IP address for the Linux instance. Using that address, you can ssh into Linux from Windows and have multiple concurrent Linux sessions. I use the popular Windows app PuTTY - which I also use to connect to serial-attached devices - but anything similar, like using the Windows' native ssh command from a PowerShell console, should work.

You can easily tell that the NMEA 0183 data stream from the GNSS device is running through some kind of USB bridge software layer under Windows that adds significant latency. My GNSS software displays, second by second, both the local system clock (LOC) and the GPS time from the incoming NMEA data (TIM). On this system, they consistently differ by one second, TIM running one second late. I have seen this also when running under VMware, but never when running natively on a system. Definitely won't be using this approach for precision timing, but it should be fine for geolocation.

I've found an issue with one of my USB-attached GNSS receivers, that the optional Windows usbipd utility, which you use to manage the connection of USB devices to WSL, refuses to attach to Linux: "device is in an error state". It's the one dongle I have that uses the Data Carrier Detect (DCD) indication to provide the GNSS one-pulse-per-second ("1PPS") signal. It works fine natively with Linux on, for example, a Raspberry Pi. Other USB-attached GNSS devices have worked fine.

Otherwise: so far, so good.

Sunday, July 16, 2023

Large Machine Learning Models Are Not Intrinsically Ethical - And Neither Are Large Corporations

I think the screen actors and writers concerns about the use of large AI models is legitimate, since the models cannot exist and could not be successful without being trained using a ginormous amount of human-created input, typically without the permission or even knowledge of the original creators.

But that's just the tip of the iceberg, being currently the most visible public example of this concern.

Eventually, software engineers will wise up and figure out they have the same issue, with companies training AIs using software - including open source software - written by humans, most of whom are no longer, or never were, employees of theirs, without any compensation, consent, or acknowledgement.

Worse, companies will try to get around using expensive, experienced, and ethical developers by training AIs to generate software that will be used in safety critical or weapons systems.

Eventually, companies will save even more money, and avoid any intellectual property issues, by training AIs using software that was itself generated by other AIs, and... it's turtles all the way down. With each iteration, it will be like a game of telephone, the quality of the output getting worse and worse. Except sometimes with ground to air missiles.

In time, there will be corporate executives for some prime defense contractor sitting in front of a Congressional committee, trying to explain why their automated weapons system shot down a commercial airliner because it thought it was a Russian bomber. They will be forced to admit that no one - not their scientists, not their own engineers, not anyone - really understands how the AI in the system came to that decision.

Because that's how complex large neural network machine learning models are. It's not traditional if-then-else logic, a so-called "rule-based" system, like I studied when I was a graduate student in Computer Science. It's an almost incomprehensibly gigantic simulated network of neurons that was configured by an almost unbelievably huge dataset of input. A dataset whose contents no human moderated or approved or even examined. Or, because of its volume, could examine.

I admit this isn't my area of expertise. But I have a couple of degrees in Computer Science from an accredited program at a university. I have worked for many years in large multi-national corporations, part of that time in the defense-industrial complex. So I feel like I have a reasonably informed opinion on both the technical aspects and how large corporations work.

I believe that the application of very large machine learning models to weapons systems is inevitable. If not by the U.S., then by other countries, perhaps including our allies. The results will be unpredictable. And unexplainable.


It only just now occurred to me that how large machine learning models work might be a good metaphor for the hive minds of large organizations.

Not really joking.

Postscript 2

My use of "hive minds" above was quite deliberate, BTW, since my train of thought first connected machine learning modes with the emergent behavior of some insect colonies e.g. bees. The individual bee - and the neural network inside its brain - is relatively simple, but the group behavior of a lot of bees is quite complex - and not even remotely understood by any individual bee.

Postscript 3

I couldn't read this paywalled article from Bloomberg [2023-07-16], but the part I could see, just a few minutes ago, coincidentally, was enough.

"Israel Quietly Embeds AI Systems in Deadly Military Operations

Selecting targets for air strikes and executing raids can now be conducted with unprecedented speed, according to army officials.

The Israel Defense Forces have started using artificial intelligence to select targets for air strikes and organize wartime logistics as tensions escalate in the occupied territories and with arch-rival Iran.

Though the military won’t comment on specific operations, officials say that it now uses an AI recommendation system that can crunch huge amounts of data to select targets for air strikes. Ensuing raids can then be rapidly assembled with another artificial intelligence model called Fire Factory, which uses data about military-approved targets to calculate munition loads, prioritize and assign thousands of targets to aircraft and drones, and propose a schedule."

Postscript 4

There's a very recent article on Vox about how the inner workings of large machine learning models are unknowable.

Postscript 5

Postscript 6

The article from VentureBeat that I cite just above makes an interesting point: the fact that using AI model output as training data for another AI model leads to "model collapse" means that high-quality human-generated or human-curated training data becomes increasingly more rare and more valuable. I predict this will lead to new open source licenses, GNU and otherwise, that restrict data or code use as training data for machine learning models. (And of course, AI developers will routinely violate those open source licenses, just as they are violated now.)

Monday, July 03, 2023

Hazer with the U-blox NEO-F10T GNSS Receiver on the Ardusimple SimpleGNSS Board

It's been a while since I talked about my GPS/GNSS efforts. Some time ago I bought a SimpleGNSS board from Ardusimple to try out. The SimpleGNSS has the new U-blox NEO-F10T GNSS receiver. This was my first experience using a new U-blox generation 10 device. It is the first GNSS device of any kind I've used that includes features specific to the latest version 4.11 of the National Marine Electronics Association 0183 standard. And it's the first I've used that's capable of receiving the new L5 band signal from the latest Block III GPS satellites.

I was about ready to dismantle this experiment. But before I removed it from my workbench, I thought I'd update y'all on my latest version of the Hazer library and its gpstool utility.

Hazer is a Linux/GNU/C-based library that supports the processing not only of the usual NMEA 0183 output of GNSS receivers, but also proprietary binary output like UBX from U-blox devices, and CPO output from Garmin devices. It also handles input and output of RTCM messages in support of differential GNSS, yielding geolocation precision down to about a 1.5 centimeters. (I run a DGNSS fixed base and a stationary rover 24x7 at the Palatial Overclock Estate.)

gpstool is the Swiss Army Knife of Hazer. It's one of those old-school tools that has a nightmare of command line options. I use it to functionally test the library, and also as the main component of most of my GNSS efforts. You wouldn't want to use gpstool to navigate cross-country (although I have used it in conjunction with OpenStreetMap to generate a moving map display in real-time). But I find it really useful for testing and evaluating new GNSS devices and just generally futzing around with geolocation and precision timing.

(You may want to click on the short video clips to view them on YouTube instead of from this blog; the UI seems to crop the images, losing some information. You can also click on photographs to see a larger image.)

Here is a short video clip of a Raspberry Pi 4B running gpstool with the NEO-F10T connected via USB.

Besides processing the output of the U-blox device, gpstool is following the 1 Hz One Pulse Per Second (1PPS) digital output signal from the device, which is syntonized to GPS time, and strobes another digital line on the Raspberry PI, syntonized to 1PPS (subject to software latency), to which I've attached a red LED.

Here is a short screen capture that shows the output from gpstool. I've SSHed into the Raspberry Pi from my Mac desktop to run gpstool. The utility uses some simple ANSI screen controls to update the display in real-time. I'm viewing standard output directly from gpstool, but the utility also has the ability to run headless in the background as a daemon, and you can view this real-time display remotely at leisure. (This is how I run my DGNSS systems.)

If you have perused the output of gpstool in any of my prior articles, you may notice a new field in the SAT lines showing the frequency band used by the device to receive data from the indicated satellite, e.g. L1 C/A. Some GNSS devices (like this one) may receive data from the same satellite over more than one band. (I confess this "new feature" is a long delayed bug fix because I botched the handling of new fields in version 4.10 of the NMEA 0183 spec. I have no excuse.)

gpstool isn't just a passive monitor. You can use its command line options to send NMEA, UBX, and other message formats to the device to interrogate and configure it. I did this with the NEO-F10T. Here is a screen shot of a snippet of a script similar to the one I used to generate the output shown in this article.  It sends a batch of UBX messages, sequentially, waiting for an acknowledgement for most of them, to configure the device.

Screen Shot 2023-07-06 at 13.28.38

You can find this script in its entirety in the Hazer repository on GitHub.

Finally, in addition to the standard output, gpstool generates log output on standard error, which here I capture in a file. (If you run gpstool headless, log output can also be automatically directed to the system log without any need for a command line option or API call.)

Hazer gpstool Example Log File

In this example, I have the log output level set to INFO (informational) and above; setting it to the DBUG (debug) level and above generates more output than I typically need unless I am actually debugging new code. (The logging system is a feature of Diminuto, my Linux/GNU/C-based systems programming library on which Hazer and gpstool are built; Diminuto has its own ginormous feature set.)

That's a quick update of some of my latest poking around with GNSS. I'm always on the lookout for new (to me, anyway) GNSS devices to play with!