Tuesday, February 20, 2024

Pig Butchering with Large Language Models

I have my Facebook default privacy settings locked down so that only my FB friends can see my posts on my timeline. And I only accept friend requests from folks I feel I know pretty well, and typically only those I know in meat space. But when I shared my post about selling a BMW motorcycle to my motorcycle club's group on FB, I had to change the privacy setting of that particular post from private to public so that members who weren't on my FB friends list could see it. The comments below are the result.

Pig Butchering With LLMs

Take a close look at them. All of course claim to be from attractive young women. The first two of them are just short comments trying to get me to engage. The fourth one is a long missive that is probably a standard form letter with no specific detail. But the third one has enough specificity that it had me looking up the commenter's profile: a young divorced Asian woman in the fashion industry who lives in San Francisco. Possible but not likely in the BMW motorcycle owner demographic.

It was almost certainly written by an AI, using the current technology based on an artificial neural network, like the Large Language Models such as ChatGPT use. It has all sorts of detail about my post, and at first seems legit, but is really nothing much more than a rewording of what I originally posted to the group.

This is where LLMs are taking the pig butchering or romance scam artists. As they are trained with more and more data, they are just going to get better and better.

Wednesday, February 14, 2024

Are AI Generated Works Intellectual Property?

The U.S. Patent and Trademark Office (USPTO) has once again stressed that only humans can be listed as inventors on patents. And the U.S. Copyright Office, part of the Library of Congress and typically a small bureaucracy with just a few people, is about to make big news as it evaluates whether AI generated works can be copyrighted.

If the USPTO declines to recognize AI "inventors", and the Library of Congress similarly disallows copyrighting of AI generated material, that's going to really put a crimp in the monetization of AI generated intellectual property, since it cannot be protected.

My current thinking is that right now it's right thing to do.

The current technology of Generative Pre-Trained (GPT) AIs are nothing more than gigantic text or image prediction engines based on huge artificial neural network-based statistical models trained with enormous amounts of human created and curated input - input for which the original authors and artists are not being compensated, despite the fact that their work may have been copyrighted. There's no cognition or creativity involved.

But the counter argument is worth thinking about.

We ourselves are nothing but gigantic text or image prediction engines based on huge natural neural network-based statistical models trained with enormous amounts of human created and curated input - material we have read or examined - for which the original authors and artists are not being compensated, despite the fact that their work may have been copyrighted.

The difference is that when we write or make art, we may be trying use the trained neural network in our brain to create what others have not done before. That's creativity.

Update (2024-02-20)

Another counter argument is that there is creativity and cognition involved in the prompt engineering - the term used for the creation of the prompt, or series of prompts, the human operator gives the AI to produce its output. Perhaps, in this respect, using an AI is no different than using tools like Microsoft Word or Adobe Photoshop for your writing or art.

I'm still leaning towards not providing IP protection for AI generated output. But this is a complicated issue. As the subtitle of my blog reminds you, 90% of this opinion could be crap.


(Perhaps ironically, this article is based on the no doubt copyrighted work of several others that I would like to cite... if only I could remember them. As I do, I'll add the citations here.) 

Emilia David, "US patent office confirms AI can't hold patents", The Verge, 2024-02-13, https://www.theverge.com/2024/2/13/24072241/ai-patent-us-office-guidance

Cecilia Kang, "The Sleepy Copyright Office in the Middle of a High Stakes Clash over A.I.", The New York Times, 2024-01-25, https://www.nytimes.com/2024/01/25/technology/ai-copyright-office-law.html

Louis Menand, "Is A.I. the Death of I.P.?", The New Yorker, 2024-01-15, https://www.newyorker.com/magazine/2024/01/22/who-owns-this-sentence-a-history-of-copyrights-and-wrongs-david-bellos-alexandre-montagu-book-review

Shira Perlmutter, "Copyright Registration Guidance: Works Containing Material Generated by Artificial Intelligence", U.S. Copyright Office, Federal Register, 2023-03-10, https://copyright.gov/ai/ai_policy_guidance.pdf

Katherine Kelly Vidal, "Inventorship Guidance for AI-assisted Invention", U.S. Patent and Trademark Office,  Federal Register, 2024-02-13, https://public-inspection.federalregister.gov/2024-02623.pdf

Thursday, January 25, 2024

Large Language Models/Generative Pre-trained Transformers

(I'm turning this stock comment into a blog article so that I can refer to it in the future.)

My concern is that by the time we figure out we need an enormous volume of high quality content created and curated by human experts to correctly train Large Language Models (LLMs) like ChatGPT, we will have eliminated all the entry-level career paths of those very same human experts by using those same LLMs. As the existing cohort of experts retire, die, move into management, or otherwise quit producing content, there will be no one to take their place. We will have “eaten our own seed corn”.

Because human-created and -curated content will be more expensive to produce, organizations will be strongly incentivized to use LLM-created content to train other LLMs - or perhaps even the same LLM. This tends to cause errors in the training data to be amplified, leading to model collapse, where the LLM produces nonsense. (This is less likely to happen with human-created content because humans, unlike an algorithm, are unlikely to make exactly the same mistakes.)

Because human-created and -curated content will be deemed to be of higher quality, organizations will be strongly incentivized to not label LLM-created content as such. This will be problematic for LLM developers who are looking for the enormous amounts of high quality data necessary to train their models.

The seeds of the destruction of LLMs lies in the economics of creating and using LLMs.

I believe that LLMs have a future in being used as tools by experienced users in the same way such users may use tools like Wikipedia, Google Search, and StackOverflow today, with much of the same risk.