Winter is Coming: AI Companies Must Account for Rights Violations

Authors are using litigation to force a final reckoning

Sep 27, 2023

“Over my dead body!” Image by Ellen Levy Finch July 1998, Seattle, WA. - Own work, CC BY-SA 3.0, https://commons.wikimedia.org/w/index.php?curid=900369

Last week, George R.R. Martin joined authors like John Grisham and Jodi Picoult in suing OpenAI for blatantly ignoring copyright and threatening their livelihoods. The authors are accusing OpenAI of dumping their work into large language models (LLMs) for ChatGPT’s consumption and eventual regurgitation. From the filing:

[OpenAI] could have “trained” their LLMs on works in the public domain. They could have paid a reasonable licensing fee to use copyrighted works. What [OpenAI] could not do was evade the Copyright Act altogether to power their lucrative commercial endeavor, taking whatever datasets of relatively recent books they could get their hands on without authorization.
There is nothing fair about this. [OpenAI]’s unauthorized use of Plaintiffs’ copyrighted works thus presents a straightforward infringement case applying well-established law to well-recognized copyright harms.

This is actually the third class action lawsuit authors have filed against OpenAI this year. Each filing amounts to the same thing: OpenAI continues to brazenly flout Copyright Law and is a clear and present danger to all writers of quality everywhere.

The construction and composition of the LLMs is the heart of the issue. Per their name, these libraries are quite large. Common Crawl is a dataset used as the primary training library for every LLM. It includes 240 billion pages, and adds another 3-5 billon every month. Common Crawl uses a bot that crawls the internet and copies whatever it finds into a database.

There is also a quality factor to consider. ChatGPT’s ability to generate human-sounding responses derives entirely from the textual meal it is fed. Garbage in, garbage out. Therefore, stuff written by professional authors and fine-tuned by professional editors is especially desirable. Like, the most desirable.

OpenAI has admitted to using datasets it dubbed Books1 and Books2, but is reticent to disclose what those libraries entail. Books1 is believed to be Project Gutenburg, a library of free public domain books. Nobody knows the exact origins of Books2, but most suspect it was sourced from a pirate library operating in the internet’s shadows. There’s also a Books3 dataset, which an AI researcher built using Bibliotik, one such shadow library.1 The court filing claims the size and disposition of Books3—again, confirmed to include pirated material—is similar to Books2 and therefore indicates Books2 is also packed to the gills with illegally-obtained material.

The filing includes examples where ChatGPT summarized the named author’s books and chapters with a high degree of detail and accuracy. It also generated alternative versions. In Martin’s case, ChatGPT created a detailed outline for a sequel to A Clash of Kings titled “A Dance With Shadows.” This is probably the closest they’ll come to smoking gun.

This whole debacle draws the open nature of the Internet into sharp relief. LLMs like Common Crawl take advantage of the free ubiquity of data and copy whatever they will, without consent. This is nothing new—before crawling, it was screen scraping—but it has never been used to feed a profit engine to such an extent. And while Common Crawl is a non-profit, and therefore can only be accused of being morally malfeasant, OpenAI is a for-profit enterprise. OpenAI was actually founded as a non-profit but changed its tune once it realized how much money was to be made. And is already making.

For their part, OpenAI has rallied under the banner of Fair Use.

OpenAI claimed that the authors "misconceive the scope of copyright, failing to take into account the limitations and exceptions (including fair use) that properly leave room for innovations like the large language models now at the forefront of artificial intelligence."

So basically: Our innovation is more important than your intrinsic rights. Got it.

Fair Use is a tricky bit of legislation. Section 107 of the Copyright Act provides “the fair use of a copyrighted work, including such use by reproduction or by any other means specified by that section, for purposes such as criticism, comment, news reporting, teaching (including multiple copies for classroom use), scholarship, or research, is not an infringement of copyright.”

As someone running a pop culture publication, I think about Fair Use every day. It is what allows us to use an image from Star Wars in an article about the mating habits of Wookiees. I am also very cognizant that there is a line between Fair Use and outright Abuse, and I make sure we don’t cross it.

To quote the great Charlie Murphy, OpenAI is a habitual line-stepper.

The Fair Use doctrine includes a series of factors to consider when evaluating such a matter:

the purpose and character of the use, including whether such use is of a commercial nature or is for nonprofit educational purposes; (emphasis mine)
the nature of the copyrighted work;
the amount and substantiality of the portion used in relation to the copyrighted work as a whole; (again, me)
the effect of the use upon the potential market for or value of the copyrighted work. (hello again)

I’m no lawyer, but given OpenAI is making a ton of money off of ill-gotten material (point 1), which is based in part upon consuming copyrighted books totally and completely (point 3), and which may make it difficult for these writers to make a buck in the future (point 4), it seems pretty open and shut to me.

I think point 4 is where the authors drew the line and finally said, “eff this shit.” Pirating of eBooks has been around since there have been eBooks. But when the content of those books—the unique structure and word choice, which together form the author’s voice—can be used to allow anyone to generate content that mimics the style of George R.R. Martin or Michael Connelly, the jig is up.

ChatGPT isn’t there yet. Someone already used it to create the last two books of the A Song of Ice and Fire series, and they were not good. Not yet. But given the speed of innovation, and the fact that OpenAI is clearly willing to cut corners—ethical, legal, or otherwise—it’s really only a matter of time.

George and his cohort are seeking damages up to $150,000 per book for making them unwilling accomplices in their own replacement, as well as a permanent cease-and-desist from such any shenanigans in the future. Which honestly doesn’t seem like enough.

And while it may seem like this only effects people with famous names, that’s simply not true if you write on the Internet, where opt-in is the default and consent is assumed. In Google’s sanitized version of the Common Crawl dataset, Medium was the 46th largest site in terms of quantity of data copied. For its part, Medium is considering how best to handle the situation and is leaning toward opt-out by default. Which is great news. But as it stands, I guarantee there are words written by me in the Common Crawl dataset. And that sucks.

I’m small potatoes in the scheme of all this. But this exact issue is what prompted me to put posts on All the Fanfare behind a free paywall.

We can’t put AI tools back in the box. But we can do our best to protect our work. It’s the only real choice we have.

Other Headlines

HBO Copies Netflix’s Homework, Doesn’t Realize Netflix is Failing

Winning Time has been on my watch list for some time. I finally started watching it a few weeks ago. Naturally, that prompted HBO to announce they were canceling the show after season 2. Which is a damned tragedy. Winning Time is one of the best shows I’ve seen in some time.

Netflix patented the ‘get hooked on this show and then we’ll cancel it after 2 seasons’ model. The fact that HBO is now deploying said model is an indictment on how far it has fallen.

A Guillermo del Toro Star Wars Film Would Be Different and Fresh, Thus it Was Canceled

There’s been a lot of news lately about Star Wars projects that will probably never see the light of day, which is just nature’s way of healing. For a hot minute there, Lucasfilm was handing out new Star Wars projects like Oprah gives away cars. But I was legitimately bummed that del Toro had been penciled in to direct a Star Wars film that is now shelved indefinitely.

Nothing is known about the film, which was based on a David Goyer script. Goyer is primarily known as the screenwriter for Nolan’s The Dark Knight trilogy as well as a ton of lesser superhero movies, including Batman v. Superman, so maybe we’re dodging a bullet here. I am very much interested in del Toro’s ideas for a Jabba the Hutt film reminiscent of The Godfather, though.

Expend4bles Bombed, Core Demographic Would Rather Nap

Expendables 4 opened this past weekend with a paltry $8 million, leading to the worst box office weekend of 2023.

Somebody thought it was a good idea to spend 100 million making this film. I can only assume the lion’s share went toward explosions and Metamucil.

Notable Releases This Week

The Creator (Theaters)

I’m pretty excited about this one.

The Creator is a futuristic yet timely movie about humans, our artificial creations, and our natural abject horror when we are faced with what our hands hath wrought. The trailer is dramatic and actiony, and looks like a modern take on something like Steven Spielberg’s A.I. Artificial Intelligence.2 Except I’m getting the sort of icky thoughtful vibes I got watching Ex Machina. This is a very good thing.

Early buzz has been overwhelmingly positive. Here’s David Chen’s take:

Gen V: Season 1 (Prime Video)

The Boys is a series that asks the question: What would happen if Superman was a huge asshole? Gen V is a spin-off in the same mold. It follows young supes-to-be attending a University for gifted people. Picture something like Van Wilder crossed with X-Men and you’ll be close to the mark.

Survivor: Season 45 and The Amazing Race: Season 35 (CBS)

I stopped watching Survivor sometime around season 5. What season had Colby and Tina? (I just did some googling and turns out they were season 2.) This is a good example of where I stand with this show.

I do watch The Amazing Race. I think that might be the only reality television I routinely watch, but in my defense, it’s really more like a sport.

Surely someone in your life could benefit from reading this.

My favorite part about the Books3 article is where the guy admits he was just browsing the Internet, looking to do some light intellectual theft. You know, like you do.

Spielberg’s film released in 2001, not so long ago, but apparently the thought at the time was that the common Joe or Jane wouldn’t understand A.I. = Artificial Intelligence, hence the clunky title. If it released today, I think it would just be called AI.

Discussion about this post

Ready for more?