Two Rulings on Fair Use and LLM Training
District court judges find themselves begrudgingly ruling for Meta and Anthropic
Last week, partial rulings were handed down in each of two key copyright cases: Bartz et al. v. Anthropic, and Kadrey et al. v. Meta. Each case found a federal district court judge ruling on whether frontier labs could permissibly use (copyrighted) books as training data without compensating the copyright holders, which here are the book authors. The records of the cases are similar, and both should be seen as victories by the labs—but both judges indicated that a different fact pattern or different complaint could have resulted in a finding of liability for copyright infringement.
What’s happening, and why does it matter?
Since the release of ChatGPT, observers have been waiting for a ruling on whether training on copyrighted materials constitutes infringement. All large language models have been trained off of an extensive quantity of copyrighted data: books, news articles, lyrics, and more. In some cases, the labs may have paid for rights to the data in order to forestall a lawsuit; last year, OpenAI inked deals with Time, Condé Nast, and a few other publishers. Books, however, are especially difficult to license in bulk, because publishing houses generally do not hold the rights to license the content for new uses—the authors themselves do.
The labs, aware of this reality, were faced with a set of poor options: They could try to negotiate with authors, but the sheer scale of the publishing industry meant this was largely infeasible. They could forgo book data, but as both opinions note books are some of the highest-value data for pretraining because they are generally quite long, are well written (at least, compared with snippets of text from the internet), and are internally consistent (that is, the final tokens of a book are meaningfully reliant upon early tokens, training the model to attend over a full context window). Or, finally, they could just copy the books without paying, and hope that the legal consequences wouldn’t be too severe.
Had the first or second options been chosen, no lawsuit would have been filed. Instead, both Anthropic and Meta relied on pirated copies of books (the LibGen and books3 datasets in particular) at the beginning of their training process. Anthropic, realizing that this illegal acquisition could cost them an enormous amount of money down the road, pivoted to purchasing physical books, cutting the pages to size, and reading them with a book scanner—a course of action which indicates just how pricey they believed a finding of infringement could be, since chopping up physical books is time-consuming, expensive, and far outside the core competencies of a research lab.
When large language models were a research curiosity, this piracy was unlikely to cause many problems, due to a broad exception to copyright law called fair use. To understand fair use, it’s helpful to remember why we protect artistic works at all—and indeed, why it is so important that it is expressly mentioned in the Constitution.
Why have copyright?
The creation of an artistic work is, generally speaking, the result of a substantial amount of work and training. The artist is generally not guaranteed profit for their result, and must instead result on selling or licensing the result to earn a living: A new book is valuable only insofar as it sells copies, is adapted into a movie, or is otherwise utilized in an economically valuable manner. Hence, we grant the artist exclusive rights to their expressive creation for a period of time—though these rights do not include the ideas contained in the work, or the style of expression, just the actual words on the page.
This grant of rights does not come without a cost, however. By granting exclusive rights to the creator, society is prevented from broadly developing derivative works, despite all of us being the product of societal investment (school, infrastructure, other thinkers, etc.) The United States consequently caps the duration of copyright to the lifetime of the author plus seventy years—more than enough time to recoup what value can be earned. Once that time lapses, the work enters the public domain, and all uses are permitted, including making copies of the work and selling them without returning any profit to the descendant of the original rightsholder. Prior to that point, any copying requires paying for a license—the price of which can vary widely based on the market and the profitability of the resultant idea.
There’s something missing here, though. A book review which excerpts only a handful of phrases adds public value without stealing value from the author; use of books for teaching or reporting seems both beneficial and unlikely to displace much demand for the original book. A bright-line, no-exceptions rule would grant too much power to the author, and would deprive society of the benefits of debating, critiquing, and analyzing the published works. This is the origin of the fair use doctrine, which provides a narrow avenue where licensing is not required, and copying is permitted.
As Judge Chhabria, writing in Kadrey, summarized:
Under this doctrine, “the fair use of a copyrighted work . . . for purposes such as criticism, comment, news reporting, teaching . . . , scholarship, or research, is not an infringement of copyright.” Fair use “permits courts to avoid rigid application of the copyright statute when, on occasion, it would stifle the very creativity which that law is designed to foster.” (citations omitted)
In these cases, no party disputes whether copyrighted works were used—but the labs contend that fair use means they are not liable for copyright infringement.
Background principles
These opinions are both rulings on summary judgment—which means that the parties have taken discovery (i.e. forced disclosure of relevant information for the lawsuit), but a trial has not taken place. The opinions evaluate as a matter of law whether the labs were entitled to a fair use defense; “as a matter of law” here means that the judges take the authors’ side on any contested question of fact in the record (that is, if the authors and the labs dispute what happened, the judge assumes that the authors are correct for the purposes of this opinion). This means each opinion analyzes only how the legal doctrines of fair use apply to uncontested facts, making the comparison between the opinions quite straightforward. (N.B. Both cases involve other allegations: Bartz includes a claim under the Digital Millennium Copyright Act, and Kadrey a claim about re-uploading pirated content. Since those issues have not been ruled on, we omit them from the discussion.)
The problem for frontier labs is that the boundaries of fair use doctrine are anything but clear: each reviewing court must apply a four-factor balancing test to assess whether the unauthorized copying and use of a work falls within the exception. The factors the court considers are (1) the purpose and character of the use, including whether such use is of a commercial nature or is for nonprofit educational purposes; (2) the nature of the copyrighted work; (3) the amount and substantiality of the portion used in relation to the copyrighted work as a whole; and (4) the effect of the use upon the potential market for or value of the copyrighted work.
Both Judges Alsup and Chhabria analyzed the fair use claim with respect to each use of the text, and for both, the core question was whether training a large language model falls within the fair use exception, and hence does not require permission from the copyright owner. It’s straightforward to weigh the first and fourth factors in either direction; the fourth factor in particular receives substantial weight in a standard fair use evaluation. The second and third factors are evaluated similarly by the judges, but do not play a starring role in their overall balancing, so we won’t discuss them further here.
With regard to the first factor, the purpose and character of the use, the character of the use is inarguably transformative, and therefore this use of the copyrighted material spurs innovation. But, both Anthropic and Meta (and every other leading frontier lab) used these materials for commercial purposes: They sought to build a new product that people would pay for, and fair use would absolve them of returning any of that revenue back to the authors of these books. On balance, however, the commercial nature of the work was insufficient for the judges to rule in the authors’ favor on the first factor. Though it’s hard to assess why a complex factor like this breaks one way or the other, the judges’ language seems to indicate that ChatGPT is simply too transformative for this factor to weigh against the labs. As Judge Alsup wrote, “the ‘purpose and character’ of using works to train LLMs was transformative —spectacularly so.”
For the critical fourth factor, straightforward analysis would conclude that since these models do not reproduce substantial portions of the copyrighted text verbatim (due to training and technical safeguards implemented by the labs), they cannot meaningfully infringe on the market for the source works. This analysis, however, is so narrow as to miss the broader point: In most cases, the purchaser of a book is not wedded to the exact words that book contains, but instead to words of the sort that book contains. I might buy the new Sally Rooney bestseller to hear her words, but for an average modern novel or many nonfiction topics, I would plausibly be okay with any adequate substitute, thereby making the models a threat to the overall market for copyrighted work even if not to any work in particular. This point, more so than anything else in the analysis, is in very clear tension between the judges—and neither lawsuit definitively shuts the door on it.
The opinions written by these two judges, despite ending up with similar findings, represent starkly different visions of how law should adapt itself to artificial intelligence, and neither is the final word—at a minimum, the Ninth Circuit will find itself reviewing one or both decisions on appeal, and other courts across the country will be faced with similar questions.
We turn now to a two concrete distinctions between the rulings, before highlighting the consequences for copyright holders and frontier labs generally.
How many copies exist?
A standard copyright infringement claim (and fair use defense) proceeds by identifying each granular act of copying that occurred, and then considering that copying action in isolation, as the factors of fair use weigh differently for each copying utilized.
In Bartz, there are two relevant sources of copyrighted material: First, Anthropic pirated millions of books, including those of the plaintiffs, from sources like LibGen. It stored those copies in a “library,” which likely means just in networked storage. It then realized that it might have a copyright issue, and proceeded to buy physical copies of books, cut the pages, scan them, and use those transcribed copies for actual training. Alsup sliced this fact pattern into three separate infringement claims:
The pirating of books to produce a “permanent, general-purpose library,” which appears to refer to the fact that Anthropic stored copies of every relevant book, which were used for “research”—and did not delete them after deciding to use physical copies instead.
The transcription of books from physical to digital media, after which the physical copies were discarded. (N.B. Alsup ruled that this transcription is fair use, but the rationale for it is not specific to artificial intelligence, so we won’t discuss it further.)
The training of models off of the copyrighted data.
In contrast, Chhabria’s opinion in Kadrey considered only the training of models, which was done with pirated data—data that likely was retained in a functionally identical manner as Anthropic’s “library,” as above.
The first claim recognized by Alsup is technologically questionable on several fronts—and is quietly critiqued by Chhabria in a citation.
First, it suggests a certain judicial blindness to what other purposes those copies might have been used for. Yes, retaining a bunch of copies of the text of books could technically be used by Anthropic for other purposes—but then a separate infringement claim would be brought about the other use. In the record itself, the only allegations are that some books were not used for training models, but not that they were used for anything else. The opinion states that, “the point . . . [was] to build a central library that one could have paid for,” but without any evidence of a separate use, it’s hard to believe that this should qualify as a discrete and colorable infringement claim.
Second, if I read the opinion correctly, had Anthropic simply deleted any copies it did not include in a pretraining data set (and simply re-downloaded them later if it had needed them), it would have incurred no loss of profit and avoided this claim. The ease of avoiding the claim is, of course, not part of the fair use analysis—but it seems like these factors might weigh to the first claim collapsing into the third.
It’s worth pointing out the substantial scope of damages here. In general, the award granted would include the damages suffered by the authors (here, the books that went unused in training seem unlikely to have produced damages), plus any profits of the infringer (which, again, is likely nothing given that there is no use of these materials on the record). But, per 17 U.S.C. § 504, the authors can elect to receive statutory damages of up to $30,000 per work infringed upon. Books3, one of the datasets at issue, contains around 200,000 books, so this could expose Anthropic to billions in damages if all affected authors pursued damages in court.
The litigation over damages will continue, and undoubtedly be appealed, but at first glance it appears that Alsup wanted to hand a win to the plaintiffs on at least one claim—but I wouldn’t put money on this ruling surviving appellate review.
What does market substitution consist of?
The marquee holding in each decision is with respect to whether using copyrighted materials for training constitutes fair use, and as noted above, this question often turns on “the effect of the use upon the potential market for or value of the copyrighted work.” In essence, did the copying harm the ability to monetize the original content?
For a plaintiff to win on infringement for training, they likely need to win on this element—the other three factors came out identically for the two judges, and aren’t obviously contestable. Neither plaintiff won on this factor, but each judge pointed quite explicitly to what it would take to succeed in their eyes.
Precise reproduction
Some books can likely be reproduced in substantial part by many language models; precise reproduction of copyrighted text is also a key element of the record in the ongoing lawsuit between the New York Times and OpenAI. Yet, in both Bartz and Kadrey, the plaintiffs did not demonstrate any substantial reproduction of exact text.
For Judge Alsup, this omission was critical: He repeatedly emphasized the failure of the plaintiffs to make any allegation of reproduction, and hinted strongly that this was the key to winning an infringement claim for training:
Authors do not allege that any LLM output provided to users infringed upon Authors’ works. Our record shows the opposite. Users interacted only with the Claude service, which placed additional software between the user and the underlying LLM to ensure that no infringing output ever reached the users. . . . Here, if the outputs seen by users had been infringing, Authors would have a different case. And, if the outputs were ever to become infringing, Authors could bring such a case. But that is not this case.
This is, however, part of the record for New York Times as linked above. This is critical: If a plaintiff can demonstrate some quantity of precise reproduction, the judge is more likely to allow expansive discovery to identify the scope of reproduction, as has just happened in that case. If the plaintiff cannot demonstrate any evidence of reproduction, then a judge is unlikely to accept the claim that there is reproduction “in the wild,” is unlikely to order broad discovery, and consequently the plaintiff will fail for the same reason as in Bartz.
In some sense, this indicates that the New York Times was canny enough to capture substantial evidence of reproduction before technical safeguards were put in place, like every leading company utilizes today. But, regardless of the outcome of that case, the path that Alsup is highlighting is likely a dead end: This block on reproducing copyrighted content has become routine; it technically feasible, cheap (enough) to do, and consequently, the major labs are unlikely to demonstrate close-enough reproduction to plausibly expose themselves to a claim of infringement on these grounds.
Market substitution
Chhabria, on the other hand, does not spend many words hunting for precise reproduction: “In short, Llama cannot currently be used to read or otherwise meaningfully access the plaintiffs’ books.” He relies upon expert testimony for this holding, but it bears repeating that this will likely be true of all major text-based models for the foreseeable future: No major lab is going to make the mistake of handing ammunition to copyright holders like in New York Times.
Instead, Chhabria identifies a seemingly-novel market substitution argument. He notes that lesser-known authors, whose books do not sell on name recognition alone, are particularly vulnerable to having their works replaced by distinct-but-similar generated works:
It seems unlikely, for instance, that AI-generated books would meaningfully siphon sales away from well-known authors who sell books to people looking for books by those particular authors. But it’s easy to imagine that AI-generated books could successfully crowd out lesser-known works or works by up-and-coming authors. While AI-generated books probably wouldn’t have much of an effect on the market for the works of Agatha Christie, they could very well prevent the next Agatha Christie from getting noticed or selling enough books to keep writing.
From a broad perspective, this argument (which plaintiffs did not raise in their original complaint—striking, given how many words the judge devotes to it) does hold water; I would not want to be in the business of writing generic how-to guides on topics which a model can easily converse about. These models pose a substantial threat to search engine traffic; how could a single unknown author survive?
The problem is in the actual pleading, like with precise reproduction. In Kadrey, expert testimony for Meta suggested that Llama’s release had no impact on book sales, which went functionally uncontroverted by plaintiffs. Chhabria clearly believes that this is the path forward for winning a claim:
On this record, then, Meta has defeated the plaintiffs’ half-hearted argument that its copying causes or threatens significant market harm. That conclusion may be in significant tension with reality, but it’s dictated by the choice the plaintiffs made to put forward two flawed theories of market harm while failing to present meaningful evidence on the effect of training LLMs like Llama with their books on the market for those books.
But again, to actually plausibly allege this is the case, plaintiffs would need to demonstrate causal evidence of the defendant’s model causing a decline in book sales. And they would need to disentangle the effect from market effects, the popularity of various topics or authors, and any other potential failures in the chain between authorship and delivery to readers.
I just don’t see how a plaintiff can surmount this barrier absent the market for certain types of book fully collapsing in response to a particular model’s release. Instead, it seems likely that the market for lesser-known works will slowly decrease over time, but not in a way which demonstrates a clear chain of causation from a model. It seems more analogous to Wikipedia’s effect on the sales of encyclopedias (not that there’s a colorable infringement claim against Wikipedia): A certain mode of knowledge acquisition slowly faded out of existence, but there was no single moment at which point the market just disappeared.
Chhabria’s analysis, while more novel and aware of the consequences of technology, likely provides no more meaningful a path forward for plaintiffs.
Wrapping up
These two opinions both seem like rulings from judges who are displeased about having to grant summary judgment to the labs, as evidenced by the words they spend indicating how plaintiffs could have won. But neither of the routes noted by Alsup or Chhabria appears likely to succeed in practice, especially since labs wised up to the risk of exact reproduction years ago.
These cases, and New York Times, remain open, and these orders will undoubtedly be appealed. Similar cases are also being heard in jurisdictions around the country, so it seems likely that we will have at least a few disagreements on matters of law. If these persist after review by circuit courts, then this is likely to make it to the Supreme Court—but the arguments discussed in last weeks’ opinions appear to hold their own.
If these rulings stand, these cases seem to represent a near-closing of the door to new infringement claims for textual materials used in training, aside from the lingering claims from Bartz (which may be resolved on appeal), and the existing claims in New York Times. And there are unlikely to be major new sources of copyrighted text data for labs for three reasons: They’ve already consumed a majority of what’s available, they are now well-aware of the risks of training without a license, and synthetic data (or proprietary) has become an increasingly valuable source for training (pre and post). It’s possible that a different media could produce new problems—such as the Italian Plumber problem for image generation. For authors of books, newspaper articles, and other copyrighted works, however, these rulings may just mean they are out of luck.