Explore the intersection of artificial intelligence and law. This hub offers insights into regulations, confidentiality, and copyright in the context of AI.

Training Data Cases

Is training AI models on copyrighted data infringement?

Creators of AI models need to train their models on data. High-quality data, and tons of it. Large language models are supposed to learn world knowledge and culture from their training data. And multimodal models learn how to create music, images, or movies from the highest quality existing examples. But these materials are often copyrighted.

GenAI Confidentiality

The Fair Use Argument

AI developers consistently argue that using copyrighted material as training data constitutes fair use, and therefore not infringement. Good AI models are not supposed to output copies of the training data, but should learn like a human learns from reading books or watching films. The point of AI models is to generalize beyond the training data to create novel output--a transformative use that provides a public benefit diffrent than the underlying material. A user might infringe if they intentionally generate output too similar to a copyrighted work. But the developer has made only a tool and should not be liable for anything.

The Counter Argument

Rightsholders counter that AI developers make actual copies of works during training, and can often reproduce portions of training data verbatim. Even if the use is "transformative," the copying process itself is commercial and substitutes for properly licensing the works. And AI models harm relevant markets by generating similar content at scale.

This page lists legal cases in the US that involve the issue of whether using copyrighted data for AI training constitutes infringement or fair use. Many cases have been filed, but no US court has squarely addressed the issue.

The "fair use" defense will be the core issue for many of these cases, and the AI community in general. Defendants have not moved to dismiss on fair use grounds for procedural reasons: it is a defense, and the burden is on the defendant to prove it, whereas on a motion to dismiss, courts must accept the plaintiffs' allegations as true.

Doe v. GitHub, Inc.
Filed: Nov 3, 2022
Developers allege DMCA and breach of contract claims for Github's use of software uploaded to its platform to train AI coding models.
Current Status:
Trial court dismissed the DMCA claims, but refused to dismiss the contract lcaims. Case is stayed pending the Plaintiffs' request for interlocutory appeal of the DMCA dismissal.
Visual artists allege creators of Stable Diffusion and Midjourney infringed copyrights by using their works as training data.
Current Status:
The court has dismissed multiple claims but a few remain. The court has not answered the core question of whether training on copyrighted data constitutes infringement.
Getty Images accuses Stability AI of infringing on millions of photographs in training Stable Diffusion.
Current Status:
Jurisdictional motion to dismiss is pending.
Tremblay v. OpenAI, Inc.
Filed: Jun 28, 2023
Authors allege OpenAI infringed their copyrighted books in training GPT models.
Current Status:
Case is in discovery. Certain claims were dismissed, but the claim for direct copyright infringement remains at issue.
Authors allege Meta infringed their copyrighted books by using them to train LLaMA models.
Current Status:
Meta successfully moved to dismiss all claims except the core claim of copyright infringement. Case is in discovery.
Leovy v. Google
Filed: Jul 11, 2023
Authors allege Google violated copyright by training AI models on the authors' works.
Current Status:
Hearing on motion to dismiss set for December 18, 2024, but the motion does not present the core copyright issues.
Huckabee v. Bloomberg
Filed: Oct 17, 2023
Authors allege unauthorized use of books in AI training by Meta, Bloomberg, Microsoft, and EleutherAI.
Current Status:
Motion to dismiss is pending. The case is unique in that the trained model, BloombergGPT, was never publicly released.
Music publishers sue Anthropic for copyright infringement for using lyrics in AI training.
Current Status:
Preliminary injunction hearing set for 11/25/24
Alter v. OpenAI
Filed: Nov 21, 2023
Authors allege OpenAI infringed their copyrighted books in training GPT models.
Current Status:
Case is in discovery
NYT claims OpenAI infringed their copyrights by training GPT models on their news articles.
Current Status:
A motion to dismiss is fully briefed and has been pending. The case is in discovery.
News organizations allege OpenAI stripped out copyright management information in journalism works used to train ChatGPT, and therefore violated DMCA.
Current Status:
Awaiting court's ruling on motion to dismiss, which was heard on 11/1/24.
Authors allege unauthorized use of books to train NVIDIA's NeMo LLM.
Current Status:
Case is in discovery.
Daily News v. Microsoft
Filed: Apr 30, 2024
Newspaper publishers sue Microsoft and OpenAI for copyright infringement.
Current Status:
Case is in discovery.
Record labels claim Suno infringed their copyrighted recordings by using them to train AI music generation models.
Current Status:
Case is in discovery.
Record labels claim Udio infringed their copyrighted recordings by using them to train AI music generation models.
Current Status:
Case is in discovery.
Nonprofit news organization alleges copyright infringement in OpenAI's training of GPT models.
Current Status:
Case is in discovery. Motion to dismiss is pending, but not on the core copyright infringement claim.
Millette v. OpenAI
Filed: Aug 2, 2024
YouTube creators allege OpenAI infringed copyright by using transcriptions of videos as training data for AI models.
Current Status:
Motion to dismiss due 12/16/2024, further papers due in early 2025. The same plaintiff filed cases against Google and Nvidia.
Dow Jones v. Perplexity
Filed: Oct 21, 2024
Owners of New York Post and Wall Street Journal allege Perplexity 'cop[ied] without authorization ... copyrighted works for inclusion into Perplexity's RAG index,' and have outputs which reproduce the plaintiffs' content.
Current Status:
Perplexity has not yet appeared

Contact us

Grounds LLP is a law firm based in Los Angeles, California. We represent clients in class actions, appeals, and matters involving intellectual property, competition, and technology.

Please read our disclaimer.

Contact Bahrad at bahrad@grounds.ai

Contact Joe at joe@grounds.ai