Explore the intersection of artificial intelligence and law. This hub offers insights into regulations, confidentiality, and copyright in the context of AI.
Training Data Cases
Is training AI models on copyrighted data infringement?
Creators of AI models need to train their models on data. High-quality data, and tons of it. Large language models are supposed to learn world knowledge and culture from their training data. And multimodal models learn how to create music, images, or movies from the highest quality existing examples. But these materials are often copyrighted.
The Fair Use Argument
AI developers consistently argue that using copyrighted material as training data constitutes fair use, and therefore not infringement. Good AI models are not supposed to output copies of the training data, but should learn like a human learns from reading books or watching films. The point of AI models is to generalize beyond the training data to create novel output--a transformative use that provides a public benefit diffrent than the underlying material. A user might infringe if they intentionally generate output too similar to a copyrighted work. But the developer has made only a tool and should not be liable for anything.
The Counter Argument
Rightsholders counter that AI developers make actual copies of works during training, and can often reproduce portions of training data verbatim. Even if the use is "transformative," the copying process itself is commercial and substitutes for properly licensing the works. And AI models harm relevant markets by generating similar content at scale.
This page lists legal cases in the US that involve the issue of whether using copyrighted data for AI training constitutes infringement or fair use. Many cases have been filed, but no US court has squarely addressed the issue.
The "fair use" defense will be the core issue for many of these cases, and the AI community in general. Defendants have not moved to dismiss on fair use grounds for procedural reasons: it is a defense, and the burden is on the defendant to prove it, whereas on a motion to dismiss, courts must accept the plaintiffs' allegations as true.
Case Title | Date Filed | Description | Status |
---|---|---|---|
Doe v. GitHub, Inc. | Nov 3, 2022 | Developers allege DMCA and breach of contract claims for Github's use of software uploaded to its platform to train AI coding models. | Trial court dismissed the DMCA claims, but refused to dismiss the contract lcaims. Case is stayed pending the Plaintiffs' request for interlocutory appeal of the DMCA dismissal. |
Andersen v. Stability AI Ltd. | Jan 23, 2023 | Visual artists allege creators of Stable Diffusion and Midjourney infringed copyrights by using their works as training data. | The court has dismissed multiple claims but a few remain. The court has not answered the core question of whether training on copyrighted data constitutes infringement. |
Getty Images v. Stability AI | Feb 3, 2023 | Getty Images accuses Stability AI of infringing on millions of photographs in training Stable Diffusion. | Jurisdictional motion to dismiss is pending. |
Tremblay v. OpenAI, Inc. | Jun 28, 2023 | Authors allege OpenAI infringed their copyrighted books in training GPT models. | Case is in discovery. Certain claims were dismissed, but the claim for direct copyright infringement remains at issue. |
Kadrey v. Meta Platforms, Inc. | Jul 7, 2023 | Authors allege Meta infringed their copyrighted books by using them to train LLaMA models. | Meta successfully moved to dismiss all claims except the core claim of copyright infringement. Case is in discovery. |
Leovy v. Google | Jul 11, 2023 | Authors allege Google violated copyright by training AI models on the authors' works. | Hearing on motion to dismiss set for December 18, 2024, but the motion does not present the core copyright issues. |
Huckabee v. Bloomberg | Oct 17, 2023 | Authors allege unauthorized use of books in AI training by Meta, Bloomberg, Microsoft, and EleutherAI. | Motion to dismiss is pending. The case is unique in that the trained model, BloombergGPT, was never publicly released. |
Concord Music Group, Inc. v. Anthropic PBC | Oct 18, 2023 | Music publishers sue Anthropic for copyright infringement for using lyrics in AI training. | Preliminary injunction hearing set for 11/25/24 |
Alter v. OpenAI | Nov 21, 2023 | Authors allege OpenAI infringed their copyrighted books in training GPT models. | Case is in discovery |
The New York Times Company v. Microsoft Corporation | Dec 28, 2023 | NYT claims OpenAI infringed their copyrights by training GPT models on their news articles. | A motion to dismiss is fully briefed and has been pending. The case is in discovery. |
The Intercept Media and Raw Story Media v. OpenAI | Feb 28, 2024 | News organizations allege OpenAI stripped out copyright management information in journalism works used to train ChatGPT, and therefore violated DMCA. | Awaiting court's ruling on motion to dismiss, which was heard on 11/1/24. |
Nazemian and Dubus v. NVIDIA Corporation | Mar 8, 2024 | Authors allege unauthorized use of books to train NVIDIA's NeMo LLM. | Case is in discovery. |
Daily News v. Microsoft | Apr 30, 2024 | Newspaper publishers sue Microsoft and OpenAI for copyright infringement. | Case is in discovery. |
UMG Recordings, Inc. v. Suno, Inc. | Jun 24, 2024 | Record labels claim Suno infringed their copyrighted recordings by using them to train AI music generation models. | Case is in discovery. |
UMG Recordings, Inc. v. Uncharted Labs, Inc. | Jun 24, 2024 | Record labels claim Udio infringed their copyrighted recordings by using them to train AI music generation models. | Case is in discovery. |
Center for Investigative Reporting v. OpenAI | Jun 27, 2024 | Nonprofit news organization alleges copyright infringement in OpenAI's training of GPT models. | Case is in discovery. Motion to dismiss is pending, but not on the core copyright infringement claim. |
Millette v. OpenAI | Aug 2, 2024 | YouTube creators allege OpenAI infringed copyright by using transcriptions of videos as training data for AI models. | Motion to dismiss due 12/16/2024, further papers due in early 2025. The same plaintiff filed cases against Google and Nvidia. |
Dow Jones v. Perplexity | Oct 21, 2024 | Owners of New York Post and Wall Street Journal allege Perplexity 'cop[ied] without authorization ... copyrighted works for inclusion into Perplexity's RAG index,' and have outputs which reproduce the plaintiffs' content. | Perplexity has not yet appeared |