07 Feb Rage Against the Machine: The lawsuit tsunami and the impact of ChatGPT on Fair Use and Copyright Laws
Welcome to the third installment of my series (see below for links) that delves into tech’s new plaything: ChatGPT. This edition explores the thorniest part of AI and GPT-3: the legal use of various large language model data sets to train AI models.
How do Large Language Models work?
Large language models (LLMs) learn from huge volumes of data. The core of an LLM is the size of the dataset it’s trained on. But the definition of “large” has been increasing exponentially, and now, these models are typically trained on datasets large enough to include nearly everything that has been written on the internet over a significant span of time. According to OpenAI, the model used to train ChatGPT leveraged a dataset of approximately 8 million web pages. This dataset, known as the “WebText” dataset, was collected from the internet and includes a wide variety of text from news articles, websites, and other online sources. These additional datasets include books, articles, and other text sources, which all contribute to the overall size of the training data.
Who owns this data set and the copyright?
Companies like Stability AI and OpenAI, the company behind ChatGPT, have long claimed that “fair use” protects them in the event that their systems were trained on licensed content. This doctrine, enshrined in U.S. law, permits limited use of copyrighted material without first having to obtain permission from the rights holder. The success of a fair use defense will depend on whether the works generated by the AI are considered “transformative”, that is, whether they use the copyrighted works it in a way that varies significantly from the originals. Google v. Oracle decision, suggests that using collected data to create new works can be transformative. In that case, Google’s use of portions of Java SE code to create its Android operating system was found to be fair use.
The legal tsunami has begun…
Microsoft, GitHub and OpenAI are currently being sued in a class action motion that accuses them of violating copyright law by allowing Copilot, a code-generating AI system trained on billions of lines of public code, to regurgitate licensed code snippets without providing credit. Two companies behind popular AI art tools, Midjourney and Stability AI, are in the crosshairs of a legal case that alleges they infringed on the rights of millions of artists by training their tools on web-scraped images. Stock image supplier Getty Images took Stability AI to court for reportedly using millions of images from its site without permission to train Stable Diffusion, an art-generating AI.
“A Lawyer in the style of Banksy” Image Credits: MidJourney under a CC BY 4.0 license.
At this point the legal wrangling has started with images and creators, but it will inevitably move on to other forms of data. These lawsuits could drag on for years as the courts dissect whether LLM outputs are derivative or transformative on a case-by-case basis unless regulators step in to broadly update fair use and copyright laws — a long shot at this time.
What are enterprises supposed to do?
Enterprise Users
In this ensuing chaos, enterprise customers will likely demand similar ChatGPT-style functionality – indeed this is already happening — rightly succumbing to it’s ease-of-use and conversational prowess. Software suppliers will be forced to incorporate these LLM capabilities, but will likely hit a stone wall with their sensitive IT teams. What happens if company data is cut and pasted in ChatGPT or other tools?
Corporate IT
IT security teams and corporate lawyers will tread carefully and will not be willing to change the carefully constructed commercial software contracts till the murkiness clears. They will default to contracts that have been carved out of stone on data liability, protection of copyright, and proprietary data that they enforce routinely with their commercial software suppliers; any new generative AI capabilities will have to also comply.
“Data liability and copyright laws carved out of stone” Image Credits: MidJourney under a CC BY 4.0 license.
Software Vendors
Commercial solutions will carefully evaluate what is possible and ensure it does not contradict liabilities and data protection within their existing contracts before adding any capabilities. They will cherry pick evolving generative AI capabilities that still fit current compliance directives they must adhere to.
In conclusion for all the philosophizing about generative AI and LLMs, for large enterprises the promise of the technology will be evenly tempered by the more mundane issues about fair use, copyright and data ownership and what the legal responses will be, leading to an emerging twilight zone till the dust settles.
Links to my previous posts delving into ChatGPT:
The promise of ChatGPT and the reality of the enterprise
The Rise of the Machines: The Impact and Future of ChatGPT