ChatGPT and Copyright: What You Need to Know

robin hood artificial intelligence

Robin Hood stole from the rich to give to the poor. Whether or not he was a real figure in history is beside the point. The myth, the man, the legend, beloved by the common man and reviled by the elite, exuded cleverness and charm. Is ChatGPT a modern day Robin Hood?

How ChatGPT Works

We were exposed to AI long before Open AI released ChatGPT in 2022 but this new tech made it all but ubiquitous. From search engines to language translators to smart home devices, you’re more likely to interact with AI now than ever before. You may even use it yourself to write emails or draft resumes.

ChatGPT is a a large language model tool. OpenAI provided the bot with content using a database of public information. It then trained the bot using what they called Reinforcement Learning with Human Feedback (RLHF). People interacted with it, from both the AI and the user side, to teach it how to respond to prompts in ways that we speak to each other in real life. They continue to train it, using the prompts that people put into it every day to make it better and better (so watch what you say!). Meanwhile, on the back end, the AI simply processes data as numbers (think 0s and 1s falling from the sky a la The Matrix) and uses predictive modeling to deliver the content you want in a user friendly way.

The Dataset

When you use an AI like ChatGPT, you should ask yourself where all of that formation comes from. You don’t need a computer science degree but you have a right to know if the information you are getting is legitimate and up-to-date. After all, ChatGPT has been known to “hallucinate”, i.e., make stuff up.

Both the free version, ChatGPT3.5, and the paid version,ChatGPT-4.0 (also called ChatGPT Plus), use a dataset that only goes back to September 2021. The bulk of that data comes from Common Crawl, a repository of web data gathered by a non-profit organization of the same name. It reportedly also turned to books, news articles, and other data sources like Wikipedia. The big question is what specific books and other creative works were included? OpenAI keeps that information under lock and key, but it raises important issues about copyright and fair use.

Internet Access
The free version of ChatGPT does not have access to the live internet but the paid version does. With that in mind, the premium version claims to be 40% more accurate than its predecessor.

Robin Hood vs. Copyright Infringement

Back to the Robin Hood analogy. ChatGPT takes from one group and gives to another. Specifically, it takes information that could take a whole lot of time, money, and effort to gather yourself and gives it to you for free!

The problem is whether it has the right to do so. If copyrighted materials are included in the ChatGPT dataset (in this case, we just don’t know!), creators could potentially face lost revenue, their livelihoods impacted.

That’s why OpenAI faces multiple class action law suits.

  • In one suit, two authors filed suit. Paul Tremblay (The Cabin at the End of the World) and Mona Awad (13 Ways of Looking at a Fat Girl, Bunny) point out that their books of fiction were summarized in “very accurate” detail, so much so that it could only be possible if the bot had been trained on them in the first place. Including these books in the ChatGPT dataset without permission would violate copyright law.
  • In another suit, comedian/author Sarah Silverman (The Bedwetter), author Christopher Golden (The Myth Hunters, The Ferryman), and author Richard Kadry (Sandman Slim, The Grand Dark) argue that the bot violates copyright law by producing “derivative” versions of their work when prompted to summarize the source.

This is not the first time copyright issues have come up. A 2023 study tried to reverse engineer the dataset by testing text from well known books and removing the character names. The study authors then asked the bot questions about the text and, sure enough, ChatGPT seemed to know more than it should. Books that raised flags included everything from George Orwell’s 1984 to more modern books like Helen Fielding’s Bridget Jones’s Diary, Stephen King’s The Shining, and Max Brook’s World War Z.

Whether or not you take advantage of ChatGPT, one thing remains true. It is not 100% reliable. Check your sources, and more importantly, credit them. You would want that same respect given to you.

References

Capoot, A. (2023). Authors sue OpenAI, allege their books were used to train ChatGPT without their consent. CNBC. https://www.cnbc.com/2023/07/05/authors-sue-openai-allege-chatgpt-was-trained-on-their-books.html

Chang, K. K., Cramer, M., Soni, S., & Bamman, D. (2023). Speak, Memory: An Archaeology of Books Known to ChatGPT/GPT-4. ArXiv.org. https://arxiv.org/abs/2305.00118

Dsouza, E.G. (2023). How ChatGPT Works? Training Model of ChatGPT. Edureka. https://www.edureka.co/blog/how-chatgpt-works-training-model-of-chatgpt/

GPT-4. (2023). OpenAI. https://openai.com/research/gpt-4

Paul Tremblay and Mona Awad Class Action Suit vs. OpenAI. (2023). https://llmlitigation.com/pdf/03223/tremblay-openai-complaint.pdf

Rogers, A. (2023). ChatGPT secret training data: the top 50 books AI bots are reading. Business Insider. https://www.businessinsider.com/chatbot-training-data-chatgpt-gpt4-books-sci-fi-artificial-intelligence-2023-5

Sarah Silverman, Christopher Golden, and Richard Kadry vs. OpenAI. (2023). https://llmlitigation.com/pdf/03223/tremblay-openai-complaint.pdf

What is ChatGPT? (2021). OpenAI. https://help.openai.com/en/articles/6783457-what-is-chatgpt

Leave a Reply

Your email address will not be published.

This site uses Akismet to reduce spam. Learn how your comment data is processed.