The New York Times filed a copyright infringement lawsuit against OpenAI earlier this month, and now OpenAI has publicly responded on its blog. Notably, one of their points is about the word-for-word regurgitation that the Times was able to generate through some creative prompting. OpenAI writes, “Memorization is a rare failure of the learning process that we are continually making progress on, but it’s more common when particular content appears more than once in training data, like if pieces of it appear on lots of different public websites. So we have measures in place to limit inadvertent memorization and prevent regurgitation in model outputs. We also expect our users to act responsibly; intentionally manipulating our models to regurgitate is not an appropriate use of our technology and is against our terms of use.”
In the blog post, they also offer insight into their negotiations with the Times before they broke down. Perhaps insultingly, OpenAI informed the Times that “like any single source, their content didn’t meaningfully contribute to the training of our existing models and also wouldn’t be sufficiently impactful for future training.” AI companies have always argued that, for their usage, data is like sand, not oil. Only in aggregate is it valuable, and taking away a few grains isn’t going to matter. Perhaps OpenAI was using that argument as a way of defending their offer of several million dollars per year to the New York Times, which was reported by The Information (subscription required); OpenAI’s annualized revenue is believed to be $1.6 billion.
Related: Here is thoughtful commentary by Peter Schoppert on how AI companies will likely avoid regurgitation of copyrighted material and thus have a stronger fair use case. This won’t resolve all plaintiffs’ complaints, of course, but it’s a major factor in determining fair use in a court of law. He writes, “My read is that [tech companies] will try very hard to separate the copying for training and operation and deployment from the creation of copied outputs. The training of LLMs is fair use, or it would be except for this pesky memorisation issue. They will begin to push harder for safe harbour on the copied outputs, characterising these as a minor issue, not a real problem, just few inadvertent copies made by angry (and greedy) copyright holders.”

Jane Friedman has spent her entire career working in the publishing industry, with a focus on business reporting and author education. Established in 2015, her newsletter The Bottom Line provides nuanced market intelligence to thousands of authors and industry professionals; in 2023, she was named Publishing Commentator of the Year by Digital Book World.
Jane’s expertise regularly features in major media outlets such as The New York Times, The Atlantic, NPR, The Today Show, Wired, The Guardian, Fox News, and BBC. Her book, The Business of Being a Writer, Second Edition (The University of Chicago Press), is used as a classroom text by many writing and publishing degree programs. She reaches thousands through speaking engagements and workshops at diverse venues worldwide, including NYU’s Advanced Publishing Institute, Frankfurt Book Fair, and numerous MFA programs.



