OpenAI hits back at the New York Times

The New York Times filed a copyright infringement lawsuit against OpenAI earlier this month, and now OpenAI has publicly responded on its blog. Notably, one of their points is about the word-for-word regurgitation that the Times was able to generate through some creative prompting. OpenAI writes, “Memorization is a rare failure of the learning process that we are continually making progress on, but it’s more common when particular content appears more than once in training data, like if pieces of it appear on lots of different public websites. So we have measures in place to limit inadvertent memorization and prevent regurgitation in model outputs. We also expect our users to act responsibly; intentionally manipulating our models to regurgitate is not an appropriate use of our technology and is against our terms of use.”

In the blog post, they also offer insight into their negotiations with the Times before they broke down. Perhaps insultingly, OpenAI informed the Times that “like any single source, their content didn’t meaningfully contribute to the training of our existing models and also wouldn’t be sufficiently impactful for future training.” AI companies have always argued that, for their usage, data is like sand, not oil. Only in aggregate is it valuable, and taking away a few grains isn’t going to matter. Perhaps OpenAI was using that argument as a way of defending their offer of several million dollars per year to the New York Times, which was reported by The Information (subscription required); OpenAI’s annualized revenue is believed to be $1.6 billion.

Related: Here is thoughtful commentary by Peter Schoppert on how AI companies will likely avoid regurgitation of copyrighted material and thus have a stronger fair use case. This won’t resolve all plaintiffs’ complaints, of course, but it’s a major factor in determining fair use in a court of law. He writes, “My read is that [tech companies] will try very hard to separate the copying for training and operation and deployment from the creation of copied outputs. The training of LLMs is fair use, or it would be except for this pesky memorisation issue. They will begin to push harder for safe harbour on the copied outputs, characterising these as a minor issue, not a real problem, just few inadvertent copies made by angry (and greedy) copyright holders.”