OpenAI accidentally deleted case data

Lawyers representing The New York Times (NYT) and the Daily News in their lawsuit against OpenAI, which alleges unauthorized use of their content to train AI models, claim that OpenAI engineers accidentally deleted data that could have been relevant to the case, TechCrunch reports .

Earlier this year, OpenAI provided two virtual machines with computing resources. These machines were provided so that NYT and Daily News consultants could search their copyrighted content in their AI training sets.

An attorney representing the NYT and the Daily News filed the letter in the U.S. District Court for the Southern District of New York. The letter is an update on the status of the training data issues and a new request that OpenAI identify and acknowledge what work from the News plaintiffs (NYT and Daily News) it used to train each of its GPT models.

The letter stated that on November 14, 2024, OpenAI engineers deleted programs and search results data stored on one of the dedicated virtual machines. However, the publishers’ lawyer added that they had no reason to believe that the deletion was intentional.

OpenAI training datasets are sandboxed: NYT

The publishers’ attorney said they “incur significant burdens and expenses in searching for their copyrighted works in OpenAI’s training datasets in a highly controlled environment that this court and the parties have previously called a ‘sandbox.’

The publishers’ lawyer said they and the experts they hired spent more than 150 hours searching OpenAI training data since November 1, 2024. He adds that OpenAI was able to recover much of the data it had “erased.” However, OpenAI has “irretrievably lost” the folder structure and file names of the publishers’ work product.

OpenAI is in a better position to search its own datasets.

He added that without a folder structure and original file names, the recovered data is “unreliable” and cannot confirm whether OpenAI used copied publisher articles to build its models. Claiming that the recovered data was “unusable,” the publishers’ lawyer argued that OpenAI was in a better position to search the publishers’ own datasets using its own tools and equipment.

“Plaintiffs News also provided the information required by OpenAI to conduct these searches—all that is required for OpenAI to commit to doing so in a timely manner,” it said.

The News plaintiffs provided OpenAI with detailed instructions on how to search their content using specific URLs and “n-gram” analysis that detects duplicate phrases in their works. However, OpenAI has yet to produce results or confirm meaningful progress. According to the documents, OpenAI’s lawyer reported only “promising meetings” with the company’s engineers, but without tangible results. Moreover, in response to plaintiffs’ formal requests for admission, OpenAI stated that it “does not acknowledge or deny” the use of publishers’ work in its training datasets or models.

Open AI response

On November 22, 2024, OpenAI filed a response in the case. In his response, OpenAI’s lawyer denied that the company had removed any evidence, instead attributing the problem to a misconfiguration of the system by publishers that led to a technical problem.

“Plaintiffs requested a reconfiguration of one of several machines that OpenAI provided to search training data sets. However, implementation of the change requested by Plaintiffs resulted in the removal of the folder structure and some file names on one hard drive—a drive that was intended to be used as a temporary cache to store OpenAI data, but was apparently also used by Plaintiffs to store some data. their search results (apparently without any backups). In any case, there is no reason to believe that any files were actually lost, and plaintiffs can re-run the search to recreate the files with just a couple of days of computer time,” the company said.

“Plaintiffs’ verification efforts began with them repeatedly running faulty code that overloaded and crashed the file system,” it added.

Advertising

OpenAI’s lawyer further stated that the company first submitted training data for review in June, but the publishers delayed reviewing it until October.

“Once the case began, the plaintiffs caused a number of technical problems due to their own errors. As a result of Plaintiffs’ self-inflicted wounds, OpenAI was forced to invest enormous resources in support of Plaintiffs’ review, far beyond what would have been necessary,” it added.

The statement said the publishers want an order requiring OpenAI to respond to nearly 500 million requests for admission.

Willingness to cooperate

The statement said OpenAI is ready to collaborate with publishers. “The main obstacle here is not technical; it is the plaintiffs’ unwillingness to cooperate,” the company said in its response.

OpenAI said it had offered to take over the search for publishers on the condition that they provide “clear and reasonable proposals.”

“OpenAI also offered to conduct at least some of the Plaintiffs’ searches on them and asked the Plaintiffs to put together a comprehensive proposal. “Despite OpenAI’s support, plaintiffs have returned to their ineffective quest for a ‘boiling ocean,’ demanding ever-improving hardware performance,” the statement said.

Background

In this case, OpenAI argues that using publicly available data, such as articles from the NYT and Daily News, to train its models constitutes fair use. According to OpenAI, “learning” from billions of examples does not require licensing or compensation for the data. It says this remains true even when models use the data for commercial purposes.

Moreover, OpenAI has entered into licensing agreements with a growing number of publishers, including Conde Nast, TIME, Associated Press and News Corp.

OpenAI accidentally deleted case data

OpenAI training datasets are sandboxed: NYT

OpenAI is in a better position to search its own datasets.

Open AI response

Willingness to cooperate

Background

Read more:

Archives

Categories