The main paradigm in AI development is that the more training data, the better. OpenAI’s GPT-2 model had a dataset consisting of 40 gigabytes of text. ChatGPT based GPT-3 is trained on 570 GB of data. OpenAI did not share the amount of data set for the latest model GPT-4.
But that hunger for larger models is coming back to bite the company. In the past few weeks, several Western data protection authorities have launched investigations into how OpenAI collects and processes the data that powers ChatGPT. They believe it has deleted and used people’s personal information, such as names or email addresses, without their consent.
Italian authorities have banned chatGPT as a precaution, while data regulators in France, Germany, Ireland and Canada are also investigating how the OpenAI system collects and uses data. The European Data Protection Board, an umbrella organization of data protection authorities, is setting up an EU-wide task force to coordinate investigations and enforcement around ChatGPT.
Italy has given OpenAI until April 30 to comply with the law. This means OpenAI must ask people for permission to delete their data or prove it has a “legitimate interest” in collecting it. OpenAI should also explain to people how ChatGPT uses their data and give them the power to correct any errors the chatbot spits out, delete their data if they want, and object to the computer program’s use of it.
If OpenAI cannot convince authorities that its data-use practices are legal, it could be banned in certain countries or even the entire European Union. Alexis Leuthier, an AI expert at France’s data protection agency CNIL, said it could face heavy fines and be forced to delete models and the data used to train them.
Lillian Edwards, Professor of Internet Law at Newcastle University, says OpenAI’s violations are so clear that the case could end up in the European Court of Justice, the EU’s highest court. It may take years before we see answers to the questions raised by the Italian data controller.
High level game
The stakes couldn’t be higher for OpenAI. The European Union’s General Data Protection Regulation is the strictest data protection system in the world and has been widely copied around the world. Regulators from Brazil to California will be paying close attention to what happens next, and the resulting AI could fundamentally change the way companies go about collecting data.
In addition to being more transparent about its data practices, OpenAI must demonstrate that it is using one of two legal means to collect data for training its algorithms: consent or “legitimate interest.”