As artificial intelligence (AI) reaches the peak of its popularity, researchers warn that the industry may run out of training data, the fuel that powers powerful AI systems. This could slow down the development of AI models, especially large language models, and could even change the course of the AI ​​revolution. So, given how much data is on the internet, why would potential lack of data be an issue? So is there a way to eliminate the risk?
Why is high-quality data important for AI?
we need so much data For training powerful, accurate and high-quality artificial intelligence algorithms. ChatGPT, for example, was trained on 570 gigabytes of text data, or nearly 300 billion words. Similarly, the stable diffusion algorithm (which forms the basis of many imaging AI applications such as DALL-E, Lensa, and Midjourney) was trained on the LIAON-5B dataset consisting of 5.8 billion image-text pairs. If an algorithm is trained with insufficient data, it will produce inaccurate or low-quality results.
The quality of training data is also important. Low-quality data like social media posts or blurry photos is easy to obtain but not enough to train high-performance AI models. Texts retrieved from social media platforms may be biased or biased, or contain misinformation or illegal content that may be reproduced by the model. For example, when Microsoft tried to train its AI robot with Twitter content, it learned to produce racist and misogynistic results.
This is why AI developers look for high-quality content such as text from books, online articles, scientific papers, Wikipedia, and certain filtered web content. Google Assistant has been made more conversational by training it on 11,000 romance novels from the self-publishing site Smashwords.
Do we have enough data?
The AI ​​industry is training AI systems on increasingly larger datasets, so we now have high-performance models like ChatGPT or DALL-E 3. At the same time, research shows that online datasets grow much more slowly than in-use datasets. training artificial intelligence
In a paper published last year, a group of researchers predicted that we will run out of high-quality text data by 2026 if current trends in AI education continue. They also predict that low-quality speech data will run out between 2030 and 2050, and low-quality image data will run out between 2030 and 2060. AI could contribute US$15.7 trillion (A$24.1 trillion) to the global economy by 2030, according to accounting and consulting group PwC. However, a lack of available data may slow its development.
Should we be worried?
While the above points may worry some AI fans, the situation may not be as bad as it seems. There are many unknowns about how AI models will evolve in the future and the various ways to address the risk of missing data. One opportunity for AI developers is to improve algorithms that enable them to use the data they already have more efficiently.
In the coming years, they will likely be able to train high-performance AI systems using less data and possibly less computing power. It will also help reduce the carbon footprint of artificial intelligence. Another option is to use artificial intelligence to create synthetic data for education systems. In other words, developers can easily generate the data they need based on a specific AI model.
Many projects already use synthetic content, mostly from data generation services like Mostly AI. This will become more common in the future. Developers are also looking for content outside the free online space, such as content from major publishers and offline repositories. Consider the millions of texts published before the Internet. Because they are available digitally, they can become a new data source for AI projects.
News Corp, one of the world’s largest news content owners (which owns most of its pay-per-view content), recently said it was negotiating content deals with AI developers. Such deals would force AI companies to pay for training data, but so far they have mostly collected that data for free from the internet.
Creators have protested the unauthorized use of their content to train AI models, and some have sued companies like Microsoft, OpenAI, and Stability AI. Paying them a reward for their work could help address the power imbalance between creative people and AI companies. Source