AI companies collect data through web crawlers, which explore the internet and record content.
Now, a new study reveals that websites globally are blocking these crawlers from gaining access to their content. In terms of numbers, OpenAI’s crawlers are restricted from nearly 26 percent of high-quality data sources, Google’s from 10 percent and Meta’s from 4 percent.
Even public data that is available to all, like Wikipedia, is expected to exhaust, as well.
To counter this, AI companies are now paying millions to acquire training data. OpenAI has entered into deals with publications like the Atlantic, Vox Media, the Financial Times, etc., to use their data for AI training. Another possible solution is synthetic data, which I’ll discuss next.
Reference link 🔗 https://observer.com/2024/07/ai-training-data-crisis/
#ai #ainews #ai2024