The AI company OpenAI is said to have used its speech recognition tool Whisper to extract training data from YouTube videos.
The AI race is leading to a desperate hunt for digital data needed to train models. The New York Times has published a report detailing how major AI players have sought to expand their access to data related to copyright laws. For example, OpenAI reportedly used its speech recognition tool Whisper to transcribe more than a million hours of YouTube videos and used this data to train the GPT-4 model.
to whisper
OpenAI faced a supply issue at the end of 2021 and addressed it by launching the speech recognition tool Whisper, which can transcribe the audio from YouTube videos. This conversation text was then used to make the AI system smarter.
Although Google-owned YouTube prohibits the use of its videos for cross-platform applications, the OpenAI team is said to have transcribed more than a million hours of YouTube videos, reports the New York Times. This data was reportedly used to train OpenAI’s GPT-4 model.
Legal measures
In an email to The Verge, OpenAI spokeswoman Lindsay Held says the company curates “unique” datasets for each of its models to “enhance their understanding of the world” and maintain its global research competitiveness.
Google spokesman Matt Bryant wrote to The Verge: “Both our robots.txt files and our terms of service prohibit the unauthorized scraping or downloading of YouTube content.” Bryant said Google takes “technical and legal measures” to prevent such unauthorized uses to prevent “if we have a clear legal or technical basis for it”.