OpenAI utilizes YouTube to “train” GPT-4

According to the New York Times, OpenAI transcribed more than a million hours of YouTube footage to train GPT-4.

Hello and welcome to In this article, we will discuss artificial intelligence, precisely how the OpenAI “Training” will begin. The Wall Street Journal reported earlier this week that AI companies have had difficulty locating high-quality training data. The New York Times described today a number of the strategies being implemented by enterprises to address this issue. This covers practices that fall into a gray area of the law regarding AI and copyright.


The story begins with OpenAI, which, in its frenzied search for training data, developed its own audio transcription model called Whisper to tackle the challenge, transcribing over a million hours of YouTube videos to train GPT-4, its most advanced large language model. The New York Times said that the corporation was aware of the legally dubious component, but believed it was within the boundaries of fair use. According to the Times, OpenAI president Greg Brockman personally collected the footage.

Lindsay Held, an OpenAI spokesperson, told The Verge via email that the business curates “unique” datasets for each of its models to “further their understanding of the world” and maintain its global research competitiveness. Held also stated that the company uses “numerous sources, including publicly available data and non-public data partnerships,” and plans to produce its own synthetic data.

Also Read Our Article On “YouTube vs TV & Movies: Myths Busted”

Google spokesperson Matt Bryant

In an email to The Verge, Google spokesperson Matt Bryant stated that the business has “seen unconfirmed reports” of OpenAI activity, and that “both our robots.txt files and the Terms of Service prohibit scraping or unauthorized downloading of YouTube content,” mirroring the company’s conditions of use. This week, YouTube CEO Neal Mohan expressed similar sentiments about OpenAI potentially using YouTube to train its Sora video production model. Bryant stated that Google employs “technical and legal measures” to prohibit such illegal use “when we have a clear legal or technical basis for doing so.”

Also read:   6 Music Artists Who Became Famous With YouTube!

According to sources at the Times, Google also acquired transcripts from YouTube. Bryant stated that the company taught its models “on some YouTube content, in accordance with our agreements with YouTube creators.” According to The Times, Google’s legal department has urged the company’s privacy team to tweak its policy language so that it can do more with consumer data, such as through its office tools like Google Docs. The new policy was apparently published on July 1 to capitalize on the distraction of the Independence Day holiday weekend.

YouTube CEO Neal Mohan

Similarly, Meta has addressed concerns connected to the availability of high-quality training data, and according to the Times, its AI team has considered potential issues related to illicit use of copyrighted works while working with OpenAI. After analyzing “nearly internet-find English-language books, essays, poems, and news articles,” the corporation appears to have pondered paying for book licensing or perhaps acquiring a great publisher on a permanent basis. Furthermore, Meta appears to have faced limitations in its use of consumer data as a result of privacy-related regulations enacted following the Cambridge Analytica debacle.

Google, OpenAI, and the broader AI training environment are seeing a rapid evaporation of the training data required for their models, which improve as they absorb more data. The Journal stated last week that firms may run out of new material by 2028.


As indicated in the Journal, possible solutions to this problem include training models on “synthetic” data generated by your own models or implementing so-called “curriculum learning,” which entails gradually supplementing models with background data. High quality in order, in the intention of developing “smarter connections between concepts” with less information, however neither strategy has been demonstrated. Companies have the option of using whatever they can find, regardless of whether they have permission or not. However, given the number of cases brought over the last year or two, this method is more than a little problematic.

Related Posts

Leave a Reply

Your email address will not be published. Required fields are marked *