OpenAI Transcribes One Million Hours of YouTube Videos to Teach GPT-4

By Consultants Review Team Monday, 08 April 2024

The Wall Street Journal revealed earlier this week that AI companies were having trouble gathering high-quality training data.

The New York Times has detailed some of the strategies used by companies to deal with this. It seems sense that it involves doing things that are covered under the hazy gray area of AI copyright legislation.

How was Chat GPT-4 trained by Open AI?

The narrative starts with OpenAI, which apparently developed its Whisper audio transcription algorithm in desperation for training data and overcame the obstacle by transcribing more than a million hours of YouTube videos to train GPT-4, its most potent big language model.

The New York Times claims that although the company was aware of the legal concerns, it still thought it was a fair use. According to The Times, Greg Brockman, the head of OpenAI, personally gathered the films that were utilized.

How did the business go forward?

OpenAI spokesperson Lindsay Held told The Verge via email that the company selects "unique" datasets for every model in order to "help their understanding of the world" and keep its research competitive globally.

Held stated that the business is thinking about producing its own synthetic data and that it uses "many sources, including publicly available data and partnerships for non-public data."

Why did the business choose to use footage from YouTube?

The Times article claims that in 2021, the company ran out of useful data and, having exhausted all other avenues, thought about transcribing podcasts, audiobooks, and YouTube videos.

By then, it has used data from Quizlet homework assignments, chess move databases, and Github computer code to train its models.

Current Issue