By Consultants Review Team
Tech companies are using questionable methods, frequently without the creators' knowledge, to feed their data-hungry artificial intelligence models. These methods involve sucking up books, websites, images, and social network posts. While most AI companies keep their training data sources a secret, a Proof News investigation revealed that some of the world's wealthiest AI companies have been using content from thousands of YouTube videos for AI training. Despite YouTube's policies prohibiting the unapproved extraction of content from the platform, businesses continued to do so.
The analysis revealed that major Silicon Valley players, such as Anthropic, Nvidia, Apple, and Salesforce, were using the subtitles from 173,536 YouTube films that were taken from more than 48,000 channels.
Video transcripts from educational and online learning channels like Khan Academy, MIT, and Harvard are included in the YouTube Subtitles collection. Videos from the Wall Street Journal, NPR, BBC, The Late Show With Stephen Colbert, Last Week Tonight With John Oliver, and Jimmy Kimmel Live were also used to train AI. YouTube megastars MrBeast (289 million subscribers, two videos taken for training), Marques Brownlee (19 million subscribers, seven videos taken), Jacksepticeye (nearly 31 million subscribers, 377 videos taken), and PewDiePie (111 million subscribers, 337 videos taken) were among the celebrities whose content Proof News also discovered. A portion of the training data for AI also propagated conspiracies like the "flat-earth theory."
The CEO of Nebula, a streaming service that is partially owned by its creators and where some of their work has been stolen from YouTube to train AI, called it "theft." Wiskus stated that using artists' creations without permission is "disrespectful," particularly in light of the possibility that studios will utilize "generative AI to replace as many of the artists along the way as they can." "Will artists be taken advantage of and harmed by this? Indeed, indeed, Wiskus replied. The makers of the dataset, EleutherAI, did not reply to inquiries about Proof's conclusions, including claims that movies were utilized without authorization.According to the company's website, its main objective is to make AI development more accessible to people outside of Big Tech. In the past, it has done this by giving people "access to cutting-edge AI technologies by training and releasing models." YouTube Subtitles are only the simple text of the subtitles for videos; they frequently include have translations into other languages, such as Arabic, German, Japanese, and so on. As per a study produced by EleutherAI, the dataset is a component of an assortment that the nonprofit organization named the Pile. In addition to YouTube, the developers of the Pile incorporated content from the English Wikipedia, the European Parliament, and a vast collection of emails sent by staff members of Enron Corporation that were made public as a result of a federal probe into the company.
Anyone with enough computer power and storage space on the internet can access the majority of the Pile's datasets. The dataset was utilized not just by Big Tech companies but also by academics and other developers. In their research papers and postings, Apple, Nvidia, and Salesforce—companies with market values in the hundreds of billions and trillions of dollars—describe how they trained artificial intelligence using the Pile. Records also reveal that, just weeks before the firm announced plans to integrate additional AI capabilities into iPhones and MacBooks, Apple trained OpenELM, a well-known model that was published in April, using the Pile. The disclosures from Bloomberg and Databricks show that the corporations also trained models on the Pile.
Anthropic, a well-known AI developer that received a $4 billion investment from Amazon, also done so and advocates for its emphasis on "AI safety." Anthropic spokesperson Jennifer Martinez confirmed that the Pile is used in Anthropic's generative AI assistant, Claude, in a statement. "The Pile includes a very small subset of YouTube subtitles," Martinez added. "Direct usage of YouTube's platform is covered under its terms; this is not the same as using the Pile dataset. Regarding possible infringements upon YouTube's terms of service, we would have to direct you to the creators of Pile.