Tech giants reportedly used YouTube videos to train AI, violating platform's terms

A recent investigation by Proof News, in collaboration with Wired, revealed that Apple, Anthropic, Nvidia, and Salesforce are among the companies that used a massive dataset containing subtitles from over 170,000 YouTube videos to train their AI systems.

This dataset, known as "YouTube Subtitles," was created without permission and includes content from more than 48,000 YouTube channels.

The dataset includes videos from popular YouTube creators such as MrBeast and Marques Brownlee, as well as major news outlets such as ABC News, BBC and The New York Times. Although YouTube's terms of service explicitly prohibit the use of its content to train artificial intelligence, these tech companies have continued their data collection efforts.

Marques Brownlee, also known as MKBHD, commented on the issue on social media, highlighting that Apple sourced data from various companies that scraped transcripts from YouTube videos, including his own. He emphasized that this problem is likely to persist and evolve.

YouTube has yet to respond to inquiries about this violation of its terms of service. Meanwhile, Proof News released an interactive tool allowing users to check if specific content or creators appear in the dataset.

The "YouTube Subtitles" dataset is part of a larger open-source collection called The Pile, curated by the nonprofit EleutherAI. This collection also includes books, Wikipedia articles, and other materials. Last year, the Books3 dataset within The Pile led to lawsuits from authors whose works were used to train AI systems without their consent.