Apple, NVIDIA and Anthropic reportedly used YouTube transcripts without permission to train AI models

Date:

Share:


Some of the world’s largest tech companies trained their AI models on a dataset that included transcripts of more than 173,000 YouTube videos without permission, a new investigation from Proof News has found. The dataset, which was created by a nonprofit company called EleutherAI, contains transcripts of YouTube videos from more than 48,000 channels and was used by Apple, NVIDIA and Anthropic among other companies. The findings of the investigation spotlight AI’s uncomfortable truth: the technology is largely built on the backs of data siphoned from creators without their consent or compensation.

The dataset doesn’t include any videos or images from YouTube, but contains video transcripts from the platform’s biggest creators including Marques Brownlee and MrBeast, as well as large news publishers like The New York Times, the BBC, and ABC News. Subtitles from videos belonging to Tech Reader are also part of the dataset.

“Apple has sourced data for their AI from several companies,” Brownlee posted on X. “One of them scraped tons of data/transcripts from YouTube videos, including mine,” he added. “This is going to be an evolving problem for a long time.”

A Google spokesperson told Tech Reader that previous comments made by YouTube CEO Neal Mohan saying that companies using YouTube’s data to train AI models would violate the paltform’s terms and service still stand. Apple, NVIDIA, Anthropic and EleutherAI did not respond to a request for comment from Tech Reader.

So far, AI companies haven’t been transparent about the data used to train their models. Earlier this month, artists and photographers criticized Apple for failing to reveal the source of training data for Apple Intelligence, the company own spin on generative AI coming to millions of Apple devices this year.

YouTube, the world’s largest repository of videos, in particular, is a goldmine of not only transcripts but also audio, video, and images, making it an attractive dataset for training AI models. Earlier this year, OpenAI’s chief technology officer, Mira Murati, evaded questions from The Wall Street Journal about whether the company used YouTube videos to train Sora, OpenAI’s upcoming AI video generation tool. “I’m not going to go into the details of the data that was used, but it was publicly available or licensed data,” Murati said at the time. Alphabet CEO Sundar Pichai has also said that companies using data from YouTube to train their AI models would violate of the platform’s terms of service.

If you want to see if subtitles from your YouTube videos or from your favorite channels are part of the dataset, head over to the Proof News’ lookup tool.

Update, July 16 2024, 3:17 PM PT: This story has been updated to include a statement from Google.





Source link

━ more like this

SpaceX’s Crew-11 appear in video as they prepare to return home early

SpaceX’s Crew-11 is returning from the International Space Station (ISS) early due to a medical concern with one of its astronauts. It’s the...

The AI arms race in online reviews: How businesses are battling fake content

What was once a simple signal for trust has become a place where potential customers feel like they have to keep a watchful...

The most important IT hire for CIOs in 2026 may not be human

The next governance challenge that chief information officers (CIOs) can’t ignore in 2026 is the acceleration of artificial intelligence (AI) agent sprawl. The...

Framework increases Desktop prices by up to $460 due to RAM crisis

Computer brand Framework has hiked the prices on RAM for its Desktop systems and Mainframes in response to rising costs with its suppliers....
spot_img