Nvidia’s new AI audio model can synthesize sounds that have never existed

Date:

Share:



At this point, anyone who has been following AI research is long familiar with generative models that can synthesize speech or melodic music from nothing but text prompting. Nvidia’s newly revealed “Fugatto” model looks to go a step further, using new synthetic training methods and inference-level combination techniques to “transform any mix of music, voices, and sounds,” including the synthesis of sounds that have never existed.

While Fugatto isn’t available for public testing yet, a sample-filled website showcases how Fugatto can be used to dial a number of distinct audio traits and descriptions up or down, resulting in everything from the sound of saxophones barking to people speaking underwater to ambulance sirens singing in a kind of choir. While the results on display can be a bit hit or miss, the vast array of capabilities on display here helps support Nvidia’s description of Fugatto as “a Swiss Army knife for sound.”

You’re only as good as your data

In an explanatory research paper, over a dozen Nvidia researchers explain the difficulty in crafting a training dataset that can “reveal meaningful relationships between audio and language.” While standard language models can often infer how to handle various instructions from the text-based data itself, it can be hard to generalize descriptions and traits from audio without more explicit guidance.

To that end, the researchers start by using an LLM to generate a Python script that can create a large number of template-based and free-form instructions describing different audio “personas” (e.g., “standard, young-crowd, thirty-somethings, professional”). They then generate a set of both absolute (e.g., “synthesize a happy voice”) and relative (e.g., “increase the happiness of this voice”) instructions that can be applied to those personas.

The wide array of open source audio datasets used as the basis for Fugatto generally don’t have these kinds of trait measurements embedded in them by default. But the researchers make use of existing audio understanding models to create “synthetic captions” for their training clips based on their prompts, creating natural language descriptions that can automatically quantify traits such as gender, emotion, and speech quality. Audio processing tools are also used to describe and quantify training clips on a more acoustic level (e.g. “fundamental frequency variance” or “reverb”).



Source link

━ more like this

Inheritance Tax Receipts raise £4.4 billion in six months – London Business News | Londonlovesbusiness.com

Inheritance tax (IHT) receipts hit £4.4 billion in the first six months of the 2025/26 tax year, according to data released by HM...

Meta will warn WhatsApp and Messenger users against scams

Meta is launching new tools aimed at trying to protect Messenger and WhatsApp users from potential scams. The company says its teams have...

UK borrowing rises amid French S&P downgrade – London Business News | Londonlovesbusiness.com

European markets have kicked off on a somewhat indecisive tone, as fiscal concerns dominated headlines. In the UK, the ONS reported a public sector...

iOS 26.1 Beta 4 lets you make Liquid Glass frosted

If you're not a big fan of the, well, glass in iOS 26's Liquid Glass interface, Apple has apparently heard you. The latest...

Yelp is getting more AI, including an upgraded chatbot

AI is the star of Yelp's fall product update. The review site has updated Yelp Assistant, its chatbot to answer users' questions, rolling...
spot_img