Nvidia’s new Fugatto AI model creates & manipulates audio

I’m a video editor. I edit most of the videos on both Tech Critter and Nasi Lemak Tech. One of my biggest gripes is to deal with audio. If I want to find sound effects, I’ll have to scour the internet and make sure it’s okay for commercial use before integrating it into the video. Music is much more difficult – and let’s not talk about bespoke sound effects or music just yet. So, when I saw the news that Nvidia has announced a brand new AI model that deals exclusively with audio only, I was ecstatic!

This new model is called Fugatto, and it can obviously generate audio. Tell it some description, and the AI model will spit out some audio based on what we have described. But what interests me the most is that this AI model can also manipulate audio. If we just upload our own audio to the model, I’m thinking of a super smart background noise and wind noise removal tool.

It can even manipulate the intonations of the words based on the emotion that we want. The example shown in the video above is a very good demonstration of what it can do. Perhaps it can even use our own voice to generate its own text-to-speech engine so that we can fix whatever misspoken words in a video. That would be really nice to have.

As of now, there are still no download links yet. From what I can see, the entire UI is a webpage that is locally hosted – so similar to something like Stable Diffusion A1111. If it is locally-hosted and ran, I’m really excited for the future of this Fugatto AI model.

To learn more about it, click here to read it on Nvidia’s official blog.