Audio And Music Generation

Full‑Song / Vocal‑Capable Commercial Models
These generate complete songs with vocals, structure, and production.
- Eleven Music — ElevenLabs’ music model with highly lifelike multilingual vocals; strong API orientation.
- Mureka V8 — 2026 “Supermodel” with strong structure reasoning (MusiCoT) and competitive full‑song quality.
- Sonauto — Extremely fast generation (~15s), free & unlimited for users; developer API available.
- Beatoven Maestro — Licensed-data, mood‑based generation for commercial-safe output.
- Loudly VEGA‑2 — Royalty‑free, content‑creator‑oriented generator.
- AIVA — Classical/cinematic composition engine with MIDI workflows.
- ProducerAI — “Music agent” layer built on top of frontier models.
- Boomy
- Soundraw
- Mubert
- Endlesss (loop‑based, collaborative)
- Amper Music (legacy, now absorbed/licensed)
- Jukebox‑based commercial derivatives (various)
- Suno (v3, v4)
- Udio
Open‑Source / Research‑Grade Music Models
These are self‑hostable or research‑oriented.
- Stable Audio (1.x, 2.x)
- Stable Audio Open — Open‑weights version of Stability’s text‑to‑audio system.
- AudioCraft (Meta: MusicGen, AudioGen, EnCodec)
- AudioLDM/AudioLDM2 — Latent diffusion for text‑to‑audio/music.
- Magenta — Google’s long‑running music ML project (MelodyRNN, MusicVAE, etc.).
- Jukebox — OpenAI’s pioneering hierarchical VQ‑VAE music generator.
- MusicLM (unofficial) — Community implementation of Google’s MusicLM.
- Riffusion OSS — Spectrogram‑diffusion music generator.
- Mustango — Controllable text‑to‑music model.
- DiffRhythm 2 — 2026 open‑source model; strong but still behind commercial systems.
- MusicGen (Meta)
- Riffusion (and Riffusion OSS)
- Mustango
- DiffRhythm / DiffRhythm 2
- Magenta (MelodyRNN, MusicVAE, etc.)
- OpenAI Jukebox
- MusicLM (unofficial implementations)
- MuseNet (legacy, not generally available now)
- JEN‑1 / other academic text‑to‑music models
Background‑Music / Content‑Creator Platforms
These focus on safe licensing and mood‑based generation.
- Mubert — API‑first generative music for apps and platforms.
- Soundraw — Customizable music for video creators.
- Boomy — Consumer‑friendly quick‑generation tool.
Sound‑Effects / Audio‑to‑Audio Models
Text-to-Audio: Models that generate audio from text, usually sound effects, ambience, environmental sounds, non‑musical audio.
Audio‑to‑audio: Models that transform existing audio (inpainting, style transfer, editing).
Most modern SFX models do both, so the categories overlap.
- AudioGen - Meta’s text‑to‑audio model for SFX, ambience, and general audio.
- AudioLDM2‑SFX — Diffusion‑based SFX generator; strong for environmental and synthetic sounds
- Stable Audio 2.5 SFX — Stability AI’s SFX‑capable diffusion model; supports inpainting and audio‑to‑audio.
- Stable Audio Open —Open‑weights general audio generator (not music‑specific).
- GANSynth — Legacy timbre‑focused GAN model; historically important.
- Beatoven SFX — Commercial SFX generator with licensing‑safe output.
Voice / Speech Models
Speech editing and voice cloning; can modify specific words in an existing recording while preserving surrounding audio These are not strictly “music models” but are essential for vocals, singing, and voice cloning.
- ElevenLabs Voice Models — Industry-leading speech synthesis; integrated into Eleven Music.
- Google Lyria 3 Pro — High‑fidelity music & voice model for API products.
- VoiceCraft
- Voicebox (Meta)
- VALL‑E / VALL‑E X (Microsoft research)
- Neural Codec Language Models for speech (various)
- Coqui TTS (open‑source) - a deep learning toolkit for Text-to-Speech, battle-tested in research and production
- Hugging face: Coqui XTTS-v2 -
- Bark TTS
- Mozilla TTS
- ChatTTS
- MeloTTS
Singing / vocal synthesis
Audio Codecs & Tokenizers
These are the building blocks for modern audio LLMs.
These tokens are what audio language models operate on, just as text tokens are the unit for LLMs
- Opus - open, royalty-free, highly versatile audio codec
- Enhanced Voice Services (EVS) - a superwideband speech audio coding standard that was developed for VoLTE and VoNR.
- EnCodec — neural audio codecs that compress audio into discrete tokens at very low bitrates while preserving quality.
- DAC - Descript Audio Codec (.dac), a high fidelity general neural audio codec, introduced in the paper titled High-Fidelity Audio Compression with Improved RVQGAN.
- Lyra a neural audio codec for low-bitrate speech
- SoundStream — the first neural network codec to work on speech and music, while being able to run in real-time on a smartphone CPU.
- Vocos — Neural vocoder used in several open‑source pipelines.
- MusicGen Tokenizer Variants — Multiple bitrate/token‑rate configurations.
- HiFi‑GAN - Generative Adversarial Networks for Efficient and High Fidelity Speech Synthesis
- WaveNet -introduced in 2016, WaveNet was one of the first AI models to generate natural-sounding speech.
- WaveRNN - WaveRNN Vocoder + TTS
- Pytorch implementation of Deepmind's WaveRNN model- from Efficient Neural Audio Synthesis Installation.
- MelGAN - a generative adversarial network (GAN) model that generates audio from mel spectrograms.
- BigVGAN - a Universal Neural Vocoder
- Pytorch Implementation of BigVGAN
- Voicecraft - Zero-Shot Speech Editing and Text-to-Speech in the Wild
Classical / symbolic / MIDI‑focused models
Platforms bundling multiple models or agents
References

(last updated: 8/June/2026)