Audio And Music Generation

  1. Full‑Song / Vocal‑Capable Commercial Models
    These generate complete songs with vocals, structure, and production.
    • Eleven Music — ElevenLabs’ music model with highly lifelike multilingual vocals; strong API orientation.
    • Mureka V8 — 2026 “Supermodel” with strong structure reasoning (MusiCoT) and competitive full‑song quality.
    • Sonauto — Extremely fast generation (~15s), free & unlimited for users; developer API available.
    • Beatoven Maestro — Licensed-data, mood‑based generation for commercial-safe output.
    • Loudly VEGA‑2 — Royalty‑free, content‑creator‑oriented generator.
    • AIVA — Classical/cinematic composition engine with MIDI workflows.
    • ProducerAI — “Music agent” layer built on top of frontier models.
    • Boomy
    • Soundraw
    • Mubert
    • Endlesss (loop‑based, collaborative)
    • Amper Music (legacy, now absorbed/licensed)
    • Jukebox‑based commercial derivatives (various)
    • Suno (v3, v4)
    • Udio

  2. Open‑Source / Research‑Grade Music Models
    These are self‑hostable or research‑oriented.
    • Stable Audio (1.x, 2.x)
    • Stable Audio Open — Open‑weights version of Stability’s text‑to‑audio system.
    • AudioCraft (Meta: MusicGen, AudioGen, EnCodec)
    • AudioLDM/AudioLDM2 — Latent diffusion for text‑to‑audio/music.
    • Magenta — Google’s long‑running music ML project (MelodyRNN, MusicVAE, etc.).
    • Jukebox — OpenAI’s pioneering hierarchical VQ‑VAE music generator.
    • MusicLM (unofficial) — Community implementation of Google’s MusicLM.
    • Riffusion OSS — Spectrogram‑diffusion music generator.
    • Mustango — Controllable text‑to‑music model.
    • DiffRhythm 2 — 2026 open‑source model; strong but still behind commercial systems.
    • MusicGen (Meta)
    • Riffusion (and Riffusion OSS)
    • Mustango
    • DiffRhythm / DiffRhythm 2
    • Magenta (MelodyRNN, MusicVAE, etc.)
    • OpenAI Jukebox
    • MusicLM (unofficial implementations)
    • MuseNet (legacy, not generally available now)
    • JEN‑1 / other academic text‑to‑music models

  3. Background‑Music / Content‑Creator Platforms
    These focus on safe licensing and mood‑based generation.
    • Mubert — API‑first generative music for apps and platforms.
    • Soundraw — Customizable music for video creators.
    • Boomy — Consumer‑friendly quick‑generation tool.

  4. Sound‑Effects / Audio‑to‑Audio Models
    Text-to-Audio: Models that generate audio from text, usually sound effects, ambience, environmental sounds, non‑musical audio.
    Audio‑to‑audio: Models that transform existing audio (inpainting, style transfer, editing).
    Most modern SFX models do both, so the categories overlap.
    • AudioGen - Meta’s text‑to‑audio model for SFX, ambience, and general audio.
    • AudioLDM2‑SFX — Diffusion‑based SFX generator; strong for environmental and synthetic sounds
    • Stable Audio 2.5 SFX — Stability AI’s SFX‑capable diffusion model; supports inpainting and audio‑to‑audio.
    • Stable Audio Open —Open‑weights general audio generator (not music‑specific).
    • GANSynth — Legacy timbre‑focused GAN model; historically important.
    • Beatoven SFX — Commercial SFX generator with licensing‑safe output.

  5. Voice / Speech Models
    Speech editing and voice cloning; can modify specific words in an existing recording while preserving surrounding audio These are not strictly “music models” but are essential for vocals, singing, and voice cloning.
  6. Singing / vocal synthesis

  7. Audio Codecs & Tokenizers
    These are the building blocks for modern audio LLMs.
    These tokens are what audio language models operate on, just as text tokens are the unit for LLMs
    • Opus - open, royalty-free, highly versatile audio codec
    • Enhanced Voice Services (EVS) - a superwideband speech audio coding standard that was developed for VoLTE and VoNR.
    • EnCodec — neural audio codecs that compress audio into discrete tokens at very low bitrates while preserving quality.
    • DAC - Descript Audio Codec (.dac), a high fidelity general neural audio codec, introduced in the paper titled High-Fidelity Audio Compression with Improved RVQGAN.
    • Lyra a neural audio codec for low-bitrate speech
    • SoundStream — the first neural network codec to work on speech and music, while being able to run in real-time on a smartphone CPU.
    • Vocos — Neural vocoder used in several open‑source pipelines.
    • MusicGen Tokenizer Variants — Multiple bitrate/token‑rate configurations.
    • HiFi‑GAN - Generative Adversarial Networks for Efficient and High Fidelity Speech Synthesis
    • WaveNet -introduced in 2016, WaveNet was one of the first AI models to generate natural-sounding speech.
    • WaveRNN - WaveRNN Vocoder + TTS
    • Pytorch implementation of Deepmind's WaveRNN model- from Efficient Neural Audio Synthesis Installation.
    • MelGAN - a generative adversarial network (GAN) model that generates audio from mel spectrograms.
    • BigVGAN - a Universal Neural Vocoder
    • Pytorch Implementation of BigVGAN
    • Voicecraft - Zero-Shot Speech Editing and Text-to-Speech in the Wild

  8. Classical / symbolic / MIDI‑focused models
  9. Platforms bundling multiple models or agents
  10. References
(last updated: 8/June/2026)