Skip to main content
ZeroTwo’s audio Studio provides access to multiple AI audio generation models, each optimized for different types of audio output. This page explains the model types available and how to choose between them.
AI audio generation technology is evolving rapidly. New models are added to ZeroTwo regularly. Check the model dropdown in the audio Studio for the current full list.

Model categories

Audio generation models in ZeroTwo fall into three main categories:

Music generation models

Specialized for generating original music from text descriptions. These models understand genre, mood, instrumentation, tempo, and musical structure. Best for:
  • Background music for videos, presentations, and apps
  • Ambient soundscapes and atmospheric audio
  • Jingles, intros, and branded audio pieces
  • Specific genre requests (jazz, classical, electronic, etc.)
Prompt approach: Describe genre, mood, instruments, tempo, and duration. Example: "Upbeat electronic background music, 120 BPM, synthesizer melody, suitable for a tech product demo, 60 seconds"

Voice synthesis / text-to-speech models

Generate spoken audio from text input. These models produce natural-sounding narration in various voices and styles. Best for:
  • AI narration for videos and presentations
  • Podcast-style spoken content
  • Accessibility audio (screen reader-style narration)
  • Character voices for creative projects
Prompt approach: Provide the text to be spoken and describe the voice characteristics — tone, pace, gender, accent, emotional register. Example: "Narrate the following in a warm, professional female voice at a moderate pace: [text]"

Sound effects models

Generate specific, discrete audio events — clicks, chimes, environment sounds, and other effects. Best for:
  • UI sounds and notification tones
  • Environmental and ambient effects
  • Production sound design
  • Game audio assets
Prompt approach: Describe the specific sound event as precisely as possible. Example: "A single wooden door knock, two knocks, natural reverb, interior setting"

Choosing the right model

Use caseModel type to choose
Background music for a videoMusic generation
Voiceover narrationVoice synthesis / TTS
App notification soundSound effects
Ambient environment audioMusic generation or sound effects
Podcast introMusic generation
AI-read articleVoice synthesis / TTS
For music, describing the genre and mood is the most important part of the prompt. For voice, the most important elements are the text content and the voice tone/style description. For sound effects, precision about the specific sound event produces the best results.

Output formats

FormatBest for
MP3Web sharing, social media, general use
WAVProfessional production, lossless quality, video editing
Download format options depend on the model selected. MP3 is available from all models; WAV is available from higher-quality models.

Prompting by model type

Each audio model type responds to different prompt elements:

Prompting music generation models

The most important elements for music prompts are genre, mood, and instrumentation:
Prompt elementExamples
Genrejazz, classical, electronic, ambient, folk, hip-hop, cinematic
Mooduplifting, tense, melancholic, energetic, peaceful, mysterious
Instrumentspiano, acoustic guitar, orchestral strings, synthesizer, drums, bass
Tempo120 BPM, slow and deliberate, fast-paced, moderate tempo
Duration30 seconds, 60 seconds, 2 minutes
Purposebackground music for a product video, podcast intro, game menu music
Strong music prompt: Cinematic orchestral piece with rising strings and dramatic percussion, building tension over 30 seconds, suitable for a movie trailer

Prompting voice synthesis models

Voice synthesis prompts focus on the text to be spoken and the voice characteristics:
Prompt elementExamples
Voice characteristicswarm and friendly, authoritative and professional, energetic, calm and soothing
Gender / agemale voice, female voice, neutral, mature, young
Paceslow and deliberate, conversational pace, brisk and confident
Accent / styleAmerican English, British accent, news anchor style
Strong voice prompt: Read the following in a warm, professional female voice at a conversational pace, with natural pauses: [your text here]

Prompting sound effects models

Sound effect prompts should be as specific as possible about the exact sound event:
Prompt elementExamples
Sound eventdoor knock, coin drop, camera click, notification chime
Material / characterwooden, metallic, glass, soft, sharp
Environmentinterior, outdoor, reverberant space, dry studio
Durationbrief 1-second burst, 3-second sustained
Strong sound effect prompt: A single metallic coin dropped onto a hardwood floor, brief ring and roll, indoor environment with slight room reverb

Model updates

ZeroTwo’s audio model library is updated as new models become available. Check the ZeroTwo changelog for announcements about newly added audio models.

Creating audio

Step-by-step guide and prompt examples for all audio types.

Audio troubleshooting

Fix common issues with audio generation.

Studio overview

Overview of all three Studio sections — images, video, and audio.

Video generation

Generate AI video clips from text descriptions.