Audio Generation Models

ZeroTwo’s audio Studio provides access to multiple AI audio generation models, each optimized for different types of audio output. This page explains the model types available and how to choose between them.

AI audio generation technology is evolving rapidly. New models are added to ZeroTwo regularly. Check the model dropdown in the audio Studio for the current full list.

Model categories

Audio generation models in ZeroTwo fall into three main categories:

Music generation models

Specialized for generating original music from text descriptions. These models understand genre, mood, instrumentation, tempo, and musical structure. Best for:

Background music for videos, presentations, and apps
Ambient soundscapes and atmospheric audio
Jingles, intros, and branded audio pieces
Specific genre requests (jazz, classical, electronic, etc.)

Prompt approach: Describe genre, mood, instruments, tempo, and duration. Example: "Upbeat electronic background music, 120 BPM, synthesizer melody, suitable for a tech product demo, 60 seconds"

Voice synthesis / text-to-speech models

Generate spoken audio from text input. These models produce natural-sounding narration in various voices and styles. Best for:

AI narration for videos and presentations
Podcast-style spoken content
Accessibility audio (screen reader-style narration)
Character voices for creative projects

Prompt approach: Provide the text to be spoken and describe the voice characteristics — tone, pace, gender, accent, emotional register. Example: "Narrate the following in a warm, professional female voice at a moderate pace: [text]"

Sound effects models

Generate specific, discrete audio events — clicks, chimes, environment sounds, and other effects. Best for:

UI sounds and notification tones
Environmental and ambient effects
Production sound design
Game audio assets

Prompt approach: Describe the specific sound event as precisely as possible. Example: "A single wooden door knock, two knocks, natural reverb, interior setting"

Choosing the right model

Use case	Model type to choose
Background music for a video	Music generation
Voiceover narration	Voice synthesis / TTS
App notification sound	Sound effects
Ambient environment audio	Music generation or sound effects
Podcast intro	Music generation
AI-read article	Voice synthesis / TTS

For music, describing the genre and mood is the most important part of the prompt. For voice, the most important elements are the text content and the voice tone/style description. For sound effects, precision about the specific sound event produces the best results.

Output formats

Format	Best for
MP3	Web sharing, social media, general use
WAV	Professional production, lossless quality, video editing

Download format options depend on the model selected. MP3 is available from all models; WAV is available from higher-quality models.

Prompting by model type

Each audio model type responds to different prompt elements:

Prompting music generation models

The most important elements for music prompts are genre, mood, and instrumentation:

Prompt element	Examples
Genre	`jazz`, `classical`, `electronic`, `ambient`, `folk`, `hip-hop`, `cinematic`
Mood	`uplifting`, `tense`, `melancholic`, `energetic`, `peaceful`, `mysterious`
Instruments	`piano`, `acoustic guitar`, `orchestral strings`, `synthesizer`, `drums`, `bass`
Tempo	`120 BPM`, `slow and deliberate`, `fast-paced`, `moderate tempo`
Duration	`30 seconds`, `60 seconds`, `2 minutes`
Purpose	`background music for a product video`, `podcast intro`, `game menu music`

Strong music prompt:

Cinematic orchestral piece with rising strings and dramatic percussion, building tension over 30 seconds, suitable for a movie trailer

Prompting voice synthesis models

Voice synthesis prompts focus on the text to be spoken and the voice characteristics:

Prompt element	Examples
Voice characteristics	`warm and friendly`, `authoritative and professional`, `energetic`, `calm and soothing`
Gender / age	`male voice`, `female voice`, `neutral`, `mature`, `young`
Pace	`slow and deliberate`, `conversational pace`, `brisk and confident`
Accent / style	`American English`, `British accent`, `news anchor style`

Strong voice prompt: Read the following in a warm, professional female voice at a conversational pace, with natural pauses: [your text here]

Prompting sound effects models

Sound effect prompts should be as specific as possible about the exact sound event:

Prompt element	Examples
Sound event	`door knock`, `coin drop`, `camera click`, `notification chime`
Material / character	`wooden`, `metallic`, `glass`, `soft`, `sharp`
Environment	`interior`, `outdoor`, `reverberant space`, `dry studio`
Duration	`brief 1-second burst`, `3-second sustained`

Strong sound effect prompt: A single metallic coin dropped onto a hardwood floor, brief ring and roll, indoor environment with slight room reverb

Model updates

ZeroTwo’s audio model library is updated as new models become available. Check the ZeroTwo changelog for announcements about newly added audio models.

Creating audio

Step-by-step guide and prompt examples for all audio types.

Audio troubleshooting

Fix common issues with audio generation.

Studio overview

Overview of all three Studio sections — images, video, and audio.

Video generation

Generate AI video clips from text descriptions.

Getting Started

Overview

Core Chat

Tools

Studio

Models & Providers

Projects

Custom Agents

Skills

Connectors & Integrations

Personalization & Memory

Sharing

Workspaces & Business

Account & Billing

Privacy

Prompts

Troubleshooting

FAQ

Changelog

Reference

Audio Generation Models

Model categories

Music generation models

Voice synthesis / text-to-speech models

Sound effects models

Choosing the right model

Output formats

Prompting by model type

Prompting music generation models

Prompting voice synthesis models

Prompting sound effects models

Model updates

Creating audio

Audio troubleshooting

Studio overview

Video generation

Getting Started

Overview

Core Chat

Tools

Studio

Models & Providers

Projects

Custom Agents

Skills

Connectors & Integrations

Personalization & Memory

Sharing

Workspaces & Business

Account & Billing

Privacy

Prompts

Troubleshooting

FAQ

Changelog

Reference

Documentation Index

​Model categories

​Music generation models

​Voice synthesis / text-to-speech models

​Sound effects models

​Choosing the right model

​Output formats

​Prompting by model type

​Prompting music generation models

​Prompting voice synthesis models

​Prompting sound effects models

​Model updates

​Related

Creating audio

Audio troubleshooting

Studio overview

Video generation

Model categories

Music generation models

Voice synthesis / text-to-speech models

Sound effects models

Choosing the right model

Output formats

Prompting by model type

Prompting music generation models

Prompting voice synthesis models

Prompting sound effects models

Model updates

Related