Voice Documentation
Complete guide to generating AI-powered speech — narration, NPC dialogue, podcast intros, product demos, and ads.
1. Overview
WokGen Voice is an AI text-to-speech studio built for creators who need high-quality audio without recording equipment or voice talent. Type your script, pick a voice, and get back a WAV file in seconds.
Use cases
- Narration — explainer videos, tutorials, and e-learning modules.
- NPC dialogue — game characters, interactive fiction, and branching narratives.
- Product demos — voiceovers for SaaS walkthroughs and app store videos.
- Podcast intros — polished intro and outro segments with music-ready timing.
- Advertisements — punchy 15 s and 30 s ad spots with energetic delivery.
2. Getting Started
Generating your first voice clip takes under a minute:
- Step 1 — Navigate to Voice mode from the top nav or the WokGen home page.
- Step 2 — Enter your text in the script box. Keep it under 200 words for fastest results.
- Step 3 — Pick a Voice Type from the selector (Natural, Character, Whisper, etc.).
- Step 4 — Choose a language and set your speed if needed.
- Step 5 — Click Generate. Your audio will appear in the player within seconds.
3. Voice Types
WokGen Voice offers six voice types to match different content needs. Each type affects delivery, pacing, and emotional tone.
| Voice Type | Character | Best for |
|---|---|---|
| Natural | Clean, measured, announcer-style | Narration, explainers, product demos |
| Character | Expressive, dramatic, character-driven | Game NPC dialogue, fantasy/gaming content |
| Whisper | Soft, intimate, low-energy | ASMR, meditation guides, romantic scenes |
| Energetic | Upbeat, fast-paced, enthusiastic | Marketing spots, ads, hype reels |
| News | Authoritative, clipped, broadcast-quality | News segments, corporate announcements |
| Deep | Rich baritone, commanding presence | Movie trailers, documentary narration |
SpeechSynthesis pitch and rate parameters.4. Languages
WokGen Voice currently supports seven languages. Language availability may vary between Standard and HD tiers.
- English (en) — full support, all voice types, all speeds
- Spanish (es) — full support
- French (fr) — full support
- German (de) — full support
- Japanese (ja) — HD tier recommended for natural prosody
- Portuguese (pt) — full support
- Chinese (zh) — HD tier recommended for tonal accuracy
5. Speed Control
The speed slider controls how fast the voice speaks, from a slow dramatic pace to a rapid upbeat delivery.
| Speed | Feel | Recommended for |
|---|---|---|
0.5× | Very slow, dramatic, weighted | Trailers, dramatic reveals, meditation |
0.75× | Slow and deliberate | Technical explanations, e-learning |
1.0× | Normal conversational pace | Most content — default |
1.25× | Slightly brisk | Podcasts, quick explainers |
1.5× | Fast, energetic | Ads, hype content, social clips |
2.0× | Very fast, upbeat, punchy | Short-form marketing, TikTok/Reels |
6. Standard vs HD Tier
Standard tier (free)
Standard voice generation uses your browser's built-in Web Speech API (SpeechSynthesis). Results are instant and require no server round-trip or credits. Voice quality depends on the OS and browser — it can sound robotic on some platforms, especially for non-English text.
- Instant — no network latency
- Free, no credits consumed
- Robotic on some OS/browser combos
- Language quality varies by installed system voices
HD tier (1 credit per clip)
HD voice uses HuggingFace Kokoro (fast, natural-sounding neural TTS) or Bark (expressive, character-style voice synthesis). Results are significantly more natural, with proper prosody, breathing, and emotional inflection.
- Natural-sounding neural TTS
- Consistent quality across all languages
- Better character voice expressiveness with Bark
- Costs 1 credit per clip
- Takes 5–15 seconds to generate
7. Audio Player
After generation, the audio appears in the built-in player. Controls include:
- Play / Pause — toggle playback with the play button or spacebar.
- Seek — click anywhere on the progress bar to jump to that position.
- Volume — drag the volume slider to adjust output level (0–100%).
- Duration — elapsed time and total clip length are displayed.
The player is non-destructive — you can re-generate with different settings without losing your current clip until you explicitly clear it.
8. Downloading
All generated audio is available for immediate download in WAV format, which is lossless and universally compatible with DAWs, video editors, and game engines.
- Click the Download button below the player.
- The file is named
wokgen-voice-[timestamp].wavby default. - WAV files can be imported directly into Unity, Godot, Premiere, DaVinci Resolve, etc.
9. Saving to Gallery
Voice clips can be saved to your WokGen Gallery for later access and sharing.
- Click Save to Gallery after generation.
- Choose Public to share with the WokGen community, or Private to keep it in your account only.
- Saved clips appear under your profile in the Gallery with the script text and voice settings.
- You can delete saved clips at any time from your Gallery page.
10. Rate Limits
| Plan | Voice clips per hour | HD clips |
|---|---|---|
| Free | 5 clips / hour | From credit balance |
| Pro | 20 clips / hour | From credit balance |
| Max | Unlimited | Unlimited (subject to fair use) |
Rate limits apply to the number of generation requests, not clip duration. A 500-word clip counts as one request. Limits reset on a rolling 60-minute window.
11. Best Practices
Script length
Keep your script under 200 words for the fastest generation times and most consistent quality. Long scripts may be split internally — shorter inputs give better control over pacing and tone.
Voice selection
- Use Character voice for game NPC dialogue — it adds the expressive variability that makes characters feel alive.
- Use Natural for product demos and explainers — clean, professional, and easy to understand.
- Use News for authoritative corporate content — it signals credibility and broadcast quality.
- Use Whisper for ASMR or intimate voiceover — the soft delivery reduces listener fatigue.
Punctuation matters
TTS engines respond to punctuation. Use commas for brief pauses, periods for full stops, and ellipses (...) for dramatic hesitation. Exclamation marks increase energy in Energetic and Character voices.
Iterating on HD clips
Credits are precious. Proof your script in Standard tier first — fix typos, check pacing, and confirm the word count is correct. Only upgrade to HD once the script is finalised.
12. Limitations
- Standard tier uses browser TTS — output quality varies significantly between Chrome, Firefox, Safari, and different operating systems. On some Linux systems it may sound robotic.
- HD tier requires API credits — each clip consumes 1 credit. Credits are consumed even if you choose not to download the result.
- No real-time streaming — the full clip is generated before playback begins.
- Script length cap — inputs over 500 words may be truncated. Split long scripts into multiple clips.
- Emotional nuance — AI voices are improving rapidly but still lack the full expressive range of a human voice actor for nuanced emotional scenes.
- Singing not supported — WokGen Voice is for speech only. It cannot generate musical singing or melody.