WokGen Voice

Voice Documentation

Complete guide to generating AI-powered speech — narration, NPC dialogue, podcast intros, product demos, and ads.

1. Overview

WokGen Voice is an AI text-to-speech studio built for creators who need high-quality audio without recording equipment or voice talent. Type your script, pick a voice, and get back a WAV file in seconds.

Use cases

Narration — explainer videos, tutorials, and e-learning modules.
NPC dialogue — game characters, interactive fiction, and branching narratives.
Product demos — voiceovers for SaaS walkthroughs and app store videos.
Podcast intros — polished intro and outro segments with music-ready timing.
Advertisements — punchy 15 s and 30 s ad spots with energetic delivery.

→WokGen Voice is in Beta. Features are actively expanding — expect new voices and language support in upcoming releases.

2. Getting Started

Generating your first voice clip takes under a minute:

Step 1 — Navigate to Voice mode from the top nav or the WokGen home page.
Step 2 — Enter your text in the script box. Keep it under 200 words for fastest results.
Step 3 — Pick a Voice Type from the selector (Natural, Character, Whisper, etc.).
Step 4 — Choose a language and set your speed if needed.
Step 5 — Click Generate. Your audio will appear in the player within seconds.

iStandard tier generates instantly using your browser's built-in TTS engine. HD tier sends the request to HuggingFace Kokoro or Bark and may take 5–15 seconds.

3. Voice Types

WokGen Voice offers six voice types to match different content needs. Each type affects delivery, pacing, and emotional tone.

Voice Type	Character	Best for
Natural	Clean, measured, announcer-style	Narration, explainers, product demos
Character	Expressive, dramatic, character-driven	Game NPC dialogue, fantasy/gaming content
Whisper	Soft, intimate, low-energy	ASMR, meditation guides, romantic scenes
Energetic	Upbeat, fast-paced, enthusiastic	Marketing spots, ads, hype reels
News	Authoritative, clipped, broadcast-quality	News segments, corporate announcements
Deep	Rich baritone, commanding presence	Movie trailers, documentary narration

→In HD tier, voice type influences the Kokoro/Bark model parameters. In Standard tier, it maps to browser SpeechSynthesis pitch and rate parameters.

4. Languages

WokGen Voice currently supports seven languages. Language availability may vary between Standard and HD tiers.

English (en) — full support, all voice types, all speeds
Spanish (es) — full support
French (fr) — full support
German (de) — full support
Japanese (ja) — HD tier recommended for natural prosody
Portuguese (pt) — full support
Chinese (zh) — HD tier recommended for tonal accuracy

!Standard tier uses the browser's system TTS voices. Available languages depend on the voices installed on the user's OS. For guaranteed multilingual output, use HD tier.

5. Speed Control

The speed slider controls how fast the voice speaks, from a slow dramatic pace to a rapid upbeat delivery.

Speed	Feel	Recommended for
`0.5×`	Very slow, dramatic, weighted	Trailers, dramatic reveals, meditation
`0.75×`	Slow and deliberate	Technical explanations, e-learning
`1.0×`	Normal conversational pace	Most content — default
`1.25×`	Slightly brisk	Podcasts, quick explainers
`1.5×`	Fast, energetic	Ads, hype content, social clips
`2.0×`	Very fast, upbeat, punchy	Short-form marketing, TikTok/Reels

6. Standard vs HD Tier

Standard tier (free)

Standard voice generation uses your browser's built-in Web Speech API (SpeechSynthesis). Results are instant and require no server round-trip or credits. Voice quality depends on the OS and browser — it can sound robotic on some platforms, especially for non-English text.

Instant — no network latency
Free, no credits consumed
Robotic on some OS/browser combos
Language quality varies by installed system voices

HD tier (1 credit per clip)

HD voice uses HuggingFace Kokoro (fast, natural-sounding neural TTS) or Bark (expressive, character-style voice synthesis). Results are significantly more natural, with proper prosody, breathing, and emotional inflection.

Natural-sounding neural TTS
Consistent quality across all languages
Better character voice expressiveness with Bark
Costs 1 credit per clip
Takes 5–15 seconds to generate

→Use Standard to iterate on your script and timing, then switch to HD for your final production clip.

7. Audio Player

After generation, the audio appears in the built-in player. Controls include:

Play / Pause — toggle playback with the play button or spacebar.
Seek — click anywhere on the progress bar to jump to that position.
Volume — drag the volume slider to adjust output level (0–100%).
Duration — elapsed time and total clip length are displayed.

The player is non-destructive — you can re-generate with different settings without losing your current clip until you explicitly clear it.

8. Downloading

All generated audio is available for immediate download in WAV format, which is lossless and universally compatible with DAWs, video editors, and game engines.

Click the Download button below the player.
The file is named wokgen-voice-[timestamp].wav by default.
WAV files can be imported directly into Unity, Godot, Premiere, DaVinci Resolve, etc.

iWAV format is chosen for maximum compatibility. If you need MP3 or OGG, convert using Audacity, ffmpeg, or an online converter after downloading.

9. Saving to Gallery

Voice clips can be saved to your WokGen Gallery for later access and sharing.

Click Save to Gallery after generation.
Choose Public to share with the WokGen community, or Private to keep it in your account only.
Saved clips appear under your profile in the Gallery with the script text and voice settings.
You can delete saved clips at any time from your Gallery page.

10. Rate Limits

Plan	Voice clips per hour	HD clips
Free	5 clips / hour	From credit balance
Pro	20 clips / hour	From credit balance
Max	Unlimited	Unlimited (subject to fair use)

Rate limits apply to the number of generation requests, not clip duration. A 500-word clip counts as one request. Limits reset on a rolling 60-minute window.

11. Best Practices

Script length

Keep your script under 200 words for the fastest generation times and most consistent quality. Long scripts may be split internally — shorter inputs give better control over pacing and tone.

Voice selection

Use Character voice for game NPC dialogue — it adds the expressive variability that makes characters feel alive.
Use Natural for product demos and explainers — clean, professional, and easy to understand.
Use News for authoritative corporate content — it signals credibility and broadcast quality.
Use Whisper for ASMR or intimate voiceover — the soft delivery reduces listener fatigue.

Punctuation matters

TTS engines respond to punctuation. Use commas for brief pauses, periods for full stops, and ellipses (...) for dramatic hesitation. Exclamation marks increase energy in Energetic and Character voices.

Iterating on HD clips

Credits are precious. Proof your script in Standard tier first — fix typos, check pacing, and confirm the word count is correct. Only upgrade to HD once the script is finalised.

12. Limitations

Standard tier uses browser TTS — output quality varies significantly between Chrome, Firefox, Safari, and different operating systems. On some Linux systems it may sound robotic.
HD tier requires API credits — each clip consumes 1 credit. Credits are consumed even if you choose not to download the result.
No real-time streaming — the full clip is generated before playback begins.
Script length cap — inputs over 500 words may be truncated. Split long scripts into multiple clips.
Emotional nuance — AI voices are improving rapidly but still lack the full expressive range of a human voice actor for nuanced emotional scenes.
Singing not supported — WokGen Voice is for speech only. It cannot generate musical singing or melody.

!Do not use WokGen Voice to impersonate real people or generate deceptive content. Review the WokSpec Terms of Service before using voice output in commercial projects.