WokGen Voice

Voice Documentation

Complete guide to generating AI-powered speech — narration, NPC dialogue, podcast intros, product demos, and ads.

1. Overview

WokGen Voice is an AI text-to-speech studio built for creators who need high-quality audio without recording equipment or voice talent. Type your script, pick a voice, and get back a WAV file in seconds.

Use cases

  • Narration — explainer videos, tutorials, and e-learning modules.
  • NPC dialogue — game characters, interactive fiction, and branching narratives.
  • Product demos — voiceovers for SaaS walkthroughs and app store videos.
  • Podcast intros — polished intro and outro segments with music-ready timing.
  • Advertisements — punchy 15 s and 30 s ad spots with energetic delivery.
WokGen Voice is in Beta. Features are actively expanding — expect new voices and language support in upcoming releases.

2. Getting Started

Generating your first voice clip takes under a minute:

  • Step 1 — Navigate to Voice mode from the top nav or the WokGen home page.
  • Step 2 — Enter your text in the script box. Keep it under 200 words for fastest results.
  • Step 3 — Pick a Voice Type from the selector (Natural, Character, Whisper, etc.).
  • Step 4 — Choose a language and set your speed if needed.
  • Step 5 — Click Generate. Your audio will appear in the player within seconds.
iStandard tier generates instantly using your browser's built-in TTS engine. HD tier sends the request to HuggingFace Kokoro or Bark and may take 5–15 seconds.

3. Voice Types

WokGen Voice offers six voice types to match different content needs. Each type affects delivery, pacing, and emotional tone.

Voice TypeCharacterBest for
NaturalClean, measured, announcer-styleNarration, explainers, product demos
CharacterExpressive, dramatic, character-drivenGame NPC dialogue, fantasy/gaming content
WhisperSoft, intimate, low-energyASMR, meditation guides, romantic scenes
EnergeticUpbeat, fast-paced, enthusiasticMarketing spots, ads, hype reels
NewsAuthoritative, clipped, broadcast-qualityNews segments, corporate announcements
DeepRich baritone, commanding presenceMovie trailers, documentary narration
In HD tier, voice type influences the Kokoro/Bark model parameters. In Standard tier, it maps to browser SpeechSynthesis pitch and rate parameters.

4. Languages

WokGen Voice currently supports seven languages. Language availability may vary between Standard and HD tiers.

  • English (en) — full support, all voice types, all speeds
  • Spanish (es) — full support
  • French (fr) — full support
  • German (de) — full support
  • Japanese (ja) — HD tier recommended for natural prosody
  • Portuguese (pt) — full support
  • Chinese (zh) — HD tier recommended for tonal accuracy
!Standard tier uses the browser's system TTS voices. Available languages depend on the voices installed on the user's OS. For guaranteed multilingual output, use HD tier.

5. Speed Control

The speed slider controls how fast the voice speaks, from a slow dramatic pace to a rapid upbeat delivery.

SpeedFeelRecommended for
0.5×Very slow, dramatic, weightedTrailers, dramatic reveals, meditation
0.75×Slow and deliberateTechnical explanations, e-learning
1.0×Normal conversational paceMost content — default
1.25×Slightly briskPodcasts, quick explainers
1.5×Fast, energeticAds, hype content, social clips
2.0×Very fast, upbeat, punchyShort-form marketing, TikTok/Reels

6. Standard vs HD Tier

Standard tier (free)

Standard voice generation uses your browser's built-in Web Speech API (SpeechSynthesis). Results are instant and require no server round-trip or credits. Voice quality depends on the OS and browser — it can sound robotic on some platforms, especially for non-English text.

  • Instant — no network latency
  • Free, no credits consumed
  • Robotic on some OS/browser combos
  • Language quality varies by installed system voices

HD tier (1 credit per clip)

HD voice uses HuggingFace Kokoro (fast, natural-sounding neural TTS) or Bark (expressive, character-style voice synthesis). Results are significantly more natural, with proper prosody, breathing, and emotional inflection.

  • Natural-sounding neural TTS
  • Consistent quality across all languages
  • Better character voice expressiveness with Bark
  • Costs 1 credit per clip
  • Takes 5–15 seconds to generate
Use Standard to iterate on your script and timing, then switch to HD for your final production clip.

7. Audio Player

After generation, the audio appears in the built-in player. Controls include:

  • Play / Pause — toggle playback with the play button or spacebar.
  • Seek — click anywhere on the progress bar to jump to that position.
  • Volume — drag the volume slider to adjust output level (0–100%).
  • Duration — elapsed time and total clip length are displayed.

The player is non-destructive — you can re-generate with different settings without losing your current clip until you explicitly clear it.

8. Downloading

All generated audio is available for immediate download in WAV format, which is lossless and universally compatible with DAWs, video editors, and game engines.

  • Click the Download button below the player.
  • The file is named wokgen-voice-[timestamp].wav by default.
  • WAV files can be imported directly into Unity, Godot, Premiere, DaVinci Resolve, etc.
iWAV format is chosen for maximum compatibility. If you need MP3 or OGG, convert using Audacity, ffmpeg, or an online converter after downloading.

Voice clips can be saved to your WokGen Gallery for later access and sharing.

  • Click Save to Gallery after generation.
  • Choose Public to share with the WokGen community, or Private to keep it in your account only.
  • Saved clips appear under your profile in the Gallery with the script text and voice settings.
  • You can delete saved clips at any time from your Gallery page.

10. Rate Limits

PlanVoice clips per hourHD clips
Free5 clips / hourFrom credit balance
Pro20 clips / hourFrom credit balance
MaxUnlimitedUnlimited (subject to fair use)

Rate limits apply to the number of generation requests, not clip duration. A 500-word clip counts as one request. Limits reset on a rolling 60-minute window.

11. Best Practices

Script length

Keep your script under 200 words for the fastest generation times and most consistent quality. Long scripts may be split internally — shorter inputs give better control over pacing and tone.

Voice selection

  • Use Character voice for game NPC dialogue — it adds the expressive variability that makes characters feel alive.
  • Use Natural for product demos and explainers — clean, professional, and easy to understand.
  • Use News for authoritative corporate content — it signals credibility and broadcast quality.
  • Use Whisper for ASMR or intimate voiceover — the soft delivery reduces listener fatigue.

Punctuation matters

TTS engines respond to punctuation. Use commas for brief pauses, periods for full stops, and ellipses (...) for dramatic hesitation. Exclamation marks increase energy in Energetic and Character voices.

Iterating on HD clips

Credits are precious. Proof your script in Standard tier first — fix typos, check pacing, and confirm the word count is correct. Only upgrade to HD once the script is finalised.

12. Limitations

  • Standard tier uses browser TTS — output quality varies significantly between Chrome, Firefox, Safari, and different operating systems. On some Linux systems it may sound robotic.
  • HD tier requires API credits — each clip consumes 1 credit. Credits are consumed even if you choose not to download the result.
  • No real-time streaming — the full clip is generated before playback begins.
  • Script length cap — inputs over 500 words may be truncated. Split long scripts into multiple clips.
  • Emotional nuance — AI voices are improving rapidly but still lack the full expressive range of a human voice actor for nuanced emotional scenes.
  • Singing not supported — WokGen Voice is for speech only. It cannot generate musical singing or melody.
!Do not use WokGen Voice to impersonate real people or generate deceptive content. Review the WokSpec Terms of Service before using voice output in commercial projects.