Skip to main content
This endpoint generates complete audio first, then aligns it with the input text to provide precise timing information for each segment. The response includes both the audio and an array of timestamp segments.

Response Format

The response is a JSON object containing:
FieldTypeDescription
audio_base64stringBase64-encoded audio data
textstringThe synthesized text (with emotion markers removed)
alignmentarrayArray of timestamp segments
Each timestamp segment contains:
FieldTypeDescription
textstringThe text content of this segment
startnumberStart time in seconds
endnumberEnd time in seconds

Example Response

{
  "audio_base64": "UklGRiQAAABXQVZFZm10IBAAAAABAAEAQB8...",
  "text": "Hello, world!",
  "alignment": [
    {"text": "Hello,", "start": 0.0, "end": 0.45},
    {"text": "world!", "start": 0.52, "end": 1.1}
  ]
}

Use Cases

  • Subtitle generation: Automatically create synchronized subtitles for video content
  • Karaoke-style highlighting: Highlight words as they are spoken
  • Accessibility features: Provide visual indicators synchronized with audio playback
  • Audio editing: Precisely locate and edit specific words in generated speech