Skip to main content

Documentation Index

Fetch the complete documentation index at: https://documentation.uponai.com/llms.txt

Use this file to discover all available pages before exploring further.

This guide only applies to cascading agents. If you are using speech-to-speech models, this feature does not apply.
Real-time transcription is a trade-off between latency and accuracy. Using interim results gives the lowest latency but with a higher chance of errors due to less context. Waiting for results with more context improves accuracy but adds delay after the user stops speaking.

Transcription Modes

Optimize for Speed

Uses the latest interim results with a low endpointing setting for downstream processing. Best latency, slightly less accurate on entities like numbers and dates.

Optimize for Accuracy

Uses results with a higher endpointing setting, waiting longer with more context to generate more accurate transcripts. Incurs ~200ms additional latency.

Which Mode Should You Use?

Benchmarking shows that both modes have similar Word Error Rate (WER). The main difference is in capturing entities like numbers, dates, and proper nouns.
Use caseRecommended mode
General conversation, low latency priorityOptimize for speed
Capturing numbers, dates, or specific entitiesOptimize for accuracy