Answer
Latency and throughput are complementary metrics for evaluating inference performance: latency measures time per request, throughput measures requests per unit time. Builders must choose which to prioritize based on use case constraints.
Key points
- Latency matters most for interactive applications (e.g., voice chat, real-time agents).
- Throughput matters most for batch processing or high-concurrency serving (e.g., offline summarization, embedding generation).
- Optimizing one often trades off against the other—hardware, quantization, and batching strategies affect both.
What changed recently
- Gemini 3.1’s Flash Live demonstrates ultra-low-latency voice interaction as a concrete latency target for real-time interfaces.
- Recent evidence confirms that reasoning depth (e.g., Chain-of-Thought) imposes inherent latency floors—even with prompt masking, conceptual processing cannot be bypassed.
Explanation
Latency is measured in milliseconds per token or per request; it reflects responsiveness under load and affects user-perceived speed.
Throughput is measured in tokens or requests per second; it reflects system efficiency and cost-per-inference at scale—both depend on model architecture, hardware, and serving configuration.
Tools / Examples
- A voice assistant serving sub-300ms end-to-end latency requires tight orchestration of ASR, LLM, and TTS—latency dominates design decisions.
- A document-processing pipeline ingesting 10K PDFs/day prioritizes throughput: batching, KV caching, and FP8 quantization improve tokens/sec without requiring single-request speed.
Evidence timeline
The semantic irreducibility of Chain-of-Thought (CoT) reasoning has been empirically demonstrated: even when specific words are masked via prompt engineering, LLMs remain unable to bypass underlying conceptual reasoning—
The Gemini 3.1 series launches strongly, with dual breakthroughs in Flash Live (ultra-low-latency voice interaction) and Pro Grounding (search augmentation), securing second place in Search Arena; meanwhile, Mistral's Vo
Sources
FAQ
Is lower latency always better?
Not universally. Aggressive latency reduction may reduce output quality or increase cost per token—measure against your SLA and user expectations.
Can I improve both latency and throughput simultaneously?
Sometimes—e.g., kernel fusion or better memory layout—but trade-offs persist. Recent advances like Flash Live optimize for latency *without* sacrificing throughput in narrow domains.
Last updated: 2026-03-28 · Policy: Editorial standards · Methodology