Topics

Latency and throughput (what to measure)

Evergreen topic pages updated with new evidence

Answer

Latency and throughput are complementary metrics for evaluating inference performance: latency measures time per request, throughput measures requests per unit time. Builders must choose which to prioritize based on use case constraints.

Key points

  • Latency matters most for interactive applications (e.g., voice chat, real-time agents).
  • Throughput matters most for batch processing or high-concurrency serving (e.g., offline summarization, embedding generation).
  • Optimizing one often trades off against the other—hardware, quantization, and batching strategies affect both.

What changed recently

  • Gemini 3.1’s Flash Live demonstrates ultra-low-latency voice interaction as a concrete latency target for real-time interfaces.
  • Recent evidence confirms that reasoning depth (e.g., Chain-of-Thought) imposes inherent latency floors—even with prompt masking, conceptual processing cannot be bypassed.

Explanation

Latency is measured in milliseconds per token or per request; it reflects responsiveness under load and affects user-perceived speed.

Throughput is measured in tokens or requests per second; it reflects system efficiency and cost-per-inference at scale—both depend on model architecture, hardware, and serving configuration.

Tools / Examples

  • A voice assistant serving sub-300ms end-to-end latency requires tight orchestration of ASR, LLM, and TTS—latency dominates design decisions.
  • A document-processing pipeline ingesting 10K PDFs/day prioritizes throughput: batching, KV caching, and FP8 quantization improve tokens/sec without requiring single-request speed.

Evidence timeline

March 27 AI Briefing · Issue #151

The semantic irreducibility of Chain-of-Thought (CoT) reasoning has been empirically demonstrated: even when specific words are masked via prompt engineering, LLMs remain unable to bypass underlying conceptual reasoning—

March 27 AI Briefing · Issue #150

The Gemini 3.1 series launches strongly, with dual breakthroughs in Flash Live (ultra-low-latency voice interaction) and Pro Grounding (search augmentation), securing second place in Search Arena; meanwhile, Mistral's Vo

Sources

FAQ

Is lower latency always better?

Not universally. Aggressive latency reduction may reduce output quality or increase cost per token—measure against your SLA and user expectations.

Can I improve both latency and throughput simultaneously?

Sometimes—e.g., kernel fusion or better memory layout—but trade-offs persist. Recent advances like Flash Live optimize for latency *without* sacrificing throughput in narrow domains.

Last updated: 2026-03-28 · Policy: Editorial standards · Methodology