Answer
Latency and throughput are complementary metrics for evaluating inference performance: latency measures time per request, throughput measures requests per unit time. Builders must choose which to prioritize based on use case constraints.
Key points
- Latency reflects responsiveness—critical for interactive applications like chat or real-time editing.
- Throughput reflects system efficiency—key for batch processing or high-concurrency serving.
- Trade-offs between them are inherent; optimizing one often degrades the other without architectural changes.
What changed recently
- OpenAI’s MRC protocol (May 2026) targets network bottlenecks in large-scale GPU training—potentially improving throughput in distributed inference setups.
- Mistral Medium 3.5 (May 2026) enables dense 128B model inference on just 4 GPUs, lowering hardware barriers to balanced latency-throughput tuning.
Explanation
Latency and throughput depend on hardware, software stack, model architecture, and input characteristics—not just model size. For example, quantization may reduce latency but increase throughput variance under load.
Recent infrastructure advances (e.g., MRC) address underlying network constraints, but evidence of direct inference latency or throughput improvements in production remains limited. Builder decisions should still be grounded in empirical measurement across representative workloads.
Tools / Examples
- A customer-facing chatbot requires <500ms p95 latency—even at the cost of lower throughput.
- A document summarization pipeline processing 10K reports/hour prioritizes stable throughput over sub-second latency.
Evidence timeline
OpenAI open-sourced the MRC (Multi-Path Reliable Connection) protocol, collaborating with industry giants including AMD and NVIDIA to overcome network bottlenecks in large-scale GPU training; Anthropic, leveraging SpaceX
Luma Uni-1 adds a programmable inference layer to break the text-to-image 'black box'; Mistral Medium 3.5 unifies encoding, reasoning, and instruction-following in a single 128B dense model—deployable on just 4 GPUs; Ope
Sources
FAQ
Should I measure latency or throughput first?
Start with latency if user experience is time-sensitive (e.g., APIs, UIs); start with throughput if cost-per-request or batch volume dominates (e.g., offline jobs). Measure both early to identify trade-offs.
Do newer models automatically improve latency or throughput?
Not necessarily. Model efficiency depends on implementation, hardware alignment, and optimization—not just release date. Evidence from Mistral Medium 3.5 shows improved deployability, but real-world latency/throughput gains require benchmarking.
Last updated: 2026-05-12 · Policy: Editorial standards · Methodology