Observability 2.0: GPU Bottlenecks and Agent Latency
Bottom Line
Modern AI systems fail long before average GPU utilization looks scary. The winning move is to correlate queue time, KV-cache pressure, tool latency, and token-level traces so you can predict saturation before users feel it.
Key Takeaways
- ›High GPU utilization is not enough; queue time and KV-cache pressure often break latency first.
- ›Agent latency needs span-level attribution across retrieval, planning, tools, and final generation.
- ›Reference tuning cut p95 time-to-first-token from 2.6s to 1.1s by fixing queueing, not raw compute.
- ›Safe observability requires payload masking before prompts, tool args, or user data hit traces.
AI observability is moving past uptime dashboards and after-the-fact traces. In production agent systems, the expensive failures are usually visible minutes earlier in queue depth, KV-cache pressure, and span timing drift than in top-line GPU graphs. The next generation of monitoring is predictive: treat every token, tool call, and scheduler decision as part of one latency budget, then detect when the budget is about to break before customers hit refresh.
The Lead
Classic infrastructure monitoring answers whether a cluster is alive. It does not explain why a user waits three seconds for the first token even when the dashboard says the GPUs are only moderately busy. That gap is exactly where Observability 2.0 starts: not with more charts, but with a model of how inference pipelines actually degrade under mixed workloads.
Bottom Line
The fastest way to improve agent experience is usually not a larger model server fleet. It is better attribution: separate queueing, compute, cache contention, and tool latency, then alert on whichever one is rising first.
Why GPU utilization lies
- nvgpuutilization can stay reasonable while scheduler queues expand and users see worse time to first token.
- Batching improves throughput, but it can hide fairness problems when short requests wait behind larger prompts.
- Memory headroom matters as much as compute headroom because cache churn can force expensive recomputation.
- In agent workflows, model inference is only one phase; retrieval and tool execution often dominate tail latency.
Latency has more than one critical path
OpenTelemetry now defines GenAI spans and metrics for model operations, agent spans, and tool execution, which is the right conceptual shift. The operational trick is to stop measuring one end-to-end duration and instead decompose latency into phases that map to a concrete owner: platform, model serving, retrieval, or application logic.
agent.request
retrieve.context
model.plan
tool.execute
model.final
stream.outputOnce those phases are explicit, the engineering conversation gets sharper. A spike in user latency can be attributed to a queueing regime change, a retrieval miss pattern, or a tool timeout domain instead of the vague conclusion that the model is slow.
Architecture & Implementation
Build three telemetry planes
The most resilient implementations separate telemetry into three planes and join them at query time.
- Resource plane: GPU, CPU, memory, power, and node metrics. DCGM Exporter is the standard NVIDIA path for exposing GPU metrics to Prometheus.
- Serving plane: model-server request and scheduler metrics. Triton exposes metrics such as
nv_inference_queue_duration_us,nv_inference_compute_infer_duration_us,nv_gpu_utilization, andnv_gpu_memory_used_bytes. - Interaction plane: spans and GenAI metrics from the application layer, including
gen_ai.server.time_to_first_tokenandgen_ai.server.time_per_output_token.
If you run vLLM, the same pattern holds. Its Prometheus endpoint surfaces server state and cache behavior through metrics such as vllm:num_requests_running, vllm:kv_cache_usage_perc, vllm:prompt_tokens_total, and vllm:generation_tokens_total. The important design choice is not the backend itself; it is whether those metrics share a request identity with your agent spans.
Correlate everything with one request contract
A predictive stack needs a minimal schema that survives hops across gateways, model routers, tool workers, and retrieval services. Keep it small enough to be adopted everywhere.
trace_id
request_id
tenant_id
agent_id
model_name
prompt_tokens
output_tokens
tool_count
retrieval_docs
queue_class
cache_tierThat contract lets you ask high-value questions quickly.
- Did TTFT regress only for one queue class?
- Did queue duration rise before compute duration?
- Did latency correlate with cache saturation or with a single slow tool?
- Did a routing change shift load to a weaker GPU pool or a smaller MIG slice?
Mask before you export
Prompt text, tool arguments, and retrieved documents can turn tracing into a privacy problem if you export them raw. If you need content-level debugging, sanitize first with the Data Masking Tool so traces preserve structure without leaking secrets, customer identifiers, or regulated fields.
Predictive signals to calculate every minute
- Queue share: queue time divided by total request time. Rising queue share is often the earliest saturation signal.
- Cache pressure slope: rate of change in
vllm:kv_cache_usage_percor equivalent memory pressure metrics. - TTFT divergence: gap between p95 and p50 time to first token. This reveals emerging unfairness before averages move.
- Tool amplification: total tool span time divided by model span time. Agents with high amplification need a different SLO than plain chat.
- Prompt inflation: prompt-token growth relative to stable task classes. This often predicts sudden queue instability.
Benchmarks & Metrics
Reference workload
The numbers below describe a reference deployment pattern, not a vendor benchmark: an eight-GPU inference pool serving mixed interactive requests, plus an agent layer that performs retrieval and up to three tool calls per task. The goal was not maximizing aggregate throughput. The goal was to hold conversational responsiveness under bursty load while keeping infrastructure cost flat.
- Workload mix: 70% direct generation, 30% agentic flows.
- Prompt profile: short prompts mixed with retrieval-augmented prompts and occasional long contexts.
- Primary SLO: p95 TTFT under 1.2 seconds for interactive traffic.
- Secondary SLO: stable token streaming cadence after first token.
What predicted failure first
Three signals moved before user-facing errors.
- Queue duration rose ahead of compute duration, showing the bottleneck was scheduling and admission, not raw matrix math.
- KV-cache usage climbed into the high 0.8 to low 0.9 range, increasing eviction pressure and destabilizing response times.
- Tool span variance widened, which made agent tails worse even when model-only requests looked healthy.
| Metric | Baseline | Tuned | Why it mattered |
|---|---|---|---|
| p95 TTFT | 2.6s | 1.1s | Best proxy for perceived responsiveness |
| Queue share of request time | 47% | 19% | Proved the main issue was waiting, not compute |
| Median GPU utilization | 61% | 68% | Higher utilization with lower latency showed better scheduling |
| Peak KV-cache usage | 0.92 | 0.78 | More headroom reduced churn and tail spikes |
| p95 tool latency contribution | 38% | 24% | Separate budgets for tools stopped agent tails from dominating |
What changed
- Interactive traffic was isolated into a stricter queue class instead of competing with long-context batch work.
- Prompt budgets were tightened for the noisiest agent paths, reducing prompt inflation during peaks.
- Slow tools were wrapped with shorter deadlines and fallback behavior instead of holding the entire request open.
- Cache headroom targets were enforced, treating sustained high usage as an incident precursor rather than a harmless efficiency win.
The notable lesson is that the tuned system did not win by brute-force overprovisioning. It won by making latency visible as a composition of queueing, compute, memory pressure, and tool work. That is the essence of predictive observability: the data tells you which lever to pull before you start scaling the wrong thing.
Strategic Impact
Why this changes platform decisions
Once teams can predict bottlenecks instead of react to outages, several architectural decisions become more disciplined.
- Capacity planning shifts from peak GPU count to queue-class and cache-headroom modeling.
- Routing policy becomes evidence-based because model choice can be evaluated against latency budgets, not only quality and cost.
- Agent design improves because tool-heavy flows can be given explicit latency budgets and fallbacks.
- FinOps gets cleaner because you can prove whether a cost increase bought lower queueing or merely more idle hardware.
The organizational payoff
This style of observability also reduces the unproductive blame loop between application teams and infrastructure teams. If traces show that queue duration expanded while compute stayed flat, the issue belongs to scheduling or admission. If tool spans widened, the model server is not the culprit. If TTFT rose only for one tenant or prompt class, that is a routing or product policy issue. Clear ownership is a performance feature.
- Platform teams get measurable thresholds for pre-scaling or traffic shedding.
- Application teams can tune prompts, retrieval limits, and tool fan-out with immediate feedback.
- Security teams can approve telemetry exports with lower risk when masking is built into the pipeline.
Road Ahead
What gets better next
The next stage is not just richer dashboards. It is closed-loop control.
- Schedulers will use recent queue and cache signals to adapt admission before tail latency spikes.
- Routers will select models based on real-time latency headroom, not static preferences.
- Agent frameworks will trim tool plans dynamically when the latency budget is already half spent.
- Observability platforms will forecast TTFT and throughput saturation windows instead of merely reporting them.
What remains immature
There is still standardization work to do. OpenTelemetry's GenAI semantic conventions are useful, but they are still marked as Development, which means teams should adopt them deliberately and expect some churn. GPU telemetry, model-serving metrics, and agent traces also live at different layers with different naming habits, so normalization remains an implementation task rather than a solved standard.
- Keep your internal metric dictionary explicit, even if you export standard names where possible.
- Treat semantic-convention upgrades as schema changes, not cosmetic changes.
- Prefer a thin internal abstraction over direct dashboard dependence on raw backend metric names.
Observability 2.0 is therefore less about buying another monitoring product and more about adopting a sharper performance model. If you can explain every user-visible delay as a mix of queueing, compute, cache pressure, and tool execution, you can usually predict the next bottleneck before it becomes an outage. That is the difference between monitoring infrastructure and operating an AI system.
Frequently Asked Questions
How do I tell whether GPU saturation or queueing is the real bottleneck? +
nv_inference_queue_duration_us grows faster than nv_inference_compute_infer_duration_us, the system is waiting more than it is computing. That usually points to admission control, batching policy, prompt inflation, or queue fairness rather than insufficient raw GPU math.What metrics matter most for agent latency analysis? +
Can OpenTelemetry trace model calls and tool calls in one request? +
trace_id across the entire request path. The practical requirement is consistent context propagation through gateways, async workers, and tool runners.How do I avoid leaking prompts or customer data into traces? +
Get Engineering Deep-Dives in Your Inbox
Weekly breakdowns of architecture, security, and developer tooling — no fluff.