How do I tell whether GPU saturation or queueing is the real bottleneck?

Compare scheduler wait with execution time. If request latency rises because nv_inference_queue_duration_us grows faster than nv_inference_compute_infer_duration_us, the system is waiting more than it is computing. That usually points to admission control, batching policy, prompt inflation, or queue fairness rather than insufficient raw GPU math.

What metrics matter most for agent latency analysis?

Start with time to first token, total request duration, queue duration, per-tool span latency, and cache pressure. For model servers, pair request metrics with GPU memory and utilization. For agent apps, add retrieval counts, tool fan-out, retries, and token counts so you can see which phase is consuming the budget.

Can OpenTelemetry trace model calls and tool calls in one request?

Yes. The GenAI semantic conventions cover model spans, agent spans, metrics, and tool execution patterns, so you can keep one trace_id across the entire request path. The practical requirement is consistent context propagation through gateways, async workers, and tool runners.

How do I avoid leaking prompts or customer data into traces?

Do not export raw payloads by default. Emit token counts, hashes, class labels, and truncated metadata for routine telemetry, and only retain full content in narrowly controlled debug workflows. If you must inspect payload structure, mask sensitive fields before export and keep retention short.

Observability 2.0: GPU Bottlenecks and Agent Latency

AI observability is moving past uptime dashboards and after-the-fact traces. In production agent systems, the expensive failures are usually visible minutes earlier in queue depth, KV-cache pressure, and span timing drift than in top-line GPU graphs. The next generation of monitoring is predictive: treat every token, tool call, and scheduler decision as part of one latency budget, then detect when the budget is about to break before customers hit refresh.

The Lead

Classic infrastructure monitoring answers whether a cluster is alive. It does not explain why a user waits three seconds for the first token even when the dashboard says the GPUs are only moderately busy. That gap is exactly where Observability 2.0 starts: not with more charts, but with a model of how inference pipelines actually degrade under mixed workloads.

Bottom Line

The fastest way to improve agent experience is usually not a larger model server fleet. It is better attribution: separate queueing, compute, cache contention, and tool latency, then alert on whichever one is rising first.

Why GPU utilization lies

nvgpuutilization can stay reasonable while scheduler queues expand and users see worse time to first token.
Batching improves throughput, but it can hide fairness problems when short requests wait behind larger prompts.
Memory headroom matters as much as compute headroom because cache churn can force expensive recomputation.
In agent workflows, model inference is only one phase; retrieval and tool execution often dominate tail latency.

Latency has more than one critical path

OpenTelemetry now defines GenAI spans and metrics for model operations, agent spans, and tool execution, which is the right conceptual shift. The operational trick is to stop measuring one end-to-end duration and instead decompose latency into phases that map to a concrete owner: platform, model serving, retrieval, or application logic.

agent.request
  retrieve.context
  model.plan
  tool.execute
  model.final
  stream.output

Once those phases are explicit, the engineering conversation gets sharper. A spike in user latency can be attributed to a queueing regime change, a retrieval miss pattern, or a tool timeout domain instead of the vague conclusion that the model is slow.

Architecture & Implementation

Build three telemetry planes

The most resilient implementations separate telemetry into three planes and join them at query time.

Resource plane: GPU, CPU, memory, power, and node metrics. DCGM Exporter is the standard NVIDIA path for exposing GPU metrics to Prometheus.
Serving plane: model-server request and scheduler metrics. Triton exposes metrics such as nv_inference_queue_duration_us, nv_inference_compute_infer_duration_us, nv_gpu_utilization, and nv_gpu_memory_used_bytes.
Interaction plane: spans and GenAI metrics from the application layer, including gen_ai.server.time_to_first_token and gen_ai.server.time_per_output_token.

If you run vLLM, the same pattern holds. Its Prometheus endpoint surfaces server state and cache behavior through metrics such as vllm:num_requests_running, vllm:kv_cache_usage_perc, vllm:prompt_tokens_total, and vllm:generation_tokens_total. The important design choice is not the backend itself; it is whether those metrics share a request identity with your agent spans.

Correlate everything with one request contract

A predictive stack needs a minimal schema that survives hops across gateways, model routers, tool workers, and retrieval services. Keep it small enough to be adopted everywhere.

trace_id
request_id
tenant_id
agent_id
model_name
prompt_tokens
output_tokens
tool_count
retrieval_docs
queue_class
cache_tier

That contract lets you ask high-value questions quickly.

Did TTFT regress only for one queue class?
Did queue duration rise before compute duration?
Did latency correlate with cache saturation or with a single slow tool?
Did a routing change shift load to a weaker GPU pool or a smaller MIG slice?

Mask before you export

Prompt text, tool arguments, and retrieved documents can turn tracing into a privacy problem if you export them raw. If you need content-level debugging, sanitize first with the Data Masking Tool so traces preserve structure without leaking secrets, customer identifiers, or regulated fields.

Pro tip: Store full payloads only in tightly scoped debug paths. For normal production telemetry, export hashes, lengths, token counts, and class labels instead of raw content.

Predictive signals to calculate every minute

Queue share: queue time divided by total request time. Rising queue share is often the earliest saturation signal.
Cache pressure slope: rate of change in vllm:kv_cache_usage_perc or equivalent memory pressure metrics.
TTFT divergence: gap between p95 and p50 time to first token. This reveals emerging unfairness before averages move.
Tool amplification: total tool span time divided by model span time. Agents with high amplification need a different SLO than plain chat.
Prompt inflation: prompt-token growth relative to stable task classes. This often predicts sudden queue instability.

Benchmarks & Metrics

Reference workload

The numbers below describe a reference deployment pattern, not a vendor benchmark: an eight-GPU inference pool serving mixed interactive requests, plus an agent layer that performs retrieval and up to three tool calls per task. The goal was not maximizing aggregate throughput. The goal was to hold conversational responsiveness under bursty load while keeping infrastructure cost flat.

Workload mix: 70% direct generation, 30% agentic flows.
Prompt profile: short prompts mixed with retrieval-augmented prompts and occasional long contexts.
Primary SLO: p95 TTFT under 1.2 seconds for interactive traffic.
Secondary SLO: stable token streaming cadence after first token.

What predicted failure first

Three signals moved before user-facing errors.

Queue duration rose ahead of compute duration, showing the bottleneck was scheduling and admission, not raw matrix math.
KV-cache usage climbed into the high 0.8 to low 0.9 range, increasing eviction pressure and destabilizing response times.
Tool span variance widened, which made agent tails worse even when model-only requests looked healthy.

Metric	Baseline	Tuned	Why it mattered
p95 TTFT	2.6s	1.1s	Best proxy for perceived responsiveness
Queue share of request time	47%	19%	Proved the main issue was waiting, not compute
Median GPU utilization	61%	68%	Higher utilization with lower latency showed better scheduling
Peak KV-cache usage	0.92	0.78	More headroom reduced churn and tail spikes
p95 tool latency contribution	38%	24%	Separate budgets for tools stopped agent tails from dominating

What changed

Interactive traffic was isolated into a stricter queue class instead of competing with long-context batch work.
Prompt budgets were tightened for the noisiest agent paths, reducing prompt inflation during peaks.
Slow tools were wrapped with shorter deadlines and fallback behavior instead of holding the entire request open.
Cache headroom targets were enforced, treating sustained high usage as an incident precursor rather than a harmless efficiency win.

The notable lesson is that the tuned system did not win by brute-force overprovisioning. It won by making latency visible as a composition of queueing, compute, memory pressure, and tool work. That is the essence of predictive observability: the data tells you which lever to pull before you start scaling the wrong thing.

Watch out: If your dashboards focus on average latency or average GPU utilization, you can miss the exact moment when interactive traffic becomes unfairly starved by long prompts and agent retries.

Strategic Impact

Why this changes platform decisions

Once teams can predict bottlenecks instead of react to outages, several architectural decisions become more disciplined.

Capacity planning shifts from peak GPU count to queue-class and cache-headroom modeling.
Routing policy becomes evidence-based because model choice can be evaluated against latency budgets, not only quality and cost.
Agent design improves because tool-heavy flows can be given explicit latency budgets and fallbacks.
FinOps gets cleaner because you can prove whether a cost increase bought lower queueing or merely more idle hardware.

The organizational payoff

This style of observability also reduces the unproductive blame loop between application teams and infrastructure teams. If traces show that queue duration expanded while compute stayed flat, the issue belongs to scheduling or admission. If tool spans widened, the model server is not the culprit. If TTFT rose only for one tenant or prompt class, that is a routing or product policy issue. Clear ownership is a performance feature.

Platform teams get measurable thresholds for pre-scaling or traffic shedding.
Application teams can tune prompts, retrieval limits, and tool fan-out with immediate feedback.
Security teams can approve telemetry exports with lower risk when masking is built into the pipeline.

Road Ahead

What gets better next

The next stage is not just richer dashboards. It is closed-loop control.

Schedulers will use recent queue and cache signals to adapt admission before tail latency spikes.
Routers will select models based on real-time latency headroom, not static preferences.
Agent frameworks will trim tool plans dynamically when the latency budget is already half spent.
Observability platforms will forecast TTFT and throughput saturation windows instead of merely reporting them.

What remains immature

There is still standardization work to do. OpenTelemetry's GenAI semantic conventions are useful, but they are still marked as Development, which means teams should adopt them deliberately and expect some churn. GPU telemetry, model-serving metrics, and agent traces also live at different layers with different naming habits, so normalization remains an implementation task rather than a solved standard.

Keep your internal metric dictionary explicit, even if you export standard names where possible.
Treat semantic-convention upgrades as schema changes, not cosmetic changes.
Prefer a thin internal abstraction over direct dashboard dependence on raw backend metric names.

Observability 2.0 is therefore less about buying another monitoring product and more about adopting a sharper performance model. If you can explain every user-visible delay as a mix of queueing, compute, cache pressure, and tool execution, you can usually predict the next bottleneck before it becomes an outage. That is the difference between monitoring infrastructure and operating an AI system.

Observability 2.0: GPU Bottlenecks and Agent Latency

Bottom Line

The Lead

Bottom Line

Why GPU utilization lies

Latency has more than one critical path

Architecture & Implementation

Build three telemetry planes

Correlate everything with one request contract

Mask before you export

Predictive signals to calculate every minute

Benchmarks & Metrics

Reference workload

What predicted failure first

What changed

Strategic Impact

Why this changes platform decisions

The organizational payoff

Road Ahead

What gets better next

What remains immature

Frequently Asked Questions

Get Engineering Deep-Dives in Your Inbox