%PDF-1.4 %âãÏÓ 1 0 obj << /Type /Catalog /Pages 2 0 R >> endobj 2 0 obj << /Type /Pages /Count 8 /Kids [5 0 R 7 0 R 9 0 R 11 0 R 13 0 R 15 0 R 17 0 R 19 0 R] >> endobj 3 0 obj << /Type /Font /Subtype /Type1 /BaseFont /Helvetica >> endobj 4 0 obj << /Type /Font /Subtype /Type1 /BaseFont /Helvetica-Bold >> endobj 5 0 obj << /Type /Page /Parent 2 0 R /MediaBox [0 0 595.28 841.89] /Resources << /Font << /F1 3 0 R /F2 4 0 R >> >> /Contents 6 0 R >> endobj 6 0 obj << /Length 5570 >> stream BT /F2 22 Tf 0.06 0.08 0.12 rg 1 0 0 1 46 789.89 Tm (The Phantom Throttling: How You're Misdiagnosing) Tj ET BT /F2 22 Tf 0.06 0.08 0.12 rg 1 0 0 1 46 762.89 Tm (AI API Timeouts) Tj ET BT /F2 11 Tf 0.72 0.14 0.18 rg 1 0 0 1 46 725.89 Tm (TechRounder PDF Edition) Tj ET BT /F1 9.5 Tf 0.36 0.39 0.46 rg 1 0 0 1 46 709.89 Tm (Live article: https://www.techrounder.com/ai/the-phantom-throttling-how-youre-misdiagnosing-ai-api-timeouts/) Tj ET q 0.82 0.85 0.9 RG 1 w 46 691.39 m 549.28 691.39 l S Q BT /F1 10 Tf 0.24 0.27 0.32 rg 1 0 0 1 46 679.39 Tm (By Vipin PG | Published February 20, 2026 | Updated February 20, 2026 | Format: Analysis | 5 min read) Tj ET BT /F2 13 Tf 0.72 0.14 0.18 rg 1 0 0 1 46 656.39 Tm (In brief) Tj ET BT /F1 11 Tf 0.14 0.16 0.2 rg 1 0 0 1 46 636.39 Tm (Engineers frequently misdiagnose AI API timeouts as provider throttling when the actual root cause is) Tj ET BT /F1 11 Tf 0.14 0.16 0.2 rg 1 0 0 1 46 621.39 Tm (often internal connection pool exhaustion caused by long-lived LLM requests. Rather than upstream) Tj ET BT /F1 11 Tf 0.14 0.16 0.2 rg 1 0 0 1 46 606.39 Tm (latency, requests silently accumulate in local queues due to insufficient HTTP client socket limits, a) Tj ET BT /F1 11 Tf 0.14 0.16 0.2 rg 1 0 0 1 46 591.39 Tm (problem best fixed by adjusting connection capacity and monitoring internal queue duration instead of) Tj ET BT /F1 11 Tf 0.14 0.16 0.2 rg 1 0 0 1 46 576.39 Tm (increasing retries.) Tj ET BT /F2 13 Tf 0.08 0.1 0.14 rg 1 0 0 1 46 549.39 Tm (Key points) Tj ET BT /F1 10.5 Tf 0.2 0.23 0.28 rg 1 0 0 1 46 529.39 Tm (- The "Phantom Throttling" Phenomenon: Modern AI workloads often stall or time out in ways that mimic) Tj ET BT /F1 10.5 Tf 0.2 0.23 0.28 rg 1 0 0 1 46 515.59 Tm (provider throttling, leading engineers to blame upstream APIs \(like OpenAI\) when the issue is actually local.) Tj ET BT /F1 10.5 Tf 0.2 0.23 0.28 rg 1 0 0 1 46 498.79 Tm (- The Technical Root Cause: The bottleneck is frequently connection pool exhaustion within internal proxies.) Tj ET BT /F1 10.5 Tf 0.2 0.23 0.28 rg 1 0 0 1 46 484.99 Tm (Because LLM calls are long-lived \(holding sockets for seconds\), default HTTP client settings \(e.g., 10 max) Tj ET BT /F1 10.5 Tf 0.2 0.23 0.28 rg 1 0 0 1 46 471.19 Tm (connections\) become saturated quickly, causing requests to queue locally before ever reaching the) Tj ET BT /F1 10.5 Tf 0.2 0.23 0.28 rg 1 0 0 1 46 457.39 Tm (provider.) Tj ET BT /F1 10.5 Tf 0.2 0.23 0.28 rg 1 0 0 1 46 440.59 Tm (- Cognitive Bias in Debugging: Teams often fall victim to Availability Bias and Anchoring ; seeing a single) Tj ET BT /F1 10.5 Tf 0.2 0.23 0.28 rg 1 0 0 1 46 426.79 Tm ("429 Too Many Requests" error leads them to assume all subsequent latency is throttling, preventing them) Tj ET BT /F1 10.5 Tf 0.2 0.23 0.28 rg 1 0 0 1 46 412.99 Tm (from investigating internal infrastructure.) Tj ET BT /F1 10.5 Tf 0.2 0.23 0.28 rg 1 0 0 1 46 396.19 Tm (- Harmful Reflexive Fixes: Standard incident responses-such as increasing retry attempts or extending) Tj ET BT /F1 10.5 Tf 0.2 0.23 0.28 rg 1 0 0 1 46 382.39 Tm (timeout windows-actually amplify the problem by increasing internal queue pressure and socket contention.) Tj ET BT /F1 10.5 Tf 0.2 0.23 0.28 rg 1 0 0 1 46 365.59 Tm (- The Critical Metric: To diagnose this correctly, teams must monitor client-side queue duration \(e.g.,) Tj ET BT /F1 10.5 Tf 0.2 0.23 0.28 rg 1 0 0 1 46 351.79 Tm (`http.queue.duration` \) rather than just provider latency or CPU usage.) Tj ET BT /F1 10.5 Tf 0.2 0.23 0.28 rg 1 0 0 1 46 334.99 Tm (- Architecture Lesson: Connection pools must be treated as capacity-managed infrastructure; throughput) Tj ET BT /F1 10.5 Tf 0.2 0.23 0.28 rg 1 0 0 1 46 321.19 Tm (calculations must account for the specific duration of AI inference, and configuration must change from) Tj ET BT /F1 10.5 Tf 0.2 0.23 0.28 rg 1 0 0 1 46 307.39 Tm (default settings to match high-concurrency needs.) Tj ET BT /F1 11 Tf 0.14 0.16 0.2 rg 1 0 0 1 46 290.59 Tm (Modern AI workloads don't fail loudly. They degrade. They stall. They surface as timeouts that look) Tj ET BT /F1 11 Tf 0.14 0.16 0.2 rg 1 0 0 1 46 275.59 Tm (like provider throttling.) Tj ET BT /F1 11 Tf 0.14 0.16 0.2 rg 1 0 0 1 46 253.59 Tm (In 2026, most production AI stacks are layered: serverless edges, containerized workers, LLM) Tj ET BT /F1 11 Tf 0.14 0.16 0.2 rg 1 0 0 1 46 238.59 Tm (gateways, observability pipelines, circuit breakers, rate limiters. When something times out, the instinct) Tj ET BT /F1 11 Tf 0.14 0.16 0.2 rg 1 0 0 1 46 223.59 Tm (is predictable:) Tj ET BT /F1 10.5 Tf 0.28 0.31 0.36 rg 1 0 0 1 56 199.59 Tm (Quote: "The model provider is throttling us.") Tj ET BT /F1 11 Tf 0.14 0.16 0.2 rg 1 0 0 1 46 179.59 Tm (Sometimes that's true. Often, it isn't.) Tj ET BT /F1 11 Tf 0.14 0.16 0.2 rg 1 0 0 1 46 157.59 Tm (This article walks through a real-world failure pattern I've seen repeatedly in AI-heavy systems - and) Tj ET BT /F1 11 Tf 0.14 0.16 0.2 rg 1 0 0 1 46 142.59 Tm (why experienced engineers still misdiagnose it.) Tj ET BT /F2 15 Tf 0.08 0.1 0.14 rg 1 0 0 1 46 114.59 Tm (1. The Failure Scenario) Tj ET BT /F1 11 Tf 0.14 0.16 0.2 rg 1 0 0 1 46 90.59 Tm (A US-based SaaS platform provides AI-assisted document review. Architecture looks like this:) Tj ET q 0.86 0.88 0.92 RG 1 w 46 42 m 549.28 42 l S Q BT /F1 8.4 Tf 0.42 0.45 0.5 rg 1 0 0 1 46 30 Tm (TechRounder | Page 1 of 8) Tj ET BT /F1 7.2 Tf 0.42 0.45 0.5 rg 1 0 0 1 46 19 Tm (https://www.techrounder.com/pdf/blog/the-phantom-throttling-how-youre-misdiagnosing-ai-api-timeouts.pdf) Tj ET endstream endobj 7 0 obj << /Type /Page /Parent 2 0 R /MediaBox [0 0 595.28 841.89] /Resources << /Font << /F1 3 0 R /F2 4 0 R >> >> /Contents 8 0 R >> endobj 8 0 obj << /Length 3793 >> stream BT /F1 10.5 Tf 0.2 0.23 0.28 rg 1 0 0 1 46 789.89 Tm (- Edge: Cloudflare Workers) Tj ET BT /F1 10.5 Tf 0.2 0.23 0.28 rg 1 0 0 1 46 773.09 Tm (- API layer: AWS API Gateway) Tj ET BT /F1 10.5 Tf 0.2 0.23 0.28 rg 1 0 0 1 46 756.29 Tm (- App tier: EKS \(Kubernetes 1.32\)) Tj ET BT /F1 10.5 Tf 0.2 0.23 0.28 rg 1 0 0 1 46 739.49 Tm (- LLM gateway: internal proxy service) Tj ET BT /F1 10.5 Tf 0.2 0.23 0.28 rg 1 0 0 1 46 722.69 Tm (- Model provider: OpenAI-compatible endpoint) Tj ET BT /F1 10.5 Tf 0.2 0.23 0.28 rg 1 0 0 1 46 705.89 Tm (- Observability: Datadog + OpenTelemetry) Tj ET BT /F1 10.5 Tf 0.2 0.23 0.28 rg 1 0 0 1 46 689.09 Tm (- Queue: SQS for async processing) Tj ET BT /F1 10.5 Tf 0.2 0.23 0.28 rg 1 0 0 1 46 672.29 Tm (- Autoscaling: HPA + Karpenter) Tj ET BT /F1 11 Tf 0.14 0.16 0.2 rg 1 0 0 1 46 655.49 Tm (Traffic pattern:) Tj ET BT /F1 10.5 Tf 0.2 0.23 0.28 rg 1 0 0 1 46 633.49 Tm (- ~1,200 RPS peak) Tj ET BT /F1 10.5 Tf 0.2 0.23 0.28 rg 1 0 0 1 46 616.69 Tm (- 15-25% of requests trigger LLM calls) Tj ET BT /F1 10.5 Tf 0.2 0.23 0.28 rg 1 0 0 1 46 599.89 Tm (- Average LLM latency: 2.4s) Tj ET BT /F1 10.5 Tf 0.2 0.23 0.28 rg 1 0 0 1 46 583.09 Tm (- p95 LLM latency: 5.8s) Tj ET BT /F1 11 Tf 0.14 0.16 0.2 rg 1 0 0 1 46 566.29 Tm (Then incidents start appearing:) Tj ET BT /F1 9.5 Tf 0.18 0.2 0.24 rg 1 0 0 1 54 542.29 Tm (ERROR llm-proxy: upstream timeout) Tj ET BT /F1 9.5 Tf 0.18 0.2 0.24 rg 1 0 0 1 54 525.415 Tm (provider=openai) Tj ET BT /F1 9.5 Tf 0.18 0.2 0.24 rg 1 0 0 1 54 508.54 Tm (model=gpt-4.2) Tj ET BT /F1 9.5 Tf 0.18 0.2 0.24 rg 1 0 0 1 54 491.665 Tm (duration=29.998s) Tj ET BT /F1 9.5 Tf 0.18 0.2 0.24 rg 1 0 0 1 54 474.79 Tm (status=504) Tj ET BT /F1 9.5 Tf 0.18 0.2 0.24 rg 1 0 0 1 54 457.915 Tm (trace_id=7b92c1...) Tj ET BT /F1 11 Tf 0.14 0.16 0.2 rg 1 0 0 1 46 439.415 Tm (Metrics dashboard shows:) Tj ET BT /F1 10.5 Tf 0.2 0.23 0.28 rg 1 0 0 1 46 417.415 Tm (- Spike in 504s) Tj ET BT /F1 10.5 Tf 0.2 0.23 0.28 rg 1 0 0 1 46 400.615 Tm (- p95 latency jumps from 6s -> 30s) Tj ET BT /F1 10.5 Tf 0.2 0.23 0.28 rg 1 0 0 1 46 383.815 Tm (- CPU utilization normal \(55%\)) Tj ET BT /F1 10.5 Tf 0.2 0.23 0.28 rg 1 0 0 1 46 367.015 Tm (- Memory normal \(62%\)) Tj ET BT /F1 10.5 Tf 0.2 0.23 0.28 rg 1 0 0 1 46 350.215 Tm (- No obvious pod crashes) Tj ET BT /F1 10.5 Tf 0.2 0.23 0.28 rg 1 0 0 1 46 333.415 Tm (- No provider status alerts) Tj ET BT /F1 11 Tf 0.14 0.16 0.2 rg 1 0 0 1 46 316.615 Tm (Engineers immediately suspect provider throttling.) Tj ET BT /F1 11 Tf 0.14 0.16 0.2 rg 1 0 0 1 46 294.615 Tm (Because the errors look like this:) Tj ET BT /F1 9.5 Tf 0.18 0.2 0.24 rg 1 0 0 1 54 270.615 Tm (HTTP/1.1 504 Gateway Timeout) Tj ET BT /F1 9.5 Tf 0.18 0.2 0.24 rg 1 0 0 1 54 253.74 Tm (x-request-id: req_92af3) Tj ET BT /F1 11 Tf 0.14 0.16 0.2 rg 1 0 0 1 46 235.24 Tm (And occasionally:) Tj ET BT /F1 9.5 Tf 0.18 0.2 0.24 rg 1 0 0 1 54 211.24 Tm (HTTP/1.1 429 Too Many Requests) Tj ET BT /F1 9.5 Tf 0.18 0.2 0.24 rg 1 0 0 1 54 194.365 Tm (retry-after: 2) Tj ET BT /F1 11 Tf 0.14 0.16 0.2 rg 1 0 0 1 46 175.865 Tm (The team increases retry logic. They increase client timeout from 30s to 60s. They add exponential) Tj ET BT /F1 11 Tf 0.14 0.16 0.2 rg 1 0 0 1 46 160.865 Tm (backoff.) Tj ET BT /F1 11 Tf 0.14 0.16 0.2 rg 1 0 0 1 46 138.865 Tm (The problem gets worse.) Tj ET BT /F2 15 Tf 0.08 0.1 0.14 rg 1 0 0 1 46 110.865 Tm (2. The Technical Root Cause) Tj ET BT /F1 11 Tf 0.14 0.16 0.2 rg 1 0 0 1 46 86.865 Tm (The issue was not upstream throttling.) Tj ET q 0.86 0.88 0.92 RG 1 w 46 42 m 549.28 42 l S Q BT /F1 8.4 Tf 0.42 0.45 0.5 rg 1 0 0 1 46 30 Tm (TechRounder | Page 2 of 8) Tj ET BT /F1 7.2 Tf 0.42 0.45 0.5 rg 1 0 0 1 46 19 Tm (https://www.techrounder.com/pdf/blog/the-phantom-throttling-how-youre-misdiagnosing-ai-api-timeouts.pdf) Tj ET endstream endobj 9 0 obj << /Type /Page /Parent 2 0 R /MediaBox [0 0 595.28 841.89] /Resources << /Font << /F1 3 0 R /F2 4 0 R >> >> /Contents 10 0 R >> endobj 10 0 obj << /Length 3894 >> stream BT /F1 11 Tf 0.14 0.16 0.2 rg 1 0 0 1 46 789.89 Tm (It was connection pool exhaustion inside the internal LLM proxy.) Tj ET BT /F1 11 Tf 0.14 0.16 0.2 rg 1 0 0 1 46 767.89 Tm (The proxy service used:) Tj ET BT /F1 10.5 Tf 0.2 0.23 0.28 rg 1 0 0 1 46 745.89 Tm (- Node.js 22) Tj ET BT /F1 10.5 Tf 0.2 0.23 0.28 rg 1 0 0 1 46 729.09 Tm (- undici HTTP client) Tj ET BT /F1 10.5 Tf 0.2 0.23 0.28 rg 1 0 0 1 46 712.29 Tm (- Keep-alive enabled) Tj ET BT /F1 10.5 Tf 0.2 0.23 0.28 rg 1 0 0 1 46 695.49 Tm (- max connections per origin: 10 \(default\)) Tj ET BT /F1 11 Tf 0.14 0.16 0.2 rg 1 0 0 1 46 678.69 Tm (Configuration looked like this:) Tj ET BT /F1 9.5 Tf 0.18 0.2 0.24 rg 1 0 0 1 54 654.69 Tm (const client = new Pool\("https://api.provider.com", {) Tj ET BT /F1 9.5 Tf 0.18 0.2 0.24 rg 1 0 0 1 54 637.815 Tm (connections: 10,) Tj ET BT /F1 9.5 Tf 0.18 0.2 0.24 rg 1 0 0 1 54 620.94 Tm (pipelining: 1,) Tj ET BT /F1 9.5 Tf 0.18 0.2 0.24 rg 1 0 0 1 54 604.065 Tm (keepAliveTimeout: 60_000,) Tj ET BT /F1 9.5 Tf 0.18 0.2 0.24 rg 1 0 0 1 54 587.19 Tm (}\);) Tj ET BT /F1 11 Tf 0.14 0.16 0.2 rg 1 0 0 1 46 568.69 Tm (Under peak load:) Tj ET BT /F1 10.5 Tf 0.2 0.23 0.28 rg 1 0 0 1 46 546.69 Tm (- 300+ concurrent LLM calls) Tj ET BT /F1 10.5 Tf 0.2 0.23 0.28 rg 1 0 0 1 46 529.89 Tm (- Each request held connection ~3-6 seconds) Tj ET BT /F1 10.5 Tf 0.2 0.23 0.28 rg 1 0 0 1 46 513.09 Tm (- Pool size fixed at 10) Tj ET BT /F1 11 Tf 0.14 0.16 0.2 rg 1 0 0 1 46 496.29 Tm (Effect:) Tj ET BT /F1 10.5 Tf 0.2 0.23 0.28 rg 1 0 0 1 46 474.29 Tm (- Requests queued internally) Tj ET BT /F1 10.5 Tf 0.2 0.23 0.28 rg 1 0 0 1 46 457.49 Tm (- No visible CPU spike) Tj ET BT /F1 10.5 Tf 0.2 0.23 0.28 rg 1 0 0 1 46 440.69 Tm (- No container memory pressure) Tj ET BT /F1 10.5 Tf 0.2 0.23 0.28 rg 1 0 0 1 46 423.89 Tm (- No autoscaler trigger) Tj ET BT /F1 10.5 Tf 0.2 0.23 0.28 rg 1 0 0 1 46 407.09 Tm (- No immediate provider rejection) Tj ET BT /F1 11 Tf 0.14 0.16 0.2 rg 1 0 0 1 46 390.29 Tm (Queue time inside the proxy grew silently.) Tj ET BT /F1 11 Tf 0.14 0.16 0.2 rg 1 0 0 1 46 368.29 Tm (Actual breakdown \(from tracing\):) Tj ET BT /F1 10 Tf 0.18 0.2 0.24 rg 1 0 0 1 46 346.29 Tm (Stage: API Gateway -> proxy | Duration: 8ms) Tj ET BT /F1 10 Tf 0.18 0.2 0.24 rg 1 0 0 1 46 329.29 Tm (Stage: Proxy internal queue wait | Duration: 21,700ms) Tj ET BT /F1 10 Tf 0.18 0.2 0.24 rg 1 0 0 1 46 312.29 Tm (Stage: Provider processing | Duration: 2,900ms) Tj ET BT /F1 10 Tf 0.18 0.2 0.24 rg 1 0 0 1 46 295.29 Tm (Stage: Response serialization | Duration: 40ms) Tj ET BT /F1 11 Tf 0.14 0.16 0.2 rg 1 0 0 1 46 278.29 Tm (The 30-second timeout wasn't provider latency.) Tj ET BT /F1 11 Tf 0.14 0.16 0.2 rg 1 0 0 1 46 256.29 Tm (It was:) Tj ET BT /F1 11 Tf 0.14 0.16 0.2 rg 1 0 0 1 46 234.29 Tm (21.7 seconds waiting for a free outbound socket) Tj ET BT /F1 10.5 Tf 0.2 0.23 0.28 rg 1 0 0 1 46 212.29 Tm (- 2.9 seconds actual model inference) Tj ET BT /F1 10.5 Tf 0.2 0.23 0.28 rg 1 0 0 1 62 195.49 Tm (- = ~24.6 seconds) Tj ET BT /F1 10.5 Tf 0.2 0.23 0.28 rg 1 0 0 1 46 178.69 Tm (- network jitter) Tj ET BT /F1 10.5 Tf 0.2 0.23 0.28 rg 1 0 0 1 62 161.89 Tm (- = timeout) Tj ET BT /F1 11 Tf 0.14 0.16 0.2 rg 1 0 0 1 46 145.09 Tm (Why did 429s appear occasionally?) Tj ET BT /F1 11 Tf 0.14 0.16 0.2 rg 1 0 0 1 46 123.09 Tm (Because once backlog built up, burst retries amplified outbound concurrency spikes, briefly exceeding) Tj ET BT /F1 11 Tf 0.14 0.16 0.2 rg 1 0 0 1 46 108.09 Tm (provider rate limits.) Tj ET BT /F1 11 Tf 0.14 0.16 0.2 rg 1 0 0 1 46 86.09 Tm (But throttling wasn't the origin of the problem.) Tj ET q 0.86 0.88 0.92 RG 1 w 46 42 m 549.28 42 l S Q BT /F1 8.4 Tf 0.42 0.45 0.5 rg 1 0 0 1 46 30 Tm (TechRounder | Page 3 of 8) Tj ET BT /F1 7.2 Tf 0.42 0.45 0.5 rg 1 0 0 1 46 19 Tm (https://www.techrounder.com/pdf/blog/the-phantom-throttling-how-youre-misdiagnosing-ai-api-timeouts.pdf) Tj ET endstream endobj 11 0 obj << /Type /Page /Parent 2 0 R /MediaBox [0 0 595.28 841.89] /Resources << /Font << /F1 3 0 R /F2 4 0 R >> >> /Contents 12 0 R >> endobj 12 0 obj << /Length 3489 >> stream BT /F1 11 Tf 0.14 0.16 0.2 rg 1 0 0 1 46 789.89 Tm (It was a saturated client pool.) Tj ET BT /F2 15 Tf 0.08 0.1 0.14 rg 1 0 0 1 46 761.89 Tm (3. The Cognitive Bias Involved) Tj ET BT /F1 11 Tf 0.14 0.16 0.2 rg 1 0 0 1 46 737.89 Tm (This failure pattern is driven by availability bias and anchoring.) Tj ET BT /F2 13 Tf 0.08 0.1 0.14 rg 1 0 0 1 46 709.89 Tm (Availability Bias) Tj ET BT /F1 11 Tf 0.14 0.16 0.2 rg 1 0 0 1 46 687.89 Tm (AI providers sometimes throttle. Developers have experienced it before. So when timeouts occur, that) Tj ET BT /F1 11 Tf 0.14 0.16 0.2 rg 1 0 0 1 46 672.89 Tm (explanation is cognitively "available.") Tj ET BT /F1 11 Tf 0.14 0.16 0.2 rg 1 0 0 1 46 650.89 Tm (The brain prefers familiar causes.) Tj ET BT /F2 13 Tf 0.08 0.1 0.14 rg 1 0 0 1 46 622.89 Tm (Anchoring) Tj ET BT /F1 11 Tf 0.14 0.16 0.2 rg 1 0 0 1 46 600.89 Tm (The first log entry seen was:) Tj ET BT /F1 9.5 Tf 0.18 0.2 0.24 rg 1 0 0 1 54 576.89 Tm (HTTP 429 Too Many Requests) Tj ET BT /F1 11 Tf 0.14 0.16 0.2 rg 1 0 0 1 46 558.39 Tm (That becomes the anchor.) Tj ET BT /F1 11 Tf 0.14 0.16 0.2 rg 1 0 0 1 46 536.39 Tm (Subsequent 504s are interpreted through that lens:) Tj ET BT /F1 10.5 Tf 0.28 0.31 0.36 rg 1 0 0 1 56 512.39 Tm (Quote: "We're being throttled harder now.") Tj ET BT /F1 11 Tf 0.14 0.16 0.2 rg 1 0 0 1 46 492.39 Tm (The team never re-evaluates the base assumption.) Tj ET BT /F1 11 Tf 0.14 0.16 0.2 rg 1 0 0 1 46 470.39 Tm (Instead of asking:) Tj ET BT /F1 10.5 Tf 0.28 0.31 0.36 rg 1 0 0 1 56 446.39 Tm (Quote: Where exactly is latency accumulating?) Tj ET BT /F1 11 Tf 0.14 0.16 0.2 rg 1 0 0 1 46 426.39 Tm (They assume:) Tj ET BT /F1 10.5 Tf 0.28 0.31 0.36 rg 1 0 0 1 56 402.39 Tm (Quote: The provider is slow or limiting us.) Tj ET BT /F1 11 Tf 0.14 0.16 0.2 rg 1 0 0 1 46 382.39 Tm (Even experienced engineers fall into this trap because the hypothesis is plausible.) Tj ET BT /F2 15 Tf 0.08 0.1 0.14 rg 1 0 0 1 46 354.39 Tm (4. The Missed Signal) Tj ET BT /F1 11 Tf 0.14 0.16 0.2 rg 1 0 0 1 46 330.39 Tm (The signal was in distributed tracing - but it was subtle.) Tj ET BT /F1 11 Tf 0.14 0.16 0.2 rg 1 0 0 1 46 308.39 Tm (OpenTelemetry traces showed:) Tj ET BT /F1 9.5 Tf 0.18 0.2 0.24 rg 1 0 0 1 54 284.39 Tm (span: llm-proxy.request) Tj ET BT /F1 9.5 Tf 0.18 0.2 0.24 rg 1 0 0 1 54 267.515 Tm (duration: 29998ms) Tj ET BT /F1 9.5 Tf 0.18 0.2 0.24 rg 1 0 0 1 54 250.64 Tm (attributes:) Tj ET BT /F1 9.5 Tf 0.18 0.2 0.24 rg 1 0 0 1 54 233.765 Tm (http.client.duration: 2897ms) Tj ET BT /F1 9.5 Tf 0.18 0.2 0.24 rg 1 0 0 1 54 216.89 Tm (http.queue.duration: 21987ms) Tj ET BT /F1 11 Tf 0.14 0.16 0.2 rg 1 0 0 1 46 198.39 Tm (That second metric - `http.queue.duration` - was the clue.) Tj ET BT /F1 11 Tf 0.14 0.16 0.2 rg 1 0 0 1 46 176.39 Tm (But the team didn't have alerting tied to client-side queuing.) Tj ET BT /F1 11 Tf 0.14 0.16 0.2 rg 1 0 0 1 46 154.39 Tm (They monitored:) Tj ET BT /F1 10.5 Tf 0.2 0.23 0.28 rg 1 0 0 1 46 132.39 Tm (- CPU) Tj ET BT /F1 10.5 Tf 0.2 0.23 0.28 rg 1 0 0 1 46 115.59 Tm (- Memory) Tj ET BT /F1 10.5 Tf 0.2 0.23 0.28 rg 1 0 0 1 46 98.79 Tm (- Pod count) Tj ET BT /F1 10.5 Tf 0.2 0.23 0.28 rg 1 0 0 1 46 81.99 Tm (- Provider latency) Tj ET q 0.86 0.88 0.92 RG 1 w 46 42 m 549.28 42 l S Q BT /F1 8.4 Tf 0.42 0.45 0.5 rg 1 0 0 1 46 30 Tm (TechRounder | Page 4 of 8) Tj ET BT /F1 7.2 Tf 0.42 0.45 0.5 rg 1 0 0 1 46 19 Tm (https://www.techrounder.com/pdf/blog/the-phantom-throttling-how-youre-misdiagnosing-ai-api-timeouts.pdf) Tj ET endstream endobj 13 0 obj << /Type /Page /Parent 2 0 R /MediaBox [0 0 595.28 841.89] /Resources << /Font << /F1 3 0 R /F2 4 0 R >> >> /Contents 14 0 R >> endobj 14 0 obj << /Length 3392 >> stream BT /F1 11 Tf 0.14 0.16 0.2 rg 1 0 0 1 46 789.89 Tm (They did not monitor:) Tj ET BT /F1 10.5 Tf 0.2 0.23 0.28 rg 1 0 0 1 46 767.89 Tm (- Outbound connection saturation) Tj ET BT /F1 10.5 Tf 0.2 0.23 0.28 rg 1 0 0 1 46 751.09 Tm (- Event loop lag) Tj ET BT /F1 10.5 Tf 0.2 0.23 0.28 rg 1 0 0 1 46 734.29 Tm (- Per-service internal queue depth) Tj ET BT /F1 10.5 Tf 0.2 0.23 0.28 rg 1 0 0 1 46 717.49 Tm (- HTTP client pool metrics) Tj ET BT /F1 11 Tf 0.14 0.16 0.2 rg 1 0 0 1 46 700.69 Tm (Another missed signal:) Tj ET BT /F1 9.5 Tf 0.18 0.2 0.24 rg 1 0 0 1 54 676.69 Tm (netstat -an | grep ESTABLISHED | wc -l) Tj ET BT /F1 9.5 Tf 0.18 0.2 0.24 rg 1 0 0 1 54 659.815 Tm (= 10) Tj ET BT /F1 11 Tf 0.14 0.16 0.2 rg 1 0 0 1 46 641.315 Tm (Exactly 10 persistent outbound connections.) Tj ET BT /F1 11 Tf 0.14 0.16 0.2 rg 1 0 0 1 46 619.315 Tm (Under load of hundreds of concurrent calls.) Tj ET BT /F1 11 Tf 0.14 0.16 0.2 rg 1 0 0 1 46 597.315 Tm (Yet dashboards showed normal infrastructure health.) Tj ET BT /F1 11 Tf 0.14 0.16 0.2 rg 1 0 0 1 46 575.315 Tm (The failure was invisible to standard system metrics.) Tj ET BT /F2 15 Tf 0.08 0.1 0.14 rg 1 0 0 1 46 547.315 Tm (5. The Cost of the Wrong Assumption) Tj ET BT /F1 11 Tf 0.14 0.16 0.2 rg 1 0 0 1 46 523.315 Tm (The misdiagnosis caused cascading side effects:) Tj ET BT /F2 13 Tf 0.08 0.1 0.14 rg 1 0 0 1 46 495.315 Tm (1. Retry Amplification) Tj ET BT /F1 11 Tf 0.14 0.16 0.2 rg 1 0 0 1 46 473.315 Tm (Retry logic multiplied concurrency:) Tj ET BT /F1 9.5 Tf 0.18 0.2 0.24 rg 1 0 0 1 54 449.315 Tm (retry:) Tj ET BT /F1 9.5 Tf 0.18 0.2 0.24 rg 1 0 0 1 54 432.44 Tm (attempts: 5) Tj ET BT /F1 9.5 Tf 0.18 0.2 0.24 rg 1 0 0 1 54 415.565 Tm (backoff: exponential) Tj ET BT /F1 9.5 Tf 0.18 0.2 0.24 rg 1 0 0 1 54 398.69 Tm (base: 500ms) Tj ET BT /F1 11 Tf 0.14 0.16 0.2 rg 1 0 0 1 46 380.19 Tm (Each timeout triggered up to 5 new requests.) Tj ET BT /F1 11 Tf 0.14 0.16 0.2 rg 1 0 0 1 46 358.19 Tm (This increased internal queue pressure.) Tj ET BT /F2 13 Tf 0.08 0.1 0.14 rg 1 0 0 1 46 330.19 Tm (2. Increased Client Timeout) Tj ET BT /F1 11 Tf 0.14 0.16 0.2 rg 1 0 0 1 46 308.19 Tm (Raising timeout from 30s -> 60s:) Tj ET BT /F1 10.5 Tf 0.2 0.23 0.28 rg 1 0 0 1 46 286.19 Tm (- Doubled socket hold time) Tj ET BT /F1 10.5 Tf 0.2 0.23 0.28 rg 1 0 0 1 46 269.39 Tm (- Increased connection pool contention) Tj ET BT /F1 10.5 Tf 0.2 0.23 0.28 rg 1 0 0 1 46 252.59 Tm (- Reduced throughput further) Tj ET BT /F2 13 Tf 0.08 0.1 0.14 rg 1 0 0 1 46 229.79 Tm (3. Autoscaling Noise) Tj ET BT /F1 11 Tf 0.14 0.16 0.2 rg 1 0 0 1 46 207.79 Tm (HPA scaled based on CPU.) Tj ET BT /F1 11 Tf 0.14 0.16 0.2 rg 1 0 0 1 46 185.79 Tm (But CPU wasn't the bottleneck.) Tj ET BT /F1 11 Tf 0.14 0.16 0.2 rg 1 0 0 1 46 163.79 Tm (Scaling to 40 pods did nothing because each pod still had a 10-connection cap.) Tj ET BT /F2 13 Tf 0.08 0.1 0.14 rg 1 0 0 1 46 135.79 Tm (4. Incident Duration) Tj ET BT /F1 11 Tf 0.14 0.16 0.2 rg 1 0 0 1 46 113.79 Tm (Root cause resolution took 11 hours.) Tj ET BT /F1 11 Tf 0.14 0.16 0.2 rg 1 0 0 1 46 91.79 Tm (Fix took 10 minutes:) Tj ET q 0.86 0.88 0.92 RG 1 w 46 42 m 549.28 42 l S Q BT /F1 8.4 Tf 0.42 0.45 0.5 rg 1 0 0 1 46 30 Tm (TechRounder | Page 5 of 8) Tj ET BT /F1 7.2 Tf 0.42 0.45 0.5 rg 1 0 0 1 46 19 Tm (https://www.techrounder.com/pdf/blog/the-phantom-throttling-how-youre-misdiagnosing-ai-api-timeouts.pdf) Tj ET endstream endobj 15 0 obj << /Type /Page /Parent 2 0 R /MediaBox [0 0 595.28 841.89] /Resources << /Font << /F1 3 0 R /F2 4 0 R >> >> /Contents 16 0 R >> endobj 16 0 obj << /Length 3353 >> stream BT /F1 9.5 Tf 0.18 0.2 0.24 rg 1 0 0 1 54 789.89 Tm (connections: 200) Tj ET BT /F1 11 Tf 0.14 0.16 0.2 rg 1 0 0 1 46 771.39 Tm (Plus backpressure control.) Tj ET BT /F1 11 Tf 0.14 0.16 0.2 rg 1 0 0 1 46 749.39 Tm (The real cost wasn't downtime.) Tj ET BT /F1 11 Tf 0.14 0.16 0.2 rg 1 0 0 1 46 727.39 Tm (It was engineering attention misdirected by a faulty mental model.) Tj ET BT /F2 15 Tf 0.08 0.1 0.14 rg 1 0 0 1 46 699.39 Tm (6. A Repeatable Debugging Framework) Tj ET BT /F1 11 Tf 0.14 0.16 0.2 rg 1 0 0 1 46 675.39 Tm (Here's a structured way to avoid phantom throttling diagnoses.) Tj ET BT /F2 13 Tf 0.08 0.1 0.14 rg 1 0 0 1 46 647.39 Tm (Step 1: Decompose Latency by Stage) Tj ET BT /F1 11 Tf 0.14 0.16 0.2 rg 1 0 0 1 46 625.39 Tm (Break request time into:) Tj ET BT /F1 10.5 Tf 0.2 0.23 0.28 rg 1 0 0 1 46 603.39 Tm (- Edge latency) Tj ET BT /F1 10.5 Tf 0.2 0.23 0.28 rg 1 0 0 1 46 586.59 Tm (- Ingress) Tj ET BT /F1 10.5 Tf 0.2 0.23 0.28 rg 1 0 0 1 46 569.79 Tm (- App processing) Tj ET BT /F1 10.5 Tf 0.2 0.23 0.28 rg 1 0 0 1 46 552.99 Tm (- Outbound queue) Tj ET BT /F1 10.5 Tf 0.2 0.23 0.28 rg 1 0 0 1 46 536.19 Tm (- Upstream processing) Tj ET BT /F1 10.5 Tf 0.2 0.23 0.28 rg 1 0 0 1 46 519.39 Tm (- Response handling) Tj ET BT /F1 11 Tf 0.14 0.16 0.2 rg 1 0 0 1 46 502.59 Tm (If you can't see outbound queue time, instrument it.) Tj ET BT /F2 13 Tf 0.08 0.1 0.14 rg 1 0 0 1 46 474.59 Tm (Step 2: Measure Concurrency vs Pool Capacity) Tj ET BT /F1 11 Tf 0.14 0.16 0.2 rg 1 0 0 1 46 452.59 Tm (For every outbound dependency:) Tj ET BT /F1 10.5 Tf 0.2 0.23 0.28 rg 1 0 0 1 46 430.59 Tm (- Max concurrent requests) Tj ET BT /F1 10.5 Tf 0.2 0.23 0.28 rg 1 0 0 1 46 413.79 Tm (- Max connections configured) Tj ET BT /F1 10.5 Tf 0.2 0.23 0.28 rg 1 0 0 1 46 396.99 Tm (- Average request duration) Tj ET BT /F1 11 Tf 0.14 0.16 0.2 rg 1 0 0 1 46 380.19 Tm (Compute theoretical saturation:) Tj ET BT /F1 9.5 Tf 0.18 0.2 0.24 rg 1 0 0 1 54 356.19 Tm (Throughput = connections / avg_duration) Tj ET BT /F1 11 Tf 0.14 0.16 0.2 rg 1 0 0 1 46 337.69 Tm (If:) Tj ET BT /F1 10.5 Tf 0.2 0.23 0.28 rg 1 0 0 1 46 315.69 Tm (- connections = 10) Tj ET BT /F1 10.5 Tf 0.2 0.23 0.28 rg 1 0 0 1 46 298.89 Tm (- avg_duration = 3s) Tj ET BT /F1 11 Tf 0.14 0.16 0.2 rg 1 0 0 1 46 282.09 Tm (Max throughput ? 3.3 RPS per pod.) Tj ET BT /F1 11 Tf 0.14 0.16 0.2 rg 1 0 0 1 46 260.09 Tm (That's not enough for most AI workloads.) Tj ET BT /F2 13 Tf 0.08 0.1 0.14 rg 1 0 0 1 46 232.09 Tm (Step 3: Check for Internal Queue Growth) Tj ET BT /F1 11 Tf 0.14 0.16 0.2 rg 1 0 0 1 46 210.09 Tm (Add metrics:) Tj ET BT /F1 9.5 Tf 0.18 0.2 0.24 rg 1 0 0 1 54 186.09 Tm (client.stats\(\)) Tj ET BT /F1 11 Tf 0.14 0.16 0.2 rg 1 0 0 1 46 167.59 Tm (Or equivalent in your HTTP client.) Tj ET BT /F1 11 Tf 0.14 0.16 0.2 rg 1 0 0 1 46 145.59 Tm (Monitor:) Tj ET BT /F1 10.5 Tf 0.2 0.23 0.28 rg 1 0 0 1 46 123.59 Tm (- pending requests) Tj ET BT /F1 10.5 Tf 0.2 0.23 0.28 rg 1 0 0 1 46 106.79 Tm (- active sockets) Tj ET BT /F1 10.5 Tf 0.2 0.23 0.28 rg 1 0 0 1 46 89.99 Tm (- idle sockets) Tj ET q 0.86 0.88 0.92 RG 1 w 46 42 m 549.28 42 l S Q BT /F1 8.4 Tf 0.42 0.45 0.5 rg 1 0 0 1 46 30 Tm (TechRounder | Page 6 of 8) Tj ET BT /F1 7.2 Tf 0.42 0.45 0.5 rg 1 0 0 1 46 19 Tm (https://www.techrounder.com/pdf/blog/the-phantom-throttling-how-youre-misdiagnosing-ai-api-timeouts.pdf) Tj ET endstream endobj 17 0 obj << /Type /Page /Parent 2 0 R /MediaBox [0 0 595.28 841.89] /Resources << /Font << /F1 3 0 R /F2 4 0 R >> >> /Contents 18 0 R >> endobj 18 0 obj << /Length 3477 >> stream BT /F1 11 Tf 0.14 0.16 0.2 rg 1 0 0 1 46 789.89 Tm (Alert on queue depth, not just errors.) Tj ET BT /F2 13 Tf 0.08 0.1 0.14 rg 1 0 0 1 46 761.89 Tm (Step 4: Disable Retries During Diagnosis) Tj ET BT /F1 11 Tf 0.14 0.16 0.2 rg 1 0 0 1 46 739.89 Tm (Retries distort signals.) Tj ET BT /F1 11 Tf 0.14 0.16 0.2 rg 1 0 0 1 46 717.89 Tm (Temporarily reduce:) Tj ET BT /F1 9.5 Tf 0.18 0.2 0.24 rg 1 0 0 1 54 693.89 Tm (retry:) Tj ET BT /F1 9.5 Tf 0.18 0.2 0.24 rg 1 0 0 1 54 677.015 Tm (attempts: 1) Tj ET BT /F1 11 Tf 0.14 0.16 0.2 rg 1 0 0 1 46 658.515 Tm (This prevents artificial amplification.) Tj ET BT /F2 13 Tf 0.08 0.1 0.14 rg 1 0 0 1 46 630.515 Tm (Step 5: Correlate Provider Latency Independently) Tj ET BT /F1 11 Tf 0.14 0.16 0.2 rg 1 0 0 1 46 608.515 Tm (Use provider-side metrics if available.) Tj ET BT /F1 11 Tf 0.14 0.16 0.2 rg 1 0 0 1 46 586.515 Tm (If provider p95 remains stable at 3s while your system reports 30s, the bottleneck is local.) Tj ET BT /F2 13 Tf 0.08 0.1 0.14 rg 1 0 0 1 46 558.515 Tm (Step 6: Validate with Load Test) Tj ET BT /F1 11 Tf 0.14 0.16 0.2 rg 1 0 0 1 46 536.515 Tm (Reproduce under controlled load:) Tj ET BT /F1 10.5 Tf 0.2 0.23 0.28 rg 1 0 0 1 46 514.515 Tm (- Fixed RPS) Tj ET BT /F1 10.5 Tf 0.2 0.23 0.28 rg 1 0 0 1 46 497.715 Tm (- Measure queue growth) Tj ET BT /F1 10.5 Tf 0.2 0.23 0.28 rg 1 0 0 1 46 480.915 Tm (- Observe saturation point) Tj ET BT /F1 11 Tf 0.14 0.16 0.2 rg 1 0 0 1 46 464.115 Tm (This turns speculation into measurement.) Tj ET BT /F2 15 Tf 0.08 0.1 0.14 rg 1 0 0 1 46 436.115 Tm (7. Architecture Lessons) Tj ET BT /F1 11 Tf 0.14 0.16 0.2 rg 1 0 0 1 46 412.115 Tm (This class of failure reveals structural weaknesses in many AI stacks.) Tj ET BT /F2 13 Tf 0.08 0.1 0.14 rg 1 0 0 1 46 384.115 Tm (1. LLM Calls Are Long-Lived) Tj ET BT /F1 11 Tf 0.14 0.16 0.2 rg 1 0 0 1 46 362.115 Tm (Compared to typical REST calls, LLM calls:) Tj ET BT /F1 10.5 Tf 0.2 0.23 0.28 rg 1 0 0 1 46 340.115 Tm (- Are slower \(seconds, not milliseconds\)) Tj ET BT /F1 10.5 Tf 0.2 0.23 0.28 rg 1 0 0 1 46 323.315 Tm (- Consume connections longer) Tj ET BT /F1 10.5 Tf 0.2 0.23 0.28 rg 1 0 0 1 46 306.515 Tm (- Increase risk of pool starvation) Tj ET BT /F1 11 Tf 0.14 0.16 0.2 rg 1 0 0 1 46 289.715 Tm (Default HTTP client configs are rarely sufficient.) Tj ET BT /F2 13 Tf 0.08 0.1 0.14 rg 1 0 0 1 46 261.715 Tm (2. Connection Pools Are Infrastructure) Tj ET BT /F1 11 Tf 0.14 0.16 0.2 rg 1 0 0 1 46 239.715 Tm (Treat them like capacity-managed resources.) Tj ET BT /F1 11 Tf 0.14 0.16 0.2 rg 1 0 0 1 46 217.715 Tm (In Kubernetes, scale math must include:) Tj ET BT /F1 9.5 Tf 0.18 0.2 0.24 rg 1 0 0 1 54 193.715 Tm (Total outbound capacity = pods × connections per pod) Tj ET BT /F1 11 Tf 0.14 0.16 0.2 rg 1 0 0 1 46 175.215 Tm (If each pod supports 10 connections and you have 20 pods:) Tj ET BT /F1 11 Tf 0.14 0.16 0.2 rg 1 0 0 1 46 153.215 Tm (Max concurrent upstream calls = 200.) Tj ET BT /F1 11 Tf 0.14 0.16 0.2 rg 1 0 0 1 46 131.215 Tm (That's your ceiling.) Tj ET BT /F2 13 Tf 0.08 0.1 0.14 rg 1 0 0 1 46 103.215 Tm (3. Observability Needs Queue Awareness) Tj ET BT /F1 11 Tf 0.14 0.16 0.2 rg 1 0 0 1 46 81.215 Tm (Monitor:) Tj ET q 0.86 0.88 0.92 RG 1 w 46 42 m 549.28 42 l S Q BT /F1 8.4 Tf 0.42 0.45 0.5 rg 1 0 0 1 46 30 Tm (TechRounder | Page 7 of 8) Tj ET BT /F1 7.2 Tf 0.42 0.45 0.5 rg 1 0 0 1 46 19 Tm (https://www.techrounder.com/pdf/blog/the-phantom-throttling-how-youre-misdiagnosing-ai-api-timeouts.pdf) Tj ET endstream endobj 19 0 obj << /Type /Page /Parent 2 0 R /MediaBox [0 0 595.28 841.89] /Resources << /Font << /F1 3 0 R /F2 4 0 R >> >> /Contents 20 0 R >> endobj 20 0 obj << /Length 3567 >> stream BT /F1 10.5 Tf 0.2 0.23 0.28 rg 1 0 0 1 46 789.89 Tm (- Internal queue wait time) Tj ET BT /F1 10.5 Tf 0.2 0.23 0.28 rg 1 0 0 1 46 773.09 Tm (- Connection pool utilization %) Tj ET BT /F1 10.5 Tf 0.2 0.23 0.28 rg 1 0 0 1 46 756.29 Tm (- Event loop delay \(Node\)) Tj ET BT /F1 10.5 Tf 0.2 0.23 0.28 rg 1 0 0 1 46 739.49 Tm (- Async task backlog) Tj ET BT /F1 11 Tf 0.14 0.16 0.2 rg 1 0 0 1 46 722.69 Tm (Without these, you're blind to phantom throttling.) Tj ET BT /F2 13 Tf 0.08 0.1 0.14 rg 1 0 0 1 46 694.69 Tm (4. Backpressure Beats Retries) Tj ET BT /F1 11 Tf 0.14 0.16 0.2 rg 1 0 0 1 46 672.69 Tm (Instead of aggressive retries:) Tj ET BT /F1 10.5 Tf 0.2 0.23 0.28 rg 1 0 0 1 46 650.69 Tm (- Apply circuit breakers) Tj ET BT /F1 10.5 Tf 0.2 0.23 0.28 rg 1 0 0 1 46 633.89 Tm (- Enforce bounded concurrency) Tj ET BT /F1 10.5 Tf 0.2 0.23 0.28 rg 1 0 0 1 46 617.09 Tm (- Use token buckets per pod) Tj ET BT /F1 11 Tf 0.14 0.16 0.2 rg 1 0 0 1 46 600.29 Tm (Example pattern:) Tj ET BT /F1 9.5 Tf 0.18 0.2 0.24 rg 1 0 0 1 54 576.29 Tm (import pLimit from "p-limit";) Tj ET BT /F1 9.5 Tf 0.18 0.2 0.24 rg 1 0 0 1 54 559.415 Tm (const limit = pLimit\(100\);) Tj ET BT /F1 9.5 Tf 0.18 0.2 0.24 rg 1 0 0 1 54 542.54 Tm (await limit\(\(\) => callLLM\(\)\);) Tj ET BT /F1 11 Tf 0.14 0.16 0.2 rg 1 0 0 1 46 524.04 Tm (This prevents runaway saturation.) Tj ET BT /F2 13 Tf 0.08 0.1 0.14 rg 1 0 0 1 46 496.04 Tm (5. Cognitive Guardrails Matter) Tj ET BT /F1 11 Tf 0.14 0.16 0.2 rg 1 0 0 1 46 474.04 Tm (Institutionalize bias checks during incidents:) Tj ET BT /F1 10.5 Tf 0.2 0.23 0.28 rg 1 0 0 1 46 452.04 Tm (- What evidence supports our assumption?) Tj ET BT /F1 10.5 Tf 0.2 0.23 0.28 rg 1 0 0 1 46 435.24 Tm (- What would disprove it?) Tj ET BT /F1 10.5 Tf 0.2 0.23 0.28 rg 1 0 0 1 46 418.44 Tm (- Where is latency actually accumulating?) Tj ET BT /F1 10.5 Tf 0.2 0.23 0.28 rg 1 0 0 1 46 401.64 Tm (- Are we measuring queue time or guessing?) Tj ET BT /F1 11 Tf 0.14 0.16 0.2 rg 1 0 0 1 46 384.84 Tm (This is engineering hygiene.) Tj ET BT /F2 15 Tf 0.08 0.1 0.14 rg 1 0 0 1 46 356.84 Tm (Final Reflection) Tj ET BT /F1 11 Tf 0.14 0.16 0.2 rg 1 0 0 1 46 332.84 Tm (AI infrastructure failures in 2026 are rarely dramatic outages. They are subtle resource mismatches) Tj ET BT /F1 11 Tf 0.14 0.16 0.2 rg 1 0 0 1 46 317.84 Tm (amplified by retries and scaling policies.) Tj ET BT /F1 11 Tf 0.14 0.16 0.2 rg 1 0 0 1 46 295.84 Tm (When timeouts appear, "provider throttling" is an easy narrative. It feels plausible. It matches prior) Tj ET BT /F1 11 Tf 0.14 0.16 0.2 rg 1 0 0 1 46 280.84 Tm (experience. It explains the symptoms.) Tj ET BT /F1 11 Tf 0.14 0.16 0.2 rg 1 0 0 1 46 258.84 Tm (But if you don't isolate queue time, connection limits, and concurrency math, you're debugging a story) Tj ET BT /F1 11 Tf 0.14 0.16 0.2 rg 1 0 0 1 46 243.84 Tm (- not a system.) Tj ET BT /F1 11 Tf 0.14 0.16 0.2 rg 1 0 0 1 46 221.84 Tm (Phantom throttling isn't a vendor problem.) Tj ET BT /F1 11 Tf 0.14 0.16 0.2 rg 1 0 0 1 46 199.84 Tm (It's a visibility problem shaped by cognitive bias.) Tj ET BT /F1 11 Tf 0.14 0.16 0.2 rg 1 0 0 1 46 177.84 Tm (The fix is usually small.) Tj ET BT /F1 11 Tf 0.14 0.16 0.2 rg 1 0 0 1 46 155.84 Tm (Recognizing your own mental shortcut is the hard part.) Tj ET q 0.86 0.88 0.92 RG 1 w 46 42 m 549.28 42 l S Q BT /F1 8.4 Tf 0.42 0.45 0.5 rg 1 0 0 1 46 30 Tm (TechRounder | Page 8 of 8) Tj ET BT /F1 7.2 Tf 0.42 0.45 0.5 rg 1 0 0 1 46 19 Tm (https://www.techrounder.com/pdf/blog/the-phantom-throttling-how-youre-misdiagnosing-ai-api-timeouts.pdf) Tj ET endstream endobj xref 0 21 0000000000 65535 f 0000000015 00000 n 0000000064 00000 n 0000000168 00000 n 0000000238 00000 n 0000000313 00000 n 0000000455 00000 n 0000006076 00000 n 0000006218 00000 n 0000010062 00000 n 0000010205 00000 n 0000014151 00000 n 0000014295 00000 n 0000017836 00000 n 0000017980 00000 n 0000021424 00000 n 0000021568 00000 n 0000024973 00000 n 0000025117 00000 n 0000028646 00000 n 0000028790 00000 n trailer << /Size 21 /Root 1 0 R >> startxref 32409 %%EOF