Tail Control: The Counterintuitive Engineering of Reliable Agentic Workflows | Towards Data Science

inside your own company and almost any failure is cheap: you retry, fall back, or potentially even ignore it. Put that same workflow behind a customer’s API or MCP server and the grace is gone. Now only one thing matters: did the customer get a correct, usable result? Their process depends on yours delivering one. They, not you, now decide what counts as delivered. At Databook we process billions of tokens for the world’s largest enterprises; this article is based on real data from production flows at scale. I hope it offers you some useful insights.

Delivering that result is harder than it looks, because LLMs are notoriously unreliable. They fail frequently, in four flavors: an invalid answer (empty, unparseable, or simply wrong), a hard error, no answer at all, or no answer in time. And the whole run only succeeds if every step does, so the more you chain together, the more chances there are for one of them to fail. A workflow of individually excellent steps can still come out a coin flip.

Tail Control: The Counterintuitive Engineering of Reliable Agentic Workflows | Towards Data Science — FIGURE 1 – The four ways an LLM call fails. Three are loud — an invalid answer, a hard error, no answer at all — and you see and handle each. The fourth is quiet: a correct answer that simply arrives too late, which looks like success on your side and like failure on the customer’s.

Inside your own company you can absorb every one of these, because you have slack on every axis: retry the failed step, wait out the slow one, spend a little more, relax the bar if you must. Put the same workflow behind a customer’s API and the slack vanishes, because the run now has to clear three resource budgets at the same time, none of which you set:

Time — a window that closes whether or not you’re done: a hard gateway timeout (one to three minutes, sometimes five) that severs the connection mid-run, or something softer: an SLA, a caller blocked on the result, a process that can only wait so long. And it doesn’t resume: when the window closes, the customer just retries, starting the whole run over from zero.
Cost — now a margin, not a pool. Every run carries a price the customer already paid, so it has to come back profitable, not merely affordable. And the customer, not you, decides how often it runs.
Tokens and rate — a per-minute token budget (TPM) you share across every customer at once, and they tend to call in the same bursts. You hit the ceiling exactly when load is heaviest, which is exactly when latency is worst.

Under all three sits a hard floor you never trade beneath: quality. The answer has to be right to count at all. A fast, cheap, on-time answer that’s wrong is still a failure. Quality isn’t a budget you spend down.

FIGURE 2 – The three resource budgets a customer-facing run spends simultaneously — time, cost, and token/rate — resting on a fixed quality floor. Each budget is imposed from outside; the floor is the one line no trade may cross.

Any one of these you could manage on its own. The bind is that they apply together and pull against each other, so the obvious fix for one spends another. Wait out a slow step and you blow the time window. Race a second copy to beat the clock and you burn cost and quota. Reach for a stronger model to clear the quality floor and you get slower. None of the budgets are yours to loosen, so the only move left is to trade deliberately across all of them at once — without ever dropping below the floor.

That is what makes a customer-facing workflow a genuinely different thing to build, and it sometimes forces a playbook that, from the inside, looks totally backwards:

Kill a call that hasn’t failed
Fire a duplicate of a call you’re already paying for
Drop to a weaker model on purpose

Within your own walls you’d never bother. You’d just let the slow step finish. And the budget that punishes you most quietly is time: miss it and nothing looks broken on your side. A perfect answer that lands a few seconds late still reads as a success on your dashboards and as a failure to the customer, and it’s the one limit nothing in the stack enforces for you.

Here’s the thesis, up front, because everything else serves it: once quality clears the bar, reliable delivery is a question of variance, not speed. A predictable completion time beats a fast one with a long tail, because your customers can’t run their infrastructure on your best case; they have to build for your worst.

What this is — and isn’t: workflows, not free reasoning agents

One distinction up front, because it changes everything. This is about an agentic workflow: a known process flow with LLM-powered steps inside it, run by a deterministic orchestrator. It is not a reasoning agent that decides its own next move at runtime. For the same task, a workflow is simply faster: it already knows the plan, skips the deliberation, and runs every independent step in parallel, so it reaches the same answer in a fraction of the time and cost a reasoning agent would take. Both have their place (reasoning agents are much more flexible), but they fail differently and you fix them differently. A reasoning agent’s problem is deciding what to do; a workflow’s problem (the one customers feel) is delivering what it already knows how to do, with quality, and in time. This article is about the latter.

How our system is built

The findings below come from our architecture, and they should generalize. These are ordinary, direct API calls. Still, it helps to know the setup so you can compare it to yours.

We run a custom orchestrator over managed third-party APIs (no self-hosted models in this dataset), and we run flagship models both directly through their providers (OpenAI, Anthropic, …) and through managed platforms (Bedrock, Databricks, …), so top models have more than 1 provider. That lets us compare serving paths and move work between them.

Our workloads are a mix: simple agent calls, deep reasoning, extractions, JSON and free text outputs. For a large fraction of calls we synthesize a large fact base into an answer, so large input and small to medium outputs. The analytics in this article hold input and output size constant within buckets (see appendix).

The slow tails we encounter are largely transient. Note that if your architecture is self-hosted or on dedicated capacity the tail may behave differently, and will warrant another approach. Secondly, running several providers is what makes routing a hedge to a separate budget practical. With a single provider, fewer of these moves are available.

The claim, and the receipts

So here’s the move that sounds backwards: we cut a step off at 20-30 seconds even when we know it might have answered perfectly a little later — and that makes the system more reliable, not less.

That isn’t a hunch. It’s true on paper — the math of heavy-tailed retries is unambiguous — and it’s true in the data: a scan of well over a million recent production LLM calls across our enterprise workloads — real customer traffic. The first thing that traffic tells you is how strange a single call’s timing really is. A typical longer-output call comes back in about a dozen seconds. But one in a hundred takes thirty seconds, sometimes a full minute or more — for no reason connected to how much work it was doing.

Answer-time distribution for longer calls (output ≥ 600 tokens), one curve per model · serving path. Typical times sit in a tight band; the tails do not — FIGURE 3 – Real production data (1M+ calls, top-100 enterprise workloads, anonymized); 1s bins, capped at 90s. Model names are withheld on purpose. **This is not a leaderboard, and not a fair head-to-head:** different models run different workloads in our system, so the calls behind each curve aren’t the same task — the chart says nothing about which model is “faster.” What it *does* show: every model has a meaningful tail (note Model C — the quickest typical time, yet a long tail), and the *serving path* matters as much as the model — Model F via a managed API vs. direct is one model with two different tails. Model A shows free-form answer calls only; a separate, tightly-bounded structured-prefill workload on that same model is held out (see the data note) so it doesn’t split the curve into two artificial peaks.

That gap between the typical call and the slow one underlies much of this article. The rest of the article reviews what to do about it.

Why the clock is unforgiving

A workflow isn’t judged on its average. It’s judged against a deadline. On average our flows finish comfortably; however outlier runs in long tails don’t. Those tail runs aren’t broken. They’d return a perfect answer a bit later, and on an internal run they would count as successes. On the customer’s side, every one of them is a failure. The entire tail of your latency distribution, however correct, becomes an addition to your failure rate.

That’s why the number that matters here isn’t average latency, it’s variance. A fast median buys you nothing if your tail is long.

The second squeeze is sunk cost. The deeper you are into a workflow, the more you’ve already spent: time, dollars, and your TPM quota. A failure on step nine is far more expensive than the same failure on step two. You throw away everything the workflow built and you have less of the clock left to shift gears. We never restart the whole workflow ourselves, but the customer will. If we fail, they will almost certainly retry, starting the full flow again from the beginning. That compounds the problem on our side. It burns more cost, more token budget, and the error budget on the SLA. And because the conditions that made the run fail usually haven’t changed, the retry has a similar chance of failing. Worse, it tends to happen during a high-TPM window. The worst possible time to pile extra load onto an already-strained system, and exactly when the odds of failing again are highest.

There’s a second multiplier, and it’s easy to miss. The first is the one from the opening: reliability compounds, so a chain of individually excellent steps can still come out a coin flip¹. But that failure is always told as a story about correctness: getting a wrong answer.

Here’s what you almost never hear about: the exact same compounding happens on the clock. Every step adds its own small chance of landing in the slow tail, and those chances stack. So the more steps you chain, the more likely it is that at least one of them blows the deadline, even when every step is individually fast. That’s the multiplier this article is about, and it’s the one the literature leaves out. So let’s look at the numbers.

What an LLM answer time actually looks like

The typical times in the chart above sit in a fairly tight band: every model finishes a typical call somewhere between eight and twenty seconds. The tails are not tight at all. One model’s 99th-percentile call comes in around 30 seconds, another’s past 80. Similar median, wildly different worst case. Promise a customer your median and you’re lying to the 1-in-20 and 1-in-100 calls in the tail, and a multi-step workflow hits those constantly. A fast typical time is not a predictable one.

The obvious objection is that the slow calls are just doing more work: bigger prompts, longer answers. They aren’t. Pin both the prompt size and the response length and the tail barely moves: within a single size bucket (work held fixed), p99 still runs two to seven times the median (Figure 4). The slowness isn’t about how much the call has to do — in our traffic it’s largely transient (queueing, scheduling, mid-stream contention, a provider hiccup), which is exactly what makes it worth interrupting.

"The tail isn't the workload." Each row fixes *both* prompt size and response size; the median climbs as the work grows, but inside every row the p50→p99 gap stays 3.8–6.7×. A dumbbell plot, deliberately not a distribution curve — same-size calls, wildly different finish times. — FIGURE 4 – “The tail isn’t the workload.” Each row fixes *both* prompt size and response size; the median climbs as the work grows, but inside every row the p50→p99 gap stays 3.8-6.7×. A dumbbell plot, deliberately not a distribution curve — same-size calls, wildly different finish times.

One slow step sinks the whole run

You’d think a workflow misses its deadline because many steps were each a little slow. It almost never happens that way. When a chain blows its budget, it’s usually one step that wandered into its tail while everything else behaved fine. Mathematically, a chain’s overrun is dominated by its single worst step, not by the accumulation of mildly slow ones. The total behaves like its maximum, not its sum.²

That’s good news. You don’t need every step fast. You need to stop any single step from running away. Which is the cutoff.

Sidebar — The math, briefly (skip unless you like math)

Three results sit underneath the argument:

Compounding. Just the arithmetic of independent steps: n steps each succeeding with probability p gives pⁿ end-to-end. At p = 0.95, ten steps ≈ 60% and twenty ≈ 36% — multiplication, no modeling. The same compounding hits the clock: each added step is another independent draw against the latency tail (the 2-7× p99/p50 we measure per call), so the odds that at least one step blows its budget only rise with length. Independence is the simplification — shared capacity correlates real steps — but it’s the conservative, illustrative case.

The single big jump. LLM latency is heavy-tailed (lognormal-ish), and the lognormal is subexponential. For independent subexponential steps the tail of the sum is just the sum of the tails — `P(ΣX_i > t) ≈ Σ P(X_i > t) ≈ P(maxᵢ X_i > t)` as t grows. In words: a chain overruns because one step hit its tail, not because many were mildly slow.²
Hedging, and why it works for any failure. Fire n independent attempts and take the first good one: if a single attempt fails with probability q, all n fail with probability qⁿ. That arithmetic doesn’t care what “fail” means — a blown deadline, a hard error, or a wrong answer all buy down the same way, which is why the same retry/race/fallback move serves every flavor. For the timing flavor specifically it also shrinks spread: since the variances of independent steps add, `Var(ΣX_i) = Σ Var(X_i)`, capping each step’s tail shrinks the whole chain’s. All of it rests on the attempts being independent (fresh draws, fresh queue) — which is exactly why a parallel re-draw collapses a transient tail (or an unlucky bad answer) and does nothing for a deterministic one.³

The move: cut early, then race

If a step has wandered into its tail, waiting is the worst thing you can do — you’re spending your scarcest resource on your least likely payoff. So you give up early and try again in parallel: fire a fresh attempt and take whichever returns first. A fresh attempt rarely lands in the same pothole, so two of them fit inside the time one stuck call would have eaten — and the odds of both being slow are tiny (if one is slow with probability q, two are both slow with probability q²).³

FIGURE 5 – The same longer step, waited out versus raced. Each dot is one production run of that step (top-100 enterprise traffic, anonymized); red marks the slow tail. Racing a second attempt and taking the first to return collapses the spread (std 6s → 3s, p99 roughly halved) for the price of extra tokens — the body barely moves, so you get the same typical speed with far less variance. A sequential re-draw on total time wouldn’t help here: you’d pay the generation floor twice.

The median barely moves: about 10 seconds instead of 12. The tail does the opposite: the 99th percentile drops from roughly 60 seconds to 25, and the run-to-run spread is more than cut in half. You buy predictability for the price of some extra tokens.

That price is real, and it pushes back. Racing doubles the token bill on that step, and tokens are a shared, capped budget. So cost is a genuine downward force on how freely you retry and race. But run the arithmetic and it’s lopsided. Doubling one step costs you that step’s tokens, once. Blowing the deadline throws away everything you’ve already paid for, and the customer almost always retries, re-running all N steps of the workflow, at least once, sometimes more. The deeper into the flow you are, the more one-sided the trade: a redundant attempt on step nine is cheap next to discarding steps one through nine and watching them run again. So you hedge anyway. You just don’t hedge indiscriminately, because that shared token budget bites back hardest exactly when you most want to spend it (more on that pressure shortly).

One nuance that decides which fallback to reach for: the direction has to match why the step is failing.

Slow for transient reasons → re-draw, ideally in parallel. A fresh attempt escapes the stall. (A plain serial retry is weaker here on a longer step — you’d pay the long generation time twice.)
Slow because the work is genuinely big → don’t re-run the same call. Fall down to a faster model, or to an alternate path that reaches the same result more cheaply.
Wrong, not slow → fall up to a more capable model. Speed won’t fix a bad answer; capability might. (This is the quality floor from earlier, enforced at runtime.)

Cut on the right signal

An answer time is really two phases.⁴ The wait for the first token is mostly queueing and scheduling; the generation that follows, token by token, is the rest. Which phase carries the tail decides what you put the cutoff on. And that depends on how much the step writes.

For the longer steps this article is about (the ones that press against a deadline), the tail lives in generation, not the first-token wait. A slow queue is a small slice of a forty-second call; the spread that blows the budget is in the tokens. So cut these on total elapsed time, or on tokens emitted so far against the time you have left, not on time-to-first-token. (For short steps the balance flips: with little to generate, the first-token wait is most of the call, and time-to-first-token becomes the cleaner cut. Measure your own steps to see which side you’re on.)

Two signals are worth wiring in regardless:

No first token at all, past the cutoff? That’s stuck, not slow. Give up and hedge. A fresh parallel attempt gets newly scheduled and almost always wins.
Tokens flowing but it’ll blow the budget? Don’t re-run it. You’d just regenerate the same length at the same speed. Fall to a faster model.

And one failure no clock can catch: a step that returns on time but returns junk (e.g. it’s empty, truncated, or unparseable). A latency cutoff sails right past it; only a quality check downstream will. For any step that’s supposed to return a specific shape, the cheapest such check is a strict validation right after the call. Parse the result against the expected schema or object, and treat a validation failure exactly like any other: cut and fall back (re-draw, or fall up to a more capable model). It catches a meaningful slice of bad answers before they reach the next step. Cutting early buys you predictability, not correctness. Keep those two jobs separate.

The catch: hedging spends the budget you’re shortest on

Racing has an awkward property. The tail is worst when the system is busy. And “busy” is exactly when your tokens-per-minute budget has the least room left. So the one move that fixes the tail wants to spend tokens at the precise moment they’re hardest to come by. Do it blindly and you get a pile-on: slow calls trigger hedges, hedges add load, load makes everything slower, more calls cross their cutoff. A latency problem becomes a rate-limit problem.

Two facts make this less forgiving than it first looks. The cost is committed the instant you fire the second call. Cancelling the loser frees your connection, but the provider keeps generating, and billing, the abandoned attempt. There’s no clawback, so all the control has to live at the decision to hedge, not after. And you usually can’t see how much budget is left. Estimating it is possible but involved, so any scheme that “eases off as the quota fills” is hard to run in practice.

What works in practice is cruder and more structural:

Send the hedge somewhere with its own budget. Token limits are per-model and per-provider, and most of us run more than one (as noted in How our system is built). Routing the retry to a different model or provider gets a separate quota and an independent draw. The same move that escapes the stall also avoids spending the scarce budget twice.
Keep hedges rare by construction. This is what the precomputed cutoffs already buy you: with the threshold set at each step’s measured p95, a hedge fires only on the slow minority, so the extra spend stays small with no runtime accounting at all. (Same cutoffs as the next section, no new machinery.)
React to the signals you actually get. You probably can’t read headroom, but you can read 429s and climbing latency. Treat those as the cue to hedge less and cut later, not more.
At real saturation, stop hedging. Once the provider is already returning rate-limit errors, more attempts only deepen the hole. Downshift to a smaller, cheaper model or shed the work instead.

One lever we haven’t built, and offer only as a direction: an explicit global cap that holds hedged calls to a small fraction of total traffic, independent of the per-step decisions. It’s the principled backstop the tail-at-scale work points to;³ we set conservative cutoffs instead and haven’t needed it, but at higher hedge rates that’s where we’d go next.

Sidebar — The cheap moves you make first

Cutoffs and hedging are insurance. You buy less of it if the workflow is built well to begin with. The defaults that fire on every request, before any reactive trick:

Parallelism by design. Lay the flow out as a dependency graph and run every step the moment its inputs exist. Then go further — design the dependencies out. Fewer dependencies means more steps are leaves, and a leaf can fail cheaply without taking the rest of the graph down.
Don’t call the model at all when you don’t have to. The most reliable call is the one you don’t make — use code, lookups, and validators wherever the work doesn’t actually need a model.
Mix models per step, not per workflow. Fast and cheap where it’s enough; capable where it isn’t.
Cache the deterministic parts. Don’t pay an LLM twice for an answer that can’t change.

The point here: spend your reliability budget on structure first, so the clock work has less to fix.

When do you actually pull the trigger?

The cutoff is a knob, not a constant. How hard you turn it comes down to three plain questions about each step:

How much does the answer need this step? Nice-to-have: let it go. Must-have: protect it.
How much is waiting on it? If nothing depends on it, let it run to the deadline. If half the workflow is queued behind it, finish it sooner, and make sure it’s right, because a wrong answer here poisons everything downstream.
How much time is left? Plenty: retry calmly. Almost out: cut fast and fall back.

The more a step is must-have, load-bearing, and short on time, the earlier you fire the backup and the more you’ll spend to hedge it. An optional, terminal, early step gets none of that. (“Early or late in the flow” was never the real axis. It was a proxy for how much still depends on this step.)

And you don’t guess the number. You run the workflow many times, measure each step’s latency curve (P95), and set the cutoff from that curve. Below the step’s worst case, weighted by the three questions. A step that usually answers in 20 seconds gets cut at 30, even though it might have succeeded at 60.

Why almost nobody does this

This isn’t hard. It’s nuanced, and most teams don’t have the engine for it.

The popular workflow tools, the Airflows and Temporals, were built to make pipelines durable: retry, resume, don’t lose state, and they’re very good at it. Their timeout advice follows from that goal: set a per-step timeout longer than the slowest run and retry until it succeeds.⁵ That’s the right instinct when the job is to durable completion, and it’s exactly the wrong advice when the job is to finish in time. Your workflow engine will happily retry a step many times; it has no notion of a step’s measured typical time and downstream implications, so it can’t cut early and switch models. That isn’t a flaw. It’s by design.

The distributed-systems fundamentals are already on our side: work from a deadline budget, match each timeout to measured latency.⁶ We’re not contradicting that. We’re applying it to a case those tools don’t assume: a short, non-resumable budget where the right move at the cutoff is a faster alternative, not the same call again. Same principle, inverted direction.

Takeaway

One thing, if you keep nothing else: a predictable completion time beats a fast one with a long tail. Low variance beats low latency. You can’t promise a customer a median, only a bound. Everything here serves that bound. Cutting early, hedging, racing, designing out dependencies: each trades a little average speed for a lot less variance. You give up the right tail to buy the left.

In a customer-facing agentic workflow, reliability is the product. The craft isn’t owning a bag of retries and fallbacks, those are table stakes. It’s deciding, per step, whether to hedge and when to give up, from the constraints and the measured behavior of your own system.

APPENDIX

About the author

Frank Wittkampf is Head of Applied AI Engineering at Databook. His team architects, builds, and operates a fully custom AI stack including deep reasoning, an agentic workflow engine, AI asset generation, agentic harnesses, knowledge base & context graph, AI pre-processing, multi-tenant AI configuration management, etc. This AI infrastructure powers the GTM teams of top Enterprise companies like Microsoft, Salesforce, Amazon, Databricks, and many others.

A note on the data

The latency figures here come from recent (June 2026), anonymized production traffic across enterprise customer workloads — roughly 1.2 million LLM calls over a 30-day window, not synthetic benchmarks or a public trace. As described in How our system is built, these are direct calls to managed third-party APIs, which is part of why the slow tail is largely transient. The numbers in the text describe the longer calls (output ≥ 600 tokens), since those are the ones that actually press against a deadline; shorter calls are faster and less variable. Throughout, a “tail ratio” (p99/p50) holds call size fixed within a bucket unless stated otherwise. Models are labeled by family and serving path only; predictability depends on the serving path (e.g. a managed API vs. a direct one), not just the model, so these are deliberately not a model ranking. Durations were bucketed in one-second bins; a hard 90-second ceiling truncates only the last ~0.2% of longer calls, so the tail you see is real, not an artifact of the cap.

Isn’t the tail just the bigger calls?

The fair objection to Figure 4: each row is a token bucket, not a fixed token count, so maybe the slow calls inside a cell are simply the larger ones — more to prefill, more to generate — and the tail is just size, not anything transient.

It isn’t, and the data’s own shape shows why. If size drove the within-cell tail, two things would follow: the tail ratio would grow with the amount of work, and the most tightly bounded cells would have almost no tail. Neither holds.

FIGURE A1 — Within-cell p99/p50 tail ratio by output-size bucket. Each dot is one model × cell with both token counts held to a bucket; color = input size, dot area ∝ call volume; red bar = volume-weighted mean per column.
Two things to read off it. First, the tail ratio is flat at roughly 2–4× across every output-size column — it doesn’t climb as the work grows, so the tail doesn’t scale with the work. Second, and decisively, look at the leftmost column: those calls emit at most 50 output tokens, so generation time physically can’t vary by more than about a second — yet the tail there is still ~3.5×. There is no size variable large enough to produce that. The residual spread is transient (queueing, scheduling, a momentary provider hiccup), which is exactly what a fresh attempt escapes.

Why these numbers look smaller than the 2–7× quoted earlier: the column figures here are volume-weighted averages across many cells, which smooth out the spread, whereas the 2–7× in the body is the per-call envelope — the range individual cells actually span. Same data, two different cuts: the averages show the tail doesn’t scale with work; the envelope shows how wide it gets on any given call.

Notes & Footnotes

Note: All images created by the author.

¹: Ten steps at 95% each ≈ 60% end-to-end; twenty ≈ 36% (assuming independence).

²: The lognormal lies in the subexponential class, where the tail of a sum of independent terms is asymptotically the sum of the individual tails: `P(S_n > t) ∼ Σ_i P(X_i > t) ∼ P(max_i X_i > t)` as t → ∞ — the “single big jump” principle (Foss, Korshunov & Zachary, An Introduction to Heavy-Tailed and Subexponential Distributions, Springer, 2nd ed. 2013, eqs. 1.3 & 1.6). It’s an asymptotic statement and assumes independence, so treat it as the intuition for why one slow step dominates, not a plug-in formula.

³: If each independent attempt is slow with probability q, two parallel attempts are both slow with probability q²; n attempts, qⁿ. The classic hedged-request result (Dean & Barroso, “The Tail at Scale,” CACM 2013); in an agent setting, Winston et al. (arXiv:2605.21470, ICML 2026) choose between serial, parallel, and hedged execution from measured latency curves. On our production data, racing two attempts cut p99 on longer steps by more than half (≈60s→25s) while sequential re-draw on total time did not.

⁴: The split is standard in inference work: “time to first token” (queue + prefill) versus per-token generation. See e.g. Agrawal et al., Taming the Throughput-Latency Tradeoff in LLM Inference with Sarathi-Serve (arXiv:2403.02310, 2024). In our production traffic the tail for longer calls sits in the generation phase, not the first-token wait — which is why we cut long steps on total elapsed time rather than time-to-first-token.

⁵: Temporal’s activity timeouts are designed to finish eventually, including retries — hence Start-To-Close set above the slow tail.

⁶: Google SRE, gRPC deadlines, and Spanner all advise propagating a total budget and dropping work that can no longer help the caller. We extend the same principle to a sync, non-resumable customer budget.

ニュース24 (Nyūsu 24)