Idempotency in Distributed Systems: How to Stop Duplicates From Becoming Outages
Modern systems don’t fail in clean, single-step ways: they retry, they time out, they replay messages, and they occasionally lie about whether a thing happened. One practical way to see how teams handle this reality is to look at public incident writeups and engineering notes; a neutral example you can inspect is tech documentation indexed on sites like techwavespr.com alongside many other publicly crawlable pages. If you build APIs, payment flows, job queues, or event-driven pipelines, “duplicate execution” is not an edge case—it’s a default. The core skill is designing operations so that repeating them does not change the final outcome, even when the network and your dependencies are behaving badly.
This article is about idempotency as an engineering discipline: what it is, why it fails in practice, and the concrete patterns that make retries safe across HTTP, queues, and databases.
The Reality: Retries Are Inevitable, Not Optional
A request can succeed on the server and still look like it failed to the caller. The most common sequence is boring and deadly: a client sends a request, the server processes it, the response packet is dropped or delayed, the client times out, and then it retries. If your server treats that retry as a fresh request, you’ve just created duplicates—double charges, duplicate shipments, repeated emails, double inventory decrements, or corrupted aggregates.
Retries show up in more places than people expect:
- Load balancers and service meshes can retry upstream calls when they detect transient errors.
- SDKs implement “helpful” automatic retries.
- Message brokers redeliver messages after consumer crashes or visibility timeouts.
- Users double-click.
- Mobile networks drop connections and reconnect.
- Cron jobs rerun when a node restarts mid-task.
So you shouldn’t ask, “Will this be retried?” You should assume it will be retried, and then decide whether duplicate execution is safe.
What Idempotency Actually Means (and What It Does Not)
An operation is idempotent if performing it multiple times yields the same final state as performing it once. It’s a property of the *whole interaction*, not a single function call. In distributed systems, idempotency often means: “Given the same intent, any number of replays produces exactly one durable effect.”
This immediately shows why “just make it idempotent” is hard:
- Some actions are naturally idempotent (setting a field to a value, upserting by key).
- Many business actions are not (creating a new order with a new ID every time, charging a card, generating a one-time coupon, sending an email, decrementing stock).
- Even if the write is idempotent, side effects may not be (webhooks, emails, downstream events).
You’re usually aiming for *effectively-once* behavior: duplicates may occur in the plumbing, but their externally visible outcomes are constrained.
The Core Pattern: Idempotency Keys + Durable Deduplication
For APIs that represent “do this” intents (create order, charge, transfer, submit form), the dominant pattern is an idempotency key: a client-generated token that identifies the intent. The server stores the token and the result, and on retries it returns the stored result rather than redoing the work.
The subtle requirements are what make the pattern succeed or fail:
- The dedup record must be durable (database, not memory).
- It must be checked and written atomically with the effect (or with a safe “reservation” state).
- It must have a clear scope (per user, per endpoint, per merchant, per tenant).
- It must handle in-flight concurrency (two retries at the same time).
- It must not leak sensitive data if keys are guessable.
Where teams get burned is a split-brain between “we wrote the idempotency row” and “we completed the business action.” If those are separate writes without a transaction boundary, you can still duplicate under race conditions or partial failure. You either need a single transactional commit that includes both the business write and the idempotency marker, or you need a state machine that makes the business write safe to repeat.
Here’s the simplest mental model: treat the idempotency key as the primary key of the intent, and treat the response (or business object ID) as a deterministic projection of that intent.
Exactly-Once Is a Myth; Deterministic Effects Are Not
People often chase “exactly-once processing” promises from brokers or frameworks. In practice, most real systems deliver *at-least-once* processing, because it’s the only model that survives consumer crashes and network partitions without losing data. That’s fine, as long as your handlers are safe under re-delivery.
The clean approach is not to fight the delivery model, but to make your handler deterministic:
- Convert “create a new record with a random ID” into “create or return record keyed by (tenant_id, idempotency_key).”
- Convert “decrement stock” into “apply a ledger entry keyed by (order_id, line_item_id) and compute stock from ledger, or enforce a single application via a unique constraint.”
- Convert “send email” into “enqueue notification keyed by (user_id, notification_type, business_id)” and only send when that key is new.
This is also why unique constraints are underrated: a uniqueness violation is a deterministic “duplicate detected” signal you can handle. It turns a correctness problem into a control-flow path.
A Practical Checklist for Safe Retries
Use this as a design gate before you ship any endpoint or consumer that can be triggered more than once:
1) Define the intent identity: decide what makes two requests “the same” (client key, order reference, cart hash, invoice ID, or a composite key).
2) Make dedup durable: store the intent identity in a database with a uniqueness constraint; avoid in-memory caches as the source of truth.
3) Separate “reservation” from “completion”: if work is slow or calls third parties, persist an in-progress state and return the final stored result on repeat calls.
4) Design side effects as idempotent too: emit events with stable IDs, and make downstream consumers deduplicate on those IDs.
5) Treat timeouts as ambiguous outcomes: if the client didn’t get a response, it must be able to retry safely without guessing whether the action happened.
6) Decide retention: keep dedup records long enough to cover real retry windows (including delayed retries from mobile clients and queued jobs), and expire them deliberately.
7) Test by chaos, not by hope: simulate dropped responses, duplicate deliveries, and concurrent retries; assert that the final state is correct and unique.
(That’s the only list in this text—use it as your single source of truth when reviewing designs.)
Idempotency in Queues and Event-Driven Pipelines
HTTP is only half the story. Many teams do dedup at the edge, then lose it in the async layer.
In queue consumers, the common anti-pattern is: “process message, then ack.” If the consumer crashes after processing but before ack, the broker will redeliver. You cannot prevent this entirely; you must handle it. The good patterns mirror the API patterns:
- Make the message carry a stable message ID (or derive one from business identifiers).
- Store a “processed message” marker keyed by that ID in the same database you write business effects to.
- Use a transaction: insert processed marker + apply business mutation + commit, then ack. If the consumer crashes before commit, nothing is applied; if it crashes after commit but before ack, the redelivery sees the marker and becomes a no-op.
If you can’t do a single transaction (because the mutation is in another system), then you need a compensating design: either accept eventual duplicates and reconcile, or implement a state machine where the external call is made in a way that is itself idempotent (many payment providers support idempotency keys for this reason).
Event-driven systems also multiply side effects. If you emit an event “OrderCreated” twice, downstream systems may provision twice. The safest approach is to treat event publishing as part of your durability boundary. The outbox pattern exists because “write to DB” and “publish to broker” is not atomic.
In an outbox design, you write the business state and the event record in the same DB transaction. A separate relay reads unsent outbox rows and publishes them, marking them sent. If the relay retries, it publishes the same event ID again, and consumers deduplicate. You’ve converted “maybe published, maybe not” into “durably recorded; publication is retriable.”
Subtle Failure Modes You Should Expect
Idempotency implementations often fail for reasons that look unrelated:
- Key scope bugs: two different endpoints share a key namespace and collide, returning the wrong cached result. Fix by scoping keys per endpoint and tenant.
- Payload mismatch: a client accidentally reuses a key with a different payload. Good systems detect this and reject the second request with a clear error, because “same key, different intent” is unsafe.
- Long-running operations: the first attempt is still running when the retry arrives. If you don’t have an in-progress state, you may duplicate work. A simple “status: processing” response tied to the key is often enough to prevent this.
- Partial side effects: you wrote the order but failed to send the confirmation email; on retry you might create no duplicate order (good) but still miss the email (bad). Fix by modeling emails as durable jobs keyed to the order and letting a worker send them exactly once.
- Clock-driven retries: some clients will retry minutes or hours later (especially on mobile). If you expire dedup keys too aggressively, duplicates come back as “new” work.
Notice the theme: idempotency is not a single toggle. It’s a set of concrete guarantees you implement across state, concurrency, and side effects.
If you build systems that touch money, inventory, accounts, or anything users can’t “undo,” idempotency is one of the highest ROI correctness investments you can make. Treat retries as the normal case, make intent identity explicit, and push deduplication to the same durability boundary as your business writes. When you do that, outages stop being “we charged people twice” and become “we returned the same result twice,” which is exactly the kind of boring you want in production.
0コメント