Back to notes
POSOffline FirstIdempotencyQueueingNetwork PartitionVenue Operations

Offline-First Transaction Queuing in Venue POS Systems: Idempotency Under Network Partitions

Offline-First Transaction Queuing in Venue POS Systems: Idempotency Under Network Partitions

I still remember the text message from the on-site lead during a college football opener: “All 14 portable stands are offline. Fans are still buying. What now?” That was minute 18 of a 45-minute LTE outage caused by a carrier tower handover failure nobody had anticipated. The tills kept ringing because we had already shipped an offline queue six months earlier—but the real test was whether those queued sales would land cleanly once the network came back.

The Partition Reality in Venues

Venues are partition factories. You get:

  • Concrete + steel killing 5 GHz Wi-Fi dead
  • Thousands of phones creating massive channel contention on 2.4 GHz
  • Handhelds roaming between APs with 3–8 second blackouts
  • Carrier-grade NAT timeouts during sudden traffic spikes
  • Full LTE/5G blackouts when the stadium DAS gets overloaded

In practice this means a single lane can be offline for 30 seconds to 12+ minutes multiple times per event, and different lanes see different partitions. You cannot count on “the network is back” meaning every device sees the server at the same moment.

Local Queue + Exponential Backoff

The pattern that survived the hardest incidents is simple but strict:

  1. Every transaction is written first to a local SQLite table (status = pending)
  2. Client generates a v7 UUID as the idempotency key immediately
  3. Transaction is signed with device-specific key pair (prevents forgery after compromise)
  4. A background worker attempts delivery with exponential backoff + full jitter
  5. On 409 Conflict or 200 OK from server → mark as confirmed
  6. On terminal failure modes (410 Gone, auth errors) → move to dead-letter table for operator review

The backoff schedule we converged on after too many thundering-herd incidents:

const BACKOFF_SECONDS = [0, 1, 3, 8, 15, 30, 60, 120, 300, 600];
function nextAttemptDelay(attempt: number, max = 600): number {
  if (attempt >= BACKOFF_SECONDS.length) return max * 1000;
  // full jitter: random in [0, base]
  return Math.floor(Math.random() * BACKOFF_SECONDS[attempt] * 1000);
}

This spread retry traffic enough that a mass reconnection after a 7-minute outage didn’t instantly kill our API gateway.

Idempotency Key Design

We store the key + SHA-256 of the payload body. This allows safe retries even if the client re-signs or slightly mutates non-semantic fields (timestamps, etc.).

Server table (simplified):

CREATE TABLE idempotency (
  key            TEXT PRIMARY KEY,
  payload_hash   TEXT NOT NULL,
  lane_id        TEXT NOT NULL,
  accepted_at    TIMESTAMPTZ NOT NULL DEFAULT NOW(),
  status         TEXT NOT NULL,          -- accepted | duplicate | conflict
  response_body  JSONB,
  expires_at     TIMESTAMPTZ NOT NULL   -- 72 hours
);

On incoming request:

  • If key exists and hash matches → return cached 200 + response_body
  • If key exists but hash differs → 409 Conflict + alert (almost always a client bug)
  • Else → process → store on success → return 200

We keep the window at 72 hours because some venues run multi-day events and staff may void/retry old tickets hours later.

Real War Story: The Partial Partition

During a baseball double-header, half the field-level stands lost connectivity for ~22 minutes while upper-deck stayed up. We saw:

  • 38 lanes queue 2,100+ tx locally
  • 14 lanes stay online the whole time
  • When the field-level network returned, 2 lanes had stale price data (happy-hour promo ended)

Because idempotency was strict and inventory was soft-reserved locally, we:

  1. Accepted every queued tx as-is (stale price honored—revenue > perfection)
  2. Applied compensating inventory adjustment for the promo items later
  3. Operator got a single Slack thread with the 38 affected lanes and delta report

No double-sales, no lost revenue, one very confused GM who expected the system to freeze.

What Breaks If You Skip Strict Idempotency

We tried a weaker version early on (just UUID without payload hash). During one partition recovery:

  • Client A sent tx-123, got timeout
  • Client A retried → succeeded
  • Client B (same lane, same tx) retried an older payload → accepted as new tx-124

Result: one beer sold twice, charged twice, angry customer. Never again.

Closing Lessons

  • Idempotency is not a nice-to-have; it’s the only thing that keeps your financials sane when the network is lying to you.
  • Payload hashing catches client bugs early—don’t skip it.
  • Jittered exponential backoff is boring but prevents retry storms that kill everything.
  • Accept that stale data during outages is usually the lesser evil compared to downtime.

Next time someone asks “can’t we just use the cloud POS vendor?”, I show them the graph of 2,100 queued transactions that landed cleanly during a 22-minute field-level blackout. That usually ends the conversation.