blog.suje.sh

Distributed Locks and Fencing Tokens — Handling Concurrency Safely in Microservices

Over the last few years, I worked mostly on building microservices using Spring Boot, but I never really had a chance to work directly with distributed locking systems like Redis, ZooKeeper, or etcd.

Still, I kept coming across these topics whenever I read about system design, schedulers, or how large systems avoid inconsistent updates. I always wanted to understand how distributed locks actually work and what could go wrong when multiple service instances try to modify the same shared resource.

Recently, I started spending some time reading different blogs, GitHub discussions, and documentation to understand the concepts better. This post is my attempt to write down what I learned so far, mostly for my own understanding, and hopefully useful for anyone else exploring this topic.


Why Do We Need Distributed Locks?

In a single application instance, handling concurrency is straightforward with synchronized blocks or Java locks.
But in microservice-based systems, we often run multiple instances across containers, nodes, or regions.

Common situations where more than one instance competes for the same resource:

  • Two scheduler instances try to pick the same job.
  • Two order processors update the same inventory.
  • Multiple consumers process the same message.
  • Several services attempt to refresh the same cache entry.

To avoid inconsistent state, the natural thought is: use a distributed lock.
Redis, ZooKeeper, or etcd are the common options.

Once I started reading more deeply, I realized a distributed lock alone cannot guarantee correctness, especially during failures.


The Problem: Locks Can Expire or Be Lost

Consider a simple scenario:

  1. Service A acquires a Redis lock (with a 50ms expiration).
  2. Service A suddenly pauses because of GC or CPU pressure.
  3. The lock expires before A completes the operation.
  4. Service B acquires the same lock.
  5. Both A and B believe they own the lock and update shared state.

This creates corrupted or inconsistent data.

Distributed locks are vulnerable because:

  • Nodes pause unexpectedly.
  • Network delays cause late heartbeats.
  • Lock expirations are time-based.
  • Processes may crash after performing half the work.
  • Network partitions create a split-brain situation.

These edge cases make naive locking unsafe.


Why Naive Locking Fails

Even if you use the “correct” algorithm (like Redis Redlock), a lock cannot protect against:

  • GC pauses
  • Network delays or packet drops
  • Clock drift
  • Lock expiration happening too early or too late
  • Multiple lock holders caused by failover or partition
  • A process waking up after losing the lock

The key insight is:


A service that once held a lock may not be the valid lock holder at the time it writes to shared state.

This is why systems need something stronger than just a lock.


Fencing Tokens — The Missing Piece

A fencing token is a monotonically increasing number issued every time a lock is granted.

Example:


Client A acquires lock → token 101
Client B acquires lock → token 102
Client C acquires lock → token 103

Each token is strictly larger than the previous one.

Downstream systems (database, queue, storage service) accept updates only if the incoming token is greater than the token from the last accepted operation.

This prevents stale lock holders from modifying the resource.

How It Helps

Even if Service A pauses after acquiring its token (101):

  • It wakes up later.
  • Tries to write to shared state.
  • Shared state sees: 101 < 102.
  • Update is rejected.

This ensures data correctness regardless of timing delays.


Real Example: Inventory Update

Imagine two services deducting inventory for the same product.

Without fencing tokens:

  • A gets lock but is slow.
  • Lock expires.
  • B gets lock and writes update.
  • A wakes up and writes old inventory.
  • Inventory becomes corrupted.

With fencing tokens:

  • A gets token 101.
  • B gets token 102.
  • Database accepts B’s update because 102 > 101.
  • When A tries to update with 101, database rejects it.

This prevents inconsistent inventory updates.


How to Implement Fencing Tokens

1. Redis (simple approach)

Use an atomic counter for fencing tokens:


INCR lock:token

The returned number is the fencing token.

Use SETNX or Redlock separately for lock ownership.

2. ZooKeeper or etcd

These systems support:

  • Monotonic sequence numbers
  • Ephemeral nodes
  • Session expiration

ZooKeeper’s sequential nodes are ideal for fencing tokens.

3. Database versioning (practical and safe)

Example using a table:


lock_name | last_token

Increment using:


UPDATE locks SET last_token = last_token + 1 WHERE lock_name = 'job1'
RETURNING last_token;

This gives a safe fencing token.


Why Fencing Tokens Matter More Than Locks

A distributed lock only tries to ensure “one at a time”.
It cannot protect against failures in timing or network conditions.

Fencing tokens enforce:


Only the newest valid lock holder is allowed to write.

This shifts correctness to the storage layer, which is much safer.


Where Fencing Tokens Are Used

  • Kafka producers use epochs to avoid stale writes.
  • ZooKeeper and etcd use sequence numbers and leases.
  • DynamoDB uses conditional writes (optimized versioning).
  • PostgreSQL advisory locks combined with version checks.

You start noticing this pattern everywhere once you understand it.


Common Questions

1. How do you ensure only one service instance processes a job?
Use a lock + fencing tokens.

2. What are the weaknesses of Redis-based locks?
Lock expiration can lead to multiple valid lock holders.

3. How do you avoid double-processing events?
Use versioning or fencing tokens.

4. Why is leader election tricky?
A leader might think it still holds leadership after a pause. Fencing tokens prevent stale leaders from updating state.


My Takeaway

Before learning about distributed locks, I assumed acquiring a lock was enough to guarantee safety.
Later I realized the real issue is not who holds the lock, but whether that lock is still valid when the service updates shared state.

Fencing tokens solve this by letting the storage layer enforce correctness instead of relying on timing or node behavior.

Now whenever I read or think about distributed coordination, I ask:


What if a lock holder pauses and resumes after losing the lock?

If the system would break in that scenario, fencing tokens are needed.


← All posts