Implementing PaxosLease in Python with HTTP Flask

Marton Trencseni - Sat 13 December 2025 - Distributed

Introduction

In earlier posts I walked through plain Paxos and then Multi-Paxos, writing a toy replicated log implementation in Python. The goal of Paxos is agreeing on values which then become log entries — these can practically be thought of as database commands, which are then executed against local database replicas. In Paxos, once a value is chosen, it's chosen forever, never forgotten — in this sense, each entry of the replicated log is write-once, and all nodes (replicas) see the same sequence of values (commands), which ensures that their local databases executes the same sequence of commands. This is exactly what we want for reliable database replication, but it's overkill for some other common coordination problems in a distributed system, such as master election or lock ownership.

Temporal leases are the underlying primitive for master election and lock ownership. In practice you rarely want a lock that lasts forever; you want a lease: a lock that expires, so if the owner goes down, somebody else can take over. Master election is just a type of lock, with the same desired property: "this node is master for at most T seconds, then we'll see." If that node dies, you don't want the cluster to sit there blocked forever waiting for it to come back and release the lock. You want the lease to just expire and somebody else to take over. PaxosLease is a specialization of Paxos that does exactly that: it negotiates leases using a Paxos-style protocol, but because leases expire, the algorithm can get away with no disk writes and no synchronized clocks.

The code for this article is on Github.

Links:

Scalien

I originally came up with PaxosLease while working on my startup, Scalien. Scalien built two Paxos-based distributed databases, Keyspace and ScalienDB, roughly between 2008 and 2012. Keyspace was a consistently replicated key-value store that used Paxos to replicate write commands and guarantee strong consistency. ScalienDB was its successor, a bigger C++ system that again used Paxos under the hood, and we put it into production at a few paying customers before we ran out of funding.

In both Keyspace and ScalienDB we needed a way to elect a master inside a small controller quorum. You can absolutely do this with regular Paxos (or even full Multi-Paxos), but it's clunky: you have to write acceptor state to disk, worry about recovery, and you're still just trying to answer a yes/no question: "who's the leader right now?" So I ended up designing PaxosLease as a leaner variant: same majority/quorum logic as Paxos, but optimized for the case where the "value" is a lease with a fixed maximum duration. PaxosLease ended up being used for master lease negotiation in both Keyspace and ScalienDB.

The core realization was that leader/master selection in these systems doesn't need the "once chosen, true forever" property that Paxos is built around. Masters come and go. What we really needed was: at any given moment, there is at most one master, and we're okay forgetting the past as long as we don't overlap leases. That difference in requirements is what let PaxosLease drop disk writes and still remain safe.

PaxosLease

PaxosLease follows classical Paxos: we have proposers and acceptors, and we're trying to reach agreement on the "state" of a tiny replicated state machine:

  • "no one has the lease"
  • "node 1 has the lease"
  • "node 2 has the lease"

The proposer runs a two-phase protocol very similar to Paxos. In the prepare phase it picks a fresh ballot number and asks a majority of acceptors whether they're willing to promise not to accept lower-numbered proposals. In the propose phase it actually proposes "node i has the lease for T seconds." If a majority of acceptors accept that proposal, the proposer believes it now holds the lease for T seconds.

The crucial difference from regular Paxos is that the lease expires. Each acceptor starts a local timer for T seconds when it accepts a lease, and when that timer fires it just forgets the proposal. There's also a globally known maximum lease time M, and proposers are required to choose T < M. Because leases are bounded in time, acceptors don't have to persist anything to disk: if they crash and lose state, they simply wait M seconds before rejoining. By the time they start answering messages again, any lease they might have been part of before the crash must have already expired. That's the trick that gives PaxosLease its "diskless" property.

Importantly, PaxosLease assumes nothing about nodes agreeing on absolute time—acceptors and proposers never compare timestamps or rely on synchronized clocks. They only use their own local notion of elapsed time, which is safe as long as their clock drift (the rate error) is reasonably small. Even if one machine's clock runs slightly faster or slower, the algorithm practically still guarantees correctness because lease ownership is determined purely by majority agreement and local expiry. In production systems this can be implemented by having the proposer treat its lease as ending slightly before the nominal duration (subtract a small amount, e.g. 100–1000 ms), providing extra buffer against drift; this toy implementation doesn't do that, but it could be added trivially.

Code

The little Python program above is a minimal PaxosLease implementation using Flask for RPC. Each node runs both a PaxosLeaseAcceptor and a PaxosLeaseProposer. The acceptor keeps the usual Paxos metadata: the highest ballot it has promised, and the last lease it accepted. In this toy version, the lease is a small dict with the owner, the duration and an absolute expiry time:

class PaxosLeaseAcceptor:
    def __init__(self):
        self._lock = threading.Lock()
        self.state = SimpleNamespace()
        self.state.promised_n = None
        self.state.accepted_n = None
        self.state.accepted_value = None
        self._lease_timer = None

On a prepare, the acceptor does the usual Paxos check: if the proposal id is greater than anything it has promised so far, it updates promised_n and returns success along with its current accepted value (which might be None). On a propose, if the proposal id is at least the promised id, it accepts the lease, stores it in accepted_value and starts a local timer:

    def on_propose(self, proposal_id, lease_owner, lease_seconds):
        with self._lock:
            if self.state.promised_n is None or proposal_id >= self.state.promised_n:
                self.state.promised_n = proposal_id
                self.state.accepted_n = proposal_id
                self.state.accepted_value = {
                    "owner": lease_owner,
                    "lease_seconds": lease_seconds,
                    "lease_expires_at": time.time() + lease_seconds,
                }
                self._restart_timer(lease_seconds)
                success = True
            else:
                success = False
            return success, self.state

The PaxosLeaseProposer is responsible for actually trying to grab the lease. It maintains a proposal_id of the form node_id + k*256 so each node owns its own "lane" of ballot numbers. On each attempt it bumps the proposal id, runs the prepare phase, and if it doesn't get enough promises it looks at the promised_n values the acceptors returned and jumps ahead to the next proposal id that's guaranteed to be larger than all of them:

    def _next_proposal_id_after(self, max_seen):
        if max_seen is None:
            return self.node_id
        k = (max_seen - self.node_id) // 256 + 1
        new_id = k * 256 + self.node_id
        if new_id <= max_seen:
            new_id += 256
        return new_id

If prepare fails, the proposer updates its proposal_id with _next_proposal_id_after(max_seen) and retries once. If it still can't get a majority, it gives up and returns a failed_prepare status. If prepare succeeds and a majority of acceptors report that they're not currently holding any lease, the proposer starts its own local timer and sends propose requests. If a majority accept, this node now believes it owns the lease until its timer fires; at that point it flips lease_owner back to False and forgets about it.

Trace

Let's walk through a simple trace with three nodes: 0, 1 and 2, so the majority is 2. Assume each HTTP call takes about 500 ms round-trip, and the lease length is 5 seconds.

At time t = 0, node 0's client calls /start. The proposer on node 0 bumps its proposal_id and sends prepare to all three acceptors. Those messages arrive around t = 0.5 s. Each acceptor looks at its promised_n and, assuming this is the first round, they're all empty, so they all promise.

Their responses arrive back at node 0 at around t = 1.0 s. Node 0 now has three promises; it only needs a majority, so it's ready to move on. Node 0 checks that a majority of the acceptors have accepted_value = None (no current lease). That's true, so it starts its local lease timer for 5 seconds (still at t = 1.0s, expiring at t = 6.0s) and sends propose messages to all three acceptors. Those arrive around t = 1.5 s. Each acceptor accepts the proposal, stores a lease with owner = 0, lease_seconds = 5 and a local lease_expires_at ≈ t + 5, and starts its own expiry timer, which will expire at t = 6.5s. Their responses arrive back at node 0 around t = 2.0 s. At that moment node 0 flips lease_owner = True and logs "I am the lease owner". From then until t = 6.0s node 0 behaves as the master. When its local timer fires, it relinquishes the lease. Separately, each acceptor's timer will clear its accepted_value after its own 5-second window at t = 6.5s, so the cluster's internal state also returns back to "no one has the lease."

The interesting part is what doesn't happen. Suppose node 1 tries to acquire the lease while node 0 is already the lease owner. In the prepare phase it's likely to see that at least a majority of acceptors have accepted_value set (they currently hold a lease), so even if it can collect promises, it will discover that the lease is busy and return lease_busy without ever getting to propose a conflicting lease. If you then kill and restart one of the acceptors, the code sleeps for LEASE_SECONDS on startup before it starts answering requests, which mimics the "wait M seconds after restart" rule from the paper and ensures that node can't accidentally resurrect an expired lease.

Note that at t = 6.0s the lease owner’s local timer expires, so it no longer believes it holds the lease, but the acceptors don’t clear their accepted state until their own timers fire around t = 6.5s. In this simple implementation—with deliberately exaggerated 500 ms message round-trips—that creates a brief window in which no new lease can be elected. This is expected: neither proposers nor acceptors can assume anything about one another’s absolute clocks or about how long messages took in transit, and the algorithm is deliberately designed to avoid relying on global time synchrony. A production implementation would typically add an extend_lease() mechanism that allows the current lease holder to refresh its lease before expiration, preventing unnecessary gaps. If the lease owner actually goes down, a short safety window—on the order of a few milliseconds in real systems—before another node can acquire the lease is a reasonable trade-off for maintaining correctness without requiring any disk persistence.

Conclusion

PaxosLease is a nice example of how you can take the heavy machinery of Paxos and make it much simpler if you're willing to narrow the problem. By noticing that leadership and lock ownership are temporary properties, the algorithm gets to throw out one of the most painful parts of Paxos implementations: disk persistence in the acceptor. Everything becomes in-memory with timers, plus a simple "wait M seconds after restart" rule to keep things safe.

In real systems like Keyspace and ScalienDB, PaxosLease sat next to a more traditional Paxos/Multi-Paxos pipeline: one part of the system agreed on a replicated log of writes, another part agreed on who the current master was. In the next post I'll zoom back out to that broader picture: leader-optimized Paxos, skipping the prepare phase, and how these pieces fit together closer to a production-ready distributed database.