Paxos: Lamport's Crown Jewel for Replicated State Machines

Marton Trencseni - Thu 04 December 2025 - Distributed

Introduction

In distributed systems, a few ideas have had an outsized influence on how we think about coordination, time, state, and agreement. Leslie Lamport contributed many of them: logical clocks, mutual exclusion, the Byzantine Generals problem, and eventually Paxos. Among these, Paxos stands out as — in my opinion — the most elegant algorithm for achieving consensus in a system that suffers from crashes, delays, and unreliable communication. It is the minimal, essential solution to a deceptively simple question: how can a group of machines agree on one value, even if several of them fail?

This post introduces Paxos, explains what problem it solves, how it fits into the history of distributed algorithms, how it compares to the Byzantine fault-tolerant protocols discussed previously on Bytepawn, and finally walks through a small toy implementation in Python using Flask. The goal is not to cover every aspect of Paxos, but rather to show why it matters and how the mechanics work in practice.

Previous articles on Bytepawn on Byzantine Fault Tolerance:

Before Paxos, Lamport explored the nature of agreement in a world where nodes might be deceptive or faulty. The Byzantine Generals problem formalized the idea of arbitrary (malicious) faults, where nodes can lie, collude, or send conflicting information. Algorithms in this model require heavy communication and at least $ 3f+1 $ replicas to tolerate $f$ faulty ones. Most real distributed systems, however, do not operate in an adversarial environment. Nodes crash. Packets are lost. Machines pause. Network links briefly fail. These failures are frustrating but not malicious. For this class of problems — called the crash-fault model — Lamport created Paxos, an algorithm that tolerates message loss, duplication, reordering, arbitrary delays, node crashes, and network partitions, while requiring only $ N = 2f+1 $ replicas. In other words, the algorithm works if a majority of nodes are able to communicate with one another. This shift from arbitrary-fault tolerance to crash-fault tolerance is a central turning point in the history of distributed coordination: it enables vastly simpler algorithms and clearer correctness proofs, yet still solves the form of consensus most systems actually need.

The code for this article is on Github.

Paxos Guarantees

Paxos solves what is usually called single-decree consensus: among $N$ nodes, agree on a single value, even if some subset of them crashes. The guarantee is first and foremost about safety: no two nodes ever decide on different values. Liveness follows from safety but is conditional — it requires that a majority of nodes be up and able to communicate with one another. Paxos willingly gives up progress during certain network failures in order to maintain safety. This leads to a subtle but important point: Paxos does not claim to make progress in all circumstances. If only a minority of nodes are reachable, Paxos simply halts. This is the correct behavior.

Much of Paxos's character can be seen in how it behaves during a network split. Consider a five-node system where a majority is three. If exactly three nodes can still reach each other, the algorithm continues normally. A proposal can gather three promises, then three accepts, and the value becomes chosen. If, however, the partition leaves only two nodes communicating, Paxos will not progress. It will remain safe but inactive. This outcome is not a limitation but a design choice rooted in correctness. A minority partition should not attempt to push the system forward. Progress must wait until a majority reconnects, at which point the protocol resumes and, crucially, cannot produce a value inconsistent with anything decided before the partition.

If a minority could choose a value, two different minorities might choose two different values. By requiring a strict majority for progress, Paxos ensures that any two majorities overlap, and therefore cannot contradict one another. In short, Paxos handles message loss, reordering, duplication, delays, and node crashes, and it behaves predictably under network partitions. It does not tolerate malicious behavior or nodes that arbitrarily violate the protocol; for that, you need Byzantine algorithms like PBFT.

The Paxos Algorithm

Paxos divides the work of consensus into three conceptual roles: proposers, acceptors, and learners. A single physical node may play all three roles, but separating them helps clarify the algorithm. The core idea is that proposers suggest values, acceptors provide the fault-tolerant “memory” of the system, and learners are informed once a value has been irrevocably chosen. Note that Paxos is not a leader-based algorithm, ie. there is no assumption that only a single proposer is present in the system; many can be present and “competing” to get their value chosen.

The protocol runs in three phases. In Phase 1, a proposer selects a proposal number—an integer that must be unique and monotonically increasing—and sends a prepare request to the acceptors. When an acceptor receives such a request, it promises not to accept any proposals numbered lower than the one it has just seen. It also returns information about any previously accepted proposal: specifically, the highest-numbered proposal it has accepted so far and the value that accompanied it. These two pieces of state, the highest promised proposal number and the last accepted proposal number and value, are the essence of the acceptor's memory.

If the proposer receives promises from a majority of acceptors, it moves to Phase 2. Before sending out the proposed value, the proposer examines the acceptors' replies. If any acceptor reported having already accepted a proposal in the past, the proposer is obligated to adopt the value from the highest-numbered such proposal. If no acceptor has accepted anything yet, the proposer proceeds with its own initially chosen value. This rule ensures that once any value has been "in play" at a majority, no competing value can later override it.

The proposer then sends an accept request to the acceptors, including the proposal number and the chosen value. When an acceptor receives this request, it checks whether it has already promised to ignore lower-numbered proposals. If the proposal number is at least as large as the highest promise it has made, it accepts the proposal, storing both the proposal number and the value. If the proposer manages to collect acceptances from a majority of acceptors, the value is considered chosen. At that point, proposers or acceptors can notify learners, which record the chosen value. From that moment on, any further participants in the system that missed earlier messages can recover by communicating with any node that knows the chosen value; safety ensures that there can be no conflicting choice elsewhere.

In Phase 3, the proposer informs the learners of the chosen value. The algorithm guarantees that all learners learn the same value.

The reason Paxos needs both a prepare phase and a propose phase is that proposers must coordinate not only with a current majority of acceptors, but also with the history of what any majority may have accepted earlier. A single-phase protocol cannot safely do this. Imagine a proposer simply sending “accept this value” messages: if an earlier proposal had partially succeeded — for example, it reached some acceptors but not a full majority — a later proposer might unknowingly overwrite that value with a new one. Although any two majorities intersect, a later proposer in a single-phase design has no way to learn whether it should preserve a previously accepted value from an earlier, incomplete round. The prepare phase solves this by collecting information from a majority about the highest-numbered proposal they have accepted (if any). This forces proposers to carry forward any potentially chosen value. Only after this reconciliation step can the proposer safely enter the propose phase and actually ask acceptors to record the new proposal. The two-phase structure ensures that later rounds cannot accidentally discard information from earlier rounds, and this is exactly what preserves Paxos's safety guarantees.

This structure—two phases, majority quorums, and the careful rules governing promises and acceptance—guarantees the fundamental Paxos property: even in the presence of message loss, reordering, duplicated messages, pauses, and node crashes, no two different values can ever be chosen. A value may be delayed, retried, or blocked by a network split, but it cannot be contradicted. That is the heart of Paxos.

Replicated Logs and State Machine Replication

A single execution of Paxos chooses one value — in this sense it can be thought of as a write-once logical register that is replicated between the learners. How can that be useful, if the value, once chosen, can never change? The trick is to use subsequent rounds of Paxos to choose values that are actually database commands, and use these to modify local copies of databases on the learners. Assuming the learners start with the same initial local database (state), Paxos guarantees that the same commands (state transitions) will be executed by all of them. The sequence of replicated database commands form a Replicated Log. This concept is also called State Machine Replication (SMR). Note that Paxos does not care about the nature of the database or the database commands: the database can be an RDBMS with SQL commands being replicated (CREATE TABLE.., UPDATE.., INSERT..), or it can be a key-value database with simple SET commands.

This idea is the conceptual foundation for almost every replicated database and coordination system built since. My toy implementation performs only one round, intentionally so, as a way to keep the core of Paxos visible without bringing in slot numbers, log persistence, leader election, or multi-instance maintenance. In a follow-up post I'll extend it with these pieces and show how logs emerge from repeated consensus.

Code

To illustrate the algorithm, I wrote a minimal Paxos implementation using Python and Flask. At just over 200 LOC, it's a compact way to explore and play around with the distributed algorithm. Each node runs all three roles: proposer, acceptor, and learner, and the nodes communicate using simple HTTP calls. The endpoints correspond directly to the phases of Paxos:

  • /start kicks off the round of Paxos on the proposer, routed to the proposer. The proposer then calls the below 3 endpoints in the three phases of Paxos.
  • /prepare for Phase 1, this endpoint is routed to the acceptor
  • /propose for Phase 2, also routed to the acceptor
  • /learn for disseminating the chosen value, routed to the learner
  • /status to see the internal state of nodes

The three roles are implemented in a separate Python classes: PaxosProposer, PaxosAcceptor and PaxosLearner. The Paxos algorithm's core logic is contained in the first two. First, let's see PaxosAcceptor:

class PaxosAcceptor:
    def __init__(self):
        ...
        self.state.promised_n = None
        self.state.accepted_n = None
        self.state.accepted_value = None

    def on_prepare(self, proposal_id):
        with self._lock:
            if self.state.promised_n is None or proposal_id > self.state.promised_n:
                self.state.promised_n = proposal_id
                success = True
            else:
                success = False
            return success, self.state

    def on_propose(self, proposal_id, value):
        with self._lock:
            if self.state.promised_n is None or proposal_id >= self.state.promised_n:
                self.state.promised_n = proposal_id
                self.state.accepted_n = proposal_id
                self.state.accepted_value = value
                success = True
            else:
                success = False
            return success, self.state

The acceptor class stores the three essential pieces of state defined by Paxos:

  • promised_n, the highest proposal number for which it has promised not to accept lower numbers;
  • accepted_n, the proposal number it has most recently accepted; and
  • accepted_value, the value corresponding to that accepted proposal.

The logic is simple: the acceptor honors previous promises and proposals. If a previous value was accepted, it returns it to the proposer.

Next, let's see the three phases of the algorithm in PaxosProposer:

class PaxosProposer:
    ...

    def paxos_round(self, initial_value):
        self.increment_proposal_id()
        # phase 1: prepare
        prepare_responses = self._send_prepare(self.state.proposal_id)
        promises = [r for r in prepare_responses if r.get("success")]
        if len(promises) < n_majority:
            return {
                "status": "failed_prepare",
                "reason": f"Only got {len(promises)} promises, need {n_majority}",
                "proposal_id": self.state.proposal_id,
                "prepare_responses": prepare_responses,
            }
        # if any acceptor already accepted a value, choose the one with the highest accepted_n
        chosen_value = initial_value
        highest_accepted_n = -1
        for r in promises:
            accepted_n = r["acceptor_state"].get("accepted_n")
            accepted_value = r["acceptor_state"].get("accepted_value")
            if accepted_n is not None and accepted_value is not None and accepted_n > highest_accepted_n:
                highest_accepted_n = accepted_n
                chosen_value = accepted_value
        # phase 2: propose
        propose_responses = self._send_propose(self.state.proposal_id, chosen_value)
        accepts = [r for r in propose_responses if r.get("success")]
        if len(accepts) < n_majority:
            return {
                "status": "failed_propose",
                ...
            }
        # phase 3: learn
        self._broadcast_learn(chosen_value)
        return {
            "status": "success",
            ...
        }

The proposer generates proposal IDs, performs Phase 1 by collecting promises from a majority, decides which value to propose in Phase 2 (falling back to previously accepted values if necessary), and broadcasts the learned value once a majority accepts it. The learner records the chosen value and asserts that no conflicting value can appear.

Finally, the PaxosLearner class is trivial:

class PaxosLearner:
    def __init__(self):
        ...
        self.state.chosen_value = None

    def learn(self, value):
        with self._lock:
            # in Paxos, the learner should never get two different chosen values
            if self.state.chosen_value is not None:
                assert(self.state.chosen_value == value)
            self.state.chosen_value = value
            return True, self.state

The learner just stores the chosen value. In a production implementation of Paxos, this value would be passed on to the local database for the database command contained in chosen_value to be executed on it.

The implementation accepts a value from a client via /start and carries it through a single execution of the Paxos algorithm.

@app.route("/start", methods=["POST"])
def start():
    data = request.get_json(force=True, silent=True) or {}
    if "value" not in data:
        return jsonify({"error": "Missing 'value' in JSON body"}), 400
    result = proposer.paxos_round(data["value"])
    return jsonify(result)

If a majority of nodes are up, they will all converge on the same chosen value; if not, the protocol halts safely. Because this is single-decree Paxos, it resolves only one value, but this suffices to demonstrate the mechanics of the algorithm.

Executing the program

Running the code is straightforward. We specify the id of the node and the total number n of nodes on the command line. The nodes will be listening on localhost, port 5000+id:

# start nodes:
python3 node.py 0 3
python3 node.py 1 3
python3 node.py 2 3

Then we can kick off a round of Paxos with some value:

# kick off a round of paxos:
# curl -X POST http://localhost:5000/start -H "Content-Type: application/json" -d '{"value": "foo"}'

A Simple Trace: Three Nodes, One Proposal

Assume we have three nodes: node 0, 1 and 2. Each node runs a proposer, acceptor, and learner. A majority is any 2 out of 3. All acceptors start in the same empty state:

promised_n = None
accepted_n = None
accepted_value = None

A client sends a request to node 0:

curl -X POST http://localhost:5000/start -H "Content-Type: application/json" -d '{"value": "foo"}'

Node 0 acts as the proposer. In the toy implementation, each proposer maintains a proposal_id counter and bumps it by 256 for each round. Let's say node 0 chooses proposal number 256.

Phase 1: Prepare

Node 0 sends /prepare with proposal_id = 256 to all three nodes.

  • Node 0's acceptor sees promised_n = None, so it updates promised_n to 256 and replies success along with its state (accepted_n = None, accepted_value = None).
  • Node 1's acceptor does the same: sets promised_n = 256, returns success, no previously accepted value.
  • Node 2's acceptor likewise sets promised_n = 256, returns success, no accepted value.

Node 0 now has three promises, which is more than the required majority of 2. It inspects the replies: none of the acceptors have an accepted_n or accepted_value yet, so there is no prior value to preserve. The proposer keeps its original client value foo as the value to propose.

Phase 2: Propose

Node 0 now sends /propose with (proposal_id = 256, value = "foo") to all three nodes.

  • Each acceptor verifies that the incoming proposal number is at least as large as the promised_n it recorded in Phase 1. Since promised_n = 256 everywhere, and the proposal number is 256, the check passes.
  • Each acceptor updates accepted_n = 256 and accepted_value = "foo", and returns success.

Node 0 collects the responses. It received three success responses, which constitutes a majority. At that point, according to Paxos, value foo is chosen.

Phase 3: Learn

The proposer calls /learn with foo to all three nodes. Each learner updates its local chosen_value to foo. From now on, even if some messages were lost or delayed, the invariant is that no other value can ever be chosen in this instance of Paxos. The system has safely decided on foo.

A More Interesting Trace: Node Crash and Majority Intersection

Now consider a more realistic scenario. Again we have 3 nodes, but initially only node 0 and 1 are up; node 2 is down (crashed or partitioned away). The majority is still 2, so node 0 and 1 can make progress without node 2.

Part 1: Choosing foo with Only Two Nodes Up

A client sends a request to node 0 to propose foo:

curl -X POST http://localhost:5000/start -H "Content-Type: application/json" -d '{"value": "foo"}'

Node 0, as proposer, picks proposal number 256 again.

In Phase 1, node 0 sends /prepare(256) to all three nodes. Node 2 is down, so only 0 and 1 reply. Both acceptors have promised_n = None initially, so both set promised_n = 256 and return success. Node 0 sees two promises, exactly the majority needed; Phase 1 succeeds. In phase 2, node 0 sends /propose(256, "foo") to all three nodes. Again only node 0 and 1 respond. Both acceptors see a proposal number equal to their promised_n (256), so they set accepted_n = 256, accepted_value = "foo", and return success. Node 0 receives two success responses and concludes that foo is chosen. Learners on node 0 and 1 record foo as the chosen value. Node 2 is still down and knows nothing about this round.

At this point the system has safely chosen foo using only two nodes. Paxos guarantees that no other value will be chosen.

Part 2: Node 0 Goes Down, Node 2 Comes Back and Tries to Propose bar

Later, node 0 goes down but node 2 comes up for the first time or recovers from a crash. Its acceptor state is fresh: promised_n = None, accepted_n = None, accepted_value = None. Its learner has no chosen value yet.

A client now sends a request to node 2 to propose bar:

curl -X POST http://localhost:5002/start -H "Content-Type: application/json" -d '{"value": "bar"}'

Node 2 now acts as proposer. In the toy implementation, node 2's initial proposal_id is its node ID (2), and the proposer increments it by 256 before each round, so it chooses proposal number 258.

Phase 1: Prepare(258)

Node 2 sends /prepare(258) to all three nodes (but node 0 is down).

  • Node 1's acceptor currently has promised_n = 256, accepted_n = 256, accepted_value = "foo". It compares 258 with its current promised_n and sees that 258 is higher, so it updates promised_n = 258 and replies success, including its accepted state: (accepted_n = 256, accepted_value = "foo").
  • Node 2's own acceptor had never seen any proposal before, so it sets promised_n = 258 and returns success with no previously accepted value.

Node 2, as proposer, collects two successful responses. Among those responses, some acceptors report an accepted_n and accepted_value. In particular, node 1 reports accepted_n = 256, accepted_value = "foo". Node 2 now applies the Paxos rule: if any acceptor has already accepted a value, the proposer must adopt the value associated with the highest-numbered accepted proposal. Here the highest accepted_n is 256, with value foo. So although node 2's client asked it to propose bar, the proposer is now obligated to propose foo instead.

Phase 2: Propose(258, "foo")

Node 2 sends /propose(258, "foo") to all three nodes (node 0 is still down).

  • Node 1 has promised_n = 258 and previously accepted proposal 256 with foo. They see that 258 is at least as large as their current promise, so they accept: accepted_n = 258, accepted_value = "foo" (their accepted_value does not change).
  • Node 2's own acceptor has promised_n = 258 and no accepted value yet. It also accepts (258, "foo").

Node 2 receives success responses from a majority and concludes that foo is chosen (again).

Phase 3: Learn("foo")

It then sends /learn("foo") to all nodes. Each learner sets its chosen_value to foo. Node 0 and 1 already had foo; node 2 now learns it for the first time.

The important observation is that node 2's client asked it to propose bar, but after Phase 1, node 2 realized that there was a “potentially chosen” value in the system, namely foo with proposal number 256. Paxos requires node 2 to respect that history and carry foo forward. As a result, the system never diverges: foo is the only value that is ever chosen for this instance, even though later proposers tried to introduce a different candidate. This second trace is where the two-phase structure and the majority intersection property really show their value. The prepare phase allowed node 2 to discover and preserve the earlier decision, and the propose phase committed that decision under a newer, higher proposal number.

How would the second client's bar value ever be chosen? In Paxos, if a previous value foo was chosen, bar will never be chosen. In practice, once a value is chosen, the nodes move on to a new, fresh, empty round of Paxos, where a different value can be chosen.

Conclusion

Paxos reconciles partial progress, failures, and restarts into a coherent, safe protocol that guarantees that once a value has any real chance of being chosen, it cannot be lost or contradicted. The toy implementation here exposes the mechanics in a clean and accessible way, but the same logic scales to real systems once persistence, repeated instances, leader optimization, and log replication are added. In follow-up posts, we will build on this foundation and extend the implementation into a true replicated log with durable storage and multiple slots. The underlying idea, however, is already visible in the toy implementation: with nothing more than majority quorums and Paxos' two-phase structure, the distributed system can achieve consistent agreement in the presence of faults.