Exploring Lamport's Bakery algorithm in C++20

Marton Trencseni - Sun 06 July 2025 - Programming

Introduction

The Bakery algorithm was discovered by Leslie Lamport in 1974 and described in his seminal paper A New Solution of Dijkstra's Concurrent Programming Problem. The algorithm solves the mutual exclusion problem: given n processes with shared memory, how can they cooperate so each one can execute a crticial section of code without any other process executing their critical section at the same time. Effectively, the algorithm implements a locking mechanism, where each process, before entering their critical section, acquires the lock, and then afterwards releases it. On his publications page, Lamport adds the following comments:

This paper describes the bakery algorithm for implementing mutual exclusion. I have invented many concurrent algorithms. I feel that I did not invent the bakery algorithm, I discovered it. Like all shared-memory synchronization algorithms, the bakery algorithm requires that one process be able to read a word of memory while another process is writing it. (Each memory location is written by only one process, so concurrent writing never occurs.) Unlike any previous algorithm, and almost all subsequent algorithms, the bakery algorithm works regardless of what value is obtained by a read that overlaps a write. If the write changes the value from 0 to 1, a concurrent read could obtain the value 7456 (assuming that 7456 is a value that could be in the memory location). The algorithm still works. I didn't try to devise an algorithm with this property. I discovered that the bakery algorithm had this property after writing a proof of its correctness and noticing that the proof did not depend on what value is returned by a read that overlaps a write. I don't know how many people realize how remarkable this algorithm is. Perhaps the person who realized it better than anyone is Anatol Holt, a former colleague at Massachusetts Computer Associates. When I showed him the algorithm and its proof and pointed out its amazing property, he was shocked. He refused to believe it could be true. He could find nothing wrong with my proof, but he was certain there must be a flaw. He left that night determined to find it. I don't know when he finally reconciled himself to the algorithm's correctness.

The code for the article is on Github.

Pseudocode

The pseudocode implementation of the Bakery algorithm is as follows:

# Lamport’s Bakery algorithm for n concurrent processes
Choosing[1 .. n] : boolean    # true  ⇢  process is choosing its number
Numbers [1 .. n] : integer    # 0     ⇢  process is not competing

procedure lock(i) # called by process i (1-based index)
    Choosing[i] := true
    Numbers[i] := 1 + max(Numbers[1 .. n]) # take the next “ticket”
    Choosing[i] := false

    for j := 1 to n do
        # wait while j is still choosing
        while Choosing[j] do
            yield
        # wait while (ticket[j], j) < (ticket[i], i)
        while Numbers[j] != 0
              and (Numbers[j] < Numbers[i]                  # lower number
                   or (Numbers[j] = Numbers[i] and j < i )) # same number, lower ID                                 
        do
            yield
    end for
end procedure


procedure unlock(i) # called by process i when leaving critical section
    Numbers[i] := 0
end procedure

When a process calls lock(i) it first raises Choosing[i] to announce, “I’m about to take a ticket.” While that flag is true every other process is obliged to spin, so nobody will look at Numbers[i] until it is final. Process i then writes its ticket as 1 + max(Numbers[1..n]), drops Choosing[i] to false, and enters the scan over all peers. For each peer j it does two checks: (1) it waits until Choosing[j] is false, guaranteeing that any ticket in Numbers[j] is fully written, and (2) it waits while j’s lexicographic pair (Numbers[j], j) is strictly smaller than its own. Because these pairs form a total order, at most one process can find itself “smallest,” so only that process can pass the entire loop and enter the critical section.

The two phase approach is what makes the algorithm safe on hardware where a read concurrent with a write can return an arbitrary value. Choosing[] acts as a one-bit publish/subscribe system: a process must set the bit before writing its ticket and clear it after, and other processes must not trust the ticket until they observe the bit cleared. This guarantees everyone compares only stable tickets, prevents two processes from thinking they both hold the smallest pair, and thus enforces mutual exclusion without needing any atomic read-modify-writes.

Naive C++ implementation

Implementing the Bakery algorithm in C++20 is not as straighforward as it seems. Below is an incorrect implementation:

class bakery_mutex_naive
{
public:
    explicit bakery_mutex_naive(std::size_t n)
        : choosing(n, 0), number(n, 0)
    {        
    }

    std::size_t lock(std::size_t id)
    {
        std::size_t my_ticket = announce_intent(id);
        wait_acquire(id);
        return my_ticket;
        // can enter critical section
    }

    void unlock(std::size_t id)
    {
        number[id] = 0;
    }

private:
    std::size_t get_max_ticket() const
    {
        std::size_t max_ticket = 0;
        for (std::size_t t : number)
            max_ticket = std::max(max_ticket, t);
        return max_ticket;
    }

    std::size_t announce_intent(std::size_t id)
    {
        choosing[id] = 1;
        // pick ticket = 1 + max(number)
        std::size_t max_ticket = get_max_ticket() + 1;
        number[id] = max_ticket;
        choosing[id] = 0; // done choosing
        return max_ticket;
    }

    void wait_acquire(std::size_t id)
    {
        const std::size_t n = choosing.size();
        for (std::size_t j = 0; j < n; ++j) {
            if (j == id) continue;
            // wait while j is still choosing
            while (choosing[j])
                std::this_thread::yield();
            // wait while (ticket[j], j) < (ticket[i], i)
            while (true) {
                std::size_t tj = number[j];
                if (tj == 0) break;
                std::size_t ti = number[id];
                if (tj >  ti) break;
                if (tj == ti && j > id) break;
                std::this_thread::yield();
            }
        }
    }

    std::vector<std::uint64_t>  choosing;   // 0 = false, 1 = true
    std::vector<std::uint64_t>  number;     // ticket values
};

We can test our implementation by creating a number of threads, and creating a critical section which will yield the correct result if mutual exclusion works, but will likely yield incorrect results otherwise. For this sample, we will simply increment a counter. If mutual exclusion does not work, eventually there will be a case when to threads try to increment the memory location at the same time (read-increment-write) and read the same value, yielding an overall increment of just 1 instead of 2:

int main(int argc, char* argv[])
{
    constexpr std::size_t DEFAULT_THREADS = 16;
    constexpr std::size_t DEFAULT_LOOPS   = 1000*1000;

    std::size_t num_threads = DEFAULT_THREADS;
    std::size_t num_loops   = DEFAULT_LOOPS;

    if (argc >= 2 && !parse_positive(argv[1], num_threads)) {
        std::cerr << "Usage: " << argv[0] << " [n_threads] [iterations]\n";
        return 1;
    }
    if (argc >= 3 && !parse_positive(argv[2], num_loops)) {
        std::cerr << "Usage: " << argv[0] << " [n_threads] [iterations]\n";
        return 1;
    }

    bakery_mutex_naive mtx{num_threads};
    //bakery_mutex_atomic mtx{num_threads};
    //bakery_mutex_bounded mtx{num_threads, 1u << 16};

    std::uint64_t shared_counter = 0;
    std::size_t   max_ticket     = 0;

    auto worker = [&](std::size_t id)
    {
        for (std::size_t k = 0; k < num_loops; ++k) {
            std::size_t t = mtx.lock(id);
            ++shared_counter;                   // critical section
            max_ticket = std::max(max_ticket, t);
            mtx.unlock(id);
        }
    };

    std::vector<std::thread> threads;
    threads.reserve(num_threads);
    for (std::size_t i = 0; i < num_threads; ++i)
        threads.emplace_back(worker, i);

    for (auto& t : threads) t.join();

    auto expected = num_threads * num_loops;
    auto good = expected == shared_counter;
    std::cout << "Threads:    " << num_threads         << '\n'
              << "Iterations: " << num_loops           << '\n'
              << "Expected:   " << expected            << '\n'
              << "Observed:   " << shared_counter      << '\n'
              << "Max ticket: " << max_ticket          << '\n'
              << (good ? "Passed!" : "FAILED!")        << '\n'
              ;
}

After compiling with g++ -std=c++20 -O3 -flto -march=native -DNDEBUG -pthread bakery.cpp -o bakery we can run the code with timing time ./bakery <num_threads> <num_loops>.

On x86_64 it mostly works:

Threads:    8
Iterations: 10000
Expected:   80000
Observed:   80000
Max ticket: 77888
Passed!

real    0m0.076s
user    0m0.276s
sys     0m0.218s

.. but it's not too hard to find runs where it breaks:

Threads:    8
Iterations: 100000
Expected:   800000
Observed:   799997
Max ticket: 793381
FAILED!

real    0m0.475s
user    0m1.706s
sys     0m2.039s

Memory access and correctness

The C++ code shown is a straightforward implementation of Lamport's Bakery algorithm, yet it doesn't work. Why not? The naive version relies on plain reads and writes to the shared choosing and number arrays, so the C++ memory model treats each access as an independent, data-racing operation with no ordering guarantees. On a weakly ordered CPU such as ARM or RISC-V the two stores inside announce_intent() can be observed out of order by other cores: a competing thread may see choosing[j] == 0 (meaning “I’m done choosing”) while it still sees the old ticket value in number[j]. That violates the algorithm’s fundamental invariant that every process publishes its ticket before dropping the flag, allowing two threads to believe they both hold the smallest (ticket, id) pair and to enter the critical section simultaneously. In C++, we must use std::atomic with release/acquire semantics (or an explicit memory fence) to enforce the required ticket-then-flag order on every architecture.

Another interesting detail related to memory access: why are we not using std::vector<bool> for the choosing array? This is easy to see by making the change to the code and running the program — on my computer, it results in a deadlock. The explanation is the following: std::vector<bool> isn’t an array of independent Booleans—it’s an STL optimization that packs eight flags into one byte and rewrites every access as a read-modify-write on that shared byte. In Lamport’s Bakery algorithm each thread must own its own choosing[i] slot, yet with bit-packing two different threads end up flipping bits in the same underlying byte, creating silent write–write races. The result is that one thread can lose another’s update and spin forever, giving the appearance of a dead-lock even though the algorithm is correct. Using a plain uint64_t per slot guarantees that every writer has a private word, eliminating those hidden data hazards and keeping the code’s behaviour true to the original proof.

ISO C++20 implementation

The naive solution can be improved by the use std::atomic type and memory fencing. By promoting the shared arrays to std::atomic and using memory_order_seq_cst, this version upgrades the code to fully standards-compliant and portable. Every write of a ticket is now a release-style event that, by the subsequent fence, is guaranteed to become visible on all cores before the corresponding choosing = false store; every thread that spins on choosing[j] performs an acquire-style load and therefore sees that ticket in the required order. Because the C++ memory model now formally defines these inter-thread happens-before edges, neither the compiler nor weakly ordered CPUs such as ARM or RISC-V can reorder the operations and violate the algorithm’s invariant. Note that with std::atomic<>, it is safe to use a vector<bool> for the choosing array:

class bakery_mutex_atomic
{
public:
    explicit bakery_mutex_atomic(std::size_t n)
        : choosing(n), number(n)
    {
        for (auto& c : choosing)
            c.store(false, std::memory_order_relaxed);
        for (auto& t : number)
            t.store(0,     std::memory_order_relaxed);
    }

    std::size_t lock(std::size_t id)
    {
        auto my_ticket = announce_intent(id); // choose ticket
        wait_acquire(id);                     // wait for turn
        return my_ticket;                     // critical section may begin
    }

    void unlock(std::size_t id)
    {
        number[id].store(0, std::memory_order_seq_cst); // release ticket
    }

private:
    static inline void full_barrier()
    {
        std::atomic_thread_fence(std::memory_order_seq_cst);
    }

    std::size_t get_max_ticket() const
    {
        std::size_t max_ticket = 0;
        for (const auto& t : number)
            max_ticket = std::max(max_ticket, t.load(std::memory_order_seq_cst));
        return max_ticket;
    }

    std::size_t announce_intent(std::size_t id)
    {
        choosing[id].store(true, std::memory_order_seq_cst);
        // pick ticket = 1 + max(number)
        std::size_t my_ticket = get_max_ticket() + 1;
        number[id].store(my_ticket, std::memory_order_seq_cst);
        choosing[id].store(false, std::memory_order_seq_cst);  // done choosing
        full_barrier(); // make ordering global
        return my_ticket;
    }

    void wait_acquire(std::size_t id)
    {
        const std::size_t n = choosing.size();
        for (std::size_t j = 0; j < n; ++j) {
            if (j == id) continue;
            // wait while j is still choosing
            while (choosing[j].load(std::memory_order_seq_cst))
                std::this_thread::yield();
            // wait while (ticket[j], j) < (ticket[i], i)
            while (true) {
                std::size_t tj = number[j].load(std::memory_order_seq_cst);
                if (tj == 0) break;
                std::size_t ti = number[id].load(std::memory_order_seq_cst);
                if (tj >  ti) break;
                if (tj == ti && j > id) break;
                std::this_thread::yield();
            }
        }
    }

    std::vector<std::atomic<bool>>          choosing;  // true = thread is in doorway
    std::vector<std::atomic<std::size_t>>   number;    // ticket values
};

Bounding ticket sizes

Throughout the article, the program output has been showing the maximum ticket seen, which is usually a number close to num_threads * num_loops. In a real-world production setting, this number could overflow and cause the algorithm to break. This is relatively easy to handle if we can assume that there are less threads than the maximum ticket size that we can store. In this case, as we approach that number, we can simply create a synchronization point, where all threads wait until others exit their critical sections and reset their tickets to 0:

    void synchronized_wait_outside()
    {
        auto max_ticket = get_max_ticket();
        if (max_ticket > max_ticket_allowed - choosing.size()) {
            while (get_max_ticket() > 0) {
                // wait until everybody unlocks and all tickets reset to 0
                std::this_thread::yield();
            }
        }
    }

and add one line to lock():

    std::size_t lock(std::size_t id)
    {
        synchronized_wait_outside();          // assure ticket numbers bounded
        auto my_ticket = announce_intent(id); // choose ticket
        wait_acquire(id);                     // wait for turn
        return my_ticket;                     // critical section may begin
    }

The actual max_ticket_allowed is passed in the constructor and is set to 2^16 here to simulate the limitations of a 16 bit register. Running the program then prints:

Threads:    16
Iterations: 1000000
Expected:   16000000
Observed:   16000000
Max ticket: 65521
Passed!

Conclusion

Lamport’s Bakery shows how a neat idea on paper can fail if we ignore how real CPUs and compilers reorder memory. A plain “read-write” version may pass every test on x86, yet slip on ARM because another core can see the flag drop before the ticket is written. Turning the shared arrays into std::atomic and using the right memory orders keeps the ticket-then-flag sequence correct on every machine, without changing the core logic. Add a limit for ticket growth and keep each flag in its own word, and the lock stays safe, fast, and portable. The exercise is a good reminder that undefined behaviour and weak memory models are not corner cases—they are bugs waiting to happen.