Testing randomness extractors

Marton Trencseni - Sun 02 June 2024 - Data


In a previous post titled Randomness extractors: making fair coins out of biased coins, I wrote various implementations of randomness extractors:

  • von Neumann extractor: works when the input bits are independent, but have a constant bias
  • Parity extractor: works when the inputs bit are independent, but the bias varies from bit to bit
  • Blum extractor: works when the input can be modeled as a Markov chain

In this follow-up post, I will apply a suite of statistical random tests to test these randomness extractor implementations.

The code is up on Github.

Testing for randomness

Given a $N$ long sequence of bits, how can we test whether it was produced by a series of independent, fair coin tosses? In principle, any given bit sequence of length $N$ is equally likely under the null hypothesis of being produced by a series of independent, fair coin tosses; for example, for $N=8$, 0000 0000 is just as likely as 1111 1111, which is just as likely as 0100 1011 — all of them have probability $2^{-N}$ occuring. So, what can we do?

We can compute certain statistics about the bit sequence, where most bit sequences will yield certain typical values, and only a few special ones will yield unlikely values. The simplest example is to calculate the ratio of 0s (or 1s, same thing) in the bit string. As $N$ becomes very large, this ratio will be $0.5$ for most bit strings, and only a few will be far away from this average. Of course, it's possible even at $N=1024$ to get all 0s or all 1s, but the vast majority of bit sequences will have close to equal number of 0s and 1s. So, this is just like in A/B testing: we don't really know if the input sequence was truly random or not, but we can compute a statistic, which will tell us how unlikely the computed statistic is among all $2^N$ possible sequences. For example, in the example above, for $N=1024$, the probability of the average to be less than $0.4$ or greater than $0.6$ is vanishingly small. Note here that probability in the previous sentence means "the ratio of bit strings for which the average is less than $0.4$ or greater than $0.6$ among all bit strings assuming the bit strings were produced by independent fair coin tosses", which can be computed explicitly using binomial formulas, or by making a normality assumption under the Central Limit Theorem.

What I have described above is the same concept I explain in more detail, with Monte Carlo simulations, in the post Building intuition for p-values and statistical significance. If the above wasn't clear, I recommend to read this longer explanation.

It would be fairly easy to come up with a few such random tests. Computing the statistics could be done by evaluating formulas, or, by the Monte Carlo method: creating a large number of bit sequences of length N under the null hypothesis of independent fair coin tosses, and measuring the statistic distribution. Fortunately, we don't have to do either. NIST, the US National Insitute of Standards and Technology has created a suite of such tests, available online, with an accompanying paper titled A Statistical Test Suite for Random and Pseudorandom Number Generators for Cryptographic Applications describing all 15 tests. There are various implementations of thes tests, I used the ones from github.com/stevenang/randomness_testsuite.

The 15 tests in the NIST suite are:

  • Frequency Test: Monobit
  • Frequency Test: Block
  • Runs Test
  • Test for the Longest Runs of Ones in a Block
  • Binary Matrix Rank Test
  • Discrete Fourier Transform (Spectral Test)
  • Non-Overlapping Template Matching Test
  • Overlapping Template Matching Test
  • Maurer's Universal Statistical Test
  • Linear Complexity Test
  • Serial Test
  • Approximate Entropy Test
  • Cumulative Sums Test
  • Random Excursions Test
  • Random Excursions Variant Test

The Monobit test is the simplest possible ratio test I described above. The Frequency test is the same idea, applied to a block of $N$ bits: are each $2^N$ possible outcomes of the block equally likely, to within a reasonable degree, in the bit string? The Runs test looks at the statistics of runs of 0s and 1s, and so on.


Python's built-in random.random() is not a true random number generator: Python uses the Mersenne Twister as the core generator. It produces 53-bit precision floats and has a period of $2^{19937}-1$. The underlying implementation in C is both fast and threadsafe. The Mersenne Twister is one of the most extensively tested random number generators in existence. However, being completely deterministic, it is not suitable for all purposes, and is completely unsuitable for cryptographic purposes.

First, in order to better random numbers, I implemented a class that uses os.urandom(). The class uses an internal cache to avoid making slow system calls repeatedly:

class RandomGenerator:
    def __init__(self):
        self.bit_cache = []
        self.byte_cache = []
        self.cache_size = 128  # Number of bytes to retrieve at a time

    def _fill_bit_cache(self):
        random_bytes = urandom(self.cache_size)
        self.bit_cache = [
            (byte >> i) & 1
            for byte in random_bytes
            for i in range(8)

    def _fill_byte_cache(self):
        self.byte_cache = list(urandom(self.cache_size))

    def random_bit(self) -> int:
        if not self.bit_cache:
        return self.bit_cache.pop(0)

    def random_float(self) -> float:
        if not self.byte_cache:
        random_bytes = bytes(self.byte_cache[:8])
        self.byte_cache = self.byte_cache[8:]
        random_float = unpack('>Q', random_bytes)[0] / (1 << 64)
        return random_float

rg = RandomGenerator()

def random_bit() -> int:
    return rg.random_bit()

def random_float() -> float:
    return rg.random_float()

I then replaced calls to rabdin.random() to my random_float() in my bit sources from the previous post. I wrote a simple wrapper for the test suite functions used:

def run_tests(r):
    funcs = [
        lambda r: Serial.serial_test(r)[0] and Serial.serial_test(r)[1],
        lambda r: CumulativeSums.cumulative_sums_test(r, 0),
        lambda r: CumulativeSums.cumulative_sums_test(r, 1),
    num_passed = 0
    for f in funcs:
        result = f(r)
        num_passed += int(result[1])
    passed = (num_passed >= 10)
    if passed:
        print(f'PASSED: The bit sequence appears to be a sequence of independent fair coin tosses ({num_passed}/{len(funcs)} passed) ✅')
        print(f'FAILED: The bit sequence does not appear to be a sequence of independent fair coin tosses ({num_passed}/{len(funcs)} passed) ❌')

Throughout the test, I generate bit sequences of N = 8*1024 bits long.

First, a trivial test, is the all 0 random?

r = ''.join(['0' for _ in range(N)])

As expected, it yields:

FAILED: The bit sequence does not appear to be a sequence of independent fair coin tosses 

Next, let's see if the built-in random.random() passes the tests:

r = ''.join([str(round(random())) for _ in range(N)])

The result here changes, sometimes it passes, but usually it passes 12-13 tests out of 14:

PASSED: The bit sequence appears to be a sequence of independent fair coin tosses (12/14 passed) 

To my surprise, the stronger random implementation does not do much well, and passes around the same number of tests:

r = ''.join([str(random_bit()) for _ in range(N)])

Prints something like:

PASSED: The bit sequence appears to be a sequence of independent fair coin tosses (13/14 passed) 

With these, we can create all sorts of interesting tests. For example, let's verify that our von Neumann extractor works as expected:

bit_source = BiasedBitSource(0.9)
extractor = VonNeumannExtractor(bit_source)

r = ''.join([str(extractor.random()) for _ in range(N)])


PASSED: The bit sequence appears to be a sequence of independent fair coin tosses (13/14 passed) 

Another example, let's create a Markov chain which is very likely to produce long runs of identical bits:

class ExampleMarkovChainBitSource(MarkovChainBitSource):
    def __init__(self):
        self.transitions = {
            'A': {0: ('A', 0.99), 1: ('B', 0.01)},
            'B': {0: ('A', 0.01), 1: ('B', 0.99)},

    def get_transitions(self, state: str) -> Dict[int, Tuple[str, float]]:
        return self.transitions[state]

Let's try various tests:

bit_source = ExampleMarkovChainBitSource()

r = ''.join([str(bit_source.next()) for _ in range(N)])

extractor = VonNeumannExtractor(bit_source)

r = ''.join([str(extractor.random()) for _ in range(N)])

extractor = ParityExtractor(bit_source=bit_source, N=32)

r = ''.join([str(extractor.random()) for _ in range(N)])

extractor = BlumExtractor(bit_source)

r = ''.join([str(extractor.random()) for _ in range(N)])

The first 3 will fail, while the last one will pass. Only the BlumExtractor, which has access to the internal state of the Markov chain, is able to extract good random bits.


In the next post I will investigate why some of the randomness test fail even though we're using system.urandom() here. An even stronger option to try is to retrieve randomness from random.org.