Wordcount III: Beating the system `wc`

Marton Trencseni - Fri 29 September 2023 - Programming

Introduction

In the previous two articles I re-implemented wc is modern C++ as a toy exercise:

I had three reasons to write a third version.

First, my core architectural takeaways at the end of the second article was since wc is a fully specified tool that is not going to see new features introduced, a lot of the usual C++ abstractions are not neccesary. For example, the code doesn't need to follow the Open–closed principle and be open for extension, but closed for modification. So all the complexity introduced in the second version isn't good complexity.

Second, I also found that the second (and also first) implementation loses time compared to my system wc in the following for_each loop (same for different ways of iterating a C++ container):

unsigned process_block(
    const char* cp, unsigned n, stream_counter_vector_t& counters, bool eof) {
    long remaining = n;
    std::mbstate_t ps{};
    while (remaining > 0) {
        // ... code omitted
        std::for_each(counters.begin(), counters.end(),                    <-----
            [&](auto& c) { c->process_wchar(wc, num_bytes, error); });     <-----
        // ... code omitted
    }
    return remaining;
}

Third, I found that the system wc has a trick up its sleeve when doing line counting. Instead of processing each character and checking whether it's a \n, it uses the memchr() family of functions. memchr() takes a buffer of character and finds the first occurence of a character, in this case \n, and is much faster than doing the same in a C loop.

Simples classes and code

To keep this implementation simple I used a very simple base counter class, and derived all other implementation classes from this:

class counter_t {
    public:
    void reset() { count = 0; active = true; }
    bool active = false;
    long count = 0;
};

For example, the class counting bytes is now as simple as:

lass byte_counter_t : public counter_t {
    public:
    void process_file(const std::string fname) {
        struct stat st;
        stat(fname.c_str(), &st);
        count = st.st_size;
    }
    void process_wchar(const wchar_t, unsigned num_bytes, bool) {
        count += num_bytes;
    }
};

Since I realized extensibility is not required here, the class itself stores whether the count should happen or not. The active flag is flipped based on the command-line parameters, and checked in the internal loops. This means that the dynamic for loop is now replaced with compiled ifs:

void process_wchar(const wchar_t wc, unsigned num_bytes, bool error) {
    if (byte_counter.active)
        byte_counter.process_wchar(wc, num_bytes, error);
    if (char_counter.active)
        char_counter.process_wchar(wc, num_bytes, error);
    if (line_counter.active)
        line_counter.process_wchar(wc, num_bytes, error);
    if (longest_line_counter.active)
        longest_line_counter.process_wchar(wc, num_bytes, error);
    if (word_counter.active)
        word_counter.process_wchar(wc, num_bytes, error);
}

As I will show shortly, this yield much faste code. Obviously this is repeated code that goes against the DRY principle, but here it's permissible.

Lastly, the use of memchr() for in linecounter:

unsigned process_block_lines_only(
    const char* cp, unsigned n, bool eof) {
    void* cur = static_cast<void*>(const_cast<char*>(cp));
    while (cur != nullptr) {
        auto remaining = n - ((char*)cur - cp);
        cur = memchr(static_cast<void*>(cur), '\n', remaining);
        if (cur != nullptr) {
            line_counter.count++;
            cur++;
        }
    }
    return 0;
}

Headers

To further clean up the code, I factored the command-line parsing and tabulation code into a separate headers clargs.hpp and tabular.hpp. The complete code is up on Github.

Performance

With the above changes and optimizations, the code now runs about 10% faster than my system wc when running all 5 counters in file mode:

$ time wc -c -m -l -w -L playground/MNIST_data/t10k-images-idx3-ubyte.gz
   6121   34669  874589 1648877     402 playground/MNIST_data/t10k-images-idx3-ubyte.gz

real    0m0.101s
user    0m0.100s
sys     0m0.000s

$ time ./wcpp_v3 -c -m -l -w -L playground/MNIST_data/t10k-images-idx3-ubyte.gz
    6121   34669  874589 1648877     402 playground/MNIST_data/t10k-images-idx3-ubyte.gz

real    0m0.089s
user    0m0.088s
sys     0m0.000s

Results are similar when reading from stdin:

$ time cat playground/MNIST_data/t10k-images-idx3-ubyte.gz | wc -c -m -l -w -L
   6121   34669  874589 1648877     402

real    0m0.101s
user    0m0.104s
sys     0m0.000s

$ time cat playground/MNIST_data/t10k-images-idx3-ubyte.gz | ./wcpp_v3 -c -m -l -w -L
    6121   34669  874589 1648877     402

real    0m0.091s
user    0m0.080s
sys     0m0.008s

Conclusion

This was a fun toy exercise so far. One final optimization possibility would be to multi-threading when parsing multiple files.