Wordcount II: Introducing a cleaner C++ class hierarchy to `wc`

Marton Trencseni - Sat 23 September 2023 - Programming

Introduction

In the previous article, I wrote a modern C++ version of the Unix command-line utility wc. The basic architecture of the program was a collection of state machines, one for each type of count that the utility supports:

  • bytes
  • characters (Unicode or ASCII)
  • words
  • lines
  • the longest line

Each counter was implemented as a class that derives from and implements from class counter:

class counter {
    public:
    virtual bool has_optimization() const { return false; }
    virtual void optimized_count(const std::string fname) { }
    virtual void process_wchar(const wchar_t, unsigned, bool) = 0;
    unsigned long get_count() const { return count; }
    void add(unsigned long i) { count += i; }
    void clear() { count = 0; }
    protected:
    unsigned long count = 0;
};

The core state machine logic is implemented in process_wchar(). In the case of eg. byte counting this is trivial:

void process_wchar(const wchar_t, unsigned num_bytes, bool) {
    count += num_bytes;
}

In the case of words it's a real edge-triggered state machine:

void process_wchar(const wchar_t wc, unsigned, bool error) {
    if (error)
        return;
    bool whitespace = is_word_sep(wc);
    bool printable = iswprint(wc);
    if (in_word && whitespace) {
        in_word = false;
    } else if (!in_word && printable && !whitespace) {
        in_word = true;
        count++;
    }
}

The source code for the previous version is here.

Optimized counts

However, for some counts, there are more optimized ways to get the desired result. The simplest example, and the one I implemented in the previous post, is byte counting. If the user of wc just wants to know the byte count by typing wc -c or wc --bytes and the input is a file (not stdin), then we can just use fstat() and ask the file system for the size in bytes, and return it — no need to actually open the file and count the bytes.

This is what the has_optimization() and optimized_count() functions in class counter are for. But, I was unhappy with this, since it mixes multiple things in one class:

  • querying whether the counter supports optimized counting
  • performing the file-based optimized count
  • performing the character-wise state machine-based count

Also, even classes that don't support optimized counting will contain (empty) functions for this, derived from the base class.

Abstractions

To address this, I started to create separate classes to encapsulate separate concepts. In general there are counters, which can return an unsigned long count; in our case, a counter is either a file system based one (eg. fstat() for character counting), or a stream-based one (the state machines):

class counter {
    public:
    unsigned long get_count() const { return count; }
    protected:
    unsigned long count = 0;
};

class filesystem_counter : public counter {
    public:
    virtual void process_file(const std::string fname) = 0;
};

class stream_counter : public counter {
    public:
    virtual void process_wchar(const wchar_t, const unsigned, const bool) = 0;
};

With this structure, the fstat() case is nicely separated from other concerns:

class byte_filesystem_counter : public filesystem_counter {
    public:
    void process_file(const std::string fname) {
        struct stat st;
        stat(fname.c_str(), &st);
        count = st.st_size;
    }
};

Then there is the concept of deciding which one is supported for the given count, and retrieving the supported one (with sensible default implementation):

class counter_policy {
    public:
    virtual bool supports_file_based() const {
        return false;
    }
    virtual filesystem_counter_t new_filesystem_counter() const {
        return nullptr;
    }
    virtual stream_counter_t new_stream_counter() const = 0;
};

To implement, other class byte_filesystem_counter shows above, we need two more classes:

class byte_stream_counter : public stream_counter {
    public:
    void process_wchar(const wchar_t, const unsigned num_bytes, const bool) {
        count += num_bytes;
    }
};

class byte_counter_policy : public counter_policy {
    public:
    virtual bool supports_file_based() const {
        return true;
    }
    virtual filesystem_counter_t new_filesystem_counter() const {
        return std::make_unique<byte_filesystem_counter>();
    }
    virtual stream_counter_t new_stream_counter() const {
        return std::make_unique<byte_stream_counter>();
    }
};

This is quite clean, but it has drawbacks. After parsing the command-line arguments, I construct an std::vector<counter_policy> corresponding to what needs to be counted, and eventually an std::vector of counters. This is where the problems start: an std::vector<derived_class> cannot be used as an std::vector<base_class>, because even though all instances of derived_class are base_class also, this is not true for container types such as std::vector. So, if I create a vector of counter, and I pass it to a function which runs the stream-based state machine logic, then it needs to static_cast<> to stream_counter as it iterates through the vector, which is ugly and not type safe. Also, if I wanted to, I cannot mix different types of counters in the vector, since I don't know how to cast (was it a stream-based or file-based one?). Or, I can decide up-front that, due to the command-line arguments used, I will instantiate all stream-based or all file-based counters: in this case I don't need to do ugly casts, but now I have a different problem: the functions that display the results as a table, they don't care how the counts were constructed, they just care about the numbers, so they do want to just call counter::get_count(), so they want to get passed an std::vector<counter>, irrespective of whether the count was stream-based or file-based.

In the end, I decided that instead of passing counters to the display functions it's easier to extract the counts into typedef std::vector<unsigned long>s and pass that. I still have to have two different functions to extract, but that can be handled reasonably with templates:

template <typename T>
count_vector_t to_counts(const std::vector<T>& counters) {
    count_vector_t counts;
    std::for_each(counters.begin(), counters.end(),
            [&](auto& c) { counts.push_back(c->get_count()); });
    return counts;
}

So, for example, the function which handles counting things on stdin looks like this:

table_t process_stdin(const cl_args_t& cl_args) {
    table_t table;
    auto policies = policies_from_arguments(cl_args);
    stream_counter_vector_t counters;
    stream_counters_from_policies(policies, counters);
    process_stream(&std::cin, counters);
    auto counts = to_counts<stream_counter_t>(counters);
    table.push_back(tabulate(counts));
    return table;
}

The function countings things per file has two branches, one if everything can be counted on a file-based way, or it needs to get processed stream-wise, character-by-character:

count_vector_t process_file(const std::string fname, const cl_args_t& cl_args) {
    count_vector_t counts;
    auto policies = policies_from_arguments(cl_args);
    auto all_sfb = std::all_of(
        policies.begin(),
        policies.end(),
        [](auto& policy) { return policy->supports_file_based(); }
        );
    if (all_sfb) {
        filesystem_counter_vector_t counters;
        filesystem_counters_from_policies(policies, counters);
        std::for_each(counters.begin(), counters.end(),
            [&](auto& counter) {
                counter->process_file(fname);
            });
        counts = to_counts<filesystem_counter_t>(counters);
    } else {
        stream_counter_vector_t counters;
        stream_counters_from_policies(policies, counters);
        std::ifstream f(fname);
        if (f.is_open()) {
            process_stream(&f, counters);
        } else {
            std::cerr << "cannot read from file: " << fname << std::endl;
        }
        counts = to_counts<stream_counter_t>(counters);
    }
    return counts;
}

This second version of wc, with more abtracted classes, is here.

Complexity

Unfortunately, overall there is a hefty price to pay for these additional abstractions. The original version was 469 lines of code, this one is 567, so it's an extra 100 lines of code, or roughly 20% more. Unfortunately, I don't think it's cleaner, clearer, or easier to modify (in terms of lines of code to be touched). This is a case of too much complexity. A good sign of this is the long list forward declarations and typedefs I introduced to keep things somewhat readable:

class counter;
class counter_policy;
class stream_counter;
class filesystem_counter;

typedef std::unique_ptr<counter> counter_t;
typedef std::unique_ptr<counter_policy> counter_policy_t;
typedef std::unique_ptr<stream_counter> stream_counter_t;
typedef std::unique_ptr<filesystem_counter> filesystem_counter_t;
typedef std::vector<counter_policy_t> policy_vector_t;
typedef std::vector<counter_t> counter_vector_t;
typedef std::vector<stream_counter_t> stream_counter_vector_t;
typedef std::vector<filesystem_counter_t> filesystem_counter_vector_t;
typedef std::vector<unsigned long> count_vector_t;

// .. actual class definitions start here

The only "win" in all this was the realization that it's cleaner to just pass std::vector<unsigned long>s to the display functions, so they can be completely independent of however I decide to implement counters.

Performance

This version of wc, compared to my previous version, has a different code structure, but actual runtime performance is identical. At this point it started to bother me that I invested so much time, the performance is not too far off from my sytem wc, but it's still about 10% slower. So I did some profiling, and found that following: in the case that's relevant for benchmarking — the stream-based state machine branches — I lose time in the following for_each loop (this is true for both my previous and the current implementation):

unsigned process_block(
    const char* cp, unsigned n, stream_counter_vector_t& counters, bool eof) {
    long remaining = n;
    std::mbstate_t ps{};
    while (remaining > 0) {
        // ... code omitted
        std::for_each(counters.begin(), counters.end(),                    <-----
            [&](auto& c) { c->process_wchar(wc, num_bytes, error); });     <-----
        // ... code omitted
    }
    return remaining;
}

I tried multiple different ways of iterating, all had identical performance.

This was quite a blow, since the whole point of implementing wc in C++ is to abstract the state machines into their own classes, instantiate the state machines we need based on the command-like arguments, and perform only those counts. This — irrespective of the abstractions introduced in this post — does yield significantly cleaner code than the original C version in coreutils, which is one giant while loop with 10s of variables over 100s of lines of code, ie. all state machines are spilled into one another, with ugly, hard to follow, not-obviously-correct code.

Open-closed principle

After I wrote this version and in the end decided that it's over-engineered and all the astractions are not worth it, I realized that many otherwise good software engineering principles don't apply to wc. For example, the S and O in SOLID stand for (from Wikipedia):

  • The Single-responsibility principle: There should never be more than one reason for a class to change. In other words, every class should have only one responsibility.
  • The Open–closed principle: Software entities should be open for extension, but closed for modification.

Both cases are about what happens when code changes to incorporate features in the future. But wc is not like that: wc has been around with minimal changes since 1971, so more than 50 years! It's one of the few exceptions where you don't have to anticipate future changes to the code.

Conclusion

In the next post, I will show a 3rd variation of my C++ wc implementation, with a reasonable trade-off between C++ complexity, readability and run-time performance, based on the above considerations — this time with the explicit goal of beating my system wc in performance tests.