Wordcount I: Implementing the Unix command-line tool `wc` in modern C++

Marton Trencseni - Sun 10 September 2023 - Programming

Introduction

Like many programmers, I started programming on the Commodore 64, in BASIC, at age 6-7. A few years later, in elementary school, I graduated to Pascal, using the excellent Turbo Pascal compiler. Later in High School I learned C and C++ at about the same time. I remember being 16 and going on a trip with my family to Yosemite Park, where I chose my then-favorite Teach Yourself ANSI C++ in 21 Days book over going hiking with my parents.

Teach Yourself ANSI C++ In 21 Days

I finished University in 2004 (Computer Science, Budapest University of Technology) and I got a job as a junior C++ programmer at the company Graphisoft, working on the architectural CAD product ArchiCAD. Later I had a startup called Scalien, where we wrote a distributed key-value database called ScalienDB (code on Github) in C++.

In all this reminiscing it's worth noting that I was not a good programmer initially. I was definitely not (and am still not) a natural talent like John Carmack, identifying problems worth solving and writing elegant and highly efficient solutions that millions of people still marvel at 20 years later (Quake source code on Github). In retrospect I would say I became a good programmer towards the end of my time at Graphisoft, and I wrote mostly good code at Scalien. However, this was in 2009 onwards, so, before the onset of modern C++. Back then my position was that the best way to use the C++ language is as a C-with-objects. Specifically, I did not enjoy using the Standard Template Library (STL), so in ScalienDB we rolled our own container classes, which is considered a big no-no per the C++ Core Guidelines (see below).

After Scalien I pivoted to working in the data field, where I was able to leverage my second University degree as a Physicist, which I earned while working at Graphisoft. However, this also meant leaving C++ behind, as the lingua franca of data in the last 10 years became Python, and SQL for querying relational databases.

Lately I realized I miss writing pedal-to-the-metal shotgun-to-the-foot C++ code. Part of this is the lively progress the language has made, with C++11, C++14, C++17 and the last C++20 standard introducing exciting new features into my favorite programming language. So I ordered a bunch of books on modern C++ development a few months ago, among them the latest version of Bjarne Strousrup's book (of which I owned an earlier edition before I gave it to my University library in 2015), and started to devour them.

I started by reading the excellent Davidson & Gregory book Beautiful C++ the last couple of weeks — to my surprise it's one of the best technical books I've read in a long time, in parallel discussing hard earned programming wisdom with new language features. The book is built on top of the C++ Core Guidelines, a set of rules for writing good modern C++ code, edited by Bjarne Stroustrup and Herb Sutter.

Beautiful C++

After reading the book cover-to-cover I had a strong urge to go and use all the new features and write some modern C++ code. I didn't want to write purely toy code, I wanted to approximate solving a real-world problem — but I also have limited time so anything big is out of the question. I settled on re-implementing, to a satisying (to me) degree the standard Unix command-line tool wc, most commonly used to count lines in a file using wc -l <file>.

The out of the box wc --help output on my Debian devbox is:

$ wc --help
Usage: wc [OPTION]... [FILE]...
  or:  wc [OPTION]... --files0-from=F
Print newline, word, and byte counts for each FILE, and a total line if
more than one FILE is specified.  A word is a non-zero-length sequence of
characters delimited by white space.

With no FILE, or when FILE is -, read standard input.

The options below may be used to select which counts are printed, always in
the following order: newline, word, character, byte, maximum line length.
  -c, --bytes            print the byte counts
  -m, --chars            print the character counts
  -l, --lines            print the newline counts
      --files0-from=F    read input from the files specified by
                           NUL-terminated names in file F;
                           If F is - then read names from standard input
  -L, --max-line-length  print the maximum display width
  -w, --words            print the word counts
      --help     display this help and exit
      --version  output version information and exit

GNU coreutils online help: <http://www.gnu.org/software/coreutils/>
Full documentation at: <http://www.gnu.org/software/coreutils/wc>
or available locally via: info '(coreutils) wc invocation'

I set of off to re-implement all the 5 basic modes of wc, in C++, while trying to maintain compatibility with the standard C implementation, to a reasonable degree.

The full source code of my C++ implementation is up on Github. You can compile like:

$ g++ -o wcpp -O3 wcpp_v1.cpp

The C implementation I most often used to peak to understand classic behaviour is in the coretuils Github repo.

Command line arguments

wc only has a few command line arguments, but each has a - and a -- way to specify it. Additionally, wc supports passing in -xyz, which is equivalent to passing in -x -y -z, and passing in the special key=value type argument --files0-from=F. To my surprise, there is no C++ standard library for parsing arguments, and even boost::program_options does not support this -xyz behaviour. So I wrote a few helper functions utilizing the STL to implement this:

struct cl_args_t {
    std::set<std::string> flags;
    std::map<std::string, std::string> key_values;
    std::vector<std::string> filename_args;
};

cl_args_t get_cl_args(int argc, char** argv) {
    cl_args_t cl_args;
    for(auto i = 1; i < argc; i++) {
        auto arg = std::string(argv[i]);
        if (arg[0] == '-') {
            if (cl_args.filename_args.size() > 0) {
                std::cerr << "error: switches must precede filename arguments"
                << std::endl;
            }
            auto eqi = arg.find('=');
            if (eqi == std::string::npos) {
                // no '=' in arg
                cl_args.flags.insert(arg);
            } else {
                auto k = arg.substr(0, eqi);
                auto v = arg.substr(eqi+1, arg.length());
                cl_args.key_values[k] = v;
            }
        } else {
            cl_args.filename_args.push_back(arg); 
        }
    }
    return cl_args;
}

void normalize_flags(cl_args_t& cl_args) {
    // convert eg. -lm to -l -m
    std::set<std::string> new_flags;
    for (auto f : cl_args.flags) {
        if (f[0] == '-' && f[1] != '-' && f.length() > 2) {
            for (auto i = 1; i < f.length(); i++) {
                std::string new_f = std::string("-") + f[i];
                new_flags.insert(new_f);
            }
        } else {
            new_flags.insert(f);
        }
    }
    cl_args.flags = new_flags;
}

void check_bad_args(const cl_args_t& cl_args) {
    std::set<std::string> accepted_flags = {
        "-l", "--lines",
        "-w", "--words",
        "-m", "--chars",
        "-c", "--bytes",
        "-L", "--max-line-length",
              "--help",
              "--version"
    };
    for (auto f : cl_args.flags) {
        if (!accepted_flags.count(f)) {
            std::cerr << "invalid option: " << f << std::endl;
            exit(1);
        }
    }
    for (auto k : cl_args.key_values) {
        std::string must_be = "--files0-from";
        if (k.first.substr(0, must_be.length()) != must_be) {
            std::cerr << "invalid option: " << k.first << std::endl;
            exit(1);
        }
    }
}

Counting architecture

wc supports counting 5 quantities:

bytes
characters
words
lines
the longest line

To support this, I defined an abstract base class called counter, and 5 child classes defining each counting logic:

class counter {
    public:
    virtual bool has_optimization() const { return false; }
    virtual void optimized_count(const std::string fname) { }
    virtual void process_wchar(const wchar_t, unsigned, bool) = 0;
    unsigned long get_count() const { return count; }
    void add(unsigned long i) { count += i; }
    void clear() { count = 0; }
    protected:
    unsigned long count = 0;
};

Essentially, each counter is assumed to be a state machine, and each invocation of the process_wchar(const wchar_t wc, unsigned size, bool error) triggers the state machine to process a new wide character wc of length size bytes that was read; if the current bytes to not form a valid character per the current locale then error is set and 1 byte is consumed. The has_optimization() and optimized_count() is a special path for supporting wc -c byte counting, in this case we don't have to actually count the bytes, we can just use fstat() to ask the operating system what the size of the file is — assuming we are processing files and not stdin.

As an example, here is the implementation for byte counting, the most trivial of the 5:

class byte_counter : public counter {
    public:
    bool has_optimization() const { return true; }
    virtual void optimized_count(const std::string fname) {
        struct stat st;
        stat(fname.c_str(), &st);
        count = st.st_size;
    }
    void process_wchar(const wchar_t, unsigned num_bytes, bool) {
        count += num_bytes;
    }
};

Handling wide characters

Although I've been using wc for 25 years, I never used it to count characters or words, so I was surprised to learn that a signiciant fraction of complexity comes from differentiating between bytes and (non-ASCII) characters. Essentially, in Unix a locale can be set, and wc honors this setting. On my Debian box, running locale shows:

$ locale
LANG=en_US.utf8
LANGUAGE=
LC_CTYPE="en_US.utf8"
LC_NUMERIC="en_US.utf8"
LC_TIME="en_US.utf8"
LC_COLLATE="en_US.utf8"
LC_MONETARY="en_US.utf8"
LC_MESSAGES="en_US.utf8"
LC_PAPER="en_US.utf8"
LC_NAME="en_US.utf8"
LC_ADDRESS="en_US.utf8"
LC_TELEPHONE="en_US.utf8"
LC_MEASUREMENT="en_US.utf8"
LC_IDENTIFICATION="en_US.utf8"
LC_ALL=

This means that bytes are interpreted as UTF-8 characters by wc. So valid UTF-8 byte sequences are counted as 1 character by wc, and this is also honored by the word counting logic. However, if we set the locale to C, we switch to chars=bytes mode, and in this case wc -c and wc -m always outputs the same number; you can trigger this like LC_ALL="C" wc -c -m -l -w -L <file>.

However, not all byte sequences are valid UTF-8 (or whatever else the locale is set to), so invalid sequences are skipped one byte at a time, and a new attempt is made to read a valid character. The workhorse here is the C function mbrtowc(), which attempts to read a character per the current locale, and return it, or set an error. This logic is in my process_block() function, which after reading a block of bytes, repeatedly tries to extract wide characters. This function is quite C-like, and the core logic is similar to the standard wc implementations:

unsigned process_block(
    const char* cp, unsigned n, counter_vector& counters, bool eof) {
    long remaining = n;
    std::mbstate_t ps{};
    while (remaining > 0) {
        wchar_t wc;
        long num_bytes = 0;
        if (is_basic(*cp)) {
            wc = *cp;
            num_bytes = 1;
        } else {
            num_bytes = mbrtowc(&wc, cp, remaining, &ps);
        }
        bool error = false;
        if (num_bytes == 0) {
            // encountered null character
            num_bytes = 1;
        } else if (num_bytes == static_cast<std::size_t>(-1)) {
            // encountered bad wide character
            num_bytes = 1;
            error = true;
        } else if (num_bytes == static_cast<std::size_t>(-2)) {
            // encountered incomplete wide character, get more bytes if possible
            if (!eof)
                return remaining;
            num_bytes = remaining;
            error = true;
        } else {
            // success, read a wchar_t
        }
        std::for_each(counters.begin(), counters.end(),
            [&](auto& c) { c->process_wchar(wc, num_bytes, error); });
        cp += num_bytes;
        remaining -= num_bytes;
    }
    return remaining;
}

Bugs

Getting the locale-specific conversion of bytes to characters took at least half of all development time. I used large zipped files as a test, since these contain random characters, some of which are valid UTF-8, many are not, so this allowed me to test many branches of my code. In the end, I got everything to match except one case: U+2029, the Unicode paragraph separator character. My implementation calls std::iswspace(wc) which considers this as white-space, so it separates words and triggers an extra word count. However, my Debian system's wc does not consider this character to be a word separator. I reduced the difference to a test file containg: aU+2029a (5 bytes long, as U+2029 is represented in 3 bytes with UTF-8) , on which wc -w returns 1 word from the system wc, but my imlementation counts 2 words (test file is here). I believe mine is correct, so I left this difference.

Performance

I tested the performance by running all 5 counters on fairly large zip files repeatedly and using time. To my surprise, without any additional optimizations, just by compiling with -O3, my C++ implementation is only about 10% slower than my system's highly optimized C version from 20-30 years ago:

$ time wc -c -m -l -w -L playground/MNIST_data/train-images-idx3-ubyte.gz
  36468  207566 5259183 9912422     448 playground/MNIST_data/train-images-idx3-ubyte.gz

real    0m0.560s
user    0m0.548s
sys     0m0.008s
$ time ./wcpp -c -m -l -w -L playground/MNIST_data/train-images-idx3-ubyte.gz
   36468  207567 5259183 9912422     448 playground/MNIST_data/train-images-idx3-ubyte.gz

real    0m0.619s
user    0m0.620s
sys     0m0.000s

Note the difference in word counts due to the issue mentioned above.

Conclusion

I had a lot of fun writing wcpp.cpp and learned a lot about Unix, locale, wide character handling, and new features of C++ such as auto and for_each. I'm sure there are still a lot of optimization possibilities in terms of C++ language usage in my code, I'm happy to take feedback. I'm planning to continue on this path, and as time allows, implement other members of coreutils such as grep or find.