Wordcount III: Beating the system `wc`
Marton Trencseni - Fri 29 September 2023 - Programming
Introduction
In the previous two articles I re-implemented wc
is modern C++ as a toy exercise:
- Wordcount I: Implementing the Unix command-line tool
wc
in modern C++ - Wordcount II: Introducing a cleaner C++ class hierarchy to
wc
I had three reasons to write a third version.
First, my core architectural takeaways at the end of the second article was since wc
is a fully specified tool that is not going to see new features introduced, a lot of the usual C++ abstractions are not neccesary. For example, the code doesn't need to follow the Open–closed principle and be open for extension, but closed for modification. So all the complexity introduced in the second version isn't good complexity.
Second, I also found that the second (and also first) implementation loses time compared to my system wc
in the following for_each
loop (same for different ways of iterating a C++ container):
unsigned process_block(
const char* cp, unsigned n, stream_counter_vector_t& counters, bool eof) {
long remaining = n;
std::mbstate_t ps{};
while (remaining > 0) {
// ... code omitted
std::for_each(counters.begin(), counters.end(), <-----
[&](auto& c) { c->process_wchar(wc, num_bytes, error); }); <-----
// ... code omitted
}
return remaining;
}
Third, I found that the system wc
has a trick up its sleeve when doing line counting. Instead of processing each character and checking whether it's a \n
, it uses the memchr()
family of functions. memchr()
takes a buffer of character and finds the first occurence of a character, in this case \n
, and is much faster than doing the same in a C loop.
Simples classes and code
To keep this implementation simple I used a very simple base counter
class, and derived all other implementation classes from this:
class counter_t {
public:
void reset() { count = 0; active = true; }
bool active = false;
long count = 0;
};
For example, the class counting bytes is now as simple as:
lass byte_counter_t : public counter_t {
public:
void process_file(const std::string fname) {
struct stat st;
stat(fname.c_str(), &st);
count = st.st_size;
}
void process_wchar(const wchar_t, unsigned num_bytes, bool) {
count += num_bytes;
}
};
Since I realized extensibility is not required here, the class itself stores whether the count should happen or not. The active
flag is flipped based on the command-line parameters, and checked in the internal loops. This means that the dynamic for
loop is now replaced with compiled if
s:
void process_wchar(const wchar_t wc, unsigned num_bytes, bool error) {
if (byte_counter.active)
byte_counter.process_wchar(wc, num_bytes, error);
if (char_counter.active)
char_counter.process_wchar(wc, num_bytes, error);
if (line_counter.active)
line_counter.process_wchar(wc, num_bytes, error);
if (longest_line_counter.active)
longest_line_counter.process_wchar(wc, num_bytes, error);
if (word_counter.active)
word_counter.process_wchar(wc, num_bytes, error);
}
As I will show shortly, this yield much faste code. Obviously this is repeated code that goes against the DRY principle, but here it's permissible.
Lastly, the use of memchr()
for in linecounter
:
unsigned process_block_lines_only(
const char* cp, unsigned n, bool eof) {
void* cur = static_cast<void*>(const_cast<char*>(cp));
while (cur != nullptr) {
auto remaining = n - ((char*)cur - cp);
cur = memchr(static_cast<void*>(cur), '\n', remaining);
if (cur != nullptr) {
line_counter.count++;
cur++;
}
}
return 0;
}
Headers
To further clean up the code, I factored the command-line parsing and tabulation code into a separate headers clargs.hpp and tabular.hpp. The complete code is up on Github.
Performance
With the above changes and optimizations, the code now runs about 10% faster than my system wc
when running all 5 counters in file mode:
$ time wc -c -m -l -w -L playground/MNIST_data/t10k-images-idx3-ubyte.gz
6121 34669 874589 1648877 402 playground/MNIST_data/t10k-images-idx3-ubyte.gz
real 0m0.101s
user 0m0.100s
sys 0m0.000s
$ time ./wcpp_v3 -c -m -l -w -L playground/MNIST_data/t10k-images-idx3-ubyte.gz
6121 34669 874589 1648877 402 playground/MNIST_data/t10k-images-idx3-ubyte.gz
real 0m0.089s
user 0m0.088s
sys 0m0.000s
Results are similar when reading from stdin
:
$ time cat playground/MNIST_data/t10k-images-idx3-ubyte.gz | wc -c -m -l -w -L
6121 34669 874589 1648877 402
real 0m0.101s
user 0m0.104s
sys 0m0.000s
$ time cat playground/MNIST_data/t10k-images-idx3-ubyte.gz | ./wcpp_v3 -c -m -l -w -L
6121 34669 874589 1648877 402
real 0m0.091s
user 0m0.080s
sys 0m0.008s
Conclusion
This was a fun toy exercise so far. One final optimization possibility would be to multi-threading when parsing multiple files.