Re: wc(1): accelerate word counting

Scott Cheloha Wed, 03 Aug 2022 16:14:34 -0700

On Wed, Nov 17, 2021 at 08:37:53AM -0600, Scott Cheloha wrote:
> In wc(1) we currently count words, both ASCII and multibyte, in a
> getline(3) loop.
> 
> This makes sense in the multibyte case because stdio handles all the
> nasty buffer resizing for us.  We avoid splitting a multibyte between
> two read(2) calls and the resulting code is simpler.
> 
> However, for ASCII input we don't have the split-character problem.
> Using getline(3) doesn't really buy us anything.  We can count words
> in a big buffer (as we do in the ASCII byte- and line-counting modes)
> just fine.
> 
> [...]


37 week bump.

Counting words in a big buffer is faster than doing it with
getline(3).  We don't need the convenience of getline(3) except
in the multibyte case.

The state machine for counting words doesn't need to change because
word transitions still happen within a single byte.  We just move the
logic out of the getline(3) loop and into a read(2) loop.

As for "faster", consider The Adventures of Sherlock Holmes:

$ ftp -o sherlock-holmes.txt https://www.gutenberg.org/files/1661/1661-0.txt
Trying 152.19.134.47...
Requesting https://www.gutenberg.org/files/1661/1661-0.txt
100% |**************************************************|   593 KB    00:01    
607430 bytes received in 1.05 seconds (563.58 KB/s)
$ ls -lh sherlock-holmes.txt
-rw-r--r--  1 ssc  ssc   593K Jun  9  2021 sherlock-holmes.txt

-current:

$ command time /usr/bin/wc $(jot -b ~/sherlock-holmes.txt 200) | tail -n 1
        2.081 real         2.730 user         0.080 sys
 2460800 21512000 121486000 total

Patched:

$ command time obj/wc $(jot -b /home/ssc/sherlock-holmes.txt 200) | tail -n 1
        1.093 real         1.910 user         0.030 sys
 2460800 21512000 121486000 total

So, twice as fast on an input with normal-ish line lengths.

ok?

Index: wc.c
===================================================================
RCS file: /cvs/src/usr.bin/wc/wc.c,v
retrieving revision 1.29
diff -u -p -r1.29 wc.c
--- wc.c        28 Nov 2021 19:28:42 -0000      1.29
+++ wc.c        3 Aug 2022 23:11:45 -0000
@@ -145,16 +145,42 @@ cnt(const char *path)
                fd = STDIN_FILENO;
        }
 
-       if (!doword && !multibyte) {
+       if (!multibyte) {
                if (bufsz < _MAXBSIZE &&
                    (buf = realloc(buf, _MAXBSIZE)) == NULL)
                        err(1, NULL);
+
+               /*
+                * According to POSIX, a word is a "maximal string of
+                * characters delimited by whitespace."  Nothing is said
+                * about a character being printing or non-printing.
+                */
+               if (doword) {
+                       gotsp = 1;
+                       while ((len = read(fd, buf, _MAXBSIZE)) > 0) {
+                               charct += len;
+                               for (C = buf; len--; ++C) {
+                                       if (isspace((unsigned char)*C)) {
+                                               gotsp = 1;
+                                               if (*C == '\n')
+                                                       ++linect;
+                                       } else if (gotsp) {
+                                               gotsp = 0;
+                                               ++wordct;
+                                       }
+                               }
+                       }
+                       if (len == -1) {
+                               warn("%s", file);
+                               rval = 1;
+                       }
+               }
                /*
                 * Line counting is split out because it's a lot
                 * faster to get lines than to get words, since
                 * the word count requires some logic.
                 */
-               if (doline) {
+               else if (doline) {
                        while ((len = read(fd, buf, _MAXBSIZE)) > 0) {
                                charct += len;
                                for (C = buf; len--; ++C)
@@ -204,46 +230,26 @@ cnt(const char *path)
                        return;
                }
 
-               /*
-                * Do it the hard way.
-                * According to POSIX, a word is a "maximal string of
-                * characters delimited by whitespace."  Nothing is said
-                * about a character being printing or non-printing.
-                */
                gotsp = 1;
                while ((len = getline(&buf, &bufsz, stream)) > 0) {
-                       if (multibyte) {
-                               const char *end = buf + len;
-                               for (C = buf; C < end; C += len) {
-                                       ++charct;
-                                       len = mbtowc(&wc, C, MB_CUR_MAX);
-                                       if (len == -1) {
-                                               mbtowc(NULL, NULL,
-                                                   MB_CUR_MAX);
-                                               len = 1;
-                                               wc = L'?';
-                                       } else if (len == 0)
-                                               len = 1;
-                                       if (iswspace(wc)) {
-                                               gotsp = 1;
-                                               if (wc == L'\n')
-                                                       ++linect;
-                                       } else if (gotsp) {
-                                               gotsp = 0;
-                                               ++wordct;
-                                       }
-                               }
-                       } else {
-                               charct += len;
-                               for (C = buf; len--; ++C) {
-                                       if (isspace((unsigned char)*C)) {
-                                               gotsp = 1;
-                                               if (*C == '\n')
-                                                       ++linect;
-                                       } else if (gotsp) {
-                                               gotsp = 0;
-                                               ++wordct;
-                                       }
+                       const char *end = buf + len;
+                       for (C = buf; C < end; C += len) {
+                               ++charct;
+                               len = mbtowc(&wc, C, MB_CUR_MAX);
+                               if (len == -1) {
+                                       mbtowc(NULL, NULL,
+                                           MB_CUR_MAX);
+                                       len = 1;
+                                       wc = L'?';
+                               } else if (len == 0)
+                                       len = 1;
+                               if (iswspace(wc)) {
+                                       gotsp = 1;
+                                       if (wc == L'\n')
+                                               ++linect;
+                               } else if (gotsp) {
+                                       gotsp = 0;
+                                       ++wordct;
                                }
                        }
                }

Re: wc(1): accelerate word counting

Reply via email to