On Wed, Nov 17, 2021 at 08:37:53AM -0600, Scott Cheloha wrote: > In wc(1) we currently count words, both ASCII and multibyte, in a > getline(3) loop. > > This makes sense in the multibyte case because stdio handles all the > nasty buffer resizing for us. We avoid splitting a multibyte between > two read(2) calls and the resulting code is simpler. > > However, for ASCII input we don't have the split-character problem. > Using getline(3) doesn't really buy us anything. We can count words > in a big buffer (as we do in the ASCII byte- and line-counting modes) > just fine. > > [...]
37 week bump. Counting words in a big buffer is faster than doing it with getline(3). We don't need the convenience of getline(3) except in the multibyte case. The state machine for counting words doesn't need to change because word transitions still happen within a single byte. We just move the logic out of the getline(3) loop and into a read(2) loop. As for "faster", consider The Adventures of Sherlock Holmes: $ ftp -o sherlock-holmes.txt https://www.gutenberg.org/files/1661/1661-0.txt Trying 152.19.134.47... Requesting https://www.gutenberg.org/files/1661/1661-0.txt 100% |**************************************************| 593 KB 00:01 607430 bytes received in 1.05 seconds (563.58 KB/s) $ ls -lh sherlock-holmes.txt -rw-r--r-- 1 ssc ssc 593K Jun 9 2021 sherlock-holmes.txt -current: $ command time /usr/bin/wc $(jot -b ~/sherlock-holmes.txt 200) | tail -n 1 2.081 real 2.730 user 0.080 sys 2460800 21512000 121486000 total Patched: $ command time obj/wc $(jot -b /home/ssc/sherlock-holmes.txt 200) | tail -n 1 1.093 real 1.910 user 0.030 sys 2460800 21512000 121486000 total So, twice as fast on an input with normal-ish line lengths. ok? Index: wc.c =================================================================== RCS file: /cvs/src/usr.bin/wc/wc.c,v retrieving revision 1.29 diff -u -p -r1.29 wc.c --- wc.c 28 Nov 2021 19:28:42 -0000 1.29 +++ wc.c 3 Aug 2022 23:11:45 -0000 @@ -145,16 +145,42 @@ cnt(const char *path) fd = STDIN_FILENO; } - if (!doword && !multibyte) { + if (!multibyte) { if (bufsz < _MAXBSIZE && (buf = realloc(buf, _MAXBSIZE)) == NULL) err(1, NULL); + + /* + * According to POSIX, a word is a "maximal string of + * characters delimited by whitespace." Nothing is said + * about a character being printing or non-printing. + */ + if (doword) { + gotsp = 1; + while ((len = read(fd, buf, _MAXBSIZE)) > 0) { + charct += len; + for (C = buf; len--; ++C) { + if (isspace((unsigned char)*C)) { + gotsp = 1; + if (*C == '\n') + ++linect; + } else if (gotsp) { + gotsp = 0; + ++wordct; + } + } + } + if (len == -1) { + warn("%s", file); + rval = 1; + } + } /* * Line counting is split out because it's a lot * faster to get lines than to get words, since * the word count requires some logic. */ - if (doline) { + else if (doline) { while ((len = read(fd, buf, _MAXBSIZE)) > 0) { charct += len; for (C = buf; len--; ++C) @@ -204,46 +230,26 @@ cnt(const char *path) return; } - /* - * Do it the hard way. - * According to POSIX, a word is a "maximal string of - * characters delimited by whitespace." Nothing is said - * about a character being printing or non-printing. - */ gotsp = 1; while ((len = getline(&buf, &bufsz, stream)) > 0) { - if (multibyte) { - const char *end = buf + len; - for (C = buf; C < end; C += len) { - ++charct; - len = mbtowc(&wc, C, MB_CUR_MAX); - if (len == -1) { - mbtowc(NULL, NULL, - MB_CUR_MAX); - len = 1; - wc = L'?'; - } else if (len == 0) - len = 1; - if (iswspace(wc)) { - gotsp = 1; - if (wc == L'\n') - ++linect; - } else if (gotsp) { - gotsp = 0; - ++wordct; - } - } - } else { - charct += len; - for (C = buf; len--; ++C) { - if (isspace((unsigned char)*C)) { - gotsp = 1; - if (*C == '\n') - ++linect; - } else if (gotsp) { - gotsp = 0; - ++wordct; - } + const char *end = buf + len; + for (C = buf; C < end; C += len) { + ++charct; + len = mbtowc(&wc, C, MB_CUR_MAX); + if (len == -1) { + mbtowc(NULL, NULL, + MB_CUR_MAX); + len = 1; + wc = L'?'; + } else if (len == 0) + len = 1; + if (iswspace(wc)) { + gotsp = 1; + if (wc == L'\n') + ++linect; + } else if (gotsp) { + gotsp = 0; + ++wordct; } } }
