Hi Scott,
thank you for spending quite some work on our small utility programs,
and sorry for failing to follow up as often as i should.
Scott Cheloha wrote on Fri, Jan 07, 2022 at 09:45:41AM -0600:
> In rev(1), we call MB_CUR_MAX for every byte in the input stream.
> This is extremely expensive.
That's actually quite ridiculous.
The reason why MB_CUR_MAX is so expensive is -- multi-thread support.
That is of course completely pointless for small utility programs,
on any operating system. And on OpenBSD in particular, it is
mostly useless even for large application programs that actually
use multiple threads because we basically only support one single
locale in the first place.
So i had a brief look at the implementation of MB_CUR_MAX in our libc
to see whether it could be made less expensive, but i don't see a
simple way without potentially breaking some application programs
(for example, consider a third-party webserver that wants to run
with a POSIX locale in one thread but with a UTF-8 locale in
another thread).
Apart from that, even if it could be made less expensive, avoiding
to call it in tight loops might still be valuable in the interest
of portability. On other systems, it must necessarily be expensive:
that web server might want to run one thread with UTF-8, one with
UTF-16, one with UTF-32, and one with Shift JIS...
So i agree with jour general idea to improve rev(1) and other
small utility programs that might have similar issues.
> It is much cheaper to call it once per line and use a simpler loop
> (see the inlined patch below) if the current locale doesn't handle
> multibyte characters:
[useful measurements snipped]
> CC schwarze@ to double-check I'm not misunderstanding MB_CUR_MAX.
> I'm under the impression the return value cannot change unless
> we call setlocale(3).
Mostly correct, except that uselocale(3) may also change it.
But of course, we don't call uselocale(3) in small utility programs.
> ok?
I think your patch can still be improved a bit before committing.
millert@ wrote:
> Why not just store the result of MB_CUR_MAX in a variable and use
> that? It's value is not going to change during execution of the
> program. This made a big difference in sort(1) once upon a time.
I agree with that.
Even with your patch, rev(1) would still be slow for large files
consisting of very short lines.
> Index: rev.c
> ===================================================================
> RCS file: /cvs/src/usr.bin/rev/rev.c,v
> retrieving revision 1.13
> diff -u -p -r1.13 rev.c
> --- rev.c 10 Apr 2016 17:06:52 -0000 1.13
> +++ rev.c 7 Jan 2022 15:38:39 -0000
> @@ -82,13 +82,18 @@ main(int argc, char *argv[])
> while ((len = getline(&p, &ps, fp)) != -1) {
> if (p[len - 1] == '\n')
> --len;
> - for (t = p + len - 1; t >= p; --t) {
> - if (isu8cont(*t))
> - continue;
> - u = t;
> - do {
> - putchar(*u);
> - } while (isu8cont(*(++u)));
> + if (MB_CUR_MAX == 1) {
> + for (t = p + len - 1; t >= p; t--)
> + putchar(*t);
> + } else {
> + for (t = p + len - 1; t >= p; --t) {
I dislike how you use t-- in one loop but --t in the other.
It made me wonder for a moment whether that needs to be different
depending on the character set. Wouldn't it be less confusing
to pick one of the forms and stick to it?
> + if (isu8cont(*t))
> + continue;
> + u = t;
> + do {
> + putchar(*u);
> + } while (isu8cont(*(++u)));
> + }
> }
> putchar('\n');
> }
> @@ -104,7 +109,7 @@ main(int argc, char *argv[])
> int
> isu8cont(unsigned char c)
> {
> - return MB_CUR_MAX > 1 && (c & (0x80 | 0x40)) == 0x80;
> + return (c & (0x80 | 0x40)) == 0x80;
> }
The isu8cont() function is a semi-standard function used in many
OpenBSD utility programs, originally designed by tedu@, i think,
and it has proven highly useful to get UTF-8 support going.
But back in the day, i think we failed to realize that MB_CUR_MAX
is expensive. In that sense, it is probably somewhat ill-designed:
it is typically called in tight loops, so i guess there aren't
many use cases where the original version is really that good.
Given that ksh(1) and ypldap(8) already use versions of this
function identical to your version, i'm OK with your suggestion
of keeping the name while changing the semantics of the function,
even though that means that semantics of isu8cont() continue to vary
across our tree depending on the needs of individual programs.
I should probably audit the tree to see whether isu8cont() and
isu8start() are used correctly and efficiently everywhere and
whether the versions calling MB_CUR_MAX can be eliminated (to
improve uniformity). But that's independent of your diff.
Also, i think i should write a MB_CUR_MAX(3) manual page.
It is quite misleading that we document it using .Dv, which
makes it look like a #defined constant and misleads people
to think it is cheap, while it is actually a function call,
and necessarily expensive at least on other systems. Again,
not your job to worry about that missing manual page...
Yours,
Ingo