On Fri, Jan 07, 2022 at 07:28:30PM +0100, Ingo Schwarze wrote:
>
> [...]
>
> Scott Cheloha wrote on Fri, Jan 07, 2022 at 09:45:41AM -0600:
>
> > In rev(1), we call MB_CUR_MAX for every byte in the input stream.
> > This is extremely expensive.
>
> [...]
>
> So i agree with jour general idea to improve rev(1) and other
> small utility programs that might have similar issues.
>
> > It is much cheaper to call it once per line and use a simpler loop
> > (see the inlined patch below) if the current locale doesn't handle
> > multibyte characters:
>
> [useful measurements snipped]
>
> > CC schwarze@ to double-check I'm not misunderstanding MB_CUR_MAX.
> > I'm under the impression the return value cannot change unless
> > we call setlocale(3).
>
> Mostly correct, except that uselocale(3) may also change it.
> But of course, we don't call uselocale(3) in small utility programs.
>
> > ok?
>
> I think your patch can still be improved a bit before committing.
>
> millert@ wrote:
>
> > Why not just store the result of MB_CUR_MAX in a variable and use
> > that? It's value is not going to change during execution of the
> > program. This made a big difference in sort(1) once upon a time.
>
> I agree with that.
>
> Even with your patch, rev(1) would still be slow for large files
> consisting of very short lines.
Like this?
Index: rev.c
===================================================================
RCS file: /cvs/src/usr.bin/rev/rev.c,v
retrieving revision 1.13
diff -u -p -r1.13 rev.c
--- rev.c 10 Apr 2016 17:06:52 -0000 1.13
+++ rev.c 7 Jan 2022 19:41:23 -0000
@@ -50,9 +50,10 @@ main(int argc, char *argv[])
FILE *fp;
ssize_t len;
size_t ps = 0;
- int ch, rval;
+ int ch, rval, multibyte;
setlocale(LC_CTYPE, "");
+ multibyte = MB_CUR_MAX > 1;
if (pledge("stdio rpath", NULL) == -1)
err(1, "pledge");
@@ -82,13 +83,18 @@ main(int argc, char *argv[])
while ((len = getline(&p, &ps, fp)) != -1) {
if (p[len - 1] == '\n')
--len;
- for (t = p + len - 1; t >= p; --t) {
- if (isu8cont(*t))
- continue;
- u = t;
- do {
- putchar(*u);
- } while (isu8cont(*(++u)));
+ if (multibyte) {
+ for (t = p + len - 1; t >= p; --t) {
+ if (isu8cont(*t))
+ continue;
+ u = t;
+ do {
+ putchar(*u);
+ } while (isu8cont(*(++u)));
+ }
+ } else {
+ for (t = p + len - 1; t >= p; t--)
+ putchar(*t);
}
putchar('\n');
}
@@ -104,7 +110,7 @@ main(int argc, char *argv[])
int
isu8cont(unsigned char c)
{
- return MB_CUR_MAX > 1 && (c & (0x80 | 0x40)) == 0x80;
+ return (c & (0x80 | 0x40)) == 0x80;
}
void