On Fri, Jan 07, 2022 at 07:28:30PM +0100, Ingo Schwarze wrote:
> 
> [...]
> 
> Scott Cheloha wrote on Fri, Jan 07, 2022 at 09:45:41AM -0600:
> 
> > In rev(1), we call MB_CUR_MAX for every byte in the input stream.
> > This is extremely expensive.
> 
> [...]
> 
> So i agree with jour general idea to improve rev(1) and other
> small utility programs that might have similar issues.
> 
> > It is much cheaper to call it once per line and use a simpler loop
> > (see the inlined patch below) if the current locale doesn't handle
> > multibyte characters:
> 
> [useful measurements snipped]
> 
> > CC schwarze@ to double-check I'm not misunderstanding MB_CUR_MAX.
> > I'm under the impression the return value cannot change unless
> > we call setlocale(3).
> 
> Mostly correct, except that uselocale(3) may also change it.
> But of course, we don't call uselocale(3) in small utility programs.
> 
> > ok?
> 
> I think your patch can still be improved a bit before committing.
> 
> millert@ wrote:
> 
> > Why not just store the result of MB_CUR_MAX in a variable and use
> > that?  It's value is not going to change during execution of the
> > program.  This made a big difference in sort(1) once upon a time.
> 
> I agree with that.
> 
> Even with your patch, rev(1) would still be slow for large files
> consisting of very short lines.

Like this?

Index: rev.c
===================================================================
RCS file: /cvs/src/usr.bin/rev/rev.c,v
retrieving revision 1.13
diff -u -p -r1.13 rev.c
--- rev.c       10 Apr 2016 17:06:52 -0000      1.13
+++ rev.c       7 Jan 2022 19:41:23 -0000
@@ -50,9 +50,10 @@ main(int argc, char *argv[])
        FILE *fp;
        ssize_t len;
        size_t ps = 0;
-       int ch, rval;
+       int ch, rval, multibyte;
 
        setlocale(LC_CTYPE, "");
+       multibyte = MB_CUR_MAX > 1;
 
        if (pledge("stdio rpath", NULL) == -1)
                err(1, "pledge");
@@ -82,13 +83,18 @@ main(int argc, char *argv[])
                while ((len = getline(&p, &ps, fp)) != -1) {
                        if (p[len - 1] == '\n')
                                --len;
-                       for (t = p + len - 1; t >= p; --t) {
-                               if (isu8cont(*t))
-                                       continue;
-                               u = t;
-                               do {
-                                       putchar(*u);
-                               } while (isu8cont(*(++u)));
+                       if (multibyte) {
+                               for (t = p + len - 1; t >= p; --t) {
+                                       if (isu8cont(*t))
+                                               continue;
+                                       u = t;
+                                       do {
+                                               putchar(*u);
+                                       } while (isu8cont(*(++u)));
+                               }
+                       } else {
+                               for (t = p + len - 1; t >= p; t--)
+                                       putchar(*t);
                        }
                        putchar('\n');
                }
@@ -104,7 +110,7 @@ main(int argc, char *argv[])
 int
 isu8cont(unsigned char c)
 {
-       return MB_CUR_MAX > 1 && (c & (0x80 | 0x40)) == 0x80;
+       return (c & (0x80 | 0x40)) == 0x80;
 }
 
 void

Reply via email to