Any takers?

On Wed, 2020-06-24 at 13:27 +0200, Martijn van Duren wrote:
> Moving to tech@
> 
> On Tue, 2020-06-23 at 22:17 -0900, Philip Guenther wrote:
> > Nope.  This is a grep of a single file, so procfile() must be overflowing
> > and this only 'fixes' it by relying on signed overflow, which is undefined
> > behavior, being handled in a particular way by the compiler.  So, luck
> > (which fails when the compiler decides to hate you).  There are more places
> > that need to change for the reported problem to be handled safely.
> > 
> > Philip Guenther
> 
> I'm not sure I understand exactly what you mean, but the overflow can
> still propagate through to the return value. Since everything propagated
> up from procfile is only eventually used to determine the exit value we
> can start treating it as a boolean.
> 
> I gave a quick glance at freebsd and they also treat it as a boolean,
> but more explicitly. They however don't appear have overflow detection.
> 
> Is this better?
> 
> martijn@
> > 
> > On Tue, Jun 23, 2020 at 9:58 PM Martijn van Duren <
> > open...@list.imperialat.at> wrote:
> > 
> > > This seems to fix the issue for me.
> > > 
> > > OK?
> > > 
> > > martijn@
> > > 
> > > On Tue, 2020-06-23 at 19:29 -0700, Jordan Geoghegan wrote:
> > > > Hello,
> > > > 
> > > > I was working on a couple POSIX regular expressions to search for and
> > > > validate IPv4 and IPv6 addresses with optional CIDR blocks, and
> > > > encountered some strange behaviour from the base system grep.
> > > > 
> > > > I wanted to validate my regex against a list of every valid IPv4
> > > > address, so I generated a list with a zsh 1 liner:
> > > > 
> > > >       for i in {0..255}; do; echo $i.{0..255}.{0..255}.{0..255} ; done |
> > > > tr '[:space:]' '\n' > IPv4.txt
> > > > 
> > > > My intentions were to test the regex by running it with 'grep -c' to
> > > > confirm there was indeed 2^32 addresses matched, and I also wanted to
> > > > benchmark and compare performance between BSD grep, GNU grep and
> > > > ripgrep. The command I used:
> > > > 
> > > >     grep -Eoc
> > > > 
> > > "((25[0-5]|(2[0-4]|1{0,1}[[:digit:]]){0,1}[[:digit:]])\.){3,3}(25[0-5]|(2[0-4]|1{0,1}[[:digit:]]){0,1}[[:digit:]])(/[1-9]|/[1-2][[:digit:]]|/3[0-2])?"
> > > > My findings were surprising. Both GNU grep and ripgrep were able get
> > > > through the file in roughly 10 and 20 minutes respectively, whereas the
> > > > base system grep took over 20 hours! What interested me the most was
> > > > that the base system grep when run with '-c' returned '0' for match
> > > > count. It seems that 'grep -c' will have its counter overflow if there
> > > > are more than 2^32-1 matches (4294967295) and then the counter will
> > > > start counting from zero again for further matches.
> > > > 
> > > >      ryzen$ time zcat IPv4.txt.gz | grep -Eoc
> > > "((25[0-5]|(2[0-4]|1{0,1}...
> > > >      0
> > > >      1222m09.32s real  1224m28.02s user     1m16.17s system
> > > > 
> > > >      ryzen$ time zcat allip.txt.gz | ggrep -Eoc
> > > "((25[0-5]|(2[0-4]|1{0,1}...
> > > >      4294967296
> > > >      10m00.38s real    11m40.57s user     0m30.55s system
> > > > 
> > > >      ryzen$ time rg -zoc "((25[0-5]|(2[0-4]|1{0,1}...
> > > >      4294967296
> > > >      21m06.36s real    27m06.04s user     0m50.08s system
> > > > 
> > > > # See the counter overflow/reset:
> > > >      jot 4294967350 | grep -c "^[[:digit:]]"
> > > >      54
> > > > 
> > > > All testing was done on a Ryzen desktop machine running 6.7 stable.
> > > > 
> > > > The grep counting bug can be reproduced with this command:
> > > >     jot 4294967296 | nice grep -c "^[[:digit:]]"
> > > > 
> > > > Regards,
> > > > 
> > > > Jordan
> > > > 
> Index: grep.c
> ===================================================================
> RCS file: /cvs/src/usr.bin/grep/grep.c,v
> retrieving revision 1.64
> diff -u -p -r1.64 grep.c
> --- grep.c    3 Dec 2019 09:14:37 -0000       1.64
> +++ grep.c    24 Jun 2020 11:26:25 -0000
> @@ -517,7 +517,7 @@ main(int argc, char *argv[])
>               c = grep_tree(argv);
>       else
>               for (c = 0; argc--; ++argv)
> -                     c += procfile(*argv);
> +                     c |= procfile(*argv);
>  
>       exit(c ? (file_err ? (qflag ? 0 : 2) : 0) : (file_err ? 2 : 1));
>  }
> Index: util.c
> ===================================================================
> RCS file: /cvs/src/usr.bin/grep/util.c,v
> retrieving revision 1.62
> diff -u -p -r1.62 util.c
> --- util.c    3 Dec 2019 09:14:37 -0000       1.62
> +++ util.c    24 Jun 2020 11:26:25 -0000
> @@ -91,7 +91,7 @@ grep_tree(char **argv)
>                       /* skip "./" if implied */
>                       if (argv == dot_argv && p->fts_pathlen >= 2)
>                               path += 2;
> -                     c += procfile(path);
> +                     c |= procfile(path);
>                       break;
>               }
>       }
> @@ -106,7 +106,8 @@ procfile(char *fn)
>  {
>       str_t ln;
>       file_t *f;
> -     int c, t, z, nottext;
> +     int t, z, nottext, overflow = 0;
> +     unsigned long long c;
>  
>       mcount = mlimit;
>  
> @@ -158,7 +159,10 @@ procfile(char *fn)
>                       enqueue(&ln);
>                       linesqueued++;
>               }
> -             c += t;
> +             if (ULLONG_MAX - c < (unsigned long long)t)
> +                     overflow = 1;
> +             else
> +                     c += t;
>               if (mflag && mcount <= 0)
>                       break;
>       }
> @@ -169,7 +173,7 @@ procfile(char *fn)
>       if (cflag) {
>               if (!hflag)
>                       printf("%s:", ln.file);
> -             printf("%u\n", c);
> +             printf("%llu%s\n", c, overflow ? "+" : "");
>       }
>       if (lflag && c != 0)
>               printf("%s\n", fn);
> @@ -179,7 +183,7 @@ procfile(char *fn)
>           binbehave == BIN_FILE_BIN && nottext && !qflag)
>               printf("Binary file %s matches\n", fn);
>  
> -     return c;
> +     return overflow || c != 0;
>  }
>  
>  

Reply via email to