Any takers?
On Wed, 2020-06-24 at 13:27 +0200, Martijn van Duren wrote:
> Moving to tech@
>
> On Tue, 2020-06-23 at 22:17 -0900, Philip Guenther wrote:
> > Nope. This is a grep of a single file, so procfile() must be overflowing
> > and this only 'fixes' it by relying on signed overflow, which is undefined
> > behavior, being handled in a particular way by the compiler. So, luck
> > (which fails when the compiler decides to hate you). There are more places
> > that need to change for the reported problem to be handled safely.
> >
> > Philip Guenther
>
> I'm not sure I understand exactly what you mean, but the overflow can
> still propagate through to the return value. Since everything propagated
> up from procfile is only eventually used to determine the exit value we
> can start treating it as a boolean.
>
> I gave a quick glance at freebsd and they also treat it as a boolean,
> but more explicitly. They however don't appear have overflow detection.
>
> Is this better?
>
> martijn@
> >
> > On Tue, Jun 23, 2020 at 9:58 PM Martijn van Duren <
> > open...@list.imperialat.at> wrote:
> >
> > > This seems to fix the issue for me.
> > >
> > > OK?
> > >
> > > martijn@
> > >
> > > On Tue, 2020-06-23 at 19:29 -0700, Jordan Geoghegan wrote:
> > > > Hello,
> > > >
> > > > I was working on a couple POSIX regular expressions to search for and
> > > > validate IPv4 and IPv6 addresses with optional CIDR blocks, and
> > > > encountered some strange behaviour from the base system grep.
> > > >
> > > > I wanted to validate my regex against a list of every valid IPv4
> > > > address, so I generated a list with a zsh 1 liner:
> > > >
> > > > for i in {0..255}; do; echo $i.{0..255}.{0..255}.{0..255} ; done |
> > > > tr '[:space:]' '\n' > IPv4.txt
> > > >
> > > > My intentions were to test the regex by running it with 'grep -c' to
> > > > confirm there was indeed 2^32 addresses matched, and I also wanted to
> > > > benchmark and compare performance between BSD grep, GNU grep and
> > > > ripgrep. The command I used:
> > > >
> > > > grep -Eoc
> > > >
> > > "((25[0-5]|(2[0-4]|1{0,1}[[:digit:]]){0,1}[[:digit:]])\.){3,3}(25[0-5]|(2[0-4]|1{0,1}[[:digit:]]){0,1}[[:digit:]])(/[1-9]|/[1-2][[:digit:]]|/3[0-2])?"
> > > > My findings were surprising. Both GNU grep and ripgrep were able get
> > > > through the file in roughly 10 and 20 minutes respectively, whereas the
> > > > base system grep took over 20 hours! What interested me the most was
> > > > that the base system grep when run with '-c' returned '0' for match
> > > > count. It seems that 'grep -c' will have its counter overflow if there
> > > > are more than 2^32-1 matches (4294967295) and then the counter will
> > > > start counting from zero again for further matches.
> > > >
> > > > ryzen$ time zcat IPv4.txt.gz | grep -Eoc
> > > "((25[0-5]|(2[0-4]|1{0,1}...
> > > > 0
> > > > 1222m09.32s real 1224m28.02s user 1m16.17s system
> > > >
> > > > ryzen$ time zcat allip.txt.gz | ggrep -Eoc
> > > "((25[0-5]|(2[0-4]|1{0,1}...
> > > > 4294967296
> > > > 10m00.38s real 11m40.57s user 0m30.55s system
> > > >
> > > > ryzen$ time rg -zoc "((25[0-5]|(2[0-4]|1{0,1}...
> > > > 4294967296
> > > > 21m06.36s real 27m06.04s user 0m50.08s system
> > > >
> > > > # See the counter overflow/reset:
> > > > jot 4294967350 | grep -c "^[[:digit:]]"
> > > > 54
> > > >
> > > > All testing was done on a Ryzen desktop machine running 6.7 stable.
> > > >
> > > > The grep counting bug can be reproduced with this command:
> > > > jot 4294967296 | nice grep -c "^[[:digit:]]"
> > > >
> > > > Regards,
> > > >
> > > > Jordan
> > > >
> Index: grep.c
> ===================================================================
> RCS file: /cvs/src/usr.bin/grep/grep.c,v
> retrieving revision 1.64
> diff -u -p -r1.64 grep.c
> --- grep.c 3 Dec 2019 09:14:37 -0000 1.64
> +++ grep.c 24 Jun 2020 11:26:25 -0000
> @@ -517,7 +517,7 @@ main(int argc, char *argv[])
> c = grep_tree(argv);
> else
> for (c = 0; argc--; ++argv)
> - c += procfile(*argv);
> + c |= procfile(*argv);
>
> exit(c ? (file_err ? (qflag ? 0 : 2) : 0) : (file_err ? 2 : 1));
> }
> Index: util.c
> ===================================================================
> RCS file: /cvs/src/usr.bin/grep/util.c,v
> retrieving revision 1.62
> diff -u -p -r1.62 util.c
> --- util.c 3 Dec 2019 09:14:37 -0000 1.62
> +++ util.c 24 Jun 2020 11:26:25 -0000
> @@ -91,7 +91,7 @@ grep_tree(char **argv)
> /* skip "./" if implied */
> if (argv == dot_argv && p->fts_pathlen >= 2)
> path += 2;
> - c += procfile(path);
> + c |= procfile(path);
> break;
> }
> }
> @@ -106,7 +106,8 @@ procfile(char *fn)
> {
> str_t ln;
> file_t *f;
> - int c, t, z, nottext;
> + int t, z, nottext, overflow = 0;
> + unsigned long long c;
>
> mcount = mlimit;
>
> @@ -158,7 +159,10 @@ procfile(char *fn)
> enqueue(&ln);
> linesqueued++;
> }
> - c += t;
> + if (ULLONG_MAX - c < (unsigned long long)t)
> + overflow = 1;
> + else
> + c += t;
> if (mflag && mcount <= 0)
> break;
> }
> @@ -169,7 +173,7 @@ procfile(char *fn)
> if (cflag) {
> if (!hflag)
> printf("%s:", ln.file);
> - printf("%u\n", c);
> + printf("%llu%s\n", c, overflow ? "+" : "");
> }
> if (lflag && c != 0)
> printf("%s\n", fn);
> @@ -179,7 +183,7 @@ procfile(char *fn)
> binbehave == BIN_FILE_BIN && nottext && !qflag)
> printf("Binary file %s matches\n", fn);
>
> - return c;
> + return overflow || c != 0;
> }
>
>