Re: Potential grep bug?

2020-07-22 Thread Todd C . Miller
On Wed, 22 Jul 2020 20:57:56 +0200, Martijn van Duren wrote:

> Any takers?

OK millert@

 - todd



Re: Potential grep bug?

2020-07-22 Thread Martijn van Duren
Any takers?

On Wed, 2020-06-24 at 13:27 +0200, Martijn van Duren wrote:
> Moving to tech@
> 
> On Tue, 2020-06-23 at 22:17 -0900, Philip Guenther wrote:
> > Nope.  This is a grep of a single file, so procfile() must be overflowing
> > and this only 'fixes' it by relying on signed overflow, which is undefined
> > behavior, being handled in a particular way by the compiler.  So, luck
> > (which fails when the compiler decides to hate you).  There are more places
> > that need to change for the reported problem to be handled safely.
> > 
> > Philip Guenther
> 
> I'm not sure I understand exactly what you mean, but the overflow can
> still propagate through to the return value. Since everything propagated
> up from procfile is only eventually used to determine the exit value we
> can start treating it as a boolean.
> 
> I gave a quick glance at freebsd and they also treat it as a boolean,
> but more explicitly. They however don't appear have overflow detection.
> 
> Is this better?
> 
> martijn@
> > 
> > On Tue, Jun 23, 2020 at 9:58 PM Martijn van Duren <
> > open...@list.imperialat.at> wrote:
> > 
> > > This seems to fix the issue for me.
> > > 
> > > OK?
> > > 
> > > martijn@
> > > 
> > > On Tue, 2020-06-23 at 19:29 -0700, Jordan Geoghegan wrote:
> > > > Hello,
> > > > 
> > > > I was working on a couple POSIX regular expressions to search for and
> > > > validate IPv4 and IPv6 addresses with optional CIDR blocks, and
> > > > encountered some strange behaviour from the base system grep.
> > > > 
> > > > I wanted to validate my regex against a list of every valid IPv4
> > > > address, so I generated a list with a zsh 1 liner:
> > > > 
> > > >   for i in {0..255}; do; echo $i.{0..255}.{0..255}.{0..255} ; done |
> > > > tr '[:space:]' '\n' > IPv4.txt
> > > > 
> > > > My intentions were to test the regex by running it with 'grep -c' to
> > > > confirm there was indeed 2^32 addresses matched, and I also wanted to
> > > > benchmark and compare performance between BSD grep, GNU grep and
> > > > ripgrep. The command I used:
> > > > 
> > > > grep -Eoc
> > > > 
> > > "((25[0-5]|(2[0-4]|1{0,1}[[:digit:]]){0,1}[[:digit:]])\.){3,3}(25[0-5]|(2[0-4]|1{0,1}[[:digit:]]){0,1}[[:digit:]])(/[1-9]|/[1-2][[:digit:]]|/3[0-2])?"
> > > > My findings were surprising. Both GNU grep and ripgrep were able get
> > > > through the file in roughly 10 and 20 minutes respectively, whereas the
> > > > base system grep took over 20 hours! What interested me the most was
> > > > that the base system grep when run with '-c' returned '0' for match
> > > > count. It seems that 'grep -c' will have its counter overflow if there
> > > > are more than 2^32-1 matches (4294967295) and then the counter will
> > > > start counting from zero again for further matches.
> > > > 
> > > >  ryzen$ time zcat IPv4.txt.gz | grep -Eoc
> > > "((25[0-5]|(2[0-4]|1{0,1}...
> > > >  0
> > > >  1222m09.32s real  1224m28.02s user 1m16.17s system
> > > > 
> > > >  ryzen$ time zcat allip.txt.gz | ggrep -Eoc
> > > "((25[0-5]|(2[0-4]|1{0,1}...
> > > >  4294967296
> > > >  10m00.38s real11m40.57s user 0m30.55s system
> > > > 
> > > >  ryzen$ time rg -zoc "((25[0-5]|(2[0-4]|1{0,1}...
> > > >  4294967296
> > > >  21m06.36s real27m06.04s user 0m50.08s system
> > > > 
> > > > # See the counter overflow/reset:
> > > >  jot 4294967350 | grep -c "^[[:digit:]]"
> > > >  54
> > > > 
> > > > All testing was done on a Ryzen desktop machine running 6.7 stable.
> > > > 
> > > > The grep counting bug can be reproduced with this command:
> > > > jot 4294967296 | nice grep -c "^[[:digit:]]"
> > > > 
> > > > Regards,
> > > > 
> > > > Jordan
> > > > 
> Index: grep.c
> ===
> RCS file: /cvs/src/usr.bin/grep/grep.c,v
> retrieving revision 1.64
> diff -u -p -r1.64 grep.c
> --- grep.c3 Dec 2019 09:14:37 -   1.64
> +++ grep.c24 Jun 2020 11:26:25 -
> @@ -517,7 +517,7 @@ main(int argc, char *argv[])
>   c = grep_tree(argv);
>   else
>   for (c = 0; argc--; ++argv)
> - c += procfile(*argv);
> + c |= procfile(*argv);
>  
>   exit(c ? (file_err ? (qflag ? 0 : 2) : 0) : (file_err ? 2 : 1));
>  }
> Index: util.c
> ===
> RCS file: /cvs/src/usr.bin/grep/util.c,v
> retrieving revision 1.62
> diff -u -p -r1.62 util.c
> --- util.c3 Dec 2019 09:14:37 -   1.62
> +++ util.c24 Jun 2020 11:26:25 -
> @@ -91,7 +91,7 @@ grep_tree(char **argv)
>   /* skip "./" if implied */
>   if (argv == dot_argv && p->fts_pathlen >= 2)
>   path += 2;
> - c += procfile(path);
> + c |= procfile(path);
>   break;
>   }
>   }
> @@ -106,7 +106,8 @@ procfile(char *fn)
>  {
>   str_t 

Re: Potential grep bug?

2020-06-24 Thread Martijn van Duren
Moving to tech@

On Tue, 2020-06-23 at 22:17 -0900, Philip Guenther wrote:
> Nope.  This is a grep of a single file, so procfile() must be overflowing
> and this only 'fixes' it by relying on signed overflow, which is undefined
> behavior, being handled in a particular way by the compiler.  So, luck
> (which fails when the compiler decides to hate you).  There are more places
> that need to change for the reported problem to be handled safely.
> 
> Philip Guenther

I'm not sure I understand exactly what you mean, but the overflow can
still propagate through to the return value. Since everything propagated
up from procfile is only eventually used to determine the exit value we
can start treating it as a boolean.

I gave a quick glance at freebsd and they also treat it as a boolean,
but more explicitly. They however don't appear have overflow detection.

Is this better?

martijn@
> 
> 
> On Tue, Jun 23, 2020 at 9:58 PM Martijn van Duren <
> open...@list.imperialat.at> wrote:
> 
> > This seems to fix the issue for me.
> > 
> > OK?
> > 
> > martijn@
> > 
> > On Tue, 2020-06-23 at 19:29 -0700, Jordan Geoghegan wrote:
> > > Hello,
> > > 
> > > I was working on a couple POSIX regular expressions to search for and
> > > validate IPv4 and IPv6 addresses with optional CIDR blocks, and
> > > encountered some strange behaviour from the base system grep.
> > > 
> > > I wanted to validate my regex against a list of every valid IPv4
> > > address, so I generated a list with a zsh 1 liner:
> > > 
> > >   for i in {0..255}; do; echo $i.{0..255}.{0..255}.{0..255} ; done |
> > > tr '[:space:]' '\n' > IPv4.txt
> > > 
> > > My intentions were to test the regex by running it with 'grep -c' to
> > > confirm there was indeed 2^32 addresses matched, and I also wanted to
> > > benchmark and compare performance between BSD grep, GNU grep and
> > > ripgrep. The command I used:
> > > 
> > > grep -Eoc
> > > 
> > "((25[0-5]|(2[0-4]|1{0,1}[[:digit:]]){0,1}[[:digit:]])\.){3,3}(25[0-5]|(2[0-4]|1{0,1}[[:digit:]]){0,1}[[:digit:]])(/[1-9]|/[1-2][[:digit:]]|/3[0-2])?"
> > > My findings were surprising. Both GNU grep and ripgrep were able get
> > > through the file in roughly 10 and 20 minutes respectively, whereas the
> > > base system grep took over 20 hours! What interested me the most was
> > > that the base system grep when run with '-c' returned '0' for match
> > > count. It seems that 'grep -c' will have its counter overflow if there
> > > are more than 2^32-1 matches (4294967295) and then the counter will
> > > start counting from zero again for further matches.
> > > 
> > >  ryzen$ time zcat IPv4.txt.gz | grep -Eoc
> > "((25[0-5]|(2[0-4]|1{0,1}...
> > >  0
> > >  1222m09.32s real  1224m28.02s user 1m16.17s system
> > > 
> > >  ryzen$ time zcat allip.txt.gz | ggrep -Eoc
> > "((25[0-5]|(2[0-4]|1{0,1}...
> > >  4294967296
> > >  10m00.38s real11m40.57s user 0m30.55s system
> > > 
> > >  ryzen$ time rg -zoc "((25[0-5]|(2[0-4]|1{0,1}...
> > >  4294967296
> > >  21m06.36s real27m06.04s user 0m50.08s system
> > > 
> > > # See the counter overflow/reset:
> > >  jot 4294967350 | grep -c "^[[:digit:]]"
> > >  54
> > > 
> > > All testing was done on a Ryzen desktop machine running 6.7 stable.
> > > 
> > > The grep counting bug can be reproduced with this command:
> > > jot 4294967296 | nice grep -c "^[[:digit:]]"
> > > 
> > > Regards,
> > > 
> > > Jordan
> > > 
Index: grep.c
===
RCS file: /cvs/src/usr.bin/grep/grep.c,v
retrieving revision 1.64
diff -u -p -r1.64 grep.c
--- grep.c  3 Dec 2019 09:14:37 -   1.64
+++ grep.c  24 Jun 2020 11:26:25 -
@@ -517,7 +517,7 @@ main(int argc, char *argv[])
c = grep_tree(argv);
else
for (c = 0; argc--; ++argv)
-   c += procfile(*argv);
+   c |= procfile(*argv);
 
exit(c ? (file_err ? (qflag ? 0 : 2) : 0) : (file_err ? 2 : 1));
 }
Index: util.c
===
RCS file: /cvs/src/usr.bin/grep/util.c,v
retrieving revision 1.62
diff -u -p -r1.62 util.c
--- util.c  3 Dec 2019 09:14:37 -   1.62
+++ util.c  24 Jun 2020 11:26:25 -
@@ -91,7 +91,7 @@ grep_tree(char **argv)
/* skip "./" if implied */
if (argv == dot_argv && p->fts_pathlen >= 2)
path += 2;
-   c += procfile(path);
+   c |= procfile(path);
break;
}
}
@@ -106,7 +106,8 @@ procfile(char *fn)
 {
str_t ln;
file_t *f;
-   int c, t, z, nottext;
+   int t, z, nottext, overflow = 0;
+   unsigned long long c;
 
mcount = mlimit;
 
@@ -158,7 +159,10 @@ procfile(char *fn)
enqueue();
linesqueued++;