On Fri, Jun 09, 2017 at 07:58:48AM +0000, kshe wrote:

> Hi,
> 
> There is an ongoing confusion in sed's source about the concept of EOL:
> the code contradicts itself as to whether it should be determined by a
> trailing NUL or by looking up the `len' field of the SPACE structure.
> 
> At least two commands (`l' and `s') assume that the pattern space is
> always NUL-terminated.  However, at least two other commands (`c' and
> `D') do not comply with that assumption, modifying it by merely updating
> its length attribute.  As a result, using `l' or `s' after `c' or `D'
> produces incorrect output.
> 
> For instance, in the following excerpt, `l' should print `bb$':
> 
>       $ printf 'a\nbb\n' | sed '${l;q;};N;D'
>       bbbb$
>       bb
> 
> For the same underlying reason, it also disregards the deletion of the
> pattern space that `c' is supposed to entail:
> 
>       $ echo abc | sed 'c\
>       > text
>       > l'
>       text
>       abc$
> 
> The substitute command can likewise get confused and add an extra
> character after a replacement:
> 
>       $ printf 'a\nbb\n' | sed '${s/$/x/;q;};N;D'
>       bbxb
> 
> Note how this does not happen when the substitution pattern matches more
> than the empty string:
> 
>       $ printf 'a\nbb\n' | sed '${s/.$/&x/;q;};N;D'
>       bbx
> 
> After an empty match, the character that `s' adds to the pattern space
> depends on how memmove(3) modified it beforehand.  Here's a very
> simplified version of the original sequence of commands from which I
> discovered these bugs, where `i' appear instead of `b':
> 
>       $ printf 'a\nbb\n' | sed '${s/$/x/;q;};N;s/a/ZZi/;D'
>       bbxi
> 
> Similarily, when one believes the pattern space to be empty after a `c'
> command, something like `s/.*//' will magically revive one character
> from that emptiness:
> 
>       $ echo abc | sed 'c\
>       > text
>       > s/.*//'
>       text
>       a
> 
> As I came across these in an actual script that I was writing, I was
> quite amazed to find out that they went unnoticed for so long, without
> anyone ever trying to see what `l' does after `c' or `D'.  Indeed, the
> `l' command is broken since revision 1.1 of process.c, making that bug
> older than I am.  It has also been present in the other BSDs, but:
> 
>       o FreeBSD fixed it by accident in 2004 by adding support for
>         multibyte encodings;
> 
>       o NetBSD fixed it by accident in 2014 by importing FreeBSD's sed
>         (merging their changes afterwards, but that didn't reintroduce
>         the bug, since they hadn't touched that part of the code).
> 
> The more serious bug regarding empty matches in the substitute command
> was introduced in OpenBSD in 2011, when schwarze@ decided to rewrite its
> main loop, intending to "fix multiple issues regarding the replacement
> of zero-length strings", but apparently with the unfortunate side effect
> of introducing a new issue regarding the replacement of zero-length
> strings.  Finally, this commit was adopted by FreeBSD in 2016 when they
> synchronised some of their code with OpenBSD's to ease importing other
> fixes from it.
> 
> Thus, OpenBSD still has the old `l' bug as well as the more recent `s'
> one, FreeBSD only has the latter, and since 2014 NetBSD is not longer
> affected by any of these, although the fact that `c' and `D' omit to
> NUL-terminate the pattern space could be considered a bug in itself,
> since at least one piece of code present for 20 years in their tree was
> making the contrary assumption.  So who knows where else it was made?
> Actually, with all due respect, probably not even the original
> developers: many of their choices suggest that the end of the pattern
> space was not always meant to be encoded as a trailing NUL, but they
> nevertheless designed the function `lputs' to only take a simple string
> parameter, by which no separate length information could ever be
> accessible.
> 
> I therefore have the strong feeling that this is not the last bug of
> BSD's sed.  The remaining ones may be hidden in dark corners, but I do
> expect them to show up at some point in the future, considering how
> fragile the existing code looks.  The overall situation also prevents
> any significant improvement to be successfully conducted, as changing
> more than two lines will most likely lead to subtle regressions; see the
> last few commits at
> 
> https://svnweb.freebsd.org/base/head/usr.bin/sed/process.c
> 
> and
> 
> http://cvsweb.netbsd.org/cgi-bin/cvsweb.cgi/src/usr.bin/sed/process.c
> 
> for recent examples of such failed attempts.
> 
> Anyway, until I digress even more, here is a patch for that one, forcing
> `c' and `D' to put a NUL where `l' and `s' expect one to be.  By the
> way, whether the fault is on one side or the other is actually unclear.
> It does seem redundant to enforce such a convention when one already has
> access to the length of the corresponding string.  As a matter of fact,
> the overwhelming majority of the code does not appear to ever rely on a
> NUL being there at all; for instance, `D' and `P' use memchr(3), not
> strchr(3), which, besides, allows for the manipulation of NUL bytes
> originally present in the input.  Hence a more complete patch could
> instead try to unify the whole source to follow one convention
> consistently; however, that would probably take as much time as a full
> rewrite, because that is essentially what would need to be done, so, for
> now, I kept it as simple as it could be, to avoid messing it up even
> more than it already is.
> 
> --- a/src/usr.bin/sed/process.c       Wed Feb 22 14:09:09 2017
> +++ b/src/usr.bin/sed/process.c       Wed Jun  7 09:56:20 2017
> @@ -121,7 +121,7 @@ redirect:
>                               goto redirect;
>                       case 'c':
>                               pd = 1;
> -                             psl = 0;
> +                             ps[psl = 0] = '\0';
>                               if (cp->a2 == NULL || lastaddr || lastline())
>                                       (void)fprintf(outfile, "%s", cp->t);
>                               break;
> @@ -138,6 +138,7 @@ redirect:
>                               } else {
>                                       psl -= (p + 1) - ps;
>                                       memmove(ps, p + 1, psl);
> +                                     ps[psl] = '\0';
>                                       goto top;
>                               }
>                       case 'g':
> 
> Also note how one could have thought of fixing `D' by making its
> memmove(3) go up to `psl + 1' instead of `psl', since there now should
> always be a NUL at the end; however, a separate assignment makes the
> intent clearer, and I read somewhere that explicit was better than
> implicit.
> 
> That being said, the more I think about this, the more I am convinced
> that the problematic commands were in fact `l' and `s'.  Since the code
> already uses fgetln(3) to read lines, it would have been nicer (and
> optimal) to never rely on NULs.  As stated above, this is already the
> case almost everywhere; therefore, if any of the OpenBSD developers
> wanted to elaborate on that idea after committing such an unsatisfying
> patch, I would of course be very happy to help.
> 
> Kind regards,
> 
> kshe

Thanks for the analysis and diff, I hope to get a chanche to think
about this soon. At least I'll make sure this diff is not forgotten,

        -Otto

Reply via email to