On Fri, Jun 09, 2017 at 07:58:48AM +0000, kshe wrote: > Hi, > > There is an ongoing confusion in sed's source about the concept of EOL: > the code contradicts itself as to whether it should be determined by a > trailing NUL or by looking up the `len' field of the SPACE structure. > > At least two commands (`l' and `s') assume that the pattern space is > always NUL-terminated. However, at least two other commands (`c' and > `D') do not comply with that assumption, modifying it by merely updating > its length attribute. As a result, using `l' or `s' after `c' or `D' > produces incorrect output. > > For instance, in the following excerpt, `l' should print `bb$': > > $ printf 'a\nbb\n' | sed '${l;q;};N;D' > bbbb$ > bb > > For the same underlying reason, it also disregards the deletion of the > pattern space that `c' is supposed to entail: > > $ echo abc | sed 'c\ > > text > > l' > text > abc$ > > The substitute command can likewise get confused and add an extra > character after a replacement: > > $ printf 'a\nbb\n' | sed '${s/$/x/;q;};N;D' > bbxb > > Note how this does not happen when the substitution pattern matches more > than the empty string: > > $ printf 'a\nbb\n' | sed '${s/.$/&x/;q;};N;D' > bbx > > After an empty match, the character that `s' adds to the pattern space > depends on how memmove(3) modified it beforehand. Here's a very > simplified version of the original sequence of commands from which I > discovered these bugs, where `i' appear instead of `b': > > $ printf 'a\nbb\n' | sed '${s/$/x/;q;};N;s/a/ZZi/;D' > bbxi > > Similarily, when one believes the pattern space to be empty after a `c' > command, something like `s/.*//' will magically revive one character > from that emptiness: > > $ echo abc | sed 'c\ > > text > > s/.*//' > text > a > > As I came across these in an actual script that I was writing, I was > quite amazed to find out that they went unnoticed for so long, without > anyone ever trying to see what `l' does after `c' or `D'. Indeed, the > `l' command is broken since revision 1.1 of process.c, making that bug > older than I am. It has also been present in the other BSDs, but: > > o FreeBSD fixed it by accident in 2004 by adding support for > multibyte encodings; > > o NetBSD fixed it by accident in 2014 by importing FreeBSD's sed > (merging their changes afterwards, but that didn't reintroduce > the bug, since they hadn't touched that part of the code). > > The more serious bug regarding empty matches in the substitute command > was introduced in OpenBSD in 2011, when schwarze@ decided to rewrite its > main loop, intending to "fix multiple issues regarding the replacement > of zero-length strings", but apparently with the unfortunate side effect > of introducing a new issue regarding the replacement of zero-length > strings. Finally, this commit was adopted by FreeBSD in 2016 when they > synchronised some of their code with OpenBSD's to ease importing other > fixes from it. > > Thus, OpenBSD still has the old `l' bug as well as the more recent `s' > one, FreeBSD only has the latter, and since 2014 NetBSD is not longer > affected by any of these, although the fact that `c' and `D' omit to > NUL-terminate the pattern space could be considered a bug in itself, > since at least one piece of code present for 20 years in their tree was > making the contrary assumption. So who knows where else it was made? > Actually, with all due respect, probably not even the original > developers: many of their choices suggest that the end of the pattern > space was not always meant to be encoded as a trailing NUL, but they > nevertheless designed the function `lputs' to only take a simple string > parameter, by which no separate length information could ever be > accessible. > > I therefore have the strong feeling that this is not the last bug of > BSD's sed. The remaining ones may be hidden in dark corners, but I do > expect them to show up at some point in the future, considering how > fragile the existing code looks. The overall situation also prevents > any significant improvement to be successfully conducted, as changing > more than two lines will most likely lead to subtle regressions; see the > last few commits at > > https://svnweb.freebsd.org/base/head/usr.bin/sed/process.c > > and > > http://cvsweb.netbsd.org/cgi-bin/cvsweb.cgi/src/usr.bin/sed/process.c > > for recent examples of such failed attempts. > > Anyway, until I digress even more, here is a patch for that one, forcing > `c' and `D' to put a NUL where `l' and `s' expect one to be. By the > way, whether the fault is on one side or the other is actually unclear. > It does seem redundant to enforce such a convention when one already has > access to the length of the corresponding string. As a matter of fact, > the overwhelming majority of the code does not appear to ever rely on a > NUL being there at all; for instance, `D' and `P' use memchr(3), not > strchr(3), which, besides, allows for the manipulation of NUL bytes > originally present in the input. Hence a more complete patch could > instead try to unify the whole source to follow one convention > consistently; however, that would probably take as much time as a full > rewrite, because that is essentially what would need to be done, so, for > now, I kept it as simple as it could be, to avoid messing it up even > more than it already is. > > --- a/src/usr.bin/sed/process.c Wed Feb 22 14:09:09 2017 > +++ b/src/usr.bin/sed/process.c Wed Jun 7 09:56:20 2017 > @@ -121,7 +121,7 @@ redirect: > goto redirect; > case 'c': > pd = 1; > - psl = 0; > + ps[psl = 0] = '\0'; > if (cp->a2 == NULL || lastaddr || lastline()) > (void)fprintf(outfile, "%s", cp->t); > break; > @@ -138,6 +138,7 @@ redirect: > } else { > psl -= (p + 1) - ps; > memmove(ps, p + 1, psl); > + ps[psl] = '\0'; > goto top; > } > case 'g': > > Also note how one could have thought of fixing `D' by making its > memmove(3) go up to `psl + 1' instead of `psl', since there now should > always be a NUL at the end; however, a separate assignment makes the > intent clearer, and I read somewhere that explicit was better than > implicit. > > That being said, the more I think about this, the more I am convinced > that the problematic commands were in fact `l' and `s'. Since the code > already uses fgetln(3) to read lines, it would have been nicer (and > optimal) to never rely on NULs. As stated above, this is already the > case almost everywhere; therefore, if any of the OpenBSD developers > wanted to elaborate on that idea after committing such an unsatisfying > patch, I would of course be very happy to help. > > Kind regards, > > kshe
Thanks for the analysis and diff, I hope to get a chanche to think about this soon. At least I'll make sure this diff is not forgotten, -Otto