On Tue, Jun 13, 2017 at 10:08:11AM +0000, kshe wrote: > On Sat, 10 Jun 2017 10:25:27 +0000, Otto Moerbeek wrote: > > Thanks for the analysis and diff, I hope to get a chanche to think > > about this soon. At least I'll make sure this diff is not forgotten, > > -Otto > > I have seen the tests that you recently added to the tree; in the mean > time, however, while I continued experimenting and reading the source, I > discovered more bugs (as expected), as well as other trivial issues > which might be worthy of note before committing any real fixes. > > First of all, my comprehension of the `c' command has slightly changed > since my last message: I now think its behaviour should be aligned with > that of `p', `P', `w' and others, all of which produce no output after > `c', treating a deleted pattern space as truly inexistent, and not > merely as an empty one. It would hence appear more coherent to add > > if (pd) > break; > > before calling lputs(), as is done in many other places. This would > then change the expected output of one of the newly added tests. > > In general though, that `c' command is a big mess. It is supposed to > delete the pattern space: in other implementations of sed that I have > tested, that means it will essentially behave like `i' followed by `d'. > In BSD's sed, however, it continues processing commands... Here's one > of the many bugs that arise because of that: > > $ echo abc | sed 'c\ > > text > > s/.*/x/ > > p > > s/.*/y/ > > p' > text > x > > So why isn't `y' printed too? In comparison, using flags instead of > standalone commands: > > $ echo abc | sed 'c\ > > text > > s/.*/x/p > > s/.*/y/p' > text > x > y > > Here, we do get a `y', but only one (there should be two given that `-n' > has not been supplied to suppress normal output). Since at least `p' as > a flag given to `s' appears to work, let us try the same with `w': > > $ echo abc | sed 'c\ > > text > > s/.*/x/w /tmp/sed.out > > s/.*/y/w /tmp/sed.out' > text > $ cat /tmp/sed.out > x > > But this time it doesn't work. Now add a third `s' command... > > $ echo abc | sed 'c\ > > text > > s/.*/x/w /tmp/sed.out > > s/.*/y/w /tmp/sed.out > > s/.*/z/w /tmp/sed.out' > text > z > $ cat /tmp/sed.out > x > z > > And it works. Well, not really: we do see a `z', but still no `y'. > Indeed, this does not make any sense. > > The reason for this strange behaviour is nevertheless easy to figure out > when looking at how `s' is implemented, keeping in mind that the > substitute space is globally defined. In any case, even if one was to > fix that particular bug, I would probably be able to find more. This is > why I believe that the only sane solution is that, just like `d', `c' > should simply > > goto new; > > which would, at the same time, obviate the need for the `deleted' field > of the SPACE structure, subsequently allowing for substantial cleanups. > > Also, aside of that `c' mess, there is still the fact that NUL bytes > originally present in the input are not well handled at all. Of course, > `l' cannot print past them (whereas `p' can), which is a shame; but > significantly worse is the inconsistent behaviour of `s' in that case. > At first, it seems to work well (in the output, `^@' represents a NUL): > > $ printf 'x\0y' | sed 's/x/_/' > _^@y > > It can even handle some empty matches perfectly: > > $ printf 'x\0y' | sed 's/[[:<:]]/_/' # pattern space survives > _x^@y > > But not all of them: > > $ printf 'x\0y' | sed 's/[[:>:]]/_/' # pattern space is cut off > x_ > $ printf 'x\0y' | sed 's/[[:>:]]/_/2' # likewise > x > > One might argue that NUL bytes are not supposed to happen in textual > input, so that these examples are irrelevant; the real problem however > is that it also affects newlines, thus perfectly valid text: > > $ printf 'x\ny' | sed 'N;s/[[:<:]]/_/' > _x > y > $ printf 'x\ny' | sed 'N;s/[[:>:]]/_/' > x_ > $ printf 'x\ny' | sed 'N;s/[[:>:]]/_/2' > x > > So maybe substitute() should test for (slen == 0) instead of the > questionable (*s == '\0' || *s == '\n'). That would seem more correct > to me, but this is only a guess since I must admit that I don't fully > understand that part of the code yet. Here is a diff, for instance, > that fixes all the above issues. > > --- a/src/usr.bin/sed/process.c Wed Jun 7 09:56:20 2017 > +++ b/src/usr.bin/sed/process.c Tue Jun 13 07:21:59 2017 > @@ -387,14 +388,11 @@ substitute(struct s_command *cp) > * and at the end of the line, terminate. > */ > if (match[0].rm_so == match[0].rm_eo) { > - if (*s == '\0' || *s == '\n') > - slen = -1; > - else > - slen--; > - if (*s != '\0') { > - cspace(&SS, s++, 1, APPEND); > - le++; > - } > + if (slen == 0) > + break; > + cspace(&SS, s++, 1, APPEND); > + slen--; > + le++; > lastempty = 1; > } else > lastempty = 0; > > However, even if it passes the regression tests with the same success as > the current sed, please do not blindly commit this patch. Although I > fail to see what it was, surely there must have been a reason to test > for that particular condition in the original code, and not simply for > (slen == 0). This patch is therefore only intented as an example of how > a fix might look like, but definitely not as the correct and final > solution. > > Lastly, I have prepared one more patch dealing with trivial issues that > I found while browsing the code. Contrary to the previous one, I am > very confident that it will not have any undesired effects, although of > course it should be carefully reviewed, as any patch should be. > > --- a/src/usr.bin/sed/defs.h Fri Jan 20 10:26:16 2017 > +++ b/src/usr.bin/sed/defs.h Mon Jun 12 09:32:57 2017 > @@ -129,7 +129,6 @@ typedef struct { > size_t len; /* Current length. */ > int deleted; /* If deleted. */ > int append_newline; /* If originally terminated by \n. */ > - char *back; /* Backing memory. */ > size_t blen; /* Backing memory length. */ > } SPACE; > > --- a/src/usr.bin/sed/process.c Wed Jun 7 09:56:20 2017 > +++ b/src/usr.bin/sed/process.c Tue Jun 13 07:21:59 2017 > @@ -77,9 +77,9 @@ static regex_t *defpreg; > size_t maxnsub; > regmatch_t *match; > > -#define OUT() do {\ > - fwrite(ps, 1, psl, outfile);\ > - if (psanl) fputc('\n', outfile);\ > +#define OUT() do { \ > + fwrite(ps, 1, psl, outfile); \ > + if (psanl) fputc('\n', outfile); \ > } while (0) > > void > @@ -145,15 +145,15 @@ redirect: > cspace(&PS, hs, hsl, REPLACE); > break; > case 'G': > - cspace(&PS, "\n", 1, 0); > - cspace(&PS, hs, hsl, 0); > + cspace(&PS, "\n", 1, APPEND); > + cspace(&PS, hs, hsl, APPEND); > break; > case 'h': > cspace(&HS, ps, psl, REPLACE); > break; > case 'H': > - cspace(&HS, "\n", 1, 0); > - cspace(&HS, ps, psl, 0); > + cspace(&HS, "\n", 1, APPEND); > + cspace(&HS, ps, psl, APPEND); > break; > case 'i': > (void)fprintf(outfile, "%s", cp->t); > @@ -171,7 +171,7 @@ redirect: > break; > case 'N': > flush_appends(); > - cspace(&PS, "\n", 1, 0); > + cspace(&PS, "\n", 1, APPEND); > if (!mf_fgets(&PS, 0)) > exit(0); > break; > @@ -256,8 +256,9 @@ redirect: > cp = cp->next; > } /* for all cp */ > > -new: if (!nflag && !pd) > + if (!nflag && !pd) > OUT(); > +new: > flush_appends(); > } /* for all lines */ > } > @@ -266,7 +267,7 @@ new: if (!nflag && !pd) > * TRUE if the address passed matches the current program state > * (lastline, linenumber, ps). > */ > -#define MATCH(a) \ > +#define MATCH(a) \ > (a)->type == AT_RE ? regexec_e((a)->u.r, ps, 0, 1, 0, psl) : \ > (a)->type == AT_LINE ? linenum == (a)->u.l : lastline() > > @@ -361,7 +362,7 @@ substitute(struct s_command *cp) > cspace(&SS, s, match[0].rm_so - le, APPEND); > > /* Skip zero-length matches right after other matches. */ > - if (lastempty || (match[0].rm_so - le) || > + if (lastempty || match[0].rm_so != le || > match[0].rm_so != match[0].rm_eo) { > if (n <= 1) { > /* Want this match: append replacement. */ > @@ -370,7 +371,7 @@ substitute(struct s_command *cp) > n = -1; > } else { > /* Want a later match: append original. */ > - if (match[0].rm_eo - le) > + if (match[0].rm_eo != le) > cspace(&SS, s, match[0].rm_eo - le, > APPEND); > n--; > @@ -418,7 +416,6 @@ substitute(struct s_command *cp) > PS = SS; > psanl = tspace.append_newline; > SS = tspace; > - SS.space = SS.back; > > /* Handle the 'p' flag. */ > if (cp->u.s->p) > @@ -547,13 +544,14 @@ regexec_e(regex_t *preg, const char *string, int eflag > static void > regsub(SPACE *sp, char *string, char *src) > { > - int len, no; > + int no; > + size_t len; > char c, *dst; > > -#define NEEDSP(reqlen) > \ > +#define NEEDSP(reqlen) > \ > if (sp->len + (reqlen) + 1 >= sp->blen) { \ > size_t newlen = sp->blen + (reqlen) + 1024; \ > - sp->space = sp->back = xrealloc(sp->back, newlen); \ > + sp->space = xrealloc(sp->space, newlen); \ > sp->blen = newlen; \ > dst = sp->space + sp->len; \ > } > @@ -572,8 +570,7 @@ regsub(SPACE *sp, char *string, char *src) > NEEDSP(1); > *dst++ = c; > ++sp->len; > - } else if (match[no].rm_so != -1 && match[no].rm_eo != -1) { > - len = match[no].rm_eo - match[no].rm_so; > + } else if ((len = match[no].rm_eo - match[no].rm_so) != 0) { > NEEDSP(len); > memmove(dst, string + match[no].rm_so, len); > dst += len; > @@ -594,18 +591,18 @@ cspace(SPACE *sp, const char *p, size_t len, enum e_sp > { > size_t tlen; > > + if (spflag == REPLACE) > + sp->len = 0; > + > /* Make sure SPACE has enough memory and ramp up quickly. */ > tlen = sp->len + len + 1; > if (tlen > sp->blen) { > size_t newlen = tlen + 1024; > - sp->space = sp->back = xrealloc(sp->back, newlen); > + sp->space = xrealloc(sp->space, newlen); > sp->blen = newlen; > } > > - if (spflag == REPLACE) > - sp->len = 0; > - > - memmove(sp->space + sp->len, p, len); > + memcpy(sp->space + sp->len, p, len); > > sp->space[sp->len += len] = '\0'; > } > > Here is the corresponding list of changes: > > 1. Remove the `back' field from the SPACE structure, which was useless > since always equal to the `space' field. > > 2. Use APPEND in calls to cspace() that previously used 0 directly, > which could be slightly confusing if one did not know the order of the > `e_spflag' enum, and also missed the point of it being defined in the > first place. > > 3. As suggested by the outdated comment preceding its definition, > cspace() was originally only meant to append to, not to replace, the > contents of a SPACE. The calculation of the size for the potentially > new buffer was however never adjusted, which meant that, with input > lines of approximately equal length, the function would allocate roughly > twice as much memory as necessary, half of that memory staying unused. > To fix it, conditionally set `sp->len' to zero before the relevent > assignment, not after. > > 4. Since cspace() never receives arguments that overlap in memory (as > that would make little sense), and considering that it is probably one > of the most frequently called routines, make it use memcpy(3) in lieu of > memmove(3). > > 5. Change conditionals of the form (a - b) to (a != b) for better > readability. > > 6. Use `size_t' instead of `int' for a variable holding a length. > > 7. Slightly improve the logic of regsub() to avoid a bunch of no-op > instructions when `len' is zero. > > 8. Move a goto label previously redirecting into a conditional that was > always false within that context. > > 9. Add missing tabs to align backslashes in macro definitions. > > Kind regards, > > kshe
Hi, One thing: in general, try to limit posts to one diff including explanation, with a proper Subject: line, it makes this easier to track. -Otto