Re: sed: Fix up y command

Todd C. Miller Fri, 08 Dec 2017 13:44:58 -0800

On Thu, 07 Dec 2017 15:59:16 +0100, Martijn van Duren wrote:

> Our sed command goes a bit bit bonkers on the y command in a couple of  
> different scenarios and was found thanks to the sed.1 diff from kshe.
> 
> 1) is a posix[0] violation, which states:
> If a <backslash> followed by an 'n' appear in string1 or string2, the
> two characters shall be handled as a single <newline>.
> and
> If the delimiter is not 'n', within string1 and string2, the delimiter
> itself can be used as a literal character if it is preceded by a
> <backslash>.
> But our implementation (same as Free, Net, and gsed) replace \n with n
> if the delimiter is n.


That is clearly a bug.

> 2) is a broken delimiter parser:
> y needs to take all characters literally (except for <backslash>), but
> all the *BSD implementations pass the delimiter parsing through
> compile_delimited, which does an extra '[' pair matching, where it
> should not do this. The handling of said character is done as expected
> once the closing tag is found. gsed does this correct:
> $ echo [ | sed 'y/[/a/'
> sed: 1: "y/[/a/": unbalanced brackets ([])
> $ echo [ | sed 'y/[a]/bcd/'
> b
> $
> $ echo [ | ./patchedsed 'y/[/a/'  
> a

Same.

> 3) is inconsistent behaviour between implementations:
> According to posix:
> The meaning of a <backslash> followed by any character that is not 'n',
> a <backslash>, or the delimiter character is undefined.
> and
> If the number of characters in string1 and string2 are not equal, or if
> any of the characters in string1 appear more than once, the results are
> undefined.
> 
> Right now we generate an error if the strings are of unequal length, but
> double characters and unmatched <backslash> treated literal. This also
> contradicts our manpage:
> Within string1 and string2, a backslash followed by any character other
> than a newline is that literal character.
> 
> But the code does something different:
> $ echo a | sed 'y/ab/\c/' 
> \
> gsed is a little mostly more accurate to our manpage, but awefully
> inconsistent:
> $ echo a | gsed 'y/a/\c/'
> gsed: -e expression #1, char 7: strings for `y' command are different lengths
> $ echo a | gsed 'y/a/\d/'   
> d

Here I would naively expect the escaped character to be treated as
if no escape was present.  This is what the Solaris sed does.

> As for the repeating lines we always choose the last matching character,
> where gsed chooses the first:
> $ echo a | sed 'y/aa/bc/'
> c
> $ echo a | gsed 'y/aa/bc/'
> b

Solaris sed also chooses the last match.

> To make portability issues more obvious I've chosen to turn these undefined
> scenarios into an error to make these non-obvious issues more visible:
> $ echo a | ./patchedsed 'y/ab/\c/'
> sed: 1: "y/ab/\c/": Unexpected character after backslash
> $ echo a | ./patchedsed 'y/aa/bc/'
> sed: 1: "y/aa/bc/": Repeated character in source string

I'm a bit leary of changing these into errors.  However, these are
undefined behaviors and implementations do vary in how they handle
them, so throwing an error is probably best.  Otherwise we are just
creating a trap for the user.

> Since this changes functional behaviour and can cause backwards
> incompatible changes for people with scripts that live in this unspecified
> area: I would like to know if someone is against this change.
> 
> The manpage bits were helped by jmc@, but this code makes his head spin,
> so extra eyes on that part are welcome as well.
> 
> OK?
> 
> Sincerely,
> 
> martijn@
> 
> [0]http://pubs.opengroup.org/onlinepubs/9699919799/utilities/sed.html
> 
> Index: compile.c
> ===================================================================
> RCS file: /cvs/src/usr.bin/sed/compile.c,v
> retrieving revision 1.42
> diff -u -r1.42 compile.c
> --- compile.c 1 Aug 2017 18:05:53 -0000       1.42
> +++ compile.c 7 Dec 2017 14:57:04 -0000
> @@ -617,46 +617,62 @@
>   * Compile a translation set of strings into a lookup table.
>   */
>  static char *
> -compile_tr(char *p, char **transtab)
> +compile_tr(char *old, char **transtab)
>  {
>       int i;
> -     char *lt, *op, *np;
> -     char *old = NULL, *new = NULL;
> +     char delimiter, check[UCHAR_MAX];
> +     char *new, *end;
>  
> -     if (*p == '\0' || *p == '\\')
> -             error(COMPILE,
> -"transform pattern can not be delimited by newline or backslash");
> -     old = xmalloc(strlen(p) + 1);
> -     p = compile_delimited(p, old, 1);
> -     if (p == NULL) {
> -             error(COMPILE, "unterminated transform source string");
> -             goto bad;
> -     }
> -     new = xmalloc(strlen(p) + 1);
> -     p = compile_delimited(--p, new, 1);
> -     if (p == NULL) {
> -             error(COMPILE, "unterminated transform target string");
> -             goto bad;
> +     bzero(check, sizeof(check));

I'd prefer memset over bzero here.

> +     delimiter = *old++;
> +     if (delimiter == '\0' || delimiter == '\\')
> +             error(COMPILE, "transform pattern can not be delimited by "
> +                 "newline or backslash");
> +
> +     new = old;
> +     do {
> +             if ((new = strchr(new + 1, delimiter)) == NULL)
> +                     error(COMPILE, "unterminated transform source string");
> +     } while (*(new - 1) == '\\' && *(new -2) != '\\');

Given "y/", new + 1 will point outside the string (one past the
NUL).  Perhaps add an explicit check for *new == '\0' or just enforce
a minimum length of 3 characters?

> +     *new++ = '\0';
> +     end = new;
> +     do {
> +             if ((end = strchr(end + 1, delimiter)) == NULL)
> +                     error(COMPILE, "unterminated transform target string");
> +     } while (*(end -1) == '\\' && *(end -2) != '\\');
> +     *end = '\0';

Similar problem here.

> +
> +     // We assume characters are 8 bits
> +     *transtab = xmalloc(UCHAR_MAX + 1);
> +     for (i = 0; i <= UCHAR_MAX; i++)
> +             (*transtab)[i] = (char)i;

Can we preserve the old-style C comment for consistency?

> +
> +     while (*old != '\0' && *new != '\0') {
> +             if (*old == '\\') {
> +                     old++;
> +                     if (*old == 'n')
> +                             *old = '\n';
> +                     else if (*old != delimiter && *old != '\\')
> +                             error(COMPILE, "Unexpected character after "
> +                                 "backslash");
> +                     
> +             }
> +             if (*new == '\\') {
> +                     new++;
> +                     if (*new == 'n')
> +                             *new = '\n';
> +                     else if (*new != delimiter && *new != '\\')
> +                             error(COMPILE, "Unexpected character after "
> +                                 "backslash");
> +             }
> +             if (check[*old] == 1)
> +                     error(COMPILE, "Repeated character in source string");
> +             check[*old] = 1;
> +             (*transtab)[*old++] = *new++;
>       }
> -     EATSPACE();
> -     if (strlen(new) != strlen(old)) {
> +     if (*old != '\0' || *new != '\0')
>               error(COMPILE, "transform strings are not the same length");
> -             goto bad;
> -     }
> -     /* We assume characters are 8 bits */
> -     lt = xmalloc(UCHAR_MAX + 1);
> -     for (i = 0; i <= UCHAR_MAX; i++)
> -             lt[i] = (char)i;
> -     for (op = old, np = new; *op; op++, np++)
> -             lt[(u_char)*op] = *np;
> -     *transtab = lt;
> -     free(old);
> -     free(new);
> -     return (p);
> -bad:
> -     free(old);
> -     free(new);
> -     return (NULL);
> +     return end + 1;
>  }

I can't see any problems with the rest.  All errors are fatal so
there's no need to return NULL (and returning NULL would have
resulted in a crash anyway).

When this goes in you can also remove the "is_tr" function argument
from compile_delimited() as it is now unused.

The manual change looks good as well.

 - todd

Re: sed: Fix up y command

Reply via email to