Hello tech@, Our sed command goes a bit bit bonkers on the y command in a couple of different scenarios and was found thanks to the sed.1 diff from kshe.
1) is a posix[0] violation, which states: If a <backslash> followed by an 'n' appear in string1 or string2, the two characters shall be handled as a single <newline>. and If the delimiter is not 'n', within string1 and string2, the delimiter itself can be used as a literal character if it is preceded by a <backslash>. But our implementation (same as Free, Net, and gsed) replace \n with n if the delimiter is n. $ echo a | sed 'ynan\nn' n $ # With patch $ echo a | ./patchedsed 'ynan\nn' $ 2) is a broken delimiter parser: y needs to take all characters literally (except for <backslash>), but all the *BSD implementations pass the delimiter parsing through compile_delimited, which does an extra '[' pair matching, where it should not do this. The handling of said character is done as expected once the closing tag is found. gsed does this correct: $ echo [ | sed 'y/[/a/' sed: 1: "y/[/a/": unbalanced brackets ([]) $ echo [ | sed 'y/[a]/bcd/' b $ $ echo [ | ./patchedsed 'y/[/a/' a 3) is inconsistent behaviour between implementations: According to posix: The meaning of a <backslash> followed by any character that is not 'n', a <backslash>, or the delimiter character is undefined. and If the number of characters in string1 and string2 are not equal, or if any of the characters in string1 appear more than once, the results are undefined. Right now we generate an error if the strings are of unequal length, but double characters and unmatched <backslash> treated literal. This also contradicts our manpage: Within string1 and string2, a backslash followed by any character other than a newline is that literal character. But the code does something different: $ echo a | sed 'y/ab/\c/' \ gsed is a little mostly more accurate to our manpage, but awefully inconsistent: $ echo a | gsed 'y/a/\c/' gsed: -e expression #1, char 7: strings for `y' command are different lengths $ echo a | gsed 'y/a/\d/' d As for the repeating lines we always choose the last matching character, where gsed chooses the first: $ echo a | sed 'y/aa/bc/' c $ echo a | gsed 'y/aa/bc/' b To make portability issues more obvious I've chosen to turn these undefined scenarios into an error to make these non-obvious issues more visible: $ echo a | ./patchedsed 'y/ab/\c/' sed: 1: "y/ab/\c/": Unexpected character after backslash $ echo a | ./patchedsed 'y/aa/bc/' sed: 1: "y/aa/bc/": Repeated character in source string Since this changes functional behaviour and can cause backwards incompatible changes for people with scripts that live in this unspecified area: I would like to know if someone is against this change. The manpage bits were helped by jmc@, but this code makes his head spin, so extra eyes on that part are welcome as well. OK? Sincerely, martijn@ [0]http://pubs.opengroup.org/onlinepubs/9699919799/utilities/sed.html Index: compile.c =================================================================== RCS file: /cvs/src/usr.bin/sed/compile.c,v retrieving revision 1.42 diff -u -r1.42 compile.c --- compile.c 1 Aug 2017 18:05:53 -0000 1.42 +++ compile.c 7 Dec 2017 14:57:04 -0000 @@ -617,46 +617,62 @@ * Compile a translation set of strings into a lookup table. */ static char * -compile_tr(char *p, char **transtab) +compile_tr(char *old, char **transtab) { int i; - char *lt, *op, *np; - char *old = NULL, *new = NULL; + char delimiter, check[UCHAR_MAX]; + char *new, *end; - if (*p == '\0' || *p == '\\') - error(COMPILE, -"transform pattern can not be delimited by newline or backslash"); - old = xmalloc(strlen(p) + 1); - p = compile_delimited(p, old, 1); - if (p == NULL) { - error(COMPILE, "unterminated transform source string"); - goto bad; - } - new = xmalloc(strlen(p) + 1); - p = compile_delimited(--p, new, 1); - if (p == NULL) { - error(COMPILE, "unterminated transform target string"); - goto bad; + bzero(check, sizeof(check)); + delimiter = *old++; + if (delimiter == '\0' || delimiter == '\\') + error(COMPILE, "transform pattern can not be delimited by " + "newline or backslash"); + + new = old; + do { + if ((new = strchr(new + 1, delimiter)) == NULL) + error(COMPILE, "unterminated transform source string"); + } while (*(new - 1) == '\\' && *(new -2) != '\\'); + *new++ = '\0'; + end = new; + do { + if ((end = strchr(end + 1, delimiter)) == NULL) + error(COMPILE, "unterminated transform target string"); + } while (*(end -1) == '\\' && *(end -2) != '\\'); + *end = '\0'; + + // We assume characters are 8 bits + *transtab = xmalloc(UCHAR_MAX + 1); + for (i = 0; i <= UCHAR_MAX; i++) + (*transtab)[i] = (char)i; + + while (*old != '\0' && *new != '\0') { + if (*old == '\\') { + old++; + if (*old == 'n') + *old = '\n'; + else if (*old != delimiter && *old != '\\') + error(COMPILE, "Unexpected character after " + "backslash"); + + } + if (*new == '\\') { + new++; + if (*new == 'n') + *new = '\n'; + else if (*new != delimiter && *new != '\\') + error(COMPILE, "Unexpected character after " + "backslash"); + } + if (check[*old] == 1) + error(COMPILE, "Repeated character in source string"); + check[*old] = 1; + (*transtab)[*old++] = *new++; } - EATSPACE(); - if (strlen(new) != strlen(old)) { + if (*old != '\0' || *new != '\0') error(COMPILE, "transform strings are not the same length"); - goto bad; - } - /* We assume characters are 8 bits */ - lt = xmalloc(UCHAR_MAX + 1); - for (i = 0; i <= UCHAR_MAX; i++) - lt[i] = (char)i; - for (op = old, np = new; *op; op++, np++) - lt[(u_char)*op] = *np; - *transtab = lt; - free(old); - free(new); - return (p); -bad: - free(old); - free(new); - return (NULL); + return end + 1; } /* Index: sed.1 =================================================================== RCS file: /cvs/src/usr.bin/sed/sed.1,v retrieving revision 1.51 diff -u -r1.51 sed.1 --- sed.1 7 Dec 2017 09:52:26 -0000 1.51 +++ sed.1 7 Dec 2017 14:57:04 -0000 @@ -482,14 +482,29 @@ .Ar string2 . Any character other than a backslash or newline can be used instead of a slash to delimit the strings. +.Pp Within .Ar string1 and .Ar string2 , -a backslash followed by any character other than a newline is that literal -character, and a backslash followed by an +a backslash followed by another backslash +is replaced by a single backslash, +a backslash followed by an .Sq n -is replaced by a newline character. +is replaced by a newline character, +and a backslash followed by the delimiting character +is replaced by that character, +causing it to be treated literally, +with the exception of the +.Sq n +character, +which will still be treated like a newline character. +It is an error for a backslash to not be followed by another backslash, +.Sq n , +or the delimiting character, +or for +.Ar string1 +to contain repeating characters, .It [0addr] Ns Ic \&: Ns Ar label This function does nothing; it bears a .Ar label
