Hello tech@,

Our sed command goes a bit bit bonkers on the y command in a couple of  
different scenarios and was found thanks to the sed.1 diff from kshe.

1) is a posix[0] violation, which states:
If a <backslash> followed by an 'n' appear in string1 or string2, the
two characters shall be handled as a single <newline>.
and
If the delimiter is not 'n', within string1 and string2, the delimiter
itself can be used as a literal character if it is preceded by a
<backslash>.
But our implementation (same as Free, Net, and gsed) replace \n with n
if the delimiter is n.
$ echo a | sed 'ynan\nn'
n
$ # With patch
$ echo a | ./patchedsed 'ynan\nn'


$

2) is a broken delimiter parser:
y needs to take all characters literally (except for <backslash>), but
all the *BSD implementations pass the delimiter parsing through
compile_delimited, which does an extra '[' pair matching, where it
should not do this. The handling of said character is done as expected
once the closing tag is found. gsed does this correct:
$ echo [ | sed 'y/[/a/'
sed: 1: "y/[/a/": unbalanced brackets ([])
$ echo [ | sed 'y/[a]/bcd/'
b
$
$ echo [ | ./patchedsed 'y/[/a/'  
a

3) is inconsistent behaviour between implementations:
According to posix:
The meaning of a <backslash> followed by any character that is not 'n',
a <backslash>, or the delimiter character is undefined.
and
If the number of characters in string1 and string2 are not equal, or if
any of the characters in string1 appear more than once, the results are
undefined.

Right now we generate an error if the strings are of unequal length, but
double characters and unmatched <backslash> treated literal. This also
contradicts our manpage:
Within string1 and string2, a backslash followed by any character other
than a newline is that literal character.

But the code does something different:
$ echo a | sed 'y/ab/\c/' 
\
gsed is a little mostly more accurate to our manpage, but awefully
inconsistent:
$ echo a | gsed 'y/a/\c/'
gsed: -e expression #1, char 7: strings for `y' command are different lengths
$ echo a | gsed 'y/a/\d/'   
d

As for the repeating lines we always choose the last matching character,
where gsed chooses the first:
$ echo a | sed 'y/aa/bc/'
c
$ echo a | gsed 'y/aa/bc/'
b

To make portability issues more obvious I've chosen to turn these undefined
scenarios into an error to make these non-obvious issues more visible:
$ echo a | ./patchedsed 'y/ab/\c/'
sed: 1: "y/ab/\c/": Unexpected character after backslash
$ echo a | ./patchedsed 'y/aa/bc/'
sed: 1: "y/aa/bc/": Repeated character in source string

Since this changes functional behaviour and can cause backwards
incompatible changes for people with scripts that live in this unspecified
area: I would like to know if someone is against this change.

The manpage bits were helped by jmc@, but this code makes his head spin,
so extra eyes on that part are welcome as well.

OK?

Sincerely,

martijn@

[0]http://pubs.opengroup.org/onlinepubs/9699919799/utilities/sed.html

Index: compile.c
===================================================================
RCS file: /cvs/src/usr.bin/sed/compile.c,v
retrieving revision 1.42
diff -u -r1.42 compile.c
--- compile.c   1 Aug 2017 18:05:53 -0000       1.42
+++ compile.c   7 Dec 2017 14:57:04 -0000
@@ -617,46 +617,62 @@
  * Compile a translation set of strings into a lookup table.
  */
 static char *
-compile_tr(char *p, char **transtab)
+compile_tr(char *old, char **transtab)
 {
        int i;
-       char *lt, *op, *np;
-       char *old = NULL, *new = NULL;
+       char delimiter, check[UCHAR_MAX];
+       char *new, *end;
 
-       if (*p == '\0' || *p == '\\')
-               error(COMPILE,
-"transform pattern can not be delimited by newline or backslash");
-       old = xmalloc(strlen(p) + 1);
-       p = compile_delimited(p, old, 1);
-       if (p == NULL) {
-               error(COMPILE, "unterminated transform source string");
-               goto bad;
-       }
-       new = xmalloc(strlen(p) + 1);
-       p = compile_delimited(--p, new, 1);
-       if (p == NULL) {
-               error(COMPILE, "unterminated transform target string");
-               goto bad;
+       bzero(check, sizeof(check));
+       delimiter = *old++;
+       if (delimiter == '\0' || delimiter == '\\')
+               error(COMPILE, "transform pattern can not be delimited by "
+                   "newline or backslash");
+
+       new = old;
+       do {
+               if ((new = strchr(new + 1, delimiter)) == NULL)
+                       error(COMPILE, "unterminated transform source string");
+       } while (*(new - 1) == '\\' && *(new -2) != '\\');
+       *new++ = '\0';
+       end = new;
+       do {
+               if ((end = strchr(end + 1, delimiter)) == NULL)
+                       error(COMPILE, "unterminated transform target string");
+       } while (*(end -1) == '\\' && *(end -2) != '\\');
+       *end = '\0';
+
+       // We assume characters are 8 bits
+       *transtab = xmalloc(UCHAR_MAX + 1);
+       for (i = 0; i <= UCHAR_MAX; i++)
+               (*transtab)[i] = (char)i;
+
+       while (*old != '\0' && *new != '\0') {
+               if (*old == '\\') {
+                       old++;
+                       if (*old == 'n')
+                               *old = '\n';
+                       else if (*old != delimiter && *old != '\\')
+                               error(COMPILE, "Unexpected character after "
+                                   "backslash");
+                       
+               }
+               if (*new == '\\') {
+                       new++;
+                       if (*new == 'n')
+                               *new = '\n';
+                       else if (*new != delimiter && *new != '\\')
+                               error(COMPILE, "Unexpected character after "
+                                   "backslash");
+               }
+               if (check[*old] == 1)
+                       error(COMPILE, "Repeated character in source string");
+               check[*old] = 1;
+               (*transtab)[*old++] = *new++;
        }
-       EATSPACE();
-       if (strlen(new) != strlen(old)) {
+       if (*old != '\0' || *new != '\0')
                error(COMPILE, "transform strings are not the same length");
-               goto bad;
-       }
-       /* We assume characters are 8 bits */
-       lt = xmalloc(UCHAR_MAX + 1);
-       for (i = 0; i <= UCHAR_MAX; i++)
-               lt[i] = (char)i;
-       for (op = old, np = new; *op; op++, np++)
-               lt[(u_char)*op] = *np;
-       *transtab = lt;
-       free(old);
-       free(new);
-       return (p);
-bad:
-       free(old);
-       free(new);
-       return (NULL);
+       return end + 1;
 }
 
 /*
Index: sed.1
===================================================================
RCS file: /cvs/src/usr.bin/sed/sed.1,v
retrieving revision 1.51
diff -u -r1.51 sed.1
--- sed.1       7 Dec 2017 09:52:26 -0000       1.51
+++ sed.1       7 Dec 2017 14:57:04 -0000
@@ -482,14 +482,29 @@
 .Ar string2 .
 Any character other than a backslash or newline can be used instead of
 a slash to delimit the strings.
+.Pp
 Within
 .Ar string1
 and
 .Ar string2 ,
-a backslash followed by any character other than a newline is that literal
-character, and a backslash followed by an
+a backslash followed by another backslash
+is replaced by a single backslash,
+a backslash followed by an
 .Sq n
-is replaced by a newline character.
+is replaced by a newline character,
+and a backslash followed by the delimiting character
+is replaced by that character,
+causing it to be treated literally,
+with the exception of the
+.Sq n
+character,
+which will still be treated like a newline character.
+It is an error for a backslash to not be followed by another backslash,
+.Sq n ,
+or the delimiting character,
+or for
+.Ar string1
+to contain repeating characters,
 .It [0addr] Ns Ic \&: Ns Ar label
 This function does nothing; it bears a
 .Ar label

Reply via email to