On 12/12/17 15:02, kshe wrote:
> On Tue, 12 Dec 2017 12:44:03 +0000, Todd C. Miller wrote:
>> On Tue, 12 Dec 2017 11:57:58 +0000, kshe wrote:
>>
>>> Perhaps the worst part of all this, though, is how the change of
>>> behaviour, which made sed fail hard where it previously handled input in
>>> a perfectly defined and reasonable way, was apparently approved because
>>> "implementations do vary in how they handle [it], so throwing an error
>>> is probably best". Following the same kind of reasoning, I think
>>> OpenBSD should also modify the `echo' command to fail if given an
>>> argument like `-E', as its behaviour in that case differs from system to
>>> system, hence the current implementation is likewise "just creating a
>>> trap for the user", and surely this is unacceptable and therefore ought
>>> to be fixed, right?
>>
>> It's not really the same situation. The question is what to do in
>> cases where POSIX leaves the behavior undefined and where there is
>> no consensus among implementations. Is it best for sed to be as
>> consistent as possible, knowing that other commonly used implementations
>> will produce different results, or is it better to produce an error
>> for the unwary user? This is not just theoretical, we run into
>> issues all the time with scripts that "work fine" on Linux but fail
>> in odd ways on other systems that don't use the GNU utilities.
>>
>> Most users will have no way to determine the source of the problem.
>> At least if the undefined behavior results in an error they have
>> something to go on.
>
> Trying to prevent the unwary user from being unwary is a noble but
> impossible task to accomplish. There are so many ways to introduce
> non-portability in shell scripts that replacing historical behaviour by
> hard failures in an attempt to improve this situation is likely to be
> counterproductive, especially when such attempt goes against the
> internal consistency of the affected commands.
>
> Regards,
>
> kshe
>
Funny that you use echo as an example here[0]:
The echo utility has not been made obsolescent because of its extremely
widespread use in historical applications.
and
New applications are encouraged to use printf instead of echo.
and
It is not possible to use echo portably across all POSIX systems unless
both -n (as the first argument) and escape sequences are omitted.
So even POSIX says that this tool should be avoided and only there for
backwards compatibility.
As for the discrepancy between \n between the y and s command, this is
how POSIX actually specifies things[1].
s: Within the BRE and the replacement, the BRE delimiter itself can be
used as a literal character if it is preceded by a <backslash>.
s: The meaning of an unescaped <backslash> immediately followed by any
character other than '&', <backslash>, a digit, <newline>, or the
delimiter character used for this command, is unspecified.
y: If a <backslash> followed by an 'n' appear in string1 or string2, the
two characters shall be handled as a single <newline>.
As for the "a perfectly defined and reasonable way": It's undefined
behaviour according to POSIX and our manual never mentioned anything
about this behaviour if the backslash is used anywhere else, so it's
not defined.
Also the reasonable way is debatable because it's behaviour
actually changes on the BRE side if you use n as a separator and gsed
has an identical quirk on the replacement side:
$ printf '\n\n' | sed 'N;s/\n/a/'
a
$ printf '\n\n' | gsed 'N;s/\n/a/'
a
$ printf '\n\n' | sed 'N;sn\nnan'
$ printf '\n\n' | gsed 'N;sn\nnan'
a
$ printf 'a\n' | sed 's/a/\n/'
n
$ printf 'a\n' | gsed 's/a/\n/'
$ printf 'a\n' | sed 'snan\nn'
n
$ printf 'a\n' | gsed 'snan\nn'
n
Maybe we should pull the s command also a little closer to what POSIX
states (see defined behaviour above) and do something similar as y.
Do note that:
$ printf '\n\n' | ./sed 'N;s/\n/a/'
$ printf 'n\n' | ./sed 's/\n/a/'
a
Because the \n becomes part of the regex, which *does* define it in
re_format(7):
\c Any backslash-escaped character c, except for ‘{’, ‘}’, ‘(’, and
‘)’, matches itself.
Patch to do so below. Not asking for OKs (yet), since compile_re is used
in more places and whipped up in about a minute.
martijn@
[0] http://pubs.opengroup.org/onlinepubs/009695399/utilities/echo.html
[1] http://pubs.opengroup.org/onlinepubs/9699919799/utilities/sed.html
Index: compile.c
===================================================================
RCS file: /cvs/src/usr.bin/sed/compile.c,v
retrieving revision 1.44
diff -u -r1.44 compile.c
--- compile.c 11 Dec 2017 13:25:57 -0000 1.44
+++ compile.c 12 Dec 2017 15:42:55 -0000
@@ -58,7 +58,6 @@
} *labels[LHSZ];
static char *compile_addr(char *, struct s_addr *);
-static char *compile_ccl(char **, char *);
static char *compile_delimited(char *, char *);
static char *compile_flags(char *, struct s_subst *);
static char *compile_re(char *, regex_t **);
@@ -353,31 +352,17 @@
static char *
compile_delimited(char *p, char *d)
{
- char c;
+ char delimiter;
- c = *p++;
- if (c == '\0')
- return (NULL);
- else if (c == '\\')
+ delimiter = *p++;
+ if (delimiter == '\\')
error(COMPILE, "\\ can not be used as a string delimiter");
- else if (c == '\n')
+ else if (delimiter == '\n' || delimiter == '\0')
error(COMPILE, "newline can not be used as a string delimiter");
while (*p) {
- if (*p == '[' && *p != c) {
- if ((d = compile_ccl(&p, d)) == NULL)
- error(COMPILE, "unbalanced brackets ([])");
- continue;
- } else if (*p == '\\' && p[1] == '[') {
- *d++ = *p++;
- } else if (*p == '\\' && p[1] == c) {
- p++;
- } else if (*p == '\\' && p[1] == 'n') {
- *d++ = '\n';
- p += 2;
- continue;
- } else if (*p == '\\' && p[1] == '\\') {
- *d++ = *p++;
- } else if (*p == c) {
+ if (*p == '\\' && p[1] == delimiter) {
+ p++;
+ } else if (*p == delimiter) {
*d = '\0';
return (p + 1);
}
@@ -387,36 +372,6 @@
}
-/* compile_ccl: expand a POSIX character class */
-static char *
-compile_ccl(char **sp, char *t)
-{
- int c, d;
- char *s = *sp;
-
- *t++ = *s++;
- if (*s == '^')
- *t++ = *s++;
- if (*s == ']')
- *t++ = *s++;
- for (; *s && (*t = *s) != ']'; s++, t++)
- if (*s == '[' && ((d = *(s+1)) == '.' || d == ':' || d == '='))
{
- *++t = *++s, t++, s++;
- for (c = *s; (*t = *s) != ']' || c != d; s++, t++)
- if ((c = *s) == '\0')
- return NULL;
- } else if (*s == '\\' && s[1] == 'n') {
- *t = '\n';
- s++;
- }
- if (*s == ']') {
- *sp = ++s;
- return (++t);
- } else {
- return (NULL);
- }
-}
-
/*
* Get a regular expression. P points to the delimiter of the regular
* expression; repp points to the address of a regexp pointer. Newline
@@ -513,6 +468,10 @@
s->maxbref = ref;
} else if (*p == '&' || *p == '\\')
*sp++ = '\\';
+ else if (*p != '\n' && *p != c) {
+ error(COMPILE,
+"Unexpected character after backslash");
+ }
} else if (*p == c) {
p++;
*sp++ = '\0';