So, finally, here is the update of my patch to fix zero-length
matches in sed(1). Sorry for the ridiculous delay.
Changes with respect to the first version:
- Removed the now unused local variable re_off (noticed by otto@).
- Make slen signed, or zero-length matches at EOF without a trailing
newline would cause it to underflow (suspected by otto@ after
looking at FreeBSD code, suspicion confirmed myself).
- Advancing after zero-length matches was not quite right:
A zero-length match before the final character of a file caused the
zero-length match after the final character to get lost if the file
ended without a terminating newline (noticed myself by changing
echo "$in" to echo -n "$in" in substitute.sh of our own test suite);
fixed by reworking the end of the do...while loop, which also
makes it easier to understand.
This now passes our own test suite both as it is and with -n,
and otto@ checked that the remaining failures with respect to the
FreeBSD test suite look unrelated to this particular patch.
I'm leaving the original rationale at the end, for reference.
Index: process.c
===================================================================
RCS file: /cvs/src/usr.bin/sed/process.c,v
retrieving revision 1.15
diff -u -p -r1.15 process.c
--- process.c 27 Oct 2009 23:59:43 -0000 1.15
+++ process.c 22 Jul 2011 00:11:49 -0000
@@ -312,7 +312,7 @@ substitute(struct s_command *cp)
{
SPACE tspace;
regex_t *re;
- size_t re_off, slen;
+ regoff_t slen;
int n, lastempty;
char *s;
@@ -333,60 +333,54 @@ substitute(struct s_command *cp)
n = cp->u.s->n;
lastempty = 1;
- switch (n) {
- case 0: /* Global */
- do {
- if (lastempty || match[0].rm_so != match[0].rm_eo) {
- /* Locate start of replaced string. */
- re_off = match[0].rm_so;
- /* Copy leading retained string. */
- cspace(&SS, s, re_off, APPEND);
- /* Add in regular expression. */
- regsub(&SS, s, cp->u.s->new);
- }
+ do {
+ /* Copy the leading retained string. */
+ if (n <= 1 && match[0].rm_so)
+ cspace(&SS, s, match[0].rm_so, APPEND);
- /* Move past this match. */
- if (match[0].rm_so != match[0].rm_eo) {
- s += match[0].rm_eo;
- slen -= match[0].rm_eo;
- lastempty = 0;
+ /* Skip zero-length matches right after other matches. */
+ if (lastempty || match[0].rm_so ||
+ match[0].rm_so != match[0].rm_eo) {
+ if (n <= 1) {
+ /* Want this match: append replacement. */
+ regsub(&SS, s, cp->u.s->new);
+ if (n == 1)
+ n = -1;
} else {
- if (match[0].rm_so == 0)
- cspace(&SS, s, match[0].rm_so + 1,
- APPEND);
- else
- cspace(&SS, s + match[0].rm_so, 1,
- APPEND);
- s += match[0].rm_so + 1;
- slen -= match[0].rm_so + 1;
- lastempty = 1;
+ /* Want a later match: append original. */
+ if (match[0].rm_eo)
+ cspace(&SS, s, match[0].rm_eo, APPEND);
+ n--;
}
- } while (slen > 0 && regexec_e(re, s, REG_NOTBOL, 0, slen));
- /* Copy trailing retained string. */
- if (slen > 0)
- cspace(&SS, s, slen, APPEND);
- break;
- default: /* Nth occurrence */
- while (--n) {
- s += match[0].rm_eo;
- slen -= match[0].rm_eo;
- if (!regexec_e(re, s, REG_NOTBOL, 0, slen))
- return (0);
}
- /* FALLTHROUGH */
- case 1: /* 1st occurrence */
- /* Locate start of replaced string. */
- re_off = match[0].rm_so + (s - ps);
- /* Copy leading retained string. */
- cspace(&SS, ps, re_off, APPEND);
- /* Add in regular expression. */
- regsub(&SS, s, cp->u.s->new);
- /* Copy trailing retained string. */
+
+ /* Move past this match. */
s += match[0].rm_eo;
slen -= match[0].rm_eo;
+
+ /*
+ * After a zero-length match, advance one byte,
+ * and at the end of the line, terminate.
+ */
+ if (match[0].rm_so == match[0].rm_eo) {
+ if (*s == '\n')
+ slen = -1;
+ else
+ slen--;
+ cspace(&SS, s++, 1, APPEND);
+ lastempty = 1;
+ } else
+ lastempty = 0;
+
+ } while (n >= 0 && slen >= 0 && regexec_e(re, s, REG_NOTBOL, 0, slen));
+
+ /* Did not find the requested number of matches. */
+ if (n > 1)
+ return (0);
+
+ /* Copy the trailing retained string. */
+ if (slen > 0)
cspace(&SS, s, slen, APPEND);
- break;
- }
/*
* Swap the substitute space and the pattern space, and make sure
----- 8< ----- schnipp ----- >8 ----- 8< ----- schnapp ----- >8 -----
Otto Moerbeek wrote on Fri, Jun 24, 2011 at 09:41:00PM +0200:
> On Sat, Jun 18, 2011 at 08:16:05PM -0600, Ingo Schwarze wrote:
>> When a regular expression has zero-length matches in a string,
>> both sed(1) global replacement (/g) and replacement of numbered
>> instances (e.g. /2) are broken. This is not even limited to sed -E.
>> Both Otto's patch and my own refactoring patch on misc@ only address
>> global replacement and leave numbered instances broken.
>>
>> Already now, code in the three parts of the switch statement
>> is rather repetitive, and by fixing the numbered instances case,
>> code duplication would become much worse. Thus, i propose
>> the following full rewrite of the whole switch statement.
>>
>> This is a bit scary because...
>> Well, sed(1) is not exactly the tool we want to break.
>> Hence, a test suite is included.
>> Both my patch and GNU sed(1) pass the test suite.
>>
>> Our -current sed fails 43 of these tests, quite a few of them
>> in spectacular ways, like
>>
>> $ echo | sed 's/$/x/2' # expect ""
>> x
>> $ echo aaa | sed 's/a*/x/2' # expect "aaa"
>> aaax
>> $ echo abc | sed -E 's/a|$/x/g' # expect "xbcx"
>> x
>> $ echo abc | sed -E 's/()/x/2' # expect "axbc"
>> xabc
>> $ echo abc | sed -E 's/()/x/42' # expect "abc"
>> xabc
>>
>> One common source of confusion on misc@ was that Perl allows
>> empty matches right after other matches, like in:
>>
>> $ perl -Mstrict -we '$_ = "abc"; s/b|()/x/g; print "$_\n";'
>> xaxxcx
>> $ perl -Mstrict -we '$_ = "a"; s/^|a|$/x/g; print "$_\n";'
>> xxx
>>
>> I consider that broken behaviour in Perl. For example, think about
>> the case of "a" =~ /^|a/. The branch /^/ matches at 0, length 0.
>> The branch /a/ matches at 0, length 1. So by greediness, the latter
>> ought to prevail. But then, we have already consumed the character,
>> and there is no way to get a second match. Hence,
>>
>> $ echo abc | sed -E 's/b|()/x/g'
>> xaxcx
>> $ echo a | sed -E 's/^|a|$/x/g'
>> x
>>
>> Comments?
>> Ingo
> Couple of comments;
>
> - re_off is no longer used.
>
> - slen is unsigned. We need to be 100% sure it cannot wrap. ATM I'm
> not completely convinced the empty match case is safe. If you look at
> the FreeBSD code, it appears the case match[0].rm_eo == slen can
> happen, and in that case slen will wrap. They fixed it by making slen
> a signed type (regoff_t). See rev 1.30 and 1.31 in their tree.
>
> When I run the regress test suite from FreeBSD (in
> src/tools/regression/usr.bin/sed) I see a couple of failures with our
> sed. I did not have time yet to look if these failures have anything
> with this diff and if we should incoorporate them.