Hello tech@,
Here's another shot at fixing the sed/regex bug. This time I've split it
up in a libc/regex part and a sed part. The sed patch is still the same,
but I'll resend it after the regex changes are in the tree.[1]
I've had a lot of help from schwarze@ at testing the consequences of
this patch and comparing it to other regex implementations.
He and I both agree that this is the way forward.
The patch itself changes the behavior of when REG_STARTEND is combined
with REG_NOTBOL. Currently when the two are combined
"string + pmatch[0].rm_so" is used as an offset and REG_NOTBOL is used
to do his determination equal to when string is used without offset.
This patch enables the assumption that pmatch[0].rm_so is a
continuation offset to string and allows us to do a proper assessment of
the character in regards to it's word position ('^' or '\<'), without
risking going into unallocated memory.
This change also fixes a feature^wbug in vi. Consider the string " a b"
where the cursor is at position 0. When searching for '\<' 'b' will be
found first, since 'a' will be tested first with REG_NOTBOL, but won't
match, because of the current behavior of REG_NOTBOL. After my patch
'a' will be found first, as expected.
When comparing REG_STARTEND to other vendors it appears to not only be
non-standard, but also implemented inconsistent, especially when it
comes to combining it with REG_NOTBOL. This change however is similar to
how glibc handles REG_STARTEND | REG_NOTBOL.
Ingo and I also did a full sweep of both src and ports[2] and found that
the combination is seldom used and where it is used there is either an
improvent (see vi example), doesn't cause a regression, because offset
is 0, or the implementation uses a custom regex library.
The only place where an unexpected change could occur is in the
luarexlib port, which wraps the system regex, but this isn't considered
a big risk.
Most of the manpage changes and (of course) OK by schwarze@, but considering
this is a libc change I'd like some extra OKs.
I won't commit this until Ingo's patch for "buffer underflow segfault in
regexec"[3] is in. From what I understood an extra OK is still required
here as well.
martijn@
[1] For those who are curious, yes: vi and ed are also affected by this
issue, and maybe more software that uses REG_NOTBOL. I'll do a sweep
when I find the time.
[2] Thanks to sthen@ for proving us with an overview of where
REG_STARTEND is used in ports.
[3] http://marc.info/?l=openbsd-tech&m=146333174823516&w=2
Index: engine.c
===================================================================
RCS file: /cvs/src/lib/libc/regex/engine.c,v
retrieving revision 1.20
diff -u -p -r1.20 engine.c
--- engine.c 17 May 2016 22:03:18 -0000 1.20
+++ engine.c 24 May 2016 18:10:46 -0000
@@ -674,12 +674,17 @@ fast(struct match *m, char *start, char
states fresh = m->fresh;
states tmp = m->tmp;
char *p = start;
- int c = (start == m->beginp) ? OUT : *(start-1);
+ int c;
int lastc; /* previous c */
int flagch;
int i;
char *coldp; /* last p after which no match was underway */
+ if (start == m->offp || (start == m->beginp && !(m->eflags®_NOTBOL)))
+ c = OUT;
+ else
+ c = *(start-1);
+
CLEAR(st);
SET1(st, startst);
st = step(m->g, startst, stopst, st, NOTHING, st);
@@ -758,11 +763,16 @@ slow(struct match *m, char *start, char
states empty = m->empty;
states tmp = m->tmp;
char *p = start;
- int c = (start == m->beginp) ? OUT : *(start-1);
+ int c;
int lastc; /* previous c */
int flagch;
int i;
char *matchp; /* last p at which a match ended */
+
+ if (start == m->offp || (start == m->beginp && !(m->eflags®_NOTBOL)))
+ c = OUT;
+ else
+ c = *(start-1);
AT("slow", start, stop, startst, stopst);
CLEAR(st);
Index: regex.3
===================================================================
RCS file: /cvs/src/lib/libc/regex/regex.3,v
retrieving revision 1.26
diff -u -p -r1.26 regex.3
--- regex.3 10 Nov 2015 23:48:18 -0000 1.26
+++ regex.3 24 May 2016 18:10:46 -0000
@@ -225,11 +225,16 @@ argument is the bitwise
of zero or more of the following values:
.Bl -tag -width XREG_STARTENDX
.It Dv REG_NOTBOL
-The first character of
-the string
-is not the beginning of a line, so the
-.Ql ^
-anchor should not match before it.
+The first character of the string is treated as the continuation
+of a line.
+This means that the anchors
+.Ql ^ ,
+.Ql [[:<:]] ,
+and
+.Ql \e<
+do not match before it; but see
+.Dv REG_STARTEND
+below.
This does not affect the behavior of newlines under
.Dv REG_NEWLINE .
.It Dv REG_NOTEOL
@@ -237,15 +242,16 @@ The NUL terminating
the string
does not end a line, so the
.Ql $
-anchor should not match before it.
+anchor does not match before it.
This does not affect the behavior of newlines under
.Dv REG_NEWLINE .
.It Dv REG_STARTEND
The string is considered to start at
-\fIstring\fR\ + \fIpmatch\fR[0].\fIrm_so\fR
-and to have a terminating NUL located at
-\fIstring\fR\ + \fIpmatch\fR[0].\fIrm_eo\fR
-(there need not actually be a NUL at that location),
+.Fa string No +
+.Fa pmatch Ns [0]. Ns Fa rm_so
+and to end before the byte located at
+.Fa string No +
+.Fa pmatch Ns [0]. Ns Fa rm_eo ,
regardless of the value of
.Fa nmatch .
See below for the definition of
@@ -257,11 +263,37 @@ compatible with but not specified by
.St -p1003.2 ,
and should be used with
caution in software intended to be portable to other systems.
-Note that a non-zero \fIrm_so\fR does not imply
-.Dv REG_NOTBOL ;
-.Dv REG_STARTEND
-affects only the location of the string,
-not how it is matched.
+.Pp
+Without
+.Dv REG_NOTBOL ,
+the position
+.Fa rm_so
+is considered the beginning of a line, such that
+.Ql ^
+matches before it, and the beginning of a word if there is a word
+character at this position, such that
+.Ql [[:<:]]
+and
+.Ql \e<
+match before it.
+.Pp
+With
+.Dv REG_NOTBOL ,
+the character at position
+.Fa rm_so
+is treated as the continuation of a line, and if
+.Fa rm_so
+is greater than 0, the preceding character is taken into consideration.
+If the preceding character is a newline and the regular expression was compiled
+with
+.Dv REG_NEWLINE ,
+.Ql ^
+matches before the string; if the preceding character is not a word character
+but the string starts with a word character,
+.Ql [[:<:]]
+and
+.Ql \e<
+match before the string.
.El
.Pp
See