Re: sed/regcomp bug?

Martijn van Duren Tue, 24 May 2016 11:21:57 -0700

Hello tech@,

Here's another shot at fixing the sed/regex bug. This time I've split it
up in a libc/regex part and a sed part. The sed patch is still the same,
but I'll resend it after the regex changes are in the tree.[1]


I've had a lot of help from schwarze@ at testing the consequences of
this patch and comparing it to other regex implementations.
He and I both agree that this is the way forward.

The patch itself changes the behavior of when REG_STARTEND is combined
with REG_NOTBOL. Currently when the two are combined
"string + pmatch[0].rm_so" is used as an offset and REG_NOTBOL is used
to do his determination equal to when string is used without offset.
This patch enables the assumption that pmatch[0].rm_so is a
continuation offset to string and allows us to do a proper assessment of
the character in regards to it's word position ('^' or '\<'), without 
risking going into unallocated memory.

This change also fixes a feature^wbug in vi. Consider the string " a b"
where the cursor is at position 0. When searching for '\<' 'b' will be
found first, since 'a' will be tested first with REG_NOTBOL, but won't
match, because of the current behavior of REG_NOTBOL. After my patch
'a' will be found first, as expected.

When comparing REG_STARTEND to other vendors it appears to not only be  
non-standard, but also implemented inconsistent, especially when it
comes to combining it with REG_NOTBOL. This change however is similar to
how glibc handles REG_STARTEND | REG_NOTBOL.

Ingo and I also did a full sweep of both src and ports[2] and found that
the combination is seldom used and where it is used there is either an
improvent (see vi example), doesn't cause a regression, because offset
is 0, or the implementation uses a custom regex library.
The only place where an unexpected change could occur is in the
luarexlib port, which wraps the system regex, but this isn't considered
a big risk.

Most of the manpage changes and (of course) OK by schwarze@, but considering
this is a libc change I'd like some extra OKs.
I won't commit this until Ingo's patch for "buffer underflow segfault in
regexec"[3] is in. From what I understood an extra OK is still required
here as well.

martijn@

[1] For those who are curious, yes: vi and ed are also affected by this
issue, and maybe more software that uses REG_NOTBOL. I'll do a sweep
when I find the time.
[2] Thanks to sthen@ for proving us with an overview of where
REG_STARTEND is used in ports.
[3] http://marc.info/?l=openbsd-tech&m=146333174823516&w=2

Index: engine.c
===================================================================
RCS file: /cvs/src/lib/libc/regex/engine.c,v
retrieving revision 1.20
diff -u -p -r1.20 engine.c
--- engine.c    17 May 2016 22:03:18 -0000      1.20
+++ engine.c    24 May 2016 18:10:46 -0000
@@ -674,12 +674,17 @@ fast(struct match *m, char *start, char 
        states fresh = m->fresh;
        states tmp = m->tmp;
        char *p = start;
-       int c = (start == m->beginp) ? OUT : *(start-1);
+       int c;
        int lastc;      /* previous c */
        int flagch;
        int i;
        char *coldp;    /* last p after which no match was underway */
 
+       if (start == m->offp || (start == m->beginp && !(m->eflags&REG_NOTBOL)))
+               c = OUT;
+       else
+               c = *(start-1);
+
        CLEAR(st);
        SET1(st, startst);
        st = step(m->g, startst, stopst, st, NOTHING, st);
@@ -758,11 +763,16 @@ slow(struct match *m, char *start, char 
        states empty = m->empty;
        states tmp = m->tmp;
        char *p = start;
-       int c = (start == m->beginp) ? OUT : *(start-1);
+       int c;
        int lastc;      /* previous c */
        int flagch;
        int i;
        char *matchp;   /* last p at which a match ended */
+
+       if (start == m->offp || (start == m->beginp && !(m->eflags&REG_NOTBOL)))
+               c = OUT;
+       else
+               c = *(start-1);
 
        AT("slow", start, stop, startst, stopst);
        CLEAR(st);
Index: regex.3
===================================================================
RCS file: /cvs/src/lib/libc/regex/regex.3,v
retrieving revision 1.26
diff -u -p -r1.26 regex.3
--- regex.3     10 Nov 2015 23:48:18 -0000      1.26
+++ regex.3     24 May 2016 18:10:46 -0000
@@ -225,11 +225,16 @@ argument is the bitwise
 of zero or more of the following values:
 .Bl -tag -width XREG_STARTENDX
 .It Dv REG_NOTBOL
-The first character of
-the string
-is not the beginning of a line, so the
-.Ql ^
-anchor should not match before it.
+The first character of the string is treated as the continuation
+of a line.
+This means that the anchors
+.Ql ^ ,
+.Ql [[:<:]] ,
+and
+.Ql \e<
+do not match before it; but see
+.Dv REG_STARTEND
+below.
 This does not affect the behavior of newlines under
 .Dv REG_NEWLINE .
 .It Dv REG_NOTEOL
@@ -237,15 +242,16 @@ The NUL terminating
 the string
 does not end a line, so the
 .Ql $
-anchor should not match before it.
+anchor does not match before it.
 This does not affect the behavior of newlines under
 .Dv REG_NEWLINE .
 .It Dv REG_STARTEND
 The string is considered to start at
-\fIstring\fR\ + \fIpmatch\fR[0].\fIrm_so\fR
-and to have a terminating NUL located at
-\fIstring\fR\ + \fIpmatch\fR[0].\fIrm_eo\fR
-(there need not actually be a NUL at that location),
+.Fa string No +
+.Fa pmatch Ns [0]. Ns Fa rm_so
+and to end before the byte located at
+.Fa string No +
+.Fa pmatch Ns [0]. Ns Fa rm_eo ,
 regardless of the value of
 .Fa nmatch .
 See below for the definition of
@@ -257,11 +263,37 @@ compatible with but not specified by
 .St -p1003.2 ,
 and should be used with
 caution in software intended to be portable to other systems.
-Note that a non-zero \fIrm_so\fR does not imply
-.Dv REG_NOTBOL ;
-.Dv REG_STARTEND
-affects only the location of the string,
-not how it is matched.
+.Pp
+Without
+.Dv REG_NOTBOL ,
+the position
+.Fa rm_so
+is considered the beginning of a line, such that
+.Ql ^
+matches before it, and the beginning of a word if there is a word
+character at this position, such that
+.Ql [[:<:]]
+and
+.Ql \e<
+match before it.
+.Pp
+With
+.Dv REG_NOTBOL ,
+the character at position
+.Fa rm_so
+is treated as the continuation of a line, and if
+.Fa rm_so
+is greater than 0, the preceding character is taken into consideration.
+If the preceding character is a newline and the regular expression was compiled
+with
+.Dv REG_NEWLINE ,
+.Ql ^
+matches before the string; if the preceding character is not a word character
+but the string starts with a word character,
+.Ql [[:<:]]
+and
+.Ql \e<
+match before the string.
 .El
 .Pp
 See

Re: sed/regcomp bug?

Reply via email to