Re: sed/regcomp bug?

Martijn van Duren Tue, 10 May 2016 13:36:43 -0700

On 05/10/16 20:58, Ingo Schwarze wrote:
> Hi Martijn,
> 
> Martijn van Duren wrote on Tue, May 10, 2016 at 08:08:34PM +0200:
>> On 05/10/16 19:29, Ingo Schwarze wrote:
>>> Martijn van Duren wrote on Tue, May 10, 2016 at 02:43:54PM +0200:
> 
>>>> Index: ./lib/libc/regex/engine.c
>>>> ===================================================================
>>>> RCS file: /cvs/src/lib/libc/regex/engine.c,v
>>>> retrieving revision 1.19
>>>> diff -u -p -r1.19 engine.c
>>>> --- ./lib/libc/regex/engine.c      28 Dec 2015 23:01:22 -0000      1.19
>>>> +++ ./lib/libc/regex/engine.c      2 May 2016 08:50:20 -0000
>>>> @@ -674,7 +674,7 @@ fast(struct match *m, char *start, char 
>>>>    states fresh = m->fresh;
>>>>    states tmp = m->tmp;
>>>>    char *p = start;
>>>> -  int c = (start == m->beginp) ? OUT : *(start-1);
>>>> +  int c = (start == m->offp) ? OUT : *(start-1);
>>>>    int lastc;      /* previous c */
>>>>    int flagch;
>>>>    int i;
>>>> @@ -758,7 +758,7 @@ slow(struct match *m, char *start, char 
>>>>    states empty = m->empty;
>>>>    states tmp = m->tmp;
>>>>    char *p = start;
>>>> -  int c = (start == m->beginp) ? OUT : *(start-1);
>>>> +  int c = (start == m->offp) ? OUT : *(start-1);
>>>>    int lastc;      /* previous c */
>>>>    int flagch;
>>>>    int i;
> 
>>> i hate to say that this change appears to cause a regression.
>>>
>>> The regexec(3) manual explicitly says:
>>>
>>>   REG_STARTEND      The string is considered to start at [...]
>>>             Note that a non-zero rm_so does not imply REG_NOTBOL;
>>>             REG_STARTEND affects only the location of the string,
>>>             not how it is matched.
>>>
>>> Right now, the library actually implements that.  The test program
>>> appended below produces the following output, as documented:
>>>
>>>   rt: regcomp: OK
>>>   rt: mismatch: regexec() failed to match
>>>   rt: BOL match: OK
>>>   rt: ST match: OK
>>>
>>> With your change, the library now fails to match:
>>>
>>>   rt: regcomp: OK
>>>   rt: mismatch: regexec() failed to match
>>>   rt: BOL match: OK
>>>   rt: ST match: regexec() failed to match
>>>
>>> I don't think that change is intentional, or is it?
> 
>> This change is intentional.  You try to match y on the start of the
>> string.  This falsely succeeds in the current library, but is fixed
>> in my change.
> 
> Since this is an intentional change of the way the library works,
> i think the following is needed:
> 
>  1. It ought to be separate from fixing sed(1).
>     Mixing interface changes of libraries with bug fixes
>     in programs using them seems like a bad idea to me.
> 
>  2. We need a rationale why the change is useful in general,
>     not just for sed(1), and why we think that nobody relies on the
>     current interface definition, or how programs that rely on the
>     current behaviour can be found and adjusted to work with the
>     new behaviour.
> 
>  3. There should be a list explaining how other implementations
>     behave in this respect, in particular FreeBSD, NetBSD,
>     DragonFly, glibc, illumos or Solaris, and maybe some other
>     commercial Unixes.
> 
>  4. The library must be kept consistent.  For example, you only
>     change the treatment of "^" and "\<", but leave "$" and "\>"
>     unchanged, see the test program below.  That is unlikely to
>     be the only inconsistency you introduce.  In the function
>     backref(), there are multiple references to beginp in the
>     immediate vicinity of REG_NOTBOL and ISWORD.  I did not
>     check the details, but i fear that your change introduces
>     a second inconsistency between the treatment of anchors
>     inside and outside of backreferences.  These are just two
>     examples of (potential) inconsistencies, we have to convince
>     ourselves there aren't more.
> 
>  5. The change needs to be documented.  It blantantly contradicts
>     what the manual currently says.  The regular expression
>     manuals are among the most precise we have, and we should
>     really keep that level of perfection.
> 
> I'm not saying we don't want such a change, nor am i saying we want
> it, but as it stands, the patch is inconsistent, incomplete, and
> mixed up with unrelated stuff, so it is not OK.


I reread the REG_STARTEND section again and it turns out I
misinterpreted it's intention and I thought the current behaviour
was wrong/incomplete. So the change was intentional, but for the
wrong reasons.

I changed the diff to allow regexec to combine REG_NOTBOL and
REG_STARTEND. This way the current syntax stays intact and we can
use the extra flag to do lookbacks to verify that the last
character was a word character or not, without risking going into
unallocated memory.

Does this look better to you?
> 
> Yours,
>   Ingo
> 
> 

Index: ./lib/libc/regex/engine.c
===================================================================
RCS file: /cvs/src/lib/libc/regex/engine.c,v
retrieving revision 1.19
diff -u -p -r1.19 engine.c
--- ./lib/libc/regex/engine.c   28 Dec 2015 23:01:22 -0000      1.19
+++ ./lib/libc/regex/engine.c   10 May 2016 20:22:09 -0000
@@ -674,12 +674,17 @@ fast(struct match *m, char *start, char 
        states fresh = m->fresh;
        states tmp = m->tmp;
        char *p = start;
-       int c = (start == m->beginp) ? OUT : *(start-1);
+       int c;
        int lastc;      /* previous c */
        int flagch;
        int i;
        char *coldp;    /* last p after which no match was underway */
 
+       if (start == m->offp || (start == m->beginp && !(m->eflags&REG_NOTBOL)))
+               c = OUT;
+       else
+               c = *(start-1);
+
        CLEAR(st);
        SET1(st, startst);
        st = step(m->g, startst, stopst, st, NOTHING, st);
@@ -758,11 +763,16 @@ slow(struct match *m, char *start, char 
        states empty = m->empty;
        states tmp = m->tmp;
        char *p = start;
-       int c = (start == m->beginp) ? OUT : *(start-1);
+       int c;
        int lastc;      /* previous c */
        int flagch;
        int i;
        char *matchp;   /* last p at which a match ended */
+
+       if (start == m->offp || (start == m->beginp && !(m->eflags&REG_NOTBOL)))
+               c = OUT;
+       else
+               c = *(start-1);
 
        AT("slow", start, stop, startst, stopst);
        CLEAR(st);
Index: ./usr.bin/sed/process.c
===================================================================
RCS file: /cvs/src/usr.bin/sed/process.c,v
retrieving revision 1.27
diff -u -p -r1.27 process.c
--- ./usr.bin/sed/process.c     26 Oct 2015 14:08:47 -0000      1.27
+++ ./usr.bin/sed/process.c     10 May 2016 20:22:09 -0000
@@ -61,7 +61,8 @@ static SPACE HS, PS, SS;
 static inline int       applies(struct s_command *);
 static void             flush_appends(void);
 static void             lputs(char *);
-static inline int       regexec_e(regex_t *, const char *, int, int, size_t);
+static inline int       regexec_e(regex_t *, const char *, int, int, size_t,
+                            size_t);
 static void             regsub(SPACE *, char *, char *);
 static int              substitute(struct s_command *);
 
@@ -267,7 +268,7 @@ new:                if (!nflag && !pd)
  * (lastline, linenumber, ps).
  */
 #define        MATCH(a)                                                \
-       (a)->type == AT_RE ? regexec_e((a)->u.r, ps, 0, 1, psl) :       \
+       (a)->type == AT_RE ? regexec_e((a)->u.r, ps, 0, 1, 0, psl) :    \
            (a)->type == AT_LINE ? linenum == (a)->u.l : lastline()
 
 /*
@@ -335,6 +336,7 @@ substitute(struct s_command *cp)
        regex_t *re;
        regoff_t slen;
        int n, lastempty;
+       size_t le = 0;
        char *s;
 
        s = ps;
@@ -346,7 +348,7 @@ substitute(struct s_command *cp)
                            cp->u.s->maxbref);
                }
        }
-       if (!regexec_e(re, s, 0, 0, psl))
+       if (!regexec_e(re, s, 0, 0, 0, psl))
                return (0);
 
        SS.len = 0;                             /* Clean substitute space. */
@@ -356,28 +358,29 @@ substitute(struct s_command *cp)
 
        do {
                /* Copy the leading retained string. */
-               if (n <= 1 && match[0].rm_so)
-                       cspace(&SS, s, match[0].rm_so, APPEND);
+               if (n <= 1 && (match[0].rm_so - le))
+                       cspace(&SS, s, match[0].rm_so - le, APPEND);
 
                /* Skip zero-length matches right after other matches. */
-               if (lastempty || match[0].rm_so ||
+               if (lastempty || (match[0].rm_so - le) ||
                    match[0].rm_so != match[0].rm_eo) {
                        if (n <= 1) {
                                /* Want this match: append replacement. */
-                               regsub(&SS, s, cp->u.s->new);
+                               regsub(&SS, ps, cp->u.s->new);
                                if (n == 1)
                                        n = -1;
                        } else {
                                /* Want a later match: append original. */
-                               if (match[0].rm_eo)
-                                       cspace(&SS, s, match[0].rm_eo, APPEND);
+                               if (match[0].rm_eo - le)
+                                       cspace(&SS, s, match[0].rm_eo - le, 
APPEND);
                                n--;
                        }
                }
 
                /* Move past this match. */
-               s += match[0].rm_eo;
-               slen -= match[0].rm_eo;
+               s += (match[0].rm_eo - le);
+               slen -= (match[0].rm_eo - le);
+               le = match[0].rm_eo;
 
                /*
                 * After a zero-length match, advance one byte,
@@ -388,13 +391,15 @@ substitute(struct s_command *cp)
                                slen = -1;
                        else
                                slen--;
-                       if (*s != '\0')
+                       if (*s != '\0') {
                                cspace(&SS, s++, 1, APPEND);
+                               le++;
+                       }
                        lastempty = 1;
                } else
                        lastempty = 0;
 
-       } while (n >= 0 && slen >= 0 && regexec_e(re, s, REG_NOTBOL, 0, slen));
+       } while (n >= 0 && slen >= 0 && regexec_e(re, ps, REG_NOTBOL, 0, le, 
psl));
 
        /* Did not find the requested number of matches. */
        if (n > 1)
@@ -509,7 +514,7 @@ lputs(char *s)
 
 static inline int
 regexec_e(regex_t *preg, const char *string, int eflags,
-    int nomatch, size_t slen)
+    int nomatch, size_t start, size_t stop)
 {
        int eval;
 
@@ -520,8 +525,8 @@ regexec_e(regex_t *preg, const char *str
                defpreg = preg;
 
        /* Set anchors */
-       match[0].rm_so = 0;
-       match[0].rm_eo = slen;
+       match[0].rm_so = start;
+       match[0].rm_eo = stop;
 
        eval = regexec(defpreg, string,
            nomatch ? 0 : maxnsub + 1, match, eflags | REG_STARTEND);

Re: sed/regcomp bug?

Reply via email to