Re: sed/regcomp bug?

Ingo Schwarze Thu, 12 May 2016 02:36:07 -0700

Hi Martijn,

Martijn van Duren wrote on Tue, May 10, 2016 at 10:34:54PM +0200:
> On 05/10/16 20:58, Ingo Schwarze wrote:
>> Martijn van Duren wrote on Tue, May 10, 2016 at 08:08:34PM +0200:
>>> On 05/10/16 19:29, Ingo Schwarze wrote:
>>>> Martijn van Duren wrote on Tue, May 10, 2016 at 02:43:54PM +0200:


>>>>> Index: ./lib/libc/regex/engine.c
>>>>> ===================================================================
>>>>> RCS file: /cvs/src/lib/libc/regex/engine.c,v
>>>>> retrieving revision 1.19
>>>>> diff -u -p -r1.19 engine.c
>>>>> --- ./lib/libc/regex/engine.c     28 Dec 2015 23:01:22 -0000      1.19
>>>>> +++ ./lib/libc/regex/engine.c     2 May 2016 08:50:20 -0000
>>>>> @@ -674,7 +674,7 @@ fast(struct match *m, char *start, char 
>>>>>   states fresh = m->fresh;
>>>>>   states tmp = m->tmp;
>>>>>   char *p = start;
>>>>> - int c = (start == m->beginp) ? OUT : *(start-1);
>>>>> + int c = (start == m->offp) ? OUT : *(start-1);
>>>>>   int lastc;      /* previous c */
>>>>>   int flagch;
>>>>>   int i;
>>>>> @@ -758,7 +758,7 @@ slow(struct match *m, char *start, char 
>>>>>   states empty = m->empty;
>>>>>   states tmp = m->tmp;
>>>>>   char *p = start;
>>>>> - int c = (start == m->beginp) ? OUT : *(start-1);
>>>>> + int c = (start == m->offp) ? OUT : *(start-1);
>>>>>   int lastc;      /* previous c */
>>>>>   int flagch;
>>>>>   int i;

>>>> i hate to say that this change appears to cause a regression.
>>>>
>>>> The regexec(3) manual explicitly says:
>>>>
>>>>   REG_STARTEND     The string is considered to start at [...]
>>>>            Note that a non-zero rm_so does not imply REG_NOTBOL;
>>>>            REG_STARTEND affects only the location of the string,
>>>>            not how it is matched.
>>>>
>>> Right now, the library actually implements that.  The test program
>>>> appended below produces the following output, as documented:
>>>>
>>>>   rt: regcomp: OK
>>>>   rt: mismatch: regexec() failed to match
>>>>   rt: BOL match: OK
>>>>   rt: ST match: OK
>>>>
>>>> With your change, the library now fails to match:
>>>>
>>>>   rt: regcomp: OK
>>>>   rt: mismatch: regexec() failed to match
>>>>   rt: BOL match: OK
>>>>   rt: ST match: regexec() failed to match
>>>>
>>>> I don't think that change is intentional, or is it?

>>> This change is intentional.  You try to match y on the start of the
>>> string.  This falsely succeeds in the current library, but is fixed
>>> in my change.

>> Since this is an intentional change of the way the library works,
>> i think the following is needed:
>> 
>>  1. It ought to be separate from fixing sed(1).
>>     Mixing interface changes of libraries with bug fixes
>>     in programs using them seems like a bad idea to me.
>> 
>>  2. We need a rationale why the change is useful in general,
>>     not just for sed(1), and why we think that nobody relies on the
>>     current interface definition, or how programs that rely on the
>>     current behaviour can be found and adjusted to work with the
>>     new behaviour.
>> 
>>  3. There should be a list explaining how other implementations
>>     behave in this respect, in particular FreeBSD, NetBSD,
>>     DragonFly, glibc, illumos or Solaris, and maybe some other
>>     commercial Unixes.
>> 
>>  4. The library must be kept consistent.  For example, you only
>>     change the treatment of "^" and "\<", but leave "$" and "\>"
>>     unchanged, see the test program below.  That is unlikely to
>>     be the only inconsistency you introduce.  In the function
>>     backref(), there are multiple references to beginp in the
>>     immediate vicinity of REG_NOTBOL and ISWORD.  I did not
>>     check the details, but i fear that your change introduces
>>     a second inconsistency between the treatment of anchors
>>     inside and outside of backreferences.  These are just two
>>     examples of (potential) inconsistencies, we have to convince
>>     ourselves there aren't more.
>> 
>>  5. The change needs to be documented.  It blantantly contradicts
>>     what the manual currently says.  The regular expression
>>     manuals are among the most precise we have, and we should
>>     really keep that level of perfection.
>> 
>> I'm not saying we don't want such a change, nor am i saying we want
>> it, but as it stands, the patch is inconsistent, incomplete, and
>> mixed up with unrelated stuff, so it is not OK.

> I reread the REG_STARTEND section again and it turns out I
> misinterpreted it's intention and I thought the current behaviour
> was wrong/incomplete. So the change was intentional, but for the
> wrong reasons.
> 
> I changed the diff to allow regexec to combine REG_NOTBOL and
> REG_STARTEND. This way the current syntax stays intact and we can
> use the extra flag to do lookbacks to verify that the last
> character was a word character or not, without risking going into
> unallocated memory.
> 
> Does this look better to you?

Probably minimally better, but still not OK.

All the above reservations still apply:

 1. It changes the interface definition of the library,
    so don't mix it with changing a program using the library.
 2. Assessment is missing how the change of behaviour might
    affect existing programs.
 3. Research is missing what other systems do.
 4. It is inconsistent.  You only change ^/\>, not $/\<,
    and there is still no attempt to even investigate what's
    up with backreferences.
 5. Before, you violated the documentation in one respect.
    Now, you violate the documentation even in two respects.
    The following is no longer true:

      REG_NOTBOL  The first character of the string is not the
                  beginning of a line, so the '^' anchor should
                  not match before it.

    With your change, there are situations where ^ can match at
    the beginning of the string even with REG_NOTBOL:  For example,
    if REG_NEWLINE is set and the character before the beginning
    of the string is '\n'.

    The following is also wrong now:

       REG_STARTEND affects only the location of the string, not
       how it is matched.

    With your change, it does affect how it is matched.  Suddenly,
    how the string is matched is affected by a character *outside*
    the string, if both REG_STARTEND and REG_NOTBOL are given.

You seem to be aiming for this semantics:

  !REG_STARTEND && !REG_NOTBOL (req. by the standard):
    The beginning of the string is the beginning of a line, and the
    beginning of a word if the first character is a word character.
  !REG_STARTEND && REG_NOTBOL (req. by the standard):
    The beginning of the string is not the beginning of a line
    and not the beginning of a word.
  REG_STARTEND && !REG_NOTBOL (existing extension):
    The beginning of the string is the beginning of a line, and the
    beginning of a word if the first character is a word character.
  REG_STARTEND && REG_NOTBOL (changed extension):
    The beginning of the string is the beginning of a line
    if and only if REG_NEWLINE is set and the previous character
    is '\n', and it is the beginning of a word if and only if
    the first character is a word character and either it is the
    beginning of a line, or the previous character is not a word
    character.

But if so (or something similar, there may be slight oversights in
the above specification, i did not scrutinize it perfectly), it
needs to be implemented and documented consistently, including
REG_NOTEOL and backreferences.

Yours,
  Ingo


> Index: ./lib/libc/regex/engine.c
> ===================================================================
> RCS file: /cvs/src/lib/libc/regex/engine.c,v
> retrieving revision 1.19
> diff -u -p -r1.19 engine.c
> --- ./lib/libc/regex/engine.c 28 Dec 2015 23:01:22 -0000      1.19
> +++ ./lib/libc/regex/engine.c 10 May 2016 20:22:09 -0000
> @@ -674,12 +674,17 @@ fast(struct match *m, char *start, char 
>       states fresh = m->fresh;
>       states tmp = m->tmp;
>       char *p = start;
> -     int c = (start == m->beginp) ? OUT : *(start-1);
> +     int c;
>       int lastc;      /* previous c */
>       int flagch;
>       int i;
>       char *coldp;    /* last p after which no match was underway */
>  
> +     if (start == m->offp || (start == m->beginp && !(m->eflags&REG_NOTBOL)))
> +             c = OUT;
> +     else
> +             c = *(start-1);
> +
>       CLEAR(st);
>       SET1(st, startst);
>       st = step(m->g, startst, stopst, st, NOTHING, st);
> @@ -758,11 +763,16 @@ slow(struct match *m, char *start, char 
>       states empty = m->empty;
>       states tmp = m->tmp;
>       char *p = start;
> -     int c = (start == m->beginp) ? OUT : *(start-1);
> +     int c;
>       int lastc;      /* previous c */
>       int flagch;
>       int i;
>       char *matchp;   /* last p at which a match ended */
> +
> +     if (start == m->offp || (start == m->beginp && !(m->eflags&REG_NOTBOL)))
> +             c = OUT;
> +     else
> +             c = *(start-1);
>  
>       AT("slow", start, stop, startst, stopst);
>       CLEAR(st);

Re: sed/regcomp bug?

Reply via email to