Re: sed/regcomp bug?

Ingo Schwarze Tue, 10 May 2016 12:04:51 -0700

Hi Martijn,

Martijn van Duren wrote on Tue, May 10, 2016 at 08:08:34PM +0200:
> On 05/10/16 19:29, Ingo Schwarze wrote:
>> Martijn van Duren wrote on Tue, May 10, 2016 at 02:43:54PM +0200:


>>> Index: ./lib/libc/regex/engine.c
>>> ===================================================================
>>> RCS file: /cvs/src/lib/libc/regex/engine.c,v
>>> retrieving revision 1.19
>>> diff -u -p -r1.19 engine.c
>>> --- ./lib/libc/regex/engine.c       28 Dec 2015 23:01:22 -0000      1.19
>>> +++ ./lib/libc/regex/engine.c       2 May 2016 08:50:20 -0000
>>> @@ -674,7 +674,7 @@ fast(struct match *m, char *start, char 
>>>     states fresh = m->fresh;
>>>     states tmp = m->tmp;
>>>     char *p = start;
>>> -   int c = (start == m->beginp) ? OUT : *(start-1);
>>> +   int c = (start == m->offp) ? OUT : *(start-1);
>>>     int lastc;      /* previous c */
>>>     int flagch;
>>>     int i;
>>> @@ -758,7 +758,7 @@ slow(struct match *m, char *start, char 
>>>     states empty = m->empty;
>>>     states tmp = m->tmp;
>>>     char *p = start;
>>> -   int c = (start == m->beginp) ? OUT : *(start-1);
>>> +   int c = (start == m->offp) ? OUT : *(start-1);
>>>     int lastc;      /* previous c */
>>>     int flagch;
>>>     int i;

>> i hate to say that this change appears to cause a regression.
>> 
>> The regexec(3) manual explicitly says:
>> 
>>   REG_STARTEND       The string is considered to start at [...]
>>              Note that a non-zero rm_so does not imply REG_NOTBOL;
>>              REG_STARTEND affects only the location of the string,
>>              not how it is matched.
>> 
>> Right now, the library actually implements that.  The test program
>> appended below produces the following output, as documented:
>> 
>>   rt: regcomp: OK
>>   rt: mismatch: regexec() failed to match
>>   rt: BOL match: OK
>>   rt: ST match: OK
>> 
>> With your change, the library now fails to match:
>> 
>>   rt: regcomp: OK
>>   rt: mismatch: regexec() failed to match
>>   rt: BOL match: OK
>>   rt: ST match: regexec() failed to match
>> 
>> I don't think that change is intentional, or is it?

> This change is intentional.  You try to match y on the start of the
> string.  This falsely succeeds in the current library, but is fixed
> in my change.

Since this is an intentional change of the way the library works,
i think the following is needed:

 1. It ought to be separate from fixing sed(1).
    Mixing interface changes of libraries with bug fixes
    in programs using them seems like a bad idea to me.

 2. We need a rationale why the change is useful in general,
    not just for sed(1), and why we think that nobody relies on the
    current interface definition, or how programs that rely on the
    current behaviour can be found and adjusted to work with the
    new behaviour.

 3. There should be a list explaining how other implementations
    behave in this respect, in particular FreeBSD, NetBSD,
    DragonFly, glibc, illumos or Solaris, and maybe some other
    commercial Unixes.

 4. The library must be kept consistent.  For example, you only
    change the treatment of "^" and "\<", but leave "$" and "\>"
    unchanged, see the test program below.  That is unlikely to
    be the only inconsistency you introduce.  In the function
    backref(), there are multiple references to beginp in the
    immediate vicinity of REG_NOTBOL and ISWORD.  I did not
    check the details, but i fear that your change introduces
    a second inconsistency between the treatment of anchors
    inside and outside of backreferences.  These are just two
    examples of (potential) inconsistencies, we have to convince
    ourselves there aren't more.

 5. The change needs to be documented.  It blantantly contradicts
    what the manual currently says.  The regular expression
    manuals are among the most precise we have, and we should
    really keep that level of perfection.

I'm not saying we don't want such a change, nor am i saying we want
it, but as it stands, the patch is inconsistent, incomplete, and
mixed up with unrelated stuff, so it is not OK.

Yours,
  Ingo


#include <sys/types.h>
#include <err.h>
#include <regex.h>

static regex_t           re;

static int
report(int errcode, const char *msg)
{
        const size_t     errbuf_size = 2048;
        char             errbuf[errbuf_size];
        size_t           sz;

        if (errcode) {
                sz = regerror(errcode, &re, errbuf, errbuf_size);
                warnx("%s: %s%s", msg, errbuf,
                    sz > errbuf_size ? "[...]" : "");
        } else
                warnx("%s: OK", msg);
        return errcode;
}

int
main(void)
{
        regmatch_t       pmatch;

        if (report(regcomp(&re, "y$", REG_EXTENDED), "regcomp"))
                return 1;

        report(regexec(&re, "yz", 0, NULL, 0), "mismatch");
        report(regexec(&re, "xy", 0, NULL, 0), "EOL match");

        pmatch.rm_so = 1;
        pmatch.rm_eo = 2;

        report(regexec(&re, "xyz", 0, &pmatch, REG_STARTEND), "SE match");
        return 0;
}

Re: sed/regcomp bug?

Reply via email to