Hi Martijn,
Martijn van Duren wrote on Tue, May 10, 2016 at 08:08:34PM +0200:
> On 05/10/16 19:29, Ingo Schwarze wrote:
>> Martijn van Duren wrote on Tue, May 10, 2016 at 02:43:54PM +0200:
>>> Index: ./lib/libc/regex/engine.c
>>> ===================================================================
>>> RCS file: /cvs/src/lib/libc/regex/engine.c,v
>>> retrieving revision 1.19
>>> diff -u -p -r1.19 engine.c
>>> --- ./lib/libc/regex/engine.c 28 Dec 2015 23:01:22 -0000 1.19
>>> +++ ./lib/libc/regex/engine.c 2 May 2016 08:50:20 -0000
>>> @@ -674,7 +674,7 @@ fast(struct match *m, char *start, char
>>> states fresh = m->fresh;
>>> states tmp = m->tmp;
>>> char *p = start;
>>> - int c = (start == m->beginp) ? OUT : *(start-1);
>>> + int c = (start == m->offp) ? OUT : *(start-1);
>>> int lastc; /* previous c */
>>> int flagch;
>>> int i;
>>> @@ -758,7 +758,7 @@ slow(struct match *m, char *start, char
>>> states empty = m->empty;
>>> states tmp = m->tmp;
>>> char *p = start;
>>> - int c = (start == m->beginp) ? OUT : *(start-1);
>>> + int c = (start == m->offp) ? OUT : *(start-1);
>>> int lastc; /* previous c */
>>> int flagch;
>>> int i;
>> i hate to say that this change appears to cause a regression.
>>
>> The regexec(3) manual explicitly says:
>>
>> REG_STARTEND The string is considered to start at [...]
>> Note that a non-zero rm_so does not imply REG_NOTBOL;
>> REG_STARTEND affects only the location of the string,
>> not how it is matched.
>>
>> Right now, the library actually implements that. The test program
>> appended below produces the following output, as documented:
>>
>> rt: regcomp: OK
>> rt: mismatch: regexec() failed to match
>> rt: BOL match: OK
>> rt: ST match: OK
>>
>> With your change, the library now fails to match:
>>
>> rt: regcomp: OK
>> rt: mismatch: regexec() failed to match
>> rt: BOL match: OK
>> rt: ST match: regexec() failed to match
>>
>> I don't think that change is intentional, or is it?
> This change is intentional. You try to match y on the start of the
> string. This falsely succeeds in the current library, but is fixed
> in my change.
Since this is an intentional change of the way the library works,
i think the following is needed:
1. It ought to be separate from fixing sed(1).
Mixing interface changes of libraries with bug fixes
in programs using them seems like a bad idea to me.
2. We need a rationale why the change is useful in general,
not just for sed(1), and why we think that nobody relies on the
current interface definition, or how programs that rely on the
current behaviour can be found and adjusted to work with the
new behaviour.
3. There should be a list explaining how other implementations
behave in this respect, in particular FreeBSD, NetBSD,
DragonFly, glibc, illumos or Solaris, and maybe some other
commercial Unixes.
4. The library must be kept consistent. For example, you only
change the treatment of "^" and "\<", but leave "$" and "\>"
unchanged, see the test program below. That is unlikely to
be the only inconsistency you introduce. In the function
backref(), there are multiple references to beginp in the
immediate vicinity of REG_NOTBOL and ISWORD. I did not
check the details, but i fear that your change introduces
a second inconsistency between the treatment of anchors
inside and outside of backreferences. These are just two
examples of (potential) inconsistencies, we have to convince
ourselves there aren't more.
5. The change needs to be documented. It blantantly contradicts
what the manual currently says. The regular expression
manuals are among the most precise we have, and we should
really keep that level of perfection.
I'm not saying we don't want such a change, nor am i saying we want
it, but as it stands, the patch is inconsistent, incomplete, and
mixed up with unrelated stuff, so it is not OK.
Yours,
Ingo
#include <sys/types.h>
#include <err.h>
#include <regex.h>
static regex_t re;
static int
report(int errcode, const char *msg)
{
const size_t errbuf_size = 2048;
char errbuf[errbuf_size];
size_t sz;
if (errcode) {
sz = regerror(errcode, &re, errbuf, errbuf_size);
warnx("%s: %s%s", msg, errbuf,
sz > errbuf_size ? "[...]" : "");
} else
warnx("%s: OK", msg);
return errcode;
}
int
main(void)
{
regmatch_t pmatch;
if (report(regcomp(&re, "y$", REG_EXTENDED), "regcomp"))
return 1;
report(regexec(&re, "yz", 0, NULL, 0), "mismatch");
report(regexec(&re, "xy", 0, NULL, 0), "EOL match");
pmatch.rm_so = 1;
pmatch.rm_eo = 2;
report(regexec(&re, "xyz", 0, &pmatch, REG_STARTEND), "SE match");
return 0;
}