On 05/10/16 20:58, Ingo Schwarze wrote:
> Hi Martijn,
>
> Martijn van Duren wrote on Tue, May 10, 2016 at 08:08:34PM +0200:
>> On 05/10/16 19:29, Ingo Schwarze wrote:
>>> Martijn van Duren wrote on Tue, May 10, 2016 at 02:43:54PM +0200:
>
>>>> Index: ./lib/libc/regex/engine.c
>>>> ===================================================================
>>>> RCS file: /cvs/src/lib/libc/regex/engine.c,v
>>>> retrieving revision 1.19
>>>> diff -u -p -r1.19 engine.c
>>>> --- ./lib/libc/regex/engine.c 28 Dec 2015 23:01:22 -0000 1.19
>>>> +++ ./lib/libc/regex/engine.c 2 May 2016 08:50:20 -0000
>>>> @@ -674,7 +674,7 @@ fast(struct match *m, char *start, char
>>>> states fresh = m->fresh;
>>>> states tmp = m->tmp;
>>>> char *p = start;
>>>> - int c = (start == m->beginp) ? OUT : *(start-1);
>>>> + int c = (start == m->offp) ? OUT : *(start-1);
>>>> int lastc; /* previous c */
>>>> int flagch;
>>>> int i;
>>>> @@ -758,7 +758,7 @@ slow(struct match *m, char *start, char
>>>> states empty = m->empty;
>>>> states tmp = m->tmp;
>>>> char *p = start;
>>>> - int c = (start == m->beginp) ? OUT : *(start-1);
>>>> + int c = (start == m->offp) ? OUT : *(start-1);
>>>> int lastc; /* previous c */
>>>> int flagch;
>>>> int i;
>
>>> i hate to say that this change appears to cause a regression.
>>>
>>> The regexec(3) manual explicitly says:
>>>
>>> REG_STARTEND The string is considered to start at [...]
>>> Note that a non-zero rm_so does not imply REG_NOTBOL;
>>> REG_STARTEND affects only the location of the string,
>>> not how it is matched.
>>>
>>> Right now, the library actually implements that. The test program
>>> appended below produces the following output, as documented:
>>>
>>> rt: regcomp: OK
>>> rt: mismatch: regexec() failed to match
>>> rt: BOL match: OK
>>> rt: ST match: OK
>>>
>>> With your change, the library now fails to match:
>>>
>>> rt: regcomp: OK
>>> rt: mismatch: regexec() failed to match
>>> rt: BOL match: OK
>>> rt: ST match: regexec() failed to match
>>>
>>> I don't think that change is intentional, or is it?
>
>> This change is intentional. You try to match y on the start of the
>> string. This falsely succeeds in the current library, but is fixed
>> in my change.
>
> Since this is an intentional change of the way the library works,
> i think the following is needed:
>
> 1. It ought to be separate from fixing sed(1).
> Mixing interface changes of libraries with bug fixes
> in programs using them seems like a bad idea to me.
>
> 2. We need a rationale why the change is useful in general,
> not just for sed(1), and why we think that nobody relies on the
> current interface definition, or how programs that rely on the
> current behaviour can be found and adjusted to work with the
> new behaviour.
>
> 3. There should be a list explaining how other implementations
> behave in this respect, in particular FreeBSD, NetBSD,
> DragonFly, glibc, illumos or Solaris, and maybe some other
> commercial Unixes.
>
> 4. The library must be kept consistent. For example, you only
> change the treatment of "^" and "\<", but leave "$" and "\>"
> unchanged, see the test program below. That is unlikely to
> be the only inconsistency you introduce. In the function
> backref(), there are multiple references to beginp in the
> immediate vicinity of REG_NOTBOL and ISWORD. I did not
> check the details, but i fear that your change introduces
> a second inconsistency between the treatment of anchors
> inside and outside of backreferences. These are just two
> examples of (potential) inconsistencies, we have to convince
> ourselves there aren't more.
>
> 5. The change needs to be documented. It blantantly contradicts
> what the manual currently says. The regular expression
> manuals are among the most precise we have, and we should
> really keep that level of perfection.
>
> I'm not saying we don't want such a change, nor am i saying we want
> it, but as it stands, the patch is inconsistent, incomplete, and
> mixed up with unrelated stuff, so it is not OK.
I reread the REG_STARTEND section again and it turns out I
misinterpreted it's intention and I thought the current behaviour
was wrong/incomplete. So the change was intentional, but for the
wrong reasons.
I changed the diff to allow regexec to combine REG_NOTBOL and
REG_STARTEND. This way the current syntax stays intact and we can
use the extra flag to do lookbacks to verify that the last
character was a word character or not, without risking going into
unallocated memory.
Does this look better to you?
>
> Yours,
> Ingo
>
>
Index: ./lib/libc/regex/engine.c
===================================================================
RCS file: /cvs/src/lib/libc/regex/engine.c,v
retrieving revision 1.19
diff -u -p -r1.19 engine.c
--- ./lib/libc/regex/engine.c 28 Dec 2015 23:01:22 -0000 1.19
+++ ./lib/libc/regex/engine.c 10 May 2016 20:22:09 -0000
@@ -674,12 +674,17 @@ fast(struct match *m, char *start, char
states fresh = m->fresh;
states tmp = m->tmp;
char *p = start;
- int c = (start == m->beginp) ? OUT : *(start-1);
+ int c;
int lastc; /* previous c */
int flagch;
int i;
char *coldp; /* last p after which no match was underway */
+ if (start == m->offp || (start == m->beginp && !(m->eflags®_NOTBOL)))
+ c = OUT;
+ else
+ c = *(start-1);
+
CLEAR(st);
SET1(st, startst);
st = step(m->g, startst, stopst, st, NOTHING, st);
@@ -758,11 +763,16 @@ slow(struct match *m, char *start, char
states empty = m->empty;
states tmp = m->tmp;
char *p = start;
- int c = (start == m->beginp) ? OUT : *(start-1);
+ int c;
int lastc; /* previous c */
int flagch;
int i;
char *matchp; /* last p at which a match ended */
+
+ if (start == m->offp || (start == m->beginp && !(m->eflags®_NOTBOL)))
+ c = OUT;
+ else
+ c = *(start-1);
AT("slow", start, stop, startst, stopst);
CLEAR(st);
Index: ./usr.bin/sed/process.c
===================================================================
RCS file: /cvs/src/usr.bin/sed/process.c,v
retrieving revision 1.27
diff -u -p -r1.27 process.c
--- ./usr.bin/sed/process.c 26 Oct 2015 14:08:47 -0000 1.27
+++ ./usr.bin/sed/process.c 10 May 2016 20:22:09 -0000
@@ -61,7 +61,8 @@ static SPACE HS, PS, SS;
static inline int applies(struct s_command *);
static void flush_appends(void);
static void lputs(char *);
-static inline int regexec_e(regex_t *, const char *, int, int, size_t);
+static inline int regexec_e(regex_t *, const char *, int, int, size_t,
+ size_t);
static void regsub(SPACE *, char *, char *);
static int substitute(struct s_command *);
@@ -267,7 +268,7 @@ new: if (!nflag && !pd)
* (lastline, linenumber, ps).
*/
#define MATCH(a) \
- (a)->type == AT_RE ? regexec_e((a)->u.r, ps, 0, 1, psl) : \
+ (a)->type == AT_RE ? regexec_e((a)->u.r, ps, 0, 1, 0, psl) : \
(a)->type == AT_LINE ? linenum == (a)->u.l : lastline()
/*
@@ -335,6 +336,7 @@ substitute(struct s_command *cp)
regex_t *re;
regoff_t slen;
int n, lastempty;
+ size_t le = 0;
char *s;
s = ps;
@@ -346,7 +348,7 @@ substitute(struct s_command *cp)
cp->u.s->maxbref);
}
}
- if (!regexec_e(re, s, 0, 0, psl))
+ if (!regexec_e(re, s, 0, 0, 0, psl))
return (0);
SS.len = 0; /* Clean substitute space. */
@@ -356,28 +358,29 @@ substitute(struct s_command *cp)
do {
/* Copy the leading retained string. */
- if (n <= 1 && match[0].rm_so)
- cspace(&SS, s, match[0].rm_so, APPEND);
+ if (n <= 1 && (match[0].rm_so - le))
+ cspace(&SS, s, match[0].rm_so - le, APPEND);
/* Skip zero-length matches right after other matches. */
- if (lastempty || match[0].rm_so ||
+ if (lastempty || (match[0].rm_so - le) ||
match[0].rm_so != match[0].rm_eo) {
if (n <= 1) {
/* Want this match: append replacement. */
- regsub(&SS, s, cp->u.s->new);
+ regsub(&SS, ps, cp->u.s->new);
if (n == 1)
n = -1;
} else {
/* Want a later match: append original. */
- if (match[0].rm_eo)
- cspace(&SS, s, match[0].rm_eo, APPEND);
+ if (match[0].rm_eo - le)
+ cspace(&SS, s, match[0].rm_eo - le,
APPEND);
n--;
}
}
/* Move past this match. */
- s += match[0].rm_eo;
- slen -= match[0].rm_eo;
+ s += (match[0].rm_eo - le);
+ slen -= (match[0].rm_eo - le);
+ le = match[0].rm_eo;
/*
* After a zero-length match, advance one byte,
@@ -388,13 +391,15 @@ substitute(struct s_command *cp)
slen = -1;
else
slen--;
- if (*s != '\0')
+ if (*s != '\0') {
cspace(&SS, s++, 1, APPEND);
+ le++;
+ }
lastempty = 1;
} else
lastempty = 0;
- } while (n >= 0 && slen >= 0 && regexec_e(re, s, REG_NOTBOL, 0, slen));
+ } while (n >= 0 && slen >= 0 && regexec_e(re, ps, REG_NOTBOL, 0, le,
psl));
/* Did not find the requested number of matches. */
if (n > 1)
@@ -509,7 +514,7 @@ lputs(char *s)
static inline int
regexec_e(regex_t *preg, const char *string, int eflags,
- int nomatch, size_t slen)
+ int nomatch, size_t start, size_t stop)
{
int eval;
@@ -520,8 +525,8 @@ regexec_e(regex_t *preg, const char *str
defpreg = preg;
/* Set anchors */
- match[0].rm_so = 0;
- match[0].rm_eo = slen;
+ match[0].rm_so = start;
+ match[0].rm_eo = stop;
eval = regexec(defpreg, string,
nomatch ? 0 : maxnsub + 1, match, eflags | REG_STARTEND);