Hi,
there is another segfault in regexec(3), engine.c, backref(), similar
to the one i just reported, with the following difference: The first
elementary atom in the expression must be "[[:<:]]" or "\<" rather
than '^'.
The condition screwing up is:
case OBOW:
if (( (sp == m->beginp && !(m->eflags®_NOTBOL)) ||
(sp < m->endp && *(sp-1) == '\n' &&
(m->g->cflags®_NEWLINE)) ||
(sp > m->beginp &&
!ISWORD(*(sp-1))) ) &&
(sp < m->endp && ISWORD(*sp)) )
Beautiful, isn't it?
It can be fixed schematically by adding "sp > m->offp &&"
to the line below the "if", just like in the other patch i sent.
But i would like to shorten the expression while here rather than
making it even longer. For that purpose, observe that the last
line is joined to the rest with &&, so pulling it up front
allows to eliminate "sp < m->endp" from the rest:
case OBOW:
if (sp < m->endp && ISWORD(*sp) &&
((sp == m->beginp && !(m->eflags®_NOTBOL)) ||
(sp > m->offp && *(sp-1) == '\n' &&
(m->g->cflags®_NEWLINE)) ||
(sp > m->beginp && !ISWORD(*(sp-1)))))
Now, exploiting the following invariants,
m->offp <= m->beginp <= sp && ISWORD('\n') == 0
we see that the last three lines can be simplified as follows:
case OBOW:
if (sp < m->endp && ISWORD(*sp) &&
((sp == m->beginp && !(m->eflags®_NOTBOL)) ||
(sp > m->offp && !ISWORD(*(sp-1)))))
Proof:
Direction => (simple):
Case sp > m->offp && *(sp-1) == '\n':
*(sp-1) == '\n' => !ISWORD(*(sp-1)) qed.
Case sp > m->beginp && !ISWORD(*(sp-1)):
sp > m->beginp => sp > m->offp qed.
Direction <= (tricky):
If sp > m->beginp, the statement is already proven.
So the only case that remains is sp == m->beginp.
For !REG_NOTBOL, that is already covered by the previous line.
So the only case that remains is REG_NOTBOL.
Actually, it's the REG_STARTEND case since m->offp < m->beginp == sp.
In anticipation of martijn@'s pending patch, it is OK to already
look at the preceding character in this case rather than always
rejecting the match.
Regarding the last sentence, note that the respective code in case
OBOL already does the same: It accepts '\n' preceding rm_so as BOL
even with REG_NOTBOL, even without martijn@'s semantic change,
simply because right now, without martijn@'s semantic change, the
filter functions fast() and slow() already catch that case and never
let the code get into backref(). So we can do the same for case
OBOW, such that we won't have to change it again when committing
martijn@'s improvements.
Again, i'm appending a test program demonstrating the segfault
and the patch fixing it.
OK?
Ingo
----- 8< ----- schnipp ----- >8 ----- 8< ----- schnapp ----- >8 -----
#include <sys/types.h>
#include <err.h>
#include <regex.h>
#include <stdlib.h>
int
main(void)
{
regex_t re;
char *buf;
if (regcomp(&re, "\\([[:<:]]\\)*\\(x\\)\\2", REG_BASIC | REG_NEWLINE))
errx(1, "regcomp");
/*
* Allocate a huge buffer such that we get
* a guard page in front of it.
*/
if ((buf = malloc(64 * 1024)) == NULL)
err(1, NULL);
buf[0] = 'x';
buf[1] = 'x';
buf[2] = '\0';
/*
* Trigger the segfault in regex/engine.c,
* backref(), case OBOL.
*/
regexec(&re, buf, 0, NULL, REG_NOTBOL);
errx(1, "This is unexpected: regexec did not segfault.");
}
----- 8< ----- schnipp ----- >8 ----- 8< ----- schnapp ----- >8 -----
Index: engine.c
===================================================================
RCS file: /cvs/src/lib/libc/regex/engine.c,v
retrieving revision 1.19
diff -u -p -r1.19 engine.c
--- engine.c 28 Dec 2015 23:01:22 -0000 1.19
+++ engine.c 15 May 2016 16:47:50 -0000
@@ -522,12 +522,9 @@ backref(struct match *m, char *start, ch
return(NULL);
break;
case OBOW:
- if (( (sp == m->beginp && !(m->eflags®_NOTBOL)) ||
- (sp < m->endp && *(sp-1) == '\n' &&
- (m->g->cflags®_NEWLINE)) ||
- (sp > m->beginp &&
- !ISWORD(*(sp-1))) ) &&
- (sp < m->endp && ISWORD(*sp)) )
+ if (sp < m->endp && ISWORD(*sp) &&
+ ((sp == m->beginp && !(m->eflags®_NOTBOL)) ||
+ (sp > m->offp && !ISWORD(*(sp-1)))))
{ /* yes */ }
else
return(NULL);