Updates:
Status: Duplicate
Mergedinto: 287
Comment #1 on issue 1915 by [email protected]: v8 locks up on regex matching
http://code.google.com/p/v8/issues/detail?id=1915
The regexp iin question is:
/^ *<\w+(?!:\/|@)\b(?:"[^"]*"|'[^']*'|[^>])*?> *(?:\n{2,}|\s*$)/
Notice that it's anchored, so it *should* only match the first tag of a
string, and only if it's either alone in the string, or followed by two or
more newlines. That doesn't happen here.
The problem, which they have fixed now, is that it has more than one way to
match inside a *-quantifier. It's an instance of catastrophic backtracking.
That usually happens with nested quantifiers, but can also happen with
simply a non-exclusive choice inside a quantifier.
If the (?:"[^"]*"|'[^']*'|[^>]) part reaches a string, it first matches the
string, and then if it following fails, it will let the [^>] match the
string quote too. That allows it to continue scanning inside the string,
and start a *new* string match at the end quote. Thus out of sync, the
quantifier will likely match the entire rest of input, since the '<'
and '>' are now between the end quote of one string and the start quote of
the next.
It'll eventually fail, reaching the end of input without seeing the
expected '>', but that takes some time. Then it continues with the NEXT
single character and does it all again. And that can happen for, worst
case, each string in the source, giving an execution time that is
exponential in the number of strings. It won't accidentally match the
final '>' correctly, because that requires an unbalanced string quoting
(and the input is valid HTML).
That other RegExp engines reports failure sooner isn't because they are
smarter, but because they bail out of the regexp engine after a while and
reports non-match, even if the regexp would have matched. You can see that
by adding a "|" to the regexp, i.e.,
var regex = /^ *<\w+(?!:\/|@)\b(?:"[^"]*"|'[^']*'|[^>])*?> *(?:\n{2,}|
\s*$)|/;
which will first try the current regexp, and if that fails, match the empty
string. That regexp should *always* match, if nothing else the empty string.
If you run it on the same string in, e.g., Firefox, it quickly returns
null, which cannot possibly be the *correct* result.
V8 instead tries to complete the computation, not knowing that it is really
going to take forever, and instead relies on the embedder to interrupt it,
just as any other long-running script. It's no different from
executing "while(true){};".
The original problem was fixed by changing the regexp to:
var regex = /^ *<\w+(?!:\/|@)\b(?:"[^"]*"|'[^']*'|[^'">])*> *(?:\n{2,}|
\s*$)/;
This makes the different alternatives inside the quantifier disjoint. One
should always try to do that.
I even recommend:
var regex = /^ *<\w+\b(?!:[/@])(?:"[^"]*"|'[^']*'|[^'">])*?> *(?:\n{2,}|
\s*$)/;
--
v8-dev mailing list
[email protected]
http://groups.google.com/group/v8-dev