Updates:
        Status: Duplicate
        Mergedinto: 287

Comment #1 on issue 1915 by [email protected]: v8 locks up on regex matching
http://code.google.com/p/v8/issues/detail?id=1915

The regexp iin question is:
 /^ *<\w+(?!:\/|@)\b(?:"[^"]*"|'[^']*'|[^>])*?> *(?:\n{2,}|\s*$)/

Notice that it's anchored, so it *should* only match the first tag of a string, and only if it's either alone in the string, or followed by two or more newlines. That doesn't happen here.

The problem, which they have fixed now, is that it has more than one way to match inside a *-quantifier. It's an instance of catastrophic backtracking. That usually happens with nested quantifiers, but can also happen with simply a non-exclusive choice inside a quantifier.

If the (?:"[^"]*"|'[^']*'|[^>]) part reaches a string, it first matches the string, and then if it following fails, it will let the [^>] match the string quote too. That allows it to continue scanning inside the string, and start a *new* string match at the end quote. Thus out of sync, the quantifier will likely match the entire rest of input, since the '<' and '>' are now between the end quote of one string and the start quote of the next. It'll eventually fail, reaching the end of input without seeing the expected '>', but that takes some time. Then it continues with the NEXT single character and does it all again. And that can happen for, worst case, each string in the source, giving an execution time that is exponential in the number of strings. It won't accidentally match the final '>' correctly, because that requires an unbalanced string quoting (and the input is valid HTML).

That other RegExp engines reports failure sooner isn't because they are smarter, but because they bail out of the regexp engine after a while and reports non-match, even if the regexp would have matched. You can see that by adding a "|" to the regexp, i.e., var regex = /^ *<\w+(?!:\/|@)\b(?:"[^"]*"|'[^']*'|[^>])*?> *(?:\n{2,}| \s*$)|/; which will first try the current regexp, and if that fails, match the empty string. That regexp should *always* match, if nothing else the empty string. If you run it on the same string in, e.g., Firefox, it quickly returns null, which cannot possibly be the *correct* result. V8 instead tries to complete the computation, not knowing that it is really going to take forever, and instead relies on the embedder to interrupt it, just as any other long-running script. It's no different from executing "while(true){};".

The original problem was fixed by changing the regexp to:
var regex = /^ *<\w+(?!:\/|@)\b(?:"[^"]*"|'[^']*'|[^'">])*> *(?:\n{2,}| \s*$)/; This makes the different alternatives inside the quantifier disjoint. One should always try to do that.

I even recommend:
var regex = /^ *<\w+\b(?!:[/@])(?:"[^"]*"|'[^']*'|[^'">])*?> *(?:\n{2,}| \s*$)/;


--
v8-dev mailing list
[email protected]
http://groups.google.com/group/v8-dev

Reply via email to