[webkit-dev] regular expression notes (mostly about performance optimization)

Darin Adler Sat, 05 Jan 2008 18:28:49 -0800

I've been spending some time recently on our JavaScript regularexpression code.

For background, our regular expression code is a much-modified fork ofPCRE 6.5. When I talk about "latest" PCRE here, I'm talking about PCREversion 7.4.


Here are a few notes I thought others might find interesting:

SunSpider's regular expression test coverage is limited.

- The regexp-dna test covers only a small part of the regularexpression engine.

PCRE has first character bitmap optimization that is only done as partof "study" in the real PCRE.- But it's part of study, not compile -- maybe because it's too costlyfor many expressions?- Doing this could help eliminate annoyingly complex code to handlecased vs. caseless firstByte and reqByte values.- I'd like to do it for reqByte too, not just firstByte, but there maybe a reason that's not a good idea.

- This would be a huge win for the regexp-dna test.

Latest PCRE has an optimized form of bracket for cases where thecompiler determines it can never be empty.- For this form the execution engine can use tail recursion, which isfaster.- For this form the execution engine can skip the "bracket chain"work, also faster.- Brackets that are known to never be empty don't have to incrementthe matchCount and check against matchLimit.

- Will be a win for the regexp-dna test.

Latest PCRE handles the opcodes for capturing brackets in a cleanerway, eliminates including the number in the opcode.- Not necessarily faster, but a precondition for the optimized form ofbracket.- If we do this we can remove the code to do opcode jump tableinitialization in the match function.

- That might make things a bit faster.

Latest PCRE has a function that counts capturing subpatterns.

- We need to know this to make the JavaScript semantic for referencesvs. octal values work.- Using a function liek the PCRE one would no doubt be more efficientthan calling the calculate length function twice as we currently do.

Latest PCRE uses the real compilation code to compute length ratherthan having a separate function.- According to Philip Hazel this is slower than the separate lengthcomputing function was.- But it's easier to maintain; I think the buffer overrun bugs PCREhad may have motivated this change.


Latest PCRE converts a*?b into a*b for speed.
- Maybe we want this optmimization.
- The function to do it is called check_auto_possessive.
- Will not affect the regexp-dna test.

The regexp-dna expressions all could be translated from brackets intocharacter classes.- They have the form: /a|b/ where "a" and "b" are expressions solelycontaining letters and character classes of the same length.- They could be converted into a sequence of character classes withouta bracket.

- Would this be faster?

    -- Darin

_______________________________________________
webkit-dev mailing list
webkit-dev@lists.webkit.org
http://lists.webkit.org/mailman/listinfo/webkit-dev

[webkit-dev] regular expression notes (mostly about performance optimization)

Reply via email to