I've been spending some time recently on our JavaScript regular expression code.

For background, our regular expression code is a much-modified fork of PCRE 6.5. When I talk about "latest" PCRE here, I'm talking about PCRE version 7.4.

Here are a few notes I thought others might find interesting:

SunSpider's regular expression test coverage is limited.
- The regexp-dna test covers only a small part of the regular expression engine.

PCRE has first character bitmap optimization that is only done as part of "study" in the real PCRE. - But it's part of study, not compile -- maybe because it's too costly for many expressions? - Doing this could help eliminate annoyingly complex code to handle cased vs. caseless firstByte and reqByte values. - I'd like to do it for reqByte too, not just firstByte, but there may be a reason that's not a good idea.
- This would be a huge win for the regexp-dna test.

Latest PCRE has an optimized form of bracket for cases where the compiler determines it can never be empty. - For this form the execution engine can use tail recursion, which is faster. - For this form the execution engine can skip the "bracket chain" work, also faster. - Brackets that are known to never be empty don't have to increment the matchCount and check against matchLimit.
- Will be a win for the regexp-dna test.

Latest PCRE handles the opcodes for capturing brackets in a cleaner way, eliminates including the number in the opcode. - Not necessarily faster, but a precondition for the optimized form of bracket. - If we do this we can remove the code to do opcode jump table initialization in the match function.
- That might make things a bit faster.

Latest PCRE has a function that counts capturing subpatterns.
- We need to know this to make the JavaScript semantic for references vs. octal values work. - Using a function liek the PCRE one would no doubt be more efficient than calling the calculate length function twice as we currently do.

Latest PCRE uses the real compilation code to compute length rather than having a separate function. - According to Philip Hazel this is slower than the separate length computing function was. - But it's easier to maintain; I think the buffer overrun bugs PCRE had may have motivated this change.

Latest PCRE converts a*?b into a*b for speed.
- Maybe we want this optmimization.
- The function to do it is called check_auto_possessive.
- Will not affect the regexp-dna test.

The regexp-dna expressions all could be translated from brackets into character classes. - They have the form: /a|b/ where "a" and "b" are expressions solely containing letters and character classes of the same length. - They could be converted into a sequence of character classes without a bracket.
- Would this be faster?

    -- Darin

_______________________________________________
webkit-dev mailing list
webkit-dev@lists.webkit.org
http://lists.webkit.org/mailman/listinfo/webkit-dev

Reply via email to