Hi all,

we have been working on the auto-possessifying optimization in PCRE for some 
time, and I would like to give you a brief summary of the results we achived. 
This optimization replaces greedy/non-greedy single character repetitions with 
their appropriate posessive form. A simple example is rewriting /a+b/ to 
/a++b/. This optimization has been part of PCRE for a very long time, but it 
only supported simple cases before. Now it can even replace \s* to \s*+ in 
/\s*(?:left|right)?hand/. Of course using \s*+ directly in the pattern would 
provide the same effect, but possessive quantifiers are among the less known 
regular expression features, and they are rarely used (this is my impression at 
least). The performance of possessive quantifiers are usually much higher than 
other quantifiers, since the backtracking phase can be totally skipped.

However, before I show some results, let me tell you a bit more about 
possessive quantifiers. Their primary purpose is to define unbreakable 
multi-byte character sequences. The improved matching speed is just a side 
effect. For example, "sch" represent a single consonant in german, splitting it 
to "sc" and "h" is meaningless. Another nice example is newlines: most of the 
time a newline can be \r, \n and \r\n. If we want
to find the "aa" and "bb" strings, which are separated by at least two 
newlines, we can use the /aa(?>\r\n|\r|\n){2,}bb/ pattern. Without the ?> 
bracket type, the pattern would happily accept aa\r\nbb as well, and that is 
incorrect.

Back to the results, then. Once I got a pattern set used by an Intrusion 
Detection System, and I use it for benchmarking and also getting ideas how 
people use regular expressions. Sometimes I browse http://regexlib.com/ as 
well. I realized that most patterns are not exactly efficient, so regex 
compiler optimizations such as auto-possessifying seems very important. The 
gain provided by this particular optimization is the following (INT: 
interpreter, JIT: PCRE-JIT compiler, s: seconds):

was: INT: 412.16 s, JIT: 86.22 s
now: INT: 182.94 s, JIT: 45.46 s
progress: INT: 125% JIT: 90%

Of course on other pattern sets the results might be totally different, but we 
hope this helps to improve the overall performance of our favourite regex 
engine.

Regards,
Zoltan


-- 
## List details at https://lists.exim.org/mailman/listinfo/pcre-dev 

Reply via email to