On Mon, Apr 4, 2016 at 1:12 PM, Geoffrey Simmons <[email protected]> wrote: >> On Apr 4, 2016, at 9:21 PM, Devon H. O'Dell <[email protected]> wrote: >> >> ## PCRE >> >> There are other things we've done (like optimizing regexes that are >> obviously prefix and suffix matches -- turns out lots of people write >> things like `if (req.http.x-foo ~ "^bar.*$")` that are effectively `if >> (strncmp(req.http.x-foo, "bar" 3))` because it's easy), but I don't >> see those as being as high priority for upstream; they're largely >> issues for our multi-tenant use case. We have done this already; >> another thing we would like to do is to check regexes for things like >> backtracking and use DFA-based matching where possible. In the flame >> graph screenshot, the obvious VRT functions are PCRE. > > You might be interested in this, although it's new as can be (just today > tagged as v0.1) -- a VMOD to access Google's RE2 regular expression lib: > > https://code.uplex.de/uplex-varnish/libvmod-re2 > > For those not familiar with RE2: it limits the syntax so that patterns are > regular languages in the strictly formal sense. Most notably, backrefs within > a pattern are not allowed. That means that the matcher can run as DFAs/NFAs, > there is never any backtracking, and the time requirement for matches is > always linear in the length of the string to be matched.
Thanks for pointing this out! We have also considered RE2, but one of the problems is determining whether a particular regular expression is actually regular. One way would be to first try to compile with RE2, then fall back to PCRE if that didn't work. But there are also other problems -- we expose regular expression groups into VCL through `re.group.N` where `N` is some number in [0, 9] (we do not currently support named captures, and I don't have any plans to implement that) -- and providing cross-API consistency there is also problematic. PCRE does allow one to execute regexps in a DFA context using `pcre_dfa_exec`, but this also changes the semantics of matching as well due to what PCRE calls "auto-possessification". So we've continued to punt on PCRE improvements in our Varnish until we have time to think about all the semantic changes. > So far this is just a proof of concept, and I haven't done any performance > testing. From the documentation, I suspect that there are certain kinds of > use cases for Varnish where RE2 would perform better than PCRE, and many > cases where it doesn't make much difference (such as the prefix or suffix > matches you mentioned). But that's all speculation until it's been tested. The prefix and suffix cases are really interesting. Several people here were surprised at the limitations of regex optimizations performed by the engines. There are some optimizations that can be performed, but we don't currently run pcre_study either, and I'm not sure the range of things it actually improves. Most of the performance problems I see with regular expressions are more to do with the regexes themselves either being poorly written, or just the wrong tool for the job. Anyway, I believe that RE2 will outperform PCRE for a great many things; I haven't benchmarked it against pcre_dfa_exec (though I've seen some benchmarks in the context of pcre-jit, which is another thing I haven't tried at all), but my primary concern about just adopting RE2 is about the classes of regexes that simply don't work in RE2 context. We do have a non-trivial number of customers who rely on being able to use effectively non-regular expressions. --dho > > Best, > Geoff > > _______________________________________________ varnish-dev mailing list [email protected] https://www.varnish-cache.org/lists/mailman/listinfo/varnish-dev
