Re: Adding regular expression support to CORS filter

Carsten Klein Tue, 06 Oct 2020 11:03:13 -0700

Chris,

On 9/28/20 02:40, Christopher Schultz wrote:

Carsten,


On 9/27/20 05:53, Carsten Klein wrote:

Any comments on that? Is it worth preparing a PR?


Regular expressions are fairly expensive.

Yes, but my measurements of the HashSet-lookups were wrong, sincehashValue() of a String gets cached, so, while measuring in a loop, thehash value gets computed only once. Setting up a fair test ischallenging (using new strings in the loop causes memory allocations andGCs). I ended with additionally calling a clone of the String'shashValue() function in the loop.

Now, testing an origin with HashSet.contains(origin) (current solution)takes about 19 ns for a cache-miss (empty list in bucket) and 30 ns fora cache-hit) on my box.

In contrast, evaluating a regular expression takes about 120 ns; sothese are about 6 times slower (NOT 25 times as stated in my last mail).

The creation of a new Matcher instance each time is a significantbottleneck (in my measurement loop): reusing the same Matcher instanceand resetting it with a new input string makes the test taking onlyabout 75 ns (that's only about 4 times slower that the HashSet test).


So, regular expressions are not as bad as we were thinking?


If there is a way to build the code such that some subset of wildcards
can be serviced without regex (and of course exact matches without using
regex), then I'm at least a +0 on this.

I never intended to implement tests for exact matches with regularexpressions. Configured "exact" origins (those without wildcards and notenclosed between slashes) will still be tested with HashSet.contains, ofcourse. So, there's no change (considering performance) for Tomcatsetups only using "exactly" defined allowed origins.

If someone uses the new "inexact" origin specifiers, will it not becomprehensible that these are more expensive (from a performance pointof view)?

It may seem like over-engineering, but maybe creating a Matcher
interface and a Factory (I know, I know) which produces Matchers which
implement the optimal strategy for a particular value would be a good
way to do this.

A single matcher could be used for really simple values and maybe a
MultipleMatcher could be used for multiple different value-checks.
Something like that will likely lead to the highest-performing filter
processing.

I did some tests on that. As I don't believe, that a (self-made)NFA-based solution could outperform Java's regular expression engine, Iwas looking for a different (simpler) algorithm.


I started with a rule-driven globbing algorithm, that supports

? (any char except "."),
# (any digit, using Character.isDigit)
$ (any letter, using Character.isAlphabetic)

as well as literal character sequences.

(Actually, I followed your suggestion. Your Matchers are my Rulestogether with a piece of code in a switch-case block. Using realclasses/instances for the matchers requires method invocations and maybeinstanceof tests. Both are adding extra overhead, so I decided to use amore C-like approach.)

That simple algorithm takes about 42 ns and so, is still 2 times slowerthan the HashSet test. I already made more than half the way down tosupport * and ** multi-matching wildcards. That implementation usesnon-recursive backtracking, similar to the algorithms described athttps://www.codeproject.com/Articles/5163931/Fast-String-Matching-with-Wildcards-Globs-and-Giti

With * and ** partly in place, time consumption is about 50 ns. The codeadditions for making the algorithm work on the many edge cases will verylikely add more nanoseconds to the test so, we may soon end at 60 ns oreven more. That's almost the same time required for evaluating a regularexpression (without the time needed to create the Matcher instance).

The algorithm is optimized and uses only a few method calls and no OOPconstructs (by using the Rules, which are like beans that specify whatto match next, the whole logic can be implemented in a single method).But it's still not much faster than a Java regular expression test.

I don't believe, that it's worth to (self-)implement such a rathercomplex (error prone) and hard to understand (and maintain) algorithm,if it's not significantly faster than real Java regular expressions.

Anyhow, if performance should not degrade due to using wildcards in theallowed origins, the goal is not just to be better than Java regularexpressions, but to be close to the HashSet test (~19 ns). And, if realregular expressions shall be used as well (enclosed between slashes),that all will not help much for these.

That's why I wanted to combine this with a HashMap-based (LRU-)cache, sothat regular expressions must only be evaluated if the current request'sorigin is not yet in the cache. This cache's performance nearly equalsthe performance of the HashSet test (depending on whether real LRU isused or not). That way, the time required for testing a regularexpression is no longer significant in the average case.

Finally, we should not only compare the performance of the differentmatching algorithms. The absolute absolute time values should also becompared to the request's overall processing time. It's about timeintervals between ~20 and ~120 nano(!)seconds. I was using TomcatsAccess Log with the %D placeholder, which reports the request's overallprocessing time (< Tomcat 10, reports milliseconds as integers). On mybox, a request to a simple servlet, echoing some request properties (nodatabase or filesystem access), takes between 0 and 2 milliseconds.Using the average of 1 millisecond (that is 1,000,000, nanoseconds),every regular expression test adds ~0,012% of extra time to the request.

In other words, we could easily test about 83 regular expressions perrequest in order to add ~1% extra processing time (only few people willconfigure that much different allowed origin patterns).

So, I still believe that adding support for specifying the CORS filter'sallowed origins with wildcards (globbing) or by real regular expressionsdoes not (and should not) depend on a new pattern matching algorithm.Java's regular expression engine is, in fact, not that slow (and the newalgorithm is probably not significantly faster).

Compared with the request's overall processing time, even with Java'sslow but approved regular expression engine, there's enough space fortesting some dozens of regular expressions without degrading performanceby more than one percent.

With a cache, the number of regular expression tests can even beminimized, so that the allowed-state of the presented origin can bedetermined by only two HashMap-lookups for most of the requests.


Carsten

---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscr...@tomcat.apache.org
For additional commands, e-mail: users-h...@tomcat.apache.org

Re: Adding regular expression support to CORS filter

Reply via email to