Chris,
On 9/28/20 02:40, Christopher Schultz wrote:
Carsten,
On 9/27/20 05:53, Carsten Klein wrote:
Any comments on that? Is it worth preparing a PR?
Regular expressions are fairly expensive.
Yes, but my measurements of the HashSet-lookups were wrong, since
hashValue() of a String gets cached, so, while measuring in a loop, the
hash value gets computed only once. Setting up a fair test is
challenging (using new strings in the loop causes memory allocations and
GCs). I ended with additionally calling a clone of the String's
hashValue() function in the loop.
Now, testing an origin with HashSet.contains(origin) (current solution)
takes about 19 ns for a cache-miss (empty list in bucket) and 30 ns for
a cache-hit) on my box.
In contrast, evaluating a regular expression takes about 120 ns; so
these are about 6 times slower (NOT 25 times as stated in my last mail).
The creation of a new Matcher instance each time is a significant
bottleneck (in my measurement loop): reusing the same Matcher instance
and resetting it with a new input string makes the test taking only
about 75 ns (that's only about 4 times slower that the HashSet test).
So, regular expressions are not as bad as we were thinking?
If there is a way to build the code such that some subset of wildcards
can be serviced without regex (and of course exact matches without using
regex), then I'm at least a +0 on this.
I never intended to implement tests for exact matches with regular
expressions. Configured "exact" origins (those without wildcards and not
enclosed between slashes) will still be tested with HashSet.contains, of
course. So, there's no change (considering performance) for Tomcat
setups only using "exactly" defined allowed origins.
If someone uses the new "inexact" origin specifiers, will it not be
comprehensible that these are more expensive (from a performance point
of view)?
It may seem like over-engineering, but maybe creating a Matcher
interface and a Factory (I know, I know) which produces Matchers which
implement the optimal strategy for a particular value would be a good
way to do this.
A single matcher could be used for really simple values and maybe a
MultipleMatcher could be used for multiple different value-checks.
Something like that will likely lead to the highest-performing filter
processing.
I did some tests on that. As I don't believe, that a (self-made)
NFA-based solution could outperform Java's regular expression engine, I
was looking for a different (simpler) algorithm.
I started with a rule-driven globbing algorithm, that supports
? (any char except "."),
# (any digit, using Character.isDigit)
$ (any letter, using Character.isAlphabetic)
as well as literal character sequences.
(Actually, I followed your suggestion. Your Matchers are my Rules
together with a piece of code in a switch-case block. Using real
classes/instances for the matchers requires method invocations and maybe
instanceof tests. Both are adding extra overhead, so I decided to use a
more C-like approach.)
That simple algorithm takes about 42 ns and so, is still 2 times slower
than the HashSet test. I already made more than half the way down to
support * and ** multi-matching wildcards. That implementation uses
non-recursive backtracking, similar to the algorithms described at
https://www.codeproject.com/Articles/5163931/Fast-String-Matching-with-Wildcards-Globs-and-Giti
With * and ** partly in place, time consumption is about 50 ns. The code
additions for making the algorithm work on the many edge cases will very
likely add more nanoseconds to the test so, we may soon end at 60 ns or
even more. That's almost the same time required for evaluating a regular
expression (without the time needed to create the Matcher instance).
The algorithm is optimized and uses only a few method calls and no OOP
constructs (by using the Rules, which are like beans that specify what
to match next, the whole logic can be implemented in a single method).
But it's still not much faster than a Java regular expression test.
I don't believe, that it's worth to (self-)implement such a rather
complex (error prone) and hard to understand (and maintain) algorithm,
if it's not significantly faster than real Java regular expressions.
Anyhow, if performance should not degrade due to using wildcards in the
allowed origins, the goal is not just to be better than Java regular
expressions, but to be close to the HashSet test (~19 ns). And, if real
regular expressions shall be used as well (enclosed between slashes),
that all will not help much for these.
That's why I wanted to combine this with a HashMap-based (LRU-)cache, so
that regular expressions must only be evaluated if the current request's
origin is not yet in the cache. This cache's performance nearly equals
the performance of the HashSet test (depending on whether real LRU is
used or not). That way, the time required for testing a regular
expression is no longer significant in the average case.
Finally, we should not only compare the performance of the different
matching algorithms. The absolute absolute time values should also be
compared to the request's overall processing time. It's about time
intervals between ~20 and ~120 nano(!)seconds. I was using Tomcats
Access Log with the %D placeholder, which reports the request's overall
processing time (< Tomcat 10, reports milliseconds as integers). On my
box, a request to a simple servlet, echoing some request properties (no
database or filesystem access), takes between 0 and 2 milliseconds.
Using the average of 1 millisecond (that is 1,000,000, nanoseconds),
every regular expression test adds ~0,012% of extra time to the request.
In other words, we could easily test about 83 regular expressions per
request in order to add ~1% extra processing time (only few people will
configure that much different allowed origin patterns).
So, I still believe that adding support for specifying the CORS filter's
allowed origins with wildcards (globbing) or by real regular expressions
does not (and should not) depend on a new pattern matching algorithm.
Java's regular expression engine is, in fact, not that slow (and the new
algorithm is probably not significantly faster).
Compared with the request's overall processing time, even with Java's
slow but approved regular expression engine, there's enough space for
testing some dozens of regular expressions without degrading performance
by more than one percent.
With a cache, the number of regular expression tests can even be
minimized, so that the allowed-state of the presented origin can be
determined by only two HashMap-lookups for most of the requests.
Carsten
---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscr...@tomcat.apache.org
For additional commands, e-mail: users-h...@tomcat.apache.org