Hi Ivan,
On 11/23/2017 01:16 AM, Ivan Gerasimov wrote:
Hi Peter!
Thank you very much for looking into this!
On 11/22/17 1:45 AM, Peter Levart wrote:
Hi Ivan,
Here's my attempt to increase multithreaded scalability of Cache:
http://cr.openjdk.java.net/~plevart/jdk10-dev/8186628_ssl_session_cache_scalability/webrev.01/
Haven't tested this yet, but I thought that since you already have
relevant performance tests, you might want to try this, so I decided
to release it as is.
I plugged your implementation into the benchmark I had [*] and got
these numbers for the throughput:
Original MemoryCache: 4,116,537 puts/second
Your variant: 2,942,561 puts/second
So, in the tested scenario performance downgraded by 28%.
Hm. What kind of benchmark did you have? Can you share the source? Was
this multi-threaded concurrent puts or single-threaded? I'm surprised
about the results since I did some benchmarking myself and got entirely
different results. Here's what I did:
http://cr.openjdk.java.net/~plevart/jdk10-dev/8186628_ssl_session_cache_scalability/bench/
You can download the whole directory and just setup the JMH project over
it. The main class is CacheBench. It uses a typical caching scenario:
@Threads(4)
@Benchmark
public Object original() {
Integer key = ThreadLocalRandom.current().nextInt(keyspan);
Object val = origCache.get(key);
if (val == null) {
origCache.put(key, val = Boolean.TRUE);
}
return val;
}
The cache hit-ratio is varied by adjusting cache size and/or the range
of distinct key values produced by RNG. With this setup I get the
following results:
Benchmark (hitratio) (size) (timeout) Mode Cnt
Score Error Units
CacheBench.original 0 1000 86400 avgt 10 18023,156
± 434,710 ns/op
CacheBench.original 5 1000 86400 avgt 10 17645,688
± 767,696 ns/op
CacheBench.original 50 1000 86400 avgt 10 10412,352
± 216,208 ns/op
CacheBench.original 95 1000 86400 avgt 10 2275,257
± 449,302 ns/op
CacheBench.original 100 1000 86400 avgt 10 542,321
± 15,133 ns/op
CacheBench.patched1 0 1000 86400 avgt 10 1685,499
± 24,257 ns/op
CacheBench.patched1 5 1000 86400 avgt 10 1633,203
± 15,045 ns/op
CacheBench.patched1 50 1000 86400 avgt 10 1061,596
± 12,312 ns/op
CacheBench.patched1 95 1000 86400 avgt 10 125,456
± 3,426 ns/op
CacheBench.patched1 100 1000 86400 avgt 10 47,166
± 3,140 ns/op
patched1 is mostly what I had in my proposed code but with a fix in
put() method which could install an entry into the cacheMap after some
other thread already removed it from the DL-list. The race is prevended
by 1st installing a unique reservation value into the cacheMap, then
adding the entry to the DL-list and afterwards trying to replace the
reservation value with the entry and cleaning up if not successful.
If this is not enough, I also tried an approach with lock-free DL-list.
Basically a modified ConcurrentLinkedDequeue where intermediary nodes
may be removed in constant time (important for fast removal of elements
that become soft-reachable). This variant is even more scalable than
lock-based DL-list:
CacheBench.patched2 0 1000 86400 avgt 10 774,356
± 4,113 ns/op
CacheBench.patched2 5 1000 86400 avgt 10 754,222
± 2,956 ns/op
CacheBench.patched2 50 1000 86400 avgt 10 434,969
± 2,359 ns/op
CacheBench.patched2 95 1000 86400 avgt 10 98,352
± 0,750 ns/op
CacheBench.patched2 100 1000 86400 avgt 10 47,692
± 0,995 ns/op
Here's what such MemoryCache looks like in the JDK:
http://cr.openjdk.java.net/~plevart/jdk10-dev/8186628_ssl_session_cache_scalability/webrev.02/
I took the ConcurrentLinkedDequeue by Doug Lea and Martin Buchholz and
removed all public methods keeping just the delicate internal mechanics.
The Node is an interface and can be implemented by different classes.
BasicNode and SoftReferenceNode are provided (and are subclassed in
MemoryCache). Public methods are tailored to Cache use case and deal
with Node<E>(s) rather than E(lement)s.
Still, I think it's important to try to improve performance of the
MemoryCache and, if possible, remove the points of contention.
So what do you think of the benchmark results?
Rough sketch of changes:
- replaced LinkedHashMap with ConcurrentHashMap
- CacheEntry(s) are additionaly linked into a double-linked list.
DL-list management is the only synchronization point (similar to
Cleaner API) and the rule is: 1st new entry is linked into DL-list,
then it is put into map - published. The same with removing: 1st an
entry is unlinked from DL-list, then if successfull, removed from map
and invalidated.
- changed the way expiry is handled so that the whole list is never
needed to be scanned.
The code speaks for itself.
Your implementation is somewhat similar to what I had tried before
coming up with proposal of the option to turn the cache off.
I used ConcurrentHashMap + ConcurrentLinkedQueue to maintain the FIFO
order, still the performance was a few percent less then the one of
the original implementation.
Removing an intermediary element from ConcurrentLinkedQueue takes O(n)
time. This is needed in emptyQueue() for all sofly-reachable-and-cleared
entries and for remove() when application removes entry explicitly from
cache (for example on SSL session termination). We need double-linked
list (like ConcurrentLinkedDequeue) with constant-time removal of
intermediary node(s).
That's really why I decided to split the issue:
- first, provide an option to turn the cache off (that should be
easily backported and can provide immediate relief to the customers
that experience the scalability bottleneck;
- second, continue work on the cache implementation improvements.
Maybe with a scalable Cache such switch will not be needed. I can
imagine that such switch provides relief for kinds of loads that may
happen in very special circumstances (like DOS attacks for example that
don't re-use SSL sessions), but in normal operation such switch would
degrade performance wouldn't it?
The problem I see is that if you provide such switch now, users may set
it and later when they upgrade to newer JDK with scalable cache, don't
realize that they don't need it any more and consequently unnecessarily
keep paying for it in the future...
Regards, Peter
Let me know if you find it usefull and/or it solves the scalability
bottleneck.
Yes, I think it's very useful!
However, as I wrote above, I think that the issue needs be split into
two parts: an option to turn the caching off (which can be easily
backported) and improving the cache implementation (which can even
relax the requirements, as the FIFO order or absolutely hard upper
bound of the cache size).
With kind regards,
Ivan