Re: RFR JDK-8187653: Lock in CoderResult.Cache becomes performance bottleneck
One less (benign) race - and possibly more efficient, too :-) If we really worry about the startup costs here, we could make it so that the three Cache classes themselves aren't loaded until someone actually has a need for them: http://cr.openjdk.java.net/~redestad/scratch/coderresult_cache.00/ /Claes On 2018-03-02 21:02, Xueming Shen wrote: To follow Claes's suggestion to make the CoderResult.Cache.cache field final and allocate early. issue: https://bugs.openjdk.java.net/browse/JDK-8198966 webrev: http://cr.openjdk.java.net/~sherman/8198966/webrev/ Thanks, -Sherman
Re: RFR JDK-8187653: Lock in CoderResult.Cache becomes performance bottleneck
To follow Claes's suggestion to make the CoderResult.Cache.cache field final and allocate early. issue: https://bugs.openjdk.java.net/browse/JDK-8198966 webrev: http://cr.openjdk.java.net/~sherman/8198966/webrev/ Thanks, -Sherman
Re: RFR JDK-8187653: Lock in CoderResult.Cache becomes performance bottleneck
Hi Sherman, On 03/02/18 03:20, David Holmes wrote: Also this: 195 private Mapcache = null; should now be volatile. Either that, or you should load the 'cache' field only once per method call into a local variable unless you want reorderings of reads and writes observed from concurrent threads to result in NPE-s. If you do replace it with a volatile, you should also load the field into local variable just once (although not strictly necessary for correctness). Also: 193 private abstract static class Cache { 197 *protected* abstract CoderResult create(int len); the abstract method is protected, but the implementations: 217 = new Cache() { 218 *public* CoderResult create(int len) { 219 return new CoderResult(CR_MALFORMED, len); 220 }}; ...are public. Since you are dealing with private nested class, all create() methods could be package-private. Less words to write and read... Regards, Peter
Re: RFR JDK-8187653: Lock in CoderResult.Cache becomes performance bottleneck
On 3/1/18, 4:35 PM, David Holmes wrote: When you replace synchronized code with concurrent data structures you introduce race conditions that are precluded in the synchronized code. These need to be examined carefully to ensure they are safe. For example, whenever you replace a HashMap with a ConcurrentHashMap you need to see if put() needs to be replaced by putIfAbsent(). Hi David, The assumption here is that putIfAbsent() does not help/save anything as the value object would have been created already when it reaches here. And it appears there is no need here to have the check to be atomic, replacing any existing key/value pair is fine in this use scenario. Was thinking about computIfAbsent(), but concluded it's just little overdone (wonder why we decided to use the WeakReference in this case, it probably is not worth it, but keep it as is for "compatibility" concern). -Sherman
Re: RFR JDK-8187653: Lock in CoderResult.Cache becomes performance bottleneck
Hi Sherman, On 24/02/2018 11:26 AM, Xueming Shen wrote: Hi, Please help review the proposed change to remove the potential performance bottleneck in CoderResult caused by the "over" synchronization the cache mechanism. issue: https://bugs.openjdk.java.net/browse/JDK-8187653 webrev: http://cr.openjdk.java.net/~sherman/8187653/webrev Notes: While the original test case (new String/String.getBytes() with UTF8 as showed in the stacktrace) described in the bug report might no longer be reproducible in jdk9, as we have been optimizing lots of String related char[]/byte[] coding path away from the paths that use CoderResult. But it is still a concern/issue for the "general" CharsetDe/Encoder.de/encode() operation, in which the malformed or unmappable CoderResult object is returned. As showed in the "[synchronized]" section in bm.scores [1], which is from the simple jmh benchmark test CoderResultBM.java [2], the sores are getting significant worse when the number of concurrent de/encoding threads gets bigger. It appears the easy fix is to replace the sync mechanism from "method synchronized + HashMap" to "ConcurrentHashMap" solves the problem, as showed in the same bm result [1] in [ConcurrentHashMap] section, because most of the accesses to the caching HashMap is read operation. The ConcurrentHahsMap's "almost non-block for retrieval operation" really helps here. When you replace synchronized code with concurrent data structures you introduce race conditions that are precluded in the synchronized code. These need to be examined carefully to ensure they are safe. For example, whenever you replace a HashMap with a ConcurrentHashMap you need to see if put() needs to be replaced by putIfAbsent(). David - There is another fact that might help us optimize further here. For most of our charsets in JDK repository (with couple exceptions), the length of malformed and unmappable CoderResult that these charsets might trigger actually is never longer than 4. So we might not need to access the ConcurrentHashMap cache at all in normal use scenario. I'm putting a CoderResult[4] on top the hashmap cache in proposed webrev. It does not improve the performance significantly, but when the thread number gets bigger, a 10%+ improvement is observed. So I would assume it might be something worth doing, given its cost is really low. Thanks, Sherman [1] http://cr.openjdk.java.net/~sherman/8187653/bm.scores [2] http://cr.openjdk.java.net/~sherman/8187653/CoderResultBM.java [3] http://cr.openjdk.java.net/~sherman/8187653/bm.scores.ms932
Re: RFR JDK-8187653: Lock in CoderResult.Cache becomes performance bottleneck
On 01/03/2018 20:10, Xueming Shen wrote: Right, they all should be final. webrev has been updated accordingly. Thanks, it looks okay now.
Re: RFR JDK-8187653: Lock in CoderResult.Cache becomes performance bottleneck
On 03/01/2018 12:00 PM, Alan Bateman wrote: On 01/03/2018 19:42, Roger Riggs wrote: Hi Sherman, Looks to be a good fix with predictable behavior and performance. +1, Roger I assume malformed4 and unmappable4 should be final, the XXXCache fields too. -Alan Right, they all should be final. webrev has been updated accordingly. -sherman
Re: RFR JDK-8187653: Lock in CoderResult.Cache becomes performance bottleneck
On 01/03/2018 19:42, Roger Riggs wrote: Hi Sherman, Looks to be a good fix with predictable behavior and performance. +1, Roger I assume malformed4 and unmappable4 should be final, the XXXCache fields too. -Alan
Re: RFR JDK-8187653: Lock in CoderResult.Cache becomes performance bottleneck
Hi Sherman, Looks to be a good fix with predictable behavior and performance. +1, Roger On 2/23/2018 8:26 PM, Xueming Shen wrote: Hi, Please help review the proposed change to remove the potential performance bottleneck in CoderResult caused by the "over" synchronization the cache mechanism. issue: https://bugs.openjdk.java.net/browse/JDK-8187653 webrev: http://cr.openjdk.java.net/~sherman/8187653/webrev Notes: While the original test case (new String/String.getBytes() with UTF8 as showed in the stacktrace) described in the bug report might no longer be reproducible in jdk9, as we have been optimizing lots of String related char[]/byte[] coding path away from the paths that use CoderResult. But it is still a concern/issue for the "general" CharsetDe/Encoder.de/encode() operation, in which the malformed or unmappable CoderResult object is returned. As showed in the "[synchronized]" section in bm.scores [1], which is from the simple jmh benchmark test CoderResultBM.java [2], the sores are getting significant worse when the number of concurrent de/encoding threads gets bigger. It appears the easy fix is to replace the sync mechanism from "method synchronized + HashMap" to "ConcurrentHashMap" solves the problem, as showed in the same bm result [1] in [ConcurrentHashMap] section, because most of the accesses to the caching HashMap is read operation. The ConcurrentHahsMap's "almost non-block for retrieval operation" really helps here. There is another fact that might help us optimize further here. For most of our charsets in JDK repository (with couple exceptions), the length of malformed and unmappable CoderResult that these charsets might trigger actually is never longer than 4. So we might not need to access the ConcurrentHashMap cache at all in normal use scenario. I'm putting a CoderResult[4] on top the hashmap cache in proposed webrev. It does not improve the performance significantly, but when the thread number gets bigger, a 10%+ improvement is observed. So I would assume it might be something worth doing, given its cost is really low. Thanks, Sherman [1] http://cr.openjdk.java.net/~sherman/8187653/bm.scores [2] http://cr.openjdk.java.net/~sherman/8187653/CoderResultBM.java [3] http://cr.openjdk.java.net/~sherman/8187653/bm.scores.ms932