Re: write performance issue in 3.6.2

2021-05-03 Thread Michael Han
>> because the tests were run with Prometheus enabled, which is new in 3.6 and has significant negative perf impact. Interesting, let's see what the numbers are without Prometheus involved. It could be that the increased latency we observed in CommitProcessor is just a symptom rather than the

Re: write performance issue in 3.6.2

2021-05-03 Thread Li Wang
Hi Michael, Thanks for your additional inputs. On Mon, May 3, 2021 at 3:13 PM Michael Han wrote: > Hi Li, > > Thanks for following up. > > >> write_commitproc_time_ms were large > > This measures how long a local write op hears back from the leader. If it's > big, then either the leader is

Re: write performance issue in 3.6.2

2021-05-03 Thread Michael Han
Hi Li, Thanks for following up. >> write_commitproc_time_ms were large This measures how long a local write op hears back from the leader. If it's big, then either the leader is very busy acking the request, or your network RTT is high. How does the local fsync time (fsynctime) look like

Re: write performance issue in 3.6.2

2021-04-26 Thread Li Wang
Hi Srikant, 1. Have you tried to run the test without enabling Prometheus metrics? What I observed that enabling Prometheus has significant performance impact (about 40%-60% degradation) 2. In addition to the session expiry errors and max latency increasing issue, did you see any issue with

Re: write performance issue in 3.6.2

2021-04-26 Thread Li Wang
Hi Michael, Thanks for your reply. 1. The workload is 500 concurrent users creating nodes with data size of 4 bytes. 2. It's pure write 3. The perf issue is that under the same load, there were many session expired and connection loss errors when using ZK 3.6.2 but no such errors in ZK 3.4.14.

Re: write performance issue in 3.6.2

2021-04-23 Thread shrikant kalani
Hi Andor Thanks for your reply. We are planning to perform one more round of stress testing and then I would be able to provide the details logs needed for any troubleshooting. Other details are provided against each question. - which version of Zookeeper is being used, 3.6.2 at server side

Re: write performance issue in 3.6.2

2021-04-23 Thread Andor Molnar
Hi folks, As previously mentioned the community won’t be able to help if you don’t share more information about your scenario. We need to see the following: - which version of Zookeeper is being used, - how many nodes are you running in the ZK cluster, - what is the server configuration? any

Re: write performance issue in 3.6.2

2021-04-21 Thread Antoine Pitrou
Can you explain why this is posted to the Arrow mailing-list? This does not seem relevant to Arrow. If indeed it isn't, please remove the Arrow mailing-list from the recipients. Regards Antoine. On Wed, 21 Apr 2021 11:25:20 +0800 shrikant kalani wrote: > Hello Everyone, > > We are also

Re: write performance issue in 3.6.2

2021-04-20 Thread shrikant kalani
Hello Everyone, We are also using zookeeper 3.6.2 with ssl turned on both sides. We observed the same behaviour where under high write load the ZK server starts expiring the session. There are no jvm related issues. During high load the max latency increases significantly. Also the session

Re: write performance issue in 3.6.2

2021-04-20 Thread Michael Han
What is the workload looking like? Is it pure write, or mixed read write? A couple of ideas to move this forward: * Publish the performance benchmark so the community can help. * Bisect git commit and find the bad commit that caused the regression. * Use the fine grained metrics introduced in 3.6

Re: write performance issue in 3.6.2

2021-03-11 Thread Li Wang
The CPU usage of both server and client are normal (< 50%) during the test. Based on the investigation, the server is too busy with the load. The issue doesn't exist in 3.4.14. I wonder why there is a significant write performance degradation from 3.4.14 to 3.6.2 and how we can address the

Re: write performance issue in 3.6.2

2021-03-11 Thread Andor Molnar
What is the CPU usage of both server and client during the test? Looks like server is dropping the clients because either the server or both are too busy to deal with the load. This log line is also concerning: "Too busy to snap, skipping” If that’s the case I believe you'll have to profile the

Re: write performance issue in 3.6.2

2021-02-21 Thread Li Wang
Thanks, Patrick. Yes, we are using the same JVM version and GC configurations when running the two tests. I have checked the GC metrics and also the heap dump of the 3.6, the GC pause and the memory usage look okay. Best, Li On Sun, Feb 21, 2021 at 3:34 PM Patrick Hunt wrote: > On Sun, Feb

Re: write performance issue in 3.6.2

2021-02-21 Thread Patrick Hunt
On Sun, Feb 21, 2021 at 3:28 PM Li Wang wrote: > Hi Enrico, Sushant, > > I re-run the perf test with the data consistency check feature disabled > (i.e. -Dzookeeper.digest.enabled=false), the write performance issue of 3.6 > is still there. > > With everything exactly the same, the throughput of

Re: write performance issue in 3.6.2

2021-02-21 Thread Li Wang
Hi Enrico, Sushant, I re-run the perf test with the data consistency check feature disabled (i.e. -Dzookeeper.digest.enabled=false), the write performance issue of 3.6 is still there. With everything exactly the same, the throughput of 3.6 was only 1/2 of 3.4 and the max latency was more than 8

Re: write performance issue in 3.6.2

2021-02-20 Thread Li Wang
Thanks Sushant and Enrico! This is a really good point. According to the 3.6 documentation, the feature is disabled by default. https://zookeeper.apache.org/doc/r3.6.2/zookeeperAdmin.html#ch_administration. However, checking the code, the default is enabled. Let me set the

Re: write performance issue in 3.6.2

2021-02-19 Thread Sushant Mane
Hi Li, On 3.6.2 consistency checker (adhash based) is enabled by default: https://github.com/apache/zookeeper/blob/803c7f1a12f85978cb049af5e4ef23bd8b688715/zookeeper-server/src/main/java/org/apache/zookeeper/server/ZooKeeperServer.java#L136. It is not present in ZK 3.4.14. This feature does have

Re: write performance issue in 3.6.2

2021-02-19 Thread Enrico Olivelli
Li, I wonder of we have some new throttling/back pressure mechanisms that is enabled by default. Does anyone has some pointer to relevant implementations? Enrico Il Ven 19 Feb 2021, 19:46 Li Wang ha scritto: > Hi, > > We switched to Netty on both client side and server side and the >

Re: write performance issue in 3.6.2

2021-02-19 Thread Li Wang
Hi, We switched to Netty on both client side and server side and the performance issue is still there. Anyone has any insights on what could be the cause of higher latency? Thanks, Li On Mon, Feb 15, 2021 at 2:17 PM Li Wang wrote: > Hi Enrico, > > > Thanks for the reply. > > > 1. We are

Re: write performance issue in 3.6.2

2021-02-15 Thread Li Wang
Hi Enrico, Thanks for the reply. 1. We are using NIO based stack, not Netty based yet. 2. Yes, here are some metrics on the client side. 3.6: throughput: 7K, failure: 81215228, Avg Latency: 57ms, Max Latency 31s 3.4: throughput: 15k, failure: 0, Avg Latency: 30ms, Max Latency: 1.6s 3.

Re: write performance issue in 3.6.2

2021-02-15 Thread Li Wang
Hi Enrico, Thanks for the reply. 1. We are using direct NIO based stack, not Netty based yet. 2. Yes, on the client side, here are the metrics 3.6: On Mon, Feb 15, 2021 at 10:44 AM Enrico Olivelli wrote: > IIRC The main difference is about the switch to Netty 4 and about using > more

Re: write performance issue in 3.6.2

2021-02-15 Thread Enrico Olivelli
IIRC The main difference is about the switch to Netty 4 and about using more DirectMemory. Are you using the Netty based stack? Apart from that macro difference there have been many many changes since 3.4. Do you have some metrics to share? Are the JVM configurations and zoo.cfg configuration

write performance issue in 3.6.2

2021-02-15 Thread Li Wang
Hi, We want to upgrade from 3.4.14 to 3.6.2. During the perform/load comparison test, it was found that the performance of 3.6 has been significantly degraded compared to 3.4 for the write operation. Under the same load, there was a huge number of SessionExpired and ConnectionLoss errors in 3.6