[jira] [Updated] (ZOOKEEPER-4768) Flaky test org.apache.zookeeper.metrics.prometheus.ExportJvmInfoTest#exportInfo
[ https://issues.apache.org/jira/browse/ZOOKEEPER-4768?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yike Xiao updated ZOOKEEPER-4768: - Labels: pull-requests-available (was: pull-request-available) > Flaky test > org.apache.zookeeper.metrics.prometheus.ExportJvmInfoTest#exportInfo > --- > > Key: ZOOKEEPER-4768 > URL: https://issues.apache.org/jira/browse/ZOOKEEPER-4768 > Project: ZooKeeper > Issue Type: Test > Components: metric system >Affects Versions: 3.10.0 >Reporter: Yike Xiao >Priority: Major > Labels: pull-requests-available > Time Spent: 10m > Remaining Estimate: 0h > > If the {{io.prometheus.client.hotspot.DefaultExports#initialize}} method has > been executed by other test cases before running the > {{org.apache.zookeeper.metrics.prometheus.ExportJvmInfoTest#exportInfo}} test > case, then this test case will fail. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (ZOOKEEPER-4768) Flaky test org.apache.zookeeper.metrics.prometheus.ExportJvmInfoTest#exportInfo
Yike Xiao created ZOOKEEPER-4768: Summary: Flaky test org.apache.zookeeper.metrics.prometheus.ExportJvmInfoTest#exportInfo Key: ZOOKEEPER-4768 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-4768 Project: ZooKeeper Issue Type: Test Components: metric system Affects Versions: 3.10.0 Reporter: Yike Xiao If the {{io.prometheus.client.hotspot.DefaultExports#initialize}} method has been executed by other test cases before running the {{org.apache.zookeeper.metrics.prometheus.ExportJvmInfoTest#exportInfo}} test case, then this test case will fail. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (ZOOKEEPER-4767) New implementation of prometheus quantile metrics based on DataSketches
[ https://issues.apache.org/jira/browse/ZOOKEEPER-4767?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yike Xiao updated ZOOKEEPER-4767: - Labels: pull-request-available (was: ) > New implementation of prometheus quantile metrics based on DataSketches > --- > > Key: ZOOKEEPER-4767 > URL: https://issues.apache.org/jira/browse/ZOOKEEPER-4767 > Project: ZooKeeper > Issue Type: Improvement > Components: metric system >Reporter: Yike Xiao >Priority: Major > Labels: pull-request-available > > If the built-in Prometheus metrics feature introduced after version 3.6 is > enabled, under high-load scenarios (such as when there are a large number of > read requests), the percentile metrics (Summary) used to collect request > latencies can easily become a bottleneck and impact the service itself. This > is because the internal implementation of Summary involves the overhead of > lock operations. In scenarios with a large number of requests, lock > contention can lead to a dramatic deterioration in request latency. The > details of this issue and related profiling can be viewed in ZOOKEEPER-4741. > In ZOOKEEPER-4289, the updates to Summary were switched to be executed in a > separate thread pool. While this approach avoids the overhead of lock > contention caused by multiple threads updating Summary simultaneously, it > introduces the operational overhead of the thread pool queue and additional > garbage collection (GC) overhead. Especially when the thread pool queue is > full, a large number of RejectedExecutionException instances will be thrown, > further increasing the pressure on GC. > To address problems above, I have implemented an almost lock-free solution > based on DataSketches. Benchmark results show that it offers over a 10x speed > improvement compared to version 3.9.1 and avoids frequent GC caused by > creating a large number of temporary objects. The trade-off is that the > latency percentiles will be displayed with a relative delay (default is 60 > seconds), and each Summary metric will have a certain amount of permanent > memory overhead. > This solution refers to Matteo Merli's optimization work on the percentile > latency metrics for Bookkeeper, as detailed in > https://github.com/apache/bookkeeper/commit/3bff19956e70e37c025a8e29aa8428937af77aa1. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (ZOOKEEPER-4767) New implementation of prometheus quantile metrics based on DataSketches
[ https://issues.apache.org/jira/browse/ZOOKEEPER-4767?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yike Xiao updated ZOOKEEPER-4767: - Summary: New implementation of prometheus quantile metrics based on DataSketches (was: New implementation of prometheus qunatile metrics based on DataSketches) > New implementation of prometheus quantile metrics based on DataSketches > --- > > Key: ZOOKEEPER-4767 > URL: https://issues.apache.org/jira/browse/ZOOKEEPER-4767 > Project: ZooKeeper > Issue Type: Improvement > Components: metric system >Reporter: Yike Xiao >Priority: Major > > If the built-in Prometheus metrics feature introduced after version 3.6 is > enabled, under high-load scenarios (such as when there are a large number of > read requests), the percentile metrics (Summary) used to collect request > latencies can easily become a bottleneck and impact the service itself. This > is because the internal implementation of Summary involves the overhead of > lock operations. In scenarios with a large number of requests, lock > contention can lead to a dramatic deterioration in request latency. The > details of this issue and related profiling can be viewed in ZOOKEEPER-4741. > In ZOOKEEPER-4289, the updates to Summary were switched to be executed in a > separate thread pool. While this approach avoids the overhead of lock > contention caused by multiple threads updating Summary simultaneously, it > introduces the operational overhead of the thread pool queue and additional > garbage collection (GC) overhead. Especially when the thread pool queue is > full, a large number of RejectedExecutionException instances will be thrown, > further increasing the pressure on GC. > To address problems above, I have implemented an almost lock-free solution > based on DataSketches. Benchmark results show that it offers over a 10x speed > improvement compared to version 3.9.1 and avoids frequent GC caused by > creating a large number of temporary objects. The trade-off is that the > latency percentiles will be displayed with a relative delay (default is 60 > seconds), and each Summary metric will have a certain amount of permanent > memory overhead. > This solution refers to Matteo Merli's optimization work on the percentile > latency metrics for Bookkeeper, as detailed in > https://github.com/apache/bookkeeper/commit/3bff19956e70e37c025a8e29aa8428937af77aa1. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (ZOOKEEPER-4767) New implementation of prometheus qunatile metrics based on DataSketches
Yike Xiao created ZOOKEEPER-4767: Summary: New implementation of prometheus qunatile metrics based on DataSketches Key: ZOOKEEPER-4767 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-4767 Project: ZooKeeper Issue Type: Improvement Components: metric system Reporter: Yike Xiao If the built-in Prometheus metrics feature introduced after version 3.6 is enabled, under high-load scenarios (such as when there are a large number of read requests), the percentile metrics (Summary) used to collect request latencies can easily become a bottleneck and impact the service itself. This is because the internal implementation of Summary involves the overhead of lock operations. In scenarios with a large number of requests, lock contention can lead to a dramatic deterioration in request latency. The details of this issue and related profiling can be viewed in ZOOKEEPER-4741. In ZOOKEEPER-4289, the updates to Summary were switched to be executed in a separate thread pool. While this approach avoids the overhead of lock contention caused by multiple threads updating Summary simultaneously, it introduces the operational overhead of the thread pool queue and additional garbage collection (GC) overhead. Especially when the thread pool queue is full, a large number of RejectedExecutionException instances will be thrown, further increasing the pressure on GC. To address problems above, I have implemented an almost lock-free solution based on DataSketches. Benchmark results show that it offers over a 10x speed improvement compared to version 3.9.1 and avoids frequent GC caused by creating a large number of temporary objects. The trade-off is that the latency percentiles will be displayed with a relative delay (default is 60 seconds), and each Summary metric will have a certain amount of permanent memory overhead. This solution refers to Matteo Merli's optimization work on the percentile latency metrics for Bookkeeper, as detailed in https://github.com/apache/bookkeeper/commit/3bff19956e70e37c025a8e29aa8428937af77aa1. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (ZOOKEEPER-4741) High latency under heavy load when prometheus metrics enabled
[ https://issues.apache.org/jira/browse/ZOOKEEPER-4741?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yike Xiao updated ZOOKEEPER-4741: - Affects Version/s: 3.6.2 3.6.1 3.6.3 3.6.0 (was: 3.8.2) > High latency under heavy load when prometheus metrics enabled > - > > Key: ZOOKEEPER-4741 > URL: https://issues.apache.org/jira/browse/ZOOKEEPER-4741 > Project: ZooKeeper > Issue Type: Bug > Components: server >Affects Versions: 3.6.0, 3.6.3, 3.6.1, 3.6.2, 3.6.4 > Environment: zookeeper version: 3.6.4 > kernel: 3.10.0-1160.95.1.el7.x86_64 > java version "1.8.0_111" > > metricsProvider.className=org.apache.zookeeper.metrics.prometheus.PrometheusMetricsProvider >Reporter: Yike Xiao >Priority: Major > Attachments: 32010.lock.profile.html, 32010.threaddump.001.txt, > 32010.wallclock.profile.html, image-2023-09-11-16-17-21-166.png > > > In our production, we use zookeeper built-in PrometheusMetricsProvider to > monitor zookeeper status, recently we observed very high latency in one of > our zookeeper cluster which serve heavy load. > Measured in a heavy client side, the latency could more than 25 seconds. > {noformat} > [arthas@8]$ watch org.apache.zookeeper.ClientCnxn submitRequest > Press Q or Ctrl+C to abort. > Affect(class count: 1 , method count: 1) cost in 294 ms, listenerId: 1 > method=org.apache.zookeeper.ClientCnxn.submitRequest location=AtExit > ts=2023-09-06 16:01:32; [cost=28180.767833ms] result=@ArrayList[ > @Object[][isEmpty=false;size=4], > @ClientCnxn[sessionid:0x221f087c3d7eb6c local:/:40942 > remoteserver:/:2181 lastZxid:146177807315 xid:589796 sent:589796 > recv:973892 queuedpkts:0 pendingresp:4398 queuedevents:0], > @ReplyHeader[585397,146177807315,0 > ], > ] > method=org.apache.zookeeper.ClientCnxn.submitRequest location=AtExit > ts=2023-09-06 16:01:34; [cost=29052.726493ms] result=@ArrayList[ > @Object[][isEmpty=false;size=4], > @ClientCnxn[sessionid:0x221f087c3d7eb6c local:/:40942 > remoteserver:/:2181 lastZxid:146177807862 xid:589989 sent:589989 > recv:974165 queuedpkts:0 pendingresp:4400 queuedevents:0], > @ReplyHeader[585588,146177807862,0 > ]{noformat} > > Observed many connections with *high Recv-Q* on the server side. > CommitProcWorkThread *BLOCKED* in > {{org.apache.zookeeper.server.ServerStats#updateLatency:}} > {noformat} > "CommitProcWorkThread-15" #21595 daemon prio=5 os_prio=0 > tid=0x7f86d804a000 nid=0x6bca waiting for monitor entry > [0x7f86deb95000] >java.lang.Thread.State: BLOCKED (on object monitor) > at io.prometheus.client.CKMSQuantiles.insert(CKMSQuantiles.java:91) > - waiting to lock <0x000784dd1a18> (a > io.prometheus.client.CKMSQuantiles) > at > io.prometheus.client.TimeWindowQuantiles.insert(TimeWindowQuantiles.java:38) > at io.prometheus.client.Summary$Child.observe(Summary.java:281) > at io.prometheus.client.Summary.observe(Summary.java:307) > at > org.apache.zookeeper.metrics.prometheus.PrometheusMetricsProvider$PrometheusSummary.add(PrometheusMetricsProvider.java:355) > at > org.apache.zookeeper.server.ServerStats.updateLatency(ServerStats.java:153) > at > org.apache.zookeeper.server.FinalRequestProcessor.updateStats(FinalRequestProcessor.java:669) > at > org.apache.zookeeper.server.FinalRequestProcessor.processRequest(FinalRequestProcessor.java:585) > at > org.apache.zookeeper.server.quorum.CommitProcessor$CommitWorkRequest.doWork(CommitProcessor.java:545) > at > org.apache.zookeeper.server.WorkerService$ScheduledWorkRequest.run(WorkerService.java:154) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) > at java.lang.Thread.run(Thread.java:748) >Locked ownable synchronizers: > - <0x000734923e80> (a > java.util.concurrent.ThreadPoolExecutor$Worker){noformat} > > The wall clock profile shows that there is lock contention within > `CommitProcWorkThread` threads. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (ZOOKEEPER-4741) High latency under heavy load when prometheus metrics enabled
[ https://issues.apache.org/jira/browse/ZOOKEEPER-4741?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17763621#comment-17763621 ] Yike Xiao commented on ZOOKEEPER-4741: -- zookeeper 3.6.4 include io.prometheus:simpleclient:0.6.0, this patch is include in 0.7.0 [https://github.com/prometheus/client_java/commit/62001352f37c22e01ff5c806148abca44b41ba81] zookeeper 3.8.2 include io.prometheus:simpleclient:0.9.0, I'll update here after upgrade simpleclient component > High latency under heavy load when prometheus metrics enabled > - > > Key: ZOOKEEPER-4741 > URL: https://issues.apache.org/jira/browse/ZOOKEEPER-4741 > Project: ZooKeeper > Issue Type: Bug > Components: server >Affects Versions: 3.6.4, 3.8.2 > Environment: zookeeper version: 3.6.4 > kernel: 3.10.0-1160.95.1.el7.x86_64 > java version "1.8.0_111" > > metricsProvider.className=org.apache.zookeeper.metrics.prometheus.PrometheusMetricsProvider >Reporter: Yike Xiao >Priority: Major > Attachments: 32010.lock.profile.html, 32010.threaddump.001.txt, > 32010.wallclock.profile.html, image-2023-09-11-16-17-21-166.png > > > In our production, we use zookeeper built-in PrometheusMetricsProvider to > monitor zookeeper status, recently we observed very high latency in one of > our zookeeper cluster which serve heavy load. > Measured in a heavy client side, the latency could more than 25 seconds. > {noformat} > [arthas@8]$ watch org.apache.zookeeper.ClientCnxn submitRequest > Press Q or Ctrl+C to abort. > Affect(class count: 1 , method count: 1) cost in 294 ms, listenerId: 1 > method=org.apache.zookeeper.ClientCnxn.submitRequest location=AtExit > ts=2023-09-06 16:01:32; [cost=28180.767833ms] result=@ArrayList[ > @Object[][isEmpty=false;size=4], > @ClientCnxn[sessionid:0x221f087c3d7eb6c local:/:40942 > remoteserver:/:2181 lastZxid:146177807315 xid:589796 sent:589796 > recv:973892 queuedpkts:0 pendingresp:4398 queuedevents:0], > @ReplyHeader[585397,146177807315,0 > ], > ] > method=org.apache.zookeeper.ClientCnxn.submitRequest location=AtExit > ts=2023-09-06 16:01:34; [cost=29052.726493ms] result=@ArrayList[ > @Object[][isEmpty=false;size=4], > @ClientCnxn[sessionid:0x221f087c3d7eb6c local:/:40942 > remoteserver:/:2181 lastZxid:146177807862 xid:589989 sent:589989 > recv:974165 queuedpkts:0 pendingresp:4400 queuedevents:0], > @ReplyHeader[585588,146177807862,0 > ]{noformat} > > Observed many connections with *high Recv-Q* on the server side. > CommitProcWorkThread *BLOCKED* in > {{org.apache.zookeeper.server.ServerStats#updateLatency:}} > {noformat} > "CommitProcWorkThread-15" #21595 daemon prio=5 os_prio=0 > tid=0x7f86d804a000 nid=0x6bca waiting for monitor entry > [0x7f86deb95000] >java.lang.Thread.State: BLOCKED (on object monitor) > at io.prometheus.client.CKMSQuantiles.insert(CKMSQuantiles.java:91) > - waiting to lock <0x000784dd1a18> (a > io.prometheus.client.CKMSQuantiles) > at > io.prometheus.client.TimeWindowQuantiles.insert(TimeWindowQuantiles.java:38) > at io.prometheus.client.Summary$Child.observe(Summary.java:281) > at io.prometheus.client.Summary.observe(Summary.java:307) > at > org.apache.zookeeper.metrics.prometheus.PrometheusMetricsProvider$PrometheusSummary.add(PrometheusMetricsProvider.java:355) > at > org.apache.zookeeper.server.ServerStats.updateLatency(ServerStats.java:153) > at > org.apache.zookeeper.server.FinalRequestProcessor.updateStats(FinalRequestProcessor.java:669) > at > org.apache.zookeeper.server.FinalRequestProcessor.processRequest(FinalRequestProcessor.java:585) > at > org.apache.zookeeper.server.quorum.CommitProcessor$CommitWorkRequest.doWork(CommitProcessor.java:545) > at > org.apache.zookeeper.server.WorkerService$ScheduledWorkRequest.run(WorkerService.java:154) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) > at java.lang.Thread.run(Thread.java:748) >Locked ownable synchronizers: > - <0x000734923e80> (a > java.util.concurrent.ThreadPoolExecutor$Worker){noformat} > > The wall clock profile shows that there is lock contention within > `CommitProcWorkThread` threads. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (ZOOKEEPER-4741) High latency under heavy load when prometheus metrics enabled
[ https://issues.apache.org/jira/browse/ZOOKEEPER-4741?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17763611#comment-17763611 ] Yike Xiao commented on ZOOKEEPER-4741: -- Workaround: After disable PrometheusMetricsProvider, problem disappeared. > High latency under heavy load when prometheus metrics enabled > - > > Key: ZOOKEEPER-4741 > URL: https://issues.apache.org/jira/browse/ZOOKEEPER-4741 > Project: ZooKeeper > Issue Type: Bug > Components: server >Affects Versions: 3.6.4, 3.8.2 > Environment: zookeeper version: 3.6.4 > kernel: 3.10.0-1160.95.1.el7.x86_64 > java version "1.8.0_111" > > metricsProvider.className=org.apache.zookeeper.metrics.prometheus.PrometheusMetricsProvider >Reporter: Yike Xiao >Priority: Major > Attachments: 32010.lock.profile.html, 32010.threaddump.001.txt, > 32010.wallclock.profile.html, image-2023-09-11-16-17-21-166.png > > > In our production, we use zookeeper built-in PrometheusMetricsProvider to > monitor zookeeper status, recently we observed very high latency in one of > our zookeeper cluster which serve heavy load. > Measured in a heavy client side, the latency could more than 25 seconds. > {noformat} > [arthas@8]$ watch org.apache.zookeeper.ClientCnxn submitRequest > Press Q or Ctrl+C to abort. > Affect(class count: 1 , method count: 1) cost in 294 ms, listenerId: 1 > method=org.apache.zookeeper.ClientCnxn.submitRequest location=AtExit > ts=2023-09-06 16:01:32; [cost=28180.767833ms] result=@ArrayList[ > @Object[][isEmpty=false;size=4], > @ClientCnxn[sessionid:0x221f087c3d7eb6c local:/:40942 > remoteserver:/:2181 lastZxid:146177807315 xid:589796 sent:589796 > recv:973892 queuedpkts:0 pendingresp:4398 queuedevents:0], > @ReplyHeader[585397,146177807315,0 > ], > ] > method=org.apache.zookeeper.ClientCnxn.submitRequest location=AtExit > ts=2023-09-06 16:01:34; [cost=29052.726493ms] result=@ArrayList[ > @Object[][isEmpty=false;size=4], > @ClientCnxn[sessionid:0x221f087c3d7eb6c local:/:40942 > remoteserver:/:2181 lastZxid:146177807862 xid:589989 sent:589989 > recv:974165 queuedpkts:0 pendingresp:4400 queuedevents:0], > @ReplyHeader[585588,146177807862,0 > ]{noformat} > > Observed many connections with *high Recv-Q* on the server side. > CommitProcWorkThread *BLOCKED* in > {{org.apache.zookeeper.server.ServerStats#updateLatency:}} > {noformat} > "CommitProcWorkThread-15" #21595 daemon prio=5 os_prio=0 > tid=0x7f86d804a000 nid=0x6bca waiting for monitor entry > [0x7f86deb95000] >java.lang.Thread.State: BLOCKED (on object monitor) > at io.prometheus.client.CKMSQuantiles.insert(CKMSQuantiles.java:91) > - waiting to lock <0x000784dd1a18> (a > io.prometheus.client.CKMSQuantiles) > at > io.prometheus.client.TimeWindowQuantiles.insert(TimeWindowQuantiles.java:38) > at io.prometheus.client.Summary$Child.observe(Summary.java:281) > at io.prometheus.client.Summary.observe(Summary.java:307) > at > org.apache.zookeeper.metrics.prometheus.PrometheusMetricsProvider$PrometheusSummary.add(PrometheusMetricsProvider.java:355) > at > org.apache.zookeeper.server.ServerStats.updateLatency(ServerStats.java:153) > at > org.apache.zookeeper.server.FinalRequestProcessor.updateStats(FinalRequestProcessor.java:669) > at > org.apache.zookeeper.server.FinalRequestProcessor.processRequest(FinalRequestProcessor.java:585) > at > org.apache.zookeeper.server.quorum.CommitProcessor$CommitWorkRequest.doWork(CommitProcessor.java:545) > at > org.apache.zookeeper.server.WorkerService$ScheduledWorkRequest.run(WorkerService.java:154) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) > at java.lang.Thread.run(Thread.java:748) >Locked ownable synchronizers: > - <0x000734923e80> (a > java.util.concurrent.ThreadPoolExecutor$Worker){noformat} > > The wall clock profile shows that there is lock contention within > `CommitProcWorkThread` threads. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (ZOOKEEPER-4741) High latency under heavy load when prometheus metrics enabled
[ https://issues.apache.org/jira/browse/ZOOKEEPER-4741?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yike Xiao updated ZOOKEEPER-4741: - Description: In our production, we use zookeeper built-in PrometheusMetricsProvider to monitor zookeeper status, recently we observed very high latency in one of our zookeeper cluster which serve heavy load. Measured in a heavy client side, the latency could more than 25 seconds. {noformat} [arthas@8]$ watch org.apache.zookeeper.ClientCnxn submitRequest Press Q or Ctrl+C to abort. Affect(class count: 1 , method count: 1) cost in 294 ms, listenerId: 1 method=org.apache.zookeeper.ClientCnxn.submitRequest location=AtExit ts=2023-09-06 16:01:32; [cost=28180.767833ms] result=@ArrayList[ @Object[][isEmpty=false;size=4], @ClientCnxn[sessionid:0x221f087c3d7eb6c local:/:40942 remoteserver:/:2181 lastZxid:146177807315 xid:589796 sent:589796 recv:973892 queuedpkts:0 pendingresp:4398 queuedevents:0], @ReplyHeader[585397,146177807315,0 ], ] method=org.apache.zookeeper.ClientCnxn.submitRequest location=AtExit ts=2023-09-06 16:01:34; [cost=29052.726493ms] result=@ArrayList[ @Object[][isEmpty=false;size=4], @ClientCnxn[sessionid:0x221f087c3d7eb6c local:/:40942 remoteserver:/:2181 lastZxid:146177807862 xid:589989 sent:589989 recv:974165 queuedpkts:0 pendingresp:4400 queuedevents:0], @ReplyHeader[585588,146177807862,0 ]{noformat} Observed many connections with *high Recv-Q* on the server side. CommitProcWorkThread *BLOCKED* in {{org.apache.zookeeper.server.ServerStats#updateLatency:}} {noformat} "CommitProcWorkThread-15" #21595 daemon prio=5 os_prio=0 tid=0x7f86d804a000 nid=0x6bca waiting for monitor entry [0x7f86deb95000] java.lang.Thread.State: BLOCKED (on object monitor) at io.prometheus.client.CKMSQuantiles.insert(CKMSQuantiles.java:91) - waiting to lock <0x000784dd1a18> (a io.prometheus.client.CKMSQuantiles) at io.prometheus.client.TimeWindowQuantiles.insert(TimeWindowQuantiles.java:38) at io.prometheus.client.Summary$Child.observe(Summary.java:281) at io.prometheus.client.Summary.observe(Summary.java:307) at org.apache.zookeeper.metrics.prometheus.PrometheusMetricsProvider$PrometheusSummary.add(PrometheusMetricsProvider.java:355) at org.apache.zookeeper.server.ServerStats.updateLatency(ServerStats.java:153) at org.apache.zookeeper.server.FinalRequestProcessor.updateStats(FinalRequestProcessor.java:669) at org.apache.zookeeper.server.FinalRequestProcessor.processRequest(FinalRequestProcessor.java:585) at org.apache.zookeeper.server.quorum.CommitProcessor$CommitWorkRequest.doWork(CommitProcessor.java:545) at org.apache.zookeeper.server.WorkerService$ScheduledWorkRequest.run(WorkerService.java:154) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) at java.lang.Thread.run(Thread.java:748) Locked ownable synchronizers: - <0x000734923e80> (a java.util.concurrent.ThreadPoolExecutor$Worker){noformat} The wall clock profile shows that there is lock contention within `CommitProcWorkThread` threads. was: In our production, we use zookeeper built-in PrometheusMetricsProvider to monitor zookeeper status, recently we observed very high latency in one of our zookeeper cluster which serve heavy load. Measured in a heavy client side, the latency could more than 25 seconds. {noformat} [arthas@8]$ watch org.apache.zookeeper.ClientCnxn submitRequest Press Q or Ctrl+C to abort. Affect(class count: 1 , method count: 1) cost in 294 ms, listenerId: 1 method=org.apache.zookeeper.ClientCnxn.submitRequest location=AtExit ts=2023-09-06 16:01:32; [cost=28180.767833ms] result=@ArrayList[ @Object[][isEmpty=false;size=4], @ClientCnxn[sessionid:0x221f087c3d7eb6c local:/:40942 remoteserver:/:2181 lastZxid:146177807315 xid:589796 sent:589796 recv:973892 queuedpkts:0 pendingresp:4398 queuedevents:0], @ReplyHeader[585397,146177807315,0 ], ] method=org.apache.zookeeper.ClientCnxn.submitRequest location=AtExit ts=2023-09-06 16:01:34; [cost=29052.726493ms] result=@ArrayList[ @Object[][isEmpty=false;size=4], @ClientCnxn[sessionid:0x221f087c3d7eb6c local:/:40942 remoteserver:/:2181 lastZxid:146177807862 xid:589989 sent:589989 recv:974165 queuedpkts:0 pendingresp:4400 queuedevents:0], @ReplyHeader[585588,146177807862,0 ]{noformat} Observed many connections with *high Recv-Q* on the server side. CommitProcWorkThread *BLOCKED* in {{{}org.apache.zookeeper.server.ServerStats#updateLatency:{}}}}}{}}} {noformat} "CommitProcWorkThread-15" #21595 daemon prio=5 os_prio=0 tid=0x7f86d804a000 nid=0x6bca waiting for monitor entry [0x7f86deb95000] java.lang.Thread.State: BLOCKED (on object monitor)
[jira] [Updated] (ZOOKEEPER-4741) High latency under heavy load when prometheus metrics enabled
[ https://issues.apache.org/jira/browse/ZOOKEEPER-4741?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yike Xiao updated ZOOKEEPER-4741: - Attachment: 32010.lock.profile.html > High latency under heavy load when prometheus metrics enabled > - > > Key: ZOOKEEPER-4741 > URL: https://issues.apache.org/jira/browse/ZOOKEEPER-4741 > Project: ZooKeeper > Issue Type: Bug > Components: server >Affects Versions: 3.6.4, 3.8.2 > Environment: zookeeper version: 3.6.4 > kernel: 3.10.0-1160.95.1.el7.x86_64 > java version "1.8.0_111" > > metricsProvider.className=org.apache.zookeeper.metrics.prometheus.PrometheusMetricsProvider >Reporter: Yike Xiao >Priority: Major > Attachments: 32010.lock.profile.html, 32010.threaddump.001.txt, > 32010.wallclock.profile.html, image-2023-09-11-16-17-21-166.png > > > In our production, we use zookeeper built-in PrometheusMetricsProvider to > monitor zookeeper status, recently we observed very high latency in one of > our zookeeper cluster which serve heavy load. > Measured in a heavy client side, the latency could more than 25 seconds. > {noformat} > [arthas@8]$ watch org.apache.zookeeper.ClientCnxn submitRequest > Press Q or Ctrl+C to abort. > Affect(class count: 1 , method count: 1) cost in 294 ms, listenerId: 1 > method=org.apache.zookeeper.ClientCnxn.submitRequest location=AtExit > ts=2023-09-06 16:01:32; [cost=28180.767833ms] result=@ArrayList[ > @Object[][isEmpty=false;size=4], > @ClientCnxn[sessionid:0x221f087c3d7eb6c local:/:40942 > remoteserver:/:2181 lastZxid:146177807315 xid:589796 sent:589796 > recv:973892 queuedpkts:0 pendingresp:4398 queuedevents:0], > @ReplyHeader[585397,146177807315,0 > ], > ] > method=org.apache.zookeeper.ClientCnxn.submitRequest location=AtExit > ts=2023-09-06 16:01:34; [cost=29052.726493ms] result=@ArrayList[ > @Object[][isEmpty=false;size=4], > @ClientCnxn[sessionid:0x221f087c3d7eb6c local:/:40942 > remoteserver:/:2181 lastZxid:146177807862 xid:589989 sent:589989 > recv:974165 queuedpkts:0 pendingresp:4400 queuedevents:0], > @ReplyHeader[585588,146177807862,0 > ]{noformat} > > Observed many connections with *high Recv-Q* on the server side. > CommitProcWorkThread *BLOCKED* in > {{org.apache.zookeeper.server.ServerStats#updateLatency:}} > {noformat} > "CommitProcWorkThread-15" #21595 daemon prio=5 os_prio=0 > tid=0x7f86d804a000 nid=0x6bca waiting for monitor entry > [0x7f86deb95000] >java.lang.Thread.State: BLOCKED (on object monitor) > at io.prometheus.client.CKMSQuantiles.insert(CKMSQuantiles.java:91) > - waiting to lock <0x000784dd1a18> (a > io.prometheus.client.CKMSQuantiles) > at > io.prometheus.client.TimeWindowQuantiles.insert(TimeWindowQuantiles.java:38) > at io.prometheus.client.Summary$Child.observe(Summary.java:281) > at io.prometheus.client.Summary.observe(Summary.java:307) > at > org.apache.zookeeper.metrics.prometheus.PrometheusMetricsProvider$PrometheusSummary.add(PrometheusMetricsProvider.java:355) > at > org.apache.zookeeper.server.ServerStats.updateLatency(ServerStats.java:153) > at > org.apache.zookeeper.server.FinalRequestProcessor.updateStats(FinalRequestProcessor.java:669) > at > org.apache.zookeeper.server.FinalRequestProcessor.processRequest(FinalRequestProcessor.java:585) > at > org.apache.zookeeper.server.quorum.CommitProcessor$CommitWorkRequest.doWork(CommitProcessor.java:545) > at > org.apache.zookeeper.server.WorkerService$ScheduledWorkRequest.run(WorkerService.java:154) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) > at java.lang.Thread.run(Thread.java:748) >Locked ownable synchronizers: > - <0x000734923e80> (a > java.util.concurrent.ThreadPoolExecutor$Worker){noformat} > > The wall clock profile shows that there is lock contention within > `CommitProcWorkThread` threads. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (ZOOKEEPER-4741) High latency under heavy load when prometheus metrics enabled
[ https://issues.apache.org/jira/browse/ZOOKEEPER-4741?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yike Xiao updated ZOOKEEPER-4741: - Description: In our production, we use zookeeper built-in PrometheusMetricsProvider to monitor zookeeper status, recently we observed very high latency in one of our zookeeper cluster which serve heavy load. Measured in a heavy client side, the latency could more than 25 seconds. {noformat} [arthas@8]$ watch org.apache.zookeeper.ClientCnxn submitRequest Press Q or Ctrl+C to abort. Affect(class count: 1 , method count: 1) cost in 294 ms, listenerId: 1 method=org.apache.zookeeper.ClientCnxn.submitRequest location=AtExit ts=2023-09-06 16:01:32; [cost=28180.767833ms] result=@ArrayList[ @Object[][isEmpty=false;size=4], @ClientCnxn[sessionid:0x221f087c3d7eb6c local:/:40942 remoteserver:/:2181 lastZxid:146177807315 xid:589796 sent:589796 recv:973892 queuedpkts:0 pendingresp:4398 queuedevents:0], @ReplyHeader[585397,146177807315,0 ], ] method=org.apache.zookeeper.ClientCnxn.submitRequest location=AtExit ts=2023-09-06 16:01:34; [cost=29052.726493ms] result=@ArrayList[ @Object[][isEmpty=false;size=4], @ClientCnxn[sessionid:0x221f087c3d7eb6c local:/:40942 remoteserver:/:2181 lastZxid:146177807862 xid:589989 sent:589989 recv:974165 queuedpkts:0 pendingresp:4400 queuedevents:0], @ReplyHeader[585588,146177807862,0 ]{noformat} Observed many connections with *high Recv-Q* on the server side. CommitProcWorkThread *BLOCKED* in {{{}org.apache.zookeeper.server.ServerStats#updateLatency:{}}}}}{}}} {noformat} "CommitProcWorkThread-15" #21595 daemon prio=5 os_prio=0 tid=0x7f86d804a000 nid=0x6bca waiting for monitor entry [0x7f86deb95000] java.lang.Thread.State: BLOCKED (on object monitor) at io.prometheus.client.CKMSQuantiles.insert(CKMSQuantiles.java:91) - waiting to lock <0x000784dd1a18> (a io.prometheus.client.CKMSQuantiles) at io.prometheus.client.TimeWindowQuantiles.insert(TimeWindowQuantiles.java:38) at io.prometheus.client.Summary$Child.observe(Summary.java:281) at io.prometheus.client.Summary.observe(Summary.java:307) at org.apache.zookeeper.metrics.prometheus.PrometheusMetricsProvider$PrometheusSummary.add(PrometheusMetricsProvider.java:355) at org.apache.zookeeper.server.ServerStats.updateLatency(ServerStats.java:153) at org.apache.zookeeper.server.FinalRequestProcessor.updateStats(FinalRequestProcessor.java:669) at org.apache.zookeeper.server.FinalRequestProcessor.processRequest(FinalRequestProcessor.java:585) at org.apache.zookeeper.server.quorum.CommitProcessor$CommitWorkRequest.doWork(CommitProcessor.java:545) at org.apache.zookeeper.server.WorkerService$ScheduledWorkRequest.run(WorkerService.java:154) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) at java.lang.Thread.run(Thread.java:748) Locked ownable synchronizers: - <0x000734923e80> (a java.util.concurrent.ThreadPoolExecutor$Worker){noformat} !https://gitlab.dev.zhaopin.com/sucheng.wang/notes/uploads/b9da2552d6b00c3f9130d87caf01325e/image.png! {{}} was: In our production, we use zookeeper built-in PrometheusMetricsProvider to monitor zookeeper status, recently we observed very high latency in one of our zookeeper cluster which serve heavy load. Measured in a heavy client side, the latency could more than 25 seconds. !image-2023-09-11-16-17-21-166.png! Observed many connections with high Recv-Q on the server side. CommitProcWorkThread *BLOCKED* in {{{}org.apache.zookeeper.server.ServerStats#updateLatency:{}}}}}{}}} {noformat} "CommitProcWorkThread-15" #21595 daemon prio=5 os_prio=0 tid=0x7f86d804a000 nid=0x6bca waiting for monitor entry [0x7f86deb95000] java.lang.Thread.State: BLOCKED (on object monitor) at io.prometheus.client.CKMSQuantiles.insert(CKMSQuantiles.java:91) - waiting to lock <0x000784dd1a18> (a io.prometheus.client.CKMSQuantiles) at io.prometheus.client.TimeWindowQuantiles.insert(TimeWindowQuantiles.java:38) at io.prometheus.client.Summary$Child.observe(Summary.java:281) at io.prometheus.client.Summary.observe(Summary.java:307) at org.apache.zookeeper.metrics.prometheus.PrometheusMetricsProvider$PrometheusSummary.add(PrometheusMetricsProvider.java:355) at org.apache.zookeeper.server.ServerStats.updateLatency(ServerStats.java:153) at org.apache.zookeeper.server.FinalRequestProcessor.updateStats(FinalRequestProcessor.java:669) at org.apache.zookeeper.server.FinalRequestProcessor.processRequest(FinalRequestProcessor.java:585) at
[jira] [Updated] (ZOOKEEPER-4741) High latency under heavy load when prometheus metrics enabled
[ https://issues.apache.org/jira/browse/ZOOKEEPER-4741?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yike Xiao updated ZOOKEEPER-4741: - Description: In our production, we use zookeeper built-in PrometheusMetricsProvider to monitor zookeeper status, recently we observed very high latency in one of our zookeeper cluster which serve heavy load. Measured in a heavy client side, the latency could more than 25 seconds. !image-2023-09-11-16-17-21-166.png! Observed many connections with high Recv-Q on the server side. CommitProcWorkThread *BLOCKED* in {{{}org.apache.zookeeper.server.ServerStats#updateLatency:{}}}}}{}}} {noformat} "CommitProcWorkThread-15" #21595 daemon prio=5 os_prio=0 tid=0x7f86d804a000 nid=0x6bca waiting for monitor entry [0x7f86deb95000] java.lang.Thread.State: BLOCKED (on object monitor) at io.prometheus.client.CKMSQuantiles.insert(CKMSQuantiles.java:91) - waiting to lock <0x000784dd1a18> (a io.prometheus.client.CKMSQuantiles) at io.prometheus.client.TimeWindowQuantiles.insert(TimeWindowQuantiles.java:38) at io.prometheus.client.Summary$Child.observe(Summary.java:281) at io.prometheus.client.Summary.observe(Summary.java:307) at org.apache.zookeeper.metrics.prometheus.PrometheusMetricsProvider$PrometheusSummary.add(PrometheusMetricsProvider.java:355) at org.apache.zookeeper.server.ServerStats.updateLatency(ServerStats.java:153) at org.apache.zookeeper.server.FinalRequestProcessor.updateStats(FinalRequestProcessor.java:669) at org.apache.zookeeper.server.FinalRequestProcessor.processRequest(FinalRequestProcessor.java:585) at org.apache.zookeeper.server.quorum.CommitProcessor$CommitWorkRequest.doWork(CommitProcessor.java:545) at org.apache.zookeeper.server.WorkerService$ScheduledWorkRequest.run(WorkerService.java:154) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) at java.lang.Thread.run(Thread.java:748) Locked ownable synchronizers: - <0x000734923e80> (a java.util.concurrent.ThreadPoolExecutor$Worker){noformat} !https://gitlab.dev.zhaopin.com/sucheng.wang/notes/uploads/b9da2552d6b00c3f9130d87caf01325e/image.png! {{}} was: In our production, we use zookeeper built-in PrometheusMetricsProvider to monitor zookeeper status, recently we observed very high latency in one of our zookeeper cluster which serve heavy load. Measured in a heavy client side, the latency could more than 25 seconds. !image-2023-09-11-16-17-21-166.png! Observed many connections with high Recv-Q on the server side. CommitProcWorkThread *BLOCKED* in {{{}org.apache.zookeeper.server.ServerStats#updateLatency:{}}}{{{}{}}} "CommitProcWorkThread-15" #21595 daemon prio=5 os_prio=0 tid=0x7f86d804a000 nid=0x6bca waiting for monitor entry [0x7f86deb95000] java.lang.Thread.State: BLOCKED (on object monitor)at io.prometheus.client.CKMSQuantiles.insert(CKMSQuantiles.java:91)- waiting to lock <0x000784dd1a18> (a io.prometheus.client.CKMSQuantiles) at io.prometheus.client.TimeWindowQuantiles.insert(TimeWindowQuantiles.java:38) at io.prometheus.client.Summary$Child.observe(Summary.java:281)at io.prometheus.client.Summary.observe(Summary.java:307)at org.apache.zookeeper.metrics.prometheus.PrometheusMetricsProvider$PrometheusSummary.add(PrometheusMetricsProvider.java:355) at org.apache.zookeeper.server.ServerStats.updateLatency(ServerStats.java:153) at org.apache.zookeeper.server.FinalRequestProcessor.updateStats(FinalRequestProcessor.java:669) at org.apache.zookeeper.server.FinalRequestProcessor.processRequest(FinalRequestProcessor.java:585) at org.apache.zookeeper.server.quorum.CommitProcessor$CommitWorkRequest.doWork(CommitProcessor.java:545) at org.apache.zookeeper.server.WorkerService$ScheduledWorkRequest.run(WorkerService.java:154) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) at java.lang.Thread.run(Thread.java:748) The wall lock profile shows that there is lock contention within the {{CommitProcWorkThread}} threads. !https://gitlab.dev.zhaopin.com/sucheng.wang/notes/uploads/b9da2552d6b00c3f9130d87caf01325e/image.png! {{}} > High latency under heavy load when prometheus metrics enabled > - > > Key: ZOOKEEPER-4741 > URL: https://issues.apache.org/jira/browse/ZOOKEEPER-4741 > Project: ZooKeeper > Issue Type: Bug > Components: server >Affects Versions: 3.6.4, 3.8.2 >
[jira] [Created] (ZOOKEEPER-4741) High latency under heavy load when prometheus metrics enabled
Yike Xiao created ZOOKEEPER-4741: Summary: High latency under heavy load when prometheus metrics enabled Key: ZOOKEEPER-4741 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-4741 Project: ZooKeeper Issue Type: Bug Components: server Affects Versions: 3.8.2, 3.6.4 Environment: zookeeper version: 3.6.4 kernel: 3.10.0-1160.95.1.el7.x86_64 java version "1.8.0_111" metricsProvider.className=org.apache.zookeeper.metrics.prometheus.PrometheusMetricsProvider Reporter: Yike Xiao Attachments: 32010.threaddump.001.txt, 32010.wallclock.profile.html, image-2023-09-11-16-17-21-166.png In our production, we use zookeeper built-in PrometheusMetricsProvider to monitor zookeeper status, recently we observed very high latency in one of our zookeeper cluster which serve heavy load. Measured in a heavy client side, the latency could more than 25 seconds. !image-2023-09-11-16-17-21-166.png! Observed many connections with high Recv-Q on the server side. CommitProcWorkThread *BLOCKED* in {{{}org.apache.zookeeper.server.ServerStats#updateLatency:{}}}{{{}{}}} "CommitProcWorkThread-15" #21595 daemon prio=5 os_prio=0 tid=0x7f86d804a000 nid=0x6bca waiting for monitor entry [0x7f86deb95000] java.lang.Thread.State: BLOCKED (on object monitor)at io.prometheus.client.CKMSQuantiles.insert(CKMSQuantiles.java:91)- waiting to lock <0x000784dd1a18> (a io.prometheus.client.CKMSQuantiles) at io.prometheus.client.TimeWindowQuantiles.insert(TimeWindowQuantiles.java:38) at io.prometheus.client.Summary$Child.observe(Summary.java:281)at io.prometheus.client.Summary.observe(Summary.java:307)at org.apache.zookeeper.metrics.prometheus.PrometheusMetricsProvider$PrometheusSummary.add(PrometheusMetricsProvider.java:355) at org.apache.zookeeper.server.ServerStats.updateLatency(ServerStats.java:153) at org.apache.zookeeper.server.FinalRequestProcessor.updateStats(FinalRequestProcessor.java:669) at org.apache.zookeeper.server.FinalRequestProcessor.processRequest(FinalRequestProcessor.java:585) at org.apache.zookeeper.server.quorum.CommitProcessor$CommitWorkRequest.doWork(CommitProcessor.java:545) at org.apache.zookeeper.server.WorkerService$ScheduledWorkRequest.run(WorkerService.java:154) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) at java.lang.Thread.run(Thread.java:748) The wall lock profile shows that there is lock contention within the {{CommitProcWorkThread}} threads. !https://gitlab.dev.zhaopin.com/sucheng.wang/notes/uploads/b9da2552d6b00c3f9130d87caf01325e/image.png! {{}} -- This message was sent by Atlassian Jira (v8.20.10#820010)