There are multiple options we can try here. what if we used cacheddataaccessor for this use case?.clients will only read if node has changed. This optimization can benefit all use cases.
What about batching the watch triggers. Not sure which version of helix has this option. Another option is to use a poll based roundtable instead of watch based. This can coupled with cacheddataaccessor can be over efficient. Thanks, Kishore G On Feb 2, 2015 8:17 PM, "Varun Sharma" <[email protected]> wrote: > My total external view across all resources is roughly 3M in size and > there are 100 clients downloading it twice for every node restart - thats > 600M of data for every restart. So I guess that is causing this issue. We > are thinking of doing some tricks to limit the # of clients to 1 from 100. > I guess that should help significantly. > > Varun > > On Mon, Feb 2, 2015 at 7:37 PM, Zhen Zhang <[email protected]> wrote: > >> Hey Varun, >> >> I guess your external view is pretty large, since each external view >> callback takes ~3s. The RoutingTableProvider is callback based, so only >> when there is a change in the external view, RoutingTableProvider will read >> the entire external view from ZK. During the rolling upgrade, there are >> lots of live instance change, which may lead to a lot of changes in the >> external view. One possible way to mitigate the issue is to smooth the >> traffic by having some delays in between bouncing nodes. We can do a rough >> estimation on how many external view changes you might have during the >> upgrade, how many listeners you have, and how large is the external views. >> Once we have these numbers, we might know the ZK bandwidth requirement. ZK >> read bandwidth can be scaled by adding ZK observers. >> >> ZK watcher is one time only, so every time a listener receives a >> callback, it will re-register its watcher again to ZK. >> >> It's normally unreliable to depend on delta changes instead of reading >> the entire znode. There might be some corner cases where you would lose >> delta changes if you depend on that. >> >> For the ZK connection issue, do you have any log on the ZK server side >> regarding this connection? >> >> Thanks, >> Jason >> >> ------------------------------ >> *From:* Varun Sharma [[email protected]] >> *Sent:* Monday, February 02, 2015 4:41 PM >> *To:* [email protected] >> *Subject:* Re: Excessive ZooKeeper load >> >> I believe there is a misbehaving client. Here is a stack trace - it >> probably lost connection and is now stampeding it: >> >> "ZkClient-EventThread-104-terrapinzk001a:2181,terrapinzk >> 002b:2181,terrapinzk003e:2181" daemon prio=10 tid=0x00007f534144b800 >> nid=0x7db5 in Object.wait() [0x00007f52ca9c3000] >> >> java.lang.Thread.State: WAITING (on object monitor) >> >> at java.lang.Object.wait(Native Method) >> >> at java.lang.Object.wait(Object.java:503) >> >> at >> org.apache.zookeeper.ClientCnxn.submitRequest(ClientCnxn.java:1309) >> >> - locked <0x00000004fb0d8c38> (a >> org.apache.zookeeper.ClientCnxn$Packet) >> >> at org.apache.zookeeper.ZooKeeper.exists(ZooKeeper.java:1036) >> >> at org.apache.zookeeper.ZooKeeper.exists(ZooKeeper.java:1069) >> >> at org.I0Itec.zkclient.ZkConnection.exists(ZkConnection.java:95) >> >> at org.I0Itec.zkclient.ZkClient$11.call(ZkClient.java:823) >> >> * at >> org.I0Itec.zkclient.ZkClient.retryUntilConnected(ZkClient.java:675)* >> >> * at org.I0Itec.zkclient.ZkClient.watchForData(ZkClient.java:820)* >> >> * at >> org.I0Itec.zkclient.ZkClient.subscribeDataChanges(ZkClient.java:136)* >> >> at org.apache.helix.manager.zk >> .CallbackHandler.subscribeDataChange(CallbackHandler.java:241) >> >> at org.apache.helix.manager.zk >> .CallbackHandler.subscribeForChanges(CallbackHandler.java:287) >> >> at org.apache.helix.manager.zk >> .CallbackHandler.invoke(CallbackHandler.java:202) >> >> - locked <0x000000056b75a948> (a org.apache.helix.manager.zk >> .ZKHelixManager) >> >> at org.apache.helix.manager.zk >> .CallbackHandler.handleDataChange(CallbackHandler.java:338) >> >> at org.I0Itec.zkclient.ZkClient$6.run(ZkClient.java:547) >> >> at org.I0Itec.zkclient.ZkEventThread.run(ZkEventThread.java:71) >> >> On Mon, Feb 2, 2015 at 4:28 PM, Varun Sharma <[email protected]> wrote: >> >>> I am wondering what is causing the zk subscription to happen every 2-3 >>> seconds - is this a new watch being established every 3 seconds ? >>> >>> Thanks >>> Varun >>> >>> On Mon, Feb 2, 2015 at 4:23 PM, Varun Sharma <[email protected]> >>> wrote: >>> >>>> Hi, >>>> >>>> We are serving a few different resources whose total # of partitions >>>> is ~ 30K. We just did a rolling restart fo the cluster and the clients >>>> which use the RoutingTableProvider are stuck in a bad state where they are >>>> constantly subscribing to changes in the external view of a cluster. Here >>>> is the helix log on the client after our rolling restart was finished - the >>>> client is constantly polling ZK. The zookeeper node is pushing 300mbps >>>> right now and most of the traffic is being pulled by clients. Is this a >>>> race condition - also is there an easy way to make the clients not poll so >>>> aggressively. We restarted one of the clients and we don't see these same >>>> messages anymore. Also is it possible to just propagate external view diffs >>>> instead of the whole big znode ? >>>> >>>> 15/02/03 00:21:18 INFO zk.CallbackHandler: 104 END:INVOKE >>>> /main_a/EXTERNALVIEW >>>> listener:org.apache.helix.spectator.RoutingTableProvider Took: 3340ms >>>> >>>> 15/02/03 00:21:18 INFO zk.CallbackHandler: 104 START:INVOKE >>>> /main_a/EXTERNALVIEW >>>> listener:org.apache.helix.spectator.RoutingTableProvider >>>> >>>> 15/02/03 00:21:18 INFO zk.CallbackHandler: pinacle2084 subscribes >>>> child-change. path: /main_a/EXTERNALVIEW, listener: >>>> org.apache.helix.spectator.RoutingTableProvider@76984879 >>>> >>>> 15/02/03 00:21:22 INFO zk.CallbackHandler: 104 END:INVOKE >>>> /main_a/EXTERNALVIEW >>>> listener:org.apache.helix.spectator.RoutingTableProvider Took: 3371ms >>>> >>>> 15/02/03 00:21:22 INFO zk.CallbackHandler: 104 START:INVOKE >>>> /main_a/EXTERNALVIEW >>>> listener:org.apache.helix.spectator.RoutingTableProvider >>>> >>>> 15/02/03 00:21:22 INFO zk.CallbackHandler: pinacle2084 subscribes >>>> child-change. path: /main_a/EXTERNALVIEW, listener: >>>> org.apache.helix.spectator.RoutingTableProvider@76984879 >>>> >>>> 15/02/03 00:21:25 INFO zk.CallbackHandler: 104 END:INVOKE >>>> /main_a/EXTERNALVIEW >>>> listener:org.apache.helix.spectator.RoutingTableProvider Took: 3281ms >>>> >>>> 15/02/03 00:21:25 INFO zk.CallbackHandler: 104 START:INVOKE >>>> /main_a/EXTERNALVIEW >>>> listener:org.apache.helix.spectator.RoutingTableProvider >>>> >>>> >>>> >>> >> >
