Jason, I remember having the ability to compress/decompress and before we added the support to bucketize, compression was used to support large number of partitions. However I dont see the code anywhere. Did we do this on a separate branch?
thanks, Kishore G On Wed, Feb 4, 2015 at 3:30 PM, Zhen Zhang <[email protected]> wrote: > Hi Varun, we can certainly add compression and have a config for turning > it on/off. We do have implemented compression in our own zkclient before. > The issue for compression might be: > 1) cpu consumption on controller will increase. > 2) hard to debug > > Thanks, > Jason > ------------------------------ > *From:* kishore g [[email protected]] > *Sent:* Wednesday, February 04, 2015 3:08 PM > > *To:* [email protected] > *Subject:* Re: Excessive ZooKeeper load > > we do have the ability to compress the data. I am not sure if there is > a easy way to turn on/off the compression. > > On Wed, Feb 4, 2015 at 2:49 PM, Varun Sharma <[email protected]> wrote: > >> I am wondering if its possible to gzip the external view znode - a simple >> gzip cut down the data size by 25X. Is it possible to plug in >> compression/decompression as zookeeper nodes are read ? >> >> Varun >> >> On Mon, Feb 2, 2015 at 8:53 PM, kishore g <[email protected]> wrote: >> >>> There are multiple options we can try here. >>> what if we used cacheddataaccessor for this use case?.clients will only >>> read if node has changed. This optimization can benefit all use cases. >>> >>> What about batching the watch triggers. Not sure which version of helix >>> has this option. >>> >>> Another option is to use a poll based roundtable instead of watch based. >>> This can coupled with cacheddataaccessor can be over efficient. >>> >>> Thanks, >>> Kishore G >>> On Feb 2, 2015 8:17 PM, "Varun Sharma" <[email protected]> wrote: >>> >>>> My total external view across all resources is roughly 3M in size and >>>> there are 100 clients downloading it twice for every node restart - thats >>>> 600M of data for every restart. So I guess that is causing this issue. We >>>> are thinking of doing some tricks to limit the # of clients to 1 from 100. >>>> I guess that should help significantly. >>>> >>>> Varun >>>> >>>> On Mon, Feb 2, 2015 at 7:37 PM, Zhen Zhang <[email protected]> wrote: >>>> >>>>> Hey Varun, >>>>> >>>>> I guess your external view is pretty large, since each external view >>>>> callback takes ~3s. The RoutingTableProvider is callback based, so >>>>> only when there is a change in the external view, RoutingTableProvider >>>>> will >>>>> read the entire external view from ZK. During the rolling upgrade, there >>>>> are lots of live instance change, which may lead to a lot of changes in >>>>> the >>>>> external view. One possible way to mitigate the issue is to smooth the >>>>> traffic by having some delays in between bouncing nodes. We can do a rough >>>>> estimation on how many external view changes you might have during the >>>>> upgrade, how many listeners you have, and how large is the external views. >>>>> Once we have these numbers, we might know the ZK bandwidth requirement. ZK >>>>> read bandwidth can be scaled by adding ZK observers. >>>>> >>>>> ZK watcher is one time only, so every time a listener receives a >>>>> callback, it will re-register its watcher again to ZK. >>>>> >>>>> It's normally unreliable to depend on delta changes instead of >>>>> reading the entire znode. There might be some corner cases where you would >>>>> lose delta changes if you depend on that. >>>>> >>>>> For the ZK connection issue, do you have any log on the ZK server >>>>> side regarding this connection? >>>>> >>>>> Thanks, >>>>> Jason >>>>> >>>>> ------------------------------ >>>>> *From:* Varun Sharma [[email protected]] >>>>> *Sent:* Monday, February 02, 2015 4:41 PM >>>>> *To:* [email protected] >>>>> *Subject:* Re: Excessive ZooKeeper load >>>>> >>>>> I believe there is a misbehaving client. Here is a stack trace - >>>>> it probably lost connection and is now stampeding it: >>>>> >>>>> "ZkClient-EventThread-104-terrapinzk001a:2181,terrapinzk >>>>> 002b:2181,terrapinzk003e:2181" daemon prio=10 tid=0x00007f534144b800 >>>>> nid=0x7db5 in Object.wait() [0x00007f52ca9c3000] >>>>> >>>>> java.lang.Thread.State: WAITING (on object monitor) >>>>> >>>>> at java.lang.Object.wait(Native Method) >>>>> >>>>> at java.lang.Object.wait(Object.java:503) >>>>> >>>>> at >>>>> org.apache.zookeeper.ClientCnxn.submitRequest(ClientCnxn.java:1309) >>>>> >>>>> - locked <0x00000004fb0d8c38> (a >>>>> org.apache.zookeeper.ClientCnxn$Packet) >>>>> >>>>> at org.apache.zookeeper.ZooKeeper.exists(ZooKeeper.java:1036) >>>>> >>>>> at org.apache.zookeeper.ZooKeeper.exists(ZooKeeper.java:1069) >>>>> >>>>> at org.I0Itec.zk >>>>> client.ZkConnection.exists(ZkConnection.java:95) >>>>> >>>>> at org.I0Itec.zkclient.ZkClient$11.call(ZkClient.java:823) >>>>> >>>>> * at >>>>> org.I0Itec.zkclient.ZkClient.retryUntilConnected(ZkClient.java:675)* >>>>> >>>>> * at >>>>> org.I0Itec.zkclient.ZkClient.watchForData(ZkClient.java:820)* >>>>> >>>>> * at >>>>> org.I0Itec.zkclient.ZkClient.subscribeDataChanges(ZkClient.java:136)* >>>>> >>>>> at org.apache.helix.manager.zk >>>>> .CallbackHandler.subscribeDataChange(CallbackHandler.java:241) >>>>> >>>>> at org.apache.helix.manager.zk >>>>> .CallbackHandler.subscribeForChanges(CallbackHandler.java:287) >>>>> >>>>> at org.apache.helix.manager.zk >>>>> .CallbackHandler.invoke(CallbackHandler.java:202) >>>>> >>>>> - locked <0x000000056b75a948> (a org.apache.helix.manager.zk >>>>> .ZKHelixManager) >>>>> >>>>> at org.apache.helix.manager.zk >>>>> .CallbackHandler.handleDataChange(CallbackHandler.java:338) >>>>> >>>>> at org.I0Itec.zkclient.ZkClient$6.run(ZkClient.java:547) >>>>> >>>>> at org.I0Itec.zk >>>>> client.ZkEventThread.run(ZkEventThread.java:71) >>>>> >>>>> On Mon, Feb 2, 2015 at 4:28 PM, Varun Sharma <[email protected]> >>>>> wrote: >>>>> >>>>>> I am wondering what is causing the zk subscription to happen every >>>>>> 2-3 seconds - is this a new watch being established every 3 seconds ? >>>>>> >>>>>> Thanks >>>>>> Varun >>>>>> >>>>>> On Mon, Feb 2, 2015 at 4:23 PM, Varun Sharma <[email protected]> >>>>>> wrote: >>>>>> >>>>>>> Hi, >>>>>>> >>>>>>> We are serving a few different resources whose total # of >>>>>>> partitions is ~ 30K. We just did a rolling restart fo the cluster and >>>>>>> the >>>>>>> clients which use the RoutingTableProvider are stuck in a bad state >>>>>>> where >>>>>>> they are constantly subscribing to changes in the external view of a >>>>>>> cluster. Here is the helix log on the client after our rolling restart >>>>>>> was >>>>>>> finished - the client is constantly polling ZK. The zookeeper node is >>>>>>> pushing 300mbps right now and most of the traffic is being pulled by >>>>>>> clients. Is this a race condition - also is there an easy way to make >>>>>>> the >>>>>>> clients not poll so aggressively. We restarted one of the clients and we >>>>>>> don't see these same messages anymore. Also is it possible to just >>>>>>> propagate external view diffs instead of the whole big znode ? >>>>>>> >>>>>>> 15/02/03 00:21:18 INFO zk.CallbackHandler: 104 END:INVOKE >>>>>>> /main_a/EXTERNALVIEW >>>>>>> listener:org.apache.helix.spectator.RoutingTableProvider Took: 3340ms >>>>>>> >>>>>>> 15/02/03 00:21:18 INFO zk.CallbackHandler: 104 START:INVOKE >>>>>>> /main_a/EXTERNALVIEW >>>>>>> listener:org.apache.helix.spectator.RoutingTableProvider >>>>>>> >>>>>>> 15/02/03 00:21:18 INFO zk.CallbackHandler: pinacle2084 subscribes >>>>>>> child-change. path: /main_a/EXTERNALVIEW, listener: >>>>>>> org.apache.helix.spectator.RoutingTableProvider@76984879 >>>>>>> >>>>>>> 15/02/03 00:21:22 INFO zk.CallbackHandler: 104 END:INVOKE >>>>>>> /main_a/EXTERNALVIEW >>>>>>> listener:org.apache.helix.spectator.RoutingTableProvider Took: 3371ms >>>>>>> >>>>>>> 15/02/03 00:21:22 INFO zk.CallbackHandler: 104 START:INVOKE >>>>>>> /main_a/EXTERNALVIEW >>>>>>> listener:org.apache.helix.spectator.RoutingTableProvider >>>>>>> >>>>>>> 15/02/03 00:21:22 INFO zk.CallbackHandler: pinacle2084 subscribes >>>>>>> child-change. path: /main_a/EXTERNALVIEW, listener: >>>>>>> org.apache.helix.spectator.RoutingTableProvider@76984879 >>>>>>> >>>>>>> 15/02/03 00:21:25 INFO zk.CallbackHandler: 104 END:INVOKE >>>>>>> /main_a/EXTERNALVIEW >>>>>>> listener:org.apache.helix.spectator.RoutingTableProvider Took: 3281ms >>>>>>> >>>>>>> 15/02/03 00:21:25 INFO zk.CallbackHandler: 104 START:INVOKE >>>>>>> /main_a/EXTERNALVIEW >>>>>>> listener:org.apache.helix.spectator.RoutingTableProvider >>>>>>> >>>>>>> >>>>>>> >>>>>> >>>>> >>>> >> >
