Hi Varun, Kishore already checked in a fix for that: https://git-wip-us.apache.org/repos/asf?p=helix.git;a=commit;h=99baacf7f19a09d972754902c50f1618fc8b804c
It's in 0.6.x branch. Thanks, Jason ________________________________ From: Varun Sharma [[email protected]] Sent: Monday, March 09, 2015 2:11 PM To: [email protected] Subject: Re: RoutingTableProvider dropping callbacks Just pinging this thread to check on the hot fix to not remove externalview znode and release for the same. Is there a JIRA tracking that ? On Sun, Mar 8, 2015 at 11:46 PM, Varun Sharma <[email protected]<mailto:[email protected]>> wrote: If I recall correctly from a previous thread, it seems like we don't even support changing of bucket sizes for the same resource - so it seems we should probably not be deleting the znode in this case ? On Sun, Mar 8, 2015 at 11:43 PM, Zhen Zhang <[email protected]<mailto:[email protected]>> wrote: @Kishore, I think the remove is used in case bucket size is changed, so we can clean all the buckets for old size and set it using new size. The issue seems like a race condition in setting bucketized external view and add watches on child paths. Will investigate more. Thanks, Jason ________________________________ From: Varun Sharma [[email protected]<mailto:[email protected]>] Sent: Saturday, March 07, 2015 11:07 PM To: [email protected]<mailto:[email protected]> Subject: Re: RoutingTableProvider dropping callbacks Please find the attached log file with the above trace. On Sat, Mar 7, 2015 at 8:12 PM, kishore g <[email protected]<mailto:[email protected]>> wrote: Another thing is that the RoutingTable is logging this line "Resetting the routing table.". Looks like this happens when we fail to set the watcher. thanks, Kishore G On Sat, Mar 7, 2015 at 8:05 PM, kishore g <[email protected]<mailto:[email protected]>> wrote: Your explanation makes sense. https://github.com/apache/helix/blob/helix-0.6.4/helix-core/src/main/java/org/apache/helix/manager/zk/ZKHelixDataAccessor.java. For bucketized resource we see that path is deleted and set again. Jason, any idea why we are removing the path? case EXTERNALVIEW: if (value.getBucketSize() == 0) { records.add(value.getRecord()); } else { _baseDataAccessor.remove(path, options); On Sat, Mar 7, 2015 at 4:03 PM, Varun Sharma <[email protected]<mailto:[email protected]>> wrote: How does the writing of externalview work for bucketized resources -is it possible that the top level znode for the resource is first deleted and then rewritten with the latest external view ? On Sat, Mar 7, 2015 at 3:56 PM, Varun Sharma <[email protected]<mailto:[email protected]>> wrote: Here is the stack trace - there is a zookeeper race and the detailed stack trace appears for bucketized resources. I saw that the ideal state for the resource was created on 26th Feb and was modified on 7th March. However, the external view for the resource is showing up as created on 7th march as well as modified on 7th march. The external view is created at 10:36:04 on 7th march which is 20 seconds after this log message stack trace is spit out. After this the routing table provider no longer receives any more zk callbacks. 2015-03-07 10:35:43,735 [main-EventThread] (ZkAsyncCallbacks.java:127) WARN org.apache.helix.manager.zk.ZkAsyncCallbacks$SetDataCallbackHandler@3c8589f0, rc:NONODE, path: /main_a/EXTERNALVIEW/$terrapin$data$visual_seo_joins_staging$1422384697040 2015-03-07 10:35:43,736 [main-EventThread] (ZkAsyncCallbacks.java:127) WARN org.apache.helix.manager.zk.ZkAsyncCallbacks$SetDataCallbackHandler@63230a9a, rc:NONODE, path: /main_a/EXTERNALVIEW/$terrapin$data$recommendation_p2p_exp_candset_1$1425671237739 2015-03-07 10:35:43,736 [main-EventThread] (ZkAsyncCallbacks.java:127) WARN org.apache.helix.manager.zk.ZkAsyncCallbacks$SetDataCallbackHandler@118d374f, rc:NONODE, path: /main_a/EXTERNALVIEW/$terrapin$data$None$1422308641250 2015-03-07 10:35:43,736 [ZkClient-EventThread-17-terrapinzk001a:2181] (CallbackHandler.java:304) WARN fail to subscribe child/data change. path: /main_a/EXTERNALVIEW, listener: com.pinterest.terrapin.controller.TerrapinRoutingTableProvider@2c6691da org.I0Itec.zkclient.exception.ZkNoNodeException: org.apache.zookeeper.KeeperException$NoNodeException: KeeperErrorCode = NoNode for /main_a/EXTERNALVIEW/$terrapin$data$None$1422308641250 at org.I0Itec.zkclient.exception.ZkException.create(ZkException.java:47) at org.I0Itec.zkclient.ZkClient.retryUntilConnected(ZkClient.java:685) at org.apache.helix.manager.zk.ZkClient.getChildren(ZkClient.java:210) at org.I0Itec.zkclient.ZkClient.getChildren(ZkClient.java:409) at org.apache.helix.manager.zk.CallbackHandler.subscribeForChanges(CallbackHandler.java:279) at org.apache.helix.manager.zk.CallbackHandler.invoke(CallbackHandler.java:202) at org.apache.helix.manager.zk.CallbackHandler.handleChildChange(CallbackHandler.java:391) at org.I0Itec.zkclient.ZkClient$7.run(ZkClient.java:570) at org.I0Itec.zkclient.ZkEventThread.run(ZkEventThread.java:71) Caused by: org.apache.zookeeper.KeeperException$NoNodeException: KeeperErrorCode = NoNode for /main_a/EXTERNALVIEW/$terrapin$data$None$1422308641250 at org.apache.zookeeper.KeeperException.create(KeeperException.java:102) at org.apache.zookeeper.KeeperException.create(KeeperException.java:42) at org.apache.zookeeper.ZooKeeper.getChildren(ZooKeeper.java:1249) 2015-03-07 10:35:43,848 [ZkClient-EventThread-17-terrapinzk001a:2181] (RoutingTableProvider.java:99) INFO Resetting the routing table. On Thu, Mar 5, 2015 at 11:33 AM, Varun Sharma <[email protected]<mailto:[email protected]>> wrote: I suspect the callbacks are not coming in, for a long time now. On Thu, Mar 5, 2015 at 11:30 AM, Varun Sharma <[email protected]<mailto:[email protected]>> wrote: I grepped this and found nothing: sudo grep START:INVOKE.*EXTERNALVIEW /var/log/terrapin/controller.log* I found a bunch of START:INVOKE for the IDEALSTATES znode though. On Thu, Mar 5, 2015 at 11:15 AM, Zhen Zhang <[email protected]<mailto:[email protected]>> wrote: Yes. you should see a pair of "START:INVOKE..." and "END:INVOKE:..." for each callback in your log. ________________________________ From: Varun Sharma [[email protected]<mailto:[email protected]>] Sent: Thursday, March 05, 2015 11:11 AM To: [email protected]<mailto:[email protected]> Subject: Re: RoutingTableProvider dropping callbacks Ohk - is there a way to confirm that the callbacks are being processed (from the logs etc.) ? On Thu, Mar 5, 2015 at 10:50 AM, Zhen Zhang <[email protected]<mailto:[email protected]>> wrote: Hi Varun, This should not be a problem. When we register a callback, we are expecting a call back type of INIT first, followed by a sequence of CALLBACK types, and when you unregister the callback, you will received a FINALIZED type. Since unregister is an async operation, when you receive a FINALIZED type, you might still see a couple of CALLBACK type callbacks, which are simply ignored. The log is basically telling you that. Thanks, Jason ________________________________ From: Varun Sharma [[email protected]<mailto:[email protected]>] Sent: Thursday, March 05, 2015 10:44 AM To: [email protected]<mailto:[email protected]> Subject: RoutingTableProvider dropping callbacks Hi, It seems that the RoutingTableProvider is dropping callbacks in our case. Here is a log: [ZkClient-EventThread-17-terrapinzk001a:2181] (CallbackHandler.java:130) WARN Skip processing callbacks for listener: com.pinterest.terrapin.controller.TerrapinRoutingTableProvider@7e7f8062, path: /main_a/EXTERNALVIEW, expected types: [INIT] but was CALLBACK We have a custom RoutingTableProvider to catch callbacks and do some processing - this is causing a lot of issues for us. What could be causing this ? Thanks Varun
