There is a timing issue in ZkHelixParticipant#setupMsgHandler(). We should hook up ZK callback (line 347 in https://github.com/apache/helix/blob/master/helix-core/src/main/java/org/apache/helix/manager/zk/ZkHelixParticipant.java) after all message handler registrations are done (line 354 in https://github.com/apache/helix/blob/master/helix-core/src/main/java/org/apache/helix/manager/zk/ZkHelixParticipant.java). Fix is to move adding ZK callback to the end. Will add a test case that can reliably reproduce this issue.
Thanks, Zhen On Sun, Feb 15, 2015 at 11:45 PM, Zhen Zhang <[email protected]> wrote: > might be some race conditions. need to double check this. > On Feb 15, 2015 11:38 PM, "Steph Meslin-Weber" <[email protected]> > wrote: > >> Hi Kishore, >> >> That's right, the node doesn't process any state transitions. They should >> have been logged in the first set of logs had they occurred. >> >> Thanks, >> Steph >> On 16 Feb 2015 07:28, "kishore g" <[email protected]> wrote: >> >>> Hi Steph, >>> >>> When the NPE occurs, do you get the state transition callbacks? >>> >>> thanks, >>> Kishore G >>> >>> >>> >>> On Sun, Feb 15, 2015 at 11:23 PM, Steph Meslin-Weber < >>> [email protected]> wrote: >>> >>>> Unfortunately it appears that when the NPE occurs, dropping the >>>> participant no longer cleans up the related INSTANCE node. Perhaps some >>>> state is lost? >>>> >>>> Thanks, >>>> Steph >>>> On 16 Feb 2015 06:52, "Zhen Zhang" <[email protected]> wrote: >>>> >>>>> I think the NPE is not fatal. It happens when no message handler >>>>> factory is registered for this message type. The message will not be >>>>> removed and remain in UNREAD state. Later when the message handler factory >>>>> is registered via: >>>>> DefaultMessagingService#registerMessageHandlerFactory, we will send a >>>>> NOP message, which will in turn trigger HelixTaskExecutor to process all >>>>> UNREAD messages. We should definitely fix this by logging a warning >>>>> message >>>>> instead of throwing an NPE. >>>>> >>>>> Thanks, >>>>> Jason >>>>> >>>>> >>>>> On Sun, Feb 15, 2015 at 7:30 PM, kishore g <[email protected]> >>>>> wrote: >>>>> >>>>>> Controller assuming the state transition occurred is even more >>>>>> dangerous. >>>>>> >>>>>> >>>>>> >>>>>> >>>>>> >>>>>> On Sun, Feb 15, 2015 at 7:18 PM, [email protected] <[email protected] >>>>>> > wrote: >>>>>> >>>>>>> In my experience it was fatal. The callback would jot be called but >>>>>>> the >>>>>>> controller would somehow assume the state transition occurred. >>>>>>> On Feb 15, 2015 7:13 PM, "kishore g" <[email protected]> wrote: >>>>>>> >>>>>>> > Thanks Vlad. That explains the problem. That also explains how >>>>>>> adding >>>>>>> > sleep of 3seconds work. >>>>>>> > >>>>>>> > Jason, is this exception fatal?. Will the message be processed >>>>>>> again after >>>>>>> > the handler is added. >>>>>>> > >>>>>>> > thanks, >>>>>>> > Kishore G >>>>>>> > >>>>>>> > On Sun, Feb 15, 2015 at 6:41 PM, [email protected] < >>>>>>> [email protected]> >>>>>>> > wrote: >>>>>>> > >>>>>>> >> https://issues.apache.org/jira/browse/HELIX-548 >>>>>>> >> On Feb 15, 2015 6:38 PM, "kishore g" <[email protected]> wrote: >>>>>>> >> >>>>>>> >> > Hi Vlad, >>>>>>> >> > >>>>>>> >> > Was there any jira associated with it? >>>>>>> >> > >>>>>>> >> > thanks. >>>>>>> >> > Kishore G >>>>>>> >> > >>>>>>> >> > On Sun, Feb 15, 2015 at 4:36 PM, [email protected] < >>>>>>> [email protected]> >>>>>>> >> > wrote: >>>>>>> >> > >>>>>>> >> >> Looks like the same problem we encountered recently. >>>>>>> >> >> >>>>>>> >> >> Regards, >>>>>>> >> >> Vlad >>>>>>> >> >> On Feb 15, 2015 4:35 PM, "kishore g" <[email protected]> >>>>>>> wrote: >>>>>>> >> >> >>>>>>> >> >> > Steph described this problem on IRC. >>>>>>> >> >> > >>>>>>> >> >> > He is using 0.7.1. On connecting to cluster he gets this NPE >>>>>>> >> >> > >>>>>>> >> >> > http://pastebin.com/YE3fwK5i >>>>>>> >> >> > >>>>>>> >> >> > java.lang.NullPointerException >>>>>>> >> >> > at >>>>>>> >> >> > >>>>>>> >> >> >>>>>>> >> >>>>>>> org.apache.helix.messaging.handling.HelixTaskExecutor.createMessageHandler(HelixTaskExecutor.java:661) >>>>>>> >> >> > at >>>>>>> >> >> > >>>>>>> >> >> >>>>>>> >> >>>>>>> org.apache.helix.messaging.handling.HelixTaskExecutor.onMessage(HelixTaskExecutor.java:581) >>>>>>> >> >> > at >>>>>>> >> >> > >>>>>>> >> >> >>>>>>> >> >>>>>>> org.apache.helix.manager.zk.ZkCallbackHandler.invoke(ZkCallbackHandler.java:202) >>>>>>> >> >> > at >>>>>>> >> >> > >>>>>>> >> >> >>>>>>> >> >>>>>>> org.apache.helix.manager.zk.ZkCallbackHandler.init(ZkCallbackHandler.java:336) >>>>>>> >> >> > at >>>>>>> >> >> > >>>>>>> >> >> >>>>>>> >> >>>>>>> org.apache.helix.manager.zk.ZkCallbackHandler.<init>(ZkCallbackHandler.java:130) >>>>>>> >> >> > at >>>>>>> >> >> > >>>>>>> >> >> >>>>>>> >> >>>>>>> org.apache.helix.manager.zk.ZkHelixConnection.addListener(ZkHelixConnection.java:533) >>>>>>> >> >> > at >>>>>>> >> >> > >>>>>>> >> >> >>>>>>> >> >>>>>>> org.apache.helix.manager.zk.ZkHelixConnection.addMessageListener(ZkHelixConnection.java:267) >>>>>>> >> >> > at >>>>>>> >> >> > >>>>>>> >> >> >>>>>>> >> >>>>>>> org.apache.helix.manager.zk.ZkHelixParticipant.setupMsgHandler(ZkHelixParticipant.java:347) >>>>>>> >> >> > at >>>>>>> >> >> > >>>>>>> >> >> >>>>>>> >> >>>>>>> org.apache.helix.manager.zk.ZkHelixParticipant.init(ZkHelixParticipant.java:383) >>>>>>> >> >> > at >>>>>>> >> >> > >>>>>>> >> >> >>>>>>> >> >>>>>>> org.apache.helix.manager.zk.ZkHelixParticipant.onConnected(ZkHelixParticipant.java:401) >>>>>>> >> >> > at >>>>>>> >> >> > >>>>>>> >> >> >>>>>>> >> >>>>>>> org.apache.helix.manager.zk.ZkHelixParticipant.start(ZkHelixParticipant.java:428) >>>>>>> >> >> > at >>>>>>> >> >> > >>>>>>> >> >> >>>>>>> >> >>>>>>> com.example.ProtostuffServerNode.spinUpParticipant(ProtostuffServerNode.java:134) >>>>>>> >> >> > >>>>>>> >> >> > >>>>>>> >> >> > Here is his connection code. >>>>>>> >> >> > >>>>>>> >> >> > http://pastebin.com/QRfVU1tc >>>>>>> >> >> > >>>>>>> >> >> > private static HelixParticipant spinUpParticipant(HelixAdmin >>>>>>> admin, >>>>>>> >> >> > ParticipantId participantId) { >>>>>>> >> >> > LOGGER.info("Starting up "+participantId); >>>>>>> >> >> > HelixConnection connection = new >>>>>>> ZkHelixConnection( >>>>>>> >> >> > ZK_ADDRESS); >>>>>>> >> >> > connection.connect(); >>>>>>> >> >> > HelixParticipant participant = connection. >>>>>>> >> >> > createParticipant(CLUSTER_ID, participantId); >>>>>>> >> >> > StateMachineEngine stateMach = participant. >>>>>>> >> >> > getStateMachineEngine(); >>>>>>> >> >> > >>>>>>> >> >> > >>>>>>> StateTransitionHandlerFactory<LocalTransitionHandler> >>>>>>> >> >> > transitionHandlerFactory = new OnlineOfflineHandlerFactory(); >>>>>>> >> >> > >>>>>>> stateMach.registerStateModelFactory(STATE_MODEL_NAME, >>>>>>> >> >> > transitionHandlerFactory); >>>>>>> >> >> > participant.start(); >>>>>>> >> >> > >>>>>>> >> >> > admin.enableInstance(CLUSTER_NAME, >>>>>>> >> >> participantId.toString( >>>>>>> >> >> > ), true); >>>>>>> >> >> > >>>>>>> >> >> > return participant; >>>>>>> >> >> > } >>>>>>> >> >> > >>>>>>> >> >> > Adding 3s sleep after registerStateModelFactory works. Any >>>>>>> idea what >>>>>>> >> is >>>>>>> >> >> > happening. >>>>>>> >> >> > >>>>>>> >> >> > thanks, >>>>>>> >> >> > Kishore G >>>>>>> >> >> > >>>>>>> >> >> > >>>>>>> >> >> > >>>>>>> >> >> > >>>>>>> >> >> >>>>>>> >> > >>>>>>> >> > >>>>>>> >> >>>>>>> > >>>>>>> > >>>>>>> >>>>>> >>>>>> >>>>> >>>
