Is there any work around for this and is this fatal as Vlad mentioned? On Mon, Feb 16, 2015 at 10:28 AM, Zhen Zhang <[email protected]> wrote:
> There is a timing issue in ZkHelixParticipant#setupMsgHandler(). We should > hook up ZK callback (line 347 in > https://github.com/apache/helix/blob/master/helix-core/src/main/java/org/apache/helix/manager/zk/ZkHelixParticipant.java) > after all message handler registrations are done (line 354 in > https://github.com/apache/helix/blob/master/helix-core/src/main/java/org/apache/helix/manager/zk/ZkHelixParticipant.java). > Fix is to move adding ZK callback to the end. Will add a test case that can > reliably reproduce this issue. > > Thanks, > Zhen > > > On Sun, Feb 15, 2015 at 11:45 PM, Zhen Zhang <[email protected]> wrote: > >> might be some race conditions. need to double check this. >> On Feb 15, 2015 11:38 PM, "Steph Meslin-Weber" <[email protected]> >> wrote: >> >>> Hi Kishore, >>> >>> That's right, the node doesn't process any state transitions. They >>> should have been logged in the first set of logs had they occurred. >>> >>> Thanks, >>> Steph >>> On 16 Feb 2015 07:28, "kishore g" <[email protected]> wrote: >>> >>>> Hi Steph, >>>> >>>> When the NPE occurs, do you get the state transition callbacks? >>>> >>>> thanks, >>>> Kishore G >>>> >>>> >>>> >>>> On Sun, Feb 15, 2015 at 11:23 PM, Steph Meslin-Weber < >>>> [email protected]> wrote: >>>> >>>>> Unfortunately it appears that when the NPE occurs, dropping the >>>>> participant no longer cleans up the related INSTANCE node. Perhaps some >>>>> state is lost? >>>>> >>>>> Thanks, >>>>> Steph >>>>> On 16 Feb 2015 06:52, "Zhen Zhang" <[email protected]> wrote: >>>>> >>>>>> I think the NPE is not fatal. It happens when no message handler >>>>>> factory is registered for this message type. The message will not be >>>>>> removed and remain in UNREAD state. Later when the message handler >>>>>> factory >>>>>> is registered via: >>>>>> DefaultMessagingService#registerMessageHandlerFactory, we will send a >>>>>> NOP message, which will in turn trigger HelixTaskExecutor to process all >>>>>> UNREAD messages. We should definitely fix this by logging a warning >>>>>> message >>>>>> instead of throwing an NPE. >>>>>> >>>>>> Thanks, >>>>>> Jason >>>>>> >>>>>> >>>>>> On Sun, Feb 15, 2015 at 7:30 PM, kishore g <[email protected]> >>>>>> wrote: >>>>>> >>>>>>> Controller assuming the state transition occurred is even more >>>>>>> dangerous. >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>> On Sun, Feb 15, 2015 at 7:18 PM, [email protected] < >>>>>>> [email protected]> wrote: >>>>>>> >>>>>>>> In my experience it was fatal. The callback would jot be called but >>>>>>>> the >>>>>>>> controller would somehow assume the state transition occurred. >>>>>>>> On Feb 15, 2015 7:13 PM, "kishore g" <[email protected]> wrote: >>>>>>>> >>>>>>>> > Thanks Vlad. That explains the problem. That also explains how >>>>>>>> adding >>>>>>>> > sleep of 3seconds work. >>>>>>>> > >>>>>>>> > Jason, is this exception fatal?. Will the message be processed >>>>>>>> again after >>>>>>>> > the handler is added. >>>>>>>> > >>>>>>>> > thanks, >>>>>>>> > Kishore G >>>>>>>> > >>>>>>>> > On Sun, Feb 15, 2015 at 6:41 PM, [email protected] < >>>>>>>> [email protected]> >>>>>>>> > wrote: >>>>>>>> > >>>>>>>> >> https://issues.apache.org/jira/browse/HELIX-548 >>>>>>>> >> On Feb 15, 2015 6:38 PM, "kishore g" <[email protected]> >>>>>>>> wrote: >>>>>>>> >> >>>>>>>> >> > Hi Vlad, >>>>>>>> >> > >>>>>>>> >> > Was there any jira associated with it? >>>>>>>> >> > >>>>>>>> >> > thanks. >>>>>>>> >> > Kishore G >>>>>>>> >> > >>>>>>>> >> > On Sun, Feb 15, 2015 at 4:36 PM, [email protected] < >>>>>>>> [email protected]> >>>>>>>> >> > wrote: >>>>>>>> >> > >>>>>>>> >> >> Looks like the same problem we encountered recently. >>>>>>>> >> >> >>>>>>>> >> >> Regards, >>>>>>>> >> >> Vlad >>>>>>>> >> >> On Feb 15, 2015 4:35 PM, "kishore g" <[email protected]> >>>>>>>> wrote: >>>>>>>> >> >> >>>>>>>> >> >> > Steph described this problem on IRC. >>>>>>>> >> >> > >>>>>>>> >> >> > He is using 0.7.1. On connecting to cluster he gets this NPE >>>>>>>> >> >> > >>>>>>>> >> >> > http://pastebin.com/YE3fwK5i >>>>>>>> >> >> > >>>>>>>> >> >> > java.lang.NullPointerException >>>>>>>> >> >> > at >>>>>>>> >> >> > >>>>>>>> >> >> >>>>>>>> >> >>>>>>>> org.apache.helix.messaging.handling.HelixTaskExecutor.createMessageHandler(HelixTaskExecutor.java:661) >>>>>>>> >> >> > at >>>>>>>> >> >> > >>>>>>>> >> >> >>>>>>>> >> >>>>>>>> org.apache.helix.messaging.handling.HelixTaskExecutor.onMessage(HelixTaskExecutor.java:581) >>>>>>>> >> >> > at >>>>>>>> >> >> > >>>>>>>> >> >> >>>>>>>> >> >>>>>>>> org.apache.helix.manager.zk.ZkCallbackHandler.invoke(ZkCallbackHandler.java:202) >>>>>>>> >> >> > at >>>>>>>> >> >> > >>>>>>>> >> >> >>>>>>>> >> >>>>>>>> org.apache.helix.manager.zk.ZkCallbackHandler.init(ZkCallbackHandler.java:336) >>>>>>>> >> >> > at >>>>>>>> >> >> > >>>>>>>> >> >> >>>>>>>> >> >>>>>>>> org.apache.helix.manager.zk.ZkCallbackHandler.<init>(ZkCallbackHandler.java:130) >>>>>>>> >> >> > at >>>>>>>> >> >> > >>>>>>>> >> >> >>>>>>>> >> >>>>>>>> org.apache.helix.manager.zk.ZkHelixConnection.addListener(ZkHelixConnection.java:533) >>>>>>>> >> >> > at >>>>>>>> >> >> > >>>>>>>> >> >> >>>>>>>> >> >>>>>>>> org.apache.helix.manager.zk.ZkHelixConnection.addMessageListener(ZkHelixConnection.java:267) >>>>>>>> >> >> > at >>>>>>>> >> >> > >>>>>>>> >> >> >>>>>>>> >> >>>>>>>> org.apache.helix.manager.zk.ZkHelixParticipant.setupMsgHandler(ZkHelixParticipant.java:347) >>>>>>>> >> >> > at >>>>>>>> >> >> > >>>>>>>> >> >> >>>>>>>> >> >>>>>>>> org.apache.helix.manager.zk.ZkHelixParticipant.init(ZkHelixParticipant.java:383) >>>>>>>> >> >> > at >>>>>>>> >> >> > >>>>>>>> >> >> >>>>>>>> >> >>>>>>>> org.apache.helix.manager.zk.ZkHelixParticipant.onConnected(ZkHelixParticipant.java:401) >>>>>>>> >> >> > at >>>>>>>> >> >> > >>>>>>>> >> >> >>>>>>>> >> >>>>>>>> org.apache.helix.manager.zk.ZkHelixParticipant.start(ZkHelixParticipant.java:428) >>>>>>>> >> >> > at >>>>>>>> >> >> > >>>>>>>> >> >> >>>>>>>> >> >>>>>>>> com.example.ProtostuffServerNode.spinUpParticipant(ProtostuffServerNode.java:134) >>>>>>>> >> >> > >>>>>>>> >> >> > >>>>>>>> >> >> > Here is his connection code. >>>>>>>> >> >> > >>>>>>>> >> >> > http://pastebin.com/QRfVU1tc >>>>>>>> >> >> > >>>>>>>> >> >> > private static HelixParticipant >>>>>>>> spinUpParticipant(HelixAdmin admin, >>>>>>>> >> >> > ParticipantId participantId) { >>>>>>>> >> >> > LOGGER.info("Starting up "+participantId); >>>>>>>> >> >> > HelixConnection connection = new >>>>>>>> ZkHelixConnection( >>>>>>>> >> >> > ZK_ADDRESS); >>>>>>>> >> >> > connection.connect(); >>>>>>>> >> >> > HelixParticipant participant = connection. >>>>>>>> >> >> > createParticipant(CLUSTER_ID, participantId); >>>>>>>> >> >> > StateMachineEngine stateMach = participant. >>>>>>>> >> >> > getStateMachineEngine(); >>>>>>>> >> >> > >>>>>>>> >> >> > >>>>>>>> StateTransitionHandlerFactory<LocalTransitionHandler> >>>>>>>> >> >> > transitionHandlerFactory = new >>>>>>>> OnlineOfflineHandlerFactory(); >>>>>>>> >> >> > >>>>>>>> stateMach.registerStateModelFactory(STATE_MODEL_NAME, >>>>>>>> >> >> > transitionHandlerFactory); >>>>>>>> >> >> > participant.start(); >>>>>>>> >> >> > >>>>>>>> >> >> > admin.enableInstance(CLUSTER_NAME, >>>>>>>> >> >> participantId.toString( >>>>>>>> >> >> > ), true); >>>>>>>> >> >> > >>>>>>>> >> >> > return participant; >>>>>>>> >> >> > } >>>>>>>> >> >> > >>>>>>>> >> >> > Adding 3s sleep after registerStateModelFactory works. Any >>>>>>>> idea what >>>>>>>> >> is >>>>>>>> >> >> > happening. >>>>>>>> >> >> > >>>>>>>> >> >> > thanks, >>>>>>>> >> >> > Kishore G >>>>>>>> >> >> > >>>>>>>> >> >> > >>>>>>>> >> >> > >>>>>>>> >> >> > >>>>>>>> >> >> >>>>>>>> >> > >>>>>>>> >> > >>>>>>>> >> >>>>>>>> > >>>>>>>> > >>>>>>>> >>>>>>> >>>>>>> >>>>>> >>>> >
