I don't think it's fatal. When NPE happens, the messages will be marked as UNPROCESSABLE and removed. All state transitions should still happen when later message handler factory is registered. Controller will resend all transitions. The error messages are harmless.
I also tried drop instance. It seems working fine. When to drop an instance, remember to first disable the instance and then stop the instance; otherwise, some states may still be remaining on zookeeper. On Feb 16, 2015 11:36 AM, "kishore g" <[email protected]> wrote: > Is there any work around for this and is this fatal as Vlad mentioned? > > On Mon, Feb 16, 2015 at 10:28 AM, Zhen Zhang <[email protected]> wrote: > > > There is a timing issue in ZkHelixParticipant#setupMsgHandler(). We > should > > hook up ZK callback (line 347 in > > > https://github.com/apache/helix/blob/master/helix-core/src/main/java/org/apache/helix/manager/zk/ZkHelixParticipant.java > ) > > after all message handler registrations are done (line 354 in > > > https://github.com/apache/helix/blob/master/helix-core/src/main/java/org/apache/helix/manager/zk/ZkHelixParticipant.java > ). > > Fix is to move adding ZK callback to the end. Will add a test case that > can > > reliably reproduce this issue. > > > > Thanks, > > Zhen > > > > > > On Sun, Feb 15, 2015 at 11:45 PM, Zhen Zhang <[email protected]> > wrote: > > > >> might be some race conditions. need to double check this. > >> On Feb 15, 2015 11:38 PM, "Steph Meslin-Weber" <[email protected]> > >> wrote: > >> > >>> Hi Kishore, > >>> > >>> That's right, the node doesn't process any state transitions. They > >>> should have been logged in the first set of logs had they occurred. > >>> > >>> Thanks, > >>> Steph > >>> On 16 Feb 2015 07:28, "kishore g" <[email protected]> wrote: > >>> > >>>> Hi Steph, > >>>> > >>>> When the NPE occurs, do you get the state transition callbacks? > >>>> > >>>> thanks, > >>>> Kishore G > >>>> > >>>> > >>>> > >>>> On Sun, Feb 15, 2015 at 11:23 PM, Steph Meslin-Weber < > >>>> [email protected]> wrote: > >>>> > >>>>> Unfortunately it appears that when the NPE occurs, dropping the > >>>>> participant no longer cleans up the related INSTANCE node. Perhaps > some > >>>>> state is lost? > >>>>> > >>>>> Thanks, > >>>>> Steph > >>>>> On 16 Feb 2015 06:52, "Zhen Zhang" <[email protected]> wrote: > >>>>> > >>>>>> I think the NPE is not fatal. It happens when no message handler > >>>>>> factory is registered for this message type. The message will not be > >>>>>> removed and remain in UNREAD state. Later when the message handler > factory > >>>>>> is registered via: > >>>>>> DefaultMessagingService#registerMessageHandlerFactory, we will send > a > >>>>>> NOP message, which will in turn trigger HelixTaskExecutor to > process all > >>>>>> UNREAD messages. We should definitely fix this by logging a warning > message > >>>>>> instead of throwing an NPE. > >>>>>> > >>>>>> Thanks, > >>>>>> Jason > >>>>>> > >>>>>> > >>>>>> On Sun, Feb 15, 2015 at 7:30 PM, kishore g <[email protected]> > >>>>>> wrote: > >>>>>> > >>>>>>> Controller assuming the state transition occurred is even more > >>>>>>> dangerous. > >>>>>>> > >>>>>>> > >>>>>>> > >>>>>>> > >>>>>>> > >>>>>>> On Sun, Feb 15, 2015 at 7:18 PM, [email protected] < > >>>>>>> [email protected]> wrote: > >>>>>>> > >>>>>>>> In my experience it was fatal. The callback would jot be called > but > >>>>>>>> the > >>>>>>>> controller would somehow assume the state transition occurred. > >>>>>>>> On Feb 15, 2015 7:13 PM, "kishore g" <[email protected]> wrote: > >>>>>>>> > >>>>>>>> > Thanks Vlad. That explains the problem. That also explains how > >>>>>>>> adding > >>>>>>>> > sleep of 3seconds work. > >>>>>>>> > > >>>>>>>> > Jason, is this exception fatal?. Will the message be processed > >>>>>>>> again after > >>>>>>>> > the handler is added. > >>>>>>>> > > >>>>>>>> > thanks, > >>>>>>>> > Kishore G > >>>>>>>> > > >>>>>>>> > On Sun, Feb 15, 2015 at 6:41 PM, [email protected] < > >>>>>>>> [email protected]> > >>>>>>>> > wrote: > >>>>>>>> > > >>>>>>>> >> https://issues.apache.org/jira/browse/HELIX-548 > >>>>>>>> >> On Feb 15, 2015 6:38 PM, "kishore g" <[email protected]> > >>>>>>>> wrote: > >>>>>>>> >> > >>>>>>>> >> > Hi Vlad, > >>>>>>>> >> > > >>>>>>>> >> > Was there any jira associated with it? > >>>>>>>> >> > > >>>>>>>> >> > thanks. > >>>>>>>> >> > Kishore G > >>>>>>>> >> > > >>>>>>>> >> > On Sun, Feb 15, 2015 at 4:36 PM, [email protected] < > >>>>>>>> [email protected]> > >>>>>>>> >> > wrote: > >>>>>>>> >> > > >>>>>>>> >> >> Looks like the same problem we encountered recently. > >>>>>>>> >> >> > >>>>>>>> >> >> Regards, > >>>>>>>> >> >> Vlad > >>>>>>>> >> >> On Feb 15, 2015 4:35 PM, "kishore g" <[email protected]> > >>>>>>>> wrote: > >>>>>>>> >> >> > >>>>>>>> >> >> > Steph described this problem on IRC. > >>>>>>>> >> >> > > >>>>>>>> >> >> > He is using 0.7.1. On connecting to cluster he gets this > NPE > >>>>>>>> >> >> > > >>>>>>>> >> >> > http://pastebin.com/YE3fwK5i > >>>>>>>> >> >> > > >>>>>>>> >> >> > java.lang.NullPointerException > >>>>>>>> >> >> > at > >>>>>>>> >> >> > > >>>>>>>> >> >> > >>>>>>>> >> > >>>>>>>> > org.apache.helix.messaging.handling.HelixTaskExecutor.createMessageHandler(HelixTaskExecutor.java:661) > >>>>>>>> >> >> > at > >>>>>>>> >> >> > > >>>>>>>> >> >> > >>>>>>>> >> > >>>>>>>> > org.apache.helix.messaging.handling.HelixTaskExecutor.onMessage(HelixTaskExecutor.java:581) > >>>>>>>> >> >> > at > >>>>>>>> >> >> > > >>>>>>>> >> >> > >>>>>>>> >> > >>>>>>>> > org.apache.helix.manager.zk.ZkCallbackHandler.invoke(ZkCallbackHandler.java:202) > >>>>>>>> >> >> > at > >>>>>>>> >> >> > > >>>>>>>> >> >> > >>>>>>>> >> > >>>>>>>> > org.apache.helix.manager.zk.ZkCallbackHandler.init(ZkCallbackHandler.java:336) > >>>>>>>> >> >> > at > >>>>>>>> >> >> > > >>>>>>>> >> >> > >>>>>>>> >> > >>>>>>>> > org.apache.helix.manager.zk.ZkCallbackHandler.<init>(ZkCallbackHandler.java:130) > >>>>>>>> >> >> > at > >>>>>>>> >> >> > > >>>>>>>> >> >> > >>>>>>>> >> > >>>>>>>> > org.apache.helix.manager.zk.ZkHelixConnection.addListener(ZkHelixConnection.java:533) > >>>>>>>> >> >> > at > >>>>>>>> >> >> > > >>>>>>>> >> >> > >>>>>>>> >> > >>>>>>>> > org.apache.helix.manager.zk.ZkHelixConnection.addMessageListener(ZkHelixConnection.java:267) > >>>>>>>> >> >> > at > >>>>>>>> >> >> > > >>>>>>>> >> >> > >>>>>>>> >> > >>>>>>>> > org.apache.helix.manager.zk.ZkHelixParticipant.setupMsgHandler(ZkHelixParticipant.java:347) > >>>>>>>> >> >> > at > >>>>>>>> >> >> > > >>>>>>>> >> >> > >>>>>>>> >> > >>>>>>>> > org.apache.helix.manager.zk.ZkHelixParticipant.init(ZkHelixParticipant.java:383) > >>>>>>>> >> >> > at > >>>>>>>> >> >> > > >>>>>>>> >> >> > >>>>>>>> >> > >>>>>>>> > org.apache.helix.manager.zk.ZkHelixParticipant.onConnected(ZkHelixParticipant.java:401) > >>>>>>>> >> >> > at > >>>>>>>> >> >> > > >>>>>>>> >> >> > >>>>>>>> >> > >>>>>>>> > org.apache.helix.manager.zk.ZkHelixParticipant.start(ZkHelixParticipant.java:428) > >>>>>>>> >> >> > at > >>>>>>>> >> >> > > >>>>>>>> >> >> > >>>>>>>> >> > >>>>>>>> > com.example.ProtostuffServerNode.spinUpParticipant(ProtostuffServerNode.java:134) > >>>>>>>> >> >> > > >>>>>>>> >> >> > > >>>>>>>> >> >> > Here is his connection code. > >>>>>>>> >> >> > > >>>>>>>> >> >> > http://pastebin.com/QRfVU1tc > >>>>>>>> >> >> > > >>>>>>>> >> >> > private static HelixParticipant > >>>>>>>> spinUpParticipant(HelixAdmin admin, > >>>>>>>> >> >> > ParticipantId participantId) { > >>>>>>>> >> >> > LOGGER.info("Starting up "+participantId); > >>>>>>>> >> >> > HelixConnection connection = new > >>>>>>>> ZkHelixConnection( > >>>>>>>> >> >> > ZK_ADDRESS); > >>>>>>>> >> >> > connection.connect(); > >>>>>>>> >> >> > HelixParticipant participant = connection. > >>>>>>>> >> >> > createParticipant(CLUSTER_ID, participantId); > >>>>>>>> >> >> > StateMachineEngine stateMach = > participant. > >>>>>>>> >> >> > getStateMachineEngine(); > >>>>>>>> >> >> > > >>>>>>>> >> >> > > >>>>>>>> StateTransitionHandlerFactory<LocalTransitionHandler> > >>>>>>>> >> >> > transitionHandlerFactory = new > >>>>>>>> OnlineOfflineHandlerFactory(); > >>>>>>>> >> >> > > >>>>>>>> stateMach.registerStateModelFactory(STATE_MODEL_NAME, > >>>>>>>> >> >> > transitionHandlerFactory); > >>>>>>>> >> >> > participant.start(); > >>>>>>>> >> >> > > >>>>>>>> >> >> > admin.enableInstance(CLUSTER_NAME, > >>>>>>>> >> >> participantId.toString( > >>>>>>>> >> >> > ), true); > >>>>>>>> >> >> > > >>>>>>>> >> >> > return participant; > >>>>>>>> >> >> > } > >>>>>>>> >> >> > > >>>>>>>> >> >> > Adding 3s sleep after registerStateModelFactory works. Any > >>>>>>>> idea what > >>>>>>>> >> is > >>>>>>>> >> >> > happening. > >>>>>>>> >> >> > > >>>>>>>> >> >> > thanks, > >>>>>>>> >> >> > Kishore G > >>>>>>>> >> >> > > >>>>>>>> >> >> > > >>>>>>>> >> >> > > >>>>>>>> >> >> > > >>>>>>>> >> >> > >>>>>>>> >> > > >>>>>>>> >> > > >>>>>>>> >> > >>>>>>>> > > >>>>>>>> > > >>>>>>>> > >>>>>>> > >>>>>>> > >>>>>> > >>>> > > >
