Hi Varun, I missed the conversation on IRC. You could create a jira at: https://issues.apache.org/jira/browse/HELIX
And attach the zk log in the jira. We will be able to figure it out. Thanks, Zhen ________________________________ From: Zhen Zhang [[email protected]] Sent: Monday, November 17, 2014 5:16 PM To: [email protected] Subject: RE: Helix issue - External View out of sync Hi, Varun, you can join us on freenode IRC: http://helix.apache.org/IRC.html Thanks, Zhen ________________________________ From: Varun Sharma [[email protected]] Sent: Monday, November 17, 2014 5:08 PM To: [email protected] Subject: Re: Helix issue - External View out of sync I looked at the logs and gc was fine as the system was processing other events around the same time. Is there anything else specifically I shold look for in the logs ? Is there a way to find out whether a node was removed from the cluster due to a ZK issue ? Thanks ! Varun On Mon, Nov 17, 2014 at 4:32 PM, Varun Sharma <[email protected]<mailto:[email protected]>> wrote: I am wondering how come a partition was in the online state for a resource that was newly created. Thanks Varun On Mon, Nov 17, 2014 at 4:31 PM, Varun Sharma <[email protected]<mailto:[email protected]>> wrote: I am using 0.6.4. In this case, I created a resource and set its ideal state and the partitions onlined themselves. It seems for that node - it opened a whole bunch of other partitions at around the same time (~ 30 or so) but failed to open 3-4 partitions. This was for a brand new resource I created.. THanks ! Varun On Mon, Nov 17, 2014 at 4:24 PM, kishore g <[email protected]<mailto:[email protected]>> wrote: One suggestion is to check for GC pauses on the nodes. Nodes loses the cluster member ship if they get into long GC or starts flapping. That might be cause for state mismatch. However, external view must be up to date. It might help if you can attach the controller logs and node logs. On Mon, Nov 17, 2014 at 4:10 PM, Varun Sharma <[email protected]<mailto:[email protected]>> wrote: Hi, I am seeing the following issue for many partitions in helix using a simple Online->Offline state model factory. The external view says that the partition has been assigned to 3 hosts. However, when I look at the hosts only 1 of them executed the OFFLINE --> ONLINE transition. On the hosts, that did not execute the transition, I see the following: 2014-11-13 09:29:54,394 [pool-3-thread-11] (HelixStateTransitionHandler.java:206) WARN Force CurrentState on Zk to be stateModel's CurrentState. partitionKey: 490, currentState: ONLINE, message: 12690ce8-8098-46b1-a93d-279604f0e3db, {CREATE_TIMESTAMP=1415870993349, ClusterEventName=idealStateChange, EXECUTE_START_TIMESTAMP=1415870994382, EXE_SESSION_ID=149a14ada0d0013, FROM_STATE=OFFLINE, MSG_ID=12690ce8-8098-46b1-a93d-279604f0e3db, MSG_STATE=read, MSG_TYPE=STATE_TRANSITION, PARTITION_NAME=490, READ_TIMESTAMP=1415870993787, RESOURCE_NAME=$terrapin$data$meta_pin_join$1415866960201, SRC_NAME=hdfsterrapin-a-namenode001_9090, SRC_SESSION_ID=147a7beb2dd8ed7, STATE_MODEL_DEF=OnlineOffline, STATE_MODEL_FACTORY_NAME=DEFAULT, TGT_NAME=hdfsterrapin-a-datanode-ba3ad256, TGT_SESSION_ID=149a14ada0d0013, TO_STATE=ONLINE}{}{} When I grep the message ID in the controller, I see the following: 2014-11-14 09:34:56,265 [StatusDumpTimerTask] (ZKPathDataDumpTask.java:155) INFO { "id" : "149a14ada0d0013__$terrapin$data$meta_pin_join$1415866960201", "mapFields" : { "HELIX_ERROR 20141113-092954.000419 STATE_TRANSITION c1193025-b416-49d7-adc2-10afe2389141" : { "AdditionalInfo" : "Message execution failed. msgId: 12690ce8-8098-46b1-a93d-279604f0e3db, errorMsg: org.apache.helix.messaging.handling.HelixStateTransitionHandler$HelixStateMismatchException: Current state of stateModel does not match the fromState in Message, Current State:ONLINE, message expected:OFFLINE, partition: 490, from: hdfsterrapin-a-namenode001_9090, to: hdfsterrapin-a-datanode-ba3ad256", "Class" : "class org.apache.helix.messaging.handling.HelixStateTransitionHandler", "MSG_ID" : "12690ce8-8098-46b1-a93d-279604f0e3db", "Message state" : "READ" }, What could be causing this - when I restart the node, the error disappears (meaning that the node is able to perform the state transition). What could be causing this state mismatch ? Thanks Varun
