I looked at the logs and gc was fine as the system was processing other events around the same time.
Is there anything else specifically I shold look for in the logs ? Is there a way to find out whether a node was removed from the cluster due to a ZK issue ? Thanks ! Varun On Mon, Nov 17, 2014 at 4:32 PM, Varun Sharma <[email protected]> wrote: > I am wondering how come a partition was in the online state for a resource > that was newly created. > > Thanks > Varun > > On Mon, Nov 17, 2014 at 4:31 PM, Varun Sharma <[email protected]> wrote: > >> I am using 0.6.4. In this case, I created a resource and set its ideal >> state and the partitions onlined themselves. It seems for that node - it >> opened a whole bunch of other partitions at around the same time (~ 30 or >> so) but failed to open 3-4 partitions. This was for a brand new resource I >> created.. >> >> THanks ! >> Varun >> >> On Mon, Nov 17, 2014 at 4:24 PM, kishore g <[email protected]> wrote: >> >>> One suggestion is to check for GC pauses on the nodes. Nodes loses the >>> cluster member ship if they get into long GC or starts flapping. That might >>> be cause for state mismatch. However, external view must be up to date. It >>> might help if you can attach the controller logs and node logs. >>> >>> On Mon, Nov 17, 2014 at 4:10 PM, Varun Sharma <[email protected]> >>> wrote: >>> >>>> Hi, >>>> >>>> I am seeing the following issue for many partitions in helix using a >>>> simple Online->Offline state model factory. The external view says that the >>>> partition has been assigned to 3 hosts. However, when I look at the hosts >>>> only 1 of them executed the OFFLINE --> ONLINE transition. >>>> >>>> On the hosts, that did not execute the transition, I see the following: >>>> >>>> 2014-11-13 09:29:54,394 [pool-3-thread-11] >>>> (HelixStateTransitionHandler.java:206) WARN *Force CurrentState on Zk >>>> to be stateModel's CurrentState*. *partitionKey: 490*, currentState: >>>> ONLINE, message: 12690ce8-8098-46b1-a93d-279604f0e3db, >>>> {CREATE_TIMESTAMP=1415870993349, ClusterEventName=idealStateChange, >>>> EXECUTE_START_TIMESTAMP=1415870994382, EXE_SESSION_ID=149a14ada0d0013, >>>> FROM_STATE=OFFLINE, MSG_ID=*12690ce8-8098-46b1-a93d-279604f0e3db*, >>>> MSG_STATE=read, MSG_TYPE=STATE_TRANSITION, PARTITION_NAME=490, >>>> READ_TIMESTAMP=1415870993787, >>>> RESOURCE_NAME=$terrapin$data$meta_pin_join$1415866960201, >>>> SRC_NAME=hdfsterrapin-a-namenode001_9090, SRC_SESSION_ID=147a7beb2dd8ed7, >>>> STATE_MODEL_DEF=OnlineOffline, STATE_MODEL_FACTORY_NAME=DEFAULT, >>>> TGT_NAME=hdfsterrapin-a-datanode-ba3ad256, TGT_SESSION_ID=149a14ada0d0013, >>>> TO_STATE=ONLINE}{}{} >>>> >>>> When I grep the message ID in the controller, I see the following: >>>> >>>> 2014-11-14 09:34:56,265 [StatusDumpTimerTask] >>>> (ZKPathDataDumpTask.java:155) INFO { >>>> >>>> "id" : "149a14ada0d0013__$terrapin$data$meta_pin_join$1415866960201", >>>> >>>> "mapFields" : { >>>> >>>> "HELIX_ERROR 20141113-092954.000419 STATE_TRANSITION >>>> c1193025-b416-49d7-adc2-10afe2389141" : { >>>> >>>> "AdditionalInfo" : "Message execution failed. msgId: >>>> 12690ce8-8098-46b1-a93d-279604f0e3db, errorMsg: >>>> org.apache.helix.messaging.handling. >>>> *HelixStateTransitionHandler$HelixStateMismatchException*: Current >>>> state of stateModel does not match the fromState in Message, Current >>>> State:ONLINE, message expected:OFFLINE, partition: 490, from: >>>> hdfsterrapin-a-namenode001_9090, to: hdfsterrapin-a-datanode-ba3ad256", >>>> >>>> "Class" : "class >>>> org.apache.helix.messaging.handling.HelixStateTransitionHandler", >>>> >>>> "MSG_ID" : "12690ce8-8098-46b1-a93d-279604f0e3db", >>>> >>>> "Message state" : "READ" >>>> >>>> }, >>>> >>>> >>>> What could be causing this - when I restart the node, the error >>>> disappears (meaning that the node is able to perform the state transition). >>>> What could be causing this state mismatch ? >>>> >>>> >>>> Thanks >>>> >>>> Varun >>>> >>> >>> >> >
