I shared the logs with zhen using google drive.. On Tue, Nov 18, 2014 at 12:56 PM, kishore g <[email protected]> wrote:
> Did you try dropbox or any other public file sharing service. > > On Tue, Nov 18, 2014 at 10:57 AM, Varun Sharma <[email protected]> > wrote: > >> Hi Zhen, >> >> My logs are > 10M and jira does not allow me to attach them. Also, gmail >> is not allowing me to send them over as it flags them as "blocked for >> security reasons" - link here >> <https://support.google.com/mail/answer/6590?hl=en> - Do you have any >> other options to send over the file. I create HELIX-551 for this issue. >> >> Thanks >> Varun >> >> On Mon, Nov 17, 2014 at 6:49 PM, Zhen Zhang <[email protected]> wrote: >> >>> Hi Varun, I missed the conversation on IRC. You could create a jira at: >>> https://issues.apache.org/jira/browse/HELIX >>> >>> And attach the zk log in the jira. We will be able to figure it out. >>> >>> Thanks, >>> Zhen >>> >>> ------------------------------ >>> *From:* Zhen Zhang [[email protected]] >>> *Sent:* Monday, November 17, 2014 5:16 PM >>> *To:* [email protected] >>> *Subject:* RE: Helix issue - External View out of sync >>> >>> Hi, Varun, you can join us on freenode IRC: >>> http://helix.apache.org/IRC.html >>> >>> Thanks, >>> Zhen >>> >>> ------------------------------ >>> *From:* Varun Sharma [[email protected]] >>> *Sent:* Monday, November 17, 2014 5:08 PM >>> *To:* [email protected] >>> *Subject:* Re: Helix issue - External View out of sync >>> >>> I looked at the logs and gc was fine as the system was processing >>> other events around the same time. >>> >>> Is there anything else specifically I shold look for in the logs ? Is >>> there a way to find out whether a node was removed from the cluster due to >>> a ZK issue ? >>> >>> Thanks ! >>> Varun >>> >>> On Mon, Nov 17, 2014 at 4:32 PM, Varun Sharma <[email protected]> >>> wrote: >>> >>>> I am wondering how come a partition was in the online state for a >>>> resource that was newly created. >>>> >>>> Thanks >>>> Varun >>>> >>>> On Mon, Nov 17, 2014 at 4:31 PM, Varun Sharma <[email protected]> >>>> wrote: >>>> >>>>> I am using 0.6.4. In this case, I created a resource and set its ideal >>>>> state and the partitions onlined themselves. It seems for that node - it >>>>> opened a whole bunch of other partitions at around the same time (~ 30 or >>>>> so) but failed to open 3-4 partitions. This was for a brand new resource I >>>>> created.. >>>>> >>>>> THanks ! >>>>> Varun >>>>> >>>>> On Mon, Nov 17, 2014 at 4:24 PM, kishore g <[email protected]> >>>>> wrote: >>>>> >>>>>> One suggestion is to check for GC pauses on the nodes. Nodes loses >>>>>> the cluster member ship if they get into long GC or starts flapping. That >>>>>> might be cause for state mismatch. However, external view must be up to >>>>>> date. It might help if you can attach the controller logs and node logs. >>>>>> >>>>>> On Mon, Nov 17, 2014 at 4:10 PM, Varun Sharma <[email protected]> >>>>>> wrote: >>>>>> >>>>>>> Hi, >>>>>>> >>>>>>> I am seeing the following issue for many partitions in helix using >>>>>>> a simple Online->Offline state model factory. The external view says >>>>>>> that >>>>>>> the partition has been assigned to 3 hosts. However, when I look at the >>>>>>> hosts only 1 of them executed the OFFLINE --> ONLINE transition. >>>>>>> >>>>>>> On the hosts, that did not execute the transition, I see the >>>>>>> following: >>>>>>> >>>>>>> 2014-11-13 09:29:54,394 [pool-3-thread-11] >>>>>>> (HelixStateTransitionHandler.java:206) WARN *Force CurrentState on >>>>>>> Zk to be stateModel's CurrentState*. *partitionKey: 490*, >>>>>>> currentState: ONLINE, message: 12690ce8-8098-46b1-a93d-279604f0e3db, >>>>>>> {CREATE_TIMESTAMP=1415870993349, ClusterEventName=idealStateChange, >>>>>>> EXECUTE_START_TIMESTAMP=1415870994382, EXE_SESSION_ID=149a14ada0d0013, >>>>>>> FROM_STATE=OFFLINE, MSG_ID=*12690ce8-8098-46b1-a93d-279604f0e3db*, >>>>>>> MSG_STATE=read, MSG_TYPE=STATE_TRANSITION, PARTITION_NAME=490, >>>>>>> READ_TIMESTAMP=1415870993787, >>>>>>> RESOURCE_NAME=$terrapin$data$meta_pin_join$1415866960201, >>>>>>> SRC_NAME=hdfsterrapin-a-namenode001_9090, >>>>>>> SRC_SESSION_ID=147a7beb2dd8ed7, >>>>>>> STATE_MODEL_DEF=OnlineOffline, STATE_MODEL_FACTORY_NAME=DEFAULT, >>>>>>> TGT_NAME=hdfsterrapin-a-datanode-ba3ad256, >>>>>>> TGT_SESSION_ID=149a14ada0d0013, >>>>>>> TO_STATE=ONLINE}{}{} >>>>>>> >>>>>>> When I grep the message ID in the controller, I see the following: >>>>>>> >>>>>>> 2014-11-14 09:34:56,265 [StatusDumpTimerTask] >>>>>>> (ZKPathDataDumpTask.java:155) INFO { >>>>>>> >>>>>>> "id" : >>>>>>> "149a14ada0d0013__$terrapin$data$meta_pin_join$1415866960201", >>>>>>> >>>>>>> "mapFields" : { >>>>>>> >>>>>>> "HELIX_ERROR 20141113-092954.000419 STATE_TRANSITION >>>>>>> c1193025-b416-49d7-adc2-10afe2389141" : { >>>>>>> >>>>>>> "AdditionalInfo" : "Message execution failed. msgId: >>>>>>> 12690ce8-8098-46b1-a93d-279604f0e3db, errorMsg: >>>>>>> org.apache.helix.messaging.handling. >>>>>>> *HelixStateTransitionHandler$HelixStateMismatchException*: Current >>>>>>> state of stateModel does not match the fromState in Message, Current >>>>>>> State:ONLINE, message expected:OFFLINE, partition: 490, from: >>>>>>> hdfsterrapin-a-namenode001_9090, to: hdfsterrapin-a-datanode-ba3ad256", >>>>>>> >>>>>>> "Class" : "class >>>>>>> org.apache.helix.messaging.handling.HelixStateTransitionHandler", >>>>>>> >>>>>>> "MSG_ID" : "12690ce8-8098-46b1-a93d-279604f0e3db", >>>>>>> >>>>>>> "Message state" : "READ" >>>>>>> >>>>>>> }, >>>>>>> >>>>>>> >>>>>>> What could be causing this - when I restart the node, the error >>>>>>> disappears (meaning that the node is able to perform the state >>>>>>> transition). >>>>>>> What could be causing this state mismatch ? >>>>>>> >>>>>>> >>>>>>> Thanks >>>>>>> >>>>>>> Varun >>>>>>> >>>>>> >>>>>> >>>>> >>>> >>> >> >
