Re: Helix issue - External View out of sync

Varun Sharma Tue, 18 Nov 2014 15:17:08 -0800

Hmm, it seems that in my case, the resources are not known in advance and I
need to decommision resources/create resources on the fly as data comes
in/gets deleted. Is there a way around that ?


Thanks
Varun

On Tue, Nov 18, 2014 at 3:06 PM, Zhen Zhang <[email protected]> wrote:

>  Hi Varun,
>
> Here is the problem. You are using ONLINE-OFFLINE state model for multiple
> resources, and in this case when you register state model factory, you need
> to use your resource name (e.g. $terrapin$data$meta_pin_join$1415866960201)
> as your factory name instead of using the default factory name (which is
> "DEFAULT"); sth. like this:
>
> HelixManager#getStateMachineEngine#registerStateModelFactory("ONLINEOFFLINE",
> factory, "$terrapin$data$meta_pin_join$1415866960201")
>
> Otherwise, Helix can't distinguish the state model factories for the two
> different resources using the same state model and the same factory name.
> To confirm, you should have the following message in your participant log:
>
> WARN: "stateModelFactory for " + stateModelName + " using factoryName
> DEFAULT has already been registered."
>
> Let us know if this solves the problem.
>
> Thanks,
> Zhen
>
>  ------------------------------
> *From:* Varun Sharma [[email protected]]
> *Sent:* Tuesday, November 18, 2014 12:59 PM
>
> *To:* [email protected]
> *Subject:* Re: Helix issue - External View out of sync
>
>   I shared the logs with zhen using google drive..
>
> On Tue, Nov 18, 2014 at 12:56 PM, kishore g <[email protected]> wrote:
>
>> Did you try dropbox or any other public file sharing service.
>>
>> On Tue, Nov 18, 2014 at 10:57 AM, Varun Sharma <[email protected]>
>> wrote:
>>
>>> Hi Zhen,
>>>
>>>  My logs are > 10M and jira does not allow me to attach them. Also,
>>> gmail is not allowing me to send them over as it flags them as "blocked for
>>> security reasons" - link here
>>> <https://support.google.com/mail/answer/6590?hl=en> - Do you have any
>>> other options to send over the file. I create HELIX-551 for this issue.
>>>
>>>  Thanks
>>>  Varun
>>>
>>> On Mon, Nov 17, 2014 at 6:49 PM, Zhen Zhang <[email protected]> wrote:
>>>
>>>>  Hi Varun, I missed the conversation on IRC. You could create a jira
>>>> at:
>>>> https://issues.apache.org/jira/browse/HELIX
>>>>
>>>> And attach the zk log in the jira. We will be able to figure it out.
>>>>
>>>> Thanks,
>>>> Zhen
>>>>
>>>>  ------------------------------
>>>> *From:* Zhen Zhang [[email protected]]
>>>> *Sent:* Monday, November 17, 2014 5:16 PM
>>>> *To:* [email protected]
>>>> *Subject:* RE: Helix issue - External View out of sync
>>>>
>>>>    Hi, Varun, you can join us on freenode IRC:
>>>> http://helix.apache.org/IRC.html
>>>>
>>>> Thanks,
>>>> Zhen
>>>>
>>>>  ------------------------------
>>>> *From:* Varun Sharma [[email protected]]
>>>> *Sent:* Monday, November 17, 2014 5:08 PM
>>>> *To:* [email protected]
>>>> *Subject:* Re: Helix issue - External View out of sync
>>>>
>>>>   I looked at the logs and gc was fine as the system was processing
>>>> other events around the same time.
>>>>
>>>>  Is there anything else specifically I shold look for in the logs ? Is
>>>> there a way to find out whether a node was removed from the cluster due to
>>>> a ZK issue ?
>>>>
>>>>  Thanks !
>>>> Varun
>>>>
>>>> On Mon, Nov 17, 2014 at 4:32 PM, Varun Sharma <[email protected]>
>>>> wrote:
>>>>
>>>>> I am wondering how come a partition was in the online state for a
>>>>> resource that was newly created.
>>>>>
>>>>>  Thanks
>>>>>  Varun
>>>>>
>>>>> On Mon, Nov 17, 2014 at 4:31 PM, Varun Sharma <[email protected]>
>>>>> wrote:
>>>>>
>>>>>> I am using 0.6.4. In this case, I created a resource and set its
>>>>>> ideal state and the partitions onlined themselves. It seems for that 
>>>>>> node -
>>>>>> it opened a whole bunch of other partitions at around the same time (~ 30
>>>>>> or so) but failed to open 3-4 partitions. This was for a brand new 
>>>>>> resource
>>>>>> I created..
>>>>>>
>>>>>>  THanks !
>>>>>>  Varun
>>>>>>
>>>>>> On Mon, Nov 17, 2014 at 4:24 PM, kishore g <[email protected]>
>>>>>> wrote:
>>>>>>
>>>>>>> One suggestion is to check for GC pauses on the nodes. Nodes loses
>>>>>>> the cluster member ship if they get into long GC or starts flapping. 
>>>>>>> That
>>>>>>> might be cause for state mismatch. However, external view must be up to
>>>>>>> date. It might help if you can attach the controller logs and node logs.
>>>>>>>
>>>>>>> On Mon, Nov 17, 2014 at 4:10 PM, Varun Sharma <[email protected]>
>>>>>>> wrote:
>>>>>>>
>>>>>>>> Hi,
>>>>>>>>
>>>>>>>>  I am seeing the following issue for many partitions in helix
>>>>>>>> using a simple Online->Offline state model factory. The external view 
>>>>>>>> says
>>>>>>>> that the partition has been assigned to 3 hosts. However, when I look 
>>>>>>>> at
>>>>>>>> the hosts only 1 of them executed the OFFLINE --> ONLINE transition.
>>>>>>>>
>>>>>>>>  On the hosts, that did not execute the transition, I see the
>>>>>>>> following:
>>>>>>>>
>>>>>>>>  2014-11-13 09:29:54,394 [pool-3-thread-11]
>>>>>>>> (HelixStateTransitionHandler.java:206) WARN  *Force CurrentState
>>>>>>>> on Zk to be stateModel's CurrentState*. *partitionKey: 490*,
>>>>>>>> currentState: ONLINE, message: 12690ce8-8098-46b1-a93d-279604f0e3db,
>>>>>>>> {CREATE_TIMESTAMP=1415870993349, ClusterEventName=idealStateChange,
>>>>>>>> EXECUTE_START_TIMESTAMP=1415870994382, EXE_SESSION_ID=149a14ada0d0013,
>>>>>>>> FROM_STATE=OFFLINE, MSG_ID=*12690ce8-8098-46b1-a93d-279604f0e3db*,
>>>>>>>> MSG_STATE=read, MSG_TYPE=STATE_TRANSITION, PARTITION_NAME=490,
>>>>>>>> READ_TIMESTAMP=1415870993787,
>>>>>>>> RESOURCE_NAME=$terrapin$data$meta_pin_join$1415866960201,
>>>>>>>> SRC_NAME=hdfsterrapin-a-namenode001_9090, 
>>>>>>>> SRC_SESSION_ID=147a7beb2dd8ed7,
>>>>>>>> STATE_MODEL_DEF=OnlineOffline, STATE_MODEL_FACTORY_NAME=DEFAULT,
>>>>>>>> TGT_NAME=hdfsterrapin-a-datanode-ba3ad256, 
>>>>>>>> TGT_SESSION_ID=149a14ada0d0013,
>>>>>>>> TO_STATE=ONLINE}{}{}
>>>>>>>>
>>>>>>>>  When I grep the message ID in the controller, I see the following:
>>>>>>>>
>>>>>>>>  2014-11-14 09:34:56,265 [StatusDumpTimerTask]
>>>>>>>> (ZKPathDataDumpTask.java:155) INFO  {
>>>>>>>>
>>>>>>>>   "id" :
>>>>>>>> "149a14ada0d0013__$terrapin$data$meta_pin_join$1415866960201",
>>>>>>>>
>>>>>>>>   "mapFields" : {
>>>>>>>>
>>>>>>>>     "HELIX_ERROR     20141113-092954.000419 STATE_TRANSITION
>>>>>>>> c1193025-b416-49d7-adc2-10afe2389141" : {
>>>>>>>>
>>>>>>>>       "AdditionalInfo" : "Message execution failed. msgId:
>>>>>>>> 12690ce8-8098-46b1-a93d-279604f0e3db, errorMsg:
>>>>>>>> org.apache.helix.messaging.handling.
>>>>>>>> *HelixStateTransitionHandler$HelixStateMismatchException*: Current
>>>>>>>> state of stateModel does not match the fromState in Message, Current
>>>>>>>> State:ONLINE, message expected:OFFLINE, partition: 490, from:
>>>>>>>> hdfsterrapin-a-namenode001_9090, to: hdfsterrapin-a-datanode-ba3ad256",
>>>>>>>>
>>>>>>>>       "Class" : "class
>>>>>>>> org.apache.helix.messaging.handling.HelixStateTransitionHandler",
>>>>>>>>
>>>>>>>>       "MSG_ID" : "12690ce8-8098-46b1-a93d-279604f0e3db",
>>>>>>>>
>>>>>>>>       "Message state" : "READ"
>>>>>>>>
>>>>>>>>     },
>>>>>>>>
>>>>>>>>
>>>>>>>>  What could be causing this - when I restart the node, the error
>>>>>>>> disappears (meaning that the node is able to perform the state 
>>>>>>>> transition).
>>>>>>>> What could be causing this state mismatch ?
>>>>>>>>
>>>>>>>>
>>>>>>>>  Thanks
>>>>>>>>
>>>>>>>> Varun
>>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>
>>>>>
>>>>
>>>
>>
>

Re: Helix issue - External View out of sync

Reply via email to