RE: Helix issue - External View out of sync

Zhen Zhang Tue, 18 Nov 2014 15:35:59 -0800

Hi Varun,

Factory name needs not to be resource name. You can probably do this:
1) register multiple state model factories using OnlineOffline state model and 
a set of predefined different factory names.
2) when you create a new resource, set the factory name in ideal state using a 
factory name that hasn't been used by current resources: 
IdealState#setStateModelFactoryName(factory-name)
3) when you decommission a resource, I guess you will remove the resource from 
ideal-state, so the factory name will be no longer used.

Thanks,
Zhen
________________________________
From: Varun Sharma [[email protected]]
Sent: Tuesday, November 18, 2014 3:16 PM
To: [email protected]
Subject: Re: Helix issue - External View out of sync

Hmm, it seems that in my case, the resources are not known in advance and I 
need to decommision resources/create resources on the fly as data comes in/gets 
deleted. Is there a way around that ?

Thanks
Varun

On Tue, Nov 18, 2014 at 3:06 PM, Zhen Zhang 
<[email protected]<mailto:[email protected]>> wrote:
Hi Varun,

Here is the problem. You are using ONLINE-OFFLINE state model for multiple 
resources, and in this case when you register state model factory, you need to 
use your resource name (e.g. $terrapin$data$meta_pin_join$1415866960201) as 
your factory name instead of using the default factory name (which is 
"DEFAULT"); sth. like this:

HelixManager#getStateMachineEngine#registerStateModelFactory("ONLINEOFFLINE", 
factory, "$terrapin$data$meta_pin_join$1415866960201")

Otherwise, Helix can't distinguish the state model factories for the two 
different resources using the same state model and the same factory name. To 
confirm, you should have the following message in your participant log:

WARN: "stateModelFactory for " + stateModelName + " using factoryName DEFAULT 
has already been registered."

Let us know if this solves the problem.

Thanks,
Zhen

________________________________
From: Varun Sharma [[email protected]<mailto:[email protected]>]
Sent: Tuesday, November 18, 2014 12:59 PM

To: [email protected]<mailto:[email protected]>
Subject: Re: Helix issue - External View out of sync

I shared the logs with zhen using google drive..

On Tue, Nov 18, 2014 at 12:56 PM, kishore g 
<[email protected]<mailto:[email protected]>> wrote:
Did you try dropbox or any other public file sharing service.

On Tue, Nov 18, 2014 at 10:57 AM, Varun Sharma 
<[email protected]<mailto:[email protected]>> wrote:
Hi Zhen,

My logs are > 10M and jira does not allow me to attach them. Also, gmail is not 
allowing me to send them over as it flags them as "blocked for security 
reasons" - link here<https://support.google.com/mail/answer/6590?hl=en> - Do 
you have any other options to send over the file. I create HELIX-551 for this 
issue.

Thanks
Varun

On Mon, Nov 17, 2014 at 6:49 PM, Zhen Zhang 
<[email protected]<mailto:[email protected]>> wrote:
Hi Varun, I missed the conversation on IRC. You could create a jira at:
https://issues.apache.org/jira/browse/HELIX

And attach the zk log in the jira. We will be able to figure it out.

Thanks,
Zhen

________________________________
From: Zhen Zhang [[email protected]<mailto:[email protected]>]
Sent: Monday, November 17, 2014 5:16 PM
To: [email protected]<mailto:[email protected]>
Subject: RE: Helix issue - External View out of sync

Hi, Varun, you can join us on freenode IRC: http://helix.apache.org/IRC.html

Thanks,
Zhen

________________________________
From: Varun Sharma [[email protected]<mailto:[email protected]>]
Sent: Monday, November 17, 2014 5:08 PM
To: [email protected]<mailto:[email protected]>
Subject: Re: Helix issue - External View out of sync

I looked at the logs and gc was fine as the system was processing other events 
around the same time.

Is there anything else specifically I shold look for in the logs ? Is there a 
way to find out whether a node was removed from the cluster due to a ZK issue ?

Thanks !
Varun

On Mon, Nov 17, 2014 at 4:32 PM, Varun Sharma 
<[email protected]<mailto:[email protected]>> wrote:
I am wondering how come a partition was in the online state for a resource that 
was newly created.

Thanks
Varun

On Mon, Nov 17, 2014 at 4:31 PM, Varun Sharma 
<[email protected]<mailto:[email protected]>> wrote:
I am using 0.6.4. In this case, I created a resource and set its ideal state 
and the partitions onlined themselves. It seems for that node - it opened a 
whole bunch of other partitions at around the same time (~ 30 or so) but failed 
to open 3-4 partitions. This was for a brand new resource I created..

THanks !
Varun

On Mon, Nov 17, 2014 at 4:24 PM, kishore g 
<[email protected]<mailto:[email protected]>> wrote:
One suggestion is to check for GC pauses on the nodes. Nodes loses the cluster 
member ship if they get into long GC or starts flapping. That might be cause 
for state mismatch. However, external view must be up to date. It might help if 
you can attach the controller logs and node logs.

On Mon, Nov 17, 2014 at 4:10 PM, Varun Sharma 
<[email protected]<mailto:[email protected]>> wrote:
Hi,

I am seeing the following issue for many partitions in helix using a simple 
Online->Offline state model factory. The external view says that the partition 
has been assigned to 3 hosts. However, when I look at the hosts only 1 of them 
executed the OFFLINE --> ONLINE transition.

On the hosts, that did not execute the transition, I see the following:

2014-11-13 09:29:54,394 [pool-3-thread-11] 
(HelixStateTransitionHandler.java:206) WARN  Force CurrentState on Zk to be 
stateModel's CurrentState. partitionKey: 490, currentState: ONLINE, message: 
12690ce8-8098-46b1-a93d-279604f0e3db, {CREATE_TIMESTAMP=1415870993349, 
ClusterEventName=idealStateChange, EXECUTE_START_TIMESTAMP=1415870994382, 
EXE_SESSION_ID=149a14ada0d0013, FROM_STATE=OFFLINE, 
MSG_ID=12690ce8-8098-46b1-a93d-279604f0e3db, MSG_STATE=read, 
MSG_TYPE=STATE_TRANSITION, PARTITION_NAME=490, READ_TIMESTAMP=1415870993787, 
RESOURCE_NAME=$terrapin$data$meta_pin_join$1415866960201, 
SRC_NAME=hdfsterrapin-a-namenode001_9090, SRC_SESSION_ID=147a7beb2dd8ed7, 
STATE_MODEL_DEF=OnlineOffline, STATE_MODEL_FACTORY_NAME=DEFAULT, 
TGT_NAME=hdfsterrapin-a-datanode-ba3ad256, TGT_SESSION_ID=149a14ada0d0013, 
TO_STATE=ONLINE}{}{}

When I grep the message ID in the controller, I see the following:

2014-11-14 09:34:56,265 [StatusDumpTimerTask] (ZKPathDataDumpTask.java:155) 
INFO  {

  "id" : "149a14ada0d0013__$terrapin$data$meta_pin_join$1415866960201",

  "mapFields" : {

    "HELIX_ERROR     20141113-092954.000419 STATE_TRANSITION 
c1193025-b416-49d7-adc2-10afe2389141" : {

      "AdditionalInfo" : "Message execution failed. msgId: 
12690ce8-8098-46b1-a93d-279604f0e3db, errorMsg: 
org.apache.helix.messaging.handling.HelixStateTransitionHandler$HelixStateMismatchException:
 Current state of stateModel does not match the fromState in Message, Current 
State:ONLINE, message expected:OFFLINE, partition: 490, from: 
hdfsterrapin-a-namenode001_9090, to: hdfsterrapin-a-datanode-ba3ad256",

      "Class" : "class 
org.apache.helix.messaging.handling.HelixStateTransitionHandler",

      "MSG_ID" : "12690ce8-8098-46b1-a93d-279604f0e3db",

      "Message state" : "READ"

    },

What could be causing this - when I restart the node, the error disappears 
(meaning that the node is able to perform the state transition). What could be 
causing this state mismatch ?

Thanks

Varun

RE: Helix issue - External View out of sync

Reply via email to