Hmm, it seems that in my case, the resources are not known in advance and I need to decommision resources/create resources on the fly as data comes in/gets deleted. Is there a way around that ?
Thanks Varun On Tue, Nov 18, 2014 at 3:06 PM, Zhen Zhang <[email protected]> wrote: > Hi Varun, > > Here is the problem. You are using ONLINE-OFFLINE state model for multiple > resources, and in this case when you register state model factory, you need > to use your resource name (e.g. $terrapin$data$meta_pin_join$1415866960201) > as your factory name instead of using the default factory name (which is > "DEFAULT"); sth. like this: > > HelixManager#getStateMachineEngine#registerStateModelFactory("ONLINEOFFLINE", > factory, "$terrapin$data$meta_pin_join$1415866960201") > > Otherwise, Helix can't distinguish the state model factories for the two > different resources using the same state model and the same factory name. > To confirm, you should have the following message in your participant log: > > WARN: "stateModelFactory for " + stateModelName + " using factoryName > DEFAULT has already been registered." > > Let us know if this solves the problem. > > Thanks, > Zhen > > ------------------------------ > *From:* Varun Sharma [[email protected]] > *Sent:* Tuesday, November 18, 2014 12:59 PM > > *To:* [email protected] > *Subject:* Re: Helix issue - External View out of sync > > I shared the logs with zhen using google drive.. > > On Tue, Nov 18, 2014 at 12:56 PM, kishore g <[email protected]> wrote: > >> Did you try dropbox or any other public file sharing service. >> >> On Tue, Nov 18, 2014 at 10:57 AM, Varun Sharma <[email protected]> >> wrote: >> >>> Hi Zhen, >>> >>> My logs are > 10M and jira does not allow me to attach them. Also, >>> gmail is not allowing me to send them over as it flags them as "blocked for >>> security reasons" - link here >>> <https://support.google.com/mail/answer/6590?hl=en> - Do you have any >>> other options to send over the file. I create HELIX-551 for this issue. >>> >>> Thanks >>> Varun >>> >>> On Mon, Nov 17, 2014 at 6:49 PM, Zhen Zhang <[email protected]> wrote: >>> >>>> Hi Varun, I missed the conversation on IRC. You could create a jira >>>> at: >>>> https://issues.apache.org/jira/browse/HELIX >>>> >>>> And attach the zk log in the jira. We will be able to figure it out. >>>> >>>> Thanks, >>>> Zhen >>>> >>>> ------------------------------ >>>> *From:* Zhen Zhang [[email protected]] >>>> *Sent:* Monday, November 17, 2014 5:16 PM >>>> *To:* [email protected] >>>> *Subject:* RE: Helix issue - External View out of sync >>>> >>>> Hi, Varun, you can join us on freenode IRC: >>>> http://helix.apache.org/IRC.html >>>> >>>> Thanks, >>>> Zhen >>>> >>>> ------------------------------ >>>> *From:* Varun Sharma [[email protected]] >>>> *Sent:* Monday, November 17, 2014 5:08 PM >>>> *To:* [email protected] >>>> *Subject:* Re: Helix issue - External View out of sync >>>> >>>> I looked at the logs and gc was fine as the system was processing >>>> other events around the same time. >>>> >>>> Is there anything else specifically I shold look for in the logs ? Is >>>> there a way to find out whether a node was removed from the cluster due to >>>> a ZK issue ? >>>> >>>> Thanks ! >>>> Varun >>>> >>>> On Mon, Nov 17, 2014 at 4:32 PM, Varun Sharma <[email protected]> >>>> wrote: >>>> >>>>> I am wondering how come a partition was in the online state for a >>>>> resource that was newly created. >>>>> >>>>> Thanks >>>>> Varun >>>>> >>>>> On Mon, Nov 17, 2014 at 4:31 PM, Varun Sharma <[email protected]> >>>>> wrote: >>>>> >>>>>> I am using 0.6.4. In this case, I created a resource and set its >>>>>> ideal state and the partitions onlined themselves. It seems for that >>>>>> node - >>>>>> it opened a whole bunch of other partitions at around the same time (~ 30 >>>>>> or so) but failed to open 3-4 partitions. This was for a brand new >>>>>> resource >>>>>> I created.. >>>>>> >>>>>> THanks ! >>>>>> Varun >>>>>> >>>>>> On Mon, Nov 17, 2014 at 4:24 PM, kishore g <[email protected]> >>>>>> wrote: >>>>>> >>>>>>> One suggestion is to check for GC pauses on the nodes. Nodes loses >>>>>>> the cluster member ship if they get into long GC or starts flapping. >>>>>>> That >>>>>>> might be cause for state mismatch. However, external view must be up to >>>>>>> date. It might help if you can attach the controller logs and node logs. >>>>>>> >>>>>>> On Mon, Nov 17, 2014 at 4:10 PM, Varun Sharma <[email protected]> >>>>>>> wrote: >>>>>>> >>>>>>>> Hi, >>>>>>>> >>>>>>>> I am seeing the following issue for many partitions in helix >>>>>>>> using a simple Online->Offline state model factory. The external view >>>>>>>> says >>>>>>>> that the partition has been assigned to 3 hosts. However, when I look >>>>>>>> at >>>>>>>> the hosts only 1 of them executed the OFFLINE --> ONLINE transition. >>>>>>>> >>>>>>>> On the hosts, that did not execute the transition, I see the >>>>>>>> following: >>>>>>>> >>>>>>>> 2014-11-13 09:29:54,394 [pool-3-thread-11] >>>>>>>> (HelixStateTransitionHandler.java:206) WARN *Force CurrentState >>>>>>>> on Zk to be stateModel's CurrentState*. *partitionKey: 490*, >>>>>>>> currentState: ONLINE, message: 12690ce8-8098-46b1-a93d-279604f0e3db, >>>>>>>> {CREATE_TIMESTAMP=1415870993349, ClusterEventName=idealStateChange, >>>>>>>> EXECUTE_START_TIMESTAMP=1415870994382, EXE_SESSION_ID=149a14ada0d0013, >>>>>>>> FROM_STATE=OFFLINE, MSG_ID=*12690ce8-8098-46b1-a93d-279604f0e3db*, >>>>>>>> MSG_STATE=read, MSG_TYPE=STATE_TRANSITION, PARTITION_NAME=490, >>>>>>>> READ_TIMESTAMP=1415870993787, >>>>>>>> RESOURCE_NAME=$terrapin$data$meta_pin_join$1415866960201, >>>>>>>> SRC_NAME=hdfsterrapin-a-namenode001_9090, >>>>>>>> SRC_SESSION_ID=147a7beb2dd8ed7, >>>>>>>> STATE_MODEL_DEF=OnlineOffline, STATE_MODEL_FACTORY_NAME=DEFAULT, >>>>>>>> TGT_NAME=hdfsterrapin-a-datanode-ba3ad256, >>>>>>>> TGT_SESSION_ID=149a14ada0d0013, >>>>>>>> TO_STATE=ONLINE}{}{} >>>>>>>> >>>>>>>> When I grep the message ID in the controller, I see the following: >>>>>>>> >>>>>>>> 2014-11-14 09:34:56,265 [StatusDumpTimerTask] >>>>>>>> (ZKPathDataDumpTask.java:155) INFO { >>>>>>>> >>>>>>>> "id" : >>>>>>>> "149a14ada0d0013__$terrapin$data$meta_pin_join$1415866960201", >>>>>>>> >>>>>>>> "mapFields" : { >>>>>>>> >>>>>>>> "HELIX_ERROR 20141113-092954.000419 STATE_TRANSITION >>>>>>>> c1193025-b416-49d7-adc2-10afe2389141" : { >>>>>>>> >>>>>>>> "AdditionalInfo" : "Message execution failed. msgId: >>>>>>>> 12690ce8-8098-46b1-a93d-279604f0e3db, errorMsg: >>>>>>>> org.apache.helix.messaging.handling. >>>>>>>> *HelixStateTransitionHandler$HelixStateMismatchException*: Current >>>>>>>> state of stateModel does not match the fromState in Message, Current >>>>>>>> State:ONLINE, message expected:OFFLINE, partition: 490, from: >>>>>>>> hdfsterrapin-a-namenode001_9090, to: hdfsterrapin-a-datanode-ba3ad256", >>>>>>>> >>>>>>>> "Class" : "class >>>>>>>> org.apache.helix.messaging.handling.HelixStateTransitionHandler", >>>>>>>> >>>>>>>> "MSG_ID" : "12690ce8-8098-46b1-a93d-279604f0e3db", >>>>>>>> >>>>>>>> "Message state" : "READ" >>>>>>>> >>>>>>>> }, >>>>>>>> >>>>>>>> >>>>>>>> What could be causing this - when I restart the node, the error >>>>>>>> disappears (meaning that the node is able to perform the state >>>>>>>> transition). >>>>>>>> What could be causing this state mismatch ? >>>>>>>> >>>>>>>> >>>>>>>> Thanks >>>>>>>> >>>>>>>> Varun >>>>>>>> >>>>>>> >>>>>>> >>>>>> >>>>> >>>> >>> >> >
