Re: Error on participant while joining cluster

Varun Sharma Tue, 26 Aug 2014 18:23:12 -0700

Another quick question - if I open the externalview from inside a contoller
using helixadmin.getResourceExternalView - is that a zk call or is the
external view cached in local memory. If the former, is it better to
establish a spectator conn. so we get notified of changes instead of having
to pull every time (I am polling external view for all resources every few
minutes which is why i am asking this question)..



On Tue, Aug 26, 2014 at 5:02 PM, kishore g <[email protected]> wrote:

> I think they are thread safe because ZKHelixAdmin is stateless.I think the
> right question is "are the operations atomic". Most HelixAdmin operation
> change znodes in zookeeper. By default none of the operations are atomic.
> However, HelixAdmin under the hood uses HelixDataAccessor that supports
> atomic operations.
>
> If you really want these operations to be atomic, you can use
> HelixDataAccessor and BaseDataAccessor. These are low level api's and if
> you really need atomicity, we should probably file a jira and provide the
> high level apis in HelixAdmin.
>
>
>
>
>
>
> On Tue, Aug 26, 2014 at 4:48 PM, Varun Sharma <[email protected]> wrote:
>
>> I am doing an "addResource", "dropResource" in separate threads. Its
>> highly highly unlikely for me to call these operations on the same resource
>> concurrently.
>>
>> Varun
>>
>>
>> On Tue, Aug 26, 2014 at 4:45 PM, Kanak Biscuitwala <[email protected]>
>> wrote:
>>
>>> I would have to say, "it depends." There are operations that are
>>> idempotent (e.g. dropResource), atomic (e.g. setResourceIdealState), both,
>>> or neither (e.g. resetResource). Generally speaking, you should be OK for
>>> most operations, but there isn't any synchronization, so depending on which
>>> ZNodes are affected and how, there may be some thread safety issues.
>>>
>>> Are there specific operations you need to be thread-safe?
>>>
>>>
>>> ------------------------------
>>> Date: Tue, 26 Aug 2014 16:37:50 -0700
>>>
>>> Subject: Re: Error on participant while joining cluster
>>> From: [email protected]
>>> To: [email protected]
>>>
>>>
>>> Thanks Kanak. Another question, is HelixAdmin thread safe ?
>>>
>>> Varun
>>>
>>>
>>> On Tue, Aug 26, 2014 at 3:36 PM, Kanak Biscuitwala <[email protected]>
>>> wrote:
>>>
>>> Hi Varun,
>>>
>>>
>>> To answer your question on IRC, the resource's znode is deleted
>>> immediately on dropResource(), but Helix will still be able to send dropped
>>> messages after this happens because there is enough persisted information
>>> in the current state on each node.
>>>
>>>
>>> Kanak
>>>
>>> ------------------------------
>>> Date: Thu, 21 Aug 2014 12:56:21 -0700
>>>
>>> Subject: Re: Error on participant while joining cluster
>>> From: [email protected]
>>> To: [email protected]
>>>
>>>
>>> I dont see any issue at runtime. However, Helix as a support to backup
>>> the zookeeper nodes on to a file system. I think | might cause problems
>>> while storing or restoring data onto zookeeper. I would use something thats
>>> compatible with file system something like _ or probably -.
>>>
>>>
>>> On Thu, Aug 21, 2014 at 12:03 PM, Varun Sharma <[email protected]>
>>> wrote:
>>>
>>> Is there any restriction with choosing resource names. I was initially
>>> putting "/" in the name but that seems to be not working well since it ends
>>> up creating a znode with a slash. I found that if i replace a "/" with a
>>> "|", a znode can be created. Could there be any other issues inside helix
>>> with using a "|" in the resource name ?
>>>
>>> Varun
>>>
>>>
>>> On Tue, Aug 19, 2014 at 2:20 PM, Kanak Biscuitwala <[email protected]>
>>> wrote:
>>>
>>> But of course since HelixAdmin seems to be bugging out, what Jason said
>>> is right :)
>>>
>>> ------------------------------
>>> From: [email protected]
>>> To: [email protected]
>>> Subject: RE: Error on participant while joining cluster
>>> Date: Tue, 19 Aug 2014 14:18:23 -0700
>>>
>>>
>>> As Jason said, typically the naming convention is host_port, which helix
>>> tools automatically parse as host and port. It is possible to use arbitrary
>>> instance IDs in theory though, so it might be worth filing as a bug.
>>>
>>> As for removing instances, the typical flow is to shut it down (so that
>>> the live instance is gone), disable it, and then drop it using HelixAdmin.
>>>
>>> ------------------------------
>>> From: [email protected]
>>> To: [email protected]
>>> Subject: Re: Error on participant while joining cluster
>>> Date: Tue, 19 Aug 2014 21:05:46 +0000
>>>
>>> First make sure under /<CLUSTER_NAME>/LIVEINSTANCES/, the node you want
>>> to remove from the cluster is not running. Then you can simply remove the
>>> orphaned znodes under /<CLUTER_NAME>/INSTANCES as well as under
>>> /<CLUSTER_NAME>/CONFIGS/PARTICIPANT. Normally ":" is not recommended in the
>>> instance id, and we internally replace it with "_". We will check how to
>>> get rid of an instance with ":" in its id.
>>>
>>>  Thanks,
>>> Jason
>>>
>>>   From: Varun Sharma <[email protected]>
>>> Reply-To: "[email protected]" <[email protected]>
>>> Date: Tuesday, August 19, 2014 1:58 PM
>>> To: "[email protected]" <[email protected]>
>>> Subject: Re: Error on participant while joining cluster
>>>
>>>   Can I simply remove the orphaned znodes under the
>>> /<CLUSTER_NAME>/INSTANCES tag ?
>>>
>>>  Varun
>>>
>>>
>>> On Tue, Aug 19, 2014 at 1:54 PM, Varun Sharma <[email protected]>
>>> wrote:
>>>
>>> Another issue I have now is that I ended up registering the participants
>>> as <host>:<port> - this causes exceptions related to MBeann (because it
>>> does not like colon separators). I dont know if that is interfering with
>>> normal controller operation. I restarted the instances replacing the : with
>>> a , but those old names are still stuck in INSTANCES znode. How can I get
>>> rid of these - helix-admin seems to be replacing the ":" in the node name
>>> with an underscore "_" and can't delete the node.
>>>
>>>  This is still causing MBean related exceptions in the log trace.
>>>
>>>  Varun
>>>
>>>
>>> On Tue, Aug 19, 2014 at 12:18 PM, Zhen Zhang <[email protected]>
>>> wrote:
>>>
>>>  sure. Will add it.
>>>
>>>   From: kishore g <[email protected]>
>>> Reply-To: "[email protected]" <[email protected]>
>>> Date: Tuesday, August 19, 2014 12:14 PM
>>> To: "[email protected]" <[email protected]>
>>> Subject: Re: Error on participant while joining cluster
>>>
>>>   Thanks Jason. We need to add this to the documentation. I could not
>>> find the way to enable auto-join from the docs. Should we add this to admin
>>> interface documentation?
>>>
>>>
>>>
>>>
>>>
>>>
>>> On Tue, Aug 19, 2014 at 12:06 PM, Zhen Zhang <[email protected]>
>>> wrote:
>>>
>>>  Hi Varun, you need to either add the participant to the cluster before
>>> start it, or enable participant auto-join config:
>>>
>>>  add participant to cluster:
>>>  ./helix-admin.sh --zkSvr <ZookeeperServerAddress, e.g. localhost:2181>
>>> --addNode <clusterName, e.g. terrapin> <instanceId, e.g.
>>> hdfsterrapin-a-datanode-531b2679_9090>
>>>
>>>  or, enable auto-join config:
>>> ./helix-admin.sh --zkSvr <ZookeeperServerAddress> --setConfig CLUSTER
>>> <clusterName> allowParticipantAutoJoin=true
>>>
>>>  Thanks,
>>> Jason
>>>
>>>
>>>   From: Varun Sharma <[email protected]>
>>> Reply-To: "[email protected]" <[email protected]>
>>> Date: Tuesday, August 19, 2014 11:47 AM
>>> To: "[email protected]" <[email protected]>
>>> Subject: Error on participant while joining cluster
>>>
>>>   I am getting the following error while trying to join a cluster as a
>>> participant. THe cluster is setup and a controller has already connected to
>>> it. Can someone help out as to why this is happening ?
>>>
>>>
>>> 2014-08-19 18:41:36,843 [main] (ZKHelixManager.java:727) INFO  Handling
>>> new session, session id: 147a7beb2dd63f4, instance:
>>> hdfsterrapin-a-datanode-531b2679:9090, instanceTye: PARTICIPANT, cluster:
>>> terrapin, zkconnection: State:CONNECTED Timeout:30000
>>> sessionid:0x147a7beb2dd63f4 local:/10.65.145.80:43854
>>> remoteserver:terrapinzk001a/10.115.59.31:2181 lastZxid:0 xid:1 sent:1
>>> recv:1 queuedpkts:0 pendingresp:0 queuedevents:0
>>> 2014-08-19 18:41:36,843 [main] (ParticipantHealthReportTask.java:67)
>>> WARN  ParticipantHealthReportTimerTask already stopped
>>> 2014-08-19 18:41:36,914 [main] (ParticipantManagerHelper.java:101) INFO
>>> instance: hdfsterrapin-a-datanode-531b2679:9090 auto-joining terrapin is
>>> false
>>> *2014-08-19 18:41:36,917 [main] (ZKUtil.java:95) INFO  Invalid instance
>>> setup, missing znode path:
>>> /terrapin/CONFIGS/PARTICIPANT/hdfsterrapin-a-datanode-531b2679:9090*
>>> *2014-08-19 18:41:36,918 [main] (ZKUtil.java:95) INFO  Invalid instance
>>> setup, missing znode path:
>>> /terrapin/INSTANCES/hdfsterrapin-a-datanode-531b2679:9090/MESSAGES*
>>> *2014-08-19 18:41:36,918 [main] (ZKUtil.java:95) INFO  Invalid instance
>>> setup, missing znode path:
>>> /terrapin/INSTANCES/hdfsterrapin-a-datanode-531b2679:9090/CURRENTSTATES*
>>> *2014-08-19 18:41:36,919 [main] (ZKUtil.java:95) INFO  Invalid instance
>>> setup, missing znode path:
>>> /terrapin/INSTANCES/hdfsterrapin-a-datanode-531b2679:9090/STATUSUPDATES*
>>> *2014-08-19 18:41:36,920 [main] (ZKUtil.java:95) INFO  Invalid instance
>>> setup, missing znode path:
>>> /terrapin/INSTANCES/hdfsterrapin-a-datanode-531b2679:9090/ERRORS*
>>> *2014-08-19 18:41:36,920 [main] (ZKHelixManager.java:496) ERROR fail to
>>> createClient.*
>>> *org.apache.helix.HelixException: Initial cluster structure is not set
>>> up for instance: hdfsterrapin-a-datanode-531b2679:9090, instanceType:
>>> PARTICIPANT*
>>> at
>>> org.apache.helix.manager.zk.ParticipantManagerHelper.joinCluster(ParticipantManagerHelper.java:108)
>>> at
>>> org.apache.helix.manager.zk.ZKHelixManager.handleNewSessionAsParticipant(ZKHelixManager.java:869)
>>> at
>>> org.apache.helix.manager.zk.ZKHelixManager.handleNewSession(ZKHelixManager.java:838)
>>> at
>>> org.apache.helix.manager.zk.ZKHelixManager.createClient(ZKHelixManager.java:493)
>>> at
>>> org.apache.helix.manager.zk.ZKHelixManager.connect(ZKHelixManager.java:519)
>>> at
>>> com.pinterest.terrapin.server.TerrapinServerHandler.start(TerrapinServerHandler.java:84)
>>> at
>>> com.pinterest.terrapin.server.TerrapinServerMain.main(TerrapinServerMain.java:31)
>>> 2014-08-19 18:41:36,921 [main] (ZKHelixManager.java:522) ERROR fail to
>>> connect hdfsterrapin-a-datanode-531b2679:9090
>>> org.apache.helix.HelixException: Initial cluster structure is not set up
>>> for instance: hdfsterrapin-a-datanode-531b2679:9090, instanceType:
>>> PARTICIPANT
>>> at
>>> org.apache.helix.manager.zk.ParticipantManagerHelper.joinCluster(ParticipantManagerHelper.java:108)
>>> at
>>> org.apache.helix.manager.zk.ZKHelixManager.handleNewSessionAsParticipant(ZKHelixManager.java:869)
>>> at
>>> org.apache.helix.manager.zk.ZKHelixManager.handleNewSession(ZKHelixManager.java:838)
>>> at
>>> org.apache.helix.manager.zk.ZKHelixManager.createClient(ZKHelixManager.java:493)
>>> at
>>> org.apache.helix.manager.zk.ZKHelixManager.connect(ZKHelixManager.java:519)
>>> at
>>> com.pinterest.terrapin.server.TerrapinServerHandler.start(TerrapinServerHandler.java:84)
>>> at
>>> com.pinterest.terrapin.server.TerrapinServerMain.main(TerrapinServerMain.java:31)
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>
>

Re: Error on participant while joining cluster

Reply via email to