I like the last idea. When an instance is shutting down today it does following - disable instance - drop instance
Instead I can do When instance shuts down - disable instance Reaper thread on the controller (this can be present on any of the current live instances) wakes up look for all disabled instances and drop the instance. - Sesh .J On Mon, Nov 28, 2016 at 2:06 PM, kishore g <[email protected]> wrote: > If you know that instance will never come back up with the same name. You > can do the following > > - disable the instance > - wait for all partitions hosted by this instance to get OFFLINE/DROPPED > state. > - disconnect from the cluster > - Use ZkHelixAdmin to drop the instance from the cluster. This should > clean up everything related to the old node. > > You can also do this via the controller node. watch for liveinstances and > if nodes are not present under liveinstances you can delete those nodes. > One suggestion here is - when a node shutsdown, write the state to > instanceConfig of that node say STATE="SHUTDOWN". Your reaper thread can > look for nodes that are in this state and invoke admin.dropInstance. > > dropInstance will take care of cleaning up everything related to a dead > node. > > > > > On Mon, Nov 28, 2016 at 1:56 PM, Sesh Jalagam <[email protected]> wrote: > >> Kishore thanks, >> >> Option 1 and Option 3 are plausible. Option 2 is not feasible, even >> though the cluster name is same, instance name is different (usually this a >> random value) >> >> With Option 1 what should I be looking in the External View, should I be >> looking at all the resources that should have been transitioned off. >> >> With Option 3, when a cluster is redeployed the controller is moving >> around (because of leader election) from old nodes to old nodes, so I >> wonder if the controller will miss any messages for dead nodes. Are I can >> simply have a reaper that comes up and deletes all messages that are >> destined for instances that are not present in /LIVEINSTANCES/. >> >> How should I be dealing with <cluster_id>INSTANCES/INSTANCES/CURRENTSTATES >> this has stale current states ( session id that is not valid). >> >> >> >> On Mon, Nov 28, 2016 at 12:52 PM, kishore g <[email protected]> wrote: >> >>> Looks like nodes add and remove themselves quite often. After you >>> disable the instance, Helix will send messages to go from ONLINE to >>> OFFLINE. Looks like the nodes shut down before they get those messages and >>> when they come back up, they use a different instance id. >>> >>> There are two solutions >>> - During shut down - after disabling wait for the state to be reflected >>> in the External View. >>> - During start up - If possible, re-join the cluster with the same name. >>> If you do that, Helix will remove old messages. >>> >>> A third option is to support autoCleanUp in Helix. Helix controller can >>> monitor the cluster for dead nodes and remove them automatically after some >>> time. >>> >>> >>> >>> On Mon, Nov 28, 2016 at 12:39 PM, Sesh Jalagam <[email protected]> wrote: >>> >>>> <clustername>/INSTANCES/INSTANCES/MESSAGES has already read messages. >>>> >>>> Here is an example. >>>> ,"FROM_STATE":"ONLINE" >>>> ,"MSG_STATE":"read" >>>> ,"MSG_TYPE":"STATE_TRANSITION" >>>> ,"STATE_MODEL_DEF":"OnlineOffline" >>>> ,"STATE_MODEL_FACTORY_NAME":"DEFAULT" >>>> ,"TO_STATE":"OFFLINE >>>> >>>> I see these messages after the participant is disabled and dropped i.e >>>> <clustername>/INSTANCES/<PARTICIPANT_ID> is removed. >>>> >>>> Thanks >>>> >>>> >>>> On Mon, Nov 28, 2016 at 12:18 PM, kishore g <[email protected]> >>>> wrote: >>>> >>>>> <clustername>/INSTANCES/INSTANCES/MESSAGES by this do you mean >>>>> <clustername>/INSTANCES/<PARTICIPANT_ID>/MESSAGES >>>>> >>>>> What kind of messages do you see under these nodes. >>>>> >>>>> >>>>> >>>>> On Mon, Nov 28, 2016 at 12:04 PM, Sesh Jalagam <[email protected]> >>>>> wrote: >>>>> >>>>>> Our set up is following. >>>>>> >>>>>> - Controller (leader elected from one of the cluster nodes) >>>>>> >>>>>> - Cluster of nodes as participants in OnlineOffline StateModel >>>>>> >>>>>> - Set of resources with partitions. >>>>>> >>>>>> >>>>>> Each node on its startup, creates a controller adds a participant if >>>>>> its not existing and waits for the callbacks to handle partition >>>>>> rebalancing. >>>>>> >>>>>> Please not this cluster is created on the fly multiple times a day >>>>>> (actual cluster is not deleted, but new participants are removed and >>>>>> re-added) >>>>>> >>>>>> >>>>>> Everything works fine in production, but I see that the znodes >>>>>> in <clustername>/INSTANCES/INSTANCES/MESSAGES is growing. >>>>>> >>>>>> What is <cluster_id>/INSTANCES/INSTANCES used for, is there a way >>>>>> for the messages to be deleted automatically. >>>>>> >>>>>> I see similar buildup in <cluster_id>INSTANCES/INSTANCE >>>>>> S/CURRENTSTATES. >>>>>> >>>>>> >>>>>> Thanks >>>>>> -- >>>>>> - Sesh .J >>>>>> >>>>> >>>>> >>>> >>>> >>>> -- >>>> - Sesh .J >>>> >>> >>> >> >> >> -- >> - Sesh .J >> > > -- - Sesh .J
