If your ConnectionStateListener gets SUSPENDED or LOST you've lost connection to ZooKeeper. Therefore you cannot use that same ZooKeeper connection to manage a node that denotes the process is running or not. Only 1 VM at a time will be running the process. That process can watch for SUSPENDED/LOST and wind down the task.
> You can't assume that the notification is received locally before another > leader election finishes elsewhere Which notification? The ConnectionStateListener is an abstraction on ZooKeeper's watcher mechanism. It's only significant for the VM that is the leader. Non-leaders don't need to be concerned. -JZ On Dec 8, 2012, at 9:12 PM, Henry Robinson <[email protected]> wrote: > You can't assume that the notification is received locally before another > leader election finishes elsewhere (particularly if you are running slowly > for some reason!), so it's not sufficient to guarantee that the process > that is running locally has finished before someone else starts another. > > It's usually best - if possible - to restructure the system so that > processes are idempotent to work around these kinds of problem, in > conjunction with using the kind of primitives that Curator provides. > > Henry > > On 8 December 2012 21:04, Jordan Zimmerman <[email protected]>wrote: > >> This is why you need a ConnectionStateListener. You'll get a notice that >> the connection has been suspended and you should assume all locks/leaders >> are invalid. >> >> -JZ >> >> On Dec 8, 2012, at 9:02 PM, Henry Robinson <[email protected]> wrote: >> >>> What about a network disconnection? Presumably leadership is revoked when >>> the leader appears to have failed, which can be for more reasons than a >> VM >>> crash (VM running slow, network event, GC pause etc). >>> >>> Henry >>> >>> On 8 December 2012 21:00, Jordan Zimmerman <[email protected] >>> wrote: >>> >>>> The leader latch lock is the equivalent of task in progress. I assume >> the >>>> task is running in the same VM as the leader lock. The only reason the >> VM >>>> would lose leadership is if it crashes in which case the process would >> die >>>> anyway. >>>> >>>> -JZ >>>> >>>> On Dec 8, 2012, at 8:56 PM, Eric Pederson <[email protected]> wrote: >>>> >>>>> If I recall correctly it was Henry Robinson that gave me the advice to >>>> have >>>>> a "task in progress" check. >>>>> >>>>> >>>>> -- Eric >>>>> >>>>> >>>>> >>>>> On Sat, Dec 8, 2012 at 11:54 PM, Eric Pederson <[email protected]> >>>> wrote: >>>>> >>>>>> I am using Curator LeaderLatch :) >>>>>> >>>>>> >>>>>> -- Eric >>>>>> >>>>>> >>>>>> >>>>>> >>>>>> On Sat, Dec 8, 2012 at 11:52 PM, Jordan Zimmerman < >>>>>> [email protected]> wrote: >>>>>> >>>>>>> You might check your leader implementation. Writing a correct leader >>>>>>> recipe is actually quite challenging due to edge cases. Have a look >> at >>>>>>> Curator (disclosure: I wrote it) for an example. >>>>>>> >>>>>>> -JZ >>>>>>> >>>>>>> On Dec 8, 2012, at 8:49 PM, Eric Pederson <[email protected]> wrote: >>>>>>> >>>>>>>> Actually I had the same thought and didn't consider having to do >> this >>>>>>> until >>>>>>>> I talked about my project at a Zookeeper User Group a month or so >> ago >>>>>>> and I >>>>>>>> was given this advice. >>>>>>>> >>>>>>>> I know that I do see leadership being lost/transferred when one of >> the >>>>>>> ZK >>>>>>>> servers is restarted (not the whole ensemble). And it seems like >>>> I've >>>>>>>> seen it happen even when the ensemble stays totally stable (though I >>>> am >>>>>>> not >>>>>>>> 100% sure as it's been a while since I have worked on this >> particular >>>>>>>> application). >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> -- Eric >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> On Sat, Dec 8, 2012 at 11:25 PM, Jordan Zimmerman < >>>>>>>> [email protected]> wrote: >>>>>>>> >>>>>>>>> Why would it lose leadership? The only reason I can think of is if >>>> the >>>>>>> ZK >>>>>>>>> cluster goes down. In normal use, the ZK cluster won't go down (I >>>>>>> assume >>>>>>>>> you're running 3 or 5 instances). >>>>>>>>> >>>>>>>>> -JZ >>>>>>>>> >>>>>>>>> On Dec 8, 2012, at 8:17 PM, Eric Pederson <[email protected]> >> wrote: >>>>>>>>> >>>>>>>>>> During the time the task is running a cluster member could lose >> its >>>>>>>>>> leadership. >>>>>>>>> >>>>>>>>> >>>>>>> >>>>>>> >>>>>> >>>> >>>> >>> >>> >>> -- >>> Henry Robinson >>> Software Engineer >>> Cloudera >>> 415-994-6679 >> >> > > > -- > Henry Robinson > Software Engineer > Cloudera > 415-994-6679
