Re: leader election, scheduled tasks, losing leadership

Eric Pederson Tue, 11 Dec 2012 16:56:03 -0800

Thanks Vitalii!  I will think about this and ask if I have any questions.


-- Eric



On Tue, Dec 11, 2012 at 3:09 PM, Vitalii Tymchyshyn <[email protected]>wrote:

> I am asking because you have this "at most once" vs "at least one" problem.
> I don't think you can have "exactly one" unless your jobs are transactional
> and you can synhronize your transaction commits to zookeeper (and better
> with two-phase commit). So, you need to decide
>
> What I'd recommend  to you is to make queue-like architecture, not
> lock-based. This way you can:
> a) Do parallel task processing
> b) Try increasing timeouts to be larger than maximum task time.
>     E.g. set it to one hour. This would mean that task running will restart
> in an hour if client fails.
>
> But this would mean moving from database to zookeeper for task
> status/queueing. As for me this would be good as database is SPOF for you.
>
> Best regards, Vitalii Tymchyshyn
>
>
> 2012/12/10 Eric Pederson <[email protected]>
>
> > It depends on the scheduled task.  Some have status fields in the
> database
> > that indicate new/in-progress/done, but others do not.
> >
> >
> > -- Eric
> >
> >
> >
> > On Mon, Dec 10, 2012 at 1:49 AM, Vitalii Tymchyshyn <[email protected]
> > >wrote:
> >
> > > How are you going to ensure atomicity? I mean, if you processor dies in
> > the
> > > middle of the operation, how do you know if it is done or not?
> > >
> > > --
> > > Best regards,
> > > Vitalii Tymchyshyn
> > > 10 груд. 2012 00:11, "Eric Pederson" <[email protected]> напис.
> > >
> > > > Also sometimes the app leadership (via LeaderLatch) will get lost - I
> > > will
> > > > follow up about this on the Curator list:
> > > > https://gist.github.com/4247226
> > > >
> > > > So back to my previous question, what is the best way to implement
> the
> > > > "fence"?
> > > >
> > > > -- Eric
> > > >
> > > >
> > > >
> > > > On Sun, Dec 9, 2012 at 4:42 PM, Eric Pederson <[email protected]>
> > wrote:
> > > >
> > > > > The irony is that I am using leader election to convert
> > non-idempotent
> > > > > operations into idempotent operations :)   For example, once a
> night
> > a
> > > > > report is emailed out to a set of addresses.   We don't want the
> > report
> > > > to
> > > > > go to the same person more than once.
> > > > >
> > > > > Prior to using leader election one of the cluster members was
> > > designated
> > > > > as the scheduled task "leader" using a system property.  But if
> that
> > > > > cluster member crashed it required a manual operation to failover
> the
> > > > > "leader" responsibility to another cluster member.   I considered
> > using
> > > > > app-specific techniques to make the scheduled tasks idempotent (for
> > > > example
> > > > > using "select for update" / database locking) but I wanted a
> general
> > > > > solution and I needed clustering support for other reasons (cluster
> > > > > membership, etc).
> > > > >
> > > > > Anyway, here is the code that I'm using.
> > > > >
> > > > > Application startup (using Curator LeaderLatch):
> > > > > https://gist.github.com/3936162
> > > > > https://gist.github.com/3935895
> > > > > https://gist.github.com/3935889
> > > > >
> > > > > ClusterStatus:
> > > > > https://gist.github.com/3943149
> > > > > https://gist.github.com/3935861
> > > > >
> > > > > Scheduled task:
> > > > > https://gist.github.com/4246388
> > > > >
> > > > > In the last gist the "distribute" scheduled task is run every 30
> > > seconds.
> > > > >   It checks clusterStatus.isLeader to see if the current cluster
> > member
> > > > is
> > > > > the leader before running the real method (which sends email).
> > > > > clusterStatus() calls methods on LeaderLatch.
> > > > >
> > > > > Here is the output that I am seeing if I kill the ZK quorum leader
> > and
> > > > the
> > > > > app cluster member that was the leader loses its LeaderLatch
> > leadership
> > > > to
> > > > > another cluster member:
> > > > > https://gist.github.com/4247058
> > > > >
> > > > >
> > > > > -- Eric
> > > > >
> > > > >
> > > > >
> > > > > On Sun, Dec 9, 2012 at 12:30 AM, Henry Robinson <
> [email protected]
> > > > >wrote:
> > > > >
> > > > >> On 8 December 2012 21:18, Jordan Zimmerman <
> > > [email protected]
> > > > >> >wrote:
> > > > >>
> > > > >> > If your ConnectionStateListener gets SUSPENDED or LOST you've
> lost
> > > > >> > connection to ZooKeeper. Therefore you cannot use that same
> > > ZooKeeper
> > > > >> > connection to manage a node that denotes the process is running
> or
> > > > not.
> > > > >> > Only 1 VM at a time will be running the process. That process
> can
> > > > watch
> > > > >> for
> > > > >> > SUSPENDED/LOST and wind down the task.
> > > > >> >
> > > > >> >
> > > > >> My point is that by the time that VM sees SUSPENDED/LOST, another
> VM
> > > may
> > > > >> have been elected leader and have started running another process.
> > > > >>
> > > > >> It's a classic problem - you need some mechanism to fence a node
> > that
> > > > >> thinks its the leader, but isn't and hasn't got the memo yet. The
> > way
> > > > >> around the problem is to either ensure that no work is done by you
> > > once
> > > > >> you
> > > > >> are no longer the leader (perhaps by checking every time you want
> to
> > > do
> > > > >> work), or that the work you do does not affect the system (e.g. by
> > > > >> idempotent work units).
> > > > >>
> > > > >> ZK itself solves this internally by checking with that it has a
> > quorum
> > > > for
> > > > >> every operation, which forces an ordering between the
> disconnection
> > > > event
> > > > >> and trying to do something that relies upon being the leader.
> Other
> > > > >> systems
> > > > >> forcibly terminate old leaders before allowing a new leader to
> take
> > > the
> > > > >> throne.
> > > > >>
> > > > >> Henry
> > > > >>
> > > > >>
> > > > >> > > You can't assume that the notification is received locally
> > before
> > > > >> another
> > > > >> > > leader election finishes elsewhere
> > > > >> > Which notification? The ConnectionStateListener is an
> abstraction
> > on
> > > > >> > ZooKeeper's watcher mechanism. It's only significant for the VM
> > that
> > > > is
> > > > >> the
> > > > >> > leader. Non-leaders don't need to be concerned.
> > > > >>
> > > > >>
> > > > >> > -JZ
> > > > >> >
> > > > >> > On Dec 8, 2012, at 9:12 PM, Henry Robinson <[email protected]>
> > > > wrote:
> > > > >> >
> > > > >> > > You can't assume that the notification is received locally
> > before
> > > > >> another
> > > > >> > > leader election finishes elsewhere (particularly if you are
> > > running
> > > > >> > slowly
> > > > >> > > for some reason!), so it's not sufficient to guarantee that
> the
> > > > >> process
> > > > >> > > that is running locally has finished before someone else
> starts
> > > > >> another.
> > > > >> > >
> > > > >> > > It's usually best - if possible - to restructure the system so
> > > that
> > > > >> > > processes are idempotent to work around these kinds of
> problem,
> > in
> > > > >> > > conjunction with using the kind of primitives that Curator
> > > provides.
> > > > >> > >
> > > > >> > > Henry
> > > > >> > >
> > > > >> > > On 8 December 2012 21:04, Jordan Zimmerman <
> > > > >> [email protected]
> > > > >> > >wrote:
> > > > >> > >
> > > > >> > >> This is why you need a ConnectionStateListener. You'll get a
> > > notice
> > > > >> that
> > > > >> > >> the connection has been suspended and you should assume all
> > > > >> > locks/leaders
> > > > >> > >> are invalid.
> > > > >> > >>
> > > > >> > >> -JZ
> > > > >> > >>
> > > > >> > >> On Dec 8, 2012, at 9:02 PM, Henry Robinson <
> [email protected]
> > >
> > > > >> wrote:
> > > > >> > >>
> > > > >> > >>> What about a network disconnection? Presumably leadership is
> > > > revoked
> > > > >> > when
> > > > >> > >>> the leader appears to have failed, which can be for more
> > reasons
> > > > >> than a
> > > > >> > >> VM
> > > > >> > >>> crash (VM running slow, network event, GC pause etc).
> > > > >> > >>>
> > > > >> > >>> Henry
> > > > >> > >>>
> > > > >> > >>> On 8 December 2012 21:00, Jordan Zimmerman <
> > > > >> [email protected]
> > > > >> > >>> wrote:
> > > > >> > >>>
> > > > >> > >>>> The leader latch lock is the equivalent of task in
> progress.
> > I
> > > > >> assume
> > > > >> > >> the
> > > > >> > >>>> task is running in the same VM as the leader lock. The only
> > > > reason
> > > > >> the
> > > > >> > >> VM
> > > > >> > >>>> would lose leadership is if it crashes in which case the
> > > process
> > > > >> would
> > > > >> > >> die
> > > > >> > >>>> anyway.
> > > > >> > >>>>
> > > > >> > >>>> -JZ
> > > > >> > >>>>
> > > > >> > >>>> On Dec 8, 2012, at 8:56 PM, Eric Pederson <
> [email protected]
> > >
> > > > >> wrote:
> > > > >> > >>>>
> > > > >> > >>>>> If I recall correctly it was Henry Robinson that gave me
> the
> > > > >> advice
> > > > >> > to
> > > > >> > >>>> have
> > > > >> > >>>>> a "task in progress" check.
> > > > >> > >>>>>
> > > > >> > >>>>>
> > > > >> > >>>>> -- Eric
> > > > >> > >>>>>
> > > > >> > >>>>>
> > > > >> > >>>>>
> > > > >> > >>>>> On Sat, Dec 8, 2012 at 11:54 PM, Eric Pederson <
> > > > [email protected]
> > > > >> >
> > > > >> > >>>> wrote:
> > > > >> > >>>>>
> > > > >> > >>>>>> I am using Curator LeaderLatch :)
> > > > >> > >>>>>>
> > > > >> > >>>>>>
> > > > >> > >>>>>> -- Eric
> > > > >> > >>>>>>
> > > > >> > >>>>>>
> > > > >> > >>>>>>
> > > > >> > >>>>>>
> > > > >> > >>>>>> On Sat, Dec 8, 2012 at 11:52 PM, Jordan Zimmerman <
> > > > >> > >>>>>> [email protected]> wrote:
> > > > >> > >>>>>>
> > > > >> > >>>>>>> You might check your leader implementation. Writing a
> > > correct
> > > > >> > leader
> > > > >> > >>>>>>> recipe is actually quite challenging due to edge cases.
> > > Have a
> > > > >> look
> > > > >> > >> at
> > > > >> > >>>>>>> Curator (disclosure: I wrote it) for an example.
> > > > >> > >>>>>>>
> > > > >> > >>>>>>> -JZ
> > > > >> > >>>>>>>
> > > > >> > >>>>>>> On Dec 8, 2012, at 8:49 PM, Eric Pederson <
> > > [email protected]>
> > > > >> > wrote:
> > > > >> > >>>>>>>
> > > > >> > >>>>>>>> Actually I had the same thought and didn't consider
> > having
> > > to
> > > > >> do
> > > > >> > >> this
> > > > >> > >>>>>>> until
> > > > >> > >>>>>>>> I talked about my project at a Zookeeper User Group a
> > month
> > > > or
> > > > >> so
> > > > >> > >> ago
> > > > >> > >>>>>>> and I
> > > > >> > >>>>>>>> was given this advice.
> > > > >> > >>>>>>>>
> > > > >> > >>>>>>>> I know that I do see leadership being lost/transferred
> > when
> > > > >> one of
> > > > >> > >> the
> > > > >> > >>>>>>> ZK
> > > > >> > >>>>>>>> servers is restarted (not the whole ensemble).   And it
> > > seems
> > > > >> like
> > > > >> > >>>> I've
> > > > >> > >>>>>>>> seen it happen even when the ensemble stays totally
> > stable
> > > > >> > (though I
> > > > >> > >>>> am
> > > > >> > >>>>>>> not
> > > > >> > >>>>>>>> 100% sure as it's been a while since I have worked on
> > this
> > > > >> > >> particular
> > > > >> > >>>>>>>> application).
> > > > >> > >>>>>>>>
> > > > >> > >>>>>>>>
> > > > >> > >>>>>>>>
> > > > >> > >>>>>>>> -- Eric
> > > > >> > >>>>>>>>
> > > > >> > >>>>>>>>
> > > > >> > >>>>>>>>
> > > > >> > >>>>>>>> On Sat, Dec 8, 2012 at 11:25 PM, Jordan Zimmerman <
> > > > >> > >>>>>>>> [email protected]> wrote:
> > > > >> > >>>>>>>>
> > > > >> > >>>>>>>>> Why would it lose leadership? The only reason I can
> > think
> > > of
> > > > >> is
> > > > >> > if
> > > > >> > >>>> the
> > > > >> > >>>>>>> ZK
> > > > >> > >>>>>>>>> cluster goes down. In normal use, the ZK cluster won't
> > go
> > > > >> down (I
> > > > >> > >>>>>>> assume
> > > > >> > >>>>>>>>> you're running 3 or 5 instances).
> > > > >> > >>>>>>>>>
> > > > >> > >>>>>>>>> -JZ
> > > > >> > >>>>>>>>>
> > > > >> > >>>>>>>>> On Dec 8, 2012, at 8:17 PM, Eric Pederson <
> > > > [email protected]>
> > > > >> > >> wrote:
> > > > >> > >>>>>>>>>
> > > > >> > >>>>>>>>>> During the time the task is running a cluster member
> > > could
> > > > >> lose
> > > > >> > >> its
> > > > >> > >>>>>>>>>> leadership.
> > > > >> > >>>>>>>>>
> > > > >> > >>>>>>>>>
> > > > >> > >>>>>>>
> > > > >> > >>>>>>>
> > > > >> > >>>>>>
> > > > >> > >>>>
> > > > >> > >>>>
> > > > >> > >>>
> > > > >> > >>>
> > > > >> > >>> --
> > > > >> > >>> Henry Robinson
> > > > >> > >>> Software Engineer
> > > > >> > >>> Cloudera
> > > > >> > >>> 415-994-6679
> > > > >> > >>
> > > > >> > >>
> > > > >> > >
> > > > >> > >
> > > > >> > > --
> > > > >> > > Henry Robinson
> > > > >> > > Software Engineer
> > > > >> > > Cloudera
> > > > >> > > 415-994-6679
> > > > >> >
> > > > >> >
> > > > >>
> > > > >>
> > > > >> --
> > > > >> Henry Robinson
> > > > >> Software Engineer
> > > > >> Cloudera
> > > > >> 415-994-6679
> > > > >>
> > > > >
> > > > >
> > > >
> > >
> >
>
>
>
> --
> Best regards,
>  Vitalii Tymchyshyn
>

Re: leader election, scheduled tasks, losing leadership

Reply via email to