Re: leader election issues

Mohit Jaggi Tue, 26 Sep 2017 14:32:01 -0700

Actually I wonder if the "register" here is for Aurora to register with
mesos as a framework...I was assuming it refered to registering with ZK as
a leader.


On Tue, Sep 26, 2017 at 2:27 PM, Renan DelValle Rueda <
[email protected]> wrote:

> I've had this, or something very similar, happen before. It's an issue
> with Aurora and ZK. Election is based upon ZK, so if writing down who the
> leader is to the ZK server path fails, or if ZK is unable to reach quorum
> on the write, the election will fail. Sometimes this might manifest itself
> in weird ways, such as two aurora schedulers believing they are leaders. If
> you could tell us a little bit about your ZK set up we might be able to
> narrow down the issue. Also, Aurora version and whether you are using
> Curator or the commons library will help as well.
>
> On Tue, Sep 26, 2017 at 2:02 PM, Mohit Jaggi <[email protected]> wrote:
>
>> Hmm..it seems machine62 became a leader but could not "register" as
>> leader. Not sure what that means. My naive assumption is that "becoming
>> leader" and "registering as leader" are "atomic".
>>
>> ------- grep on SchedulerLifecycle -----
>> aurora-scheduler.log:Sep 26 18:11:33 machine62 aurora-scheduler[24743]:
>> I0926 18:11:33.158 [LeaderSelector-0, StateMachine$Builder:389]
>> SchedulerLifecycle state machine transition STORAGE_PREPARED ->
>> LEADER_AWAITING_REGISTRATION
>> aurora-scheduler.log:Sep 26 18:11:33 machine62 aurora-scheduler[24743]:
>> I0926 18:11:33.159 [LeaderSelector-0, SchedulerLifecycle$4:224] Elected as
>> leading scheduler!
>> aurora-scheduler.log:Sep 26 18:11:37 machine62 aurora-scheduler[24743]:
>> I0926 18:11:37.204 [LeaderSelector-0, 
>> SchedulerLifecycle$DefaultDelayedActions:163]
>> Giving up on registration in (10, mins)
>> aurora-scheduler.log:Sep 26 18:21:37 machine62 aurora-scheduler[24743]:
>> E0926 18:21:37.205 [Lifecycle-0, SchedulerLifecycle$4:235] Framework has
>> not been registered within the tolerated delay.
>> aurora-scheduler.log:Sep 26 18:21:37 machine62 aurora-scheduler[24743]:
>> I0926 18:21:37.205 [Lifecycle-0, StateMachine$Builder:389]
>> SchedulerLifecycle state machine transition LEADER_AWAITING_REGISTRATION ->
>> DEAD
>> aurora-scheduler.log:Sep 26 18:21:37 machine62 aurora-scheduler[24743]:
>> I0926 18:21:37.215 [Lifecycle-0, StateMachine$Builder:389]
>> SchedulerLifecycle state machine transition DEAD -> DEAD
>> aurora-scheduler.log:Sep 26 18:21:37 machine62 aurora-scheduler[24743]:
>> I0926 18:21:37.215 [Lifecycle-0, SchedulerLifecycle$6:275] Shutdown already
>> invoked, ignoring extra call.
>> aurora-scheduler.log:Sep 26 18:22:05 machine62 aurora-scheduler[54502]:
>> I0926 18:22:05.681 [main, StateMachine$Builder:389] SchedulerLifecycle
>> state machine transition IDLE -> PREPARING_STORAGE
>> aurora-scheduler.log:Sep 26 18:22:06 machine62 aurora-scheduler[54502]:
>> I0926 18:22:06.396 [main, StateMachine$Builder:389] SchedulerLifecycle
>> state machine transition PREPARING_STORAGE -> STORAGE_PREPARED
>>
>>
>> ------ connecting to mesos -----
>> Sep 26 18:11:37 machine62 aurora-scheduler[24743]: I0926 18:11:37.211750
>> 24871 group.cpp:757] Found non-sequence node 'log_replicas' at '/mesos' in
>> ZooKeeper
>> Sep 26 18:11:37 machine62 aurora-scheduler[24743]: I0926 18:11:37.211817
>> 24871 detector.cpp:152] Detected a new leader: (id='1506')
>> Sep 26 18:11:37 machine62 aurora-scheduler[24743]: I0926 18:11:37.211917
>> 24871 group.cpp:699] Trying to get '/mesos/json.info_0000001506' in
>> ZooKeeper
>> Sep 26 18:11:37 machine62 aurora-scheduler[24743]: I0926 18:11:37.216063
>> 24871 zookeeper.cpp:262] A new leading master (UPID=
>> [email protected]:5050) is detected
>> Sep 26 18:11:37 machine62 aurora-scheduler[24743]: I0926 18:11:37.216162
>> 24871 scheduler.cpp:470] New master detected at [email protected]:5050
>> Sep 26 18:11:37 machine62 aurora-scheduler[24743]: I0926 18:11:37.217772
>> 24871 scheduler.cpp:479] Waiting for 12.81503ms before initiating a
>> re-(connection) attempt with the master
>> Sep 26 18:11:37 machine62 aurora-scheduler[24743]: I0926 18:11:37.231549
>> 24868 scheduler.cpp:361] Connected with the master at
>> http://10.163.25.45:5050/master/api/v1/scheduler
>>
>>
>>
>> On Tue, Sep 26, 2017 at 1:24 PM, Bill Farner <[email protected]> wrote:
>>
>>> Is there a reason a non-leading scheduler will talk to Mesos
>>>
>>>
>>> No, there is not a legitimate reason.  Did this occur for an extended
>>> period of time?  Do you have logs from the scheduler indicating that it
>>> lost ZK leadership and subsequently interacted with mesos?
>>>
>>> On Tue, Sep 26, 2017 at 1:02 PM, Mohit Jaggi <[email protected]>
>>> wrote:
>>>
>>>> Fellows,
>>>> While examining Aurora log files, I noticed a condition where a
>>>> scheduler was talking to Mesos but it was not showing up as a leader in
>>>> Zookeeper. It ultimately restarted itself and another scheduler became the
>>>> leader.
>>>> Is there a reason a non-leading scheduler will talk to Mesos?
>>>>
>>>> Mohit.
>>>>
>>>
>>>
>>
>

Re: leader election issues

Reply via email to