Re: leader election issues

John Sirois Tue, 26 Sep 2017 14:35:12 -0700

On Tue, Sep 26, 2017 at 3:31 PM, Mohit Jaggi <mohit.ja...@uber.com> wrote:


> Actually I wonder if the "register" here is for Aurora to register with
> mesos as a framework...I was assuming it refered to registering with ZK as
> a leader.
>

Bingo.


>
> On Tue, Sep 26, 2017 at 2:27 PM, Renan DelValle Rueda <
> renanidelva...@gmail.com> wrote:
>
>> I've had this, or something very similar, happen before. It's an issue
>> with Aurora and ZK. Election is based upon ZK, so if writing down who the
>> leader is to the ZK server path fails, or if ZK is unable to reach quorum
>> on the write, the election will fail. Sometimes this might manifest itself
>> in weird ways, such as two aurora schedulers believing they are leaders. If
>> you could tell us a little bit about your ZK set up we might be able to
>> narrow down the issue. Also, Aurora version and whether you are using
>> Curator or the commons library will help as well.
>>
>> On Tue, Sep 26, 2017 at 2:02 PM, Mohit Jaggi <mohit.ja...@uber.com>
>> wrote:
>>
>>> Hmm..it seems machine62 became a leader but could not "register" as
>>> leader. Not sure what that means. My naive assumption is that "becoming
>>> leader" and "registering as leader" are "atomic".
>>>
>>> ------- grep on SchedulerLifecycle -----
>>> aurora-scheduler.log:Sep 26 18:11:33 machine62 aurora-scheduler[24743]:
>>> I0926 18:11:33.158 [LeaderSelector-0, StateMachine$Builder:389]
>>> SchedulerLifecycle state machine transition STORAGE_PREPARED ->
>>> LEADER_AWAITING_REGISTRATION
>>> aurora-scheduler.log:Sep 26 18:11:33 machine62 aurora-scheduler[24743]:
>>> I0926 18:11:33.159 [LeaderSelector-0, SchedulerLifecycle$4:224] Elected as
>>> leading scheduler!
>>> aurora-scheduler.log:Sep 26 18:11:37 machine62 aurora-scheduler[24743]:
>>> I0926 18:11:37.204 [LeaderSelector-0, 
>>> SchedulerLifecycle$DefaultDelayedActions:163]
>>> Giving up on registration in (10, mins)
>>> aurora-scheduler.log:Sep 26 18:21:37 machine62 aurora-scheduler[24743]:
>>> E0926 18:21:37.205 [Lifecycle-0, SchedulerLifecycle$4:235] Framework has
>>> not been registered within the tolerated delay.
>>> aurora-scheduler.log:Sep 26 18:21:37 machine62 aurora-scheduler[24743]:
>>> I0926 18:21:37.205 [Lifecycle-0, StateMachine$Builder:389]
>>> SchedulerLifecycle state machine transition LEADER_AWAITING_REGISTRATION ->
>>> DEAD
>>> aurora-scheduler.log:Sep 26 18:21:37 machine62 aurora-scheduler[24743]:
>>> I0926 18:21:37.215 [Lifecycle-0, StateMachine$Builder:389]
>>> SchedulerLifecycle state machine transition DEAD -> DEAD
>>> aurora-scheduler.log:Sep 26 18:21:37 machine62 aurora-scheduler[24743]:
>>> I0926 18:21:37.215 [Lifecycle-0, SchedulerLifecycle$6:275] Shutdown already
>>> invoked, ignoring extra call.
>>> aurora-scheduler.log:Sep 26 18:22:05 machine62 aurora-scheduler[54502]:
>>> I0926 18:22:05.681 [main, StateMachine$Builder:389] SchedulerLifecycle
>>> state machine transition IDLE -> PREPARING_STORAGE
>>> aurora-scheduler.log:Sep 26 18:22:06 machine62 aurora-scheduler[54502]:
>>> I0926 18:22:06.396 [main, StateMachine$Builder:389] SchedulerLifecycle
>>> state machine transition PREPARING_STORAGE -> STORAGE_PREPARED
>>>
>>>
>>> ------ connecting to mesos -----
>>> Sep 26 18:11:37 machine62 aurora-scheduler[24743]: I0926 18:11:37.211750
>>> 24871 group.cpp:757] Found non-sequence node 'log_replicas' at '/mesos' in
>>> ZooKeeper
>>> Sep 26 18:11:37 machine62 aurora-scheduler[24743]: I0926 18:11:37.211817
>>> 24871 detector.cpp:152] Detected a new leader: (id='1506')
>>> Sep 26 18:11:37 machine62 aurora-scheduler[24743]: I0926 18:11:37.211917
>>> 24871 group.cpp:699] Trying to get '/mesos/json.info_0000001506' in
>>> ZooKeeper
>>> Sep 26 18:11:37 machine62 aurora-scheduler[24743]: I0926 18:11:37.216063
>>> 24871 zookeeper.cpp:262] A new leading master (UPID=
>>> master@10.163.25.45:5050) is detected
>>> Sep 26 18:11:37 machine62 aurora-scheduler[24743]: I0926 18:11:37.216162
>>> 24871 scheduler.cpp:470] New master detected at master@10.163.25.45:5050
>>> Sep 26 18:11:37 machine62 aurora-scheduler[24743]: I0926 18:11:37.217772
>>> 24871 scheduler.cpp:479] Waiting for 12.81503ms before initiating a
>>> re-(connection) attempt with the master
>>> Sep 26 18:11:37 machine62 aurora-scheduler[24743]: I0926 18:11:37.231549
>>> 24868 scheduler.cpp:361] Connected with the master at
>>> http://10.163.25.45:5050/master/api/v1/scheduler
>>>
>>>
>>>
>>> On Tue, Sep 26, 2017 at 1:24 PM, Bill Farner <wfar...@apache.org> wrote:
>>>
>>>> Is there a reason a non-leading scheduler will talk to Mesos
>>>>
>>>>
>>>> No, there is not a legitimate reason.  Did this occur for an extended
>>>> period of time?  Do you have logs from the scheduler indicating that it
>>>> lost ZK leadership and subsequently interacted with mesos?
>>>>
>>>> On Tue, Sep 26, 2017 at 1:02 PM, Mohit Jaggi <mohit.ja...@uber.com>
>>>> wrote:
>>>>
>>>>> Fellows,
>>>>> While examining Aurora log files, I noticed a condition where a
>>>>> scheduler was talking to Mesos but it was not showing up as a leader in
>>>>> Zookeeper. It ultimately restarted itself and another scheduler became the
>>>>> leader.
>>>>> Is there a reason a non-leading scheduler will talk to Mesos?
>>>>>
>>>>> Mohit.
>>>>>
>>>>
>>>>
>>>
>>
>

Re: leader election issues

Reply via email to