Re: leader election issues

John Sirois Tue, 26 Sep 2017 16:22:27 -0700

On Tue, Sep 26, 2017 at 4:33 PM, Mohit Jaggi <[email protected]> wrote:


> John,
> I was referring to the following log message...isn't that the right one?
>

Aha - it is, apologies.


> Sep 26 18:11:58 machine62 aurora-scheduler[24743]: I0926 18:11:58.795
>  [Thread-814, MesosCallbackHandler$MesosCallbackHandlerImpl:180]
> Registered with ID value: "4ca9aa06-3214-4d2c-a678-0832e2f84d17-0000"
>
> On Tue, Sep 26, 2017 at 3:30 PM, John Sirois <[email protected]>
> wrote:
>
>> > ... but it succeeded(see logs at the end) ...
>>
>> The underlying c++ code in the scheduler driver successfully connected to
>> the leading master socket:
>> Sep 26 18:11:37 machine62 aurora-scheduler[24743]: I0926 18:11:37.231549 
>> 24868
>> scheduler.cpp:361] Connected with the master at
>> http://10.163.25.45:5050/master/api/v1/scheduler
>>
>> <http://10.163.25.45:5050/master/api/v1/scheduler>This is not the same
>> as a framework registration call being successfully executed against the
>> newly connected master.
>> You need to be careful about what you derive from the logs just based on
>> a reading of the words. Generally you'll need to look carefully / grep
>> sourcecode to be sure you are mentally modelling the code flows correctly.
>> It certainly gets tricky.
>>
>> On Tue, Sep 26, 2017 at 4:14 PM, Mohit Jaggi <[email protected]>
>> wrote:
>>
>>> Hmm....so it is indeed the mesos registration, but it succeeded(see logs
>>> at the end). Looking at the code, I see that  com.google.common.eventbus
>>> is used as pub-sub mechanism to link the registration (first log message)
>>>  to the  registrationAcked flag...and this flag is not being set for
>>> 10mins (?) otherwise the registration timeout handler will not print the
>>> second log message.
>>>
>>> delayedActions.onRegistrationTimeout(
>>>     () -> {
>>>       if (!registrationAcked.get()) {
>>>         LOG.error(
>>>             "Framework has not been registered within the tolerated 
>>> delay.");
>>>         stateMachine.transition(State.DEAD);
>>>       }
>>>     });
>>>
>>> Sep 26 18:11:58 machine62 aurora-scheduler[24743]: I0926 18:11:58.795
>>> [Thread-814, MesosCallbackHandler$MesosCallbackHandlerImpl:180]
>>> Registered with ID value: "4ca9aa06-3214-4d2c-a678-0832e2f84d17-0000"
>>>
>>> ...
>>>
>>> ...
>>>
>>> Sep 26 18:21:37 machine62 aurora-scheduler[24743]: E0926 18:21:37.205
>>> [Lifecycle-0, SchedulerLifecycle$4:235] Framework has not been registered
>>> within the tolerated delay.
>>>
>>> On Tue, Sep 26, 2017 at 2:34 PM, John Sirois <[email protected]>
>>> wrote:
>>>
>>>>
>>>>
>>>> On Tue, Sep 26, 2017 at 3:31 PM, Mohit Jaggi <[email protected]>
>>>> wrote:
>>>>
>>>>> Actually I wonder if the "register" here is for Aurora to register
>>>>> with mesos as a framework...I was assuming it refered to registering with
>>>>> ZK as a leader.
>>>>>
>>>>
>>>> Bingo.
>>>>
>>>>
>>>>>
>>>>> On Tue, Sep 26, 2017 at 2:27 PM, Renan DelValle Rueda <
>>>>> [email protected]> wrote:
>>>>>
>>>>>> I've had this, or something very similar, happen before. It's an
>>>>>> issue with Aurora and ZK. Election is based upon ZK, so if writing down 
>>>>>> who
>>>>>> the leader is to the ZK server path fails, or if ZK is unable to reach
>>>>>> quorum on the write, the election will fail. Sometimes this might 
>>>>>> manifest
>>>>>> itself in weird ways, such as two aurora schedulers believing they are
>>>>>> leaders. If you could tell us a little bit about your ZK set up we might 
>>>>>> be
>>>>>> able to narrow down the issue. Also, Aurora version and whether you are
>>>>>> using Curator or the commons library will help as well.
>>>>>>
>>>>>> On Tue, Sep 26, 2017 at 2:02 PM, Mohit Jaggi <[email protected]>
>>>>>> wrote:
>>>>>>
>>>>>>> Hmm..it seems machine62 became a leader but could not "register" as
>>>>>>> leader. Not sure what that means. My naive assumption is that "becoming
>>>>>>> leader" and "registering as leader" are "atomic".
>>>>>>>
>>>>>>> ------- grep on SchedulerLifecycle -----
>>>>>>> aurora-scheduler.log:Sep 26 18:11:33 machine62
>>>>>>> aurora-scheduler[24743]: I0926 18:11:33.158 [LeaderSelector-0,
>>>>>>> StateMachine$Builder:389] SchedulerLifecycle state machine transition
>>>>>>> STORAGE_PREPARED -> LEADER_AWAITING_REGISTRATION
>>>>>>> aurora-scheduler.log:Sep 26 18:11:33 machine62
>>>>>>> aurora-scheduler[24743]: I0926 18:11:33.159 [LeaderSelector-0,
>>>>>>> SchedulerLifecycle$4:224] Elected as leading scheduler!
>>>>>>> aurora-scheduler.log:Sep 26 18:11:37 machine62
>>>>>>> aurora-scheduler[24743]: I0926 18:11:37.204 [LeaderSelector-0,
>>>>>>> SchedulerLifecycle$DefaultDelayedActions:163] Giving up on
>>>>>>> registration in (10, mins)
>>>>>>> aurora-scheduler.log:Sep 26 18:21:37 machine62
>>>>>>> aurora-scheduler[24743]: E0926 18:21:37.205 [Lifecycle-0,
>>>>>>> SchedulerLifecycle$4:235] Framework has not been registered within the
>>>>>>> tolerated delay.
>>>>>>> aurora-scheduler.log:Sep 26 18:21:37 machine62
>>>>>>> aurora-scheduler[24743]: I0926 18:21:37.205 [Lifecycle-0,
>>>>>>> StateMachine$Builder:389] SchedulerLifecycle state machine transition
>>>>>>> LEADER_AWAITING_REGISTRATION -> DEAD
>>>>>>> aurora-scheduler.log:Sep 26 18:21:37 machine62
>>>>>>> aurora-scheduler[24743]: I0926 18:21:37.215 [Lifecycle-0,
>>>>>>> StateMachine$Builder:389] SchedulerLifecycle state machine transition 
>>>>>>> DEAD
>>>>>>> -> DEAD
>>>>>>> aurora-scheduler.log:Sep 26 18:21:37 machine62
>>>>>>> aurora-scheduler[24743]: I0926 18:21:37.215 [Lifecycle-0,
>>>>>>> SchedulerLifecycle$6:275] Shutdown already invoked, ignoring extra call.
>>>>>>> aurora-scheduler.log:Sep 26 18:22:05 machine62
>>>>>>> aurora-scheduler[54502]: I0926 18:22:05.681 [main,
>>>>>>> StateMachine$Builder:389] SchedulerLifecycle state machine transition 
>>>>>>> IDLE
>>>>>>> -> PREPARING_STORAGE
>>>>>>> aurora-scheduler.log:Sep 26 18:22:06 machine62
>>>>>>> aurora-scheduler[54502]: I0926 18:22:06.396 [main,
>>>>>>> StateMachine$Builder:389] SchedulerLifecycle state machine transition
>>>>>>> PREPARING_STORAGE -> STORAGE_PREPARED
>>>>>>>
>>>>>>>
>>>>>>> ------ connecting to mesos -----
>>>>>>> Sep 26 18:11:37 machine62 aurora-scheduler[24743]: I0926
>>>>>>> 18:11:37.211750 24871 group.cpp:757] Found non-sequence node 
>>>>>>> 'log_replicas'
>>>>>>> at '/mesos' in ZooKeeper
>>>>>>> Sep 26 18:11:37 machine62 aurora-scheduler[24743]: I0926
>>>>>>> 18:11:37.211817 24871 detector.cpp:152] Detected a new leader: 
>>>>>>> (id='1506')
>>>>>>> Sep 26 18:11:37 machine62 aurora-scheduler[24743]: I0926
>>>>>>> 18:11:37.211917 24871 group.cpp:699] Trying to get
>>>>>>> '/mesos/json.info_0000001506' in ZooKeeper
>>>>>>> Sep 26 18:11:37 machine62 aurora-scheduler[24743]: I0926
>>>>>>> 18:11:37.216063 24871 zookeeper.cpp:262] A new leading master (UPID=
>>>>>>> [email protected]:5050) is detected
>>>>>>> Sep 26 18:11:37 machine62 aurora-scheduler[24743]: I0926
>>>>>>> 18:11:37.216162 24871 scheduler.cpp:470] New master detected at
>>>>>>> [email protected]:5050
>>>>>>> Sep 26 18:11:37 machine62 aurora-scheduler[24743]: I0926
>>>>>>> 18:11:37.217772 24871 scheduler.cpp:479] Waiting for 12.81503ms before
>>>>>>> initiating a re-(connection) attempt with the master
>>>>>>> Sep 26 18:11:37 machine62 aurora-scheduler[24743]: I0926
>>>>>>> 18:11:37.231549 24868 scheduler.cpp:361] Connected with the master at
>>>>>>> http://10.163.25.45:5050/master/api/v1/scheduler
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> On Tue, Sep 26, 2017 at 1:24 PM, Bill Farner <[email protected]>
>>>>>>> wrote:
>>>>>>>
>>>>>>>> Is there a reason a non-leading scheduler will talk to Mesos
>>>>>>>>
>>>>>>>>
>>>>>>>> No, there is not a legitimate reason.  Did this occur for an
>>>>>>>> extended period of time?  Do you have logs from the scheduler 
>>>>>>>> indicating
>>>>>>>> that it lost ZK leadership and subsequently interacted with mesos?
>>>>>>>>
>>>>>>>> On Tue, Sep 26, 2017 at 1:02 PM, Mohit Jaggi <[email protected]>
>>>>>>>> wrote:
>>>>>>>>
>>>>>>>>> Fellows,
>>>>>>>>> While examining Aurora log files, I noticed a condition where a
>>>>>>>>> scheduler was talking to Mesos but it was not showing up as a leader 
>>>>>>>>> in
>>>>>>>>> Zookeeper. It ultimately restarted itself and another scheduler 
>>>>>>>>> became the
>>>>>>>>> leader.
>>>>>>>>> Is there a reason a non-leading scheduler will talk to Mesos?
>>>>>>>>>
>>>>>>>>> Mohit.
>>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>
>>>>>
>>>>
>>>
>>
>

Re: leader election issues

Reply via email to