Re: leader election issues

John Sirois Tue, 26 Sep 2017 15:30:38 -0700

> ... but it succeeded(see logs at the end) ...

The underlying c++ code in the scheduler driver successfully connected to
the leading master socket:
Sep 26 18:11:37 machine62 aurora-scheduler[24743]: I0926 18:11:37.231549 24868
scheduler.cpp:361] Connected with the master at http://10.163.25.45:5050/
master/api/v1/scheduler


<http://10.163.25.45:5050/master/api/v1/scheduler>This is not the same as a
framework registration call being successfully executed against the newly
connected master.
You need to be careful about what you derive from the logs just based on a
reading of the words. Generally you'll need to look carefully / grep
sourcecode to be sure you are mentally modelling the code flows correctly.
It certainly gets tricky.

On Tue, Sep 26, 2017 at 4:14 PM, Mohit Jaggi <[email protected]> wrote:

> Hmm....so it is indeed the mesos registration, but it succeeded(see logs
> at the end). Looking at the code, I see that  com.google.common.eventbus
> is used as pub-sub mechanism to link the registration (first log message)
>  to the  registrationAcked flag...and this flag is not being set for
> 10mins (?) otherwise the registration timeout handler will not print the
> second log message.
>
> delayedActions.onRegistrationTimeout(
>     () -> {
>       if (!registrationAcked.get()) {
>         LOG.error(
>             "Framework has not been registered within the tolerated delay.");
>         stateMachine.transition(State.DEAD);
>       }
>     });
>
> Sep 26 18:11:58 machine62 aurora-scheduler[24743]: I0926 18:11:58.795
> [Thread-814, MesosCallbackHandler$MesosCallbackHandlerImpl:180]
> Registered with ID value: "4ca9aa06-3214-4d2c-a678-0832e2f84d17-0000"
>
> ...
>
> ...
>
> Sep 26 18:21:37 machine62 aurora-scheduler[24743]: E0926 18:21:37.205
> [Lifecycle-0, SchedulerLifecycle$4:235] Framework has not been registered
> within the tolerated delay.
>
> On Tue, Sep 26, 2017 at 2:34 PM, John Sirois <[email protected]>
> wrote:
>
>>
>>
>> On Tue, Sep 26, 2017 at 3:31 PM, Mohit Jaggi <[email protected]>
>> wrote:
>>
>>> Actually I wonder if the "register" here is for Aurora to register with
>>> mesos as a framework...I was assuming it refered to registering with ZK as
>>> a leader.
>>>
>>
>> Bingo.
>>
>>
>>>
>>> On Tue, Sep 26, 2017 at 2:27 PM, Renan DelValle Rueda <
>>> [email protected]> wrote:
>>>
>>>> I've had this, or something very similar, happen before. It's an issue
>>>> with Aurora and ZK. Election is based upon ZK, so if writing down who the
>>>> leader is to the ZK server path fails, or if ZK is unable to reach quorum
>>>> on the write, the election will fail. Sometimes this might manifest itself
>>>> in weird ways, such as two aurora schedulers believing they are leaders. If
>>>> you could tell us a little bit about your ZK set up we might be able to
>>>> narrow down the issue. Also, Aurora version and whether you are using
>>>> Curator or the commons library will help as well.
>>>>
>>>> On Tue, Sep 26, 2017 at 2:02 PM, Mohit Jaggi <[email protected]>
>>>> wrote:
>>>>
>>>>> Hmm..it seems machine62 became a leader but could not "register" as
>>>>> leader. Not sure what that means. My naive assumption is that "becoming
>>>>> leader" and "registering as leader" are "atomic".
>>>>>
>>>>> ------- grep on SchedulerLifecycle -----
>>>>> aurora-scheduler.log:Sep 26 18:11:33 machine62
>>>>> aurora-scheduler[24743]: I0926 18:11:33.158 [LeaderSelector-0,
>>>>> StateMachine$Builder:389] SchedulerLifecycle state machine transition
>>>>> STORAGE_PREPARED -> LEADER_AWAITING_REGISTRATION
>>>>> aurora-scheduler.log:Sep 26 18:11:33 machine62
>>>>> aurora-scheduler[24743]: I0926 18:11:33.159 [LeaderSelector-0,
>>>>> SchedulerLifecycle$4:224] Elected as leading scheduler!
>>>>> aurora-scheduler.log:Sep 26 18:11:37 machine62
>>>>> aurora-scheduler[24743]: I0926 18:11:37.204 [LeaderSelector-0,
>>>>> SchedulerLifecycle$DefaultDelayedActions:163] Giving up on
>>>>> registration in (10, mins)
>>>>> aurora-scheduler.log:Sep 26 18:21:37 machine62
>>>>> aurora-scheduler[24743]: E0926 18:21:37.205 [Lifecycle-0,
>>>>> SchedulerLifecycle$4:235] Framework has not been registered within the
>>>>> tolerated delay.
>>>>> aurora-scheduler.log:Sep 26 18:21:37 machine62
>>>>> aurora-scheduler[24743]: I0926 18:21:37.205 [Lifecycle-0,
>>>>> StateMachine$Builder:389] SchedulerLifecycle state machine transition
>>>>> LEADER_AWAITING_REGISTRATION -> DEAD
>>>>> aurora-scheduler.log:Sep 26 18:21:37 machine62
>>>>> aurora-scheduler[24743]: I0926 18:21:37.215 [Lifecycle-0,
>>>>> StateMachine$Builder:389] SchedulerLifecycle state machine transition DEAD
>>>>> -> DEAD
>>>>> aurora-scheduler.log:Sep 26 18:21:37 machine62
>>>>> aurora-scheduler[24743]: I0926 18:21:37.215 [Lifecycle-0,
>>>>> SchedulerLifecycle$6:275] Shutdown already invoked, ignoring extra call.
>>>>> aurora-scheduler.log:Sep 26 18:22:05 machine62
>>>>> aurora-scheduler[54502]: I0926 18:22:05.681 [main,
>>>>> StateMachine$Builder:389] SchedulerLifecycle state machine transition IDLE
>>>>> -> PREPARING_STORAGE
>>>>> aurora-scheduler.log:Sep 26 18:22:06 machine62
>>>>> aurora-scheduler[54502]: I0926 18:22:06.396 [main,
>>>>> StateMachine$Builder:389] SchedulerLifecycle state machine transition
>>>>> PREPARING_STORAGE -> STORAGE_PREPARED
>>>>>
>>>>>
>>>>> ------ connecting to mesos -----
>>>>> Sep 26 18:11:37 machine62 aurora-scheduler[24743]: I0926
>>>>> 18:11:37.211750 24871 group.cpp:757] Found non-sequence node 
>>>>> 'log_replicas'
>>>>> at '/mesos' in ZooKeeper
>>>>> Sep 26 18:11:37 machine62 aurora-scheduler[24743]: I0926
>>>>> 18:11:37.211817 24871 detector.cpp:152] Detected a new leader: (id='1506')
>>>>> Sep 26 18:11:37 machine62 aurora-scheduler[24743]: I0926
>>>>> 18:11:37.211917 24871 group.cpp:699] Trying to get
>>>>> '/mesos/json.info_0000001506' in ZooKeeper
>>>>> Sep 26 18:11:37 machine62 aurora-scheduler[24743]: I0926
>>>>> 18:11:37.216063 24871 zookeeper.cpp:262] A new leading master (UPID=
>>>>> [email protected]:5050) is detected
>>>>> Sep 26 18:11:37 machine62 aurora-scheduler[24743]: I0926
>>>>> 18:11:37.216162 24871 scheduler.cpp:470] New master detected at
>>>>> [email protected]:5050
>>>>> Sep 26 18:11:37 machine62 aurora-scheduler[24743]: I0926
>>>>> 18:11:37.217772 24871 scheduler.cpp:479] Waiting for 12.81503ms before
>>>>> initiating a re-(connection) attempt with the master
>>>>> Sep 26 18:11:37 machine62 aurora-scheduler[24743]: I0926
>>>>> 18:11:37.231549 24868 scheduler.cpp:361] Connected with the master at
>>>>> http://10.163.25.45:5050/master/api/v1/scheduler
>>>>>
>>>>>
>>>>>
>>>>> On Tue, Sep 26, 2017 at 1:24 PM, Bill Farner <[email protected]>
>>>>> wrote:
>>>>>
>>>>>> Is there a reason a non-leading scheduler will talk to Mesos
>>>>>>
>>>>>>
>>>>>> No, there is not a legitimate reason.  Did this occur for an extended
>>>>>> period of time?  Do you have logs from the scheduler indicating that it
>>>>>> lost ZK leadership and subsequently interacted with mesos?
>>>>>>
>>>>>> On Tue, Sep 26, 2017 at 1:02 PM, Mohit Jaggi <[email protected]>
>>>>>> wrote:
>>>>>>
>>>>>>> Fellows,
>>>>>>> While examining Aurora log files, I noticed a condition where a
>>>>>>> scheduler was talking to Mesos but it was not showing up as a leader in
>>>>>>> Zookeeper. It ultimately restarted itself and another scheduler became 
>>>>>>> the
>>>>>>> leader.
>>>>>>> Is there a reason a non-leading scheduler will talk to Mesos?
>>>>>>>
>>>>>>> Mohit.
>>>>>>>
>>>>>>
>>>>>>
>>>>>
>>>>
>>>
>>
>

Re: leader election issues

Reply via email to