Re: leader election issues

Mohit Jaggi Tue, 26 Sep 2017 15:34:04 -0700

John,
I was referring to the following log message...isn't that the right one?
Sep 26 18:11:58 machine62 aurora-scheduler[24743]: I0926 18:11:58.795
[Thread-814,
MesosCallbackHandler$MesosCallbackHandlerImpl:180] Registered with ID
value: "4ca9aa06-3214-4d2c-a678-0832e2f84d17-0000"


On Tue, Sep 26, 2017 at 3:30 PM, John Sirois <[email protected]> wrote:

> > ... but it succeeded(see logs at the end) ...
>
> The underlying c++ code in the scheduler driver successfully connected to
> the leading master socket:
> Sep 26 18:11:37 machine62 aurora-scheduler[24743]: I0926 18:11:37.231549 24868
> scheduler.cpp:361] Connected with the master at
> http://10.163.25.45:5050/master/api/v1/scheduler
>
> <http://10.163.25.45:5050/master/api/v1/scheduler>This is not the same as
> a framework registration call being successfully executed against the newly
> connected master.
> You need to be careful about what you derive from the logs just based on a
> reading of the words. Generally you'll need to look carefully / grep
> sourcecode to be sure you are mentally modelling the code flows correctly.
> It certainly gets tricky.
>
> On Tue, Sep 26, 2017 at 4:14 PM, Mohit Jaggi <[email protected]> wrote:
>
>> Hmm....so it is indeed the mesos registration, but it succeeded(see logs
>> at the end). Looking at the code, I see that  com.google.common.eventbus
>> is used as pub-sub mechanism to link the registration (first log message)
>>  to the  registrationAcked flag...and this flag is not being set for
>> 10mins (?) otherwise the registration timeout handler will not print the
>> second log message.
>>
>> delayedActions.onRegistrationTimeout(
>>     () -> {
>>       if (!registrationAcked.get()) {
>>         LOG.error(
>>             "Framework has not been registered within the tolerated delay.");
>>         stateMachine.transition(State.DEAD);
>>       }
>>     });
>>
>> Sep 26 18:11:58 machine62 aurora-scheduler[24743]: I0926 18:11:58.795
>> [Thread-814, MesosCallbackHandler$MesosCallbackHandlerImpl:180]
>> Registered with ID value: "4ca9aa06-3214-4d2c-a678-0832e2f84d17-0000"
>>
>> ...
>>
>> ...
>>
>> Sep 26 18:21:37 machine62 aurora-scheduler[24743]: E0926 18:21:37.205
>> [Lifecycle-0, SchedulerLifecycle$4:235] Framework has not been registered
>> within the tolerated delay.
>>
>> On Tue, Sep 26, 2017 at 2:34 PM, John Sirois <[email protected]>
>> wrote:
>>
>>>
>>>
>>> On Tue, Sep 26, 2017 at 3:31 PM, Mohit Jaggi <[email protected]>
>>> wrote:
>>>
>>>> Actually I wonder if the "register" here is for Aurora to register with
>>>> mesos as a framework...I was assuming it refered to registering with ZK as
>>>> a leader.
>>>>
>>>
>>> Bingo.
>>>
>>>
>>>>
>>>> On Tue, Sep 26, 2017 at 2:27 PM, Renan DelValle Rueda <
>>>> [email protected]> wrote:
>>>>
>>>>> I've had this, or something very similar, happen before. It's an issue
>>>>> with Aurora and ZK. Election is based upon ZK, so if writing down who the
>>>>> leader is to the ZK server path fails, or if ZK is unable to reach quorum
>>>>> on the write, the election will fail. Sometimes this might manifest itself
>>>>> in weird ways, such as two aurora schedulers believing they are leaders. 
>>>>> If
>>>>> you could tell us a little bit about your ZK set up we might be able to
>>>>> narrow down the issue. Also, Aurora version and whether you are using
>>>>> Curator or the commons library will help as well.
>>>>>
>>>>> On Tue, Sep 26, 2017 at 2:02 PM, Mohit Jaggi <[email protected]>
>>>>> wrote:
>>>>>
>>>>>> Hmm..it seems machine62 became a leader but could not "register" as
>>>>>> leader. Not sure what that means. My naive assumption is that "becoming
>>>>>> leader" and "registering as leader" are "atomic".
>>>>>>
>>>>>> ------- grep on SchedulerLifecycle -----
>>>>>> aurora-scheduler.log:Sep 26 18:11:33 machine62
>>>>>> aurora-scheduler[24743]: I0926 18:11:33.158 [LeaderSelector-0,
>>>>>> StateMachine$Builder:389] SchedulerLifecycle state machine transition
>>>>>> STORAGE_PREPARED -> LEADER_AWAITING_REGISTRATION
>>>>>> aurora-scheduler.log:Sep 26 18:11:33 machine62
>>>>>> aurora-scheduler[24743]: I0926 18:11:33.159 [LeaderSelector-0,
>>>>>> SchedulerLifecycle$4:224] Elected as leading scheduler!
>>>>>> aurora-scheduler.log:Sep 26 18:11:37 machine62
>>>>>> aurora-scheduler[24743]: I0926 18:11:37.204 [LeaderSelector-0,
>>>>>> SchedulerLifecycle$DefaultDelayedActions:163] Giving up on
>>>>>> registration in (10, mins)
>>>>>> aurora-scheduler.log:Sep 26 18:21:37 machine62
>>>>>> aurora-scheduler[24743]: E0926 18:21:37.205 [Lifecycle-0,
>>>>>> SchedulerLifecycle$4:235] Framework has not been registered within the
>>>>>> tolerated delay.
>>>>>> aurora-scheduler.log:Sep 26 18:21:37 machine62
>>>>>> aurora-scheduler[24743]: I0926 18:21:37.205 [Lifecycle-0,
>>>>>> StateMachine$Builder:389] SchedulerLifecycle state machine transition
>>>>>> LEADER_AWAITING_REGISTRATION -> DEAD
>>>>>> aurora-scheduler.log:Sep 26 18:21:37 machine62
>>>>>> aurora-scheduler[24743]: I0926 18:21:37.215 [Lifecycle-0,
>>>>>> StateMachine$Builder:389] SchedulerLifecycle state machine transition 
>>>>>> DEAD
>>>>>> -> DEAD
>>>>>> aurora-scheduler.log:Sep 26 18:21:37 machine62
>>>>>> aurora-scheduler[24743]: I0926 18:21:37.215 [Lifecycle-0,
>>>>>> SchedulerLifecycle$6:275] Shutdown already invoked, ignoring extra call.
>>>>>> aurora-scheduler.log:Sep 26 18:22:05 machine62
>>>>>> aurora-scheduler[54502]: I0926 18:22:05.681 [main,
>>>>>> StateMachine$Builder:389] SchedulerLifecycle state machine transition 
>>>>>> IDLE
>>>>>> -> PREPARING_STORAGE
>>>>>> aurora-scheduler.log:Sep 26 18:22:06 machine62
>>>>>> aurora-scheduler[54502]: I0926 18:22:06.396 [main,
>>>>>> StateMachine$Builder:389] SchedulerLifecycle state machine transition
>>>>>> PREPARING_STORAGE -> STORAGE_PREPARED
>>>>>>
>>>>>>
>>>>>> ------ connecting to mesos -----
>>>>>> Sep 26 18:11:37 machine62 aurora-scheduler[24743]: I0926
>>>>>> 18:11:37.211750 24871 group.cpp:757] Found non-sequence node 
>>>>>> 'log_replicas'
>>>>>> at '/mesos' in ZooKeeper
>>>>>> Sep 26 18:11:37 machine62 aurora-scheduler[24743]: I0926
>>>>>> 18:11:37.211817 24871 detector.cpp:152] Detected a new leader: 
>>>>>> (id='1506')
>>>>>> Sep 26 18:11:37 machine62 aurora-scheduler[24743]: I0926
>>>>>> 18:11:37.211917 24871 group.cpp:699] Trying to get
>>>>>> '/mesos/json.info_0000001506' in ZooKeeper
>>>>>> Sep 26 18:11:37 machine62 aurora-scheduler[24743]: I0926
>>>>>> 18:11:37.216063 24871 zookeeper.cpp:262] A new leading master (UPID=
>>>>>> [email protected]:5050) is detected
>>>>>> Sep 26 18:11:37 machine62 aurora-scheduler[24743]: I0926
>>>>>> 18:11:37.216162 24871 scheduler.cpp:470] New master detected at
>>>>>> [email protected]:5050
>>>>>> Sep 26 18:11:37 machine62 aurora-scheduler[24743]: I0926
>>>>>> 18:11:37.217772 24871 scheduler.cpp:479] Waiting for 12.81503ms before
>>>>>> initiating a re-(connection) attempt with the master
>>>>>> Sep 26 18:11:37 machine62 aurora-scheduler[24743]: I0926
>>>>>> 18:11:37.231549 24868 scheduler.cpp:361] Connected with the master at
>>>>>> http://10.163.25.45:5050/master/api/v1/scheduler
>>>>>>
>>>>>>
>>>>>>
>>>>>> On Tue, Sep 26, 2017 at 1:24 PM, Bill Farner <[email protected]>
>>>>>> wrote:
>>>>>>
>>>>>>> Is there a reason a non-leading scheduler will talk to Mesos
>>>>>>>
>>>>>>>
>>>>>>> No, there is not a legitimate reason.  Did this occur for an
>>>>>>> extended period of time?  Do you have logs from the scheduler indicating
>>>>>>> that it lost ZK leadership and subsequently interacted with mesos?
>>>>>>>
>>>>>>> On Tue, Sep 26, 2017 at 1:02 PM, Mohit Jaggi <[email protected]>
>>>>>>> wrote:
>>>>>>>
>>>>>>>> Fellows,
>>>>>>>> While examining Aurora log files, I noticed a condition where a
>>>>>>>> scheduler was talking to Mesos but it was not showing up as a leader in
>>>>>>>> Zookeeper. It ultimately restarted itself and another scheduler became 
>>>>>>>> the
>>>>>>>> leader.
>>>>>>>> Is there a reason a non-leading scheduler will talk to Mesos?
>>>>>>>>
>>>>>>>> Mohit.
>>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>
>>>>>
>>>>
>>>
>>
>

Re: leader election issues

Reply via email to