Lost framework registered event [Was Re: leader election issues]

Mohit Jaggi Tue, 26 Sep 2017 23:52:38 -0700

Updating subject...as it looks like leader election was fine but
registration ack did not make it to the SchedulerLifecycle code. Weird that
an event will get lost like that.


On Tue, Sep 26, 2017 at 4:21 PM, John Sirois <[email protected]> wrote:

>
>
> On Tue, Sep 26, 2017 at 4:33 PM, Mohit Jaggi <[email protected]> wrote:
>
>> John,
>> I was referring to the following log message...isn't that the right one?
>>
>
> Aha - it is, apologies.
>
>
>> Sep 26 18:11:58 machine62 aurora-scheduler[24743]: I0926 18:11:58.795
>>  [Thread-814, MesosCallbackHandler$MesosCallbackHandlerImpl:180]
>> Registered with ID value: "4ca9aa06-3214-4d2c-a678-0832e2f84d17-0000"
>>
>> On Tue, Sep 26, 2017 at 3:30 PM, John Sirois <[email protected]>
>> wrote:
>>
>>> > ... but it succeeded(see logs at the end) ...
>>>
>>> The underlying c++ code in the scheduler driver successfully connected
>>> to the leading master socket:
>>> Sep 26 18:11:37 machine62 aurora-scheduler[24743]: I0926 18:11:37.231549 
>>> 24868
>>> scheduler.cpp:361] Connected with the master at
>>> http://10.163.25.45:5050/master/api/v1/scheduler
>>>
>>> <http://10.163.25.45:5050/master/api/v1/scheduler>This is not the same
>>> as a framework registration call being successfully executed against the
>>> newly connected master.
>>> You need to be careful about what you derive from the logs just based on
>>> a reading of the words. Generally you'll need to look carefully / grep
>>> sourcecode to be sure you are mentally modelling the code flows correctly.
>>> It certainly gets tricky.
>>>
>>> On Tue, Sep 26, 2017 at 4:14 PM, Mohit Jaggi <[email protected]>
>>> wrote:
>>>
>>>> Hmm....so it is indeed the mesos registration, but it succeeded(see
>>>> logs at the end). Looking at the code, I see that
>>>> com.google.common.eventbus  is used as pub-sub mechanism to link the
>>>> registration (first log message)  to the  registrationAcked flag...and
>>>> this flag is not being set for 10mins (?) otherwise the registration
>>>> timeout handler will not print the second log message.
>>>>
>>>> delayedActions.onRegistrationTimeout(
>>>>     () -> {
>>>>       if (!registrationAcked.get()) {
>>>>         LOG.error(
>>>>             "Framework has not been registered within the tolerated 
>>>> delay.");
>>>>         stateMachine.transition(State.DEAD);
>>>>       }
>>>>     });
>>>>
>>>> Sep 26 18:11:58 machine62 aurora-scheduler[24743]: I0926 18:11:58.795
>>>> [Thread-814, MesosCallbackHandler$MesosCallbackHandlerImpl:180]
>>>> Registered with ID value: "4ca9aa06-3214-4d2c-a678-0832e2f84d17-0000"
>>>>
>>>> ...
>>>>
>>>> ...
>>>>
>>>> Sep 26 18:21:37 machine62 aurora-scheduler[24743]: E0926 18:21:37.205
>>>> [Lifecycle-0, SchedulerLifecycle$4:235] Framework has not been registered
>>>> within the tolerated delay.
>>>>
>>>> On Tue, Sep 26, 2017 at 2:34 PM, John Sirois <[email protected]>
>>>> wrote:
>>>>
>>>>>
>>>>>
>>>>> On Tue, Sep 26, 2017 at 3:31 PM, Mohit Jaggi <[email protected]>
>>>>> wrote:
>>>>>
>>>>>> Actually I wonder if the "register" here is for Aurora to register
>>>>>> with mesos as a framework...I was assuming it refered to registering with
>>>>>> ZK as a leader.
>>>>>>
>>>>>
>>>>> Bingo.
>>>>>
>>>>>
>>>>>>
>>>>>> On Tue, Sep 26, 2017 at 2:27 PM, Renan DelValle Rueda <
>>>>>> [email protected]> wrote:
>>>>>>
>>>>>>> I've had this, or something very similar, happen before. It's an
>>>>>>> issue with Aurora and ZK. Election is based upon ZK, so if writing down 
>>>>>>> who
>>>>>>> the leader is to the ZK server path fails, or if ZK is unable to reach
>>>>>>> quorum on the write, the election will fail. Sometimes this might 
>>>>>>> manifest
>>>>>>> itself in weird ways, such as two aurora schedulers believing they are
>>>>>>> leaders. If you could tell us a little bit about your ZK set up we 
>>>>>>> might be
>>>>>>> able to narrow down the issue. Also, Aurora version and whether you are
>>>>>>> using Curator or the commons library will help as well.
>>>>>>>
>>>>>>> On Tue, Sep 26, 2017 at 2:02 PM, Mohit Jaggi <[email protected]>
>>>>>>> wrote:
>>>>>>>
>>>>>>>> Hmm..it seems machine62 became a leader but could not "register" as
>>>>>>>> leader. Not sure what that means. My naive assumption is that "becoming
>>>>>>>> leader" and "registering as leader" are "atomic".
>>>>>>>>
>>>>>>>> ------- grep on SchedulerLifecycle -----
>>>>>>>> aurora-scheduler.log:Sep 26 18:11:33 machine62
>>>>>>>> aurora-scheduler[24743]: I0926 18:11:33.158 [LeaderSelector-0,
>>>>>>>> StateMachine$Builder:389] SchedulerLifecycle state machine transition
>>>>>>>> STORAGE_PREPARED -> LEADER_AWAITING_REGISTRATION
>>>>>>>> aurora-scheduler.log:Sep 26 18:11:33 machine62
>>>>>>>> aurora-scheduler[24743]: I0926 18:11:33.159 [LeaderSelector-0,
>>>>>>>> SchedulerLifecycle$4:224] Elected as leading scheduler!
>>>>>>>> aurora-scheduler.log:Sep 26 18:11:37 machine62
>>>>>>>> aurora-scheduler[24743]: I0926 18:11:37.204 [LeaderSelector-0,
>>>>>>>> SchedulerLifecycle$DefaultDelayedActions:163] Giving up on
>>>>>>>> registration in (10, mins)
>>>>>>>> aurora-scheduler.log:Sep 26 18:21:37 machine62
>>>>>>>> aurora-scheduler[24743]: E0926 18:21:37.205 [Lifecycle-0,
>>>>>>>> SchedulerLifecycle$4:235] Framework has not been registered within the
>>>>>>>> tolerated delay.
>>>>>>>> aurora-scheduler.log:Sep 26 18:21:37 machine62
>>>>>>>> aurora-scheduler[24743]: I0926 18:21:37.205 [Lifecycle-0,
>>>>>>>> StateMachine$Builder:389] SchedulerLifecycle state machine transition
>>>>>>>> LEADER_AWAITING_REGISTRATION -> DEAD
>>>>>>>> aurora-scheduler.log:Sep 26 18:21:37 machine62
>>>>>>>> aurora-scheduler[24743]: I0926 18:21:37.215 [Lifecycle-0,
>>>>>>>> StateMachine$Builder:389] SchedulerLifecycle state machine transition 
>>>>>>>> DEAD
>>>>>>>> -> DEAD
>>>>>>>> aurora-scheduler.log:Sep 26 18:21:37 machine62
>>>>>>>> aurora-scheduler[24743]: I0926 18:21:37.215 [Lifecycle-0,
>>>>>>>> SchedulerLifecycle$6:275] Shutdown already invoked, ignoring extra 
>>>>>>>> call.
>>>>>>>> aurora-scheduler.log:Sep 26 18:22:05 machine62
>>>>>>>> aurora-scheduler[54502]: I0926 18:22:05.681 [main,
>>>>>>>> StateMachine$Builder:389] SchedulerLifecycle state machine transition 
>>>>>>>> IDLE
>>>>>>>> -> PREPARING_STORAGE
>>>>>>>> aurora-scheduler.log:Sep 26 18:22:06 machine62
>>>>>>>> aurora-scheduler[54502]: I0926 18:22:06.396 [main,
>>>>>>>> StateMachine$Builder:389] SchedulerLifecycle state machine transition
>>>>>>>> PREPARING_STORAGE -> STORAGE_PREPARED
>>>>>>>>
>>>>>>>>
>>>>>>>> ------ connecting to mesos -----
>>>>>>>> Sep 26 18:11:37 machine62 aurora-scheduler[24743]: I0926
>>>>>>>> 18:11:37.211750 24871 group.cpp:757] Found non-sequence node 
>>>>>>>> 'log_replicas'
>>>>>>>> at '/mesos' in ZooKeeper
>>>>>>>> Sep 26 18:11:37 machine62 aurora-scheduler[24743]: I0926
>>>>>>>> 18:11:37.211817 24871 detector.cpp:152] Detected a new leader: 
>>>>>>>> (id='1506')
>>>>>>>> Sep 26 18:11:37 machine62 aurora-scheduler[24743]: I0926
>>>>>>>> 18:11:37.211917 24871 group.cpp:699] Trying to get
>>>>>>>> '/mesos/json.info_0000001506' in ZooKeeper
>>>>>>>> Sep 26 18:11:37 machine62 aurora-scheduler[24743]: I0926
>>>>>>>> 18:11:37.216063 24871 zookeeper.cpp:262] A new leading master (UPID=
>>>>>>>> [email protected]:5050) is detected
>>>>>>>> Sep 26 18:11:37 machine62 aurora-scheduler[24743]: I0926
>>>>>>>> 18:11:37.216162 24871 scheduler.cpp:470] New master detected at
>>>>>>>> [email protected]:5050
>>>>>>>> Sep 26 18:11:37 machine62 aurora-scheduler[24743]: I0926
>>>>>>>> 18:11:37.217772 24871 scheduler.cpp:479] Waiting for 12.81503ms before
>>>>>>>> initiating a re-(connection) attempt with the master
>>>>>>>> Sep 26 18:11:37 machine62 aurora-scheduler[24743]: I0926
>>>>>>>> 18:11:37.231549 24868 scheduler.cpp:361] Connected with the master at
>>>>>>>> http://10.163.25.45:5050/master/api/v1/scheduler
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> On Tue, Sep 26, 2017 at 1:24 PM, Bill Farner <[email protected]>
>>>>>>>> wrote:
>>>>>>>>
>>>>>>>>> Is there a reason a non-leading scheduler will talk to Mesos
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> No, there is not a legitimate reason.  Did this occur for an
>>>>>>>>> extended period of time?  Do you have logs from the scheduler 
>>>>>>>>> indicating
>>>>>>>>> that it lost ZK leadership and subsequently interacted with mesos?
>>>>>>>>>
>>>>>>>>> On Tue, Sep 26, 2017 at 1:02 PM, Mohit Jaggi <[email protected]
>>>>>>>>> > wrote:
>>>>>>>>>
>>>>>>>>>> Fellows,
>>>>>>>>>> While examining Aurora log files, I noticed a condition where a
>>>>>>>>>> scheduler was talking to Mesos but it was not showing up as a leader 
>>>>>>>>>> in
>>>>>>>>>> Zookeeper. It ultimately restarted itself and another scheduler 
>>>>>>>>>> became the
>>>>>>>>>> leader.
>>>>>>>>>> Is there a reason a non-leading scheduler will talk to Mesos?
>>>>>>>>>>
>>>>>>>>>> Mohit.
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>
>>>>>
>>>>
>>>
>>
>

Lost framework registered event [Was Re: leader election issues]

Reply via email to