Re: Lost framework registered event [Was Re: leader election issues]

Bill Farner Thu, 26 Oct 2017 21:51:45 -0700

FYI - i believe we stumbled on this issue, which is greatly exacerbated by
f2755e1
<https://github.com/apache/aurora/commit/f2755e1cdd67f3c1516726c21d6e8f13059a5a01>.
The good news is that we now have a good handle on the culprit!  More
details at https://issues.apache.org/jira/browse/AURORA-1953


On Thu, Sep 28, 2017 at 2:14 PM, Mohit Jaggi <[email protected]> wrote:

> Hmm...it is a very busy cluster and 10 mins of logs will be voluminous.
> They contain some internal details which I cannot share publicly. If you
> suspect specific areas, I can try to get those logs and remove internal
> info.
>
> Re: code, we have a fork which is very close to master.
>
> On Wed, Sep 27, 2017 at 10:03 PM, Bill Farner <[email protected]> wrote:
>
>> What commit/release was this with?  From the looks of the log contents,
>> it's not master.  I'd like to make sure i'm looking at the correct code.
>>
>> Are there more logs not being included?  If so, can you share more
>> complete logs?  In particular, logs during the 10 minute delay would be
>> particularly helpful.
>>
>> On Tue, Sep 26, 2017 at 11:51 PM, Mohit Jaggi <[email protected]>
>> wrote:
>>
>>> Updating subject...as it looks like leader election was fine but
>>> registration ack did not make it to the SchedulerLifecycle code. Weird that
>>> an event will get lost like that.
>>>
>>> On Tue, Sep 26, 2017 at 4:21 PM, John Sirois <[email protected]>
>>> wrote:
>>>
>>>>
>>>>
>>>> On Tue, Sep 26, 2017 at 4:33 PM, Mohit Jaggi <[email protected]>
>>>> wrote:
>>>>
>>>>> John,
>>>>> I was referring to the following log message...isn't that the right
>>>>> one?
>>>>>
>>>>
>>>> Aha - it is, apologies.
>>>>
>>>>
>>>>> Sep 26 18:11:58 machine62 aurora-scheduler[24743]: I0926 18:11:58.795
>>>>>  [Thread-814, MesosCallbackHandler$MesosCallbackHandlerImpl:180]
>>>>> Registered with ID value: "4ca9aa06-3214-4d2c-a678-0832e2f84d17-0000"
>>>>>
>>>>> On Tue, Sep 26, 2017 at 3:30 PM, John Sirois <[email protected]>
>>>>> wrote:
>>>>>
>>>>>> > ... but it succeeded(see logs at the end) ...
>>>>>>
>>>>>> The underlying c++ code in the scheduler driver successfully
>>>>>> connected to the leading master socket:
>>>>>> Sep 26 18:11:37 machine62 aurora-scheduler[24743]: I0926
>>>>>> 18:11:37.231549 24868 scheduler.cpp:361] Connected with the master
>>>>>> at http://10.163.25.45:5050/master/api/v1/scheduler
>>>>>>
>>>>>> <http://10.163.25.45:5050/master/api/v1/scheduler>This is not the
>>>>>> same as a framework registration call being successfully executed against
>>>>>> the newly connected master.
>>>>>> You need to be careful about what you derive from the logs just based
>>>>>> on a reading of the words. Generally you'll need to look carefully / grep
>>>>>> sourcecode to be sure you are mentally modelling the code flows 
>>>>>> correctly.
>>>>>> It certainly gets tricky.
>>>>>>
>>>>>> On Tue, Sep 26, 2017 at 4:14 PM, Mohit Jaggi <[email protected]>
>>>>>> wrote:
>>>>>>
>>>>>>> Hmm....so it is indeed the mesos registration, but it succeeded(see
>>>>>>> logs at the end). Looking at the code, I see that
>>>>>>> com.google.common.eventbus  is used as pub-sub mechanism to link
>>>>>>> the registration (first log message)  to the  registrationAcked 
>>>>>>> flag...and
>>>>>>> this flag is not being set for 10mins (?) otherwise the registration
>>>>>>> timeout handler will not print the second log message.
>>>>>>>
>>>>>>> delayedActions.onRegistrationTimeout(
>>>>>>>     () -> {
>>>>>>>       if (!registrationAcked.get()) {
>>>>>>>         LOG.error(
>>>>>>>             "Framework has not been registered within the tolerated 
>>>>>>> delay.");
>>>>>>>         stateMachine.transition(State.DEAD);
>>>>>>>       }
>>>>>>>     });
>>>>>>>
>>>>>>> Sep 26 18:11:58 machine62 aurora-scheduler[24743]: I0926
>>>>>>> 18:11:58.795 [Thread-814, 
>>>>>>> MesosCallbackHandler$MesosCallbackHandlerImpl:180]
>>>>>>> Registered with ID value: "4ca9aa06-3214-4d2c-a678-0832e
>>>>>>> 2f84d17-0000"
>>>>>>>
>>>>>>> ...
>>>>>>>
>>>>>>> ...
>>>>>>>
>>>>>>> Sep 26 18:21:37 machine62 aurora-scheduler[24743]: E0926
>>>>>>> 18:21:37.205 [Lifecycle-0, SchedulerLifecycle$4:235] Framework has not 
>>>>>>> been
>>>>>>> registered within the tolerated delay.
>>>>>>>
>>>>>>> On Tue, Sep 26, 2017 at 2:34 PM, John Sirois <[email protected]>
>>>>>>> wrote:
>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> On Tue, Sep 26, 2017 at 3:31 PM, Mohit Jaggi <[email protected]>
>>>>>>>> wrote:
>>>>>>>>
>>>>>>>>> Actually I wonder if the "register" here is for Aurora to register
>>>>>>>>> with mesos as a framework...I was assuming it refered to registering 
>>>>>>>>> with
>>>>>>>>> ZK as a leader.
>>>>>>>>>
>>>>>>>>
>>>>>>>> Bingo.
>>>>>>>>
>>>>>>>>
>>>>>>>>>
>>>>>>>>> On Tue, Sep 26, 2017 at 2:27 PM, Renan DelValle Rueda <
>>>>>>>>> [email protected]> wrote:
>>>>>>>>>
>>>>>>>>>> I've had this, or something very similar, happen before. It's an
>>>>>>>>>> issue with Aurora and ZK. Election is based upon ZK, so if writing 
>>>>>>>>>> down who
>>>>>>>>>> the leader is to the ZK server path fails, or if ZK is unable to 
>>>>>>>>>> reach
>>>>>>>>>> quorum on the write, the election will fail. Sometimes this might 
>>>>>>>>>> manifest
>>>>>>>>>> itself in weird ways, such as two aurora schedulers believing they 
>>>>>>>>>> are
>>>>>>>>>> leaders. If you could tell us a little bit about your ZK set up we 
>>>>>>>>>> might be
>>>>>>>>>> able to narrow down the issue. Also, Aurora version and whether you 
>>>>>>>>>> are
>>>>>>>>>> using Curator or the commons library will help as well.
>>>>>>>>>>
>>>>>>>>>> On Tue, Sep 26, 2017 at 2:02 PM, Mohit Jaggi <
>>>>>>>>>> [email protected]> wrote:
>>>>>>>>>>
>>>>>>>>>>> Hmm..it seems machine62 became a leader but could not "register"
>>>>>>>>>>> as leader. Not sure what that means. My naive assumption is that 
>>>>>>>>>>> "becoming
>>>>>>>>>>> leader" and "registering as leader" are "atomic".
>>>>>>>>>>>
>>>>>>>>>>> ------- grep on SchedulerLifecycle -----
>>>>>>>>>>> aurora-scheduler.log:Sep 26 18:11:33 machine62
>>>>>>>>>>> aurora-scheduler[24743]: I0926 18:11:33.158 [LeaderSelector-0,
>>>>>>>>>>> StateMachine$Builder:389] SchedulerLifecycle state machine 
>>>>>>>>>>> transition
>>>>>>>>>>> STORAGE_PREPARED -> LEADER_AWAITING_REGISTRATION
>>>>>>>>>>> aurora-scheduler.log:Sep 26 18:11:33 machine62
>>>>>>>>>>> aurora-scheduler[24743]: I0926 18:11:33.159 [LeaderSelector-0,
>>>>>>>>>>> SchedulerLifecycle$4:224] Elected as leading scheduler!
>>>>>>>>>>> aurora-scheduler.log:Sep 26 18:11:37 machine62
>>>>>>>>>>> aurora-scheduler[24743]: I0926 18:11:37.204 [LeaderSelector-0,
>>>>>>>>>>> SchedulerLifecycle$DefaultDelayedActions:163] Giving up on
>>>>>>>>>>> registration in (10, mins)
>>>>>>>>>>> aurora-scheduler.log:Sep 26 18:21:37 machine62
>>>>>>>>>>> aurora-scheduler[24743]: E0926 18:21:37.205 [Lifecycle-0,
>>>>>>>>>>> SchedulerLifecycle$4:235] Framework has not been registered within 
>>>>>>>>>>> the
>>>>>>>>>>> tolerated delay.
>>>>>>>>>>> aurora-scheduler.log:Sep 26 18:21:37 machine62
>>>>>>>>>>> aurora-scheduler[24743]: I0926 18:21:37.205 [Lifecycle-0,
>>>>>>>>>>> StateMachine$Builder:389] SchedulerLifecycle state machine 
>>>>>>>>>>> transition
>>>>>>>>>>> LEADER_AWAITING_REGISTRATION -> DEAD
>>>>>>>>>>> aurora-scheduler.log:Sep 26 18:21:37 machine62
>>>>>>>>>>> aurora-scheduler[24743]: I0926 18:21:37.215 [Lifecycle-0,
>>>>>>>>>>> StateMachine$Builder:389] SchedulerLifecycle state machine 
>>>>>>>>>>> transition DEAD
>>>>>>>>>>> -> DEAD
>>>>>>>>>>> aurora-scheduler.log:Sep 26 18:21:37 machine62
>>>>>>>>>>> aurora-scheduler[24743]: I0926 18:21:37.215 [Lifecycle-0,
>>>>>>>>>>> SchedulerLifecycle$6:275] Shutdown already invoked, ignoring extra 
>>>>>>>>>>> call.
>>>>>>>>>>> aurora-scheduler.log:Sep 26 18:22:05 machine62
>>>>>>>>>>> aurora-scheduler[54502]: I0926 18:22:05.681 [main,
>>>>>>>>>>> StateMachine$Builder:389] SchedulerLifecycle state machine 
>>>>>>>>>>> transition IDLE
>>>>>>>>>>> -> PREPARING_STORAGE
>>>>>>>>>>> aurora-scheduler.log:Sep 26 18:22:06 machine62
>>>>>>>>>>> aurora-scheduler[54502]: I0926 18:22:06.396 [main,
>>>>>>>>>>> StateMachine$Builder:389] SchedulerLifecycle state machine 
>>>>>>>>>>> transition
>>>>>>>>>>> PREPARING_STORAGE -> STORAGE_PREPARED
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> ------ connecting to mesos -----
>>>>>>>>>>> Sep 26 18:11:37 machine62 aurora-scheduler[24743]: I0926
>>>>>>>>>>> 18:11:37.211750 24871 group.cpp:757] Found non-sequence node 
>>>>>>>>>>> 'log_replicas'
>>>>>>>>>>> at '/mesos' in ZooKeeper
>>>>>>>>>>> Sep 26 18:11:37 machine62 aurora-scheduler[24743]: I0926
>>>>>>>>>>> 18:11:37.211817 24871 detector.cpp:152] Detected a new leader: 
>>>>>>>>>>> (id='1506')
>>>>>>>>>>> Sep 26 18:11:37 machine62 aurora-scheduler[24743]: I0926
>>>>>>>>>>> 18:11:37.211917 24871 group.cpp:699] Trying to get
>>>>>>>>>>> '/mesos/json.info_0000001506' in ZooKeeper
>>>>>>>>>>> Sep 26 18:11:37 machine62 aurora-scheduler[24743]: I0926
>>>>>>>>>>> 18:11:37.216063 24871 zookeeper.cpp:262] A new leading master (UPID=
>>>>>>>>>>> [email protected]:5050) is detected
>>>>>>>>>>> Sep 26 18:11:37 machine62 aurora-scheduler[24743]: I0926
>>>>>>>>>>> 18:11:37.216162 24871 scheduler.cpp:470] New master detected at
>>>>>>>>>>> [email protected]:5050
>>>>>>>>>>> Sep 26 18:11:37 machine62 aurora-scheduler[24743]: I0926
>>>>>>>>>>> 18:11:37.217772 24871 scheduler.cpp:479] Waiting for 12.81503ms 
>>>>>>>>>>> before
>>>>>>>>>>> initiating a re-(connection) attempt with the master
>>>>>>>>>>> Sep 26 18:11:37 machine62 aurora-scheduler[24743]: I0926
>>>>>>>>>>> 18:11:37.231549 24868 scheduler.cpp:361] Connected with the master 
>>>>>>>>>>> at
>>>>>>>>>>> http://10.163.25.45:5050/master/api/v1/scheduler
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> On Tue, Sep 26, 2017 at 1:24 PM, Bill Farner <[email protected]
>>>>>>>>>>> > wrote:
>>>>>>>>>>>
>>>>>>>>>>>> Is there a reason a non-leading scheduler will talk to Mesos
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> No, there is not a legitimate reason.  Did this occur for an
>>>>>>>>>>>> extended period of time?  Do you have logs from the scheduler 
>>>>>>>>>>>> indicating
>>>>>>>>>>>> that it lost ZK leadership and subsequently interacted with mesos?
>>>>>>>>>>>>
>>>>>>>>>>>> On Tue, Sep 26, 2017 at 1:02 PM, Mohit Jaggi <
>>>>>>>>>>>> [email protected]> wrote:
>>>>>>>>>>>>
>>>>>>>>>>>>> Fellows,
>>>>>>>>>>>>> While examining Aurora log files, I noticed a condition where
>>>>>>>>>>>>> a scheduler was talking to Mesos but it was not showing up as a 
>>>>>>>>>>>>> leader in
>>>>>>>>>>>>> Zookeeper. It ultimately restarted itself and another scheduler 
>>>>>>>>>>>>> became the
>>>>>>>>>>>>> leader.
>>>>>>>>>>>>> Is there a reason a non-leading scheduler will talk to Mesos?
>>>>>>>>>>>>>
>>>>>>>>>>>>> Mohit.
>>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>
>>>>>
>>>>
>>>
>>
>

Re: Lost framework registered event [Was Re: leader election issues]

Reply via email to