Hmm...it is a very busy cluster and 10 mins of logs will be voluminous.
They contain some internal details which I cannot share publicly. If you
suspect specific areas, I can try to get those logs and remove internal
info.

Re: code, we have a fork which is very close to master.

On Wed, Sep 27, 2017 at 10:03 PM, Bill Farner <[email protected]> wrote:

> What commit/release was this with?  From the looks of the log contents,
> it's not master.  I'd like to make sure i'm looking at the correct code.
>
> Are there more logs not being included?  If so, can you share more
> complete logs?  In particular, logs during the 10 minute delay would be
> particularly helpful.
>
> On Tue, Sep 26, 2017 at 11:51 PM, Mohit Jaggi <[email protected]>
> wrote:
>
>> Updating subject...as it looks like leader election was fine but
>> registration ack did not make it to the SchedulerLifecycle code. Weird that
>> an event will get lost like that.
>>
>> On Tue, Sep 26, 2017 at 4:21 PM, John Sirois <[email protected]>
>> wrote:
>>
>>>
>>>
>>> On Tue, Sep 26, 2017 at 4:33 PM, Mohit Jaggi <[email protected]>
>>> wrote:
>>>
>>>> John,
>>>> I was referring to the following log message...isn't that the right one?
>>>>
>>>
>>> Aha - it is, apologies.
>>>
>>>
>>>> Sep 26 18:11:58 machine62 aurora-scheduler[24743]: I0926 18:11:58.795
>>>>  [Thread-814, MesosCallbackHandler$MesosCallbackHandlerImpl:180]
>>>> Registered with ID value: "4ca9aa06-3214-4d2c-a678-0832e2f84d17-0000"
>>>>
>>>> On Tue, Sep 26, 2017 at 3:30 PM, John Sirois <[email protected]>
>>>> wrote:
>>>>
>>>>> > ... but it succeeded(see logs at the end) ...
>>>>>
>>>>> The underlying c++ code in the scheduler driver successfully connected
>>>>> to the leading master socket:
>>>>> Sep 26 18:11:37 machine62 aurora-scheduler[24743]: I0926
>>>>> 18:11:37.231549 24868 scheduler.cpp:361] Connected with the master at
>>>>> http://10.163.25.45:5050/master/api/v1/scheduler
>>>>>
>>>>> <http://10.163.25.45:5050/master/api/v1/scheduler>This is not the
>>>>> same as a framework registration call being successfully executed against
>>>>> the newly connected master.
>>>>> You need to be careful about what you derive from the logs just based
>>>>> on a reading of the words. Generally you'll need to look carefully / grep
>>>>> sourcecode to be sure you are mentally modelling the code flows correctly.
>>>>> It certainly gets tricky.
>>>>>
>>>>> On Tue, Sep 26, 2017 at 4:14 PM, Mohit Jaggi <[email protected]>
>>>>> wrote:
>>>>>
>>>>>> Hmm....so it is indeed the mesos registration, but it succeeded(see
>>>>>> logs at the end). Looking at the code, I see that
>>>>>> com.google.common.eventbus  is used as pub-sub mechanism to link the
>>>>>> registration (first log message)  to the  registrationAcked flag...and
>>>>>> this flag is not being set for 10mins (?) otherwise the registration
>>>>>> timeout handler will not print the second log message.
>>>>>>
>>>>>> delayedActions.onRegistrationTimeout(
>>>>>>     () -> {
>>>>>>       if (!registrationAcked.get()) {
>>>>>>         LOG.error(
>>>>>>             "Framework has not been registered within the tolerated 
>>>>>> delay.");
>>>>>>         stateMachine.transition(State.DEAD);
>>>>>>       }
>>>>>>     });
>>>>>>
>>>>>> Sep 26 18:11:58 machine62 aurora-scheduler[24743]: I0926 18:11:58.795
>>>>>> [Thread-814, MesosCallbackHandler$MesosCallbackHandlerImpl:180]
>>>>>> Registered with ID value: "4ca9aa06-3214-4d2c-a678-0832e2f84d17-0000"
>>>>>>
>>>>>> ...
>>>>>>
>>>>>> ...
>>>>>>
>>>>>> Sep 26 18:21:37 machine62 aurora-scheduler[24743]: E0926
>>>>>> 18:21:37.205 [Lifecycle-0, SchedulerLifecycle$4:235] Framework has not 
>>>>>> been
>>>>>> registered within the tolerated delay.
>>>>>>
>>>>>> On Tue, Sep 26, 2017 at 2:34 PM, John Sirois <[email protected]>
>>>>>> wrote:
>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> On Tue, Sep 26, 2017 at 3:31 PM, Mohit Jaggi <[email protected]>
>>>>>>> wrote:
>>>>>>>
>>>>>>>> Actually I wonder if the "register" here is for Aurora to register
>>>>>>>> with mesos as a framework...I was assuming it refered to registering 
>>>>>>>> with
>>>>>>>> ZK as a leader.
>>>>>>>>
>>>>>>>
>>>>>>> Bingo.
>>>>>>>
>>>>>>>
>>>>>>>>
>>>>>>>> On Tue, Sep 26, 2017 at 2:27 PM, Renan DelValle Rueda <
>>>>>>>> [email protected]> wrote:
>>>>>>>>
>>>>>>>>> I've had this, or something very similar, happen before. It's an
>>>>>>>>> issue with Aurora and ZK. Election is based upon ZK, so if writing 
>>>>>>>>> down who
>>>>>>>>> the leader is to the ZK server path fails, or if ZK is unable to reach
>>>>>>>>> quorum on the write, the election will fail. Sometimes this might 
>>>>>>>>> manifest
>>>>>>>>> itself in weird ways, such as two aurora schedulers believing they are
>>>>>>>>> leaders. If you could tell us a little bit about your ZK set up we 
>>>>>>>>> might be
>>>>>>>>> able to narrow down the issue. Also, Aurora version and whether you 
>>>>>>>>> are
>>>>>>>>> using Curator or the commons library will help as well.
>>>>>>>>>
>>>>>>>>> On Tue, Sep 26, 2017 at 2:02 PM, Mohit Jaggi <[email protected]
>>>>>>>>> > wrote:
>>>>>>>>>
>>>>>>>>>> Hmm..it seems machine62 became a leader but could not "register"
>>>>>>>>>> as leader. Not sure what that means. My naive assumption is that 
>>>>>>>>>> "becoming
>>>>>>>>>> leader" and "registering as leader" are "atomic".
>>>>>>>>>>
>>>>>>>>>> ------- grep on SchedulerLifecycle -----
>>>>>>>>>> aurora-scheduler.log:Sep 26 18:11:33 machine62
>>>>>>>>>> aurora-scheduler[24743]: I0926 18:11:33.158 [LeaderSelector-0,
>>>>>>>>>> StateMachine$Builder:389] SchedulerLifecycle state machine transition
>>>>>>>>>> STORAGE_PREPARED -> LEADER_AWAITING_REGISTRATION
>>>>>>>>>> aurora-scheduler.log:Sep 26 18:11:33 machine62
>>>>>>>>>> aurora-scheduler[24743]: I0926 18:11:33.159 [LeaderSelector-0,
>>>>>>>>>> SchedulerLifecycle$4:224] Elected as leading scheduler!
>>>>>>>>>> aurora-scheduler.log:Sep 26 18:11:37 machine62
>>>>>>>>>> aurora-scheduler[24743]: I0926 18:11:37.204 [LeaderSelector-0,
>>>>>>>>>> SchedulerLifecycle$DefaultDelayedActions:163] Giving up on
>>>>>>>>>> registration in (10, mins)
>>>>>>>>>> aurora-scheduler.log:Sep 26 18:21:37 machine62
>>>>>>>>>> aurora-scheduler[24743]: E0926 18:21:37.205 [Lifecycle-0,
>>>>>>>>>> SchedulerLifecycle$4:235] Framework has not been registered within 
>>>>>>>>>> the
>>>>>>>>>> tolerated delay.
>>>>>>>>>> aurora-scheduler.log:Sep 26 18:21:37 machine62
>>>>>>>>>> aurora-scheduler[24743]: I0926 18:21:37.205 [Lifecycle-0,
>>>>>>>>>> StateMachine$Builder:389] SchedulerLifecycle state machine transition
>>>>>>>>>> LEADER_AWAITING_REGISTRATION -> DEAD
>>>>>>>>>> aurora-scheduler.log:Sep 26 18:21:37 machine62
>>>>>>>>>> aurora-scheduler[24743]: I0926 18:21:37.215 [Lifecycle-0,
>>>>>>>>>> StateMachine$Builder:389] SchedulerLifecycle state machine 
>>>>>>>>>> transition DEAD
>>>>>>>>>> -> DEAD
>>>>>>>>>> aurora-scheduler.log:Sep 26 18:21:37 machine62
>>>>>>>>>> aurora-scheduler[24743]: I0926 18:21:37.215 [Lifecycle-0,
>>>>>>>>>> SchedulerLifecycle$6:275] Shutdown already invoked, ignoring extra 
>>>>>>>>>> call.
>>>>>>>>>> aurora-scheduler.log:Sep 26 18:22:05 machine62
>>>>>>>>>> aurora-scheduler[54502]: I0926 18:22:05.681 [main,
>>>>>>>>>> StateMachine$Builder:389] SchedulerLifecycle state machine 
>>>>>>>>>> transition IDLE
>>>>>>>>>> -> PREPARING_STORAGE
>>>>>>>>>> aurora-scheduler.log:Sep 26 18:22:06 machine62
>>>>>>>>>> aurora-scheduler[54502]: I0926 18:22:06.396 [main,
>>>>>>>>>> StateMachine$Builder:389] SchedulerLifecycle state machine transition
>>>>>>>>>> PREPARING_STORAGE -> STORAGE_PREPARED
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> ------ connecting to mesos -----
>>>>>>>>>> Sep 26 18:11:37 machine62 aurora-scheduler[24743]: I0926
>>>>>>>>>> 18:11:37.211750 24871 group.cpp:757] Found non-sequence node 
>>>>>>>>>> 'log_replicas'
>>>>>>>>>> at '/mesos' in ZooKeeper
>>>>>>>>>> Sep 26 18:11:37 machine62 aurora-scheduler[24743]: I0926
>>>>>>>>>> 18:11:37.211817 24871 detector.cpp:152] Detected a new leader: 
>>>>>>>>>> (id='1506')
>>>>>>>>>> Sep 26 18:11:37 machine62 aurora-scheduler[24743]: I0926
>>>>>>>>>> 18:11:37.211917 24871 group.cpp:699] Trying to get
>>>>>>>>>> '/mesos/json.info_0000001506' in ZooKeeper
>>>>>>>>>> Sep 26 18:11:37 machine62 aurora-scheduler[24743]: I0926
>>>>>>>>>> 18:11:37.216063 24871 zookeeper.cpp:262] A new leading master (UPID=
>>>>>>>>>> [email protected]:5050) is detected
>>>>>>>>>> Sep 26 18:11:37 machine62 aurora-scheduler[24743]: I0926
>>>>>>>>>> 18:11:37.216162 24871 scheduler.cpp:470] New master detected at
>>>>>>>>>> [email protected]:5050
>>>>>>>>>> Sep 26 18:11:37 machine62 aurora-scheduler[24743]: I0926
>>>>>>>>>> 18:11:37.217772 24871 scheduler.cpp:479] Waiting for 12.81503ms 
>>>>>>>>>> before
>>>>>>>>>> initiating a re-(connection) attempt with the master
>>>>>>>>>> Sep 26 18:11:37 machine62 aurora-scheduler[24743]: I0926
>>>>>>>>>> 18:11:37.231549 24868 scheduler.cpp:361] Connected with the master at
>>>>>>>>>> http://10.163.25.45:5050/master/api/v1/scheduler
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> On Tue, Sep 26, 2017 at 1:24 PM, Bill Farner <[email protected]>
>>>>>>>>>> wrote:
>>>>>>>>>>
>>>>>>>>>>> Is there a reason a non-leading scheduler will talk to Mesos
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> No, there is not a legitimate reason.  Did this occur for an
>>>>>>>>>>> extended period of time?  Do you have logs from the scheduler 
>>>>>>>>>>> indicating
>>>>>>>>>>> that it lost ZK leadership and subsequently interacted with mesos?
>>>>>>>>>>>
>>>>>>>>>>> On Tue, Sep 26, 2017 at 1:02 PM, Mohit Jaggi <
>>>>>>>>>>> [email protected]> wrote:
>>>>>>>>>>>
>>>>>>>>>>>> Fellows,
>>>>>>>>>>>> While examining Aurora log files, I noticed a condition where a
>>>>>>>>>>>> scheduler was talking to Mesos but it was not showing up as a 
>>>>>>>>>>>> leader in
>>>>>>>>>>>> Zookeeper. It ultimately restarted itself and another scheduler 
>>>>>>>>>>>> became the
>>>>>>>>>>>> leader.
>>>>>>>>>>>> Is there a reason a non-leading scheduler will talk to Mesos?
>>>>>>>>>>>>
>>>>>>>>>>>> Mohit.
>>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>
>>>>>
>>>>
>>>
>>
>

Reply via email to