Hmm....so it is indeed the mesos registration, but it succeeded(see logs at
the end). Looking at the code, I see that com.google.common.eventbus is
used as pub-sub mechanism to link the registration (first log message) to
the registrationAcked flag...and this flag is not being set for 10mins (?)
otherwise the registration timeout handler will not print the second log
message.
delayedActions.onRegistrationTimeout(
() -> {
if (!registrationAcked.get()) {
LOG.error(
"Framework has not been registered within the tolerated delay.");
stateMachine.transition(State.DEAD);
}
});
Sep 26 18:11:58 machine62 aurora-scheduler[24743]: I0926 18:11:58.795
[Thread-814, MesosCallbackHandler$MesosCallbackHandlerImpl:180] Registered
with ID value: "4ca9aa06-3214-4d2c-a678-0832e2f84d17-0000"
...
...
Sep 26 18:21:37 machine62 aurora-scheduler[24743]: E0926 18:21:37.205
[Lifecycle-0, SchedulerLifecycle$4:235] Framework has not been registered
within the tolerated delay.
On Tue, Sep 26, 2017 at 2:34 PM, John Sirois <[email protected]> wrote:
>
>
> On Tue, Sep 26, 2017 at 3:31 PM, Mohit Jaggi <[email protected]> wrote:
>
>> Actually I wonder if the "register" here is for Aurora to register with
>> mesos as a framework...I was assuming it refered to registering with ZK as
>> a leader.
>>
>
> Bingo.
>
>
>>
>> On Tue, Sep 26, 2017 at 2:27 PM, Renan DelValle Rueda <
>> [email protected]> wrote:
>>
>>> I've had this, or something very similar, happen before. It's an issue
>>> with Aurora and ZK. Election is based upon ZK, so if writing down who the
>>> leader is to the ZK server path fails, or if ZK is unable to reach quorum
>>> on the write, the election will fail. Sometimes this might manifest itself
>>> in weird ways, such as two aurora schedulers believing they are leaders. If
>>> you could tell us a little bit about your ZK set up we might be able to
>>> narrow down the issue. Also, Aurora version and whether you are using
>>> Curator or the commons library will help as well.
>>>
>>> On Tue, Sep 26, 2017 at 2:02 PM, Mohit Jaggi <[email protected]>
>>> wrote:
>>>
>>>> Hmm..it seems machine62 became a leader but could not "register" as
>>>> leader. Not sure what that means. My naive assumption is that "becoming
>>>> leader" and "registering as leader" are "atomic".
>>>>
>>>> ------- grep on SchedulerLifecycle -----
>>>> aurora-scheduler.log:Sep 26 18:11:33 machine62 aurora-scheduler[24743]:
>>>> I0926 18:11:33.158 [LeaderSelector-0, StateMachine$Builder:389]
>>>> SchedulerLifecycle state machine transition STORAGE_PREPARED ->
>>>> LEADER_AWAITING_REGISTRATION
>>>> aurora-scheduler.log:Sep 26 18:11:33 machine62 aurora-scheduler[24743]:
>>>> I0926 18:11:33.159 [LeaderSelector-0, SchedulerLifecycle$4:224] Elected as
>>>> leading scheduler!
>>>> aurora-scheduler.log:Sep 26 18:11:37 machine62 aurora-scheduler[24743]:
>>>> I0926 18:11:37.204 [LeaderSelector-0,
>>>> SchedulerLifecycle$DefaultDelayedActions:163]
>>>> Giving up on registration in (10, mins)
>>>> aurora-scheduler.log:Sep 26 18:21:37 machine62 aurora-scheduler[24743]:
>>>> E0926 18:21:37.205 [Lifecycle-0, SchedulerLifecycle$4:235] Framework has
>>>> not been registered within the tolerated delay.
>>>> aurora-scheduler.log:Sep 26 18:21:37 machine62 aurora-scheduler[24743]:
>>>> I0926 18:21:37.205 [Lifecycle-0, StateMachine$Builder:389]
>>>> SchedulerLifecycle state machine transition LEADER_AWAITING_REGISTRATION ->
>>>> DEAD
>>>> aurora-scheduler.log:Sep 26 18:21:37 machine62 aurora-scheduler[24743]:
>>>> I0926 18:21:37.215 [Lifecycle-0, StateMachine$Builder:389]
>>>> SchedulerLifecycle state machine transition DEAD -> DEAD
>>>> aurora-scheduler.log:Sep 26 18:21:37 machine62 aurora-scheduler[24743]:
>>>> I0926 18:21:37.215 [Lifecycle-0, SchedulerLifecycle$6:275] Shutdown already
>>>> invoked, ignoring extra call.
>>>> aurora-scheduler.log:Sep 26 18:22:05 machine62 aurora-scheduler[54502]:
>>>> I0926 18:22:05.681 [main, StateMachine$Builder:389] SchedulerLifecycle
>>>> state machine transition IDLE -> PREPARING_STORAGE
>>>> aurora-scheduler.log:Sep 26 18:22:06 machine62 aurora-scheduler[54502]:
>>>> I0926 18:22:06.396 [main, StateMachine$Builder:389] SchedulerLifecycle
>>>> state machine transition PREPARING_STORAGE -> STORAGE_PREPARED
>>>>
>>>>
>>>> ------ connecting to mesos -----
>>>> Sep 26 18:11:37 machine62 aurora-scheduler[24743]: I0926
>>>> 18:11:37.211750 24871 group.cpp:757] Found non-sequence node 'log_replicas'
>>>> at '/mesos' in ZooKeeper
>>>> Sep 26 18:11:37 machine62 aurora-scheduler[24743]: I0926
>>>> 18:11:37.211817 24871 detector.cpp:152] Detected a new leader: (id='1506')
>>>> Sep 26 18:11:37 machine62 aurora-scheduler[24743]: I0926
>>>> 18:11:37.211917 24871 group.cpp:699] Trying to get
>>>> '/mesos/json.info_0000001506' in ZooKeeper
>>>> Sep 26 18:11:37 machine62 aurora-scheduler[24743]: I0926
>>>> 18:11:37.216063 24871 zookeeper.cpp:262] A new leading master (UPID=
>>>> [email protected]:5050) is detected
>>>> Sep 26 18:11:37 machine62 aurora-scheduler[24743]: I0926
>>>> 18:11:37.216162 24871 scheduler.cpp:470] New master detected at
>>>> [email protected]:5050
>>>> Sep 26 18:11:37 machine62 aurora-scheduler[24743]: I0926
>>>> 18:11:37.217772 24871 scheduler.cpp:479] Waiting for 12.81503ms before
>>>> initiating a re-(connection) attempt with the master
>>>> Sep 26 18:11:37 machine62 aurora-scheduler[24743]: I0926
>>>> 18:11:37.231549 24868 scheduler.cpp:361] Connected with the master at
>>>> http://10.163.25.45:5050/master/api/v1/scheduler
>>>>
>>>>
>>>>
>>>> On Tue, Sep 26, 2017 at 1:24 PM, Bill Farner <[email protected]>
>>>> wrote:
>>>>
>>>>> Is there a reason a non-leading scheduler will talk to Mesos
>>>>>
>>>>>
>>>>> No, there is not a legitimate reason. Did this occur for an extended
>>>>> period of time? Do you have logs from the scheduler indicating that it
>>>>> lost ZK leadership and subsequently interacted with mesos?
>>>>>
>>>>> On Tue, Sep 26, 2017 at 1:02 PM, Mohit Jaggi <[email protected]>
>>>>> wrote:
>>>>>
>>>>>> Fellows,
>>>>>> While examining Aurora log files, I noticed a condition where a
>>>>>> scheduler was talking to Mesos but it was not showing up as a leader in
>>>>>> Zookeeper. It ultimately restarted itself and another scheduler became
>>>>>> the
>>>>>> leader.
>>>>>> Is there a reason a non-leading scheduler will talk to Mesos?
>>>>>>
>>>>>> Mohit.
>>>>>>
>>>>>
>>>>>
>>>>
>>>
>>
>