Hmm..it seems machine62 became a leader but could not "register" as leader. Not sure what that means. My naive assumption is that "becoming leader" and "registering as leader" are "atomic".
------- grep on SchedulerLifecycle ----- aurora-scheduler.log:Sep 26 18:11:33 machine62 aurora-scheduler[24743]: I0926 18:11:33.158 [LeaderSelector-0, StateMachine$Builder:389] SchedulerLifecycle state machine transition STORAGE_PREPARED -> LEADER_AWAITING_REGISTRATION aurora-scheduler.log:Sep 26 18:11:33 machine62 aurora-scheduler[24743]: I0926 18:11:33.159 [LeaderSelector-0, SchedulerLifecycle$4:224] Elected as leading scheduler! aurora-scheduler.log:Sep 26 18:11:37 machine62 aurora-scheduler[24743]: I0926 18:11:37.204 [LeaderSelector-0, SchedulerLifecycle$DefaultDelayedActions:163] Giving up on registration in (10, mins) aurora-scheduler.log:Sep 26 18:21:37 machine62 aurora-scheduler[24743]: E0926 18:21:37.205 [Lifecycle-0, SchedulerLifecycle$4:235] Framework has not been registered within the tolerated delay. aurora-scheduler.log:Sep 26 18:21:37 machine62 aurora-scheduler[24743]: I0926 18:21:37.205 [Lifecycle-0, StateMachine$Builder:389] SchedulerLifecycle state machine transition LEADER_AWAITING_REGISTRATION -> DEAD aurora-scheduler.log:Sep 26 18:21:37 machine62 aurora-scheduler[24743]: I0926 18:21:37.215 [Lifecycle-0, StateMachine$Builder:389] SchedulerLifecycle state machine transition DEAD -> DEAD aurora-scheduler.log:Sep 26 18:21:37 machine62 aurora-scheduler[24743]: I0926 18:21:37.215 [Lifecycle-0, SchedulerLifecycle$6:275] Shutdown already invoked, ignoring extra call. aurora-scheduler.log:Sep 26 18:22:05 machine62 aurora-scheduler[54502]: I0926 18:22:05.681 [main, StateMachine$Builder:389] SchedulerLifecycle state machine transition IDLE -> PREPARING_STORAGE aurora-scheduler.log:Sep 26 18:22:06 machine62 aurora-scheduler[54502]: I0926 18:22:06.396 [main, StateMachine$Builder:389] SchedulerLifecycle state machine transition PREPARING_STORAGE -> STORAGE_PREPARED ------ connecting to mesos ----- Sep 26 18:11:37 machine62 aurora-scheduler[24743]: I0926 18:11:37.211750 24871 group.cpp:757] Found non-sequence node 'log_replicas' at '/mesos' in ZooKeeper Sep 26 18:11:37 machine62 aurora-scheduler[24743]: I0926 18:11:37.211817 24871 detector.cpp:152] Detected a new leader: (id='1506') Sep 26 18:11:37 machine62 aurora-scheduler[24743]: I0926 18:11:37.211917 24871 group.cpp:699] Trying to get '/mesos/json.info_0000001506' in ZooKeeper Sep 26 18:11:37 machine62 aurora-scheduler[24743]: I0926 18:11:37.216063 24871 zookeeper.cpp:262] A new leading master ([email protected]:5050) is detected Sep 26 18:11:37 machine62 aurora-scheduler[24743]: I0926 18:11:37.216162 24871 scheduler.cpp:470] New master detected at [email protected]:5050 Sep 26 18:11:37 machine62 aurora-scheduler[24743]: I0926 18:11:37.217772 24871 scheduler.cpp:479] Waiting for 12.81503ms before initiating a re-(connection) attempt with the master Sep 26 18:11:37 machine62 aurora-scheduler[24743]: I0926 18:11:37.231549 24868 scheduler.cpp:361] Connected with the master at http://10.163.25.45:5050/master/api/v1/scheduler On Tue, Sep 26, 2017 at 1:24 PM, Bill Farner <[email protected]> wrote: > Is there a reason a non-leading scheduler will talk to Mesos > > > No, there is not a legitimate reason. Did this occur for an extended > period of time? Do you have logs from the scheduler indicating that it > lost ZK leadership and subsequently interacted with mesos? > > On Tue, Sep 26, 2017 at 1:02 PM, Mohit Jaggi <[email protected]> wrote: > >> Fellows, >> While examining Aurora log files, I noticed a condition where a scheduler >> was talking to Mesos but it was not showing up as a leader in Zookeeper. It >> ultimately restarted itself and another scheduler became the leader. >> Is there a reason a non-leading scheduler will talk to Mesos? >> >> Mohit. >> > >
