I've had this, or something very similar, happen before. It's an issue with Aurora and ZK. Election is based upon ZK, so if writing down who the leader is to the ZK server path fails, or if ZK is unable to reach quorum on the write, the election will fail. Sometimes this might manifest itself in weird ways, such as two aurora schedulers believing they are leaders. If you could tell us a little bit about your ZK set up we might be able to narrow down the issue. Also, Aurora version and whether you are using Curator or the commons library will help as well.
On Tue, Sep 26, 2017 at 2:02 PM, Mohit Jaggi <[email protected]> wrote: > Hmm..it seems machine62 became a leader but could not "register" as > leader. Not sure what that means. My naive assumption is that "becoming > leader" and "registering as leader" are "atomic". > > ------- grep on SchedulerLifecycle ----- > aurora-scheduler.log:Sep 26 18:11:33 machine62 aurora-scheduler[24743]: > I0926 18:11:33.158 [LeaderSelector-0, StateMachine$Builder:389] > SchedulerLifecycle state machine transition STORAGE_PREPARED -> > LEADER_AWAITING_REGISTRATION > aurora-scheduler.log:Sep 26 18:11:33 machine62 aurora-scheduler[24743]: > I0926 18:11:33.159 [LeaderSelector-0, SchedulerLifecycle$4:224] Elected as > leading scheduler! > aurora-scheduler.log:Sep 26 18:11:37 machine62 aurora-scheduler[24743]: > I0926 18:11:37.204 [LeaderSelector-0, > SchedulerLifecycle$DefaultDelayedActions:163] > Giving up on registration in (10, mins) > aurora-scheduler.log:Sep 26 18:21:37 machine62 aurora-scheduler[24743]: > E0926 18:21:37.205 [Lifecycle-0, SchedulerLifecycle$4:235] Framework has > not been registered within the tolerated delay. > aurora-scheduler.log:Sep 26 18:21:37 machine62 aurora-scheduler[24743]: > I0926 18:21:37.205 [Lifecycle-0, StateMachine$Builder:389] > SchedulerLifecycle state machine transition LEADER_AWAITING_REGISTRATION -> > DEAD > aurora-scheduler.log:Sep 26 18:21:37 machine62 aurora-scheduler[24743]: > I0926 18:21:37.215 [Lifecycle-0, StateMachine$Builder:389] > SchedulerLifecycle state machine transition DEAD -> DEAD > aurora-scheduler.log:Sep 26 18:21:37 machine62 aurora-scheduler[24743]: > I0926 18:21:37.215 [Lifecycle-0, SchedulerLifecycle$6:275] Shutdown already > invoked, ignoring extra call. > aurora-scheduler.log:Sep 26 18:22:05 machine62 aurora-scheduler[54502]: > I0926 18:22:05.681 [main, StateMachine$Builder:389] SchedulerLifecycle > state machine transition IDLE -> PREPARING_STORAGE > aurora-scheduler.log:Sep 26 18:22:06 machine62 aurora-scheduler[54502]: > I0926 18:22:06.396 [main, StateMachine$Builder:389] SchedulerLifecycle > state machine transition PREPARING_STORAGE -> STORAGE_PREPARED > > > ------ connecting to mesos ----- > Sep 26 18:11:37 machine62 aurora-scheduler[24743]: I0926 18:11:37.211750 > 24871 group.cpp:757] Found non-sequence node 'log_replicas' at '/mesos' in > ZooKeeper > Sep 26 18:11:37 machine62 aurora-scheduler[24743]: I0926 18:11:37.211817 > 24871 detector.cpp:152] Detected a new leader: (id='1506') > Sep 26 18:11:37 machine62 aurora-scheduler[24743]: I0926 18:11:37.211917 > 24871 group.cpp:699] Trying to get '/mesos/json.info_0000001506' in > ZooKeeper > Sep 26 18:11:37 machine62 aurora-scheduler[24743]: I0926 18:11:37.216063 > 24871 zookeeper.cpp:262] A new leading master (UPID= > [email protected]:5050) is detected > Sep 26 18:11:37 machine62 aurora-scheduler[24743]: I0926 18:11:37.216162 > 24871 scheduler.cpp:470] New master detected at [email protected]:5050 > Sep 26 18:11:37 machine62 aurora-scheduler[24743]: I0926 18:11:37.217772 > 24871 scheduler.cpp:479] Waiting for 12.81503ms before initiating a > re-(connection) attempt with the master > Sep 26 18:11:37 machine62 aurora-scheduler[24743]: I0926 18:11:37.231549 > 24868 scheduler.cpp:361] Connected with the master at > http://10.163.25.45:5050/master/api/v1/scheduler > > > > On Tue, Sep 26, 2017 at 1:24 PM, Bill Farner <[email protected]> wrote: > >> Is there a reason a non-leading scheduler will talk to Mesos >> >> >> No, there is not a legitimate reason. Did this occur for an extended >> period of time? Do you have logs from the scheduler indicating that it >> lost ZK leadership and subsequently interacted with mesos? >> >> On Tue, Sep 26, 2017 at 1:02 PM, Mohit Jaggi <[email protected]> >> wrote: >> >>> Fellows, >>> While examining Aurora log files, I noticed a condition where a >>> scheduler was talking to Mesos but it was not showing up as a leader in >>> Zookeeper. It ultimately restarted itself and another scheduler became the >>> leader. >>> Is there a reason a non-leading scheduler will talk to Mesos? >>> >>> Mohit. >>> >> >> >
