Actually I wonder if the "register" here is for Aurora to register with mesos as a framework...I was assuming it refered to registering with ZK as a leader.
On Tue, Sep 26, 2017 at 2:27 PM, Renan DelValle Rueda < [email protected]> wrote: > I've had this, or something very similar, happen before. It's an issue > with Aurora and ZK. Election is based upon ZK, so if writing down who the > leader is to the ZK server path fails, or if ZK is unable to reach quorum > on the write, the election will fail. Sometimes this might manifest itself > in weird ways, such as two aurora schedulers believing they are leaders. If > you could tell us a little bit about your ZK set up we might be able to > narrow down the issue. Also, Aurora version and whether you are using > Curator or the commons library will help as well. > > On Tue, Sep 26, 2017 at 2:02 PM, Mohit Jaggi <[email protected]> wrote: > >> Hmm..it seems machine62 became a leader but could not "register" as >> leader. Not sure what that means. My naive assumption is that "becoming >> leader" and "registering as leader" are "atomic". >> >> ------- grep on SchedulerLifecycle ----- >> aurora-scheduler.log:Sep 26 18:11:33 machine62 aurora-scheduler[24743]: >> I0926 18:11:33.158 [LeaderSelector-0, StateMachine$Builder:389] >> SchedulerLifecycle state machine transition STORAGE_PREPARED -> >> LEADER_AWAITING_REGISTRATION >> aurora-scheduler.log:Sep 26 18:11:33 machine62 aurora-scheduler[24743]: >> I0926 18:11:33.159 [LeaderSelector-0, SchedulerLifecycle$4:224] Elected as >> leading scheduler! >> aurora-scheduler.log:Sep 26 18:11:37 machine62 aurora-scheduler[24743]: >> I0926 18:11:37.204 [LeaderSelector-0, >> SchedulerLifecycle$DefaultDelayedActions:163] >> Giving up on registration in (10, mins) >> aurora-scheduler.log:Sep 26 18:21:37 machine62 aurora-scheduler[24743]: >> E0926 18:21:37.205 [Lifecycle-0, SchedulerLifecycle$4:235] Framework has >> not been registered within the tolerated delay. >> aurora-scheduler.log:Sep 26 18:21:37 machine62 aurora-scheduler[24743]: >> I0926 18:21:37.205 [Lifecycle-0, StateMachine$Builder:389] >> SchedulerLifecycle state machine transition LEADER_AWAITING_REGISTRATION -> >> DEAD >> aurora-scheduler.log:Sep 26 18:21:37 machine62 aurora-scheduler[24743]: >> I0926 18:21:37.215 [Lifecycle-0, StateMachine$Builder:389] >> SchedulerLifecycle state machine transition DEAD -> DEAD >> aurora-scheduler.log:Sep 26 18:21:37 machine62 aurora-scheduler[24743]: >> I0926 18:21:37.215 [Lifecycle-0, SchedulerLifecycle$6:275] Shutdown already >> invoked, ignoring extra call. >> aurora-scheduler.log:Sep 26 18:22:05 machine62 aurora-scheduler[54502]: >> I0926 18:22:05.681 [main, StateMachine$Builder:389] SchedulerLifecycle >> state machine transition IDLE -> PREPARING_STORAGE >> aurora-scheduler.log:Sep 26 18:22:06 machine62 aurora-scheduler[54502]: >> I0926 18:22:06.396 [main, StateMachine$Builder:389] SchedulerLifecycle >> state machine transition PREPARING_STORAGE -> STORAGE_PREPARED >> >> >> ------ connecting to mesos ----- >> Sep 26 18:11:37 machine62 aurora-scheduler[24743]: I0926 18:11:37.211750 >> 24871 group.cpp:757] Found non-sequence node 'log_replicas' at '/mesos' in >> ZooKeeper >> Sep 26 18:11:37 machine62 aurora-scheduler[24743]: I0926 18:11:37.211817 >> 24871 detector.cpp:152] Detected a new leader: (id='1506') >> Sep 26 18:11:37 machine62 aurora-scheduler[24743]: I0926 18:11:37.211917 >> 24871 group.cpp:699] Trying to get '/mesos/json.info_0000001506' in >> ZooKeeper >> Sep 26 18:11:37 machine62 aurora-scheduler[24743]: I0926 18:11:37.216063 >> 24871 zookeeper.cpp:262] A new leading master (UPID= >> [email protected]:5050) is detected >> Sep 26 18:11:37 machine62 aurora-scheduler[24743]: I0926 18:11:37.216162 >> 24871 scheduler.cpp:470] New master detected at [email protected]:5050 >> Sep 26 18:11:37 machine62 aurora-scheduler[24743]: I0926 18:11:37.217772 >> 24871 scheduler.cpp:479] Waiting for 12.81503ms before initiating a >> re-(connection) attempt with the master >> Sep 26 18:11:37 machine62 aurora-scheduler[24743]: I0926 18:11:37.231549 >> 24868 scheduler.cpp:361] Connected with the master at >> http://10.163.25.45:5050/master/api/v1/scheduler >> >> >> >> On Tue, Sep 26, 2017 at 1:24 PM, Bill Farner <[email protected]> wrote: >> >>> Is there a reason a non-leading scheduler will talk to Mesos >>> >>> >>> No, there is not a legitimate reason. Did this occur for an extended >>> period of time? Do you have logs from the scheduler indicating that it >>> lost ZK leadership and subsequently interacted with mesos? >>> >>> On Tue, Sep 26, 2017 at 1:02 PM, Mohit Jaggi <[email protected]> >>> wrote: >>> >>>> Fellows, >>>> While examining Aurora log files, I noticed a condition where a >>>> scheduler was talking to Mesos but it was not showing up as a leader in >>>> Zookeeper. It ultimately restarted itself and another scheduler became the >>>> leader. >>>> Is there a reason a non-leading scheduler will talk to Mesos? >>>> >>>> Mohit. >>>> >>> >>> >> >
