On Tue, Sep 26, 2017 at 3:31 PM, Mohit Jaggi <mohit.ja...@uber.com> wrote:
> Actually I wonder if the "register" here is for Aurora to register with > mesos as a framework...I was assuming it refered to registering with ZK as > a leader. > Bingo. > > On Tue, Sep 26, 2017 at 2:27 PM, Renan DelValle Rueda < > renanidelva...@gmail.com> wrote: > >> I've had this, or something very similar, happen before. It's an issue >> with Aurora and ZK. Election is based upon ZK, so if writing down who the >> leader is to the ZK server path fails, or if ZK is unable to reach quorum >> on the write, the election will fail. Sometimes this might manifest itself >> in weird ways, such as two aurora schedulers believing they are leaders. If >> you could tell us a little bit about your ZK set up we might be able to >> narrow down the issue. Also, Aurora version and whether you are using >> Curator or the commons library will help as well. >> >> On Tue, Sep 26, 2017 at 2:02 PM, Mohit Jaggi <mohit.ja...@uber.com> >> wrote: >> >>> Hmm..it seems machine62 became a leader but could not "register" as >>> leader. Not sure what that means. My naive assumption is that "becoming >>> leader" and "registering as leader" are "atomic". >>> >>> ------- grep on SchedulerLifecycle ----- >>> aurora-scheduler.log:Sep 26 18:11:33 machine62 aurora-scheduler[24743]: >>> I0926 18:11:33.158 [LeaderSelector-0, StateMachine$Builder:389] >>> SchedulerLifecycle state machine transition STORAGE_PREPARED -> >>> LEADER_AWAITING_REGISTRATION >>> aurora-scheduler.log:Sep 26 18:11:33 machine62 aurora-scheduler[24743]: >>> I0926 18:11:33.159 [LeaderSelector-0, SchedulerLifecycle$4:224] Elected as >>> leading scheduler! >>> aurora-scheduler.log:Sep 26 18:11:37 machine62 aurora-scheduler[24743]: >>> I0926 18:11:37.204 [LeaderSelector-0, >>> SchedulerLifecycle$DefaultDelayedActions:163] >>> Giving up on registration in (10, mins) >>> aurora-scheduler.log:Sep 26 18:21:37 machine62 aurora-scheduler[24743]: >>> E0926 18:21:37.205 [Lifecycle-0, SchedulerLifecycle$4:235] Framework has >>> not been registered within the tolerated delay. >>> aurora-scheduler.log:Sep 26 18:21:37 machine62 aurora-scheduler[24743]: >>> I0926 18:21:37.205 [Lifecycle-0, StateMachine$Builder:389] >>> SchedulerLifecycle state machine transition LEADER_AWAITING_REGISTRATION -> >>> DEAD >>> aurora-scheduler.log:Sep 26 18:21:37 machine62 aurora-scheduler[24743]: >>> I0926 18:21:37.215 [Lifecycle-0, StateMachine$Builder:389] >>> SchedulerLifecycle state machine transition DEAD -> DEAD >>> aurora-scheduler.log:Sep 26 18:21:37 machine62 aurora-scheduler[24743]: >>> I0926 18:21:37.215 [Lifecycle-0, SchedulerLifecycle$6:275] Shutdown already >>> invoked, ignoring extra call. >>> aurora-scheduler.log:Sep 26 18:22:05 machine62 aurora-scheduler[54502]: >>> I0926 18:22:05.681 [main, StateMachine$Builder:389] SchedulerLifecycle >>> state machine transition IDLE -> PREPARING_STORAGE >>> aurora-scheduler.log:Sep 26 18:22:06 machine62 aurora-scheduler[54502]: >>> I0926 18:22:06.396 [main, StateMachine$Builder:389] SchedulerLifecycle >>> state machine transition PREPARING_STORAGE -> STORAGE_PREPARED >>> >>> >>> ------ connecting to mesos ----- >>> Sep 26 18:11:37 machine62 aurora-scheduler[24743]: I0926 18:11:37.211750 >>> 24871 group.cpp:757] Found non-sequence node 'log_replicas' at '/mesos' in >>> ZooKeeper >>> Sep 26 18:11:37 machine62 aurora-scheduler[24743]: I0926 18:11:37.211817 >>> 24871 detector.cpp:152] Detected a new leader: (id='1506') >>> Sep 26 18:11:37 machine62 aurora-scheduler[24743]: I0926 18:11:37.211917 >>> 24871 group.cpp:699] Trying to get '/mesos/json.info_0000001506' in >>> ZooKeeper >>> Sep 26 18:11:37 machine62 aurora-scheduler[24743]: I0926 18:11:37.216063 >>> 24871 zookeeper.cpp:262] A new leading master (UPID= >>> master@10.163.25.45:5050) is detected >>> Sep 26 18:11:37 machine62 aurora-scheduler[24743]: I0926 18:11:37.216162 >>> 24871 scheduler.cpp:470] New master detected at master@10.163.25.45:5050 >>> Sep 26 18:11:37 machine62 aurora-scheduler[24743]: I0926 18:11:37.217772 >>> 24871 scheduler.cpp:479] Waiting for 12.81503ms before initiating a >>> re-(connection) attempt with the master >>> Sep 26 18:11:37 machine62 aurora-scheduler[24743]: I0926 18:11:37.231549 >>> 24868 scheduler.cpp:361] Connected with the master at >>> http://10.163.25.45:5050/master/api/v1/scheduler >>> >>> >>> >>> On Tue, Sep 26, 2017 at 1:24 PM, Bill Farner <wfar...@apache.org> wrote: >>> >>>> Is there a reason a non-leading scheduler will talk to Mesos >>>> >>>> >>>> No, there is not a legitimate reason. Did this occur for an extended >>>> period of time? Do you have logs from the scheduler indicating that it >>>> lost ZK leadership and subsequently interacted with mesos? >>>> >>>> On Tue, Sep 26, 2017 at 1:02 PM, Mohit Jaggi <mohit.ja...@uber.com> >>>> wrote: >>>> >>>>> Fellows, >>>>> While examining Aurora log files, I noticed a condition where a >>>>> scheduler was talking to Mesos but it was not showing up as a leader in >>>>> Zookeeper. It ultimately restarted itself and another scheduler became the >>>>> leader. >>>>> Is there a reason a non-leading scheduler will talk to Mesos? >>>>> >>>>> Mohit. >>>>> >>>> >>>> >>> >> >