> ... but it succeeded(see logs at the end) ... The underlying c++ code in the scheduler driver successfully connected to the leading master socket: Sep 26 18:11:37 machine62 aurora-scheduler[24743]: I0926 18:11:37.231549 24868 scheduler.cpp:361] Connected with the master at http://10.163.25.45:5050/ master/api/v1/scheduler
<http://10.163.25.45:5050/master/api/v1/scheduler>This is not the same as a framework registration call being successfully executed against the newly connected master. You need to be careful about what you derive from the logs just based on a reading of the words. Generally you'll need to look carefully / grep sourcecode to be sure you are mentally modelling the code flows correctly. It certainly gets tricky. On Tue, Sep 26, 2017 at 4:14 PM, Mohit Jaggi <[email protected]> wrote: > Hmm....so it is indeed the mesos registration, but it succeeded(see logs > at the end). Looking at the code, I see that com.google.common.eventbus > is used as pub-sub mechanism to link the registration (first log message) > to the registrationAcked flag...and this flag is not being set for > 10mins (?) otherwise the registration timeout handler will not print the > second log message. > > delayedActions.onRegistrationTimeout( > () -> { > if (!registrationAcked.get()) { > LOG.error( > "Framework has not been registered within the tolerated delay."); > stateMachine.transition(State.DEAD); > } > }); > > Sep 26 18:11:58 machine62 aurora-scheduler[24743]: I0926 18:11:58.795 > [Thread-814, MesosCallbackHandler$MesosCallbackHandlerImpl:180] > Registered with ID value: "4ca9aa06-3214-4d2c-a678-0832e2f84d17-0000" > > ... > > ... > > Sep 26 18:21:37 machine62 aurora-scheduler[24743]: E0926 18:21:37.205 > [Lifecycle-0, SchedulerLifecycle$4:235] Framework has not been registered > within the tolerated delay. > > On Tue, Sep 26, 2017 at 2:34 PM, John Sirois <[email protected]> > wrote: > >> >> >> On Tue, Sep 26, 2017 at 3:31 PM, Mohit Jaggi <[email protected]> >> wrote: >> >>> Actually I wonder if the "register" here is for Aurora to register with >>> mesos as a framework...I was assuming it refered to registering with ZK as >>> a leader. >>> >> >> Bingo. >> >> >>> >>> On Tue, Sep 26, 2017 at 2:27 PM, Renan DelValle Rueda < >>> [email protected]> wrote: >>> >>>> I've had this, or something very similar, happen before. It's an issue >>>> with Aurora and ZK. Election is based upon ZK, so if writing down who the >>>> leader is to the ZK server path fails, or if ZK is unable to reach quorum >>>> on the write, the election will fail. Sometimes this might manifest itself >>>> in weird ways, such as two aurora schedulers believing they are leaders. If >>>> you could tell us a little bit about your ZK set up we might be able to >>>> narrow down the issue. Also, Aurora version and whether you are using >>>> Curator or the commons library will help as well. >>>> >>>> On Tue, Sep 26, 2017 at 2:02 PM, Mohit Jaggi <[email protected]> >>>> wrote: >>>> >>>>> Hmm..it seems machine62 became a leader but could not "register" as >>>>> leader. Not sure what that means. My naive assumption is that "becoming >>>>> leader" and "registering as leader" are "atomic". >>>>> >>>>> ------- grep on SchedulerLifecycle ----- >>>>> aurora-scheduler.log:Sep 26 18:11:33 machine62 >>>>> aurora-scheduler[24743]: I0926 18:11:33.158 [LeaderSelector-0, >>>>> StateMachine$Builder:389] SchedulerLifecycle state machine transition >>>>> STORAGE_PREPARED -> LEADER_AWAITING_REGISTRATION >>>>> aurora-scheduler.log:Sep 26 18:11:33 machine62 >>>>> aurora-scheduler[24743]: I0926 18:11:33.159 [LeaderSelector-0, >>>>> SchedulerLifecycle$4:224] Elected as leading scheduler! >>>>> aurora-scheduler.log:Sep 26 18:11:37 machine62 >>>>> aurora-scheduler[24743]: I0926 18:11:37.204 [LeaderSelector-0, >>>>> SchedulerLifecycle$DefaultDelayedActions:163] Giving up on >>>>> registration in (10, mins) >>>>> aurora-scheduler.log:Sep 26 18:21:37 machine62 >>>>> aurora-scheduler[24743]: E0926 18:21:37.205 [Lifecycle-0, >>>>> SchedulerLifecycle$4:235] Framework has not been registered within the >>>>> tolerated delay. >>>>> aurora-scheduler.log:Sep 26 18:21:37 machine62 >>>>> aurora-scheduler[24743]: I0926 18:21:37.205 [Lifecycle-0, >>>>> StateMachine$Builder:389] SchedulerLifecycle state machine transition >>>>> LEADER_AWAITING_REGISTRATION -> DEAD >>>>> aurora-scheduler.log:Sep 26 18:21:37 machine62 >>>>> aurora-scheduler[24743]: I0926 18:21:37.215 [Lifecycle-0, >>>>> StateMachine$Builder:389] SchedulerLifecycle state machine transition DEAD >>>>> -> DEAD >>>>> aurora-scheduler.log:Sep 26 18:21:37 machine62 >>>>> aurora-scheduler[24743]: I0926 18:21:37.215 [Lifecycle-0, >>>>> SchedulerLifecycle$6:275] Shutdown already invoked, ignoring extra call. >>>>> aurora-scheduler.log:Sep 26 18:22:05 machine62 >>>>> aurora-scheduler[54502]: I0926 18:22:05.681 [main, >>>>> StateMachine$Builder:389] SchedulerLifecycle state machine transition IDLE >>>>> -> PREPARING_STORAGE >>>>> aurora-scheduler.log:Sep 26 18:22:06 machine62 >>>>> aurora-scheduler[54502]: I0926 18:22:06.396 [main, >>>>> StateMachine$Builder:389] SchedulerLifecycle state machine transition >>>>> PREPARING_STORAGE -> STORAGE_PREPARED >>>>> >>>>> >>>>> ------ connecting to mesos ----- >>>>> Sep 26 18:11:37 machine62 aurora-scheduler[24743]: I0926 >>>>> 18:11:37.211750 24871 group.cpp:757] Found non-sequence node >>>>> 'log_replicas' >>>>> at '/mesos' in ZooKeeper >>>>> Sep 26 18:11:37 machine62 aurora-scheduler[24743]: I0926 >>>>> 18:11:37.211817 24871 detector.cpp:152] Detected a new leader: (id='1506') >>>>> Sep 26 18:11:37 machine62 aurora-scheduler[24743]: I0926 >>>>> 18:11:37.211917 24871 group.cpp:699] Trying to get >>>>> '/mesos/json.info_0000001506' in ZooKeeper >>>>> Sep 26 18:11:37 machine62 aurora-scheduler[24743]: I0926 >>>>> 18:11:37.216063 24871 zookeeper.cpp:262] A new leading master (UPID= >>>>> [email protected]:5050) is detected >>>>> Sep 26 18:11:37 machine62 aurora-scheduler[24743]: I0926 >>>>> 18:11:37.216162 24871 scheduler.cpp:470] New master detected at >>>>> [email protected]:5050 >>>>> Sep 26 18:11:37 machine62 aurora-scheduler[24743]: I0926 >>>>> 18:11:37.217772 24871 scheduler.cpp:479] Waiting for 12.81503ms before >>>>> initiating a re-(connection) attempt with the master >>>>> Sep 26 18:11:37 machine62 aurora-scheduler[24743]: I0926 >>>>> 18:11:37.231549 24868 scheduler.cpp:361] Connected with the master at >>>>> http://10.163.25.45:5050/master/api/v1/scheduler >>>>> >>>>> >>>>> >>>>> On Tue, Sep 26, 2017 at 1:24 PM, Bill Farner <[email protected]> >>>>> wrote: >>>>> >>>>>> Is there a reason a non-leading scheduler will talk to Mesos >>>>>> >>>>>> >>>>>> No, there is not a legitimate reason. Did this occur for an extended >>>>>> period of time? Do you have logs from the scheduler indicating that it >>>>>> lost ZK leadership and subsequently interacted with mesos? >>>>>> >>>>>> On Tue, Sep 26, 2017 at 1:02 PM, Mohit Jaggi <[email protected]> >>>>>> wrote: >>>>>> >>>>>>> Fellows, >>>>>>> While examining Aurora log files, I noticed a condition where a >>>>>>> scheduler was talking to Mesos but it was not showing up as a leader in >>>>>>> Zookeeper. It ultimately restarted itself and another scheduler became >>>>>>> the >>>>>>> leader. >>>>>>> Is there a reason a non-leading scheduler will talk to Mesos? >>>>>>> >>>>>>> Mohit. >>>>>>> >>>>>> >>>>>> >>>>> >>>> >>> >> >
