On Tue, Sep 26, 2017 at 4:33 PM, Mohit Jaggi <[email protected]> wrote:
> John, > I was referring to the following log message...isn't that the right one? > Aha - it is, apologies. > Sep 26 18:11:58 machine62 aurora-scheduler[24743]: I0926 18:11:58.795 > [Thread-814, MesosCallbackHandler$MesosCallbackHandlerImpl:180] > Registered with ID value: "4ca9aa06-3214-4d2c-a678-0832e2f84d17-0000" > > On Tue, Sep 26, 2017 at 3:30 PM, John Sirois <[email protected]> > wrote: > >> > ... but it succeeded(see logs at the end) ... >> >> The underlying c++ code in the scheduler driver successfully connected to >> the leading master socket: >> Sep 26 18:11:37 machine62 aurora-scheduler[24743]: I0926 18:11:37.231549 >> 24868 >> scheduler.cpp:361] Connected with the master at >> http://10.163.25.45:5050/master/api/v1/scheduler >> >> <http://10.163.25.45:5050/master/api/v1/scheduler>This is not the same >> as a framework registration call being successfully executed against the >> newly connected master. >> You need to be careful about what you derive from the logs just based on >> a reading of the words. Generally you'll need to look carefully / grep >> sourcecode to be sure you are mentally modelling the code flows correctly. >> It certainly gets tricky. >> >> On Tue, Sep 26, 2017 at 4:14 PM, Mohit Jaggi <[email protected]> >> wrote: >> >>> Hmm....so it is indeed the mesos registration, but it succeeded(see logs >>> at the end). Looking at the code, I see that com.google.common.eventbus >>> is used as pub-sub mechanism to link the registration (first log message) >>> to the registrationAcked flag...and this flag is not being set for >>> 10mins (?) otherwise the registration timeout handler will not print the >>> second log message. >>> >>> delayedActions.onRegistrationTimeout( >>> () -> { >>> if (!registrationAcked.get()) { >>> LOG.error( >>> "Framework has not been registered within the tolerated >>> delay."); >>> stateMachine.transition(State.DEAD); >>> } >>> }); >>> >>> Sep 26 18:11:58 machine62 aurora-scheduler[24743]: I0926 18:11:58.795 >>> [Thread-814, MesosCallbackHandler$MesosCallbackHandlerImpl:180] >>> Registered with ID value: "4ca9aa06-3214-4d2c-a678-0832e2f84d17-0000" >>> >>> ... >>> >>> ... >>> >>> Sep 26 18:21:37 machine62 aurora-scheduler[24743]: E0926 18:21:37.205 >>> [Lifecycle-0, SchedulerLifecycle$4:235] Framework has not been registered >>> within the tolerated delay. >>> >>> On Tue, Sep 26, 2017 at 2:34 PM, John Sirois <[email protected]> >>> wrote: >>> >>>> >>>> >>>> On Tue, Sep 26, 2017 at 3:31 PM, Mohit Jaggi <[email protected]> >>>> wrote: >>>> >>>>> Actually I wonder if the "register" here is for Aurora to register >>>>> with mesos as a framework...I was assuming it refered to registering with >>>>> ZK as a leader. >>>>> >>>> >>>> Bingo. >>>> >>>> >>>>> >>>>> On Tue, Sep 26, 2017 at 2:27 PM, Renan DelValle Rueda < >>>>> [email protected]> wrote: >>>>> >>>>>> I've had this, or something very similar, happen before. It's an >>>>>> issue with Aurora and ZK. Election is based upon ZK, so if writing down >>>>>> who >>>>>> the leader is to the ZK server path fails, or if ZK is unable to reach >>>>>> quorum on the write, the election will fail. Sometimes this might >>>>>> manifest >>>>>> itself in weird ways, such as two aurora schedulers believing they are >>>>>> leaders. If you could tell us a little bit about your ZK set up we might >>>>>> be >>>>>> able to narrow down the issue. Also, Aurora version and whether you are >>>>>> using Curator or the commons library will help as well. >>>>>> >>>>>> On Tue, Sep 26, 2017 at 2:02 PM, Mohit Jaggi <[email protected]> >>>>>> wrote: >>>>>> >>>>>>> Hmm..it seems machine62 became a leader but could not "register" as >>>>>>> leader. Not sure what that means. My naive assumption is that "becoming >>>>>>> leader" and "registering as leader" are "atomic". >>>>>>> >>>>>>> ------- grep on SchedulerLifecycle ----- >>>>>>> aurora-scheduler.log:Sep 26 18:11:33 machine62 >>>>>>> aurora-scheduler[24743]: I0926 18:11:33.158 [LeaderSelector-0, >>>>>>> StateMachine$Builder:389] SchedulerLifecycle state machine transition >>>>>>> STORAGE_PREPARED -> LEADER_AWAITING_REGISTRATION >>>>>>> aurora-scheduler.log:Sep 26 18:11:33 machine62 >>>>>>> aurora-scheduler[24743]: I0926 18:11:33.159 [LeaderSelector-0, >>>>>>> SchedulerLifecycle$4:224] Elected as leading scheduler! >>>>>>> aurora-scheduler.log:Sep 26 18:11:37 machine62 >>>>>>> aurora-scheduler[24743]: I0926 18:11:37.204 [LeaderSelector-0, >>>>>>> SchedulerLifecycle$DefaultDelayedActions:163] Giving up on >>>>>>> registration in (10, mins) >>>>>>> aurora-scheduler.log:Sep 26 18:21:37 machine62 >>>>>>> aurora-scheduler[24743]: E0926 18:21:37.205 [Lifecycle-0, >>>>>>> SchedulerLifecycle$4:235] Framework has not been registered within the >>>>>>> tolerated delay. >>>>>>> aurora-scheduler.log:Sep 26 18:21:37 machine62 >>>>>>> aurora-scheduler[24743]: I0926 18:21:37.205 [Lifecycle-0, >>>>>>> StateMachine$Builder:389] SchedulerLifecycle state machine transition >>>>>>> LEADER_AWAITING_REGISTRATION -> DEAD >>>>>>> aurora-scheduler.log:Sep 26 18:21:37 machine62 >>>>>>> aurora-scheduler[24743]: I0926 18:21:37.215 [Lifecycle-0, >>>>>>> StateMachine$Builder:389] SchedulerLifecycle state machine transition >>>>>>> DEAD >>>>>>> -> DEAD >>>>>>> aurora-scheduler.log:Sep 26 18:21:37 machine62 >>>>>>> aurora-scheduler[24743]: I0926 18:21:37.215 [Lifecycle-0, >>>>>>> SchedulerLifecycle$6:275] Shutdown already invoked, ignoring extra call. >>>>>>> aurora-scheduler.log:Sep 26 18:22:05 machine62 >>>>>>> aurora-scheduler[54502]: I0926 18:22:05.681 [main, >>>>>>> StateMachine$Builder:389] SchedulerLifecycle state machine transition >>>>>>> IDLE >>>>>>> -> PREPARING_STORAGE >>>>>>> aurora-scheduler.log:Sep 26 18:22:06 machine62 >>>>>>> aurora-scheduler[54502]: I0926 18:22:06.396 [main, >>>>>>> StateMachine$Builder:389] SchedulerLifecycle state machine transition >>>>>>> PREPARING_STORAGE -> STORAGE_PREPARED >>>>>>> >>>>>>> >>>>>>> ------ connecting to mesos ----- >>>>>>> Sep 26 18:11:37 machine62 aurora-scheduler[24743]: I0926 >>>>>>> 18:11:37.211750 24871 group.cpp:757] Found non-sequence node >>>>>>> 'log_replicas' >>>>>>> at '/mesos' in ZooKeeper >>>>>>> Sep 26 18:11:37 machine62 aurora-scheduler[24743]: I0926 >>>>>>> 18:11:37.211817 24871 detector.cpp:152] Detected a new leader: >>>>>>> (id='1506') >>>>>>> Sep 26 18:11:37 machine62 aurora-scheduler[24743]: I0926 >>>>>>> 18:11:37.211917 24871 group.cpp:699] Trying to get >>>>>>> '/mesos/json.info_0000001506' in ZooKeeper >>>>>>> Sep 26 18:11:37 machine62 aurora-scheduler[24743]: I0926 >>>>>>> 18:11:37.216063 24871 zookeeper.cpp:262] A new leading master (UPID= >>>>>>> [email protected]:5050) is detected >>>>>>> Sep 26 18:11:37 machine62 aurora-scheduler[24743]: I0926 >>>>>>> 18:11:37.216162 24871 scheduler.cpp:470] New master detected at >>>>>>> [email protected]:5050 >>>>>>> Sep 26 18:11:37 machine62 aurora-scheduler[24743]: I0926 >>>>>>> 18:11:37.217772 24871 scheduler.cpp:479] Waiting for 12.81503ms before >>>>>>> initiating a re-(connection) attempt with the master >>>>>>> Sep 26 18:11:37 machine62 aurora-scheduler[24743]: I0926 >>>>>>> 18:11:37.231549 24868 scheduler.cpp:361] Connected with the master at >>>>>>> http://10.163.25.45:5050/master/api/v1/scheduler >>>>>>> >>>>>>> >>>>>>> >>>>>>> On Tue, Sep 26, 2017 at 1:24 PM, Bill Farner <[email protected]> >>>>>>> wrote: >>>>>>> >>>>>>>> Is there a reason a non-leading scheduler will talk to Mesos >>>>>>>> >>>>>>>> >>>>>>>> No, there is not a legitimate reason. Did this occur for an >>>>>>>> extended period of time? Do you have logs from the scheduler >>>>>>>> indicating >>>>>>>> that it lost ZK leadership and subsequently interacted with mesos? >>>>>>>> >>>>>>>> On Tue, Sep 26, 2017 at 1:02 PM, Mohit Jaggi <[email protected]> >>>>>>>> wrote: >>>>>>>> >>>>>>>>> Fellows, >>>>>>>>> While examining Aurora log files, I noticed a condition where a >>>>>>>>> scheduler was talking to Mesos but it was not showing up as a leader >>>>>>>>> in >>>>>>>>> Zookeeper. It ultimately restarted itself and another scheduler >>>>>>>>> became the >>>>>>>>> leader. >>>>>>>>> Is there a reason a non-leading scheduler will talk to Mesos? >>>>>>>>> >>>>>>>>> Mohit. >>>>>>>>> >>>>>>>> >>>>>>>> >>>>>>> >>>>>> >>>>> >>>> >>> >> >
