John, I was referring to the following log message...isn't that the right one? Sep 26 18:11:58 machine62 aurora-scheduler[24743]: I0926 18:11:58.795 [Thread-814, MesosCallbackHandler$MesosCallbackHandlerImpl:180] Registered with ID value: "4ca9aa06-3214-4d2c-a678-0832e2f84d17-0000"
On Tue, Sep 26, 2017 at 3:30 PM, John Sirois <[email protected]> wrote: > > ... but it succeeded(see logs at the end) ... > > The underlying c++ code in the scheduler driver successfully connected to > the leading master socket: > Sep 26 18:11:37 machine62 aurora-scheduler[24743]: I0926 18:11:37.231549 24868 > scheduler.cpp:361] Connected with the master at > http://10.163.25.45:5050/master/api/v1/scheduler > > <http://10.163.25.45:5050/master/api/v1/scheduler>This is not the same as > a framework registration call being successfully executed against the newly > connected master. > You need to be careful about what you derive from the logs just based on a > reading of the words. Generally you'll need to look carefully / grep > sourcecode to be sure you are mentally modelling the code flows correctly. > It certainly gets tricky. > > On Tue, Sep 26, 2017 at 4:14 PM, Mohit Jaggi <[email protected]> wrote: > >> Hmm....so it is indeed the mesos registration, but it succeeded(see logs >> at the end). Looking at the code, I see that com.google.common.eventbus >> is used as pub-sub mechanism to link the registration (first log message) >> to the registrationAcked flag...and this flag is not being set for >> 10mins (?) otherwise the registration timeout handler will not print the >> second log message. >> >> delayedActions.onRegistrationTimeout( >> () -> { >> if (!registrationAcked.get()) { >> LOG.error( >> "Framework has not been registered within the tolerated delay."); >> stateMachine.transition(State.DEAD); >> } >> }); >> >> Sep 26 18:11:58 machine62 aurora-scheduler[24743]: I0926 18:11:58.795 >> [Thread-814, MesosCallbackHandler$MesosCallbackHandlerImpl:180] >> Registered with ID value: "4ca9aa06-3214-4d2c-a678-0832e2f84d17-0000" >> >> ... >> >> ... >> >> Sep 26 18:21:37 machine62 aurora-scheduler[24743]: E0926 18:21:37.205 >> [Lifecycle-0, SchedulerLifecycle$4:235] Framework has not been registered >> within the tolerated delay. >> >> On Tue, Sep 26, 2017 at 2:34 PM, John Sirois <[email protected]> >> wrote: >> >>> >>> >>> On Tue, Sep 26, 2017 at 3:31 PM, Mohit Jaggi <[email protected]> >>> wrote: >>> >>>> Actually I wonder if the "register" here is for Aurora to register with >>>> mesos as a framework...I was assuming it refered to registering with ZK as >>>> a leader. >>>> >>> >>> Bingo. >>> >>> >>>> >>>> On Tue, Sep 26, 2017 at 2:27 PM, Renan DelValle Rueda < >>>> [email protected]> wrote: >>>> >>>>> I've had this, or something very similar, happen before. It's an issue >>>>> with Aurora and ZK. Election is based upon ZK, so if writing down who the >>>>> leader is to the ZK server path fails, or if ZK is unable to reach quorum >>>>> on the write, the election will fail. Sometimes this might manifest itself >>>>> in weird ways, such as two aurora schedulers believing they are leaders. >>>>> If >>>>> you could tell us a little bit about your ZK set up we might be able to >>>>> narrow down the issue. Also, Aurora version and whether you are using >>>>> Curator or the commons library will help as well. >>>>> >>>>> On Tue, Sep 26, 2017 at 2:02 PM, Mohit Jaggi <[email protected]> >>>>> wrote: >>>>> >>>>>> Hmm..it seems machine62 became a leader but could not "register" as >>>>>> leader. Not sure what that means. My naive assumption is that "becoming >>>>>> leader" and "registering as leader" are "atomic". >>>>>> >>>>>> ------- grep on SchedulerLifecycle ----- >>>>>> aurora-scheduler.log:Sep 26 18:11:33 machine62 >>>>>> aurora-scheduler[24743]: I0926 18:11:33.158 [LeaderSelector-0, >>>>>> StateMachine$Builder:389] SchedulerLifecycle state machine transition >>>>>> STORAGE_PREPARED -> LEADER_AWAITING_REGISTRATION >>>>>> aurora-scheduler.log:Sep 26 18:11:33 machine62 >>>>>> aurora-scheduler[24743]: I0926 18:11:33.159 [LeaderSelector-0, >>>>>> SchedulerLifecycle$4:224] Elected as leading scheduler! >>>>>> aurora-scheduler.log:Sep 26 18:11:37 machine62 >>>>>> aurora-scheduler[24743]: I0926 18:11:37.204 [LeaderSelector-0, >>>>>> SchedulerLifecycle$DefaultDelayedActions:163] Giving up on >>>>>> registration in (10, mins) >>>>>> aurora-scheduler.log:Sep 26 18:21:37 machine62 >>>>>> aurora-scheduler[24743]: E0926 18:21:37.205 [Lifecycle-0, >>>>>> SchedulerLifecycle$4:235] Framework has not been registered within the >>>>>> tolerated delay. >>>>>> aurora-scheduler.log:Sep 26 18:21:37 machine62 >>>>>> aurora-scheduler[24743]: I0926 18:21:37.205 [Lifecycle-0, >>>>>> StateMachine$Builder:389] SchedulerLifecycle state machine transition >>>>>> LEADER_AWAITING_REGISTRATION -> DEAD >>>>>> aurora-scheduler.log:Sep 26 18:21:37 machine62 >>>>>> aurora-scheduler[24743]: I0926 18:21:37.215 [Lifecycle-0, >>>>>> StateMachine$Builder:389] SchedulerLifecycle state machine transition >>>>>> DEAD >>>>>> -> DEAD >>>>>> aurora-scheduler.log:Sep 26 18:21:37 machine62 >>>>>> aurora-scheduler[24743]: I0926 18:21:37.215 [Lifecycle-0, >>>>>> SchedulerLifecycle$6:275] Shutdown already invoked, ignoring extra call. >>>>>> aurora-scheduler.log:Sep 26 18:22:05 machine62 >>>>>> aurora-scheduler[54502]: I0926 18:22:05.681 [main, >>>>>> StateMachine$Builder:389] SchedulerLifecycle state machine transition >>>>>> IDLE >>>>>> -> PREPARING_STORAGE >>>>>> aurora-scheduler.log:Sep 26 18:22:06 machine62 >>>>>> aurora-scheduler[54502]: I0926 18:22:06.396 [main, >>>>>> StateMachine$Builder:389] SchedulerLifecycle state machine transition >>>>>> PREPARING_STORAGE -> STORAGE_PREPARED >>>>>> >>>>>> >>>>>> ------ connecting to mesos ----- >>>>>> Sep 26 18:11:37 machine62 aurora-scheduler[24743]: I0926 >>>>>> 18:11:37.211750 24871 group.cpp:757] Found non-sequence node >>>>>> 'log_replicas' >>>>>> at '/mesos' in ZooKeeper >>>>>> Sep 26 18:11:37 machine62 aurora-scheduler[24743]: I0926 >>>>>> 18:11:37.211817 24871 detector.cpp:152] Detected a new leader: >>>>>> (id='1506') >>>>>> Sep 26 18:11:37 machine62 aurora-scheduler[24743]: I0926 >>>>>> 18:11:37.211917 24871 group.cpp:699] Trying to get >>>>>> '/mesos/json.info_0000001506' in ZooKeeper >>>>>> Sep 26 18:11:37 machine62 aurora-scheduler[24743]: I0926 >>>>>> 18:11:37.216063 24871 zookeeper.cpp:262] A new leading master (UPID= >>>>>> [email protected]:5050) is detected >>>>>> Sep 26 18:11:37 machine62 aurora-scheduler[24743]: I0926 >>>>>> 18:11:37.216162 24871 scheduler.cpp:470] New master detected at >>>>>> [email protected]:5050 >>>>>> Sep 26 18:11:37 machine62 aurora-scheduler[24743]: I0926 >>>>>> 18:11:37.217772 24871 scheduler.cpp:479] Waiting for 12.81503ms before >>>>>> initiating a re-(connection) attempt with the master >>>>>> Sep 26 18:11:37 machine62 aurora-scheduler[24743]: I0926 >>>>>> 18:11:37.231549 24868 scheduler.cpp:361] Connected with the master at >>>>>> http://10.163.25.45:5050/master/api/v1/scheduler >>>>>> >>>>>> >>>>>> >>>>>> On Tue, Sep 26, 2017 at 1:24 PM, Bill Farner <[email protected]> >>>>>> wrote: >>>>>> >>>>>>> Is there a reason a non-leading scheduler will talk to Mesos >>>>>>> >>>>>>> >>>>>>> No, there is not a legitimate reason. Did this occur for an >>>>>>> extended period of time? Do you have logs from the scheduler indicating >>>>>>> that it lost ZK leadership and subsequently interacted with mesos? >>>>>>> >>>>>>> On Tue, Sep 26, 2017 at 1:02 PM, Mohit Jaggi <[email protected]> >>>>>>> wrote: >>>>>>> >>>>>>>> Fellows, >>>>>>>> While examining Aurora log files, I noticed a condition where a >>>>>>>> scheduler was talking to Mesos but it was not showing up as a leader in >>>>>>>> Zookeeper. It ultimately restarted itself and another scheduler became >>>>>>>> the >>>>>>>> leader. >>>>>>>> Is there a reason a non-leading scheduler will talk to Mesos? >>>>>>>> >>>>>>>> Mohit. >>>>>>>> >>>>>>> >>>>>>> >>>>>> >>>>> >>>> >>> >> >
