Hmm...it is a very busy cluster and 10 mins of logs will be voluminous. They contain some internal details which I cannot share publicly. If you suspect specific areas, I can try to get those logs and remove internal info.
Re: code, we have a fork which is very close to master. On Wed, Sep 27, 2017 at 10:03 PM, Bill Farner <[email protected]> wrote: > What commit/release was this with? From the looks of the log contents, > it's not master. I'd like to make sure i'm looking at the correct code. > > Are there more logs not being included? If so, can you share more > complete logs? In particular, logs during the 10 minute delay would be > particularly helpful. > > On Tue, Sep 26, 2017 at 11:51 PM, Mohit Jaggi <[email protected]> > wrote: > >> Updating subject...as it looks like leader election was fine but >> registration ack did not make it to the SchedulerLifecycle code. Weird that >> an event will get lost like that. >> >> On Tue, Sep 26, 2017 at 4:21 PM, John Sirois <[email protected]> >> wrote: >> >>> >>> >>> On Tue, Sep 26, 2017 at 4:33 PM, Mohit Jaggi <[email protected]> >>> wrote: >>> >>>> John, >>>> I was referring to the following log message...isn't that the right one? >>>> >>> >>> Aha - it is, apologies. >>> >>> >>>> Sep 26 18:11:58 machine62 aurora-scheduler[24743]: I0926 18:11:58.795 >>>> [Thread-814, MesosCallbackHandler$MesosCallbackHandlerImpl:180] >>>> Registered with ID value: "4ca9aa06-3214-4d2c-a678-0832e2f84d17-0000" >>>> >>>> On Tue, Sep 26, 2017 at 3:30 PM, John Sirois <[email protected]> >>>> wrote: >>>> >>>>> > ... but it succeeded(see logs at the end) ... >>>>> >>>>> The underlying c++ code in the scheduler driver successfully connected >>>>> to the leading master socket: >>>>> Sep 26 18:11:37 machine62 aurora-scheduler[24743]: I0926 >>>>> 18:11:37.231549 24868 scheduler.cpp:361] Connected with the master at >>>>> http://10.163.25.45:5050/master/api/v1/scheduler >>>>> >>>>> <http://10.163.25.45:5050/master/api/v1/scheduler>This is not the >>>>> same as a framework registration call being successfully executed against >>>>> the newly connected master. >>>>> You need to be careful about what you derive from the logs just based >>>>> on a reading of the words. Generally you'll need to look carefully / grep >>>>> sourcecode to be sure you are mentally modelling the code flows correctly. >>>>> It certainly gets tricky. >>>>> >>>>> On Tue, Sep 26, 2017 at 4:14 PM, Mohit Jaggi <[email protected]> >>>>> wrote: >>>>> >>>>>> Hmm....so it is indeed the mesos registration, but it succeeded(see >>>>>> logs at the end). Looking at the code, I see that >>>>>> com.google.common.eventbus is used as pub-sub mechanism to link the >>>>>> registration (first log message) to the registrationAcked flag...and >>>>>> this flag is not being set for 10mins (?) otherwise the registration >>>>>> timeout handler will not print the second log message. >>>>>> >>>>>> delayedActions.onRegistrationTimeout( >>>>>> () -> { >>>>>> if (!registrationAcked.get()) { >>>>>> LOG.error( >>>>>> "Framework has not been registered within the tolerated >>>>>> delay."); >>>>>> stateMachine.transition(State.DEAD); >>>>>> } >>>>>> }); >>>>>> >>>>>> Sep 26 18:11:58 machine62 aurora-scheduler[24743]: I0926 18:11:58.795 >>>>>> [Thread-814, MesosCallbackHandler$MesosCallbackHandlerImpl:180] >>>>>> Registered with ID value: "4ca9aa06-3214-4d2c-a678-0832e2f84d17-0000" >>>>>> >>>>>> ... >>>>>> >>>>>> ... >>>>>> >>>>>> Sep 26 18:21:37 machine62 aurora-scheduler[24743]: E0926 >>>>>> 18:21:37.205 [Lifecycle-0, SchedulerLifecycle$4:235] Framework has not >>>>>> been >>>>>> registered within the tolerated delay. >>>>>> >>>>>> On Tue, Sep 26, 2017 at 2:34 PM, John Sirois <[email protected]> >>>>>> wrote: >>>>>> >>>>>>> >>>>>>> >>>>>>> On Tue, Sep 26, 2017 at 3:31 PM, Mohit Jaggi <[email protected]> >>>>>>> wrote: >>>>>>> >>>>>>>> Actually I wonder if the "register" here is for Aurora to register >>>>>>>> with mesos as a framework...I was assuming it refered to registering >>>>>>>> with >>>>>>>> ZK as a leader. >>>>>>>> >>>>>>> >>>>>>> Bingo. >>>>>>> >>>>>>> >>>>>>>> >>>>>>>> On Tue, Sep 26, 2017 at 2:27 PM, Renan DelValle Rueda < >>>>>>>> [email protected]> wrote: >>>>>>>> >>>>>>>>> I've had this, or something very similar, happen before. It's an >>>>>>>>> issue with Aurora and ZK. Election is based upon ZK, so if writing >>>>>>>>> down who >>>>>>>>> the leader is to the ZK server path fails, or if ZK is unable to reach >>>>>>>>> quorum on the write, the election will fail. Sometimes this might >>>>>>>>> manifest >>>>>>>>> itself in weird ways, such as two aurora schedulers believing they are >>>>>>>>> leaders. If you could tell us a little bit about your ZK set up we >>>>>>>>> might be >>>>>>>>> able to narrow down the issue. Also, Aurora version and whether you >>>>>>>>> are >>>>>>>>> using Curator or the commons library will help as well. >>>>>>>>> >>>>>>>>> On Tue, Sep 26, 2017 at 2:02 PM, Mohit Jaggi <[email protected] >>>>>>>>> > wrote: >>>>>>>>> >>>>>>>>>> Hmm..it seems machine62 became a leader but could not "register" >>>>>>>>>> as leader. Not sure what that means. My naive assumption is that >>>>>>>>>> "becoming >>>>>>>>>> leader" and "registering as leader" are "atomic". >>>>>>>>>> >>>>>>>>>> ------- grep on SchedulerLifecycle ----- >>>>>>>>>> aurora-scheduler.log:Sep 26 18:11:33 machine62 >>>>>>>>>> aurora-scheduler[24743]: I0926 18:11:33.158 [LeaderSelector-0, >>>>>>>>>> StateMachine$Builder:389] SchedulerLifecycle state machine transition >>>>>>>>>> STORAGE_PREPARED -> LEADER_AWAITING_REGISTRATION >>>>>>>>>> aurora-scheduler.log:Sep 26 18:11:33 machine62 >>>>>>>>>> aurora-scheduler[24743]: I0926 18:11:33.159 [LeaderSelector-0, >>>>>>>>>> SchedulerLifecycle$4:224] Elected as leading scheduler! >>>>>>>>>> aurora-scheduler.log:Sep 26 18:11:37 machine62 >>>>>>>>>> aurora-scheduler[24743]: I0926 18:11:37.204 [LeaderSelector-0, >>>>>>>>>> SchedulerLifecycle$DefaultDelayedActions:163] Giving up on >>>>>>>>>> registration in (10, mins) >>>>>>>>>> aurora-scheduler.log:Sep 26 18:21:37 machine62 >>>>>>>>>> aurora-scheduler[24743]: E0926 18:21:37.205 [Lifecycle-0, >>>>>>>>>> SchedulerLifecycle$4:235] Framework has not been registered within >>>>>>>>>> the >>>>>>>>>> tolerated delay. >>>>>>>>>> aurora-scheduler.log:Sep 26 18:21:37 machine62 >>>>>>>>>> aurora-scheduler[24743]: I0926 18:21:37.205 [Lifecycle-0, >>>>>>>>>> StateMachine$Builder:389] SchedulerLifecycle state machine transition >>>>>>>>>> LEADER_AWAITING_REGISTRATION -> DEAD >>>>>>>>>> aurora-scheduler.log:Sep 26 18:21:37 machine62 >>>>>>>>>> aurora-scheduler[24743]: I0926 18:21:37.215 [Lifecycle-0, >>>>>>>>>> StateMachine$Builder:389] SchedulerLifecycle state machine >>>>>>>>>> transition DEAD >>>>>>>>>> -> DEAD >>>>>>>>>> aurora-scheduler.log:Sep 26 18:21:37 machine62 >>>>>>>>>> aurora-scheduler[24743]: I0926 18:21:37.215 [Lifecycle-0, >>>>>>>>>> SchedulerLifecycle$6:275] Shutdown already invoked, ignoring extra >>>>>>>>>> call. >>>>>>>>>> aurora-scheduler.log:Sep 26 18:22:05 machine62 >>>>>>>>>> aurora-scheduler[54502]: I0926 18:22:05.681 [main, >>>>>>>>>> StateMachine$Builder:389] SchedulerLifecycle state machine >>>>>>>>>> transition IDLE >>>>>>>>>> -> PREPARING_STORAGE >>>>>>>>>> aurora-scheduler.log:Sep 26 18:22:06 machine62 >>>>>>>>>> aurora-scheduler[54502]: I0926 18:22:06.396 [main, >>>>>>>>>> StateMachine$Builder:389] SchedulerLifecycle state machine transition >>>>>>>>>> PREPARING_STORAGE -> STORAGE_PREPARED >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> ------ connecting to mesos ----- >>>>>>>>>> Sep 26 18:11:37 machine62 aurora-scheduler[24743]: I0926 >>>>>>>>>> 18:11:37.211750 24871 group.cpp:757] Found non-sequence node >>>>>>>>>> 'log_replicas' >>>>>>>>>> at '/mesos' in ZooKeeper >>>>>>>>>> Sep 26 18:11:37 machine62 aurora-scheduler[24743]: I0926 >>>>>>>>>> 18:11:37.211817 24871 detector.cpp:152] Detected a new leader: >>>>>>>>>> (id='1506') >>>>>>>>>> Sep 26 18:11:37 machine62 aurora-scheduler[24743]: I0926 >>>>>>>>>> 18:11:37.211917 24871 group.cpp:699] Trying to get >>>>>>>>>> '/mesos/json.info_0000001506' in ZooKeeper >>>>>>>>>> Sep 26 18:11:37 machine62 aurora-scheduler[24743]: I0926 >>>>>>>>>> 18:11:37.216063 24871 zookeeper.cpp:262] A new leading master (UPID= >>>>>>>>>> [email protected]:5050) is detected >>>>>>>>>> Sep 26 18:11:37 machine62 aurora-scheduler[24743]: I0926 >>>>>>>>>> 18:11:37.216162 24871 scheduler.cpp:470] New master detected at >>>>>>>>>> [email protected]:5050 >>>>>>>>>> Sep 26 18:11:37 machine62 aurora-scheduler[24743]: I0926 >>>>>>>>>> 18:11:37.217772 24871 scheduler.cpp:479] Waiting for 12.81503ms >>>>>>>>>> before >>>>>>>>>> initiating a re-(connection) attempt with the master >>>>>>>>>> Sep 26 18:11:37 machine62 aurora-scheduler[24743]: I0926 >>>>>>>>>> 18:11:37.231549 24868 scheduler.cpp:361] Connected with the master at >>>>>>>>>> http://10.163.25.45:5050/master/api/v1/scheduler >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> On Tue, Sep 26, 2017 at 1:24 PM, Bill Farner <[email protected]> >>>>>>>>>> wrote: >>>>>>>>>> >>>>>>>>>>> Is there a reason a non-leading scheduler will talk to Mesos >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> No, there is not a legitimate reason. Did this occur for an >>>>>>>>>>> extended period of time? Do you have logs from the scheduler >>>>>>>>>>> indicating >>>>>>>>>>> that it lost ZK leadership and subsequently interacted with mesos? >>>>>>>>>>> >>>>>>>>>>> On Tue, Sep 26, 2017 at 1:02 PM, Mohit Jaggi < >>>>>>>>>>> [email protected]> wrote: >>>>>>>>>>> >>>>>>>>>>>> Fellows, >>>>>>>>>>>> While examining Aurora log files, I noticed a condition where a >>>>>>>>>>>> scheduler was talking to Mesos but it was not showing up as a >>>>>>>>>>>> leader in >>>>>>>>>>>> Zookeeper. It ultimately restarted itself and another scheduler >>>>>>>>>>>> became the >>>>>>>>>>>> leader. >>>>>>>>>>>> Is there a reason a non-leading scheduler will talk to Mesos? >>>>>>>>>>>> >>>>>>>>>>>> Mohit. >>>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>> >>>>>>>>> >>>>>>>> >>>>>>> >>>>>> >>>>> >>>> >>> >> >
