FYI - i believe we stumbled on this issue, which is greatly exacerbated by f2755e1 <https://github.com/apache/aurora/commit/f2755e1cdd67f3c1516726c21d6e8f13059a5a01>. The good news is that we now have a good handle on the culprit! More details at https://issues.apache.org/jira/browse/AURORA-1953
On Thu, Sep 28, 2017 at 2:14 PM, Mohit Jaggi <[email protected]> wrote: > Hmm...it is a very busy cluster and 10 mins of logs will be voluminous. > They contain some internal details which I cannot share publicly. If you > suspect specific areas, I can try to get those logs and remove internal > info. > > Re: code, we have a fork which is very close to master. > > On Wed, Sep 27, 2017 at 10:03 PM, Bill Farner <[email protected]> wrote: > >> What commit/release was this with? From the looks of the log contents, >> it's not master. I'd like to make sure i'm looking at the correct code. >> >> Are there more logs not being included? If so, can you share more >> complete logs? In particular, logs during the 10 minute delay would be >> particularly helpful. >> >> On Tue, Sep 26, 2017 at 11:51 PM, Mohit Jaggi <[email protected]> >> wrote: >> >>> Updating subject...as it looks like leader election was fine but >>> registration ack did not make it to the SchedulerLifecycle code. Weird that >>> an event will get lost like that. >>> >>> On Tue, Sep 26, 2017 at 4:21 PM, John Sirois <[email protected]> >>> wrote: >>> >>>> >>>> >>>> On Tue, Sep 26, 2017 at 4:33 PM, Mohit Jaggi <[email protected]> >>>> wrote: >>>> >>>>> John, >>>>> I was referring to the following log message...isn't that the right >>>>> one? >>>>> >>>> >>>> Aha - it is, apologies. >>>> >>>> >>>>> Sep 26 18:11:58 machine62 aurora-scheduler[24743]: I0926 18:11:58.795 >>>>> [Thread-814, MesosCallbackHandler$MesosCallbackHandlerImpl:180] >>>>> Registered with ID value: "4ca9aa06-3214-4d2c-a678-0832e2f84d17-0000" >>>>> >>>>> On Tue, Sep 26, 2017 at 3:30 PM, John Sirois <[email protected]> >>>>> wrote: >>>>> >>>>>> > ... but it succeeded(see logs at the end) ... >>>>>> >>>>>> The underlying c++ code in the scheduler driver successfully >>>>>> connected to the leading master socket: >>>>>> Sep 26 18:11:37 machine62 aurora-scheduler[24743]: I0926 >>>>>> 18:11:37.231549 24868 scheduler.cpp:361] Connected with the master >>>>>> at http://10.163.25.45:5050/master/api/v1/scheduler >>>>>> >>>>>> <http://10.163.25.45:5050/master/api/v1/scheduler>This is not the >>>>>> same as a framework registration call being successfully executed against >>>>>> the newly connected master. >>>>>> You need to be careful about what you derive from the logs just based >>>>>> on a reading of the words. Generally you'll need to look carefully / grep >>>>>> sourcecode to be sure you are mentally modelling the code flows >>>>>> correctly. >>>>>> It certainly gets tricky. >>>>>> >>>>>> On Tue, Sep 26, 2017 at 4:14 PM, Mohit Jaggi <[email protected]> >>>>>> wrote: >>>>>> >>>>>>> Hmm....so it is indeed the mesos registration, but it succeeded(see >>>>>>> logs at the end). Looking at the code, I see that >>>>>>> com.google.common.eventbus is used as pub-sub mechanism to link >>>>>>> the registration (first log message) to the registrationAcked >>>>>>> flag...and >>>>>>> this flag is not being set for 10mins (?) otherwise the registration >>>>>>> timeout handler will not print the second log message. >>>>>>> >>>>>>> delayedActions.onRegistrationTimeout( >>>>>>> () -> { >>>>>>> if (!registrationAcked.get()) { >>>>>>> LOG.error( >>>>>>> "Framework has not been registered within the tolerated >>>>>>> delay."); >>>>>>> stateMachine.transition(State.DEAD); >>>>>>> } >>>>>>> }); >>>>>>> >>>>>>> Sep 26 18:11:58 machine62 aurora-scheduler[24743]: I0926 >>>>>>> 18:11:58.795 [Thread-814, >>>>>>> MesosCallbackHandler$MesosCallbackHandlerImpl:180] >>>>>>> Registered with ID value: "4ca9aa06-3214-4d2c-a678-0832e >>>>>>> 2f84d17-0000" >>>>>>> >>>>>>> ... >>>>>>> >>>>>>> ... >>>>>>> >>>>>>> Sep 26 18:21:37 machine62 aurora-scheduler[24743]: E0926 >>>>>>> 18:21:37.205 [Lifecycle-0, SchedulerLifecycle$4:235] Framework has not >>>>>>> been >>>>>>> registered within the tolerated delay. >>>>>>> >>>>>>> On Tue, Sep 26, 2017 at 2:34 PM, John Sirois <[email protected]> >>>>>>> wrote: >>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> On Tue, Sep 26, 2017 at 3:31 PM, Mohit Jaggi <[email protected]> >>>>>>>> wrote: >>>>>>>> >>>>>>>>> Actually I wonder if the "register" here is for Aurora to register >>>>>>>>> with mesos as a framework...I was assuming it refered to registering >>>>>>>>> with >>>>>>>>> ZK as a leader. >>>>>>>>> >>>>>>>> >>>>>>>> Bingo. >>>>>>>> >>>>>>>> >>>>>>>>> >>>>>>>>> On Tue, Sep 26, 2017 at 2:27 PM, Renan DelValle Rueda < >>>>>>>>> [email protected]> wrote: >>>>>>>>> >>>>>>>>>> I've had this, or something very similar, happen before. It's an >>>>>>>>>> issue with Aurora and ZK. Election is based upon ZK, so if writing >>>>>>>>>> down who >>>>>>>>>> the leader is to the ZK server path fails, or if ZK is unable to >>>>>>>>>> reach >>>>>>>>>> quorum on the write, the election will fail. Sometimes this might >>>>>>>>>> manifest >>>>>>>>>> itself in weird ways, such as two aurora schedulers believing they >>>>>>>>>> are >>>>>>>>>> leaders. If you could tell us a little bit about your ZK set up we >>>>>>>>>> might be >>>>>>>>>> able to narrow down the issue. Also, Aurora version and whether you >>>>>>>>>> are >>>>>>>>>> using Curator or the commons library will help as well. >>>>>>>>>> >>>>>>>>>> On Tue, Sep 26, 2017 at 2:02 PM, Mohit Jaggi < >>>>>>>>>> [email protected]> wrote: >>>>>>>>>> >>>>>>>>>>> Hmm..it seems machine62 became a leader but could not "register" >>>>>>>>>>> as leader. Not sure what that means. My naive assumption is that >>>>>>>>>>> "becoming >>>>>>>>>>> leader" and "registering as leader" are "atomic". >>>>>>>>>>> >>>>>>>>>>> ------- grep on SchedulerLifecycle ----- >>>>>>>>>>> aurora-scheduler.log:Sep 26 18:11:33 machine62 >>>>>>>>>>> aurora-scheduler[24743]: I0926 18:11:33.158 [LeaderSelector-0, >>>>>>>>>>> StateMachine$Builder:389] SchedulerLifecycle state machine >>>>>>>>>>> transition >>>>>>>>>>> STORAGE_PREPARED -> LEADER_AWAITING_REGISTRATION >>>>>>>>>>> aurora-scheduler.log:Sep 26 18:11:33 machine62 >>>>>>>>>>> aurora-scheduler[24743]: I0926 18:11:33.159 [LeaderSelector-0, >>>>>>>>>>> SchedulerLifecycle$4:224] Elected as leading scheduler! >>>>>>>>>>> aurora-scheduler.log:Sep 26 18:11:37 machine62 >>>>>>>>>>> aurora-scheduler[24743]: I0926 18:11:37.204 [LeaderSelector-0, >>>>>>>>>>> SchedulerLifecycle$DefaultDelayedActions:163] Giving up on >>>>>>>>>>> registration in (10, mins) >>>>>>>>>>> aurora-scheduler.log:Sep 26 18:21:37 machine62 >>>>>>>>>>> aurora-scheduler[24743]: E0926 18:21:37.205 [Lifecycle-0, >>>>>>>>>>> SchedulerLifecycle$4:235] Framework has not been registered within >>>>>>>>>>> the >>>>>>>>>>> tolerated delay. >>>>>>>>>>> aurora-scheduler.log:Sep 26 18:21:37 machine62 >>>>>>>>>>> aurora-scheduler[24743]: I0926 18:21:37.205 [Lifecycle-0, >>>>>>>>>>> StateMachine$Builder:389] SchedulerLifecycle state machine >>>>>>>>>>> transition >>>>>>>>>>> LEADER_AWAITING_REGISTRATION -> DEAD >>>>>>>>>>> aurora-scheduler.log:Sep 26 18:21:37 machine62 >>>>>>>>>>> aurora-scheduler[24743]: I0926 18:21:37.215 [Lifecycle-0, >>>>>>>>>>> StateMachine$Builder:389] SchedulerLifecycle state machine >>>>>>>>>>> transition DEAD >>>>>>>>>>> -> DEAD >>>>>>>>>>> aurora-scheduler.log:Sep 26 18:21:37 machine62 >>>>>>>>>>> aurora-scheduler[24743]: I0926 18:21:37.215 [Lifecycle-0, >>>>>>>>>>> SchedulerLifecycle$6:275] Shutdown already invoked, ignoring extra >>>>>>>>>>> call. >>>>>>>>>>> aurora-scheduler.log:Sep 26 18:22:05 machine62 >>>>>>>>>>> aurora-scheduler[54502]: I0926 18:22:05.681 [main, >>>>>>>>>>> StateMachine$Builder:389] SchedulerLifecycle state machine >>>>>>>>>>> transition IDLE >>>>>>>>>>> -> PREPARING_STORAGE >>>>>>>>>>> aurora-scheduler.log:Sep 26 18:22:06 machine62 >>>>>>>>>>> aurora-scheduler[54502]: I0926 18:22:06.396 [main, >>>>>>>>>>> StateMachine$Builder:389] SchedulerLifecycle state machine >>>>>>>>>>> transition >>>>>>>>>>> PREPARING_STORAGE -> STORAGE_PREPARED >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> ------ connecting to mesos ----- >>>>>>>>>>> Sep 26 18:11:37 machine62 aurora-scheduler[24743]: I0926 >>>>>>>>>>> 18:11:37.211750 24871 group.cpp:757] Found non-sequence node >>>>>>>>>>> 'log_replicas' >>>>>>>>>>> at '/mesos' in ZooKeeper >>>>>>>>>>> Sep 26 18:11:37 machine62 aurora-scheduler[24743]: I0926 >>>>>>>>>>> 18:11:37.211817 24871 detector.cpp:152] Detected a new leader: >>>>>>>>>>> (id='1506') >>>>>>>>>>> Sep 26 18:11:37 machine62 aurora-scheduler[24743]: I0926 >>>>>>>>>>> 18:11:37.211917 24871 group.cpp:699] Trying to get >>>>>>>>>>> '/mesos/json.info_0000001506' in ZooKeeper >>>>>>>>>>> Sep 26 18:11:37 machine62 aurora-scheduler[24743]: I0926 >>>>>>>>>>> 18:11:37.216063 24871 zookeeper.cpp:262] A new leading master (UPID= >>>>>>>>>>> [email protected]:5050) is detected >>>>>>>>>>> Sep 26 18:11:37 machine62 aurora-scheduler[24743]: I0926 >>>>>>>>>>> 18:11:37.216162 24871 scheduler.cpp:470] New master detected at >>>>>>>>>>> [email protected]:5050 >>>>>>>>>>> Sep 26 18:11:37 machine62 aurora-scheduler[24743]: I0926 >>>>>>>>>>> 18:11:37.217772 24871 scheduler.cpp:479] Waiting for 12.81503ms >>>>>>>>>>> before >>>>>>>>>>> initiating a re-(connection) attempt with the master >>>>>>>>>>> Sep 26 18:11:37 machine62 aurora-scheduler[24743]: I0926 >>>>>>>>>>> 18:11:37.231549 24868 scheduler.cpp:361] Connected with the master >>>>>>>>>>> at >>>>>>>>>>> http://10.163.25.45:5050/master/api/v1/scheduler >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> On Tue, Sep 26, 2017 at 1:24 PM, Bill Farner <[email protected] >>>>>>>>>>> > wrote: >>>>>>>>>>> >>>>>>>>>>>> Is there a reason a non-leading scheduler will talk to Mesos >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> No, there is not a legitimate reason. Did this occur for an >>>>>>>>>>>> extended period of time? Do you have logs from the scheduler >>>>>>>>>>>> indicating >>>>>>>>>>>> that it lost ZK leadership and subsequently interacted with mesos? >>>>>>>>>>>> >>>>>>>>>>>> On Tue, Sep 26, 2017 at 1:02 PM, Mohit Jaggi < >>>>>>>>>>>> [email protected]> wrote: >>>>>>>>>>>> >>>>>>>>>>>>> Fellows, >>>>>>>>>>>>> While examining Aurora log files, I noticed a condition where >>>>>>>>>>>>> a scheduler was talking to Mesos but it was not showing up as a >>>>>>>>>>>>> leader in >>>>>>>>>>>>> Zookeeper. It ultimately restarted itself and another scheduler >>>>>>>>>>>>> became the >>>>>>>>>>>>> leader. >>>>>>>>>>>>> Is there a reason a non-leading scheduler will talk to Mesos? >>>>>>>>>>>>> >>>>>>>>>>>>> Mohit. >>>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>> >>>>>>>>> >>>>>>>> >>>>>>> >>>>>> >>>>> >>>> >>> >> >
