Re: [ISSUE] Check failed: slave.maintenance.isSome()

Vinod Kone Mon, 18 Sep 2017 11:28:27 -0700

This looks similar to https://issues.apache.org/jira/browse/MESOS-7966. Can
you add your information and logs to that ticket?


On Fri, Sep 15, 2017 at 3:18 AM, Qi Feng <[email protected]> wrote:

> My mesos version is 1.2.0. Sorry.
>
> ------------------------------
> *From:* Qi Feng <[email protected]>
> *Sent:* Friday, September 15, 2017 10:14 AM
> *To:* [email protected]; Bayou
> *Subject:* Re: [ISSUE] Check failed: slave.maintenance.isSome()
>
>
> This case could be reproduced by calling `for i in {1..8}; do python
> call.py; done` (call.py gist: https://gist.github.com/athlum/
> e2cd04bfb9f81a790d31643606252a49 ).
>
> Looks like there is something wrong when call /maintenance/schedule
> concurrently.
>
> We met this case because we use wrote a service base on ansible that
> manage the mesos cluster. When we create a task to update slave configs
> with a certain number of workers. Just like:
>
>    1. call schedule for 3 machine: a,b,c.
>    2. as machine a was done, maintenance window updates to: b,c
>    3. as an other machine "d" assigned after a immediately, windows will
>    update to: b,c,d
>
> This change sometimes happen in little interval. Then we find the fatal
> log just in Bayou's mail.
>
> What's the right way to update maintanence window? Thanks to any reply.
>
>
> ------------------------------
> *From:* Bayou <[email protected]>
> *Sent:* Thursday, September 14, 2017 12:06 PM
> *To:* [email protected]
> *Subject:* [ISSUE] Check failed: slave.maintenance.isSome()
>
> Hi all,
>    I’m trying to continuously do mesos-maintenance-schedule, machine-down,
> machine-up, mesos-maintenance-schedule-cancel over and over again in a
> three-slaves cluster, no any other operations, just trying mesos API to do
> these to schedule the three slaves asynchronously. At the beginning, It
> worked well, after I tried many times, about hundreds times, unfortunately,
> there were alway a Check failed of slave.maintenance.isSome() and mesos
> master crashed, the origin code at
> https://github.com/apache/mesos/blob/2fe2bb26a425da9aaf1d7cf34019dd
> 347d0cf9a4/src/master/allocator/mesos/hierarchical.cpp#L983
> And some log from mesos master at below:
> 2017-09-12 16:39:07.394 err mesos-master[254491]: F0912 16:39:07.393944
> 254527 hierarchical.cpp:903] Check failed: slave.maintenance.isSome()
> 2017-09-12 16:39:07.394 err mesos-master[254491]: *** Check failure stack
> trace: ***
> 2017-09-12 16:39:07.402 err mesos-master[254491]:     @
> 0x7f4cf356fba6  google::LogMessage::Fail()
> 2017-09-12 16:39:07.413 err mesos-master[254491]:     @
> 0x7f4cf356fb05  google::LogMessage::SendToLog()
> 2017-09-12 16:39:07.420 err mesos-master[254491]:     @
> 0x7f4cf356f516  google::LogMessage::Flush()
> 2017-09-12 16:39:07.424 err mesos-master[254491]:     @
> 0x7f4cf357224a  google::LogMessageFatal::~LogMessageFatal()
> 2017-09-12 16:39:07.429 err mesos-master[254491]:     @
> 0x7f4cf2344a32  mesos::internal::master::allocator::internal::
> HierarchicalAllocatorProcess::updateInverseOffer()
> 2017-09-12 16:39:07.435 err mesos-master[254491]:     @
> 0x7f4cf1f8d9f9  _ZZN7process8dispatchIN5mesos8i
> nternal6master9allocator21MesosAllocatorProcessERKNS1_7SlaveIDERKNS1_
> 11FrameworkIDERK6OptionINS1_20UnavailableResourcesEERKSC_INS1_
> 9allocator18InverseOfferStatusEERKSC_INS1_7FiltersEES6_S9_
> SE_SJ_SN_EEvRKNS_3PIDIT_EEMSR_FvT0_T1_T2_T3_T4_ET5_T6_T7_T8_
> T9_ENKUlPNS_11ProcessBaseEE_clES18_
> 2017-09-12 16:39:07.445 err mesos-master[254491]:     @
> 0x7f4cf1f938bb  _ZNSt17_Function_handlerIFvPN7process11ProcessBaseEEZNS0_
> 8dispatchIN5mesos8internal6master9allocator21MesosAllocatorP
> rocessERKNS5_7SlaveIDERKNS5_11FrameworkIDERK6OptionINS5_
> 20UnavailableResourcesEERKSG_INS5_9allocator18InverseOfferStatus
> EERKSG_INS5_7FiltersEESA_SD_SI_SN_SR_EEvRKNS0_3PIDIT_
> EEMSV_FvT0_T1_T2_T3_T4_ET5_T6_T7_T8_T9_EUlS2_E_E9_M_
> invokeERKSt9_Any_dataS2_
> 2017-09-12 16:39:07.455 err mesos-master[254491]:     @
> 0x7f4cf34dd049  std::function<>::operator()()
> 2017-09-12 16:39:07.460 err mesos-master[254491]:     @
> 0x7f4cf34c1285  process::ProcessBase::visit()
> 2017-09-12 16:39:07.464 err mesos-master[254491]:     @
> 0x7f4cf34cc58a  process::DispatchEvent::visit()
> 2017-09-12 16:39:07.465 err mesos-master[254491]:     @
> 0x7f4cf4e4ad4e  process::ProcessBase::serve()
> 2017-09-12 16:39:07.469 err mesos-master[254491]:     @
> 0x7f4cf34bd281  process::ProcessManager::resume()
> 2017-09-12 16:39:07.471 err mesos-master[254491]:     @
> 0x7f4cf34b9a2c  _ZZN7process14ProcessManager12init_threadsEvENKUt_clEv
> 2017-09-12 16:39:07.473 err mesos-master[254491]:     @
> 0x7f4cf34cbbf2  _ZNSt12_Bind_simpleIFZN7process14ProcessMan
> ager12init_threadsEvEUt_vEE9_M_invokeIIEEEvSt12_Index_tupleIIXspT_EEE
> 2017-09-12 16:39:07.475 err mesos-master[254491]:     @
> 0x7f4cf34cbb36  _ZNSt12_Bind_simpleIFZN7process14ProcessMan
> ager12init_threadsEvEUt_vEEclEv
> 2017-09-12 16:39:07.477 err mesos-master[254491]:     @
> 0x7f4cf34cbac0  _ZNSt6thread5_ImplISt12_Bind_
> simpleIFZN7process14ProcessManager12init_threadsEvEUt_vEEE6_M_runEv
> 2017-09-12 16:39:07.478 err mesos-master[254491]:     @
> 0x7f4ced3ba1e0  (unknown)
> 2017-09-12 16:39:07.478 err mesos-master[254491]:     @
> 0x7f4ced613dc5  start_thread
> 2017-09-12 16:39:07.479 err mesos-master[254491]:     @
> 0x7f4cecb21ced  __clone
> 2017-09-12 16:39:07.486 notice systemd[1]: mesos-master.service: main
> process exited, code=killed, status=6/ABRT
>
> Is this an issue or what I did something wrong? Hope someone could help me
> to work out this. Thank you.
>

Re: [ISSUE] Check failed: slave.maintenance.isSome()

Reply via email to