Re: Mesos Master Crashes when Task launched with LAUNCH_GROUP fails

Benjamin Mahler Fri, 01 Mar 2019 10:08:12 -0800

For posterity: https://issues.apache.org/jira/browse/MESOS-9619


On Thu, Feb 28, 2019 at 6:02 PM Meng Zhu <m...@mesosphere.com> wrote:

> Hi Nimi:
>
> Thanks for reporting this.
>
> From the log snippet, looks like, when de-allocating resources, the agent
> does not have the port resources that is supposed to have been allocated.
> Can you provide the master log (which at least covers the period from when
> the resources on the agent is offered to the crash point)? Also, can you
> create a JIRA ticket and upload the log to there? (
> https://issues.apache.org/jira/projects/MESOS/issues)
>
> -Meng
>
> On Thu, Feb 28, 2019 at 1:58 PM Nimi W <psnim2...@gmail.com> wrote:
>
>> Hi,
>>
>> Mesos: 1.7.1
>>
>> I'm trying to debug an issue where if I launch a task using the
>> LAUNCH_GROUP method,
>> and the task fails to start, the mesos master will crash. I am using a
>> custom framework
>> I've built using the HTTP Scheduler API.
>>
>> When my framework received an offer - I return with an ACCEPT with this
>> JSON:
>>
>> https://gist.github.com/nemosupremo/3b23c4e1ca0ab241376aa5b975993270
>>
>> I then receive the following UPDATE events:
>>
>> TASK_STARTING
>> TASK_RUNNING
>> TASK_FAILED
>>
>> My framework then immediately tries to relaunch the task on the next
>> OFFERS:
>>
>> https://gist.github.com/nemosupremo/2b02443241c3bd002f04be034d8e64f7
>>
>> But between sometime when I get that event and try to acknowledge the
>> TASK_FAILED event,
>> the mesos master crashes with:
>>
>> Feb 28 21:34:02 master03 mesos-master[7124]: F0228 21:34:02.118693  7142
>> sorter.hpp:357] Check failed: resources.at(slaveId).contains(toRemove)
>> Resources disk(allocated: faust)(reservations: [(STATIC,faust)]):1;
>> cpus(allocated: faust)(reservations: [(STATIC,faust)]):0.1; mem(allocated:
>> faust)(reservations: [(STATIC,faust)]):64 at agent
>> 643078ba-8cb8-4582-b9c3-345d602506c8-S0 does not contain cpus(allocated:
>> faust)(reservations: [(STATIC,faust)]):0.1; mem(allocated:
>> faust)(reservations: [(STATIC,faust)]):64; disk(allocated:
>> faust)(reservations: [(STATIC,faust)]):1; ports(allocated:
>> faust)(reservations: [(STATIC,faust)]):[7777-7777]
>> Feb 28 21:34:02 master03 mesos-master[7124]: *** Check failure stack
>> trace: ***
>> Feb 28 21:34:02 master03 mesos-master[7124]:     @     0x7f1fd935e48d
>> google::LogMessage::Fail()
>> Feb 28 21:34:02 master03 mesos-master[7124]:     @     0x7f1fd9360240
>> google::LogMessage::SendToLog()
>> Feb 28 21:34:02 master03 mesos-master[7124]:     @     0x7f1fd935e073
>> google::LogMessage::Flush()
>> Feb 28 21:34:02 master03 mesos-master[7124]:     @     0x7f1fd9360c69
>> google::LogMessageFatal::~LogMessageFatal()
>> Feb 28 21:34:02 master03 mesos-master[7124]:     @     0x7f1fd83d85f8
>> mesos::internal::master::allocator::DRFSorter::unallocated()
>> Feb 28 21:34:02 master03 mesos-master[7124]:     @     0x7f1fd83a78af
>> mesos::internal::master::allocator::internal::HierarchicalAllocatorProcess::untrackAllocatedResources()
>> Feb 28 21:34:02 master03 mesos-master[7124]:     @     0x7f1fd83ba281
>> mesos::internal::master::allocator::internal::HierarchicalAllocatorProcess::recoverResources()
>> Feb 28 21:34:02 master03 mesos-master[7124]:     @     0x7f1fd92a6631
>> process::ProcessBase::consume()
>> Feb 28 21:34:02 master03 mesos-master[7124]:     @     0x7f1fd92c878a
>> process::ProcessManager::resume()
>> Feb 28 21:34:02 master03 mesos-master[7124]:     @     0x7f1fd92cc4d6
>> _ZNSt6thread5_ImplISt12_Bind_simpleIFZN7process14ProcessManager12init_threadsEvEUlvE_vEEE6_M_runEv
>> Feb 28 21:34:02 master03 mesos-master[7124]:     @     0x7f1fd6289c80
>> (unknown)
>> Feb 28 21:34:02 master03 mesos-master[7124]:     @     0x7f1fd5da56ba
>> start_thread
>> Feb 28 21:34:02 master03 mesos-master[7124]:     @     0x7f1fd5adb41d
>> (unknown)
>> Feb 28 21:34:02 master03 systemd[1]: mesos-master.service: Main process
>> exited, code=killed, status=6/ABRT
>> Feb 28 21:34:02 master03 systemd[1]: mesos-master.service: Unit entered
>> failed state.
>> Feb 28 21:34:02 master03 systemd[1]: mesos-master.service: Failed with
>> result 'signal'.
>>
>> The entire process works with the older LAUNCH API (for some reason the
>> docker task crashes with filesystem permission issues when using
>> LAUNCH_GROUPS)
>>
>

Re: Mesos Master Crashes when Task launched with LAUNCH_GROUP fails

Reply via email to