For posterity: https://issues.apache.org/jira/browse/MESOS-9619
On Thu, Feb 28, 2019 at 6:02 PM Meng Zhu <m...@mesosphere.com> wrote: > Hi Nimi: > > Thanks for reporting this. > > From the log snippet, looks like, when de-allocating resources, the agent > does not have the port resources that is supposed to have been allocated. > Can you provide the master log (which at least covers the period from when > the resources on the agent is offered to the crash point)? Also, can you > create a JIRA ticket and upload the log to there? ( > https://issues.apache.org/jira/projects/MESOS/issues) > > -Meng > > On Thu, Feb 28, 2019 at 1:58 PM Nimi W <psnim2...@gmail.com> wrote: > >> Hi, >> >> Mesos: 1.7.1 >> >> I'm trying to debug an issue where if I launch a task using the >> LAUNCH_GROUP method, >> and the task fails to start, the mesos master will crash. I am using a >> custom framework >> I've built using the HTTP Scheduler API. >> >> When my framework received an offer - I return with an ACCEPT with this >> JSON: >> >> https://gist.github.com/nemosupremo/3b23c4e1ca0ab241376aa5b975993270 >> >> I then receive the following UPDATE events: >> >> TASK_STARTING >> TASK_RUNNING >> TASK_FAILED >> >> My framework then immediately tries to relaunch the task on the next >> OFFERS: >> >> https://gist.github.com/nemosupremo/2b02443241c3bd002f04be034d8e64f7 >> >> But between sometime when I get that event and try to acknowledge the >> TASK_FAILED event, >> the mesos master crashes with: >> >> Feb 28 21:34:02 master03 mesos-master[7124]: F0228 21:34:02.118693 7142 >> sorter.hpp:357] Check failed: resources.at(slaveId).contains(toRemove) >> Resources disk(allocated: faust)(reservations: [(STATIC,faust)]):1; >> cpus(allocated: faust)(reservations: [(STATIC,faust)]):0.1; mem(allocated: >> faust)(reservations: [(STATIC,faust)]):64 at agent >> 643078ba-8cb8-4582-b9c3-345d602506c8-S0 does not contain cpus(allocated: >> faust)(reservations: [(STATIC,faust)]):0.1; mem(allocated: >> faust)(reservations: [(STATIC,faust)]):64; disk(allocated: >> faust)(reservations: [(STATIC,faust)]):1; ports(allocated: >> faust)(reservations: [(STATIC,faust)]):[7777-7777] >> Feb 28 21:34:02 master03 mesos-master[7124]: *** Check failure stack >> trace: *** >> Feb 28 21:34:02 master03 mesos-master[7124]: @ 0x7f1fd935e48d >> google::LogMessage::Fail() >> Feb 28 21:34:02 master03 mesos-master[7124]: @ 0x7f1fd9360240 >> google::LogMessage::SendToLog() >> Feb 28 21:34:02 master03 mesos-master[7124]: @ 0x7f1fd935e073 >> google::LogMessage::Flush() >> Feb 28 21:34:02 master03 mesos-master[7124]: @ 0x7f1fd9360c69 >> google::LogMessageFatal::~LogMessageFatal() >> Feb 28 21:34:02 master03 mesos-master[7124]: @ 0x7f1fd83d85f8 >> mesos::internal::master::allocator::DRFSorter::unallocated() >> Feb 28 21:34:02 master03 mesos-master[7124]: @ 0x7f1fd83a78af >> mesos::internal::master::allocator::internal::HierarchicalAllocatorProcess::untrackAllocatedResources() >> Feb 28 21:34:02 master03 mesos-master[7124]: @ 0x7f1fd83ba281 >> mesos::internal::master::allocator::internal::HierarchicalAllocatorProcess::recoverResources() >> Feb 28 21:34:02 master03 mesos-master[7124]: @ 0x7f1fd92a6631 >> process::ProcessBase::consume() >> Feb 28 21:34:02 master03 mesos-master[7124]: @ 0x7f1fd92c878a >> process::ProcessManager::resume() >> Feb 28 21:34:02 master03 mesos-master[7124]: @ 0x7f1fd92cc4d6 >> _ZNSt6thread5_ImplISt12_Bind_simpleIFZN7process14ProcessManager12init_threadsEvEUlvE_vEEE6_M_runEv >> Feb 28 21:34:02 master03 mesos-master[7124]: @ 0x7f1fd6289c80 >> (unknown) >> Feb 28 21:34:02 master03 mesos-master[7124]: @ 0x7f1fd5da56ba >> start_thread >> Feb 28 21:34:02 master03 mesos-master[7124]: @ 0x7f1fd5adb41d >> (unknown) >> Feb 28 21:34:02 master03 systemd[1]: mesos-master.service: Main process >> exited, code=killed, status=6/ABRT >> Feb 28 21:34:02 master03 systemd[1]: mesos-master.service: Unit entered >> failed state. >> Feb 28 21:34:02 master03 systemd[1]: mesos-master.service: Failed with >> result 'signal'. >> >> The entire process works with the older LAUNCH API (for some reason the >> docker task crashes with filesystem permission issues when using >> LAUNCH_GROUPS) >> >