Hi, Mesos: 1.7.1
I'm trying to debug an issue where if I launch a task using the LAUNCH_GROUP method, and the task fails to start, the mesos master will crash. I am using a custom framework I've built using the HTTP Scheduler API. When my framework received an offer - I return with an ACCEPT with this JSON: https://gist.github.com/nemosupremo/3b23c4e1ca0ab241376aa5b975993270 I then receive the following UPDATE events: TASK_STARTING TASK_RUNNING TASK_FAILED My framework then immediately tries to relaunch the task on the next OFFERS: https://gist.github.com/nemosupremo/2b02443241c3bd002f04be034d8e64f7 But between sometime when I get that event and try to acknowledge the TASK_FAILED event, the mesos master crashes with: Feb 28 21:34:02 master03 mesos-master[7124]: F0228 21:34:02.118693 7142 sorter.hpp:357] Check failed: resources.at(slaveId).contains(toRemove) Resources disk(allocated: faust)(reservations: [(STATIC,faust)]):1; cpus(allocated: faust)(reservations: [(STATIC,faust)]):0.1; mem(allocated: faust)(reservations: [(STATIC,faust)]):64 at agent 643078ba-8cb8-4582-b9c3-345d602506c8-S0 does not contain cpus(allocated: faust)(reservations: [(STATIC,faust)]):0.1; mem(allocated: faust)(reservations: [(STATIC,faust)]):64; disk(allocated: faust)(reservations: [(STATIC,faust)]):1; ports(allocated: faust)(reservations: [(STATIC,faust)]):[7777-7777] Feb 28 21:34:02 master03 mesos-master[7124]: *** Check failure stack trace: *** Feb 28 21:34:02 master03 mesos-master[7124]: @ 0x7f1fd935e48d google::LogMessage::Fail() Feb 28 21:34:02 master03 mesos-master[7124]: @ 0x7f1fd9360240 google::LogMessage::SendToLog() Feb 28 21:34:02 master03 mesos-master[7124]: @ 0x7f1fd935e073 google::LogMessage::Flush() Feb 28 21:34:02 master03 mesos-master[7124]: @ 0x7f1fd9360c69 google::LogMessageFatal::~LogMessageFatal() Feb 28 21:34:02 master03 mesos-master[7124]: @ 0x7f1fd83d85f8 mesos::internal::master::allocator::DRFSorter::unallocated() Feb 28 21:34:02 master03 mesos-master[7124]: @ 0x7f1fd83a78af mesos::internal::master::allocator::internal::HierarchicalAllocatorProcess::untrackAllocatedResources() Feb 28 21:34:02 master03 mesos-master[7124]: @ 0x7f1fd83ba281 mesos::internal::master::allocator::internal::HierarchicalAllocatorProcess::recoverResources() Feb 28 21:34:02 master03 mesos-master[7124]: @ 0x7f1fd92a6631 process::ProcessBase::consume() Feb 28 21:34:02 master03 mesos-master[7124]: @ 0x7f1fd92c878a process::ProcessManager::resume() Feb 28 21:34:02 master03 mesos-master[7124]: @ 0x7f1fd92cc4d6 _ZNSt6thread5_ImplISt12_Bind_simpleIFZN7process14ProcessManager12init_threadsEvEUlvE_vEEE6_M_runEv Feb 28 21:34:02 master03 mesos-master[7124]: @ 0x7f1fd6289c80 (unknown) Feb 28 21:34:02 master03 mesos-master[7124]: @ 0x7f1fd5da56ba start_thread Feb 28 21:34:02 master03 mesos-master[7124]: @ 0x7f1fd5adb41d (unknown) Feb 28 21:34:02 master03 systemd[1]: mesos-master.service: Main process exited, code=killed, status=6/ABRT Feb 28 21:34:02 master03 systemd[1]: mesos-master.service: Unit entered failed state. Feb 28 21:34:02 master03 systemd[1]: mesos-master.service: Failed with result 'signal'. The entire process works with the older LAUNCH API (for some reason the docker task crashes with filesystem permission issues when using LAUNCH_GROUPS)

