Hi Nimi: Thanks for reporting this.
>From the log snippet, looks like, when de-allocating resources, the agent does not have the port resources that is supposed to have been allocated. Can you provide the master log (which at least covers the period from when the resources on the agent is offered to the crash point)? Also, can you create a JIRA ticket and upload the log to there? ( https://issues.apache.org/jira/projects/MESOS/issues) -Meng On Thu, Feb 28, 2019 at 1:58 PM Nimi W <[email protected]> wrote: > Hi, > > Mesos: 1.7.1 > > I'm trying to debug an issue where if I launch a task using the > LAUNCH_GROUP method, > and the task fails to start, the mesos master will crash. I am using a > custom framework > I've built using the HTTP Scheduler API. > > When my framework received an offer - I return with an ACCEPT with this > JSON: > > https://gist.github.com/nemosupremo/3b23c4e1ca0ab241376aa5b975993270 > > I then receive the following UPDATE events: > > TASK_STARTING > TASK_RUNNING > TASK_FAILED > > My framework then immediately tries to relaunch the task on the next > OFFERS: > > https://gist.github.com/nemosupremo/2b02443241c3bd002f04be034d8e64f7 > > But between sometime when I get that event and try to acknowledge the > TASK_FAILED event, > the mesos master crashes with: > > Feb 28 21:34:02 master03 mesos-master[7124]: F0228 21:34:02.118693 7142 > sorter.hpp:357] Check failed: resources.at(slaveId).contains(toRemove) > Resources disk(allocated: faust)(reservations: [(STATIC,faust)]):1; > cpus(allocated: faust)(reservations: [(STATIC,faust)]):0.1; mem(allocated: > faust)(reservations: [(STATIC,faust)]):64 at agent > 643078ba-8cb8-4582-b9c3-345d602506c8-S0 does not contain cpus(allocated: > faust)(reservations: [(STATIC,faust)]):0.1; mem(allocated: > faust)(reservations: [(STATIC,faust)]):64; disk(allocated: > faust)(reservations: [(STATIC,faust)]):1; ports(allocated: > faust)(reservations: [(STATIC,faust)]):[7777-7777] > Feb 28 21:34:02 master03 mesos-master[7124]: *** Check failure stack > trace: *** > Feb 28 21:34:02 master03 mesos-master[7124]: @ 0x7f1fd935e48d > google::LogMessage::Fail() > Feb 28 21:34:02 master03 mesos-master[7124]: @ 0x7f1fd9360240 > google::LogMessage::SendToLog() > Feb 28 21:34:02 master03 mesos-master[7124]: @ 0x7f1fd935e073 > google::LogMessage::Flush() > Feb 28 21:34:02 master03 mesos-master[7124]: @ 0x7f1fd9360c69 > google::LogMessageFatal::~LogMessageFatal() > Feb 28 21:34:02 master03 mesos-master[7124]: @ 0x7f1fd83d85f8 > mesos::internal::master::allocator::DRFSorter::unallocated() > Feb 28 21:34:02 master03 mesos-master[7124]: @ 0x7f1fd83a78af > mesos::internal::master::allocator::internal::HierarchicalAllocatorProcess::untrackAllocatedResources() > Feb 28 21:34:02 master03 mesos-master[7124]: @ 0x7f1fd83ba281 > mesos::internal::master::allocator::internal::HierarchicalAllocatorProcess::recoverResources() > Feb 28 21:34:02 master03 mesos-master[7124]: @ 0x7f1fd92a6631 > process::ProcessBase::consume() > Feb 28 21:34:02 master03 mesos-master[7124]: @ 0x7f1fd92c878a > process::ProcessManager::resume() > Feb 28 21:34:02 master03 mesos-master[7124]: @ 0x7f1fd92cc4d6 > _ZNSt6thread5_ImplISt12_Bind_simpleIFZN7process14ProcessManager12init_threadsEvEUlvE_vEEE6_M_runEv > Feb 28 21:34:02 master03 mesos-master[7124]: @ 0x7f1fd6289c80 > (unknown) > Feb 28 21:34:02 master03 mesos-master[7124]: @ 0x7f1fd5da56ba > start_thread > Feb 28 21:34:02 master03 mesos-master[7124]: @ 0x7f1fd5adb41d > (unknown) > Feb 28 21:34:02 master03 systemd[1]: mesos-master.service: Main process > exited, code=killed, status=6/ABRT > Feb 28 21:34:02 master03 systemd[1]: mesos-master.service: Unit entered > failed state. > Feb 28 21:34:02 master03 systemd[1]: mesos-master.service: Failed with > result 'signal'. > > The entire process works with the older LAUNCH API (for some reason the > docker task crashes with filesystem permission issues when using > LAUNCH_GROUPS) >

