Mesos Master Crashes when Task launched with LAUNCH_GROUP fails

Nimi W Thu, 28 Feb 2019 13:59:05 -0800

Hi,

Mesos: 1.7.1


I'm trying to debug an issue where if I launch a task using the
LAUNCH_GROUP method,
and the task fails to start, the mesos master will crash. I am using a
custom framework
I've built using the HTTP Scheduler API.

When my framework received an offer - I return with an ACCEPT with this
JSON:

https://gist.github.com/nemosupremo/3b23c4e1ca0ab241376aa5b975993270

I then receive the following UPDATE events:

TASK_STARTING
TASK_RUNNING
TASK_FAILED

My framework then immediately tries to relaunch the task on the next OFFERS:

https://gist.github.com/nemosupremo/2b02443241c3bd002f04be034d8e64f7

But between sometime when I get that event and try to acknowledge the
TASK_FAILED event,
the mesos master crashes with:

Feb 28 21:34:02 master03 mesos-master[7124]: F0228 21:34:02.118693  7142
sorter.hpp:357] Check failed: resources.at(slaveId).contains(toRemove)
Resources disk(allocated: faust)(reservations: [(STATIC,faust)]):1;
cpus(allocated: faust)(reservations: [(STATIC,faust)]):0.1; mem(allocated:
faust)(reservations: [(STATIC,faust)]):64 at agent
643078ba-8cb8-4582-b9c3-345d602506c8-S0 does not contain cpus(allocated:
faust)(reservations: [(STATIC,faust)]):0.1; mem(allocated:
faust)(reservations: [(STATIC,faust)]):64; disk(allocated:
faust)(reservations: [(STATIC,faust)]):1; ports(allocated:
faust)(reservations: [(STATIC,faust)]):[7777-7777]
Feb 28 21:34:02 master03 mesos-master[7124]: *** Check failure stack trace:
***
Feb 28 21:34:02 master03 mesos-master[7124]:     @     0x7f1fd935e48d
google::LogMessage::Fail()
Feb 28 21:34:02 master03 mesos-master[7124]:     @     0x7f1fd9360240
google::LogMessage::SendToLog()
Feb 28 21:34:02 master03 mesos-master[7124]:     @     0x7f1fd935e073
google::LogMessage::Flush()
Feb 28 21:34:02 master03 mesos-master[7124]:     @     0x7f1fd9360c69
google::LogMessageFatal::~LogMessageFatal()
Feb 28 21:34:02 master03 mesos-master[7124]:     @     0x7f1fd83d85f8
mesos::internal::master::allocator::DRFSorter::unallocated()
Feb 28 21:34:02 master03 mesos-master[7124]:     @     0x7f1fd83a78af
mesos::internal::master::allocator::internal::HierarchicalAllocatorProcess::untrackAllocatedResources()
Feb 28 21:34:02 master03 mesos-master[7124]:     @     0x7f1fd83ba281
mesos::internal::master::allocator::internal::HierarchicalAllocatorProcess::recoverResources()
Feb 28 21:34:02 master03 mesos-master[7124]:     @     0x7f1fd92a6631
process::ProcessBase::consume()
Feb 28 21:34:02 master03 mesos-master[7124]:     @     0x7f1fd92c878a
process::ProcessManager::resume()
Feb 28 21:34:02 master03 mesos-master[7124]:     @     0x7f1fd92cc4d6
_ZNSt6thread5_ImplISt12_Bind_simpleIFZN7process14ProcessManager12init_threadsEvEUlvE_vEEE6_M_runEv
Feb 28 21:34:02 master03 mesos-master[7124]:     @     0x7f1fd6289c80
(unknown)
Feb 28 21:34:02 master03 mesos-master[7124]:     @     0x7f1fd5da56ba
start_thread
Feb 28 21:34:02 master03 mesos-master[7124]:     @     0x7f1fd5adb41d
(unknown)
Feb 28 21:34:02 master03 systemd[1]: mesos-master.service: Main process
exited, code=killed, status=6/ABRT
Feb 28 21:34:02 master03 systemd[1]: mesos-master.service: Unit entered
failed state.
Feb 28 21:34:02 master03 systemd[1]: mesos-master.service: Failed with
result 'signal'.

The entire process works with the older LAUNCH API (for some reason the
docker task crashes with filesystem permission issues when using
LAUNCH_GROUPS)

Mesos Master Crashes when Task launched with LAUNCH_GROUP fails

Reply via email to