We recently upgraded our Mesos cluster from version 1.3 to 1.5, and since then have been getting periodic master crashes due to this error:
Feb 5 15:53:57 ip-10-0-16-140 mesos-master[8414]: F0205 15:53:57.385118 8434 hierarchical.cpp:2630] Check failed: reservationScalarQuantities.contains(role) Full stack trace is at the end of this email. When the master fails, we automatically restart it and it rejoins the cluster just fine. I did some initial searching and was unable to find any existing bug reports or other people experiencing this issue. We run a cluster of 3 masters, and see crashes on all 3 instances. Hope to get some guidance on what is going on and/or where to start looking for more information. Feb 5 15:53:57 ip-10-0-16-140 mesos-master[8414]: @ 0x7f87e9170a7d google::LogMessage::Fail() Feb 5 15:53:57 ip-10-0-16-140 mesos-master[8414]: @ 0x7f87e9172830 google::LogMessage::SendToLog() Feb 5 15:53:57 ip-10-0-16-140 mesos-master[8414]: @ 0x7f87e9170663 google::LogMessage::Flush() Feb 5 15:53:57 ip-10-0-16-140 mesos-master[8414]: @ 0x7f87e9173259 google::LogMessageFatal::~LogMessageFatal() Feb 5 15:53:57 ip-10-0-16-140 mesos-master[8414]: @ 0x7f87e8443cbd mesos::internal::master::allocator::internal::HierarchicalAllocatorProcess::untrackReservations() Feb 5 15:53:57 ip-10-0-16-140 mesos-master[8414]: @ 0x7f87e8448fcd mesos::internal::master::allocator::internal::HierarchicalAllocatorProcess::removeSlave() Feb 5 15:53:57 ip-10-0-16-140 mesos-master[8414]: @ 0x7f87e90c4f11 process::ProcessBase::consume() Feb 5 15:53:57 ip-10-0-16-140 mesos-master[8414]: @ 0x7f87e90dea4a process::ProcessManager::resume() Feb 5 15:53:57 ip-10-0-16-140 mesos-master[8414]: @ 0x7f87e90e25d6 _ZNSt6thread5_ImplISt12_Bind_simpleIFZN7process14ProcessManager12init_threadsEvEUlvE_vEEE6_M_runEv Feb 5 15:53:57 ip-10-0-16-140 mesos-master[8414]: @ 0x7f87e6700c80 (unknown) Feb 5 15:53:57 ip-10-0-16-140 mesos-master[8414]: @ 0x7f87e5f136ba start_thread Feb 5 15:53:57 ip-10-0-16-140 mesos-master[8414]: @ 0x7f87e5c4941d (unknown)