Thanks for reporting this, we can help investigate this with you in JIRA. On Tue, Feb 5, 2019 at 5:40 PM Jeff Pollard <jeff.poll...@gmail.com> wrote:
> Thanks for the info. I did find the "Removed agent" line as you suspected, > but not much else in logging looked promising. I opened a JIRA to track > from here on out https://issues.apache.org/jira/browse/MESOS-9555. > > On Tue, Feb 5, 2019 at 2:03 PM Joseph Wu <jos...@mesosphere.io> wrote: > >> From the stack, it looks like the master is attempting to remove an agent >> from the master's in-memory state. In the master's logs you should find a >> line shortly before the exit, like: >> >> <timestamp> master.cpp:nnnn] Removed agent <ID of agent>: <reason> >> >> The agent's ID should at least give you some pointer to which agent is >> causing the problem. Feel free to create a JIRA ( >> https://issues.apache.org/jira/) with any information you can glean. >> This particular type of failure, a CHECK-failure, means some invariant has >> been violated and usually means we missed a corner case. >> >> On Tue, Feb 5, 2019 at 12:04 PM Jeff Pollard <jeff.poll...@gmail.com> >> wrote: >> >>> We recently upgraded our Mesos cluster from version 1.3 to 1.5, and >>> since then have been getting periodic master crashes due to this error: >>> >>> Feb 5 15:53:57 ip-10-0-16-140 mesos-master[8414]: F0205 >>> 15:53:57.385118 8434 hierarchical.cpp:2630] Check failed: >>> reservationScalarQuantities.contains(role) >>> >>> Full stack trace is at the end of this email. When the master fails, we >>> automatically restart it and it rejoins the cluster just fine. I did some >>> initial searching and was unable to find any existing bug reports or other >>> people experiencing this issue. We run a cluster of 3 masters, and see >>> crashes on all 3 instances. >>> >>> Hope to get some guidance on what is going on and/or where to start >>> looking for more information. >>> >>> Feb 5 15:53:57 ip-10-0-16-140 mesos-master[8414]: @ >>> 0x7f87e9170a7d google::LogMessage::Fail() >>> Feb 5 15:53:57 ip-10-0-16-140 mesos-master[8414]: @ >>> 0x7f87e9172830 google::LogMessage::SendToLog() >>> Feb 5 15:53:57 ip-10-0-16-140 mesos-master[8414]: @ >>> 0x7f87e9170663 google::LogMessage::Flush() >>> Feb 5 15:53:57 ip-10-0-16-140 mesos-master[8414]: @ >>> 0x7f87e9173259 google::LogMessageFatal::~LogMessageFatal() >>> Feb 5 15:53:57 ip-10-0-16-140 mesos-master[8414]: @ >>> 0x7f87e8443cbd >>> mesos::internal::master::allocator::internal::HierarchicalAllocatorProcess::untrackReservations() >>> Feb 5 15:53:57 ip-10-0-16-140 mesos-master[8414]: @ >>> 0x7f87e8448fcd >>> mesos::internal::master::allocator::internal::HierarchicalAllocatorProcess::removeSlave() >>> Feb 5 15:53:57 ip-10-0-16-140 mesos-master[8414]: @ >>> 0x7f87e90c4f11 process::ProcessBase::consume() >>> Feb 5 15:53:57 ip-10-0-16-140 mesos-master[8414]: @ >>> 0x7f87e90dea4a process::ProcessManager::resume() >>> Feb 5 15:53:57 ip-10-0-16-140 mesos-master[8414]: @ >>> 0x7f87e90e25d6 >>> _ZNSt6thread5_ImplISt12_Bind_simpleIFZN7process14ProcessManager12init_threadsEvEUlvE_vEEE6_M_runEv >>> Feb 5 15:53:57 ip-10-0-16-140 mesos-master[8414]: @ >>> 0x7f87e6700c80 (unknown) >>> Feb 5 15:53:57 ip-10-0-16-140 mesos-master[8414]: @ >>> 0x7f87e5f136ba start_thread >>> Feb 5 15:53:57 ip-10-0-16-140 mesos-master[8414]: @ >>> 0x7f87e5c4941d (unknown) >>> >>