Hi, I'm trying to understand a chain of events that occurred this weekend to see if I'm doing something wrong in my custom framework.
The cluster in question had a single Mesos master and < 10 slaves. At some point the master got a fatal error and apparently respawned: Log file created at: 2015/09/12 20:42:18 Running on machine: ip-172-31-11-45 Log line format: [IWEF]mmdd hh:mm:ss.uuuuuu threadid file:line] msg F0912 20:42:18.452498 2832 sorter.cpp:213] Check failed: total.resources.contains(slaveId) More from syslog: Sep 12 20:42:18 ip-172-31-11-45 mesos-master[2811]: @ 0x7ff0c08ed8dd google::LogMessage::Fail() Sep 12 20:42:18 ip-172-31-11-45 mesos-master[2811]: @ 0x7ff0c08ef71d google::LogMessage::SendToLog() Sep 12 20:42:18 ip-172-31-11-45 mesos-master[2811]: @ 0x7ff0c08ed4cc google::LogMessage::Flush() Sep 12 20:42:18 ip-172-31-11-45 mesos-master[2811]: @ 0x7ff0c08f0019 google::LogMessageFatal::~LogMessageFatal() Sep 12 20:42:18 ip-172-31-11-45 mesos-master[2811]: @ 0x7ff0c03d6c40 mesos::internal::master::allocator::DRFSorter::remove() Sep 12 20:42:18 ip-172-31-11-45 mesos-master[2811]: @ 0x7ff0c02ba2ec mesos::internal::master::allocator::HierarchicalAllocatorProcess<>::removeFramework() Sep 12 20:42:18 ip-172-31-11-45 mesos-master[2811]: @ 0x7ff0c089f2c9 process::ProcessManager::resume() Sep 12 20:42:18 ip-172-31-11-45 mesos-master[2811]: @ 0x7ff0c089f5bf process::schedule() Sep 12 20:42:18 ip-172-31-11-45 mesos-master[2811]: @ 0x7ff0bf02c182 start_thread Sep 12 20:42:18 ip-172-31-11-45 mesos-master[2811]: @ 0x7ff0bed5947d (unknown) Sep 12 20:42:42 ip-172-31-11-45 kernel: [9183384.713130] init: mesos-master main process (2811) killed by ABRT signal Sep 12 20:42:42 ip-172-31-11-45 kernel: [9183384.713146] init: mesos-master main process ended, respawning The new master process was elected as the new leader and then my problems began. All of the tasks on the cluster were logged as "possibly orphaned" and while they weren't stopped, they seemed to get disconnected from the Mesos master. (I'm not sure what I mean by disconnected. The symptoms are that the tasks' entries in Mesos DNS were lost and I no longer saw them listed on the landing page of the Mesos master UI. The processes continued to run though on the slaves.) Here is the log of the master restarting and noting the possibly orphaned tasks: Sep 12 20:42:42 ip-172-31-11-45 mesos-master[10672]: I0912 20:42:42.451133 10692 master.hpp:159] Adding task 0 with resources cpus(service):4; mem(service):50000 on slave 20150908-084257-755703724-5050-2811-S0 (ip-172-31-19-210.ec2.internal) ... Sep 12 20:42:42 ip-172-31-11-45 mesos-master[10672]: W0912 20:42:42.451499 10692 master.cpp:4926] Possibly orphaned task 0 of framework 20150908-084257-755703724-5050-2811-0006 running on slave 20150908-084257-755703724-5050-2811-S0 at slave(1)@172.31.19.210:5051 (ip-172-31-19-210.ec2.internal) Sep 12 20:42:42 ip-172-31-11-45 mesos-master[10672]: I0912 20:42:42.451741 10692 master.cpp:3597] Re-registered slave 20150908-084257-755703724-5050-2811-S0 at slave(1)@172.31.19.210:5051 (ip-172-31-19-210.ec2.internal) with cpus(service):16; mem(service):63395; disk(service):35056; ports(service):[31000-32000] Sep 12 20:42:42 ip-172-31-11-45 mesos-master[10672]: I0912 20:42:42.451797 10692 master.cpp:3625] Sending updated checkpointed resources to slave 20150908-084257-755703724-5050-2811-S0 at slave(1)@172.31.19.210:5051 (ip-172-31-19-210.ec2.internal) Sep 12 20:42:42 ip-172-31-11-45 mesos-master[10672]: I0912 20:42:42.451910 10692 hierarchical.hpp:528] Added slave 20150908-084257-755703724-5050-2811-S0 (ip-172-31-19-210.ec2.internal) with cpus(service):16; mem(service):63395; disk(service):35056; ports(service):[31000-32000] (allocated: cpus(service):6.25; mem(service):52432) Here is the log on the slave in these logs just before and after the master respawn. Sep 12 20:42:18 ip-172-31-19-210 mesos-slave[54699]: I0912 20:42:18.450093 54718 slave.cpp:1946] Asked to shut down framework 20150908-084257-755703724-5050-2811-0041 by [email protected]:5050 Sep 12 20:42:18 ip-172-31-19-210 mesos-slave[54699]: W0912 20:42:18.450150 54718 slave.cpp:1961] Cannot shut down unknown framework 20150908-084257-755703724-5050-2811-0041 Sep 12 20:42:28 ip-172-31-19-210 mesos-slave[54699]: I0912 20:42:28.003684 54713 detector.cpp:138] Detected a new leader: None Sep 12 20:42:28 ip-172-31-19-210 mesos-slave[54699]: I0912 20:42:28.003851 54720 slave.cpp:677] Lost leading master Sep 12 20:42:28 ip-172-31-19-210 mesos-slave[54699]: I0912 20:42:28.003882 54720 slave.cpp:720] Detecting new master Sep 12 20:42:28 ip-172-31-19-210 mesos-slave[54699]: I0912 20:42:28.003856 54715 status_update_manager.cpp:176] Pausing sending status updates Sep 12 20:42:30 ip-172-31-19-210 mesos-slave[54699]: I0912 20:42:30.451076 54720 slave.cpp:4179] Querying resource estimator for oversubscribable resources I'm not sure if this is interesting, but there was some logging from the framework scheduler that showed "Shutdown failed" messages I hadn't seen before: I0908 10:15:37.936308 856 exec.cpp:132] Version: 0.23.0 I0908 10:15:37.938036 858 exec.cpp:206] Executor registered on slave 20150825-081928-755703724-5050-18248-S0 I0908 10:15:38.398712 873 sched.cpp:157] Version: 0.23.0 I0908 10:15:38.400921 891 sched.cpp:254] New master detected at [email protected]:5050 I0908 10:15:38.401202 891 sched.cpp:264] No credentials provided. Attempting to register without authentication I0908 10:15:38.402683 891 sched.cpp:448] Framework registered with 20150908-084257-755703724-5050-2811-0000 E0909 18:22:57.000236 865 socket.hpp:107] Shutdown failed on fd=10: Transport endpoint is not connected [107] E0909 18:23:26.991600 898 socket.hpp:107] Shutdown failed on fd=13: Transport endpoint is not connected [107] E0909 20:15:44.286170 898 socket.hpp:107] Shutdown failed on fd=13: Transport endpoint is not connected [107] Is there something I need to do such that tasks don't get orphaned and disconnected from their schedulers after failures? Is there something I can write in my custom framework that would let me reconnect the scheduler to its tasks? Thanks, Mike

