how to overcome orphaned tasks after master failure

Mike Barborak Mon, 14 Sep 2015 06:39:41 -0700

Hi,

I'm trying to understand a chain of events that occurred this weekend to see if 
I'm doing something wrong in my custom framework.


The cluster in question had a single Mesos master and < 10 slaves. At some 
point the master got a fatal error and apparently respawned:

Log file created at: 2015/09/12 20:42:18
Running on machine: ip-172-31-11-45
Log line format: [IWEF]mmdd hh:mm:ss.uuuuuu threadid file:line] msg
F0912 20:42:18.452498  2832 sorter.cpp:213] Check failed: 
total.resources.contains(slaveId)

More from syslog:

Sep 12 20:42:18 ip-172-31-11-45 mesos-master[2811]:     @     0x7ff0c08ed8dd  
google::LogMessage::Fail()
Sep 12 20:42:18 ip-172-31-11-45 mesos-master[2811]:     @     0x7ff0c08ef71d  
google::LogMessage::SendToLog()
Sep 12 20:42:18 ip-172-31-11-45 mesos-master[2811]:     @     0x7ff0c08ed4cc  
google::LogMessage::Flush()
Sep 12 20:42:18 ip-172-31-11-45 mesos-master[2811]:     @     0x7ff0c08f0019  
google::LogMessageFatal::~LogMessageFatal()
Sep 12 20:42:18 ip-172-31-11-45 mesos-master[2811]:     @     0x7ff0c03d6c40  
mesos::internal::master::allocator::DRFSorter::remove()
Sep 12 20:42:18 ip-172-31-11-45 mesos-master[2811]:     @     0x7ff0c02ba2ec  
mesos::internal::master::allocator::HierarchicalAllocatorProcess<>::removeFramework()
Sep 12 20:42:18 ip-172-31-11-45 mesos-master[2811]:     @     0x7ff0c089f2c9  
process::ProcessManager::resume()
Sep 12 20:42:18 ip-172-31-11-45 mesos-master[2811]:     @     0x7ff0c089f5bf  
process::schedule()
Sep 12 20:42:18 ip-172-31-11-45 mesos-master[2811]:     @     0x7ff0bf02c182  
start_thread
Sep 12 20:42:18 ip-172-31-11-45 mesos-master[2811]:     @     0x7ff0bed5947d  
(unknown)
Sep 12 20:42:42 ip-172-31-11-45 kernel: [9183384.713130] init: mesos-master 
main process (2811) killed by ABRT signal
Sep 12 20:42:42 ip-172-31-11-45 kernel: [9183384.713146] init: mesos-master 
main process ended, respawning

The new master process was elected as the new leader and then my problems 
began. All of the tasks on the cluster were logged as "possibly orphaned" and 
while they weren't stopped, they seemed to get disconnected from the Mesos 
master. (I'm not sure what I mean by disconnected. The symptoms are that the 
tasks' entries in Mesos DNS were lost and I no longer saw them listed on the 
landing page of the Mesos master UI. The processes continued to run though on 
the slaves.)  Here is the log of the master restarting and noting the possibly 
orphaned tasks:

Sep 12 20:42:42 ip-172-31-11-45 mesos-master[10672]: I0912 20:42:42.451133 
10692 master.hpp:159] Adding task 0 with resources cpus(service):4; 
mem(service):50000 on slave 20150908-084257-755703724-5050-2811-S0 
(ip-172-31-19-210.ec2.internal)
...
Sep 12 20:42:42 ip-172-31-11-45 mesos-master[10672]: W0912 20:42:42.451499 
10692 master.cpp:4926] Possibly orphaned task 0 of framework 
20150908-084257-755703724-5050-2811-0006 running on slave 
20150908-084257-755703724-5050-2811-S0 at slave(1)@172.31.19.210:5051 
(ip-172-31-19-210.ec2.internal)
Sep 12 20:42:42 ip-172-31-11-45 mesos-master[10672]: I0912 20:42:42.451741 
10692 master.cpp:3597] Re-registered slave 
20150908-084257-755703724-5050-2811-S0 at slave(1)@172.31.19.210:5051 
(ip-172-31-19-210.ec2.internal) with cpus(service):16; mem(service):63395; 
disk(service):35056; ports(service):[31000-32000]
Sep 12 20:42:42 ip-172-31-11-45 mesos-master[10672]: I0912 20:42:42.451797 
10692 master.cpp:3625] Sending updated checkpointed resources  to slave 
20150908-084257-755703724-5050-2811-S0 at slave(1)@172.31.19.210:5051 
(ip-172-31-19-210.ec2.internal)
Sep 12 20:42:42 ip-172-31-11-45 mesos-master[10672]: I0912 20:42:42.451910 
10692 hierarchical.hpp:528] Added slave 20150908-084257-755703724-5050-2811-S0 
(ip-172-31-19-210.ec2.internal) with cpus(service):16; mem(service):63395; 
disk(service):35056; ports(service):[31000-32000] (allocated: 
cpus(service):6.25; mem(service):52432)

Here is the log on the slave in these logs just before and after the master 
respawn.

Sep 12 20:42:18 ip-172-31-19-210 mesos-slave[54699]: I0912 20:42:18.450093 
54718 slave.cpp:1946] Asked to shut down framework 
20150908-084257-755703724-5050-2811-0041 by [email protected]:5050
Sep 12 20:42:18 ip-172-31-19-210 mesos-slave[54699]: W0912 20:42:18.450150 
54718 slave.cpp:1961] Cannot shut down unknown framework 
20150908-084257-755703724-5050-2811-0041
Sep 12 20:42:28 ip-172-31-19-210 mesos-slave[54699]: I0912 20:42:28.003684 
54713 detector.cpp:138] Detected a new leader: None
Sep 12 20:42:28 ip-172-31-19-210 mesos-slave[54699]: I0912 20:42:28.003851 
54720 slave.cpp:677] Lost leading master
Sep 12 20:42:28 ip-172-31-19-210 mesos-slave[54699]: I0912 20:42:28.003882 
54720 slave.cpp:720] Detecting new master
Sep 12 20:42:28 ip-172-31-19-210 mesos-slave[54699]: I0912 20:42:28.003856 
54715 status_update_manager.cpp:176] Pausing sending status updates
Sep 12 20:42:30 ip-172-31-19-210 mesos-slave[54699]: I0912 20:42:30.451076 
54720 slave.cpp:4179] Querying resource estimator for oversubscribable resources

I'm not sure if this is interesting, but there was some logging from the 
framework scheduler that showed "Shutdown failed" messages I hadn't seen before:

I0908 10:15:37.936308   856 exec.cpp:132] Version: 0.23.0
I0908 10:15:37.938036   858 exec.cpp:206] Executor registered on slave 
20150825-081928-755703724-5050-18248-S0
I0908 10:15:38.398712   873 sched.cpp:157] Version: 0.23.0
I0908 10:15:38.400921   891 sched.cpp:254] New master detected at 
[email protected]:5050
I0908 10:15:38.401202   891 sched.cpp:264] No credentials provided. Attempting 
to register without authentication
I0908 10:15:38.402683   891 sched.cpp:448] Framework registered with 
20150908-084257-755703724-5050-2811-0000
E0909 18:22:57.000236   865 socket.hpp:107] Shutdown failed on fd=10: Transport 
endpoint is not connected [107]
E0909 18:23:26.991600   898 socket.hpp:107] Shutdown failed on fd=13: Transport 
endpoint is not connected [107]
E0909 20:15:44.286170   898 socket.hpp:107] Shutdown failed on fd=13: Transport 
endpoint is not connected [107]

Is there something I need to do such that tasks don't get orphaned and 
disconnected from their schedulers after failures? Is there something I can 
write in my custom framework that would let me reconnect the scheduler to its 
tasks?

Thanks,
Mike

how to overcome orphaned tasks after master failure

Reply via email to