Re: how to overcome orphaned tasks after master failure

Vinod Kone Mon, 14 Sep 2015 10:18:10 -0700

The framework is expected to re-register with a failed over master with the
same framework id as before. If you are using the scheduler driver, that
should happen automatically.


On Mon, Sep 14, 2015 at 6:38 AM, Mike Barborak <[email protected]> wrote:

> Hi,
>
>
>
> I’m trying to understand a chain of events that occurred this weekend to
> see if I’m doing something wrong in my custom framework.
>
>
>
> The cluster in question had a single Mesos master and < 10 slaves. At some
> point the master got a fatal error and apparently respawned:
>
>
>
> Log file created at: 2015/09/12 20:42:18
>
> Running on machine: ip-172-31-11-45
>
> Log line format: [IWEF]mmdd hh:mm:ss.uuuuuu threadid file:line] msg
>
> F0912 20:42:18.452498  2832 sorter.cpp:213] Check failed:
> total.resources.contains(slaveId)
>
>
>
> More from syslog:
>
>
>
> Sep 12 20:42:18 ip-172-31-11-45 mesos-master[2811]:     @
> 0x7ff0c08ed8dd  google::LogMessage::Fail()
>
> Sep 12 20:42:18 ip-172-31-11-45 mesos-master[2811]:     @
> 0x7ff0c08ef71d  google::LogMessage::SendToLog()
>
> Sep 12 20:42:18 ip-172-31-11-45 mesos-master[2811]:     @
> 0x7ff0c08ed4cc  google::LogMessage::Flush()
>
> Sep 12 20:42:18 ip-172-31-11-45 mesos-master[2811]:     @
> 0x7ff0c08f0019  google::LogMessageFatal::~LogMessageFatal()
>
> Sep 12 20:42:18 ip-172-31-11-45 mesos-master[2811]:     @
> 0x7ff0c03d6c40  mesos::internal::master::allocator::DRFSorter::remove()
>
> Sep 12 20:42:18 ip-172-31-11-45 mesos-master[2811]:     @
> 0x7ff0c02ba2ec
>  
> mesos::internal::master::allocator::HierarchicalAllocatorProcess<>::removeFramework()
>
> Sep 12 20:42:18 ip-172-31-11-45 mesos-master[2811]:     @
> 0x7ff0c089f2c9  process::ProcessManager::resume()
>
> Sep 12 20:42:18 ip-172-31-11-45 mesos-master[2811]:     @
> 0x7ff0c089f5bf  process::schedule()
>
> Sep 12 20:42:18 ip-172-31-11-45 mesos-master[2811]:     @
> 0x7ff0bf02c182  start_thread
>
> Sep 12 20:42:18 ip-172-31-11-45 mesos-master[2811]:     @
> 0x7ff0bed5947d  (unknown)
>
> Sep 12 20:42:42 ip-172-31-11-45 kernel: [9183384.713130] init:
> mesos-master main process (2811) killed by ABRT signal
>
> Sep 12 20:42:42 ip-172-31-11-45 kernel: [9183384.713146] init:
> mesos-master main process ended, respawning
>
>
>
> The new master process was elected as the new leader and then my problems
> began. All of the tasks on the cluster were logged as “possibly orphaned”
> and while they weren’t stopped, they seemed to get disconnected from the
> Mesos master. (I’m not sure what I mean by disconnected. The symptoms are
> that the tasks’ entries in Mesos DNS were lost and I no longer saw them
> listed on the landing page of the Mesos master UI. The processes continued
> to run though on the slaves.)  Here is the log of the master restarting and
> noting the possibly orphaned tasks:
>
>
>
> Sep 12 20:42:42 ip-172-31-11-45 mesos-master[10672]: I0912 20:42:42.451133
> 10692 master.hpp:159] Adding task 0 with resources cpus(service):4;
> mem(service):50000 on slave 20150908-084257-755703724-5050-2811-S0
> (ip-172-31-19-210.ec2.internal)
>
> ...
>
> Sep 12 20:42:42 ip-172-31-11-45 mesos-master[10672]: W0912 20:42:42.451499
> 10692 master.cpp:4926] Possibly orphaned task 0 of framework
> 20150908-084257-755703724-5050-2811-0006 running on slave
> 20150908-084257-755703724-5050-2811-S0 at slave(1)@172.31.19.210:5051
> (ip-172-31-19-210.ec2.internal)
>
> Sep 12 20:42:42 ip-172-31-11-45 mesos-master[10672]: I0912 20:42:42.451741
> 10692 master.cpp:3597] Re-registered slave
> 20150908-084257-755703724-5050-2811-S0 at slave(1)@172.31.19.210:5051
> (ip-172-31-19-210.ec2.internal) with cpus(service):16; mem(service):63395;
> disk(service):35056; ports(service):[31000-32000]
>
> Sep 12 20:42:42 ip-172-31-11-45 mesos-master[10672]: I0912 20:42:42.451797
> 10692 master.cpp:3625] Sending updated checkpointed resources  to slave
> 20150908-084257-755703724-5050-2811-S0 at slave(1)@172.31.19.210:5051
> (ip-172-31-19-210.ec2.internal)
>
> Sep 12 20:42:42 ip-172-31-11-45 mesos-master[10672]: I0912 20:42:42.451910
> 10692 hierarchical.hpp:528] Added slave
> 20150908-084257-755703724-5050-2811-S0 (ip-172-31-19-210.ec2.internal) with
> cpus(service):16; mem(service):63395; disk(service):35056;
> ports(service):[31000-32000] (allocated: cpus(service):6.25;
> mem(service):52432)
>
>
>
> Here is the log on the slave in these logs just before and after the
> master respawn.
>
>
>
> Sep 12 20:42:18 ip-172-31-19-210 mesos-slave[54699]: I0912 20:42:18.450093
> 54718 slave.cpp:1946] Asked to shut down framework
> 20150908-084257-755703724-5050-2811-0041 by [email protected]:5050
>
> Sep 12 20:42:18 ip-172-31-19-210 mesos-slave[54699]: W0912 20:42:18.450150
> 54718 slave.cpp:1961] Cannot shut down unknown framework
> 20150908-084257-755703724-5050-2811-0041
>
> Sep 12 20:42:28 ip-172-31-19-210 mesos-slave[54699]: I0912 20:42:28.003684
> 54713 detector.cpp:138] Detected a new leader: None
>
> Sep 12 20:42:28 ip-172-31-19-210 mesos-slave[54699]: I0912 20:42:28.003851
> 54720 slave.cpp:677] Lost leading master
>
> Sep 12 20:42:28 ip-172-31-19-210 mesos-slave[54699]: I0912 20:42:28.003882
> 54720 slave.cpp:720] Detecting new master
>
> Sep 12 20:42:28 ip-172-31-19-210 mesos-slave[54699]: I0912 20:42:28.003856
> 54715 status_update_manager.cpp:176] Pausing sending status updates
>
> Sep 12 20:42:30 ip-172-31-19-210 mesos-slave[54699]: I0912 20:42:30.451076
> 54720 slave.cpp:4179] Querying resource estimator for oversubscribable
> resources
>
>
>
> I’m not sure if this is interesting, but there was some logging from the
> framework scheduler that showed “Shutdown failed” messages I hadn’t seen
> before:
>
>
>
> I0908 10:15:37.936308   856 exec.cpp:132] Version: 0.23.0
>
> I0908 10:15:37.938036   858 exec.cpp:206] Executor registered on slave
> 20150825-081928-755703724-5050-18248-S0
>
> I0908 10:15:38.398712   873 sched.cpp:157] Version: 0.23.0
>
> I0908 10:15:38.400921   891 sched.cpp:254] New master detected at
> [email protected]:5050
>
> I0908 10:15:38.401202   891 sched.cpp:264] No credentials provided.
> Attempting to register without authentication
>
> I0908 10:15:38.402683   891 sched.cpp:448] Framework registered with
> 20150908-084257-755703724-5050-2811-0000
>
> E0909 18:22:57.000236   865 socket.hpp:107] Shutdown failed on fd=10:
> Transport endpoint is not connected [107]
>
> E0909 18:23:26.991600   898 socket.hpp:107] Shutdown failed on fd=13:
> Transport endpoint is not connected [107]
>
> E0909 20:15:44.286170   898 socket.hpp:107] Shutdown failed on fd=13:
> Transport endpoint is not connected [107]
>
>
>
> Is there something I need to do such that tasks don’t get orphaned and
> disconnected from their schedulers after failures? Is there something I can
> write in my custom framework that would let me reconnect the scheduler to
> its tasks?
>
>
>
> Thanks,
>
> Mike
>
>
>

Re: how to overcome orphaned tasks after master failure

Reply via email to