The framework is expected to re-register with a failed over master with the same framework id as before. If you are using the scheduler driver, that should happen automatically.
On Mon, Sep 14, 2015 at 6:38 AM, Mike Barborak <[email protected]> wrote: > Hi, > > > > I’m trying to understand a chain of events that occurred this weekend to > see if I’m doing something wrong in my custom framework. > > > > The cluster in question had a single Mesos master and < 10 slaves. At some > point the master got a fatal error and apparently respawned: > > > > Log file created at: 2015/09/12 20:42:18 > > Running on machine: ip-172-31-11-45 > > Log line format: [IWEF]mmdd hh:mm:ss.uuuuuu threadid file:line] msg > > F0912 20:42:18.452498 2832 sorter.cpp:213] Check failed: > total.resources.contains(slaveId) > > > > More from syslog: > > > > Sep 12 20:42:18 ip-172-31-11-45 mesos-master[2811]: @ > 0x7ff0c08ed8dd google::LogMessage::Fail() > > Sep 12 20:42:18 ip-172-31-11-45 mesos-master[2811]: @ > 0x7ff0c08ef71d google::LogMessage::SendToLog() > > Sep 12 20:42:18 ip-172-31-11-45 mesos-master[2811]: @ > 0x7ff0c08ed4cc google::LogMessage::Flush() > > Sep 12 20:42:18 ip-172-31-11-45 mesos-master[2811]: @ > 0x7ff0c08f0019 google::LogMessageFatal::~LogMessageFatal() > > Sep 12 20:42:18 ip-172-31-11-45 mesos-master[2811]: @ > 0x7ff0c03d6c40 mesos::internal::master::allocator::DRFSorter::remove() > > Sep 12 20:42:18 ip-172-31-11-45 mesos-master[2811]: @ > 0x7ff0c02ba2ec > > mesos::internal::master::allocator::HierarchicalAllocatorProcess<>::removeFramework() > > Sep 12 20:42:18 ip-172-31-11-45 mesos-master[2811]: @ > 0x7ff0c089f2c9 process::ProcessManager::resume() > > Sep 12 20:42:18 ip-172-31-11-45 mesos-master[2811]: @ > 0x7ff0c089f5bf process::schedule() > > Sep 12 20:42:18 ip-172-31-11-45 mesos-master[2811]: @ > 0x7ff0bf02c182 start_thread > > Sep 12 20:42:18 ip-172-31-11-45 mesos-master[2811]: @ > 0x7ff0bed5947d (unknown) > > Sep 12 20:42:42 ip-172-31-11-45 kernel: [9183384.713130] init: > mesos-master main process (2811) killed by ABRT signal > > Sep 12 20:42:42 ip-172-31-11-45 kernel: [9183384.713146] init: > mesos-master main process ended, respawning > > > > The new master process was elected as the new leader and then my problems > began. All of the tasks on the cluster were logged as “possibly orphaned” > and while they weren’t stopped, they seemed to get disconnected from the > Mesos master. (I’m not sure what I mean by disconnected. The symptoms are > that the tasks’ entries in Mesos DNS were lost and I no longer saw them > listed on the landing page of the Mesos master UI. The processes continued > to run though on the slaves.) Here is the log of the master restarting and > noting the possibly orphaned tasks: > > > > Sep 12 20:42:42 ip-172-31-11-45 mesos-master[10672]: I0912 20:42:42.451133 > 10692 master.hpp:159] Adding task 0 with resources cpus(service):4; > mem(service):50000 on slave 20150908-084257-755703724-5050-2811-S0 > (ip-172-31-19-210.ec2.internal) > > ... > > Sep 12 20:42:42 ip-172-31-11-45 mesos-master[10672]: W0912 20:42:42.451499 > 10692 master.cpp:4926] Possibly orphaned task 0 of framework > 20150908-084257-755703724-5050-2811-0006 running on slave > 20150908-084257-755703724-5050-2811-S0 at slave(1)@172.31.19.210:5051 > (ip-172-31-19-210.ec2.internal) > > Sep 12 20:42:42 ip-172-31-11-45 mesos-master[10672]: I0912 20:42:42.451741 > 10692 master.cpp:3597] Re-registered slave > 20150908-084257-755703724-5050-2811-S0 at slave(1)@172.31.19.210:5051 > (ip-172-31-19-210.ec2.internal) with cpus(service):16; mem(service):63395; > disk(service):35056; ports(service):[31000-32000] > > Sep 12 20:42:42 ip-172-31-11-45 mesos-master[10672]: I0912 20:42:42.451797 > 10692 master.cpp:3625] Sending updated checkpointed resources to slave > 20150908-084257-755703724-5050-2811-S0 at slave(1)@172.31.19.210:5051 > (ip-172-31-19-210.ec2.internal) > > Sep 12 20:42:42 ip-172-31-11-45 mesos-master[10672]: I0912 20:42:42.451910 > 10692 hierarchical.hpp:528] Added slave > 20150908-084257-755703724-5050-2811-S0 (ip-172-31-19-210.ec2.internal) with > cpus(service):16; mem(service):63395; disk(service):35056; > ports(service):[31000-32000] (allocated: cpus(service):6.25; > mem(service):52432) > > > > Here is the log on the slave in these logs just before and after the > master respawn. > > > > Sep 12 20:42:18 ip-172-31-19-210 mesos-slave[54699]: I0912 20:42:18.450093 > 54718 slave.cpp:1946] Asked to shut down framework > 20150908-084257-755703724-5050-2811-0041 by [email protected]:5050 > > Sep 12 20:42:18 ip-172-31-19-210 mesos-slave[54699]: W0912 20:42:18.450150 > 54718 slave.cpp:1961] Cannot shut down unknown framework > 20150908-084257-755703724-5050-2811-0041 > > Sep 12 20:42:28 ip-172-31-19-210 mesos-slave[54699]: I0912 20:42:28.003684 > 54713 detector.cpp:138] Detected a new leader: None > > Sep 12 20:42:28 ip-172-31-19-210 mesos-slave[54699]: I0912 20:42:28.003851 > 54720 slave.cpp:677] Lost leading master > > Sep 12 20:42:28 ip-172-31-19-210 mesos-slave[54699]: I0912 20:42:28.003882 > 54720 slave.cpp:720] Detecting new master > > Sep 12 20:42:28 ip-172-31-19-210 mesos-slave[54699]: I0912 20:42:28.003856 > 54715 status_update_manager.cpp:176] Pausing sending status updates > > Sep 12 20:42:30 ip-172-31-19-210 mesos-slave[54699]: I0912 20:42:30.451076 > 54720 slave.cpp:4179] Querying resource estimator for oversubscribable > resources > > > > I’m not sure if this is interesting, but there was some logging from the > framework scheduler that showed “Shutdown failed” messages I hadn’t seen > before: > > > > I0908 10:15:37.936308 856 exec.cpp:132] Version: 0.23.0 > > I0908 10:15:37.938036 858 exec.cpp:206] Executor registered on slave > 20150825-081928-755703724-5050-18248-S0 > > I0908 10:15:38.398712 873 sched.cpp:157] Version: 0.23.0 > > I0908 10:15:38.400921 891 sched.cpp:254] New master detected at > [email protected]:5050 > > I0908 10:15:38.401202 891 sched.cpp:264] No credentials provided. > Attempting to register without authentication > > I0908 10:15:38.402683 891 sched.cpp:448] Framework registered with > 20150908-084257-755703724-5050-2811-0000 > > E0909 18:22:57.000236 865 socket.hpp:107] Shutdown failed on fd=10: > Transport endpoint is not connected [107] > > E0909 18:23:26.991600 898 socket.hpp:107] Shutdown failed on fd=13: > Transport endpoint is not connected [107] > > E0909 20:15:44.286170 898 socket.hpp:107] Shutdown failed on fd=13: > Transport endpoint is not connected [107] > > > > Is there something I need to do such that tasks don’t get orphaned and > disconnected from their schedulers after failures? Is there something I can > write in my custom framework that would let me reconnect the scheduler to > its tasks? > > > > Thanks, > > Mike > > >

