Hi, This is what I see regarding the framework in the logs on the master:
mesos-master.INFO:W0912 20:42:42.451398 10692 master.cpp:4926] Possibly orphaned task 0 of framework 20150908-084257-755703724-5050-2811-0000 running on slave 20150908-084257-755703724-5050-2811-S0 at slave(1)@172.31.19.210:5051 (ip-172-31-19-210.ec2.internal) mesos-master.ip-172-31-11-45.invalid-user.log.INFO.20150912-204242.10672:W0912 20:42:42.451398 10692 master.cpp:4926] Possibly orphaned task 0 of framework 20150908-084257-755703724-5050-2811-0000 running on slave 20150908-084257-755703724-5050-2811-S0 at slave(1)@172.31.19.210:5051 (ip-172-31-19-210.ec2.internal) mesos-master.ip-172-31-11-45.invalid-user.log.WARNING.20150912-204242.10672:W0912 20:42:42.451398 10692 master.cpp:4926] Possibly orphaned task 0 of framework 20150908-084257-755703724-5050-2811-0000 running on slave 20150908-084257-755703724-5050-2811-S0 at slave(1)@172.31.19.210:5051 (ip-172-31-19-210.ec2.internal) mesos-master.WARNING:W0912 20:42:42.451398 10692 master.cpp:4926] Possibly orphaned task 0 of framework 20150908-084257-755703724-5050-2811-0000 running on slave 20150908-084257-755703724-5050-2811-S0 at slave(1)@172.31.19.210:5051 (ip-172-31-19-210.ec2.internal) The framework itself logged this at startup: Starting framework with previous ID value: "20150908-084257-755703724-5050-2811-0000" Registered! ID = 20150908-084257-755703724-5050-2811-0000 No status messages in 5s. Assuming reconciliation complete. task 0 null Received offer 20150908-084257-755703724-5050-2811-O525 with cpus: 16.0 and mem: 63395.0 Launching task 0 using offer 20150908-084257-755703724-5050-2811-O525 Status update: task 0 is in state TASK_RUNNING and then nothing during the reregistration. On the slave 20150908-084257-755703724-5050-2811-S0 at this time I see is this: mesos-slave.INFO:I0912 20:42:15.450817 54714 slave.cpp:4179] Querying resource estimator for oversubscribable resources mesos-slave.INFO:I0912 20:42:15.450942 54723 slave.cpp:4193] Received oversubscribable resources from the resource estimator mesos-slave.INFO:I0912 20:42:18.450093 54718 slave.cpp:1946] Asked to shut down framework 20150908-084257-755703724-5050-2811-0041 by [email protected]:5050 mesos-slave.INFO:W0912 20:42:18.450150 54718 slave.cpp:1961] Cannot shut down unknown framework 20150908-084257-755703724-5050-2811-0041 mesos-slave.INFO:I0912 20:42:28.003684 54713 detector.cpp:138] Detected a new leader: None mesos-slave.INFO:I0912 20:42:28.003851 54720 slave.cpp:677] Lost leading master mesos-slave.INFO:I0912 20:42:28.003882 54720 slave.cpp:720] Detecting new master mesos-slave.INFO:I0912 20:42:28.003856 54715 status_update_manager.cpp:176] Pausing sending status updates mesos-slave.INFO:I0912 20:42:30.451076 54720 slave.cpp:4179] Querying resource estimator for oversubscribable resources mesos-slave.INFO:I0912 20:42:30.451254 54728 slave.cpp:4193] Received oversubscribable resources from the resource estimator mesos-slave.INFO:I0912 20:42:42.157058 54722 slave.cpp:3087] [email protected]:5050 exited mesos-slave.INFO:W0912 20:42:42.157120 54722 slave.cpp:3090] Master disconnected! Waiting for a new master to be elected mesos-slave.INFO:I0912 20:42:42.301710 54720 detector.cpp:138] Detected a new leader: (id='29') mesos-slave.INFO:I0912 20:42:42.301834 54724 group.cpp:656] Trying to get '/mesos/info_0000000029' in ZooKeeper mesos-slave.INFO:W0912 20:42:42.303886 54724 detector.cpp:444] Leading master [email protected]:5050 is using a Protobuf binary format when registering with ZooKeeper (info): this will be deprecated as of Mesos 0.24 (see MESOS-23 40) mesos-slave.INFO:I0912 20:42:42.303947 54724 detector.cpp:481] A new leading master ([email protected]:5050) is detected mesos-slave.INFO:I0912 20:42:42.304061 54724 slave.cpp:684] New master detected at [email protected]:5050 mesos-slave.INFO:I0912 20:42:42.304075 54716 status_update_manager.cpp:176] Pausing sending status updates mesos-slave.INFO:I0912 20:42:42.304178 54724 slave.cpp:709] No credentials provided. Attempting to register without authentication mesos-slave.INFO:I0912 20:42:42.304221 54724 slave.cpp:720] Detecting new master mesos-slave.INFO:I0912 20:42:42.453955 54723 slave.cpp:959] Re-registered with master [email protected]:5050 mesos-slave.INFO:I0912 20:42:42.454052 54723 slave.cpp:995] Forwarding total oversubscribed resources mesos-slave.INFO:I0912 20:42:42.454056 54720 status_update_manager.cpp:183] Resuming sending status updates mesos-slave.INFO:I0912 20:42:42.454222 54723 slave.cpp:2202] Updated checkpointed resources from to mesos-slave.INFO:W0912 20:42:43.504453 54725 slave.cpp:2105] Ignoring updating pid for framework 20150501-182221-755703724-5050-17276-0000 because it does not exist mesos-slave.INFO:I0912 20:42:45.451882 54722 slave.cpp:4179] Querying resource estimator for oversubscribable resources mesos-slave.INFO:I0912 20:42:45.452005 54720 slave.cpp:4193] Received oversubscribable resources from the resource estimator The line about “Ignoring updating pid” is confusing to me. The framework 20150501-182221-755703724-5050-17276-0000 mentioned in that log entry is Marathon. How things work on my cluster is that framework 20150908-084257-755703724-5050-2811-0000 is started by Marathon as a command. In this case, Marathon created a task called service-clara-users-mikeb-jobs-112.1406dc6b-5634-11e5-adfc-56847afe9799 on a different slave than whose log is shown to launch the framework. These are the entries on the master about this task being an orphaned task of Marathon. mesos-master.INFO:I0912 20:42:43.249708 10689 master.hpp:159] Adding task service-cl-users-mikeb-jobs-112.1406dc6b-5634-11e5-adfc-56847afe9799 with resources cpus(marathon):0.01; mem(marathon):200; ports(marathon):[31238-31238] on slave 20150825-081928-755703724-5050-18248-S0 (dev1.ml.com) mesos-master.INFO:W0912 20:42:43.250097 10689 master.cpp:4926] Possibly orphaned task service-clara-users-mikeb-jobs-112.1406dc6b-5634-11e5-adfc-56847afe9799 of framework 20150501-182221-755703724-5050-17276-0000 running on slave 20150825-081928-755703724-5050-18248-S0 at slave(1)@172.31.11.45:5051 (dev1.ml.com) mesos-master.ip-172-31-11-45.invalid-user.log.INFO.20150912-204242.10672:I0912 20:42:43.249708 10689 master.hpp:159] Adding task service-clara-users-mikeb-jobs-112.1406dc6b-5634-11e5-adfc-56847afe9799 with resources cpus(marathon):0.01; mem(marathon):200; ports(marathon):[31238-31238] on slave 20150825-081928-755703724-5050-18248-S0 (dev1.ml.com) mesos-master.ip-172-31-11-45.invalid-user.log.INFO.20150912-204242.10672:W0912 20:42:43.250097 10689 master.cpp:4926] Possibly orphaned task service-cl-users-mikeb-jobs-112.1406dc6b-5634-11e5-adfc-56847afe9799 of framework 20150501-182221-755703724-5050-17276-0000 running on slave 20150825-081928-755703724-5050-18248-S0 at slave(1)@172.31.11.45:5051 (dev1.ml.com) mesos-master.ip-172-31-11-45.invalid-user.log.WARNING.20150912-204242.10672:W0912 20:42:43.250097 10689 master.cpp:4926] Possibly orphaned task service-cl-users-mikeb-jobs-112.1406dc6b-5634-11e5-adfc-56847afe9799 of framework 20150501-182221-755703724-5050-17276-0000 running on slave 20150825-081928-755703724-5050-18248-S0 at slave(1)@172.31.11.45:5051 (dev1.ml.com) mesos-master.WARNING:W0912 20:42:43.250097 10689 master.cpp:4926] Possibly orphaned task service-cl-users-mikeb-jobs-112.1406dc6b-5634-11e5-adfc-56847afe9799 of framework 20150501-182221-755703724-5050-17276-0000 running on slave 20150825-081928-755703724-5050-18248-S0 at slave(1)@172.31.11.45:5051 (dev1.ml.com) Perhaps my problem has to do with launching the framework via a command through Marathon? I will try to reproduce this tomorrow with and without Marathon. Thanks for looking this over. Best, Mike From: Vinod Kone [mailto:[email protected]] Sent: Monday, September 14, 2015 4:07 PM To: [email protected] Subject: Re: how to overcome orphaned tasks after master failure On Mon, Sep 14, 2015 at 12:40 PM, Mike Barborak <[email protected]<mailto:[email protected]>> wrote: Sorry for my ignorance, but what is the “scheduler driver?” My framework is based on the Java example: Some details about the driver should be here: http://mesos.apache.org/documentation/latest/app-framework-development-guide/ Looks like you are using the driver. Can you paste the logs of the scheduler and master (related to the framework) after master failover?

