RE: how to overcome orphaned tasks after master failure

Mike Barborak Mon, 14 Sep 2015 14:53:38 -0700

Hi,

This is what I see regarding the framework in the logs on the master:


mesos-master.INFO:W0912 20:42:42.451398 10692 master.cpp:4926] Possibly 
orphaned task 0 of framework 20150908-084257-755703724-5050-2811-0000 running 
on slave 20150908-084257-755703724-5050-2811-S0 at slave(1)@172.31.19.210:5051 
(ip-172-31-19-210.ec2.internal)
mesos-master.ip-172-31-11-45.invalid-user.log.INFO.20150912-204242.10672:W0912 
20:42:42.451398 10692 master.cpp:4926] Possibly orphaned task 0 of framework 
20150908-084257-755703724-5050-2811-0000 running on slave 
20150908-084257-755703724-5050-2811-S0 at slave(1)@172.31.19.210:5051 
(ip-172-31-19-210.ec2.internal)
mesos-master.ip-172-31-11-45.invalid-user.log.WARNING.20150912-204242.10672:W0912
 20:42:42.451398 10692 master.cpp:4926] Possibly orphaned task 0 of framework 
20150908-084257-755703724-5050-2811-0000 running on slave 
20150908-084257-755703724-5050-2811-S0 at slave(1)@172.31.19.210:5051 
(ip-172-31-19-210.ec2.internal)
mesos-master.WARNING:W0912 20:42:42.451398 10692 master.cpp:4926] Possibly 
orphaned task 0 of framework 20150908-084257-755703724-5050-2811-0000 running 
on slave 20150908-084257-755703724-5050-2811-S0 at slave(1)@172.31.19.210:5051 
(ip-172-31-19-210.ec2.internal)

The framework itself logged this at startup:

Starting framework with previous ID value: 
"20150908-084257-755703724-5050-2811-0000"
Registered! ID = 20150908-084257-755703724-5050-2811-0000
No status messages in 5s. Assuming reconciliation complete.
task 0 null
Received offer 20150908-084257-755703724-5050-2811-O525 with cpus: 16.0 and 
mem: 63395.0
Launching task 0 using offer 20150908-084257-755703724-5050-2811-O525
Status update: task 0 is in state TASK_RUNNING

and then nothing during the reregistration.

On the slave 20150908-084257-755703724-5050-2811-S0  at this time I see is this:

mesos-slave.INFO:I0912 20:42:15.450817 54714 slave.cpp:4179] Querying resource 
estimator for oversubscribable resources
mesos-slave.INFO:I0912 20:42:15.450942 54723 slave.cpp:4193] Received 
oversubscribable resources  from the resource estimator
mesos-slave.INFO:I0912 20:42:18.450093 54718 slave.cpp:1946] Asked to shut down 
framework 20150908-084257-755703724-5050-2811-0041 by [email protected]:5050
mesos-slave.INFO:W0912 20:42:18.450150 54718 slave.cpp:1961] Cannot shut down 
unknown framework 20150908-084257-755703724-5050-2811-0041
mesos-slave.INFO:I0912 20:42:28.003684 54713 detector.cpp:138] Detected a new 
leader: None
mesos-slave.INFO:I0912 20:42:28.003851 54720 slave.cpp:677] Lost leading master
mesos-slave.INFO:I0912 20:42:28.003882 54720 slave.cpp:720] Detecting new master
mesos-slave.INFO:I0912 20:42:28.003856 54715 status_update_manager.cpp:176] 
Pausing sending status updates
mesos-slave.INFO:I0912 20:42:30.451076 54720 slave.cpp:4179] Querying resource 
estimator for oversubscribable resources
mesos-slave.INFO:I0912 20:42:30.451254 54728 slave.cpp:4193] Received 
oversubscribable resources  from the resource estimator
mesos-slave.INFO:I0912 20:42:42.157058 54722 slave.cpp:3087] 
[email protected]:5050 exited
mesos-slave.INFO:W0912 20:42:42.157120 54722 slave.cpp:3090] Master 
disconnected! Waiting for a new master to be elected
mesos-slave.INFO:I0912 20:42:42.301710 54720 detector.cpp:138] Detected a new 
leader: (id='29')
mesos-slave.INFO:I0912 20:42:42.301834 54724 group.cpp:656] Trying to get 
'/mesos/info_0000000029' in ZooKeeper
mesos-slave.INFO:W0912 20:42:42.303886 54724 detector.cpp:444] Leading master 
[email protected]:5050 is using a Protobuf binary format when registering 
with ZooKeeper (info): this will be deprecated as of Mesos 0.24 (see MESOS-23
40)
mesos-slave.INFO:I0912 20:42:42.303947 54724 detector.cpp:481] A new leading 
master ([email protected]:5050) is detected
mesos-slave.INFO:I0912 20:42:42.304061 54724 slave.cpp:684] New master detected 
at [email protected]:5050
mesos-slave.INFO:I0912 20:42:42.304075 54716 status_update_manager.cpp:176] 
Pausing sending status updates
mesos-slave.INFO:I0912 20:42:42.304178 54724 slave.cpp:709] No credentials 
provided. Attempting to register without authentication
mesos-slave.INFO:I0912 20:42:42.304221 54724 slave.cpp:720] Detecting new master
mesos-slave.INFO:I0912 20:42:42.453955 54723 slave.cpp:959] Re-registered with 
master [email protected]:5050
mesos-slave.INFO:I0912 20:42:42.454052 54723 slave.cpp:995] Forwarding total 
oversubscribed resources
mesos-slave.INFO:I0912 20:42:42.454056 54720 status_update_manager.cpp:183] 
Resuming sending status updates
mesos-slave.INFO:I0912 20:42:42.454222 54723 slave.cpp:2202] Updated 
checkpointed resources from  to
mesos-slave.INFO:W0912 20:42:43.504453 54725 slave.cpp:2105] Ignoring updating 
pid for framework 20150501-182221-755703724-5050-17276-0000 because it does not 
exist
mesos-slave.INFO:I0912 20:42:45.451882 54722 slave.cpp:4179] Querying resource 
estimator for oversubscribable resources
mesos-slave.INFO:I0912 20:42:45.452005 54720 slave.cpp:4193] Received 
oversubscribable resources  from the resource estimator

The line about “Ignoring updating pid” is confusing to me. The framework 
20150501-182221-755703724-5050-17276-0000 mentioned in that log entry is 
Marathon. How things work on my cluster is that framework 
20150908-084257-755703724-5050-2811-0000 is started by Marathon as a command. 
In this case, Marathon created a task called 
service-clara-users-mikeb-jobs-112.1406dc6b-5634-11e5-adfc-56847afe9799 on a 
different slave than whose log is shown to launch the framework. These are the 
entries on the master about this task being an orphaned task of Marathon.

mesos-master.INFO:I0912 20:42:43.249708 10689 master.hpp:159] Adding task 
service-cl-users-mikeb-jobs-112.1406dc6b-5634-11e5-adfc-56847afe9799 with 
resources cpus(marathon):0.01; mem(marathon):200; ports(marathon):[31238-31238] 
on slave 20150825-081928-755703724-5050-18248-S0 (dev1.ml.com)
mesos-master.INFO:W0912 20:42:43.250097 10689 master.cpp:4926] Possibly 
orphaned task 
service-clara-users-mikeb-jobs-112.1406dc6b-5634-11e5-adfc-56847afe9799 of 
framework 20150501-182221-755703724-5050-17276-0000 running on slave 
20150825-081928-755703724-5050-18248-S0 at slave(1)@172.31.11.45:5051 
(dev1.ml.com)
mesos-master.ip-172-31-11-45.invalid-user.log.INFO.20150912-204242.10672:I0912 
20:42:43.249708 10689 master.hpp:159] Adding task 
service-clara-users-mikeb-jobs-112.1406dc6b-5634-11e5-adfc-56847afe9799 with 
resources cpus(marathon):0.01; mem(marathon):200; ports(marathon):[31238-31238] 
on slave 20150825-081928-755703724-5050-18248-S0 (dev1.ml.com)
mesos-master.ip-172-31-11-45.invalid-user.log.INFO.20150912-204242.10672:W0912 
20:42:43.250097 10689 master.cpp:4926] Possibly orphaned task 
service-cl-users-mikeb-jobs-112.1406dc6b-5634-11e5-adfc-56847afe9799 of 
framework 20150501-182221-755703724-5050-17276-0000 running on slave 
20150825-081928-755703724-5050-18248-S0 at slave(1)@172.31.11.45:5051 
(dev1.ml.com)
mesos-master.ip-172-31-11-45.invalid-user.log.WARNING.20150912-204242.10672:W0912
 20:42:43.250097 10689 master.cpp:4926] Possibly orphaned task 
service-cl-users-mikeb-jobs-112.1406dc6b-5634-11e5-adfc-56847afe9799 of 
framework 20150501-182221-755703724-5050-17276-0000 running on slave 
20150825-081928-755703724-5050-18248-S0 at slave(1)@172.31.11.45:5051 
(dev1.ml.com)
mesos-master.WARNING:W0912 20:42:43.250097 10689 master.cpp:4926] Possibly 
orphaned task 
service-cl-users-mikeb-jobs-112.1406dc6b-5634-11e5-adfc-56847afe9799 of 
framework 20150501-182221-755703724-5050-17276-0000 running on slave 
20150825-081928-755703724-5050-18248-S0 at slave(1)@172.31.11.45:5051 
(dev1.ml.com)

Perhaps my problem has to do with launching the framework via a command through 
Marathon? I will try to reproduce this tomorrow with and without Marathon.

Thanks for looking this over.

Best,
Mike


From: Vinod Kone [mailto:[email protected]]
Sent: Monday, September 14, 2015 4:07 PM
To: [email protected]
Subject: Re: how to overcome orphaned tasks after master failure


On Mon, Sep 14, 2015 at 12:40 PM, Mike Barborak 
<[email protected]<mailto:[email protected]>> wrote:
Sorry for my ignorance, but what is the “scheduler driver?” My framework is 
based on the Java example:

Some details about the driver should be here: 
http://mesos.apache.org/documentation/latest/app-framework-development-guide/

Looks like you are using the driver. Can you paste the logs of the scheduler 
and master (related to the framework) after master failover?

RE: how to overcome orphaned tasks after master failure

Reply via email to