Just a quick update... I was only able to get the orphans cleared by stopping mesos-slave, deleting the contents of the scratch directory, and then restarting mesos-slave.
Thanks, June Taylor System Administrator, Minnesota Population Center University of Minnesota On Thu, Apr 7, 2016 at 12:01 PM, Vinod Kone <[email protected]> wrote: > A task/executor is called "orphaned" if the corresponding scheduler > doesn't register with Mesos. Is your framework scheduler running or gone > for good? The resources should be cleaned up if the agent (and consequently > the master) have realized that the executor exited. > > Can you paste the master and agent logs for one of orphaned > tasks/executors (grep the log with the task/executor id)? > > On Thu, Apr 7, 2016 at 9:00 AM, haosdent <[email protected]> wrote: > >> Hmm, sorry for didn't express my idea clear. I mean kill those orphan >> tasks here. >> >> On Thu, Apr 7, 2016 at 11:57 PM, June Taylor <[email protected]> wrote: >> >>> Forgive my ignorance, are you literally saying I should just sigkill >>> these instances? How will that clean up the mesos orphans? >>> >>> >>> Thanks, >>> June Taylor >>> System Administrator, Minnesota Population Center >>> University of Minnesota >>> >>> On Thu, Apr 7, 2016 at 10:44 AM, haosdent <[email protected]> wrote: >>> >>>> Support you --work_dir=/tmp/mesos. So you could >>>> >>>> $ find /tmp/mesos -name $YOUR_EXECUTOR_ID >>>> >>>> Then you could get a folder list and then could use lsof on them. >>>> >>>> As a example, my executor id is "test" here. >>>> >>>> $ find /tmp/mesos/ -name 'test' >>>> >>>> /tmp/mesos/0/slaves/138ee255-c8ef-4caa-8ff2-c0c02f70b4f5-S0/frameworks/138ee255-c8ef-4caa-8ff2-c0c02f70b4f5-0002/executors/test >>>> >>>> When I execute >>>> lsof >>>> /tmp/mesos/0/slaves/138ee255-c8ef-4caa-8ff2-c0c02f70b4f5-S0/frameworks/138ee255-c8ef-4caa-8ff2-c0c02f70b4f5-0002/executors/test/runs/latest/ >>>> (Keep in mind I append runs/latest) here. >>>> >>>> Then you could see the pid list: >>>> >>>> COMMAND PID USER FD TYPE DEVICE SIZE/OFF NODE NAME >>>> mesos-exe 21811 haosdent cwd DIR 8,3 6 3221463220 >>>> /tmp/mesos/0/slaves/138ee255-c8ef-4caa-8ff2-c0c02f70b4f5-S0/frameworks/138ee255-c8ef-4caa-8ff2-c0c02f70b4f5-0003/executors/test/runs/efecb119-1019-4629-91ab-fec7724a0f11 >>>> sleep 21847 haosdent cwd DIR 8,3 6 3221463220 >>>> /tmp/mesos/0/slaves/138ee255-c8ef-4caa-8ff2-c0c02f70b4f5-S0/frameworks/138ee255-c8ef-4caa-8ff2-c0c02f70b4f5-0003/executors/test/runs/efecb119-1019-4629-91ab-fec7724a0f11 >>>> >>>> Kill all of them. >>>> >>>> On Thu, Apr 7, 2016 at 11:23 PM, June Taylor <[email protected]> wrote: >>>> >>>>> I do have the executor ID. Can you advise how to kill it? >>>>> >>>>> I have one master and three slaves. Each slave has one of these >>>>> orphans. >>>>> >>>>> >>>>> Thanks, >>>>> June Taylor >>>>> System Administrator, Minnesota Population Center >>>>> University of Minnesota >>>>> >>>>> On Thu, Apr 7, 2016 at 10:14 AM, haosdent <[email protected]> wrote: >>>>> >>>>>> >Going to this slave I can find an executor within the mesos working >>>>>> directory which matches this framework ID >>>>>> The quickest way here is use kill in slave if you could find the >>>>>> mesos-executor id. You make use lsof/fuser or dig log to find out the >>>>>> executor pid. >>>>>> >>>>>> However, it still wired according your feedbacks. Do you have >>>>>> multiple masters and fail over happens in your master? So that the slave >>>>>> could not collect to the new master and tasks become orphan. >>>>>> >>>>>> On Thu, Apr 7, 2016 at 11:06 PM, June Taylor <[email protected]> wrote: >>>>>> >>>>>>> Here is one of three orphaned tasks (first two octets of IP removed): >>>>>>> >>>>>>> "orphan_tasks": [ >>>>>>> { >>>>>>> "executor_id": "", >>>>>>> "name": "Task 1", >>>>>>> "framework_id": >>>>>>> "14cddded-e692-4838-9893-6e04a81481d8-0006", >>>>>>> "state": "TASK_RUNNING", >>>>>>> "statuses": [ >>>>>>> { >>>>>>> "timestamp": 1459887295.05554, >>>>>>> "state": "TASK_RUNNING", >>>>>>> "container_status": { >>>>>>> "network_infos": [ >>>>>>> { >>>>>>> "ip_addresses": [ >>>>>>> { >>>>>>> "ip_address": >>>>>>> "xxx.xxx.163.205" >>>>>>> } >>>>>>> ], >>>>>>> "ip_address": "xxx.xxx.163.205" >>>>>>> } >>>>>>> ] >>>>>>> } >>>>>>> } >>>>>>> ], >>>>>>> "slave_id": "182cf09f-0843-4736-82f1-d913089d7df4-S83", >>>>>>> "id": "1", >>>>>>> "resources": { >>>>>>> "mem": 112640.0, >>>>>>> "disk": 0.0, >>>>>>> "cpus": 30.0 >>>>>>> } >>>>>>> } >>>>>>> >>>>>>> Going to this slave I can find an executor within the mesos working >>>>>>> directory which matches this framework ID. Reviewing the stdout >>>>>>> messaging >>>>>>> within indicates the program has finished its work. But, it is still >>>>>>> holding these resources open. >>>>>>> >>>>>>> This framework ID is not shown as Active in the main Mesos Web UI, >>>>>>> but does show up if you display the Slave's web UI. >>>>>>> >>>>>>> The resources consumed count towards the Idle pool, and have >>>>>>> resulted in zero available resources for other Offers. >>>>>>> >>>>>>> >>>>>>> >>>>>>> Thanks, >>>>>>> June Taylor >>>>>>> System Administrator, Minnesota Population Center >>>>>>> University of Minnesota >>>>>>> >>>>>>> On Thu, Apr 7, 2016 at 9:46 AM, haosdent <[email protected]> wrote: >>>>>>> >>>>>>>> > pyspark executors hanging around and consuming resources marked >>>>>>>> as Idle in mesos Web UI >>>>>>>> >>>>>>>> Do you have some logs about this? >>>>>>>> >>>>>>>> >is there an API call I can make to kill these orphans? >>>>>>>> >>>>>>>> As I know, mesos agent would try to clean orphan containers when >>>>>>>> restart. But I not sure the orphan I mean here is same with yours. >>>>>>>> >>>>>>>> On Thu, Apr 7, 2016 at 10:21 PM, June Taylor <[email protected]> wrote: >>>>>>>> >>>>>>>>> Greetings mesos users! >>>>>>>>> >>>>>>>>> I am debugging an issue with pyspark executors hanging around and >>>>>>>>> consuming resources marked as Idle in mesos Web UI. These tasks also >>>>>>>>> show >>>>>>>>> up in the orphaned_tasks key in `mesos state`. >>>>>>>>> >>>>>>>>> I'm first wondering how to clear them out - is there an API call I >>>>>>>>> can make to kill these orphans? Secondly, how it happened at all. >>>>>>>>> >>>>>>>>> Thanks, >>>>>>>>> June Taylor >>>>>>>>> System Administrator, Minnesota Population Center >>>>>>>>> University of Minnesota >>>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> -- >>>>>>>> Best Regards, >>>>>>>> Haosdent Huang >>>>>>>> >>>>>>> >>>>>>> >>>>>> >>>>>> >>>>>> -- >>>>>> Best Regards, >>>>>> Haosdent Huang >>>>>> >>>>> >>>>> >>>> >>>> >>>> -- >>>> Best Regards, >>>> Haosdent Huang >>>> >>> >>> >> >> >> -- >> Best Regards, >> Haosdent Huang >> > >

