Re: orphaned_tasks cleanup and prevention method

June Taylor Thu, 07 Apr 2016 13:09:38 -0700

Just a quick update... I was only able to get the orphans cleared by
stopping mesos-slave, deleting the contents of the scratch directory, and
then restarting mesos-slave.



Thanks,
June Taylor
System Administrator, Minnesota Population Center
University of Minnesota

On Thu, Apr 7, 2016 at 12:01 PM, Vinod Kone <[email protected]> wrote:

> A task/executor is called "orphaned" if the corresponding scheduler
> doesn't register with Mesos. Is your framework scheduler running or gone
> for good? The resources should be cleaned up if the agent (and consequently
> the master) have realized that the executor exited.
>
> Can you paste the master and agent logs for one of orphaned
> tasks/executors (grep the log with the task/executor id)?
>
> On Thu, Apr 7, 2016 at 9:00 AM, haosdent <[email protected]> wrote:
>
>> Hmm, sorry for didn't express my idea clear. I mean kill those orphan
>> tasks here.
>>
>> On Thu, Apr 7, 2016 at 11:57 PM, June Taylor <[email protected]> wrote:
>>
>>> Forgive my ignorance, are you literally saying I should just sigkill
>>> these instances? How will that clean up the mesos orphans?
>>>
>>>
>>> Thanks,
>>> June Taylor
>>> System Administrator, Minnesota Population Center
>>> University of Minnesota
>>>
>>> On Thu, Apr 7, 2016 at 10:44 AM, haosdent <[email protected]> wrote:
>>>
>>>> Support you --work_dir=/tmp/mesos. So you could
>>>>
>>>> $ find /tmp/mesos -name $YOUR_EXECUTOR_ID
>>>>
>>>> Then you could get a folder list and then could use lsof on them.
>>>>
>>>> As a example, my executor id is "test" here.
>>>>
>>>> $ find /tmp/mesos/ -name 'test'
>>>>
>>>> /tmp/mesos/0/slaves/138ee255-c8ef-4caa-8ff2-c0c02f70b4f5-S0/frameworks/138ee255-c8ef-4caa-8ff2-c0c02f70b4f5-0002/executors/test
>>>>
>>>> When I execute
>>>> lsof 
>>>> /tmp/mesos/0/slaves/138ee255-c8ef-4caa-8ff2-c0c02f70b4f5-S0/frameworks/138ee255-c8ef-4caa-8ff2-c0c02f70b4f5-0002/executors/test/runs/latest/
>>>> (Keep in mind I append runs/latest) here.
>>>>
>>>> Then you could see the pid list:
>>>>
>>>> COMMAND     PID      USER   FD   TYPE DEVICE SIZE/OFF       NODE NAME
>>>> mesos-exe 21811 haosdent  cwd    DIR    8,3        6 3221463220
>>>> /tmp/mesos/0/slaves/138ee255-c8ef-4caa-8ff2-c0c02f70b4f5-S0/frameworks/138ee255-c8ef-4caa-8ff2-c0c02f70b4f5-0003/executors/test/runs/efecb119-1019-4629-91ab-fec7724a0f11
>>>> sleep     21847 haosdent  cwd    DIR    8,3        6 3221463220
>>>> /tmp/mesos/0/slaves/138ee255-c8ef-4caa-8ff2-c0c02f70b4f5-S0/frameworks/138ee255-c8ef-4caa-8ff2-c0c02f70b4f5-0003/executors/test/runs/efecb119-1019-4629-91ab-fec7724a0f11
>>>>
>>>> Kill all of them.
>>>>
>>>> On Thu, Apr 7, 2016 at 11:23 PM, June Taylor <[email protected]> wrote:
>>>>
>>>>> I do have the executor ID. Can you advise how to kill it?
>>>>>
>>>>> I have one master and three slaves. Each slave has one of these
>>>>> orphans.
>>>>>
>>>>>
>>>>> Thanks,
>>>>> June Taylor
>>>>> System Administrator, Minnesota Population Center
>>>>> University of Minnesota
>>>>>
>>>>> On Thu, Apr 7, 2016 at 10:14 AM, haosdent <[email protected]> wrote:
>>>>>
>>>>>> >Going to this slave I can find an executor within the mesos working
>>>>>> directory which matches this framework ID
>>>>>> The quickest way here is use kill in slave if you could find the
>>>>>> mesos-executor id. You make use lsof/fuser or dig log to find out the
>>>>>> executor pid.
>>>>>>
>>>>>> However, it still wired according your feedbacks. Do you have
>>>>>> multiple masters and fail over happens in your master? So that the slave
>>>>>> could not collect to the new master and tasks become orphan.
>>>>>>
>>>>>> On Thu, Apr 7, 2016 at 11:06 PM, June Taylor <[email protected]> wrote:
>>>>>>
>>>>>>> Here is one of three orphaned tasks (first two octets of IP removed):
>>>>>>>
>>>>>>> "orphan_tasks": [
>>>>>>>         {
>>>>>>>             "executor_id": "",
>>>>>>>             "name": "Task 1",
>>>>>>>             "framework_id":
>>>>>>> "14cddded-e692-4838-9893-6e04a81481d8-0006",
>>>>>>>             "state": "TASK_RUNNING",
>>>>>>>             "statuses": [
>>>>>>>                 {
>>>>>>>                     "timestamp": 1459887295.05554,
>>>>>>>                     "state": "TASK_RUNNING",
>>>>>>>                     "container_status": {
>>>>>>>                         "network_infos": [
>>>>>>>                             {
>>>>>>>                                 "ip_addresses": [
>>>>>>>                                     {
>>>>>>>                                         "ip_address":
>>>>>>> "xxx.xxx.163.205"
>>>>>>>                                     }
>>>>>>>                                 ],
>>>>>>>                                 "ip_address": "xxx.xxx.163.205"
>>>>>>>                             }
>>>>>>>                         ]
>>>>>>>                     }
>>>>>>>                 }
>>>>>>>             ],
>>>>>>>             "slave_id": "182cf09f-0843-4736-82f1-d913089d7df4-S83",
>>>>>>>             "id": "1",
>>>>>>>             "resources": {
>>>>>>>                 "mem": 112640.0,
>>>>>>>                 "disk": 0.0,
>>>>>>>                 "cpus": 30.0
>>>>>>>             }
>>>>>>>         }
>>>>>>>
>>>>>>> Going to this slave I can find an executor within the mesos working
>>>>>>> directory which matches this framework ID. Reviewing the stdout 
>>>>>>> messaging
>>>>>>> within indicates the program has finished its work. But, it is still
>>>>>>> holding these resources open.
>>>>>>>
>>>>>>> This framework ID is not shown as Active in the main Mesos Web UI,
>>>>>>> but does show up if you display the Slave's web UI.
>>>>>>>
>>>>>>> The resources consumed count towards the Idle pool, and have
>>>>>>> resulted in zero available resources for other Offers.
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> Thanks,
>>>>>>> June Taylor
>>>>>>> System Administrator, Minnesota Population Center
>>>>>>> University of Minnesota
>>>>>>>
>>>>>>> On Thu, Apr 7, 2016 at 9:46 AM, haosdent <[email protected]> wrote:
>>>>>>>
>>>>>>>> > pyspark executors hanging around and consuming resources marked
>>>>>>>> as Idle in mesos Web UI
>>>>>>>>
>>>>>>>> Do you have some logs about this?
>>>>>>>>
>>>>>>>> >is there an API call I can make to kill these orphans?
>>>>>>>>
>>>>>>>> As I know, mesos agent would try to clean orphan containers when
>>>>>>>> restart. But I not sure the orphan I mean here is same with yours.
>>>>>>>>
>>>>>>>> On Thu, Apr 7, 2016 at 10:21 PM, June Taylor <[email protected]> wrote:
>>>>>>>>
>>>>>>>>> Greetings mesos users!
>>>>>>>>>
>>>>>>>>> I am debugging an issue with pyspark executors hanging around and
>>>>>>>>> consuming resources marked as Idle in mesos Web UI. These tasks also 
>>>>>>>>> show
>>>>>>>>> up in the orphaned_tasks key in `mesos state`.
>>>>>>>>>
>>>>>>>>> I'm first wondering how to clear them out - is there an API call I
>>>>>>>>> can make to kill these orphans? Secondly, how it happened at all.
>>>>>>>>>
>>>>>>>>> Thanks,
>>>>>>>>> June Taylor
>>>>>>>>> System Administrator, Minnesota Population Center
>>>>>>>>> University of Minnesota
>>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> --
>>>>>>>> Best Regards,
>>>>>>>> Haosdent Huang
>>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>
>>>>>>
>>>>>> --
>>>>>> Best Regards,
>>>>>> Haosdent Huang
>>>>>>
>>>>>
>>>>>
>>>>
>>>>
>>>> --
>>>> Best Regards,
>>>> Haosdent Huang
>>>>
>>>
>>>
>>
>>
>> --
>> Best Regards,
>> Haosdent Huang
>>
>
>

Re: orphaned_tasks cleanup and prevention method

Reply via email to