Re: orphaned_tasks cleanup and prevention method

June Taylor Fri, 08 Apr 2016 05:43:37 -0700

Greg,

I'm on the ops side and fairly new to spark/mesos, so I'm not quite sure I
understand your question, here's how the task shows up in a process listing:


/usr/lib/jvm/java-8-oracle/bin/java -cp /path/to/spark/spark-
installations/spark-1.6.0-bin-hadoop2.6/conf/:/path/to/spark/spark-
installations/spark-1.6.0-bin-hadoop2.6/lib/spark-assembly-
1.6.0-hadoop2.6.0.jar:/path/to/spark/spark-installations/spark-1.6.0-bin-
hadoop2.6/lib/datanucleus-core-3.2.10.jar:/path/to/spark/spark-
installations/spark-1.6.0-bin-hadoop2.6/lib/datanucleus-
rdbms-3.2.9.jar:/path/to/spark/spark-installations/spark-1.6.0-bin-
hadoop2.6/lib/datanucleus-api-jdo-3.2.6.jar -Xms10G -Xmx10G
org.apache.spark.deploy.SparkSubmit --master mesos://master.ourdomain.com
:5050 --conf spark.driver.memory=10G --executor-memory 100G
--total-executor-cores 90 pyspark-shell


Thanks,
June Taylor
System Administrator, Minnesota Population Center
University of Minnesota

On Thu, Apr 7, 2016 at 3:37 PM, Greg Mann <[email protected]> wrote:

> Hi June,
> Are these Spark tasks being run in cluster mode or client mode? If it's
> client mode, then perhaps your local Spark scheduler is tearing itself down
> before the executors exit, thus leaving them orphaned.
>
> I'd love to see master/agent logs during the time that the tasks are
> becoming orphaned if you have them available.
>
> Cheers,
> Greg
>
>
> On Thu, Apr 7, 2016 at 1:08 PM, June Taylor <[email protected]> wrote:
>
>> Just a quick update... I was only able to get the orphans cleared by
>> stopping mesos-slave, deleting the contents of the scratch directory, and
>> then restarting mesos-slave.
>>
>>
>> Thanks,
>> June Taylor
>> System Administrator, Minnesota Population Center
>> University of Minnesota
>>
>> On Thu, Apr 7, 2016 at 12:01 PM, Vinod Kone <[email protected]> wrote:
>>
>>> A task/executor is called "orphaned" if the corresponding scheduler
>>> doesn't register with Mesos. Is your framework scheduler running or gone
>>> for good? The resources should be cleaned up if the agent (and consequently
>>> the master) have realized that the executor exited.
>>>
>>> Can you paste the master and agent logs for one of orphaned
>>> tasks/executors (grep the log with the task/executor id)?
>>>
>>> On Thu, Apr 7, 2016 at 9:00 AM, haosdent <[email protected]> wrote:
>>>
>>>> Hmm, sorry for didn't express my idea clear. I mean kill those orphan
>>>> tasks here.
>>>>
>>>> On Thu, Apr 7, 2016 at 11:57 PM, June Taylor <[email protected]> wrote:
>>>>
>>>>> Forgive my ignorance, are you literally saying I should just sigkill
>>>>> these instances? How will that clean up the mesos orphans?
>>>>>
>>>>>
>>>>> Thanks,
>>>>> June Taylor
>>>>> System Administrator, Minnesota Population Center
>>>>> University of Minnesota
>>>>>
>>>>> On Thu, Apr 7, 2016 at 10:44 AM, haosdent <[email protected]> wrote:
>>>>>
>>>>>> Support you --work_dir=/tmp/mesos. So you could
>>>>>>
>>>>>> $ find /tmp/mesos -name $YOUR_EXECUTOR_ID
>>>>>>
>>>>>> Then you could get a folder list and then could use lsof on them.
>>>>>>
>>>>>> As a example, my executor id is "test" here.
>>>>>>
>>>>>> $ find /tmp/mesos/ -name 'test'
>>>>>>
>>>>>> /tmp/mesos/0/slaves/138ee255-c8ef-4caa-8ff2-c0c02f70b4f5-S0/frameworks/138ee255-c8ef-4caa-8ff2-c0c02f70b4f5-0002/executors/test
>>>>>>
>>>>>> When I execute
>>>>>> lsof 
>>>>>> /tmp/mesos/0/slaves/138ee255-c8ef-4caa-8ff2-c0c02f70b4f5-S0/frameworks/138ee255-c8ef-4caa-8ff2-c0c02f70b4f5-0002/executors/test/runs/latest/
>>>>>> (Keep in mind I append runs/latest) here.
>>>>>>
>>>>>> Then you could see the pid list:
>>>>>>
>>>>>> COMMAND     PID      USER   FD   TYPE DEVICE SIZE/OFF       NODE NAME
>>>>>> mesos-exe 21811 haosdent  cwd    DIR    8,3        6 3221463220
>>>>>> /tmp/mesos/0/slaves/138ee255-c8ef-4caa-8ff2-c0c02f70b4f5-S0/frameworks/138ee255-c8ef-4caa-8ff2-c0c02f70b4f5-0003/executors/test/runs/efecb119-1019-4629-91ab-fec7724a0f11
>>>>>> sleep     21847 haosdent  cwd    DIR    8,3        6 3221463220
>>>>>> /tmp/mesos/0/slaves/138ee255-c8ef-4caa-8ff2-c0c02f70b4f5-S0/frameworks/138ee255-c8ef-4caa-8ff2-c0c02f70b4f5-0003/executors/test/runs/efecb119-1019-4629-91ab-fec7724a0f11
>>>>>>
>>>>>> Kill all of them.
>>>>>>
>>>>>> On Thu, Apr 7, 2016 at 11:23 PM, June Taylor <[email protected]> wrote:
>>>>>>
>>>>>>> I do have the executor ID. Can you advise how to kill it?
>>>>>>>
>>>>>>> I have one master and three slaves. Each slave has one of these
>>>>>>> orphans.
>>>>>>>
>>>>>>>
>>>>>>> Thanks,
>>>>>>> June Taylor
>>>>>>> System Administrator, Minnesota Population Center
>>>>>>> University of Minnesota
>>>>>>>
>>>>>>> On Thu, Apr 7, 2016 at 10:14 AM, haosdent <[email protected]>
>>>>>>> wrote:
>>>>>>>
>>>>>>>> >Going to this slave I can find an executor within the mesos
>>>>>>>> working directory which matches this framework ID
>>>>>>>> The quickest way here is use kill in slave if you could find the
>>>>>>>> mesos-executor id. You make use lsof/fuser or dig log to find out the
>>>>>>>> executor pid.
>>>>>>>>
>>>>>>>> However, it still wired according your feedbacks. Do you have
>>>>>>>> multiple masters and fail over happens in your master? So that the 
>>>>>>>> slave
>>>>>>>> could not collect to the new master and tasks become orphan.
>>>>>>>>
>>>>>>>> On Thu, Apr 7, 2016 at 11:06 PM, June Taylor <[email protected]> wrote:
>>>>>>>>
>>>>>>>>> Here is one of three orphaned tasks (first two octets of IP
>>>>>>>>> removed):
>>>>>>>>>
>>>>>>>>> "orphan_tasks": [
>>>>>>>>>         {
>>>>>>>>>             "executor_id": "",
>>>>>>>>>             "name": "Task 1",
>>>>>>>>>             "framework_id":
>>>>>>>>> "14cddded-e692-4838-9893-6e04a81481d8-0006",
>>>>>>>>>             "state": "TASK_RUNNING",
>>>>>>>>>             "statuses": [
>>>>>>>>>                 {
>>>>>>>>>                     "timestamp": 1459887295.05554,
>>>>>>>>>                     "state": "TASK_RUNNING",
>>>>>>>>>                     "container_status": {
>>>>>>>>>                         "network_infos": [
>>>>>>>>>                             {
>>>>>>>>>                                 "ip_addresses": [
>>>>>>>>>                                     {
>>>>>>>>>                                         "ip_address":
>>>>>>>>> "xxx.xxx.163.205"
>>>>>>>>>                                     }
>>>>>>>>>                                 ],
>>>>>>>>>                                 "ip_address": "xxx.xxx.163.205"
>>>>>>>>>                             }
>>>>>>>>>                         ]
>>>>>>>>>                     }
>>>>>>>>>                 }
>>>>>>>>>             ],
>>>>>>>>>             "slave_id": "182cf09f-0843-4736-82f1-d913089d7df4-S83",
>>>>>>>>>             "id": "1",
>>>>>>>>>             "resources": {
>>>>>>>>>                 "mem": 112640.0,
>>>>>>>>>                 "disk": 0.0,
>>>>>>>>>                 "cpus": 30.0
>>>>>>>>>             }
>>>>>>>>>         }
>>>>>>>>>
>>>>>>>>> Going to this slave I can find an executor within the mesos
>>>>>>>>> working directory which matches this framework ID. Reviewing the 
>>>>>>>>> stdout
>>>>>>>>> messaging within indicates the program has finished its work. But, it 
>>>>>>>>> is
>>>>>>>>> still holding these resources open.
>>>>>>>>>
>>>>>>>>> This framework ID is not shown as Active in the main Mesos Web UI,
>>>>>>>>> but does show up if you display the Slave's web UI.
>>>>>>>>>
>>>>>>>>> The resources consumed count towards the Idle pool, and have
>>>>>>>>> resulted in zero available resources for other Offers.
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> Thanks,
>>>>>>>>> June Taylor
>>>>>>>>> System Administrator, Minnesota Population Center
>>>>>>>>> University of Minnesota
>>>>>>>>>
>>>>>>>>> On Thu, Apr 7, 2016 at 9:46 AM, haosdent <[email protected]>
>>>>>>>>> wrote:
>>>>>>>>>
>>>>>>>>>> > pyspark executors hanging around and consuming resources
>>>>>>>>>> marked as Idle in mesos Web UI
>>>>>>>>>>
>>>>>>>>>> Do you have some logs about this?
>>>>>>>>>>
>>>>>>>>>> >is there an API call I can make to kill these orphans?
>>>>>>>>>>
>>>>>>>>>> As I know, mesos agent would try to clean orphan containers when
>>>>>>>>>> restart. But I not sure the orphan I mean here is same with yours.
>>>>>>>>>>
>>>>>>>>>> On Thu, Apr 7, 2016 at 10:21 PM, June Taylor <[email protected]>
>>>>>>>>>> wrote:
>>>>>>>>>>
>>>>>>>>>>> Greetings mesos users!
>>>>>>>>>>>
>>>>>>>>>>> I am debugging an issue with pyspark executors hanging around
>>>>>>>>>>> and consuming resources marked as Idle in mesos Web UI. These tasks 
>>>>>>>>>>> also
>>>>>>>>>>> show up in the orphaned_tasks key in `mesos state`.
>>>>>>>>>>>
>>>>>>>>>>> I'm first wondering how to clear them out - is there an API call
>>>>>>>>>>> I can make to kill these orphans? Secondly, how it happened at all.
>>>>>>>>>>>
>>>>>>>>>>> Thanks,
>>>>>>>>>>> June Taylor
>>>>>>>>>>> System Administrator, Minnesota Population Center
>>>>>>>>>>> University of Minnesota
>>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> --
>>>>>>>>>> Best Regards,
>>>>>>>>>> Haosdent Huang
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> --
>>>>>>>> Best Regards,
>>>>>>>> Haosdent Huang
>>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>
>>>>>>
>>>>>> --
>>>>>> Best Regards,
>>>>>> Haosdent Huang
>>>>>>
>>>>>
>>>>>
>>>>
>>>>
>>>> --
>>>> Best Regards,
>>>> Haosdent Huang
>>>>
>>>
>>>
>>
>

Re: orphaned_tasks cleanup and prevention method

Reply via email to