Re: orphaned_tasks cleanup and prevention method

Greg Mann Fri, 08 Apr 2016 15:50:02 -0700

Unfortunately I'm not able to glean much from that command, but perhaps
someone out there with more Spark experience can? I do know that there are
a couple ways to launch Spark jobs on a cluster: you can run them in client
mode, where the Spark driver runs locally on your machine and exits when
it's finished, or they can be run in cluster mode where the Spark driver
runs persistently on the cluster as a Mesos framework. How exactly are you
launching these tasks on the Mesos cluster?


On Fri, Apr 8, 2016 at 5:41 AM, June Taylor <[email protected]> wrote:

> Greg,
>
> I'm on the ops side and fairly new to spark/mesos, so I'm not quite sure I
> understand your question, here's how the task shows up in a process listing:
>
> /usr/lib/jvm/java-8-oracle/bin/java -cp /path/to/spark/spark-
> installations/spark-1.6.0-bin-hadoop2.6/conf/:/path/to/spark/spark-
> installations/spark-1.6.0-bin-hadoop2.6/lib/spark-assembly-
> 1.6.0-hadoop2.6.0.jar:/path/to/spark/spark-installations/spark-1.6.0-bin-
> hadoop2.6/lib/datanucleus-core-3.2.10.jar:/path/to/spark/spark-
> installations/spark-1.6.0-bin-hadoop2.6/lib/datanucleus-
> rdbms-3.2.9.jar:/path/to/spark/spark-installations/spark-1.6.0-bin-
> hadoop2.6/lib/datanucleus-api-jdo-3.2.6.jar -Xms10G -Xmx10G
> org.apache.spark.deploy.SparkSubmit --master mesos://master.ourdomain.com
> :5050 --conf spark.driver.memory=10G --executor-memory 100G
> --total-executor-cores 90 pyspark-shell
>
>
> Thanks,
> June Taylor
> System Administrator, Minnesota Population Center
> University of Minnesota
>
> On Thu, Apr 7, 2016 at 3:37 PM, Greg Mann <[email protected]> wrote:
>
>> Hi June,
>> Are these Spark tasks being run in cluster mode or client mode? If it's
>> client mode, then perhaps your local Spark scheduler is tearing itself down
>> before the executors exit, thus leaving them orphaned.
>>
>> I'd love to see master/agent logs during the time that the tasks are
>> becoming orphaned if you have them available.
>>
>> Cheers,
>> Greg
>>
>>
>> On Thu, Apr 7, 2016 at 1:08 PM, June Taylor <[email protected]> wrote:
>>
>>> Just a quick update... I was only able to get the orphans cleared by
>>> stopping mesos-slave, deleting the contents of the scratch directory, and
>>> then restarting mesos-slave.
>>>
>>>
>>> Thanks,
>>> June Taylor
>>> System Administrator, Minnesota Population Center
>>> University of Minnesota
>>>
>>> On Thu, Apr 7, 2016 at 12:01 PM, Vinod Kone <[email protected]>
>>> wrote:
>>>
>>>> A task/executor is called "orphaned" if the corresponding scheduler
>>>> doesn't register with Mesos. Is your framework scheduler running or gone
>>>> for good? The resources should be cleaned up if the agent (and consequently
>>>> the master) have realized that the executor exited.
>>>>
>>>> Can you paste the master and agent logs for one of orphaned
>>>> tasks/executors (grep the log with the task/executor id)?
>>>>
>>>> On Thu, Apr 7, 2016 at 9:00 AM, haosdent <[email protected]> wrote:
>>>>
>>>>> Hmm, sorry for didn't express my idea clear. I mean kill those orphan
>>>>> tasks here.
>>>>>
>>>>> On Thu, Apr 7, 2016 at 11:57 PM, June Taylor <[email protected]> wrote:
>>>>>
>>>>>> Forgive my ignorance, are you literally saying I should just sigkill
>>>>>> these instances? How will that clean up the mesos orphans?
>>>>>>
>>>>>>
>>>>>> Thanks,
>>>>>> June Taylor
>>>>>> System Administrator, Minnesota Population Center
>>>>>> University of Minnesota
>>>>>>
>>>>>> On Thu, Apr 7, 2016 at 10:44 AM, haosdent <[email protected]> wrote:
>>>>>>
>>>>>>> Support you --work_dir=/tmp/mesos. So you could
>>>>>>>
>>>>>>> $ find /tmp/mesos -name $YOUR_EXECUTOR_ID
>>>>>>>
>>>>>>> Then you could get a folder list and then could use lsof on them.
>>>>>>>
>>>>>>> As a example, my executor id is "test" here.
>>>>>>>
>>>>>>> $ find /tmp/mesos/ -name 'test'
>>>>>>>
>>>>>>> /tmp/mesos/0/slaves/138ee255-c8ef-4caa-8ff2-c0c02f70b4f5-S0/frameworks/138ee255-c8ef-4caa-8ff2-c0c02f70b4f5-0002/executors/test
>>>>>>>
>>>>>>> When I execute
>>>>>>> lsof 
>>>>>>> /tmp/mesos/0/slaves/138ee255-c8ef-4caa-8ff2-c0c02f70b4f5-S0/frameworks/138ee255-c8ef-4caa-8ff2-c0c02f70b4f5-0002/executors/test/runs/latest/
>>>>>>> (Keep in mind I append runs/latest) here.
>>>>>>>
>>>>>>> Then you could see the pid list:
>>>>>>>
>>>>>>> COMMAND     PID      USER   FD   TYPE DEVICE SIZE/OFF       NODE NAME
>>>>>>> mesos-exe 21811 haosdent  cwd    DIR    8,3        6 3221463220
>>>>>>> /tmp/mesos/0/slaves/138ee255-c8ef-4caa-8ff2-c0c02f70b4f5-S0/frameworks/138ee255-c8ef-4caa-8ff2-c0c02f70b4f5-0003/executors/test/runs/efecb119-1019-4629-91ab-fec7724a0f11
>>>>>>> sleep     21847 haosdent  cwd    DIR    8,3        6 3221463220
>>>>>>> /tmp/mesos/0/slaves/138ee255-c8ef-4caa-8ff2-c0c02f70b4f5-S0/frameworks/138ee255-c8ef-4caa-8ff2-c0c02f70b4f5-0003/executors/test/runs/efecb119-1019-4629-91ab-fec7724a0f11
>>>>>>>
>>>>>>> Kill all of them.
>>>>>>>
>>>>>>> On Thu, Apr 7, 2016 at 11:23 PM, June Taylor <[email protected]> wrote:
>>>>>>>
>>>>>>>> I do have the executor ID. Can you advise how to kill it?
>>>>>>>>
>>>>>>>> I have one master and three slaves. Each slave has one of these
>>>>>>>> orphans.
>>>>>>>>
>>>>>>>>
>>>>>>>> Thanks,
>>>>>>>> June Taylor
>>>>>>>> System Administrator, Minnesota Population Center
>>>>>>>> University of Minnesota
>>>>>>>>
>>>>>>>> On Thu, Apr 7, 2016 at 10:14 AM, haosdent <[email protected]>
>>>>>>>> wrote:
>>>>>>>>
>>>>>>>>> >Going to this slave I can find an executor within the mesos
>>>>>>>>> working directory which matches this framework ID
>>>>>>>>> The quickest way here is use kill in slave if you could find the
>>>>>>>>> mesos-executor id. You make use lsof/fuser or dig log to find out the
>>>>>>>>> executor pid.
>>>>>>>>>
>>>>>>>>> However, it still wired according your feedbacks. Do you have
>>>>>>>>> multiple masters and fail over happens in your master? So that the 
>>>>>>>>> slave
>>>>>>>>> could not collect to the new master and tasks become orphan.
>>>>>>>>>
>>>>>>>>> On Thu, Apr 7, 2016 at 11:06 PM, June Taylor <[email protected]> wrote:
>>>>>>>>>
>>>>>>>>>> Here is one of three orphaned tasks (first two octets of IP
>>>>>>>>>> removed):
>>>>>>>>>>
>>>>>>>>>> "orphan_tasks": [
>>>>>>>>>>         {
>>>>>>>>>>             "executor_id": "",
>>>>>>>>>>             "name": "Task 1",
>>>>>>>>>>             "framework_id":
>>>>>>>>>> "14cddded-e692-4838-9893-6e04a81481d8-0006",
>>>>>>>>>>             "state": "TASK_RUNNING",
>>>>>>>>>>             "statuses": [
>>>>>>>>>>                 {
>>>>>>>>>>                     "timestamp": 1459887295.05554,
>>>>>>>>>>                     "state": "TASK_RUNNING",
>>>>>>>>>>                     "container_status": {
>>>>>>>>>>                         "network_infos": [
>>>>>>>>>>                             {
>>>>>>>>>>                                 "ip_addresses": [
>>>>>>>>>>                                     {
>>>>>>>>>>                                         "ip_address":
>>>>>>>>>> "xxx.xxx.163.205"
>>>>>>>>>>                                     }
>>>>>>>>>>                                 ],
>>>>>>>>>>                                 "ip_address": "xxx.xxx.163.205"
>>>>>>>>>>                             }
>>>>>>>>>>                         ]
>>>>>>>>>>                     }
>>>>>>>>>>                 }
>>>>>>>>>>             ],
>>>>>>>>>>             "slave_id":
>>>>>>>>>> "182cf09f-0843-4736-82f1-d913089d7df4-S83",
>>>>>>>>>>             "id": "1",
>>>>>>>>>>             "resources": {
>>>>>>>>>>                 "mem": 112640.0,
>>>>>>>>>>                 "disk": 0.0,
>>>>>>>>>>                 "cpus": 30.0
>>>>>>>>>>             }
>>>>>>>>>>         }
>>>>>>>>>>
>>>>>>>>>> Going to this slave I can find an executor within the mesos
>>>>>>>>>> working directory which matches this framework ID. Reviewing the 
>>>>>>>>>> stdout
>>>>>>>>>> messaging within indicates the program has finished its work. But, 
>>>>>>>>>> it is
>>>>>>>>>> still holding these resources open.
>>>>>>>>>>
>>>>>>>>>> This framework ID is not shown as Active in the main Mesos Web
>>>>>>>>>> UI, but does show up if you display the Slave's web UI.
>>>>>>>>>>
>>>>>>>>>> The resources consumed count towards the Idle pool, and have
>>>>>>>>>> resulted in zero available resources for other Offers.
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> Thanks,
>>>>>>>>>> June Taylor
>>>>>>>>>> System Administrator, Minnesota Population Center
>>>>>>>>>> University of Minnesota
>>>>>>>>>>
>>>>>>>>>> On Thu, Apr 7, 2016 at 9:46 AM, haosdent <[email protected]>
>>>>>>>>>> wrote:
>>>>>>>>>>
>>>>>>>>>>> > pyspark executors hanging around and consuming resources
>>>>>>>>>>> marked as Idle in mesos Web UI
>>>>>>>>>>>
>>>>>>>>>>> Do you have some logs about this?
>>>>>>>>>>>
>>>>>>>>>>> >is there an API call I can make to kill these orphans?
>>>>>>>>>>>
>>>>>>>>>>> As I know, mesos agent would try to clean orphan containers when
>>>>>>>>>>> restart. But I not sure the orphan I mean here is same with yours.
>>>>>>>>>>>
>>>>>>>>>>> On Thu, Apr 7, 2016 at 10:21 PM, June Taylor <[email protected]>
>>>>>>>>>>> wrote:
>>>>>>>>>>>
>>>>>>>>>>>> Greetings mesos users!
>>>>>>>>>>>>
>>>>>>>>>>>> I am debugging an issue with pyspark executors hanging around
>>>>>>>>>>>> and consuming resources marked as Idle in mesos Web UI. These 
>>>>>>>>>>>> tasks also
>>>>>>>>>>>> show up in the orphaned_tasks key in `mesos state`.
>>>>>>>>>>>>
>>>>>>>>>>>> I'm first wondering how to clear them out - is there an API
>>>>>>>>>>>> call I can make to kill these orphans? Secondly, how it happened 
>>>>>>>>>>>> at all.
>>>>>>>>>>>>
>>>>>>>>>>>> Thanks,
>>>>>>>>>>>> June Taylor
>>>>>>>>>>>> System Administrator, Minnesota Population Center
>>>>>>>>>>>> University of Minnesota
>>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> --
>>>>>>>>>>> Best Regards,
>>>>>>>>>>> Haosdent Huang
>>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> --
>>>>>>>>> Best Regards,
>>>>>>>>> Haosdent Huang
>>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> --
>>>>>>> Best Regards,
>>>>>>> Haosdent Huang
>>>>>>>
>>>>>>
>>>>>>
>>>>>
>>>>>
>>>>> --
>>>>> Best Regards,
>>>>> Haosdent Huang
>>>>>
>>>>
>>>>
>>>
>>
>

Re: orphaned_tasks cleanup and prevention method

Reply via email to