Re: orphaned_tasks cleanup and prevention method

June Taylor Mon, 11 Apr 2016 08:13:47 -0700

While I was waiting for more info the app finally did start up. I am trying
to figure out why it took so long.



Thanks,
June Taylor
System Administrator, Minnesota Population Center
University of Minnesota

On Mon, Apr 11, 2016 at 9:50 AM, haosdent <[email protected]> wrote:

> Could you find marathon in 
> http://${YOUR_MASTER_IP}:${YOUR_MASTER_PORT}/#/frameworks
> page? And
>
> >While deploying I am looking at mesos-master.WARNING, mesos-master.INFO
> and mesos-master.ERROR log files, but I never see anything show up that
> would indicate a problem, or even an attempt.
>
> When you create a new task in marathon, could you see any related logs in
> mesos master?
>
>
> On Mon, Apr 11, 2016 at 10:11 PM, June Taylor <[email protected]> wrote:
>
>> Hello again. I am not sure this has been resolved yet, because I am still
>> unable to get Marathon deployments to start.
>>
>> I have deleted the /marathon/ node from Zookeeper, and I now have the
>> Marathon WebUI accessible again. I try to add a new task to deploy, and
>> there seem to be available resources, but it is still stuck in a 'Waiting'
>> status.
>>
>> While deploying I am looking at mesos-master.WARNING, mesos-master.INFO
>> and mesos-master.ERROR log files, but I never see anything show up that
>> would indicate a problem, or even an attempt.
>>
>> Where am I going wrong?
>>
>>
>> Thanks,
>> June Taylor
>> System Administrator, Minnesota Population Center
>> University of Minnesota
>>
>> On Sat, Apr 9, 2016 at 6:07 AM, Pradeep Chhetri <
>> [email protected]> wrote:
>>
>>> Hi Greg & June,
>>>
>>> By looking at the above command, I can say that you are running spark in
>>> client mode because you are invoking the pyspark-shell.
>>>
>>> One simple way to distinguish is that in cluster mode, it's mandatory to
>>> start MesosClusterDispatcher in your mesos cluster which is the spark
>>> framework scheduler.
>>>
>>> As everyone told above, I guess the reason you are observing orphaned
>>> tasks is because the scheduler is getting killed before the tasks getting
>>> finished.
>>>
>>> I would suggest June to run Spark in clustered mode (
>>> http://spark.apache.org/docs/latest/running-on-mesos.html#cluster-mode)
>>>
>>> Also, as Radek suggested above, run spark in coarse grained (default run
>>> mode) which will save you much of the JVM startup time.
>>>
>>> Keep us informed how it goes.
>>>
>>>
>>> On Sat, Apr 9, 2016 at 12:28 AM, Rad Gruchalski <[email protected]>
>>> wrote:
>>>
>>>> Greg,
>>>>
>>>> All you need to do is tell Spark that the master is mesos://…, as in
>>>> the example from June.
>>>> It’s all nicely documented here:
>>>>
>>>> http://spark.apache.org/docs/latest/running-on-mesos.html
>>>>
>>>> I’d suggest running in coarse mode as fine grained is a bit choppy.
>>>>
>>>> Best regards,
>>>> Radek Gruchalski
>>>> [email protected] <[email protected]>
>>>> de.linkedin.com/in/radgruchalski/
>>>>
>>>>
>>>> *Confidentiality:*This communication is intended for the above-named
>>>> person and may be confidential and/or legally privileged.
>>>> If it has come to you in error you must take no action based on it, nor
>>>> must you copy or show it to anyone; please delete/destroy and inform the
>>>> sender immediately.
>>>>
>>>> On Saturday, 9 April 2016 at 00:48, Greg Mann wrote:
>>>>
>>>> Unfortunately I'm not able to glean much from that command, but perhaps
>>>> someone out there with more Spark experience can? I do know that there are
>>>> a couple ways to launch Spark jobs on a cluster: you can run them in client
>>>> mode, where the Spark driver runs locally on your machine and exits when
>>>> it's finished, or they can be run in cluster mode where the Spark driver
>>>> runs persistently on the cluster as a Mesos framework. How exactly are you
>>>> launching these tasks on the Mesos cluster?
>>>>
>>>> On Fri, Apr 8, 2016 at 5:41 AM, June Taylor <[email protected]> wrote:
>>>>
>>>> Greg,
>>>>
>>>> I'm on the ops side and fairly new to spark/mesos, so I'm not quite
>>>> sure I understand your question, here's how the task shows up in a process
>>>> listing:
>>>>
>>>> /usr/lib/jvm/java-8-oracle/bin/java -cp /path/to/spark/spark-
>>>> installations/spark-1.6.0-bin-hadoop2.6/conf/:/path/to/spark/spark-
>>>> installations/spark-1.6.0-bin-hadoop2.6/lib/spark-assembly-
>>>> 1.6.0-hadoop2.6.0.jar:/path/to/spark/spark-
>>>> installations/spark-1.6.0-bin-hadoop2.6/lib/datanucleus-
>>>> core-3.2.10.jar:/path/to/spark/spark-installations/spark-1.6.0-bin-
>>>> hadoop2.6/lib/datanucleus-rdbms-3.2.9.jar:/path/to/spark/spark-
>>>> installations/spark-1.6.0-bin-hadoop2.6/lib/datanucleus-api-jdo-3.2.6.jar
>>>> -Xms10G -Xmx10G org.apache.spark.deploy.SparkSubmit --master mesos://
>>>> master.ourdomain.com:5050 --conf spark.driver.memory=10G
>>>> --executor-memory 100G --total-executor-cores 90 pyspark-shell
>>>>
>>>>
>>>> Thanks,
>>>> June Taylor
>>>> System Administrator, Minnesota Population Center
>>>> University of Minnesota
>>>>
>>>> On Thu, Apr 7, 2016 at 3:37 PM, Greg Mann <[email protected]> wrote:
>>>>
>>>> Hi June,
>>>> Are these Spark tasks being run in cluster mode or client mode? If it's
>>>> client mode, then perhaps your local Spark scheduler is tearing itself down
>>>> before the executors exit, thus leaving them orphaned.
>>>>
>>>> I'd love to see master/agent logs during the time that the tasks are
>>>> becoming orphaned if you have them available.
>>>>
>>>> Cheers,
>>>> Greg
>>>>
>>>>
>>>> On Thu, Apr 7, 2016 at 1:08 PM, June Taylor <[email protected]> wrote:
>>>>
>>>> Just a quick update... I was only able to get the orphans cleared by
>>>> stopping mesos-slave, deleting the contents of the scratch directory, and
>>>> then restarting mesos-slave.
>>>>
>>>>
>>>> Thanks,
>>>> June Taylor
>>>> System Administrator, Minnesota Population Center
>>>> University of Minnesota
>>>>
>>>> On Thu, Apr 7, 2016 at 12:01 PM, Vinod Kone <[email protected]>
>>>> wrote:
>>>>
>>>> A task/executor is called "orphaned" if the corresponding scheduler
>>>> doesn't register with Mesos. Is your framework scheduler running or gone
>>>> for good? The resources should be cleaned up if the agent (and consequently
>>>> the master) have realized that the executor exited.
>>>>
>>>> Can you paste the master and agent logs for one of orphaned
>>>> tasks/executors (grep the log with the task/executor id)?
>>>>
>>>> On Thu, Apr 7, 2016 at 9:00 AM, haosdent <[email protected]> wrote:
>>>>
>>>> Hmm, sorry for didn't express my idea clear. I mean kill those orphan
>>>> tasks here.
>>>>
>>>> On Thu, Apr 7, 2016 at 11:57 PM, June Taylor <[email protected]> wrote:
>>>>
>>>> Forgive my ignorance, are you literally saying I should just sigkill
>>>> these instances? How will that clean up the mesos orphans?
>>>>
>>>>
>>>> Thanks,
>>>> June Taylor
>>>> System Administrator, Minnesota Population Center
>>>> University of Minnesota
>>>>
>>>> On Thu, Apr 7, 2016 at 10:44 AM, haosdent <[email protected]> wrote:
>>>>
>>>> Support you --work_dir=/tmp/mesos. So you could
>>>>
>>>> $ find /tmp/mesos -name $YOUR_EXECUTOR_ID
>>>>
>>>> Then you could get a folder list and then could use lsof on them.
>>>>
>>>> As a example, my executor id is "test" here.
>>>>
>>>> $ find /tmp/mesos/ -name 'test'
>>>>
>>>> /tmp/mesos/0/slaves/138ee255-c8ef-4caa-8ff2-c0c02f70b4f5-S0/frameworks/138ee255-c8ef-4caa-8ff2-c0c02f70b4f5-0002/executors/test
>>>>
>>>> When I execute
>>>> lsof 
>>>> /tmp/mesos/0/slaves/138ee255-c8ef-4caa-8ff2-c0c02f70b4f5-S0/frameworks/138ee255-c8ef-4caa-8ff2-c0c02f70b4f5-0002/executors/test/runs/latest/
>>>> (Keep in mind I append runs/latest) here.
>>>>
>>>> Then you could see the pid list:
>>>>
>>>> COMMAND     PID      USER   FD   TYPE DEVICE SIZE/OFF       NODE NAME
>>>> mesos-exe 21811 haosdent  cwd    DIR    8,3        6 3221463220
>>>> /tmp/mesos/0/slaves/138ee255-c8ef-4caa-8ff2-c0c02f70b4f5-S0/frameworks/138ee255-c8ef-4caa-8ff2-c0c02f70b4f5-0003/executors/test/runs/efecb119-1019-4629-91ab-fec7724a0f11
>>>> sleep     21847 haosdent  cwd    DIR    8,3        6 3221463220
>>>> /tmp/mesos/0/slaves/138ee255-c8ef-4caa-8ff2-c0c02f70b4f5-S0/frameworks/138ee255-c8ef-4caa-8ff2-c0c02f70b4f5-0003/executors/test/runs/efecb119-1019-4629-91ab-fec7724a0f11
>>>>
>>>> Kill all of them.
>>>>
>>>> On Thu, Apr 7, 2016 at 11:23 PM, June Taylor <[email protected]> wrote:
>>>>
>>>> I do have the executor ID. Can you advise how to kill it?
>>>>
>>>> I have one master and three slaves. Each slave has one of these orphans.
>>>>
>>>>
>>>> Thanks,
>>>> June Taylor
>>>> System Administrator, Minnesota Population Center
>>>> University of Minnesota
>>>>
>>>> On Thu, Apr 7, 2016 at 10:14 AM, haosdent <[email protected]> wrote:
>>>>
>>>> >Going to this slave I can find an executor within the mesos working
>>>> directory which matches this framework ID
>>>> The quickest way here is use kill in slave if you could find the
>>>> mesos-executor id. You make use lsof/fuser or dig log to find out the
>>>> executor pid.
>>>>
>>>> However, it still wired according your feedbacks. Do you have multiple
>>>> masters and fail over happens in your master? So that the slave could not
>>>> collect to the new master and tasks become orphan.
>>>>
>>>> On Thu, Apr 7, 2016 at 11:06 PM, June Taylor <[email protected]> wrote:
>>>>
>>>> Here is one of three orphaned tasks (first two octets of IP removed):
>>>>
>>>> "orphan_tasks": [
>>>>         {
>>>>             "executor_id": "",
>>>>             "name": "Task 1",
>>>>             "framework_id": "14cddded-e692-4838-9893-6e04a81481d8-0006",
>>>>             "state": "TASK_RUNNING",
>>>>             "statuses": [
>>>>                 {
>>>>                     "timestamp": 1459887295.05554,
>>>>                     "state": "TASK_RUNNING",
>>>>                     "container_status": {
>>>>                         "network_infos": [
>>>>                             {
>>>>                                 "ip_addresses": [
>>>>                                     {
>>>>                                         "ip_address": "xxx.xxx.163.205"
>>>>                                     }
>>>>                                 ],
>>>>                                 "ip_address": "xxx.xxx.163.205"
>>>>                             }
>>>>                         ]
>>>>                     }
>>>>                 }
>>>>             ],
>>>>             "slave_id": "182cf09f-0843-4736-82f1-d913089d7df4-S83",
>>>>             "id": "1",
>>>>             "resources": {
>>>>                 "mem": 112640.0,
>>>>                 "disk": 0.0,
>>>>                 "cpus": 30.0
>>>>             }
>>>>         }
>>>>
>>>> Going to this slave I can find an executor within the mesos working
>>>> directory which matches this framework ID. Reviewing the stdout messaging
>>>> within indicates the program has finished its work. But, it is still
>>>> holding these resources open.
>>>>
>>>> This framework ID is not shown as Active in the main Mesos Web UI, but
>>>> does show up if you display the Slave's web UI.
>>>>
>>>> The resources consumed count towards the Idle pool, and have resulted
>>>> in zero available resources for other Offers.
>>>>
>>>>
>>>>
>>>> Thanks,
>>>> June Taylor
>>>> System Administrator, Minnesota Population Center
>>>> University of Minnesota
>>>>
>>>> On Thu, Apr 7, 2016 at 9:46 AM, haosdent <[email protected]> wrote:
>>>>
>>>> > pyspark executors hanging around and consuming resources marked as
>>>> Idle in mesos Web UI
>>>>
>>>> Do you have some logs about this?
>>>>
>>>> >is there an API call I can make to kill these orphans?
>>>>
>>>> As I know, mesos agent would try to clean orphan containers when
>>>> restart. But I not sure the orphan I mean here is same with yours.
>>>>
>>>> On Thu, Apr 7, 2016 at 10:21 PM, June Taylor <[email protected]> wrote:
>>>>
>>>> Greetings mesos users!
>>>>
>>>> I am debugging an issue with pyspark executors hanging around and
>>>> consuming resources marked as Idle in mesos Web UI. These tasks also show
>>>> up in the orphaned_tasks key in `mesos state`.
>>>>
>>>> I'm first wondering how to clear them out - is there an API call I can
>>>> make to kill these orphans? Secondly, how it happened at all.
>>>>
>>>> Thanks,
>>>> June Taylor
>>>> System Administrator, Minnesota Population Center
>>>> University of Minnesota
>>>>
>>>>
>>>>
>>>>
>>>> --
>>>> Best Regards,
>>>> Haosdent Huang
>>>>
>>>>
>>>>
>>>>
>>>>
>>>> --
>>>> Best Regards,
>>>> Haosdent Huang
>>>>
>>>>
>>>>
>>>>
>>>>
>>>> --
>>>> Best Regards,
>>>> Haosdent Huang
>>>>
>>>>
>>>>
>>>>
>>>>
>>>> --
>>>> Best Regards,
>>>> Haosdent Huang
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>
>>>
>>> --
>>> Regards,
>>> Pradeep Chhetri
>>>
>>
>>
>
>
> --
> Best Regards,
> Haosdent Huang
>

Re: orphaned_tasks cleanup and prevention method

Reply via email to