Re: orphaned_tasks cleanup and prevention method

Rad Gruchalski Fri, 08 Apr 2016 16:28:57 -0700

Greg,

All you need to do is tell Spark that the master is mesos://…, as in the 
example from June.
It’s all nicely documented here:


http://spark.apache.org/docs/latest/running-on-mesos.html

I’d suggest running in coarse mode as fine grained is a bit choppy.










Best regards, 
Radek Gruchalski
 [email protected] (mailto:[email protected])  
(mailto:[email protected])
de.linkedin.com/in/radgruchalski/ (http://de.linkedin.com/in/radgruchalski/)

Confidentiality:
This communication is intended for the above-named person and may be 
confidential and/or legally privileged.
If it has come to you in error you must take no action based on it, nor must 
you copy or show it to anyone; please delete/destroy and inform the sender 
immediately.



On Saturday, 9 April 2016 at 00:48, Greg Mann wrote:

> Unfortunately I'm not able to glean much from that command, but perhaps 
> someone out there with more Spark experience can? I do know that there are a 
> couple ways to launch Spark jobs on a cluster: you can run them in client 
> mode, where the Spark driver runs locally on your machine and exits when it's 
> finished, or they can be run in cluster mode where the Spark driver runs 
> persistently on the cluster as a Mesos framework. How exactly are you 
> launching these tasks on the Mesos cluster?
>  
> On Fri, Apr 8, 2016 at 5:41 AM, June Taylor <[email protected] 
> (mailto:[email protected])> wrote:
> > Greg,
> >  
> > I'm on the ops side and fairly new to spark/mesos, so I'm not quite sure I 
> > understand your question, here's how the task shows up in a process listing:
> >  
> > /usr/lib/jvm/java-8-oracle/bin/java -cp 
> > /path/to/spark/spark-installations/spark-1.6.0-bin-hadoop2.6/conf/:/path/to/spark/spark-installations/spark-1.6.0-bin-hadoop2.6/lib/spark-assembly-1.6.0-hadoop2.6.0.jar:/path/to/spark/spark-installations/spark-1.6.0-bin-hadoop2.6/lib/datanucleus-core-3.2.10.jar:/path/to/spark/spark-installations/spark-1.6.0-bin-hadoop2.6/lib/datanucleus-rdbms-3.2.9.jar:/path/to/spark/spark-installations/spark-1.6.0-bin-hadoop2.6/lib/datanucleus-api-jdo-3.2.6.jar
> >  -Xms10G -Xmx10G org.apache.spark.deploy.SparkSubmit --master 
> > mesos://master.ourdomain.com (http://master.ourdomain.com):5050 --conf 
> > spark.driver.memory=10G --executor-memory 100G --total-executor-cores 90 
> > pyspark-shell
> >  
> >  
> > Thanks,
> > June Taylor
> >  
> > System Administrator, Minnesota Population Center
> > University of Minnesota
> >  
> >  
> >  
> >  
> >  
> >  
> >  
> >  
> > On Thu, Apr 7, 2016 at 3:37 PM, Greg Mann <[email protected] 
> > (mailto:[email protected])> wrote:
> > > Hi June,
> > > Are these Spark tasks being run in cluster mode or client mode? If it's 
> > > client mode, then perhaps your local Spark scheduler is tearing itself 
> > > down before the executors exit, thus leaving them orphaned.
> > >  
> > > I'd love to see master/agent logs during the time that the tasks are 
> > > becoming orphaned if you have them available.
> > >  
> > > Cheers,
> > > Greg
> > >  
> > >  
> > > On Thu, Apr 7, 2016 at 1:08 PM, June Taylor <[email protected] 
> > > (mailto:[email protected])> wrote:
> > > > Just a quick update... I was only able to get the orphans cleared by 
> > > > stopping mesos-slave, deleting the contents of the scratch directory, 
> > > > and then restarting mesos-slave.
> > > >  
> > > >  
> > > > Thanks,
> > > > June Taylor
> > > >  
> > > > System Administrator, Minnesota Population Center
> > > > University of Minnesota
> > > >  
> > > >  
> > > >  
> > > >  
> > > >  
> > > >  
> > > >  
> > > >  
> > > > On Thu, Apr 7, 2016 at 12:01 PM, Vinod Kone <[email protected] 
> > > > (mailto:[email protected])> wrote:
> > > > > A task/executor is called "orphaned" if the corresponding scheduler 
> > > > > doesn't register with Mesos. Is your framework scheduler running or 
> > > > > gone for good? The resources should be cleaned up if the agent (and 
> > > > > consequently the master) have realized that the executor exited.
> > > > >  
> > > > > Can you paste the master and agent logs for one of orphaned 
> > > > > tasks/executors (grep the log with the task/executor id)?
> > > > >  
> > > > > On Thu, Apr 7, 2016 at 9:00 AM, haosdent <[email protected] 
> > > > > (mailto:[email protected])> wrote:
> > > > > > Hmm, sorry for didn't express my idea clear. I mean kill those 
> > > > > > orphan tasks here.
> > > > > >  
> > > > > > On Thu, Apr 7, 2016 at 11:57 PM, June Taylor <[email protected] 
> > > > > > (mailto:[email protected])> wrote:
> > > > > > > Forgive my ignorance, are you literally saying I should just 
> > > > > > > sigkill these instances? How will that clean up the mesos orphans?
> > > > > > >  
> > > > > > >  
> > > > > > > Thanks,
> > > > > > > June Taylor
> > > > > > >  
> > > > > > > System Administrator, Minnesota Population Center
> > > > > > > University of Minnesota
> > > > > > >  
> > > > > > >  
> > > > > > >  
> > > > > > >  
> > > > > > >  
> > > > > > >  
> > > > > > >  
> > > > > > >  
> > > > > > > On Thu, Apr 7, 2016 at 10:44 AM, haosdent <[email protected] 
> > > > > > > (mailto:[email protected])> wrote:
> > > > > > > > Support you --work_dir=/tmp/mesos. So you could
> > > > > > > >  
> > > > > > > > $ find /tmp/mesos -name $YOUR_EXECUTOR_ID
> > > > > > > >  
> > > > > > > > Then you could get a folder list and then could use lsof on 
> > > > > > > > them.
> > > > > > > >  
> > > > > > > > As a example, my executor id is "test" here.
> > > > > > > >  
> > > > > > > > $ find /tmp/mesos/ -name 'test'
> > > > > > > > /tmp/mesos/0/slaves/138ee255-c8ef-4caa-8ff2-c0c02f70b4f5-S0/frameworks/138ee255-c8ef-4caa-8ff2-c0c02f70b4f5-0002/executors/test
> > > > > > > >  
> > > > > > > > When I execute lsof 
> > > > > > > > /tmp/mesos/0/slaves/138ee255-c8ef-4caa-8ff2-c0c02f70b4f5-S0/frameworks/138ee255-c8ef-4caa-8ff2-c0c02f70b4f5-0002/executors/test/runs/latest/
> > > > > > > >  (Keep in mind I append runs/latest) here.
> > > > > > > >  
> > > > > > > > Then you could see the pid list:
> > > > > > > >  
> > > > > > > > COMMAND     PID      USER   FD   TYPE DEVICE SIZE/OFF       
> > > > > > > > NODE NAME
> > > > > > > > mesos-exe 21811 haosdent  cwd    DIR    8,3        6 3221463220 
> > > > > > > > /tmp/mesos/0/slaves/138ee255-c8ef-4caa-8ff2-c0c02f70b4f5-S0/frameworks/138ee255-c8ef-4caa-8ff2-c0c02f70b4f5-0003/executors/test/runs/efecb119-1019-4629-91ab-fec7724a0f11
> > > > > > > > sleep     21847 haosdent  cwd    DIR    8,3        6 3221463220 
> > > > > > > > /tmp/mesos/0/slaves/138ee255-c8ef-4caa-8ff2-c0c02f70b4f5-S0/frameworks/138ee255-c8ef-4caa-8ff2-c0c02f70b4f5-0003/executors/test/runs/efecb119-1019-4629-91ab-fec7724a0f11
> > > > > > > >  
> > > > > > > >  
> > > > > > > > Kill all of them.
> > > > > > > >  
> > > > > > > > On Thu, Apr 7, 2016 at 11:23 PM, June Taylor <[email protected] 
> > > > > > > > (mailto:[email protected])> wrote:
> > > > > > > > > I do have the executor ID. Can you advise how to kill it?
> > > > > > > > >  
> > > > > > > > > I have one master and three slaves. Each slave has one of 
> > > > > > > > > these orphans.
> > > > > > > > >  
> > > > > > > > >  
> > > > > > > > > Thanks,
> > > > > > > > > June Taylor
> > > > > > > > >  
> > > > > > > > > System Administrator, Minnesota Population Center
> > > > > > > > > University of Minnesota
> > > > > > > > >  
> > > > > > > > >  
> > > > > > > > >  
> > > > > > > > >  
> > > > > > > > >  
> > > > > > > > >  
> > > > > > > > >  
> > > > > > > > >  
> > > > > > > > > On Thu, Apr 7, 2016 at 10:14 AM, haosdent <[email protected] 
> > > > > > > > > (mailto:[email protected])> wrote:
> > > > > > > > > > >Going to this slave I can find an executor within the 
> > > > > > > > > > >mesos working directory which matches this framework ID
> > > > > > > > > > The quickest way here is use kill in slave if you could 
> > > > > > > > > > find the mesos-executor id. You make use lsof/fuser or dig 
> > > > > > > > > > log to find out the executor pid.
> > > > > > > > > >  
> > > > > > > > > > However, it still wired according your feedbacks. Do you 
> > > > > > > > > > have multiple masters and fail over happens in your master? 
> > > > > > > > > > So that the slave could not collect to the new master and 
> > > > > > > > > > tasks become orphan.  
> > > > > > > > > >  
> > > > > > > > > > On Thu, Apr 7, 2016 at 11:06 PM, June Taylor <[email protected] 
> > > > > > > > > > (mailto:[email protected])> wrote:
> > > > > > > > > > > Here is one of three orphaned tasks (first two octets of 
> > > > > > > > > > > IP removed):
> > > > > > > > > > >  
> > > > > > > > > > > "orphan_tasks": [
> > > > > > > > > > >         {
> > > > > > > > > > >             "executor_id": "",
> > > > > > > > > > >             "name": "Task 1",
> > > > > > > > > > >             "framework_id": 
> > > > > > > > > > > "14cddded-e692-4838-9893-6e04a81481d8-0006",
> > > > > > > > > > >             "state": "TASK_RUNNING",
> > > > > > > > > > >             "statuses": [
> > > > > > > > > > >                 {
> > > > > > > > > > >                     "timestamp": 1459887295.05554,
> > > > > > > > > > >                     "state": "TASK_RUNNING",
> > > > > > > > > > >                     "container_status": {
> > > > > > > > > > >                         "network_infos": [
> > > > > > > > > > >                             {
> > > > > > > > > > >                                 "ip_addresses": [
> > > > > > > > > > >                                     {
> > > > > > > > > > >                                         "ip_address": 
> > > > > > > > > > > "xxx.xxx.163.205"
> > > > > > > > > > >                                     }
> > > > > > > > > > >                                 ],
> > > > > > > > > > >                                 "ip_address": 
> > > > > > > > > > > "xxx.xxx.163.205"
> > > > > > > > > > >                             }
> > > > > > > > > > >                         ]
> > > > > > > > > > >                     }
> > > > > > > > > > >                 }
> > > > > > > > > > >             ],
> > > > > > > > > > >             "slave_id": 
> > > > > > > > > > > "182cf09f-0843-4736-82f1-d913089d7df4-S83",
> > > > > > > > > > >             "id": "1",
> > > > > > > > > > >             "resources": {
> > > > > > > > > > >                 "mem": 112640.0,
> > > > > > > > > > >                 "disk": 0.0,
> > > > > > > > > > >                 "cpus": 30.0
> > > > > > > > > > >             }
> > > > > > > > > > >         }
> > > > > > > > > > >  
> > > > > > > > > > >  
> > > > > > > > > > >  
> > > > > > > > > > > Going to this slave I can find an executor within the 
> > > > > > > > > > > mesos working directory which matches this framework ID. 
> > > > > > > > > > > Reviewing the stdout messaging within indicates the 
> > > > > > > > > > > program has finished its work. But, it is still holding 
> > > > > > > > > > > these resources open.
> > > > > > > > > > >  
> > > > > > > > > > > This framework ID is not shown as Active in the main 
> > > > > > > > > > > Mesos Web UI, but does show up if you display the Slave's 
> > > > > > > > > > > web UI.
> > > > > > > > > > >  
> > > > > > > > > > > The resources consumed count towards the Idle pool, and 
> > > > > > > > > > > have resulted in zero available resources for other 
> > > > > > > > > > > Offers.
> > > > > > > > > > >  
> > > > > > > > > > >  
> > > > > > > > > > >  
> > > > > > > > > > > Thanks,
> > > > > > > > > > > June Taylor
> > > > > > > > > > >  
> > > > > > > > > > > System Administrator, Minnesota Population Center
> > > > > > > > > > > University of Minnesota
> > > > > > > > > > >  
> > > > > > > > > > >  
> > > > > > > > > > >  
> > > > > > > > > > >  
> > > > > > > > > > >  
> > > > > > > > > > >  
> > > > > > > > > > >  
> > > > > > > > > > >  
> > > > > > > > > > > On Thu, Apr 7, 2016 at 9:46 AM, haosdent 
> > > > > > > > > > > <[email protected] (mailto:[email protected])> wrote:
> > > > > > > > > > > > > pyspark executors hanging around and consuming 
> > > > > > > > > > > > > resources marked as Idle in mesos Web UI
> > > > > > > > > > > >  
> > > > > > > > > > > > Do you have some logs about this?  
> > > > > > > > > > > >  
> > > > > > > > > > > > >is there an API call I can make to kill these orphans?
> > > > > > > > > > > >  
> > > > > > > > > > > > As I know, mesos agent would try to clean orphan 
> > > > > > > > > > > > containers when restart. But I not sure the orphan I 
> > > > > > > > > > > > mean here is same with yours.
> > > > > > > > > > > >  
> > > > > > > > > > > > On Thu, Apr 7, 2016 at 10:21 PM, June Taylor 
> > > > > > > > > > > > <[email protected] (mailto:[email protected])> wrote:
> > > > > > > > > > > > > Greetings mesos users!
> > > > > > > > > > > > >  
> > > > > > > > > > > > > I am debugging an issue with pyspark executors 
> > > > > > > > > > > > > hanging around and consuming resources marked as Idle 
> > > > > > > > > > > > > in mesos Web UI. These tasks also show up in the 
> > > > > > > > > > > > > orphaned_tasks key in `mesos state`.
> > > > > > > > > > > > >  
> > > > > > > > > > > > > I'm first wondering how to clear them out - is there 
> > > > > > > > > > > > > an API call I can make to kill these orphans? 
> > > > > > > > > > > > > Secondly, how it happened at all.
> > > > > > > > > > > > >  
> > > > > > > > > > > > > Thanks,
> > > > > > > > > > > > > June Taylor
> > > > > > > > > > > > >  
> > > > > > > > > > > > > System Administrator, Minnesota Population Center
> > > > > > > > > > > > > University of Minnesota
> > > > > > > > > > > > >  
> > > > > > > > > > > > >  
> > > > > > > > > > > > >  
> > > > > > > > > > > > >  
> > > > > > > > > > > > >  
> > > > > > > > > > > > >  
> > > > > > > > > > > > >  
> > > > > > > > > > > > >  
> > > > > > > > > > > > >  
> > > > > > > > > > > > >  
> > > > > > > > > > > >  
> > > > > > > > > > > >  
> > > > > > > > > > > >  
> > > > > > > > > > > >  
> > > > > > > > > > > >  
> > > > > > > > > > > > --  
> > > > > > > > > > > > Best Regards,
> > > > > > > > > > > > Haosdent Huang  
> > > > > > > > > >  
> > > > > > > > > >  
> > > > > > > > > >  
> > > > > > > > > > --  
> > > > > > > > > > Best Regards,
> > > > > > > > > > Haosdent Huang  
> > > > > > > >  
> > > > > > > >  
> > > > > > > >  
> > > > > > > > --  
> > > > > > > > Best Regards,
> > > > > > > > Haosdent Huang  
> > > > > >  
> > > > > >  
> > > > > >  
> > > > > > --  
> > > > > > Best Regards,
> > > > > > Haosdent Huang  
> > > >  
> > >  
> >  
>

Re: orphaned_tasks cleanup and prevention method

Reply via email to