Re: orphaned_tasks cleanup and prevention method

Pradeep Chhetri Sat, 09 Apr 2016 04:09:02 -0700

Hi Greg & June,

By looking at the above command, I can say that you are running spark in
client mode because you are invoking the pyspark-shell.


One simple way to distinguish is that in cluster mode, it's mandatory to
start MesosClusterDispatcher in your mesos cluster which is the spark
framework scheduler.

As everyone told above, I guess the reason you are observing orphaned tasks
is because the scheduler is getting killed before the tasks getting
finished.

I would suggest June to run Spark in clustered mode (
http://spark.apache.org/docs/latest/running-on-mesos.html#cluster-mode)

Also, as Radek suggested above, run spark in coarse grained (default run
mode) which will save you much of the JVM startup time.

Keep us informed how it goes.


On Sat, Apr 9, 2016 at 12:28 AM, Rad Gruchalski <[email protected]>
wrote:

> Greg,
>
> All you need to do is tell Spark that the master is mesos://…, as in the
> example from June.
> It’s all nicely documented here:
>
> http://spark.apache.org/docs/latest/running-on-mesos.html
>
> I’d suggest running in coarse mode as fine grained is a bit choppy.
>
> Best regards,
> Radek Gruchalski
> [email protected] <[email protected]>
> de.linkedin.com/in/radgruchalski/
>
>
> *Confidentiality:*This communication is intended for the above-named
> person and may be confidential and/or legally privileged.
> If it has come to you in error you must take no action based on it, nor
> must you copy or show it to anyone; please delete/destroy and inform the
> sender immediately.
>
> On Saturday, 9 April 2016 at 00:48, Greg Mann wrote:
>
> Unfortunately I'm not able to glean much from that command, but perhaps
> someone out there with more Spark experience can? I do know that there are
> a couple ways to launch Spark jobs on a cluster: you can run them in client
> mode, where the Spark driver runs locally on your machine and exits when
> it's finished, or they can be run in cluster mode where the Spark driver
> runs persistently on the cluster as a Mesos framework. How exactly are you
> launching these tasks on the Mesos cluster?
>
> On Fri, Apr 8, 2016 at 5:41 AM, June Taylor <[email protected]> wrote:
>
> Greg,
>
> I'm on the ops side and fairly new to spark/mesos, so I'm not quite sure I
> understand your question, here's how the task shows up in a process listing:
>
> /usr/lib/jvm/java-8-oracle/bin/java -cp /path/to/spark/spark-
> installations/spark-1.6.0-bin-hadoop2.6/conf/:/path/to/spark/spark-
> installations/spark-1.6.0-bin-hadoop2.6/lib/spark-assembly-
> 1.6.0-hadoop2.6.0.jar:/path/to/spark/spark-installations/spark-1.6.0-bin-
> hadoop2.6/lib/datanucleus-core-3.2.10.jar:/path/to/spark/spark-
> installations/spark-1.6.0-bin-hadoop2.6/lib/datanucleus-
> rdbms-3.2.9.jar:/path/to/spark/spark-installations/spark-1.6.0-bin-
> hadoop2.6/lib/datanucleus-api-jdo-3.2.6.jar -Xms10G -Xmx10G
> org.apache.spark.deploy.SparkSubmit --master mesos://master.ourdomain.com
> :5050 --conf spark.driver.memory=10G --executor-memory 100G
> --total-executor-cores 90 pyspark-shell
>
>
> Thanks,
> June Taylor
> System Administrator, Minnesota Population Center
> University of Minnesota
>
> On Thu, Apr 7, 2016 at 3:37 PM, Greg Mann <[email protected]> wrote:
>
> Hi June,
> Are these Spark tasks being run in cluster mode or client mode? If it's
> client mode, then perhaps your local Spark scheduler is tearing itself down
> before the executors exit, thus leaving them orphaned.
>
> I'd love to see master/agent logs during the time that the tasks are
> becoming orphaned if you have them available.
>
> Cheers,
> Greg
>
>
> On Thu, Apr 7, 2016 at 1:08 PM, June Taylor <[email protected]> wrote:
>
> Just a quick update... I was only able to get the orphans cleared by
> stopping mesos-slave, deleting the contents of the scratch directory, and
> then restarting mesos-slave.
>
>
> Thanks,
> June Taylor
> System Administrator, Minnesota Population Center
> University of Minnesota
>
> On Thu, Apr 7, 2016 at 12:01 PM, Vinod Kone <[email protected]> wrote:
>
> A task/executor is called "orphaned" if the corresponding scheduler
> doesn't register with Mesos. Is your framework scheduler running or gone
> for good? The resources should be cleaned up if the agent (and consequently
> the master) have realized that the executor exited.
>
> Can you paste the master and agent logs for one of orphaned
> tasks/executors (grep the log with the task/executor id)?
>
> On Thu, Apr 7, 2016 at 9:00 AM, haosdent <[email protected]> wrote:
>
> Hmm, sorry for didn't express my idea clear. I mean kill those orphan
> tasks here.
>
> On Thu, Apr 7, 2016 at 11:57 PM, June Taylor <[email protected]> wrote:
>
> Forgive my ignorance, are you literally saying I should just sigkill these
> instances? How will that clean up the mesos orphans?
>
>
> Thanks,
> June Taylor
> System Administrator, Minnesota Population Center
> University of Minnesota
>
> On Thu, Apr 7, 2016 at 10:44 AM, haosdent <[email protected]> wrote:
>
> Support you --work_dir=/tmp/mesos. So you could
>
> $ find /tmp/mesos -name $YOUR_EXECUTOR_ID
>
> Then you could get a folder list and then could use lsof on them.
>
> As a example, my executor id is "test" here.
>
> $ find /tmp/mesos/ -name 'test'
>
> /tmp/mesos/0/slaves/138ee255-c8ef-4caa-8ff2-c0c02f70b4f5-S0/frameworks/138ee255-c8ef-4caa-8ff2-c0c02f70b4f5-0002/executors/test
>
> When I execute
> lsof 
> /tmp/mesos/0/slaves/138ee255-c8ef-4caa-8ff2-c0c02f70b4f5-S0/frameworks/138ee255-c8ef-4caa-8ff2-c0c02f70b4f5-0002/executors/test/runs/latest/
> (Keep in mind I append runs/latest) here.
>
> Then you could see the pid list:
>
> COMMAND     PID      USER   FD   TYPE DEVICE SIZE/OFF       NODE NAME
> mesos-exe 21811 haosdent  cwd    DIR    8,3        6 3221463220
> /tmp/mesos/0/slaves/138ee255-c8ef-4caa-8ff2-c0c02f70b4f5-S0/frameworks/138ee255-c8ef-4caa-8ff2-c0c02f70b4f5-0003/executors/test/runs/efecb119-1019-4629-91ab-fec7724a0f11
> sleep     21847 haosdent  cwd    DIR    8,3        6 3221463220
> /tmp/mesos/0/slaves/138ee255-c8ef-4caa-8ff2-c0c02f70b4f5-S0/frameworks/138ee255-c8ef-4caa-8ff2-c0c02f70b4f5-0003/executors/test/runs/efecb119-1019-4629-91ab-fec7724a0f11
>
> Kill all of them.
>
> On Thu, Apr 7, 2016 at 11:23 PM, June Taylor <[email protected]> wrote:
>
> I do have the executor ID. Can you advise how to kill it?
>
> I have one master and three slaves. Each slave has one of these orphans.
>
>
> Thanks,
> June Taylor
> System Administrator, Minnesota Population Center
> University of Minnesota
>
> On Thu, Apr 7, 2016 at 10:14 AM, haosdent <[email protected]> wrote:
>
> >Going to this slave I can find an executor within the mesos working
> directory which matches this framework ID
> The quickest way here is use kill in slave if you could find the
> mesos-executor id. You make use lsof/fuser or dig log to find out the
> executor pid.
>
> However, it still wired according your feedbacks. Do you have multiple
> masters and fail over happens in your master? So that the slave could not
> collect to the new master and tasks become orphan.
>
> On Thu, Apr 7, 2016 at 11:06 PM, June Taylor <[email protected]> wrote:
>
> Here is one of three orphaned tasks (first two octets of IP removed):
>
> "orphan_tasks": [
>         {
>             "executor_id": "",
>             "name": "Task 1",
>             "framework_id": "14cddded-e692-4838-9893-6e04a81481d8-0006",
>             "state": "TASK_RUNNING",
>             "statuses": [
>                 {
>                     "timestamp": 1459887295.05554,
>                     "state": "TASK_RUNNING",
>                     "container_status": {
>                         "network_infos": [
>                             {
>                                 "ip_addresses": [
>                                     {
>                                         "ip_address": "xxx.xxx.163.205"
>                                     }
>                                 ],
>                                 "ip_address": "xxx.xxx.163.205"
>                             }
>                         ]
>                     }
>                 }
>             ],
>             "slave_id": "182cf09f-0843-4736-82f1-d913089d7df4-S83",
>             "id": "1",
>             "resources": {
>                 "mem": 112640.0,
>                 "disk": 0.0,
>                 "cpus": 30.0
>             }
>         }
>
> Going to this slave I can find an executor within the mesos working
> directory which matches this framework ID. Reviewing the stdout messaging
> within indicates the program has finished its work. But, it is still
> holding these resources open.
>
> This framework ID is not shown as Active in the main Mesos Web UI, but
> does show up if you display the Slave's web UI.
>
> The resources consumed count towards the Idle pool, and have resulted in
> zero available resources for other Offers.
>
>
>
> Thanks,
> June Taylor
> System Administrator, Minnesota Population Center
> University of Minnesota
>
> On Thu, Apr 7, 2016 at 9:46 AM, haosdent <[email protected]> wrote:
>
> > pyspark executors hanging around and consuming resources marked as Idle
> in mesos Web UI
>
> Do you have some logs about this?
>
> >is there an API call I can make to kill these orphans?
>
> As I know, mesos agent would try to clean orphan containers when restart.
> But I not sure the orphan I mean here is same with yours.
>
> On Thu, Apr 7, 2016 at 10:21 PM, June Taylor <[email protected]> wrote:
>
> Greetings mesos users!
>
> I am debugging an issue with pyspark executors hanging around and
> consuming resources marked as Idle in mesos Web UI. These tasks also show
> up in the orphaned_tasks key in `mesos state`.
>
> I'm first wondering how to clear them out - is there an API call I can
> make to kill these orphans? Secondly, how it happened at all.
>
> Thanks,
> June Taylor
> System Administrator, Minnesota Population Center
> University of Minnesota
>
>
>
>
> --
> Best Regards,
> Haosdent Huang
>
>
>
>
>
> --
> Best Regards,
> Haosdent Huang
>
>
>
>
>
> --
> Best Regards,
> Haosdent Huang
>
>
>
>
>
> --
> Best Regards,
> Haosdent Huang
>
>
>
>
>
>
>
>


-- 
Regards,
Pradeep Chhetri

Re: orphaned_tasks cleanup and prevention method

Reply via email to