While I was waiting for more info the app finally did start up. I am trying to figure out why it took so long.
Thanks, June Taylor System Administrator, Minnesota Population Center University of Minnesota On Mon, Apr 11, 2016 at 9:50 AM, haosdent <[email protected]> wrote: > Could you find marathon in > http://${YOUR_MASTER_IP}:${YOUR_MASTER_PORT}/#/frameworks > page? And > > >While deploying I am looking at mesos-master.WARNING, mesos-master.INFO > and mesos-master.ERROR log files, but I never see anything show up that > would indicate a problem, or even an attempt. > > When you create a new task in marathon, could you see any related logs in > mesos master? > > > On Mon, Apr 11, 2016 at 10:11 PM, June Taylor <[email protected]> wrote: > >> Hello again. I am not sure this has been resolved yet, because I am still >> unable to get Marathon deployments to start. >> >> I have deleted the /marathon/ node from Zookeeper, and I now have the >> Marathon WebUI accessible again. I try to add a new task to deploy, and >> there seem to be available resources, but it is still stuck in a 'Waiting' >> status. >> >> While deploying I am looking at mesos-master.WARNING, mesos-master.INFO >> and mesos-master.ERROR log files, but I never see anything show up that >> would indicate a problem, or even an attempt. >> >> Where am I going wrong? >> >> >> Thanks, >> June Taylor >> System Administrator, Minnesota Population Center >> University of Minnesota >> >> On Sat, Apr 9, 2016 at 6:07 AM, Pradeep Chhetri < >> [email protected]> wrote: >> >>> Hi Greg & June, >>> >>> By looking at the above command, I can say that you are running spark in >>> client mode because you are invoking the pyspark-shell. >>> >>> One simple way to distinguish is that in cluster mode, it's mandatory to >>> start MesosClusterDispatcher in your mesos cluster which is the spark >>> framework scheduler. >>> >>> As everyone told above, I guess the reason you are observing orphaned >>> tasks is because the scheduler is getting killed before the tasks getting >>> finished. >>> >>> I would suggest June to run Spark in clustered mode ( >>> http://spark.apache.org/docs/latest/running-on-mesos.html#cluster-mode) >>> >>> Also, as Radek suggested above, run spark in coarse grained (default run >>> mode) which will save you much of the JVM startup time. >>> >>> Keep us informed how it goes. >>> >>> >>> On Sat, Apr 9, 2016 at 12:28 AM, Rad Gruchalski <[email protected]> >>> wrote: >>> >>>> Greg, >>>> >>>> All you need to do is tell Spark that the master is mesos://…, as in >>>> the example from June. >>>> It’s all nicely documented here: >>>> >>>> http://spark.apache.org/docs/latest/running-on-mesos.html >>>> >>>> I’d suggest running in coarse mode as fine grained is a bit choppy. >>>> >>>> Best regards, >>>> Radek Gruchalski >>>> [email protected] <[email protected]> >>>> de.linkedin.com/in/radgruchalski/ >>>> >>>> >>>> *Confidentiality:*This communication is intended for the above-named >>>> person and may be confidential and/or legally privileged. >>>> If it has come to you in error you must take no action based on it, nor >>>> must you copy or show it to anyone; please delete/destroy and inform the >>>> sender immediately. >>>> >>>> On Saturday, 9 April 2016 at 00:48, Greg Mann wrote: >>>> >>>> Unfortunately I'm not able to glean much from that command, but perhaps >>>> someone out there with more Spark experience can? I do know that there are >>>> a couple ways to launch Spark jobs on a cluster: you can run them in client >>>> mode, where the Spark driver runs locally on your machine and exits when >>>> it's finished, or they can be run in cluster mode where the Spark driver >>>> runs persistently on the cluster as a Mesos framework. How exactly are you >>>> launching these tasks on the Mesos cluster? >>>> >>>> On Fri, Apr 8, 2016 at 5:41 AM, June Taylor <[email protected]> wrote: >>>> >>>> Greg, >>>> >>>> I'm on the ops side and fairly new to spark/mesos, so I'm not quite >>>> sure I understand your question, here's how the task shows up in a process >>>> listing: >>>> >>>> /usr/lib/jvm/java-8-oracle/bin/java -cp /path/to/spark/spark- >>>> installations/spark-1.6.0-bin-hadoop2.6/conf/:/path/to/spark/spark- >>>> installations/spark-1.6.0-bin-hadoop2.6/lib/spark-assembly- >>>> 1.6.0-hadoop2.6.0.jar:/path/to/spark/spark- >>>> installations/spark-1.6.0-bin-hadoop2.6/lib/datanucleus- >>>> core-3.2.10.jar:/path/to/spark/spark-installations/spark-1.6.0-bin- >>>> hadoop2.6/lib/datanucleus-rdbms-3.2.9.jar:/path/to/spark/spark- >>>> installations/spark-1.6.0-bin-hadoop2.6/lib/datanucleus-api-jdo-3.2.6.jar >>>> -Xms10G -Xmx10G org.apache.spark.deploy.SparkSubmit --master mesos:// >>>> master.ourdomain.com:5050 --conf spark.driver.memory=10G >>>> --executor-memory 100G --total-executor-cores 90 pyspark-shell >>>> >>>> >>>> Thanks, >>>> June Taylor >>>> System Administrator, Minnesota Population Center >>>> University of Minnesota >>>> >>>> On Thu, Apr 7, 2016 at 3:37 PM, Greg Mann <[email protected]> wrote: >>>> >>>> Hi June, >>>> Are these Spark tasks being run in cluster mode or client mode? If it's >>>> client mode, then perhaps your local Spark scheduler is tearing itself down >>>> before the executors exit, thus leaving them orphaned. >>>> >>>> I'd love to see master/agent logs during the time that the tasks are >>>> becoming orphaned if you have them available. >>>> >>>> Cheers, >>>> Greg >>>> >>>> >>>> On Thu, Apr 7, 2016 at 1:08 PM, June Taylor <[email protected]> wrote: >>>> >>>> Just a quick update... I was only able to get the orphans cleared by >>>> stopping mesos-slave, deleting the contents of the scratch directory, and >>>> then restarting mesos-slave. >>>> >>>> >>>> Thanks, >>>> June Taylor >>>> System Administrator, Minnesota Population Center >>>> University of Minnesota >>>> >>>> On Thu, Apr 7, 2016 at 12:01 PM, Vinod Kone <[email protected]> >>>> wrote: >>>> >>>> A task/executor is called "orphaned" if the corresponding scheduler >>>> doesn't register with Mesos. Is your framework scheduler running or gone >>>> for good? The resources should be cleaned up if the agent (and consequently >>>> the master) have realized that the executor exited. >>>> >>>> Can you paste the master and agent logs for one of orphaned >>>> tasks/executors (grep the log with the task/executor id)? >>>> >>>> On Thu, Apr 7, 2016 at 9:00 AM, haosdent <[email protected]> wrote: >>>> >>>> Hmm, sorry for didn't express my idea clear. I mean kill those orphan >>>> tasks here. >>>> >>>> On Thu, Apr 7, 2016 at 11:57 PM, June Taylor <[email protected]> wrote: >>>> >>>> Forgive my ignorance, are you literally saying I should just sigkill >>>> these instances? How will that clean up the mesos orphans? >>>> >>>> >>>> Thanks, >>>> June Taylor >>>> System Administrator, Minnesota Population Center >>>> University of Minnesota >>>> >>>> On Thu, Apr 7, 2016 at 10:44 AM, haosdent <[email protected]> wrote: >>>> >>>> Support you --work_dir=/tmp/mesos. So you could >>>> >>>> $ find /tmp/mesos -name $YOUR_EXECUTOR_ID >>>> >>>> Then you could get a folder list and then could use lsof on them. >>>> >>>> As a example, my executor id is "test" here. >>>> >>>> $ find /tmp/mesos/ -name 'test' >>>> >>>> /tmp/mesos/0/slaves/138ee255-c8ef-4caa-8ff2-c0c02f70b4f5-S0/frameworks/138ee255-c8ef-4caa-8ff2-c0c02f70b4f5-0002/executors/test >>>> >>>> When I execute >>>> lsof >>>> /tmp/mesos/0/slaves/138ee255-c8ef-4caa-8ff2-c0c02f70b4f5-S0/frameworks/138ee255-c8ef-4caa-8ff2-c0c02f70b4f5-0002/executors/test/runs/latest/ >>>> (Keep in mind I append runs/latest) here. >>>> >>>> Then you could see the pid list: >>>> >>>> COMMAND PID USER FD TYPE DEVICE SIZE/OFF NODE NAME >>>> mesos-exe 21811 haosdent cwd DIR 8,3 6 3221463220 >>>> /tmp/mesos/0/slaves/138ee255-c8ef-4caa-8ff2-c0c02f70b4f5-S0/frameworks/138ee255-c8ef-4caa-8ff2-c0c02f70b4f5-0003/executors/test/runs/efecb119-1019-4629-91ab-fec7724a0f11 >>>> sleep 21847 haosdent cwd DIR 8,3 6 3221463220 >>>> /tmp/mesos/0/slaves/138ee255-c8ef-4caa-8ff2-c0c02f70b4f5-S0/frameworks/138ee255-c8ef-4caa-8ff2-c0c02f70b4f5-0003/executors/test/runs/efecb119-1019-4629-91ab-fec7724a0f11 >>>> >>>> Kill all of them. >>>> >>>> On Thu, Apr 7, 2016 at 11:23 PM, June Taylor <[email protected]> wrote: >>>> >>>> I do have the executor ID. Can you advise how to kill it? >>>> >>>> I have one master and three slaves. Each slave has one of these orphans. >>>> >>>> >>>> Thanks, >>>> June Taylor >>>> System Administrator, Minnesota Population Center >>>> University of Minnesota >>>> >>>> On Thu, Apr 7, 2016 at 10:14 AM, haosdent <[email protected]> wrote: >>>> >>>> >Going to this slave I can find an executor within the mesos working >>>> directory which matches this framework ID >>>> The quickest way here is use kill in slave if you could find the >>>> mesos-executor id. You make use lsof/fuser or dig log to find out the >>>> executor pid. >>>> >>>> However, it still wired according your feedbacks. Do you have multiple >>>> masters and fail over happens in your master? So that the slave could not >>>> collect to the new master and tasks become orphan. >>>> >>>> On Thu, Apr 7, 2016 at 11:06 PM, June Taylor <[email protected]> wrote: >>>> >>>> Here is one of three orphaned tasks (first two octets of IP removed): >>>> >>>> "orphan_tasks": [ >>>> { >>>> "executor_id": "", >>>> "name": "Task 1", >>>> "framework_id": "14cddded-e692-4838-9893-6e04a81481d8-0006", >>>> "state": "TASK_RUNNING", >>>> "statuses": [ >>>> { >>>> "timestamp": 1459887295.05554, >>>> "state": "TASK_RUNNING", >>>> "container_status": { >>>> "network_infos": [ >>>> { >>>> "ip_addresses": [ >>>> { >>>> "ip_address": "xxx.xxx.163.205" >>>> } >>>> ], >>>> "ip_address": "xxx.xxx.163.205" >>>> } >>>> ] >>>> } >>>> } >>>> ], >>>> "slave_id": "182cf09f-0843-4736-82f1-d913089d7df4-S83", >>>> "id": "1", >>>> "resources": { >>>> "mem": 112640.0, >>>> "disk": 0.0, >>>> "cpus": 30.0 >>>> } >>>> } >>>> >>>> Going to this slave I can find an executor within the mesos working >>>> directory which matches this framework ID. Reviewing the stdout messaging >>>> within indicates the program has finished its work. But, it is still >>>> holding these resources open. >>>> >>>> This framework ID is not shown as Active in the main Mesos Web UI, but >>>> does show up if you display the Slave's web UI. >>>> >>>> The resources consumed count towards the Idle pool, and have resulted >>>> in zero available resources for other Offers. >>>> >>>> >>>> >>>> Thanks, >>>> June Taylor >>>> System Administrator, Minnesota Population Center >>>> University of Minnesota >>>> >>>> On Thu, Apr 7, 2016 at 9:46 AM, haosdent <[email protected]> wrote: >>>> >>>> > pyspark executors hanging around and consuming resources marked as >>>> Idle in mesos Web UI >>>> >>>> Do you have some logs about this? >>>> >>>> >is there an API call I can make to kill these orphans? >>>> >>>> As I know, mesos agent would try to clean orphan containers when >>>> restart. But I not sure the orphan I mean here is same with yours. >>>> >>>> On Thu, Apr 7, 2016 at 10:21 PM, June Taylor <[email protected]> wrote: >>>> >>>> Greetings mesos users! >>>> >>>> I am debugging an issue with pyspark executors hanging around and >>>> consuming resources marked as Idle in mesos Web UI. These tasks also show >>>> up in the orphaned_tasks key in `mesos state`. >>>> >>>> I'm first wondering how to clear them out - is there an API call I can >>>> make to kill these orphans? Secondly, how it happened at all. >>>> >>>> Thanks, >>>> June Taylor >>>> System Administrator, Minnesota Population Center >>>> University of Minnesota >>>> >>>> >>>> >>>> >>>> -- >>>> Best Regards, >>>> Haosdent Huang >>>> >>>> >>>> >>>> >>>> >>>> -- >>>> Best Regards, >>>> Haosdent Huang >>>> >>>> >>>> >>>> >>>> >>>> -- >>>> Best Regards, >>>> Haosdent Huang >>>> >>>> >>>> >>>> >>>> >>>> -- >>>> Best Regards, >>>> Haosdent Huang >>>> >>>> >>>> >>>> >>>> >>>> >>>> >>>> >>> >>> >>> -- >>> Regards, >>> Pradeep Chhetri >>> >> >> > > > -- > Best Regards, > Haosdent Huang >

