Greg, I'm on the ops side and fairly new to spark/mesos, so I'm not quite sure I understand your question, here's how the task shows up in a process listing:
/usr/lib/jvm/java-8-oracle/bin/java -cp /path/to/spark/spark- installations/spark-1.6.0-bin-hadoop2.6/conf/:/path/to/spark/spark- installations/spark-1.6.0-bin-hadoop2.6/lib/spark-assembly- 1.6.0-hadoop2.6.0.jar:/path/to/spark/spark-installations/spark-1.6.0-bin- hadoop2.6/lib/datanucleus-core-3.2.10.jar:/path/to/spark/spark- installations/spark-1.6.0-bin-hadoop2.6/lib/datanucleus- rdbms-3.2.9.jar:/path/to/spark/spark-installations/spark-1.6.0-bin- hadoop2.6/lib/datanucleus-api-jdo-3.2.6.jar -Xms10G -Xmx10G org.apache.spark.deploy.SparkSubmit --master mesos://master.ourdomain.com :5050 --conf spark.driver.memory=10G --executor-memory 100G --total-executor-cores 90 pyspark-shell Thanks, June Taylor System Administrator, Minnesota Population Center University of Minnesota On Thu, Apr 7, 2016 at 3:37 PM, Greg Mann <[email protected]> wrote: > Hi June, > Are these Spark tasks being run in cluster mode or client mode? If it's > client mode, then perhaps your local Spark scheduler is tearing itself down > before the executors exit, thus leaving them orphaned. > > I'd love to see master/agent logs during the time that the tasks are > becoming orphaned if you have them available. > > Cheers, > Greg > > > On Thu, Apr 7, 2016 at 1:08 PM, June Taylor <[email protected]> wrote: > >> Just a quick update... I was only able to get the orphans cleared by >> stopping mesos-slave, deleting the contents of the scratch directory, and >> then restarting mesos-slave. >> >> >> Thanks, >> June Taylor >> System Administrator, Minnesota Population Center >> University of Minnesota >> >> On Thu, Apr 7, 2016 at 12:01 PM, Vinod Kone <[email protected]> wrote: >> >>> A task/executor is called "orphaned" if the corresponding scheduler >>> doesn't register with Mesos. Is your framework scheduler running or gone >>> for good? The resources should be cleaned up if the agent (and consequently >>> the master) have realized that the executor exited. >>> >>> Can you paste the master and agent logs for one of orphaned >>> tasks/executors (grep the log with the task/executor id)? >>> >>> On Thu, Apr 7, 2016 at 9:00 AM, haosdent <[email protected]> wrote: >>> >>>> Hmm, sorry for didn't express my idea clear. I mean kill those orphan >>>> tasks here. >>>> >>>> On Thu, Apr 7, 2016 at 11:57 PM, June Taylor <[email protected]> wrote: >>>> >>>>> Forgive my ignorance, are you literally saying I should just sigkill >>>>> these instances? How will that clean up the mesos orphans? >>>>> >>>>> >>>>> Thanks, >>>>> June Taylor >>>>> System Administrator, Minnesota Population Center >>>>> University of Minnesota >>>>> >>>>> On Thu, Apr 7, 2016 at 10:44 AM, haosdent <[email protected]> wrote: >>>>> >>>>>> Support you --work_dir=/tmp/mesos. So you could >>>>>> >>>>>> $ find /tmp/mesos -name $YOUR_EXECUTOR_ID >>>>>> >>>>>> Then you could get a folder list and then could use lsof on them. >>>>>> >>>>>> As a example, my executor id is "test" here. >>>>>> >>>>>> $ find /tmp/mesos/ -name 'test' >>>>>> >>>>>> /tmp/mesos/0/slaves/138ee255-c8ef-4caa-8ff2-c0c02f70b4f5-S0/frameworks/138ee255-c8ef-4caa-8ff2-c0c02f70b4f5-0002/executors/test >>>>>> >>>>>> When I execute >>>>>> lsof >>>>>> /tmp/mesos/0/slaves/138ee255-c8ef-4caa-8ff2-c0c02f70b4f5-S0/frameworks/138ee255-c8ef-4caa-8ff2-c0c02f70b4f5-0002/executors/test/runs/latest/ >>>>>> (Keep in mind I append runs/latest) here. >>>>>> >>>>>> Then you could see the pid list: >>>>>> >>>>>> COMMAND PID USER FD TYPE DEVICE SIZE/OFF NODE NAME >>>>>> mesos-exe 21811 haosdent cwd DIR 8,3 6 3221463220 >>>>>> /tmp/mesos/0/slaves/138ee255-c8ef-4caa-8ff2-c0c02f70b4f5-S0/frameworks/138ee255-c8ef-4caa-8ff2-c0c02f70b4f5-0003/executors/test/runs/efecb119-1019-4629-91ab-fec7724a0f11 >>>>>> sleep 21847 haosdent cwd DIR 8,3 6 3221463220 >>>>>> /tmp/mesos/0/slaves/138ee255-c8ef-4caa-8ff2-c0c02f70b4f5-S0/frameworks/138ee255-c8ef-4caa-8ff2-c0c02f70b4f5-0003/executors/test/runs/efecb119-1019-4629-91ab-fec7724a0f11 >>>>>> >>>>>> Kill all of them. >>>>>> >>>>>> On Thu, Apr 7, 2016 at 11:23 PM, June Taylor <[email protected]> wrote: >>>>>> >>>>>>> I do have the executor ID. Can you advise how to kill it? >>>>>>> >>>>>>> I have one master and three slaves. Each slave has one of these >>>>>>> orphans. >>>>>>> >>>>>>> >>>>>>> Thanks, >>>>>>> June Taylor >>>>>>> System Administrator, Minnesota Population Center >>>>>>> University of Minnesota >>>>>>> >>>>>>> On Thu, Apr 7, 2016 at 10:14 AM, haosdent <[email protected]> >>>>>>> wrote: >>>>>>> >>>>>>>> >Going to this slave I can find an executor within the mesos >>>>>>>> working directory which matches this framework ID >>>>>>>> The quickest way here is use kill in slave if you could find the >>>>>>>> mesos-executor id. You make use lsof/fuser or dig log to find out the >>>>>>>> executor pid. >>>>>>>> >>>>>>>> However, it still wired according your feedbacks. Do you have >>>>>>>> multiple masters and fail over happens in your master? So that the >>>>>>>> slave >>>>>>>> could not collect to the new master and tasks become orphan. >>>>>>>> >>>>>>>> On Thu, Apr 7, 2016 at 11:06 PM, June Taylor <[email protected]> wrote: >>>>>>>> >>>>>>>>> Here is one of three orphaned tasks (first two octets of IP >>>>>>>>> removed): >>>>>>>>> >>>>>>>>> "orphan_tasks": [ >>>>>>>>> { >>>>>>>>> "executor_id": "", >>>>>>>>> "name": "Task 1", >>>>>>>>> "framework_id": >>>>>>>>> "14cddded-e692-4838-9893-6e04a81481d8-0006", >>>>>>>>> "state": "TASK_RUNNING", >>>>>>>>> "statuses": [ >>>>>>>>> { >>>>>>>>> "timestamp": 1459887295.05554, >>>>>>>>> "state": "TASK_RUNNING", >>>>>>>>> "container_status": { >>>>>>>>> "network_infos": [ >>>>>>>>> { >>>>>>>>> "ip_addresses": [ >>>>>>>>> { >>>>>>>>> "ip_address": >>>>>>>>> "xxx.xxx.163.205" >>>>>>>>> } >>>>>>>>> ], >>>>>>>>> "ip_address": "xxx.xxx.163.205" >>>>>>>>> } >>>>>>>>> ] >>>>>>>>> } >>>>>>>>> } >>>>>>>>> ], >>>>>>>>> "slave_id": "182cf09f-0843-4736-82f1-d913089d7df4-S83", >>>>>>>>> "id": "1", >>>>>>>>> "resources": { >>>>>>>>> "mem": 112640.0, >>>>>>>>> "disk": 0.0, >>>>>>>>> "cpus": 30.0 >>>>>>>>> } >>>>>>>>> } >>>>>>>>> >>>>>>>>> Going to this slave I can find an executor within the mesos >>>>>>>>> working directory which matches this framework ID. Reviewing the >>>>>>>>> stdout >>>>>>>>> messaging within indicates the program has finished its work. But, it >>>>>>>>> is >>>>>>>>> still holding these resources open. >>>>>>>>> >>>>>>>>> This framework ID is not shown as Active in the main Mesos Web UI, >>>>>>>>> but does show up if you display the Slave's web UI. >>>>>>>>> >>>>>>>>> The resources consumed count towards the Idle pool, and have >>>>>>>>> resulted in zero available resources for other Offers. >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> Thanks, >>>>>>>>> June Taylor >>>>>>>>> System Administrator, Minnesota Population Center >>>>>>>>> University of Minnesota >>>>>>>>> >>>>>>>>> On Thu, Apr 7, 2016 at 9:46 AM, haosdent <[email protected]> >>>>>>>>> wrote: >>>>>>>>> >>>>>>>>>> > pyspark executors hanging around and consuming resources >>>>>>>>>> marked as Idle in mesos Web UI >>>>>>>>>> >>>>>>>>>> Do you have some logs about this? >>>>>>>>>> >>>>>>>>>> >is there an API call I can make to kill these orphans? >>>>>>>>>> >>>>>>>>>> As I know, mesos agent would try to clean orphan containers when >>>>>>>>>> restart. But I not sure the orphan I mean here is same with yours. >>>>>>>>>> >>>>>>>>>> On Thu, Apr 7, 2016 at 10:21 PM, June Taylor <[email protected]> >>>>>>>>>> wrote: >>>>>>>>>> >>>>>>>>>>> Greetings mesos users! >>>>>>>>>>> >>>>>>>>>>> I am debugging an issue with pyspark executors hanging around >>>>>>>>>>> and consuming resources marked as Idle in mesos Web UI. These tasks >>>>>>>>>>> also >>>>>>>>>>> show up in the orphaned_tasks key in `mesos state`. >>>>>>>>>>> >>>>>>>>>>> I'm first wondering how to clear them out - is there an API call >>>>>>>>>>> I can make to kill these orphans? Secondly, how it happened at all. >>>>>>>>>>> >>>>>>>>>>> Thanks, >>>>>>>>>>> June Taylor >>>>>>>>>>> System Administrator, Minnesota Population Center >>>>>>>>>>> University of Minnesota >>>>>>>>>>> >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> -- >>>>>>>>>> Best Regards, >>>>>>>>>> Haosdent Huang >>>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> -- >>>>>>>> Best Regards, >>>>>>>> Haosdent Huang >>>>>>>> >>>>>>> >>>>>>> >>>>>> >>>>>> >>>>>> -- >>>>>> Best Regards, >>>>>> Haosdent Huang >>>>>> >>>>> >>>>> >>>> >>>> >>>> -- >>>> Best Regards, >>>> Haosdent Huang >>>> >>> >>> >> >

