Also, do you see any lines in the YARN NodeManager logs where it says that it's killing a container?
-Sandy On Wed, Feb 4, 2015 at 8:56 AM, Imran Rashid <iras...@cloudera.com> wrote: > Hi Michael, > > judging from the logs, it seems that those tasks are just working a really > long time. If you have long running tasks, then you wouldn't expect the > driver to output anything while those tasks are working. > > What is unusual is that there is no activity during all that time the > tasks are executing. Are you sure you are looking at the activity of the > executors (the nodes that are actually running the tasks), and not the > activity of the driver node (the node where your "main" program lives, but > that doesn't do any of the distributed computation)? It would be perfectly > normal for the driver node to be idle while all the executors were busy > with long running tasks. > > I would look at: > (a) the cpu usage etc. of the executor nodes during those long running > tasks > (b) the thread dumps of the executors during those long running tasks > (available via the UI under the "Executors" tab, or just log into the boxes > and run jstack). Ideally this will point out a hotspot in your code that > is making these tasks take so long. (Or perhaps it'll point out what is > going on in spark internals that is so slow) > (c) the summary metrics for the long running stage, when it finally > finishes (also available in the UI, under the "Stages" tab). You will get > a breakdown of how much time is spent in various phases of the tasks, how > much data is read, etc., which can help you figure out why tasks are slow > > > Hopefully this will help you find out what is taking so long. If you find > out the executors really arent' doing anything during these really long > tasks, it would be great to find that out, and maybe get some more info for > a bug report. > > Imran > > > On Tue, Feb 3, 2015 at 6:18 PM, Michael Albert < > m_albert...@yahoo.com.invalid> wrote: > >> Greetings! >> >> First, my sincere thanks to all who have given me advice. >> Following previous discussion, I've rearranged my code to try to keep the >> partitions to more manageable sizes. >> Thanks to all who commented. >> >> At the moment, the input set I'm trying to work with is about 90GB (avro >> parquet format). >> >> When I run on a reasonable chunk of the data (say half) things work >> reasonably. >> >> On the full data, the spark process stalls. >> That is, for about 1.5 hours out of a 3.5 hour run, I see no activity. >> No cpu usage, no error message, no network activity. >> It just seems to sits there. >> The messages bracketing the stall are shown below. >> >> Any advice on how to diagnose this? >> I don't get any error messages. >> The spark UI says that it is running a stage, but it makes no discernible >> progress. >> Ganglia shows no CPU usage or network activity. >> When I shell into the worker nodes there are no filled disks or other >> obvious problems. >> >> How can I discern what Spark is waiting for? >> >> The only weird thing seen, other than the stall, is that the yarn logs on >> the workers have lines with messages like this: >> 2015-02-03 22:59:58,890 INFO >> org.apache.hadoop.yarn.server.nodemanager.containermanager.monitor.ContainersMonitorImpl >> (Container Monitor): Memory usage of ProcessTree 13158 for container-id >> container_1422834185427_0083_01_000021: 7.1 GB of 8.5 GB physical memory >> used; 7.6 GB of 42.5 GB virtual memory used >> >> It's rather strange that it mentions 42.5 GB of virtual memory. The >> machines are EMR machines with 32 GB of physical memory and, as far as I >> can determine, no swap space. >> >> The messages bracketing the stall are shown below. >> >> >> Any advice is welcome. >> >> Thanks! >> >> Sincerely, >> Mike Albert >> >> Before the stall. >> 15/02/03 21:45:28 INFO cluster.YarnClientClusterScheduler: Removed >> TaskSet 5.0, whose tasks have all completed, from pool >> 15/02/03 21:45:28 INFO scheduler.DAGScheduler: Stage 5 >> (mapPartitionsWithIndex at Transposer.scala:147) finished in 4880.317 s >> 15/02/03 21:45:28 INFO scheduler.DAGScheduler: looking for newly runnable >> stages >> 15/02/03 21:45:28 INFO scheduler.DAGScheduler: running: Set(Stage 3) >> 15/02/03 21:45:28 INFO scheduler.DAGScheduler: waiting: Set(Stage 6, >> Stage 7, Stage 8) >> 15/02/03 21:45:28 INFO scheduler.DAGScheduler: failed: Set() >> 15/02/03 21:45:28 INFO scheduler.DAGScheduler: Missing parents for Stage >> 6: List(Stage 3) >> 15/02/03 21:45:28 INFO scheduler.DAGScheduler: Missing parents for Stage >> 7: List(Stage 6) >> 15/02/03 21:45:28 INFO scheduler.DAGScheduler: Missing parents for Stage >> 8: List(Stage 7) >> At this point, I see no activity for 1.5 hours except for this (XXX for >> I.P. address) >> 15/02/03 22:13:24 INFO util.AkkaUtils: Connecting to ExecutorActor: >> akka.tcp://sparkExecutor@ip-XXX.ec2.internal:36301/user/ExecutorActor >> >> Then finally it started again: >> 15/02/03 23:31:34 INFO scheduler.TaskSetManager: Finished task 1.0 in >> stage 3.0 (TID 7301) in 7208259 ms on ip-10-171-0-124.ec2.internal (3/4) >> 15/02/03 23:31:34 INFO scheduler.TaskSetManager: Finished task 0.0 in >> stage 3.0 (TID 7300) in 7208503 ms on ip-10-171-0-128.ec2.internal (4/4) >> 15/02/03 23:31:34 INFO scheduler.DAGScheduler: Stage 3 (mapPartitions at >> Transposer.scala:211) finished in 7209.534 s >> >> >> >> >