Re: advice on diagnosing Spark stall for 1.5hr out of 3.5hr job?

Sandy Ryza Wed, 04 Feb 2015 09:55:01 -0800

Also, do you see any lines in the YARN NodeManager logs where it says that
it's killing a container?


-Sandy

On Wed, Feb 4, 2015 at 8:56 AM, Imran Rashid <iras...@cloudera.com> wrote:

> Hi Michael,
>
> judging from the logs, it seems that those tasks are just working a really
> long time.  If you have long running tasks, then you wouldn't expect the
> driver to output anything while those tasks are working.
>
> What is unusual is that there is no activity during all that time the
> tasks are executing.  Are you sure you are looking at the activity of the
> executors (the nodes that are actually running the tasks), and not the
> activity of the driver node (the node where your "main" program lives, but
> that doesn't do any of the distributed computation)?  It would be perfectly
> normal for the driver node to be idle while all the executors were busy
> with long running tasks.
>
> I would look at:
> (a) the cpu usage etc. of the executor nodes during those long running
> tasks
> (b) the thread dumps of the executors during those long running tasks
> (available via the UI under the "Executors" tab, or just log into the boxes
> and run jstack).  Ideally this will point out a hotspot in your code that
> is making these tasks take so long.  (Or perhaps it'll point out what is
> going on in spark internals that is so slow)
> (c) the summary metrics for the long running stage, when it finally
> finishes (also available in the UI, under the "Stages" tab).  You will get
> a breakdown of how much time is spent in various phases of the tasks, how
> much data is read, etc., which can help you figure out why tasks are slow
>
>
> Hopefully this will help you find out what is taking so long.  If you find
> out the executors really arent' doing anything during these really long
> tasks, it would be great to find that out, and maybe get some more info for
> a bug report.
>
> Imran
>
>
> On Tue, Feb 3, 2015 at 6:18 PM, Michael Albert <
> m_albert...@yahoo.com.invalid> wrote:
>
>> Greetings!
>>
>> First, my sincere thanks to all who have given me advice.
>> Following previous discussion, I've rearranged my code to try to keep the
>> partitions to more manageable sizes.
>> Thanks to all who commented.
>>
>> At the moment, the input set I'm trying to work with is about 90GB (avro
>> parquet format).
>>
>> When I run on a reasonable chunk of the data (say half) things work
>> reasonably.
>>
>> On the full data, the spark process stalls.
>> That is, for about 1.5 hours out of a 3.5 hour run, I see no activity.
>> No cpu usage, no error message, no network activity.
>> It just seems to sits there.
>> The messages bracketing the stall are shown below.
>>
>> Any advice on how to diagnose this?
>> I don't get any error messages.
>> The spark UI says that it is running a stage, but it makes no discernible
>> progress.
>> Ganglia shows no CPU usage or network activity.
>> When I shell into the worker nodes there are no filled disks or other
>> obvious problems.
>>
>> How can I discern what Spark is waiting for?
>>
>> The only weird thing seen, other than the stall, is that the yarn logs on
>> the workers have lines with messages like this:
>> 2015-02-03 22:59:58,890 INFO
>> org.apache.hadoop.yarn.server.nodemanager.containermanager.monitor.ContainersMonitorImpl
>> (Container Monitor): Memory usage of ProcessTree 13158 for container-id
>> container_1422834185427_0083_01_000021: 7.1 GB of 8.5 GB physical memory
>> used; 7.6 GB of 42.5 GB virtual memory used
>>
>> It's rather strange that it mentions 42.5 GB of virtual memory.  The
>> machines are EMR machines with 32 GB of physical memory and, as far as I
>> can determine, no swap space.
>>
>> The messages bracketing the stall are shown below.
>>
>>
>> Any advice is welcome.
>>
>> Thanks!
>>
>> Sincerely,
>>  Mike Albert
>>
>> Before the stall.
>> 15/02/03 21:45:28 INFO cluster.YarnClientClusterScheduler: Removed
>> TaskSet 5.0, whose tasks have all completed, from pool
>> 15/02/03 21:45:28 INFO scheduler.DAGScheduler: Stage 5
>> (mapPartitionsWithIndex at Transposer.scala:147) finished in 4880.317 s
>> 15/02/03 21:45:28 INFO scheduler.DAGScheduler: looking for newly runnable
>> stages
>> 15/02/03 21:45:28 INFO scheduler.DAGScheduler: running: Set(Stage 3)
>> 15/02/03 21:45:28 INFO scheduler.DAGScheduler: waiting: Set(Stage 6,
>> Stage 7, Stage 8)
>> 15/02/03 21:45:28 INFO scheduler.DAGScheduler: failed: Set()
>> 15/02/03 21:45:28 INFO scheduler.DAGScheduler: Missing parents for Stage
>> 6: List(Stage 3)
>> 15/02/03 21:45:28 INFO scheduler.DAGScheduler: Missing parents for Stage
>> 7: List(Stage 6)
>> 15/02/03 21:45:28 INFO scheduler.DAGScheduler: Missing parents for Stage
>> 8: List(Stage 7)
>> At this point, I see no activity for 1.5 hours except for this (XXX for
>> I.P. address)
>> 15/02/03 22:13:24 INFO util.AkkaUtils: Connecting to ExecutorActor:
>> akka.tcp://sparkExecutor@ip-XXX.ec2.internal:36301/user/ExecutorActor
>>
>> Then finally it started again:
>> 15/02/03 23:31:34 INFO scheduler.TaskSetManager: Finished task 1.0 in
>> stage 3.0 (TID 7301) in 7208259 ms on ip-10-171-0-124.ec2.internal (3/4)
>> 15/02/03 23:31:34 INFO scheduler.TaskSetManager: Finished task 0.0 in
>> stage 3.0 (TID 7300) in 7208503 ms on ip-10-171-0-128.ec2.internal (4/4)
>> 15/02/03 23:31:34 INFO scheduler.DAGScheduler: Stage 3 (mapPartitions at
>> Transposer.scala:211) finished in 7209.534 s
>>
>>
>>
>>
>

Re: advice on diagnosing Spark stall for 1.5hr out of 3.5hr job?

Reply via email to