On this note - the ganglia web front end that runs on the master (assuming 
you're launching with the ec2 scripts) is great for this. 

Also, a common technique for diagnosing "which step is slow" is to run a 
'.cache' and a '.count' on the RDD after each step. This forces the RDD to be 
materialized, which subverts the lazy evaluation that causes such diagnosis to 
be hard sometimes. 

- Evan

> On Jan 8, 2014, at 2:57 PM, Andrew Ash <[email protected]> wrote:
> 
> My first thought on hearing that you're calling collect is that taking all 
> the data back to the driver is intensive on the network.  Try checking the 
> basic systems stuff on the machines to get a sense of what's being heavily 
> used:
> 
> disk IO
> CPU
> network
> 
> Any kind of distributed system monitoring framework should be able to handle 
> these sorts of things.
> 
> Cheers!
> Andrew
> 
> 
>> On Wed, Jan 8, 2014 at 1:49 PM, Yann Luppo <[email protected]> wrote:
>> Hi,
>> 
>> I have what I hope is a simple question. What's a typical approach to 
>> diagnostic performance issues on a Spark cluster?
>> We've followed all the pertinent parts of the following document already: 
>> http://spark.incubator.apache.org/docs/latest/tuning.html
>> But we seem to still have issues. More specifically we have a leftouterjoin 
>> followed by a flatmap and then a collect running a bit long.
>> 
>> How would I go about determining the bottleneck operation(s) ? 
>> Is our leftouterjoin taking a long time? 
>> Is the function we send to the flatmap not optimized?
>> 
>> Thanks,
>> Yann
> 

Reply via email to