Thank you guys that was really helpful in identifying the slow step, which in 
our case is the leftouterjoin.
I'm checking with our admins to see if we have some sort of distributed system 
monitoring in place, which I'm sure we do.

Now just out of curiosity, what would be the rule of thumb or general guideline 
for the number of partitions and the number of reducers?
Should it be some kind of factor of the number of cores available? Of nodes 
available? Should the number of partitions match the number of reducers or at 
least be some multiple of it for better performance?

Thanks,
Yann

From: Evan Sparks <[email protected]<mailto:[email protected]>>
Reply-To: 
"[email protected]<mailto:[email protected]>" 
<[email protected]<mailto:[email protected]>>
Date: Wednesday, January 8, 2014 5:28 PM
To: "[email protected]<mailto:[email protected]>" 
<[email protected]<mailto:[email protected]>>
Cc: "[email protected]<mailto:[email protected]>" 
<[email protected]<mailto:[email protected]>>
Subject: Re: performance

On this note - the ganglia web front end that runs on the master (assuming 
you're launching with the ec2 scripts) is great for this.

Also, a common technique for diagnosing "which step is slow" is to run a 
'.cache' and a '.count' on the RDD after each step. This forces the RDD to be 
materialized, which subverts the lazy evaluation that causes such diagnosis to 
be hard sometimes.

- Evan

On Jan 8, 2014, at 2:57 PM, Andrew Ash 
<[email protected]<mailto:[email protected]>> wrote:

My first thought on hearing that you're calling collect is that taking all the 
data back to the driver is intensive on the network.  Try checking the basic 
systems stuff on the machines to get a sense of what's being heavily used:

disk IO
CPU
network

Any kind of distributed system monitoring framework should be able to handle 
these sorts of things.

Cheers!
Andrew


On Wed, Jan 8, 2014 at 1:49 PM, Yann Luppo 
<[email protected]<mailto:[email protected]>> wrote:
Hi,

I have what I hope is a simple question. What's a typical approach to 
diagnostic performance issues on a Spark cluster?
We've followed all the pertinent parts of the following document already: 
http://spark.incubator.apache.org/docs/latest/tuning.html
But we seem to still have issues. More specifically we have a leftouterjoin 
followed by a flatmap and then a collect running a bit long.

How would I go about determining the bottleneck operation(s) ?
Is our leftouterjoin taking a long time?
Is the function we send to the flatmap not optimized?

Thanks,
Yann

Reply via email to