Re: performance

Evan R. Sparks Thu, 09 Jan 2014 19:45:20 -0800

If your left outer join is slow and one of the tables is relatively small,
you could consider broadcasting the smaller table and doing a join like in
slide 11 of this presentation:
http://ampcamp.berkeley.edu/wp-content/uploads/2012/06/matei-zaharia-amp-camp-2012-advanced-spark.pdf


If both tables are big, there has been some work on IndexedRDDs which might
help speed things up, but this feature hasn't made it into spark master
yet. https://github.com/mesos/spark/pull/848


On Thu, Jan 9, 2014 at 2:11 PM, Yann Luppo <[email protected]> wrote:

>  Thank you guys that was really helpful in identifying the slow step,
> which in our case is the leftouterjoin.
> I'm checking with our admins to see if we have some sort of distributed
> system monitoring in place, which I'm sure we do.
>
>  Now just out of curiosity, what would be the rule of thumb or general
> guideline for the number of partitions and the number of reducers?
> Should it be some kind of factor of the number of cores available? Of
> nodes available? Should the number of partitions match the number of
> reducers or at least be some multiple of it for better performance?
>
>  Thanks,
> Yann
>
>   From: Evan Sparks <[email protected]>
> Reply-To: "[email protected]" <
> [email protected]>
> Date: Wednesday, January 8, 2014 5:28 PM
> To: "[email protected]" <[email protected]>
> Cc: "[email protected]" <[email protected]>
> Subject: Re: performance
>
>   On this note - the ganglia web front end that runs on the master
> (assuming you're launching with the ec2 scripts) is great for this.
>
>  Also, a common technique for diagnosing "which step is slow" is to run a
> '.cache' and a '.count' on the RDD after each step. This forces the RDD to
> be materialized, which subverts the lazy evaluation that causes such
> diagnosis to be hard sometimes.
>
>  - Evan
>
> On Jan 8, 2014, at 2:57 PM, Andrew Ash <[email protected]> wrote:
>
>   My first thought on hearing that you're calling collect is that taking
> all the data back to the driver is intensive on the network.  Try checking
> the basic systems stuff on the machines to get a sense of what's being
> heavily used:
>
>  disk IO
> CPU
> network
>
>  Any kind of distributed system monitoring framework should be able to
> handle these sorts of things.
>
>  Cheers!
> Andrew
>
>
> On Wed, Jan 8, 2014 at 1:49 PM, Yann Luppo <[email protected]>wrote:
>
>>  Hi,
>>
>>  I have what I hope is a simple question. What's a typical approach to
>> diagnostic performance issues on a Spark cluster?
>> We've followed all the pertinent parts of the following document already:
>> http://spark.incubator.apache.org/docs/latest/tuning.html
>> But we seem to still have issues. More specifically we have a
>> leftouterjoin followed by a flatmap and then a collect running a bit long.
>>
>>  How would I go about determining the bottleneck operation(s) ?
>> Is our leftouterjoin taking a long time?
>> Is the function we send to the flatmap not optimized?
>>
>>  Thanks,
>> Yann
>>
>
>

Re: performance

Reply via email to