If your left outer join is slow and one of the tables is relatively small, you could consider broadcasting the smaller table and doing a join like in slide 11 of this presentation: http://ampcamp.berkeley.edu/wp-content/uploads/2012/06/matei-zaharia-amp-camp-2012-advanced-spark.pdf
If both tables are big, there has been some work on IndexedRDDs which might help speed things up, but this feature hasn't made it into spark master yet. https://github.com/mesos/spark/pull/848 On Thu, Jan 9, 2014 at 2:11 PM, Yann Luppo <[email protected]> wrote: > Thank you guys that was really helpful in identifying the slow step, > which in our case is the leftouterjoin. > I'm checking with our admins to see if we have some sort of distributed > system monitoring in place, which I'm sure we do. > > Now just out of curiosity, what would be the rule of thumb or general > guideline for the number of partitions and the number of reducers? > Should it be some kind of factor of the number of cores available? Of > nodes available? Should the number of partitions match the number of > reducers or at least be some multiple of it for better performance? > > Thanks, > Yann > > From: Evan Sparks <[email protected]> > Reply-To: "[email protected]" < > [email protected]> > Date: Wednesday, January 8, 2014 5:28 PM > To: "[email protected]" <[email protected]> > Cc: "[email protected]" <[email protected]> > Subject: Re: performance > > On this note - the ganglia web front end that runs on the master > (assuming you're launching with the ec2 scripts) is great for this. > > Also, a common technique for diagnosing "which step is slow" is to run a > '.cache' and a '.count' on the RDD after each step. This forces the RDD to > be materialized, which subverts the lazy evaluation that causes such > diagnosis to be hard sometimes. > > - Evan > > On Jan 8, 2014, at 2:57 PM, Andrew Ash <[email protected]> wrote: > > My first thought on hearing that you're calling collect is that taking > all the data back to the driver is intensive on the network. Try checking > the basic systems stuff on the machines to get a sense of what's being > heavily used: > > disk IO > CPU > network > > Any kind of distributed system monitoring framework should be able to > handle these sorts of things. > > Cheers! > Andrew > > > On Wed, Jan 8, 2014 at 1:49 PM, Yann Luppo <[email protected]>wrote: > >> Hi, >> >> I have what I hope is a simple question. What's a typical approach to >> diagnostic performance issues on a Spark cluster? >> We've followed all the pertinent parts of the following document already: >> http://spark.incubator.apache.org/docs/latest/tuning.html >> But we seem to still have issues. More specifically we have a >> leftouterjoin followed by a flatmap and then a collect running a bit long. >> >> How would I go about determining the bottleneck operation(s) ? >> Is our leftouterjoin taking a long time? >> Is the function we send to the flatmap not optimized? >> >> Thanks, >> Yann >> > >
