Hi, I have what I hope is a simple question. What's a typical approach to diagnostic performance issues on a Spark cluster? We've followed all the pertinent parts of the following document already: http://spark.incubator.apache.org/docs/latest/tuning.html But we seem to still have issues. More specifically we have a leftouterjoin followed by a flatmap and then a collect running a bit long.
How would I go about determining the bottleneck operation(s) ? Is our leftouterjoin taking a long time? Is the function we send to the flatmap not optimized? Thanks, Yann
