On this note - the ganglia web front end that runs on the master (assuming you're launching with the ec2 scripts) is great for this.
Also, a common technique for diagnosing "which step is slow" is to run a '.cache' and a '.count' on the RDD after each step. This forces the RDD to be materialized, which subverts the lazy evaluation that causes such diagnosis to be hard sometimes. - Evan > On Jan 8, 2014, at 2:57 PM, Andrew Ash <[email protected]> wrote: > > My first thought on hearing that you're calling collect is that taking all > the data back to the driver is intensive on the network. Try checking the > basic systems stuff on the machines to get a sense of what's being heavily > used: > > disk IO > CPU > network > > Any kind of distributed system monitoring framework should be able to handle > these sorts of things. > > Cheers! > Andrew > > >> On Wed, Jan 8, 2014 at 1:49 PM, Yann Luppo <[email protected]> wrote: >> Hi, >> >> I have what I hope is a simple question. What's a typical approach to >> diagnostic performance issues on a Spark cluster? >> We've followed all the pertinent parts of the following document already: >> http://spark.incubator.apache.org/docs/latest/tuning.html >> But we seem to still have issues. More specifically we have a leftouterjoin >> followed by a flatmap and then a collect running a bit long. >> >> How would I go about determining the bottleneck operation(s) ? >> Is our leftouterjoin taking a long time? >> Is the function we send to the flatmap not optimized? >> >> Thanks, >> Yann >
