With so little information about what your code is actually doing, what you have shared looks likely to be an anti-pattern to me. Doing many collect actions is something to be avoided if at all possible, since this forces a lot of network communication to materialize the results back within the driver process, and network communication severely constrains performance.
On Mon, Feb 17, 2014 at 9:51 AM, David Thomas <[email protected]> wrote: > I have a spark application that has the below structure: > > while(...) { // 10-100k iterations > rdd.map(...).collect > } > > Basically, I have an RDD and I need to query it multiple times. > > Now when I run this, for each iteration, Spark creates a new stage (each > stage having multiple tasks). What I find is that the stage execution takes > about 1 second and most time is spend in scheduling the tasks. Since a > stage is not submitted until the previous stage is completed, this loop > takes a long time to complete. So my question is, is there a way to > interleave multiple stage executions? Any other suggestions to improve the > above query pattern? >
