With so little information about what your code is actually doing, what you
have shared looks likely to be an anti-pattern to me.  Doing many collect
actions is something to be avoided if at all possible, since this forces a
lot of network communication to materialize the results back within the
driver process, and network communication severely constrains performance.


On Mon, Feb 17, 2014 at 9:51 AM, David Thomas <[email protected]> wrote:

> I have a spark application that has the below structure:
>
> while(...) { // 10-100k iterations
>   rdd.map(...).collect
> }
>
> Basically, I have an RDD and I need to query it multiple times.
>
> Now when I run this, for each iteration, Spark creates a new stage (each
> stage having multiple tasks). What I find is that the stage execution takes
> about 1 second and most time is spend in scheduling the tasks. Since a
> stage is not submitted until the previous stage is completed, this loop
> takes a long time to complete. So my question is, is there a way to
> interleave multiple stage executions? Any other suggestions to improve the
> above query pattern?
>

Reply via email to