Whatever you want to do, if you really have to do it that way, don't use Spark. And the answer to your question is : Spark automatically "interleaves" stages that can be interleaved.

Now, I do not believe that you really want to do that. You probably should just do a filter + map or a flatmap. But explain what you're trying to achieve so we can recommend you with a better way.

Guillaume
With so little information about what your code is actually doing, what you have shared looks likely to be an anti-pattern to me.  Doing many collect actions is something to be avoided if at all possible, since this forces a lot of network communication to materialize the results back within the driver process, and network communication severely constrains performance. 


On Mon, Feb 17, 2014 at 9:51 AM, David Thomas <[email protected]> wrote:
I have a spark application that has the below structure:

while(...) { // 10-100k iterations
  rdd.map(...).collect
}

Basically, I have an RDD and I need to query it multiple times.

Now when I run this, for each iteration, Spark creates a new stage (each stage having multiple tasks). What I find is that the stage execution takes about 1 second and most time is spend in scheduling the tasks. Since a stage is not submitted until the previous stage is completed, this loop takes a long time to complete. So my question is, is there a way to interleave multiple stage executions? Any other suggestions to improve the above query pattern?



--
eXenSa
Guillaume PITEL, Président
+33(0)6 25 48 86 80

eXenSa S.A.S.
41, rue Périer - 92120 Montrouge - FRANCE
Tel +33(0)1 84 16 36 77 / Fax +33(0)9 72 28 37 05

Reply via email to