Is there a way I can queue several stages at once?
On Mon, Feb 17, 2014 at 12:08 PM, Mark Hamstra <[email protected]>wrote: > With so little information about what your code is actually doing, what > you have shared looks likely to be an anti-pattern to me. Doing many > collect actions is something to be avoided if at all possible, since this > forces a lot of network communication to materialize the results back > within the driver process, and network communication severely constrains > performance. > > > On Mon, Feb 17, 2014 at 9:51 AM, David Thomas <[email protected]> wrote: > >> I have a spark application that has the below structure: >> >> while(...) { // 10-100k iterations >> rdd.map(...).collect >> } >> >> Basically, I have an RDD and I need to query it multiple times. >> >> Now when I run this, for each iteration, Spark creates a new stage (each >> stage having multiple tasks). What I find is that the stage execution takes >> about 1 second and most time is spend in scheduling the tasks. Since a >> stage is not submitted until the previous stage is completed, this loop >> takes a long time to complete. So my question is, is there a way to >> interleave multiple stage executions? Any other suggestions to improve the >> above query pattern? >> > >
