I have a spark application that has the below structure:
while(...) { // 10-100k iterations
rdd.map(...).collect
}
Basically, I have an RDD and I need to query it multiple times.
Now when I run this, for each iteration, Spark creates a new stage (each
stage having multiple tasks). What I find is that the stage execution takes
about 1 second and most time is spend in scheduling the tasks. Since a
stage is not submitted until the previous stage is completed, this loop
takes a long time to complete. So my question is, is there a way to
interleave multiple stage executions? Any other suggestions to improve the
above query pattern?