The difference between this code and the code you showed is that in this code there's a .reduce() whose result is the size of ONE element in the RDD, while your code calls .collect() whose result is the size of your entire dataset.
It would help if you provided a more complete example. On Tue, Feb 18, 2014 at 11:58 AM, David Thomas <[email protected]> wrote: > Here is an example code that is bundled with Spark > > for (i <- 1 to ITERATIONS) { > println("On iteration " + i) > val gradient = points.map { p => > (1 / (1 + exp(-p.y * (w dot p.x))) - 1) * p.y * p.x > }.reduce(_ + _) > w -= gradient > } > > As you can see, an action is called on every iteration. I guess this is > the same pattern I have in my code. So why do you think this is not the > right thing to do with Spark? > > > > > On Tue, Feb 18, 2014 at 12:44 AM, Guillaume Pitel < > [email protected]> wrote: > >> Whatever you want to do, if you really have to do it that way, don't >> use Spark. And the answer to your question is : Spark automatically >> "interleaves" stages that can be interleaved. >> >> Now, I do not believe that you really want to do that. You probably >> should just do a filter + map or a flatmap. But explain what you're trying >> to achieve so we can recommend you with a better way. >> >> Guillaume >> >> With so little information about what your code is actually doing, what >> you have shared looks likely to be an anti-pattern to me. Doing many >> collect actions is something to be avoided if at all possible, since this >> forces a lot of network communication to materialize the results back >> within the driver process, and network communication severely constrains >> performance. >> >> >> On Mon, Feb 17, 2014 at 9:51 AM, David Thomas <[email protected]>wrote: >> >>> I have a spark application that has the below structure: >>> >>> while(...) { // 10-100k iterations >>> rdd.map(...).collect >>> } >>> >>> Basically, I have an RDD and I need to query it multiple times. >>> >>> Now when I run this, for each iteration, Spark creates a new stage >>> (each stage having multiple tasks). What I find is that the stage execution >>> takes about 1 second and most time is spend in scheduling the tasks. Since >>> a stage is not submitted until the previous stage is completed, this loop >>> takes a long time to complete. So my question is, is there a way to >>> interleave multiple stage executions? Any other suggestions to improve the >>> above query pattern? >>> >> >> >> >> -- >> [image: eXenSa] >> *Guillaume PITEL, Président* >> +33(0)6 25 48 86 80 >> >> eXenSa S.A.S. <http://www.exensa.com/> >> 41, rue Périer - 92120 Montrouge - FRANCE >> Tel +33(0)1 84 16 36 77 / Fax +33(0)9 72 28 37 05 >> > >
<<inline: exensa_logo_mail.png>>
