I think that you get a sort of "silent" caching after shuffles, in some cases, since the shuffle files are not immediately removed and can be reused.
(This is the flip side to the frequent question/complaint that the shuffle files aren't removed straight away.) On Mon, Mar 30, 2015 at 9:43 AM, Renato Marroquín Mogrovejo <renatoj.marroq...@gmail.com> wrote: > Hi all, > > I am trying to understand Spark lazy evaluation works, and I need some help. > I have noticed that creating an RDD once and using it many times won't > trigger recomputation of it every time it gets used. Whereas creating a new > RDD for every time a new operation is performed will trigger recomputation > of the whole RDD again. > I would have thought that both approaches behave similarly (i.e. not > caching) due to Spark's lazy evaluation strategy, but I guess Spark is keeps > track of the RDD used and of the partial results computed so far so it > doesn't do unnecessary extra work. Could anybody point me to where Spark > decides what to cache or how I can disable this behaviour? > Thanks in advance! > > > Renato M. > > Approach 1 --> this doesn't trigger recomputation of the RDD in every > iteration > ========= > JavaRDD aggrRel = > Utils.readJavaRDD(...).groupBy(groupFunction).map(mapFunction); > for (int i = 0; i < NUM_RUNS; i++) { > // doing some computation like aggrRel.count() > . . . > } > > Approach 2 --> this triggers recomputation of the RDD in every iteration > ========= > for (int i = 0; i < NUM_RUNS; i++) { > JavaRDD aggrRel = > Utils.readJavaRDD(...).groupBy(groupFunction).map(mapFunction); > // doing some computation like aggrRel.count() > . . . > } --------------------------------------------------------------------- To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org