Thanks Sean! Do you know if there is a way (even manually) to delete these intermediate shuffle results? I was just want to test the "expected" behaviour. I know that re-caching might be a positive action most of the times but I want to try it without it.
Renato M. 2015-03-30 12:15 GMT+02:00 Sean Owen <so...@cloudera.com>: > I think that you get a sort of "silent" caching after shuffles, in > some cases, since the shuffle files are not immediately removed and > can be reused. > > (This is the flip side to the frequent question/complaint that the > shuffle files aren't removed straight away.) > > On Mon, Mar 30, 2015 at 9:43 AM, Renato Marroquín Mogrovejo > <renatoj.marroq...@gmail.com> wrote: > > Hi all, > > > > I am trying to understand Spark lazy evaluation works, and I need some > help. > > I have noticed that creating an RDD once and using it many times won't > > trigger recomputation of it every time it gets used. Whereas creating a > new > > RDD for every time a new operation is performed will trigger > recomputation > > of the whole RDD again. > > I would have thought that both approaches behave similarly (i.e. not > > caching) due to Spark's lazy evaluation strategy, but I guess Spark is > keeps > > track of the RDD used and of the partial results computed so far so it > > doesn't do unnecessary extra work. Could anybody point me to where Spark > > decides what to cache or how I can disable this behaviour? > > Thanks in advance! > > > > > > Renato M. > > > > Approach 1 --> this doesn't trigger recomputation of the RDD in every > > iteration > > ========= > > JavaRDD aggrRel = > > Utils.readJavaRDD(...).groupBy(groupFunction).map(mapFunction); > > for (int i = 0; i < NUM_RUNS; i++) { > > // doing some computation like aggrRel.count() > > . . . > > } > > > > Approach 2 --> this triggers recomputation of the RDD in every iteration > > ========= > > for (int i = 0; i < NUM_RUNS; i++) { > > JavaRDD aggrRel = > > Utils.readJavaRDD(...).groupBy(groupFunction).map(mapFunction); > > // doing some computation like aggrRel.count() > > . . . > > } >