Hi all, I am trying to understand Spark lazy evaluation works, and I need some help. I have noticed that creating an RDD once and using it many times won't trigger recomputation of it every time it gets used. Whereas creating a new RDD for every time a new operation is performed will trigger recomputation of the whole RDD again. I would have thought that both approaches behave similarly (i.e. not caching) due to Spark's lazy evaluation strategy, but I guess Spark is keeps track of the RDD used and of the partial results computed so far so it doesn't do unnecessary extra work. Could anybody point me to where Spark decides what to cache or how I can disable this behaviour? Thanks in advance!
Renato M. Approach 1 --> this doesn't trigger recomputation of the RDD in every iteration ========= JavaRDD aggrRel = Utils.readJavaRDD(...).groupBy(groupFunction). map(mapFunction); for (int i = 0; i < NUM_RUNS; i++) { // doing some computation like aggrRel.count() . . . } Approach 2 --> this triggers recomputation of the RDD in every iteration ========= for (int i = 0; i < NUM_RUNS; i++) { JavaRDD aggrRel = Utils.readJavaRDD(...).groupBy(groupFunction).map(mapFunction); // doing some computation like aggrRel.count() . . . }