To experiment, try this in the Spark shell: val r0 = sc.makeRDD(1 to 3, 1)val r1 = r0.map { x => println(x) x }val r2 = r1.map(_ * 2)val r3 = r1.map(_ * 2 + 1) (r2 ++ r3).collect()
You’ll see elements in r1 are printed (thus evaluated) twice. By adding .cache() to r1, you’ll see those elements are printed only once. On Wed, Apr 23, 2014 at 4:35 PM, Cheng Lian <lian.cs....@gmail.com> wrote: > Good question :) > > Although RDD DAG is lazy evaluated, it’s not exactly the same as Scala > lazy val. For Scala lazy val, evaluated value is automatically cached, > while evaluated RDD elements are not cached unless you call > .cache()explicitly, because materializing an RDD can often be expensive. Take > local > file reading as an analogy: > > val v0 = sc.textFile("input.log").cache() > > is similar to a lazy val > > lazy val u0 = Source.fromFile("input.log").mkString > > while > > val v1 = sc.textFile("input.log") > > is similar to a function > > def u0 = Source.fromFile("input.log").mkString > > Think it this way: if you want to “reuse” the evaluated elements, you have > to cache those elements somewhere. Without caching, you have to re-evaluate > the RDD, and the semantics of an uncached RDD simply downgrades to a > function rather than a lazy val. > > > On Wed, Apr 23, 2014 at 4:00 PM, Mayur Rustagi <mayur.rust...@gmail.com>wrote: > >> Shouldnt the dag optimizer optimize these routines. Sorry if its a dumb >> question :) >> >> >> Mayur Rustagi >> Ph: +1 (760) 203 3257 >> http://www.sigmoidanalytics.com >> @mayur_rustagi <https://twitter.com/mayur_rustagi> >> >> >> >> On Wed, Apr 23, 2014 at 12:29 PM, Cheng Lian <lian.cs....@gmail.com>wrote: >> >>> Without caching, an RDD will be evaluated multiple times if referenced >>> multiple times by other RDDs. A silly example: >>> >>> val text = sc.textFile("input.log")val r1 = text.filter(_ startsWith >>> "ERROR")val r2 = text.map(_ split " ")val r3 = (r1 ++ r2).collect() >>> >>> Here the input file will be scanned twice unless you call .cache() on >>> text. So if your computation involves nondeterminism (e.g. random >>> number), you may get different results. >>> >>> >>> On Tue, Apr 22, 2014 at 11:30 AM, randylu <randyl...@gmail.com> wrote: >>> >>>> it's ok when i call doc_topic_dist.cache() firstly. >>>> >>>> >>>> >>>> -- >>>> View this message in context: >>>> http://apache-spark-user-list.1001560.n3.nabble.com/two-calls-of-saveAsTextFile-have-different-results-on-the-same-RDD-tp4578p4580.html >>>> Sent from the Apache Spark User List mailing list archive at Nabble.com. >>>> >>> >>> >> >