To experiment, try this in the Spark shell:

val r0 = sc.makeRDD(1 to 3, 1)val r1 = r0.map { x =>
  println(x)
  x
}val r2 = r1.map(_ * 2)val r3 = r1.map(_ * 2 + 1)
(r2 ++ r3).collect()

You’ll see elements in r1 are printed (thus evaluated) twice. By adding
.cache() to r1, you’ll see those elements are printed only once.


On Wed, Apr 23, 2014 at 4:35 PM, Cheng Lian <lian.cs....@gmail.com> wrote:

> Good question :)
>
> Although RDD DAG is lazy evaluated, it’s not exactly the same as Scala
> lazy val. For Scala lazy val, evaluated value is automatically cached,
> while evaluated RDD elements are not cached unless you call 
> .cache()explicitly, because materializing an RDD can often be expensive. Take 
> local
> file reading as an analogy:
>
> val v0 = sc.textFile("input.log").cache()
>
> is similar to a lazy val
>
> lazy val u0 = Source.fromFile("input.log").mkString
>
> while
>
> val v1 = sc.textFile("input.log")
>
> is similar to a function
>
> def u0 = Source.fromFile("input.log").mkString
>
> Think it this way: if you want to “reuse” the evaluated elements, you have
> to cache those elements somewhere. Without caching, you have to re-evaluate
> the RDD, and the semantics of an uncached RDD simply downgrades to a
> function rather than a lazy val.
>
>
> On Wed, Apr 23, 2014 at 4:00 PM, Mayur Rustagi <mayur.rust...@gmail.com>wrote:
>
>> Shouldnt the dag optimizer optimize these routines. Sorry if its a dumb
>> question :)
>>
>>
>> Mayur Rustagi
>> Ph: +1 (760) 203 3257
>> http://www.sigmoidanalytics.com
>> @mayur_rustagi <https://twitter.com/mayur_rustagi>
>>
>>
>>
>> On Wed, Apr 23, 2014 at 12:29 PM, Cheng Lian <lian.cs....@gmail.com>wrote:
>>
>>> Without caching, an RDD will be evaluated multiple times if referenced
>>> multiple times by other RDDs. A silly example:
>>>
>>> val text = sc.textFile("input.log")val r1 = text.filter(_ startsWith 
>>> "ERROR")val r2 = text.map(_ split " ")val r3 = (r1 ++ r2).collect()
>>>
>>> Here the input file will be scanned twice unless you call .cache() on
>>> text. So if your computation involves nondeterminism (e.g. random
>>> number), you may get different results.
>>>
>>>
>>> On Tue, Apr 22, 2014 at 11:30 AM, randylu <randyl...@gmail.com> wrote:
>>>
>>>> it's ok when i call doc_topic_dist.cache() firstly.
>>>>
>>>>
>>>>
>>>> --
>>>> View this message in context:
>>>> http://apache-spark-user-list.1001560.n3.nabble.com/two-calls-of-saveAsTextFile-have-different-results-on-the-same-RDD-tp4578p4580.html
>>>> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>>>>
>>>
>>>
>>
>

Reply via email to