Shouldnt the dag optimizer optimize these routines. Sorry if its a dumb
question :)


Mayur Rustagi
Ph: +1 (760) 203 3257
http://www.sigmoidanalytics.com
@mayur_rustagi <https://twitter.com/mayur_rustagi>



On Wed, Apr 23, 2014 at 12:29 PM, Cheng Lian <lian.cs....@gmail.com> wrote:

> Without caching, an RDD will be evaluated multiple times if referenced
> multiple times by other RDDs. A silly example:
>
> val text = sc.textFile("input.log")val r1 = text.filter(_ startsWith 
> "ERROR")val r2 = text.map(_ split " ")val r3 = (r1 ++ r2).collect()
>
> Here the input file will be scanned twice unless you call .cache() on text.
> So if your computation involves nondeterminism (e.g. random number), you
> may get different results.
>
>
> On Tue, Apr 22, 2014 at 11:30 AM, randylu <randyl...@gmail.com> wrote:
>
>> it's ok when i call doc_topic_dist.cache() firstly.
>>
>>
>>
>> --
>> View this message in context:
>> http://apache-spark-user-list.1001560.n3.nabble.com/two-calls-of-saveAsTextFile-have-different-results-on-the-same-RDD-tp4578p4580.html
>> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>>
>
>

Reply via email to