Thanks Liquan, that makes sense, but if I am only doin the computation once, there will essentially be no difference, correct?
I had second question related to mapPartitions 1) All of the records of the Iterator[T] that a single function call in mapPartitions process must fit into memory, correct? 2) Is there someway to process that iterator in sorted order? Thanks! Arun On Tue, Sep 23, 2014 at 5:21 PM, Liquan Pei <liquan...@gmail.com> wrote: > Hi Arun, > > The intermediate results like keyedRecordPieces will not be > materialized. This indicates that if you run > > partitoned = keyedRecordPieces.partitionBy(KeyPartitioner) > > partitoned.mapPartitions(doComputation).save() > > again, the keyedRecordPieces will be re-computed . In this case, cache or > persist keyedRecordPieces is a good idea to eliminate unnecessary expensive > computation. What you can probably do is > > keyedRecordPieces = records.flatMap( record => Seq(key, > recordPieces)).cache() > > Which will cache the RDD referenced by keyedRecordPieces in memory. For > more options on cache and persist, take a look at > http://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.rdd.RDD. > There are two APIs you can use to persist RDDs and one allows you to > specify storage level. > > Thanks, > Liquan > > > > On Tue, Sep 23, 2014 at 2:08 PM, Arun Ahuja <aahuj...@gmail.com> wrote: > >> I have a general question on when persisting will be beneficial and when >> it won't: >> >> I have a task that runs as follow >> >> keyedRecordPieces = records.flatMap( record => Seq(key, recordPieces)) >> partitoned = keyedRecordPieces.partitionBy(KeyPartitioner) >> >> partitoned.mapPartitions(doComputation).save() >> >> Is there value in having a persist somewhere here? For example if the >> flatMap step is particularly expensive, will it ever be computed twice when >> there are no failures? >> >> Thanks >> >> Arun >> > > > > -- > Liquan Pei > Department of Physics > University of Massachusetts Amherst >