Thanks Liquan, that makes sense, but if I am only doin the computation
once, there will essentially be no difference, correct?

I had second question related to mapPartitions
1) All of the records of the Iterator[T] that a single function call in
mapPartitions process must fit into memory, correct?
2) Is there someway to process that iterator in sorted order?

Thanks!

Arun

On Tue, Sep 23, 2014 at 5:21 PM, Liquan Pei <liquan...@gmail.com> wrote:

> Hi Arun,
>
> The intermediate results like keyedRecordPieces will not be
> materialized.  This indicates that if you run
>
> partitoned = keyedRecordPieces.partitionBy(KeyPartitioner)
>
> partitoned.mapPartitions(doComputation).save()
>
> again, the keyedRecordPieces will be re-computed . In this case, cache or
> persist keyedRecordPieces is a good idea to eliminate unnecessary expensive
> computation. What you can probably do is
>
> keyedRecordPieces  = records.flatMap( record => Seq(key,
> recordPieces)).cache()
>
> Which will cache the RDD referenced by keyedRecordPieces in memory. For
> more options on cache and persist, take a look at
> http://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.rdd.RDD.
> There are two APIs you can use to persist RDDs and one allows you to
> specify storage level.
>
> Thanks,
> Liquan
>
>
>
> On Tue, Sep 23, 2014 at 2:08 PM, Arun Ahuja <aahuj...@gmail.com> wrote:
>
>> I have a general question on when persisting will be beneficial and when
>> it won't:
>>
>> I have a task that runs as follow
>>
>> keyedRecordPieces  = records.flatMap( record => Seq(key, recordPieces))
>> partitoned = keyedRecordPieces.partitionBy(KeyPartitioner)
>>
>> partitoned.mapPartitions(doComputation).save()
>>
>> Is there value in having a persist somewhere here?  For example if the
>> flatMap step is particularly expensive, will it ever be computed twice when
>> there are no failures?
>>
>> Thanks
>>
>> Arun
>>
>
>
>
> --
> Liquan Pei
> Department of Physics
> University of Massachusetts Amherst
>

Reply via email to