Re: General question on persist

2014-09-23 Thread Liquan Pei
Hi Arun,

The intermediate results like keyedRecordPieces will not be materialized.
This indicates that if you run

partitoned = keyedRecordPieces.partitionBy(KeyPartitioner)

partitoned.mapPartitions(doComputation).save()

again, the keyedRecordPieces will be re-computed . In this case, cache or
persist keyedRecordPieces is a good idea to eliminate unnecessary expensive
computation. What you can probably do is

keyedRecordPieces  = records.flatMap( record = Seq(key,
recordPieces)).cache()

Which will cache the RDD referenced by keyedRecordPieces in memory. For
more options on cache and persist, take a look at
http://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.rdd.RDD.
There are two APIs you can use to persist RDDs and one allows you to
specify storage level.

Thanks,
Liquan



On Tue, Sep 23, 2014 at 2:08 PM, Arun Ahuja aahuj...@gmail.com wrote:

 I have a general question on when persisting will be beneficial and when
 it won't:

 I have a task that runs as follow

 keyedRecordPieces  = records.flatMap( record = Seq(key, recordPieces))
 partitoned = keyedRecordPieces.partitionBy(KeyPartitioner)

 partitoned.mapPartitions(doComputation).save()

 Is there value in having a persist somewhere here?  For example if the
 flatMap step is particularly expensive, will it ever be computed twice when
 there are no failures?

 Thanks

 Arun




-- 
Liquan Pei
Department of Physics
University of Massachusetts Amherst


Re: General question on persist

2014-09-23 Thread Arun Ahuja
Thanks Liquan, that makes sense, but if I am only doin the computation
once, there will essentially be no difference, correct?

I had second question related to mapPartitions
1) All of the records of the Iterator[T] that a single function call in
mapPartitions process must fit into memory, correct?
2) Is there someway to process that iterator in sorted order?

Thanks!

Arun

On Tue, Sep 23, 2014 at 5:21 PM, Liquan Pei liquan...@gmail.com wrote:

 Hi Arun,

 The intermediate results like keyedRecordPieces will not be
 materialized.  This indicates that if you run

 partitoned = keyedRecordPieces.partitionBy(KeyPartitioner)

 partitoned.mapPartitions(doComputation).save()

 again, the keyedRecordPieces will be re-computed . In this case, cache or
 persist keyedRecordPieces is a good idea to eliminate unnecessary expensive
 computation. What you can probably do is

 keyedRecordPieces  = records.flatMap( record = Seq(key,
 recordPieces)).cache()

 Which will cache the RDD referenced by keyedRecordPieces in memory. For
 more options on cache and persist, take a look at
 http://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.rdd.RDD.
 There are two APIs you can use to persist RDDs and one allows you to
 specify storage level.

 Thanks,
 Liquan



 On Tue, Sep 23, 2014 at 2:08 PM, Arun Ahuja aahuj...@gmail.com wrote:

 I have a general question on when persisting will be beneficial and when
 it won't:

 I have a task that runs as follow

 keyedRecordPieces  = records.flatMap( record = Seq(key, recordPieces))
 partitoned = keyedRecordPieces.partitionBy(KeyPartitioner)

 partitoned.mapPartitions(doComputation).save()

 Is there value in having a persist somewhere here?  For example if the
 flatMap step is particularly expensive, will it ever be computed twice when
 there are no failures?

 Thanks

 Arun




 --
 Liquan Pei
 Department of Physics
 University of Massachusetts Amherst