Re: what happened if cache a RDD for multiple time?

charles li Thu, 24 Mar 2016 05:20:08 -0700

hi, yash, that's really help me, great thanks

On Thu, Mar 24, 2016 at 7:07 PM, yash datta <sau...@gmail.com> wrote:


> Yes, That is correct.
>
> When you call cache on an RDD, internally it calls
> persist(StorageLevel.MEMORY_ONLY) which further calls
>
> persist(StorageLevel.MEMORY_ONLY, allowOverride=false) , if the RDD is not
> marked for localCheckpointing
>
> Below is what is finally triggered :
>
> /**
>  * Mark this RDD for persisting using the specified level.
>  *
>  * @param newLevel the target storage level
>  * @param allowOverride whether to override any existing level with the new 
> one
>  */
> private def persist(newLevel: StorageLevel, allowOverride: Boolean): 
> this.type = {
>   // TODO: Handle changes of StorageLevel
>   if (storageLevel != StorageLevel.NONE && newLevel != storageLevel && 
> !allowOverride) {
>     throw new UnsupportedOperationException(
>       "Cannot change storage level of an RDD after it was already assigned a 
> level")
>   }
>   // If this is the first time this RDD is marked for persisting, register it
>   // with the SparkContext for cleanups and accounting. Do this only once.
>   if (storageLevel == StorageLevel.NONE) {
>     sc.cleaner.foreach(_.registerRDDForCleanup(this))
>     sc.persistRDD(this)
>   }
>   storageLevel = newLevel
>   this
> }
>
> As is clear from the code, persistRDD is called only when storageLevel for
> the RDD was never set (So it will be called only once for multiple calls
> for the same RDD).
> Also, persistRDD only sets an entry in persistentRdds map, which is keyed
> by RDD id :
>
> /**
>  * Register an RDD to be persisted in memory and/or disk storage
>  */
> private[spark] def persistRDD(rdd: RDD[_]) {
>   persistentRdds(rdd.id) = rdd
> }
>
> Hope this helps.
>
> Best
> Yash
>
> On Thu, Mar 24, 2016 at 1:58 PM, charles li <charles.up...@gmail.com>
> wrote:
>
>>
>> happened to see this problem on stackoverflow:
>> http://stackoverflow.com/questions/36195105/what-happens-if-i-cache-the-same-rdd-twice-in-spark/36195812#36195812
>>
>>
>> I think it's very interesting, and I think the answer posted by Aaron
>> sounds promising, but I'm not sure, and I don't find the details on the
>> cache principle in Spark, so just post here and to ask everyone that the
>> internal principle on implementing cache.
>>
>> great thanks.
>>
>>
>> -----aaron's answer to that question [Is that right?]-----
>>
>> nothing happens, it will just cache the RDD for once. The reason, I
>> think, is that every RDD has an id internally, spark will use the id to
>> mark whether a RDD have been cached or not. so cache one RDD for multiple
>> times will do nothing.
>> -----------
>>
>>
>>
>> --
>> *--------------------------------------*
>> a spark lover, a quant, a developer and a good man.
>>
>> http://github.com/litaotao
>>
>
>
>
> --
> When events unfold with calm and ease
> When the winds that blow are merely breeze
> Learn from nature, from birds and bees
> Live your life in love, and let joy not cease.
>



-- 
*--------------------------------------*
a spark lover, a quant, a developer and a good man.

http://github.com/litaotao

Re: what happened if cache a RDD for multiple time?

Reply via email to