Re: Proper caching method

Arpit Tak Wed, 16 Apr 2014 01:15:02 -0700

Thanks Cheng , that was helpful..


On Wed, Apr 16, 2014 at 1:29 PM, Cheng Lian <lian.cs....@gmail.com> wrote:

> You can remove cached rdd1 from the cache manager by calling
> rdd1.unpersist(). But here comes some subtleties: RDD.cache() is *lazy*while
> RDD.unpersist() is *eager*. When .cache() is called, it just tells Spark
> runtime to cache the RDD *later* when corresponding job that uses this
> RDD is submitted; when .unpersist() is called, the cached RDD is removed
> immediately. So you may want to do something like this to avoid rdd1taking 
> too much memory:
>
> val rdd1 = sc.textFile(path).cache()val rdd2 = rdd1.filter(...).cache()val 
> rdd3 = rdd1.filter(...).cache()
> // Trigger a job to materialize and cache rdd1, rdd2 & rdd3
> (rdd2 ++ rdd3).count()
> // Remove rdd1
> rdd1.unpersist()
> // Use rdd2 & rdd3 for later logics.
>
> In this way, an additional job is required so that you have chance to
> evict rdd1 as early as possible.
>
>
> On Wed, Apr 16, 2014 at 2:43 PM, Arpit Tak <arpit.sparku...@gmail.com>wrote:
>
>> Hi Cheng,
>>
>> Is it possibe to delete or replicate an rdd ??
>>
>>
>> > rdd1 = textFile("hdfs...").cache()
>> >
>> > rdd2 = rdd1.filter(userDefinedFunc1).cache()
>> > rdd3 = rdd1.filter(userDefinedFunc2).cache()
>>
>> I reframe above question , if rdd1 is around 50G and after filtering its
>> come around say 4G.
>> then to increase computing performance we just cached it .. but rdd2 and
>> rdd3 are on disk ..
>> so this will show somehow show good performance than performing filter on
>> disk , then caching rdd2 and rdd3.
>>
>> or can we also remove a particular rdd from cache say rdd1(if cached)
>> after filtered operation as its not required and we save memory usage.
>>
>> Regards,
>> Arpit
>>
>>
>> On Tue, Apr 15, 2014 at 7:14 AM, Cheng Lian <lian.cs....@gmail.com>wrote:
>>
>>> Hi Joe,
>>>
>>> You need to make sure which RDD is used most frequently. In your case,
>>> rdd2 & rdd3 are filtered result of rdd1, so usually they are relatively
>>> smaller than rdd1, and it would be more reasonable to cache rdd2 and/or
>>> rdd3 if rdd1 is not referenced elsewhere.
>>>
>>> Say rdd1 takes 10G, rdd2 takes 1G after filtering, if you cache both of
>>> them, you end up with 11G memory consumption, which might not be what you
>>> want.
>>>
>>> Regards
>>> Cheng
>>>
>>>
>>> On Mon, Apr 14, 2014 at 8:32 PM, Joe L <selme...@yahoo.com> wrote:
>>>
>>>> Hi I am trying to cache 2Gbyte data and to implement the following
>>>> procedure.
>>>> In order to cache them I did as follows: Is it necessary to cache rdd2
>>>> since
>>>> rdd1 is already cached?
>>>>
>>>> rdd1 = textFile("hdfs...").cache()
>>>>
>>>> rdd2 = rdd1.filter(userDefinedFunc1).cache()
>>>> rdd3 = rdd1.filter(userDefinedFunc2).cache()
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>> --
>>>> View this message in context:
>>>> http://apache-spark-user-list.1001560.n3.nabble.com/Proper-caching-method-tp4206.html
>>>> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>>>>
>>>
>>>
>>
>

Re: Proper caching method

Reply via email to