Thanks Cheng , that was helpful..
On Wed, Apr 16, 2014 at 1:29 PM, Cheng Lian <lian.cs....@gmail.com> wrote: > You can remove cached rdd1 from the cache manager by calling > rdd1.unpersist(). But here comes some subtleties: RDD.cache() is *lazy*while > RDD.unpersist() is *eager*. When .cache() is called, it just tells Spark > runtime to cache the RDD *later* when corresponding job that uses this > RDD is submitted; when .unpersist() is called, the cached RDD is removed > immediately. So you may want to do something like this to avoid rdd1taking > too much memory: > > val rdd1 = sc.textFile(path).cache()val rdd2 = rdd1.filter(...).cache()val > rdd3 = rdd1.filter(...).cache() > // Trigger a job to materialize and cache rdd1, rdd2 & rdd3 > (rdd2 ++ rdd3).count() > // Remove rdd1 > rdd1.unpersist() > // Use rdd2 & rdd3 for later logics. > > In this way, an additional job is required so that you have chance to > evict rdd1 as early as possible. > > > On Wed, Apr 16, 2014 at 2:43 PM, Arpit Tak <arpit.sparku...@gmail.com>wrote: > >> Hi Cheng, >> >> Is it possibe to delete or replicate an rdd ?? >> >> >> > rdd1 = textFile("hdfs...").cache() >> > >> > rdd2 = rdd1.filter(userDefinedFunc1).cache() >> > rdd3 = rdd1.filter(userDefinedFunc2).cache() >> >> I reframe above question , if rdd1 is around 50G and after filtering its >> come around say 4G. >> then to increase computing performance we just cached it .. but rdd2 and >> rdd3 are on disk .. >> so this will show somehow show good performance than performing filter on >> disk , then caching rdd2 and rdd3. >> >> or can we also remove a particular rdd from cache say rdd1(if cached) >> after filtered operation as its not required and we save memory usage. >> >> Regards, >> Arpit >> >> >> On Tue, Apr 15, 2014 at 7:14 AM, Cheng Lian <lian.cs....@gmail.com>wrote: >> >>> Hi Joe, >>> >>> You need to make sure which RDD is used most frequently. In your case, >>> rdd2 & rdd3 are filtered result of rdd1, so usually they are relatively >>> smaller than rdd1, and it would be more reasonable to cache rdd2 and/or >>> rdd3 if rdd1 is not referenced elsewhere. >>> >>> Say rdd1 takes 10G, rdd2 takes 1G after filtering, if you cache both of >>> them, you end up with 11G memory consumption, which might not be what you >>> want. >>> >>> Regards >>> Cheng >>> >>> >>> On Mon, Apr 14, 2014 at 8:32 PM, Joe L <selme...@yahoo.com> wrote: >>> >>>> Hi I am trying to cache 2Gbyte data and to implement the following >>>> procedure. >>>> In order to cache them I did as follows: Is it necessary to cache rdd2 >>>> since >>>> rdd1 is already cached? >>>> >>>> rdd1 = textFile("hdfs...").cache() >>>> >>>> rdd2 = rdd1.filter(userDefinedFunc1).cache() >>>> rdd3 = rdd1.filter(userDefinedFunc2).cache() >>>> >>>> >>>> >>>> >>>> >>>> >>>> -- >>>> View this message in context: >>>> http://apache-spark-user-list.1001560.n3.nabble.com/Proper-caching-method-tp4206.html >>>> Sent from the Apache Spark User List mailing list archive at Nabble.com. >>>> >>> >>> >> >