Hi Deepak, For persistence across Spark jobs, you can store and access the RDDs in Tachyon. Tachyon works with ramdisk which would give you similar in-memory performance you would have within a Spark job.
For more information, you can take a look at the docs on Tachyon-Spark integration: http://tachyon-project.org/documentation/Running-Spark-on-Tachyon.html Hope this helps, Calvin On Thu, Nov 5, 2015 at 10:29 PM, Deenar Toraskar <deenar.toras...@gmail.com> wrote: > You can have a long running Spark context in several fashions. This will > ensure your data will be cached in memory. Clients will access the RDD > through a REST API that you can expose. See the Spark Job Server, it does > something similar. It has something called Named RDDs > > Using Named RDDs > > Named RDDs are a way to easily share RDDs among job. Using this facility, > computed RDDs can be cached with a given name and later on retrieved. To > use this feature, the SparkJob needs to mixinNamedRddSupport: > > Alternatively if you use the Spark Thrift Server, any cached > dataframes/RDDs will be available to all clients of Spark via the Thrift > Server until it is shutdown. > > If you want to support key value lookups you might want to use IndexedRDD > <https://github.com/amplab/spark-indexedrdd> > > Finally not the same as sharing RDDs, Tachyon can cache underlying HDFS > blocks. > > Deenar > > *Think Reactive Ltd* > deenar.toras...@thinkreactive.co.uk > 07714140812 > > > > On 6 November 2015 at 05:56, r7raul1...@163.com <r7raul1...@163.com> > wrote: > >> You can try >> http://hadoop.apache.org/docs/r2.6.0/hadoop-project-dist/hadoop-hdfs/ArchivalStorage.html#Archival_Storage_SSD__Memory >> . >> Hive tmp table use this function to speed > > > On 6 November 2015 at 05:56, r7raul1...@163.com <r7raul1...@163.com> > wrote: > >> You can try >> http://hadoop.apache.org/docs/r2.6.0/hadoop-project-dist/hadoop-hdfs/ArchivalStorage.html#Archival_Storage_SSD__Memory >> . >> Hive tmp table use this function to speed job. >> https://issues.apache.org/jira/browse/HIVE-7313 >> >> ------------------------------ >> r7raul1...@163.com >> >> >> *From:* Christian <engr...@gmail.com> >> *Date:* 2015-11-06 13:50 >> *To:* Deepak Sharma <deepakmc...@gmail.com> >> *CC:* user <user@spark.apache.org> >> *Subject:* Re: Spark RDD cache persistence >> I've never had this need and I've never done it. There are options that >> allow this. For example, I know there are web apps out there that work like >> the spark REPL. One of these I think is called Zepplin. . I've never used >> them, but I've seen them demoed. There is also Tachyon that Spark >> supports.. Hopefully, that gives you a place to start. >> On Thu, Nov 5, 2015 at 9:21 PM Deepak Sharma <deepakmc...@gmail.com> >> wrote: >> >>> Thanks Christian. >>> So is there any inbuilt mechanism in spark or api integration to other >>> inmemory cache products such as redis to load the RDD to these system upon >>> program exit ? >>> What's the best approach to have long lived RDD cache ? >>> Thanks >>> >>> >>> Deepak >>> On 6 Nov 2015 8:34 am, "Christian" <engr...@gmail.com> wrote: >>> >>>> The cache gets cleared out when the job finishes. I am not aware of a >>>> way to keep the cache around between jobs. You could save it as an object >>>> file to disk and load it as an object file on your next job for speed. >>>> On Thu, Nov 5, 2015 at 6:17 PM Deepak Sharma <deepakmc...@gmail.com> >>>> wrote: >>>> >>>>> Hi All >>>>> I am confused on RDD persistence in cache . >>>>> If I cache RDD , is it going to stay there in memory even if my spark >>>>> program completes execution , which created it. >>>>> If not , how can I guarantee that RDD is persisted in cache even after >>>>> the program finishes execution. >>>>> >>>>> Thanks >>>>> >>>>> >>>>> Deepak >>>>> >>>> >