Re: Re: Spark RDD cache persistence

Calvin Jia Wed, 09 Dec 2015 14:53:36 -0800

Hi Deepak,

For persistence across Spark jobs, you can store and access the RDDs in
Tachyon. Tachyon works with ramdisk which would give you similar in-memory
performance you would have within a Spark job.


For more information, you can take a look at the docs on Tachyon-Spark
integration:
http://tachyon-project.org/documentation/Running-Spark-on-Tachyon.html

Hope this helps,
Calvin

On Thu, Nov 5, 2015 at 10:29 PM, Deenar Toraskar <deenar.toras...@gmail.com>
wrote:

> You can have a long running Spark context in several fashions. This will
> ensure your data will be cached in memory. Clients will access the RDD
> through a REST API that you can expose. See the Spark Job Server, it does
> something similar. It has something called Named RDDs
>
> Using Named RDDs
>
> Named RDDs are a way to easily share RDDs among job. Using this facility,
> computed RDDs can be cached with a given name and later on retrieved. To
> use this feature, the SparkJob needs to mixinNamedRddSupport:
>
> Alternatively if you use the Spark Thrift Server, any cached
> dataframes/RDDs will be available to all clients of Spark via the Thrift
> Server until it is shutdown.
>
> If you want to support key value lookups you might want to use IndexedRDD
> <https://github.com/amplab/spark-indexedrdd>
>
> Finally not the same as sharing RDDs, Tachyon can cache underlying HDFS
> blocks.
>
> Deenar
>
> *Think Reactive Ltd*
> deenar.toras...@thinkreactive.co.uk
> 07714140812
>
>
>
> On 6 November 2015 at 05:56, r7raul1...@163.com <r7raul1...@163.com>
>  wrote:
>
>> You can try
>> http://hadoop.apache.org/docs/r2.6.0/hadoop-project-dist/hadoop-hdfs/ArchivalStorage.html#Archival_Storage_SSD__Memory
>>  .
>>   Hive tmp table use this function to speed
>
>
> On 6 November 2015 at 05:56, r7raul1...@163.com <r7raul1...@163.com>
> wrote:
>
>> You can try
>> http://hadoop.apache.org/docs/r2.6.0/hadoop-project-dist/hadoop-hdfs/ArchivalStorage.html#Archival_Storage_SSD__Memory
>>  .
>>   Hive tmp table use this function to speed job.
>> https://issues.apache.org/jira/browse/HIVE-7313
>>
>> ------------------------------
>> r7raul1...@163.com
>>
>>
>> *From:* Christian <engr...@gmail.com>
>> *Date:* 2015-11-06 13:50
>> *To:* Deepak Sharma <deepakmc...@gmail.com>
>> *CC:* user <user@spark.apache.org>
>> *Subject:* Re: Spark RDD cache persistence
>> I've never had this need and I've never done it. There are options that
>> allow this. For example, I know there are web apps out there that work like
>> the spark REPL. One of these I think is called Zepplin. . I've never used
>> them, but I've seen them demoed. There is also Tachyon that Spark
>> supports.. Hopefully, that gives you a place to start.
>> On Thu, Nov 5, 2015 at 9:21 PM Deepak Sharma <deepakmc...@gmail.com>
>> wrote:
>>
>>> Thanks Christian.
>>> So is there any inbuilt mechanism in spark or api integration  to other
>>> inmemory cache products such as redis to load the RDD to these system upon
>>> program exit ?
>>> What's the best approach to have long lived RDD cache ?
>>> Thanks
>>>
>>>
>>> Deepak
>>> On 6 Nov 2015 8:34 am, "Christian" <engr...@gmail.com> wrote:
>>>
>>>> The cache gets cleared out when the job finishes. I am not aware of a
>>>> way to keep the cache around between jobs. You could save it as an object
>>>> file to disk and load it as an object file on your next job for speed.
>>>> On Thu, Nov 5, 2015 at 6:17 PM Deepak Sharma <deepakmc...@gmail.com>
>>>> wrote:
>>>>
>>>>> Hi All
>>>>> I am confused on RDD persistence in cache .
>>>>> If I cache RDD , is it going to stay there in memory even if my spark
>>>>> program completes execution , which created it.
>>>>> If not , how can I guarantee that RDD is persisted in cache even after
>>>>> the program finishes execution.
>>>>>
>>>>> Thanks
>>>>>
>>>>>
>>>>> Deepak
>>>>>
>>>>
>

Re: Re: Spark RDD cache persistence

Reply via email to