Re: Best practices for sharing/maintaining large resource files for Spark jobs

Dmitry Goldenberg Tue, 12 Jan 2016 04:27:23 -0800

I'd guess that if the resources are broadcast Spark would put them into 
Tachyon...


> On Jan 12, 2016, at 7:04 AM, Dmitry Goldenberg <dgoldenberg...@gmail.com> 
> wrote:
> 
> Would it make sense to load them into Tachyon and read and broadcast them 
> from there since Tachyon is already a part of the Spark stack?
> 
> If so I wonder if I could do that Tachyon read/write via a Spark API?
> 
> 
>> On Jan 12, 2016, at 2:21 AM, Sabarish Sasidharan 
>> <sabarish.sasidha...@manthan.com> wrote:
>> 
>> One option could be to store them as blobs in a cache like Redis and then 
>> read + broadcast them from the driver. Or you could store them in HDFS and 
>> read + broadcast from the driver.
>> 
>> Regards
>> Sab
>> 
>>> On Tue, Jan 12, 2016 at 1:44 AM, Dmitry Goldenberg 
>>> <dgoldenberg...@gmail.com> wrote:
>>> We have a bunch of Spark jobs deployed and a few large resource files such 
>>> as e.g. a dictionary for lookups or a statistical model.
>>> 
>>> Right now, these are deployed as part of the Spark jobs which will 
>>> eventually make the mongo-jars too bloated for deployments.
>>> 
>>> What are some of the best practices to consider for maintaining and sharing 
>>> large resource files like these?
>>> 
>>> Thanks.
>> 
>> 
>> 
>> -- 
>> 
>> Architect - Big Data
>> Ph: +91 99805 99458
>> 
>> Manthan Systems | Company of the year - Analytics (2014 Frost and Sullivan 
>> India ICT)
>> +++

Re: Best practices for sharing/maintaining large resource files for Spark jobs

Reply via email to