Re: How to share large resources like dictionaries while processing data with Spark ?

Yiannis Gkoufas Thu, 04 Jun 2015 15:46:12 -0700

Hi there,

I would recommend checking out
https://github.com/spark-jobserver/spark-jobserver which I think gives the
functionality you are looking for.
I haven't tested it though.


BR

On 5 June 2015 at 01:35, Olivier Girardot <ssab...@gmail.com> wrote:

> You can use it as a broadcast variable, but if it's "too" large (more than
> 1Gb I guess), you may need to share it joining this using some kind of key
> to the other RDDs.
> But this is the kind of thing broadcast variables were designed for.
>
> Regards,
>
> Olivier.
>
> Le jeu. 4 juin 2015 à 23:50, dgoldenberg <dgoldenberg...@gmail.com> a
> écrit :
>
>> We have some pipelines defined where sometimes we need to load potentially
>> large resources such as dictionaries.
>>
>> What would be the best strategy for sharing such resources among the
>> transformations/actions within a consumer?  Can they be shared somehow
>> across the RDD's?
>>
>> I'm looking for a way to load such a resource once into the cluster memory
>> and have it be available throughout the lifecycle of a consumer...
>>
>> Thanks.
>>
>>
>>
>> --
>> View this message in context:
>> http://apache-spark-user-list.1001560.n3.nabble.com/How-to-share-large-resources-like-dictionaries-while-processing-data-with-Spark-tp23162.html
>> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
>> For additional commands, e-mail: user-h...@spark.apache.org
>>
>>

Re: How to share large resources like dictionaries while processing data with Spark ?

Reply via email to