Hi there, I would recommend checking out https://github.com/spark-jobserver/spark-jobserver which I think gives the functionality you are looking for. I haven't tested it though.
BR On 5 June 2015 at 01:35, Olivier Girardot <ssab...@gmail.com> wrote: > You can use it as a broadcast variable, but if it's "too" large (more than > 1Gb I guess), you may need to share it joining this using some kind of key > to the other RDDs. > But this is the kind of thing broadcast variables were designed for. > > Regards, > > Olivier. > > Le jeu. 4 juin 2015 à 23:50, dgoldenberg <dgoldenberg...@gmail.com> a > écrit : > >> We have some pipelines defined where sometimes we need to load potentially >> large resources such as dictionaries. >> >> What would be the best strategy for sharing such resources among the >> transformations/actions within a consumer? Can they be shared somehow >> across the RDD's? >> >> I'm looking for a way to load such a resource once into the cluster memory >> and have it be available throughout the lifecycle of a consumer... >> >> Thanks. >> >> >> >> -- >> View this message in context: >> http://apache-spark-user-list.1001560.n3.nabble.com/How-to-share-large-resources-like-dictionaries-while-processing-data-with-Spark-tp23162.html >> Sent from the Apache Spark User List mailing list archive at Nabble.com. >> >> --------------------------------------------------------------------- >> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org >> For additional commands, e-mail: user-h...@spark.apache.org >> >>