Have you looked at Alluxio? (earlier tachyon) Best Regards, Sonal Founder, Nube Technologies <http://www.nubetech.co> Reifier at Strata Hadoop World <https://www.youtube.com/watch?v=eD3LkpPQIgM> Reifier at Spark Summit 2015 <https://spark-summit.org/2015/events/real-time-fuzzy-matching-with-spark-and-elastic-search/>
<http://in.linkedin.com/in/sonalgoyal> On Wed, Jun 29, 2016 at 7:30 PM, Aaron Perrin <aper...@timerazor.com> wrote: > The user guide describes a broadcast as a way to move a large dataset to > each node: > > "Broadcast variables allow the programmer to keep a read-only variable > cached on each machine rather than shipping a copy of it with tasks. They > can be used, for example, to give every node a copy of a large input > dataset in an efficient manner." > > And the broadcast example shows it being used with a variable. > > But, is it somehow possible to instead broadcast a function that can be > executed once, per node? > > My use case is the following: > > I have a large data structure that I currently create on each executor. > The way that I create it is a hack. That is, when the RDD function is > executed on the executor, I block, load a bunch of data (~250 GiB) from an > external data source, create the data structure as a static object in the > JVM, and then resume execution. This works, but it ends up costing me a > lot of extra memory (i.e. a few TiB when I have a lot of executors). > > What I'd like to do is use the broadcast mechanism to load the data > structure once, per node. But, I can't serialize the data structure from > the driver. > > Any ideas? > > Thanks! > > Aaron > >