Have you looked at Alluxio? (earlier tachyon)

Best Regards,
Sonal
Founder, Nube Technologies <http://www.nubetech.co>
Reifier at Strata Hadoop World <https://www.youtube.com/watch?v=eD3LkpPQIgM>
Reifier at Spark Summit 2015
<https://spark-summit.org/2015/events/real-time-fuzzy-matching-with-spark-and-elastic-search/>

<http://in.linkedin.com/in/sonalgoyal>



On Wed, Jun 29, 2016 at 7:30 PM, Aaron Perrin <aper...@timerazor.com> wrote:

> The user guide describes a broadcast as a way to move a large dataset to
> each node:
>
> "Broadcast variables allow the programmer to keep a read-only variable
> cached on each machine rather than shipping a copy of it with tasks. They
> can be used, for example, to give every node a copy of a large input
> dataset in an efficient manner."
>
> And the broadcast example shows it being used with a variable.
>
> But, is it somehow possible to instead broadcast a function that can be
> executed once, per node?
>
> My use case is the following:
>
> I have a large data structure that I currently create on each executor.
> The way that I create it is a hack.  That is, when the RDD function is
> executed on the executor, I block, load a bunch of data (~250 GiB) from an
> external data source, create the data structure as a static object in the
> JVM, and then resume execution.  This works, but it ends up costing me a
> lot of extra memory (i.e. a few TiB when I have a lot of executors).
>
> What I'd like to do is use the broadcast mechanism to load the data
> structure once, per node.  But, I can't serialize the data structure from
> the driver.
>
> Any ideas?
>
> Thanks!
>
> Aaron
>
>

Reply via email to