If you have one executor per machine, which is the right default thing
to do, and this is a singleton in the JVM, then this does just have
one copy per machine. Of course an executor is tied to an app, so if
you mean to hold this data across executors that won't help.


On Wed, Jun 29, 2016 at 3:00 PM, Aaron Perrin <aper...@timerazor.com> wrote:
> The user guide describes a broadcast as a way to move a large dataset to
> each node:
>
> "Broadcast variables allow the programmer to keep a read-only variable
> cached on each machine rather than shipping a copy of it with tasks. They
> can be used, for example, to give every node a copy of a large input dataset
> in an efficient manner."
>
> And the broadcast example shows it being used with a variable.
>
> But, is it somehow possible to instead broadcast a function that can be
> executed once, per node?
>
> My use case is the following:
>
> I have a large data structure that I currently create on each executor.  The
> way that I create it is a hack.  That is, when the RDD function is executed
> on the executor, I block, load a bunch of data (~250 GiB) from an external
> data source, create the data structure as a static object in the JVM, and
> then resume execution.  This works, but it ends up costing me a lot of extra
> memory (i.e. a few TiB when I have a lot of executors).
>
> What I'd like to do is use the broadcast mechanism to load the data
> structure once, per node.  But, I can't serialize the data structure from
> the driver.
>
> Any ideas?
>
> Thanks!
>
> Aaron
>

---------------------------------------------------------------------
To unsubscribe e-mail: user-unsubscr...@spark.apache.org

Reply via email to