Broadcasts let you send one copy of read only data to each executor. That's
not the same as a DataFrame and itseems nature means it doesnt make sense
to think of them as not distributed. But consider things like broadcast
hash joins which may be what you are looking for if you really mean to join
on a small DF efficiently.

On Tue, Feb 28, 2017, 16:03 johndesuv <desu...@gmail.com> wrote:

> Hi,
>
> I have an application that runs on a series of JVMs that each contain a
> subset of a large dataset in memory.  I'd like to use this data in spark
> and
> am looking at ways to use this as a data source in spark without writing
> the
> data to disk as a handoff.
>
> Parallelize doesn't work for me since I need to use the data across all the
> JVMs as one DataFrame.
>
> The only option I've come up with so far is to write a custom DataSource
> that then transmits the data from each of the JVMs over the network.  This
> seems like overkill though.
>
> Is there a simpler solution for getting this data into a DataFrame?
>
> Thanks,
> John
>
>
>
> --
> View this message in context:
> http://apache-spark-user-list.1001560.n3.nabble.com/DataFrame-from-in-memory-datasets-in-multiple-JVMs-tp28438.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
> ---------------------------------------------------------------------
> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>
>

Reply via email to