Since the data is in multiple JVMs, only 1 of them can be the driver. So I can parallelize the data from 1 of the VMs but don't have a way to do the same for the others. Or am I missing something?
On Tue, Feb 28, 2017 at 3:53 PM, ayan guha <guha.a...@gmail.com> wrote: > How about parallelize and then union all of them to one data frame? > > On Wed, 1 Mar 2017 at 3:07 am, Sean Owen <so...@cloudera.com> wrote: > >> Broadcasts let you send one copy of read only data to each executor. >> That's not the same as a DataFrame and itseems nature means it doesnt make >> sense to think of them as not distributed. But consider things like >> broadcast hash joins which may be what you are looking for if you really >> mean to join on a small DF efficiently. >> >> On Tue, Feb 28, 2017, 16:03 johndesuv <desu...@gmail.com> wrote: >> >> Hi, >> >> I have an application that runs on a series of JVMs that each contain a >> subset of a large dataset in memory. I'd like to use this data in spark >> and >> am looking at ways to use this as a data source in spark without writing >> the >> data to disk as a handoff. >> >> Parallelize doesn't work for me since I need to use the data across all >> the >> JVMs as one DataFrame. >> >> The only option I've come up with so far is to write a custom DataSource >> that then transmits the data from each of the JVMs over the network. This >> seems like overkill though. >> >> Is there a simpler solution for getting this data into a DataFrame? >> >> Thanks, >> John >> >> >> >> -- >> View this message in context: http://apache-spark-user-list. >> 1001560.n3.nabble.com/DataFrame-from-in-memory-datasets-in-multiple-JVMs- >> tp28438.html >> Sent from the Apache Spark User List mailing list archive at Nabble.com. >> >> --------------------------------------------------------------------- >> To unsubscribe e-mail: user-unsubscr...@spark.apache.org >> >> -- > Best Regards, > Ayan Guha >