You can simply save the join result distributedly, for example, as a HDFS file, 
and then copy the HDFS file to a local file.

There is an alternative memory-efficient way to collect distributed data back 
to driver other than collect(), that is toLocalIterator. The iterator will 
consume as much memory as the largest partition in your dataset.

You can use DataFrame.rdd.toLocalIterator() with Spark versions prior to 2.0. 
You can use Dataset.toLocalIterator() with Spark 2.0. 

For details, refer to https://issues.apache.org/jira/browse/SPARK-14334 
<https://issues.apache.org/jira/browse/SPARK-14334>

> On Jul 15, 2016, at 09:05, Pedro Rodriguez <ski.rodrig...@gmail.com> wrote:
> 
> Out of curiosity, is there a way to pull all the data back to the driver to 
> save without collect()? That is, stream the data in chunks back to the driver 
> so that maximum memory used comparable to a single node’s data, but all the 
> data is saved on one node.
> 
> —
> Pedro Rodriguez
> PhD Student in Large-Scale Machine Learning | CU Boulder
> Systems Oriented Data Scientist
> UC Berkeley AMPLab Alumni
> 
> pedrorodriguez.io <http://pedrorodriguez.io/> | 909-353-4423
> github.com/EntilZha <http://github.com/EntilZha> | LinkedIn 
> <https://www.linkedin.com/in/pedrorodriguezscience>
> On July 14, 2016 at 6:02:12 PM, Jacek Laskowski (ja...@japila.pl 
> <mailto:ja...@japila.pl>) wrote:
> 
>> Hi, 
>> 
>> Please re-consider your wish since it is going to move all the 
>> distributed dataset to the single machine of the driver and may lead 
>> to OOME. It's more pro to save your result to HDFS or S3 or any other 
>> distributed filesystem (that is accessible by the driver and 
>> executors). 
>> 
>> If you insist... 
>> 
>> Use collect() after select() and work with Array[T]. 
>> 
>> Pozdrawiam, 
>> Jacek Laskowski 
>> ---- 
>> https://medium.com/@jaceklaskowski/ 
>> Mastering Apache Spark http://bit.ly/mastering-apache-spark 
>> Follow me at https://twitter.com/jaceklaskowski 
>> 
>> 
>> On Fri, Jul 15, 2016 at 12:15 AM, vr.n. nachiappan 
>> <nachiappan_...@yahoo.com.invalid> wrote: 
>> > Hello, 
>> > 
>> > I am using data frames to join two cassandra tables. 
>> > 
>> > Currently when i invoke save on data frames as shown below it is saving 
>> > the 
>> > join results on executor nodes. 
>> > 
>> > joineddataframe.select(<col1>, <col2> 
>> > ...).format("com.databricks.spark.csv").option("header", 
>> > "true").save(<path>) 
>> > 
>> > I would like to persist the results of the join on Spark Master/Driver 
>> > node. 
>> > Is it possible to save the results on Spark Master/Driver and how to do 
>> > it. 
>> > 
>> > I appreciate your help. 
>> > 
>> > Nachi 
>> > 
>> 
>> --------------------------------------------------------------------- 
>> To unsubscribe e-mail: user-unsubscr...@spark.apache.org

Reply via email to