Collect will store the entire output in a List in memory. This solution is
acceptable for "Little Data" problems although if the entire problem fits
in the memory of a single machine there is less motivation to use Spark.
Most problems which benefit from Spark are large enough that even the data
a
Hey Steve - the way to do this is to use the coalesce() function to
coalesce your RDD into a single partition. Then you can do a saveAsTextFile
and you'll wind up with outpuDir/part-0 containing all the data.
-Ilya Ganelin
On Mon, Oct 20, 2014 at 11:01 PM, jay vyas
wrote:
> sounds more like
sounds more like a use case for using "collect"... and writing out the file
in your program?
On Mon, Oct 20, 2014 at 6:53 PM, Steve Lewis wrote:
> Sorry I missed the discussion - although it did not answer the question -
> In my case (and I suspect the askers) the 100 slaves are doing a lot of
>
Sorry I missed the discussion - although it did not answer the question -
In my case (and I suspect the askers) the 100 slaves are doing a lot of
useful work but the generated output is small enough to be handled by a
single process.
Many of the large data problems I have worked process a lot of da
This was covered a few days ago:
http://apache-spark-user-list.1001560.n3.nabble.com/How-to-write-a-RDD-into-One-Local-Existing-File-td16720.html
The multiple output files is actually essential for parallelism, and
certainly not a bad idea. You don't want 100 distributed workers
writing to 1 file
At the end of a set of computation I have a JavaRDD . I want a
single file where each string is printed in order. The data is small enough
that it is acceptable to handle the printout on a single processor. It may
be large enough that using collect to generate a list might be unacceptable.
the sa