Re: How do you write a JavaRDD into a single file

2014-10-21 Thread Steve Lewis
Collect will store the entire output in a List in memory. This solution is
acceptable for Little Data problems although if the entire problem fits
in the memory of a single machine there is less motivation to use Spark.

Most problems which benefit from Spark are large enough that even the data
assigned to a single partition will not fit into memory.

In my special case the output now is in the 0.5 - 4 GB range but in the
future might get to 4 times that size - something a single machine could
write but not hold at one time. I find that for most problems a file like
Part-0001 is not what the next step wants to use - the minute a step is
required to further process that file - even move and rename - there is
little reason not to let the spark code write what is wanted in the first
place.

I like the solution of using toLocalIterator and writing my own file


Re: How do you write a JavaRDD into a single file

2014-10-20 Thread Sean Owen
This was covered a few days ago:

http://apache-spark-user-list.1001560.n3.nabble.com/How-to-write-a-RDD-into-One-Local-Existing-File-td16720.html

The multiple output files is actually essential for parallelism, and
certainly not a bad idea. You don't want 100 distributed workers
writing to 1 file in 1 place, not if you want it to be fast.

RDD and  JavaRDD already expose a method to iterate over the data,
called toLocalIterator. It does not require that the RDD fit entirely
in memory.

On Mon, Oct 20, 2014 at 6:13 PM, Steve Lewis lordjoe2...@gmail.com wrote:
   At the end of a set of computation I have a JavaRDDString . I want a
 single file where each string is printed in order. The data is small enough
 that it is acceptable to handle the printout on a single processor. It may
 be large enough that using collect to generate a list might be unacceptable.
 the saveAsText command creates multiple files with names like part,
 part0001  This was bed behavior in Hadoop for final output and is also
 bad for Spark.
   A more general issue is whether is it possible to convert a JavaRDD into
 an iterator or iterable over then entire data set without using collect or
 holding all data in memory.
In many problems where it is desirable to parallelize intermediate steps
 but use a single process for handling the final result this could be very
 useful.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: How do you write a JavaRDD into a single file

2014-10-20 Thread Steve Lewis
Sorry I missed the discussion - although it did not answer the question -
In my case (and I suspect the askers) the 100 slaves are doing a lot of
useful work but the generated output is small enough to be handled by a
single process.
Many of the large data problems I have worked process a lot of data but end
up with a single report file - frequently in a format specified by
preexisting downstream code.
  I do not want a separate  hadoop merge step for a lot of reasons starting
with
better control of the generation of the file.
However toLocalIterator is exactly what I need.
Somewhat off topic - I am being overwhelmed by getting a lot of emails from
the list - is there s way to get a daily summary which might be a lot
easier to keep up with


On Mon, Oct 20, 2014 at 3:23 PM, Sean Owen so...@cloudera.com wrote:

 This was covered a few days ago:


 http://apache-spark-user-list.1001560.n3.nabble.com/How-to-write-a-RDD-into-One-Local-Existing-File-td16720.html

 The multiple output files is actually essential for parallelism, and
 certainly not a bad idea. You don't want 100 distributed workers
 writing to 1 file in 1 place, not if you want it to be fast.

 RDD and  JavaRDD already expose a method to iterate over the data,
 called toLocalIterator. It does not require that the RDD fit entirely
 in memory.

 On Mon, Oct 20, 2014 at 6:13 PM, Steve Lewis lordjoe2...@gmail.com
 wrote:
At the end of a set of computation I have a JavaRDDString . I want a
  single file where each string is printed in order. The data is small
 enough
  that it is acceptable to handle the printout on a single processor. It
 may
  be large enough that using collect to generate a list might be
 unacceptable.
  the saveAsText command creates multiple files with names like part,
  part0001  This was bed behavior in Hadoop for final output and is
 also
  bad for Spark.
A more general issue is whether is it possible to convert a JavaRDD
 into
  an iterator or iterable over then entire data set without using collect
 or
  holding all data in memory.
 In many problems where it is desirable to parallelize intermediate
 steps
  but use a single process for handling the final result this could be very
  useful.




-- 
Steven M. Lewis PhD
4221 105th Ave NE
Kirkland, WA 98033
206-384-1340 (cell)
Skype lordjoe_com


Re: How do you write a JavaRDD into a single file

2014-10-20 Thread jay vyas
sounds more like a use case for using collect... and writing out the file
in your program?

On Mon, Oct 20, 2014 at 6:53 PM, Steve Lewis lordjoe2...@gmail.com wrote:

 Sorry I missed the discussion - although it did not answer the question -
 In my case (and I suspect the askers) the 100 slaves are doing a lot of
 useful work but the generated output is small enough to be handled by a
 single process.
 Many of the large data problems I have worked process a lot of data but
 end up with a single report file - frequently in a format specified by
 preexisting downstream code.
   I do not want a separate  hadoop merge step for a lot of reasons
 starting with
 better control of the generation of the file.
 However toLocalIterator is exactly what I need.
 Somewhat off topic - I am being overwhelmed by getting a lot of emails
 from the list - is there s way to get a daily summary which might be a lot
 easier to keep up with


 On Mon, Oct 20, 2014 at 3:23 PM, Sean Owen so...@cloudera.com wrote:

 This was covered a few days ago:


 http://apache-spark-user-list.1001560.n3.nabble.com/How-to-write-a-RDD-into-One-Local-Existing-File-td16720.html

 The multiple output files is actually essential for parallelism, and
 certainly not a bad idea. You don't want 100 distributed workers
 writing to 1 file in 1 place, not if you want it to be fast.

 RDD and  JavaRDD already expose a method to iterate over the data,
 called toLocalIterator. It does not require that the RDD fit entirely
 in memory.

 On Mon, Oct 20, 2014 at 6:13 PM, Steve Lewis lordjoe2...@gmail.com
 wrote:
At the end of a set of computation I have a JavaRDDString . I want a
  single file where each string is printed in order. The data is small
 enough
  that it is acceptable to handle the printout on a single processor. It
 may
  be large enough that using collect to generate a list might be
 unacceptable.
  the saveAsText command creates multiple files with names like part,
  part0001  This was bed behavior in Hadoop for final output and is
 also
  bad for Spark.
A more general issue is whether is it possible to convert a JavaRDD
 into
  an iterator or iterable over then entire data set without using collect
 or
  holding all data in memory.
 In many problems where it is desirable to parallelize intermediate
 steps
  but use a single process for handling the final result this could be
 very
  useful.




 --
 Steven M. Lewis PhD
 4221 105th Ave NE
 Kirkland, WA 98033
 206-384-1340 (cell)
 Skype lordjoe_com




-- 
jay vyas


Re: How do you write a JavaRDD into a single file

2014-10-20 Thread Ilya Ganelin
Hey Steve - the way to do this is to use the coalesce() function to
coalesce your RDD into a single partition. Then you can do a saveAsTextFile
and you'll wind up with outpuDir/part-0 containing all the data.

-Ilya Ganelin

On Mon, Oct 20, 2014 at 11:01 PM, jay vyas jayunit100.apa...@gmail.com
wrote:

 sounds more like a use case for using collect... and writing out the
 file in your program?

 On Mon, Oct 20, 2014 at 6:53 PM, Steve Lewis lordjoe2...@gmail.com
 wrote:

 Sorry I missed the discussion - although it did not answer the question -
 In my case (and I suspect the askers) the 100 slaves are doing a lot of
 useful work but the generated output is small enough to be handled by a
 single process.
 Many of the large data problems I have worked process a lot of data but
 end up with a single report file - frequently in a format specified by
 preexisting downstream code.
   I do not want a separate  hadoop merge step for a lot of reasons
 starting with
 better control of the generation of the file.
 However toLocalIterator is exactly what I need.
 Somewhat off topic - I am being overwhelmed by getting a lot of emails
 from the list - is there s way to get a daily summary which might be a lot
 easier to keep up with


 On Mon, Oct 20, 2014 at 3:23 PM, Sean Owen so...@cloudera.com wrote:

 This was covered a few days ago:


 http://apache-spark-user-list.1001560.n3.nabble.com/How-to-write-a-RDD-into-One-Local-Existing-File-td16720.html

 The multiple output files is actually essential for parallelism, and
 certainly not a bad idea. You don't want 100 distributed workers
 writing to 1 file in 1 place, not if you want it to be fast.

 RDD and  JavaRDD already expose a method to iterate over the data,
 called toLocalIterator. It does not require that the RDD fit entirely
 in memory.

 On Mon, Oct 20, 2014 at 6:13 PM, Steve Lewis lordjoe2...@gmail.com
 wrote:
At the end of a set of computation I have a JavaRDDString . I want
 a
  single file where each string is printed in order. The data is small
 enough
  that it is acceptable to handle the printout on a single processor. It
 may
  be large enough that using collect to generate a list might be
 unacceptable.
  the saveAsText command creates multiple files with names like part,
  part0001  This was bed behavior in Hadoop for final output and is
 also
  bad for Spark.
A more general issue is whether is it possible to convert a JavaRDD
 into
  an iterator or iterable over then entire data set without using
 collect or
  holding all data in memory.
 In many problems where it is desirable to parallelize intermediate
 steps
  but use a single process for handling the final result this could be
 very
  useful.




 --
 Steven M. Lewis PhD
 4221 105th Ave NE
 Kirkland, WA 98033
 206-384-1340 (cell)
 Skype lordjoe_com




 --
 jay vyas