Re: How do you write a JavaRDD into a single file
Collect will store the entire output in a List in memory. This solution is acceptable for Little Data problems although if the entire problem fits in the memory of a single machine there is less motivation to use Spark. Most problems which benefit from Spark are large enough that even the data assigned to a single partition will not fit into memory. In my special case the output now is in the 0.5 - 4 GB range but in the future might get to 4 times that size - something a single machine could write but not hold at one time. I find that for most problems a file like Part-0001 is not what the next step wants to use - the minute a step is required to further process that file - even move and rename - there is little reason not to let the spark code write what is wanted in the first place. I like the solution of using toLocalIterator and writing my own file
Re: How do you write a JavaRDD into a single file
This was covered a few days ago: http://apache-spark-user-list.1001560.n3.nabble.com/How-to-write-a-RDD-into-One-Local-Existing-File-td16720.html The multiple output files is actually essential for parallelism, and certainly not a bad idea. You don't want 100 distributed workers writing to 1 file in 1 place, not if you want it to be fast. RDD and JavaRDD already expose a method to iterate over the data, called toLocalIterator. It does not require that the RDD fit entirely in memory. On Mon, Oct 20, 2014 at 6:13 PM, Steve Lewis lordjoe2...@gmail.com wrote: At the end of a set of computation I have a JavaRDDString . I want a single file where each string is printed in order. The data is small enough that it is acceptable to handle the printout on a single processor. It may be large enough that using collect to generate a list might be unacceptable. the saveAsText command creates multiple files with names like part, part0001 This was bed behavior in Hadoop for final output and is also bad for Spark. A more general issue is whether is it possible to convert a JavaRDD into an iterator or iterable over then entire data set without using collect or holding all data in memory. In many problems where it is desirable to parallelize intermediate steps but use a single process for handling the final result this could be very useful. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: How do you write a JavaRDD into a single file
Sorry I missed the discussion - although it did not answer the question - In my case (and I suspect the askers) the 100 slaves are doing a lot of useful work but the generated output is small enough to be handled by a single process. Many of the large data problems I have worked process a lot of data but end up with a single report file - frequently in a format specified by preexisting downstream code. I do not want a separate hadoop merge step for a lot of reasons starting with better control of the generation of the file. However toLocalIterator is exactly what I need. Somewhat off topic - I am being overwhelmed by getting a lot of emails from the list - is there s way to get a daily summary which might be a lot easier to keep up with On Mon, Oct 20, 2014 at 3:23 PM, Sean Owen so...@cloudera.com wrote: This was covered a few days ago: http://apache-spark-user-list.1001560.n3.nabble.com/How-to-write-a-RDD-into-One-Local-Existing-File-td16720.html The multiple output files is actually essential for parallelism, and certainly not a bad idea. You don't want 100 distributed workers writing to 1 file in 1 place, not if you want it to be fast. RDD and JavaRDD already expose a method to iterate over the data, called toLocalIterator. It does not require that the RDD fit entirely in memory. On Mon, Oct 20, 2014 at 6:13 PM, Steve Lewis lordjoe2...@gmail.com wrote: At the end of a set of computation I have a JavaRDDString . I want a single file where each string is printed in order. The data is small enough that it is acceptable to handle the printout on a single processor. It may be large enough that using collect to generate a list might be unacceptable. the saveAsText command creates multiple files with names like part, part0001 This was bed behavior in Hadoop for final output and is also bad for Spark. A more general issue is whether is it possible to convert a JavaRDD into an iterator or iterable over then entire data set without using collect or holding all data in memory. In many problems where it is desirable to parallelize intermediate steps but use a single process for handling the final result this could be very useful. -- Steven M. Lewis PhD 4221 105th Ave NE Kirkland, WA 98033 206-384-1340 (cell) Skype lordjoe_com
Re: How do you write a JavaRDD into a single file
sounds more like a use case for using collect... and writing out the file in your program? On Mon, Oct 20, 2014 at 6:53 PM, Steve Lewis lordjoe2...@gmail.com wrote: Sorry I missed the discussion - although it did not answer the question - In my case (and I suspect the askers) the 100 slaves are doing a lot of useful work but the generated output is small enough to be handled by a single process. Many of the large data problems I have worked process a lot of data but end up with a single report file - frequently in a format specified by preexisting downstream code. I do not want a separate hadoop merge step for a lot of reasons starting with better control of the generation of the file. However toLocalIterator is exactly what I need. Somewhat off topic - I am being overwhelmed by getting a lot of emails from the list - is there s way to get a daily summary which might be a lot easier to keep up with On Mon, Oct 20, 2014 at 3:23 PM, Sean Owen so...@cloudera.com wrote: This was covered a few days ago: http://apache-spark-user-list.1001560.n3.nabble.com/How-to-write-a-RDD-into-One-Local-Existing-File-td16720.html The multiple output files is actually essential for parallelism, and certainly not a bad idea. You don't want 100 distributed workers writing to 1 file in 1 place, not if you want it to be fast. RDD and JavaRDD already expose a method to iterate over the data, called toLocalIterator. It does not require that the RDD fit entirely in memory. On Mon, Oct 20, 2014 at 6:13 PM, Steve Lewis lordjoe2...@gmail.com wrote: At the end of a set of computation I have a JavaRDDString . I want a single file where each string is printed in order. The data is small enough that it is acceptable to handle the printout on a single processor. It may be large enough that using collect to generate a list might be unacceptable. the saveAsText command creates multiple files with names like part, part0001 This was bed behavior in Hadoop for final output and is also bad for Spark. A more general issue is whether is it possible to convert a JavaRDD into an iterator or iterable over then entire data set without using collect or holding all data in memory. In many problems where it is desirable to parallelize intermediate steps but use a single process for handling the final result this could be very useful. -- Steven M. Lewis PhD 4221 105th Ave NE Kirkland, WA 98033 206-384-1340 (cell) Skype lordjoe_com -- jay vyas
Re: How do you write a JavaRDD into a single file
Hey Steve - the way to do this is to use the coalesce() function to coalesce your RDD into a single partition. Then you can do a saveAsTextFile and you'll wind up with outpuDir/part-0 containing all the data. -Ilya Ganelin On Mon, Oct 20, 2014 at 11:01 PM, jay vyas jayunit100.apa...@gmail.com wrote: sounds more like a use case for using collect... and writing out the file in your program? On Mon, Oct 20, 2014 at 6:53 PM, Steve Lewis lordjoe2...@gmail.com wrote: Sorry I missed the discussion - although it did not answer the question - In my case (and I suspect the askers) the 100 slaves are doing a lot of useful work but the generated output is small enough to be handled by a single process. Many of the large data problems I have worked process a lot of data but end up with a single report file - frequently in a format specified by preexisting downstream code. I do not want a separate hadoop merge step for a lot of reasons starting with better control of the generation of the file. However toLocalIterator is exactly what I need. Somewhat off topic - I am being overwhelmed by getting a lot of emails from the list - is there s way to get a daily summary which might be a lot easier to keep up with On Mon, Oct 20, 2014 at 3:23 PM, Sean Owen so...@cloudera.com wrote: This was covered a few days ago: http://apache-spark-user-list.1001560.n3.nabble.com/How-to-write-a-RDD-into-One-Local-Existing-File-td16720.html The multiple output files is actually essential for parallelism, and certainly not a bad idea. You don't want 100 distributed workers writing to 1 file in 1 place, not if you want it to be fast. RDD and JavaRDD already expose a method to iterate over the data, called toLocalIterator. It does not require that the RDD fit entirely in memory. On Mon, Oct 20, 2014 at 6:13 PM, Steve Lewis lordjoe2...@gmail.com wrote: At the end of a set of computation I have a JavaRDDString . I want a single file where each string is printed in order. The data is small enough that it is acceptable to handle the printout on a single processor. It may be large enough that using collect to generate a list might be unacceptable. the saveAsText command creates multiple files with names like part, part0001 This was bed behavior in Hadoop for final output and is also bad for Spark. A more general issue is whether is it possible to convert a JavaRDD into an iterator or iterable over then entire data set without using collect or holding all data in memory. In many problems where it is desirable to parallelize intermediate steps but use a single process for handling the final result this could be very useful. -- Steven M. Lewis PhD 4221 105th Ave NE Kirkland, WA 98033 206-384-1340 (cell) Skype lordjoe_com -- jay vyas