It looks like you might be able to combine the output files using the HDFS -getmerge command: http://stackoverflow.com/questions/5700068/merge-output-files-after-reduce-phase
On Wed, Oct 30, 2013 at 9:16 PM, Shay Seng <[email protected]> wrote: > Doing a coalesce will be kind of a problem... I was hoping that would be a > utility or command option that could concat all the files together for > me... > > Thanks for the replies though! > > > > On Wed, Oct 30, 2013 at 9:07 PM, Patrick Wendell <[email protected]>wrote: > >> You can do this if you coalesce the data first. However, this will >> put all of your final data through a single reduce tasks (so you get >> no parallelism and may overload a node): >> >> myrdd.coalesce(1).saveAsTextFile("hdfs://..../my.csv") >> >> Basically you have to chose, either you do the write in parallel and >> get a lot of files, or you do the write on one node/reducer and get a >> single file. >> >> - Patrick >> >> On Wed, Oct 30, 2013 at 8:05 PM, Shay Seng <[email protected]> wrote: >> > Well that almost works... when I call >> > myrdd.saveAsTextFile("hdfs://..../my.csv") >> > >> > Instead of getting a single my.csv file, as I expect, my.csv is a >> directory >> > with a bunch parts - all of which are csv. >> > Is there some way have those files concatenated automatically? >> > >> > >> > >> > >> > On Wed, Oct 30, 2013 at 7:13 PM, Josh Rosen <[email protected]> >> wrote: >> >> >> >> saveAsTextFile() is implemented in terms of Hadoop's TextOutputFormat, >> >> which writes one record per line: >> >> >> https://github.com/apache/incubator-spark/blob/v0.8.0-incubating/core/src/main/scala/org/apache/spark/rdd/RDD.scala#L816 >> >> >> >> You could map() each entry in your RDD into a comma-separated string, >> then >> >> write those strings using saveAsTextFile(). >> >> >> >> >> >> >> >> >> >> On Wed, Oct 30, 2013 at 7:10 PM, Andre Schumacher >> >> <[email protected]> wrote: >> >>> >> >>> >> >>> Hi, >> >>> >> >>> Can you use saveAsTextFile? See >> >>> >> >>> >> >>> >> http://spark.incubator.apache.org/docs/latest/api/core/index.html#org.apache.spark.rdd.RDD >> >>> >> >>> I'm not sure what the default field separator is (Tab probably) but if >> >>> you don't mind that may work? No need to collect it to the master. >> >>> >> >>> Andre >> >>> >> >>> On 10/30/2013 06:34 PM, Shay Seng wrote: >> >>> > What's the recommended way to save a RDD as a CSV on say HDFS? >> >>> > Do I have to collect the RDD and save it from the master, or is >> there >> >>> > someway I can write out the CSV file in parallel to HDFS? >> >>> > >> >>> > >> >>> > tks >> >>> > shay >> >>> > >> >>> >> >> >> > >> > >
