Re: Save RDDs as CSV

Patrick Wendell Wed, 30 Oct 2013 21:53:15 -0700

I don't think HDFS supports concurrent appends to a single file, so
I'm not sure if this is possible with any framework (Spark/MapReduce)
that creates new HDFS connections per reducer.


On Wed, Oct 30, 2013 at 9:16 PM, Shay Seng <[email protected]> wrote:
> Doing a coalesce will be kind of a problem... I was hoping that would be a
> utility or command option  that could concat all the files together for
> me...
>
> Thanks for the replies though!
>
>
>
> On Wed, Oct 30, 2013 at 9:07 PM, Patrick Wendell <[email protected]> wrote:
>>
>>  You can do this if you coalesce the data first. However, this will
>> put all of your final data through a single reduce tasks (so you get
>> no parallelism and may overload a node):
>>
>> myrdd.coalesce(1).saveAsTextFile("hdfs://..../my.csv")
>>
>> Basically you have to chose, either you do the write in parallel and
>> get a lot of files, or you do the write on one node/reducer and get a
>> single file.
>>
>> - Patrick
>>
>> On Wed, Oct 30, 2013 at 8:05 PM, Shay Seng <[email protected]> wrote:
>> > Well that almost works... when I call
>> > myrdd.saveAsTextFile("hdfs://..../my.csv")
>> >
>> > Instead of getting a single my.csv file, as I expect, my.csv is a
>> > directory
>> > with a bunch parts - all of which are csv.
>> > Is there some way have those files concatenated automatically?
>> >
>> >
>> >
>> >
>> > On Wed, Oct 30, 2013 at 7:13 PM, Josh Rosen <[email protected]>
>> > wrote:
>> >>
>> >> saveAsTextFile() is implemented in terms of Hadoop's TextOutputFormat,
>> >> which writes one record per line:
>> >>
>> >> https://github.com/apache/incubator-spark/blob/v0.8.0-incubating/core/src/main/scala/org/apache/spark/rdd/RDD.scala#L816
>> >>
>> >> You could map() each entry in your RDD into a comma-separated string,
>> >> then
>> >> write those strings using saveAsTextFile().
>> >>
>> >>
>> >>
>> >>
>> >> On Wed, Oct 30, 2013 at 7:10 PM, Andre Schumacher
>> >> <[email protected]> wrote:
>> >>>
>> >>>
>> >>> Hi,
>> >>>
>> >>> Can you use saveAsTextFile? See
>> >>>
>> >>>
>> >>>
>> >>> http://spark.incubator.apache.org/docs/latest/api/core/index.html#org.apache.spark.rdd.RDD
>> >>>
>> >>> I'm not sure what the default field separator is (Tab probably) but if
>> >>> you don't mind that may work? No need to collect it to the master.
>> >>>
>> >>> Andre
>> >>>
>> >>> On 10/30/2013 06:34 PM, Shay Seng wrote:
>> >>> > What's the recommended way to save a RDD as a CSV on say HDFS?
>> >>> > Do I have to collect the RDD and save it from the master, or is
>> >>> > there
>> >>> > someway I can write out the CSV file in parallel to HDFS?
>> >>> >
>> >>> >
>> >>> > tks
>> >>> > shay
>> >>> >
>> >>>
>> >>
>> >
>
>

Re: Save RDDs as CSV

Reply via email to