Re: Reading multiple S3 objects, transforming, writing back one

Peter Wed, 30 Apr 2014 14:01:22 -0700

Thanks Nicholas, this is a bit of a shame, not very practical for log roll up 
for example when every output needs to be in it's own "directory". 
On Wednesday, April 30, 2014 12:15 PM, Nicholas Chammas 
<nicholas.cham...@gmail.com> wrote:
 
Yes, saveAsTextFile() will give you 1 part per RDD partition. When you 
coalesce(1), you move everything in the RDD to a single partition, which then 
gives you 1 output file. 
It will still be called part-00000 or something like that because that’s 
defined by the Hadoop API that Spark uses for reading to/writing from S3. I 
don’t know of a way to change that.




On Wed, Apr 30, 2014 at 2:47 PM, Peter <thenephili...@yahoo.com> wrote:

Ah, looks like RDD.coalesce(1) solves one part of the problem.
>On Wednesday, April 30, 2014 11:15 AM, Peter <thenephili...@yahoo.com> wrote:
> 
>Hi
>
>
>Playing around with Spark & S3, I'm opening multiple objects (CSV files) with:
>
>
>    val hfile = sc.textFile("s3n://bucket/2014-04-28/")
>
>
>so hfile is a RDD representing 10 objects that were "underneath" 2014-04-28. 
>After I've sorted and otherwise transformed the content, I'm trying to write 
>it back to a single object:
>
>
>    
>sortedMap.values.map(_.mkString(",")).saveAsTextFile("s3n://bucket/concatted.csv")
>
>
>unfortunately this results in a "folder" named concatted.csv with 10 objects 
>underneath, part-00000 .. part-00010, corresponding to the 10 original objects 
>loaded. 
>
>
>How can I achieve the desired behaviour of putting a single object named 
>concatted.csv ?
>
>
>I've tried 0.9.1 and 1.0.0-RC3. 
>
>
>Thanks!
>Peter
>
>
>
>
>
>

Re: Reading multiple S3 objects, transforming, writing back one

Reply via email to