Maybe we should provide an API like saveTextFilesByKey(path),
could you create an JIRA for it ?

There is one in DPark [1] actually.

[1] https://github.com/douban/dpark/blob/master/dpark/rdd.py#L309

On Mon, Sep 15, 2014 at 7:08 AM, Nicholas Chammas
<nicholas.cham...@gmail.com> wrote:
> Any tips from anybody on how to do this in PySpark? (Or regular Spark, for
> that matter.)
>
> On Sat, Sep 13, 2014 at 1:25 PM, Nick Chammas <nicholas.cham...@gmail.com>
> wrote:
>>
>> Howdy doody Spark Users,
>>
>> I’d like to somehow write out a single RDD to multiple paths in one go.
>> Here’s an example.
>>
>> I have an RDD of (key, value) pairs like this:
>>
>> >>> a = sc.parallelize(['Nick', 'Nancy', 'Bob', 'Ben',
>> >>> 'Frankie']).keyBy(lambda x: x[0])
>> >>> a.collect()
>> [('N', 'Nick'), ('N', 'Nancy'), ('B', 'Bob'), ('B', 'Ben'), ('F',
>> 'Frankie')]
>>
>> Now I want to write the RDD out to different paths depending on the keys,
>> so that I have one output directory per distinct key. Each output directory
>> could potentially have multiple part- files or whatever.
>>
>> So my output would be something like:
>>
>> /path/prefix/n [/part-1, /part-2, etc]
>> /path/prefix/b [/part-1, /part-2, etc]
>> /path/prefix/f [/part-1, /part-2, etc]
>>
>> How would you do that?
>>
>> I suspect I need to use saveAsNewAPIHadoopFile or saveAsHadoopFile along
>> with the MultipleTextOutputFormat output format class, but I’m not sure how.
>>
>> By the way, there is a very similar question to this here on Stack
>> Overflow.
>>
>> Nick
>>
>>
>> ________________________________
>> View this message in context: Write 1 RDD to multiple output paths in one
>> go
>> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
>

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Reply via email to