Maybe we should provide an API like saveTextFilesByKey(path), could you create an JIRA for it ?
There is one in DPark [1] actually. [1] https://github.com/douban/dpark/blob/master/dpark/rdd.py#L309 On Mon, Sep 15, 2014 at 7:08 AM, Nicholas Chammas <nicholas.cham...@gmail.com> wrote: > Any tips from anybody on how to do this in PySpark? (Or regular Spark, for > that matter.) > > On Sat, Sep 13, 2014 at 1:25 PM, Nick Chammas <nicholas.cham...@gmail.com> > wrote: >> >> Howdy doody Spark Users, >> >> I’d like to somehow write out a single RDD to multiple paths in one go. >> Here’s an example. >> >> I have an RDD of (key, value) pairs like this: >> >> >>> a = sc.parallelize(['Nick', 'Nancy', 'Bob', 'Ben', >> >>> 'Frankie']).keyBy(lambda x: x[0]) >> >>> a.collect() >> [('N', 'Nick'), ('N', 'Nancy'), ('B', 'Bob'), ('B', 'Ben'), ('F', >> 'Frankie')] >> >> Now I want to write the RDD out to different paths depending on the keys, >> so that I have one output directory per distinct key. Each output directory >> could potentially have multiple part- files or whatever. >> >> So my output would be something like: >> >> /path/prefix/n [/part-1, /part-2, etc] >> /path/prefix/b [/part-1, /part-2, etc] >> /path/prefix/f [/part-1, /part-2, etc] >> >> How would you do that? >> >> I suspect I need to use saveAsNewAPIHadoopFile or saveAsHadoopFile along >> with the MultipleTextOutputFormat output format class, but I’m not sure how. >> >> By the way, there is a very similar question to this here on Stack >> Overflow. >> >> Nick >> >> >> ________________________________ >> View this message in context: Write 1 RDD to multiple output paths in one >> go >> Sent from the Apache Spark User List mailing list archive at Nabble.com. > > --------------------------------------------------------------------- To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org