Re: Write 1 RDD to multiple output paths in one go

Nicholas Chammas Mon, 15 Sep 2014 12:22:56 -0700

Davies,

That’s pretty neat. I heard there was a pure Python clone of Spark out
there—so you were one of the people behind it!

I’ve created a JIRA issue about this. SPARK-3533: Add saveAsTextFileByKey()
method to RDDs <https://issues.apache.org/jira/browse/SPARK-3533>

Sean,

I think you might be able to get this working with a subclass of
MultipleTextOutputFormat, which overrides generateFileNameForKeyValue,
generateActualKey, etc. A bit of work for sure, but probably works.

I’m looking at how to make this work in PySpark as of 1.1.0. The closest
examples I can see of how to use the saveAsHadoop...() methods in this way
are these two examples: HBase Output Format
<https://github.com/apache/spark/blob/cc14644460872efb344e8d895859d70213a40840/examples/src/main/python/hbase_outputformat.py#L60>
and Avro Input Format
<https://github.com/apache/spark/blob/cc14644460872efb344e8d895859d70213a40840/examples/src/main/python/avro_inputformat.py#L73>

Basically, I’m thinking I need to subclass MultipleTextOutputFormat and
override some methods in a Scala file, and then reference that from Python?
Like how the AvroWrapperToJavaConverter class is done? Seems pretty
involved, but I’ll give it a shot if that’s the right direction to go in.

Nick

On Mon, Sep 15, 2014 at 1:08 PM, Davies Liu <dav...@databricks.com> wrote:

> Maybe we should provide an API like saveTextFilesByKey(path),
> could you create an JIRA for it ?
>
> There is one in DPark [1] actually.
>
> [1] https://github.com/douban/dpark/blob/master/dpark/rdd.py#L309
>
> On Mon, Sep 15, 2014 at 7:08 AM, Nicholas Chammas
> <nicholas.cham...@gmail.com> wrote:
> > Any tips from anybody on how to do this in PySpark? (Or regular Spark,
> for
> > that matter.)
> >
> > On Sat, Sep 13, 2014 at 1:25 PM, Nick Chammas <
> nicholas.cham...@gmail.com>
> > wrote:
> >>
> >> Howdy doody Spark Users,
> >>
> >> I’d like to somehow write out a single RDD to multiple paths in one go.
> >> Here’s an example.
> >>
> >> I have an RDD of (key, value) pairs like this:
> >>
> >> >>> a = sc.parallelize(['Nick', 'Nancy', 'Bob', 'Ben',
> >> >>> 'Frankie']).keyBy(lambda x: x[0])
> >> >>> a.collect()
> >> [('N', 'Nick'), ('N', 'Nancy'), ('B', 'Bob'), ('B', 'Ben'), ('F',
> >> 'Frankie')]
> >>
> >> Now I want to write the RDD out to different paths depending on the
> keys,
> >> so that I have one output directory per distinct key. Each output
> directory
> >> could potentially have multiple part- files or whatever.
> >>
> >> So my output would be something like:
> >>
> >> /path/prefix/n [/part-1, /part-2, etc]
> >> /path/prefix/b [/part-1, /part-2, etc]
> >> /path/prefix/f [/part-1, /part-2, etc]
> >>
> >> How would you do that?
> >>
> >> I suspect I need to use saveAsNewAPIHadoopFile or saveAsHadoopFile along
> >> with the MultipleTextOutputFormat output format class, but I’m not sure
> how.
> >>
> >> By the way, there is a very similar question to this here on Stack
> >> Overflow.
> >>
> >> Nick
> >>
> >>
> >> ________________________________
> >> View this message in context: Write 1 RDD to multiple output paths in
> one
> >> go
> >> Sent from the Apache Spark User List mailing list archive at Nabble.com.
> >
> >
>

Re: Write 1 RDD to multiple output paths in one go

Reply via email to