Re: Write 1 RDD to multiple output paths in one go

Nicholas Chammas Mon, 15 Sep 2014 07:09:50 -0700

Any tips from anybody on how to do this in PySpark? (Or regular Spark, for
that matter.)


On Sat, Sep 13, 2014 at 1:25 PM, Nick Chammas <nicholas.cham...@gmail.com>
wrote:

> Howdy doody Spark Users,
>
> I’d like to somehow write out a single RDD to multiple paths in one go.
> Here’s an example.
>
> I have an RDD of (key, value) pairs like this:
>
> >>> a = sc.parallelize(['Nick', 'Nancy', 'Bob', 'Ben', 
> >>> 'Frankie']).keyBy(lambda x: x[0])>>> a.collect()
> [('N', 'Nick'), ('N', 'Nancy'), ('B', 'Bob'), ('B', 'Ben'), ('F', 'Frankie')]
>
> Now I want to write the RDD out to different paths depending on the keys,
> so that I have one output directory per distinct key. Each output directory
> could potentially have multiple part- files or whatever.
>
> So my output would be something like:
>
> /path/prefix/n [/part-1, /part-2, etc]
> /path/prefix/b [/part-1, /part-2, etc]
> /path/prefix/f [/part-1, /part-2, etc]
>
> How would you do that?
>
> I suspect I need to use saveAsNewAPIHadoopFile
> <http://spark.apache.org/docs/latest/api/python/pyspark.rdd.RDD-class.html#saveAsNewAPIHadoopFile>
> or saveAsHadoopFile
> <http://spark.apache.org/docs/latest/api/python/pyspark.rdd.RDD-class.html#saveAsHadoopFile>
> along with the MultipleTextOutputFormat output format class, but I’m not
> sure how.
>
> By the way, there is a very similar question to this here on Stack
> Overflow
> <http://stackoverflow.com/questions/23995040/write-to-multiple-outputs-by-key-spark-one-spark-job>
> .
>
> Nick
> 
>
> ------------------------------
> View this message in context: Write 1 RDD to multiple output paths in one
> go
> <http://apache-spark-user-list.1001560.n3.nabble.com/Write-1-RDD-to-multiple-output-paths-in-one-go-tp14174.html>
> Sent from the Apache Spark User List mailing list archive
> <http://apache-spark-user-list.1001560.n3.nabble.com/> at Nabble.com.
>

Re: Write 1 RDD to multiple output paths in one go

Reply via email to