Any tips from anybody on how to do this in PySpark? (Or regular Spark, for that matter.)
On Sat, Sep 13, 2014 at 1:25 PM, Nick Chammas <nicholas.cham...@gmail.com> wrote: > Howdy doody Spark Users, > > I’d like to somehow write out a single RDD to multiple paths in one go. > Here’s an example. > > I have an RDD of (key, value) pairs like this: > > >>> a = sc.parallelize(['Nick', 'Nancy', 'Bob', 'Ben', > >>> 'Frankie']).keyBy(lambda x: x[0])>>> a.collect() > [('N', 'Nick'), ('N', 'Nancy'), ('B', 'Bob'), ('B', 'Ben'), ('F', 'Frankie')] > > Now I want to write the RDD out to different paths depending on the keys, > so that I have one output directory per distinct key. Each output directory > could potentially have multiple part- files or whatever. > > So my output would be something like: > > /path/prefix/n [/part-1, /part-2, etc] > /path/prefix/b [/part-1, /part-2, etc] > /path/prefix/f [/part-1, /part-2, etc] > > How would you do that? > > I suspect I need to use saveAsNewAPIHadoopFile > <http://spark.apache.org/docs/latest/api/python/pyspark.rdd.RDD-class.html#saveAsNewAPIHadoopFile> > or saveAsHadoopFile > <http://spark.apache.org/docs/latest/api/python/pyspark.rdd.RDD-class.html#saveAsHadoopFile> > along with the MultipleTextOutputFormat output format class, but I’m not > sure how. > > By the way, there is a very similar question to this here on Stack > Overflow > <http://stackoverflow.com/questions/23995040/write-to-multiple-outputs-by-key-spark-one-spark-job> > . > > Nick > > > ------------------------------ > View this message in context: Write 1 RDD to multiple output paths in one > go > <http://apache-spark-user-list.1001560.n3.nabble.com/Write-1-RDD-to-multiple-output-paths-in-one-go-tp14174.html> > Sent from the Apache Spark User List mailing list archive > <http://apache-spark-user-list.1001560.n3.nabble.com/> at Nabble.com. >