I ended up post-processing the result in hive with a dynamic partition insert query to get a table partitioned by the key.
Looking further, it seems that 'dynamic partition' insert is in Spark SQL and working well in Spark SQL versions > 1.2.0: https://issues.apache.org/jira/browse/SPARK-3007 On Tue, Apr 21, 2015 at 5:45 PM, Arun Luthra <arun.lut...@gmail.com> wrote: > Is there an efficient way to save an RDD with saveAsTextFile in such a way > that the data gets shuffled into separated directories according to a key? > (My end goal is to wrap the result in a multi-partitioned Hive table) > > Suppose you have: > > case class MyData(val0: Int, val1: string, directory_name: String) > > and an RDD called myrdd with type RDD[MyData]. Suppose that you already > have an array of the distinct directory_name's, called distinct_directories. > > A very inefficient way to to this is: > > distinct_directories.foreach( > dir_name => myrdd.filter( mydata => mydata.directory_name == dir_name ) > .map( mydata => Array(mydata.val0.toString, mydata.val1).mkString(",")) > .coalesce(5) > .saveAsTextFile("base_dir_name/" + f"$dir_name") > ) > > I tried this solution, and it does not do the multiple myrdd.filter's in > parallel. > > I'm guessing partitionBy might be in the efficient solution if an easy > efficient solution exists... > > Thanks, > Arun >