I ended up post-processing the result in hive with a dynamic partition
insert query to get a table partitioned by the key.

Looking further, it seems that 'dynamic partition' insert is in Spark SQL
and working well in Spark SQL versions > 1.2.0:
https://issues.apache.org/jira/browse/SPARK-3007

On Tue, Apr 21, 2015 at 5:45 PM, Arun Luthra <arun.lut...@gmail.com> wrote:

> Is there an efficient way to save an RDD with saveAsTextFile in such a way
> that the data gets shuffled into separated directories according to a key?
> (My end goal is to wrap the result in a multi-partitioned Hive table)
>
> Suppose you have:
>
> case class MyData(val0: Int, val1: string, directory_name: String)
>
> and an RDD called myrdd with type RDD[MyData]. Suppose that you already
> have an array of the distinct directory_name's, called distinct_directories.
>
> A very inefficient way to to this is:
>
> distinct_directories.foreach(
>   dir_name => myrdd.filter( mydata => mydata.directory_name == dir_name )
>     .map( mydata => Array(mydata.val0.toString, mydata.val1).mkString(","))
>     .coalesce(5)
>     .saveAsTextFile("base_dir_name/" + f"$dir_name")
> )
>
> I tried this solution, and it does not do the multiple myrdd.filter's in
> parallel.
>
> I'm guessing partitionBy might be in the efficient solution if an easy
> efficient solution exists...
>
> Thanks,
> Arun
>

Reply via email to