I'm looking @ the ShuffledRDD code and it looks like there is a method setKeyOrdering()- is this guaranteed to order everything in the partition? I'm on Spark 1.2.0
On Wed, Jan 28, 2015 at 9:07 AM, Corey Nolet <cjno...@gmail.com> wrote: > In all of the soutions I've found thus far, sorting has been by casting > the partition iterator into an array and sorting the array. This is not > going to work for my case as the amount of data in each partition may not > necessarily fit into memory. Any ideas? > > On Wed, Jan 28, 2015 at 1:29 AM, Corey Nolet <cjno...@gmail.com> wrote: > >> I wanted to update this thread for others who may be looking for a >> solution to his as well. I found [1] and I'm going to investigate if this >> is a viable solution. >> >> [1] >> http://stackoverflow.com/questions/23995040/write-to-multiple-outputs-by-key-spark-one-spark-job >> >> On Wed, Jan 28, 2015 at 12:51 AM, Corey Nolet <cjno...@gmail.com> wrote: >> >>> I need to be able to take an input RDD[Map[String,Any]] and split it >>> into several different RDDs based on some partitionable piece of the key >>> (groups) and then send each partition to a separate set of files in >>> different folders in HDFS. >>> >>> 1) Would running the RDD through a custom partitioner be the best way to >>> go about this or should I split the RDD into different RDDs and call >>> saveAsHadoopFile() on each? >>> 2) I need the resulting partitions sorted by key- they also need to be >>> written to the underlying files in sorted order. >>> 3) The number of keys in each partition will almost always be too big to >>> fit into memory. >>> >>> Thanks. >>> >> >> >