I have a huge RDD[Document] with millions of items. I partitioned it using 
HashPartitioner and save as object file. But when I load the object file back 
into RDD, I lost the HashPartitioner. How do I preserve the partitions when 
loading the object file?

Here is the code


val docVectors : RDD[DocVector] = computeRdd() // expensive calculation

val partitionedDocVectors : RDD[(String, DocVector)] = docVectors .keyBy(d => 
d.id).partitionBy(new HashPartitioner(16))
partitionedDocVectors.saveAsObjectFile("c:/temp/partitionedDocVectors.obj")
// At this point, I check the folder c:/temp/partitionedDocVectors.obj, it 
contains 16 parts: "part-00000, part-00001, ... part-00015"


// Now laod the object file back
val partitionedDocVectors2 : RDD[(String, DocVector)] = 
sc.objectFile("c:/temp/partitionedDocVectors.obj")
// Now partitionedDocVectors2 contains 956 parts and it has no partinier

println(s"partitions: ${partitionedDocVectors.partitions.size}") // return 956
if (idAndDocVectors.partitioner.isEmpty) println("No partitioner")  // it does 
print out this line

So how can I preserve the partitions of partitionedDocVectors on disk so I can 
load it back?

Ningjun

Reply via email to