I have a huge RDD[Document] with millions of items. I partitioned it using HashPartitioner and save as object file. But when I load the object file back into RDD, I lost the HashPartitioner. How do I preserve the partitions when loading the object file?
Here is the code val docVectors : RDD[DocVector] = computeRdd() // expensive calculation val partitionedDocVectors : RDD[(String, DocVector)] = docVectors .keyBy(d => d.id).partitionBy(new HashPartitioner(16)) partitionedDocVectors.saveAsObjectFile("c:/temp/partitionedDocVectors.obj") // At this point, I check the folder c:/temp/partitionedDocVectors.obj, it contains 16 parts: "part-00000, part-00001, ... part-00015" // Now laod the object file back val partitionedDocVectors2 : RDD[(String, DocVector)] = sc.objectFile("c:/temp/partitionedDocVectors.obj") // Now partitionedDocVectors2 contains 956 parts and it has no partinier println(s"partitions: ${partitionedDocVectors.partitions.size}") // return 956 if (idAndDocVectors.partitioner.isEmpty) println("No partitioner") // it does print out this line So how can I preserve the partitions of partitionedDocVectors on disk so I can load it back? Ningjun