Greetings! I would like to know if the code below will read "one-partition-at-a-time", and whether I am reinventing the wheel.
If I may explain, upstream code has managed (I hope) to save an RDD such that each partition file (e.g, part-r-00000, part-r-00001) contains exactly the data subset which I would like to repackage in a file of a non-hadoop format. So what I want to do is something like "mapPartitionsWithIndex" on this data (which is stored in sequence files, SNAPPY compressed). However, if I simply open the data set with "sequenceFile()", the data is re-partitioned and I loose the partitioning I want. My intention is that in the closure passed to mapPartitionWithIndex, I'll open an HDFS file and write the data from the partition in my desired format, one file for each input partition. The code below seems to work, I think. Have I missed something bad? Thanks! -Mike class NonSplittingSequenceFileInputFormat[K,V] //extends org.apache.hadoop.mapreduce.lib.input.SequenceFileInputFormat[K,V] // XXX extends org.apache.hadoop.mapred.SequenceFileInputFormat[K,V] { override def isSplitable( //context: org.apache.hadoop.mapreduce.JobContext, //path: org.apache.hadoop.fs.Path) = false fs: org.apache.hadoop.fs.FileSystem, filename: org.apache.hadoop.fs.Path) = false } sc.hadoopFile(outPathPhase1, classOf[NonSplittingSequenceFileInputFormat[K, V]], classOf[K], classOf[V], 1) } --------------------------------------------------------------------- To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org