Reading one partition at a time

Michael Albert Sun, 04 Jan 2015 16:32:10 -0800

Greetings!

I would like to know if the code below will read "one-partition-at-a-time",
and whether I am reinventing the wheel.


If I may explain, upstream code has managed (I hope) to save an RDD such 
that each partition file (e.g, part-r-00000, part-r-00001) contains exactly the 
data subset which I would like to 
repackage in a file of a non-hadoop format.  So what I want to do is 
something like "mapPartitionsWithIndex" on this data (which is stored in 
sequence files, SNAPPY compressed).  However, if I simply open the data
set with "sequenceFile()", the data is re-partitioned and I loose the 
partitioning
I want. My intention is that in the closure passed to mapPartitionWithIndex,

I'll open an HDFS file and write the data from the partition in my desired 
format, 
one file for each input partition.
The code below seems to work, I think.  Have I missed something bad?
Thanks!

-Mike

    class NonSplittingSequenceFileInputFormat[K,V]
        //extends 
org.apache.hadoop.mapreduce.lib.input.SequenceFileInputFormat[K,V] // XXX
        extends org.apache.hadoop.mapred.SequenceFileInputFormat[K,V]
    {
        override def isSplitable(
            //context: org.apache.hadoop.mapreduce.JobContext,
            //path: org.apache.hadoop.fs.Path) = false
            fs: org.apache.hadoop.fs.FileSystem,
            filename: org.apache.hadoop.fs.Path) = false

    }


sc.hadoopFile(outPathPhase1,
    classOf[NonSplittingSequenceFileInputFormat[K, V]],
    classOf[K],
   classOf[V],
   1)
}

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Reading one partition at a time

Reply via email to