We built a custom assembly jar with helper functions for reading our
protobuf data.
First of all, serialized protobuf data should be stored as sequence files
in HDFS. You would then do the following to read the serialized data in:
val myDataRDD = context.sequenceFile[LongWritable, BytesWritable](HdfsPath)
All that is left to do is map the RDD to a collection of the object type
you are deserializing to:
myDataRDD.map {
case (_, b: BytesWritable) => MyMessage.parse(b)
}
Hope this helps.
On Tue, Oct 8, 2013 at 9:16 PM, Shay Seng <[email protected]> wrote:
>
> Hi,
>
> I would like to store some data as a seq of protobuf objects. I would of
> course need to beable to read that into an RDD and write the RDD back out
> in some binary format.
>
> First of all, is this supported natively (or through some download)?
>
> If not, are there examples on how I might write my own RDDs? I was hoping
> I would be able to accomplish this using some invokation of
> sparkContext.newAPIHadoopFile , but the comments there are just too terse.
> Are there more verbose examples out there? Either on how to write new RDD
> inputFormats, or how to make use of newAPIHadoopFile
>
> tks
> Shay
>