We ended up implementing custom Hadoop InputFormats and RecordReaders by extending FileInputFormat / RecordReader, and using sc.newAPIHadoopFile to read it as an RDD.
On Wed, Mar 9, 2016 at 9:15 AM Ruslan Dautkhanov <dautkha...@gmail.com> wrote: > We have a huge binary file in a custom serialization format (e.g. header > tells the length of the record, then there is a varying number of items for > that record). This is produced by an old c++ application. > What would be best approach to deserialize it into a Hive table or a Spark > RDD? > Format is known and well documented. > > > -- > Ruslan Dautkhanov >