We ended up implementing custom Hadoop InputFormats and RecordReaders by
extending FileInputFormat / RecordReader, and using sc.newAPIHadoopFile to
read it as an RDD.

On Wed, Mar 9, 2016 at 9:15 AM Ruslan Dautkhanov <dautkha...@gmail.com>
wrote:

> We have a huge binary file in a custom serialization format (e.g. header
> tells the length of the record, then there is a varying number of items for
> that record). This is produced by an old c++ application.
> What would be best approach to deserialize it into a Hive table or a Spark
> RDD?
> Format is known and well documented.
>
>
> --
> Ruslan Dautkhanov
>

Reply via email to