Re: binary file deserialization
We ended up implementing custom Hadoop InputFormats and RecordReaders by extending FileInputFormat / RecordReader, and using sc.newAPIHadoopFile to read it as an RDD. On Wed, Mar 9, 2016 at 9:15 AM Ruslan Dautkhanovwrote: > We have a huge binary file in a custom serialization format (e.g. header > tells the length of the record, then there is a varying number of items for > that record). This is produced by an old c++ application. > What would be best approach to deserialize it into a Hive table or a Spark > RDD? > Format is known and well documented. > > > -- > Ruslan Dautkhanov >
Re: binary file deserialization
bq. there is a varying number of items for that record If the combination of items is very large, using case class would be tedious. On Wed, Mar 9, 2016 at 9:57 AM, Saurabh Bajajwrote: > You can load that binary up as a String RDD, then map over that RDD and > convert each row to your case class representing the data. In the map stage > you could also map the input string into an RDD of JSON values and use the > following function to convert it into a DF > > http://spark.apache.org/docs/latest/sql-programming-guide.html#json-datasets > > val anotherPeople = sqlContext.read.json(anotherPeopleRDD) > > > On Wed, Mar 9, 2016 at 9:15 AM, Ruslan Dautkhanov > wrote: > >> We have a huge binary file in a custom serialization format (e.g. header >> tells the length of the record, then there is a varying number of items for >> that record). This is produced by an old c++ application. >> What would be best approach to deserialize it into a Hive table or a >> Spark RDD? >> Format is known and well documented. >> >> >> -- >> Ruslan Dautkhanov >> > >
Re: binary file deserialization
You can load that binary up as a String RDD, then map over that RDD and convert each row to your case class representing the data. In the map stage you could also map the input string into an RDD of JSON values and use the following function to convert it into a DF http://spark.apache.org/docs/latest/sql-programming-guide.html#json-datasets val anotherPeople = sqlContext.read.json(anotherPeopleRDD) On Wed, Mar 9, 2016 at 9:15 AM, Ruslan Dautkhanovwrote: > We have a huge binary file in a custom serialization format (e.g. header > tells the length of the record, then there is a varying number of items for > that record). This is produced by an old c++ application. > What would be best approach to deserialize it into a Hive table or a Spark > RDD? > Format is known and well documented. > > > -- > Ruslan Dautkhanov >
binary file deserialization
We have a huge binary file in a custom serialization format (e.g. header tells the length of the record, then there is a varying number of items for that record). This is produced by an old c++ application. What would be best approach to deserialize it into a Hive table or a Spark RDD? Format is known and well documented. -- Ruslan Dautkhanov