Re: custom binary format

Ingo Thon Thu, 18 Dec 2014 14:27:07 -0800

Hi thanks for the answer so far, however, I still think there must be an easy 
way.
The file format I’m looking at is pretty simple. 
There is first an header of
n bytes, Which can be ignored. After that there is the data. 
The data consists of rows where ich rows has 9 bytes.
First there is a byte int (0..256), then there is an 8 byte int (0….)


If I understand correctly lazy.LazySimpleSerDe should do the SerDe part.
Is that right. so if I say schema TinyInt,Int64 a row consisting of 9 bytes 
will be correctly parsed?

The only thing missing would then be a proper input format.
Ignoring the header format org.apache.hadoop.hive.ql.io.HiveBinaryOutputFormat 
would actually doing the output part.
Any hints how to do the input part?

thanks in advance!



On 12 Dec 2014, at 17:02, Moore, Douglas <douglas.mo...@thinkbiganalytics.com> 
wrote:

> You want to look into ADD JAR and CREATE FUNCTION (for UDFs) and STORED AS
> 'full.class.name' for serde.
> 
> For tutorials, google for "adding custom serde", I found one from
> Cloudera: 
> http://blog.cloudera.com/blog/2012/12/how-to-use-a-serde-in-apache-hive/
> 
> Depending on your numbers (rows / file, bytes / file, files per time
> interval, #containers || map slots, mem size per slot or container)
> creating a split of your file may not be necessary to obtain good
> performance.
> 
> - Douglas
> 
> 
> 
> 
> On 12/12/14 2:17 AM, "Ingo Thon" <ist...@gmx.de> wrote:
> 
>> Dear List,
>> 
>> 
>> I want to set up a DW based on Hive. However, my data does not come as
>> handy csv files but as binary files in a proprietary format.
>> 
>> The binary file  consists of
>> - 1 header of a dynamic number of bytes, which can be read from the
>> contents of the header
>>  The header tells me how to parse the rows and how many bytes each row
>> has.
>> - n rows of k bytes, where k is defined within the header
>> 
>> 
>> The solution I have in mind looks as follows
>> - Write a custom InputFormat which chunks the data into blobs of length k
>> but skips the bytes of the header. So I¹d have two parameters for the
>> Inputformat. (bytes to skip, bytes per row)
>> Do I really have to build this myself or does sth. like this already
>> exists? Worst case I could also remove the header prior to pushing the
>> data into the hdfs
>> - Write a custom SerDe to parse the Blobs. At least in theory easy.
>> 
>> The coding part does not look to complicated, however, I¹m kind of
>> struggling with how to compile and install such serde. I installed Hive
>> from source and imported it into eclipse.
>> I guess I¹ve to now build my own projectŠ. Still I¹m a little bit lost.
>> Is there any tutorial which describes the process?
>> And also is my general idea ok?
>> 
>> thanks in advance

Re: custom binary format

Reply via email to