Hi thanks for the answer so far, however, I still think there must be an easy way. The file format I’m looking at is pretty simple. There is first an header of n bytes, Which can be ignored. After that there is the data. The data consists of rows where ich rows has 9 bytes. First there is a byte int (0..256), then there is an 8 byte int (0….)
If I understand correctly lazy.LazySimpleSerDe should do the SerDe part. Is that right. so if I say schema TinyInt,Int64 a row consisting of 9 bytes will be correctly parsed? The only thing missing would then be a proper input format. Ignoring the header format org.apache.hadoop.hive.ql.io.HiveBinaryOutputFormat would actually doing the output part. Any hints how to do the input part? thanks in advance! On 12 Dec 2014, at 17:02, Moore, Douglas <douglas.mo...@thinkbiganalytics.com> wrote: > You want to look into ADD JAR and CREATE FUNCTION (for UDFs) and STORED AS > 'full.class.name' for serde. > > For tutorials, google for "adding custom serde", I found one from > Cloudera: > http://blog.cloudera.com/blog/2012/12/how-to-use-a-serde-in-apache-hive/ > > Depending on your numbers (rows / file, bytes / file, files per time > interval, #containers || map slots, mem size per slot or container) > creating a split of your file may not be necessary to obtain good > performance. > > - Douglas > > > > > On 12/12/14 2:17 AM, "Ingo Thon" <ist...@gmx.de> wrote: > >> Dear List, >> >> >> I want to set up a DW based on Hive. However, my data does not come as >> handy csv files but as binary files in a proprietary format. >> >> The binary file consists of >> - 1 header of a dynamic number of bytes, which can be read from the >> contents of the header >> The header tells me how to parse the rows and how many bytes each row >> has. >> - n rows of k bytes, where k is defined within the header >> >> >> The solution I have in mind looks as follows >> - Write a custom InputFormat which chunks the data into blobs of length k >> but skips the bytes of the header. So I¹d have two parameters for the >> Inputformat. (bytes to skip, bytes per row) >> Do I really have to build this myself or does sth. like this already >> exists? Worst case I could also remove the header prior to pushing the >> data into the hdfs >> - Write a custom SerDe to parse the Blobs. At least in theory easy. >> >> The coding part does not look to complicated, however, I¹m kind of >> struggling with how to compile and install such serde. I installed Hive >> from source and imported it into eclipse. >> I guess I¹ve to now build my own projectŠ. Still I¹m a little bit lost. >> Is there any tutorial which describes the process? >> And also is my general idea ok? >> >> thanks in advance