Use avro or protobuf support. On Tuesday, April 8, 2014, Petter von Dolwitz (Hem) < petter.von.dolw...@gmail.com> wrote: > Good stuff! > > I am glad that I could help. > > Br, > Petter > > > 2014-04-04 6:02 GMT+02:00 David Quigley <dquigle...@gmail.com>: >> >> Thanks again Petter, the custom input format was exactly what I needed. >> Here is example of my code in case anyone is interested >> https://github.com/quicklyNotQuigley/nest >> >> Basically gives you SQL access to arbitrary json data. I know there are solutions for dealing with JSON data in hive fields but nothing I saw actually decomposes nested JSON into a set of discreet records. Its super useful for us. >> >> On Wed, Apr 2, 2014 at 2:15 AM, Petter von Dolwitz (Hem) < petter.von.dolw...@gmail.com> wrote: >>> >>> Hi David, >>> >>> you can implement a custom InputFormat (extends org.apache.hadoop.mapred.FileInputFormat) accompanied by a custom RecordReader (implements org.apache.hadoop.mapred.RecordReader). The RecordReader will be used to read your documents and from there you can decide which units you will return as records (return by the next() method). You'll still probably need a SerDe that transforms your data into Hive data types using 1:1 mapping. >>> >>> In this way you can choose only to duplicate your data while your query runs (and possible in the results) to avoid JOIN operations but the raw files will not contain duplicate data. >>> >>> Something like this: >>> >>> CREATE EXTERNAL TABLE IF NOT EXISTS MyTable ( >>> myfield1 STRING, >>> myfield2 INT) >>> PARTITIONED BY (your_partition_if_appliccable STRING) >>> ROW FORMAT SERDE 'quigley.david.myserde' >>> STORED AS INPUTFORMAT 'quigley.david.myinputformat' OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat' >>> LOCATION 'mylocation'; >>> >>> >>> Hope this helps. >>> >>> Br, >>> Petter >>> >>> >>> >>> >>> 2014-04-02 5:45 GMT+02:00 David Quigley <dquigle...@gmail.com>: >>>> >>>> We are currently streaming complex documents to hdfs with the hope of being able to query. Each single document logically breaks down into a set of individual records. In order to use Hive, we preprocess each input document into a set of discreet records, which we save on HDFS and create an external table on top of. >>>> This approach works, but we end up duplicating a lot of data in the records. It would be much more efficient to deserialize the document into a set of records when a query is made. That way, we can just save the raw documents on HDFS. >>>> I have looked into writing a cusom SerDe. >>>> Object deserialize(org.apache.hadoop.io.Writable blob) >>>> It looks like the input record => deserialized record still needs to be a 1:1 relationship. Is there any way to deserialize a record into multiple records? >>>> Thanks, >>>> Dave >> > >
-- Sorry this was sent from mobile. Will do less grammar and spell check than usual.