Use avro or protobuf support.

On Tuesday, April 8, 2014, Petter von Dolwitz (Hem) <
petter.von.dolw...@gmail.com> wrote:
> Good stuff!
>
> I am glad that I could help.
>
> Br,
> Petter
>
>
> 2014-04-04 6:02 GMT+02:00 David Quigley <dquigle...@gmail.com>:
>>
>> Thanks again Petter, the custom input format was exactly what I needed.
>> Here is example of my code in case anyone is interested
>> https://github.com/quicklyNotQuigley/nest
>>
>> Basically gives you SQL access to arbitrary json data. I know there are
solutions for dealing with JSON data in hive fields but nothing I saw
actually decomposes nested JSON into a set of discreet records. Its super
useful for us.
>>
>> On Wed, Apr 2, 2014 at 2:15 AM, Petter von Dolwitz (Hem) <
petter.von.dolw...@gmail.com> wrote:
>>>
>>> Hi David,
>>>
>>> you can implement a custom InputFormat (extends
org.apache.hadoop.mapred.FileInputFormat) accompanied by a custom
RecordReader (implements org.apache.hadoop.mapred.RecordReader). The
RecordReader will be used to read your documents and from there you can
decide which units you will return as records (return by the next()
method). You'll still probably need a SerDe that transforms your data into
Hive data types using 1:1 mapping.
>>>
>>> In this way you can choose only to duplicate your data while your query
runs (and possible in the results) to avoid JOIN operations but the raw
files will not contain duplicate data.
>>>
>>> Something like this:
>>>
>>> CREATE EXTERNAL TABLE IF NOT EXISTS MyTable (
>>>   myfield1 STRING,
>>>   myfield2 INT)
>>>   PARTITIONED BY (your_partition_if_appliccable STRING)
>>>   ROW FORMAT SERDE 'quigley.david.myserde'
>>>   STORED AS INPUTFORMAT 'quigley.david.myinputformat' OUTPUTFORMAT
'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat'
>>>   LOCATION 'mylocation';
>>>
>>>
>>> Hope this helps.
>>>
>>> Br,
>>> Petter
>>>
>>>
>>>
>>>
>>> 2014-04-02 5:45 GMT+02:00 David Quigley <dquigle...@gmail.com>:
>>>>
>>>> We are currently streaming complex documents to hdfs with the hope of
being able to query. Each single document logically breaks down into a set
of individual records. In order to use Hive, we preprocess each input
document into a set of discreet records, which we save on HDFS and create
an external table on top of.
>>>> This approach works, but we end up duplicating a lot of data in the
records. It would be much more efficient to deserialize the document into a
set of records when a query is made. That way, we can just save the raw
documents on HDFS.
>>>> I have looked into writing a cusom SerDe.
>>>> Object deserialize(org.apache.hadoop.io.Writable blob)
>>>> It looks like the input record => deserialized record still needs to
be a 1:1 relationship. Is there any way to deserialize a record into
multiple records?
>>>> Thanks,
>>>> Dave
>>
>
>

-- 
Sorry this was sent from mobile. Will do less grammar and spell check than
usual.

Reply via email to