Re: GenericData.Record vs specific generated avro object

AnilKumar B Thu, 27 Feb 2014 02:27:13 -0800

Thanks Adrian,

Sorry for the late response.


I think following your second approach is better, but as of now  I did it
in first approach.

Thanks & Regards,
B Anil Kumar.


On Thu, Feb 20, 2014 at 10:22 PM, Adrian Hains <[email protected]> wrote:

> If the avro data from flume has the schema:
> {"type":"record","name":"Event","fields":[{"name":"
> headers","type":{"type":"map","values":"string"}},{"name":"
> body","type":"bytes"}]}
> then a record can only contain a headers map of strings, and a body field
> with bytes. I don't see how it could contain structured data in the body
> like you described:
> {"headers": {"timestamp": "1392825607332", "parentnode": 
> "2014021909\/1392825638009"},
> "body": {"bytes": "{"row":"000372d8","data":{"
> x1":"v1","x2":"v2","x3":"v3"},"timestamp":1392380848474}"}}
>
> Typically your flume event contains your data payload in that body field
> as a blob. So if you have a flume hdfs sink that is logging the raw flume
> event with a config of serializer=avro_event then you would need to unpack
> the data in the body field manually in your mapreduce. If you instead want
> the hdfs sink to write your payload in your custom avro format then I think
> you would need to configure the sink with the appropriate serializer (e.g.
> https://github.com/cloudera/cdk/blob/master/cdk-flume-avro-event-serializer/src/main/java/org/apache/flume/serialization/AvroEventSerializer.java
> )
>
> Apologies if I'm misunderstanding your problem and what you're trying to
> accomplish.
> -a
>
>
>
> On Wed, Feb 19, 2014 at 9:52 PM, AnilKumar B <[email protected]>wrote:
>
>> Hi,
>>
>> I am trying to process avro data using mapreduce. The data which I get in
>> avro format is generated by flume in below format.
>>
>>
>> {"type":"record","name":"Event","fields":[{"name":"headers","type":{"type":"map","values":"string"}},{"name":"body","type":"bytes"}]}
>>
>>
>> And data sample is as below:
>>
>> {"headers": {"timestamp": "1392825607332", "parentnode": 
>> "2014021909\/1392825638009"},
>> "body": {"bytes":
>> "{"row":"000372d8","data":{"x1":"v1","x2":"v2","x3":"v3"},"timestamp":1392380848474}"}}
>>
>> But when I want to use this data in Mapreduce, I am trying to read this
>> data as AvroKey<GenericData.Record>, NullWritable in mapper. I am able to
>> get the whole message when I see key.datum(), I am unable access the fields
>> like "row",  "data", "timestamp".
>>
>>
>> So how can I resolve this? Do I need to generate specific avro java class
>> for below schema and should I use generated class for processing in
>> Mapreduce or Should I use GenericData.Record itself?
>>
>>
>> {
>>
>>   "namespace": "com.test.avro",
>>
>>   "type": "record",
>>
>>   "name": "Event",
>>
>>   "fields": [
>>
>>     {
>>
>>       "name": "row",
>>
>>       "type": "string"
>>
>>     },
>>
>>     {
>>
>>       "name": "data",
>>
>>       "type": {
>>
>>         "type": "map",
>>
>>         "values": "string"
>>
>>       }
>>
>>     },
>>
>>     {
>>
>>       "name": "timestamp",
>>
>>       "type": "string"
>>
>>     }
>>
>>   ]
>>
>> }
>>
>>
>> Thanks & Regards,
>> B Anil Kumar.
>>
>
>

Re: GenericData.Record vs specific generated avro object

Reply via email to