RE: best way to put UDP JSON data into Hadoop

Paul Chavez Thu, 30 May 2013 13:44:52 -0700

I can't speak to the UDP transport mechanism, but we do use JSON events with 
Hive and it works quite well.


In our case we have an application that takes an internal object, serializes it 
to JSON, puts that JSON into another object we call the 'flume envelope' which 
has timestamp and a couple other headers for routing. We use an HTTPSource to 
POST the JSON 'envelope' events to flume, which never does anything special 
with the JSON 'payload'. On the sink side, after a couple Avro hops we 
serialize to TEXT files with the HDFS sink. Then we use a Hive JSON SerDe to 
create an external table (flume is configured to write to partitions based on 
the timestamp). Every hour an Oozie job processes the previous hour data into a 
'native' Hive table and then we drop the external partition and data. The only 
catch is the JSON events have to be on a single line. 

This overall workflow has proven to be extremely useful and flexible. We manage 
multiple data flows with a single source/channel/sink by writing to paths based 
on the envelope headers. (eg 
/flume/%{logType}/%{logSubType}/date=%Y%M%d/hour=%H)

Hope that helps!
Paul Chavez

-----Original Message-----
From: Peter Eisentraut [mailto:[email protected]] 
Sent: Thursday, May 30, 2013 1:27 PM
To: [email protected]
Subject: best way to put UDP JSON data into Hadoop

I have a use case for Flume and I'm wondering which of the many options in 
Flume to use for making this work.

I have a data source that produces log data in UDP packets containing JSON (a 
bit like syslog, but the data is already structured).  I want to get this into 
Hadoop somehow (either HBase or HDFS+Hive, not sure yet).

My first attempt was to write a sink (based on the syslog UDP sink) that 
receives UDP packets, parses the JSON, stuffs the fields into the headers of 
the internal Flume event object, and sends it off.  (The body is left empty.)  
On the receiving end, I wrote a serializer for the hbase sink that writes each 
header field into a separate column.  That works, but I was confused that the 
default supplied hbase serializers ignored all event headers, so I was 
wondering whether I'm abusing them.

An alternative approach I was thinking about was writing a generic UDP sink 
that stuffs the entire UDP packet into the event body, and then write a 
serializer for the hbase sink that parses the JSON and puts the fields into the 
columns.  Or alternatively write the JSON straight into HDFS and have Hive to 
the JSON parsing later.

Which one of these would be more idiomatic and/or generally useful?

RE: best way to put UDP JSON data into Hadoop

Reply via email to