I have a use case for Flume and I'm wondering which of the many options in Flume to use for making this work.
I have a data source that produces log data in UDP packets containing JSON (a bit like syslog, but the data is already structured). I want to get this into Hadoop somehow (either HBase or HDFS+Hive, not sure yet). My first attempt was to write a sink (based on the syslog UDP sink) that receives UDP packets, parses the JSON, stuffs the fields into the headers of the internal Flume event object, and sends it off. (The body is left empty.) On the receiving end, I wrote a serializer for the hbase sink that writes each header field into a separate column. That works, but I was confused that the default supplied hbase serializers ignored all event headers, so I was wondering whether I'm abusing them. An alternative approach I was thinking about was writing a generic UDP sink that stuffs the entire UDP packet into the event body, and then write a serializer for the hbase sink that parses the JSON and puts the fields into the columns. Or alternatively write the JSON straight into HDFS and have Hive to the JSON parsing later. Which one of these would be more idiomatic and/or generally useful?
