Summary: I want to harvest a subset of custom data from an avro container structure, and store it off as avro data files. I'm having some difficulty in determining the cleanest place to implement my logic.
Details: I have a flume flow that is shipping a somewhat complex set of xml structures to hdfs, by way of a custom avro wrapper. I wish to pull certain values out of the xml structures to hive tables for reporting. To get to my xml data within the hdfs files of flume events I need to logically: (1) unmarshal the flume event to get the byte array of the _body_ field (2) unmarshal these bytes from the _body_ field to my custom avro wrapper structure (3) navigate through my wrapper structure to locate the specific xml payload I need to harvest data from. The xml itself is binary serialized as Fast Infoset. (4) unmarshall the Fast Infoset encoded xml to POJOs (or xml) and pull certain values out to store in a hive table. It is semi-structured data that is primarily atomic values and secondarily some lists and other structures. I intend to store the set of values from #4 as an avro data file, and leverage it as a hive table by using org.apache.hadoop.hive.serde2.avro.AvroSerDe. I could alternatively store this data as flat files, but I'm not sure I receive a benefit from it. I was originally thinking I would use a custom SerDe with hive that would allow me to read in the flume event avro structures, harvest the desired data, and write out the custom avro structure that representes my reduced dataset. Then at reporting time I would use a typical AvroSerDe with a schema that describes this reduced dataset. After experimenting in hive with SERDE=AvroSerDe, INTPUTFORMAT=AvroContainerInputFormat, and OUTPUTFORMAT=AvroContainerOutputFormat, I see that they are all rather tied together on one avro schema (used for reading, writing, and providing the table metadata). This makes it difficult to use the flume event avro schema on reading and my custom avro schema (for the transformed structure) on writing. I think I can still work with this by implementing my own intputformat that is specialized for flume events (ignores the avro schema defined in the table properties), but I'm wondering if I am considering the wrong tool for the job and would be better off with a custom map job (or something else). I'm not sure if anyone has experience with requirements along these lines, but if so I would love to hear what you learned! Cheers, Adrian Hains
