You could bunch these messages up into a single Flume event and then write a serializer that reads each of these Avro events and then writes them into an Avro container file (you can take a look at the AvroEventSerializer) - the downside of this is that you’d have to decode and re-encode the files in your serializer.
Thanks, Hari On Tue, Sep 30, 2014 at 12:54 AM, Bryce Alcock <[email protected]> wrote: > Not sure if I am approaching this problem correctly, But here is the basic > outline: > I would like to send say 10000, or even more small Avro messages in a > single Flume Event For storage on HDFS. > When I do this, it corrupts the "Avro" file created on HDFS because (I > assume based in a bit of reading) that it messes with the "Framing" that > Avro provides. > So the long and the short of it is that if I send, say 2, Flume events each > containing 10000 Avro Messages for storage on HDFS and stores the 2 > "Packets of" of avro messages in a single file on HDFS (using the HDFS > sink), the first 10000 messages are readable, but the 10001 message is > corrupt. > I am doing this for performance purposes, I need to be sending about > 1500*3600 = 5,400,000 (yes 5.4 million) small messages every ~4 seconds. > I know this is alot of messages.... > I can produce the message at the correct rate, but I cannot flume them in > very fast because I have to create an "Flume Event" with a Avro Schema > attached to each message, so I thought if I could batch up a bunch of them > at once, It would be more efficient. > Thanks In Advacnce! > Q. Boiler
