Hi all,

I'm an Systems Engineer at the Wikimedia Foundation, and we're investigating 
using Flume for our web request log HDFS imports.  We've previously been using 
Kafka, but have had to change short term architecture plans in order to get 
data into HDFS reliably and regularly soon.

Our current web request logs are available for consumption over a multicast UDP 
stream.  I could hack something together to try and pipe this into Flume using 
the existing sources (SyslogUDPSource, or maybe some combination of socat + 
NetcatSource), but I'd rather reduce the number of moving parts.  I'd like to 
consume directly from the multicast UDP stream as a Flume source.

I coded up proof of concept based on the SyslogUDPSource, mainly just stripping 
out the syslog event header extraction, and adding in multicast Datagram 
connection code.  I plan on cleaning this up, and making this a generic raw UDP 
source, with multicast being a configuration option.

My question to you guys is, is this something the Flume community would find 
useful?  If so, should I open up a JIRA to track this?  I've got a fork of the 
Flume git repo over on github and will be doing my work there.  I'd love to 
share it upstream if it would be useful.

Thanks!
-Andrew Otto
 Systems Engineer
 Wikimedia Foundation


Reply via email to