Hi Steven, Thanks for chiming in! Please see my responses inline: On Thu, Feb 7, 2013 at 3:04 PM, Steven Yates <[email protected]>wrote:
> The only missing link within the Flume architecture I see in this > conversation is the actual channel's and brokers themselves which > orchestrate this lovely undertaking of data collection. > Can you define what you mean by channels and brokers in this context? Since channel is a synonym for queueing event buffer in Flume parlance. Also, can you elaborate more on what you mean by orchestration? I think I know where you're going but I don't want to put words in your mouth. One opportunity I do see (and I may be wrong) is for the data to offloaded > into a system such as Apache Mahout before being sent to the sink. Perhaps > the concept of a ChannelAdapter of sorts? I.e Mahout Adapter ? Just > thinking out loud and it may be well out of the question. > Why not a Mahout sink? Since Mahout often wants sequence files in a particular format to begin its MapReduce processing (e.g. its k-Means clustering implementation), Flume is already a good fit with its HDFS sink and EventSerializers allowing for writing a plugin to format your data however it needs to go in. In fact that works today if you have a batch (even 5-minute batch) use case. With today's functionality, you could use Oozie to coordinate kicking off the Mahout M/R job periodically, as new data becomes available and the files are rolled. Perhaps even more interestingly, I can see a use case where you might want to use Mahout to do streaming / realtime updates driven by Flume in the form of an interceptor or a Mahout sink. If online machine learning (e.g. stochastic gradient descent or something else online) was what you were thinking, I wonder if there are any folks on this list who might have an interest in helping to work on putting such a thing together. In any case, I'd like to hear more about specific use cases for streaming analytics. :) Regards, Mike
