Nitin, +1 on the Storm sink. Worth discussing further IMO. -Steve
From: Nitin Pawar <[email protected]> Date: Fri, 8 Feb 2013 10:25:51 +0530 To: <[email protected]>, Steven Yates <[email protected]> Subject: Re: Analysis of Data Hi Steve, I can understand the idea of having data processed inside flume by streaming it to another flume agent. But do we really need to re-engineer something inside flume is what I am thinking? Core flume dev team may have better ideas on this but currently for streaming data processing storm is a huge candidate. flume does have have an open jira on this integration FLUME-1286 <https://issues.apache.org/jira/browse/FLUME-1286> It will be interesting to draw up the comparisons in performance if the data processing logic is added to to flume. We do see currently people having a little bit of pre-processing of their data (they have their own custom channel types where they modify the data and sink it) On Fri, Feb 8, 2013 at 8:52 AM, Steve Yates <[email protected]> wrote: > Thanks for your feedback Mike, I have been thinking about this a little more > and just using Mahout as an example I was considering the concept of somehow > developing an enriched 'sink' so to speak where it would accept input streams > / msgs from a flume channel and onforward specifically to a 'service' i.e > Mahout service which would subsequently deliver the results to the configured > sink. So yes it would behave as an intercept->filter->process->sink for > applicable data items. > > I apologise if that is still vague. It would be great to receive further > feedback from the user group. > > -Steve > > Mike Percy <[email protected]> wrote: > Hi Steven, > Thanks for chiming in! Please see my responses inline: > > On Thu, Feb 7, 2013 at 3:04 PM, Steven Yates <[email protected]> wrote: >> The only missing link within the Flume architecture I see in this >> conversation is the actual channel's and brokers themselves which orchestrate >> this lovely undertaking of data collection. > > Can you define what you mean by channels and brokers in this context? Since > channel is a synonym for queueing event buffer in Flume parlance. Also, can > you elaborate more on what you mean by orchestration? I think I know where > you're going but I don't want to put words in your mouth. > >> One opportunity I do see (and I may be wrong) is for the data to offloaded >> into a system such as Apache Mahout before being sent to the sink. Perhaps >> the concept of a ChannelAdapter of sorts? I.e Mahout Adapter ? Just thinking >> out loud and it may be well out of the question. > > Why not a Mahout sink? Since Mahout often wants sequence files in a particular > format to begin its MapReduce processing (e.g. its k-Means clustering > implementation), Flume is already a good fit with its HDFS sink and > EventSerializers allowing for writing a plugin to format your data however it > needs to go in. In fact that works today if you have a batch (even 5-minute > batch) use case. With today's functionality, you could use Oozie to coordinate > kicking off the Mahout M/R job periodically, as new data becomes available and > the files are rolled. > > Perhaps even more interestingly, I can see a use case where you might want to > use Mahout to do streaming / realtime updates driven by Flume in the form of > an interceptor or a Mahout sink. If online machine learning (e.g. stochastic > gradient descent or something else online) was what you were thinking, I > wonder if there are any folks on this list who might have an interest in > helping to work on putting such a thing together. > > In any case, I'd like to hear more about specific use cases for streaming > analytics. :) > > Regards, > Mike > -- Nitin Pawar
