Re: Analysis of Data

Steve Yates Thu, 07 Feb 2013 19:23:32 -0800

Thanks for your feedback Mike, I have been thinking about this a little more 
and just using Mahout as an example I was considering the concept of somehow 
developing an enriched 'sink' so to speak where it would accept input streams / 
msgs from a flume channel and onforward specifically to a 'service' i.e Mahout 
service which would subsequently deliver the results to the configured sink. So 
yes it would behave as an intercept->filter->process->sink for applicable data 
items.


I apologise if that is still vague. It would be great to receive further 
feedback from the user group.

-SteveMike Percy <[email protected]> wrote:Hi Steven,
Thanks for chiming in! Please see my responses inline:

On Thu, Feb 7, 2013 at 3:04 PM, Steven Yates <[email protected]> wrote:
The only missing link within the Flume architecture I see in this conversation 
is the actual channel's and brokers themselves which orchestrate this lovely 
undertaking of data collection.

Can you define what you mean by channels and brokers in this context? Since 
channel is a synonym for queueing event buffer in Flume parlance. Also, can you 
elaborate more on what you mean by orchestration? I think I know where you're 
going but I don't want to put words in your mouth.

One opportunity I do see (and I may be wrong) is for the data to offloaded into 
a system such as Apache Mahout  before being sent to the sink. Perhaps the 
concept of a ChannelAdapter of sorts? I.e Mahout Adapter ? Just thinking out 
loud and it may be well out of the question.

Why not a Mahout sink? Since Mahout often wants sequence files in a particular 
format to begin its MapReduce processing (e.g. its k-Means clustering 
implementation), Flume is already a good fit with its HDFS sink and 
EventSerializers allowing for writing a plugin to format your data however it 
needs to go in. In fact that works today if you have a batch (even 5-minute 
batch) use case. With today's functionality, you could use Oozie to coordinate 
kicking off the Mahout M/R job periodically, as new data becomes available and 
the files are rolled.

Perhaps even more interestingly, I can see a use case where you might want to 
use Mahout to do streaming / realtime updates driven by Flume in the form of an 
interceptor or a Mahout sink. If online machine learning (e.g. stochastic 
gradient descent or something else online) was what you were thinking, I wonder 
if there are any folks on this list who might have an interest in helping to 
work on putting such a thing together.

In any case, I'd like to hear more about specific use cases for streaming 
analytics. :)

Regards,
Mike

Re: Analysis of Data

Reply via email to