Absolutely Mike thank you. Specifically though it would be nice to be able to feedback the results from an external process (such as Mahout or Storm) into a Flume channel/sink?
-Steve From: Mike Percy <[email protected]> Reply-To: <[email protected]> Date: Fri, 8 Feb 2013 14:09:04 -0800 To: "[email protected]" <[email protected]> Cc: Nitin Pawar <[email protected]> Subject: Re: Analysis of Data Steven, Any reason you are not using interceptors for that? Can you provide more detail on what you are doing? See more about Interceptors here: http://flume.apache.org/FlumeUserGuide.html#flume-interceptors Regards Mike On Fri, Feb 8, 2013 at 3:34 AM, <[email protected]> wrote: > Hi Nitin, > > Would it be feasible to consider the addition of another extension point with > Flume for the purposes of custom filtering, enrichment, routing etc. Without > trying to envision Flume away into something it was never designed for (i.e > without going overboard) The concept of some sort of intermediate processing > unit is quite attractive to me personally as I have my dedicated AvroSources > purely for aggregating data however in the interest of modularisation I may > want to perform some enrichment/filtering exercise before I dump the events on > my durable channel. I guess the conversation of flow and some sort of > declarative way of configuring the ordering of the processing units etc. Just > thinking out loud. > > > @Nitin/Mike , your experience in the field will assist in validating this > further > > -Steve > > Quoting Nitin Pawar <[email protected]>: > >> Mike, Yes >> >> I am not against the approach flume doing it. I would love to see it part >> of flume (it ofcourse helps to remove overload of one processing engine). >> As flume already supports the grouping of agents to the normal route of >> acquisition and sink can continue. >> >> In another route, we can have it to sink to a processor source of flume >> which then converts the data and runs quick analysis on data in memory and >> update the global counters kind of things which then can be sink to live >> reporting systems. >> >> Thanks, >> Nitin >> >> >> On Fri, Feb 8, 2013 at 2:26 PM, Mike Percy <[email protected]> wrote: >> >>> Nitin, >>> Good to hear more of your thoughts. Please see inline. >>> >>> On Thu, Feb 7, 2013 at 8:55 PM, Nitin Pawar <[email protected]>wrote: >>> >>> I can understand the idea of having data processed inside flume by >>>> streaming it to another flume agent. But do we really need to re-engineer >>>> something inside flume is what I am thinking? Core flume dev team may have >>>> better ideas on this but currently for streaming data processing storm is a >>>> huge candidate. >>>> flume does have have an open jira on this integration >>>> FLUME-1286<https://issues.apache.org/jira/browse/FLUME-1286 >>>> <https://issues.apache.org/jira/browse/FLUME-1286> > >>>> >>> >>> Yes, a Storm sink could be useful. But that wouldn't preclude us from >>> taking a hard look at what may be missing in Flume itself, right? >>> >>> It will be interesting to draw up the comparisons in performance if the >>>> data processing logic is added to to flume. We do see currently people >>>> having a little bit of pre-processing of their data (they have their own >>>> custom channel types where they modify the data and sink it) >>>> >>> >>> It sounds like you have some experience with Flume. Are you guys using it >>> at Rightster? >>> >>> I work with a lot of folks to set up and deploy Flume, many of which do >>> lookups / joins with other systems, transformations, etc. in real time >>> along their data ingest pipeline before writing the data to HDFS or HBase >>> for further processing and archival. I wouldn't say these are really heavy >>> number crunching implementations in Flume, but certainly i see a lot of >>> inline parsing, inspection, enrichment, routing, and the like going on. I >>> think Flume could do a lot more, given the right abstractions. >>> >>> Regards, >>> Mike >>> >>> >> >> >> -- >> Nitin Pawar >> > > >
