Re: Flume workflow design

Wolfgang Hoschek Fri, 19 Jul 2013 10:22:51 -0700

Perhaps a MR job that writes directly into HBase (without going through Flume) 
might be more efficient. For examples see 
http://hbase.apache.org/book/mapreduce.example.html


Wolfgang.

On Jul 19, 2013, at 1:13 AM, Flavio Pompermaier wrote:

> Thank you for the reply Wolfgang, I was just looking at the great use case 
> presented by Ari Flink of Cisco and infact those technologies sound great!
> The problem is that in my use case there will be an initial mapreduce job 
> that will parse some text, perform some analysis and sends the results of 
> those analyses to my HBaseSink. 
> Just once finished (not in streaming!), I have to start processing the data 
> stored in that HBase table "newer than some date contained in this 
> end-message" (I need thus a way to trigger the start of such processing),
> which requires to invoke an external REST service and stores data in another 
> output table. Also here, just once finished I have to reduce all those 
> information and put them into Solr..
> 
> So I think that the main problem is to avoid streaming and trigger mapreduce 
> jobs. Is there a way to do it with Flume?
> 
> Best,
> Flavio
> 
> On Fri, Jul 19, 2013 at 12:51 AM, Wolfgang Hoschek <[email protected]> 
> wrote:
> Take a look at these options:
> 
> - HBase Sinks (send data into HBase):
> 
>         http://flume.apache.org/FlumeUserGuide.html#hbasesinks
> 
> - Apache Flume Morphline Solr Sink (for heavy duty ETL processing and 
> ingestion into Solr):
> 
>         http://flume.apache.org/FlumeUserGuide.html#morphlinesolrsink
> 
> - Apache Flume MorphlineInterceptor (for light-weight event annotations and 
> routing):
> 
>         http://flume.apache.org/FlumeUserGuide.html#morphline-interceptor
> 
> - For MapReduce jobs it is typically more straightforward and efficient to 
> send data directly to destinations, i.e. without going through Flume. For 
> example using the MapReduceIndexerTool when going from HDFS into Solr:
> 
>         https://github.com/cloudera/search/tree/master/search-mr
> 
> Wolfgang.
> 
> On Jul 18, 2013, at 3:37 PM, Flavio Pompermaier wrote:
> 
> > Hi to all,
> >
> > I'm new to Flume but I'm very excited about it!
> > I'd like to use it to gather some data, process received messages and then 
> > indexing to solr.
> > Any suggestion about how to do that with Flume?
> > I've already tested an Avro source that sends data to HBase,
> > but my use case requires those messages to be saved in HBase but also 
> > processed and then indexed in Solr (obviously I also need to convert the 
> > object structure to convert them).
> > I think the first part is quite simple (I just use 2 sinks, one that store 
> > in HBase) and another one that forward to another Avro instance, right?
> > If messages are sent during a map/reduce job, is the avro source the best 
> > option to send documents to index to my sink (i.e. that is my first part of 
> > the flow that up to now I simulated with an avro source..)?
> > Best,
> > Flavio
> >
> >
> >
> 
> 
> 
>

Re: Flume workflow design

Reply via email to