Perhaps a MR job that writes directly into HBase (without going through Flume) might be more efficient. For examples see http://hbase.apache.org/book/mapreduce.example.html
Wolfgang. On Jul 19, 2013, at 1:13 AM, Flavio Pompermaier wrote: > Thank you for the reply Wolfgang, I was just looking at the great use case > presented by Ari Flink of Cisco and infact those technologies sound great! > The problem is that in my use case there will be an initial mapreduce job > that will parse some text, perform some analysis and sends the results of > those analyses to my HBaseSink. > Just once finished (not in streaming!), I have to start processing the data > stored in that HBase table "newer than some date contained in this > end-message" (I need thus a way to trigger the start of such processing), > which requires to invoke an external REST service and stores data in another > output table. Also here, just once finished I have to reduce all those > information and put them into Solr.. > > So I think that the main problem is to avoid streaming and trigger mapreduce > jobs. Is there a way to do it with Flume? > > Best, > Flavio > > On Fri, Jul 19, 2013 at 12:51 AM, Wolfgang Hoschek <[email protected]> > wrote: > Take a look at these options: > > - HBase Sinks (send data into HBase): > > http://flume.apache.org/FlumeUserGuide.html#hbasesinks > > - Apache Flume Morphline Solr Sink (for heavy duty ETL processing and > ingestion into Solr): > > http://flume.apache.org/FlumeUserGuide.html#morphlinesolrsink > > - Apache Flume MorphlineInterceptor (for light-weight event annotations and > routing): > > http://flume.apache.org/FlumeUserGuide.html#morphline-interceptor > > - For MapReduce jobs it is typically more straightforward and efficient to > send data directly to destinations, i.e. without going through Flume. For > example using the MapReduceIndexerTool when going from HDFS into Solr: > > https://github.com/cloudera/search/tree/master/search-mr > > Wolfgang. > > On Jul 18, 2013, at 3:37 PM, Flavio Pompermaier wrote: > > > Hi to all, > > > > I'm new to Flume but I'm very excited about it! > > I'd like to use it to gather some data, process received messages and then > > indexing to solr. > > Any suggestion about how to do that with Flume? > > I've already tested an Avro source that sends data to HBase, > > but my use case requires those messages to be saved in HBase but also > > processed and then indexed in Solr (obviously I also need to convert the > > object structure to convert them). > > I think the first part is quite simple (I just use 2 sinks, one that store > > in HBase) and another one that forward to another Avro instance, right? > > If messages are sent during a map/reduce job, is the avro source the best > > option to send documents to index to my sink (i.e. that is my first part of > > the flow that up to now I simulated with an avro source..)? > > Best, > > Flavio > > > > > > > > > >
