I see what you are saying about the temp table being hosted at a single regions server - especially for a limited set of rows that just care about the aggregations, but receive a lot of traffic. I wonder if this will also be the case, if I was to use the source table to maintain these temporary records, and not create a temp table on the fly ...
On Fri, Sep 16, 2011 at 5:24 PM, Doug Meil <[email protected]> wrote: > > I'll add this to the book in the MR section. > > > > > > On 9/16/11 8:22 PM, "Doug Meil" <[email protected]> wrote: > >> >>I was in the middle of responding to Mike's email when yours arrived, so >>I'll respond to both. >> >>I think the temp-table idea is interesting. The caution is that a default >>temp-table creation will be hosted on a single RS and thus be a bottleneck >>for aggregation. So I would imagine that you would need to tune the >>temp-table for the job and pre-create regions. >> >>Doug >> >> >> >>On 9/16/11 8:16 PM, "Sam Seigal" <[email protected]> wrote: >> >>>I am trying to do something similar with HBase Map/Reduce. >>> >>>I have event ids and amounts stored in hbase in the following format: >>>prefix-event_id_type-timestamp-event_id as the row key and amount as >>>the value >>>I want to be able to aggregate the amounts based on the event id type >>>and for this I am using a reducer. I basically reduce on the >>>eventidtype from the incoming row in the map phase, and perform the >>>aggregation in the reducer on the amounts for the event types. Then I >>>write back the results into HBase. >>> >>>I hadn't thought about writing values directly into a temp HBase table >>>as suggested by Mike in the map phase. >>> >>>For this case, each mapper can declare its own mapperId_event_type row >>>with totalAmount and for each row it receives, do a get , add the >>>current amount, and then a put. We are basically then doing a >>>get/add/put for every row that a mapper receives. Is this any more >>>efficient when compared to the overhead of sorting/partitioning for a >>>reducer ? >>> >>>At the end of the mapping phase, aggregating the output of all the >>>mappers should be trivial. >>> >>> >>> >>>On Fri, Sep 16, 2011 at 1:24 PM, Michael Segel >>><[email protected]> wrote: >>>> >>>> Doug and company... >>>> >>>> Look, I'm not saying that there aren't m/r jobs were you might need >>>>reducers when working w HBase. What I am saying is that if we look at >>>>what you're attempting to do, you may end up getting better performance >>>>if you created a temp table in HBase and let HBase do some of the heavy >>>>lifting where you are currently using a reducer. From the jobs that we >>>>run, when we looked at what we were doing, there wasn't any need for a >>>>reducer. I suspect that its true of other jobs. >>>> >>>> Remember that HBase is much more than just an HFile format to persist >>>>stuff. >>>> >>>> Even looking at Sonal's example... you have other ways of doing the >>>>record counts like dynamic counters or using a temp table in HBase which >>>>I believe will give you better performance numbers, although I haven't >>>>benchmarked either against a reducer. >>>> >>>> Does that make sense? >>>> >>>> -Mike >>>> >>>> >>>> > From: [email protected] >>>> > To: [email protected] >>>> > Date: Fri, 16 Sep 2011 15:41:44 -0400 >>>> > Subject: Re: Writing MR-Job: Something like OracleReducer, >>>>JDBCReducer ... >>>> > >>>> > >>>> > Chris, agreed... There are sometimes that reducers aren't required, >>>>and >>>> > then situations where they are useful. We have both kinds of jobs. >>>> > >>>> > For others following the thread, I updated the book recently with >>>>more MR >>>> > examples (read-only, read-write, read-summary) >>>> > >>>> > http://hbase.apache.org/book.html#mapreduce.example >>>> > >>>> > >>>> > As to the question that started this thread... >>>> > >>>> > >>>> > re: "Store aggregated data in Oracle. " >>>> > >>>> > To me, that sounds a like the "read-summary" example with JDBC-Oracle >>>>in >>>> > the reduce step. >>>> > >>>> > >>>> > >>>> > >>>> > >>>> > On 9/16/11 2:58 PM, "Chris Tarnas" <[email protected]> wrote: >>>> > >>>> > >If only I could make NY in Nov :) >>>> > > >>>> > >We extract out large numbers of DNA sequence reads from HBase, run >>>>them >>>> > >through M/R pipelines to analyze and aggregate and then we load the >>>> > >results back in. Definitely specialized usage, but I could see other >>>> > >perfectly valid uses for reducers with HBase. >>>> > > >>>> > >-chris >>>> > > >>>> > >On Sep 16, 2011, at 11:43 AM, Michael Segel wrote: >>>> > > >>>> > >> >>>> > >> Sonal, >>>> > >> >>>> > >> You do realize that HBase is a "database", right? ;-) >>>> > >> >>>> > >> So again, why do you need a reducer? ;-) >>>> > >> >>>> > >> Using your example... >>>> > >> "Again, there will be many cases where one may want a reducer, say >>>> > >>trying to count the occurrence of words in a particular column." >>>> > >> >>>> > >> You can do this one of two ways... >>>> > >> 1) Dynamic Counters in Hadoop. >>>> > >> 2) Use a temp table and auto increment the value in a column which >>>> > >>contains the word count. (Fat row where rowkey is doc_id and >>>>column is >>>> > >>word or rowkey is doc_id|word) >>>> > >> >>>> > >> I'm sorry but if you go through all of your examples of why you >>>>would >>>> > >>want to use a reducer, you end up finding out that writing to an >>>>HBase >>>> > >>table would be faster than a reduce job. >>>> > >> (Again we haven't done an exhaustive search, but in all of the >>>>HBase >>>> > >>jobs we've run... no reducers were necessary.) >>>> > >> >>>> > >> The point I'm trying to make is that you want to avoid using a >>>>reducer >>>> > >>whenever possible and if you think about your problem... you can >>>> > >>probably come up with a solution that avoids the reducer... >>>> > >> >>>> > >> >>>> > >> HTH >>>> > >> >>>> > >> -Mike >>>> > >> PS. I haven't looked at *all* of the potential use cases of HBase >>>>which >>>> > >>is why I don't want to say you'll never need a reducer. I will say >>>>that >>>> > >>based on what we've done at my client's site, we try very hard to >>>>avoid >>>> > >>reducers. >>>> > >> [Note, I'm sure I'm going to get hammered on this when I head to >>>>NY in >>>> > >>Nov. :-) ] >>>> > >> >>>> > >> >>>> > >>> Date: Fri, 16 Sep 2011 23:00:49 +0530 >>>> > >>> Subject: Re: Writing MR-Job: Something like OracleReducer, >>>>JDBCReducer >>>> > >>>... >>>> > >>> From: [email protected] >>>> > >>> To: [email protected] >>>> > >>> >>>> > >>> Hi Michael, >>>> > >>> >>>> > >>> Yes, thanks, I understand the fact that reducers can be expensive >>>>with >>>> > >>>all >>>> > >>> the shuffling and the sorting, and you may not need them always. >>>>At >>>> > >>>the same >>>> > >>> time, there are many cases where reducers are useful, like >>>>secondary >>>> > >>> sorting. In many cases, one can have multiple map phases and not >>>>have a >>>> > >>> reduce phase at all. Again, there will be many cases where one >>>>may >>>> > >>>want a >>>> > >>> reducer, say trying to count the occurrence of words in a >>>>particular >>>> > >>>column. >>>> > >>> >>>> > >>> >>>> > >>> With this thought chain, I do not feel ready to say that when >>>>dealing >>>> > >>>with >>>> > >>> HBase, I really dont want to use a reducer. Please correct me if >>>>I am >>>> > >>> wrong. >>>> > >>> >>>> > >>> Thanks again. >>>> > >>> >>>> > >>> Best Regards, >>>> > >>> Sonal >>>> > >>> Crux: Reporting for HBase <https://github.com/sonalgoyal/crux> >>>> > >>> Nube Technologies <http://www.nubetech.co> >>>> > >>> >>>> > >>> <http://in.linkedin.com/in/sonalgoyal> >>>> > >>> >>>> > >>> >>>> > >>> >>>> > >>> >>>> > >>> >>>> > >>> On Fri, Sep 16, 2011 at 10:35 PM, Michael Segel >>>> > >>> <[email protected]>wrote: >>>> > >>> >>>> > >>>> >>>> > >>>> Sonal, >>>> > >>>> >>>> > >>>> Just because you have a m/r job doesn't mean that you need to >>>>reduce >>>> > >>>> anything. You can have a job that contains only a mapper. >>>> > >>>> Or your job runner can have a series of map jobs in serial. >>>> > >>>> >>>> > >>>> Most if not all of the map/reduce jobs where we pull data from >>>>HBase, >>>> > >>>>don't >>>> > >>>> require a reducer. >>>> > >>>> >>>> > >>>> To give you a simple example... if I want to determine the table >>>> > >>>>schema >>>> > >>>> where I am storing some sort of structured data... >>>> > >>>> I just write a m/r job which opens a table, scan's the table >>>>counting >>>> > >>>>the >>>> > >>>> occurrence of each column name via dynamic counters. >>>> > >>>> >>>> > >>>> There is no need for a reducer. >>>> > >>>> >>>> > >>>> Does that help? >>>> > >>>> >>>> > >>>> >>>> > >>>>> Date: Fri, 16 Sep 2011 21:41:01 +0530 >>>> > >>>>> Subject: Re: Writing MR-Job: Something like OracleReducer, >>>> > >>>>>JDBCReducer >>>> > >>>> ... >>>> > >>>>> From: [email protected] >>>> > >>>>> To: [email protected] >>>> > >>>>> >>>> > >>>>> Michel, >>>> > >>>>> >>>> > >>>>> Sorry can you please help me understand what you mean when you >>>>say >>>> > >>>>>that >>>> > >>>> when >>>> > >>>>> dealing with HBase, you really dont want to use a reducer? >>>>Here, >>>> > >>>>>Hbase is >>>> > >>>>> being used as the input to the MR job. >>>> > >>>>> >>>> > >>>>> Thanks >>>> > >>>>> Sonal >>>> > >>>>> >>>> > >>>>> >>>> > >>>>> On Fri, Sep 16, 2011 at 2:35 PM, Michel Segel >>>> > >>>>><[email protected] >>>> > >>>>> wrote: >>>> > >>>>> >>>> > >>>>>> I think you need to get a little bit more information. >>>> > >>>>>> Reducers are expensive. >>>> > >>>>>> When Thomas says that he is aggregating data, what exactly >>>>does he >>>> > >>>> mean? >>>> > >>>>>> When dealing w HBase, you really don't want to use a reducer. >>>> > >>>>>> >>>> > >>>>>> You may want to run two map jobs and it could be that just >>>>dumping >>>> > >>>>>>the >>>> > >>>>>> output via jdbc makes the most sense. >>>> > >>>>>> >>>> > >>>>>> We are starting to see a lot of questions where the OP isn't >>>> > >>>>>>providing >>>> > >>>>>> enough information so that the recommendation could be >>>>wrong... >>>> > >>>>>> >>>> > >>>>>> >>>> > >>>>>> Sent from a remote device. Please excuse any typos... >>>> > >>>>>> >>>> > >>>>>> Mike Segel >>>> > >>>>>> >>>> > >>>>>> On Sep 16, 2011, at 2:22 AM, Sonal Goyal >>>><[email protected]> >>>> > >>>> wrote: >>>> > >>>>>> >>>> > >>>>>>> There is a DBOutputFormat class in the >>>> > >>>> org.apache,hadoop.mapreduce.lib.db >>>> > >>>>>>> package, you could use that. Or you could write to the hdfs >>>>and >>>> > >>>>>>>then >>>> > >>>> use >>>> > >>>>>>> something like HIHO[1] to export to the db. I have been >>>>working >>>> > >>>>>> extensively >>>> > >>>>>>> in this area, you can write to me directly if you need any >>>>help. >>>> > >>>>>>> >>>> > >>>>>>> 1. https://github.com/sonalgoyal/hiho >>>> > >>>>>>> >>>> > >>>>>>> Best Regards, >>>> > >>>>>>> Sonal >>>> > >>>>>>> Crux: Reporting for HBase >>>><https://github.com/sonalgoyal/crux> >>>> > >>>>>>> Nube Technologies <http://www.nubetech.co> >>>> > >>>>>>> >>>> > >>>>>>> <http://in.linkedin.com/in/sonalgoyal> >>>> > >>>>>>> >>>> > >>>>>>> >>>> > >>>>>>> >>>> > >>>>>>> >>>> > >>>>>>> >>>> > >>>>>>> On Fri, Sep 16, 2011 at 10:55 AM, Steinmaurer Thomas < >>>> > >>>>>>> [email protected]> wrote: >>>> > >>>>>>> >>>> > >>>>>>>> Hello, >>>> > >>>>>>>> >>>> > >>>>>>>> >>>> > >>>>>>>> >>>> > >>>>>>>> writing a MR-Job to process HBase data and store aggregated >>>>data >>>> > >>>>>>>>in >>>> > >>>>>>>> Oracle. How would you do that in a MR-job? >>>> > >>>>>>>> >>>> > >>>>>>>> >>>> > >>>>>>>> >>>> > >>>>>>>> Currently, for test purposes we write the result into a >>>>HBase >>>> > >>>>>>>>table >>>> > >>>>>>>> again by using a TableReducer. Is there something like a >>>> > >>>> OracleReducer, >>>> > >>>>>>>> RelationalReducer, JDBCReducer or whatever? Or should one >>>>simply >>>> > >>>>>>>>use >>>> > >>>>>>>> plan JDBC code in the reduce step? >>>> > >>>>>>>> >>>> > >>>>>>>> >>>> > >>>>>>>> >>>> > >>>>>>>> Thanks! >>>> > >>>>>>>> >>>> > >>>>>>>> >>>> > >>>>>>>> >>>> > >>>>>>>> Thomas >>>> > >>>>>>>> >>>> > >>>>>>>> >>>> > >>>>>>>> >>>> > >>>>>>>> >>>> > >>>>>> >>>> > >>>> >>>> > >>>> >>>> > >> >>>> > > >>>> > >> > >
