I was in the middle of responding to Mike's email when yours arrived, so I'll respond to both.
I think the temp-table idea is interesting. The caution is that a default temp-table creation will be hosted on a single RS and thus be a bottleneck for aggregation. So I would imagine that you would need to tune the temp-table for the job and pre-create regions. Doug On 9/16/11 8:16 PM, "Sam Seigal" <[email protected]> wrote: >I am trying to do something similar with HBase Map/Reduce. > >I have event ids and amounts stored in hbase in the following format: >prefix-event_id_type-timestamp-event_id as the row key and amount as >the value >I want to be able to aggregate the amounts based on the event id type >and for this I am using a reducer. I basically reduce on the >eventidtype from the incoming row in the map phase, and perform the >aggregation in the reducer on the amounts for the event types. Then I >write back the results into HBase. > >I hadn't thought about writing values directly into a temp HBase table >as suggested by Mike in the map phase. > >For this case, each mapper can declare its own mapperId_event_type row >with totalAmount and for each row it receives, do a get , add the >current amount, and then a put. We are basically then doing a >get/add/put for every row that a mapper receives. Is this any more >efficient when compared to the overhead of sorting/partitioning for a >reducer ? > >At the end of the mapping phase, aggregating the output of all the >mappers should be trivial. > > > >On Fri, Sep 16, 2011 at 1:24 PM, Michael Segel ><[email protected]> wrote: >> >> Doug and company... >> >> Look, I'm not saying that there aren't m/r jobs were you might need >>reducers when working w HBase. What I am saying is that if we look at >>what you're attempting to do, you may end up getting better performance >>if you created a temp table in HBase and let HBase do some of the heavy >>lifting where you are currently using a reducer. From the jobs that we >>run, when we looked at what we were doing, there wasn't any need for a >>reducer. I suspect that its true of other jobs. >> >> Remember that HBase is much more than just an HFile format to persist >>stuff. >> >> Even looking at Sonal's example... you have other ways of doing the >>record counts like dynamic counters or using a temp table in HBase which >>I believe will give you better performance numbers, although I haven't >>benchmarked either against a reducer. >> >> Does that make sense? >> >> -Mike >> >> >> > From: [email protected] >> > To: [email protected] >> > Date: Fri, 16 Sep 2011 15:41:44 -0400 >> > Subject: Re: Writing MR-Job: Something like OracleReducer, >>JDBCReducer ... >> > >> > >> > Chris, agreed... There are sometimes that reducers aren't required, >>and >> > then situations where they are useful. We have both kinds of jobs. >> > >> > For others following the thread, I updated the book recently with >>more MR >> > examples (read-only, read-write, read-summary) >> > >> > http://hbase.apache.org/book.html#mapreduce.example >> > >> > >> > As to the question that started this thread... >> > >> > >> > re: "Store aggregated data in Oracle. " >> > >> > To me, that sounds a like the "read-summary" example with JDBC-Oracle >>in >> > the reduce step. >> > >> > >> > >> > >> > >> > On 9/16/11 2:58 PM, "Chris Tarnas" <[email protected]> wrote: >> > >> > >If only I could make NY in Nov :) >> > > >> > >We extract out large numbers of DNA sequence reads from HBase, run >>them >> > >through M/R pipelines to analyze and aggregate and then we load the >> > >results back in. Definitely specialized usage, but I could see other >> > >perfectly valid uses for reducers with HBase. >> > > >> > >-chris >> > > >> > >On Sep 16, 2011, at 11:43 AM, Michael Segel wrote: >> > > >> > >> >> > >> Sonal, >> > >> >> > >> You do realize that HBase is a "database", right? ;-) >> > >> >> > >> So again, why do you need a reducer? ;-) >> > >> >> > >> Using your example... >> > >> "Again, there will be many cases where one may want a reducer, say >> > >>trying to count the occurrence of words in a particular column." >> > >> >> > >> You can do this one of two ways... >> > >> 1) Dynamic Counters in Hadoop. >> > >> 2) Use a temp table and auto increment the value in a column which >> > >>contains the word count. (Fat row where rowkey is doc_id and >>column is >> > >>word or rowkey is doc_id|word) >> > >> >> > >> I'm sorry but if you go through all of your examples of why you >>would >> > >>want to use a reducer, you end up finding out that writing to an >>HBase >> > >>table would be faster than a reduce job. >> > >> (Again we haven't done an exhaustive search, but in all of the >>HBase >> > >>jobs we've run... no reducers were necessary.) >> > >> >> > >> The point I'm trying to make is that you want to avoid using a >>reducer >> > >>whenever possible and if you think about your problem... you can >> > >>probably come up with a solution that avoids the reducer... >> > >> >> > >> >> > >> HTH >> > >> >> > >> -Mike >> > >> PS. I haven't looked at *all* of the potential use cases of HBase >>which >> > >>is why I don't want to say you'll never need a reducer. I will say >>that >> > >>based on what we've done at my client's site, we try very hard to >>avoid >> > >>reducers. >> > >> [Note, I'm sure I'm going to get hammered on this when I head to >>NY in >> > >>Nov. :-) ] >> > >> >> > >> >> > >>> Date: Fri, 16 Sep 2011 23:00:49 +0530 >> > >>> Subject: Re: Writing MR-Job: Something like OracleReducer, >>JDBCReducer >> > >>>... >> > >>> From: [email protected] >> > >>> To: [email protected] >> > >>> >> > >>> Hi Michael, >> > >>> >> > >>> Yes, thanks, I understand the fact that reducers can be expensive >>with >> > >>>all >> > >>> the shuffling and the sorting, and you may not need them always. >>At >> > >>>the same >> > >>> time, there are many cases where reducers are useful, like >>secondary >> > >>> sorting. In many cases, one can have multiple map phases and not >>have a >> > >>> reduce phase at all. Again, there will be many cases where one may >> > >>>want a >> > >>> reducer, say trying to count the occurrence of words in a >>particular >> > >>>column. >> > >>> >> > >>> >> > >>> With this thought chain, I do not feel ready to say that when >>dealing >> > >>>with >> > >>> HBase, I really dont want to use a reducer. Please correct me if >>I am >> > >>> wrong. >> > >>> >> > >>> Thanks again. >> > >>> >> > >>> Best Regards, >> > >>> Sonal >> > >>> Crux: Reporting for HBase <https://github.com/sonalgoyal/crux> >> > >>> Nube Technologies <http://www.nubetech.co> >> > >>> >> > >>> <http://in.linkedin.com/in/sonalgoyal> >> > >>> >> > >>> >> > >>> >> > >>> >> > >>> >> > >>> On Fri, Sep 16, 2011 at 10:35 PM, Michael Segel >> > >>> <[email protected]>wrote: >> > >>> >> > >>>> >> > >>>> Sonal, >> > >>>> >> > >>>> Just because you have a m/r job doesn't mean that you need to >>reduce >> > >>>> anything. You can have a job that contains only a mapper. >> > >>>> Or your job runner can have a series of map jobs in serial. >> > >>>> >> > >>>> Most if not all of the map/reduce jobs where we pull data from >>HBase, >> > >>>>don't >> > >>>> require a reducer. >> > >>>> >> > >>>> To give you a simple example... if I want to determine the table >> > >>>>schema >> > >>>> where I am storing some sort of structured data... >> > >>>> I just write a m/r job which opens a table, scan's the table >>counting >> > >>>>the >> > >>>> occurrence of each column name via dynamic counters. >> > >>>> >> > >>>> There is no need for a reducer. >> > >>>> >> > >>>> Does that help? >> > >>>> >> > >>>> >> > >>>>> Date: Fri, 16 Sep 2011 21:41:01 +0530 >> > >>>>> Subject: Re: Writing MR-Job: Something like OracleReducer, >> > >>>>>JDBCReducer >> > >>>> ... >> > >>>>> From: [email protected] >> > >>>>> To: [email protected] >> > >>>>> >> > >>>>> Michel, >> > >>>>> >> > >>>>> Sorry can you please help me understand what you mean when you >>say >> > >>>>>that >> > >>>> when >> > >>>>> dealing with HBase, you really dont want to use a reducer? Here, >> > >>>>>Hbase is >> > >>>>> being used as the input to the MR job. >> > >>>>> >> > >>>>> Thanks >> > >>>>> Sonal >> > >>>>> >> > >>>>> >> > >>>>> On Fri, Sep 16, 2011 at 2:35 PM, Michel Segel >> > >>>>><[email protected] >> > >>>>> wrote: >> > >>>>> >> > >>>>>> I think you need to get a little bit more information. >> > >>>>>> Reducers are expensive. >> > >>>>>> When Thomas says that he is aggregating data, what exactly >>does he >> > >>>> mean? >> > >>>>>> When dealing w HBase, you really don't want to use a reducer. >> > >>>>>> >> > >>>>>> You may want to run two map jobs and it could be that just >>dumping >> > >>>>>>the >> > >>>>>> output via jdbc makes the most sense. >> > >>>>>> >> > >>>>>> We are starting to see a lot of questions where the OP isn't >> > >>>>>>providing >> > >>>>>> enough information so that the recommendation could be wrong... >> > >>>>>> >> > >>>>>> >> > >>>>>> Sent from a remote device. Please excuse any typos... >> > >>>>>> >> > >>>>>> Mike Segel >> > >>>>>> >> > >>>>>> On Sep 16, 2011, at 2:22 AM, Sonal Goyal >><[email protected]> >> > >>>> wrote: >> > >>>>>> >> > >>>>>>> There is a DBOutputFormat class in the >> > >>>> org.apache,hadoop.mapreduce.lib.db >> > >>>>>>> package, you could use that. Or you could write to the hdfs >>and >> > >>>>>>>then >> > >>>> use >> > >>>>>>> something like HIHO[1] to export to the db. I have been >>working >> > >>>>>> extensively >> > >>>>>>> in this area, you can write to me directly if you need any >>help. >> > >>>>>>> >> > >>>>>>> 1. https://github.com/sonalgoyal/hiho >> > >>>>>>> >> > >>>>>>> Best Regards, >> > >>>>>>> Sonal >> > >>>>>>> Crux: Reporting for HBase <https://github.com/sonalgoyal/crux> >> > >>>>>>> Nube Technologies <http://www.nubetech.co> >> > >>>>>>> >> > >>>>>>> <http://in.linkedin.com/in/sonalgoyal> >> > >>>>>>> >> > >>>>>>> >> > >>>>>>> >> > >>>>>>> >> > >>>>>>> >> > >>>>>>> On Fri, Sep 16, 2011 at 10:55 AM, Steinmaurer Thomas < >> > >>>>>>> [email protected]> wrote: >> > >>>>>>> >> > >>>>>>>> Hello, >> > >>>>>>>> >> > >>>>>>>> >> > >>>>>>>> >> > >>>>>>>> writing a MR-Job to process HBase data and store aggregated >>data >> > >>>>>>>>in >> > >>>>>>>> Oracle. How would you do that in a MR-job? >> > >>>>>>>> >> > >>>>>>>> >> > >>>>>>>> >> > >>>>>>>> Currently, for test purposes we write the result into a HBase >> > >>>>>>>>table >> > >>>>>>>> again by using a TableReducer. Is there something like a >> > >>>> OracleReducer, >> > >>>>>>>> RelationalReducer, JDBCReducer or whatever? Or should one >>simply >> > >>>>>>>>use >> > >>>>>>>> plan JDBC code in the reduce step? >> > >>>>>>>> >> > >>>>>>>> >> > >>>>>>>> >> > >>>>>>>> Thanks! >> > >>>>>>>> >> > >>>>>>>> >> > >>>>>>>> >> > >>>>>>>> Thomas >> > >>>>>>>> >> > >>>>>>>> >> > >>>>>>>> >> > >>>>>>>> >> > >>>>>> >> > >>>> >> > >>>> >> > >> >> > > >> >
