Chris, agreed... There are sometimes that reducers aren't required, and then situations where they are useful. We have both kinds of jobs.
For others following the thread, I updated the book recently with more MR examples (read-only, read-write, read-summary) http://hbase.apache.org/book.html#mapreduce.example As to the question that started this thread... re: "Store aggregated data in Oracle. " To me, that sounds a like the "read-summary" example with JDBC-Oracle in the reduce step. On 9/16/11 2:58 PM, "Chris Tarnas" <[email protected]> wrote: >If only I could make NY in Nov :) > >We extract out large numbers of DNA sequence reads from HBase, run them >through M/R pipelines to analyze and aggregate and then we load the >results back in. Definitely specialized usage, but I could see other >perfectly valid uses for reducers with HBase. > >-chris > >On Sep 16, 2011, at 11:43 AM, Michael Segel wrote: > >> >> Sonal, >> >> You do realize that HBase is a "database", right? ;-) >> >> So again, why do you need a reducer? ;-) >> >> Using your example... >> "Again, there will be many cases where one may want a reducer, say >>trying to count the occurrence of words in a particular column." >> >> You can do this one of two ways... >> 1) Dynamic Counters in Hadoop. >> 2) Use a temp table and auto increment the value in a column which >>contains the word count. (Fat row where rowkey is doc_id and column is >>word or rowkey is doc_id|word) >> >> I'm sorry but if you go through all of your examples of why you would >>want to use a reducer, you end up finding out that writing to an HBase >>table would be faster than a reduce job. >> (Again we haven't done an exhaustive search, but in all of the HBase >>jobs we've run... no reducers were necessary.) >> >> The point I'm trying to make is that you want to avoid using a reducer >>whenever possible and if you think about your problem... you can >>probably come up with a solution that avoids the reducer... >> >> >> HTH >> >> -Mike >> PS. I haven't looked at *all* of the potential use cases of HBase which >>is why I don't want to say you'll never need a reducer. I will say that >>based on what we've done at my client's site, we try very hard to avoid >>reducers. >> [Note, I'm sure I'm going to get hammered on this when I head to NY in >>Nov. :-) ] >> >> >>> Date: Fri, 16 Sep 2011 23:00:49 +0530 >>> Subject: Re: Writing MR-Job: Something like OracleReducer, JDBCReducer >>>... >>> From: [email protected] >>> To: [email protected] >>> >>> Hi Michael, >>> >>> Yes, thanks, I understand the fact that reducers can be expensive with >>>all >>> the shuffling and the sorting, and you may not need them always. At >>>the same >>> time, there are many cases where reducers are useful, like secondary >>> sorting. In many cases, one can have multiple map phases and not have a >>> reduce phase at all. Again, there will be many cases where one may >>>want a >>> reducer, say trying to count the occurrence of words in a particular >>>column. >>> >>> >>> With this thought chain, I do not feel ready to say that when dealing >>>with >>> HBase, I really dont want to use a reducer. Please correct me if I am >>> wrong. >>> >>> Thanks again. >>> >>> Best Regards, >>> Sonal >>> Crux: Reporting for HBase <https://github.com/sonalgoyal/crux> >>> Nube Technologies <http://www.nubetech.co> >>> >>> <http://in.linkedin.com/in/sonalgoyal> >>> >>> >>> >>> >>> >>> On Fri, Sep 16, 2011 at 10:35 PM, Michael Segel >>> <[email protected]>wrote: >>> >>>> >>>> Sonal, >>>> >>>> Just because you have a m/r job doesn't mean that you need to reduce >>>> anything. You can have a job that contains only a mapper. >>>> Or your job runner can have a series of map jobs in serial. >>>> >>>> Most if not all of the map/reduce jobs where we pull data from HBase, >>>>don't >>>> require a reducer. >>>> >>>> To give you a simple example... if I want to determine the table >>>>schema >>>> where I am storing some sort of structured data... >>>> I just write a m/r job which opens a table, scan's the table counting >>>>the >>>> occurrence of each column name via dynamic counters. >>>> >>>> There is no need for a reducer. >>>> >>>> Does that help? >>>> >>>> >>>>> Date: Fri, 16 Sep 2011 21:41:01 +0530 >>>>> Subject: Re: Writing MR-Job: Something like OracleReducer, >>>>>JDBCReducer >>>> ... >>>>> From: [email protected] >>>>> To: [email protected] >>>>> >>>>> Michel, >>>>> >>>>> Sorry can you please help me understand what you mean when you say >>>>>that >>>> when >>>>> dealing with HBase, you really dont want to use a reducer? Here, >>>>>Hbase is >>>>> being used as the input to the MR job. >>>>> >>>>> Thanks >>>>> Sonal >>>>> >>>>> >>>>> On Fri, Sep 16, 2011 at 2:35 PM, Michel Segel >>>>><[email protected] >>>>> wrote: >>>>> >>>>>> I think you need to get a little bit more information. >>>>>> Reducers are expensive. >>>>>> When Thomas says that he is aggregating data, what exactly does he >>>> mean? >>>>>> When dealing w HBase, you really don't want to use a reducer. >>>>>> >>>>>> You may want to run two map jobs and it could be that just dumping >>>>>>the >>>>>> output via jdbc makes the most sense. >>>>>> >>>>>> We are starting to see a lot of questions where the OP isn't >>>>>>providing >>>>>> enough information so that the recommendation could be wrong... >>>>>> >>>>>> >>>>>> Sent from a remote device. Please excuse any typos... >>>>>> >>>>>> Mike Segel >>>>>> >>>>>> On Sep 16, 2011, at 2:22 AM, Sonal Goyal <[email protected]> >>>> wrote: >>>>>> >>>>>>> There is a DBOutputFormat class in the >>>> org.apache,hadoop.mapreduce.lib.db >>>>>>> package, you could use that. Or you could write to the hdfs and >>>>>>>then >>>> use >>>>>>> something like HIHO[1] to export to the db. I have been working >>>>>> extensively >>>>>>> in this area, you can write to me directly if you need any help. >>>>>>> >>>>>>> 1. https://github.com/sonalgoyal/hiho >>>>>>> >>>>>>> Best Regards, >>>>>>> Sonal >>>>>>> Crux: Reporting for HBase <https://github.com/sonalgoyal/crux> >>>>>>> Nube Technologies <http://www.nubetech.co> >>>>>>> >>>>>>> <http://in.linkedin.com/in/sonalgoyal> >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>> On Fri, Sep 16, 2011 at 10:55 AM, Steinmaurer Thomas < >>>>>>> [email protected]> wrote: >>>>>>> >>>>>>>> Hello, >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> writing a MR-Job to process HBase data and store aggregated data >>>>>>>>in >>>>>>>> Oracle. How would you do that in a MR-job? >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> Currently, for test purposes we write the result into a HBase >>>>>>>>table >>>>>>>> again by using a TableReducer. Is there something like a >>>> OracleReducer, >>>>>>>> RelationalReducer, JDBCReducer or whatever? Or should one simply >>>>>>>>use >>>>>>>> plan JDBC code in the reduce step? >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> Thanks! >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> Thomas >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> >>>>>> >>>> >>>> >> >
