Aren't there memory considerations with this approach ? I would assume the HashMap can get pretty big , if it retains in memory every record that passes through .. (Apologies, if I am being ignorant with my limited knowledge of hadoop's internal workings ... )
On Fri, Sep 16, 2011 at 6:14 PM, Doug Meil <[email protected]> wrote: > > However, if the aggregations in the mapper were kept in a HashMap (key > being the aggregate, value being the count), and then the mapper made a > single pass over this map during the cleanup method and then did the > checkAndPuts, it would mean that the writes would only happen once per > map-task, and not do it on a per-row basis (which would be really > expensive). > > A single region on a single RS could handle that no problem. > > > > > On 9/16/11 9:00 PM, "Sam Seigal" <[email protected]> wrote: > >>I see what you are saying about the temp table being hosted at a >>single regions server - especially for a limited set of rows that >>just care about the aggregations, but receive a lot of traffic. I >>wonder if this will also be the case, if I was to use the source table >>to maintain these temporary records, and not create a temp table on >>the fly ... >> >>On Fri, Sep 16, 2011 at 5:24 PM, Doug Meil >><[email protected]> wrote: >>> >>> I'll add this to the book in the MR section. >>> >>> >>> >>> >>> >>> On 9/16/11 8:22 PM, "Doug Meil" <[email protected]> wrote: >>> >>>> >>>>I was in the middle of responding to Mike's email when yours arrived, so >>>>I'll respond to both. >>>> >>>>I think the temp-table idea is interesting. The caution is that a >>>>default >>>>temp-table creation will be hosted on a single RS and thus be a >>>>bottleneck >>>>for aggregation. So I would imagine that you would need to tune the >>>>temp-table for the job and pre-create regions. >>>> >>>>Doug >>>> >>>> >>>> >>>>On 9/16/11 8:16 PM, "Sam Seigal" <[email protected]> wrote: >>>> >>>>>I am trying to do something similar with HBase Map/Reduce. >>>>> >>>>>I have event ids and amounts stored in hbase in the following format: >>>>>prefix-event_id_type-timestamp-event_id as the row key and amount as >>>>>the value >>>>>I want to be able to aggregate the amounts based on the event id type >>>>>and for this I am using a reducer. I basically reduce on the >>>>>eventidtype from the incoming row in the map phase, and perform the >>>>>aggregation in the reducer on the amounts for the event types. Then I >>>>>write back the results into HBase. >>>>> >>>>>I hadn't thought about writing values directly into a temp HBase table >>>>>as suggested by Mike in the map phase. >>>>> >>>>>For this case, each mapper can declare its own mapperId_event_type row >>>>>with totalAmount and for each row it receives, do a get , add the >>>>>current amount, and then a put. We are basically then doing a >>>>>get/add/put for every row that a mapper receives. Is this any more >>>>>efficient when compared to the overhead of sorting/partitioning for a >>>>>reducer ? >>>>> >>>>>At the end of the mapping phase, aggregating the output of all the >>>>>mappers should be trivial. >>>>> >>>>> >>>>> >>>>>On Fri, Sep 16, 2011 at 1:24 PM, Michael Segel >>>>><[email protected]> wrote: >>>>>> >>>>>> Doug and company... >>>>>> >>>>>> Look, I'm not saying that there aren't m/r jobs were you might need >>>>>>reducers when working w HBase. What I am saying is that if we look at >>>>>>what you're attempting to do, you may end up getting better >>>>>>performance >>>>>>if you created a temp table in HBase and let HBase do some of the >>>>>>heavy >>>>>>lifting where you are currently using a reducer. From the jobs that we >>>>>>run, when we looked at what we were doing, there wasn't any need for a >>>>>>reducer. I suspect that its true of other jobs. >>>>>> >>>>>> Remember that HBase is much more than just an HFile format to persist >>>>>>stuff. >>>>>> >>>>>> Even looking at Sonal's example... you have other ways of doing the >>>>>>record counts like dynamic counters or using a temp table in HBase >>>>>>which >>>>>>I believe will give you better performance numbers, although I haven't >>>>>>benchmarked either against a reducer. >>>>>> >>>>>> Does that make sense? >>>>>> >>>>>> -Mike >>>>>> >>>>>> >>>>>> > From: [email protected] >>>>>> > To: [email protected] >>>>>> > Date: Fri, 16 Sep 2011 15:41:44 -0400 >>>>>> > Subject: Re: Writing MR-Job: Something like OracleReducer, >>>>>>JDBCReducer ... >>>>>> > >>>>>> > >>>>>> > Chris, agreed... There are sometimes that reducers aren't required, >>>>>>and >>>>>> > then situations where they are useful. We have both kinds of jobs. >>>>>> > >>>>>> > For others following the thread, I updated the book recently with >>>>>>more MR >>>>>> > examples (read-only, read-write, read-summary) >>>>>> > >>>>>> > http://hbase.apache.org/book.html#mapreduce.example >>>>>> > >>>>>> > >>>>>> > As to the question that started this thread... >>>>>> > >>>>>> > >>>>>> > re: "Store aggregated data in Oracle. " >>>>>> > >>>>>> > To me, that sounds a like the "read-summary" example with >>>>>>JDBC-Oracle >>>>>>in >>>>>> > the reduce step. >>>>>> > >>>>>> > >>>>>> > >>>>>> > >>>>>> > >>>>>> > On 9/16/11 2:58 PM, "Chris Tarnas" <[email protected]> wrote: >>>>>> > >>>>>> > >If only I could make NY in Nov :) >>>>>> > > >>>>>> > >We extract out large numbers of DNA sequence reads from HBase, run >>>>>>them >>>>>> > >through M/R pipelines to analyze and aggregate and then we load >>>>>>the >>>>>> > >results back in. Definitely specialized usage, but I could see >>>>>>other >>>>>> > >perfectly valid uses for reducers with HBase. >>>>>> > > >>>>>> > >-chris >>>>>> > > >>>>>> > >On Sep 16, 2011, at 11:43 AM, Michael Segel wrote: >>>>>> > > >>>>>> > >> >>>>>> > >> Sonal, >>>>>> > >> >>>>>> > >> You do realize that HBase is a "database", right? ;-) >>>>>> > >> >>>>>> > >> So again, why do you need a reducer? ;-) >>>>>> > >> >>>>>> > >> Using your example... >>>>>> > >> "Again, there will be many cases where one may want a reducer, >>>>>>say >>>>>> > >>trying to count the occurrence of words in a particular column." >>>>>> > >> >>>>>> > >> You can do this one of two ways... >>>>>> > >> 1) Dynamic Counters in Hadoop. >>>>>> > >> 2) Use a temp table and auto increment the value in a column >>>>>>which >>>>>> > >>contains the word count. (Fat row where rowkey is doc_id and >>>>>>column is >>>>>> > >>word or rowkey is doc_id|word) >>>>>> > >> >>>>>> > >> I'm sorry but if you go through all of your examples of why you >>>>>>would >>>>>> > >>want to use a reducer, you end up finding out that writing to an >>>>>>HBase >>>>>> > >>table would be faster than a reduce job. >>>>>> > >> (Again we haven't done an exhaustive search, but in all of the >>>>>>HBase >>>>>> > >>jobs we've run... no reducers were necessary.) >>>>>> > >> >>>>>> > >> The point I'm trying to make is that you want to avoid using a >>>>>>reducer >>>>>> > >>whenever possible and if you think about your problem... you can >>>>>> > >>probably come up with a solution that avoids the reducer... >>>>>> > >> >>>>>> > >> >>>>>> > >> HTH >>>>>> > >> >>>>>> > >> -Mike >>>>>> > >> PS. I haven't looked at *all* of the potential use cases of >>>>>>HBase >>>>>>which >>>>>> > >>is why I don't want to say you'll never need a reducer. I will >>>>>>say >>>>>>that >>>>>> > >>based on what we've done at my client's site, we try very hard to >>>>>>avoid >>>>>> > >>reducers. >>>>>> > >> [Note, I'm sure I'm going to get hammered on this when I head to >>>>>>NY in >>>>>> > >>Nov. :-) ] >>>>>> > >> >>>>>> > >> >>>>>> > >>> Date: Fri, 16 Sep 2011 23:00:49 +0530 >>>>>> > >>> Subject: Re: Writing MR-Job: Something like OracleReducer, >>>>>>JDBCReducer >>>>>> > >>>... >>>>>> > >>> From: [email protected] >>>>>> > >>> To: [email protected] >>>>>> > >>> >>>>>> > >>> Hi Michael, >>>>>> > >>> >>>>>> > >>> Yes, thanks, I understand the fact that reducers can be >>>>>>expensive >>>>>>with >>>>>> > >>>all >>>>>> > >>> the shuffling and the sorting, and you may not need them >>>>>>always. >>>>>>At >>>>>> > >>>the same >>>>>> > >>> time, there are many cases where reducers are useful, like >>>>>>secondary >>>>>> > >>> sorting. In many cases, one can have multiple map phases and >>>>>>not >>>>>>have a >>>>>> > >>> reduce phase at all. Again, there will be many cases where one >>>>>>may >>>>>> > >>>want a >>>>>> > >>> reducer, say trying to count the occurrence of words in a >>>>>>particular >>>>>> > >>>column. >>>>>> > >>> >>>>>> > >>> >>>>>> > >>> With this thought chain, I do not feel ready to say that when >>>>>>dealing >>>>>> > >>>with >>>>>> > >>> HBase, I really dont want to use a reducer. Please correct me >>>>>>if >>>>>>I am >>>>>> > >>> wrong. >>>>>> > >>> >>>>>> > >>> Thanks again. >>>>>> > >>> >>>>>> > >>> Best Regards, >>>>>> > >>> Sonal >>>>>> > >>> Crux: Reporting for HBase <https://github.com/sonalgoyal/crux> >>>>>> > >>> Nube Technologies <http://www.nubetech.co> >>>>>> > >>> >>>>>> > >>> <http://in.linkedin.com/in/sonalgoyal> >>>>>> > >>> >>>>>> > >>> >>>>>> > >>> >>>>>> > >>> >>>>>> > >>> >>>>>> > >>> On Fri, Sep 16, 2011 at 10:35 PM, Michael Segel >>>>>> > >>> <[email protected]>wrote: >>>>>> > >>> >>>>>> > >>>> >>>>>> > >>>> Sonal, >>>>>> > >>>> >>>>>> > >>>> Just because you have a m/r job doesn't mean that you need to >>>>>>reduce >>>>>> > >>>> anything. You can have a job that contains only a mapper. >>>>>> > >>>> Or your job runner can have a series of map jobs in serial. >>>>>> > >>>> >>>>>> > >>>> Most if not all of the map/reduce jobs where we pull data from >>>>>>HBase, >>>>>> > >>>>don't >>>>>> > >>>> require a reducer. >>>>>> > >>>> >>>>>> > >>>> To give you a simple example... if I want to determine the >>>>>>table >>>>>> > >>>>schema >>>>>> > >>>> where I am storing some sort of structured data... >>>>>> > >>>> I just write a m/r job which opens a table, scan's the table >>>>>>counting >>>>>> > >>>>the >>>>>> > >>>> occurrence of each column name via dynamic counters. >>>>>> > >>>> >>>>>> > >>>> There is no need for a reducer. >>>>>> > >>>> >>>>>> > >>>> Does that help? >>>>>> > >>>> >>>>>> > >>>> >>>>>> > >>>>> Date: Fri, 16 Sep 2011 21:41:01 +0530 >>>>>> > >>>>> Subject: Re: Writing MR-Job: Something like OracleReducer, >>>>>> > >>>>>JDBCReducer >>>>>> > >>>> ... >>>>>> > >>>>> From: [email protected] >>>>>> > >>>>> To: [email protected] >>>>>> > >>>>> >>>>>> > >>>>> Michel, >>>>>> > >>>>> >>>>>> > >>>>> Sorry can you please help me understand what you mean when >>>>>>you >>>>>>say >>>>>> > >>>>>that >>>>>> > >>>> when >>>>>> > >>>>> dealing with HBase, you really dont want to use a reducer? >>>>>>Here, >>>>>> > >>>>>Hbase is >>>>>> > >>>>> being used as the input to the MR job. >>>>>> > >>>>> >>>>>> > >>>>> Thanks >>>>>> > >>>>> Sonal >>>>>> > >>>>> >>>>>> > >>>>> >>>>>> > >>>>> On Fri, Sep 16, 2011 at 2:35 PM, Michel Segel >>>>>> > >>>>><[email protected] >>>>>> > >>>>> wrote: >>>>>> > >>>>> >>>>>> > >>>>>> I think you need to get a little bit more information. >>>>>> > >>>>>> Reducers are expensive. >>>>>> > >>>>>> When Thomas says that he is aggregating data, what exactly >>>>>>does he >>>>>> > >>>> mean? >>>>>> > >>>>>> When dealing w HBase, you really don't want to use a >>>>>>reducer. >>>>>> > >>>>>> >>>>>> > >>>>>> You may want to run two map jobs and it could be that just >>>>>>dumping >>>>>> > >>>>>>the >>>>>> > >>>>>> output via jdbc makes the most sense. >>>>>> > >>>>>> >>>>>> > >>>>>> We are starting to see a lot of questions where the OP isn't >>>>>> > >>>>>>providing >>>>>> > >>>>>> enough information so that the recommendation could be >>>>>>wrong... >>>>>> > >>>>>> >>>>>> > >>>>>> >>>>>> > >>>>>> Sent from a remote device. Please excuse any typos... >>>>>> > >>>>>> >>>>>> > >>>>>> Mike Segel >>>>>> > >>>>>> >>>>>> > >>>>>> On Sep 16, 2011, at 2:22 AM, Sonal Goyal >>>>>><[email protected]> >>>>>> > >>>> wrote: >>>>>> > >>>>>> >>>>>> > >>>>>>> There is a DBOutputFormat class in the >>>>>> > >>>> org.apache,hadoop.mapreduce.lib.db >>>>>> > >>>>>>> package, you could use that. Or you could write to the hdfs >>>>>>and >>>>>> > >>>>>>>then >>>>>> > >>>> use >>>>>> > >>>>>>> something like HIHO[1] to export to the db. I have been >>>>>>working >>>>>> > >>>>>> extensively >>>>>> > >>>>>>> in this area, you can write to me directly if you need any >>>>>>help. >>>>>> > >>>>>>> >>>>>> > >>>>>>> 1. https://github.com/sonalgoyal/hiho >>>>>> > >>>>>>> >>>>>> > >>>>>>> Best Regards, >>>>>> > >>>>>>> Sonal >>>>>> > >>>>>>> Crux: Reporting for HBase >>>>>><https://github.com/sonalgoyal/crux> >>>>>> > >>>>>>> Nube Technologies <http://www.nubetech.co> >>>>>> > >>>>>>> >>>>>> > >>>>>>> <http://in.linkedin.com/in/sonalgoyal> >>>>>> > >>>>>>> >>>>>> > >>>>>>> >>>>>> > >>>>>>> >>>>>> > >>>>>>> >>>>>> > >>>>>>> >>>>>> > >>>>>>> On Fri, Sep 16, 2011 at 10:55 AM, Steinmaurer Thomas < >>>>>> > >>>>>>> [email protected]> wrote: >>>>>> > >>>>>>> >>>>>> > >>>>>>>> Hello, >>>>>> > >>>>>>>> >>>>>> > >>>>>>>> >>>>>> > >>>>>>>> >>>>>> > >>>>>>>> writing a MR-Job to process HBase data and store >>>>>>aggregated >>>>>>data >>>>>> > >>>>>>>>in >>>>>> > >>>>>>>> Oracle. How would you do that in a MR-job? >>>>>> > >>>>>>>> >>>>>> > >>>>>>>> >>>>>> > >>>>>>>> >>>>>> > >>>>>>>> Currently, for test purposes we write the result into a >>>>>>HBase >>>>>> > >>>>>>>>table >>>>>> > >>>>>>>> again by using a TableReducer. Is there something like a >>>>>> > >>>> OracleReducer, >>>>>> > >>>>>>>> RelationalReducer, JDBCReducer or whatever? Or should one >>>>>>simply >>>>>> > >>>>>>>>use >>>>>> > >>>>>>>> plan JDBC code in the reduce step? >>>>>> > >>>>>>>> >>>>>> > >>>>>>>> >>>>>> > >>>>>>>> >>>>>> > >>>>>>>> Thanks! >>>>>> > >>>>>>>> >>>>>> > >>>>>>>> >>>>>> > >>>>>>>> >>>>>> > >>>>>>>> Thomas >>>>>> > >>>>>>>> >>>>>> > >>>>>>>> >>>>>> > >>>>>>>> >>>>>> > >>>>>>>> >>>>>> > >>>>>> >>>>>> > >>>> >>>>>> > >>>> >>>>>> > >> >>>>>> > > >>>>>> > >>>> >>> >>> > >
