Re: Writing MR-Job: Something like OracleReducer, JDBCReducer ...

Doug Meil Fri, 16 Sep 2011 17:23:00 -0700

I was in the middle of responding to Mike's email when yours arrived, so
I'll respond to both.


I think the temp-table idea is interesting.  The caution is that a default
temp-table creation will be hosted on a single RS and thus be a bottleneck
for aggregation.  So I would imagine that you would need to tune the
temp-table for the job and pre-create regions.

Doug



On 9/16/11 8:16 PM, "Sam Seigal" <[email protected]> wrote:

>I am trying to do something similar with HBase Map/Reduce.
>
>I have event ids and amounts stored in hbase in the following format:
>prefix-event_id_type-timestamp-event_id as the row key and  amount as
>the value
>I want to be able to aggregate the amounts based on the event id type
>and for this I am using a reducer. I basically reduce on the
>eventidtype from the incoming row in the map phase, and perform the
>aggregation in the reducer on the amounts for the event types. Then I
>write back the results into HBase.
>
>I hadn't thought about writing values directly into a temp HBase table
>as suggested by Mike in the map phase.
>
>For this case, each mapper can declare its own mapperId_event_type row
>with totalAmount and for each row it receives, do a get , add the
>current amount, and then a put. We are basically then doing a
>get/add/put for every row that a mapper receives. Is this any more
>efficient when compared to the overhead of sorting/partitioning for a
>reducer ?
>
>At the end of the mapping phase, aggregating the output of all the
>mappers should be trivial.
>
>
>
>On Fri, Sep 16, 2011 at 1:24 PM, Michael Segel
><[email protected]> wrote:
>>
>> Doug and company...
>>
>> Look, I'm not saying that there aren't m/r jobs were you might need
>>reducers when working w HBase. What I am saying is that if we look at
>>what you're attempting to do, you may end up getting better performance
>>if you created a temp table in HBase and let HBase do some of the heavy
>>lifting where you are currently using a reducer. From the jobs that we
>>run, when we looked at what we were doing, there wasn't any need for a
>>reducer. I suspect that its true of other jobs.
>>
>> Remember that HBase is much more than just an HFile format to persist
>>stuff.
>>
>> Even looking at Sonal's example... you have other ways of doing the
>>record counts like dynamic counters or using a temp table in HBase which
>>I believe will give you better performance numbers, although I haven't
>>benchmarked either against a reducer.
>>
>> Does that make sense?
>>
>> -Mike
>>
>>
>> > From: [email protected]
>> > To: [email protected]
>> > Date: Fri, 16 Sep 2011 15:41:44 -0400
>> > Subject: Re: Writing MR-Job: Something like OracleReducer,
>>JDBCReducer ...
>> >
>> >
>> > Chris, agreed... There are sometimes that reducers aren't required,
>>and
>> > then situations where they are useful.  We have both kinds of jobs.
>> >
>> > For others following the thread, I updated the book recently with
>>more MR
>> > examples (read-only, read-write, read-summary)
>> >
>> > http://hbase.apache.org/book.html#mapreduce.example
>> >
>> >
>> > As to the question that started this thread...
>> >
>> >
>> > re:  "Store aggregated data in Oracle. "
>> >
>> > To me, that sounds a like the "read-summary" example with JDBC-Oracle
>>in
>> > the reduce step.
>> >
>> >
>> >
>> >
>> >
>> > On 9/16/11 2:58 PM, "Chris Tarnas" <[email protected]> wrote:
>> >
>> > >If only I could make NY in Nov :)
>> > >
>> > >We extract out large numbers of DNA sequence reads from HBase, run
>>them
>> > >through M/R pipelines to analyze and aggregate and then we load the
>> > >results back in. Definitely specialized usage, but I could see other
>> > >perfectly valid uses for reducers with HBase.
>> > >
>> > >-chris
>> > >
>> > >On Sep 16, 2011, at 11:43 AM, Michael Segel wrote:
>> > >
>> > >>
>> > >> Sonal,
>> > >>
>> > >> You do realize that HBase is a "database", right? ;-)
>> > >>
>> > >> So again, why do you need a reducer?  ;-)
>> > >>
>> > >> Using your example...
>> > >> "Again, there will be many cases where one may want a reducer, say
>> > >>trying to count the occurrence of words in a particular column."
>> > >>
>> > >> You can do this one of two ways...
>> > >> 1) Dynamic Counters in Hadoop.
>> > >> 2) Use a temp table and auto increment the value in a column which
>> > >>contains the word count.  (Fat row where rowkey is doc_id and
>>column is
>> > >>word or rowkey is doc_id|word)
>> > >>
>> > >> I'm sorry but if you go through all of your examples of why you
>>would
>> > >>want to use a reducer, you end up finding out that writing to an
>>HBase
>> > >>table would be faster than a reduce job.
>> > >> (Again we haven't done an exhaustive search, but in all of the
>>HBase
>> > >>jobs we've run... no reducers were necessary.)
>> > >>
>> > >> The point I'm trying to make is that you want to avoid using a
>>reducer
>> > >>whenever possible and if you think about your problem... you can
>> > >>probably come up with a solution that avoids the reducer...
>> > >>
>> > >>
>> > >> HTH
>> > >>
>> > >> -Mike
>> > >> PS. I haven't looked at *all* of the potential use cases of HBase
>>which
>> > >>is why I don't want to say you'll never need a reducer. I will say
>>that
>> > >>based on what we've done at my client's site, we try very hard to
>>avoid
>> > >>reducers.
>> > >> [Note, I'm sure I'm going to get hammered on this when I head to
>>NY in
>> > >>Nov. :-)   ]
>> > >>
>> > >>
>> > >>> Date: Fri, 16 Sep 2011 23:00:49 +0530
>> > >>> Subject: Re: Writing MR-Job: Something like OracleReducer,
>>JDBCReducer
>> > >>>...
>> > >>> From: [email protected]
>> > >>> To: [email protected]
>> > >>>
>> > >>> Hi Michael,
>> > >>>
>> > >>> Yes, thanks, I understand the fact that reducers can be expensive
>>with
>> > >>>all
>> > >>> the shuffling and the sorting, and you may not need them always.
>>At
>> > >>>the same
>> > >>> time, there are many cases where reducers are useful, like
>>secondary
>> > >>> sorting. In many cases, one can have multiple map phases and not
>>have a
>> > >>> reduce phase at all. Again, there will be many cases where one may
>> > >>>want a
>> > >>> reducer, say trying to count the occurrence of words in a
>>particular
>> > >>>column.
>> > >>>
>> > >>>
>> > >>> With this thought chain, I do not feel ready to say that when
>>dealing
>> > >>>with
>> > >>> HBase, I really dont want to use a reducer. Please correct me if
>>I am
>> > >>> wrong.
>> > >>>
>> > >>> Thanks again.
>> > >>>
>> > >>> Best Regards,
>> > >>> Sonal
>> > >>> Crux: Reporting for HBase <https://github.com/sonalgoyal/crux>
>> > >>> Nube Technologies <http://www.nubetech.co>
>> > >>>
>> > >>> <http://in.linkedin.com/in/sonalgoyal>
>> > >>>
>> > >>>
>> > >>>
>> > >>>
>> > >>>
>> > >>> On Fri, Sep 16, 2011 at 10:35 PM, Michael Segel
>> > >>> <[email protected]>wrote:
>> > >>>
>> > >>>>
>> > >>>> Sonal,
>> > >>>>
>> > >>>> Just because you have a m/r job doesn't mean that you need to
>>reduce
>> > >>>> anything. You can have a job that contains only a mapper.
>> > >>>> Or your job runner can have a series of map jobs in serial.
>> > >>>>
>> > >>>> Most if not all of the map/reduce jobs where we pull data from
>>HBase,
>> > >>>>don't
>> > >>>> require a reducer.
>> > >>>>
>> > >>>> To give you a simple example... if I want to determine the table
>> > >>>>schema
>> > >>>> where I am storing some sort of structured data...
>> > >>>> I just write a m/r job which opens a table, scan's the table
>>counting
>> > >>>>the
>> > >>>> occurrence of each column name via dynamic counters.
>> > >>>>
>> > >>>> There is no need for a reducer.
>> > >>>>
>> > >>>> Does that help?
>> > >>>>
>> > >>>>
>> > >>>>> Date: Fri, 16 Sep 2011 21:41:01 +0530
>> > >>>>> Subject: Re: Writing MR-Job: Something like OracleReducer,
>> > >>>>>JDBCReducer
>> > >>>> ...
>> > >>>>> From: [email protected]
>> > >>>>> To: [email protected]
>> > >>>>>
>> > >>>>> Michel,
>> > >>>>>
>> > >>>>> Sorry can you please help me understand what you mean when you
>>say
>> > >>>>>that
>> > >>>> when
>> > >>>>> dealing with HBase, you really dont want to use a reducer? Here,
>> > >>>>>Hbase is
>> > >>>>> being used as the input to the MR job.
>> > >>>>>
>> > >>>>> Thanks
>> > >>>>> Sonal
>> > >>>>>
>> > >>>>>
>> > >>>>> On Fri, Sep 16, 2011 at 2:35 PM, Michel Segel
>> > >>>>><[email protected]
>> > >>>>> wrote:
>> > >>>>>
>> > >>>>>> I think you need to get a little bit more information.
>> > >>>>>> Reducers are expensive.
>> > >>>>>> When Thomas says that he is aggregating data, what exactly
>>does he
>> > >>>> mean?
>> > >>>>>> When dealing w HBase, you really don't want to use a reducer.
>> > >>>>>>
>> > >>>>>> You may want to run two map jobs and it could be that just
>>dumping
>> > >>>>>>the
>> > >>>>>> output via jdbc makes the most sense.
>> > >>>>>>
>> > >>>>>> We are starting to see a lot of questions where the OP isn't
>> > >>>>>>providing
>> > >>>>>> enough information so that the recommendation could be wrong...
>> > >>>>>>
>> > >>>>>>
>> > >>>>>> Sent from a remote device. Please excuse any typos...
>> > >>>>>>
>> > >>>>>> Mike Segel
>> > >>>>>>
>> > >>>>>> On Sep 16, 2011, at 2:22 AM, Sonal Goyal
>><[email protected]>
>> > >>>> wrote:
>> > >>>>>>
>> > >>>>>>> There is a DBOutputFormat class in the
>> > >>>> org.apache,hadoop.mapreduce.lib.db
>> > >>>>>>> package, you could use that. Or you could write to the hdfs
>>and
>> > >>>>>>>then
>> > >>>> use
>> > >>>>>>> something like HIHO[1] to export to the db. I have been
>>working
>> > >>>>>> extensively
>> > >>>>>>> in this area, you can write to me directly if you need any
>>help.
>> > >>>>>>>
>> > >>>>>>> 1. https://github.com/sonalgoyal/hiho
>> > >>>>>>>
>> > >>>>>>> Best Regards,
>> > >>>>>>> Sonal
>> > >>>>>>> Crux: Reporting for HBase <https://github.com/sonalgoyal/crux>
>> > >>>>>>> Nube Technologies <http://www.nubetech.co>
>> > >>>>>>>
>> > >>>>>>> <http://in.linkedin.com/in/sonalgoyal>
>> > >>>>>>>
>> > >>>>>>>
>> > >>>>>>>
>> > >>>>>>>
>> > >>>>>>>
>> > >>>>>>> On Fri, Sep 16, 2011 at 10:55 AM, Steinmaurer Thomas <
>> > >>>>>>> [email protected]> wrote:
>> > >>>>>>>
>> > >>>>>>>> Hello,
>> > >>>>>>>>
>> > >>>>>>>>
>> > >>>>>>>>
>> > >>>>>>>> writing a MR-Job to process HBase data and store aggregated
>>data
>> > >>>>>>>>in
>> > >>>>>>>> Oracle. How would you do that in a MR-job?
>> > >>>>>>>>
>> > >>>>>>>>
>> > >>>>>>>>
>> > >>>>>>>> Currently, for test purposes we write the result into a HBase
>> > >>>>>>>>table
>> > >>>>>>>> again by using a TableReducer. Is there something like a
>> > >>>> OracleReducer,
>> > >>>>>>>> RelationalReducer, JDBCReducer or whatever? Or should one
>>simply
>> > >>>>>>>>use
>> > >>>>>>>> plan JDBC code in the reduce step?
>> > >>>>>>>>
>> > >>>>>>>>
>> > >>>>>>>>
>> > >>>>>>>> Thanks!
>> > >>>>>>>>
>> > >>>>>>>>
>> > >>>>>>>>
>> > >>>>>>>> Thomas
>> > >>>>>>>>
>> > >>>>>>>>
>> > >>>>>>>>
>> > >>>>>>>>
>> > >>>>>>
>> > >>>>
>> > >>>>
>> > >>
>> > >
>> >

Re: Writing MR-Job: Something like OracleReducer, JDBCReducer ...

Reply via email to