Re: Writing MR-Job: Something like OracleReducer, JDBCReducer ...

Sam Seigal Fri, 16 Sep 2011 18:01:18 -0700

I see what you are saying about the temp table being hosted at a
single regions server  - especially for a limited set of rows that
just care about the aggregations, but receive a lot of traffic. I
wonder if this will also be the case, if I was to use the source table
to maintain these temporary records, and not create a temp table on
the fly ...


On Fri, Sep 16, 2011 at 5:24 PM, Doug Meil
<[email protected]> wrote:
>
> I'll add this to the book in the MR section.
>
>
>
>
>
> On 9/16/11 8:22 PM, "Doug Meil" <[email protected]> wrote:
>
>>
>>I was in the middle of responding to Mike's email when yours arrived, so
>>I'll respond to both.
>>
>>I think the temp-table idea is interesting.  The caution is that a default
>>temp-table creation will be hosted on a single RS and thus be a bottleneck
>>for aggregation.  So I would imagine that you would need to tune the
>>temp-table for the job and pre-create regions.
>>
>>Doug
>>
>>
>>
>>On 9/16/11 8:16 PM, "Sam Seigal" <[email protected]> wrote:
>>
>>>I am trying to do something similar with HBase Map/Reduce.
>>>
>>>I have event ids and amounts stored in hbase in the following format:
>>>prefix-event_id_type-timestamp-event_id as the row key and  amount as
>>>the value
>>>I want to be able to aggregate the amounts based on the event id type
>>>and for this I am using a reducer. I basically reduce on the
>>>eventidtype from the incoming row in the map phase, and perform the
>>>aggregation in the reducer on the amounts for the event types. Then I
>>>write back the results into HBase.
>>>
>>>I hadn't thought about writing values directly into a temp HBase table
>>>as suggested by Mike in the map phase.
>>>
>>>For this case, each mapper can declare its own mapperId_event_type row
>>>with totalAmount and for each row it receives, do a get , add the
>>>current amount, and then a put. We are basically then doing a
>>>get/add/put for every row that a mapper receives. Is this any more
>>>efficient when compared to the overhead of sorting/partitioning for a
>>>reducer ?
>>>
>>>At the end of the mapping phase, aggregating the output of all the
>>>mappers should be trivial.
>>>
>>>
>>>
>>>On Fri, Sep 16, 2011 at 1:24 PM, Michael Segel
>>><[email protected]> wrote:
>>>>
>>>> Doug and company...
>>>>
>>>> Look, I'm not saying that there aren't m/r jobs were you might need
>>>>reducers when working w HBase. What I am saying is that if we look at
>>>>what you're attempting to do, you may end up getting better performance
>>>>if you created a temp table in HBase and let HBase do some of the heavy
>>>>lifting where you are currently using a reducer. From the jobs that we
>>>>run, when we looked at what we were doing, there wasn't any need for a
>>>>reducer. I suspect that its true of other jobs.
>>>>
>>>> Remember that HBase is much more than just an HFile format to persist
>>>>stuff.
>>>>
>>>> Even looking at Sonal's example... you have other ways of doing the
>>>>record counts like dynamic counters or using a temp table in HBase which
>>>>I believe will give you better performance numbers, although I haven't
>>>>benchmarked either against a reducer.
>>>>
>>>> Does that make sense?
>>>>
>>>> -Mike
>>>>
>>>>
>>>> > From: [email protected]
>>>> > To: [email protected]
>>>> > Date: Fri, 16 Sep 2011 15:41:44 -0400
>>>> > Subject: Re: Writing MR-Job: Something like OracleReducer,
>>>>JDBCReducer ...
>>>> >
>>>> >
>>>> > Chris, agreed... There are sometimes that reducers aren't required,
>>>>and
>>>> > then situations where they are useful.  We have both kinds of jobs.
>>>> >
>>>> > For others following the thread, I updated the book recently with
>>>>more MR
>>>> > examples (read-only, read-write, read-summary)
>>>> >
>>>> > http://hbase.apache.org/book.html#mapreduce.example
>>>> >
>>>> >
>>>> > As to the question that started this thread...
>>>> >
>>>> >
>>>> > re:  "Store aggregated data in Oracle. "
>>>> >
>>>> > To me, that sounds a like the "read-summary" example with JDBC-Oracle
>>>>in
>>>> > the reduce step.
>>>> >
>>>> >
>>>> >
>>>> >
>>>> >
>>>> > On 9/16/11 2:58 PM, "Chris Tarnas" <[email protected]> wrote:
>>>> >
>>>> > >If only I could make NY in Nov :)
>>>> > >
>>>> > >We extract out large numbers of DNA sequence reads from HBase, run
>>>>them
>>>> > >through M/R pipelines to analyze and aggregate and then we load the
>>>> > >results back in. Definitely specialized usage, but I could see other
>>>> > >perfectly valid uses for reducers with HBase.
>>>> > >
>>>> > >-chris
>>>> > >
>>>> > >On Sep 16, 2011, at 11:43 AM, Michael Segel wrote:
>>>> > >
>>>> > >>
>>>> > >> Sonal,
>>>> > >>
>>>> > >> You do realize that HBase is a "database", right? ;-)
>>>> > >>
>>>> > >> So again, why do you need a reducer?  ;-)
>>>> > >>
>>>> > >> Using your example...
>>>> > >> "Again, there will be many cases where one may want a reducer, say
>>>> > >>trying to count the occurrence of words in a particular column."
>>>> > >>
>>>> > >> You can do this one of two ways...
>>>> > >> 1) Dynamic Counters in Hadoop.
>>>> > >> 2) Use a temp table and auto increment the value in a column which
>>>> > >>contains the word count.  (Fat row where rowkey is doc_id and
>>>>column is
>>>> > >>word or rowkey is doc_id|word)
>>>> > >>
>>>> > >> I'm sorry but if you go through all of your examples of why you
>>>>would
>>>> > >>want to use a reducer, you end up finding out that writing to an
>>>>HBase
>>>> > >>table would be faster than a reduce job.
>>>> > >> (Again we haven't done an exhaustive search, but in all of the
>>>>HBase
>>>> > >>jobs we've run... no reducers were necessary.)
>>>> > >>
>>>> > >> The point I'm trying to make is that you want to avoid using a
>>>>reducer
>>>> > >>whenever possible and if you think about your problem... you can
>>>> > >>probably come up with a solution that avoids the reducer...
>>>> > >>
>>>> > >>
>>>> > >> HTH
>>>> > >>
>>>> > >> -Mike
>>>> > >> PS. I haven't looked at *all* of the potential use cases of HBase
>>>>which
>>>> > >>is why I don't want to say you'll never need a reducer. I will say
>>>>that
>>>> > >>based on what we've done at my client's site, we try very hard to
>>>>avoid
>>>> > >>reducers.
>>>> > >> [Note, I'm sure I'm going to get hammered on this when I head to
>>>>NY in
>>>> > >>Nov. :-)   ]
>>>> > >>
>>>> > >>
>>>> > >>> Date: Fri, 16 Sep 2011 23:00:49 +0530
>>>> > >>> Subject: Re: Writing MR-Job: Something like OracleReducer,
>>>>JDBCReducer
>>>> > >>>...
>>>> > >>> From: [email protected]
>>>> > >>> To: [email protected]
>>>> > >>>
>>>> > >>> Hi Michael,
>>>> > >>>
>>>> > >>> Yes, thanks, I understand the fact that reducers can be expensive
>>>>with
>>>> > >>>all
>>>> > >>> the shuffling and the sorting, and you may not need them always.
>>>>At
>>>> > >>>the same
>>>> > >>> time, there are many cases where reducers are useful, like
>>>>secondary
>>>> > >>> sorting. In many cases, one can have multiple map phases and not
>>>>have a
>>>> > >>> reduce phase at all. Again, there will be many cases where one
>>>>may
>>>> > >>>want a
>>>> > >>> reducer, say trying to count the occurrence of words in a
>>>>particular
>>>> > >>>column.
>>>> > >>>
>>>> > >>>
>>>> > >>> With this thought chain, I do not feel ready to say that when
>>>>dealing
>>>> > >>>with
>>>> > >>> HBase, I really dont want to use a reducer. Please correct me if
>>>>I am
>>>> > >>> wrong.
>>>> > >>>
>>>> > >>> Thanks again.
>>>> > >>>
>>>> > >>> Best Regards,
>>>> > >>> Sonal
>>>> > >>> Crux: Reporting for HBase <https://github.com/sonalgoyal/crux>
>>>> > >>> Nube Technologies <http://www.nubetech.co>
>>>> > >>>
>>>> > >>> <http://in.linkedin.com/in/sonalgoyal>
>>>> > >>>
>>>> > >>>
>>>> > >>>
>>>> > >>>
>>>> > >>>
>>>> > >>> On Fri, Sep 16, 2011 at 10:35 PM, Michael Segel
>>>> > >>> <[email protected]>wrote:
>>>> > >>>
>>>> > >>>>
>>>> > >>>> Sonal,
>>>> > >>>>
>>>> > >>>> Just because you have a m/r job doesn't mean that you need to
>>>>reduce
>>>> > >>>> anything. You can have a job that contains only a mapper.
>>>> > >>>> Or your job runner can have a series of map jobs in serial.
>>>> > >>>>
>>>> > >>>> Most if not all of the map/reduce jobs where we pull data from
>>>>HBase,
>>>> > >>>>don't
>>>> > >>>> require a reducer.
>>>> > >>>>
>>>> > >>>> To give you a simple example... if I want to determine the table
>>>> > >>>>schema
>>>> > >>>> where I am storing some sort of structured data...
>>>> > >>>> I just write a m/r job which opens a table, scan's the table
>>>>counting
>>>> > >>>>the
>>>> > >>>> occurrence of each column name via dynamic counters.
>>>> > >>>>
>>>> > >>>> There is no need for a reducer.
>>>> > >>>>
>>>> > >>>> Does that help?
>>>> > >>>>
>>>> > >>>>
>>>> > >>>>> Date: Fri, 16 Sep 2011 21:41:01 +0530
>>>> > >>>>> Subject: Re: Writing MR-Job: Something like OracleReducer,
>>>> > >>>>>JDBCReducer
>>>> > >>>> ...
>>>> > >>>>> From: [email protected]
>>>> > >>>>> To: [email protected]
>>>> > >>>>>
>>>> > >>>>> Michel,
>>>> > >>>>>
>>>> > >>>>> Sorry can you please help me understand what you mean when you
>>>>say
>>>> > >>>>>that
>>>> > >>>> when
>>>> > >>>>> dealing with HBase, you really dont want to use a reducer?
>>>>Here,
>>>> > >>>>>Hbase is
>>>> > >>>>> being used as the input to the MR job.
>>>> > >>>>>
>>>> > >>>>> Thanks
>>>> > >>>>> Sonal
>>>> > >>>>>
>>>> > >>>>>
>>>> > >>>>> On Fri, Sep 16, 2011 at 2:35 PM, Michel Segel
>>>> > >>>>><[email protected]
>>>> > >>>>> wrote:
>>>> > >>>>>
>>>> > >>>>>> I think you need to get a little bit more information.
>>>> > >>>>>> Reducers are expensive.
>>>> > >>>>>> When Thomas says that he is aggregating data, what exactly
>>>>does he
>>>> > >>>> mean?
>>>> > >>>>>> When dealing w HBase, you really don't want to use a reducer.
>>>> > >>>>>>
>>>> > >>>>>> You may want to run two map jobs and it could be that just
>>>>dumping
>>>> > >>>>>>the
>>>> > >>>>>> output via jdbc makes the most sense.
>>>> > >>>>>>
>>>> > >>>>>> We are starting to see a lot of questions where the OP isn't
>>>> > >>>>>>providing
>>>> > >>>>>> enough information so that the recommendation could be
>>>>wrong...
>>>> > >>>>>>
>>>> > >>>>>>
>>>> > >>>>>> Sent from a remote device. Please excuse any typos...
>>>> > >>>>>>
>>>> > >>>>>> Mike Segel
>>>> > >>>>>>
>>>> > >>>>>> On Sep 16, 2011, at 2:22 AM, Sonal Goyal
>>>><[email protected]>
>>>> > >>>> wrote:
>>>> > >>>>>>
>>>> > >>>>>>> There is a DBOutputFormat class in the
>>>> > >>>> org.apache,hadoop.mapreduce.lib.db
>>>> > >>>>>>> package, you could use that. Or you could write to the hdfs
>>>>and
>>>> > >>>>>>>then
>>>> > >>>> use
>>>> > >>>>>>> something like HIHO[1] to export to the db. I have been
>>>>working
>>>> > >>>>>> extensively
>>>> > >>>>>>> in this area, you can write to me directly if you need any
>>>>help.
>>>> > >>>>>>>
>>>> > >>>>>>> 1. https://github.com/sonalgoyal/hiho
>>>> > >>>>>>>
>>>> > >>>>>>> Best Regards,
>>>> > >>>>>>> Sonal
>>>> > >>>>>>> Crux: Reporting for HBase
>>>><https://github.com/sonalgoyal/crux>
>>>> > >>>>>>> Nube Technologies <http://www.nubetech.co>
>>>> > >>>>>>>
>>>> > >>>>>>> <http://in.linkedin.com/in/sonalgoyal>
>>>> > >>>>>>>
>>>> > >>>>>>>
>>>> > >>>>>>>
>>>> > >>>>>>>
>>>> > >>>>>>>
>>>> > >>>>>>> On Fri, Sep 16, 2011 at 10:55 AM, Steinmaurer Thomas <
>>>> > >>>>>>> [email protected]> wrote:
>>>> > >>>>>>>
>>>> > >>>>>>>> Hello,
>>>> > >>>>>>>>
>>>> > >>>>>>>>
>>>> > >>>>>>>>
>>>> > >>>>>>>> writing a MR-Job to process HBase data and store aggregated
>>>>data
>>>> > >>>>>>>>in
>>>> > >>>>>>>> Oracle. How would you do that in a MR-job?
>>>> > >>>>>>>>
>>>> > >>>>>>>>
>>>> > >>>>>>>>
>>>> > >>>>>>>> Currently, for test purposes we write the result into a
>>>>HBase
>>>> > >>>>>>>>table
>>>> > >>>>>>>> again by using a TableReducer. Is there something like a
>>>> > >>>> OracleReducer,
>>>> > >>>>>>>> RelationalReducer, JDBCReducer or whatever? Or should one
>>>>simply
>>>> > >>>>>>>>use
>>>> > >>>>>>>> plan JDBC code in the reduce step?
>>>> > >>>>>>>>
>>>> > >>>>>>>>
>>>> > >>>>>>>>
>>>> > >>>>>>>> Thanks!
>>>> > >>>>>>>>
>>>> > >>>>>>>>
>>>> > >>>>>>>>
>>>> > >>>>>>>> Thomas
>>>> > >>>>>>>>
>>>> > >>>>>>>>
>>>> > >>>>>>>>
>>>> > >>>>>>>>
>>>> > >>>>>>
>>>> > >>>>
>>>> > >>>>
>>>> > >>
>>>> > >
>>>> >
>>
>
>

Re: Writing MR-Job: Something like OracleReducer, JDBCReducer ...

Reply via email to