Re: Writing MR-Job: Something like OracleReducer, JDBCReducer ...

Sam Seigal Fri, 16 Sep 2011 18:39:40 -0700

Aren't there memory considerations with this approach ? I would assume
the HashMap can get pretty big , if it retains in memory every record
that passes through .. (Apologies, if I am being ignorant with my
limited knowledge of hadoop's internal workings ... )


On Fri, Sep 16, 2011 at 6:14 PM, Doug Meil
<[email protected]> wrote:
>
> However, if the aggregations in the mapper were kept in a HashMap (key
> being the aggregate, value being the count), and then the mapper made a
> single pass over this map during the cleanup method and then did the
> checkAndPuts, it would mean that the writes would only happen once per
> map-task, and not do it on a per-row basis (which would be really
> expensive).
>
> A single region on a single RS could handle that no problem.
>
>
>
>
> On 9/16/11 9:00 PM, "Sam Seigal" <[email protected]> wrote:
>
>>I see what you are saying about the temp table being hosted at a
>>single regions server  - especially for a limited set of rows that
>>just care about the aggregations, but receive a lot of traffic. I
>>wonder if this will also be the case, if I was to use the source table
>>to maintain these temporary records, and not create a temp table on
>>the fly ...
>>
>>On Fri, Sep 16, 2011 at 5:24 PM, Doug Meil
>><[email protected]> wrote:
>>>
>>> I'll add this to the book in the MR section.
>>>
>>>
>>>
>>>
>>>
>>> On 9/16/11 8:22 PM, "Doug Meil" <[email protected]> wrote:
>>>
>>>>
>>>>I was in the middle of responding to Mike's email when yours arrived, so
>>>>I'll respond to both.
>>>>
>>>>I think the temp-table idea is interesting.  The caution is that a
>>>>default
>>>>temp-table creation will be hosted on a single RS and thus be a
>>>>bottleneck
>>>>for aggregation.  So I would imagine that you would need to tune the
>>>>temp-table for the job and pre-create regions.
>>>>
>>>>Doug
>>>>
>>>>
>>>>
>>>>On 9/16/11 8:16 PM, "Sam Seigal" <[email protected]> wrote:
>>>>
>>>>>I am trying to do something similar with HBase Map/Reduce.
>>>>>
>>>>>I have event ids and amounts stored in hbase in the following format:
>>>>>prefix-event_id_type-timestamp-event_id as the row key and  amount as
>>>>>the value
>>>>>I want to be able to aggregate the amounts based on the event id type
>>>>>and for this I am using a reducer. I basically reduce on the
>>>>>eventidtype from the incoming row in the map phase, and perform the
>>>>>aggregation in the reducer on the amounts for the event types. Then I
>>>>>write back the results into HBase.
>>>>>
>>>>>I hadn't thought about writing values directly into a temp HBase table
>>>>>as suggested by Mike in the map phase.
>>>>>
>>>>>For this case, each mapper can declare its own mapperId_event_type row
>>>>>with totalAmount and for each row it receives, do a get , add the
>>>>>current amount, and then a put. We are basically then doing a
>>>>>get/add/put for every row that a mapper receives. Is this any more
>>>>>efficient when compared to the overhead of sorting/partitioning for a
>>>>>reducer ?
>>>>>
>>>>>At the end of the mapping phase, aggregating the output of all the
>>>>>mappers should be trivial.
>>>>>
>>>>>
>>>>>
>>>>>On Fri, Sep 16, 2011 at 1:24 PM, Michael Segel
>>>>><[email protected]> wrote:
>>>>>>
>>>>>> Doug and company...
>>>>>>
>>>>>> Look, I'm not saying that there aren't m/r jobs were you might need
>>>>>>reducers when working w HBase. What I am saying is that if we look at
>>>>>>what you're attempting to do, you may end up getting better
>>>>>>performance
>>>>>>if you created a temp table in HBase and let HBase do some of the
>>>>>>heavy
>>>>>>lifting where you are currently using a reducer. From the jobs that we
>>>>>>run, when we looked at what we were doing, there wasn't any need for a
>>>>>>reducer. I suspect that its true of other jobs.
>>>>>>
>>>>>> Remember that HBase is much more than just an HFile format to persist
>>>>>>stuff.
>>>>>>
>>>>>> Even looking at Sonal's example... you have other ways of doing the
>>>>>>record counts like dynamic counters or using a temp table in HBase
>>>>>>which
>>>>>>I believe will give you better performance numbers, although I haven't
>>>>>>benchmarked either against a reducer.
>>>>>>
>>>>>> Does that make sense?
>>>>>>
>>>>>> -Mike
>>>>>>
>>>>>>
>>>>>> > From: [email protected]
>>>>>> > To: [email protected]
>>>>>> > Date: Fri, 16 Sep 2011 15:41:44 -0400
>>>>>> > Subject: Re: Writing MR-Job: Something like OracleReducer,
>>>>>>JDBCReducer ...
>>>>>> >
>>>>>> >
>>>>>> > Chris, agreed... There are sometimes that reducers aren't required,
>>>>>>and
>>>>>> > then situations where they are useful.  We have both kinds of jobs.
>>>>>> >
>>>>>> > For others following the thread, I updated the book recently with
>>>>>>more MR
>>>>>> > examples (read-only, read-write, read-summary)
>>>>>> >
>>>>>> > http://hbase.apache.org/book.html#mapreduce.example
>>>>>> >
>>>>>> >
>>>>>> > As to the question that started this thread...
>>>>>> >
>>>>>> >
>>>>>> > re:  "Store aggregated data in Oracle. "
>>>>>> >
>>>>>> > To me, that sounds a like the "read-summary" example with
>>>>>>JDBC-Oracle
>>>>>>in
>>>>>> > the reduce step.
>>>>>> >
>>>>>> >
>>>>>> >
>>>>>> >
>>>>>> >
>>>>>> > On 9/16/11 2:58 PM, "Chris Tarnas" <[email protected]> wrote:
>>>>>> >
>>>>>> > >If only I could make NY in Nov :)
>>>>>> > >
>>>>>> > >We extract out large numbers of DNA sequence reads from HBase, run
>>>>>>them
>>>>>> > >through M/R pipelines to analyze and aggregate and then we load
>>>>>>the
>>>>>> > >results back in. Definitely specialized usage, but I could see
>>>>>>other
>>>>>> > >perfectly valid uses for reducers with HBase.
>>>>>> > >
>>>>>> > >-chris
>>>>>> > >
>>>>>> > >On Sep 16, 2011, at 11:43 AM, Michael Segel wrote:
>>>>>> > >
>>>>>> > >>
>>>>>> > >> Sonal,
>>>>>> > >>
>>>>>> > >> You do realize that HBase is a "database", right? ;-)
>>>>>> > >>
>>>>>> > >> So again, why do you need a reducer?  ;-)
>>>>>> > >>
>>>>>> > >> Using your example...
>>>>>> > >> "Again, there will be many cases where one may want a reducer,
>>>>>>say
>>>>>> > >>trying to count the occurrence of words in a particular column."
>>>>>> > >>
>>>>>> > >> You can do this one of two ways...
>>>>>> > >> 1) Dynamic Counters in Hadoop.
>>>>>> > >> 2) Use a temp table and auto increment the value in a column
>>>>>>which
>>>>>> > >>contains the word count.  (Fat row where rowkey is doc_id and
>>>>>>column is
>>>>>> > >>word or rowkey is doc_id|word)
>>>>>> > >>
>>>>>> > >> I'm sorry but if you go through all of your examples of why you
>>>>>>would
>>>>>> > >>want to use a reducer, you end up finding out that writing to an
>>>>>>HBase
>>>>>> > >>table would be faster than a reduce job.
>>>>>> > >> (Again we haven't done an exhaustive search, but in all of the
>>>>>>HBase
>>>>>> > >>jobs we've run... no reducers were necessary.)
>>>>>> > >>
>>>>>> > >> The point I'm trying to make is that you want to avoid using a
>>>>>>reducer
>>>>>> > >>whenever possible and if you think about your problem... you can
>>>>>> > >>probably come up with a solution that avoids the reducer...
>>>>>> > >>
>>>>>> > >>
>>>>>> > >> HTH
>>>>>> > >>
>>>>>> > >> -Mike
>>>>>> > >> PS. I haven't looked at *all* of the potential use cases of
>>>>>>HBase
>>>>>>which
>>>>>> > >>is why I don't want to say you'll never need a reducer. I will
>>>>>>say
>>>>>>that
>>>>>> > >>based on what we've done at my client's site, we try very hard to
>>>>>>avoid
>>>>>> > >>reducers.
>>>>>> > >> [Note, I'm sure I'm going to get hammered on this when I head to
>>>>>>NY in
>>>>>> > >>Nov. :-)   ]
>>>>>> > >>
>>>>>> > >>
>>>>>> > >>> Date: Fri, 16 Sep 2011 23:00:49 +0530
>>>>>> > >>> Subject: Re: Writing MR-Job: Something like OracleReducer,
>>>>>>JDBCReducer
>>>>>> > >>>...
>>>>>> > >>> From: [email protected]
>>>>>> > >>> To: [email protected]
>>>>>> > >>>
>>>>>> > >>> Hi Michael,
>>>>>> > >>>
>>>>>> > >>> Yes, thanks, I understand the fact that reducers can be
>>>>>>expensive
>>>>>>with
>>>>>> > >>>all
>>>>>> > >>> the shuffling and the sorting, and you may not need them
>>>>>>always.
>>>>>>At
>>>>>> > >>>the same
>>>>>> > >>> time, there are many cases where reducers are useful, like
>>>>>>secondary
>>>>>> > >>> sorting. In many cases, one can have multiple map phases and
>>>>>>not
>>>>>>have a
>>>>>> > >>> reduce phase at all. Again, there will be many cases where one
>>>>>>may
>>>>>> > >>>want a
>>>>>> > >>> reducer, say trying to count the occurrence of words in a
>>>>>>particular
>>>>>> > >>>column.
>>>>>> > >>>
>>>>>> > >>>
>>>>>> > >>> With this thought chain, I do not feel ready to say that when
>>>>>>dealing
>>>>>> > >>>with
>>>>>> > >>> HBase, I really dont want to use a reducer. Please correct me
>>>>>>if
>>>>>>I am
>>>>>> > >>> wrong.
>>>>>> > >>>
>>>>>> > >>> Thanks again.
>>>>>> > >>>
>>>>>> > >>> Best Regards,
>>>>>> > >>> Sonal
>>>>>> > >>> Crux: Reporting for HBase <https://github.com/sonalgoyal/crux>
>>>>>> > >>> Nube Technologies <http://www.nubetech.co>
>>>>>> > >>>
>>>>>> > >>> <http://in.linkedin.com/in/sonalgoyal>
>>>>>> > >>>
>>>>>> > >>>
>>>>>> > >>>
>>>>>> > >>>
>>>>>> > >>>
>>>>>> > >>> On Fri, Sep 16, 2011 at 10:35 PM, Michael Segel
>>>>>> > >>> <[email protected]>wrote:
>>>>>> > >>>
>>>>>> > >>>>
>>>>>> > >>>> Sonal,
>>>>>> > >>>>
>>>>>> > >>>> Just because you have a m/r job doesn't mean that you need to
>>>>>>reduce
>>>>>> > >>>> anything. You can have a job that contains only a mapper.
>>>>>> > >>>> Or your job runner can have a series of map jobs in serial.
>>>>>> > >>>>
>>>>>> > >>>> Most if not all of the map/reduce jobs where we pull data from
>>>>>>HBase,
>>>>>> > >>>>don't
>>>>>> > >>>> require a reducer.
>>>>>> > >>>>
>>>>>> > >>>> To give you a simple example... if I want to determine the
>>>>>>table
>>>>>> > >>>>schema
>>>>>> > >>>> where I am storing some sort of structured data...
>>>>>> > >>>> I just write a m/r job which opens a table, scan's the table
>>>>>>counting
>>>>>> > >>>>the
>>>>>> > >>>> occurrence of each column name via dynamic counters.
>>>>>> > >>>>
>>>>>> > >>>> There is no need for a reducer.
>>>>>> > >>>>
>>>>>> > >>>> Does that help?
>>>>>> > >>>>
>>>>>> > >>>>
>>>>>> > >>>>> Date: Fri, 16 Sep 2011 21:41:01 +0530
>>>>>> > >>>>> Subject: Re: Writing MR-Job: Something like OracleReducer,
>>>>>> > >>>>>JDBCReducer
>>>>>> > >>>> ...
>>>>>> > >>>>> From: [email protected]
>>>>>> > >>>>> To: [email protected]
>>>>>> > >>>>>
>>>>>> > >>>>> Michel,
>>>>>> > >>>>>
>>>>>> > >>>>> Sorry can you please help me understand what you mean when
>>>>>>you
>>>>>>say
>>>>>> > >>>>>that
>>>>>> > >>>> when
>>>>>> > >>>>> dealing with HBase, you really dont want to use a reducer?
>>>>>>Here,
>>>>>> > >>>>>Hbase is
>>>>>> > >>>>> being used as the input to the MR job.
>>>>>> > >>>>>
>>>>>> > >>>>> Thanks
>>>>>> > >>>>> Sonal
>>>>>> > >>>>>
>>>>>> > >>>>>
>>>>>> > >>>>> On Fri, Sep 16, 2011 at 2:35 PM, Michel Segel
>>>>>> > >>>>><[email protected]
>>>>>> > >>>>> wrote:
>>>>>> > >>>>>
>>>>>> > >>>>>> I think you need to get a little bit more information.
>>>>>> > >>>>>> Reducers are expensive.
>>>>>> > >>>>>> When Thomas says that he is aggregating data, what exactly
>>>>>>does he
>>>>>> > >>>> mean?
>>>>>> > >>>>>> When dealing w HBase, you really don't want to use a
>>>>>>reducer.
>>>>>> > >>>>>>
>>>>>> > >>>>>> You may want to run two map jobs and it could be that just
>>>>>>dumping
>>>>>> > >>>>>>the
>>>>>> > >>>>>> output via jdbc makes the most sense.
>>>>>> > >>>>>>
>>>>>> > >>>>>> We are starting to see a lot of questions where the OP isn't
>>>>>> > >>>>>>providing
>>>>>> > >>>>>> enough information so that the recommendation could be
>>>>>>wrong...
>>>>>> > >>>>>>
>>>>>> > >>>>>>
>>>>>> > >>>>>> Sent from a remote device. Please excuse any typos...
>>>>>> > >>>>>>
>>>>>> > >>>>>> Mike Segel
>>>>>> > >>>>>>
>>>>>> > >>>>>> On Sep 16, 2011, at 2:22 AM, Sonal Goyal
>>>>>><[email protected]>
>>>>>> > >>>> wrote:
>>>>>> > >>>>>>
>>>>>> > >>>>>>> There is a DBOutputFormat class in the
>>>>>> > >>>> org.apache,hadoop.mapreduce.lib.db
>>>>>> > >>>>>>> package, you could use that. Or you could write to the hdfs
>>>>>>and
>>>>>> > >>>>>>>then
>>>>>> > >>>> use
>>>>>> > >>>>>>> something like HIHO[1] to export to the db. I have been
>>>>>>working
>>>>>> > >>>>>> extensively
>>>>>> > >>>>>>> in this area, you can write to me directly if you need any
>>>>>>help.
>>>>>> > >>>>>>>
>>>>>> > >>>>>>> 1. https://github.com/sonalgoyal/hiho
>>>>>> > >>>>>>>
>>>>>> > >>>>>>> Best Regards,
>>>>>> > >>>>>>> Sonal
>>>>>> > >>>>>>> Crux: Reporting for HBase
>>>>>><https://github.com/sonalgoyal/crux>
>>>>>> > >>>>>>> Nube Technologies <http://www.nubetech.co>
>>>>>> > >>>>>>>
>>>>>> > >>>>>>> <http://in.linkedin.com/in/sonalgoyal>
>>>>>> > >>>>>>>
>>>>>> > >>>>>>>
>>>>>> > >>>>>>>
>>>>>> > >>>>>>>
>>>>>> > >>>>>>>
>>>>>> > >>>>>>> On Fri, Sep 16, 2011 at 10:55 AM, Steinmaurer Thomas <
>>>>>> > >>>>>>> [email protected]> wrote:
>>>>>> > >>>>>>>
>>>>>> > >>>>>>>> Hello,
>>>>>> > >>>>>>>>
>>>>>> > >>>>>>>>
>>>>>> > >>>>>>>>
>>>>>> > >>>>>>>> writing a MR-Job to process HBase data and store
>>>>>>aggregated
>>>>>>data
>>>>>> > >>>>>>>>in
>>>>>> > >>>>>>>> Oracle. How would you do that in a MR-job?
>>>>>> > >>>>>>>>
>>>>>> > >>>>>>>>
>>>>>> > >>>>>>>>
>>>>>> > >>>>>>>> Currently, for test purposes we write the result into a
>>>>>>HBase
>>>>>> > >>>>>>>>table
>>>>>> > >>>>>>>> again by using a TableReducer. Is there something like a
>>>>>> > >>>> OracleReducer,
>>>>>> > >>>>>>>> RelationalReducer, JDBCReducer or whatever? Or should one
>>>>>>simply
>>>>>> > >>>>>>>>use
>>>>>> > >>>>>>>> plan JDBC code in the reduce step?
>>>>>> > >>>>>>>>
>>>>>> > >>>>>>>>
>>>>>> > >>>>>>>>
>>>>>> > >>>>>>>> Thanks!
>>>>>> > >>>>>>>>
>>>>>> > >>>>>>>>
>>>>>> > >>>>>>>>
>>>>>> > >>>>>>>> Thomas
>>>>>> > >>>>>>>>
>>>>>> > >>>>>>>>
>>>>>> > >>>>>>>>
>>>>>> > >>>>>>>>
>>>>>> > >>>>>>
>>>>>> > >>>>
>>>>>> > >>>>
>>>>>> > >>
>>>>>> > >
>>>>>> >
>>>>
>>>
>>>
>
>

Re: Writing MR-Job: Something like OracleReducer, JDBCReducer ...

Reply via email to