Re: Writing MR-Job: Something like OracleReducer, JDBCReducer ...

Doug Meil Fri, 16 Sep 2011 12:40:20 -0700

Chris, agreed... There are sometimes that reducers aren't required, and
then situations where they are useful.  We have both kinds of jobs.


For others following the thread, I updated the book recently with more MR
examples (read-only, read-write, read-summary)

http://hbase.apache.org/book.html#mapreduce.example


As to the question that started this thread...


re:  "Store aggregated data in Oracle. "

To me, that sounds a like the "read-summary" example with JDBC-Oracle in
the reduce step.





On 9/16/11 2:58 PM, "Chris Tarnas" <[email protected]> wrote:

>If only I could make NY in Nov :)
>
>We extract out large numbers of DNA sequence reads from HBase, run them
>through M/R pipelines to analyze and aggregate and then we load the
>results back in. Definitely specialized usage, but I could see other
>perfectly valid uses for reducers with HBase.
>
>-chris
> 
>On Sep 16, 2011, at 11:43 AM, Michael Segel wrote:
>
>> 
>> Sonal,
>> 
>> You do realize that HBase is a "database", right? ;-)
>> 
>> So again, why do you need a reducer?  ;-)
>> 
>> Using your example...
>> "Again, there will be many cases where one may want a reducer, say
>>trying to count the occurrence of words in a particular column."
>> 
>> You can do this one of two ways...
>> 1) Dynamic Counters in Hadoop.
>> 2) Use a temp table and auto increment the value in a column which
>>contains the word count.  (Fat row where rowkey is doc_id and column is
>>word or rowkey is doc_id|word)
>> 
>> I'm sorry but if you go through all of your examples of why you would
>>want to use a reducer, you end up finding out that writing to an HBase
>>table would be faster than a reduce job.
>> (Again we haven't done an exhaustive search, but in all of the HBase
>>jobs we've run... no reducers were necessary.)
>> 
>> The point I'm trying to make is that you want to avoid using a reducer
>>whenever possible and if you think about your problem... you can
>>probably come up with a solution that avoids the reducer...
>> 
>> 
>> HTH
>> 
>> -Mike
>> PS. I haven't looked at *all* of the potential use cases of HBase which
>>is why I don't want to say you'll never need a reducer. I will say that
>>based on what we've done at my client's site, we try very hard to avoid
>>reducers.
>> [Note, I'm sure I'm going to get hammered on this when I head to NY in
>>Nov. :-)   ]
>> 
>> 
>>> Date: Fri, 16 Sep 2011 23:00:49 +0530
>>> Subject: Re: Writing MR-Job: Something like OracleReducer, JDBCReducer
>>>...
>>> From: [email protected]
>>> To: [email protected]
>>> 
>>> Hi Michael,
>>> 
>>> Yes, thanks, I understand the fact that reducers can be expensive with
>>>all
>>> the shuffling and the sorting, and you may not need them always. At
>>>the same
>>> time, there are many cases where reducers are useful, like secondary
>>> sorting. In many cases, one can have multiple map phases and not have a
>>> reduce phase at all. Again, there will be many cases where one may
>>>want a
>>> reducer, say trying to count the occurrence of words in a particular
>>>column.
>>> 
>>> 
>>> With this thought chain, I do not feel ready to say that when dealing
>>>with
>>> HBase, I really dont want to use a reducer. Please correct me if I am
>>> wrong.
>>> 
>>> Thanks again.
>>> 
>>> Best Regards,
>>> Sonal
>>> Crux: Reporting for HBase <https://github.com/sonalgoyal/crux>
>>> Nube Technologies <http://www.nubetech.co>
>>> 
>>> <http://in.linkedin.com/in/sonalgoyal>
>>> 
>>> 
>>> 
>>> 
>>> 
>>> On Fri, Sep 16, 2011 at 10:35 PM, Michael Segel
>>> <[email protected]>wrote:
>>> 
>>>> 
>>>> Sonal,
>>>> 
>>>> Just because you have a m/r job doesn't mean that you need to reduce
>>>> anything. You can have a job that contains only a mapper.
>>>> Or your job runner can have a series of map jobs in serial.
>>>> 
>>>> Most if not all of the map/reduce jobs where we pull data from HBase,
>>>>don't
>>>> require a reducer.
>>>> 
>>>> To give you a simple example... if I want to determine the table
>>>>schema
>>>> where I am storing some sort of structured data...
>>>> I just write a m/r job which opens a table, scan's the table counting
>>>>the
>>>> occurrence of each column name via dynamic counters.
>>>> 
>>>> There is no need for a reducer.
>>>> 
>>>> Does that help?
>>>> 
>>>> 
>>>>> Date: Fri, 16 Sep 2011 21:41:01 +0530
>>>>> Subject: Re: Writing MR-Job: Something like OracleReducer,
>>>>>JDBCReducer
>>>> ...
>>>>> From: [email protected]
>>>>> To: [email protected]
>>>>> 
>>>>> Michel,
>>>>> 
>>>>> Sorry can you please help me understand what you mean when you say
>>>>>that
>>>> when
>>>>> dealing with HBase, you really dont want to use a reducer? Here,
>>>>>Hbase is
>>>>> being used as the input to the MR job.
>>>>> 
>>>>> Thanks
>>>>> Sonal
>>>>> 
>>>>> 
>>>>> On Fri, Sep 16, 2011 at 2:35 PM, Michel Segel
>>>>><[email protected]
>>>>> wrote:
>>>>> 
>>>>>> I think you need to get a little bit more information.
>>>>>> Reducers are expensive.
>>>>>> When Thomas says that he is aggregating data, what exactly does he
>>>> mean?
>>>>>> When dealing w HBase, you really don't want to use a reducer.
>>>>>> 
>>>>>> You may want to run two map jobs and it could be that just dumping
>>>>>>the
>>>>>> output via jdbc makes the most sense.
>>>>>> 
>>>>>> We are starting to see a lot of questions where the OP isn't
>>>>>>providing
>>>>>> enough information so that the recommendation could be wrong...
>>>>>> 
>>>>>> 
>>>>>> Sent from a remote device. Please excuse any typos...
>>>>>> 
>>>>>> Mike Segel
>>>>>> 
>>>>>> On Sep 16, 2011, at 2:22 AM, Sonal Goyal <[email protected]>
>>>> wrote:
>>>>>> 
>>>>>>> There is a DBOutputFormat class in the
>>>> org.apache,hadoop.mapreduce.lib.db
>>>>>>> package, you could use that. Or you could write to the hdfs and
>>>>>>>then
>>>> use
>>>>>>> something like HIHO[1] to export to the db. I have been working
>>>>>> extensively
>>>>>>> in this area, you can write to me directly if you need any help.
>>>>>>> 
>>>>>>> 1. https://github.com/sonalgoyal/hiho
>>>>>>> 
>>>>>>> Best Regards,
>>>>>>> Sonal
>>>>>>> Crux: Reporting for HBase <https://github.com/sonalgoyal/crux>
>>>>>>> Nube Technologies <http://www.nubetech.co>
>>>>>>> 
>>>>>>> <http://in.linkedin.com/in/sonalgoyal>
>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>>> On Fri, Sep 16, 2011 at 10:55 AM, Steinmaurer Thomas <
>>>>>>> [email protected]> wrote:
>>>>>>> 
>>>>>>>> Hello,
>>>>>>>> 
>>>>>>>> 
>>>>>>>> 
>>>>>>>> writing a MR-Job to process HBase data and store aggregated data
>>>>>>>>in
>>>>>>>> Oracle. How would you do that in a MR-job?
>>>>>>>> 
>>>>>>>> 
>>>>>>>> 
>>>>>>>> Currently, for test purposes we write the result into a HBase
>>>>>>>>table
>>>>>>>> again by using a TableReducer. Is there something like a
>>>> OracleReducer,
>>>>>>>> RelationalReducer, JDBCReducer or whatever? Or should one simply
>>>>>>>>use
>>>>>>>> plan JDBC code in the reduce step?
>>>>>>>> 
>>>>>>>> 
>>>>>>>> 
>>>>>>>> Thanks!
>>>>>>>> 
>>>>>>>> 
>>>>>>>> 
>>>>>>>> Thomas
>>>>>>>> 
>>>>>>>> 
>>>>>>>> 
>>>>>>>> 
>>>>>> 
>>>> 
>>>> 
>>                                        
>

Re: Writing MR-Job: Something like OracleReducer, JDBCReducer ...

Reply via email to