If only I could make NY in Nov :)

We extract out large numbers of DNA sequence reads from HBase, run them through 
M/R pipelines to analyze and aggregate and then we load the results back in. 
Definitely specialized usage, but I could see other perfectly valid uses for 
reducers with HBase.

-chris
 
On Sep 16, 2011, at 11:43 AM, Michael Segel wrote:

> 
> Sonal,
> 
> You do realize that HBase is a "database", right? ;-)
> 
> So again, why do you need a reducer?  ;-)
> 
> Using your example...
> "Again, there will be many cases where one may want a reducer, say trying to 
> count the occurrence of words in a particular column."
> 
> You can do this one of two ways...
> 1) Dynamic Counters in Hadoop.
> 2) Use a temp table and auto increment the value in a column which contains 
> the word count.  (Fat row where rowkey is doc_id and column is word or rowkey 
> is doc_id|word)
> 
> I'm sorry but if you go through all of your examples of why you would want to 
> use a reducer, you end up finding out that writing to an HBase table would be 
> faster than a reduce job.
> (Again we haven't done an exhaustive search, but in all of the HBase jobs 
> we've run... no reducers were necessary.)
> 
> The point I'm trying to make is that you want to avoid using a reducer 
> whenever possible and if you think about your problem... you can probably 
> come up with a solution that avoids the reducer...
> 
> 
> HTH
> 
> -Mike
> PS. I haven't looked at *all* of the potential use cases of HBase which is 
> why I don't want to say you'll never need a reducer. I will say that based on 
> what we've done at my client's site, we try very hard to avoid reducers.
> [Note, I'm sure I'm going to get hammered on this when I head to NY in Nov. 
> :-)   ]
> 
> 
>> Date: Fri, 16 Sep 2011 23:00:49 +0530
>> Subject: Re: Writing MR-Job: Something like OracleReducer, JDBCReducer ...
>> From: [email protected]
>> To: [email protected]
>> 
>> Hi Michael,
>> 
>> Yes, thanks, I understand the fact that reducers can be expensive with all
>> the shuffling and the sorting, and you may not need them always. At the same
>> time, there are many cases where reducers are useful, like secondary
>> sorting. In many cases, one can have multiple map phases and not have a
>> reduce phase at all. Again, there will be many cases where one may want a
>> reducer, say trying to count the occurrence of words in a particular column.
>> 
>> 
>> With this thought chain, I do not feel ready to say that when dealing with
>> HBase, I really dont want to use a reducer. Please correct me if I am
>> wrong.
>> 
>> Thanks again.
>> 
>> Best Regards,
>> Sonal
>> Crux: Reporting for HBase <https://github.com/sonalgoyal/crux>
>> Nube Technologies <http://www.nubetech.co>
>> 
>> <http://in.linkedin.com/in/sonalgoyal>
>> 
>> 
>> 
>> 
>> 
>> On Fri, Sep 16, 2011 at 10:35 PM, Michael Segel
>> <[email protected]>wrote:
>> 
>>> 
>>> Sonal,
>>> 
>>> Just because you have a m/r job doesn't mean that you need to reduce
>>> anything. You can have a job that contains only a mapper.
>>> Or your job runner can have a series of map jobs in serial.
>>> 
>>> Most if not all of the map/reduce jobs where we pull data from HBase, don't
>>> require a reducer.
>>> 
>>> To give you a simple example... if I want to determine the table schema
>>> where I am storing some sort of structured data...
>>> I just write a m/r job which opens a table, scan's the table counting the
>>> occurrence of each column name via dynamic counters.
>>> 
>>> There is no need for a reducer.
>>> 
>>> Does that help?
>>> 
>>> 
>>>> Date: Fri, 16 Sep 2011 21:41:01 +0530
>>>> Subject: Re: Writing MR-Job: Something like OracleReducer, JDBCReducer
>>> ...
>>>> From: [email protected]
>>>> To: [email protected]
>>>> 
>>>> Michel,
>>>> 
>>>> Sorry can you please help me understand what you mean when you say that
>>> when
>>>> dealing with HBase, you really dont want to use a reducer? Here, Hbase is
>>>> being used as the input to the MR job.
>>>> 
>>>> Thanks
>>>> Sonal
>>>> 
>>>> 
>>>> On Fri, Sep 16, 2011 at 2:35 PM, Michel Segel <[email protected]
>>>> wrote:
>>>> 
>>>>> I think you need to get a little bit more information.
>>>>> Reducers are expensive.
>>>>> When Thomas says that he is aggregating data, what exactly does he
>>> mean?
>>>>> When dealing w HBase, you really don't want to use a reducer.
>>>>> 
>>>>> You may want to run two map jobs and it could be that just dumping the
>>>>> output via jdbc makes the most sense.
>>>>> 
>>>>> We are starting to see a lot of questions where the OP isn't providing
>>>>> enough information so that the recommendation could be wrong...
>>>>> 
>>>>> 
>>>>> Sent from a remote device. Please excuse any typos...
>>>>> 
>>>>> Mike Segel
>>>>> 
>>>>> On Sep 16, 2011, at 2:22 AM, Sonal Goyal <[email protected]>
>>> wrote:
>>>>> 
>>>>>> There is a DBOutputFormat class in the
>>> org.apache,hadoop.mapreduce.lib.db
>>>>>> package, you could use that. Or you could write to the hdfs and then
>>> use
>>>>>> something like HIHO[1] to export to the db. I have been working
>>>>> extensively
>>>>>> in this area, you can write to me directly if you need any help.
>>>>>> 
>>>>>> 1. https://github.com/sonalgoyal/hiho
>>>>>> 
>>>>>> Best Regards,
>>>>>> Sonal
>>>>>> Crux: Reporting for HBase <https://github.com/sonalgoyal/crux>
>>>>>> Nube Technologies <http://www.nubetech.co>
>>>>>> 
>>>>>> <http://in.linkedin.com/in/sonalgoyal>
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> On Fri, Sep 16, 2011 at 10:55 AM, Steinmaurer Thomas <
>>>>>> [email protected]> wrote:
>>>>>> 
>>>>>>> Hello,
>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>>> writing a MR-Job to process HBase data and store aggregated data in
>>>>>>> Oracle. How would you do that in a MR-job?
>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>>> Currently, for test purposes we write the result into a HBase table
>>>>>>> again by using a TableReducer. Is there something like a
>>> OracleReducer,
>>>>>>> RelationalReducer, JDBCReducer or whatever? Or should one simply use
>>>>>>> plan JDBC code in the reduce step?
>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>>> Thanks!
>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>>> Thomas
>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>> 
>>> 
>>> 
>                                         

Reply via email to