After a quick scan of the performance section, I didn't see what I consider to be a huge performance consideration: If at all possible, don't do a reduce on your puts. The shuffle/sort part of the map/reduce paradigm is often useless if all you are trying to do is insert/update data in HBase. From the OP's description it sounds like he doesn't need to have any kind of reduce phase [and may be a great candidate for bulk loading and the pre-creation of regions]. In any case, don't reduce if you can avoid it.
Dave -----Original Message----- From: Doug Meil [mailto:[email protected]] Sent: Sunday, July 17, 2011 4:40 PM To: [email protected] Subject: Re: loading data in HBase table using APIs Hi there- Take a look at this for starters: http://hbase.apache.org/book.html#schema 1) double-check your row-keys (sanity check), that's in the Schema Design chapter. http://hbase.apache.org/book.html#performance 2) if not using bulk-load - re-create regions, do this regardless of using MR or non-MR. 3) if not using MR job and are using multiple threads with the Java API, take a look at HTableUtil. It's on trunk, but that utility can help you. On 7/17/11 4:08 PM, "abhay ratnaparkhi" <[email protected]> wrote: >Hello, > >I am loading lots of data through API in HBase table. >I am using HBase Java API to do this. >If I convert this code to map-reduce task and use *TableOutputFormat* >class >then will I get any performance improvement? > >As I am not getting input data from existing HBase table or HDFS files >there >will not be any input to map task. >The only advantage is multiple map tasks running simultaneously might make >processing faster. > >Thanks! >Regars, >Abhay
