So your answer would be that it is better to have the best possible load balancing during the reduce phase instead of taking care to output Hfiles that fit within a single Region, because splitting done by Incremental Load is rather fast?
> Date: Wed, 25 May 2011 09:20:10 -0700 > Subject: Re: HFiles that fit within a single region VS better load balancing > at reduce phase > From: [email protected] > To: [email protected] > > LoadIncrementalHFiles would split HFile if it doesn't fit within a single > region. > > Please refer to the following JIRAs which speedup LoadIncrementalHFiles: > https://issues.apache.org/jira/browse/HBASE-3871 > https://issues.apache.org/jira/browse/HBASE-3721 > > Note: parallelizing splitting of HFile(s) by LoadIncrementalHFiles is done > on a single machine. > > Thanks > > 2011/5/25 Panayotis Antonopoulos <[email protected]> > > > > > Hello, > > I am currently working on a MR job that will output HFiles that will be > > bulk loaded in an HBase Table. > > According to the HBase site in order for the bulk loading to be efficient > > each HFile of the MR job should fit within a single region. > > In order to achieve that I use the TotalOrderPartitioner so that each > > reducer gets Key/Value pairs from a single region. > > However this prevents partitioning Mapper's output in equal splits so that > > I have the best possible load balancing during the reduce phase. > > > > So I would like to ask you how important is to create HFiles that fit > > within a single region. > > If it makes bulk loading much faster probably it is better to sacrifice > > load balancing. > > But is this the case? > > Has anyone tried both choices? > > > > Thank you in advance! > > Panagiotis. > >
