Hello,
I am currently working on a MR job that will output HFiles that will be bulk
loaded in an HBase Table.
According to the HBase site in order for the bulk loading to be efficient each
HFile of the MR job should fit within a single region.
In order to achieve that I use the TotalOrderPartitioner so that each reducer
gets Key/Value pairs from a single region.
However this prevents partitioning Mapper's output in equal splits so that I
have the best possible load balancing during the reduce phase.
So I would like to ask you how important is to create HFiles that fit within a
single region.
If it makes bulk loading much faster probably it is better to sacrifice load
balancing.
But is this the case?
Has anyone tried both choices?
Thank you in advance!
Panagiotis.