Re: Speeding up LoadIncrementalHFiles?

Adam Phelps Thu, 31 Mar 2011 11:15:23 -0700

On 3/30/11 8:39 PM, Stack wrote:

What is slow?  The running of the LoadIncrementHFiles or the copy?


Its the LoadIncrementHFiles portion.

If
the former, is it because the table its loading into has different
boundaries than those of the HFiles so the HFiles have to be split?

I'm sure that could be one aspect of it, however from the logs it lookslike <1% of the hfiles we're loading have to be split. Looking at thecode for LoadIncrementHFiles (hbase v0.90.1), I'm actually thinking ourproblem is that this code loads the hfiles sequentially. Our largesttable has over 2500 regions and the data being loaded is fairly welldistributed across them, so there end up being around 2500 HFiles foreach load period. At 1-2 seconds per HFile that means the loadingprocess is very time consuming.

On the primary cluster (16 regionservers) one of this set of HFilesloads in ~350s vs ~3200s on the backup (with 4 regionservers). Overallthe nodes on the backup cluster are running at around 5% CPU (andsimilarly minimal disk and network usage). So we have plenty ofresources to throw at the problem, its just a matter of determining whatwe can do here other than adding additional nodes to the cluster.

My first thoughts are to try to add some parallelism, either bysplitting the HFiles into multiple chunks for separate load instances,or to change LoadIncrementHFiles itself to use multiple loading threads.

Is your data only coming in via bulk load?

Yes, everything we put into hbase is via bulk load. We found it to be ahuge improvement over doing individual Puts from the the M/R jobs.


- Adam

Re: Speeding up LoadIncrementalHFiles?

Reply via email to