Adam: I logged https://issues.apache.org/jira/browse/HBASE-3721
Feel free to comment on that JIRA. On Thu, Mar 31, 2011 at 11:14 AM, Adam Phelps <[email protected]> wrote: > On 3/30/11 8:39 PM, Stack wrote: > >> What is slow? The running of the LoadIncrementHFiles or the copy? >> > > Its the LoadIncrementHFiles portion. > > > If >> the former, is it because the table its loading into has different >> boundaries than those of the HFiles so the HFiles have to be split? >> > > I'm sure that could be one aspect of it, however from the logs it looks > like <1% of the hfiles we're loading have to be split. Looking at the code > for LoadIncrementHFiles (hbase v0.90.1), I'm actually thinking our problem > is that this code loads the hfiles sequentially. Our largest table has over > 2500 regions and the data being loaded is fairly well distributed across > them, so there end up being around 2500 HFiles for each load period. At 1-2 > seconds per HFile that means the loading process is very time consuming. > > On the primary cluster (16 regionservers) one of this set of HFiles loads > in ~350s vs ~3200s on the backup (with 4 regionservers). Overall the nodes > on the backup cluster are running at around 5% CPU (and similarly minimal > disk and network usage). So we have plenty of resources to throw at the > problem, its just a matter of determining what we can do here other than > adding additional nodes to the cluster. > > My first thoughts are to try to add some parallelism, either by splitting > the HFiles into multiple chunks for separate load instances, or to change > LoadIncrementHFiles itself to use multiple loading threads. > > > Is your data only coming in via bulk load? >> > > Yes, everything we put into hbase is via bulk load. We found it to be a > huge improvement over doing individual Puts from the the M/R jobs. > > - Adam >
