Adam:
I logged https://issues.apache.org/jira/browse/HBASE-3721

Feel free to comment on that JIRA.

On Thu, Mar 31, 2011 at 11:14 AM, Adam Phelps <[email protected]> wrote:

> On 3/30/11 8:39 PM, Stack wrote:
>
>> What is slow?  The running of the LoadIncrementHFiles or the copy?
>>
>
> Its the LoadIncrementHFiles portion.
>
>
>  If
>> the former, is it because the table its loading into has different
>> boundaries than those of the HFiles so the HFiles have to be split?
>>
>
> I'm sure that could be one aspect of it, however from the logs it looks
> like <1% of the hfiles we're loading have to be split.  Looking at the code
> for LoadIncrementHFiles (hbase v0.90.1), I'm actually thinking our problem
> is that this code loads the hfiles sequentially.  Our largest table has over
> 2500 regions and the data being loaded is fairly well distributed across
> them, so there end up being around 2500 HFiles for each load period.  At 1-2
> seconds per HFile that means the loading process is very time consuming.
>
> On the primary cluster (16 regionservers) one of this set of HFiles loads
> in ~350s vs ~3200s on the backup (with 4 regionservers).  Overall the nodes
> on the backup cluster are running at around 5% CPU (and similarly minimal
> disk and network usage).  So we have plenty of resources to throw at the
> problem, its just a matter of determining what we can do here other than
> adding additional nodes to the cluster.
>
> My first thoughts are to try to add some parallelism, either by splitting
> the HFiles into multiple chunks for separate load instances, or to change
> LoadIncrementHFiles itself to use multiple loading threads.
>
>
>  Is your data only coming in via bulk load?
>>
>
> Yes, everything we put into hbase is via bulk load.  We found it to be a
> huge improvement over doing individual Puts from the the M/R jobs.
>
> - Adam
>

Reply via email to