I've made some good progress using the HBase Bulk Load Tool.   With HBase
0.89.20100924+28.

My initial implementation did not have importtsv do compression, and it ran
directly on the hbase cluster's hadoop.  It's been working ok for a while
(but slow, because of limited resources).

My next implementation, as discussed in another thread, has compression
settings turned on for importtsv (thanks, Lars).  And I am running the
importtsv on a remote cluster and then distcp'ing (thanks, Todd) the results
to the HBase cluster for the completebulkload step.

I'm trying this out with a fresh (empty) Hbase table.  So, the first run of
importtsv takes a very long time, because the table only has one region, so
it starts only one Reducer.

   - Bulk load into a new table
   - About 20 GB of data (compressed with gzip)
   - Created one massive region

It seemed to complete successfully.  But we are seeing some intermittent
errors (missing blocks and such).

Could not obtain block: blk_-5944324410280250477_429443
> file=/hbase/mytable/7c2b09e1ef8c4984732f362d7876305c/metrics/7947729174003011436
>

The initial region seems to have split once, but I'm not sure the split
completed, since the key ranges overlap and the storeFileSizeMB seems to be
about as big as it started out.  My theory is that the initial load is too
large for a region, and the split either failed or is still in progress.

 Both on the same Region Server:
>

>
mytable,ad_format728x90site_category2advertiser14563countrysepublisher2e03ab73-b234-4413-bcee-6183a99bd840starttime1293897600,1294094158507.2360f0a03e2566c72ea1a07c40f5f296.
> stores=2, storefiles=1075, storefileSizeMB=19230, memstoreSizeMB=0,
> storefileIndexSizeMB=784
> --
> mytable,,1294094158507.33b1e47c5fb004aa801b0bd88ce8322d.
> stores=2, storefiles=1083, storefileSizeMB=19546, memstoreSizeMB=0,
> storefileIndexSizeMB=796
>

Another new table on this same hbase loaded around the same time, has
already split into 69 regions (storefileSizeMB 200 - 400 each).  This one
was loaded in smaller chunks with importtsv running directly on the hbase
cluster, but also with compression on.

Now that I've gotten all the background down, here are my questions:

   1. Is it still working on the split?  Any way to monitor progress?
   2. Can I force more splits?
   3. Should I have done something first to avoid having the bulk load
   create one big region?
   4. Would it be easier to split if my initial bulkload was not gzip
   compressed?
   5. Am I looking in the wrong place entirely for this issue?

thanks,
Marc

Reply via email to