I've made some good progress using the HBase Bulk Load Tool. With HBase 0.89.20100924+28.
My initial implementation did not have importtsv do compression, and it ran directly on the hbase cluster's hadoop. It's been working ok for a while (but slow, because of limited resources). My next implementation, as discussed in another thread, has compression settings turned on for importtsv (thanks, Lars). And I am running the importtsv on a remote cluster and then distcp'ing (thanks, Todd) the results to the HBase cluster for the completebulkload step. I'm trying this out with a fresh (empty) Hbase table. So, the first run of importtsv takes a very long time, because the table only has one region, so it starts only one Reducer. - Bulk load into a new table - About 20 GB of data (compressed with gzip) - Created one massive region It seemed to complete successfully. But we are seeing some intermittent errors (missing blocks and such). Could not obtain block: blk_-5944324410280250477_429443 > file=/hbase/mytable/7c2b09e1ef8c4984732f362d7876305c/metrics/7947729174003011436 > The initial region seems to have split once, but I'm not sure the split completed, since the key ranges overlap and the storeFileSizeMB seems to be about as big as it started out. My theory is that the initial load is too large for a region, and the split either failed or is still in progress. Both on the same Region Server: > > mytable,ad_format728x90site_category2advertiser14563countrysepublisher2e03ab73-b234-4413-bcee-6183a99bd840starttime1293897600,1294094158507.2360f0a03e2566c72ea1a07c40f5f296. > stores=2, storefiles=1075, storefileSizeMB=19230, memstoreSizeMB=0, > storefileIndexSizeMB=784 > -- > mytable,,1294094158507.33b1e47c5fb004aa801b0bd88ce8322d. > stores=2, storefiles=1083, storefileSizeMB=19546, memstoreSizeMB=0, > storefileIndexSizeMB=796 > Another new table on this same hbase loaded around the same time, has already split into 69 regions (storefileSizeMB 200 - 400 each). This one was loaded in smaller chunks with importtsv running directly on the hbase cluster, but also with compression on. Now that I've gotten all the background down, here are my questions: 1. Is it still working on the split? Any way to monitor progress? 2. Can I force more splits? 3. Should I have done something first to avoid having the bulk load create one big region? 4. Would it be easier to split if my initial bulkload was not gzip compressed? 5. Am I looking in the wrong place entirely for this issue? thanks, Marc
