Marc,
Just an idea. Can you create your table with N regions with null as the start key and the largest possible key value and your end key? Talking with ssechrist on IRC he pointed me to the API : http://hbase.apache.org/docs/r0.89.20100924/apidocs/org/apache/hadoop/hbase/client/HBaseAdmin.html#createTable%28org.apache.hadoop.hbase.HTableDescriptor,%20byte[],%20byte[],%20int%29 And looking at the createTable() methods.... HTH -Mike > From: [email protected] > Date: Tue, 4 Jan 2011 06:03:21 -0800 > Subject: hbase bulk load / table split > To: [email protected] > > I've made some good progress using the HBase Bulk Load Tool. With HBase > 0.89.20100924+28. > > My initial implementation did not have importtsv do compression, and it ran > directly on the hbase cluster's hadoop. It's been working ok for a while > (but slow, because of limited resources). > > My next implementation, as discussed in another thread, has compression > settings turned on for importtsv (thanks, Lars). And I am running the > importtsv on a remote cluster and then distcp'ing (thanks, Todd) the results > to the HBase cluster for the completebulkload step. > > I'm trying this out with a fresh (empty) Hbase table. So, the first run of > importtsv takes a very long time, because the table only has one region, so > it starts only one Reducer. > > - Bulk load into a new table > - About 20 GB of data (compressed with gzip) > - Created one massive region > > It seemed to complete successfully. But we are seeing some intermittent > errors (missing blocks and such). > > Could not obtain block: blk_-5944324410280250477_429443 > > file=/hbase/mytable/7c2b09e1ef8c4984732f362d7876305c/metrics/7947729174003011436 > > > > The initial region seems to have split once, but I'm not sure the split > completed, since the key ranges overlap and the storeFileSizeMB seems to be > about as big as it started out. My theory is that the initial load is too > large for a region, and the split either failed or is still in progress. > > Both on the same Region Server: > > > > > > mytable,ad_format728x90site_category2advertiser14563countrysepublisher2e03ab73-b234-4413-bcee-6183a99bd840starttime1293897600,1294094158507.2360f0a03e2566c72ea1a07c40f5f296. > > stores=2, storefiles=1075, storefileSizeMB=19230, memstoreSizeMB=0, > > storefileIndexSizeMB=784 > > -- > > mytable,,1294094158507.33b1e47c5fb004aa801b0bd88ce8322d. > > stores=2, storefiles=1083, storefileSizeMB=19546, memstoreSizeMB=0, > > storefileIndexSizeMB=796 > > > > Another new table on this same hbase loaded around the same time, has > already split into 69 regions (storefileSizeMB 200 - 400 each). This one > was loaded in smaller chunks with importtsv running directly on the hbase > cluster, but also with compression on. > > Now that I've gotten all the background down, here are my questions: > > 1. Is it still working on the split? Any way to monitor progress? > 2. Can I force more splits? > 3. Should I have done something first to avoid having the bulk load > create one big region? > 4. Would it be easier to split if my initial bulkload was not gzip > compressed? > 5. Am I looking in the wrong place entirely for this issue? > > thanks, > Marc
