Michael, Your solution seems to work. My keys are not evenly dispersed, so I couldn't use the createTable with N regions signature, but I was able to sample one data set and come up with a half-way reasonable set of starter keys. I used those to create a new table with 20 splits. HBase appears to be much happier now, it's continuing to re-split my initial tables and distribute the load better.
thanks, Mike On Tue, Jan 4, 2011 at 9:09 AM, Michael Segel <[email protected]>wrote: > > > Marc, > > Just an idea. > Can you create your table with N regions with null as the start key and the > largest possible key value and your end key? > Talking with ssechrist on IRC he pointed me to the API : > > http://hbase.apache.org/docs/r0.89.20100924/apidocs/org/apache/hadoop/hbase/client/HBaseAdmin.html#createTable%28org.apache.hadoop.hbase.HTableDescriptor,%20byte[],%20byte[],%20int%29 > > And looking at the createTable() methods.... > > HTH > > -Mike > > > From: [email protected] > > Date: Tue, 4 Jan 2011 06:03:21 -0800 > > Subject: hbase bulk load / table split > > To: [email protected] > > > > I've made some good progress using the HBase Bulk Load Tool. With HBase > > 0.89.20100924+28. > > > > My initial implementation did not have importtsv do compression, and it > ran > > directly on the hbase cluster's hadoop. It's been working ok for a while > > (but slow, because of limited resources). > > > > My next implementation, as discussed in another thread, has compression > > settings turned on for importtsv (thanks, Lars). And I am running the > > importtsv on a remote cluster and then distcp'ing (thanks, Todd) the > results > > to the HBase cluster for the completebulkload step. > > > > I'm trying this out with a fresh (empty) Hbase table. So, the first run > of > > importtsv takes a very long time, because the table only has one region, > so > > it starts only one Reducer. > > > > - Bulk load into a new table > > - About 20 GB of data (compressed with gzip) > > - Created one massive region > > > > It seemed to complete successfully. But we are seeing some intermittent > > errors (missing blocks and such). > > > > Could not obtain block: blk_-5944324410280250477_429443 > > > > file=/hbase/mytable/7c2b09e1ef8c4984732f362d7876305c/metrics/7947729174003011436 > > > > > > > The initial region seems to have split once, but I'm not sure the split > > completed, since the key ranges overlap and the storeFileSizeMB seems to > be > > about as big as it started out. My theory is that the initial load is > too > > large for a region, and the split either failed or is still in progress. > > > > Both on the same Region Server: > > > > > > > > > > > mytable,ad_format728x90site_category2advertiser14563countrysepublisher2e03ab73-b234-4413-bcee-6183a99bd840starttime1293897600,1294094158507.2360f0a03e2566c72ea1a07c40f5f296. > > > stores=2, storefiles=1075, storefileSizeMB=19230, memstoreSizeMB=0, > > > storefileIndexSizeMB=784 > > > -- > > > mytable,,1294094158507.33b1e47c5fb004aa801b0bd88ce8322d. > > > stores=2, storefiles=1083, storefileSizeMB=19546, memstoreSizeMB=0, > > > storefileIndexSizeMB=796 > > > > > > > Another new table on this same hbase loaded around the same time, has > > already split into 69 regions (storefileSizeMB 200 - 400 each). This one > > was loaded in smaller chunks with importtsv running directly on the hbase > > cluster, but also with compression on. > > > > Now that I've gotten all the background down, here are my questions: > > > > 1. Is it still working on the split? Any way to monitor progress? > > 2. Can I force more splits? > > 3. Should I have done something first to avoid having the bulk load > > create one big region? > > 4. Would it be easier to split if my initial bulkload was not gzip > > compressed? > > 5. Am I looking in the wrong place entirely for this issue? > > > > thanks, > > Marc >
