Regards, Jean-Marc. What version of HBase are you using? In the new version of the platform (0.94), there a lot of improvements for auto spliting and pre-spliting regions. The great Hortonworks's team published an amazing post for this particular topic: http://hortonworks.com/blog/apache-hbase-region-splitting-and-merging/
2013/8/9, Jean-Marc Spaggiari <[email protected]>: > Hi, > > Quick question regarding the split. > > Let's consider the table "work_proposed' below: > > 275164921921 hdfs://node3:9000/hbase/work_proposed > > This is a 256GB table. I think there is more than 1B lines into it but I > have not counted them for a while. > > This table as a pretty default definition: > > > hbase(main):001:0> describe 'work_proposed' > DESCRIPTION > ENABLED > > 'work_proposed', {NAME => '@', DATA_BLOCK_ENCODING => 'NONE', BLOOMFILTER > => 'ROW', REPLICATION_SCOPE => '0', COMPRESSION => 'NONE', VERSIONS => '3', > TTL => '2147483647', MIN > true > > _VERSIONS => '0', KEEP_DELETED_CELLS => 'false', BLOCKSIZE => '65536', > ENCODE_ON_DISK => 'true', IN_MEMORY => 'false', BLOCKCACHE => 'true'}, > {NAME => 'a', > DATA_BLOCK_ENCODIN > > G => 'NONE', BLOOMFILTER => 'ROW', REPLICATION_SCOPE => '0', VERSIONS => > '3', COMPRESSION => 'NONE', MIN_VERSIONS => '0', TTL => '2147483647', > KEEP_DELETED_CELLS => > 'false', > > BLOCKSIZE => '65536', IN_MEMORY => 'false', ENCODE_ON_DISK => 'true', > BLOCKCACHE => > 'true'} > > 1 row(s) in 0.7590 seconds > > Those are all default parameters. Which mean, the default FILE_SIZE value > is 10GB. > > If I look into Hannibal, it's fine. I can see my table, the regions, the > red line at 10GB showing the max size before the split, etc. All the > regions are under this line.... except one! > > hadoop@buldo:~/hadoop-1.0.3$ bin/hadoop fs -ls > /hbase/work_proposed/46f8ea6e24982fbeb249a4516c879109/@ > Found 1 items > -rw-r--r-- 3 hbase supergroup 22911054018 2013-08-03 20:57 > /hbase/work_proposed/46f8ea6e24982fbeb249a4516c879109/@/404fcf681e5e4fdbac99db80345b011b > > This region is 21GB. And it doesn't want to split. The first thing you will > say is it's because I have one single 21GB row in this region, but I don't > think so. My rows are URLs. I will be surprised if I have a 21GB URL ;) > > I triggered major_compact many times, I stopped/start the cluster many > times, nothing. I can most probably ask for a manual split and that will > work, but I want to take this oportunity to figure why it's not splitting, > if it should be, and if there is any defect behind that. > > I have not found any exception in the logs. I just started another > major_compaction and will grep the region name from the logs, but any idea > why I'm facing that, and where in the code I should start to look at? I can > deploy customized code to show more logs if required. I still start to look > at the split policies... > > JM > -- Marcos Ortiz Valmaseda Product Manager at PDVSA http://about.me/marcosortiz
