Hello,
In a prototypical cluster we have 8 region servers with 4G HBase heap space. Each region server has about 107 regions, with a region size of 1G using Snappy as compression codec. The table has ~ 1.8 billion rows with a 48 characters row-key and measurement values as cell values, so values are rather small. Currently, the storefileIndexsize for each region server is ~ 1300M. We are afraid, that with an increasing number of rows, we need quite an amount of RAM per RS for just holding the index. Is this somehow linear, e.g. if we double the number of rows to ~3.6 billion, we will have around 2600M storefileIndexsize? I found a few references discussing storefileIndexsize: http://search-hadoop.com/m/hemBv1LiN4Q1/a+question+storefileIndexSize&su bj=a+question+storefileIndexSize <http://search-hadoop.com/m/hemBv1LiN4Q1/a+question+storefileIndexSize&s ubj=a+question+storefileIndexSize> http://hbase.apache.org/book.html#keysize <http://hbase.apache.org/book.html#keysize> The basic suggestion is to increase the block size (we currently use the default 64K) and to reduce the length of the row-key, column family and qualifier names. Are there more? True, in our prototypical implementation we have used rather "good readable" names for column families and qualifiers. Does anybody have numbers from practice on storefileIndexsize decreased with shorter column family and qualifier names? Thanks, Thomas
