HBase always loads the whole block and then seeks forward in that block until it finds the KV it is looking for (there is no indexing inside the block).
Also note that HBase has compression and block encoding. These are different. Compression compresses the files on disk (at the HDFS level) and not in memory, so it does not help with your cache size. Encoding is applied at the HBase block level and is retained in the block cache. I'm really curious as what kind of improvement you see with smaller block size. Remember that after you change BLOCKSIZE you need to issue a major compaction so that the data is rewritten into smaller blocks. We should really document this stuff better. -- Lars ________________________________ From: Jan Schellenberger <[email protected]> To: [email protected] Sent: Friday, January 31, 2014 10:31 PM Subject: RE: Slow Get Performance (or how many disk I/O does it take for one non-cached read?) A lot of useful information here... I disabled bloom filters I changed to gz compression (compressed files significantly) I'm now seeing about *80gets/sec/server* which is a pretty good improvement. Since I estimate that the server is capable of about 300-350 hard disk operations/second, that's about 4 hard disk operations/get. I will experiment with the BLOCKSIZE next. Unfortunately upgrading our system to a newer HBASE/Hadoop is tricky for various IT/regulation reasons but I'll ask to upgrade. From what I see, even Cloudera 4.5.0 still comes with HBase 94.6 I also restarted the regionservers and am now getting blockCacheHitCachingRatio=51% and blockCacheHitRatio=51%. So conceivably, I could be hitting the: root index (cache hit) block index (cache hit) load on average 2 blocks to get data (cache misses most likely as my total heap space is 1/7 the compressed dataset) That would be about 52% cache hit overall and if each data access requires 2 Hard Drive reads (data + checksum) then that would explain my throughput. It still seems high but probably within the realm of reason. Does HBase always read a full block (the 64k HFile block, not the HDFS block) at a time or can it just jump to a particular location within the block? -- View this message in context: http://apache-hbase.679495.n3.nabble.com/Slow-Get-Performance-or-how-many-disk-I-O-does-it-take-for-one-non-cached-read-tp4055545p4055564.html Sent from the HBase User mailing list archive at Nabble.com.
