RE: HDFS Compression... that is interesting -- i didnt think HBase forced any HDFS specific operatoins (other than short circuit reads, which is configurable on/off)?
... So how is the compression encoding implemented, and how do other file systems handle it? I dont think compression is specifically part of the FileSystem API. On Sat, Feb 1, 2014 at 11:06 PM, lars hofhansl <[email protected]> wrote: > HBase always loads the whole block and then seeks forward in that block > until it finds the KV it is looking for (there is no indexing inside the > block). > > Also note that HBase has compression and block encoding. These are > different. Compression compresses the files on disk (at the HDFS level) and > not in memory, so it does not help with your cache size. Encoding is > applied at the HBase block level and is retained in the block cache. > > I'm really curious as what kind of improvement you see with smaller block > size. Remember that after you change BLOCKSIZE you need to issue a major > compaction so that the data is rewritten into smaller blocks. > > We should really document this stuff better. > > > -- Lars > > > > ________________________________ > From: Jan Schellenberger <[email protected]> > To: [email protected] > Sent: Friday, January 31, 2014 10:31 PM > Subject: RE: Slow Get Performance (or how many disk I/O does it take for > one non-cached read?) > > > A lot of useful information here... > > I disabled bloom filters > I changed to gz compression (compressed files significantly) > > I'm now seeing about *80gets/sec/server* which is a pretty good > improvement. > Since I estimate that the server is capable of about 300-350 hard disk > operations/second, that's about 4 hard disk operations/get. > > I will experiment with the BLOCKSIZE next. Unfortunately upgrading our > system to a newer HBASE/Hadoop is tricky for various IT/regulation reasons > but I'll ask to upgrade. From what I see, even Cloudera 4.5.0 still comes > with HBase 94.6 > > > > > I also restarted the regionservers and am now getting > blockCacheHitCachingRatio=51% and blockCacheHitRatio=51%. > So conceivably, I could be hitting the: > root index (cache hit) > block index (cache hit) > load on average 2 blocks to get data (cache misses most likely as my total > heap space is 1/7 the compressed dataset) > That would be about 52% cache hit overall and if each data access requires > 2 > Hard Drive reads (data + checksum) then that would explain my throughput. > It still seems high but probably within the realm of reason. > > Does HBase always read a full block (the 64k HFile block, not the HDFS > block) at a time or can it just jump to a particular location within the > block? > > > > > > -- > View this message in context: > http://apache-hbase.679495.n3.nabble.com/Slow-Get-Performance-or-how-many-disk-I-O-does-it-take-for-one-non-cached-read-tp4055545p4055564.html > > Sent from the HBase User mailing list archive at Nabble.com. > -- Jay Vyas http://jayunit100.blogspot.com
