On Tue, Jan 31, 2012 at 7:30 PM, Zheng Da <[email protected]> wrote: > It mentions "block size" and also the figure shows data is split into > blocks and each block starts with a magic header, which shows whether data > in the block is compressed or not. Also blocks in HBase is indexed. >
These 'blocks' are not hdfs 'blocks'. The hbase hfile that we write to hdfs is written in, by default, 64k chunks/blocks (This is the same as the read-time blocks as I talked of in my earlier message). As said already, these are not hdfs blocks (this blocking is done on top of hdfs blocking). > "Minimum block size. We recommend a setting of minimum block size between > 8KB to 1MB for general usage. Larger block size is preferred if files are > primarily for sequential access. However, it would lead to inefficient > random access (because there are more data to decompress). Smaller blocks > are good for random access, but require more memory to hold the block > index, and may be slower to create (because we must flush the compressor > stream at the conclusion of each data block, which leads to an FS I/O > flush). Further, due to the internal caching in Compression codec, the > smallest possible block size would be around 20KB-30KB." > So each block with its prefixed "magic" header contains either plain or > compressed data. How that looks like we will have a look at in the next > section. > > If data isn't split into blocks, how do these things work? > The above prescription rings about right (you should be referring to the reference guide though rather than to Lars' blog; See http://hbase.apache.org/book.html#hfilev2 It builds on Lars blog to explain how hfile works in more recent hbases'). It pertains to the hfile blocks. I don't understand your question 'If data isn't split into blocks, how do these things work?' Data is split into hfile blocks. Splits happen on hfile block boundaries usually. Please ask more questions so I can help you understand whats going on. St.Ack
