Hello, On Tue, Jan 31, 2012 at 3:45 PM, Stack <[email protected]> wrote:
> On Mon, Jan 30, 2012 at 5:27 PM, Zheng Da <[email protected]> wrote: > > Hello, > > > > I'm thinking of using HBase to store a matrix, so each subblock of a > matrix > > is stored as a value in HBase, and the key of the value is the location > of > > the subblock in the matrix. At beginning, I wanted the subblock to be as > > large as 8MB. But when I read > > http://www.larsgeorge.com/2009/10/hbase-architecture-101-storage.html, I > > found HBase splits keyvalue pairs into blocks and the block size is > usually > > much smaller than 8MB. So what happens if I store data of 8MB as a value > in > > HBase? I tried, and it seems to work fine. But how about the performance? > > > > Please point to what in that blog has you thinking we split keyvalues. > We do not. > It mentions "block size" and also the figure shows data is split into blocks and each block starts with a magic header, which shows whether data in the block is compressed or not. Also blocks in HBase is indexed. "Minimum block size. We recommend a setting of minimum block size between 8KB to 1MB for general usage. Larger block size is preferred if files are primarily for sequential access. However, it would lead to inefficient random access (because there are more data to decompress). Smaller blocks are good for random access, but require more memory to hold the block index, and may be slower to create (because we must flush the compressor stream at the conclusion of each data block, which leads to an FS I/O flush). Further, due to the internal caching in Compression codec, the smallest possible block size would be around 20KB-30KB." So each block with its prefixed "magic" header contains either plain or compressed data. How that looks like we will have a look at in the next section. If data isn't split into blocks, how do these things work? > > Writing, we persist files that by default use hdfs blocks of 64MB. > Reading we will by default read in 64k chunks (hbase read blocks). > The 64k will contain whole keyvalues which means we likely rarely read > exactly 64kb. If a keyvalue is 8MB, though we're configured to read > in 64kb blocks, we'll read in the coherent 8MB keyvalue as a block. > > Performance-wise, its best you try it out. Be aware that unless you > configure stuff otherwise, this 8MB block coming up out of the > filesystem will probably traverse the read-side block cache and blow > out a bunch of lesser entries. These are the kind of things you'll > need to think consider. Check out the performance section in the > hbase reference guide: http://hbase.apache.org/book.html#performance Thanks, Da
