Ryan (who wrote HFile) did a lot of testing around block size and didn't really see any difference when changing it. So I would recommend that you benchmark different values with your own data/usage pattern and see if you do have better/worse perfs.
The tradeoff for larger values is that in order to retrieve a single cell, you would have to fetch a lot more data than required eg if your cell is 5KB and your block size is 1MB, that's how much you need to get on the network in order to read it. Obviously if you are scanning, then you probably want all that data anyways so larger values * theoretically* gives you better performance. J-D On Mon, Jul 26, 2010 at 10:41 PM, Andrew Nguyen <[email protected]> wrote: > I found the following snippet in the HFile javadocs and had some questions > seeking clarification. The recommendation is a minimum block size between > 8KB and 1MB with larger for sequential accesses. Our data are time series > data (high resolution, sampled at 125Hz). The primary/typical access pattern > are subsets of the data, anywhere from 37k points to millions of points. > > Should I be setting this to 1MB? Would even larger values be a good idea > (i.e. greater than 1MB)? What are the tradeoffs for larger values? > > > From the HFile javadocs: > > Minimum block size. We recommend a setting of minimum block size between 8KB > to 1MB for general usage. Larger block size is preferred if files are > primarily for sequential access. However, it would lead to inefficient random > access (because there are more data to decompress). Smaller blocks are good > for random access, but require more memory to hold the block index, and may > be slower to create (because we must flush the compressor stream at the > conclusion of each data block, which leads to an FS I/O flush). Further, due > to the internal caching in Compression codec, the smallest possible block > size would be around 20KB-30KB. > > Thanks! > > --Andrew > > > > > >
