I found the following snippet in the HFile javadocs and had some questions seeking clarification. The recommendation is a minimum block size between 8KB and 1MB with larger for sequential accesses. Our data are time series data (high resolution, sampled at 125Hz). The primary/typical access pattern are subsets of the data, anywhere from 37k points to millions of points.
Should I be setting this to 1MB? Would even larger values be a good idea (i.e. greater than 1MB)? What are the tradeoffs for larger values? From the HFile javadocs: Minimum block size. We recommend a setting of minimum block size between 8KB to 1MB for general usage. Larger block size is preferred if files are primarily for sequential access. However, it would lead to inefficient random access (because there are more data to decompress). Smaller blocks are good for random access, but require more memory to hold the block index, and may be slower to create (because we must flush the compressor stream at the conclusion of each data block, which leads to an FS I/O flush). Further, due to the internal caching in Compression codec, the smallest possible block size would be around 20KB-30KB. Thanks! --Andrew
