I found the following snippet in the HFile javadocs and had some questions 
seeking clarification.  The recommendation is a minimum block size between 8KB 
and 1MB with larger for sequential accesses.  Our data are time series data 
(high resolution, sampled at 125Hz).  The primary/typical access pattern are 
subsets of the data, anywhere from 37k points to millions of points.  

Should I be setting this to 1MB?  Would even larger values be a good idea (i.e. 
greater than 1MB)?  What are the tradeoffs for larger values?


From the HFile javadocs:

Minimum block size. We recommend a setting of minimum block size between 8KB to 
1MB for general usage. Larger block size is preferred if files are primarily 
for sequential access. However, it would lead to inefficient random access 
(because there are more data to decompress). Smaller blocks are good for random 
access, but require more memory to hold the block index, and may be slower to 
create (because we must flush the compressor stream at the conclusion of each 
data block, which leads to an FS I/O flush). Further, due to the internal 
caching in Compression codec, the smallest possible block size would be around 
20KB-30KB.

Thanks!

--Andrew





Reply via email to