Hi Sam, The idea is that the entire result of the scan will not fit into the cache if the scan scans a "reasonable" number of cells, and hence it unlikely that another scan will hit cached blocks before they get evicted, especially when using an LRU cache.
-- Lars ----- Original Message ----- From: Sam Seigal <[email protected]> To: [email protected] Cc: Sent: Thursday, November 17, 2011 1:44 PM Subject: block caching I have a table that I only use for generating indexes. It rarely will have random reads, but will have M/R jobs running against it constantly for generating indexes. Even the index table, random reads will be rare. It will mostly be used for scanning blocks of data. According to HBase The Definitive Guide "As HBase reads entire blocks of data for efficient IO usage it retains these blocks in an in-memory cache, so that subsequent reads do not need any disk operation. The default of true enables the block cache for every read operation. But if your use-case only ever has sequential reads on a particular column family it is advisable to disable it from polluting the block cache by setting the block cache enabled flag to false. " "There are other options you can use to influence how the block cache is used, for example during a scan operation. This is useful during full table scans so that you do not cause a major churn on the cache. See the section called “Configuration” for more information about this feature." "Scan instances can be set to use the block cache in the region server via the setCacheBlocks() method. For scans used with MapReduce jobs, this should be false. For frequently accessed rows, it is advisable to use the block cache." What is the reasoning behind the above ? Why is using a block cache for M/R jobs not a good idea if it is doing full table scans ?
