Wellington Chevreuil created HBASE-28170:
--------------------------------------------

             Summary: Put the cached time at the beginning of the block run 
cache validation in the background when retrieving the persistent cache
                 Key: HBASE-28170
                 URL: https://issues.apache.org/jira/browse/HBASE-28170
             Project: HBase
          Issue Type: Sub-task
            Reporter: Wellington Chevreuil
            Assignee: Wellington Chevreuil


In HBASE-28004, we added a "cached time" long at the end of each block on the 
bucket cache. We also record the cached time in the backing map we persist to 
disk periodically, in order to retrieve the cache upon crashes/restarts. The 
persisted backing map includes the last modification time of the cache itself.

On restarts, once we read the backing map from the persisted file, we compare 
the last modification time of the cache recorded there against the last 
modification time of the cache. If those differ, it means the cache has been 
updated after the backing map has been persisted, so the backing map might not 
be accurate. We then iterate though the backing map entires and compare the 
entries cached time against the related block in the cache, and if those 
differ, we remove the entry from the map. 

Currently this validation is made at RS initialisation time, but with caches as 
large as 1.6TB/30M+ blocks, it can last to an hour, meaning the RS is useless 
over that time. This PR changes this validation to be performed in the 
background, whilst direct accesses to a block in the cache would also perform 
the "cached time" comparison.

This PR also moves the "cached time" to the beginning of the block in the 
cache, instead of the end. We noticed that with the "cached time" at the end we 
can fail to ensure consistency at some conditions. Consider the following: 
1) A block B1 of size S gets allocated at offset 0 with cached time T1;
2) The backing map is persisted, containing B1 at offset 0 and cached time T1;
3) B1 is evicted. It's offset in the cache is now free, however its contents 
are still there, including the cached time T1 at its end;
4) A new block B2 of size S/2 gets allocated at offset 0 with cached time T2;
5) RS crashes before the backing map gets saved, so the persisted backing map 
still has only the reference to B1, but not B2;
6) At restart, we run the validation. Because B2 was half the size of B1, we 
haven't overridden B1 cached time from the cache, so we will successfully 
validate B1, although its content is now half overridden by B2. 




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to