[jira] [Commented] (HDDS-1499) OzoneManager Cache
[ https://issues.apache.org/jira/browse/HDDS-1499?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16905335#comment-16905335 ] Anu Engineer commented on HDDS-1499: Just add some historical context: Many of these arguments predates us inventing the current method of doing things. Let me add the notes for the historical context; Please note, this tells us how we got here — not defending the design decision. At this point, I believe the question you asked is entirely valid. 1. When we first started working on this, we found that when you make an entry into the RAFT, it has to generate a LOG ID. This ID is a monotonically increasing number and *strictly* serial. 2. The first approach of the HA code was to move the current code into Raft callbacks. That created a small problem; We were in this RAFT +critical section+ and doing a lot of heavy work for Ozone. For example, during createVolume, we would end up looking up Ozone metadata to verify if the volume already exists, etc. That is, all this complicate code was getting executed in the call back from Ratis, which was serial. 3. At this point, it should have been evident that if we stepped back and looked at the problem, you would see the solution that has been proposed by Nanda and you. However, there was a catch that prevented us from going there. 4. Raft protocol does not guarantee that when you read data from a leader, it is the real leader. Please take a look at this issue ( https://github.com/etcd-io/etcd/issues/741) and this paper for a deeper dive of the problem. https://www.usenix.org/system/files/conference/hotcloud17/hotcloud17-paper-arora.pdf So whatever solution we can up with we failed in the correctness test because of this issue. That, we could never guarantee that the leader that we are reading from is the real leader. At first, we tried to solve the problem with these constraints of RAFT protocol. That is, we thought we can solve the issue of RAFT protocol reading and writing without needing any extra protocol -- and it led to this callback architecture. For SCM, it is very obvious that we can move into an architecture which you are proposing that we can greatly simplify the code; or after the first version of solving all the HA issues in Ozone Manager; we will go and rewrite this code. That is why we have this current code; *a combination of three things* -- A prototype class code, lack of Raft leader guarantees and trying to fit the current code into the existing Ratis code base. For people who have the minimal background and trying to understand how the eventual solution should look like: The current code (the Non-HA code) does the following. 1. A call comes in -- say, create volume -- the server looks up the metadata and proceeds to make updates to the metadata of the system in a consistent way. 2. With HA, what we will do is that -- the last update step will move away from the current update metadata in place to update metadata via Ratis. That is, a future object will be returned. 3. When the future is completed, Ozone Manager or SCM will reply to the caller. So yes; eventually we will get to that proposed architecture -- I am just telling you the problems that we worked on. [~arp], [~bharatviswa], [~hanishakoneru] as the original developers of this feature, please feel free to jump in and correct me. I have just been an observer of this code evolution and commenting from what I recall. I have not been a participant to many of these discussions and I might have missed lots of details on why certain decisions were made. Closing Comment: *Yes, I agree we should move into what [~nandakumar131] and you ([~elek]) has been proposing*. This current code allows us to get HA feature complete and working -- the biggest issue we have the lack of lease for the leader -- at this point that is a question of correctness -- once we fix that we can go back and refactor code in Ozone manager to reflect the changes proposed by both of you. > OzoneManager Cache > -- > > Key: HDDS-1499 > URL: https://issues.apache.org/jira/browse/HDDS-1499 > Project: Hadoop Distributed Data Store > Issue Type: Sub-task > Components: Ozone Manager >Reporter: Bharat Viswanadham >Assignee: Bharat Viswanadham >Priority: Major > Labels: pull-request-available > Fix For: 0.4.1, 0.5.0 > > Time Spent: 12h > Remaining Estimate: 0h > > In this Jira, we shall implement a cache for Table. > As with OM HA, we are planning to implement double buffer implementation to > flush transaction in a batch, instead of using rocksdb put() for every > operation. When this comes in to place we need cache in OzoneManager HA to > handle/server the requests for validation/returning responses. > > This Jira will
[jira] [Commented] (HDDS-1499) OzoneManager Cache
[ https://issues.apache.org/jira/browse/HDDS-1499?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16900108#comment-16900108 ] Elek, Marton commented on HDDS-1499: Sorry, I am very late in this party. I found some problem with the create volume (volume creation is not cached) and I am trying to understand the current design. bq. well, I was really hoping that the fact that there is a cache is not visible to the layer that is reading and writing. Is there a reason why that should be exposed to calling applications? This was as comment by [~anu] and I had the same question very soon. It was not clear for me why do we need separated methods for the TypedTable (cached put/get instead of the simple put/get which may or may not be cached.) What I expected is to have a Table implementation where there is an in-memory map under the hood and a real rocksdb Table. If I understood well the arguments in the PR (but fix me if I am wrong) It was not possible as this cache is not a traditional cache. When the value is added to the cache it may not be committed yet (as the cache is independent from the write path). It's more like an in memory overlay-table. If something is added to the in-memory table it should be used a return value. But in this case the in-memory overlay table seems to be an independent component. As I can see the TableCache interface is a simplified version of a key-value table. In the original TypedTable a lot of methods just ignores the cache and if the cache is not updated manually (as it's not updated with the put method) the behavior will be inconsistent. It seems to be more safe to separate the TableCache from TypedTable. > OzoneManager Cache > -- > > Key: HDDS-1499 > URL: https://issues.apache.org/jira/browse/HDDS-1499 > Project: Hadoop Distributed Data Store > Issue Type: Sub-task > Components: Ozone Manager >Reporter: Bharat Viswanadham >Assignee: Bharat Viswanadham >Priority: Major > Labels: pull-request-available > Fix For: 0.4.1, 0.5.0 > > Time Spent: 12h > Remaining Estimate: 0h > > In this Jira, we shall implement a cache for Table. > As with OM HA, we are planning to implement double buffer implementation to > flush transaction in a batch, instead of using rocksdb put() for every > operation. When this comes in to place we need cache in OzoneManager HA to > handle/server the requests for validation/returning responses. > > This Jira will implement Cache as an integral part of the table. In this way > users using this table does not need to handle like check cache/db. For this, > we can update get API in the table to handle the cache. > > This Jira will implement: > # Cache as a part of each Table. > # Uses this cache in get(). > # Exposes api for cleanup, add entries to cache. > Usage to add the entries in to cache will be done in further jira's. -- This message was sent by Atlassian JIRA (v7.6.14#76016) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDDS-1499) OzoneManager Cache
[ https://issues.apache.org/jira/browse/HDDS-1499?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16843593#comment-16843593 ] Hudson commented on HDDS-1499: -- SUCCESS: Integrated in Jenkins build Hadoop-trunk-Commit #16573 (See [https://builds.apache.org/job/Hadoop-trunk-Commit/16573/]) HDDS-1499. OzoneManager Cache. (#798) (arp7: rev 0d1d7c86ec34fabc62c0e3844aca3733024bc172) * (add) hadoop-hdds/common/src/test/java/org/apache/hadoop/utils/db/cache/package-info.java * (add) hadoop-hdds/common/src/main/java/org/apache/hadoop/utils/db/cache/CacheKey.java * (edit) hadoop-hdds/common/src/main/java/org/apache/hadoop/utils/db/DBStore.java * (edit) hadoop-ozone/ozone-manager/src/main/java/org/apache/hadoop/ozone/om/OmMetadataManagerImpl.java * (edit) hadoop-hdds/common/src/test/java/org/apache/hadoop/utils/db/TestTypedRDBTableStore.java * (edit) hadoop-hdds/common/src/main/java/org/apache/hadoop/utils/db/RDBTable.java * (edit) hadoop-hdds/common/src/main/java/org/apache/hadoop/utils/db/Table.java * (edit) hadoop-hdds/common/src/main/java/org/apache/hadoop/utils/db/TypedTable.java * (add) hadoop-hdds/common/src/main/java/org/apache/hadoop/utils/db/cache/PartialTableCache.java * (add) hadoop-hdds/common/src/main/java/org/apache/hadoop/utils/db/cache/TableCache.java * (add) hadoop-hdds/common/src/main/java/org/apache/hadoop/utils/db/cache/CacheValue.java * (add) hadoop-hdds/common/src/test/java/org/apache/hadoop/utils/db/cache/TestPartialTableCache.java * (add) hadoop-hdds/common/src/main/java/org/apache/hadoop/utils/db/cache/EpochEntry.java * (add) hadoop-hdds/common/src/main/java/org/apache/hadoop/utils/db/cache/package-info.java > OzoneManager Cache > -- > > Key: HDDS-1499 > URL: https://issues.apache.org/jira/browse/HDDS-1499 > Project: Hadoop Distributed Data Store > Issue Type: Sub-task > Components: Ozone Manager >Reporter: Bharat Viswanadham >Assignee: Bharat Viswanadham >Priority: Major > Labels: pull-request-available > Fix For: 0.5.0 > > Time Spent: 12h > Remaining Estimate: 0h > > In this Jira, we shall implement a cache for Table. > As with OM HA, we are planning to implement double buffer implementation to > flush transaction in a batch, instead of using rocksdb put() for every > operation. When this comes in to place we need cache in OzoneManager HA to > handle/server the requests for validation/returning responses. > > This Jira will implement Cache as an integral part of the table. In this way > users using this table does not need to handle like check cache/db. For this, > we can update get API in the table to handle the cache. > > This Jira will implement: > # Cache as a part of each Table. > # Uses this cache in get(). > # Exposes api for cleanup, add entries to cache. > Usage to add the entries in to cache will be done in further jira's. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org