[jira] [Commented] (HDDS-3354) OM HA replay optimization
[ https://issues.apache.org/jira/browse/HDDS-3354?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17142996#comment-17142996 ] Marton Elek commented on HDDS-3354: --- Moved the target.version=0.6.0 flag to the sub issue (HDDS-3685). I think we should track that one to close this automatically. > OM HA replay optimization > - > > Key: HDDS-3354 > URL: https://issues.apache.org/jira/browse/HDDS-3354 > Project: Hadoop Distributed Data Store > Issue Type: Improvement > Components: OM HA, Ozone Manager >Reporter: Bharat Viswanadham >Assignee: Bharat Viswanadham >Priority: Major > Labels: Triaged > Attachments: OM HA Replay.pdf, Screen Shot 2020-05-20 at 1.28.48 > PM.png > > > This Jira is to improve the OM HA replay scenario. > Attached the design document which discusses about the proposal and issue in > detail. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: ozone-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: ozone-issues-h...@hadoop.apache.org
[jira] [Commented] (HDDS-3354) OM HA replay optimization
[ https://issues.apache.org/jira/browse/HDDS-3354?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17113494#comment-17113494 ] Bharat Viswanadham commented on HDDS-3354: -- {quote}That's very interesting. Do you have more finding about the root cause? How many buckets did you have? It's very surprising to have GC pause for a few thousands of buckets especially as they are not frequently updated. Do we need to adjust something on the cache size?{quote} Initially, the test is started with default heap size settings. Once OM is started with 8GB memory settings, no issues are seen. Bucket count is not in order of thousands, it is 14 million buckets. Command used in the test is {{$ bin/ozone freon ombg -n=100}} Even during cache design, we thought in a single OM cluster, we shall have 1000 of volumes, and each volume will have 1000 of buckets and will have billions of key in a cluster. As volume/bucket exist checks are done for each request, we have decided to keep bucket/volume cache in memory whole time and as a full cache. If we want to revisit this decision. But this is not related to this Jira. (This came out as part of testing when creating a million buckets to test this Jira) Snippet from Cache design doc attached to HDDS-505 {noformat} Memory Usage: As discussed above for Volume and Bucket Table we store full table information in memory. This will help in validation of the requests very quickly. As for every request Ozone Manager receives the mandatory check is volume/bucket exists or not. On a typical Ozone cluster Volumes can be in number of thousands. (Considering this as an admin level operation in a system where each team/organization gets a volume for their usage). And for each volume we can expect 1000 to 1 buckets. These are considered just for calculation purpose. Let’s assume each VolumeInfo and BucketInfo structure consumes 1KB in memory. Then, Volume cache memory usage can be 1000 * 1KB = 10 MB. Bucket cache memory usage can be 1000 * 1000 * 1KB = 1GB. We can make the Volume and BucketTable caches partial if the number of buckets and volumes are very high in the system. This can be given as an option to end user. For now we assume that the entire list of volumes and buckets can be safely cached in memory. {noformat} > OM HA replay optimization > - > > Key: HDDS-3354 > URL: https://issues.apache.org/jira/browse/HDDS-3354 > Project: Hadoop Distributed Data Store > Issue Type: Improvement >Reporter: Bharat Viswanadham >Assignee: Bharat Viswanadham >Priority: Major > Attachments: OM HA Replay.pdf, Screen Shot 2020-05-20 at 1.28.48 > PM.png > > > This Jira is to improve the OM HA replay scenario. > Attached the design document which discusses about the proposal and issue in > detail. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: ozone-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: ozone-issues-h...@hadoop.apache.org
[jira] [Commented] (HDDS-3354) OM HA replay optimization
[ https://issues.apache.org/jira/browse/HDDS-3354?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17113488#comment-17113488 ] Bharat Viswanadham commented on HDDS-3354: -- {quote}That's an interesting question. If I understood well, your objections is that if we do ratis log snapshot and rocksdb snapshot (=checkpoint) at the same time they can be inconsistent with each other in case of any error. I don't think it's a problem. Writing ratis log snapshot can fail even now which should be handled. The only question if we can finalize both the snapshots in one step which should be possible: For example write ratis log snapshot file and rocksdb snapshot file to the same directory and move it to the final location.{quote} *Let me share the complete thought details here.* 1. Even in writing to temporary directory both ratis log snapshot file which has snapshot index and checkpoint which we use rocksdb checkpoint, let's say checkpoint succeded, and write failed. And in the current directory of snapshot we have old checkpoint, and old snapshot file. During OM restart if we use current OM DB, we cannot avoid replay logic. So, in this case, whenever OM restart, we should use last checkpoint DB and snapshot file and come up. As said we agree with this it will delay startup until the leader applied all OM entries from snapshot to latest log, clients will get LeaderNotReadyException. These kind of issues, will not be seen with the proposed approach. 2. And one step failing is not only the issue, it is one of the issues. If snapshot taken is controlled by ratis, when a checkpoint is happening, we should not allow any transactions to be flushed to DB, as we want to get what is exact last applied Transaction to DB, so that when restart happens, we want to know what is last applied transaction to DB. If this happens, that means every time checkpoint is happening, we need to stop double buffer and take a checkpoint and write to the snapshot file. Stopping double buffer means right now it will send a signal to interrupt flush thread, but now with this we should still maintain unflushed transactions that are not completed by flush or wait for flush to complete. So, this might increase the current queue length in double-buffer. As apply transaction will still continue to apply transactions to StateMachine. This looks till complex than what is proposed and it also comes with its own disadvantages of startup slowness and double-buffer queue size. Or if we think let's take Or other approaches is instead of putting transaction info, repeat the above process of checkpoint and snapshot to file for every iteration, so that we don't stop double buffer and apply transaction will put to double buffer. But this is not a great solution, as it will makes double buffer slow and checkpoints also increased (Just want to point it out), and we need another background thread for cleaning up. This will not have a startup slow problem. As with testing it is shown with HDDS-3474 + HDDS-3475 performance is not degraded and it is in par, and with this we shall remove the replay logic from actual request logic. So, even if we want to revisit, it will be simpler and it will make developers implementing new API's does not need to know about handling a replay case when implementing new write non-idempotent requests. Let me know your thoughts? > OM HA replay optimization > - > > Key: HDDS-3354 > URL: https://issues.apache.org/jira/browse/HDDS-3354 > Project: Hadoop Distributed Data Store > Issue Type: Improvement >Reporter: Bharat Viswanadham >Assignee: Bharat Viswanadham >Priority: Major > Attachments: OM HA Replay.pdf, Screen Shot 2020-05-20 at 1.28.48 > PM.png > > > This Jira is to improve the OM HA replay scenario. > Attached the design document which discusses about the proposal and issue in > detail. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: ozone-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: ozone-issues-h...@hadoop.apache.org
[jira] [Commented] (HDDS-3354) OM HA replay optimization
[ https://issues.apache.org/jira/browse/HDDS-3354?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17112874#comment-17112874 ] Marton Elek commented on HDDS-3354: --- bq. One of the main reason is as bucket cache is full table cache it will be in memory, we need some JVM tunings to be set, because, without tuning, we are seeing a lot of GC Pauses happening in OM. That's very interesting. Do you have more finding about the root cause? How many buckets did you have? It's very surprising to have GC pause for a few thousands of buckets especially as they are not frequently updated. Do we need to adjust something on the cache size? > OM HA replay optimization > - > > Key: HDDS-3354 > URL: https://issues.apache.org/jira/browse/HDDS-3354 > Project: Hadoop Distributed Data Store > Issue Type: Improvement >Reporter: Bharat Viswanadham >Assignee: Bharat Viswanadham >Priority: Major > Attachments: OM HA Replay.pdf, Screen Shot 2020-05-20 at 1.28.48 > PM.png > > > This Jira is to improve the OM HA replay scenario. > Attached the design document which discusses about the proposal and issue in > detail. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: ozone-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: ozone-issues-h...@hadoop.apache.org
[jira] [Commented] (HDDS-3354) OM HA replay optimization
[ https://issues.apache.org/jira/browse/HDDS-3354?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17112873#comment-17112873 ] Marton Elek commented on HDDS-3354: --- bq. If there are no further objection I have no further objection, but have a comment to your answer. Take it as a friendly chat during the coffee break about interesting questions related to Distributed Systems. bq. if any of the steps fails like checkpoint succeeded, but snapshot file writes failed, then when Om restart That's an interesting question. If I understood well, your objections is that if we do ratis log snapshot and rocksdb snapshot (=checkpoint) at the same time they can be inconsistent with each other in case of any error. I don't think it's a problem. Writing ratis log snapshot can fail even now which should be handled. The only question if we can finalize both the snapshots in one step which should be possible: For example write ratis log snapshot file and rocksdb snapshot file to the same directory and move it to the final location. I wouldn't like to say it's better. But I think it's possible (How is your coffee?) > OM HA replay optimization > - > > Key: HDDS-3354 > URL: https://issues.apache.org/jira/browse/HDDS-3354 > Project: Hadoop Distributed Data Store > Issue Type: Improvement >Reporter: Bharat Viswanadham >Assignee: Bharat Viswanadham >Priority: Major > Attachments: OM HA Replay.pdf, Screen Shot 2020-05-20 at 1.28.48 > PM.png > > > This Jira is to improve the OM HA replay scenario. > Attached the design document which discusses about the proposal and issue in > detail. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: ozone-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: ozone-issues-h...@hadoop.apache.org
[jira] [Commented] (HDDS-3354) OM HA replay optimization
[ https://issues.apache.org/jira/browse/HDDS-3354?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17112592#comment-17112592 ] Bharat Viswanadham commented on HDDS-3354: -- If there are no further objections, I will proceed further with commit and continue on the next tasks required for this improvement. > OM HA replay optimization > - > > Key: HDDS-3354 > URL: https://issues.apache.org/jira/browse/HDDS-3354 > Project: Hadoop Distributed Data Store > Issue Type: Improvement >Reporter: Bharat Viswanadham >Assignee: Bharat Viswanadham >Priority: Major > Attachments: OM HA Replay.pdf, Screen Shot 2020-05-20 at 1.28.48 > PM.png > > > This Jira is to improve the OM HA replay scenario. > Attached the design document which discusses about the proposal and issue in > detail. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: ozone-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: ozone-issues-h...@hadoop.apache.org
[jira] [Commented] (HDDS-3354) OM HA replay optimization
[ https://issues.apache.org/jira/browse/HDDS-3354?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17112591#comment-17112591 ] Bharat Viswanadham commented on HDDS-3354: -- The reason for the inclusion of HDDS-3615 and HDDS-3623 is we identified a few potential problems, and that is causing slowness as the number of objects increases in OM. Another reason is as bucket cache is full table cache it will be in memory, we need some JVM tunings to be set, because without tuning, we are seeing a lot of GC Pauses happening in OM. (So, this test is like a stress test, where OM has only millions of buckets in memory, as during design of cache we have thought a single OM will have a couple of thousands of volumes/buckets) -XX:ParallelGCThreads=8 -XX:+UseConcMarkSweepGC -XX:CMSInitiatingOccupancyFraction=70 -XX:+CMSParallelRemarkEnabled -Xloggc:/tmp/gc.log-$(date +'%Y%m%d%H%M') -XX:NewSize=1024m -XX:MaxNewSize=1024m -Xms8192m -Xmx8192m -XX:+PrintGCDateStamps > OM HA replay optimization > - > > Key: HDDS-3354 > URL: https://issues.apache.org/jira/browse/HDDS-3354 > Project: Hadoop Distributed Data Store > Issue Type: Improvement >Reporter: Bharat Viswanadham >Assignee: Bharat Viswanadham >Priority: Major > Attachments: OM HA Replay.pdf, Screen Shot 2020-05-20 at 1.28.48 > PM.png > > > This Jira is to improve the OM HA replay scenario. > Attached the design document which discusses about the proposal and issue in > detail. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: ozone-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: ozone-issues-h...@hadoop.apache.org
[jira] [Commented] (HDDS-3354) OM HA replay optimization
[ https://issues.apache.org/jira/browse/HDDS-3354?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17112590#comment-17112590 ] Bharat Viswanadham commented on HDDS-3354: -- Hi [~elek] Perf results: !Screen Shot 2020-05-20 at 1.28.48 PM.png! The command used for the test: $ bin/ozone freon ombg -n=100 {quote}But I am fine with this approach, unless it causes significant performance degradation. And If I understood well this is the case: it introduce some new IO pressure but also removes a lot of unnecessary queries which can make the overall picture even better than before.{quote} Above test results shows not much perf impact with HDDS-3474 and HDDS-3475. The tests are run on Mac with single SSD and 16GB ram. {quote} For example: use RocksDB checkpoints and snapshot db together with Ratis log. {quote} With this proposal, we want to avoid replay logic in the actual request, as this will be hard for developer when implementing new API's they need to know about a case they need to handle replay also for non-idempotent requests. And also the problem with checkpoints approach is how shall we map last applied transaction to DB with checkpoint atomically, if any of the steps fails like checkpoint succeeded, but snapshot file writes failed, then when Om restart, we shall start from previous snapshot value, so we still need replay code. So, the proposed solution solves the need of handling replay at request logic level. > OM HA replay optimization > - > > Key: HDDS-3354 > URL: https://issues.apache.org/jira/browse/HDDS-3354 > Project: Hadoop Distributed Data Store > Issue Type: Improvement >Reporter: Bharat Viswanadham >Assignee: Bharat Viswanadham >Priority: Major > Attachments: OM HA Replay.pdf, Screen Shot 2020-05-20 at 1.28.48 > PM.png > > > This Jira is to improve the OM HA replay scenario. > Attached the design document which discusses about the proposal and issue in > detail. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: ozone-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: ozone-issues-h...@hadoop.apache.org
[jira] [Commented] (HDDS-3354) OM HA replay optimization
[ https://issues.apache.org/jira/browse/HDDS-3354?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17110342#comment-17110342 ] Marton Elek commented on HDDS-3354: --- Thanks to explain, and also thanks the offline updates about this plan. It was not clear for me why was it decided to solve the problem in this way (there are 1-2 other ways to do the same, which was not mentioned as considered alternatives. For example: use RocksDB checkpoints and snapshot db together with Ratis log. Or keep the list of the active keys always in the memory and store only the values in the database.) But I am fine with this approach, unless it causes significant performance degradation. And If I understood well this is the case: it introduce some new IO pressure but also removes a lot of unnecessary queries which can make the overall picture even better than before. As I know, you have some initial numbers about the performance. I would propose to share the here. Thanks again to explain it to me offline, multiple times. > OM HA replay optimization > - > > Key: HDDS-3354 > URL: https://issues.apache.org/jira/browse/HDDS-3354 > Project: Hadoop Distributed Data Store > Issue Type: Improvement >Reporter: Bharat Viswanadham >Assignee: Bharat Viswanadham >Priority: Major > Attachments: OM HA Replay.pdf > > > This Jira is to improve the OM HA replay scenario. > Attached the design document which discusses about the proposal and issue in > detail. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: ozone-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: ozone-issues-h...@hadoop.apache.org
[jira] [Commented] (HDDS-3354) OM HA replay optimization
[ https://issues.apache.org/jira/browse/HDDS-3354?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17093754#comment-17093754 ] Bharat Viswanadham commented on HDDS-3354: -- Hi [~elek] Thanks for reviewing design doc. {quote}What is the performance impact of turning on sync of write. It seems that we make the restart faster but making the production run slower. Is it true? Do you have numbers of the additional cost of the sync write?{quote} We don't sync to be turned on. As rockdb batch put is an atomic operation. After a OM restart, we can read transaction information from the table, and use this as last applied Index. To add more info, even sync write also does not impact write request performance, as this is running in background thread, and write does not wait for the flush to be completed. But now any way for HA, we don't need sync to be turned on. {quote}It's not clear the structure of the new table. You wrote "For this, we can have a new table in rocks db with key as timestamp and value as largest transaction index in that batch flush to DB." But you mentioned a String->long table. And in the document you mentioned that only one key will be used. What will be the content of the table?{quote} During implementation figured that we need both term and log index. So modified it during the implementation. Table -> ratislogTable. Key = TRANSACTIONINFO value = currentTerm-transactionIndex I will update the doc also, as during implementation figured few more things. > OM HA replay optimization > - > > Key: HDDS-3354 > URL: https://issues.apache.org/jira/browse/HDDS-3354 > Project: Hadoop Distributed Data Store > Issue Type: Improvement >Reporter: Bharat Viswanadham >Assignee: Bharat Viswanadham >Priority: Major > Attachments: OM HA Replay.pdf > > > This Jira is to improve the OM HA replay scenario. > Attached the design document which discusses about the proposal and issue in > detail. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: ozone-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: ozone-issues-h...@hadoop.apache.org
[jira] [Commented] (HDDS-3354) OM HA replay optimization
[ https://issues.apache.org/jira/browse/HDDS-3354?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17091559#comment-17091559 ] Marton Elek commented on HDDS-3354: --- Thanks to wok on this Bharat, I have a few questions to understand the problem: 1. What is the performance impact of turning on sync of write. It seems that we make the restart faster but making the production run slower. Is it true? Do you have numbers of the additional cost of the sync write? 2. It's not clear the structure of the new table. You wrote "For this, we can have a new table in rocks db with key as timestamp and value as largest transaction index in that batch flush to DB." But you mentioned a String->long table. And in the document you mentioned that only one key will be used. What will be the content of the table? > OM HA replay optimization > - > > Key: HDDS-3354 > URL: https://issues.apache.org/jira/browse/HDDS-3354 > Project: Hadoop Distributed Data Store > Issue Type: Improvement >Reporter: Bharat Viswanadham >Assignee: Bharat Viswanadham >Priority: Major > Attachments: OM HA Replay.pdf > > > This Jira is to improve the OM HA replay scenario. > Attached the design document which discusses about the proposal and issue in > detail. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: ozone-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: ozone-issues-h...@hadoop.apache.org