[jira] [Commented] (HDDS-3354) OM HA replay optimization

2020-06-23 Thread Marton Elek (Jira)


[ 
https://issues.apache.org/jira/browse/HDDS-3354?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17142996#comment-17142996
 ] 

Marton Elek commented on HDDS-3354:
---

Moved the target.version=0.6.0 flag to the sub issue (HDDS-3685). I think we 
should track that one to close this automatically.

> OM HA replay optimization
> -
>
> Key: HDDS-3354
> URL: https://issues.apache.org/jira/browse/HDDS-3354
> Project: Hadoop Distributed Data Store
>  Issue Type: Improvement
>  Components: OM HA, Ozone Manager
>Reporter: Bharat Viswanadham
>Assignee: Bharat Viswanadham
>Priority: Major
>  Labels: Triaged
> Attachments: OM HA Replay.pdf, Screen Shot 2020-05-20 at 1.28.48 
> PM.png
>
>
> This Jira is to improve the OM HA replay scenario.
> Attached the design document which discusses about the proposal and issue in 
> detail.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: ozone-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: ozone-issues-h...@hadoop.apache.org



[jira] [Commented] (HDDS-3354) OM HA replay optimization

2020-05-21 Thread Bharat Viswanadham (Jira)


[ 
https://issues.apache.org/jira/browse/HDDS-3354?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17113494#comment-17113494
 ] 

Bharat Viswanadham commented on HDDS-3354:
--

{quote}That's very interesting. Do you have more finding about the root cause? 
How many buckets did you have? It's very surprising to have GC pause for a few 
thousands of buckets especially as they are not frequently updated. Do we need 
to adjust something on the cache size?{quote}

Initially, the test is started with default heap size settings. Once OM is 
started with 8GB memory settings, no issues are seen.

Bucket count is not in order of thousands, it is 14 million buckets.
Command used in the test is 
{{$ bin/ozone freon ombg -n=100}}

Even during cache design, we thought in a single OM cluster, we shall have 1000 
of volumes, and each volume will have 1000 of buckets and will have billions of 
key in a cluster. As volume/bucket exist checks are done for each request, we 
have decided to keep bucket/volume cache in memory whole time and as a full 
cache. If we want to revisit this decision. But this is not related to this 
Jira. (This came out as part of testing when creating a million buckets to test 
this Jira)
Snippet from Cache design doc attached to HDDS-505


{noformat}
Memory Usage:
As discussed above for Volume and Bucket Table we store full table information 
in memory. This will help in validation of the requests very quickly. As for 
every request Ozone Manager receives the mandatory check is volume/bucket 
exists or not. 

On a typical Ozone cluster Volumes can be in number of thousands. (Considering 
this as an admin level operation in a system where each team/organization gets 
a volume for their usage). And for each volume we can expect 1000 to 1 
buckets. These are considered just for calculation purpose.

Let’s assume each VolumeInfo and BucketInfo structure consumes 1KB in memory. 
Then,

Volume cache memory usage can be 1000 * 1KB = 10 MB. 
Bucket cache memory usage can be 1000 * 1000 * 1KB  = 1GB.

We can make the Volume and BucketTable caches partial if the number of buckets 
and volumes are very high in the system. This can be given as an option to end 
user. For now we assume that the entire list of volumes and buckets can be 
safely cached in memory.
{noformat}





> OM HA replay optimization
> -
>
> Key: HDDS-3354
> URL: https://issues.apache.org/jira/browse/HDDS-3354
> Project: Hadoop Distributed Data Store
>  Issue Type: Improvement
>Reporter: Bharat Viswanadham
>Assignee: Bharat Viswanadham
>Priority: Major
> Attachments: OM HA Replay.pdf, Screen Shot 2020-05-20 at 1.28.48 
> PM.png
>
>
> This Jira is to improve the OM HA replay scenario.
> Attached the design document which discusses about the proposal and issue in 
> detail.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: ozone-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: ozone-issues-h...@hadoop.apache.org



[jira] [Commented] (HDDS-3354) OM HA replay optimization

2020-05-21 Thread Bharat Viswanadham (Jira)


[ 
https://issues.apache.org/jira/browse/HDDS-3354?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17113488#comment-17113488
 ] 

Bharat Viswanadham commented on HDDS-3354:
--

{quote}That's an interesting question. If I understood well, your objections is 
that if we do ratis log snapshot and rocksdb snapshot (=checkpoint) at the same 
time they can be inconsistent with each other in case of any error.

I don't think it's a problem. Writing ratis log snapshot can fail even now 
which should be handled. The only question if we can finalize both the 
snapshots in one step which should be possible: For example write ratis log 
snapshot file and rocksdb snapshot file to the same directory and move it to 
the final location.{quote}

*Let me share the complete thought details here.*
1. Even in writing to temporary directory both ratis log snapshot file which 
has snapshot index and checkpoint which we use rocksdb checkpoint, let's say 
checkpoint succeded, and write failed. And in the current directory of snapshot 
we have old checkpoint, and old snapshot file. During OM restart if we use 
current OM DB, we cannot avoid replay logic. So, in this case, whenever OM 
restart, we should use last checkpoint DB and snapshot file and come up. As 
said we agree with this it will delay startup until the leader applied all OM 
entries from snapshot to latest log, clients will get LeaderNotReadyException. 
These kind of issues, will not be seen with the proposed approach.

2. And one step failing is not only the issue, it is one of the issues. If 
snapshot taken is controlled by ratis, when a checkpoint is happening, we 
should not allow any transactions to be flushed to DB, as we want to get what 
is exact last applied Transaction to DB, so that when restart happens, we want 
to know what is last applied transaction to DB. If this happens, that means 
every time checkpoint is happening, we need to stop double buffer and take a 
checkpoint and write to the snapshot file. Stopping double buffer means right 
now it will send a signal to interrupt flush thread, but now with this we 
should still maintain unflushed transactions that are not completed by flush or 
wait for flush to complete. So, this might increase the current queue length in 
double-buffer. As apply transaction will still continue to apply transactions 
to StateMachine. This looks till complex than what is proposed and it also 
comes with its own disadvantages of startup slowness and double-buffer queue 
size. Or if we think let's take 

Or other approaches is instead of putting transaction info, repeat the above 
process of checkpoint and snapshot to file for every iteration, so that we 
don't stop double buffer and apply transaction will put to double buffer. But 
this is not a great solution, as it will makes double buffer slow and 
checkpoints also increased (Just want to point it out),  and we need another 
background thread for cleaning up. This will not have a startup slow problem.

As with testing it is shown with HDDS-3474 + HDDS-3475 performance is not 
degraded and it is in par, and with this we shall remove the replay logic from 
actual request logic. So, even if we want to revisit, it will be simpler and it 
will make developers implementing new API's does not need to know about 
handling a replay case when implementing new write non-idempotent requests.

Let me know your thoughts?




> OM HA replay optimization
> -
>
> Key: HDDS-3354
> URL: https://issues.apache.org/jira/browse/HDDS-3354
> Project: Hadoop Distributed Data Store
>  Issue Type: Improvement
>Reporter: Bharat Viswanadham
>Assignee: Bharat Viswanadham
>Priority: Major
> Attachments: OM HA Replay.pdf, Screen Shot 2020-05-20 at 1.28.48 
> PM.png
>
>
> This Jira is to improve the OM HA replay scenario.
> Attached the design document which discusses about the proposal and issue in 
> detail.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: ozone-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: ozone-issues-h...@hadoop.apache.org



[jira] [Commented] (HDDS-3354) OM HA replay optimization

2020-05-21 Thread Marton Elek (Jira)


[ 
https://issues.apache.org/jira/browse/HDDS-3354?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17112874#comment-17112874
 ] 

Marton Elek commented on HDDS-3354:
---

bq. One of the main reason is as bucket cache is full table cache it will be in 
memory, we need some JVM tunings to be set, because, without tuning, we are 
seeing a lot of GC Pauses happening in OM.

That's very interesting. Do you have more finding about the root cause? How 
many buckets did you have? It's very surprising to have GC pause for a few 
thousands of buckets especially as they are not frequently updated. Do we need 
to adjust something on the cache size?  

> OM HA replay optimization
> -
>
> Key: HDDS-3354
> URL: https://issues.apache.org/jira/browse/HDDS-3354
> Project: Hadoop Distributed Data Store
>  Issue Type: Improvement
>Reporter: Bharat Viswanadham
>Assignee: Bharat Viswanadham
>Priority: Major
> Attachments: OM HA Replay.pdf, Screen Shot 2020-05-20 at 1.28.48 
> PM.png
>
>
> This Jira is to improve the OM HA replay scenario.
> Attached the design document which discusses about the proposal and issue in 
> detail.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: ozone-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: ozone-issues-h...@hadoop.apache.org



[jira] [Commented] (HDDS-3354) OM HA replay optimization

2020-05-21 Thread Marton Elek (Jira)


[ 
https://issues.apache.org/jira/browse/HDDS-3354?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17112873#comment-17112873
 ] 

Marton Elek commented on HDDS-3354:
---

bq. If there are no further objection

I have no further objection, but have a comment to your answer. Take it as a 
friendly chat during the coffee break about interesting questions related to 
Distributed Systems.

bq.  if any of the steps fails like checkpoint succeeded, but snapshot file 
writes failed, then when Om restart

That's an interesting question. If I understood well, your objections is that 
if we do ratis log snapshot and rocksdb snapshot (=checkpoint) at the same time 
they can be inconsistent with each other in case of any error.

I don't think it's a problem. Writing ratis log snapshot can fail even now 
which should be handled. The only question if we can finalize both the 
snapshots in one step which should be possible: For example write ratis log 
snapshot file and rocksdb snapshot file to the same directory and move it to 
the final location.

I wouldn't like to say it's better. But I think it's possible (How is your 
coffee?) 

> OM HA replay optimization
> -
>
> Key: HDDS-3354
> URL: https://issues.apache.org/jira/browse/HDDS-3354
> Project: Hadoop Distributed Data Store
>  Issue Type: Improvement
>Reporter: Bharat Viswanadham
>Assignee: Bharat Viswanadham
>Priority: Major
> Attachments: OM HA Replay.pdf, Screen Shot 2020-05-20 at 1.28.48 
> PM.png
>
>
> This Jira is to improve the OM HA replay scenario.
> Attached the design document which discusses about the proposal and issue in 
> detail.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: ozone-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: ozone-issues-h...@hadoop.apache.org



[jira] [Commented] (HDDS-3354) OM HA replay optimization

2020-05-20 Thread Bharat Viswanadham (Jira)


[ 
https://issues.apache.org/jira/browse/HDDS-3354?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17112592#comment-17112592
 ] 

Bharat Viswanadham commented on HDDS-3354:
--

If there are no further objections, I will proceed further with commit and 
continue on the next tasks required for this improvement.

> OM HA replay optimization
> -
>
> Key: HDDS-3354
> URL: https://issues.apache.org/jira/browse/HDDS-3354
> Project: Hadoop Distributed Data Store
>  Issue Type: Improvement
>Reporter: Bharat Viswanadham
>Assignee: Bharat Viswanadham
>Priority: Major
> Attachments: OM HA Replay.pdf, Screen Shot 2020-05-20 at 1.28.48 
> PM.png
>
>
> This Jira is to improve the OM HA replay scenario.
> Attached the design document which discusses about the proposal and issue in 
> detail.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: ozone-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: ozone-issues-h...@hadoop.apache.org



[jira] [Commented] (HDDS-3354) OM HA replay optimization

2020-05-20 Thread Bharat Viswanadham (Jira)


[ 
https://issues.apache.org/jira/browse/HDDS-3354?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17112591#comment-17112591
 ] 

Bharat Viswanadham commented on HDDS-3354:
--

The reason for the inclusion of HDDS-3615 and HDDS-3623 is we identified a few 
potential problems, and that is causing slowness as the number of objects 
increases in OM. Another reason is as bucket cache is full table cache it will 
be in memory, we need some JVM tunings to be set, because without tuning, we 
are seeing a lot of GC Pauses happening in OM. (So, this test is like a stress 
test, where OM has only millions of buckets in memory, as during design of 
cache we have thought a single OM will have a couple of thousands of 
volumes/buckets)

-XX:ParallelGCThreads=8 -XX:+UseConcMarkSweepGC 
-XX:CMSInitiatingOccupancyFraction=70 -XX:+CMSParallelRemarkEnabled 
-Xloggc:/tmp/gc.log-$(date +'%Y%m%d%H%M') -XX:NewSize=1024m 
-XX:MaxNewSize=1024m -Xms8192m -Xmx8192m -XX:+PrintGCDateStamps


> OM HA replay optimization
> -
>
> Key: HDDS-3354
> URL: https://issues.apache.org/jira/browse/HDDS-3354
> Project: Hadoop Distributed Data Store
>  Issue Type: Improvement
>Reporter: Bharat Viswanadham
>Assignee: Bharat Viswanadham
>Priority: Major
> Attachments: OM HA Replay.pdf, Screen Shot 2020-05-20 at 1.28.48 
> PM.png
>
>
> This Jira is to improve the OM HA replay scenario.
> Attached the design document which discusses about the proposal and issue in 
> detail.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: ozone-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: ozone-issues-h...@hadoop.apache.org



[jira] [Commented] (HDDS-3354) OM HA replay optimization

2020-05-20 Thread Bharat Viswanadham (Jira)


[ 
https://issues.apache.org/jira/browse/HDDS-3354?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17112590#comment-17112590
 ] 

Bharat Viswanadham commented on HDDS-3354:
--

Hi [~elek]

Perf results:
 !Screen Shot 2020-05-20 at 1.28.48 PM.png! 


The command used for the test:
$ bin/ozone freon ombg -n=100

{quote}But I am fine with this approach, unless it causes significant 
performance degradation. And If I understood well this is the case: it 
introduce some new IO pressure but also removes a lot of unnecessary queries 
which can make the overall picture even better than before.{quote}

Above test results shows not much perf impact with HDDS-3474 and HDDS-3475. The 
tests are run on Mac with single SSD and 16GB ram.

{quote} For example: use RocksDB checkpoints and snapshot db together with 
Ratis log. {quote}
With this proposal, we want to avoid replay logic in the actual request, as 
this will be hard for developer when implementing new API's they need to know 
about a case they need to handle replay also for non-idempotent requests. 
And also the problem with checkpoints approach is how shall we map last applied 
transaction to DB with checkpoint atomically,  if any of the steps fails like 
checkpoint succeeded, but snapshot file writes failed, then when Om restart, we 
shall start from previous snapshot value, so we still need replay code. So, the 
proposed solution solves the need of handling replay at request logic level.






> OM HA replay optimization
> -
>
> Key: HDDS-3354
> URL: https://issues.apache.org/jira/browse/HDDS-3354
> Project: Hadoop Distributed Data Store
>  Issue Type: Improvement
>Reporter: Bharat Viswanadham
>Assignee: Bharat Viswanadham
>Priority: Major
> Attachments: OM HA Replay.pdf, Screen Shot 2020-05-20 at 1.28.48 
> PM.png
>
>
> This Jira is to improve the OM HA replay scenario.
> Attached the design document which discusses about the proposal and issue in 
> detail.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: ozone-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: ozone-issues-h...@hadoop.apache.org



[jira] [Commented] (HDDS-3354) OM HA replay optimization

2020-05-18 Thread Marton Elek (Jira)


[ 
https://issues.apache.org/jira/browse/HDDS-3354?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17110342#comment-17110342
 ] 

Marton Elek commented on HDDS-3354:
---

Thanks to explain, and also thanks the offline updates about this plan. It was 
not clear for me why was it decided to solve the problem in this way (there are 
1-2 other ways to do the same, which was not mentioned as considered 
alternatives. For example: use RocksDB checkpoints and snapshot db together 
with Ratis log. Or keep the list of the active keys always in the memory and 
store only the values in the database.)

But I am fine with this approach, unless it causes significant performance 
degradation. And If I understood well this is the case: it introduce some new 
IO pressure but also removes a lot of unnecessary queries which can make the 
overall picture even better than before.  

As I know, you have some initial numbers about the performance. I would propose 
to share the here.

Thanks again to explain it to me offline, multiple times.

> OM HA replay optimization
> -
>
> Key: HDDS-3354
> URL: https://issues.apache.org/jira/browse/HDDS-3354
> Project: Hadoop Distributed Data Store
>  Issue Type: Improvement
>Reporter: Bharat Viswanadham
>Assignee: Bharat Viswanadham
>Priority: Major
> Attachments: OM HA Replay.pdf
>
>
> This Jira is to improve the OM HA replay scenario.
> Attached the design document which discusses about the proposal and issue in 
> detail.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: ozone-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: ozone-issues-h...@hadoop.apache.org



[jira] [Commented] (HDDS-3354) OM HA replay optimization

2020-04-27 Thread Bharat Viswanadham (Jira)


[ 
https://issues.apache.org/jira/browse/HDDS-3354?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17093754#comment-17093754
 ] 

Bharat Viswanadham commented on HDDS-3354:
--

Hi [~elek]
Thanks for reviewing design doc.

{quote}What is the performance impact of turning on sync of write. It seems 
that we make the restart faster but making the production run slower. Is it 
true? Do you have numbers of the additional cost of the sync write?{quote}

We don't sync to be turned on. As rockdb batch put is an atomic operation. 
After a OM restart, we can read transaction information from the table, and use 
this as last applied Index.

To add more info, even sync write also does not impact write request 
performance, as this is running in background thread, and write does not wait 
for the flush to be completed. But now any way for HA, we don't need sync to be 
turned on.

{quote}It's not clear the structure of the new table. You wrote "For this, we 
can have a new table in rocks db with key as timestamp and value as largest 
transaction index in that batch flush to DB." But you mentioned a String->long 
table. And in the document you mentioned that only one key will be used. What 
will be the content of the table?{quote}

During implementation figured that we need both term and log index. So modified 
it during the implementation.
Table -> ratislogTable.
Key = TRANSACTIONINFO
value = currentTerm-transactionIndex

I will update the doc also, as during implementation figured few more things.



> OM HA replay optimization
> -
>
> Key: HDDS-3354
> URL: https://issues.apache.org/jira/browse/HDDS-3354
> Project: Hadoop Distributed Data Store
>  Issue Type: Improvement
>Reporter: Bharat Viswanadham
>Assignee: Bharat Viswanadham
>Priority: Major
> Attachments: OM HA Replay.pdf
>
>
> This Jira is to improve the OM HA replay scenario.
> Attached the design document which discusses about the proposal and issue in 
> detail.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: ozone-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: ozone-issues-h...@hadoop.apache.org



[jira] [Commented] (HDDS-3354) OM HA replay optimization

2020-04-24 Thread Marton Elek (Jira)


[ 
https://issues.apache.org/jira/browse/HDDS-3354?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17091559#comment-17091559
 ] 

Marton Elek commented on HDDS-3354:
---

Thanks to wok on this Bharat, I have a few questions to understand the problem:

 1. What is the performance impact of turning on sync of write. It seems that 
we make the restart faster but making the production run slower. Is it true? Do 
you have numbers of the additional cost of the sync write?

 2. It's not clear the structure of the new table. You wrote "For this, we can 
have a new table in rocks db with key as timestamp and value as largest 
transaction index in that batch flush to DB." But you mentioned a String->long 
table. And in the document you mentioned that only one key will be used. What 
will be the content of the table? 




> OM HA replay optimization
> -
>
> Key: HDDS-3354
> URL: https://issues.apache.org/jira/browse/HDDS-3354
> Project: Hadoop Distributed Data Store
>  Issue Type: Improvement
>Reporter: Bharat Viswanadham
>Assignee: Bharat Viswanadham
>Priority: Major
> Attachments: OM HA Replay.pdf
>
>
> This Jira is to improve the OM HA replay scenario.
> Attached the design document which discusses about the proposal and issue in 
> detail.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: ozone-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: ozone-issues-h...@hadoop.apache.org