Re: [PR] RATIS-2385. Don't keep failed requests in the RetryCache. [ratis]
github-actions[bot] commented on PR #1339: URL: https://github.com/apache/ratis/pull/1339#issuecomment-4230466824 This PR has been marked as stale due to 60 days of inactivity. Please comment or remove the stale label to keep it open. Otherwise, it will be automatically closed in ~30 days. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
Re: [PR] RATIS-2385. Don't keep failed requests in the RetryCache. [ratis]
szetszwo commented on PR #1339: URL: https://github.com/apache/ratis/pull/1339#issuecomment-3874348416 @ss77892 , I agree that it is inconsistent for NotLeaderException for different cases. A consistent approach is to add all the calls to the RetryCache. However, it adds overhead but not really solving any problems. if Ozone has already been fixed by [HDDS-13621](https://issues.apache.org/jira/browse/HDDS-13621), how about not changing Ratis? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
Re: [PR] RATIS-2385. Don't keep failed requests in the RetryCache. [ratis]
ss77892 commented on PR #1339: URL: https://github.com/apache/ratis/pull/1339#issuecomment-3874046548 @szetszwo The workflow was following: 1. client send the request and it successfully pass the first check and lands to the RetryCache. 2. leadership changed 3. before landing the request to the log we find that we are not leader anymore and fail the request 4. OM wants to retry the request but it checks the RetryCache directly and tries to get the response. Since the request failed with an exception it fails with the NPE. HDDS-13621 adds the check that request from the cache is successful. If it's not the regular logic with retry will be triggered. So, from Ozone perspective the issue is already resolved. In other hand Ratis behavior is not consistent. Under similar circumstances (the leadership has been changed) the request might or might not be in the RetryCache, depending on the time when leadership has been changed. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
Re: [PR] RATIS-2385. Don't keep failed requests in the RetryCache. [ratis]
szetszwo commented on PR #1339: URL: https://github.com/apache/ratis/pull/1339#issuecomment-3873395846 @ss77892 , thanks for pointing the code. HDDS-13621 only has the exception but not the details. Could you provide explain how invalidating the cache could solve the problem in Ozone? Why Ozone cannot retry with a new callId? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
Re: [PR] RATIS-2385. Don't keep failed requests in the RetryCache. [ratis]
ss77892 commented on PR #1339: URL: https://github.com/apache/ratis/pull/1339#issuecomment-3872567515 @ivandika3 @szetszwo Please check this. https://github.com/apache/ratis/blob/b940b65004bea5f15593e2bbcb5ce925c0580b39/ratis-server/src/main/java/org/apache/ratis/server/impl/RaftServerImpl.java#L991 We fail with an exception at the beginning of writeAsyncImpl before the request lands to RetryCache. So, there is no idempotency guarantee. Also the client might go to another node that is the leader now with the same callId and that node doesn't have it in the RetryCache. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
Re: [PR] RATIS-2385. Don't keep failed requests in the RetryCache. [ratis]
ivandika3 commented on PR #1339: URL: https://github.com/apache/ratis/pull/1339#issuecomment-3869124461 > on second thought, we should not invalidate any cache entries. Ratis retry cache is designed to have unique cache entries. +1 on this, we should not invalidate cache entries even if there is an exception. This will break the idempotency guarantee for a single ClientInvocationId, CallId. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
Re: [PR] RATIS-2385. Don't keep failed requests in the RetryCache. [ratis]
ss77892 commented on PR #1339: URL: https://github.com/apache/ratis/pull/1339#issuecomment-3866145726 https://github.com/apache/ratis/blob/b940b65004bea5f15593e2bbcb5ce925c0580b39/ratis-server/src/main/java/org/apache/ratis/server/impl/RaftServerImpl.java#L991 Here we call checkLeaderState that would throw an exception if the node is not the leader. To the cache, we add it 5 lines later. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
Re: [PR] RATIS-2385. Don't keep failed requests in the RetryCache. [ratis]
szetszwo commented on PR #1339: URL: https://github.com/apache/ratis/pull/1339#issuecomment-3865714357 > ... We have another leadership check at the beginning, ... Could you point out the code? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
Re: [PR] RATIS-2385. Don't keep failed requests in the RetryCache. [ratis]
ss77892 commented on PR #1339: URL: https://github.com/apache/ratis/pull/1339#issuecomment-3864908196 @szetszwo We have another leadership check at the beginning, so we don't store the entry in the cache if we are not the leader at the moment. So, it would behave exactly in the same way even without the patch. The patch removes an inconsistency in behavior that might occur when leadership changes during the event processing (the node was the leader when it received the event, but it's not a leader when it wants to add the event to the log). -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
Re: [PR] RATIS-2385. Don't keep failed requests in the RetryCache. [ratis]
szetszwo commented on PR #1339: URL: https://github.com/apache/ratis/pull/1339#issuecomment-3864804427 @ss77892 , on second thought, we should not invalidate any cache entries. Ratis retry cache is designed to have unique cache entries. - If a call has replied normally or exceptionally, the calls with the same `ClientInvocationId` should always receive the same result. - For retires, the client always uses a different `ClientInvocationId`. Such design makes thing simple since all the replies are always fixed. In contrast, if we invalidate a NotLeaderException entry. It could cause problem like below: 1. Client sends a call. 2. Server replies NotLeaderException and invalidates the entry. 3. Client has not received the NotLeaderException reply (due to slow network). It retries with the same ClientInvocationId. 4. Server replies replies again with a different result. 5. Client receives the second result. 6. Client receives the NotLeaderException -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
