Re: [PR] RATIS-2385. Don't keep failed requests in the RetryCache. [ratis]

2026-04-11 Thread via GitHub


github-actions[bot] commented on PR #1339:
URL: https://github.com/apache/ratis/pull/1339#issuecomment-4230466824

   This PR has been marked as stale due to 60 days of inactivity. Please 
comment or remove the stale label to keep it open. Otherwise, it will be 
automatically closed in ~30 days.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]



Re: [PR] RATIS-2385. Don't keep failed requests in the RetryCache. [ratis]

2026-02-09 Thread via GitHub


szetszwo commented on PR #1339:
URL: https://github.com/apache/ratis/pull/1339#issuecomment-3874348416

   @ss77892 ,  I agree that it is inconsistent for NotLeaderException for 
different cases.  A consistent approach is to add all the  calls to the 
RetryCache.  However, it adds overhead but not really solving any problems.
   
   if Ozone has already been fixed by 
[HDDS-13621](https://issues.apache.org/jira/browse/HDDS-13621), how about not 
changing Ratis?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]



Re: [PR] RATIS-2385. Don't keep failed requests in the RetryCache. [ratis]

2026-02-09 Thread via GitHub


ss77892 commented on PR #1339:
URL: https://github.com/apache/ratis/pull/1339#issuecomment-3874046548

   @szetszwo The workflow was following:
   1. client send the request and it successfully pass the first check and 
lands to the RetryCache. 
   2. leadership changed
   3. before landing the request to the log we find that we are not leader 
anymore and fail the request
   4. OM wants to retry the request but it checks the RetryCache directly and 
tries to get the response. Since the request failed with an exception it fails 
with the NPE.
   
   HDDS-13621 adds the check that request from the cache is successful. If it's 
not the regular logic with retry will be triggered. 
   So, from Ozone perspective the issue is already resolved. In other hand 
Ratis behavior is not consistent. Under similar circumstances (the leadership 
has been changed) the request might or might not be in the RetryCache, 
depending on the time when leadership has been changed. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]



Re: [PR] RATIS-2385. Don't keep failed requests in the RetryCache. [ratis]

2026-02-09 Thread via GitHub


szetszwo commented on PR #1339:
URL: https://github.com/apache/ratis/pull/1339#issuecomment-3873395846

   @ss77892 , thanks for pointing the code.
   
   HDDS-13621 only has the exception but not the details.  Could you provide 
explain how invalidating the cache could solve the problem in Ozone?  Why Ozone 
cannot retry with a new callId?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]



Re: [PR] RATIS-2385. Don't keep failed requests in the RetryCache. [ratis]

2026-02-09 Thread via GitHub


ss77892 commented on PR #1339:
URL: https://github.com/apache/ratis/pull/1339#issuecomment-3872567515

   @ivandika3 @szetszwo 
   Please check this.
   
https://github.com/apache/ratis/blob/b940b65004bea5f15593e2bbcb5ce925c0580b39/ratis-server/src/main/java/org/apache/ratis/server/impl/RaftServerImpl.java#L991
   We fail with an exception at the beginning of  writeAsyncImpl before the 
request lands to RetryCache. So, there is no idempotency guarantee. Also the 
client might go to another node that is the leader now with the same callId and 
that node doesn't have it in the RetryCache. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]



Re: [PR] RATIS-2385. Don't keep failed requests in the RetryCache. [ratis]

2026-02-08 Thread via GitHub


ivandika3 commented on PR #1339:
URL: https://github.com/apache/ratis/pull/1339#issuecomment-3869124461

   > on second thought, we should not invalidate any cache entries. Ratis retry 
cache is designed to have unique cache entries.
   
   +1 on this, we should not invalidate cache entries even if there is an 
exception. This will break the idempotency guarantee for a single 
ClientInvocationId, CallId.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]



Re: [PR] RATIS-2385. Don't keep failed requests in the RetryCache. [ratis]

2026-02-07 Thread via GitHub


ss77892 commented on PR #1339:
URL: https://github.com/apache/ratis/pull/1339#issuecomment-3866145726

   
https://github.com/apache/ratis/blob/b940b65004bea5f15593e2bbcb5ce925c0580b39/ratis-server/src/main/java/org/apache/ratis/server/impl/RaftServerImpl.java#L991
   Here we call checkLeaderState that would throw an exception if the node is 
not the leader. To the cache, we add it 5 lines later.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]



Re: [PR] RATIS-2385. Don't keep failed requests in the RetryCache. [ratis]

2026-02-07 Thread via GitHub


szetszwo commented on PR #1339:
URL: https://github.com/apache/ratis/pull/1339#issuecomment-3865714357

   > ... We have another leadership check at the beginning,  ...
   
   Could you point out the code?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]



Re: [PR] RATIS-2385. Don't keep failed requests in the RetryCache. [ratis]

2026-02-07 Thread via GitHub


ss77892 commented on PR #1339:
URL: https://github.com/apache/ratis/pull/1339#issuecomment-3864908196

   @szetszwo We have another leadership check at the beginning, so we don't 
store the entry in the cache if we are not the leader at the moment. So, it 
would behave exactly in the same way even without the patch. The patch removes 
an inconsistency in behavior that might occur when leadership changes during 
the event processing (the node was the leader when it received the event, but 
it's not a leader when it wants to add the event to the log). 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]



Re: [PR] RATIS-2385. Don't keep failed requests in the RetryCache. [ratis]

2026-02-07 Thread via GitHub


szetszwo commented on PR #1339:
URL: https://github.com/apache/ratis/pull/1339#issuecomment-3864804427

   @ss77892 , on second thought, we should not invalidate any cache entries.  
Ratis retry cache is designed to have unique cache entries.
   - If a call has replied normally or exceptionally, the calls with the same 
`ClientInvocationId` should always receive the same result.
   - For retires, the client always uses a different `ClientInvocationId`.
   
   Such design makes thing simple since all the replies are always fixed.
   
   In contrast, if we invalidate a NotLeaderException entry.  It could cause 
problem like below:
   
   1. Client sends a call.
   2. Server replies NotLeaderException and invalidates the entry.
   3. Client has not received the NotLeaderException reply (due to slow 
network).  It retries with the same ClientInvocationId.
   4. Server replies replies again with a different result.
   5. Client receives the second result.
   6. Client receives the NotLeaderException


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]