[jira] [Commented] (HBASE-28048) RSProcedureDispatcher to abort executing request after configurable retries

2024-03-05 Thread Viraj Jasani (Jira)


[ 
https://issues.apache.org/jira/browse/HBASE-28048?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17823735#comment-17823735
 ] 

Viraj Jasani commented on HBASE-28048:
--

Indeed, that's good idea. Somewhat similar HBASE-28366 : if the 
AssignmentManager accepts old regionserver report instead of rejecting it, we 
get into trouble. If we implement Nick's idea, first we might want to take care 
of this i.e. we will need consistency b/ ServerManager and AssignmentManager.

> RSProcedureDispatcher to abort executing request after configurable retries
> ---
>
> Key: HBASE-28048
> URL: https://issues.apache.org/jira/browse/HBASE-28048
> Project: HBase
>  Issue Type: Improvement
>Affects Versions: 3.0.0-alpha-4, 2.4.17, 2.5.5
>Reporter: Viraj Jasani
>Priority: Major
> Fix For: 2.4.18, 2.7.0, 3.0.0-beta-2, 2.6.1, 2.5.9
>
>
> In a recent incident, we observed that RSProcedureDispatcher continues 
> executing region open/close procedures with unbounded retries even in the 
> presence of known failures like GSS initiate failure:
>  
> {code:java}
> 2023-08-25 02:21:02,821 WARN [ispatcher-pool-40777] 
> procedure.RSProcedureDispatcher - request to rs1,61020,1692930044498 failed 
> due to java.io.IOException: Call to address=rs1:61020 failed on local 
> exception: java.io.IOException: 
> org.apache.hbase.thirdparty.io.netty.handler.codec.DecoderException: 
> org.apache.hadoop.ipc.RemoteException(javax.security.sasl.SaslException): GSS 
> initiate failed, try=0, retrying... {code}
>  
>  
> If the remote execution results in IOException, the dispatcher attempts to 
> schedule the procedure for further retries:
>  
> {code:java}
>     private boolean scheduleForRetry(IOException e) {
>       LOG.debug("Request to {} failed, try={}", serverName, 
> numberOfAttemptsSoFar, e);
>       // Should we wait a little before retrying? If the server is starting 
> it's yes.
>       ...
>       ...
>       ...
>       numberOfAttemptsSoFar++;
>       // Add some backoff here as the attempts rise otherwise if a stuck 
> condition, will fill logs
>       // with failed attempts. None of our backoff classes -- RetryCounter or 
> ClientBackoffPolicy
>       // -- fit here nicely so just do something simple; increment by 
> rsRpcRetryInterval millis *
>       // retry^2 on each try
>       // up to max of 10 seconds (don't want to back off too much in case of 
> situation change).
>       submitTask(this,
>         Math.min(rsRpcRetryInterval * (this.numberOfAttemptsSoFar * 
> this.numberOfAttemptsSoFar),
>           10 * 1000),
>         TimeUnit.MILLISECONDS);
>       return true;
>     }
>  {code}
>  
>  
> Even though we try to provide backoff while retrying, max wait time is 10s:
>  
> {code:java}
> submitTask(this,
>   Math.min(rsRpcRetryInterval * (this.numberOfAttemptsSoFar * 
> this.numberOfAttemptsSoFar),
> 10 * 1000),
>   TimeUnit.MILLISECONDS); {code}
>  
>  
> This results in endless loop of retries, until either the underlying issue is 
> fixed (e.g. krb issue in this case) or regionserver is killed and the ongoing 
> open/close region procedure (and perhaps entire SCP) for the affected 
> regionserver is sidelined manually.
> {code:java}
> 2023-08-25 03:04:18,918 WARN  [ispatcher-pool-41274] 
> procedure.RSProcedureDispatcher - request to rs1,61020,1692930044498 failed 
> due to java.io.IOException: Call to address=rs1:61020 failed on local 
> exception: java.io.IOException: 
> org.apache.hbase.thirdparty.io.netty.handler.codec.DecoderException: 
> org.apache.hadoop.ipc.RemoteException(javax.security.sasl.SaslException): GSS 
> initiate failed, try=217, retrying...
> 2023-08-25 03:04:18,916 WARN  [ispatcher-pool-41280] 
> procedure.RSProcedureDispatcher - request to rs1,61020,1692930044498 failed 
> due to java.io.IOException: Call to address=rs1:61020 failed on local 
> exception: java.io.IOException: 
> org.apache.hbase.thirdparty.io.netty.handler.codec.DecoderException: 
> org.apache.hadoop.ipc.RemoteException(javax.security.sasl.SaslException): GSS 
> initiate failed, try=193, retrying...
> 2023-08-25 03:04:28,968 WARN  [ispatcher-pool-41315] 
> procedure.RSProcedureDispatcher - request to rs1,61020,1692930044498 failed 
> due to java.io.IOException: Call to address=rs1:61020 failed on local 
> exception: java.io.IOException: 
> org.apache.hbase.thirdparty.io.netty.handler.codec.DecoderException: 
> org.apache.hadoop.ipc.RemoteException(javax.security.sasl.SaslException): GSS 
> initiate failed, try=266, retrying...
> 2023-08-25 03:04:28,969 WARN  [ispatcher-pool-41240] 
> procedure.RSProcedureDispatcher - request to rs1,61020,1692930044498 failed 
> due to java.io.IOException: Call to address=rs1:61020 failed on local 
> exception: java.io.IOException: 
> 

[jira] [Commented] (HBASE-28048) RSProcedureDispatcher to abort executing request after configurable retries

2024-02-28 Thread Andrew Kyle Purtell (Jira)


[ 
https://issues.apache.org/jira/browse/HBASE-28048?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17821980#comment-17821980
 ] 

Andrew Kyle Purtell commented on HBASE-28048:
-

{quote}When Master recognizes that the {{ServerName}} that is the target of a 
{{RemoteProcedure}} has left the cluster, the remote procedure must be failed. 
How the failure is handled is up to each procedure.
{quote}
This is a good idea.

> RSProcedureDispatcher to abort executing request after configurable retries
> ---
>
> Key: HBASE-28048
> URL: https://issues.apache.org/jira/browse/HBASE-28048
> Project: HBase
>  Issue Type: Improvement
>Affects Versions: 3.0.0-alpha-4, 2.4.17, 2.5.5
>Reporter: Viraj Jasani
>Priority: Major
> Fix For: 2.4.18, 2.7.0, 3.0.0-beta-2, 2.6.1, 2.5.9
>
>
> In a recent incident, we observed that RSProcedureDispatcher continues 
> executing region open/close procedures with unbounded retries even in the 
> presence of known failures like GSS initiate failure:
>  
> {code:java}
> 2023-08-25 02:21:02,821 WARN [ispatcher-pool-40777] 
> procedure.RSProcedureDispatcher - request to rs1,61020,1692930044498 failed 
> due to java.io.IOException: Call to address=rs1:61020 failed on local 
> exception: java.io.IOException: 
> org.apache.hbase.thirdparty.io.netty.handler.codec.DecoderException: 
> org.apache.hadoop.ipc.RemoteException(javax.security.sasl.SaslException): GSS 
> initiate failed, try=0, retrying... {code}
>  
>  
> If the remote execution results in IOException, the dispatcher attempts to 
> schedule the procedure for further retries:
>  
> {code:java}
>     private boolean scheduleForRetry(IOException e) {
>       LOG.debug("Request to {} failed, try={}", serverName, 
> numberOfAttemptsSoFar, e);
>       // Should we wait a little before retrying? If the server is starting 
> it's yes.
>       ...
>       ...
>       ...
>       numberOfAttemptsSoFar++;
>       // Add some backoff here as the attempts rise otherwise if a stuck 
> condition, will fill logs
>       // with failed attempts. None of our backoff classes -- RetryCounter or 
> ClientBackoffPolicy
>       // -- fit here nicely so just do something simple; increment by 
> rsRpcRetryInterval millis *
>       // retry^2 on each try
>       // up to max of 10 seconds (don't want to back off too much in case of 
> situation change).
>       submitTask(this,
>         Math.min(rsRpcRetryInterval * (this.numberOfAttemptsSoFar * 
> this.numberOfAttemptsSoFar),
>           10 * 1000),
>         TimeUnit.MILLISECONDS);
>       return true;
>     }
>  {code}
>  
>  
> Even though we try to provide backoff while retrying, max wait time is 10s:
>  
> {code:java}
> submitTask(this,
>   Math.min(rsRpcRetryInterval * (this.numberOfAttemptsSoFar * 
> this.numberOfAttemptsSoFar),
> 10 * 1000),
>   TimeUnit.MILLISECONDS); {code}
>  
>  
> This results in endless loop of retries, until either the underlying issue is 
> fixed (e.g. krb issue in this case) or regionserver is killed and the ongoing 
> open/close region procedure (and perhaps entire SCP) for the affected 
> regionserver is sidelined manually.
> {code:java}
> 2023-08-25 03:04:18,918 WARN  [ispatcher-pool-41274] 
> procedure.RSProcedureDispatcher - request to rs1,61020,1692930044498 failed 
> due to java.io.IOException: Call to address=rs1:61020 failed on local 
> exception: java.io.IOException: 
> org.apache.hbase.thirdparty.io.netty.handler.codec.DecoderException: 
> org.apache.hadoop.ipc.RemoteException(javax.security.sasl.SaslException): GSS 
> initiate failed, try=217, retrying...
> 2023-08-25 03:04:18,916 WARN  [ispatcher-pool-41280] 
> procedure.RSProcedureDispatcher - request to rs1,61020,1692930044498 failed 
> due to java.io.IOException: Call to address=rs1:61020 failed on local 
> exception: java.io.IOException: 
> org.apache.hbase.thirdparty.io.netty.handler.codec.DecoderException: 
> org.apache.hadoop.ipc.RemoteException(javax.security.sasl.SaslException): GSS 
> initiate failed, try=193, retrying...
> 2023-08-25 03:04:28,968 WARN  [ispatcher-pool-41315] 
> procedure.RSProcedureDispatcher - request to rs1,61020,1692930044498 failed 
> due to java.io.IOException: Call to address=rs1:61020 failed on local 
> exception: java.io.IOException: 
> org.apache.hbase.thirdparty.io.netty.handler.codec.DecoderException: 
> org.apache.hadoop.ipc.RemoteException(javax.security.sasl.SaslException): GSS 
> initiate failed, try=266, retrying...
> 2023-08-25 03:04:28,969 WARN  [ispatcher-pool-41240] 
> procedure.RSProcedureDispatcher - request to rs1,61020,1692930044498 failed 
> due to java.io.IOException: Call to address=rs1:61020 failed on local 
> exception: java.io.IOException: 
> org.apache.hbase.thirdparty.io.netty.handler.codec.DecoderException: 
> 

[jira] [Commented] (HBASE-28048) RSProcedureDispatcher to abort executing request after configurable retries

2024-02-28 Thread Nick Dimiduk (Jira)


[ 
https://issues.apache.org/jira/browse/HBASE-28048?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17821590#comment-17821590
 ] 

Nick Dimiduk commented on HBASE-28048:
--

An alternative to the liveliness check would be to tie procedure execution into 
events processed by the {{ServerManager}}. When Master recognizes that the 
{{ServerName}} that is the target of a {{RemoteProcedure}} has left the 
cluster, the remote procedure must be failed. How the failure is handled is up 
to each procedure.

> RSProcedureDispatcher to abort executing request after configurable retries
> ---
>
> Key: HBASE-28048
> URL: https://issues.apache.org/jira/browse/HBASE-28048
> Project: HBase
>  Issue Type: Improvement
>Affects Versions: 3.0.0-alpha-4, 2.4.17, 2.5.5
>Reporter: Viraj Jasani
>Priority: Major
> Fix For: 2.4.18, 2.7.0, 3.0.0-beta-2, 2.6.1, 2.5.9
>
>
> In a recent incident, we observed that RSProcedureDispatcher continues 
> executing region open/close procedures with unbounded retries even in the 
> presence of known failures like GSS initiate failure:
>  
> {code:java}
> 2023-08-25 02:21:02,821 WARN [ispatcher-pool-40777] 
> procedure.RSProcedureDispatcher - request to rs1,61020,1692930044498 failed 
> due to java.io.IOException: Call to address=rs1:61020 failed on local 
> exception: java.io.IOException: 
> org.apache.hbase.thirdparty.io.netty.handler.codec.DecoderException: 
> org.apache.hadoop.ipc.RemoteException(javax.security.sasl.SaslException): GSS 
> initiate failed, try=0, retrying... {code}
>  
>  
> If the remote execution results in IOException, the dispatcher attempts to 
> schedule the procedure for further retries:
>  
> {code:java}
>     private boolean scheduleForRetry(IOException e) {
>       LOG.debug("Request to {} failed, try={}", serverName, 
> numberOfAttemptsSoFar, e);
>       // Should we wait a little before retrying? If the server is starting 
> it's yes.
>       ...
>       ...
>       ...
>       numberOfAttemptsSoFar++;
>       // Add some backoff here as the attempts rise otherwise if a stuck 
> condition, will fill logs
>       // with failed attempts. None of our backoff classes -- RetryCounter or 
> ClientBackoffPolicy
>       // -- fit here nicely so just do something simple; increment by 
> rsRpcRetryInterval millis *
>       // retry^2 on each try
>       // up to max of 10 seconds (don't want to back off too much in case of 
> situation change).
>       submitTask(this,
>         Math.min(rsRpcRetryInterval * (this.numberOfAttemptsSoFar * 
> this.numberOfAttemptsSoFar),
>           10 * 1000),
>         TimeUnit.MILLISECONDS);
>       return true;
>     }
>  {code}
>  
>  
> Even though we try to provide backoff while retrying, max wait time is 10s:
>  
> {code:java}
> submitTask(this,
>   Math.min(rsRpcRetryInterval * (this.numberOfAttemptsSoFar * 
> this.numberOfAttemptsSoFar),
> 10 * 1000),
>   TimeUnit.MILLISECONDS); {code}
>  
>  
> This results in endless loop of retries, until either the underlying issue is 
> fixed (e.g. krb issue in this case) or regionserver is killed and the ongoing 
> open/close region procedure (and perhaps entire SCP) for the affected 
> regionserver is sidelined manually.
> {code:java}
> 2023-08-25 03:04:18,918 WARN  [ispatcher-pool-41274] 
> procedure.RSProcedureDispatcher - request to rs1,61020,1692930044498 failed 
> due to java.io.IOException: Call to address=rs1:61020 failed on local 
> exception: java.io.IOException: 
> org.apache.hbase.thirdparty.io.netty.handler.codec.DecoderException: 
> org.apache.hadoop.ipc.RemoteException(javax.security.sasl.SaslException): GSS 
> initiate failed, try=217, retrying...
> 2023-08-25 03:04:18,916 WARN  [ispatcher-pool-41280] 
> procedure.RSProcedureDispatcher - request to rs1,61020,1692930044498 failed 
> due to java.io.IOException: Call to address=rs1:61020 failed on local 
> exception: java.io.IOException: 
> org.apache.hbase.thirdparty.io.netty.handler.codec.DecoderException: 
> org.apache.hadoop.ipc.RemoteException(javax.security.sasl.SaslException): GSS 
> initiate failed, try=193, retrying...
> 2023-08-25 03:04:28,968 WARN  [ispatcher-pool-41315] 
> procedure.RSProcedureDispatcher - request to rs1,61020,1692930044498 failed 
> due to java.io.IOException: Call to address=rs1:61020 failed on local 
> exception: java.io.IOException: 
> org.apache.hbase.thirdparty.io.netty.handler.codec.DecoderException: 
> org.apache.hadoop.ipc.RemoteException(javax.security.sasl.SaslException): GSS 
> initiate failed, try=266, retrying...
> 2023-08-25 03:04:28,969 WARN  [ispatcher-pool-41240] 
> procedure.RSProcedureDispatcher - request to rs1,61020,1692930044498 failed 
> due to java.io.IOException: Call to address=rs1:61020 failed on local 
> exception: java.io.IOException: 
> 

[jira] [Commented] (HBASE-28048) RSProcedureDispatcher to abort executing request after configurable retries

2024-01-13 Thread Bryan Beaudreault (Jira)


[ 
https://issues.apache.org/jira/browse/HBASE-28048?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17806353#comment-17806353
 ] 

Bryan Beaudreault commented on HBASE-28048:
---

Moving out of 2.6.0

> RSProcedureDispatcher to abort executing request after configurable retries
> ---
>
> Key: HBASE-28048
> URL: https://issues.apache.org/jira/browse/HBASE-28048
> Project: HBase
>  Issue Type: Improvement
>Affects Versions: 3.0.0-alpha-4, 2.4.17, 2.5.5
>Reporter: Viraj Jasani
>Priority: Major
> Fix For: 2.4.18, 2.7.0, 2.5.8, 3.0.0-beta-2, 2.6.1
>
>
> In a recent incident, we observed that RSProcedureDispatcher continues 
> executing region open/close procedures with unbounded retries even in the 
> presence of known failures like GSS initiate failure:
>  
> {code:java}
> 2023-08-25 02:21:02,821 WARN [ispatcher-pool-40777] 
> procedure.RSProcedureDispatcher - request to rs1,61020,1692930044498 failed 
> due to java.io.IOException: Call to address=rs1:61020 failed on local 
> exception: java.io.IOException: 
> org.apache.hbase.thirdparty.io.netty.handler.codec.DecoderException: 
> org.apache.hadoop.ipc.RemoteException(javax.security.sasl.SaslException): GSS 
> initiate failed, try=0, retrying... {code}
>  
>  
> If the remote execution results in IOException, the dispatcher attempts to 
> schedule the procedure for further retries:
>  
> {code:java}
>     private boolean scheduleForRetry(IOException e) {
>       LOG.debug("Request to {} failed, try={}", serverName, 
> numberOfAttemptsSoFar, e);
>       // Should we wait a little before retrying? If the server is starting 
> it's yes.
>       ...
>       ...
>       ...
>       numberOfAttemptsSoFar++;
>       // Add some backoff here as the attempts rise otherwise if a stuck 
> condition, will fill logs
>       // with failed attempts. None of our backoff classes -- RetryCounter or 
> ClientBackoffPolicy
>       // -- fit here nicely so just do something simple; increment by 
> rsRpcRetryInterval millis *
>       // retry^2 on each try
>       // up to max of 10 seconds (don't want to back off too much in case of 
> situation change).
>       submitTask(this,
>         Math.min(rsRpcRetryInterval * (this.numberOfAttemptsSoFar * 
> this.numberOfAttemptsSoFar),
>           10 * 1000),
>         TimeUnit.MILLISECONDS);
>       return true;
>     }
>  {code}
>  
>  
> Even though we try to provide backoff while retrying, max wait time is 10s:
>  
> {code:java}
> submitTask(this,
>   Math.min(rsRpcRetryInterval * (this.numberOfAttemptsSoFar * 
> this.numberOfAttemptsSoFar),
> 10 * 1000),
>   TimeUnit.MILLISECONDS); {code}
>  
>  
> This results in endless loop of retries, until either the underlying issue is 
> fixed (e.g. krb issue in this case) or regionserver is killed and the ongoing 
> open/close region procedure (and perhaps entire SCP) for the affected 
> regionserver is sidelined manually.
> {code:java}
> 2023-08-25 03:04:18,918 WARN  [ispatcher-pool-41274] 
> procedure.RSProcedureDispatcher - request to rs1,61020,1692930044498 failed 
> due to java.io.IOException: Call to address=rs1:61020 failed on local 
> exception: java.io.IOException: 
> org.apache.hbase.thirdparty.io.netty.handler.codec.DecoderException: 
> org.apache.hadoop.ipc.RemoteException(javax.security.sasl.SaslException): GSS 
> initiate failed, try=217, retrying...
> 2023-08-25 03:04:18,916 WARN  [ispatcher-pool-41280] 
> procedure.RSProcedureDispatcher - request to rs1,61020,1692930044498 failed 
> due to java.io.IOException: Call to address=rs1:61020 failed on local 
> exception: java.io.IOException: 
> org.apache.hbase.thirdparty.io.netty.handler.codec.DecoderException: 
> org.apache.hadoop.ipc.RemoteException(javax.security.sasl.SaslException): GSS 
> initiate failed, try=193, retrying...
> 2023-08-25 03:04:28,968 WARN  [ispatcher-pool-41315] 
> procedure.RSProcedureDispatcher - request to rs1,61020,1692930044498 failed 
> due to java.io.IOException: Call to address=rs1:61020 failed on local 
> exception: java.io.IOException: 
> org.apache.hbase.thirdparty.io.netty.handler.codec.DecoderException: 
> org.apache.hadoop.ipc.RemoteException(javax.security.sasl.SaslException): GSS 
> initiate failed, try=266, retrying...
> 2023-08-25 03:04:28,969 WARN  [ispatcher-pool-41240] 
> procedure.RSProcedureDispatcher - request to rs1,61020,1692930044498 failed 
> due to java.io.IOException: Call to address=rs1:61020 failed on local 
> exception: java.io.IOException: 
> org.apache.hbase.thirdparty.io.netty.handler.codec.DecoderException: 
> org.apache.hadoop.ipc.RemoteException(javax.security.sasl.SaslException): GSS 
> initiate failed, try=266, retrying...{code}
>  
> While external issues like "krb ticket expiry" requires operator 
> intervention, it is not prudent 

[jira] [Commented] (HBASE-28048) RSProcedureDispatcher to abort executing request after configurable retries

2023-10-11 Thread Andrew Kyle Purtell (Jira)


[ 
https://issues.apache.org/jira/browse/HBASE-28048?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17774182#comment-17774182
 ] 

Andrew Kyle Purtell commented on HBASE-28048:
-

Moving out of 2.5.6

> RSProcedureDispatcher to abort executing request after configurable retries
> ---
>
> Key: HBASE-28048
> URL: https://issues.apache.org/jira/browse/HBASE-28048
> Project: HBase
>  Issue Type: Improvement
>Affects Versions: 3.0.0-alpha-4, 2.4.17, 2.5.5
>Reporter: Viraj Jasani
>Priority: Major
> Fix For: 2.6.0, 2.4.18, 3.0.0-beta-1, 2.5.7
>
>
> In a recent incident, we observed that RSProcedureDispatcher continues 
> executing region open/close procedures with unbounded retries even in the 
> presence of known failures like GSS initiate failure:
>  
> {code:java}
> 2023-08-25 02:21:02,821 WARN [ispatcher-pool-40777] 
> procedure.RSProcedureDispatcher - request to rs1,61020,1692930044498 failed 
> due to java.io.IOException: Call to address=rs1:61020 failed on local 
> exception: java.io.IOException: 
> org.apache.hbase.thirdparty.io.netty.handler.codec.DecoderException: 
> org.apache.hadoop.ipc.RemoteException(javax.security.sasl.SaslException): GSS 
> initiate failed, try=0, retrying... {code}
>  
>  
> If the remote execution results in IOException, the dispatcher attempts to 
> schedule the procedure for further retries:
>  
> {code:java}
>     private boolean scheduleForRetry(IOException e) {
>       LOG.debug("Request to {} failed, try={}", serverName, 
> numberOfAttemptsSoFar, e);
>       // Should we wait a little before retrying? If the server is starting 
> it's yes.
>       ...
>       ...
>       ...
>       numberOfAttemptsSoFar++;
>       // Add some backoff here as the attempts rise otherwise if a stuck 
> condition, will fill logs
>       // with failed attempts. None of our backoff classes -- RetryCounter or 
> ClientBackoffPolicy
>       // -- fit here nicely so just do something simple; increment by 
> rsRpcRetryInterval millis *
>       // retry^2 on each try
>       // up to max of 10 seconds (don't want to back off too much in case of 
> situation change).
>       submitTask(this,
>         Math.min(rsRpcRetryInterval * (this.numberOfAttemptsSoFar * 
> this.numberOfAttemptsSoFar),
>           10 * 1000),
>         TimeUnit.MILLISECONDS);
>       return true;
>     }
>  {code}
>  
>  
> Even though we try to provide backoff while retrying, max wait time is 10s:
>  
> {code:java}
> submitTask(this,
>   Math.min(rsRpcRetryInterval * (this.numberOfAttemptsSoFar * 
> this.numberOfAttemptsSoFar),
> 10 * 1000),
>   TimeUnit.MILLISECONDS); {code}
>  
>  
> This results in endless loop of retries, until either the underlying issue is 
> fixed (e.g. krb issue in this case) or regionserver is killed and the ongoing 
> open/close region procedure (and perhaps entire SCP) for the affected 
> regionserver is sidelined manually.
> {code:java}
> 2023-08-25 03:04:18,918 WARN  [ispatcher-pool-41274] 
> procedure.RSProcedureDispatcher - request to rs1,61020,1692930044498 failed 
> due to java.io.IOException: Call to address=rs1:61020 failed on local 
> exception: java.io.IOException: 
> org.apache.hbase.thirdparty.io.netty.handler.codec.DecoderException: 
> org.apache.hadoop.ipc.RemoteException(javax.security.sasl.SaslException): GSS 
> initiate failed, try=217, retrying...
> 2023-08-25 03:04:18,916 WARN  [ispatcher-pool-41280] 
> procedure.RSProcedureDispatcher - request to rs1,61020,1692930044498 failed 
> due to java.io.IOException: Call to address=rs1:61020 failed on local 
> exception: java.io.IOException: 
> org.apache.hbase.thirdparty.io.netty.handler.codec.DecoderException: 
> org.apache.hadoop.ipc.RemoteException(javax.security.sasl.SaslException): GSS 
> initiate failed, try=193, retrying...
> 2023-08-25 03:04:28,968 WARN  [ispatcher-pool-41315] 
> procedure.RSProcedureDispatcher - request to rs1,61020,1692930044498 failed 
> due to java.io.IOException: Call to address=rs1:61020 failed on local 
> exception: java.io.IOException: 
> org.apache.hbase.thirdparty.io.netty.handler.codec.DecoderException: 
> org.apache.hadoop.ipc.RemoteException(javax.security.sasl.SaslException): GSS 
> initiate failed, try=266, retrying...
> 2023-08-25 03:04:28,969 WARN  [ispatcher-pool-41240] 
> procedure.RSProcedureDispatcher - request to rs1,61020,1692930044498 failed 
> due to java.io.IOException: Call to address=rs1:61020 failed on local 
> exception: java.io.IOException: 
> org.apache.hbase.thirdparty.io.netty.handler.codec.DecoderException: 
> org.apache.hadoop.ipc.RemoteException(javax.security.sasl.SaslException): GSS 
> initiate failed, try=266, retrying...{code}
>  
> While external issues like "krb ticket expiry" requires operator 
> intervention, it is not prudent to 

[jira] [Commented] (HBASE-28048) RSProcedureDispatcher to abort executing request after configurable retries

2023-08-30 Thread Duo Zhang (Jira)


[ 
https://issues.apache.org/jira/browse/HBASE-28048?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17760676#comment-17760676
 ] 

Duo Zhang commented on HBASE-28048:
---

If we want to recover then the only safe way is to kill the region server. Just 
giving up and trying another region server may cause another more serious 
problem, double assign...
It is much more difficult to figure out and also much more difficult to fix.

Introducing a liveness check is a good idea, maybe we could add a feature in 
canary, where we kill the unhealthy region server after checking availibility?

Thanks.

> RSProcedureDispatcher to abort executing request after configurable retries
> ---
>
> Key: HBASE-28048
> URL: https://issues.apache.org/jira/browse/HBASE-28048
> Project: HBase
>  Issue Type: Improvement
>Affects Versions: 3.0.0-alpha-4, 2.4.17, 2.5.5
>Reporter: Viraj Jasani
>Priority: Major
> Fix For: 2.6.0, 2.4.18, 2.5.6, 3.0.0-beta-1
>
>
> In a recent incident, we observed that RSProcedureDispatcher continues 
> executing region open/close procedures with unbounded retries even in the 
> presence of known failures like GSS initiate failure:
>  
> {code:java}
> 2023-08-25 02:21:02,821 WARN [ispatcher-pool-40777] 
> procedure.RSProcedureDispatcher - request to rs1,61020,1692930044498 failed 
> due to java.io.IOException: Call to address=rs1:61020 failed on local 
> exception: java.io.IOException: 
> org.apache.hbase.thirdparty.io.netty.handler.codec.DecoderException: 
> org.apache.hadoop.ipc.RemoteException(javax.security.sasl.SaslException): GSS 
> initiate failed, try=0, retrying... {code}
>  
>  
> If the remote execution results in IOException, the dispatcher attempts to 
> schedule the procedure for further retries:
>  
> {code:java}
>     private boolean scheduleForRetry(IOException e) {
>       LOG.debug("Request to {} failed, try={}", serverName, 
> numberOfAttemptsSoFar, e);
>       // Should we wait a little before retrying? If the server is starting 
> it's yes.
>       ...
>       ...
>       ...
>       numberOfAttemptsSoFar++;
>       // Add some backoff here as the attempts rise otherwise if a stuck 
> condition, will fill logs
>       // with failed attempts. None of our backoff classes -- RetryCounter or 
> ClientBackoffPolicy
>       // -- fit here nicely so just do something simple; increment by 
> rsRpcRetryInterval millis *
>       // retry^2 on each try
>       // up to max of 10 seconds (don't want to back off too much in case of 
> situation change).
>       submitTask(this,
>         Math.min(rsRpcRetryInterval * (this.numberOfAttemptsSoFar * 
> this.numberOfAttemptsSoFar),
>           10 * 1000),
>         TimeUnit.MILLISECONDS);
>       return true;
>     }
>  {code}
>  
>  
> Even though we try to provide backoff while retrying, max wait time is 10s:
>  
> {code:java}
> submitTask(this,
>   Math.min(rsRpcRetryInterval * (this.numberOfAttemptsSoFar * 
> this.numberOfAttemptsSoFar),
> 10 * 1000),
>   TimeUnit.MILLISECONDS); {code}
>  
>  
> This results in endless loop of retries, until either the underlying issue is 
> fixed (e.g. krb issue in this case) or regionserver is killed and the ongoing 
> open/close region procedure (and perhaps entire SCP) for the affected 
> regionserver is sidelined manually.
> {code:java}
> 2023-08-25 03:04:18,918 WARN  [ispatcher-pool-41274] 
> procedure.RSProcedureDispatcher - request to rs1,61020,1692930044498 failed 
> due to java.io.IOException: Call to address=rs1:61020 failed on local 
> exception: java.io.IOException: 
> org.apache.hbase.thirdparty.io.netty.handler.codec.DecoderException: 
> org.apache.hadoop.ipc.RemoteException(javax.security.sasl.SaslException): GSS 
> initiate failed, try=217, retrying...
> 2023-08-25 03:04:18,916 WARN  [ispatcher-pool-41280] 
> procedure.RSProcedureDispatcher - request to rs1,61020,1692930044498 failed 
> due to java.io.IOException: Call to address=rs1:61020 failed on local 
> exception: java.io.IOException: 
> org.apache.hbase.thirdparty.io.netty.handler.codec.DecoderException: 
> org.apache.hadoop.ipc.RemoteException(javax.security.sasl.SaslException): GSS 
> initiate failed, try=193, retrying...
> 2023-08-25 03:04:28,968 WARN  [ispatcher-pool-41315] 
> procedure.RSProcedureDispatcher - request to rs1,61020,1692930044498 failed 
> due to java.io.IOException: Call to address=rs1:61020 failed on local 
> exception: java.io.IOException: 
> org.apache.hbase.thirdparty.io.netty.handler.codec.DecoderException: 
> org.apache.hadoop.ipc.RemoteException(javax.security.sasl.SaslException): GSS 
> initiate failed, try=266, retrying...
> 2023-08-25 03:04:28,969 WARN  [ispatcher-pool-41240] 
> procedure.RSProcedureDispatcher - request to rs1,61020,1692930044498 failed 
> due to java.io.IOException: 

[jira] [Commented] (HBASE-28048) RSProcedureDispatcher to abort executing request after configurable retries

2023-08-30 Thread Andrew Kyle Purtell (Jira)


[ 
https://issues.apache.org/jira/browse/HBASE-28048?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17760627#comment-17760627
 ] 

Andrew Kyle Purtell commented on HBASE-28048:
-

bq. Let's assume we are moving all regions from server A to server B. If server 
A is not reachable, and we fail all TRSP for region moves from A to B, the only 
alternative that the operator or software would be left with is stopping server 
A non-gracefully so that new SCP for server A can be processed by master.

We have this same problem for any region state transitions, any TRSP. 
[~vjasani] [~zhangduo] 

In some cases in our production we are seeing retries > 10 minutes to an 
unresponsive or dead regionserver. It's too much, too long. It cannot be 
required for an operator to step in every time to manually schedule a SCP for 
the unresponsive server. TRSP should abort itself, or the parent procedure of 
the TRSP should abort it, if the target server does not respond within a 
reasonable time bound. I am thinking 1 minute. The clock is ticking on the RIT 
while we are retrying RPCs to a unresponsive server. The time required to 
detect the server is unresponsive should be fairly short, so the total RIT time 
remains fairly short. 

To start with, TRSP should not retry for effectively infinite times. If the 
total retry time is more than a minute or two, it should give up. Then, 
depending on the region state, either another server is chosen or the 
unresponsive server is fenced and killed with a forced SCP, which grabs the 
lease on the RS WAL to split it, killing the RS as desired.

We should maybe also consider adding active probes and liveness checks and a 
predictive component (like Φ-accrual) for the master to identify sick or 
unresponsive regionservers before they impact production and fence and kill 
them proactively with SCP. 

> RSProcedureDispatcher to abort executing request after configurable retries
> ---
>
> Key: HBASE-28048
> URL: https://issues.apache.org/jira/browse/HBASE-28048
> Project: HBase
>  Issue Type: Improvement
>Affects Versions: 3.0.0-alpha-4, 2.4.17, 2.5.5
>Reporter: Viraj Jasani
>Priority: Major
> Fix For: 2.6.0, 2.4.18, 2.5.6, 3.0.0-beta-1
>
>
> In a recent incident, we observed that RSProcedureDispatcher continues 
> executing region open/close procedures with unbounded retries even in the 
> presence of known failures like GSS initiate failure:
>  
> {code:java}
> 2023-08-25 02:21:02,821 WARN [ispatcher-pool-40777] 
> procedure.RSProcedureDispatcher - request to rs1,61020,1692930044498 failed 
> due to java.io.IOException: Call to address=rs1:61020 failed on local 
> exception: java.io.IOException: 
> org.apache.hbase.thirdparty.io.netty.handler.codec.DecoderException: 
> org.apache.hadoop.ipc.RemoteException(javax.security.sasl.SaslException): GSS 
> initiate failed, try=0, retrying... {code}
>  
>  
> If the remote execution results in IOException, the dispatcher attempts to 
> schedule the procedure for further retries:
>  
> {code:java}
>     private boolean scheduleForRetry(IOException e) {
>       LOG.debug("Request to {} failed, try={}", serverName, 
> numberOfAttemptsSoFar, e);
>       // Should we wait a little before retrying? If the server is starting 
> it's yes.
>       ...
>       ...
>       ...
>       numberOfAttemptsSoFar++;
>       // Add some backoff here as the attempts rise otherwise if a stuck 
> condition, will fill logs
>       // with failed attempts. None of our backoff classes -- RetryCounter or 
> ClientBackoffPolicy
>       // -- fit here nicely so just do something simple; increment by 
> rsRpcRetryInterval millis *
>       // retry^2 on each try
>       // up to max of 10 seconds (don't want to back off too much in case of 
> situation change).
>       submitTask(this,
>         Math.min(rsRpcRetryInterval * (this.numberOfAttemptsSoFar * 
> this.numberOfAttemptsSoFar),
>           10 * 1000),
>         TimeUnit.MILLISECONDS);
>       return true;
>     }
>  {code}
>  
>  
> Even though we try to provide backoff while retrying, max wait time is 10s:
>  
> {code:java}
> submitTask(this,
>   Math.min(rsRpcRetryInterval * (this.numberOfAttemptsSoFar * 
> this.numberOfAttemptsSoFar),
> 10 * 1000),
>   TimeUnit.MILLISECONDS); {code}
>  
>  
> This results in endless loop of retries, until either the underlying issue is 
> fixed (e.g. krb issue in this case) or regionserver is killed and the ongoing 
> open/close region procedure (and perhaps entire SCP) for the affected 
> regionserver is sidelined manually.
> {code:java}
> 2023-08-25 03:04:18,918 WARN  [ispatcher-pool-41274] 
> procedure.RSProcedureDispatcher - request to rs1,61020,1692930044498 failed 
> due to java.io.IOException: Call to address=rs1:61020 failed on local 
> 

[jira] [Commented] (HBASE-28048) RSProcedureDispatcher to abort executing request after configurable retries

2023-08-29 Thread Viraj Jasani (Jira)


[ 
https://issues.apache.org/jira/browse/HBASE-28048?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17760052#comment-17760052
 ] 

Viraj Jasani commented on HBASE-28048:
--

Let's assume we are moving all regions from server A to server B. If server A 
is not reachable, and we fail all TRSP for region moves from A to B, the only 
alternative that the operator or software would be left with is stopping server 
A non-gracefully so that new SCP for server A can be processed by master.

This should still be okay i guess, assuming remaining servers are responsive to 
requests from master, and hence procedures are overall making good progress 
(instead of some of them getting stuck).

> RSProcedureDispatcher to abort executing request after configurable retries
> ---
>
> Key: HBASE-28048
> URL: https://issues.apache.org/jira/browse/HBASE-28048
> Project: HBase
>  Issue Type: Improvement
>Affects Versions: 3.0.0-alpha-4, 2.4.17, 2.5.5
>Reporter: Viraj Jasani
>Priority: Major
> Fix For: 2.6.0, 2.4.18, 2.5.6, 3.0.0-beta-1
>
>
> In a recent incident, we observed that RSProcedureDispatcher continues 
> executing region open/close procedures with unbounded retries even in the 
> presence of known failures like GSS initiate failure:
>  
> {code:java}
> 2023-08-25 02:21:02,821 WARN [ispatcher-pool-40777] 
> procedure.RSProcedureDispatcher - request to rs1,61020,1692930044498 failed 
> due to java.io.IOException: Call to address=rs1:61020 failed on local 
> exception: java.io.IOException: 
> org.apache.hbase.thirdparty.io.netty.handler.codec.DecoderException: 
> org.apache.hadoop.ipc.RemoteException(javax.security.sasl.SaslException): GSS 
> initiate failed, try=0, retrying... {code}
>  
>  
> If the remote execution results in IOException, the dispatcher attempts to 
> schedule the procedure for further retries:
>  
> {code:java}
>     private boolean scheduleForRetry(IOException e) {
>       LOG.debug("Request to {} failed, try={}", serverName, 
> numberOfAttemptsSoFar, e);
>       // Should we wait a little before retrying? If the server is starting 
> it's yes.
>       ...
>       ...
>       ...
>       numberOfAttemptsSoFar++;
>       // Add some backoff here as the attempts rise otherwise if a stuck 
> condition, will fill logs
>       // with failed attempts. None of our backoff classes -- RetryCounter or 
> ClientBackoffPolicy
>       // -- fit here nicely so just do something simple; increment by 
> rsRpcRetryInterval millis *
>       // retry^2 on each try
>       // up to max of 10 seconds (don't want to back off too much in case of 
> situation change).
>       submitTask(this,
>         Math.min(rsRpcRetryInterval * (this.numberOfAttemptsSoFar * 
> this.numberOfAttemptsSoFar),
>           10 * 1000),
>         TimeUnit.MILLISECONDS);
>       return true;
>     }
>  {code}
>  
>  
> Even though we try to provide backoff while retrying, max wait time is 10s:
>  
> {code:java}
> submitTask(this,
>   Math.min(rsRpcRetryInterval * (this.numberOfAttemptsSoFar * 
> this.numberOfAttemptsSoFar),
> 10 * 1000),
>   TimeUnit.MILLISECONDS); {code}
>  
>  
> This results in endless loop of retries, until either the underlying issue is 
> fixed (e.g. krb issue in this case) or regionserver is killed and the ongoing 
> open/close region procedure (and perhaps entire SCP) for the affected 
> regionserver is sidelined manually.
> {code:java}
> 2023-08-25 03:04:18,918 WARN  [ispatcher-pool-41274] 
> procedure.RSProcedureDispatcher - request to rs1,61020,1692930044498 failed 
> due to java.io.IOException: Call to address=rs1:61020 failed on local 
> exception: java.io.IOException: 
> org.apache.hbase.thirdparty.io.netty.handler.codec.DecoderException: 
> org.apache.hadoop.ipc.RemoteException(javax.security.sasl.SaslException): GSS 
> initiate failed, try=217, retrying...
> 2023-08-25 03:04:18,916 WARN  [ispatcher-pool-41280] 
> procedure.RSProcedureDispatcher - request to rs1,61020,1692930044498 failed 
> due to java.io.IOException: Call to address=rs1:61020 failed on local 
> exception: java.io.IOException: 
> org.apache.hbase.thirdparty.io.netty.handler.codec.DecoderException: 
> org.apache.hadoop.ipc.RemoteException(javax.security.sasl.SaslException): GSS 
> initiate failed, try=193, retrying...
> 2023-08-25 03:04:28,968 WARN  [ispatcher-pool-41315] 
> procedure.RSProcedureDispatcher - request to rs1,61020,1692930044498 failed 
> due to java.io.IOException: Call to address=rs1:61020 failed on local 
> exception: java.io.IOException: 
> org.apache.hbase.thirdparty.io.netty.handler.codec.DecoderException: 
> org.apache.hadoop.ipc.RemoteException(javax.security.sasl.SaslException): GSS 
> initiate failed, try=266, retrying...
> 2023-08-25 03:04:28,969 WARN  [ispatcher-pool-41240] 
> 

[jira] [Commented] (HBASE-28048) RSProcedureDispatcher to abort executing request after configurable retries

2023-08-29 Thread Duo Zhang (Jira)


[ 
https://issues.apache.org/jira/browse/HBASE-28048?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17759850#comment-17759850
 ] 

Duo Zhang commented on HBASE-28048:
---

Aborting master does not help here, the new master will still try to send the 
procedure to the same region server.

We could add a log message to mention that we have retried for a long time and 
can not succeed but the region server is still alive, please check manually.

> RSProcedureDispatcher to abort executing request after configurable retries
> ---
>
> Key: HBASE-28048
> URL: https://issues.apache.org/jira/browse/HBASE-28048
> Project: HBase
>  Issue Type: Improvement
>Affects Versions: 3.0.0-alpha-4, 2.4.17, 2.5.5
>Reporter: Viraj Jasani
>Priority: Major
> Fix For: 2.6.0, 2.4.18, 2.5.6, 3.0.0-beta-1
>
>
> In a recent incident, we observed that RSProcedureDispatcher continues 
> executing region open/close procedures with unbounded retries even in the 
> presence of known failures like GSS initiate failure:
>  
> {code:java}
> 2023-08-25 02:21:02,821 WARN [ispatcher-pool-40777] 
> procedure.RSProcedureDispatcher - request to rs1,61020,1692930044498 failed 
> due to java.io.IOException: Call to address=rs1:61020 failed on local 
> exception: java.io.IOException: 
> org.apache.hbase.thirdparty.io.netty.handler.codec.DecoderException: 
> org.apache.hadoop.ipc.RemoteException(javax.security.sasl.SaslException): GSS 
> initiate failed, try=0, retrying... {code}
>  
>  
> If the remote execution results in IOException, the dispatcher attempts to 
> schedule the procedure for further retries:
>  
> {code:java}
>     private boolean scheduleForRetry(IOException e) {
>       LOG.debug("Request to {} failed, try={}", serverName, 
> numberOfAttemptsSoFar, e);
>       // Should we wait a little before retrying? If the server is starting 
> it's yes.
>       ...
>       ...
>       ...
>       numberOfAttemptsSoFar++;
>       // Add some backoff here as the attempts rise otherwise if a stuck 
> condition, will fill logs
>       // with failed attempts. None of our backoff classes -- RetryCounter or 
> ClientBackoffPolicy
>       // -- fit here nicely so just do something simple; increment by 
> rsRpcRetryInterval millis *
>       // retry^2 on each try
>       // up to max of 10 seconds (don't want to back off too much in case of 
> situation change).
>       submitTask(this,
>         Math.min(rsRpcRetryInterval * (this.numberOfAttemptsSoFar * 
> this.numberOfAttemptsSoFar),
>           10 * 1000),
>         TimeUnit.MILLISECONDS);
>       return true;
>     }
>  {code}
>  
>  
> Even though we try to provide backoff while retrying, max wait time is 10s:
>  
> {code:java}
> submitTask(this,
>   Math.min(rsRpcRetryInterval * (this.numberOfAttemptsSoFar * 
> this.numberOfAttemptsSoFar),
> 10 * 1000),
>   TimeUnit.MILLISECONDS); {code}
>  
>  
> This results in endless loop of retries, until either the underlying issue is 
> fixed (e.g. krb issue in this case) or regionserver is killed and the ongoing 
> open/close region procedure (and perhaps entire SCP) for the affected 
> regionserver is sidelined manually.
> {code:java}
> 2023-08-25 03:04:18,918 WARN  [ispatcher-pool-41274] 
> procedure.RSProcedureDispatcher - request to rs1,61020,1692930044498 failed 
> due to java.io.IOException: Call to address=rs1:61020 failed on local 
> exception: java.io.IOException: 
> org.apache.hbase.thirdparty.io.netty.handler.codec.DecoderException: 
> org.apache.hadoop.ipc.RemoteException(javax.security.sasl.SaslException): GSS 
> initiate failed, try=217, retrying...
> 2023-08-25 03:04:18,916 WARN  [ispatcher-pool-41280] 
> procedure.RSProcedureDispatcher - request to rs1,61020,1692930044498 failed 
> due to java.io.IOException: Call to address=rs1:61020 failed on local 
> exception: java.io.IOException: 
> org.apache.hbase.thirdparty.io.netty.handler.codec.DecoderException: 
> org.apache.hadoop.ipc.RemoteException(javax.security.sasl.SaslException): GSS 
> initiate failed, try=193, retrying...
> 2023-08-25 03:04:28,968 WARN  [ispatcher-pool-41315] 
> procedure.RSProcedureDispatcher - request to rs1,61020,1692930044498 failed 
> due to java.io.IOException: Call to address=rs1:61020 failed on local 
> exception: java.io.IOException: 
> org.apache.hbase.thirdparty.io.netty.handler.codec.DecoderException: 
> org.apache.hadoop.ipc.RemoteException(javax.security.sasl.SaslException): GSS 
> initiate failed, try=266, retrying...
> 2023-08-25 03:04:28,969 WARN  [ispatcher-pool-41240] 
> procedure.RSProcedureDispatcher - request to rs1,61020,1692930044498 failed 
> due to java.io.IOException: Call to address=rs1:61020 failed on local 
> exception: java.io.IOException: 
> org.apache.hbase.thirdparty.io.netty.handler.codec.DecoderException: 
> 

[jira] [Commented] (HBASE-28048) RSProcedureDispatcher to abort executing request after configurable retries

2023-08-29 Thread Viraj Jasani (Jira)


[ 
https://issues.apache.org/jira/browse/HBASE-28048?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17759843#comment-17759843
 ] 

Viraj Jasani commented on HBASE-28048:
--

I agree, we already have such logic for ServerNotRunningYetException, 
DoNotRetryIOException, CallQueueTooBigException.

Adding SaslException should be relatively straightforward.

 

I still wonder, if we could categorize how many dispatcher threads are occupied 
with a given regionserver (sort of group by all server names and check how many 
are busy serving them), and if we realize that considerably higher num of 
threads are busy performing region transitions with only same target server, 
with majority of them having higher num of retries already exhausted, perhaps 
it would make sense to fail them such that, it can lead to master abort. The 
key is to not saturate dispatcher threads for only single or a few problematic 
regionservers.

At worst, we will see inconsistencies when new master takes over as active, and 
that requires operational intervention, which is still fine, compared to 
majority dispatcher threads getting occupied with some task that is just not 
making any progress.

> RSProcedureDispatcher to abort executing request after configurable retries
> ---
>
> Key: HBASE-28048
> URL: https://issues.apache.org/jira/browse/HBASE-28048
> Project: HBase
>  Issue Type: Improvement
>Affects Versions: 3.0.0-alpha-4, 2.4.17, 2.5.5
>Reporter: Viraj Jasani
>Priority: Major
> Fix For: 2.6.0, 2.4.18, 2.5.6, 3.0.0-beta-1
>
>
> In a recent incident, we observed that RSProcedureDispatcher continues 
> executing region open/close procedures with unbounded retries even in the 
> presence of known failures like GSS initiate failure:
>  
> {code:java}
> 2023-08-25 02:21:02,821 WARN [ispatcher-pool-40777] 
> procedure.RSProcedureDispatcher - request to rs1,61020,1692930044498 failed 
> due to java.io.IOException: Call to address=rs1:61020 failed on local 
> exception: java.io.IOException: 
> org.apache.hbase.thirdparty.io.netty.handler.codec.DecoderException: 
> org.apache.hadoop.ipc.RemoteException(javax.security.sasl.SaslException): GSS 
> initiate failed, try=0, retrying... {code}
>  
>  
> If the remote execution results in IOException, the dispatcher attempts to 
> schedule the procedure for further retries:
>  
> {code:java}
>     private boolean scheduleForRetry(IOException e) {
>       LOG.debug("Request to {} failed, try={}", serverName, 
> numberOfAttemptsSoFar, e);
>       // Should we wait a little before retrying? If the server is starting 
> it's yes.
>       ...
>       ...
>       ...
>       numberOfAttemptsSoFar++;
>       // Add some backoff here as the attempts rise otherwise if a stuck 
> condition, will fill logs
>       // with failed attempts. None of our backoff classes -- RetryCounter or 
> ClientBackoffPolicy
>       // -- fit here nicely so just do something simple; increment by 
> rsRpcRetryInterval millis *
>       // retry^2 on each try
>       // up to max of 10 seconds (don't want to back off too much in case of 
> situation change).
>       submitTask(this,
>         Math.min(rsRpcRetryInterval * (this.numberOfAttemptsSoFar * 
> this.numberOfAttemptsSoFar),
>           10 * 1000),
>         TimeUnit.MILLISECONDS);
>       return true;
>     }
>  {code}
>  
>  
> Even though we try to provide backoff while retrying, max wait time is 10s:
>  
> {code:java}
> submitTask(this,
>   Math.min(rsRpcRetryInterval * (this.numberOfAttemptsSoFar * 
> this.numberOfAttemptsSoFar),
> 10 * 1000),
>   TimeUnit.MILLISECONDS); {code}
>  
>  
> This results in endless loop of retries, until either the underlying issue is 
> fixed (e.g. krb issue in this case) or regionserver is killed and the ongoing 
> open/close region procedure (and perhaps entire SCP) for the affected 
> regionserver is sidelined manually.
> {code:java}
> 2023-08-25 03:04:18,918 WARN  [ispatcher-pool-41274] 
> procedure.RSProcedureDispatcher - request to rs1,61020,1692930044498 failed 
> due to java.io.IOException: Call to address=rs1:61020 failed on local 
> exception: java.io.IOException: 
> org.apache.hbase.thirdparty.io.netty.handler.codec.DecoderException: 
> org.apache.hadoop.ipc.RemoteException(javax.security.sasl.SaslException): GSS 
> initiate failed, try=217, retrying...
> 2023-08-25 03:04:18,916 WARN  [ispatcher-pool-41280] 
> procedure.RSProcedureDispatcher - request to rs1,61020,1692930044498 failed 
> due to java.io.IOException: Call to address=rs1:61020 failed on local 
> exception: java.io.IOException: 
> org.apache.hbase.thirdparty.io.netty.handler.codec.DecoderException: 
> org.apache.hadoop.ipc.RemoteException(javax.security.sasl.SaslException): GSS 
> initiate failed, try=193, retrying...
> 2023-08-25 

[jira] [Commented] (HBASE-28048) RSProcedureDispatcher to abort executing request after configurable retries

2023-08-29 Thread Duo Zhang (Jira)


[ 
https://issues.apache.org/jira/browse/HBASE-28048?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17759836#comment-17759836
 ] 

Duo Zhang commented on HBASE-28048:
---

The key here is that, we are not safe to give up if we are not sure whether we 
have already sent the procedure to the remote region server.

If we can make sure that the request does not reach the remote region server, 
then we are safe to give up and try another region server.

You can see scheduleForRetry method for more details. We used to consider 
connection exception in the past IIRC, for example, if we get a connection 
refused exception, we can make sure that we have not sent the request to region 
server. But in general, if we get a connection refused exception, usually this 
means the region server is already dead, so soon we will go with the dead 
server processing and solve the problem.

I think here, kerberos failure is the same with connection refused exception, 
we can make sure that the region server has not received the rquest yet, so we 
are safe to quit and try another region server.

> RSProcedureDispatcher to abort executing request after configurable retries
> ---
>
> Key: HBASE-28048
> URL: https://issues.apache.org/jira/browse/HBASE-28048
> Project: HBase
>  Issue Type: Improvement
>Affects Versions: 3.0.0-alpha-4, 2.4.17, 2.5.5
>Reporter: Viraj Jasani
>Priority: Major
> Fix For: 2.6.0, 2.4.18, 2.5.6, 3.0.0-beta-1
>
>
> In a recent incident, we observed that RSProcedureDispatcher continues 
> executing region open/close procedures with unbounded retries even in the 
> presence of known failures like GSS initiate failure:
>  
> {code:java}
> 2023-08-25 02:21:02,821 WARN [ispatcher-pool-40777] 
> procedure.RSProcedureDispatcher - request to rs1,61020,1692930044498 failed 
> due to java.io.IOException: Call to address=rs1:61020 failed on local 
> exception: java.io.IOException: 
> org.apache.hbase.thirdparty.io.netty.handler.codec.DecoderException: 
> org.apache.hadoop.ipc.RemoteException(javax.security.sasl.SaslException): GSS 
> initiate failed, try=0, retrying... {code}
>  
>  
> If the remote execution results in IOException, the dispatcher attempts to 
> schedule the procedure for further retries:
>  
> {code:java}
>     private boolean scheduleForRetry(IOException e) {
>       LOG.debug("Request to {} failed, try={}", serverName, 
> numberOfAttemptsSoFar, e);
>       // Should we wait a little before retrying? If the server is starting 
> it's yes.
>       ...
>       ...
>       ...
>       numberOfAttemptsSoFar++;
>       // Add some backoff here as the attempts rise otherwise if a stuck 
> condition, will fill logs
>       // with failed attempts. None of our backoff classes -- RetryCounter or 
> ClientBackoffPolicy
>       // -- fit here nicely so just do something simple; increment by 
> rsRpcRetryInterval millis *
>       // retry^2 on each try
>       // up to max of 10 seconds (don't want to back off too much in case of 
> situation change).
>       submitTask(this,
>         Math.min(rsRpcRetryInterval * (this.numberOfAttemptsSoFar * 
> this.numberOfAttemptsSoFar),
>           10 * 1000),
>         TimeUnit.MILLISECONDS);
>       return true;
>     }
>  {code}
>  
>  
> Even though we try to provide backoff while retrying, max wait time is 10s:
>  
> {code:java}
> submitTask(this,
>   Math.min(rsRpcRetryInterval * (this.numberOfAttemptsSoFar * 
> this.numberOfAttemptsSoFar),
> 10 * 1000),
>   TimeUnit.MILLISECONDS); {code}
>  
>  
> This results in endless loop of retries, until either the underlying issue is 
> fixed (e.g. krb issue in this case) or regionserver is killed and the ongoing 
> open/close region procedure (and perhaps entire SCP) for the affected 
> regionserver is sidelined manually.
> {code:java}
> 2023-08-25 03:04:18,918 WARN  [ispatcher-pool-41274] 
> procedure.RSProcedureDispatcher - request to rs1,61020,1692930044498 failed 
> due to java.io.IOException: Call to address=rs1:61020 failed on local 
> exception: java.io.IOException: 
> org.apache.hbase.thirdparty.io.netty.handler.codec.DecoderException: 
> org.apache.hadoop.ipc.RemoteException(javax.security.sasl.SaslException): GSS 
> initiate failed, try=217, retrying...
> 2023-08-25 03:04:18,916 WARN  [ispatcher-pool-41280] 
> procedure.RSProcedureDispatcher - request to rs1,61020,1692930044498 failed 
> due to java.io.IOException: Call to address=rs1:61020 failed on local 
> exception: java.io.IOException: 
> org.apache.hbase.thirdparty.io.netty.handler.codec.DecoderException: 
> org.apache.hadoop.ipc.RemoteException(javax.security.sasl.SaslException): GSS 
> initiate failed, try=193, retrying...
> 2023-08-25 03:04:28,968 WARN  [ispatcher-pool-41315] 
> procedure.RSProcedureDispatcher - request to 

[jira] [Commented] (HBASE-28048) RSProcedureDispatcher to abort executing request after configurable retries

2023-08-28 Thread Viraj Jasani (Jira)


[ 
https://issues.apache.org/jira/browse/HBASE-28048?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17759795#comment-17759795
 ] 

Viraj Jasani commented on HBASE-28048:
--

We already have relevant TODOs :)
{code:java}
try {
  sendRequest(getServerName(), request.build());
} catch (IOException e) {
  e = unwrapException(e);
  // TODO: In the future some operation may want to bail out early.
  // TODO: How many times should we retry (use numberOfAttemptsSoFar)
  if (!scheduleForRetry(e)) {
remoteCallFailed(procedureEnv, e);
  }
} {code}

> RSProcedureDispatcher to abort executing request after configurable retries
> ---
>
> Key: HBASE-28048
> URL: https://issues.apache.org/jira/browse/HBASE-28048
> Project: HBase
>  Issue Type: Improvement
>Affects Versions: 3.0.0-alpha-4, 2.4.17, 2.5.5
>Reporter: Viraj Jasani
>Priority: Major
> Fix For: 2.6.0, 2.4.18, 2.5.6, 3.0.0-beta-1
>
>
> In a recent incident, we observed that RSProcedureDispatcher continues 
> executing region open/close procedures with unbounded retries even in the 
> presence of known failures like GSS initiate failure:
>  
> {code:java}
> 2023-08-25 02:21:02,821 WARN [ispatcher-pool-40777] 
> procedure.RSProcedureDispatcher - request to rs1,61020,1692930044498 failed 
> due to java.io.IOException: Call to address=rs1:61020 failed on local 
> exception: java.io.IOException: 
> org.apache.hbase.thirdparty.io.netty.handler.codec.DecoderException: 
> org.apache.hadoop.ipc.RemoteException(javax.security.sasl.SaslException): GSS 
> initiate failed, try=0, retrying... {code}
>  
>  
> If the remote execution results in IOException, the dispatcher attempts to 
> schedule the procedure for further retries:
>  
> {code:java}
>     private boolean scheduleForRetry(IOException e) {
>       LOG.debug("Request to {} failed, try={}", serverName, 
> numberOfAttemptsSoFar, e);
>       // Should we wait a little before retrying? If the server is starting 
> it's yes.
>       ...
>       ...
>       ...
>       numberOfAttemptsSoFar++;
>       // Add some backoff here as the attempts rise otherwise if a stuck 
> condition, will fill logs
>       // with failed attempts. None of our backoff classes -- RetryCounter or 
> ClientBackoffPolicy
>       // -- fit here nicely so just do something simple; increment by 
> rsRpcRetryInterval millis *
>       // retry^2 on each try
>       // up to max of 10 seconds (don't want to back off too much in case of 
> situation change).
>       submitTask(this,
>         Math.min(rsRpcRetryInterval * (this.numberOfAttemptsSoFar * 
> this.numberOfAttemptsSoFar),
>           10 * 1000),
>         TimeUnit.MILLISECONDS);
>       return true;
>     }
>  {code}
>  
>  
> Even though we try to provide backoff while retrying, max wait time is 10s:
>  
> {code:java}
> submitTask(this,
>   Math.min(rsRpcRetryInterval * (this.numberOfAttemptsSoFar * 
> this.numberOfAttemptsSoFar),
> 10 * 1000),
>   TimeUnit.MILLISECONDS); {code}
>  
>  
> This results in endless loop of retries, until either the underlying issue is 
> fixed (e.g. krb issue in this case) or regionserver is killed and the ongoing 
> open/close region procedure (and perhaps entire SCP) for the affected 
> regionserver is sidelined manually.
> {code:java}
> 2023-08-25 03:04:18,918 WARN  [ispatcher-pool-41274] 
> procedure.RSProcedureDispatcher - request to rs1,61020,1692930044498 failed 
> due to java.io.IOException: Call to address=rs1:61020 failed on local 
> exception: java.io.IOException: 
> org.apache.hbase.thirdparty.io.netty.handler.codec.DecoderException: 
> org.apache.hadoop.ipc.RemoteException(javax.security.sasl.SaslException): GSS 
> initiate failed, try=217, retrying...
> 2023-08-25 03:04:18,916 WARN  [ispatcher-pool-41280] 
> procedure.RSProcedureDispatcher - request to rs1,61020,1692930044498 failed 
> due to java.io.IOException: Call to address=rs1:61020 failed on local 
> exception: java.io.IOException: 
> org.apache.hbase.thirdparty.io.netty.handler.codec.DecoderException: 
> org.apache.hadoop.ipc.RemoteException(javax.security.sasl.SaslException): GSS 
> initiate failed, try=193, retrying...
> 2023-08-25 03:04:28,968 WARN  [ispatcher-pool-41315] 
> procedure.RSProcedureDispatcher - request to rs1,61020,1692930044498 failed 
> due to java.io.IOException: Call to address=rs1:61020 failed on local 
> exception: java.io.IOException: 
> org.apache.hbase.thirdparty.io.netty.handler.codec.DecoderException: 
> org.apache.hadoop.ipc.RemoteException(javax.security.sasl.SaslException): GSS 
> initiate failed, try=266, retrying...
> 2023-08-25 03:04:28,969 WARN  [ispatcher-pool-41240] 
> procedure.RSProcedureDispatcher - request to rs1,61020,1692930044498 failed 
> due to java.io.IOException: Call to address=rs1:61020 failed on local 
>