[jira] [Commented] (HBASE-28422) SplitWalProcedure will attempt SplitWalRemoteProcedure on the same target RegionServer indefinitely

2024-03-06 Thread Duo Zhang (Jira)


[ 
https://issues.apache.org/jira/browse/HBASE-28422?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17824014#comment-17824014
 ] 

Duo Zhang commented on HBASE-28422:
---

{quote}
Might as well be a good opportunity to refactor isSaslError() as a global 
static utility, available for use to anyone.
{quote}

+1.

> SplitWalProcedure will attempt SplitWalRemoteProcedure on the same target 
> RegionServer indefinitely
> ---
>
> Key: HBASE-28422
> URL: https://issues.apache.org/jira/browse/HBASE-28422
> Project: HBase
>  Issue Type: Bug
>  Components: master, proc-v2, wal
>Affects Versions: 2.5.5
>Reporter: David Manning
>Priority: Minor
>
> Similar to HBASE-28050. If HMaster selects a RegionServer for 
> SplitWalRemoteProcedure, it will retry this server as long as the server is 
> alive. I believe this is because even though 
> {{RSProcedureDispatcher.ExecuteProceduresRemoteCall.run}} calls 
> {{{}remoteCallFailed{}}}, there is no logic after this to select a new target 
> server. For {{TransitRegionStateProcedure}} there is logic to select a new 
> server for opening a region, using {{{}forceNewPlan{}}}. But 
> SplitWalRemoteProcedure only has logic to try another server if we receive a 
> {{DoNotRetryIOException}} in SplitWALRemoteProcedure#complete: 
> [https://github.com/apache/hbase/blob/780ff56b3f23e7041ef1b705b7d3d0a53fdd05ae/hbase-server/src/main/java/org/apache/hadoop/hbase/master/procedure/SplitWALRemoteProcedure.java#L104-L110]
> If we receive any other IOException, we will just retry the target server 
> forever. Just like in HBASE-28050, if there is a SaslException, this will 
> never lead to retrying a SplitWalRemoteProcedure on a new server, which can 
> lead to ServerCrashProcedure never finishing until the target server for 
> SplitWalRemoteProcedure is restarted. The following log is seen repeatedly, 
> always sending to the same host.
> {code:java}
> 2024-01-31 15:59:43,616 WARN  [RSProcedureDispatcher-pool-72846] 
> procedure.SplitWALRemoteProcedure - Failed split of 
> hdfs:///hbase/WALs/,1704984571464-splitting/1704984571464.1706710908543,
>  retry...
> java.io.IOException: Call to address= failed on local exception: 
> java.io.IOException: Can not send request because relogin is in progress.
>   at sun.reflect.GeneratedConstructorAccessor363.newInstance(Unknown 
> Source)
>   at 
> sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
>   at java.lang.reflect.Constructor.newInstance(Constructor.java:423)
>   at org.apache.hadoop.hbase.ipc.IPCUtil.wrapException(IPCUtil.java:239)
>   at 
> org.apache.hadoop.hbase.ipc.AbstractRpcClient.onCallFinished(AbstractRpcClient.java:391)
>   at 
> org.apache.hadoop.hbase.ipc.AbstractRpcClient.access$100(AbstractRpcClient.java:92)
>   at 
> org.apache.hadoop.hbase.ipc.AbstractRpcClient$3.run(AbstractRpcClient.java:425)
>   at 
> org.apache.hadoop.hbase.ipc.AbstractRpcClient$3.run(AbstractRpcClient.java:420)
>   at org.apache.hadoop.hbase.ipc.Call.callComplete(Call.java:114)
>   at org.apache.hadoop.hbase.ipc.Call.setException(Call.java:129)
>   at 
> org.apache.hadoop.hbase.ipc.NettyRpcConnection.lambda$sendRequest$4(NettyRpcConnection.java:365)
>   at 
> org.apache.hbase.thirdparty.io.netty.util.concurrent.AbstractEventExecutor.runTask(AbstractEventExecutor.java:174)
>   at 
> org.apache.hbase.thirdparty.io.netty.util.concurrent.AbstractEventExecutor.safeExecute(AbstractEventExecutor.java:167)
>   at 
> org.apache.hbase.thirdparty.io.netty.util.concurrent.SingleThreadEventExecutor.runAllTasks(SingleThreadEventExecutor.java:470)
>   at 
> org.apache.hbase.thirdparty.io.netty.channel.epoll.EpollEventLoop.run(EpollEventLoop.java:403)
>   at 
> org.apache.hbase.thirdparty.io.netty.util.concurrent.SingleThreadEventExecutor$4.run(SingleThreadEventExecutor.java:997)
>   at 
> org.apache.hbase.thirdparty.io.netty.util.internal.ThreadExecutorMap$2.run(ThreadExecutorMap.java:74)
>   at 
> org.apache.hbase.thirdparty.io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30)
>   at java.lang.Thread.run(Thread.java:750)
> Caused by: java.io.IOException: Can not send request because relogin is in 
> progress.
>   at 
> org.apache.hadoop.hbase.ipc.NettyRpcConnection.sendRequest0(NettyRpcConnection.java:321)
>   at 
> org.apache.hadoop.hbase.ipc.NettyRpcConnection.lambda$sendRequest$4(NettyRpcConnection.java:363)
>   ... 8 more
> {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (HBASE-28422) SplitWalProcedure will attempt SplitWalRemoteProcedure on the same target RegionServer indefinitely

2024-03-05 Thread Viraj Jasani (Jira)


[ 
https://issues.apache.org/jira/browse/HBASE-28422?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17823811#comment-17823811
 ] 

Viraj Jasani commented on HBASE-28422:
--

Might as well be a good opportunity to refactor _isSaslError()_ as a global 
static utility, available for use to anyone.

> SplitWalProcedure will attempt SplitWalRemoteProcedure on the same target 
> RegionServer indefinitely
> ---
>
> Key: HBASE-28422
> URL: https://issues.apache.org/jira/browse/HBASE-28422
> Project: HBase
>  Issue Type: Bug
>  Components: master, proc-v2, wal
>Affects Versions: 2.5.5
>Reporter: David Manning
>Priority: Minor
>
> Similar to HBASE-28050. If HMaster selects a RegionServer for 
> SplitWalRemoteProcedure, it will retry this server as long as the server is 
> alive. I believe this is because even though 
> {{RSProcedureDispatcher.ExecuteProceduresRemoteCall.run}} calls 
> {{{}remoteCallFailed{}}}, there is no logic after this to select a new target 
> server. For {{TransitRegionStateProcedure}} there is logic to select a new 
> server for opening a region, using {{{}forceNewPlan{}}}. But 
> SplitWalRemoteProcedure only has logic to try another server if we receive a 
> {{DoNotRetryIOException}} in SplitWALRemoteProcedure#complete: 
> [https://github.com/apache/hbase/blob/780ff56b3f23e7041ef1b705b7d3d0a53fdd05ae/hbase-server/src/main/java/org/apache/hadoop/hbase/master/procedure/SplitWALRemoteProcedure.java#L104-L110]
> If we receive any other IOException, we will just retry the target server 
> forever. Just like in HBASE-28050, if there is a SaslException, this will 
> never lead to retrying a SplitWalRemoteProcedure on a new server, which can 
> lead to ServerCrashProcedure never finishing until the target server for 
> SplitWalRemoteProcedure is restarted. The following log is seen repeatedly, 
> always sending to the same host.
> {code:java}
> 2024-01-31 15:59:43,616 WARN  [RSProcedureDispatcher-pool-72846] 
> procedure.SplitWALRemoteProcedure - Failed split of 
> hdfs:///hbase/WALs/,1704984571464-splitting/1704984571464.1706710908543,
>  retry...
> java.io.IOException: Call to address= failed on local exception: 
> java.io.IOException: Can not send request because relogin is in progress.
>   at sun.reflect.GeneratedConstructorAccessor363.newInstance(Unknown 
> Source)
>   at 
> sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
>   at java.lang.reflect.Constructor.newInstance(Constructor.java:423)
>   at org.apache.hadoop.hbase.ipc.IPCUtil.wrapException(IPCUtil.java:239)
>   at 
> org.apache.hadoop.hbase.ipc.AbstractRpcClient.onCallFinished(AbstractRpcClient.java:391)
>   at 
> org.apache.hadoop.hbase.ipc.AbstractRpcClient.access$100(AbstractRpcClient.java:92)
>   at 
> org.apache.hadoop.hbase.ipc.AbstractRpcClient$3.run(AbstractRpcClient.java:425)
>   at 
> org.apache.hadoop.hbase.ipc.AbstractRpcClient$3.run(AbstractRpcClient.java:420)
>   at org.apache.hadoop.hbase.ipc.Call.callComplete(Call.java:114)
>   at org.apache.hadoop.hbase.ipc.Call.setException(Call.java:129)
>   at 
> org.apache.hadoop.hbase.ipc.NettyRpcConnection.lambda$sendRequest$4(NettyRpcConnection.java:365)
>   at 
> org.apache.hbase.thirdparty.io.netty.util.concurrent.AbstractEventExecutor.runTask(AbstractEventExecutor.java:174)
>   at 
> org.apache.hbase.thirdparty.io.netty.util.concurrent.AbstractEventExecutor.safeExecute(AbstractEventExecutor.java:167)
>   at 
> org.apache.hbase.thirdparty.io.netty.util.concurrent.SingleThreadEventExecutor.runAllTasks(SingleThreadEventExecutor.java:470)
>   at 
> org.apache.hbase.thirdparty.io.netty.channel.epoll.EpollEventLoop.run(EpollEventLoop.java:403)
>   at 
> org.apache.hbase.thirdparty.io.netty.util.concurrent.SingleThreadEventExecutor$4.run(SingleThreadEventExecutor.java:997)
>   at 
> org.apache.hbase.thirdparty.io.netty.util.internal.ThreadExecutorMap$2.run(ThreadExecutorMap.java:74)
>   at 
> org.apache.hbase.thirdparty.io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30)
>   at java.lang.Thread.run(Thread.java:750)
> Caused by: java.io.IOException: Can not send request because relogin is in 
> progress.
>   at 
> org.apache.hadoop.hbase.ipc.NettyRpcConnection.sendRequest0(NettyRpcConnection.java:321)
>   at 
> org.apache.hadoop.hbase.ipc.NettyRpcConnection.lambda$sendRequest$4(NettyRpcConnection.java:363)
>   ... 8 more
> {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)