[jira] [Commented] (HBASE-22287) inifinite retries on failed server in RSProcedureDispatcher
[ https://issues.apache.org/jira/browse/HBASE-22287?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17120259#comment-17120259 ] Hudson commented on HBASE-22287: Results for branch master [build #1741 on builds.a.o|https://builds.apache.org/job/HBase%20Nightly/job/master/1741/]: (/) *{color:green}+1 overall{color}* details (if available): (/) {color:green}+1 general checks{color} -- For more information [see general report|https://builds.apache.org/job/HBase%20Nightly/job/master/1741/General_20Nightly_20Build_20Report/] (/) {color:green}+1 jdk8 hadoop2 checks{color} -- For more information [see jdk8 (hadoop2) report|https://builds.apache.org/job/HBase%20Nightly/job/master/1698/JDK8_20Nightly_20Build_20Report_20_28Hadoop2_29/] (/) {color:green}+1 jdk8 hadoop3 checks{color} -- For more information [see jdk8 (hadoop3) report|https://builds.apache.org/job/HBase%20Nightly/job/master/1741/JDK8_20Nightly_20Build_20Report_20_28Hadoop3_29/] (/) {color:green}+1 jdk11 hadoop3 checks{color} -- For more information [see jdk11 report|https://builds.apache.org/job/HBase%20Nightly/job/master/1741/JDK11_20Nightly_20Build_20Report_20_28Hadoop3_29/] (/) {color:green}+1 source release artifact{color} -- See build output for details. (/) {color:green}+1 client integration test{color} > inifinite retries on failed server in RSProcedureDispatcher > --- > > Key: HBASE-22287 > URL: https://issues.apache.org/jira/browse/HBASE-22287 > Project: HBase > Issue Type: Bug >Reporter: Sergey Shelukhin >Assignee: Michael Stack >Priority: Major > Fix For: 3.0.0-alpha-1, 2.3.0 > > > We observed this recently on some cluster, I'm still investigating the root > cause however seems like the retries should have special handling for this > exception; and separately probably a cap on number of retries > {noformat} > 2019-04-20 04:24:27,093 WARN [RSProcedureDispatcher-pool4-t1285] > procedure.RSProcedureDispatcher: request to server ,17020,1555742560432 > failed due to java.io.IOException: Call to :17020 failed on local exception: > org.apache.hadoop.hbase.ipc.FailedServerException: This server is in the > failed servers list: :17020, try=26603, retrying... > {noformat} > The corresponding worker is stuck -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (HBASE-22287) inifinite retries on failed server in RSProcedureDispatcher
[ https://issues.apache.org/jira/browse/HBASE-22287?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17120094#comment-17120094 ] Hudson commented on HBASE-22287: Results for branch branch-2.3 [build #112 on builds.a.o|https://builds.apache.org/job/HBase%20Nightly/job/branch-2.3/112/]: (/) *{color:green}+1 overall{color}* details (if available): (/) {color:green}+1 general checks{color} -- For more information [see general report|https://builds.apache.org/job/HBase%20Nightly/job/branch-2.3/112/General_20Nightly_20Build_20Report/] (/) {color:green}+1 jdk8 hadoop2 checks{color} -- For more information [see jdk8 (hadoop2) report|https://builds.apache.org/job/HBase%20Nightly/job/branch-2.3/112/JDK8_20Nightly_20Build_20Report_20_28Hadoop2_29/] (/) {color:green}+1 jdk8 hadoop3 checks{color} -- For more information [see jdk8 (hadoop3) report|https://builds.apache.org/job/HBase%20Nightly/job/branch-2.3/112/JDK8_20Nightly_20Build_20Report_20_28Hadoop3_29/] (/) {color:green}+1 jdk11 hadoop3 checks{color} -- For more information [see jdk11 report|https://builds.apache.org/job/HBase%20Nightly/job/branch-2.3/112/JDK11_20Nightly_20Build_20Report_20_28Hadoop3_29/] (/) {color:green}+1 source release artifact{color} -- See build output for details. (/) {color:green}+1 client integration test{color} > inifinite retries on failed server in RSProcedureDispatcher > --- > > Key: HBASE-22287 > URL: https://issues.apache.org/jira/browse/HBASE-22287 > Project: HBase > Issue Type: Bug >Reporter: Sergey Shelukhin >Assignee: Michael Stack >Priority: Major > Fix For: 3.0.0-alpha-1, 2.3.0 > > > We observed this recently on some cluster, I'm still investigating the root > cause however seems like the retries should have special handling for this > exception; and separately probably a cap on number of retries > {noformat} > 2019-04-20 04:24:27,093 WARN [RSProcedureDispatcher-pool4-t1285] > procedure.RSProcedureDispatcher: request to server ,17020,1555742560432 > failed due to java.io.IOException: Call to :17020 failed on local exception: > org.apache.hadoop.hbase.ipc.FailedServerException: This server is in the > failed servers list: :17020, try=26603, retrying... > {noformat} > The corresponding worker is stuck -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (HBASE-22287) inifinite retries on failed server in RSProcedureDispatcher
[ https://issues.apache.org/jira/browse/HBASE-22287?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17120075#comment-17120075 ] Hudson commented on HBASE-22287: Results for branch branch-2 [build #2683 on builds.a.o|https://builds.apache.org/job/HBase%20Nightly/job/branch-2/2683/]: (/) *{color:green}+1 overall{color}* details (if available): (/) {color:green}+1 general checks{color} -- For more information [see general report|https://builds.apache.org/job/HBase%20Nightly/job/branch-2/2683/General_20Nightly_20Build_20Report/] (/) {color:green}+1 jdk8 hadoop2 checks{color} -- For more information [see jdk8 (hadoop2) report|https://builds.apache.org/job/HBase%20Nightly/job/branch-2/2683/JDK8_20Nightly_20Build_20Report_20_28Hadoop2_29/] (/) {color:green}+1 jdk8 hadoop3 checks{color} -- For more information [see jdk8 (hadoop3) report|https://builds.apache.org/job/HBase%20Nightly/job/branch-2/2683/JDK8_20Nightly_20Build_20Report_20_28Hadoop3_29/] (/) {color:green}+1 jdk11 hadoop3 checks{color} -- For more information [see jdk11 report|https://builds.apache.org/job/HBase%20Nightly/job/branch-2/2683/JDK11_20Nightly_20Build_20Report_20_28Hadoop3_29/] (/) {color:green}+1 source release artifact{color} -- See build output for details. (/) {color:green}+1 client integration test{color} > inifinite retries on failed server in RSProcedureDispatcher > --- > > Key: HBASE-22287 > URL: https://issues.apache.org/jira/browse/HBASE-22287 > Project: HBase > Issue Type: Bug >Reporter: Sergey Shelukhin >Assignee: Michael Stack >Priority: Major > Fix For: 3.0.0-alpha-1, 2.3.0 > > > We observed this recently on some cluster, I'm still investigating the root > cause however seems like the retries should have special handling for this > exception; and separately probably a cap on number of retries > {noformat} > 2019-04-20 04:24:27,093 WARN [RSProcedureDispatcher-pool4-t1285] > procedure.RSProcedureDispatcher: request to server ,17020,1555742560432 > failed due to java.io.IOException: Call to :17020 failed on local exception: > org.apache.hadoop.hbase.ipc.FailedServerException: This server is in the > failed servers list: :17020, try=26603, retrying... > {noformat} > The corresponding worker is stuck -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (HBASE-22287) inifinite retries on failed server in RSProcedureDispatcher
[ https://issues.apache.org/jira/browse/HBASE-22287?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17119181#comment-17119181 ] Michael Stack commented on HBASE-22287: --- Put up a patch to add some backoff so we don't fill logs. If we get into this situation, we don't want to break off retrying. Conditions need to change first: e.g. ServerCrashProcedure is usual way of how we break off retrying. In the case cited above, SCP had a hole so the rpc kept on. Bug. Shouldn't have happened. > inifinite retries on failed server in RSProcedureDispatcher > --- > > Key: HBASE-22287 > URL: https://issues.apache.org/jira/browse/HBASE-22287 > Project: HBase > Issue Type: Bug >Reporter: Sergey Shelukhin >Priority: Major > > We observed this recently on some cluster, I'm still investigating the root > cause however seems like the retries should have special handling for this > exception; and separately probably a cap on number of retries > {noformat} > 2019-04-20 04:24:27,093 WARN [RSProcedureDispatcher-pool4-t1285] > procedure.RSProcedureDispatcher: request to server ,17020,1555742560432 > failed due to java.io.IOException: Call to :17020 failed on local exception: > org.apache.hadoop.hbase.ipc.FailedServerException: This server is in the > failed servers list: :17020, try=26603, retrying... > {noformat} > The corresponding worker is stuck -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (HBASE-22287) inifinite retries on failed server in RSProcedureDispatcher
[ https://issues.apache.org/jira/browse/HBASE-22287?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17119138#comment-17119138 ] Michael Stack commented on HBASE-22287: --- Here are logs showing retry 525 and 526 with 100ms in between attempts from trace-level log attached to HBASE-22041 {code} 2020-05-21 17:29:49,267 TRACE [RSProcedureDispatcher-pool3-t44] procedure.RSProcedureDispatcher: Building request with operations count=1 2020-05-21 17:29:49,268 DEBUG [RSProcedureDispatcher-pool3-t44] ipc.AbstractRpcClient: Not trying to connect to regionserver-2.hbase.hbase.svc.cluster.local/10.128.14.39:16020 this server is in the failed servers list 2020-05-21 17:29:49,268 TRACE [RSProcedureDispatcher-pool3-t44] ipc.AbstractRpcClient: Call: ExecuteProcedures, callTime: 0ms 2020-05-21 17:29:49,268 DEBUG [RSProcedureDispatcher-pool3-t44] procedure.RSProcedureDispatcher: request to regionserver-2.hbase.hbase.svc.cluster.local,16020,1590082132059 failed, try=525 org.apache.hadoop.hbase.ipc.FailedServerException: Call to regionserver-2.hbase.hbase.svc.cluster.local/10.128.14.39:16020 failed on local exception: org.apache.hadoop.hbase.ipc.FailedServerException: This server is in the failed servers list: regionserver-2.hbase.hbase.svc.cluster.local/ 10.128.14.39:16020 at sun.reflect.GeneratedConstructorAccessor8.newInstance(Unknown Source) at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45) at java.lang.reflect.Constructor.newInstance(Constructor.java:423) at org.apache.hadoop.hbase.ipc.IPCUtil.wrapException(IPCUtil.java:220) at org.apache.hadoop.hbase.ipc.AbstractRpcClient.onCallFinished(AbstractRpcClient.java:392) at org.apache.hadoop.hbase.ipc.AbstractRpcClient.access$100(AbstractRpcClient.java:97) at org.apache.hadoop.hbase.ipc.AbstractRpcClient$3.run(AbstractRpcClient.java:423) at org.apache.hadoop.hbase.ipc.AbstractRpcClient$3.run(AbstractRpcClient.java:419) at org.apache.hadoop.hbase.ipc.Call.callComplete(Call.java:117) at org.apache.hadoop.hbase.ipc.Call.setException(Call.java:132) at org.apache.hadoop.hbase.ipc.AbstractRpcClient.callMethod(AbstractRpcClient.java:436) at org.apache.hadoop.hbase.ipc.AbstractRpcClient.callBlockingMethod(AbstractRpcClient.java:330) at org.apache.hadoop.hbase.ipc.AbstractRpcClient.access$200(AbstractRpcClient.java:97) at org.apache.hadoop.hbase.ipc.AbstractRpcClient$BlockingRpcChannelImplementation.callBlockingMethod(AbstractRpcClient.java:585) at org.apache.hadoop.hbase.shaded.protobuf.generated.AdminProtos$AdminService$BlockingStub.executeProcedures(AdminProtos.java:31006) at org.apache.hadoop.hbase.master.procedure.RSProcedureDispatcher$ExecuteProceduresRemoteCall.sendRequest(RSProcedureDispatcher.java:349) at org.apache.hadoop.hbase.master.procedure.RSProcedureDispatcher$ExecuteProceduresRemoteCall.run(RSProcedureDispatcher.java:314) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) at java.lang.Thread.run(Thread.java:748) Caused by: org.apache.hadoop.hbase.ipc.FailedServerException: This server is in the failed servers list: regionserver-2.hbase.hbase.svc.cluster.local/10.128.14.39:16020 at org.apache.hadoop.hbase.ipc.AbstractRpcClient.getConnection(AbstractRpcClient.java:354) at org.apache.hadoop.hbase.ipc.AbstractRpcClient.callMethod(AbstractRpcClient.java:433) ... 9 more 2020-05-21 17:29:49,268 WARN [RSProcedureDispatcher-pool3-t44] procedure.RSProcedureDispatcher: request to server regionserver-2.hbase.hbase.svc.cluster.local,16020,1590082132059 failed due to org.apache.hadoop.hbase.ipc.FailedServerException: Call to regionserver-2.hbase.hbase.svc. cluster.local/10.128.14.39:16020 failed on local exception: org.apache.hadoop.hbase.ipc.FailedServerException: This server is in the failed servers list: regionserver-2.hbase.hbase.svc.cluster.local/10.128.14.39:16020, try=525, retrying... 2020-05-21 17:29:49,368 TRACE [RSProcedureDispatcher-pool3-t45] procedure.RSProcedureDispatcher: Building request with operations count=1 2020-05-21 17:29:49,369 DEBUG [RSProcedureDispatcher-pool3-t45] ipc.AbstractRpcClient: Not trying to connect to regionserver-2.hbase.hbase.svc.cluster.local/10.128.14.39:16020 this server is in the failed servers list 2020-05-21 17:29:49,369 TRACE [RSProcedureDispatcher-pool3-t45] ipc.AbstractRpcClient: Call: ExecuteProcedures, callTime: 1ms 2020-05-21 17:29:49,369 DEBUG [RSProcedureDispatcher-pool3-t45] procedure.RSProcedureDispatcher: request to regionserver-2.hbase.hbase.svc.cluster.local,16020,1590082132059 failed, try=526 org.apache.hadoop.hbase.ipc.FailedServerException: Call to regionserver-2.hbase.hbase.svc.cluster.local/10.128.14.39:16020 failed on local exception: org.
[jira] [Commented] (HBASE-22287) inifinite retries on failed server in RSProcedureDispatcher
[ https://issues.apache.org/jira/browse/HBASE-22287?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16824502#comment-16824502 ] Sergey Shelukhin commented on HBASE-22287: -- I'll check the logs if we hit it again... looks like there were too many logs and they got rolled > inifinite retries on failed server in RSProcedureDispatcher > --- > > Key: HBASE-22287 > URL: https://issues.apache.org/jira/browse/HBASE-22287 > Project: HBase > Issue Type: Bug >Reporter: Sergey Shelukhin >Priority: Major > > We observed this recently on some cluster, I'm still investigating the root > cause however seems like the retries should have special handling for this > exception; and separately probably a cap on number of retries > {noformat} > 2019-04-20 04:24:27,093 WARN [RSProcedureDispatcher-pool4-t1285] > procedure.RSProcedureDispatcher: request to server ,17020,1555742560432 > failed due to java.io.IOException: Call to :17020 failed on local exception: > org.apache.hadoop.hbase.ipc.FailedServerException: This server is in the > failed servers list: :17020, try=26603, retrying... > {noformat} > The corresponding worker is stuck -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (HBASE-22287) inifinite retries on failed server in RSProcedureDispatcher
[ https://issues.apache.org/jira/browse/HBASE-22287?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16823593#comment-16823593 ] Duo Zhang commented on HBASE-22287: --- Has the region server already been marked as dead at master side? If not, then I think this is intentional. We can only give up when we make sure that the RS is dead. > inifinite retries on failed server in RSProcedureDispatcher > --- > > Key: HBASE-22287 > URL: https://issues.apache.org/jira/browse/HBASE-22287 > Project: HBase > Issue Type: Bug >Reporter: Sergey Shelukhin >Priority: Major > > We observed this recently on some cluster, I'm still investigating the root > cause however seems like the retries should have special handling for this > exception; and separately probably a cap on number of retries > {noformat} > 2019-04-20 04:24:27,093 WARN [RSProcedureDispatcher-pool4-t1285] > procedure.RSProcedureDispatcher: request to server ,17020,1555742560432 > failed due to java.io.IOException: Call to :17020 failed on local exception: > org.apache.hadoop.hbase.ipc.FailedServerException: This server is in the > failed servers list: :17020, try=26603, retrying... > {noformat} > The corresponding worker is stuck -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (HBASE-22287) inifinite retries on failed server in RSProcedureDispatcher
[ https://issues.apache.org/jira/browse/HBASE-22287?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16823294#comment-16823294 ] Sergey Shelukhin commented on HBASE-22287: -- cc [~Apache9] > inifinite retries on failed server in RSProcedureDispatcher > --- > > Key: HBASE-22287 > URL: https://issues.apache.org/jira/browse/HBASE-22287 > Project: HBase > Issue Type: Bug >Reporter: Sergey Shelukhin >Priority: Major > > We observed this recently on some cluster, I'm still investigating the root > cause however seems like the retries should have special handling for this > exception; and separately probably a cap on number of retries > {noformat} > 2019-04-20 04:24:27,093 WARN [RSProcedureDispatcher-pool4-t1285] > procedure.RSProcedureDispatcher: request to server ,17020,1555742560432 > failed due to java.io.IOException: Call to :17020 failed on local exception: > org.apache.hadoop.hbase.ipc.FailedServerException: This server is in the > failed servers list: :17020, try=26603, retrying... > {noformat} > The corresponding worker is stuck -- This message was sent by Atlassian JIRA (v7.6.3#76005)