[jira] [Commented] (HBASE-22287) inifinite retries on failed server in RSProcedureDispatcher

2020-05-30 Thread Hudson (Jira)


[ 
https://issues.apache.org/jira/browse/HBASE-22287?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17120259#comment-17120259
 ] 

Hudson commented on HBASE-22287:


Results for branch master
[build #1741 on 
builds.a.o|https://builds.apache.org/job/HBase%20Nightly/job/master/1741/]: (/) 
*{color:green}+1 overall{color}*

details (if available):

(/) {color:green}+1 general checks{color}
-- For more information [see general 
report|https://builds.apache.org/job/HBase%20Nightly/job/master/1741/General_20Nightly_20Build_20Report/]




(/) {color:green}+1 jdk8 hadoop2 checks{color}
-- For more information [see jdk8 (hadoop2) 
report|https://builds.apache.org/job/HBase%20Nightly/job/master/1698/JDK8_20Nightly_20Build_20Report_20_28Hadoop2_29/]


(/) {color:green}+1 jdk8 hadoop3 checks{color}
-- For more information [see jdk8 (hadoop3) 
report|https://builds.apache.org/job/HBase%20Nightly/job/master/1741/JDK8_20Nightly_20Build_20Report_20_28Hadoop3_29/]


(/) {color:green}+1 jdk11 hadoop3 checks{color}
-- For more information [see jdk11 
report|https://builds.apache.org/job/HBase%20Nightly/job/master/1741/JDK11_20Nightly_20Build_20Report_20_28Hadoop3_29/]


(/) {color:green}+1 source release artifact{color}
-- See build output for details.


(/) {color:green}+1 client integration test{color}


> inifinite retries on failed server in RSProcedureDispatcher
> ---
>
> Key: HBASE-22287
> URL: https://issues.apache.org/jira/browse/HBASE-22287
> Project: HBase
>  Issue Type: Bug
>Reporter: Sergey Shelukhin
>Assignee: Michael Stack
>Priority: Major
> Fix For: 3.0.0-alpha-1, 2.3.0
>
>
> We observed this recently on some cluster, I'm still investigating the root 
> cause however seems like the retries should have special handling for this 
> exception; and separately probably a cap on number of retries
> {noformat}
> 2019-04-20 04:24:27,093 WARN  [RSProcedureDispatcher-pool4-t1285] 
> procedure.RSProcedureDispatcher: request to server ,17020,1555742560432 
> failed due to java.io.IOException: Call to :17020 failed on local exception: 
> org.apache.hadoop.hbase.ipc.FailedServerException: This server is in the 
> failed servers list: :17020, try=26603, retrying...
> {noformat}
> The corresponding worker is stuck



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (HBASE-22287) inifinite retries on failed server in RSProcedureDispatcher

2020-05-29 Thread Hudson (Jira)


[ 
https://issues.apache.org/jira/browse/HBASE-22287?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17120094#comment-17120094
 ] 

Hudson commented on HBASE-22287:


Results for branch branch-2.3
[build #112 on 
builds.a.o|https://builds.apache.org/job/HBase%20Nightly/job/branch-2.3/112/]: 
(/) *{color:green}+1 overall{color}*

details (if available):

(/) {color:green}+1 general checks{color}
-- For more information [see general 
report|https://builds.apache.org/job/HBase%20Nightly/job/branch-2.3/112/General_20Nightly_20Build_20Report/]




(/) {color:green}+1 jdk8 hadoop2 checks{color}
-- For more information [see jdk8 (hadoop2) 
report|https://builds.apache.org/job/HBase%20Nightly/job/branch-2.3/112/JDK8_20Nightly_20Build_20Report_20_28Hadoop2_29/]


(/) {color:green}+1 jdk8 hadoop3 checks{color}
-- For more information [see jdk8 (hadoop3) 
report|https://builds.apache.org/job/HBase%20Nightly/job/branch-2.3/112/JDK8_20Nightly_20Build_20Report_20_28Hadoop3_29/]


(/) {color:green}+1 jdk11 hadoop3 checks{color}
-- For more information [see jdk11 
report|https://builds.apache.org/job/HBase%20Nightly/job/branch-2.3/112/JDK11_20Nightly_20Build_20Report_20_28Hadoop3_29/]


(/) {color:green}+1 source release artifact{color}
-- See build output for details.


(/) {color:green}+1 client integration test{color}


> inifinite retries on failed server in RSProcedureDispatcher
> ---
>
> Key: HBASE-22287
> URL: https://issues.apache.org/jira/browse/HBASE-22287
> Project: HBase
>  Issue Type: Bug
>Reporter: Sergey Shelukhin
>Assignee: Michael Stack
>Priority: Major
> Fix For: 3.0.0-alpha-1, 2.3.0
>
>
> We observed this recently on some cluster, I'm still investigating the root 
> cause however seems like the retries should have special handling for this 
> exception; and separately probably a cap on number of retries
> {noformat}
> 2019-04-20 04:24:27,093 WARN  [RSProcedureDispatcher-pool4-t1285] 
> procedure.RSProcedureDispatcher: request to server ,17020,1555742560432 
> failed due to java.io.IOException: Call to :17020 failed on local exception: 
> org.apache.hadoop.hbase.ipc.FailedServerException: This server is in the 
> failed servers list: :17020, try=26603, retrying...
> {noformat}
> The corresponding worker is stuck



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (HBASE-22287) inifinite retries on failed server in RSProcedureDispatcher

2020-05-29 Thread Hudson (Jira)


[ 
https://issues.apache.org/jira/browse/HBASE-22287?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17120075#comment-17120075
 ] 

Hudson commented on HBASE-22287:


Results for branch branch-2
[build #2683 on 
builds.a.o|https://builds.apache.org/job/HBase%20Nightly/job/branch-2/2683/]: 
(/) *{color:green}+1 overall{color}*

details (if available):

(/) {color:green}+1 general checks{color}
-- For more information [see general 
report|https://builds.apache.org/job/HBase%20Nightly/job/branch-2/2683/General_20Nightly_20Build_20Report/]




(/) {color:green}+1 jdk8 hadoop2 checks{color}
-- For more information [see jdk8 (hadoop2) 
report|https://builds.apache.org/job/HBase%20Nightly/job/branch-2/2683/JDK8_20Nightly_20Build_20Report_20_28Hadoop2_29/]


(/) {color:green}+1 jdk8 hadoop3 checks{color}
-- For more information [see jdk8 (hadoop3) 
report|https://builds.apache.org/job/HBase%20Nightly/job/branch-2/2683/JDK8_20Nightly_20Build_20Report_20_28Hadoop3_29/]


(/) {color:green}+1 jdk11 hadoop3 checks{color}
-- For more information [see jdk11 
report|https://builds.apache.org/job/HBase%20Nightly/job/branch-2/2683/JDK11_20Nightly_20Build_20Report_20_28Hadoop3_29/]


(/) {color:green}+1 source release artifact{color}
-- See build output for details.


(/) {color:green}+1 client integration test{color}


> inifinite retries on failed server in RSProcedureDispatcher
> ---
>
> Key: HBASE-22287
> URL: https://issues.apache.org/jira/browse/HBASE-22287
> Project: HBase
>  Issue Type: Bug
>Reporter: Sergey Shelukhin
>Assignee: Michael Stack
>Priority: Major
> Fix For: 3.0.0-alpha-1, 2.3.0
>
>
> We observed this recently on some cluster, I'm still investigating the root 
> cause however seems like the retries should have special handling for this 
> exception; and separately probably a cap on number of retries
> {noformat}
> 2019-04-20 04:24:27,093 WARN  [RSProcedureDispatcher-pool4-t1285] 
> procedure.RSProcedureDispatcher: request to server ,17020,1555742560432 
> failed due to java.io.IOException: Call to :17020 failed on local exception: 
> org.apache.hadoop.hbase.ipc.FailedServerException: This server is in the 
> failed servers list: :17020, try=26603, retrying...
> {noformat}
> The corresponding worker is stuck



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (HBASE-22287) inifinite retries on failed server in RSProcedureDispatcher

2020-05-28 Thread Michael Stack (Jira)


[ 
https://issues.apache.org/jira/browse/HBASE-22287?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17119181#comment-17119181
 ] 

Michael Stack commented on HBASE-22287:
---

Put up a patch to add some backoff so we don't fill logs.

If we get into this situation, we don't want to break off retrying.  Conditions 
need to change first: e.g. ServerCrashProcedure is usual way of how we break 
off retrying. In the case cited above, SCP had a hole so the rpc kept on. Bug. 
Shouldn't have happened.

> inifinite retries on failed server in RSProcedureDispatcher
> ---
>
> Key: HBASE-22287
> URL: https://issues.apache.org/jira/browse/HBASE-22287
> Project: HBase
>  Issue Type: Bug
>Reporter: Sergey Shelukhin
>Priority: Major
>
> We observed this recently on some cluster, I'm still investigating the root 
> cause however seems like the retries should have special handling for this 
> exception; and separately probably a cap on number of retries
> {noformat}
> 2019-04-20 04:24:27,093 WARN  [RSProcedureDispatcher-pool4-t1285] 
> procedure.RSProcedureDispatcher: request to server ,17020,1555742560432 
> failed due to java.io.IOException: Call to :17020 failed on local exception: 
> org.apache.hadoop.hbase.ipc.FailedServerException: This server is in the 
> failed servers list: :17020, try=26603, retrying...
> {noformat}
> The corresponding worker is stuck



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (HBASE-22287) inifinite retries on failed server in RSProcedureDispatcher

2020-05-28 Thread Michael Stack (Jira)


[ 
https://issues.apache.org/jira/browse/HBASE-22287?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17119138#comment-17119138
 ] 

Michael Stack commented on HBASE-22287:
---

Here are logs showing retry 525 and 526 with 100ms in between attempts from 
trace-level log attached to HBASE-22041
{code}
 2020-05-21 17:29:49,267 TRACE [RSProcedureDispatcher-pool3-t44] 
procedure.RSProcedureDispatcher: Building request with operations count=1
 2020-05-21 17:29:49,268 DEBUG [RSProcedureDispatcher-pool3-t44] 
ipc.AbstractRpcClient: Not trying to connect to 
regionserver-2.hbase.hbase.svc.cluster.local/10.128.14.39:16020 this server is 
in the failed servers list
 2020-05-21 17:29:49,268 TRACE [RSProcedureDispatcher-pool3-t44] 
ipc.AbstractRpcClient: Call: ExecuteProcedures, callTime: 0ms
 2020-05-21 17:29:49,268 DEBUG [RSProcedureDispatcher-pool3-t44] 
procedure.RSProcedureDispatcher: request to 
regionserver-2.hbase.hbase.svc.cluster.local,16020,1590082132059 failed, try=525
 org.apache.hadoop.hbase.ipc.FailedServerException: Call to 
regionserver-2.hbase.hbase.svc.cluster.local/10.128.14.39:16020 failed on local 
exception: org.apache.hadoop.hbase.ipc.FailedServerException: This server is in 
the failed servers list: regionserver-2.hbase.hbase.svc.cluster.local/ 
10.128.14.39:16020
   at sun.reflect.GeneratedConstructorAccessor8.newInstance(Unknown Source)
   at 
sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
   at java.lang.reflect.Constructor.newInstance(Constructor.java:423)
   at org.apache.hadoop.hbase.ipc.IPCUtil.wrapException(IPCUtil.java:220)
   at 
org.apache.hadoop.hbase.ipc.AbstractRpcClient.onCallFinished(AbstractRpcClient.java:392)
   at 
org.apache.hadoop.hbase.ipc.AbstractRpcClient.access$100(AbstractRpcClient.java:97)
   at 
org.apache.hadoop.hbase.ipc.AbstractRpcClient$3.run(AbstractRpcClient.java:423)
   at 
org.apache.hadoop.hbase.ipc.AbstractRpcClient$3.run(AbstractRpcClient.java:419)
   at org.apache.hadoop.hbase.ipc.Call.callComplete(Call.java:117)
   at org.apache.hadoop.hbase.ipc.Call.setException(Call.java:132)
   at 
org.apache.hadoop.hbase.ipc.AbstractRpcClient.callMethod(AbstractRpcClient.java:436)
   at 
org.apache.hadoop.hbase.ipc.AbstractRpcClient.callBlockingMethod(AbstractRpcClient.java:330)
   at 
org.apache.hadoop.hbase.ipc.AbstractRpcClient.access$200(AbstractRpcClient.java:97)
   at 
org.apache.hadoop.hbase.ipc.AbstractRpcClient$BlockingRpcChannelImplementation.callBlockingMethod(AbstractRpcClient.java:585)
   at 
org.apache.hadoop.hbase.shaded.protobuf.generated.AdminProtos$AdminService$BlockingStub.executeProcedures(AdminProtos.java:31006)
   at 
org.apache.hadoop.hbase.master.procedure.RSProcedureDispatcher$ExecuteProceduresRemoteCall.sendRequest(RSProcedureDispatcher.java:349)
   at 
org.apache.hadoop.hbase.master.procedure.RSProcedureDispatcher$ExecuteProceduresRemoteCall.run(RSProcedureDispatcher.java:314)
   at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
   at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
   at java.lang.Thread.run(Thread.java:748)
 Caused by: org.apache.hadoop.hbase.ipc.FailedServerException: This server is 
in the failed servers list: 
regionserver-2.hbase.hbase.svc.cluster.local/10.128.14.39:16020
   at 
org.apache.hadoop.hbase.ipc.AbstractRpcClient.getConnection(AbstractRpcClient.java:354)
   at 
org.apache.hadoop.hbase.ipc.AbstractRpcClient.callMethod(AbstractRpcClient.java:433)
   ... 9 more
 2020-05-21 17:29:49,268 WARN  [RSProcedureDispatcher-pool3-t44] 
procedure.RSProcedureDispatcher: request to server 
regionserver-2.hbase.hbase.svc.cluster.local,16020,1590082132059 failed due to 
org.apache.hadoop.hbase.ipc.FailedServerException: Call to 
regionserver-2.hbase.hbase.svc.  cluster.local/10.128.14.39:16020 failed on 
local exception: org.apache.hadoop.hbase.ipc.FailedServerException: This server 
is in the failed servers list: 
regionserver-2.hbase.hbase.svc.cluster.local/10.128.14.39:16020, try=525, 
retrying...
 2020-05-21 17:29:49,368 TRACE [RSProcedureDispatcher-pool3-t45] 
procedure.RSProcedureDispatcher: Building request with operations count=1
 2020-05-21 17:29:49,369 DEBUG [RSProcedureDispatcher-pool3-t45] 
ipc.AbstractRpcClient: Not trying to connect to 
regionserver-2.hbase.hbase.svc.cluster.local/10.128.14.39:16020 this server is 
in the failed servers list
 2020-05-21 17:29:49,369 TRACE [RSProcedureDispatcher-pool3-t45] 
ipc.AbstractRpcClient: Call: ExecuteProcedures, callTime: 1ms
 2020-05-21 17:29:49,369 DEBUG [RSProcedureDispatcher-pool3-t45] 
procedure.RSProcedureDispatcher: request to 
regionserver-2.hbase.hbase.svc.cluster.local,16020,1590082132059 failed, try=526
 org.apache.hadoop.hbase.ipc.FailedServerException: Call to 
regionserver-2.hbase.hbase.svc.cluster.local/10.128.14.39:16020 failed on local 
exception: 

[jira] [Commented] (HBASE-22287) inifinite retries on failed server in RSProcedureDispatcher

2019-04-23 Thread Sergey Shelukhin (JIRA)


[ 
https://issues.apache.org/jira/browse/HBASE-22287?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16824502#comment-16824502
 ] 

Sergey Shelukhin commented on HBASE-22287:
--

I'll check the logs if we hit it again... looks like there were too many logs 
and they got rolled

> inifinite retries on failed server in RSProcedureDispatcher
> ---
>
> Key: HBASE-22287
> URL: https://issues.apache.org/jira/browse/HBASE-22287
> Project: HBase
>  Issue Type: Bug
>Reporter: Sergey Shelukhin
>Priority: Major
>
> We observed this recently on some cluster, I'm still investigating the root 
> cause however seems like the retries should have special handling for this 
> exception; and separately probably a cap on number of retries
> {noformat}
> 2019-04-20 04:24:27,093 WARN  [RSProcedureDispatcher-pool4-t1285] 
> procedure.RSProcedureDispatcher: request to server ,17020,1555742560432 
> failed due to java.io.IOException: Call to :17020 failed on local exception: 
> org.apache.hadoop.hbase.ipc.FailedServerException: This server is in the 
> failed servers list: :17020, try=26603, retrying...
> {noformat}
> The corresponding worker is stuck



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (HBASE-22287) inifinite retries on failed server in RSProcedureDispatcher

2019-04-22 Thread Duo Zhang (JIRA)


[ 
https://issues.apache.org/jira/browse/HBASE-22287?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16823593#comment-16823593
 ] 

Duo Zhang commented on HBASE-22287:
---

Has the region server already been marked as dead at master side? If not, then 
I think this is intentional. We can only give up when we make sure that the RS 
is dead.

> inifinite retries on failed server in RSProcedureDispatcher
> ---
>
> Key: HBASE-22287
> URL: https://issues.apache.org/jira/browse/HBASE-22287
> Project: HBase
>  Issue Type: Bug
>Reporter: Sergey Shelukhin
>Priority: Major
>
> We observed this recently on some cluster, I'm still investigating the root 
> cause however seems like the retries should have special handling for this 
> exception; and separately probably a cap on number of retries
> {noformat}
> 2019-04-20 04:24:27,093 WARN  [RSProcedureDispatcher-pool4-t1285] 
> procedure.RSProcedureDispatcher: request to server ,17020,1555742560432 
> failed due to java.io.IOException: Call to :17020 failed on local exception: 
> org.apache.hadoop.hbase.ipc.FailedServerException: This server is in the 
> failed servers list: :17020, try=26603, retrying...
> {noformat}
> The corresponding worker is stuck



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (HBASE-22287) inifinite retries on failed server in RSProcedureDispatcher

2019-04-22 Thread Sergey Shelukhin (JIRA)


[ 
https://issues.apache.org/jira/browse/HBASE-22287?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16823294#comment-16823294
 ] 

Sergey Shelukhin commented on HBASE-22287:
--

cc [~Apache9]

> inifinite retries on failed server in RSProcedureDispatcher
> ---
>
> Key: HBASE-22287
> URL: https://issues.apache.org/jira/browse/HBASE-22287
> Project: HBase
>  Issue Type: Bug
>Reporter: Sergey Shelukhin
>Priority: Major
>
> We observed this recently on some cluster, I'm still investigating the root 
> cause however seems like the retries should have special handling for this 
> exception; and separately probably a cap on number of retries
> {noformat}
> 2019-04-20 04:24:27,093 WARN  [RSProcedureDispatcher-pool4-t1285] 
> procedure.RSProcedureDispatcher: request to server ,17020,1555742560432 
> failed due to java.io.IOException: Call to :17020 failed on local exception: 
> org.apache.hadoop.hbase.ipc.FailedServerException: This server is in the 
> failed servers list: :17020, try=26603, retrying...
> {noformat}
> The corresponding worker is stuck



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)