[jira] [Updated] (HBASE-20137) TestRSGroups is flakey

2018-03-09 Thread stack (JIRA)

 [ 
https://issues.apache.org/jira/browse/HBASE-20137?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

stack updated HBASE-20137:
--
Status: Open  (was: Patch Available)

Cancelling patch. This patch went the wrong route. It allowed unassign complete 
if it could not schedule a server expire (Because concurrent server crash 
procedure in progress). Problem with this is that if the unassign were done as 
part of a move, we'd next go to the assign step and could online a region 
before its logs had split.

Will be back. Working in umbrella issue HBASE-20152 first.

> TestRSGroups is flakey
> --
>
> Key: HBASE-20137
> URL: https://issues.apache.org/jira/browse/HBASE-20137
> Project: HBase
>  Issue Type: Bug
>  Components: flakey
>Affects Versions: 2.0.0-beta-2
>Reporter: stack
>Assignee: stack
>Priority: Major
> Fix For: 2.0.0
>
> Attachments: HBASE-20137.branch-2.001.patch, 
> HBASE-20137.branch-2.002.patch, HBASE-20137.branch-2.003.patch, 
> HBASE-20137.branch-2.003.patch
>
>
> It was the single test that failed the hbase-2 nightlies in #440 at the 
> hadoop2 stage.
> The failure manifests as a timeout. It actually has an interesting cause 
> calling into question some of the clauses in 
> UnassignProcedure#remoteCallFailed.
> We are running a disabletable concurrent with a shutdown. pid=309 is the 
> disable. pid=311 is the interesting one. The below is a little hard to read 
> -- the exception 'message' is the the current procedure as a String... hard 
> to parse, fixing -- but we are trying to unassign as part of a the 
> disabletable. Our RPC fails because the server we are trying to rpc too is 
> currently being processed as crashed (pid=308 is a servercrashprocedure for 
> this server). As part of the processing of the failed RPC we will expire the 
> server -- if we can't RPC to it, it must be gone. The current procedure is 
> then suspended until it gets woken up by the servercrashprocedure triggered 
> by the expire only in this case we are shutting down so the expire is 
> ignored... The current procedure is left in its suspend state. This prevents 
> the Master going down. So we time out.
> 2018-03-05 11:29:22,507 INFO  [PEWorker-13] 
> assignment.RegionTransitionProcedure(213): Dispatch pid=311, ppid=309, 
> state=RUNNABLE:REGION_TRANSITION_DISPATCH; UnassignProcedure 
> table=Group_ns:testKillRS, region=de7534c208a06502537cd95c248b3043, 
> server=1cfd208ff882,40584,1520249102524; rit=CLOSING, 
> location=1cfd208ff882,40584,1520249102524
> 2018-03-05 11:29:22,508 WARN  [PEWorker-13] 
> assignment.RegionTransitionProcedure(187): Remote call failed pid=311, 
> ppid=309, state=RUNNABLE:REGION_TRANSITION_DISPATCH; UnassignProcedure 
> table=Group_ns:testKillRS, region=de7534c208a06502537cd95c248b3043, 
> server=1cfd208ff882,40584,1520249102524; rit=CLOSING, 
> location=1cfd208ff882,40584,1520249102524; exception=pid=311, ppid=309, 
> state=RUNNABLE:REGION_TRANSITION_DISPATCH; UnassignProcedure 
> table=Group_ns:testKillRS, region=de7534c208a06502537cd95c248b3043, 
> server=1cfd208ff882,40584,1520249102524 to 1cfd208ff882,40584,1520249102524
> 2018-03-05 11:29:22,508 WARN  [PEWorker-13] 
> assignment.UnassignProcedure(276): Expiring server pid=311, ppid=309, 
> state=RUNNABLE:REGION_TRANSITION_DISPATCH; UnassignProcedure 
> table=Group_ns:testKillRS, region=de7534c208a06502537cd95c248b3043, 
> server=1cfd208ff882,40584,1520249102524; rit=CLOSING, 
> location=1cfd208ff882,40584,1520249102524, 
> exception=org.apache.hadoop.hbase.master.assignment.FailedRemoteDispatchException:
>  pid=311, ppid=309, state=RUNNABLE:REGION_TRANSITION_DISPATCH; 
> UnassignProcedure table=Group_ns:testKillRS, 
> region=de7534c208a06502537cd95c248b3043, 
> server=1cfd208ff882,40584,1520249102524 to 1cfd208ff882,40584,1520249102524
> 2018-03-05 11:29:22,508 WARN  [PEWorker-13] master.ServerManager(580): 
> Expiration of 1cfd208ff882,40584,1520249102524 but server shutdown already in 
> progress
> I need to cater for case where the expire server is rejected.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (HBASE-20137) TestRSGroups is flakey

2018-03-06 Thread stack (JIRA)

 [ 
https://issues.apache.org/jira/browse/HBASE-20137?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

stack updated HBASE-20137:
--
Attachment: HBASE-20137.branch-2.003.patch

> TestRSGroups is flakey
> --
>
> Key: HBASE-20137
> URL: https://issues.apache.org/jira/browse/HBASE-20137
> Project: HBase
>  Issue Type: Bug
>  Components: flakey
>Affects Versions: 2.0.0-beta-2
>Reporter: stack
>Assignee: stack
>Priority: Major
> Fix For: 2.0.0
>
> Attachments: HBASE-20137.branch-2.001.patch, 
> HBASE-20137.branch-2.002.patch, HBASE-20137.branch-2.003.patch, 
> HBASE-20137.branch-2.003.patch
>
>
> It was the single test that failed the hbase-2 nightlies in #440 at the 
> hadoop2 stage.
> The failure manifests as a timeout. It actually has an interesting cause 
> calling into question some of the clauses in 
> UnassignProcedure#remoteCallFailed.
> We are running a disabletable concurrent with a shutdown. pid=309 is the 
> disable. pid=311 is the interesting one. The below is a little hard to read 
> -- the exception 'message' is the the current procedure as a String... hard 
> to parse, fixing -- but we are trying to unassign as part of a the 
> disabletable. Our RPC fails because the server we are trying to rpc too is 
> currently being processed as crashed (pid=308 is a servercrashprocedure for 
> this server). As part of the processing of the failed RPC we will expire the 
> server -- if we can't RPC to it, it must be gone. The current procedure is 
> then suspended until it gets woken up by the servercrashprocedure triggered 
> by the expire only in this case we are shutting down so the expire is 
> ignored... The current procedure is left in its suspend state. This prevents 
> the Master going down. So we time out.
> 2018-03-05 11:29:22,507 INFO  [PEWorker-13] 
> assignment.RegionTransitionProcedure(213): Dispatch pid=311, ppid=309, 
> state=RUNNABLE:REGION_TRANSITION_DISPATCH; UnassignProcedure 
> table=Group_ns:testKillRS, region=de7534c208a06502537cd95c248b3043, 
> server=1cfd208ff882,40584,1520249102524; rit=CLOSING, 
> location=1cfd208ff882,40584,1520249102524
> 2018-03-05 11:29:22,508 WARN  [PEWorker-13] 
> assignment.RegionTransitionProcedure(187): Remote call failed pid=311, 
> ppid=309, state=RUNNABLE:REGION_TRANSITION_DISPATCH; UnassignProcedure 
> table=Group_ns:testKillRS, region=de7534c208a06502537cd95c248b3043, 
> server=1cfd208ff882,40584,1520249102524; rit=CLOSING, 
> location=1cfd208ff882,40584,1520249102524; exception=pid=311, ppid=309, 
> state=RUNNABLE:REGION_TRANSITION_DISPATCH; UnassignProcedure 
> table=Group_ns:testKillRS, region=de7534c208a06502537cd95c248b3043, 
> server=1cfd208ff882,40584,1520249102524 to 1cfd208ff882,40584,1520249102524
> 2018-03-05 11:29:22,508 WARN  [PEWorker-13] 
> assignment.UnassignProcedure(276): Expiring server pid=311, ppid=309, 
> state=RUNNABLE:REGION_TRANSITION_DISPATCH; UnassignProcedure 
> table=Group_ns:testKillRS, region=de7534c208a06502537cd95c248b3043, 
> server=1cfd208ff882,40584,1520249102524; rit=CLOSING, 
> location=1cfd208ff882,40584,1520249102524, 
> exception=org.apache.hadoop.hbase.master.assignment.FailedRemoteDispatchException:
>  pid=311, ppid=309, state=RUNNABLE:REGION_TRANSITION_DISPATCH; 
> UnassignProcedure table=Group_ns:testKillRS, 
> region=de7534c208a06502537cd95c248b3043, 
> server=1cfd208ff882,40584,1520249102524 to 1cfd208ff882,40584,1520249102524
> 2018-03-05 11:29:22,508 WARN  [PEWorker-13] master.ServerManager(580): 
> Expiration of 1cfd208ff882,40584,1520249102524 but server shutdown already in 
> progress
> I need to cater for case where the expire server is rejected.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (HBASE-20137) TestRSGroups is flakey

2018-03-05 Thread stack (JIRA)

 [ 
https://issues.apache.org/jira/browse/HBASE-20137?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

stack updated HBASE-20137:
--
Attachment: HBASE-20137.branch-2.003.patch

> TestRSGroups is flakey
> --
>
> Key: HBASE-20137
> URL: https://issues.apache.org/jira/browse/HBASE-20137
> Project: HBase
>  Issue Type: Bug
>  Components: flakey
>Affects Versions: 2.0.0-beta-2
>Reporter: stack
>Assignee: stack
>Priority: Major
> Fix For: 2.0.0
>
> Attachments: HBASE-20137.branch-2.001.patch, 
> HBASE-20137.branch-2.002.patch, HBASE-20137.branch-2.003.patch
>
>
> It was the single test that failed the hbase-2 nightlies in #440 at the 
> hadoop2 stage.
> The failure manifests as a timeout. It actually has an interesting cause 
> calling into question some of the clauses in 
> UnassignProcedure#remoteCallFailed.
> We are running a disabletable concurrent with a shutdown. pid=309 is the 
> disable. pid=311 is the interesting one. The below is a little hard to read 
> -- the exception 'message' is the the current procedure as a String... hard 
> to parse, fixing -- but we are trying to unassign as part of a the 
> disabletable. Our RPC fails because the server we are trying to rpc too is 
> currently being processed as crashed (pid=308 is a servercrashprocedure for 
> this server). As part of the processing of the failed RPC we will expire the 
> server -- if we can't RPC to it, it must be gone. The current procedure is 
> then suspended until it gets woken up by the servercrashprocedure triggered 
> by the expire only in this case we are shutting down so the expire is 
> ignored... The current procedure is left in its suspend state. This prevents 
> the Master going down. So we time out.
> 2018-03-05 11:29:22,507 INFO  [PEWorker-13] 
> assignment.RegionTransitionProcedure(213): Dispatch pid=311, ppid=309, 
> state=RUNNABLE:REGION_TRANSITION_DISPATCH; UnassignProcedure 
> table=Group_ns:testKillRS, region=de7534c208a06502537cd95c248b3043, 
> server=1cfd208ff882,40584,1520249102524; rit=CLOSING, 
> location=1cfd208ff882,40584,1520249102524
> 2018-03-05 11:29:22,508 WARN  [PEWorker-13] 
> assignment.RegionTransitionProcedure(187): Remote call failed pid=311, 
> ppid=309, state=RUNNABLE:REGION_TRANSITION_DISPATCH; UnassignProcedure 
> table=Group_ns:testKillRS, region=de7534c208a06502537cd95c248b3043, 
> server=1cfd208ff882,40584,1520249102524; rit=CLOSING, 
> location=1cfd208ff882,40584,1520249102524; exception=pid=311, ppid=309, 
> state=RUNNABLE:REGION_TRANSITION_DISPATCH; UnassignProcedure 
> table=Group_ns:testKillRS, region=de7534c208a06502537cd95c248b3043, 
> server=1cfd208ff882,40584,1520249102524 to 1cfd208ff882,40584,1520249102524
> 2018-03-05 11:29:22,508 WARN  [PEWorker-13] 
> assignment.UnassignProcedure(276): Expiring server pid=311, ppid=309, 
> state=RUNNABLE:REGION_TRANSITION_DISPATCH; UnassignProcedure 
> table=Group_ns:testKillRS, region=de7534c208a06502537cd95c248b3043, 
> server=1cfd208ff882,40584,1520249102524; rit=CLOSING, 
> location=1cfd208ff882,40584,1520249102524, 
> exception=org.apache.hadoop.hbase.master.assignment.FailedRemoteDispatchException:
>  pid=311, ppid=309, state=RUNNABLE:REGION_TRANSITION_DISPATCH; 
> UnassignProcedure table=Group_ns:testKillRS, 
> region=de7534c208a06502537cd95c248b3043, 
> server=1cfd208ff882,40584,1520249102524 to 1cfd208ff882,40584,1520249102524
> 2018-03-05 11:29:22,508 WARN  [PEWorker-13] master.ServerManager(580): 
> Expiration of 1cfd208ff882,40584,1520249102524 but server shutdown already in 
> progress
> I need to cater for case where the expire server is rejected.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (HBASE-20137) TestRSGroups is flakey

2018-03-05 Thread stack (JIRA)

 [ 
https://issues.apache.org/jira/browse/HBASE-20137?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

stack updated HBASE-20137:
--
Attachment: HBASE-20137.branch-2.002.patch

> TestRSGroups is flakey
> --
>
> Key: HBASE-20137
> URL: https://issues.apache.org/jira/browse/HBASE-20137
> Project: HBase
>  Issue Type: Bug
>  Components: flakey
>Affects Versions: 2.0.0-beta-2
>Reporter: stack
>Assignee: stack
>Priority: Major
> Fix For: 2.0.0
>
> Attachments: HBASE-20137.branch-2.001.patch, 
> HBASE-20137.branch-2.002.patch
>
>
> It was the single test that failed the hbase-2 nightlies in #440 at the 
> hadoop2 stage.
> The failure manifests as a timeout. It actually has an interesting cause 
> calling into question some of the clauses in 
> UnassignProcedure#remoteCallFailed.
> We are running a disabletable concurrent with a shutdown. pid=309 is the 
> disable. pid=311 is the interesting one. The below is a little hard to read 
> -- the exception 'message' is the the current procedure as a String... hard 
> to parse, fixing -- but we are trying to unassign as part of a the 
> disabletable. Our RPC fails because the server we are trying to rpc too is 
> currently being processed as crashed (pid=308 is a servercrashprocedure for 
> this server). As part of the processing of the failed RPC we will expire the 
> server -- if we can't RPC to it, it must be gone. The current procedure is 
> then suspended until it gets woken up by the servercrashprocedure triggered 
> by the expire only in this case we are shutting down so the expire is 
> ignored... The current procedure is left in its suspend state. This prevents 
> the Master going down. So we time out.
> 2018-03-05 11:29:22,507 INFO  [PEWorker-13] 
> assignment.RegionTransitionProcedure(213): Dispatch pid=311, ppid=309, 
> state=RUNNABLE:REGION_TRANSITION_DISPATCH; UnassignProcedure 
> table=Group_ns:testKillRS, region=de7534c208a06502537cd95c248b3043, 
> server=1cfd208ff882,40584,1520249102524; rit=CLOSING, 
> location=1cfd208ff882,40584,1520249102524
> 2018-03-05 11:29:22,508 WARN  [PEWorker-13] 
> assignment.RegionTransitionProcedure(187): Remote call failed pid=311, 
> ppid=309, state=RUNNABLE:REGION_TRANSITION_DISPATCH; UnassignProcedure 
> table=Group_ns:testKillRS, region=de7534c208a06502537cd95c248b3043, 
> server=1cfd208ff882,40584,1520249102524; rit=CLOSING, 
> location=1cfd208ff882,40584,1520249102524; exception=pid=311, ppid=309, 
> state=RUNNABLE:REGION_TRANSITION_DISPATCH; UnassignProcedure 
> table=Group_ns:testKillRS, region=de7534c208a06502537cd95c248b3043, 
> server=1cfd208ff882,40584,1520249102524 to 1cfd208ff882,40584,1520249102524
> 2018-03-05 11:29:22,508 WARN  [PEWorker-13] 
> assignment.UnassignProcedure(276): Expiring server pid=311, ppid=309, 
> state=RUNNABLE:REGION_TRANSITION_DISPATCH; UnassignProcedure 
> table=Group_ns:testKillRS, region=de7534c208a06502537cd95c248b3043, 
> server=1cfd208ff882,40584,1520249102524; rit=CLOSING, 
> location=1cfd208ff882,40584,1520249102524, 
> exception=org.apache.hadoop.hbase.master.assignment.FailedRemoteDispatchException:
>  pid=311, ppid=309, state=RUNNABLE:REGION_TRANSITION_DISPATCH; 
> UnassignProcedure table=Group_ns:testKillRS, 
> region=de7534c208a06502537cd95c248b3043, 
> server=1cfd208ff882,40584,1520249102524 to 1cfd208ff882,40584,1520249102524
> 2018-03-05 11:29:22,508 WARN  [PEWorker-13] master.ServerManager(580): 
> Expiration of 1cfd208ff882,40584,1520249102524 but server shutdown already in 
> progress
> I need to cater for case where the expire server is rejected.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (HBASE-20137) TestRSGroups is flakey

2018-03-05 Thread stack (JIRA)

 [ 
https://issues.apache.org/jira/browse/HBASE-20137?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

stack updated HBASE-20137:
--
Status: Patch Available  (was: Open)

Trying patch against precommit while I try to figure a test (the circumstance 
is a bit tough to conjure what w/ an rpc to a server concurrently undergoing 
server crash procedure...)

> TestRSGroups is flakey
> --
>
> Key: HBASE-20137
> URL: https://issues.apache.org/jira/browse/HBASE-20137
> Project: HBase
>  Issue Type: Bug
>  Components: flakey
>Affects Versions: 2.0.0-beta-2
>Reporter: stack
>Assignee: stack
>Priority: Major
> Fix For: 2.0.0
>
> Attachments: HBASE-20137.branch-2.001.patch
>
>
> It was the single test that failed the hbase-2 nightlies in #440 at the 
> hadoop2 stage.
> The failure manifests as a timeout. It actually has an interesting cause 
> calling into question some of the clauses in 
> UnassignProcedure#remoteCallFailed.
> We are running a disabletable concurrent with a shutdown. pid=309 is the 
> disable. pid=311 is the interesting one. The below is a little hard to read 
> -- the exception 'message' is the the current procedure as a String... hard 
> to parse, fixing -- but we are trying to unassign as part of a the 
> disabletable. Our RPC fails because the server we are trying to rpc too is 
> currently being processed as crashed (pid=308 is a servercrashprocedure for 
> this server). As part of the processing of the failed RPC we will expire the 
> server -- if we can't RPC to it, it must be gone. The current procedure is 
> then suspended until it gets woken up by the servercrashprocedure triggered 
> by the expire only in this case we are shutting down so the expire is 
> ignored... The current procedure is left in its suspend state. This prevents 
> the Master going down. So we time out.
> 2018-03-05 11:29:22,507 INFO  [PEWorker-13] 
> assignment.RegionTransitionProcedure(213): Dispatch pid=311, ppid=309, 
> state=RUNNABLE:REGION_TRANSITION_DISPATCH; UnassignProcedure 
> table=Group_ns:testKillRS, region=de7534c208a06502537cd95c248b3043, 
> server=1cfd208ff882,40584,1520249102524; rit=CLOSING, 
> location=1cfd208ff882,40584,1520249102524
> 2018-03-05 11:29:22,508 WARN  [PEWorker-13] 
> assignment.RegionTransitionProcedure(187): Remote call failed pid=311, 
> ppid=309, state=RUNNABLE:REGION_TRANSITION_DISPATCH; UnassignProcedure 
> table=Group_ns:testKillRS, region=de7534c208a06502537cd95c248b3043, 
> server=1cfd208ff882,40584,1520249102524; rit=CLOSING, 
> location=1cfd208ff882,40584,1520249102524; exception=pid=311, ppid=309, 
> state=RUNNABLE:REGION_TRANSITION_DISPATCH; UnassignProcedure 
> table=Group_ns:testKillRS, region=de7534c208a06502537cd95c248b3043, 
> server=1cfd208ff882,40584,1520249102524 to 1cfd208ff882,40584,1520249102524
> 2018-03-05 11:29:22,508 WARN  [PEWorker-13] 
> assignment.UnassignProcedure(276): Expiring server pid=311, ppid=309, 
> state=RUNNABLE:REGION_TRANSITION_DISPATCH; UnassignProcedure 
> table=Group_ns:testKillRS, region=de7534c208a06502537cd95c248b3043, 
> server=1cfd208ff882,40584,1520249102524; rit=CLOSING, 
> location=1cfd208ff882,40584,1520249102524, 
> exception=org.apache.hadoop.hbase.master.assignment.FailedRemoteDispatchException:
>  pid=311, ppid=309, state=RUNNABLE:REGION_TRANSITION_DISPATCH; 
> UnassignProcedure table=Group_ns:testKillRS, 
> region=de7534c208a06502537cd95c248b3043, 
> server=1cfd208ff882,40584,1520249102524 to 1cfd208ff882,40584,1520249102524
> 2018-03-05 11:29:22,508 WARN  [PEWorker-13] master.ServerManager(580): 
> Expiration of 1cfd208ff882,40584,1520249102524 but server shutdown already in 
> progress
> I need to cater for case where the expire server is rejected.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (HBASE-20137) TestRSGroups is flakey

2018-03-05 Thread stack (JIRA)

 [ 
https://issues.apache.org/jira/browse/HBASE-20137?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

stack updated HBASE-20137:
--
Fix Version/s: 2.0.0

> TestRSGroups is flakey
> --
>
> Key: HBASE-20137
> URL: https://issues.apache.org/jira/browse/HBASE-20137
> Project: HBase
>  Issue Type: Bug
>  Components: flakey
>Affects Versions: 2.0.0-beta-2
>Reporter: stack
>Assignee: stack
>Priority: Major
> Fix For: 2.0.0
>
> Attachments: HBASE-20137.branch-2.001.patch
>
>
> It was the single test that failed the hbase-2 nightlies in #440 at the 
> hadoop2 stage.
> The failure manifests as a timeout. It actually has an interesting cause 
> calling into question some of the clauses in 
> UnassignProcedure#remoteCallFailed.
> We are running a disabletable concurrent with a shutdown. pid=309 is the 
> disable. pid=311 is the interesting one. The below is a little hard to read 
> -- the exception 'message' is the the current procedure as a String... hard 
> to parse, fixing -- but we are trying to unassign as part of a the 
> disabletable. Our RPC fails because the server we are trying to rpc too is 
> currently being processed as crashed (pid=308 is a servercrashprocedure for 
> this server). As part of the processing of the failed RPC we will expire the 
> server -- if we can't RPC to it, it must be gone. The current procedure is 
> then suspended until it gets woken up by the servercrashprocedure triggered 
> by the expire only in this case we are shutting down so the expire is 
> ignored... The current procedure is left in its suspend state. This prevents 
> the Master going down. So we time out.
> 2018-03-05 11:29:22,507 INFO  [PEWorker-13] 
> assignment.RegionTransitionProcedure(213): Dispatch pid=311, ppid=309, 
> state=RUNNABLE:REGION_TRANSITION_DISPATCH; UnassignProcedure 
> table=Group_ns:testKillRS, region=de7534c208a06502537cd95c248b3043, 
> server=1cfd208ff882,40584,1520249102524; rit=CLOSING, 
> location=1cfd208ff882,40584,1520249102524
> 2018-03-05 11:29:22,508 WARN  [PEWorker-13] 
> assignment.RegionTransitionProcedure(187): Remote call failed pid=311, 
> ppid=309, state=RUNNABLE:REGION_TRANSITION_DISPATCH; UnassignProcedure 
> table=Group_ns:testKillRS, region=de7534c208a06502537cd95c248b3043, 
> server=1cfd208ff882,40584,1520249102524; rit=CLOSING, 
> location=1cfd208ff882,40584,1520249102524; exception=pid=311, ppid=309, 
> state=RUNNABLE:REGION_TRANSITION_DISPATCH; UnassignProcedure 
> table=Group_ns:testKillRS, region=de7534c208a06502537cd95c248b3043, 
> server=1cfd208ff882,40584,1520249102524 to 1cfd208ff882,40584,1520249102524
> 2018-03-05 11:29:22,508 WARN  [PEWorker-13] 
> assignment.UnassignProcedure(276): Expiring server pid=311, ppid=309, 
> state=RUNNABLE:REGION_TRANSITION_DISPATCH; UnassignProcedure 
> table=Group_ns:testKillRS, region=de7534c208a06502537cd95c248b3043, 
> server=1cfd208ff882,40584,1520249102524; rit=CLOSING, 
> location=1cfd208ff882,40584,1520249102524, 
> exception=org.apache.hadoop.hbase.master.assignment.FailedRemoteDispatchException:
>  pid=311, ppid=309, state=RUNNABLE:REGION_TRANSITION_DISPATCH; 
> UnassignProcedure table=Group_ns:testKillRS, 
> region=de7534c208a06502537cd95c248b3043, 
> server=1cfd208ff882,40584,1520249102524 to 1cfd208ff882,40584,1520249102524
> 2018-03-05 11:29:22,508 WARN  [PEWorker-13] master.ServerManager(580): 
> Expiration of 1cfd208ff882,40584,1520249102524 but server shutdown already in 
> progress
> I need to cater for case where the expire server is rejected.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (HBASE-20137) TestRSGroups is flakey

2018-03-05 Thread stack (JIRA)

 [ 
https://issues.apache.org/jira/browse/HBASE-20137?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

stack updated HBASE-20137:
--
Attachment: HBASE-20137.branch-2.001.patch

> TestRSGroups is flakey
> --
>
> Key: HBASE-20137
> URL: https://issues.apache.org/jira/browse/HBASE-20137
> Project: HBase
>  Issue Type: Bug
>  Components: flakey
>Affects Versions: 2.0.0-beta-2
>Reporter: stack
>Assignee: stack
>Priority: Major
> Fix For: 2.0.0
>
> Attachments: HBASE-20137.branch-2.001.patch
>
>
> It was the single test that failed the hbase-2 nightlies in #440 at the 
> hadoop2 stage.
> The failure manifests as a timeout. It actually has an interesting cause 
> calling into question some of the clauses in 
> UnassignProcedure#remoteCallFailed.
> We are running a disabletable concurrent with a shutdown. pid=309 is the 
> disable. pid=311 is the interesting one. The below is a little hard to read 
> -- the exception 'message' is the the current procedure as a String... hard 
> to parse, fixing -- but we are trying to unassign as part of a the 
> disabletable. Our RPC fails because the server we are trying to rpc too is 
> currently being processed as crashed (pid=308 is a servercrashprocedure for 
> this server). As part of the processing of the failed RPC we will expire the 
> server -- if we can't RPC to it, it must be gone. The current procedure is 
> then suspended until it gets woken up by the servercrashprocedure triggered 
> by the expire only in this case we are shutting down so the expire is 
> ignored... The current procedure is left in its suspend state. This prevents 
> the Master going down. So we time out.
> 2018-03-05 11:29:22,507 INFO  [PEWorker-13] 
> assignment.RegionTransitionProcedure(213): Dispatch pid=311, ppid=309, 
> state=RUNNABLE:REGION_TRANSITION_DISPATCH; UnassignProcedure 
> table=Group_ns:testKillRS, region=de7534c208a06502537cd95c248b3043, 
> server=1cfd208ff882,40584,1520249102524; rit=CLOSING, 
> location=1cfd208ff882,40584,1520249102524
> 2018-03-05 11:29:22,508 WARN  [PEWorker-13] 
> assignment.RegionTransitionProcedure(187): Remote call failed pid=311, 
> ppid=309, state=RUNNABLE:REGION_TRANSITION_DISPATCH; UnassignProcedure 
> table=Group_ns:testKillRS, region=de7534c208a06502537cd95c248b3043, 
> server=1cfd208ff882,40584,1520249102524; rit=CLOSING, 
> location=1cfd208ff882,40584,1520249102524; exception=pid=311, ppid=309, 
> state=RUNNABLE:REGION_TRANSITION_DISPATCH; UnassignProcedure 
> table=Group_ns:testKillRS, region=de7534c208a06502537cd95c248b3043, 
> server=1cfd208ff882,40584,1520249102524 to 1cfd208ff882,40584,1520249102524
> 2018-03-05 11:29:22,508 WARN  [PEWorker-13] 
> assignment.UnassignProcedure(276): Expiring server pid=311, ppid=309, 
> state=RUNNABLE:REGION_TRANSITION_DISPATCH; UnassignProcedure 
> table=Group_ns:testKillRS, region=de7534c208a06502537cd95c248b3043, 
> server=1cfd208ff882,40584,1520249102524; rit=CLOSING, 
> location=1cfd208ff882,40584,1520249102524, 
> exception=org.apache.hadoop.hbase.master.assignment.FailedRemoteDispatchException:
>  pid=311, ppid=309, state=RUNNABLE:REGION_TRANSITION_DISPATCH; 
> UnassignProcedure table=Group_ns:testKillRS, 
> region=de7534c208a06502537cd95c248b3043, 
> server=1cfd208ff882,40584,1520249102524 to 1cfd208ff882,40584,1520249102524
> 2018-03-05 11:29:22,508 WARN  [PEWorker-13] master.ServerManager(580): 
> Expiration of 1cfd208ff882,40584,1520249102524 but server shutdown already in 
> progress
> I need to cater for case where the expire server is rejected.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)