[jira] [Updated] (HBASE-20137) TestRSGroups is flakey
[ https://issues.apache.org/jira/browse/HBASE-20137?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] stack updated HBASE-20137: -- Status: Open (was: Patch Available) Cancelling patch. This patch went the wrong route. It allowed unassign complete if it could not schedule a server expire (Because concurrent server crash procedure in progress). Problem with this is that if the unassign were done as part of a move, we'd next go to the assign step and could online a region before its logs had split. Will be back. Working in umbrella issue HBASE-20152 first. > TestRSGroups is flakey > -- > > Key: HBASE-20137 > URL: https://issues.apache.org/jira/browse/HBASE-20137 > Project: HBase > Issue Type: Bug > Components: flakey >Affects Versions: 2.0.0-beta-2 >Reporter: stack >Assignee: stack >Priority: Major > Fix For: 2.0.0 > > Attachments: HBASE-20137.branch-2.001.patch, > HBASE-20137.branch-2.002.patch, HBASE-20137.branch-2.003.patch, > HBASE-20137.branch-2.003.patch > > > It was the single test that failed the hbase-2 nightlies in #440 at the > hadoop2 stage. > The failure manifests as a timeout. It actually has an interesting cause > calling into question some of the clauses in > UnassignProcedure#remoteCallFailed. > We are running a disabletable concurrent with a shutdown. pid=309 is the > disable. pid=311 is the interesting one. The below is a little hard to read > -- the exception 'message' is the the current procedure as a String... hard > to parse, fixing -- but we are trying to unassign as part of a the > disabletable. Our RPC fails because the server we are trying to rpc too is > currently being processed as crashed (pid=308 is a servercrashprocedure for > this server). As part of the processing of the failed RPC we will expire the > server -- if we can't RPC to it, it must be gone. The current procedure is > then suspended until it gets woken up by the servercrashprocedure triggered > by the expire only in this case we are shutting down so the expire is > ignored... The current procedure is left in its suspend state. This prevents > the Master going down. So we time out. > 2018-03-05 11:29:22,507 INFO [PEWorker-13] > assignment.RegionTransitionProcedure(213): Dispatch pid=311, ppid=309, > state=RUNNABLE:REGION_TRANSITION_DISPATCH; UnassignProcedure > table=Group_ns:testKillRS, region=de7534c208a06502537cd95c248b3043, > server=1cfd208ff882,40584,1520249102524; rit=CLOSING, > location=1cfd208ff882,40584,1520249102524 > 2018-03-05 11:29:22,508 WARN [PEWorker-13] > assignment.RegionTransitionProcedure(187): Remote call failed pid=311, > ppid=309, state=RUNNABLE:REGION_TRANSITION_DISPATCH; UnassignProcedure > table=Group_ns:testKillRS, region=de7534c208a06502537cd95c248b3043, > server=1cfd208ff882,40584,1520249102524; rit=CLOSING, > location=1cfd208ff882,40584,1520249102524; exception=pid=311, ppid=309, > state=RUNNABLE:REGION_TRANSITION_DISPATCH; UnassignProcedure > table=Group_ns:testKillRS, region=de7534c208a06502537cd95c248b3043, > server=1cfd208ff882,40584,1520249102524 to 1cfd208ff882,40584,1520249102524 > 2018-03-05 11:29:22,508 WARN [PEWorker-13] > assignment.UnassignProcedure(276): Expiring server pid=311, ppid=309, > state=RUNNABLE:REGION_TRANSITION_DISPATCH; UnassignProcedure > table=Group_ns:testKillRS, region=de7534c208a06502537cd95c248b3043, > server=1cfd208ff882,40584,1520249102524; rit=CLOSING, > location=1cfd208ff882,40584,1520249102524, > exception=org.apache.hadoop.hbase.master.assignment.FailedRemoteDispatchException: > pid=311, ppid=309, state=RUNNABLE:REGION_TRANSITION_DISPATCH; > UnassignProcedure table=Group_ns:testKillRS, > region=de7534c208a06502537cd95c248b3043, > server=1cfd208ff882,40584,1520249102524 to 1cfd208ff882,40584,1520249102524 > 2018-03-05 11:29:22,508 WARN [PEWorker-13] master.ServerManager(580): > Expiration of 1cfd208ff882,40584,1520249102524 but server shutdown already in > progress > I need to cater for case where the expire server is rejected. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (HBASE-20137) TestRSGroups is flakey
[ https://issues.apache.org/jira/browse/HBASE-20137?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] stack updated HBASE-20137: -- Attachment: HBASE-20137.branch-2.003.patch > TestRSGroups is flakey > -- > > Key: HBASE-20137 > URL: https://issues.apache.org/jira/browse/HBASE-20137 > Project: HBase > Issue Type: Bug > Components: flakey >Affects Versions: 2.0.0-beta-2 >Reporter: stack >Assignee: stack >Priority: Major > Fix For: 2.0.0 > > Attachments: HBASE-20137.branch-2.001.patch, > HBASE-20137.branch-2.002.patch, HBASE-20137.branch-2.003.patch, > HBASE-20137.branch-2.003.patch > > > It was the single test that failed the hbase-2 nightlies in #440 at the > hadoop2 stage. > The failure manifests as a timeout. It actually has an interesting cause > calling into question some of the clauses in > UnassignProcedure#remoteCallFailed. > We are running a disabletable concurrent with a shutdown. pid=309 is the > disable. pid=311 is the interesting one. The below is a little hard to read > -- the exception 'message' is the the current procedure as a String... hard > to parse, fixing -- but we are trying to unassign as part of a the > disabletable. Our RPC fails because the server we are trying to rpc too is > currently being processed as crashed (pid=308 is a servercrashprocedure for > this server). As part of the processing of the failed RPC we will expire the > server -- if we can't RPC to it, it must be gone. The current procedure is > then suspended until it gets woken up by the servercrashprocedure triggered > by the expire only in this case we are shutting down so the expire is > ignored... The current procedure is left in its suspend state. This prevents > the Master going down. So we time out. > 2018-03-05 11:29:22,507 INFO [PEWorker-13] > assignment.RegionTransitionProcedure(213): Dispatch pid=311, ppid=309, > state=RUNNABLE:REGION_TRANSITION_DISPATCH; UnassignProcedure > table=Group_ns:testKillRS, region=de7534c208a06502537cd95c248b3043, > server=1cfd208ff882,40584,1520249102524; rit=CLOSING, > location=1cfd208ff882,40584,1520249102524 > 2018-03-05 11:29:22,508 WARN [PEWorker-13] > assignment.RegionTransitionProcedure(187): Remote call failed pid=311, > ppid=309, state=RUNNABLE:REGION_TRANSITION_DISPATCH; UnassignProcedure > table=Group_ns:testKillRS, region=de7534c208a06502537cd95c248b3043, > server=1cfd208ff882,40584,1520249102524; rit=CLOSING, > location=1cfd208ff882,40584,1520249102524; exception=pid=311, ppid=309, > state=RUNNABLE:REGION_TRANSITION_DISPATCH; UnassignProcedure > table=Group_ns:testKillRS, region=de7534c208a06502537cd95c248b3043, > server=1cfd208ff882,40584,1520249102524 to 1cfd208ff882,40584,1520249102524 > 2018-03-05 11:29:22,508 WARN [PEWorker-13] > assignment.UnassignProcedure(276): Expiring server pid=311, ppid=309, > state=RUNNABLE:REGION_TRANSITION_DISPATCH; UnassignProcedure > table=Group_ns:testKillRS, region=de7534c208a06502537cd95c248b3043, > server=1cfd208ff882,40584,1520249102524; rit=CLOSING, > location=1cfd208ff882,40584,1520249102524, > exception=org.apache.hadoop.hbase.master.assignment.FailedRemoteDispatchException: > pid=311, ppid=309, state=RUNNABLE:REGION_TRANSITION_DISPATCH; > UnassignProcedure table=Group_ns:testKillRS, > region=de7534c208a06502537cd95c248b3043, > server=1cfd208ff882,40584,1520249102524 to 1cfd208ff882,40584,1520249102524 > 2018-03-05 11:29:22,508 WARN [PEWorker-13] master.ServerManager(580): > Expiration of 1cfd208ff882,40584,1520249102524 but server shutdown already in > progress > I need to cater for case where the expire server is rejected. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (HBASE-20137) TestRSGroups is flakey
[ https://issues.apache.org/jira/browse/HBASE-20137?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] stack updated HBASE-20137: -- Attachment: HBASE-20137.branch-2.003.patch > TestRSGroups is flakey > -- > > Key: HBASE-20137 > URL: https://issues.apache.org/jira/browse/HBASE-20137 > Project: HBase > Issue Type: Bug > Components: flakey >Affects Versions: 2.0.0-beta-2 >Reporter: stack >Assignee: stack >Priority: Major > Fix For: 2.0.0 > > Attachments: HBASE-20137.branch-2.001.patch, > HBASE-20137.branch-2.002.patch, HBASE-20137.branch-2.003.patch > > > It was the single test that failed the hbase-2 nightlies in #440 at the > hadoop2 stage. > The failure manifests as a timeout. It actually has an interesting cause > calling into question some of the clauses in > UnassignProcedure#remoteCallFailed. > We are running a disabletable concurrent with a shutdown. pid=309 is the > disable. pid=311 is the interesting one. The below is a little hard to read > -- the exception 'message' is the the current procedure as a String... hard > to parse, fixing -- but we are trying to unassign as part of a the > disabletable. Our RPC fails because the server we are trying to rpc too is > currently being processed as crashed (pid=308 is a servercrashprocedure for > this server). As part of the processing of the failed RPC we will expire the > server -- if we can't RPC to it, it must be gone. The current procedure is > then suspended until it gets woken up by the servercrashprocedure triggered > by the expire only in this case we are shutting down so the expire is > ignored... The current procedure is left in its suspend state. This prevents > the Master going down. So we time out. > 2018-03-05 11:29:22,507 INFO [PEWorker-13] > assignment.RegionTransitionProcedure(213): Dispatch pid=311, ppid=309, > state=RUNNABLE:REGION_TRANSITION_DISPATCH; UnassignProcedure > table=Group_ns:testKillRS, region=de7534c208a06502537cd95c248b3043, > server=1cfd208ff882,40584,1520249102524; rit=CLOSING, > location=1cfd208ff882,40584,1520249102524 > 2018-03-05 11:29:22,508 WARN [PEWorker-13] > assignment.RegionTransitionProcedure(187): Remote call failed pid=311, > ppid=309, state=RUNNABLE:REGION_TRANSITION_DISPATCH; UnassignProcedure > table=Group_ns:testKillRS, region=de7534c208a06502537cd95c248b3043, > server=1cfd208ff882,40584,1520249102524; rit=CLOSING, > location=1cfd208ff882,40584,1520249102524; exception=pid=311, ppid=309, > state=RUNNABLE:REGION_TRANSITION_DISPATCH; UnassignProcedure > table=Group_ns:testKillRS, region=de7534c208a06502537cd95c248b3043, > server=1cfd208ff882,40584,1520249102524 to 1cfd208ff882,40584,1520249102524 > 2018-03-05 11:29:22,508 WARN [PEWorker-13] > assignment.UnassignProcedure(276): Expiring server pid=311, ppid=309, > state=RUNNABLE:REGION_TRANSITION_DISPATCH; UnassignProcedure > table=Group_ns:testKillRS, region=de7534c208a06502537cd95c248b3043, > server=1cfd208ff882,40584,1520249102524; rit=CLOSING, > location=1cfd208ff882,40584,1520249102524, > exception=org.apache.hadoop.hbase.master.assignment.FailedRemoteDispatchException: > pid=311, ppid=309, state=RUNNABLE:REGION_TRANSITION_DISPATCH; > UnassignProcedure table=Group_ns:testKillRS, > region=de7534c208a06502537cd95c248b3043, > server=1cfd208ff882,40584,1520249102524 to 1cfd208ff882,40584,1520249102524 > 2018-03-05 11:29:22,508 WARN [PEWorker-13] master.ServerManager(580): > Expiration of 1cfd208ff882,40584,1520249102524 but server shutdown already in > progress > I need to cater for case where the expire server is rejected. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (HBASE-20137) TestRSGroups is flakey
[ https://issues.apache.org/jira/browse/HBASE-20137?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] stack updated HBASE-20137: -- Attachment: HBASE-20137.branch-2.002.patch > TestRSGroups is flakey > -- > > Key: HBASE-20137 > URL: https://issues.apache.org/jira/browse/HBASE-20137 > Project: HBase > Issue Type: Bug > Components: flakey >Affects Versions: 2.0.0-beta-2 >Reporter: stack >Assignee: stack >Priority: Major > Fix For: 2.0.0 > > Attachments: HBASE-20137.branch-2.001.patch, > HBASE-20137.branch-2.002.patch > > > It was the single test that failed the hbase-2 nightlies in #440 at the > hadoop2 stage. > The failure manifests as a timeout. It actually has an interesting cause > calling into question some of the clauses in > UnassignProcedure#remoteCallFailed. > We are running a disabletable concurrent with a shutdown. pid=309 is the > disable. pid=311 is the interesting one. The below is a little hard to read > -- the exception 'message' is the the current procedure as a String... hard > to parse, fixing -- but we are trying to unassign as part of a the > disabletable. Our RPC fails because the server we are trying to rpc too is > currently being processed as crashed (pid=308 is a servercrashprocedure for > this server). As part of the processing of the failed RPC we will expire the > server -- if we can't RPC to it, it must be gone. The current procedure is > then suspended until it gets woken up by the servercrashprocedure triggered > by the expire only in this case we are shutting down so the expire is > ignored... The current procedure is left in its suspend state. This prevents > the Master going down. So we time out. > 2018-03-05 11:29:22,507 INFO [PEWorker-13] > assignment.RegionTransitionProcedure(213): Dispatch pid=311, ppid=309, > state=RUNNABLE:REGION_TRANSITION_DISPATCH; UnassignProcedure > table=Group_ns:testKillRS, region=de7534c208a06502537cd95c248b3043, > server=1cfd208ff882,40584,1520249102524; rit=CLOSING, > location=1cfd208ff882,40584,1520249102524 > 2018-03-05 11:29:22,508 WARN [PEWorker-13] > assignment.RegionTransitionProcedure(187): Remote call failed pid=311, > ppid=309, state=RUNNABLE:REGION_TRANSITION_DISPATCH; UnassignProcedure > table=Group_ns:testKillRS, region=de7534c208a06502537cd95c248b3043, > server=1cfd208ff882,40584,1520249102524; rit=CLOSING, > location=1cfd208ff882,40584,1520249102524; exception=pid=311, ppid=309, > state=RUNNABLE:REGION_TRANSITION_DISPATCH; UnassignProcedure > table=Group_ns:testKillRS, region=de7534c208a06502537cd95c248b3043, > server=1cfd208ff882,40584,1520249102524 to 1cfd208ff882,40584,1520249102524 > 2018-03-05 11:29:22,508 WARN [PEWorker-13] > assignment.UnassignProcedure(276): Expiring server pid=311, ppid=309, > state=RUNNABLE:REGION_TRANSITION_DISPATCH; UnassignProcedure > table=Group_ns:testKillRS, region=de7534c208a06502537cd95c248b3043, > server=1cfd208ff882,40584,1520249102524; rit=CLOSING, > location=1cfd208ff882,40584,1520249102524, > exception=org.apache.hadoop.hbase.master.assignment.FailedRemoteDispatchException: > pid=311, ppid=309, state=RUNNABLE:REGION_TRANSITION_DISPATCH; > UnassignProcedure table=Group_ns:testKillRS, > region=de7534c208a06502537cd95c248b3043, > server=1cfd208ff882,40584,1520249102524 to 1cfd208ff882,40584,1520249102524 > 2018-03-05 11:29:22,508 WARN [PEWorker-13] master.ServerManager(580): > Expiration of 1cfd208ff882,40584,1520249102524 but server shutdown already in > progress > I need to cater for case where the expire server is rejected. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (HBASE-20137) TestRSGroups is flakey
[ https://issues.apache.org/jira/browse/HBASE-20137?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] stack updated HBASE-20137: -- Status: Patch Available (was: Open) Trying patch against precommit while I try to figure a test (the circumstance is a bit tough to conjure what w/ an rpc to a server concurrently undergoing server crash procedure...) > TestRSGroups is flakey > -- > > Key: HBASE-20137 > URL: https://issues.apache.org/jira/browse/HBASE-20137 > Project: HBase > Issue Type: Bug > Components: flakey >Affects Versions: 2.0.0-beta-2 >Reporter: stack >Assignee: stack >Priority: Major > Fix For: 2.0.0 > > Attachments: HBASE-20137.branch-2.001.patch > > > It was the single test that failed the hbase-2 nightlies in #440 at the > hadoop2 stage. > The failure manifests as a timeout. It actually has an interesting cause > calling into question some of the clauses in > UnassignProcedure#remoteCallFailed. > We are running a disabletable concurrent with a shutdown. pid=309 is the > disable. pid=311 is the interesting one. The below is a little hard to read > -- the exception 'message' is the the current procedure as a String... hard > to parse, fixing -- but we are trying to unassign as part of a the > disabletable. Our RPC fails because the server we are trying to rpc too is > currently being processed as crashed (pid=308 is a servercrashprocedure for > this server). As part of the processing of the failed RPC we will expire the > server -- if we can't RPC to it, it must be gone. The current procedure is > then suspended until it gets woken up by the servercrashprocedure triggered > by the expire only in this case we are shutting down so the expire is > ignored... The current procedure is left in its suspend state. This prevents > the Master going down. So we time out. > 2018-03-05 11:29:22,507 INFO [PEWorker-13] > assignment.RegionTransitionProcedure(213): Dispatch pid=311, ppid=309, > state=RUNNABLE:REGION_TRANSITION_DISPATCH; UnassignProcedure > table=Group_ns:testKillRS, region=de7534c208a06502537cd95c248b3043, > server=1cfd208ff882,40584,1520249102524; rit=CLOSING, > location=1cfd208ff882,40584,1520249102524 > 2018-03-05 11:29:22,508 WARN [PEWorker-13] > assignment.RegionTransitionProcedure(187): Remote call failed pid=311, > ppid=309, state=RUNNABLE:REGION_TRANSITION_DISPATCH; UnassignProcedure > table=Group_ns:testKillRS, region=de7534c208a06502537cd95c248b3043, > server=1cfd208ff882,40584,1520249102524; rit=CLOSING, > location=1cfd208ff882,40584,1520249102524; exception=pid=311, ppid=309, > state=RUNNABLE:REGION_TRANSITION_DISPATCH; UnassignProcedure > table=Group_ns:testKillRS, region=de7534c208a06502537cd95c248b3043, > server=1cfd208ff882,40584,1520249102524 to 1cfd208ff882,40584,1520249102524 > 2018-03-05 11:29:22,508 WARN [PEWorker-13] > assignment.UnassignProcedure(276): Expiring server pid=311, ppid=309, > state=RUNNABLE:REGION_TRANSITION_DISPATCH; UnassignProcedure > table=Group_ns:testKillRS, region=de7534c208a06502537cd95c248b3043, > server=1cfd208ff882,40584,1520249102524; rit=CLOSING, > location=1cfd208ff882,40584,1520249102524, > exception=org.apache.hadoop.hbase.master.assignment.FailedRemoteDispatchException: > pid=311, ppid=309, state=RUNNABLE:REGION_TRANSITION_DISPATCH; > UnassignProcedure table=Group_ns:testKillRS, > region=de7534c208a06502537cd95c248b3043, > server=1cfd208ff882,40584,1520249102524 to 1cfd208ff882,40584,1520249102524 > 2018-03-05 11:29:22,508 WARN [PEWorker-13] master.ServerManager(580): > Expiration of 1cfd208ff882,40584,1520249102524 but server shutdown already in > progress > I need to cater for case where the expire server is rejected. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (HBASE-20137) TestRSGroups is flakey
[ https://issues.apache.org/jira/browse/HBASE-20137?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] stack updated HBASE-20137: -- Fix Version/s: 2.0.0 > TestRSGroups is flakey > -- > > Key: HBASE-20137 > URL: https://issues.apache.org/jira/browse/HBASE-20137 > Project: HBase > Issue Type: Bug > Components: flakey >Affects Versions: 2.0.0-beta-2 >Reporter: stack >Assignee: stack >Priority: Major > Fix For: 2.0.0 > > Attachments: HBASE-20137.branch-2.001.patch > > > It was the single test that failed the hbase-2 nightlies in #440 at the > hadoop2 stage. > The failure manifests as a timeout. It actually has an interesting cause > calling into question some of the clauses in > UnassignProcedure#remoteCallFailed. > We are running a disabletable concurrent with a shutdown. pid=309 is the > disable. pid=311 is the interesting one. The below is a little hard to read > -- the exception 'message' is the the current procedure as a String... hard > to parse, fixing -- but we are trying to unassign as part of a the > disabletable. Our RPC fails because the server we are trying to rpc too is > currently being processed as crashed (pid=308 is a servercrashprocedure for > this server). As part of the processing of the failed RPC we will expire the > server -- if we can't RPC to it, it must be gone. The current procedure is > then suspended until it gets woken up by the servercrashprocedure triggered > by the expire only in this case we are shutting down so the expire is > ignored... The current procedure is left in its suspend state. This prevents > the Master going down. So we time out. > 2018-03-05 11:29:22,507 INFO [PEWorker-13] > assignment.RegionTransitionProcedure(213): Dispatch pid=311, ppid=309, > state=RUNNABLE:REGION_TRANSITION_DISPATCH; UnassignProcedure > table=Group_ns:testKillRS, region=de7534c208a06502537cd95c248b3043, > server=1cfd208ff882,40584,1520249102524; rit=CLOSING, > location=1cfd208ff882,40584,1520249102524 > 2018-03-05 11:29:22,508 WARN [PEWorker-13] > assignment.RegionTransitionProcedure(187): Remote call failed pid=311, > ppid=309, state=RUNNABLE:REGION_TRANSITION_DISPATCH; UnassignProcedure > table=Group_ns:testKillRS, region=de7534c208a06502537cd95c248b3043, > server=1cfd208ff882,40584,1520249102524; rit=CLOSING, > location=1cfd208ff882,40584,1520249102524; exception=pid=311, ppid=309, > state=RUNNABLE:REGION_TRANSITION_DISPATCH; UnassignProcedure > table=Group_ns:testKillRS, region=de7534c208a06502537cd95c248b3043, > server=1cfd208ff882,40584,1520249102524 to 1cfd208ff882,40584,1520249102524 > 2018-03-05 11:29:22,508 WARN [PEWorker-13] > assignment.UnassignProcedure(276): Expiring server pid=311, ppid=309, > state=RUNNABLE:REGION_TRANSITION_DISPATCH; UnassignProcedure > table=Group_ns:testKillRS, region=de7534c208a06502537cd95c248b3043, > server=1cfd208ff882,40584,1520249102524; rit=CLOSING, > location=1cfd208ff882,40584,1520249102524, > exception=org.apache.hadoop.hbase.master.assignment.FailedRemoteDispatchException: > pid=311, ppid=309, state=RUNNABLE:REGION_TRANSITION_DISPATCH; > UnassignProcedure table=Group_ns:testKillRS, > region=de7534c208a06502537cd95c248b3043, > server=1cfd208ff882,40584,1520249102524 to 1cfd208ff882,40584,1520249102524 > 2018-03-05 11:29:22,508 WARN [PEWorker-13] master.ServerManager(580): > Expiration of 1cfd208ff882,40584,1520249102524 but server shutdown already in > progress > I need to cater for case where the expire server is rejected. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (HBASE-20137) TestRSGroups is flakey
[ https://issues.apache.org/jira/browse/HBASE-20137?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] stack updated HBASE-20137: -- Attachment: HBASE-20137.branch-2.001.patch > TestRSGroups is flakey > -- > > Key: HBASE-20137 > URL: https://issues.apache.org/jira/browse/HBASE-20137 > Project: HBase > Issue Type: Bug > Components: flakey >Affects Versions: 2.0.0-beta-2 >Reporter: stack >Assignee: stack >Priority: Major > Fix For: 2.0.0 > > Attachments: HBASE-20137.branch-2.001.patch > > > It was the single test that failed the hbase-2 nightlies in #440 at the > hadoop2 stage. > The failure manifests as a timeout. It actually has an interesting cause > calling into question some of the clauses in > UnassignProcedure#remoteCallFailed. > We are running a disabletable concurrent with a shutdown. pid=309 is the > disable. pid=311 is the interesting one. The below is a little hard to read > -- the exception 'message' is the the current procedure as a String... hard > to parse, fixing -- but we are trying to unassign as part of a the > disabletable. Our RPC fails because the server we are trying to rpc too is > currently being processed as crashed (pid=308 is a servercrashprocedure for > this server). As part of the processing of the failed RPC we will expire the > server -- if we can't RPC to it, it must be gone. The current procedure is > then suspended until it gets woken up by the servercrashprocedure triggered > by the expire only in this case we are shutting down so the expire is > ignored... The current procedure is left in its suspend state. This prevents > the Master going down. So we time out. > 2018-03-05 11:29:22,507 INFO [PEWorker-13] > assignment.RegionTransitionProcedure(213): Dispatch pid=311, ppid=309, > state=RUNNABLE:REGION_TRANSITION_DISPATCH; UnassignProcedure > table=Group_ns:testKillRS, region=de7534c208a06502537cd95c248b3043, > server=1cfd208ff882,40584,1520249102524; rit=CLOSING, > location=1cfd208ff882,40584,1520249102524 > 2018-03-05 11:29:22,508 WARN [PEWorker-13] > assignment.RegionTransitionProcedure(187): Remote call failed pid=311, > ppid=309, state=RUNNABLE:REGION_TRANSITION_DISPATCH; UnassignProcedure > table=Group_ns:testKillRS, region=de7534c208a06502537cd95c248b3043, > server=1cfd208ff882,40584,1520249102524; rit=CLOSING, > location=1cfd208ff882,40584,1520249102524; exception=pid=311, ppid=309, > state=RUNNABLE:REGION_TRANSITION_DISPATCH; UnassignProcedure > table=Group_ns:testKillRS, region=de7534c208a06502537cd95c248b3043, > server=1cfd208ff882,40584,1520249102524 to 1cfd208ff882,40584,1520249102524 > 2018-03-05 11:29:22,508 WARN [PEWorker-13] > assignment.UnassignProcedure(276): Expiring server pid=311, ppid=309, > state=RUNNABLE:REGION_TRANSITION_DISPATCH; UnassignProcedure > table=Group_ns:testKillRS, region=de7534c208a06502537cd95c248b3043, > server=1cfd208ff882,40584,1520249102524; rit=CLOSING, > location=1cfd208ff882,40584,1520249102524, > exception=org.apache.hadoop.hbase.master.assignment.FailedRemoteDispatchException: > pid=311, ppid=309, state=RUNNABLE:REGION_TRANSITION_DISPATCH; > UnassignProcedure table=Group_ns:testKillRS, > region=de7534c208a06502537cd95c248b3043, > server=1cfd208ff882,40584,1520249102524 to 1cfd208ff882,40584,1520249102524 > 2018-03-05 11:29:22,508 WARN [PEWorker-13] master.ServerManager(580): > Expiration of 1cfd208ff882,40584,1520249102524 but server shutdown already in > progress > I need to cater for case where the expire server is rejected. -- This message was sent by Atlassian JIRA (v7.6.3#76005)