[jira] [Commented] (HBASE-20976) SCP can be scheduled multiple times for the same RS
[ https://issues.apache.org/jira/browse/HBASE-20976?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16580372#comment-16580372 ] Josh Elser commented on HBASE-20976: Coming in late... {quote}the worst case is that there is a race condition so we still schedule redundant SCPs, still better than now I think {quote} {quote}Yes. Could age them out instead... i.e. a deadserver needs to stick around for an hour at least? {quote} What's the downside of this: we run an SCP for a RS that already was processed or something worse? As long as SCP is idempotent, we'd just want to reduce the likelihood that we do things multiple times (maybe I'm incorrectly assuming that SCP is idempotent ;)) > SCP can be scheduled multiple times for the same RS > --- > > Key: HBASE-20976 > URL: https://issues.apache.org/jira/browse/HBASE-20976 > Project: HBase > Issue Type: Sub-task >Affects Versions: 2.1.0, 2.0.1 >Reporter: Allan Yang >Assignee: Allan Yang >Priority: Major > Attachments: HBASE-20976.branch-2.0.001.patch, > HBASE-20976.branch-2.0.002.patch > > > SCP can be scheduled multiple times for the same RS: > 1. a RS crashed, a SCP was submitted for it > 2. before this SCP finish, the Master crashed > 3. The new master will scan the meta table and find some region is still open > on a dead server > 4. The new master submit a SCP for the dead server again > The two SCP for the same RS can even execute concurrently if without > HBASE-20846… > Provided a test case to reproduce this issue and a fix solution in the patch. > Another case that SCP might be scheduled multiple times for the same RS(with > HBASE-20708.): > 1. a RS crashed, a SCP was submitted for it > 2. A new RS on the same host started, the old RS's Serveranme was remove from > DeadServer.deadServers > 3. after the SCP passed the Handle_RIT state, a UnassignProcedure need to > send a close region operation to the crashed RS > 4. The UnassignProcedure's dispatch failed since 'NoServerDispatchException' > 5. Begin to expire the RS, but only find it not online and not in deadServer > list, so a SCP was submitted for the same RS again > -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (HBASE-20976) SCP can be scheduled multiple times for the same RS
[ https://issues.apache.org/jira/browse/HBASE-20976?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16579017#comment-16579017 ] stack commented on HBASE-20976: --- bq. IIRC, the deadservers are removed so that the master Web UI won't show a dead server foverever there... Yes. Could age them out instead... i.e. a deadserver needs to stick around for an hour at least? > SCP can be scheduled multiple times for the same RS > --- > > Key: HBASE-20976 > URL: https://issues.apache.org/jira/browse/HBASE-20976 > Project: HBase > Issue Type: Sub-task >Affects Versions: 2.1.0, 2.0.1 >Reporter: Allan Yang >Assignee: Allan Yang >Priority: Major > Attachments: HBASE-20976.branch-2.0.001.patch, > HBASE-20976.branch-2.0.002.patch > > > SCP can be scheduled multiple times for the same RS: > 1. a RS crashed, a SCP was submitted for it > 2. before this SCP finish, the Master crashed > 3. The new master will scan the meta table and find some region is still open > on a dead server > 4. The new master submit a SCP for the dead server again > The two SCP for the same RS can even execute concurrently if without > HBASE-20846… > Provided a test case to reproduce this issue and a fix solution in the patch. > Another case that SCP might be scheduled multiple times for the same RS(with > HBASE-20708.): > 1. a RS crashed, a SCP was submitted for it > 2. A new RS on the same host started, the old RS's Serveranme was remove from > DeadServer.deadServers > 3. after the SCP passed the Handle_RIT state, a UnassignProcedure need to > send a close region operation to the crashed RS > 4. The UnassignProcedure's dispatch failed since 'NoServerDispatchException' > 5. Begin to expire the RS, but only find it not online and not in deadServer > list, so a SCP was submitted for the same RS again > -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (HBASE-20976) SCP can be scheduled multiple times for the same RS
[ https://issues.apache.org/jira/browse/HBASE-20976?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16576210#comment-16576210 ] Allan Yang commented on HBASE-20976: {quote} I think we'd better do it a bit clean without adding too much checks... I think here we need to make sure that the deadServers check can work and prevent scheduling redundant SCPs. We can do the SCPs check when restarting is that, we have not started the PE yet so it is safe, but during the execution, this is not a good idea as there is no fencing... {quote} Yes, there is no fence here... But the worst case is that there is a race condition so we still schedule redundant SCPs, still better than now I think. Making deadServers working is indeed a better idea, but I can't think a better way to do it for now. IIRC, the deadservers are removed so that the master Web UI won't show a dead server foverever there... > SCP can be scheduled multiple times for the same RS > --- > > Key: HBASE-20976 > URL: https://issues.apache.org/jira/browse/HBASE-20976 > Project: HBase > Issue Type: Sub-task >Affects Versions: 2.1.0, 2.0.1 >Reporter: Allan Yang >Assignee: Allan Yang >Priority: Major > Fix For: 2.0.2 > > Attachments: HBASE-20976.branch-2.0.001.patch, > HBASE-20976.branch-2.0.002.patch > > > SCP can be scheduled multiple times for the same RS: > 1. a RS crashed, a SCP was submitted for it > 2. before this SCP finish, the Master crashed > 3. The new master will scan the meta table and find some region is still open > on a dead server > 4. The new master submit a SCP for the dead server again > The two SCP for the same RS can even execute concurrently if without > HBASE-20846… > Provided a test case to reproduce this issue and a fix solution in the patch. > Another case that SCP might be scheduled multiple times for the same RS(with > HBASE-20708.): > 1. a RS crashed, a SCP was submitted for it > 2. A new RS on the same host started, the old RS's Serveranme was remove from > DeadServer.deadServers > 3. after the SCP passed the Handle_RIT state, a UnassignProcedure need to > send a close region operation to the crashed RS > 4. The UnassignProcedure's dispatch failed since 'NoServerDispatchException' > 5. Begin to expire the RS, but only find it not online and not in deadServer > list, so a SCP was submitted for the same RS again > -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (HBASE-20976) SCP can be scheduled multiple times for the same RS
[ https://issues.apache.org/jira/browse/HBASE-20976?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16576186#comment-16576186 ] Hadoop QA commented on HBASE-20976: --- | (/) *{color:green}+1 overall{color}* | \\ \\ || Vote || Subsystem || Runtime || Comment || | {color:blue}0{color} | {color:blue} reexec {color} | {color:blue} 0m 18s{color} | {color:blue} Docker mode activated. {color} | || || || || {color:brown} Prechecks {color} || | {color:green}+1{color} | {color:green} hbaseanti {color} | {color:green} 0m 0s{color} | {color:green} Patch does not have any anti-patterns. {color} | | {color:green}+1{color} | {color:green} @author {color} | {color:green} 0m 0s{color} | {color:green} The patch does not contain any @author tags. {color} | | {color:orange}-0{color} | {color:orange} test4tests {color} | {color:orange} 0m 0s{color} | {color:orange} The patch doesn't appear to include any new or modified tests. Please justify why no new tests are needed for this patch. Also please list what manual steps were performed to verify this patch. {color} | || || || || {color:brown} branch-2.0 Compile Tests {color} || | {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 3m 29s{color} | {color:green} branch-2.0 passed {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 1m 41s{color} | {color:green} branch-2.0 passed {color} | | {color:green}+1{color} | {color:green} checkstyle {color} | {color:green} 1m 0s{color} | {color:green} branch-2.0 passed {color} | | {color:green}+1{color} | {color:green} shadedjars {color} | {color:green} 3m 41s{color} | {color:green} branch has no errors when building our shaded downstream artifacts. {color} | | {color:green}+1{color} | {color:green} findbugs {color} | {color:green} 2m 0s{color} | {color:green} branch-2.0 passed {color} | | {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 0m 26s{color} | {color:green} branch-2.0 passed {color} | || || || || {color:brown} Patch Compile Tests {color} || | {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 3m 22s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 1m 42s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} javac {color} | {color:green} 1m 42s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} checkstyle {color} | {color:green} 1m 1s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} whitespace {color} | {color:green} 0m 0s{color} | {color:green} The patch has no whitespace issues. {color} | | {color:green}+1{color} | {color:green} shadedjars {color} | {color:green} 3m 46s{color} | {color:green} patch has no errors when building our shaded downstream artifacts. {color} | | {color:green}+1{color} | {color:green} hadoopcheck {color} | {color:green} 10m 50s{color} | {color:green} Patch does not cause any errors with Hadoop 2.6.5 2.7.4 or 3.0.0. {color} | | {color:green}+1{color} | {color:green} findbugs {color} | {color:green} 2m 5s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 0m 26s{color} | {color:green} the patch passed {color} | || || || || {color:brown} Other Tests {color} || | {color:green}+1{color} | {color:green} unit {color} | {color:green}187m 11s{color} | {color:green} hbase-server in the patch passed. {color} | | {color:green}+1{color} | {color:green} asflicense {color} | {color:green} 0m 22s{color} | {color:green} The patch does not generate ASF License warnings. {color} | | {color:black}{color} | {color:black} {color} | {color:black}223m 45s{color} | {color:black} {color} | \\ \\ || Subsystem || Report/Notes || | Docker | Client=17.05.0-ce Server=17.05.0-ce Image:yetus/hbase:6f01af0 | | JIRA Issue | HBASE-20976 | | JIRA Patch URL | https://issues.apache.org/jira/secure/attachment/12935097/HBASE-20976.branch-2.0.002.patch | | Optional Tests | asflicense javac javadoc unit findbugs shadedjars hadoopcheck hbaseanti checkstyle compile | | uname | Linux 8949f767b952 4.4.0-130-generic #156-Ubuntu SMP Thu Jun 14 08:53:28 UTC 2018 x86_64 GNU/Linux | | Build tool | maven | | Personality | /home/jenkins/jenkins-slave/workspace/PreCommit-HBASE-Build@2/component/dev-support/hbase-personality.sh | | git revision | branch-2.0 / 7ee4aa459c | | maven | version: Apache Maven 3.5.4 (1edded0938998edf8bf061f1ceb3cfdeccf443fe; 2018-06-17T18:33:14Z) | | Default Java | 1.8.0_171 | | findbugs | v3.1.0-RC3 | | Test Results | https://builds.apache.org/job/PreCommit-HBASE-Build/14000/testReport/ | | Max. process+thread count | 4466 (vs. ulimit of 1) | | modules | C: hbase-server U: hbase-server | | Console output |
[jira] [Commented] (HBASE-20976) SCP can be scheduled multiple times for the same RS
[ https://issues.apache.org/jira/browse/HBASE-20976?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16576095#comment-16576095 ] Duo Zhang commented on HBASE-20976: --- I think we'd better do it a bit clean without adding too much checks... I think here we need to make sure that the deadServers check can work and prevent scheduling redundant SCPs. We can do the SCPs check when restarting is that, we have not started the PE yet so it is safe, but during the execution, this is not a good idea as there is no fencing... > SCP can be scheduled multiple times for the same RS > --- > > Key: HBASE-20976 > URL: https://issues.apache.org/jira/browse/HBASE-20976 > Project: HBase > Issue Type: Sub-task >Affects Versions: 2.1.0, 2.0.1 >Reporter: Allan Yang >Assignee: Allan Yang >Priority: Major > Fix For: 2.0.2 > > Attachments: HBASE-20976.branch-2.0.001.patch, > HBASE-20976.branch-2.0.002.patch > > > SCP can be scheduled multiple times for the same RS: > 1. a RS crashed, a SCP was submitted for it > 2. before this SCP finish, the Master crashed > 3. The new master will scan the meta table and find some region is still open > on a dead server > 4. The new master submit a SCP for the dead server again > The two SCP for the same RS can even execute concurrently if without > HBASE-20846… > Provided a test case to reproduce this issue and a fix solution in the patch. > Another case that SCP might be scheduled multiple times for the same RS(with > HBASE-20708.): > 1. a RS crashed, a SCP was submitted for it > 2. A new RS on the same host started, the old RS's Serveranme was remove from > DeadServer.deadServers > 3. after the SCP passed the Handle_RIT state, a UnassignProcedure need to > send a close region operation to the crashed RS > 4. The UnassignProcedure's dispatch failed since 'NoServerDispatchException' > 5. Begin to expire the RS, but only find it not online and not in deadServer > list, so a SCP was submitted for the same RS again > -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (HBASE-20976) SCP can be scheduled multiple times for the same RS
[ https://issues.apache.org/jira/browse/HBASE-20976?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16576082#comment-16576082 ] Allan Yang commented on HBASE-20976: {code} I think there may still be races? As if the previous SCP has also been done and removed from ProcedureExecutor, and then the UnassignProcedure tries to expire the server... {code} Maybe I deleted some procedures wals which causing this. But whatever, I think a double check won't hurt. > SCP can be scheduled multiple times for the same RS > --- > > Key: HBASE-20976 > URL: https://issues.apache.org/jira/browse/HBASE-20976 > Project: HBase > Issue Type: Sub-task >Affects Versions: 2.1.0, 2.0.1 >Reporter: Allan Yang >Assignee: Allan Yang >Priority: Major > Fix For: 2.0.2 > > Attachments: HBASE-20976.branch-2.0.001.patch, > HBASE-20976.branch-2.0.002.patch > > > SCP can be scheduled multiple times for the same RS: > 1. a RS crashed, a SCP was submitted for it > 2. before this SCP finish, the Master crashed > 3. The new master will scan the meta table and find some region is still open > on a dead server > 4. The new master submit a SCP for the dead server again > The two SCP for the same RS can even execute concurrently if without > HBASE-20846… > Provided a test case to reproduce this issue and a fix solution in the patch. > Another case that SCP might be scheduled multiple times for the same RS(with > HBASE-20708.): > 1. a RS crashed, a SCP was submitted for it > 2. A new RS on the same host started, the old RS's Serveranme was remove from > DeadServer.deadServers > 3. after the SCP passed the Handle_RIT state, a UnassignProcedure need to > send a close region operation to the crashed RS > 4. The UnassignProcedure's dispatch failed since 'NoServerDispatchException' > 5. Begin to expire the RS, but only find it not online and not in deadServer > list, so a SCP was submitted for the same RS again > -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (HBASE-20976) SCP can be scheduled multiple times for the same RS
[ https://issues.apache.org/jira/browse/HBASE-20976?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16576066#comment-16576066 ] Duo Zhang commented on HBASE-20976: --- I think there may still be races? As if the previous SCP has also been done and removed from ProcedureExecutor, and then the UnassignProcedure tries to expire the server... > SCP can be scheduled multiple times for the same RS > --- > > Key: HBASE-20976 > URL: https://issues.apache.org/jira/browse/HBASE-20976 > Project: HBase > Issue Type: Sub-task >Affects Versions: 2.1.0, 2.0.1 >Reporter: Allan Yang >Assignee: Allan Yang >Priority: Major > Fix For: 2.0.2 > > Attachments: HBASE-20976.branch-2.0.001.patch, > HBASE-20976.branch-2.0.002.patch > > > SCP can be scheduled multiple times for the same RS: > 1. a RS crashed, a SCP was submitted for it > 2. before this SCP finish, the Master crashed > 3. The new master will scan the meta table and find some region is still open > on a dead server > 4. The new master submit a SCP for the dead server again > The two SCP for the same RS can even execute concurrently if without > HBASE-20846… > Provided a test case to reproduce this issue and a fix solution in the patch. > Another case that SCP might be scheduled multiple times for the same RS(with > HBASE-20708.): > 1. a RS crashed, a SCP was submitted for it > 2. A new RS on the same host started, the old RS's Serveranme was remove from > DeadServer.deadServers > 3. after the SCP passed the Handle_RIT state, a UnassignProcedure need to > send a close region operation to the crashed RS > 4. The UnassignProcedure's dispatch failed since 'NoServerDispatchException' > 5. Begin to expire the RS, but only find it not online and not in deadServer > list, so a SCP was submitted for the same RS again > -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (HBASE-20976) SCP can be scheduled multiple times for the same RS
[ https://issues.apache.org/jira/browse/HBASE-20976?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16575920#comment-16575920 ] Allan Yang commented on HBASE-20976: [~stack],[~Apache9]. Sorry that I have to reopen this again, since I find another case that SCP can be scheduled multiple times…… As you can see from the issue's description. 1. the RS is expired, and a SCP was submitted {code} 2018-08-09 12:29:55,665 WARN [PEWorker-11] master.ServerManager: Expiration of izbp1azj9xjvk1h9vioyvfz,16020,1533725024975 but server not online 2018-08-09 12:29:55,665 INFO [PEWorker-11] master.ServerManager: Processing expiration of izbp1azj9xjvk1h9vioyvfz,16020,1533725024975 on izbp1azj9xjvk1h9vioyvfz,16000,1533787159573 2018-08-09 12:29:55,815 DEBUG [PEWorker-11] assignment.AssignmentManager: Added=izbp1azj9xjvk1h9vioyvfz,16020,1533725024975 to dead servers, submitted shutdown handler to be executed meta=false {code} 2. the RS restarted on the same host, and the servername is removed from the deadserver's list {code} 2018-08-09 12:29:58,034 DEBUG [RpcServer.default.FPBQ.Fifo.handler=157,queue=13,port=16000] master.ServerManager: REPORT: Server izbp1azj9xjvk1h9vioyvfz,16020,1533787086010 came back up, removed it fro m the dead servers list {code} 3. Another UnassinProcedure detect this one too, since it thinks no one is handling it, a SCP is submitted again {code} 2018-08-09 12:29:58,061 WARN [PEWorker-15] assignment.RegionTransitionProcedure: Remote call failed izbp1azj9xjvk1h9vioyvfz,16020,1533725024975; pid=4034, ppid=4012, state=RUNNABLE:REGION_TRANSITION_D ISPATCH, hasLock=true; UnassignProcedure table=randowmWrite15, region=e07c5ad01ce7b76b80a92e809fb98e26, server=izbp1azj9xjvk1h9vioyvfz,16020,1533735186645; rit=CLOSING, location=izbp1azj9xjvk1h9vioyvfz ,16020,1533725024975; exception=NoServerDispatchException org.apache.hadoop.hbase.procedure2.NoServerDispatchException: izbp1azj9xjvk1h9vioyvfz,16020,1533725024975; pid=4034, ppid=4012, state=RUNNABLE:REGION_TRANSITION_DISPATCH, hasLock=true; UnassignProcedur e table=randowmWrite15, region=e07c5ad01ce7b76b80a92e809fb98e26, server=izbp1azj9xjvk1h9vioyvfz,16020,1533735186645 at org.apache.hadoop.hbase.procedure2.RemoteProcedureDispatcher.addOperationToNode(RemoteProcedureDispatcher.java:177) at org.apache.hadoop.hbase.master.assignment.RegionTransitionProcedure.addToRemoteDispatcher(RegionTransitionProcedure.java:263) at org.apache.hadoop.hbase.master.assignment.UnassignProcedure.updateTransition(UnassignProcedure.java:207) at org.apache.hadoop.hbase.master.assignment.RegionTransitionProcedure.execute(RegionTransitionProcedure.java:349) at org.apache.hadoop.hbase.master.assignment.RegionTransitionProcedure.execute(RegionTransitionProcedure.java:101) at org.apache.hadoop.hbase.procedure2.Procedure.doExecute(Procedure.java:873) at org.apache.hadoop.hbase.procedure2.ProcedureExecutor.execProcedure(ProcedureExecutor.java:1498) at org.apache.hadoop.hbase.procedure2.ProcedureExecutor.executeProcedure(ProcedureExecutor.java:1278) at org.apache.hadoop.hbase.procedure2.ProcedureExecutor.access$900(ProcedureExecutor.java:76) at org.apache.hadoop.hbase.procedure2.ProcedureExecutor$WorkerThread.run(ProcedureExecutor.java:1785) 2018-08-09 12:29:58,061 WARN [PEWorker-15] assignment.UnassignProcedure: Expiring izbp1azj9xjvk1h9vioyvfz,16020,1533725024975, pid=4034, ppid=4012, state=RUNNABLE:REGION_TRANSITION_DISPATCH, hasLock=t rue; UnassignProcedure table=randowmWrite15, region=e07c5ad01ce7b76b80a92e809fb98e26, server=izbp1azj9xjvk1h9vioyvfz,16020,1533735186645 rit=CLOSING, location=izbp1azj9xjvk1h9vioyvfz,16020,153372502497 5; exception=NoServerDispatchException 2018-08-09 12:29:58,061 WARN [PEWorker-15] master.ServerManager: Expiration of izbp1azj9xjvk1h9vioyvfz,16020,1533725024975 but server not online 2018-08-09 12:29:58,061 INFO [PEWorker-15] master.ServerManager: Processing expiration of izbp1azj9xjvk1h9vioyvfz,16020,1533725024975 on izbp1azj9xjvk1h9vioyvfz,16000,1533787159573 2018-08-09 12:29:58,540 DEBUG [PEWorker-15] assignment.AssignmentManager: Added=izbp1azj9xjvk1h9vioyvfz,16020,1533725024975 to dead servers, submitted shutdown handler to be executed meta=false {code} > SCP can be scheduled multiple times for the same RS > --- > > Key: HBASE-20976 > URL: https://issues.apache.org/jira/browse/HBASE-20976 > Project: HBase > Issue Type: Sub-task >Affects Versions: 2.0.1 >Reporter: Allan Yang >Assignee: Allan Yang >Priority: Major > Fix For: 2.0.2 > > Attachments: HBASE-20976.branch-2.0.001.patch > > > SCP can be scheduled multiple times for the same RS: > 1. a RS crashed, a SCP was submitted for it > 2. before
[jira] [Commented] (HBASE-20976) SCP can be scheduled multiple times for the same RS
[ https://issues.apache.org/jira/browse/HBASE-20976?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16572619#comment-16572619 ] Allan Yang commented on HBASE-20976: [~stack], sure, let's resolve it. Thanks for backporting HBASE-20708. > SCP can be scheduled multiple times for the same RS > --- > > Key: HBASE-20976 > URL: https://issues.apache.org/jira/browse/HBASE-20976 > Project: HBase > Issue Type: Sub-task >Affects Versions: 2.0.1 >Reporter: Allan Yang >Assignee: Allan Yang >Priority: Major > Fix For: 2.0.2 > > Attachments: HBASE-20976.branch-2.0.001.patch > > > SCP can be scheduled multiple times for the same RS: > 1. a RS crashed, a SCP was submitted for it > 2. before this SCP finish, the Master crashed > 3. The new master will scan the meta table and find some region is still open > on a dead server > 4. The new master submit a SCP for the dead server again > The two SCP for the same RS can even execute concurrently if without > HBASE-20846… > Provided a test case to reproduce this issue and a fix solution in the patch. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (HBASE-20976) SCP can be scheduled multiple times for the same RS
[ https://issues.apache.org/jira/browse/HBASE-20976?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16569191#comment-16569191 ] stack commented on HBASE-20976: --- [~allan163] I backported HBASE-20708 yesterday. Should we close this now? I've been backporting the main AMv2 changes up in branch-2.1 to branch-2.0 over the last few days after first trying patches on cluster Patches are big but I see them as bug-fixes, critical ones.. AMv2 has to work well when folks go to use hbase2, even for those who try hbase-2.0.x first. It is taking a while as I test before commit. I'm almost caught up. > SCP can be scheduled multiple times for the same RS > --- > > Key: HBASE-20976 > URL: https://issues.apache.org/jira/browse/HBASE-20976 > Project: HBase > Issue Type: Sub-task >Affects Versions: 2.0.1 >Reporter: Allan Yang >Assignee: Allan Yang >Priority: Major > Fix For: 2.0.2 > > Attachments: HBASE-20976.branch-2.0.001.patch > > > SCP can be scheduled multiple times for the same RS: > 1. a RS crashed, a SCP was submitted for it > 2. before this SCP finish, the Master crashed > 3. The new master will scan the meta table and find some region is still open > on a dead server > 4. The new master submit a SCP for the dead server again > The two SCP for the same RS can even execute concurrently if without > HBASE-20846… > Provided a test case to reproduce this issue and a fix solution in the patch. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (HBASE-20976) SCP can be scheduled multiple times for the same RS
[ https://issues.apache.org/jira/browse/HBASE-20976?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16564653#comment-16564653 ] Allan Yang commented on HBASE-20976: {quote} Agree with the reopen. My fault that HBASE-20708 was marked with 2.0.2. I should have opened a new issue for backport instead (see HBASE-20987). To be clear, HBASE-20987 is not in branch-2.0. It seems too big of a change for a branch-2.0 but if we are seeing issues like this one, then we should consider it. {quote} [~stack], We can make a decision here, if HBASE-20708 is too big to be back-ported, we can fix the issue here with a smaller patch > SCP can be scheduled multiple times for the same RS > --- > > Key: HBASE-20976 > URL: https://issues.apache.org/jira/browse/HBASE-20976 > Project: HBase > Issue Type: Sub-task >Affects Versions: 2.0.1 >Reporter: Allan Yang >Assignee: Allan Yang >Priority: Major > Fix For: 2.0.2 > > Attachments: HBASE-20976.branch-2.0.001.patch > > > SCP can be scheduled multiple times for the same RS: > 1. a RS crashed, a SCP was submitted for it > 2. before this SCP finish, the Master crashed > 3. The new master will scan the meta table and find some region is still open > on a dead server > 4. The new master submit a SCP for the dead server again > The two SCP for the same RS can even execute concurrently if without > HBASE-20846… > Provided a test case to reproduce this issue and a fix solution in the patch. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (HBASE-20976) SCP can be scheduled multiple times for the same RS
[ https://issues.apache.org/jira/browse/HBASE-20976?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16563621#comment-16563621 ] stack commented on HBASE-20976: --- Agree with the reopen. My fault that HBASE-20708 was marked with 2.0.2. I should have opened a new issue for backport instead (see HBASE-20987). To be clear, HBASE-20987 is not in branch-2.0. It seems too big of a change for a branch-2.0 but if we are seeing issues like this one, then we should consider it. > SCP can be scheduled multiple times for the same RS > --- > > Key: HBASE-20976 > URL: https://issues.apache.org/jira/browse/HBASE-20976 > Project: HBase > Issue Type: Sub-task >Affects Versions: 2.0.1 >Reporter: Allan Yang >Assignee: Allan Yang >Priority: Major > Fix For: 2.0.2 > > Attachments: HBASE-20976.branch-2.0.001.patch > > > SCP can be scheduled multiple times for the same RS: > 1. a RS crashed, a SCP was submitted for it > 2. before this SCP finish, the Master crashed > 3. The new master will scan the meta table and find some region is still open > on a dead server > 4. The new master submit a SCP for the dead server again > The two SCP for the same RS can even execute concurrently if without > HBASE-20846… > Provided a test case to reproduce this issue and a fix solution in the patch. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (HBASE-20976) SCP can be scheduled multiple times for the same RS
[ https://issues.apache.org/jira/browse/HBASE-20976?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16563176#comment-16563176 ] Duo Zhang commented on HBASE-20976: --- But we set 2.0.2 as the fix version for HBASE-20708? [~stack] > SCP can be scheduled multiple times for the same RS > --- > > Key: HBASE-20976 > URL: https://issues.apache.org/jira/browse/HBASE-20976 > Project: HBase > Issue Type: Sub-task >Affects Versions: 2.0.1 >Reporter: Allan Yang >Assignee: Allan Yang >Priority: Major > Attachments: HBASE-20976.branch-2.0.001.patch > > > SCP can be scheduled multiple times for the same RS: > 1. a RS crashed, a SCP was submitted for it > 2. before this SCP finish, the Master crashed > 3. The new master will scan the meta table and find some region is still open > on a dead server > 4. The new master submit a SCP for the dead server again > The two SCP for the same RS can even execute concurrently if without > HBASE-20846… > Provided a test case to reproduce this issue and a fix solution in the patch. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (HBASE-20976) SCP can be scheduled multiple times for the same RS
[ https://issues.apache.org/jira/browse/HBASE-20976?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16563120#comment-16563120 ] Allan Yang commented on HBASE-20976: Sorry, HBASE-20708 has not been back-ported, so the issue still exists > SCP can be scheduled multiple times for the same RS > --- > > Key: HBASE-20976 > URL: https://issues.apache.org/jira/browse/HBASE-20976 > Project: HBase > Issue Type: Sub-task >Affects Versions: 2.0.1 >Reporter: Allan Yang >Assignee: Allan Yang >Priority: Major > Attachments: HBASE-20976.branch-2.0.001.patch > > > SCP can be scheduled multiple times for the same RS: > 1. a RS crashed, a SCP was submitted for it > 2. before this SCP finish, the Master crashed > 3. The new master will scan the meta table and find some region is still open > on a dead server > 4. The new master submit a SCP for the dead server again > The two SCP for the same RS can even execute concurrently if without > HBASE-20846… > Provided a test case to reproduce this issue and a fix solution in the patch. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (HBASE-20976) SCP can be scheduled multiple times for the same RS
[ https://issues.apache.org/jira/browse/HBASE-20976?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16563099#comment-16563099 ] Allan Yang commented on HBASE-20976: Seem like HBASE-20708 has been back-ported to branch-2.0. Resolving this one as fixed > SCP can be scheduled multiple times for the same RS > --- > > Key: HBASE-20976 > URL: https://issues.apache.org/jira/browse/HBASE-20976 > Project: HBase > Issue Type: Sub-task >Affects Versions: 2.0.1 >Reporter: Allan Yang >Assignee: Allan Yang >Priority: Major > Attachments: HBASE-20976.branch-2.0.001.patch > > > SCP can be scheduled multiple times for the same RS: > 1. a RS crashed, a SCP was submitted for it > 2. before this SCP finish, the Master crashed > 3. The new master will scan the meta table and find some region is still open > on a dead server > 4. The new master submit a SCP for the dead server again > The two SCP for the same RS can even execute concurrently if without > HBASE-20846… > Provided a test case to reproduce this issue and a fix solution in the patch. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (HBASE-20976) SCP can be scheduled multiple times for the same RS
[ https://issues.apache.org/jira/browse/HBASE-20976?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16561999#comment-16561999 ] Ted Yu commented on HBASE-20976: +1 > SCP can be scheduled multiple times for the same RS > --- > > Key: HBASE-20976 > URL: https://issues.apache.org/jira/browse/HBASE-20976 > Project: HBase > Issue Type: Sub-task >Affects Versions: 2.0.1 >Reporter: Allan Yang >Assignee: Allan Yang >Priority: Major > Attachments: HBASE-20976.branch-2.0.001.patch > > > SCP can be scheduled multiple times for the same RS: > 1. a RS crashed, a SCP was submitted for it > 2. before this SCP finish, the Master crashed > 3. The new master will scan the meta table and find some region is still open > on a dead server > 4. The new master submit a SCP for the dead server again > The two SCP for the same RS can even execute concurrently if without > HBASE-20846… > Provided a test case to reproduce this issue and a fix solution in the patch. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (HBASE-20976) SCP can be scheduled multiple times for the same RS
[ https://issues.apache.org/jira/browse/HBASE-20976?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16561955#comment-16561955 ] Hadoop QA commented on HBASE-20976: --- | (x) *{color:red}-1 overall{color}* | \\ \\ || Vote || Subsystem || Runtime || Comment || | {color:blue}0{color} | {color:blue} reexec {color} | {color:blue} 0m 11s{color} | {color:blue} Docker mode activated. {color} | || || || || {color:brown} Prechecks {color} || | {color:green}+1{color} | {color:green} hbaseanti {color} | {color:green} 0m 0s{color} | {color:green} Patch does not have any anti-patterns. {color} | | {color:green}+1{color} | {color:green} @author {color} | {color:green} 0m 0s{color} | {color:green} The patch does not contain any @author tags. {color} | | {color:green}+1{color} | {color:green} test4tests {color} | {color:green} 0m 0s{color} | {color:green} The patch appears to include 1 new or modified test files. {color} | || || || || {color:brown} branch-2.0 Compile Tests {color} || | {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 3m 46s{color} | {color:green} branch-2.0 passed {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 1m 43s{color} | {color:green} branch-2.0 passed {color} | | {color:green}+1{color} | {color:green} checkstyle {color} | {color:green} 1m 10s{color} | {color:green} branch-2.0 passed {color} | | {color:green}+1{color} | {color:green} shadedjars {color} | {color:green} 4m 9s{color} | {color:green} branch has no errors when building our shaded downstream artifacts. {color} | | {color:green}+1{color} | {color:green} findbugs {color} | {color:green} 2m 0s{color} | {color:green} branch-2.0 passed {color} | | {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 0m 29s{color} | {color:green} branch-2.0 passed {color} | || || || || {color:brown} Patch Compile Tests {color} || | {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 3m 35s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 1m 41s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} javac {color} | {color:green} 1m 41s{color} | {color:green} the patch passed {color} | | {color:red}-1{color} | {color:red} checkstyle {color} | {color:red} 1m 9s{color} | {color:red} hbase-server: The patch generated 2 new + 29 unchanged - 0 fixed = 31 total (was 29) {color} | | {color:green}+1{color} | {color:green} whitespace {color} | {color:green} 0m 0s{color} | {color:green} The patch has no whitespace issues. {color} | | {color:green}+1{color} | {color:green} shadedjars {color} | {color:green} 4m 6s{color} | {color:green} patch has no errors when building our shaded downstream artifacts. {color} | | {color:green}+1{color} | {color:green} hadoopcheck {color} | {color:green} 11m 25s{color} | {color:green} Patch does not cause any errors with Hadoop 2.6.5 2.7.4 or 3.0.0. {color} | | {color:green}+1{color} | {color:green} findbugs {color} | {color:green} 2m 36s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 0m 30s{color} | {color:green} the patch passed {color} | || || || || {color:brown} Other Tests {color} || | {color:green}+1{color} | {color:green} unit {color} | {color:green} 99m 48s{color} | {color:green} hbase-server in the patch passed. {color} | | {color:green}+1{color} | {color:green} asflicense {color} | {color:green} 0m 22s{color} | {color:green} The patch does not generate ASF License warnings. {color} | | {color:black}{color} | {color:black} {color} | {color:black}139m 9s{color} | {color:black} {color} | \\ \\ || Subsystem || Report/Notes || | Docker | Client=17.05.0-ce Server=17.05.0-ce Image:yetus/hbase:6f01af0 | | JIRA Issue | HBASE-20976 | | JIRA Patch URL | https://issues.apache.org/jira/secure/attachment/12933579/HBASE-20976.branch-2.0.001.patch | | Optional Tests | asflicense javac javadoc unit findbugs shadedjars hadoopcheck hbaseanti checkstyle compile | | uname | Linux b07cad149a8c 3.13.0-143-generic #192-Ubuntu SMP Tue Feb 27 10:45:36 UTC 2018 x86_64 GNU/Linux | | Build tool | maven | | Personality | /home/jenkins/jenkins-slave/workspace/PreCommit-HBASE-Build/component/dev-support/hbase-personality.sh | | git revision | branch-2.0 / e7eadd61d2 | | maven | version: Apache Maven 3.5.4 (1edded0938998edf8bf061f1ceb3cfdeccf443fe; 2018-06-17T18:33:14Z) | | Default Java | 1.8.0_171 | | findbugs | v3.1.0-RC3 | | checkstyle | https://builds.apache.org/job/PreCommit-HBASE-Build/13850/artifact/patchprocess/diff-checkstyle-hbase-server.txt | | Test Results | https://builds.apache.org/job/PreCommit-HBASE-Build/13850/testReport/ | | Max. process+thread count | 4206 (vs. ulimit of 1) | | modules | C: hbase-server U: hbase-server |
[jira] [Commented] (HBASE-20976) SCP can be scheduled multiple times for the same RS
[ https://issues.apache.org/jira/browse/HBASE-20976?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16561837#comment-16561837 ] Allan Yang commented on HBASE-20976: confirmed that this issue only exists in branch-2.0, Other 2.x branch is fixed by HBASE-20708. > SCP can be scheduled multiple times for the same RS > --- > > Key: HBASE-20976 > URL: https://issues.apache.org/jira/browse/HBASE-20976 > Project: HBase > Issue Type: Sub-task >Affects Versions: 2.1.0, 2.0.1 >Reporter: Allan Yang >Assignee: Allan Yang >Priority: Major > Attachments: HBASE-20976.branch-2.0.001.patch > > > SCP can be scheduled multiple times for the same RS: > 1. a RS crashed, a SCP was submitted for it > 2. before this SCP finish, the Master crashed > 3. The new master will scan the meta table and find some region is still open > on a dead server > 4. The new master submit a SCP for the dead server again > The two SCP for the same RS can even execute concurrently if without > HBASE-20846… > Provided a test case to reproduce this issue and a fix solution in the patch. -- This message was sent by Atlassian JIRA (v7.6.3#76005)