[jira] [Commented] (HBASE-16270) Handle duplicate clearing of snapshot in region replicas
[ https://issues.apache.org/jira/browse/HBASE-16270?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15446655#comment-15446655 ] Enis Soztutar commented on HBASE-16270: --- I've committed Robert's patch. Let me assign to him. > Handle duplicate clearing of snapshot in region replicas > > > Key: HBASE-16270 > URL: https://issues.apache.org/jira/browse/HBASE-16270 > Project: HBase > Issue Type: Bug > Components: Replication >Affects Versions: 1.1.2 >Reporter: Robert Yokota >Assignee: Robert Yokota > Fix For: 2.0.0, 1.3.0, 1.4.0, 1.1.6, 1.2.3 > > Attachments: HBASE-16270-branch-1.2.patch, HBASE-16270-master.patch > > > We have an HBase (1.1.2) production cluster with 58 region servers and a > staging cluster with 6 region servers. > For both clusters, we enabled region replicas with the following settings: > hbase.regionserver.storefile.refresh.period = 0 > hbase.region.replica.replication.enabled = true > hbase.region.replica.replication.memstore.enabled = true > hbase.master.hfilecleaner.ttl = 360 > hbase.master.loadbalancer.class = > org.apache.hadoop.hbase.master.balancer.StochasticLoadBalancer > hbase.meta.replica.count = 3 > hbase.regionserver.meta.storefile.refresh.period = 3 > hbase.region.replica.wait.for.primary.flush = true > hbase.region.replica.storefile.refresh.memstore.multiplier = 4 > We then altered our HBase tables to have REGION_REPLICATION => 2 > Both clusters got into a state where a few region servers were spewing the > following error below in the HBase logs. In one instance the error occurred > over 70K times. At this time, these region servers would see 10x write > traffic (the hadoop.HBase.RegionServer.Server.writeRequestCount metric) and > in some instances would crash. > At the same time, we would see secondary regions move and then go into the > "reads are disabled" state for extended periods. > It appears there is a race condition where the DefaultMemStore::clearSnapshot > method might be called more than once in succession. The first call would set > snapshotId to -1, but the second call would throw an exception. It seems the > second call should just return if the snapshotId is already -1, rather than > throwing an exception. > Thu Jul 21 08:38:50 UTC 2016, > RpcRetryingCaller{globalStartTime=1469090201543, pause=100, retries=35}, > org.apache.hadoop.hbase.regionserver.UnexpectedStateException: > org.apache.hadoop.hbase.regionserver.UnexpectedStateException: Current > snapshot id is -1,passed 1469085004304 > at > org.apache.hadoop.hbase.regionserver.DefaultMemStore.clearSnapshot(DefaultMemStore.java:187) > at > org.apache.hadoop.hbase.regionserver.HStore.updateStorefiles(HStore.java:1054) > at > org.apache.hadoop.hbase.regionserver.HStore.access$500(HStore.java:128) > at > org.apache.hadoop.hbase.regionserver.HStore$StoreFlusherImpl.replayFlush(HStore.java:2270) > at > org.apache.hadoop.hbase.regionserver.HRegion.replayFlushInStores(HRegion.java:4487) > at > org.apache.hadoop.hbase.regionserver.HRegion.replayWALFlushCommitMarker(HRegion.java:4388) > at > org.apache.hadoop.hbase.regionserver.HRegion.replayWALFlushMarker(HRegion.java:4191) > at > org.apache.hadoop.hbase.regionserver.RSRpcServices.doReplayBatchOp(RSRpcServices.java:776) > at > org.apache.hadoop.hbase.regionserver.RSRpcServices.replay(RSRpcServices.java:1655) > at > org.apache.hadoop.hbase.protobuf.generated.AdminProtos$AdminService$2.callBlockingMethod(AdminProtos.java:22255) > at org.apache.hadoop.hbase.ipc.RpcServer.call(RpcServer.java:2114) > at org.apache.hadoop.hbase.ipc.CallRunner.run(CallRunner.java:101) > at > org.apache.hadoop.hbase.ipc.RpcExecutor.consumerLoop(RpcExecutor.java:130) > at org.apache.hadoop.hbase.ipc.RpcExecutor$1.run(RpcExecutor.java:107) > at java.lang.Thread.run(Thread.java:745) -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HBASE-16270) Handle duplicate clearing of snapshot in region replicas
[ https://issues.apache.org/jira/browse/HBASE-16270?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15443704#comment-15443704 ] Nick Dimiduk commented on HBASE-16270: -- No assignee here? > Handle duplicate clearing of snapshot in region replicas > > > Key: HBASE-16270 > URL: https://issues.apache.org/jira/browse/HBASE-16270 > Project: HBase > Issue Type: Bug > Components: Replication >Affects Versions: 1.1.2 >Reporter: Robert Yokota > Fix For: 2.0.0, 1.3.0, 1.4.0, 1.1.6, 1.2.3 > > Attachments: HBASE-16270-branch-1.2.patch, HBASE-16270-master.patch > > > We have an HBase (1.1.2) production cluster with 58 region servers and a > staging cluster with 6 region servers. > For both clusters, we enabled region replicas with the following settings: > hbase.regionserver.storefile.refresh.period = 0 > hbase.region.replica.replication.enabled = true > hbase.region.replica.replication.memstore.enabled = true > hbase.master.hfilecleaner.ttl = 360 > hbase.master.loadbalancer.class = > org.apache.hadoop.hbase.master.balancer.StochasticLoadBalancer > hbase.meta.replica.count = 3 > hbase.regionserver.meta.storefile.refresh.period = 3 > hbase.region.replica.wait.for.primary.flush = true > hbase.region.replica.storefile.refresh.memstore.multiplier = 4 > We then altered our HBase tables to have REGION_REPLICATION => 2 > Both clusters got into a state where a few region servers were spewing the > following error below in the HBase logs. In one instance the error occurred > over 70K times. At this time, these region servers would see 10x write > traffic (the hadoop.HBase.RegionServer.Server.writeRequestCount metric) and > in some instances would crash. > At the same time, we would see secondary regions move and then go into the > "reads are disabled" state for extended periods. > It appears there is a race condition where the DefaultMemStore::clearSnapshot > method might be called more than once in succession. The first call would set > snapshotId to -1, but the second call would throw an exception. It seems the > second call should just return if the snapshotId is already -1, rather than > throwing an exception. > Thu Jul 21 08:38:50 UTC 2016, > RpcRetryingCaller{globalStartTime=1469090201543, pause=100, retries=35}, > org.apache.hadoop.hbase.regionserver.UnexpectedStateException: > org.apache.hadoop.hbase.regionserver.UnexpectedStateException: Current > snapshot id is -1,passed 1469085004304 > at > org.apache.hadoop.hbase.regionserver.DefaultMemStore.clearSnapshot(DefaultMemStore.java:187) > at > org.apache.hadoop.hbase.regionserver.HStore.updateStorefiles(HStore.java:1054) > at > org.apache.hadoop.hbase.regionserver.HStore.access$500(HStore.java:128) > at > org.apache.hadoop.hbase.regionserver.HStore$StoreFlusherImpl.replayFlush(HStore.java:2270) > at > org.apache.hadoop.hbase.regionserver.HRegion.replayFlushInStores(HRegion.java:4487) > at > org.apache.hadoop.hbase.regionserver.HRegion.replayWALFlushCommitMarker(HRegion.java:4388) > at > org.apache.hadoop.hbase.regionserver.HRegion.replayWALFlushMarker(HRegion.java:4191) > at > org.apache.hadoop.hbase.regionserver.RSRpcServices.doReplayBatchOp(RSRpcServices.java:776) > at > org.apache.hadoop.hbase.regionserver.RSRpcServices.replay(RSRpcServices.java:1655) > at > org.apache.hadoop.hbase.protobuf.generated.AdminProtos$AdminService$2.callBlockingMethod(AdminProtos.java:22255) > at org.apache.hadoop.hbase.ipc.RpcServer.call(RpcServer.java:2114) > at org.apache.hadoop.hbase.ipc.CallRunner.run(CallRunner.java:101) > at > org.apache.hadoop.hbase.ipc.RpcExecutor.consumerLoop(RpcExecutor.java:130) > at org.apache.hadoop.hbase.ipc.RpcExecutor$1.run(RpcExecutor.java:107) > at java.lang.Thread.run(Thread.java:745) -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HBASE-16270) Handle duplicate clearing of snapshot in region replicas
[ https://issues.apache.org/jira/browse/HBASE-16270?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15435933#comment-15435933 ] Hudson commented on HBASE-16270: SUCCESS: Integrated in Jenkins build HBase-1.1-JDK8 #1857 (See [https://builds.apache.org/job/HBase-1.1-JDK8/1857/]) HBASE-16270 Handle duplicate clearing of snapshot in region replicas (enis: rev dca60dbc566c6693269718e469021cdbbfce) * (edit) hbase-server/src/main/java/org/apache/hadoop/hbase/regionserver/DefaultMemStore.java > Handle duplicate clearing of snapshot in region replicas > > > Key: HBASE-16270 > URL: https://issues.apache.org/jira/browse/HBASE-16270 > Project: HBase > Issue Type: Bug > Components: Replication >Affects Versions: 1.1.2 >Reporter: Robert Yokota > Fix For: 2.0.0, 1.3.0, 1.4.0, 1.1.6, 1.2.3 > > Attachments: HBASE-16270-branch-1.2.patch, HBASE-16270-master.patch > > > We have an HBase (1.1.2) production cluster with 58 region servers and a > staging cluster with 6 region servers. > For both clusters, we enabled region replicas with the following settings: > hbase.regionserver.storefile.refresh.period = 0 > hbase.region.replica.replication.enabled = true > hbase.region.replica.replication.memstore.enabled = true > hbase.master.hfilecleaner.ttl = 360 > hbase.master.loadbalancer.class = > org.apache.hadoop.hbase.master.balancer.StochasticLoadBalancer > hbase.meta.replica.count = 3 > hbase.regionserver.meta.storefile.refresh.period = 3 > hbase.region.replica.wait.for.primary.flush = true > hbase.region.replica.storefile.refresh.memstore.multiplier = 4 > We then altered our HBase tables to have REGION_REPLICATION => 2 > Both clusters got into a state where a few region servers were spewing the > following error below in the HBase logs. In one instance the error occurred > over 70K times. At this time, these region servers would see 10x write > traffic (the hadoop.HBase.RegionServer.Server.writeRequestCount metric) and > in some instances would crash. > At the same time, we would see secondary regions move and then go into the > "reads are disabled" state for extended periods. > It appears there is a race condition where the DefaultMemStore::clearSnapshot > method might be called more than once in succession. The first call would set > snapshotId to -1, but the second call would throw an exception. It seems the > second call should just return if the snapshotId is already -1, rather than > throwing an exception. > Thu Jul 21 08:38:50 UTC 2016, > RpcRetryingCaller{globalStartTime=1469090201543, pause=100, retries=35}, > org.apache.hadoop.hbase.regionserver.UnexpectedStateException: > org.apache.hadoop.hbase.regionserver.UnexpectedStateException: Current > snapshot id is -1,passed 1469085004304 > at > org.apache.hadoop.hbase.regionserver.DefaultMemStore.clearSnapshot(DefaultMemStore.java:187) > at > org.apache.hadoop.hbase.regionserver.HStore.updateStorefiles(HStore.java:1054) > at > org.apache.hadoop.hbase.regionserver.HStore.access$500(HStore.java:128) > at > org.apache.hadoop.hbase.regionserver.HStore$StoreFlusherImpl.replayFlush(HStore.java:2270) > at > org.apache.hadoop.hbase.regionserver.HRegion.replayFlushInStores(HRegion.java:4487) > at > org.apache.hadoop.hbase.regionserver.HRegion.replayWALFlushCommitMarker(HRegion.java:4388) > at > org.apache.hadoop.hbase.regionserver.HRegion.replayWALFlushMarker(HRegion.java:4191) > at > org.apache.hadoop.hbase.regionserver.RSRpcServices.doReplayBatchOp(RSRpcServices.java:776) > at > org.apache.hadoop.hbase.regionserver.RSRpcServices.replay(RSRpcServices.java:1655) > at > org.apache.hadoop.hbase.protobuf.generated.AdminProtos$AdminService$2.callBlockingMethod(AdminProtos.java:22255) > at org.apache.hadoop.hbase.ipc.RpcServer.call(RpcServer.java:2114) > at org.apache.hadoop.hbase.ipc.CallRunner.run(CallRunner.java:101) > at > org.apache.hadoop.hbase.ipc.RpcExecutor.consumerLoop(RpcExecutor.java:130) > at org.apache.hadoop.hbase.ipc.RpcExecutor$1.run(RpcExecutor.java:107) > at java.lang.Thread.run(Thread.java:745) -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HBASE-16270) Handle duplicate clearing of snapshot in region replicas
[ https://issues.apache.org/jira/browse/HBASE-16270?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15435413#comment-15435413 ] Hudson commented on HBASE-16270: SUCCESS: Integrated in Jenkins build HBase-1.2-JDK7 #13 (See [https://builds.apache.org/job/HBase-1.2-JDK7/13/]) HBASE-16270 Handle duplicate clearing of snapshot in region replicas (enis: rev fefb8e8513f6eac93cd89cd2a2cf4a8874a33116) * (edit) hbase-server/src/main/java/org/apache/hadoop/hbase/regionserver/DefaultMemStore.java > Handle duplicate clearing of snapshot in region replicas > > > Key: HBASE-16270 > URL: https://issues.apache.org/jira/browse/HBASE-16270 > Project: HBase > Issue Type: Bug > Components: Replication >Affects Versions: 1.1.2 >Reporter: Robert Yokota > Fix For: 2.0.0, 1.3.0, 1.4.0, 1.1.6, 1.2.3 > > Attachments: HBASE-16270-branch-1.2.patch, HBASE-16270-master.patch > > > We have an HBase (1.1.2) production cluster with 58 region servers and a > staging cluster with 6 region servers. > For both clusters, we enabled region replicas with the following settings: > hbase.regionserver.storefile.refresh.period = 0 > hbase.region.replica.replication.enabled = true > hbase.region.replica.replication.memstore.enabled = true > hbase.master.hfilecleaner.ttl = 360 > hbase.master.loadbalancer.class = > org.apache.hadoop.hbase.master.balancer.StochasticLoadBalancer > hbase.meta.replica.count = 3 > hbase.regionserver.meta.storefile.refresh.period = 3 > hbase.region.replica.wait.for.primary.flush = true > hbase.region.replica.storefile.refresh.memstore.multiplier = 4 > We then altered our HBase tables to have REGION_REPLICATION => 2 > Both clusters got into a state where a few region servers were spewing the > following error below in the HBase logs. In one instance the error occurred > over 70K times. At this time, these region servers would see 10x write > traffic (the hadoop.HBase.RegionServer.Server.writeRequestCount metric) and > in some instances would crash. > At the same time, we would see secondary regions move and then go into the > "reads are disabled" state for extended periods. > It appears there is a race condition where the DefaultMemStore::clearSnapshot > method might be called more than once in succession. The first call would set > snapshotId to -1, but the second call would throw an exception. It seems the > second call should just return if the snapshotId is already -1, rather than > throwing an exception. > Thu Jul 21 08:38:50 UTC 2016, > RpcRetryingCaller{globalStartTime=1469090201543, pause=100, retries=35}, > org.apache.hadoop.hbase.regionserver.UnexpectedStateException: > org.apache.hadoop.hbase.regionserver.UnexpectedStateException: Current > snapshot id is -1,passed 1469085004304 > at > org.apache.hadoop.hbase.regionserver.DefaultMemStore.clearSnapshot(DefaultMemStore.java:187) > at > org.apache.hadoop.hbase.regionserver.HStore.updateStorefiles(HStore.java:1054) > at > org.apache.hadoop.hbase.regionserver.HStore.access$500(HStore.java:128) > at > org.apache.hadoop.hbase.regionserver.HStore$StoreFlusherImpl.replayFlush(HStore.java:2270) > at > org.apache.hadoop.hbase.regionserver.HRegion.replayFlushInStores(HRegion.java:4487) > at > org.apache.hadoop.hbase.regionserver.HRegion.replayWALFlushCommitMarker(HRegion.java:4388) > at > org.apache.hadoop.hbase.regionserver.HRegion.replayWALFlushMarker(HRegion.java:4191) > at > org.apache.hadoop.hbase.regionserver.RSRpcServices.doReplayBatchOp(RSRpcServices.java:776) > at > org.apache.hadoop.hbase.regionserver.RSRpcServices.replay(RSRpcServices.java:1655) > at > org.apache.hadoop.hbase.protobuf.generated.AdminProtos$AdminService$2.callBlockingMethod(AdminProtos.java:22255) > at org.apache.hadoop.hbase.ipc.RpcServer.call(RpcServer.java:2114) > at org.apache.hadoop.hbase.ipc.CallRunner.run(CallRunner.java:101) > at > org.apache.hadoop.hbase.ipc.RpcExecutor.consumerLoop(RpcExecutor.java:130) > at org.apache.hadoop.hbase.ipc.RpcExecutor$1.run(RpcExecutor.java:107) > at java.lang.Thread.run(Thread.java:745) -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HBASE-16270) Handle duplicate clearing of snapshot in region replicas
[ https://issues.apache.org/jira/browse/HBASE-16270?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15435372#comment-15435372 ] Hudson commented on HBASE-16270: FAILURE: Integrated in Jenkins build HBase-Trunk_matrix #1474 (See [https://builds.apache.org/job/HBase-Trunk_matrix/1474/]) HBASE-16270 Handle duplicate clearing of snapshot in region replicas (enis: rev dda8f67b2cc9f6ef4ab434beea2a47d461a20a1f) * (edit) hbase-server/src/main/java/org/apache/hadoop/hbase/regionserver/AbstractMemStore.java > Handle duplicate clearing of snapshot in region replicas > > > Key: HBASE-16270 > URL: https://issues.apache.org/jira/browse/HBASE-16270 > Project: HBase > Issue Type: Bug > Components: Replication >Affects Versions: 1.1.2 >Reporter: Robert Yokota > Fix For: 2.0.0, 1.3.0, 1.4.0, 1.1.6, 1.2.3 > > Attachments: HBASE-16270-branch-1.2.patch, HBASE-16270-master.patch > > > We have an HBase (1.1.2) production cluster with 58 region servers and a > staging cluster with 6 region servers. > For both clusters, we enabled region replicas with the following settings: > hbase.regionserver.storefile.refresh.period = 0 > hbase.region.replica.replication.enabled = true > hbase.region.replica.replication.memstore.enabled = true > hbase.master.hfilecleaner.ttl = 360 > hbase.master.loadbalancer.class = > org.apache.hadoop.hbase.master.balancer.StochasticLoadBalancer > hbase.meta.replica.count = 3 > hbase.regionserver.meta.storefile.refresh.period = 3 > hbase.region.replica.wait.for.primary.flush = true > hbase.region.replica.storefile.refresh.memstore.multiplier = 4 > We then altered our HBase tables to have REGION_REPLICATION => 2 > Both clusters got into a state where a few region servers were spewing the > following error below in the HBase logs. In one instance the error occurred > over 70K times. At this time, these region servers would see 10x write > traffic (the hadoop.HBase.RegionServer.Server.writeRequestCount metric) and > in some instances would crash. > At the same time, we would see secondary regions move and then go into the > "reads are disabled" state for extended periods. > It appears there is a race condition where the DefaultMemStore::clearSnapshot > method might be called more than once in succession. The first call would set > snapshotId to -1, but the second call would throw an exception. It seems the > second call should just return if the snapshotId is already -1, rather than > throwing an exception. > Thu Jul 21 08:38:50 UTC 2016, > RpcRetryingCaller{globalStartTime=1469090201543, pause=100, retries=35}, > org.apache.hadoop.hbase.regionserver.UnexpectedStateException: > org.apache.hadoop.hbase.regionserver.UnexpectedStateException: Current > snapshot id is -1,passed 1469085004304 > at > org.apache.hadoop.hbase.regionserver.DefaultMemStore.clearSnapshot(DefaultMemStore.java:187) > at > org.apache.hadoop.hbase.regionserver.HStore.updateStorefiles(HStore.java:1054) > at > org.apache.hadoop.hbase.regionserver.HStore.access$500(HStore.java:128) > at > org.apache.hadoop.hbase.regionserver.HStore$StoreFlusherImpl.replayFlush(HStore.java:2270) > at > org.apache.hadoop.hbase.regionserver.HRegion.replayFlushInStores(HRegion.java:4487) > at > org.apache.hadoop.hbase.regionserver.HRegion.replayWALFlushCommitMarker(HRegion.java:4388) > at > org.apache.hadoop.hbase.regionserver.HRegion.replayWALFlushMarker(HRegion.java:4191) > at > org.apache.hadoop.hbase.regionserver.RSRpcServices.doReplayBatchOp(RSRpcServices.java:776) > at > org.apache.hadoop.hbase.regionserver.RSRpcServices.replay(RSRpcServices.java:1655) > at > org.apache.hadoop.hbase.protobuf.generated.AdminProtos$AdminService$2.callBlockingMethod(AdminProtos.java:22255) > at org.apache.hadoop.hbase.ipc.RpcServer.call(RpcServer.java:2114) > at org.apache.hadoop.hbase.ipc.CallRunner.run(CallRunner.java:101) > at > org.apache.hadoop.hbase.ipc.RpcExecutor.consumerLoop(RpcExecutor.java:130) > at org.apache.hadoop.hbase.ipc.RpcExecutor$1.run(RpcExecutor.java:107) > at java.lang.Thread.run(Thread.java:745) -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HBASE-16270) Handle duplicate clearing of snapshot in region replicas
[ https://issues.apache.org/jira/browse/HBASE-16270?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15435331#comment-15435331 ] Hudson commented on HBASE-16270: SUCCESS: Integrated in Jenkins build HBase-1.2-JDK8 #10 (See [https://builds.apache.org/job/HBase-1.2-JDK8/10/]) HBASE-16270 Handle duplicate clearing of snapshot in region replicas (enis: rev fefb8e8513f6eac93cd89cd2a2cf4a8874a33116) * (edit) hbase-server/src/main/java/org/apache/hadoop/hbase/regionserver/DefaultMemStore.java > Handle duplicate clearing of snapshot in region replicas > > > Key: HBASE-16270 > URL: https://issues.apache.org/jira/browse/HBASE-16270 > Project: HBase > Issue Type: Bug > Components: Replication >Affects Versions: 1.1.2 >Reporter: Robert Yokota > Fix For: 2.0.0, 1.3.0, 1.4.0, 1.1.6, 1.2.3 > > Attachments: HBASE-16270-branch-1.2.patch, HBASE-16270-master.patch > > > We have an HBase (1.1.2) production cluster with 58 region servers and a > staging cluster with 6 region servers. > For both clusters, we enabled region replicas with the following settings: > hbase.regionserver.storefile.refresh.period = 0 > hbase.region.replica.replication.enabled = true > hbase.region.replica.replication.memstore.enabled = true > hbase.master.hfilecleaner.ttl = 360 > hbase.master.loadbalancer.class = > org.apache.hadoop.hbase.master.balancer.StochasticLoadBalancer > hbase.meta.replica.count = 3 > hbase.regionserver.meta.storefile.refresh.period = 3 > hbase.region.replica.wait.for.primary.flush = true > hbase.region.replica.storefile.refresh.memstore.multiplier = 4 > We then altered our HBase tables to have REGION_REPLICATION => 2 > Both clusters got into a state where a few region servers were spewing the > following error below in the HBase logs. In one instance the error occurred > over 70K times. At this time, these region servers would see 10x write > traffic (the hadoop.HBase.RegionServer.Server.writeRequestCount metric) and > in some instances would crash. > At the same time, we would see secondary regions move and then go into the > "reads are disabled" state for extended periods. > It appears there is a race condition where the DefaultMemStore::clearSnapshot > method might be called more than once in succession. The first call would set > snapshotId to -1, but the second call would throw an exception. It seems the > second call should just return if the snapshotId is already -1, rather than > throwing an exception. > Thu Jul 21 08:38:50 UTC 2016, > RpcRetryingCaller{globalStartTime=1469090201543, pause=100, retries=35}, > org.apache.hadoop.hbase.regionserver.UnexpectedStateException: > org.apache.hadoop.hbase.regionserver.UnexpectedStateException: Current > snapshot id is -1,passed 1469085004304 > at > org.apache.hadoop.hbase.regionserver.DefaultMemStore.clearSnapshot(DefaultMemStore.java:187) > at > org.apache.hadoop.hbase.regionserver.HStore.updateStorefiles(HStore.java:1054) > at > org.apache.hadoop.hbase.regionserver.HStore.access$500(HStore.java:128) > at > org.apache.hadoop.hbase.regionserver.HStore$StoreFlusherImpl.replayFlush(HStore.java:2270) > at > org.apache.hadoop.hbase.regionserver.HRegion.replayFlushInStores(HRegion.java:4487) > at > org.apache.hadoop.hbase.regionserver.HRegion.replayWALFlushCommitMarker(HRegion.java:4388) > at > org.apache.hadoop.hbase.regionserver.HRegion.replayWALFlushMarker(HRegion.java:4191) > at > org.apache.hadoop.hbase.regionserver.RSRpcServices.doReplayBatchOp(RSRpcServices.java:776) > at > org.apache.hadoop.hbase.regionserver.RSRpcServices.replay(RSRpcServices.java:1655) > at > org.apache.hadoop.hbase.protobuf.generated.AdminProtos$AdminService$2.callBlockingMethod(AdminProtos.java:22255) > at org.apache.hadoop.hbase.ipc.RpcServer.call(RpcServer.java:2114) > at org.apache.hadoop.hbase.ipc.CallRunner.run(CallRunner.java:101) > at > org.apache.hadoop.hbase.ipc.RpcExecutor.consumerLoop(RpcExecutor.java:130) > at org.apache.hadoop.hbase.ipc.RpcExecutor$1.run(RpcExecutor.java:107) > at java.lang.Thread.run(Thread.java:745) -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HBASE-16270) Handle duplicate clearing of snapshot in region replicas
[ https://issues.apache.org/jira/browse/HBASE-16270?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15435212#comment-15435212 ] Hudson commented on HBASE-16270: FAILURE: Integrated in Jenkins build HBase-1.3 #826 (See [https://builds.apache.org/job/HBase-1.3/826/]) HBASE-16270 Handle duplicate clearing of snapshot in region replicas (enis: rev 25995a2bf71195d0fe697e919f226a21b0d84be1) * (edit) hbase-server/src/main/java/org/apache/hadoop/hbase/regionserver/DefaultMemStore.java > Handle duplicate clearing of snapshot in region replicas > > > Key: HBASE-16270 > URL: https://issues.apache.org/jira/browse/HBASE-16270 > Project: HBase > Issue Type: Bug > Components: Replication >Affects Versions: 1.1.2 >Reporter: Robert Yokota > Fix For: 2.0.0, 1.3.0, 1.4.0, 1.1.6, 1.2.3 > > Attachments: HBASE-16270-branch-1.2.patch, HBASE-16270-master.patch > > > We have an HBase (1.1.2) production cluster with 58 region servers and a > staging cluster with 6 region servers. > For both clusters, we enabled region replicas with the following settings: > hbase.regionserver.storefile.refresh.period = 0 > hbase.region.replica.replication.enabled = true > hbase.region.replica.replication.memstore.enabled = true > hbase.master.hfilecleaner.ttl = 360 > hbase.master.loadbalancer.class = > org.apache.hadoop.hbase.master.balancer.StochasticLoadBalancer > hbase.meta.replica.count = 3 > hbase.regionserver.meta.storefile.refresh.period = 3 > hbase.region.replica.wait.for.primary.flush = true > hbase.region.replica.storefile.refresh.memstore.multiplier = 4 > We then altered our HBase tables to have REGION_REPLICATION => 2 > Both clusters got into a state where a few region servers were spewing the > following error below in the HBase logs. In one instance the error occurred > over 70K times. At this time, these region servers would see 10x write > traffic (the hadoop.HBase.RegionServer.Server.writeRequestCount metric) and > in some instances would crash. > At the same time, we would see secondary regions move and then go into the > "reads are disabled" state for extended periods. > It appears there is a race condition where the DefaultMemStore::clearSnapshot > method might be called more than once in succession. The first call would set > snapshotId to -1, but the second call would throw an exception. It seems the > second call should just return if the snapshotId is already -1, rather than > throwing an exception. > Thu Jul 21 08:38:50 UTC 2016, > RpcRetryingCaller{globalStartTime=1469090201543, pause=100, retries=35}, > org.apache.hadoop.hbase.regionserver.UnexpectedStateException: > org.apache.hadoop.hbase.regionserver.UnexpectedStateException: Current > snapshot id is -1,passed 1469085004304 > at > org.apache.hadoop.hbase.regionserver.DefaultMemStore.clearSnapshot(DefaultMemStore.java:187) > at > org.apache.hadoop.hbase.regionserver.HStore.updateStorefiles(HStore.java:1054) > at > org.apache.hadoop.hbase.regionserver.HStore.access$500(HStore.java:128) > at > org.apache.hadoop.hbase.regionserver.HStore$StoreFlusherImpl.replayFlush(HStore.java:2270) > at > org.apache.hadoop.hbase.regionserver.HRegion.replayFlushInStores(HRegion.java:4487) > at > org.apache.hadoop.hbase.regionserver.HRegion.replayWALFlushCommitMarker(HRegion.java:4388) > at > org.apache.hadoop.hbase.regionserver.HRegion.replayWALFlushMarker(HRegion.java:4191) > at > org.apache.hadoop.hbase.regionserver.RSRpcServices.doReplayBatchOp(RSRpcServices.java:776) > at > org.apache.hadoop.hbase.regionserver.RSRpcServices.replay(RSRpcServices.java:1655) > at > org.apache.hadoop.hbase.protobuf.generated.AdminProtos$AdminService$2.callBlockingMethod(AdminProtos.java:22255) > at org.apache.hadoop.hbase.ipc.RpcServer.call(RpcServer.java:2114) > at org.apache.hadoop.hbase.ipc.CallRunner.run(CallRunner.java:101) > at > org.apache.hadoop.hbase.ipc.RpcExecutor.consumerLoop(RpcExecutor.java:130) > at org.apache.hadoop.hbase.ipc.RpcExecutor$1.run(RpcExecutor.java:107) > at java.lang.Thread.run(Thread.java:745) -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HBASE-16270) Handle duplicate clearing of snapshot in region replicas
[ https://issues.apache.org/jira/browse/HBASE-16270?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15435182#comment-15435182 ] Hudson commented on HBASE-16270: FAILURE: Integrated in Jenkins build HBase-1.4 #365 (See [https://builds.apache.org/job/HBase-1.4/365/]) HBASE-16270 Handle duplicate clearing of snapshot in region replicas (enis: rev f0385b4b83bd9725aefcc9c9ec3f08cb57b33afa) * (edit) hbase-server/src/main/java/org/apache/hadoop/hbase/regionserver/DefaultMemStore.java > Handle duplicate clearing of snapshot in region replicas > > > Key: HBASE-16270 > URL: https://issues.apache.org/jira/browse/HBASE-16270 > Project: HBase > Issue Type: Bug > Components: Replication >Affects Versions: 1.1.2 >Reporter: Robert Yokota > Fix For: 2.0.0, 1.3.0, 1.4.0, 1.1.6, 1.2.3 > > Attachments: HBASE-16270-branch-1.2.patch, HBASE-16270-master.patch > > > We have an HBase (1.1.2) production cluster with 58 region servers and a > staging cluster with 6 region servers. > For both clusters, we enabled region replicas with the following settings: > hbase.regionserver.storefile.refresh.period = 0 > hbase.region.replica.replication.enabled = true > hbase.region.replica.replication.memstore.enabled = true > hbase.master.hfilecleaner.ttl = 360 > hbase.master.loadbalancer.class = > org.apache.hadoop.hbase.master.balancer.StochasticLoadBalancer > hbase.meta.replica.count = 3 > hbase.regionserver.meta.storefile.refresh.period = 3 > hbase.region.replica.wait.for.primary.flush = true > hbase.region.replica.storefile.refresh.memstore.multiplier = 4 > We then altered our HBase tables to have REGION_REPLICATION => 2 > Both clusters got into a state where a few region servers were spewing the > following error below in the HBase logs. In one instance the error occurred > over 70K times. At this time, these region servers would see 10x write > traffic (the hadoop.HBase.RegionServer.Server.writeRequestCount metric) and > in some instances would crash. > At the same time, we would see secondary regions move and then go into the > "reads are disabled" state for extended periods. > It appears there is a race condition where the DefaultMemStore::clearSnapshot > method might be called more than once in succession. The first call would set > snapshotId to -1, but the second call would throw an exception. It seems the > second call should just return if the snapshotId is already -1, rather than > throwing an exception. > Thu Jul 21 08:38:50 UTC 2016, > RpcRetryingCaller{globalStartTime=1469090201543, pause=100, retries=35}, > org.apache.hadoop.hbase.regionserver.UnexpectedStateException: > org.apache.hadoop.hbase.regionserver.UnexpectedStateException: Current > snapshot id is -1,passed 1469085004304 > at > org.apache.hadoop.hbase.regionserver.DefaultMemStore.clearSnapshot(DefaultMemStore.java:187) > at > org.apache.hadoop.hbase.regionserver.HStore.updateStorefiles(HStore.java:1054) > at > org.apache.hadoop.hbase.regionserver.HStore.access$500(HStore.java:128) > at > org.apache.hadoop.hbase.regionserver.HStore$StoreFlusherImpl.replayFlush(HStore.java:2270) > at > org.apache.hadoop.hbase.regionserver.HRegion.replayFlushInStores(HRegion.java:4487) > at > org.apache.hadoop.hbase.regionserver.HRegion.replayWALFlushCommitMarker(HRegion.java:4388) > at > org.apache.hadoop.hbase.regionserver.HRegion.replayWALFlushMarker(HRegion.java:4191) > at > org.apache.hadoop.hbase.regionserver.RSRpcServices.doReplayBatchOp(RSRpcServices.java:776) > at > org.apache.hadoop.hbase.regionserver.RSRpcServices.replay(RSRpcServices.java:1655) > at > org.apache.hadoop.hbase.protobuf.generated.AdminProtos$AdminService$2.callBlockingMethod(AdminProtos.java:22255) > at org.apache.hadoop.hbase.ipc.RpcServer.call(RpcServer.java:2114) > at org.apache.hadoop.hbase.ipc.CallRunner.run(CallRunner.java:101) > at > org.apache.hadoop.hbase.ipc.RpcExecutor.consumerLoop(RpcExecutor.java:130) > at org.apache.hadoop.hbase.ipc.RpcExecutor$1.run(RpcExecutor.java:107) > at java.lang.Thread.run(Thread.java:745) -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HBASE-16270) Handle duplicate clearing of snapshot in region replicas
[ https://issues.apache.org/jira/browse/HBASE-16270?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15435057#comment-15435057 ] Hudson commented on HBASE-16270: FAILURE: Integrated in Jenkins build HBase-1.3-IT #800 (See [https://builds.apache.org/job/HBase-1.3-IT/800/]) HBASE-16270 Handle duplicate clearing of snapshot in region replicas (enis: rev 25995a2bf71195d0fe697e919f226a21b0d84be1) * (edit) hbase-server/src/main/java/org/apache/hadoop/hbase/regionserver/DefaultMemStore.java > Handle duplicate clearing of snapshot in region replicas > > > Key: HBASE-16270 > URL: https://issues.apache.org/jira/browse/HBASE-16270 > Project: HBase > Issue Type: Bug > Components: Replication >Affects Versions: 1.1.2 >Reporter: Robert Yokota > Fix For: 2.0.0, 1.3.0, 1.4.0, 1.1.6, 1.2.3 > > Attachments: HBASE-16270-branch-1.2.patch, HBASE-16270-master.patch > > > We have an HBase (1.1.2) production cluster with 58 region servers and a > staging cluster with 6 region servers. > For both clusters, we enabled region replicas with the following settings: > hbase.regionserver.storefile.refresh.period = 0 > hbase.region.replica.replication.enabled = true > hbase.region.replica.replication.memstore.enabled = true > hbase.master.hfilecleaner.ttl = 360 > hbase.master.loadbalancer.class = > org.apache.hadoop.hbase.master.balancer.StochasticLoadBalancer > hbase.meta.replica.count = 3 > hbase.regionserver.meta.storefile.refresh.period = 3 > hbase.region.replica.wait.for.primary.flush = true > hbase.region.replica.storefile.refresh.memstore.multiplier = 4 > We then altered our HBase tables to have REGION_REPLICATION => 2 > Both clusters got into a state where a few region servers were spewing the > following error below in the HBase logs. In one instance the error occurred > over 70K times. At this time, these region servers would see 10x write > traffic (the hadoop.HBase.RegionServer.Server.writeRequestCount metric) and > in some instances would crash. > At the same time, we would see secondary regions move and then go into the > "reads are disabled" state for extended periods. > It appears there is a race condition where the DefaultMemStore::clearSnapshot > method might be called more than once in succession. The first call would set > snapshotId to -1, but the second call would throw an exception. It seems the > second call should just return if the snapshotId is already -1, rather than > throwing an exception. > Thu Jul 21 08:38:50 UTC 2016, > RpcRetryingCaller{globalStartTime=1469090201543, pause=100, retries=35}, > org.apache.hadoop.hbase.regionserver.UnexpectedStateException: > org.apache.hadoop.hbase.regionserver.UnexpectedStateException: Current > snapshot id is -1,passed 1469085004304 > at > org.apache.hadoop.hbase.regionserver.DefaultMemStore.clearSnapshot(DefaultMemStore.java:187) > at > org.apache.hadoop.hbase.regionserver.HStore.updateStorefiles(HStore.java:1054) > at > org.apache.hadoop.hbase.regionserver.HStore.access$500(HStore.java:128) > at > org.apache.hadoop.hbase.regionserver.HStore$StoreFlusherImpl.replayFlush(HStore.java:2270) > at > org.apache.hadoop.hbase.regionserver.HRegion.replayFlushInStores(HRegion.java:4487) > at > org.apache.hadoop.hbase.regionserver.HRegion.replayWALFlushCommitMarker(HRegion.java:4388) > at > org.apache.hadoop.hbase.regionserver.HRegion.replayWALFlushMarker(HRegion.java:4191) > at > org.apache.hadoop.hbase.regionserver.RSRpcServices.doReplayBatchOp(RSRpcServices.java:776) > at > org.apache.hadoop.hbase.regionserver.RSRpcServices.replay(RSRpcServices.java:1655) > at > org.apache.hadoop.hbase.protobuf.generated.AdminProtos$AdminService$2.callBlockingMethod(AdminProtos.java:22255) > at org.apache.hadoop.hbase.ipc.RpcServer.call(RpcServer.java:2114) > at org.apache.hadoop.hbase.ipc.CallRunner.run(CallRunner.java:101) > at > org.apache.hadoop.hbase.ipc.RpcExecutor.consumerLoop(RpcExecutor.java:130) > at org.apache.hadoop.hbase.ipc.RpcExecutor$1.run(RpcExecutor.java:107) > at java.lang.Thread.run(Thread.java:745) -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HBASE-16270) Handle duplicate clearing of snapshot in region replicas
[ https://issues.apache.org/jira/browse/HBASE-16270?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15435039#comment-15435039 ] Hudson commented on HBASE-16270: FAILURE: Integrated in Jenkins build HBase-1.1-JDK7 #1772 (See [https://builds.apache.org/job/HBase-1.1-JDK7/1772/]) HBASE-16270 Handle duplicate clearing of snapshot in region replicas (enis: rev dca60dbc566c6693269718e469021cdbbfce) * (edit) hbase-server/src/main/java/org/apache/hadoop/hbase/regionserver/DefaultMemStore.java > Handle duplicate clearing of snapshot in region replicas > > > Key: HBASE-16270 > URL: https://issues.apache.org/jira/browse/HBASE-16270 > Project: HBase > Issue Type: Bug > Components: Replication >Affects Versions: 1.1.2 >Reporter: Robert Yokota > Fix For: 2.0.0, 1.3.0, 1.4.0, 1.1.6, 1.2.3 > > Attachments: HBASE-16270-branch-1.2.patch, HBASE-16270-master.patch > > > We have an HBase (1.1.2) production cluster with 58 region servers and a > staging cluster with 6 region servers. > For both clusters, we enabled region replicas with the following settings: > hbase.regionserver.storefile.refresh.period = 0 > hbase.region.replica.replication.enabled = true > hbase.region.replica.replication.memstore.enabled = true > hbase.master.hfilecleaner.ttl = 360 > hbase.master.loadbalancer.class = > org.apache.hadoop.hbase.master.balancer.StochasticLoadBalancer > hbase.meta.replica.count = 3 > hbase.regionserver.meta.storefile.refresh.period = 3 > hbase.region.replica.wait.for.primary.flush = true > hbase.region.replica.storefile.refresh.memstore.multiplier = 4 > We then altered our HBase tables to have REGION_REPLICATION => 2 > Both clusters got into a state where a few region servers were spewing the > following error below in the HBase logs. In one instance the error occurred > over 70K times. At this time, these region servers would see 10x write > traffic (the hadoop.HBase.RegionServer.Server.writeRequestCount metric) and > in some instances would crash. > At the same time, we would see secondary regions move and then go into the > "reads are disabled" state for extended periods. > It appears there is a race condition where the DefaultMemStore::clearSnapshot > method might be called more than once in succession. The first call would set > snapshotId to -1, but the second call would throw an exception. It seems the > second call should just return if the snapshotId is already -1, rather than > throwing an exception. > Thu Jul 21 08:38:50 UTC 2016, > RpcRetryingCaller{globalStartTime=1469090201543, pause=100, retries=35}, > org.apache.hadoop.hbase.regionserver.UnexpectedStateException: > org.apache.hadoop.hbase.regionserver.UnexpectedStateException: Current > snapshot id is -1,passed 1469085004304 > at > org.apache.hadoop.hbase.regionserver.DefaultMemStore.clearSnapshot(DefaultMemStore.java:187) > at > org.apache.hadoop.hbase.regionserver.HStore.updateStorefiles(HStore.java:1054) > at > org.apache.hadoop.hbase.regionserver.HStore.access$500(HStore.java:128) > at > org.apache.hadoop.hbase.regionserver.HStore$StoreFlusherImpl.replayFlush(HStore.java:2270) > at > org.apache.hadoop.hbase.regionserver.HRegion.replayFlushInStores(HRegion.java:4487) > at > org.apache.hadoop.hbase.regionserver.HRegion.replayWALFlushCommitMarker(HRegion.java:4388) > at > org.apache.hadoop.hbase.regionserver.HRegion.replayWALFlushMarker(HRegion.java:4191) > at > org.apache.hadoop.hbase.regionserver.RSRpcServices.doReplayBatchOp(RSRpcServices.java:776) > at > org.apache.hadoop.hbase.regionserver.RSRpcServices.replay(RSRpcServices.java:1655) > at > org.apache.hadoop.hbase.protobuf.generated.AdminProtos$AdminService$2.callBlockingMethod(AdminProtos.java:22255) > at org.apache.hadoop.hbase.ipc.RpcServer.call(RpcServer.java:2114) > at org.apache.hadoop.hbase.ipc.CallRunner.run(CallRunner.java:101) > at > org.apache.hadoop.hbase.ipc.RpcExecutor.consumerLoop(RpcExecutor.java:130) > at org.apache.hadoop.hbase.ipc.RpcExecutor$1.run(RpcExecutor.java:107) > at java.lang.Thread.run(Thread.java:745) -- This message was sent by Atlassian JIRA (v6.3.4#6332)