[jira] [Commented] (HBASE-16270) Handle duplicate clearing of snapshot in region replicas

2016-08-29 Thread Enis Soztutar (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-16270?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15446655#comment-15446655
 ] 

Enis Soztutar commented on HBASE-16270:
---

I've committed Robert's patch. Let me assign to him. 

> Handle duplicate clearing of snapshot in region replicas
> 
>
> Key: HBASE-16270
> URL: https://issues.apache.org/jira/browse/HBASE-16270
> Project: HBase
>  Issue Type: Bug
>  Components: Replication
>Affects Versions: 1.1.2
>Reporter: Robert Yokota
>Assignee: Robert Yokota
> Fix For: 2.0.0, 1.3.0, 1.4.0, 1.1.6, 1.2.3
>
> Attachments: HBASE-16270-branch-1.2.patch, HBASE-16270-master.patch
>
>
> We have an HBase (1.1.2) production cluster with 58 region servers and a 
> staging cluster with 6 region servers.
> For both clusters, we enabled region replicas with the following settings:
> hbase.regionserver.storefile.refresh.period = 0
> hbase.region.replica.replication.enabled = true
> hbase.region.replica.replication.memstore.enabled = true
> hbase.master.hfilecleaner.ttl = 360
> hbase.master.loadbalancer.class = 
> org.apache.hadoop.hbase.master.balancer.StochasticLoadBalancer
> hbase.meta.replica.count = 3
> hbase.regionserver.meta.storefile.refresh.period = 3
> hbase.region.replica.wait.for.primary.flush = true
> hbase.region.replica.storefile.refresh.memstore.multiplier = 4
> We then altered our HBase tables to have REGION_REPLICATION => 2
> Both clusters got into a state where a few region servers were spewing the 
> following error below in the HBase logs.  In one instance the error occurred 
> over 70K times.  At this time, these region servers would see 10x write 
> traffic (the hadoop.HBase.RegionServer.Server.writeRequestCount metric) and 
> in some instances would crash.
> At the same time, we would see secondary regions move and then go into the 
> "reads are disabled" state for extended periods.  
> It appears there is a race condition where the DefaultMemStore::clearSnapshot 
> method might be called more than once in succession. The first call would set 
> snapshotId to -1, but the second call would throw an exception.  It seems the 
> second call should just return if the snapshotId is already -1, rather than 
> throwing an exception.
> Thu Jul 21 08:38:50 UTC 2016, 
> RpcRetryingCaller{globalStartTime=1469090201543, pause=100, retries=35}, 
> org.apache.hadoop.hbase.regionserver.UnexpectedStateException: 
> org.apache.hadoop.hbase.regionserver.UnexpectedStateException: Current 
> snapshot id is -1,passed 1469085004304
> at 
> org.apache.hadoop.hbase.regionserver.DefaultMemStore.clearSnapshot(DefaultMemStore.java:187)
> at 
> org.apache.hadoop.hbase.regionserver.HStore.updateStorefiles(HStore.java:1054)
> at 
> org.apache.hadoop.hbase.regionserver.HStore.access$500(HStore.java:128)
> at 
> org.apache.hadoop.hbase.regionserver.HStore$StoreFlusherImpl.replayFlush(HStore.java:2270)
> at 
> org.apache.hadoop.hbase.regionserver.HRegion.replayFlushInStores(HRegion.java:4487)
> at 
> org.apache.hadoop.hbase.regionserver.HRegion.replayWALFlushCommitMarker(HRegion.java:4388)
> at 
> org.apache.hadoop.hbase.regionserver.HRegion.replayWALFlushMarker(HRegion.java:4191)
> at 
> org.apache.hadoop.hbase.regionserver.RSRpcServices.doReplayBatchOp(RSRpcServices.java:776)
> at 
> org.apache.hadoop.hbase.regionserver.RSRpcServices.replay(RSRpcServices.java:1655)
> at 
> org.apache.hadoop.hbase.protobuf.generated.AdminProtos$AdminService$2.callBlockingMethod(AdminProtos.java:22255)
> at org.apache.hadoop.hbase.ipc.RpcServer.call(RpcServer.java:2114)
> at org.apache.hadoop.hbase.ipc.CallRunner.run(CallRunner.java:101)
> at 
> org.apache.hadoop.hbase.ipc.RpcExecutor.consumerLoop(RpcExecutor.java:130)
> at org.apache.hadoop.hbase.ipc.RpcExecutor$1.run(RpcExecutor.java:107)
> at java.lang.Thread.run(Thread.java:745)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HBASE-16270) Handle duplicate clearing of snapshot in region replicas

2016-08-28 Thread Nick Dimiduk (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-16270?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15443704#comment-15443704
 ] 

Nick Dimiduk commented on HBASE-16270:
--

No assignee here?

> Handle duplicate clearing of snapshot in region replicas
> 
>
> Key: HBASE-16270
> URL: https://issues.apache.org/jira/browse/HBASE-16270
> Project: HBase
>  Issue Type: Bug
>  Components: Replication
>Affects Versions: 1.1.2
>Reporter: Robert Yokota
> Fix For: 2.0.0, 1.3.0, 1.4.0, 1.1.6, 1.2.3
>
> Attachments: HBASE-16270-branch-1.2.patch, HBASE-16270-master.patch
>
>
> We have an HBase (1.1.2) production cluster with 58 region servers and a 
> staging cluster with 6 region servers.
> For both clusters, we enabled region replicas with the following settings:
> hbase.regionserver.storefile.refresh.period = 0
> hbase.region.replica.replication.enabled = true
> hbase.region.replica.replication.memstore.enabled = true
> hbase.master.hfilecleaner.ttl = 360
> hbase.master.loadbalancer.class = 
> org.apache.hadoop.hbase.master.balancer.StochasticLoadBalancer
> hbase.meta.replica.count = 3
> hbase.regionserver.meta.storefile.refresh.period = 3
> hbase.region.replica.wait.for.primary.flush = true
> hbase.region.replica.storefile.refresh.memstore.multiplier = 4
> We then altered our HBase tables to have REGION_REPLICATION => 2
> Both clusters got into a state where a few region servers were spewing the 
> following error below in the HBase logs.  In one instance the error occurred 
> over 70K times.  At this time, these region servers would see 10x write 
> traffic (the hadoop.HBase.RegionServer.Server.writeRequestCount metric) and 
> in some instances would crash.
> At the same time, we would see secondary regions move and then go into the 
> "reads are disabled" state for extended periods.  
> It appears there is a race condition where the DefaultMemStore::clearSnapshot 
> method might be called more than once in succession. The first call would set 
> snapshotId to -1, but the second call would throw an exception.  It seems the 
> second call should just return if the snapshotId is already -1, rather than 
> throwing an exception.
> Thu Jul 21 08:38:50 UTC 2016, 
> RpcRetryingCaller{globalStartTime=1469090201543, pause=100, retries=35}, 
> org.apache.hadoop.hbase.regionserver.UnexpectedStateException: 
> org.apache.hadoop.hbase.regionserver.UnexpectedStateException: Current 
> snapshot id is -1,passed 1469085004304
> at 
> org.apache.hadoop.hbase.regionserver.DefaultMemStore.clearSnapshot(DefaultMemStore.java:187)
> at 
> org.apache.hadoop.hbase.regionserver.HStore.updateStorefiles(HStore.java:1054)
> at 
> org.apache.hadoop.hbase.regionserver.HStore.access$500(HStore.java:128)
> at 
> org.apache.hadoop.hbase.regionserver.HStore$StoreFlusherImpl.replayFlush(HStore.java:2270)
> at 
> org.apache.hadoop.hbase.regionserver.HRegion.replayFlushInStores(HRegion.java:4487)
> at 
> org.apache.hadoop.hbase.regionserver.HRegion.replayWALFlushCommitMarker(HRegion.java:4388)
> at 
> org.apache.hadoop.hbase.regionserver.HRegion.replayWALFlushMarker(HRegion.java:4191)
> at 
> org.apache.hadoop.hbase.regionserver.RSRpcServices.doReplayBatchOp(RSRpcServices.java:776)
> at 
> org.apache.hadoop.hbase.regionserver.RSRpcServices.replay(RSRpcServices.java:1655)
> at 
> org.apache.hadoop.hbase.protobuf.generated.AdminProtos$AdminService$2.callBlockingMethod(AdminProtos.java:22255)
> at org.apache.hadoop.hbase.ipc.RpcServer.call(RpcServer.java:2114)
> at org.apache.hadoop.hbase.ipc.CallRunner.run(CallRunner.java:101)
> at 
> org.apache.hadoop.hbase.ipc.RpcExecutor.consumerLoop(RpcExecutor.java:130)
> at org.apache.hadoop.hbase.ipc.RpcExecutor$1.run(RpcExecutor.java:107)
> at java.lang.Thread.run(Thread.java:745)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HBASE-16270) Handle duplicate clearing of snapshot in region replicas

2016-08-24 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-16270?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15435933#comment-15435933
 ] 

Hudson commented on HBASE-16270:


SUCCESS: Integrated in Jenkins build HBase-1.1-JDK8 #1857 (See 
[https://builds.apache.org/job/HBase-1.1-JDK8/1857/])
HBASE-16270 Handle duplicate clearing of snapshot in region replicas (enis: rev 
dca60dbc566c6693269718e469021cdbbfce)
* (edit) 
hbase-server/src/main/java/org/apache/hadoop/hbase/regionserver/DefaultMemStore.java


> Handle duplicate clearing of snapshot in region replicas
> 
>
> Key: HBASE-16270
> URL: https://issues.apache.org/jira/browse/HBASE-16270
> Project: HBase
>  Issue Type: Bug
>  Components: Replication
>Affects Versions: 1.1.2
>Reporter: Robert Yokota
> Fix For: 2.0.0, 1.3.0, 1.4.0, 1.1.6, 1.2.3
>
> Attachments: HBASE-16270-branch-1.2.patch, HBASE-16270-master.patch
>
>
> We have an HBase (1.1.2) production cluster with 58 region servers and a 
> staging cluster with 6 region servers.
> For both clusters, we enabled region replicas with the following settings:
> hbase.regionserver.storefile.refresh.period = 0
> hbase.region.replica.replication.enabled = true
> hbase.region.replica.replication.memstore.enabled = true
> hbase.master.hfilecleaner.ttl = 360
> hbase.master.loadbalancer.class = 
> org.apache.hadoop.hbase.master.balancer.StochasticLoadBalancer
> hbase.meta.replica.count = 3
> hbase.regionserver.meta.storefile.refresh.period = 3
> hbase.region.replica.wait.for.primary.flush = true
> hbase.region.replica.storefile.refresh.memstore.multiplier = 4
> We then altered our HBase tables to have REGION_REPLICATION => 2
> Both clusters got into a state where a few region servers were spewing the 
> following error below in the HBase logs.  In one instance the error occurred 
> over 70K times.  At this time, these region servers would see 10x write 
> traffic (the hadoop.HBase.RegionServer.Server.writeRequestCount metric) and 
> in some instances would crash.
> At the same time, we would see secondary regions move and then go into the 
> "reads are disabled" state for extended periods.  
> It appears there is a race condition where the DefaultMemStore::clearSnapshot 
> method might be called more than once in succession. The first call would set 
> snapshotId to -1, but the second call would throw an exception.  It seems the 
> second call should just return if the snapshotId is already -1, rather than 
> throwing an exception.
> Thu Jul 21 08:38:50 UTC 2016, 
> RpcRetryingCaller{globalStartTime=1469090201543, pause=100, retries=35}, 
> org.apache.hadoop.hbase.regionserver.UnexpectedStateException: 
> org.apache.hadoop.hbase.regionserver.UnexpectedStateException: Current 
> snapshot id is -1,passed 1469085004304
> at 
> org.apache.hadoop.hbase.regionserver.DefaultMemStore.clearSnapshot(DefaultMemStore.java:187)
> at 
> org.apache.hadoop.hbase.regionserver.HStore.updateStorefiles(HStore.java:1054)
> at 
> org.apache.hadoop.hbase.regionserver.HStore.access$500(HStore.java:128)
> at 
> org.apache.hadoop.hbase.regionserver.HStore$StoreFlusherImpl.replayFlush(HStore.java:2270)
> at 
> org.apache.hadoop.hbase.regionserver.HRegion.replayFlushInStores(HRegion.java:4487)
> at 
> org.apache.hadoop.hbase.regionserver.HRegion.replayWALFlushCommitMarker(HRegion.java:4388)
> at 
> org.apache.hadoop.hbase.regionserver.HRegion.replayWALFlushMarker(HRegion.java:4191)
> at 
> org.apache.hadoop.hbase.regionserver.RSRpcServices.doReplayBatchOp(RSRpcServices.java:776)
> at 
> org.apache.hadoop.hbase.regionserver.RSRpcServices.replay(RSRpcServices.java:1655)
> at 
> org.apache.hadoop.hbase.protobuf.generated.AdminProtos$AdminService$2.callBlockingMethod(AdminProtos.java:22255)
> at org.apache.hadoop.hbase.ipc.RpcServer.call(RpcServer.java:2114)
> at org.apache.hadoop.hbase.ipc.CallRunner.run(CallRunner.java:101)
> at 
> org.apache.hadoop.hbase.ipc.RpcExecutor.consumerLoop(RpcExecutor.java:130)
> at org.apache.hadoop.hbase.ipc.RpcExecutor$1.run(RpcExecutor.java:107)
> at java.lang.Thread.run(Thread.java:745)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HBASE-16270) Handle duplicate clearing of snapshot in region replicas

2016-08-24 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-16270?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15435413#comment-15435413
 ] 

Hudson commented on HBASE-16270:


SUCCESS: Integrated in Jenkins build HBase-1.2-JDK7 #13 (See 
[https://builds.apache.org/job/HBase-1.2-JDK7/13/])
HBASE-16270 Handle duplicate clearing of snapshot in region replicas (enis: rev 
fefb8e8513f6eac93cd89cd2a2cf4a8874a33116)
* (edit) 
hbase-server/src/main/java/org/apache/hadoop/hbase/regionserver/DefaultMemStore.java


> Handle duplicate clearing of snapshot in region replicas
> 
>
> Key: HBASE-16270
> URL: https://issues.apache.org/jira/browse/HBASE-16270
> Project: HBase
>  Issue Type: Bug
>  Components: Replication
>Affects Versions: 1.1.2
>Reporter: Robert Yokota
> Fix For: 2.0.0, 1.3.0, 1.4.0, 1.1.6, 1.2.3
>
> Attachments: HBASE-16270-branch-1.2.patch, HBASE-16270-master.patch
>
>
> We have an HBase (1.1.2) production cluster with 58 region servers and a 
> staging cluster with 6 region servers.
> For both clusters, we enabled region replicas with the following settings:
> hbase.regionserver.storefile.refresh.period = 0
> hbase.region.replica.replication.enabled = true
> hbase.region.replica.replication.memstore.enabled = true
> hbase.master.hfilecleaner.ttl = 360
> hbase.master.loadbalancer.class = 
> org.apache.hadoop.hbase.master.balancer.StochasticLoadBalancer
> hbase.meta.replica.count = 3
> hbase.regionserver.meta.storefile.refresh.period = 3
> hbase.region.replica.wait.for.primary.flush = true
> hbase.region.replica.storefile.refresh.memstore.multiplier = 4
> We then altered our HBase tables to have REGION_REPLICATION => 2
> Both clusters got into a state where a few region servers were spewing the 
> following error below in the HBase logs.  In one instance the error occurred 
> over 70K times.  At this time, these region servers would see 10x write 
> traffic (the hadoop.HBase.RegionServer.Server.writeRequestCount metric) and 
> in some instances would crash.
> At the same time, we would see secondary regions move and then go into the 
> "reads are disabled" state for extended periods.  
> It appears there is a race condition where the DefaultMemStore::clearSnapshot 
> method might be called more than once in succession. The first call would set 
> snapshotId to -1, but the second call would throw an exception.  It seems the 
> second call should just return if the snapshotId is already -1, rather than 
> throwing an exception.
> Thu Jul 21 08:38:50 UTC 2016, 
> RpcRetryingCaller{globalStartTime=1469090201543, pause=100, retries=35}, 
> org.apache.hadoop.hbase.regionserver.UnexpectedStateException: 
> org.apache.hadoop.hbase.regionserver.UnexpectedStateException: Current 
> snapshot id is -1,passed 1469085004304
> at 
> org.apache.hadoop.hbase.regionserver.DefaultMemStore.clearSnapshot(DefaultMemStore.java:187)
> at 
> org.apache.hadoop.hbase.regionserver.HStore.updateStorefiles(HStore.java:1054)
> at 
> org.apache.hadoop.hbase.regionserver.HStore.access$500(HStore.java:128)
> at 
> org.apache.hadoop.hbase.regionserver.HStore$StoreFlusherImpl.replayFlush(HStore.java:2270)
> at 
> org.apache.hadoop.hbase.regionserver.HRegion.replayFlushInStores(HRegion.java:4487)
> at 
> org.apache.hadoop.hbase.regionserver.HRegion.replayWALFlushCommitMarker(HRegion.java:4388)
> at 
> org.apache.hadoop.hbase.regionserver.HRegion.replayWALFlushMarker(HRegion.java:4191)
> at 
> org.apache.hadoop.hbase.regionserver.RSRpcServices.doReplayBatchOp(RSRpcServices.java:776)
> at 
> org.apache.hadoop.hbase.regionserver.RSRpcServices.replay(RSRpcServices.java:1655)
> at 
> org.apache.hadoop.hbase.protobuf.generated.AdminProtos$AdminService$2.callBlockingMethod(AdminProtos.java:22255)
> at org.apache.hadoop.hbase.ipc.RpcServer.call(RpcServer.java:2114)
> at org.apache.hadoop.hbase.ipc.CallRunner.run(CallRunner.java:101)
> at 
> org.apache.hadoop.hbase.ipc.RpcExecutor.consumerLoop(RpcExecutor.java:130)
> at org.apache.hadoop.hbase.ipc.RpcExecutor$1.run(RpcExecutor.java:107)
> at java.lang.Thread.run(Thread.java:745)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HBASE-16270) Handle duplicate clearing of snapshot in region replicas

2016-08-24 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-16270?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15435372#comment-15435372
 ] 

Hudson commented on HBASE-16270:


FAILURE: Integrated in Jenkins build HBase-Trunk_matrix #1474 (See 
[https://builds.apache.org/job/HBase-Trunk_matrix/1474/])
HBASE-16270 Handle duplicate clearing of snapshot in region replicas (enis: rev 
dda8f67b2cc9f6ef4ab434beea2a47d461a20a1f)
* (edit) 
hbase-server/src/main/java/org/apache/hadoop/hbase/regionserver/AbstractMemStore.java


> Handle duplicate clearing of snapshot in region replicas
> 
>
> Key: HBASE-16270
> URL: https://issues.apache.org/jira/browse/HBASE-16270
> Project: HBase
>  Issue Type: Bug
>  Components: Replication
>Affects Versions: 1.1.2
>Reporter: Robert Yokota
> Fix For: 2.0.0, 1.3.0, 1.4.0, 1.1.6, 1.2.3
>
> Attachments: HBASE-16270-branch-1.2.patch, HBASE-16270-master.patch
>
>
> We have an HBase (1.1.2) production cluster with 58 region servers and a 
> staging cluster with 6 region servers.
> For both clusters, we enabled region replicas with the following settings:
> hbase.regionserver.storefile.refresh.period = 0
> hbase.region.replica.replication.enabled = true
> hbase.region.replica.replication.memstore.enabled = true
> hbase.master.hfilecleaner.ttl = 360
> hbase.master.loadbalancer.class = 
> org.apache.hadoop.hbase.master.balancer.StochasticLoadBalancer
> hbase.meta.replica.count = 3
> hbase.regionserver.meta.storefile.refresh.period = 3
> hbase.region.replica.wait.for.primary.flush = true
> hbase.region.replica.storefile.refresh.memstore.multiplier = 4
> We then altered our HBase tables to have REGION_REPLICATION => 2
> Both clusters got into a state where a few region servers were spewing the 
> following error below in the HBase logs.  In one instance the error occurred 
> over 70K times.  At this time, these region servers would see 10x write 
> traffic (the hadoop.HBase.RegionServer.Server.writeRequestCount metric) and 
> in some instances would crash.
> At the same time, we would see secondary regions move and then go into the 
> "reads are disabled" state for extended periods.  
> It appears there is a race condition where the DefaultMemStore::clearSnapshot 
> method might be called more than once in succession. The first call would set 
> snapshotId to -1, but the second call would throw an exception.  It seems the 
> second call should just return if the snapshotId is already -1, rather than 
> throwing an exception.
> Thu Jul 21 08:38:50 UTC 2016, 
> RpcRetryingCaller{globalStartTime=1469090201543, pause=100, retries=35}, 
> org.apache.hadoop.hbase.regionserver.UnexpectedStateException: 
> org.apache.hadoop.hbase.regionserver.UnexpectedStateException: Current 
> snapshot id is -1,passed 1469085004304
> at 
> org.apache.hadoop.hbase.regionserver.DefaultMemStore.clearSnapshot(DefaultMemStore.java:187)
> at 
> org.apache.hadoop.hbase.regionserver.HStore.updateStorefiles(HStore.java:1054)
> at 
> org.apache.hadoop.hbase.regionserver.HStore.access$500(HStore.java:128)
> at 
> org.apache.hadoop.hbase.regionserver.HStore$StoreFlusherImpl.replayFlush(HStore.java:2270)
> at 
> org.apache.hadoop.hbase.regionserver.HRegion.replayFlushInStores(HRegion.java:4487)
> at 
> org.apache.hadoop.hbase.regionserver.HRegion.replayWALFlushCommitMarker(HRegion.java:4388)
> at 
> org.apache.hadoop.hbase.regionserver.HRegion.replayWALFlushMarker(HRegion.java:4191)
> at 
> org.apache.hadoop.hbase.regionserver.RSRpcServices.doReplayBatchOp(RSRpcServices.java:776)
> at 
> org.apache.hadoop.hbase.regionserver.RSRpcServices.replay(RSRpcServices.java:1655)
> at 
> org.apache.hadoop.hbase.protobuf.generated.AdminProtos$AdminService$2.callBlockingMethod(AdminProtos.java:22255)
> at org.apache.hadoop.hbase.ipc.RpcServer.call(RpcServer.java:2114)
> at org.apache.hadoop.hbase.ipc.CallRunner.run(CallRunner.java:101)
> at 
> org.apache.hadoop.hbase.ipc.RpcExecutor.consumerLoop(RpcExecutor.java:130)
> at org.apache.hadoop.hbase.ipc.RpcExecutor$1.run(RpcExecutor.java:107)
> at java.lang.Thread.run(Thread.java:745)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HBASE-16270) Handle duplicate clearing of snapshot in region replicas

2016-08-24 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-16270?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15435331#comment-15435331
 ] 

Hudson commented on HBASE-16270:


SUCCESS: Integrated in Jenkins build HBase-1.2-JDK8 #10 (See 
[https://builds.apache.org/job/HBase-1.2-JDK8/10/])
HBASE-16270 Handle duplicate clearing of snapshot in region replicas (enis: rev 
fefb8e8513f6eac93cd89cd2a2cf4a8874a33116)
* (edit) 
hbase-server/src/main/java/org/apache/hadoop/hbase/regionserver/DefaultMemStore.java


> Handle duplicate clearing of snapshot in region replicas
> 
>
> Key: HBASE-16270
> URL: https://issues.apache.org/jira/browse/HBASE-16270
> Project: HBase
>  Issue Type: Bug
>  Components: Replication
>Affects Versions: 1.1.2
>Reporter: Robert Yokota
> Fix For: 2.0.0, 1.3.0, 1.4.0, 1.1.6, 1.2.3
>
> Attachments: HBASE-16270-branch-1.2.patch, HBASE-16270-master.patch
>
>
> We have an HBase (1.1.2) production cluster with 58 region servers and a 
> staging cluster with 6 region servers.
> For both clusters, we enabled region replicas with the following settings:
> hbase.regionserver.storefile.refresh.period = 0
> hbase.region.replica.replication.enabled = true
> hbase.region.replica.replication.memstore.enabled = true
> hbase.master.hfilecleaner.ttl = 360
> hbase.master.loadbalancer.class = 
> org.apache.hadoop.hbase.master.balancer.StochasticLoadBalancer
> hbase.meta.replica.count = 3
> hbase.regionserver.meta.storefile.refresh.period = 3
> hbase.region.replica.wait.for.primary.flush = true
> hbase.region.replica.storefile.refresh.memstore.multiplier = 4
> We then altered our HBase tables to have REGION_REPLICATION => 2
> Both clusters got into a state where a few region servers were spewing the 
> following error below in the HBase logs.  In one instance the error occurred 
> over 70K times.  At this time, these region servers would see 10x write 
> traffic (the hadoop.HBase.RegionServer.Server.writeRequestCount metric) and 
> in some instances would crash.
> At the same time, we would see secondary regions move and then go into the 
> "reads are disabled" state for extended periods.  
> It appears there is a race condition where the DefaultMemStore::clearSnapshot 
> method might be called more than once in succession. The first call would set 
> snapshotId to -1, but the second call would throw an exception.  It seems the 
> second call should just return if the snapshotId is already -1, rather than 
> throwing an exception.
> Thu Jul 21 08:38:50 UTC 2016, 
> RpcRetryingCaller{globalStartTime=1469090201543, pause=100, retries=35}, 
> org.apache.hadoop.hbase.regionserver.UnexpectedStateException: 
> org.apache.hadoop.hbase.regionserver.UnexpectedStateException: Current 
> snapshot id is -1,passed 1469085004304
> at 
> org.apache.hadoop.hbase.regionserver.DefaultMemStore.clearSnapshot(DefaultMemStore.java:187)
> at 
> org.apache.hadoop.hbase.regionserver.HStore.updateStorefiles(HStore.java:1054)
> at 
> org.apache.hadoop.hbase.regionserver.HStore.access$500(HStore.java:128)
> at 
> org.apache.hadoop.hbase.regionserver.HStore$StoreFlusherImpl.replayFlush(HStore.java:2270)
> at 
> org.apache.hadoop.hbase.regionserver.HRegion.replayFlushInStores(HRegion.java:4487)
> at 
> org.apache.hadoop.hbase.regionserver.HRegion.replayWALFlushCommitMarker(HRegion.java:4388)
> at 
> org.apache.hadoop.hbase.regionserver.HRegion.replayWALFlushMarker(HRegion.java:4191)
> at 
> org.apache.hadoop.hbase.regionserver.RSRpcServices.doReplayBatchOp(RSRpcServices.java:776)
> at 
> org.apache.hadoop.hbase.regionserver.RSRpcServices.replay(RSRpcServices.java:1655)
> at 
> org.apache.hadoop.hbase.protobuf.generated.AdminProtos$AdminService$2.callBlockingMethod(AdminProtos.java:22255)
> at org.apache.hadoop.hbase.ipc.RpcServer.call(RpcServer.java:2114)
> at org.apache.hadoop.hbase.ipc.CallRunner.run(CallRunner.java:101)
> at 
> org.apache.hadoop.hbase.ipc.RpcExecutor.consumerLoop(RpcExecutor.java:130)
> at org.apache.hadoop.hbase.ipc.RpcExecutor$1.run(RpcExecutor.java:107)
> at java.lang.Thread.run(Thread.java:745)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HBASE-16270) Handle duplicate clearing of snapshot in region replicas

2016-08-24 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-16270?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15435212#comment-15435212
 ] 

Hudson commented on HBASE-16270:


FAILURE: Integrated in Jenkins build HBase-1.3 #826 (See 
[https://builds.apache.org/job/HBase-1.3/826/])
HBASE-16270 Handle duplicate clearing of snapshot in region replicas (enis: rev 
25995a2bf71195d0fe697e919f226a21b0d84be1)
* (edit) 
hbase-server/src/main/java/org/apache/hadoop/hbase/regionserver/DefaultMemStore.java


> Handle duplicate clearing of snapshot in region replicas
> 
>
> Key: HBASE-16270
> URL: https://issues.apache.org/jira/browse/HBASE-16270
> Project: HBase
>  Issue Type: Bug
>  Components: Replication
>Affects Versions: 1.1.2
>Reporter: Robert Yokota
> Fix For: 2.0.0, 1.3.0, 1.4.0, 1.1.6, 1.2.3
>
> Attachments: HBASE-16270-branch-1.2.patch, HBASE-16270-master.patch
>
>
> We have an HBase (1.1.2) production cluster with 58 region servers and a 
> staging cluster with 6 region servers.
> For both clusters, we enabled region replicas with the following settings:
> hbase.regionserver.storefile.refresh.period = 0
> hbase.region.replica.replication.enabled = true
> hbase.region.replica.replication.memstore.enabled = true
> hbase.master.hfilecleaner.ttl = 360
> hbase.master.loadbalancer.class = 
> org.apache.hadoop.hbase.master.balancer.StochasticLoadBalancer
> hbase.meta.replica.count = 3
> hbase.regionserver.meta.storefile.refresh.period = 3
> hbase.region.replica.wait.for.primary.flush = true
> hbase.region.replica.storefile.refresh.memstore.multiplier = 4
> We then altered our HBase tables to have REGION_REPLICATION => 2
> Both clusters got into a state where a few region servers were spewing the 
> following error below in the HBase logs.  In one instance the error occurred 
> over 70K times.  At this time, these region servers would see 10x write 
> traffic (the hadoop.HBase.RegionServer.Server.writeRequestCount metric) and 
> in some instances would crash.
> At the same time, we would see secondary regions move and then go into the 
> "reads are disabled" state for extended periods.  
> It appears there is a race condition where the DefaultMemStore::clearSnapshot 
> method might be called more than once in succession. The first call would set 
> snapshotId to -1, but the second call would throw an exception.  It seems the 
> second call should just return if the snapshotId is already -1, rather than 
> throwing an exception.
> Thu Jul 21 08:38:50 UTC 2016, 
> RpcRetryingCaller{globalStartTime=1469090201543, pause=100, retries=35}, 
> org.apache.hadoop.hbase.regionserver.UnexpectedStateException: 
> org.apache.hadoop.hbase.regionserver.UnexpectedStateException: Current 
> snapshot id is -1,passed 1469085004304
> at 
> org.apache.hadoop.hbase.regionserver.DefaultMemStore.clearSnapshot(DefaultMemStore.java:187)
> at 
> org.apache.hadoop.hbase.regionserver.HStore.updateStorefiles(HStore.java:1054)
> at 
> org.apache.hadoop.hbase.regionserver.HStore.access$500(HStore.java:128)
> at 
> org.apache.hadoop.hbase.regionserver.HStore$StoreFlusherImpl.replayFlush(HStore.java:2270)
> at 
> org.apache.hadoop.hbase.regionserver.HRegion.replayFlushInStores(HRegion.java:4487)
> at 
> org.apache.hadoop.hbase.regionserver.HRegion.replayWALFlushCommitMarker(HRegion.java:4388)
> at 
> org.apache.hadoop.hbase.regionserver.HRegion.replayWALFlushMarker(HRegion.java:4191)
> at 
> org.apache.hadoop.hbase.regionserver.RSRpcServices.doReplayBatchOp(RSRpcServices.java:776)
> at 
> org.apache.hadoop.hbase.regionserver.RSRpcServices.replay(RSRpcServices.java:1655)
> at 
> org.apache.hadoop.hbase.protobuf.generated.AdminProtos$AdminService$2.callBlockingMethod(AdminProtos.java:22255)
> at org.apache.hadoop.hbase.ipc.RpcServer.call(RpcServer.java:2114)
> at org.apache.hadoop.hbase.ipc.CallRunner.run(CallRunner.java:101)
> at 
> org.apache.hadoop.hbase.ipc.RpcExecutor.consumerLoop(RpcExecutor.java:130)
> at org.apache.hadoop.hbase.ipc.RpcExecutor$1.run(RpcExecutor.java:107)
> at java.lang.Thread.run(Thread.java:745)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HBASE-16270) Handle duplicate clearing of snapshot in region replicas

2016-08-24 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-16270?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15435182#comment-15435182
 ] 

Hudson commented on HBASE-16270:


FAILURE: Integrated in Jenkins build HBase-1.4 #365 (See 
[https://builds.apache.org/job/HBase-1.4/365/])
HBASE-16270 Handle duplicate clearing of snapshot in region replicas (enis: rev 
f0385b4b83bd9725aefcc9c9ec3f08cb57b33afa)
* (edit) 
hbase-server/src/main/java/org/apache/hadoop/hbase/regionserver/DefaultMemStore.java


> Handle duplicate clearing of snapshot in region replicas
> 
>
> Key: HBASE-16270
> URL: https://issues.apache.org/jira/browse/HBASE-16270
> Project: HBase
>  Issue Type: Bug
>  Components: Replication
>Affects Versions: 1.1.2
>Reporter: Robert Yokota
> Fix For: 2.0.0, 1.3.0, 1.4.0, 1.1.6, 1.2.3
>
> Attachments: HBASE-16270-branch-1.2.patch, HBASE-16270-master.patch
>
>
> We have an HBase (1.1.2) production cluster with 58 region servers and a 
> staging cluster with 6 region servers.
> For both clusters, we enabled region replicas with the following settings:
> hbase.regionserver.storefile.refresh.period = 0
> hbase.region.replica.replication.enabled = true
> hbase.region.replica.replication.memstore.enabled = true
> hbase.master.hfilecleaner.ttl = 360
> hbase.master.loadbalancer.class = 
> org.apache.hadoop.hbase.master.balancer.StochasticLoadBalancer
> hbase.meta.replica.count = 3
> hbase.regionserver.meta.storefile.refresh.period = 3
> hbase.region.replica.wait.for.primary.flush = true
> hbase.region.replica.storefile.refresh.memstore.multiplier = 4
> We then altered our HBase tables to have REGION_REPLICATION => 2
> Both clusters got into a state where a few region servers were spewing the 
> following error below in the HBase logs.  In one instance the error occurred 
> over 70K times.  At this time, these region servers would see 10x write 
> traffic (the hadoop.HBase.RegionServer.Server.writeRequestCount metric) and 
> in some instances would crash.
> At the same time, we would see secondary regions move and then go into the 
> "reads are disabled" state for extended periods.  
> It appears there is a race condition where the DefaultMemStore::clearSnapshot 
> method might be called more than once in succession. The first call would set 
> snapshotId to -1, but the second call would throw an exception.  It seems the 
> second call should just return if the snapshotId is already -1, rather than 
> throwing an exception.
> Thu Jul 21 08:38:50 UTC 2016, 
> RpcRetryingCaller{globalStartTime=1469090201543, pause=100, retries=35}, 
> org.apache.hadoop.hbase.regionserver.UnexpectedStateException: 
> org.apache.hadoop.hbase.regionserver.UnexpectedStateException: Current 
> snapshot id is -1,passed 1469085004304
> at 
> org.apache.hadoop.hbase.regionserver.DefaultMemStore.clearSnapshot(DefaultMemStore.java:187)
> at 
> org.apache.hadoop.hbase.regionserver.HStore.updateStorefiles(HStore.java:1054)
> at 
> org.apache.hadoop.hbase.regionserver.HStore.access$500(HStore.java:128)
> at 
> org.apache.hadoop.hbase.regionserver.HStore$StoreFlusherImpl.replayFlush(HStore.java:2270)
> at 
> org.apache.hadoop.hbase.regionserver.HRegion.replayFlushInStores(HRegion.java:4487)
> at 
> org.apache.hadoop.hbase.regionserver.HRegion.replayWALFlushCommitMarker(HRegion.java:4388)
> at 
> org.apache.hadoop.hbase.regionserver.HRegion.replayWALFlushMarker(HRegion.java:4191)
> at 
> org.apache.hadoop.hbase.regionserver.RSRpcServices.doReplayBatchOp(RSRpcServices.java:776)
> at 
> org.apache.hadoop.hbase.regionserver.RSRpcServices.replay(RSRpcServices.java:1655)
> at 
> org.apache.hadoop.hbase.protobuf.generated.AdminProtos$AdminService$2.callBlockingMethod(AdminProtos.java:22255)
> at org.apache.hadoop.hbase.ipc.RpcServer.call(RpcServer.java:2114)
> at org.apache.hadoop.hbase.ipc.CallRunner.run(CallRunner.java:101)
> at 
> org.apache.hadoop.hbase.ipc.RpcExecutor.consumerLoop(RpcExecutor.java:130)
> at org.apache.hadoop.hbase.ipc.RpcExecutor$1.run(RpcExecutor.java:107)
> at java.lang.Thread.run(Thread.java:745)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HBASE-16270) Handle duplicate clearing of snapshot in region replicas

2016-08-24 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-16270?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15435057#comment-15435057
 ] 

Hudson commented on HBASE-16270:


FAILURE: Integrated in Jenkins build HBase-1.3-IT #800 (See 
[https://builds.apache.org/job/HBase-1.3-IT/800/])
HBASE-16270 Handle duplicate clearing of snapshot in region replicas (enis: rev 
25995a2bf71195d0fe697e919f226a21b0d84be1)
* (edit) 
hbase-server/src/main/java/org/apache/hadoop/hbase/regionserver/DefaultMemStore.java


> Handle duplicate clearing of snapshot in region replicas
> 
>
> Key: HBASE-16270
> URL: https://issues.apache.org/jira/browse/HBASE-16270
> Project: HBase
>  Issue Type: Bug
>  Components: Replication
>Affects Versions: 1.1.2
>Reporter: Robert Yokota
> Fix For: 2.0.0, 1.3.0, 1.4.0, 1.1.6, 1.2.3
>
> Attachments: HBASE-16270-branch-1.2.patch, HBASE-16270-master.patch
>
>
> We have an HBase (1.1.2) production cluster with 58 region servers and a 
> staging cluster with 6 region servers.
> For both clusters, we enabled region replicas with the following settings:
> hbase.regionserver.storefile.refresh.period = 0
> hbase.region.replica.replication.enabled = true
> hbase.region.replica.replication.memstore.enabled = true
> hbase.master.hfilecleaner.ttl = 360
> hbase.master.loadbalancer.class = 
> org.apache.hadoop.hbase.master.balancer.StochasticLoadBalancer
> hbase.meta.replica.count = 3
> hbase.regionserver.meta.storefile.refresh.period = 3
> hbase.region.replica.wait.for.primary.flush = true
> hbase.region.replica.storefile.refresh.memstore.multiplier = 4
> We then altered our HBase tables to have REGION_REPLICATION => 2
> Both clusters got into a state where a few region servers were spewing the 
> following error below in the HBase logs.  In one instance the error occurred 
> over 70K times.  At this time, these region servers would see 10x write 
> traffic (the hadoop.HBase.RegionServer.Server.writeRequestCount metric) and 
> in some instances would crash.
> At the same time, we would see secondary regions move and then go into the 
> "reads are disabled" state for extended periods.  
> It appears there is a race condition where the DefaultMemStore::clearSnapshot 
> method might be called more than once in succession. The first call would set 
> snapshotId to -1, but the second call would throw an exception.  It seems the 
> second call should just return if the snapshotId is already -1, rather than 
> throwing an exception.
> Thu Jul 21 08:38:50 UTC 2016, 
> RpcRetryingCaller{globalStartTime=1469090201543, pause=100, retries=35}, 
> org.apache.hadoop.hbase.regionserver.UnexpectedStateException: 
> org.apache.hadoop.hbase.regionserver.UnexpectedStateException: Current 
> snapshot id is -1,passed 1469085004304
> at 
> org.apache.hadoop.hbase.regionserver.DefaultMemStore.clearSnapshot(DefaultMemStore.java:187)
> at 
> org.apache.hadoop.hbase.regionserver.HStore.updateStorefiles(HStore.java:1054)
> at 
> org.apache.hadoop.hbase.regionserver.HStore.access$500(HStore.java:128)
> at 
> org.apache.hadoop.hbase.regionserver.HStore$StoreFlusherImpl.replayFlush(HStore.java:2270)
> at 
> org.apache.hadoop.hbase.regionserver.HRegion.replayFlushInStores(HRegion.java:4487)
> at 
> org.apache.hadoop.hbase.regionserver.HRegion.replayWALFlushCommitMarker(HRegion.java:4388)
> at 
> org.apache.hadoop.hbase.regionserver.HRegion.replayWALFlushMarker(HRegion.java:4191)
> at 
> org.apache.hadoop.hbase.regionserver.RSRpcServices.doReplayBatchOp(RSRpcServices.java:776)
> at 
> org.apache.hadoop.hbase.regionserver.RSRpcServices.replay(RSRpcServices.java:1655)
> at 
> org.apache.hadoop.hbase.protobuf.generated.AdminProtos$AdminService$2.callBlockingMethod(AdminProtos.java:22255)
> at org.apache.hadoop.hbase.ipc.RpcServer.call(RpcServer.java:2114)
> at org.apache.hadoop.hbase.ipc.CallRunner.run(CallRunner.java:101)
> at 
> org.apache.hadoop.hbase.ipc.RpcExecutor.consumerLoop(RpcExecutor.java:130)
> at org.apache.hadoop.hbase.ipc.RpcExecutor$1.run(RpcExecutor.java:107)
> at java.lang.Thread.run(Thread.java:745)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HBASE-16270) Handle duplicate clearing of snapshot in region replicas

2016-08-24 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-16270?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15435039#comment-15435039
 ] 

Hudson commented on HBASE-16270:


FAILURE: Integrated in Jenkins build HBase-1.1-JDK7 #1772 (See 
[https://builds.apache.org/job/HBase-1.1-JDK7/1772/])
HBASE-16270 Handle duplicate clearing of snapshot in region replicas (enis: rev 
dca60dbc566c6693269718e469021cdbbfce)
* (edit) 
hbase-server/src/main/java/org/apache/hadoop/hbase/regionserver/DefaultMemStore.java


> Handle duplicate clearing of snapshot in region replicas
> 
>
> Key: HBASE-16270
> URL: https://issues.apache.org/jira/browse/HBASE-16270
> Project: HBase
>  Issue Type: Bug
>  Components: Replication
>Affects Versions: 1.1.2
>Reporter: Robert Yokota
> Fix For: 2.0.0, 1.3.0, 1.4.0, 1.1.6, 1.2.3
>
> Attachments: HBASE-16270-branch-1.2.patch, HBASE-16270-master.patch
>
>
> We have an HBase (1.1.2) production cluster with 58 region servers and a 
> staging cluster with 6 region servers.
> For both clusters, we enabled region replicas with the following settings:
> hbase.regionserver.storefile.refresh.period = 0
> hbase.region.replica.replication.enabled = true
> hbase.region.replica.replication.memstore.enabled = true
> hbase.master.hfilecleaner.ttl = 360
> hbase.master.loadbalancer.class = 
> org.apache.hadoop.hbase.master.balancer.StochasticLoadBalancer
> hbase.meta.replica.count = 3
> hbase.regionserver.meta.storefile.refresh.period = 3
> hbase.region.replica.wait.for.primary.flush = true
> hbase.region.replica.storefile.refresh.memstore.multiplier = 4
> We then altered our HBase tables to have REGION_REPLICATION => 2
> Both clusters got into a state where a few region servers were spewing the 
> following error below in the HBase logs.  In one instance the error occurred 
> over 70K times.  At this time, these region servers would see 10x write 
> traffic (the hadoop.HBase.RegionServer.Server.writeRequestCount metric) and 
> in some instances would crash.
> At the same time, we would see secondary regions move and then go into the 
> "reads are disabled" state for extended periods.  
> It appears there is a race condition where the DefaultMemStore::clearSnapshot 
> method might be called more than once in succession. The first call would set 
> snapshotId to -1, but the second call would throw an exception.  It seems the 
> second call should just return if the snapshotId is already -1, rather than 
> throwing an exception.
> Thu Jul 21 08:38:50 UTC 2016, 
> RpcRetryingCaller{globalStartTime=1469090201543, pause=100, retries=35}, 
> org.apache.hadoop.hbase.regionserver.UnexpectedStateException: 
> org.apache.hadoop.hbase.regionserver.UnexpectedStateException: Current 
> snapshot id is -1,passed 1469085004304
> at 
> org.apache.hadoop.hbase.regionserver.DefaultMemStore.clearSnapshot(DefaultMemStore.java:187)
> at 
> org.apache.hadoop.hbase.regionserver.HStore.updateStorefiles(HStore.java:1054)
> at 
> org.apache.hadoop.hbase.regionserver.HStore.access$500(HStore.java:128)
> at 
> org.apache.hadoop.hbase.regionserver.HStore$StoreFlusherImpl.replayFlush(HStore.java:2270)
> at 
> org.apache.hadoop.hbase.regionserver.HRegion.replayFlushInStores(HRegion.java:4487)
> at 
> org.apache.hadoop.hbase.regionserver.HRegion.replayWALFlushCommitMarker(HRegion.java:4388)
> at 
> org.apache.hadoop.hbase.regionserver.HRegion.replayWALFlushMarker(HRegion.java:4191)
> at 
> org.apache.hadoop.hbase.regionserver.RSRpcServices.doReplayBatchOp(RSRpcServices.java:776)
> at 
> org.apache.hadoop.hbase.regionserver.RSRpcServices.replay(RSRpcServices.java:1655)
> at 
> org.apache.hadoop.hbase.protobuf.generated.AdminProtos$AdminService$2.callBlockingMethod(AdminProtos.java:22255)
> at org.apache.hadoop.hbase.ipc.RpcServer.call(RpcServer.java:2114)
> at org.apache.hadoop.hbase.ipc.CallRunner.run(CallRunner.java:101)
> at 
> org.apache.hadoop.hbase.ipc.RpcExecutor.consumerLoop(RpcExecutor.java:130)
> at org.apache.hadoop.hbase.ipc.RpcExecutor$1.run(RpcExecutor.java:107)
> at java.lang.Thread.run(Thread.java:745)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)