[jira] [Updated] (GEODE-10403) Distributed deadlock when stopping gateway sender

2022-09-16 Thread Alberto Gomez (Jira)


 [ 
https://issues.apache.org/jira/browse/GEODE-10403?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alberto Gomez updated GEODE-10403:
--
Fix Version/s: 1.15.1

> Distributed deadlock when stopping gateway sender
> -
>
> Key: GEODE-10403
> URL: https://issues.apache.org/jira/browse/GEODE-10403
> Project: Geode
>  Issue Type: Bug
>  Components: wan
>Affects Versions: 1.12.9, 1.13.8, 1.14.4, 1.15.0
>Reporter: Alberto Gomez
>Assignee: Alberto Gomez
>Priority: Major
>  Labels: needsTriage, pull-request-available
> Fix For: 1.15.1, 1.16.0
>
>
> A distributed deadlock has been found during some tests of a Geode system 
> with WAN replication when stopping the gateway sender while sending a fair 
> amount of operations to the servers.
> The distributed deadlock manifests in the gateway sender stop command hanging 
> forever and by all normal Geode operations from clients (gets, puts,...) not 
> being responded.
> The situation is provoked by the Gateway sender stop command that first takes 
> the lifecycle lock and then, at a given point, tries to retrieve the size of 
> the gateway sender. This operation, that requires communication with the 
> other peers never finishes, probably because the response from one of the 
> peers is never received.
> Another thread is blocked when trying to acquire the lifecycle lock in 
> AbstractGatewaySender.distribute().
> Finally many threads handling Geode operations (get, put...) get blocked in 
> the DistributedCacheOperation._distribute() call waiting for a response from 
> another peer.
> Thread dump section from blocked gateway sender stop command in call to get 
> size of queue:
> {{"ConcurrentParallelGatewaySenderEventProcessor Stopper Thread1" #1316 
> daemon prio=10 os_prio=0 cpu=45.55ms elapsed=4152.76s tid=0x7f92bc1c2000 
> nid=0x2154 waiting on condition  [0x7f9179cd2000]}}
> {{   java.lang.Thread.State: TIMED_WAITING (parking)}}
> {{        at jdk.internal.misc.Unsafe.park(java.base@11.0.11/Native Method)}}
> {{        - parking to wait for  <0x00031ca2be50> (a 
> java.util.concurrent.CountDownLatch$Sync)}}
> {{        at 
> java.util.concurrent.locks.LockSupport.parkNanos(java.base@11.0.11/LockSupport.java:234)}}
> {{        at 
> java.util.concurrent.locks.AbstractQueuedSynchronizer.doAcquireSharedNanos(java.base@11.0.11/AbstractQueuedSynchronizer.java:1079)}}
> {{        at 
> java.util.concurrent.locks.AbstractQueuedSynchronizer.tryAcquireSharedNanos(java.base@11.0.11/AbstractQueuedSynchronizer.java:1369)}}
> {{        at 
> java.util.concurrent.CountDownLatch.await(java.base@11.0.11/CountDownLatch.java:278)}}
> {{        at 
> org.apache.geode.internal.util.concurrent.StoppableCountDownLatch.await(StoppableCountDownLatch.java:72)}}
> {{        at 
> org.apache.geode.distributed.internal.ReplyProcessor21.basicWait(ReplyProcessor21.java:731)}}
> {{        at 
> org.apache.geode.distributed.internal.ReplyProcessor21.waitForRepliesUninterruptibly(ReplyProcessor21.java:802)}}
> {{        at 
> org.apache.geode.distributed.internal.ReplyProcessor21.waitForRepliesUninterruptibly(ReplyProcessor21.java:779)}}
> {{        at 
> org.apache.geode.distributed.internal.ReplyProcessor21.waitForRepliesUninterruptibly(ReplyProcessor21.java:865)}}
> {{        at 
> org.apache.geode.internal.cache.partitioned.SizeMessage$SizeResponse.waitBucketSizes(SizeMessage.java:344)}}
> {{        at 
> org.apache.geode.internal.cache.PartitionedRegion.getSizeRemotely(PartitionedRegion.java:6758)}}
> {{        at 
> org.apache.geode.internal.cache.PartitionedRegion.entryCount(PartitionedRegion.java:6709)}}
> {{        at 
> org.apache.geode.internal.cache.PartitionedRegion.entryCount(PartitionedRegion.java:6691)}}
> {{        at 
> org.apache.geode.internal.cache.PartitionedRegion.getRegionSize(PartitionedRegion.java:6663)}}
> {{        at 
> org.apache.geode.internal.cache.LocalRegionDataView.entryCount(LocalRegionDataView.java:99)}}
> {{        at 
> org.apache.geode.internal.cache.LocalRegion.entryCount(LocalRegion.java:2078)}}
> {{        at 
> org.apache.geode.internal.cache.LocalRegion.size(LocalRegion.java:8301)}}
> {{        at 
> org.apache.geode.internal.cache.wan.parallel.ParallelGatewaySenderQueue.size(ParallelGatewaySenderQueue.java:1670)}}
> {{        at 
> org.apache.geode.internal.cache.wan.AbstractGatewaySenderEventProcessor.closeProcessor(AbstractGatewaySenderEventProcessor.java:1259)}}
> {{        at 
> org.apache.geode.internal.cache.wan.AbstractGatewaySenderEventProcessor.stopProcessing(AbstractGatewaySenderEventProcessor.java:1247)}}
> {{        at 
> org.apache.geode.internal.cache.wan.AbstractGatewaySenderEventProcessor$SenderStopperCallable.call(AbstractGatewaySenderEventProcessor.java:1399)}}
> {{        at 
> 

[jira] [Updated] (GEODE-10403) Distributed deadlock when stopping gateway sender

2022-09-06 Thread Alberto Gomez (Jira)


 [ 
https://issues.apache.org/jira/browse/GEODE-10403?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alberto Gomez updated GEODE-10403:
--
Fix Version/s: 1.16.0

> Distributed deadlock when stopping gateway sender
> -
>
> Key: GEODE-10403
> URL: https://issues.apache.org/jira/browse/GEODE-10403
> Project: Geode
>  Issue Type: Bug
>  Components: wan
>Affects Versions: 1.12.9, 1.13.8, 1.14.4, 1.15.0
>Reporter: Alberto Gomez
>Assignee: Alberto Gomez
>Priority: Major
>  Labels: needsTriage, pull-request-available
> Fix For: 1.16.0
>
>
> A distributed deadlock has been found during some tests of a Geode system 
> with WAN replication when stopping the gateway sender while sending a fair 
> amount of operations to the servers.
> The distributed deadlock manifests in the gateway sender stop command hanging 
> forever and by all normal Geode operations from clients (gets, puts,...) not 
> being responded.
> The situation is provoked by the Gateway sender stop command that first takes 
> the lifecycle lock and then, at a given point, tries to retrieve the size of 
> the gateway sender. This operation, that requires communication with the 
> other peers never finishes, probably because the response from one of the 
> peers is never received.
> Another thread is blocked when trying to acquire the lifecycle lock in 
> AbstractGatewaySender.distribute().
> Finally many threads handling Geode operations (get, put...) get blocked in 
> the DistributedCacheOperation._distribute() call waiting for a response from 
> another peer.
> Thread dump section from blocked gateway sender stop command in call to get 
> size of queue:
> {{"ConcurrentParallelGatewaySenderEventProcessor Stopper Thread1" #1316 
> daemon prio=10 os_prio=0 cpu=45.55ms elapsed=4152.76s tid=0x7f92bc1c2000 
> nid=0x2154 waiting on condition  [0x7f9179cd2000]}}
> {{   java.lang.Thread.State: TIMED_WAITING (parking)}}
> {{        at jdk.internal.misc.Unsafe.park(java.base@11.0.11/Native Method)}}
> {{        - parking to wait for  <0x00031ca2be50> (a 
> java.util.concurrent.CountDownLatch$Sync)}}
> {{        at 
> java.util.concurrent.locks.LockSupport.parkNanos(java.base@11.0.11/LockSupport.java:234)}}
> {{        at 
> java.util.concurrent.locks.AbstractQueuedSynchronizer.doAcquireSharedNanos(java.base@11.0.11/AbstractQueuedSynchronizer.java:1079)}}
> {{        at 
> java.util.concurrent.locks.AbstractQueuedSynchronizer.tryAcquireSharedNanos(java.base@11.0.11/AbstractQueuedSynchronizer.java:1369)}}
> {{        at 
> java.util.concurrent.CountDownLatch.await(java.base@11.0.11/CountDownLatch.java:278)}}
> {{        at 
> org.apache.geode.internal.util.concurrent.StoppableCountDownLatch.await(StoppableCountDownLatch.java:72)}}
> {{        at 
> org.apache.geode.distributed.internal.ReplyProcessor21.basicWait(ReplyProcessor21.java:731)}}
> {{        at 
> org.apache.geode.distributed.internal.ReplyProcessor21.waitForRepliesUninterruptibly(ReplyProcessor21.java:802)}}
> {{        at 
> org.apache.geode.distributed.internal.ReplyProcessor21.waitForRepliesUninterruptibly(ReplyProcessor21.java:779)}}
> {{        at 
> org.apache.geode.distributed.internal.ReplyProcessor21.waitForRepliesUninterruptibly(ReplyProcessor21.java:865)}}
> {{        at 
> org.apache.geode.internal.cache.partitioned.SizeMessage$SizeResponse.waitBucketSizes(SizeMessage.java:344)}}
> {{        at 
> org.apache.geode.internal.cache.PartitionedRegion.getSizeRemotely(PartitionedRegion.java:6758)}}
> {{        at 
> org.apache.geode.internal.cache.PartitionedRegion.entryCount(PartitionedRegion.java:6709)}}
> {{        at 
> org.apache.geode.internal.cache.PartitionedRegion.entryCount(PartitionedRegion.java:6691)}}
> {{        at 
> org.apache.geode.internal.cache.PartitionedRegion.getRegionSize(PartitionedRegion.java:6663)}}
> {{        at 
> org.apache.geode.internal.cache.LocalRegionDataView.entryCount(LocalRegionDataView.java:99)}}
> {{        at 
> org.apache.geode.internal.cache.LocalRegion.entryCount(LocalRegion.java:2078)}}
> {{        at 
> org.apache.geode.internal.cache.LocalRegion.size(LocalRegion.java:8301)}}
> {{        at 
> org.apache.geode.internal.cache.wan.parallel.ParallelGatewaySenderQueue.size(ParallelGatewaySenderQueue.java:1670)}}
> {{        at 
> org.apache.geode.internal.cache.wan.AbstractGatewaySenderEventProcessor.closeProcessor(AbstractGatewaySenderEventProcessor.java:1259)}}
> {{        at 
> org.apache.geode.internal.cache.wan.AbstractGatewaySenderEventProcessor.stopProcessing(AbstractGatewaySenderEventProcessor.java:1247)}}
> {{        at 
> org.apache.geode.internal.cache.wan.AbstractGatewaySenderEventProcessor$SenderStopperCallable.call(AbstractGatewaySenderEventProcessor.java:1399)}}
> {{        at 
> 

[jira] [Updated] (GEODE-10403) Distributed deadlock when stopping gateway sender

2022-07-29 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/GEODE-10403?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated GEODE-10403:
---
Labels: needsTriage pull-request-available  (was: needsTriage)

> Distributed deadlock when stopping gateway sender
> -
>
> Key: GEODE-10403
> URL: https://issues.apache.org/jira/browse/GEODE-10403
> Project: Geode
>  Issue Type: Bug
>  Components: wan
>Affects Versions: 1.12.9, 1.13.8, 1.14.4, 1.15.0
>Reporter: Alberto Gomez
>Assignee: Alberto Gomez
>Priority: Major
>  Labels: needsTriage, pull-request-available
>
> A distributed deadlock has been found during some tests of a Geode system 
> with WAN replication when stopping the gateway sender while sending a fair 
> amount of operations to the servers.
> The distributed deadlock manifests in the gateway sender stop command hanging 
> forever and by all normal Geode operations from clients (gets, puts,...) not 
> being responded.
> The situation is provoked by the Gateway sender stop command that first takes 
> the lifecycle lock and then, at a given point, tries to retrieve the size of 
> the gateway sender. This operation, that requires communication with the 
> other peers never finishes, probably because the response from one of the 
> peers is never received.
> Another thread is blocked when trying to acquire the lifecycle lock in 
> AbstractGatewaySender.distribute().
> Finally many threads handling Geode operations (get, put...) get blocked in 
> the DistributedCacheOperation._distribute() call waiting for a response from 
> another peer.
> Thread dump section from blocked gateway sender stop command in call to get 
> size of queue:
> {{"ConcurrentParallelGatewaySenderEventProcessor Stopper Thread1" #1316 
> daemon prio=10 os_prio=0 cpu=45.55ms elapsed=4152.76s tid=0x7f92bc1c2000 
> nid=0x2154 waiting on condition  [0x7f9179cd2000]}}
> {{   java.lang.Thread.State: TIMED_WAITING (parking)}}
> {{        at jdk.internal.misc.Unsafe.park(java.base@11.0.11/Native Method)}}
> {{        - parking to wait for  <0x00031ca2be50> (a 
> java.util.concurrent.CountDownLatch$Sync)}}
> {{        at 
> java.util.concurrent.locks.LockSupport.parkNanos(java.base@11.0.11/LockSupport.java:234)}}
> {{        at 
> java.util.concurrent.locks.AbstractQueuedSynchronizer.doAcquireSharedNanos(java.base@11.0.11/AbstractQueuedSynchronizer.java:1079)}}
> {{        at 
> java.util.concurrent.locks.AbstractQueuedSynchronizer.tryAcquireSharedNanos(java.base@11.0.11/AbstractQueuedSynchronizer.java:1369)}}
> {{        at 
> java.util.concurrent.CountDownLatch.await(java.base@11.0.11/CountDownLatch.java:278)}}
> {{        at 
> org.apache.geode.internal.util.concurrent.StoppableCountDownLatch.await(StoppableCountDownLatch.java:72)}}
> {{        at 
> org.apache.geode.distributed.internal.ReplyProcessor21.basicWait(ReplyProcessor21.java:731)}}
> {{        at 
> org.apache.geode.distributed.internal.ReplyProcessor21.waitForRepliesUninterruptibly(ReplyProcessor21.java:802)}}
> {{        at 
> org.apache.geode.distributed.internal.ReplyProcessor21.waitForRepliesUninterruptibly(ReplyProcessor21.java:779)}}
> {{        at 
> org.apache.geode.distributed.internal.ReplyProcessor21.waitForRepliesUninterruptibly(ReplyProcessor21.java:865)}}
> {{        at 
> org.apache.geode.internal.cache.partitioned.SizeMessage$SizeResponse.waitBucketSizes(SizeMessage.java:344)}}
> {{        at 
> org.apache.geode.internal.cache.PartitionedRegion.getSizeRemotely(PartitionedRegion.java:6758)}}
> {{        at 
> org.apache.geode.internal.cache.PartitionedRegion.entryCount(PartitionedRegion.java:6709)}}
> {{        at 
> org.apache.geode.internal.cache.PartitionedRegion.entryCount(PartitionedRegion.java:6691)}}
> {{        at 
> org.apache.geode.internal.cache.PartitionedRegion.getRegionSize(PartitionedRegion.java:6663)}}
> {{        at 
> org.apache.geode.internal.cache.LocalRegionDataView.entryCount(LocalRegionDataView.java:99)}}
> {{        at 
> org.apache.geode.internal.cache.LocalRegion.entryCount(LocalRegion.java:2078)}}
> {{        at 
> org.apache.geode.internal.cache.LocalRegion.size(LocalRegion.java:8301)}}
> {{        at 
> org.apache.geode.internal.cache.wan.parallel.ParallelGatewaySenderQueue.size(ParallelGatewaySenderQueue.java:1670)}}
> {{        at 
> org.apache.geode.internal.cache.wan.AbstractGatewaySenderEventProcessor.closeProcessor(AbstractGatewaySenderEventProcessor.java:1259)}}
> {{        at 
> org.apache.geode.internal.cache.wan.AbstractGatewaySenderEventProcessor.stopProcessing(AbstractGatewaySenderEventProcessor.java:1247)}}
> {{        at 
> org.apache.geode.internal.cache.wan.AbstractGatewaySenderEventProcessor$SenderStopperCallable.call(AbstractGatewaySenderEventProcessor.java:1399)}}
> {{        at 
> 

[jira] [Updated] (GEODE-10403) Distributed deadlock when stopping gateway sender

2022-07-29 Thread Alberto Gomez (Jira)


 [ 
https://issues.apache.org/jira/browse/GEODE-10403?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alberto Gomez updated GEODE-10403:
--
Description: 
A distributed deadlock has been found during some tests of a Geode system with 
WAN replication when stopping the gateway sender while sending a fair amount of 
operations to the servers.

The distributed deadlock manifests in the gateway sender stop command hanging 
forever and by all normal Geode operations from clients (gets, puts,...) not 
being responded.
The situation is provoked by the Gateway sender stop command that first takes 
the lifecycle lock and then, at a given point, tries to retrieve the size of 
the gateway sender. This operation, that requires communication with the other 
peers never finishes, probably because the response from one of the peers is 
never received.
Another thread is blocked when trying to acquire the lifecycle lock in 
AbstractGatewaySender.distribute().
Finally many threads handling Geode operations (get, put...) get blocked in the 
DistributedCacheOperation._distribute() call waiting for a response from 
another peer.

Thread dump section from blocked gateway sender stop command in call to get 
size of queue:

{{"ConcurrentParallelGatewaySenderEventProcessor Stopper Thread1" #1316 daemon 
prio=10 os_prio=0 cpu=45.55ms elapsed=4152.76s tid=0x7f92bc1c2000 
nid=0x2154 waiting on condition  [0x7f9179cd2000]}}
{{   java.lang.Thread.State: TIMED_WAITING (parking)}}
{{        at jdk.internal.misc.Unsafe.park(java.base@11.0.11/Native Method)}}
{{        - parking to wait for  <0x00031ca2be50> (a 
java.util.concurrent.CountDownLatch$Sync)}}
{{        at 
java.util.concurrent.locks.LockSupport.parkNanos(java.base@11.0.11/LockSupport.java:234)}}
{{        at 
java.util.concurrent.locks.AbstractQueuedSynchronizer.doAcquireSharedNanos(java.base@11.0.11/AbstractQueuedSynchronizer.java:1079)}}
{{        at 
java.util.concurrent.locks.AbstractQueuedSynchronizer.tryAcquireSharedNanos(java.base@11.0.11/AbstractQueuedSynchronizer.java:1369)}}
{{        at 
java.util.concurrent.CountDownLatch.await(java.base@11.0.11/CountDownLatch.java:278)}}
{{        at 
org.apache.geode.internal.util.concurrent.StoppableCountDownLatch.await(StoppableCountDownLatch.java:72)}}
{{        at 
org.apache.geode.distributed.internal.ReplyProcessor21.basicWait(ReplyProcessor21.java:731)}}
{{        at 
org.apache.geode.distributed.internal.ReplyProcessor21.waitForRepliesUninterruptibly(ReplyProcessor21.java:802)}}
{{        at 
org.apache.geode.distributed.internal.ReplyProcessor21.waitForRepliesUninterruptibly(ReplyProcessor21.java:779)}}
{{        at 
org.apache.geode.distributed.internal.ReplyProcessor21.waitForRepliesUninterruptibly(ReplyProcessor21.java:865)}}
{{        at 
org.apache.geode.internal.cache.partitioned.SizeMessage$SizeResponse.waitBucketSizes(SizeMessage.java:344)}}
{{        at 
org.apache.geode.internal.cache.PartitionedRegion.getSizeRemotely(PartitionedRegion.java:6758)}}
{{        at 
org.apache.geode.internal.cache.PartitionedRegion.entryCount(PartitionedRegion.java:6709)}}
{{        at 
org.apache.geode.internal.cache.PartitionedRegion.entryCount(PartitionedRegion.java:6691)}}
{{        at 
org.apache.geode.internal.cache.PartitionedRegion.getRegionSize(PartitionedRegion.java:6663)}}
{{        at 
org.apache.geode.internal.cache.LocalRegionDataView.entryCount(LocalRegionDataView.java:99)}}
{{        at 
org.apache.geode.internal.cache.LocalRegion.entryCount(LocalRegion.java:2078)}}
{{        at 
org.apache.geode.internal.cache.LocalRegion.size(LocalRegion.java:8301)}}
{{        at 
org.apache.geode.internal.cache.wan.parallel.ParallelGatewaySenderQueue.size(ParallelGatewaySenderQueue.java:1670)}}
{{        at 
org.apache.geode.internal.cache.wan.AbstractGatewaySenderEventProcessor.closeProcessor(AbstractGatewaySenderEventProcessor.java:1259)}}
{{        at 
org.apache.geode.internal.cache.wan.AbstractGatewaySenderEventProcessor.stopProcessing(AbstractGatewaySenderEventProcessor.java:1247)}}
{{        at 
org.apache.geode.internal.cache.wan.AbstractGatewaySenderEventProcessor$SenderStopperCallable.call(AbstractGatewaySenderEventProcessor.java:1399)}}
{{        at 
org.apache.geode.internal.cache.wan.AbstractGatewaySenderEventProcessor$SenderStopperCallable.call(AbstractGatewaySenderEventProcessor.java:1387)}}
{{        at 
java.util.concurrent.FutureTask.run(java.base@11.0.11/FutureTask.java:264)}}
{{        at 
java.util.concurrent.ThreadPoolExecutor.runWorker(java.base@11.0.11/ThreadPoolExecutor.java:1128)}}
{{        at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(java.base@11.0.11/ThreadPoolExecutor.java:628)}}
{{        at java.lang.Thread.run(java.base@11.0.11/Thread.java:829)}}

 

Thread dump section from blocked call to AbstractGatewaySender.distribute() 
call trying to acquire the lifecycle lock:

{{"P2P message reader for 

[jira] [Updated] (GEODE-10403) Distributed deadlock when stopping gateway sender

2022-07-29 Thread Alberto Gomez (Jira)


 [ 
https://issues.apache.org/jira/browse/GEODE-10403?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alberto Gomez updated GEODE-10403:
--
Description: 
A distributed deadlock has been found during some tests of a Geode system with 
WAN replication when stopping the gateway sender while sending a fair amount of 
operations to the servers.

The distributed deadlock manifests in the gateway sender stop command hanging 
forever and by all normal Geode operations from clients (gets, puts,...) not 
being responded.
The situation is provoked by the Gateway sender stop command that first takes 
the lifecycle lock and then, at a given point, tries to retrieve the size of 
the gateway sender. This operation, that requires communication with the other 
peers never finishes, probably because the response from one of the peers is 
never received.
Another thread is blocked when trying to acquire the lifecycle lock in 
AbstractGatewaySender.distribute().
Finally many threads handling Geode operations (get, put...) get blocked in the 
DistributedCacheOperation._distribute() call waiting for a response from 
another peer.

Thread dump section from blocked gateway sender stop command in call to get 
size of queue:

{{"ConcurrentParallelGatewaySenderEventProcessor Stopper Thread4" #1319 daemon 
prio=10 os_prio=0 cpu=46.95ms elapsed=4152.76s tid=0x7f92bc1bb000 
nid=0x2157 waiting on condition  [0x7f9179bd1000]
   java.lang.Thread.State: TIMED_WAITING (parking)
at jdk.internal.misc.Unsafe.park(java.base@11.0.11/Native Method)
- parking to wait for  <0x00031ca2cbd8> (a 
java.util.concurrent.CountDownLatch$Sync)
at 
java.util.concurrent.locks.LockSupport.parkNanos(java.base@11.0.11/LockSupport.java:234)
at 
java.util.concurrent.locks.AbstractQueuedSynchronizer.doAcquireSharedNanos(java.base@11.0.11/AbstractQueuedSynchronizer.java:1079)
at 
java.util.concurrent.locks.AbstractQueuedSynchronizer.tryAcquireSharedNanos(java.base@11.0.11/AbstractQueuedSynchronizer.java:1369)
at 
java.util.concurrent.CountDownLatch.await(java.base@11.0.11/CountDownLatch.java:278)
at 
org.apache.geode.internal.util.concurrent.StoppableCountDownLatch.await(StoppableCountDownLatch.java:72)
at 
org.apache.geode.distributed.internal.ReplyProcessor21.basicWait(ReplyProcessor21.java:731)
at 
org.apache.geode.distributed.internal.ReplyProcessor21.waitForRepliesUninterruptibly(ReplyProcessor21.java:802)
at 
org.apache.geode.distributed.internal.ReplyProcessor21.waitForRepliesUninterruptibly(ReplyProcessor21.java:779)
at 
org.apache.geode.distributed.internal.ReplyProcessor21.waitForRepliesUninterruptibly(ReplyProcessor21.java:865)
at 
org.apache.geode.internal.cache.partitioned.SizeMessage$SizeResponse.waitBucketSizes(SizeMessage.java:344)
at 
org.apache.geode.internal.cache.PartitionedRegion.getSizeRemotely(PartitionedRegion.java:6758)
at 
org.apache.geode.internal.cache.PartitionedRegion.entryCount(PartitionedRegion.java:6709)
at 
org.apache.geode.internal.cache.PartitionedRegion.entryCount(PartitionedRegion.java:6691)
at 
org.apache.geode.internal.cache.PartitionedRegion.getRegionSize(PartitionedRegion.java:6663)
at 
org.apache.geode.internal.cache.LocalRegionDataView.entryCount(LocalRegionDataView.java:99)
at 
org.apache.geode.internal.cache.LocalRegion.entryCount(LocalRegion.java:2078)
at 
org.apache.geode.internal.cache.LocalRegion.size(LocalRegion.java:8301)
at 
org.apache.geode.internal.cache.wan.parallel.ParallelGatewaySenderQueue.size(ParallelGatewaySenderQueue.java:1670)
at 
org.apache.geode.internal.cache.wan.AbstractGatewaySenderEventProcessor.closeProcessor(AbstractGatewaySenderEventProcessor.java:1259)
at 
org.apache.geode.internal.cache.wan.AbstractGatewaySenderEventProcessor.stopProcessing(AbstractGatewaySenderEventProcessor.java:1247)
at 
org.apache.geode.internal.cache.wan.AbstractGatewaySenderEventProcessor$SenderStopperCallable.call(AbstractGatewaySenderEventProcessor.java:1399)
at 
org.apache.geode.internal.cache.wan.AbstractGatewaySenderEventProcessor$SenderStopperCallable.call(AbstractGatewaySenderEventProcessor.java:1387)
at 
java.util.concurrent.FutureTask.run(java.base@11.0.11/FutureTask.java:264)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(java.base@11.0.11/ThreadPoolExecutor.java:1128)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(java.base@11.0.11/ThreadPoolExecutor.java:628)
at java.lang.Thread.run(java.base@11.0.11/Thread.java:829)

}}

Thread dump section from blocked call to AbstractGatewaySender.distribute() 
call trying to acquire the lifecycle lock:
{{"P2P message reader for 
192.168.78.164(eric-data-kvdb-ag-server-0:1):41000 shared ordered uid=6 
local port=60360 remote port=57246" #56 daemon prio=10 

[jira] [Updated] (GEODE-10403) Distributed deadlock when stopping gateway sender

2022-07-29 Thread Alberto Gomez (Jira)


 [ 
https://issues.apache.org/jira/browse/GEODE-10403?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alberto Gomez updated GEODE-10403:
--
Description: 
A distributed deadlock has been found during some tests of a Geode system with 
WAN replication when stopping the gateway sender while sending a fair amount of 
operations to the servers.

The distributed deadlock manifests in the gateway sender stop command hanging 
forever and by all normal Geode operations from clients (gets, puts,...) not 
being responded.
The situation is provoked by the Gateway sender stop command that first takes 
the lifecycle lock and then, at a given point, tries to retrieve the size of 
the gateway sender. This operation, that requires communication with the other 
peers never finishes, probably because the response from one of the peers is 
never received.
Another thread is blocked when trying to acquire the lifecycle lock in 
AbstractGatewaySender.distribute().
Finally many threads handling Geode operations (get, put...) get blocked in the 
DistributedCacheOperation._distribute() call waiting for a response from 
another peer.

Thread dump section from blocked gateway sender stop command in call to get 
size of queue:

{{"ConcurrentParallelGatewaySenderEventProcessor Stopper Thread4" #1319 daemon 
prio=10 os_prio=0 cpu=46.95ms elapsed=4152.76s tid=0x7f92bc1bb000 
nid=0x2157 waiting on condition  [0x7f9179bd1000]
   java.lang.Thread.State: TIMED_WAITING (parking)
at jdk.internal.misc.Unsafe.park(java.base@11.0.11/Native Method)
- parking to wait for  <0x00031ca2cbd8> (a 
java.util.concurrent.CountDownLatch$Sync)
at 
java.util.concurrent.locks.LockSupport.parkNanos(java.base@11.0.11/LockSupport.java:234)
at 
java.util.concurrent.locks.AbstractQueuedSynchronizer.doAcquireSharedNanos(java.base@11.0.11/AbstractQueuedSynchronizer.java:1079)
at 
java.util.concurrent.locks.AbstractQueuedSynchronizer.tryAcquireSharedNanos(java.base@11.0.11/AbstractQueuedSynchronizer.java:1369)
at 
java.util.concurrent.CountDownLatch.await(java.base@11.0.11/CountDownLatch.java:278)
at 
org.apache.geode.internal.util.concurrent.StoppableCountDownLatch.await(StoppableCountDownLatch.java:72)
at 
org.apache.geode.distributed.internal.ReplyProcessor21.basicWait(ReplyProcessor21.java:731)
at 
org.apache.geode.distributed.internal.ReplyProcessor21.waitForRepliesUninterruptibly(ReplyProcessor21.java:802)
at 
org.apache.geode.distributed.internal.ReplyProcessor21.waitForRepliesUninterruptibly(ReplyProcessor21.java:779)
at 
org.apache.geode.distributed.internal.ReplyProcessor21.waitForRepliesUninterruptibly(ReplyProcessor21.java:865)
at 
org.apache.geode.internal.cache.partitioned.SizeMessage$SizeResponse.waitBucketSizes(SizeMessage.java:344)
at 
org.apache.geode.internal.cache.PartitionedRegion.getSizeRemotely(PartitionedRegion.java:6758)
at 
org.apache.geode.internal.cache.PartitionedRegion.entryCount(PartitionedRegion.java:6709)
at 
org.apache.geode.internal.cache.PartitionedRegion.entryCount(PartitionedRegion.java:6691)
at 
org.apache.geode.internal.cache.PartitionedRegion.getRegionSize(PartitionedRegion.java:6663)
at 
org.apache.geode.internal.cache.LocalRegionDataView.entryCount(LocalRegionDataView.java:99)
at 
org.apache.geode.internal.cache.LocalRegion.entryCount(LocalRegion.java:2078)
at 
org.apache.geode.internal.cache.LocalRegion.size(LocalRegion.java:8301)
at 
org.apache.geode.internal.cache.wan.parallel.ParallelGatewaySenderQueue.size(ParallelGatewaySenderQueue.java:1670)
at 
org.apache.geode.internal.cache.wan.AbstractGatewaySenderEventProcessor.closeProcessor(AbstractGatewaySenderEventProcessor.java:1259)
at 
org.apache.geode.internal.cache.wan.AbstractGatewaySenderEventProcessor.stopProcessing(AbstractGatewaySenderEventProcessor.java:1247)
at 
org.apache.geode.internal.cache.wan.AbstractGatewaySenderEventProcessor$SenderStopperCallable.call(AbstractGatewaySenderEventProcessor.java:1399)
at 
org.apache.geode.internal.cache.wan.AbstractGatewaySenderEventProcessor$SenderStopperCallable.call(AbstractGatewaySenderEventProcessor.java:1387)
at 
java.util.concurrent.FutureTask.run(java.base@11.0.11/FutureTask.java:264)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(java.base@11.0.11/ThreadPoolExecutor.java:1128)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(java.base@11.0.11/ThreadPoolExecutor.java:628)
at java.lang.Thread.run(java.base@11.0.11/Thread.java:829)
}}


Thread dump section from blocked call to AbstractGatewaySender.distribute() 
call trying to acquire the lifecycle lock:
{{"P2P message reader for 
192.168.78.164(eric-data-kvdb-ag-server-0:1):41000 shared ordered uid=6 
local port=60360 remote port=57246" #56 daemon prio=10 

[jira] [Updated] (GEODE-10403) Distributed deadlock when stopping gateway sender

2022-07-29 Thread Alexander Murmann (Jira)


 [ 
https://issues.apache.org/jira/browse/GEODE-10403?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alexander Murmann updated GEODE-10403:
--
Labels: needsTriage  (was: )

> Distributed deadlock when stopping gateway sender
> -
>
> Key: GEODE-10403
> URL: https://issues.apache.org/jira/browse/GEODE-10403
> Project: Geode
>  Issue Type: Bug
>  Components: wan
>Affects Versions: 1.12.9, 1.13.8, 1.14.4, 1.15.0
>Reporter: Alberto Gomez
>Priority: Major
>  Labels: needsTriage
>
> A distributed deadlock has been found during some tests of a Geode system 
> with WAN replication when stopping the gateway sender while sending a fair 
> amount of operations to the servers.
> The distributed deadlock manifests in the gateway sender stop command hanging 
> forever and by all normal Geode operations from clients (gets, puts,...) not 
> being responded.
> The situation is provoked by the Gateway sender stop command that first takes 
> the lifecycle lock and then, at a given point, tries to retrieve the size of 
> the gateway sender. This operation, that requires communication with the 
> other peers never finishes, probably because the response from one of the 
> peers is never received.
> Another thread is blocked when trying to acquire the lifecycle lock in 
> AbstractGatewaySender.distribute().
> Finally many threads handling Geode operations (get, put...) get blocked in 
> the DistributedCacheOperation._distribute() call waiting for a response from 
> another peer.
> Thread dump section from blocked gateway sender stop command in call to get 
> size of queue:
> "ConcurrentParallelGatewaySenderEventProcessor Stopper Thread4" #1319 daemon 
> prio=10 os_prio=0 cpu=46.95ms elapsed=4152.76s tid=0x7f92bc1bb000 
> nid=0x2157 waiting on condition  [0x7f9179bd1000]
>java.lang.Thread.State: TIMED_WAITING (parking)
> at jdk.internal.misc.Unsafe.park(java.base@11.0.11/Native Method)
> - parking to wait for  <0x00031ca2cbd8> (a 
> java.util.concurrent.CountDownLatch$Sync)
> at 
> java.util.concurrent.locks.LockSupport.parkNanos(java.base@11.0.11/LockSupport.java:234)
> at 
> java.util.concurrent.locks.AbstractQueuedSynchronizer.doAcquireSharedNanos(java.base@11.0.11/AbstractQueuedSynchronizer.java:1079)
> at 
> java.util.concurrent.locks.AbstractQueuedSynchronizer.tryAcquireSharedNanos(java.base@11.0.11/AbstractQueuedSynchronizer.java:1369)
> at 
> java.util.concurrent.CountDownLatch.await(java.base@11.0.11/CountDownLatch.java:278)
> at 
> org.apache.geode.internal.util.concurrent.StoppableCountDownLatch.await(StoppableCountDownLatch.java:72)
> at 
> org.apache.geode.distributed.internal.ReplyProcessor21.basicWait(ReplyProcessor21.java:731)
> at 
> org.apache.geode.distributed.internal.ReplyProcessor21.waitForRepliesUninterruptibly(ReplyProcessor21.java:802)
> at 
> org.apache.geode.distributed.internal.ReplyProcessor21.waitForRepliesUninterruptibly(ReplyProcessor21.java:779)
> at 
> org.apache.geode.distributed.internal.ReplyProcessor21.waitForRepliesUninterruptibly(ReplyProcessor21.java:865)
> at 
> org.apache.geode.internal.cache.partitioned.SizeMessage$SizeResponse.waitBucketSizes(SizeMessage.java:344)
> at 
> org.apache.geode.internal.cache.PartitionedRegion.getSizeRemotely(PartitionedRegion.java:6758)
> at 
> org.apache.geode.internal.cache.PartitionedRegion.entryCount(PartitionedRegion.java:6709)
> at 
> org.apache.geode.internal.cache.PartitionedRegion.entryCount(PartitionedRegion.java:6691)
> at 
> org.apache.geode.internal.cache.PartitionedRegion.getRegionSize(PartitionedRegion.java:6663)
> at 
> org.apache.geode.internal.cache.LocalRegionDataView.entryCount(LocalRegionDataView.java:99)
> at 
> org.apache.geode.internal.cache.LocalRegion.entryCount(LocalRegion.java:2078)
> at 
> org.apache.geode.internal.cache.LocalRegion.size(LocalRegion.java:8301)
> at 
> org.apache.geode.internal.cache.wan.parallel.ParallelGatewaySenderQueue.size(ParallelGatewaySenderQueue.java:1670)
> at 
> org.apache.geode.internal.cache.wan.AbstractGatewaySenderEventProcessor.closeProcessor(AbstractGatewaySenderEventProcessor.java:1259)
> at 
> org.apache.geode.internal.cache.wan.AbstractGatewaySenderEventProcessor.stopProcessing(AbstractGatewaySenderEventProcessor.java:1247)
> at 
> org.apache.geode.internal.cache.wan.AbstractGatewaySenderEventProcessor$SenderStopperCallable.call(AbstractGatewaySenderEventProcessor.java:1399)
> at 
> org.apache.geode.internal.cache.wan.AbstractGatewaySenderEventProcessor$SenderStopperCallable.call(AbstractGatewaySenderEventProcessor.java:1387)
> at 
>