Hi,

Following up on the discussion we had yesterday in the Apache Geode Community 
meeting around the "Reflections on conserve-sockets setting in Apache Geode" 
topic, I'd like to post here some questions that could not be fully answered 
during the meeting:

The Geode documentation states the following about conserve-sockets and WAN 
deployments in [1]:
"WAN deployments increase the messaging demands on a Geode system. To avoid 
hangs related to WAN messaging, always set `conserve-sockets=false` for Geode 
members that participate in a WAN deployment."

It also states the following about conserve-sockets and transactions in [2]:
"When you have transactions operating on EMPTY, NORMAL or PARTITION regions, 
make sure that conserve-sockets is set to false to avoid distributed deadlocks."

Doing a search on the Geode tests, the only test case related to deadlocks with 
conserve-sockets=true that I have found is:
https://github.com/apache/geode/blob/41eb49989f25607acfcbf9ac5afe3d4c0721bb35/geode-wan/src/distributedTest/java/org/apache/geode/internal/cache/wan/serial/SerialGatewaySenderDistributedDeadlockDUnitTest.java#L176
According to the comments in the test, it always causes a distributed deadlock, 
and it is commented out. Nevertheless, the test case is actually NOT commented 
out and, in fact, if you execute it, you see it passing without any 
failure/deadlock.

And here the questions:

Could it be that deadlocks with conserve-sockets=true and WAN and/or 
transactions over partitioned regions was some legacy issue that has already 
been fixed?

Otherwise, could someone please provide some more information about why these 
deadlocks could happen? It would be great if there were test cases that 
showcase this possibility.

It looks like a big limitation of Geode that you are forced to set 
conserve-sockets to false (with the implications this has on resources usage) 
when you are using WAN replication and/or transactions on partitioned regions.

Could it be that there are other elements (for example also using 
CacheListeners as Anthony Baker pointed out) that would increase the risk of 
hitting a distributed deadlock?

Thanks in advance,

Alberto

[1]: 
https://geode.apache.org/docs/guide/114/managing/monitor_tune/sockets_and_gateways.html

[2]: 
https://geode.apache.org/docs/guide/114/managing/monitor_tune/performance_controls_controlling_socket_use.html

Reply via email to