[jira] [Assigned] (GEODE-10391) Region Operation During Primary Change in P2P-only Configuration Results in Spurious Entry{NotFound|Exists}Exception
Title: Message Title Bill Burcham assigned an issue to Unassigned Geode / GEODE-10391 Region Operation During Primary Change in P2P-only Configuration Results in Spurious Entry{NotFound|Exists}Exception Change By: Bill Burcham Assignee: Bill Burcham Add Comment This message was sent by Atlassian Jira (v8.20.10#820010-sha1:ace47f9)
[jira] [Assigned] (GEODE-10391) Region Operation During Primary Change in P2P-only Configuration Results in Spurious Entry{NotFound|Exists}Exception
[ https://issues.apache.org/jira/browse/GEODE-10391?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Bill Burcham reassigned GEODE-10391: Assignee: Bill Burcham > Region Operation During Primary Change in P2P-only Configuration Results in > Spurious Entry{NotFound|Exists}Exception > > > Key: GEODE-10391 > URL: https://issues.apache.org/jira/browse/GEODE-10391 > Project: Geode > Issue Type: Bug > Components: regions >Affects Versions: 1.16.0 >Reporter: Bill Burcham >Assignee: Bill Burcham >Priority: Major > > When a primary moves while a region operation, e.g. create, is in-flight, > i.e. started but not yet acknowledged, the operation will be retried > automatically, until the operation succeeds or fails. > When a member notices another member has crashed, the surviving member > requests (from the remaining members) data for which the crashed member had > been primary (delta-GII/sync). This sync is necessary to regain consistency > in case the (retrying) requester fails before it can re-issue the request to > the new primary. > In GEODE-5055 we learned that we needed to delay that sync request long > enough for the new primary to be chosen and for the original requester to > make a new request against the new primary. If we didn't delay the sync, the > primary could end up with the entry in the new state (as if the operation had > completed) but without the corresponding event tracker data needed to > conflate the retried event. > The fix for GEODE-5055 introduced a delay, but only for configurations where > clients were present. If only peers were present there would be no delay. > This ticket pertains to the P2P-only case. > -- This message was sent by Atlassian Jira (v8.20.7#820007)
[jira] [Updated] (GEODE-10391) Region Operation During Primary Change in P2P-only Configuration Results in Spurious Entry{NotFound|Exists}Exception
[ https://issues.apache.org/jira/browse/GEODE-10391?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Bill Burcham updated GEODE-10391: - Labels: (was: needsTriage) > Region Operation During Primary Change in P2P-only Configuration Results in > Spurious Entry{NotFound|Exists}Exception > > > Key: GEODE-10391 > URL: https://issues.apache.org/jira/browse/GEODE-10391 > Project: Geode > Issue Type: Bug > Components: regions >Affects Versions: 1.16.0 >Reporter: Bill Burcham >Priority: Major > > When a primary moves while a region operation, e.g. create, is in-flight, > i.e. started but not yet acknowledged, the operation will be retried > automatically, until the operation succeeds or fails. > When a member notices another member has crashed, the surviving member > requests (from the remaining members) data for which the crashed member had > been primary (delta-GII/sync). This sync is necessary to regain consistency > in case the (retrying) requester fails before it can re-issue the request to > the new primary. > In GEODE-5055 we learned that we needed to delay that sync request long > enough for the new primary to be chosen and for the original requester to > make a new request against the new primary. If we didn't delay the sync, the > primary could end up with the entry in the new state (as if the operation had > completed) but without the corresponding event tracker data needed to > conflate the retried event. > The fix for GEODE-5055 introduced a delay, but only for configurations where > clients were present. If only peers were present there would be no delay. > This ticket pertains to the P2P-only case. > -- This message was sent by Atlassian Jira (v8.20.7#820007)
[jira] [Created] (GEODE-10391) Region Operation During Primary Change in P2P-only Configuration Results in Spurious Entry{NotFound|Exists}Exception
Bill Burcham created GEODE-10391: Summary: Region Operation During Primary Change in P2P-only Configuration Results in Spurious Entry{NotFound|Exists}Exception Key: GEODE-10391 URL: https://issues.apache.org/jira/browse/GEODE-10391 Project: Geode Issue Type: Bug Components: regions Affects Versions: 1.16.0 Reporter: Bill Burcham When a primary moves while a region operation, e.g. create, is in-flight, i.e. started but not yet acknowledged, the operation will be retried automatically, until the operation succeeds or fails. When a member notices another member has crashed, the surviving member requests (from the remaining members) data for which the crashed member had been primary (delta-GII/sync). This sync is necessary to regain consistency in case the (retrying) requester fails before it can re-issue the request to the new primary. In GEODE-5055 we learned that we needed to delay that sync request long enough for the new primary to be chosen and for the original requester to make a new request against the new primary. If we didn't delay the sync, the primary could end up with the entry in the new state (as if the operation had completed) but without the corresponding event tracker data needed to conflate the retried event. The fix for GEODE-5055 introduced a delay, but only for configurations where clients were present. If only peers were present there would be no delay. This ticket pertains to the P2P-only case. -- This message was sent by Atlassian Jira (v8.20.7#820007)
[jira] [Updated] (GEODE-10326) Convert MessageType into an enum
[ https://issues.apache.org/jira/browse/GEODE-10326?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Bill Burcham updated GEODE-10326: - Summary: Convert MessageType into an enum (was: Covert MessageType into an enum) > Convert MessageType into an enum > > > Key: GEODE-10326 > URL: https://issues.apache.org/jira/browse/GEODE-10326 > Project: Geode > Issue Type: Improvement > Components: messaging >Reporter: Jacob Barrett >Assignee: Jacob Barrett >Priority: Major > Labels: pull-request-available > > Currently {{MessageType}} is class with lots of numeric constants, > effectively and enum without all the compile time checking that comes with > it. Let's make it an enum for type safety. -- This message was sent by Atlassian Jira (v8.20.7#820007)
[jira] [Assigned] (GEODE-9402) Automatic Reconnect Failure: Address already in use
[ https://issues.apache.org/jira/browse/GEODE-9402?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Bill Burcham reassigned GEODE-9402: --- Assignee: Jianxia Chen (was: Bill Burcham) > Automatic Reconnect Failure: Address already in use > --- > > Key: GEODE-9402 > URL: https://issues.apache.org/jira/browse/GEODE-9402 > Project: Geode > Issue Type: Bug > Components: membership >Reporter: Juan Ramos >Assignee: Jianxia Chen >Priority: Major > Attachments: cluster_logs_gke_latest_54.zip, cluster_logs_pks_121.zip > > > There are 2 locators and 4 servers during the test, once they're all up and > running the test drops the network connectivity between all members to > generate a full network partition and cause all members to shutdown and go > into reconnect mode. Upon reaching the mentioned state, the test > automatically restores the network connectivity and expects all members to > automatically go up again and re-form the distributed system. > This works fine most of the time, and we see every member successfully > reconnecting to the distributed system: > {noformat} > [info 2021/06/23 15:58:12.981 GMT gemfire-cluster-locator-0 > tid=0x87] Reconnect completed. > [info 2021/06/23 15:58:14.726 GMT gemfire-cluster-locator-1 > tid=0x86] Reconnect completed. > [info 2021/06/23 15:58:46.702 GMT gemfire-cluster-server-0 > tid=0x94] Reconnect completed. > [info 2021/06/23 15:58:46.485 GMT gemfire-cluster-server-1 > tid=0x96] Reconnect completed. > [info 2021/06/23 15:58:46.273 GMT gemfire-cluster-server-2 > tid=0x97] Reconnect completed. > [info 2021/06/23 15:58:46.902 GMT gemfire-cluster-server-3 > tid=0x95] Reconnect completed. > {noformat} > In some rare occasions, though, one of the servers fails during the reconnect > phase with the following exception: > {noformat} > [error 2021/06/09 18:48:52.872 GMT gemfire-cluster-server-1 > tid=0x91] Cache initialization for GemFireCache[id = 575310555; isClosing = > false; isShutDownAll = false; created = Wed Jun 09 18:46:49 GMT 2021; server > = false; copyOnRead = false; lockLease = 120; lockTimeout = 60] failed > because: > org.apache.geode.GemFireIOException: While starting cache server CacheServer > on port=40404 client subscription config policy=none client subscription > config capacity=1 client subscription config overflow directory=. > at > org.apache.geode.internal.cache.xmlcache.CacheCreation.startCacheServers(CacheCreation.java:800) > at > org.apache.geode.internal.cache.xmlcache.CacheCreation.create(CacheCreation.java:599) > at > org.apache.geode.internal.cache.xmlcache.CacheXmlParser.create(CacheXmlParser.java:339) > at > org.apache.geode.internal.cache.GemFireCacheImpl.loadCacheXml(GemFireCacheImpl.java:4207) > at > org.apache.geode.internal.cache.ClusterConfigurationLoader.applyClusterXmlConfiguration(ClusterConfigurationLoader.java:197) > at > org.apache.geode.internal.cache.GemFireCacheImpl.applyJarAndXmlFromClusterConfig(GemFireCacheImpl.java:1497) > at > org.apache.geode.internal.cache.GemFireCacheImpl.initialize(GemFireCacheImpl.java:1449) > at > org.apache.geode.internal.cache.InternalCacheBuilder.create(InternalCacheBuilder.java:191) > at > org.apache.geode.distributed.internal.InternalDistributedSystem.reconnect(InternalDistributedSystem.java:2668) > at > org.apache.geode.distributed.internal.InternalDistributedSystem.tryReconnect(InternalDistributedSystem.java:2426) > at > org.apache.geode.distributed.internal.InternalDistributedSystem.disconnect(InternalDistributedSystem.java:1277) > at > org.apache.geode.distributed.internal.ClusterDistributionManager$DMListener.membershipFailure(ClusterDistributionManager.java:2315) > at > org.apache.geode.distributed.internal.membership.gms.GMSMembership.uncleanShutdown(GMSMembership.java:1183) > at > org.apache.geode.distributed.internal.membership.gms.GMSMembership$ManagerImpl.lambda$forceDisconnect$0(GMSMembership.java:1807) > at java.base/java.lang.Thread.run(Thread.java:829) > Caused by: java.net.BindException: Address already in use (Bind failed) > at java.base/java.net.PlainSocketImpl.socketBind(Native Method) > at > java.base/java.net.AbstractPlainSocketImpl.bind(AbstractPlainSocketImpl.java:436) > at java.base/java.net.ServerSocket.bind(ServerSocket.java:395) > at > org.apache.geode.internal.net.SCClusterSocketCreator.createServerSocket(SCClusterSocketCreator.java:70) > at > org.apache.geode.internal.net.SocketCreator.createServerSocket(SocketCreator.java:529) > at > org.apache.geode.internal.cache.tier.sockets.AcceptorImpl.(AcceptorImpl.java:573) > at > org.apache.geode.internal.cache.tier.sockets.AcceptorBui
[jira] [Created] (GEODE-10272) CI failure: SerialGatewaySenderEventProcessor throws RejectedExecutionException in handlePrimaryDestroy
Bill Burcham created GEODE-10272: Summary: CI failure: SerialGatewaySenderEventProcessor throws RejectedExecutionException in handlePrimaryDestroy Key: GEODE-10272 URL: https://issues.apache.org/jira/browse/GEODE-10272 Project: Geode Issue Type: Bug Components: wan Affects Versions: 1.15.0 Reporter: Bill Burcham [https://hydradb.hdb.gemfire-ci.info/hdb/testresult/14917007] {noformat} > Task :geode-wan:distributedTest SerialWANPropagationOffHeapDUnitTest > testReplicatedSerialPropagationWithRemoteReceiverRestarted_SenderReceiverPersistent FAILED java.lang.AssertionError: Suspicious strings were written to the log during this run. Fix the strings or use IgnoredException.addIgnoredException to ignore. --- Found suspect string in 'dunit_suspect-vm5.log' at line 578 [error 2022/04/30 17:54:20.129 UTC :51004 unshared ordered sender uid=22 dom #1 local port=51185 remote port=59364> tid=172] Exception occurred in CacheListener java.util.concurrent.RejectedExecutionException: Task org.apache.geode.internal.cache.wan.serial.SerialGatewaySenderEventProcessor$$Lambda$419/1037103054@1aae2bfe rejected from java.util.concurrent.ThreadPoolExecutor@7d2e5a91[Shutting down, pool size = 1, active threads = 0, queued tasks = 0, completed tasks = 8478] at java.util.concurrent.ThreadPoolExecutor$AbortPolicy.rejectedExecution(ThreadPoolExecutor.java:2063) at java.util.concurrent.ThreadPoolExecutor.reject(ThreadPoolExecutor.java:830) at java.util.concurrent.ThreadPoolExecutor.execute(ThreadPoolExecutor.java:1379) at org.apache.geode.internal.cache.wan.serial.SerialGatewaySenderEventProcessor.handlePrimaryDestroy(SerialGatewaySenderEventProcessor.java:592) at org.apache.geode.internal.cache.wan.serial.SerialSecondaryGatewayListener.afterDestroy(SerialSecondaryGatewayListener.java:92) at org.apache.geode.internal.cache.EnumListenerEvent$AFTER_DESTROY.dispatchEvent(EnumListenerEvent.java:183) at org.apache.geode.internal.cache.LocalRegion.dispatchEvent(LocalRegion.java:8313) at org.apache.geode.internal.cache.LocalRegion.dispatchListenerEvent(LocalRegion.java:7021) at org.apache.geode.internal.cache.LocalRegion.invokeDestroyCallbacks(LocalRegion.java:6822) at org.apache.geode.internal.cache.EntryEventImpl.invokeCallbacks(EntryEventImpl.java:2454) at org.apache.geode.internal.cache.entries.AbstractRegionEntry.dispatchListenerEvents(AbstractRegionEntry.java:164) at org.apache.geode.internal.cache.LocalRegion.basicDestroyPart2(LocalRegion.java:6763) at org.apache.geode.internal.cache.map.RegionMapDestroy.destroyExistingEntry(RegionMapDestroy.java:420) at org.apache.geode.internal.cache.map.RegionMapDestroy.handleExistingRegionEntry(RegionMapDestroy.java:244) at org.apache.geode.internal.cache.map.RegionMapDestroy.destroy(RegionMapDestroy.java:152) at org.apache.geode.internal.cache.AbstractRegionMap.destroy(AbstractRegionMap.java:940) at org.apache.geode.internal.cache.LocalRegion.mapDestroy(LocalRegion.java:6552) at org.apache.geode.internal.cache.LocalRegion.mapDestroy(LocalRegion.java:6526) at org.apache.geode.internal.cache.LocalRegionDataView.destroyExistingEntry(LocalRegionDataView.java:59) at org.apache.geode.internal.cache.LocalRegion.basicDestroy(LocalRegion.java:6477) at org.apache.geode.internal.cache.DistributedRegion.basicDestroy(DistributedRegion.java:1745) at org.apache.geode.internal.cache.wan.serial.SerialGatewaySenderQueue$SerialGatewaySenderQueueMetaRegion.basicDestroy(SerialGatewaySenderQueue.java:1372) at org.apache.geode.internal.cache.LocalRegion.localDestroy(LocalRegion.java:2261) at org.apache.geode.internal.cache.DistributedRegion.localDestroy(DistributedRegion.java:981) at org.apache.geode.internal.cache.wan.serial.BatchDestroyOperation$DestroyMessage.operateOnRegion(BatchDestroyOperation.java:121) at org.apache.geode.internal.cache.DistributedCacheOperation$CacheOperationMessage.basicProcess(DistributedCacheOperation.java:1196) at org.apache.geode.internal.cache.DistributedCacheOperation$CacheOperationMessage.process(DistributedCacheOperation.java:1102) at org.apache.geode.distributed.internal.DistributionMessage.scheduleAction(DistributionMessage.java:380) at org.apache.geode.distributed.internal.DistributionMessage.schedule(DistributionMessage.java:436) at org.apache.geode.distributed.internal.ClusterDistributionManager.scheduleIncomingMessage(ClusterDistributionManager.java:2080) at org.apache.geode.distributed.internal.ClusterDistributionManager.handleIncomingDMsg(ClusterDistributionManager.java:1844) at org.apache.geode.distributed.internal.membe
[jira] [Created] (GEODE-10271) CI failure: dead server monitor fails to increment server count after a new server is started
Bill Burcham created GEODE-10271: Summary: CI failure: dead server monitor fails to increment server count after a new server is started Key: GEODE-10271 URL: https://issues.apache.org/jira/browse/GEODE-10271 Project: Geode Issue Type: Bug Components: client/server Affects Versions: 1.15.0 Reporter: Bill Burcham [https://hydradb.hdb.gemfire-ci.info/hdb/testresult/14919517] {noformat} > Task :geode-core:integrationTest ConnectionProxyJUnitTest > testDeadServerMonitorPingNature1 FAILED org.awaitility.core.ConditionTimeoutException: Assertion condition defined as a lambda expression in org.apache.geode.internal.cache.tier.sockets.ConnectionProxyJUnitTest expected:<1> but was:<0> within 5 minutes. at org.awaitility.core.ConditionAwaiter.await(ConditionAwaiter.java:167) at org.awaitility.core.AssertionCondition.await(AssertionCondition.java:119) at org.awaitility.core.AssertionCondition.await(AssertionCondition.java:31) at org.awaitility.core.ConditionFactory.until(ConditionFactory.java:985) at org.awaitility.core.ConditionFactory.untilAsserted(ConditionFactory.java:769) at org.apache.geode.internal.cache.tier.sockets.ConnectionProxyJUnitTest.testDeadServerMonitorPingNature1(ConnectionProxyJUnitTest.java:246) Caused by: java.lang.AssertionError: expected:<1> but was:<0> at org.junit.Assert.fail(Assert.java:89) at org.junit.Assert.failNotEquals(Assert.java:835) at org.junit.Assert.assertEquals(Assert.java:647) at org.junit.Assert.assertEquals(Assert.java:633) at org.apache.geode.internal.cache.tier.sockets.ConnectionProxyJUnitTest.lambda$testDeadServerMonitorPingNature1$0(ConnectionProxyJUnitTest.java:247) 4053 tests completed, 1 failed, 84 skipped =-=-=-=-=-=-=-=-=-=-=-=-=-=-= Test Results URI =-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-= http://files.apachegeode-ci.info/builds/apache-develop-main/1.15.0-build.1131/test-results/integrationTest/1651501470/ =-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-= Test report artifacts from this job are available at: http://files.apachegeode-ci.info/builds/apache-develop-main/1.15.0-build.1131/test-artifacts/1651501470/integrationtestfiles-openjdk8-1.15.0-build.1131.tgz{noformat} -- This message was sent by Atlassian Jira (v8.20.7#820007)
[jira] [Commented] (GEODE-9402) Automatic Reconnect Failure: Address already in use
[ https://issues.apache.org/jira/browse/GEODE-9402?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17529123#comment-17529123 ] Bill Burcham commented on GEODE-9402: - A shortcoming of my testing around this problem is that my new test/experiment isn't starting the cache server from cache XML. I notice we have a test for that scenario: ReconnectWithCacheXMLDUnitTest and that test mentions GEODE-2732. If you look at that ticket you'll see a BindException. It's got me thinking perhaps a problem (this problem) remains when reconnecting from a server started with cache XML. > Automatic Reconnect Failure: Address already in use > --- > > Key: GEODE-9402 > URL: https://issues.apache.org/jira/browse/GEODE-9402 > Project: Geode > Issue Type: Bug > Components: membership >Reporter: Juan Ramos >Assignee: Bill Burcham >Priority: Major > Attachments: cluster_logs_gke_latest_54.zip, cluster_logs_pks_121.zip > > > There are 2 locators and 4 servers during the test, once they're all up and > running the test drops the network connectivity between all members to > generate a full network partition and cause all members to shutdown and go > into reconnect mode. Upon reaching the mentioned state, the test > automatically restores the network connectivity and expects all members to > automatically go up again and re-form the distributed system. > This works fine most of the time, and we see every member successfully > reconnecting to the distributed system: > {noformat} > [info 2021/06/23 15:58:12.981 GMT gemfire-cluster-locator-0 > tid=0x87] Reconnect completed. > [info 2021/06/23 15:58:14.726 GMT gemfire-cluster-locator-1 > tid=0x86] Reconnect completed. > [info 2021/06/23 15:58:46.702 GMT gemfire-cluster-server-0 > tid=0x94] Reconnect completed. > [info 2021/06/23 15:58:46.485 GMT gemfire-cluster-server-1 > tid=0x96] Reconnect completed. > [info 2021/06/23 15:58:46.273 GMT gemfire-cluster-server-2 > tid=0x97] Reconnect completed. > [info 2021/06/23 15:58:46.902 GMT gemfire-cluster-server-3 > tid=0x95] Reconnect completed. > {noformat} > In some rare occasions, though, one of the servers fails during the reconnect > phase with the following exception: > {noformat} > [error 2021/06/09 18:48:52.872 GMT gemfire-cluster-server-1 > tid=0x91] Cache initialization for GemFireCache[id = 575310555; isClosing = > false; isShutDownAll = false; created = Wed Jun 09 18:46:49 GMT 2021; server > = false; copyOnRead = false; lockLease = 120; lockTimeout = 60] failed > because: > org.apache.geode.GemFireIOException: While starting cache server CacheServer > on port=40404 client subscription config policy=none client subscription > config capacity=1 client subscription config overflow directory=. > at > org.apache.geode.internal.cache.xmlcache.CacheCreation.startCacheServers(CacheCreation.java:800) > at > org.apache.geode.internal.cache.xmlcache.CacheCreation.create(CacheCreation.java:599) > at > org.apache.geode.internal.cache.xmlcache.CacheXmlParser.create(CacheXmlParser.java:339) > at > org.apache.geode.internal.cache.GemFireCacheImpl.loadCacheXml(GemFireCacheImpl.java:4207) > at > org.apache.geode.internal.cache.ClusterConfigurationLoader.applyClusterXmlConfiguration(ClusterConfigurationLoader.java:197) > at > org.apache.geode.internal.cache.GemFireCacheImpl.applyJarAndXmlFromClusterConfig(GemFireCacheImpl.java:1497) > at > org.apache.geode.internal.cache.GemFireCacheImpl.initialize(GemFireCacheImpl.java:1449) > at > org.apache.geode.internal.cache.InternalCacheBuilder.create(InternalCacheBuilder.java:191) > at > org.apache.geode.distributed.internal.InternalDistributedSystem.reconnect(InternalDistributedSystem.java:2668) > at > org.apache.geode.distributed.internal.InternalDistributedSystem.tryReconnect(InternalDistributedSystem.java:2426) > at > org.apache.geode.distributed.internal.InternalDistributedSystem.disconnect(InternalDistributedSystem.java:1277) > at > org.apache.geode.distributed.internal.ClusterDistributionManager$DMListener.membershipFailure(ClusterDistributionManager.java:2315) > at > org.apache.geode.distributed.internal.membership.gms.GMSMembership.uncleanShutdown(GMSMembership.java:1183) > at > org.apache.geode.distributed.internal.membership.gms.GMSMembership$ManagerImpl.lambda$forceDisconnect$0(GMSMembership.java:1807) > at java.base/java.lang.Thread.run(Thread.java:829) > Caused by: java.net.BindException: Address already in use (Bind failed) > at java.base/java.net.PlainSocketImpl.socketBind(Native Method) > at > java.base/java.net.AbstractPlainSocketImpl.bind(AbstractPlainSocketImpl.java:436) > at java.base/java.net.ServerSoc
[jira] [Updated] (GEODE-10122) With TLSv1.3 and GCM-based cipher (the default), P2P Messaging Fails When Encrypted Data Limit is Reached
[ https://issues.apache.org/jira/browse/GEODE-10122?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Bill Burcham updated GEODE-10122: - Fix Version/s: 1.12.10 > With TLSv1.3 and GCM-based cipher (the default), P2P Messaging Fails When > Encrypted Data Limit is Reached > - > > Key: GEODE-10122 > URL: https://issues.apache.org/jira/browse/GEODE-10122 > Project: Geode > Issue Type: Bug > Components: messaging >Affects Versions: 1.12.0, 1.13.0, 1.14.0, 1.15.0 >Reporter: Bill Burcham >Assignee: Bill Burcham >Priority: Major > Labels: blocks-1.15.0, pull-request-available, ssl > Fix For: 1.12.10, 1.13.9, 1.14.5, 1.15.0 > > Attachments: patch-P2PMessagingConcurrencyDUnitTest.txt > > > TLSv1.3 introduced [1] the ability to set per-algorithm limits on symmetric > key usage lifetimes. Once a certain number of bytes have been encrypted, a > KeyUpdate post-handshake message [2] is sent. > With default settings, on Liberica JDK 11, Geode's P2P framework will > negotiate TLSv1.3 with the TLS_AES_256_GCM_SHA384 cipher suite. Geode P2P > messaging will eventually fail, with a "Tag mismatch!" IOException in shared > ordered receivers, after a session has been in heavy use for days. > We have not see this failure on TLSv1.2. > The implementation of TLSv1.3 in the Java runtime provides a security > property [3] to configure the encrypted data limit. The attached patch to > P2PMessagingConcurrencyDUnitTest configures the limit large enough that the > test makes it through the (P2P) TLS handshake but small enough so that the > "Tag mismatch!" exception is encountered less than a minute later. > The bug is caused by Geode’s NioSslEngine class’ ignorance of the > “rehandshaking” phase of the TLS protocol [4]: > Creation - ready to be configured. > Initial handshaking - perform authentication and negotiate communication > parameters. > Application data - ready for application exchange. > *Rehandshaking* - renegotiate communications parameters/authentication; > handshaking data may be mixed with application data. > Closure - ready to shut down connection. > Geode's tcp.Connection and NioSslEngine classes (particularly wrap() and > unwrap()), as they are currently implemented, fail to fully attend to the > handshake status from javax.net.ssl.SSLEngine. As a result these Geode > classes fail to respond to the KeyUpdate message, resulting in the "Tag > mismatch!" IOException. > When that exception is encountered, the Connection is destroyed and a new one > created in its place. But users of the old Connection, waiting for > acknowledgements, will never receive them. This can result in cluster-wide > hangs. > [1] [https://datatracker.ietf.org/doc/html/rfc8446#section-5.5] > [2] > [https://www.ibm.com/docs/en/sdk-java-technology/8?topic=handshake-post-messages] > > [3] > [https://docs.oracle.com/en/java/javase/11/security/java-secure-socket-extension-jsse-reference-guide.html#GUID-B970ADD6-1E9F-4C18-A26E-0679B50CC946] > [4] [https://www.ibm.com/docs/en/sdk-java-technology/7.1?topic=sslengine-] -- This message was sent by Atlassian Jira (v8.20.7#820007)
[jira] [Updated] (GEODE-10122) With TLSv1.3 and GCM-based cipher (the default), P2P Messaging Fails When Encrypted Data Limit is Reached
[ https://issues.apache.org/jira/browse/GEODE-10122?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Bill Burcham updated GEODE-10122: - Affects Version/s: 1.14.0 1.13.0 1.12.0 (was: 1.13.7) (was: 1.14.3) (was: 1.12.9) > With TLSv1.3 and GCM-based cipher (the default), P2P Messaging Fails When > Encrypted Data Limit is Reached > - > > Key: GEODE-10122 > URL: https://issues.apache.org/jira/browse/GEODE-10122 > Project: Geode > Issue Type: Bug > Components: messaging >Affects Versions: 1.12.0, 1.13.0, 1.14.0, 1.15.0 >Reporter: Bill Burcham >Assignee: Bill Burcham >Priority: Major > Labels: blocks-1.15.0, pull-request-available, ssl > Fix For: 1.13.9, 1.14.5, 1.15.0 > > Attachments: patch-P2PMessagingConcurrencyDUnitTest.txt > > > TLSv1.3 introduced [1] the ability to set per-algorithm limits on symmetric > key usage lifetimes. Once a certain number of bytes have been encrypted, a > KeyUpdate post-handshake message [2] is sent. > With default settings, on Liberica JDK 11, Geode's P2P framework will > negotiate TLSv1.3 with the TLS_AES_256_GCM_SHA384 cipher suite. Geode P2P > messaging will eventually fail, with a "Tag mismatch!" IOException in shared > ordered receivers, after a session has been in heavy use for days. > We have not see this failure on TLSv1.2. > The implementation of TLSv1.3 in the Java runtime provides a security > property [3] to configure the encrypted data limit. The attached patch to > P2PMessagingConcurrencyDUnitTest configures the limit large enough that the > test makes it through the (P2P) TLS handshake but small enough so that the > "Tag mismatch!" exception is encountered less than a minute later. > The bug is caused by Geode’s NioSslEngine class’ ignorance of the > “rehandshaking” phase of the TLS protocol [4]: > Creation - ready to be configured. > Initial handshaking - perform authentication and negotiate communication > parameters. > Application data - ready for application exchange. > *Rehandshaking* - renegotiate communications parameters/authentication; > handshaking data may be mixed with application data. > Closure - ready to shut down connection. > Geode's tcp.Connection and NioSslEngine classes (particularly wrap() and > unwrap()), as they are currently implemented, fail to fully attend to the > handshake status from javax.net.ssl.SSLEngine. As a result these Geode > classes fail to respond to the KeyUpdate message, resulting in the "Tag > mismatch!" IOException. > When that exception is encountered, the Connection is destroyed and a new one > created in its place. But users of the old Connection, waiting for > acknowledgements, will never receive them. This can result in cluster-wide > hangs. > [1] [https://datatracker.ietf.org/doc/html/rfc8446#section-5.5] > [2] > [https://www.ibm.com/docs/en/sdk-java-technology/8?topic=handshake-post-messages] > > [3] > [https://docs.oracle.com/en/java/javase/11/security/java-secure-socket-extension-jsse-reference-guide.html#GUID-B970ADD6-1E9F-4C18-A26E-0679B50CC946] > [4] [https://www.ibm.com/docs/en/sdk-java-technology/7.1?topic=sslengine-] -- This message was sent by Atlassian Jira (v8.20.7#820007)
[jira] [Updated] (GEODE-10122) With TLSv1.3 and GCM-based cipher (the default), P2P Messaging Fails When Encrypted Data Limit is Reached
[ https://issues.apache.org/jira/browse/GEODE-10122?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Bill Burcham updated GEODE-10122: - Affects Version/s: 1.12.9 > With TLSv1.3 and GCM-based cipher (the default), P2P Messaging Fails When > Encrypted Data Limit is Reached > - > > Key: GEODE-10122 > URL: https://issues.apache.org/jira/browse/GEODE-10122 > Project: Geode > Issue Type: Bug > Components: messaging >Affects Versions: 1.12.9, 1.13.7, 1.14.3, 1.15.0 >Reporter: Bill Burcham >Assignee: Bill Burcham >Priority: Major > Labels: blocks-1.15.0, pull-request-available, ssl > Fix For: 1.13.9, 1.14.5, 1.15.0 > > Attachments: patch-P2PMessagingConcurrencyDUnitTest.txt > > > TLSv1.3 introduced [1] the ability to set per-algorithm limits on symmetric > key usage lifetimes. Once a certain number of bytes have been encrypted, a > KeyUpdate post-handshake message [2] is sent. > With default settings, on Liberica JDK 11, Geode's P2P framework will > negotiate TLSv1.3 with the TLS_AES_256_GCM_SHA384 cipher suite. Geode P2P > messaging will eventually fail, with a "Tag mismatch!" IOException in shared > ordered receivers, after a session has been in heavy use for days. > We have not see this failure on TLSv1.2. > The implementation of TLSv1.3 in the Java runtime provides a security > property [3] to configure the encrypted data limit. The attached patch to > P2PMessagingConcurrencyDUnitTest configures the limit large enough that the > test makes it through the (P2P) TLS handshake but small enough so that the > "Tag mismatch!" exception is encountered less than a minute later. > The bug is caused by Geode’s NioSslEngine class’ ignorance of the > “rehandshaking” phase of the TLS protocol [4]: > Creation - ready to be configured. > Initial handshaking - perform authentication and negotiate communication > parameters. > Application data - ready for application exchange. > *Rehandshaking* - renegotiate communications parameters/authentication; > handshaking data may be mixed with application data. > Closure - ready to shut down connection. > Geode's tcp.Connection and NioSslEngine classes (particularly wrap() and > unwrap()), as they are currently implemented, fail to fully attend to the > handshake status from javax.net.ssl.SSLEngine. As a result these Geode > classes fail to respond to the KeyUpdate message, resulting in the "Tag > mismatch!" IOException. > When that exception is encountered, the Connection is destroyed and a new one > created in its place. But users of the old Connection, waiting for > acknowledgements, will never receive them. This can result in cluster-wide > hangs. > [1] [https://datatracker.ietf.org/doc/html/rfc8446#section-5.5] > [2] > [https://www.ibm.com/docs/en/sdk-java-technology/8?topic=handshake-post-messages] > > [3] > [https://docs.oracle.com/en/java/javase/11/security/java-secure-socket-extension-jsse-reference-guide.html#GUID-B970ADD6-1E9F-4C18-A26E-0679B50CC946] > [4] [https://www.ibm.com/docs/en/sdk-java-technology/7.1?topic=sslengine-] -- This message was sent by Atlassian Jira (v8.20.7#820007)
[jira] [Updated] (GEODE-10122) With TLSv1.3 and GCM-based cipher (the default), P2P Messaging Fails When Encrypted Data Limit is Reached
[ https://issues.apache.org/jira/browse/GEODE-10122?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Bill Burcham updated GEODE-10122: - Fix Version/s: 1.13.9 1.14.5 > With TLSv1.3 and GCM-based cipher (the default), P2P Messaging Fails When > Encrypted Data Limit is Reached > - > > Key: GEODE-10122 > URL: https://issues.apache.org/jira/browse/GEODE-10122 > Project: Geode > Issue Type: Bug > Components: messaging >Affects Versions: 1.13.7, 1.14.3, 1.15.0 >Reporter: Bill Burcham >Assignee: Bill Burcham >Priority: Major > Labels: blocks-1.15.0, pull-request-available, ssl > Fix For: 1.13.9, 1.14.5, 1.15.0 > > Attachments: patch-P2PMessagingConcurrencyDUnitTest.txt > > > TLSv1.3 introduced [1] the ability to set per-algorithm limits on symmetric > key usage lifetimes. Once a certain number of bytes have been encrypted, a > KeyUpdate post-handshake message [2] is sent. > With default settings, on Liberica JDK 11, Geode's P2P framework will > negotiate TLSv1.3 with the TLS_AES_256_GCM_SHA384 cipher suite. Geode P2P > messaging will eventually fail, with a "Tag mismatch!" IOException in shared > ordered receivers, after a session has been in heavy use for days. > We have not see this failure on TLSv1.2. > The implementation of TLSv1.3 in the Java runtime provides a security > property [3] to configure the encrypted data limit. The attached patch to > P2PMessagingConcurrencyDUnitTest configures the limit large enough that the > test makes it through the (P2P) TLS handshake but small enough so that the > "Tag mismatch!" exception is encountered less than a minute later. > The bug is caused by Geode’s NioSslEngine class’ ignorance of the > “rehandshaking” phase of the TLS protocol [4]: > Creation - ready to be configured. > Initial handshaking - perform authentication and negotiate communication > parameters. > Application data - ready for application exchange. > *Rehandshaking* - renegotiate communications parameters/authentication; > handshaking data may be mixed with application data. > Closure - ready to shut down connection. > Geode's tcp.Connection and NioSslEngine classes (particularly wrap() and > unwrap()), as they are currently implemented, fail to fully attend to the > handshake status from javax.net.ssl.SSLEngine. As a result these Geode > classes fail to respond to the KeyUpdate message, resulting in the "Tag > mismatch!" IOException. > When that exception is encountered, the Connection is destroyed and a new one > created in its place. But users of the old Connection, waiting for > acknowledgements, will never receive them. This can result in cluster-wide > hangs. > [1] [https://datatracker.ietf.org/doc/html/rfc8446#section-5.5] > [2] > [https://www.ibm.com/docs/en/sdk-java-technology/8?topic=handshake-post-messages] > > [3] > [https://docs.oracle.com/en/java/javase/11/security/java-secure-socket-extension-jsse-reference-guide.html#GUID-B970ADD6-1E9F-4C18-A26E-0679B50CC946] > [4] [https://www.ibm.com/docs/en/sdk-java-technology/7.1?topic=sslengine-] -- This message was sent by Atlassian Jira (v8.20.7#820007)
[jira] [Updated] (GEODE-8506) BufferPool returns byte buffers that may be much larger than requested
[ https://issues.apache.org/jira/browse/GEODE-8506?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Bill Burcham updated GEODE-8506: Description: BufferPool manages several pools of direct-memory ByteBuffers. When asked for a ByteBuffer of size X you may receive a buffer that is any size greater than or equal to X. For users of this pool this is unexpected behavior and is causing some trouble. MsgStreamer, for instance, performs message "chunking" based on the size of a socket's buffer size. It requests a byte buffer of that size and then fills it over and over again with message chunks to be written to the socket. But it does this based on the buffer's capacity, which may be much larger than the expected buffer size. This results in incorrect chunking and requires larger buffers in the receiver of these message chunks. BufferPool should always return a buffer that has exactly the requested capacity. It could be a _slice_ of a pooled buffer, for instance. That would let it hand out a larger buffer while not confusing the code that requested the buffer. was: BufferPool manages several pools of direct-memory ByteBuffers. When asked for a ByteBuffer of size X you may receive a buffer that is any size greater than or equal to X. For users of this pool this is unexpected behavior and is causing some trouble. MessageStreamer, for instance, performs message "chunking" based on the size of a socket's buffer size. It requests a byte buffer of that size and then fills it over and over again with message chunks to be written to the socket. But it does this based on the buffer's capacity, which may be much larger than the expected buffer size. This results in incorrect chunking and requires larger buffers in the receiver of these message chunks. BufferPool should always return a buffer that has exactly the requested capacity. It could be a _slice_ of a pooled buffer, for instance. That would let it hand out a larger buffer while not confusing the code that requested the buffer. > BufferPool returns byte buffers that may be much larger than requested > -- > > Key: GEODE-8506 > URL: https://issues.apache.org/jira/browse/GEODE-8506 > Project: Geode > Issue Type: Improvement > Components: membership >Reporter: Bruce J Schuchardt >Assignee: Bruce J Schuchardt >Priority: Major > Labels: pull-request-available > Fix For: 1.12.1, 1.13.1, 1.14.0 > > > BufferPool manages several pools of direct-memory ByteBuffers. When asked > for a ByteBuffer of size X you may receive a buffer that is any size greater > than or equal to X. For users of this pool this is unexpected behavior and > is causing some trouble. > MsgStreamer, for instance, performs message "chunking" based on the size of a > socket's buffer size. It requests a byte buffer of that size and then fills > it over and over again with message chunks to be written to the socket. But > it does this based on the buffer's capacity, which may be much larger than > the expected buffer size. This results in incorrect chunking and requires > larger buffers in the receiver of these message chunks. > BufferPool should always return a buffer that has exactly the requested > capacity. It could be a _slice_ of a pooled buffer, for instance. That > would let it hand out a larger buffer while not confusing the code that > requested the buffer. -- This message was sent by Atlassian Jira (v8.20.7#820007)
[jira] [Commented] (GEODE-9402) Automatic Reconnect Failure: Address already in use
[ https://issues.apache.org/jira/browse/GEODE-9402?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17526130#comment-17526130 ] Bill Burcham commented on GEODE-9402: - Here’s a draft PR with my experiments: [https://github.com/apache/geode/pull/7614] (In my testing I enabled TLS for all components. I don’t think it matters for this ticket but it’s become a habit.) I wrote a test that starts a three-member cluster and then binds a server socket to port X and then calls geode.cache.Cache.addCacheServer() to create a CacheServer and then calls setPort(X) on it and then start(). Here’s the exception I get: {{BGB caught: java.net.BindException: Address already in use (Bind failed) at java.net.PlainSocketImpl.socketBind(Native Method) at java.net.AbstractPlainSocketImpl.bind(AbstractPlainSocketImpl.java:387) at java.net.ServerSocket.bind(ServerSocket.java:390) at org.apache.geode.internal.net.SCClusterSocketCreator.createServerSocket(SCClusterSocketCreator.java:79) at org.apache.geode.internal.net.SocketCreator.createServerSocket(SocketCreator.java:491) at org.apache.geode.internal.cache.tier.sockets.AcceptorImpl.(AcceptorImpl.java:574) at org.apache.geode.internal.cache.tier.sockets.AcceptorBuilder.create(AcceptorBuilder.java:291) at org.apache.geode.internal.cache.CacheServerImpl.createAcceptor(CacheServerImpl.java:421) at org.apache.geode.internal.cache.CacheServerImpl.start(CacheServerImpl.java:378) at org.apache.geode.cache30.ReconnectWithTlsAndClientsCacheServerDistributedTest.startClientsCacheServer(ReconnectWithTlsAndClientsCacheServerDistributedTest.java:126) at org.apache.geode.cache30.ReconnectWithTlsAndClientsCacheServerDistributedTest.disconnectAndReconnectTest(ReconnectWithTlsAndClientsCacheServerDistributedTest.java:105) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)}} Part of that stack trace, from the exception to CacheServerImpl.start matches the stack trace from GEM-3359. The test does not create the cache from cache XML (e.g. ClusterConfigurationLoader.applyClusterXmlConfiguration()) as described in the ticket however. *This may be an area we want to explore further.* By explicitly causing the bind exception (in my new preBindToClientsCacheServerPortTest() test) I can see that the AcceptorImpl constructor is retrying when it encounters the BindException (a SocketException). It’ll repeatedly try to create the server socket for 120 seconds (CacheServerImpl.getTimeLimitMillis()), sleeping 1 second in between tries. This is also true of the code path described by the stack trace in the ticket. Calling ServerSocket.setReuseAddress(true) when I bind to port X, does not eliminate the bind exception. From the documentation: Enabling SO_REUSEADDR prior to binding the socket using bind(SocketAddress) allows the socket to be bound even though a previous connection is in a timeout state. This setting only allows something else to bind to the port when the original socket is in the timeout state. A socket not in the timeout state, bound to a port, simply monopolizes that port. The short of it is that setReuseAddress(true) is helpful for addressing certain race conditions but it can’t address them all. I did confirm that Geode does always call setReuseAddress(true) whenever creating a server socket for a SocketCreator: non-TLS case: SocketCreator.createServerSocket() TLS case: SCClusterSocketCreator.createServerSocket() I’ve got a test (disconnectAndReconnectTest()) that enables TLS for all Geode components (including clients) and creates a three-member cluster. Then it repeatedly starts a client’s CacheServer (bound to port X), crashes the distributed system via MembershipManagerHelper.crashDistributedSystem() and verifies that the disconnected member reconnects. I haven’t been able to reproduce the problem with this test. This is not exactly the way the forced-disconnect was generated in GEM-3359. In that case a network partition caused the forced-disconnection. *This may be an area we want to explore further.* Searching for asynchrony that could lead to a race condition I took a look at GMSMembership.ManagerImpl.forceDisconnect(). When that calls uncleanShutdownDS() a thread is spawned to do the actual work of shutting down the distributed system. Inserting at 30 second delay at the start of that thread’s task (run()) did not reproduce GEM-3359. The path from uncleanShutdownDS() that actually leads to closing the client’s CacheServer’s ServerSocket can be seen in this stack trace: {{BGB in AcceptorImpl.close() closing server socket bound to port: 20009, java.lang.Throwable at org.apache.geode.internal.cache.tier.sockets.AcceptorImpl.close(AcceptorImpl.java:1617) at org.apache.geode.internal.cache.CacheServerImpl.stop(CacheServerImpl.java:485) at org.apache.geode.internal.cache.GemF
[jira] [Comment Edited] (GEODE-10236) Compatibility issues while upgrading Jgroups to versions 4.0+
[ https://issues.apache.org/jira/browse/GEODE-10236?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17522437#comment-17522437 ] Bill Burcham edited comment on GEODE-10236 at 4/14/22 5:34 PM: --- I agree with [~abaker] . If you want to see the JGroups protocol stack used in Geode (membership) communication it's primarily here: [https://github.com/apache/geode/blob/develop/geode-membership/src/main/resources/org/apache/geode/distributed/internal/membership/gms/messenger/jgroups-config.xml] There is also a multicast protocol stack here: [https://github.com/apache/geode/blob/develop/geode-membership/src/main/resources/org/apache/geode/distributed/internal/membership/gms/messenger/jgroups-mcast.xml] Neither mentions the deprecated ENCRYPT protocol/layer or the AUTH protocol/layer. was (Author: bburcham): I agree with [~abaker] . If you want to see the JGroups protocol stack use in Geode (membership) communication it's primarily here: [https://github.com/apache/geode/blob/develop/geode-membership/src/main/resources/org/apache/geode/distributed/internal/membership/gms/messenger/jgroups-config.xml] There is also a multicast protocol stack here: [https://github.com/apache/geode/blob/develop/geode-membership/src/main/resources/org/apache/geode/distributed/internal/membership/gms/messenger/jgroups-mcast.xml] Neither mentions the deprecated ENCRYPT protocol/layer or the AUTH protocol/layer. > Compatibility issues while upgrading Jgroups to versions 4.0+ > - > > Key: GEODE-10236 > URL: https://issues.apache.org/jira/browse/GEODE-10236 > Project: Geode > Issue Type: Bug >Affects Versions: 1.14.4 >Reporter: Rohan Jagtap >Priority: Major > Labels: needsTriage > > According to a recent CVE: > {quote}CVE-2016-2141 > NVD: 2016/06/30 - CVSS v2 Base Score: 7.5 - CVSS v3.1 Base Score: 9.8 > JGroups before 4.0 does not require the proper headers for the ENCRYPT and > AUTH protocols from nodes joining the cluster, which allows remote attackers > to bypass security restrictions and send and receive messages within the > cluster via unspecified vectors. > > {quote} > Hence we intend to upgrade jgroups to a recommended version. > However, even the latest version of apache geode ([geode-core > 1.14.4|https://mvnrepository.com/artifact/org.apache.geode/geode-core/1.14.4]) > uses jgroups 3.6.14 which has the aforementioned vulnerability. > Overriding the jgroups dependency to anything over 4.0+ gives the following > issue on running: > {{Caused by: org.springframework.beans.factory.BeanCreationException: Error > creating bean with name 'gemfireCache': FactoryBean threw exception on object > creation; nested exception is java.lang.ExceptionInInitializerError}} > {{ at > org.springframework.beans.factory.support.FactoryBeanRegistrySupport.doGetObjectFromFactoryBean(FactoryBeanRegistrySupport.java:176)}} > {{ at > org.springframework.beans.factory.support.FactoryBeanRegistrySupport.getObjectFromFactoryBean(FactoryBeanRegistrySupport.java:101)}} > {{ at > org.springframework.beans.factory.support.AbstractBeanFactory.getObjectForBeanInstance(AbstractBeanFactory.java:1828)}} > {{ at > org.springframework.beans.factory.support.AbstractAutowireCapableBeanFactory.getObjectForBeanInstance(AbstractAutowireCapableBeanFactory.java:1265)}} > {{ at > org.springframework.beans.factory.support.AbstractBeanFactory.doGetBean(AbstractBeanFactory.java:334)}} > {{ at > org.springframework.beans.factory.support.AbstractBeanFactory.getBean(AbstractBeanFactory.java:202)}} > {{ at > org.springframework.beans.factory.support.BeanDefinitionValueResolver.resolveReference(BeanDefinitionValueResolver.java:330)}} > {{ ... 32 common frames omitted}} > {{Caused by: java.lang.ExceptionInInitializerError: null}} > {{ at > org.apache.geode.distributed.internal.membership.gms.Services.(Services.java:155)}} > {{ at > org.apache.geode.distributed.internal.membership.gms.MembershipBuilderImpl.create(MembershipBuilderImpl.java:114)}} > {{ at > org.apache.geode.distributed.internal.DistributionImpl.(DistributionImpl.java:150)}} > {{ at > org.apache.geode.distributed.internal.DistributionImpl.createDistribution(DistributionImpl.java:217)}} > {{ at > org.apache.geode.distributed.internal.ClusterDistributionManager.(ClusterDistributionManager.java:464)}} > {{ at > org.apache.geode.distributed.internal.ClusterDistributionManager.(ClusterDistributionManager.java:497)}} > {{ at > org.apache.geode.distributed.internal.ClusterDistributionManager.create(ClusterDistributionManager.java:326)}} > {{ at > org.apache.geode.distributed.internal.InternalDistributedSystem.initialize(I
[jira] [Commented] (GEODE-10236) Compatibility issues while upgrading Jgroups to versions 4.0+
[ https://issues.apache.org/jira/browse/GEODE-10236?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17522437#comment-17522437 ] Bill Burcham commented on GEODE-10236: -- I agree with [~abaker] . If you want to see the JGroups protocol stack use in Geode (membership) communication it's primarily here: [https://github.com/apache/geode/blob/develop/geode-membership/src/main/resources/org/apache/geode/distributed/internal/membership/gms/messenger/jgroups-config.xml] There is also a multicast protocol stack here: [https://github.com/apache/geode/blob/develop/geode-membership/src/main/resources/org/apache/geode/distributed/internal/membership/gms/messenger/jgroups-mcast.xml] Neither mentions the deprecated ENCRYPT protocol/layer or the AUTH protocol/layer. > Compatibility issues while upgrading Jgroups to versions 4.0+ > - > > Key: GEODE-10236 > URL: https://issues.apache.org/jira/browse/GEODE-10236 > Project: Geode > Issue Type: Bug >Affects Versions: 1.14.4 >Reporter: Rohan Jagtap >Priority: Major > Labels: needsTriage > > According to a recent CVE: > {quote}CVE-2016-2141 > NVD: 2016/06/30 - CVSS v2 Base Score: 7.5 - CVSS v3.1 Base Score: 9.8 > JGroups before 4.0 does not require the proper headers for the ENCRYPT and > AUTH protocols from nodes joining the cluster, which allows remote attackers > to bypass security restrictions and send and receive messages within the > cluster via unspecified vectors. > > {quote} > Hence we intend to upgrade jgroups to a recommended version. > However, even the latest version of apache geode ([geode-core > 1.14.4|https://mvnrepository.com/artifact/org.apache.geode/geode-core/1.14.4]) > uses jgroups 3.6.14 which has the aforementioned vulnerability. > Overriding the jgroups dependency to anything over 4.0+ gives the following > issue on running: > {{Caused by: org.springframework.beans.factory.BeanCreationException: Error > creating bean with name 'gemfireCache': FactoryBean threw exception on object > creation; nested exception is java.lang.ExceptionInInitializerError}} > {{ at > org.springframework.beans.factory.support.FactoryBeanRegistrySupport.doGetObjectFromFactoryBean(FactoryBeanRegistrySupport.java:176)}} > {{ at > org.springframework.beans.factory.support.FactoryBeanRegistrySupport.getObjectFromFactoryBean(FactoryBeanRegistrySupport.java:101)}} > {{ at > org.springframework.beans.factory.support.AbstractBeanFactory.getObjectForBeanInstance(AbstractBeanFactory.java:1828)}} > {{ at > org.springframework.beans.factory.support.AbstractAutowireCapableBeanFactory.getObjectForBeanInstance(AbstractAutowireCapableBeanFactory.java:1265)}} > {{ at > org.springframework.beans.factory.support.AbstractBeanFactory.doGetBean(AbstractBeanFactory.java:334)}} > {{ at > org.springframework.beans.factory.support.AbstractBeanFactory.getBean(AbstractBeanFactory.java:202)}} > {{ at > org.springframework.beans.factory.support.BeanDefinitionValueResolver.resolveReference(BeanDefinitionValueResolver.java:330)}} > {{ ... 32 common frames omitted}} > {{Caused by: java.lang.ExceptionInInitializerError: null}} > {{ at > org.apache.geode.distributed.internal.membership.gms.Services.(Services.java:155)}} > {{ at > org.apache.geode.distributed.internal.membership.gms.MembershipBuilderImpl.create(MembershipBuilderImpl.java:114)}} > {{ at > org.apache.geode.distributed.internal.DistributionImpl.(DistributionImpl.java:150)}} > {{ at > org.apache.geode.distributed.internal.DistributionImpl.createDistribution(DistributionImpl.java:217)}} > {{ at > org.apache.geode.distributed.internal.ClusterDistributionManager.(ClusterDistributionManager.java:464)}} > {{ at > org.apache.geode.distributed.internal.ClusterDistributionManager.(ClusterDistributionManager.java:497)}} > {{ at > org.apache.geode.distributed.internal.ClusterDistributionManager.create(ClusterDistributionManager.java:326)}} > {{ at > org.apache.geode.distributed.internal.InternalDistributedSystem.initialize(InternalDistributedSystem.java:779)}} > {{ at > org.apache.geode.distributed.internal.InternalDistributedSystem.access$200(InternalDistributedSystem.java:135)}} > {{ at > org.apache.geode.distributed.internal.InternalDistributedSystem$Builder.build(InternalDistributedSystem.java:3036)}} > {{ at > org.apache.geode.distributed.internal.InternalDistributedSystem.connectInternal(InternalDistributedSystem.java:290)}} > {{ at > org.apache.geode.distributed.internal.InternalDistributedSystem.connectInternal(InternalDistributedSystem.java:216)}} > {{ at > org.apache.geode.internal.cache.InternalCacheBuilder.createInternalDistributedSy
[jira] [Resolved] (GEODE-10122) With TLSv1.3 and GCM-based cipher (the default), P2P Messaging Fails When Encrypted Data Limit is Reached
[ https://issues.apache.org/jira/browse/GEODE-10122?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Bill Burcham resolved GEODE-10122. -- Fix Version/s: 1.15.0 Resolution: Fixed > With TLSv1.3 and GCM-based cipher (the default), P2P Messaging Fails When > Encrypted Data Limit is Reached > - > > Key: GEODE-10122 > URL: https://issues.apache.org/jira/browse/GEODE-10122 > Project: Geode > Issue Type: Bug > Components: messaging >Affects Versions: 1.13.7, 1.14.3, 1.15.0 >Reporter: Bill Burcham >Assignee: Bill Burcham >Priority: Major > Labels: blocks-1.15.0, pull-request-available, ssl > Fix For: 1.15.0 > > Attachments: patch-P2PMessagingConcurrencyDUnitTest.txt > > > TLSv1.3 introduced [1] the ability to set per-algorithm limits on symmetric > key usage lifetimes. Once a certain number of bytes have been encrypted, a > KeyUpdate post-handshake message [2] is sent. > With default settings, on Liberica JDK 11, Geode's P2P framework will > negotiate TLSv1.3 with the TLS_AES_256_GCM_SHA384 cipher suite. Geode P2P > messaging will eventually fail, with a "Tag mismatch!" IOException in shared > ordered receivers, after a session has been in heavy use for days. > We have not see this failure on TLSv1.2. > The implementation of TLSv1.3 in the Java runtime provides a security > property [3] to configure the encrypted data limit. The attached patch to > P2PMessagingConcurrencyDUnitTest configures the limit large enough that the > test makes it through the (P2P) TLS handshake but small enough so that the > "Tag mismatch!" exception is encountered less than a minute later. > The bug is caused by Geode’s NioSslEngine class’ ignorance of the > “rehandshaking” phase of the TLS protocol [4]: > Creation - ready to be configured. > Initial handshaking - perform authentication and negotiate communication > parameters. > Application data - ready for application exchange. > *Rehandshaking* - renegotiate communications parameters/authentication; > handshaking data may be mixed with application data. > Closure - ready to shut down connection. > Geode's tcp.Connection and NioSslEngine classes (particularly wrap() and > unwrap()), as they are currently implemented, fail to fully attend to the > handshake status from javax.net.ssl.SSLEngine. As a result these Geode > classes fail to respond to the KeyUpdate message, resulting in the "Tag > mismatch!" IOException. > When that exception is encountered, the Connection is destroyed and a new one > created in its place. But users of the old Connection, waiting for > acknowledgements, will never receive them. This can result in cluster-wide > hangs. > [1] [https://datatracker.ietf.org/doc/html/rfc8446#section-5.5] > [2] > [https://www.ibm.com/docs/en/sdk-java-technology/8?topic=handshake-post-messages] > > [3] > [https://docs.oracle.com/en/java/javase/11/security/java-secure-socket-extension-jsse-reference-guide.html#GUID-B970ADD6-1E9F-4C18-A26E-0679B50CC946] > [4] [https://www.ibm.com/docs/en/sdk-java-technology/7.1?topic=sslengine-] -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Updated] (GEODE-10192) CI hang: testRollingEnabledRecoverValuesTruePersistWithOverFlowWithEarlyTerminationOfRoller
[ https://issues.apache.org/jira/browse/GEODE-10192?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Bill Burcham updated GEODE-10192: - Description: Hung here: [https://concourse.apachegeode-ci.info/teams/main/pipelines/apache-develop-main/jobs/integration-test-openjdk8/builds/246#C] {noformat} > Task :geode-for-redis:integrationTest timeout exceeded =-=-=-=-=-=-=-=-=-=-=-=-=-=-= Test Results URI =-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-= http://files.apachegeode-ci.info/builds/apache-develop-main/1.15.0-build.1035/test-results/integrationTest/1648477166/ =-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-= Test report artifacts from this job are available at: http://files.apachegeode-ci.info/builds/apache-develop-main/1.15.0-build.1035/test-artifacts/1648477166/integrationtestfiles-openjdk8-1.15.0-build.1035.tgz{noformat} The only test in the "started" state is: {noformat} |2.3.1| bburcham-a01 in ~/Downloads/integrationtestfiles-openjdk8-1.15.0-build.1035 ○ → progress -s started org.apache.geode.internal.cache.DiskRandomOperationsAndRecoveryJUnitTest.testRollingEnabledRecoverValuesTruePersistWithOverFlowWithEarlyTerminationOfRoller Iteration: 1 Start: 2022-03-28 13:41:07.109 + End: 0001-01-01 00:00:00.000 + Duration: 0s Status: started {noformat} That JUnit test takes about 20s to run on a Macbook Pro. was: Hung here: [https://hydradb.hdb.gemfire-ci.info/hdb/testresult/14291020] The only test in the "started" state is: {noformat} |2.3.1| bburcham-a01 in ~/Downloads/integrationtestfiles-openjdk8-1.15.0-build.1035 ○ → progress -s started org.apache.geode.internal.cache.DiskRandomOperationsAndRecoveryJUnitTest.testRollingEnabledRecoverValuesTruePersistWithOverFlowWithEarlyTerminationOfRoller Iteration: 1 Start: 2022-03-28 13:41:07.109 + End: 0001-01-01 00:00:00.000 + Duration: 0s Status: started {noformat} That JUnit test takes about 20s to run on a Macbook Pro. > CI hang: > testRollingEnabledRecoverValuesTruePersistWithOverFlowWithEarlyTerminationOfRoller > --- > > Key: GEODE-10192 > URL: https://issues.apache.org/jira/browse/GEODE-10192 > Project: Geode > Issue Type: Bug > Components: persistence >Affects Versions: 1.15.0 >Reporter: Bill Burcham >Priority: Major > Labels: needsTriage > > Hung here: > [https://concourse.apachegeode-ci.info/teams/main/pipelines/apache-develop-main/jobs/integration-test-openjdk8/builds/246#C] > > > {noformat} > > Task :geode-for-redis:integrationTest > timeout exceeded > =-=-=-=-=-=-=-=-=-=-=-=-=-=-= Test Results URI > =-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-= > http://files.apachegeode-ci.info/builds/apache-develop-main/1.15.0-build.1035/test-results/integrationTest/1648477166/ > =-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-= > Test report artifacts from this job are available at: > http://files.apachegeode-ci.info/builds/apache-develop-main/1.15.0-build.1035/test-artifacts/1648477166/integrationtestfiles-openjdk8-1.15.0-build.1035.tgz{noformat} > The only test in the "started" state is: > > {noformat} > |2.3.1| bburcham-a01 in > ~/Downloads/integrationtestfiles-openjdk8-1.15.0-build.1035 > ○ → progress -s started > org.apache.geode.internal.cache.DiskRandomOperationsAndRecoveryJUnitTest.testRollingEnabledRecoverValuesTruePersistWithOverFlowWithEarlyTerminationOfRoller > Iteration: 1 > Start: 2022-03-28 13:41:07.109 + > End: 0001-01-01 00:00:00.000 + > Duration: 0s > Status: started > {noformat} > That JUnit test takes about 20s to run on a Macbook Pro. -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Created] (GEODE-10192) CI hang: testRollingEnabledRecoverValuesTruePersistWithOverFlowWithEarlyTerminationOfRoller
Bill Burcham created GEODE-10192: Summary: CI hang: testRollingEnabledRecoverValuesTruePersistWithOverFlowWithEarlyTerminationOfRoller Key: GEODE-10192 URL: https://issues.apache.org/jira/browse/GEODE-10192 Project: Geode Issue Type: Bug Components: persistence Affects Versions: 1.15.0 Reporter: Bill Burcham Hung here: [https://hydradb.hdb.gemfire-ci.info/hdb/testresult/14291020] The only test in the "started" state is: {noformat} |2.3.1| bburcham-a01 in ~/Downloads/integrationtestfiles-openjdk8-1.15.0-build.1035 ○ → progress -s started org.apache.geode.internal.cache.DiskRandomOperationsAndRecoveryJUnitTest.testRollingEnabledRecoverValuesTruePersistWithOverFlowWithEarlyTerminationOfRoller Iteration: 1 Start: 2022-03-28 13:41:07.109 + End: 0001-01-01 00:00:00.000 + Duration: 0s Status: started {noformat} That JUnit test takes about 20s to run on a Macbook Pro. -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Comment Edited] (GEODE-10188) AvailablePortHelperIntegrationTest > initializeUniquePortRange_returnSamePortsForSameRange gets different ports on subsequent tries
[ https://issues.apache.org/jira/browse/GEODE-10188?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17513744#comment-17513744 ] Bill Burcham edited comment on GEODE-10188 at 3/29/22, 12:49 AM: - A theory about what happened (thanks [~demery] ): # The 10 Keepers created by a previous test (returnsUniqueKeepers) held onto their ports for a little longer than usual. # The failing test called getRandomAvailableTCPPorts, which skipped those 10 ports because they were still in use, and instead picked up the next ten ports in the initialized range. # Then the Keepers released their ports. # Then the failing test called getRandomAvailableTCPPorts again, and picked up the first ports in the initialized range. was (Author: bburcham): A theory about what happened from Dale Emery: # The 10 Keepers created by a previous test (returnsUniqueKeepers) held onto their ports for a little longer than usual. # The failing test called getRandomAvailableTCPPorts, which skipped those 10 ports because they were still in use, and instead picked up the next ten ports in the initialized range. # Then the Keepers released their ports. # Then the failing test called getRandomAvailableTCPPorts again, and picked up the first ports in the initialized range. > AvailablePortHelperIntegrationTest > > initializeUniquePortRange_returnSamePortsForSameRange gets different ports on > subsequent tries > --- > > Key: GEODE-10188 > URL: https://issues.apache.org/jira/browse/GEODE-10188 > Project: Geode > Issue Type: Bug > Components: tests >Affects Versions: 1.13.9 >Reporter: Bill Burcham >Priority: Major > Labels: needsTriage > > Failed here: [https://hydradb.hdb.gemfire-ci.info/hdb/testresult/14294054] > > {noformat} > > Task :geode-core:integrationTest > org.apache.geode.internal.AvailablePortHelperIntegrationTest > > initializeUniquePortRange_returnSamePortsForSameRange(useMembershipPortRange=true) > FAILED > org.junit.ComparisonFailure: expected:<[460[10, 46011, 4601]2]> but > was:<[460[00, 46001, 4600]2]> > at > jdk.internal.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method) > at > jdk.internal.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62) > at > jdk.internal.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45) > at > org.apache.geode.internal.AvailablePortHelperIntegrationTest.initializeUniquePortRange_returnSamePortsForSameRange(AvailablePortHelperIntegrationTest.java:322) > 4023 tests completed, 1 failed, 82 skipped > =-=-=-=-=-=-=-=-=-=-=-=-=-=-= Test Results URI > =-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-= > http://files.apachegeode-ci.info/builds/apache-support-1-13-main/1.13.9-build.0668/test-results/integrationTest/1648509410/ > =-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-= > Test report artifacts from this job are available at: > http://files.apachegeode-ci.info/builds/apache-support-1-13-main/1.13.9-build.0668/test-artifacts/1648509410/integrationtestfiles-openjdk11-1.13.9-build.0668.tgz > {noformat} -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Comment Edited] (GEODE-10188) AvailablePortHelperIntegrationTest > initializeUniquePortRange_returnSamePortsForSameRange gets different ports on subsequent tries
[ https://issues.apache.org/jira/browse/GEODE-10188?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17513744#comment-17513744 ] Bill Burcham edited comment on GEODE-10188 at 3/29/22, 12:46 AM: - A theory about what happened from Dale Emery: # The 10 Keepers created by a previous test (returnsUniqueKeepers) held onto their ports for a little longer than usual. # The failing test called getRandomAvailableTCPPorts, which skipped those 10 ports because they were still in use, and instead picked up the next ten ports in the initialized range. # Then the Keepers released their ports. # Then the failing test called getRandomAvailableTCPPorts again, and picked up the first ports in the initialized range. was (Author: bburcham): A theory about what happened @dale: # The 10 Keepers created by a previous test (returnsUniqueKeepers) held onto their ports for a little longer than usual. # The failing test called getRandomAvailableTCPPorts, which skipped those 10 ports because they were still in use, and instead picked up the next ten ports in the initialized range. # Then the Keepers released their ports. # Then the failing test called getRandomAvailableTCPPorts again, and picked up the first ports in the initialized range. > AvailablePortHelperIntegrationTest > > initializeUniquePortRange_returnSamePortsForSameRange gets different ports on > subsequent tries > --- > > Key: GEODE-10188 > URL: https://issues.apache.org/jira/browse/GEODE-10188 > Project: Geode > Issue Type: Bug > Components: tests >Affects Versions: 1.13.9 >Reporter: Bill Burcham >Priority: Major > Labels: needsTriage > > Failed here: [https://hydradb.hdb.gemfire-ci.info/hdb/testresult/14294054] > > {noformat} > > Task :geode-core:integrationTest > org.apache.geode.internal.AvailablePortHelperIntegrationTest > > initializeUniquePortRange_returnSamePortsForSameRange(useMembershipPortRange=true) > FAILED > org.junit.ComparisonFailure: expected:<[460[10, 46011, 4601]2]> but > was:<[460[00, 46001, 4600]2]> > at > jdk.internal.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method) > at > jdk.internal.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62) > at > jdk.internal.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45) > at > org.apache.geode.internal.AvailablePortHelperIntegrationTest.initializeUniquePortRange_returnSamePortsForSameRange(AvailablePortHelperIntegrationTest.java:322) > 4023 tests completed, 1 failed, 82 skipped > =-=-=-=-=-=-=-=-=-=-=-=-=-=-= Test Results URI > =-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-= > http://files.apachegeode-ci.info/builds/apache-support-1-13-main/1.13.9-build.0668/test-results/integrationTest/1648509410/ > =-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-= > Test report artifacts from this job are available at: > http://files.apachegeode-ci.info/builds/apache-support-1-13-main/1.13.9-build.0668/test-artifacts/1648509410/integrationtestfiles-openjdk11-1.13.9-build.0668.tgz > {noformat} -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Commented] (GEODE-10188) AvailablePortHelperIntegrationTest > initializeUniquePortRange_returnSamePortsForSameRange gets different ports on subsequent tries
[ https://issues.apache.org/jira/browse/GEODE-10188?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17513744#comment-17513744 ] Bill Burcham commented on GEODE-10188: -- A theory about what happened @dale: # The 10 Keepers created by a previous test (returnsUniqueKeepers) held onto their ports for a little longer than usual. # The failing test called getRandomAvailableTCPPorts, which skipped those 10 ports because they were still in use, and instead picked up the next ten ports in the initialized range. # Then the Keepers released their ports. # Then the failing test called getRandomAvailableTCPPorts again, and picked up the first ports in the initialized range. > AvailablePortHelperIntegrationTest > > initializeUniquePortRange_returnSamePortsForSameRange gets different ports on > subsequent tries > --- > > Key: GEODE-10188 > URL: https://issues.apache.org/jira/browse/GEODE-10188 > Project: Geode > Issue Type: Bug > Components: tests >Affects Versions: 1.13.9 >Reporter: Bill Burcham >Priority: Major > Labels: needsTriage > > Failed here: [https://hydradb.hdb.gemfire-ci.info/hdb/testresult/14294054] > > {noformat} > > Task :geode-core:integrationTest > org.apache.geode.internal.AvailablePortHelperIntegrationTest > > initializeUniquePortRange_returnSamePortsForSameRange(useMembershipPortRange=true) > FAILED > org.junit.ComparisonFailure: expected:<[460[10, 46011, 4601]2]> but > was:<[460[00, 46001, 4600]2]> > at > jdk.internal.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method) > at > jdk.internal.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62) > at > jdk.internal.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45) > at > org.apache.geode.internal.AvailablePortHelperIntegrationTest.initializeUniquePortRange_returnSamePortsForSameRange(AvailablePortHelperIntegrationTest.java:322) > 4023 tests completed, 1 failed, 82 skipped > =-=-=-=-=-=-=-=-=-=-=-=-=-=-= Test Results URI > =-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-= > http://files.apachegeode-ci.info/builds/apache-support-1-13-main/1.13.9-build.0668/test-results/integrationTest/1648509410/ > =-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-= > Test report artifacts from this job are available at: > http://files.apachegeode-ci.info/builds/apache-support-1-13-main/1.13.9-build.0668/test-artifacts/1648509410/integrationtestfiles-openjdk11-1.13.9-build.0668.tgz > {noformat} -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Updated] (GEODE-10187) PutAllGlobalDUnitTest > testputAllGlobalRemoteVM fails to receive expected TimeoutException
[ https://issues.apache.org/jira/browse/GEODE-10187?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Bill Burcham updated GEODE-10187: - Affects Version/s: 1.14.5 (was: 1.15.0) > PutAllGlobalDUnitTest > testputAllGlobalRemoteVM fails to receive expected > TimeoutException > --- > > Key: GEODE-10187 > URL: https://issues.apache.org/jira/browse/GEODE-10187 > Project: Geode > Issue Type: Bug > Components: regions >Affects Versions: 1.14.5 >Reporter: Bill Burcham >Priority: Major > Labels: needsTriage > > Failed here: [https://hydradb.hdb.gemfire-ci.info/hdb/testresult/14277444] > {noformat} > > Task :geode-core:distributedTest > org.apache.geode.internal.cache.PutAllGlobalDUnitTest > > testputAllGlobalRemoteVM FAILED > java.lang.AssertionError: async2 failed > at org.apache.geode.test.dunit.Assert.fail(Assert.java:66) > at > org.apache.geode.internal.cache.PutAllGlobalDUnitTest.testputAllGlobalRemoteVM(PutAllGlobalDUnitTest.java:215) > Caused by: > java.lang.AssertionError: Should have thrown TimeoutException > at org.junit.Assert.fail(Assert.java:89) > at > org.apache.geode.internal.cache.PutAllGlobalDUnitTest$2.run2(PutAllGlobalDUnitTest.java:193) > 8805 tests completed, 1 failed, 455 skipped > =-=-=-=-=-=-=-=-=-=-=-=-=-=-= Test Results URI > =-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-= > http://files.apachegeode-ci.info/builds/apache-support-1-14-main/1.14.5-build.0942/test-results/distributedTest/1648360227/ > =-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-= > Test report artifacts from this job are available at: > http://files.apachegeode-ci.info/builds/apache-support-1-14-main/1.14.5-build.0942/test-artifacts/1648360227/distributedtestfiles-openjdk11-1.14.5-build.0942.tgz{noformat} -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Updated] (GEODE-10188) AvailablePortHelperIntegrationTest > initializeUniquePortRange_returnSamePortsForSameRange gets different ports on subsequent tries
[ https://issues.apache.org/jira/browse/GEODE-10188?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Bill Burcham updated GEODE-10188: - Affects Version/s: 1.13.9 (was: 1.15.0) > AvailablePortHelperIntegrationTest > > initializeUniquePortRange_returnSamePortsForSameRange gets different ports on > subsequent tries > --- > > Key: GEODE-10188 > URL: https://issues.apache.org/jira/browse/GEODE-10188 > Project: Geode > Issue Type: Bug > Components: tests >Affects Versions: 1.13.9 >Reporter: Bill Burcham >Priority: Major > Labels: needsTriage > > Failed here: [https://hydradb.hdb.gemfire-ci.info/hdb/testresult/14294054] > > {noformat} > > Task :geode-core:integrationTest > org.apache.geode.internal.AvailablePortHelperIntegrationTest > > initializeUniquePortRange_returnSamePortsForSameRange(useMembershipPortRange=true) > FAILED > org.junit.ComparisonFailure: expected:<[460[10, 46011, 4601]2]> but > was:<[460[00, 46001, 4600]2]> > at > jdk.internal.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method) > at > jdk.internal.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62) > at > jdk.internal.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45) > at > org.apache.geode.internal.AvailablePortHelperIntegrationTest.initializeUniquePortRange_returnSamePortsForSameRange(AvailablePortHelperIntegrationTest.java:322) > 4023 tests completed, 1 failed, 82 skipped > =-=-=-=-=-=-=-=-=-=-=-=-=-=-= Test Results URI > =-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-= > http://files.apachegeode-ci.info/builds/apache-support-1-13-main/1.13.9-build.0668/test-results/integrationTest/1648509410/ > =-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-= > Test report artifacts from this job are available at: > http://files.apachegeode-ci.info/builds/apache-support-1-13-main/1.13.9-build.0668/test-artifacts/1648509410/integrationtestfiles-openjdk11-1.13.9-build.0668.tgz > {noformat} -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Created] (GEODE-10188) AvailablePortHelperIntegrationTest > initializeUniquePortRange_returnSamePortsForSameRange gets different ports on subsequent tries
Bill Burcham created GEODE-10188: Summary: AvailablePortHelperIntegrationTest > initializeUniquePortRange_returnSamePortsForSameRange gets different ports on subsequent tries Key: GEODE-10188 URL: https://issues.apache.org/jira/browse/GEODE-10188 Project: Geode Issue Type: Bug Components: tests Affects Versions: 1.15.0 Reporter: Bill Burcham Failed here: [https://hydradb.hdb.gemfire-ci.info/hdb/testresult/14294054] {noformat} > Task :geode-core:integrationTest org.apache.geode.internal.AvailablePortHelperIntegrationTest > initializeUniquePortRange_returnSamePortsForSameRange(useMembershipPortRange=true) FAILED org.junit.ComparisonFailure: expected:<[460[10, 46011, 4601]2]> but was:<[460[00, 46001, 4600]2]> at jdk.internal.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method) at jdk.internal.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62) at jdk.internal.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45) at org.apache.geode.internal.AvailablePortHelperIntegrationTest.initializeUniquePortRange_returnSamePortsForSameRange(AvailablePortHelperIntegrationTest.java:322) 4023 tests completed, 1 failed, 82 skipped =-=-=-=-=-=-=-=-=-=-=-=-=-=-= Test Results URI =-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-= http://files.apachegeode-ci.info/builds/apache-support-1-13-main/1.13.9-build.0668/test-results/integrationTest/1648509410/ =-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-= Test report artifacts from this job are available at: http://files.apachegeode-ci.info/builds/apache-support-1-13-main/1.13.9-build.0668/test-artifacts/1648509410/integrationtestfiles-openjdk11-1.13.9-build.0668.tgz {noformat} -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Created] (GEODE-10187) PutAllGlobalDUnitTest > testputAllGlobalRemoteVM fails to receive expected TimeoutException
Bill Burcham created GEODE-10187: Summary: PutAllGlobalDUnitTest > testputAllGlobalRemoteVM fails to receive expected TimeoutException Key: GEODE-10187 URL: https://issues.apache.org/jira/browse/GEODE-10187 Project: Geode Issue Type: Bug Components: regions Affects Versions: 1.15.0 Reporter: Bill Burcham Failed here: [https://hydradb.hdb.gemfire-ci.info/hdb/testresult/14277444] {noformat} > Task :geode-core:distributedTest org.apache.geode.internal.cache.PutAllGlobalDUnitTest > testputAllGlobalRemoteVM FAILED java.lang.AssertionError: async2 failed at org.apache.geode.test.dunit.Assert.fail(Assert.java:66) at org.apache.geode.internal.cache.PutAllGlobalDUnitTest.testputAllGlobalRemoteVM(PutAllGlobalDUnitTest.java:215) Caused by: java.lang.AssertionError: Should have thrown TimeoutException at org.junit.Assert.fail(Assert.java:89) at org.apache.geode.internal.cache.PutAllGlobalDUnitTest$2.run2(PutAllGlobalDUnitTest.java:193) 8805 tests completed, 1 failed, 455 skipped =-=-=-=-=-=-=-=-=-=-=-=-=-=-= Test Results URI =-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-= http://files.apachegeode-ci.info/builds/apache-support-1-14-main/1.14.5-build.0942/test-results/distributedTest/1648360227/ =-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-= Test report artifacts from this job are available at: http://files.apachegeode-ci.info/builds/apache-support-1-14-main/1.14.5-build.0942/test-artifacts/1648360227/distributedtestfiles-openjdk11-1.14.5-build.0942.tgz{noformat} -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Created] (GEODE-10186) CI failure: RedundancyLevelPart1DUnitTest > testRedundancySpecifiedNonPrimaryEPFailsDetectionByCCU times out waiting for getClientProxies() to return more than 0 objects
Bill Burcham created GEODE-10186: Summary: CI failure: RedundancyLevelPart1DUnitTest > testRedundancySpecifiedNonPrimaryEPFailsDetectionByCCU times out waiting for getClientProxies() to return more than 0 objects Key: GEODE-10186 URL: https://issues.apache.org/jira/browse/GEODE-10186 Project: Geode Issue Type: Bug Components: client queues Affects Versions: 1.15.0 Reporter: Bill Burcham Failed here: [https://hydradb.hdb.gemfire-ci.info/hdb/testresult/14277358] {noformat} > Task :geode-core:distributedTest RedundancyLevelPart1DUnitTest > testRedundancySpecifiedNonPrimaryEPFailsDetectionByCCU FAILED org.apache.geode.test.dunit.RMIException: While invoking org.apache.geode.internal.cache.tier.sockets.RedundancyLevelPart1DUnitTest$$Lambda$543/510122765.run in VM 2 running on Host heavy-lifter-f58561da-caf9-5bc0-a7fa-f938c3fd1e51.c.apachegeode-ci.internal with 4 VMs at org.apache.geode.test.dunit.VM.executeMethodOnObject(VM.java:631) at org.apache.geode.test.dunit.VM.invoke(VM.java:448) at org.apache.geode.internal.cache.tier.sockets.RedundancyLevelPart1DUnitTest.testRedundancySpecifiedNonPrimaryEPFailsDetectionByCCU(RedundancyLevelPart1DUnitTest.java:284) Caused by: org.awaitility.core.ConditionTimeoutException: Assertion condition defined as a lambda expression in org.apache.geode.internal.cache.tier.sockets.RedundancyLevelPart1DUnitTest that uses org.apache.geode.internal.cache.tier.sockets.CacheClientNotifier Expecting actual: 0 to be greater than: 0 within 5 minutes. at org.awaitility.core.ConditionAwaiter.await(ConditionAwaiter.java:167) at org.awaitility.core.AssertionCondition.await(AssertionCondition.java:119) at org.awaitility.core.AssertionCondition.await(AssertionCondition.java:31) at org.awaitility.core.ConditionFactory.until(ConditionFactory.java:985) at org.awaitility.core.ConditionFactory.untilAsserted(ConditionFactory.java:769) at org.apache.geode.internal.cache.tier.sockets.RedundancyLevelPart1DUnitTest.verifyInterestRegistration(RedundancyLevelPart1DUnitTest.java:505) Caused by: java.lang.AssertionError: Expecting actual: 0 to be greater than: 0 at org.apache.geode.internal.cache.tier.sockets.RedundancyLevelPart1DUnitTest.lambda$verifyInterestRegistration$19(RedundancyLevelPart1DUnitTest.java:506) 8352 tests completed, 1 failed, 414 skipped =-=-=-=-=-=-=-=-=-=-=-=-=-=-= Test Results URI =-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-= http://files.apachegeode-ci.info/builds/apache-develop-mass-test-run/1.15.0-build.1033/test-results/distributedTest/1648331031/ =-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-= Test report artifacts from this job are available at: http://files.apachegeode-ci.info/builds/apache-develop-mass-test-run/1.15.0-build.1033/test-artifacts/1648331031/distributedtestfiles-openjdk8-1.15.0-build.1033.tgz {noformat} -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Updated] (GEODE-10184) CI failure on windows: non-zero exit status on gfsh command in DeployWithLargeJarTest > deployLargeSetOfJars
[ https://issues.apache.org/jira/browse/GEODE-10184?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Bill Burcham updated GEODE-10184: - Summary: CI failure on windows: non-zero exit status on gfsh command in DeployWithLargeJarTest > deployLargeSetOfJars (was: CI failure: non-zero exit status on gfsh command in DeployWithLargeJarTest > deployLargeSetOfJars) > CI failure on windows: non-zero exit status on gfsh command in > DeployWithLargeJarTest > deployLargeSetOfJars > > > Key: GEODE-10184 > URL: https://issues.apache.org/jira/browse/GEODE-10184 > Project: Geode > Issue Type: Bug > Components: gfsh >Affects Versions: 1.15.0 >Reporter: Bill Burcham >Priority: Major > Labels: needsTriage > > Deploy large jar test fails due to non-zero exit status on gfsh command on > windows > > [https://hydradb.hdb.gemfire-ci.info/hdb/testresult/14291025] > > {noformat} > > Task :geode-assembly:acceptanceTest > DeployWithLargeJarTest > deployLargeSetOfJars FAILED > org.opentest4j.AssertionFailedError: [Exit value from process started by > [e66e7d3e01750dd9: gfsh -e start locator --name=locator --max-heap=128m -e > start server --name=server --max-heap=128m --server-port=0 -e sleep --time=1 > -e deploy > --jars=C:\\Users\\geode\\geode\\geode-assembly\\build\\install\\apache-geode\\lib\\commons-beanutils-1.9.4.jar,C:\\Users\\geode\\geode\\geode-assembly\\build\\install\\apache-geode\\lib\\commons-codec-1.15.jar,C:\\Users\\geode\\geode\\geode-assembly\\build\\install\\apache-geode\\lib\\commons-collections-3.2.2.jar,C:\\Users\\geode\\geode\\geode-assembly\\build\\install\\apache-geode\\lib\\commons-digester-2.1.jar,C:\\Users\\geode\\geode\\geode-assembly\\build\\install\\apache-geode\\lib\\commons-io-2.11.0.jar,C:\\Users\\geode\\geode\\geode-assembly\\build\\install\\apache-geode\\lib\\commons-lang3-3.12.0.jar,C:\\Users\\geode\\geode\\geode-assembly\\build\\install\\apache-geode\\lib\\commons-logging-1.2.jar,C:\\Users\\geode\\geode\\geode-assembly\\build\\install\\apache-geode\\lib\\commons-modeler-2.0.1.jar,C:\\Users\\geode\\geode\\geode-assembly\\build\\install\\apache-geode\\lib\\commons-validator-1.7.jar]] > > expected: 0 > but was: 1 > at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native > Method) > at > sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62) > at > sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45) > at > org.apache.geode.test.junit.rules.gfsh.GfshExecution.awaitTermination(GfshExecution.java:103) > at > org.apache.geode.test.junit.rules.gfsh.GfshRule.execute(GfshRule.java:154) > at > org.apache.geode.test.junit.rules.gfsh.GfshRule.execute(GfshRule.java:163) > at > org.apache.geode.test.junit.rules.gfsh.GfshScript.execute(GfshScript.java:153) > at > org.apache.geode.management.internal.cli.commands.DeployWithLargeJarTest.deployLargeSetOfJars(DeployWithLargeJarTest.java:41) > 176 tests completed, 1 failed, 18 skipped > =-=-=-=-=-=-=-=-=-=-=-=-=-=-= Test Results URI > =-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-= > http://files.apachegeode-ci.info/builds/apache-develop-main/1.15.0-build.1035/test-results/acceptanceTest/1648482211/ > =-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-= > Test report artifacts from this job are available at: > http://files.apachegeode-ci.info/builds/apache-develop-main/1.15.0-build.1035/test-artifacts/1648482211/windows-acceptancetestfiles-openjdk8-1.15.0-build.1035.tgz{noformat} -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Updated] (GEODE-10184) CI failure: non-zero exit status on gfsh command in DeployWithLargeJarTest > deployLargeSetOfJars
[ https://issues.apache.org/jira/browse/GEODE-10184?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Bill Burcham updated GEODE-10184: - Summary: CI failure: non-zero exit status on gfsh command in DeployWithLargeJarTest > deployLargeSetOfJars (was: non-zero exit status on gfsh command in DeployWithLargeJarTest > deployLargeSetOfJars) > CI failure: non-zero exit status on gfsh command in DeployWithLargeJarTest > > deployLargeSetOfJars > - > > Key: GEODE-10184 > URL: https://issues.apache.org/jira/browse/GEODE-10184 > Project: Geode > Issue Type: Bug > Components: gfsh >Affects Versions: 1.15.0 >Reporter: Bill Burcham >Priority: Major > Labels: needsTriage > > Deploy large jar test fails due to non-zero exit status on gfsh command on > windows > > [https://hydradb.hdb.gemfire-ci.info/hdb/testresult/14291025] > > {noformat} > > Task :geode-assembly:acceptanceTest > DeployWithLargeJarTest > deployLargeSetOfJars FAILED > org.opentest4j.AssertionFailedError: [Exit value from process started by > [e66e7d3e01750dd9: gfsh -e start locator --name=locator --max-heap=128m -e > start server --name=server --max-heap=128m --server-port=0 -e sleep --time=1 > -e deploy > --jars=C:\\Users\\geode\\geode\\geode-assembly\\build\\install\\apache-geode\\lib\\commons-beanutils-1.9.4.jar,C:\\Users\\geode\\geode\\geode-assembly\\build\\install\\apache-geode\\lib\\commons-codec-1.15.jar,C:\\Users\\geode\\geode\\geode-assembly\\build\\install\\apache-geode\\lib\\commons-collections-3.2.2.jar,C:\\Users\\geode\\geode\\geode-assembly\\build\\install\\apache-geode\\lib\\commons-digester-2.1.jar,C:\\Users\\geode\\geode\\geode-assembly\\build\\install\\apache-geode\\lib\\commons-io-2.11.0.jar,C:\\Users\\geode\\geode\\geode-assembly\\build\\install\\apache-geode\\lib\\commons-lang3-3.12.0.jar,C:\\Users\\geode\\geode\\geode-assembly\\build\\install\\apache-geode\\lib\\commons-logging-1.2.jar,C:\\Users\\geode\\geode\\geode-assembly\\build\\install\\apache-geode\\lib\\commons-modeler-2.0.1.jar,C:\\Users\\geode\\geode\\geode-assembly\\build\\install\\apache-geode\\lib\\commons-validator-1.7.jar]] > > expected: 0 > but was: 1 > at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native > Method) > at > sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62) > at > sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45) > at > org.apache.geode.test.junit.rules.gfsh.GfshExecution.awaitTermination(GfshExecution.java:103) > at > org.apache.geode.test.junit.rules.gfsh.GfshRule.execute(GfshRule.java:154) > at > org.apache.geode.test.junit.rules.gfsh.GfshRule.execute(GfshRule.java:163) > at > org.apache.geode.test.junit.rules.gfsh.GfshScript.execute(GfshScript.java:153) > at > org.apache.geode.management.internal.cli.commands.DeployWithLargeJarTest.deployLargeSetOfJars(DeployWithLargeJarTest.java:41) > 176 tests completed, 1 failed, 18 skipped > =-=-=-=-=-=-=-=-=-=-=-=-=-=-= Test Results URI > =-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-= > http://files.apachegeode-ci.info/builds/apache-develop-main/1.15.0-build.1035/test-results/acceptanceTest/1648482211/ > =-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-= > Test report artifacts from this job are available at: > http://files.apachegeode-ci.info/builds/apache-develop-main/1.15.0-build.1035/test-artifacts/1648482211/windows-acceptancetestfiles-openjdk8-1.15.0-build.1035.tgz{noformat} -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Created] (GEODE-10184) non-zero exit status on gfsh command in DeployWithLargeJarTest > deployLargeSetOfJars
Bill Burcham created GEODE-10184: Summary: non-zero exit status on gfsh command in DeployWithLargeJarTest > deployLargeSetOfJars Key: GEODE-10184 URL: https://issues.apache.org/jira/browse/GEODE-10184 Project: Geode Issue Type: Bug Components: gfsh Affects Versions: 1.15.0 Reporter: Bill Burcham Deploy large jar test fails due to non-zero exit status on gfsh command on windows [https://hydradb.hdb.gemfire-ci.info/hdb/testresult/14291025] {noformat} > Task :geode-assembly:acceptanceTest DeployWithLargeJarTest > deployLargeSetOfJars FAILED org.opentest4j.AssertionFailedError: [Exit value from process started by [e66e7d3e01750dd9: gfsh -e start locator --name=locator --max-heap=128m -e start server --name=server --max-heap=128m --server-port=0 -e sleep --time=1 -e deploy --jars=C:\\Users\\geode\\geode\\geode-assembly\\build\\install\\apache-geode\\lib\\commons-beanutils-1.9.4.jar,C:\\Users\\geode\\geode\\geode-assembly\\build\\install\\apache-geode\\lib\\commons-codec-1.15.jar,C:\\Users\\geode\\geode\\geode-assembly\\build\\install\\apache-geode\\lib\\commons-collections-3.2.2.jar,C:\\Users\\geode\\geode\\geode-assembly\\build\\install\\apache-geode\\lib\\commons-digester-2.1.jar,C:\\Users\\geode\\geode\\geode-assembly\\build\\install\\apache-geode\\lib\\commons-io-2.11.0.jar,C:\\Users\\geode\\geode\\geode-assembly\\build\\install\\apache-geode\\lib\\commons-lang3-3.12.0.jar,C:\\Users\\geode\\geode\\geode-assembly\\build\\install\\apache-geode\\lib\\commons-logging-1.2.jar,C:\\Users\\geode\\geode\\geode-assembly\\build\\install\\apache-geode\\lib\\commons-modeler-2.0.1.jar,C:\\Users\\geode\\geode\\geode-assembly\\build\\install\\apache-geode\\lib\\commons-validator-1.7.jar]] expected: 0 but was: 1 at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method) at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62) at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45) at org.apache.geode.test.junit.rules.gfsh.GfshExecution.awaitTermination(GfshExecution.java:103) at org.apache.geode.test.junit.rules.gfsh.GfshRule.execute(GfshRule.java:154) at org.apache.geode.test.junit.rules.gfsh.GfshRule.execute(GfshRule.java:163) at org.apache.geode.test.junit.rules.gfsh.GfshScript.execute(GfshScript.java:153) at org.apache.geode.management.internal.cli.commands.DeployWithLargeJarTest.deployLargeSetOfJars(DeployWithLargeJarTest.java:41) 176 tests completed, 1 failed, 18 skipped =-=-=-=-=-=-=-=-=-=-=-=-=-=-= Test Results URI =-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-= http://files.apachegeode-ci.info/builds/apache-develop-main/1.15.0-build.1035/test-results/acceptanceTest/1648482211/ =-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-= Test report artifacts from this job are available at: http://files.apachegeode-ci.info/builds/apache-develop-main/1.15.0-build.1035/test-artifacts/1648482211/windows-acceptancetestfiles-openjdk8-1.15.0-build.1035.tgz{noformat} -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Commented] (GEODE-10122) With TLSv1.3 and GCM-based cipher (the default), P2P Messaging Fails When Encrypted Data Limit is Reached
[ https://issues.apache.org/jira/browse/GEODE-10122?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17509131#comment-17509131 ] Bill Burcham commented on GEODE-10122: -- Made progress on the PR: the JUnit ("Integration") test fails reliably, sending 2 bytes of encoded (TLS) data. Next step is to make the test pass! > With TLSv1.3 and GCM-based cipher (the default), P2P Messaging Fails When > Encrypted Data Limit is Reached > - > > Key: GEODE-10122 > URL: https://issues.apache.org/jira/browse/GEODE-10122 > Project: Geode > Issue Type: Bug > Components: messaging >Affects Versions: 1.13.7, 1.14.3, 1.15.0 >Reporter: Bill Burcham >Assignee: Bill Burcham >Priority: Major > Labels: blocks-1.15.0, pull-request-available > Attachments: patch-P2PMessagingConcurrencyDUnitTest.txt > > > TLSv1.3 introduced [1] the ability to set per-algorithm limits on symmetric > key usage lifetimes. Once a certain number of bytes have been encrypted, a > KeyUpdate post-handshake message [2] is sent. > With default settings, on Liberica JDK 11, Geode's P2P framework will > negotiate TLSv1.3 with the TLS_AES_256_GCM_SHA384 cipher suite. Geode P2P > messaging will eventually fail, with a "Tag mismatch!" IOException in shared > ordered receivers, after a session has been in heavy use for days. > We have not see this failure on TLSv1.2. > The implementation of TLSv1.3 in the Java runtime provides a security > property [3] to configure the encrypted data limit. The attached patch to > P2PMessagingConcurrencyDUnitTest configures the limit large enough that the > test makes it through the (P2P) TLS handshake but small enough so that the > "Tag mismatch!" exception is encountered less than a minute later. > The bug is caused by Geode’s NioSslEngine class’ ignorance of the > “rehandshaking” phase of the TLS protocol [4]: > Creation - ready to be configured. > Initial handshaking - perform authentication and negotiate communication > parameters. > Application data - ready for application exchange. > *Rehandshaking* - renegotiate communications parameters/authentication; > handshaking data may be mixed with application data. > Closure - ready to shut down connection. > Geode's tcp.Connection and NioSslEngine classes (particularly wrap() and > unwrap()), as they are currently implemented, fail to fully attend to the > handshake status from javax.net.ssl.SSLEngine. As a result these Geode > classes fail to respond to the KeyUpdate message, resulting in the "Tag > mismatch!" IOException. > When that exception is encountered, the Connection is destroyed and a new one > created in its place. But users of the old Connection, waiting for > acknowledgements, will never receive them. This can result in cluster-wide > hangs. > [1] [https://datatracker.ietf.org/doc/html/rfc8446#section-5.5] > [2] > [https://www.ibm.com/docs/en/sdk-java-technology/8?topic=handshake-post-messages] > > [3] > [https://docs.oracle.com/en/java/javase/11/security/java-secure-socket-extension-jsse-reference-guide.html#GUID-B970ADD6-1E9F-4C18-A26E-0679B50CC946] > [4] [https://www.ibm.com/docs/en/sdk-java-technology/7.1?topic=sslengine-] -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Updated] (GEODE-10122) With TLSv1.3 and GCM-based cipher (the default), P2P Messaging Fails When Encrypted Data Limit is Reached
[ https://issues.apache.org/jira/browse/GEODE-10122?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Bill Burcham updated GEODE-10122: - Description: TLSv1.3 introduced [1] the ability to set per-algorithm limits on symmetric key usage lifetimes. Once a certain number of bytes have been encrypted, a KeyUpdate post-handshake message [2] is sent. With default settings, on Liberica JDK 11, Geode's P2P framework will negotiate TLSv1.3 with the TLS_AES_256_GCM_SHA384 cipher suite. Geode P2P messaging will eventually fail, with a "Tag mismatch!" IOException in shared ordered receivers, after a session has been in heavy use for days. We have not see this failure on TLSv1.2. The implementation of TLSv1.3 in the Java runtime provides a security property [3] to configure the encrypted data limit. The attached patch to P2PMessagingConcurrencyDUnitTest configures the limit large enough that the test makes it through the (P2P) TLS handshake but small enough so that the "Tag mismatch!" exception is encountered less than a minute later. The bug is caused by Geode’s NioSslEngine class’ ignorance of the “rehandshaking” phase of the TLS protocol [4]: Creation - ready to be configured. Initial handshaking - perform authentication and negotiate communication parameters. Application data - ready for application exchange. *Rehandshaking* - renegotiate communications parameters/authentication; handshaking data may be mixed with application data. Closure - ready to shut down connection. Geode's tcp.Connection and NioSslEngine classes (particularly wrap() and unwrap()), as they are currently implemented, fail to fully attend to the handshake status from javax.net.ssl.SSLEngine. As a result these Geode classes fail to respond to the KeyUpdate message, resulting in the "Tag mismatch!" IOException. When that exception is encountered, the Connection is destroyed and a new one created in its place. But users of the old Connection, waiting for acknowledgements, will never receive them. This can result in cluster-wide hangs. [1] [https://datatracker.ietf.org/doc/html/rfc8446#section-5.5] [2] [https://www.ibm.com/docs/en/sdk-java-technology/8?topic=handshake-post-messages] [3] [https://docs.oracle.com/en/java/javase/11/security/java-secure-socket-extension-jsse-reference-guide.html#GUID-B970ADD6-1E9F-4C18-A26E-0679B50CC946] [4] [https://www.ibm.com/docs/en/sdk-java-technology/7.1?topic=sslengine-] was: TLSv1.3 introduced [1] the ability to set per-algorithm limits on symmetric key usage lifetimes. Once a certain number of bytes have been encrypted, a KeyUpdate post-handshake message is sent. With default settings, on Liberica JDK 11, Geode's P2P framework will negotiate TLSv1.3 with the TLS_AES_256_GCM_SHA384 cipher suite. Geode P2P messaging will eventually fail, with a "Tag mismatch!" IOException in shared ordered receivers, after a session has been in heavy use for days. We have not see this failure on TLSv1.2. The implementation of TLSv1.3 in the Java runtime provides a security property [2] to configure the encrypted data limit. The attached patch to P2PMessagingConcurrencyDUnitTest configures the limit large enough that the test makes it through the (P2P) TLS handshake but small enough so that the "Tag mismatch!" exception is encountered less than a minute later. The bug is caused by Geode’s NioSslEngine class’ ignorance of the “rehandshaking” phase of the TLS protocol [3]: Creation - ready to be configured. Initial handshaking - perform authentication and negotiate communication parameters. Application data - ready for application exchange. *Rehandshaking* - renegotiate communications parameters/authentication; handshaking data may be mixed with application data. Closure - ready to shut down connection. Geode's tcp.Connection and NioSslEngine classes (particularly wrap() and unwrap()), as they are currently implemented, fail to fully attend to the handshake status from javax.net.ssl.SSLEngine. As a result these Geode classes fail to respond to the KeyUpdate message, resulting in the "Tag mismatch!" IOException. When that exception is encountered, the Connection is destroyed and a new one created in its place. But users of the old Connection, waiting for acknowledgements, will never receive them. This can result in cluster-wide hangs. [1] [https://datatracker.ietf.org/doc/html/rfc8446#section-5.5] [2] [https://docs.oracle.com/en/java/javase/11/security/java-secure-socket-extension-jsse-reference-guide.html#GUID-B970ADD6-1E9F-4C18-A26E-0679B50CC946] [3] [https://www.ibm.com/docs/en/sdk-java-technology/7.1?topic=sslengine-] > With TLSv1.3 and GCM-based cipher (the default), P2P Messaging Fails When > Encrypted Data Limit is Reached > - > >
[jira] [Updated] (GEODE-10122) With TLSv1.3 and GCM-based cipher (the default), P2P Messaging Fails When Encrypted Data Limit is Reached
[ https://issues.apache.org/jira/browse/GEODE-10122?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Bill Burcham updated GEODE-10122: - Component/s: messaging > With TLSv1.3 and GCM-based cipher (the default), P2P Messaging Fails When > Encrypted Data Limit is Reached > - > > Key: GEODE-10122 > URL: https://issues.apache.org/jira/browse/GEODE-10122 > Project: Geode > Issue Type: Bug > Components: messaging >Affects Versions: 1.13.7, 1.14.3, 1.15.0, 1.16.0 >Reporter: Bill Burcham >Assignee: Bill Burcham >Priority: Major > Attachments: patch-P2PMessagingConcurrencyDUnitTest.txt > > > TLSv1.3 introduced [1] the ability to set per-algorithm limits on symmetric > key usage lifetimes. Once a certain number of bytes have been encrypted, a > KeyUpdate post-handshake message is sent. > With default settings, on Liberica JDK 11, Geode's P2P framework will > negotiate TLSv1.3 with the TLS_AES_256_GCM_SHA384 cipher suite. Geode P2P > messaging will eventually fail, with a "Tag mismatch!" IOException in shared > ordered receivers, after a session has been in heavy use for days. > We have not see this failure on TLSv1.2. > The implementation of TLSv1.3 in the Java runtime provides a security > property [2] to configure the encrypted data limit. The attached patch to > P2PMessagingConcurrencyDUnitTest configures the limit large enough that the > test makes it through the (P2P) TLS handshake but small enough so that the > "Tag mismatch!" exception is encountered less than a minute later. > The bug is caused by Geode’s NioSslEngine class’ ignorance of the > “rehandshaking” phase of the TLS protocol [3]: > Creation - ready to be configured. > Initial handshaking - perform authentication and negotiate communication > parameters. > Application data - ready for application exchange. > *Rehandshaking* - renegotiate communications parameters/authentication; > handshaking data may be mixed with application data. > Closure - ready to shut down connection. > Geode's tcp.Connection and NioSslEngine classes (particularly wrap() and > unwrap()), as they are currently implemented, fail to fully attend to the > handshake status from javax.net.ssl.SSLEngine. As a result these Geode > classes fail to respond to the KeyUpdate message, resulting in the "Tag > mismatch!" IOException. > When that exception is encountered, the Connection is destroyed and a new one > created in its place. But users of the old Connection, waiting for > acknowledgements, will never receive them. This can result in cluster-wide > hangs. > [1] [https://datatracker.ietf.org/doc/html/rfc8446#section-5.5] > [2] > [https://docs.oracle.com/en/java/javase/11/security/java-secure-socket-extension-jsse-reference-guide.html#GUID-B970ADD6-1E9F-4C18-A26E-0679B50CC946] > > [3] [https://www.ibm.com/docs/en/sdk-java-technology/7.1?topic=sslengine-] -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Updated] (GEODE-10122) With TLSv1.3 and GCM-based cipher (the default), P2P Messaging Fails When Encrypted Data Limit is Reached
[ https://issues.apache.org/jira/browse/GEODE-10122?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Bill Burcham updated GEODE-10122: - Labels: (was: needsTriage) > With TLSv1.3 and GCM-based cipher (the default), P2P Messaging Fails When > Encrypted Data Limit is Reached > - > > Key: GEODE-10122 > URL: https://issues.apache.org/jira/browse/GEODE-10122 > Project: Geode > Issue Type: Bug >Affects Versions: 1.13.7, 1.14.3, 1.15.0, 1.16.0 >Reporter: Bill Burcham >Assignee: Bill Burcham >Priority: Major > Attachments: patch-P2PMessagingConcurrencyDUnitTest.txt > > > TLSv1.3 introduced [1] the ability to set per-algorithm limits on symmetric > key usage lifetimes. Once a certain number of bytes have been encrypted, a > KeyUpdate post-handshake message is sent. > With default settings, on Liberica JDK 11, Geode's P2P framework will > negotiate TLSv1.3 with the TLS_AES_256_GCM_SHA384 cipher suite. Geode P2P > messaging will eventually fail, with a "Tag mismatch!" IOException in shared > ordered receivers, after a session has been in heavy use for days. > We have not see this failure on TLSv1.2. > The implementation of TLSv1.3 in the Java runtime provides a security > property [2] to configure the encrypted data limit. The attached patch to > P2PMessagingConcurrencyDUnitTest configures the limit large enough that the > test makes it through the (P2P) TLS handshake but small enough so that the > "Tag mismatch!" exception is encountered less than a minute later. > The bug is caused by Geode’s NioSslEngine class’ ignorance of the > “rehandshaking” phase of the TLS protocol [3]: > Creation - ready to be configured. > Initial handshaking - perform authentication and negotiate communication > parameters. > Application data - ready for application exchange. > *Rehandshaking* - renegotiate communications parameters/authentication; > handshaking data may be mixed with application data. > Closure - ready to shut down connection. > Geode's tcp.Connection and NioSslEngine classes (particularly wrap() and > unwrap()), as they are currently implemented, fail to fully attend to the > handshake status from javax.net.ssl.SSLEngine. As a result these Geode > classes fail to respond to the KeyUpdate message, resulting in the "Tag > mismatch!" IOException. > When that exception is encountered, the Connection is destroyed and a new one > created in its place. But users of the old Connection, waiting for > acknowledgements, will never receive them. This can result in cluster-wide > hangs. > [1] [https://datatracker.ietf.org/doc/html/rfc8446#section-5.5] > [2] > [https://docs.oracle.com/en/java/javase/11/security/java-secure-socket-extension-jsse-reference-guide.html#GUID-B970ADD6-1E9F-4C18-A26E-0679B50CC946] > > [3] [https://www.ibm.com/docs/en/sdk-java-technology/7.1?topic=sslengine-] -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Assigned] (GEODE-10122) With TLSv1.3 and GCM-based cipher (the default), P2P Messaging Fails When Encrypted Data Limit is Reached
[ https://issues.apache.org/jira/browse/GEODE-10122?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Bill Burcham reassigned GEODE-10122: Assignee: Bill Burcham > With TLSv1.3 and GCM-based cipher (the default), P2P Messaging Fails When > Encrypted Data Limit is Reached > - > > Key: GEODE-10122 > URL: https://issues.apache.org/jira/browse/GEODE-10122 > Project: Geode > Issue Type: Bug >Affects Versions: 1.13.7, 1.14.3, 1.15.0, 1.16.0 >Reporter: Bill Burcham >Assignee: Bill Burcham >Priority: Major > Labels: needsTriage > Attachments: patch-P2PMessagingConcurrencyDUnitTest.txt > > > TLSv1.3 introduced [1] the ability to set per-algorithm limits on symmetric > key usage lifetimes. Once a certain number of bytes have been encrypted, a > KeyUpdate post-handshake message is sent. > With default settings, on Liberica JDK 11, Geode's P2P framework will > negotiate TLSv1.3 with the TLS_AES_256_GCM_SHA384 cipher suite. Geode P2P > messaging will eventually fail, with a "Tag mismatch!" IOException in shared > ordered receivers, after a session has been in heavy use for days. > We have not see this failure on TLSv1.2. > The implementation of TLSv1.3 in the Java runtime provides a security > property [2] to configure the encrypted data limit. The attached patch to > P2PMessagingConcurrencyDUnitTest configures the limit large enough that the > test makes it through the (P2P) TLS handshake but small enough so that the > "Tag mismatch!" exception is encountered less than a minute later. > The bug is caused by Geode’s NioSslEngine class’ ignorance of the > “rehandshaking” phase of the TLS protocol [3]: > Creation - ready to be configured. > Initial handshaking - perform authentication and negotiate communication > parameters. > Application data - ready for application exchange. > *Rehandshaking* - renegotiate communications parameters/authentication; > handshaking data may be mixed with application data. > Closure - ready to shut down connection. > Geode's tcp.Connection and NioSslEngine classes (particularly wrap() and > unwrap()), as they are currently implemented, fail to fully attend to the > handshake status from javax.net.ssl.SSLEngine. As a result these Geode > classes fail to respond to the KeyUpdate message, resulting in the "Tag > mismatch!" IOException. > When that exception is encountered, the Connection is destroyed and a new one > created in its place. But users of the old Connection, waiting for > acknowledgements, will never receive them. This can result in cluster-wide > hangs. > [1] [https://datatracker.ietf.org/doc/html/rfc8446#section-5.5] > [2] > [https://docs.oracle.com/en/java/javase/11/security/java-secure-socket-extension-jsse-reference-guide.html#GUID-B970ADD6-1E9F-4C18-A26E-0679B50CC946] > > [3] [https://www.ibm.com/docs/en/sdk-java-technology/7.1?topic=sslengine-] -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Updated] (GEODE-10122) With TLSv1.3 and GCM-based cipher (the default), P2P Messaging Fails When Encrypted Data Limit is Reached
[ https://issues.apache.org/jira/browse/GEODE-10122?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Bill Burcham updated GEODE-10122: - Attachment: patch-P2PMessagingConcurrencyDUnitTest.txt > With TLSv1.3 and GCM-based cipher (the default), P2P Messaging Fails When > Encrypted Data Limit is Reached > - > > Key: GEODE-10122 > URL: https://issues.apache.org/jira/browse/GEODE-10122 > Project: Geode > Issue Type: Bug >Affects Versions: 1.13.7, 1.14.3, 1.15.0, 1.16.0 >Reporter: Bill Burcham >Priority: Major > Labels: needsTriage > Attachments: patch-P2PMessagingConcurrencyDUnitTest.txt > > > TLSv1.3 introduced [1] the ability to set per-algorithm limits on symmetric > key usage lifetimes. Once a certain number of bytes have been encrypted, a > KeyUpdate post-handshake message is sent. > With default settings, on Liberica JDK 11, Geode's P2P framework will > negotiate TLSv1.3 with the TLS_AES_256_GCM_SHA384 cipher suite. Geode P2P > messaging will eventually fail, with a "Tag mismatch!" IOException in shared > ordered receivers, after a session has been in heavy use for days. > We have not see this failure on TLSv1.2. > The implementation of TLSv1.3 in the Java runtime provides a security > property [2] to configure the encrypted data limit. The attached patch to > P2PMessagingConcurrencyDUnitTest configures the limit large enough that the > test makes it through the (P2P) TLS handshake but small enough so that the > "Tag mismatch!" exception is encountered less than a minute later. > The bug is caused by Geode’s NioSslEngine class’ ignorance of the > “rehandshaking” phase of the TLS protocol [3]: > Creation - ready to be configured. > Initial handshaking - perform authentication and negotiate communication > parameters. > Application data - ready for application exchange. > *Rehandshaking* - renegotiate communications parameters/authentication; > handshaking data may be mixed with application data. > Closure - ready to shut down connection. > Geode's tcp.Connection and NioSslEngine classes (particularly wrap() and > unwrap()), as they are currently implemented, fail to fully attend to the > handshake status from javax.net.ssl.SSLEngine. As a result these Geode > classes fail to respond to the KeyUpdate message, resulting in the "Tag > mismatch!" IOException. > When that exception is encountered, the Connection is destroyed and a new one > created in its place. But users of the old Connection, waiting for > acknowledgements, will never receive them. This can result in cluster-wide > hangs. > [1] [https://datatracker.ietf.org/doc/html/rfc8446#section-5.5] > [2] > [https://docs.oracle.com/en/java/javase/11/security/java-secure-socket-extension-jsse-reference-guide.html#GUID-B970ADD6-1E9F-4C18-A26E-0679B50CC946] > > [3] [https://www.ibm.com/docs/en/sdk-java-technology/7.1?topic=sslengine-] -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Created] (GEODE-10122) With TLSv1.3 and GCM-based cipher (the default), P2P Messaging Fails When Encrypted Data Limit is Reached
Bill Burcham created GEODE-10122: Summary: With TLSv1.3 and GCM-based cipher (the default), P2P Messaging Fails When Encrypted Data Limit is Reached Key: GEODE-10122 URL: https://issues.apache.org/jira/browse/GEODE-10122 Project: Geode Issue Type: Bug Affects Versions: 1.14.3, 1.13.7, 1.15.0, 1.16.0 Reporter: Bill Burcham Attachments: patch-P2PMessagingConcurrencyDUnitTest.txt TLSv1.3 introduced [1] the ability to set per-algorithm limits on symmetric key usage lifetimes. Once a certain number of bytes have been encrypted, a KeyUpdate post-handshake message is sent. With default settings, on Liberica JDK 11, Geode's P2P framework will negotiate TLSv1.3 with the TLS_AES_256_GCM_SHA384 cipher suite. Geode P2P messaging will eventually fail, with a "Tag mismatch!" IOException in shared ordered receivers, after a session has been in heavy use for days. We have not see this failure on TLSv1.2. The implementation of TLSv1.3 in the Java runtime provides a security property [2] to configure the encrypted data limit. The attached patch to P2PMessagingConcurrencyDUnitTest configures the limit large enough that the test makes it through the (P2P) TLS handshake but small enough so that the "Tag mismatch!" exception is encountered less than a minute later. The bug is caused by Geode’s NioSslEngine class’ ignorance of the “rehandshaking” phase of the TLS protocol [3]: Creation - ready to be configured. Initial handshaking - perform authentication and negotiate communication parameters. Application data - ready for application exchange. *Rehandshaking* - renegotiate communications parameters/authentication; handshaking data may be mixed with application data. Closure - ready to shut down connection. Geode's tcp.Connection and NioSslEngine classes (particularly wrap() and unwrap()), as they are currently implemented, fail to fully attend to the handshake status from javax.net.ssl.SSLEngine. As a result these Geode classes fail to respond to the KeyUpdate message, resulting in the "Tag mismatch!" IOException. When that exception is encountered, the Connection is destroyed and a new one created in its place. But users of the old Connection, waiting for acknowledgements, will never receive them. This can result in cluster-wide hangs. [1] [https://datatracker.ietf.org/doc/html/rfc8446#section-5.5] [2] [https://docs.oracle.com/en/java/javase/11/security/java-secure-socket-extension-jsse-reference-guide.html#GUID-B970ADD6-1E9F-4C18-A26E-0679B50CC946] [3] [https://www.ibm.com/docs/en/sdk-java-technology/7.1?topic=sslengine-] -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Updated] (GEODE-9680) Newly Started/Restarted Locators are Susceptible to Split-Brains
[ https://issues.apache.org/jira/browse/GEODE-9680?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Bill Burcham updated GEODE-9680: Description: The issues described here are present in all versions of Geode (this is not new to 1.15.0)… Geode is built on the assumption that views progress linearly in a sequence. If that sequence ever forks into two or more parallel lines then we have a "split brain". In a split brain condition, each of the parallel views are independent. It's as if you have more than one system running concurrently. It's possible e.g. for some clients to connect to members of one view and other clients to connect to members of another view. Updates to members in one view are not seen by members of a parallel view. Geode views are produced by a coordinator. As long as only a single coordinator is running, there is no possibility of a split brain. Split brain arises when more than one coordinator is producing views at the same time. Each Geode member (peer) is started with the {{locators}} configuration parameter. That parameter specifies locator(s) to use to find the (already running!) coordinator (member) to join with. When a locator (member) starts, it goes through this sequence to find the coordinator: # it first tries to find the coordinator through one of the (other) configured locators # if it can't contact any of those, it tries contacting non-locator (cache server) members it has retrieved from the "view presistence" ({{{}.dat{}}}) file If it hasn't found a coordinator to join with, then the locator may _become_ a coordinator. Sometimes this is ok. If no other coordinator is currently running then this behavior is fine. An example is when an [administrator is starting up a brand new cluster|https://geode.apache.org/docs/guide/114/configuring/running/running_the_locator.html]. In that case we want the very first locator we start to become the coordinator. But there are a number of situations where there may already be another coordinator running but it cannot be reached: * if the administrator/operator wants to *start up a brand new cluster* with multiple locators and… ** maybe Geode is running in a managed environment like Kubernetes and the locators hostnames are not (yet) resolvable in DNS ** maybe there is a network partition between the starting locators so they can't communicate ** maybe the existing locators or coordinator are running very slowly or the network is degraded. This is effectively the same as the network partition just mentioned * if a cluster is already running and the administrator/operator wants to *scale it up* by starting/adding a new locator Geode is susceptible to the same issues just mentioned * if a cluster is already running and the administrator/operator needs to *restart* a locator, e.g. for a rolling upgrade, if none of the locators in the {{locators}} configuration parameter are reachable (maybe they are not running, or maybe there is a network partition) and… ** if the "view persistence" {{.dat}} file is missing or deleted ** or if the current set of running Geode members has evolved so far that the coordinates (host+port) in the {{.dat}} file are completely out of date In each of those cases, the newly starting locator will become a coordinator and will start producing views. Now we'll have the old coordinator producing views at the same time as the new one. h2. When This Ticket is Complete There are a number of possible solutions to these problems. Here is one possible solution… Geode will offer a locator startup mode (via TBD {{LocatorLauncher}} startup parameter) that prevents that locator from becoming a coordinator. In that mode, it will be possible for an administrator/operator to avoid many of the problematic scenarios mentioned above, while retaining the ability (via some _other_ mode) to start a first locator which is allowed to become a coordinator. For purposes of discussion we'll call the startup mode that allows the locator to become a coordinator "seed" mode, and we'll call the new startup mode that prevents the locator from becoming a coordinator before first joining, "join-only" mode. After this mode split is implemented, it is envisioned that to start a brand new cluster, an administrator/operator will start the first locator in "seed" mode. After that the operator will start all subsequent locators in "join only" mode. If network partitions occur during startup, those newly started ("join-only") nodes will exit with a failure status—under no circumstances will they ever become coordinators. To add a locator to a running cluster, an operator starts it in "join only" mode. The new member will similarly either join with an existing coordinator or exit with a failure status, thereby avoiding split brains. When an operator restarts a locator, e.g. during a rolling upgrade, they will restart it in
[jira] [Resolved] (GEODE-9822) Split-brain Certain During Network Partition in Two-Locator Cluster
[ https://issues.apache.org/jira/browse/GEODE-9822?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Bill Burcham resolved GEODE-9822. - Fix Version/s: 1.15.0 Resolution: Fixed > Split-brain Certain During Network Partition in Two-Locator Cluster > --- > > Key: GEODE-9822 > URL: https://issues.apache.org/jira/browse/GEODE-9822 > Project: Geode > Issue Type: Bug > Components: membership >Reporter: Bill Burcham >Priority: Major > Labels: pull-request-available > Fix For: 1.15.0 > > > In a two-locator cluster with default member weights and default setting > (true) of enable-network-partition-detection, if a long-lived network > partition separates the two members, a split-brain will arise: there will be > two coordinators at the same time. > The reason for this can be found in the GMSJoinLeave.isNetworkPartition() > method. That method's name is misleading. A name like isMajorityLost() would > probably be more apt. It needs to return true iff the weight of "crashed" > members (in the prospective view) is greater-than-or-equal-to half (50%) of > the total weight (of all members in the current view). > What the method actually does is return true iff the weight of "crashed" > members is greater-than 51% of the total weight. As a result, if we have two > members of equal weight, and the coordinator sees that the non-coordinator is > "crashed", the coordinator will keep running. If a network partition is > happening, and the non-coordinator is still running, then it will become a > coordinator and start producing views. Now we'll have two coordinators > producing views concurrently. > For this discussion "crashed" members are members for which the coordinator > has received a RemoveMemberRequest message. These are members that the > failure detector has deemed failed. Keep in mind the failure detector is > imperfect (it's not always right), and that's kind of the whole point of this > ticket: we've lost contact with the non-coordinator member, but that doesn't > mean it can't still be running (on the other side of a partition). > This bug is not limited to the two-locator scenario. Any set of members that > can be partitioned into two equal sets is susceptible. In fact it's even a > little worse than that. Any set of members that can be partitioned (into more > than one set), where any two-or-more sets, each still have 49% or more of the > total weight, will result in a split-brain -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Updated] (GEODE-9822) Split-brain Certain During Network Partition in Two-Locator Cluster
[ https://issues.apache.org/jira/browse/GEODE-9822?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Bill Burcham updated GEODE-9822: Description: In a two-locator cluster with default member weights and default setting (true) of enable-network-partition-detection, if a long-lived network partition separates the two members, a split-brain will arise: there will be two coordinators at the same time. The reason for this can be found in the GMSJoinLeave.isNetworkPartition() method. That method's name is misleading. A name like isMajorityLost() would probably be more apt. It needs to return true iff the weight of "crashed" members (in the prospective view) is greater-than-or-equal-to half (50%) of the total weight (of all members in the current view). What the method actually does is return true iff the weight of "crashed" members is greater-than 51% of the total weight. As a result, if we have two members of equal weight, and the coordinator sees that the non-coordinator is "crashed", the coordinator will keep running. If a network partition is happening, and the non-coordinator is still running, then it will become a coordinator and start producing views. Now we'll have two coordinators producing views concurrently. For this discussion "crashed" members are members for which the coordinator has received a RemoveMemberRequest message. These are members that the failure detector has deemed failed. Keep in mind the failure detector is imperfect (it's not always right), and that's kind of the whole point of this ticket: we've lost contact with the non-coordinator member, but that doesn't mean it can't still be running (on the other side of a partition). This bug is not limited to the two-locator scenario. Any set of members that can be partitioned into two equal sets is susceptible. In fact it's even a little worse than that. Any set of members that can be partitioned (into more than one set), where any two-or-more sets, each still have 49% or more of the total weight, will result in a split-brain was: In a two-locator cluster with default member weights and default setting (true) of enable-network-partition-detection, if a long-lived network partition separates the two members, a split-brain will arise: there will be two coordinators at the same time. The reason for this can be found in the GMSJoinLeave.isNetworkPartition() method. That method's name is misleading. A name like isMajorityLost() would probably be more apt. It needs to return true iff the weight of "crashed" members (in the prospective view) is greater-than-or-equal-to half (50%) of the total weight (of all members in the current view). What the method actually does is return true iff the weight of "crashed" members is greater-than 51% of the total weight. As a result, if we have two members of equal weight, and the coordinator sees that the non-coordinator is "crashed", the coordinator will keep running. If a network partition is happening, and the non-coordinator is still running, then it will become a coordinator and start producing views. Now we'll have two coordinators producing views concurrently. For this discussion "crashed" members are members for which the coordinator has received a RemoveMemberRequest message. These are members that the failure detector has deemed failed. Keep in mind the failure detector is imperfect (it's not always right), and that's kind of the whole point of this ticket: we've lost contact with the non-coordinator member, but that doesn't mean it can't still be running (on the other side of a partition). This bug is not limited to the two-locator scenario. Any set of members that can be partitioned into two equal sets is susceptible. In fact it's even a little worse than that. Any set of members that can be partitioned into two sets, both of which still have 49% or more of the total weight, will result in a split-brain. > Split-brain Certain During Network Partition in Two-Locator Cluster > --- > > Key: GEODE-9822 > URL: https://issues.apache.org/jira/browse/GEODE-9822 > Project: Geode > Issue Type: Bug > Components: membership >Reporter: Bill Burcham >Priority: Major > Labels: pull-request-available > > In a two-locator cluster with default member weights and default setting > (true) of enable-network-partition-detection, if a long-lived network > partition separates the two members, a split-brain will arise: there will be > two coordinators at the same time. > The reason for this can be found in the GMSJoinLeave.isNetworkPartition() > method. That method's name is misleading. A name like isMajorityLost() would > probably be more apt. It needs to return true iff the weight of "crashed" > members (in the prospective view) is gre
[jira] [Updated] (GEODE-9880) Cluster with multiple locators in an environment with no host name resolution, leads to null pointer exception
[ https://issues.apache.org/jira/browse/GEODE-9880?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Bill Burcham updated GEODE-9880: Component/s: membership > Cluster with multiple locators in an environment with no host name > resolution, leads to null pointer exception > -- > > Key: GEODE-9880 > URL: https://issues.apache.org/jira/browse/GEODE-9880 > Project: Geode > Issue Type: Bug > Components: locator, membership >Affects Versions: 1.12.5 >Reporter: Tigran Ghahramanyan >Priority: Major > Labels: membership > > In our use case we have two locators that are initially configured with IP > addresses, but _AutoConnectionSourceImpl.UpdateLocatorList()_ flow keeps on > adding their corresponding host names to the locators list, while these host > names are not resolvable. > Later in {_}AutoConnectionSourceImpl.queryLocators(){_}, whenever a client > tries to use such non resolvable host name to connect to a locator it tries > to establish a connection to {_}socketaddr=0.0.0.0{_}, as written in > {_}SocketCreator.connect(){_}. Which seems strange. > Then, if there is no locator running on the same host, the next locator in > the list is contacted, until reaching a locator contact configured with IP > address - which succeeds eventually. > But, when there happens to be a locator listening on the same host, then we > have a null pointer exception in the second line below, because _inetadd=null_ > _socket.connect(sockaddr, Math.max(timeout, 0)); // sockaddr=0.0.0.0, > connects to a locator listening on the same host_ > _configureClientSSLSocket(socket, inetadd.getHostName(), timeout); // inetadd > = null_ > > As a result, the cluster comes to a failed state, unable to recover. -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Updated] (GEODE-9822) Split-brain Certain During Network Partition in Two-Locator Cluster
[ https://issues.apache.org/jira/browse/GEODE-9822?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Bill Burcham updated GEODE-9822: Description: In a two-locator cluster with default member weights and default setting (true) of enable-network-partition-detection, if a long-lived network partition separates the two members, a split-brain will arise: there will be two coordinators at the same time. The reason for this can be found in the GMSJoinLeave.isNetworkPartition() method. That method's name is misleading. A name like isMajorityLost() would probably be more apt. It needs to return true iff the weight of "crashed" members (in the prospective view) is greater-than-or-equal-to half (50%) of the total weight (of all members in the current view). What the method actually does is return true iff the weight of "crashed" members is greater-than 51% of the total weight. As a result, if we have two members of equal weight, and the coordinator sees that the non-coordinator is "crashed", the coordinator will keep running. If a network partition is happening, and the non-coordinator is still running, then it will become a coordinator and start producing views. Now we'll have two coordinators producing views concurrently. For this discussion "crashed" members are members for which the coordinator has received a RemoveMemberRequest message. These are members that the failure detector has deemed failed. Keep in mind the failure detector is imperfect (it's not always right), and that's kind of the whole point of this ticket: we've lost contact with the non-coordinator member, but that doesn't mean it can't still be running (on the other side of a partition). This bug is not limited to the two-locator scenario. Any set of members that can be partitioned into two equal sets is susceptible. In fact it's even a little worse than that. Any set of members that can be partitioned into two sets, both of which still have 49% or more of the total weight, will result in a split-brain. was: In a two-locator cluster with default member weights and default setting (true) of enable-network-partition-detection, if a long-lived network partition separates the two members, a split-brain will arise: there will be two coordinators at the same time. The reason for this can be found in the GMSJoinLeave.isNetworkPartition() method. That method's name is misleading. A name like isMajorityLost() would probably be more apt. It needs to return true iff the weight of "crashed" members (in the prospective view) is greater-than-or-equal-to half (50%) of the total weight (of all members in the current view). What the method actually does is return true iff the weight of "crashed" members is greater-than 51% of the total weight. As a result, if we have two members of equal weight, and the coordinator sees that the non-coordinator is "crashed", the coordinator will keep running. If a network partition is happening, and the non-coordinator is still running, then it will become a coordinator and start producing views. Now we'll have two coordinators producing views concurrently. For this discussion "crashed" members are members for which the coordinator has received a RemoveMemberRequest message. These are members that the failure detector has deemed failed. Keep in mind the failure detector is imperfect (it's not always right), and that's kind of the whole point of this ticket: we've lost contact with the non-coordinator member, but that doesn't mean it can't still be running (on the other side of a partition). > Split-brain Certain During Network Partition in Two-Locator Cluster > --- > > Key: GEODE-9822 > URL: https://issues.apache.org/jira/browse/GEODE-9822 > Project: Geode > Issue Type: Bug > Components: membership >Reporter: Bill Burcham >Priority: Major > Labels: pull-request-available > > In a two-locator cluster with default member weights and default setting > (true) of enable-network-partition-detection, if a long-lived network > partition separates the two members, a split-brain will arise: there will be > two coordinators at the same time. > The reason for this can be found in the GMSJoinLeave.isNetworkPartition() > method. That method's name is misleading. A name like isMajorityLost() would > probably be more apt. It needs to return true iff the weight of "crashed" > members (in the prospective view) is greater-than-or-equal-to half (50%) of > the total weight (of all members in the current view). > What the method actually does is return true iff the weight of "crashed" > members is greater-than 51% of the total weight. As a result, if we have two > members of equal weight, and the coordinator sees that the non-coordinator is > "crashed", the coordina
[jira] [Updated] (GEODE-9822) Split-brain Certain During Network Partition in Two-Locator Cluster
[ https://issues.apache.org/jira/browse/GEODE-9822?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Bill Burcham updated GEODE-9822: Summary: Split-brain Certain During Network Partition in Two-Locator Cluster (was: Split-brain Possible During Network Partition in Two-Locator Cluster) > Split-brain Certain During Network Partition in Two-Locator Cluster > --- > > Key: GEODE-9822 > URL: https://issues.apache.org/jira/browse/GEODE-9822 > Project: Geode > Issue Type: Bug > Components: membership >Reporter: Bill Burcham >Priority: Major > Labels: pull-request-available > > In a two-locator cluster with default member weights and default setting > (true) of enable-network-partition-detection, if a long-lived network > partition separates the two members, a split-brain will arise: there will be > two coordinators at the same time. > The reason for this can be found in the GMSJoinLeave.isNetworkPartition() > method. That method's name is misleading. A name like isMajorityLost() would > probably be more apt. It needs to return true iff the weight of "crashed" > members (in the prospective view) is greater-than-or-equal-to half (50%) of > the total weight (of all members in the current view). > What the method actually does is return true iff the weight of "crashed" > members is greater-than 51% of the total weight. As a result, if we have two > members of equal weight, and the coordinator sees that the non-coordinator is > "crashed", the coordinator will keep running. If a network partition is > happening, and the non-coordinator is still running, then it will become a > coordinator and start producing views. Now we'll have two coordinators > producing views concurrently. > For this discussion "crashed" members are members for which the coordinator > has received a RemoveMemberRequest message. These are members that the > failure detector has deemed failed. Keep in mind the failure detector is > imperfect (it's not always right), and that's kind of the whole point of this > ticket: we've lost contact with the non-coordinator member, but that doesn't > mean it can't still be running (on the other side of a partition). -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Created] (GEODE-9872) DistTXPersistentDebugDUnitTest tests fail because "cluster configuration service not available"
Bill Burcham created GEODE-9872: --- Summary: DistTXPersistentDebugDUnitTest tests fail because "cluster configuration service not available" Key: GEODE-9872 URL: https://issues.apache.org/jira/browse/GEODE-9872 Project: Geode Issue Type: Bug Components: tests Reporter: Bill Burcham I suspect this failure is due to something in the test framework, or perhaps one or more tests failing to manage ports correctly, allowing two or more tests to interfere with one another. In this distributed test: [https://concourse.apachegeode-ci.info/teams/main/pipelines/apache-develop-mass-test-run/jobs/distributed-test-openjdk8/builds/388] we see two failures. Here's the first full stack trace: {code:java} [error 2021/12/04 20:40:53.796 UTC tid=33] org.apache.geode.GemFireConfigException: cluster configuration service not available at org.junit.vintage.engine.execution.TestRun.getStoredResultOrSuccessful(TestRun.java:196) at org.junit.vintage.engine.execution.RunListenerAdapter.fireExecutionFinished(RunListenerAdapter.java:226) at org.junit.vintage.engine.execution.RunListenerAdapter.testFinished(RunListenerAdapter.java:192) at org.junit.vintage.engine.execution.RunListenerAdapter.testFinished(RunListenerAdapter.java:79) at org.junit.runner.notification.SynchronizedRunListener.testFinished(SynchronizedRunListener.java:87) at org.junit.runner.notification.RunNotifier$9.notifyListener(RunNotifier.java:225) at org.junit.runner.notification.RunNotifier$SafeNotifier.run(RunNotifier.java:72) at org.junit.runner.notification.RunNotifier.fireTestFinished(RunNotifier.java:222) at org.junit.internal.runners.model.EachTestNotifier.fireTestFinished(EachTestNotifier.java:38) at org.junit.runners.ParentRunner.runLeaf(ParentRunner.java:372) at org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:103) at org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:63) at org.junit.runners.ParentRunner$4.run(ParentRunner.java:331) at org.junit.runners.ParentRunner$1.schedule(ParentRunner.java:79) at org.junit.runners.ParentRunner.runChildren(ParentRunner.java:329) at org.junit.runners.ParentRunner.access$100(ParentRunner.java:66) at org.junit.runners.ParentRunner$2.evaluate(ParentRunner.java:293) at org.junit.internal.runners.statements.RunBefores.evaluate(RunBefores.java:26) at org.junit.runners.ParentRunner$3.evaluate(ParentRunner.java:306) at org.junit.runners.ParentRunner.run(ParentRunner.java:413) at org.junit.runner.JUnitCore.run(JUnitCore.java:137) at org.junit.runner.JUnitCore.run(JUnitCore.java:115) at org.junit.vintage.engine.execution.RunnerExecutor.execute(RunnerExecutor.java:43) at java.util.stream.ForEachOps$ForEachOp$OfRef.accept(ForEachOps.java:183) at java.util.stream.ReferencePipeline$3$1.accept(ReferencePipeline.java:193) at java.util.Iterator.forEachRemaining(Iterator.java:116) at java.util.Spliterators$IteratorSpliterator.forEachRemaining(Spliterators.java:1801) at java.util.stream.AbstractPipeline.copyInto(AbstractPipeline.java:482) at java.util.stream.AbstractPipeline.wrapAndCopyInto(AbstractPipeline.java:472) at java.util.stream.ForEachOps$ForEachOp.evaluateSequential(ForEachOps.java:150) at java.util.stream.ForEachOps$ForEachOp$OfRef.evaluateSequential(ForEachOps.java:173) at java.util.stream.AbstractPipeline.evaluate(AbstractPipeline.java:234) at java.util.stream.ReferencePipeline.forEach(ReferencePipeline.java:485) at org.junit.vintage.engine.VintageTestEngine.executeAllChildren(VintageTestEngine.java:82) at org.junit.vintage.engine.VintageTestEngine.execute(VintageTestEngine.java:73) at org.junit.platform.launcher.core.EngineExecutionOrchestrator.execute(EngineExecutionOrchestrator.java:108) at org.junit.platform.launcher.core.EngineExecutionOrchestrator.execute(EngineExecutionOrchestrator.java:88) at org.junit.platform.launcher.core.EngineExecutionOrchestrator.lambda$execute$0(EngineExecutionOrchestrator.java:54) at org.junit.platform.launcher.core.EngineExecutionOrchestrator.withInterceptedStreams(EngineExecutionOrchestrator.java:67) at org.junit.platform.launcher.core.EngineExecutionOrchestrator.execute(EngineExecutionOrchestrator.java:52) at org.junit.platform.launcher.core.DefaultLauncher.execute(DefaultLauncher.java:96) at org.junit.platform.launcher.core.DefaultLauncher.execute(DefaultLauncher.java:75) at org.gradle.api.internal.tasks.testing.junitplatform.JUnitPlatformTestClassProcessor$CollectAllTestClassesExecutor.processAllTestClasses(JUnitPlatformTes
[jira] [Created] (GEODE-9871) CI failure: InfoStatsIntegrationTest > networkKiloBytesReadOverLastSecond_shouldBeCloseToBytesReadOverLastSecond
Bill Burcham created GEODE-9871: --- Summary: CI failure: InfoStatsIntegrationTest > networkKiloBytesReadOverLastSecond_shouldBeCloseToBytesReadOverLastSecond Key: GEODE-9871 URL: https://issues.apache.org/jira/browse/GEODE-9871 Project: Geode Issue Type: Bug Components: redis, statistics Affects Versions: 1.15.0 Reporter: Bill Burcham link: [https://concourse.apachegeode-ci.info/teams/main/pipelines/apache-develop-main/jobs/integration-test-openjdk8/builds/38] stack trace: {code:java} InfoStatsIntegrationTest > networkKiloBytesReadOverLastSecond_shouldBeCloseToBytesReadOverLastSecond FAILED org.opentest4j.AssertionFailedError: expected: 0.0 but was: 0.01 at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method) at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62) at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45) at org.apache.geode.redis.internal.commands.executor.server.AbstractRedisInfoStatsIntegrationTest.networkKiloBytesReadOverLastSecond_shouldBeCloseToBytesReadOverLastSecond(AbstractRedisInfoStatsIntegrationTest.java:228) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:498) at org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.java:59) at org.junit.internal.runners.model.ReflectiveCallable.run(ReflectiveCallable.java:12) at org.junit.runners.model.FrameworkMethod.invokeExplosively(FrameworkMethod.java:56) at org.junit.internal.runners.statements.InvokeMethod.evaluate(InvokeMethod.java:17) at org.junit.internal.runners.statements.RunBefores.evaluate(RunBefores.java:26) at org.junit.internal.runners.statements.RunAfters.evaluate(RunAfters.java:27) at org.junit.runners.ParentRunner$3.evaluate(ParentRunner.java:306) at org.junit.runners.BlockJUnit4ClassRunner$1.evaluate(BlockJUnit4ClassRunner.java:100) at org.junit.runners.ParentRunner.runLeaf(ParentRunner.java:366) at org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:103) at org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:63) at org.junit.runners.ParentRunner$4.run(ParentRunner.java:331) at org.junit.runners.ParentRunner$1.schedule(ParentRunner.java:79) at org.junit.runners.ParentRunner.runChildren(ParentRunner.java:329) at org.junit.runners.ParentRunner.access$100(ParentRunner.java:66) at org.junit.runners.ParentRunner$2.evaluate(ParentRunner.java:293) at org.junit.internal.runners.statements.RunBefores.evaluate(RunBefores.java:26) at org.apache.geode.test.junit.rules.serializable.SerializableExternalResource$1.evaluate(SerializableExternalResource.java:38) at org.junit.rules.RunRules.evaluate(RunRules.java:20) at org.junit.runners.ParentRunner$3.evaluate(ParentRunner.java:306) at org.junit.runners.ParentRunner.run(ParentRunner.java:413) at org.junit.runner.JUnitCore.run(JUnitCore.java:137) at org.junit.runner.JUnitCore.run(JUnitCore.java:115) at org.junit.vintage.engine.execution.RunnerExecutor.execute(RunnerExecutor.java:43) at java.util.stream.ForEachOps$ForEachOp$OfRef.accept(ForEachOps.java:183) at java.util.stream.ReferencePipeline$3$1.accept(ReferencePipeline.java:193) at java.util.Iterator.forEachRemaining(Iterator.java:116) at java.util.Spliterators$IteratorSpliterator.forEachRemaining(Spliterators.java:1801) at java.util.stream.AbstractPipeline.copyInto(AbstractPipeline.java:482) at java.util.stream.AbstractPipeline.wrapAndCopyInto(AbstractPipeline.java:472) at java.util.stream.ForEachOps$ForEachOp.evaluateSequential(ForEachOps.java:150) at java.util.stream.ForEachOps$ForEachOp$OfRef.evaluateSequential(ForEachOps.java:173) at java.util.stream.AbstractPipeline.evaluate(AbstractPipeline.java:234) at java.util.stream.ReferencePipeline.forEach(ReferencePipeline.java:485) at org.junit.vintage.engine.VintageTestEngine.executeAllChildren(VintageTestEngine.java:82) at org.junit.vintage.engine.VintageTestEngine.execute(VintageTestEngine.java:73) at org.junit.platform.launcher.core.EngineExecutionOrchestrator.execute(EngineExecutionOrchestrator.java:108) at org.junit.platform.launcher.core.EngineExecutionOrchestrator.execute(EngineExecutionOrchestrator.java:88) at org.junit.platform.launc
[jira] [Reopened] (GEODE-9866) CI Failure : MemoryStatsIntegrationTest > usedMemory_shouldIncrease_givenAdditionalValuesAdded FAILED
[ https://issues.apache.org/jira/browse/GEODE-9866?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Bill Burcham reopened GEODE-9866: - Seen again: https://concourse.apachegeode-ci.info/teams/main/pipelines/apache-develop-main/jobs/integration-test-openjdk8/builds/37 > CI Failure : MemoryStatsIntegrationTest > > usedMemory_shouldIncrease_givenAdditionalValuesAdded FAILED > - > > Key: GEODE-9866 > URL: https://issues.apache.org/jira/browse/GEODE-9866 > Project: Geode > Issue Type: Bug > Components: redis, statistics >Reporter: Nabarun Nag >Assignee: Jens Deppe >Priority: Major > Labels: pull-request-available > Fix For: 1.15.0 > > > link : > [https://concourse.apachegeode-ci.info/teams/main/pipelines/apache-develop-main/jobs/windows-integration-test-openjdk8/builds/31] > Bug Report: > {noformat} > MemoryStatsIntegrationTest > > usedMemory_shouldIncrease_givenAdditionalValuesAdded FAILED > java.lang.AssertionError: > Expecting actual: > 61121264L > to be greater than: > 105070472L > at > org.apache.geode.redis.internal.commands.executor.server.AbstractRedisMemoryStatsIntegrationTest.usedMemory_shouldIncrease_givenAdditionalValuesAdded(AbstractRedisMemoryStatsIntegrationTest.java:80) > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) > at > sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > at java.lang.reflect.Method.invoke(Method.java:498) > at > org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.java:59) > at > org.junit.internal.runners.model.ReflectiveCallable.run(ReflectiveCallable.java:12) > at > org.junit.runners.model.FrameworkMethod.invokeExplosively(FrameworkMethod.java:56) > at > org.junit.internal.runners.statements.InvokeMethod.evaluate(InvokeMethod.java:17) > at > org.junit.internal.runners.statements.RunBefores.evaluate(RunBefores.java:26) > at > org.junit.internal.runners.statements.RunAfters.evaluate(RunAfters.java:27) > at org.junit.runners.ParentRunner$3.evaluate(ParentRunner.java:306) > at > org.junit.runners.BlockJUnit4ClassRunner$1.evaluate(BlockJUnit4ClassRunner.java:100) > at org.junit.runners.ParentRunner.runLeaf(ParentRunner.java:366) > at > org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:103) > at > org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:63) > at org.junit.runners.ParentRunner$4.run(ParentRunner.java:331) > at org.junit.runners.ParentRunner$1.schedule(ParentRunner.java:79) > at org.junit.runners.ParentRunner.runChildren(ParentRunner.java:329) > at org.junit.runners.ParentRunner.access$100(ParentRunner.java:66) > at org.junit.runners.ParentRunner$2.evaluate(ParentRunner.java:293) > at > org.apache.geode.test.junit.rules.serializable.SerializableExternalResource$1.evaluate(SerializableExternalResource.java:38) > at org.junit.rules.RunRules.evaluate(RunRules.java:20) > at org.junit.runners.ParentRunner$3.evaluate(ParentRunner.java:306) > at org.junit.runners.ParentRunner.run(ParentRunner.java:413) > at org.junit.runner.JUnitCore.run(JUnitCore.java:137) > at org.junit.runner.JUnitCore.run(JUnitCore.java:115) > at > org.junit.vintage.engine.execution.RunnerExecutor.execute(RunnerExecutor.java:43) > at > java.util.stream.ForEachOps$ForEachOp$OfRef.accept(ForEachOps.java:183) > at > java.util.stream.ReferencePipeline$3$1.accept(ReferencePipeline.java:193) > at java.util.Iterator.forEachRemaining(Iterator.java:116) > at > java.util.Spliterators$IteratorSpliterator.forEachRemaining(Spliterators.java:1801) > at > java.util.stream.AbstractPipeline.copyInto(AbstractPipeline.java:482) > at > java.util.stream.AbstractPipeline.wrapAndCopyInto(AbstractPipeline.java:472) > at > java.util.stream.ForEachOps$ForEachOp.evaluateSequential(ForEachOps.java:150) > at > java.util.stream.ForEachOps$ForEachOp$OfRef.evaluateSequential(ForEachOps.java:173) > at > java.util.stream.AbstractPipeline.evaluate(AbstractPipeline.java:234) > at > java.util.stream.ReferencePipeline.forEach(ReferencePipeline.java:485) > at > org.junit.vintage.engine.VintageTestEngine.executeAllChildren(VintageTestEngine.java:82) > at > org.junit.vintage.engine.VintageTestEngine.execute(VintageTestEngine.java:73) > at > org.junit.platform.launcher.core.Engi
[jira] [Updated] (GEODE-9870) JedisMovedDataException exception in testReconnectionWithAuthAndServerRestarts
[ https://issues.apache.org/jira/browse/GEODE-9870?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Bill Burcham updated GEODE-9870: Description: CI failure here [https://concourse.apachegeode-ci.info/teams/main/pipelines/apache-develop-mass-test-run/jobs/distributed-test-openjdk8/builds/315|https://concourse.apachegeode-ci.info/teams/main/pipelines/apache-develop-mass-test-run/jobs/distributed-test-openjdk8/builds/315]: {code:java} AuthWhileServersRestartDUnitTest > testReconnectionWithAuthAndServerRestarts FAILED redis.clients.jedis.exceptions.JedisMovedDataException: MOVED 12539 127.0.0.1:26259 at redis.clients.jedis.Protocol.processError(Protocol.java:119) at redis.clients.jedis.Protocol.process(Protocol.java:169) at redis.clients.jedis.Protocol.read(Protocol.java:223) at redis.clients.jedis.Connection.readProtocolWithCheckingBroken(Connection.java:352) at redis.clients.jedis.Connection.getStatusCodeReply(Connection.java:270) at redis.clients.jedis.BinaryJedis.flushAll(BinaryJedis.java:826) at org.apache.geode.test.dunit.rules.RedisClusterStartupRule.flushAll(RedisClusterStartupRule.java:147) at org.apache.geode.test.dunit.rules.RedisClusterStartupRule.flushAll(RedisClusterStartupRule.java:131) at org.apache.geode.redis.internal.executor.auth.AuthWhileServersRestartDUnitTest.after(AuthWhileServersRestartDUnitTest.java:88){code} was: CI failure: {code:java} AuthWhileServersRestartDUnitTest > testReconnectionWithAuthAndServerRestarts FAILED redis.clients.jedis.exceptions.JedisMovedDataException: MOVED 12539 127.0.0.1:26259 at redis.clients.jedis.Protocol.processError(Protocol.java:119) at redis.clients.jedis.Protocol.process(Protocol.java:169) at redis.clients.jedis.Protocol.read(Protocol.java:223) at redis.clients.jedis.Connection.readProtocolWithCheckingBroken(Connection.java:352) at redis.clients.jedis.Connection.getStatusCodeReply(Connection.java:270) at redis.clients.jedis.BinaryJedis.flushAll(BinaryJedis.java:826) at org.apache.geode.test.dunit.rules.RedisClusterStartupRule.flushAll(RedisClusterStartupRule.java:147) at org.apache.geode.test.dunit.rules.RedisClusterStartupRule.flushAll(RedisClusterStartupRule.java:131) at org.apache.geode.redis.internal.executor.auth.AuthWhileServersRestartDUnitTest.after(AuthWhileServersRestartDUnitTest.java:88){code} > JedisMovedDataException exception in testReconnectionWithAuthAndServerRestarts > -- > > Key: GEODE-9870 > URL: https://issues.apache.org/jira/browse/GEODE-9870 > Project: Geode > Issue Type: Bug > Components: redis >Affects Versions: 1.15.0 >Reporter: Bill Burcham >Priority: Major > > CI failure here > [https://concourse.apachegeode-ci.info/teams/main/pipelines/apache-develop-mass-test-run/jobs/distributed-test-openjdk8/builds/315|https://concourse.apachegeode-ci.info/teams/main/pipelines/apache-develop-mass-test-run/jobs/distributed-test-openjdk8/builds/315]: > > {code:java} > AuthWhileServersRestartDUnitTest > testReconnectionWithAuthAndServerRestarts > FAILED > redis.clients.jedis.exceptions.JedisMovedDataException: MOVED 12539 > 127.0.0.1:26259 > at redis.clients.jedis.Protocol.processError(Protocol.java:119) > at redis.clients.jedis.Protocol.process(Protocol.java:169) > at redis.clients.jedis.Protocol.read(Protocol.java:223) > at > redis.clients.jedis.Connection.readProtocolWithCheckingBroken(Connection.java:352) > at > redis.clients.jedis.Connection.getStatusCodeReply(Connection.java:270) > at redis.clients.jedis.BinaryJedis.flushAll(BinaryJedis.java:826) > at > org.apache.geode.test.dunit.rules.RedisClusterStartupRule.flushAll(RedisClusterStartupRule.java:147) > at > org.apache.geode.test.dunit.rules.RedisClusterStartupRule.flushAll(RedisClusterStartupRule.java:131) > at > org.apache.geode.redis.internal.executor.auth.AuthWhileServersRestartDUnitTest.after(AuthWhileServersRestartDUnitTest.java:88){code} -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Created] (GEODE-9870) JedisMovedDataException exception in testReconnectionWithAuthAndServerRestarts
Bill Burcham created GEODE-9870: --- Summary: JedisMovedDataException exception in testReconnectionWithAuthAndServerRestarts Key: GEODE-9870 URL: https://issues.apache.org/jira/browse/GEODE-9870 Project: Geode Issue Type: Bug Components: redis Affects Versions: 1.15.0 Reporter: Bill Burcham CI failure: {code:java} AuthWhileServersRestartDUnitTest > testReconnectionWithAuthAndServerRestarts FAILED redis.clients.jedis.exceptions.JedisMovedDataException: MOVED 12539 127.0.0.1:26259 at redis.clients.jedis.Protocol.processError(Protocol.java:119) at redis.clients.jedis.Protocol.process(Protocol.java:169) at redis.clients.jedis.Protocol.read(Protocol.java:223) at redis.clients.jedis.Connection.readProtocolWithCheckingBroken(Connection.java:352) at redis.clients.jedis.Connection.getStatusCodeReply(Connection.java:270) at redis.clients.jedis.BinaryJedis.flushAll(BinaryJedis.java:826) at org.apache.geode.test.dunit.rules.RedisClusterStartupRule.flushAll(RedisClusterStartupRule.java:147) at org.apache.geode.test.dunit.rules.RedisClusterStartupRule.flushAll(RedisClusterStartupRule.java:131) at org.apache.geode.redis.internal.executor.auth.AuthWhileServersRestartDUnitTest.after(AuthWhileServersRestartDUnitTest.java:88){code} -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Resolved] (GEODE-9396) Upgrades using SSL fail with mismatch of hostname in certificates
[ https://issues.apache.org/jira/browse/GEODE-9396?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Bill Burcham resolved GEODE-9396. - Fix Version/s: 1.15.0 Resolution: Fixed > Upgrades using SSL fail with mismatch of hostname in certificates > - > > Key: GEODE-9396 > URL: https://issues.apache.org/jira/browse/GEODE-9396 > Project: Geode > Issue Type: Bug > Components: membership >Affects Versions: 1.15.0 >Reporter: Ernest Burghardt >Assignee: Bill Burcham >Priority: Major > Labels: pull-request-available, release-blocker > Fix For: 1.15.0 > > > When upgrading from a previous version (prior to 1.14) the ssl handshake will > fail to complete in cases where the Certificate contains a symbolic name that > doesn't match the hostname used by the sslengine. -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Assigned] (GEODE-9396) Upgrades using SSL fail with mismatch of hostname in certificates
[ https://issues.apache.org/jira/browse/GEODE-9396?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Bill Burcham reassigned GEODE-9396: --- Assignee: Bill Burcham (was: Kamilla Aslami) > Upgrades using SSL fail with mismatch of hostname in certificates > - > > Key: GEODE-9396 > URL: https://issues.apache.org/jira/browse/GEODE-9396 > Project: Geode > Issue Type: Bug > Components: membership >Affects Versions: 1.15.0 >Reporter: Ernest Burghardt >Assignee: Bill Burcham >Priority: Major > Labels: pull-request-available, release-blocker > > When upgrading from a previous version (prior to 1.14) the ssl handshake will > fail to complete in cases where the Certificate contains a symbolic name that > doesn't match the hostname used by the sslengine. -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Resolved] (GEODE-9825) Disparate socket-buffer-size Results in "IOException: Unknown header byte" and Hangs
[ https://issues.apache.org/jira/browse/GEODE-9825?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Bill Burcham resolved GEODE-9825. - Resolution: Fixed > Disparate socket-buffer-size Results in "IOException: Unknown header byte" > and Hangs > > > Key: GEODE-9825 > URL: https://issues.apache.org/jira/browse/GEODE-9825 > Project: Geode > Issue Type: Bug > Components: messaging >Affects Versions: 1.12.4, 1.15.0 >Reporter: Bill Burcham >Assignee: Bill Burcham >Priority: Major > Labels: pull-request-available > Fix For: 1.12.6, 1.13.5, 1.14.1, 1.15.0 > > > GEODE-9141 introduced a bug that causes {{IOException: "Unknown header > byte..."}} and hangs if members are configured with different > {{socket-buffer-size}} settings. > h2. Reproduction > To reproduce this bug turn off TLS and set socket-buffer-size on sender to be > 64KB and set socket-buffer-size on receiver to be 32KB. See associated PR for > an example. > h2. Analysis > In {{{}Connection.processInputBuffer(){}}}. When that method has read all the > messages it can from the current input buffer, it then considers whether the > buffer needs expansion. If it does then: > {code:java} > inputBuffer = inputSharing.expandReadBufferIfNeeded(allocSize); {code} > Is executed and the method returns. The caller then expects to be able to > _write_ bytes into {{{}inputBuffer{}}}. > The problem, it seems, is that > {{ByteBufferSharingInternalImpl.expandReadBufferIfNeeded()}} does not leave > the the {{ByteBuffer}} in the proper state. It leaves the buffer ready to be > _read_ not written. > Before the changes for GEODE-9141 were introduced, the line of code > referenced above used to be this snippet in > {{Connection.compactOrResizeBuffer(int messageLength)}} (that method has > since been removed): > {code:java} > // need a bigger buffer > logger.info("Allocating larger network read buffer, new size is {} old > size was {}.", > allocSize, oldBufferSize); > ByteBuffer oldBuffer = inputBuffer; > inputBuffer = getBufferPool().acquireDirectReceiveBuffer(allocSize); > if (oldBuffer != null) { > int oldByteCount = oldBuffer.remaining(); > inputBuffer.put(oldBuffer); > inputBuffer.position(oldByteCount); > getBufferPool().releaseReceiveBuffer(oldBuffer); > } {code} > Notice how this method leaves {{inputBuffer}} ready to be _written_ to. > But the code inside > {{ByteBufferSharingInternalImpl.expandReadBufferIfNeeded()}} is doing > something like: > {code:java} > newBuffer.clear(); > newBuffer.put(existing); > newBuffer.flip(); > releaseBuffer(type, existing); > return newBuffer; {code} > A solution (shown in the associated PR) is to do add logic after the call to > {{expandReadBufferIfNeeded(allocSize)}} to leave the buffer in a _writeable_ > state: > {code:java} > inputBuffer = inputSharing.expandReadBufferIfNeeded(allocSize); > // we're returning to the caller (done == true) so make buffer writeable > inputBuffer.position(inputBuffer.limit()); > inputBuffer.limit(inputBuffer.capacity()); {code} > h2. Resolution > When this ticket is complete the bug will be fixed and > {{P2PMessagingConcurrencyDUnitTest}} will be enhanced to test at least these > combinations: > [security, sender/locator socket-buffer-size, receiver socket-buffer-size] > [TLS, (default), (default)] this is what the test currently does > [no TLS, 64 * 1024, 32 * 1024] *new: this illustrates this bug* > [no TLS, (default), (default)] *new* > We might want to mix in conserve-sockets true/false in there too while we're > at it (the test currently holds it at true). -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Updated] (GEODE-9825) Disparate socket-buffer-size Results in "IOException: Unknown header byte" and Hangs
[ https://issues.apache.org/jira/browse/GEODE-9825?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Bill Burcham updated GEODE-9825: Fix Version/s: 1.12.6 1.13.5 1.14.1 1.15.0 > Disparate socket-buffer-size Results in "IOException: Unknown header byte" > and Hangs > > > Key: GEODE-9825 > URL: https://issues.apache.org/jira/browse/GEODE-9825 > Project: Geode > Issue Type: Bug > Components: messaging >Affects Versions: 1.12.4, 1.15.0 >Reporter: Bill Burcham >Assignee: Bill Burcham >Priority: Major > Labels: pull-request-available > Fix For: 1.12.6, 1.13.5, 1.14.1, 1.15.0 > > > GEODE-9141 introduced a bug that causes {{IOException: "Unknown header > byte..."}} and hangs if members are configured with different > {{socket-buffer-size}} settings. > h2. Reproduction > To reproduce this bug turn off TLS and set socket-buffer-size on sender to be > 64KB and set socket-buffer-size on receiver to be 32KB. See associated PR for > an example. > h2. Analysis > In {{{}Connection.processInputBuffer(){}}}. When that method has read all the > messages it can from the current input buffer, it then considers whether the > buffer needs expansion. If it does then: > {code:java} > inputBuffer = inputSharing.expandReadBufferIfNeeded(allocSize); {code} > Is executed and the method returns. The caller then expects to be able to > _write_ bytes into {{{}inputBuffer{}}}. > The problem, it seems, is that > {{ByteBufferSharingInternalImpl.expandReadBufferIfNeeded()}} does not leave > the the {{ByteBuffer}} in the proper state. It leaves the buffer ready to be > _read_ not written. > Before the changes for GEODE-9141 were introduced, the line of code > referenced above used to be this snippet in > {{Connection.compactOrResizeBuffer(int messageLength)}} (that method has > since been removed): > {code:java} > // need a bigger buffer > logger.info("Allocating larger network read buffer, new size is {} old > size was {}.", > allocSize, oldBufferSize); > ByteBuffer oldBuffer = inputBuffer; > inputBuffer = getBufferPool().acquireDirectReceiveBuffer(allocSize); > if (oldBuffer != null) { > int oldByteCount = oldBuffer.remaining(); > inputBuffer.put(oldBuffer); > inputBuffer.position(oldByteCount); > getBufferPool().releaseReceiveBuffer(oldBuffer); > } {code} > Notice how this method leaves {{inputBuffer}} ready to be _written_ to. > But the code inside > {{ByteBufferSharingInternalImpl.expandReadBufferIfNeeded()}} is doing > something like: > {code:java} > newBuffer.clear(); > newBuffer.put(existing); > newBuffer.flip(); > releaseBuffer(type, existing); > return newBuffer; {code} > A solution (shown in the associated PR) is to do add logic after the call to > {{expandReadBufferIfNeeded(allocSize)}} to leave the buffer in a _writeable_ > state: > {code:java} > inputBuffer = inputSharing.expandReadBufferIfNeeded(allocSize); > // we're returning to the caller (done == true) so make buffer writeable > inputBuffer.position(inputBuffer.limit()); > inputBuffer.limit(inputBuffer.capacity()); {code} > h2. Resolution > When this ticket is complete the bug will be fixed and > {{P2PMessagingConcurrencyDUnitTest}} will be enhanced to test at least these > combinations: > [security, sender/locator socket-buffer-size, receiver socket-buffer-size] > [TLS, (default), (default)] this is what the test currently does > [no TLS, 64 * 1024, 32 * 1024] *new: this illustrates this bug* > [no TLS, (default), (default)] *new* > We might want to mix in conserve-sockets true/false in there too while we're > at it (the test currently holds it at true). -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Commented] (GEODE-9825) Disparate socket-buffer-size Results in "IOException: Unknown header byte" and Hangs
[ https://issues.apache.org/jira/browse/GEODE-9825?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17448309#comment-17448309 ] Bill Burcham commented on GEODE-9825: - Merged to {{{}develop{}}}. Back-port PR to 1.14 is ready to merge. A flaky test failed in the PR for 1.13 (wrote a new ticket GEODE-9850 and re-initiated the test). Back-port to 1.12 has a problem. I had to back-port the PR for GEODE-9713 (test framework enhancement). Unfortunately it relies on a newer version (4.1.0) of Awaitility (was at 3.1.6). Bumping just that version in DependencyConstraints.groovy was not sufficient as something (TBD) is dependent on Awaitility 2.0.0 and that version is taking precedence. I want to work this out and then merge all three PRs together (in close succession). > Disparate socket-buffer-size Results in "IOException: Unknown header byte" > and Hangs > > > Key: GEODE-9825 > URL: https://issues.apache.org/jira/browse/GEODE-9825 > Project: Geode > Issue Type: Bug > Components: messaging >Affects Versions: 1.12.4, 1.15.0 >Reporter: Bill Burcham >Assignee: Bill Burcham >Priority: Major > Labels: pull-request-available > > GEODE-9141 introduced a bug that causes {{IOException: "Unknown header > byte..."}} and hangs if members are configured with different > {{socket-buffer-size}} settings. > h2. Reproduction > To reproduce this bug turn off TLS and set socket-buffer-size on sender to be > 64KB and set socket-buffer-size on receiver to be 32KB. See associated PR for > an example. > h2. Analysis > In {{{}Connection.processInputBuffer(){}}}. When that method has read all the > messages it can from the current input buffer, it then considers whether the > buffer needs expansion. If it does then: > {code:java} > inputBuffer = inputSharing.expandReadBufferIfNeeded(allocSize); {code} > Is executed and the method returns. The caller then expects to be able to > _write_ bytes into {{{}inputBuffer{}}}. > The problem, it seems, is that > {{ByteBufferSharingInternalImpl.expandReadBufferIfNeeded()}} does not leave > the the {{ByteBuffer}} in the proper state. It leaves the buffer ready to be > _read_ not written. > Before the changes for GEODE-9141 were introduced, the line of code > referenced above used to be this snippet in > {{Connection.compactOrResizeBuffer(int messageLength)}} (that method has > since been removed): > {code:java} > // need a bigger buffer > logger.info("Allocating larger network read buffer, new size is {} old > size was {}.", > allocSize, oldBufferSize); > ByteBuffer oldBuffer = inputBuffer; > inputBuffer = getBufferPool().acquireDirectReceiveBuffer(allocSize); > if (oldBuffer != null) { > int oldByteCount = oldBuffer.remaining(); > inputBuffer.put(oldBuffer); > inputBuffer.position(oldByteCount); > getBufferPool().releaseReceiveBuffer(oldBuffer); > } {code} > Notice how this method leaves {{inputBuffer}} ready to be _written_ to. > But the code inside > {{ByteBufferSharingInternalImpl.expandReadBufferIfNeeded()}} is doing > something like: > {code:java} > newBuffer.clear(); > newBuffer.put(existing); > newBuffer.flip(); > releaseBuffer(type, existing); > return newBuffer; {code} > A solution (shown in the associated PR) is to do add logic after the call to > {{expandReadBufferIfNeeded(allocSize)}} to leave the buffer in a _writeable_ > state: > {code:java} > inputBuffer = inputSharing.expandReadBufferIfNeeded(allocSize); > // we're returning to the caller (done == true) so make buffer writeable > inputBuffer.position(inputBuffer.limit()); > inputBuffer.limit(inputBuffer.capacity()); {code} > h2. Resolution > When this ticket is complete the bug will be fixed and > {{P2PMessagingConcurrencyDUnitTest}} will be enhanced to test at least these > combinations: > [security, sender/locator socket-buffer-size, receiver socket-buffer-size] > [TLS, (default), (default)] this is what the test currently does > [no TLS, 64 * 1024, 32 * 1024] *new: this illustrates this bug* > [no TLS, (default), (default)] *new* > We might want to mix in conserve-sockets true/false in there too while we're > at it (the test currently holds it at true). -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Created] (GEODE-9850) flaky test: testGetOldestTombstoneTimeForReplicateTombstoneSweeper
Bill Burcham created GEODE-9850: --- Summary: flaky test: testGetOldestTombstoneTimeForReplicateTombstoneSweeper Key: GEODE-9850 URL: https://issues.apache.org/jira/browse/GEODE-9850 Project: Geode Issue Type: Bug Components: tests Affects Versions: 1.13.5 Reporter: Bill Burcham First saw this failure in PR pipeline on support/1.13 here: [https://concourse.apachegeode-ci.info/builds/3912569] {code:java} org.apache.geode.internal.cache.versions.TombstoneDUnitTest > testGetOldestTombstoneTimeForReplicateTombstoneSweeper FAILED org.apache.geode.test.dunit.RMIException: While invoking org.apache.geode.internal.cache.versions.TombstoneDUnitTest$$Lambda$42/2046302475.run in VM 0 running on Host 9a305b2d7db7 with 4 VMs at org.apache.geode.test.dunit.VM.executeMethodOnObject(VM.java:610) at org.apache.geode.test.dunit.VM.invoke(VM.java:437) at org.apache.geode.internal.cache.versions.TombstoneDUnitTest.testGetOldestTombstoneTimeForReplicateTombstoneSweeper(TombstoneDUnitTest.java:228) Caused by: java.lang.AssertionError: Expecting: <-1637701703343L> to be greater than: <0L> at org.apache.geode.internal.cache.versions.TombstoneDUnitTest.lambda$testGetOldestTombstoneTimeForReplicateTombstoneSweeper$bb17a952$3(TombstoneDUnitTest.java:237) {code} I believe the fix is to wrap this assertion in an awaitility call. -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Updated] (GEODE-9764) Request-Response Messaging Should Time Out
[ https://issues.apache.org/jira/browse/GEODE-9764?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Bill Burcham updated GEODE-9764: Description: There is a weakness in the P2P/DirectChannel messaging architecture, in that it never gives up on a request (in a request-response scenario). As a result a bug (software fault) anywhere from the point where the requesting thread hands off the {{DistributionMessage}} e.g. to {{{}ClusterDistributionManager.putOutgoing(DistributionMessage){}}}, to the point where that request is ultimately fulfilled on a (one) receiver, can result in a hang (of some task on the send side, which is waiting for a response). Well it's a little worse than that because any code in the return (response) path can also cause disruption of the (response) flow, thereby leaving the requesting task hanging. If the code in the request path (primarily in P2P messaging) and the code in the response path (P2P messaging and TBD higher-level code) were perfect this might not be a problem. But there is a fair amount of code there and we have some evidence that it is currently not perfect, nor do we expect it to become perfect and stay that way. This is a sketch of the situation. The left-most column is the request path or the originating member. The middle column is the server-side of the request-response path. And the right-most column is the response path back on the originating member. !image-2021-11-22-12-14-59-117.png! You can see that Geode product code, JDK code, and hardware components all lie in the end-to-end request-response messaging path. That being the case it seems prudent to institute response timeouts so that bugs of this sort (which disrupt request-response message flow) don't result in hangs. It's TBD if we want to go a step further and institute retries. The latter would entail introducing duplicate-suppression (conflation) in P2P messaging. We might also add exponential backoff (open-loop) or back-pressure (closed-loop) to prevent a flood of retries when the system is at or near the point of thrashing. But even without retries, a configurable timeout might have good ROI as a first step. This would entail: * adding a configuration parameter to specify the timeout value * changing ReplyProcessor21 and others TBD to "give up" after the timeout has elapsed * changing higher-level code dependent on request-reply messaging so it properly handles the situations where we might have to "give up" This issue affects all versions of Geode. h2. Counterpoint Not everybody thinks timeouts are a good idea. This section has the highlights. h3. Timeouts Will Result in Data-Inconsistency If we leave most the surrounding code as-is and introduce timeouts, then we risk data inconsistency. TODO: describe in detail why data inconsistency is _inherent_ in using timeouts. h3. Narrow The Vulnerability Cross-Section Without Timeouts The proposal (above) seeks to solve the problem using end-to-end timeouts since any component in the path can, in general, have faults. An alternative approach, would be to assume that _some_ of the components can be made "good enough" (without adding timeouts) and that those "good enough" components can protect themselves (and user applications) from faults in the remaining components. With this approach, the Cluster Distribution Manager, and P2P / TCP Conduit / Direct Channel framework would be enhanced so that it was less susceptible to bugs in: * the 341 Distribution Message classes * the 68 Reply Message classes * the 95 Reply Processor classes The question is: what form would that enhancement take, and also, would it be sufficient to overcome faults in remaining components (JDK, and the host+network layers). h2. Alternatives Discussed These alternatives have been discussed, to varying degrees. Baseline: no timeouts; members waiting for replies do "the right thing" if recipient departs view Give-up-after-timeout Retry-after-timeout-and-eventually-give-up Retry-after-forcing-receiver-out-of-view was: There is a weakness in the P2P/DirectChannel messaging architecture, in that it never gives up on a request (in a request-response scenario). As a result a bug (software fault) anywhere from the point where the requesting thread hands off the {{DistributionMessage}} e.g. to {{{}ClusterDistributionManager.putOutgoing(DistributionMessage){}}}, to the point where that request is ultimately fulfilled on a (one) receiver, can result in a hang (of some task on the send side, which is waiting for a response). Well it's a little worse than that because any code in the return (response) path can also cause disruption of the (response) flow, thereby leaving the requesting task hanging. If the code in the request path (primarily in P2P messaging) and the code in the response path (P2P messaging and TBD higher-level code) were perfect this might n
[jira] [Updated] (GEODE-9764) Request-Response Messaging Should Time Out
[ https://issues.apache.org/jira/browse/GEODE-9764?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Bill Burcham updated GEODE-9764: Attachment: image-2021-11-22-12-14-59-117.png > Request-Response Messaging Should Time Out > -- > > Key: GEODE-9764 > URL: https://issues.apache.org/jira/browse/GEODE-9764 > Project: Geode > Issue Type: Improvement > Components: messaging >Reporter: Bill Burcham >Assignee: Bill Burcham >Priority: Major > Attachments: image-2021-11-22-11-52-23-586.png, > image-2021-11-22-12-14-59-117.png > > > There is a weakness in the P2P/DirectChannel messaging architecture, in that > it never gives up on a request (in a request-response scenario). As a result > a bug (software fault) anywhere from the point where the requesting thread > hands off the {{DistributionMessage}} e.g. to > {{{}ClusterDistributionManager.putOutgoing(DistributionMessage){}}}, to the > point where that request is ultimately fulfilled on a (one) receiver, can > result in a hang (of some task on the send side, which is waiting for a > response). > Well it's a little worse than that because any code in the return (response) > path can also cause disruption of the (response) flow, thereby leaving the > requesting task hanging. > If the code in the request path (primarily in P2P messaging) and the code in > the response path (P2P messaging and TBD higher-level code) were perfect this > might not be a problem. But there is a fair amount of code there and we have > some evidence that it is currently not perfect, nor do we expect it to become > perfect and stay that way. That being the case it seems prudent to institute > response timeouts so that bugs of this sort (which disrupt request-response > message flow) don't result in hangs. > It's TBD if we want to go a step further and institute retries. The latter > would entail introducing duplicate-suppression (conflation) in P2P messaging. > We might also add exponential backoff (open-loop) or back-pressure > (closed-loop) to prevent a flood of retries when the system is at or near the > point of thrashing. > But even without retries, a configurable timeout might have good ROI as a > first step. This would entail: > * adding a configuration parameter to specify the timeout value > * changing ReplyProcessor21 and others TBD to "give up" after the timeout > has elapsed > * changing higher-level code dependent on request-reply messaging so it > properly handles the situations where we might have to "give up" > This issue affects all versions of Geode. > h2. Counterpoint > Not everbody thinks timeouts are a good idea. Here are some alternative ideas: > > Make request-response primitive better. make it so only bugs in our core > messaging framework could cause a lack of response - rather than our current > approach where a bug in a class like “RemotePutMessage” could cause a lack of > a response. -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Updated] (GEODE-9764) Request-Response Messaging Should Time Out
[ https://issues.apache.org/jira/browse/GEODE-9764?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Bill Burcham updated GEODE-9764: Description: There is a weakness in the P2P/DirectChannel messaging architecture, in that it never gives up on a request (in a request-response scenario). As a result a bug (software fault) anywhere from the point where the requesting thread hands off the {{DistributionMessage}} e.g. to {{{}ClusterDistributionManager.putOutgoing(DistributionMessage){}}}, to the point where that request is ultimately fulfilled on a (one) receiver, can result in a hang (of some task on the send side, which is waiting for a response). Well it's a little worse than that because any code in the return (response) path can also cause disruption of the (response) flow, thereby leaving the requesting task hanging. If the code in the request path (primarily in P2P messaging) and the code in the response path (P2P messaging and TBD higher-level code) were perfect this might not be a problem. But there is a fair amount of code there and we have some evidence that it is currently not perfect, nor do we expect it to become perfect and stay that way. This is a sketch of the situation. The left-most column is the request path or the originating member. The middle column is the server-side of the request-response path. And the right-most column is the response path back on the originating member. !image-2021-11-22-12-14-59-117.png! You can see that Geode product code, JDK code, and hardware components all lie in the end-to-end request-response messaging path. That being the case it seems prudent to institute response timeouts so that bugs of this sort (which disrupt request-response message flow) don't result in hangs. It's TBD if we want to go a step further and institute retries. The latter would entail introducing duplicate-suppression (conflation) in P2P messaging. We might also add exponential backoff (open-loop) or back-pressure (closed-loop) to prevent a flood of retries when the system is at or near the point of thrashing. But even without retries, a configurable timeout might have good ROI as a first step. This would entail: * adding a configuration parameter to specify the timeout value * changing ReplyProcessor21 and others TBD to "give up" after the timeout has elapsed * changing higher-level code dependent on request-reply messaging so it properly handles the situations where we might have to "give up" This issue affects all versions of Geode. h2. Counterpoint Not everybody thinks timeouts are a good idea. Here are some alternative ideas: The proposal (above) seeks to solve the problem using end-to-end timeouts since any component in the path can, in general, have faults. An alternative approach, would be to assume that _some_ of the components can be made "good enough" (without adding timeouts) and that those "good enough" components can protect themselves (and user applications) from faults in the remaining components. With this approach, the Cluster Distribution Manager, and P2P / TCP Conduit / Direct Channel framework would be enhanced so that it was less susceptible to bugs in: * the 341 Distribution Message classes * the 68 Reply Message classes * the 95 Reply Processor classes The question is: what form would that enhancement take, and also, would it be sufficient to overcome faults in remaining components (JDK, and the host+network layers). was: There is a weakness in the P2P/DirectChannel messaging architecture, in that it never gives up on a request (in a request-response scenario). As a result a bug (software fault) anywhere from the point where the requesting thread hands off the {{DistributionMessage}} e.g. to {{{}ClusterDistributionManager.putOutgoing(DistributionMessage){}}}, to the point where that request is ultimately fulfilled on a (one) receiver, can result in a hang (of some task on the send side, which is waiting for a response). Well it's a little worse than that because any code in the return (response) path can also cause disruption of the (response) flow, thereby leaving the requesting task hanging. If the code in the request path (primarily in P2P messaging) and the code in the response path (P2P messaging and TBD higher-level code) were perfect this might not be a problem. But there is a fair amount of code there and we have some evidence that it is currently not perfect, nor do we expect it to become perfect and stay that way. That being the case it seems prudent to institute response timeouts so that bugs of this sort (which disrupt request-response message flow) don't result in hangs. It's TBD if we want to go a step further and institute retries. The latter would entail introducing duplicate-suppression (conflation) in P2P messaging. We might also add exponential backoff (open-loop) or back-pressure (closed-loop) to pre
[jira] [Updated] (GEODE-9764) Request-Response Messaging Should Time Out
[ https://issues.apache.org/jira/browse/GEODE-9764?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Bill Burcham updated GEODE-9764: Attachment: image-2021-11-22-11-52-23-586.png > Request-Response Messaging Should Time Out > -- > > Key: GEODE-9764 > URL: https://issues.apache.org/jira/browse/GEODE-9764 > Project: Geode > Issue Type: Improvement > Components: messaging >Reporter: Bill Burcham >Assignee: Bill Burcham >Priority: Major > Attachments: image-2021-11-22-11-52-23-586.png > > > There is a weakness in the P2P/DirectChannel messaging architecture, in that > it never gives up on a request (in a request-response scenario). As a result > a bug (software fault) anywhere from the point where the requesting thread > hands off the {{DistributionMessage}} e.g. to > {{{}ClusterDistributionManager.putOutgoing(DistributionMessage){}}}, to the > point where that request is ultimately fulfilled on a (one) receiver, can > result in a hang (of some task on the send side, which is waiting for a > response). > Well it's a little worse than that because any code in the return (response) > path can also cause disruption of the (response) flow, thereby leaving the > requesting task hanging. > If the code in the request path (primarily in P2P messaging) and the code in > the response path (P2P messaging and TBD higher-level code) were perfect this > might not be a problem. But there is a fair amount of code there and we have > some evidence that it is currently not perfect, nor do we expect it to become > perfect and stay that way. That being the case it seems prudent to institute > response timeouts so that bugs of this sort (which disrupt request-response > message flow) don't result in hangs. > It's TBD if we want to go a step further and institute retries. The latter > would entail introducing duplicate-suppression (conflation) in P2P messaging. > We might also add exponential backoff (open-loop) or back-pressure > (closed-loop) to prevent a flood of retries when the system is at or near the > point of thrashing. > But even without retries, a configurable timeout might have good ROI as a > first step. This would entail: > * adding a configuration parameter to specify the timeout value > * changing ReplyProcessor21 and others TBD to "give up" after the timeout > has elapsed > * changing higher-level code dependent on request-reply messaging so it > properly handles the situations where we might have to "give up" > This issue affects all versions of Geode. > h2. Counterpoint > Not everbody thinks timeouts are a good idea. Here are some alternative ideas: > > Make request-response primitive better. make it so only bugs in our core > messaging framework could cause a lack of response - rather than our current > approach where a bug in a class like “RemotePutMessage” could cause a lack of > a response. -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Updated] (GEODE-9764) Request-Response Messaging Should Time Out
[ https://issues.apache.org/jira/browse/GEODE-9764?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Bill Burcham updated GEODE-9764: Description: There is a weakness in the P2P/DirectChannel messaging architecture, in that it never gives up on a request (in a request-response scenario). As a result a bug (software fault) anywhere from the point where the requesting thread hands off the {{DistributionMessage}} e.g. to {{{}ClusterDistributionManager.putOutgoing(DistributionMessage){}}}, to the point where that request is ultimately fulfilled on a (one) receiver, can result in a hang (of some task on the send side, which is waiting for a response). Well it's a little worse than that because any code in the return (response) path can also cause disruption of the (response) flow, thereby leaving the requesting task hanging. If the code in the request path (primarily in P2P messaging) and the code in the response path (P2P messaging and TBD higher-level code) were perfect this might not be a problem. But there is a fair amount of code there and we have some evidence that it is currently not perfect, nor do we expect it to become perfect and stay that way. That being the case it seems prudent to institute response timeouts so that bugs of this sort (which disrupt request-response message flow) don't result in hangs. It's TBD if we want to go a step further and institute retries. The latter would entail introducing duplicate-suppression (conflation) in P2P messaging. We might also add exponential backoff (open-loop) or back-pressure (closed-loop) to prevent a flood of retries when the system is at or near the point of thrashing. But even without retries, a configurable timeout might have good ROI as a first step. This would entail: * adding a configuration parameter to specify the timeout value * changing ReplyProcessor21 and others TBD to "give up" after the timeout has elapsed * changing higher-level code dependent on request-reply messaging so it properly handles the situations where we might have to "give up" This issue affects all versions of Geode. h2. Counterpoint Not everbody thinks timeouts are a good idea. Here are some alternative ideas: Make request-response primitive better. make it so only bugs in our core messaging framework could cause a lack of response - rather than our current approach where a bug in a class like “RemotePutMessage” could cause a lack of a response. was: There is a weakness in the P2P/DirectChannel messaging architecture, in that it never gives up on a request (in a request-response scenario). As a result a bug (software fault) anywhere from the point where the requesting thread hands off the {{DistributionMessage}} e.g. to {{ClusterDistributionManager.putOutgoing(DistributionMessage)}}, to the point where that request is ultimately fulfilled on a (one) receiver, can result in a hang (of some task on the send side, which is waiting for a response). Well it's a little worse than that because any code in the return (response) path can also cause disruption of the (response) flow, thereby leaving the requesting task hanging. If the code in the request path (primarily in P2P messaging) and the code in the response path (P2P messaging and TBD higher-level code) were perfect this might not be a problem. But there is a fair amount of code there and we have some evidence that it is currently not perfect, nor do we expect it to become perfect and stay that way. That being the case it seems prudent to institute response timeouts so that bugs of this sort (which disrupt request-response message flow) don't result in hangs. It's TBD if we want to go a step further and institute retries. The latter would entail introducing duplicate-suppression (conflation) in P2P messaging. We might also add exponential backoff (open-loop) or back-pressure (closed-loop) to prevent a flood of retries when the system is at or near the point of thrashing. But even without retries, a configurable timeout might have good ROI as a first step. This would entail: * adding a configuration parameter to specify the timeout value * changing ReplyProcessor21 and others TBD to "give up" after the timeout has elapsed * changing higher-level code dependent on request-reply messaging so it properly handles the situations where we might have to "give up" This issue affects all versions of Geode. > Request-Response Messaging Should Time Out > -- > > Key: GEODE-9764 > URL: https://issues.apache.org/jira/browse/GEODE-9764 > Project: Geode > Issue Type: Improvement > Components: messaging >Reporter: Bill Burcham >Assignee: Bill Burcham >Priority: Major > > There is a weakness in the P2P/DirectChannel messaging architecture, in that > it never gives up on a r
[jira] [Assigned] (GEODE-9825) Disparate socket-buffer-size Results in "IOException: Unknown header byte" and Hangs
[ https://issues.apache.org/jira/browse/GEODE-9825?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Bill Burcham reassigned GEODE-9825: --- Assignee: Bill Burcham > Disparate socket-buffer-size Results in "IOException: Unknown header byte" > and Hangs > > > Key: GEODE-9825 > URL: https://issues.apache.org/jira/browse/GEODE-9825 > Project: Geode > Issue Type: Bug > Components: messaging >Affects Versions: 1.12.4, 1.15.0 >Reporter: Bill Burcham >Assignee: Bill Burcham >Priority: Major > Labels: pull-request-available > > GEODE-9141 introduced a bug that causes {{IOException: "Unknown header > byte..."}} and hangs if members are configured with different > {{socket-buffer-size}} settings. > h2. Reproduction > To reproduce this bug turn off TLS and set socket-buffer-size on sender to be > 64KB and set socket-buffer-size on receiver to be 32KB. See associated PR for > an example. > h2. Analysis > In {{{}Connection.processInputBuffer(){}}}. When that method has read all the > messages it can from the current input buffer, it then considers whether the > buffer needs expansion. If it does then: > {code:java} > inputBuffer = inputSharing.expandReadBufferIfNeeded(allocSize); {code} > Is executed and the method returns. The caller then expects to be able to > _write_ bytes into {{{}inputBuffer{}}}. > The problem, it seems, is that > {{ByteBufferSharingInternalImpl.expandReadBufferIfNeeded()}} does not leave > the the {{ByteBuffer}} in the proper state. It leaves the buffer ready to be > _read_ not written. > Before the changes for GEODE-9141 were introduced, the line of code > referenced above used to be this snippet in > {{Connection.compactOrResizeBuffer(int messageLength)}} (that method has > since been removed): > {code:java} > // need a bigger buffer > logger.info("Allocating larger network read buffer, new size is {} old > size was {}.", > allocSize, oldBufferSize); > ByteBuffer oldBuffer = inputBuffer; > inputBuffer = getBufferPool().acquireDirectReceiveBuffer(allocSize); > if (oldBuffer != null) { > int oldByteCount = oldBuffer.remaining(); > inputBuffer.put(oldBuffer); > inputBuffer.position(oldByteCount); > getBufferPool().releaseReceiveBuffer(oldBuffer); > } {code} > Notice how this method leaves {{inputBuffer}} ready to be _written_ to. > But the code inside > {{ByteBufferSharingInternalImpl.expandReadBufferIfNeeded()}} is doing > something like: > {code:java} > newBuffer.clear(); > newBuffer.put(existing); > newBuffer.flip(); > releaseBuffer(type, existing); > return newBuffer; {code} > A solution (shown in the associated PR) is to do add logic after the call to > {{expandReadBufferIfNeeded(allocSize)}} to leave the buffer in a _writeable_ > state: > {code:java} > inputBuffer = inputSharing.expandReadBufferIfNeeded(allocSize); > // we're returning to the caller (done == true) so make buffer writeable > inputBuffer.position(inputBuffer.limit()); > inputBuffer.limit(inputBuffer.capacity()); {code} > h2. Resolution > When this ticket is complete the bug will be fixed and > {{P2PMessagingConcurrencyDUnitTest}} will be enhanced to test at least these > combinations: > [security, sender/locator socket-buffer-size, receiver socket-buffer-size] > [TLS, (default), (default)] this is what the test currently does > [no TLS, 64 * 1024, 32 * 1024] *new: this illustrates this bug* > [no TLS, (default), (default)] *new* > We might want to mix in conserve-sockets true/false in there too while we're > at it (the test currently holds it at true). -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Updated] (GEODE-9825) Disparate socket-buffer-size Results in "IOException: Unknown header byte" and Hangs
[ https://issues.apache.org/jira/browse/GEODE-9825?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Bill Burcham updated GEODE-9825: Attachment: (was: GEODE-9825-demo.patch) > Disparate socket-buffer-size Results in "IOException: Unknown header byte" > and Hangs > > > Key: GEODE-9825 > URL: https://issues.apache.org/jira/browse/GEODE-9825 > Project: Geode > Issue Type: Bug > Components: messaging >Affects Versions: 1.12.4, 1.15.0 >Reporter: Bill Burcham >Priority: Major > Labels: pull-request-available > > GEODE-9141 introduced a bug that causes {{IOException: "Unknown header > byte..."}} and hangs if members are configured with different > {{socket-buffer-size}} settings. > h2. Reproduction > To reproduce this bug turn off TLS and set socket-buffer-size on sender to be > 64KB and set socket-buffer-size on receiver to be 32KB. See associated PR for > an example. > h2. Analysis > In {{{}Connection.processInputBuffer(){}}}. When that method has read all the > messages it can from the current input buffer, it then considers whether the > buffer needs expansion. If it does then: > {code:java} > inputBuffer = inputSharing.expandReadBufferIfNeeded(allocSize); {code} > Is executed and the method returns. The caller then expects to be able to > _write_ bytes into {{{}inputBuffer{}}}. > The problem, it seems, is that > {{ByteBufferSharingInternalImpl.expandReadBufferIfNeeded()}} does not leave > the the {{ByteBuffer}} in the proper state. It leaves the buffer ready to be > _read_ not written. > Before the changes for GEODE-9141 were introduced, the line of code > referenced above used to be this snippet in > {{Connection.compactOrResizeBuffer(int messageLength)}} (that method has > since been removed): > {code:java} > // need a bigger buffer > logger.info("Allocating larger network read buffer, new size is {} old > size was {}.", > allocSize, oldBufferSize); > ByteBuffer oldBuffer = inputBuffer; > inputBuffer = getBufferPool().acquireDirectReceiveBuffer(allocSize); > if (oldBuffer != null) { > int oldByteCount = oldBuffer.remaining(); > inputBuffer.put(oldBuffer); > inputBuffer.position(oldByteCount); > getBufferPool().releaseReceiveBuffer(oldBuffer); > } {code} > Notice how this method leaves {{inputBuffer}} ready to be _written_ to. > But the code inside > {{ByteBufferSharingInternalImpl.expandReadBufferIfNeeded()}} is doing > something like: > {code:java} > newBuffer.clear(); > newBuffer.put(existing); > newBuffer.flip(); > releaseBuffer(type, existing); > return newBuffer; {code} > A solution (shown in the associated PR) is to do add logic after the call to > {{expandReadBufferIfNeeded(allocSize)}} to leave the buffer in a _writeable_ > state: > {code:java} > inputBuffer = inputSharing.expandReadBufferIfNeeded(allocSize); > // we're returning to the caller (done == true) so make buffer writeable > inputBuffer.position(inputBuffer.limit()); > inputBuffer.limit(inputBuffer.capacity()); {code} > h2. Resolution > When this ticket is complete the bug will be fixed and > {{P2PMessagingConcurrencyDUnitTest}} will be enhanced to test at least these > combinations: > [security, sender/locator socket-buffer-size, receiver socket-buffer-size] > [TLS, (default), (default)] this is what the test currently does > [no TLS, 64 * 1024, 32 * 1024] *new: this illustrates this bug* > [no TLS, (default), (default)] *new* > We might want to mix in conserve-sockets true/false in there too while we're > at it (the test currently holds it at true). -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Updated] (GEODE-9825) Disparate socket-buffer-size Results in "IOException: Unknown header byte" and Hangs
[ https://issues.apache.org/jira/browse/GEODE-9825?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Bill Burcham updated GEODE-9825: Description: GEODE-9141 introduced a bug that causes {{IOException: "Unknown header byte..."}} and hangs if members are configured with different {{socket-buffer-size}} settings. h2. Reproduction To reproduce this bug turn off TLS and set socket-buffer-size on sender to be 64KB and set socket-buffer-size on receiver to be 32KB. See associated PR for an example. h2. Analysis In {{{}Connection.processInputBuffer(){}}}. When that method has read all the messages it can from the current input buffer, it then considers whether the buffer needs expansion. If it does then: {code:java} inputBuffer = inputSharing.expandReadBufferIfNeeded(allocSize); {code} Is executed and the method returns. The caller then expects to be able to _write_ bytes into {{{}inputBuffer{}}}. The problem, it seems, is that {{ByteBufferSharingInternalImpl.expandReadBufferIfNeeded()}} does not leave the the {{ByteBuffer}} in the proper state. It leaves the buffer ready to be _read_ not written. Before the changes for GEODE-9141 were introduced, the line of code referenced above used to be this snippet in {{Connection.compactOrResizeBuffer(int messageLength)}} (that method has since been removed): {code:java} // need a bigger buffer logger.info("Allocating larger network read buffer, new size is {} old size was {}.", allocSize, oldBufferSize); ByteBuffer oldBuffer = inputBuffer; inputBuffer = getBufferPool().acquireDirectReceiveBuffer(allocSize); if (oldBuffer != null) { int oldByteCount = oldBuffer.remaining(); inputBuffer.put(oldBuffer); inputBuffer.position(oldByteCount); getBufferPool().releaseReceiveBuffer(oldBuffer); } {code} Notice how this method leaves {{inputBuffer}} ready to be _written_ to. But the code inside {{ByteBufferSharingInternalImpl.expandReadBufferIfNeeded()}} is doing something like: {code:java} newBuffer.clear(); newBuffer.put(existing); newBuffer.flip(); releaseBuffer(type, existing); return newBuffer; {code} A solution (shown in the associated PR) is to do add logic after the call to {{expandReadBufferIfNeeded(allocSize)}} to leave the buffer in a _writeable_ state: {code:java} inputBuffer = inputSharing.expandReadBufferIfNeeded(allocSize); // we're returning to the caller (done == true) so make buffer writeable inputBuffer.position(inputBuffer.limit()); inputBuffer.limit(inputBuffer.capacity()); {code} h2. Resolution When this ticket is complete the bug will be fixed and {{P2PMessagingConcurrencyDUnitTest}} will be enhanced to test at least these combinations: [security, sender/locator socket-buffer-size, receiver socket-buffer-size] [TLS, (default), (default)] this is what the test currently does [no TLS, 64 * 1024, 32 * 1024] *new: this illustrates this bug* [no TLS, (default), (default)] *new* We might want to mix in conserve-sockets true/false in there too while we're at it (the test currently holds it at true). was: GEODE-9141 introduced a bug that causes {{IOException: "Unknown header byte..."}} and hangs if members are configured with different {{socket-buffer-size}} settings. h2. Reproduction To reproduce this bug, modify {{P2PMessagingConcurrencyDUnitTest}} so that sender and locator and receiver use different configuration parameters. Set {{socket-buffer-size}} to 212992 for the sender and 32 * 1024 for the receiver. Oh and just skip the call to {{{}securityProperties(){}}}—we want to induce the "Unknown header byte" exception—we don't want the TLS framework throwing exceptions. See attached patch file GEODE-9825-demo.patch for an example. h2. Analysis In {{{}Connection.processInputBuffer(){}}}. When that method has read all the messages it can from the current input buffer, it then considers whether the buffer needs expansion. If it does then: {code:java} inputBuffer = inputSharing.expandReadBufferIfNeeded(allocSize); {code} Is executed and the method returns. The caller then expects to be able to _write_ bytes into {{{}inputBuffer{}}}. The problem, it seems, is that {{ByteBufferSharingInternalImpl.expandReadBufferIfNeeded()}} does not leave the the {{ByteBuffer}} in the proper state. It leaves the buffer ready to be _read_ not written. Before the changes for GEODE-9141 were introduced, the line of code referenced above used to be this snippet in {{Connection.compactOrResizeBuffer(int messageLength)}} (that method has since been removed): {code:java} // need a bigger buffer logger.info("Allocating larger network read buffer, new size is {} old size was {}.", allocSize, oldBufferSize); ByteBuffer oldBuffer = inputBuffer; inputBuffer = getBufferPool().acquireDirectReceiveBuffer(allocSize); if (oldBuffer != null) { int oldByteCount = oldBuffer.remaining();
[jira] [Updated] (GEODE-9825) Disparate socket-buffer-size Results in "IOException: Unknown header byte" and Hangs
[ https://issues.apache.org/jira/browse/GEODE-9825?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Bill Burcham updated GEODE-9825: Description: GEODE-9141 introduced a bug that causes {{IOException: "Unknown header byte..."}} and hangs if members are configured with different {{socket-buffer-size}} settings. h2. Reproduction To reproduce this bug, modify {{P2PMessagingConcurrencyDUnitTest}} so that sender and locator and receiver use different configuration parameters. Set {{socket-buffer-size}} to 212992 for the sender and 32 * 1024 for the receiver. Oh and just skip the call to {{{}securityProperties(){}}}—we want to induce the "Unknown header byte" exception—we don't want the TLS framework throwing exceptions. See attached patch file GEODE-9825-demo.patch for an example. h2. Analysis In {{{}Connection.processInputBuffer(){}}}. When that method has read all the messages it can from the current input buffer, it then considers whether the buffer needs expansion. If it does then: {code:java} inputBuffer = inputSharing.expandReadBufferIfNeeded(allocSize); {code} Is executed and the method returns. The caller then expects to be able to _write_ bytes into {{{}inputBuffer{}}}. The problem, it seems, is that {{ByteBufferSharingInternalImpl.expandReadBufferIfNeeded()}} does not leave the the {{ByteBuffer}} in the proper state. It leaves the buffer ready to be _read_ not written. Before the changes for GEODE-9141 were introduced, the line of code referenced above used to be this snippet in {{Connection.compactOrResizeBuffer(int messageLength)}} (that method has since been removed): {code:java} // need a bigger buffer logger.info("Allocating larger network read buffer, new size is {} old size was {}.", allocSize, oldBufferSize); ByteBuffer oldBuffer = inputBuffer; inputBuffer = getBufferPool().acquireDirectReceiveBuffer(allocSize); if (oldBuffer != null) { int oldByteCount = oldBuffer.remaining(); inputBuffer.put(oldBuffer); inputBuffer.position(oldByteCount); getBufferPool().releaseReceiveBuffer(oldBuffer); } {code} Notice how this method leaves {{inputBuffer}} ready to be _written_ to. But the code inside {{ByteBufferSharingInternalImpl.expandReadBufferIfNeeded()}} is doing something like: {code:java} newBuffer.clear(); newBuffer.put(existing); newBuffer.flip(); releaseBuffer(type, existing); return newBuffer; {code} The solution (shown in the attached patch file GEODE-9825-demo.patch) is to do add logic after the call to {{expandReadBufferIfNeeded(allocSize)}} to leave the buffer in a _writeable_ state: {code:java} inputBuffer = inputSharing.expandReadBufferIfNeeded(allocSize); // we're returning to the caller (done == true) so make buffer writeable inputBuffer.position(inputBuffer.limit()); inputBuffer.limit(inputBuffer.capacity()); {code} h2. Resolution When this ticket is complete the bug will be fixed and {{P2PMessagingConcurrencyDUnitTest}} will be enhanced to test these combinations: [security, sender/locator socket-buffer-size, receiver socket-buffer-size] [TLS, (default), (default)] this is what the test currently does [no TLS, 64 * 1024, 32 * 1024] *new: this illustrates this bug* [no TLS, (default), (default)] *new* We might want to mix in conserve-sockets true/false in there too while we're at it (the test currently holds it at true). The attached patch file GEODE-9825-demo.patch shows a quick hack to {{P2PMessagingConcurrencyDUnitTest}} to illustrate the bug. The patch also includes a fix. was: GEODE-9141 introduced a bug that causes {{IOException: "Unknown header byte..."}} and hangs if members are configured with different {{socket-buffer-size}} settings. h2. Reproduction To reproduce this bug, modify {{P2PMessagingConcurrencyDUnitTest}} so that sender and locator and receiver use different configuration parameters. Set {{socket-buffer-size}} to 212992 for the sender and 32 * 1024 for the receiver. Oh and just skip the call to {{{}securityProperties(){}}}—we want to induce the "Unknown header byte" exception—we don't want the TLS framework throwing exceptions. See attached patch file GEODE-9825-demo.patch for an example. h2. Analysis In {{{}Connection.processInputBuffer(){}}}. When that method has read all the messages it can from the current input buffer, it then considers whether the buffer needs expansion. If it does then: {code:java} inputBuffer = inputSharing.expandReadBufferIfNeeded(allocSize); {code} Is executed and the method returns. The caller then expects to be able to _write_ bytes into {{{}inputBuffer{}}}. The problem, it seems, is that {{ByteBufferSharingInternalImpl.expandReadBufferIfNeeded()}} does not leave the the {{ByteBuffer}} in the proper state. It leaves the buffer ready to be _read_ not written. Before the changes for GEODE-9141 were introduced, the line of code referenced above used to be t
[jira] [Updated] (GEODE-9825) Disparate socket-buffer-size Results in "IOException: Unknown header byte" and Hangs
[ https://issues.apache.org/jira/browse/GEODE-9825?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Bill Burcham updated GEODE-9825: Attachment: GEODE-9825-demo.patch > Disparate socket-buffer-size Results in "IOException: Unknown header byte" > and Hangs > > > Key: GEODE-9825 > URL: https://issues.apache.org/jira/browse/GEODE-9825 > Project: Geode > Issue Type: Bug > Components: messaging >Affects Versions: 1.12.4, 1.15.0 >Reporter: Bill Burcham >Priority: Major > Attachments: GEODE-9825-demo.patch > > > GEODE-9141 introduced a bug that causes {{IOException: "Unknown header > byte..."}} and hangs if members are configured with different > {{socket-buffer-size}} settings. > h2. Reproduction > To reproduce this bug, modify {{P2PMessagingConcurrencyDUnitTest}} so that > sender and locator and receiver use different configuration parameters. Set > {{socket-buffer-size}} to 212992 for the sender and 32 * 1024 for the > receiver. Oh and just skip the call to {{{}securityProperties(){}}}—we want > to induce the "Unknown header byte" exception—we don't want the TLS framework > throwing exceptions. See attached patch file GEODE-9825-demo.patch for an > example. > h2. Analysis > In {{{}Connection.processInputBuffer(){}}}. When that method has read all the > messages it can from the current input buffer, it then considers whether the > buffer needs expansion. If it does then: > {code:java} > inputBuffer = inputSharing.expandReadBufferIfNeeded(allocSize); {code} > Is executed and the method returns. The caller then expects to be able to > _write_ bytes into {{{}inputBuffer{}}}. > The problem, it seems, is that > {{ByteBufferSharingInternalImpl.expandReadBufferIfNeeded()}} does not leave > the the {{ByteBuffer}} in the proper state. It leaves the buffer ready to be > _read_ not written. > Before the changes for GEODE-9141 were introduced, the line of code > referenced above used to be this snippet in > {{Connection.compactOrResizeBuffer(int messageLength)}} (that method has > since been removed): > {code:java} > // need a bigger buffer > logger.info("Allocating larger network read buffer, new size is {} old > size was {}.", > allocSize, oldBufferSize); > ByteBuffer oldBuffer = inputBuffer; > inputBuffer = getBufferPool().acquireDirectReceiveBuffer(allocSize); > if (oldBuffer != null) { > int oldByteCount = oldBuffer.remaining(); > inputBuffer.put(oldBuffer); > inputBuffer.position(oldByteCount); > getBufferPool().releaseReceiveBuffer(oldBuffer); > } {code} > Notice how this method leaves {{inputBuffer}} ready to be _written_ to. > But the code inside > {{ByteBufferSharingInternalImpl.expandReadBufferIfNeeded()}} is doing > something like: > {code:java} > newBuffer.clear(); > newBuffer.put(existing); > newBuffer.flip(); > releaseBuffer(type, existing); > return newBuffer; {code} > It's not clear to me, exactly, what the difference is between the old and new > code. It's not sufficient to simply call {{flip()}} on the inputBuffer before > returning it (I tried it and it didn't fix the bug). More work is needed. > h2. Resolution > When this ticket is complete the bug will be fixed and > {{P2PMessagingConcurrencyDUnitTest}} will be enhanced to test these > combinations: > [security, sender/locator socket-buffer-size, receiver socket-buffer-size] > [TLS, (default), (default)] this is what the test currently does > [no TLS, 64 * 1024, 32 * 1024] *new: this illustrates this bug* > [no TLS, (default), (default)] *new* > We might want to mix in conserve-sockets true/false in there too while we're > at it (the test currently holds it at true). > The attached patch file GEODE-9825-demo.patch shows a quick hack to > {{P2PMessagingConcurrencyDUnitTest}} to illustrate the bug. -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Updated] (GEODE-9825) Disparate socket-buffer-size Results in "IOException: Unknown header byte" and Hangs
[ https://issues.apache.org/jira/browse/GEODE-9825?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Bill Burcham updated GEODE-9825: Attachment: (was: GEODE-9825-demo.patch) > Disparate socket-buffer-size Results in "IOException: Unknown header byte" > and Hangs > > > Key: GEODE-9825 > URL: https://issues.apache.org/jira/browse/GEODE-9825 > Project: Geode > Issue Type: Bug > Components: messaging >Affects Versions: 1.12.4, 1.15.0 >Reporter: Bill Burcham >Priority: Major > Attachments: GEODE-9825-demo.patch > > > GEODE-9141 introduced a bug that causes {{IOException: "Unknown header > byte..."}} and hangs if members are configured with different > {{socket-buffer-size}} settings. > h2. Reproduction > To reproduce this bug, modify {{P2PMessagingConcurrencyDUnitTest}} so that > sender and locator and receiver use different configuration parameters. Set > {{socket-buffer-size}} to 212992 for the sender and 32 * 1024 for the > receiver. Oh and just skip the call to {{{}securityProperties(){}}}—we want > to induce the "Unknown header byte" exception—we don't want the TLS framework > throwing exceptions. See attached patch file GEODE-9825-demo.patch for an > example. > h2. Analysis > In {{{}Connection.processInputBuffer(){}}}. When that method has read all the > messages it can from the current input buffer, it then considers whether the > buffer needs expansion. If it does then: > {code:java} > inputBuffer = inputSharing.expandReadBufferIfNeeded(allocSize); {code} > Is executed and the method returns. The caller then expects to be able to > _write_ bytes into {{{}inputBuffer{}}}. > The problem, it seems, is that > {{ByteBufferSharingInternalImpl.expandReadBufferIfNeeded()}} does not leave > the the {{ByteBuffer}} in the proper state. It leaves the buffer ready to be > _read_ not written. > Before the changes for GEODE-9141 were introduced, the line of code > referenced above used to be this snippet in > {{Connection.compactOrResizeBuffer(int messageLength)}} (that method has > since been removed): > {code:java} > // need a bigger buffer > logger.info("Allocating larger network read buffer, new size is {} old > size was {}.", > allocSize, oldBufferSize); > ByteBuffer oldBuffer = inputBuffer; > inputBuffer = getBufferPool().acquireDirectReceiveBuffer(allocSize); > if (oldBuffer != null) { > int oldByteCount = oldBuffer.remaining(); > inputBuffer.put(oldBuffer); > inputBuffer.position(oldByteCount); > getBufferPool().releaseReceiveBuffer(oldBuffer); > } {code} > Notice how this method leaves {{inputBuffer}} ready to be _written_ to. > But the code inside > {{ByteBufferSharingInternalImpl.expandReadBufferIfNeeded()}} is doing > something like: > {code:java} > newBuffer.clear(); > newBuffer.put(existing); > newBuffer.flip(); > releaseBuffer(type, existing); > return newBuffer; {code} > It's not clear to me, exactly, what the difference is between the old and new > code. It's not sufficient to simply call {{flip()}} on the inputBuffer before > returning it (I tried it and it didn't fix the bug). More work is needed. > h2. Resolution > When this ticket is complete the bug will be fixed and > {{P2PMessagingConcurrencyDUnitTest}} will be enhanced to test these > combinations: > [security, sender/locator socket-buffer-size, receiver socket-buffer-size] > [TLS, (default), (default)] this is what the test currently does > [no TLS, 64 * 1024, 32 * 1024] *new: this illustrates this bug* > [no TLS, (default), (default)] *new* > We might want to mix in conserve-sockets true/false in there too while we're > at it (the test currently holds it at true). > The attached patch file GEODE-9825-demo.patch shows a quick hack to > {{P2PMessagingConcurrencyDUnitTest}} to illustrate the bug. -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Updated] (GEODE-9825) Disparate socket-buffer-size Results in "IOException: Unknown header byte" and Hangs
[ https://issues.apache.org/jira/browse/GEODE-9825?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Bill Burcham updated GEODE-9825: Description: GEODE-9141 introduced a bug that causes {{IOException: "Unknown header byte..."}} and hangs if members are configured with different {{socket-buffer-size}} settings. h2. Reproduction To reproduce this bug, modify {{P2PMessagingConcurrencyDUnitTest}} so that sender and locator and receiver use different configuration parameters. Set {{socket-buffer-size}} to 212992 for the sender and 32 * 1024 for the receiver. Oh and just skip the call to {{{}securityProperties(){}}}—we want to induce the "Unknown header byte" exception—we don't want the TLS framework throwing exceptions. See attached patch file GEODE-9825-demo.patch for an example. h2. Analysis In {{{}Connection.processInputBuffer(){}}}. When that method has read all the messages it can from the current input buffer, it then considers whether the buffer needs expansion. If it does then: {code:java} inputBuffer = inputSharing.expandReadBufferIfNeeded(allocSize); {code} Is executed and the method returns. The caller then expects to be able to _write_ bytes into {{{}inputBuffer{}}}. The problem, it seems, is that {{ByteBufferSharingInternalImpl.expandReadBufferIfNeeded()}} does not leave the the {{ByteBuffer}} in the proper state. It leaves the buffer ready to be _read_ not written. Before the changes for GEODE-9141 were introduced, the line of code referenced above used to be this snippet in {{Connection.compactOrResizeBuffer(int messageLength)}} (that method has since been removed): {code:java} // need a bigger buffer logger.info("Allocating larger network read buffer, new size is {} old size was {}.", allocSize, oldBufferSize); ByteBuffer oldBuffer = inputBuffer; inputBuffer = getBufferPool().acquireDirectReceiveBuffer(allocSize); if (oldBuffer != null) { int oldByteCount = oldBuffer.remaining(); inputBuffer.put(oldBuffer); inputBuffer.position(oldByteCount); getBufferPool().releaseReceiveBuffer(oldBuffer); } {code} Notice how this method leaves {{inputBuffer}} ready to be _written_ to. But the code inside {{ByteBufferSharingInternalImpl.expandReadBufferIfNeeded()}} is doing something like: {code:java} newBuffer.clear(); newBuffer.put(existing); newBuffer.flip(); releaseBuffer(type, existing); return newBuffer; {code} It's not clear to me, exactly, what the difference is between the old and new code. It's not sufficient to simply call {{flip()}} on the inputBuffer before returning it (I tried it and it didn't fix the bug). More work is needed. h2. Resolution When this ticket is complete the bug will be fixed and {{P2PMessagingConcurrencyDUnitTest}} will be enhanced to test these combinations: [security, sender/locator socket-buffer-size, receiver socket-buffer-size] [TLS, (default), (default)] this is what the test currently does [no TLS, 64 * 1024, 32 * 1024] *new: this illustrates this bug* [no TLS, (default), (default)] *new* We might want to mix in conserve-sockets true/false in there too while we're at it (the test currently holds it at true). The attached patch file GEODE-9825-demo.patch shows a quick hack to {{P2PMessagingConcurrencyDUnitTest}} to illustrate the bug. was: GEODE-9141 introduced a bug that causes {{IOException: "Unknown header byte..."}} and hangs if members are configured with different {{socket-buffer-size}} settings. h2. Reproduction To reproduce this bug, modify {{P2PMessagingConcurrencyDUnitTest}} so that sender and locator and receiver use different configuration parameters. Set {{socket-buffer-size}} to 212992 for the sender and 32 * 1024 for the receiver. Oh and just skip the call to {{{}securityProperties(){}}}—we want to induce the "Unknown header byte" exception—we don't want the TLS framework throwing exceptions. See attached patch file GEODE-9825-demo.patch for an example. h2. Analysis In {{{}Connection.processInputBuffer(){}}}. When that method has read all the messages it can from the current input buffer, it then considers whether the buffer needs expansion. If it does then: {code:java} inputBuffer = inputSharing.expandReadBufferIfNeeded(allocSize); {code} Is executed and the method returns. The caller then expects to be able to _write_ bytes into {{{}inputBuffer{}}}. The problem, it seems, is that {{ByteBufferSharingInternalImpl.expandReadBufferIfNeeded()}} does not leave the the {{ByteBuffer}} in the proper state. It leaves the buffer ready to be _read_ not written. Before the changes for GEODE-9141 were introduced, the line of code referenced above used to be this snippet in {{Connection.compactOrResizeBuffer(int messageLength)}} (that method has since been removed): {code:java} // need a bigger buffer logger.info("Allocating larger network read buffer, new size is {} old size was
[jira] [Updated] (GEODE-9825) Disparate socket-buffer-size Results in "IOException: Unknown header byte" and Hangs
[ https://issues.apache.org/jira/browse/GEODE-9825?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Bill Burcham updated GEODE-9825: Description: GEODE-9141 introduced a bug that causes {{IOException: "Unknown header byte..."}} and hangs if members are configured with different {{socket-buffer-size}} settings. h2. Reproduction To reproduce this bug, modify {{P2PMessagingConcurrencyDUnitTest}} so that sender and locator and receiver use different configuration parameters. Set {{socket-buffer-size}} to 212992 for the sender and 32 * 1024 for the receiver. Oh and just skip the call to {{{}securityProperties(){}}}—we want to induce the "Unknown header byte" exception—we don't want the TLS framework throwing exceptions. See attached patch file GEODE-9825-demo.patch for an example. h2. Analysis In {{{}Connection.processInputBuffer(){}}}. When that method has read all the messages it can from the current input buffer, it then considers whether the buffer needs expansion. If it does then: {code:java} inputBuffer = inputSharing.expandReadBufferIfNeeded(allocSize); {code} Is executed and the method returns. The caller then expects to be able to _write_ bytes into {{{}inputBuffer{}}}. The problem, it seems, is that {{ByteBufferSharingInternalImpl.expandReadBufferIfNeeded()}} does not leave the the {{ByteBuffer}} in the proper state. It leaves the buffer ready to be _read_ not written. Before the changes for GEODE-9141 were introduced, the line of code referenced above used to be this snippet in {{Connection.compactOrResizeBuffer(int messageLength)}} (that method has since been removed): {code:java} // need a bigger buffer logger.info("Allocating larger network read buffer, new size is {} old size was {}.", allocSize, oldBufferSize); ByteBuffer oldBuffer = inputBuffer; inputBuffer = getBufferPool().acquireDirectReceiveBuffer(allocSize); if (oldBuffer != null) { int oldByteCount = oldBuffer.remaining(); inputBuffer.put(oldBuffer); inputBuffer.position(oldByteCount); getBufferPool().releaseReceiveBuffer(oldBuffer); } {code} Notice how this method leaves {{inputBuffer}} ready to be _written_ to. But the code inside {{ByteBufferSharingInternalImpl.expandReadBufferIfNeeded() }}is doing something like: {code:java} newBuffer.clear(); newBuffer.put(existing); newBuffer.flip(); releaseBuffer(type, existing); return newBuffer; {code} It's not clear to me, exactly, what the difference is between the old and new code. It's not sufficient to simply call {{flip()}} on the inputBuffer before returning it (I tried it and it didn't fix the bug). More work is needed. h2. Resolution When this ticket is complete the bug will be fixed and {{P2PMessagingConcurrencyDUnitTest}} will be enhanced to test these combinations: [security, sender/locator socket-buffer-size, receiver socket-buffer-size] [TLS, (default), (default)] this is what the test currently does [no TLS, 64 * 1024, 32 * 1024] *new: this illustrates this bug* [no TLS, (default), (default)] *new* We might want to mix in conserve-sockets true/false in there too while we're at it (the test currently holds it at true). The attached patch file GEODE-9825-demo.patch shows a quick hack to {{P2PMessagingConcurrencyDUnitTest}} to illustrate the bug. was: GEODE-9141 introduced a bug that causes {{IOException: "Unknown header byte..."}} and hangs if members are configured with different {{socket-buffer-size}} settings. h2. Reproduction To reproduce this bug, modify {{P2PMessagingConcurrencyDUnitTest}} so that sender and locator and receiver use different configuration parameters. Set {{socket-buffer-size}} to 212992 for the sender and 32 * 1024 for the receiver. Oh and just skip the call to {{{}securityProperties(){}}}—we want to induce the "Unknown header byte" exception—we don't want the TLS framework throwing exceptions. See attached patch file GEODE-9825-demo.patch for an example. h2. Analysis In {{{}Connection.processInputBuffer(){}}}. When that method has read all the messages it can from the current input buffer, it then considers whether the buffer needs expansion. If it does then: {code:java} inputBuffer = inputSharing.expandReadBufferIfNeeded(allocSize); {code} Is executed and the method returns. The caller then expects to be able to _write_ bytes into {{{}inputBuffer{}}}. The problem, it seems, is that {{ByteBufferSharingInternalImpl.expandReadBufferIfNeeded()}} does not leave the the {{ByteBuffer}} in the proper state. It leaves the buffer ready to be _read_ not written. Before the changes for GEODE-9141 were introduced, the line of code referenced above used to be this method in {{Connection}} (which has since been removed): {code:java} private void compactOrResizeBuffer(int messageLength) { final int oldBufferSize = inputBuffer.capacity(); int allocSize = messageLength + MSG_HEADER_BYTES; i
[jira] [Updated] (GEODE-9825) Disparate socket-buffer-size Results in "IOException: Unknown header byte" and Hangs
[ https://issues.apache.org/jira/browse/GEODE-9825?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Bill Burcham updated GEODE-9825: Description: GEODE-9141 introduced a bug that causes {{IOException: "Unknown header byte..."}} and hangs if members are configured with different {{socket-buffer-size}} settings. h2. Reproduction To reproduce this bug, modify {{P2PMessagingConcurrencyDUnitTest}} so that sender and locator and receiver use different configuration parameters. Set {{socket-buffer-size}} to 212992 for the sender and 32 * 1024 for the receiver. Oh and just skip the call to {{{}securityProperties(){}}}—we want to induce the "Unknown header byte" exception—we don't want the TLS framework throwing exceptions. See attached patch file GEODE-9825-demo.patch for an example. h2. Analysis In {{{}Connection.processInputBuffer(){}}}. When that method has read all the messages it can from the current input buffer, it then considers whether the buffer needs expansion. If it does then: {code:java} inputBuffer = inputSharing.expandReadBufferIfNeeded(allocSize); {code} Is executed and the method returns. The caller then expects to be able to _write_ bytes into {{{}inputBuffer{}}}. The problem, it seems, is that {{ByteBufferSharingInternalImpl.expandReadBufferIfNeeded()}} does not leave the the {{ByteBuffer}} in the proper state. It leaves the buffer ready to be _read_ not written. Before the changes for GEODE-9141 were introduced, the line of code referenced above used to be this method in {{Connection}} (which has since been removed): {code:java} private void compactOrResizeBuffer(int messageLength) { final int oldBufferSize = inputBuffer.capacity(); int allocSize = messageLength + MSG_HEADER_BYTES; if (oldBufferSize < allocSize) { // need a bigger buffer logger.info("Allocating larger network read buffer, new size is {} old size was {}.", allocSize, oldBufferSize); ByteBuffer oldBuffer = inputBuffer; inputBuffer = getBufferPool().acquireDirectReceiveBuffer(allocSize); if (oldBuffer != null) { int oldByteCount = oldBuffer.remaining(); inputBuffer.put(oldBuffer); inputBuffer.position(oldByteCount); getBufferPool().releaseReceiveBuffer(oldBuffer); } } else { if (inputBuffer.position() != 0) { inputBuffer.compact(); } else { inputBuffer.position(inputBuffer.limit()); inputBuffer.limit(inputBuffer.capacity()); } } } {code} Notice how this method leaves {{inputBuffer}} ready to be _written_ to. It's not sufficient to simply call {{flip()}} on the inputBuffer before returning it (I tried it and it didn't fix the bug). More work is needed. h2. Resolution When this ticket is complete the bug will be fixed and {{P2PMessagingConcurrencyDUnitTest}} will be enhanced to test these combinations: [security, sender/locator socket-buffer-size, receiver socket-buffer-size] [TLS, (default), (default)] this is what the test currently does [no TLS, 64 * 1024, 32 * 1024] *new: this illustrates this bug* [no TLS, (default), (default)] *new* We might want to mix in conserve-sockets true/false in there too while we're at it (the test currently holds it at true). The attached patch file GEODE-9825-demo.patch shows a quick hack to {{P2PMessagingConcurrencyDUnitTest}} to illustrate the bug. was: GEODE-9141 introduced a bug that causes {{IOException: "Unknown header byte..."}} and hangs if members are configured with different {{socket-buffer-size}} settings. h2. Reproduction To reproduce this bug, modify {{P2PMessagingConcurrencyDUnitTest}} so that sender and locator and receiver use different configuration parameters. Set {{socket-buffer-size}} to 212992 for the sender and 32 * 1024 for the receiver. Oh and just skip the call to {{{}securityProperties(){}}}—we want to induce the "Unknown header byte" exception—we don't want the TLS framework throwing exceptions. See attached patch file GEODE-9825-demo.patch for an example. h2. Analysis In {{{}Connection.processInputBuffer(){}}}. When that method has read all the messages it can from the current input buffer, it then considers whether the buffer needs expansion. If it does then: {code:java} inputBuffer = inputSharing.expandReadBufferIfNeeded(allocSize); {code} Is executed and the method returns. The caller then expects to be able to _write_ bytes into {{{}inputBuffer{}}}. The problem, it seems, is that {{ByteBufferSharingInternalImpl.expandReadBufferIfNeeded()}} does not leave the the {{ByteBuffer}} in the proper state. It leaves the buffer ready to be _read_ not written. The line of code referenced above used to be this method in {{Connection}} (which has since been removed): {code:java} private void compactOrResizeBuffer(int messageLength) { final int oldBufferSize = inputBuffer.capacity(); int allocSize = messageLength + MSG_HEADER_BYTES; if (oldBufferSize < allocSize)
[jira] [Updated] (GEODE-9825) Disparate socket-buffer-size Results in "IOException: Unknown header byte" and Hangs
[ https://issues.apache.org/jira/browse/GEODE-9825?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Bill Burcham updated GEODE-9825: Description: GEODE-9141 introduced a bug that causes {{IOException: "Unknown header byte..."}} and hangs if members are configured with different {{socket-buffer-size}} settings. h2. Reproduction To reproduce this bug, modify {{P2PMessagingConcurrencyDUnitTest}} so that sender and locator and receiver use different configuration parameters. Set {{socket-buffer-size}} to 212992 for the sender and 32 * 1024 for the receiver. Oh and just skip the call to {{{}securityProperties(){}}}—we want to induce the "Unknown header byte" exception—we don't want the TLS framework throwing exceptions. See attached patch file GEODE-9825-demo.patch for an example. h2. Analysis In {{{}Connection.processInputBuffer(){}}}. When that method has read all the messages it can from the current input buffer, it then considers whether the buffer needs expansion. If it does then: {code:java} inputBuffer = inputSharing.expandReadBufferIfNeeded(allocSize); {code} Is executed and the method returns. The caller then expects to be able to _write_ bytes into {{{}inputBuffer{}}}. The problem, it seems, is that {{ByteBufferSharingInternalImpl.expandReadBufferIfNeeded()}} does not leave the the {{ByteBuffer}} in the proper state. It leaves the buffer ready to be _read_ not written. The line of code referenced above used to be this method in {{Connection}} (which has since been removed): {code:java} private void compactOrResizeBuffer(int messageLength) { final int oldBufferSize = inputBuffer.capacity(); int allocSize = messageLength + MSG_HEADER_BYTES; if (oldBufferSize < allocSize) { // need a bigger buffer logger.info("Allocating larger network read buffer, new size is {} old size was {}.", allocSize, oldBufferSize); ByteBuffer oldBuffer = inputBuffer; inputBuffer = getBufferPool().acquireDirectReceiveBuffer(allocSize); if (oldBuffer != null) { int oldByteCount = oldBuffer.remaining(); inputBuffer.put(oldBuffer); inputBuffer.position(oldByteCount); getBufferPool().releaseReceiveBuffer(oldBuffer); } } else { if (inputBuffer.position() != 0) { inputBuffer.compact(); } else { inputBuffer.position(inputBuffer.limit()); inputBuffer.limit(inputBuffer.capacity()); } } } {code} Notice how this method leaves {{inputBuffer}} ready to be _written_ to. It's not sufficient to simply call {{flip()}} on the inputBuffer before returning it (I tried it and it didn't fix the bug). More work is needed. h2. Resolution When this ticket is complete the bug will be fixed and {{P2PMessagingConcurrencyDUnitTest}} will be enhanced to test these combinations: [security, sender/locator socket-buffer-size, receiver socket-buffer-size] [TLS, (default), (default)] this is what the test currently does [no TLS, 64 * 1024, 32 * 1024] *new: this illustrates this bug* [no TLS, (default), (default)] *new* We might want to mix in conserve-sockets true/false in there too while we're at it (the test currently holds it at true). The attached patch file GEODE-9825-demo.patch shows a quick hack to {{P2PMessagingConcurrencyDUnitTest}} to illustrate the bug. was: GEODE-9141 introduced a bug that causes {{IOException: "Unknown header byte..."}} and hangs if members are configured with different {{socket-buffer-size}} settings. h2. Reproduction To reproduce this bug, modify {{P2PMessagingConcurrencyDUnitTest}} so that sender and locator and receiver use different configuration parameters. Set {{socket-buffer-size}} to 212992 for the sender and 32 * 1024 for the receiver. Oh and just skip the call to {{{}securityProperties(){}}}—we want to induce the "Unknown header byte" exception—we don't want the TLS framework throwing exceptions. h2. Analysis In {{{}Connection.processInputBuffer(){}}}. When that method has read all the messages it can from the current input buffer, it then considers whether the buffer needs expansion. If it does then: {code:java} inputBuffer = inputSharing.expandReadBufferIfNeeded(allocSize); {code} Is executed and the method returns. The caller then expects to be able to _write_ bytes into {{{}inputBuffer{}}}. The problem, it seems, is that {{ByteBufferSharingInternalImpl.expandReadBufferIfNeeded()}} does not leave the the {{ByteBuffer}} in the proper state. It leaves the buffer ready to be _read_ not written. The line of code referenced above used to be this method in {{Connection}} (which has since been removed): {code:java} private void compactOrResizeBuffer(int messageLength) { final int oldBufferSize = inputBuffer.capacity(); int allocSize = messageLength + MSG_HEADER_BYTES; if (oldBufferSize < allocSize) { // need a bigger buffer logger.info("Allocating larger network read buffer, new size is {} old size w
[jira] [Updated] (GEODE-9825) Disparate socket-buffer-size Results in "IOException: Unknown header byte" and Hangs
[ https://issues.apache.org/jira/browse/GEODE-9825?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Bill Burcham updated GEODE-9825: Description: GEODE-9141 introduced a bug that causes {{IOException: "Unknown header byte..."}} and hangs if members are configured with different {{socket-buffer-size}} settings. h2. Reproduction To reproduce this bug, modify {{P2PMessagingConcurrencyDUnitTest}} so that sender and locator and receiver use different configuration parameters. Set {{socket-buffer-size}} to 212992 for the sender and 32 * 1024 for the receiver. Oh and just skip the call to {{{}securityProperties(){}}}—we want to induce the "Unknown header byte" exception—we don't want the TLS framework throwing exceptions. h2. Analysis In {{{}Connection.processInputBuffer(){}}}. When that method has read all the messages it can from the current input buffer, it then considers whether the buffer needs expansion. If it does then: {code:java} inputBuffer = inputSharing.expandReadBufferIfNeeded(allocSize); {code} Is executed and the method returns. The caller then expects to be able to _write_ bytes into {{{}inputBuffer{}}}. The problem, it seems, is that {{ByteBufferSharingInternalImpl.expandReadBufferIfNeeded()}} does not leave the the {{ByteBuffer}} in the proper state. It leaves the buffer ready to be _read_ not written. The line of code referenced above used to be this method in {{Connection}} (which has since been removed): {code:java} private void compactOrResizeBuffer(int messageLength) { final int oldBufferSize = inputBuffer.capacity(); int allocSize = messageLength + MSG_HEADER_BYTES; if (oldBufferSize < allocSize) { // need a bigger buffer logger.info("Allocating larger network read buffer, new size is {} old size was {}.", allocSize, oldBufferSize); ByteBuffer oldBuffer = inputBuffer; inputBuffer = getBufferPool().acquireDirectReceiveBuffer(allocSize); if (oldBuffer != null) { int oldByteCount = oldBuffer.remaining(); inputBuffer.put(oldBuffer); inputBuffer.position(oldByteCount); getBufferPool().releaseReceiveBuffer(oldBuffer); } } else { if (inputBuffer.position() != 0) { inputBuffer.compact(); } else { inputBuffer.position(inputBuffer.limit()); inputBuffer.limit(inputBuffer.capacity()); } } } {code} Notice how this method leaves {{inputBuffer}} ready to be _written_ to. It's not sufficient to simply call {{flip()}} on the inputBuffer before returning it (I tried it and it didn't fix the bug). More work is needed. h2. Resolution When this ticket is complete the bug will be fixed and {{P2PMessagingConcurrencyDUnitTest}} will be enhanced to test these combinations: [security, sender/locator socket-buffer-size, receiver socket-buffer-size] [TLS, (default), (default)] this is what the test currently does [no TLS, 64 * 1024, 32 * 1024] *new: this illustrates this bug* [no TLS, (default), (default)] *new* We might want to mix in conserve-sockets true/false in there too while we're at it (the test currently holds it at true). The attached patch file GEODE-9825-demo.patch shows a quick hack to {{P2PMessagingConcurrencyDUnitTest}} to illustrate the bug. was: GEODE-9141 introduced a bug that causes {{IOException: "Unknown header byte..."}} and hangs if members are configured with different {{socket-buffer-size}} settings. h2. Reproduction To reproduce this bug, modify {{P2PMessagingConcurrencyDUnitTest}} so that sender and locator and receiver use different configuration parameters. Set {{socket-buffer-size}} to 212992 for the sender and 32 * 1024 for the receiver. Oh and just skip the call to {{{}securityProperties(){}}}—we want to induce the "Unknown header byte" exception—we don't want the TLS framework throwing exceptions. h2. Analysis In {{{}Connection.processInputBuffer(){}}}. When that method has read all the messages it can from the current input buffer, it then considers whether the buffer needs expansion. If it does then: {code:java} inputBuffer = inputSharing.expandReadBufferIfNeeded(allocSize); {code} Is executed and the method returns. The caller then expects to be able to _write_ bytes into {{{}inputBuffer{}}}. The problem, it seems, is that {{ByteBufferSharingInternalImpl.expandReadBufferIfNeeded()}} does not leave the the {{ByteBuffer}} in the proper state. It leaves the buffer ready to be _read_ not written. The line of code referenced above used to be this method in {{Connection}} (which has since been removed): {code:java} private void compactOrResizeBuffer(int messageLength) { final int oldBufferSize = inputBuffer.capacity(); int allocSize = messageLength + MSG_HEADER_BYTES; if (oldBufferSize < allocSize) { // need a bigger buffer logger.info("Allocating larger network read buffer, new size is {} old size was {}.", allocSize, oldBufferSize); ByteBuffer old
[jira] [Updated] (GEODE-9825) Disparate socket-buffer-size Results in "IOException: Unknown header byte" and Hangs
[ https://issues.apache.org/jira/browse/GEODE-9825?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Bill Burcham updated GEODE-9825: Attachment: GEODE-9825-demo.patch > Disparate socket-buffer-size Results in "IOException: Unknown header byte" > and Hangs > > > Key: GEODE-9825 > URL: https://issues.apache.org/jira/browse/GEODE-9825 > Project: Geode > Issue Type: Bug > Components: messaging >Affects Versions: 1.12.4, 1.15.0 >Reporter: Bill Burcham >Priority: Major > Attachments: GEODE-9825-demo.patch > > > GEODE-9141 introduced a bug that causes {{IOException: "Unknown header > byte..."}} and hangs if members are configured with different > {{socket-buffer-size}} settings. > h2. Reproduction > To reproduce this bug, modify {{P2PMessagingConcurrencyDUnitTest}} so that > sender and locator and receiver use different configuration parameters. Set > {{socket-buffer-size}} to 212992 for the sender and 32 * 1024 for the > receiver. Oh and just skip the call to {{{}securityProperties(){}}}—we want > to induce the "Unknown header byte" exception—we don't want the TLS framework > throwing exceptions. > h2. Analysis > In {{{}Connection.processInputBuffer(){}}}. When that method has read all the > messages it can from the current input buffer, it then considers whether the > buffer needs expansion. If it does then: > {code:java} > inputBuffer = inputSharing.expandReadBufferIfNeeded(allocSize); {code} > Is executed and the method returns. The caller then expects to be able to > _write_ bytes into {{{}inputBuffer{}}}. > The problem, it seems, is that > {{ByteBufferSharingInternalImpl.expandReadBufferIfNeeded()}} does not leave > the the {{ByteBuffer}} in the proper state. It leaves the buffer ready to be > _read_ not written. > The line of code referenced above used to be this method in {{Connection}} > (which has since been removed): > {code:java} > private void compactOrResizeBuffer(int messageLength) { > final int oldBufferSize = inputBuffer.capacity(); > int allocSize = messageLength + MSG_HEADER_BYTES; > if (oldBufferSize < allocSize) { > // need a bigger buffer > logger.info("Allocating larger network read buffer, new size is {} old > size was {}.", > allocSize, oldBufferSize); > ByteBuffer oldBuffer = inputBuffer; > inputBuffer = getBufferPool().acquireDirectReceiveBuffer(allocSize); > if (oldBuffer != null) { > int oldByteCount = oldBuffer.remaining(); > inputBuffer.put(oldBuffer); > inputBuffer.position(oldByteCount); > getBufferPool().releaseReceiveBuffer(oldBuffer); > } > } else { > if (inputBuffer.position() != 0) { > inputBuffer.compact(); > } else { > inputBuffer.position(inputBuffer.limit()); > inputBuffer.limit(inputBuffer.capacity()); > } > } > } {code} > Notice how this method leaves {{inputBuffer}} ready to be _written_ to. > It's not sufficient to simply call {{flip()}} on the inputBuffer before > returning it (I tried it and it didn't fix the bug). More work is needed. > h2. Resolution > When this ticket is complete the bug will be fixed and > {{P2PMessagingConcurrencyDUnitTest}} will be enhanced to test these > combinations: > [security, sender/locator socket-buffer-size, receiver socket-buffer-size] > [TLS, (default), (default)] this is what the test currently does > [no TLS, 64 * 1024, 32 * 1024] *new: this illustrates this bug* > [no TLS, (default), (default)] *new* > We might want to mix in conserve-sockets true/false in there too while we're > at it (the test currently holds it at true). -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Updated] (GEODE-9825) Disparate socket-buffer-size Results in "IOException: Unknown header byte" and Hangs
[ https://issues.apache.org/jira/browse/GEODE-9825?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Bill Burcham updated GEODE-9825: Description: GEODE-9141 introduced a bug that causes {{IOException: "Unknown header byte..."}} and hangs if members are configured with different {{socket-buffer-size}} settings. h2. Reproduction To reproduce this bug, modify {{P2PMessagingConcurrencyDUnitTest}} so that sender and locator and receiver use different configuration parameters. Set {{socket-buffer-size}} to 212992 for the sender and 32 * 1024 for the receiver. Oh and just skip the call to {{{}securityProperties(){}}}—we want to induce the "Unknown header byte" exception—we don't want the TLS framework throwing exceptions. h2. Analysis In {{{}Connection.processInputBuffer(){}}}. When that method has read all the messages it can from the current input buffer, it then considers whether the buffer needs expansion. If it does then: {code:java} inputBuffer = inputSharing.expandReadBufferIfNeeded(allocSize); {code} Is executed and the method returns. The caller then expects to be able to _write_ bytes into {{{}inputBuffer{}}}. The problem, it seems, is that {{ByteBufferSharingInternalImpl.expandReadBufferIfNeeded()}} does not leave the the {{ByteBuffer}} in the proper state. It leaves the buffer ready to be _read_ not written. The line of code referenced above used to be this method in {{Connection}} (which has since been removed): {code:java} private void compactOrResizeBuffer(int messageLength) { final int oldBufferSize = inputBuffer.capacity(); int allocSize = messageLength + MSG_HEADER_BYTES; if (oldBufferSize < allocSize) { // need a bigger buffer logger.info("Allocating larger network read buffer, new size is {} old size was {}.", allocSize, oldBufferSize); ByteBuffer oldBuffer = inputBuffer; inputBuffer = getBufferPool().acquireDirectReceiveBuffer(allocSize); if (oldBuffer != null) { int oldByteCount = oldBuffer.remaining(); inputBuffer.put(oldBuffer); inputBuffer.position(oldByteCount); getBufferPool().releaseReceiveBuffer(oldBuffer); } } else { if (inputBuffer.position() != 0) { inputBuffer.compact(); } else { inputBuffer.position(inputBuffer.limit()); inputBuffer.limit(inputBuffer.capacity()); } } } {code} Notice how this method leaves {{inputBuffer}} ready to be _written_ to. It's not sufficient to simply call {{flip()}} on the inputBuffer before returning it (I tried it and it didn't fix the bug). More work is needed. h2. Resolution When this ticket is complete the bug will be fixed and {{P2PMessagingConcurrencyDUnitTest}} will be enhanced to test these combinations: [security, sender/locator socket-buffer-size, receiver socket-buffer-size] [TLS, (default), (default)] this is what the test currently does [no TLS, 64 * 1024, 32 * 1024] *new: this illustrates this bug* [no TLS, (default), (default)] *new* We might want to mix in conserve-sockets true/false in there too while we're at it (the test currently holds it at true). was: GEODE-9141 introduced a bug that causes hangs In {{{}Connection.processInputBuffer(){}}}. When that method has read all the messages it can from the current input buffer, it then considers whether the buffer needs expansion. If it does then: {code:java} inputBuffer = inputSharing.expandReadBufferIfNeeded(allocSize); {code} Is executed and the method returns. The caller then expects to be able to _write_ bytes into {{{}inputBuffer{}}}. The problem, it seems, is that {{ByteBufferSharingInternalImpl.expandReadBufferIfNeeded()}} does not leave the the {{ByteBuffer}} in the proper state. It leaves the buffer ready to be _read_ not written. The line of code referenced above used to be this method in {{Connection}} (which has since been removed): {code:java} private void compactOrResizeBuffer(int messageLength) { final int oldBufferSize = inputBuffer.capacity(); int allocSize = messageLength + MSG_HEADER_BYTES; if (oldBufferSize < allocSize) { // need a bigger buffer logger.info("Allocating larger network read buffer, new size is {} old size was {}.", allocSize, oldBufferSize); ByteBuffer oldBuffer = inputBuffer; inputBuffer = getBufferPool().acquireDirectReceiveBuffer(allocSize); if (oldBuffer != null) { int oldByteCount = oldBuffer.remaining(); inputBuffer.put(oldBuffer); inputBuffer.position(oldByteCount); getBufferPool().releaseReceiveBuffer(oldBuffer); } } else { if (inputBuffer.position() != 0) { inputBuffer.compact(); } else { inputBuffer.position(inputBuffer.limit()); inputBuffer.limit(inputBuffer.capacity()); } } } {code} Notice how this method leaves {{inputBuffer}} ready to be _written_ to. It's not sufficient to simply call {{flip()}} on the inputBuffer before returning it (I tri
[jira] [Updated] (GEODE-9825) Disparate socket-buffer-size Results in "IOException: Unknown header byte" and Hangs
[ https://issues.apache.org/jira/browse/GEODE-9825?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Bill Burcham updated GEODE-9825: Summary: Disparate socket-buffer-size Results in "IOException: Unknown header byte" and Hangs (was: Disparate socket-buffer-size Results in "Unknown header byte" Exceptions and Hangs) > Disparate socket-buffer-size Results in "IOException: Unknown header byte" > and Hangs > > > Key: GEODE-9825 > URL: https://issues.apache.org/jira/browse/GEODE-9825 > Project: Geode > Issue Type: Bug > Components: messaging >Affects Versions: 1.12.4, 1.15.0 >Reporter: Bill Burcham >Priority: Major > > GEODE-9141 introduced a bug that causes hangs > In {{{}Connection.processInputBuffer(){}}}. When that method has read all the > messages it can from the current input buffer, it then considers whether the > buffer needs expansion. If it does then: > {code:java} > inputBuffer = inputSharing.expandReadBufferIfNeeded(allocSize); {code} > Is executed and the method returns. The caller then expects to be able to > _write_ bytes into {{{}inputBuffer{}}}. > The problem, it seems, is that > {{ByteBufferSharingInternalImpl.expandReadBufferIfNeeded()}} does not leave > the the {{ByteBuffer}} in the proper state. It leaves the buffer ready to be > _read_ not written. > The line of code referenced above used to be this method in {{Connection}} > (which has since been removed): > {code:java} > private void compactOrResizeBuffer(int messageLength) { > final int oldBufferSize = inputBuffer.capacity(); > int allocSize = messageLength + MSG_HEADER_BYTES; > if (oldBufferSize < allocSize) { > // need a bigger buffer > logger.info("Allocating larger network read buffer, new size is {} old > size was {}.", > allocSize, oldBufferSize); > ByteBuffer oldBuffer = inputBuffer; > inputBuffer = getBufferPool().acquireDirectReceiveBuffer(allocSize); > if (oldBuffer != null) { > int oldByteCount = oldBuffer.remaining(); > inputBuffer.put(oldBuffer); > inputBuffer.position(oldByteCount); > getBufferPool().releaseReceiveBuffer(oldBuffer); > } > } else { > if (inputBuffer.position() != 0) { > inputBuffer.compact(); > } else { > inputBuffer.position(inputBuffer.limit()); > inputBuffer.limit(inputBuffer.capacity()); > } > } > } {code} > Notice how this method leaves {{inputBuffer}} ready to be _written_ to. > It's not sufficient to simply call {{flip()}} on the inputBuffer before > returning it (I tried it and it didn't fix the bug). More work is needed. > To reproduce this bug, modify {{P2PMessagingConcurrencyDUnitTest}} so that > sender and locator and receiver use different configuration parameters. Set > {{socket-buffer-size}} to 212992 for the sender and 32 * 1024 for the > receiver. Oh and just skip the call to {{{}securityProperties(){}}}—we want > to induce the "Unknown header byte" exception—we don't want the TLS framework > throwing exceptions. > When this ticket is complete {{P2PMessagingConcurrencyDUnitTest}} will be > enhanced to test these combinations: > [security, sender/locator socket-buffer-size, receiver socket-buffer-size] > [TLS, (default), (default)] this is what the test currently does > [no TLS, 212992, 32 * 1024] *new: this illustrates this bug* > [no TLS, (default), (default)] *new* > We might want to mix in conserve-sockets true/false in there too while we're > at it (the test currently holds it at true). -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Updated] (GEODE-9825) Disparate socket-buffer-size Results in "Unknown header byte" Exceptions and Hangs
[ https://issues.apache.org/jira/browse/GEODE-9825?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Bill Burcham updated GEODE-9825: Description: GEODE-9141 introduced a bug that causes hangs In {{{}Connection.processInputBuffer(){}}}. When that method has read all the messages it can from the current input buffer, it then considers whether the buffer needs expansion. If it does then: {code:java} inputBuffer = inputSharing.expandReadBufferIfNeeded(allocSize); {code} Is executed and the method returns. The caller then expects to be able to _write_ bytes into {{{}inputBuffer{}}}. The problem, it seems, is that {{ByteBufferSharingInternalImpl.expandReadBufferIfNeeded()}} does not leave the the {{ByteBuffer}} in the proper state. It leaves the buffer ready to be _read_ not written. The line of code referenced above used to be this method in {{Connection}} (which has since been removed): {code:java} private void compactOrResizeBuffer(int messageLength) { final int oldBufferSize = inputBuffer.capacity(); int allocSize = messageLength + MSG_HEADER_BYTES; if (oldBufferSize < allocSize) { // need a bigger buffer logger.info("Allocating larger network read buffer, new size is {} old size was {}.", allocSize, oldBufferSize); ByteBuffer oldBuffer = inputBuffer; inputBuffer = getBufferPool().acquireDirectReceiveBuffer(allocSize); if (oldBuffer != null) { int oldByteCount = oldBuffer.remaining(); inputBuffer.put(oldBuffer); inputBuffer.position(oldByteCount); getBufferPool().releaseReceiveBuffer(oldBuffer); } } else { if (inputBuffer.position() != 0) { inputBuffer.compact(); } else { inputBuffer.position(inputBuffer.limit()); inputBuffer.limit(inputBuffer.capacity()); } } } {code} Notice how this method leaves {{inputBuffer}} ready to be _written_ to. It's not sufficient to simply call {{flip()}} on the inputBuffer before returning it (I tried it and it didn't fix the bug). More work is needed. To reproduce this bug, modify {{P2PMessagingConcurrencyDUnitTest}} so that sender and locator and receiver use different configuration parameters. Set {{socket-buffer-size}} to 212992 for the sender and 32 * 1024 for the receiver. Oh and just skip the call to {{{}securityProperties(){}}}—we want to induce the "Unknown header byte" exception—we don't want the TLS framework throwing exceptions. When this ticket is complete {{P2PMessagingConcurrencyDUnitTest}} will be enhanced to test these combinations: [security, sender/locator socket-buffer-size, receiver socket-buffer-size] [TLS, (default), (default)] this is what the test currently does [no TLS, 212992, 32 * 1024] *new: this illustrates this bug* [no TLS, (default), (default)] *new* We might want to mix in conserve-sockets true/false in there too while we're at it (the test currently holds it at true). was: GEODE-9141 introduced a bug in {{{}Connection.processInputBuffer(){}}}. When that method has read all the messages it can from the current input buffer, it then considers whether the buffer needs expansion. If it does then: {code:java} inputBuffer = inputSharing.expandReadBufferIfNeeded(allocSize); {code} Is executed and the method returns. The caller then expects to be able to _write_ bytes into {{{}inputBuffer{}}}. The problem, it seems, is that {{ByteBufferSharingInternalImpl.expandReadBufferIfNeeded()}} does not leave the the {{ByteBuffer}} in the proper state. It leaves the buffer ready to be _read_ not written. The line of code referenced above used to be this method in {{Connection}} (which has since been removed): {code:java} private void compactOrResizeBuffer(int messageLength) { final int oldBufferSize = inputBuffer.capacity(); int allocSize = messageLength + MSG_HEADER_BYTES; if (oldBufferSize < allocSize) { // need a bigger buffer logger.info("Allocating larger network read buffer, new size is {} old size was {}.", allocSize, oldBufferSize); ByteBuffer oldBuffer = inputBuffer; inputBuffer = getBufferPool().acquireDirectReceiveBuffer(allocSize); if (oldBuffer != null) { int oldByteCount = oldBuffer.remaining(); inputBuffer.put(oldBuffer); inputBuffer.position(oldByteCount); getBufferPool().releaseReceiveBuffer(oldBuffer); } } else { if (inputBuffer.position() != 0) { inputBuffer.compact(); } else { inputBuffer.position(inputBuffer.limit()); inputBuffer.limit(inputBuffer.capacity()); } } } {code} Notice how this method leaves {{inputBuffer}} ready to be _written_ to. It's not sufficient to simply call {{flip()}} on the inputBuffer before returning it (I tried it and it didn't fix the bug). More work is needed. To reproduce this bug, modify {{P2PMessagingConcurrencyDUnitTest}} so that sender and locator and receiver use different configuration parameters.
[jira] [Created] (GEODE-9825) Disparate socket-buffer-size Results in "Unknown header byte" Exceptions and Hangs
Bill Burcham created GEODE-9825: --- Summary: Disparate socket-buffer-size Results in "Unknown header byte" Exceptions and Hangs Key: GEODE-9825 URL: https://issues.apache.org/jira/browse/GEODE-9825 Project: Geode Issue Type: Bug Components: messaging Affects Versions: 1.12.4, 1.15.0 Reporter: Bill Burcham GEODE-9141 introduced a bug in {{{}Connection.processInputBuffer(){}}}. When that method has read all the messages it can from the current input buffer, it then considers whether the buffer needs expansion. If it does then: {code:java} inputBuffer = inputSharing.expandReadBufferIfNeeded(allocSize); {code} Is executed and the method returns. The caller then expects to be able to _write_ bytes into {{{}inputBuffer{}}}. The problem, it seems, is that {{ByteBufferSharingInternalImpl.expandReadBufferIfNeeded()}} does not leave the the {{ByteBuffer}} in the proper state. It leaves the buffer ready to be _read_ not written. The line of code referenced above used to be this method in {{Connection}} (which has since been removed): {code:java} private void compactOrResizeBuffer(int messageLength) { final int oldBufferSize = inputBuffer.capacity(); int allocSize = messageLength + MSG_HEADER_BYTES; if (oldBufferSize < allocSize) { // need a bigger buffer logger.info("Allocating larger network read buffer, new size is {} old size was {}.", allocSize, oldBufferSize); ByteBuffer oldBuffer = inputBuffer; inputBuffer = getBufferPool().acquireDirectReceiveBuffer(allocSize); if (oldBuffer != null) { int oldByteCount = oldBuffer.remaining(); inputBuffer.put(oldBuffer); inputBuffer.position(oldByteCount); getBufferPool().releaseReceiveBuffer(oldBuffer); } } else { if (inputBuffer.position() != 0) { inputBuffer.compact(); } else { inputBuffer.position(inputBuffer.limit()); inputBuffer.limit(inputBuffer.capacity()); } } } {code} Notice how this method leaves {{inputBuffer}} ready to be _written_ to. It's not sufficient to simply call {{flip()}} on the inputBuffer before returning it (I tried it and it didn't fix the bug). More work is needed. To reproduce this bug, modify {{P2PMessagingConcurrencyDUnitTest}} so that sender and locator and receiver use different configuration parameters. Set {{socket-buffer-size}} to 212992 for the sender and 32 * 1024 for the receiver. Oh and just skip the call to {{{}securityProperties(){}}}—we want to induce the "Unknown header byte" exception—we don't want the TLS framework throwing exceptions. When this ticket is complete {{P2PMessagingConcurrencyDUnitTest}} will be enhanced to test these combinations: [security, sender/locator socket-buffer-size, receiver socket-buffer-size] [TLS, (default), (default)] this is what the test currently does [no TLS, 212992, 32 * 1024] *new: this illustrates this bug* [no TLS, (default), (default)] *new* We might want to mix in conserve-sockets true/false in there too while we're at it (the test currently holds it at true). -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Updated] (GEODE-9822) Split-brain Possible During Network Partition in Two-Locator Cluster
[ https://issues.apache.org/jira/browse/GEODE-9822?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Bill Burcham updated GEODE-9822: Description: In a two-locator cluster with default member weights and default setting (true) of enable-network-partition-detection, if a long-lived network partition separates the two members, a split-brain will arise: there will be two coordinators at the same time. The reason for this can be found in the GMSJoinLeave.isNetworkPartition() method. That method's name is misleading. A name like isMajorityLost() would probably be more apt. It needs to return true iff the weight of "crashed" members (in the prospective view) is greater-than-or-equal-to half (50%) of the total weight (of all members in the current view). What the method actually does is return true iff the weight of "crashed" members is greater-than 51% of the total weight. As a result, if we have two members of equal weight, and the coordinator sees that the non-coordinator is "crashed", the coordinator will keep running. If a network partition is happening, and the non-coordinator is still running, then it will become a coordinator and start producing views. Now we'll have two coordinators producing views concurrently. For this discussion "crashed" members are members for which the coordinator has received a RemoveMemberRequest message. These are members that the failure detector has deemed failed. Keep in mind the failure detector is imperfect (it's not always right), and that's kind of the whole point of this ticket: we've lost contact with the non-coordinator member, but that doesn't mean it can't still be running (on the other side of a partition). was: In a two-locator cluster with default member weights and default setting (true) of enable-network-partition-detection, if a long-lived network partition separates the two members, a split-brain will arise: there will be two coordinators at the same time. The reason for this can be found in the GMSJoinLeave.isNetworkPartition() method. That method's name is misleading. A name like isMajorityLost() would probably be more apt. It needs to return true iff the weight of "crashed" members (in the prospective view) is greater-than-or-equal-to 50% of the total weight (of all members in the current view). What the method actually does is return true iff the weight of "crashed" members is greater-than 51% of the total weight. As a result, if we have two members of equal weight, and the coordinator sees that the non-coordinator is "crashed", the coordinator will keep running. If a network partition is happening, and the non-coordinator is still running, then it will become a coordinator and start producing views. Now we'll have two coordinators producing views concurrently. For this discussion "crashed" members are members for which the coordinator has received a RemoveMemberRequest message. These are members that the failure detector has deemed failed. Keep in mind the failure detector is imperfect (it's not always right), and that's kind of the whole point of this ticket: we've lost contact with the non-coordinator member, but that doesn't mean it can't still be running (on the other side of a partition). > Split-brain Possible During Network Partition in Two-Locator Cluster > > > Key: GEODE-9822 > URL: https://issues.apache.org/jira/browse/GEODE-9822 > Project: Geode > Issue Type: Bug > Components: membership >Reporter: Bill Burcham >Priority: Major > Labels: pull-request-available > > In a two-locator cluster with default member weights and default setting > (true) of enable-network-partition-detection, if a long-lived network > partition separates the two members, a split-brain will arise: there will be > two coordinators at the same time. > The reason for this can be found in the GMSJoinLeave.isNetworkPartition() > method. That method's name is misleading. A name like isMajorityLost() would > probably be more apt. It needs to return true iff the weight of "crashed" > members (in the prospective view) is greater-than-or-equal-to half (50%) of > the total weight (of all members in the current view). > What the method actually does is return true iff the weight of "crashed" > members is greater-than 51% of the total weight. As a result, if we have two > members of equal weight, and the coordinator sees that the non-coordinator is > "crashed", the coordinator will keep running. If a network partition is > happening, and the non-coordinator is still running, then it will become a > coordinator and start producing views. Now we'll have two coordinators > producing views concurrently. > For this discussion "crashed" members are members for which the coordinator > has received a Re
[jira] [Updated] (GEODE-9822) Split-brain Possible During Network Partition in Two-Locator Cluster
[ https://issues.apache.org/jira/browse/GEODE-9822?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Bill Burcham updated GEODE-9822: Description: In a two-locator cluster with default member weights and default setting (true) of enable-network-partition-detection, if a long-lived network partition separates the two members, a split-brain will arise: there will be two coordinators at the same time. The reason for this can be found in the GMSJoinLeave.isNetworkPartition() method. That method's name is misleading. A name like isMajorityLost() would probably be more apt. It needs to return true iff the weight of "crashed" members (in the prospective view) is greater-than-or-equal-to 50% of the total weight (of all members in the current view). What the method actually does is return true iff the weight of "crashed" members is greater-than 51% of the total weight. As a result, if we have two members of equal weight, and the coordinator sees that the non-coordinator is "crashed", the coordinator will keep running. If a network partition is happening, and the non-coordinator is still running, then it will become a coordinator and start producing views. Now we'll have two coordinators producing views concurrently. For this discussion "crashed" members are members for which the coordinator has received a RemoveMemberRequest message. These are members that the failure detector has deemed failed. Keep in mind the failure detector is imperfect (it's not always right), and that's kind of the whole point of this ticket: we've lost contact with the non-coordinator member, but that doesn't mean it can't still be running (on the other side of a partition). was: In a two-locator cluster with default member weights and default setting (true) of enable-network-partition-detection, if a long-lived network partition separates the two members, a split-brain will arise: there will be two coordinators at the same time. The reason for this can be found in the GMSJoinLeave.isNetworkPartition() method. That method's name is misleading. A name like majorityLost() would probably be more apt. It needs to return true iff the weight of "crashed" members (in the prospective view) is greater-than-or-equal-to 50% of the total weight (of all members in the current view). What the method actually does is return true iff the weight of "crashed" members is greater-than 51% of the total weight. As a result, if we have two members of equal weight, and the coordinator sees that the non-coordinator is "crashed", the coordinator will keep running. If a network partition is happening, and the non-coordinator is still running, then it will become a coordinator and start producing views. Now we'll have two coordinators producing views concurrently. For this discussion "crashed" members are members for which the coordinator has received a RemoveMemberRequest message. These are members that the failure detector has deemed failed. Keep in mind the failure detector is imperfect (it's not always right), and that's kind of the whole point of this ticket: we've lost contact with the non-coordinator member, but that doesn't mean it can't still be running (on the other side of the partition). > Split-brain Possible During Network Partition in Two-Locator Cluster > > > Key: GEODE-9822 > URL: https://issues.apache.org/jira/browse/GEODE-9822 > Project: Geode > Issue Type: Bug > Components: membership >Reporter: Bill Burcham >Priority: Major > Labels: pull-request-available > > In a two-locator cluster with default member weights and default setting > (true) of enable-network-partition-detection, if a long-lived network > partition separates the two members, a split-brain will arise: there will be > two coordinators at the same time. > The reason for this can be found in the GMSJoinLeave.isNetworkPartition() > method. That method's name is misleading. A name like isMajorityLost() would > probably be more apt. It needs to return true iff the weight of "crashed" > members (in the prospective view) is greater-than-or-equal-to 50% of the > total weight (of all members in the current view). > What the method actually does is return true iff the weight of "crashed" > members is greater-than 51% of the total weight. As a result, if we have two > members of equal weight, and the coordinator sees that the non-coordinator is > "crashed", the coordinator will keep running. If a network partition is > happening, and the non-coordinator is still running, then it will become a > coordinator and start producing views. Now we'll have two coordinators > producing views concurrently. > For this discussion "crashed" members are members for which the coordinator > has received a RemoveMemberRequ
[jira] [Updated] (GEODE-9822) Split-brain Possible During Network Partition in Two-Locator Cluster
[ https://issues.apache.org/jira/browse/GEODE-9822?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Bill Burcham updated GEODE-9822: Description: In a two-locator cluster with default member weights and default setting (true) of enable-network-partition-detection, if a long-lived network partition separates the two members, a split-brain will arise: there will be two coordinators at the same time. The reason for this can be found in the GMSJoinLeave.isNetworkPartition() method. That method's name is misleading. A name like majorityLost() would probably be more apt. It needs to return true iff the weight of "crashed" members (in the prospective view) is greater-than-or-equal-to 50% of the total weight (of all members in the current view). What the method actually does is return true iff the weight of "crashed" members is greater-than 51% of the total weight. As a result, if we have two members of equal weight, and the coordinator sees that the non-coordinator is "crashed", the coordinator will keep running. If a network partition is happening, and the non-coordinator is still running, then it will become a coordinator and start producing views. Now we'll have two coordinators producing views concurrently. For this discussion "crashed" members are members for which the coordinator has received a RemoveMemberRequest message. These are members that the failure detector has deemed failed. Keep in mind the failure detector is imperfect (it's not always right), and that's kind of the whole point of this ticket: we've lost contact with the non-coordinator member, but that doesn't mean it can't still be running (on the other side of the partition). was: In a two-locator cluster with default member weights and default setting (true) of enable-network-partition-detection, if a long-lived network partition separates the two members, a split-brain will arise: there will be two coordinators at the same time. The reason for this can be found in the GMSJoinLeave.isNetworkPartition() method. That method's name is misleading. A name like majorityLost() would probably be more apt. It needs to return true iff the weight of "crashed" members (in the prospective view) is greater-than-or-equal-to 50% of the total weight (of all members in the current view). What the method actually does is return true iff the weight of "crashed" members is greater-than 51% of the total weight. As a result, if we have two members of equal weight, and the coordinator sees that the non-coordinator is "crashed", the coordinator will keep running. If a network partition is happening, and the non-coordinator is still running, then it will become a coordinator and start producing views. Now we'll have two coordinators producing views concurrently. > Split-brain Possible During Network Partition in Two-Locator Cluster > > > Key: GEODE-9822 > URL: https://issues.apache.org/jira/browse/GEODE-9822 > Project: Geode > Issue Type: Bug > Components: membership >Reporter: Bill Burcham >Priority: Major > > In a two-locator cluster with default member weights and default setting > (true) of enable-network-partition-detection, if a long-lived network > partition separates the two members, a split-brain will arise: there will be > two coordinators at the same time. > The reason for this can be found in the GMSJoinLeave.isNetworkPartition() > method. That method's name is misleading. A name like majorityLost() would > probably be more apt. It needs to return true iff the weight of "crashed" > members (in the prospective view) is greater-than-or-equal-to 50% of the > total weight (of all members in the current view). > What the method actually does is return true iff the weight of "crashed" > members is greater-than 51% of the total weight. As a result, if we have two > members of equal weight, and the coordinator sees that the non-coordinator is > "crashed", the coordinator will keep running. If a network partition is > happening, and the non-coordinator is still running, then it will become a > coordinator and start producing views. Now we'll have two coordinators > producing views concurrently. > For this discussion "crashed" members are members for which the coordinator > has received a RemoveMemberRequest message. These are members that the > failure detector has deemed failed. Keep in mind the failure detector is > imperfect (it's not always right), and that's kind of the whole point of this > ticket: we've lost contact with the non-coordinator member, but that doesn't > mean it can't still be running (on the other side of the partition). -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Created] (GEODE-9822) Split-brain Possible During Network Partition in Two-Locator Cluster
Bill Burcham created GEODE-9822: --- Summary: Split-brain Possible During Network Partition in Two-Locator Cluster Key: GEODE-9822 URL: https://issues.apache.org/jira/browse/GEODE-9822 Project: Geode Issue Type: Bug Components: membership Reporter: Bill Burcham In a two-locator cluster with default member weights and default setting (true) of enable-network-partition-detection, if a long-lived network partition separates the two members, a split-brain will arise: there will be two coordinators at the same time. The reason for this can be found in the GMSJoinLeave.isNetworkPartition() method. That method's name is misleading. A name like majorityLost() would probably be more apt. It needs to return true iff the weight of "crashed" members (in the prospective view) is greater-than-or-equal-to 50% of the total weight (of all members in the current view). What the method actually does is return true iff the weight of "crashed" members is greater-than 51% of the total weight. As a result, if we have two members of equal weight, and the coordinator sees that the non-coordinator is "crashed", the coordinator will keep running. If a network partition is happening, and the non-coordinator is still running, then it will become a coordinator and start producing views. Now we'll have two coordinators producing views concurrently. -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Updated] (GEODE-9738) CI failure: RollingUpgradeRollServersOnReplicatedRegion_dataserializable failed with DistributedSystemDisconnectedException
[ https://issues.apache.org/jira/browse/GEODE-9738?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Bill Burcham updated GEODE-9738: Attachment: GEODE-9738-short.log.all > CI failure: RollingUpgradeRollServersOnReplicatedRegion_dataserializable > failed with DistributedSystemDisconnectedException > --- > > Key: GEODE-9738 > URL: https://issues.apache.org/jira/browse/GEODE-9738 > Project: Geode > Issue Type: Bug > Components: membership, messaging >Affects Versions: 1.15.0 >Reporter: Kamilla Aslami >Assignee: Bill Burcham >Priority: Major > Labels: needsTriage > Attachments: GEODE-9738-short.log.all, controller.log, locator.log, > vm0.log, vm1.log, vm2.log, vm3.log > > > {noformat} > RollingUpgradeRollServersOnReplicatedRegion_dataserializable > > testRollServersOnReplicatedRegion_dataserializable[from_v1.13.4] FAILED > java.lang.AssertionError: Suspicious strings were written to the log > during this run. > Fix the strings or use IgnoredException.addIgnoredException to ignore. > --- > Found suspect string in 'dunit_suspect-vm2.log' at line 685[fatal > 2021/10/14 00:24:14.739 UTC tid=115] Uncaught exception > in thread Thread[FederatingManager6,5,RMI Runtime] > org.apache.geode.management.ManagementException: > org.apache.geode.distributed.DistributedSystemDisconnectedException: > Distribution manager on > heavy-lifter-10ae5f9d-2528-5e02-b707-d968eb54d50a(vm2:580278:locator):54751 > started at Thu Oct 14 00:23:52 UTC 2021: Message distribution has terminated > at > org.apache.geode.management.internal.FederatingManager.addMemberArtifacts(FederatingManager.java:486) > at > org.apache.geode.management.internal.FederatingManager$AddMemberTask.call(FederatingManager.java:596) > at > org.apache.geode.management.internal.FederatingManager.lambda$addMember$1(FederatingManager.java:199) > at > java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128) > at > java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628) > at java.base/java.lang.Thread.run(Thread.java:829) > Caused by: > org.apache.geode.distributed.DistributedSystemDisconnectedException: > Distribution manager on > heavy-lifter-10ae5f9d-2528-5e02-b707-d968eb54d50a(vm2:580278:locator):54751 > started at Thu Oct 14 00:23:52 UTC 2021: Message distribution has terminated > at > org.apache.geode.distributed.internal.ClusterDistributionManager$Stopper.generateCancelledException(ClusterDistributionManager.java:2885) > at > org.apache.geode.distributed.internal.InternalDistributedSystem$Stopper.generateCancelledException(InternalDistributedSystem.java:1177) > at > org.apache.geode.internal.cache.GemFireCacheImpl$Stopper.generateCancelledException(GemFireCacheImpl.java:5212) > at > org.apache.geode.CancelCriterion.checkCancelInProgress(CancelCriterion.java:83) > at > org.apache.geode.internal.cache.CreateRegionProcessor.initializeRegion(CreateRegionProcessor.java:121) > at > org.apache.geode.internal.cache.DistributedRegion.getInitialImageAndRecovery(DistributedRegion.java:1164) > at > org.apache.geode.internal.cache.DistributedRegion.initialize(DistributedRegion.java:1095) > at > org.apache.geode.internal.cache.GemFireCacheImpl.createVMRegion(GemFireCacheImpl.java:3108) > at > org.apache.geode.internal.cache.InternalRegionFactory.create(InternalRegionFactory.java:78) > at > org.apache.geode.management.internal.FederatingManager.addMemberArtifacts(FederatingManager.java:429) > ... 5 more > at org.junit.Assert.fail(Assert.java:89) > at > org.apache.geode.test.dunit.internal.DUnitLauncher.closeAndCheckForSuspects(DUnitLauncher.java:420) > at > org.apache.geode.test.dunit.internal.DUnitLauncher.closeAndCheckForSuspects(DUnitLauncher.java:436) > at > org.apache.geode.test.dunit.internal.JUnit4DistributedTestCase.cleanupAllVms(JUnit4DistributedTestCase.java:551) > at > org.apache.geode.test.dunit.internal.JUnit4DistributedTestCase.doTearDownDistributedTestCase(JUnit4DistributedTestCase.java:498) > at > org.apache.geode.test.dunit.internal.JUnit4DistributedTestCase.tearDownDistributedTestCase(JUnit4DistributedTestCase.java:481) > at jdk.internal.reflect.GeneratedMethodAccessor11.invoke(Unknown > Source) > at > jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > at java.lang.reflect.Method.invoke(Method.java:566) > a
[jira] [Comment Edited] (GEODE-9738) CI failure: RollingUpgradeRollServersOnReplicatedRegion_dataserializable failed with DistributedSystemDisconnectedException
[ https://issues.apache.org/jira/browse/GEODE-9738?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17441878#comment-17441878 ] Bill Burcham edited comment on GEODE-9738 at 11/16/21, 10:55 PM: - The logs in the failing test run (previous comment) are all interleaved in the "standard output" section of the failing test. I have attached the individual logs to the ticket, so we can analyze them. The attached logs (controller.log, locator.log, vm\{0-3}.log) each contain content for multiple tests. I've attached the stdout for just the test of interest as GEODE-9738-short.log.all. That needs to be split so we can see a more focused view of the various logs. was (Author: bburcham): The logs in the failing test run (previous comment) are all interleaved in the "standard output" section of the failing test. I have attached the individual logs to the ticket, so we can analyze them. > CI failure: RollingUpgradeRollServersOnReplicatedRegion_dataserializable > failed with DistributedSystemDisconnectedException > --- > > Key: GEODE-9738 > URL: https://issues.apache.org/jira/browse/GEODE-9738 > Project: Geode > Issue Type: Bug > Components: membership, messaging >Affects Versions: 1.15.0 >Reporter: Kamilla Aslami >Assignee: Bill Burcham >Priority: Major > Labels: needsTriage > Attachments: GEODE-9738-short.log.all, controller.log, locator.log, > vm0.log, vm1.log, vm2.log, vm3.log > > > {noformat} > RollingUpgradeRollServersOnReplicatedRegion_dataserializable > > testRollServersOnReplicatedRegion_dataserializable[from_v1.13.4] FAILED > java.lang.AssertionError: Suspicious strings were written to the log > during this run. > Fix the strings or use IgnoredException.addIgnoredException to ignore. > --- > Found suspect string in 'dunit_suspect-vm2.log' at line 685[fatal > 2021/10/14 00:24:14.739 UTC tid=115] Uncaught exception > in thread Thread[FederatingManager6,5,RMI Runtime] > org.apache.geode.management.ManagementException: > org.apache.geode.distributed.DistributedSystemDisconnectedException: > Distribution manager on > heavy-lifter-10ae5f9d-2528-5e02-b707-d968eb54d50a(vm2:580278:locator):54751 > started at Thu Oct 14 00:23:52 UTC 2021: Message distribution has terminated > at > org.apache.geode.management.internal.FederatingManager.addMemberArtifacts(FederatingManager.java:486) > at > org.apache.geode.management.internal.FederatingManager$AddMemberTask.call(FederatingManager.java:596) > at > org.apache.geode.management.internal.FederatingManager.lambda$addMember$1(FederatingManager.java:199) > at > java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128) > at > java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628) > at java.base/java.lang.Thread.run(Thread.java:829) > Caused by: > org.apache.geode.distributed.DistributedSystemDisconnectedException: > Distribution manager on > heavy-lifter-10ae5f9d-2528-5e02-b707-d968eb54d50a(vm2:580278:locator):54751 > started at Thu Oct 14 00:23:52 UTC 2021: Message distribution has terminated > at > org.apache.geode.distributed.internal.ClusterDistributionManager$Stopper.generateCancelledException(ClusterDistributionManager.java:2885) > at > org.apache.geode.distributed.internal.InternalDistributedSystem$Stopper.generateCancelledException(InternalDistributedSystem.java:1177) > at > org.apache.geode.internal.cache.GemFireCacheImpl$Stopper.generateCancelledException(GemFireCacheImpl.java:5212) > at > org.apache.geode.CancelCriterion.checkCancelInProgress(CancelCriterion.java:83) > at > org.apache.geode.internal.cache.CreateRegionProcessor.initializeRegion(CreateRegionProcessor.java:121) > at > org.apache.geode.internal.cache.DistributedRegion.getInitialImageAndRecovery(DistributedRegion.java:1164) > at > org.apache.geode.internal.cache.DistributedRegion.initialize(DistributedRegion.java:1095) > at > org.apache.geode.internal.cache.GemFireCacheImpl.createVMRegion(GemFireCacheImpl.java:3108) > at > org.apache.geode.internal.cache.InternalRegionFactory.create(InternalRegionFactory.java:78) > at > org.apache.geode.management.internal.FederatingManager.addMemberArtifacts(FederatingManager.java:429) > ... 5 more > at org.junit.Assert.fail(Assert.java:89) > at > org.apache.geode.test.dunit.internal.DUnitLauncher.closeAndCheckForSuspects(DUnitLauncher.java:420) > at > org.apache.geo
[jira] [Created] (GEODE-9808) Client ops fail with NoLocatorsAvailableException when all servers leave the DS
Bill Burcham created GEODE-9808: --- Summary: Client ops fail with NoLocatorsAvailableException when all servers leave the DS Key: GEODE-9808 URL: https://issues.apache.org/jira/browse/GEODE-9808 Project: Geode Issue Type: Bug Components: client/server Affects Versions: 1.15.0 Reporter: Bill Burcham When there are no cache servers (only locators) in a cluster, client operations will fail with a misleading exception: {noformat} org.apache.geode.cache.client.NoAvailableLocatorsException: Unable to connect to any locators in the list [gemfire-cluster-locator-0.gemfire-cluster-locator.namespace-1850250019.svc.cluster.local:10334, gemfire-cluster-locator-1.gemfire-cluster-locator.namespace-1850250019.svc.cluster.local:10334, gemfire-cluster-locator-2.gemfire-cluster-locator.namespace-1850250019.svc.cluster.local:10334] at org.apache.geode.cache.client.internal.AutoConnectionSourceImpl.findServer(AutoConnectionSourceImpl.java:174) at org.apache.geode.cache.client.internal.ConnectionFactoryImpl.createClientToServerConnection(ConnectionFactoryImpl.java:211) at org.apache.geode.cache.client.internal.pooling.ConnectionManagerImpl.createPooledConnection(ConnectionManagerImpl.java:196) at org.apache.geode.cache.client.internal.pooling.ConnectionManagerImpl.forceCreateConnection(ConnectionManagerImpl.java:227) at org.apache.geode.cache.client.internal.pooling.ConnectionManagerImpl.exchangeConnection(ConnectionManagerImpl.java:365) at org.apache.geode.cache.client.internal.OpExecutorImpl.execute(OpExecutorImpl.java:161) at org.apache.geode.cache.client.internal.OpExecutorImpl.execute(OpExecutorImpl.java:120) at org.apache.geode.cache.client.internal.PoolImpl.execute(PoolImpl.java:805) at org.apache.geode.cache.client.internal.PutOp.execute(PutOp.java:91) {noformat} Even the client is able to connect to a locator, we encounter a NoAvailableLocatorsException exception with the message "Unable to connect to any locators in the list". Investigating the product code we see: # If there are no cache servers in the cluster, ServerLocator.pickServer() will definitely construct a ClientConnectionResponse(null) which causes that object’s hasResult() to respond with false in the loop termination in AutoConnectionSourceImpl.queryLocators() # Not only is the exception wording misleading in AutoConnectionSourceImpl.findServer()—it’s also misleading in at least two other calling locations in AutoConnectionSourceImpl: findReplacementServer() and findServersForQueue(). # In each of those cases the calling method translates a null response from queryLocators() into a throw of a NoAvailableLocatorsException # an appropriate exception, NoAvailableServersException, already exists, for the case where we were able to contact a locator but the locator was not able to find any cache servers # According to my Git spelunking queryLocators() has been obfuscating the true cause of the failure since at least 2015 Without analyzing ServerLocator.pickServer() (LocatorLoadSnapshot.getServerForConnection()) to discern why two locators might disagree on how many cache servers are in the cluster, it seems to me that we should modify AutoConnectionSourceImpl.queryLocators() so that: * if it gets a ServerLocationResponse with hasResult() true, it immediately returns that as it does now * otherwise it keeps trying and it keeps track of the last (non-null) ServerLocationResponse it has received * it returns the last non-null ServerLocationResponse it received (otherwise it returns null) With that in hand, we can change the three call locations in AutoConnectionSourceImpl: findServer(), findReplacementServer(), and findServersForQueue() to each throw NoAvailableLocatorsException if no locator responded, or NoAvailableServersException if a locator responded with a ClientConnectionResponse for which hasResult() returns null. -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Updated] (GEODE-9738) CI failure: RollingUpgradeRollServersOnReplicatedRegion_dataserializable failed with DistributedSystemDisconnectedException
[ https://issues.apache.org/jira/browse/GEODE-9738?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Bill Burcham updated GEODE-9738: Attachment: controller.log locator.log vm3.log vm2.log vm1.log vm0.log > CI failure: RollingUpgradeRollServersOnReplicatedRegion_dataserializable > failed with DistributedSystemDisconnectedException > --- > > Key: GEODE-9738 > URL: https://issues.apache.org/jira/browse/GEODE-9738 > Project: Geode > Issue Type: Bug > Components: membership, messaging >Affects Versions: 1.15.0 >Reporter: Kamilla Aslami >Assignee: Bill Burcham >Priority: Major > Labels: needsTriage > Attachments: controller.log, locator.log, vm0.log, vm1.log, vm2.log, > vm3.log > > > {noformat} > RollingUpgradeRollServersOnReplicatedRegion_dataserializable > > testRollServersOnReplicatedRegion_dataserializable[from_v1.13.4] FAILED > java.lang.AssertionError: Suspicious strings were written to the log > during this run. > Fix the strings or use IgnoredException.addIgnoredException to ignore. > --- > Found suspect string in 'dunit_suspect-vm2.log' at line 685[fatal > 2021/10/14 00:24:14.739 UTC tid=115] Uncaught exception > in thread Thread[FederatingManager6,5,RMI Runtime] > org.apache.geode.management.ManagementException: > org.apache.geode.distributed.DistributedSystemDisconnectedException: > Distribution manager on > heavy-lifter-10ae5f9d-2528-5e02-b707-d968eb54d50a(vm2:580278:locator):54751 > started at Thu Oct 14 00:23:52 UTC 2021: Message distribution has terminated > at > org.apache.geode.management.internal.FederatingManager.addMemberArtifacts(FederatingManager.java:486) > at > org.apache.geode.management.internal.FederatingManager$AddMemberTask.call(FederatingManager.java:596) > at > org.apache.geode.management.internal.FederatingManager.lambda$addMember$1(FederatingManager.java:199) > at > java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128) > at > java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628) > at java.base/java.lang.Thread.run(Thread.java:829) > Caused by: > org.apache.geode.distributed.DistributedSystemDisconnectedException: > Distribution manager on > heavy-lifter-10ae5f9d-2528-5e02-b707-d968eb54d50a(vm2:580278:locator):54751 > started at Thu Oct 14 00:23:52 UTC 2021: Message distribution has terminated > at > org.apache.geode.distributed.internal.ClusterDistributionManager$Stopper.generateCancelledException(ClusterDistributionManager.java:2885) > at > org.apache.geode.distributed.internal.InternalDistributedSystem$Stopper.generateCancelledException(InternalDistributedSystem.java:1177) > at > org.apache.geode.internal.cache.GemFireCacheImpl$Stopper.generateCancelledException(GemFireCacheImpl.java:5212) > at > org.apache.geode.CancelCriterion.checkCancelInProgress(CancelCriterion.java:83) > at > org.apache.geode.internal.cache.CreateRegionProcessor.initializeRegion(CreateRegionProcessor.java:121) > at > org.apache.geode.internal.cache.DistributedRegion.getInitialImageAndRecovery(DistributedRegion.java:1164) > at > org.apache.geode.internal.cache.DistributedRegion.initialize(DistributedRegion.java:1095) > at > org.apache.geode.internal.cache.GemFireCacheImpl.createVMRegion(GemFireCacheImpl.java:3108) > at > org.apache.geode.internal.cache.InternalRegionFactory.create(InternalRegionFactory.java:78) > at > org.apache.geode.management.internal.FederatingManager.addMemberArtifacts(FederatingManager.java:429) > ... 5 more > at org.junit.Assert.fail(Assert.java:89) > at > org.apache.geode.test.dunit.internal.DUnitLauncher.closeAndCheckForSuspects(DUnitLauncher.java:420) > at > org.apache.geode.test.dunit.internal.DUnitLauncher.closeAndCheckForSuspects(DUnitLauncher.java:436) > at > org.apache.geode.test.dunit.internal.JUnit4DistributedTestCase.cleanupAllVms(JUnit4DistributedTestCase.java:551) > at > org.apache.geode.test.dunit.internal.JUnit4DistributedTestCase.doTearDownDistributedTestCase(JUnit4DistributedTestCase.java:498) > at > org.apache.geode.test.dunit.internal.JUnit4DistributedTestCase.tearDownDistributedTestCase(JUnit4DistributedTestCase.java:481) > at jdk.internal.reflect.GeneratedMethodAccessor11.invoke(Unknown > Source) > at > jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccesso
[jira] [Updated] (GEODE-9738) CI failure: RollingUpgradeRollServersOnReplicatedRegion_dataserializable failed with DistributedSystemDisconnectedException
[ https://issues.apache.org/jira/browse/GEODE-9738?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Bill Burcham updated GEODE-9738: Attachment: (was: controller.log) > CI failure: RollingUpgradeRollServersOnReplicatedRegion_dataserializable > failed with DistributedSystemDisconnectedException > --- > > Key: GEODE-9738 > URL: https://issues.apache.org/jira/browse/GEODE-9738 > Project: Geode > Issue Type: Bug > Components: membership, messaging >Affects Versions: 1.15.0 >Reporter: Kamilla Aslami >Assignee: Bill Burcham >Priority: Major > Labels: needsTriage > > {noformat} > RollingUpgradeRollServersOnReplicatedRegion_dataserializable > > testRollServersOnReplicatedRegion_dataserializable[from_v1.13.4] FAILED > java.lang.AssertionError: Suspicious strings were written to the log > during this run. > Fix the strings or use IgnoredException.addIgnoredException to ignore. > --- > Found suspect string in 'dunit_suspect-vm2.log' at line 685[fatal > 2021/10/14 00:24:14.739 UTC tid=115] Uncaught exception > in thread Thread[FederatingManager6,5,RMI Runtime] > org.apache.geode.management.ManagementException: > org.apache.geode.distributed.DistributedSystemDisconnectedException: > Distribution manager on > heavy-lifter-10ae5f9d-2528-5e02-b707-d968eb54d50a(vm2:580278:locator):54751 > started at Thu Oct 14 00:23:52 UTC 2021: Message distribution has terminated > at > org.apache.geode.management.internal.FederatingManager.addMemberArtifacts(FederatingManager.java:486) > at > org.apache.geode.management.internal.FederatingManager$AddMemberTask.call(FederatingManager.java:596) > at > org.apache.geode.management.internal.FederatingManager.lambda$addMember$1(FederatingManager.java:199) > at > java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128) > at > java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628) > at java.base/java.lang.Thread.run(Thread.java:829) > Caused by: > org.apache.geode.distributed.DistributedSystemDisconnectedException: > Distribution manager on > heavy-lifter-10ae5f9d-2528-5e02-b707-d968eb54d50a(vm2:580278:locator):54751 > started at Thu Oct 14 00:23:52 UTC 2021: Message distribution has terminated > at > org.apache.geode.distributed.internal.ClusterDistributionManager$Stopper.generateCancelledException(ClusterDistributionManager.java:2885) > at > org.apache.geode.distributed.internal.InternalDistributedSystem$Stopper.generateCancelledException(InternalDistributedSystem.java:1177) > at > org.apache.geode.internal.cache.GemFireCacheImpl$Stopper.generateCancelledException(GemFireCacheImpl.java:5212) > at > org.apache.geode.CancelCriterion.checkCancelInProgress(CancelCriterion.java:83) > at > org.apache.geode.internal.cache.CreateRegionProcessor.initializeRegion(CreateRegionProcessor.java:121) > at > org.apache.geode.internal.cache.DistributedRegion.getInitialImageAndRecovery(DistributedRegion.java:1164) > at > org.apache.geode.internal.cache.DistributedRegion.initialize(DistributedRegion.java:1095) > at > org.apache.geode.internal.cache.GemFireCacheImpl.createVMRegion(GemFireCacheImpl.java:3108) > at > org.apache.geode.internal.cache.InternalRegionFactory.create(InternalRegionFactory.java:78) > at > org.apache.geode.management.internal.FederatingManager.addMemberArtifacts(FederatingManager.java:429) > ... 5 more > at org.junit.Assert.fail(Assert.java:89) > at > org.apache.geode.test.dunit.internal.DUnitLauncher.closeAndCheckForSuspects(DUnitLauncher.java:420) > at > org.apache.geode.test.dunit.internal.DUnitLauncher.closeAndCheckForSuspects(DUnitLauncher.java:436) > at > org.apache.geode.test.dunit.internal.JUnit4DistributedTestCase.cleanupAllVms(JUnit4DistributedTestCase.java:551) > at > org.apache.geode.test.dunit.internal.JUnit4DistributedTestCase.doTearDownDistributedTestCase(JUnit4DistributedTestCase.java:498) > at > org.apache.geode.test.dunit.internal.JUnit4DistributedTestCase.tearDownDistributedTestCase(JUnit4DistributedTestCase.java:481) > at jdk.internal.reflect.GeneratedMethodAccessor11.invoke(Unknown > Source) > at > jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > at java.lang.reflect.Method.invoke(Method.java:566) > at > org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.java:59) > at > org.junit.i
[jira] [Updated] (GEODE-9738) CI failure: RollingUpgradeRollServersOnReplicatedRegion_dataserializable failed with DistributedSystemDisconnectedException
[ https://issues.apache.org/jira/browse/GEODE-9738?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Bill Burcham updated GEODE-9738: Attachment: (was: vm3.log) > CI failure: RollingUpgradeRollServersOnReplicatedRegion_dataserializable > failed with DistributedSystemDisconnectedException > --- > > Key: GEODE-9738 > URL: https://issues.apache.org/jira/browse/GEODE-9738 > Project: Geode > Issue Type: Bug > Components: membership, messaging >Affects Versions: 1.15.0 >Reporter: Kamilla Aslami >Assignee: Bill Burcham >Priority: Major > Labels: needsTriage > > {noformat} > RollingUpgradeRollServersOnReplicatedRegion_dataserializable > > testRollServersOnReplicatedRegion_dataserializable[from_v1.13.4] FAILED > java.lang.AssertionError: Suspicious strings were written to the log > during this run. > Fix the strings or use IgnoredException.addIgnoredException to ignore. > --- > Found suspect string in 'dunit_suspect-vm2.log' at line 685[fatal > 2021/10/14 00:24:14.739 UTC tid=115] Uncaught exception > in thread Thread[FederatingManager6,5,RMI Runtime] > org.apache.geode.management.ManagementException: > org.apache.geode.distributed.DistributedSystemDisconnectedException: > Distribution manager on > heavy-lifter-10ae5f9d-2528-5e02-b707-d968eb54d50a(vm2:580278:locator):54751 > started at Thu Oct 14 00:23:52 UTC 2021: Message distribution has terminated > at > org.apache.geode.management.internal.FederatingManager.addMemberArtifacts(FederatingManager.java:486) > at > org.apache.geode.management.internal.FederatingManager$AddMemberTask.call(FederatingManager.java:596) > at > org.apache.geode.management.internal.FederatingManager.lambda$addMember$1(FederatingManager.java:199) > at > java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128) > at > java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628) > at java.base/java.lang.Thread.run(Thread.java:829) > Caused by: > org.apache.geode.distributed.DistributedSystemDisconnectedException: > Distribution manager on > heavy-lifter-10ae5f9d-2528-5e02-b707-d968eb54d50a(vm2:580278:locator):54751 > started at Thu Oct 14 00:23:52 UTC 2021: Message distribution has terminated > at > org.apache.geode.distributed.internal.ClusterDistributionManager$Stopper.generateCancelledException(ClusterDistributionManager.java:2885) > at > org.apache.geode.distributed.internal.InternalDistributedSystem$Stopper.generateCancelledException(InternalDistributedSystem.java:1177) > at > org.apache.geode.internal.cache.GemFireCacheImpl$Stopper.generateCancelledException(GemFireCacheImpl.java:5212) > at > org.apache.geode.CancelCriterion.checkCancelInProgress(CancelCriterion.java:83) > at > org.apache.geode.internal.cache.CreateRegionProcessor.initializeRegion(CreateRegionProcessor.java:121) > at > org.apache.geode.internal.cache.DistributedRegion.getInitialImageAndRecovery(DistributedRegion.java:1164) > at > org.apache.geode.internal.cache.DistributedRegion.initialize(DistributedRegion.java:1095) > at > org.apache.geode.internal.cache.GemFireCacheImpl.createVMRegion(GemFireCacheImpl.java:3108) > at > org.apache.geode.internal.cache.InternalRegionFactory.create(InternalRegionFactory.java:78) > at > org.apache.geode.management.internal.FederatingManager.addMemberArtifacts(FederatingManager.java:429) > ... 5 more > at org.junit.Assert.fail(Assert.java:89) > at > org.apache.geode.test.dunit.internal.DUnitLauncher.closeAndCheckForSuspects(DUnitLauncher.java:420) > at > org.apache.geode.test.dunit.internal.DUnitLauncher.closeAndCheckForSuspects(DUnitLauncher.java:436) > at > org.apache.geode.test.dunit.internal.JUnit4DistributedTestCase.cleanupAllVms(JUnit4DistributedTestCase.java:551) > at > org.apache.geode.test.dunit.internal.JUnit4DistributedTestCase.doTearDownDistributedTestCase(JUnit4DistributedTestCase.java:498) > at > org.apache.geode.test.dunit.internal.JUnit4DistributedTestCase.tearDownDistributedTestCase(JUnit4DistributedTestCase.java:481) > at jdk.internal.reflect.GeneratedMethodAccessor11.invoke(Unknown > Source) > at > jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > at java.lang.reflect.Method.invoke(Method.java:566) > at > org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.java:59) > at > org.junit.internal
[jira] [Updated] (GEODE-9738) CI failure: RollingUpgradeRollServersOnReplicatedRegion_dataserializable failed with DistributedSystemDisconnectedException
[ https://issues.apache.org/jira/browse/GEODE-9738?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Bill Burcham updated GEODE-9738: Attachment: (was: vm2.log) > CI failure: RollingUpgradeRollServersOnReplicatedRegion_dataserializable > failed with DistributedSystemDisconnectedException > --- > > Key: GEODE-9738 > URL: https://issues.apache.org/jira/browse/GEODE-9738 > Project: Geode > Issue Type: Bug > Components: membership, messaging >Affects Versions: 1.15.0 >Reporter: Kamilla Aslami >Assignee: Bill Burcham >Priority: Major > Labels: needsTriage > > {noformat} > RollingUpgradeRollServersOnReplicatedRegion_dataserializable > > testRollServersOnReplicatedRegion_dataserializable[from_v1.13.4] FAILED > java.lang.AssertionError: Suspicious strings were written to the log > during this run. > Fix the strings or use IgnoredException.addIgnoredException to ignore. > --- > Found suspect string in 'dunit_suspect-vm2.log' at line 685[fatal > 2021/10/14 00:24:14.739 UTC tid=115] Uncaught exception > in thread Thread[FederatingManager6,5,RMI Runtime] > org.apache.geode.management.ManagementException: > org.apache.geode.distributed.DistributedSystemDisconnectedException: > Distribution manager on > heavy-lifter-10ae5f9d-2528-5e02-b707-d968eb54d50a(vm2:580278:locator):54751 > started at Thu Oct 14 00:23:52 UTC 2021: Message distribution has terminated > at > org.apache.geode.management.internal.FederatingManager.addMemberArtifacts(FederatingManager.java:486) > at > org.apache.geode.management.internal.FederatingManager$AddMemberTask.call(FederatingManager.java:596) > at > org.apache.geode.management.internal.FederatingManager.lambda$addMember$1(FederatingManager.java:199) > at > java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128) > at > java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628) > at java.base/java.lang.Thread.run(Thread.java:829) > Caused by: > org.apache.geode.distributed.DistributedSystemDisconnectedException: > Distribution manager on > heavy-lifter-10ae5f9d-2528-5e02-b707-d968eb54d50a(vm2:580278:locator):54751 > started at Thu Oct 14 00:23:52 UTC 2021: Message distribution has terminated > at > org.apache.geode.distributed.internal.ClusterDistributionManager$Stopper.generateCancelledException(ClusterDistributionManager.java:2885) > at > org.apache.geode.distributed.internal.InternalDistributedSystem$Stopper.generateCancelledException(InternalDistributedSystem.java:1177) > at > org.apache.geode.internal.cache.GemFireCacheImpl$Stopper.generateCancelledException(GemFireCacheImpl.java:5212) > at > org.apache.geode.CancelCriterion.checkCancelInProgress(CancelCriterion.java:83) > at > org.apache.geode.internal.cache.CreateRegionProcessor.initializeRegion(CreateRegionProcessor.java:121) > at > org.apache.geode.internal.cache.DistributedRegion.getInitialImageAndRecovery(DistributedRegion.java:1164) > at > org.apache.geode.internal.cache.DistributedRegion.initialize(DistributedRegion.java:1095) > at > org.apache.geode.internal.cache.GemFireCacheImpl.createVMRegion(GemFireCacheImpl.java:3108) > at > org.apache.geode.internal.cache.InternalRegionFactory.create(InternalRegionFactory.java:78) > at > org.apache.geode.management.internal.FederatingManager.addMemberArtifacts(FederatingManager.java:429) > ... 5 more > at org.junit.Assert.fail(Assert.java:89) > at > org.apache.geode.test.dunit.internal.DUnitLauncher.closeAndCheckForSuspects(DUnitLauncher.java:420) > at > org.apache.geode.test.dunit.internal.DUnitLauncher.closeAndCheckForSuspects(DUnitLauncher.java:436) > at > org.apache.geode.test.dunit.internal.JUnit4DistributedTestCase.cleanupAllVms(JUnit4DistributedTestCase.java:551) > at > org.apache.geode.test.dunit.internal.JUnit4DistributedTestCase.doTearDownDistributedTestCase(JUnit4DistributedTestCase.java:498) > at > org.apache.geode.test.dunit.internal.JUnit4DistributedTestCase.tearDownDistributedTestCase(JUnit4DistributedTestCase.java:481) > at jdk.internal.reflect.GeneratedMethodAccessor11.invoke(Unknown > Source) > at > jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > at java.lang.reflect.Method.invoke(Method.java:566) > at > org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.java:59) > at > org.junit.internal
[jira] [Updated] (GEODE-9738) CI failure: RollingUpgradeRollServersOnReplicatedRegion_dataserializable failed with DistributedSystemDisconnectedException
[ https://issues.apache.org/jira/browse/GEODE-9738?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Bill Burcham updated GEODE-9738: Attachment: (was: locator.log) > CI failure: RollingUpgradeRollServersOnReplicatedRegion_dataserializable > failed with DistributedSystemDisconnectedException > --- > > Key: GEODE-9738 > URL: https://issues.apache.org/jira/browse/GEODE-9738 > Project: Geode > Issue Type: Bug > Components: membership, messaging >Affects Versions: 1.15.0 >Reporter: Kamilla Aslami >Assignee: Bill Burcham >Priority: Major > Labels: needsTriage > > {noformat} > RollingUpgradeRollServersOnReplicatedRegion_dataserializable > > testRollServersOnReplicatedRegion_dataserializable[from_v1.13.4] FAILED > java.lang.AssertionError: Suspicious strings were written to the log > during this run. > Fix the strings or use IgnoredException.addIgnoredException to ignore. > --- > Found suspect string in 'dunit_suspect-vm2.log' at line 685[fatal > 2021/10/14 00:24:14.739 UTC tid=115] Uncaught exception > in thread Thread[FederatingManager6,5,RMI Runtime] > org.apache.geode.management.ManagementException: > org.apache.geode.distributed.DistributedSystemDisconnectedException: > Distribution manager on > heavy-lifter-10ae5f9d-2528-5e02-b707-d968eb54d50a(vm2:580278:locator):54751 > started at Thu Oct 14 00:23:52 UTC 2021: Message distribution has terminated > at > org.apache.geode.management.internal.FederatingManager.addMemberArtifacts(FederatingManager.java:486) > at > org.apache.geode.management.internal.FederatingManager$AddMemberTask.call(FederatingManager.java:596) > at > org.apache.geode.management.internal.FederatingManager.lambda$addMember$1(FederatingManager.java:199) > at > java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128) > at > java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628) > at java.base/java.lang.Thread.run(Thread.java:829) > Caused by: > org.apache.geode.distributed.DistributedSystemDisconnectedException: > Distribution manager on > heavy-lifter-10ae5f9d-2528-5e02-b707-d968eb54d50a(vm2:580278:locator):54751 > started at Thu Oct 14 00:23:52 UTC 2021: Message distribution has terminated > at > org.apache.geode.distributed.internal.ClusterDistributionManager$Stopper.generateCancelledException(ClusterDistributionManager.java:2885) > at > org.apache.geode.distributed.internal.InternalDistributedSystem$Stopper.generateCancelledException(InternalDistributedSystem.java:1177) > at > org.apache.geode.internal.cache.GemFireCacheImpl$Stopper.generateCancelledException(GemFireCacheImpl.java:5212) > at > org.apache.geode.CancelCriterion.checkCancelInProgress(CancelCriterion.java:83) > at > org.apache.geode.internal.cache.CreateRegionProcessor.initializeRegion(CreateRegionProcessor.java:121) > at > org.apache.geode.internal.cache.DistributedRegion.getInitialImageAndRecovery(DistributedRegion.java:1164) > at > org.apache.geode.internal.cache.DistributedRegion.initialize(DistributedRegion.java:1095) > at > org.apache.geode.internal.cache.GemFireCacheImpl.createVMRegion(GemFireCacheImpl.java:3108) > at > org.apache.geode.internal.cache.InternalRegionFactory.create(InternalRegionFactory.java:78) > at > org.apache.geode.management.internal.FederatingManager.addMemberArtifacts(FederatingManager.java:429) > ... 5 more > at org.junit.Assert.fail(Assert.java:89) > at > org.apache.geode.test.dunit.internal.DUnitLauncher.closeAndCheckForSuspects(DUnitLauncher.java:420) > at > org.apache.geode.test.dunit.internal.DUnitLauncher.closeAndCheckForSuspects(DUnitLauncher.java:436) > at > org.apache.geode.test.dunit.internal.JUnit4DistributedTestCase.cleanupAllVms(JUnit4DistributedTestCase.java:551) > at > org.apache.geode.test.dunit.internal.JUnit4DistributedTestCase.doTearDownDistributedTestCase(JUnit4DistributedTestCase.java:498) > at > org.apache.geode.test.dunit.internal.JUnit4DistributedTestCase.tearDownDistributedTestCase(JUnit4DistributedTestCase.java:481) > at jdk.internal.reflect.GeneratedMethodAccessor11.invoke(Unknown > Source) > at > jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > at java.lang.reflect.Method.invoke(Method.java:566) > at > org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.java:59) > at > org.junit.inte
[jira] [Updated] (GEODE-9738) CI failure: RollingUpgradeRollServersOnReplicatedRegion_dataserializable failed with DistributedSystemDisconnectedException
[ https://issues.apache.org/jira/browse/GEODE-9738?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Bill Burcham updated GEODE-9738: Attachment: (was: vm1.log) > CI failure: RollingUpgradeRollServersOnReplicatedRegion_dataserializable > failed with DistributedSystemDisconnectedException > --- > > Key: GEODE-9738 > URL: https://issues.apache.org/jira/browse/GEODE-9738 > Project: Geode > Issue Type: Bug > Components: membership, messaging >Affects Versions: 1.15.0 >Reporter: Kamilla Aslami >Assignee: Bill Burcham >Priority: Major > Labels: needsTriage > > {noformat} > RollingUpgradeRollServersOnReplicatedRegion_dataserializable > > testRollServersOnReplicatedRegion_dataserializable[from_v1.13.4] FAILED > java.lang.AssertionError: Suspicious strings were written to the log > during this run. > Fix the strings or use IgnoredException.addIgnoredException to ignore. > --- > Found suspect string in 'dunit_suspect-vm2.log' at line 685[fatal > 2021/10/14 00:24:14.739 UTC tid=115] Uncaught exception > in thread Thread[FederatingManager6,5,RMI Runtime] > org.apache.geode.management.ManagementException: > org.apache.geode.distributed.DistributedSystemDisconnectedException: > Distribution manager on > heavy-lifter-10ae5f9d-2528-5e02-b707-d968eb54d50a(vm2:580278:locator):54751 > started at Thu Oct 14 00:23:52 UTC 2021: Message distribution has terminated > at > org.apache.geode.management.internal.FederatingManager.addMemberArtifacts(FederatingManager.java:486) > at > org.apache.geode.management.internal.FederatingManager$AddMemberTask.call(FederatingManager.java:596) > at > org.apache.geode.management.internal.FederatingManager.lambda$addMember$1(FederatingManager.java:199) > at > java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128) > at > java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628) > at java.base/java.lang.Thread.run(Thread.java:829) > Caused by: > org.apache.geode.distributed.DistributedSystemDisconnectedException: > Distribution manager on > heavy-lifter-10ae5f9d-2528-5e02-b707-d968eb54d50a(vm2:580278:locator):54751 > started at Thu Oct 14 00:23:52 UTC 2021: Message distribution has terminated > at > org.apache.geode.distributed.internal.ClusterDistributionManager$Stopper.generateCancelledException(ClusterDistributionManager.java:2885) > at > org.apache.geode.distributed.internal.InternalDistributedSystem$Stopper.generateCancelledException(InternalDistributedSystem.java:1177) > at > org.apache.geode.internal.cache.GemFireCacheImpl$Stopper.generateCancelledException(GemFireCacheImpl.java:5212) > at > org.apache.geode.CancelCriterion.checkCancelInProgress(CancelCriterion.java:83) > at > org.apache.geode.internal.cache.CreateRegionProcessor.initializeRegion(CreateRegionProcessor.java:121) > at > org.apache.geode.internal.cache.DistributedRegion.getInitialImageAndRecovery(DistributedRegion.java:1164) > at > org.apache.geode.internal.cache.DistributedRegion.initialize(DistributedRegion.java:1095) > at > org.apache.geode.internal.cache.GemFireCacheImpl.createVMRegion(GemFireCacheImpl.java:3108) > at > org.apache.geode.internal.cache.InternalRegionFactory.create(InternalRegionFactory.java:78) > at > org.apache.geode.management.internal.FederatingManager.addMemberArtifacts(FederatingManager.java:429) > ... 5 more > at org.junit.Assert.fail(Assert.java:89) > at > org.apache.geode.test.dunit.internal.DUnitLauncher.closeAndCheckForSuspects(DUnitLauncher.java:420) > at > org.apache.geode.test.dunit.internal.DUnitLauncher.closeAndCheckForSuspects(DUnitLauncher.java:436) > at > org.apache.geode.test.dunit.internal.JUnit4DistributedTestCase.cleanupAllVms(JUnit4DistributedTestCase.java:551) > at > org.apache.geode.test.dunit.internal.JUnit4DistributedTestCase.doTearDownDistributedTestCase(JUnit4DistributedTestCase.java:498) > at > org.apache.geode.test.dunit.internal.JUnit4DistributedTestCase.tearDownDistributedTestCase(JUnit4DistributedTestCase.java:481) > at jdk.internal.reflect.GeneratedMethodAccessor11.invoke(Unknown > Source) > at > jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > at java.lang.reflect.Method.invoke(Method.java:566) > at > org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.java:59) > at > org.junit.internal
[jira] [Updated] (GEODE-9738) CI failure: RollingUpgradeRollServersOnReplicatedRegion_dataserializable failed with DistributedSystemDisconnectedException
[ https://issues.apache.org/jira/browse/GEODE-9738?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Bill Burcham updated GEODE-9738: Attachment: (was: vm0.log) > CI failure: RollingUpgradeRollServersOnReplicatedRegion_dataserializable > failed with DistributedSystemDisconnectedException > --- > > Key: GEODE-9738 > URL: https://issues.apache.org/jira/browse/GEODE-9738 > Project: Geode > Issue Type: Bug > Components: membership, messaging >Affects Versions: 1.15.0 >Reporter: Kamilla Aslami >Assignee: Bill Burcham >Priority: Major > Labels: needsTriage > > {noformat} > RollingUpgradeRollServersOnReplicatedRegion_dataserializable > > testRollServersOnReplicatedRegion_dataserializable[from_v1.13.4] FAILED > java.lang.AssertionError: Suspicious strings were written to the log > during this run. > Fix the strings or use IgnoredException.addIgnoredException to ignore. > --- > Found suspect string in 'dunit_suspect-vm2.log' at line 685[fatal > 2021/10/14 00:24:14.739 UTC tid=115] Uncaught exception > in thread Thread[FederatingManager6,5,RMI Runtime] > org.apache.geode.management.ManagementException: > org.apache.geode.distributed.DistributedSystemDisconnectedException: > Distribution manager on > heavy-lifter-10ae5f9d-2528-5e02-b707-d968eb54d50a(vm2:580278:locator):54751 > started at Thu Oct 14 00:23:52 UTC 2021: Message distribution has terminated > at > org.apache.geode.management.internal.FederatingManager.addMemberArtifacts(FederatingManager.java:486) > at > org.apache.geode.management.internal.FederatingManager$AddMemberTask.call(FederatingManager.java:596) > at > org.apache.geode.management.internal.FederatingManager.lambda$addMember$1(FederatingManager.java:199) > at > java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128) > at > java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628) > at java.base/java.lang.Thread.run(Thread.java:829) > Caused by: > org.apache.geode.distributed.DistributedSystemDisconnectedException: > Distribution manager on > heavy-lifter-10ae5f9d-2528-5e02-b707-d968eb54d50a(vm2:580278:locator):54751 > started at Thu Oct 14 00:23:52 UTC 2021: Message distribution has terminated > at > org.apache.geode.distributed.internal.ClusterDistributionManager$Stopper.generateCancelledException(ClusterDistributionManager.java:2885) > at > org.apache.geode.distributed.internal.InternalDistributedSystem$Stopper.generateCancelledException(InternalDistributedSystem.java:1177) > at > org.apache.geode.internal.cache.GemFireCacheImpl$Stopper.generateCancelledException(GemFireCacheImpl.java:5212) > at > org.apache.geode.CancelCriterion.checkCancelInProgress(CancelCriterion.java:83) > at > org.apache.geode.internal.cache.CreateRegionProcessor.initializeRegion(CreateRegionProcessor.java:121) > at > org.apache.geode.internal.cache.DistributedRegion.getInitialImageAndRecovery(DistributedRegion.java:1164) > at > org.apache.geode.internal.cache.DistributedRegion.initialize(DistributedRegion.java:1095) > at > org.apache.geode.internal.cache.GemFireCacheImpl.createVMRegion(GemFireCacheImpl.java:3108) > at > org.apache.geode.internal.cache.InternalRegionFactory.create(InternalRegionFactory.java:78) > at > org.apache.geode.management.internal.FederatingManager.addMemberArtifacts(FederatingManager.java:429) > ... 5 more > at org.junit.Assert.fail(Assert.java:89) > at > org.apache.geode.test.dunit.internal.DUnitLauncher.closeAndCheckForSuspects(DUnitLauncher.java:420) > at > org.apache.geode.test.dunit.internal.DUnitLauncher.closeAndCheckForSuspects(DUnitLauncher.java:436) > at > org.apache.geode.test.dunit.internal.JUnit4DistributedTestCase.cleanupAllVms(JUnit4DistributedTestCase.java:551) > at > org.apache.geode.test.dunit.internal.JUnit4DistributedTestCase.doTearDownDistributedTestCase(JUnit4DistributedTestCase.java:498) > at > org.apache.geode.test.dunit.internal.JUnit4DistributedTestCase.tearDownDistributedTestCase(JUnit4DistributedTestCase.java:481) > at jdk.internal.reflect.GeneratedMethodAccessor11.invoke(Unknown > Source) > at > jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > at java.lang.reflect.Method.invoke(Method.java:566) > at > org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.java:59) > at > org.junit.internal
[jira] [Comment Edited] (GEODE-9738) CI failure: RollingUpgradeRollServersOnReplicatedRegion_dataserializable failed with DistributedSystemDisconnectedException
[ https://issues.apache.org/jira/browse/GEODE-9738?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17441878#comment-17441878 ] Bill Burcham edited comment on GEODE-9738 at 11/10/21, 5:59 PM: The logs in the failing test run (previous comment) are all interleaved in the "standard output" section of the failing test. I have attached the individual logs to the ticket, so we can analyze them. was (Author: bburcham): The logs in the failing test run (previous comment) are all interleaved in the "standard output" section of the failing test. I have attached the separated logs to the ticket, so we can analyze them. > CI failure: RollingUpgradeRollServersOnReplicatedRegion_dataserializable > failed with DistributedSystemDisconnectedException > --- > > Key: GEODE-9738 > URL: https://issues.apache.org/jira/browse/GEODE-9738 > Project: Geode > Issue Type: Bug > Components: membership, messaging >Affects Versions: 1.15.0 >Reporter: Kamilla Aslami >Assignee: Bill Burcham >Priority: Major > Labels: needsTriage > Attachments: controller.log, locator.log, vm0.log, vm1.log, vm2.log, > vm3.log > > > {noformat} > RollingUpgradeRollServersOnReplicatedRegion_dataserializable > > testRollServersOnReplicatedRegion_dataserializable[from_v1.13.4] FAILED > java.lang.AssertionError: Suspicious strings were written to the log > during this run. > Fix the strings or use IgnoredException.addIgnoredException to ignore. > --- > Found suspect string in 'dunit_suspect-vm2.log' at line 685[fatal > 2021/10/14 00:24:14.739 UTC tid=115] Uncaught exception > in thread Thread[FederatingManager6,5,RMI Runtime] > org.apache.geode.management.ManagementException: > org.apache.geode.distributed.DistributedSystemDisconnectedException: > Distribution manager on > heavy-lifter-10ae5f9d-2528-5e02-b707-d968eb54d50a(vm2:580278:locator):54751 > started at Thu Oct 14 00:23:52 UTC 2021: Message distribution has terminated > at > org.apache.geode.management.internal.FederatingManager.addMemberArtifacts(FederatingManager.java:486) > at > org.apache.geode.management.internal.FederatingManager$AddMemberTask.call(FederatingManager.java:596) > at > org.apache.geode.management.internal.FederatingManager.lambda$addMember$1(FederatingManager.java:199) > at > java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128) > at > java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628) > at java.base/java.lang.Thread.run(Thread.java:829) > Caused by: > org.apache.geode.distributed.DistributedSystemDisconnectedException: > Distribution manager on > heavy-lifter-10ae5f9d-2528-5e02-b707-d968eb54d50a(vm2:580278:locator):54751 > started at Thu Oct 14 00:23:52 UTC 2021: Message distribution has terminated > at > org.apache.geode.distributed.internal.ClusterDistributionManager$Stopper.generateCancelledException(ClusterDistributionManager.java:2885) > at > org.apache.geode.distributed.internal.InternalDistributedSystem$Stopper.generateCancelledException(InternalDistributedSystem.java:1177) > at > org.apache.geode.internal.cache.GemFireCacheImpl$Stopper.generateCancelledException(GemFireCacheImpl.java:5212) > at > org.apache.geode.CancelCriterion.checkCancelInProgress(CancelCriterion.java:83) > at > org.apache.geode.internal.cache.CreateRegionProcessor.initializeRegion(CreateRegionProcessor.java:121) > at > org.apache.geode.internal.cache.DistributedRegion.getInitialImageAndRecovery(DistributedRegion.java:1164) > at > org.apache.geode.internal.cache.DistributedRegion.initialize(DistributedRegion.java:1095) > at > org.apache.geode.internal.cache.GemFireCacheImpl.createVMRegion(GemFireCacheImpl.java:3108) > at > org.apache.geode.internal.cache.InternalRegionFactory.create(InternalRegionFactory.java:78) > at > org.apache.geode.management.internal.FederatingManager.addMemberArtifacts(FederatingManager.java:429) > ... 5 more > at org.junit.Assert.fail(Assert.java:89) > at > org.apache.geode.test.dunit.internal.DUnitLauncher.closeAndCheckForSuspects(DUnitLauncher.java:420) > at > org.apache.geode.test.dunit.internal.DUnitLauncher.closeAndCheckForSuspects(DUnitLauncher.java:436) > at > org.apache.geode.test.dunit.internal.JUnit4DistributedTestCase.cleanupAllVms(JUnit4DistributedTestCase.java:551) > at > org.apache.geode.test.dunit.internal.JUnit4DistributedTestCase.doTe
[jira] [Commented] (GEODE-9738) CI failure: RollingUpgradeRollServersOnReplicatedRegion_dataserializable failed with DistributedSystemDisconnectedException
[ https://issues.apache.org/jira/browse/GEODE-9738?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17441878#comment-17441878 ] Bill Burcham commented on GEODE-9738: - The logs in the failing test run (previous comment) are all interleaved in the "standard output" section of the failing test. I have attached the separated logs to the ticket, so we can analyze them. > CI failure: RollingUpgradeRollServersOnReplicatedRegion_dataserializable > failed with DistributedSystemDisconnectedException > --- > > Key: GEODE-9738 > URL: https://issues.apache.org/jira/browse/GEODE-9738 > Project: Geode > Issue Type: Bug > Components: membership, messaging >Affects Versions: 1.15.0 >Reporter: Kamilla Aslami >Assignee: Bill Burcham >Priority: Major > Labels: needsTriage > Attachments: controller.log, locator.log, vm0.log, vm1.log, vm2.log, > vm3.log > > > {noformat} > RollingUpgradeRollServersOnReplicatedRegion_dataserializable > > testRollServersOnReplicatedRegion_dataserializable[from_v1.13.4] FAILED > java.lang.AssertionError: Suspicious strings were written to the log > during this run. > Fix the strings or use IgnoredException.addIgnoredException to ignore. > --- > Found suspect string in 'dunit_suspect-vm2.log' at line 685[fatal > 2021/10/14 00:24:14.739 UTC tid=115] Uncaught exception > in thread Thread[FederatingManager6,5,RMI Runtime] > org.apache.geode.management.ManagementException: > org.apache.geode.distributed.DistributedSystemDisconnectedException: > Distribution manager on > heavy-lifter-10ae5f9d-2528-5e02-b707-d968eb54d50a(vm2:580278:locator):54751 > started at Thu Oct 14 00:23:52 UTC 2021: Message distribution has terminated > at > org.apache.geode.management.internal.FederatingManager.addMemberArtifacts(FederatingManager.java:486) > at > org.apache.geode.management.internal.FederatingManager$AddMemberTask.call(FederatingManager.java:596) > at > org.apache.geode.management.internal.FederatingManager.lambda$addMember$1(FederatingManager.java:199) > at > java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128) > at > java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628) > at java.base/java.lang.Thread.run(Thread.java:829) > Caused by: > org.apache.geode.distributed.DistributedSystemDisconnectedException: > Distribution manager on > heavy-lifter-10ae5f9d-2528-5e02-b707-d968eb54d50a(vm2:580278:locator):54751 > started at Thu Oct 14 00:23:52 UTC 2021: Message distribution has terminated > at > org.apache.geode.distributed.internal.ClusterDistributionManager$Stopper.generateCancelledException(ClusterDistributionManager.java:2885) > at > org.apache.geode.distributed.internal.InternalDistributedSystem$Stopper.generateCancelledException(InternalDistributedSystem.java:1177) > at > org.apache.geode.internal.cache.GemFireCacheImpl$Stopper.generateCancelledException(GemFireCacheImpl.java:5212) > at > org.apache.geode.CancelCriterion.checkCancelInProgress(CancelCriterion.java:83) > at > org.apache.geode.internal.cache.CreateRegionProcessor.initializeRegion(CreateRegionProcessor.java:121) > at > org.apache.geode.internal.cache.DistributedRegion.getInitialImageAndRecovery(DistributedRegion.java:1164) > at > org.apache.geode.internal.cache.DistributedRegion.initialize(DistributedRegion.java:1095) > at > org.apache.geode.internal.cache.GemFireCacheImpl.createVMRegion(GemFireCacheImpl.java:3108) > at > org.apache.geode.internal.cache.InternalRegionFactory.create(InternalRegionFactory.java:78) > at > org.apache.geode.management.internal.FederatingManager.addMemberArtifacts(FederatingManager.java:429) > ... 5 more > at org.junit.Assert.fail(Assert.java:89) > at > org.apache.geode.test.dunit.internal.DUnitLauncher.closeAndCheckForSuspects(DUnitLauncher.java:420) > at > org.apache.geode.test.dunit.internal.DUnitLauncher.closeAndCheckForSuspects(DUnitLauncher.java:436) > at > org.apache.geode.test.dunit.internal.JUnit4DistributedTestCase.cleanupAllVms(JUnit4DistributedTestCase.java:551) > at > org.apache.geode.test.dunit.internal.JUnit4DistributedTestCase.doTearDownDistributedTestCase(JUnit4DistributedTestCase.java:498) > at > org.apache.geode.test.dunit.internal.JUnit4DistributedTestCase.tearDownDistributedTestCase(JUnit4DistributedTestCase.java:481) > at jdk.internal.reflect.GeneratedMethodAccessor11.invoke(Unknown >
[jira] [Updated] (GEODE-9738) CI failure: RollingUpgradeRollServersOnReplicatedRegion_dataserializable failed with DistributedSystemDisconnectedException
[ https://issues.apache.org/jira/browse/GEODE-9738?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Bill Burcham updated GEODE-9738: Attachment: controller.log locator.log vm0.log vm1.log vm2.log vm3.log > CI failure: RollingUpgradeRollServersOnReplicatedRegion_dataserializable > failed with DistributedSystemDisconnectedException > --- > > Key: GEODE-9738 > URL: https://issues.apache.org/jira/browse/GEODE-9738 > Project: Geode > Issue Type: Bug > Components: membership, messaging >Affects Versions: 1.15.0 >Reporter: Kamilla Aslami >Assignee: Bill Burcham >Priority: Major > Labels: needsTriage > Attachments: controller.log, locator.log, vm0.log, vm1.log, vm2.log, > vm3.log > > > {noformat} > RollingUpgradeRollServersOnReplicatedRegion_dataserializable > > testRollServersOnReplicatedRegion_dataserializable[from_v1.13.4] FAILED > java.lang.AssertionError: Suspicious strings were written to the log > during this run. > Fix the strings or use IgnoredException.addIgnoredException to ignore. > --- > Found suspect string in 'dunit_suspect-vm2.log' at line 685[fatal > 2021/10/14 00:24:14.739 UTC tid=115] Uncaught exception > in thread Thread[FederatingManager6,5,RMI Runtime] > org.apache.geode.management.ManagementException: > org.apache.geode.distributed.DistributedSystemDisconnectedException: > Distribution manager on > heavy-lifter-10ae5f9d-2528-5e02-b707-d968eb54d50a(vm2:580278:locator):54751 > started at Thu Oct 14 00:23:52 UTC 2021: Message distribution has terminated > at > org.apache.geode.management.internal.FederatingManager.addMemberArtifacts(FederatingManager.java:486) > at > org.apache.geode.management.internal.FederatingManager$AddMemberTask.call(FederatingManager.java:596) > at > org.apache.geode.management.internal.FederatingManager.lambda$addMember$1(FederatingManager.java:199) > at > java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128) > at > java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628) > at java.base/java.lang.Thread.run(Thread.java:829) > Caused by: > org.apache.geode.distributed.DistributedSystemDisconnectedException: > Distribution manager on > heavy-lifter-10ae5f9d-2528-5e02-b707-d968eb54d50a(vm2:580278:locator):54751 > started at Thu Oct 14 00:23:52 UTC 2021: Message distribution has terminated > at > org.apache.geode.distributed.internal.ClusterDistributionManager$Stopper.generateCancelledException(ClusterDistributionManager.java:2885) > at > org.apache.geode.distributed.internal.InternalDistributedSystem$Stopper.generateCancelledException(InternalDistributedSystem.java:1177) > at > org.apache.geode.internal.cache.GemFireCacheImpl$Stopper.generateCancelledException(GemFireCacheImpl.java:5212) > at > org.apache.geode.CancelCriterion.checkCancelInProgress(CancelCriterion.java:83) > at > org.apache.geode.internal.cache.CreateRegionProcessor.initializeRegion(CreateRegionProcessor.java:121) > at > org.apache.geode.internal.cache.DistributedRegion.getInitialImageAndRecovery(DistributedRegion.java:1164) > at > org.apache.geode.internal.cache.DistributedRegion.initialize(DistributedRegion.java:1095) > at > org.apache.geode.internal.cache.GemFireCacheImpl.createVMRegion(GemFireCacheImpl.java:3108) > at > org.apache.geode.internal.cache.InternalRegionFactory.create(InternalRegionFactory.java:78) > at > org.apache.geode.management.internal.FederatingManager.addMemberArtifacts(FederatingManager.java:429) > ... 5 more > at org.junit.Assert.fail(Assert.java:89) > at > org.apache.geode.test.dunit.internal.DUnitLauncher.closeAndCheckForSuspects(DUnitLauncher.java:420) > at > org.apache.geode.test.dunit.internal.DUnitLauncher.closeAndCheckForSuspects(DUnitLauncher.java:436) > at > org.apache.geode.test.dunit.internal.JUnit4DistributedTestCase.cleanupAllVms(JUnit4DistributedTestCase.java:551) > at > org.apache.geode.test.dunit.internal.JUnit4DistributedTestCase.doTearDownDistributedTestCase(JUnit4DistributedTestCase.java:498) > at > org.apache.geode.test.dunit.internal.JUnit4DistributedTestCase.tearDownDistributedTestCase(JUnit4DistributedTestCase.java:481) > at jdk.internal.reflect.GeneratedMethodAccessor11.invoke(Unknown > Source) > at > jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccesso
[jira] [Resolved] (GEODE-9675) CI: ClusterDistributionManagerDUnitTest > testConnectAfterBeingShunned FAILED
[ https://issues.apache.org/jira/browse/GEODE-9675?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Bill Burcham resolved GEODE-9675. - Fix Version/s: 1.15.0 Resolution: Fixed Fixed this test by deleting this test. > CI: ClusterDistributionManagerDUnitTest > testConnectAfterBeingShunned FAILED > - > > Key: GEODE-9675 > URL: https://issues.apache.org/jira/browse/GEODE-9675 > Project: Geode > Issue Type: Bug > Components: membership >Affects Versions: 1.15.0 >Reporter: Xiaojian Zhou >Assignee: Bill Burcham >Priority: Major > Labels: pull-request-available > Fix For: 1.15.0 > > Attachments: screenshot-1.png > > > https://concourse.apachegeode-ci.info/teams/main/pipelines/apache-develop-mass-test-run/jobs/distributed-test-openjdk8/builds/1983 > {code:java} > ClusterDistributionManagerDUnitTest > testConnectAfterBeingShunned FAILED > org.apache.geode.SystemConnectException: Problem starting up membership > services > at > org.apache.geode.distributed.internal.DistributionImpl.start(DistributionImpl.java:186) > at > org.apache.geode.distributed.internal.DistributionImpl.createDistribution(DistributionImpl.java:222) > at > org.apache.geode.distributed.internal.ClusterDistributionManager.(ClusterDistributionManager.java:466) > at > org.apache.geode.distributed.internal.ClusterDistributionManager.(ClusterDistributionManager.java:499) > at > org.apache.geode.distributed.internal.ClusterDistributionManager.create(ClusterDistributionManager.java:328) > at > org.apache.geode.distributed.internal.InternalDistributedSystem.initialize(InternalDistributedSystem.java:757) > at > org.apache.geode.distributed.internal.InternalDistributedSystem.access$200(InternalDistributedSystem.java:133) > at > org.apache.geode.distributed.internal.InternalDistributedSystem$Builder.build(InternalDistributedSystem.java:3013) > at > org.apache.geode.distributed.internal.InternalDistributedSystem.connectInternal(InternalDistributedSystem.java:283) > at > org.apache.geode.distributed.internal.InternalDistributedSystem.connectInternal(InternalDistributedSystem.java:209) > at > org.apache.geode.distributed.DistributedSystem.connect(DistributedSystem.java:159) > at > org.apache.geode.test.dunit.internal.JUnit4DistributedTestCase.getSystem(JUnit4DistributedTestCase.java:180) > at > org.apache.geode.test.dunit.internal.JUnit4DistributedTestCase.getSystem(JUnit4DistributedTestCase.java:256) > at > org.apache.geode.distributed.internal.ClusterDistributionManagerDUnitTest.testConnectAfterBeingShunned(ClusterDistributionManagerDUnitTest.java:170) > Caused by: > > org.apache.geode.distributed.internal.membership.api.MemberStartupException: > unable to create jgroups channel > at > org.apache.geode.distributed.internal.membership.gms.messenger.JGroupsMessenger.start(JGroupsMessenger.java:401) > at > org.apache.geode.distributed.internal.membership.gms.Services.start(Services.java:203) > at > org.apache.geode.distributed.internal.membership.gms.GMSMembership.start(GMSMembership.java:1642) > at > org.apache.geode.distributed.internal.DistributionImpl.start(DistributionImpl.java:171) > ... 13 more > Caused by: > java.lang.Exception: failed to open a port in range 41003-41003 > at > org.jgroups.protocols.UDP.createMulticastSocketWithBindPort(UDP.java:503) > at org.jgroups.protocols.UDP.createSockets(UDP.java:348) > at org.jgroups.protocols.UDP.start(UDP.java:266) > at > org.jgroups.stack.ProtocolStack.startStack(ProtocolStack.java:966) > at org.jgroups.JChannel.startStack(JChannel.java:889) > at org.jgroups.JChannel._preConnect(JChannel.java:553) > at org.jgroups.JChannel.connect(JChannel.java:288) > at org.jgroups.JChannel.connect(JChannel.java:279) > at > org.apache.geode.distributed.internal.membership.gms.messenger.JGroupsMessenger.start(JGroupsMessenger.java:397) > ... 16 more > {code} -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Assigned] (GEODE-9675) CI: ClusterDistributionManagerDUnitTest > testConnectAfterBeingShunned FAILED
[ https://issues.apache.org/jira/browse/GEODE-9675?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Bill Burcham reassigned GEODE-9675: --- Assignee: Bill Burcham > CI: ClusterDistributionManagerDUnitTest > testConnectAfterBeingShunned FAILED > - > > Key: GEODE-9675 > URL: https://issues.apache.org/jira/browse/GEODE-9675 > Project: Geode > Issue Type: Bug > Components: membership >Affects Versions: 1.15.0 >Reporter: Xiaojian Zhou >Assignee: Bill Burcham >Priority: Major > Labels: pull-request-available > Attachments: screenshot-1.png > > > https://concourse.apachegeode-ci.info/teams/main/pipelines/apache-develop-mass-test-run/jobs/distributed-test-openjdk8/builds/1983 > {code:java} > ClusterDistributionManagerDUnitTest > testConnectAfterBeingShunned FAILED > org.apache.geode.SystemConnectException: Problem starting up membership > services > at > org.apache.geode.distributed.internal.DistributionImpl.start(DistributionImpl.java:186) > at > org.apache.geode.distributed.internal.DistributionImpl.createDistribution(DistributionImpl.java:222) > at > org.apache.geode.distributed.internal.ClusterDistributionManager.(ClusterDistributionManager.java:466) > at > org.apache.geode.distributed.internal.ClusterDistributionManager.(ClusterDistributionManager.java:499) > at > org.apache.geode.distributed.internal.ClusterDistributionManager.create(ClusterDistributionManager.java:328) > at > org.apache.geode.distributed.internal.InternalDistributedSystem.initialize(InternalDistributedSystem.java:757) > at > org.apache.geode.distributed.internal.InternalDistributedSystem.access$200(InternalDistributedSystem.java:133) > at > org.apache.geode.distributed.internal.InternalDistributedSystem$Builder.build(InternalDistributedSystem.java:3013) > at > org.apache.geode.distributed.internal.InternalDistributedSystem.connectInternal(InternalDistributedSystem.java:283) > at > org.apache.geode.distributed.internal.InternalDistributedSystem.connectInternal(InternalDistributedSystem.java:209) > at > org.apache.geode.distributed.DistributedSystem.connect(DistributedSystem.java:159) > at > org.apache.geode.test.dunit.internal.JUnit4DistributedTestCase.getSystem(JUnit4DistributedTestCase.java:180) > at > org.apache.geode.test.dunit.internal.JUnit4DistributedTestCase.getSystem(JUnit4DistributedTestCase.java:256) > at > org.apache.geode.distributed.internal.ClusterDistributionManagerDUnitTest.testConnectAfterBeingShunned(ClusterDistributionManagerDUnitTest.java:170) > Caused by: > > org.apache.geode.distributed.internal.membership.api.MemberStartupException: > unable to create jgroups channel > at > org.apache.geode.distributed.internal.membership.gms.messenger.JGroupsMessenger.start(JGroupsMessenger.java:401) > at > org.apache.geode.distributed.internal.membership.gms.Services.start(Services.java:203) > at > org.apache.geode.distributed.internal.membership.gms.GMSMembership.start(GMSMembership.java:1642) > at > org.apache.geode.distributed.internal.DistributionImpl.start(DistributionImpl.java:171) > ... 13 more > Caused by: > java.lang.Exception: failed to open a port in range 41003-41003 > at > org.jgroups.protocols.UDP.createMulticastSocketWithBindPort(UDP.java:503) > at org.jgroups.protocols.UDP.createSockets(UDP.java:348) > at org.jgroups.protocols.UDP.start(UDP.java:266) > at > org.jgroups.stack.ProtocolStack.startStack(ProtocolStack.java:966) > at org.jgroups.JChannel.startStack(JChannel.java:889) > at org.jgroups.JChannel._preConnect(JChannel.java:553) > at org.jgroups.JChannel.connect(JChannel.java:288) > at org.jgroups.JChannel.connect(JChannel.java:279) > at > org.apache.geode.distributed.internal.membership.gms.messenger.JGroupsMessenger.start(JGroupsMessenger.java:397) > ... 16 more > {code} -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Assigned] (GEODE-9738) CI failure: RollingUpgradeRollServersOnReplicatedRegion_dataserializable failed with DistributedSystemDisconnectedException
[ https://issues.apache.org/jira/browse/GEODE-9738?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Bill Burcham reassigned GEODE-9738: --- Assignee: Bill Burcham > CI failure: RollingUpgradeRollServersOnReplicatedRegion_dataserializable > failed with DistributedSystemDisconnectedException > --- > > Key: GEODE-9738 > URL: https://issues.apache.org/jira/browse/GEODE-9738 > Project: Geode > Issue Type: Bug > Components: membership, messaging >Affects Versions: 1.15.0 >Reporter: Kamilla Aslami >Assignee: Bill Burcham >Priority: Major > Labels: needsTriage > > {noformat} > RollingUpgradeRollServersOnReplicatedRegion_dataserializable > > testRollServersOnReplicatedRegion_dataserializable[from_v1.13.4] FAILED > java.lang.AssertionError: Suspicious strings were written to the log > during this run. > Fix the strings or use IgnoredException.addIgnoredException to ignore. > --- > Found suspect string in 'dunit_suspect-vm2.log' at line 685[fatal > 2021/10/14 00:24:14.739 UTC tid=115] Uncaught exception > in thread Thread[FederatingManager6,5,RMI Runtime] > org.apache.geode.management.ManagementException: > org.apache.geode.distributed.DistributedSystemDisconnectedException: > Distribution manager on > heavy-lifter-10ae5f9d-2528-5e02-b707-d968eb54d50a(vm2:580278:locator):54751 > started at Thu Oct 14 00:23:52 UTC 2021: Message distribution has terminated > at > org.apache.geode.management.internal.FederatingManager.addMemberArtifacts(FederatingManager.java:486) > at > org.apache.geode.management.internal.FederatingManager$AddMemberTask.call(FederatingManager.java:596) > at > org.apache.geode.management.internal.FederatingManager.lambda$addMember$1(FederatingManager.java:199) > at > java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128) > at > java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628) > at java.base/java.lang.Thread.run(Thread.java:829) > Caused by: > org.apache.geode.distributed.DistributedSystemDisconnectedException: > Distribution manager on > heavy-lifter-10ae5f9d-2528-5e02-b707-d968eb54d50a(vm2:580278:locator):54751 > started at Thu Oct 14 00:23:52 UTC 2021: Message distribution has terminated > at > org.apache.geode.distributed.internal.ClusterDistributionManager$Stopper.generateCancelledException(ClusterDistributionManager.java:2885) > at > org.apache.geode.distributed.internal.InternalDistributedSystem$Stopper.generateCancelledException(InternalDistributedSystem.java:1177) > at > org.apache.geode.internal.cache.GemFireCacheImpl$Stopper.generateCancelledException(GemFireCacheImpl.java:5212) > at > org.apache.geode.CancelCriterion.checkCancelInProgress(CancelCriterion.java:83) > at > org.apache.geode.internal.cache.CreateRegionProcessor.initializeRegion(CreateRegionProcessor.java:121) > at > org.apache.geode.internal.cache.DistributedRegion.getInitialImageAndRecovery(DistributedRegion.java:1164) > at > org.apache.geode.internal.cache.DistributedRegion.initialize(DistributedRegion.java:1095) > at > org.apache.geode.internal.cache.GemFireCacheImpl.createVMRegion(GemFireCacheImpl.java:3108) > at > org.apache.geode.internal.cache.InternalRegionFactory.create(InternalRegionFactory.java:78) > at > org.apache.geode.management.internal.FederatingManager.addMemberArtifacts(FederatingManager.java:429) > ... 5 more > at org.junit.Assert.fail(Assert.java:89) > at > org.apache.geode.test.dunit.internal.DUnitLauncher.closeAndCheckForSuspects(DUnitLauncher.java:420) > at > org.apache.geode.test.dunit.internal.DUnitLauncher.closeAndCheckForSuspects(DUnitLauncher.java:436) > at > org.apache.geode.test.dunit.internal.JUnit4DistributedTestCase.cleanupAllVms(JUnit4DistributedTestCase.java:551) > at > org.apache.geode.test.dunit.internal.JUnit4DistributedTestCase.doTearDownDistributedTestCase(JUnit4DistributedTestCase.java:498) > at > org.apache.geode.test.dunit.internal.JUnit4DistributedTestCase.tearDownDistributedTestCase(JUnit4DistributedTestCase.java:481) > at jdk.internal.reflect.GeneratedMethodAccessor11.invoke(Unknown > Source) > at > jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > at java.lang.reflect.Method.invoke(Method.java:566) > at > org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.java:59) > at > org.junit.internal.
[jira] [Commented] (GEODE-9402) Automatic Reconnect Failure: Address already in use
[ https://issues.apache.org/jira/browse/GEODE-9402?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17440828#comment-17440828 ] Bill Burcham commented on GEODE-9402: - h2. Summary In each of the attached logs, we see the member that logged the BindException eventually joining the view (in 8 and 11 seconds respectively). My suspicion is that what we see here is nondeterminism in the time it takes for a port to become available after it is unbound. Since the members in question do re-join the cluster successfully I don't think this is a bug. What do you think [~jjramos] ? h2. Detailed Analysis of cluster_logs_gke_latest_54 Looking at cluster_logs_gke_latest_54 quorum loss happens: [Entry id=4208, date=2021/06/23 15:55:48.119 GMT, level=fatal, thread=tid=0x92, emitter=Geode Membership View Creator, message=Possible loss of quorum due to the loss of 5 cache processes: [gemfire-cluster-server-3(gemfire-cluster-server-3:1):41000, gemfire-cluster-server-1(gemfire-cluster-server-1:1):41000, gemfire-cluster-locator-1(gemfire-cluster-locator-1:1:locator):41000, gemfire-cluster-server-2(gemfire-cluster-server-2:1):41000, gemfire-cluster-locator-0(gemfire-cluster-locator-0:1:locator):41000] , Host=gemfire-cluster-server-0 , mergedFile=/Users/bburcham/Downloads/cluster_logs_gke_latest_54/gemfire-cluster-server-0/gemfire-cluster-server-0-01-01.log] It takes about two minutes for the network partition to be healed and for a coordinator to be designated. It is TBD what part of that two minutes was due to the test delaying the healing of the partition, vs what part of that time was spent re-forming a cluster after the network partition was healed. Here's the coordinator thread starting: [Entry id=4925, date=2021/06/23 15:57:57.671 GMT, level=info, thread=tid=0x87, emitter=ReconnectThread, message=This member is becoming the membership coordinator with address gemfire-cluster-locator-0(gemfire-cluster-locator-0:1:locator):41000 , Host=gemfire-cluster-locator-0 , mergedFile=/Users/bburcham/Downloads/cluster_logs_gke_latest_54/gemfire-cluster-locator-0/gemfire-cluster-locator-0.log] That point in time corresponds to view 21 (the pre-partition view sequence ended at view 5): [Entry id=4960, date=2021/06/23 15:57:58.009 GMT, level=info, thread=tid=0xad, emitter=Geode Membership View Creator, message=sending new view View[gemfire-cluster-locator-0(gemfire-cluster-locator-0:1:locator):41000|21] members: [gemfire-cluster-locator-0(gemfire-cluster-locator-0:1:locator):41000, gemfire-cluster-server-0(gemfire-cluster-server-0:1):41000\{lead}, gemfire-cluster-server-1(gemfire-cluster-server-1:1):41000, gemfire-cluster-server-3(gemfire-cluster-server-3:1):41000, gemfire-cluster-server-2(gemfire-cluster-server-2:1):41000, gemfire-cluster-locator-1(gemfire-cluster-locator-1:1:locator):41000] crashed: [gemfire-cluster-locator-1(gemfire-cluster-locator-1:1:locator):41000, gemfire-cluster-server-2(gemfire-cluster-server-2:1):41000, gemfire-cluster-locator-0(gemfire-cluster-locator-0:1:locator):41000] , Host=gemfire-cluster-locator-0 , mergedFile=/Users/bburcham/Downloads/cluster_logs_gke_latest_54/gemfire-cluster-locator-0/gemfire-cluster-locator-0.log] About a minute later server-0 logs the BindException while reconnecting: [Entry id=5536, date=2021/06/23 16:00:31.491 GMT, level=error, thread=tid=0x94, emitter=ReconnectThread, message=Cache initialization for GemFireCache[id = 1795575589; isClosing = false; isShutDownAll = false; created = Wed Jun 23 15:58:29 GMT 2021; server = false; copyOnRead = false; lockLease = 120; lockTimeout = 60] failed because: org.apache.geode.GemFireIOException: While starting cache server CacheServer on port=40404 client subscription config policy=none client subscription config capacity=1 client subscription config overflow directory=. at org.apache.geode.internal.cache.xmlcache.CacheCreation.startCacheServers(CacheCreation.java:800) at org.apache.geode.internal.cache.xmlcache.CacheCreation.create(CacheCreation.java:599) at org.apache.geode.internal.cache.xmlcache.CacheXmlParser.create(CacheXmlParser.java:339) at org.apache.geode.internal.cache.GemFireCacheImpl.loadCacheXml(GemFireCacheImpl.java:4207) at org.apache.geode.internal.cache.ClusterConfigurationLoader.applyClusterXmlConfiguration(ClusterConfigurationLoader.java:199) at org.apache.geode.internal.cache.GemFireCacheImpl.applyJarAndXmlFromClusterConfig(GemFireCacheImpl.java:1497) at org.apache.geode.internal.cache.GemFireCacheImpl.initialize(GemFireCacheImpl.java:1449) at org.apache.geode.internal.cache.InternalCacheBuilder.create(InternalCacheBuilder.java:191) at org.apache.geode.distributed.internal.InternalDistributedSystem.reconnect(InternalDistributedSystem.java:2668) at org.apache.geode.distributed.internal.InternalDistributedSystem.tryReconnect(InternalDistribut
[jira] [Assigned] (GEODE-9402) Automatic Reconnect Failure: Address already in use
[ https://issues.apache.org/jira/browse/GEODE-9402?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Bill Burcham reassigned GEODE-9402: --- Assignee: Bill Burcham > Automatic Reconnect Failure: Address already in use > --- > > Key: GEODE-9402 > URL: https://issues.apache.org/jira/browse/GEODE-9402 > Project: Geode > Issue Type: Bug > Components: membership >Reporter: Juan Ramos >Assignee: Bill Burcham >Priority: Major > Attachments: cluster_logs_gke_latest_54.zip, cluster_logs_pks_121.zip > > > There are 2 locators and 4 servers during the test, once they're all up and > running the test drops the network connectivity between all members to > generate a full network partition and cause all members to shutdown and go > into reconnect mode. Upon reaching the mentioned state, the test > automatically restores the network connectivity and expects all members to > automatically go up again and re-form the distributed system. > This works fine most of the time, and we see every member successfully > reconnecting to the distributed system: > {noformat} > [info 2021/06/23 15:58:12.981 GMT gemfire-cluster-locator-0 > tid=0x87] Reconnect completed. > [info 2021/06/23 15:58:14.726 GMT gemfire-cluster-locator-1 > tid=0x86] Reconnect completed. > [info 2021/06/23 15:58:46.702 GMT gemfire-cluster-server-0 > tid=0x94] Reconnect completed. > [info 2021/06/23 15:58:46.485 GMT gemfire-cluster-server-1 > tid=0x96] Reconnect completed. > [info 2021/06/23 15:58:46.273 GMT gemfire-cluster-server-2 > tid=0x97] Reconnect completed. > [info 2021/06/23 15:58:46.902 GMT gemfire-cluster-server-3 > tid=0x95] Reconnect completed. > {noformat} > In some rare occasions, though, one of the servers fails during the reconnect > phase with the following exception: > {noformat} > [error 2021/06/09 18:48:52.872 GMT gemfire-cluster-server-1 > tid=0x91] Cache initialization for GemFireCache[id = 575310555; isClosing = > false; isShutDownAll = false; created = Wed Jun 09 18:46:49 GMT 2021; server > = false; copyOnRead = false; lockLease = 120; lockTimeout = 60] failed > because: > org.apache.geode.GemFireIOException: While starting cache server CacheServer > on port=40404 client subscription config policy=none client subscription > config capacity=1 client subscription config overflow directory=. > at > org.apache.geode.internal.cache.xmlcache.CacheCreation.startCacheServers(CacheCreation.java:800) > at > org.apache.geode.internal.cache.xmlcache.CacheCreation.create(CacheCreation.java:599) > at > org.apache.geode.internal.cache.xmlcache.CacheXmlParser.create(CacheXmlParser.java:339) > at > org.apache.geode.internal.cache.GemFireCacheImpl.loadCacheXml(GemFireCacheImpl.java:4207) > at > org.apache.geode.internal.cache.ClusterConfigurationLoader.applyClusterXmlConfiguration(ClusterConfigurationLoader.java:197) > at > org.apache.geode.internal.cache.GemFireCacheImpl.applyJarAndXmlFromClusterConfig(GemFireCacheImpl.java:1497) > at > org.apache.geode.internal.cache.GemFireCacheImpl.initialize(GemFireCacheImpl.java:1449) > at > org.apache.geode.internal.cache.InternalCacheBuilder.create(InternalCacheBuilder.java:191) > at > org.apache.geode.distributed.internal.InternalDistributedSystem.reconnect(InternalDistributedSystem.java:2668) > at > org.apache.geode.distributed.internal.InternalDistributedSystem.tryReconnect(InternalDistributedSystem.java:2426) > at > org.apache.geode.distributed.internal.InternalDistributedSystem.disconnect(InternalDistributedSystem.java:1277) > at > org.apache.geode.distributed.internal.ClusterDistributionManager$DMListener.membershipFailure(ClusterDistributionManager.java:2315) > at > org.apache.geode.distributed.internal.membership.gms.GMSMembership.uncleanShutdown(GMSMembership.java:1183) > at > org.apache.geode.distributed.internal.membership.gms.GMSMembership$ManagerImpl.lambda$forceDisconnect$0(GMSMembership.java:1807) > at java.base/java.lang.Thread.run(Thread.java:829) > Caused by: java.net.BindException: Address already in use (Bind failed) > at java.base/java.net.PlainSocketImpl.socketBind(Native Method) > at > java.base/java.net.AbstractPlainSocketImpl.bind(AbstractPlainSocketImpl.java:436) > at java.base/java.net.ServerSocket.bind(ServerSocket.java:395) > at > org.apache.geode.internal.net.SCClusterSocketCreator.createServerSocket(SCClusterSocketCreator.java:70) > at > org.apache.geode.internal.net.SocketCreator.createServerSocket(SocketCreator.java:529) > at > org.apache.geode.internal.cache.tier.sockets.AcceptorImpl.(AcceptorImpl.java:573) > at > org.apache.geode.internal.cache.tier.sockets.AcceptorBuilder.create(AcceptorB
[jira] [Assigned] (GEODE-9675) CI: ClusterDistributionManagerDUnitTest > testConnectAfterBeingShunned FAILED
[ https://issues.apache.org/jira/browse/GEODE-9675?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Bill Burcham reassigned GEODE-9675: --- Assignee: (was: Bill Burcham) > CI: ClusterDistributionManagerDUnitTest > testConnectAfterBeingShunned FAILED > - > > Key: GEODE-9675 > URL: https://issues.apache.org/jira/browse/GEODE-9675 > Project: Geode > Issue Type: Bug > Components: membership >Affects Versions: 1.15.0 >Reporter: Xiaojian Zhou >Priority: Major > Attachments: screenshot-1.png > > > https://concourse.apachegeode-ci.info/teams/main/pipelines/apache-develop-mass-test-run/jobs/distributed-test-openjdk8/builds/1983 > {code:java} > ClusterDistributionManagerDUnitTest > testConnectAfterBeingShunned FAILED > org.apache.geode.SystemConnectException: Problem starting up membership > services > at > org.apache.geode.distributed.internal.DistributionImpl.start(DistributionImpl.java:186) > at > org.apache.geode.distributed.internal.DistributionImpl.createDistribution(DistributionImpl.java:222) > at > org.apache.geode.distributed.internal.ClusterDistributionManager.(ClusterDistributionManager.java:466) > at > org.apache.geode.distributed.internal.ClusterDistributionManager.(ClusterDistributionManager.java:499) > at > org.apache.geode.distributed.internal.ClusterDistributionManager.create(ClusterDistributionManager.java:328) > at > org.apache.geode.distributed.internal.InternalDistributedSystem.initialize(InternalDistributedSystem.java:757) > at > org.apache.geode.distributed.internal.InternalDistributedSystem.access$200(InternalDistributedSystem.java:133) > at > org.apache.geode.distributed.internal.InternalDistributedSystem$Builder.build(InternalDistributedSystem.java:3013) > at > org.apache.geode.distributed.internal.InternalDistributedSystem.connectInternal(InternalDistributedSystem.java:283) > at > org.apache.geode.distributed.internal.InternalDistributedSystem.connectInternal(InternalDistributedSystem.java:209) > at > org.apache.geode.distributed.DistributedSystem.connect(DistributedSystem.java:159) > at > org.apache.geode.test.dunit.internal.JUnit4DistributedTestCase.getSystem(JUnit4DistributedTestCase.java:180) > at > org.apache.geode.test.dunit.internal.JUnit4DistributedTestCase.getSystem(JUnit4DistributedTestCase.java:256) > at > org.apache.geode.distributed.internal.ClusterDistributionManagerDUnitTest.testConnectAfterBeingShunned(ClusterDistributionManagerDUnitTest.java:170) > Caused by: > > org.apache.geode.distributed.internal.membership.api.MemberStartupException: > unable to create jgroups channel > at > org.apache.geode.distributed.internal.membership.gms.messenger.JGroupsMessenger.start(JGroupsMessenger.java:401) > at > org.apache.geode.distributed.internal.membership.gms.Services.start(Services.java:203) > at > org.apache.geode.distributed.internal.membership.gms.GMSMembership.start(GMSMembership.java:1642) > at > org.apache.geode.distributed.internal.DistributionImpl.start(DistributionImpl.java:171) > ... 13 more > Caused by: > java.lang.Exception: failed to open a port in range 41003-41003 > at > org.jgroups.protocols.UDP.createMulticastSocketWithBindPort(UDP.java:503) > at org.jgroups.protocols.UDP.createSockets(UDP.java:348) > at org.jgroups.protocols.UDP.start(UDP.java:266) > at > org.jgroups.stack.ProtocolStack.startStack(ProtocolStack.java:966) > at org.jgroups.JChannel.startStack(JChannel.java:889) > at org.jgroups.JChannel._preConnect(JChannel.java:553) > at org.jgroups.JChannel.connect(JChannel.java:288) > at org.jgroups.JChannel.connect(JChannel.java:279) > at > org.apache.geode.distributed.internal.membership.gms.messenger.JGroupsMessenger.start(JGroupsMessenger.java:397) > ... 16 more > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (GEODE-9675) CI: ClusterDistributionManagerDUnitTest > testConnectAfterBeingShunned FAILED
[ https://issues.apache.org/jira/browse/GEODE-9675?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Bill Burcham updated GEODE-9675: Labels: (was: needsTriage) > CI: ClusterDistributionManagerDUnitTest > testConnectAfterBeingShunned FAILED > - > > Key: GEODE-9675 > URL: https://issues.apache.org/jira/browse/GEODE-9675 > Project: Geode > Issue Type: Bug > Components: membership >Affects Versions: 1.15.0 >Reporter: Xiaojian Zhou >Assignee: Bill Burcham >Priority: Major > Attachments: screenshot-1.png > > > https://concourse.apachegeode-ci.info/teams/main/pipelines/apache-develop-mass-test-run/jobs/distributed-test-openjdk8/builds/1983 > {code:java} > ClusterDistributionManagerDUnitTest > testConnectAfterBeingShunned FAILED > org.apache.geode.SystemConnectException: Problem starting up membership > services > at > org.apache.geode.distributed.internal.DistributionImpl.start(DistributionImpl.java:186) > at > org.apache.geode.distributed.internal.DistributionImpl.createDistribution(DistributionImpl.java:222) > at > org.apache.geode.distributed.internal.ClusterDistributionManager.(ClusterDistributionManager.java:466) > at > org.apache.geode.distributed.internal.ClusterDistributionManager.(ClusterDistributionManager.java:499) > at > org.apache.geode.distributed.internal.ClusterDistributionManager.create(ClusterDistributionManager.java:328) > at > org.apache.geode.distributed.internal.InternalDistributedSystem.initialize(InternalDistributedSystem.java:757) > at > org.apache.geode.distributed.internal.InternalDistributedSystem.access$200(InternalDistributedSystem.java:133) > at > org.apache.geode.distributed.internal.InternalDistributedSystem$Builder.build(InternalDistributedSystem.java:3013) > at > org.apache.geode.distributed.internal.InternalDistributedSystem.connectInternal(InternalDistributedSystem.java:283) > at > org.apache.geode.distributed.internal.InternalDistributedSystem.connectInternal(InternalDistributedSystem.java:209) > at > org.apache.geode.distributed.DistributedSystem.connect(DistributedSystem.java:159) > at > org.apache.geode.test.dunit.internal.JUnit4DistributedTestCase.getSystem(JUnit4DistributedTestCase.java:180) > at > org.apache.geode.test.dunit.internal.JUnit4DistributedTestCase.getSystem(JUnit4DistributedTestCase.java:256) > at > org.apache.geode.distributed.internal.ClusterDistributionManagerDUnitTest.testConnectAfterBeingShunned(ClusterDistributionManagerDUnitTest.java:170) > Caused by: > > org.apache.geode.distributed.internal.membership.api.MemberStartupException: > unable to create jgroups channel > at > org.apache.geode.distributed.internal.membership.gms.messenger.JGroupsMessenger.start(JGroupsMessenger.java:401) > at > org.apache.geode.distributed.internal.membership.gms.Services.start(Services.java:203) > at > org.apache.geode.distributed.internal.membership.gms.GMSMembership.start(GMSMembership.java:1642) > at > org.apache.geode.distributed.internal.DistributionImpl.start(DistributionImpl.java:171) > ... 13 more > Caused by: > java.lang.Exception: failed to open a port in range 41003-41003 > at > org.jgroups.protocols.UDP.createMulticastSocketWithBindPort(UDP.java:503) > at org.jgroups.protocols.UDP.createSockets(UDP.java:348) > at org.jgroups.protocols.UDP.start(UDP.java:266) > at > org.jgroups.stack.ProtocolStack.startStack(ProtocolStack.java:966) > at org.jgroups.JChannel.startStack(JChannel.java:889) > at org.jgroups.JChannel._preConnect(JChannel.java:553) > at org.jgroups.JChannel.connect(JChannel.java:288) > at org.jgroups.JChannel.connect(JChannel.java:279) > at > org.apache.geode.distributed.internal.membership.gms.messenger.JGroupsMessenger.start(JGroupsMessenger.java:397) > ... 16 more > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005)