[jira] [Assigned] (GEODE-10391) Region Operation During Primary Change in P2P-only Configuration Results in Spurious Entry{NotFound|Exists}Exception

2022-06-28 Thread Bill Burcham (Jira)
Title: Message Title


 
 
 
 

 
 
 

 
   
 Bill Burcham assigned an issue to Unassigned  
 

  
 
 
 
 

 
 
  
 
 
 
 

 
 Geode /  GEODE-10391  
 
 
  Region Operation During Primary Change in P2P-only Configuration Results in Spurious Entry{NotFound|Exists}Exception   
 

  
 
 
 
 

 
Change By: 
 Bill Burcham  
 
 
Assignee: 
 Bill Burcham  
 

  
 
 
 
 

 
 
 

 
 
 Add Comment  
 

  
 

  
 
 
 
  
 

  
 
 
 
 

 
 This message was sent by Atlassian Jira (v8.20.10#820010-sha1:ace47f9)  
 
 

 
   
 

  
 

  
 

   



[jira] [Assigned] (GEODE-10391) Region Operation During Primary Change in P2P-only Configuration Results in Spurious Entry{NotFound|Exists}Exception

2022-06-17 Thread Bill Burcham (Jira)


 [ 
https://issues.apache.org/jira/browse/GEODE-10391?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bill Burcham reassigned GEODE-10391:


Assignee: Bill Burcham

> Region Operation During Primary Change in P2P-only Configuration Results in 
> Spurious Entry{NotFound|Exists}Exception
> 
>
> Key: GEODE-10391
> URL: https://issues.apache.org/jira/browse/GEODE-10391
> Project: Geode
>  Issue Type: Bug
>  Components: regions
>Affects Versions: 1.16.0
>Reporter: Bill Burcham
>Assignee: Bill Burcham
>Priority: Major
>
> When a primary moves while a region operation, e.g. create, is in-flight, 
> i.e. started but not yet acknowledged, the operation will be retried 
> automatically, until the operation succeeds or fails.
> When a member notices another member has crashed, the surviving member 
> requests (from the remaining members) data for which the crashed member had 
> been primary (delta-GII/sync). This sync is necessary to regain consistency 
> in case the (retrying) requester fails before it can re-issue the request to 
> the new primary.
> In GEODE-5055 we learned that we needed to delay that sync request long 
> enough for the new primary to be chosen and for the original requester to 
> make a new request against the new primary. If we didn't delay the sync, the 
> primary could end up with the entry in the new state (as if the operation had 
> completed) but without the corresponding event tracker data needed to 
> conflate the retried event.
> The fix for GEODE-5055 introduced a delay, but only for configurations where 
> clients were present. If only peers were present there would be no delay. 
> This ticket pertains to the P2P-only case.
>  



--
This message was sent by Atlassian Jira
(v8.20.7#820007)


[jira] [Updated] (GEODE-10391) Region Operation During Primary Change in P2P-only Configuration Results in Spurious Entry{NotFound|Exists}Exception

2022-06-17 Thread Bill Burcham (Jira)


 [ 
https://issues.apache.org/jira/browse/GEODE-10391?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bill Burcham updated GEODE-10391:
-
Labels:   (was: needsTriage)

> Region Operation During Primary Change in P2P-only Configuration Results in 
> Spurious Entry{NotFound|Exists}Exception
> 
>
> Key: GEODE-10391
> URL: https://issues.apache.org/jira/browse/GEODE-10391
> Project: Geode
>  Issue Type: Bug
>  Components: regions
>Affects Versions: 1.16.0
>Reporter: Bill Burcham
>Priority: Major
>
> When a primary moves while a region operation, e.g. create, is in-flight, 
> i.e. started but not yet acknowledged, the operation will be retried 
> automatically, until the operation succeeds or fails.
> When a member notices another member has crashed, the surviving member 
> requests (from the remaining members) data for which the crashed member had 
> been primary (delta-GII/sync). This sync is necessary to regain consistency 
> in case the (retrying) requester fails before it can re-issue the request to 
> the new primary.
> In GEODE-5055 we learned that we needed to delay that sync request long 
> enough for the new primary to be chosen and for the original requester to 
> make a new request against the new primary. If we didn't delay the sync, the 
> primary could end up with the entry in the new state (as if the operation had 
> completed) but without the corresponding event tracker data needed to 
> conflate the retried event.
> The fix for GEODE-5055 introduced a delay, but only for configurations where 
> clients were present. If only peers were present there would be no delay. 
> This ticket pertains to the P2P-only case.
>  



--
This message was sent by Atlassian Jira
(v8.20.7#820007)


[jira] [Created] (GEODE-10391) Region Operation During Primary Change in P2P-only Configuration Results in Spurious Entry{NotFound|Exists}Exception

2022-06-16 Thread Bill Burcham (Jira)
Bill Burcham created GEODE-10391:


 Summary: Region Operation During Primary Change in P2P-only 
Configuration Results in Spurious Entry{NotFound|Exists}Exception
 Key: GEODE-10391
 URL: https://issues.apache.org/jira/browse/GEODE-10391
 Project: Geode
  Issue Type: Bug
  Components: regions
Affects Versions: 1.16.0
Reporter: Bill Burcham


When a primary moves while a region operation, e.g. create, is in-flight, i.e. 
started but not yet acknowledged, the operation will be retried automatically, 
until the operation succeeds or fails.

When a member notices another member has crashed, the surviving member requests 
(from the remaining members) data for which the crashed member had been primary 
(delta-GII/sync). This sync is necessary to regain consistency in case the 
(retrying) requester fails before it can re-issue the request to the new 
primary.

In GEODE-5055 we learned that we needed to delay that sync request long enough 
for the new primary to be chosen and for the original requester to make a new 
request against the new primary. If we didn't delay the sync, the primary could 
end up with the entry in the new state (as if the operation had completed) but 
without the corresponding event tracker data needed to conflate the retried 
event.

The fix for GEODE-5055 introduced a delay, but only for configurations where 
clients were present. If only peers were present there would be no delay. This 
ticket pertains to the P2P-only case.

 



--
This message was sent by Atlassian Jira
(v8.20.7#820007)


[jira] [Updated] (GEODE-10326) Convert MessageType into an enum

2022-05-24 Thread Bill Burcham (Jira)


 [ 
https://issues.apache.org/jira/browse/GEODE-10326?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bill Burcham updated GEODE-10326:
-
Summary: Convert MessageType into an enum  (was: Covert MessageType into an 
enum)

> Convert MessageType into an enum
> 
>
> Key: GEODE-10326
> URL: https://issues.apache.org/jira/browse/GEODE-10326
> Project: Geode
>  Issue Type: Improvement
>  Components: messaging
>Reporter: Jacob Barrett
>Assignee: Jacob Barrett
>Priority: Major
>  Labels: pull-request-available
>
> Currently {{MessageType}} is class with lots of numeric constants, 
> effectively and enum without all the compile time checking that comes with 
> it. Let's make it an enum for type safety.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)


[jira] [Assigned] (GEODE-9402) Automatic Reconnect Failure: Address already in use

2022-05-02 Thread Bill Burcham (Jira)


 [ 
https://issues.apache.org/jira/browse/GEODE-9402?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bill Burcham reassigned GEODE-9402:
---

Assignee: Jianxia Chen  (was: Bill Burcham)

> Automatic Reconnect Failure: Address already in use
> ---
>
> Key: GEODE-9402
> URL: https://issues.apache.org/jira/browse/GEODE-9402
> Project: Geode
>  Issue Type: Bug
>  Components: membership
>Reporter: Juan Ramos
>Assignee: Jianxia Chen
>Priority: Major
> Attachments: cluster_logs_gke_latest_54.zip, cluster_logs_pks_121.zip
>
>
> There are 2 locators and 4 servers during the test, once they're all up and 
> running the test drops the network connectivity between all members to 
> generate a full network partition and cause all members to shutdown and go 
> into reconnect mode. Upon reaching the mentioned state, the test 
> automatically restores the network connectivity and expects all members to 
> automatically go up again and re-form the distributed system.
>  This works fine most of the time, and we see every member successfully 
> reconnecting to the distributed system:
> {noformat}
> [info 2021/06/23 15:58:12.981 GMT gemfire-cluster-locator-0  
> tid=0x87] Reconnect completed.
> [info 2021/06/23 15:58:14.726 GMT gemfire-cluster-locator-1  
> tid=0x86] Reconnect completed.
> [info 2021/06/23 15:58:46.702 GMT gemfire-cluster-server-0  
> tid=0x94] Reconnect completed.
> [info 2021/06/23 15:58:46.485 GMT gemfire-cluster-server-1  
> tid=0x96] Reconnect completed.
> [info 2021/06/23 15:58:46.273 GMT gemfire-cluster-server-2  
> tid=0x97] Reconnect completed.
> [info 2021/06/23 15:58:46.902 GMT gemfire-cluster-server-3  
> tid=0x95] Reconnect completed.
> {noformat}
> In some rare occasions, though, one of the servers fails during the reconnect 
> phase with the following exception:
> {noformat}
> [error 2021/06/09 18:48:52.872 GMT gemfire-cluster-server-1  
> tid=0x91] Cache initialization for GemFireCache[id = 575310555; isClosing = 
> false; isShutDownAll = false; created = Wed Jun 09 18:46:49 GMT 2021; server 
> = false; copyOnRead = false; lockLease = 120; lockTimeout = 60] failed 
> because:
> org.apache.geode.GemFireIOException: While starting cache server CacheServer 
> on port=40404 client subscription config policy=none client subscription 
> config capacity=1 client subscription config overflow directory=.
>   at 
> org.apache.geode.internal.cache.xmlcache.CacheCreation.startCacheServers(CacheCreation.java:800)
>   at 
> org.apache.geode.internal.cache.xmlcache.CacheCreation.create(CacheCreation.java:599)
>   at 
> org.apache.geode.internal.cache.xmlcache.CacheXmlParser.create(CacheXmlParser.java:339)
>   at 
> org.apache.geode.internal.cache.GemFireCacheImpl.loadCacheXml(GemFireCacheImpl.java:4207)
>   at 
> org.apache.geode.internal.cache.ClusterConfigurationLoader.applyClusterXmlConfiguration(ClusterConfigurationLoader.java:197)
>   at 
> org.apache.geode.internal.cache.GemFireCacheImpl.applyJarAndXmlFromClusterConfig(GemFireCacheImpl.java:1497)
>   at 
> org.apache.geode.internal.cache.GemFireCacheImpl.initialize(GemFireCacheImpl.java:1449)
>   at 
> org.apache.geode.internal.cache.InternalCacheBuilder.create(InternalCacheBuilder.java:191)
>   at 
> org.apache.geode.distributed.internal.InternalDistributedSystem.reconnect(InternalDistributedSystem.java:2668)
>   at 
> org.apache.geode.distributed.internal.InternalDistributedSystem.tryReconnect(InternalDistributedSystem.java:2426)
>   at 
> org.apache.geode.distributed.internal.InternalDistributedSystem.disconnect(InternalDistributedSystem.java:1277)
>   at 
> org.apache.geode.distributed.internal.ClusterDistributionManager$DMListener.membershipFailure(ClusterDistributionManager.java:2315)
>   at 
> org.apache.geode.distributed.internal.membership.gms.GMSMembership.uncleanShutdown(GMSMembership.java:1183)
>   at 
> org.apache.geode.distributed.internal.membership.gms.GMSMembership$ManagerImpl.lambda$forceDisconnect$0(GMSMembership.java:1807)
>   at java.base/java.lang.Thread.run(Thread.java:829)
> Caused by: java.net.BindException: Address already in use (Bind failed)
>   at java.base/java.net.PlainSocketImpl.socketBind(Native Method)
>   at 
> java.base/java.net.AbstractPlainSocketImpl.bind(AbstractPlainSocketImpl.java:436)
>   at java.base/java.net.ServerSocket.bind(ServerSocket.java:395)
>   at 
> org.apache.geode.internal.net.SCClusterSocketCreator.createServerSocket(SCClusterSocketCreator.java:70)
>   at 
> org.apache.geode.internal.net.SocketCreator.createServerSocket(SocketCreator.java:529)
>   at 
> org.apache.geode.internal.cache.tier.sockets.AcceptorImpl.(AcceptorImpl.java:573)
>   at 
> org.apache.geode.internal.cache.tier.sockets.AcceptorBui

[jira] [Created] (GEODE-10272) CI failure: SerialGatewaySenderEventProcessor throws RejectedExecutionException in handlePrimaryDestroy

2022-05-02 Thread Bill Burcham (Jira)
Bill Burcham created GEODE-10272:


 Summary: CI failure: SerialGatewaySenderEventProcessor throws 
RejectedExecutionException in handlePrimaryDestroy 
 Key: GEODE-10272
 URL: https://issues.apache.org/jira/browse/GEODE-10272
 Project: Geode
  Issue Type: Bug
  Components: wan
Affects Versions: 1.15.0
Reporter: Bill Burcham


[https://hydradb.hdb.gemfire-ci.info/hdb/testresult/14917007]

 
{noformat}
> Task :geode-wan:distributedTest

SerialWANPropagationOffHeapDUnitTest > 
testReplicatedSerialPropagationWithRemoteReceiverRestarted_SenderReceiverPersistent
 FAILED
java.lang.AssertionError: Suspicious strings were written to the log during 
this run.
Fix the strings or use IgnoredException.addIgnoredException to ignore.
---
Found suspect string in 'dunit_suspect-vm5.log' at line 578

[error 2022/04/30 17:54:20.129 UTC :51004 unshared 
ordered sender uid=22 dom #1 local port=51185 remote port=59364> tid=172] 
Exception occurred in CacheListener
java.util.concurrent.RejectedExecutionException: Task 
org.apache.geode.internal.cache.wan.serial.SerialGatewaySenderEventProcessor$$Lambda$419/1037103054@1aae2bfe
 rejected from java.util.concurrent.ThreadPoolExecutor@7d2e5a91[Shutting down, 
pool size = 1, active threads = 0, queued tasks = 0, completed tasks = 8478]
  at 
java.util.concurrent.ThreadPoolExecutor$AbortPolicy.rejectedExecution(ThreadPoolExecutor.java:2063)
  at 
java.util.concurrent.ThreadPoolExecutor.reject(ThreadPoolExecutor.java:830)
  at 
java.util.concurrent.ThreadPoolExecutor.execute(ThreadPoolExecutor.java:1379)
  at 
org.apache.geode.internal.cache.wan.serial.SerialGatewaySenderEventProcessor.handlePrimaryDestroy(SerialGatewaySenderEventProcessor.java:592)
  at 
org.apache.geode.internal.cache.wan.serial.SerialSecondaryGatewayListener.afterDestroy(SerialSecondaryGatewayListener.java:92)
  at 
org.apache.geode.internal.cache.EnumListenerEvent$AFTER_DESTROY.dispatchEvent(EnumListenerEvent.java:183)
  at 
org.apache.geode.internal.cache.LocalRegion.dispatchEvent(LocalRegion.java:8313)
  at 
org.apache.geode.internal.cache.LocalRegion.dispatchListenerEvent(LocalRegion.java:7021)
  at 
org.apache.geode.internal.cache.LocalRegion.invokeDestroyCallbacks(LocalRegion.java:6822)
  at 
org.apache.geode.internal.cache.EntryEventImpl.invokeCallbacks(EntryEventImpl.java:2454)
  at 
org.apache.geode.internal.cache.entries.AbstractRegionEntry.dispatchListenerEvents(AbstractRegionEntry.java:164)
  at 
org.apache.geode.internal.cache.LocalRegion.basicDestroyPart2(LocalRegion.java:6763)
  at 
org.apache.geode.internal.cache.map.RegionMapDestroy.destroyExistingEntry(RegionMapDestroy.java:420)
  at 
org.apache.geode.internal.cache.map.RegionMapDestroy.handleExistingRegionEntry(RegionMapDestroy.java:244)
  at 
org.apache.geode.internal.cache.map.RegionMapDestroy.destroy(RegionMapDestroy.java:152)
  at 
org.apache.geode.internal.cache.AbstractRegionMap.destroy(AbstractRegionMap.java:940)
  at 
org.apache.geode.internal.cache.LocalRegion.mapDestroy(LocalRegion.java:6552)
  at 
org.apache.geode.internal.cache.LocalRegion.mapDestroy(LocalRegion.java:6526)
  at 
org.apache.geode.internal.cache.LocalRegionDataView.destroyExistingEntry(LocalRegionDataView.java:59)
  at 
org.apache.geode.internal.cache.LocalRegion.basicDestroy(LocalRegion.java:6477)
  at 
org.apache.geode.internal.cache.DistributedRegion.basicDestroy(DistributedRegion.java:1745)
  at 
org.apache.geode.internal.cache.wan.serial.SerialGatewaySenderQueue$SerialGatewaySenderQueueMetaRegion.basicDestroy(SerialGatewaySenderQueue.java:1372)
  at 
org.apache.geode.internal.cache.LocalRegion.localDestroy(LocalRegion.java:2261)
  at 
org.apache.geode.internal.cache.DistributedRegion.localDestroy(DistributedRegion.java:981)
  at 
org.apache.geode.internal.cache.wan.serial.BatchDestroyOperation$DestroyMessage.operateOnRegion(BatchDestroyOperation.java:121)
  at 
org.apache.geode.internal.cache.DistributedCacheOperation$CacheOperationMessage.basicProcess(DistributedCacheOperation.java:1196)
  at 
org.apache.geode.internal.cache.DistributedCacheOperation$CacheOperationMessage.process(DistributedCacheOperation.java:1102)
  at 
org.apache.geode.distributed.internal.DistributionMessage.scheduleAction(DistributionMessage.java:380)
  at 
org.apache.geode.distributed.internal.DistributionMessage.schedule(DistributionMessage.java:436)
  at 
org.apache.geode.distributed.internal.ClusterDistributionManager.scheduleIncomingMessage(ClusterDistributionManager.java:2080)
  at 
org.apache.geode.distributed.internal.ClusterDistributionManager.handleIncomingDMsg(ClusterDistributionManager.java:1844)
  at 
org.apache.geode.distributed.internal.membe

[jira] [Created] (GEODE-10271) CI failure: dead server monitor fails to increment server count after a new server is started

2022-05-02 Thread Bill Burcham (Jira)
Bill Burcham created GEODE-10271:


 Summary: CI failure: dead server monitor fails to increment server 
count after a new server is started
 Key: GEODE-10271
 URL: https://issues.apache.org/jira/browse/GEODE-10271
 Project: Geode
  Issue Type: Bug
  Components: client/server
Affects Versions: 1.15.0
Reporter: Bill Burcham


[https://hydradb.hdb.gemfire-ci.info/hdb/testresult/14919517]

 
{noformat}
> Task :geode-core:integrationTest

ConnectionProxyJUnitTest > testDeadServerMonitorPingNature1 FAILED
org.awaitility.core.ConditionTimeoutException: Assertion condition defined 
as a lambda expression in 
org.apache.geode.internal.cache.tier.sockets.ConnectionProxyJUnitTest 
expected:<1> but was:<0> within 5 minutes.
at org.awaitility.core.ConditionAwaiter.await(ConditionAwaiter.java:167)
at 
org.awaitility.core.AssertionCondition.await(AssertionCondition.java:119)
at 
org.awaitility.core.AssertionCondition.await(AssertionCondition.java:31)
at org.awaitility.core.ConditionFactory.until(ConditionFactory.java:985)
at 
org.awaitility.core.ConditionFactory.untilAsserted(ConditionFactory.java:769)
at 
org.apache.geode.internal.cache.tier.sockets.ConnectionProxyJUnitTest.testDeadServerMonitorPingNature1(ConnectionProxyJUnitTest.java:246)

Caused by:
java.lang.AssertionError: expected:<1> but was:<0>
at org.junit.Assert.fail(Assert.java:89)
at org.junit.Assert.failNotEquals(Assert.java:835)
at org.junit.Assert.assertEquals(Assert.java:647)
at org.junit.Assert.assertEquals(Assert.java:633)
at 
org.apache.geode.internal.cache.tier.sockets.ConnectionProxyJUnitTest.lambda$testDeadServerMonitorPingNature1$0(ConnectionProxyJUnitTest.java:247)

4053 tests completed, 1 failed, 84 skipped

=-=-=-=-=-=-=-=-=-=-=-=-=-=-=  Test Results URI 
=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=
http://files.apachegeode-ci.info/builds/apache-develop-main/1.15.0-build.1131/test-results/integrationTest/1651501470/
=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=

Test report artifacts from this job are available at:

http://files.apachegeode-ci.info/builds/apache-develop-main/1.15.0-build.1131/test-artifacts/1651501470/integrationtestfiles-openjdk8-1.15.0-build.1131.tgz{noformat}



--
This message was sent by Atlassian Jira
(v8.20.7#820007)


[jira] [Commented] (GEODE-9402) Automatic Reconnect Failure: Address already in use

2022-04-27 Thread Bill Burcham (Jira)


[ 
https://issues.apache.org/jira/browse/GEODE-9402?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17529123#comment-17529123
 ] 

Bill Burcham commented on GEODE-9402:
-

A shortcoming of my testing around this problem is that my new test/experiment 
isn't starting the cache server from cache XML. I notice we have a test for 
that scenario: ReconnectWithCacheXMLDUnitTest and that test mentions 
GEODE-2732. If you look at that ticket you'll see a BindException. It's got me 
thinking perhaps a problem (this problem) remains when reconnecting from a 
server started with cache XML.

> Automatic Reconnect Failure: Address already in use
> ---
>
> Key: GEODE-9402
> URL: https://issues.apache.org/jira/browse/GEODE-9402
> Project: Geode
>  Issue Type: Bug
>  Components: membership
>Reporter: Juan Ramos
>Assignee: Bill Burcham
>Priority: Major
> Attachments: cluster_logs_gke_latest_54.zip, cluster_logs_pks_121.zip
>
>
> There are 2 locators and 4 servers during the test, once they're all up and 
> running the test drops the network connectivity between all members to 
> generate a full network partition and cause all members to shutdown and go 
> into reconnect mode. Upon reaching the mentioned state, the test 
> automatically restores the network connectivity and expects all members to 
> automatically go up again and re-form the distributed system.
>  This works fine most of the time, and we see every member successfully 
> reconnecting to the distributed system:
> {noformat}
> [info 2021/06/23 15:58:12.981 GMT gemfire-cluster-locator-0  
> tid=0x87] Reconnect completed.
> [info 2021/06/23 15:58:14.726 GMT gemfire-cluster-locator-1  
> tid=0x86] Reconnect completed.
> [info 2021/06/23 15:58:46.702 GMT gemfire-cluster-server-0  
> tid=0x94] Reconnect completed.
> [info 2021/06/23 15:58:46.485 GMT gemfire-cluster-server-1  
> tid=0x96] Reconnect completed.
> [info 2021/06/23 15:58:46.273 GMT gemfire-cluster-server-2  
> tid=0x97] Reconnect completed.
> [info 2021/06/23 15:58:46.902 GMT gemfire-cluster-server-3  
> tid=0x95] Reconnect completed.
> {noformat}
> In some rare occasions, though, one of the servers fails during the reconnect 
> phase with the following exception:
> {noformat}
> [error 2021/06/09 18:48:52.872 GMT gemfire-cluster-server-1  
> tid=0x91] Cache initialization for GemFireCache[id = 575310555; isClosing = 
> false; isShutDownAll = false; created = Wed Jun 09 18:46:49 GMT 2021; server 
> = false; copyOnRead = false; lockLease = 120; lockTimeout = 60] failed 
> because:
> org.apache.geode.GemFireIOException: While starting cache server CacheServer 
> on port=40404 client subscription config policy=none client subscription 
> config capacity=1 client subscription config overflow directory=.
>   at 
> org.apache.geode.internal.cache.xmlcache.CacheCreation.startCacheServers(CacheCreation.java:800)
>   at 
> org.apache.geode.internal.cache.xmlcache.CacheCreation.create(CacheCreation.java:599)
>   at 
> org.apache.geode.internal.cache.xmlcache.CacheXmlParser.create(CacheXmlParser.java:339)
>   at 
> org.apache.geode.internal.cache.GemFireCacheImpl.loadCacheXml(GemFireCacheImpl.java:4207)
>   at 
> org.apache.geode.internal.cache.ClusterConfigurationLoader.applyClusterXmlConfiguration(ClusterConfigurationLoader.java:197)
>   at 
> org.apache.geode.internal.cache.GemFireCacheImpl.applyJarAndXmlFromClusterConfig(GemFireCacheImpl.java:1497)
>   at 
> org.apache.geode.internal.cache.GemFireCacheImpl.initialize(GemFireCacheImpl.java:1449)
>   at 
> org.apache.geode.internal.cache.InternalCacheBuilder.create(InternalCacheBuilder.java:191)
>   at 
> org.apache.geode.distributed.internal.InternalDistributedSystem.reconnect(InternalDistributedSystem.java:2668)
>   at 
> org.apache.geode.distributed.internal.InternalDistributedSystem.tryReconnect(InternalDistributedSystem.java:2426)
>   at 
> org.apache.geode.distributed.internal.InternalDistributedSystem.disconnect(InternalDistributedSystem.java:1277)
>   at 
> org.apache.geode.distributed.internal.ClusterDistributionManager$DMListener.membershipFailure(ClusterDistributionManager.java:2315)
>   at 
> org.apache.geode.distributed.internal.membership.gms.GMSMembership.uncleanShutdown(GMSMembership.java:1183)
>   at 
> org.apache.geode.distributed.internal.membership.gms.GMSMembership$ManagerImpl.lambda$forceDisconnect$0(GMSMembership.java:1807)
>   at java.base/java.lang.Thread.run(Thread.java:829)
> Caused by: java.net.BindException: Address already in use (Bind failed)
>   at java.base/java.net.PlainSocketImpl.socketBind(Native Method)
>   at 
> java.base/java.net.AbstractPlainSocketImpl.bind(AbstractPlainSocketImpl.java:436)
>   at java.base/java.net.ServerSoc

[jira] [Updated] (GEODE-10122) With TLSv1.3 and GCM-based cipher (the default), P2P Messaging Fails When Encrypted Data Limit is Reached

2022-04-27 Thread Bill Burcham (Jira)


 [ 
https://issues.apache.org/jira/browse/GEODE-10122?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bill Burcham updated GEODE-10122:
-
Fix Version/s: 1.12.10

> With TLSv1.3 and GCM-based cipher (the default), P2P Messaging Fails When 
> Encrypted Data Limit is Reached
> -
>
> Key: GEODE-10122
> URL: https://issues.apache.org/jira/browse/GEODE-10122
> Project: Geode
>  Issue Type: Bug
>  Components: messaging
>Affects Versions: 1.12.0, 1.13.0, 1.14.0, 1.15.0
>Reporter: Bill Burcham
>Assignee: Bill Burcham
>Priority: Major
>  Labels: blocks-1.15.0, pull-request-available, ssl
> Fix For: 1.12.10, 1.13.9, 1.14.5, 1.15.0
>
> Attachments: patch-P2PMessagingConcurrencyDUnitTest.txt
>
>
> TLSv1.3 introduced [1] the ability to set per-algorithm limits on symmetric 
> key usage lifetimes. Once a certain number of bytes have been encrypted, a 
> KeyUpdate post-handshake message [2] is sent.
> With default settings, on Liberica JDK 11, Geode's P2P framework will 
> negotiate TLSv1.3 with the TLS_AES_256_GCM_SHA384 cipher suite. Geode P2P 
> messaging will eventually fail, with a "Tag mismatch!" IOException in shared 
> ordered receivers, after a session has been in heavy use for days.
> We have not see this failure on TLSv1.2.
> The implementation of TLSv1.3 in the Java runtime provides a security 
> property [3] to configure the encrypted data limit. The attached patch to 
> P2PMessagingConcurrencyDUnitTest configures the limit large enough that the 
> test makes it through the (P2P) TLS handshake but small enough so that the 
> "Tag mismatch!" exception is encountered less than a minute later.
> The bug is caused by Geode’s NioSslEngine class’ ignorance of the 
> “rehandshaking” phase of the TLS protocol [4]:
>     Creation - ready to be configured.
>     Initial handshaking - perform authentication and negotiate communication 
> parameters.
>     Application data - ready for application exchange.
>     *Rehandshaking* - renegotiate communications parameters/authentication; 
> handshaking data may be mixed with application data.
>     Closure - ready to shut down connection.
> Geode's tcp.Connection and NioSslEngine classes (particularly wrap() and 
> unwrap()), as they are currently implemented, fail to fully attend to the 
> handshake status from javax.net.ssl.SSLEngine. As a result these Geode 
> classes fail to respond to the KeyUpdate message, resulting in the "Tag 
> mismatch!" IOException.
> When that exception is encountered, the Connection is destroyed and a new one 
> created in its place. But users of the old Connection, waiting for 
> acknowledgements, will never receive them. This can result in cluster-wide 
> hangs.
> [1] [https://datatracker.ietf.org/doc/html/rfc8446#section-5.5]
> [2] 
> [https://www.ibm.com/docs/en/sdk-java-technology/8?topic=handshake-post-messages]
>  
> [3] 
> [https://docs.oracle.com/en/java/javase/11/security/java-secure-socket-extension-jsse-reference-guide.html#GUID-B970ADD6-1E9F-4C18-A26E-0679B50CC946]
> [4] [https://www.ibm.com/docs/en/sdk-java-technology/7.1?topic=sslengine-]



--
This message was sent by Atlassian Jira
(v8.20.7#820007)


[jira] [Updated] (GEODE-10122) With TLSv1.3 and GCM-based cipher (the default), P2P Messaging Fails When Encrypted Data Limit is Reached

2022-04-26 Thread Bill Burcham (Jira)


 [ 
https://issues.apache.org/jira/browse/GEODE-10122?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bill Burcham updated GEODE-10122:
-
Affects Version/s: 1.14.0
   1.13.0
   1.12.0
   (was: 1.13.7)
   (was: 1.14.3)
   (was: 1.12.9)

> With TLSv1.3 and GCM-based cipher (the default), P2P Messaging Fails When 
> Encrypted Data Limit is Reached
> -
>
> Key: GEODE-10122
> URL: https://issues.apache.org/jira/browse/GEODE-10122
> Project: Geode
>  Issue Type: Bug
>  Components: messaging
>Affects Versions: 1.12.0, 1.13.0, 1.14.0, 1.15.0
>Reporter: Bill Burcham
>Assignee: Bill Burcham
>Priority: Major
>  Labels: blocks-1.15.0, pull-request-available, ssl
> Fix For: 1.13.9, 1.14.5, 1.15.0
>
> Attachments: patch-P2PMessagingConcurrencyDUnitTest.txt
>
>
> TLSv1.3 introduced [1] the ability to set per-algorithm limits on symmetric 
> key usage lifetimes. Once a certain number of bytes have been encrypted, a 
> KeyUpdate post-handshake message [2] is sent.
> With default settings, on Liberica JDK 11, Geode's P2P framework will 
> negotiate TLSv1.3 with the TLS_AES_256_GCM_SHA384 cipher suite. Geode P2P 
> messaging will eventually fail, with a "Tag mismatch!" IOException in shared 
> ordered receivers, after a session has been in heavy use for days.
> We have not see this failure on TLSv1.2.
> The implementation of TLSv1.3 in the Java runtime provides a security 
> property [3] to configure the encrypted data limit. The attached patch to 
> P2PMessagingConcurrencyDUnitTest configures the limit large enough that the 
> test makes it through the (P2P) TLS handshake but small enough so that the 
> "Tag mismatch!" exception is encountered less than a minute later.
> The bug is caused by Geode’s NioSslEngine class’ ignorance of the 
> “rehandshaking” phase of the TLS protocol [4]:
>     Creation - ready to be configured.
>     Initial handshaking - perform authentication and negotiate communication 
> parameters.
>     Application data - ready for application exchange.
>     *Rehandshaking* - renegotiate communications parameters/authentication; 
> handshaking data may be mixed with application data.
>     Closure - ready to shut down connection.
> Geode's tcp.Connection and NioSslEngine classes (particularly wrap() and 
> unwrap()), as they are currently implemented, fail to fully attend to the 
> handshake status from javax.net.ssl.SSLEngine. As a result these Geode 
> classes fail to respond to the KeyUpdate message, resulting in the "Tag 
> mismatch!" IOException.
> When that exception is encountered, the Connection is destroyed and a new one 
> created in its place. But users of the old Connection, waiting for 
> acknowledgements, will never receive them. This can result in cluster-wide 
> hangs.
> [1] [https://datatracker.ietf.org/doc/html/rfc8446#section-5.5]
> [2] 
> [https://www.ibm.com/docs/en/sdk-java-technology/8?topic=handshake-post-messages]
>  
> [3] 
> [https://docs.oracle.com/en/java/javase/11/security/java-secure-socket-extension-jsse-reference-guide.html#GUID-B970ADD6-1E9F-4C18-A26E-0679B50CC946]
> [4] [https://www.ibm.com/docs/en/sdk-java-technology/7.1?topic=sslengine-]



--
This message was sent by Atlassian Jira
(v8.20.7#820007)


[jira] [Updated] (GEODE-10122) With TLSv1.3 and GCM-based cipher (the default), P2P Messaging Fails When Encrypted Data Limit is Reached

2022-04-26 Thread Bill Burcham (Jira)


 [ 
https://issues.apache.org/jira/browse/GEODE-10122?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bill Burcham updated GEODE-10122:
-
Affects Version/s: 1.12.9

> With TLSv1.3 and GCM-based cipher (the default), P2P Messaging Fails When 
> Encrypted Data Limit is Reached
> -
>
> Key: GEODE-10122
> URL: https://issues.apache.org/jira/browse/GEODE-10122
> Project: Geode
>  Issue Type: Bug
>  Components: messaging
>Affects Versions: 1.12.9, 1.13.7, 1.14.3, 1.15.0
>Reporter: Bill Burcham
>Assignee: Bill Burcham
>Priority: Major
>  Labels: blocks-1.15.0, pull-request-available, ssl
> Fix For: 1.13.9, 1.14.5, 1.15.0
>
> Attachments: patch-P2PMessagingConcurrencyDUnitTest.txt
>
>
> TLSv1.3 introduced [1] the ability to set per-algorithm limits on symmetric 
> key usage lifetimes. Once a certain number of bytes have been encrypted, a 
> KeyUpdate post-handshake message [2] is sent.
> With default settings, on Liberica JDK 11, Geode's P2P framework will 
> negotiate TLSv1.3 with the TLS_AES_256_GCM_SHA384 cipher suite. Geode P2P 
> messaging will eventually fail, with a "Tag mismatch!" IOException in shared 
> ordered receivers, after a session has been in heavy use for days.
> We have not see this failure on TLSv1.2.
> The implementation of TLSv1.3 in the Java runtime provides a security 
> property [3] to configure the encrypted data limit. The attached patch to 
> P2PMessagingConcurrencyDUnitTest configures the limit large enough that the 
> test makes it through the (P2P) TLS handshake but small enough so that the 
> "Tag mismatch!" exception is encountered less than a minute later.
> The bug is caused by Geode’s NioSslEngine class’ ignorance of the 
> “rehandshaking” phase of the TLS protocol [4]:
>     Creation - ready to be configured.
>     Initial handshaking - perform authentication and negotiate communication 
> parameters.
>     Application data - ready for application exchange.
>     *Rehandshaking* - renegotiate communications parameters/authentication; 
> handshaking data may be mixed with application data.
>     Closure - ready to shut down connection.
> Geode's tcp.Connection and NioSslEngine classes (particularly wrap() and 
> unwrap()), as they are currently implemented, fail to fully attend to the 
> handshake status from javax.net.ssl.SSLEngine. As a result these Geode 
> classes fail to respond to the KeyUpdate message, resulting in the "Tag 
> mismatch!" IOException.
> When that exception is encountered, the Connection is destroyed and a new one 
> created in its place. But users of the old Connection, waiting for 
> acknowledgements, will never receive them. This can result in cluster-wide 
> hangs.
> [1] [https://datatracker.ietf.org/doc/html/rfc8446#section-5.5]
> [2] 
> [https://www.ibm.com/docs/en/sdk-java-technology/8?topic=handshake-post-messages]
>  
> [3] 
> [https://docs.oracle.com/en/java/javase/11/security/java-secure-socket-extension-jsse-reference-guide.html#GUID-B970ADD6-1E9F-4C18-A26E-0679B50CC946]
> [4] [https://www.ibm.com/docs/en/sdk-java-technology/7.1?topic=sslengine-]



--
This message was sent by Atlassian Jira
(v8.20.7#820007)


[jira] [Updated] (GEODE-10122) With TLSv1.3 and GCM-based cipher (the default), P2P Messaging Fails When Encrypted Data Limit is Reached

2022-04-26 Thread Bill Burcham (Jira)


 [ 
https://issues.apache.org/jira/browse/GEODE-10122?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bill Burcham updated GEODE-10122:
-
Fix Version/s: 1.13.9
   1.14.5

> With TLSv1.3 and GCM-based cipher (the default), P2P Messaging Fails When 
> Encrypted Data Limit is Reached
> -
>
> Key: GEODE-10122
> URL: https://issues.apache.org/jira/browse/GEODE-10122
> Project: Geode
>  Issue Type: Bug
>  Components: messaging
>Affects Versions: 1.13.7, 1.14.3, 1.15.0
>Reporter: Bill Burcham
>Assignee: Bill Burcham
>Priority: Major
>  Labels: blocks-1.15.0, pull-request-available, ssl
> Fix For: 1.13.9, 1.14.5, 1.15.0
>
> Attachments: patch-P2PMessagingConcurrencyDUnitTest.txt
>
>
> TLSv1.3 introduced [1] the ability to set per-algorithm limits on symmetric 
> key usage lifetimes. Once a certain number of bytes have been encrypted, a 
> KeyUpdate post-handshake message [2] is sent.
> With default settings, on Liberica JDK 11, Geode's P2P framework will 
> negotiate TLSv1.3 with the TLS_AES_256_GCM_SHA384 cipher suite. Geode P2P 
> messaging will eventually fail, with a "Tag mismatch!" IOException in shared 
> ordered receivers, after a session has been in heavy use for days.
> We have not see this failure on TLSv1.2.
> The implementation of TLSv1.3 in the Java runtime provides a security 
> property [3] to configure the encrypted data limit. The attached patch to 
> P2PMessagingConcurrencyDUnitTest configures the limit large enough that the 
> test makes it through the (P2P) TLS handshake but small enough so that the 
> "Tag mismatch!" exception is encountered less than a minute later.
> The bug is caused by Geode’s NioSslEngine class’ ignorance of the 
> “rehandshaking” phase of the TLS protocol [4]:
>     Creation - ready to be configured.
>     Initial handshaking - perform authentication and negotiate communication 
> parameters.
>     Application data - ready for application exchange.
>     *Rehandshaking* - renegotiate communications parameters/authentication; 
> handshaking data may be mixed with application data.
>     Closure - ready to shut down connection.
> Geode's tcp.Connection and NioSslEngine classes (particularly wrap() and 
> unwrap()), as they are currently implemented, fail to fully attend to the 
> handshake status from javax.net.ssl.SSLEngine. As a result these Geode 
> classes fail to respond to the KeyUpdate message, resulting in the "Tag 
> mismatch!" IOException.
> When that exception is encountered, the Connection is destroyed and a new one 
> created in its place. But users of the old Connection, waiting for 
> acknowledgements, will never receive them. This can result in cluster-wide 
> hangs.
> [1] [https://datatracker.ietf.org/doc/html/rfc8446#section-5.5]
> [2] 
> [https://www.ibm.com/docs/en/sdk-java-technology/8?topic=handshake-post-messages]
>  
> [3] 
> [https://docs.oracle.com/en/java/javase/11/security/java-secure-socket-extension-jsse-reference-guide.html#GUID-B970ADD6-1E9F-4C18-A26E-0679B50CC946]
> [4] [https://www.ibm.com/docs/en/sdk-java-technology/7.1?topic=sslengine-]



--
This message was sent by Atlassian Jira
(v8.20.7#820007)


[jira] [Updated] (GEODE-8506) BufferPool returns byte buffers that may be much larger than requested

2022-04-26 Thread Bill Burcham (Jira)


 [ 
https://issues.apache.org/jira/browse/GEODE-8506?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bill Burcham updated GEODE-8506:

Description: 
BufferPool manages several pools of direct-memory ByteBuffers.  When asked for 
a ByteBuffer of size X you may receive a buffer that is any size greater than 
or equal to X.  For users of this pool this is unexpected behavior and is 
causing some trouble.

MsgStreamer, for instance, performs message "chunking" based on the size of a 
socket's buffer size.  It requests a byte buffer of that size and then fills it 
over and over again with message chunks to be written to the socket.  But it 
does this based on the buffer's capacity, which may be much larger than the 
expected buffer size.  This results in incorrect chunking and requires larger 
buffers in the receiver of these message chunks.

BufferPool should always return a buffer that has exactly the requested 
capacity.  It could be a _slice_ of a pooled buffer, for instance.  That would 
let it hand out a larger buffer while not confusing the code that requested the 
buffer.

  was:
BufferPool manages several pools of direct-memory ByteBuffers.  When asked for 
a ByteBuffer of size X you may receive a buffer that is any size greater than 
or equal to X.  For users of this pool this is unexpected behavior and is 
causing some trouble.

MessageStreamer, for instance, performs message "chunking" based on the size of 
a socket's buffer size.  It requests a byte buffer of that size and then fills 
it over and over again with message chunks to be written to the socket.  But it 
does this based on the buffer's capacity, which may be much larger than the 
expected buffer size.  This results in incorrect chunking and requires larger 
buffers in the receiver of these message chunks.

BufferPool should always return a buffer that has exactly the requested 
capacity.  It could be a _slice_ of a pooled buffer, for instance.  That would 
let it hand out a larger buffer while not confusing the code that requested the 
buffer.


> BufferPool returns byte buffers that may be much larger than requested
> --
>
> Key: GEODE-8506
> URL: https://issues.apache.org/jira/browse/GEODE-8506
> Project: Geode
>  Issue Type: Improvement
>  Components: membership
>Reporter: Bruce J Schuchardt
>Assignee: Bruce J Schuchardt
>Priority: Major
>  Labels: pull-request-available
> Fix For: 1.12.1, 1.13.1, 1.14.0
>
>
> BufferPool manages several pools of direct-memory ByteBuffers.  When asked 
> for a ByteBuffer of size X you may receive a buffer that is any size greater 
> than or equal to X.  For users of this pool this is unexpected behavior and 
> is causing some trouble.
> MsgStreamer, for instance, performs message "chunking" based on the size of a 
> socket's buffer size.  It requests a byte buffer of that size and then fills 
> it over and over again with message chunks to be written to the socket.  But 
> it does this based on the buffer's capacity, which may be much larger than 
> the expected buffer size.  This results in incorrect chunking and requires 
> larger buffers in the receiver of these message chunks.
> BufferPool should always return a buffer that has exactly the requested 
> capacity.  It could be a _slice_ of a pooled buffer, for instance.  That 
> would let it hand out a larger buffer while not confusing the code that 
> requested the buffer.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)


[jira] [Commented] (GEODE-9402) Automatic Reconnect Failure: Address already in use

2022-04-21 Thread Bill Burcham (Jira)


[ 
https://issues.apache.org/jira/browse/GEODE-9402?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17526130#comment-17526130
 ] 

Bill Burcham commented on GEODE-9402:
-

Here’s a draft PR with my experiments: 
[https://github.com/apache/geode/pull/7614]

(In my testing I enabled TLS for all components. I don’t think it matters for 
this ticket but it’s become a habit.)

I wrote a test that starts a three-member cluster and then binds a server 
socket to port X and then calls geode.cache.Cache.addCacheServer() to create a 
CacheServer and then calls setPort(X) on it and then start(). Here’s the 
exception I get:

{{BGB caught: java.net.BindException: Address already in use (Bind failed)
at java.net.PlainSocketImpl.socketBind(Native Method)
at java.net.AbstractPlainSocketImpl.bind(AbstractPlainSocketImpl.java:387)
at java.net.ServerSocket.bind(ServerSocket.java:390)
at 
org.apache.geode.internal.net.SCClusterSocketCreator.createServerSocket(SCClusterSocketCreator.java:79)
at 
org.apache.geode.internal.net.SocketCreator.createServerSocket(SocketCreator.java:491)
at 
org.apache.geode.internal.cache.tier.sockets.AcceptorImpl.(AcceptorImpl.java:574)
at 
org.apache.geode.internal.cache.tier.sockets.AcceptorBuilder.create(AcceptorBuilder.java:291)
at 
org.apache.geode.internal.cache.CacheServerImpl.createAcceptor(CacheServerImpl.java:421)
at 
org.apache.geode.internal.cache.CacheServerImpl.start(CacheServerImpl.java:378)
at 
org.apache.geode.cache30.ReconnectWithTlsAndClientsCacheServerDistributedTest.startClientsCacheServer(ReconnectWithTlsAndClientsCacheServerDistributedTest.java:126)
at 
org.apache.geode.cache30.ReconnectWithTlsAndClientsCacheServerDistributedTest.disconnectAndReconnectTest(ReconnectWithTlsAndClientsCacheServerDistributedTest.java:105)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)}}

Part of that stack trace, from the exception to CacheServerImpl.start matches 
the stack trace from GEM-3359. The test does not create the cache from cache 
XML (e.g. ClusterConfigurationLoader.applyClusterXmlConfiguration()) as 
described in the ticket however. *This may be an area we want to explore 
further.*

By explicitly causing the bind exception (in my new 
preBindToClientsCacheServerPortTest() test) I can see that the AcceptorImpl 
constructor is retrying when it encounters the BindException (a 
SocketException). It’ll repeatedly try to create the server socket for 120 
seconds (CacheServerImpl.getTimeLimitMillis()), sleeping 1 second in between 
tries. This is also true of the code path described by the stack trace in the 
ticket.

Calling ServerSocket.setReuseAddress(true) when I bind to port X, does not 
eliminate the bind exception. From the documentation:

Enabling SO_REUSEADDR prior to binding the socket using bind(SocketAddress) 
allows the socket to be bound even though a previous connection is in a timeout 
state.

This setting only allows something else to bind to the port when the original 
socket is in the timeout state. A socket not in the timeout state, bound to a 
port, simply monopolizes that port. The short of it is that 
setReuseAddress(true) is helpful for addressing certain race conditions but it 
can’t address them all.

I did confirm that Geode does always call setReuseAddress(true) whenever 
creating a server socket for a SocketCreator:

non-TLS case:

SocketCreator.createServerSocket()

TLS case:

SCClusterSocketCreator.createServerSocket()

I’ve got a test (disconnectAndReconnectTest()) that enables TLS for all Geode 
components (including clients) and creates a three-member cluster. Then it 
repeatedly starts a client’s CacheServer (bound to port X), crashes the 
distributed system via MembershipManagerHelper.crashDistributedSystem() and 
verifies that the disconnected member reconnects. I haven’t been able to 
reproduce the problem with this test.

This is not exactly the way the forced-disconnect was generated in GEM-3359. In 
that case a network partition caused the forced-disconnection. *This may be an 
area we want to explore further.*

Searching for asynchrony that could lead to a race condition I took a look at 
GMSMembership.ManagerImpl.forceDisconnect(). When that calls 
uncleanShutdownDS() a thread is spawned to do the actual work of shutting down 
the distributed system. Inserting at 30 second delay at the start of that 
thread’s task (run()) did not reproduce GEM-3359.

The path from uncleanShutdownDS() that actually leads to closing the client’s 
CacheServer’s ServerSocket can be seen in this stack trace:

{{BGB in AcceptorImpl.close() closing server socket bound to port: 20009, 
java.lang.Throwable
at 
org.apache.geode.internal.cache.tier.sockets.AcceptorImpl.close(AcceptorImpl.java:1617)
at 
org.apache.geode.internal.cache.CacheServerImpl.stop(CacheServerImpl.java:485)
at 
org.apache.geode.internal.cache.GemF

[jira] [Comment Edited] (GEODE-10236) Compatibility issues while upgrading Jgroups to versions 4.0+

2022-04-14 Thread Bill Burcham (Jira)


[ 
https://issues.apache.org/jira/browse/GEODE-10236?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17522437#comment-17522437
 ] 

Bill Burcham edited comment on GEODE-10236 at 4/14/22 5:34 PM:
---

I agree with [~abaker] . If you want to see the JGroups protocol stack used in 
Geode (membership) communication it's primarily here:

[https://github.com/apache/geode/blob/develop/geode-membership/src/main/resources/org/apache/geode/distributed/internal/membership/gms/messenger/jgroups-config.xml]

There is also a multicast protocol stack here:

[https://github.com/apache/geode/blob/develop/geode-membership/src/main/resources/org/apache/geode/distributed/internal/membership/gms/messenger/jgroups-mcast.xml]

Neither mentions the deprecated ENCRYPT protocol/layer or the AUTH 
protocol/layer.


was (Author: bburcham):
I agree with [~abaker] . If you want to see the JGroups protocol stack use in 
Geode (membership) communication it's primarily here:

[https://github.com/apache/geode/blob/develop/geode-membership/src/main/resources/org/apache/geode/distributed/internal/membership/gms/messenger/jgroups-config.xml]


There is also a multicast protocol stack here:

[https://github.com/apache/geode/blob/develop/geode-membership/src/main/resources/org/apache/geode/distributed/internal/membership/gms/messenger/jgroups-mcast.xml]

Neither mentions the deprecated ENCRYPT protocol/layer or the AUTH 
protocol/layer.

> Compatibility issues while upgrading Jgroups to versions 4.0+
> -
>
> Key: GEODE-10236
> URL: https://issues.apache.org/jira/browse/GEODE-10236
> Project: Geode
>  Issue Type: Bug
>Affects Versions: 1.14.4
>Reporter: Rohan Jagtap
>Priority: Major
>  Labels: needsTriage
>
> According to a recent CVE: 
> {quote}CVE-2016-2141
> NVD: 2016/06/30 - CVSS v2 Base Score: 7.5 - CVSS v3.1 Base Score: 9.8
> JGroups before 4.0 does not require the proper headers for the ENCRYPT and 
> AUTH protocols from nodes joining the cluster, which allows remote attackers 
> to bypass security restrictions and send and receive messages within the 
> cluster via unspecified vectors.
>  
> {quote}
> Hence we intend to upgrade jgroups to a recommended version.
> However, even the latest version of apache geode ([geode-core 
> 1.14.4|https://mvnrepository.com/artifact/org.apache.geode/geode-core/1.14.4])
>  uses jgroups 3.6.14 which has the aforementioned vulnerability.
> Overriding the jgroups dependency to anything over 4.0+ gives the following 
> issue on running:
> {{Caused by: org.springframework.beans.factory.BeanCreationException: Error 
> creating bean with name 'gemfireCache': FactoryBean threw exception on object 
> creation; nested exception is java.lang.ExceptionInInitializerError}}
> {{        at 
> org.springframework.beans.factory.support.FactoryBeanRegistrySupport.doGetObjectFromFactoryBean(FactoryBeanRegistrySupport.java:176)}}
> {{        at 
> org.springframework.beans.factory.support.FactoryBeanRegistrySupport.getObjectFromFactoryBean(FactoryBeanRegistrySupport.java:101)}}
> {{        at 
> org.springframework.beans.factory.support.AbstractBeanFactory.getObjectForBeanInstance(AbstractBeanFactory.java:1828)}}
> {{        at 
> org.springframework.beans.factory.support.AbstractAutowireCapableBeanFactory.getObjectForBeanInstance(AbstractAutowireCapableBeanFactory.java:1265)}}
> {{        at 
> org.springframework.beans.factory.support.AbstractBeanFactory.doGetBean(AbstractBeanFactory.java:334)}}
> {{        at 
> org.springframework.beans.factory.support.AbstractBeanFactory.getBean(AbstractBeanFactory.java:202)}}
> {{        at 
> org.springframework.beans.factory.support.BeanDefinitionValueResolver.resolveReference(BeanDefinitionValueResolver.java:330)}}
> {{        ... 32 common frames omitted}}
> {{Caused by: java.lang.ExceptionInInitializerError: null}}
> {{        at 
> org.apache.geode.distributed.internal.membership.gms.Services.(Services.java:155)}}
> {{        at 
> org.apache.geode.distributed.internal.membership.gms.MembershipBuilderImpl.create(MembershipBuilderImpl.java:114)}}
> {{        at 
> org.apache.geode.distributed.internal.DistributionImpl.(DistributionImpl.java:150)}}
> {{        at 
> org.apache.geode.distributed.internal.DistributionImpl.createDistribution(DistributionImpl.java:217)}}
> {{        at 
> org.apache.geode.distributed.internal.ClusterDistributionManager.(ClusterDistributionManager.java:464)}}
> {{        at 
> org.apache.geode.distributed.internal.ClusterDistributionManager.(ClusterDistributionManager.java:497)}}
> {{        at 
> org.apache.geode.distributed.internal.ClusterDistributionManager.create(ClusterDistributionManager.java:326)}}
> {{        at 
> org.apache.geode.distributed.internal.InternalDistributedSystem.initialize(I

[jira] [Commented] (GEODE-10236) Compatibility issues while upgrading Jgroups to versions 4.0+

2022-04-14 Thread Bill Burcham (Jira)


[ 
https://issues.apache.org/jira/browse/GEODE-10236?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17522437#comment-17522437
 ] 

Bill Burcham commented on GEODE-10236:
--

I agree with [~abaker] . If you want to see the JGroups protocol stack use in 
Geode (membership) communication it's primarily here:

[https://github.com/apache/geode/blob/develop/geode-membership/src/main/resources/org/apache/geode/distributed/internal/membership/gms/messenger/jgroups-config.xml]


There is also a multicast protocol stack here:

[https://github.com/apache/geode/blob/develop/geode-membership/src/main/resources/org/apache/geode/distributed/internal/membership/gms/messenger/jgroups-mcast.xml]

Neither mentions the deprecated ENCRYPT protocol/layer or the AUTH 
protocol/layer.

> Compatibility issues while upgrading Jgroups to versions 4.0+
> -
>
> Key: GEODE-10236
> URL: https://issues.apache.org/jira/browse/GEODE-10236
> Project: Geode
>  Issue Type: Bug
>Affects Versions: 1.14.4
>Reporter: Rohan Jagtap
>Priority: Major
>  Labels: needsTriage
>
> According to a recent CVE: 
> {quote}CVE-2016-2141
> NVD: 2016/06/30 - CVSS v2 Base Score: 7.5 - CVSS v3.1 Base Score: 9.8
> JGroups before 4.0 does not require the proper headers for the ENCRYPT and 
> AUTH protocols from nodes joining the cluster, which allows remote attackers 
> to bypass security restrictions and send and receive messages within the 
> cluster via unspecified vectors.
>  
> {quote}
> Hence we intend to upgrade jgroups to a recommended version.
> However, even the latest version of apache geode ([geode-core 
> 1.14.4|https://mvnrepository.com/artifact/org.apache.geode/geode-core/1.14.4])
>  uses jgroups 3.6.14 which has the aforementioned vulnerability.
> Overriding the jgroups dependency to anything over 4.0+ gives the following 
> issue on running:
> {{Caused by: org.springframework.beans.factory.BeanCreationException: Error 
> creating bean with name 'gemfireCache': FactoryBean threw exception on object 
> creation; nested exception is java.lang.ExceptionInInitializerError}}
> {{        at 
> org.springframework.beans.factory.support.FactoryBeanRegistrySupport.doGetObjectFromFactoryBean(FactoryBeanRegistrySupport.java:176)}}
> {{        at 
> org.springframework.beans.factory.support.FactoryBeanRegistrySupport.getObjectFromFactoryBean(FactoryBeanRegistrySupport.java:101)}}
> {{        at 
> org.springframework.beans.factory.support.AbstractBeanFactory.getObjectForBeanInstance(AbstractBeanFactory.java:1828)}}
> {{        at 
> org.springframework.beans.factory.support.AbstractAutowireCapableBeanFactory.getObjectForBeanInstance(AbstractAutowireCapableBeanFactory.java:1265)}}
> {{        at 
> org.springframework.beans.factory.support.AbstractBeanFactory.doGetBean(AbstractBeanFactory.java:334)}}
> {{        at 
> org.springframework.beans.factory.support.AbstractBeanFactory.getBean(AbstractBeanFactory.java:202)}}
> {{        at 
> org.springframework.beans.factory.support.BeanDefinitionValueResolver.resolveReference(BeanDefinitionValueResolver.java:330)}}
> {{        ... 32 common frames omitted}}
> {{Caused by: java.lang.ExceptionInInitializerError: null}}
> {{        at 
> org.apache.geode.distributed.internal.membership.gms.Services.(Services.java:155)}}
> {{        at 
> org.apache.geode.distributed.internal.membership.gms.MembershipBuilderImpl.create(MembershipBuilderImpl.java:114)}}
> {{        at 
> org.apache.geode.distributed.internal.DistributionImpl.(DistributionImpl.java:150)}}
> {{        at 
> org.apache.geode.distributed.internal.DistributionImpl.createDistribution(DistributionImpl.java:217)}}
> {{        at 
> org.apache.geode.distributed.internal.ClusterDistributionManager.(ClusterDistributionManager.java:464)}}
> {{        at 
> org.apache.geode.distributed.internal.ClusterDistributionManager.(ClusterDistributionManager.java:497)}}
> {{        at 
> org.apache.geode.distributed.internal.ClusterDistributionManager.create(ClusterDistributionManager.java:326)}}
> {{        at 
> org.apache.geode.distributed.internal.InternalDistributedSystem.initialize(InternalDistributedSystem.java:779)}}
> {{        at 
> org.apache.geode.distributed.internal.InternalDistributedSystem.access$200(InternalDistributedSystem.java:135)}}
> {{        at 
> org.apache.geode.distributed.internal.InternalDistributedSystem$Builder.build(InternalDistributedSystem.java:3036)}}
> {{        at 
> org.apache.geode.distributed.internal.InternalDistributedSystem.connectInternal(InternalDistributedSystem.java:290)}}
> {{        at 
> org.apache.geode.distributed.internal.InternalDistributedSystem.connectInternal(InternalDistributedSystem.java:216)}}
> {{        at 
> org.apache.geode.internal.cache.InternalCacheBuilder.createInternalDistributedSy

[jira] [Resolved] (GEODE-10122) With TLSv1.3 and GCM-based cipher (the default), P2P Messaging Fails When Encrypted Data Limit is Reached

2022-04-06 Thread Bill Burcham (Jira)


 [ 
https://issues.apache.org/jira/browse/GEODE-10122?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bill Burcham resolved GEODE-10122.
--
Fix Version/s: 1.15.0
   Resolution: Fixed

> With TLSv1.3 and GCM-based cipher (the default), P2P Messaging Fails When 
> Encrypted Data Limit is Reached
> -
>
> Key: GEODE-10122
> URL: https://issues.apache.org/jira/browse/GEODE-10122
> Project: Geode
>  Issue Type: Bug
>  Components: messaging
>Affects Versions: 1.13.7, 1.14.3, 1.15.0
>Reporter: Bill Burcham
>Assignee: Bill Burcham
>Priority: Major
>  Labels: blocks-1.15.0, pull-request-available, ssl
> Fix For: 1.15.0
>
> Attachments: patch-P2PMessagingConcurrencyDUnitTest.txt
>
>
> TLSv1.3 introduced [1] the ability to set per-algorithm limits on symmetric 
> key usage lifetimes. Once a certain number of bytes have been encrypted, a 
> KeyUpdate post-handshake message [2] is sent.
> With default settings, on Liberica JDK 11, Geode's P2P framework will 
> negotiate TLSv1.3 with the TLS_AES_256_GCM_SHA384 cipher suite. Geode P2P 
> messaging will eventually fail, with a "Tag mismatch!" IOException in shared 
> ordered receivers, after a session has been in heavy use for days.
> We have not see this failure on TLSv1.2.
> The implementation of TLSv1.3 in the Java runtime provides a security 
> property [3] to configure the encrypted data limit. The attached patch to 
> P2PMessagingConcurrencyDUnitTest configures the limit large enough that the 
> test makes it through the (P2P) TLS handshake but small enough so that the 
> "Tag mismatch!" exception is encountered less than a minute later.
> The bug is caused by Geode’s NioSslEngine class’ ignorance of the 
> “rehandshaking” phase of the TLS protocol [4]:
>     Creation - ready to be configured.
>     Initial handshaking - perform authentication and negotiate communication 
> parameters.
>     Application data - ready for application exchange.
>     *Rehandshaking* - renegotiate communications parameters/authentication; 
> handshaking data may be mixed with application data.
>     Closure - ready to shut down connection.
> Geode's tcp.Connection and NioSslEngine classes (particularly wrap() and 
> unwrap()), as they are currently implemented, fail to fully attend to the 
> handshake status from javax.net.ssl.SSLEngine. As a result these Geode 
> classes fail to respond to the KeyUpdate message, resulting in the "Tag 
> mismatch!" IOException.
> When that exception is encountered, the Connection is destroyed and a new one 
> created in its place. But users of the old Connection, waiting for 
> acknowledgements, will never receive them. This can result in cluster-wide 
> hangs.
> [1] [https://datatracker.ietf.org/doc/html/rfc8446#section-5.5]
> [2] 
> [https://www.ibm.com/docs/en/sdk-java-technology/8?topic=handshake-post-messages]
>  
> [3] 
> [https://docs.oracle.com/en/java/javase/11/security/java-secure-socket-extension-jsse-reference-guide.html#GUID-B970ADD6-1E9F-4C18-A26E-0679B50CC946]
> [4] [https://www.ibm.com/docs/en/sdk-java-technology/7.1?topic=sslengine-]



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Updated] (GEODE-10192) CI hang: testRollingEnabledRecoverValuesTruePersistWithOverFlowWithEarlyTerminationOfRoller

2022-03-29 Thread Bill Burcham (Jira)


 [ 
https://issues.apache.org/jira/browse/GEODE-10192?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bill Burcham updated GEODE-10192:
-
Description: 
Hung here: 
[https://concourse.apachegeode-ci.info/teams/main/pipelines/apache-develop-main/jobs/integration-test-openjdk8/builds/246#C]
 

 
{noformat}
> Task :geode-for-redis:integrationTest

timeout exceeded

=-=-=-=-=-=-=-=-=-=-=-=-=-=-=  Test Results URI 
=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=
http://files.apachegeode-ci.info/builds/apache-develop-main/1.15.0-build.1035/test-results/integrationTest/1648477166/
=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=

Test report artifacts from this job are available at:

http://files.apachegeode-ci.info/builds/apache-develop-main/1.15.0-build.1035/test-artifacts/1648477166/integrationtestfiles-openjdk8-1.15.0-build.1035.tgz{noformat}
The only test in the "started" state is:

 
{noformat}
  |2.3.1| bburcham-a01 in 
~/Downloads/integrationtestfiles-openjdk8-1.15.0-build.1035
○ → progress -s started
org.apache.geode.internal.cache.DiskRandomOperationsAndRecoveryJUnitTest.testRollingEnabledRecoverValuesTruePersistWithOverFlowWithEarlyTerminationOfRoller
    Iteration: 1
    Start:     2022-03-28 13:41:07.109 +
    End:       0001-01-01 00:00:00.000 +
    Duration:  0s
    Status:    started
{noformat}
That JUnit test takes about 20s to run on a Macbook Pro.

  was:
Hung here: [https://hydradb.hdb.gemfire-ci.info/hdb/testresult/14291020]

The only test in the "started" state is:

 
{noformat}
  |2.3.1| bburcham-a01 in 
~/Downloads/integrationtestfiles-openjdk8-1.15.0-build.1035
○ → progress -s started
org.apache.geode.internal.cache.DiskRandomOperationsAndRecoveryJUnitTest.testRollingEnabledRecoverValuesTruePersistWithOverFlowWithEarlyTerminationOfRoller
    Iteration: 1
    Start:     2022-03-28 13:41:07.109 +
    End:       0001-01-01 00:00:00.000 +
    Duration:  0s
    Status:    started
{noformat}
That JUnit test takes about 20s to run on a Macbook Pro.


> CI hang: 
> testRollingEnabledRecoverValuesTruePersistWithOverFlowWithEarlyTerminationOfRoller
> ---
>
> Key: GEODE-10192
> URL: https://issues.apache.org/jira/browse/GEODE-10192
> Project: Geode
>  Issue Type: Bug
>  Components: persistence
>Affects Versions: 1.15.0
>Reporter: Bill Burcham
>Priority: Major
>  Labels: needsTriage
>
> Hung here: 
> [https://concourse.apachegeode-ci.info/teams/main/pipelines/apache-develop-main/jobs/integration-test-openjdk8/builds/246#C]
>  
>  
> {noformat}
> > Task :geode-for-redis:integrationTest
> timeout exceeded
> =-=-=-=-=-=-=-=-=-=-=-=-=-=-=  Test Results URI 
> =-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=
> http://files.apachegeode-ci.info/builds/apache-develop-main/1.15.0-build.1035/test-results/integrationTest/1648477166/
> =-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=
> Test report artifacts from this job are available at:
> http://files.apachegeode-ci.info/builds/apache-develop-main/1.15.0-build.1035/test-artifacts/1648477166/integrationtestfiles-openjdk8-1.15.0-build.1035.tgz{noformat}
> The only test in the "started" state is:
>  
> {noformat}
>   |2.3.1| bburcham-a01 in 
> ~/Downloads/integrationtestfiles-openjdk8-1.15.0-build.1035
> ○ → progress -s started
> org.apache.geode.internal.cache.DiskRandomOperationsAndRecoveryJUnitTest.testRollingEnabledRecoverValuesTruePersistWithOverFlowWithEarlyTerminationOfRoller
>     Iteration: 1
>     Start:     2022-03-28 13:41:07.109 +
>     End:       0001-01-01 00:00:00.000 +
>     Duration:  0s
>     Status:    started
> {noformat}
> That JUnit test takes about 20s to run on a Macbook Pro.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Created] (GEODE-10192) CI hang: testRollingEnabledRecoverValuesTruePersistWithOverFlowWithEarlyTerminationOfRoller

2022-03-29 Thread Bill Burcham (Jira)
Bill Burcham created GEODE-10192:


 Summary: CI hang: 
testRollingEnabledRecoverValuesTruePersistWithOverFlowWithEarlyTerminationOfRoller
 Key: GEODE-10192
 URL: https://issues.apache.org/jira/browse/GEODE-10192
 Project: Geode
  Issue Type: Bug
  Components: persistence
Affects Versions: 1.15.0
Reporter: Bill Burcham


Hung here: [https://hydradb.hdb.gemfire-ci.info/hdb/testresult/14291020]

The only test in the "started" state is:

 
{noformat}
  |2.3.1| bburcham-a01 in 
~/Downloads/integrationtestfiles-openjdk8-1.15.0-build.1035
○ → progress -s started
org.apache.geode.internal.cache.DiskRandomOperationsAndRecoveryJUnitTest.testRollingEnabledRecoverValuesTruePersistWithOverFlowWithEarlyTerminationOfRoller
    Iteration: 1
    Start:     2022-03-28 13:41:07.109 +
    End:       0001-01-01 00:00:00.000 +
    Duration:  0s
    Status:    started
{noformat}
That JUnit test takes about 20s to run on a Macbook Pro.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Comment Edited] (GEODE-10188) AvailablePortHelperIntegrationTest > initializeUniquePortRange_returnSamePortsForSameRange gets different ports on subsequent tries

2022-03-28 Thread Bill Burcham (Jira)


[ 
https://issues.apache.org/jira/browse/GEODE-10188?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17513744#comment-17513744
 ] 

Bill Burcham edited comment on GEODE-10188 at 3/29/22, 12:49 AM:
-

A theory about what happened (thanks [~demery] ):
 # The 10 Keepers created by a previous test (returnsUniqueKeepers) held onto 
their ports for a little longer than usual.
 # The failing test called getRandomAvailableTCPPorts, which skipped those 10 
ports because they were still in use, and instead picked up the next ten ports 
in the initialized range.
 # Then the Keepers released their ports.
 # Then the failing test called getRandomAvailableTCPPorts again, and picked up 
the first ports in the initialized range.


was (Author: bburcham):
A theory about what happened from Dale Emery:
 # The 10 Keepers created by a previous test (returnsUniqueKeepers) held onto 
their ports for a little longer than usual.
 # The failing test called getRandomAvailableTCPPorts, which skipped those 10 
ports because they were still in use, and instead picked up the next ten ports 
in the initialized range.
 # Then the Keepers released their ports.
 # Then the failing test called getRandomAvailableTCPPorts again, and picked up 
the first ports in the initialized range.

> AvailablePortHelperIntegrationTest > 
> initializeUniquePortRange_returnSamePortsForSameRange gets different ports on 
> subsequent tries
> ---
>
> Key: GEODE-10188
> URL: https://issues.apache.org/jira/browse/GEODE-10188
> Project: Geode
>  Issue Type: Bug
>  Components: tests
>Affects Versions: 1.13.9
>Reporter: Bill Burcham
>Priority: Major
>  Labels: needsTriage
>
> Failed here: [https://hydradb.hdb.gemfire-ci.info/hdb/testresult/14294054]
>  
> {noformat}
> > Task :geode-core:integrationTest
> org.apache.geode.internal.AvailablePortHelperIntegrationTest > 
> initializeUniquePortRange_returnSamePortsForSameRange(useMembershipPortRange=true)
>  FAILED
> org.junit.ComparisonFailure: expected:<[460[10, 46011, 4601]2]> but 
> was:<[460[00, 46001, 4600]2]>
> at 
> jdk.internal.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
> at 
> jdk.internal.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
> at 
> jdk.internal.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
> at 
> org.apache.geode.internal.AvailablePortHelperIntegrationTest.initializeUniquePortRange_returnSamePortsForSameRange(AvailablePortHelperIntegrationTest.java:322)
> 4023 tests completed, 1 failed, 82 skipped
> =-=-=-=-=-=-=-=-=-=-=-=-=-=-=  Test Results URI 
> =-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=
> http://files.apachegeode-ci.info/builds/apache-support-1-13-main/1.13.9-build.0668/test-results/integrationTest/1648509410/
> =-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=
> Test report artifacts from this job are available at:
> http://files.apachegeode-ci.info/builds/apache-support-1-13-main/1.13.9-build.0668/test-artifacts/1648509410/integrationtestfiles-openjdk11-1.13.9-build.0668.tgz
> {noformat}



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Comment Edited] (GEODE-10188) AvailablePortHelperIntegrationTest > initializeUniquePortRange_returnSamePortsForSameRange gets different ports on subsequent tries

2022-03-28 Thread Bill Burcham (Jira)


[ 
https://issues.apache.org/jira/browse/GEODE-10188?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17513744#comment-17513744
 ] 

Bill Burcham edited comment on GEODE-10188 at 3/29/22, 12:46 AM:
-

A theory about what happened from Dale Emery:
 # The 10 Keepers created by a previous test (returnsUniqueKeepers) held onto 
their ports for a little longer than usual.
 # The failing test called getRandomAvailableTCPPorts, which skipped those 10 
ports because they were still in use, and instead picked up the next ten ports 
in the initialized range.
 # Then the Keepers released their ports.
 # Then the failing test called getRandomAvailableTCPPorts again, and picked up 
the first ports in the initialized range.


was (Author: bburcham):
A theory about what happened @dale:
 # The 10 Keepers created by a previous test (returnsUniqueKeepers) held onto 
their ports for a little longer than usual.
 # The failing test called getRandomAvailableTCPPorts, which skipped those 10 
ports because they were still in use, and instead picked up the next ten ports 
in the initialized range.
 # Then the Keepers released their ports.
 # Then the failing test called getRandomAvailableTCPPorts again, and picked up 
the first ports in the initialized range.

> AvailablePortHelperIntegrationTest > 
> initializeUniquePortRange_returnSamePortsForSameRange gets different ports on 
> subsequent tries
> ---
>
> Key: GEODE-10188
> URL: https://issues.apache.org/jira/browse/GEODE-10188
> Project: Geode
>  Issue Type: Bug
>  Components: tests
>Affects Versions: 1.13.9
>Reporter: Bill Burcham
>Priority: Major
>  Labels: needsTriage
>
> Failed here: [https://hydradb.hdb.gemfire-ci.info/hdb/testresult/14294054]
>  
> {noformat}
> > Task :geode-core:integrationTest
> org.apache.geode.internal.AvailablePortHelperIntegrationTest > 
> initializeUniquePortRange_returnSamePortsForSameRange(useMembershipPortRange=true)
>  FAILED
> org.junit.ComparisonFailure: expected:<[460[10, 46011, 4601]2]> but 
> was:<[460[00, 46001, 4600]2]>
> at 
> jdk.internal.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
> at 
> jdk.internal.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
> at 
> jdk.internal.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
> at 
> org.apache.geode.internal.AvailablePortHelperIntegrationTest.initializeUniquePortRange_returnSamePortsForSameRange(AvailablePortHelperIntegrationTest.java:322)
> 4023 tests completed, 1 failed, 82 skipped
> =-=-=-=-=-=-=-=-=-=-=-=-=-=-=  Test Results URI 
> =-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=
> http://files.apachegeode-ci.info/builds/apache-support-1-13-main/1.13.9-build.0668/test-results/integrationTest/1648509410/
> =-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=
> Test report artifacts from this job are available at:
> http://files.apachegeode-ci.info/builds/apache-support-1-13-main/1.13.9-build.0668/test-artifacts/1648509410/integrationtestfiles-openjdk11-1.13.9-build.0668.tgz
> {noformat}



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Commented] (GEODE-10188) AvailablePortHelperIntegrationTest > initializeUniquePortRange_returnSamePortsForSameRange gets different ports on subsequent tries

2022-03-28 Thread Bill Burcham (Jira)


[ 
https://issues.apache.org/jira/browse/GEODE-10188?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17513744#comment-17513744
 ] 

Bill Burcham commented on GEODE-10188:
--

A theory about what happened @dale:
 # The 10 Keepers created by a previous test (returnsUniqueKeepers) held onto 
their ports for a little longer than usual.
 # The failing test called getRandomAvailableTCPPorts, which skipped those 10 
ports because they were still in use, and instead picked up the next ten ports 
in the initialized range.
 # Then the Keepers released their ports.
 # Then the failing test called getRandomAvailableTCPPorts again, and picked up 
the first ports in the initialized range.

> AvailablePortHelperIntegrationTest > 
> initializeUniquePortRange_returnSamePortsForSameRange gets different ports on 
> subsequent tries
> ---
>
> Key: GEODE-10188
> URL: https://issues.apache.org/jira/browse/GEODE-10188
> Project: Geode
>  Issue Type: Bug
>  Components: tests
>Affects Versions: 1.13.9
>Reporter: Bill Burcham
>Priority: Major
>  Labels: needsTriage
>
> Failed here: [https://hydradb.hdb.gemfire-ci.info/hdb/testresult/14294054]
>  
> {noformat}
> > Task :geode-core:integrationTest
> org.apache.geode.internal.AvailablePortHelperIntegrationTest > 
> initializeUniquePortRange_returnSamePortsForSameRange(useMembershipPortRange=true)
>  FAILED
> org.junit.ComparisonFailure: expected:<[460[10, 46011, 4601]2]> but 
> was:<[460[00, 46001, 4600]2]>
> at 
> jdk.internal.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
> at 
> jdk.internal.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
> at 
> jdk.internal.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
> at 
> org.apache.geode.internal.AvailablePortHelperIntegrationTest.initializeUniquePortRange_returnSamePortsForSameRange(AvailablePortHelperIntegrationTest.java:322)
> 4023 tests completed, 1 failed, 82 skipped
> =-=-=-=-=-=-=-=-=-=-=-=-=-=-=  Test Results URI 
> =-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=
> http://files.apachegeode-ci.info/builds/apache-support-1-13-main/1.13.9-build.0668/test-results/integrationTest/1648509410/
> =-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=
> Test report artifacts from this job are available at:
> http://files.apachegeode-ci.info/builds/apache-support-1-13-main/1.13.9-build.0668/test-artifacts/1648509410/integrationtestfiles-openjdk11-1.13.9-build.0668.tgz
> {noformat}



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Updated] (GEODE-10187) PutAllGlobalDUnitTest > testputAllGlobalRemoteVM fails to receive expected TimeoutException

2022-03-28 Thread Bill Burcham (Jira)


 [ 
https://issues.apache.org/jira/browse/GEODE-10187?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bill Burcham updated GEODE-10187:
-
Affects Version/s: 1.14.5
   (was: 1.15.0)

> PutAllGlobalDUnitTest > testputAllGlobalRemoteVM fails to receive expected 
> TimeoutException
> ---
>
> Key: GEODE-10187
> URL: https://issues.apache.org/jira/browse/GEODE-10187
> Project: Geode
>  Issue Type: Bug
>  Components: regions
>Affects Versions: 1.14.5
>Reporter: Bill Burcham
>Priority: Major
>  Labels: needsTriage
>
> Failed here: [https://hydradb.hdb.gemfire-ci.info/hdb/testresult/14277444]
> {noformat}
> > Task :geode-core:distributedTest
> org.apache.geode.internal.cache.PutAllGlobalDUnitTest > 
> testputAllGlobalRemoteVM FAILED
> java.lang.AssertionError: async2 failed
> at org.apache.geode.test.dunit.Assert.fail(Assert.java:66)
> at 
> org.apache.geode.internal.cache.PutAllGlobalDUnitTest.testputAllGlobalRemoteVM(PutAllGlobalDUnitTest.java:215)
> Caused by:
> java.lang.AssertionError: Should have thrown TimeoutException
> at org.junit.Assert.fail(Assert.java:89)
> at 
> org.apache.geode.internal.cache.PutAllGlobalDUnitTest$2.run2(PutAllGlobalDUnitTest.java:193)
> 8805 tests completed, 1 failed, 455 skipped
> =-=-=-=-=-=-=-=-=-=-=-=-=-=-=  Test Results URI 
> =-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=
> http://files.apachegeode-ci.info/builds/apache-support-1-14-main/1.14.5-build.0942/test-results/distributedTest/1648360227/
> =-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=
> Test report artifacts from this job are available at:
> http://files.apachegeode-ci.info/builds/apache-support-1-14-main/1.14.5-build.0942/test-artifacts/1648360227/distributedtestfiles-openjdk11-1.14.5-build.0942.tgz{noformat}



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Updated] (GEODE-10188) AvailablePortHelperIntegrationTest > initializeUniquePortRange_returnSamePortsForSameRange gets different ports on subsequent tries

2022-03-28 Thread Bill Burcham (Jira)


 [ 
https://issues.apache.org/jira/browse/GEODE-10188?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bill Burcham updated GEODE-10188:
-
Affects Version/s: 1.13.9
   (was: 1.15.0)

> AvailablePortHelperIntegrationTest > 
> initializeUniquePortRange_returnSamePortsForSameRange gets different ports on 
> subsequent tries
> ---
>
> Key: GEODE-10188
> URL: https://issues.apache.org/jira/browse/GEODE-10188
> Project: Geode
>  Issue Type: Bug
>  Components: tests
>Affects Versions: 1.13.9
>Reporter: Bill Burcham
>Priority: Major
>  Labels: needsTriage
>
> Failed here: [https://hydradb.hdb.gemfire-ci.info/hdb/testresult/14294054]
>  
> {noformat}
> > Task :geode-core:integrationTest
> org.apache.geode.internal.AvailablePortHelperIntegrationTest > 
> initializeUniquePortRange_returnSamePortsForSameRange(useMembershipPortRange=true)
>  FAILED
> org.junit.ComparisonFailure: expected:<[460[10, 46011, 4601]2]> but 
> was:<[460[00, 46001, 4600]2]>
> at 
> jdk.internal.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
> at 
> jdk.internal.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
> at 
> jdk.internal.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
> at 
> org.apache.geode.internal.AvailablePortHelperIntegrationTest.initializeUniquePortRange_returnSamePortsForSameRange(AvailablePortHelperIntegrationTest.java:322)
> 4023 tests completed, 1 failed, 82 skipped
> =-=-=-=-=-=-=-=-=-=-=-=-=-=-=  Test Results URI 
> =-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=
> http://files.apachegeode-ci.info/builds/apache-support-1-13-main/1.13.9-build.0668/test-results/integrationTest/1648509410/
> =-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=
> Test report artifacts from this job are available at:
> http://files.apachegeode-ci.info/builds/apache-support-1-13-main/1.13.9-build.0668/test-artifacts/1648509410/integrationtestfiles-openjdk11-1.13.9-build.0668.tgz
> {noformat}



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Created] (GEODE-10188) AvailablePortHelperIntegrationTest > initializeUniquePortRange_returnSamePortsForSameRange gets different ports on subsequent tries

2022-03-28 Thread Bill Burcham (Jira)
Bill Burcham created GEODE-10188:


 Summary: AvailablePortHelperIntegrationTest > 
initializeUniquePortRange_returnSamePortsForSameRange gets different ports on 
subsequent tries
 Key: GEODE-10188
 URL: https://issues.apache.org/jira/browse/GEODE-10188
 Project: Geode
  Issue Type: Bug
  Components: tests
Affects Versions: 1.15.0
Reporter: Bill Burcham


Failed here: [https://hydradb.hdb.gemfire-ci.info/hdb/testresult/14294054]

 
{noformat}
> Task :geode-core:integrationTest

org.apache.geode.internal.AvailablePortHelperIntegrationTest > 
initializeUniquePortRange_returnSamePortsForSameRange(useMembershipPortRange=true)
 FAILED
org.junit.ComparisonFailure: expected:<[460[10, 46011, 4601]2]> but 
was:<[460[00, 46001, 4600]2]>
at 
jdk.internal.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
at 
jdk.internal.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
at 
jdk.internal.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
at 
org.apache.geode.internal.AvailablePortHelperIntegrationTest.initializeUniquePortRange_returnSamePortsForSameRange(AvailablePortHelperIntegrationTest.java:322)

4023 tests completed, 1 failed, 82 skipped

=-=-=-=-=-=-=-=-=-=-=-=-=-=-=  Test Results URI 
=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=
http://files.apachegeode-ci.info/builds/apache-support-1-13-main/1.13.9-build.0668/test-results/integrationTest/1648509410/
=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=

Test report artifacts from this job are available at:

http://files.apachegeode-ci.info/builds/apache-support-1-13-main/1.13.9-build.0668/test-artifacts/1648509410/integrationtestfiles-openjdk11-1.13.9-build.0668.tgz
{noformat}



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Created] (GEODE-10187) PutAllGlobalDUnitTest > testputAllGlobalRemoteVM fails to receive expected TimeoutException

2022-03-28 Thread Bill Burcham (Jira)
Bill Burcham created GEODE-10187:


 Summary: PutAllGlobalDUnitTest > testputAllGlobalRemoteVM fails to 
receive expected TimeoutException
 Key: GEODE-10187
 URL: https://issues.apache.org/jira/browse/GEODE-10187
 Project: Geode
  Issue Type: Bug
  Components: regions
Affects Versions: 1.15.0
Reporter: Bill Burcham


Failed here: [https://hydradb.hdb.gemfire-ci.info/hdb/testresult/14277444]
{noformat}
> Task :geode-core:distributedTest

org.apache.geode.internal.cache.PutAllGlobalDUnitTest > 
testputAllGlobalRemoteVM FAILED
java.lang.AssertionError: async2 failed
at org.apache.geode.test.dunit.Assert.fail(Assert.java:66)
at 
org.apache.geode.internal.cache.PutAllGlobalDUnitTest.testputAllGlobalRemoteVM(PutAllGlobalDUnitTest.java:215)

Caused by:
java.lang.AssertionError: Should have thrown TimeoutException
at org.junit.Assert.fail(Assert.java:89)
at 
org.apache.geode.internal.cache.PutAllGlobalDUnitTest$2.run2(PutAllGlobalDUnitTest.java:193)

8805 tests completed, 1 failed, 455 skipped

=-=-=-=-=-=-=-=-=-=-=-=-=-=-=  Test Results URI 
=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=
http://files.apachegeode-ci.info/builds/apache-support-1-14-main/1.14.5-build.0942/test-results/distributedTest/1648360227/
=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=

Test report artifacts from this job are available at:

http://files.apachegeode-ci.info/builds/apache-support-1-14-main/1.14.5-build.0942/test-artifacts/1648360227/distributedtestfiles-openjdk11-1.14.5-build.0942.tgz{noformat}



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Created] (GEODE-10186) CI failure: RedundancyLevelPart1DUnitTest > testRedundancySpecifiedNonPrimaryEPFailsDetectionByCCU times out waiting for getClientProxies() to return more than 0 objects

2022-03-28 Thread Bill Burcham (Jira)
Bill Burcham created GEODE-10186:


 Summary: CI failure: RedundancyLevelPart1DUnitTest > 
testRedundancySpecifiedNonPrimaryEPFailsDetectionByCCU times out waiting for 
getClientProxies() to return more than 0 objects
 Key: GEODE-10186
 URL: https://issues.apache.org/jira/browse/GEODE-10186
 Project: Geode
  Issue Type: Bug
  Components: client queues
Affects Versions: 1.15.0
Reporter: Bill Burcham


Failed here: [https://hydradb.hdb.gemfire-ci.info/hdb/testresult/14277358]

 
{noformat}
> Task :geode-core:distributedTest

RedundancyLevelPart1DUnitTest > 
testRedundancySpecifiedNonPrimaryEPFailsDetectionByCCU FAILED
org.apache.geode.test.dunit.RMIException: While invoking 
org.apache.geode.internal.cache.tier.sockets.RedundancyLevelPart1DUnitTest$$Lambda$543/510122765.run
 in VM 2 running on Host 
heavy-lifter-f58561da-caf9-5bc0-a7fa-f938c3fd1e51.c.apachegeode-ci.internal 
with 4 VMs
at org.apache.geode.test.dunit.VM.executeMethodOnObject(VM.java:631)
at org.apache.geode.test.dunit.VM.invoke(VM.java:448)
at 
org.apache.geode.internal.cache.tier.sockets.RedundancyLevelPart1DUnitTest.testRedundancySpecifiedNonPrimaryEPFailsDetectionByCCU(RedundancyLevelPart1DUnitTest.java:284)

Caused by:
org.awaitility.core.ConditionTimeoutException: Assertion condition 
defined as a lambda expression in 
org.apache.geode.internal.cache.tier.sockets.RedundancyLevelPart1DUnitTest that 
uses org.apache.geode.internal.cache.tier.sockets.CacheClientNotifier 
Expecting actual:
  0
to be greater than:
  0
 within 5 minutes.
at 
org.awaitility.core.ConditionAwaiter.await(ConditionAwaiter.java:167)
at 
org.awaitility.core.AssertionCondition.await(AssertionCondition.java:119)
at 
org.awaitility.core.AssertionCondition.await(AssertionCondition.java:31)
at 
org.awaitility.core.ConditionFactory.until(ConditionFactory.java:985)
at 
org.awaitility.core.ConditionFactory.untilAsserted(ConditionFactory.java:769)
at 
org.apache.geode.internal.cache.tier.sockets.RedundancyLevelPart1DUnitTest.verifyInterestRegistration(RedundancyLevelPart1DUnitTest.java:505)

Caused by:
java.lang.AssertionError: 
Expecting actual:
  0
to be greater than:
  0
at 
org.apache.geode.internal.cache.tier.sockets.RedundancyLevelPart1DUnitTest.lambda$verifyInterestRegistration$19(RedundancyLevelPart1DUnitTest.java:506)

8352 tests completed, 1 failed, 414 skipped

=-=-=-=-=-=-=-=-=-=-=-=-=-=-=  Test Results URI 
=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=
http://files.apachegeode-ci.info/builds/apache-develop-mass-test-run/1.15.0-build.1033/test-results/distributedTest/1648331031/
=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=

Test report artifacts from this job are available at:

http://files.apachegeode-ci.info/builds/apache-develop-mass-test-run/1.15.0-build.1033/test-artifacts/1648331031/distributedtestfiles-openjdk8-1.15.0-build.1033.tgz
{noformat}



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Updated] (GEODE-10184) CI failure on windows: non-zero exit status on gfsh command in DeployWithLargeJarTest > deployLargeSetOfJars

2022-03-28 Thread Bill Burcham (Jira)


 [ 
https://issues.apache.org/jira/browse/GEODE-10184?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bill Burcham updated GEODE-10184:
-
Summary: CI failure on windows: non-zero exit status on gfsh command in 
DeployWithLargeJarTest > deployLargeSetOfJars  (was: CI failure: non-zero exit 
status on gfsh command in DeployWithLargeJarTest > deployLargeSetOfJars)

> CI failure on windows: non-zero exit status on gfsh command in 
> DeployWithLargeJarTest > deployLargeSetOfJars
> 
>
> Key: GEODE-10184
> URL: https://issues.apache.org/jira/browse/GEODE-10184
> Project: Geode
>  Issue Type: Bug
>  Components: gfsh
>Affects Versions: 1.15.0
>Reporter: Bill Burcham
>Priority: Major
>  Labels: needsTriage
>
> Deploy large jar test fails due to non-zero exit status on gfsh command on 
> windows
>  
> [https://hydradb.hdb.gemfire-ci.info/hdb/testresult/14291025]
>  
> {noformat}
> > Task :geode-assembly:acceptanceTest
> DeployWithLargeJarTest > deployLargeSetOfJars FAILED
> org.opentest4j.AssertionFailedError: [Exit value from process started by 
> [e66e7d3e01750dd9: gfsh -e start locator --name=locator --max-heap=128m -e 
> start server --name=server --max-heap=128m --server-port=0 -e sleep --time=1 
> -e deploy 
> --jars=C:\\Users\\geode\\geode\\geode-assembly\\build\\install\\apache-geode\\lib\\commons-beanutils-1.9.4.jar,C:\\Users\\geode\\geode\\geode-assembly\\build\\install\\apache-geode\\lib\\commons-codec-1.15.jar,C:\\Users\\geode\\geode\\geode-assembly\\build\\install\\apache-geode\\lib\\commons-collections-3.2.2.jar,C:\\Users\\geode\\geode\\geode-assembly\\build\\install\\apache-geode\\lib\\commons-digester-2.1.jar,C:\\Users\\geode\\geode\\geode-assembly\\build\\install\\apache-geode\\lib\\commons-io-2.11.0.jar,C:\\Users\\geode\\geode\\geode-assembly\\build\\install\\apache-geode\\lib\\commons-lang3-3.12.0.jar,C:\\Users\\geode\\geode\\geode-assembly\\build\\install\\apache-geode\\lib\\commons-logging-1.2.jar,C:\\Users\\geode\\geode\\geode-assembly\\build\\install\\apache-geode\\lib\\commons-modeler-2.0.1.jar,C:\\Users\\geode\\geode\\geode-assembly\\build\\install\\apache-geode\\lib\\commons-validator-1.7.jar]]
>  
> expected: 0
>  but was: 1
> at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native 
> Method)
> at 
> sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
> at 
> sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
> at 
> org.apache.geode.test.junit.rules.gfsh.GfshExecution.awaitTermination(GfshExecution.java:103)
> at 
> org.apache.geode.test.junit.rules.gfsh.GfshRule.execute(GfshRule.java:154)
> at 
> org.apache.geode.test.junit.rules.gfsh.GfshRule.execute(GfshRule.java:163)
> at 
> org.apache.geode.test.junit.rules.gfsh.GfshScript.execute(GfshScript.java:153)
> at 
> org.apache.geode.management.internal.cli.commands.DeployWithLargeJarTest.deployLargeSetOfJars(DeployWithLargeJarTest.java:41)
> 176 tests completed, 1 failed, 18 skipped
> =-=-=-=-=-=-=-=-=-=-=-=-=-=-=  Test Results URI 
> =-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=
> http://files.apachegeode-ci.info/builds/apache-develop-main/1.15.0-build.1035/test-results/acceptanceTest/1648482211/
> =-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=
> Test report artifacts from this job are available at:
> http://files.apachegeode-ci.info/builds/apache-develop-main/1.15.0-build.1035/test-artifacts/1648482211/windows-acceptancetestfiles-openjdk8-1.15.0-build.1035.tgz{noformat}



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Updated] (GEODE-10184) CI failure: non-zero exit status on gfsh command in DeployWithLargeJarTest > deployLargeSetOfJars

2022-03-28 Thread Bill Burcham (Jira)


 [ 
https://issues.apache.org/jira/browse/GEODE-10184?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bill Burcham updated GEODE-10184:
-
Summary: CI failure: non-zero exit status on gfsh command in 
DeployWithLargeJarTest > deployLargeSetOfJars  (was: non-zero exit status on 
gfsh command in DeployWithLargeJarTest > deployLargeSetOfJars)

> CI failure: non-zero exit status on gfsh command in DeployWithLargeJarTest > 
> deployLargeSetOfJars
> -
>
> Key: GEODE-10184
> URL: https://issues.apache.org/jira/browse/GEODE-10184
> Project: Geode
>  Issue Type: Bug
>  Components: gfsh
>Affects Versions: 1.15.0
>Reporter: Bill Burcham
>Priority: Major
>  Labels: needsTriage
>
> Deploy large jar test fails due to non-zero exit status on gfsh command on 
> windows
>  
> [https://hydradb.hdb.gemfire-ci.info/hdb/testresult/14291025]
>  
> {noformat}
> > Task :geode-assembly:acceptanceTest
> DeployWithLargeJarTest > deployLargeSetOfJars FAILED
> org.opentest4j.AssertionFailedError: [Exit value from process started by 
> [e66e7d3e01750dd9: gfsh -e start locator --name=locator --max-heap=128m -e 
> start server --name=server --max-heap=128m --server-port=0 -e sleep --time=1 
> -e deploy 
> --jars=C:\\Users\\geode\\geode\\geode-assembly\\build\\install\\apache-geode\\lib\\commons-beanutils-1.9.4.jar,C:\\Users\\geode\\geode\\geode-assembly\\build\\install\\apache-geode\\lib\\commons-codec-1.15.jar,C:\\Users\\geode\\geode\\geode-assembly\\build\\install\\apache-geode\\lib\\commons-collections-3.2.2.jar,C:\\Users\\geode\\geode\\geode-assembly\\build\\install\\apache-geode\\lib\\commons-digester-2.1.jar,C:\\Users\\geode\\geode\\geode-assembly\\build\\install\\apache-geode\\lib\\commons-io-2.11.0.jar,C:\\Users\\geode\\geode\\geode-assembly\\build\\install\\apache-geode\\lib\\commons-lang3-3.12.0.jar,C:\\Users\\geode\\geode\\geode-assembly\\build\\install\\apache-geode\\lib\\commons-logging-1.2.jar,C:\\Users\\geode\\geode\\geode-assembly\\build\\install\\apache-geode\\lib\\commons-modeler-2.0.1.jar,C:\\Users\\geode\\geode\\geode-assembly\\build\\install\\apache-geode\\lib\\commons-validator-1.7.jar]]
>  
> expected: 0
>  but was: 1
> at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native 
> Method)
> at 
> sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
> at 
> sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
> at 
> org.apache.geode.test.junit.rules.gfsh.GfshExecution.awaitTermination(GfshExecution.java:103)
> at 
> org.apache.geode.test.junit.rules.gfsh.GfshRule.execute(GfshRule.java:154)
> at 
> org.apache.geode.test.junit.rules.gfsh.GfshRule.execute(GfshRule.java:163)
> at 
> org.apache.geode.test.junit.rules.gfsh.GfshScript.execute(GfshScript.java:153)
> at 
> org.apache.geode.management.internal.cli.commands.DeployWithLargeJarTest.deployLargeSetOfJars(DeployWithLargeJarTest.java:41)
> 176 tests completed, 1 failed, 18 skipped
> =-=-=-=-=-=-=-=-=-=-=-=-=-=-=  Test Results URI 
> =-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=
> http://files.apachegeode-ci.info/builds/apache-develop-main/1.15.0-build.1035/test-results/acceptanceTest/1648482211/
> =-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=
> Test report artifacts from this job are available at:
> http://files.apachegeode-ci.info/builds/apache-develop-main/1.15.0-build.1035/test-artifacts/1648482211/windows-acceptancetestfiles-openjdk8-1.15.0-build.1035.tgz{noformat}



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Created] (GEODE-10184) non-zero exit status on gfsh command in DeployWithLargeJarTest > deployLargeSetOfJars

2022-03-28 Thread Bill Burcham (Jira)
Bill Burcham created GEODE-10184:


 Summary: non-zero exit status on gfsh command in 
DeployWithLargeJarTest > deployLargeSetOfJars
 Key: GEODE-10184
 URL: https://issues.apache.org/jira/browse/GEODE-10184
 Project: Geode
  Issue Type: Bug
  Components: gfsh
Affects Versions: 1.15.0
Reporter: Bill Burcham


Deploy large jar test fails due to non-zero exit status on gfsh command on 
windows

 

[https://hydradb.hdb.gemfire-ci.info/hdb/testresult/14291025]

 
{noformat}
> Task :geode-assembly:acceptanceTest

DeployWithLargeJarTest > deployLargeSetOfJars FAILED
org.opentest4j.AssertionFailedError: [Exit value from process started by 
[e66e7d3e01750dd9: gfsh -e start locator --name=locator --max-heap=128m -e 
start server --name=server --max-heap=128m --server-port=0 -e sleep --time=1 -e 
deploy 
--jars=C:\\Users\\geode\\geode\\geode-assembly\\build\\install\\apache-geode\\lib\\commons-beanutils-1.9.4.jar,C:\\Users\\geode\\geode\\geode-assembly\\build\\install\\apache-geode\\lib\\commons-codec-1.15.jar,C:\\Users\\geode\\geode\\geode-assembly\\build\\install\\apache-geode\\lib\\commons-collections-3.2.2.jar,C:\\Users\\geode\\geode\\geode-assembly\\build\\install\\apache-geode\\lib\\commons-digester-2.1.jar,C:\\Users\\geode\\geode\\geode-assembly\\build\\install\\apache-geode\\lib\\commons-io-2.11.0.jar,C:\\Users\\geode\\geode\\geode-assembly\\build\\install\\apache-geode\\lib\\commons-lang3-3.12.0.jar,C:\\Users\\geode\\geode\\geode-assembly\\build\\install\\apache-geode\\lib\\commons-logging-1.2.jar,C:\\Users\\geode\\geode\\geode-assembly\\build\\install\\apache-geode\\lib\\commons-modeler-2.0.1.jar,C:\\Users\\geode\\geode\\geode-assembly\\build\\install\\apache-geode\\lib\\commons-validator-1.7.jar]]
 
expected: 0
 but was: 1
at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
at 
sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
at 
sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
at 
org.apache.geode.test.junit.rules.gfsh.GfshExecution.awaitTermination(GfshExecution.java:103)
at 
org.apache.geode.test.junit.rules.gfsh.GfshRule.execute(GfshRule.java:154)
at 
org.apache.geode.test.junit.rules.gfsh.GfshRule.execute(GfshRule.java:163)
at 
org.apache.geode.test.junit.rules.gfsh.GfshScript.execute(GfshScript.java:153)
at 
org.apache.geode.management.internal.cli.commands.DeployWithLargeJarTest.deployLargeSetOfJars(DeployWithLargeJarTest.java:41)

176 tests completed, 1 failed, 18 skipped

=-=-=-=-=-=-=-=-=-=-=-=-=-=-=  Test Results URI 
=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=
http://files.apachegeode-ci.info/builds/apache-develop-main/1.15.0-build.1035/test-results/acceptanceTest/1648482211/
=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=

Test report artifacts from this job are available at:

http://files.apachegeode-ci.info/builds/apache-develop-main/1.15.0-build.1035/test-artifacts/1648482211/windows-acceptancetestfiles-openjdk8-1.15.0-build.1035.tgz{noformat}



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Commented] (GEODE-10122) With TLSv1.3 and GCM-based cipher (the default), P2P Messaging Fails When Encrypted Data Limit is Reached

2022-03-18 Thread Bill Burcham (Jira)


[ 
https://issues.apache.org/jira/browse/GEODE-10122?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17509131#comment-17509131
 ] 

Bill Burcham commented on GEODE-10122:
--

Made progress on the PR: the JUnit ("Integration") test fails reliably, sending 
2 bytes of encoded (TLS) data.

Next step is to make the test pass!

> With TLSv1.3 and GCM-based cipher (the default), P2P Messaging Fails When 
> Encrypted Data Limit is Reached
> -
>
> Key: GEODE-10122
> URL: https://issues.apache.org/jira/browse/GEODE-10122
> Project: Geode
>  Issue Type: Bug
>  Components: messaging
>Affects Versions: 1.13.7, 1.14.3, 1.15.0
>Reporter: Bill Burcham
>Assignee: Bill Burcham
>Priority: Major
>  Labels: blocks-1.15.0, pull-request-available
> Attachments: patch-P2PMessagingConcurrencyDUnitTest.txt
>
>
> TLSv1.3 introduced [1] the ability to set per-algorithm limits on symmetric 
> key usage lifetimes. Once a certain number of bytes have been encrypted, a 
> KeyUpdate post-handshake message [2] is sent.
> With default settings, on Liberica JDK 11, Geode's P2P framework will 
> negotiate TLSv1.3 with the TLS_AES_256_GCM_SHA384 cipher suite. Geode P2P 
> messaging will eventually fail, with a "Tag mismatch!" IOException in shared 
> ordered receivers, after a session has been in heavy use for days.
> We have not see this failure on TLSv1.2.
> The implementation of TLSv1.3 in the Java runtime provides a security 
> property [3] to configure the encrypted data limit. The attached patch to 
> P2PMessagingConcurrencyDUnitTest configures the limit large enough that the 
> test makes it through the (P2P) TLS handshake but small enough so that the 
> "Tag mismatch!" exception is encountered less than a minute later.
> The bug is caused by Geode’s NioSslEngine class’ ignorance of the 
> “rehandshaking” phase of the TLS protocol [4]:
>     Creation - ready to be configured.
>     Initial handshaking - perform authentication and negotiate communication 
> parameters.
>     Application data - ready for application exchange.
>     *Rehandshaking* - renegotiate communications parameters/authentication; 
> handshaking data may be mixed with application data.
>     Closure - ready to shut down connection.
> Geode's tcp.Connection and NioSslEngine classes (particularly wrap() and 
> unwrap()), as they are currently implemented, fail to fully attend to the 
> handshake status from javax.net.ssl.SSLEngine. As a result these Geode 
> classes fail to respond to the KeyUpdate message, resulting in the "Tag 
> mismatch!" IOException.
> When that exception is encountered, the Connection is destroyed and a new one 
> created in its place. But users of the old Connection, waiting for 
> acknowledgements, will never receive them. This can result in cluster-wide 
> hangs.
> [1] [https://datatracker.ietf.org/doc/html/rfc8446#section-5.5]
> [2] 
> [https://www.ibm.com/docs/en/sdk-java-technology/8?topic=handshake-post-messages]
>  
> [3] 
> [https://docs.oracle.com/en/java/javase/11/security/java-secure-socket-extension-jsse-reference-guide.html#GUID-B970ADD6-1E9F-4C18-A26E-0679B50CC946]
> [4] [https://www.ibm.com/docs/en/sdk-java-technology/7.1?topic=sslengine-]



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Updated] (GEODE-10122) With TLSv1.3 and GCM-based cipher (the default), P2P Messaging Fails When Encrypted Data Limit is Reached

2022-03-11 Thread Bill Burcham (Jira)


 [ 
https://issues.apache.org/jira/browse/GEODE-10122?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bill Burcham updated GEODE-10122:
-
Description: 
TLSv1.3 introduced [1] the ability to set per-algorithm limits on symmetric key 
usage lifetimes. Once a certain number of bytes have been encrypted, a 
KeyUpdate post-handshake message [2] is sent.

With default settings, on Liberica JDK 11, Geode's P2P framework will negotiate 
TLSv1.3 with the TLS_AES_256_GCM_SHA384 cipher suite. Geode P2P messaging will 
eventually fail, with a "Tag mismatch!" IOException in shared ordered 
receivers, after a session has been in heavy use for days.

We have not see this failure on TLSv1.2.

The implementation of TLSv1.3 in the Java runtime provides a security property 
[3] to configure the encrypted data limit. The attached patch to 
P2PMessagingConcurrencyDUnitTest configures the limit large enough that the 
test makes it through the (P2P) TLS handshake but small enough so that the "Tag 
mismatch!" exception is encountered less than a minute later.

The bug is caused by Geode’s NioSslEngine class’ ignorance of the 
“rehandshaking” phase of the TLS protocol [4]:

    Creation - ready to be configured.

    Initial handshaking - perform authentication and negotiate communication 
parameters.

    Application data - ready for application exchange.

    *Rehandshaking* - renegotiate communications parameters/authentication; 
handshaking data may be mixed with application data.

    Closure - ready to shut down connection.

Geode's tcp.Connection and NioSslEngine classes (particularly wrap() and 
unwrap()), as they are currently implemented, fail to fully attend to the 
handshake status from javax.net.ssl.SSLEngine. As a result these Geode classes 
fail to respond to the KeyUpdate message, resulting in the "Tag mismatch!" 
IOException.

When that exception is encountered, the Connection is destroyed and a new one 
created in its place. But users of the old Connection, waiting for 
acknowledgements, will never receive them. This can result in cluster-wide 
hangs.

[1] [https://datatracker.ietf.org/doc/html/rfc8446#section-5.5]

[2] 
[https://www.ibm.com/docs/en/sdk-java-technology/8?topic=handshake-post-messages]
 

[3] 
[https://docs.oracle.com/en/java/javase/11/security/java-secure-socket-extension-jsse-reference-guide.html#GUID-B970ADD6-1E9F-4C18-A26E-0679B50CC946]

[4] [https://www.ibm.com/docs/en/sdk-java-technology/7.1?topic=sslengine-]

  was:
TLSv1.3 introduced [1] the ability to set per-algorithm limits on symmetric key 
usage lifetimes. Once a certain number of bytes have been encrypted, a 
KeyUpdate post-handshake message is sent.

With default settings, on Liberica JDK 11, Geode's P2P framework will negotiate 
TLSv1.3 with the TLS_AES_256_GCM_SHA384 cipher suite. Geode P2P messaging will 
eventually fail, with a "Tag mismatch!" IOException in shared ordered 
receivers, after a session has been in heavy use for days.

We have not see this failure on TLSv1.2.

The implementation of TLSv1.3 in the Java runtime provides a security property 
[2] to configure the encrypted data limit. The attached patch to 
P2PMessagingConcurrencyDUnitTest configures the limit large enough that the 
test makes it through the (P2P) TLS handshake but small enough so that the "Tag 
mismatch!" exception is encountered less than a minute later.

The bug is caused by Geode’s NioSslEngine class’ ignorance of the 
“rehandshaking” phase of the TLS protocol [3]:

    Creation - ready to be configured.

    Initial handshaking - perform authentication and negotiate communication 
parameters.

    Application data - ready for application exchange.

    *Rehandshaking* - renegotiate communications parameters/authentication; 
handshaking data may be mixed with application data.

    Closure - ready to shut down connection.

Geode's tcp.Connection and NioSslEngine classes (particularly wrap() and 
unwrap()), as they are currently implemented, fail to fully attend to the 
handshake status from javax.net.ssl.SSLEngine. As a result these Geode classes 
fail to respond to the KeyUpdate message, resulting in the "Tag mismatch!" 
IOException.

When that exception is encountered, the Connection is destroyed and a new one 
created in its place. But users of the old Connection, waiting for 
acknowledgements, will never receive them. This can result in cluster-wide 
hangs.

[1] [https://datatracker.ietf.org/doc/html/rfc8446#section-5.5]

[2] 
[https://docs.oracle.com/en/java/javase/11/security/java-secure-socket-extension-jsse-reference-guide.html#GUID-B970ADD6-1E9F-4C18-A26E-0679B50CC946]
 

[3] [https://www.ibm.com/docs/en/sdk-java-technology/7.1?topic=sslengine-]


> With TLSv1.3 and GCM-based cipher (the default), P2P Messaging Fails When 
> Encrypted Data Limit is Reached
> -
>
>   

[jira] [Updated] (GEODE-10122) With TLSv1.3 and GCM-based cipher (the default), P2P Messaging Fails When Encrypted Data Limit is Reached

2022-03-11 Thread Bill Burcham (Jira)


 [ 
https://issues.apache.org/jira/browse/GEODE-10122?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bill Burcham updated GEODE-10122:
-
Component/s: messaging

> With TLSv1.3 and GCM-based cipher (the default), P2P Messaging Fails When 
> Encrypted Data Limit is Reached
> -
>
> Key: GEODE-10122
> URL: https://issues.apache.org/jira/browse/GEODE-10122
> Project: Geode
>  Issue Type: Bug
>  Components: messaging
>Affects Versions: 1.13.7, 1.14.3, 1.15.0, 1.16.0
>Reporter: Bill Burcham
>Assignee: Bill Burcham
>Priority: Major
> Attachments: patch-P2PMessagingConcurrencyDUnitTest.txt
>
>
> TLSv1.3 introduced [1] the ability to set per-algorithm limits on symmetric 
> key usage lifetimes. Once a certain number of bytes have been encrypted, a 
> KeyUpdate post-handshake message is sent.
> With default settings, on Liberica JDK 11, Geode's P2P framework will 
> negotiate TLSv1.3 with the TLS_AES_256_GCM_SHA384 cipher suite. Geode P2P 
> messaging will eventually fail, with a "Tag mismatch!" IOException in shared 
> ordered receivers, after a session has been in heavy use for days.
> We have not see this failure on TLSv1.2.
> The implementation of TLSv1.3 in the Java runtime provides a security 
> property [2] to configure the encrypted data limit. The attached patch to 
> P2PMessagingConcurrencyDUnitTest configures the limit large enough that the 
> test makes it through the (P2P) TLS handshake but small enough so that the 
> "Tag mismatch!" exception is encountered less than a minute later.
> The bug is caused by Geode’s NioSslEngine class’ ignorance of the 
> “rehandshaking” phase of the TLS protocol [3]:
>     Creation - ready to be configured.
>     Initial handshaking - perform authentication and negotiate communication 
> parameters.
>     Application data - ready for application exchange.
>     *Rehandshaking* - renegotiate communications parameters/authentication; 
> handshaking data may be mixed with application data.
>     Closure - ready to shut down connection.
> Geode's tcp.Connection and NioSslEngine classes (particularly wrap() and 
> unwrap()), as they are currently implemented, fail to fully attend to the 
> handshake status from javax.net.ssl.SSLEngine. As a result these Geode 
> classes fail to respond to the KeyUpdate message, resulting in the "Tag 
> mismatch!" IOException.
> When that exception is encountered, the Connection is destroyed and a new one 
> created in its place. But users of the old Connection, waiting for 
> acknowledgements, will never receive them. This can result in cluster-wide 
> hangs.
> [1] [https://datatracker.ietf.org/doc/html/rfc8446#section-5.5]
> [2] 
> [https://docs.oracle.com/en/java/javase/11/security/java-secure-socket-extension-jsse-reference-guide.html#GUID-B970ADD6-1E9F-4C18-A26E-0679B50CC946]
>  
> [3] [https://www.ibm.com/docs/en/sdk-java-technology/7.1?topic=sslengine-]



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Updated] (GEODE-10122) With TLSv1.3 and GCM-based cipher (the default), P2P Messaging Fails When Encrypted Data Limit is Reached

2022-03-11 Thread Bill Burcham (Jira)


 [ 
https://issues.apache.org/jira/browse/GEODE-10122?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bill Burcham updated GEODE-10122:
-
Labels:   (was: needsTriage)

> With TLSv1.3 and GCM-based cipher (the default), P2P Messaging Fails When 
> Encrypted Data Limit is Reached
> -
>
> Key: GEODE-10122
> URL: https://issues.apache.org/jira/browse/GEODE-10122
> Project: Geode
>  Issue Type: Bug
>Affects Versions: 1.13.7, 1.14.3, 1.15.0, 1.16.0
>Reporter: Bill Burcham
>Assignee: Bill Burcham
>Priority: Major
> Attachments: patch-P2PMessagingConcurrencyDUnitTest.txt
>
>
> TLSv1.3 introduced [1] the ability to set per-algorithm limits on symmetric 
> key usage lifetimes. Once a certain number of bytes have been encrypted, a 
> KeyUpdate post-handshake message is sent.
> With default settings, on Liberica JDK 11, Geode's P2P framework will 
> negotiate TLSv1.3 with the TLS_AES_256_GCM_SHA384 cipher suite. Geode P2P 
> messaging will eventually fail, with a "Tag mismatch!" IOException in shared 
> ordered receivers, after a session has been in heavy use for days.
> We have not see this failure on TLSv1.2.
> The implementation of TLSv1.3 in the Java runtime provides a security 
> property [2] to configure the encrypted data limit. The attached patch to 
> P2PMessagingConcurrencyDUnitTest configures the limit large enough that the 
> test makes it through the (P2P) TLS handshake but small enough so that the 
> "Tag mismatch!" exception is encountered less than a minute later.
> The bug is caused by Geode’s NioSslEngine class’ ignorance of the 
> “rehandshaking” phase of the TLS protocol [3]:
>     Creation - ready to be configured.
>     Initial handshaking - perform authentication and negotiate communication 
> parameters.
>     Application data - ready for application exchange.
>     *Rehandshaking* - renegotiate communications parameters/authentication; 
> handshaking data may be mixed with application data.
>     Closure - ready to shut down connection.
> Geode's tcp.Connection and NioSslEngine classes (particularly wrap() and 
> unwrap()), as they are currently implemented, fail to fully attend to the 
> handshake status from javax.net.ssl.SSLEngine. As a result these Geode 
> classes fail to respond to the KeyUpdate message, resulting in the "Tag 
> mismatch!" IOException.
> When that exception is encountered, the Connection is destroyed and a new one 
> created in its place. But users of the old Connection, waiting for 
> acknowledgements, will never receive them. This can result in cluster-wide 
> hangs.
> [1] [https://datatracker.ietf.org/doc/html/rfc8446#section-5.5]
> [2] 
> [https://docs.oracle.com/en/java/javase/11/security/java-secure-socket-extension-jsse-reference-guide.html#GUID-B970ADD6-1E9F-4C18-A26E-0679B50CC946]
>  
> [3] [https://www.ibm.com/docs/en/sdk-java-technology/7.1?topic=sslengine-]



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Assigned] (GEODE-10122) With TLSv1.3 and GCM-based cipher (the default), P2P Messaging Fails When Encrypted Data Limit is Reached

2022-03-11 Thread Bill Burcham (Jira)


 [ 
https://issues.apache.org/jira/browse/GEODE-10122?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bill Burcham reassigned GEODE-10122:


Assignee: Bill Burcham

> With TLSv1.3 and GCM-based cipher (the default), P2P Messaging Fails When 
> Encrypted Data Limit is Reached
> -
>
> Key: GEODE-10122
> URL: https://issues.apache.org/jira/browse/GEODE-10122
> Project: Geode
>  Issue Type: Bug
>Affects Versions: 1.13.7, 1.14.3, 1.15.0, 1.16.0
>Reporter: Bill Burcham
>Assignee: Bill Burcham
>Priority: Major
>  Labels: needsTriage
> Attachments: patch-P2PMessagingConcurrencyDUnitTest.txt
>
>
> TLSv1.3 introduced [1] the ability to set per-algorithm limits on symmetric 
> key usage lifetimes. Once a certain number of bytes have been encrypted, a 
> KeyUpdate post-handshake message is sent.
> With default settings, on Liberica JDK 11, Geode's P2P framework will 
> negotiate TLSv1.3 with the TLS_AES_256_GCM_SHA384 cipher suite. Geode P2P 
> messaging will eventually fail, with a "Tag mismatch!" IOException in shared 
> ordered receivers, after a session has been in heavy use for days.
> We have not see this failure on TLSv1.2.
> The implementation of TLSv1.3 in the Java runtime provides a security 
> property [2] to configure the encrypted data limit. The attached patch to 
> P2PMessagingConcurrencyDUnitTest configures the limit large enough that the 
> test makes it through the (P2P) TLS handshake but small enough so that the 
> "Tag mismatch!" exception is encountered less than a minute later.
> The bug is caused by Geode’s NioSslEngine class’ ignorance of the 
> “rehandshaking” phase of the TLS protocol [3]:
>     Creation - ready to be configured.
>     Initial handshaking - perform authentication and negotiate communication 
> parameters.
>     Application data - ready for application exchange.
>     *Rehandshaking* - renegotiate communications parameters/authentication; 
> handshaking data may be mixed with application data.
>     Closure - ready to shut down connection.
> Geode's tcp.Connection and NioSslEngine classes (particularly wrap() and 
> unwrap()), as they are currently implemented, fail to fully attend to the 
> handshake status from javax.net.ssl.SSLEngine. As a result these Geode 
> classes fail to respond to the KeyUpdate message, resulting in the "Tag 
> mismatch!" IOException.
> When that exception is encountered, the Connection is destroyed and a new one 
> created in its place. But users of the old Connection, waiting for 
> acknowledgements, will never receive them. This can result in cluster-wide 
> hangs.
> [1] [https://datatracker.ietf.org/doc/html/rfc8446#section-5.5]
> [2] 
> [https://docs.oracle.com/en/java/javase/11/security/java-secure-socket-extension-jsse-reference-guide.html#GUID-B970ADD6-1E9F-4C18-A26E-0679B50CC946]
>  
> [3] [https://www.ibm.com/docs/en/sdk-java-technology/7.1?topic=sslengine-]



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Updated] (GEODE-10122) With TLSv1.3 and GCM-based cipher (the default), P2P Messaging Fails When Encrypted Data Limit is Reached

2022-03-11 Thread Bill Burcham (Jira)


 [ 
https://issues.apache.org/jira/browse/GEODE-10122?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bill Burcham updated GEODE-10122:
-
Attachment: patch-P2PMessagingConcurrencyDUnitTest.txt

> With TLSv1.3 and GCM-based cipher (the default), P2P Messaging Fails When 
> Encrypted Data Limit is Reached
> -
>
> Key: GEODE-10122
> URL: https://issues.apache.org/jira/browse/GEODE-10122
> Project: Geode
>  Issue Type: Bug
>Affects Versions: 1.13.7, 1.14.3, 1.15.0, 1.16.0
>Reporter: Bill Burcham
>Priority: Major
>  Labels: needsTriage
> Attachments: patch-P2PMessagingConcurrencyDUnitTest.txt
>
>
> TLSv1.3 introduced [1] the ability to set per-algorithm limits on symmetric 
> key usage lifetimes. Once a certain number of bytes have been encrypted, a 
> KeyUpdate post-handshake message is sent.
> With default settings, on Liberica JDK 11, Geode's P2P framework will 
> negotiate TLSv1.3 with the TLS_AES_256_GCM_SHA384 cipher suite. Geode P2P 
> messaging will eventually fail, with a "Tag mismatch!" IOException in shared 
> ordered receivers, after a session has been in heavy use for days.
> We have not see this failure on TLSv1.2.
> The implementation of TLSv1.3 in the Java runtime provides a security 
> property [2] to configure the encrypted data limit. The attached patch to 
> P2PMessagingConcurrencyDUnitTest configures the limit large enough that the 
> test makes it through the (P2P) TLS handshake but small enough so that the 
> "Tag mismatch!" exception is encountered less than a minute later.
> The bug is caused by Geode’s NioSslEngine class’ ignorance of the 
> “rehandshaking” phase of the TLS protocol [3]:
>     Creation - ready to be configured.
>     Initial handshaking - perform authentication and negotiate communication 
> parameters.
>     Application data - ready for application exchange.
>     *Rehandshaking* - renegotiate communications parameters/authentication; 
> handshaking data may be mixed with application data.
>     Closure - ready to shut down connection.
> Geode's tcp.Connection and NioSslEngine classes (particularly wrap() and 
> unwrap()), as they are currently implemented, fail to fully attend to the 
> handshake status from javax.net.ssl.SSLEngine. As a result these Geode 
> classes fail to respond to the KeyUpdate message, resulting in the "Tag 
> mismatch!" IOException.
> When that exception is encountered, the Connection is destroyed and a new one 
> created in its place. But users of the old Connection, waiting for 
> acknowledgements, will never receive them. This can result in cluster-wide 
> hangs.
> [1] [https://datatracker.ietf.org/doc/html/rfc8446#section-5.5]
> [2] 
> [https://docs.oracle.com/en/java/javase/11/security/java-secure-socket-extension-jsse-reference-guide.html#GUID-B970ADD6-1E9F-4C18-A26E-0679B50CC946]
>  
> [3] [https://www.ibm.com/docs/en/sdk-java-technology/7.1?topic=sslengine-]



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Created] (GEODE-10122) With TLSv1.3 and GCM-based cipher (the default), P2P Messaging Fails When Encrypted Data Limit is Reached

2022-03-11 Thread Bill Burcham (Jira)
Bill Burcham created GEODE-10122:


 Summary: With TLSv1.3 and GCM-based cipher (the default), P2P 
Messaging Fails When Encrypted Data Limit is Reached
 Key: GEODE-10122
 URL: https://issues.apache.org/jira/browse/GEODE-10122
 Project: Geode
  Issue Type: Bug
Affects Versions: 1.14.3, 1.13.7, 1.15.0, 1.16.0
Reporter: Bill Burcham
 Attachments: patch-P2PMessagingConcurrencyDUnitTest.txt

TLSv1.3 introduced [1] the ability to set per-algorithm limits on symmetric key 
usage lifetimes. Once a certain number of bytes have been encrypted, a 
KeyUpdate post-handshake message is sent.

With default settings, on Liberica JDK 11, Geode's P2P framework will negotiate 
TLSv1.3 with the TLS_AES_256_GCM_SHA384 cipher suite. Geode P2P messaging will 
eventually fail, with a "Tag mismatch!" IOException in shared ordered 
receivers, after a session has been in heavy use for days.

We have not see this failure on TLSv1.2.

The implementation of TLSv1.3 in the Java runtime provides a security property 
[2] to configure the encrypted data limit. The attached patch to 
P2PMessagingConcurrencyDUnitTest configures the limit large enough that the 
test makes it through the (P2P) TLS handshake but small enough so that the "Tag 
mismatch!" exception is encountered less than a minute later.

The bug is caused by Geode’s NioSslEngine class’ ignorance of the 
“rehandshaking” phase of the TLS protocol [3]:

    Creation - ready to be configured.

    Initial handshaking - perform authentication and negotiate communication 
parameters.

    Application data - ready for application exchange.

    *Rehandshaking* - renegotiate communications parameters/authentication; 
handshaking data may be mixed with application data.

    Closure - ready to shut down connection.

Geode's tcp.Connection and NioSslEngine classes (particularly wrap() and 
unwrap()), as they are currently implemented, fail to fully attend to the 
handshake status from javax.net.ssl.SSLEngine. As a result these Geode classes 
fail to respond to the KeyUpdate message, resulting in the "Tag mismatch!" 
IOException.

When that exception is encountered, the Connection is destroyed and a new one 
created in its place. But users of the old Connection, waiting for 
acknowledgements, will never receive them. This can result in cluster-wide 
hangs.

[1] [https://datatracker.ietf.org/doc/html/rfc8446#section-5.5]

[2] 
[https://docs.oracle.com/en/java/javase/11/security/java-secure-socket-extension-jsse-reference-guide.html#GUID-B970ADD6-1E9F-4C18-A26E-0679B50CC946]
 

[3] [https://www.ibm.com/docs/en/sdk-java-technology/7.1?topic=sslengine-]



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Updated] (GEODE-9680) Newly Started/Restarted Locators are Susceptible to Split-Brains

2022-01-09 Thread Bill Burcham (Jira)


 [ 
https://issues.apache.org/jira/browse/GEODE-9680?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bill Burcham updated GEODE-9680:

Description: 
The issues described here are present in all versions of Geode (this is not new 
to 1.15.0)…

Geode is built on the assumption that views progress linearly in a sequence. If 
that sequence ever forks into two or more parallel lines then we have a "split 
brain".

In a split brain condition, each of the parallel views are independent. It's as 
if you have more than one system running concurrently. It's possible e.g. for 
some clients to connect to members of one view and other clients to connect to 
members of another view. Updates to members in one view are not seen by members 
of a parallel view.

Geode views are produced by a coordinator. As long as only a single coordinator 
is running, there is no possibility of a split brain. Split brain arises when 
more than one coordinator is producing views at the same time.

Each Geode member (peer) is started with the {{locators}} configuration 
parameter. That parameter specifies locator(s) to use to find the (already 
running!) coordinator (member) to join with.

When a locator (member) starts, it goes through this sequence to find the 
coordinator:
 # it first tries to find the coordinator through one of the (other) configured 
locators
 # if it can't contact any of those, it tries contacting non-locator (cache 
server) members it has retrieved from the "view presistence" ({{{}.dat{}}}) file

If it hasn't found a coordinator to join with, then the locator may _become_ a 
coordinator.

Sometimes this is ok. If no other coordinator is currently running then this 
behavior is fine. An example is when an [administrator is starting up a brand 
new 
cluster|https://geode.apache.org/docs/guide/114/configuring/running/running_the_locator.html].
 In that case we want the very first locator we start to become the coordinator.

But there are a number of situations where there may already be another 
coordinator running but it cannot be reached:
 * if the administrator/operator wants to *start up a brand new cluster* with 
multiple locators and…
 ** maybe Geode is running in a managed environment like Kubernetes and the 
locators hostnames are not (yet) resolvable in DNS
 ** maybe there is a network partition between the starting locators so they 
can't communicate
 ** maybe the existing locators or coordinator are running very slowly or the 
network is degraded. This is effectively the same as the network partition just 
mentioned
 * if a cluster is already running and the administrator/operator wants to 
*scale it up* by starting/adding a new locator Geode is susceptible to the same 
issues just mentioned
 * if a cluster is already running and the administrator/operator needs to 
*restart* a locator, e.g. for a rolling upgrade, if none of the locators in the 
{{locators}} configuration parameter are reachable (maybe they are not running, 
or maybe there is a network partition) and…
 ** if the "view persistence" {{.dat}} file is missing or deleted
 ** or if the current set of running Geode members has evolved so far that the 
coordinates (host+port) in the {{.dat}} file are completely out of date

In each of those cases, the newly starting locator will become a coordinator 
and will start producing views. Now we'll have the old coordinator producing 
views at the same time as the new one.
h2. When This Ticket is Complete

There are a number of possible solutions to these problems. Here is one 
possible solution…

Geode will offer a locator startup mode (via TBD {{LocatorLauncher}} startup 
parameter) that prevents that locator from becoming a coordinator. In that 
mode, it will be possible for an administrator/operator to avoid many of the 
problematic scenarios mentioned above, while retaining the ability (via some 
_other_ mode) to start a first locator which is allowed to become a coordinator.

For purposes of discussion we'll call the startup mode that allows the locator 
to become a coordinator "seed" mode, and we'll call the new startup mode that 
prevents the locator from becoming a coordinator before first joining, 
"join-only" mode.

After this mode split is implemented, it is envisioned that to start a brand 
new cluster, an administrator/operator will start the first locator in "seed" 
mode. After that the operator will start all subsequent locators in "join only" 
mode. If network partitions occur during startup, those newly started 
("join-only") nodes will exit with a failure status—under no circumstances will 
they ever become coordinators.

To add a locator to a running cluster, an operator starts it in "join only" 
mode. The new member will similarly either join with an existing coordinator or 
exit with a failure status, thereby avoiding split brains.

When an operator restarts a locator, e.g. during a rolling upgrade, they will 
restart it in

[jira] [Resolved] (GEODE-9822) Split-brain Certain During Network Partition in Two-Locator Cluster

2021-12-09 Thread Bill Burcham (Jira)


 [ 
https://issues.apache.org/jira/browse/GEODE-9822?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bill Burcham resolved GEODE-9822.
-
Fix Version/s: 1.15.0
   Resolution: Fixed

> Split-brain Certain During Network Partition in Two-Locator Cluster
> ---
>
> Key: GEODE-9822
> URL: https://issues.apache.org/jira/browse/GEODE-9822
> Project: Geode
>  Issue Type: Bug
>  Components: membership
>Reporter: Bill Burcham
>Priority: Major
>  Labels: pull-request-available
> Fix For: 1.15.0
>
>
> In a two-locator cluster with default member weights and default setting 
> (true) of enable-network-partition-detection, if a long-lived network 
> partition separates the two members, a split-brain will arise: there will be 
> two coordinators at the same time.
> The reason for this can be found in the GMSJoinLeave.isNetworkPartition() 
> method. That method's name is misleading. A name like isMajorityLost() would 
> probably be more apt. It needs to return true iff the weight of "crashed" 
> members (in the prospective view) is greater-than-or-equal-to half (50%) of 
> the total weight (of all members in the current view).
> What the method actually does is return true iff the weight of "crashed" 
> members is greater-than 51% of the total weight. As a result, if we have two 
> members of equal weight, and the coordinator sees that the non-coordinator is 
> "crashed", the coordinator will keep running. If a network partition is 
> happening, and the non-coordinator is still running, then it will become a 
> coordinator and start producing views. Now we'll have two coordinators 
> producing views concurrently.
> For this discussion "crashed" members are members for which the coordinator 
> has received a RemoveMemberRequest message. These are members that the 
> failure detector has deemed failed. Keep in mind the failure detector is 
> imperfect (it's not always right), and that's kind of the whole point of this 
> ticket: we've lost contact with the non-coordinator member, but that doesn't 
> mean it can't still be running (on the other side of a partition).
> This bug is not limited to the two-locator scenario. Any set of members that 
> can be partitioned into two equal sets is susceptible. In fact it's even a 
> little worse than that. Any set of members that can be partitioned (into more 
> than one set), where any two-or-more sets, each still have 49% or more of the 
> total weight, will result in a split-brain



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Updated] (GEODE-9822) Split-brain Certain During Network Partition in Two-Locator Cluster

2021-12-09 Thread Bill Burcham (Jira)


 [ 
https://issues.apache.org/jira/browse/GEODE-9822?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bill Burcham updated GEODE-9822:

Description: 
In a two-locator cluster with default member weights and default setting (true) 
of enable-network-partition-detection, if a long-lived network partition 
separates the two members, a split-brain will arise: there will be two 
coordinators at the same time.

The reason for this can be found in the GMSJoinLeave.isNetworkPartition() 
method. That method's name is misleading. A name like isMajorityLost() would 
probably be more apt. It needs to return true iff the weight of "crashed" 
members (in the prospective view) is greater-than-or-equal-to half (50%) of the 
total weight (of all members in the current view).

What the method actually does is return true iff the weight of "crashed" 
members is greater-than 51% of the total weight. As a result, if we have two 
members of equal weight, and the coordinator sees that the non-coordinator is 
"crashed", the coordinator will keep running. If a network partition is 
happening, and the non-coordinator is still running, then it will become a 
coordinator and start producing views. Now we'll have two coordinators 
producing views concurrently.

For this discussion "crashed" members are members for which the coordinator has 
received a RemoveMemberRequest message. These are members that the failure 
detector has deemed failed. Keep in mind the failure detector is imperfect 
(it's not always right), and that's kind of the whole point of this ticket: 
we've lost contact with the non-coordinator member, but that doesn't mean it 
can't still be running (on the other side of a partition).

This bug is not limited to the two-locator scenario. Any set of members that 
can be partitioned into two equal sets is susceptible. In fact it's even a 
little worse than that. Any set of members that can be partitioned (into more 
than one set), where any two-or-more sets, each still have 49% or more of the 
total weight, will result in a split-brain

  was:
In a two-locator cluster with default member weights and default setting (true) 
of enable-network-partition-detection, if a long-lived network partition 
separates the two members, a split-brain will arise: there will be two 
coordinators at the same time.

The reason for this can be found in the GMSJoinLeave.isNetworkPartition() 
method. That method's name is misleading. A name like isMajorityLost() would 
probably be more apt. It needs to return true iff the weight of "crashed" 
members (in the prospective view) is greater-than-or-equal-to half (50%) of the 
total weight (of all members in the current view).

What the method actually does is return true iff the weight of "crashed" 
members is greater-than 51% of the total weight. As a result, if we have two 
members of equal weight, and the coordinator sees that the non-coordinator is 
"crashed", the coordinator will keep running. If a network partition is 
happening, and the non-coordinator is still running, then it will become a 
coordinator and start producing views. Now we'll have two coordinators 
producing views concurrently.

For this discussion "crashed" members are members for which the coordinator has 
received a RemoveMemberRequest message. These are members that the failure 
detector has deemed failed. Keep in mind the failure detector is imperfect 
(it's not always right), and that's kind of the whole point of this ticket: 
we've lost contact with the non-coordinator member, but that doesn't mean it 
can't still be running (on the other side of a partition).

This bug is not limited to the two-locator scenario. Any set of members that 
can be partitioned into two equal sets is susceptible. In fact it's even a 
little worse than that. Any set of members that can be partitioned into two 
sets, both of which still have 49% or more of the total weight, will result in 
a split-brain.


> Split-brain Certain During Network Partition in Two-Locator Cluster
> ---
>
> Key: GEODE-9822
> URL: https://issues.apache.org/jira/browse/GEODE-9822
> Project: Geode
>  Issue Type: Bug
>  Components: membership
>Reporter: Bill Burcham
>Priority: Major
>  Labels: pull-request-available
>
> In a two-locator cluster with default member weights and default setting 
> (true) of enable-network-partition-detection, if a long-lived network 
> partition separates the two members, a split-brain will arise: there will be 
> two coordinators at the same time.
> The reason for this can be found in the GMSJoinLeave.isNetworkPartition() 
> method. That method's name is misleading. A name like isMajorityLost() would 
> probably be more apt. It needs to return true iff the weight of "crashed" 
> members (in the prospective view) is gre

[jira] [Updated] (GEODE-9880) Cluster with multiple locators in an environment with no host name resolution, leads to null pointer exception

2021-12-09 Thread Bill Burcham (Jira)


 [ 
https://issues.apache.org/jira/browse/GEODE-9880?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bill Burcham updated GEODE-9880:

Component/s: membership

> Cluster with multiple locators in an environment with no host name 
> resolution, leads to null pointer exception
> --
>
> Key: GEODE-9880
> URL: https://issues.apache.org/jira/browse/GEODE-9880
> Project: Geode
>  Issue Type: Bug
>  Components: locator, membership
>Affects Versions: 1.12.5
>Reporter: Tigran Ghahramanyan
>Priority: Major
>  Labels: membership
>
> In our use case we have two locators that are initially configured with IP 
> addresses, but _AutoConnectionSourceImpl.UpdateLocatorList()_ flow keeps on 
> adding their corresponding host names to the locators list, while these host 
> names are not resolvable.
> Later in {_}AutoConnectionSourceImpl.queryLocators(){_}, whenever a client 
> tries to use such non resolvable host name to connect to a locator it tries 
> to establish a connection to {_}socketaddr=0.0.0.0{_}, as written in 
> {_}SocketCreator.connect(){_}. Which seems strange.
> Then, if there is no locator running on the same host, the next locator in 
> the list is contacted, until reaching a locator contact configured with IP 
> address - which succeeds eventually.
> But, when there happens to be a locator listening on the same host, then we 
> have a null pointer exception in the second line below, because _inetadd=null_
> _socket.connect(sockaddr, Math.max(timeout, 0)); // sockaddr=0.0.0.0, 
> connects to a locator listening on the same host_
> _configureClientSSLSocket(socket, inetadd.getHostName(), timeout); // inetadd 
> = null_
>  
> As a result, the cluster comes to a failed state, unable to recover.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Updated] (GEODE-9822) Split-brain Certain During Network Partition in Two-Locator Cluster

2021-12-09 Thread Bill Burcham (Jira)


 [ 
https://issues.apache.org/jira/browse/GEODE-9822?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bill Burcham updated GEODE-9822:

Description: 
In a two-locator cluster with default member weights and default setting (true) 
of enable-network-partition-detection, if a long-lived network partition 
separates the two members, a split-brain will arise: there will be two 
coordinators at the same time.

The reason for this can be found in the GMSJoinLeave.isNetworkPartition() 
method. That method's name is misleading. A name like isMajorityLost() would 
probably be more apt. It needs to return true iff the weight of "crashed" 
members (in the prospective view) is greater-than-or-equal-to half (50%) of the 
total weight (of all members in the current view).

What the method actually does is return true iff the weight of "crashed" 
members is greater-than 51% of the total weight. As a result, if we have two 
members of equal weight, and the coordinator sees that the non-coordinator is 
"crashed", the coordinator will keep running. If a network partition is 
happening, and the non-coordinator is still running, then it will become a 
coordinator and start producing views. Now we'll have two coordinators 
producing views concurrently.

For this discussion "crashed" members are members for which the coordinator has 
received a RemoveMemberRequest message. These are members that the failure 
detector has deemed failed. Keep in mind the failure detector is imperfect 
(it's not always right), and that's kind of the whole point of this ticket: 
we've lost contact with the non-coordinator member, but that doesn't mean it 
can't still be running (on the other side of a partition).

This bug is not limited to the two-locator scenario. Any set of members that 
can be partitioned into two equal sets is susceptible. In fact it's even a 
little worse than that. Any set of members that can be partitioned into two 
sets, both of which still have 49% or more of the total weight, will result in 
a split-brain.

  was:
In a two-locator cluster with default member weights and default setting (true) 
of enable-network-partition-detection, if a long-lived network partition 
separates the two members, a split-brain will arise: there will be two 
coordinators at the same time.

The reason for this can be found in the GMSJoinLeave.isNetworkPartition() 
method. That method's name is misleading. A name like isMajorityLost() would 
probably be more apt. It needs to return true iff the weight of "crashed" 
members (in the prospective view) is greater-than-or-equal-to half (50%) of the 
total weight (of all members in the current view).

What the method actually does is return true iff the weight of "crashed" 
members is greater-than 51% of the total weight. As a result, if we have two 
members of equal weight, and the coordinator sees that the non-coordinator is 
"crashed", the coordinator will keep running. If a network partition is 
happening, and the non-coordinator is still running, then it will become a 
coordinator and start producing views. Now we'll have two coordinators 
producing views concurrently.

For this discussion "crashed" members are members for which the coordinator has 
received a RemoveMemberRequest message. These are members that the failure 
detector has deemed failed. Keep in mind the failure detector is imperfect 
(it's not always right), and that's kind of the whole point of this ticket: 
we've lost contact with the non-coordinator member, but that doesn't mean it 
can't still be running (on the other side of a partition).


> Split-brain Certain During Network Partition in Two-Locator Cluster
> ---
>
> Key: GEODE-9822
> URL: https://issues.apache.org/jira/browse/GEODE-9822
> Project: Geode
>  Issue Type: Bug
>  Components: membership
>Reporter: Bill Burcham
>Priority: Major
>  Labels: pull-request-available
>
> In a two-locator cluster with default member weights and default setting 
> (true) of enable-network-partition-detection, if a long-lived network 
> partition separates the two members, a split-brain will arise: there will be 
> two coordinators at the same time.
> The reason for this can be found in the GMSJoinLeave.isNetworkPartition() 
> method. That method's name is misleading. A name like isMajorityLost() would 
> probably be more apt. It needs to return true iff the weight of "crashed" 
> members (in the prospective view) is greater-than-or-equal-to half (50%) of 
> the total weight (of all members in the current view).
> What the method actually does is return true iff the weight of "crashed" 
> members is greater-than 51% of the total weight. As a result, if we have two 
> members of equal weight, and the coordinator sees that the non-coordinator is 
> "crashed", the coordina

[jira] [Updated] (GEODE-9822) Split-brain Certain During Network Partition in Two-Locator Cluster

2021-12-08 Thread Bill Burcham (Jira)


 [ 
https://issues.apache.org/jira/browse/GEODE-9822?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bill Burcham updated GEODE-9822:

Summary: Split-brain Certain During Network Partition in Two-Locator 
Cluster  (was: Split-brain Possible During Network Partition in Two-Locator 
Cluster)

> Split-brain Certain During Network Partition in Two-Locator Cluster
> ---
>
> Key: GEODE-9822
> URL: https://issues.apache.org/jira/browse/GEODE-9822
> Project: Geode
>  Issue Type: Bug
>  Components: membership
>Reporter: Bill Burcham
>Priority: Major
>  Labels: pull-request-available
>
> In a two-locator cluster with default member weights and default setting 
> (true) of enable-network-partition-detection, if a long-lived network 
> partition separates the two members, a split-brain will arise: there will be 
> two coordinators at the same time.
> The reason for this can be found in the GMSJoinLeave.isNetworkPartition() 
> method. That method's name is misleading. A name like isMajorityLost() would 
> probably be more apt. It needs to return true iff the weight of "crashed" 
> members (in the prospective view) is greater-than-or-equal-to half (50%) of 
> the total weight (of all members in the current view).
> What the method actually does is return true iff the weight of "crashed" 
> members is greater-than 51% of the total weight. As a result, if we have two 
> members of equal weight, and the coordinator sees that the non-coordinator is 
> "crashed", the coordinator will keep running. If a network partition is 
> happening, and the non-coordinator is still running, then it will become a 
> coordinator and start producing views. Now we'll have two coordinators 
> producing views concurrently.
> For this discussion "crashed" members are members for which the coordinator 
> has received a RemoveMemberRequest message. These are members that the 
> failure detector has deemed failed. Keep in mind the failure detector is 
> imperfect (it's not always right), and that's kind of the whole point of this 
> ticket: we've lost contact with the non-coordinator member, but that doesn't 
> mean it can't still be running (on the other side of a partition).



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Created] (GEODE-9872) DistTXPersistentDebugDUnitTest tests fail because "cluster configuration service not available"

2021-12-06 Thread Bill Burcham (Jira)
Bill Burcham created GEODE-9872:
---

 Summary: DistTXPersistentDebugDUnitTest tests fail because 
"cluster configuration service not available"
 Key: GEODE-9872
 URL: https://issues.apache.org/jira/browse/GEODE-9872
 Project: Geode
  Issue Type: Bug
  Components: tests
Reporter: Bill Burcham


I suspect this failure is due to something in the test framework, or perhaps 
one or more tests failing to manage ports correctly, allowing two or more tests 
to interfere with one another.

In this distributed test: 
[https://concourse.apachegeode-ci.info/teams/main/pipelines/apache-develop-mass-test-run/jobs/distributed-test-openjdk8/builds/388]
 we see two failures. Here's the first full stack trace:

 

 
{code:java}
[error 2021/12/04 20:40:53.796 UTC  
tid=33] org.apache.geode.GemFireConfigException: cluster configuration service 
not available
at 
org.junit.vintage.engine.execution.TestRun.getStoredResultOrSuccessful(TestRun.java:196)
at 
org.junit.vintage.engine.execution.RunListenerAdapter.fireExecutionFinished(RunListenerAdapter.java:226)
at 
org.junit.vintage.engine.execution.RunListenerAdapter.testFinished(RunListenerAdapter.java:192)
at 
org.junit.vintage.engine.execution.RunListenerAdapter.testFinished(RunListenerAdapter.java:79)
at 
org.junit.runner.notification.SynchronizedRunListener.testFinished(SynchronizedRunListener.java:87)
at 
org.junit.runner.notification.RunNotifier$9.notifyListener(RunNotifier.java:225)
at 
org.junit.runner.notification.RunNotifier$SafeNotifier.run(RunNotifier.java:72)
at 
org.junit.runner.notification.RunNotifier.fireTestFinished(RunNotifier.java:222)
at 
org.junit.internal.runners.model.EachTestNotifier.fireTestFinished(EachTestNotifier.java:38)
at org.junit.runners.ParentRunner.runLeaf(ParentRunner.java:372)
at 
org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:103)
at 
org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:63)
at org.junit.runners.ParentRunner$4.run(ParentRunner.java:331)
at org.junit.runners.ParentRunner$1.schedule(ParentRunner.java:79)
at org.junit.runners.ParentRunner.runChildren(ParentRunner.java:329)
at org.junit.runners.ParentRunner.access$100(ParentRunner.java:66)
at org.junit.runners.ParentRunner$2.evaluate(ParentRunner.java:293)
at 
org.junit.internal.runners.statements.RunBefores.evaluate(RunBefores.java:26)
at org.junit.runners.ParentRunner$3.evaluate(ParentRunner.java:306)
at org.junit.runners.ParentRunner.run(ParentRunner.java:413)
at org.junit.runner.JUnitCore.run(JUnitCore.java:137)
at org.junit.runner.JUnitCore.run(JUnitCore.java:115)
at 
org.junit.vintage.engine.execution.RunnerExecutor.execute(RunnerExecutor.java:43)
at 
java.util.stream.ForEachOps$ForEachOp$OfRef.accept(ForEachOps.java:183)
at 
java.util.stream.ReferencePipeline$3$1.accept(ReferencePipeline.java:193)
at java.util.Iterator.forEachRemaining(Iterator.java:116)
at 
java.util.Spliterators$IteratorSpliterator.forEachRemaining(Spliterators.java:1801)
at java.util.stream.AbstractPipeline.copyInto(AbstractPipeline.java:482)
at 
java.util.stream.AbstractPipeline.wrapAndCopyInto(AbstractPipeline.java:472)
at 
java.util.stream.ForEachOps$ForEachOp.evaluateSequential(ForEachOps.java:150)
at 
java.util.stream.ForEachOps$ForEachOp$OfRef.evaluateSequential(ForEachOps.java:173)
at java.util.stream.AbstractPipeline.evaluate(AbstractPipeline.java:234)
at 
java.util.stream.ReferencePipeline.forEach(ReferencePipeline.java:485)
at 
org.junit.vintage.engine.VintageTestEngine.executeAllChildren(VintageTestEngine.java:82)
at 
org.junit.vintage.engine.VintageTestEngine.execute(VintageTestEngine.java:73)
at 
org.junit.platform.launcher.core.EngineExecutionOrchestrator.execute(EngineExecutionOrchestrator.java:108)
at 
org.junit.platform.launcher.core.EngineExecutionOrchestrator.execute(EngineExecutionOrchestrator.java:88)
at 
org.junit.platform.launcher.core.EngineExecutionOrchestrator.lambda$execute$0(EngineExecutionOrchestrator.java:54)
at 
org.junit.platform.launcher.core.EngineExecutionOrchestrator.withInterceptedStreams(EngineExecutionOrchestrator.java:67)
at 
org.junit.platform.launcher.core.EngineExecutionOrchestrator.execute(EngineExecutionOrchestrator.java:52)
at 
org.junit.platform.launcher.core.DefaultLauncher.execute(DefaultLauncher.java:96)
at 
org.junit.platform.launcher.core.DefaultLauncher.execute(DefaultLauncher.java:75)
at 
org.gradle.api.internal.tasks.testing.junitplatform.JUnitPlatformTestClassProcessor$CollectAllTestClassesExecutor.processAllTestClasses(JUnitPlatformTes

[jira] [Created] (GEODE-9871) CI failure: InfoStatsIntegrationTest > networkKiloBytesReadOverLastSecond_shouldBeCloseToBytesReadOverLastSecond

2021-12-06 Thread Bill Burcham (Jira)
Bill Burcham created GEODE-9871:
---

 Summary: CI failure: InfoStatsIntegrationTest > 
networkKiloBytesReadOverLastSecond_shouldBeCloseToBytesReadOverLastSecond
 Key: GEODE-9871
 URL: https://issues.apache.org/jira/browse/GEODE-9871
 Project: Geode
  Issue Type: Bug
  Components: redis, statistics
Affects Versions: 1.15.0
Reporter: Bill Burcham


link: 
[https://concourse.apachegeode-ci.info/teams/main/pipelines/apache-develop-main/jobs/integration-test-openjdk8/builds/38]


stack trace:
{code:java}
InfoStatsIntegrationTest > 
networkKiloBytesReadOverLastSecond_shouldBeCloseToBytesReadOverLastSecond FAILED
org.opentest4j.AssertionFailedError: 
expected: 0.0
 but was: 0.01
at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
at 
sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
at 
sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
at 
org.apache.geode.redis.internal.commands.executor.server.AbstractRedisInfoStatsIntegrationTest.networkKiloBytesReadOverLastSecond_shouldBeCloseToBytesReadOverLastSecond(AbstractRedisInfoStatsIntegrationTest.java:228)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at 
org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.java:59)
at 
org.junit.internal.runners.model.ReflectiveCallable.run(ReflectiveCallable.java:12)
at 
org.junit.runners.model.FrameworkMethod.invokeExplosively(FrameworkMethod.java:56)
at 
org.junit.internal.runners.statements.InvokeMethod.evaluate(InvokeMethod.java:17)
at 
org.junit.internal.runners.statements.RunBefores.evaluate(RunBefores.java:26)
at 
org.junit.internal.runners.statements.RunAfters.evaluate(RunAfters.java:27)
at org.junit.runners.ParentRunner$3.evaluate(ParentRunner.java:306)
at 
org.junit.runners.BlockJUnit4ClassRunner$1.evaluate(BlockJUnit4ClassRunner.java:100)
at org.junit.runners.ParentRunner.runLeaf(ParentRunner.java:366)
at 
org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:103)
at 
org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:63)
at org.junit.runners.ParentRunner$4.run(ParentRunner.java:331)
at org.junit.runners.ParentRunner$1.schedule(ParentRunner.java:79)
at org.junit.runners.ParentRunner.runChildren(ParentRunner.java:329)
at org.junit.runners.ParentRunner.access$100(ParentRunner.java:66)
at org.junit.runners.ParentRunner$2.evaluate(ParentRunner.java:293)
at 
org.junit.internal.runners.statements.RunBefores.evaluate(RunBefores.java:26)
at 
org.apache.geode.test.junit.rules.serializable.SerializableExternalResource$1.evaluate(SerializableExternalResource.java:38)
at org.junit.rules.RunRules.evaluate(RunRules.java:20)
at org.junit.runners.ParentRunner$3.evaluate(ParentRunner.java:306)
at org.junit.runners.ParentRunner.run(ParentRunner.java:413)
at org.junit.runner.JUnitCore.run(JUnitCore.java:137)
at org.junit.runner.JUnitCore.run(JUnitCore.java:115)
at 
org.junit.vintage.engine.execution.RunnerExecutor.execute(RunnerExecutor.java:43)
at 
java.util.stream.ForEachOps$ForEachOp$OfRef.accept(ForEachOps.java:183)
at 
java.util.stream.ReferencePipeline$3$1.accept(ReferencePipeline.java:193)
at java.util.Iterator.forEachRemaining(Iterator.java:116)
at 
java.util.Spliterators$IteratorSpliterator.forEachRemaining(Spliterators.java:1801)
at java.util.stream.AbstractPipeline.copyInto(AbstractPipeline.java:482)
at 
java.util.stream.AbstractPipeline.wrapAndCopyInto(AbstractPipeline.java:472)
at 
java.util.stream.ForEachOps$ForEachOp.evaluateSequential(ForEachOps.java:150)
at 
java.util.stream.ForEachOps$ForEachOp$OfRef.evaluateSequential(ForEachOps.java:173)
at java.util.stream.AbstractPipeline.evaluate(AbstractPipeline.java:234)
at 
java.util.stream.ReferencePipeline.forEach(ReferencePipeline.java:485)
at 
org.junit.vintage.engine.VintageTestEngine.executeAllChildren(VintageTestEngine.java:82)
at 
org.junit.vintage.engine.VintageTestEngine.execute(VintageTestEngine.java:73)
at 
org.junit.platform.launcher.core.EngineExecutionOrchestrator.execute(EngineExecutionOrchestrator.java:108)
at 
org.junit.platform.launcher.core.EngineExecutionOrchestrator.execute(EngineExecutionOrchestrator.java:88)
at 
org.junit.platform.launc

[jira] [Reopened] (GEODE-9866) CI Failure : MemoryStatsIntegrationTest > usedMemory_shouldIncrease_givenAdditionalValuesAdded FAILED

2021-12-06 Thread Bill Burcham (Jira)


 [ 
https://issues.apache.org/jira/browse/GEODE-9866?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bill Burcham reopened GEODE-9866:
-

Seen again: 
https://concourse.apachegeode-ci.info/teams/main/pipelines/apache-develop-main/jobs/integration-test-openjdk8/builds/37

> CI Failure : MemoryStatsIntegrationTest > 
> usedMemory_shouldIncrease_givenAdditionalValuesAdded FAILED
> -
>
> Key: GEODE-9866
> URL: https://issues.apache.org/jira/browse/GEODE-9866
> Project: Geode
>  Issue Type: Bug
>  Components: redis, statistics
>Reporter: Nabarun Nag
>Assignee: Jens Deppe
>Priority: Major
>  Labels: pull-request-available
> Fix For: 1.15.0
>
>
> link : 
> [https://concourse.apachegeode-ci.info/teams/main/pipelines/apache-develop-main/jobs/windows-integration-test-openjdk8/builds/31]
> Bug Report:
> {noformat}
> MemoryStatsIntegrationTest > 
> usedMemory_shouldIncrease_givenAdditionalValuesAdded FAILED
> java.lang.AssertionError: 
> Expecting actual:
>   61121264L
> to be greater than:
>   105070472L
> at 
> org.apache.geode.redis.internal.commands.executor.server.AbstractRedisMemoryStatsIntegrationTest.usedMemory_shouldIncrease_givenAdditionalValuesAdded(AbstractRedisMemoryStatsIntegrationTest.java:80)
> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
> at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> at java.lang.reflect.Method.invoke(Method.java:498)
> at 
> org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.java:59)
> at 
> org.junit.internal.runners.model.ReflectiveCallable.run(ReflectiveCallable.java:12)
> at 
> org.junit.runners.model.FrameworkMethod.invokeExplosively(FrameworkMethod.java:56)
> at 
> org.junit.internal.runners.statements.InvokeMethod.evaluate(InvokeMethod.java:17)
> at 
> org.junit.internal.runners.statements.RunBefores.evaluate(RunBefores.java:26)
> at 
> org.junit.internal.runners.statements.RunAfters.evaluate(RunAfters.java:27)
> at org.junit.runners.ParentRunner$3.evaluate(ParentRunner.java:306)
> at 
> org.junit.runners.BlockJUnit4ClassRunner$1.evaluate(BlockJUnit4ClassRunner.java:100)
> at org.junit.runners.ParentRunner.runLeaf(ParentRunner.java:366)
> at 
> org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:103)
> at 
> org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:63)
> at org.junit.runners.ParentRunner$4.run(ParentRunner.java:331)
> at org.junit.runners.ParentRunner$1.schedule(ParentRunner.java:79)
> at org.junit.runners.ParentRunner.runChildren(ParentRunner.java:329)
> at org.junit.runners.ParentRunner.access$100(ParentRunner.java:66)
> at org.junit.runners.ParentRunner$2.evaluate(ParentRunner.java:293)
> at 
> org.apache.geode.test.junit.rules.serializable.SerializableExternalResource$1.evaluate(SerializableExternalResource.java:38)
> at org.junit.rules.RunRules.evaluate(RunRules.java:20)
> at org.junit.runners.ParentRunner$3.evaluate(ParentRunner.java:306)
> at org.junit.runners.ParentRunner.run(ParentRunner.java:413)
> at org.junit.runner.JUnitCore.run(JUnitCore.java:137)
> at org.junit.runner.JUnitCore.run(JUnitCore.java:115)
> at 
> org.junit.vintage.engine.execution.RunnerExecutor.execute(RunnerExecutor.java:43)
> at 
> java.util.stream.ForEachOps$ForEachOp$OfRef.accept(ForEachOps.java:183)
> at 
> java.util.stream.ReferencePipeline$3$1.accept(ReferencePipeline.java:193)
> at java.util.Iterator.forEachRemaining(Iterator.java:116)
> at 
> java.util.Spliterators$IteratorSpliterator.forEachRemaining(Spliterators.java:1801)
> at 
> java.util.stream.AbstractPipeline.copyInto(AbstractPipeline.java:482)
> at 
> java.util.stream.AbstractPipeline.wrapAndCopyInto(AbstractPipeline.java:472)
> at 
> java.util.stream.ForEachOps$ForEachOp.evaluateSequential(ForEachOps.java:150)
> at 
> java.util.stream.ForEachOps$ForEachOp$OfRef.evaluateSequential(ForEachOps.java:173)
> at 
> java.util.stream.AbstractPipeline.evaluate(AbstractPipeline.java:234)
> at 
> java.util.stream.ReferencePipeline.forEach(ReferencePipeline.java:485)
> at 
> org.junit.vintage.engine.VintageTestEngine.executeAllChildren(VintageTestEngine.java:82)
> at 
> org.junit.vintage.engine.VintageTestEngine.execute(VintageTestEngine.java:73)
> at 
> org.junit.platform.launcher.core.Engi

[jira] [Updated] (GEODE-9870) JedisMovedDataException exception in testReconnectionWithAuthAndServerRestarts

2021-12-06 Thread Bill Burcham (Jira)


 [ 
https://issues.apache.org/jira/browse/GEODE-9870?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bill Burcham updated GEODE-9870:

Description: 
CI failure here 
[https://concourse.apachegeode-ci.info/teams/main/pipelines/apache-develop-mass-test-run/jobs/distributed-test-openjdk8/builds/315|https://concourse.apachegeode-ci.info/teams/main/pipelines/apache-develop-mass-test-run/jobs/distributed-test-openjdk8/builds/315]:

 
{code:java}
AuthWhileServersRestartDUnitTest > testReconnectionWithAuthAndServerRestarts 
FAILED
redis.clients.jedis.exceptions.JedisMovedDataException: MOVED 12539 
127.0.0.1:26259
at redis.clients.jedis.Protocol.processError(Protocol.java:119)
at redis.clients.jedis.Protocol.process(Protocol.java:169)
at redis.clients.jedis.Protocol.read(Protocol.java:223)
at 
redis.clients.jedis.Connection.readProtocolWithCheckingBroken(Connection.java:352)
at 
redis.clients.jedis.Connection.getStatusCodeReply(Connection.java:270)
at redis.clients.jedis.BinaryJedis.flushAll(BinaryJedis.java:826)
at 
org.apache.geode.test.dunit.rules.RedisClusterStartupRule.flushAll(RedisClusterStartupRule.java:147)
at 
org.apache.geode.test.dunit.rules.RedisClusterStartupRule.flushAll(RedisClusterStartupRule.java:131)
at 
org.apache.geode.redis.internal.executor.auth.AuthWhileServersRestartDUnitTest.after(AuthWhileServersRestartDUnitTest.java:88){code}

  was:
CI failure:

 
{code:java}
AuthWhileServersRestartDUnitTest > testReconnectionWithAuthAndServerRestarts 
FAILED
redis.clients.jedis.exceptions.JedisMovedDataException: MOVED 12539 
127.0.0.1:26259
at redis.clients.jedis.Protocol.processError(Protocol.java:119)
at redis.clients.jedis.Protocol.process(Protocol.java:169)
at redis.clients.jedis.Protocol.read(Protocol.java:223)
at 
redis.clients.jedis.Connection.readProtocolWithCheckingBroken(Connection.java:352)
at 
redis.clients.jedis.Connection.getStatusCodeReply(Connection.java:270)
at redis.clients.jedis.BinaryJedis.flushAll(BinaryJedis.java:826)
at 
org.apache.geode.test.dunit.rules.RedisClusterStartupRule.flushAll(RedisClusterStartupRule.java:147)
at 
org.apache.geode.test.dunit.rules.RedisClusterStartupRule.flushAll(RedisClusterStartupRule.java:131)
at 
org.apache.geode.redis.internal.executor.auth.AuthWhileServersRestartDUnitTest.after(AuthWhileServersRestartDUnitTest.java:88){code}


> JedisMovedDataException exception in testReconnectionWithAuthAndServerRestarts
> --
>
> Key: GEODE-9870
> URL: https://issues.apache.org/jira/browse/GEODE-9870
> Project: Geode
>  Issue Type: Bug
>  Components: redis
>Affects Versions: 1.15.0
>Reporter: Bill Burcham
>Priority: Major
>
> CI failure here 
> [https://concourse.apachegeode-ci.info/teams/main/pipelines/apache-develop-mass-test-run/jobs/distributed-test-openjdk8/builds/315|https://concourse.apachegeode-ci.info/teams/main/pipelines/apache-develop-mass-test-run/jobs/distributed-test-openjdk8/builds/315]:
>  
> {code:java}
> AuthWhileServersRestartDUnitTest > testReconnectionWithAuthAndServerRestarts 
> FAILED
> redis.clients.jedis.exceptions.JedisMovedDataException: MOVED 12539 
> 127.0.0.1:26259
> at redis.clients.jedis.Protocol.processError(Protocol.java:119)
> at redis.clients.jedis.Protocol.process(Protocol.java:169)
> at redis.clients.jedis.Protocol.read(Protocol.java:223)
> at 
> redis.clients.jedis.Connection.readProtocolWithCheckingBroken(Connection.java:352)
> at 
> redis.clients.jedis.Connection.getStatusCodeReply(Connection.java:270)
> at redis.clients.jedis.BinaryJedis.flushAll(BinaryJedis.java:826)
> at 
> org.apache.geode.test.dunit.rules.RedisClusterStartupRule.flushAll(RedisClusterStartupRule.java:147)
> at 
> org.apache.geode.test.dunit.rules.RedisClusterStartupRule.flushAll(RedisClusterStartupRule.java:131)
> at 
> org.apache.geode.redis.internal.executor.auth.AuthWhileServersRestartDUnitTest.after(AuthWhileServersRestartDUnitTest.java:88){code}



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Created] (GEODE-9870) JedisMovedDataException exception in testReconnectionWithAuthAndServerRestarts

2021-12-06 Thread Bill Burcham (Jira)
Bill Burcham created GEODE-9870:
---

 Summary: JedisMovedDataException exception in 
testReconnectionWithAuthAndServerRestarts
 Key: GEODE-9870
 URL: https://issues.apache.org/jira/browse/GEODE-9870
 Project: Geode
  Issue Type: Bug
  Components: redis
Affects Versions: 1.15.0
Reporter: Bill Burcham


CI failure:

 
{code:java}
AuthWhileServersRestartDUnitTest > testReconnectionWithAuthAndServerRestarts 
FAILED
redis.clients.jedis.exceptions.JedisMovedDataException: MOVED 12539 
127.0.0.1:26259
at redis.clients.jedis.Protocol.processError(Protocol.java:119)
at redis.clients.jedis.Protocol.process(Protocol.java:169)
at redis.clients.jedis.Protocol.read(Protocol.java:223)
at 
redis.clients.jedis.Connection.readProtocolWithCheckingBroken(Connection.java:352)
at 
redis.clients.jedis.Connection.getStatusCodeReply(Connection.java:270)
at redis.clients.jedis.BinaryJedis.flushAll(BinaryJedis.java:826)
at 
org.apache.geode.test.dunit.rules.RedisClusterStartupRule.flushAll(RedisClusterStartupRule.java:147)
at 
org.apache.geode.test.dunit.rules.RedisClusterStartupRule.flushAll(RedisClusterStartupRule.java:131)
at 
org.apache.geode.redis.internal.executor.auth.AuthWhileServersRestartDUnitTest.after(AuthWhileServersRestartDUnitTest.java:88){code}



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Resolved] (GEODE-9396) Upgrades using SSL fail with mismatch of hostname in certificates

2021-11-30 Thread Bill Burcham (Jira)


 [ 
https://issues.apache.org/jira/browse/GEODE-9396?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bill Burcham resolved GEODE-9396.
-
Fix Version/s: 1.15.0
   Resolution: Fixed

> Upgrades using SSL fail with mismatch of hostname in certificates
> -
>
> Key: GEODE-9396
> URL: https://issues.apache.org/jira/browse/GEODE-9396
> Project: Geode
>  Issue Type: Bug
>  Components: membership
>Affects Versions: 1.15.0
>Reporter: Ernest Burghardt
>Assignee: Bill Burcham
>Priority: Major
>  Labels: pull-request-available, release-blocker
> Fix For: 1.15.0
>
>
> When upgrading from a previous version (prior to 1.14) the ssl handshake will 
> fail to complete in cases where the Certificate contains a symbolic name that 
> doesn't match the hostname used by the sslengine.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Assigned] (GEODE-9396) Upgrades using SSL fail with mismatch of hostname in certificates

2021-11-29 Thread Bill Burcham (Jira)


 [ 
https://issues.apache.org/jira/browse/GEODE-9396?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bill Burcham reassigned GEODE-9396:
---

Assignee: Bill Burcham  (was: Kamilla Aslami)

> Upgrades using SSL fail with mismatch of hostname in certificates
> -
>
> Key: GEODE-9396
> URL: https://issues.apache.org/jira/browse/GEODE-9396
> Project: Geode
>  Issue Type: Bug
>  Components: membership
>Affects Versions: 1.15.0
>Reporter: Ernest Burghardt
>Assignee: Bill Burcham
>Priority: Major
>  Labels: pull-request-available, release-blocker
>
> When upgrading from a previous version (prior to 1.14) the ssl handshake will 
> fail to complete in cases where the Certificate contains a symbolic name that 
> doesn't match the hostname used by the sslengine.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Resolved] (GEODE-9825) Disparate socket-buffer-size Results in "IOException: Unknown header byte" and Hangs

2021-11-24 Thread Bill Burcham (Jira)


 [ 
https://issues.apache.org/jira/browse/GEODE-9825?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bill Burcham resolved GEODE-9825.
-
Resolution: Fixed

> Disparate socket-buffer-size Results in "IOException: Unknown header byte" 
> and Hangs
> 
>
> Key: GEODE-9825
> URL: https://issues.apache.org/jira/browse/GEODE-9825
> Project: Geode
>  Issue Type: Bug
>  Components: messaging
>Affects Versions: 1.12.4, 1.15.0
>Reporter: Bill Burcham
>Assignee: Bill Burcham
>Priority: Major
>  Labels: pull-request-available
> Fix For: 1.12.6, 1.13.5, 1.14.1, 1.15.0
>
>
> GEODE-9141 introduced a bug that causes {{IOException: "Unknown header 
> byte..."}} and hangs if members are configured with different 
> {{socket-buffer-size}} settings.
> h2. Reproduction
> To reproduce this bug turn off TLS and set socket-buffer-size on sender to be 
> 64KB and set socket-buffer-size on receiver to be 32KB. See associated PR for 
> an example.
> h2. Analysis
> In {{{}Connection.processInputBuffer(){}}}. When that method has read all the 
> messages it can from the current input buffer, it then considers whether the 
> buffer needs expansion. If it does then:
> {code:java}
> inputBuffer = inputSharing.expandReadBufferIfNeeded(allocSize); {code}
> Is executed and the method returns. The caller then expects to be able to 
> _write_ bytes into {{{}inputBuffer{}}}.
> The problem, it seems, is that 
> {{ByteBufferSharingInternalImpl.expandReadBufferIfNeeded()}} does not leave 
> the the {{ByteBuffer}} in the proper state. It leaves the buffer ready to be 
> _read_ not written.
> Before the changes for GEODE-9141 were introduced, the line of code 
> referenced above used to be this snippet in 
> {{Connection.compactOrResizeBuffer(int messageLength)}} (that method has 
> since been removed):
> {code:java}
>      // need a bigger buffer
>     logger.info("Allocating larger network read buffer, new size is {} old 
> size was {}.",
>         allocSize, oldBufferSize);
>     ByteBuffer oldBuffer = inputBuffer;
>     inputBuffer = getBufferPool().acquireDirectReceiveBuffer(allocSize);    
> if (oldBuffer != null) {
>       int oldByteCount = oldBuffer.remaining();
>       inputBuffer.put(oldBuffer);
>       inputBuffer.position(oldByteCount);
>       getBufferPool().releaseReceiveBuffer(oldBuffer);
>     } {code}
> Notice how this method leaves {{inputBuffer}} ready to be _written_ to.
> But the code inside 
> {{ByteBufferSharingInternalImpl.expandReadBufferIfNeeded()}} is doing 
> something like:
> {code:java}
> newBuffer.clear();
> newBuffer.put(existing);
> newBuffer.flip();
> releaseBuffer(type, existing);
> return newBuffer; {code}
> A solution (shown in the associated PR) is to do add logic after the call to 
> {{expandReadBufferIfNeeded(allocSize)}} to leave the buffer in a _writeable_ 
> state:
> {code:java}
> inputBuffer = inputSharing.expandReadBufferIfNeeded(allocSize);
> // we're returning to the caller (done == true) so make buffer writeable
> inputBuffer.position(inputBuffer.limit());
> inputBuffer.limit(inputBuffer.capacity()); {code}
> h2. Resolution
> When this ticket is complete the bug will be fixed and 
> {{P2PMessagingConcurrencyDUnitTest}} will be enhanced to test at least these 
> combinations:
> [security, sender/locator socket-buffer-size, receiver socket-buffer-size]
> [TLS, (default), (default)]  this is what the test currently does
> [no TLS, 64 * 1024, 32 * 1024] *new: this illustrates this bug*
> [no TLS, (default), (default)] *new*
> We might want to mix in conserve-sockets true/false in there too while we're 
> at it (the test currently holds it at true).



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Updated] (GEODE-9825) Disparate socket-buffer-size Results in "IOException: Unknown header byte" and Hangs

2021-11-24 Thread Bill Burcham (Jira)


 [ 
https://issues.apache.org/jira/browse/GEODE-9825?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bill Burcham updated GEODE-9825:

Fix Version/s: 1.12.6
   1.13.5
   1.14.1
   1.15.0

> Disparate socket-buffer-size Results in "IOException: Unknown header byte" 
> and Hangs
> 
>
> Key: GEODE-9825
> URL: https://issues.apache.org/jira/browse/GEODE-9825
> Project: Geode
>  Issue Type: Bug
>  Components: messaging
>Affects Versions: 1.12.4, 1.15.0
>Reporter: Bill Burcham
>Assignee: Bill Burcham
>Priority: Major
>  Labels: pull-request-available
> Fix For: 1.12.6, 1.13.5, 1.14.1, 1.15.0
>
>
> GEODE-9141 introduced a bug that causes {{IOException: "Unknown header 
> byte..."}} and hangs if members are configured with different 
> {{socket-buffer-size}} settings.
> h2. Reproduction
> To reproduce this bug turn off TLS and set socket-buffer-size on sender to be 
> 64KB and set socket-buffer-size on receiver to be 32KB. See associated PR for 
> an example.
> h2. Analysis
> In {{{}Connection.processInputBuffer(){}}}. When that method has read all the 
> messages it can from the current input buffer, it then considers whether the 
> buffer needs expansion. If it does then:
> {code:java}
> inputBuffer = inputSharing.expandReadBufferIfNeeded(allocSize); {code}
> Is executed and the method returns. The caller then expects to be able to 
> _write_ bytes into {{{}inputBuffer{}}}.
> The problem, it seems, is that 
> {{ByteBufferSharingInternalImpl.expandReadBufferIfNeeded()}} does not leave 
> the the {{ByteBuffer}} in the proper state. It leaves the buffer ready to be 
> _read_ not written.
> Before the changes for GEODE-9141 were introduced, the line of code 
> referenced above used to be this snippet in 
> {{Connection.compactOrResizeBuffer(int messageLength)}} (that method has 
> since been removed):
> {code:java}
>      // need a bigger buffer
>     logger.info("Allocating larger network read buffer, new size is {} old 
> size was {}.",
>         allocSize, oldBufferSize);
>     ByteBuffer oldBuffer = inputBuffer;
>     inputBuffer = getBufferPool().acquireDirectReceiveBuffer(allocSize);    
> if (oldBuffer != null) {
>       int oldByteCount = oldBuffer.remaining();
>       inputBuffer.put(oldBuffer);
>       inputBuffer.position(oldByteCount);
>       getBufferPool().releaseReceiveBuffer(oldBuffer);
>     } {code}
> Notice how this method leaves {{inputBuffer}} ready to be _written_ to.
> But the code inside 
> {{ByteBufferSharingInternalImpl.expandReadBufferIfNeeded()}} is doing 
> something like:
> {code:java}
> newBuffer.clear();
> newBuffer.put(existing);
> newBuffer.flip();
> releaseBuffer(type, existing);
> return newBuffer; {code}
> A solution (shown in the associated PR) is to do add logic after the call to 
> {{expandReadBufferIfNeeded(allocSize)}} to leave the buffer in a _writeable_ 
> state:
> {code:java}
> inputBuffer = inputSharing.expandReadBufferIfNeeded(allocSize);
> // we're returning to the caller (done == true) so make buffer writeable
> inputBuffer.position(inputBuffer.limit());
> inputBuffer.limit(inputBuffer.capacity()); {code}
> h2. Resolution
> When this ticket is complete the bug will be fixed and 
> {{P2PMessagingConcurrencyDUnitTest}} will be enhanced to test at least these 
> combinations:
> [security, sender/locator socket-buffer-size, receiver socket-buffer-size]
> [TLS, (default), (default)]  this is what the test currently does
> [no TLS, 64 * 1024, 32 * 1024] *new: this illustrates this bug*
> [no TLS, (default), (default)] *new*
> We might want to mix in conserve-sockets true/false in there too while we're 
> at it (the test currently holds it at true).



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Commented] (GEODE-9825) Disparate socket-buffer-size Results in "IOException: Unknown header byte" and Hangs

2021-11-23 Thread Bill Burcham (Jira)


[ 
https://issues.apache.org/jira/browse/GEODE-9825?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17448309#comment-17448309
 ] 

Bill Burcham commented on GEODE-9825:
-

Merged to {{{}develop{}}}. Back-port PR to 1.14 is ready to merge. A flaky test 
failed in the PR for 1.13 (wrote a new ticket GEODE-9850 and re-initiated the 
test).

Back-port to 1.12 has a problem. I had to back-port the PR for GEODE-9713 (test 
framework enhancement). Unfortunately it relies on a newer version (4.1.0) of 
Awaitility (was at 3.1.6). Bumping just that version in 
DependencyConstraints.groovy was not sufficient as something (TBD) is dependent 
on Awaitility 2.0.0 and that version is taking precedence. I want to work this 
out and then merge all three PRs together (in close succession).

> Disparate socket-buffer-size Results in "IOException: Unknown header byte" 
> and Hangs
> 
>
> Key: GEODE-9825
> URL: https://issues.apache.org/jira/browse/GEODE-9825
> Project: Geode
>  Issue Type: Bug
>  Components: messaging
>Affects Versions: 1.12.4, 1.15.0
>Reporter: Bill Burcham
>Assignee: Bill Burcham
>Priority: Major
>  Labels: pull-request-available
>
> GEODE-9141 introduced a bug that causes {{IOException: "Unknown header 
> byte..."}} and hangs if members are configured with different 
> {{socket-buffer-size}} settings.
> h2. Reproduction
> To reproduce this bug turn off TLS and set socket-buffer-size on sender to be 
> 64KB and set socket-buffer-size on receiver to be 32KB. See associated PR for 
> an example.
> h2. Analysis
> In {{{}Connection.processInputBuffer(){}}}. When that method has read all the 
> messages it can from the current input buffer, it then considers whether the 
> buffer needs expansion. If it does then:
> {code:java}
> inputBuffer = inputSharing.expandReadBufferIfNeeded(allocSize); {code}
> Is executed and the method returns. The caller then expects to be able to 
> _write_ bytes into {{{}inputBuffer{}}}.
> The problem, it seems, is that 
> {{ByteBufferSharingInternalImpl.expandReadBufferIfNeeded()}} does not leave 
> the the {{ByteBuffer}} in the proper state. It leaves the buffer ready to be 
> _read_ not written.
> Before the changes for GEODE-9141 were introduced, the line of code 
> referenced above used to be this snippet in 
> {{Connection.compactOrResizeBuffer(int messageLength)}} (that method has 
> since been removed):
> {code:java}
>      // need a bigger buffer
>     logger.info("Allocating larger network read buffer, new size is {} old 
> size was {}.",
>         allocSize, oldBufferSize);
>     ByteBuffer oldBuffer = inputBuffer;
>     inputBuffer = getBufferPool().acquireDirectReceiveBuffer(allocSize);    
> if (oldBuffer != null) {
>       int oldByteCount = oldBuffer.remaining();
>       inputBuffer.put(oldBuffer);
>       inputBuffer.position(oldByteCount);
>       getBufferPool().releaseReceiveBuffer(oldBuffer);
>     } {code}
> Notice how this method leaves {{inputBuffer}} ready to be _written_ to.
> But the code inside 
> {{ByteBufferSharingInternalImpl.expandReadBufferIfNeeded()}} is doing 
> something like:
> {code:java}
> newBuffer.clear();
> newBuffer.put(existing);
> newBuffer.flip();
> releaseBuffer(type, existing);
> return newBuffer; {code}
> A solution (shown in the associated PR) is to do add logic after the call to 
> {{expandReadBufferIfNeeded(allocSize)}} to leave the buffer in a _writeable_ 
> state:
> {code:java}
> inputBuffer = inputSharing.expandReadBufferIfNeeded(allocSize);
> // we're returning to the caller (done == true) so make buffer writeable
> inputBuffer.position(inputBuffer.limit());
> inputBuffer.limit(inputBuffer.capacity()); {code}
> h2. Resolution
> When this ticket is complete the bug will be fixed and 
> {{P2PMessagingConcurrencyDUnitTest}} will be enhanced to test at least these 
> combinations:
> [security, sender/locator socket-buffer-size, receiver socket-buffer-size]
> [TLS, (default), (default)]  this is what the test currently does
> [no TLS, 64 * 1024, 32 * 1024] *new: this illustrates this bug*
> [no TLS, (default), (default)] *new*
> We might want to mix in conserve-sockets true/false in there too while we're 
> at it (the test currently holds it at true).



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Created] (GEODE-9850) flaky test: testGetOldestTombstoneTimeForReplicateTombstoneSweeper

2021-11-23 Thread Bill Burcham (Jira)
Bill Burcham created GEODE-9850:
---

 Summary: flaky test: 
testGetOldestTombstoneTimeForReplicateTombstoneSweeper
 Key: GEODE-9850
 URL: https://issues.apache.org/jira/browse/GEODE-9850
 Project: Geode
  Issue Type: Bug
  Components: tests
Affects Versions: 1.13.5
Reporter: Bill Burcham


First saw this failure in PR pipeline on support/1.13 here: 
[https://concourse.apachegeode-ci.info/builds/3912569]


{code:java}
org.apache.geode.internal.cache.versions.TombstoneDUnitTest > 
testGetOldestTombstoneTimeForReplicateTombstoneSweeper FAILED
org.apache.geode.test.dunit.RMIException: While invoking 
org.apache.geode.internal.cache.versions.TombstoneDUnitTest$$Lambda$42/2046302475.run
 in VM 0 running on Host 9a305b2d7db7 with 4 VMs
at org.apache.geode.test.dunit.VM.executeMethodOnObject(VM.java:610)
at org.apache.geode.test.dunit.VM.invoke(VM.java:437)
at 
org.apache.geode.internal.cache.versions.TombstoneDUnitTest.testGetOldestTombstoneTimeForReplicateTombstoneSweeper(TombstoneDUnitTest.java:228)

Caused by:
java.lang.AssertionError: 
Expecting:
 <-1637701703343L>
to be greater than:
 <0L> 
at 
org.apache.geode.internal.cache.versions.TombstoneDUnitTest.lambda$testGetOldestTombstoneTimeForReplicateTombstoneSweeper$bb17a952$3(TombstoneDUnitTest.java:237)
 {code}
I believe the fix is to wrap this assertion in an awaitility call.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Updated] (GEODE-9764) Request-Response Messaging Should Time Out

2021-11-22 Thread Bill Burcham (Jira)


 [ 
https://issues.apache.org/jira/browse/GEODE-9764?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bill Burcham updated GEODE-9764:

Description: 
There is a weakness in the P2P/DirectChannel messaging architecture, in that it 
never gives up on a request (in a request-response scenario). As a result a bug 
(software fault) anywhere from the point where the requesting thread hands off 
the {{DistributionMessage}} e.g. to 
{{{}ClusterDistributionManager.putOutgoing(DistributionMessage){}}}, to the 
point where that request is ultimately fulfilled on a (one) receiver, can 
result in a hang (of some task on the send side, which is waiting for a 
response).

Well it's a little worse than that because any code in the return (response) 
path can also cause disruption of the (response) flow, thereby leaving the 
requesting task hanging.

If the code in the request path (primarily in P2P messaging) and the code in 
the response path (P2P messaging and TBD higher-level code) were perfect this 
might not be a problem. But there is a fair amount of code there and we have 
some evidence that it is currently not perfect, nor do we expect it to become 
perfect and stay that way.

This is a sketch of the situation. The left-most column is the request path or 
the originating member. The middle column is the server-side of the 
request-response path. And the right-most column is the response path back on 
the originating member.

!image-2021-11-22-12-14-59-117.png!

You can see that Geode product code, JDK code, and hardware components all lie 
in the end-to-end request-response messaging path.

That being the case it seems prudent to institute response timeouts so that 
bugs of this sort (which disrupt request-response message flow) don't result in 
hangs.

It's TBD if we want to go a step further and institute retries. The latter 
would entail introducing duplicate-suppression (conflation) in P2P messaging. 
We might also add exponential backoff (open-loop) or back-pressure 
(closed-loop) to prevent a flood of retries when the system is at or near the 
point of thrashing.

But even without retries, a configurable timeout might have good ROI as a first 
step. This would entail:
 * adding a configuration parameter to specify the timeout value
 * changing ReplyProcessor21 and others TBD to "give up" after the timeout has 
elapsed
 * changing higher-level code dependent on request-reply messaging so it 
properly handles the situations where we might have to "give up"

This issue affects all versions of Geode.
h2. Counterpoint

Not everybody thinks timeouts are a good idea. This section has the highlights.
h3. Timeouts Will Result in Data-Inconsistency

If we leave most the surrounding code as-is and introduce timeouts, then we 
risk data inconsistency. TODO: describe in detail why data inconsistency is 
_inherent_ in using timeouts.
h3. Narrow The Vulnerability Cross-Section Without Timeouts

The proposal (above) seeks to solve the problem using end-to-end timeouts since 
any component in the path can, in general, have faults. An alternative 
approach, would be to assume that _some_ of the components can be made "good 
enough" (without adding timeouts) and that those "good enough" components can 
protect themselves (and user applications) from faults in the remaining 
components.

With this approach, the Cluster Distribution Manager, and P2P / TCP Conduit / 
Direct Channel framework would be enhanced so that it was less susceptible to 
bugs in:
 * the 341 Distribution Message classes
 * the 68 Reply Message classes
 * the 95 Reply Processor classes

The question is: what form would that enhancement take, and also, would it be 
sufficient to overcome faults in remaining components (JDK, and the 
host+network layers).
h2. Alternatives Discussed

These alternatives have been discussed, to varying degrees.

Baseline: no timeouts; members waiting for replies do "the right thing" if 
recipient departs view

Give-up-after-timeout

Retry-after-timeout-and-eventually-give-up

Retry-after-forcing-receiver-out-of-view

  was:
There is a weakness in the P2P/DirectChannel messaging architecture, in that it 
never gives up on a request (in a request-response scenario). As a result a bug 
(software fault) anywhere from the point where the requesting thread hands off 
the {{DistributionMessage}} e.g. to 
{{{}ClusterDistributionManager.putOutgoing(DistributionMessage){}}}, to the 
point where that request is ultimately fulfilled on a (one) receiver, can 
result in a hang (of some task on the send side, which is waiting for a 
response).

Well it's a little worse than that because any code in the return (response) 
path can also cause disruption of the (response) flow, thereby leaving the 
requesting task hanging.

If the code in the request path (primarily in P2P messaging) and the code in 
the response path (P2P messaging and TBD higher-level code) were perfect this 
might n

[jira] [Updated] (GEODE-9764) Request-Response Messaging Should Time Out

2021-11-22 Thread Bill Burcham (Jira)


 [ 
https://issues.apache.org/jira/browse/GEODE-9764?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bill Burcham updated GEODE-9764:

Attachment: image-2021-11-22-12-14-59-117.png

> Request-Response Messaging Should Time Out
> --
>
> Key: GEODE-9764
> URL: https://issues.apache.org/jira/browse/GEODE-9764
> Project: Geode
>  Issue Type: Improvement
>  Components: messaging
>Reporter: Bill Burcham
>Assignee: Bill Burcham
>Priority: Major
> Attachments: image-2021-11-22-11-52-23-586.png, 
> image-2021-11-22-12-14-59-117.png
>
>
> There is a weakness in the P2P/DirectChannel messaging architecture, in that 
> it never gives up on a request (in a request-response scenario). As a result 
> a bug (software fault) anywhere from the point where the requesting thread 
> hands off the {{DistributionMessage}} e.g. to 
> {{{}ClusterDistributionManager.putOutgoing(DistributionMessage){}}}, to the 
> point where that request is ultimately fulfilled on a (one) receiver, can 
> result in a hang (of some task on the send side, which is waiting for a 
> response).
> Well it's a little worse than that because any code in the return (response) 
> path can also cause disruption of the (response) flow, thereby leaving the 
> requesting task hanging.
> If the code in the request path (primarily in P2P messaging) and the code in 
> the response path (P2P messaging and TBD higher-level code) were perfect this 
> might not be a problem. But there is a fair amount of code there and we have 
> some evidence that it is currently not perfect, nor do we expect it to become 
> perfect and stay that way. That being the case it seems prudent to institute 
> response timeouts so that bugs of this sort (which disrupt request-response 
> message flow) don't result in hangs.
> It's TBD if we want to go a step further and institute retries. The latter 
> would entail introducing duplicate-suppression (conflation) in P2P messaging. 
> We might also add exponential backoff (open-loop) or back-pressure 
> (closed-loop) to prevent a flood of retries when the system is at or near the 
> point of thrashing.
> But even without retries, a configurable timeout might have good ROI as a 
> first step. This would entail:
>  * adding a configuration parameter to specify the timeout value
>  * changing ReplyProcessor21 and others TBD to "give up" after the timeout 
> has elapsed
>  * changing higher-level code dependent on request-reply messaging so it 
> properly handles the situations where we might have to "give up"
> This issue affects all versions of Geode.
> h2. Counterpoint
> Not everbody thinks timeouts are a good idea. Here are some alternative ideas:
>  
> Make request-response primitive better.  make it so only bugs in our core 
> messaging framework could cause a lack of response - rather than our current 
> approach where a bug in a class like “RemotePutMessage” could cause a lack of 
> a response.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Updated] (GEODE-9764) Request-Response Messaging Should Time Out

2021-11-22 Thread Bill Burcham (Jira)


 [ 
https://issues.apache.org/jira/browse/GEODE-9764?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bill Burcham updated GEODE-9764:

Description: 
There is a weakness in the P2P/DirectChannel messaging architecture, in that it 
never gives up on a request (in a request-response scenario). As a result a bug 
(software fault) anywhere from the point where the requesting thread hands off 
the {{DistributionMessage}} e.g. to 
{{{}ClusterDistributionManager.putOutgoing(DistributionMessage){}}}, to the 
point where that request is ultimately fulfilled on a (one) receiver, can 
result in a hang (of some task on the send side, which is waiting for a 
response).

Well it's a little worse than that because any code in the return (response) 
path can also cause disruption of the (response) flow, thereby leaving the 
requesting task hanging.

If the code in the request path (primarily in P2P messaging) and the code in 
the response path (P2P messaging and TBD higher-level code) were perfect this 
might not be a problem. But there is a fair amount of code there and we have 
some evidence that it is currently not perfect, nor do we expect it to become 
perfect and stay that way.

This is a sketch of the situation. The left-most column is the request path or 
the originating member. The middle column is the server-side of the 
request-response path. And the right-most column is the response path back on 
the originating member.

!image-2021-11-22-12-14-59-117.png!

You can see that Geode product code, JDK code, and hardware components all lie 
in the end-to-end request-response messaging path.

That being the case it seems prudent to institute response timeouts so that 
bugs of this sort (which disrupt request-response message flow) don't result in 
hangs.

It's TBD if we want to go a step further and institute retries. The latter 
would entail introducing duplicate-suppression (conflation) in P2P messaging. 
We might also add exponential backoff (open-loop) or back-pressure 
(closed-loop) to prevent a flood of retries when the system is at or near the 
point of thrashing.

But even without retries, a configurable timeout might have good ROI as a first 
step. This would entail:
 * adding a configuration parameter to specify the timeout value
 * changing ReplyProcessor21 and others TBD to "give up" after the timeout has 
elapsed
 * changing higher-level code dependent on request-reply messaging so it 
properly handles the situations where we might have to "give up"

This issue affects all versions of Geode.
h2. Counterpoint

Not everybody thinks timeouts are a good idea. Here are some alternative ideas:

The proposal (above) seeks to solve the problem using end-to-end timeouts since 
any component in the path can, in general, have faults. An alternative 
approach, would be to assume that _some_ of the components can be made "good 
enough" (without adding timeouts) and that those "good enough" components can 
protect themselves (and user applications) from faults in the remaining 
components.

With this approach, the Cluster Distribution Manager, and P2P / TCP Conduit / 
Direct Channel framework would be enhanced so that it was less susceptible to 
bugs in:
 * the 341 Distribution Message classes
 * the 68 Reply Message classes
 * the 95 Reply Processor classes

The question is: what form would that enhancement take, and also, would it be 
sufficient to overcome faults in remaining components (JDK, and the 
host+network layers).

 

  was:
There is a weakness in the P2P/DirectChannel messaging architecture, in that it 
never gives up on a request (in a request-response scenario). As a result a bug 
(software fault) anywhere from the point where the requesting thread hands off 
the {{DistributionMessage}} e.g. to 
{{{}ClusterDistributionManager.putOutgoing(DistributionMessage){}}}, to the 
point where that request is ultimately fulfilled on a (one) receiver, can 
result in a hang (of some task on the send side, which is waiting for a 
response).

Well it's a little worse than that because any code in the return (response) 
path can also cause disruption of the (response) flow, thereby leaving the 
requesting task hanging.

If the code in the request path (primarily in P2P messaging) and the code in 
the response path (P2P messaging and TBD higher-level code) were perfect this 
might not be a problem. But there is a fair amount of code there and we have 
some evidence that it is currently not perfect, nor do we expect it to become 
perfect and stay that way. That being the case it seems prudent to institute 
response timeouts so that bugs of this sort (which disrupt request-response 
message flow) don't result in hangs.

It's TBD if we want to go a step further and institute retries. The latter 
would entail introducing duplicate-suppression (conflation) in P2P messaging. 
We might also add exponential backoff (open-loop) or back-pressure 
(closed-loop) to pre

[jira] [Updated] (GEODE-9764) Request-Response Messaging Should Time Out

2021-11-22 Thread Bill Burcham (Jira)


 [ 
https://issues.apache.org/jira/browse/GEODE-9764?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bill Burcham updated GEODE-9764:

Attachment: image-2021-11-22-11-52-23-586.png

> Request-Response Messaging Should Time Out
> --
>
> Key: GEODE-9764
> URL: https://issues.apache.org/jira/browse/GEODE-9764
> Project: Geode
>  Issue Type: Improvement
>  Components: messaging
>Reporter: Bill Burcham
>Assignee: Bill Burcham
>Priority: Major
> Attachments: image-2021-11-22-11-52-23-586.png
>
>
> There is a weakness in the P2P/DirectChannel messaging architecture, in that 
> it never gives up on a request (in a request-response scenario). As a result 
> a bug (software fault) anywhere from the point where the requesting thread 
> hands off the {{DistributionMessage}} e.g. to 
> {{{}ClusterDistributionManager.putOutgoing(DistributionMessage){}}}, to the 
> point where that request is ultimately fulfilled on a (one) receiver, can 
> result in a hang (of some task on the send side, which is waiting for a 
> response).
> Well it's a little worse than that because any code in the return (response) 
> path can also cause disruption of the (response) flow, thereby leaving the 
> requesting task hanging.
> If the code in the request path (primarily in P2P messaging) and the code in 
> the response path (P2P messaging and TBD higher-level code) were perfect this 
> might not be a problem. But there is a fair amount of code there and we have 
> some evidence that it is currently not perfect, nor do we expect it to become 
> perfect and stay that way. That being the case it seems prudent to institute 
> response timeouts so that bugs of this sort (which disrupt request-response 
> message flow) don't result in hangs.
> It's TBD if we want to go a step further and institute retries. The latter 
> would entail introducing duplicate-suppression (conflation) in P2P messaging. 
> We might also add exponential backoff (open-loop) or back-pressure 
> (closed-loop) to prevent a flood of retries when the system is at or near the 
> point of thrashing.
> But even without retries, a configurable timeout might have good ROI as a 
> first step. This would entail:
>  * adding a configuration parameter to specify the timeout value
>  * changing ReplyProcessor21 and others TBD to "give up" after the timeout 
> has elapsed
>  * changing higher-level code dependent on request-reply messaging so it 
> properly handles the situations where we might have to "give up"
> This issue affects all versions of Geode.
> h2. Counterpoint
> Not everbody thinks timeouts are a good idea. Here are some alternative ideas:
>  
> Make request-response primitive better.  make it so only bugs in our core 
> messaging framework could cause a lack of response - rather than our current 
> approach where a bug in a class like “RemotePutMessage” could cause a lack of 
> a response.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Updated] (GEODE-9764) Request-Response Messaging Should Time Out

2021-11-22 Thread Bill Burcham (Jira)


 [ 
https://issues.apache.org/jira/browse/GEODE-9764?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bill Burcham updated GEODE-9764:

Description: 
There is a weakness in the P2P/DirectChannel messaging architecture, in that it 
never gives up on a request (in a request-response scenario). As a result a bug 
(software fault) anywhere from the point where the requesting thread hands off 
the {{DistributionMessage}} e.g. to 
{{{}ClusterDistributionManager.putOutgoing(DistributionMessage){}}}, to the 
point where that request is ultimately fulfilled on a (one) receiver, can 
result in a hang (of some task on the send side, which is waiting for a 
response).

Well it's a little worse than that because any code in the return (response) 
path can also cause disruption of the (response) flow, thereby leaving the 
requesting task hanging.

If the code in the request path (primarily in P2P messaging) and the code in 
the response path (P2P messaging and TBD higher-level code) were perfect this 
might not be a problem. But there is a fair amount of code there and we have 
some evidence that it is currently not perfect, nor do we expect it to become 
perfect and stay that way. That being the case it seems prudent to institute 
response timeouts so that bugs of this sort (which disrupt request-response 
message flow) don't result in hangs.

It's TBD if we want to go a step further and institute retries. The latter 
would entail introducing duplicate-suppression (conflation) in P2P messaging. 
We might also add exponential backoff (open-loop) or back-pressure 
(closed-loop) to prevent a flood of retries when the system is at or near the 
point of thrashing.

But even without retries, a configurable timeout might have good ROI as a first 
step. This would entail:
 * adding a configuration parameter to specify the timeout value
 * changing ReplyProcessor21 and others TBD to "give up" after the timeout has 
elapsed
 * changing higher-level code dependent on request-reply messaging so it 
properly handles the situations where we might have to "give up"

This issue affects all versions of Geode.
h2. Counterpoint

Not everbody thinks timeouts are a good idea. Here are some alternative ideas:

 

Make request-response primitive better.  make it so only bugs in our core 
messaging framework could cause a lack of response - rather than our current 
approach where a bug in a class like “RemotePutMessage” could cause a lack of a 
response.

  was:
There is a weakness in the P2P/DirectChannel messaging architecture, in that it 
never gives up on a request (in a request-response scenario). As a result a bug 
(software fault) anywhere from the point where the requesting thread hands off 
the {{DistributionMessage}} e.g. to 
{{ClusterDistributionManager.putOutgoing(DistributionMessage)}}, to the point 
where that request is ultimately fulfilled on a (one) receiver, can result in a 
hang (of some task on the send side, which is waiting for a response).

Well it's a little worse than that because any code in the return (response) 
path can also cause disruption of the (response) flow, thereby leaving the 
requesting task hanging.

If the code in the request path (primarily in P2P messaging) and the code in 
the response path (P2P messaging and TBD higher-level code) were perfect this 
might not be a problem. But there is a fair amount of code there and we have 
some evidence that it is currently not perfect, nor do we expect it to become 
perfect and stay that way. That being the case it seems prudent to institute 
response timeouts so that bugs of this sort (which disrupt request-response 
message flow) don't result in hangs.

It's TBD if we want to go a step further and institute retries. The latter 
would entail introducing duplicate-suppression (conflation) in P2P messaging. 
We might also add exponential backoff (open-loop) or back-pressure 
(closed-loop) to prevent a flood of retries when the system is at or near the 
point of thrashing.

But even without retries, a configurable timeout might have good ROI as a first 
step. This would entail:

* adding a configuration parameter to specify the timeout value
* changing ReplyProcessor21 and others TBD to "give up" after the timeout has 
elapsed
* changing higher-level code dependent on request-reply messaging so it 
properly handles the situations where we might have to "give up"

This issue affects all versions of Geode.


> Request-Response Messaging Should Time Out
> --
>
> Key: GEODE-9764
> URL: https://issues.apache.org/jira/browse/GEODE-9764
> Project: Geode
>  Issue Type: Improvement
>  Components: messaging
>Reporter: Bill Burcham
>Assignee: Bill Burcham
>Priority: Major
>
> There is a weakness in the P2P/DirectChannel messaging architecture, in that 
> it never gives up on a r

[jira] [Assigned] (GEODE-9825) Disparate socket-buffer-size Results in "IOException: Unknown header byte" and Hangs

2021-11-22 Thread Bill Burcham (Jira)


 [ 
https://issues.apache.org/jira/browse/GEODE-9825?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bill Burcham reassigned GEODE-9825:
---

Assignee: Bill Burcham

> Disparate socket-buffer-size Results in "IOException: Unknown header byte" 
> and Hangs
> 
>
> Key: GEODE-9825
> URL: https://issues.apache.org/jira/browse/GEODE-9825
> Project: Geode
>  Issue Type: Bug
>  Components: messaging
>Affects Versions: 1.12.4, 1.15.0
>Reporter: Bill Burcham
>Assignee: Bill Burcham
>Priority: Major
>  Labels: pull-request-available
>
> GEODE-9141 introduced a bug that causes {{IOException: "Unknown header 
> byte..."}} and hangs if members are configured with different 
> {{socket-buffer-size}} settings.
> h2. Reproduction
> To reproduce this bug turn off TLS and set socket-buffer-size on sender to be 
> 64KB and set socket-buffer-size on receiver to be 32KB. See associated PR for 
> an example.
> h2. Analysis
> In {{{}Connection.processInputBuffer(){}}}. When that method has read all the 
> messages it can from the current input buffer, it then considers whether the 
> buffer needs expansion. If it does then:
> {code:java}
> inputBuffer = inputSharing.expandReadBufferIfNeeded(allocSize); {code}
> Is executed and the method returns. The caller then expects to be able to 
> _write_ bytes into {{{}inputBuffer{}}}.
> The problem, it seems, is that 
> {{ByteBufferSharingInternalImpl.expandReadBufferIfNeeded()}} does not leave 
> the the {{ByteBuffer}} in the proper state. It leaves the buffer ready to be 
> _read_ not written.
> Before the changes for GEODE-9141 were introduced, the line of code 
> referenced above used to be this snippet in 
> {{Connection.compactOrResizeBuffer(int messageLength)}} (that method has 
> since been removed):
> {code:java}
>      // need a bigger buffer
>     logger.info("Allocating larger network read buffer, new size is {} old 
> size was {}.",
>         allocSize, oldBufferSize);
>     ByteBuffer oldBuffer = inputBuffer;
>     inputBuffer = getBufferPool().acquireDirectReceiveBuffer(allocSize);    
> if (oldBuffer != null) {
>       int oldByteCount = oldBuffer.remaining();
>       inputBuffer.put(oldBuffer);
>       inputBuffer.position(oldByteCount);
>       getBufferPool().releaseReceiveBuffer(oldBuffer);
>     } {code}
> Notice how this method leaves {{inputBuffer}} ready to be _written_ to.
> But the code inside 
> {{ByteBufferSharingInternalImpl.expandReadBufferIfNeeded()}} is doing 
> something like:
> {code:java}
> newBuffer.clear();
> newBuffer.put(existing);
> newBuffer.flip();
> releaseBuffer(type, existing);
> return newBuffer; {code}
> A solution (shown in the associated PR) is to do add logic after the call to 
> {{expandReadBufferIfNeeded(allocSize)}} to leave the buffer in a _writeable_ 
> state:
> {code:java}
> inputBuffer = inputSharing.expandReadBufferIfNeeded(allocSize);
> // we're returning to the caller (done == true) so make buffer writeable
> inputBuffer.position(inputBuffer.limit());
> inputBuffer.limit(inputBuffer.capacity()); {code}
> h2. Resolution
> When this ticket is complete the bug will be fixed and 
> {{P2PMessagingConcurrencyDUnitTest}} will be enhanced to test at least these 
> combinations:
> [security, sender/locator socket-buffer-size, receiver socket-buffer-size]
> [TLS, (default), (default)]  this is what the test currently does
> [no TLS, 64 * 1024, 32 * 1024] *new: this illustrates this bug*
> [no TLS, (default), (default)] *new*
> We might want to mix in conserve-sockets true/false in there too while we're 
> at it (the test currently holds it at true).



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Updated] (GEODE-9825) Disparate socket-buffer-size Results in "IOException: Unknown header byte" and Hangs

2021-11-19 Thread Bill Burcham (Jira)


 [ 
https://issues.apache.org/jira/browse/GEODE-9825?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bill Burcham updated GEODE-9825:

Attachment: (was: GEODE-9825-demo.patch)

> Disparate socket-buffer-size Results in "IOException: Unknown header byte" 
> and Hangs
> 
>
> Key: GEODE-9825
> URL: https://issues.apache.org/jira/browse/GEODE-9825
> Project: Geode
>  Issue Type: Bug
>  Components: messaging
>Affects Versions: 1.12.4, 1.15.0
>Reporter: Bill Burcham
>Priority: Major
>  Labels: pull-request-available
>
> GEODE-9141 introduced a bug that causes {{IOException: "Unknown header 
> byte..."}} and hangs if members are configured with different 
> {{socket-buffer-size}} settings.
> h2. Reproduction
> To reproduce this bug turn off TLS and set socket-buffer-size on sender to be 
> 64KB and set socket-buffer-size on receiver to be 32KB. See associated PR for 
> an example.
> h2. Analysis
> In {{{}Connection.processInputBuffer(){}}}. When that method has read all the 
> messages it can from the current input buffer, it then considers whether the 
> buffer needs expansion. If it does then:
> {code:java}
> inputBuffer = inputSharing.expandReadBufferIfNeeded(allocSize); {code}
> Is executed and the method returns. The caller then expects to be able to 
> _write_ bytes into {{{}inputBuffer{}}}.
> The problem, it seems, is that 
> {{ByteBufferSharingInternalImpl.expandReadBufferIfNeeded()}} does not leave 
> the the {{ByteBuffer}} in the proper state. It leaves the buffer ready to be 
> _read_ not written.
> Before the changes for GEODE-9141 were introduced, the line of code 
> referenced above used to be this snippet in 
> {{Connection.compactOrResizeBuffer(int messageLength)}} (that method has 
> since been removed):
> {code:java}
>      // need a bigger buffer
>     logger.info("Allocating larger network read buffer, new size is {} old 
> size was {}.",
>         allocSize, oldBufferSize);
>     ByteBuffer oldBuffer = inputBuffer;
>     inputBuffer = getBufferPool().acquireDirectReceiveBuffer(allocSize);    
> if (oldBuffer != null) {
>       int oldByteCount = oldBuffer.remaining();
>       inputBuffer.put(oldBuffer);
>       inputBuffer.position(oldByteCount);
>       getBufferPool().releaseReceiveBuffer(oldBuffer);
>     } {code}
> Notice how this method leaves {{inputBuffer}} ready to be _written_ to.
> But the code inside 
> {{ByteBufferSharingInternalImpl.expandReadBufferIfNeeded()}} is doing 
> something like:
> {code:java}
> newBuffer.clear();
> newBuffer.put(existing);
> newBuffer.flip();
> releaseBuffer(type, existing);
> return newBuffer; {code}
> A solution (shown in the associated PR) is to do add logic after the call to 
> {{expandReadBufferIfNeeded(allocSize)}} to leave the buffer in a _writeable_ 
> state:
> {code:java}
> inputBuffer = inputSharing.expandReadBufferIfNeeded(allocSize);
> // we're returning to the caller (done == true) so make buffer writeable
> inputBuffer.position(inputBuffer.limit());
> inputBuffer.limit(inputBuffer.capacity()); {code}
> h2. Resolution
> When this ticket is complete the bug will be fixed and 
> {{P2PMessagingConcurrencyDUnitTest}} will be enhanced to test at least these 
> combinations:
> [security, sender/locator socket-buffer-size, receiver socket-buffer-size]
> [TLS, (default), (default)]  this is what the test currently does
> [no TLS, 64 * 1024, 32 * 1024] *new: this illustrates this bug*
> [no TLS, (default), (default)] *new*
> We might want to mix in conserve-sockets true/false in there too while we're 
> at it (the test currently holds it at true).



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Updated] (GEODE-9825) Disparate socket-buffer-size Results in "IOException: Unknown header byte" and Hangs

2021-11-19 Thread Bill Burcham (Jira)


 [ 
https://issues.apache.org/jira/browse/GEODE-9825?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bill Burcham updated GEODE-9825:

Description: 
GEODE-9141 introduced a bug that causes {{IOException: "Unknown header 
byte..."}} and hangs if members are configured with different 
{{socket-buffer-size}} settings.
h2. Reproduction

To reproduce this bug turn off TLS and set socket-buffer-size on sender to be 
64KB and set socket-buffer-size on receiver to be 32KB. See associated PR for 
an example.
h2. Analysis

In {{{}Connection.processInputBuffer(){}}}. When that method has read all the 
messages it can from the current input buffer, it then considers whether the 
buffer needs expansion. If it does then:
{code:java}
inputBuffer = inputSharing.expandReadBufferIfNeeded(allocSize); {code}
Is executed and the method returns. The caller then expects to be able to 
_write_ bytes into {{{}inputBuffer{}}}.

The problem, it seems, is that 
{{ByteBufferSharingInternalImpl.expandReadBufferIfNeeded()}} does not leave the 
the {{ByteBuffer}} in the proper state. It leaves the buffer ready to be _read_ 
not written.

Before the changes for GEODE-9141 were introduced, the line of code referenced 
above used to be this snippet in {{Connection.compactOrResizeBuffer(int 
messageLength)}} (that method has since been removed):
{code:java}
     // need a bigger buffer
    logger.info("Allocating larger network read buffer, new size is {} old size 
was {}.",
        allocSize, oldBufferSize);
    ByteBuffer oldBuffer = inputBuffer;
    inputBuffer = getBufferPool().acquireDirectReceiveBuffer(allocSize);    
if (oldBuffer != null) {
      int oldByteCount = oldBuffer.remaining();
      inputBuffer.put(oldBuffer);
      inputBuffer.position(oldByteCount);
      getBufferPool().releaseReceiveBuffer(oldBuffer);
    } {code}
Notice how this method leaves {{inputBuffer}} ready to be _written_ to.

But the code inside 
{{ByteBufferSharingInternalImpl.expandReadBufferIfNeeded()}} is doing something 
like:
{code:java}
newBuffer.clear();
newBuffer.put(existing);
newBuffer.flip();
releaseBuffer(type, existing);
return newBuffer; {code}
A solution (shown in the associated PR) is to do add logic after the call to 
{{expandReadBufferIfNeeded(allocSize)}} to leave the buffer in a _writeable_ 
state:
{code:java}
inputBuffer = inputSharing.expandReadBufferIfNeeded(allocSize);
// we're returning to the caller (done == true) so make buffer writeable
inputBuffer.position(inputBuffer.limit());
inputBuffer.limit(inputBuffer.capacity()); {code}
h2. Resolution

When this ticket is complete the bug will be fixed and 
{{P2PMessagingConcurrencyDUnitTest}} will be enhanced to test at least these 
combinations:

[security, sender/locator socket-buffer-size, receiver socket-buffer-size]

[TLS, (default), (default)]  this is what the test currently does
[no TLS, 64 * 1024, 32 * 1024] *new: this illustrates this bug*
[no TLS, (default), (default)] *new*

We might want to mix in conserve-sockets true/false in there too while we're at 
it (the test currently holds it at true).

  was:
GEODE-9141 introduced a bug that causes {{IOException: "Unknown header 
byte..."}} and hangs if members are configured with different 
{{socket-buffer-size}} settings.
h2. Reproduction

To reproduce this bug, modify {{P2PMessagingConcurrencyDUnitTest}} so that 
sender and locator and receiver use different configuration parameters. Set 
{{socket-buffer-size}} to 212992 for the sender and 32 * 1024 for the receiver. 
Oh and just skip the call to {{{}securityProperties(){}}}—we want to induce the 
"Unknown header byte" exception—we don't want the TLS framework throwing 
exceptions. See attached patch file GEODE-9825-demo.patch for an example.
h2. Analysis

In {{{}Connection.processInputBuffer(){}}}. When that method has read all the 
messages it can from the current input buffer, it then considers whether the 
buffer needs expansion. If it does then:
{code:java}
inputBuffer = inputSharing.expandReadBufferIfNeeded(allocSize); {code}
Is executed and the method returns. The caller then expects to be able to 
_write_ bytes into {{{}inputBuffer{}}}.

The problem, it seems, is that 
{{ByteBufferSharingInternalImpl.expandReadBufferIfNeeded()}} does not leave the 
the {{ByteBuffer}} in the proper state. It leaves the buffer ready to be _read_ 
not written.

Before the changes for GEODE-9141 were introduced, the line of code referenced 
above used to be this snippet in {{Connection.compactOrResizeBuffer(int 
messageLength)}} (that method has since been removed):
{code:java}
     // need a bigger buffer
    logger.info("Allocating larger network read buffer, new size is {} old size 
was {}.",
        allocSize, oldBufferSize);
    ByteBuffer oldBuffer = inputBuffer;
    inputBuffer = getBufferPool().acquireDirectReceiveBuffer(allocSize);    
if (oldBuffer != null) {
      int oldByteCount = oldBuffer.remaining();
     

[jira] [Updated] (GEODE-9825) Disparate socket-buffer-size Results in "IOException: Unknown header byte" and Hangs

2021-11-19 Thread Bill Burcham (Jira)


 [ 
https://issues.apache.org/jira/browse/GEODE-9825?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bill Burcham updated GEODE-9825:

Description: 
GEODE-9141 introduced a bug that causes {{IOException: "Unknown header 
byte..."}} and hangs if members are configured with different 
{{socket-buffer-size}} settings.
h2. Reproduction

To reproduce this bug, modify {{P2PMessagingConcurrencyDUnitTest}} so that 
sender and locator and receiver use different configuration parameters. Set 
{{socket-buffer-size}} to 212992 for the sender and 32 * 1024 for the receiver. 
Oh and just skip the call to {{{}securityProperties(){}}}—we want to induce the 
"Unknown header byte" exception—we don't want the TLS framework throwing 
exceptions. See attached patch file GEODE-9825-demo.patch for an example.
h2. Analysis

In {{{}Connection.processInputBuffer(){}}}. When that method has read all the 
messages it can from the current input buffer, it then considers whether the 
buffer needs expansion. If it does then:
{code:java}
inputBuffer = inputSharing.expandReadBufferIfNeeded(allocSize); {code}
Is executed and the method returns. The caller then expects to be able to 
_write_ bytes into {{{}inputBuffer{}}}.

The problem, it seems, is that 
{{ByteBufferSharingInternalImpl.expandReadBufferIfNeeded()}} does not leave the 
the {{ByteBuffer}} in the proper state. It leaves the buffer ready to be _read_ 
not written.

Before the changes for GEODE-9141 were introduced, the line of code referenced 
above used to be this snippet in {{Connection.compactOrResizeBuffer(int 
messageLength)}} (that method has since been removed):
{code:java}
     // need a bigger buffer
    logger.info("Allocating larger network read buffer, new size is {} old size 
was {}.",
        allocSize, oldBufferSize);
    ByteBuffer oldBuffer = inputBuffer;
    inputBuffer = getBufferPool().acquireDirectReceiveBuffer(allocSize);    
if (oldBuffer != null) {
      int oldByteCount = oldBuffer.remaining();
      inputBuffer.put(oldBuffer);
      inputBuffer.position(oldByteCount);
      getBufferPool().releaseReceiveBuffer(oldBuffer);
    } {code}
Notice how this method leaves {{inputBuffer}} ready to be _written_ to.

But the code inside 
{{ByteBufferSharingInternalImpl.expandReadBufferIfNeeded()}} is doing something 
like:
{code:java}
newBuffer.clear();
newBuffer.put(existing);
newBuffer.flip();
releaseBuffer(type, existing);
return newBuffer; {code}
The solution (shown in the attached patch file GEODE-9825-demo.patch) is to do 
add logic after the call to {{expandReadBufferIfNeeded(allocSize)}} to leave 
the buffer in a _writeable_ state:
{code:java}
inputBuffer = inputSharing.expandReadBufferIfNeeded(allocSize);
// we're returning to the caller (done == true) so make buffer writeable
inputBuffer.position(inputBuffer.limit());
inputBuffer.limit(inputBuffer.capacity()); {code}
h2. Resolution

When this ticket is complete the bug will be fixed and 
{{P2PMessagingConcurrencyDUnitTest}} will be enhanced to test these 
combinations:

[security, sender/locator socket-buffer-size, receiver socket-buffer-size]

[TLS, (default), (default)]  this is what the test currently does
[no TLS, 64 * 1024, 32 * 1024] *new: this illustrates this bug*
[no TLS, (default), (default)] *new*

We might want to mix in conserve-sockets true/false in there too while we're at 
it (the test currently holds it at true).

The attached patch file GEODE-9825-demo.patch shows a quick hack to 
{{P2PMessagingConcurrencyDUnitTest}} to illustrate the bug. The patch also 
includes a fix.

  was:
GEODE-9141 introduced a bug that causes {{IOException: "Unknown header 
byte..."}} and hangs if members are configured with different 
{{socket-buffer-size}} settings.
h2. Reproduction

To reproduce this bug, modify {{P2PMessagingConcurrencyDUnitTest}} so that 
sender and locator and receiver use different configuration parameters. Set 
{{socket-buffer-size}} to 212992 for the sender and 32 * 1024 for the receiver. 
Oh and just skip the call to {{{}securityProperties(){}}}—we want to induce the 
"Unknown header byte" exception—we don't want the TLS framework throwing 
exceptions. See attached patch file GEODE-9825-demo.patch for an example.
h2. Analysis

In {{{}Connection.processInputBuffer(){}}}. When that method has read all the 
messages it can from the current input buffer, it then considers whether the 
buffer needs expansion. If it does then:
{code:java}
inputBuffer = inputSharing.expandReadBufferIfNeeded(allocSize); {code}
Is executed and the method returns. The caller then expects to be able to 
_write_ bytes into {{{}inputBuffer{}}}.

The problem, it seems, is that 
{{ByteBufferSharingInternalImpl.expandReadBufferIfNeeded()}} does not leave the 
the {{ByteBuffer}} in the proper state. It leaves the buffer ready to be _read_ 
not written.

Before the changes for GEODE-9141 were introduced, the line of code referenced 
above used to be t

[jira] [Updated] (GEODE-9825) Disparate socket-buffer-size Results in "IOException: Unknown header byte" and Hangs

2021-11-19 Thread Bill Burcham (Jira)


 [ 
https://issues.apache.org/jira/browse/GEODE-9825?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bill Burcham updated GEODE-9825:

Attachment: GEODE-9825-demo.patch

> Disparate socket-buffer-size Results in "IOException: Unknown header byte" 
> and Hangs
> 
>
> Key: GEODE-9825
> URL: https://issues.apache.org/jira/browse/GEODE-9825
> Project: Geode
>  Issue Type: Bug
>  Components: messaging
>Affects Versions: 1.12.4, 1.15.0
>Reporter: Bill Burcham
>Priority: Major
> Attachments: GEODE-9825-demo.patch
>
>
> GEODE-9141 introduced a bug that causes {{IOException: "Unknown header 
> byte..."}} and hangs if members are configured with different 
> {{socket-buffer-size}} settings.
> h2. Reproduction
> To reproduce this bug, modify {{P2PMessagingConcurrencyDUnitTest}} so that 
> sender and locator and receiver use different configuration parameters. Set 
> {{socket-buffer-size}} to 212992 for the sender and 32 * 1024 for the 
> receiver. Oh and just skip the call to {{{}securityProperties(){}}}—we want 
> to induce the "Unknown header byte" exception—we don't want the TLS framework 
> throwing exceptions. See attached patch file GEODE-9825-demo.patch for an 
> example.
> h2. Analysis
> In {{{}Connection.processInputBuffer(){}}}. When that method has read all the 
> messages it can from the current input buffer, it then considers whether the 
> buffer needs expansion. If it does then:
> {code:java}
> inputBuffer = inputSharing.expandReadBufferIfNeeded(allocSize); {code}
> Is executed and the method returns. The caller then expects to be able to 
> _write_ bytes into {{{}inputBuffer{}}}.
> The problem, it seems, is that 
> {{ByteBufferSharingInternalImpl.expandReadBufferIfNeeded()}} does not leave 
> the the {{ByteBuffer}} in the proper state. It leaves the buffer ready to be 
> _read_ not written.
> Before the changes for GEODE-9141 were introduced, the line of code 
> referenced above used to be this snippet in 
> {{Connection.compactOrResizeBuffer(int messageLength)}} (that method has 
> since been removed):
> {code:java}
>      // need a bigger buffer
>     logger.info("Allocating larger network read buffer, new size is {} old 
> size was {}.",
>         allocSize, oldBufferSize);
>     ByteBuffer oldBuffer = inputBuffer;
>     inputBuffer = getBufferPool().acquireDirectReceiveBuffer(allocSize);    
> if (oldBuffer != null) {
>       int oldByteCount = oldBuffer.remaining();
>       inputBuffer.put(oldBuffer);
>       inputBuffer.position(oldByteCount);
>       getBufferPool().releaseReceiveBuffer(oldBuffer);
>     } {code}
> Notice how this method leaves {{inputBuffer}} ready to be _written_ to.
> But the code inside 
> {{ByteBufferSharingInternalImpl.expandReadBufferIfNeeded()}} is doing 
> something like:
> {code:java}
> newBuffer.clear();
> newBuffer.put(existing);
> newBuffer.flip();
> releaseBuffer(type, existing);
> return newBuffer; {code}
> It's not clear to me, exactly, what the difference is between the old and new 
> code. It's not sufficient to simply call {{flip()}} on the inputBuffer before 
> returning it (I tried it and it didn't fix the bug). More work is needed.
> h2. Resolution
> When this ticket is complete the bug will be fixed and 
> {{P2PMessagingConcurrencyDUnitTest}} will be enhanced to test these 
> combinations:
> [security, sender/locator socket-buffer-size, receiver socket-buffer-size]
> [TLS, (default), (default)]  this is what the test currently does
> [no TLS, 64 * 1024, 32 * 1024] *new: this illustrates this bug*
> [no TLS, (default), (default)] *new*
> We might want to mix in conserve-sockets true/false in there too while we're 
> at it (the test currently holds it at true).
> The attached patch file GEODE-9825-demo.patch shows a quick hack to 
> {{P2PMessagingConcurrencyDUnitTest}} to illustrate the bug.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Updated] (GEODE-9825) Disparate socket-buffer-size Results in "IOException: Unknown header byte" and Hangs

2021-11-19 Thread Bill Burcham (Jira)


 [ 
https://issues.apache.org/jira/browse/GEODE-9825?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bill Burcham updated GEODE-9825:

Attachment: (was: GEODE-9825-demo.patch)

> Disparate socket-buffer-size Results in "IOException: Unknown header byte" 
> and Hangs
> 
>
> Key: GEODE-9825
> URL: https://issues.apache.org/jira/browse/GEODE-9825
> Project: Geode
>  Issue Type: Bug
>  Components: messaging
>Affects Versions: 1.12.4, 1.15.0
>Reporter: Bill Burcham
>Priority: Major
> Attachments: GEODE-9825-demo.patch
>
>
> GEODE-9141 introduced a bug that causes {{IOException: "Unknown header 
> byte..."}} and hangs if members are configured with different 
> {{socket-buffer-size}} settings.
> h2. Reproduction
> To reproduce this bug, modify {{P2PMessagingConcurrencyDUnitTest}} so that 
> sender and locator and receiver use different configuration parameters. Set 
> {{socket-buffer-size}} to 212992 for the sender and 32 * 1024 for the 
> receiver. Oh and just skip the call to {{{}securityProperties(){}}}—we want 
> to induce the "Unknown header byte" exception—we don't want the TLS framework 
> throwing exceptions. See attached patch file GEODE-9825-demo.patch for an 
> example.
> h2. Analysis
> In {{{}Connection.processInputBuffer(){}}}. When that method has read all the 
> messages it can from the current input buffer, it then considers whether the 
> buffer needs expansion. If it does then:
> {code:java}
> inputBuffer = inputSharing.expandReadBufferIfNeeded(allocSize); {code}
> Is executed and the method returns. The caller then expects to be able to 
> _write_ bytes into {{{}inputBuffer{}}}.
> The problem, it seems, is that 
> {{ByteBufferSharingInternalImpl.expandReadBufferIfNeeded()}} does not leave 
> the the {{ByteBuffer}} in the proper state. It leaves the buffer ready to be 
> _read_ not written.
> Before the changes for GEODE-9141 were introduced, the line of code 
> referenced above used to be this snippet in 
> {{Connection.compactOrResizeBuffer(int messageLength)}} (that method has 
> since been removed):
> {code:java}
>      // need a bigger buffer
>     logger.info("Allocating larger network read buffer, new size is {} old 
> size was {}.",
>         allocSize, oldBufferSize);
>     ByteBuffer oldBuffer = inputBuffer;
>     inputBuffer = getBufferPool().acquireDirectReceiveBuffer(allocSize);    
> if (oldBuffer != null) {
>       int oldByteCount = oldBuffer.remaining();
>       inputBuffer.put(oldBuffer);
>       inputBuffer.position(oldByteCount);
>       getBufferPool().releaseReceiveBuffer(oldBuffer);
>     } {code}
> Notice how this method leaves {{inputBuffer}} ready to be _written_ to.
> But the code inside 
> {{ByteBufferSharingInternalImpl.expandReadBufferIfNeeded()}} is doing 
> something like:
> {code:java}
> newBuffer.clear();
> newBuffer.put(existing);
> newBuffer.flip();
> releaseBuffer(type, existing);
> return newBuffer; {code}
> It's not clear to me, exactly, what the difference is between the old and new 
> code. It's not sufficient to simply call {{flip()}} on the inputBuffer before 
> returning it (I tried it and it didn't fix the bug). More work is needed.
> h2. Resolution
> When this ticket is complete the bug will be fixed and 
> {{P2PMessagingConcurrencyDUnitTest}} will be enhanced to test these 
> combinations:
> [security, sender/locator socket-buffer-size, receiver socket-buffer-size]
> [TLS, (default), (default)]  this is what the test currently does
> [no TLS, 64 * 1024, 32 * 1024] *new: this illustrates this bug*
> [no TLS, (default), (default)] *new*
> We might want to mix in conserve-sockets true/false in there too while we're 
> at it (the test currently holds it at true).
> The attached patch file GEODE-9825-demo.patch shows a quick hack to 
> {{P2PMessagingConcurrencyDUnitTest}} to illustrate the bug.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Updated] (GEODE-9825) Disparate socket-buffer-size Results in "IOException: Unknown header byte" and Hangs

2021-11-18 Thread Bill Burcham (Jira)


 [ 
https://issues.apache.org/jira/browse/GEODE-9825?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bill Burcham updated GEODE-9825:

Description: 
GEODE-9141 introduced a bug that causes {{IOException: "Unknown header 
byte..."}} and hangs if members are configured with different 
{{socket-buffer-size}} settings.
h2. Reproduction

To reproduce this bug, modify {{P2PMessagingConcurrencyDUnitTest}} so that 
sender and locator and receiver use different configuration parameters. Set 
{{socket-buffer-size}} to 212992 for the sender and 32 * 1024 for the receiver. 
Oh and just skip the call to {{{}securityProperties(){}}}—we want to induce the 
"Unknown header byte" exception—we don't want the TLS framework throwing 
exceptions. See attached patch file GEODE-9825-demo.patch for an example.
h2. Analysis

In {{{}Connection.processInputBuffer(){}}}. When that method has read all the 
messages it can from the current input buffer, it then considers whether the 
buffer needs expansion. If it does then:
{code:java}
inputBuffer = inputSharing.expandReadBufferIfNeeded(allocSize); {code}
Is executed and the method returns. The caller then expects to be able to 
_write_ bytes into {{{}inputBuffer{}}}.

The problem, it seems, is that 
{{ByteBufferSharingInternalImpl.expandReadBufferIfNeeded()}} does not leave the 
the {{ByteBuffer}} in the proper state. It leaves the buffer ready to be _read_ 
not written.

Before the changes for GEODE-9141 were introduced, the line of code referenced 
above used to be this snippet in {{Connection.compactOrResizeBuffer(int 
messageLength)}} (that method has since been removed):
{code:java}
     // need a bigger buffer
    logger.info("Allocating larger network read buffer, new size is {} old size 
was {}.",
        allocSize, oldBufferSize);
    ByteBuffer oldBuffer = inputBuffer;
    inputBuffer = getBufferPool().acquireDirectReceiveBuffer(allocSize);    
if (oldBuffer != null) {
      int oldByteCount = oldBuffer.remaining();
      inputBuffer.put(oldBuffer);
      inputBuffer.position(oldByteCount);
      getBufferPool().releaseReceiveBuffer(oldBuffer);
    } {code}
Notice how this method leaves {{inputBuffer}} ready to be _written_ to.

But the code inside 
{{ByteBufferSharingInternalImpl.expandReadBufferIfNeeded()}} is doing something 
like:
{code:java}
newBuffer.clear();
newBuffer.put(existing);
newBuffer.flip();
releaseBuffer(type, existing);
return newBuffer; {code}
It's not clear to me, exactly, what the difference is between the old and new 
code. It's not sufficient to simply call {{flip()}} on the inputBuffer before 
returning it (I tried it and it didn't fix the bug). More work is needed.
h2. Resolution

When this ticket is complete the bug will be fixed and 
{{P2PMessagingConcurrencyDUnitTest}} will be enhanced to test these 
combinations:

[security, sender/locator socket-buffer-size, receiver socket-buffer-size]

[TLS, (default), (default)]  this is what the test currently does
[no TLS, 64 * 1024, 32 * 1024] *new: this illustrates this bug*
[no TLS, (default), (default)] *new*

We might want to mix in conserve-sockets true/false in there too while we're at 
it (the test currently holds it at true).

The attached patch file GEODE-9825-demo.patch shows a quick hack to 
{{P2PMessagingConcurrencyDUnitTest}} to illustrate the bug.

  was:
GEODE-9141 introduced a bug that causes {{IOException: "Unknown header 
byte..."}} and hangs if members are configured with different 
{{socket-buffer-size}} settings.
h2. Reproduction

To reproduce this bug, modify {{P2PMessagingConcurrencyDUnitTest}} so that 
sender and locator and receiver use different configuration parameters. Set 
{{socket-buffer-size}} to 212992 for the sender and 32 * 1024 for the receiver. 
Oh and just skip the call to {{{}securityProperties(){}}}—we want to induce the 
"Unknown header byte" exception—we don't want the TLS framework throwing 
exceptions. See attached patch file GEODE-9825-demo.patch for an example.
h2. Analysis

In {{{}Connection.processInputBuffer(){}}}. When that method has read all the 
messages it can from the current input buffer, it then considers whether the 
buffer needs expansion. If it does then:
{code:java}
inputBuffer = inputSharing.expandReadBufferIfNeeded(allocSize); {code}
Is executed and the method returns. The caller then expects to be able to 
_write_ bytes into {{{}inputBuffer{}}}.

The problem, it seems, is that 
{{ByteBufferSharingInternalImpl.expandReadBufferIfNeeded()}} does not leave the 
the {{ByteBuffer}} in the proper state. It leaves the buffer ready to be _read_ 
not written.

Before the changes for GEODE-9141 were introduced, the line of code referenced 
above used to be this snippet in {{Connection.compactOrResizeBuffer(int 
messageLength)}} (that method has since been removed):
{code:java}
     // need a bigger buffer
    logger.info("Allocating larger network read buffer, new size is {} old size 
was

[jira] [Updated] (GEODE-9825) Disparate socket-buffer-size Results in "IOException: Unknown header byte" and Hangs

2021-11-18 Thread Bill Burcham (Jira)


 [ 
https://issues.apache.org/jira/browse/GEODE-9825?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bill Burcham updated GEODE-9825:

Description: 
GEODE-9141 introduced a bug that causes {{IOException: "Unknown header 
byte..."}} and hangs if members are configured with different 
{{socket-buffer-size}} settings.
h2. Reproduction

To reproduce this bug, modify {{P2PMessagingConcurrencyDUnitTest}} so that 
sender and locator and receiver use different configuration parameters. Set 
{{socket-buffer-size}} to 212992 for the sender and 32 * 1024 for the receiver. 
Oh and just skip the call to {{{}securityProperties(){}}}—we want to induce the 
"Unknown header byte" exception—we don't want the TLS framework throwing 
exceptions. See attached patch file GEODE-9825-demo.patch for an example.
h2. Analysis

In {{{}Connection.processInputBuffer(){}}}. When that method has read all the 
messages it can from the current input buffer, it then considers whether the 
buffer needs expansion. If it does then:
{code:java}
inputBuffer = inputSharing.expandReadBufferIfNeeded(allocSize); {code}
Is executed and the method returns. The caller then expects to be able to 
_write_ bytes into {{{}inputBuffer{}}}.

The problem, it seems, is that 
{{ByteBufferSharingInternalImpl.expandReadBufferIfNeeded()}} does not leave the 
the {{ByteBuffer}} in the proper state. It leaves the buffer ready to be _read_ 
not written.

Before the changes for GEODE-9141 were introduced, the line of code referenced 
above used to be this snippet in {{Connection.compactOrResizeBuffer(int 
messageLength)}} (that method has since been removed):
{code:java}
     // need a bigger buffer
    logger.info("Allocating larger network read buffer, new size is {} old size 
was {}.",
        allocSize, oldBufferSize);
    ByteBuffer oldBuffer = inputBuffer;
    inputBuffer = getBufferPool().acquireDirectReceiveBuffer(allocSize);    
if (oldBuffer != null) {
      int oldByteCount = oldBuffer.remaining();
      inputBuffer.put(oldBuffer);
      inputBuffer.position(oldByteCount);
      getBufferPool().releaseReceiveBuffer(oldBuffer);
    } {code}
Notice how this method leaves {{inputBuffer}} ready to be _written_ to.

But the code inside {{ByteBufferSharingInternalImpl.expandReadBufferIfNeeded() 
}}is doing something like:
{code:java}
newBuffer.clear();
newBuffer.put(existing);
newBuffer.flip();
releaseBuffer(type, existing);
return newBuffer; {code}
It's not clear to me, exactly, what the difference is between the old and new 
code. It's not sufficient to simply call {{flip()}} on the inputBuffer before 
returning it (I tried it and it didn't fix the bug). More work is needed.
h2. Resolution

When this ticket is complete the bug will be fixed and 
{{P2PMessagingConcurrencyDUnitTest}} will be enhanced to test these 
combinations:

[security, sender/locator socket-buffer-size, receiver socket-buffer-size]

[TLS, (default), (default)]  this is what the test currently does
[no TLS, 64 * 1024, 32 * 1024] *new: this illustrates this bug*
[no TLS, (default), (default)] *new*

We might want to mix in conserve-sockets true/false in there too while we're at 
it (the test currently holds it at true).

The attached patch file GEODE-9825-demo.patch shows a quick hack to 
{{P2PMessagingConcurrencyDUnitTest}} to illustrate the bug.

  was:
GEODE-9141 introduced a bug that causes {{IOException: "Unknown header 
byte..."}} and hangs if members are configured with different 
{{socket-buffer-size}} settings.
h2. Reproduction

To reproduce this bug, modify {{P2PMessagingConcurrencyDUnitTest}} so that 
sender and locator and receiver use different configuration parameters. Set 
{{socket-buffer-size}} to 212992 for the sender and 32 * 1024 for the receiver. 
Oh and just skip the call to {{{}securityProperties(){}}}—we want to induce the 
"Unknown header byte" exception—we don't want the TLS framework throwing 
exceptions. See attached patch file GEODE-9825-demo.patch for an example.
h2. Analysis

In {{{}Connection.processInputBuffer(){}}}. When that method has read all the 
messages it can from the current input buffer, it then considers whether the 
buffer needs expansion. If it does then:
{code:java}
inputBuffer = inputSharing.expandReadBufferIfNeeded(allocSize); {code}
Is executed and the method returns. The caller then expects to be able to 
_write_ bytes into {{{}inputBuffer{}}}.

The problem, it seems, is that 
{{ByteBufferSharingInternalImpl.expandReadBufferIfNeeded()}} does not leave the 
the {{ByteBuffer}} in the proper state. It leaves the buffer ready to be _read_ 
not written.

Before the changes for GEODE-9141 were introduced, the line of code referenced 
above used to be this method in {{Connection}} (which has since been removed):
{code:java}
private void compactOrResizeBuffer(int messageLength) {
  final int oldBufferSize = inputBuffer.capacity();
  int allocSize = messageLength + MSG_HEADER_BYTES;
  i

[jira] [Updated] (GEODE-9825) Disparate socket-buffer-size Results in "IOException: Unknown header byte" and Hangs

2021-11-18 Thread Bill Burcham (Jira)


 [ 
https://issues.apache.org/jira/browse/GEODE-9825?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bill Burcham updated GEODE-9825:

Description: 
GEODE-9141 introduced a bug that causes {{IOException: "Unknown header 
byte..."}} and hangs if members are configured with different 
{{socket-buffer-size}} settings.
h2. Reproduction

To reproduce this bug, modify {{P2PMessagingConcurrencyDUnitTest}} so that 
sender and locator and receiver use different configuration parameters. Set 
{{socket-buffer-size}} to 212992 for the sender and 32 * 1024 for the receiver. 
Oh and just skip the call to {{{}securityProperties(){}}}—we want to induce the 
"Unknown header byte" exception—we don't want the TLS framework throwing 
exceptions. See attached patch file GEODE-9825-demo.patch for an example.
h2. Analysis

In {{{}Connection.processInputBuffer(){}}}. When that method has read all the 
messages it can from the current input buffer, it then considers whether the 
buffer needs expansion. If it does then:
{code:java}
inputBuffer = inputSharing.expandReadBufferIfNeeded(allocSize); {code}
Is executed and the method returns. The caller then expects to be able to 
_write_ bytes into {{{}inputBuffer{}}}.

The problem, it seems, is that 
{{ByteBufferSharingInternalImpl.expandReadBufferIfNeeded()}} does not leave the 
the {{ByteBuffer}} in the proper state. It leaves the buffer ready to be _read_ 
not written.

Before the changes for GEODE-9141 were introduced, the line of code referenced 
above used to be this method in {{Connection}} (which has since been removed):
{code:java}
private void compactOrResizeBuffer(int messageLength) {
  final int oldBufferSize = inputBuffer.capacity();
  int allocSize = messageLength + MSG_HEADER_BYTES;
  if (oldBufferSize < allocSize) {
// need a bigger buffer
logger.info("Allocating larger network read buffer, new size is {} old size 
was {}.",
allocSize, oldBufferSize);
ByteBuffer oldBuffer = inputBuffer;
inputBuffer = getBufferPool().acquireDirectReceiveBuffer(allocSize);

if (oldBuffer != null) {
  int oldByteCount = oldBuffer.remaining();
  inputBuffer.put(oldBuffer);
  inputBuffer.position(oldByteCount);
  getBufferPool().releaseReceiveBuffer(oldBuffer);
}
  } else {
if (inputBuffer.position() != 0) {
  inputBuffer.compact();
} else {
  inputBuffer.position(inputBuffer.limit());
  inputBuffer.limit(inputBuffer.capacity());
}
  }
} {code}
Notice how this method leaves {{inputBuffer}} ready to be _written_ to.

It's not sufficient to simply call {{flip()}} on the inputBuffer before 
returning it (I tried it and it didn't fix the bug). More work is needed.
h2. Resolution

When this ticket is complete the bug will be fixed and 
{{P2PMessagingConcurrencyDUnitTest}} will be enhanced to test these 
combinations:

[security, sender/locator socket-buffer-size, receiver socket-buffer-size]

[TLS, (default), (default)]  this is what the test currently does
[no TLS, 64 * 1024, 32 * 1024] *new: this illustrates this bug*
[no TLS, (default), (default)] *new*

We might want to mix in conserve-sockets true/false in there too while we're at 
it (the test currently holds it at true).

The attached patch file GEODE-9825-demo.patch shows a quick hack to 
{{P2PMessagingConcurrencyDUnitTest}} to illustrate the bug.

  was:
GEODE-9141 introduced a bug that causes {{IOException: "Unknown header 
byte..."}} and hangs if members are configured with different 
{{socket-buffer-size}} settings.
h2. Reproduction

To reproduce this bug, modify {{P2PMessagingConcurrencyDUnitTest}} so that 
sender and locator and receiver use different configuration parameters. Set 
{{socket-buffer-size}} to 212992 for the sender and 32 * 1024 for the receiver. 
Oh and just skip the call to {{{}securityProperties(){}}}—we want to induce the 
"Unknown header byte" exception—we don't want the TLS framework throwing 
exceptions. See attached patch file GEODE-9825-demo.patch for an example.
h2. Analysis

In {{{}Connection.processInputBuffer(){}}}. When that method has read all the 
messages it can from the current input buffer, it then considers whether the 
buffer needs expansion. If it does then:
{code:java}
inputBuffer = inputSharing.expandReadBufferIfNeeded(allocSize); {code}
Is executed and the method returns. The caller then expects to be able to 
_write_ bytes into {{{}inputBuffer{}}}.

The problem, it seems, is that 
{{ByteBufferSharingInternalImpl.expandReadBufferIfNeeded()}} does not leave the 
the {{ByteBuffer}} in the proper state. It leaves the buffer ready to be _read_ 
not written.

The line of code referenced above used to be this method in {{Connection}} 
(which has since been removed):
{code:java}
private void compactOrResizeBuffer(int messageLength) {
  final int oldBufferSize = inputBuffer.capacity();
  int allocSize = messageLength + MSG_HEADER_BYTES;
  if (oldBufferSize < allocSize)

[jira] [Updated] (GEODE-9825) Disparate socket-buffer-size Results in "IOException: Unknown header byte" and Hangs

2021-11-18 Thread Bill Burcham (Jira)


 [ 
https://issues.apache.org/jira/browse/GEODE-9825?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bill Burcham updated GEODE-9825:

Description: 
GEODE-9141 introduced a bug that causes {{IOException: "Unknown header 
byte..."}} and hangs if members are configured with different 
{{socket-buffer-size}} settings.
h2. Reproduction

To reproduce this bug, modify {{P2PMessagingConcurrencyDUnitTest}} so that 
sender and locator and receiver use different configuration parameters. Set 
{{socket-buffer-size}} to 212992 for the sender and 32 * 1024 for the receiver. 
Oh and just skip the call to {{{}securityProperties(){}}}—we want to induce the 
"Unknown header byte" exception—we don't want the TLS framework throwing 
exceptions. See attached patch file GEODE-9825-demo.patch for an example.
h2. Analysis

In {{{}Connection.processInputBuffer(){}}}. When that method has read all the 
messages it can from the current input buffer, it then considers whether the 
buffer needs expansion. If it does then:
{code:java}
inputBuffer = inputSharing.expandReadBufferIfNeeded(allocSize); {code}
Is executed and the method returns. The caller then expects to be able to 
_write_ bytes into {{{}inputBuffer{}}}.

The problem, it seems, is that 
{{ByteBufferSharingInternalImpl.expandReadBufferIfNeeded()}} does not leave the 
the {{ByteBuffer}} in the proper state. It leaves the buffer ready to be _read_ 
not written.

The line of code referenced above used to be this method in {{Connection}} 
(which has since been removed):
{code:java}
private void compactOrResizeBuffer(int messageLength) {
  final int oldBufferSize = inputBuffer.capacity();
  int allocSize = messageLength + MSG_HEADER_BYTES;
  if (oldBufferSize < allocSize) {
// need a bigger buffer
logger.info("Allocating larger network read buffer, new size is {} old size 
was {}.",
allocSize, oldBufferSize);
ByteBuffer oldBuffer = inputBuffer;
inputBuffer = getBufferPool().acquireDirectReceiveBuffer(allocSize);

if (oldBuffer != null) {
  int oldByteCount = oldBuffer.remaining();
  inputBuffer.put(oldBuffer);
  inputBuffer.position(oldByteCount);
  getBufferPool().releaseReceiveBuffer(oldBuffer);
}
  } else {
if (inputBuffer.position() != 0) {
  inputBuffer.compact();
} else {
  inputBuffer.position(inputBuffer.limit());
  inputBuffer.limit(inputBuffer.capacity());
}
  }
} {code}
Notice how this method leaves {{inputBuffer}} ready to be _written_ to.

It's not sufficient to simply call {{flip()}} on the inputBuffer before 
returning it (I tried it and it didn't fix the bug). More work is needed.
h2. Resolution

When this ticket is complete the bug will be fixed and 
{{P2PMessagingConcurrencyDUnitTest}} will be enhanced to test these 
combinations:

[security, sender/locator socket-buffer-size, receiver socket-buffer-size]

[TLS, (default), (default)]  this is what the test currently does
[no TLS, 64 * 1024, 32 * 1024] *new: this illustrates this bug*
[no TLS, (default), (default)] *new*

We might want to mix in conserve-sockets true/false in there too while we're at 
it (the test currently holds it at true).

The attached patch file GEODE-9825-demo.patch shows a quick hack to 
{{P2PMessagingConcurrencyDUnitTest}} to illustrate the bug.

  was:
GEODE-9141 introduced a bug that causes {{IOException: "Unknown header 
byte..."}} and hangs if members are configured with different 
{{socket-buffer-size}} settings.
h2. Reproduction

To reproduce this bug, modify {{P2PMessagingConcurrencyDUnitTest}} so that 
sender and locator and receiver use different configuration parameters. Set 
{{socket-buffer-size}} to 212992 for the sender and 32 * 1024 for the receiver. 
Oh and just skip the call to {{{}securityProperties(){}}}—we want to induce the 
"Unknown header byte" exception—we don't want the TLS framework throwing 
exceptions.
h2. Analysis

In {{{}Connection.processInputBuffer(){}}}. When that method has read all the 
messages it can from the current input buffer, it then considers whether the 
buffer needs expansion. If it does then:
{code:java}
inputBuffer = inputSharing.expandReadBufferIfNeeded(allocSize); {code}
Is executed and the method returns. The caller then expects to be able to 
_write_ bytes into {{{}inputBuffer{}}}.

The problem, it seems, is that 
{{ByteBufferSharingInternalImpl.expandReadBufferIfNeeded()}} does not leave the 
the {{ByteBuffer}} in the proper state. It leaves the buffer ready to be _read_ 
not written.

The line of code referenced above used to be this method in {{Connection}} 
(which has since been removed):
{code:java}
private void compactOrResizeBuffer(int messageLength) {
  final int oldBufferSize = inputBuffer.capacity();
  int allocSize = messageLength + MSG_HEADER_BYTES;
  if (oldBufferSize < allocSize) {
// need a bigger buffer
logger.info("Allocating larger network read buffer, new size is {} old size 
w

[jira] [Updated] (GEODE-9825) Disparate socket-buffer-size Results in "IOException: Unknown header byte" and Hangs

2021-11-18 Thread Bill Burcham (Jira)


 [ 
https://issues.apache.org/jira/browse/GEODE-9825?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bill Burcham updated GEODE-9825:

Description: 
GEODE-9141 introduced a bug that causes {{IOException: "Unknown header 
byte..."}} and hangs if members are configured with different 
{{socket-buffer-size}} settings.
h2. Reproduction

To reproduce this bug, modify {{P2PMessagingConcurrencyDUnitTest}} so that 
sender and locator and receiver use different configuration parameters. Set 
{{socket-buffer-size}} to 212992 for the sender and 32 * 1024 for the receiver. 
Oh and just skip the call to {{{}securityProperties(){}}}—we want to induce the 
"Unknown header byte" exception—we don't want the TLS framework throwing 
exceptions.
h2. Analysis

In {{{}Connection.processInputBuffer(){}}}. When that method has read all the 
messages it can from the current input buffer, it then considers whether the 
buffer needs expansion. If it does then:
{code:java}
inputBuffer = inputSharing.expandReadBufferIfNeeded(allocSize); {code}
Is executed and the method returns. The caller then expects to be able to 
_write_ bytes into {{{}inputBuffer{}}}.

The problem, it seems, is that 
{{ByteBufferSharingInternalImpl.expandReadBufferIfNeeded()}} does not leave the 
the {{ByteBuffer}} in the proper state. It leaves the buffer ready to be _read_ 
not written.

The line of code referenced above used to be this method in {{Connection}} 
(which has since been removed):
{code:java}
private void compactOrResizeBuffer(int messageLength) {
  final int oldBufferSize = inputBuffer.capacity();
  int allocSize = messageLength + MSG_HEADER_BYTES;
  if (oldBufferSize < allocSize) {
// need a bigger buffer
logger.info("Allocating larger network read buffer, new size is {} old size 
was {}.",
allocSize, oldBufferSize);
ByteBuffer oldBuffer = inputBuffer;
inputBuffer = getBufferPool().acquireDirectReceiveBuffer(allocSize);

if (oldBuffer != null) {
  int oldByteCount = oldBuffer.remaining();
  inputBuffer.put(oldBuffer);
  inputBuffer.position(oldByteCount);
  getBufferPool().releaseReceiveBuffer(oldBuffer);
}
  } else {
if (inputBuffer.position() != 0) {
  inputBuffer.compact();
} else {
  inputBuffer.position(inputBuffer.limit());
  inputBuffer.limit(inputBuffer.capacity());
}
  }
} {code}
Notice how this method leaves {{inputBuffer}} ready to be _written_ to.

It's not sufficient to simply call {{flip()}} on the inputBuffer before 
returning it (I tried it and it didn't fix the bug). More work is needed.
h2. Resolution

When this ticket is complete the bug will be fixed and 
{{P2PMessagingConcurrencyDUnitTest}} will be enhanced to test these 
combinations:

[security, sender/locator socket-buffer-size, receiver socket-buffer-size]

[TLS, (default), (default)]  this is what the test currently does
[no TLS, 64 * 1024, 32 * 1024] *new: this illustrates this bug*
[no TLS, (default), (default)] *new*

We might want to mix in conserve-sockets true/false in there too while we're at 
it (the test currently holds it at true).

The attached patch file GEODE-9825-demo.patch shows a quick hack to 
{{P2PMessagingConcurrencyDUnitTest}} to illustrate the bug.

  was:
GEODE-9141 introduced a bug that causes {{IOException: "Unknown header 
byte..."}} and hangs if members are configured with different 
{{socket-buffer-size}} settings.
h2. Reproduction

To reproduce this bug, modify {{P2PMessagingConcurrencyDUnitTest}} so that 
sender and locator and receiver use different configuration parameters. Set 
{{socket-buffer-size}} to 212992 for the sender and 32 * 1024 for the receiver. 
Oh and just skip the call to {{{}securityProperties(){}}}—we want to induce the 
"Unknown header byte" exception—we don't want the TLS framework throwing 
exceptions.
h2. Analysis

In {{{}Connection.processInputBuffer(){}}}. When that method has read all the 
messages it can from the current input buffer, it then considers whether the 
buffer needs expansion. If it does then:
{code:java}
inputBuffer = inputSharing.expandReadBufferIfNeeded(allocSize); {code}
Is executed and the method returns. The caller then expects to be able to 
_write_ bytes into {{{}inputBuffer{}}}.

The problem, it seems, is that 
{{ByteBufferSharingInternalImpl.expandReadBufferIfNeeded()}} does not leave the 
the {{ByteBuffer}} in the proper state. It leaves the buffer ready to be _read_ 
not written.

The line of code referenced above used to be this method in {{Connection}} 
(which has since been removed):
{code:java}
private void compactOrResizeBuffer(int messageLength) {
  final int oldBufferSize = inputBuffer.capacity();
  int allocSize = messageLength + MSG_HEADER_BYTES;
  if (oldBufferSize < allocSize) {
// need a bigger buffer
logger.info("Allocating larger network read buffer, new size is {} old size 
was {}.",
allocSize, oldBufferSize);
ByteBuffer old

[jira] [Updated] (GEODE-9825) Disparate socket-buffer-size Results in "IOException: Unknown header byte" and Hangs

2021-11-18 Thread Bill Burcham (Jira)


 [ 
https://issues.apache.org/jira/browse/GEODE-9825?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bill Burcham updated GEODE-9825:

Attachment: GEODE-9825-demo.patch

> Disparate socket-buffer-size Results in "IOException: Unknown header byte" 
> and Hangs
> 
>
> Key: GEODE-9825
> URL: https://issues.apache.org/jira/browse/GEODE-9825
> Project: Geode
>  Issue Type: Bug
>  Components: messaging
>Affects Versions: 1.12.4, 1.15.0
>Reporter: Bill Burcham
>Priority: Major
> Attachments: GEODE-9825-demo.patch
>
>
> GEODE-9141 introduced a bug that causes {{IOException: "Unknown header 
> byte..."}} and hangs if members are configured with different 
> {{socket-buffer-size}} settings.
> h2. Reproduction
> To reproduce this bug, modify {{P2PMessagingConcurrencyDUnitTest}} so that 
> sender and locator and receiver use different configuration parameters. Set 
> {{socket-buffer-size}} to 212992 for the sender and 32 * 1024 for the 
> receiver. Oh and just skip the call to {{{}securityProperties(){}}}—we want 
> to induce the "Unknown header byte" exception—we don't want the TLS framework 
> throwing exceptions.
> h2. Analysis
> In {{{}Connection.processInputBuffer(){}}}. When that method has read all the 
> messages it can from the current input buffer, it then considers whether the 
> buffer needs expansion. If it does then:
> {code:java}
> inputBuffer = inputSharing.expandReadBufferIfNeeded(allocSize); {code}
> Is executed and the method returns. The caller then expects to be able to 
> _write_ bytes into {{{}inputBuffer{}}}.
> The problem, it seems, is that 
> {{ByteBufferSharingInternalImpl.expandReadBufferIfNeeded()}} does not leave 
> the the {{ByteBuffer}} in the proper state. It leaves the buffer ready to be 
> _read_ not written.
> The line of code referenced above used to be this method in {{Connection}} 
> (which has since been removed):
> {code:java}
> private void compactOrResizeBuffer(int messageLength) {
>   final int oldBufferSize = inputBuffer.capacity();
>   int allocSize = messageLength + MSG_HEADER_BYTES;
>   if (oldBufferSize < allocSize) {
> // need a bigger buffer
> logger.info("Allocating larger network read buffer, new size is {} old 
> size was {}.",
> allocSize, oldBufferSize);
> ByteBuffer oldBuffer = inputBuffer;
> inputBuffer = getBufferPool().acquireDirectReceiveBuffer(allocSize);
> if (oldBuffer != null) {
>   int oldByteCount = oldBuffer.remaining();
>   inputBuffer.put(oldBuffer);
>   inputBuffer.position(oldByteCount);
>   getBufferPool().releaseReceiveBuffer(oldBuffer);
> }
>   } else {
> if (inputBuffer.position() != 0) {
>   inputBuffer.compact();
> } else {
>   inputBuffer.position(inputBuffer.limit());
>   inputBuffer.limit(inputBuffer.capacity());
> }
>   }
> } {code}
> Notice how this method leaves {{inputBuffer}} ready to be _written_ to.
> It's not sufficient to simply call {{flip()}} on the inputBuffer before 
> returning it (I tried it and it didn't fix the bug). More work is needed.
> h2. Resolution
> When this ticket is complete the bug will be fixed and 
> {{P2PMessagingConcurrencyDUnitTest}} will be enhanced to test these 
> combinations:
> [security, sender/locator socket-buffer-size, receiver socket-buffer-size]
> [TLS, (default), (default)]  this is what the test currently does
> [no TLS, 64 * 1024, 32 * 1024] *new: this illustrates this bug*
> [no TLS, (default), (default)] *new*
> We might want to mix in conserve-sockets true/false in there too while we're 
> at it (the test currently holds it at true).



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Updated] (GEODE-9825) Disparate socket-buffer-size Results in "IOException: Unknown header byte" and Hangs

2021-11-18 Thread Bill Burcham (Jira)


 [ 
https://issues.apache.org/jira/browse/GEODE-9825?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bill Burcham updated GEODE-9825:

Description: 
GEODE-9141 introduced a bug that causes {{IOException: "Unknown header 
byte..."}} and hangs if members are configured with different 
{{socket-buffer-size}} settings.
h2. Reproduction

To reproduce this bug, modify {{P2PMessagingConcurrencyDUnitTest}} so that 
sender and locator and receiver use different configuration parameters. Set 
{{socket-buffer-size}} to 212992 for the sender and 32 * 1024 for the receiver. 
Oh and just skip the call to {{{}securityProperties(){}}}—we want to induce the 
"Unknown header byte" exception—we don't want the TLS framework throwing 
exceptions.
h2. Analysis

In {{{}Connection.processInputBuffer(){}}}. When that method has read all the 
messages it can from the current input buffer, it then considers whether the 
buffer needs expansion. If it does then:
{code:java}
inputBuffer = inputSharing.expandReadBufferIfNeeded(allocSize); {code}
Is executed and the method returns. The caller then expects to be able to 
_write_ bytes into {{{}inputBuffer{}}}.

The problem, it seems, is that 
{{ByteBufferSharingInternalImpl.expandReadBufferIfNeeded()}} does not leave the 
the {{ByteBuffer}} in the proper state. It leaves the buffer ready to be _read_ 
not written.

The line of code referenced above used to be this method in {{Connection}} 
(which has since been removed):
{code:java}
private void compactOrResizeBuffer(int messageLength) {
  final int oldBufferSize = inputBuffer.capacity();
  int allocSize = messageLength + MSG_HEADER_BYTES;
  if (oldBufferSize < allocSize) {
// need a bigger buffer
logger.info("Allocating larger network read buffer, new size is {} old size 
was {}.",
allocSize, oldBufferSize);
ByteBuffer oldBuffer = inputBuffer;
inputBuffer = getBufferPool().acquireDirectReceiveBuffer(allocSize);

if (oldBuffer != null) {
  int oldByteCount = oldBuffer.remaining();
  inputBuffer.put(oldBuffer);
  inputBuffer.position(oldByteCount);
  getBufferPool().releaseReceiveBuffer(oldBuffer);
}
  } else {
if (inputBuffer.position() != 0) {
  inputBuffer.compact();
} else {
  inputBuffer.position(inputBuffer.limit());
  inputBuffer.limit(inputBuffer.capacity());
}
  }
} {code}
Notice how this method leaves {{inputBuffer}} ready to be _written_ to.

It's not sufficient to simply call {{flip()}} on the inputBuffer before 
returning it (I tried it and it didn't fix the bug). More work is needed.
h2. Resolution

When this ticket is complete the bug will be fixed and 
{{P2PMessagingConcurrencyDUnitTest}} will be enhanced to test these 
combinations:

[security, sender/locator socket-buffer-size, receiver socket-buffer-size]

[TLS, (default), (default)]  this is what the test currently does
[no TLS, 64 * 1024, 32 * 1024] *new: this illustrates this bug*
[no TLS, (default), (default)] *new*

We might want to mix in conserve-sockets true/false in there too while we're at 
it (the test currently holds it at true).

  was:
GEODE-9141 introduced a bug that causes hangs

In {{{}Connection.processInputBuffer(){}}}. When that method has read all the 
messages it can from the current input buffer, it then considers whether the 
buffer needs expansion. If it does then:
{code:java}
inputBuffer = inputSharing.expandReadBufferIfNeeded(allocSize); {code}
Is executed and the method returns. The caller then expects to be able to 
_write_ bytes into {{{}inputBuffer{}}}.

The problem, it seems, is that 
{{ByteBufferSharingInternalImpl.expandReadBufferIfNeeded()}} does not leave the 
the {{ByteBuffer}} in the proper state. It leaves the buffer ready to be _read_ 
not written.

The line of code referenced above used to be this method in {{Connection}} 
(which has since been removed):
{code:java}
private void compactOrResizeBuffer(int messageLength) {
  final int oldBufferSize = inputBuffer.capacity();
  int allocSize = messageLength + MSG_HEADER_BYTES;
  if (oldBufferSize < allocSize) {
// need a bigger buffer
logger.info("Allocating larger network read buffer, new size is {} old size 
was {}.",
allocSize, oldBufferSize);
ByteBuffer oldBuffer = inputBuffer;
inputBuffer = getBufferPool().acquireDirectReceiveBuffer(allocSize);

if (oldBuffer != null) {
  int oldByteCount = oldBuffer.remaining();
  inputBuffer.put(oldBuffer);
  inputBuffer.position(oldByteCount);
  getBufferPool().releaseReceiveBuffer(oldBuffer);
}
  } else {
if (inputBuffer.position() != 0) {
  inputBuffer.compact();
} else {
  inputBuffer.position(inputBuffer.limit());
  inputBuffer.limit(inputBuffer.capacity());
}
  }
} {code}
Notice how this method leaves {{inputBuffer}} ready to be _written_ to.

It's not sufficient to simply call {{flip()}} on the inputBuffer before 
returning it (I tri

[jira] [Updated] (GEODE-9825) Disparate socket-buffer-size Results in "IOException: Unknown header byte" and Hangs

2021-11-18 Thread Bill Burcham (Jira)


 [ 
https://issues.apache.org/jira/browse/GEODE-9825?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bill Burcham updated GEODE-9825:

Summary: Disparate socket-buffer-size Results in "IOException: Unknown 
header byte" and Hangs  (was: Disparate socket-buffer-size Results in "Unknown 
header byte" Exceptions and Hangs)

> Disparate socket-buffer-size Results in "IOException: Unknown header byte" 
> and Hangs
> 
>
> Key: GEODE-9825
> URL: https://issues.apache.org/jira/browse/GEODE-9825
> Project: Geode
>  Issue Type: Bug
>  Components: messaging
>Affects Versions: 1.12.4, 1.15.0
>Reporter: Bill Burcham
>Priority: Major
>
> GEODE-9141 introduced a bug that causes hangs
> In {{{}Connection.processInputBuffer(){}}}. When that method has read all the 
> messages it can from the current input buffer, it then considers whether the 
> buffer needs expansion. If it does then:
> {code:java}
> inputBuffer = inputSharing.expandReadBufferIfNeeded(allocSize); {code}
> Is executed and the method returns. The caller then expects to be able to 
> _write_ bytes into {{{}inputBuffer{}}}.
> The problem, it seems, is that 
> {{ByteBufferSharingInternalImpl.expandReadBufferIfNeeded()}} does not leave 
> the the {{ByteBuffer}} in the proper state. It leaves the buffer ready to be 
> _read_ not written.
> The line of code referenced above used to be this method in {{Connection}} 
> (which has since been removed):
> {code:java}
> private void compactOrResizeBuffer(int messageLength) {
>   final int oldBufferSize = inputBuffer.capacity();
>   int allocSize = messageLength + MSG_HEADER_BYTES;
>   if (oldBufferSize < allocSize) {
> // need a bigger buffer
> logger.info("Allocating larger network read buffer, new size is {} old 
> size was {}.",
> allocSize, oldBufferSize);
> ByteBuffer oldBuffer = inputBuffer;
> inputBuffer = getBufferPool().acquireDirectReceiveBuffer(allocSize);
> if (oldBuffer != null) {
>   int oldByteCount = oldBuffer.remaining();
>   inputBuffer.put(oldBuffer);
>   inputBuffer.position(oldByteCount);
>   getBufferPool().releaseReceiveBuffer(oldBuffer);
> }
>   } else {
> if (inputBuffer.position() != 0) {
>   inputBuffer.compact();
> } else {
>   inputBuffer.position(inputBuffer.limit());
>   inputBuffer.limit(inputBuffer.capacity());
> }
>   }
> } {code}
> Notice how this method leaves {{inputBuffer}} ready to be _written_ to.
> It's not sufficient to simply call {{flip()}} on the inputBuffer before 
> returning it (I tried it and it didn't fix the bug). More work is needed.
> To reproduce this bug, modify {{P2PMessagingConcurrencyDUnitTest}} so that 
> sender and locator and receiver use different configuration parameters. Set 
> {{socket-buffer-size}} to 212992 for the sender and 32 * 1024 for the 
> receiver. Oh and just skip the call to {{{}securityProperties(){}}}—we want 
> to induce the "Unknown header byte" exception—we don't want the TLS framework 
> throwing exceptions.
> When this ticket is complete {{P2PMessagingConcurrencyDUnitTest}} will be 
> enhanced to test these combinations:
> [security, sender/locator socket-buffer-size, receiver socket-buffer-size]
> [TLS, (default), (default)]  this is what the test currently does
> [no TLS, 212992, 32 * 1024] *new: this illustrates this bug*
> [no TLS, (default), (default)] *new*
> We might want to mix in conserve-sockets true/false in there too while we're 
> at it (the test currently holds it at true).



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Updated] (GEODE-9825) Disparate socket-buffer-size Results in "Unknown header byte" Exceptions and Hangs

2021-11-18 Thread Bill Burcham (Jira)


 [ 
https://issues.apache.org/jira/browse/GEODE-9825?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bill Burcham updated GEODE-9825:

Description: 
GEODE-9141 introduced a bug that causes hangs

In {{{}Connection.processInputBuffer(){}}}. When that method has read all the 
messages it can from the current input buffer, it then considers whether the 
buffer needs expansion. If it does then:
{code:java}
inputBuffer = inputSharing.expandReadBufferIfNeeded(allocSize); {code}
Is executed and the method returns. The caller then expects to be able to 
_write_ bytes into {{{}inputBuffer{}}}.

The problem, it seems, is that 
{{ByteBufferSharingInternalImpl.expandReadBufferIfNeeded()}} does not leave the 
the {{ByteBuffer}} in the proper state. It leaves the buffer ready to be _read_ 
not written.

The line of code referenced above used to be this method in {{Connection}} 
(which has since been removed):
{code:java}
private void compactOrResizeBuffer(int messageLength) {
  final int oldBufferSize = inputBuffer.capacity();
  int allocSize = messageLength + MSG_HEADER_BYTES;
  if (oldBufferSize < allocSize) {
// need a bigger buffer
logger.info("Allocating larger network read buffer, new size is {} old size 
was {}.",
allocSize, oldBufferSize);
ByteBuffer oldBuffer = inputBuffer;
inputBuffer = getBufferPool().acquireDirectReceiveBuffer(allocSize);

if (oldBuffer != null) {
  int oldByteCount = oldBuffer.remaining();
  inputBuffer.put(oldBuffer);
  inputBuffer.position(oldByteCount);
  getBufferPool().releaseReceiveBuffer(oldBuffer);
}
  } else {
if (inputBuffer.position() != 0) {
  inputBuffer.compact();
} else {
  inputBuffer.position(inputBuffer.limit());
  inputBuffer.limit(inputBuffer.capacity());
}
  }
} {code}
Notice how this method leaves {{inputBuffer}} ready to be _written_ to.

It's not sufficient to simply call {{flip()}} on the inputBuffer before 
returning it (I tried it and it didn't fix the bug). More work is needed.

To reproduce this bug, modify {{P2PMessagingConcurrencyDUnitTest}} so that 
sender and locator and receiver use different configuration parameters. Set 
{{socket-buffer-size}} to 212992 for the sender and 32 * 1024 for the receiver. 
Oh and just skip the call to {{{}securityProperties(){}}}—we want to induce the 
"Unknown header byte" exception—we don't want the TLS framework throwing 
exceptions.

When this ticket is complete {{P2PMessagingConcurrencyDUnitTest}} will be 
enhanced to test these combinations:

[security, sender/locator socket-buffer-size, receiver socket-buffer-size]

[TLS, (default), (default)]  this is what the test currently does
[no TLS, 212992, 32 * 1024] *new: this illustrates this bug*
[no TLS, (default), (default)] *new*

We might want to mix in conserve-sockets true/false in there too while we're at 
it (the test currently holds it at true).

  was:
GEODE-9141 introduced a bug in {{{}Connection.processInputBuffer(){}}}. When 
that method has read all the messages it can from the current input buffer, it 
then considers whether the buffer needs expansion. If it does then:

 
{code:java}
inputBuffer = inputSharing.expandReadBufferIfNeeded(allocSize); {code}
Is executed and the method returns. The caller then expects to be able to 
_write_ bytes into {{{}inputBuffer{}}}.

 

The problem, it seems, is that 
{{ByteBufferSharingInternalImpl.expandReadBufferIfNeeded()}} does not leave the 
the {{ByteBuffer}} in the proper state. It leaves the buffer ready to be _read_ 
not written.

The line of code referenced above used to be this method in {{Connection}} 
(which has since been removed):

 
{code:java}
private void compactOrResizeBuffer(int messageLength) {
  final int oldBufferSize = inputBuffer.capacity();
  int allocSize = messageLength + MSG_HEADER_BYTES;
  if (oldBufferSize < allocSize) {
// need a bigger buffer
logger.info("Allocating larger network read buffer, new size is {} old size 
was {}.",
allocSize, oldBufferSize);
ByteBuffer oldBuffer = inputBuffer;
inputBuffer = getBufferPool().acquireDirectReceiveBuffer(allocSize);

if (oldBuffer != null) {
  int oldByteCount = oldBuffer.remaining();
  inputBuffer.put(oldBuffer);
  inputBuffer.position(oldByteCount);
  getBufferPool().releaseReceiveBuffer(oldBuffer);
}
  } else {
if (inputBuffer.position() != 0) {
  inputBuffer.compact();
} else {
  inputBuffer.position(inputBuffer.limit());
  inputBuffer.limit(inputBuffer.capacity());
}
  }
} {code}
Notice how this method leaves {{inputBuffer}} ready to be _written_ to.

 

It's not sufficient to simply call {{flip()}} on the inputBuffer before 
returning it (I tried it and it didn't fix the bug). More work is needed.

To reproduce this bug, modify {{P2PMessagingConcurrencyDUnitTest}} so that 
sender and locator and receiver use different configuration parameters. 

[jira] [Created] (GEODE-9825) Disparate socket-buffer-size Results in "Unknown header byte" Exceptions and Hangs

2021-11-18 Thread Bill Burcham (Jira)
Bill Burcham created GEODE-9825:
---

 Summary: Disparate socket-buffer-size Results in "Unknown header 
byte" Exceptions and Hangs
 Key: GEODE-9825
 URL: https://issues.apache.org/jira/browse/GEODE-9825
 Project: Geode
  Issue Type: Bug
  Components: messaging
Affects Versions: 1.12.4, 1.15.0
Reporter: Bill Burcham


GEODE-9141 introduced a bug in {{{}Connection.processInputBuffer(){}}}. When 
that method has read all the messages it can from the current input buffer, it 
then considers whether the buffer needs expansion. If it does then:

 
{code:java}
inputBuffer = inputSharing.expandReadBufferIfNeeded(allocSize); {code}
Is executed and the method returns. The caller then expects to be able to 
_write_ bytes into {{{}inputBuffer{}}}.

 

The problem, it seems, is that 
{{ByteBufferSharingInternalImpl.expandReadBufferIfNeeded()}} does not leave the 
the {{ByteBuffer}} in the proper state. It leaves the buffer ready to be _read_ 
not written.

The line of code referenced above used to be this method in {{Connection}} 
(which has since been removed):

 
{code:java}
private void compactOrResizeBuffer(int messageLength) {
  final int oldBufferSize = inputBuffer.capacity();
  int allocSize = messageLength + MSG_HEADER_BYTES;
  if (oldBufferSize < allocSize) {
// need a bigger buffer
logger.info("Allocating larger network read buffer, new size is {} old size 
was {}.",
allocSize, oldBufferSize);
ByteBuffer oldBuffer = inputBuffer;
inputBuffer = getBufferPool().acquireDirectReceiveBuffer(allocSize);

if (oldBuffer != null) {
  int oldByteCount = oldBuffer.remaining();
  inputBuffer.put(oldBuffer);
  inputBuffer.position(oldByteCount);
  getBufferPool().releaseReceiveBuffer(oldBuffer);
}
  } else {
if (inputBuffer.position() != 0) {
  inputBuffer.compact();
} else {
  inputBuffer.position(inputBuffer.limit());
  inputBuffer.limit(inputBuffer.capacity());
}
  }
} {code}
Notice how this method leaves {{inputBuffer}} ready to be _written_ to.

 

It's not sufficient to simply call {{flip()}} on the inputBuffer before 
returning it (I tried it and it didn't fix the bug). More work is needed.

To reproduce this bug, modify {{P2PMessagingConcurrencyDUnitTest}} so that 
sender and locator and receiver use different configuration parameters. Set 
{{socket-buffer-size}} to 212992 for the sender and 32 * 1024 for the receiver. 
Oh and just skip the call to {{{}securityProperties(){}}}—we want to induce the 
"Unknown header byte" exception—we don't want the TLS framework throwing 
exceptions.

When this ticket is complete {{P2PMessagingConcurrencyDUnitTest}} will be 
enhanced to test these combinations:

[security, sender/locator socket-buffer-size, receiver socket-buffer-size]

[TLS, (default), (default)]  this is what the test currently does
[no TLS, 212992, 32 * 1024] *new: this illustrates this bug*
[no TLS, (default), (default)] *new*

We might want to mix in conserve-sockets true/false in there too while we're at 
it (the test currently holds it at true).



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Updated] (GEODE-9822) Split-brain Possible During Network Partition in Two-Locator Cluster

2021-11-18 Thread Bill Burcham (Jira)


 [ 
https://issues.apache.org/jira/browse/GEODE-9822?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bill Burcham updated GEODE-9822:

Description: 
In a two-locator cluster with default member weights and default setting (true) 
of enable-network-partition-detection, if a long-lived network partition 
separates the two members, a split-brain will arise: there will be two 
coordinators at the same time.

The reason for this can be found in the GMSJoinLeave.isNetworkPartition() 
method. That method's name is misleading. A name like isMajorityLost() would 
probably be more apt. It needs to return true iff the weight of "crashed" 
members (in the prospective view) is greater-than-or-equal-to half (50%) of the 
total weight (of all members in the current view).

What the method actually does is return true iff the weight of "crashed" 
members is greater-than 51% of the total weight. As a result, if we have two 
members of equal weight, and the coordinator sees that the non-coordinator is 
"crashed", the coordinator will keep running. If a network partition is 
happening, and the non-coordinator is still running, then it will become a 
coordinator and start producing views. Now we'll have two coordinators 
producing views concurrently.

For this discussion "crashed" members are members for which the coordinator has 
received a RemoveMemberRequest message. These are members that the failure 
detector has deemed failed. Keep in mind the failure detector is imperfect 
(it's not always right), and that's kind of the whole point of this ticket: 
we've lost contact with the non-coordinator member, but that doesn't mean it 
can't still be running (on the other side of a partition).

  was:
In a two-locator cluster with default member weights and default setting (true) 
of enable-network-partition-detection, if a long-lived network partition 
separates the two members, a split-brain will arise: there will be two 
coordinators at the same time.

The reason for this can be found in the GMSJoinLeave.isNetworkPartition() 
method. That method's name is misleading. A name like isMajorityLost() would 
probably be more apt. It needs to return true iff the weight of "crashed" 
members (in the prospective view) is greater-than-or-equal-to 50% of the total 
weight (of all members in the current view).

What the method actually does is return true iff the weight of "crashed" 
members is greater-than 51% of the total weight. As a result, if we have two 
members of equal weight, and the coordinator sees that the non-coordinator is 
"crashed", the coordinator will keep running. If a network partition is 
happening, and the non-coordinator is still running, then it will become a 
coordinator and start producing views. Now we'll have two coordinators 
producing views concurrently.

For this discussion "crashed" members are members for which the coordinator has 
received a RemoveMemberRequest message. These are members that the failure 
detector has deemed failed. Keep in mind the failure detector is imperfect 
(it's not always right), and that's kind of the whole point of this ticket: 
we've lost contact with the non-coordinator member, but that doesn't mean it 
can't still be running (on the other side of a partition).


> Split-brain Possible During Network Partition in Two-Locator Cluster
> 
>
> Key: GEODE-9822
> URL: https://issues.apache.org/jira/browse/GEODE-9822
> Project: Geode
>  Issue Type: Bug
>  Components: membership
>Reporter: Bill Burcham
>Priority: Major
>  Labels: pull-request-available
>
> In a two-locator cluster with default member weights and default setting 
> (true) of enable-network-partition-detection, if a long-lived network 
> partition separates the two members, a split-brain will arise: there will be 
> two coordinators at the same time.
> The reason for this can be found in the GMSJoinLeave.isNetworkPartition() 
> method. That method's name is misleading. A name like isMajorityLost() would 
> probably be more apt. It needs to return true iff the weight of "crashed" 
> members (in the prospective view) is greater-than-or-equal-to half (50%) of 
> the total weight (of all members in the current view).
> What the method actually does is return true iff the weight of "crashed" 
> members is greater-than 51% of the total weight. As a result, if we have two 
> members of equal weight, and the coordinator sees that the non-coordinator is 
> "crashed", the coordinator will keep running. If a network partition is 
> happening, and the non-coordinator is still running, then it will become a 
> coordinator and start producing views. Now we'll have two coordinators 
> producing views concurrently.
> For this discussion "crashed" members are members for which the coordinator 
> has received a Re

[jira] [Updated] (GEODE-9822) Split-brain Possible During Network Partition in Two-Locator Cluster

2021-11-18 Thread Bill Burcham (Jira)


 [ 
https://issues.apache.org/jira/browse/GEODE-9822?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bill Burcham updated GEODE-9822:

Description: 
In a two-locator cluster with default member weights and default setting (true) 
of enable-network-partition-detection, if a long-lived network partition 
separates the two members, a split-brain will arise: there will be two 
coordinators at the same time.

The reason for this can be found in the GMSJoinLeave.isNetworkPartition() 
method. That method's name is misleading. A name like isMajorityLost() would 
probably be more apt. It needs to return true iff the weight of "crashed" 
members (in the prospective view) is greater-than-or-equal-to 50% of the total 
weight (of all members in the current view).

What the method actually does is return true iff the weight of "crashed" 
members is greater-than 51% of the total weight. As a result, if we have two 
members of equal weight, and the coordinator sees that the non-coordinator is 
"crashed", the coordinator will keep running. If a network partition is 
happening, and the non-coordinator is still running, then it will become a 
coordinator and start producing views. Now we'll have two coordinators 
producing views concurrently.

For this discussion "crashed" members are members for which the coordinator has 
received a RemoveMemberRequest message. These are members that the failure 
detector has deemed failed. Keep in mind the failure detector is imperfect 
(it's not always right), and that's kind of the whole point of this ticket: 
we've lost contact with the non-coordinator member, but that doesn't mean it 
can't still be running (on the other side of a partition).

  was:
In a two-locator cluster with default member weights and default setting (true) 
of enable-network-partition-detection, if a long-lived network partition 
separates the two members, a split-brain will arise: there will be two 
coordinators at the same time.

The reason for this can be found in the GMSJoinLeave.isNetworkPartition() 
method. That method's name is misleading. A name like majorityLost() would 
probably be more apt. It needs to return true iff the weight of "crashed" 
members (in the prospective view) is greater-than-or-equal-to 50% of the total 
weight (of all members in the current view).

What the method actually does is return true iff the weight of "crashed" 
members is greater-than 51% of the total weight. As a result, if we have two 
members of equal weight, and the coordinator sees that the non-coordinator is 
"crashed", the coordinator will keep running. If a network partition is 
happening, and the non-coordinator is still running, then it will become a 
coordinator and start producing views. Now we'll have two coordinators 
producing views concurrently.

For this discussion "crashed" members are members for which the coordinator has 
received a RemoveMemberRequest message. These are members that the failure 
detector has deemed failed. Keep in mind the failure detector is imperfect 
(it's not always right), and that's kind of the whole point of this ticket: 
we've lost contact with the non-coordinator member, but that doesn't mean it 
can't still be running (on the other side of the partition).


> Split-brain Possible During Network Partition in Two-Locator Cluster
> 
>
> Key: GEODE-9822
> URL: https://issues.apache.org/jira/browse/GEODE-9822
> Project: Geode
>  Issue Type: Bug
>  Components: membership
>Reporter: Bill Burcham
>Priority: Major
>  Labels: pull-request-available
>
> In a two-locator cluster with default member weights and default setting 
> (true) of enable-network-partition-detection, if a long-lived network 
> partition separates the two members, a split-brain will arise: there will be 
> two coordinators at the same time.
> The reason for this can be found in the GMSJoinLeave.isNetworkPartition() 
> method. That method's name is misleading. A name like isMajorityLost() would 
> probably be more apt. It needs to return true iff the weight of "crashed" 
> members (in the prospective view) is greater-than-or-equal-to 50% of the 
> total weight (of all members in the current view).
> What the method actually does is return true iff the weight of "crashed" 
> members is greater-than 51% of the total weight. As a result, if we have two 
> members of equal weight, and the coordinator sees that the non-coordinator is 
> "crashed", the coordinator will keep running. If a network partition is 
> happening, and the non-coordinator is still running, then it will become a 
> coordinator and start producing views. Now we'll have two coordinators 
> producing views concurrently.
> For this discussion "crashed" members are members for which the coordinator 
> has received a RemoveMemberRequ

[jira] [Updated] (GEODE-9822) Split-brain Possible During Network Partition in Two-Locator Cluster

2021-11-18 Thread Bill Burcham (Jira)


 [ 
https://issues.apache.org/jira/browse/GEODE-9822?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bill Burcham updated GEODE-9822:

Description: 
In a two-locator cluster with default member weights and default setting (true) 
of enable-network-partition-detection, if a long-lived network partition 
separates the two members, a split-brain will arise: there will be two 
coordinators at the same time.

The reason for this can be found in the GMSJoinLeave.isNetworkPartition() 
method. That method's name is misleading. A name like majorityLost() would 
probably be more apt. It needs to return true iff the weight of "crashed" 
members (in the prospective view) is greater-than-or-equal-to 50% of the total 
weight (of all members in the current view).

What the method actually does is return true iff the weight of "crashed" 
members is greater-than 51% of the total weight. As a result, if we have two 
members of equal weight, and the coordinator sees that the non-coordinator is 
"crashed", the coordinator will keep running. If a network partition is 
happening, and the non-coordinator is still running, then it will become a 
coordinator and start producing views. Now we'll have two coordinators 
producing views concurrently.

For this discussion "crashed" members are members for which the coordinator has 
received a RemoveMemberRequest message. These are members that the failure 
detector has deemed failed. Keep in mind the failure detector is imperfect 
(it's not always right), and that's kind of the whole point of this ticket: 
we've lost contact with the non-coordinator member, but that doesn't mean it 
can't still be running (on the other side of the partition).

  was:
In a two-locator cluster with default member weights and default setting (true) 
of enable-network-partition-detection, if a long-lived network partition 
separates the two members, a split-brain will arise: there will be two 
coordinators at the same time.

The reason for this can be found in the GMSJoinLeave.isNetworkPartition() 
method. That method's name is misleading. A name like majorityLost() would 
probably be more apt. It needs to return true iff the weight of "crashed" 
members (in the prospective view) is greater-than-or-equal-to 50% of the total 
weight (of all members in the current view).

What the method actually does is return true iff the weight of "crashed" 
members is greater-than 51% of the total weight. As a result, if we have two 
members of equal weight, and the coordinator sees that the non-coordinator is 
"crashed", the coordinator will keep running. If a network partition is 
happening, and the non-coordinator is still running, then it will become a 
coordinator and start producing views. Now we'll have two coordinators 
producing views concurrently.


> Split-brain Possible During Network Partition in Two-Locator Cluster
> 
>
> Key: GEODE-9822
> URL: https://issues.apache.org/jira/browse/GEODE-9822
> Project: Geode
>  Issue Type: Bug
>  Components: membership
>Reporter: Bill Burcham
>Priority: Major
>
> In a two-locator cluster with default member weights and default setting 
> (true) of enable-network-partition-detection, if a long-lived network 
> partition separates the two members, a split-brain will arise: there will be 
> two coordinators at the same time.
> The reason for this can be found in the GMSJoinLeave.isNetworkPartition() 
> method. That method's name is misleading. A name like majorityLost() would 
> probably be more apt. It needs to return true iff the weight of "crashed" 
> members (in the prospective view) is greater-than-or-equal-to 50% of the 
> total weight (of all members in the current view).
> What the method actually does is return true iff the weight of "crashed" 
> members is greater-than 51% of the total weight. As a result, if we have two 
> members of equal weight, and the coordinator sees that the non-coordinator is 
> "crashed", the coordinator will keep running. If a network partition is 
> happening, and the non-coordinator is still running, then it will become a 
> coordinator and start producing views. Now we'll have two coordinators 
> producing views concurrently.
> For this discussion "crashed" members are members for which the coordinator 
> has received a RemoveMemberRequest message. These are members that the 
> failure detector has deemed failed. Keep in mind the failure detector is 
> imperfect (it's not always right), and that's kind of the whole point of this 
> ticket: we've lost contact with the non-coordinator member, but that doesn't 
> mean it can't still be running (on the other side of the partition).



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Created] (GEODE-9822) Split-brain Possible During Network Partition in Two-Locator Cluster

2021-11-18 Thread Bill Burcham (Jira)
Bill Burcham created GEODE-9822:
---

 Summary: Split-brain Possible During Network Partition in 
Two-Locator Cluster
 Key: GEODE-9822
 URL: https://issues.apache.org/jira/browse/GEODE-9822
 Project: Geode
  Issue Type: Bug
  Components: membership
Reporter: Bill Burcham


In a two-locator cluster with default member weights and default setting (true) 
of enable-network-partition-detection, if a long-lived network partition 
separates the two members, a split-brain will arise: there will be two 
coordinators at the same time.

The reason for this can be found in the GMSJoinLeave.isNetworkPartition() 
method. That method's name is misleading. A name like majorityLost() would 
probably be more apt. It needs to return true iff the weight of "crashed" 
members (in the prospective view) is greater-than-or-equal-to 50% of the total 
weight (of all members in the current view).

What the method actually does is return true iff the weight of "crashed" 
members is greater-than 51% of the total weight. As a result, if we have two 
members of equal weight, and the coordinator sees that the non-coordinator is 
"crashed", the coordinator will keep running. If a network partition is 
happening, and the non-coordinator is still running, then it will become a 
coordinator and start producing views. Now we'll have two coordinators 
producing views concurrently.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Updated] (GEODE-9738) CI failure: RollingUpgradeRollServersOnReplicatedRegion_dataserializable failed with DistributedSystemDisconnectedException

2021-11-16 Thread Bill Burcham (Jira)


 [ 
https://issues.apache.org/jira/browse/GEODE-9738?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bill Burcham updated GEODE-9738:

Attachment: GEODE-9738-short.log.all

> CI failure: RollingUpgradeRollServersOnReplicatedRegion_dataserializable 
> failed with DistributedSystemDisconnectedException
> ---
>
> Key: GEODE-9738
> URL: https://issues.apache.org/jira/browse/GEODE-9738
> Project: Geode
>  Issue Type: Bug
>  Components: membership, messaging
>Affects Versions: 1.15.0
>Reporter: Kamilla Aslami
>Assignee: Bill Burcham
>Priority: Major
>  Labels: needsTriage
> Attachments: GEODE-9738-short.log.all, controller.log, locator.log, 
> vm0.log, vm1.log, vm2.log, vm3.log
>
>
> {noformat}
> RollingUpgradeRollServersOnReplicatedRegion_dataserializable > 
> testRollServersOnReplicatedRegion_dataserializable[from_v1.13.4] FAILED
> java.lang.AssertionError: Suspicious strings were written to the log 
> during this run.
> Fix the strings or use IgnoredException.addIgnoredException to ignore.
> ---
> Found suspect string in 'dunit_suspect-vm2.log' at line 685[fatal 
> 2021/10/14 00:24:14.739 UTC  tid=115] Uncaught exception 
> in thread Thread[FederatingManager6,5,RMI Runtime]
> org.apache.geode.management.ManagementException: 
> org.apache.geode.distributed.DistributedSystemDisconnectedException: 
> Distribution manager on 
> heavy-lifter-10ae5f9d-2528-5e02-b707-d968eb54d50a(vm2:580278:locator):54751
>  started at Thu Oct 14 00:23:52 UTC 2021: Message distribution has terminated
> at 
> org.apache.geode.management.internal.FederatingManager.addMemberArtifacts(FederatingManager.java:486)
> at 
> org.apache.geode.management.internal.FederatingManager$AddMemberTask.call(FederatingManager.java:596)
> at 
> org.apache.geode.management.internal.FederatingManager.lambda$addMember$1(FederatingManager.java:199)
> at 
> java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
> at 
> java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
> at java.base/java.lang.Thread.run(Thread.java:829)
> Caused by: 
> org.apache.geode.distributed.DistributedSystemDisconnectedException: 
> Distribution manager on 
> heavy-lifter-10ae5f9d-2528-5e02-b707-d968eb54d50a(vm2:580278:locator):54751
>  started at Thu Oct 14 00:23:52 UTC 2021: Message distribution has terminated
> at 
> org.apache.geode.distributed.internal.ClusterDistributionManager$Stopper.generateCancelledException(ClusterDistributionManager.java:2885)
> at 
> org.apache.geode.distributed.internal.InternalDistributedSystem$Stopper.generateCancelledException(InternalDistributedSystem.java:1177)
> at 
> org.apache.geode.internal.cache.GemFireCacheImpl$Stopper.generateCancelledException(GemFireCacheImpl.java:5212)
> at 
> org.apache.geode.CancelCriterion.checkCancelInProgress(CancelCriterion.java:83)
> at 
> org.apache.geode.internal.cache.CreateRegionProcessor.initializeRegion(CreateRegionProcessor.java:121)
> at 
> org.apache.geode.internal.cache.DistributedRegion.getInitialImageAndRecovery(DistributedRegion.java:1164)
> at 
> org.apache.geode.internal.cache.DistributedRegion.initialize(DistributedRegion.java:1095)
> at 
> org.apache.geode.internal.cache.GemFireCacheImpl.createVMRegion(GemFireCacheImpl.java:3108)
> at 
> org.apache.geode.internal.cache.InternalRegionFactory.create(InternalRegionFactory.java:78)
> at 
> org.apache.geode.management.internal.FederatingManager.addMemberArtifacts(FederatingManager.java:429)
> ... 5 more
> at org.junit.Assert.fail(Assert.java:89)
> at 
> org.apache.geode.test.dunit.internal.DUnitLauncher.closeAndCheckForSuspects(DUnitLauncher.java:420)
> at 
> org.apache.geode.test.dunit.internal.DUnitLauncher.closeAndCheckForSuspects(DUnitLauncher.java:436)
> at 
> org.apache.geode.test.dunit.internal.JUnit4DistributedTestCase.cleanupAllVms(JUnit4DistributedTestCase.java:551)
> at 
> org.apache.geode.test.dunit.internal.JUnit4DistributedTestCase.doTearDownDistributedTestCase(JUnit4DistributedTestCase.java:498)
> at 
> org.apache.geode.test.dunit.internal.JUnit4DistributedTestCase.tearDownDistributedTestCase(JUnit4DistributedTestCase.java:481)
> at jdk.internal.reflect.GeneratedMethodAccessor11.invoke(Unknown 
> Source)
> at 
> jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> at java.lang.reflect.Method.invoke(Method.java:566)
> a

[jira] [Comment Edited] (GEODE-9738) CI failure: RollingUpgradeRollServersOnReplicatedRegion_dataserializable failed with DistributedSystemDisconnectedException

2021-11-16 Thread Bill Burcham (Jira)


[ 
https://issues.apache.org/jira/browse/GEODE-9738?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17441878#comment-17441878
 ] 

Bill Burcham edited comment on GEODE-9738 at 11/16/21, 10:55 PM:
-

The logs in the failing test run (previous comment) are all interleaved in the 
"standard output" section of the failing test. I have attached the individual 
logs to the ticket, so we can analyze them.

The attached logs (controller.log, locator.log, vm\{0-3}.log) each contain 
content for multiple tests. I've attached the stdout for just the test of 
interest as GEODE-9738-short.log.all. That needs to be split so we can see a 
more focused view of the various logs.


was (Author: bburcham):
The logs in the failing test run (previous comment) are all interleaved in the 
"standard output" section of the failing test. I have attached the individual 
logs to the ticket, so we can analyze them.

> CI failure: RollingUpgradeRollServersOnReplicatedRegion_dataserializable 
> failed with DistributedSystemDisconnectedException
> ---
>
> Key: GEODE-9738
> URL: https://issues.apache.org/jira/browse/GEODE-9738
> Project: Geode
>  Issue Type: Bug
>  Components: membership, messaging
>Affects Versions: 1.15.0
>Reporter: Kamilla Aslami
>Assignee: Bill Burcham
>Priority: Major
>  Labels: needsTriage
> Attachments: GEODE-9738-short.log.all, controller.log, locator.log, 
> vm0.log, vm1.log, vm2.log, vm3.log
>
>
> {noformat}
> RollingUpgradeRollServersOnReplicatedRegion_dataserializable > 
> testRollServersOnReplicatedRegion_dataserializable[from_v1.13.4] FAILED
> java.lang.AssertionError: Suspicious strings were written to the log 
> during this run.
> Fix the strings or use IgnoredException.addIgnoredException to ignore.
> ---
> Found suspect string in 'dunit_suspect-vm2.log' at line 685[fatal 
> 2021/10/14 00:24:14.739 UTC  tid=115] Uncaught exception 
> in thread Thread[FederatingManager6,5,RMI Runtime]
> org.apache.geode.management.ManagementException: 
> org.apache.geode.distributed.DistributedSystemDisconnectedException: 
> Distribution manager on 
> heavy-lifter-10ae5f9d-2528-5e02-b707-d968eb54d50a(vm2:580278:locator):54751
>  started at Thu Oct 14 00:23:52 UTC 2021: Message distribution has terminated
> at 
> org.apache.geode.management.internal.FederatingManager.addMemberArtifacts(FederatingManager.java:486)
> at 
> org.apache.geode.management.internal.FederatingManager$AddMemberTask.call(FederatingManager.java:596)
> at 
> org.apache.geode.management.internal.FederatingManager.lambda$addMember$1(FederatingManager.java:199)
> at 
> java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
> at 
> java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
> at java.base/java.lang.Thread.run(Thread.java:829)
> Caused by: 
> org.apache.geode.distributed.DistributedSystemDisconnectedException: 
> Distribution manager on 
> heavy-lifter-10ae5f9d-2528-5e02-b707-d968eb54d50a(vm2:580278:locator):54751
>  started at Thu Oct 14 00:23:52 UTC 2021: Message distribution has terminated
> at 
> org.apache.geode.distributed.internal.ClusterDistributionManager$Stopper.generateCancelledException(ClusterDistributionManager.java:2885)
> at 
> org.apache.geode.distributed.internal.InternalDistributedSystem$Stopper.generateCancelledException(InternalDistributedSystem.java:1177)
> at 
> org.apache.geode.internal.cache.GemFireCacheImpl$Stopper.generateCancelledException(GemFireCacheImpl.java:5212)
> at 
> org.apache.geode.CancelCriterion.checkCancelInProgress(CancelCriterion.java:83)
> at 
> org.apache.geode.internal.cache.CreateRegionProcessor.initializeRegion(CreateRegionProcessor.java:121)
> at 
> org.apache.geode.internal.cache.DistributedRegion.getInitialImageAndRecovery(DistributedRegion.java:1164)
> at 
> org.apache.geode.internal.cache.DistributedRegion.initialize(DistributedRegion.java:1095)
> at 
> org.apache.geode.internal.cache.GemFireCacheImpl.createVMRegion(GemFireCacheImpl.java:3108)
> at 
> org.apache.geode.internal.cache.InternalRegionFactory.create(InternalRegionFactory.java:78)
> at 
> org.apache.geode.management.internal.FederatingManager.addMemberArtifacts(FederatingManager.java:429)
> ... 5 more
> at org.junit.Assert.fail(Assert.java:89)
> at 
> org.apache.geode.test.dunit.internal.DUnitLauncher.closeAndCheckForSuspects(DUnitLauncher.java:420)
> at 
> org.apache.geo

[jira] [Created] (GEODE-9808) Client ops fail with NoLocatorsAvailableException when all servers leave the DS

2021-11-15 Thread Bill Burcham (Jira)
Bill Burcham created GEODE-9808:
---

 Summary: Client ops fail with NoLocatorsAvailableException when 
all servers leave the DS 
 Key: GEODE-9808
 URL: https://issues.apache.org/jira/browse/GEODE-9808
 Project: Geode
  Issue Type: Bug
  Components: client/server
Affects Versions: 1.15.0
Reporter: Bill Burcham


When there are no cache servers (only locators) in a cluster, client operations 
will fail with a misleading exception:
{noformat}
org.apache.geode.cache.client.NoAvailableLocatorsException: Unable to connect 
to any locators in the list 
[gemfire-cluster-locator-0.gemfire-cluster-locator.namespace-1850250019.svc.cluster.local:10334,
 
gemfire-cluster-locator-1.gemfire-cluster-locator.namespace-1850250019.svc.cluster.local:10334,
 
gemfire-cluster-locator-2.gemfire-cluster-locator.namespace-1850250019.svc.cluster.local:10334]
    at 
org.apache.geode.cache.client.internal.AutoConnectionSourceImpl.findServer(AutoConnectionSourceImpl.java:174)
    at 
org.apache.geode.cache.client.internal.ConnectionFactoryImpl.createClientToServerConnection(ConnectionFactoryImpl.java:211)
    at 
org.apache.geode.cache.client.internal.pooling.ConnectionManagerImpl.createPooledConnection(ConnectionManagerImpl.java:196)
    at 
org.apache.geode.cache.client.internal.pooling.ConnectionManagerImpl.forceCreateConnection(ConnectionManagerImpl.java:227)
    at 
org.apache.geode.cache.client.internal.pooling.ConnectionManagerImpl.exchangeConnection(ConnectionManagerImpl.java:365)
    at 
org.apache.geode.cache.client.internal.OpExecutorImpl.execute(OpExecutorImpl.java:161)
    at 
org.apache.geode.cache.client.internal.OpExecutorImpl.execute(OpExecutorImpl.java:120)
    at 
org.apache.geode.cache.client.internal.PoolImpl.execute(PoolImpl.java:805)
    at org.apache.geode.cache.client.internal.PutOp.execute(PutOp.java:91)
{noformat}
Even the client is able to connect to a locator, we encounter a 
NoAvailableLocatorsException exception with the message "Unable to connect to 
any locators in the list".

Investigating the product code we see:
 # If there are no cache servers in the cluster, ServerLocator.pickServer() 
will definitely construct a ClientConnectionResponse(null) which causes that 
object’s hasResult() to respond with false in the loop termination in 
AutoConnectionSourceImpl.queryLocators()

 # Not only is the exception wording misleading in 
AutoConnectionSourceImpl.findServer()—it’s also misleading in at least two 
other calling locations in AutoConnectionSourceImpl: findReplacementServer() 
and findServersForQueue().

 # In each of those cases the calling method translates a null response from 
queryLocators() into a throw of a NoAvailableLocatorsException

 # an appropriate exception, NoAvailableServersException, already exists, for 
the case where we were able to contact a locator but the locator was not able 
to find any cache servers

 # According to my Git spelunking queryLocators() has been obfuscating the true 
cause of the failure since at least 2015

Without analyzing ServerLocator.pickServer() 
(LocatorLoadSnapshot.getServerForConnection()) to discern why two locators 
might disagree on how many cache servers are in the cluster, it seems to me 
that we should modify AutoConnectionSourceImpl.queryLocators() so that:
 * if it gets a ServerLocationResponse with hasResult() true, it immediately 
returns that as it does now

 * otherwise it keeps trying and it keeps track of the last (non-null) 
ServerLocationResponse it has received

 * it returns the last non-null ServerLocationResponse it received (otherwise 
it returns null)

With that in hand, we can change the three call locations in 
AutoConnectionSourceImpl: findServer(), findReplacementServer(), and 
findServersForQueue() to each throw NoAvailableLocatorsException if no locator 
responded, or NoAvailableServersException if a locator responded with a 
ClientConnectionResponse for which hasResult() returns null.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Updated] (GEODE-9738) CI failure: RollingUpgradeRollServersOnReplicatedRegion_dataserializable failed with DistributedSystemDisconnectedException

2021-11-10 Thread Bill Burcham (Jira)


 [ 
https://issues.apache.org/jira/browse/GEODE-9738?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bill Burcham updated GEODE-9738:

Attachment: controller.log
locator.log
vm3.log
vm2.log
vm1.log
vm0.log

> CI failure: RollingUpgradeRollServersOnReplicatedRegion_dataserializable 
> failed with DistributedSystemDisconnectedException
> ---
>
> Key: GEODE-9738
> URL: https://issues.apache.org/jira/browse/GEODE-9738
> Project: Geode
>  Issue Type: Bug
>  Components: membership, messaging
>Affects Versions: 1.15.0
>Reporter: Kamilla Aslami
>Assignee: Bill Burcham
>Priority: Major
>  Labels: needsTriage
> Attachments: controller.log, locator.log, vm0.log, vm1.log, vm2.log, 
> vm3.log
>
>
> {noformat}
> RollingUpgradeRollServersOnReplicatedRegion_dataserializable > 
> testRollServersOnReplicatedRegion_dataserializable[from_v1.13.4] FAILED
> java.lang.AssertionError: Suspicious strings were written to the log 
> during this run.
> Fix the strings or use IgnoredException.addIgnoredException to ignore.
> ---
> Found suspect string in 'dunit_suspect-vm2.log' at line 685[fatal 
> 2021/10/14 00:24:14.739 UTC  tid=115] Uncaught exception 
> in thread Thread[FederatingManager6,5,RMI Runtime]
> org.apache.geode.management.ManagementException: 
> org.apache.geode.distributed.DistributedSystemDisconnectedException: 
> Distribution manager on 
> heavy-lifter-10ae5f9d-2528-5e02-b707-d968eb54d50a(vm2:580278:locator):54751
>  started at Thu Oct 14 00:23:52 UTC 2021: Message distribution has terminated
> at 
> org.apache.geode.management.internal.FederatingManager.addMemberArtifacts(FederatingManager.java:486)
> at 
> org.apache.geode.management.internal.FederatingManager$AddMemberTask.call(FederatingManager.java:596)
> at 
> org.apache.geode.management.internal.FederatingManager.lambda$addMember$1(FederatingManager.java:199)
> at 
> java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
> at 
> java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
> at java.base/java.lang.Thread.run(Thread.java:829)
> Caused by: 
> org.apache.geode.distributed.DistributedSystemDisconnectedException: 
> Distribution manager on 
> heavy-lifter-10ae5f9d-2528-5e02-b707-d968eb54d50a(vm2:580278:locator):54751
>  started at Thu Oct 14 00:23:52 UTC 2021: Message distribution has terminated
> at 
> org.apache.geode.distributed.internal.ClusterDistributionManager$Stopper.generateCancelledException(ClusterDistributionManager.java:2885)
> at 
> org.apache.geode.distributed.internal.InternalDistributedSystem$Stopper.generateCancelledException(InternalDistributedSystem.java:1177)
> at 
> org.apache.geode.internal.cache.GemFireCacheImpl$Stopper.generateCancelledException(GemFireCacheImpl.java:5212)
> at 
> org.apache.geode.CancelCriterion.checkCancelInProgress(CancelCriterion.java:83)
> at 
> org.apache.geode.internal.cache.CreateRegionProcessor.initializeRegion(CreateRegionProcessor.java:121)
> at 
> org.apache.geode.internal.cache.DistributedRegion.getInitialImageAndRecovery(DistributedRegion.java:1164)
> at 
> org.apache.geode.internal.cache.DistributedRegion.initialize(DistributedRegion.java:1095)
> at 
> org.apache.geode.internal.cache.GemFireCacheImpl.createVMRegion(GemFireCacheImpl.java:3108)
> at 
> org.apache.geode.internal.cache.InternalRegionFactory.create(InternalRegionFactory.java:78)
> at 
> org.apache.geode.management.internal.FederatingManager.addMemberArtifacts(FederatingManager.java:429)
> ... 5 more
> at org.junit.Assert.fail(Assert.java:89)
> at 
> org.apache.geode.test.dunit.internal.DUnitLauncher.closeAndCheckForSuspects(DUnitLauncher.java:420)
> at 
> org.apache.geode.test.dunit.internal.DUnitLauncher.closeAndCheckForSuspects(DUnitLauncher.java:436)
> at 
> org.apache.geode.test.dunit.internal.JUnit4DistributedTestCase.cleanupAllVms(JUnit4DistributedTestCase.java:551)
> at 
> org.apache.geode.test.dunit.internal.JUnit4DistributedTestCase.doTearDownDistributedTestCase(JUnit4DistributedTestCase.java:498)
> at 
> org.apache.geode.test.dunit.internal.JUnit4DistributedTestCase.tearDownDistributedTestCase(JUnit4DistributedTestCase.java:481)
> at jdk.internal.reflect.GeneratedMethodAccessor11.invoke(Unknown 
> Source)
> at 
> jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccesso

[jira] [Updated] (GEODE-9738) CI failure: RollingUpgradeRollServersOnReplicatedRegion_dataserializable failed with DistributedSystemDisconnectedException

2021-11-10 Thread Bill Burcham (Jira)


 [ 
https://issues.apache.org/jira/browse/GEODE-9738?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bill Burcham updated GEODE-9738:

Attachment: (was: controller.log)

> CI failure: RollingUpgradeRollServersOnReplicatedRegion_dataserializable 
> failed with DistributedSystemDisconnectedException
> ---
>
> Key: GEODE-9738
> URL: https://issues.apache.org/jira/browse/GEODE-9738
> Project: Geode
>  Issue Type: Bug
>  Components: membership, messaging
>Affects Versions: 1.15.0
>Reporter: Kamilla Aslami
>Assignee: Bill Burcham
>Priority: Major
>  Labels: needsTriage
>
> {noformat}
> RollingUpgradeRollServersOnReplicatedRegion_dataserializable > 
> testRollServersOnReplicatedRegion_dataserializable[from_v1.13.4] FAILED
> java.lang.AssertionError: Suspicious strings were written to the log 
> during this run.
> Fix the strings or use IgnoredException.addIgnoredException to ignore.
> ---
> Found suspect string in 'dunit_suspect-vm2.log' at line 685[fatal 
> 2021/10/14 00:24:14.739 UTC  tid=115] Uncaught exception 
> in thread Thread[FederatingManager6,5,RMI Runtime]
> org.apache.geode.management.ManagementException: 
> org.apache.geode.distributed.DistributedSystemDisconnectedException: 
> Distribution manager on 
> heavy-lifter-10ae5f9d-2528-5e02-b707-d968eb54d50a(vm2:580278:locator):54751
>  started at Thu Oct 14 00:23:52 UTC 2021: Message distribution has terminated
> at 
> org.apache.geode.management.internal.FederatingManager.addMemberArtifacts(FederatingManager.java:486)
> at 
> org.apache.geode.management.internal.FederatingManager$AddMemberTask.call(FederatingManager.java:596)
> at 
> org.apache.geode.management.internal.FederatingManager.lambda$addMember$1(FederatingManager.java:199)
> at 
> java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
> at 
> java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
> at java.base/java.lang.Thread.run(Thread.java:829)
> Caused by: 
> org.apache.geode.distributed.DistributedSystemDisconnectedException: 
> Distribution manager on 
> heavy-lifter-10ae5f9d-2528-5e02-b707-d968eb54d50a(vm2:580278:locator):54751
>  started at Thu Oct 14 00:23:52 UTC 2021: Message distribution has terminated
> at 
> org.apache.geode.distributed.internal.ClusterDistributionManager$Stopper.generateCancelledException(ClusterDistributionManager.java:2885)
> at 
> org.apache.geode.distributed.internal.InternalDistributedSystem$Stopper.generateCancelledException(InternalDistributedSystem.java:1177)
> at 
> org.apache.geode.internal.cache.GemFireCacheImpl$Stopper.generateCancelledException(GemFireCacheImpl.java:5212)
> at 
> org.apache.geode.CancelCriterion.checkCancelInProgress(CancelCriterion.java:83)
> at 
> org.apache.geode.internal.cache.CreateRegionProcessor.initializeRegion(CreateRegionProcessor.java:121)
> at 
> org.apache.geode.internal.cache.DistributedRegion.getInitialImageAndRecovery(DistributedRegion.java:1164)
> at 
> org.apache.geode.internal.cache.DistributedRegion.initialize(DistributedRegion.java:1095)
> at 
> org.apache.geode.internal.cache.GemFireCacheImpl.createVMRegion(GemFireCacheImpl.java:3108)
> at 
> org.apache.geode.internal.cache.InternalRegionFactory.create(InternalRegionFactory.java:78)
> at 
> org.apache.geode.management.internal.FederatingManager.addMemberArtifacts(FederatingManager.java:429)
> ... 5 more
> at org.junit.Assert.fail(Assert.java:89)
> at 
> org.apache.geode.test.dunit.internal.DUnitLauncher.closeAndCheckForSuspects(DUnitLauncher.java:420)
> at 
> org.apache.geode.test.dunit.internal.DUnitLauncher.closeAndCheckForSuspects(DUnitLauncher.java:436)
> at 
> org.apache.geode.test.dunit.internal.JUnit4DistributedTestCase.cleanupAllVms(JUnit4DistributedTestCase.java:551)
> at 
> org.apache.geode.test.dunit.internal.JUnit4DistributedTestCase.doTearDownDistributedTestCase(JUnit4DistributedTestCase.java:498)
> at 
> org.apache.geode.test.dunit.internal.JUnit4DistributedTestCase.tearDownDistributedTestCase(JUnit4DistributedTestCase.java:481)
> at jdk.internal.reflect.GeneratedMethodAccessor11.invoke(Unknown 
> Source)
> at 
> jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> at java.lang.reflect.Method.invoke(Method.java:566)
> at 
> org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.java:59)
> at 
> org.junit.i

[jira] [Updated] (GEODE-9738) CI failure: RollingUpgradeRollServersOnReplicatedRegion_dataserializable failed with DistributedSystemDisconnectedException

2021-11-10 Thread Bill Burcham (Jira)


 [ 
https://issues.apache.org/jira/browse/GEODE-9738?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bill Burcham updated GEODE-9738:

Attachment: (was: vm3.log)

> CI failure: RollingUpgradeRollServersOnReplicatedRegion_dataserializable 
> failed with DistributedSystemDisconnectedException
> ---
>
> Key: GEODE-9738
> URL: https://issues.apache.org/jira/browse/GEODE-9738
> Project: Geode
>  Issue Type: Bug
>  Components: membership, messaging
>Affects Versions: 1.15.0
>Reporter: Kamilla Aslami
>Assignee: Bill Burcham
>Priority: Major
>  Labels: needsTriage
>
> {noformat}
> RollingUpgradeRollServersOnReplicatedRegion_dataserializable > 
> testRollServersOnReplicatedRegion_dataserializable[from_v1.13.4] FAILED
> java.lang.AssertionError: Suspicious strings were written to the log 
> during this run.
> Fix the strings or use IgnoredException.addIgnoredException to ignore.
> ---
> Found suspect string in 'dunit_suspect-vm2.log' at line 685[fatal 
> 2021/10/14 00:24:14.739 UTC  tid=115] Uncaught exception 
> in thread Thread[FederatingManager6,5,RMI Runtime]
> org.apache.geode.management.ManagementException: 
> org.apache.geode.distributed.DistributedSystemDisconnectedException: 
> Distribution manager on 
> heavy-lifter-10ae5f9d-2528-5e02-b707-d968eb54d50a(vm2:580278:locator):54751
>  started at Thu Oct 14 00:23:52 UTC 2021: Message distribution has terminated
> at 
> org.apache.geode.management.internal.FederatingManager.addMemberArtifacts(FederatingManager.java:486)
> at 
> org.apache.geode.management.internal.FederatingManager$AddMemberTask.call(FederatingManager.java:596)
> at 
> org.apache.geode.management.internal.FederatingManager.lambda$addMember$1(FederatingManager.java:199)
> at 
> java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
> at 
> java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
> at java.base/java.lang.Thread.run(Thread.java:829)
> Caused by: 
> org.apache.geode.distributed.DistributedSystemDisconnectedException: 
> Distribution manager on 
> heavy-lifter-10ae5f9d-2528-5e02-b707-d968eb54d50a(vm2:580278:locator):54751
>  started at Thu Oct 14 00:23:52 UTC 2021: Message distribution has terminated
> at 
> org.apache.geode.distributed.internal.ClusterDistributionManager$Stopper.generateCancelledException(ClusterDistributionManager.java:2885)
> at 
> org.apache.geode.distributed.internal.InternalDistributedSystem$Stopper.generateCancelledException(InternalDistributedSystem.java:1177)
> at 
> org.apache.geode.internal.cache.GemFireCacheImpl$Stopper.generateCancelledException(GemFireCacheImpl.java:5212)
> at 
> org.apache.geode.CancelCriterion.checkCancelInProgress(CancelCriterion.java:83)
> at 
> org.apache.geode.internal.cache.CreateRegionProcessor.initializeRegion(CreateRegionProcessor.java:121)
> at 
> org.apache.geode.internal.cache.DistributedRegion.getInitialImageAndRecovery(DistributedRegion.java:1164)
> at 
> org.apache.geode.internal.cache.DistributedRegion.initialize(DistributedRegion.java:1095)
> at 
> org.apache.geode.internal.cache.GemFireCacheImpl.createVMRegion(GemFireCacheImpl.java:3108)
> at 
> org.apache.geode.internal.cache.InternalRegionFactory.create(InternalRegionFactory.java:78)
> at 
> org.apache.geode.management.internal.FederatingManager.addMemberArtifacts(FederatingManager.java:429)
> ... 5 more
> at org.junit.Assert.fail(Assert.java:89)
> at 
> org.apache.geode.test.dunit.internal.DUnitLauncher.closeAndCheckForSuspects(DUnitLauncher.java:420)
> at 
> org.apache.geode.test.dunit.internal.DUnitLauncher.closeAndCheckForSuspects(DUnitLauncher.java:436)
> at 
> org.apache.geode.test.dunit.internal.JUnit4DistributedTestCase.cleanupAllVms(JUnit4DistributedTestCase.java:551)
> at 
> org.apache.geode.test.dunit.internal.JUnit4DistributedTestCase.doTearDownDistributedTestCase(JUnit4DistributedTestCase.java:498)
> at 
> org.apache.geode.test.dunit.internal.JUnit4DistributedTestCase.tearDownDistributedTestCase(JUnit4DistributedTestCase.java:481)
> at jdk.internal.reflect.GeneratedMethodAccessor11.invoke(Unknown 
> Source)
> at 
> jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> at java.lang.reflect.Method.invoke(Method.java:566)
> at 
> org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.java:59)
> at 
> org.junit.internal

[jira] [Updated] (GEODE-9738) CI failure: RollingUpgradeRollServersOnReplicatedRegion_dataserializable failed with DistributedSystemDisconnectedException

2021-11-10 Thread Bill Burcham (Jira)


 [ 
https://issues.apache.org/jira/browse/GEODE-9738?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bill Burcham updated GEODE-9738:

Attachment: (was: vm2.log)

> CI failure: RollingUpgradeRollServersOnReplicatedRegion_dataserializable 
> failed with DistributedSystemDisconnectedException
> ---
>
> Key: GEODE-9738
> URL: https://issues.apache.org/jira/browse/GEODE-9738
> Project: Geode
>  Issue Type: Bug
>  Components: membership, messaging
>Affects Versions: 1.15.0
>Reporter: Kamilla Aslami
>Assignee: Bill Burcham
>Priority: Major
>  Labels: needsTriage
>
> {noformat}
> RollingUpgradeRollServersOnReplicatedRegion_dataserializable > 
> testRollServersOnReplicatedRegion_dataserializable[from_v1.13.4] FAILED
> java.lang.AssertionError: Suspicious strings were written to the log 
> during this run.
> Fix the strings or use IgnoredException.addIgnoredException to ignore.
> ---
> Found suspect string in 'dunit_suspect-vm2.log' at line 685[fatal 
> 2021/10/14 00:24:14.739 UTC  tid=115] Uncaught exception 
> in thread Thread[FederatingManager6,5,RMI Runtime]
> org.apache.geode.management.ManagementException: 
> org.apache.geode.distributed.DistributedSystemDisconnectedException: 
> Distribution manager on 
> heavy-lifter-10ae5f9d-2528-5e02-b707-d968eb54d50a(vm2:580278:locator):54751
>  started at Thu Oct 14 00:23:52 UTC 2021: Message distribution has terminated
> at 
> org.apache.geode.management.internal.FederatingManager.addMemberArtifacts(FederatingManager.java:486)
> at 
> org.apache.geode.management.internal.FederatingManager$AddMemberTask.call(FederatingManager.java:596)
> at 
> org.apache.geode.management.internal.FederatingManager.lambda$addMember$1(FederatingManager.java:199)
> at 
> java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
> at 
> java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
> at java.base/java.lang.Thread.run(Thread.java:829)
> Caused by: 
> org.apache.geode.distributed.DistributedSystemDisconnectedException: 
> Distribution manager on 
> heavy-lifter-10ae5f9d-2528-5e02-b707-d968eb54d50a(vm2:580278:locator):54751
>  started at Thu Oct 14 00:23:52 UTC 2021: Message distribution has terminated
> at 
> org.apache.geode.distributed.internal.ClusterDistributionManager$Stopper.generateCancelledException(ClusterDistributionManager.java:2885)
> at 
> org.apache.geode.distributed.internal.InternalDistributedSystem$Stopper.generateCancelledException(InternalDistributedSystem.java:1177)
> at 
> org.apache.geode.internal.cache.GemFireCacheImpl$Stopper.generateCancelledException(GemFireCacheImpl.java:5212)
> at 
> org.apache.geode.CancelCriterion.checkCancelInProgress(CancelCriterion.java:83)
> at 
> org.apache.geode.internal.cache.CreateRegionProcessor.initializeRegion(CreateRegionProcessor.java:121)
> at 
> org.apache.geode.internal.cache.DistributedRegion.getInitialImageAndRecovery(DistributedRegion.java:1164)
> at 
> org.apache.geode.internal.cache.DistributedRegion.initialize(DistributedRegion.java:1095)
> at 
> org.apache.geode.internal.cache.GemFireCacheImpl.createVMRegion(GemFireCacheImpl.java:3108)
> at 
> org.apache.geode.internal.cache.InternalRegionFactory.create(InternalRegionFactory.java:78)
> at 
> org.apache.geode.management.internal.FederatingManager.addMemberArtifacts(FederatingManager.java:429)
> ... 5 more
> at org.junit.Assert.fail(Assert.java:89)
> at 
> org.apache.geode.test.dunit.internal.DUnitLauncher.closeAndCheckForSuspects(DUnitLauncher.java:420)
> at 
> org.apache.geode.test.dunit.internal.DUnitLauncher.closeAndCheckForSuspects(DUnitLauncher.java:436)
> at 
> org.apache.geode.test.dunit.internal.JUnit4DistributedTestCase.cleanupAllVms(JUnit4DistributedTestCase.java:551)
> at 
> org.apache.geode.test.dunit.internal.JUnit4DistributedTestCase.doTearDownDistributedTestCase(JUnit4DistributedTestCase.java:498)
> at 
> org.apache.geode.test.dunit.internal.JUnit4DistributedTestCase.tearDownDistributedTestCase(JUnit4DistributedTestCase.java:481)
> at jdk.internal.reflect.GeneratedMethodAccessor11.invoke(Unknown 
> Source)
> at 
> jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> at java.lang.reflect.Method.invoke(Method.java:566)
> at 
> org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.java:59)
> at 
> org.junit.internal

[jira] [Updated] (GEODE-9738) CI failure: RollingUpgradeRollServersOnReplicatedRegion_dataserializable failed with DistributedSystemDisconnectedException

2021-11-10 Thread Bill Burcham (Jira)


 [ 
https://issues.apache.org/jira/browse/GEODE-9738?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bill Burcham updated GEODE-9738:

Attachment: (was: locator.log)

> CI failure: RollingUpgradeRollServersOnReplicatedRegion_dataserializable 
> failed with DistributedSystemDisconnectedException
> ---
>
> Key: GEODE-9738
> URL: https://issues.apache.org/jira/browse/GEODE-9738
> Project: Geode
>  Issue Type: Bug
>  Components: membership, messaging
>Affects Versions: 1.15.0
>Reporter: Kamilla Aslami
>Assignee: Bill Burcham
>Priority: Major
>  Labels: needsTriage
>
> {noformat}
> RollingUpgradeRollServersOnReplicatedRegion_dataserializable > 
> testRollServersOnReplicatedRegion_dataserializable[from_v1.13.4] FAILED
> java.lang.AssertionError: Suspicious strings were written to the log 
> during this run.
> Fix the strings or use IgnoredException.addIgnoredException to ignore.
> ---
> Found suspect string in 'dunit_suspect-vm2.log' at line 685[fatal 
> 2021/10/14 00:24:14.739 UTC  tid=115] Uncaught exception 
> in thread Thread[FederatingManager6,5,RMI Runtime]
> org.apache.geode.management.ManagementException: 
> org.apache.geode.distributed.DistributedSystemDisconnectedException: 
> Distribution manager on 
> heavy-lifter-10ae5f9d-2528-5e02-b707-d968eb54d50a(vm2:580278:locator):54751
>  started at Thu Oct 14 00:23:52 UTC 2021: Message distribution has terminated
> at 
> org.apache.geode.management.internal.FederatingManager.addMemberArtifacts(FederatingManager.java:486)
> at 
> org.apache.geode.management.internal.FederatingManager$AddMemberTask.call(FederatingManager.java:596)
> at 
> org.apache.geode.management.internal.FederatingManager.lambda$addMember$1(FederatingManager.java:199)
> at 
> java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
> at 
> java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
> at java.base/java.lang.Thread.run(Thread.java:829)
> Caused by: 
> org.apache.geode.distributed.DistributedSystemDisconnectedException: 
> Distribution manager on 
> heavy-lifter-10ae5f9d-2528-5e02-b707-d968eb54d50a(vm2:580278:locator):54751
>  started at Thu Oct 14 00:23:52 UTC 2021: Message distribution has terminated
> at 
> org.apache.geode.distributed.internal.ClusterDistributionManager$Stopper.generateCancelledException(ClusterDistributionManager.java:2885)
> at 
> org.apache.geode.distributed.internal.InternalDistributedSystem$Stopper.generateCancelledException(InternalDistributedSystem.java:1177)
> at 
> org.apache.geode.internal.cache.GemFireCacheImpl$Stopper.generateCancelledException(GemFireCacheImpl.java:5212)
> at 
> org.apache.geode.CancelCriterion.checkCancelInProgress(CancelCriterion.java:83)
> at 
> org.apache.geode.internal.cache.CreateRegionProcessor.initializeRegion(CreateRegionProcessor.java:121)
> at 
> org.apache.geode.internal.cache.DistributedRegion.getInitialImageAndRecovery(DistributedRegion.java:1164)
> at 
> org.apache.geode.internal.cache.DistributedRegion.initialize(DistributedRegion.java:1095)
> at 
> org.apache.geode.internal.cache.GemFireCacheImpl.createVMRegion(GemFireCacheImpl.java:3108)
> at 
> org.apache.geode.internal.cache.InternalRegionFactory.create(InternalRegionFactory.java:78)
> at 
> org.apache.geode.management.internal.FederatingManager.addMemberArtifacts(FederatingManager.java:429)
> ... 5 more
> at org.junit.Assert.fail(Assert.java:89)
> at 
> org.apache.geode.test.dunit.internal.DUnitLauncher.closeAndCheckForSuspects(DUnitLauncher.java:420)
> at 
> org.apache.geode.test.dunit.internal.DUnitLauncher.closeAndCheckForSuspects(DUnitLauncher.java:436)
> at 
> org.apache.geode.test.dunit.internal.JUnit4DistributedTestCase.cleanupAllVms(JUnit4DistributedTestCase.java:551)
> at 
> org.apache.geode.test.dunit.internal.JUnit4DistributedTestCase.doTearDownDistributedTestCase(JUnit4DistributedTestCase.java:498)
> at 
> org.apache.geode.test.dunit.internal.JUnit4DistributedTestCase.tearDownDistributedTestCase(JUnit4DistributedTestCase.java:481)
> at jdk.internal.reflect.GeneratedMethodAccessor11.invoke(Unknown 
> Source)
> at 
> jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> at java.lang.reflect.Method.invoke(Method.java:566)
> at 
> org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.java:59)
> at 
> org.junit.inte

[jira] [Updated] (GEODE-9738) CI failure: RollingUpgradeRollServersOnReplicatedRegion_dataserializable failed with DistributedSystemDisconnectedException

2021-11-10 Thread Bill Burcham (Jira)


 [ 
https://issues.apache.org/jira/browse/GEODE-9738?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bill Burcham updated GEODE-9738:

Attachment: (was: vm1.log)

> CI failure: RollingUpgradeRollServersOnReplicatedRegion_dataserializable 
> failed with DistributedSystemDisconnectedException
> ---
>
> Key: GEODE-9738
> URL: https://issues.apache.org/jira/browse/GEODE-9738
> Project: Geode
>  Issue Type: Bug
>  Components: membership, messaging
>Affects Versions: 1.15.0
>Reporter: Kamilla Aslami
>Assignee: Bill Burcham
>Priority: Major
>  Labels: needsTriage
>
> {noformat}
> RollingUpgradeRollServersOnReplicatedRegion_dataserializable > 
> testRollServersOnReplicatedRegion_dataserializable[from_v1.13.4] FAILED
> java.lang.AssertionError: Suspicious strings were written to the log 
> during this run.
> Fix the strings or use IgnoredException.addIgnoredException to ignore.
> ---
> Found suspect string in 'dunit_suspect-vm2.log' at line 685[fatal 
> 2021/10/14 00:24:14.739 UTC  tid=115] Uncaught exception 
> in thread Thread[FederatingManager6,5,RMI Runtime]
> org.apache.geode.management.ManagementException: 
> org.apache.geode.distributed.DistributedSystemDisconnectedException: 
> Distribution manager on 
> heavy-lifter-10ae5f9d-2528-5e02-b707-d968eb54d50a(vm2:580278:locator):54751
>  started at Thu Oct 14 00:23:52 UTC 2021: Message distribution has terminated
> at 
> org.apache.geode.management.internal.FederatingManager.addMemberArtifacts(FederatingManager.java:486)
> at 
> org.apache.geode.management.internal.FederatingManager$AddMemberTask.call(FederatingManager.java:596)
> at 
> org.apache.geode.management.internal.FederatingManager.lambda$addMember$1(FederatingManager.java:199)
> at 
> java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
> at 
> java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
> at java.base/java.lang.Thread.run(Thread.java:829)
> Caused by: 
> org.apache.geode.distributed.DistributedSystemDisconnectedException: 
> Distribution manager on 
> heavy-lifter-10ae5f9d-2528-5e02-b707-d968eb54d50a(vm2:580278:locator):54751
>  started at Thu Oct 14 00:23:52 UTC 2021: Message distribution has terminated
> at 
> org.apache.geode.distributed.internal.ClusterDistributionManager$Stopper.generateCancelledException(ClusterDistributionManager.java:2885)
> at 
> org.apache.geode.distributed.internal.InternalDistributedSystem$Stopper.generateCancelledException(InternalDistributedSystem.java:1177)
> at 
> org.apache.geode.internal.cache.GemFireCacheImpl$Stopper.generateCancelledException(GemFireCacheImpl.java:5212)
> at 
> org.apache.geode.CancelCriterion.checkCancelInProgress(CancelCriterion.java:83)
> at 
> org.apache.geode.internal.cache.CreateRegionProcessor.initializeRegion(CreateRegionProcessor.java:121)
> at 
> org.apache.geode.internal.cache.DistributedRegion.getInitialImageAndRecovery(DistributedRegion.java:1164)
> at 
> org.apache.geode.internal.cache.DistributedRegion.initialize(DistributedRegion.java:1095)
> at 
> org.apache.geode.internal.cache.GemFireCacheImpl.createVMRegion(GemFireCacheImpl.java:3108)
> at 
> org.apache.geode.internal.cache.InternalRegionFactory.create(InternalRegionFactory.java:78)
> at 
> org.apache.geode.management.internal.FederatingManager.addMemberArtifacts(FederatingManager.java:429)
> ... 5 more
> at org.junit.Assert.fail(Assert.java:89)
> at 
> org.apache.geode.test.dunit.internal.DUnitLauncher.closeAndCheckForSuspects(DUnitLauncher.java:420)
> at 
> org.apache.geode.test.dunit.internal.DUnitLauncher.closeAndCheckForSuspects(DUnitLauncher.java:436)
> at 
> org.apache.geode.test.dunit.internal.JUnit4DistributedTestCase.cleanupAllVms(JUnit4DistributedTestCase.java:551)
> at 
> org.apache.geode.test.dunit.internal.JUnit4DistributedTestCase.doTearDownDistributedTestCase(JUnit4DistributedTestCase.java:498)
> at 
> org.apache.geode.test.dunit.internal.JUnit4DistributedTestCase.tearDownDistributedTestCase(JUnit4DistributedTestCase.java:481)
> at jdk.internal.reflect.GeneratedMethodAccessor11.invoke(Unknown 
> Source)
> at 
> jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> at java.lang.reflect.Method.invoke(Method.java:566)
> at 
> org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.java:59)
> at 
> org.junit.internal

[jira] [Updated] (GEODE-9738) CI failure: RollingUpgradeRollServersOnReplicatedRegion_dataserializable failed with DistributedSystemDisconnectedException

2021-11-10 Thread Bill Burcham (Jira)


 [ 
https://issues.apache.org/jira/browse/GEODE-9738?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bill Burcham updated GEODE-9738:

Attachment: (was: vm0.log)

> CI failure: RollingUpgradeRollServersOnReplicatedRegion_dataserializable 
> failed with DistributedSystemDisconnectedException
> ---
>
> Key: GEODE-9738
> URL: https://issues.apache.org/jira/browse/GEODE-9738
> Project: Geode
>  Issue Type: Bug
>  Components: membership, messaging
>Affects Versions: 1.15.0
>Reporter: Kamilla Aslami
>Assignee: Bill Burcham
>Priority: Major
>  Labels: needsTriage
>
> {noformat}
> RollingUpgradeRollServersOnReplicatedRegion_dataserializable > 
> testRollServersOnReplicatedRegion_dataserializable[from_v1.13.4] FAILED
> java.lang.AssertionError: Suspicious strings were written to the log 
> during this run.
> Fix the strings or use IgnoredException.addIgnoredException to ignore.
> ---
> Found suspect string in 'dunit_suspect-vm2.log' at line 685[fatal 
> 2021/10/14 00:24:14.739 UTC  tid=115] Uncaught exception 
> in thread Thread[FederatingManager6,5,RMI Runtime]
> org.apache.geode.management.ManagementException: 
> org.apache.geode.distributed.DistributedSystemDisconnectedException: 
> Distribution manager on 
> heavy-lifter-10ae5f9d-2528-5e02-b707-d968eb54d50a(vm2:580278:locator):54751
>  started at Thu Oct 14 00:23:52 UTC 2021: Message distribution has terminated
> at 
> org.apache.geode.management.internal.FederatingManager.addMemberArtifacts(FederatingManager.java:486)
> at 
> org.apache.geode.management.internal.FederatingManager$AddMemberTask.call(FederatingManager.java:596)
> at 
> org.apache.geode.management.internal.FederatingManager.lambda$addMember$1(FederatingManager.java:199)
> at 
> java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
> at 
> java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
> at java.base/java.lang.Thread.run(Thread.java:829)
> Caused by: 
> org.apache.geode.distributed.DistributedSystemDisconnectedException: 
> Distribution manager on 
> heavy-lifter-10ae5f9d-2528-5e02-b707-d968eb54d50a(vm2:580278:locator):54751
>  started at Thu Oct 14 00:23:52 UTC 2021: Message distribution has terminated
> at 
> org.apache.geode.distributed.internal.ClusterDistributionManager$Stopper.generateCancelledException(ClusterDistributionManager.java:2885)
> at 
> org.apache.geode.distributed.internal.InternalDistributedSystem$Stopper.generateCancelledException(InternalDistributedSystem.java:1177)
> at 
> org.apache.geode.internal.cache.GemFireCacheImpl$Stopper.generateCancelledException(GemFireCacheImpl.java:5212)
> at 
> org.apache.geode.CancelCriterion.checkCancelInProgress(CancelCriterion.java:83)
> at 
> org.apache.geode.internal.cache.CreateRegionProcessor.initializeRegion(CreateRegionProcessor.java:121)
> at 
> org.apache.geode.internal.cache.DistributedRegion.getInitialImageAndRecovery(DistributedRegion.java:1164)
> at 
> org.apache.geode.internal.cache.DistributedRegion.initialize(DistributedRegion.java:1095)
> at 
> org.apache.geode.internal.cache.GemFireCacheImpl.createVMRegion(GemFireCacheImpl.java:3108)
> at 
> org.apache.geode.internal.cache.InternalRegionFactory.create(InternalRegionFactory.java:78)
> at 
> org.apache.geode.management.internal.FederatingManager.addMemberArtifacts(FederatingManager.java:429)
> ... 5 more
> at org.junit.Assert.fail(Assert.java:89)
> at 
> org.apache.geode.test.dunit.internal.DUnitLauncher.closeAndCheckForSuspects(DUnitLauncher.java:420)
> at 
> org.apache.geode.test.dunit.internal.DUnitLauncher.closeAndCheckForSuspects(DUnitLauncher.java:436)
> at 
> org.apache.geode.test.dunit.internal.JUnit4DistributedTestCase.cleanupAllVms(JUnit4DistributedTestCase.java:551)
> at 
> org.apache.geode.test.dunit.internal.JUnit4DistributedTestCase.doTearDownDistributedTestCase(JUnit4DistributedTestCase.java:498)
> at 
> org.apache.geode.test.dunit.internal.JUnit4DistributedTestCase.tearDownDistributedTestCase(JUnit4DistributedTestCase.java:481)
> at jdk.internal.reflect.GeneratedMethodAccessor11.invoke(Unknown 
> Source)
> at 
> jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> at java.lang.reflect.Method.invoke(Method.java:566)
> at 
> org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.java:59)
> at 
> org.junit.internal

[jira] [Comment Edited] (GEODE-9738) CI failure: RollingUpgradeRollServersOnReplicatedRegion_dataserializable failed with DistributedSystemDisconnectedException

2021-11-10 Thread Bill Burcham (Jira)


[ 
https://issues.apache.org/jira/browse/GEODE-9738?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17441878#comment-17441878
 ] 

Bill Burcham edited comment on GEODE-9738 at 11/10/21, 5:59 PM:


The logs in the failing test run (previous comment) are all interleaved in the 
"standard output" section of the failing test. I have attached the individual 
logs to the ticket, so we can analyze them.


was (Author: bburcham):
The logs in the failing test run (previous comment) are all interleaved in the 
"standard output" section of the failing test. I have attached the separated 
logs to the ticket, so we can analyze them.

> CI failure: RollingUpgradeRollServersOnReplicatedRegion_dataserializable 
> failed with DistributedSystemDisconnectedException
> ---
>
> Key: GEODE-9738
> URL: https://issues.apache.org/jira/browse/GEODE-9738
> Project: Geode
>  Issue Type: Bug
>  Components: membership, messaging
>Affects Versions: 1.15.0
>Reporter: Kamilla Aslami
>Assignee: Bill Burcham
>Priority: Major
>  Labels: needsTriage
> Attachments: controller.log, locator.log, vm0.log, vm1.log, vm2.log, 
> vm3.log
>
>
> {noformat}
> RollingUpgradeRollServersOnReplicatedRegion_dataserializable > 
> testRollServersOnReplicatedRegion_dataserializable[from_v1.13.4] FAILED
> java.lang.AssertionError: Suspicious strings were written to the log 
> during this run.
> Fix the strings or use IgnoredException.addIgnoredException to ignore.
> ---
> Found suspect string in 'dunit_suspect-vm2.log' at line 685[fatal 
> 2021/10/14 00:24:14.739 UTC  tid=115] Uncaught exception 
> in thread Thread[FederatingManager6,5,RMI Runtime]
> org.apache.geode.management.ManagementException: 
> org.apache.geode.distributed.DistributedSystemDisconnectedException: 
> Distribution manager on 
> heavy-lifter-10ae5f9d-2528-5e02-b707-d968eb54d50a(vm2:580278:locator):54751
>  started at Thu Oct 14 00:23:52 UTC 2021: Message distribution has terminated
> at 
> org.apache.geode.management.internal.FederatingManager.addMemberArtifacts(FederatingManager.java:486)
> at 
> org.apache.geode.management.internal.FederatingManager$AddMemberTask.call(FederatingManager.java:596)
> at 
> org.apache.geode.management.internal.FederatingManager.lambda$addMember$1(FederatingManager.java:199)
> at 
> java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
> at 
> java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
> at java.base/java.lang.Thread.run(Thread.java:829)
> Caused by: 
> org.apache.geode.distributed.DistributedSystemDisconnectedException: 
> Distribution manager on 
> heavy-lifter-10ae5f9d-2528-5e02-b707-d968eb54d50a(vm2:580278:locator):54751
>  started at Thu Oct 14 00:23:52 UTC 2021: Message distribution has terminated
> at 
> org.apache.geode.distributed.internal.ClusterDistributionManager$Stopper.generateCancelledException(ClusterDistributionManager.java:2885)
> at 
> org.apache.geode.distributed.internal.InternalDistributedSystem$Stopper.generateCancelledException(InternalDistributedSystem.java:1177)
> at 
> org.apache.geode.internal.cache.GemFireCacheImpl$Stopper.generateCancelledException(GemFireCacheImpl.java:5212)
> at 
> org.apache.geode.CancelCriterion.checkCancelInProgress(CancelCriterion.java:83)
> at 
> org.apache.geode.internal.cache.CreateRegionProcessor.initializeRegion(CreateRegionProcessor.java:121)
> at 
> org.apache.geode.internal.cache.DistributedRegion.getInitialImageAndRecovery(DistributedRegion.java:1164)
> at 
> org.apache.geode.internal.cache.DistributedRegion.initialize(DistributedRegion.java:1095)
> at 
> org.apache.geode.internal.cache.GemFireCacheImpl.createVMRegion(GemFireCacheImpl.java:3108)
> at 
> org.apache.geode.internal.cache.InternalRegionFactory.create(InternalRegionFactory.java:78)
> at 
> org.apache.geode.management.internal.FederatingManager.addMemberArtifacts(FederatingManager.java:429)
> ... 5 more
> at org.junit.Assert.fail(Assert.java:89)
> at 
> org.apache.geode.test.dunit.internal.DUnitLauncher.closeAndCheckForSuspects(DUnitLauncher.java:420)
> at 
> org.apache.geode.test.dunit.internal.DUnitLauncher.closeAndCheckForSuspects(DUnitLauncher.java:436)
> at 
> org.apache.geode.test.dunit.internal.JUnit4DistributedTestCase.cleanupAllVms(JUnit4DistributedTestCase.java:551)
> at 
> org.apache.geode.test.dunit.internal.JUnit4DistributedTestCase.doTe

[jira] [Commented] (GEODE-9738) CI failure: RollingUpgradeRollServersOnReplicatedRegion_dataserializable failed with DistributedSystemDisconnectedException

2021-11-10 Thread Bill Burcham (Jira)


[ 
https://issues.apache.org/jira/browse/GEODE-9738?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17441878#comment-17441878
 ] 

Bill Burcham commented on GEODE-9738:
-

The logs in the failing test run (previous comment) are all interleaved in the 
"standard output" section of the failing test. I have attached the separated 
logs to the ticket, so we can analyze them.

> CI failure: RollingUpgradeRollServersOnReplicatedRegion_dataserializable 
> failed with DistributedSystemDisconnectedException
> ---
>
> Key: GEODE-9738
> URL: https://issues.apache.org/jira/browse/GEODE-9738
> Project: Geode
>  Issue Type: Bug
>  Components: membership, messaging
>Affects Versions: 1.15.0
>Reporter: Kamilla Aslami
>Assignee: Bill Burcham
>Priority: Major
>  Labels: needsTriage
> Attachments: controller.log, locator.log, vm0.log, vm1.log, vm2.log, 
> vm3.log
>
>
> {noformat}
> RollingUpgradeRollServersOnReplicatedRegion_dataserializable > 
> testRollServersOnReplicatedRegion_dataserializable[from_v1.13.4] FAILED
> java.lang.AssertionError: Suspicious strings were written to the log 
> during this run.
> Fix the strings or use IgnoredException.addIgnoredException to ignore.
> ---
> Found suspect string in 'dunit_suspect-vm2.log' at line 685[fatal 
> 2021/10/14 00:24:14.739 UTC  tid=115] Uncaught exception 
> in thread Thread[FederatingManager6,5,RMI Runtime]
> org.apache.geode.management.ManagementException: 
> org.apache.geode.distributed.DistributedSystemDisconnectedException: 
> Distribution manager on 
> heavy-lifter-10ae5f9d-2528-5e02-b707-d968eb54d50a(vm2:580278:locator):54751
>  started at Thu Oct 14 00:23:52 UTC 2021: Message distribution has terminated
> at 
> org.apache.geode.management.internal.FederatingManager.addMemberArtifacts(FederatingManager.java:486)
> at 
> org.apache.geode.management.internal.FederatingManager$AddMemberTask.call(FederatingManager.java:596)
> at 
> org.apache.geode.management.internal.FederatingManager.lambda$addMember$1(FederatingManager.java:199)
> at 
> java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
> at 
> java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
> at java.base/java.lang.Thread.run(Thread.java:829)
> Caused by: 
> org.apache.geode.distributed.DistributedSystemDisconnectedException: 
> Distribution manager on 
> heavy-lifter-10ae5f9d-2528-5e02-b707-d968eb54d50a(vm2:580278:locator):54751
>  started at Thu Oct 14 00:23:52 UTC 2021: Message distribution has terminated
> at 
> org.apache.geode.distributed.internal.ClusterDistributionManager$Stopper.generateCancelledException(ClusterDistributionManager.java:2885)
> at 
> org.apache.geode.distributed.internal.InternalDistributedSystem$Stopper.generateCancelledException(InternalDistributedSystem.java:1177)
> at 
> org.apache.geode.internal.cache.GemFireCacheImpl$Stopper.generateCancelledException(GemFireCacheImpl.java:5212)
> at 
> org.apache.geode.CancelCriterion.checkCancelInProgress(CancelCriterion.java:83)
> at 
> org.apache.geode.internal.cache.CreateRegionProcessor.initializeRegion(CreateRegionProcessor.java:121)
> at 
> org.apache.geode.internal.cache.DistributedRegion.getInitialImageAndRecovery(DistributedRegion.java:1164)
> at 
> org.apache.geode.internal.cache.DistributedRegion.initialize(DistributedRegion.java:1095)
> at 
> org.apache.geode.internal.cache.GemFireCacheImpl.createVMRegion(GemFireCacheImpl.java:3108)
> at 
> org.apache.geode.internal.cache.InternalRegionFactory.create(InternalRegionFactory.java:78)
> at 
> org.apache.geode.management.internal.FederatingManager.addMemberArtifacts(FederatingManager.java:429)
> ... 5 more
> at org.junit.Assert.fail(Assert.java:89)
> at 
> org.apache.geode.test.dunit.internal.DUnitLauncher.closeAndCheckForSuspects(DUnitLauncher.java:420)
> at 
> org.apache.geode.test.dunit.internal.DUnitLauncher.closeAndCheckForSuspects(DUnitLauncher.java:436)
> at 
> org.apache.geode.test.dunit.internal.JUnit4DistributedTestCase.cleanupAllVms(JUnit4DistributedTestCase.java:551)
> at 
> org.apache.geode.test.dunit.internal.JUnit4DistributedTestCase.doTearDownDistributedTestCase(JUnit4DistributedTestCase.java:498)
> at 
> org.apache.geode.test.dunit.internal.JUnit4DistributedTestCase.tearDownDistributedTestCase(JUnit4DistributedTestCase.java:481)
> at jdk.internal.reflect.GeneratedMethodAccessor11.invoke(Unknown 
>

[jira] [Updated] (GEODE-9738) CI failure: RollingUpgradeRollServersOnReplicatedRegion_dataserializable failed with DistributedSystemDisconnectedException

2021-11-10 Thread Bill Burcham (Jira)


 [ 
https://issues.apache.org/jira/browse/GEODE-9738?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bill Burcham updated GEODE-9738:

Attachment: controller.log
locator.log
vm0.log
vm1.log
vm2.log
vm3.log

> CI failure: RollingUpgradeRollServersOnReplicatedRegion_dataserializable 
> failed with DistributedSystemDisconnectedException
> ---
>
> Key: GEODE-9738
> URL: https://issues.apache.org/jira/browse/GEODE-9738
> Project: Geode
>  Issue Type: Bug
>  Components: membership, messaging
>Affects Versions: 1.15.0
>Reporter: Kamilla Aslami
>Assignee: Bill Burcham
>Priority: Major
>  Labels: needsTriage
> Attachments: controller.log, locator.log, vm0.log, vm1.log, vm2.log, 
> vm3.log
>
>
> {noformat}
> RollingUpgradeRollServersOnReplicatedRegion_dataserializable > 
> testRollServersOnReplicatedRegion_dataserializable[from_v1.13.4] FAILED
> java.lang.AssertionError: Suspicious strings were written to the log 
> during this run.
> Fix the strings or use IgnoredException.addIgnoredException to ignore.
> ---
> Found suspect string in 'dunit_suspect-vm2.log' at line 685[fatal 
> 2021/10/14 00:24:14.739 UTC  tid=115] Uncaught exception 
> in thread Thread[FederatingManager6,5,RMI Runtime]
> org.apache.geode.management.ManagementException: 
> org.apache.geode.distributed.DistributedSystemDisconnectedException: 
> Distribution manager on 
> heavy-lifter-10ae5f9d-2528-5e02-b707-d968eb54d50a(vm2:580278:locator):54751
>  started at Thu Oct 14 00:23:52 UTC 2021: Message distribution has terminated
> at 
> org.apache.geode.management.internal.FederatingManager.addMemberArtifacts(FederatingManager.java:486)
> at 
> org.apache.geode.management.internal.FederatingManager$AddMemberTask.call(FederatingManager.java:596)
> at 
> org.apache.geode.management.internal.FederatingManager.lambda$addMember$1(FederatingManager.java:199)
> at 
> java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
> at 
> java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
> at java.base/java.lang.Thread.run(Thread.java:829)
> Caused by: 
> org.apache.geode.distributed.DistributedSystemDisconnectedException: 
> Distribution manager on 
> heavy-lifter-10ae5f9d-2528-5e02-b707-d968eb54d50a(vm2:580278:locator):54751
>  started at Thu Oct 14 00:23:52 UTC 2021: Message distribution has terminated
> at 
> org.apache.geode.distributed.internal.ClusterDistributionManager$Stopper.generateCancelledException(ClusterDistributionManager.java:2885)
> at 
> org.apache.geode.distributed.internal.InternalDistributedSystem$Stopper.generateCancelledException(InternalDistributedSystem.java:1177)
> at 
> org.apache.geode.internal.cache.GemFireCacheImpl$Stopper.generateCancelledException(GemFireCacheImpl.java:5212)
> at 
> org.apache.geode.CancelCriterion.checkCancelInProgress(CancelCriterion.java:83)
> at 
> org.apache.geode.internal.cache.CreateRegionProcessor.initializeRegion(CreateRegionProcessor.java:121)
> at 
> org.apache.geode.internal.cache.DistributedRegion.getInitialImageAndRecovery(DistributedRegion.java:1164)
> at 
> org.apache.geode.internal.cache.DistributedRegion.initialize(DistributedRegion.java:1095)
> at 
> org.apache.geode.internal.cache.GemFireCacheImpl.createVMRegion(GemFireCacheImpl.java:3108)
> at 
> org.apache.geode.internal.cache.InternalRegionFactory.create(InternalRegionFactory.java:78)
> at 
> org.apache.geode.management.internal.FederatingManager.addMemberArtifacts(FederatingManager.java:429)
> ... 5 more
> at org.junit.Assert.fail(Assert.java:89)
> at 
> org.apache.geode.test.dunit.internal.DUnitLauncher.closeAndCheckForSuspects(DUnitLauncher.java:420)
> at 
> org.apache.geode.test.dunit.internal.DUnitLauncher.closeAndCheckForSuspects(DUnitLauncher.java:436)
> at 
> org.apache.geode.test.dunit.internal.JUnit4DistributedTestCase.cleanupAllVms(JUnit4DistributedTestCase.java:551)
> at 
> org.apache.geode.test.dunit.internal.JUnit4DistributedTestCase.doTearDownDistributedTestCase(JUnit4DistributedTestCase.java:498)
> at 
> org.apache.geode.test.dunit.internal.JUnit4DistributedTestCase.tearDownDistributedTestCase(JUnit4DistributedTestCase.java:481)
> at jdk.internal.reflect.GeneratedMethodAccessor11.invoke(Unknown 
> Source)
> at 
> jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccesso

[jira] [Resolved] (GEODE-9675) CI: ClusterDistributionManagerDUnitTest > testConnectAfterBeingShunned FAILED

2021-11-09 Thread Bill Burcham (Jira)


 [ 
https://issues.apache.org/jira/browse/GEODE-9675?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bill Burcham resolved GEODE-9675.
-
Fix Version/s: 1.15.0
   Resolution: Fixed

Fixed this test by deleting this test.

> CI: ClusterDistributionManagerDUnitTest > testConnectAfterBeingShunned FAILED
> -
>
> Key: GEODE-9675
> URL: https://issues.apache.org/jira/browse/GEODE-9675
> Project: Geode
>  Issue Type: Bug
>  Components: membership
>Affects Versions: 1.15.0
>Reporter: Xiaojian Zhou
>Assignee: Bill Burcham
>Priority: Major
>  Labels: pull-request-available
> Fix For: 1.15.0
>
> Attachments: screenshot-1.png
>
>
> https://concourse.apachegeode-ci.info/teams/main/pipelines/apache-develop-mass-test-run/jobs/distributed-test-openjdk8/builds/1983
> {code:java}
> ClusterDistributionManagerDUnitTest > testConnectAfterBeingShunned FAILED
> org.apache.geode.SystemConnectException: Problem starting up membership 
> services
> at 
> org.apache.geode.distributed.internal.DistributionImpl.start(DistributionImpl.java:186)
> at 
> org.apache.geode.distributed.internal.DistributionImpl.createDistribution(DistributionImpl.java:222)
> at 
> org.apache.geode.distributed.internal.ClusterDistributionManager.(ClusterDistributionManager.java:466)
> at 
> org.apache.geode.distributed.internal.ClusterDistributionManager.(ClusterDistributionManager.java:499)
> at 
> org.apache.geode.distributed.internal.ClusterDistributionManager.create(ClusterDistributionManager.java:328)
> at 
> org.apache.geode.distributed.internal.InternalDistributedSystem.initialize(InternalDistributedSystem.java:757)
> at 
> org.apache.geode.distributed.internal.InternalDistributedSystem.access$200(InternalDistributedSystem.java:133)
> at 
> org.apache.geode.distributed.internal.InternalDistributedSystem$Builder.build(InternalDistributedSystem.java:3013)
> at 
> org.apache.geode.distributed.internal.InternalDistributedSystem.connectInternal(InternalDistributedSystem.java:283)
> at 
> org.apache.geode.distributed.internal.InternalDistributedSystem.connectInternal(InternalDistributedSystem.java:209)
> at 
> org.apache.geode.distributed.DistributedSystem.connect(DistributedSystem.java:159)
> at 
> org.apache.geode.test.dunit.internal.JUnit4DistributedTestCase.getSystem(JUnit4DistributedTestCase.java:180)
> at 
> org.apache.geode.test.dunit.internal.JUnit4DistributedTestCase.getSystem(JUnit4DistributedTestCase.java:256)
> at 
> org.apache.geode.distributed.internal.ClusterDistributionManagerDUnitTest.testConnectAfterBeingShunned(ClusterDistributionManagerDUnitTest.java:170)
> Caused by:
> 
> org.apache.geode.distributed.internal.membership.api.MemberStartupException: 
> unable to create jgroups channel
> at 
> org.apache.geode.distributed.internal.membership.gms.messenger.JGroupsMessenger.start(JGroupsMessenger.java:401)
> at 
> org.apache.geode.distributed.internal.membership.gms.Services.start(Services.java:203)
> at 
> org.apache.geode.distributed.internal.membership.gms.GMSMembership.start(GMSMembership.java:1642)
> at 
> org.apache.geode.distributed.internal.DistributionImpl.start(DistributionImpl.java:171)
> ... 13 more
> Caused by:
> java.lang.Exception: failed to open a port in range 41003-41003
> at 
> org.jgroups.protocols.UDP.createMulticastSocketWithBindPort(UDP.java:503)
> at org.jgroups.protocols.UDP.createSockets(UDP.java:348)
> at org.jgroups.protocols.UDP.start(UDP.java:266)
> at 
> org.jgroups.stack.ProtocolStack.startStack(ProtocolStack.java:966)
> at org.jgroups.JChannel.startStack(JChannel.java:889)
> at org.jgroups.JChannel._preConnect(JChannel.java:553)
> at org.jgroups.JChannel.connect(JChannel.java:288)
> at org.jgroups.JChannel.connect(JChannel.java:279)
> at 
> org.apache.geode.distributed.internal.membership.gms.messenger.JGroupsMessenger.start(JGroupsMessenger.java:397)
> ... 16 more
> {code}



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Assigned] (GEODE-9675) CI: ClusterDistributionManagerDUnitTest > testConnectAfterBeingShunned FAILED

2021-11-09 Thread Bill Burcham (Jira)


 [ 
https://issues.apache.org/jira/browse/GEODE-9675?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bill Burcham reassigned GEODE-9675:
---

Assignee: Bill Burcham

> CI: ClusterDistributionManagerDUnitTest > testConnectAfterBeingShunned FAILED
> -
>
> Key: GEODE-9675
> URL: https://issues.apache.org/jira/browse/GEODE-9675
> Project: Geode
>  Issue Type: Bug
>  Components: membership
>Affects Versions: 1.15.0
>Reporter: Xiaojian Zhou
>Assignee: Bill Burcham
>Priority: Major
>  Labels: pull-request-available
> Attachments: screenshot-1.png
>
>
> https://concourse.apachegeode-ci.info/teams/main/pipelines/apache-develop-mass-test-run/jobs/distributed-test-openjdk8/builds/1983
> {code:java}
> ClusterDistributionManagerDUnitTest > testConnectAfterBeingShunned FAILED
> org.apache.geode.SystemConnectException: Problem starting up membership 
> services
> at 
> org.apache.geode.distributed.internal.DistributionImpl.start(DistributionImpl.java:186)
> at 
> org.apache.geode.distributed.internal.DistributionImpl.createDistribution(DistributionImpl.java:222)
> at 
> org.apache.geode.distributed.internal.ClusterDistributionManager.(ClusterDistributionManager.java:466)
> at 
> org.apache.geode.distributed.internal.ClusterDistributionManager.(ClusterDistributionManager.java:499)
> at 
> org.apache.geode.distributed.internal.ClusterDistributionManager.create(ClusterDistributionManager.java:328)
> at 
> org.apache.geode.distributed.internal.InternalDistributedSystem.initialize(InternalDistributedSystem.java:757)
> at 
> org.apache.geode.distributed.internal.InternalDistributedSystem.access$200(InternalDistributedSystem.java:133)
> at 
> org.apache.geode.distributed.internal.InternalDistributedSystem$Builder.build(InternalDistributedSystem.java:3013)
> at 
> org.apache.geode.distributed.internal.InternalDistributedSystem.connectInternal(InternalDistributedSystem.java:283)
> at 
> org.apache.geode.distributed.internal.InternalDistributedSystem.connectInternal(InternalDistributedSystem.java:209)
> at 
> org.apache.geode.distributed.DistributedSystem.connect(DistributedSystem.java:159)
> at 
> org.apache.geode.test.dunit.internal.JUnit4DistributedTestCase.getSystem(JUnit4DistributedTestCase.java:180)
> at 
> org.apache.geode.test.dunit.internal.JUnit4DistributedTestCase.getSystem(JUnit4DistributedTestCase.java:256)
> at 
> org.apache.geode.distributed.internal.ClusterDistributionManagerDUnitTest.testConnectAfterBeingShunned(ClusterDistributionManagerDUnitTest.java:170)
> Caused by:
> 
> org.apache.geode.distributed.internal.membership.api.MemberStartupException: 
> unable to create jgroups channel
> at 
> org.apache.geode.distributed.internal.membership.gms.messenger.JGroupsMessenger.start(JGroupsMessenger.java:401)
> at 
> org.apache.geode.distributed.internal.membership.gms.Services.start(Services.java:203)
> at 
> org.apache.geode.distributed.internal.membership.gms.GMSMembership.start(GMSMembership.java:1642)
> at 
> org.apache.geode.distributed.internal.DistributionImpl.start(DistributionImpl.java:171)
> ... 13 more
> Caused by:
> java.lang.Exception: failed to open a port in range 41003-41003
> at 
> org.jgroups.protocols.UDP.createMulticastSocketWithBindPort(UDP.java:503)
> at org.jgroups.protocols.UDP.createSockets(UDP.java:348)
> at org.jgroups.protocols.UDP.start(UDP.java:266)
> at 
> org.jgroups.stack.ProtocolStack.startStack(ProtocolStack.java:966)
> at org.jgroups.JChannel.startStack(JChannel.java:889)
> at org.jgroups.JChannel._preConnect(JChannel.java:553)
> at org.jgroups.JChannel.connect(JChannel.java:288)
> at org.jgroups.JChannel.connect(JChannel.java:279)
> at 
> org.apache.geode.distributed.internal.membership.gms.messenger.JGroupsMessenger.start(JGroupsMessenger.java:397)
> ... 16 more
> {code}



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Assigned] (GEODE-9738) CI failure: RollingUpgradeRollServersOnReplicatedRegion_dataserializable failed with DistributedSystemDisconnectedException

2021-11-09 Thread Bill Burcham (Jira)


 [ 
https://issues.apache.org/jira/browse/GEODE-9738?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bill Burcham reassigned GEODE-9738:
---

Assignee: Bill Burcham

> CI failure: RollingUpgradeRollServersOnReplicatedRegion_dataserializable 
> failed with DistributedSystemDisconnectedException
> ---
>
> Key: GEODE-9738
> URL: https://issues.apache.org/jira/browse/GEODE-9738
> Project: Geode
>  Issue Type: Bug
>  Components: membership, messaging
>Affects Versions: 1.15.0
>Reporter: Kamilla Aslami
>Assignee: Bill Burcham
>Priority: Major
>  Labels: needsTriage
>
> {noformat}
> RollingUpgradeRollServersOnReplicatedRegion_dataserializable > 
> testRollServersOnReplicatedRegion_dataserializable[from_v1.13.4] FAILED
> java.lang.AssertionError: Suspicious strings were written to the log 
> during this run.
> Fix the strings or use IgnoredException.addIgnoredException to ignore.
> ---
> Found suspect string in 'dunit_suspect-vm2.log' at line 685[fatal 
> 2021/10/14 00:24:14.739 UTC  tid=115] Uncaught exception 
> in thread Thread[FederatingManager6,5,RMI Runtime]
> org.apache.geode.management.ManagementException: 
> org.apache.geode.distributed.DistributedSystemDisconnectedException: 
> Distribution manager on 
> heavy-lifter-10ae5f9d-2528-5e02-b707-d968eb54d50a(vm2:580278:locator):54751
>  started at Thu Oct 14 00:23:52 UTC 2021: Message distribution has terminated
> at 
> org.apache.geode.management.internal.FederatingManager.addMemberArtifacts(FederatingManager.java:486)
> at 
> org.apache.geode.management.internal.FederatingManager$AddMemberTask.call(FederatingManager.java:596)
> at 
> org.apache.geode.management.internal.FederatingManager.lambda$addMember$1(FederatingManager.java:199)
> at 
> java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
> at 
> java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
> at java.base/java.lang.Thread.run(Thread.java:829)
> Caused by: 
> org.apache.geode.distributed.DistributedSystemDisconnectedException: 
> Distribution manager on 
> heavy-lifter-10ae5f9d-2528-5e02-b707-d968eb54d50a(vm2:580278:locator):54751
>  started at Thu Oct 14 00:23:52 UTC 2021: Message distribution has terminated
> at 
> org.apache.geode.distributed.internal.ClusterDistributionManager$Stopper.generateCancelledException(ClusterDistributionManager.java:2885)
> at 
> org.apache.geode.distributed.internal.InternalDistributedSystem$Stopper.generateCancelledException(InternalDistributedSystem.java:1177)
> at 
> org.apache.geode.internal.cache.GemFireCacheImpl$Stopper.generateCancelledException(GemFireCacheImpl.java:5212)
> at 
> org.apache.geode.CancelCriterion.checkCancelInProgress(CancelCriterion.java:83)
> at 
> org.apache.geode.internal.cache.CreateRegionProcessor.initializeRegion(CreateRegionProcessor.java:121)
> at 
> org.apache.geode.internal.cache.DistributedRegion.getInitialImageAndRecovery(DistributedRegion.java:1164)
> at 
> org.apache.geode.internal.cache.DistributedRegion.initialize(DistributedRegion.java:1095)
> at 
> org.apache.geode.internal.cache.GemFireCacheImpl.createVMRegion(GemFireCacheImpl.java:3108)
> at 
> org.apache.geode.internal.cache.InternalRegionFactory.create(InternalRegionFactory.java:78)
> at 
> org.apache.geode.management.internal.FederatingManager.addMemberArtifacts(FederatingManager.java:429)
> ... 5 more
> at org.junit.Assert.fail(Assert.java:89)
> at 
> org.apache.geode.test.dunit.internal.DUnitLauncher.closeAndCheckForSuspects(DUnitLauncher.java:420)
> at 
> org.apache.geode.test.dunit.internal.DUnitLauncher.closeAndCheckForSuspects(DUnitLauncher.java:436)
> at 
> org.apache.geode.test.dunit.internal.JUnit4DistributedTestCase.cleanupAllVms(JUnit4DistributedTestCase.java:551)
> at 
> org.apache.geode.test.dunit.internal.JUnit4DistributedTestCase.doTearDownDistributedTestCase(JUnit4DistributedTestCase.java:498)
> at 
> org.apache.geode.test.dunit.internal.JUnit4DistributedTestCase.tearDownDistributedTestCase(JUnit4DistributedTestCase.java:481)
> at jdk.internal.reflect.GeneratedMethodAccessor11.invoke(Unknown 
> Source)
> at 
> jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> at java.lang.reflect.Method.invoke(Method.java:566)
> at 
> org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.java:59)
> at 
> org.junit.internal.

[jira] [Commented] (GEODE-9402) Automatic Reconnect Failure: Address already in use

2021-11-08 Thread Bill Burcham (Jira)


[ 
https://issues.apache.org/jira/browse/GEODE-9402?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17440828#comment-17440828
 ] 

Bill Burcham commented on GEODE-9402:
-

h2. Summary

In each of the attached logs, we see the member that logged the BindException 
eventually joining the view (in 8 and 11 seconds respectively).

My suspicion is that what we see here is nondeterminism in the time it takes 
for a port to become available after it is unbound.

Since the members in question do re-join the cluster successfully I don't think 
this is a bug. What do you think [~jjramos] ?
h2. Detailed Analysis of cluster_logs_gke_latest_54

Looking at cluster_logs_gke_latest_54 quorum loss happens:

[Entry id=4208, date=2021/06/23 15:55:48.119 GMT, level=fatal, thread=tid=0x92, 
emitter=Geode Membership View Creator, message=Possible loss of quorum due to 
the loss of 5 cache processes: 
[gemfire-cluster-server-3(gemfire-cluster-server-3:1):41000, 
gemfire-cluster-server-1(gemfire-cluster-server-1:1):41000, 
gemfire-cluster-locator-1(gemfire-cluster-locator-1:1:locator):41000, 
gemfire-cluster-server-2(gemfire-cluster-server-2:1):41000, 
gemfire-cluster-locator-0(gemfire-cluster-locator-0:1:locator):41000]
, Host=gemfire-cluster-server-0 , 
mergedFile=/Users/bburcham/Downloads/cluster_logs_gke_latest_54/gemfire-cluster-server-0/gemfire-cluster-server-0-01-01.log]

It takes about two minutes for the network partition to be healed and for a 
coordinator to be designated. It is TBD what part of that two minutes was due 
to the test delaying the healing of the partition, vs what part of that time 
was spent re-forming a cluster after the network partition was healed. Here's 
the coordinator thread starting:

[Entry id=4925, date=2021/06/23 15:57:57.671 GMT, level=info, thread=tid=0x87, 
emitter=ReconnectThread, message=This member is becoming the membership 
coordinator with address 
gemfire-cluster-locator-0(gemfire-cluster-locator-0:1:locator):41000
, Host=gemfire-cluster-locator-0 , 
mergedFile=/Users/bburcham/Downloads/cluster_logs_gke_latest_54/gemfire-cluster-locator-0/gemfire-cluster-locator-0.log]

That point in time corresponds to view 21 (the pre-partition view sequence 
ended at view 5):

[Entry id=4960, date=2021/06/23 15:57:58.009 GMT, level=info, thread=tid=0xad, 
emitter=Geode Membership View Creator, message=sending new view 
View[gemfire-cluster-locator-0(gemfire-cluster-locator-0:1:locator):41000|21]
 members: 
[gemfire-cluster-locator-0(gemfire-cluster-locator-0:1:locator):41000, 
gemfire-cluster-server-0(gemfire-cluster-server-0:1):41000\{lead}, 
gemfire-cluster-server-1(gemfire-cluster-server-1:1):41000, 
gemfire-cluster-server-3(gemfire-cluster-server-3:1):41000, 
gemfire-cluster-server-2(gemfire-cluster-server-2:1):41000, 
gemfire-cluster-locator-1(gemfire-cluster-locator-1:1:locator):41000]  
crashed: 
[gemfire-cluster-locator-1(gemfire-cluster-locator-1:1:locator):41000, 
gemfire-cluster-server-2(gemfire-cluster-server-2:1):41000, 
gemfire-cluster-locator-0(gemfire-cluster-locator-0:1:locator):41000]
, Host=gemfire-cluster-locator-0 , 
mergedFile=/Users/bburcham/Downloads/cluster_logs_gke_latest_54/gemfire-cluster-locator-0/gemfire-cluster-locator-0.log]

About a minute later server-0 logs the BindException while reconnecting:

[Entry id=5536, date=2021/06/23 16:00:31.491 GMT, level=error, thread=tid=0x94, 
emitter=ReconnectThread, message=Cache initialization for GemFireCache[id = 
1795575589; isClosing = false; isShutDownAll = false; created = Wed Jun 23 
15:58:29 GMT 2021; server = false; copyOnRead = false; lockLease = 120; 
lockTimeout = 60] failed because:
org.apache.geode.GemFireIOException: While starting cache server CacheServer on 
port=40404 client subscription config policy=none client subscription config 
capacity=1 client subscription config overflow directory=.
    at 
org.apache.geode.internal.cache.xmlcache.CacheCreation.startCacheServers(CacheCreation.java:800)
    at 
org.apache.geode.internal.cache.xmlcache.CacheCreation.create(CacheCreation.java:599)
    at 
org.apache.geode.internal.cache.xmlcache.CacheXmlParser.create(CacheXmlParser.java:339)
    at 
org.apache.geode.internal.cache.GemFireCacheImpl.loadCacheXml(GemFireCacheImpl.java:4207)
    at 
org.apache.geode.internal.cache.ClusterConfigurationLoader.applyClusterXmlConfiguration(ClusterConfigurationLoader.java:199)
    at 
org.apache.geode.internal.cache.GemFireCacheImpl.applyJarAndXmlFromClusterConfig(GemFireCacheImpl.java:1497)
    at 
org.apache.geode.internal.cache.GemFireCacheImpl.initialize(GemFireCacheImpl.java:1449)
    at 
org.apache.geode.internal.cache.InternalCacheBuilder.create(InternalCacheBuilder.java:191)
    at 
org.apache.geode.distributed.internal.InternalDistributedSystem.reconnect(InternalDistributedSystem.java:2668)
    at 
org.apache.geode.distributed.internal.InternalDistributedSystem.tryReconnect(InternalDistribut

[jira] [Assigned] (GEODE-9402) Automatic Reconnect Failure: Address already in use

2021-10-28 Thread Bill Burcham (Jira)


 [ 
https://issues.apache.org/jira/browse/GEODE-9402?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bill Burcham reassigned GEODE-9402:
---

Assignee: Bill Burcham

> Automatic Reconnect Failure: Address already in use
> ---
>
> Key: GEODE-9402
> URL: https://issues.apache.org/jira/browse/GEODE-9402
> Project: Geode
>  Issue Type: Bug
>  Components: membership
>Reporter: Juan Ramos
>Assignee: Bill Burcham
>Priority: Major
> Attachments: cluster_logs_gke_latest_54.zip, cluster_logs_pks_121.zip
>
>
> There are 2 locators and 4 servers during the test, once they're all up and 
> running the test drops the network connectivity between all members to 
> generate a full network partition and cause all members to shutdown and go 
> into reconnect mode. Upon reaching the mentioned state, the test 
> automatically restores the network connectivity and expects all members to 
> automatically go up again and re-form the distributed system.
>  This works fine most of the time, and we see every member successfully 
> reconnecting to the distributed system:
> {noformat}
> [info 2021/06/23 15:58:12.981 GMT gemfire-cluster-locator-0  
> tid=0x87] Reconnect completed.
> [info 2021/06/23 15:58:14.726 GMT gemfire-cluster-locator-1  
> tid=0x86] Reconnect completed.
> [info 2021/06/23 15:58:46.702 GMT gemfire-cluster-server-0  
> tid=0x94] Reconnect completed.
> [info 2021/06/23 15:58:46.485 GMT gemfire-cluster-server-1  
> tid=0x96] Reconnect completed.
> [info 2021/06/23 15:58:46.273 GMT gemfire-cluster-server-2  
> tid=0x97] Reconnect completed.
> [info 2021/06/23 15:58:46.902 GMT gemfire-cluster-server-3  
> tid=0x95] Reconnect completed.
> {noformat}
> In some rare occasions, though, one of the servers fails during the reconnect 
> phase with the following exception:
> {noformat}
> [error 2021/06/09 18:48:52.872 GMT gemfire-cluster-server-1  
> tid=0x91] Cache initialization for GemFireCache[id = 575310555; isClosing = 
> false; isShutDownAll = false; created = Wed Jun 09 18:46:49 GMT 2021; server 
> = false; copyOnRead = false; lockLease = 120; lockTimeout = 60] failed 
> because:
> org.apache.geode.GemFireIOException: While starting cache server CacheServer 
> on port=40404 client subscription config policy=none client subscription 
> config capacity=1 client subscription config overflow directory=.
>   at 
> org.apache.geode.internal.cache.xmlcache.CacheCreation.startCacheServers(CacheCreation.java:800)
>   at 
> org.apache.geode.internal.cache.xmlcache.CacheCreation.create(CacheCreation.java:599)
>   at 
> org.apache.geode.internal.cache.xmlcache.CacheXmlParser.create(CacheXmlParser.java:339)
>   at 
> org.apache.geode.internal.cache.GemFireCacheImpl.loadCacheXml(GemFireCacheImpl.java:4207)
>   at 
> org.apache.geode.internal.cache.ClusterConfigurationLoader.applyClusterXmlConfiguration(ClusterConfigurationLoader.java:197)
>   at 
> org.apache.geode.internal.cache.GemFireCacheImpl.applyJarAndXmlFromClusterConfig(GemFireCacheImpl.java:1497)
>   at 
> org.apache.geode.internal.cache.GemFireCacheImpl.initialize(GemFireCacheImpl.java:1449)
>   at 
> org.apache.geode.internal.cache.InternalCacheBuilder.create(InternalCacheBuilder.java:191)
>   at 
> org.apache.geode.distributed.internal.InternalDistributedSystem.reconnect(InternalDistributedSystem.java:2668)
>   at 
> org.apache.geode.distributed.internal.InternalDistributedSystem.tryReconnect(InternalDistributedSystem.java:2426)
>   at 
> org.apache.geode.distributed.internal.InternalDistributedSystem.disconnect(InternalDistributedSystem.java:1277)
>   at 
> org.apache.geode.distributed.internal.ClusterDistributionManager$DMListener.membershipFailure(ClusterDistributionManager.java:2315)
>   at 
> org.apache.geode.distributed.internal.membership.gms.GMSMembership.uncleanShutdown(GMSMembership.java:1183)
>   at 
> org.apache.geode.distributed.internal.membership.gms.GMSMembership$ManagerImpl.lambda$forceDisconnect$0(GMSMembership.java:1807)
>   at java.base/java.lang.Thread.run(Thread.java:829)
> Caused by: java.net.BindException: Address already in use (Bind failed)
>   at java.base/java.net.PlainSocketImpl.socketBind(Native Method)
>   at 
> java.base/java.net.AbstractPlainSocketImpl.bind(AbstractPlainSocketImpl.java:436)
>   at java.base/java.net.ServerSocket.bind(ServerSocket.java:395)
>   at 
> org.apache.geode.internal.net.SCClusterSocketCreator.createServerSocket(SCClusterSocketCreator.java:70)
>   at 
> org.apache.geode.internal.net.SocketCreator.createServerSocket(SocketCreator.java:529)
>   at 
> org.apache.geode.internal.cache.tier.sockets.AcceptorImpl.(AcceptorImpl.java:573)
>   at 
> org.apache.geode.internal.cache.tier.sockets.AcceptorBuilder.create(AcceptorB

[jira] [Assigned] (GEODE-9675) CI: ClusterDistributionManagerDUnitTest > testConnectAfterBeingShunned FAILED

2021-10-28 Thread Bill Burcham (Jira)


 [ 
https://issues.apache.org/jira/browse/GEODE-9675?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bill Burcham reassigned GEODE-9675:
---

Assignee: (was: Bill Burcham)

> CI: ClusterDistributionManagerDUnitTest > testConnectAfterBeingShunned FAILED
> -
>
> Key: GEODE-9675
> URL: https://issues.apache.org/jira/browse/GEODE-9675
> Project: Geode
>  Issue Type: Bug
>  Components: membership
>Affects Versions: 1.15.0
>Reporter: Xiaojian Zhou
>Priority: Major
> Attachments: screenshot-1.png
>
>
> https://concourse.apachegeode-ci.info/teams/main/pipelines/apache-develop-mass-test-run/jobs/distributed-test-openjdk8/builds/1983
> {code:java}
> ClusterDistributionManagerDUnitTest > testConnectAfterBeingShunned FAILED
> org.apache.geode.SystemConnectException: Problem starting up membership 
> services
> at 
> org.apache.geode.distributed.internal.DistributionImpl.start(DistributionImpl.java:186)
> at 
> org.apache.geode.distributed.internal.DistributionImpl.createDistribution(DistributionImpl.java:222)
> at 
> org.apache.geode.distributed.internal.ClusterDistributionManager.(ClusterDistributionManager.java:466)
> at 
> org.apache.geode.distributed.internal.ClusterDistributionManager.(ClusterDistributionManager.java:499)
> at 
> org.apache.geode.distributed.internal.ClusterDistributionManager.create(ClusterDistributionManager.java:328)
> at 
> org.apache.geode.distributed.internal.InternalDistributedSystem.initialize(InternalDistributedSystem.java:757)
> at 
> org.apache.geode.distributed.internal.InternalDistributedSystem.access$200(InternalDistributedSystem.java:133)
> at 
> org.apache.geode.distributed.internal.InternalDistributedSystem$Builder.build(InternalDistributedSystem.java:3013)
> at 
> org.apache.geode.distributed.internal.InternalDistributedSystem.connectInternal(InternalDistributedSystem.java:283)
> at 
> org.apache.geode.distributed.internal.InternalDistributedSystem.connectInternal(InternalDistributedSystem.java:209)
> at 
> org.apache.geode.distributed.DistributedSystem.connect(DistributedSystem.java:159)
> at 
> org.apache.geode.test.dunit.internal.JUnit4DistributedTestCase.getSystem(JUnit4DistributedTestCase.java:180)
> at 
> org.apache.geode.test.dunit.internal.JUnit4DistributedTestCase.getSystem(JUnit4DistributedTestCase.java:256)
> at 
> org.apache.geode.distributed.internal.ClusterDistributionManagerDUnitTest.testConnectAfterBeingShunned(ClusterDistributionManagerDUnitTest.java:170)
> Caused by:
> 
> org.apache.geode.distributed.internal.membership.api.MemberStartupException: 
> unable to create jgroups channel
> at 
> org.apache.geode.distributed.internal.membership.gms.messenger.JGroupsMessenger.start(JGroupsMessenger.java:401)
> at 
> org.apache.geode.distributed.internal.membership.gms.Services.start(Services.java:203)
> at 
> org.apache.geode.distributed.internal.membership.gms.GMSMembership.start(GMSMembership.java:1642)
> at 
> org.apache.geode.distributed.internal.DistributionImpl.start(DistributionImpl.java:171)
> ... 13 more
> Caused by:
> java.lang.Exception: failed to open a port in range 41003-41003
> at 
> org.jgroups.protocols.UDP.createMulticastSocketWithBindPort(UDP.java:503)
> at org.jgroups.protocols.UDP.createSockets(UDP.java:348)
> at org.jgroups.protocols.UDP.start(UDP.java:266)
> at 
> org.jgroups.stack.ProtocolStack.startStack(ProtocolStack.java:966)
> at org.jgroups.JChannel.startStack(JChannel.java:889)
> at org.jgroups.JChannel._preConnect(JChannel.java:553)
> at org.jgroups.JChannel.connect(JChannel.java:288)
> at org.jgroups.JChannel.connect(JChannel.java:279)
> at 
> org.apache.geode.distributed.internal.membership.gms.messenger.JGroupsMessenger.start(JGroupsMessenger.java:397)
> ... 16 more
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (GEODE-9675) CI: ClusterDistributionManagerDUnitTest > testConnectAfterBeingShunned FAILED

2021-10-28 Thread Bill Burcham (Jira)


 [ 
https://issues.apache.org/jira/browse/GEODE-9675?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bill Burcham updated GEODE-9675:

Labels:   (was: needsTriage)

> CI: ClusterDistributionManagerDUnitTest > testConnectAfterBeingShunned FAILED
> -
>
> Key: GEODE-9675
> URL: https://issues.apache.org/jira/browse/GEODE-9675
> Project: Geode
>  Issue Type: Bug
>  Components: membership
>Affects Versions: 1.15.0
>Reporter: Xiaojian Zhou
>Assignee: Bill Burcham
>Priority: Major
> Attachments: screenshot-1.png
>
>
> https://concourse.apachegeode-ci.info/teams/main/pipelines/apache-develop-mass-test-run/jobs/distributed-test-openjdk8/builds/1983
> {code:java}
> ClusterDistributionManagerDUnitTest > testConnectAfterBeingShunned FAILED
> org.apache.geode.SystemConnectException: Problem starting up membership 
> services
> at 
> org.apache.geode.distributed.internal.DistributionImpl.start(DistributionImpl.java:186)
> at 
> org.apache.geode.distributed.internal.DistributionImpl.createDistribution(DistributionImpl.java:222)
> at 
> org.apache.geode.distributed.internal.ClusterDistributionManager.(ClusterDistributionManager.java:466)
> at 
> org.apache.geode.distributed.internal.ClusterDistributionManager.(ClusterDistributionManager.java:499)
> at 
> org.apache.geode.distributed.internal.ClusterDistributionManager.create(ClusterDistributionManager.java:328)
> at 
> org.apache.geode.distributed.internal.InternalDistributedSystem.initialize(InternalDistributedSystem.java:757)
> at 
> org.apache.geode.distributed.internal.InternalDistributedSystem.access$200(InternalDistributedSystem.java:133)
> at 
> org.apache.geode.distributed.internal.InternalDistributedSystem$Builder.build(InternalDistributedSystem.java:3013)
> at 
> org.apache.geode.distributed.internal.InternalDistributedSystem.connectInternal(InternalDistributedSystem.java:283)
> at 
> org.apache.geode.distributed.internal.InternalDistributedSystem.connectInternal(InternalDistributedSystem.java:209)
> at 
> org.apache.geode.distributed.DistributedSystem.connect(DistributedSystem.java:159)
> at 
> org.apache.geode.test.dunit.internal.JUnit4DistributedTestCase.getSystem(JUnit4DistributedTestCase.java:180)
> at 
> org.apache.geode.test.dunit.internal.JUnit4DistributedTestCase.getSystem(JUnit4DistributedTestCase.java:256)
> at 
> org.apache.geode.distributed.internal.ClusterDistributionManagerDUnitTest.testConnectAfterBeingShunned(ClusterDistributionManagerDUnitTest.java:170)
> Caused by:
> 
> org.apache.geode.distributed.internal.membership.api.MemberStartupException: 
> unable to create jgroups channel
> at 
> org.apache.geode.distributed.internal.membership.gms.messenger.JGroupsMessenger.start(JGroupsMessenger.java:401)
> at 
> org.apache.geode.distributed.internal.membership.gms.Services.start(Services.java:203)
> at 
> org.apache.geode.distributed.internal.membership.gms.GMSMembership.start(GMSMembership.java:1642)
> at 
> org.apache.geode.distributed.internal.DistributionImpl.start(DistributionImpl.java:171)
> ... 13 more
> Caused by:
> java.lang.Exception: failed to open a port in range 41003-41003
> at 
> org.jgroups.protocols.UDP.createMulticastSocketWithBindPort(UDP.java:503)
> at org.jgroups.protocols.UDP.createSockets(UDP.java:348)
> at org.jgroups.protocols.UDP.start(UDP.java:266)
> at 
> org.jgroups.stack.ProtocolStack.startStack(ProtocolStack.java:966)
> at org.jgroups.JChannel.startStack(JChannel.java:889)
> at org.jgroups.JChannel._preConnect(JChannel.java:553)
> at org.jgroups.JChannel.connect(JChannel.java:288)
> at org.jgroups.JChannel.connect(JChannel.java:279)
> at 
> org.apache.geode.distributed.internal.membership.gms.messenger.JGroupsMessenger.start(JGroupsMessenger.java:397)
> ... 16 more
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


  1   2   3   4   5   6   7   >