[jira] [Assigned] (GEODE-10250) The LockGrantor can grant a lock to a member that has left the distributed system
[ https://issues.apache.org/jira/browse/GEODE-10250?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Barrett Oglesby reassigned GEODE-10250: --- Assignee: Barrett Oglesby > The LockGrantor can grant a lock to a member that has left the distributed > system > - > > Key: GEODE-10250 > URL: https://issues.apache.org/jira/browse/GEODE-10250 > Project: Geode > Issue Type: Bug > Components: distributed lock service >Reporter: Barrett Oglesby >Assignee: Barrett Oglesby >Priority: Major > Labels: needsTriage > > If a member requests a distributed lock and then leaves the distributed > system, the grantor may grant that request and leave itself in a state where > the lock has been granted but the member has left. > Here are the steps: > # The lock requesting server requests a lock > # The grantor server is delayed in granting that lock > # The lock requesting server shutsdown in the meantime > # The grantor server finally grants the lock after it has released all locks > and pending requests for the lock requesting server > # The lock requesting server receives the lock response but drops it since > the thread pool has been already shutdown -- This message was sent by Atlassian Jira (v8.20.7#820007)
[jira] [Created] (GEODE-10250) The LockGrantor can grant a lock to a member that has left the distributed system
Barrett Oglesby created GEODE-10250: --- Summary: The LockGrantor can grant a lock to a member that has left the distributed system Key: GEODE-10250 URL: https://issues.apache.org/jira/browse/GEODE-10250 Project: Geode Issue Type: Bug Components: distributed lock service Reporter: Barrett Oglesby If a member requests a distributed lock and then leaves the distributed system, the grantor may grant that request and leave itself in a state where the lock has been granted but the member has left. Here are the steps: # The lock requesting server requests a lock # The grantor server is delayed in granting that lock # The lock requesting server shutsdown in the meantime # The grantor server finally grants the lock after it has released all locks and pending requests for the lock requesting server # The lock requesting server receives the lock response but drops it since the thread pool has been already shutdown -- This message was sent by Atlassian Jira (v8.20.7#820007)
[jira] [Commented] (GEODE-10148) [CI Failure] : JMXMBeanFederationDUnitTest > MBeanFederationAddRemoveServer FAILED
[ https://issues.apache.org/jira/browse/GEODE-10148?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17522937#comment-17522937 ] Barrett Oglesby commented on GEODE-10148: - I think here is where the problem is: {{LocalManager.startLocalManagement}} runs the {{ManagementTask}} once right when it starts. With logging added, the call to {{managementTask.get().run()}} returns right away. Even though the comment says its a synchronous call, it isn't. {noformat} [vm3] [warn 2022/03/23 16:16:02.173 PDT server-3 tid=0x12] XXX LocalManager.startLocalManagement about to run managementTask [vm3] [warn 2022/03/23 16:16:02.173 PDT server-3 tid=0x12] XXX LocalManager.startLocalManagement done managementTask {noformat} Then, {{LocalManager.markForFederation}} adds the mbeans to the {{federatedComponentMap}}: {noformat} [vm3] [warn 2022/03/23 16:16:02.209 PDT server-3 tid=0x12] XXX LocalManager.markForFederation about to add to federatedComponentMap objName=GemFire:type=Member,member=server-3 [vm3] [warn 2022/03/23 16:16:02.364 PDT server-3 tid=0x12] XXX LocalManager.markForFederation about to add to federatedComponentMap objName=GemFire:service=Region,name="/test-region-1",type=Member,member=server-3 [vm3] [warn 2022/03/23 16:16:02.437 PDT server-3 tid=0x12] XXX LocalManager.markForFederation about to add to federatedComponentMap objName=GemFire:service=CacheServer,port=20017,type=Member,member=server-3 {noformat} The CacheServer mbean above is the one that is missing in the failed run. Then, the {{Management Task}} thread runs the {{ManagementTask}} started above to put the mbeans into the region: {noformat} [vm3] [warn 2022/03/23 16:16:04.177 PDT server-3 tid=0x46] XXX LocalManager.doManagementTask about to putAll replicaMap={GemFire:service=CacheServer,port=20017,type=Member,member=server-3=ObjectName = GemFire:service=CacheServer,port=20017,type=Member,member=server-3, GemFire:service=Region,name="/test-region-1",type=Member,member=server-3=ObjectName = GemFire:service=Region,name="/test-region-1",type=Member,member=server-3, GemFire:type=Member,member=server-3=ObjectName = GemFire:type=Member,member=server-3} [vm3] [warn 2022/03/23 16:16:04.211 PDT server-3 tid=0x46] XXX LocalManager.doManagementTask done putAll replicaMap={GemFire:service=CacheServer,port=20017,type=Member,member=server-3=ObjectName = GemFire:service=CacheServer,port=20017,type=Member,member=server-3, GemFire:service=Region,name="/test-region-1",type=Member,member=server-3=ObjectName = GemFire:service=Region,name="/test-region-1",type=Member,member=server-3, GemFire:type=Member,member=server-3=ObjectName = GemFire:type=Member,member=server-3} {noformat} If the {{Management Task}} thread runs between the added Region and CacheServer mbeans, this issue would reproduce. > [CI Failure] : JMXMBeanFederationDUnitTest > MBeanFederationAddRemoveServer > FAILED > -- > > Key: GEODE-10148 > URL: https://issues.apache.org/jira/browse/GEODE-10148 > Project: Geode > Issue Type: Bug > Components: jmx >Affects Versions: 1.15.0 >Reporter: Nabarun Nag >Priority: Major > Labels: test-stability > > JMXMBeanFederationDUnitTest > MBeanFederationAddRemoveServer FAILED > java.lang.AssertionError: > Expecting actual: > ["GemFire:service=AccessControl,type=Distributed", > "GemFire:service=CacheServer,port=20842,type=Member,member=server-1", > "GemFire:service=CacheServer,port=20846,type=Member,member=server-2", > > "GemFire:service=DiskStore,name=cluster_config,type=Member,member=locator-one", > "GemFire:service=FileUploader,type=Distributed", > "GemFire:service=Locator,type=Member,member=locator-one", > > "GemFire:service=LockService,name=__CLUSTER_CONFIG_LS,type=Distributed", > > "GemFire:service=LockService,name=__CLUSTER_CONFIG_LS,type=Member,member=locator-one", > "GemFire:service=Manager,type=Member,member=locator-one", > "GemFire:service=Region,name="/test-region-1",type=Distributed", > > "GemFire:service=Region,name="/test-region-1",type=Member,member=server-1", > > "GemFire:service=Region,name="/test-region-1",type=Member,member=server-2", > > "GemFire:service=Region,name="/test-region-1",type=Member,member=server-3", > "GemFire:service=System,type=Distributed", > "GemFire:type=Member,member=locator-one", > "GemFire:type=Member,member=server-1", > "GemFire:type=Member,member=server-2", > "GemFire:type=Member,member=server-3"] > to contain exactly (and in same order): > ["GemFire:service=AccessControl,type=Distributed", > "GemFire:service=CacheServer,port=20842,type=Member,member=server-1", >
[jira] [Resolved] (GEODE-10212) In a WAN topology with 3 sites in a star pattern, stopping a sender between two of the sites causes an event to be dropped even though another path exists between the t
[ https://issues.apache.org/jira/browse/GEODE-10212?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Barrett Oglesby resolved GEODE-10212. - Resolution: Duplicate > In a WAN topology with 3 sites in a star pattern, stopping a sender between > two of the sites causes an event to be dropped even though another path > exists between the two sites > > > Key: GEODE-10212 > URL: https://issues.apache.org/jira/browse/GEODE-10212 > Project: Geode > Issue Type: Bug > Components: wan >Reporter: Barrett Oglesby >Assignee: Barrett Oglesby >Priority: Major > Labels: needsTriage > > A WAN topology in a star pattern means every site is connected to every other > site like: > {noformat} > site-A <--> site-B <--> siteC > ^_^ > {noformat} > If the sender from site-A to site-B is stopped and a put is done in site-A, > site-B doesn't receive the event even though site-A is connected to site-C > and site-C is connected to site-B. -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Assigned] (GEODE-10212) In a WAN topology with 3 sites in a star pattern, stopping a sender between two of the sites causes an event to be dropped even though another path exists between the t
[ https://issues.apache.org/jira/browse/GEODE-10212?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Barrett Oglesby reassigned GEODE-10212: --- Assignee: Barrett Oglesby > In a WAN topology with 3 sites in a star pattern, stopping a sender between > two of the sites causes an event to be dropped even though another path > exists between the two sites > > > Key: GEODE-10212 > URL: https://issues.apache.org/jira/browse/GEODE-10212 > Project: Geode > Issue Type: Bug > Components: wan >Reporter: Barrett Oglesby >Assignee: Barrett Oglesby >Priority: Major > Labels: needsTriage > > A WAN topology in a star pattern means every site is connected to every other > site like: > {noformat} > site-A <--> site-B <--> siteC > ^_^ > {noformat} > If the sender from site-A to site-B is stopped and a put is done in site-A, > site-B doesn't receive the event even though site-A is connected to site-C > and site-C is connected to site-B. -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Updated] (GEODE-10212) In a WAN topology with 3 sites in a star pattern, stopping a sender between two of the sites causes an event to be dropped even though another path exists between the tw
[ https://issues.apache.org/jira/browse/GEODE-10212?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Barrett Oglesby updated GEODE-10212: Description: A WAN topology in a star pattern means every site is connected to every other site like: {noformat} site-A <--> site-B <--> siteC ^_^ {noformat} If the sender from site-A to site-B is stopped and a put is done in site-A, site-B doesn't receive the event even though site-A is connected to site-C and site-C is connected to site-B. was: A WAN topology in a star pattern means every site is connected to every other site like: site-A <-> site-B <-> siteC ^_^ If the sender from site-A to site-B is stopped and a put is done in site-A, site-B doesn't receive the event even though site-A is connected to site-C and site-C is connected to site-B. > In a WAN topology with 3 sites in a star pattern, stopping a sender between > two of the sites causes an event to be dropped even though another path > exists between the two sites > > > Key: GEODE-10212 > URL: https://issues.apache.org/jira/browse/GEODE-10212 > Project: Geode > Issue Type: Bug > Components: wan >Reporter: Barrett Oglesby >Priority: Major > Labels: needsTriage > > A WAN topology in a star pattern means every site is connected to every other > site like: > {noformat} > site-A <--> site-B <--> siteC > ^_^ > {noformat} > If the sender from site-A to site-B is stopped and a put is done in site-A, > site-B doesn't receive the event even though site-A is connected to site-C > and site-C is connected to site-B. -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Created] (GEODE-10212) In a WAN topology with 3 sites in a star pattern, stopping a sender between two of the sites causes an event to be dropped even though another path exists between the tw
Barrett Oglesby created GEODE-10212: --- Summary: In a WAN topology with 3 sites in a star pattern, stopping a sender between two of the sites causes an event to be dropped even though another path exists between the two sites Key: GEODE-10212 URL: https://issues.apache.org/jira/browse/GEODE-10212 Project: Geode Issue Type: Bug Components: wan Reporter: Barrett Oglesby A WAN topology in a star pattern means every site is connected to every other site like: site-A <-> site-B <-> siteC ^_^ If the sender from site-A to site-B is stopped and a put is done in site-A, site-B doesn't receive the event even though site-A is connected to site-C and site-C is connected to site-B. -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Updated] (GEODE-10164) Revert wording change in rebalance result
[ https://issues.apache.org/jira/browse/GEODE-10164?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Barrett Oglesby updated GEODE-10164: Affects Version/s: 1.15.0 > Revert wording change in rebalance result > - > > Key: GEODE-10164 > URL: https://issues.apache.org/jira/browse/GEODE-10164 > Project: Geode > Issue Type: Bug > Components: gfsh >Affects Versions: 1.15.0 >Reporter: Barrett Oglesby >Assignee: Barrett Oglesby >Priority: Major > Labels: blocks-1.15.0, needsTriage, pull-request-available > Fix For: 1.12.10, 1.13.9, 1.14.5, 1.15.0 > > > I made a change to the wording of the rebalance command result > from: > {noformat} > Rebalanced partition regions {noformat} > to: > {noformat} > Rebalanced partitioned region {noformat} > This change caused hydra and other tests to fail, so I'm reverting it. -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Updated] (GEODE-10164) Revert wording change in rebalance result
[ https://issues.apache.org/jira/browse/GEODE-10164?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Barrett Oglesby updated GEODE-10164: Labels: blocks-1.15.0 needsTriage pull-request-available (was: needsTriage pull-request-available) > Revert wording change in rebalance result > - > > Key: GEODE-10164 > URL: https://issues.apache.org/jira/browse/GEODE-10164 > Project: Geode > Issue Type: Bug > Components: gfsh >Reporter: Barrett Oglesby >Assignee: Barrett Oglesby >Priority: Major > Labels: blocks-1.15.0, needsTriage, pull-request-available > Fix For: 1.12.10, 1.13.9, 1.14.5, 1.15.0 > > > I made a change to the wording of the rebalance command result > from: > {noformat} > Rebalanced partition regions {noformat} > to: > {noformat} > Rebalanced partitioned region {noformat} > This change caused hydra and other tests to fail, so I'm reverting it. -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Resolved] (GEODE-10164) Revert wording change in rebalance result
[ https://issues.apache.org/jira/browse/GEODE-10164?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Barrett Oglesby resolved GEODE-10164. - Fix Version/s: 1.12.10 1.13.9 1.14.5 1.15.0 Resolution: Fixed > Revert wording change in rebalance result > - > > Key: GEODE-10164 > URL: https://issues.apache.org/jira/browse/GEODE-10164 > Project: Geode > Issue Type: Bug > Components: gfsh >Reporter: Barrett Oglesby >Assignee: Barrett Oglesby >Priority: Major > Labels: needsTriage, pull-request-available > Fix For: 1.12.10, 1.13.9, 1.14.5, 1.15.0 > > > I made a change to the wording of the rebalance command result > from: > {noformat} > Rebalanced partition regions {noformat} > to: > {noformat} > Rebalanced partitioned region {noformat} > This change caused hydra and other tests to fail, so I'm reverting it. -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Assigned] (GEODE-10164) Revert wording change in rebalance result
[ https://issues.apache.org/jira/browse/GEODE-10164?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Barrett Oglesby reassigned GEODE-10164: --- Assignee: Barrett Oglesby > Revert wording change in rebalance result > - > > Key: GEODE-10164 > URL: https://issues.apache.org/jira/browse/GEODE-10164 > Project: Geode > Issue Type: Bug > Components: gfsh >Reporter: Barrett Oglesby >Assignee: Barrett Oglesby >Priority: Major > Labels: needsTriage > > I made a change to the wording of the rebalance command result > from: > {noformat} > Rebalanced partition regions {noformat} > to: > {noformat} > Rebalanced partitioned region {noformat} > This change caused hydra and other tests to fail, so I'm reverting it. -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Created] (GEODE-10164) Revert wording change in rebalance result
Barrett Oglesby created GEODE-10164: --- Summary: Revert wording change in rebalance result Key: GEODE-10164 URL: https://issues.apache.org/jira/browse/GEODE-10164 Project: Geode Issue Type: Bug Components: gfsh Reporter: Barrett Oglesby I made a change to the wording of the rebalance command result from: {noformat} Rebalanced partition regions {noformat} to: {noformat} Rebalanced partitioned region {noformat} This change caused hydra and other tests to fail, so I'm reverting it. -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Commented] (GEODE-10148) [CI Failure] : JMXMBeanFederationDUnitTest > MBeanFederationAddRemoveServer FAILED
[ https://issues.apache.org/jira/browse/GEODE-10148?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17511520#comment-17511520 ] Barrett Oglesby commented on GEODE-10148: - The test is saying that the result of this call to the locator is missing the CacheServer MBean that exists in the expectedMBeans list. List intermediateMBeans = getFederatedGemfireBeansFrom(locator1); That mbean list in the locator is updated asynchronously by the ManagementTask in each member. See ManagementResourceRepo.putAllInLocalMonitoringRegion. The localMonitoringRegion is DISTRIBUTED_NO_ACK. > [CI Failure] : JMXMBeanFederationDUnitTest > MBeanFederationAddRemoveServer > FAILED > -- > > Key: GEODE-10148 > URL: https://issues.apache.org/jira/browse/GEODE-10148 > Project: Geode > Issue Type: Bug > Components: jmx >Affects Versions: 1.15.0 >Reporter: Nabarun Nag >Assignee: Owen Nichols >Priority: Major > Labels: needsTriage > > JMXMBeanFederationDUnitTest > MBeanFederationAddRemoveServer FAILED > java.lang.AssertionError: > Expecting actual: > ["GemFire:service=AccessControl,type=Distributed", > "GemFire:service=CacheServer,port=20842,type=Member,member=server-1", > "GemFire:service=CacheServer,port=20846,type=Member,member=server-2", > > "GemFire:service=DiskStore,name=cluster_config,type=Member,member=locator-one", > "GemFire:service=FileUploader,type=Distributed", > "GemFire:service=Locator,type=Member,member=locator-one", > > "GemFire:service=LockService,name=__CLUSTER_CONFIG_LS,type=Distributed", > > "GemFire:service=LockService,name=__CLUSTER_CONFIG_LS,type=Member,member=locator-one", > "GemFire:service=Manager,type=Member,member=locator-one", > "GemFire:service=Region,name="/test-region-1",type=Distributed", > > "GemFire:service=Region,name="/test-region-1",type=Member,member=server-1", > > "GemFire:service=Region,name="/test-region-1",type=Member,member=server-2", > > "GemFire:service=Region,name="/test-region-1",type=Member,member=server-3", > "GemFire:service=System,type=Distributed", > "GemFire:type=Member,member=locator-one", > "GemFire:type=Member,member=server-1", > "GemFire:type=Member,member=server-2", > "GemFire:type=Member,member=server-3"] > to contain exactly (and in same order): > ["GemFire:service=AccessControl,type=Distributed", > "GemFire:service=CacheServer,port=20842,type=Member,member=server-1", > "GemFire:service=CacheServer,port=20846,type=Member,member=server-2", > "GemFire:service=CacheServer,port=20850,type=Member,member=server-3", > > "GemFire:service=DiskStore,name=cluster_config,type=Member,member=locator-one", > "GemFire:service=FileUploader,type=Distributed", > "GemFire:service=Locator,type=Member,member=locator-one", > > "GemFire:service=LockService,name=__CLUSTER_CONFIG_LS,type=Distributed", > > "GemFire:service=LockService,name=__CLUSTER_CONFIG_LS,type=Member,member=locator-one", > "GemFire:service=Manager,type=Member,member=locator-one", > "GemFire:service=Region,name="/test-region-1",type=Distributed", > > "GemFire:service=Region,name="/test-region-1",type=Member,member=server-1", > > "GemFire:service=Region,name="/test-region-1",type=Member,member=server-2", > > "GemFire:service=Region,name="/test-region-1",type=Member,member=server-3", > "GemFire:service=System,type=Distributed", > "GemFire:type=Member,member=locator-one", > "GemFire:type=Member,member=server-1", > "GemFire:type=Member,member=server-2", > "GemFire:type=Member,member=server-3"] > but could not find the following elements: > ["GemFire:service=CacheServer,port=20850,type=Member,member=server-3"] > at > org.apache.geode.management.internal.JMXMBeanFederationDUnitTest.MBeanFederationAddRemoveServer(JMXMBeanFederationDUnitTest.java:130) > 8352 tests completed, 1 failed, 414 skipped -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Assigned] (GEODE-10148) [CI Failure] : JMXMBeanFederationDUnitTest > MBeanFederationAddRemoveServer FAILED
[ https://issues.apache.org/jira/browse/GEODE-10148?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Barrett Oglesby reassigned GEODE-10148: --- Assignee: (was: Barrett Oglesby) > [CI Failure] : JMXMBeanFederationDUnitTest > MBeanFederationAddRemoveServer > FAILED > -- > > Key: GEODE-10148 > URL: https://issues.apache.org/jira/browse/GEODE-10148 > Project: Geode > Issue Type: Bug > Components: jmx >Affects Versions: 1.15.0 >Reporter: Nabarun Nag >Priority: Major > Labels: needsTriage > > JMXMBeanFederationDUnitTest > MBeanFederationAddRemoveServer FAILED > java.lang.AssertionError: > Expecting actual: > ["GemFire:service=AccessControl,type=Distributed", > "GemFire:service=CacheServer,port=20842,type=Member,member=server-1", > "GemFire:service=CacheServer,port=20846,type=Member,member=server-2", > > "GemFire:service=DiskStore,name=cluster_config,type=Member,member=locator-one", > "GemFire:service=FileUploader,type=Distributed", > "GemFire:service=Locator,type=Member,member=locator-one", > > "GemFire:service=LockService,name=__CLUSTER_CONFIG_LS,type=Distributed", > > "GemFire:service=LockService,name=__CLUSTER_CONFIG_LS,type=Member,member=locator-one", > "GemFire:service=Manager,type=Member,member=locator-one", > "GemFire:service=Region,name="/test-region-1",type=Distributed", > > "GemFire:service=Region,name="/test-region-1",type=Member,member=server-1", > > "GemFire:service=Region,name="/test-region-1",type=Member,member=server-2", > > "GemFire:service=Region,name="/test-region-1",type=Member,member=server-3", > "GemFire:service=System,type=Distributed", > "GemFire:type=Member,member=locator-one", > "GemFire:type=Member,member=server-1", > "GemFire:type=Member,member=server-2", > "GemFire:type=Member,member=server-3"] > to contain exactly (and in same order): > ["GemFire:service=AccessControl,type=Distributed", > "GemFire:service=CacheServer,port=20842,type=Member,member=server-1", > "GemFire:service=CacheServer,port=20846,type=Member,member=server-2", > "GemFire:service=CacheServer,port=20850,type=Member,member=server-3", > > "GemFire:service=DiskStore,name=cluster_config,type=Member,member=locator-one", > "GemFire:service=FileUploader,type=Distributed", > "GemFire:service=Locator,type=Member,member=locator-one", > > "GemFire:service=LockService,name=__CLUSTER_CONFIG_LS,type=Distributed", > > "GemFire:service=LockService,name=__CLUSTER_CONFIG_LS,type=Member,member=locator-one", > "GemFire:service=Manager,type=Member,member=locator-one", > "GemFire:service=Region,name="/test-region-1",type=Distributed", > > "GemFire:service=Region,name="/test-region-1",type=Member,member=server-1", > > "GemFire:service=Region,name="/test-region-1",type=Member,member=server-2", > > "GemFire:service=Region,name="/test-region-1",type=Member,member=server-3", > "GemFire:service=System,type=Distributed", > "GemFire:type=Member,member=locator-one", > "GemFire:type=Member,member=server-1", > "GemFire:type=Member,member=server-2", > "GemFire:type=Member,member=server-3"] > but could not find the following elements: > ["GemFire:service=CacheServer,port=20850,type=Member,member=server-3"] > at > org.apache.geode.management.internal.JMXMBeanFederationDUnitTest.MBeanFederationAddRemoveServer(JMXMBeanFederationDUnitTest.java:130) > 8352 tests completed, 1 failed, 414 skipped -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Assigned] (GEODE-10148) [CI Failure] : JMXMBeanFederationDUnitTest > MBeanFederationAddRemoveServer FAILED
[ https://issues.apache.org/jira/browse/GEODE-10148?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Barrett Oglesby reassigned GEODE-10148: --- Assignee: Barrett Oglesby > [CI Failure] : JMXMBeanFederationDUnitTest > MBeanFederationAddRemoveServer > FAILED > -- > > Key: GEODE-10148 > URL: https://issues.apache.org/jira/browse/GEODE-10148 > Project: Geode > Issue Type: Bug > Components: jmx >Affects Versions: 1.15.0 >Reporter: Nabarun Nag >Assignee: Barrett Oglesby >Priority: Major > Labels: needsTriage > > JMXMBeanFederationDUnitTest > MBeanFederationAddRemoveServer FAILED > java.lang.AssertionError: > Expecting actual: > ["GemFire:service=AccessControl,type=Distributed", > "GemFire:service=CacheServer,port=20842,type=Member,member=server-1", > "GemFire:service=CacheServer,port=20846,type=Member,member=server-2", > > "GemFire:service=DiskStore,name=cluster_config,type=Member,member=locator-one", > "GemFire:service=FileUploader,type=Distributed", > "GemFire:service=Locator,type=Member,member=locator-one", > > "GemFire:service=LockService,name=__CLUSTER_CONFIG_LS,type=Distributed", > > "GemFire:service=LockService,name=__CLUSTER_CONFIG_LS,type=Member,member=locator-one", > "GemFire:service=Manager,type=Member,member=locator-one", > "GemFire:service=Region,name="/test-region-1",type=Distributed", > > "GemFire:service=Region,name="/test-region-1",type=Member,member=server-1", > > "GemFire:service=Region,name="/test-region-1",type=Member,member=server-2", > > "GemFire:service=Region,name="/test-region-1",type=Member,member=server-3", > "GemFire:service=System,type=Distributed", > "GemFire:type=Member,member=locator-one", > "GemFire:type=Member,member=server-1", > "GemFire:type=Member,member=server-2", > "GemFire:type=Member,member=server-3"] > to contain exactly (and in same order): > ["GemFire:service=AccessControl,type=Distributed", > "GemFire:service=CacheServer,port=20842,type=Member,member=server-1", > "GemFire:service=CacheServer,port=20846,type=Member,member=server-2", > "GemFire:service=CacheServer,port=20850,type=Member,member=server-3", > > "GemFire:service=DiskStore,name=cluster_config,type=Member,member=locator-one", > "GemFire:service=FileUploader,type=Distributed", > "GemFire:service=Locator,type=Member,member=locator-one", > > "GemFire:service=LockService,name=__CLUSTER_CONFIG_LS,type=Distributed", > > "GemFire:service=LockService,name=__CLUSTER_CONFIG_LS,type=Member,member=locator-one", > "GemFire:service=Manager,type=Member,member=locator-one", > "GemFire:service=Region,name="/test-region-1",type=Distributed", > > "GemFire:service=Region,name="/test-region-1",type=Member,member=server-1", > > "GemFire:service=Region,name="/test-region-1",type=Member,member=server-2", > > "GemFire:service=Region,name="/test-region-1",type=Member,member=server-3", > "GemFire:service=System,type=Distributed", > "GemFire:type=Member,member=locator-one", > "GemFire:type=Member,member=server-1", > "GemFire:type=Member,member=server-2", > "GemFire:type=Member,member=server-3"] > but could not find the following elements: > ["GemFire:service=CacheServer,port=20850,type=Member,member=server-3"] > at > org.apache.geode.management.internal.JMXMBeanFederationDUnitTest.MBeanFederationAddRemoveServer(JMXMBeanFederationDUnitTest.java:130) > 8352 tests completed, 1 failed, 414 skipped -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Commented] (GEODE-10144) Regression in geode-native test CqPlusAuthInitializeTest.reAuthenticateWithDurable
[ https://issues.apache.org/jira/browse/GEODE-10144?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17511339#comment-17511339 ] Barrett Oglesby commented on GEODE-10144: - This issue comes down to a few factors: - The default PoolFactory::DEFAULT_SUBSCRIPTION_ACK_INTERVAL of 100 seconds means all the events stay on the queue for entire test. This means that every time the client disconnects, the server starts over again processing the queue from the beginning. - The NC client is version GFE 9.0 (earlier) so no ClientReAuthenticateMessage is sent to it when a AuthenticationExpiredException occurs. The server waits anyway in case new credentials are sent through another operation. - With the new changes, the server waits 5 seconds to be notified of a re-auth. If no re-auth occurs, it waits the entire 5 seconds. - With the old code, the server waits 200 ms before attempting to process the event again (which includes asking for authorization again). The SimulatedExpirationSecurityManager randomly decides whether to authorize the event. 99% of the time, it returns true. So, the second request will almost always return true. - So without any external event (like new credentials): -- With the the old code, the Message Dispatcher processes the event successfully after 200 ms with no client disconnect -- With the new code, the Message Dispatcher waits 5 seconds and then disconnects the client > Regression in geode-native test > CqPlusAuthInitializeTest.reAuthenticateWithDurable > -- > > Key: GEODE-10144 > URL: https://issues.apache.org/jira/browse/GEODE-10144 > Project: Geode > Issue Type: Bug > Components: client/server >Affects Versions: 1.15.0 >Reporter: Blake Bender >Assignee: Jinmei Liao >Priority: Major > Labels: blocks-1.15.0, needsTriage > Fix For: 1.15.0 > > > This test is failing across the board in the `geode-native` PR pipeline. > Main develop pipeline is green only because nothing can get through the PR > pipeline to clear checkin gates. We have green CI runs with 1.15. build 918, > then it started failing when we picked up build 924. > > [~moleske] tracked this back to this commit: > [https://github.com/apache/geode/commit/2554f42b925f2b9b8ca7eee14c7a887436b1d9db|https://github.com/apache/geode/commit/2554f42b925f2b9b8ca7eee14c7a887436b1d9db]. > See his notes in `geode-native` PR # 947 > ([https://github.com/apache/geode-native/pull/947]) -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Commented] (GEODE-10144) Regression in geode-native test CqPlusAuthInitializeTest.reAuthenticateWithDurable
[ https://issues.apache.org/jira/browse/GEODE-10144?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17510954#comment-17510954 ] Barrett Oglesby commented on GEODE-10144: - Here is why this test passes with the previous server code. In the previous server code, after the Client Message Dispatcher caught an AuthenticationExpiredException, it slept for 200 ms before trying again. It does this for up to 5 seconds before giving up. Each time it retries, it asks for authorization again. Here is a case where SimulatedExpirationSecurityManager.authorize throws an AuthenticationExpiredException: {noformat} [warn 2022/03/22 14:59:04.110 PDT CqPlusAuthInitializeTest_reAuthenticateWithDurable_server_0 tid=0x51] XXX MessageDispatcher.runDispatcher about to dispatchMessage operation=AFTER_CREATE; key=key130 [warn 2022/03/22 14:59:04.110 PDT CqPlusAuthInitializeTest_reAuthenticateWithDurable_server_0 tid=0x51] XXX SimulatedExpirationSecurityManager.authorize about to throw AuthenticationExpiredException [warn 2022/03/22 14:59:04.110 PDT CqPlusAuthInitializeTest_reAuthenticateWithDurable_server_0 tid=0x51] XXX MessageDispatcher.runDispatcher caught AuthenticationExpiredException [warn 2022/03/22 14:59:04.110 PDT CqPlusAuthInitializeTest_reAuthenticateWithDurable_server_0 tid=0x51] XXX MessageDispatcher.runDispatcher skipped sending ClientReAuthenticateMessage clientVersion=GFE 9.0 {noformat} The Client Message Dispatcher sleeps for 200 ms: {noformat} [warn 2022/03/22 14:59:04.110 PDT CqPlusAuthInitializeTest_reAuthenticateWithDurable_server_0 tid=0x51] XXX MessageDispatcher.runDispatcher about to sleep 1 for 200 ms [warn 2022/03/22 14:59:04.311 PDT CqPlusAuthInitializeTest_reAuthenticateWithDurable_server_0 tid=0x51] XXX MessageDispatcher.runDispatcher done sleep 1 {noformat} When it wakes up, it checks for authorization again. This time, the SimulatedExpirationSecurityManager returns true, so the message is sent: {noformat} [warn 2022/03/22 14:59:04.311 PDT CqPlusAuthInitializeTest_reAuthenticateWithDurable_server_0 tid=0x51] XXX MessageDispatcher.runDispatcher about to dispatchMessage operation=AFTER_CREATE; key=key130 [warn 2022/03/22 14:59:04.311 PDT CqPlusAuthInitializeTest_reAuthenticateWithDurable_server_0 tid=0x51] XXX SimulatedExpirationSecurityManager.authorize about to return true [warn 2022/03/22 14:59:04.311 PDT CqPlusAuthInitializeTest_reAuthenticateWithDurable_server_0 tid=0x51] XXX MessageDispatcher.runDispatcher done dispatchMessage operation=AFTER_CREATE; key=key130 {noformat} This path is not relying on outside operations to notify the Client Message Dispatcher. The SimulatedExpirationSecurityManager authorizes the operation after the sleep. So at the end of the run where there are no client operations, the Client Message Dispatcher is most likely only going to sleep 200 ms. There is never going to be a 5 second wait. I did see a few times where the Client Message Dispatcher slept twice through the loop (so 400 ms): {noformat} [warn 2022/03/22 14:59:23.924 PDT CqPlusAuthInitializeTest_reAuthenticateWithDurable_server_0 tid=0x51] XXX MessageDispatcher.runDispatcher about to dispatchMessage operation=AFTER_UPDATE; key=key4820 [warn 2022/03/22 14:59:23.924 PDT CqPlusAuthInitializeTest_reAuthenticateWithDurable_server_0 tid=0x51] XXX SimulatedExpirationSecurityManager.authorize about to throw AuthenticationExpiredException [warn 2022/03/22 14:59:23.924 PDT CqPlusAuthInitializeTest_reAuthenticateWithDurable_server_0 tid=0x51] XXX MessageDispatcher.runDispatcher caught AuthenticationExpiredException [warn 2022/03/22 14:59:23.924 PDT CqPlusAuthInitializeTest_reAuthenticateWithDurable_server_0 tid=0x51] XXX MessageDispatcher.runDispatcher skipped sending ClientReAuthenticateMessage clientVersion=GFE 9.0 [warn 2022/03/22 14:59:23.924 PDT CqPlusAuthInitializeTest_reAuthenticateWithDurable_server_0 tid=0x51] XXX MessageDispatcher.runDispatcher about to sleep 1 for 200 ms [warn 2022/03/22 14:59:24.124 PDT CqPlusAuthInitializeTest_reAuthenticateWithDurable_server_0 tid=0x51] XXX MessageDispatcher.runDispatcher done sleep 1 [warn 2022/03/22 14:59:24.124 PDT CqPlusAuthInitializeTest_reAuthenticateWithDurable_server_0 tid=0x51] XXX MessageDispatcher.runDispatcher about to dispatchMessage operation=AFTER_UPDATE; key=key4820 [warn 2022/03/22 14:59:24.124 PDT CqPlusAuthInitializeTest_reAuthenticateWithDurable_server_0 tid=0x51] XXX SimulatedExpirationSecurityManager.authorize about to throw AuthenticationExpiredException [warn 2022/03/22 14:59:24.124 PDT CqPlusAuthInitializeTest_reAuthenticateWithDurable_server_0 tid=0x51] XXX MessageDispatcher.runDispatcher caught AuthenticationExpiredException [warn 2022/03/22 14:59:24.124 PDT CqPlusAuthInitializeTest_reAuthenticateWithDurable_server_0 tid=0x51] XXX MessageDispatcher.runDispatcher about to sleep 2 for
[jira] [Commented] (GEODE-10144) Regression in geode-native test CqPlusAuthInitializeTest.reAuthenticateWithDurable
[ https://issues.apache.org/jira/browse/GEODE-10144?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17510953#comment-17510953 ] Barrett Oglesby commented on GEODE-10144: - Even though this JIRA is resolved, I did some analysis on it. I see whats going on in this test. At the beginning of the test client cache operations are occurring simultaneously with message dispatching from the server to the client. Here is a ServerConnection processing puts: {noformat} [warn 2022/03/22 15:40:01.096 PDT CqPlusAuthInitializeTest_reAuthenticateWithDurable_server_0 tid=0x50] XXX Put70.cmdExecute operation=UPDATE; key=key50 [warn 2022/03/22 15:40:01.096 PDT CqPlusAuthInitializeTest_reAuthenticateWithDurable_server_0 tid=0x50] XXX SimulatedExpirationSecurityManager.authorize about to return true [warn 2022/03/22 15:40:01.099 PDT CqPlusAuthInitializeTest_reAuthenticateWithDurable_server_0 tid=0x50] XXX Put70.cmdExecute operation=UPDATE; key=key51 [warn 2022/03/22 15:40:01.099 PDT CqPlusAuthInitializeTest_reAuthenticateWithDurable_server_0 tid=0x50] XXX SimulatedExpirationSecurityManager.authorize about to return true [warn 2022/03/22 15:40:01.101 PDT CqPlusAuthInitializeTest_reAuthenticateWithDurable_server_0 tid=0x50] XXX Put70.cmdExecute operation=UPDATE; key=key52 [warn 2022/03/22 15:40:01.102 PDT CqPlusAuthInitializeTest_reAuthenticateWithDurable_server_0 tid=0x50] XXX SimulatedExpirationSecurityManager.authorize about to return true [warn 2022/03/22 15:40:01.104 PDT CqPlusAuthInitializeTest_reAuthenticateWithDurable_server_0 tid=0x50] XXX Put70.cmdExecute operation=UPDATE; key=key53 [warn 2022/03/22 15:40:01.104 PDT CqPlusAuthInitializeTest_reAuthenticateWithDurable_server_0 tid=0x50] XXX SimulatedExpirationSecurityManager.authorize about to return true [warn 2022/03/22 15:40:01.106 PDT CqPlusAuthInitializeTest_reAuthenticateWithDurable_server_0 tid=0x50] XXX Put70.cmdExecute operation=UPDATE; key=key54 [warn 2022/03/22 15:40:01.107 PDT CqPlusAuthInitializeTest_reAuthenticateWithDurable_server_0 tid=0x50] XXX SimulatedExpirationSecurityManager.authorize about to return true {noformat} At the same time, the Client Message Dispatcher is dispatching events to the client: {noformat} [warn 2022/03/22 15:40:01.098 PDT CqPlusAuthInitializeTest_reAuthenticateWithDurable_server_0 tid=0x51] XXX MessageDispatcher.runDispatcher about to dispatchMessage operation=AFTER_CREATE; key=key50 [warn 2022/03/22 15:40:01.098 PDT CqPlusAuthInitializeTest_reAuthenticateWithDurable_server_0 tid=0x51] XXX SimulatedExpirationSecurityManager.authorize about to return true [warn 2022/03/22 15:40:01.099 PDT CqPlusAuthInitializeTest_reAuthenticateWithDurable_server_0 tid=0x51] XXX MessageDispatcher.runDispatcher done dispatchMessage operation=AFTER_CREATE; key=key50 [warn 2022/03/22 15:40:01.101 PDT CqPlusAuthInitializeTest_reAuthenticateWithDurable_server_0 tid=0x51] XXX MessageDispatcher.runDispatcher about to dispatchMessage operation=AFTER_CREATE; key=key51 [warn 2022/03/22 15:40:01.101 PDT CqPlusAuthInitializeTest_reAuthenticateWithDurable_server_0 tid=0x51] XXX SimulatedExpirationSecurityManager.authorize about to return true [warn 2022/03/22 15:40:01.101 PDT CqPlusAuthInitializeTest_reAuthenticateWithDurable_server_0 tid=0x51] XXX MessageDispatcher.runDispatcher done dispatchMessage operation=AFTER_CREATE; key=key51 [warn 2022/03/22 15:40:01.103 PDT CqPlusAuthInitializeTest_reAuthenticateWithDurable_server_0 tid=0x51] XXX MessageDispatcher.runDispatcher about to dispatchMessage operation=AFTER_CREATE; key=key52 [warn 2022/03/22 15:40:01.104 PDT CqPlusAuthInitializeTest_reAuthenticateWithDurable_server_0 tid=0x51] XXX SimulatedExpirationSecurityManager.authorize about to return true [warn 2022/03/22 15:40:01.104 PDT CqPlusAuthInitializeTest_reAuthenticateWithDurable_server_0 tid=0x51] XXX MessageDispatcher.runDispatcher done dispatchMessage operation=AFTER_CREATE; key=key52 [warn 2022/03/22 15:40:01.106 PDT CqPlusAuthInitializeTest_reAuthenticateWithDurable_server_0 tid=0x51] XXX MessageDispatcher.runDispatcher about to dispatchMessage operation=AFTER_CREATE; key=key53 [warn 2022/03/22 15:40:01.106 PDT CqPlusAuthInitializeTest_reAuthenticateWithDurable_server_0 tid=0x51] XXX SimulatedExpirationSecurityManager.authorize about to return true [warn 2022/03/22 15:40:01.106 PDT CqPlusAuthInitializeTest_reAuthenticateWithDurable_server_0 tid=0x51] XXX MessageDispatcher.runDispatcher done dispatchMessage operation=AFTER_CREATE; key=key53 {noformat} While the Client Message Dispatcher, it requests authorization which fails. Since the NC is not the latest version, no ClientReAuthenticateMessage is sent to it. The dispatcher waits anyway in case another operation updates the credentials: {noformat} [warn 2022/03/22 15:40:01.109 PDT
[jira] [Resolved] (GEODE-9910) Failure to auto-reconnect upon network partition
[ https://issues.apache.org/jira/browse/GEODE-9910?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Barrett Oglesby resolved GEODE-9910. Fix Version/s: 1.12.10 1.13.9 1.14.5 1.15.0 Resolution: Fixed > Failure to auto-reconnect upon network partition > > > Key: GEODE-9910 > URL: https://issues.apache.org/jira/browse/GEODE-9910 > Project: Geode > Issue Type: Bug >Affects Versions: 1.14.0 >Reporter: Surya Mudundi >Assignee: Barrett Oglesby >Priority: Major > Labels: GeodeOperationAPI, blocks-1.15.0, needsTriage, > pull-request-available > Fix For: 1.12.10, 1.13.9, 1.14.5, 1.15.0 > > Attachments: geode-logs.zip > > > Two node cluster with embedded locators failed to auto-reconnect when node-1 > experienced network outage for couple of minutes and when node-1 recovered > from the outage, node-2 failed to auto-reconnect. > node-2 tried to re-connect to node-1 as: > [org.apache.geode.distributed.internal.InternalDistributedSystem]-[ReconnectThread] > [] Attempting to reconnect to the distributed system. This is attempt #1. > [org.apache.geode.distributed.internal.InternalDistributedSystem]-[ReconnectThread] > [] Attempting to reconnect to the distributed system. This is attempt #2. > [org.apache.geode.distributed.internal.InternalDistributedSystem]-[ReconnectThread] > [] Attempting to reconnect to the distributed system. This is attempt #3. > Finally reported below error after 3 attempts as: > INFO > [org.apache.geode.logging.internal.LoggingProviderLoader]-[ReconnectThread] > [] Using org.apache.geode.logging.internal.SimpleLoggingProvider for service > org.apache.geode.logging.internal.spi.LoggingProvider > INFO [org.apache.geode.internal.InternalDataSerializer]-[ReconnectThread] [] > initializing InternalDataSerializer with 0 services > INFO > [org.apache.geode.distributed.internal.InternalDistributedSystem]-[ReconnectThread] > [] performing a quorum check to see if location services can be started early > INFO > [org.apache.geode.distributed.internal.InternalDistributedSystem]-[ReconnectThread] > [] Quorum check passed - allowing location services to start early > WARN > [org.apache.geode.distributed.internal.InternalDistributedSystem]-[ReconnectThread] > [] Exception occurred while trying to connect the system during reconnect > java.lang.IllegalStateException: A locator can not be created because one > already exists in this JVM. > at > org.apache.geode.distributed.internal.InternalLocator.createLocator(InternalLocator.java:298) > ~[geode-core-1.14.0.jar:?] > at > org.apache.geode.distributed.internal.InternalLocator.createLocator(InternalLocator.java:273) > ~[geode-core-1.14.0.jar:?] > at > org.apache.geode.distributed.internal.InternalDistributedSystem.startInitLocator(InternalDistributedSystem.java:916) > ~[geode-core-1.14.0.jar:?] > at > org.apache.geode.distributed.internal.InternalDistributedSystem.initialize(InternalDistributedSystem.java:768) > ~[geode-core-1.14.0.jar:?] > at > org.apache.geode.distributed.internal.InternalDistributedSystem.access$200(InternalDistributedSystem.java:135) > ~[geode-core-1.14.0.jar:?] > at > org.apache.geode.distributed.internal.InternalDistributedSystem$Builder.build(InternalDistributedSystem.java:3034) > ~[geode-core-1.14.0.jar:?] > at > org.apache.geode.distributed.internal.InternalDistributedSystem.connectInternal(InternalDistributedSystem.java:290) > ~[geode-core-1.14.0.jar:?] > at > org.apache.geode.distributed.internal.InternalDistributedSystem.reconnect(InternalDistributedSystem.java:2605) > ~[geode-core-1.14.0.jar:?] > at > org.apache.geode.distributed.internal.InternalDistributedSystem.tryReconnect(InternalDistributedSystem.java:2424) > ~[geode-core-1.14.0.jar:?] > at > org.apache.geode.distributed.internal.InternalDistributedSystem.disconnect(InternalDistributedSystem.java:1275) > ~[geode-core-1.14.0.jar:?] > at > org.apache.geode.distributed.internal.ClusterDistributionManager$DMListener.membershipFailure(ClusterDistributionManager.java:2326) > ~[geode-core-1.14.0.jar:?] > at > org.apache.geode.distributed.internal.membership.gms.GMSMembership.uncleanShutdown(GMSMembership.java:1187) > ~[geode-membership-1.14.0.jar:?] > at > org.apache.geode.distributed.internal.membership.gms.GMSMembership$ManagerImpl.lambda$forceDisconnect$0(GMSMembership.java:1811) > ~[geode-membership-1.14.0.jar:?] > at java.lang.Thread.run(Thread.java:829) [?:?] > -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Resolved] (GEODE-10103) Rebalance with no setting for include-region doesn't work for subregions
[ https://issues.apache.org/jira/browse/GEODE-10103?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Barrett Oglesby resolved GEODE-10103. - Fix Version/s: 1.12.10 1.13.9 1.14.5 1.15.0 Resolution: Fixed > Rebalance with no setting for include-region doesn't work for subregions > > > Key: GEODE-10103 > URL: https://issues.apache.org/jira/browse/GEODE-10103 > Project: Geode > Issue Type: Bug > Components: gfsh >Reporter: Barrett Oglesby >Assignee: Barrett Oglesby >Priority: Major > Labels: needsTriage, pull-request-available > Fix For: 1.12.10, 1.13.9, 1.14.5, 1.15.0 > > > Executing a command like this produces no output for the rebalance command > even though a region exists to rebalance: > {noformat} > gfsh -e "connect --locator=localhost[23456]" -e "rebalance"{noformat} > Output: > {noformat} > ./rebalance.sh > (1) Executing - connect --locator=localhost[23456] > Connecting to Locator at [host=localhost, port=23456] .. > Connecting to Manager at [host=192.168.1.5, port=1099] .. > Successfully connected to: [host=192.168.1.5, port=1099] > You are connected to a cluster of version: 1.16.0-build.0 > (2) Executing - rebalance{noformat} > Running from gfsh directly does: > {noformat} > gfsh>rebalance > gfsh> {noformat} -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Assigned] (GEODE-10103) Rebalance with no setting for include-region doesn't work for subregions
[ https://issues.apache.org/jira/browse/GEODE-10103?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Barrett Oglesby reassigned GEODE-10103: --- Assignee: Barrett Oglesby > Rebalance with no setting for include-region doesn't work for subregions > > > Key: GEODE-10103 > URL: https://issues.apache.org/jira/browse/GEODE-10103 > Project: Geode > Issue Type: Bug > Components: gfsh >Reporter: Barrett Oglesby >Assignee: Barrett Oglesby >Priority: Major > Labels: needsTriage > > Executing a command like this produces no output for the rebalance command > even though a region exists to rebalance: > {noformat} > gfsh -e "connect --locator=localhost[23456]" -e "rebalance"{noformat} > Output: > {noformat} > ./rebalance.sh > (1) Executing - connect --locator=localhost[23456] > Connecting to Locator at [host=localhost, port=23456] .. > Connecting to Manager at [host=192.168.1.5, port=1099] .. > Successfully connected to: [host=192.168.1.5, port=1099] > You are connected to a cluster of version: 1.16.0-build.0 > (2) Executing - rebalance{noformat} > Running from gfsh directly does: > {noformat} > gfsh>rebalance > gfsh> {noformat} -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Created] (GEODE-10103) Rebalance with no setting for include-region doesn't work for subregions
Barrett Oglesby created GEODE-10103: --- Summary: Rebalance with no setting for include-region doesn't work for subregions Key: GEODE-10103 URL: https://issues.apache.org/jira/browse/GEODE-10103 Project: Geode Issue Type: Bug Components: gfsh Reporter: Barrett Oglesby Executing a command like this produces no output for the rebalance command even though a region exists to rebalance: {noformat} gfsh -e "connect --locator=localhost[23456]" -e "rebalance"{noformat} Output: {noformat} ./rebalance.sh (1) Executing - connect --locator=localhost[23456] Connecting to Locator at [host=localhost, port=23456] .. Connecting to Manager at [host=192.168.1.5, port=1099] .. Successfully connected to: [host=192.168.1.5, port=1099] You are connected to a cluster of version: 1.16.0-build.0 (2) Executing - rebalance{noformat} Running from gfsh directly does: {noformat} gfsh>rebalance gfsh> {noformat} -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Comment Edited] (GEODE-9910) Failure to auto-reconnect upon network partition
[ https://issues.apache.org/jira/browse/GEODE-9910?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17494856#comment-17494856 ] Barrett Oglesby edited comment on GEODE-9910 at 2/22/22, 6:35 PM: -- With a modification to the product to simulate a failed JoinRequestMessage, I can reproduce this issue. h3. Test # Start server 1 (becomes coordinator) # Start server 2 # Play dead server 1 # The servers disconnect from each other # Server 2 disconnects from the distributed system since it doesn't have quorum When server 2 reconnects, it: - establishes quorum - starts the locator - is unable to join the distributed system (due to the modification I made) - attempts to reconnect again - fails because the locator is already started was (Author: barry.oglesby): With a modification to the product to simulate a failed JoinRequestMessage, I can reproduce this issue. h3. Test # Start server 1 (becomes coordinator) # Start server 2 # Play dead server 2 # The servers disconnect from each other # Server 2 disconnects from the distributed system since it doesn't have quorum When server 2 reconnects, it: - establishes quorum - starts the locator - is unable to join the distributed system (due to the modification I made) - attempts to reconnect again - fails because the locator is already started > Failure to auto-reconnect upon network partition > > > Key: GEODE-9910 > URL: https://issues.apache.org/jira/browse/GEODE-9910 > Project: Geode > Issue Type: Bug >Affects Versions: 1.14.0 >Reporter: Surya Mudundi >Assignee: Barrett Oglesby >Priority: Major > Labels: GeodeOperationAPI, blocks-1.15.0, needsTriage > Attachments: geode-logs.zip > > > Two node cluster with embedded locators failed to auto-reconnect when node-1 > experienced network outage for couple of minutes and when node-1 recovered > from the outage, node-2 failed to auto-reconnect. > node-2 tried to re-connect to node-1 as: > [org.apache.geode.distributed.internal.InternalDistributedSystem]-[ReconnectThread] > [] Attempting to reconnect to the distributed system. This is attempt #1. > [org.apache.geode.distributed.internal.InternalDistributedSystem]-[ReconnectThread] > [] Attempting to reconnect to the distributed system. This is attempt #2. > [org.apache.geode.distributed.internal.InternalDistributedSystem]-[ReconnectThread] > [] Attempting to reconnect to the distributed system. This is attempt #3. > Finally reported below error after 3 attempts as: > INFO > [org.apache.geode.logging.internal.LoggingProviderLoader]-[ReconnectThread] > [] Using org.apache.geode.logging.internal.SimpleLoggingProvider for service > org.apache.geode.logging.internal.spi.LoggingProvider > INFO [org.apache.geode.internal.InternalDataSerializer]-[ReconnectThread] [] > initializing InternalDataSerializer with 0 services > INFO > [org.apache.geode.distributed.internal.InternalDistributedSystem]-[ReconnectThread] > [] performing a quorum check to see if location services can be started early > INFO > [org.apache.geode.distributed.internal.InternalDistributedSystem]-[ReconnectThread] > [] Quorum check passed - allowing location services to start early > WARN > [org.apache.geode.distributed.internal.InternalDistributedSystem]-[ReconnectThread] > [] Exception occurred while trying to connect the system during reconnect > java.lang.IllegalStateException: A locator can not be created because one > already exists in this JVM. > at > org.apache.geode.distributed.internal.InternalLocator.createLocator(InternalLocator.java:298) > ~[geode-core-1.14.0.jar:?] > at > org.apache.geode.distributed.internal.InternalLocator.createLocator(InternalLocator.java:273) > ~[geode-core-1.14.0.jar:?] > at > org.apache.geode.distributed.internal.InternalDistributedSystem.startInitLocator(InternalDistributedSystem.java:916) > ~[geode-core-1.14.0.jar:?] > at > org.apache.geode.distributed.internal.InternalDistributedSystem.initialize(InternalDistributedSystem.java:768) > ~[geode-core-1.14.0.jar:?] > at > org.apache.geode.distributed.internal.InternalDistributedSystem.access$200(InternalDistributedSystem.java:135) > ~[geode-core-1.14.0.jar:?] > at > org.apache.geode.distributed.internal.InternalDistributedSystem$Builder.build(InternalDistributedSystem.java:3034) > ~[geode-core-1.14.0.jar:?] > at > org.apache.geode.distributed.internal.InternalDistributedSystem.connectInternal(InternalDistributedSystem.java:290) > ~[geode-core-1.14.0.jar:?] > at > org.apache.geode.distributed.internal.InternalDistributedSystem.reconnect(InternalDistributedSystem.java:2605) > ~[geode-core-1.14.0.jar:?] > at >
[jira] [Commented] (GEODE-9910) Failure to auto-reconnect upon network partition
[ https://issues.apache.org/jira/browse/GEODE-9910?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17494856#comment-17494856 ] Barrett Oglesby commented on GEODE-9910: With a modification to the product to simulate a failed JoinRequestMessage, I can reproduce this issue. h3. Test # Start server 1 (becomes coordinator) # Start server 2 # Play dead server 2 # The servers disconnect from each other # Server 2 disconnects from the distributed system since it doesn't have quorum When server 2 reconnects, it: - establishes quorum - starts the locator - is unable to join the distributed system (due to the modification I made) - attempts to reconnect again - fails because the locator is already started > Failure to auto-reconnect upon network partition > > > Key: GEODE-9910 > URL: https://issues.apache.org/jira/browse/GEODE-9910 > Project: Geode > Issue Type: Bug >Affects Versions: 1.14.0 >Reporter: Surya Mudundi >Assignee: Barrett Oglesby >Priority: Major > Labels: GeodeOperationAPI, blocks-1.15.0, needsTriage > Attachments: geode-logs.zip > > > Two node cluster with embedded locators failed to auto-reconnect when node-1 > experienced network outage for couple of minutes and when node-1 recovered > from the outage, node-2 failed to auto-reconnect. > node-2 tried to re-connect to node-1 as: > [org.apache.geode.distributed.internal.InternalDistributedSystem]-[ReconnectThread] > [] Attempting to reconnect to the distributed system. This is attempt #1. > [org.apache.geode.distributed.internal.InternalDistributedSystem]-[ReconnectThread] > [] Attempting to reconnect to the distributed system. This is attempt #2. > [org.apache.geode.distributed.internal.InternalDistributedSystem]-[ReconnectThread] > [] Attempting to reconnect to the distributed system. This is attempt #3. > Finally reported below error after 3 attempts as: > INFO > [org.apache.geode.logging.internal.LoggingProviderLoader]-[ReconnectThread] > [] Using org.apache.geode.logging.internal.SimpleLoggingProvider for service > org.apache.geode.logging.internal.spi.LoggingProvider > INFO [org.apache.geode.internal.InternalDataSerializer]-[ReconnectThread] [] > initializing InternalDataSerializer with 0 services > INFO > [org.apache.geode.distributed.internal.InternalDistributedSystem]-[ReconnectThread] > [] performing a quorum check to see if location services can be started early > INFO > [org.apache.geode.distributed.internal.InternalDistributedSystem]-[ReconnectThread] > [] Quorum check passed - allowing location services to start early > WARN > [org.apache.geode.distributed.internal.InternalDistributedSystem]-[ReconnectThread] > [] Exception occurred while trying to connect the system during reconnect > java.lang.IllegalStateException: A locator can not be created because one > already exists in this JVM. > at > org.apache.geode.distributed.internal.InternalLocator.createLocator(InternalLocator.java:298) > ~[geode-core-1.14.0.jar:?] > at > org.apache.geode.distributed.internal.InternalLocator.createLocator(InternalLocator.java:273) > ~[geode-core-1.14.0.jar:?] > at > org.apache.geode.distributed.internal.InternalDistributedSystem.startInitLocator(InternalDistributedSystem.java:916) > ~[geode-core-1.14.0.jar:?] > at > org.apache.geode.distributed.internal.InternalDistributedSystem.initialize(InternalDistributedSystem.java:768) > ~[geode-core-1.14.0.jar:?] > at > org.apache.geode.distributed.internal.InternalDistributedSystem.access$200(InternalDistributedSystem.java:135) > ~[geode-core-1.14.0.jar:?] > at > org.apache.geode.distributed.internal.InternalDistributedSystem$Builder.build(InternalDistributedSystem.java:3034) > ~[geode-core-1.14.0.jar:?] > at > org.apache.geode.distributed.internal.InternalDistributedSystem.connectInternal(InternalDistributedSystem.java:290) > ~[geode-core-1.14.0.jar:?] > at > org.apache.geode.distributed.internal.InternalDistributedSystem.reconnect(InternalDistributedSystem.java:2605) > ~[geode-core-1.14.0.jar:?] > at > org.apache.geode.distributed.internal.InternalDistributedSystem.tryReconnect(InternalDistributedSystem.java:2424) > ~[geode-core-1.14.0.jar:?] > at > org.apache.geode.distributed.internal.InternalDistributedSystem.disconnect(InternalDistributedSystem.java:1275) > ~[geode-core-1.14.0.jar:?] > at > org.apache.geode.distributed.internal.ClusterDistributionManager$DMListener.membershipFailure(ClusterDistributionManager.java:2326) > ~[geode-core-1.14.0.jar:?] > at > org.apache.geode.distributed.internal.membership.gms.GMSMembership.uncleanShutdown(GMSMembership.java:1187) > ~[geode-membership-1.14.0.jar:?] > at >
[jira] [Comment Edited] (GEODE-9910) Failure to auto-reconnect upon network partition
[ https://issues.apache.org/jira/browse/GEODE-9910?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17494855#comment-17494855 ] Barrett Oglesby edited comment on GEODE-9910 at 2/19/22, 12:42 AM: --- Here is some analysis of this issue. h3. Server Addresses node 1: membership: 10.196.55.141(15661):42000 locator: 10.196.55.141:10335 node 2: membership: 10.196.55.142(19002):42000 locator: 10.196.55.142:10335 h3. Node2 Initial Disconnect node2 lost connectivity with node1 and removed it: {noformat} 2021-11-28 04:03:45,084 INFO [org.apache.geode.distributed.internal.membership.gms.Services]-[Geode Failure Detection thread 9] [] Availability check failed for member 10.196.55.141(15661):42000 2021-11-28 04:03:45,084 INFO [org.apache.geode.distributed.internal.membership.gms.Services]-[Geode Failure Detection thread 9] [] Requesting removal of suspect member 10.196.55.141(15661):42000 2021-11-28 04:03:45,085 INFO [org.apache.geode.distributed.internal.membership.gms.Services]-[Geode Failure Detection thread 9] [] This member is becoming the membership coordinator with address 10.196.55.142(19002):42000 {noformat} It then realized that quorum had been lost (node1 was coordinator with weight=15; node2 was not coordinator with weight=10): {noformat} 2021-11-28 04:03:45,091 INFO [org.apache.geode.distributed.internal.membership.gms.Services]-[Geode Membership View Creator] [] View Creator thread is starting 2021-11-28 04:03:45,091 INFO [org.apache.geode.distributed.internal.membership.gms.Services]-[Geode Membership View Creator] [] 10.196.55.141(15661):42000 had a weight of 15 2021-11-28 04:03:45,092 WARN [org.apache.geode.distributed.internal.membership.gms.Services]-[Geode Membership View Creator] [] total weight lost in this view change is 15 of 25. Quorum has been lost! 2021-11-28 04:03:45,092 FATAL [org.apache.geode.distributed.internal.membership.gms.Services]-[Geode Membership View Creator] [] Possible loss of quorum due to the loss of 1 cache processes: [10.196.55.141(15661):42000] {noformat} And disconnected itself from the distributed system: {noformat} 2021-11-28 04:03:46,093 FATAL [org.apache.geode.distributed.internal.membership.gms.Services]-[Geode Membership View Creator] [] Membership service failure: Exiting due to possible network partition event due to loss of 1 cache processes: [10.196.55.141(15661):42000] org.apache.geode.distributed.internal.membership.api.MemberDisconnectedException: Exiting due to possible network partition event due to loss of 1 cache processes: [10.196.55.141(15661):42000] at org.apache.geode.distributed.internal.membership.gms.GMSMembership$ManagerImpl.forceDisconnect(GMSMembership.java:1787) [geode-membership-1.14.0.jar:?] at org.apache.geode.distributed.internal.membership.gms.membership.GMSJoinLeave.forceDisconnect(GMSJoinLeave.java:1122) [geode-membership-1.14.0.jar:?] at org.apache.geode.distributed.internal.membership.gms.membership.GMSJoinLeave.access$1300(GMSJoinLeave.java:80) [geode-membership-1.14.0.jar:?] at org.apache.geode.distributed.internal.membership.gms.membership.GMSJoinLeave$ViewCreator.prepareAndSendView(GMSJoinLeave.java:2588) [geode-membership-1.14.0.jar:?] at org.apache.geode.distributed.internal.membership.gms.membership.GMSJoinLeave$ViewCreator.sendInitialView(GMSJoinLeave.java:2204) [geode-membership-1.14.0.jar:?] at org.apache.geode.distributed.internal.membership.gms.membership.GMSJoinLeave$ViewCreator.run(GMSJoinLeave.java:2286) [geode-membership-1.14.0.jar:?] {noformat} It stopped its locator: {noformat} 2021-11-28 04:03:46,794 INFO [org.apache.geode.distributed.internal.InternalLocator]-[ReconnectThread] [] Distribution Locator on vmw-hcs-248e71fd-dd76-4111-ba82-379151aabbb7-3000-1-node-2/10.196.55.142 is stopped{noformat} h3. Node2 Reconnect Attempt 1 {noformat} 2021-11-28 04:04:46,800 INFO [org.apache.geode.distributed.internal.InternalDistributedSystem]-[ReconnectThread] [] Attempting to reconnect to the distributed system. This is attempt #1. {noformat} The first reconnect attempt failed to get quorum (it needed a weight of 13 but is 10): {noformat} 2021-11-28 04:04:46,810 INFO [org.apache.geode.distributed.internal.InternalDistributedSystem]-[ReconnectThread] [] performing a quorum check to see if location services can be started early 2021-11-28 04:04:46,810 INFO [org.apache.geode.distributed.internal.membership.gms.messenger.GMSQuorumChecker]-[ReconnectThread] [] beginning quorum check with GMSQuorumChecker on view View[10.196.55.141(15661):42000|1] members: [10.196.55.141(15661):42000{lead}, 10.196.55.142(19002):42000] 2021-11-28 04:04:46,810 INFO [org.apache.geode.distributed.internal.membership.gms.messenger.GMSQuorumChecker]-[ReconnectThread] [] quorum check: sending request to 10.196.55.141(15661):42000 2021-11-28
[jira] [Comment Edited] (GEODE-9910) Failure to auto-reconnect upon network partition
[ https://issues.apache.org/jira/browse/GEODE-9910?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17494855#comment-17494855 ] Barrett Oglesby edited comment on GEODE-9910 at 2/19/22, 12:41 AM: --- Here is some analysis of this issue. h3. Server Addresses node 1: membership: 10.196.55.141(15661):42000 locator: 10.196.55.141:10335 node 2: membership: 10.196.55.142(19002):42000 locator: 10.196.55.142:10335 h3. Node2 Initial Disconnect node2 lost connectivity with node1 and removed it: {noformat} 2021-11-28 04:03:45,084 INFO [org.apache.geode.distributed.internal.membership.gms.Services]-[Geode Failure Detection thread 9] [] Availability check failed for member 10.196.55.141(15661):42000 2021-11-28 04:03:45,084 INFO [org.apache.geode.distributed.internal.membership.gms.Services]-[Geode Failure Detection thread 9] [] Requesting removal of suspect member 10.196.55.141(15661):42000 2021-11-28 04:03:45,085 INFO [org.apache.geode.distributed.internal.membership.gms.Services]-[Geode Failure Detection thread 9] [] This member is becoming the membership coordinator with address 10.196.55.142(19002):42000 {noformat} It then realized that quorum had been lost (node1 was coordinator with weight=15; node2 was not coordinator with weight=10): {noformat} 2021-11-28 04:03:45,091 INFO [org.apache.geode.distributed.internal.membership.gms.Services]-[Geode Membership View Creator] [] View Creator thread is starting 2021-11-28 04:03:45,091 INFO [org.apache.geode.distributed.internal.membership.gms.Services]-[Geode Membership View Creator] [] 10.196.55.141(15661):42000 had a weight of 15 2021-11-28 04:03:45,092 WARN [org.apache.geode.distributed.internal.membership.gms.Services]-[Geode Membership View Creator] [] total weight lost in this view change is 15 of 25. Quorum has been lost! 2021-11-28 04:03:45,092 FATAL [org.apache.geode.distributed.internal.membership.gms.Services]-[Geode Membership View Creator] [] Possible loss of quorum due to the loss of 1 cache processes: [10.196.55.141(15661):42000] {noformat} And disconnected itself from the distributed system: {noformat} 2021-11-28 04:03:46,093 FATAL [org.apache.geode.distributed.internal.membership.gms.Services]-[Geode Membership View Creator] [] Membership service failure: Exiting due to possible network partition event due to loss of 1 cache processes: [10.196.55.141(15661):42000] org.apache.geode.distributed.internal.membership.api.MemberDisconnectedException: Exiting due to possible network partition event due to loss of 1 cache processes: [10.196.55.141(15661):42000] at org.apache.geode.distributed.internal.membership.gms.GMSMembership$ManagerImpl.forceDisconnect(GMSMembership.java:1787) [geode-membership-1.14.0.jar:?] at org.apache.geode.distributed.internal.membership.gms.membership.GMSJoinLeave.forceDisconnect(GMSJoinLeave.java:1122) [geode-membership-1.14.0.jar:?] at org.apache.geode.distributed.internal.membership.gms.membership.GMSJoinLeave.access$1300(GMSJoinLeave.java:80) [geode-membership-1.14.0.jar:?] at org.apache.geode.distributed.internal.membership.gms.membership.GMSJoinLeave$ViewCreator.prepareAndSendView(GMSJoinLeave.java:2588) [geode-membership-1.14.0.jar:?] at org.apache.geode.distributed.internal.membership.gms.membership.GMSJoinLeave$ViewCreator.sendInitialView(GMSJoinLeave.java:2204) [geode-membership-1.14.0.jar:?] at org.apache.geode.distributed.internal.membership.gms.membership.GMSJoinLeave$ViewCreator.run(GMSJoinLeave.java:2286) [geode-membership-1.14.0.jar:?] {noformat} It stopped its locator: {noformat} 2021-11-28 04:03:46,794 INFO [org.apache.geode.distributed.internal.InternalLocator]-[ReconnectThread] [] Distribution Locator on vmw-hcs-248e71fd-dd76-4111-ba82-379151aabbb7-3000-1-node-2/10.196.55.142 is stopped{noformat} h3. Node2 Reconnect Attempt 1 {noformat} 2021-11-28 04:04:46,800 INFO [org.apache.geode.distributed.internal.InternalDistributedSystem]-[ReconnectThread] [] Attempting to reconnect to the distributed system. This is attempt #1. {noformat} The first reconnect attempt failed to get quorum (it needed a weight of 13 but is 10): {noformat} 2021-11-28 04:04:46,810 INFO [org.apache.geode.distributed.internal.InternalDistributedSystem]-[ReconnectThread] [] performing a quorum check to see if location services can be started early 2021-11-28 04:04:46,810 INFO [org.apache.geode.distributed.internal.membership.gms.messenger.GMSQuorumChecker]-[ReconnectThread] [] beginning quorum check with GMSQuorumChecker on view View[10.196.55.141(15661):42000|1] members: [10.196.55.141(15661):42000{lead}, 10.196.55.142(19002):42000] 2021-11-28 04:04:46,810 INFO [org.apache.geode.distributed.internal.membership.gms.messenger.GMSQuorumChecker]-[ReconnectThread] [] quorum check: sending request to 10.196.55.141(15661):42000 2021-11-28
[jira] [Commented] (GEODE-9910) Failure to auto-reconnect upon network partition
[ https://issues.apache.org/jira/browse/GEODE-9910?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17494855#comment-17494855 ] Barrett Oglesby commented on GEODE-9910: Here is some analysis of this issue. h3. Server Addresses node 1: membership: 10.196.55.141(15661):42000 locator: 10.196.55.141:10335 node 2: membership: 10.196.55.142(19002):42000 locator: 10.196.55.142:10335 h3. Node2 Initial Disconnect node2 lost connectivity with node1 and removed it: {noformat} 2021-11-28 04:03:45,084 INFO [org.apache.geode.distributed.internal.membership.gms.Services]-[Geode Failure Detection thread 9] [] Availability check failed for member 10.196.55.141(15661):42000 2021-11-28 04:03:45,084 INFO [org.apache.geode.distributed.internal.membership.gms.Services]-[Geode Failure Detection thread 9] [] Requesting removal of suspect member 10.196.55.141(15661):42000 2021-11-28 04:03:45,085 INFO [org.apache.geode.distributed.internal.membership.gms.Services]-[Geode Failure Detection thread 9] [] This member is becoming the membership coordinator with address 10.196.55.142(19002):42000 {noformat} It then realized that quorum had been lost (node1 was coordinator with weight=15; node2 was not coordinator with weight=10): {noformat} 2021-11-28 04:03:45,091 INFO [org.apache.geode.distributed.internal.membership.gms.Services]-[Geode Membership View Creator] [] View Creator thread is starting 2021-11-28 04:03:45,091 INFO [org.apache.geode.distributed.internal.membership.gms.Services]-[Geode Membership View Creator] [] 10.196.55.141(15661):42000 had a weight of 15 2021-11-28 04:03:45,092 WARN [org.apache.geode.distributed.internal.membership.gms.Services]-[Geode Membership View Creator] [] total weight lost in this view change is 15 of 25. Quorum has been lost! 2021-11-28 04:03:45,092 FATAL [org.apache.geode.distributed.internal.membership.gms.Services]-[Geode Membership View Creator] [] Possible loss of quorum due to the loss of 1 cache processes: [10.196.55.141(15661):42000] {noformat} And disconnected itself from the distributed system: {noformat} 2021-11-28 04:03:46,093 FATAL [org.apache.geode.distributed.internal.membership.gms.Services]-[Geode Membership View Creator] [] Membership service failure: Exiting due to possible network partition event due to loss of 1 cache processes: [10.196.55.141(15661):42000] org.apache.geode.distributed.internal.membership.api.MemberDisconnectedException: Exiting due to possible network partition event due to loss of 1 cache processes: [10.196.55.141(15661):42000] at org.apache.geode.distributed.internal.membership.gms.GMSMembership$ManagerImpl.forceDisconnect(GMSMembership.java:1787) [geode-membership-1.14.0.jar:?] at org.apache.geode.distributed.internal.membership.gms.membership.GMSJoinLeave.forceDisconnect(GMSJoinLeave.java:1122) [geode-membership-1.14.0.jar:?] at org.apache.geode.distributed.internal.membership.gms.membership.GMSJoinLeave.access$1300(GMSJoinLeave.java:80) [geode-membership-1.14.0.jar:?] at org.apache.geode.distributed.internal.membership.gms.membership.GMSJoinLeave$ViewCreator.prepareAndSendView(GMSJoinLeave.java:2588) [geode-membership-1.14.0.jar:?] at org.apache.geode.distributed.internal.membership.gms.membership.GMSJoinLeave$ViewCreator.sendInitialView(GMSJoinLeave.java:2204) [geode-membership-1.14.0.jar:?] at org.apache.geode.distributed.internal.membership.gms.membership.GMSJoinLeave$ViewCreator.run(GMSJoinLeave.java:2286) [geode-membership-1.14.0.jar:?] {noformat} It stopped its locator: {noformat} 2021-11-28 04:03:46,794 INFO [org.apache.geode.distributed.internal.InternalLocator]-[ReconnectThread] [] Distribution Locator on vmw-hcs-248e71fd-dd76-4111-ba82-379151aabbb7-3000-1-node-2/10.196.55.142 is stopped{noformat} h3. Node2 Reconnect Attempt 1 {noformat} 2021-11-28 04:04:46,800 INFO [org.apache.geode.distributed.internal.InternalDistributedSystem]-[ReconnectThread] [] Attempting to reconnect to the distributed system. This is attempt #1. {noformat} The first retry attempt failed to get quorum (it needed a weight of 13 but is 10): {noformat} 2021-11-28 04:04:46,810 INFO [org.apache.geode.distributed.internal.InternalDistributedSystem]-[ReconnectThread] [] performing a quorum check to see if location services can be started early 2021-11-28 04:04:46,810 INFO [org.apache.geode.distributed.internal.membership.gms.messenger.GMSQuorumChecker]-[ReconnectThread] [] beginning quorum check with GMSQuorumChecker on view View[10.196.55.141(15661):42000|1] members: [10.196.55.141(15661):42000{lead}, 10.196.55.142(19002):42000] 2021-11-28 04:04:46,810 INFO [org.apache.geode.distributed.internal.membership.gms.messenger.GMSQuorumChecker]-[ReconnectThread] [] quorum check: sending request to 10.196.55.141(15661):42000 2021-11-28 04:04:46,810 INFO
[jira] [Commented] (GEODE-9910) Failure to auto-reconnect upon network partition
[ https://issues.apache.org/jira/browse/GEODE-9910?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17493725#comment-17493725 ] Barrett Oglesby commented on GEODE-9910: Can you attach the full logs of both servers? > Failure to auto-reconnect upon network partition > > > Key: GEODE-9910 > URL: https://issues.apache.org/jira/browse/GEODE-9910 > Project: Geode > Issue Type: Bug >Affects Versions: 1.14.0 >Reporter: Surya Mudundi >Priority: Major > Labels: GeodeOperationAPI, blocks-1.15.0, needsTriage > > Two node cluster with embedded locators failed to auto-reconnect when node-1 > experienced network outage for couple of minutes and when node-1 recovered > from the outage, node-2 failed to auto-reconnect. > node-2 tried to re-connect to node-1 as: > [org.apache.geode.distributed.internal.InternalDistributedSystem]-[ReconnectThread] > [] Attempting to reconnect to the distributed system. This is attempt #1. > [org.apache.geode.distributed.internal.InternalDistributedSystem]-[ReconnectThread] > [] Attempting to reconnect to the distributed system. This is attempt #2. > [org.apache.geode.distributed.internal.InternalDistributedSystem]-[ReconnectThread] > [] Attempting to reconnect to the distributed system. This is attempt #3. > Finally reported below error after 3 attempts as: > INFO > [org.apache.geode.logging.internal.LoggingProviderLoader]-[ReconnectThread] > [] Using org.apache.geode.logging.internal.SimpleLoggingProvider for service > org.apache.geode.logging.internal.spi.LoggingProvider > INFO [org.apache.geode.internal.InternalDataSerializer]-[ReconnectThread] [] > initializing InternalDataSerializer with 0 services > INFO > [org.apache.geode.distributed.internal.InternalDistributedSystem]-[ReconnectThread] > [] performing a quorum check to see if location services can be started early > INFO > [org.apache.geode.distributed.internal.InternalDistributedSystem]-[ReconnectThread] > [] Quorum check passed - allowing location services to start early > WARN > [org.apache.geode.distributed.internal.InternalDistributedSystem]-[ReconnectThread] > [] Exception occurred while trying to connect the system during reconnect > java.lang.IllegalStateException: A locator can not be created because one > already exists in this JVM. > at > org.apache.geode.distributed.internal.InternalLocator.createLocator(InternalLocator.java:298) > ~[geode-core-1.14.0.jar:?] > at > org.apache.geode.distributed.internal.InternalLocator.createLocator(InternalLocator.java:273) > ~[geode-core-1.14.0.jar:?] > at > org.apache.geode.distributed.internal.InternalDistributedSystem.startInitLocator(InternalDistributedSystem.java:916) > ~[geode-core-1.14.0.jar:?] > at > org.apache.geode.distributed.internal.InternalDistributedSystem.initialize(InternalDistributedSystem.java:768) > ~[geode-core-1.14.0.jar:?] > at > org.apache.geode.distributed.internal.InternalDistributedSystem.access$200(InternalDistributedSystem.java:135) > ~[geode-core-1.14.0.jar:?] > at > org.apache.geode.distributed.internal.InternalDistributedSystem$Builder.build(InternalDistributedSystem.java:3034) > ~[geode-core-1.14.0.jar:?] > at > org.apache.geode.distributed.internal.InternalDistributedSystem.connectInternal(InternalDistributedSystem.java:290) > ~[geode-core-1.14.0.jar:?] > at > org.apache.geode.distributed.internal.InternalDistributedSystem.reconnect(InternalDistributedSystem.java:2605) > ~[geode-core-1.14.0.jar:?] > at > org.apache.geode.distributed.internal.InternalDistributedSystem.tryReconnect(InternalDistributedSystem.java:2424) > ~[geode-core-1.14.0.jar:?] > at > org.apache.geode.distributed.internal.InternalDistributedSystem.disconnect(InternalDistributedSystem.java:1275) > ~[geode-core-1.14.0.jar:?] > at > org.apache.geode.distributed.internal.ClusterDistributionManager$DMListener.membershipFailure(ClusterDistributionManager.java:2326) > ~[geode-core-1.14.0.jar:?] > at > org.apache.geode.distributed.internal.membership.gms.GMSMembership.uncleanShutdown(GMSMembership.java:1187) > ~[geode-membership-1.14.0.jar:?] > at > org.apache.geode.distributed.internal.membership.gms.GMSMembership$ManagerImpl.lambda$forceDisconnect$0(GMSMembership.java:1811) > ~[geode-membership-1.14.0.jar:?] > at java.lang.Thread.run(Thread.java:829) [?:?] > -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Assigned] (GEODE-9910) Failure to auto-reconnect upon network partition
[ https://issues.apache.org/jira/browse/GEODE-9910?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Barrett Oglesby reassigned GEODE-9910: -- Assignee: Barrett Oglesby > Failure to auto-reconnect upon network partition > > > Key: GEODE-9910 > URL: https://issues.apache.org/jira/browse/GEODE-9910 > Project: Geode > Issue Type: Bug >Affects Versions: 1.14.0 >Reporter: Surya Mudundi >Assignee: Barrett Oglesby >Priority: Major > Labels: GeodeOperationAPI, blocks-1.15.0, needsTriage > > Two node cluster with embedded locators failed to auto-reconnect when node-1 > experienced network outage for couple of minutes and when node-1 recovered > from the outage, node-2 failed to auto-reconnect. > node-2 tried to re-connect to node-1 as: > [org.apache.geode.distributed.internal.InternalDistributedSystem]-[ReconnectThread] > [] Attempting to reconnect to the distributed system. This is attempt #1. > [org.apache.geode.distributed.internal.InternalDistributedSystem]-[ReconnectThread] > [] Attempting to reconnect to the distributed system. This is attempt #2. > [org.apache.geode.distributed.internal.InternalDistributedSystem]-[ReconnectThread] > [] Attempting to reconnect to the distributed system. This is attempt #3. > Finally reported below error after 3 attempts as: > INFO > [org.apache.geode.logging.internal.LoggingProviderLoader]-[ReconnectThread] > [] Using org.apache.geode.logging.internal.SimpleLoggingProvider for service > org.apache.geode.logging.internal.spi.LoggingProvider > INFO [org.apache.geode.internal.InternalDataSerializer]-[ReconnectThread] [] > initializing InternalDataSerializer with 0 services > INFO > [org.apache.geode.distributed.internal.InternalDistributedSystem]-[ReconnectThread] > [] performing a quorum check to see if location services can be started early > INFO > [org.apache.geode.distributed.internal.InternalDistributedSystem]-[ReconnectThread] > [] Quorum check passed - allowing location services to start early > WARN > [org.apache.geode.distributed.internal.InternalDistributedSystem]-[ReconnectThread] > [] Exception occurred while trying to connect the system during reconnect > java.lang.IllegalStateException: A locator can not be created because one > already exists in this JVM. > at > org.apache.geode.distributed.internal.InternalLocator.createLocator(InternalLocator.java:298) > ~[geode-core-1.14.0.jar:?] > at > org.apache.geode.distributed.internal.InternalLocator.createLocator(InternalLocator.java:273) > ~[geode-core-1.14.0.jar:?] > at > org.apache.geode.distributed.internal.InternalDistributedSystem.startInitLocator(InternalDistributedSystem.java:916) > ~[geode-core-1.14.0.jar:?] > at > org.apache.geode.distributed.internal.InternalDistributedSystem.initialize(InternalDistributedSystem.java:768) > ~[geode-core-1.14.0.jar:?] > at > org.apache.geode.distributed.internal.InternalDistributedSystem.access$200(InternalDistributedSystem.java:135) > ~[geode-core-1.14.0.jar:?] > at > org.apache.geode.distributed.internal.InternalDistributedSystem$Builder.build(InternalDistributedSystem.java:3034) > ~[geode-core-1.14.0.jar:?] > at > org.apache.geode.distributed.internal.InternalDistributedSystem.connectInternal(InternalDistributedSystem.java:290) > ~[geode-core-1.14.0.jar:?] > at > org.apache.geode.distributed.internal.InternalDistributedSystem.reconnect(InternalDistributedSystem.java:2605) > ~[geode-core-1.14.0.jar:?] > at > org.apache.geode.distributed.internal.InternalDistributedSystem.tryReconnect(InternalDistributedSystem.java:2424) > ~[geode-core-1.14.0.jar:?] > at > org.apache.geode.distributed.internal.InternalDistributedSystem.disconnect(InternalDistributedSystem.java:1275) > ~[geode-core-1.14.0.jar:?] > at > org.apache.geode.distributed.internal.ClusterDistributionManager$DMListener.membershipFailure(ClusterDistributionManager.java:2326) > ~[geode-core-1.14.0.jar:?] > at > org.apache.geode.distributed.internal.membership.gms.GMSMembership.uncleanShutdown(GMSMembership.java:1187) > ~[geode-membership-1.14.0.jar:?] > at > org.apache.geode.distributed.internal.membership.gms.GMSMembership$ManagerImpl.lambda$forceDisconnect$0(GMSMembership.java:1811) > ~[geode-membership-1.14.0.jar:?] > at java.lang.Thread.run(Thread.java:829) [?:?] > -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Resolved] (GEODE-10009) The CacheClientProxy for a durable client can be terminated when it shouldn't be
[ https://issues.apache.org/jira/browse/GEODE-10009?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Barrett Oglesby resolved GEODE-10009. - Fix Version/s: 1.15.0 Resolution: Fixed > The CacheClientProxy for a durable client can be terminated when it shouldn't > be > > > Key: GEODE-10009 > URL: https://issues.apache.org/jira/browse/GEODE-10009 > Project: Geode > Issue Type: Bug > Components: client queues >Affects Versions: 1.15.0 >Reporter: Barrett Oglesby >Assignee: Barrett Oglesby >Priority: Major > Labels: blocks-1.15.0, pull-request-available > Fix For: 1.15.0 > > > When the client connection is closed but the server has not left or crashed > (e.g in the re-authentication failed case), its possible that two threads in > a durable client can interleave in a way that causes an extra durable task to > be created on the server that eventually causes the CacheClientProxy to be > terminated. -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Assigned] (GEODE-10009) The CacheClientProxy for a durable client can be terminated when it shouldn't be
[ https://issues.apache.org/jira/browse/GEODE-10009?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Barrett Oglesby reassigned GEODE-10009: --- Assignee: Barrett Oglesby > The CacheClientProxy for a durable client can be terminated when it shouldn't > be > > > Key: GEODE-10009 > URL: https://issues.apache.org/jira/browse/GEODE-10009 > Project: Geode > Issue Type: Bug > Components: client queues >Affects Versions: 1.15.0 >Reporter: Barrett Oglesby >Assignee: Barrett Oglesby >Priority: Major > Labels: needsTriage > > When the client connection is closed but the server has not left or crashed > (e.g in the re-authentication failed case), its possible that two threads in > a durable client can interleave in a way that causes an extra durable task to > be created on the server that eventually causes the CacheClientProxy to be > terminated. -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Created] (GEODE-10009) The CacheClientProxy for a durable client can be terminated when it shouldn't be
Barrett Oglesby created GEODE-10009: --- Summary: The CacheClientProxy for a durable client can be terminated when it shouldn't be Key: GEODE-10009 URL: https://issues.apache.org/jira/browse/GEODE-10009 Project: Geode Issue Type: Bug Components: client queues Affects Versions: 1.15.0 Reporter: Barrett Oglesby When the client connection is closed but the server has not left or crashed (e.g in the re-authentication failed case), its possible that two threads in a durable client can interleave in a way that causes an extra durable task to be created on the server that eventually causes the CacheClientProxy to be terminated. -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Resolved] (GEODE-9913) A retried event can fail if the original event is still being processed and a new event for that same key occurs at the same time
[ https://issues.apache.org/jira/browse/GEODE-9913?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Barrett Oglesby resolved GEODE-9913. Fix Version/s: 1.15.0 Resolution: Fixed > A retried event can fail if the original event is still being processed and a > new event for that same key occurs at the same time > - > > Key: GEODE-9913 > URL: https://issues.apache.org/jira/browse/GEODE-9913 > Project: Geode > Issue Type: Bug > Components: client/server >Reporter: Barrett Oglesby >Assignee: Barrett Oglesby >Priority: Major > Labels: GeodeOperationAPI, pull-request-available > Fix For: 1.15.0 > > -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Commented] (GEODE-9528) CI Failure: DistributionAdvisorIntegrationTest > verifyMembershipListenerIsRemovedAfterForceDisconnect
[ https://issues.apache.org/jira/browse/GEODE-9528?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17478057#comment-17478057 ] Barrett Oglesby commented on GEODE-9528: I backported this change to support/1.14, support/1.13 and support/1.12. > CI Failure: DistributionAdvisorIntegrationTest > > verifyMembershipListenerIsRemovedAfterForceDisconnect > -- > > Key: GEODE-9528 > URL: https://issues.apache.org/jira/browse/GEODE-9528 > Project: Geode > Issue Type: Bug > Components: membership >Affects Versions: 1.12.5, 1.13.5, 1.14.0 >Reporter: Owen Nichols >Assignee: Barrett Oglesby >Priority: Major > Labels: pull-request-available > Fix For: 1.12.9, 1.13.7, 1.14.3, 1.15.0 > > > {noformat} > org.apache.geode.distributed.internal.DistributionAdvisorIntegrationTest > > verifyMembershipListenerIsRemovedAfterForceDisconnect FAILED > org.junit.ComparisonFailure: expected:<[fals]e> but was:<[tru]e> > at > jdk.internal.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method) > at > jdk.internal.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62) > at > jdk.internal.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45) > at > org.apache.geode.distributed.internal.DistributionAdvisorIntegrationTest.verifyMembershipListenerIsRemovedAfterForceDisconnect(DistributionAdvisorIntegrationTest.java:57) > {noformat} -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Updated] (GEODE-9528) CI Failure: DistributionAdvisorIntegrationTest > verifyMembershipListenerIsRemovedAfterForceDisconnect
[ https://issues.apache.org/jira/browse/GEODE-9528?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Barrett Oglesby updated GEODE-9528: --- Fix Version/s: 1.12.9 1.13.7 1.14.3 > CI Failure: DistributionAdvisorIntegrationTest > > verifyMembershipListenerIsRemovedAfterForceDisconnect > -- > > Key: GEODE-9528 > URL: https://issues.apache.org/jira/browse/GEODE-9528 > Project: Geode > Issue Type: Bug > Components: membership >Affects Versions: 1.12.5, 1.13.5, 1.14.0 >Reporter: Owen Nichols >Assignee: Barrett Oglesby >Priority: Major > Labels: pull-request-available > Fix For: 1.12.9, 1.13.7, 1.14.3, 1.15.0 > > > {noformat} > org.apache.geode.distributed.internal.DistributionAdvisorIntegrationTest > > verifyMembershipListenerIsRemovedAfterForceDisconnect FAILED > org.junit.ComparisonFailure: expected:<[fals]e> but was:<[tru]e> > at > jdk.internal.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method) > at > jdk.internal.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62) > at > jdk.internal.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45) > at > org.apache.geode.distributed.internal.DistributionAdvisorIntegrationTest.verifyMembershipListenerIsRemovedAfterForceDisconnect(DistributionAdvisorIntegrationTest.java:57) > {noformat} -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Created] (GEODE-9913) A retried event can fail if the original event is still being processed and a new event for that same key occurs at the same time
Barrett Oglesby created GEODE-9913: -- Summary: A retried event can fail if the original event is still being processed and a new event for that same key occurs at the same time Key: GEODE-9913 URL: https://issues.apache.org/jira/browse/GEODE-9913 Project: Geode Issue Type: Bug Components: client/server Reporter: Barrett Oglesby -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Assigned] (GEODE-9913) A retried event can fail if the original event is still being processed and a new event for that same key occurs at the same time
[ https://issues.apache.org/jira/browse/GEODE-9913?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Barrett Oglesby reassigned GEODE-9913: -- Assignee: Barrett Oglesby > A retried event can fail if the original event is still being processed and a > new event for that same key occurs at the same time > - > > Key: GEODE-9913 > URL: https://issues.apache.org/jira/browse/GEODE-9913 > Project: Geode > Issue Type: Bug > Components: client/server >Reporter: Barrett Oglesby >Assignee: Barrett Oglesby >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Resolved] (GEODE-9865) ConnectionManagerImpl forceCreateConnection to a specific server increments the count regardless whether the connection is successful
[ https://issues.apache.org/jira/browse/GEODE-9865?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Barrett Oglesby resolved GEODE-9865. Fix Version/s: 1.15.0 Resolution: Fixed > ConnectionManagerImpl forceCreateConnection to a specific server increments > the count regardless whether the connection is successful > - > > Key: GEODE-9865 > URL: https://issues.apache.org/jira/browse/GEODE-9865 > Project: Geode > Issue Type: Bug > Components: client/server >Affects Versions: 1.14.0 >Reporter: Barrett Oglesby >Assignee: Barrett Oglesby >Priority: Major > Labels: pull-request-available > Fix For: 1.15.0 > > > *ConnectionManagerImpl forceCreateConnection* does: > {noformat} > private PooledConnection forceCreateConnection(ServerLocation serverLocation) > throws ServerRefusedConnectionException, ServerOperationException { > connectionAccounting.create(); > try { > return createPooledConnection(serverLocation); > } catch (GemFireSecurityException e) { > throw new ServerOperationException(e); > } > }{noformat} > The call to *connectionAccounting.create()* increments the count. If > *createPooledConnection* is unsuccessful, the count is not decremented. This > causes the client to think there are more connections than there actually are. -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Created] (GEODE-9865) ConnectionManagerImpl forceCreateConnection to a specific server increments the count regardless whether the connection is successful
Barrett Oglesby created GEODE-9865: -- Summary: ConnectionManagerImpl forceCreateConnection to a specific server increments the count regardless whether the connection is successful Key: GEODE-9865 URL: https://issues.apache.org/jira/browse/GEODE-9865 Project: Geode Issue Type: Bug Components: client/server Affects Versions: 1.14.0 Reporter: Barrett Oglesby *ConnectionManagerImpl forceCreateConnection* does: {noformat} private PooledConnection forceCreateConnection(ServerLocation serverLocation) throws ServerRefusedConnectionException, ServerOperationException { connectionAccounting.create(); try { return createPooledConnection(serverLocation); } catch (GemFireSecurityException e) { throw new ServerOperationException(e); } }{noformat} The call to *connectionAccounting.create()* increments the count. If *createPooledConnection* is unsuccessful, the count is not decremented. This causes the client to think there are more connections than there actually are. -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Assigned] (GEODE-9865) ConnectionManagerImpl forceCreateConnection to a specific server increments the count regardless whether the connection is successful
[ https://issues.apache.org/jira/browse/GEODE-9865?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Barrett Oglesby reassigned GEODE-9865: -- Assignee: Barrett Oglesby > ConnectionManagerImpl forceCreateConnection to a specific server increments > the count regardless whether the connection is successful > - > > Key: GEODE-9865 > URL: https://issues.apache.org/jira/browse/GEODE-9865 > Project: Geode > Issue Type: Bug > Components: client/server >Affects Versions: 1.14.0 >Reporter: Barrett Oglesby >Assignee: Barrett Oglesby >Priority: Major > > *ConnectionManagerImpl forceCreateConnection* does: > {noformat} > private PooledConnection forceCreateConnection(ServerLocation serverLocation) > throws ServerRefusedConnectionException, ServerOperationException { > connectionAccounting.create(); > try { > return createPooledConnection(serverLocation); > } catch (GemFireSecurityException e) { > throw new ServerOperationException(e); > } > }{noformat} > The call to *connectionAccounting.create()* increments the count. If > *createPooledConnection* is unsuccessful, the count is not decremented. This > causes the client to think there are more connections than there actually are. -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Commented] (GEODE-9664) Two different clients with the same durable id will both connect to the servers and receive messages
[ https://issues.apache.org/jira/browse/GEODE-9664?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17424732#comment-17424732 ] Barrett Oglesby commented on GEODE-9664: I checked the behavior of the client if there are no servers. I was thinking that these durable client scenarios should behave similar to the no servers scenario. When the Pool is created, the QueueManagerImpl.initializeConnections attempts to create the connections. If there are no servers, the ConnectionList's primaryDiscoveryException is initialized like: {noformat} List servers = findQueueServers(excludedServers, queuesNeeded, true, false, null); if (servers == null || servers.isEmpty()) { scheduleRedundancySatisfierIfNeeded(redundancyRetryInterval); synchronized (lock) { queueConnections = queueConnections.setPrimaryDiscoveryFailed(null); lock.notifyAll(); } return; } {noformat} And the empty ConnectionList is created here: {noformat} java.lang.Exception: Stack trace at java.lang.Thread.dumpStack(Thread.java:1333) at org.apache.geode.cache.client.internal.QueueManagerImpl$ConnectionList.(QueueManagerImpl.java:1318) at org.apache.geode.cache.client.internal.QueueManagerImpl$ConnectionList.setPrimaryDiscoveryFailed(QueueManagerImpl.java:1337) at org.apache.geode.cache.client.internal.QueueManagerImpl.initializeConnections(QueueManagerImpl.java:439) at org.apache.geode.cache.client.internal.QueueManagerImpl.start(QueueManagerImpl.java:293) at org.apache.geode.cache.client.internal.PoolImpl.start(PoolImpl.java:359) at org.apache.geode.cache.client.internal.PoolImpl.finishCreate(PoolImpl.java:183) at org.apache.geode.cache.client.internal.PoolImpl.create(PoolImpl.java:169) at org.apache.geode.internal.cache.PoolFactoryImpl.create(PoolFactoryImpl.java:378) {noformat} Then when Region.registerInterestForAllKeys is called, it invokes ServerRegionProxy.registerInterest which: - adds the key to the RegisterInterestTracker - executes the RegisterInterestOp - removed from key from the RegisterInterestTracker if the RegisterInterestOp fails Here is the code in Region.registerInterestForAllKeys that does the above steps: {noformat} try { rit.addSingleInterest(region, key, interestType, policy, isDurable, receiveUpdatesAsInvalidates); result = RegisterInterestOp.execute(pool, regionName, key, interestType, policy, isDurable, receiveUpdatesAsInvalidates, regionDataPolicy); finished = true; return result; } finally { if (!finished) { rit.removeSingleInterest(region, key, interestType, isDurable, receiveUpdatesAsInvalidates); } } {noformat} The Connections are retrieved in QueueManagerImpl.getAllConnections. If there are none, a NoSubscriptionServersAvailableException wrapping the primaryDiscoveryException is thrown: {noformat} Exception in thread "main" org.apache.geode.cache.NoSubscriptionServersAvailableException: org.apache.geode.cache.NoSubscriptionServersAvailableException: Primary discovery failed. at org.apache.geode.cache.client.internal.QueueManagerImpl.getAllConnections(QueueManagerImpl.java:191) at org.apache.geode.cache.client.internal.OpExecutorImpl.executeOnQueuesAndReturnPrimaryResult(OpExecutorImpl.java:428) at org.apache.geode.cache.client.internal.PoolImpl.executeOnQueuesAndReturnPrimaryResult(PoolImpl.java:875) at org.apache.geode.cache.client.internal.RegisterInterestOp.execute(RegisterInterestOp.java:58) at org.apache.geode.cache.client.internal.ServerRegionProxy.registerInterest(ServerRegionProxy.java:364) at org.apache.geode.internal.cache.LocalRegion.processSingleInterest(LocalRegion.java:3815) at org.apache.geode.internal.cache.LocalRegion.registerInterestRegex(LocalRegion.java:3911) at org.apache.geode.internal.cache.LocalRegion.registerInterestRegex(LocalRegion.java:3890) at org.apache.geode.internal.cache.LocalRegion.registerInterestRegex(LocalRegion.java:3885) at org.apache.geode.cache.Region.registerInterestForAllKeys(Region.java:1709) {noformat} Here is logging that shows all this behavior: {noformat} [warn 2021/10/04 10:58:22.184 PDT client-a-1 tid=0x1] XXX ConnectionList. primaryDiscoveryException=org.apache.geode.cache.NoSubscriptionServersAvailableException: Primary discovery failed. [warn 2021/10/04 10:58:22.238 PDT client-a-1 tid=0x1] XXX RegisterInterestTracker.addSingleInterest key=.*; rieInterests={.*=KEYS_VALUES} [warn 2021/10/04 10:58:22.238 PDT client-a-1 tid=0x1] XXX ServerRegionProxy.registerInterest about to execute RegisterInterestOp [warn 2021/10/04 10:58:22.244 PDT client-a-1 tid=0x1] XXX QueueManagerImpl.getAllConnections about to throw exception=org.apache.geode.cache.NoSubscriptionServersAvailableException: org.apache.geode.cache.NoSubscriptionServersAvailableException: Primary
[jira] [Created] (GEODE-9664) Two different clients with the same durable id will both connect to the servers and receive messages
Barrett Oglesby created GEODE-9664: -- Summary: Two different clients with the same durable id will both connect to the servers and receive messages Key: GEODE-9664 URL: https://issues.apache.org/jira/browse/GEODE-9664 Project: Geode Issue Type: Bug Components: client queues Reporter: Barrett Oglesby There are two cases: # The number of queues is the same as the number of servers (e.g. client with subscription-redundancy=1 and 2 servers) # The number of queues is less than the number of servers (e.g. client with subscription-redundancy=0 and 2 servers) h2. Case 1 In this case, the client first attempts to connect to the primary and fails. {noformat} [warn 2021/10/01 14:37:56.209 PDT server-1 tid=0x4b] XXX CacheClientNotifier.registerClientInternal about to register clientProxyMembershipID=identity(127.0.0.1(client-a-2:89832:loner):61596:fad3ca3d:client-a-2,connection=2,durableAttributes=DurableClientAttributes[id=client-a; timeout=300]) [warn 2021/10/01 14:37:56.209 PDT server-1 tid=0x4b] XXX CacheClientNotifier.registerClientInternal existing proxy=CacheClientProxy[identity(127.0.0.1(client-a-1:89806:loner):61573:10a9ca3d:client-a-1,connection=2,durableAttributes=DurableClientAttributes[id=client-a; timeout=300]); port=61581; primary=true; version=GEODE 1.15.0] [warn 2021/10/01 14:37:56.210 PDT server-1 tid=0x4b] XXX CacheClientNotifier.registerClientInternal existing proxy isPaused=false [warn 2021/10/01 14:37:56.210 PDT server-1 tid=0x4b] The requested durable client has the same identifier ( client-a ) as an existing durable client ( CacheClientProxy[identity(127.0.0.1(client-a-1:89806:loner):61573:10a9ca3d:client-a-1,connection=2,durableAttributes=DurableClientAttributes[id=client-a; timeout=300]); port=61581; primary=true; version=GEODE 1.15.0] ). Duplicate durable clients are not allowed. [warn 2021/10/01 14:37:56.210 PDT server-1 tid=0x4b] CacheClientNotifier: Unsuccessfully registered client with identifier identity(127.0.0.1(client-a-2:89832:loner):61596:fad3ca3d:client-a-2,connection=2,durableAttributes=DurableClientAttributes[id=client-a; timeout=300]) and response code 64 {noformat} It then attempts to connect to the secondary and succeeds. {noformat} [warn 2021/10/01 14:37:56.215 PDT server-2 tid=0x47] XXX CacheClientNotifier.registerClientInternal about to register clientProxyMembershipID=identity(127.0.0.1(client-a-2:89832:loner):61596:fad3ca3d:client-a-2,connection=2,durableAttributes=DurableClientAttributes[id=client-a; timeout=300]) [warn 2021/10/01 14:37:56.215 PDT server-2 tid=0x47] XXX CacheClientNotifier.registerClientInternal existing proxy=CacheClientProxy[identity(127.0.0.1(client-a-1:89806:loner):61573:10a9ca3d:client-a-1,connection=2,durableAttributes=DurableClientAttributes[id=client-a; timeout=300]); port=61578; primary=false; version=GEODE 1.15.0] [warn 2021/10/01 14:37:56.216 PDT server-2 tid=0x47] XXX CacheClientNotifier.registerClientInternal existing proxy isPaused=true [warn 2021/10/01 14:37:56.217 PDT server-2 tid=0x47] XXX CacheClientNotifier.registerClientInternal reinitialized existing proxy=CacheClientProxy[identity(127.0.0.1(client-a-1:89806:loner):61573:10a9ca3d:client-a-1,connection=2,durableAttributes=DurableClientAttributes[id=client-a; timeout=300]); port=61578; primary=true; version=GEODE 1.15.0] {noformat} The previous secondary is reinitialized and made into a primary. Both queues will dispatch events. The CacheClientNotifier.registerClientInternal method invoked when a client connects does: {noformat} if (cacheClientProxy.isPaused()) { ... cacheClientProxy.reinitialize(...); } else { unsuccessfulMsg = String.format("The requested durable client has the same identifier ( %s ) as an existing durable client...); logger.warn(unsuccessfulMsg); } {noformat} The CacheClientProxy is paused when the durable client it represents has disconnected. Unfortunately, a secondary CacheClientProxy is also paused. So, this check is not good enough to prevent a duplicate durable client from connecting. There are a few things that can also be checked. One of them is: {noformat} cacheClientProxy.getCommBuffer() == null {noformat} With that check added, when the client attempts to connect to the secondary, it fails just like the it does with the primary. The client then exits with this exception: {noformat} geode.cache.NoSubscriptionServersAvailableException: org.apache.geode.cache.NoSubscriptionServersAvailableException: Could not initialize a primary queue on startup. No queue servers available. at org.apache.geode.cache.client.internal.QueueManagerImpl.getAllConnections(QueueManagerImpl.java:191) at org.apache.geode.cache.client.internal.OpExecutorImpl.executeOnQueuesAndReturnPrimaryResult(OpExecutorImpl.java:428) at
[jira] [Created] (GEODE-9620) The CacheServerStats currentQueueConnections statistic is incremented and decremented twice per client queue
Barrett Oglesby created GEODE-9620: -- Summary: The CacheServerStats currentQueueConnections statistic is incremented and decremented twice per client queue Key: GEODE-9620 URL: https://issues.apache.org/jira/browse/GEODE-9620 Project: Geode Issue Type: Bug Components: client queues Reporter: Barrett Oglesby The CacheServerStats currentQueueConnections statistic is incremented and decremented twice per client queue When a client with subscription enabled joins connects to the server, the CacheServerStats currentQueueConnections statistic is incremented twice. Once by the ServerConnection thread here: {noformat} [warn 2021/09/21 11:22:18.851 PDT server-1 tid=0x41] XXX CacheServerStats.incCurrentQueueConnections currentQueueConnectionsId=1 java.lang.Exception at org.apache.geode.internal.cache.tier.sockets.CacheServerStats.incCurrentQueueConnections(CacheServerStats.java:660) at org.apache.geode.internal.cache.tier.sockets.ServerConnection.handshakeAccepted(ServerConnection.java:705) at org.apache.geode.internal.cache.tier.sockets.ServerConnection.acceptHandShake(ServerConnection.java:682) at org.apache.geode.internal.cache.tier.sockets.ServerConnection.processHandShake(ServerConnection.java:613) at org.apache.geode.internal.cache.tier.sockets.ServerConnection.verifyClientConnection(ServerConnection.java:404) at org.apache.geode.internal.cache.tier.sockets.ServerConnection.doHandshake(ServerConnection.java:787) {noformat} And once by the Client Queue Initialization Thread here: {noformat} [warn 2021/09/21 11:22:18.884 PDT server-1 tid=0x44] XXX CacheServerStats.incCurrentQueueConnections currentQueueConnectionsId=2 java.lang.Exception at org.apache.geode.internal.cache.tier.sockets.CacheServerStats.incCurrentQueueConnections(CacheServerStats.java:660) at org.apache.geode.internal.cache.tier.sockets.CacheClientProxy.(CacheClientProxy.java:342) at org.apache.geode.internal.cache.tier.sockets.CacheClientProxy.(CacheClientProxy.java:306) at org.apache.geode.internal.cache.tier.sockets.CacheClientNotifier.registerClientInternal(CacheClientNotifier.java:379) at org.apache.geode.internal.cache.tier.sockets.CacheClientNotifier.registerClient(CacheClientNotifier.java:198) at org.apache.geode.internal.cache.tier.sockets.AcceptorImpl$ClientQueueInitializerTask.run(AcceptorImpl.java:1896) {noformat} When the client disconnects from the server, the CacheServerStats currentQueueConnections statistic is decremented twice. Once by the ServerConnection thread here: {noformat} [warn 2021/09/21 11:24:01.129 PDT server-1 tid=0x41] XXX CacheServerStats.decCurrentQueueConnections currentQueueConnectionsId=1 java.lang.Exception at org.apache.geode.internal.cache.tier.sockets.CacheServerStats.decCurrentQueueConnections(CacheServerStats.java:665) at org.apache.geode.internal.cache.tier.sockets.ServerConnection.handleTermination(ServerConnection.java:956) at org.apache.geode.internal.cache.tier.sockets.ServerConnection.handleTermination(ServerConnection.java:929) at org.apache.geode.internal.cache.tier.sockets.ServerConnection.run(ServerConnection.java:1289) {noformat} And once by a different ServerConnection thread here: {noformat} [warn 2021/09/21 11:24:01.135 PDT server-1 tid=0x42] XXX CacheServerStats.decCurrentQueueConnections currentQueueConnectionsId=0 java.lang.Exception at org.apache.geode.internal.cache.tier.sockets.CacheServerStats.decCurrentQueueConnections(CacheServerStats.java:665) at org.apache.geode.internal.cache.tier.sockets.CacheClientProxy.closeSocket(CacheClientProxy.java:939) at org.apache.geode.internal.cache.tier.sockets.CacheClientProxy.terminateDispatching(CacheClientProxy.java:895) at org.apache.geode.internal.cache.tier.sockets.CacheClientProxy.close(CacheClientProxy.java:773) at org.apache.geode.internal.cache.tier.sockets.CacheClientNotifier.closeDeadProxies(CacheClientNotifier.java:1558) at org.apache.geode.internal.cache.tier.sockets.CacheClientNotifier.unregisterClient(CacheClientNotifier.java:572) at org.apache.geode.internal.cache.tier.sockets.ClientHealthMonitor.unregisterClient(ClientHealthMonitor.java:268) at org.apache.geode.internal.cache.tier.sockets.ServerConnection.handleTermination(ServerConnection.java:1008) at org.apache.geode.internal.cache.tier.sockets.ServerConnection.handleTermination(ServerConnection.java:929) at org.apache.geode.internal.cache.tier.sockets.ServerConnection.run(ServerConnection.java:1289) {noformat} -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Resolved] (GEODE-9528) CI Failure: DistributionAdvisorIntegrationTest > verifyMembershipListenerIsRemovedAfterForceDisconnect
[ https://issues.apache.org/jira/browse/GEODE-9528?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Barrett Oglesby resolved GEODE-9528. Resolution: Fixed > CI Failure: DistributionAdvisorIntegrationTest > > verifyMembershipListenerIsRemovedAfterForceDisconnect > -- > > Key: GEODE-9528 > URL: https://issues.apache.org/jira/browse/GEODE-9528 > Project: Geode > Issue Type: Bug > Components: membership >Affects Versions: 1.12.5, 1.13.5, 1.14.0 >Reporter: Owen Nichols >Assignee: Barrett Oglesby >Priority: Major > Labels: pull-request-available > > {noformat} > org.apache.geode.distributed.internal.DistributionAdvisorIntegrationTest > > verifyMembershipListenerIsRemovedAfterForceDisconnect FAILED > org.junit.ComparisonFailure: expected:<[fals]e> but was:<[tru]e> > at > jdk.internal.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method) > at > jdk.internal.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62) > at > jdk.internal.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45) > at > org.apache.geode.distributed.internal.DistributionAdvisorIntegrationTest.verifyMembershipListenerIsRemovedAfterForceDisconnect(DistributionAdvisorIntegrationTest.java:57) > {noformat} -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (GEODE-9528) CI Failure: DistributionAdvisorIntegrationTest > verifyMembershipListenerIsRemovedAfterForceDisconnect
[ https://issues.apache.org/jira/browse/GEODE-9528?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17404068#comment-17404068 ] Barrett Oglesby commented on GEODE-9528: This is a test issue. The test currently does: 1. force disconnect 2. assert the MembershipListener for the region is removed The finally block of GMSMembership.ManagerImpl.forceDisconnect spins off the DisconnectThread to do the actual disconnect. The MembershipListener is removed by that thread. {noformat} } finally { new LoggingThread("DisconnectThread", false, () -> { lifecycleListener.forcedDisconnect(); uncleanShutdown(reason, shutdownCause); }).start(); } {noformat} If there is a delay in the DisconnectThread processing, the test could fail. Here is the normal flow: 1. The Test worker thread invokes forceDisconnectMember 2. The DisconnectThread removes the MembershipListener 3. The Test worker thread asserts the listener is removed Here is logging that shows this behavior: {noformat} [warn 2021/08/24 13:38:47.445 PDT server tid=0xb] XXX DistributionAdvisorIntegrationTest.verifyMembershipListenerIsRemovedAfterForceDisconnect before forceDisconnectMember [warn 2021/08/24 13:38:47.452 PDT server tid=0x32] XXX ManagerImpl.forceDisconnect about to uncleanShutdown [warn 2021/08/24 13:38:47.472 PDT server tid=0x32] XXX DistributionAdvisor.close removeMembershipListener advisee=verifyMembershipListenerIsRemovedAfterForceDisconnect [warn 2021/08/24 13:38:47.566 PDT server tid=0xb] XXX DistributionAdvisorIntegrationTest.verifyMembershipListenerIsRemovedAfterForceDisconnect after forceDisconnectMember [warn 2021/08/24 13:38:47.567 PDT server tid=0xb] XXX DistributionAdvisorIntegrationTest.verifyMembershipListenerIsRemovedAfterForceDisconnect assert {noformat} If the DisconnectThread has any kind of delay in running, the order changes, and the test fails: {noformat} [warn 2021/08/24 14:05:29.270 PDT server tid=0xb] XXX DistributionAdvisorIntegrationTest.verifyMembershipListenerIsRemovedAfterForceDisconnect before forceDisconnectMember [warn 2021/08/24 14:05:29.382 PDT server tid=0x32] XXX ManagerImpl.forceDisconnect about to uncleanShutdown [warn 2021/08/24 14:05:29.392 PDT server tid=0xb] XXX DistributionAdvisorIntegrationTest.verifyMembershipListenerIsRemovedAfterForceDisconnect after forceDisconnectMember [warn 2021/08/24 14:05:29.393 PDT server tid=0xb] XXX DistributionAdvisorIntegrationTest.verifyMembershipListenerIsRemovedAfterForceDisconnect assert [warn 2021/08/24 14:05:29.403 PDT server tid=0x32] XXX DistributionAdvisor.close removeMembershipListener advisee=verifyMembershipListenerIsRemovedAfterForceDisconnect org.junit.ComparisonFailure: Expecting value to be false but was true expected:<[fals]e> but was:<[tru]e> at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method) at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62) at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45) at org.apache.geode.distributed.internal.DistributionAdvisorIntegrationTest.verifyMembershipListenerIsRemovedAfterForceDisconnect(DistributionAdvisorIntegrationTest.java:62) {noformat} The fix is to add await().untilAsserted like: {noformat} await().untilAsserted( () -> assertThat(manager.getMembershipListeners().contains(listener)).isFalse()); {noformat} With the introduced delay, the test failed 80/100 times. With the await change, the test passed 100/100 times. > CI Failure: DistributionAdvisorIntegrationTest > > verifyMembershipListenerIsRemovedAfterForceDisconnect > -- > > Key: GEODE-9528 > URL: https://issues.apache.org/jira/browse/GEODE-9528 > Project: Geode > Issue Type: Bug > Components: membership >Affects Versions: 1.12.5, 1.13.5, 1.14.0 >Reporter: Owen Nichols >Assignee: Barrett Oglesby >Priority: Major > > {noformat} > org.apache.geode.distributed.internal.DistributionAdvisorIntegrationTest > > verifyMembershipListenerIsRemovedAfterForceDisconnect FAILED > org.junit.ComparisonFailure: expected:<[fals]e> but was:<[tru]e> > at > jdk.internal.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method) > at > jdk.internal.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62) > at > jdk.internal.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45) > at > org.apache.geode.distributed.internal.DistributionAdvisorIntegrationTest.verifyMembershipListenerIsRemovedAfterForceDisconnect(DistributionAdvisorIntegrationTest.java:57) > {noformat} -- This
[jira] [Assigned] (GEODE-9528) CI Failure: DistributionAdvisorIntegrationTest > verifyMembershipListenerIsRemovedAfterForceDisconnect
[ https://issues.apache.org/jira/browse/GEODE-9528?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Barrett Oglesby reassigned GEODE-9528: -- Assignee: Barrett Oglesby (was: Ernest Burghardt) > CI Failure: DistributionAdvisorIntegrationTest > > verifyMembershipListenerIsRemovedAfterForceDisconnect > -- > > Key: GEODE-9528 > URL: https://issues.apache.org/jira/browse/GEODE-9528 > Project: Geode > Issue Type: Bug > Components: membership >Affects Versions: 1.12.5, 1.13.5, 1.14.0 >Reporter: Owen Nichols >Assignee: Barrett Oglesby >Priority: Major > > {noformat} > org.apache.geode.distributed.internal.DistributionAdvisorIntegrationTest > > verifyMembershipListenerIsRemovedAfterForceDisconnect FAILED > org.junit.ComparisonFailure: expected:<[fals]e> but was:<[tru]e> > at > jdk.internal.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method) > at > jdk.internal.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62) > at > jdk.internal.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45) > at > org.apache.geode.distributed.internal.DistributionAdvisorIntegrationTest.verifyMembershipListenerIsRemovedAfterForceDisconnect(DistributionAdvisorIntegrationTest.java:57) > {noformat} -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Assigned] (GEODE-9441) The NestedFunctionExecutionDistributedTest uses too many threads
[ https://issues.apache.org/jira/browse/GEODE-9441?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Barrett Oglesby reassigned GEODE-9441: -- Assignee: Dale Emery > The NestedFunctionExecutionDistributedTest uses too many threads > > > Key: GEODE-9441 > URL: https://issues.apache.org/jira/browse/GEODE-9441 > Project: Geode > Issue Type: Test > Components: tests >Reporter: Barrett Oglesby >Assignee: Dale Emery >Priority: Major > > The {{NestedFunctionExecutionDistributedTest}} uses {{OperationExecutors > MAX_FE_THREADS}} to configure both client function invocations and cache > server max connections. > It uses MAX_FE_THREADS * 2 for function executions which use Function > Execution Processor threads: > {noformat} > client.invoke(() -> executeFunction(new ParentFunction(), MAX_FE_THREADS * > 2)); > {noformat} > And potentially MAX_FE_THREADS * 3 for client connections which use > ServerConnection threads: > {noformat} > cacheServer.setMaxConnections(Math.max(CacheServer.DEFAULT_MAX_CONNECTIONS, > MAX_FE_THREADS * 3)); > {noformat} > MAX_FE_THREADS was changed recently to: > {noformat} > Math.max(Runtime.getRuntime().availableProcessors() * 16, 16)) > {noformat} > It doesn't need to use this many threads to test the behavior it is testing. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (GEODE-9441) The NestedFunctionExecutionDistributedTest uses too many threads
Barrett Oglesby created GEODE-9441: -- Summary: The NestedFunctionExecutionDistributedTest uses too many threads Key: GEODE-9441 URL: https://issues.apache.org/jira/browse/GEODE-9441 Project: Geode Issue Type: Test Components: tests Reporter: Barrett Oglesby The {{NestedFunctionExecutionDistributedTest}} uses {{OperationExecutors MAX_FE_THREADS}} to configure both client function invocations and cache server max connections. It uses MAX_FE_THREADS * 2 for function executions which use Function Execution Processor threads: {noformat} client.invoke(() -> executeFunction(new ParentFunction(), MAX_FE_THREADS * 2)); {noformat} And potentially MAX_FE_THREADS * 3 for client connections which use ServerConnection threads: {noformat} cacheServer.setMaxConnections(Math.max(CacheServer.DEFAULT_MAX_CONNECTIONS, MAX_FE_THREADS * 3)); {noformat} MAX_FE_THREADS was changed recently to: {noformat} Math.max(Runtime.getRuntime().availableProcessors() * 16, 16)) {noformat} It doesn't need to use this many threads to test the behavior it is testing. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (GEODE-9392) A gfsh query returning a Struct containing a PdxInstance behaves differently than one returning just the PdxInstance in some cases
[ https://issues.apache.org/jira/browse/GEODE-9392?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17380896#comment-17380896 ] Barrett Oglesby commented on GEODE-9392: Code using ObjectMapper like above didn't address the issue. Since LocalDateTime not supported by default in ObjectMapper, it throws this exception: {noformat} Caused by: java.lang.RuntimeException: Java 8 date/time type `java.time.LocalDateTime` not supported by default: add Module "com.fasterxml.jackson.datatype:jackson-datatype-jsr310" to enable handling at org.apache.geode.pdx.internal.json.PdxToJSON.getJSON(PdxToJSON.java:67) at org.apache.geode.pdx.JSONFormatter.fromPdxInstance(JSONFormatter.java:239) {noformat} I added jackson-datatype-jsr310-2.12.3.jar to the server's classpath and used this code in the last else clause in PdxToJSON.writeValue: {noformat} ObjectMapper mapper = new ObjectMapper(); mapper.findAndRegisterModules(); jg.writeString(mapper.writeValueAsString(value)); {noformat} And that worked: {noformat} Executing - query --query='select * from /data' Result : true Limit : 100 Rows : 1 productId | partnerProductId | onlineRelevance -- | | - 151895 | 151895 | {"value":"value1","valueChangeDate":"[2021,7,14,16,18,29,78400]"} {noformat} > A gfsh query returning a Struct containing a PdxInstance behaves differently > than one returning just the PdxInstance in some cases > -- > > Key: GEODE-9392 > URL: https://issues.apache.org/jira/browse/GEODE-9392 > Project: Geode > Issue Type: Bug > Components: gfsh >Reporter: Barrett Oglesby >Priority: Major > > This is true when the PdxInstance contains a data type that is not supported > by PdxToJSON (like Date or Character). > If objects like this are stored as PdxInstances: > {noformat} > public class Position { > private String id; > private Date tradeDate; > private Character type; > ... > } > {noformat} > A query like this is successful: > {noformat} > Executing - query --query='select * from /positions' > Result : true > Limit : 100 > Rows : 10 > tradeDate | id | type > - | -- | > 1624316618413 | 3 | "a" > 1624316618324 | 0 | "a" > 1624316618418 | 5 | "a" > 1624316618421 | 6 | "a" > 1624316618407 | 1 | "a" > 1624316618426 | 8 | "a" > 1624316618428 | 9 | "a" > 1624316618415 | 4 | "a" > 1624316618423 | 7 | "a" > 1624316618410 | 2 | "a" > {noformat} > But a query like this is not: > {noformat} > Executing - query --query="select key,value from /positions.entries where > value.id = '0'" > Result : false > Message : Could not create JSON document from PdxInstance > {noformat} > It fails with this exception in the server: > {noformat} > org.apache.geode.pdx.JSONFormatterException: Could not create JSON document > from PdxInstance > at > org.apache.geode.pdx.JSONFormatter.fromPdxInstance(JSONFormatter.java:241) > at org.apache.geode.pdx.JSONFormatter.toJSON(JSONFormatter.java:226) > at > org.apache.geode.management.internal.cli.domain.DataCommandResult$SelectResultRow.valueToJson(DataCommandResult.java:732) > at > org.apache.geode.management.internal.cli.domain.DataCommandResult$SelectResultRow.resolveStructToColumns(DataCommandResult.java:717) > at > org.apache.geode.management.internal.cli.domain.DataCommandResult$SelectResultRow.resolveObjectToColumns(DataCommandResult.java:692) > at > org.apache.geode.management.internal.cli.domain.DataCommandResult$SelectResultRow.createColumnValues(DataCommandResult.java:680) > at > org.apache.geode.management.internal.cli.domain.DataCommandResult$SelectResultRow.(DataCommandResult.java:663) > at > org.apache.geode.management.internal.cli.functions.DataCommandFunction.createSelectResultRow(DataCommandFunction.java:270) > at > org.apache.geode.management.internal.cli.functions.DataCommandFunction.select_SelectResults(DataCommandFunction.java:256) > at > org.apache.geode.management.internal.cli.functions.DataCommandFunction.select(DataCommandFunction.java:224) > at > org.apache.geode.management.internal.cli.functions.DataCommandFunction.select(DataCommandFunction.java:177) > at > org.apache.geode.management.internal.cli.functions.DataCommandFunction.execute(DataCommandFunction.java:126) > Caused by: java.lang.IllegalStateException: PdxInstance returns unknwon > pdxfield tradeDate for type Mon Jun 21 16:03:38 PDT 2021 > at > org.apache.geode.pdx.internal.json.PdxToJSON.writeValue(PdxToJSON.java:148) > at > org.apache.geode.pdx.internal.json.PdxToJSON.getJSONString(PdxToJSON.java:185) >
[jira] [Commented] (GEODE-9392) A gfsh query returning a Struct containing a PdxInstance behaves differently than one returning just the PdxInstance in some cases
[ https://issues.apache.org/jira/browse/GEODE-9392?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17380890#comment-17380890 ] Barrett Oglesby commented on GEODE-9392: A {{select *}} query fails in the same way if a field of the value is a PdxInstance and that PdxInstance contains an unsupported data type. For example, if the region contains Product objects and a Product contains a Relevance object which contains a java LocalDataTime. LocalDateTime is not supported by PdxToJSON, so the query fails with a similar stack to the above struct one: {noformat} [info 2021/07/14 15:53:00.349 PDT server1 tid=0x3f] Exception occurred: org.apache.geode.pdx.JSONFormatterException: Could not create JSON document from PdxInstance at org.apache.geode.pdx.JSONFormatter.fromPdxInstance(JSONFormatter.java:241) at org.apache.geode.pdx.JSONFormatter.toJSON(JSONFormatter.java:226) at org.apache.geode.management.internal.cli.domain.DataCommandResult$SelectResultRow.valueToJson(DataCommandResult.java:731) at org.apache.geode.management.internal.cli.domain.DataCommandResult$SelectResultRow.resolvePdxToColumns(DataCommandResult.java:711) at org.apache.geode.management.internal.cli.domain.DataCommandResult$SelectResultRow.resolveObjectToColumns(DataCommandResult.java:688) at org.apache.geode.management.internal.cli.domain.DataCommandResult$SelectResultRow.createColumnValues(DataCommandResult.java:680) at org.apache.geode.management.internal.cli.domain.DataCommandResult$SelectResultRow.(DataCommandResult.java:663) at org.apache.geode.management.internal.cli.functions.DataCommandFunction.createSelectResultRow(DataCommandFunction.java:270) at org.apache.geode.management.internal.cli.functions.DataCommandFunction.select_SelectResults(DataCommandFunction.java:256) at org.apache.geode.management.internal.cli.functions.DataCommandFunction.select(DataCommandFunction.java:224) at org.apache.geode.management.internal.cli.functions.DataCommandFunction.select(DataCommandFunction.java:177) at org.apache.geode.management.internal.cli.functions.DataCommandFunction.execute(DataCommandFunction.java:126) Caused by: java.lang.IllegalStateException: The pdx field valueChangeDate has a value 2021-07-14T15:52:54.970 whose type class java.time.LocalDateTime can not be converted to JSON. at org.apache.geode.pdx.internal.json.PdxToJSON.writeValue(PdxToJSON.java:148) at org.apache.geode.pdx.internal.json.PdxToJSON.getJSONString(PdxToJSON.java:178) at org.apache.geode.pdx.internal.json.PdxToJSON.getJSON(PdxToJSON.java:60) {noformat} > A gfsh query returning a Struct containing a PdxInstance behaves differently > than one returning just the PdxInstance in some cases > -- > > Key: GEODE-9392 > URL: https://issues.apache.org/jira/browse/GEODE-9392 > Project: Geode > Issue Type: Bug > Components: gfsh >Reporter: Barrett Oglesby >Priority: Major > > This is true when the PdxInstance contains a data type that is not supported > by PdxToJSON (like Date or Character). > If objects like this are stored as PdxInstances: > {noformat} > public class Position { > private String id; > private Date tradeDate; > private Character type; > ... > } > {noformat} > A query like this is successful: > {noformat} > Executing - query --query='select * from /positions' > Result : true > Limit : 100 > Rows : 10 > tradeDate | id | type > - | -- | > 1624316618413 | 3 | "a" > 1624316618324 | 0 | "a" > 1624316618418 | 5 | "a" > 1624316618421 | 6 | "a" > 1624316618407 | 1 | "a" > 1624316618426 | 8 | "a" > 1624316618428 | 9 | "a" > 1624316618415 | 4 | "a" > 1624316618423 | 7 | "a" > 1624316618410 | 2 | "a" > {noformat} > But a query like this is not: > {noformat} > Executing - query --query="select key,value from /positions.entries where > value.id = '0'" > Result : false > Message : Could not create JSON document from PdxInstance > {noformat} > It fails with this exception in the server: > {noformat} > org.apache.geode.pdx.JSONFormatterException: Could not create JSON document > from PdxInstance > at > org.apache.geode.pdx.JSONFormatter.fromPdxInstance(JSONFormatter.java:241) > at org.apache.geode.pdx.JSONFormatter.toJSON(JSONFormatter.java:226) > at > org.apache.geode.management.internal.cli.domain.DataCommandResult$SelectResultRow.valueToJson(DataCommandResult.java:732) > at > org.apache.geode.management.internal.cli.domain.DataCommandResult$SelectResultRow.resolveStructToColumns(DataCommandResult.java:717) > at >
[jira] [Created] (GEODE-9392) A gfsh query returning a Struct containing a PdxInstance behaves differently than one returning just the PdxInstance in some cases
Barrett Oglesby created GEODE-9392: -- Summary: A gfsh query returning a Struct containing a PdxInstance behaves differently than one returning just the PdxInstance in some cases Key: GEODE-9392 URL: https://issues.apache.org/jira/browse/GEODE-9392 Project: Geode Issue Type: Bug Components: gfsh Reporter: Barrett Oglesby This is true when the PdxInstance contains a data type that is not supported by PdxToJSON (like Date or Character). If objects like this are stored as PdxInstances: {noformat} public class Position { private String id; private Date tradeDate; private Character type; ... } {noformat} A query like this is successful: {noformat} Executing - query --query='select * from /positions' Result : true Limit : 100 Rows : 10 tradeDate | id | type - | -- | 1624316618413 | 3 | "a" 1624316618324 | 0 | "a" 1624316618418 | 5 | "a" 1624316618421 | 6 | "a" 1624316618407 | 1 | "a" 1624316618426 | 8 | "a" 1624316618428 | 9 | "a" 1624316618415 | 4 | "a" 1624316618423 | 7 | "a" 1624316618410 | 2 | "a" {noformat} But a query like this is not: {noformat} Executing - query --query="select key,value from /positions.entries where value.id = '0'" Result : false Message : Could not create JSON document from PdxInstance {noformat} It fails with this exception in the server: {noformat} org.apache.geode.pdx.JSONFormatterException: Could not create JSON document from PdxInstance at org.apache.geode.pdx.JSONFormatter.fromPdxInstance(JSONFormatter.java:241) at org.apache.geode.pdx.JSONFormatter.toJSON(JSONFormatter.java:226) at org.apache.geode.management.internal.cli.domain.DataCommandResult$SelectResultRow.valueToJson(DataCommandResult.java:732) at org.apache.geode.management.internal.cli.domain.DataCommandResult$SelectResultRow.resolveStructToColumns(DataCommandResult.java:717) at org.apache.geode.management.internal.cli.domain.DataCommandResult$SelectResultRow.resolveObjectToColumns(DataCommandResult.java:692) at org.apache.geode.management.internal.cli.domain.DataCommandResult$SelectResultRow.createColumnValues(DataCommandResult.java:680) at org.apache.geode.management.internal.cli.domain.DataCommandResult$SelectResultRow.(DataCommandResult.java:663) at org.apache.geode.management.internal.cli.functions.DataCommandFunction.createSelectResultRow(DataCommandFunction.java:270) at org.apache.geode.management.internal.cli.functions.DataCommandFunction.select_SelectResults(DataCommandFunction.java:256) at org.apache.geode.management.internal.cli.functions.DataCommandFunction.select(DataCommandFunction.java:224) at org.apache.geode.management.internal.cli.functions.DataCommandFunction.select(DataCommandFunction.java:177) at org.apache.geode.management.internal.cli.functions.DataCommandFunction.execute(DataCommandFunction.java:126) Caused by: java.lang.IllegalStateException: PdxInstance returns unknwon pdxfield tradeDate for type Mon Jun 21 16:03:38 PDT 2021 at org.apache.geode.pdx.internal.json.PdxToJSON.writeValue(PdxToJSON.java:148) at org.apache.geode.pdx.internal.json.PdxToJSON.getJSONString(PdxToJSON.java:185) at org.apache.geode.pdx.internal.json.PdxToJSON.getJSON(PdxToJSON.java:61) at org.apache.geode.pdx.JSONFormatter.fromPdxInstance(JSONFormatter.java:239) {noformat} Its because of the difference in processing a PdxInstance (first query) and a Struct (second query) in resolveObjectToColumns: {noformat} private void resolveObjectToColumns(Map columnData, Object value) { if (value instanceof PdxInstance) { resolvePdxToColumns(columnData, (PdxInstance) value); } else if (value instanceof Struct) { resolveStructToColumns(columnData, (StructImpl) value); } ... } {noformat} They both end up in SelectResultRow.valueToJson: {noformat} private String valueToJson(Object value) { ... if (value instanceof String) { return (String) value; } if (value instanceof PdxInstance) { return JSONFormatter.toJSON((PdxInstance) value); } ObjectMapper mapper = new ObjectMapper(); try { return mapper.writeValueAsString(value); } catch (JsonProcessingException jex) { return jex.getMessage(); } } {noformat} In the PdxInstance case, the fields are passed in individually and handled by the first condition (String) and the ObjectMapper (Date, Character): {noformat} SelectResultRow.resolveObjectToColumns value=PDX[13681235,Position]{id=3, tradeDate=Mon Jun 21 16:03:38 PDT 2021, type=a}; valueClass=class org.apache.geode.pdx.internal.PdxInstanceImpl SelectResultRow.valueToJson value=Mon Jun 21 16:03:38 PDT 2021; valueClass=class java.util.Date SelectResultRow.valueToJson value=3; valueClass=class java.lang.String SelectResultRow.valueToJson value=a; valueClass=class
[jira] [Created] (GEODE-9390) DistributedSystem nodes is counted twice on each server member
Barrett Oglesby created GEODE-9390: -- Summary: DistributedSystem nodes is counted twice on each server member Key: GEODE-9390 URL: https://issues.apache.org/jira/browse/GEODE-9390 Project: Geode Issue Type: Bug Components: membership Reporter: Barrett Oglesby Once in ClusterDistributionManager.startThreads: {noformat} [warn 2021/06/20 16:20:16.152 HST server-1 tid=0x1] ClusterDistributionManager.handleManagerStartup id=192.168.1.8(server-1:58386):41001; kind=10 [warn 2021/06/20 16:20:16.153 HST server-1 tid=0x1] DistributionStats.incNodes nodes=1 java.lang.Exception at org.apache.geode.distributed.internal.DistributionStats.incNodes(DistributionStats.java:1362) at org.apache.geode.distributed.internal.ClusterDistributionManager.handleManagerStartup(ClusterDistributionManager.java:1809) at org.apache.geode.distributed.internal.ClusterDistributionManager.addNewMember(ClusterDistributionManager.java:1062) at org.apache.geode.distributed.internal.ClusterDistributionManager.startThreads(ClusterDistributionManager.java:691) at org.apache.geode.distributed.internal.ClusterDistributionManager.(ClusterDistributionManager.java:504) at org.apache.geode.distributed.internal.ClusterDistributionManager.create(ClusterDistributionManager.java:326) at org.apache.geode.distributed.internal.InternalDistributedSystem.initialize(InternalDistributedSystem.java:780) {noformat} And once in ClusterDistributionManager.create: {noformat} [warn 2021/06/20 16:20:16.155 HST server-1 tid=0x1] ClusterDistributionManager.handleManagerStartup id=192.168.1.8(server-1:58386):41001; kind=10 [warn 2021/06/20 16:20:16.156 HST server-1 tid=0x1] DistributionStats.incNodes nodes=2 java.lang.Exception at org.apache.geode.distributed.internal.DistributionStats.incNodes(DistributionStats.java:1362) at org.apache.geode.distributed.internal.ClusterDistributionManager.handleManagerStartup(ClusterDistributionManager.java:1809) at org.apache.geode.distributed.internal.ClusterDistributionManager.addNewMember(ClusterDistributionManager.java:1062) at org.apache.geode.distributed.internal.ClusterDistributionManager.create(ClusterDistributionManager.java:354) at org.apache.geode.distributed.internal.InternalDistributedSystem.initialize(InternalDistributedSystem.java:780) {noformat} -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Resolved] (GEODE-9372) DistributionStats needs a stat for create sender time to help diagnose data replication spikes
[ https://issues.apache.org/jira/browse/GEODE-9372?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Barrett Oglesby resolved GEODE-9372. Fix Version/s: 1.15.0 Resolution: Fixed > DistributionStats needs a stat for create sender time to help diagnose data > replication spikes > -- > > Key: GEODE-9372 > URL: https://issues.apache.org/jira/browse/GEODE-9372 > Project: Geode > Issue Type: Improvement > Components: statistics >Reporter: Barrett Oglesby >Assignee: Barrett Oglesby >Priority: Major > Labels: GeodeOperationAPI, pull-request-available > Fix For: 1.15.0 > > Attachments: > PartitionedRegionStats_sendReplicationTime_DistributionStats_sendersTO.gif > > > While debugging an issue with sendReplicationTime, we realized it was all due > to sender creation time. > A statistic for that time would have been very useful. > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (GEODE-9372) DistributionStats needs a stat for create sender time to help diagnose data replication spikes
[ https://issues.apache.org/jira/browse/GEODE-9372?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Barrett Oglesby updated GEODE-9372: --- Labels: GeodeOperationAPI (was: ) > DistributionStats needs a stat for create sender time to help diagnose data > replication spikes > -- > > Key: GEODE-9372 > URL: https://issues.apache.org/jira/browse/GEODE-9372 > Project: Geode > Issue Type: Improvement > Components: statistics >Reporter: Barrett Oglesby >Assignee: Barrett Oglesby >Priority: Major > Labels: GeodeOperationAPI > Attachments: > PartitionedRegionStats_sendReplicationTime_DistributionStats_sendersTO.gif > > > While debugging an issue with sendReplicationTime, we realized it was all due > to sender creation time. > A statistic for that time would have been very useful. > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (GEODE-9372) DistributionStats needs a stat for create sender time to help diagnose data replication spikes
[ https://issues.apache.org/jira/browse/GEODE-9372?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Barrett Oglesby updated GEODE-9372: --- Attachment: PartitionedRegionStats_sendReplicationTime_DistributionStats_sendersTO.gif > DistributionStats needs a stat for create sender time to help diagnose data > replication spikes > -- > > Key: GEODE-9372 > URL: https://issues.apache.org/jira/browse/GEODE-9372 > Project: Geode > Issue Type: Improvement > Components: statistics >Reporter: Barrett Oglesby >Assignee: Barrett Oglesby >Priority: Major > Attachments: > PartitionedRegionStats_sendReplicationTime_DistributionStats_sendersTO.gif > > > While debugging an issue with sendReplicationTime, we realized it was all due > to sender creation time. > A statistic for that time would have been very useful. > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Assigned] (GEODE-9372) DistributionStats needs a stat for create sender time to help diagnose data replication spikes
[ https://issues.apache.org/jira/browse/GEODE-9372?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Barrett Oglesby reassigned GEODE-9372: -- Assignee: Barrett Oglesby > DistributionStats needs a stat for create sender time to help diagnose data > replication spikes > -- > > Key: GEODE-9372 > URL: https://issues.apache.org/jira/browse/GEODE-9372 > Project: Geode > Issue Type: Improvement > Components: statistics >Reporter: Barrett Oglesby >Assignee: Barrett Oglesby >Priority: Major > > While debugging an issue with sendReplicationTime, we realized it was all due > to sender creation time. > A statistic for that time would have been very useful. > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (GEODE-9372) DistributionStats needs a stat for create sender time to help diagnose data replication spikes
Barrett Oglesby created GEODE-9372: -- Summary: DistributionStats needs a stat for create sender time to help diagnose data replication spikes Key: GEODE-9372 URL: https://issues.apache.org/jira/browse/GEODE-9372 Project: Geode Issue Type: Improvement Components: statistics Reporter: Barrett Oglesby While debugging an issue with sendReplicationTime, we realized it was all due to sender creation time. A statistic for that time would have been very useful. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Resolved] (GEODE-8825) CI failure: GatewayReceiverMBeanDUnitTest > testMBeanAndProxiesForGatewayReceiverAreRemovedOnDestroy
[ https://issues.apache.org/jira/browse/GEODE-8825?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Barrett Oglesby resolved GEODE-8825. Fix Version/s: 1.15.0 Resolution: Fixed > CI failure: GatewayReceiverMBeanDUnitTest > > testMBeanAndProxiesForGatewayReceiverAreRemovedOnDestroy > > > Key: GEODE-8825 > URL: https://issues.apache.org/jira/browse/GEODE-8825 > Project: Geode > Issue Type: Bug > Components: tests, wan >Reporter: Jianxia Chen >Assignee: Barrett Oglesby >Priority: Major > Labels: GeodeOperationAPI, flaky, pull-request-available > Fix For: 1.15.0 > > > {code:java} > org.apache.geode.internal.cache.wan.GatewayReceiverMBeanDUnitTest > > testMBeanAndProxiesForGatewayReceiverAreRemovedOnDestroy FAILED > org.apache.geode.test.dunit.RMIException: While invoking > org.apache.geode.internal.cache.wan.GatewayReceiverMBeanDUnitTest$$Lambda$202/0x0001008f0c40.run > in VM 0 running on Host c3e48bdac460 with 4 VMs > at org.apache.geode.test.dunit.VM.executeMethodOnObject(VM.java:623) > at org.apache.geode.test.dunit.VM.invoke(VM.java:447) > at > org.apache.geode.internal.cache.wan.GatewayReceiverMBeanDUnitTest.testMBeanAndProxiesForGatewayReceiverAreRemovedOnDestroy(GatewayReceiverMBeanDUnitTest.java:76) > Caused by: > java.lang.AssertionError: expected null, but was: GemFire:service=GatewayReceiver,type=Member,member=172.17.0.18(183)-41002> > at org.junit.Assert.fail(Assert.java:89) > at org.junit.Assert.failNotNull(Assert.java:756) > at org.junit.Assert.assertNull(Assert.java:738) > at org.junit.Assert.assertNull(Assert.java:748) > at > org.apache.geode.internal.cache.wan.GatewayReceiverMBeanDUnitTest.verifyMBeanProxiesDoesNotExist(GatewayReceiverMBeanDUnitTest.java:106) > at > org.apache.geode.internal.cache.wan.GatewayReceiverMBeanDUnitTest.lambda$testMBeanAndProxiesForGatewayReceiverAreRemovedOnDestroy$bb17a952$3(GatewayReceiverMBeanDUnitTest.java:76) > {code} > https://concourse.apachegeode-ci.info/teams/main/pipelines/apache-develop-main/jobs/DistributedTestOpenJDK11/builds/704 > =-=-=-=-=-=-=-=-=-=-=-=-=-=-= Test Results URI > =-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-= > http://files.apachegeode-ci.info/builds/apache-develop-main/1.14.0-build.0601/test-results/distributedTest/1610390301/ > =-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-= > Test report artifacts from this job are available at: > http://files.apachegeode-ci.info/builds/apache-develop-main/1.14.0-build.0601/test-artifacts/1610390301/distributedtestfiles-OpenJDK11-1.14.0-build.0601.tgz -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Assigned] (GEODE-8825) CI failure: GatewayReceiverMBeanDUnitTest > testMBeanAndProxiesForGatewayReceiverAreRemovedOnDestroy
[ https://issues.apache.org/jira/browse/GEODE-8825?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Barrett Oglesby reassigned GEODE-8825: -- Assignee: Barrett Oglesby > CI failure: GatewayReceiverMBeanDUnitTest > > testMBeanAndProxiesForGatewayReceiverAreRemovedOnDestroy > > > Key: GEODE-8825 > URL: https://issues.apache.org/jira/browse/GEODE-8825 > Project: Geode > Issue Type: Bug > Components: tests, wan >Reporter: Jianxia Chen >Assignee: Barrett Oglesby >Priority: Major > Labels: GeodeOperationAPI, flaky > > {code:java} > org.apache.geode.internal.cache.wan.GatewayReceiverMBeanDUnitTest > > testMBeanAndProxiesForGatewayReceiverAreRemovedOnDestroy FAILED > org.apache.geode.test.dunit.RMIException: While invoking > org.apache.geode.internal.cache.wan.GatewayReceiverMBeanDUnitTest$$Lambda$202/0x0001008f0c40.run > in VM 0 running on Host c3e48bdac460 with 4 VMs > at org.apache.geode.test.dunit.VM.executeMethodOnObject(VM.java:623) > at org.apache.geode.test.dunit.VM.invoke(VM.java:447) > at > org.apache.geode.internal.cache.wan.GatewayReceiverMBeanDUnitTest.testMBeanAndProxiesForGatewayReceiverAreRemovedOnDestroy(GatewayReceiverMBeanDUnitTest.java:76) > Caused by: > java.lang.AssertionError: expected null, but was: GemFire:service=GatewayReceiver,type=Member,member=172.17.0.18(183)-41002> > at org.junit.Assert.fail(Assert.java:89) > at org.junit.Assert.failNotNull(Assert.java:756) > at org.junit.Assert.assertNull(Assert.java:738) > at org.junit.Assert.assertNull(Assert.java:748) > at > org.apache.geode.internal.cache.wan.GatewayReceiverMBeanDUnitTest.verifyMBeanProxiesDoesNotExist(GatewayReceiverMBeanDUnitTest.java:106) > at > org.apache.geode.internal.cache.wan.GatewayReceiverMBeanDUnitTest.lambda$testMBeanAndProxiesForGatewayReceiverAreRemovedOnDestroy$bb17a952$3(GatewayReceiverMBeanDUnitTest.java:76) > {code} > https://concourse.apachegeode-ci.info/teams/main/pipelines/apache-develop-main/jobs/DistributedTestOpenJDK11/builds/704 > =-=-=-=-=-=-=-=-=-=-=-=-=-=-= Test Results URI > =-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-= > http://files.apachegeode-ci.info/builds/apache-develop-main/1.14.0-build.0601/test-results/distributedTest/1610390301/ > =-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-= > Test report artifacts from this job are available at: > http://files.apachegeode-ci.info/builds/apache-develop-main/1.14.0-build.0601/test-artifacts/1610390301/distributedtestfiles-OpenJDK11-1.14.0-build.0601.tgz -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (GEODE-8825) CI failure: GatewayReceiverMBeanDUnitTest > testMBeanAndProxiesForGatewayReceiverAreRemovedOnDestroy
[ https://issues.apache.org/jira/browse/GEODE-8825?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Barrett Oglesby updated GEODE-8825: --- Labels: GeodeOperationAPI flaky (was: flaky pull-request-available) > CI failure: GatewayReceiverMBeanDUnitTest > > testMBeanAndProxiesForGatewayReceiverAreRemovedOnDestroy > > > Key: GEODE-8825 > URL: https://issues.apache.org/jira/browse/GEODE-8825 > Project: Geode > Issue Type: Bug > Components: tests, wan >Reporter: Jianxia Chen >Priority: Major > Labels: GeodeOperationAPI, flaky > > {code:java} > org.apache.geode.internal.cache.wan.GatewayReceiverMBeanDUnitTest > > testMBeanAndProxiesForGatewayReceiverAreRemovedOnDestroy FAILED > org.apache.geode.test.dunit.RMIException: While invoking > org.apache.geode.internal.cache.wan.GatewayReceiverMBeanDUnitTest$$Lambda$202/0x0001008f0c40.run > in VM 0 running on Host c3e48bdac460 with 4 VMs > at org.apache.geode.test.dunit.VM.executeMethodOnObject(VM.java:623) > at org.apache.geode.test.dunit.VM.invoke(VM.java:447) > at > org.apache.geode.internal.cache.wan.GatewayReceiverMBeanDUnitTest.testMBeanAndProxiesForGatewayReceiverAreRemovedOnDestroy(GatewayReceiverMBeanDUnitTest.java:76) > Caused by: > java.lang.AssertionError: expected null, but was: GemFire:service=GatewayReceiver,type=Member,member=172.17.0.18(183)-41002> > at org.junit.Assert.fail(Assert.java:89) > at org.junit.Assert.failNotNull(Assert.java:756) > at org.junit.Assert.assertNull(Assert.java:738) > at org.junit.Assert.assertNull(Assert.java:748) > at > org.apache.geode.internal.cache.wan.GatewayReceiverMBeanDUnitTest.verifyMBeanProxiesDoesNotExist(GatewayReceiverMBeanDUnitTest.java:106) > at > org.apache.geode.internal.cache.wan.GatewayReceiverMBeanDUnitTest.lambda$testMBeanAndProxiesForGatewayReceiverAreRemovedOnDestroy$bb17a952$3(GatewayReceiverMBeanDUnitTest.java:76) > {code} > https://concourse.apachegeode-ci.info/teams/main/pipelines/apache-develop-main/jobs/DistributedTestOpenJDK11/builds/704 > =-=-=-=-=-=-=-=-=-=-=-=-=-=-= Test Results URI > =-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-= > http://files.apachegeode-ci.info/builds/apache-develop-main/1.14.0-build.0601/test-results/distributedTest/1610390301/ > =-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-= > Test report artifacts from this job are available at: > http://files.apachegeode-ci.info/builds/apache-develop-main/1.14.0-build.0601/test-artifacts/1610390301/distributedtestfiles-OpenJDK11-1.14.0-build.0601.tgz -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (GEODE-8825) CI failure: GatewayReceiverMBeanDUnitTest > testMBeanAndProxiesForGatewayReceiverAreRemovedOnDestroy
[ https://issues.apache.org/jira/browse/GEODE-8825?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17359662#comment-17359662 ] Barrett Oglesby commented on GEODE-8825: Here is some logging that shows the behavior: Creating the receiver causes it to get added to the added to federatedComponentMap: {noformat} [vm1] [warn 2021/06/08 16:36:38.288 PDT tid=0x13] XXX GatewayReceiverMBeanDUnitTest.testMBeanAndProxiesForGatewayReceiverAreRemovedOnDestroy about to create receiver [vm1] [warn 2021/06/08 16:36:38.370 PDT tid=0x13] XXX LocalManager.markForFederation added to federatedComponentMap objName=GemFire:service=GatewayReceiver [vm1] [warn 2021/06/08 16:36:38.376 PDT tid=0x13] XXX GatewayReceiverMBeanDUnitTest.testMBeanAndProxiesForGatewayReceiverAreRemovedOnDestroy created receiver {noformat} There is no logging here, but the Management Task hasn't run when the receiver is destroyed. The mbean is removed from federatedComponentMap, but since the monitoringRegion doesn't contain the mbean, it doesn't get removed from the region. The Management Task adds the mbean to that region (which is how it gets to the manager). {noformat} [vm1] [warn 2021/06/08 16:36:38.382 PDT tid=0x13] XXX GatewayReceiverMBeanDUnitTest.testMBeanAndProxiesForGatewayReceiverAreRemovedOnDestroy about to destroy receiver [vm1] [warn 2021/06/08 16:36:38.384 PDT tid=0x13] XXX LocalManager.unMarkForFederation removed from federatedComponentMap objName=GemFire:service=GatewayReceiver [vm1] [warn 2021/06/08 16:36:38.388 PDT tid=0x13] XXX LocalManager.unMarkForFederation monitoringRegionContains objName=GemFire:service=GatewayReceiver: false [vm1] [warn 2021/06/08 16:36:38.389 PDT tid=0x13] XXX GatewayReceiverMBeanDUnitTest.testMBeanAndProxiesForGatewayReceiverAreRemovedOnDestroy destroyed receiver {noformat} If I add a sleep between the create and destroy, I see better behavior. Here is some logging that shows that. The receiver is created the same as before: {noformat} [vm1] [warn 2021/06/08 16:35:40.970 PDT tid=0x13] XXX GatewayReceiverMBeanDUnitTest.testMBeanAndProxiesForGatewayReceiverAreRemovedOnDestroy about to create receiver [vm1] [warn 2021/06/08 16:35:41.054 PDT tid=0x13] XXX LocalManager.markForFederation added to federatedComponentMap objName=GemFire:service=GatewayReceiver,type=Member,member=192.168.1.4(12942)-41002 [vm1] [warn 2021/06/08 16:35:41.061 PDT tid=0x13] XXX GatewayReceiverMBeanDUnitTest.testMBeanAndProxiesForGatewayReceiverAreRemovedOnDestroy created receiver {noformat} The Management Task puts the map into the monitoringRegion adds the proxy: {noformat} [vm1] [warn 2021/06/08 16:35:41.775 PDT tid=0x3e] XXX LocalManager.doManagementTask about to put replicaMap={GemFire:service=GatewayReceiver = GemFire:service=GatewayReceiver} [vm0] [warn 2021/06/08 16:35:41.782 PDT :41002 unshared ordered sender uid=6 dom #1 local port=60249 remote port=54707> tid=0x48] XXX MBeanAggregator.afterCreateProxy objectName=GemFire:service=GatewayReceiver,type=Member,member=192.168.1.4(12942)-41002 {noformat} The receiver is destroyed. This time, the monitoringRegion contains the mbean, so it is removed from it: {noformat} [vm1] [warn 2021/06/08 16:35:44.072 PDT tid=0x13] XXX GatewayReceiverMBeanDUnitTest.testMBeanAndProxiesForGatewayReceiverAreRemovedOnDestroy about to destroy receiver [vm1] [warn 2021/06/08 16:35:44.074 PDT tid=0x13] XXX ManagementAdapter.handleGatewayReceiverDestroy objectName=GemFire:service=GatewayReceiver [vm1] [warn 2021/06/08 16:35:44.075 PDT tid=0x13] XXX LocalManager.unMarkForFederation removed from federatedComponentMap objName=GemFire:service=GatewayReceiver [vm1] [warn 2021/06/08 16:35:44.075 PDT tid=0x13] XXX LocalManager.unMarkForFederation monitoringRegionContains objName=GemFire:service=GatewayReceiver: true [vm1] [warn 2021/06/08 16:35:44.079 PDT tid=0x13] XXX LocalManager.unMarkForFederation removed from monitoringRegion objName=GemFire:service=GatewayReceiver [vm1] [warn 2021/06/08 16:35:44.079 PDT tid=0x13] XXX GatewayReceiverMBeanDUnitTest.testMBeanAndProxiesForGatewayReceiverAreRemovedOnDestroy destroyed receiver {noformat} The proxy is removed from the mananger: {noformat} [vm0] [warn 2021/06/08 16:35:44.082 PDT :41002 unshared ordered sender uid=4 dom #1 local port=60249 remote port=54691> tid=0x41] XXX MBeanAggregator.afterRemoveProxy objectName=GemFire:service=GatewayReceiver,type=Member,member=192.168.1.4(12942)-41002 {noformat} The test needs to be modified to wait for the manager to contain the proxy before destroying the receiver. > CI failure: GatewayReceiverMBeanDUnitTest > > testMBeanAndProxiesForGatewayReceiverAreRemovedOnDestroy > > > Key: GEODE-8825 > URL:
[jira] [Commented] (GEODE-8825) CI failure: GatewayReceiverMBeanDUnitTest > testMBeanAndProxiesForGatewayReceiverAreRemovedOnDestroy
[ https://issues.apache.org/jira/browse/GEODE-8825?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17359660#comment-17359660 ] Barrett Oglesby commented on GEODE-8825: This test doesn't really do anything. Its too fast for the verifyMBeanProxiesDoesNotExist method to verify anything properly. For each member, the test does: - create receiver - start receiver - stop receiver - destroy receiver Then it verifies in the manager that none of the mbean proxies exist. The mbean is created when the receiver is created, and destroyed when the receiver is destroyed. The problem is creating the proxy in the manager is asynchronous to creating the mbean the local member. There is a Management Task thread that runs (every 2 seconds) in each member and sends the mbeans to the manager. So, after the steps above are complete, the mbean hasn't even been sent to the manager yet. Its almost always going to pass except in the case where the Management Task runs between the create and the verification. In that case, the proxies will exist, and the test will fail. > CI failure: GatewayReceiverMBeanDUnitTest > > testMBeanAndProxiesForGatewayReceiverAreRemovedOnDestroy > > > Key: GEODE-8825 > URL: https://issues.apache.org/jira/browse/GEODE-8825 > Project: Geode > Issue Type: Bug > Components: tests, wan >Reporter: Jianxia Chen >Priority: Major > Labels: flaky > > {code:java} > org.apache.geode.internal.cache.wan.GatewayReceiverMBeanDUnitTest > > testMBeanAndProxiesForGatewayReceiverAreRemovedOnDestroy FAILED > org.apache.geode.test.dunit.RMIException: While invoking > org.apache.geode.internal.cache.wan.GatewayReceiverMBeanDUnitTest$$Lambda$202/0x0001008f0c40.run > in VM 0 running on Host c3e48bdac460 with 4 VMs > at org.apache.geode.test.dunit.VM.executeMethodOnObject(VM.java:623) > at org.apache.geode.test.dunit.VM.invoke(VM.java:447) > at > org.apache.geode.internal.cache.wan.GatewayReceiverMBeanDUnitTest.testMBeanAndProxiesForGatewayReceiverAreRemovedOnDestroy(GatewayReceiverMBeanDUnitTest.java:76) > Caused by: > java.lang.AssertionError: expected null, but was: GemFire:service=GatewayReceiver,type=Member,member=172.17.0.18(183)-41002> > at org.junit.Assert.fail(Assert.java:89) > at org.junit.Assert.failNotNull(Assert.java:756) > at org.junit.Assert.assertNull(Assert.java:738) > at org.junit.Assert.assertNull(Assert.java:748) > at > org.apache.geode.internal.cache.wan.GatewayReceiverMBeanDUnitTest.verifyMBeanProxiesDoesNotExist(GatewayReceiverMBeanDUnitTest.java:106) > at > org.apache.geode.internal.cache.wan.GatewayReceiverMBeanDUnitTest.lambda$testMBeanAndProxiesForGatewayReceiverAreRemovedOnDestroy$bb17a952$3(GatewayReceiverMBeanDUnitTest.java:76) > {code} > https://concourse.apachegeode-ci.info/teams/main/pipelines/apache-develop-main/jobs/DistributedTestOpenJDK11/builds/704 > =-=-=-=-=-=-=-=-=-=-=-=-=-=-= Test Results URI > =-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-= > http://files.apachegeode-ci.info/builds/apache-develop-main/1.14.0-build.0601/test-results/distributedTest/1610390301/ > =-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-= > Test report artifacts from this job are available at: > http://files.apachegeode-ci.info/builds/apache-develop-main/1.14.0-build.0601/test-artifacts/1610390301/distributedtestfiles-OpenJDK11-1.14.0-build.0601.tgz -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Resolved] (GEODE-9299) CI Failure: WANRollingUpgradeSecondaryEventsNotReprocessedAfterOldSiteMemberFailover > testSecondaryEventsNotReprocessedAfterOldSiteMemberFailover
[ https://issues.apache.org/jira/browse/GEODE-9299?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Barrett Oglesby resolved GEODE-9299. Fix Version/s: 1.15.0 Resolution: Fixed > CI Failure: > WANRollingUpgradeSecondaryEventsNotReprocessedAfterOldSiteMemberFailover > > testSecondaryEventsNotReprocessedAfterOldSiteMemberFailover > -- > > Key: GEODE-9299 > URL: https://issues.apache.org/jira/browse/GEODE-9299 > Project: Geode > Issue Type: Bug > Components: wan >Affects Versions: 1.15.0 >Reporter: Hale Bales >Assignee: Barrett Oglesby >Priority: Major > Labels: pull-request-available > Fix For: 1.15.0 > > > {code:java} > org.apache.geode.cache.wan.WANRollingUpgradeSecondaryEventsNotReprocessedAfterOldSiteMemberFailover > > testSecondaryEventsNotReprocessedAfterOldSiteMemberFailover[from_v1.12.2] > FAILED > java.lang.AssertionError: expected:<100> but was:<101> > at org.junit.Assert.fail(Assert.java:89) > at org.junit.Assert.failNotEquals(Assert.java:835) > at org.junit.Assert.assertEquals(Assert.java:647) > at org.junit.Assert.assertEquals(Assert.java:633) > at > org.apache.geode.cache.wan.WANRollingUpgradeDUnitTest.stopSenderAndVerifyEvents(WANRollingUpgradeDUnitTest.java:227) > at > org.apache.geode.cache.wan.WANRollingUpgradeSecondaryEventsNotReprocessedAfterOldSiteMemberFailover.testSecondaryEventsNotReprocessedAfterOldSiteMemberFailover(WANRollingUpgradeSecondaryEventsNotReprocessedAfterOldSiteMemberFailover.java:98) > {code} > CI Failure: > https://concourse.apachegeode-ci.info/teams/main/pipelines/apache-develop-main/jobs/UpgradeTestOpenJDK11/builds/229#B > Artifacts Available here: > http://files.apachegeode-ci.info/builds/apache-develop-main/1.15.0-build.0253/test-results/upgradeTest/1621635640/ -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (GEODE-9299) CI Failure: WANRollingUpgradeSecondaryEventsNotReprocessedAfterOldSiteMemberFailover > testSecondaryEventsNotReprocessedAfterOldSiteMemberFailover
[ https://issues.apache.org/jira/browse/GEODE-9299?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17352894#comment-17352894 ] Barrett Oglesby commented on GEODE-9299: If I simulate this behavior with a sleep on key=5 in Put65, I see the same extra event in the queue. Keys 0-4 are processed normally in servers 1 and 2: Server 1: {noformat} ServerConnection on port 57561 Thread 1: Put65.cmdExecute processing key=0 ServerConnection on port 57561 Thread 1: ParallelGatewaySenderEventProcessor.enqueueEvent put dataKey=0; shadowKey=113 ServerConnection on port 57561 Thread 1: Put65.cmdExecute processing key=1 P2P message reader for 10.166.145.22(ln-2:85040):41003 unshared ordered uid=8 dom #2 port=57607: ParallelGatewaySenderEventProcessor.enqueueEvent put dataKey=1; shadowKey=114 ServerConnection on port 57561 Thread 1: Put65.cmdExecute processing key=2 P2P message reader for 10.166.145.22(ln-2:85040):41003 unshared ordered uid=8 dom #2 port=57607: ParallelGatewaySenderEventProcessor.enqueueEvent put dataKey=2; shadowKey=115 ServerConnection on port 57561 Thread 1: Put65.cmdExecute processing key=3 ServerConnection on port 57561 Thread 1: ParallelGatewaySenderEventProcessor.enqueueEvent put dataKey=3; shadowKey=116 ServerConnection on port 57561 Thread 1: Put65.cmdExecute processing key=4 P2P message reader for 10.166.145.22(ln-2:85040):41003 unshared ordered uid=8 dom #2 port=57607: ParallelGatewaySenderEventProcessor.enqueueEvent put dataKey=4; shadowKey=117 {noformat} Server 2: {noformat} P2P message reader for 10.166.145.22(ln-1:85023):41002 unshared ordered uid=8 dom #1 port=57606: ParallelGatewaySenderEventProcessor.enqueueEvent put dataKey=0; shadowKey=113 P2P message reader for 10.166.145.22(ln-1:85023):41002 unshared ordered uid=8 dom #1 port=57606: ParallelGatewaySenderEventProcessor.enqueueEvent put dataKey=1; shadowKey=114 P2P message reader for 10.166.145.22(ln-1:85023):41002 unshared ordered uid=8 dom #1 port=57606: ParallelGatewaySenderEventProcessor.enqueueEvent put dataKey=2; shadowKey=115 P2P message reader for 10.166.145.22(ln-1:85023):41002 unshared ordered uid=8 dom #1 port=57606: ParallelGatewaySenderEventProcessor.enqueueEvent put dataKey=3; shadowKey=116 P2P message reader for 10.166.145.22(ln-1:85023):41002 unshared ordered uid=8 dom #1 port=57606: ParallelGatewaySenderEventProcessor.enqueueEvent put dataKey=4; shadowKey=117 {noformat} The ServerConnection thread in server 1 sleeps before processing key=5: {noformat} ServerConnection on port 57561 Thread 1: Put65.cmdExecute processing key=5 ServerConnection on port 57561 Thread 1: Put65.cmdExecute sleeping key=5 {noformat} The client times out and fails over to server2 and retries key=5 and continues with keys 6-9. Notice the event with key=5 has shadowKey=118. Thats the key in the queue. {noformat} ServerConnection on port 57587 Thread 2: Put65.cmdExecute processing retried key=5 P2P message reader for 10.166.145.22(ln-1:85023):41002 unshared ordered uid=10 dom #2 port=57668: ParallelGatewaySenderEventProcessor.enqueueEvent put dataKey=5; shadowKey=118 ServerConnection on port 57587 Thread 2: Put65.cmdExecute processing key=6 ServerConnection on port 57587 Thread 2: ParallelGatewaySenderEventProcessor.enqueueEvent put dataKey=6; shadowKey=119 ServerConnection on port 57587 Thread 2: Put65.cmdExecute processing key=7 P2P message reader for 10.166.145.22(ln-1:85023):41002 unshared ordered uid=10 dom #2 port=57668: ParallelGatewaySenderEventProcessor.enqueueEvent put dataKey=7; shadowKey=120 ServerConnection on port 57587 Thread 2: Put65.cmdExecute processing key=8 ServerConnection on port 57587 Thread 2: ParallelGatewaySenderEventProcessor.enqueueEvent put dataKey=8; shadowKey=121 ServerConnection on port 57587 Thread 2: Put65.cmdExecute processing key=9 P2P message reader for 10.166.145.22(ln-1:85023):41002 unshared ordered uid=10 dom #2 port=57668: ParallelGatewaySenderEventProcessor.enqueueEvent put dataKey=9; shadowKey=122 {noformat} Server 1 enqueues keys 5-9: {noformat} P2P message reader for 10.166.145.22(ln-2:85040):41003 unshared ordered uid=10 dom #1 port=57664: ParallelGatewaySenderEventProcessor.enqueueEvent put dataKey=5; shadowKey=118 P2P message reader for 10.166.145.22(ln-2:85040):41003 unshared ordered uid=10 dom #1 port=57664: ParallelGatewaySenderEventProcessor.enqueueEvent put dataKey=6; shadowKey=119 P2P message reader for 10.166.145.22(ln-2:85040):41003 unshared ordered uid=10 dom #1 port=57664: ParallelGatewaySenderEventProcessor.enqueueEvent put dataKey=7; shadowKey=120 P2P message reader for 10.166.145.22(ln-2:85040):41003 unshared ordered uid=10 dom #1 port=57664: ParallelGatewaySenderEventProcessor.enqueueEvent put dataKey=8; shadowKey=121 P2P message reader for 10.166.145.22(ln-2:85040):41003 unshared ordered uid=10 dom #1 port=57664: ParallelGatewaySenderEventProcessor.enqueueEvent put
[jira] [Commented] (GEODE-9299) CI Failure: WANRollingUpgradeSecondaryEventsNotReprocessedAfterOldSiteMemberFailover > testSecondaryEventsNotReprocessedAfterOldSiteMemberFailover
[ https://issues.apache.org/jira/browse/GEODE-9299?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17352893#comment-17352893 ] Barrett Oglesby commented on GEODE-9299: The failing assertion is verifying the number of entries in the local secondary queues is 100 (which matches the number of puts). Instead, it is 101. {noformat} int localServer1QueueSize = localServer1.invoke(() -> getQueueRegionSize(senderId, false)); int localServer2QueueSize = localServer2.invoke(() -> getQueueRegionSize(senderId, false)); assertEquals(numPuts, localServer1QueueSize + localServer2QueueSize); {noformat} Here is some logging that shows the behavior in this test. Client Starts: {noformat} [vm3_v1.12.2] [info 2021/05/21 21:12:16.982 GMT tid=0x22] Received method: org.apache.geode.cache.wan.WANRollingUpgradeDUnitTest$$Lambda$146/0x0001008afc40.run with 0 args on object: org.apache.geode.cache.wan.WANRollingUpgradeDUnitTest$$Lambda$146/0x0001008afc40@59079e6c [vm3_v1.12.2] [info 2021/05/21 21:12:17.599 GMT tid=0x22] Using org.apache.geode.logging.log4j.internal.impl.Log4jLoggingProvider from ServiceLoader for service org.apache.geode.logging.internal.spi.LoggingProvider [vm3_v1.12.2] [info 2021/05/21 21:12:24.490 GMT tid=0x32] Updating membership port. Port changed from 0 to 46166. ID is now 7e72072330df(13685:loner):0:6094c590 [vm3_v1.12.2] [info 2021/05/21 21:12:24.526 GMT tid=0x22] Got result: null [vm3_v1.12.2] from org.apache.geode.cache.wan.WANRollingUpgradeDUnitTest$$Lambda$146/0x0001008afc40.run with 0 args on object: org.apache.geode.cache.wan.WANRollingUpgradeDUnitTest$$Lambda$146/0x0001008afc40@59079e6c (took 7538 ms) {noformat} Client does 100 puts in 22069ms with a SocketTimeoutException: {noformat} [vm3_v1.12.2] [info 2021/05/21 21:12:24.567 GMT tid=0x22] Received method: org.apache.geode.cache.wan.WANRollingUpgradeDUnitTest$$Lambda$339/0x000100959840.run with 0 args on object: org.apache.geode.cache.wan.WANRollingUpgradeDUnitTest$$Lambda$339/0x000100959840@2e8f97c1 [vm3_v1.12.2] [warn 2021/05/21 21:12:42.233 GMT tid=0x22] Pool unexpected socket timed out on client connection=Pooled Connection to 7e72072330df:21250: Connection[7e72072330df:21250]@93891194) [vm3_v1.12.2] [info 2021/05/21 21:12:46.638 GMT tid=0x22] Got result: null [vm3_v1.12.2] from org.apache.geode.cache.wan.WANRollingUpgradeDUnitTest$$Lambda$339/0x000100959840.run with 0 args on object: org.apache.geode.cache.wan.WANRollingUpgradeDUnitTest$$Lambda$339/0x000100959840@2e8f97c1 (took 22069 ms) {noformat} The SocketTimeoutException means the client retried the put. That ends up being 2 puts for the same event. Server 1 returns secondary queue size: {noformat} [vm1_v1.12.2] [info 2021/05/21 21:12:46.668 GMT tid=0x22] Received method: org.apache.geode.cache.wan.WANRollingUpgradeDUnitTest$$Lambda$520/0x000100ad1040.run with 0 args on object: org.apache.geode.cache.wan.WANRollingUpgradeDUnitTest$$Lambda$520/0x000100ad1040@79d1c376 [vm1_v1.12.2] [info 2021/05/21 21:12:47.598 GMT tid=0x22] Got result: null [vm1_v1.12.2] from org.apache.geode.cache.wan.WANRollingUpgradeDUnitTest$$Lambda$520/0x000100ad1040.run with 0 args on object: org.apache.geode.cache.wan.WANRollingUpgradeDUnitTest$$Lambda$520/0x000100ad1040@79d1c376 (took 929 ms) {noformat} Server 2 returns secondary queue size: {noformat} [vm2_v1.12.2] [info 2021/05/21 21:12:47.617 GMT tid=0x22] Received method: org.apache.geode.cache.wan.WANRollingUpgradeDUnitTest$$Lambda$517/0x000100ae2c40.run with 0 args on object: org.apache.geode.cache.wan.WANRollingUpgradeDUnitTest$$Lambda$517/0x000100ae2c40@751350b6 [vm2_v1.12.2] [info 2021/05/21 21:12:47.782 GMT tid=0x22] Got result: null [vm2_v1.12.2] from org.apache.geode.cache.wan.WANRollingUpgradeDUnitTest$$Lambda$517/0x000100ae2c40.run with 0 args on object: org.apache.geode.cache.wan.WANRollingUpgradeDUnitTest$$Lambda$517/0x000100ae2c40@751350b6 (took 161 ms) {noformat} The assertEquals check fails right after this, and the test shuts down. Here is some more detail. Server 1 buckets are created: {noformat} [vm1_v1.12.2] [info 2021/05/21 21:12:24.771 GMT tid=0x39] Initializing region _B__testSecondaryEventsNotReprocessedAfterOldSiteMemberFailover[from__v1.12.2]__region_0 [vm1_v1.12.2] [info 2021/05/21 21:12:24.847 GMT tid=0x39] Initialization of region _B__testSecondaryEventsNotReprocessedAfterOldSiteMemberFailover[from__v1.12.2]__region_0 completed [vm1_v1.12.2] [info 2021/05/21 21:12:25.418 GMT tid=0x39] Initializing region _B__testSecondaryEventsNotReprocessedAfterOldSiteMemberFailover[from__v1.12.2]__region_1 [vm1_v1.12.2] [info 2021/05/21 21:12:25.439 GMT tid=0x39] Initialization of region _B__testSecondaryEventsNotReprocessedAfterOldSiteMemberFailover[from__v1.12.2]__region_1 completed [vm1_v1.12.2] [info 2021/05/21 21:12:26.012
[jira] [Assigned] (GEODE-9299) CI Failure: WANRollingUpgradeSecondaryEventsNotReprocessedAfterOldSiteMemberFailover > testSecondaryEventsNotReprocessedAfterOldSiteMemberFailover
[ https://issues.apache.org/jira/browse/GEODE-9299?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Barrett Oglesby reassigned GEODE-9299: -- Assignee: Barrett Oglesby > CI Failure: > WANRollingUpgradeSecondaryEventsNotReprocessedAfterOldSiteMemberFailover > > testSecondaryEventsNotReprocessedAfterOldSiteMemberFailover > -- > > Key: GEODE-9299 > URL: https://issues.apache.org/jira/browse/GEODE-9299 > Project: Geode > Issue Type: Bug > Components: wan >Affects Versions: 1.15.0 >Reporter: Hale Bales >Assignee: Barrett Oglesby >Priority: Major > > {code:java} > org.apache.geode.cache.wan.WANRollingUpgradeSecondaryEventsNotReprocessedAfterOldSiteMemberFailover > > testSecondaryEventsNotReprocessedAfterOldSiteMemberFailover[from_v1.12.2] > FAILED > java.lang.AssertionError: expected:<100> but was:<101> > at org.junit.Assert.fail(Assert.java:89) > at org.junit.Assert.failNotEquals(Assert.java:835) > at org.junit.Assert.assertEquals(Assert.java:647) > at org.junit.Assert.assertEquals(Assert.java:633) > at > org.apache.geode.cache.wan.WANRollingUpgradeDUnitTest.stopSenderAndVerifyEvents(WANRollingUpgradeDUnitTest.java:227) > at > org.apache.geode.cache.wan.WANRollingUpgradeSecondaryEventsNotReprocessedAfterOldSiteMemberFailover.testSecondaryEventsNotReprocessedAfterOldSiteMemberFailover(WANRollingUpgradeSecondaryEventsNotReprocessedAfterOldSiteMemberFailover.java:98) > {code} > CI Failure: > https://concourse.apachegeode-ci.info/teams/main/pipelines/apache-develop-main/jobs/UpgradeTestOpenJDK11/builds/229#B > Artifacts Available here: > http://files.apachegeode-ci.info/builds/apache-develop-main/1.15.0-build.0253/test-results/upgradeTest/1621635640/ -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Assigned] (GEODE-9307) When a server is force disconnected, its regions can still be referenced
[ https://issues.apache.org/jira/browse/GEODE-9307?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Barrett Oglesby reassigned GEODE-9307: -- Assignee: Barrett Oglesby > When a server is force disconnected, its regions can still be referenced > > > Key: GEODE-9307 > URL: https://issues.apache.org/jira/browse/GEODE-9307 > Project: Geode > Issue Type: Bug > Components: regions >Reporter: Barrett Oglesby >Assignee: Barrett Oglesby >Priority: Major > > When a server is force disconnected, any of its DistributedRegions will not > be GCed after they are closed. This is really only a problem if the > GemFireCacheImpl is referenced in something other than the > ClusterDistributionManager.cache field (in my test, I used a static field of > a Function) > The GemFireCacheImpl references a ClusterDistributionManager in the final > field called dm. > The DistributedRegion creates and references a DistributionAdvisor in the > final field called distAdvisor. The DistributionAdvisor creates a > MembershipListener and adds it to the ClusterDistributionManager's > membershipListeners. > When the GemFireCacheImpl is closed due to force disconnect, its regions are > also closed. > When a DistributedRegion is closed, its DistributionAdvisor is also closed. > DistributionAdvisor.close attempts to remove the MembershipListener > {noformat} > try { > getDistributionManager().removeMembershipListener(membershipListener); > } catch (CancelException e) { > // if distribution has stopped, above is a no-op. > } ... > {noformat} > That call fails with a CancelException, and the MembershipListener is not > removed, so the ClusterDistributionManager references both the > GemFireCacheImpl and the MembershipListener. The MembershipListener > references the DistributionAdvisor which references the DistributedRegion. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (GEODE-9307) When a server is force disconnected, its regions can still be referenced
Barrett Oglesby created GEODE-9307: -- Summary: When a server is force disconnected, its regions can still be referenced Key: GEODE-9307 URL: https://issues.apache.org/jira/browse/GEODE-9307 Project: Geode Issue Type: Bug Components: regions Reporter: Barrett Oglesby When a server is force disconnected, any of its DistributedRegions will not be GCed after they are closed. This is really only a problem if the GemFireCacheImpl is referenced in something other than the ClusterDistributionManager.cache field (in my test, I used a static field of a Function) The GemFireCacheImpl references a ClusterDistributionManager in the final field called dm. The DistributedRegion creates and references a DistributionAdvisor in the final field called distAdvisor. The DistributionAdvisor creates a MembershipListener and adds it to the ClusterDistributionManager's membershipListeners. When the GemFireCacheImpl is closed due to force disconnect, its regions are also closed. When a DistributedRegion is closed, its DistributionAdvisor is also closed. DistributionAdvisor.close attempts to remove the MembershipListener {noformat} try { getDistributionManager().removeMembershipListener(membershipListener); } catch (CancelException e) { // if distribution has stopped, above is a no-op. } ... {noformat} That call fails with a CancelException, and the MembershipListener is not removed, so the ClusterDistributionManager references both the GemFireCacheImpl and the MembershipListener. The MembershipListener references the DistributionAdvisor which references the DistributedRegion. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Resolved] (GEODE-9138) Add warning in server logs when data event is ignored as a duplicate
[ https://issues.apache.org/jira/browse/GEODE-9138?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Barrett Oglesby resolved GEODE-9138. Fix Version/s: 1.15.0 Resolution: Fixed > Add warning in server logs when data event is ignored as a duplicate > > > Key: GEODE-9138 > URL: https://issues.apache.org/jira/browse/GEODE-9138 > Project: Geode > Issue Type: Bug > Components: client/server, logging >Reporter: Diane Hardman >Assignee: Barrett Oglesby >Priority: Major > Labels: pull-request-available > Fix For: 1.15.0 > > > Under certain rare conditions, a client may send or resend a data event with > an eventId that causes the server to interpret it as a duplicate event and > discard it. > It is currently impossible to trace when this happens without extra logging > added. > From Barry: > No, if the server thinks it has seen the event, it silently eats it. See > DistributedEventTracker.hasSeenEvent. It has a trace log message but thats it. > The log message that tracks this behavior in the server needs to be added > permanently. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Resolved] (GEODE-9104) REST query output displays non-ASCII characters using escapes
[ https://issues.apache.org/jira/browse/GEODE-9104?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Barrett Oglesby resolved GEODE-9104. Fix Version/s: 1.15.0 Resolution: Fixed > REST query output displays non-ASCII characters using escapes > - > > Key: GEODE-9104 > URL: https://issues.apache.org/jira/browse/GEODE-9104 > Project: Geode > Issue Type: Bug > Components: rest (dev) >Reporter: Barrett Oglesby >Assignee: Barrett Oglesby >Priority: Major > Labels: pull-request-available > Fix For: 1.15.0 > > > For example, if JSON containing Chinese characters is put: > {noformat} > curl -X PUT -H "Content-Type: application/json" > localhost:8081/geode/v1/customers/1 -d '{"id": "1", "firstName": "名", > "lastName": "姓"}' > {noformat} > The results of getting the entry are correct: > {noformat} > curl localhost:8081/geode/v1/customers/1 > { > "id" : "1", > "firstName" : "名", > "lastName" : "姓" > } > {noformat} > The results of querying the entry show the field values escaped: > {noformat} > curl -G http://localhost:8081/gemfire-api/v1/queries/adhoc --data-urlencode > "q=SELECT * FROM /customers where id='1'" > [ { > "id" : "1", > "firstName" : "\u540D", > "lastName" : "\u59D3" > } ] > {noformat} -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (GEODE-9138) Add warning in server logs when data event is ignored as a duplicate
[ https://issues.apache.org/jira/browse/GEODE-9138?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17335733#comment-17335733 ] Barrett Oglesby commented on GEODE-9138: A lot of this change is to determine when and when not to log the message. There were 3 HA cases where the message was logged validly: - posDup (for client message retries) - low bucket redundancy (when a server crashes) - recoveries in progress (when a server restarts) The first change was to convert the message from debug to info. What that change, Lynn ran the parReg/parRegHABridge.bt test to see how many of those message were logged. This test does a variety of operations while killing and restarting servers. There were ~400 of those messages logged per test the first run. These are all valid duplicates that we don't want to log if we can help it. We really only want to log this message in steady state (no HA). The first case I noticed was low bucket redundancy. I also noticed at that same time sometimes posDup was true; sometimes not. I also realized every message in this low bucket redundancy state should have been posDup (they were all client retries). But posDup wasn't set on putAlls and removeAlls. So I made those changes and added the posDup case. After that, there were still a handful of messages logged. That was because the region was being recovered. The messages were logged for a specific bucket right after it was GIIed, but the region was still in recovery, so I added that case. I also ran some rebalance tests, but I didn't see any messages. I'm not 100% sure there aren't any though. > Add warning in server logs when data event is ignored as a duplicate > > > Key: GEODE-9138 > URL: https://issues.apache.org/jira/browse/GEODE-9138 > Project: Geode > Issue Type: Bug > Components: client/server, logging >Reporter: Diane Hardman >Assignee: Barrett Oglesby >Priority: Major > Labels: pull-request-available > > Under certain rare conditions, a client may send or resend a data event with > an eventId that causes the server to interpret it as a duplicate event and > discard it. > It is currently impossible to trace when this happens without extra logging > added. > From Barry: > No, if the server thinks it has seen the event, it silently eats it. See > DistributedEventTracker.hasSeenEvent. It has a trace log message but thats it. > The log message that tracks this behavior in the server needs to be added > permanently. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Assigned] (GEODE-9138) Add warning in server logs when data event is ignored as a duplicate
[ https://issues.apache.org/jira/browse/GEODE-9138?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Barrett Oglesby reassigned GEODE-9138: -- Assignee: Barrett Oglesby > Add warning in server logs when data event is ignored as a duplicate > > > Key: GEODE-9138 > URL: https://issues.apache.org/jira/browse/GEODE-9138 > Project: Geode > Issue Type: Bug > Components: client/server, logging >Reporter: Diane Hardman >Assignee: Barrett Oglesby >Priority: Major > Labels: pull-request-available > > Under certain rare conditions, a client may send or resend a data event with > an eventId that causes the server to interpret it as a duplicate event and > discard it. > It is currently impossible to trace when this happens without extra logging > added. > From Barry: > No, if the server thinks it has seen the event, it silently eats it. See > DistributedEventTracker.hasSeenEvent. It has a trace log message but thats it. > The log message that tracks this behavior in the server needs to be added > permanently. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (GEODE-9174) The result of a gfsh query containing a UUID may not be displayed properly
[ https://issues.apache.org/jira/browse/GEODE-9174?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Barrett Oglesby updated GEODE-9174: --- Summary: The result of a gfsh query containing a UUID may not be displayed properly (was: A gfsh query with a UUID in the result may not be displayed properly) > The result of a gfsh query containing a UUID may not be displayed properly > -- > > Key: GEODE-9174 > URL: https://issues.apache.org/jira/browse/GEODE-9174 > Project: Geode > Issue Type: Bug > Components: gfsh, querying >Reporter: Barrett Oglesby >Priority: Major > > For example, if the key is a UUID, then a query like this won't show the > results even though there is one: > {noformat} > gfsh>query --query="select key from /data.entries where value.id = > '55e907b6-a1fe-42ea-90a2-6a5698e9b27c'" > Result : true > Limit : 100 > Rows : 1 > {noformat} > But a query like this will: > {noformat} > gfsh>query --query="select key,value from /data.entries where value.id = > '55e907b6-a1fe-42ea-90a2-6a5698e9b27c'" > Result : true > Limit : 100 > Rows : 1 > key | value > -- | > --- > "55e907b6-a1fe-42ea-90a2-6a5698e9b27c" | > {"id":"55e907b6-a1fe-42ea-90a2-6a5698e9b27c","cusip":"AAPL","shares":22,"price":352.32} > {noformat} > Thats because of the way {{DataCommandResult.resolveObjectToColumns}} works. > {noformat} > private void resolveObjectToColumns(Map columnData, Object > value) { > if (value instanceof PdxInstance) { > resolvePdxToColumns(columnData, (PdxInstance) value); > } else if (value instanceof Struct) { > resolveStructToColumns(columnData, (StructImpl) value); > } else { > ObjectMapper mapper = new ObjectMapper(); > JsonNode node = mapper.valueToTree(value); > node.fieldNames().forEachRemaining(field -> { > ... > columnData.put(field, mapper.writeValueAsString(node.get(field))); > }); > } > } > {noformat} > The value in the first query is a {{UUID}} so the last else clause is > invoked. In this case, a {{JsonNode}} is used to determine the columns. > {{ObjectMapper.valueToTree}} converts a {{UUID}} to a {{TextNode}}. > {{TextNodes}} have no fieldNames, and {{JsonNode.fieldNames}} returns an > {{EmptyIterator}} by default: > {noformat} > public Iterator fieldNames() { > return ClassUtil.emptyIterator(); > } > {noformat} > So, {{resolveObjectToColumns}} doesn't fill in columnData, which causes the > {{DataCommandResult.buildTable}} in the locator to not add any rows to the > table. > The value in the second query is a {{Struct}} so the second else clause is > invoked. The {{resolveStructToColumns}} method does: > {noformat} > private void resolveStructToColumns(Map columnData, > StructImpl struct) { > for (String field : struct.getFieldNames()) { > columnData.put(field, valueToJson(struct.get(field))); > } > } > {noformat} > I'm not sure if there is a way to make {{ObjectMapper.valueToTree}} handle > {{UUIDs}} differently, but they can easily be special-cased like > {{PdxInstances}} and {{Structs}}: > {noformat} > } else if (value instanceof UUID) { > columnData.put("uuid", valueToJson(value)); > {noformat} > I'm not sure if this is the best solution, but it works. With this clause > added, the query does: > {noformat} > gfsh>query --query="select key from /data.entries where value.id = > '55e907b6-a1fe-42ea-90a2-6a5698e9b27c'" > Result : true > Limit : 100 > Rows : 1 > uuid > -- > "55e907b6-a1fe-42ea-90a2-6a5698e9b27c" > {noformat} -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (GEODE-9174) A gfsh query with a UUID in the result may not be displayed properly
Barrett Oglesby created GEODE-9174: -- Summary: A gfsh query with a UUID in the result may not be displayed properly Key: GEODE-9174 URL: https://issues.apache.org/jira/browse/GEODE-9174 Project: Geode Issue Type: Bug Components: gfsh, querying Reporter: Barrett Oglesby For example, if the key is a UUID, then a query like this won't show the results even though there is one: {noformat} gfsh>query --query="select key from /data.entries where value.id = '55e907b6-a1fe-42ea-90a2-6a5698e9b27c'" Result : true Limit : 100 Rows : 1 {noformat} But a query like this will: {noformat} gfsh>query --query="select key,value from /data.entries where value.id = '55e907b6-a1fe-42ea-90a2-6a5698e9b27c'" Result : true Limit : 100 Rows : 1 key | value -- | --- "55e907b6-a1fe-42ea-90a2-6a5698e9b27c" | {"id":"55e907b6-a1fe-42ea-90a2-6a5698e9b27c","cusip":"AAPL","shares":22,"price":352.32} {noformat} Thats because of the way {{DataCommandResult.resolveObjectToColumns}} works. {noformat} private void resolveObjectToColumns(Map columnData, Object value) { if (value instanceof PdxInstance) { resolvePdxToColumns(columnData, (PdxInstance) value); } else if (value instanceof Struct) { resolveStructToColumns(columnData, (StructImpl) value); } else { ObjectMapper mapper = new ObjectMapper(); JsonNode node = mapper.valueToTree(value); node.fieldNames().forEachRemaining(field -> { ... columnData.put(field, mapper.writeValueAsString(node.get(field))); }); } } {noformat} The value in the first query is a {{UUID}} so the last else clause is invoked. In this case, a {{JsonNode}} is used to determine the columns. {{ObjectMapper.valueToTree}} converts a {{UUID}} to a {{TextNode}}. {{TextNodes}} have no fieldNames, and {{JsonNode.fieldNames}} returns an {{EmptyIterator}} by default: {noformat} public Iterator fieldNames() { return ClassUtil.emptyIterator(); } {noformat} So, {{resolveObjectToColumns}} doesn't fill in columnData, which causes the {{DataCommandResult.buildTable}} in the locator to not add any rows to the table. The value in the second query is a {{Struct}} so the second else clause is invoked. The {{resolveStructToColumns}} method does: {noformat} private void resolveStructToColumns(Map columnData, StructImpl struct) { for (String field : struct.getFieldNames()) { columnData.put(field, valueToJson(struct.get(field))); } } {noformat} I'm not sure if there is a way to make {{ObjectMapper.valueToTree}} handle {{UUIDs}} differently, but they can easily be special-cased like {{PdxInstances}} and {{Structs}}: {noformat} } else if (value instanceof UUID) { columnData.put("uuid", valueToJson(value)); {noformat} I'm not sure if this is the best solution, but it works. With this clause added, the query does: {noformat} gfsh>query --query="select key from /data.entries where value.id = '55e907b6-a1fe-42ea-90a2-6a5698e9b27c'" Result : true Limit : 100 Rows : 1 uuid -- "55e907b6-a1fe-42ea-90a2-6a5698e9b27c" {noformat} -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Assigned] (GEODE-9122) Setting group-transaction-events=true can cause ConcurrentModificationExceptions
[ https://issues.apache.org/jira/browse/GEODE-9122?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Barrett Oglesby reassigned GEODE-9122: -- Assignee: Alberto Gomez > Setting group-transaction-events=true can cause > ConcurrentModificationExceptions > > > Key: GEODE-9122 > URL: https://issues.apache.org/jira/browse/GEODE-9122 > Project: Geode > Issue Type: Bug > Components: wan >Reporter: Barrett Oglesby >Assignee: Alberto Gomez >Priority: Major > > The > SerialWANStatsDUnitTest.testReplicatedSerialPropagationHAWithGroupTransactionEvents > test can throw a ConcurrentModificationException like: > {noformat} > [warn 2021/04/04 02:55:53.253 GMT > tid=0x15d] An Exception occurred. The dispatcher will continue. > java.util.ConcurrentModificationException > at java.util.HashMap$HashIterator.nextNode(HashMap.java:1445) > at java.util.HashMap$KeyIterator.next(HashMap.java:1469) > at > org.apache.geode.internal.cache.wan.serial.SerialGatewaySenderQueue.peekEventsFromIncompleteTransactions(SerialGatewaySenderQueue.java:476) > at > org.apache.geode.internal.cache.wan.serial.SerialGatewaySenderQueue.peek(SerialGatewaySenderQueue.java:453) > at > org.apache.geode.internal.cache.wan.AbstractGatewaySenderEventProcessor.processQueue(AbstractGatewaySenderEventProcessor.java:518) > at > org.apache.geode.internal.cache.wan.serial.SerialGatewaySenderEventProcessor.run(SerialGatewaySenderEventProcessor.java:223) > {noformat} > If the SerialGatewaySenderQueue.peekEventsFromIncompleteTransactions contains > more than one TransactionId, and one of them is removed, the > ConcurrentModificationException will occur. > Both the SerialGatewaySenderQueue and ParallelGatewaySenderQueue > peekEventsFromIncompleteTransactions have the same implementation. > These methods do: > {noformat} >while (true) { > 1. ->for (TransactionId transactionId : incompleteTransactionIdsInBatch) { >... >if (...) { > ... > 2. -> incompleteTransactionIdsInBatch.remove(transactionId); >} > } >} > {noformat} > The for-each loop (1) cannot be paired with the remove from the > incompleteTransactionIdsInBatch set (2). As soon as the remove is called, the > ConcurrentModificationException will be thrown the next time through the > loop. Since this for loop is in a while (true) loop, it is an infinite loop. > One way to address this would be to use an Iterator and call remove on the > Iterator like: > {noformat} > 1. ->for (Iterator i = > incompleteTransactionIdsInBatch.iterator(); i.hasNext();) { >TransactionId transactionId = i.next(); >... > 2. -> i.remove(); > {noformat} -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (GEODE-9122) Setting group-transaction-events=true can cause ConcurrentModificationExceptions
Barrett Oglesby created GEODE-9122: -- Summary: Setting group-transaction-events=true can cause ConcurrentModificationExceptions Key: GEODE-9122 URL: https://issues.apache.org/jira/browse/GEODE-9122 Project: Geode Issue Type: Bug Components: wan Reporter: Barrett Oglesby The SerialWANStatsDUnitTest.testReplicatedSerialPropagationHAWithGroupTransactionEvents test can throw a ConcurrentModificationException like: {noformat} [warn 2021/04/04 02:55:53.253 GMT tid=0x15d] An Exception occurred. The dispatcher will continue. java.util.ConcurrentModificationException at java.util.HashMap$HashIterator.nextNode(HashMap.java:1445) at java.util.HashMap$KeyIterator.next(HashMap.java:1469) at org.apache.geode.internal.cache.wan.serial.SerialGatewaySenderQueue.peekEventsFromIncompleteTransactions(SerialGatewaySenderQueue.java:476) at org.apache.geode.internal.cache.wan.serial.SerialGatewaySenderQueue.peek(SerialGatewaySenderQueue.java:453) at org.apache.geode.internal.cache.wan.AbstractGatewaySenderEventProcessor.processQueue(AbstractGatewaySenderEventProcessor.java:518) at org.apache.geode.internal.cache.wan.serial.SerialGatewaySenderEventProcessor.run(SerialGatewaySenderEventProcessor.java:223) {noformat} If the SerialGatewaySenderQueue.peekEventsFromIncompleteTransactions contains more than one TransactionId, and one of them is removed, the ConcurrentModificationException will occur. Both the SerialGatewaySenderQueue and ParallelGatewaySenderQueue peekEventsFromIncompleteTransactions have the same implementation. These methods do: {noformat} while (true) { 1. ->for (TransactionId transactionId : incompleteTransactionIdsInBatch) { ... if (...) { ... 2. -> incompleteTransactionIdsInBatch.remove(transactionId); } } } {noformat} The for-each loop (1) cannot be paired with the remove from the incompleteTransactionIdsInBatch set (2). As soon as the remove is called, the ConcurrentModificationException will be thrown the next time through the loop. Since this for loop is in a while (true) loop, it is an infinite loop. One way to address this would be to use an Iterator and call remove on the Iterator like: {noformat} 1. ->for (Iterator i = incompleteTransactionIdsInBatch.iterator(); i.hasNext();) { TransactionId transactionId = i.next(); ... 2. -> i.remove(); {noformat} -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (GEODE-9104) REST query output displays non-ASCII characters using escapes
[ https://issues.apache.org/jira/browse/GEODE-9104?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17311707#comment-17311707 ] Barrett Oglesby commented on GEODE-9104: The query executes this code path: {noformat} java.lang.Exception at org.apache.geode.rest.internal.web.util.JSONUtils.enableDisableJSONGeneratorFeature(JSONUtils.java:57) at org.apache.geode.rest.internal.web.util.JSONUtils.convertCollectionToJson(JSONUtils.java:141) at org.apache.geode.rest.internal.web.controllers.AbstractBaseController.processQueryResponse(AbstractBaseController.java:243) at org.apache.geode.rest.internal.web.controllers.QueryAccessController.runNamedQuery(QueryAccessController.java:262) {noformat} JSONUtils creates a JsonGenerator like: {noformat} getObjectMapper().getFactory().createGenerator((OutputStream) outputStream, JsonEncoding.UTF8) {noformat} It then enables the ESCAPE_NON_ASCII feature: {noformat} generator.enable(JsonWriteFeature.ESCAPE_NON_ASCII.mappedFeature()); {noformat} This is what causes the Chinese characters to be escaped. The get creates a RegionData in this code path: {noformat} java.lang.Exception: RegionData.RegionData at org.apache.geode.rest.internal.web.controllers.support.RegionData.(RegionData.java:59) at org.apache.geode.rest.internal.web.controllers.support.RegionEntryData.(RegionEntryData.java:48) at org.apache.geode.rest.internal.web.controllers.PdxBasedCrudController.getRegionKeys(PdxBasedCrudController.java:260) at org.apache.geode.rest.internal.web.controllers.PdxBasedCrudController.read(PdxBasedCrudController.java:243) {noformat} The RegionData is serialized here: {noformat} java.lang.Exception: RegionData.serialize at org.apache.geode.rest.internal.web.controllers.support.RegionData.serialize(RegionData.java:131) at com.fasterxml.jackson.databind.ser.std.SerializableSerializer.serialize(SerializableSerializer.java:39) at com.fasterxml.jackson.databind.ser.std.SerializableSerializer.serialize(SerializableSerializer.java:20) at com.fasterxml.jackson.databind.ser.DefaultSerializerProvider._serialize(DefaultSerializerProvider.java:480) at com.fasterxml.jackson.databind.ser.DefaultSerializerProvider.serializeValue(DefaultSerializerProvider.java:319) at com.fasterxml.jackson.databind.ObjectWriter$Prefetch.serialize(ObjectWriter.java:1514) at com.fasterxml.jackson.databind.ObjectWriter.writeValue(ObjectWriter.java:1006) at org.springframework.http.converter.json.AbstractJackson2HttpMessageConverter.writeInternal(AbstractJackson2HttpMessageConverter.java:454) at org.springframework.http.converter.AbstractGenericHttpMessageConverter.write(AbstractGenericHttpMessageConverter.java:104) {noformat} AbstractJackson2HttpMessageConverter.writeInternal creates a JsonGenerator like this: {noformat} objectMapper.getFactory().createGenerator(outputStream, encoding) {noformat} This is the same as JSONUtils. The ESCAPE_NON_ASCII is not enabled in this case, though. > REST query output displays non-ASCII characters using escapes > - > > Key: GEODE-9104 > URL: https://issues.apache.org/jira/browse/GEODE-9104 > Project: Geode > Issue Type: Bug > Components: rest (dev) >Reporter: Barrett Oglesby >Assignee: Barrett Oglesby >Priority: Major > > For example, if JSON containing Chinese characters is put: > {noformat} > curl -X PUT -H "Content-Type: application/json" > localhost:8081/geode/v1/customers/1 -d '{"id": "1", "firstName": "名", > "lastName": "姓"}' > {noformat} > The results of getting the entry are correct: > {noformat} > curl localhost:8081/geode/v1/customers/1 > { > "id" : "1", > "firstName" : "名", > "lastName" : "姓" > } > {noformat} > The results of querying the entry show the field values escaped: > {noformat} > curl -G http://localhost:8081/gemfire-api/v1/queries/adhoc --data-urlencode > "q=SELECT * FROM /customers where id='1'" > [ { > "id" : "1", > "firstName" : "\u540D", > "lastName" : "\u59D3" > } ] > {noformat} -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Assigned] (GEODE-9104) REST query output displays non-ASCII characters using escapes
[ https://issues.apache.org/jira/browse/GEODE-9104?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Barrett Oglesby reassigned GEODE-9104: -- Assignee: Barrett Oglesby > REST query output displays non-ASCII characters using escapes > - > > Key: GEODE-9104 > URL: https://issues.apache.org/jira/browse/GEODE-9104 > Project: Geode > Issue Type: Bug > Components: rest (dev) >Reporter: Barrett Oglesby >Assignee: Barrett Oglesby >Priority: Major > > For example, if JSON containing Chinese characters is put: > {noformat} > curl -X PUT -H "Content-Type: application/json" > localhost:8081/geode/v1/customers/1 -d '{"id": "1", "firstName": "名", > "lastName": "姓"}' > {noformat} > The results of getting the entry are correct: > {noformat} > curl localhost:8081/geode/v1/customers/1 > { > "id" : "1", > "firstName" : "名", > "lastName" : "姓" > } > {noformat} > The results of querying the entry show the field values escaped: > {noformat} > curl -G http://localhost:8081/gemfire-api/v1/queries/adhoc --data-urlencode > "q=SELECT * FROM /customers where id='1'" > [ { > "id" : "1", > "firstName" : "\u540D", > "lastName" : "\u59D3" > } ] > {noformat} -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (GEODE-9104) REST query output displays non-ASCII characters using escapes
Barrett Oglesby created GEODE-9104: -- Summary: REST query output displays non-ASCII characters using escapes Key: GEODE-9104 URL: https://issues.apache.org/jira/browse/GEODE-9104 Project: Geode Issue Type: Bug Components: rest (dev) Reporter: Barrett Oglesby For example, if JSON containing Chinese characters is put: {noformat} curl -X PUT -H "Content-Type: application/json" localhost:8081/geode/v1/customers/1 -d '{"id": "1", "firstName": "名", "lastName": "姓"}' {noformat} The results of getting the entry are correct: {noformat} curl localhost:8081/geode/v1/customers/1 { "id" : "1", "firstName" : "名", "lastName" : "姓" } {noformat} The results of querying the entry show the field values escaped: {noformat} curl -G http://localhost:8081/gemfire-api/v1/queries/adhoc --data-urlencode "q=SELECT * FROM /customers where id='1'" [ { "id" : "1", "firstName" : "\u540D", "lastName" : "\u59D3" } ] {noformat} -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (GEODE-9030) The PartitionedIndex arbitraryBucketIndex doesn't get reset when the BucketRegion defining it is moved
[ https://issues.apache.org/jira/browse/GEODE-9030?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Barrett Oglesby updated GEODE-9030: --- Labels: blocks-1.14.0 pull-request-available (was: pull-request-available) > The PartitionedIndex arbitraryBucketIndex doesn't get reset when the > BucketRegion defining it is moved > -- > > Key: GEODE-9030 > URL: https://issues.apache.org/jira/browse/GEODE-9030 > Project: Geode > Issue Type: Bug > Components: querying >Reporter: Barrett Oglesby >Assignee: Barrett Oglesby >Priority: Major > Labels: blocks-1.14.0, pull-request-available > > This causes a RegionDestroyedException like this when executing a query > containing a != clause: > {noformat} > Exception in thread "main" > org.apache.geode.cache.client.ServerOperationException: remote server on > 10.166.145.16(client:27461:loner):58776:dfd3ba27:client: While performing a > remote query > at > org.apache.geode.cache.client.internal.AbstractOp.processChunkedResponse(AbstractOp.java:342) > at > org.apache.geode.cache.client.internal.QueryOp$QueryOpImpl.processResponse(QueryOp.java:168) > at > org.apache.geode.cache.client.internal.AbstractOp.processResponse(AbstractOp.java:224) > at > org.apache.geode.cache.client.internal.AbstractOp.attemptReadResponse(AbstractOp.java:197) > at > org.apache.geode.cache.client.internal.AbstractOp.attempt(AbstractOp.java:384) > at > org.apache.geode.cache.client.internal.ConnectionImpl.execute(ConnectionImpl.java:284) > at > org.apache.geode.cache.client.internal.pooling.PooledConnection.execute(PooledConnection.java:355) > at > org.apache.geode.cache.client.internal.OpExecutorImpl.executeWithPossibleReAuthentication(OpExecutorImpl.java:756) > at > org.apache.geode.cache.client.internal.OpExecutorImpl.execute(OpExecutorImpl.java:142) > at > org.apache.geode.cache.client.internal.OpExecutorImpl.execute(OpExecutorImpl.java:112) > at > org.apache.geode.cache.client.internal.PoolImpl.execute(PoolImpl.java:797) > at > org.apache.geode.cache.client.internal.QueryOp.execute(QueryOp.java:59) > at > org.apache.geode.cache.client.internal.ServerProxy.query(ServerProxy.java:59) > at > org.apache.geode.cache.query.internal.DefaultQuery.executeOnServer(DefaultQuery.java:327) > at > org.apache.geode.cache.query.internal.DefaultQuery.execute(DefaultQuery.java:215) > at > org.apache.geode.cache.query.internal.DefaultQuery.execute(DefaultQuery.java:197) > Caused by: org.apache.geode.cache.query.QueryInvocationTargetException: The > Region on which query is executed may have been > destroyed.BucketRegion[path='/__PR/_B__trade_0;serial=12;primary=false] > at > org.apache.geode.internal.cache.PRQueryProcessor.executeQueryOnBuckets(PRQueryProcessor.java:264) > at > org.apache.geode.internal.cache.PRQueryProcessor.executeSequentially(PRQueryProcessor.java:214) > at > org.apache.geode.internal.cache.PRQueryProcessor.executeQuery(PRQueryProcessor.java:124) > at > org.apache.geode.internal.cache.partitioned.QueryMessage.operateOnPartitionedRegion(QueryMessage.java:210) > Caused by: org.apache.geode.cache.RegionDestroyedException: > BucketRegion[path='/__PR/_B__trade_0;serial=12;primary=false] > at > org.apache.geode.internal.cache.LocalRegion.checkRegionDestroyed(LocalRegion.java:7352) > at > org.apache.geode.internal.cache.LocalRegion.checkReadiness(LocalRegion.java:2757) > at > org.apache.geode.internal.cache.BucketRegion.checkReadiness(BucketRegion.java:1437) > at > org.apache.geode.internal.cache.LocalRegion.size(LocalRegion.java:8313) > at > org.apache.geode.cache.query.internal.index.CompactRangeIndex.getSizeEstimate(CompactRangeIndex.java:331) > at > org.apache.geode.cache.query.internal.CompiledComparison.getSizeEstimate(CompiledComparison.java:337) > at > org.apache.geode.cache.query.internal.GroupJunction.organizeOperands(GroupJunction.java:146) > at > org.apache.geode.cache.query.internal.AbstractGroupOrRangeJunction.filterEvaluate(AbstractGroupOrRangeJunction.java:148) > at > org.apache.geode.cache.query.internal.CompiledJunction.filterEvaluate(CompiledJunction.java:190) > at > org.apache.geode.cache.query.internal.CompiledSelect.evaluate(CompiledSelect.java:538) > at > org.apache.geode.cache.query.internal.CompiledSelect.evaluate(CompiledSelect.java:53) > at > org.apache.geode.cache.query.internal.DefaultQuery.executeUsingContext(DefaultQuery.java:357) > at > org.apache.geode.internal.cache.PRQueryProcessor.executeQueryOnBuckets(PRQueryProcessor.java:248) > {noformat} > Here is an
[jira] [Assigned] (GEODE-9030) The PartitionedIndex arbitraryBucketIndex doesn't get reset when the BucketRegion defining it is moved
[ https://issues.apache.org/jira/browse/GEODE-9030?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Barrett Oglesby reassigned GEODE-9030: -- Assignee: Barrett Oglesby > The PartitionedIndex arbitraryBucketIndex doesn't get reset when the > BucketRegion defining it is moved > -- > > Key: GEODE-9030 > URL: https://issues.apache.org/jira/browse/GEODE-9030 > Project: Geode > Issue Type: Bug > Components: querying >Reporter: Barrett Oglesby >Assignee: Barrett Oglesby >Priority: Major > Labels: pull-request-available > > This causes a RegionDestroyedException like this when executing a query > containing a != clause: > {noformat} > Exception in thread "main" > org.apache.geode.cache.client.ServerOperationException: remote server on > 10.166.145.16(client:27461:loner):58776:dfd3ba27:client: While performing a > remote query > at > org.apache.geode.cache.client.internal.AbstractOp.processChunkedResponse(AbstractOp.java:342) > at > org.apache.geode.cache.client.internal.QueryOp$QueryOpImpl.processResponse(QueryOp.java:168) > at > org.apache.geode.cache.client.internal.AbstractOp.processResponse(AbstractOp.java:224) > at > org.apache.geode.cache.client.internal.AbstractOp.attemptReadResponse(AbstractOp.java:197) > at > org.apache.geode.cache.client.internal.AbstractOp.attempt(AbstractOp.java:384) > at > org.apache.geode.cache.client.internal.ConnectionImpl.execute(ConnectionImpl.java:284) > at > org.apache.geode.cache.client.internal.pooling.PooledConnection.execute(PooledConnection.java:355) > at > org.apache.geode.cache.client.internal.OpExecutorImpl.executeWithPossibleReAuthentication(OpExecutorImpl.java:756) > at > org.apache.geode.cache.client.internal.OpExecutorImpl.execute(OpExecutorImpl.java:142) > at > org.apache.geode.cache.client.internal.OpExecutorImpl.execute(OpExecutorImpl.java:112) > at > org.apache.geode.cache.client.internal.PoolImpl.execute(PoolImpl.java:797) > at > org.apache.geode.cache.client.internal.QueryOp.execute(QueryOp.java:59) > at > org.apache.geode.cache.client.internal.ServerProxy.query(ServerProxy.java:59) > at > org.apache.geode.cache.query.internal.DefaultQuery.executeOnServer(DefaultQuery.java:327) > at > org.apache.geode.cache.query.internal.DefaultQuery.execute(DefaultQuery.java:215) > at > org.apache.geode.cache.query.internal.DefaultQuery.execute(DefaultQuery.java:197) > Caused by: org.apache.geode.cache.query.QueryInvocationTargetException: The > Region on which query is executed may have been > destroyed.BucketRegion[path='/__PR/_B__trade_0;serial=12;primary=false] > at > org.apache.geode.internal.cache.PRQueryProcessor.executeQueryOnBuckets(PRQueryProcessor.java:264) > at > org.apache.geode.internal.cache.PRQueryProcessor.executeSequentially(PRQueryProcessor.java:214) > at > org.apache.geode.internal.cache.PRQueryProcessor.executeQuery(PRQueryProcessor.java:124) > at > org.apache.geode.internal.cache.partitioned.QueryMessage.operateOnPartitionedRegion(QueryMessage.java:210) > Caused by: org.apache.geode.cache.RegionDestroyedException: > BucketRegion[path='/__PR/_B__trade_0;serial=12;primary=false] > at > org.apache.geode.internal.cache.LocalRegion.checkRegionDestroyed(LocalRegion.java:7352) > at > org.apache.geode.internal.cache.LocalRegion.checkReadiness(LocalRegion.java:2757) > at > org.apache.geode.internal.cache.BucketRegion.checkReadiness(BucketRegion.java:1437) > at > org.apache.geode.internal.cache.LocalRegion.size(LocalRegion.java:8313) > at > org.apache.geode.cache.query.internal.index.CompactRangeIndex.getSizeEstimate(CompactRangeIndex.java:331) > at > org.apache.geode.cache.query.internal.CompiledComparison.getSizeEstimate(CompiledComparison.java:337) > at > org.apache.geode.cache.query.internal.GroupJunction.organizeOperands(GroupJunction.java:146) > at > org.apache.geode.cache.query.internal.AbstractGroupOrRangeJunction.filterEvaluate(AbstractGroupOrRangeJunction.java:148) > at > org.apache.geode.cache.query.internal.CompiledJunction.filterEvaluate(CompiledJunction.java:190) > at > org.apache.geode.cache.query.internal.CompiledSelect.evaluate(CompiledSelect.java:538) > at > org.apache.geode.cache.query.internal.CompiledSelect.evaluate(CompiledSelect.java:53) > at > org.apache.geode.cache.query.internal.DefaultQuery.executeUsingContext(DefaultQuery.java:357) > at > org.apache.geode.internal.cache.PRQueryProcessor.executeQueryOnBuckets(PRQueryProcessor.java:248) > {noformat} > Here is an example query that fails: > {noformat} > SELECT * FROM /trade
[jira] [Resolved] (GEODE-9040) The SingleThreadColocationLogger executorService is not shutdown when the server is stopped
[ https://issues.apache.org/jira/browse/GEODE-9040?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Barrett Oglesby resolved GEODE-9040. Fix Version/s: 1.14.0 Resolution: Fixed > The SingleThreadColocationLogger executorService is not shutdown when the > server is stopped > --- > > Key: GEODE-9040 > URL: https://issues.apache.org/jira/browse/GEODE-9040 > Project: Geode > Issue Type: Bug > Components: logging >Reporter: Barrett Oglesby >Assignee: Barrett Oglesby >Priority: Major > Labels: pull-request-available > Fix For: 1.14.0 > > > When a server is shutdown, its JVM remains alive because the ExecutorService > created by the SingleThreadColocationLogger is not terminated nor is its > thread a daemon: > {noformat} > "ColocationLogger for customer" #57 prio=5 os_prio=31 tid=0x7fb39d4e4000 > nid=0xb203 waiting on condition [0x7dc58000] >java.lang.Thread.State: WAITING (parking) > at sun.misc.Unsafe.park(Native Method) > - parking to wait for <0x000785268818> (a > java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject) > at java.util.concurrent.locks.LockSupport.park(LockSupport.java:175) > at > java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await(AbstractQueuedSynchronizer.java:2039) > at > java.util.concurrent.LinkedBlockingQueue.take(LinkedBlockingQueue.java:442) > at > java.util.concurrent.ThreadPoolExecutor.getTask(ThreadPoolExecutor.java:1067) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1127) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) > at java.lang.Thread.run(Thread.java:745) > {noformat} > The SingleThreadColocationLogger only gets created when there are missing > co-located regions. > We can either terminate the ExecutorService or make its thread a daemon or > both. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (GEODE-9043) A register interest attempt from a newer client to an older server throws a NoSubscriptionServersAvailableException instead of a ServerRefusedConnectionException
[ https://issues.apache.org/jira/browse/GEODE-9043?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Barrett Oglesby updated GEODE-9043: --- Description: The exception in the register interest case is a bit confusing. If a 1.13.2 client attempts to connect to a 1.13.0 server and do a put, it throws this ServerRefusedConnectionException with the exact cause: {noformat} Exception in thread "main" org.apache.geode.cache.client.NoAvailableServersException: org.apache.geode.cache.client.ServerRefusedConnectionException: nn.nnn.nnn.nn(3047):41001(version:GEODE 1.13.0) refused connection: Peer or client version with ordinal 121 not supported. Highest known version is 1.13.0 Client: /nn.nnn.nnn.nn:64123. at org.apache.geode.cache.client.internal.pooling.ConnectionManagerImpl.createPooledConnection(ConnectionManagerImpl.java:200) at org.apache.geode.cache.client.internal.pooling.ConnectionManagerImpl.borrowConnection(ConnectionManagerImpl.java:273) at org.apache.geode.cache.client.internal.OpExecutorImpl.execute(OpExecutorImpl.java:128) at org.apache.geode.cache.client.internal.PoolImpl.execute(PoolImpl.java:796) at org.apache.geode.cache.client.internal.PutOp.execute(PutOp.java:91) {noformat} If the client attempts to registerInterest, it throws this NoSubscriptionServersAvailableException: {noformat} Exception in thread "main" org.apache.geode.cache.NoSubscriptionServersAvailableException: org.apache.geode.cache.NoSubscriptionServersAvailableException: Could not initialize a primary queue on startup. No queue servers available. at org.apache.geode.cache.client.internal.QueueManagerImpl.getAllConnections(QueueManagerImpl.java:190) at org.apache.geode.cache.client.internal.OpExecutorImpl.executeOnQueuesAndReturnPrimaryResult(OpExecutorImpl.java:432) at org.apache.geode.cache.client.internal.PoolImpl.executeOnQueuesAndReturnPrimaryResult(PoolImpl.java:870) at org.apache.geode.cache.Region.registerInterestForAllKeys(Region.java:1657) {noformat} The log does contain a message like below so it can be determined the exact cause, but not in the exception: {noformat} [warn 2021/03/15 11:59:04.100 PDT client tid=0x1] Could not create a new connection to server: nn.nnn.nnn.nn(9838):41001(version:GEODE 1.13.0) refused connection: Peer or client version with ordinal 121 not supported. Highest known version is 1.13.0 Client: /nn.nnn.nnn.nn:65323. {noformat} was: The exception in the register interest case is a bit confusing. If a 1.13.2 client attempts to connect to a 1.13.0 server and do a put, it throws this ServerRefusedConnectionException with the exact cause: {noformat} Exception in thread "main" org.apache.geode.cache.client.NoAvailableServersException: org.apache.geode.cache.client.ServerRefusedConnectionException: nn.nnn.nnn.nn(3047):41001(version:GEODE 1.13.0) refused connection: Peer or client version with ordinal 121 not supported. Highest known version is 1.13.0 Client: /nn.nnn.nnn.nn:64123. at org.apache.geode.cache.client.internal.pooling.ConnectionManagerImpl.createPooledConnection(ConnectionManagerImpl.java:200) at org.apache.geode.cache.client.internal.pooling.ConnectionManagerImpl.borrowConnection(ConnectionManagerImpl.java:273) at org.apache.geode.cache.client.internal.OpExecutorImpl.execute(OpExecutorImpl.java:128) at org.apache.geode.cache.client.internal.PoolImpl.execute(PoolImpl.java:796) at org.apache.geode.cache.client.internal.PutOp.execute(PutOp.java:91) {noformat} If the client attempts to registerInterest, it throws this NoSubscriptionServersAvailableException: {noformat} Exception in thread "main" org.apache.geode.cache.NoSubscriptionServersAvailableException: org.apache.geode.cache.NoSubscriptionServersAvailableException: Could not initialize a primary queue on startup. No queue servers available. at org.apache.geode.cache.client.internal.QueueManagerImpl.getAllConnections(QueueManagerImpl.java:190) at org.apache.geode.cache.client.internal.OpExecutorImpl.executeOnQueuesAndReturnPrimaryResult(OpExecutorImpl.java:432) at org.apache.geode.cache.client.internal.PoolImpl.executeOnQueuesAndReturnPrimaryResult(PoolImpl.java:870) at org.apache.geode.cache.Region.registerInterestForAllKeys(Region.java:1657) {noformat} The log does contain a message like below so it can be determined the exact cause, buts not in the exception: {noformat} [warn 2021/03/15 11:59:04.100 PDT client tid=0x1] Could not create a new connection to server: nn.nnn.nnn.nn(9838):41001(version:GEODE 1.13.0) refused connection: Peer or client version with ordinal 121 not supported. Highest known version is 1.13.0 Client: /nn.nnn.nnn.nn:65323. {noformat} > A register interest attempt from a newer client to an older server throws a > NoSubscriptionServersAvailableException
[jira] [Commented] (GEODE-9043) A register interest attempt from a newer client to an older server throws a NoSubscriptionServersAvailableException instead of a ServerRefusedConnectionException
[ https://issues.apache.org/jira/browse/GEODE-9043?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17302741#comment-17302741 ] Barrett Oglesby commented on GEODE-9043: There are a couple ways to address this, but the easiest looks like to add the ServerRefusedConnectionException to QueueManagerImpl.initializeConnections here: {noformat} for (ServerLocation server : servers) { Connection connection = null; try { connection = factory.createClientToServerConnection(server, true); exToLog = null; ->} catch (GemFireSecurityException | GemFireConfigException | ServerRefusedConnectionException e) { throw e; } catch (Exception e) { exToLog = e; } {noformat} That matches what happens with GemFireSecurityException or GemFireConfigException and causes an exception like: {noformat} Exception in thread "main" org.apache.geode.cache.client.ServerRefusedConnectionException: nn.nnn.nnn.nn(9838):41001(version:GEODE 1.13.0) refused connection: Peer or client version with ordinal 121 not supported. Highest known version is 1.13.0 Client: /nn.nnn.nnn.nn:65532. at org.apache.geode.internal.cache.tier.sockets.Handshake.readMessage(Handshake.java:331) at org.apache.geode.cache.client.internal.ClientSideHandshakeImpl.handshakeWithServer(ClientSideHandshakeImpl.java:233) at org.apache.geode.cache.client.internal.ConnectionImpl.connect(ConnectionImpl.java:107) at org.apache.geode.cache.client.internal.ConnectionConnector.connectClientToServer(ConnectionConnector.java:75) at org.apache.geode.cache.client.internal.ConnectionFactoryImpl.createClientToServerConnection(ConnectionFactoryImpl.java:118) at org.apache.geode.cache.client.internal.QueueManagerImpl.initializeConnections(QueueManagerImpl.java:456) at org.apache.geode.cache.client.internal.QueueManagerImpl.start(QueueManagerImpl.java:293) {noformat} > A register interest attempt from a newer client to an older server throws a > NoSubscriptionServersAvailableException instead of a > ServerRefusedConnectionException > - > > Key: GEODE-9043 > URL: https://issues.apache.org/jira/browse/GEODE-9043 > Project: Geode > Issue Type: Bug > Components: client/server >Reporter: Barrett Oglesby >Priority: Major > > The exception in the register interest case is a bit confusing. > If a 1.13.2 client attempts to connect to a 1.13.0 server and do a put, it > throws this ServerRefusedConnectionException with the exact cause: > {noformat} > Exception in thread "main" > org.apache.geode.cache.client.NoAvailableServersException: > org.apache.geode.cache.client.ServerRefusedConnectionException: > nn.nnn.nnn.nn(3047):41001(version:GEODE 1.13.0) refused connection: Peer > or client version with ordinal 121 not supported. Highest known version is > 1.13.0 Client: /nn.nnn.nnn.nn:64123. > at > org.apache.geode.cache.client.internal.pooling.ConnectionManagerImpl.createPooledConnection(ConnectionManagerImpl.java:200) > at > org.apache.geode.cache.client.internal.pooling.ConnectionManagerImpl.borrowConnection(ConnectionManagerImpl.java:273) > at > org.apache.geode.cache.client.internal.OpExecutorImpl.execute(OpExecutorImpl.java:128) > at > org.apache.geode.cache.client.internal.PoolImpl.execute(PoolImpl.java:796) > at org.apache.geode.cache.client.internal.PutOp.execute(PutOp.java:91) > {noformat} > If the client attempts to registerInterest, it throws this > NoSubscriptionServersAvailableException: > {noformat} > Exception in thread "main" > org.apache.geode.cache.NoSubscriptionServersAvailableException: > org.apache.geode.cache.NoSubscriptionServersAvailableException: Could not > initialize a primary queue on startup. No queue servers available. > at > org.apache.geode.cache.client.internal.QueueManagerImpl.getAllConnections(QueueManagerImpl.java:190) > at > org.apache.geode.cache.client.internal.OpExecutorImpl.executeOnQueuesAndReturnPrimaryResult(OpExecutorImpl.java:432) > at > org.apache.geode.cache.client.internal.PoolImpl.executeOnQueuesAndReturnPrimaryResult(PoolImpl.java:870) > at > org.apache.geode.cache.Region.registerInterestForAllKeys(Region.java:1657) > {noformat} > The log does contain a message like below so it can be determined the exact > cause, buts not in the exception: > {noformat} > [warn 2021/03/15 11:59:04.100 PDT client tid=0x1] Could not create a > new connection to server: nn.nnn.nnn.nn(9838):41001(version:GEODE 1.13.0) > refused connection: Peer or client version with ordinal 121 not supported. > Highest known version is 1.13.0 Client:
[jira] [Created] (GEODE-9043) A register interest attempt from a newer client to an older server throws a NoSubscriptionServersAvailableException instead of a ServerRefusedConnectionException
Barrett Oglesby created GEODE-9043: -- Summary: A register interest attempt from a newer client to an older server throws a NoSubscriptionServersAvailableException instead of a ServerRefusedConnectionException Key: GEODE-9043 URL: https://issues.apache.org/jira/browse/GEODE-9043 Project: Geode Issue Type: Bug Components: client/server Reporter: Barrett Oglesby The exception in the register interest case is a bit confusing. If a 1.13.2 client attempts to connect to a 1.13.0 server and do a put, it throws this ServerRefusedConnectionException with the exact cause: {noformat} Exception in thread "main" org.apache.geode.cache.client.NoAvailableServersException: org.apache.geode.cache.client.ServerRefusedConnectionException: nn.nnn.nnn.nn(3047):41001(version:GEODE 1.13.0) refused connection: Peer or client version with ordinal 121 not supported. Highest known version is 1.13.0 Client: /nn.nnn.nnn.nn:64123. at org.apache.geode.cache.client.internal.pooling.ConnectionManagerImpl.createPooledConnection(ConnectionManagerImpl.java:200) at org.apache.geode.cache.client.internal.pooling.ConnectionManagerImpl.borrowConnection(ConnectionManagerImpl.java:273) at org.apache.geode.cache.client.internal.OpExecutorImpl.execute(OpExecutorImpl.java:128) at org.apache.geode.cache.client.internal.PoolImpl.execute(PoolImpl.java:796) at org.apache.geode.cache.client.internal.PutOp.execute(PutOp.java:91) {noformat} If the client attempts to registerInterest, it throws this NoSubscriptionServersAvailableException: {noformat} Exception in thread "main" org.apache.geode.cache.NoSubscriptionServersAvailableException: org.apache.geode.cache.NoSubscriptionServersAvailableException: Could not initialize a primary queue on startup. No queue servers available. at org.apache.geode.cache.client.internal.QueueManagerImpl.getAllConnections(QueueManagerImpl.java:190) at org.apache.geode.cache.client.internal.OpExecutorImpl.executeOnQueuesAndReturnPrimaryResult(OpExecutorImpl.java:432) at org.apache.geode.cache.client.internal.PoolImpl.executeOnQueuesAndReturnPrimaryResult(PoolImpl.java:870) at org.apache.geode.cache.Region.registerInterestForAllKeys(Region.java:1657) {noformat} The log does contain a message like below so it can be determined the exact cause, buts not in the exception: {noformat} [warn 2021/03/15 11:59:04.100 PDT client tid=0x1] Could not create a new connection to server: nn.nnn.nnn.nn(9838):41001(version:GEODE 1.13.0) refused connection: Peer or client version with ordinal 121 not supported. Highest known version is 1.13.0 Client: /nn.nnn.nnn.nn:65323. {noformat} -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Assigned] (GEODE-9040) The SingleThreadColocationLogger executorService is not shutdown when the server is stopped
[ https://issues.apache.org/jira/browse/GEODE-9040?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Barrett Oglesby reassigned GEODE-9040: -- Assignee: Barrett Oglesby > The SingleThreadColocationLogger executorService is not shutdown when the > server is stopped > --- > > Key: GEODE-9040 > URL: https://issues.apache.org/jira/browse/GEODE-9040 > Project: Geode > Issue Type: Bug > Components: logging >Reporter: Barrett Oglesby >Assignee: Barrett Oglesby >Priority: Major > > When a server is shutdown, its JVM remains alive because the ExecutorService > created by the SingleThreadColocationLogger is not terminated nor is its > thread a daemon: > {noformat} > "ColocationLogger for customer" #57 prio=5 os_prio=31 tid=0x7fb39d4e4000 > nid=0xb203 waiting on condition [0x7dc58000] >java.lang.Thread.State: WAITING (parking) > at sun.misc.Unsafe.park(Native Method) > - parking to wait for <0x000785268818> (a > java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject) > at java.util.concurrent.locks.LockSupport.park(LockSupport.java:175) > at > java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await(AbstractQueuedSynchronizer.java:2039) > at > java.util.concurrent.LinkedBlockingQueue.take(LinkedBlockingQueue.java:442) > at > java.util.concurrent.ThreadPoolExecutor.getTask(ThreadPoolExecutor.java:1067) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1127) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) > at java.lang.Thread.run(Thread.java:745) > {noformat} > The SingleThreadColocationLogger only gets created when there are missing > co-located regions. > We can either terminate the ExecutorService or make its thread a daemon or > both. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (GEODE-9040) The SingleThreadColocationLogger executorService is not shutdown when the server is stopped
Barrett Oglesby created GEODE-9040: -- Summary: The SingleThreadColocationLogger executorService is not shutdown when the server is stopped Key: GEODE-9040 URL: https://issues.apache.org/jira/browse/GEODE-9040 Project: Geode Issue Type: Bug Components: logging Reporter: Barrett Oglesby When a server is shutdown, its JVM remains alive because the ExecutorService created by the SingleThreadColocationLogger is not terminated nor is its thread a daemon: {noformat} "ColocationLogger for customer" #57 prio=5 os_prio=31 tid=0x7fb39d4e4000 nid=0xb203 waiting on condition [0x7dc58000] java.lang.Thread.State: WAITING (parking) at sun.misc.Unsafe.park(Native Method) - parking to wait for <0x000785268818> (a java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject) at java.util.concurrent.locks.LockSupport.park(LockSupport.java:175) at java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await(AbstractQueuedSynchronizer.java:2039) at java.util.concurrent.LinkedBlockingQueue.take(LinkedBlockingQueue.java:442) at java.util.concurrent.ThreadPoolExecutor.getTask(ThreadPoolExecutor.java:1067) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1127) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) at java.lang.Thread.run(Thread.java:745) {noformat} The SingleThreadColocationLogger only gets created when there are missing co-located regions. We can either terminate the ExecutorService or make its thread a daemon or both. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (GEODE-9030) The PartitionedIndex arbitraryBucketIndex doesn't get reset when the BucketRegion defining it is moved
Barrett Oglesby created GEODE-9030: -- Summary: The PartitionedIndex arbitraryBucketIndex doesn't get reset when the BucketRegion defining it is moved Key: GEODE-9030 URL: https://issues.apache.org/jira/browse/GEODE-9030 Project: Geode Issue Type: Bug Components: querying Reporter: Barrett Oglesby This causes a RegionDestroyedException like this when executing a query containing a != clause: {noformat} Exception in thread "main" org.apache.geode.cache.client.ServerOperationException: remote server on 10.166.145.16(client:27461:loner):58776:dfd3ba27:client: While performing a remote query at org.apache.geode.cache.client.internal.AbstractOp.processChunkedResponse(AbstractOp.java:342) at org.apache.geode.cache.client.internal.QueryOp$QueryOpImpl.processResponse(QueryOp.java:168) at org.apache.geode.cache.client.internal.AbstractOp.processResponse(AbstractOp.java:224) at org.apache.geode.cache.client.internal.AbstractOp.attemptReadResponse(AbstractOp.java:197) at org.apache.geode.cache.client.internal.AbstractOp.attempt(AbstractOp.java:384) at org.apache.geode.cache.client.internal.ConnectionImpl.execute(ConnectionImpl.java:284) at org.apache.geode.cache.client.internal.pooling.PooledConnection.execute(PooledConnection.java:355) at org.apache.geode.cache.client.internal.OpExecutorImpl.executeWithPossibleReAuthentication(OpExecutorImpl.java:756) at org.apache.geode.cache.client.internal.OpExecutorImpl.execute(OpExecutorImpl.java:142) at org.apache.geode.cache.client.internal.OpExecutorImpl.execute(OpExecutorImpl.java:112) at org.apache.geode.cache.client.internal.PoolImpl.execute(PoolImpl.java:797) at org.apache.geode.cache.client.internal.QueryOp.execute(QueryOp.java:59) at org.apache.geode.cache.client.internal.ServerProxy.query(ServerProxy.java:59) at org.apache.geode.cache.query.internal.DefaultQuery.executeOnServer(DefaultQuery.java:327) at org.apache.geode.cache.query.internal.DefaultQuery.execute(DefaultQuery.java:215) at org.apache.geode.cache.query.internal.DefaultQuery.execute(DefaultQuery.java:197) Caused by: org.apache.geode.cache.query.QueryInvocationTargetException: The Region on which query is executed may have been destroyed.BucketRegion[path='/__PR/_B__trade_0;serial=12;primary=false] at org.apache.geode.internal.cache.PRQueryProcessor.executeQueryOnBuckets(PRQueryProcessor.java:264) at org.apache.geode.internal.cache.PRQueryProcessor.executeSequentially(PRQueryProcessor.java:214) at org.apache.geode.internal.cache.PRQueryProcessor.executeQuery(PRQueryProcessor.java:124) at org.apache.geode.internal.cache.partitioned.QueryMessage.operateOnPartitionedRegion(QueryMessage.java:210) Caused by: org.apache.geode.cache.RegionDestroyedException: BucketRegion[path='/__PR/_B__trade_0;serial=12;primary=false] at org.apache.geode.internal.cache.LocalRegion.checkRegionDestroyed(LocalRegion.java:7352) at org.apache.geode.internal.cache.LocalRegion.checkReadiness(LocalRegion.java:2757) at org.apache.geode.internal.cache.BucketRegion.checkReadiness(BucketRegion.java:1437) at org.apache.geode.internal.cache.LocalRegion.size(LocalRegion.java:8313) at org.apache.geode.cache.query.internal.index.CompactRangeIndex.getSizeEstimate(CompactRangeIndex.java:331) at org.apache.geode.cache.query.internal.CompiledComparison.getSizeEstimate(CompiledComparison.java:337) at org.apache.geode.cache.query.internal.GroupJunction.organizeOperands(GroupJunction.java:146) at org.apache.geode.cache.query.internal.AbstractGroupOrRangeJunction.filterEvaluate(AbstractGroupOrRangeJunction.java:148) at org.apache.geode.cache.query.internal.CompiledJunction.filterEvaluate(CompiledJunction.java:190) at org.apache.geode.cache.query.internal.CompiledSelect.evaluate(CompiledSelect.java:538) at org.apache.geode.cache.query.internal.CompiledSelect.evaluate(CompiledSelect.java:53) at org.apache.geode.cache.query.internal.DefaultQuery.executeUsingContext(DefaultQuery.java:357) at org.apache.geode.internal.cache.PRQueryProcessor.executeQueryOnBuckets(PRQueryProcessor.java:248) {noformat} Here is an example query that fails: {noformat} SELECT * FROM /trade WHERE arrangementId = 'aId_1' AND tradeStatus.toString() != 'CLOSED' {noformat} Here is a test that reproduces it: * start one server with region configured as PARTITION with: ** 2 buckets ** PartitionResolver that puts the first entry in bucket 0, every other entry in bucket 1 * load N entries * the index in bucket 0 becomes the arbitraryBucketIndex * start a second server * rebalance * bucket 0 moves from the first server to the second server * run the
[jira] [Resolved] (GEODE-8992) When a GatewaySenderEventImpl is serialized, its operationDetail field is not included
[ https://issues.apache.org/jira/browse/GEODE-8992?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Barrett Oglesby resolved GEODE-8992. Fix Version/s: 1.15.0 Resolution: Fixed > When a GatewaySenderEventImpl is serialized, its operationDetail field is not > included > -- > > Key: GEODE-8992 > URL: https://issues.apache.org/jira/browse/GEODE-8992 > Project: Geode > Issue Type: Bug > Components: wan >Reporter: Barrett Oglesby >Assignee: Barrett Oglesby >Priority: Major > Labels: blocks-1.15.0, pull-request-available > Fix For: 1.15.0 > > > This causes the operation to become less specific when the > {{GatewaySenderEventImpl}} is deserialized. > Here is an example. > If the original {{GatewaySenderEventImpl}} is a *PUTALL_CREATE* like: > {noformat} > GatewaySenderEventImpl[id=EventID[id=31 > bytes;threadID=0x10063|1;sequenceID=0;bucketId=99];action=0;operation=PUTALL_CREATE;region=/data;key=0;value=0;...] > {noformat} > Then, when the {{GatewaySenderEventImpl}} is serialized and deserialized, its > operation becomes a *CREATE*: > {noformat} > GatewaySenderEventImpl[id=EventID[id=31 > bytes;threadID=0x10063|1;sequenceID=0;bucketId=99];action=0;operation=CREATE;region=/data;key=0;value=0;...] > {noformat} > Thats because {{GatewaySenderEventImpl.getOperation}} uses both *action* and > *operationDetail* to determine its operation: > {noformat} > public Operation getOperation() { > Operation op = null; > switch (this.action) { > case CREATE_ACTION: > switch (this.operationDetail) { > case ... > case OP_DETAIL_PUTALL: > op = Operation.PUTALL_CREATE; > break; > default: > op = Operation.CREATE; > break; > } > ... > {noformat} -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Assigned] (GEODE-8992) When a GatewaySenderEventImpl is serialized, its operationDetail field is not included
[ https://issues.apache.org/jira/browse/GEODE-8992?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Barrett Oglesby reassigned GEODE-8992: -- Assignee: Barrett Oglesby > When a GatewaySenderEventImpl is serialized, its operationDetail field is not > included > -- > > Key: GEODE-8992 > URL: https://issues.apache.org/jira/browse/GEODE-8992 > Project: Geode > Issue Type: Bug > Components: wan >Reporter: Barrett Oglesby >Assignee: Barrett Oglesby >Priority: Major > Labels: blocks-1.15.0 > > This causes the operation to become less specific when the > {{GatewaySenderEventImpl}} is deserialized. > Here is an example. > If the original {{GatewaySenderEventImpl}} is a *PUTALL_CREATE* like: > {noformat} > GatewaySenderEventImpl[id=EventID[id=31 > bytes;threadID=0x10063|1;sequenceID=0;bucketId=99];action=0;operation=PUTALL_CREATE;region=/data;key=0;value=0;...] > {noformat} > Then, when the {{GatewaySenderEventImpl}} is serialized and deserialized, its > operation becomes a *CREATE*: > {noformat} > GatewaySenderEventImpl[id=EventID[id=31 > bytes;threadID=0x10063|1;sequenceID=0;bucketId=99];action=0;operation=CREATE;region=/data;key=0;value=0;...] > {noformat} > Thats because {{GatewaySenderEventImpl.getOperation}} uses both *action* and > *operationDetail* to determine its operation: > {noformat} > public Operation getOperation() { > Operation op = null; > switch (this.action) { > case CREATE_ACTION: > switch (this.operationDetail) { > case ... > case OP_DETAIL_PUTALL: > op = Operation.PUTALL_CREATE; > break; > default: > op = Operation.CREATE; > break; > } > ... > {noformat} -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (GEODE-8992) When a GatewaySenderEventImpl is serialized, its operationDetail field is not included
Barrett Oglesby created GEODE-8992: -- Summary: When a GatewaySenderEventImpl is serialized, its operationDetail field is not included Key: GEODE-8992 URL: https://issues.apache.org/jira/browse/GEODE-8992 Project: Geode Issue Type: Bug Components: wan Reporter: Barrett Oglesby This causes the operation to become less specific when the {{GatewaySenderEventImpl}} is deserialized. Here is an example. If the original {{GatewaySenderEventImpl}} is a *PUTALL_CREATE* like: {noformat} GatewaySenderEventImpl[id=EventID[id=31 bytes;threadID=0x10063|1;sequenceID=0;bucketId=99];action=0;operation=PUTALL_CREATE;region=/data;key=0;value=0;...] {noformat} Then, when the {{GatewaySenderEventImpl}} is serialized and deserialized, its operation becomes a *CREATE*: {noformat} GatewaySenderEventImpl[id=EventID[id=31 bytes;threadID=0x10063|1;sequenceID=0;bucketId=99];action=0;operation=CREATE;region=/data;key=0;value=0;...] {noformat} Thats because {{GatewaySenderEventImpl.getOperation}} uses both *action* and *operationDetail* to determine its operation: {noformat} public Operation getOperation() { Operation op = null; switch (this.action) { case CREATE_ACTION: switch (this.operationDetail) { case ... case OP_DETAIL_PUTALL: op = Operation.PUTALL_CREATE; break; default: op = Operation.CREATE; break; } ... {noformat} -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (GEODE-8926) CQ events can be missed while executing with initial results simultaneously with transactions
[ https://issues.apache.org/jira/browse/GEODE-8926?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17281309#comment-17281309 ] Barrett Oglesby commented on GEODE-8926: I attached a sequence diagram showing the the interleaved behavior that causes the issue. > CQ events can be missed while executing with initial results simultaneously > with transactions > - > > Key: GEODE-8926 > URL: https://issues.apache.org/jira/browse/GEODE-8926 > Project: Geode > Issue Type: Bug > Components: cq >Reporter: Barrett Oglesby >Priority: Major > Attachments: cq_with_transaction_behavior.png > > > In this case, the event is not in either the initial results or received in > the CqListener. > A test that shows the behavior is: > - 2 servers with: > - a root PR > - a colocated child PR > In a client, asynchronously: > - start a transaction that: > - does N puts into the root PR > - does 1 put into the child PR > - commit the transaction > In the client: > create N CQs with initial results with: 'select * from /childPR' > When the test succeeds, all the CQs either get the 1 event in their initial > results or in their CqListener. > When the test fails, one or more CQs don't see the event either way. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (GEODE-8926) CQ events can be missed while executing with initial results simultaneously with transactions
[ https://issues.apache.org/jira/browse/GEODE-8926?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Barrett Oglesby updated GEODE-8926: --- Attachment: cq_with_transaction_behavior.png > CQ events can be missed while executing with initial results simultaneously > with transactions > - > > Key: GEODE-8926 > URL: https://issues.apache.org/jira/browse/GEODE-8926 > Project: Geode > Issue Type: Bug > Components: cq >Reporter: Barrett Oglesby >Priority: Major > Attachments: cq_with_transaction_behavior.png > > > In this case, the event is not in either the initial results or received in > the CqListener. > A test that shows the behavior is: > - 2 servers with: > - a root PR > - a colocated child PR > In a client, asynchronously: > - start a transaction that: > - does N puts into the root PR > - does 1 put into the child PR > - commit the transaction > In the client: > create N CQs with initial results with: 'select * from /childPR' > When the test succeeds, all the CQs either get the 1 event in their initial > results or in their CqListener. > When the test fails, one or more CQs don't see the event either way. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (GEODE-8926) CQ events can be missed while executing with initial results simultaneously with transactions
Barrett Oglesby created GEODE-8926: -- Summary: CQ events can be missed while executing with initial results simultaneously with transactions Key: GEODE-8926 URL: https://issues.apache.org/jira/browse/GEODE-8926 Project: Geode Issue Type: Bug Components: cq Reporter: Barrett Oglesby In this case, the event is not in either the initial results or received in the CqListener. A test that shows the behavior is: - 2 servers with: - a root PR - a colocated child PR In a client, asynchronously: - start a transaction that: - does N puts into the root PR - does 1 put into the child PR - commit the transaction In the client: create N CQs with initial results with: 'select * from /childPR' When the test succeeds, all the CQs either get the 1 event in their initial results or in their CqListener. When the test fails, one or more CQs don't see the event either way. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (GEODE-8916) The gfsh export stack traces command should include the locators
Barrett Oglesby created GEODE-8916: -- Summary: The gfsh export stack traces command should include the locators Key: GEODE-8916 URL: https://issues.apache.org/jira/browse/GEODE-8916 Project: Geode Issue Type: Bug Reporter: Barrett Oglesby The gfsh export stack traces command should include the locators, but only includes the servers. Here is an excerpt from a slack conversation showing the behavior: {noformat} Shelley Hughes-Godfrey 6:48 PM I have a question about gfsh export stack-traces ... "list members" shows me servers and locators ... gfsh>list members Member Count : 3 Name| Id - | gemfire-cluster-server-0 | xx.xx.x.xxx(gemfire-cluster-server-0:1):41000 gemfire-cluster-locator-0 | xx.xx.x.xxx(gemfire-cluster-locator-0:1:locator):41000 [Coordinator] gemfire-cluster-server-1 | xx.xx.x.xxx(gemfire-cluster-server-1:1):41000 But, if I don't specify members on the export stack-traces command, I just get the stacks for the servers. gfsh>export stack-traces stack-trace(s) exported to file: /path/stacktrace_1612316330340 On host : ... Specifying a locator returns "No Members found" gfsh>export stack-traces --member=gemfire-cluster-locator-0 No Members Found Barry Oglesby 2 hours ago That command excludes the locators. It uses this method in ManagementUtils to get just the normal members: public static Set getAllNormalMembers(InternalCache cache) { return new HashSet( cache.getDistributionManager().getNormalDistributionManagerIds()); } Shelley Hughes-Godfrey 1 hour ago So, I also ran "export logs" with --member= And that works gfsh>list members Member Count : 3 Name| Id - | gemfire-cluster-server-0 | xx.xx.x.xxx(gemfire-cluster-server-0:1):41000 gemfire-cluster-locator-0 | xx.xx.x.xxx(gemfire-cluster-locator-0:1:locator):41000 [Coordinator] gemfire-cluster-server-1 | xx.xx.x.xxx(gemfire-cluster-server-1:1):41000 gfsh>export logs --member=gemfire-cluster-locator-0 Logs exported to the connected member's file system: /path/exportedLogs_1612374651595.zip Barry Oglesby 44 minutes ago The ExportLogsCommand gets all the members including the locators: Set targetMembers = getMembersIncludingLocators(groups, memberIds); I tried a test by changing ExportStackTraceCommand.exportStackTrace: From: Set targetMembers = getMembers(group, memberNameOrId); To: Set targetMembers = getMembersIncludingLocators(group, memberNameOrId); And the locator stack was exported: *** Stack-trace for member locator at 2021/02/03 10:01:28.824 *** {noformat} -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Resolved] (GEODE-8827) The DiskRegionStats bytesOnlyOnDisk stat is not incremented during persistent region recovery
[ https://issues.apache.org/jira/browse/GEODE-8827?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Barrett Oglesby resolved GEODE-8827. Fix Version/s: 1.14.0 Resolution: Fixed > The DiskRegionStats bytesOnlyOnDisk stat is not incremented during persistent > region recovery > - > > Key: GEODE-8827 > URL: https://issues.apache.org/jira/browse/GEODE-8827 > Project: Geode > Issue Type: Bug > Components: persistence, statistics >Reporter: Barrett Oglesby >Assignee: Barrett Oglesby >Priority: Major > Labels: pull-request-available > Fix For: 1.14.0 > > Attachments: > DiskRegionStats_entriesOnlyOnDisk_bytesOnlyOnDisk_after_gets.gif, > DiskRegionStats_entriesOnlyOnDisk_bytesOnlyOnDisk_after_gets_with_change.gif, > DiskRegionStats_entriesOnlyOnDisk_bytesOnlyOnDisk_after_restart.gif, > DiskRegionStats_entriesOnlyOnDisk_bytesOnlyOnDisk_after_restart_with_change.gif, > DiskRegionStats_entriesOnlyOnDisk_bytesOnlyOnDisk_no_eviction.gif > > > With a test like: > - 2 servers with partitioned region configured like: > ** persistence enabled > ** heap eviction with overflow enabled > - load enough entries to cause overflow > - shut down the servers > - restart the servers > - execute a function to get all entries in each server > After the step to restart the servers, the bytesOnlyOnDisk stat is 0. > After the step to get all entries, the bytesOnlyOnDisk stat is negative. > The entriesInVM and entriesOnlyOnDisk stats are incremented as BucketRegions > are recovered from disk in LocalRegion.initializeStats here: > {noformat} > java.lang.Exception: Stack trace > at java.lang.Thread.dumpStack(Thread.java:1333) > at > org.apache.geode.internal.cache.LocalRegion.initializeStats(LocalRegion.java:10222) > at > org.apache.geode.internal.cache.BucketRegion.initializeStats(BucketRegion.java:2163) > at > org.apache.geode.internal.cache.AbstractDiskRegion.copyExistingRegionMap(AbstractDiskRegion.java:775) > at > org.apache.geode.internal.cache.DiskStoreImpl.initializeOwner(DiskStoreImpl.java:631) > at > org.apache.geode.internal.cache.DiskRegion.initializeOwner(DiskRegion.java:239) > at > org.apache.geode.internal.cache.DistributedRegion.initialize(DistributedRegion.java:1081) > at > org.apache.geode.internal.cache.BucketRegion.initialize(BucketRegion.java:262) > at > org.apache.geode.internal.cache.LocalRegion.createSubregion(LocalRegion.java:981) > at > org.apache.geode.internal.cache.PartitionedRegionDataStore.createBucketRegion(PartitionedRegionDataStore.java:785) > at > org.apache.geode.internal.cache.PartitionedRegionDataStore.grabFreeBucket(PartitionedRegionDataStore.java:460) > at > org.apache.geode.internal.cache.PartitionedRegionDataStore.grabFreeBucketRecursively(PartitionedRegionDataStore.java:319) > at > org.apache.geode.internal.cache.PartitionedRegionDataStore.grabBucket(PartitionedRegionDataStore.java:2896) > at > org.apache.geode.internal.cache.ProxyBucketRegion.recoverFromDisk(ProxyBucketRegion.java:441) > at > org.apache.geode.internal.cache.ProxyBucketRegion.recoverFromDiskRecursively(ProxyBucketRegion.java:407) > at > org.apache.geode.internal.cache.PRHARedundancyProvider$2.run2(PRHARedundancyProvider.java:1640) > at > org.apache.geode.internal.cache.partitioned.RecoveryRunnable.run(RecoveryRunnable.java:60) > at > org.apache.geode.internal.cache.PRHARedundancyProvider$2.run(PRHARedundancyProvider.java:1630) > at java.lang.Thread.run(Thread.java:745) > {noformat} > The current LocalRegion.initializeStats method implementation is: > {noformat} > public void initializeStats(long numEntriesInVM, long numOverflowOnDisk, > long numOverflowBytesOnDisk) { > getDiskRegion().getStats().incNumEntriesInVM(numEntriesInVM); > getDiskRegion().getStats().incNumOverflowOnDisk(numOverflowOnDisk); > } > {noformat} > Even though numOverflowBytesOnDisk is passed into this method, it is ignored > as this logging shows: > {noformat} > [warn 2021/01/12 11:19:11.785 PST > tid=0x49] XXX LocalRegion.initializeStats numOverflowBytesOnDisk=4546560; > bytesOnlyOnDiskFromStats=0 > [warn 2021/01/12 11:19:11.791 PST > tid=0x4f] XXX LocalRegion.initializeStats numOverflowBytesOnDisk=4536320; > bytesOnlyOnDiskFromStats=0 > [warn 2021/01/12 11:19:11.797 PST > tid=0x4c] XXX LocalRegion.initializeStats numOverflowBytesOnDisk=4526080; > bytesOnlyOnDiskFromStats=0 > [warn 2021/01/12 11:19:11.800 PST > tid=0x48] XXX LocalRegion.initializeStats numOverflowBytesOnDisk=4546560; > bytesOnlyOnDiskFromStats=0 > [warn 2021/01/12 11:19:11.801 PST > tid=0x4e] XXX LocalRegion.initializeStats
[jira] [Resolved] (GEODE-8278) Gateway sender queues using heap memory way above configured value after server restart
[ https://issues.apache.org/jira/browse/GEODE-8278?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Barrett Oglesby resolved GEODE-8278. Fix Version/s: 1.14.0 Resolution: Fixed > Gateway sender queues using heap memory way above configured value after > server restart > --- > > Key: GEODE-8278 > URL: https://issues.apache.org/jira/browse/GEODE-8278 > Project: Geode > Issue Type: Bug > Components: eviction >Reporter: Alberto Gomez >Assignee: Barrett Oglesby >Priority: Major > Labels: pull-request-available > Fix For: 1.14.0 > > > In a Geode system with the following characteristics: > * WAN replication > * partition redundant regions > * overflow configured for the gateway senders queues by means of persistence > and maximum queue memory set. > * gateway receivers stopped in one site (B) > * Operations sent to the site that does not have the gateway receivers > stopped (A) > When operations are sent to site A, the gateway sender queues start to grow > as expected and the heap memory consumed by the queues does not grow > indefinitely given that there is overflow to disk when the limit is reached. > But, if a server is restarted, the restarted server will show a much higher > heap memory used than the memory used by this server before it was restarted > or by the other servers. > This can even provoke that the server cannot be restarted if the heap memory > it requires is above the limit configured. > According to the memory analyzer the entries taking up the memory are > subclasses of ```VMThinDiskLRURegionEntryHeap```. > The number of instances of this type are the same in the restarted server > than in the not restarted servers but on the restarted server they take much > more memory. The reason seems to be that the ```value``` member attribute of > the instances, in the case of the restarted server contains > ```VMCachedDeserializable``` objects while in the case of the not restarted > server the attribute contains either ```null``` or > ```GatewaySenderEventImpl``` objects that use much less memory than the > ```VMCachedDeserializable``` ones. > If redundancy is not configured for the region then the problem is not > manifested, i.e. the heap memory used by the restarted server is similar to > the one prior to the restart. > If the node not restarted is restarted then the previously restarted node > seems to release the extra memory (my guess is that it is processing the > other process queue). > Also, if traffic is sent again to the Geode cluster, then it seems eviction > kicks in and after some short time, the memory of the restarted server goes > down to the level it had before it had been restarted. > As a summary, the problem seems to be that if a server does GII > (getInitialImage) from another server, eviction does not occur for gateway > sender queue entries. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Assigned] (GEODE-8278) Gateway sender queues using heap memory way above configured value after server restart
[ https://issues.apache.org/jira/browse/GEODE-8278?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Barrett Oglesby reassigned GEODE-8278: -- Assignee: Barrett Oglesby (was: Alberto Gomez) > Gateway sender queues using heap memory way above configured value after > server restart > --- > > Key: GEODE-8278 > URL: https://issues.apache.org/jira/browse/GEODE-8278 > Project: Geode > Issue Type: Bug > Components: eviction >Reporter: Alberto Gomez >Assignee: Barrett Oglesby >Priority: Major > Labels: pull-request-available > > In a Geode system with the following characteristics: > * WAN replication > * partition redundant regions > * overflow configured for the gateway senders queues by means of persistence > and maximum queue memory set. > * gateway receivers stopped in one site (B) > * Operations sent to the site that does not have the gateway receivers > stopped (A) > When operations are sent to site A, the gateway sender queues start to grow > as expected and the heap memory consumed by the queues does not grow > indefinitely given that there is overflow to disk when the limit is reached. > But, if a server is restarted, the restarted server will show a much higher > heap memory used than the memory used by this server before it was restarted > or by the other servers. > This can even provoke that the server cannot be restarted if the heap memory > it requires is above the limit configured. > According to the memory analyzer the entries taking up the memory are > subclasses of ```VMThinDiskLRURegionEntryHeap```. > The number of instances of this type are the same in the restarted server > than in the not restarted servers but on the restarted server they take much > more memory. The reason seems to be that the ```value``` member attribute of > the instances, in the case of the restarted server contains > ```VMCachedDeserializable``` objects while in the case of the not restarted > server the attribute contains either ```null``` or > ```GatewaySenderEventImpl``` objects that use much less memory than the > ```VMCachedDeserializable``` ones. > If redundancy is not configured for the region then the problem is not > manifested, i.e. the heap memory used by the restarted server is similar to > the one prior to the restart. > If the node not restarted is restarted then the previously restarted node > seems to release the extra memory (my guess is that it is processing the > other process queue). > Also, if traffic is sent again to the Geode cluster, then it seems eviction > kicks in and after some short time, the memory of the restarted server goes > down to the level it had before it had been restarted. > As a summary, the problem seems to be that if a server does GII > (getInitialImage) from another server, eviction does not occur for gateway > sender queue entries. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (GEODE-8827) The DiskRegionStats bytesOnlyOnDisk stat is not incremented during persistent region recovery
[ https://issues.apache.org/jira/browse/GEODE-8827?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Barrett Oglesby updated GEODE-8827: --- Attachment: DiskRegionStats_entriesOnlyOnDisk_bytesOnlyOnDisk_no_eviction.gif > The DiskRegionStats bytesOnlyOnDisk stat is not incremented during persistent > region recovery > - > > Key: GEODE-8827 > URL: https://issues.apache.org/jira/browse/GEODE-8827 > Project: Geode > Issue Type: Bug > Components: persistence, statistics >Reporter: Barrett Oglesby >Assignee: Barrett Oglesby >Priority: Major > Labels: pull-request-available > Attachments: > DiskRegionStats_entriesOnlyOnDisk_bytesOnlyOnDisk_after_gets.gif, > DiskRegionStats_entriesOnlyOnDisk_bytesOnlyOnDisk_after_gets_with_change.gif, > DiskRegionStats_entriesOnlyOnDisk_bytesOnlyOnDisk_after_restart.gif, > DiskRegionStats_entriesOnlyOnDisk_bytesOnlyOnDisk_after_restart_with_change.gif, > DiskRegionStats_entriesOnlyOnDisk_bytesOnlyOnDisk_no_eviction.gif > > > With a test like: > - 2 servers with partitioned region configured like: > ** persistence enabled > ** heap eviction with overflow enabled > - load enough entries to cause overflow > - shut down the servers > - restart the servers > - execute a function to get all entries in each server > After the step to restart the servers, the bytesOnlyOnDisk stat is 0. > After the step to get all entries, the bytesOnlyOnDisk stat is negative. > The entriesInVM and entriesOnlyOnDisk stats are incremented as BucketRegions > are recovered from disk in LocalRegion.initializeStats here: > {noformat} > java.lang.Exception: Stack trace > at java.lang.Thread.dumpStack(Thread.java:1333) > at > org.apache.geode.internal.cache.LocalRegion.initializeStats(LocalRegion.java:10222) > at > org.apache.geode.internal.cache.BucketRegion.initializeStats(BucketRegion.java:2163) > at > org.apache.geode.internal.cache.AbstractDiskRegion.copyExistingRegionMap(AbstractDiskRegion.java:775) > at > org.apache.geode.internal.cache.DiskStoreImpl.initializeOwner(DiskStoreImpl.java:631) > at > org.apache.geode.internal.cache.DiskRegion.initializeOwner(DiskRegion.java:239) > at > org.apache.geode.internal.cache.DistributedRegion.initialize(DistributedRegion.java:1081) > at > org.apache.geode.internal.cache.BucketRegion.initialize(BucketRegion.java:262) > at > org.apache.geode.internal.cache.LocalRegion.createSubregion(LocalRegion.java:981) > at > org.apache.geode.internal.cache.PartitionedRegionDataStore.createBucketRegion(PartitionedRegionDataStore.java:785) > at > org.apache.geode.internal.cache.PartitionedRegionDataStore.grabFreeBucket(PartitionedRegionDataStore.java:460) > at > org.apache.geode.internal.cache.PartitionedRegionDataStore.grabFreeBucketRecursively(PartitionedRegionDataStore.java:319) > at > org.apache.geode.internal.cache.PartitionedRegionDataStore.grabBucket(PartitionedRegionDataStore.java:2896) > at > org.apache.geode.internal.cache.ProxyBucketRegion.recoverFromDisk(ProxyBucketRegion.java:441) > at > org.apache.geode.internal.cache.ProxyBucketRegion.recoverFromDiskRecursively(ProxyBucketRegion.java:407) > at > org.apache.geode.internal.cache.PRHARedundancyProvider$2.run2(PRHARedundancyProvider.java:1640) > at > org.apache.geode.internal.cache.partitioned.RecoveryRunnable.run(RecoveryRunnable.java:60) > at > org.apache.geode.internal.cache.PRHARedundancyProvider$2.run(PRHARedundancyProvider.java:1630) > at java.lang.Thread.run(Thread.java:745) > {noformat} > The current LocalRegion.initializeStats method implementation is: > {noformat} > public void initializeStats(long numEntriesInVM, long numOverflowOnDisk, > long numOverflowBytesOnDisk) { > getDiskRegion().getStats().incNumEntriesInVM(numEntriesInVM); > getDiskRegion().getStats().incNumOverflowOnDisk(numOverflowOnDisk); > } > {noformat} > Even though numOverflowBytesOnDisk is passed into this method, it is ignored > as this logging shows: > {noformat} > [warn 2021/01/12 11:19:11.785 PST > tid=0x49] XXX LocalRegion.initializeStats numOverflowBytesOnDisk=4546560; > bytesOnlyOnDiskFromStats=0 > [warn 2021/01/12 11:19:11.791 PST > tid=0x4f] XXX LocalRegion.initializeStats numOverflowBytesOnDisk=4536320; > bytesOnlyOnDiskFromStats=0 > [warn 2021/01/12 11:19:11.797 PST > tid=0x4c] XXX LocalRegion.initializeStats numOverflowBytesOnDisk=4526080; > bytesOnlyOnDiskFromStats=0 > [warn 2021/01/12 11:19:11.800 PST > tid=0x48] XXX LocalRegion.initializeStats numOverflowBytesOnDisk=4546560; > bytesOnlyOnDiskFromStats=0 > [warn 2021/01/12 11:19:11.801 PST > tid=0x4e] XXX LocalRegion.initializeStats
[jira] [Commented] (GEODE-8827) The DiskRegionStats bytesOnlyOnDisk stat is not incremented during persistent region recovery
[ https://issues.apache.org/jira/browse/GEODE-8827?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17264356#comment-17264356 ] Barrett Oglesby commented on GEODE-8827: Here is a simpler test that shows a negative bytesOnlyOnDisk after server restart: - start 1 server with persistent partitioned region - load entries - bounce server After the step to bounce the server, the bytesOnlyOnDisk stat is negative. The attached DiskRegionStats_entriesOnlyOnDisk_bytesOnlyOnDisk_no_eviction.gif chart shows this behavior. > The DiskRegionStats bytesOnlyOnDisk stat is not incremented during persistent > region recovery > - > > Key: GEODE-8827 > URL: https://issues.apache.org/jira/browse/GEODE-8827 > Project: Geode > Issue Type: Bug > Components: persistence, statistics >Reporter: Barrett Oglesby >Assignee: Barrett Oglesby >Priority: Major > Labels: pull-request-available > Attachments: > DiskRegionStats_entriesOnlyOnDisk_bytesOnlyOnDisk_after_gets.gif, > DiskRegionStats_entriesOnlyOnDisk_bytesOnlyOnDisk_after_gets_with_change.gif, > DiskRegionStats_entriesOnlyOnDisk_bytesOnlyOnDisk_after_restart.gif, > DiskRegionStats_entriesOnlyOnDisk_bytesOnlyOnDisk_after_restart_with_change.gif, > DiskRegionStats_entriesOnlyOnDisk_bytesOnlyOnDisk_no_eviction.gif > > > With a test like: > - 2 servers with partitioned region configured like: > ** persistence enabled > ** heap eviction with overflow enabled > - load enough entries to cause overflow > - shut down the servers > - restart the servers > - execute a function to get all entries in each server > After the step to restart the servers, the bytesOnlyOnDisk stat is 0. > After the step to get all entries, the bytesOnlyOnDisk stat is negative. > The entriesInVM and entriesOnlyOnDisk stats are incremented as BucketRegions > are recovered from disk in LocalRegion.initializeStats here: > {noformat} > java.lang.Exception: Stack trace > at java.lang.Thread.dumpStack(Thread.java:1333) > at > org.apache.geode.internal.cache.LocalRegion.initializeStats(LocalRegion.java:10222) > at > org.apache.geode.internal.cache.BucketRegion.initializeStats(BucketRegion.java:2163) > at > org.apache.geode.internal.cache.AbstractDiskRegion.copyExistingRegionMap(AbstractDiskRegion.java:775) > at > org.apache.geode.internal.cache.DiskStoreImpl.initializeOwner(DiskStoreImpl.java:631) > at > org.apache.geode.internal.cache.DiskRegion.initializeOwner(DiskRegion.java:239) > at > org.apache.geode.internal.cache.DistributedRegion.initialize(DistributedRegion.java:1081) > at > org.apache.geode.internal.cache.BucketRegion.initialize(BucketRegion.java:262) > at > org.apache.geode.internal.cache.LocalRegion.createSubregion(LocalRegion.java:981) > at > org.apache.geode.internal.cache.PartitionedRegionDataStore.createBucketRegion(PartitionedRegionDataStore.java:785) > at > org.apache.geode.internal.cache.PartitionedRegionDataStore.grabFreeBucket(PartitionedRegionDataStore.java:460) > at > org.apache.geode.internal.cache.PartitionedRegionDataStore.grabFreeBucketRecursively(PartitionedRegionDataStore.java:319) > at > org.apache.geode.internal.cache.PartitionedRegionDataStore.grabBucket(PartitionedRegionDataStore.java:2896) > at > org.apache.geode.internal.cache.ProxyBucketRegion.recoverFromDisk(ProxyBucketRegion.java:441) > at > org.apache.geode.internal.cache.ProxyBucketRegion.recoverFromDiskRecursively(ProxyBucketRegion.java:407) > at > org.apache.geode.internal.cache.PRHARedundancyProvider$2.run2(PRHARedundancyProvider.java:1640) > at > org.apache.geode.internal.cache.partitioned.RecoveryRunnable.run(RecoveryRunnable.java:60) > at > org.apache.geode.internal.cache.PRHARedundancyProvider$2.run(PRHARedundancyProvider.java:1630) > at java.lang.Thread.run(Thread.java:745) > {noformat} > The current LocalRegion.initializeStats method implementation is: > {noformat} > public void initializeStats(long numEntriesInVM, long numOverflowOnDisk, > long numOverflowBytesOnDisk) { > getDiskRegion().getStats().incNumEntriesInVM(numEntriesInVM); > getDiskRegion().getStats().incNumOverflowOnDisk(numOverflowOnDisk); > } > {noformat} > Even though numOverflowBytesOnDisk is passed into this method, it is ignored > as this logging shows: > {noformat} > [warn 2021/01/12 11:19:11.785 PST > tid=0x49] XXX LocalRegion.initializeStats numOverflowBytesOnDisk=4546560; > bytesOnlyOnDiskFromStats=0 > [warn 2021/01/12 11:19:11.791 PST > tid=0x4f] XXX LocalRegion.initializeStats numOverflowBytesOnDisk=4536320; > bytesOnlyOnDiskFromStats=0 > [warn 2021/01/12 11:19:11.797 PST > tid=0x4c] XXX