[ https://issues.apache.org/jira/browse/GEODE-6646?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16816695#comment-16816695 ]
Shelley Lynn Hughes-Godfrey edited comment on GEODE-6646 at 4/12/19 10:35 PM: ------------------------------------------------------------------------------ In this test, we start a locator and 2 servers (server-1 and server-2). Then we forcefully disconnect server-2 and the locator before waiting for the locator to reconnect + start server-3. We must also expect server2 to reconnect; but it looks like the locator, server-2 and server-3 form a new DS (without server-1). {noformat} @Test public void serverRestartsAfterLocatorReconnects() throws Exception { IgnoredException.addIgnoredException("org.apache.geode.ForcedDisconnectException: for testing"); IgnoredException.addIgnoredException("cluster configuration service not available"); IgnoredException.addIgnoredException("This thread has been stalled"); IgnoredException .addIgnoredException("member unexpectedly shut down shared, unordered connection"); IgnoredException.addIgnoredException("Connection refused"); MemberVM locator0 = rule.startLocatorVM(0); rule.startServerVM(1, locator0.getPort()); MemberVM server2 = rule.startServerVM(2, locator0.getPort()); addDisconnectListener(locator0); server2.forceDisconnect(); locator0.forceDisconnect(); waitForLocatorToReconnect(locator0); rule.startServerVM(3, locator0.getPort()); gfsh.connectAndVerify(locator0); await() .untilAsserted(() -> gfsh.executeAndAssertThat("list members").statusIsSuccess() .tableHasColumnOnlyWithValues("Name", "locator-0", "server-1", "server-2", "server-3")); } {noformat} locator and server-2 are forcefully disconnected at 19:30:45 and it looks like server-1 tried to become the coordinator ... but in the end, he didn't get any responses from the others and they seems to create their own DS. {noformat} [vm2] [info 2019/04/12 19:30:45.491 UTC <RMI TCP Connection(1)-172.17.0.2> tid=0x20] GroupMembershipService.beSick invoked for 172.17.0.2(server-2:249)<v2>:41003 - simulating sickness [vm2] [info 2019/04/12 19:30:45.491 UTC <RMI TCP Connection(1)-172.17.0.2> tid=0x20] GroupMembershipService.playDead invoked for 172.17.0.2(server-2:249)<v2>:41003 [vm0] [info 2019/04/12 19:30:45.716 UTC <RMI TCP Connection(1)-172.17.0.2> tid=0x20] GroupMembershipService.beSick invoked for 172.17.0.2(locator-0:1011:locator)<ec><v0>:41001 - simulating sickness [vm0] [info 2019/04/12 19:30:45.716 UTC <RMI TCP Connection(1)-172.17.0.2> tid=0x20] GroupMembershipService.playDead invoked for 172.17.0.2(locator-0:1011:locator)<ec><v0>:41001 {noformat} vm1 reports the locator and server-2 as suspect and becomes the membership coordinator {noformat} [vm1] [info 2019/04/12 19:30:50.772 UTC <Geode Failure Detection thread 3> tid=0xca] Availability check failed for member 172.17.0.2(server-2:249)<v2>:41003 [vm1] [info 2019/04/12 19:30:50.773 UTC <Geode Failure Detection thread 3> tid=0xca] Requesting removal of suspect member 172.17.0.2(server-2:249)<v2>:41003 [vm1] [info 2019/04/12 19:30:50.772 UTC <Geode Failure Detection thread 2> tid=0xc9] Availability check failed for member 172.17.0.2(locator-0:1011:locator)<ec><v0>:41001 [vm1] [info 2019/04/12 19:30:50.776 UTC <Geode Failure Detection thread 2> tid=0xc9] Requesting removal of suspect member 172.17.0.2(locator-0:1011:locator)<ec><v0>:41001 [vm1] [info 2019/04/12 19:30:50.776 UTC <Geode Failure Detection thread 2> tid=0xc9] This member is becoming the membership coordinator with address 172.17.0.2(server-1:245)<v1>:41002 [vm1] [info 2019/04/12 19:30:50.777 UTC <Geode Failure Detection thread 2> tid=0xc9] ViewCreator starting on:172.17.0.2(server-1:245)<v1>:41002 [vm1] [info 2019/04/12 19:30:50.777 UTC <Geode Membership View Creator> tid=0xcb] View Creator thread is starting [vm1] [info 2019/04/12 19:30:50.779 UTC <Geode Membership View Creator> tid=0xcb] preparing new view View[172.17.0.2(server-1:245)<v1>:41002|9] members: [172.17.0.2(server-1:245)<v1>:41002{lead}, 172.17.0.2(server-2:249)<v2>:41003] crashed: [172.17.0.2(locator-0:1011:locator)<ec><v0>:41001] ... [vm1] [info 2019/04/12 19:31:41.970 UTC <Geode Membership View Creator> tid=0xcb] sending new view View[172.17.0.2(server-1:245)<v1>:41002|12] members: [172.17.0.2(server-1:245)<v1>:41002{lead}] crashed: [172.17.0.2(locator-0:1011:locator)<ec><v11>:41001, 172.17.0.2(server-2:249)<v11>:41003] [vm2] [info 2019/04/12 19:31:41.970 UTC <unicast receiver,bba57c926507-60306> tid=0x8a] Ignoring the view View[172.17.0.2(server-1:245)<v1>:41002|12] members: [172.17.0.2(server-1:245)<v1>:41002{lead}] crashed: [172.17.0.2(server-2:249)<v11>:41003, 172.17.0.2(locator-0:1011:locator)<ec><v11>:41001] from member 172.17.0.2<v1>:41002, which is not in my current view View[172.17.0.2(locator-0:1011:locator)<ec><v0>:41001|1] members: [172.17.0.2(locator-0:1011:locator)<ec><v0>:41001, 172.17.0.2(server-2:249)<v1>:41003{lead}, 172.17.0.2(server-3:255)<v1>:41004] [vm0] [info 2019/04/12 19:31:41.970 UTC <unicast receiver,bba57c926507-9474> tid=0x31] Ignoring the view View[172.17.0.2(server-1:245)<v1>:41002|12] members: [172.17.0.2(server-1:245)<v1>:41002{lead}] crashed: [172.17.0.2(server-2:249)<v11>:41003, 172.17.0.2(locator-0:1011:locator)<ec><v11>:41001] from member 172.17.0.2<v1>:41002, which is not in my current view View[172.17.0.2(locator-0:1011:locator)<ec><v0>:41001|1] members: [172.17.0.2(locator-0:1011:locator)<ec><v0>:41001, 172.17.0.2(server-2:249)<v1>:41003{lead}, 172.17.0.2(server-3:255)<v1>:41004] [vm0] [info 2019/04/12 19:31:42.024 UTC <RMI TCP Connection(7)-172.17.0.2> tid=0x20] Executing command: list members Command result for <list members>: Name | Id --------- | -------------------------------------------------------------- locator-0 | 172.17.0.2(locator-0:1011:locator)<ec><v0>:41001 [Coordinator] server-2 | 172.17.0.2(server-2:249)<v1>:41003 server-3 | 172.17.0.2(server-3:255)<v1>:41004 {noformat} was (Author: lhughesgodfrey): In this test, we start a locator and 2 servers (server-1 and server-2). Then we forcefully disconnect server-2 and the locator before waiting for the locator to reconnect + start server-3. We must also expect server2 to reconnect; but it looks like the locator, server-2 and server-3 form a new DS (without server-1). {noformat} @Test public void serverRestartsAfterLocatorReconnects() throws Exception { IgnoredException.addIgnoredException("org.apache.geode.ForcedDisconnectException: for testing"); IgnoredException.addIgnoredException("cluster configuration service not available"); IgnoredException.addIgnoredException("This thread has been stalled"); IgnoredException .addIgnoredException("member unexpectedly shut down shared, unordered connection"); IgnoredException.addIgnoredException("Connection refused"); MemberVM locator0 = rule.startLocatorVM(0); rule.startServerVM(1, locator0.getPort()); MemberVM server2 = rule.startServerVM(2, locator0.getPort()); addDisconnectListener(locator0); server2.forceDisconnect(); locator0.forceDisconnect(); waitForLocatorToReconnect(locator0); rule.startServerVM(3, locator0.getPort()); gfsh.connectAndVerify(locator0); await() .untilAsserted(() -> gfsh.executeAndAssertThat("list members").statusIsSuccess() .tableHasColumnOnlyWithValues("Name", "locator-0", "server-1", "server-2", "server-3")); } {noformat} > CI: > org.apache.geode.management.internal.configuration.ClusterConfigLocatorRestartDUnitTest > > serverRestartsAfterLocatorReconnects FAILED > ----------------------------------------------------------------------------------------------------------------------------------------- > > Key: GEODE-6646 > URL: https://issues.apache.org/jira/browse/GEODE-6646 > Project: Geode > Issue Type: Bug > Components: gfsh, membership > Affects Versions: 1.10.0 > Reporter: Shelley Lynn Hughes-Godfrey > Priority: Major > Labels: CI > > https://concourse.apachegeode-ci.info/teams/main/pipelines/apache-develop-main/jobs/DistributedTestOpenJDK8/builds/617 > {noformat} > org.apache.geode.management.internal.configuration.ClusterConfigLocatorRestartDUnitTest > > serverRestartsAfterLocatorReconnects FAILED > org.awaitility.core.ConditionTimeoutException: Assertion condition > defined as a lambda expression in > org.apache.geode.management.internal.configuration.ClusterConfigLocatorRestartDUnitTest > > Expecting: > <["locator-0", "server-2", "server-3"]> > to contain only: > <["locator-0", "server-1", "server-2", "server-3"]> > but could not find the following elements: > <["server-1"]> > within 300 seconds. > at > org.awaitility.core.ConditionAwaiter.await(ConditionAwaiter.java:145) > at > org.awaitility.core.AssertionCondition.await(AssertionCondition.java:122) > at > org.awaitility.core.AssertionCondition.await(AssertionCondition.java:32) > at > org.awaitility.core.ConditionFactory.until(ConditionFactory.java:902) > at > org.awaitility.core.ConditionFactory.untilAsserted(ConditionFactory.java:723) > at > org.apache.geode.management.internal.configuration.ClusterConfigLocatorRestartDUnitTest.serverRestartsAfterLocatorReconnects(ClusterConfigLocatorRestartDUnitTest.java:81) > Caused by: > java.lang.AssertionError: > Expecting: > <["locator-0", "server-2", "server-3"]> > to contain only: > <["locator-0", "server-1", "server-2", "server-3"]> > but could not find the following elements: > <["server-1"]> > at > org.apache.geode.test.junit.assertions.CommandResultAssert.tableHasColumnOnlyWithValues(CommandResultAssert.java:308) > at > org.apache.geode.management.internal.configuration.ClusterConfigLocatorRestartDUnitTest.lambda$serverRestartsAfterLocatorReconnects$0(ClusterConfigLocatorRestartDUnitTest.java:82) > {noformat} > Artifacts available here: > {noformat} > =-=-=-=-=-=-=-=-=-=-=-=-=-=-= Test Results URI > =-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-= > http://files.apachegeode-ci.info/builds/apache-develop-main/1.10.0-SNAPSHOT.0177/test-results/distributedTest/1555101232/ > =-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-= > Test report artifacts from this job are available at: > http://files.apachegeode-ci.info/builds/apache-develop-main/1.10.0-SNAPSHOT.0177/test-artifacts/1555101232/distributedtestfiles-OpenJDK8-1.10.0-SNAPSHOT.0177.tgz > {noformat} -- This message was sent by Atlassian JIRA (v7.6.3#76005)