[jira] [Comment Edited] (GEODE-6646) CI: org.apache.geode.management.internal.configuration.ClusterConfigLocatorRestartDUnitTest > serverRestartsAfterLocatorReconnects FAILED

Shelley Lynn Hughes-Godfrey (JIRA) Fri, 12 Apr 2019 15:36:13 -0700


    [ 
https://issues.apache.org/jira/browse/GEODE-6646?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16816695#comment-16816695
 ]


Shelley Lynn Hughes-Godfrey edited comment on GEODE-6646 at 4/12/19 10:35 PM:
------------------------------------------------------------------------------

In this test, we start a locator and 2 servers (server-1 and server-2).
Then we forcefully disconnect server-2 and the locator before waiting for the 
locator to reconnect + start server-3.

We must also expect server2 to reconnect; but it looks like the locator, 
server-2 and server-3 form a new DS (without server-1).

{noformat}
  @Test
  public void serverRestartsAfterLocatorReconnects() throws Exception {
    
IgnoredException.addIgnoredException("org.apache.geode.ForcedDisconnectException:
 for testing");
    IgnoredException.addIgnoredException("cluster configuration service not 
available");
    IgnoredException.addIgnoredException("This thread has been stalled");
    IgnoredException
        .addIgnoredException("member unexpectedly shut down shared, unordered 
connection");
    IgnoredException.addIgnoredException("Connection refused");

    MemberVM locator0 = rule.startLocatorVM(0);

    rule.startServerVM(1, locator0.getPort());
    MemberVM server2 = rule.startServerVM(2, locator0.getPort());

    addDisconnectListener(locator0);

    server2.forceDisconnect();
    locator0.forceDisconnect();

    waitForLocatorToReconnect(locator0);

    rule.startServerVM(3, locator0.getPort());

    gfsh.connectAndVerify(locator0);

    await()
        .untilAsserted(() -> gfsh.executeAndAssertThat("list 
members").statusIsSuccess()
            .tableHasColumnOnlyWithValues("Name", "locator-0", "server-1", 
"server-2", "server-3"));
  }
{noformat}

locator and server-2 are forcefully disconnected at 19:30:45 and it looks like 
server-1 tried to become the coordinator ... but in the end, he didn't get any 
responses from the others and they seems to create their own DS.
{noformat}
[vm2] [info 2019/04/12 19:30:45.491 UTC &lt;RMI TCP 
Connection(1)-172.17.0.2&gt; tid=0x20] GroupMembershipService.beSick invoked 
for 172.17.0.2(server-2:249)&lt;v2&gt;:41003 - simulating sickness

[vm2] [info 2019/04/12 19:30:45.491 UTC &lt;RMI TCP 
Connection(1)-172.17.0.2&gt; tid=0x20] GroupMembershipService.playDead invoked 
for 172.17.0.2(server-2:249)&lt;v2&gt;:41003

[vm0] [info 2019/04/12 19:30:45.716 UTC &lt;RMI TCP 
Connection(1)-172.17.0.2&gt; tid=0x20] GroupMembershipService.beSick invoked 
for 172.17.0.2(locator-0:1011:locator)&lt;ec&gt;&lt;v0&gt;:41001 - simulating 
sickness

[vm0] [info 2019/04/12 19:30:45.716 UTC &lt;RMI TCP 
Connection(1)-172.17.0.2&gt; tid=0x20] GroupMembershipService.playDead invoked 
for 172.17.0.2(locator-0:1011:locator)&lt;ec&gt;&lt;v0&gt;:41001
{noformat}

vm1 reports the locator and server-2 as suspect and becomes the membership 
coordinator
{noformat}
[vm1] [info 2019/04/12 19:30:50.772 UTC &lt;Geode Failure Detection thread 
3&gt; tid=0xca] Availability check failed for member 
172.17.0.2(server-2:249)&lt;v2&gt;:41003

[vm1] [info 2019/04/12 19:30:50.773 UTC &lt;Geode Failure Detection thread 
3&gt; tid=0xca] Requesting removal of suspect member 
172.17.0.2(server-2:249)&lt;v2&gt;:41003

[vm1] [info 2019/04/12 19:30:50.772 UTC &lt;Geode Failure Detection thread 
2&gt; tid=0xc9] Availability check failed for member 
172.17.0.2(locator-0:1011:locator)&lt;ec&gt;&lt;v0&gt;:41001

[vm1] [info 2019/04/12 19:30:50.776 UTC &lt;Geode Failure Detection thread 
2&gt; tid=0xc9] Requesting removal of suspect member 
172.17.0.2(locator-0:1011:locator)&lt;ec&gt;&lt;v0&gt;:41001

[vm1] [info 2019/04/12 19:30:50.776 UTC &lt;Geode Failure Detection thread 
2&gt; tid=0xc9] This member is becoming the membership coordinator with address 
172.17.0.2(server-1:245)&lt;v1&gt;:41002

[vm1] [info 2019/04/12 19:30:50.777 UTC &lt;Geode Failure Detection thread 
2&gt; tid=0xc9] ViewCreator starting on:172.17.0.2(server-1:245)&lt;v1&gt;:41002

[vm1] [info 2019/04/12 19:30:50.777 UTC &lt;Geode Membership View Creator&gt; 
tid=0xcb] View Creator thread is starting

[vm1] [info 2019/04/12 19:30:50.779 UTC &lt;Geode Membership View Creator&gt; 
tid=0xcb] preparing new view View[172.17.0.2(server-1:245)&lt;v1&gt;:41002|9] 
members: [172.17.0.2(server-1:245)&lt;v1&gt;:41002{lead}, 
172.17.0.2(server-2:249)&lt;v2&gt;:41003]  crashed: 
[172.17.0.2(locator-0:1011:locator)&lt;ec&gt;&lt;v0&gt;:41001]

...

[vm1] [info 2019/04/12 19:31:41.970 UTC &lt;Geode Membership View Creator&gt; 
tid=0xcb] sending new view View[172.17.0.2(server-1:245)&lt;v1&gt;:41002|12] 
members: [172.17.0.2(server-1:245)&lt;v1&gt;:41002{lead}]  crashed: 
[172.17.0.2(locator-0:1011:locator)&lt;ec&gt;&lt;v11&gt;:41001, 
172.17.0.2(server-2:249)&lt;v11&gt;:41003]

[vm2] [info 2019/04/12 19:31:41.970 UTC &lt;unicast 
receiver,bba57c926507-60306&gt; tid=0x8a] Ignoring the view 
View[172.17.0.2(server-1:245)&lt;v1&gt;:41002|12] members: 
[172.17.0.2(server-1:245)&lt;v1&gt;:41002{lead}]  crashed: 
[172.17.0.2(server-2:249)&lt;v11&gt;:41003, 
172.17.0.2(locator-0:1011:locator)&lt;ec&gt;&lt;v11&gt;:41001] from member 
172.17.0.2&lt;v1&gt;:41002, which is not in my current view 
View[172.17.0.2(locator-0:1011:locator)&lt;ec&gt;&lt;v0&gt;:41001|1] members: 
[172.17.0.2(locator-0:1011:locator)&lt;ec&gt;&lt;v0&gt;:41001, 
172.17.0.2(server-2:249)&lt;v1&gt;:41003{lead}, 
172.17.0.2(server-3:255)&lt;v1&gt;:41004]

[vm0] [info 2019/04/12 19:31:41.970 UTC &lt;unicast 
receiver,bba57c926507-9474&gt; tid=0x31] Ignoring the view 
View[172.17.0.2(server-1:245)&lt;v1&gt;:41002|12] members: 
[172.17.0.2(server-1:245)&lt;v1&gt;:41002{lead}]  crashed: 
[172.17.0.2(server-2:249)&lt;v11&gt;:41003, 
172.17.0.2(locator-0:1011:locator)&lt;ec&gt;&lt;v11&gt;:41001] from member 
172.17.0.2&lt;v1&gt;:41002, which is not in my current view 
View[172.17.0.2(locator-0:1011:locator)&lt;ec&gt;&lt;v0&gt;:41001|1] members: 
[172.17.0.2(locator-0:1011:locator)&lt;ec&gt;&lt;v0&gt;:41001, 
172.17.0.2(server-2:249)&lt;v1&gt;:41003{lead}, 
172.17.0.2(server-3:255)&lt;v1&gt;:41004]

[vm0] [info 2019/04/12 19:31:42.024 UTC &lt;RMI TCP 
Connection(7)-172.17.0.2&gt; tid=0x20] Executing command: list members

Command result for &lt;list members&gt;:
  Name    | Id
--------- | --------------------------------------------------------------
locator-0 | 172.17.0.2(locator-0:1011:locator)&lt;ec&gt;&lt;v0&gt;:41001 
[Coordinator]
server-2  | 172.17.0.2(server-2:249)&lt;v1&gt;:41003
server-3  | 172.17.0.2(server-3:255)&lt;v1&gt;:41004
{noformat}


was (Author: lhughesgodfrey):
In this test, we start a locator and 2 servers (server-1 and server-2).
Then we forcefully disconnect server-2 and the locator before waiting for the 
locator to reconnect + start server-3.
We must also expect server2 to reconnect; but it looks like the locator, 
server-2 and server-3 form a new DS (without server-1).

{noformat}
  @Test
  public void serverRestartsAfterLocatorReconnects() throws Exception {
    
IgnoredException.addIgnoredException("org.apache.geode.ForcedDisconnectException:
 for testing");
    IgnoredException.addIgnoredException("cluster configuration service not 
available");
    IgnoredException.addIgnoredException("This thread has been stalled");
    IgnoredException
        .addIgnoredException("member unexpectedly shut down shared, unordered 
connection");
    IgnoredException.addIgnoredException("Connection refused");

    MemberVM locator0 = rule.startLocatorVM(0);

    rule.startServerVM(1, locator0.getPort());
    MemberVM server2 = rule.startServerVM(2, locator0.getPort());

    addDisconnectListener(locator0);

    server2.forceDisconnect();
    locator0.forceDisconnect();

    waitForLocatorToReconnect(locator0);

    rule.startServerVM(3, locator0.getPort());

    gfsh.connectAndVerify(locator0);

    await()
        .untilAsserted(() -> gfsh.executeAndAssertThat("list 
members").statusIsSuccess()
            .tableHasColumnOnlyWithValues("Name", "locator-0", "server-1", 
"server-2", "server-3"));
  }
{noformat}

> CI: 
> org.apache.geode.management.internal.configuration.ClusterConfigLocatorRestartDUnitTest
>  > serverRestartsAfterLocatorReconnects FAILED
> -----------------------------------------------------------------------------------------------------------------------------------------
>
>                 Key: GEODE-6646
>                 URL: https://issues.apache.org/jira/browse/GEODE-6646
>             Project: Geode
>          Issue Type: Bug
>          Components: gfsh, membership
>    Affects Versions: 1.10.0
>            Reporter: Shelley Lynn Hughes-Godfrey
>            Priority: Major
>              Labels: CI
>
> https://concourse.apachegeode-ci.info/teams/main/pipelines/apache-develop-main/jobs/DistributedTestOpenJDK8/builds/617
> {noformat}
> org.apache.geode.management.internal.configuration.ClusterConfigLocatorRestartDUnitTest
>  > serverRestartsAfterLocatorReconnects FAILED
>     org.awaitility.core.ConditionTimeoutException: Assertion condition 
> defined as a lambda expression in 
> org.apache.geode.management.internal.configuration.ClusterConfigLocatorRestartDUnitTest
>  
>     Expecting:
>       <["locator-0", "server-2", "server-3"]>
>     to contain only:
>       <["locator-0", "server-1", "server-2", "server-3"]>
>     but could not find the following elements:
>       <["server-1"]>
>      within 300 seconds.
>         at 
> org.awaitility.core.ConditionAwaiter.await(ConditionAwaiter.java:145)
>         at 
> org.awaitility.core.AssertionCondition.await(AssertionCondition.java:122)
>         at 
> org.awaitility.core.AssertionCondition.await(AssertionCondition.java:32)
>         at 
> org.awaitility.core.ConditionFactory.until(ConditionFactory.java:902)
>         at 
> org.awaitility.core.ConditionFactory.untilAsserted(ConditionFactory.java:723)
>         at 
> org.apache.geode.management.internal.configuration.ClusterConfigLocatorRestartDUnitTest.serverRestartsAfterLocatorReconnects(ClusterConfigLocatorRestartDUnitTest.java:81)
>         Caused by:
>         java.lang.AssertionError: 
>         Expecting:
>           <["locator-0", "server-2", "server-3"]>
>         to contain only:
>           <["locator-0", "server-1", "server-2", "server-3"]>
>         but could not find the following elements:
>           <["server-1"]>
>             at 
> org.apache.geode.test.junit.assertions.CommandResultAssert.tableHasColumnOnlyWithValues(CommandResultAssert.java:308)
>             at 
> org.apache.geode.management.internal.configuration.ClusterConfigLocatorRestartDUnitTest.lambda$serverRestartsAfterLocatorReconnects$0(ClusterConfigLocatorRestartDUnitTest.java:82)
> {noformat}
> Artifacts available here:
> {noformat}
> =-=-=-=-=-=-=-=-=-=-=-=-=-=-=  Test Results URI 
> =-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=
> http://files.apachegeode-ci.info/builds/apache-develop-main/1.10.0-SNAPSHOT.0177/test-results/distributedTest/1555101232/
> =-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=
> Test report artifacts from this job are available at:
> http://files.apachegeode-ci.info/builds/apache-develop-main/1.10.0-SNAPSHOT.0177/test-artifacts/1555101232/distributedtestfiles-OpenJDK8-1.10.0-SNAPSHOT.0177.tgz
> {noformat}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Comment Edited] (GEODE-6646) CI: org.apache.geode.management.internal.configuration.ClusterConfigLocatorRestartDUnitTest > serverRestartsAfterLocatorReconnects FAILED

Reply via email to