[geode] 01/01: GEODE-3780 suspected member is never watched again after passing final check

2019-10-04 Thread bschuchardt
This is an automated email from the ASF dual-hosted git repository.

bschuchardt pushed a commit to branch feature/GEODE-3780
in repository https://gitbox.apache.org/repos/asf/geode.git

commit 977aaf2fcc5c0f74a38742aa51e0feeb5296472d
Author: Bruce Schuchardt 
AuthorDate: Fri Oct 4 10:19:49 2019 -0700

GEODE-3780 suspected member is never watched again after passing final check

This restores our original behavior (pre-1.0) behavior of performing a
final check on a member if UDP communications with that member fail.

We now also send exonoration messages to all other members if a suspect is
cleared.  We need to do that because another member may have sent a
Suspect message that was ignored because the suspect was already
undergoing a final check.

I also noticed that our tcp/ip final check loop was performing more than
one check in many cases because the nanosecond clock has a coarse
granularity.  A socket so-timeout based on the millisecond clock was
timing out but the nanosecond clock didn't line up with that timeout and
caused the "for" loop to make another attempt.  I changed that loop to
convert the nanosecond clock value to milliseconds.
---
 ...omcatSessionBackwardsCompatibilityTestBase.java |   8 +-
 .../membership/gms/fd/GMSHealthMonitor.java| 148 +
 .../membership/gms/messenger/JGroupsMessenger.java |  16 ++-
 .../geode/test/dunit/internal/ProcessManager.java  |  13 +-
 4 files changed, 106 insertions(+), 79 deletions(-)

diff --git 
a/geode-assembly/src/upgradeTest/java/org/apache/geode/session/tests/TomcatSessionBackwardsCompatibilityTestBase.java
 
b/geode-assembly/src/upgradeTest/java/org/apache/geode/session/tests/TomcatSessionBackwardsCompatibilityTestBase.java
index f4b8272..e629026 100644
--- 
a/geode-assembly/src/upgradeTest/java/org/apache/geode/session/tests/TomcatSessionBackwardsCompatibilityTestBase.java
+++ 
b/geode-assembly/src/upgradeTest/java/org/apache/geode/session/tests/TomcatSessionBackwardsCompatibilityTestBase.java
@@ -88,13 +88,7 @@ public abstract class 
TomcatSessionBackwardsCompatibilityTestBase {
 
   protected TomcatSessionBackwardsCompatibilityTestBase(String version) {
 VersionManager versionManager = VersionManager.getInstance();
-String installLocation = null;
-try {
-  installLocation = versionManager.getInstall(version);
-} finally {
-  System.out.println(
-  "BRUCE: for version " + version + " the installation directory is " 
+ installLocation);
-}
+String installLocation = installLocation = 
versionManager.getInstall(version);
 oldBuild = new File(installLocation);
 oldModules = new File(installLocation + "/tools/Modules/");
   }
diff --git 
a/geode-core/src/main/java/org/apache/geode/distributed/internal/membership/gms/fd/GMSHealthMonitor.java
 
b/geode-core/src/main/java/org/apache/geode/distributed/internal/membership/gms/fd/GMSHealthMonitor.java
index 9e90316..41a82b6 100644
--- 
a/geode-core/src/main/java/org/apache/geode/distributed/internal/membership/gms/fd/GMSHealthMonitor.java
+++ 
b/geode-core/src/main/java/org/apache/geode/distributed/internal/membership/gms/fd/GMSHealthMonitor.java
@@ -244,56 +244,63 @@ public class GMSHealthMonitor implements HealthMonitor {
 @Override
 public void run() {
 
-  if (GMSHealthMonitor.this.isStopping) {
-return;
+  GMSMember neighbor = nextNeighbor;
+  if (logger.isDebugEnabled()) {
+logger.debug("cluster health monitor invoked with {}", neighbor);
   }
+  try {
+if (GMSHealthMonitor.this.isStopping) {
+  return;
+}
 
-  GMSMember neighbour = nextNeighbor;
-
-  long currentTime = System.currentTimeMillis();
-  // this is the start of interval to record member activity
-  GMSHealthMonitor.this.currentTimeStamp = currentTime;
-
-
-  long oldTimeStamp = currentTimeStamp;
-  currentTimeStamp = System.currentTimeMillis();
-
-  GMSMembershipView myView = GMSHealthMonitor.this.currentView;
-  if (myView == null) {
-return;
-  }
+long currentTime = System.currentTimeMillis();
+// this is the start of interval to record member activity
+GMSHealthMonitor.this.currentTimeStamp = currentTime;
 
-  if (currentTimeStamp - oldTimeStamp > monitorInterval + 
MONITOR_DELAY_THRESHOLD) {
-// delay in running this task - don't suspect anyone for a while
-logger.info(
-"Failure detector has noticed a JVM pause and is giving all 
members a heartbeat in view {}",
-currentView);
-for (GMSMember member : myView.getMembers()) {
-  contactedBy(member);
-}
-return;
-  }
+long oldTimeStamp = currentTimeStamp;
+currentTimeStamp = System.currentTimeMillis();
 
-  if (neighbour != null) {
-TimeStamp nextNeighborTS;
-synchronized (GMSHealthMonitor.this) {
-

[geode] 01/01: GEODE-3780 suspected member is never watched again after passing final check (#3917)

2019-08-19 Thread bschuchardt
This is an automated email from the ASF dual-hosted git repository.

bschuchardt pushed a commit to branch feature/merge_geode_3780
in repository https://gitbox.apache.org/repos/asf/geode.git

commit 9975d1e10a905b040edeefa0ecb2210d1a1c1525
Author: Bruce Schuchardt 
AuthorDate: Thu Aug 15 13:55:46 2019 -0700

GEODE-3780 suspected member is never watched again after passing final 
check (#3917)

* GEODE-3780 suspected member is never watched again after passing final 
check

After passing a "final check" a member will be subject to suspect
processing again but we weren't processing the suspect message locally.
This caused JoinLeave to never be notified of the suspect so that
removal could be initiated.

I also noticed that a method in HealthMonitor was misnamed.  It claimed
to return the set of members that had failed availability checks but
instead it was returning a set of members currently under suspicion.  I
renamed the method for clarity.

* empty commit

* removing getSuspectMembers - it could kick out a suspect member too easily

* removing unused method and commented-out code

* revising test

(cherry picked from commit 8e9b04470264983d0aa1c7900f6e9be2374549d9)
---
 .../gms/fd/GMSHealthMonitorJUnitTest.java  | 25 
 .../gms/membership/GMSJoinLeaveJUnitTest.java  | 44 --
 .../membership/gms/fd/GMSHealthMonitor.java|  8 ++--
 .../membership/gms/interfaces/HealthMonitor.java   |  6 ---
 .../membership/gms/membership/GMSJoinLeave.java| 20 +-
 5 files changed, 70 insertions(+), 33 deletions(-)

diff --git 
a/geode-core/src/integrationTest/java/org/apache/geode/distributed/internal/membership/gms/fd/GMSHealthMonitorJUnitTest.java
 
b/geode-core/src/integrationTest/java/org/apache/geode/distributed/internal/membership/gms/fd/GMSHealthMonitorJUnitTest.java
index 2fb067f..a750fa9 100644
--- 
a/geode-core/src/integrationTest/java/org/apache/geode/distributed/internal/membership/gms/fd/GMSHealthMonitorJUnitTest.java
+++ 
b/geode-core/src/integrationTest/java/org/apache/geode/distributed/internal/membership/gms/fd/GMSHealthMonitorJUnitTest.java
@@ -500,6 +500,31 @@ public class GMSHealthMonitorJUnitTest {
 }
   }
 
+  @Test
+  public void testMemberIsExaminedAgainAfterPassingAvailabilityCheck() {
+// use the test health monitor's availability check for the first round of 
suspect processing
+// but then turn it off so that a subsequent round is performed and fails 
to get a heartbeat
+useGMSHealthMonitorTestClass = true;
+
+try {
+  NetView v = installAView();
+
+  setFailureDetectionPorts(v);
+
+  InternalDistributedMember memberToCheck = mockMembers.get(1);
+
+  boolean retVal = gmsHealthMonitor.checkIfAvailable(memberToCheck, "Not 
responding", true);
+  assertTrue("CheckIfAvailable should have return true", retVal);
+
+  // memberToCheck should be suspected again since it's not sending 
heartbeats and then
+  // it should fail an availability check and cause removal of the member
+  useGMSHealthMonitorTestClass = false;
+  await().untilAsserted(() -> verify(joinLeave, 
atLeastOnce()).remove(memberToCheck,
+  "Member isn't responding to heartbeat requests"));
+} finally {
+  useGMSHealthMonitorTestClass = false;
+}
+  }
 
   @Test
   public void testNeighborRemainsSameAfterSuccessfulFinalCheck() {
diff --git 
a/geode-core/src/integrationTest/java/org/apache/geode/distributed/internal/membership/gms/membership/GMSJoinLeaveJUnitTest.java
 
b/geode-core/src/integrationTest/java/org/apache/geode/distributed/internal/membership/gms/membership/GMSJoinLeaveJUnitTest.java
index cb605ea..2fe1a3e 100644
--- 
a/geode-core/src/integrationTest/java/org/apache/geode/distributed/internal/membership/gms/membership/GMSJoinLeaveJUnitTest.java
+++ 
b/geode-core/src/integrationTest/java/org/apache/geode/distributed/internal/membership/gms/membership/GMSJoinLeaveJUnitTest.java
@@ -716,12 +716,13 @@ public class GMSJoinLeaveJUnitTest {
 D = mockMembers[2],
 E = mockMembers[3];
 prepareAndInstallView(C, createMemberList(A, B, C, D, E));
-
when(healthMonitor.getMembersFailingAvailabilityCheck()).thenReturn(Collections.singleton(A));
 LeaveRequestMessage msg = new LeaveRequestMessage(B, C, "leaving for 
test");
 msg.setSender(C);
 gmsJoinLeave.processMessage(msg);
+RemoveMemberMessage removeMemberMessage = new RemoveMemberMessage(B, A, 
"removing for test");
+removeMemberMessage.setSender(B);
+gmsJoinLeave.processMessage(removeMemberMessage);
 assertTrue("Expected becomeCoordinator to be invoked", 
gmsJoinLeave.isCoordinator());
-await().until(() -> gmsJoinLeave.getView().size() == 1);
   }
 
   /**
@@ -738,18 +739,39 @@ public class GMSJoinLeaveJUnitTest {
 C = mockMembers[1],
 D = mockMembers[2],
 E = mockMembers[3];
+
 

[geode] 01/01: GEODE-3780 suspected member is never watched again after passing final check

2019-08-13 Thread bschuchardt
This is an automated email from the ASF dual-hosted git repository.

bschuchardt pushed a commit to branch feature/GEODE-3780
in repository https://gitbox.apache.org/repos/asf/geode.git

commit 50bcc90c779c5d74c20b75d84ebc4522dd421caa
Author: Bruce Schuchardt 
AuthorDate: Tue Aug 13 09:47:52 2019 -0700

GEODE-3780 suspected member is never watched again after passing final check

After passing a "final check" a member will be subject to suspect
processing again but we weren't processing the suspect message locally.
This caused JoinLeave to never be notified of the suspect so that
removal could be initiated.

I also noticed that a method in HealthMonitor was misnamed.  It claimed
to return the set of members that had failed availability checks but
instead it was returning a set of members currently under suspicion.  I
renamed the method for clarity.
---
 .../gms/fd/GMSHealthMonitorJUnitTest.java  | 25 ++
 .../gms/membership/GMSJoinLeaveJUnitTest.java  |  4 ++--
 .../membership/gms/fd/GMSHealthMonitor.java|  9 +---
 .../membership/gms/interfaces/HealthMonitor.java   |  4 ++--
 .../membership/gms/membership/GMSJoinLeave.java|  2 +-
 5 files changed, 36 insertions(+), 8 deletions(-)

diff --git 
a/geode-core/src/integrationTest/java/org/apache/geode/distributed/internal/membership/gms/fd/GMSHealthMonitorJUnitTest.java
 
b/geode-core/src/integrationTest/java/org/apache/geode/distributed/internal/membership/gms/fd/GMSHealthMonitorJUnitTest.java
index 264ac62..59d2c0e 100644
--- 
a/geode-core/src/integrationTest/java/org/apache/geode/distributed/internal/membership/gms/fd/GMSHealthMonitorJUnitTest.java
+++ 
b/geode-core/src/integrationTest/java/org/apache/geode/distributed/internal/membership/gms/fd/GMSHealthMonitorJUnitTest.java
@@ -498,6 +498,31 @@ public class GMSHealthMonitorJUnitTest {
 }
   }
 
+  @Test
+  public void testMemberIsExaminedAgainAfterPassingAvailabilityCheck() {
+// use the test health monitor's availability check for the first round of 
suspect processing
+// but then turn it off so that a subsequent round is performed and fails 
to get a heartbeat
+useGMSHealthMonitorTestClass = true;
+
+try {
+  GMSMembershipView v = installAView();
+
+  setFailureDetectionPorts(v);
+
+  GMSMember memberToCheck = mockMembers.get(1);
+
+  boolean retVal = gmsHealthMonitor.checkIfAvailable(memberToCheck, "Not 
responding", true);
+  assertTrue("CheckIfAvailable should have return true", retVal);
+
+  // memberToCheck should be suspected again since it's not sending 
heartbeats and then
+  // it should fail an availability check and cause removal of the member
+  useGMSHealthMonitorTestClass = false;
+  await().untilAsserted(() -> verify(joinLeave, 
atLeastOnce()).remove(memberToCheck,
+  "Member isn't responding to heartbeat requests"));
+} finally {
+  useGMSHealthMonitorTestClass = false;
+}
+  }
 
   @Test
   public void testNeighborRemainsSameAfterSuccessfulFinalCheck() {
diff --git 
a/geode-core/src/integrationTest/java/org/apache/geode/distributed/internal/membership/gms/membership/GMSJoinLeaveJUnitTest.java
 
b/geode-core/src/integrationTest/java/org/apache/geode/distributed/internal/membership/gms/membership/GMSJoinLeaveJUnitTest.java
index 8f18895..dcefbf3 100644
--- 
a/geode-core/src/integrationTest/java/org/apache/geode/distributed/internal/membership/gms/membership/GMSJoinLeaveJUnitTest.java
+++ 
b/geode-core/src/integrationTest/java/org/apache/geode/distributed/internal/membership/gms/membership/GMSJoinLeaveJUnitTest.java
@@ -707,7 +707,7 @@ public class GMSJoinLeaveJUnitTest {
 D = mockMembers[2],
 E = mockMembers[3];
 prepareAndInstallView(C, createMemberList(A, B, C, D, E));
-
when(healthMonitor.getMembersFailingAvailabilityCheck()).thenReturn(Collections.singleton(A));
+
when(healthMonitor.getSuspectedMembers()).thenReturn(Collections.singleton(A));
 LeaveRequestMessage msg = new LeaveRequestMessage(B, C, "leaving for 
test");
 msg.setSender(C);
 gmsJoinLeave.processMessage(msg);
@@ -730,7 +730,7 @@ public class GMSJoinLeaveJUnitTest {
 D = mockMembers[2],
 E = mockMembers[3];
 prepareAndInstallView(C, createMemberList(A, B, C, D));
-
when(healthMonitor.getMembersFailingAvailabilityCheck()).thenReturn(Collections.singleton(A));
+
when(healthMonitor.getSuspectedMembers()).thenReturn(Collections.singleton(A));
 E.setVmViewId(1);
 gmsJoinLeave.processMessage(new JoinRequestMessage(B, E, null, 1, 1));
 LeaveRequestMessage msg = new LeaveRequestMessage(B, C, "leaving for 
test");
diff --git 
a/geode-core/src/main/java/org/apache/geode/distributed/internal/membership/gms/fd/GMSHealthMonitor.java
 
b/geode-core/src/main/java/org/apache/geode/distributed/internal/membership/gms/fd/GMSHealthMonitor.java
index d085cd2..36827e3 100644
---