HA doesn't work for the sequence "locator1-server1 down; locator2-server2 down; locator1-server1 up"

Anton Mironenko Tue, 28 Nov 2017 06:55:16 -0800

Hello,
There is one use case which can seriously affect High Availability of Geode.
The topology is 2 hosts, 1 locator and 1 GF server on each host. 
"enable-cluster-configuration=true" is used
Here is the flow:


1) GF cluster was up and running;
2) host1 was brought down due to VM issues. As a result, locator1 and server1 
were down;
3) then host2 was brought down du to VM issues. As a result, locator2 and 
server2 were down;
4) host1 was brought back to live, but locator1 started with the following 
message:

"Cluster configuration service is waiting for other locators with newer shared 
configuration data.
This locator might have stale cluster configuration data.
Following locators contain potentially newer cluster configuration data"

server1 tried to join the locator1, and exited with the error in the 
cacheserver.log:

[error 2017/11/28 17:44:34.417 MSK host1-server-1 <main> tid=0x1] 
org.apache.geode.GemFireConfigException: cluster configuration service not 
available

[severe 2017/11/28 17:44:34.428 MSK host1-server-1 <main> tid=0x1] Cache server 
error
org.apache.geode.GemFireConfigException: cluster configuration service not 
available
        at 
org.apache.geode.internal.cache.GemFireCacheImpl.requestSharedConfiguration(GemFireCacheImpl.java:1058)
        at 
org.apache.geode.internal.cache.GemFireCacheImpl.<init>(GemFireCacheImpl.java:817)
...
Caused by: 
org.apache.geode.internal.process.ClusterConfigurationNotAvailableException: 
Unable to retrieve cluster configuration from the locator.
        at 
org.apache.geode.internal.cache.ClusterConfigurationLoader.requestConfigurationFromLocators(ClusterConfigurationLoader.java:257)
        at 
org.apache.geode.internal.cache.GemFireCacheImpl.requestSharedConfiguration(GemFireCacheImpl.java:1021)
        ... 8 more

Since Gemfire provides HA, what is a way to bring back to live the first half 
of the cluster on host1: locator1 and server1?
Let's say, host2 will be down for 1 week, during this time we have to operate.
What is a way to join the second half of the cluster on host2: locator2 and 
server2?

How to reproduce this issue:
https://issues.apache.org/jira/secure/attachment/12870290/geode-host1.zip
https://issues.apache.org/jira/secure/attachment/12870291/geode-host2.zip
(this is from https://issues.apache.org/jira/browse/GEODE-3003 )

1) extract geode-host1.zip to host1, geode-host2.zip to host2
2) adjust in the start-locator.sh the locator IPs to your values
  --locators=10.50.3.38[20236],10.50.3.14[20236] \
3) run start-locator.sh on host1
4) run start-locator.sh on host2
5) run start-server.sh on host1
6) run start- server.sh on host2 - check 4 members via gfsh list members, 
everything is fine here
7) kill locator-PID server-PID on host1
8) kill locator-PID server-PID on host2
9) run start-locator.sh on host1 - observe the "stale cluster configuration 
data" message
10) run start-server.sh on host1 - observe "cluster configuration service not 
available" and server exit

Anton Mironenko
Software Architect
Amdocs ASP team

This message and the information contained herein is proprietary and 
confidential and subject to the Amdocs policy statement,

you may review at https://www.amdocs.com/about/email-disclaimer 
<https://www.amdocs.com/about/email-disclaimer>

HA doesn't work for the sequence "locator1-server1 down; locator2-server2 down; locator1-server1 up"

Reply via email to