Hello,
There is one use case which can seriously affect High Availability of Geode.
The topology is 2 hosts, 1 locator and 1 GF server on each host.
"enable-cluster-configuration=true" is used
Here is the flow:
1) GF cluster was up and running;
2) host1 was brought down due to VM issues. As a result, locator1 and server1
were down;
3) then host2 was brought down du to VM issues. As a result, locator2 and
server2 were down;
4) host1 was brought back to live, but locator1 started with the following
message:
"Cluster configuration service is waiting for other locators with newer shared
configuration data.
This locator might have stale cluster configuration data.
Following locators contain potentially newer cluster configuration data"
server1 tried to join the locator1, and exited with the error in the
cacheserver.log:
[error 2017/11/28 17:44:34.417 MSK host1-server-1 <main> tid=0x1]
org.apache.geode.GemFireConfigException: cluster configuration service not
available
[severe 2017/11/28 17:44:34.428 MSK host1-server-1 <main> tid=0x1] Cache server
error
org.apache.geode.GemFireConfigException: cluster configuration service not
available
at
org.apache.geode.internal.cache.GemFireCacheImpl.requestSharedConfiguration(GemFireCacheImpl.java:1058)
at
org.apache.geode.internal.cache.GemFireCacheImpl.<init>(GemFireCacheImpl.java:817)
...
Caused by:
org.apache.geode.internal.process.ClusterConfigurationNotAvailableException:
Unable to retrieve cluster configuration from the locator.
at
org.apache.geode.internal.cache.ClusterConfigurationLoader.requestConfigurationFromLocators(ClusterConfigurationLoader.java:257)
at
org.apache.geode.internal.cache.GemFireCacheImpl.requestSharedConfiguration(GemFireCacheImpl.java:1021)
... 8 more
Since Gemfire provides HA, what is a way to bring back to live the first half
of the cluster on host1: locator1 and server1?
Let's say, host2 will be down for 1 week, during this time we have to operate.
What is a way to join the second half of the cluster on host2: locator2 and
server2?
How to reproduce this issue:
https://issues.apache.org/jira/secure/attachment/12870290/geode-host1.zip
https://issues.apache.org/jira/secure/attachment/12870291/geode-host2.zip
(this is from https://issues.apache.org/jira/browse/GEODE-3003 )
1) extract geode-host1.zip to host1, geode-host2.zip to host2
2) adjust in the start-locator.sh the locator IPs to your values
--locators=10.50.3.38[20236],10.50.3.14[20236] \
3) run start-locator.sh on host1
4) run start-locator.sh on host2
5) run start-server.sh on host1
6) run start- server.sh on host2 - check 4 members via gfsh list members,
everything is fine here
7) kill locator-PID server-PID on host1
8) kill locator-PID server-PID on host2
9) run start-locator.sh on host1 - observe the "stale cluster configuration
data" message
10) run start-server.sh on host1 - observe "cluster configuration service not
available" and server exit
Anton Mironenko
Software Architect
Amdocs ASP team
This message and the information contained herein is proprietary and
confidential and subject to the Amdocs policy statement,
you may review at https://www.amdocs.com/about/email-disclaimer
<https://www.amdocs.com/about/email-disclaimer>