Hi there
Our setup:
We have a multi-node clustered Java application running in an ESXi environment.
Each cluster node has Geode embedded via. Spring Data for Apache Geode and has
its own locator. Multiple replicated regions are shared among the nodes where
each node has its own disk store.
* Java runtime version: 1.8.0_151
* Geode version: 1.0.0-incubating
* Spring Data Geode version: 1.0.0.INCUBATING-RELEASE
* Spring Data version: 1.12.1.RELEASE
Our problem:
We had situation that caused our geode processes to quit abruptly e.g.
* VM being abruptly powered off (no guest shutdown) or...
* ...CPU-freezes caused by IO degradation.
After restarting the cluster nodes (one after another or all at once), geode
logs on all nodes show the following:
Region /XXX has potentially stale data. It is waiting for another
member to recover the latest data.
My persistent id:
DiskStore ID: XXX
Name:
Location: /XXX
Members with potentially new data:
[
DiskStore ID: XXX
Name: XXX
Location: /XXX
]
The problem however is that each node is waiting for the other nodes to join
although they are already started. Any combination of starting/stopping the
nodes that are shown as "missing" doesn't seem to do anything.
Our temporary solution:
We managed to "recover" from such a deadlock using gfsh:
* Revoke all missing disk stores except for one "chosen" (preferably the last
running) node.
* Delete those disk stores.
* Restart the nodes.
As of today we're not able to add the "Spring Shell" dependency to our
application easily which is why we have to run gfsh with its own locator. This
requires us to define such a "gfsh locator" in all of our cluster nodes.
What we're looking for:
Our temporary solution comes with some flaws: we're dependent of the gfsh
tooling with its own locator and manual intervention is required. Getting the
cluster up-and-running again is complicated from an admin perspective. Is there
any way detect/handle such a deadlock situation from within the application?
Are there any best practices that you could recommend?
Thanks in advance for your help!
Best
Daniel
persistent security in a changing world.