Is it possible to share the DiskStoreIDs getting printed across the nodes. This will help us to see the dependencies between the nodes/diskstores.
Also, you don't need to start a locator for gfsh, you could connect to the running locator (part of the cluster); unless the JMX-manager on that locator is turned-off. -Anil. On Thu, Feb 22, 2018 at 11:14 AM, Darrel Schneider <[email protected]> wrote: > As long as you start each member of your cluster, in parallel, they should > work out amongst themselves who has the latest copy of the data. > You should not need to revoke disk-stores that you still have. Since you > are only using replicates your temporary solution is safe as long as you do > pick the last one to write data as the winner. > If you had partitioned regions then it would not be safe to get rid of all > disk stores except one. > > This issue may have been fixed. You are using 1.0.0-incubating. Have you > considered upgrading to 1.4? > > > On Thu, Feb 22, 2018 at 2:08 AM, Daniel Kojic <[email protected]> > wrote: > >> Hi there >> >> Our setup: >> We have a multi-node clustered Java application running in an ESXi >> environment. Each cluster node has Geode embedded via. Spring Data for >> Apache Geode and has its own locator. Multiple replicated regions are >> shared among the nodes where each node has its own disk store. >> * Java runtime version: 1.8.0_151 >> * Geode version: 1.0.0-incubating >> * Spring Data Geode version: 1.0.0.INCUBATING-RELEASE >> * Spring Data version: 1.12.1.RELEASE >> >> Our problem: >> We had situation that caused our geode processes to quit abruptly e.g. >> * VM being abruptly powered off (no guest shutdown) or... >> * ...CPU-freezes caused by IO degradation. >> After restarting the cluster nodes (one after another or all at once), >> geode logs on all nodes show the following: >> Region /XXX has potentially stale data. It is waiting for another >> member to recover the latest data. >> My persistent id: >> DiskStore ID: XXX >> Name: >> Location: /XXX >> Members with potentially new data: >> [ >> DiskStore ID: XXX >> Name: XXX >> Location: /XXX >> ] >> The problem however is that each node is waiting for the other nodes to >> join although they are already started. Any combination of >> starting/stopping the nodes that are shown as "missing" doesn't seem to do >> anything. >> >> Our temporary solution: >> We managed to "recover" from such a deadlock using gfsh: >> * Revoke all missing disk stores except for one "chosen" (preferably the >> last running) node. >> * Delete those disk stores. >> * Restart the nodes. >> As of today we're not able to add the "Spring Shell" dependency to our >> application easily which is why we have to run gfsh with its own locator. >> This requires us to define such a "gfsh locator" in all of our cluster >> nodes. >> >> What we're looking for: >> Our temporary solution comes with some flaws: we're dependent of the gfsh >> tooling with its own locator and manual intervention is required. Getting >> the cluster up-and-running again is complicated from an admin perspective. >> Is there any way detect/handle such a deadlock situation from within the >> application? Are there any best practices that you could recommend? >> >> Thanks in advance for your help! >> >> Best >> Daniel >> persistent security in a changing world. >> >> >
