Thanks for the quick responses!
“Is it possible to share the DiskStoreIDs getting printed across the nodes.
This will help us to see the dependencies between the nodes/diskstores.”
== Node A:
Region /Configuration has potentially stale data. It is waiting for another
member to recover the latest data.
My persistent id:
DiskStore ID: a7ba00dd-b21e-4ee7-a06b-10f93f593180
Name: XXX
Location: XXX/Configuration
Members with potentially new data:
[
DiskStore ID: b4b3f9e3-3abc-4ad1-8871-6c6d61df027f
Name:
Location: XXX/Configuration
]
== Node B:
Region /PdxTypes has potentially stale data. It is waiting for another member
to recover the latest data.
My persistent id:
DiskStore ID: ca383b42-b7aa-44c1-986e-1af2433fc3c2
Name:
Location: XXX/PDX
Members with potentially new data:
[
DiskStore ID: e4e53819-35a1-4f14-8731-79a2f6ff
Name: XXX
Location: XXX/PDX
]
The nodes wait for different diskstores. Does the order in which the diskstores
are defined in the application play a role?
“Also, you don't need to start a locator for gfsh, you could connect to the
running locator (part of the cluster); unless the JMX-manager on that locator
is turned-off.”
You cannot use that JMX manager if the Spring Shell library is not on the apps’
classpath (please correct me if I’m wrong) and we’re running an OSGi
application in which we cannot integrate that dependency that easily.
“As long as you start each member of your cluster, in parallel, they should
work out amongst themselves who has the latest copy of the data.”
That’s what we thought, too. This however does not seem to be the case if you
abruptly “power-off” the VM. In all other cases where we gracefully or even
kill -9ed the process this never happened. The deadlock also occurs when
starting all at once.
Best
Daniel
From: Anilkumar Gingade [mailto:aging...@pivotal.io]
Sent: Donnerstag, 22. Februar 2018 20:47
To: user@geode.apache.org
Subject: Re: Geode: Deadlock situation upon startup
Is it possible to share the DiskStoreIDs getting printed across the nodes. This
will help us to see the dependencies between the nodes/diskstores.
Also, you don't need to start a locator for gfsh, you could connect to the
running locator (part of the cluster); unless the JMX-manager on that locator
is turned-off.
-Anil.
On Thu, Feb 22, 2018 at 11:14 AM, Darrel Schneider
<dschnei...@pivotal.io<mailto:dschnei...@pivotal.io>> wrote:
As long as you start each member of your cluster, in parallel, they should work
out amongst themselves who has the latest copy of the data.
You should not need to revoke disk-stores that you still have. Since you are
only using replicates your temporary solution is safe as long as you do pick
the last one to write data as the winner.
If you had partitioned regions then it would not be safe to get rid of all disk
stores except one.
This issue may have been fixed. You are using 1.0.0-incubating. Have you
considered upgrading to 1.4?
On Thu, Feb 22, 2018 at 2:08 AM, Daniel Kojic
<daniel.ko...@ispin.ch<mailto:daniel.ko...@ispin.ch>> wrote:
Hi there
Our setup:
We have a multi-node clustered Java application running in an ESXi environment.
Each cluster node has Geode embedded via. Spring Data for Apache Geode and has
its own locator. Multiple replicated regions are shared among the nodes where
each node has its own disk store.
* Java runtime version: 1.8.0_151
* Geode version: 1.0.0-incubating
* Spring Data Geode version: 1.0.0.INCUBATING-RELEASE
* Spring Data version: 1.12.1.RELEASE
Our problem:
We had situation that caused our geode processes to quit abruptly e.g.
* VM being abruptly powered off (no guest shutdown) or...
* ...CPU-freezes caused by IO degradation.
After restarting the cluster nodes (one after another or all at once), geode
logs on all nodes show the following:
Region /XXX has potentially stale data. It is waiting for another
member to recover the latest data.
My persistent id:
DiskStore ID: XXX
Name:
Location: /XXX
Members with potentially new data:
[
DiskStore ID: XXX
Name: XXX
Location: /XXX
]
The problem however is that each node is waiting for the other nodes to join
although they are already started. Any combination of starting/stopping the
nodes that are shown as "missing" doesn't seem to do anything.
Our temporary solution:
We managed to "recover" from such a deadlock using gfsh:
* Revoke all missing disk stores except for one "chosen" (preferably the last
running) node.
* Delete those disk stores.
* Restart the nodes.
As of today we're not able to add the "Spring Shell" dependency to our
application easily which is why we have to run gfsh with its own locator. This
requires us to define such a "gfsh locator" in all of our cluster nodes.
What we're looking for:
Our temporary solution comes