Is it possible to share the DiskStoreIDs getting printed across the nodes.
This will help us to see the dependencies between the nodes/diskstores.

Also, you don't need to start a locator for gfsh, you could connect to the
running locator (part of the cluster); unless the JMX-manager on that
locator is turned-off.

-Anil.







On Thu, Feb 22, 2018 at 11:14 AM, Darrel Schneider <[email protected]>
wrote:

> As long as you start each member of your cluster, in parallel, they should
> work out amongst themselves who has the latest copy of the data.
> You should not need to revoke disk-stores that you still have. Since you
> are only using replicates your temporary solution is safe as long as you do
> pick the last one to write data as the winner.
> If you had partitioned regions then it would not be safe to get rid of all
> disk stores except one.
>
> This issue may have been fixed. You are using 1.0.0-incubating. Have you
> considered upgrading to 1.4?
>
>
> On Thu, Feb 22, 2018 at 2:08 AM, Daniel Kojic <[email protected]>
> wrote:
>
>> Hi there
>>
>> Our setup:
>> We have a multi-node clustered Java application running in an ESXi
>> environment. Each cluster node has Geode embedded via. Spring Data for
>> Apache Geode and has its own locator. Multiple replicated regions are
>> shared among the nodes where each node has its own disk store.
>> * Java runtime version: 1.8.0_151
>> * Geode version: 1.0.0-incubating
>> * Spring Data Geode version: 1.0.0.INCUBATING-RELEASE
>> * Spring Data version: 1.12.1.RELEASE
>>
>> Our problem:
>> We had situation that caused our geode processes to quit abruptly e.g.
>> * VM being abruptly powered off (no guest shutdown) or...
>> * ...CPU-freezes caused by IO degradation.
>> After restarting the cluster nodes (one after another or all at once),
>> geode logs on all nodes show the following:
>>         Region /XXX has potentially stale data. It is waiting for another
>> member to recover the latest data.
>>         My persistent id:
>>           DiskStore ID: XXX
>>           Name:
>>           Location: /XXX
>>         Members with potentially new data:
>>         [
>>           DiskStore ID: XXX
>>           Name: XXX
>>           Location: /XXX
>>         ]
>> The problem however is that each node is waiting for the other nodes to
>> join although they are already started. Any combination of
>> starting/stopping the nodes that are shown as "missing" doesn't seem to do
>> anything.
>>
>> Our temporary solution:
>> We managed to "recover" from such a deadlock using gfsh:
>> * Revoke all missing disk stores except for one "chosen" (preferably the
>> last running) node.
>> * Delete those disk stores.
>> * Restart the nodes.
>> As of today we're not able to add the "Spring Shell" dependency to our
>> application easily which is why we have to run gfsh with its own locator.
>> This requires us to define such a "gfsh locator" in all of our cluster
>> nodes.
>>
>> What we're looking for:
>> Our temporary solution comes with some flaws: we're dependent of the gfsh
>> tooling with its own locator and manual intervention is required. Getting
>> the cluster up-and-running again is complicated from an admin perspective.
>> Is there any way detect/handle such a deadlock situation from within the
>> application? Are there any best practices that you could recommend?
>>
>> Thanks in advance for your help!
>>
>> Best
>> Daniel
>> persistent security in a changing world.
>>
>>
>

Reply via email to