Re: Geode: Deadlock situation upon startup

2018-02-22 Thread Anilkumar Gingade
Is it possible to share the DiskStoreIDs getting printed across the nodes.
This will help us to see the dependencies between the nodes/diskstores.

Also, you don't need to start a locator for gfsh, you could connect to the
running locator (part of the cluster); unless the JMX-manager on that
locator is turned-off.

-Anil.







On Thu, Feb 22, 2018 at 11:14 AM, Darrel Schneider 
wrote:

> As long as you start each member of your cluster, in parallel, they should
> work out amongst themselves who has the latest copy of the data.
> You should not need to revoke disk-stores that you still have. Since you
> are only using replicates your temporary solution is safe as long as you do
> pick the last one to write data as the winner.
> If you had partitioned regions then it would not be safe to get rid of all
> disk stores except one.
>
> This issue may have been fixed. You are using 1.0.0-incubating. Have you
> considered upgrading to 1.4?
>
>
> On Thu, Feb 22, 2018 at 2:08 AM, Daniel Kojic 
> wrote:
>
>> Hi there
>>
>> Our setup:
>> We have a multi-node clustered Java application running in an ESXi
>> environment. Each cluster node has Geode embedded via. Spring Data for
>> Apache Geode and has its own locator. Multiple replicated regions are
>> shared among the nodes where each node has its own disk store.
>> * Java runtime version: 1.8.0_151
>> * Geode version: 1.0.0-incubating
>> * Spring Data Geode version: 1.0.0.INCUBATING-RELEASE
>> * Spring Data version: 1.12.1.RELEASE
>>
>> Our problem:
>> We had situation that caused our geode processes to quit abruptly e.g.
>> * VM being abruptly powered off (no guest shutdown) or...
>> * ...CPU-freezes caused by IO degradation.
>> After restarting the cluster nodes (one after another or all at once),
>> geode logs on all nodes show the following:
>> Region /XXX has potentially stale data. It is waiting for another
>> member to recover the latest data.
>> My persistent id:
>>   DiskStore ID: XXX
>>   Name:
>>   Location: /XXX
>> Members with potentially new data:
>> [
>>   DiskStore ID: XXX
>>   Name: XXX
>>   Location: /XXX
>> ]
>> The problem however is that each node is waiting for the other nodes to
>> join although they are already started. Any combination of
>> starting/stopping the nodes that are shown as "missing" doesn't seem to do
>> anything.
>>
>> Our temporary solution:
>> We managed to "recover" from such a deadlock using gfsh:
>> * Revoke all missing disk stores except for one "chosen" (preferably the
>> last running) node.
>> * Delete those disk stores.
>> * Restart the nodes.
>> As of today we're not able to add the "Spring Shell" dependency to our
>> application easily which is why we have to run gfsh with its own locator.
>> This requires us to define such a "gfsh locator" in all of our cluster
>> nodes.
>>
>> What we're looking for:
>> Our temporary solution comes with some flaws: we're dependent of the gfsh
>> tooling with its own locator and manual intervention is required. Getting
>> the cluster up-and-running again is complicated from an admin perspective.
>> Is there any way detect/handle such a deadlock situation from within the
>> application? Are there any best practices that you could recommend?
>>
>> Thanks in advance for your help!
>>
>> Best
>> Daniel
>> persistent security in a changing world.
>>
>>
>


Re: Geode: Deadlock situation upon startup

2018-02-22 Thread Darrel Schneider
As long as you start each member of your cluster, in parallel, they should
work out amongst themselves who has the latest copy of the data.
You should not need to revoke disk-stores that you still have. Since you
are only using replicates your temporary solution is safe as long as you do
pick the last one to write data as the winner.
If you had partitioned regions then it would not be safe to get rid of all
disk stores except one.

This issue may have been fixed. You are using 1.0.0-incubating. Have you
considered upgrading to 1.4?


On Thu, Feb 22, 2018 at 2:08 AM, Daniel Kojic  wrote:

> Hi there
>
> Our setup:
> We have a multi-node clustered Java application running in an ESXi
> environment. Each cluster node has Geode embedded via. Spring Data for
> Apache Geode and has its own locator. Multiple replicated regions are
> shared among the nodes where each node has its own disk store.
> * Java runtime version: 1.8.0_151
> * Geode version: 1.0.0-incubating
> * Spring Data Geode version: 1.0.0.INCUBATING-RELEASE
> * Spring Data version: 1.12.1.RELEASE
>
> Our problem:
> We had situation that caused our geode processes to quit abruptly e.g.
> * VM being abruptly powered off (no guest shutdown) or...
> * ...CPU-freezes caused by IO degradation.
> After restarting the cluster nodes (one after another or all at once),
> geode logs on all nodes show the following:
> Region /XXX has potentially stale data. It is waiting for another
> member to recover the latest data.
> My persistent id:
>   DiskStore ID: XXX
>   Name:
>   Location: /XXX
> Members with potentially new data:
> [
>   DiskStore ID: XXX
>   Name: XXX
>   Location: /XXX
> ]
> The problem however is that each node is waiting for the other nodes to
> join although they are already started. Any combination of
> starting/stopping the nodes that are shown as "missing" doesn't seem to do
> anything.
>
> Our temporary solution:
> We managed to "recover" from such a deadlock using gfsh:
> * Revoke all missing disk stores except for one "chosen" (preferably the
> last running) node.
> * Delete those disk stores.
> * Restart the nodes.
> As of today we're not able to add the "Spring Shell" dependency to our
> application easily which is why we have to run gfsh with its own locator.
> This requires us to define such a "gfsh locator" in all of our cluster
> nodes.
>
> What we're looking for:
> Our temporary solution comes with some flaws: we're dependent of the gfsh
> tooling with its own locator and manual intervention is required. Getting
> the cluster up-and-running again is complicated from an admin perspective.
> Is there any way detect/handle such a deadlock situation from within the
> application? Are there any best practices that you could recommend?
>
> Thanks in advance for your help!
>
> Best
> Daniel
> persistent security in a changing world.
>
>


Geode: Deadlock situation upon startup

2018-02-22 Thread Daniel Kojic
Hi there

Our setup:
We have a multi-node clustered Java application running in an ESXi environment. 
Each cluster node has Geode embedded via. Spring Data for Apache Geode and has 
its own locator. Multiple replicated regions are shared among the nodes where 
each node has its own disk store.
* Java runtime version: 1.8.0_151
* Geode version: 1.0.0-incubating
* Spring Data Geode version: 1.0.0.INCUBATING-RELEASE
* Spring Data version: 1.12.1.RELEASE

Our problem:
We had situation that caused our geode processes to quit abruptly e.g.
* VM being abruptly powered off (no guest shutdown) or...
* ...CPU-freezes caused by IO degradation.
After restarting the cluster nodes (one after another or all at once), geode 
logs on all nodes show the following:
Region /XXX has potentially stale data. It is waiting for another 
member to recover the latest data.
My persistent id:
  DiskStore ID: XXX
  Name:
  Location: /XXX
Members with potentially new data:
[
  DiskStore ID: XXX
  Name: XXX
  Location: /XXX
]
The problem however is that each node is waiting for the other nodes to join 
although they are already started. Any combination of starting/stopping the 
nodes that are shown as "missing" doesn't seem to do anything.

Our temporary solution:
We managed to "recover" from such a deadlock using gfsh:
* Revoke all missing disk stores except for one "chosen" (preferably the last 
running) node.
* Delete those disk stores.
* Restart the nodes.
As of today we're not able to add the "Spring Shell" dependency to our 
application easily which is why we have to run gfsh with its own locator. This 
requires us to define such a "gfsh locator" in all of our cluster nodes.

What we're looking for:
Our temporary solution comes with some flaws: we're dependent of the gfsh 
tooling with its own locator and manual intervention is required. Getting the 
cluster up-and-running again is complicated from an admin perspective. Is there 
any way detect/handle such a deadlock situation from within the application? 
Are there any best practices that you could recommend?

Thanks in advance for your help!

Best
Daniel
persistent security in a changing world.