RE: How to deal with cluster configuration service failure

Thacker, Dharam Mon, 05 Jun 2017 22:15:27 -0700

I could reproduce the same as explained by Darrel which answers my first 
scenario though.


Steps:

Console1:
start locator --name=locator1 --port=10334 
--locators=localhost[10334],localhost[10335] 
--enable-cluster-configuration=true --log-level=config
stop locator --name=locator1

Console2:
start locator --name=locator2 --port=10335 
--locators=localhost[10334],localhost[10335] 
--enable-cluster-configuration=true --log-level=config
stop locator --name=locator2

Console1:
start locator --name=locator1 --port=10334 
--locators=localhost[10334],localhost[10335] 
--enable-cluster-configuration=true --log-level=config

Console2:
start locator --name=locator2 --port=10335 
--locators=localhost[10334],localhost[10335] 
--enable-cluster-configuration=true --log-level=config

And now it would say "Cluster configuration service failed to start, please 
check the log file for errors" with same error in locator log file as in email 
thread

But as explained in my second scenario, what if such scenario happens due to 
machine crash events.
For future releases, do you think of any better way by introducing more 
communication between locators to avoid such scenario?

As an example, let these locators communicate with each other and auto 
determine who should delete older/corrupted/notvalid cluster configuration from 
its own “work” directory and auto import fresh cluster config from another 
locator.

Thanks & Regards,
Dharam

From: Thacker, Dharam
Sent: Tuesday, June 06, 2017 9:53 AM
To: '[email protected]'
Subject: RE: How to deal with cluster configuration service failure

Thanks everyone for suggestions! I will note it down and follow strict 
start/stop sequence as well.

My startup script already has information about all locators. I can see both 
below property active in my locator.properties file in both hosts too!
locators=member001[10334],member002[10334]


Regarding Darrel’s point, let me present 2 scenarios…

“One way this could happen is if you have just one locator running and it 
writes the cluster config to its disk-store. You then shut that locator down 
and start up a different one. It would have no knowledge of the other locator 
that you shut down so it would create a brand new cluster config in its 
disk-store. If at some point these two locators finally see each other the 
second one to start will throw a ConflictingPersistentDataException”

Let’s say that’s actually the case as Darrel explained where I have started 
only 1 locator but with property --locators=member001[10334],member002[10334].


1.       Locator1 is running in member001 with 
--locators=member001[10334],member002[10334].  -> It creates its own cluster 
config to its disk store -> Then I am shutting down the locator1 on member001

2.       Start Locator2 on member002 with 
--locators=member001[10334],member002[10334] -> It would create its own fresh 
cluster config in its disk store

3.       Now I again start Locator1 on member001 with 
--locators=member001[10334],member002[10334]

Is that really a problem even if 1 locator knows other locator via [--locators] 
property? Of course at the same time other locator is not running though. I 
believe then it might be the same case with me. Could you confirm once again?

But if above is true, then let me twik this scenario and present in a different 
way,


1.       Locator1 is running in member001 with 
--locators=member001[10334],member002[10334].  -> It creates its own cluster 
config to its disk store -> Then after sometime locator1 crashes/machine goes 
down [In previous case I was manually shutting that down]

2.       Locator2 starts on member002 with 
--locators=member001[10334],member002[10334] using automated jobs scheduled 
using say autosys

3.       After sometime, member001 machine suddenly comes up and business 
process validator determines that locator1 should be running and it attempts to 
start locator1 in member001

Ahh then it would break the system once again.

Any points there?

Thanks & Regards,
Dharam


From: Kirk Lund [mailto:[email protected]]
Sent: Monday, June 05, 2017 10:39 PM
To: [email protected]<mailto:[email protected]>
Subject: Re: How to deal with cluster configuration service failure

Two locators in the same cluster use the DistributedLockService to determine 
which one is the primary for cluster config. If two locators don't know about 
each other, then they are not part of the same cluster, and a server cannot 
join two clusters.

On Mon, Jun 5, 2017 at 9:48 AM, Mark Secrist 
<[email protected]<mailto:[email protected]>> wrote:
I also wonder if it could be that way if the two locators are started without 
knowledge of each other (via the locators) property.

On Mon, Jun 5, 2017 at 10:45 AM, Darrel Schneider 
<[email protected]<mailto:[email protected]>> wrote:
A ConflictingPersistentDataException indicates that two copies of a disk-store 
where written independently of each other. When using cluster configuration the 
locator uses a disk-store to write the cluster configuration to disk. It looks 
like that it the disk-store that is throwing ConflictingPersistentDataException.
One way this could happen is if you have just one locator running and it writes 
the cluster config to its disk-store. You then shut that locator down and start 
up a different one. It would have no knowledge of the other locator that you 
shut down so it would create a brand new cluster config in its disk-store. If 
at some point these two locators finally see each other the second one to start 
will throw a ConflictingPersistentDataException. In this case you need to pick 
which one of these disk-stores you want to be the winner and remove the other 
disk store. To pick the best winner I think each locator also writes some 
cache.xml files that will show you in plain text what is in the binary 
disk-store files. This could also help you in determining what configuration 
you will lose when you remove one of these disk-stores. You can get that 
missing config back by doing the same gfsh commands (for example create 
region). Another option would be to use the gfsh import/export commands. Before 
deleting either disk-store start them up one at a time and export the cluster 
config. Then you can start fresh by importing the config.
You might hit a problem in which one of these disk-stores now knows about the 
other so when you try to start it by itself it fails saying it is waiting for 
the other to start up. Then when you do that you get the 
ConflictingPersistentDataException. In that case you would not be able to start 
them up one at a time to do the export so in that case you need to find the 
cache.xml files. Someone who knows more about cluster config might be able to 
help you more.

You should be able to avoid this in the future by making sure you start both 
locators before doing your first gfsh create command. That way both disk-stores 
will know about each other and will be kept in sync.

On Mon, Jun 5, 2017 at 8:07 AM, Jinmei Liao 
<[email protected]<mailto:[email protected]>> wrote:
Is this related to https://issues.apache.org/jira/browse/GEODE-3003?

On Sun, Jun 4, 2017 at 11:39 PM, Thacker, Dharam 
<[email protected]<mailto:[email protected]>> wrote:
Hi Team,

Could someone help to understand how to deal with below scenario where cluster 
configuration service fails to start in another locator? Which supportive 
action should we take to rectify this?

Note:
member001.IP.MAKSED – IP address of member001
member002.IP.MASKED – IP address of member002

Locator logs on member002:

[info 2017/06/05 02:07:11.941 EDT RavenLocator2 <Pooled Message Processor 1> 
tid=0x3d] Initializing region _ConfigurationRegion

[warning 2017/06/05 02:07:11.951 EDT Locator2 <Pooled Message Processor 1> 
tid=0x3d] Initialization failed for Region /_ConfigurationRegion
org.apache.geode.cache.persistence.ConflictingPersistentDataException: Region 
/_ConfigurationRegion refusing to initialize from member 
member001(Locator1:5160:locator)<ec><v0>:1024 with persistent data 
/169.87.179.46:/local/apps/shared/geode/members/Locator1/work/ConfigDiskDir_Locator1
 created at timestamp 1496241336712 version 0 diskStoreId 
31efa18230134865-b4fd0fcbde63ade6 name Locator1 which was offline when the 
local data from 
/member002.IP.MASKED:/local/apps/shared/geode/members/Locator2/work/ConfigDiskDir_Locator2
 created at timestamp 1496241344046 version 0 diskStoreId 
df94511d0f3d4295-91ec9286a18aaa75 name Locator2 was last online
        at 
org.apache.geode.internal.cache.persistence.PersistenceAdvisorImpl.checkMyStateOnMembers(PersistenceAdvisorImpl.java:751)
        at 
org.apache.geode.internal.cache.persistence.PersistenceAdvisorImpl.getInitialImageAdvice(PersistenceAdvisorImpl.java:812)
        at 
org.apache.geode.internal.cache.persistence.CreatePersistentRegionProcessor.getInitialImageAdvice(CreatePersistentRegionProcessor.java:52)
        at 
org.apache.geode.internal.cache.DistributedRegion.getInitialImageAndRecovery(DistributedRegion.java:1267)
        at 
org.apache.geode.internal.cache.DistributedRegion.initialize(DistributedRegion.java:1101)
        at 
org.apache.geode.internal.cache.GemFireCacheImpl.createVMRegion(GemFireCacheImpl.java:3308)
        at 
org.apache.geode.distributed.internal.ClusterConfigurationService.getConfigurationRegion(ClusterConfigurationService.java:709)
        at 
org.apache.geode.distributed.internal.ClusterConfigurationService.initSharedConfiguration(ClusterConfigurationService.java:426)
        at 
org.apache.geode.distributed.internal.InternalLocator$SharedConfigurationRunnable.run(InternalLocator.java:649)
        at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
        at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
        at 
org.apache.geode.distributed.internal.DistributionManager.ru<http://nternal.DistributionManager.ru>nUntilShutdown(DistributionManager.java:621)
        at 
org.apache.geode.distributed.internal.DistributionManager$4$1.run(DistributionManager.java:878)
        at java.lang.Thread.run(Thread.java:745)

[error 2017/06/05 02:07:11.959 EDT Locator2 <Pooled Message Processor 1> 
tid=0x3d] Error occurred while initializing cluster configuration
java.lang.RuntimeException: Error occurred while initializing cluster 
configuration
        at 
org.apache.geode.distributed.internal.ClusterConfigurationService.getConfigurationRegion(ClusterConfigurationService.java:722)
        at 
org.apache.geode.distributed.internal.ClusterConfigurationService.initSharedConfiguration(ClusterConfigurationService.java:426)
        at 
org.apache.geode.distributed.internal.InternalLocator$SharedConfigurationRunnable.run(InternalLocator.java:649)
        at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
        at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
        at 
org.apache.geode.distributed.internal.DistributionManager.ru<http://nternal.DistributionManager.ru>nUntilShutdown(DistributionManager.java:621)
        at 
org.apache.geode.distributed.internal.DistributionManager$4$1.run(DistributionManager.java:878)
        at java.lang.Thread.run(Thread.java:745)
Caused by: 
org.apache.geode.cache.persistence.ConflictingPersistentDataException: Region 
/_ConfigurationRegion refusing to initialize from member 
member001(Locator1:5160:locator)<ec><v0>:1024 with persistent data 
/member001.IP.MASKED:/local/apps/shared/geode/members/Locator1/work/ConfigDiskDir_RavenLocator1
 created at timestamp 1496241336712 version 0 diskStoreId 
31efa18230134865-b4fd0fcbde63ade6 name RavenLocator1 which was offline when the 
local data from 
/member002.IP.MASKED:/local/apps/shared/geode/members/Locator2/work/ConfigDiskDir_Locator2
 created at timestamp 1496241344046 version 0 diskStoreId 
df94511d0f3d4295-91ec9286a18aaa75 name Locator2 was last online
        at 
org.apache.geode.internal.cache.persistence.PersistenceAdvisorImpl.checkMyStateOnMembers(PersistenceAdvisorImpl.java:751)
        at 
org.apache.geode.internal.cache.persistence.PersistenceAdvisorImpl.getInitialImageAdvice(PersistenceAdvisorImpl.java:812)
        at 
org.apache.geode.internal.cache.persistence.CreatePersistentRegionProcessor.getInitialImageAdvice(CreatePersistentRegionProcessor.java:52)
        at 
org.apache.geode.internal.cache.DistributedRegion.getInitialImageAndRecovery(DistributedRegion.java:1267)
        at 
org.apache.geode.internal.cache.DistributedRegion.initialize(DistributedRegion.java:1101)
        at 
org.apache.geode.internal.cache.GemFireCacheImpl.createVMRegion(GemFireCacheImpl.java:3308)
        at 
org.apache.geode.distributed.internal.ClusterConfigurationService.getConfigurationRegion(ClusterConfigurationService.java:709)
        ... 7 more

Thanks & Regards,
Dharam

This message is confidential and subject to terms at: 
http://www.jpmorgan.com/emaildisclaimer<http://www.jpmorgan.com/emaildisclaimer>
 including on confidentiality, legal privilege, viruses and monitoring of 
electronic messages. If you are not the intended recipient, please delete this 
message and notify the sender immediately. Any unauthorized use is strictly 
prohibited.



--
Cheers

Jinmei




--

Mark Secrist | Sr Manager, Global Education Delivery

[email protected]<mailto:[email protected]>

970.214.4567 Mobile

 [http://d1fto35gcfffzn.cloudfront.net/images/header/logo-pivotal-220.png]  
pivotal.io<http://www.pivotal.io/>

Follow Us: Twitter<http://www.twitter.com/pivotal> | 
LinkedIn<http://www.linkedin.com/company/pivotalsoftware> | 
Facebook<http://www.facebook.com/pivotalsoftware> | 
YouTube<http://www.youtube.com/gopivotal> | 
Google+<https://plus.google.com/105320112436428794490>


This message is confidential and subject to terms at: 
http://www.jpmorgan.com/emaildisclaimer including on confidentiality, legal 
privilege, viruses and monitoring of electronic messages. If you are not the 
intended recipient, please delete this message and notify the sender 
immediately. Any unauthorized use is strictly prohibited.

RE: How to deal with cluster configuration service failure

Reply via email to