I could reproduce the same as explained by Darrel which answers my first scenario though.
Steps: Console1: start locator --name=locator1 --port=10334 --locators=localhost[10334],localhost[10335] --enable-cluster-configuration=true --log-level=config stop locator --name=locator1 Console2: start locator --name=locator2 --port=10335 --locators=localhost[10334],localhost[10335] --enable-cluster-configuration=true --log-level=config stop locator --name=locator2 Console1: start locator --name=locator1 --port=10334 --locators=localhost[10334],localhost[10335] --enable-cluster-configuration=true --log-level=config Console2: start locator --name=locator2 --port=10335 --locators=localhost[10334],localhost[10335] --enable-cluster-configuration=true --log-level=config And now it would say "Cluster configuration service failed to start, please check the log file for errors" with same error in locator log file as in email thread But as explained in my second scenario, what if such scenario happens due to machine crash events. For future releases, do you think of any better way by introducing more communication between locators to avoid such scenario? As an example, let these locators communicate with each other and auto determine who should delete older/corrupted/notvalid cluster configuration from its own “work” directory and auto import fresh cluster config from another locator. Thanks & Regards, Dharam From: Thacker, Dharam Sent: Tuesday, June 06, 2017 9:53 AM To: '[email protected]' Subject: RE: How to deal with cluster configuration service failure Thanks everyone for suggestions! I will note it down and follow strict start/stop sequence as well. My startup script already has information about all locators. I can see both below property active in my locator.properties file in both hosts too! locators=member001[10334],member002[10334] Regarding Darrel’s point, let me present 2 scenarios… “One way this could happen is if you have just one locator running and it writes the cluster config to its disk-store. You then shut that locator down and start up a different one. It would have no knowledge of the other locator that you shut down so it would create a brand new cluster config in its disk-store. If at some point these two locators finally see each other the second one to start will throw a ConflictingPersistentDataException” Let’s say that’s actually the case as Darrel explained where I have started only 1 locator but with property --locators=member001[10334],member002[10334]. 1. Locator1 is running in member001 with --locators=member001[10334],member002[10334]. -> It creates its own cluster config to its disk store -> Then I am shutting down the locator1 on member001 2. Start Locator2 on member002 with --locators=member001[10334],member002[10334] -> It would create its own fresh cluster config in its disk store 3. Now I again start Locator1 on member001 with --locators=member001[10334],member002[10334] Is that really a problem even if 1 locator knows other locator via [--locators] property? Of course at the same time other locator is not running though. I believe then it might be the same case with me. Could you confirm once again? But if above is true, then let me twik this scenario and present in a different way, 1. Locator1 is running in member001 with --locators=member001[10334],member002[10334]. -> It creates its own cluster config to its disk store -> Then after sometime locator1 crashes/machine goes down [In previous case I was manually shutting that down] 2. Locator2 starts on member002 with --locators=member001[10334],member002[10334] using automated jobs scheduled using say autosys 3. After sometime, member001 machine suddenly comes up and business process validator determines that locator1 should be running and it attempts to start locator1 in member001 Ahh then it would break the system once again. Any points there? Thanks & Regards, Dharam From: Kirk Lund [mailto:[email protected]] Sent: Monday, June 05, 2017 10:39 PM To: [email protected]<mailto:[email protected]> Subject: Re: How to deal with cluster configuration service failure Two locators in the same cluster use the DistributedLockService to determine which one is the primary for cluster config. If two locators don't know about each other, then they are not part of the same cluster, and a server cannot join two clusters. On Mon, Jun 5, 2017 at 9:48 AM, Mark Secrist <[email protected]<mailto:[email protected]>> wrote: I also wonder if it could be that way if the two locators are started without knowledge of each other (via the locators) property. On Mon, Jun 5, 2017 at 10:45 AM, Darrel Schneider <[email protected]<mailto:[email protected]>> wrote: A ConflictingPersistentDataException indicates that two copies of a disk-store where written independently of each other. When using cluster configuration the locator uses a disk-store to write the cluster configuration to disk. It looks like that it the disk-store that is throwing ConflictingPersistentDataException. One way this could happen is if you have just one locator running and it writes the cluster config to its disk-store. You then shut that locator down and start up a different one. It would have no knowledge of the other locator that you shut down so it would create a brand new cluster config in its disk-store. If at some point these two locators finally see each other the second one to start will throw a ConflictingPersistentDataException. In this case you need to pick which one of these disk-stores you want to be the winner and remove the other disk store. To pick the best winner I think each locator also writes some cache.xml files that will show you in plain text what is in the binary disk-store files. This could also help you in determining what configuration you will lose when you remove one of these disk-stores. You can get that missing config back by doing the same gfsh commands (for example create region). Another option would be to use the gfsh import/export commands. Before deleting either disk-store start them up one at a time and export the cluster config. Then you can start fresh by importing the config. You might hit a problem in which one of these disk-stores now knows about the other so when you try to start it by itself it fails saying it is waiting for the other to start up. Then when you do that you get the ConflictingPersistentDataException. In that case you would not be able to start them up one at a time to do the export so in that case you need to find the cache.xml files. Someone who knows more about cluster config might be able to help you more. You should be able to avoid this in the future by making sure you start both locators before doing your first gfsh create command. That way both disk-stores will know about each other and will be kept in sync. On Mon, Jun 5, 2017 at 8:07 AM, Jinmei Liao <[email protected]<mailto:[email protected]>> wrote: Is this related to https://issues.apache.org/jira/browse/GEODE-3003? On Sun, Jun 4, 2017 at 11:39 PM, Thacker, Dharam <[email protected]<mailto:[email protected]>> wrote: Hi Team, Could someone help to understand how to deal with below scenario where cluster configuration service fails to start in another locator? Which supportive action should we take to rectify this? Note: member001.IP.MAKSED – IP address of member001 member002.IP.MASKED – IP address of member002 Locator logs on member002: [info 2017/06/05 02:07:11.941 EDT RavenLocator2 <Pooled Message Processor 1> tid=0x3d] Initializing region _ConfigurationRegion [warning 2017/06/05 02:07:11.951 EDT Locator2 <Pooled Message Processor 1> tid=0x3d] Initialization failed for Region /_ConfigurationRegion org.apache.geode.cache.persistence.ConflictingPersistentDataException: Region /_ConfigurationRegion refusing to initialize from member member001(Locator1:5160:locator)<ec><v0>:1024 with persistent data /169.87.179.46:/local/apps/shared/geode/members/Locator1/work/ConfigDiskDir_Locator1 created at timestamp 1496241336712 version 0 diskStoreId 31efa18230134865-b4fd0fcbde63ade6 name Locator1 which was offline when the local data from /member002.IP.MASKED:/local/apps/shared/geode/members/Locator2/work/ConfigDiskDir_Locator2 created at timestamp 1496241344046 version 0 diskStoreId df94511d0f3d4295-91ec9286a18aaa75 name Locator2 was last online at org.apache.geode.internal.cache.persistence.PersistenceAdvisorImpl.checkMyStateOnMembers(PersistenceAdvisorImpl.java:751) at org.apache.geode.internal.cache.persistence.PersistenceAdvisorImpl.getInitialImageAdvice(PersistenceAdvisorImpl.java:812) at org.apache.geode.internal.cache.persistence.CreatePersistentRegionProcessor.getInitialImageAdvice(CreatePersistentRegionProcessor.java:52) at org.apache.geode.internal.cache.DistributedRegion.getInitialImageAndRecovery(DistributedRegion.java:1267) at org.apache.geode.internal.cache.DistributedRegion.initialize(DistributedRegion.java:1101) at org.apache.geode.internal.cache.GemFireCacheImpl.createVMRegion(GemFireCacheImpl.java:3308) at org.apache.geode.distributed.internal.ClusterConfigurationService.getConfigurationRegion(ClusterConfigurationService.java:709) at org.apache.geode.distributed.internal.ClusterConfigurationService.initSharedConfiguration(ClusterConfigurationService.java:426) at org.apache.geode.distributed.internal.InternalLocator$SharedConfigurationRunnable.run(InternalLocator.java:649) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) at org.apache.geode.distributed.internal.DistributionManager.ru<http://nternal.DistributionManager.ru>nUntilShutdown(DistributionManager.java:621) at org.apache.geode.distributed.internal.DistributionManager$4$1.run(DistributionManager.java:878) at java.lang.Thread.run(Thread.java:745) [error 2017/06/05 02:07:11.959 EDT Locator2 <Pooled Message Processor 1> tid=0x3d] Error occurred while initializing cluster configuration java.lang.RuntimeException: Error occurred while initializing cluster configuration at org.apache.geode.distributed.internal.ClusterConfigurationService.getConfigurationRegion(ClusterConfigurationService.java:722) at org.apache.geode.distributed.internal.ClusterConfigurationService.initSharedConfiguration(ClusterConfigurationService.java:426) at org.apache.geode.distributed.internal.InternalLocator$SharedConfigurationRunnable.run(InternalLocator.java:649) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) at org.apache.geode.distributed.internal.DistributionManager.ru<http://nternal.DistributionManager.ru>nUntilShutdown(DistributionManager.java:621) at org.apache.geode.distributed.internal.DistributionManager$4$1.run(DistributionManager.java:878) at java.lang.Thread.run(Thread.java:745) Caused by: org.apache.geode.cache.persistence.ConflictingPersistentDataException: Region /_ConfigurationRegion refusing to initialize from member member001(Locator1:5160:locator)<ec><v0>:1024 with persistent data /member001.IP.MASKED:/local/apps/shared/geode/members/Locator1/work/ConfigDiskDir_RavenLocator1 created at timestamp 1496241336712 version 0 diskStoreId 31efa18230134865-b4fd0fcbde63ade6 name RavenLocator1 which was offline when the local data from /member002.IP.MASKED:/local/apps/shared/geode/members/Locator2/work/ConfigDiskDir_Locator2 created at timestamp 1496241344046 version 0 diskStoreId df94511d0f3d4295-91ec9286a18aaa75 name Locator2 was last online at org.apache.geode.internal.cache.persistence.PersistenceAdvisorImpl.checkMyStateOnMembers(PersistenceAdvisorImpl.java:751) at org.apache.geode.internal.cache.persistence.PersistenceAdvisorImpl.getInitialImageAdvice(PersistenceAdvisorImpl.java:812) at org.apache.geode.internal.cache.persistence.CreatePersistentRegionProcessor.getInitialImageAdvice(CreatePersistentRegionProcessor.java:52) at org.apache.geode.internal.cache.DistributedRegion.getInitialImageAndRecovery(DistributedRegion.java:1267) at org.apache.geode.internal.cache.DistributedRegion.initialize(DistributedRegion.java:1101) at org.apache.geode.internal.cache.GemFireCacheImpl.createVMRegion(GemFireCacheImpl.java:3308) at org.apache.geode.distributed.internal.ClusterConfigurationService.getConfigurationRegion(ClusterConfigurationService.java:709) ... 7 more Thanks & Regards, Dharam This message is confidential and subject to terms at: http://www.jpmorgan.com/emaildisclaimer<http://www.jpmorgan.com/emaildisclaimer> including on confidentiality, legal privilege, viruses and monitoring of electronic messages. If you are not the intended recipient, please delete this message and notify the sender immediately. Any unauthorized use is strictly prohibited. -- Cheers Jinmei -- Mark Secrist | Sr Manager, Global Education Delivery [email protected]<mailto:[email protected]> 970.214.4567 Mobile [http://d1fto35gcfffzn.cloudfront.net/images/header/logo-pivotal-220.png] pivotal.io<http://www.pivotal.io/> Follow Us: Twitter<http://www.twitter.com/pivotal> | LinkedIn<http://www.linkedin.com/company/pivotalsoftware> | Facebook<http://www.facebook.com/pivotalsoftware> | YouTube<http://www.youtube.com/gopivotal> | Google+<https://plus.google.com/105320112436428794490> This message is confidential and subject to terms at: http://www.jpmorgan.com/emaildisclaimer including on confidentiality, legal privilege, viruses and monitoring of electronic messages. If you are not the intended recipient, please delete this message and notify the sender immediately. Any unauthorized use is strictly prohibited.
