Hello, Thank you very much for the commit https://issues.apache.org/jira/browse/GEODE-3052?focusedCommentId=16043697&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-16043697 I’m looking forward to including it into 1.2.
Unfortunately, we cannot wait. We have to deliver our corporate release earlier than Geode 1.2 release. So we need to come up with some workaround. Once the Geode 1.2 release is available, we will remove the workaround. As I understood, there are two workarounds for the issue GEODE-3003<https://issues.apache.org/jira/browse/GEODE-3003>: 1. Put a pause between locators start greater than 2 seconds 2. Remove locator*.dat file before locator start “I would really like people to not get into the habit of deleting these files. If you accidentally delete them while servers are still up you will be forced to do a complete shutdown.” I’ve played a bit with the artifacts from the issue GEODE-3003<https://issues.apache.org/jira/browse/GEODE-3003> and found out that the side effect of deleting locator*.dat is the following: If we start a locator with deleted locator*.dat file, when there are no other locators, the servers will be gone from membership view, so restart of servers will be needed. In more details: 1) Start locator on host1, start locator on host2, 2) Start server on host1, start server on host2, 3) Stop locator on host1: kill [host1-locator-PID], 4) Remove locator*.dat file from host1, 5) Stop locator on host2: kill [host2-locator-PID], 6) Start locator on host1, start locator on host2, 7) Via gfsh we see that the cluster consists only of locators, the servers are gone. In order to join servers to the cluster, we need to restart them! So if we don’t do the bullet 4), we won’t get into bullet 7) side effect. If what I’ve described is the only side effect, we are ok to go with this temporary workaround = removing locator*.dat file. Anton Mironenko From: Bruce Schuchardt [mailto:[email protected]] Sent: Thursday, June 08, 2017 20:06 To: [email protected] Subject: Re: What functionality do we lose, if we delete locator*view.dat before starting a locator The split-brain issue is easily reproduced in LocatorDUnitTest.testStartTwoLocators by duplicating the last line in the method: public void testStartTwoLocators() throws Exception { disconnectAllFromDS(); Host host = Host.getHost(0); VM loc1 = host.getVM(1); VM loc2 = host.getVM(2); int ports[] = AvailablePortHelper.getRandomAvailableTCPPorts(2); final int port1 = ports[0]; this.port1 = port1; final int port2 = ports[1]; this.port2 = port2; // for cleanup in tearDown2 DistributedTestUtils.deleteLocatorStateFile(port1); DistributedTestUtils.deleteLocatorStateFile(port2); final String host0 = NetworkUtils.getServerHostName(host); final String locators = host0 + "[" + port1 + "]," + host0 + "[" + port2 + "]"; final Properties properties = new Properties(); properties.put(MCAST_PORT, "0"); properties.put(LOCATORS, locators); properties.put(ENABLE_NETWORK_PARTITION_DETECTION, "false"); properties.put(DISABLE_AUTO_RECONNECT, "true"); properties.put(MEMBER_TIMEOUT, "2000"); properties.put(LOG_LEVEL, LogWriterUtils.getDUnitLogLevel()); properties.put(ENABLE_CLUSTER_CONFIGURATION, "false"); addDSProps(properties); startVerifyAndStopLocator(loc1, loc2, port1, port2, properties); startVerifyAndStopLocator(loc1, loc2, port1, port2, properties); } It fails every time on the second startVerifyAndStopLocator invocation. The fix for this is pretty simple and I'll try to get it in the upcoming 1.2 release. Then you won't have to delete the locator view.dat files or stagger startup anymore. I would really like people to not get into the habit of deleting these files. If you accidentally delete them while servers are still up you will be forced to do a complete shutdown. On 6/8/17 9:45 AM, Udo Kohlmeyer wrote: Dharam, Thank you for testing this out as well. Using Anton's guidance, I've managed to reproduce the issue, byt restarting the 2 locators within 1s (try for sub-second if possible). Anton, did describe he did not see this behavior when the restarting between the two locators was more than 2s. --Udo On 6/8/17 09:29, Dharam Thacker wrote: Hi Anton, I also tried to reproduce your scenario in my local ubuntu machine with Geode 1.1.1, but I was able to restart cluster safely as explained below. host1> start locator1 host2> start locator2 host1> start server1 host2> start server2 host1> stop server1 host2> stop server2 host1> stop locator1 host2> stop locator2 verify all members shutdown well... host1> start locator2 [Even though i terminated this last as per above sequence i am starting it as first member] host2> start locator1 Of course start of locator2 gave me same warning as I have higlighted in red below. Then I waited for greater than 10s before starting second locator(locator1 was stopped earlier than locator2 in past) But as soon as I locator1 started, locator2 detected that and started up cluster configuration service. Cluster was reformed after that. Logs below for verification: [info 2017/06/08 21:47:10.911 IST locator2 <Pooled Message Processor 1> tid=0x2d] Region /_ConfigurationRegion has potentially stale data. It is waiting for another member to recover the latest data. My persistent id: DiskStore ID: a267d876-40c8-4c85-848a-5a397adb5e5b Name: locator2 Location: /192.168.1.12:/home/dharam/Downloads/apache-geode/locator2/ConfigDiskDir_locator2 Members with potentially new data: [ DiskStore ID: 39d28da8-6b2c-414c-9608-3550219b624d Name: locator1 Location: /192.168.1.12:/home/dharam/Downloads/apache-geode/locator1/ConfigDiskDir_locator1 ] Use the "gfsh show missing-disk-stores" command to see all disk stores that are being waited on by other members. [warning 2017/06/08 21:47:45.606 IST locator2 <WAN Locator Discovery Thread> tid=0x2f] Locator discovery task could not exchange locator information 192.168.1.12[10335] with localhost[10334] after 6 retry attempts. Retrying in 10,000 ms. [info 2017/06/08 21:48:02.886 IST locator2 <unicast receiver,dharam-ThinkPad-Edge-E431-1183> tid=0x1c] received join request from 192.168.1.12(locator1:10969:locator)<ec>:1025 [info 2017/06/08 21:48:03.187 IST locator2 <Geode Membership View Creator> tid=0x22] View Creator is processing 1 requests for the next membership view [info 2017/06/08 21:48:03.188 IST locator2 <Geode Membership View Creator> tid=0x22] preparing new view View[192.168.1.12(locator2:10853:locator)<ec><v0>:1024|1] members: [192.168.1.12(locator2:10853:locator)<ec><v0>:1024, 192.168.1.12(locator1:10969:locator)<ec><v1>:1025] failure detection ports: 14001 42428 [info 2017/06/08 21:48:03.221 IST locator2 <Geode Membership View Creator> tid=0x22] finished waiting for responses to view preparation [info 2017/06/08 21:48:03.221 IST locator2 <Geode Membership View Creator> tid=0x22] received new view: View[192.168.1.12(locator2:10853:locator)<ec><v0>:1024|1] members: [192.168.1.12(locator2:10853:locator)<ec><v0>:1024, 192.168.1.12(locator1:10969:locator)<ec><v1>:1025] old view is: View[192.168.1.12(locator2:10853:locator)<ec><v0>:1024|0] members: [192.168.1.12(locator2:10853:locator)<ec><v0>:1024] [info 2017/06/08 21:48:03.222 IST locator2 <Geode Membership View Creator> tid=0x22] Peer locator received new membership view: View[192.168.1.12(locator2:10853:locator)<ec><v0>:1024|1] members: [192.168.1.12(locator2:10853:locator)<ec><v0>:1024, 192.168.1.12(locator1:10969:locator)<ec><v1>:1025] [info 2017/06/08 21:48:03.228 IST locator2 <Geode Membership View Creator> tid=0x22] sending new view View[192.168.1.12(locator2:10853:locator)<ec><v0>:1024|1] members: [192.168.1.12(locator2:10853:locator)<ec><v0>:1024, 192.168.1.12(locator1:10969:locator)<ec><v1>:1025] failure detection ports: 14001 42428 [info 2017/06/08 21:48:03.232 IST locator2 <View Message Processor> tid=0x48] Membership: Processing addition < 192.168.1.12(locator1:10969:locator)<ec><v1>:1025 > [info 2017/06/08 21:48:03.233 IST locator2 <View Message Processor> tid=0x48] Admitting member <192.168.1.12(locator1:10969:locator)<ec><v1>:1025>. Now there are 2 non-admin member(s). [info 2017/06/08 21:48:03.242 IST locator2 <pool-3-thread-1> tid=0x4a] Initializing region _monitoringRegion_192.168.1.12<v1>1025 [info 2017/06/08 21:48:03.275 IST locator2 <Pooled High Priority Message Processor 1> tid=0x4e] Member 192.168.1.12(locator1:10969:locator)<ec><v1>:1025 is equivalent or in the same redundancy zone. [info 2017/06/08 21:48:03.326 IST locator2 <pool-3-thread-1> tid=0x4a] Initialization of region _monitoringRegion_192.168.1.12<v1>1025 completed [info 2017/06/08 21:48:03.336 IST locator2 <pool-3-thread-1> tid=0x4a] Initializing region _notificationRegion_192.168.1.12<v1>1025 [info 2017/06/08 21:48:03.338 IST locator2 <pool-3-thread-1> tid=0x4a] Initialization of region _notificationRegion_192.168.1.12<v1>1025 completed [info 2017/06/08 21:48:04.611 IST locator2 <Pooled Message Processor 1> tid=0x2d] Region _ConfigurationRegion requesting initial image from 192.168.1.12(locator1:10969:locator)<ec><v1>:1025 [info 2017/06/08 21:48:04.615 IST locator2 <Pooled Message Processor 1> tid=0x2d] _ConfigurationRegion is done getting image from 192.168.1.12(locator1:10969:locator)<ec><v1>:1025. isDeltaGII is true [info 2017/06/08 21:48:04.616 IST locator2 <Pooled Message Processor 1> tid=0x2d] Region _ConfigurationRegion initialized persistent id: /192.168.1.12:/home/dharam/Downloads/apache-geode/locator2/ConfigDiskDir_locator2 created at timestamp 1496938615755 version 0 diskStoreId a267d87640c84c85-848a5a397adb5e5b name locator2 with data from 192.168.1.12(locator1:10969:locator)<ec><v1>:1025. [info 2017/06/08 21:48:04.617 IST locator2 <Pooled Message Processor 1> tid=0x2d] Initialization of region _ConfigurationRegion completed [info 2017/06/08 21:48:04.637 IST locator2 <Pooled Message Processor 1> tid=0x2d] ConfigRequestHandler installed [info 2017/06/08 21:48:04.637 IST locator2 <Pooled Message Processor 1> tid=0x2d] Cluster configuration service start up completed successfully and is now running .... [info 2017/06/08 21:48:05.692 IST locator2 <WAN Locator Discovery Thread> tid=0x2f] Locator discovery task exchanged locator information 192.168.1.12[10335] with localhost[10334]: {-1=[192.168.1.12[10335], 192.168.1.12[10334]]}. Thanks, Dharam - Dharam Thacker On Thu, Jun 8, 2017 at 9:25 PM, Bruce Schuchardt <[email protected]<mailto:[email protected]>> wrote: The locator view file exists to allow locators to be bounced without shutting down the rest of the cluster. On startup a locator will try to find the current membership coordinator of the cluster from an existing locator and join the system using that information. If there is no existing locator that knows who the coordinator might be then the new locator will try to find the coordinator using the membership "view" that is stored in the view file. If there is no view file the locator will not be able to join the existing cluster. If you've done a full shutdown of the cluster it is safe to delete the locator*view.dat files. When there is no .dat file the locators will use a concurrent-startup algorithm to form a unified system. On 6/8/17 7:48 AM, Anton Mironenko wrote: Hello, We found out that if we delete “locator*view.dat” before starting a locator, It fixes the first part of the issue https://issues.apache.org/jira/browse/GEODE-3003 “Geode doesn't start after cluster restart when using cluster-configuration” “The second start goes wrong: the locator on the first host always doesn't join the rest of the cluster with the error in the locator log: "Region /_ConfigurationRegion has potentially stale data. It is waiting for another member to recover the latest data."” What is a side effect of deleting the file "locator0/locator*view.dat"? What functionality do we lose? A use case with some example would be great. Anton Mironenko This message and the information contained herein is proprietary and confidential and subject to the Amdocs policy statement, you may review at https://www.amdocs.com/about/email-disclaimer This message and the information contained herein is proprietary and confidential and subject to the Amdocs policy statement, you may review at https://www.amdocs.com/about/email-disclaimer <https://www.amdocs.com/about/email-disclaimer>
