RE: What functionality do we lose, if we delete locator*view.dat before starting a locator

Anton Mironenko Fri, 09 Jun 2017 03:15:00 -0700

Hello,
Thank you very much for the commit
https://issues.apache.org/jira/browse/GEODE-3052?focusedCommentId=16043697&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-16043697
I’m looking forward to including it into 1.2.


Unfortunately, we cannot wait. We have to deliver our corporate release earlier 
than Geode 1.2 release.
So we need to come up with some workaround.
Once the Geode 1.2 release is available, we will remove the workaround.

As I understood, there are two workarounds for the issue 
GEODE-3003<https://issues.apache.org/jira/browse/GEODE-3003>:

1.       Put a pause between locators start greater than 2 seconds

2.       Remove locator*.dat file before locator start

“I would really like people to not get into the habit of deleting these files.  
If you accidentally delete them while servers are still up you will be forced 
to do a complete shutdown.”

I’ve played a bit with the artifacts from the issue 
GEODE-3003<https://issues.apache.org/jira/browse/GEODE-3003> and found out that 
the side effect of deleting locator*.dat is the following:
If we start a locator with deleted locator*.dat file, when there are no other 
locators, the servers will be gone from membership view, so restart of servers 
will be needed.
In more details:


1)      Start locator on host1, start locator on host2,

2)      Start server on host1, start server on host2,

3)      Stop locator on host1: kill [host1-locator-PID],

4)      Remove locator*.dat file from host1,

5)      Stop locator on host2: kill [host2-locator-PID],

6)      Start locator on host1, start locator on host2,

7)      Via gfsh we see that the cluster consists only of locators, the servers 
are gone. In order to join servers to the cluster, we need to restart them!

So if we don’t do the bullet 4), we won’t get into bullet 7) side effect.

If what I’ve described is the only side effect, we are ok to go with this 
temporary workaround = removing locator*.dat file.

Anton Mironenko

From: Bruce Schuchardt [mailto:[email protected]]
Sent: Thursday, June 08, 2017 20:06
To: [email protected]
Subject: Re: What functionality do we lose, if we delete locator*view.dat 
before starting a locator


The split-brain issue is easily reproduced in 
LocatorDUnitTest.testStartTwoLocators by duplicating the last line in the 
method:



public void testStartTwoLocators() throws Exception {

  disconnectAllFromDS();

  Host host = Host.getHost(0);

  VM loc1 = host.getVM(1);

  VM loc2 = host.getVM(2);



  int ports[] = AvailablePortHelper.getRandomAvailableTCPPorts(2);

  final int port1 = ports[0];

  this.port1 = port1;

  final int port2 = ports[1];

  this.port2 = port2; // for cleanup in tearDown2

  DistributedTestUtils.deleteLocatorStateFile(port1);

  DistributedTestUtils.deleteLocatorStateFile(port2);

  final String host0 = NetworkUtils.getServerHostName(host);

  final String locators = host0 + "[" + port1 + "]," + host0 + "[" + port2 + 
"]";

  final Properties properties = new Properties();

  properties.put(MCAST_PORT, "0");

  properties.put(LOCATORS, locators);

  properties.put(ENABLE_NETWORK_PARTITION_DETECTION, "false");

  properties.put(DISABLE_AUTO_RECONNECT, "true");

  properties.put(MEMBER_TIMEOUT, "2000");

  properties.put(LOG_LEVEL, LogWriterUtils.getDUnitLogLevel());

  properties.put(ENABLE_CLUSTER_CONFIGURATION, "false");

  addDSProps(properties);

  startVerifyAndStopLocator(loc1, loc2, port1, port2, properties);

  startVerifyAndStopLocator(loc1, loc2, port1, port2, properties);

}

It fails every time on the second startVerifyAndStopLocator invocation.  The 
fix for this is pretty simple and I'll try to get it in the upcoming 1.2 
release.  Then you won't have to delete the locator view.dat files or stagger 
startup anymore.

I would really like people to not get into the habit of deleting these files.  
If you accidentally delete them while servers are still up you will be forced 
to do a complete shutdown.
On 6/8/17 9:45 AM, Udo Kohlmeyer wrote:

Dharam,

Thank you for testing this out as well. Using Anton's guidance, I've managed to 
reproduce the issue, byt restarting the 2 locators within 1s (try for 
sub-second if possible).

Anton, did describe he did not see this behavior when the restarting between 
the two locators was more than 2s.

--Udo

On 6/8/17 09:29, Dharam Thacker wrote:
Hi Anton,
I also tried to reproduce your scenario in my local ubuntu machine with Geode 
1.1.1, but I was able to restart cluster safely as explained below.
host1> start locator1

host2> start locator2

host1> start server1

host2> start server2

host1> stop server1
host2> stop server2
host1> stop locator1
host2> stop locator2
verify all members shutdown well...

host1> start locator2 [Even though i terminated this last as per above sequence 
i am starting it as first member]
host2> start locator1
Of course start of locator2 gave me same warning as I have higlighted in red 
below. Then I waited for greater than 10s before starting second 
locator(locator1 was stopped earlier than locator2 in past)
But as soon as I locator1 started, locator2 detected that and started up 
cluster configuration service. Cluster was reformed after that.
Logs below for verification:


[info 2017/06/08 21:47:10.911 IST locator2 <Pooled Message Processor 1> 
tid=0x2d] Region /_ConfigurationRegion has potentially stale data. It is 
waiting for another member to recover the latest data.
  My persistent id:

    DiskStore ID: a267d876-40c8-4c85-848a-5a397adb5e5b
    Name: locator2
    Location: 
/192.168.1.12:/home/dharam/Downloads/apache-geode/locator2/ConfigDiskDir_locator2

  Members with potentially new data:
  [
    DiskStore ID: 39d28da8-6b2c-414c-9608-3550219b624d
    Name: locator1
    Location: 
/192.168.1.12:/home/dharam/Downloads/apache-geode/locator1/ConfigDiskDir_locator1
  ]
  Use the "gfsh show missing-disk-stores" command to see all disk stores that 
are being waited on by other members.

[warning 2017/06/08 21:47:45.606 IST locator2 <WAN Locator Discovery Thread> 
tid=0x2f] Locator discovery task could not exchange locator information 
192.168.1.12[10335] with localhost[10334] after 6 retry attempts. Retrying in 
10,000 ms.

[info 2017/06/08 21:48:02.886 IST locator2 <unicast 
receiver,dharam-ThinkPad-Edge-E431-1183> tid=0x1c] received join request from 
192.168.1.12(locator1:10969:locator)<ec>:1025

[info 2017/06/08 21:48:03.187 IST locator2 <Geode Membership View Creator> 
tid=0x22] View Creator is processing 1 requests for the next membership view

[info 2017/06/08 21:48:03.188 IST locator2 <Geode Membership View Creator> 
tid=0x22] preparing new view 
View[192.168.1.12(locator2:10853:locator)<ec><v0>:1024|1] members: 
[192.168.1.12(locator2:10853:locator)<ec><v0>:1024, 
192.168.1.12(locator1:10969:locator)<ec><v1>:1025]
  failure detection ports: 14001 42428

[info 2017/06/08 21:48:03.221 IST locator2 <Geode Membership View Creator> 
tid=0x22] finished waiting for responses to view preparation

[info 2017/06/08 21:48:03.221 IST locator2 <Geode Membership View Creator> 
tid=0x22] received new view: 
View[192.168.1.12(locator2:10853:locator)<ec><v0>:1024|1] members: 
[192.168.1.12(locator2:10853:locator)<ec><v0>:1024, 
192.168.1.12(locator1:10969:locator)<ec><v1>:1025]
  old view is: View[192.168.1.12(locator2:10853:locator)<ec><v0>:1024|0] 
members: [192.168.1.12(locator2:10853:locator)<ec><v0>:1024]

[info 2017/06/08 21:48:03.222 IST locator2 <Geode Membership View Creator> 
tid=0x22] Peer locator received new membership view: 
View[192.168.1.12(locator2:10853:locator)<ec><v0>:1024|1] members: 
[192.168.1.12(locator2:10853:locator)<ec><v0>:1024, 
192.168.1.12(locator1:10969:locator)<ec><v1>:1025]

[info 2017/06/08 21:48:03.228 IST locator2 <Geode Membership View Creator> 
tid=0x22] sending new view 
View[192.168.1.12(locator2:10853:locator)<ec><v0>:1024|1] members: 
[192.168.1.12(locator2:10853:locator)<ec><v0>:1024, 
192.168.1.12(locator1:10969:locator)<ec><v1>:1025]
  failure detection ports: 14001 42428

[info 2017/06/08 21:48:03.232 IST locator2 <View Message Processor> tid=0x48] 
Membership: Processing addition < 
192.168.1.12(locator1:10969:locator)<ec><v1>:1025 >

[info 2017/06/08 21:48:03.233 IST locator2 <View Message Processor> tid=0x48] 
Admitting member <192.168.1.12(locator1:10969:locator)<ec><v1>:1025>. Now there 
are 2 non-admin member(s).

[info 2017/06/08 21:48:03.242 IST locator2 <pool-3-thread-1> tid=0x4a] 
Initializing region _monitoringRegion_192.168.1.12<v1>1025

[info 2017/06/08 21:48:03.275 IST locator2 <Pooled High Priority Message 
Processor 1> tid=0x4e] Member 192.168.1.12(locator1:10969:locator)<ec><v1>:1025 
is equivalent or in the same redundancy zone.

[info 2017/06/08 21:48:03.326 IST locator2 <pool-3-thread-1> tid=0x4a] 
Initialization of region _monitoringRegion_192.168.1.12<v1>1025 completed

[info 2017/06/08 21:48:03.336 IST locator2 <pool-3-thread-1> tid=0x4a] 
Initializing region _notificationRegion_192.168.1.12<v1>1025

[info 2017/06/08 21:48:03.338 IST locator2 <pool-3-thread-1> tid=0x4a] 
Initialization of region _notificationRegion_192.168.1.12<v1>1025 completed

[info 2017/06/08 21:48:04.611 IST locator2 <Pooled Message Processor 1> 
tid=0x2d] Region _ConfigurationRegion requesting initial image from 
192.168.1.12(locator1:10969:locator)<ec><v1>:1025

[info 2017/06/08 21:48:04.615 IST locator2 <Pooled Message Processor 1> 
tid=0x2d] _ConfigurationRegion is done getting image from 
192.168.1.12(locator1:10969:locator)<ec><v1>:1025. isDeltaGII is true

[info 2017/06/08 21:48:04.616 IST locator2 <Pooled Message Processor 1> 
tid=0x2d] Region _ConfigurationRegion initialized persistent id: 
/192.168.1.12:/home/dharam/Downloads/apache-geode/locator2/ConfigDiskDir_locator2
 created at timestamp 1496938615755 version 0 diskStoreId 
a267d87640c84c85-848a5a397adb5e5b name locator2 with data from 
192.168.1.12(locator1:10969:locator)<ec><v1>:1025.

[info 2017/06/08 21:48:04.617 IST locator2 <Pooled Message Processor 1> 
tid=0x2d] Initialization of region _ConfigurationRegion completed

[info 2017/06/08 21:48:04.637 IST locator2 <Pooled Message Processor 1> 
tid=0x2d] ConfigRequestHandler installed

[info 2017/06/08 21:48:04.637 IST locator2 <Pooled Message Processor 1> 
tid=0x2d] Cluster configuration service start up completed successfully and is 
now running ....

[info 2017/06/08 21:48:05.692 IST locator2 <WAN Locator Discovery Thread> 
tid=0x2f] Locator discovery task exchanged locator information 
192.168.1.12[10335] with localhost[10334]: {-1=[192.168.1.12[10335], 
192.168.1.12[10334]]}.
Thanks,
Dharam

- Dharam Thacker

On Thu, Jun 8, 2017 at 9:25 PM, Bruce Schuchardt 
<[email protected]<mailto:[email protected]>> wrote:

The locator view file exists to allow locators to be bounced without shutting 
down the rest of the cluster.  On startup a locator will try to find the 
current membership coordinator of the cluster from an existing locator and join 
the system using that information.  If there is no existing locator that knows 
who the coordinator might be then the new locator will try to find the 
coordinator using the membership "view" that is stored in the view file.  If 
there is no view file the locator will not be able to join the existing cluster.

If you've done a full shutdown of the cluster it is safe to delete the 
locator*view.dat files.

When there is no .dat file the locators will use a concurrent-startup algorithm 
to form a unified system.

On 6/8/17 7:48 AM, Anton Mironenko wrote:
Hello,
We found out that if we delete “locator*view.dat” before starting a locator,
It fixes the first part of the issue
https://issues.apache.org/jira/browse/GEODE-3003
“Geode doesn't start after cluster restart when using cluster-configuration”

“The second start goes wrong: the locator on the first host always doesn't join 
the rest of the cluster with the error in the locator log:
"Region /_ConfigurationRegion has potentially stale data. It is waiting for 
another member to recover the latest data."”

What is a side effect of deleting the file "locator0/locator*view.dat"? What 
functionality do we lose?
A use case with some example would be great.

Anton Mironenko

This message and the information contained herein is proprietary and 
confidential and subject to the Amdocs policy statement,
you may review at https://www.amdocs.com/about/email-disclaimer




This message and the information contained herein is proprietary and 
confidential and subject to the Amdocs policy statement,

you may review at https://www.amdocs.com/about/email-disclaimer 
<https://www.amdocs.com/about/email-disclaimer>

RE: What functionality do we lose, if we delete locator*view.dat before starting a locator

Reply via email to