[GitHub] [zookeeper] symat edited a comment on issue #1048: ZOOKEEPER-3188: Improve resilience to network

2019-11-13 Thread GitBox
symat edited a comment on issue #1048: ZOOKEEPER-3188: Improve resilience to 
network
URL: https://github.com/apache/zookeeper/pull/1048#issuecomment-553575587
 
 
   I was trying to reproduce with docker the behaviour mentioned by @anmolnar 
above. So far I haven't succeed but found an other bug: 
   
   When I disabled the 'actively used' ethernet interface of the current 
leader, the follower noticed. During a new leader election it tried to 
reconnect in parallel to all the registered election addresses of the old 
leader. Waiting for the connection attempt to fail on the unreachable address 
caused a timeout in the connection to the reachable address. This was a flaky 
situation, usually causing 2-3 subsequent leader elections, but after a 10-15 
seconds the quorum become stable.
   
   I solved this by filtering the reachable hosts before trying to establish 
connections to the Leader in the Learners.


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[GitHub] [zookeeper] symat edited a comment on issue #1048: ZOOKEEPER-3188: Improve resilience to network

2019-10-09 Thread GitBox
symat edited a comment on issue #1048: ZOOKEEPER-3188: Improve resilience to 
network
URL: https://github.com/apache/zookeeper/pull/1048#issuecomment-540004845
 
 
   In my last commit I uploaded a fix for the BindException issue @anmolnar 
found (I implemented his proposal in the Leader's constructor). I also modified 
a unit test to cover this case as well.
   
   We did some manual testing on the latest version. The patch is working, now 
we can pull-out and plug back the different cables / wifi and the quorum keeps 
to survive. However, the recovery is a bit long (around 1 minute). The recovery 
when executing the same tests with linux in docker with virtual networks and 
interfaces (using the same config) takes much shorter time (~10-15 seconds). It 
looks like that in case of the docker/linux test, the socket in the 
`QuorumCnxManager.RecvWorker` dies much quicker by a `SocketException: Socket 
closed`, while in the same test with real mac notebooks the same socket dies 
later due to `SocketException: Operation timed out (Read failed)`.
   
   We think that we found a way to detect the failure quicker in the second 
case, but that still needs to be tested. I will work on this later (although I 
think this might have a lower priority, we can even close this PR without such 
optimization).
   
   I think the upgrade / TLS / kerberos related manual tests are more important 
at the moment.


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[GitHub] [zookeeper] symat edited a comment on issue #1048: ZOOKEEPER-3188: Improve resilience to network

2019-08-12 Thread GitBox
symat edited a comment on issue #1048: ZOOKEEPER-3188: Improve resilience to 
network
URL: https://github.com/apache/zookeeper/pull/1048#issuecomment-520381529
 
 
   I created a simple docker config with multiple virtual networks and managed 
to test the situation when some of the containers loose the access to one of 
the virtual networks. I uploaded the docker related scripts / configs here: 
https://github.com/symat/zookeeper-docker-test
   
   During these manual tests I found some situations when the previous patch 
didn't work. 
   - the InitialMessage sent during the leader election contained only a single 
election address. If this address was not reachable by the recipient of the 
InitialMessage, then the connection was never successfully initiated. I changed 
the format of the InitialMessage to send all the election addresses and the 
other side will use only the one which is reachable.
   - When an existing tcp connection to an electionAddress is broken, the 
server will try to send notification messages re-using the existing SendWorker 
threads. I would assume that the SendWorker.send() method should die when it 
tries to flush the output stream on the socket which destination is already 
unreachable. However, for some reason it doesn't die. (this could be 
investigated further) Anyway, I added a small logic for the connection 
initiation to verify if the existing destination in SendWorker is still 
reachable. If the destination is unreachable in the SendWorker thread, then we 
can gracefully finish it and during the next connection attempt we will choose 
a destination what is reachable. (this part I fixed in a second commit)
   
   With these modifications I was able to test the following situation 
successfully:
   
   1. starting a zookeeper with nodes, each server listening on two addresses 
(on two separate virtual networks)
   2. waiting the initial leader election to happen
   3. removing the current leader from the virtual network that is used by the 
others as destination
   4. it took a few seconds until all the servers recognised the loss of 
connections, and in 5-10 seconds the connections were re-established and the 
new leader election finished
   
   I will think how to unittest these features. (or should we crate some 
docker-based automated integration test?)
   
   In the mean while I would appreciate a deep review of these changes, as I am 
quite new in the Zookeeper code...


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[GitHub] [zookeeper] symat edited a comment on issue #1048: ZOOKEEPER-3188: Improve resilience to network

2019-08-12 Thread GitBox
symat edited a comment on issue #1048: ZOOKEEPER-3188: Improve resilience to 
network
URL: https://github.com/apache/zookeeper/pull/1048#issuecomment-520381529
 
 
   I created a simple docker config with multiple virtual networks and managed 
to test the situation when some of the containers loose the access to one of 
the virtual networks. I uploaded the docker related scripts / configs here: 
https://github.com/symat/zookeeper-docker-test
   
   During these manual tests I found some situations when the previous patch 
didn't work. 
   - the InitialMessage sent during the leader election contained only a single 
election address. If this address was not reachable by the recipient of the 
InitialMessage, then the connection was never successfully initiated. I changed 
the format of the InitialMessage to send all the election addresses and the 
other side will use only the one which is reachable.
   - When an existing tcp connection to an electionAddress is broken, the 
server will try to send notification messages re-using the existing SendWorker 
threads. I would assume that the SendWorker.send() method should die when it 
tries to flush the output stream on the socket which destination is already 
unreachable. However, for some reason it doesn't die. (this could be 
investigated further) Anyway, I added a small logic for the connection 
initiation to verify if the existing destination in SendWorker is still 
reachable. If the destination is unreachable in the SendWorker thread, then we 
can gracefully finish it and during the next connection attempt we will choose 
a destination what is reachable.
   
   With these modifications I was able to test the following situation 
successfully:
   
   1. starting a zookeeper with nodes, each server listening on two addresses 
(on two separate virtual networks)
   2. waiting the initial leader election to happen
   3. removing the current leader from the virtual network that is used by the 
others as destination
   4. it took a few seconds until all the servers recognised the loss of 
connections, and in 5-10 seconds the connections were re-established and the 
new leader election finished
   
   I will think how to unittest these features. (or should we crate some 
docker-based automated integration test?)
   
   In the mean while I would appreciate a deep review of these changes, as I am 
quite new in the Zookeeper code...


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services