[jira] [Comment Edited] (IGNITE-13465) Ignite cluster falls apart if two nodes segmented sequentially

2020-09-22 Thread Vladimir Steshin (Jira)


[ 
https://issues.apache.org/jira/browse/IGNITE-13465?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17199980#comment-17199980
 ] 

Vladimir Steshin edited comment on IGNITE-13465 at 9/22/20, 7:43 PM:
-

The problem is that connectionRecoveryTimeout can be wholly spent on one next 
node. If two fails in a row at the same time, previous nodes may become 
segmented one by one. 

I suggest to slice connectionRecoveryTimeout in order to traverse several next 
nodes in attempt to reconnect to the ring. 
To avoid too small timeouts per one node we should introduce a constant like 
100ms as minimal timeout on attempt to connect to one next node in the ring.



was (Author: vladsz83):
The problem is that connectionRecoveryTimeout can be wholly spent on one next 
node. If two fails at the same time, previous nodes may become segmented one by 
one. 

I suggest to slice connectionRecoveryTimeout in order to traverse several next 
nodes in attempt to reconnect to the ring. 
To avoid too small timeouts per one node I suggest to introduce a constant like 
100ms as minimal timeout on attempt to connect to one next node in the ring.


> Ignite cluster falls apart if two nodes segmented sequentially
> --
>
> Key: IGNITE-13465
> URL: https://issues.apache.org/jira/browse/IGNITE-13465
> Project: Ignite
>  Issue Type: Bug
>Affects Versions: 2.9
>Reporter: Aleksey Plekhanov
>Assignee: Vladimir Steshin
>Priority: Blocker
> Fix For: 2.9
>
> Attachments: GridSequentionNodesFailureTest.java
>
>  Time Spent: 50m
>  Remaining Estimate: 0h
>
> After ticket IGNITE-13134 sequential nodes segmentation leads to segmentation 
> of other nodes in the cluster.
> Reproducer attached.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Comment Edited] (IGNITE-13465) Ignite cluster falls apart if two nodes segmented sequentially

2020-09-22 Thread Vladimir Steshin (Jira)


[ 
https://issues.apache.org/jira/browse/IGNITE-13465?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17199980#comment-17199980
 ] 

Vladimir Steshin edited comment on IGNITE-13465 at 9/22/20, 11:24 AM:
--

The problem is that connectionRecoveryTimeout can be wholly spent on one next 
node. If two fails at the same time, previous nodes may become segmented one by 
one. 

I suggest to slice connectionRecoveryTimeout in order to traverse several next 
nodes in attempt to reconnect to the ring. 
To avoid too small timeouts per one node I suggest to introduce a constant like 
100ms as minimal timeout on attempt to connect to one next node in the ring.



was (Author: vladsz83):
The problem is that connectionRecoveryTimeout can be wholly spent on one next 
node. If two fails at the same time, previous nodes may become segmented one by 
one. 

I suggest to slice connectionRecoveryTimeout in order to traverse several next 
nodes in attempt to reconnect to the ring. We should consider maximum 
reasonable nodes number to reconnect to as `servers/2 + 1`. If we cannot 
connect to half of the ring, this can be considered as major malfunction of the 
network and segmentation.
To avoid too small timeouts per one node I suggest to introduce a constant like 
100ms as minimal timeout on attempt to connect to one next node in the ring.


> Ignite cluster falls apart if two nodes segmented sequentially
> --
>
> Key: IGNITE-13465
> URL: https://issues.apache.org/jira/browse/IGNITE-13465
> Project: Ignite
>  Issue Type: Bug
>Affects Versions: 2.9
>Reporter: Aleksey Plekhanov
>Assignee: Vladimir Steshin
>Priority: Blocker
> Fix For: 2.9
>
> Attachments: GridSequentionNodesFailureTest.java
>
>  Time Spent: 50m
>  Remaining Estimate: 0h
>
> After ticket IGNITE-13134 sequential nodes segmentation leads to segmentation 
> of other nodes in the cluster.
> Reproducer attached.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (IGNITE-13465) Ignite cluster falls apart if two nodes segmented sequentially

2020-09-22 Thread Vladimir Steshin (Jira)


[ 
https://issues.apache.org/jira/browse/IGNITE-13465?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17199980#comment-17199980
 ] 

Vladimir Steshin commented on IGNITE-13465:
---

The problem is that connectionRecoveryTimeout can be wholly spent on one next 
node. If two fails at the same time, previous nodes may become segmented one by 
one. 

I suggest to slice connectionRecoveryTimeout in order to traverse several next 
nodes in attempt to reconnect to the ring. We should consider maximum 
reasonable nodes number to reconnect to as `servers/2 + 1`. If we cannot 
connect to half of the ring, this can be considered as major malfunction of the 
network and segmentation.
To avoid too small timeouts per one node I suggest to introduce a constant like 
100ms as minimal timeout on attempt to connect to one next node in the ring.


> Ignite cluster falls apart if two nodes segmented sequentially
> --
>
> Key: IGNITE-13465
> URL: https://issues.apache.org/jira/browse/IGNITE-13465
> Project: Ignite
>  Issue Type: Bug
>Affects Versions: 2.9
>Reporter: Aleksey Plekhanov
>Assignee: Vladimir Steshin
>Priority: Blocker
> Fix For: 2.9
>
> Attachments: GridSequentionNodesFailureTest.java
>
>  Time Spent: 40m
>  Remaining Estimate: 0h
>
> After ticket IGNITE-13134 sequential nodes segmentation leads to segmentation 
> of other nodes in the cluster.
> Reproducer attached.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Assigned] (IGNITE-13465) Ignite cluster falls apart if two nodes segmented sequentially

2020-09-21 Thread Vladimir Steshin (Jira)


 [ 
https://issues.apache.org/jira/browse/IGNITE-13465?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vladimir Steshin reassigned IGNITE-13465:
-

Assignee: Vladimir Steshin

> Ignite cluster falls apart if two nodes segmented sequentially
> --
>
> Key: IGNITE-13465
> URL: https://issues.apache.org/jira/browse/IGNITE-13465
> Project: Ignite
>  Issue Type: Bug
>Affects Versions: 2.9
>Reporter: Aleksey Plekhanov
>Assignee: Vladimir Steshin
>Priority: Blocker
> Fix For: 2.9
>
> Attachments: GridSequentionNodesFailureTest.java
>
>
> After ticket IGNITE-13134 sequential nodes segmentation leads to segmentation 
> of other nodes in the cluster.
> Reproducer attached.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (IGNITE-13040) Remove unused parameter from TcpDiscoverySpi.writeToSocket()

2020-08-18 Thread Vladimir Steshin (Jira)


[ 
https://issues.apache.org/jira/browse/IGNITE-13040?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17179492#comment-17179492
 ] 

Vladimir Steshin commented on IGNITE-13040:
---

[~Kurinov], you still need visa for this ticket. All the tests passed?

> Remove unused parameter from TcpDiscoverySpi.writeToSocket()
> 
>
> Key: IGNITE-13040
> URL: https://issues.apache.org/jira/browse/IGNITE-13040
> Project: Ignite
>  Issue Type: Improvement
> Environment:  
>Reporter: Vladimir Steshin
>Assignee: Aleksey Kurinov
>Priority: Trivial
>  Labels: newbie
>
> Unused parameter
> {code:java}
> TcpDiscoveryAbstractMessage msg{code}
> should be removed from
> {code:java}
> TcpDiscoverySpi.writeToSocket(Socket sock, TcpDiscoveryAbstractMessage msg, 
> byte[] data, long timeout){code}
> This method seems to send raw data, not a message.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Comment Edited] (IGNITE-13040) Remove unused parameter from TcpDiscoverySpi.writeToSocket()

2020-08-13 Thread Vladimir Steshin (Jira)


[ 
https://issues.apache.org/jira/browse/IGNITE-13040?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17177134#comment-17177134
 ] 

Vladimir Steshin edited comment on IGNITE-13040 at 8/13/20, 5:49 PM:
-

[~Kurinov], you can check certain tests locally. Buy you should launch them on 
Linux/Mac. As an example
{code:java}
TcpClientDiscoverySpiSelfTest.testReconnectSegmentedAfterJoinTimeoutNetworkError()
{code} fails in this PR and has to be fixed after the code change.

Also, you can try to update master, merge it into the ticket's branch and 
re-run the tests. Then, keep re-running blockers. It should help. It could 
appear you have to re-run blockers approximately up to 3-5 times. If the keep 
failing, try checking the failed tests locally. They might become broken due to 
the PR. 

Btw., you can find me in Slack: Vladimir St.


was (Author: vladsz83):
[~Kurinov], you can check certain tests locally. Buy you should launch them on 
Linux/Mac. As an example
{code:java}
TcpClientDiscoverySpiSelfTest.testReconnectSegmentedAfterJoinTimeoutNetworkError()
{code} fails in this PR and has to be fixed after the code change.

Also, you can try to update master, merge it into the ticket's branch and 
re-run the tests. Then, keep re-running blockers. It should help. It could 
appear you have to re-run blockers approximately up to 3-5 times. If the keep 
failing, try checking the failed tests locally. They might become broken due to 
the PR. 


> Remove unused parameter from TcpDiscoverySpi.writeToSocket()
> 
>
> Key: IGNITE-13040
> URL: https://issues.apache.org/jira/browse/IGNITE-13040
> Project: Ignite
>  Issue Type: Improvement
> Environment:  
>Reporter: Vladimir Steshin
>Assignee: Aleksey Kurinov
>Priority: Trivial
>  Labels: newbie
>
> Unused parameter
> {code:java}
> TcpDiscoveryAbstractMessage msg{code}
> should be removed from
> {code:java}
> TcpDiscoverySpi.writeToSocket(Socket sock, TcpDiscoveryAbstractMessage msg, 
> byte[] data, long timeout){code}
> This method seems to send raw data, not a message.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Comment Edited] (IGNITE-13040) Remove unused parameter from TcpDiscoverySpi.writeToSocket()

2020-08-13 Thread Vladimir Steshin (Jira)


[ 
https://issues.apache.org/jira/browse/IGNITE-13040?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17177134#comment-17177134
 ] 

Vladimir Steshin edited comment on IGNITE-13040 at 8/13/20, 5:47 PM:
-

[~Kurinov], you can check certain tests locally. Buy you should launch them on 
Linux/Mac. As an example
{code:java}
TcpClientDiscoverySpiSelfTest.testReconnectSegmentedAfterJoinTimeoutNetworkError()
{code} fails in this PR and has to be fixed after the code change.

Also, you can try to update master, merge it into the ticket's branch and 
re-run the tests. Then, keep re-running blockers. It should help. It could 
appear you have to re-run blockers approximately up to 3-5 times. If the keep 
failing, try checking the failed tests locally. They might become broken due to 
the PR. 



was (Author: vladsz83):
[~Kurinov], try to update master, merge it into the ticket's branch and re-run 
the tests. Then, keep re-running blockers. It should help. It could appear you 
have to re-run blockers approximately up to 5 times.
Also, you can check certain tests locally. Buy you should launch them on 
Linux/Mac.

> Remove unused parameter from TcpDiscoverySpi.writeToSocket()
> 
>
> Key: IGNITE-13040
> URL: https://issues.apache.org/jira/browse/IGNITE-13040
> Project: Ignite
>  Issue Type: Improvement
> Environment:  
>Reporter: Vladimir Steshin
>Assignee: Aleksey Kurinov
>Priority: Trivial
>  Labels: newbie
>
> Unused parameter
> {code:java}
> TcpDiscoveryAbstractMessage msg{code}
> should be removed from
> {code:java}
> TcpDiscoverySpi.writeToSocket(Socket sock, TcpDiscoveryAbstractMessage msg, 
> byte[] data, long timeout){code}
> This method seems to send raw data, not a message.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Comment Edited] (IGNITE-13040) Remove unused parameter from TcpDiscoverySpi.writeToSocket()

2020-08-13 Thread Vladimir Steshin (Jira)


[ 
https://issues.apache.org/jira/browse/IGNITE-13040?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17177134#comment-17177134
 ] 

Vladimir Steshin edited comment on IGNITE-13040 at 8/13/20, 5:37 PM:
-

[~Kurinov], try to update master, merge it into the ticket's branch and re-run 
the tests. Then, keep re-running blockers. It should help. It could appear you 
have to re-run blockers approximately up to 5 times.
Also, you can check certain tests locally. Buy you should launch them on 
Linux/Mac.


was (Author: vladsz83):
[~Kurinov], try to update master, merge it into the ticket's branch and re-run 
the tests. Then, keep re-running blockers. It should help. It could appear you 
have to re-run blockers approximately up to 5 times.

> Remove unused parameter from TcpDiscoverySpi.writeToSocket()
> 
>
> Key: IGNITE-13040
> URL: https://issues.apache.org/jira/browse/IGNITE-13040
> Project: Ignite
>  Issue Type: Improvement
> Environment:  
>Reporter: Vladimir Steshin
>Assignee: Aleksey Kurinov
>Priority: Trivial
>  Labels: newbie
>
> Unused parameter
> {code:java}
> TcpDiscoveryAbstractMessage msg{code}
> should be removed from
> {code:java}
> TcpDiscoverySpi.writeToSocket(Socket sock, TcpDiscoveryAbstractMessage msg, 
> byte[] data, long timeout){code}
> This method seems to send raw data, not a message.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (IGNITE-13040) Remove unused parameter from TcpDiscoverySpi.writeToSocket()

2020-08-13 Thread Vladimir Steshin (Jira)


[ 
https://issues.apache.org/jira/browse/IGNITE-13040?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17177134#comment-17177134
 ] 

Vladimir Steshin commented on IGNITE-13040:
---

[~Kurinov], try to update master, merge it into the ticket's branch and re-run 
the tests. Then, keep re-running blockers. It should help. It could appear you 
have to re-run blockers approximately up to 5 times.

> Remove unused parameter from TcpDiscoverySpi.writeToSocket()
> 
>
> Key: IGNITE-13040
> URL: https://issues.apache.org/jira/browse/IGNITE-13040
> Project: Ignite
>  Issue Type: Improvement
> Environment:  
>Reporter: Vladimir Steshin
>Assignee: Aleksey Kurinov
>Priority: Trivial
>  Labels: newbie
>
> Unused parameter
> {code:java}
> TcpDiscoveryAbstractMessage msg{code}
> should be removed from
> {code:java}
> TcpDiscoverySpi.writeToSocket(Socket sock, TcpDiscoveryAbstractMessage msg, 
> byte[] data, long timeout){code}
> This method seems to send raw data, not a message.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (IGNITE-13012) Fix failure detection timeout. Simplify node ping routine.

2020-07-31 Thread Vladimir Steshin (Jira)


 [ 
https://issues.apache.org/jira/browse/IGNITE-13012?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vladimir Steshin updated IGNITE-13012:
--
Release Note: Fixed processing of failure detection timeout in 
TcpDiscoverySpi. If a node fails to send a message or ping, now it drops 
current connection strictly within this timeout and begins to establish new 
connection much faster.  (was: Fixed processing of failure detection timeout in 
TcpDiscoverySpi. If a node fails to send a message or ping, now it drops 
current connection within this timeout and begins to establish new connection 
much faster.)

> Fix failure detection timeout. Simplify node ping routine.
> --
>
> Key: IGNITE-13012
> URL: https://issues.apache.org/jira/browse/IGNITE-13012
> Project: Ignite
>  Issue Type: Improvement
>Affects Versions: 2.8.1
>Reporter: Vladimir Steshin
>Assignee: Vladimir Steshin
>Priority: Major
>  Labels: iep-45
> Fix For: 2.9
>
> Attachments: IGNITE-13012-patch.patch
>
>  Time Spent: 3h 40m
>  Remaining Estimate: 0h
>
> Connection failure may not be detected within 
> IgniteConfiguration.failureDetectionTimeout. Actual worst delay is: 
> ServerImpl.CON_CHECK_INTERVAL + IgniteConfiguration.failureDetectionTimeout. 
> Node ping routine is duplicated.
> We should fix:
> 1. Failure detection timeout should take in account last sent message. 
> Current ping is bound to own time:
> {code:java}ServerImpl. RingMessageWorker.lastTimeConnCheckMsgSent{code}
> This is weird because any discovery message check connection. 
> 2. Make connection check interval depend on failure detection timeout (FTD). 
> Current value is a constant:
> {code:java}static int ServerImpls.CON_CHECK_INTERVAL = 500{code}
> 3. Remove additional, quickened connection checking.  Once we do fix 1, this 
> will become even more useless.
> Despite TCP discovery has a period of connection checking, it may send ping 
> before this period exhausts. This premature ping relies also on the time of 
> any received message for some reason. 
> 4. Do not worry user with “Node seems disconnected” when everything is OK. 
> Once we do fix 1 and 3, this will become even more useless. 
> Node may log on INFO: “Local node seems to be disconnected from topology …” 
> whereas it is not actually disconnected at all.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (IGNITE-13012) Fix failure detection timeout. Simplify node ping routine.

2020-07-31 Thread Vladimir Steshin (Jira)


 [ 
https://issues.apache.org/jira/browse/IGNITE-13012?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vladimir Steshin updated IGNITE-13012:
--
Release Note: Fixed processing of failure detection timeout in 
TcpDiscoverySpi. If a node fails to send a message or ping, now it drops 
current connection within this timeout and begins to establish new connection 
much faster.  (was: Fixed processing of failure detection timeout in 
TcpDiscoverySpi. If a node fails to send a message or ping, it drops connection 
now within this timeout and begins to establish new connection much faster.)

> Fix failure detection timeout. Simplify node ping routine.
> --
>
> Key: IGNITE-13012
> URL: https://issues.apache.org/jira/browse/IGNITE-13012
> Project: Ignite
>  Issue Type: Improvement
>Affects Versions: 2.8.1
>Reporter: Vladimir Steshin
>Assignee: Vladimir Steshin
>Priority: Major
>  Labels: iep-45
> Fix For: 2.9
>
> Attachments: IGNITE-13012-patch.patch
>
>  Time Spent: 3h 40m
>  Remaining Estimate: 0h
>
> Connection failure may not be detected within 
> IgniteConfiguration.failureDetectionTimeout. Actual worst delay is: 
> ServerImpl.CON_CHECK_INTERVAL + IgniteConfiguration.failureDetectionTimeout. 
> Node ping routine is duplicated.
> We should fix:
> 1. Failure detection timeout should take in account last sent message. 
> Current ping is bound to own time:
> {code:java}ServerImpl. RingMessageWorker.lastTimeConnCheckMsgSent{code}
> This is weird because any discovery message check connection. 
> 2. Make connection check interval depend on failure detection timeout (FTD). 
> Current value is a constant:
> {code:java}static int ServerImpls.CON_CHECK_INTERVAL = 500{code}
> 3. Remove additional, quickened connection checking.  Once we do fix 1, this 
> will become even more useless.
> Despite TCP discovery has a period of connection checking, it may send ping 
> before this period exhausts. This premature ping relies also on the time of 
> any received message for some reason. 
> 4. Do not worry user with “Node seems disconnected” when everything is OK. 
> Once we do fix 1 and 3, this will become even more useless. 
> Node may log on INFO: “Local node seems to be disconnected from topology …” 
> whereas it is not actually disconnected at all.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (IGNITE-13012) Fix failure detection timeout. Simplify node ping routine.

2020-07-31 Thread Vladimir Steshin (Jira)


 [ 
https://issues.apache.org/jira/browse/IGNITE-13012?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vladimir Steshin updated IGNITE-13012:
--
Release Note: Fixed processing of failure detection timeout in 
TcpDiscoverySpi. If a node fails to send a message or ping, now it drops 
current connection strictly within this timeout and begins establishing new 
connection much faster.  (was: Fixed processing of failure detection timeout in 
TcpDiscoverySpi. If a node fails to send a message or ping, now it drops 
current connection strictly within this timeout and begins to establish new 
connection much faster.)

> Fix failure detection timeout. Simplify node ping routine.
> --
>
> Key: IGNITE-13012
> URL: https://issues.apache.org/jira/browse/IGNITE-13012
> Project: Ignite
>  Issue Type: Improvement
>Affects Versions: 2.8.1
>Reporter: Vladimir Steshin
>Assignee: Vladimir Steshin
>Priority: Major
>  Labels: iep-45
> Fix For: 2.9
>
> Attachments: IGNITE-13012-patch.patch
>
>  Time Spent: 3h 40m
>  Remaining Estimate: 0h
>
> Connection failure may not be detected within 
> IgniteConfiguration.failureDetectionTimeout. Actual worst delay is: 
> ServerImpl.CON_CHECK_INTERVAL + IgniteConfiguration.failureDetectionTimeout. 
> Node ping routine is duplicated.
> We should fix:
> 1. Failure detection timeout should take in account last sent message. 
> Current ping is bound to own time:
> {code:java}ServerImpl. RingMessageWorker.lastTimeConnCheckMsgSent{code}
> This is weird because any discovery message check connection. 
> 2. Make connection check interval depend on failure detection timeout (FTD). 
> Current value is a constant:
> {code:java}static int ServerImpls.CON_CHECK_INTERVAL = 500{code}
> 3. Remove additional, quickened connection checking.  Once we do fix 1, this 
> will become even more useless.
> Despite TCP discovery has a period of connection checking, it may send ping 
> before this period exhausts. This premature ping relies also on the time of 
> any received message for some reason. 
> 4. Do not worry user with “Node seems disconnected” when everything is OK. 
> Once we do fix 1 and 3, this will become even more useless. 
> Node may log on INFO: “Local node seems to be disconnected from topology …” 
> whereas it is not actually disconnected at all.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (IGNITE-13012) Fix failure detection timeout. Simplify node ping routine.

2020-07-31 Thread Vladimir Steshin (Jira)


 [ 
https://issues.apache.org/jira/browse/IGNITE-13012?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vladimir Steshin updated IGNITE-13012:
--
Release Note: Fixed processing of failure detection timeout in 
TcpDiscoverySpi. If a node fails to send a message or ping, it drops connection 
now within this timeout and begins to establish new connection much faster.

> Fix failure detection timeout. Simplify node ping routine.
> --
>
> Key: IGNITE-13012
> URL: https://issues.apache.org/jira/browse/IGNITE-13012
> Project: Ignite
>  Issue Type: Improvement
>Affects Versions: 2.8.1
>Reporter: Vladimir Steshin
>Assignee: Vladimir Steshin
>Priority: Major
>  Labels: iep-45
> Fix For: 2.9
>
> Attachments: IGNITE-13012-patch.patch
>
>  Time Spent: 3h 40m
>  Remaining Estimate: 0h
>
> Connection failure may not be detected within 
> IgniteConfiguration.failureDetectionTimeout. Actual worst delay is: 
> ServerImpl.CON_CHECK_INTERVAL + IgniteConfiguration.failureDetectionTimeout. 
> Node ping routine is duplicated.
> We should fix:
> 1. Failure detection timeout should take in account last sent message. 
> Current ping is bound to own time:
> {code:java}ServerImpl. RingMessageWorker.lastTimeConnCheckMsgSent{code}
> This is weird because any discovery message check connection. 
> 2. Make connection check interval depend on failure detection timeout (FTD). 
> Current value is a constant:
> {code:java}static int ServerImpls.CON_CHECK_INTERVAL = 500{code}
> 3. Remove additional, quickened connection checking.  Once we do fix 1, this 
> will become even more useless.
> Despite TCP discovery has a period of connection checking, it may send ping 
> before this period exhausts. This premature ping relies also on the time of 
> any received message for some reason. 
> 4. Do not worry user with “Node seems disconnected” when everything is OK. 
> Once we do fix 1 and 3, this will become even more useless. 
> Node may log on INFO: “Local node seems to be disconnected from topology …” 
> whereas it is not actually disconnected at all.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (IGNITE-13012) Fix failure detection timeout. Simplify node ping routine.

2020-07-31 Thread Vladimir Steshin (Jira)


 [ 
https://issues.apache.org/jira/browse/IGNITE-13012?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vladimir Steshin updated IGNITE-13012:
--
Ignite Flags: Release Notes Required  (was: Docs Required)

> Fix failure detection timeout. Simplify node ping routine.
> --
>
> Key: IGNITE-13012
> URL: https://issues.apache.org/jira/browse/IGNITE-13012
> Project: Ignite
>  Issue Type: Improvement
>Affects Versions: 2.8.1
>Reporter: Vladimir Steshin
>Assignee: Vladimir Steshin
>Priority: Major
>  Labels: iep-45
> Fix For: 2.9
>
> Attachments: IGNITE-13012-patch.patch
>
>  Time Spent: 3h 40m
>  Remaining Estimate: 0h
>
> Connection failure may not be detected within 
> IgniteConfiguration.failureDetectionTimeout. Actual worst delay is: 
> ServerImpl.CON_CHECK_INTERVAL + IgniteConfiguration.failureDetectionTimeout. 
> Node ping routine is duplicated.
> We should fix:
> 1. Failure detection timeout should take in account last sent message. 
> Current ping is bound to own time:
> {code:java}ServerImpl. RingMessageWorker.lastTimeConnCheckMsgSent{code}
> This is weird because any discovery message check connection. 
> 2. Make connection check interval depend on failure detection timeout (FTD). 
> Current value is a constant:
> {code:java}static int ServerImpls.CON_CHECK_INTERVAL = 500{code}
> 3. Remove additional, quickened connection checking.  Once we do fix 1, this 
> will become even more useless.
> Despite TCP discovery has a period of connection checking, it may send ping 
> before this period exhausts. This premature ping relies also on the time of 
> any received message for some reason. 
> 4. Do not worry user with “Node seems disconnected” when everything is OK. 
> Once we do fix 1 and 3, this will become even more useless. 
> Node may log on INFO: “Local node seems to be disconnected from topology …” 
> whereas it is not actually disconnected at all.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (IGNITE-13134) Fix connection recovery timeout.

2020-07-31 Thread Vladimir Steshin (Jira)


 [ 
https://issues.apache.org/jira/browse/IGNITE-13134?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vladimir Steshin updated IGNITE-13134:
--
Release Note: Fixed processing of connection recovery timeout in 
TcpDiscoverySpi. If a node loses connection, now it strictly obtains new 
connection to the ring of gets segmented within this timeout.  (was: Fixed 
TcpDiscoverySpi.connRecoveryTimeout. If a node loses connection, now it 
strictly obtains new connection to the ring of gets segmented within this 
timeout.)

> Fix connection recovery timeout.
> 
>
> Key: IGNITE-13134
> URL: https://issues.apache.org/jira/browse/IGNITE-13134
> Project: Ignite
>  Issue Type: Improvement
>Affects Versions: 2.8.1
>Reporter: Vladimir Steshin
>Assignee: Vladimir Steshin
>Priority: Critical
>  Labels: iep-45
> Fix For: 2.9
>
> Attachments: IGNITE-130134-patch.patch
>
>  Time Spent: 20m
>  Remaining Estimate: 0h
>
> If node experiences connection issues it must establish new connection or 
> fail within failureDetectionTimeout + connectionRecoveryTimout.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (IGNITE-13134) Fix connection recovery timeout.

2020-07-31 Thread Vladimir Steshin (Jira)


 [ 
https://issues.apache.org/jira/browse/IGNITE-13134?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vladimir Steshin updated IGNITE-13134:
--
Release Note: Fixed TcpDiscoverySpi.connRecoveryTimeout. If a node loses 
connection, now it strictly obtains new connection to the ring of gets 
segmented within this timeout.

> Fix connection recovery timeout.
> 
>
> Key: IGNITE-13134
> URL: https://issues.apache.org/jira/browse/IGNITE-13134
> Project: Ignite
>  Issue Type: Improvement
>Affects Versions: 2.8.1
>Reporter: Vladimir Steshin
>Assignee: Vladimir Steshin
>Priority: Critical
>  Labels: iep-45
> Fix For: 2.9
>
> Attachments: IGNITE-130134-patch.patch
>
>  Time Spent: 20m
>  Remaining Estimate: 0h
>
> If node experiences connection issues it must establish new connection or 
> fail within failureDetectionTimeout + connectionRecoveryTimout.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (IGNITE-13134) Fix connection recovery timeout.

2020-07-31 Thread Vladimir Steshin (Jira)


 [ 
https://issues.apache.org/jira/browse/IGNITE-13134?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vladimir Steshin updated IGNITE-13134:
--
Ignite Flags: Release Notes Required

> Fix connection recovery timeout.
> 
>
> Key: IGNITE-13134
> URL: https://issues.apache.org/jira/browse/IGNITE-13134
> Project: Ignite
>  Issue Type: Improvement
>Affects Versions: 2.8.1
>Reporter: Vladimir Steshin
>Assignee: Vladimir Steshin
>Priority: Critical
>  Labels: iep-45
> Fix For: 2.9
>
> Attachments: IGNITE-130134-patch.patch
>
>  Time Spent: 20m
>  Remaining Estimate: 0h
>
> If node experiences connection issues it must establish new connection or 
> fail within failureDetectionTimeout + connectionRecoveryTimout.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (IGNITE-13208) Simplify IgniteSpiOperationTimeoutHelper

2020-07-31 Thread Vladimir Steshin (Jira)


 [ 
https://issues.apache.org/jira/browse/IGNITE-13208?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vladimir Steshin updated IGNITE-13208:
--
Fix Version/s: 2.10

> Simplify IgniteSpiOperationTimeoutHelper
> 
>
> Key: IGNITE-13208
> URL: https://issues.apache.org/jira/browse/IGNITE-13208
> Project: Ignite
>  Issue Type: Task
>Reporter: Vladimir Steshin
>Assignee: Vladimir Steshin
>Priority: Minor
> Fix For: 2.10
>
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> IgniteSpiOperationTimeoutHelper has many timeout fields. It should get 
> simplified.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (IGNITE-13040) Remove unused parameter from TcpDiscoverySpi.writeToSocket()

2020-07-28 Thread Vladimir Steshin (Jira)


[ 
https://issues.apache.org/jira/browse/IGNITE-13040?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17166254#comment-17166254
 ] 

Vladimir Steshin commented on IGNITE-13040:
---

[~Kurinov], hi. I took a look at your PR. Please check it. And yes, you need 
visa as Vyacheslav mentioned.

> Remove unused parameter from TcpDiscoverySpi.writeToSocket()
> 
>
> Key: IGNITE-13040
> URL: https://issues.apache.org/jira/browse/IGNITE-13040
> Project: Ignite
>  Issue Type: Improvement
> Environment:  
>Reporter: Vladimir Steshin
>Assignee: Aleksey Kurinov
>Priority: Trivial
>  Labels: newbie
>
> Unused parameter
> {code:java}
> TcpDiscoveryAbstractMessage msg{code}
> should be removed from
> {code:java}
> TcpDiscoverySpi.writeToSocket(Socket sock, TcpDiscoveryAbstractMessage msg, 
> byte[] data, long timeout){code}
> This method seems to send raw data, not a message.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (IGNITE-13282) Fix TcpDiscoveryCoordinatorFailureTest.testClusterFailedNewCoordinatorInitialized()

2020-07-22 Thread Vladimir Steshin (Jira)


 [ 
https://issues.apache.org/jira/browse/IGNITE-13282?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vladimir Steshin updated IGNITE-13282:
--
Ignite Flags:   (was: Docs Required,Release Notes Required)

> Fix 
> TcpDiscoveryCoordinatorFailureTest.testClusterFailedNewCoordinatorInitialized()
> ---
>
> Key: IGNITE-13282
> URL: https://issues.apache.org/jira/browse/IGNITE-13282
> Project: Ignite
>  Issue Type: Bug
>Reporter: Vladimir Steshin
>Assignee: Vladimir Steshin
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (IGNITE-13282) Fix TcpDiscoveryCoordinatorFailureTest.testClusterFailedNewCoordinatorInitialized()

2020-07-21 Thread Vladimir Steshin (Jira)
Vladimir Steshin created IGNITE-13282:
-

 Summary: Fix 
TcpDiscoveryCoordinatorFailureTest.testClusterFailedNewCoordinatorInitialized()
 Key: IGNITE-13282
 URL: https://issues.apache.org/jira/browse/IGNITE-13282
 Project: Ignite
  Issue Type: Bug
Reporter: Vladimir Steshin
Assignee: Vladimir Steshin






--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (IGNITE-13016) Fix backward checking of failed node.

2020-07-21 Thread Vladimir Steshin (Jira)


 [ 
https://issues.apache.org/jira/browse/IGNITE-13016?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vladimir Steshin updated IGNITE-13016:
--
Ignite Flags:   (was: Release Notes Required)

> Fix backward checking of failed node.
> -
>
> Key: IGNITE-13016
> URL: https://issues.apache.org/jira/browse/IGNITE-13016
> Project: Ignite
>  Issue Type: Sub-task
>Reporter: Vladimir Steshin
>Assignee: Vladimir Steshin
>Priority: Major
>  Labels: iep-45
> Fix For: 2.9
>
>  Time Spent: 20m
>  Remaining Estimate: 0h
>
> Backward node connection checking looks wierd. What we might improve are:
> 1) Addresses checking could be done in parrallel, not sequentially.
> {code:java}
> for (InetSocketAddress addr : nodeAddrs) {
> // Connection refused may be got if node doesn't listen
> // (or blocked by firewall, but anyway assume it is dead).
> if (!isConnectionRefused(addr)) {
> liveAddr = addr;
> break;
> }
> }
> {code}
> 2) Any io-exception should be considered as failed connection, not only 
> connection-refused:
> {code:java}
> catch (ConnectException e) {
> return true;
> }
> catch (IOException e) {
> return false;
> }
> {code}
> 3) Timeout on connection checking should not be constant or hardcode:
> {code:java}
> sock.connect(addr, 100);
> {code}
> 4) Decision to check connection should rely on configured exchange timeout, 
> no on the ping interval
> {code:java}
> // We got message from previous in less than double connection check interval.
> boolean ok = rcvdTime + U.millisToNanos(connCheckInterval) * 2 >= now;
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (IGNITE-13206) Represent in the documenttion affection of several node addresses on failure detection.

2020-07-06 Thread Vladimir Steshin (Jira)


 [ 
https://issues.apache.org/jira/browse/IGNITE-13206?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vladimir Steshin updated IGNITE-13206:
--
Description: 
Current TcpDiscoverySpi can prolong detection of node failure which has several 
IP addresses. This happens because most of the timeouts like 
failureDetectionTimeout, sockTimeout, ackTimeout work per address. Actual 
failure detection delay is: failureDetectionTimeout*addressesNumber. And the 
node addresses are sorted out consistently. This affection on failure detection 
should be noted in the documentation.

The suggestion is to represent this behavior in 
https://apacheignite.readme.io/docs/tcpip-discovery. The text might be:

"_You should assing multiple addresses to a node only if they represent some 
real physical connections which can give more reliability. Providing several 
addresses can prolong failure detection of current node. The timeouts and 
settings on network operations (_failureDetectionTimeout(), sockTimeout, 
ackTimeout, maxAckTimeout, reconCnt_) work per connection/address. The 
exception is _connRecoveryTimeout_. And node addresses are sorted out 
sequentially.
 Example: if you use _failureDetectionTimeout _and have set 3 ip addresses 
for this node, previous node in  the ring can take up to 
'failureDetectionTimeout * 3' to detect failure of current node_."



  was:
Current TcpDiscoverySpi can prolong detection of node failure which has several 
IP addresses. This happens because most of the timeouts like 
failureDetectionTimeout, sockTimeout, ackTimeout work per address. Actual 
failure detection delay is: failureDetectionTimeout*addressesNumber. And the 
node addresses are sorted out consistently. This affection on failure detection 
should be noted in the documentation.

The suggestion is to represent this behavior in 
https://apacheignite.readme.io/docs/tcpip-discovery. The text might be:

"_You should assing multiple addresses to a node only if they represent some 
real physical connections which can give more reliability. Providing several 
addresses can prolong failure detection of current node. The timeouts and 
settings on network operations (_failureDetectionTimeout(), sockTimeout, 
ackTimeout, maxAckTimeout, reconCnt_) work per connection/address. The 
exception is _connRecoveryTimeout_. And node addresses are sorted out 
sequentially.
 Example: if you use _failureDetectionTimeout _and have set 3 ip addresses 
for this node, previous node iт  the ring can take up to 
'failureDetectionTimeout * 3' to detect failure of current node_."




> Represent in the documenttion affection of several node addresses on failure 
> detection.
> ---
>
> Key: IGNITE-13206
> URL: https://issues.apache.org/jira/browse/IGNITE-13206
> Project: Ignite
>  Issue Type: Improvement
>  Components: documentation
>Reporter: Vladimir Steshin
>Assignee: Denis A. Magda
>Priority: Minor
>  Labels: iep-45
>
> Current TcpDiscoverySpi can prolong detection of node failure which has 
> several IP addresses. This happens because most of the timeouts like 
> failureDetectionTimeout, sockTimeout, ackTimeout work per address. Actual 
> failure detection delay is: failureDetectionTimeout*addressesNumber. And the 
> node addresses are sorted out consistently. This affection on failure 
> detection should be noted in the documentation.
> The suggestion is to represent this behavior in 
> https://apacheignite.readme.io/docs/tcpip-discovery. The text might be:
> "_You should assing multiple addresses to a node only if they represent some 
> real physical connections which can give more reliability. Providing several 
> addresses can prolong failure detection of current node. The timeouts and 
> settings on network operations (_failureDetectionTimeout(), sockTimeout, 
> ackTimeout, maxAckTimeout, reconCnt_) work per connection/address. The 
> exception is _connRecoveryTimeout_. And node addresses are sorted out 
> sequentially.
>  Example: if you use _failureDetectionTimeout _and have set 3 ip 
> addresses for this node, previous node in  the ring can take up to 
> 'failureDetectionTimeout * 3' to detect failure of current node_."



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Assigned] (IGNITE-13206) Represent in the documenttion affection of several node addresses on failure detection.

2020-07-06 Thread Vladimir Steshin (Jira)


 [ 
https://issues.apache.org/jira/browse/IGNITE-13206?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vladimir Steshin reassigned IGNITE-13206:
-

Assignee: Denis A. Magda  (was: Vladimir Steshin)

> Represent in the documenttion affection of several node addresses on failure 
> detection.
> ---
>
> Key: IGNITE-13206
> URL: https://issues.apache.org/jira/browse/IGNITE-13206
> Project: Ignite
>  Issue Type: Improvement
>  Components: documentation
>Reporter: Vladimir Steshin
>Assignee: Denis A. Magda
>Priority: Minor
>  Labels: iep-45
>
> Current TcpDiscoverySpi can prolong detection of node failure which has 
> several IP addresses. This happens because most of the timeouts like 
> failureDetectionTimeout, sockTimeout, ackTimeout work per address. Actual 
> failure detection delay is: failureDetectionTimeout*addressesNumber. And the 
> node addresses are sorted out consistently. This affection on failure 
> detection should be noted in the documentation.
> The suggestion is to represent this behavior in 
> https://apacheignite.readme.io/docs/tcpip-discovery. The text might be:
> "_You should assing multiple addresses to a node only if they represent some 
> real physical connections which can give more reliability. Providing several 
> addresses can prolong failure detection of current node. The timeouts and 
> settings on network operations (_failureDetectionTimeout(), sockTimeout, 
> ackTimeout, maxAckTimeout, reconCnt_) work per connection/address. The 
> exception is _connRecoveryTimeout_. And node addresses are sorted out 
> sequentially.
>  Example: if you use _failureDetectionTimeout _and have set 3 ip 
> addresses for this node, previous node iт  the ring can take up to 
> 'failureDetectionTimeout * 3' to detect failure of current node_."



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (IGNITE-13206) Represent in the documenttion affection of several node addresses on failure detection.

2020-07-06 Thread Vladimir Steshin (Jira)


 [ 
https://issues.apache.org/jira/browse/IGNITE-13206?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vladimir Steshin updated IGNITE-13206:
--
Description: 
Current TcpDiscoverySpi can prolong detection of node failure which has several 
IP addresses. This happens because most of the timeouts like 
failureDetectionTimeout, sockTimeout, ackTimeout work per address. Actual 
failure detection delay is: failureDetectionTimeout*addressesNumber. And the 
node addresses are sorted out consistently. This affection on failure detection 
should be noted in the documentation.

The suggestion is to represent this behavior in 
https://apacheignite.readme.io/docs/tcpip-discovery. The text might be:

"You should assing multiple addresses to a node only if they represent some 
real physical connections which can give more reliability. Providing several 
addresses can prolong failure detection of current node. The timeouts and 
settings on network operations (_failureDetectionTimeout(), sockTimeout, 
ackTimeout, maxAckTimeout, reconCnt_) work per connection/address. The 
exception is _connRecoveryTimeout_. And node addresses are sorted out 
sequentially.
 Example: if you use _failureDetectionTimeout _and have set 3 ip addresses 
for this node, previous node iт  the ring can take up to 
'failureDetectionTimeout * 3' to detect failure of current node."



  was:
Current TcpDiscoverySpi can prolong detection of node failure which has several 
IP addresses. This happens because most of the timeouts like 
failureDetectionTimeout, sockTimeout, ackTimeout work per address. Actual 
failure detection delay is: failureDetectionTimeout*addressesNumber (1). And 
the node addresses are sorted out consistently. This affection on failure 
detection should be noted in the documentation.

*1: addressesNumber - addresses number of next node in the ring.

The suggestion is to represent this behavior in 
https://apacheignite.readme.io/docs/tcpip-discovery. The text might be:

"You should assing multiple addresses to a node only if they represent some 
real physical connections which can give more reliability. Providing several 
addresses can prolong failure detection of current node. The timeouts and 
settings on network operations (_failureDetectionTimeout(), sockTimeout, 
ackTimeout, maxAckTimeout, reconCnt_) work per connection/address. The 
exception is _connRecoveryTimeout_. And node addresses are sorted out 
sequentially.
 Example: if you use _failureDetectionTimeout _and have set 3 ip addresses 
for this node, previous node iт  the ring can take up to 
'failureDetectionTimeout * 3' to detect failure of current node."




> Represent in the documenttion affection of several node addresses on failure 
> detection.
> ---
>
> Key: IGNITE-13206
> URL: https://issues.apache.org/jira/browse/IGNITE-13206
> Project: Ignite
>  Issue Type: Improvement
>  Components: documentation
>Reporter: Vladimir Steshin
>Assignee: Vladimir Steshin
>Priority: Minor
>  Labels: iep-45
>
> Current TcpDiscoverySpi can prolong detection of node failure which has 
> several IP addresses. This happens because most of the timeouts like 
> failureDetectionTimeout, sockTimeout, ackTimeout work per address. Actual 
> failure detection delay is: failureDetectionTimeout*addressesNumber. And the 
> node addresses are sorted out consistently. This affection on failure 
> detection should be noted in the documentation.
> The suggestion is to represent this behavior in 
> https://apacheignite.readme.io/docs/tcpip-discovery. The text might be:
> "You should assing multiple addresses to a node only if they represent some 
> real physical connections which can give more reliability. Providing several 
> addresses can prolong failure detection of current node. The timeouts and 
> settings on network operations (_failureDetectionTimeout(), sockTimeout, 
> ackTimeout, maxAckTimeout, reconCnt_) work per connection/address. The 
> exception is _connRecoveryTimeout_. And node addresses are sorted out 
> sequentially.
>  Example: if you use _failureDetectionTimeout _and have set 3 ip 
> addresses for this node, previous node iт  the ring can take up to 
> 'failureDetectionTimeout * 3' to detect failure of current node."



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (IGNITE-13206) Represent in the documenttion affection of several node addresses on failure detection.

2020-07-06 Thread Vladimir Steshin (Jira)


 [ 
https://issues.apache.org/jira/browse/IGNITE-13206?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vladimir Steshin updated IGNITE-13206:
--
Description: 
Current TcpDiscoverySpi can prolong detection of node failure which has several 
IP addresses. This happens because most of the timeouts like 
failureDetectionTimeout, sockTimeout, ackTimeout work per address. Actual 
failure detection delay is: failureDetectionTimeout*addressesNumber. And the 
node addresses are sorted out consistently. This affection on failure detection 
should be noted in the documentation.

The suggestion is to represent this behavior in 
https://apacheignite.readme.io/docs/tcpip-discovery. The text might be:

"_You should assing multiple addresses to a node only if they represent some 
real physical connections which can give more reliability. Providing several 
addresses can prolong failure detection of current node. The timeouts and 
settings on network operations (_failureDetectionTimeout(), sockTimeout, 
ackTimeout, maxAckTimeout, reconCnt_) work per connection/address. The 
exception is _connRecoveryTimeout_. And node addresses are sorted out 
sequentially.
 Example: if you use _failureDetectionTimeout _and have set 3 ip addresses 
for this node, previous node iт  the ring can take up to 
'failureDetectionTimeout * 3' to detect failure of current node_."



  was:
Current TcpDiscoverySpi can prolong detection of node failure which has several 
IP addresses. This happens because most of the timeouts like 
failureDetectionTimeout, sockTimeout, ackTimeout work per address. Actual 
failure detection delay is: failureDetectionTimeout*addressesNumber. And the 
node addresses are sorted out consistently. This affection on failure detection 
should be noted in the documentation.

The suggestion is to represent this behavior in 
https://apacheignite.readme.io/docs/tcpip-discovery. The text might be:

"You should assing multiple addresses to a node only if they represent some 
real physical connections which can give more reliability. Providing several 
addresses can prolong failure detection of current node. The timeouts and 
settings on network operations (_failureDetectionTimeout(), sockTimeout, 
ackTimeout, maxAckTimeout, reconCnt_) work per connection/address. The 
exception is _connRecoveryTimeout_. And node addresses are sorted out 
sequentially.
 Example: if you use _failureDetectionTimeout _and have set 3 ip addresses 
for this node, previous node iт  the ring can take up to 
'failureDetectionTimeout * 3' to detect failure of current node."




> Represent in the documenttion affection of several node addresses on failure 
> detection.
> ---
>
> Key: IGNITE-13206
> URL: https://issues.apache.org/jira/browse/IGNITE-13206
> Project: Ignite
>  Issue Type: Improvement
>  Components: documentation
>Reporter: Vladimir Steshin
>Assignee: Vladimir Steshin
>Priority: Minor
>  Labels: iep-45
>
> Current TcpDiscoverySpi can prolong detection of node failure which has 
> several IP addresses. This happens because most of the timeouts like 
> failureDetectionTimeout, sockTimeout, ackTimeout work per address. Actual 
> failure detection delay is: failureDetectionTimeout*addressesNumber. And the 
> node addresses are sorted out consistently. This affection on failure 
> detection should be noted in the documentation.
> The suggestion is to represent this behavior in 
> https://apacheignite.readme.io/docs/tcpip-discovery. The text might be:
> "_You should assing multiple addresses to a node only if they represent some 
> real physical connections which can give more reliability. Providing several 
> addresses can prolong failure detection of current node. The timeouts and 
> settings on network operations (_failureDetectionTimeout(), sockTimeout, 
> ackTimeout, maxAckTimeout, reconCnt_) work per connection/address. The 
> exception is _connRecoveryTimeout_. And node addresses are sorted out 
> sequentially.
>  Example: if you use _failureDetectionTimeout _and have set 3 ip 
> addresses for this node, previous node iт  the ring can take up to 
> 'failureDetectionTimeout * 3' to detect failure of current node_."



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (IGNITE-13206) Represent in the documenttion affection of several node addresses on failure detection.

2020-07-06 Thread Vladimir Steshin (Jira)


 [ 
https://issues.apache.org/jira/browse/IGNITE-13206?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vladimir Steshin updated IGNITE-13206:
--
Component/s: documentation

> Represent in the documenttion affection of several node addresses on failure 
> detection.
> ---
>
> Key: IGNITE-13206
> URL: https://issues.apache.org/jira/browse/IGNITE-13206
> Project: Ignite
>  Issue Type: Improvement
>  Components: documentation
>Reporter: Vladimir Steshin
>Assignee: Vladimir Steshin
>Priority: Minor
>  Labels: iep-45
>
> Current TcpDiscoverySpi can prolong detection of node failure which has 
> several IP addresses. This happens because most of the timeouts like 
> failureDetectionTimeout, sockTimeout, ackTimeout work per address. Actual 
> failure detection delay is: failureDetectionTimeout*addressesNumber (1). And 
> the node addresses are sorted out consistently. This affection on failure 
> detection should be noted in the documentation.
> *1: addressesNumber - addresses number of next node in the ring.
> The suggestion is to represent this behavior in 
> https://apacheignite.readme.io/docs/tcpip-discovery. The text might be:
> "You should assing multiple addresses to a node only if they represent some 
> real physical connections which can give more reliability. Providing several 
> addresses can prolong failure detection of current node. The timeouts and 
> settings on network operations (_failureDetectionTimeout(), sockTimeout, 
> ackTimeout, maxAckTimeout, reconCnt_) work per connection/address. The 
> exception is _connRecoveryTimeout_. And node addresses are sorted out 
> sequentially.
>  Example: if you use _failureDetectionTimeout _and have set 3 ip 
> addresses for this node, previous node iт  the ring can take up to 
> 'failureDetectionTimeout * 3' to detect failure of current node."



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (IGNITE-13208) Simplify IgniteSpiOperationTimeoutHelper

2020-07-06 Thread Vladimir Steshin (Jira)


 [ 
https://issues.apache.org/jira/browse/IGNITE-13208?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vladimir Steshin updated IGNITE-13208:
--
Priority: Minor  (was: Trivial)

> Simplify IgniteSpiOperationTimeoutHelper
> 
>
> Key: IGNITE-13208
> URL: https://issues.apache.org/jira/browse/IGNITE-13208
> Project: Ignite
>  Issue Type: Task
>Reporter: Vladimir Steshin
>Assignee: Vladimir Steshin
>Priority: Minor
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> IgniteSpiOperationTimeoutHelper has many timeout fields. It should get 
> simplified.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (IGNITE-13208) Simplify IgniteSpiOperationTimeoutHelper

2020-07-06 Thread Vladimir Steshin (Jira)


 [ 
https://issues.apache.org/jira/browse/IGNITE-13208?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vladimir Steshin updated IGNITE-13208:
--
Priority: Trivial  (was: Minor)

> Simplify IgniteSpiOperationTimeoutHelper
> 
>
> Key: IGNITE-13208
> URL: https://issues.apache.org/jira/browse/IGNITE-13208
> Project: Ignite
>  Issue Type: Task
>Reporter: Vladimir Steshin
>Assignee: Vladimir Steshin
>Priority: Trivial
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> IgniteSpiOperationTimeoutHelper has many timeout fields. It should get 
> simplified.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (IGNITE-13016) Fix backward checking of failed node.

2020-07-06 Thread Vladimir Steshin (Jira)


 [ 
https://issues.apache.org/jira/browse/IGNITE-13016?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vladimir Steshin updated IGNITE-13016:
--
Description: 
Backward node connection checking looks wierd. What we might improve are:

1) Addresses checking could be done in parrallel, not sequentially.
{code:java}
for (InetSocketAddress addr : nodeAddrs) {
// Connection refused may be got if node doesn't listen
// (or blocked by firewall, but anyway assume it is dead).
if (!isConnectionRefused(addr)) {
liveAddr = addr;

break;
}
}
{code}

2) Any io-exception should be considered as failed connection, not only 
connection-refused:
{code:java}
catch (ConnectException e) {
return true;
}
catch (IOException e) {
return false;
}
{code}

3) Timeout on connection checking should not be constant or hardcode:
{code:java}
sock.connect(addr, 100);
{code}

4) Decision to check connection should rely on configured exchange timeout, no 
on the ping interval

{code:java}
// We got message from previous in less than double connection check interval.
boolean ok = rcvdTime + U.millisToNanos(connCheckInterval) * 2 >= now;
{code}





  was:
Backward node connection checking looks wierd. What might be improved are:

1) Addresses checking could be done in parrallel, not sequentially.
{code:java}
for (InetSocketAddress addr : nodeAddrs) {
// Connection refused may be got if node doesn't listen
// (or blocked by firewall, but anyway assume it is dead).
if (!isConnectionRefused(addr)) {
liveAddr = addr;

break;
}
}
{code}

2) Any io-exception should be considered as failed connection, not only 
connection-refused:
{code:java}
catch (ConnectException e) {
return true;
}
catch (IOException e) {
return false;
}
{code}

3) Timeout on connection checking should not be constand or hardcoced:
{code:java}
sock.connect(addr, 100);
{code}

4) Decision to check connection should rely on configured exchange timeout, no 
on the ping interval

{code:java}
// We got message from previous in less than double connection check interval.
boolean ok = rcvdTime + U.millisToNanos(connCheckInterval) * 2 >= now;
{code}






> Fix backward checking of failed node.
> -
>
> Key: IGNITE-13016
> URL: https://issues.apache.org/jira/browse/IGNITE-13016
> Project: Ignite
>  Issue Type: Sub-task
>Reporter: Vladimir Steshin
>Assignee: Vladimir Steshin
>Priority: Major
>  Labels: iep-45
> Fix For: 2.9
>
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> Backward node connection checking looks wierd. What we might improve are:
> 1) Addresses checking could be done in parrallel, not sequentially.
> {code:java}
> for (InetSocketAddress addr : nodeAddrs) {
> // Connection refused may be got if node doesn't listen
> // (or blocked by firewall, but anyway assume it is dead).
> if (!isConnectionRefused(addr)) {
> liveAddr = addr;
> break;
> }
> }
> {code}
> 2) Any io-exception should be considered as failed connection, not only 
> connection-refused:
> {code:java}
> catch (ConnectException e) {
> return true;
> }
> catch (IOException e) {
> return false;
> }
> {code}
> 3) Timeout on connection checking should not be constant or hardcode:
> {code:java}
> sock.connect(addr, 100);
> {code}
> 4) Decision to check connection should rely on configured exchange timeout, 
> no on the ping interval
> {code:java}
> // We got message from previous in less than double connection check interval.
> boolean ok = rcvdTime + U.millisToNanos(connCheckInterval) * 2 >= now;
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (IGNITE-13016) Fix backward checking of failed node.

2020-07-06 Thread Vladimir Steshin (Jira)


 [ 
https://issues.apache.org/jira/browse/IGNITE-13016?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vladimir Steshin updated IGNITE-13016:
--
Description: 
Backward node connection checking looks wierd. What might be improved are:

1) Addresses checking could be done in parrallel, not sequentially.
{code:java}
for (InetSocketAddress addr : nodeAddrs) {
// Connection refused may be got if node doesn't listen
// (or blocked by firewall, but anyway assume it is dead).
if (!isConnectionRefused(addr)) {
liveAddr = addr;

break;
}
}
{code}

2) Any io-exception should be considered as failed connection, not only 
connection-refused:
{code:java}
catch (ConnectException e) {
return true;
}
catch (IOException e) {
return false;
}
{code}

3) Timeout on connection checking should not be constand or hardcoced:
{code:java}
sock.connect(addr, 100);
{code}

4) Decision to check connection should rely on configured exchange timeout, no 
on the ping interval

{code:java}
// We got message from previous in less than double connection check interval.
boolean ok = rcvdTime + U.millisToNanos(connCheckInterval) * 2 >= now;
{code}





  was:
Backward node connection checking looks wierd. What might be improved are:

1) Addresses checking could be done in parrallel, not serializably
{code:java}
for (InetSocketAddress addr : nodeAddrs) {
// Connection refused may be got if node doesn't listen
// (or blocked by firewall, but anyway assume it is dead).
if (!isConnectionRefused(addr)) {
liveAddr = addr;

break;
}
}
{code}

2) Any io-exception should be considered as failed connection, not only 
connection-refused:
{code:java}
catch (ConnectException e) {
return true;
}
catch (IOException e) {
return false;
}
{code}

3) Timeout on connection checking should not be constand or hardcoced:
{code:java}
sock.connect(addr, 100);
{code}

4) Decision to check connection should rely on configured exchange timeout, no 
on the ping interval

{code:java}
// We got message from previous in less than double connection check interval.
boolean ok = rcvdTime + U.millisToNanos(connCheckInterval) * 2 >= now;
{code}






> Fix backward checking of failed node.
> -
>
> Key: IGNITE-13016
> URL: https://issues.apache.org/jira/browse/IGNITE-13016
> Project: Ignite
>  Issue Type: Sub-task
>Reporter: Vladimir Steshin
>Assignee: Vladimir Steshin
>Priority: Major
>  Labels: iep-45
> Fix For: 2.9
>
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> Backward node connection checking looks wierd. What might be improved are:
> 1) Addresses checking could be done in parrallel, not sequentially.
> {code:java}
> for (InetSocketAddress addr : nodeAddrs) {
> // Connection refused may be got if node doesn't listen
> // (or blocked by firewall, but anyway assume it is dead).
> if (!isConnectionRefused(addr)) {
> liveAddr = addr;
> break;
> }
> }
> {code}
> 2) Any io-exception should be considered as failed connection, not only 
> connection-refused:
> {code:java}
> catch (ConnectException e) {
> return true;
> }
> catch (IOException e) {
> return false;
> }
> {code}
> 3) Timeout on connection checking should not be constand or hardcoced:
> {code:java}
> sock.connect(addr, 100);
> {code}
> 4) Decision to check connection should rely on configured exchange timeout, 
> no on the ping interval
> {code:java}
> // We got message from previous in less than double connection check interval.
> boolean ok = rcvdTime + U.millisToNanos(connCheckInterval) * 2 >= now;
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (IGNITE-13134) Fix connection recovery timeout.

2020-07-06 Thread Vladimir Steshin (Jira)


 [ 
https://issues.apache.org/jira/browse/IGNITE-13134?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vladimir Steshin updated IGNITE-13134:
--
Priority: Critical  (was: Minor)

> Fix connection recovery timeout.
> 
>
> Key: IGNITE-13134
> URL: https://issues.apache.org/jira/browse/IGNITE-13134
> Project: Ignite
>  Issue Type: Improvement
>Affects Versions: 2.8.1
>Reporter: Vladimir Steshin
>Assignee: Vladimir Steshin
>Priority: Critical
>  Labels: iep-45
> Fix For: 2.9
>
> Attachments: IGNITE-130134-patch.patch
>
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> If node experiences connection issues it must establish new connection or 
> fail within failureDetectionTimeout + connectionRecoveryTimout.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (IGNITE-13134) Fix connection recovery timeout.

2020-07-06 Thread Vladimir Steshin (Jira)


 [ 
https://issues.apache.org/jira/browse/IGNITE-13134?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vladimir Steshin updated IGNITE-13134:
--
Priority: Minor  (was: Major)

> Fix connection recovery timeout.
> 
>
> Key: IGNITE-13134
> URL: https://issues.apache.org/jira/browse/IGNITE-13134
> Project: Ignite
>  Issue Type: Improvement
>Affects Versions: 2.8.1
>Reporter: Vladimir Steshin
>Assignee: Vladimir Steshin
>Priority: Minor
>  Labels: iep-45
> Fix For: 2.9
>
> Attachments: IGNITE-130134-patch.patch
>
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> If node experiences connection issues it must establish new connection or 
> fail within failureDetectionTimeout + connectionRecoveryTimout.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (IGNITE-13206) Represent in the documenttion affection of several node addresses on failure detection.

2020-07-03 Thread Vladimir Steshin (Jira)


 [ 
https://issues.apache.org/jira/browse/IGNITE-13206?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vladimir Steshin updated IGNITE-13206:
--
Description: 
Current TcpDiscoverySpi can prolong detection of node failure which has several 
IP addresses. This happens because most of the timeouts like 
failureDetectionTimeout, sockTimeout, ackTimeout work per address. Actual 
failure detection delay is: failureDetectionTimeout*addressesNumber (1). And 
the node addresses are sorted out consistently. This affection on failure 
detection should be noted in the documentation.

*1: addressesNumber - addresses number of next node in the ring.

The suggestion is to represent this behavior in 
https://apacheignite.readme.io/docs/tcpip-discovery. The text might be:

"You should assing multiple addresses to a node only if they represent some 
real physical connections which can give more reliability. Providing several 
addresses can prolong failure detection of current node. The timeouts and 
settings on network operations (_failureDetectionTimeout(), sockTimeout, 
ackTimeout, maxAckTimeout, reconCnt_) work per connection/address. The 
exception is _connRecoveryTimeout_. And node addresses are sorted out 
sequentially.
 Example: if you use _failureDetectionTimeout _and have set 3 ip addresses 
for this node, previous node iт  the ring can take up to 
'failureDetectionTimeout * 3' to detect failure of current node."



  was:
Current TcpDiscoverySpi can prolong detection of node failure which has several 
IP addresses. This happens because most of the timeouts like 
failureDetectionTimeout, sockTimeout, ackTimeout work per address. Actual 
failure detection delay is: failureDetectionTimeout*addressesNumber (1). And 
the node addresses are sorted out consistently. This affection on failure 
detection should be noted in the documentation.

*1: addressesNumber - addresses number of next node in the ring.

The suggestion is to represent this behavior in 
https://apacheignite.readme.io/docs/tcpip-discovery. The text might be:

"You should assing multiple addresses to a node only if they represent some 
real physical connections which can give more reliability. Providing several 
addresses can prolong failure detection of current node. The timeouts and 
settings on network operations (_failureDetectionTimeout(), sockTimeout, 
ackTimeout, maxAckTimeout, reconCnt_) work per connection/address. The 
exception is _connRecoveryTimeout_. And node addresses are sorted out 
consistently.
 Example: if you use _failureDetectionTimeout _and have set 3 ip addresses 
for this node, previous node iт  the ring can take up to 
'failureDetectionTimeout * 3' to detect failure of current node."




> Represent in the documenttion affection of several node addresses on failure 
> detection.
> ---
>
> Key: IGNITE-13206
> URL: https://issues.apache.org/jira/browse/IGNITE-13206
> Project: Ignite
>  Issue Type: Improvement
>Reporter: Vladimir Steshin
>Assignee: Vladimir Steshin
>Priority: Minor
>  Labels: iep-45
>
> Current TcpDiscoverySpi can prolong detection of node failure which has 
> several IP addresses. This happens because most of the timeouts like 
> failureDetectionTimeout, sockTimeout, ackTimeout work per address. Actual 
> failure detection delay is: failureDetectionTimeout*addressesNumber (1). And 
> the node addresses are sorted out consistently. This affection on failure 
> detection should be noted in the documentation.
> *1: addressesNumber - addresses number of next node in the ring.
> The suggestion is to represent this behavior in 
> https://apacheignite.readme.io/docs/tcpip-discovery. The text might be:
> "You should assing multiple addresses to a node only if they represent some 
> real physical connections which can give more reliability. Providing several 
> addresses can prolong failure detection of current node. The timeouts and 
> settings on network operations (_failureDetectionTimeout(), sockTimeout, 
> ackTimeout, maxAckTimeout, reconCnt_) work per connection/address. The 
> exception is _connRecoveryTimeout_. And node addresses are sorted out 
> sequentially.
>  Example: if you use _failureDetectionTimeout _and have set 3 ip 
> addresses for this node, previous node iт  the ring can take up to 
> 'failureDetectionTimeout * 3' to detect failure of current node."



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (IGNITE-13208) Simplify IgniteSpiOperationTimeoutHelper

2020-07-03 Thread Vladimir Steshin (Jira)


 [ 
https://issues.apache.org/jira/browse/IGNITE-13208?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vladimir Steshin updated IGNITE-13208:
--
Summary: Simplify IgniteSpiOperationTimeoutHelper  (was: Refactoring of 
IgniteSpiOperationTimeoutHelper)

> Simplify IgniteSpiOperationTimeoutHelper
> 
>
> Key: IGNITE-13208
> URL: https://issues.apache.org/jira/browse/IGNITE-13208
> Project: Ignite
>  Issue Type: Task
>Reporter: Vladimir Steshin
>Assignee: Vladimir Steshin
>Priority: Minor
>
> IgniteSpiOperationTimeoutHelper has many timeout fields. It should get 
> simplified.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (IGNITE-13208) Refactoring of IgniteSpiOperationTimeoutHelper

2020-07-03 Thread Vladimir Steshin (Jira)


 [ 
https://issues.apache.org/jira/browse/IGNITE-13208?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vladimir Steshin updated IGNITE-13208:
--
Ignite Flags:   (was: Docs Required,Release Notes Required)

> Refactoring of IgniteSpiOperationTimeoutHelper
> --
>
> Key: IGNITE-13208
> URL: https://issues.apache.org/jira/browse/IGNITE-13208
> Project: Ignite
>  Issue Type: Task
>Reporter: Vladimir Steshin
>Assignee: Vladimir Steshin
>Priority: Minor
>
> IgniteSpiOperationTimeoutHelper has many timeout fields. It should get 
> simplified.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (IGNITE-13208) Refactoring of IgniteSpiOperationTimeoutHelper

2020-07-03 Thread Vladimir Steshin (Jira)


 [ 
https://issues.apache.org/jira/browse/IGNITE-13208?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vladimir Steshin updated IGNITE-13208:
--
Description: IgniteSpiOperationTimeoutHelper has many timeout fields. It 
should get simplified.  (was: IgniteSpiOperationTimeoutHelper has many timeout 
fields. It looks like to get simplified.)

> Refactoring of IgniteSpiOperationTimeoutHelper
> --
>
> Key: IGNITE-13208
> URL: https://issues.apache.org/jira/browse/IGNITE-13208
> Project: Ignite
>  Issue Type: Task
>Reporter: Vladimir Steshin
>Assignee: Vladimir Steshin
>Priority: Minor
>
> IgniteSpiOperationTimeoutHelper has many timeout fields. It should get 
> simplified.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (IGNITE-13208) Refactoring of IgniteSpiOperationTimeoutHelper

2020-07-02 Thread Vladimir Steshin (Jira)
Vladimir Steshin created IGNITE-13208:
-

 Summary: Refactoring of IgniteSpiOperationTimeoutHelper
 Key: IGNITE-13208
 URL: https://issues.apache.org/jira/browse/IGNITE-13208
 Project: Ignite
  Issue Type: Task
Reporter: Vladimir Steshin
Assignee: Vladimir Steshin


IgniteSpiOperationTimeoutHelper has many timeout fields. It looks like to get 
simplified.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (IGNITE-13206) Represent in the documenttion affection of several node addresses on failure detection.

2020-07-02 Thread Vladimir Steshin (Jira)


 [ 
https://issues.apache.org/jira/browse/IGNITE-13206?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vladimir Steshin updated IGNITE-13206:
--
Description: 
Current TcpDiscoverySpi can prolong detection of node failure which has several 
IP addresses. This happens because most of the timeouts like 
failureDetectionTimeout, sockTimeout, ackTimeout work per address. Actual 
failure detection delay is: failureDetectionTimeout*addressesNumber (1). And 
the node addresses are sorted out consistently. This affection on failure 
detection should be noted in the documentation.

*1: addressesNumber - addresses number of next node in the ring.

The suggestion is to represent this behavior in 
https://apacheignite.readme.io/docs/tcpip-discovery. The text might be:

"You should assing multiple addresses to a node only if they represent some 
real physical connections which can give more reliability. Providing several 
addresses can prolong failure detection of current node. The timeouts and 
settings on network operations (_failureDetectionTimeout(), sockTimeout, 
ackTimeout, maxAckTimeout, reconCnt_) work per connection/address. The 
exception is _connRecoveryTimeout_. And node addresses are sorted out 
consistently.
 Example: if you use _failureDetectionTimeout _and have set 3 ip addresses 
for this node, previous node iт  the ring can take up to 
'failureDetectionTimeout * 3' to detect failure of current node."



  was:
Current TcpDiscoverySpi can prolong detection of node failure which has several 
IP addresses. This happens because most of the timeouts like 
failureDetectionTimeout, sockTimeout, ackTimeout work per address. Actual 
failure detection delay is: failureDetectionTimeout*addressesNumber (1). And 
the node addresses are sorted out consistently. This affection on failure 
detection should be noted in the documentation.

*1: addressesNumber - addresses number of next node in the ring.

The suggestion is to represent this behavior in 
https://apacheignite.readme.io/docs/tcpip-discovery. The text might be:

You should assing multiple addresses to a node only if they represent some real 
physical connections which can give more reliability. Providing several 
addresses can prolong failure detection of current node. The timeouts and 
settings on network operations (_failureDetectionTimeout(), sockTimeout, 
ackTimeout, maxAckTimeout, reconCnt_) work per connection/address. The 
exception is _connRecoveryTimeout_. And node addresses are sorted out 
consistently.
 Example: if you use _failureDetectionTimeout _and have set 3 ip addresses 
for this node, previous node iт  the ring can take up to 
'failureDetectionTimeout * 3' to detect failure of current node.




> Represent in the documenttion affection of several node addresses on failure 
> detection.
> ---
>
> Key: IGNITE-13206
> URL: https://issues.apache.org/jira/browse/IGNITE-13206
> Project: Ignite
>  Issue Type: Improvement
>Reporter: Vladimir Steshin
>Assignee: Vladimir Steshin
>Priority: Minor
>  Labels: iep-45
>
> Current TcpDiscoverySpi can prolong detection of node failure which has 
> several IP addresses. This happens because most of the timeouts like 
> failureDetectionTimeout, sockTimeout, ackTimeout work per address. Actual 
> failure detection delay is: failureDetectionTimeout*addressesNumber (1). And 
> the node addresses are sorted out consistently. This affection on failure 
> detection should be noted in the documentation.
> *1: addressesNumber - addresses number of next node in the ring.
> The suggestion is to represent this behavior in 
> https://apacheignite.readme.io/docs/tcpip-discovery. The text might be:
> "You should assing multiple addresses to a node only if they represent some 
> real physical connections which can give more reliability. Providing several 
> addresses can prolong failure detection of current node. The timeouts and 
> settings on network operations (_failureDetectionTimeout(), sockTimeout, 
> ackTimeout, maxAckTimeout, reconCnt_) work per connection/address. The 
> exception is _connRecoveryTimeout_. And node addresses are sorted out 
> consistently.
>  Example: if you use _failureDetectionTimeout _and have set 3 ip 
> addresses for this node, previous node iт  the ring can take up to 
> 'failureDetectionTimeout * 3' to detect failure of current node."



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (IGNITE-13206) Represent in the documenttion affection of several node addresses on failure detection.

2020-07-02 Thread Vladimir Steshin (Jira)


 [ 
https://issues.apache.org/jira/browse/IGNITE-13206?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vladimir Steshin updated IGNITE-13206:
--
Description: 
Current TcpDiscoverySpi can prolong detection of node failure which has several 
IP addresses. This happens because most of the timeouts like 
failureDetectionTimeout, sockTimeout, ackTimeout work per address. Actual 
failure detection delay is: failureDetectionTimeout*addressesNumber (1). And 
the node addresses are sorted out consistently. This affection on failure 
detection should be noted in the documentation.

*1: addressesNumber - addresses number of next node in the ring.

The suggestion is to represent this behavior in 
https://apacheignite.readme.io/docs/tcpip-discovery. The text might be:

You should assing multiple addresses to a node only if they represent some real 
physical connections which can give more reliability. Providing several 
addresses can prolong failure detection of current node. The timeouts and 
settings on network operations (_failureDetectionTimeout(), sockTimeout, 
ackTimeout, maxAckTimeout, reconCnt_) work per connection/address. The 
exception is _connRecoveryTimeout_. And node addresses are sorted out 
consistently.
 Example: if you use _failureDetectionTimeout _and have set 3 ip addresses 
for this node, previous node iт  the ring can take up to 
'failureDetectionTimeout * 3' to detect failure of current node.



  was:
Current TcpDiscoverySpi can prolong detection of node failure which has several 
IP addresses. This happens because most of the timeouts like 
failureDetectionTimeout, sockTimeout, ackTimeout work per address. Actual 
failure detection delay is: failureDetectionTimeout*addressesNumber (1). And 
the node addresses are sorted out consistently. This affection on failure 
detection should be noted in the documentation.

*1: addressesNumber - addresses number of next node in the ring.


> Represent in the documenttion affection of several node addresses on failure 
> detection.
> ---
>
> Key: IGNITE-13206
> URL: https://issues.apache.org/jira/browse/IGNITE-13206
> Project: Ignite
>  Issue Type: Improvement
>Reporter: Vladimir Steshin
>Assignee: Vladimir Steshin
>Priority: Minor
>  Labels: iep-45
>
> Current TcpDiscoverySpi can prolong detection of node failure which has 
> several IP addresses. This happens because most of the timeouts like 
> failureDetectionTimeout, sockTimeout, ackTimeout work per address. Actual 
> failure detection delay is: failureDetectionTimeout*addressesNumber (1). And 
> the node addresses are sorted out consistently. This affection on failure 
> detection should be noted in the documentation.
> *1: addressesNumber - addresses number of next node in the ring.
> The suggestion is to represent this behavior in 
> https://apacheignite.readme.io/docs/tcpip-discovery. The text might be:
> You should assing multiple addresses to a node only if they represent some 
> real physical connections which can give more reliability. Providing several 
> addresses can prolong failure detection of current node. The timeouts and 
> settings on network operations (_failureDetectionTimeout(), sockTimeout, 
> ackTimeout, maxAckTimeout, reconCnt_) work per connection/address. The 
> exception is _connRecoveryTimeout_. And node addresses are sorted out 
> consistently.
>  Example: if you use _failureDetectionTimeout _and have set 3 ip 
> addresses for this node, previous node iт  the ring can take up to 
> 'failureDetectionTimeout * 3' to detect failure of current node.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (IGNITE-13206) Represent in the documenttion affection of several node addresses on failure detection.

2020-07-02 Thread Vladimir Steshin (Jira)


 [ 
https://issues.apache.org/jira/browse/IGNITE-13206?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vladimir Steshin updated IGNITE-13206:
--
Ignite Flags: Docs Required  (was: Docs Required,Release Notes Required)

> Represent in the documenttion affection of several node addresses on failure 
> detection.
> ---
>
> Key: IGNITE-13206
> URL: https://issues.apache.org/jira/browse/IGNITE-13206
> Project: Ignite
>  Issue Type: Improvement
>Reporter: Vladimir Steshin
>Assignee: Vladimir Steshin
>Priority: Minor
>  Labels: iep-45
>
> Current TcpDiscoverySpi can prolong detection of node failure which has 
> several IP addresses. This happens because most of the timeouts like 
> failureDetectionTimeout, sockTimeout, ackTimeout work per address. Actual 
> failure detection delay is: failureDetectionTimeout*addressesNumber (1). And 
> the node addresses are sorted out consistently. This affection on failure 
> detection should be noted in the documentation.
> *1: addressesNumber - addresses number of next node in the ring.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (IGNITE-13206) Represent in the documenttion affection of several node addresses on failure detection.

2020-07-02 Thread Vladimir Steshin (Jira)


 [ 
https://issues.apache.org/jira/browse/IGNITE-13206?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vladimir Steshin updated IGNITE-13206:
--
Description: 
Current TcpDiscoverySpi can prolong detection of node failure which has several 
IP addresses. This happens because most of the timeouts like 
failureDetectionTimeout, sockTimeout, ackTimeout work per address. Actual 
failure detection delay is: failureDetectionTimeout*addressesNumber (1). And 
the node addresses are sorted out consistently. This affection on failure 
detection should be noted in the documentation.

*1: addressesNumber - addresses number of next node in the ring.

  was:
Current TcpDiscoverySpi can prolong detection of node failure which has several 
IP addresses. This happens because most of the timeouts like 
failureDetectionTimeout, sockTimeout, ackTimeout work per address. Actual 
failure detection delay is: failureDetectionTimeout*addressesNumber (1). And 
the node addresses are sorted out serially. This affection on failure detection 
should be noted in the documentation.

*1: addressesNumber - addresses number of next node in the ring.


> Represent in the documenttion affection of several node addresses on failure 
> detection.
> ---
>
> Key: IGNITE-13206
> URL: https://issues.apache.org/jira/browse/IGNITE-13206
> Project: Ignite
>  Issue Type: Improvement
>Reporter: Vladimir Steshin
>Assignee: Vladimir Steshin
>Priority: Minor
>  Labels: iep-45
>
> Current TcpDiscoverySpi can prolong detection of node failure which has 
> several IP addresses. This happens because most of the timeouts like 
> failureDetectionTimeout, sockTimeout, ackTimeout work per address. Actual 
> failure detection delay is: failureDetectionTimeout*addressesNumber (1). And 
> the node addresses are sorted out consistently. This affection on failure 
> detection should be noted in the documentation.
> *1: addressesNumber - addresses number of next node in the ring.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (IGNITE-13205) Represent in logs, javadoc affection of several node addresses on failure detection.

2020-07-02 Thread Vladimir Steshin (Jira)


 [ 
https://issues.apache.org/jira/browse/IGNITE-13205?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vladimir Steshin updated IGNITE-13205:
--
Description: 
Current TcpDiscoverySpi can prolong detection of node failure which has several 
IP addresses. This happens because most of the timeouts like 
_failureDetectionTimeout, sockTimeout, ackTimeout_ work per address. Actual 
failure detection delay is: _failureDetectionTimeout*addressesNumber_ (1). And 
the node addresses are sorted out consistently. This affection on failure 
detection should be noted in logs, javadocs.

*1:  addressesNumber - addresses number of next node in the ring.

  was:
Current TcpDiscoverySpi can prolong detection of node failure which has several 
IP addresses. This happens because most of the timeouts like 
_failureDetectionTimeout, sockTimeout, ackTimeout_ work per address. Actual 
failure detection delay is: _failureDetectionTimeout*addressesNumber_ (1). And 
the node addresses are sorted out serially. This affection on failure detection 
should be noted in logs, javadocs.

*1:  addressesNumber - addresses number of next node in the ring.


> Represent in logs, javadoc affection of several node addresses on failure 
> detection.
> 
>
> Key: IGNITE-13205
> URL: https://issues.apache.org/jira/browse/IGNITE-13205
> Project: Ignite
>  Issue Type: Improvement
>Reporter: Vladimir Steshin
>Assignee: Vladimir Steshin
>Priority: Minor
>  Labels: iep-45
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> Current TcpDiscoverySpi can prolong detection of node failure which has 
> several IP addresses. This happens because most of the timeouts like 
> _failureDetectionTimeout, sockTimeout, ackTimeout_ work per address. Actual 
> failure detection delay is: _failureDetectionTimeout*addressesNumber_ (1). 
> And the node addresses are sorted out consistently. This affection on failure 
> detection should be noted in logs, javadocs.
> *1:  addressesNumber - addresses number of next node in the ring.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (IGNITE-13205) Represent in logs, javadoc affection of several node addresses on failure detection.

2020-07-02 Thread Vladimir Steshin (Jira)


 [ 
https://issues.apache.org/jira/browse/IGNITE-13205?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vladimir Steshin updated IGNITE-13205:
--
Priority: Minor  (was: Major)

> Represent in logs, javadoc affection of several node addresses on failure 
> detection.
> 
>
> Key: IGNITE-13205
> URL: https://issues.apache.org/jira/browse/IGNITE-13205
> Project: Ignite
>  Issue Type: Improvement
>Reporter: Vladimir Steshin
>Assignee: Vladimir Steshin
>Priority: Minor
>  Labels: iep-45
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> Current TcpDiscoverySpi can prolong detection of node failure which has 
> several IP addresses. This happens because most of the timeouts like 
> _failureDetectionTimeout, sockTimeout, ackTimeout_ work per address. Actual 
> failure detection delay is: _failureDetectionTimeout*addressesNumber_ (1). 
> And the node addresses are sorted out serially. This affection on failure 
> detection should be noted in logs, javadocs.
> *1:  addressesNumber - addresses number of next node in the ring.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (IGNITE-13206) Represent in the documenttion affection of several node addresses on failure detection.

2020-07-02 Thread Vladimir Steshin (Jira)


 [ 
https://issues.apache.org/jira/browse/IGNITE-13206?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vladimir Steshin updated IGNITE-13206:
--
Labels: iep-45  (was: )

> Represent in the documenttion affection of several node addresses on failure 
> detection.
> ---
>
> Key: IGNITE-13206
> URL: https://issues.apache.org/jira/browse/IGNITE-13206
> Project: Ignite
>  Issue Type: Improvement
>Reporter: Vladimir Steshin
>Assignee: Vladimir Steshin
>Priority: Minor
>  Labels: iep-45
>
> Current TcpDiscoverySpi can prolong detection of node failure which has 
> several IP addresses. This happens because most of the timeouts like 
> failureDetectionTimeout, sockTimeout, ackTimeout work per address. Actual 
> failure detection delay is: failureDetectionTimeout*addressesNumber (1). And 
> the node addresses are sorted out serially. This affection on failure 
> detection should be noted in the documentation.
> *1: addressesNumber - addresses number of next node in the ring.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (IGNITE-13206) Represent in the documenttion affection of several node addresses on failure detection.

2020-07-02 Thread Vladimir Steshin (Jira)


 [ 
https://issues.apache.org/jira/browse/IGNITE-13206?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vladimir Steshin updated IGNITE-13206:
--
Description: 
Current TcpDiscoverySpi can prolong detection of node failure which has several 
IP addresses. This happens because most of the timeouts like 
failureDetectionTimeout, sockTimeout, ackTimeout work per address. Actual 
failure detection delay is: failureDetectionTimeout*addressesNumber (1). And 
the node addresses are sorted out serially. This affection on failure detection 
should be noted in the documentation.

*1: addressesNumber - addresses number of next node in the ring.
Summary: Represent in the documenttion affection of several node 
addresses on failure detection.  (was: Represent in the doc affection of 
several node addresses on failure detection.)

> Represent in the documenttion affection of several node addresses on failure 
> detection.
> ---
>
> Key: IGNITE-13206
> URL: https://issues.apache.org/jira/browse/IGNITE-13206
> Project: Ignite
>  Issue Type: Improvement
>Reporter: Vladimir Steshin
>Assignee: Vladimir Steshin
>Priority: Minor
>
> Current TcpDiscoverySpi can prolong detection of node failure which has 
> several IP addresses. This happens because most of the timeouts like 
> failureDetectionTimeout, sockTimeout, ackTimeout work per address. Actual 
> failure detection delay is: failureDetectionTimeout*addressesNumber (1). And 
> the node addresses are sorted out serially. This affection on failure 
> detection should be noted in the documentation.
> *1: addressesNumber - addresses number of next node in the ring.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (IGNITE-13206) Represent in the doc affection of several node addresses on failure detection.

2020-07-02 Thread Vladimir Steshin (Jira)
Vladimir Steshin created IGNITE-13206:
-

 Summary: Represent in the doc affection of several node addresses 
on failure detection.
 Key: IGNITE-13206
 URL: https://issues.apache.org/jira/browse/IGNITE-13206
 Project: Ignite
  Issue Type: Improvement
Reporter: Vladimir Steshin
Assignee: Vladimir Steshin






--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (IGNITE-13205) Represent in logs, javadoc affection of several node addresses on failure detection.

2020-07-02 Thread Vladimir Steshin (Jira)


 [ 
https://issues.apache.org/jira/browse/IGNITE-13205?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vladimir Steshin updated IGNITE-13205:
--
Labels: iep-45  (was: )

> Represent in logs, javadoc affection of several node addresses on failure 
> detection.
> 
>
> Key: IGNITE-13205
> URL: https://issues.apache.org/jira/browse/IGNITE-13205
> Project: Ignite
>  Issue Type: Improvement
>Reporter: Vladimir Steshin
>Assignee: Vladimir Steshin
>Priority: Major
>  Labels: iep-45
>
> Current TcpDiscoverySpi can prolong detection of node failure which has 
> several IP addresses. This happens because most of the timeouts like 
> _failureDetectionTimeout, sockTimeout, ackTimeout_ work per address. Actual 
> failure detection delay is: _failureDetectionTimeout*addressesNumber_ (1). 
> And the node addresses are sorted out serially. This affection on failure 
> detection should be noted in logs, javadocs.
> *1:  addressesNumber - addresses number of next node in the ring.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (IGNITE-13205) Represent in logs, javadoc affection of several node addresses on failure detection.

2020-07-02 Thread Vladimir Steshin (Jira)


 [ 
https://issues.apache.org/jira/browse/IGNITE-13205?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vladimir Steshin updated IGNITE-13205:
--
Description: 
Current TcpDiscoverySpi can prolong detection of node failure which has several 
IP addresses. This happens because most of the timeouts like 
_failureDetectionTimeout, sockTimeout, ackTimeout_ work per address. Actual 
failure detection delay is: _failureDetectionTimeout*addressesNumber_ (1). And 
the node addresses are sorted out serially. This affection on failure detection 
should be noted in logs, javadocs.

*1:  addressesNumber - addresses number of next node in the ring.

  was:
Current TcpDiscoverySpi can prolong detection of node failure which has several 
IP addresses. This happens because most of the timeouts like 
_failureDetectionTimeout, sockTimeout, ackTimeout_ work per address. Actual 
failure detection delay is: _failureDetectionTimeout*addressesNumber_ (1). And 
the node addresses are sorted out serially. This affection on failure detection 
should be noted in logs, javadocs.

*1:  addressesNumber - addresses number of next node.


> Represent in logs, javadoc affection of several node addresses on failure 
> detection.
> 
>
> Key: IGNITE-13205
> URL: https://issues.apache.org/jira/browse/IGNITE-13205
> Project: Ignite
>  Issue Type: Improvement
>Reporter: Vladimir Steshin
>Assignee: Vladimir Steshin
>Priority: Major
>
> Current TcpDiscoverySpi can prolong detection of node failure which has 
> several IP addresses. This happens because most of the timeouts like 
> _failureDetectionTimeout, sockTimeout, ackTimeout_ work per address. Actual 
> failure detection delay is: _failureDetectionTimeout*addressesNumber_ (1). 
> And the node addresses are sorted out serially. This affection on failure 
> detection should be noted in logs, javadocs.
> *1:  addressesNumber - addresses number of next node in the ring.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (IGNITE-13205) Represent in logs, javadoc affection of several node addresses on failure detection.

2020-07-02 Thread Vladimir Steshin (Jira)


 [ 
https://issues.apache.org/jira/browse/IGNITE-13205?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vladimir Steshin updated IGNITE-13205:
--
Description: 
Current TcpDiscoverySpi can prolong detection of node failure which has several 
IP addresses. This happens because most of the timeouts like 
_failureDetectionTimeout, sockTimeout, ackTimeout_ work per address. Actual 
failure detection delay is: _failureDetectionTimeout*addressesNumber_ (1). And 
the node addresses are sorted out serially. This affection on failure detection 
should be noted in logs, javadocs.

*1:  addressesNumber - addresses number of next node.

  was:Current TcpDiscoverySpi can prolong detection of node failure which has 
several IP addresses. This happens because most of the timeouts like 
failureDetectionTimeout, sockTimeout, ackTimeout work per address. And the node 
addresses are sorted out serially. This affection on failure detection should 
be noted in logs, javadocs.


> Represent in logs, javadoc affection of several node addresses on failure 
> detection.
> 
>
> Key: IGNITE-13205
> URL: https://issues.apache.org/jira/browse/IGNITE-13205
> Project: Ignite
>  Issue Type: Improvement
>Reporter: Vladimir Steshin
>Assignee: Vladimir Steshin
>Priority: Major
>
> Current TcpDiscoverySpi can prolong detection of node failure which has 
> several IP addresses. This happens because most of the timeouts like 
> _failureDetectionTimeout, sockTimeout, ackTimeout_ work per address. Actual 
> failure detection delay is: _failureDetectionTimeout*addressesNumber_ (1). 
> And the node addresses are sorted out serially. This affection on failure 
> detection should be noted in logs, javadocs.
> *1:  addressesNumber - addresses number of next node.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (IGNITE-13205) Represent in logs, javadoc affection of several node addresses on failure detection.

2020-07-02 Thread Vladimir Steshin (Jira)


 [ 
https://issues.apache.org/jira/browse/IGNITE-13205?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vladimir Steshin updated IGNITE-13205:
--
Description: Current TcpDiscoverySpi can prolong detection of node failure 
which has several IP addresses. This happens because most of the timeouts like 
failureDetectionTimeout, sockTimeout, ackTimeout work per address. And the node 
addresses are sorted out serially. This affection on failure detection should 
be noted in logs, javadocs.  (was: Current TcpDiscoverySpi can prolong 
detection of node failure which has several IP addresses. This happens because 
most of the timeouts like failureDetectionTimeout, sockTimeout, ackTimeout 
works per address. And the node addresses are sorted out serially. This 
affection on failure detection should be noted in logs, javadocs.)

> Represent in logs, javadoc affection of several node addresses on failure 
> detection.
> 
>
> Key: IGNITE-13205
> URL: https://issues.apache.org/jira/browse/IGNITE-13205
> Project: Ignite
>  Issue Type: Improvement
>Reporter: Vladimir Steshin
>Assignee: Vladimir Steshin
>Priority: Major
>
> Current TcpDiscoverySpi can prolong detection of node failure which has 
> several IP addresses. This happens because most of the timeouts like 
> failureDetectionTimeout, sockTimeout, ackTimeout work per address. And the 
> node addresses are sorted out serially. This affection on failure detection 
> should be noted in logs, javadocs.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (IGNITE-13205) Represent in logs, javadoc affection of several node addresses on failure detection.

2020-07-02 Thread Vladimir Steshin (Jira)


 [ 
https://issues.apache.org/jira/browse/IGNITE-13205?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vladimir Steshin updated IGNITE-13205:
--
Ignite Flags:   (was: Docs Required,Release Notes Required)

> Represent in logs, javadoc affection of several node addresses on failure 
> detection.
> 
>
> Key: IGNITE-13205
> URL: https://issues.apache.org/jira/browse/IGNITE-13205
> Project: Ignite
>  Issue Type: Improvement
>Reporter: Vladimir Steshin
>Assignee: Vladimir Steshin
>Priority: Major
>
> Current TcpDiscoverySpi can prolong detection of node failure which has 
> several IP addresses. This happens because most of the timeouts like 
> failureDetectionTimeout, sockTimeout, ackTimeout works per address. And the 
> node addresses are sorted out serially. This affection on failure detection 
> should be noted in logs, javadocs.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (IGNITE-13205) Represent in logs, javadoc affection of several node addresses on failure detection.

2020-07-02 Thread Vladimir Steshin (Jira)
Vladimir Steshin created IGNITE-13205:
-

 Summary: Represent in logs, javadoc affection of several node 
addresses on failure detection.
 Key: IGNITE-13205
 URL: https://issues.apache.org/jira/browse/IGNITE-13205
 Project: Ignite
  Issue Type: Improvement
Reporter: Vladimir Steshin
Assignee: Vladimir Steshin


Current TcpDiscoverySpi can prolong detection of node failure which has several 
IP addresses. This happens because most of the timeouts like 
failureDetectionTimeout, sockTimeout, ackTimeout works per address. And the 
node addresses are sorted out serially. This affection on failure detection 
should be noted in logs, javadocs.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (IGNITE-13016) Fix backward checking of failed node.

2020-06-30 Thread Vladimir Steshin (Jira)


 [ 
https://issues.apache.org/jira/browse/IGNITE-13016?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vladimir Steshin updated IGNITE-13016:
--
Description: 
Backward node connection checking looks wierd. What might be improved are:

1) Addresses checking could be done in parrallel, not serializably
{code:java}
for (InetSocketAddress addr : nodeAddrs) {
// Connection refused may be got if node doesn't listen
// (or blocked by firewall, but anyway assume it is dead).
if (!isConnectionRefused(addr)) {
liveAddr = addr;

break;
}
}
{code}

2) Any io-exception should be considered as failed connection, not only 
connection-refused:
{code:java}
catch (ConnectException e) {
return true;
}
catch (IOException e) {
return false;
}
{code}

3) Timeout on connection checking should not be constand or hardcoced:
{code:java}
sock.connect(addr, 100);
{code}

4) Decision to check connection should rely on configured exchange timeout, no 
on the ping interval

{code:java}
// We got message from previous in less than double connection check interval.
boolean ok = rcvdTime + U.millisToNanos(connCheckInterval) * 2 >= now;
{code}





  was:
Backward node connection checking looks wierd. What might be improved are:

1) Addresses checking could be done in parrallel, not serializably
{code:java}
for (InetSocketAddress addr : nodeAddrs) {
// Connection refused may be got if node doesn't listen
// (or blocked by firewall, but anyway assume it is dead).
if (!isConnectionRefused(addr)) {
liveAddr = addr;

break;
}
}
{code}

2) Any io-exception should be considered as failed connection, not only 
connection-refused:
{code:java}
catch (ConnectException e) {
return true;
}
catch (IOException e) {
return false;
}
{code}

3) Timeout on connection checking should not be constand or hardcoced:
{code:java}
sock.connect(addr, 100);
{code}





> Fix backward checking of failed node.
> -
>
> Key: IGNITE-13016
> URL: https://issues.apache.org/jira/browse/IGNITE-13016
> Project: Ignite
>  Issue Type: Sub-task
>Reporter: Vladimir Steshin
>Assignee: Vladimir Steshin
>Priority: Major
>  Labels: iep-45
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> Backward node connection checking looks wierd. What might be improved are:
> 1) Addresses checking could be done in parrallel, not serializably
> {code:java}
> for (InetSocketAddress addr : nodeAddrs) {
> // Connection refused may be got if node doesn't listen
> // (or blocked by firewall, but anyway assume it is dead).
> if (!isConnectionRefused(addr)) {
> liveAddr = addr;
> break;
> }
> }
> {code}
> 2) Any io-exception should be considered as failed connection, not only 
> connection-refused:
> {code:java}
> catch (ConnectException e) {
> return true;
> }
> catch (IOException e) {
> return false;
> }
> {code}
> 3) Timeout on connection checking should not be constand or hardcoced:
> {code:java}
> sock.connect(addr, 100);
> {code}
> 4) Decision to check connection should rely on configured exchange timeout, 
> no on the ping interval
> {code:java}
> // We got message from previous in less than double connection check interval.
> boolean ok = rcvdTime + U.millisToNanos(connCheckInterval) * 2 >= now;
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (IGNITE-13016) Fix backward checking of failed node.

2020-06-30 Thread Vladimir Steshin (Jira)


 [ 
https://issues.apache.org/jira/browse/IGNITE-13016?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vladimir Steshin updated IGNITE-13016:
--
Description: 
Backward node connection checking looks wierd. What might be improved are:

1) Addresses checking could be done in parrallel, not serializably
{code:java}
for (InetSocketAddress addr : nodeAddrs) {
// Connection refused may be got if node doesn't listen
// (or blocked by firewall, but anyway assume it is dead).
if (!isConnectionRefused(addr)) {
liveAddr = addr;

break;
}
}
{code}

2) Any io-exception should be considered as failed connection, not only 
connection-refused:
{code:java}
catch (ConnectException e) {
return true;
}
catch (IOException e) {
return false;
}
{code}

3) Timeout on connection checking should not be constand or hardcoced:
{code:java}
sock.connect(addr, 100);
{code}




  was:
We should fix several drawbacks in the backward checking of failed node. They 
prolong node failure detection upto: 
ServerImpl.CON_CHECK_INTERVAL + 2 * IgniteConfiguretion.failureDetectionTimeout 
+ 300ms. 

See:
* ‘_NodeFailureResearch.patch_'. It creates test 'FailureDetectionResearch' 
which emulates long answears on a failed node and measures failure detection 
delays.
* '_FailureDetectionResearch.txt_' - results of the test.
* '_FailureDetectionResearch_fixed.txt_' - results of the test after this fix.
* '_WostCaseStepByStep.txt_' - description how the worst case happens.


*Suggestions:*

1) We should replace hardcoded timeout 100ms with a parameter like 
failureDetectionTimeout:
{code:java}
private boolean ServerImpl.isConnectionRefused(SocketAddress addr) {
   ...
sock.connect(addr, 100); // Make it rely on failureDetectionTimeout.
   ...
}
{code}

2) Any negative result of the connection checking should be considered as node 
failed. Currently, we look only at refused connection. Any other exceptions, 
including a timeout, are treated as living connection: 

{code:java}
private boolean ServerImpl.isConnectionRefused(SocketAddress addr) {
   ...
   catch (ConnectException e) {
  return true;
   }
   catch (IOException e) {
  return false; // Make any error mean lost connection.
   }

   return false;
}
{code}

3) Maximal interval to check previous node should rely on actual failure 
detection timeout:
{code:java}
   TcpDiscoveryHandshakeResponse res = new TcpDiscoveryHandshakeResponse(...);
   ...
   // We got message from previous in less than double connection check 
interval.
   boolean ok = rcvdTime + CON_CHECK_INTERVAL * 2 >= now; // Here should be a 
timeout of failure detection.

   if (ok) {
  // Check case when previous node suddenly died. This will speed up
  // node failing.
  ...
}

res.previousNodeAlive(ok);
{code}



> Fix backward checking of failed node.
> -
>
> Key: IGNITE-13016
> URL: https://issues.apache.org/jira/browse/IGNITE-13016
> Project: Ignite
>  Issue Type: Sub-task
>Reporter: Vladimir Steshin
>Assignee: Vladimir Steshin
>Priority: Major
>  Labels: iep-45
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> Backward node connection checking looks wierd. What might be improved are:
> 1) Addresses checking could be done in parrallel, not serializably
> {code:java}
> for (InetSocketAddress addr : nodeAddrs) {
> // Connection refused may be got if node doesn't listen
> // (or blocked by firewall, but anyway assume it is dead).
> if (!isConnectionRefused(addr)) {
> liveAddr = addr;
> break;
> }
> }
> {code}
> 2) Any io-exception should be considered as failed connection, not only 
> connection-refused:
> {code:java}
> catch (ConnectException e) {
> return true;
> }
> catch (IOException e) {
> return false;
> }
> {code}
> 3) Timeout on connection checking should not be constand or hardcoced:
> {code:java}
> sock.connect(addr, 100);
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (IGNITE-13016) Fix backward checking of failed node.

2020-06-30 Thread Vladimir Steshin (Jira)


 [ 
https://issues.apache.org/jira/browse/IGNITE-13016?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vladimir Steshin updated IGNITE-13016:
--
Attachment: (was: WostCaseStepByStep.txt)

> Fix backward checking of failed node.
> -
>
> Key: IGNITE-13016
> URL: https://issues.apache.org/jira/browse/IGNITE-13016
> Project: Ignite
>  Issue Type: Sub-task
>Reporter: Vladimir Steshin
>Assignee: Vladimir Steshin
>Priority: Major
>  Labels: iep-45
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> We should fix several drawbacks in the backward checking of failed node. They 
> prolong node failure detection upto: 
> ServerImpl.CON_CHECK_INTERVAL + 2 * 
> IgniteConfiguretion.failureDetectionTimeout + 300ms. 
> See:
> * ‘_NodeFailureResearch.patch_'. It creates test 'FailureDetectionResearch' 
> which emulates long answears on a failed node and measures failure detection 
> delays.
> * '_FailureDetectionResearch.txt_' - results of the test.
> * '_FailureDetectionResearch_fixed.txt_' - results of the test after this fix.
> * '_WostCaseStepByStep.txt_' - description how the worst case happens.
> *Suggestions:*
> 1) We should replace hardcoded timeout 100ms with a parameter like 
> failureDetectionTimeout:
> {code:java}
> private boolean ServerImpl.isConnectionRefused(SocketAddress addr) {
>...
> sock.connect(addr, 100); // Make it rely on failureDetectionTimeout.
>...
> }
> {code}
> 2) Any negative result of the connection checking should be considered as 
> node failed. Currently, we look only at refused connection. Any other 
> exceptions, including a timeout, are treated as living connection: 
> {code:java}
> private boolean ServerImpl.isConnectionRefused(SocketAddress addr) {
>...
>catch (ConnectException e) {
>   return true;
>}
>catch (IOException e) {
>   return false; // Make any error mean lost connection.
>}
>return false;
> }
> {code}
> 3) Maximal interval to check previous node should rely on actual failure 
> detection timeout:
> {code:java}
>TcpDiscoveryHandshakeResponse res = new TcpDiscoveryHandshakeResponse(...);
>...
>// We got message from previous in less than double connection check 
> interval.
>boolean ok = rcvdTime + CON_CHECK_INTERVAL * 2 >= now; // Here should be a 
> timeout of failure detection.
>if (ok) {
>   // Check case when previous node suddenly died. This will speed up
>   // node failing.
>   ...
> }
> res.previousNodeAlive(ok);
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (IGNITE-13016) Fix backward checking of failed node.

2020-06-30 Thread Vladimir Steshin (Jira)


 [ 
https://issues.apache.org/jira/browse/IGNITE-13016?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vladimir Steshin updated IGNITE-13016:
--
Attachment: (was: FailureDetectionResearch.txt)

> Fix backward checking of failed node.
> -
>
> Key: IGNITE-13016
> URL: https://issues.apache.org/jira/browse/IGNITE-13016
> Project: Ignite
>  Issue Type: Sub-task
>Reporter: Vladimir Steshin
>Assignee: Vladimir Steshin
>Priority: Major
>  Labels: iep-45
> Attachments: WostCaseStepByStep.txt
>
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> We should fix several drawbacks in the backward checking of failed node. They 
> prolong node failure detection upto: 
> ServerImpl.CON_CHECK_INTERVAL + 2 * 
> IgniteConfiguretion.failureDetectionTimeout + 300ms. 
> See:
> * ‘_NodeFailureResearch.patch_'. It creates test 'FailureDetectionResearch' 
> which emulates long answears on a failed node and measures failure detection 
> delays.
> * '_FailureDetectionResearch.txt_' - results of the test.
> * '_FailureDetectionResearch_fixed.txt_' - results of the test after this fix.
> * '_WostCaseStepByStep.txt_' - description how the worst case happens.
> *Suggestions:*
> 1) We should replace hardcoded timeout 100ms with a parameter like 
> failureDetectionTimeout:
> {code:java}
> private boolean ServerImpl.isConnectionRefused(SocketAddress addr) {
>...
> sock.connect(addr, 100); // Make it rely on failureDetectionTimeout.
>...
> }
> {code}
> 2) Any negative result of the connection checking should be considered as 
> node failed. Currently, we look only at refused connection. Any other 
> exceptions, including a timeout, are treated as living connection: 
> {code:java}
> private boolean ServerImpl.isConnectionRefused(SocketAddress addr) {
>...
>catch (ConnectException e) {
>   return true;
>}
>catch (IOException e) {
>   return false; // Make any error mean lost connection.
>}
>return false;
> }
> {code}
> 3) Maximal interval to check previous node should rely on actual failure 
> detection timeout:
> {code:java}
>TcpDiscoveryHandshakeResponse res = new TcpDiscoveryHandshakeResponse(...);
>...
>// We got message from previous in less than double connection check 
> interval.
>boolean ok = rcvdTime + CON_CHECK_INTERVAL * 2 >= now; // Here should be a 
> timeout of failure detection.
>if (ok) {
>   // Check case when previous node suddenly died. This will speed up
>   // node failing.
>   ...
> }
> res.previousNodeAlive(ok);
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (IGNITE-13016) Fix backward checking of failed node.

2020-06-30 Thread Vladimir Steshin (Jira)


 [ 
https://issues.apache.org/jira/browse/IGNITE-13016?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vladimir Steshin updated IGNITE-13016:
--
Attachment: (was: FailureDetectionResearch.patch)

> Fix backward checking of failed node.
> -
>
> Key: IGNITE-13016
> URL: https://issues.apache.org/jira/browse/IGNITE-13016
> Project: Ignite
>  Issue Type: Sub-task
>Reporter: Vladimir Steshin
>Assignee: Vladimir Steshin
>Priority: Major
>  Labels: iep-45
> Attachments: WostCaseStepByStep.txt
>
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> We should fix several drawbacks in the backward checking of failed node. They 
> prolong node failure detection upto: 
> ServerImpl.CON_CHECK_INTERVAL + 2 * 
> IgniteConfiguretion.failureDetectionTimeout + 300ms. 
> See:
> * ‘_NodeFailureResearch.patch_'. It creates test 'FailureDetectionResearch' 
> which emulates long answears on a failed node and measures failure detection 
> delays.
> * '_FailureDetectionResearch.txt_' - results of the test.
> * '_FailureDetectionResearch_fixed.txt_' - results of the test after this fix.
> * '_WostCaseStepByStep.txt_' - description how the worst case happens.
> *Suggestions:*
> 1) We should replace hardcoded timeout 100ms with a parameter like 
> failureDetectionTimeout:
> {code:java}
> private boolean ServerImpl.isConnectionRefused(SocketAddress addr) {
>...
> sock.connect(addr, 100); // Make it rely on failureDetectionTimeout.
>...
> }
> {code}
> 2) Any negative result of the connection checking should be considered as 
> node failed. Currently, we look only at refused connection. Any other 
> exceptions, including a timeout, are treated as living connection: 
> {code:java}
> private boolean ServerImpl.isConnectionRefused(SocketAddress addr) {
>...
>catch (ConnectException e) {
>   return true;
>}
>catch (IOException e) {
>   return false; // Make any error mean lost connection.
>}
>return false;
> }
> {code}
> 3) Maximal interval to check previous node should rely on actual failure 
> detection timeout:
> {code:java}
>TcpDiscoveryHandshakeResponse res = new TcpDiscoveryHandshakeResponse(...);
>...
>// We got message from previous in less than double connection check 
> interval.
>boolean ok = rcvdTime + CON_CHECK_INTERVAL * 2 >= now; // Here should be a 
> timeout of failure detection.
>if (ok) {
>   // Check case when previous node suddenly died. This will speed up
>   // node failing.
>   ...
> }
> res.previousNodeAlive(ok);
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (IGNITE-13016) Fix backward checking of failed node.

2020-06-30 Thread Vladimir Steshin (Jira)


 [ 
https://issues.apache.org/jira/browse/IGNITE-13016?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vladimir Steshin updated IGNITE-13016:
--
Attachment: (was: FailureDetectionResearch_fixed.txt)

> Fix backward checking of failed node.
> -
>
> Key: IGNITE-13016
> URL: https://issues.apache.org/jira/browse/IGNITE-13016
> Project: Ignite
>  Issue Type: Sub-task
>Reporter: Vladimir Steshin
>Assignee: Vladimir Steshin
>Priority: Major
>  Labels: iep-45
> Attachments: WostCaseStepByStep.txt
>
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> We should fix several drawbacks in the backward checking of failed node. They 
> prolong node failure detection upto: 
> ServerImpl.CON_CHECK_INTERVAL + 2 * 
> IgniteConfiguretion.failureDetectionTimeout + 300ms. 
> See:
> * ‘_NodeFailureResearch.patch_'. It creates test 'FailureDetectionResearch' 
> which emulates long answears on a failed node and measures failure detection 
> delays.
> * '_FailureDetectionResearch.txt_' - results of the test.
> * '_FailureDetectionResearch_fixed.txt_' - results of the test after this fix.
> * '_WostCaseStepByStep.txt_' - description how the worst case happens.
> *Suggestions:*
> 1) We should replace hardcoded timeout 100ms with a parameter like 
> failureDetectionTimeout:
> {code:java}
> private boolean ServerImpl.isConnectionRefused(SocketAddress addr) {
>...
> sock.connect(addr, 100); // Make it rely on failureDetectionTimeout.
>...
> }
> {code}
> 2) Any negative result of the connection checking should be considered as 
> node failed. Currently, we look only at refused connection. Any other 
> exceptions, including a timeout, are treated as living connection: 
> {code:java}
> private boolean ServerImpl.isConnectionRefused(SocketAddress addr) {
>...
>catch (ConnectException e) {
>   return true;
>}
>catch (IOException e) {
>   return false; // Make any error mean lost connection.
>}
>return false;
> }
> {code}
> 3) Maximal interval to check previous node should rely on actual failure 
> detection timeout:
> {code:java}
>TcpDiscoveryHandshakeResponse res = new TcpDiscoveryHandshakeResponse(...);
>...
>// We got message from previous in less than double connection check 
> interval.
>boolean ok = rcvdTime + CON_CHECK_INTERVAL * 2 >= now; // Here should be a 
> timeout of failure detection.
>if (ok) {
>   // Check case when previous node suddenly died. This will speed up
>   // node failing.
>   ...
> }
> res.previousNodeAlive(ok);
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (IGNITE-13134) Fix connection recovery timout.

2020-06-30 Thread Vladimir Steshin (Jira)


[ 
https://issues.apache.org/jira/browse/IGNITE-13134?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17148574#comment-17148574
 ] 

Vladimir Steshin commented on IGNITE-13134:
---

The patch creates 
{code:java}
JmhNodeFailureDetection
{code}
and
{code:java}
TcpDiscoveryNetworkIssuesTest.testConnectionRecoveryTimeoutSmallValues()
TcpDiscoveryNetworkIssuesTest.testConnectionRecoveryTimeoutMediumValues()
TcpDiscoveryNetworkIssuesTest.testConnectionRecoveryTimeoutLongValues()
{code}

The benchmark shows delay on node segmentation. Expected: 
failureDetectionTimeout + connRecoveryTimeout. 
You can find in the output (example):
Master: Detection delay: *1923*. Failure detection timeout: 1000, connection 
recovery timeout: 500
Fixed: Detection delay: 1408. Failure detection timeout: 1000, connection 
recovery timeout: 500

> Fix connection recovery timout.
> ---
>
> Key: IGNITE-13134
> URL: https://issues.apache.org/jira/browse/IGNITE-13134
> Project: Ignite
>  Issue Type: Improvement
>Affects Versions: 2.8.1
>Reporter: Vladimir Steshin
>Assignee: Vladimir Steshin
>Priority: Major
>  Labels: iep-45
> Attachments: IGNITE-130134-patch.patch
>
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> If node experiences connection issues it must establish new connection or 
> fail within failureDetectionTimeout + connectionRecoveryTimout.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (IGNITE-13134) Fix connection recovery timout.

2020-06-30 Thread Vladimir Steshin (Jira)


 [ 
https://issues.apache.org/jira/browse/IGNITE-13134?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vladimir Steshin updated IGNITE-13134:
--
Ignite Flags:   (was: Release Notes Required)

> Fix connection recovery timout.
> ---
>
> Key: IGNITE-13134
> URL: https://issues.apache.org/jira/browse/IGNITE-13134
> Project: Ignite
>  Issue Type: Improvement
>Affects Versions: 2.8.1
>Reporter: Vladimir Steshin
>Assignee: Vladimir Steshin
>Priority: Major
>  Labels: iep-45
> Attachments: IGNITE-130134-patch.patch
>
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> If node experiences connection issues it must establish new connection or 
> fail within failureDetectionTimeout + connectionRecoveryTimout.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (IGNITE-13194) Fix testNodeWithIncompatibleMetadataIsProhibitedToJoinTheCluster()

2020-06-30 Thread Vladimir Steshin (Jira)


 [ 
https://issues.apache.org/jira/browse/IGNITE-13194?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vladimir Steshin updated IGNITE-13194:
--
Ignite Flags:   (was: Docs Required,Release Notes Required)

> Fix testNodeWithIncompatibleMetadataIsProhibitedToJoinTheCluster()
> --
>
> Key: IGNITE-13194
> URL: https://issues.apache.org/jira/browse/IGNITE-13194
> Project: Ignite
>  Issue Type: Bug
>Reporter: Vladimir Steshin
>Assignee: Vladimir Steshin
>Priority: Critical
> Fix For: 2.9
>
>  Time Spent: 0.5h
>  Remaining Estimate: 0h
>
> Test
> {code:java}
> IgnitePdsBinaryMetadataOnClusterRestartTest.testNodeWithIncompatibleMetadataIsProhibitedToJoinTheCluster()
> {code}
> fails in master. Changed error message is incorrectly checked in the test. 
> Became incorrect in IGNITE-13154.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Comment Edited] (IGNITE-13194) Fix testNodeWithIncompatibleMetadataIsProhibitedToJoinTheCluster()

2020-06-30 Thread Vladimir Steshin (Jira)


[ 
https://issues.apache.org/jira/browse/IGNITE-13194?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17148470#comment-17148470
 ] 

Vladimir Steshin edited comment on IGNITE-13194 at 6/30/20, 9:29 AM:
-

[~tledkov-gridgain], looks like IGNITE-13154 raised this one. Could you, 
please, take a look. Is the fix correct?


was (Author: vladsz83):
[~tledkov-gridgain], could you, please, take a look. Is the fix correct?

> Fix testNodeWithIncompatibleMetadataIsProhibitedToJoinTheCluster()
> --
>
> Key: IGNITE-13194
> URL: https://issues.apache.org/jira/browse/IGNITE-13194
> Project: Ignite
>  Issue Type: Bug
>Reporter: Vladimir Steshin
>Assignee: Vladimir Steshin
>Priority: Critical
> Fix For: 2.9
>
>  Time Spent: 0.5h
>  Remaining Estimate: 0h
>
> Test
> {code:java}
> IgnitePdsBinaryMetadataOnClusterRestartTest.testNodeWithIncompatibleMetadataIsProhibitedToJoinTheCluster()
> {code}
> fails in master. Changed error message is incorrectly checked in the test. 
> Became incorrect in IGNITE-13154.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (IGNITE-13194) Fix testNodeWithIncompatibleMetadataIsProhibitedToJoinTheCluster()

2020-06-30 Thread Vladimir Steshin (Jira)


[ 
https://issues.apache.org/jira/browse/IGNITE-13194?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17148470#comment-17148470
 ] 

Vladimir Steshin commented on IGNITE-13194:
---

[~tledkov-gridgain], could you, please, take a look. Is the fix correct?

> Fix testNodeWithIncompatibleMetadataIsProhibitedToJoinTheCluster()
> --
>
> Key: IGNITE-13194
> URL: https://issues.apache.org/jira/browse/IGNITE-13194
> Project: Ignite
>  Issue Type: Bug
>Reporter: Vladimir Steshin
>Assignee: Vladimir Steshin
>Priority: Critical
> Fix For: 2.9
>
>  Time Spent: 0.5h
>  Remaining Estimate: 0h
>
> Test
> {code:java}
> IgnitePdsBinaryMetadataOnClusterRestartTest.testNodeWithIncompatibleMetadataIsProhibitedToJoinTheCluster()
> {code}
> fails in master. Changed error message is incorrectly checked in the test. 
> Became incorrect in IGNITE-13154.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (IGNITE-13194) Fix testNodeWithIncompatibleMetadataIsProhibitedToJoinTheCluster()

2020-06-30 Thread Vladimir Steshin (Jira)


 [ 
https://issues.apache.org/jira/browse/IGNITE-13194?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vladimir Steshin updated IGNITE-13194:
--
Description: 
Test
{code:java}
IgnitePdsBinaryMetadataOnClusterRestartTest.testNodeWithIncompatibleMetadataIsProhibitedToJoinTheCluster()
{code}
fails in master. Changed error message is incorrectly checked in the test. 
Became incorrect in IGNITE-13154.


  was:
Test
{code:java}
IgnitePdsBinaryMetadataOnClusterRestartTest.testNodeWithIncompatibleMetadataIsProhibitedToJoinTheCluster()
{code}
fails in master. Changed error message is incorrectly checked in the test. 
Became incorrect in IGNITE-13154.

[~tledkov-gridgain], could you, please, take a look. Is the fix correct?


> Fix testNodeWithIncompatibleMetadataIsProhibitedToJoinTheCluster()
> --
>
> Key: IGNITE-13194
> URL: https://issues.apache.org/jira/browse/IGNITE-13194
> Project: Ignite
>  Issue Type: Bug
>Reporter: Vladimir Steshin
>Assignee: Vladimir Steshin
>Priority: Critical
> Fix For: 2.9
>
>  Time Spent: 0.5h
>  Remaining Estimate: 0h
>
> Test
> {code:java}
> IgnitePdsBinaryMetadataOnClusterRestartTest.testNodeWithIncompatibleMetadataIsProhibitedToJoinTheCluster()
> {code}
> fails in master. Changed error message is incorrectly checked in the test. 
> Became incorrect in IGNITE-13154.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (IGNITE-13194) Fix testNodeWithIncompatibleMetadataIsProhibitedToJoinTheCluster()

2020-06-30 Thread Vladimir Steshin (Jira)


 [ 
https://issues.apache.org/jira/browse/IGNITE-13194?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vladimir Steshin updated IGNITE-13194:
--
Description: 
Test
{code:java}
IgnitePdsBinaryMetadataOnClusterRestartTest.testNodeWithIncompatibleMetadataIsProhibitedToJoinTheCluster()
{code}
fails in master. Changed error message is incorrectly checked in the test. 
Became incorrect in IGNITE-13154.

[~tledkov-gridgain], could you, please, take a look. Is the fix correct?

  was:
Test
{code:java}
IgnitePdsBinaryMetadataOnClusterRestartTest.testNodeWithIncompatibleMetadataIsProhibitedToJoinTheCluster()
{code}
fails in master. Changed error message is incorrectly checked in the test.



> Fix testNodeWithIncompatibleMetadataIsProhibitedToJoinTheCluster()
> --
>
> Key: IGNITE-13194
> URL: https://issues.apache.org/jira/browse/IGNITE-13194
> Project: Ignite
>  Issue Type: Bug
>Reporter: Vladimir Steshin
>Assignee: Vladimir Steshin
>Priority: Critical
> Fix For: 2.9
>
>  Time Spent: 0.5h
>  Remaining Estimate: 0h
>
> Test
> {code:java}
> IgnitePdsBinaryMetadataOnClusterRestartTest.testNodeWithIncompatibleMetadataIsProhibitedToJoinTheCluster()
> {code}
> fails in master. Changed error message is incorrectly checked in the test. 
> Became incorrect in IGNITE-13154.
> [~tledkov-gridgain], could you, please, take a look. Is the fix correct?



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (IGNITE-13194) Fix testNodeWithIncompatibleMetadataIsProhibitedToJoinTheCluster()

2020-06-29 Thread Vladimir Steshin (Jira)


 [ 
https://issues.apache.org/jira/browse/IGNITE-13194?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vladimir Steshin updated IGNITE-13194:
--
Description: 
Test
{code:java}
IgnitePdsBinaryMetadataOnClusterRestartTest.testNodeWithIncompatibleMetadataIsProhibitedToJoinTheCluster()
{code}
fails in master. Changed error message is incorrectly checked in the test.


> Fix testNodeWithIncompatibleMetadataIsProhibitedToJoinTheCluster()
> --
>
> Key: IGNITE-13194
> URL: https://issues.apache.org/jira/browse/IGNITE-13194
> Project: Ignite
>  Issue Type: Bug
>Reporter: Vladimir Steshin
>Assignee: Vladimir Steshin
>Priority: Critical
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> Test
> {code:java}
> IgnitePdsBinaryMetadataOnClusterRestartTest.testNodeWithIncompatibleMetadataIsProhibitedToJoinTheCluster()
> {code}
> fails in master. Changed error message is incorrectly checked in the test.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (IGNITE-13194) Fix testNodeWithIncompatibleMetadataIsProhibitedToJoinTheCluster()

2020-06-29 Thread Vladimir Steshin (Jira)
Vladimir Steshin created IGNITE-13194:
-

 Summary: Fix 
testNodeWithIncompatibleMetadataIsProhibitedToJoinTheCluster()
 Key: IGNITE-13194
 URL: https://issues.apache.org/jira/browse/IGNITE-13194
 Project: Ignite
  Issue Type: Bug
Reporter: Vladimir Steshin
Assignee: Vladimir Steshin






--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Resolved] (IGNITE-13090) Add parameter of connection check period to TcpDiscoverySpi

2020-06-25 Thread Vladimir Steshin (Jira)


 [ 
https://issues.apache.org/jira/browse/IGNITE-13090?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vladimir Steshin resolved IGNITE-13090.
---
Resolution: Won't Fix

Another solution is in IGNITE-13012 : the period is a part of the message 
exchange timeout.

> Add parameter of connection check period to TcpDiscoverySpi
> ---
>
> Key: IGNITE-13090
> URL: https://issues.apache.org/jira/browse/IGNITE-13090
> Project: Ignite
>  Issue Type: Improvement
>Reporter: Vladimir Steshin
>Assignee: Vladimir Steshin
>Priority: Minor
>
> We should add parameter of connection check period to TcpDiscoverySpi. If it 
> isn't automatically set by IgniteConfiguration.setFailureDetectionTimeout(), 
> user should be able to tune. 
> Similar params:
> {code:java}
> TcpDiscoverySpi.setReconnectCount()
> TcpDiscoverySpi.setAckTimeout()
> TcpDiscoverySpi.setSocketTimeout()
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Closed] (IGNITE-13090) Add parameter of connection check period to TcpDiscoverySpi

2020-06-25 Thread Vladimir Steshin (Jira)


 [ 
https://issues.apache.org/jira/browse/IGNITE-13090?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vladimir Steshin closed IGNITE-13090.
-
Ignite Flags:   (was: Docs Required,Release Notes Required)

> Add parameter of connection check period to TcpDiscoverySpi
> ---
>
> Key: IGNITE-13090
> URL: https://issues.apache.org/jira/browse/IGNITE-13090
> Project: Ignite
>  Issue Type: Improvement
>Reporter: Vladimir Steshin
>Assignee: Vladimir Steshin
>Priority: Minor
>
> We should add parameter of connection check period to TcpDiscoverySpi. If it 
> isn't automatically set by IgniteConfiguration.setFailureDetectionTimeout(), 
> user should be able to tune. 
> Similar params:
> {code:java}
> TcpDiscoverySpi.setReconnectCount()
> TcpDiscoverySpi.setAckTimeout()
> TcpDiscoverySpi.setSocketTimeout()
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (IGNITE-13134) Fix connection recovery timout.

2020-06-23 Thread Vladimir Steshin (Jira)


 [ 
https://issues.apache.org/jira/browse/IGNITE-13134?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vladimir Steshin updated IGNITE-13134:
--
Attachment: IGNITE-130134-patch.patch

> Fix connection recovery timout.
> ---
>
> Key: IGNITE-13134
> URL: https://issues.apache.org/jira/browse/IGNITE-13134
> Project: Ignite
>  Issue Type: Improvement
>Affects Versions: 2.8.1
>Reporter: Vladimir Steshin
>Assignee: Vladimir Steshin
>Priority: Major
>  Labels: iep-45
> Attachments: IGNITE-130134-patch.patch
>
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> If node experiences connection issues it must establish new connection or 
> fail within failureDetectionTimeout + connectionRecoveryTimout.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (IGNITE-13134) Fix connection recovery timout.

2020-06-23 Thread Vladimir Steshin (Jira)


 [ 
https://issues.apache.org/jira/browse/IGNITE-13134?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vladimir Steshin updated IGNITE-13134:
--
Attachment: (was: IGNITE-130134-patch.patch)

> Fix connection recovery timout.
> ---
>
> Key: IGNITE-13134
> URL: https://issues.apache.org/jira/browse/IGNITE-13134
> Project: Ignite
>  Issue Type: Improvement
>Affects Versions: 2.8.1
>Reporter: Vladimir Steshin
>Assignee: Vladimir Steshin
>Priority: Major
>  Labels: iep-45
> Attachments: IGNITE-130134-patch.patch
>
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> If node experiences connection issues it must establish new connection or 
> fail within failureDetectionTimeout + connectionRecoveryTimout.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (IGNITE-13134) Fix connection recovery timout.

2020-06-23 Thread Vladimir Steshin (Jira)


 [ 
https://issues.apache.org/jira/browse/IGNITE-13134?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vladimir Steshin updated IGNITE-13134:
--
Attachment: IGNITE-130134-patch.patch

> Fix connection recovery timout.
> ---
>
> Key: IGNITE-13134
> URL: https://issues.apache.org/jira/browse/IGNITE-13134
> Project: Ignite
>  Issue Type: Improvement
>Affects Versions: 2.8.1
>Reporter: Vladimir Steshin
>Assignee: Vladimir Steshin
>Priority: Major
>  Labels: iep-45
> Attachments: IGNITE-130134-patch.patch
>
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> If node experiences connection issues it must establish new connection or 
> fail within failureDetectionTimeout + connectionRecoveryTimout.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (IGNITE-13012) Fix failure detection timeout. Simplify node ping routine.

2020-06-22 Thread Vladimir Steshin (Jira)


 [ 
https://issues.apache.org/jira/browse/IGNITE-13012?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vladimir Steshin updated IGNITE-13012:
--
Attachment: (was: IGNITE-13012-patch.patch)

> Fix failure detection timeout. Simplify node ping routine.
> --
>
> Key: IGNITE-13012
> URL: https://issues.apache.org/jira/browse/IGNITE-13012
> Project: Ignite
>  Issue Type: Improvement
>Affects Versions: 2.8.1
>Reporter: Vladimir Steshin
>Assignee: Vladimir Steshin
>Priority: Major
>  Labels: iep-45
> Attachments: IGNITE-13012-patch.patch
>
>  Time Spent: 2h 20m
>  Remaining Estimate: 0h
>
> Connection failure may not be detected within 
> IgniteConfiguration.failureDetectionTimeout. Actual worst delay is: 
> ServerImpl.CON_CHECK_INTERVAL + IgniteConfiguration.failureDetectionTimeout. 
> Node ping routine is duplicated.
> We should fix:
> 1. Failure detection timeout should take in account last sent message. 
> Current ping is bound to own time:
> {code:java}ServerImpl. RingMessageWorker.lastTimeConnCheckMsgSent{code}
> This is weird because any discovery message check connection. 
> 2. Make connection check interval depend on failure detection timeout (FTD). 
> Current value is a constant:
> {code:java}static int ServerImpls.CON_CHECK_INTERVAL = 500{code}
> 3. Remove additional, quickened connection checking.  Once we do fix 1, this 
> will become even more useless.
> Despite TCP discovery has a period of connection checking, it may send ping 
> before this period exhausts. This premature ping relies also on the time of 
> any received message for some reason. 
> 4. Do not worry user with “Node seems disconnected” when everything is OK. 
> Once we do fix 1 and 3, this will become even more useless. 
> Node may log on INFO: “Local node seems to be disconnected from topology …” 
> whereas it is not actually disconnected at all.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (IGNITE-13012) Fix failure detection timeout. Simplify node ping routine.

2020-06-22 Thread Vladimir Steshin (Jira)


 [ 
https://issues.apache.org/jira/browse/IGNITE-13012?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vladimir Steshin updated IGNITE-13012:
--
Attachment: IGNITE-13012-patch.patch

> Fix failure detection timeout. Simplify node ping routine.
> --
>
> Key: IGNITE-13012
> URL: https://issues.apache.org/jira/browse/IGNITE-13012
> Project: Ignite
>  Issue Type: Improvement
>Affects Versions: 2.8.1
>Reporter: Vladimir Steshin
>Assignee: Vladimir Steshin
>Priority: Major
>  Labels: iep-45
> Attachments: IGNITE-13012-patch.patch
>
>  Time Spent: 2h 20m
>  Remaining Estimate: 0h
>
> Connection failure may not be detected within 
> IgniteConfiguration.failureDetectionTimeout. Actual worst delay is: 
> ServerImpl.CON_CHECK_INTERVAL + IgniteConfiguration.failureDetectionTimeout. 
> Node ping routine is duplicated.
> We should fix:
> 1. Failure detection timeout should take in account last sent message. 
> Current ping is bound to own time:
> {code:java}ServerImpl. RingMessageWorker.lastTimeConnCheckMsgSent{code}
> This is weird because any discovery message check connection. 
> 2. Make connection check interval depend on failure detection timeout (FTD). 
> Current value is a constant:
> {code:java}static int ServerImpls.CON_CHECK_INTERVAL = 500{code}
> 3. Remove additional, quickened connection checking.  Once we do fix 1, this 
> will become even more useless.
> Despite TCP discovery has a period of connection checking, it may send ping 
> before this period exhausts. This premature ping relies also on the time of 
> any received message for some reason. 
> 4. Do not worry user with “Node seems disconnected” when everything is OK. 
> Once we do fix 1 and 3, this will become even more useless. 
> Node may log on INFO: “Local node seems to be disconnected from topology …” 
> whereas it is not actually disconnected at all.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (IGNITE-13012) Fix failure detection timeout. Simplify node ping routine.

2020-06-22 Thread Vladimir Steshin (Jira)


 [ 
https://issues.apache.org/jira/browse/IGNITE-13012?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vladimir Steshin updated IGNITE-13012:
--
Attachment: IGNITE-13012-patch.patch

> Fix failure detection timeout. Simplify node ping routine.
> --
>
> Key: IGNITE-13012
> URL: https://issues.apache.org/jira/browse/IGNITE-13012
> Project: Ignite
>  Issue Type: Improvement
>Affects Versions: 2.8.1
>Reporter: Vladimir Steshin
>Assignee: Vladimir Steshin
>Priority: Major
>  Labels: iep-45
> Attachments: IGNITE-13012-patch.patch
>
>  Time Spent: 2h 20m
>  Remaining Estimate: 0h
>
> Connection failure may not be detected within 
> IgniteConfiguration.failureDetectionTimeout. Actual worst delay is: 
> ServerImpl.CON_CHECK_INTERVAL + IgniteConfiguration.failureDetectionTimeout. 
> Node ping routine is duplicated.
> We should fix:
> 1. Failure detection timeout should take in account last sent message. 
> Current ping is bound to own time:
> {code:java}ServerImpl. RingMessageWorker.lastTimeConnCheckMsgSent{code}
> This is weird because any discovery message check connection. 
> 2. Make connection check interval depend on failure detection timeout (FTD). 
> Current value is a constant:
> {code:java}static int ServerImpls.CON_CHECK_INTERVAL = 500{code}
> 3. Remove additional, quickened connection checking.  Once we do fix 1, this 
> will become even more useless.
> Despite TCP discovery has a period of connection checking, it may send ping 
> before this period exhausts. This premature ping relies also on the time of 
> any received message for some reason. 
> 4. Do not worry user with “Node seems disconnected” when everything is OK. 
> Once we do fix 1 and 3, this will become even more useless. 
> Node may log on INFO: “Local node seems to be disconnected from topology …” 
> whereas it is not actually disconnected at all.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (IGNITE-13012) Fix failure detection timeout. Simplify node ping routine.

2020-06-22 Thread Vladimir Steshin (Jira)


 [ 
https://issues.apache.org/jira/browse/IGNITE-13012?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vladimir Steshin updated IGNITE-13012:
--
Attachment: (was: IGNITE-13012-patch.patch)

> Fix failure detection timeout. Simplify node ping routine.
> --
>
> Key: IGNITE-13012
> URL: https://issues.apache.org/jira/browse/IGNITE-13012
> Project: Ignite
>  Issue Type: Improvement
>Affects Versions: 2.8.1
>Reporter: Vladimir Steshin
>Assignee: Vladimir Steshin
>Priority: Major
>  Labels: iep-45
> Attachments: IGNITE-13012-patch.patch
>
>  Time Spent: 2h 20m
>  Remaining Estimate: 0h
>
> Connection failure may not be detected within 
> IgniteConfiguration.failureDetectionTimeout. Actual worst delay is: 
> ServerImpl.CON_CHECK_INTERVAL + IgniteConfiguration.failureDetectionTimeout. 
> Node ping routine is duplicated.
> We should fix:
> 1. Failure detection timeout should take in account last sent message. 
> Current ping is bound to own time:
> {code:java}ServerImpl. RingMessageWorker.lastTimeConnCheckMsgSent{code}
> This is weird because any discovery message check connection. 
> 2. Make connection check interval depend on failure detection timeout (FTD). 
> Current value is a constant:
> {code:java}static int ServerImpls.CON_CHECK_INTERVAL = 500{code}
> 3. Remove additional, quickened connection checking.  Once we do fix 1, this 
> will become even more useless.
> Despite TCP discovery has a period of connection checking, it may send ping 
> before this period exhausts. This premature ping relies also on the time of 
> any received message for some reason. 
> 4. Do not worry user with “Node seems disconnected” when everything is OK. 
> Once we do fix 1 and 3, this will become even more useless. 
> Node may log on INFO: “Local node seems to be disconnected from topology …” 
> whereas it is not actually disconnected at all.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (IGNITE-13012) Fix failure detection timeout. Simplify node ping routine.

2020-06-19 Thread Vladimir Steshin (Jira)


 [ 
https://issues.apache.org/jira/browse/IGNITE-13012?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vladimir Steshin updated IGNITE-13012:
--
Description: 
Connection failure may not be detected within 
IgniteConfiguration.failureDetectionTimeout. Actual worst delay is: 
ServerImpl.CON_CHECK_INTERVAL + IgniteConfiguration.failureDetectionTimeout. 
Node ping routine is duplicated.

We should fix:

1. Failure detection timeout should take in account last sent message. Current 
ping is bound to own time:
{code:java}ServerImpl. RingMessageWorker.lastTimeConnCheckMsgSent{code}
This is weird because any discovery message check connection. 

2. Make connection check interval depend on failure detection timeout (FTD). 
Current value is a constant:
{code:java}static int ServerImpls.CON_CHECK_INTERVAL = 500{code}

3. Remove additional, quickened connection checking.  Once we do fix 1, this 
will become even more useless.
Despite TCP discovery has a period of connection checking, it may send ping 
before this period exhausts. This premature ping relies also on the time of any 
received message for some reason. 

4. Do not worry user with “Node seems disconnected” when everything is OK. Once 
we do fix 1 and 3, this will become even more useless. 
Node may log on INFO: “Local node seems to be disconnected from topology …” 
whereas it is not actually disconnected at all.

  was:
Connection failure may not be detected within 
IgniteConfiguration.failureDetectionTimeout. Actual worst delay is: 
ServerImpl.CON_CHECK_INTERVAL + IgniteConfiguration.failureDetectionTimeout. 
Node ping routine is duplicated.

We should fix:

1. Failure detection timeout should take in account last sent message. Current 
ping is bound to own time:
{code:java}ServerImpl. RingMessageWorker.lastTimeConnCheckMsgSent{code}
This is weird because any discovery message check connection. 

2. Make connection check interval depend on failure detection timeout (FTD). 
Current value is a constant:
{code:java}static int ServerImpls.CON_CHECK_INTERVAL = 500{code}

3. Remove additional, quickened connection checking.  Once we do fix 1, this 
will become even more useless.
Despite TCP discovery has a period of connection checking, it may send ping 
before this period exhausts. This premature node ping relies on the time of any 
sent or even any received message. 

4. Do not worry user with “Node seems disconnected” when everything is OK. Once 
we do fix 1 and 3, this will become even more useless. 
Node may log on INFO: “Local node seems to be disconnected from topology …” 
whereas it is not actually disconnected at all.


> Fix failure detection timeout. Simplify node ping routine.
> --
>
> Key: IGNITE-13012
> URL: https://issues.apache.org/jira/browse/IGNITE-13012
> Project: Ignite
>  Issue Type: Improvement
>Affects Versions: 2.8.1
>Reporter: Vladimir Steshin
>Assignee: Vladimir Steshin
>Priority: Major
>  Labels: iep-45
> Attachments: IGNITE-13012-patch.patch
>
>  Time Spent: 2h 20m
>  Remaining Estimate: 0h
>
> Connection failure may not be detected within 
> IgniteConfiguration.failureDetectionTimeout. Actual worst delay is: 
> ServerImpl.CON_CHECK_INTERVAL + IgniteConfiguration.failureDetectionTimeout. 
> Node ping routine is duplicated.
> We should fix:
> 1. Failure detection timeout should take in account last sent message. 
> Current ping is bound to own time:
> {code:java}ServerImpl. RingMessageWorker.lastTimeConnCheckMsgSent{code}
> This is weird because any discovery message check connection. 
> 2. Make connection check interval depend on failure detection timeout (FTD). 
> Current value is a constant:
> {code:java}static int ServerImpls.CON_CHECK_INTERVAL = 500{code}
> 3. Remove additional, quickened connection checking.  Once we do fix 1, this 
> will become even more useless.
> Despite TCP discovery has a period of connection checking, it may send ping 
> before this period exhausts. This premature ping relies also on the time of 
> any received message for some reason. 
> 4. Do not worry user with “Node seems disconnected” when everything is OK. 
> Once we do fix 1 and 3, this will become even more useless. 
> Node may log on INFO: “Local node seems to be disconnected from topology …” 
> whereas it is not actually disconnected at all.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (IGNITE-13012) Fix failure detection timeout. Simplify node ping routine.

2020-06-18 Thread Vladimir Steshin (Jira)


 [ 
https://issues.apache.org/jira/browse/IGNITE-13012?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vladimir Steshin updated IGNITE-13012:
--
Attachment: (was: IGNITE-13012-patch.patch)

> Fix failure detection timeout. Simplify node ping routine.
> --
>
> Key: IGNITE-13012
> URL: https://issues.apache.org/jira/browse/IGNITE-13012
> Project: Ignite
>  Issue Type: Improvement
>Affects Versions: 2.8.1
>Reporter: Vladimir Steshin
>Assignee: Vladimir Steshin
>Priority: Major
>  Labels: iep-45
> Attachments: IGNITE-13012-patch.patch
>
>  Time Spent: 2h 20m
>  Remaining Estimate: 0h
>
> Connection failure may not be detected within 
> IgniteConfiguration.failureDetectionTimeout. Actual worst delay is: 
> ServerImpl.CON_CHECK_INTERVAL + IgniteConfiguration.failureDetectionTimeout. 
> Node ping routine is duplicated.
> We should fix:
> 1. Failure detection timeout should take in account last sent message. 
> Current ping is bound to own time:
> {code:java}ServerImpl. RingMessageWorker.lastTimeConnCheckMsgSent{code}
> This is weird because any discovery message check connection. 
> 2. Make connection check interval depend on failure detection timeout (FTD). 
> Current value is a constant:
> {code:java}static int ServerImpls.CON_CHECK_INTERVAL = 500{code}
> 3. Remove additional, quickened connection checking.  Once we do fix 1, this 
> will become even more useless.
> Despite TCP discovery has a period of connection checking, it may send ping 
> before this period exhausts. This premature node ping relies on the time of 
> any sent or even any received message. 
> 4. Do not worry user with “Node seems disconnected” when everything is OK. 
> Once we do fix 1 and 3, this will become even more useless. 
> Node may log on INFO: “Local node seems to be disconnected from topology …” 
> whereas it is not actually disconnected at all.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (IGNITE-13012) Fix failure detection timeout. Simplify node ping routine.

2020-06-18 Thread Vladimir Steshin (Jira)


 [ 
https://issues.apache.org/jira/browse/IGNITE-13012?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vladimir Steshin updated IGNITE-13012:
--
Attachment: IGNITE-13012-patch.patch

> Fix failure detection timeout. Simplify node ping routine.
> --
>
> Key: IGNITE-13012
> URL: https://issues.apache.org/jira/browse/IGNITE-13012
> Project: Ignite
>  Issue Type: Improvement
>Affects Versions: 2.8.1
>Reporter: Vladimir Steshin
>Assignee: Vladimir Steshin
>Priority: Major
>  Labels: iep-45
> Attachments: IGNITE-13012-patch.patch
>
>  Time Spent: 2h 20m
>  Remaining Estimate: 0h
>
> Connection failure may not be detected within 
> IgniteConfiguration.failureDetectionTimeout. Actual worst delay is: 
> ServerImpl.CON_CHECK_INTERVAL + IgniteConfiguration.failureDetectionTimeout. 
> Node ping routine is duplicated.
> We should fix:
> 1. Failure detection timeout should take in account last sent message. 
> Current ping is bound to own time:
> {code:java}ServerImpl. RingMessageWorker.lastTimeConnCheckMsgSent{code}
> This is weird because any discovery message check connection. 
> 2. Make connection check interval depend on failure detection timeout (FTD). 
> Current value is a constant:
> {code:java}static int ServerImpls.CON_CHECK_INTERVAL = 500{code}
> 3. Remove additional, quickened connection checking.  Once we do fix 1, this 
> will become even more useless.
> Despite TCP discovery has a period of connection checking, it may send ping 
> before this period exhausts. This premature node ping relies on the time of 
> any sent or even any received message. 
> 4. Do not worry user with “Node seems disconnected” when everything is OK. 
> Once we do fix 1 and 3, this will become even more useless. 
> Node may log on INFO: “Local node seems to be disconnected from topology …” 
> whereas it is not actually disconnected at all.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (IGNITE-13012) Fix failure detection timeout. Simplify node ping routine.

2020-06-18 Thread Vladimir Steshin (Jira)


 [ 
https://issues.apache.org/jira/browse/IGNITE-13012?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vladimir Steshin updated IGNITE-13012:
--
Attachment: IGNITE-13012-patch.patch

> Fix failure detection timeout. Simplify node ping routine.
> --
>
> Key: IGNITE-13012
> URL: https://issues.apache.org/jira/browse/IGNITE-13012
> Project: Ignite
>  Issue Type: Improvement
>Affects Versions: 2.8.1
>Reporter: Vladimir Steshin
>Assignee: Vladimir Steshin
>Priority: Major
>  Labels: iep-45
> Attachments: IGNITE-13012-patch.patch
>
>  Time Spent: 2h 20m
>  Remaining Estimate: 0h
>
> Connection failure may not be detected within 
> IgniteConfiguration.failureDetectionTimeout. Actual worst delay is: 
> ServerImpl.CON_CHECK_INTERVAL + IgniteConfiguration.failureDetectionTimeout. 
> Node ping routine is duplicated.
> We should fix:
> 1. Failure detection timeout should take in account last sent message. 
> Current ping is bound to own time:
> {code:java}ServerImpl. RingMessageWorker.lastTimeConnCheckMsgSent{code}
> This is weird because any discovery message check connection. 
> 2. Make connection check interval depend on failure detection timeout (FTD). 
> Current value is a constant:
> {code:java}static int ServerImpls.CON_CHECK_INTERVAL = 500{code}
> 3. Remove additional, quickened connection checking.  Once we do fix 1, this 
> will become even more useless.
> Despite TCP discovery has a period of connection checking, it may send ping 
> before this period exhausts. This premature node ping relies on the time of 
> any sent or even any received message. 
> 4. Do not worry user with “Node seems disconnected” when everything is OK. 
> Once we do fix 1 and 3, this will become even more useless. 
> Node may log on INFO: “Local node seems to be disconnected from topology …” 
> whereas it is not actually disconnected at all.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (IGNITE-13012) Fix failure detection timeout. Simplify node ping routine.

2020-06-18 Thread Vladimir Steshin (Jira)


 [ 
https://issues.apache.org/jira/browse/IGNITE-13012?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vladimir Steshin updated IGNITE-13012:
--
Attachment: (was: IGNITE-13012-patch.patch)

> Fix failure detection timeout. Simplify node ping routine.
> --
>
> Key: IGNITE-13012
> URL: https://issues.apache.org/jira/browse/IGNITE-13012
> Project: Ignite
>  Issue Type: Improvement
>Affects Versions: 2.8.1
>Reporter: Vladimir Steshin
>Assignee: Vladimir Steshin
>Priority: Major
>  Labels: iep-45
> Attachments: IGNITE-13012-patch.patch
>
>  Time Spent: 2h 20m
>  Remaining Estimate: 0h
>
> Connection failure may not be detected within 
> IgniteConfiguration.failureDetectionTimeout. Actual worst delay is: 
> ServerImpl.CON_CHECK_INTERVAL + IgniteConfiguration.failureDetectionTimeout. 
> Node ping routine is duplicated.
> We should fix:
> 1. Failure detection timeout should take in account last sent message. 
> Current ping is bound to own time:
> {code:java}ServerImpl. RingMessageWorker.lastTimeConnCheckMsgSent{code}
> This is weird because any discovery message check connection. 
> 2. Make connection check interval depend on failure detection timeout (FTD). 
> Current value is a constant:
> {code:java}static int ServerImpls.CON_CHECK_INTERVAL = 500{code}
> 3. Remove additional, quickened connection checking.  Once we do fix 1, this 
> will become even more useless.
> Despite TCP discovery has a period of connection checking, it may send ping 
> before this period exhausts. This premature node ping relies on the time of 
> any sent or even any received message. 
> 4. Do not worry user with “Node seems disconnected” when everything is OK. 
> Once we do fix 1 and 3, this will become even more useless. 
> Node may log on INFO: “Local node seems to be disconnected from topology …” 
> whereas it is not actually disconnected at all.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (IGNITE-13012) Fix failure detection timeout. Simplify node ping routine.

2020-06-18 Thread Vladimir Steshin (Jira)


 [ 
https://issues.apache.org/jira/browse/IGNITE-13012?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vladimir Steshin updated IGNITE-13012:
--
Attachment: IGNITE-13012-patch.patch

> Fix failure detection timeout. Simplify node ping routine.
> --
>
> Key: IGNITE-13012
> URL: https://issues.apache.org/jira/browse/IGNITE-13012
> Project: Ignite
>  Issue Type: Improvement
>Affects Versions: 2.8.1
>Reporter: Vladimir Steshin
>Assignee: Vladimir Steshin
>Priority: Major
>  Labels: iep-45
> Attachments: IGNITE-13012-patch.patch
>
>  Time Spent: 2h 20m
>  Remaining Estimate: 0h
>
> Connection failure may not be detected within 
> IgniteConfiguration.failureDetectionTimeout. Actual worst delay is: 
> ServerImpl.CON_CHECK_INTERVAL + IgniteConfiguration.failureDetectionTimeout. 
> Node ping routine is duplicated.
> We should fix:
> 1. Failure detection timeout should take in account last sent message. 
> Current ping is bound to own time:
> {code:java}ServerImpl. RingMessageWorker.lastTimeConnCheckMsgSent{code}
> This is weird because any discovery message check connection. 
> 2. Make connection check interval depend on failure detection timeout (FTD). 
> Current value is a constant:
> {code:java}static int ServerImpls.CON_CHECK_INTERVAL = 500{code}
> 3. Remove additional, quickened connection checking.  Once we do fix 1, this 
> will become even more useless.
> Despite TCP discovery has a period of connection checking, it may send ping 
> before this period exhausts. This premature node ping relies on the time of 
> any sent or even any received message. 
> 4. Do not worry user with “Node seems disconnected” when everything is OK. 
> Once we do fix 1 and 3, this will become even more useless. 
> Node may log on INFO: “Local node seems to be disconnected from topology …” 
> whereas it is not actually disconnected at all.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (IGNITE-13012) Fix failure detection timeout. Simplify node ping routine.

2020-06-18 Thread Vladimir Steshin (Jira)


 [ 
https://issues.apache.org/jira/browse/IGNITE-13012?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vladimir Steshin updated IGNITE-13012:
--
Attachment: (was: IGNITE-13012-patch.patch)

> Fix failure detection timeout. Simplify node ping routine.
> --
>
> Key: IGNITE-13012
> URL: https://issues.apache.org/jira/browse/IGNITE-13012
> Project: Ignite
>  Issue Type: Improvement
>Affects Versions: 2.8.1
>Reporter: Vladimir Steshin
>Assignee: Vladimir Steshin
>Priority: Major
>  Labels: iep-45
> Attachments: IGNITE-13012-patch.patch
>
>  Time Spent: 2h 20m
>  Remaining Estimate: 0h
>
> Connection failure may not be detected within 
> IgniteConfiguration.failureDetectionTimeout. Actual worst delay is: 
> ServerImpl.CON_CHECK_INTERVAL + IgniteConfiguration.failureDetectionTimeout. 
> Node ping routine is duplicated.
> We should fix:
> 1. Failure detection timeout should take in account last sent message. 
> Current ping is bound to own time:
> {code:java}ServerImpl. RingMessageWorker.lastTimeConnCheckMsgSent{code}
> This is weird because any discovery message check connection. 
> 2. Make connection check interval depend on failure detection timeout (FTD). 
> Current value is a constant:
> {code:java}static int ServerImpls.CON_CHECK_INTERVAL = 500{code}
> 3. Remove additional, quickened connection checking.  Once we do fix 1, this 
> will become even more useless.
> Despite TCP discovery has a period of connection checking, it may send ping 
> before this period exhausts. This premature node ping relies on the time of 
> any sent or even any received message. 
> 4. Do not worry user with “Node seems disconnected” when everything is OK. 
> Once we do fix 1 and 3, this will become even more useless. 
> Node may log on INFO: “Local node seems to be disconnected from topology …” 
> whereas it is not actually disconnected at all.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (IGNITE-13012) Fix failure detection timeout. Simplify node ping routine.

2020-06-18 Thread Vladimir Steshin (Jira)


 [ 
https://issues.apache.org/jira/browse/IGNITE-13012?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vladimir Steshin updated IGNITE-13012:
--
Attachment: (was: IGNITE-13012-patch.patch)

> Fix failure detection timeout. Simplify node ping routine.
> --
>
> Key: IGNITE-13012
> URL: https://issues.apache.org/jira/browse/IGNITE-13012
> Project: Ignite
>  Issue Type: Improvement
>Affects Versions: 2.8.1
>Reporter: Vladimir Steshin
>Assignee: Vladimir Steshin
>Priority: Major
>  Labels: iep-45
> Attachments: IGNITE-13012-patch.patch
>
>  Time Spent: 2h 20m
>  Remaining Estimate: 0h
>
> Connection failure may not be detected within 
> IgniteConfiguration.failureDetectionTimeout. Actual worst delay is: 
> ServerImpl.CON_CHECK_INTERVAL + IgniteConfiguration.failureDetectionTimeout. 
> Node ping routine is duplicated.
> We should fix:
> 1. Failure detection timeout should take in account last sent message. 
> Current ping is bound to own time:
> {code:java}ServerImpl. RingMessageWorker.lastTimeConnCheckMsgSent{code}
> This is weird because any discovery message check connection. 
> 2. Make connection check interval depend on failure detection timeout (FTD). 
> Current value is a constant:
> {code:java}static int ServerImpls.CON_CHECK_INTERVAL = 500{code}
> 3. Remove additional, quickened connection checking.  Once we do fix 1, this 
> will become even more useless.
> Despite TCP discovery has a period of connection checking, it may send ping 
> before this period exhausts. This premature node ping relies on the time of 
> any sent or even any received message. 
> 4. Do not worry user with “Node seems disconnected” when everything is OK. 
> Once we do fix 1 and 3, this will become even more useless. 
> Node may log on INFO: “Local node seems to be disconnected from topology …” 
> whereas it is not actually disconnected at all.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (IGNITE-13012) Fix failure detection timeout. Simplify node ping routine.

2020-06-18 Thread Vladimir Steshin (Jira)


 [ 
https://issues.apache.org/jira/browse/IGNITE-13012?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vladimir Steshin updated IGNITE-13012:
--
Attachment: IGNITE-13012-patch.patch

> Fix failure detection timeout. Simplify node ping routine.
> --
>
> Key: IGNITE-13012
> URL: https://issues.apache.org/jira/browse/IGNITE-13012
> Project: Ignite
>  Issue Type: Improvement
>Affects Versions: 2.8.1
>Reporter: Vladimir Steshin
>Assignee: Vladimir Steshin
>Priority: Major
>  Labels: iep-45
> Attachments: IGNITE-13012-patch.patch
>
>  Time Spent: 2h 20m
>  Remaining Estimate: 0h
>
> Connection failure may not be detected within 
> IgniteConfiguration.failureDetectionTimeout. Actual worst delay is: 
> ServerImpl.CON_CHECK_INTERVAL + IgniteConfiguration.failureDetectionTimeout. 
> Node ping routine is duplicated.
> We should fix:
> 1. Failure detection timeout should take in account last sent message. 
> Current ping is bound to own time:
> {code:java}ServerImpl. RingMessageWorker.lastTimeConnCheckMsgSent{code}
> This is weird because any discovery message check connection. 
> 2. Make connection check interval depend on failure detection timeout (FTD). 
> Current value is a constant:
> {code:java}static int ServerImpls.CON_CHECK_INTERVAL = 500{code}
> 3. Remove additional, quickened connection checking.  Once we do fix 1, this 
> will become even more useless.
> Despite TCP discovery has a period of connection checking, it may send ping 
> before this period exhausts. This premature node ping relies on the time of 
> any sent or even any received message. 
> 4. Do not worry user with “Node seems disconnected” when everything is OK. 
> Once we do fix 1 and 3, this will become even more useless. 
> Node may log on INFO: “Local node seems to be disconnected from topology …” 
> whereas it is not actually disconnected at all.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Comment Edited] (IGNITE-13111) Simplify backward checking of node connection.

2020-06-17 Thread Vladimir Steshin (Jira)


[ 
https://issues.apache.org/jira/browse/IGNITE-13111?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17138300#comment-17138300
 ] 

Vladimir Steshin edited comment on IGNITE-13111 at 6/17/20, 10:06 AM:
--

I find IGNITE-13016 as better solution. We cannot rely on ping interval because 
two nodes are involved in backward connection checking. They work with same but 
shifted ping intervals. If node N asks N+2 to check N+1, N+2 waits for the rest 
of its failureDetectionTimeout. But ping and failureDetectionTimeout on N are 
shifted in comparision with N+2. N can fail before N+2 has waited for ping from 
N+1.


was (Author: vladsz83):
I find IGNITE-13016 or IGNITE-13014 better solution. We cannot rely on ping 
interval because two nodes are involved in backward connection checking. They 
work with same but shifted ping intervals. If node N asks N+2 to check N+1, N+2 
waits for the rest of its failureDetectionTimeout. But ping and 
failureDetectionTimeout on N are shifted in comparision with N+2. N can fail 
before N+2 has waited for ping from N+1.

> Simplify backward checking of node connection.
> --
>
> Key: IGNITE-13111
> URL: https://issues.apache.org/jira/browse/IGNITE-13111
> Project: Ignite
>  Issue Type: Improvement
>Reporter: Vladimir Steshin
>Assignee: Vladimir Steshin
>Priority: Major
>  Labels: iep-45
> Attachments: FailureDetectionResearch.patch, 
> FailureDetectionResearch.txt, FailureDetectionResearch_fixed.txt, 
> WostCaseStepByStep.txt
>
>
> We should fix several drawbacks in the backward checking of failed node. They 
> prolong node failure detection upto: 
> ServerImpl.CON_CHECK_INTERVAL + 2 * 
> IgniteConfiguretion.failureDetectionTimeout + 300ms. 
> See:
> * ‘_NodeFailureResearch.patch_'. It creates test 'FailureDetectionResearch' 
> which emulates long answears on a failed node and measures failure detection 
> delays.
> * '_FailureDetectionResearch.txt_' - results of the test.
> * '_FailureDetectionResearch_fixed.txt_' - results of the test after this fix.
> * '_WostCaseStepByStep.txt_' - description how the worst case happens.
> *Suggestion:*
> 1) We can simplify backward connection checking as we implement IGNITE-13012. 
> Once we get robust, predictable connection ping, we don't need to check 
> previous node because we can see whether it sent ping to current node within 
> failure detection timeout. If not, previous node can be considered lost.
> Instead of:
> {code:java}
> // Node cannot connect to it's next (for local node it's previous).
> // Need to check connectivity to it.
> long rcvdTime = lastRingMsgReceivedTime;
> long now = U.currentTimeMillis();
> // We got message from previous in less than double 
> connection check interval.
> boolean ok = rcvdTime + effectiveExchangeTimeout() >= 
> now;
> TcpDiscoveryNode previous = null;
> if (ok) {
> // Check case when previous node suddenly died. 
> This will speed up
> // node failing.
>   Checking connection to previous node
>  }
> {code}
> we could wait for ping from previous node. Scenario:
> * n1 (Node1) failed to connect to n2.
> * n1 asks n3 to establish connection instead of n2.
> * n3 waits for ping form n2 for the rest of failure detection timeout.
> * If n3 received ping from n2, it connects with n1. Or answers n1 that n2 is 
> considered alive.
> 2) Then, seems we can remove:
> {code:java}
> ServerImpl.SocketReader.isConnectionRefused(SocketAddress addr);
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Comment Edited] (IGNITE-13111) Simplify backward checking of node connection.

2020-06-17 Thread Vladimir Steshin (Jira)


[ 
https://issues.apache.org/jira/browse/IGNITE-13111?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17138300#comment-17138300
 ] 

Vladimir Steshin edited comment on IGNITE-13111 at 6/17/20, 10:06 AM:
--

I find IGNITE-13016 or IGNITE-13014 better solution. We cannot rely on ping 
interval because two nodes are involved in backward connection checking. They 
work with same but shifted ping intervals. If node N asks N+2 to check N+1, N+2 
waits for the rest of its failureDetectionTimeout. But ping and 
failureDetectionTimeout on N are shifted in comparision with N+2. N can fail 
before N+2 has waited for ping from N+1.


was (Author: vladsz83):
I find IGNITE-13016 as better solution. We cannot rely on ping interval because 
two nodes are involved in backward connection checking. They work with same but 
shifted ping intervals. If node N asks N+2 to check N+1, N+2 waits for the rest 
of its failureDetectionTimeout. But ping and failureDetectionTimeout on N are 
shifted in comparision with N+2. N can fail before N+2 has waited for ping from 
N+1.

> Simplify backward checking of node connection.
> --
>
> Key: IGNITE-13111
> URL: https://issues.apache.org/jira/browse/IGNITE-13111
> Project: Ignite
>  Issue Type: Improvement
>Reporter: Vladimir Steshin
>Assignee: Vladimir Steshin
>Priority: Major
>  Labels: iep-45
> Attachments: FailureDetectionResearch.patch, 
> FailureDetectionResearch.txt, FailureDetectionResearch_fixed.txt, 
> WostCaseStepByStep.txt
>
>
> We should fix several drawbacks in the backward checking of failed node. They 
> prolong node failure detection upto: 
> ServerImpl.CON_CHECK_INTERVAL + 2 * 
> IgniteConfiguretion.failureDetectionTimeout + 300ms. 
> See:
> * ‘_NodeFailureResearch.patch_'. It creates test 'FailureDetectionResearch' 
> which emulates long answears on a failed node and measures failure detection 
> delays.
> * '_FailureDetectionResearch.txt_' - results of the test.
> * '_FailureDetectionResearch_fixed.txt_' - results of the test after this fix.
> * '_WostCaseStepByStep.txt_' - description how the worst case happens.
> *Suggestion:*
> 1) We can simplify backward connection checking as we implement IGNITE-13012. 
> Once we get robust, predictable connection ping, we don't need to check 
> previous node because we can see whether it sent ping to current node within 
> failure detection timeout. If not, previous node can be considered lost.
> Instead of:
> {code:java}
> // Node cannot connect to it's next (for local node it's previous).
> // Need to check connectivity to it.
> long rcvdTime = lastRingMsgReceivedTime;
> long now = U.currentTimeMillis();
> // We got message from previous in less than double 
> connection check interval.
> boolean ok = rcvdTime + effectiveExchangeTimeout() >= 
> now;
> TcpDiscoveryNode previous = null;
> if (ok) {
> // Check case when previous node suddenly died. 
> This will speed up
> // node failing.
>   Checking connection to previous node
>  }
> {code}
> we could wait for ping from previous node. Scenario:
> * n1 (Node1) failed to connect to n2.
> * n1 asks n3 to establish connection instead of n2.
> * n3 waits for ping form n2 for the rest of failure detection timeout.
> * If n3 received ping from n2, it connects with n1. Or answers n1 that n2 is 
> considered alive.
> 2) Then, seems we can remove:
> {code:java}
> ServerImpl.SocketReader.isConnectionRefused(SocketAddress addr);
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (IGNITE-13111) Simplify backward checking of node connection.

2020-06-17 Thread Vladimir Steshin (Jira)


 [ 
https://issues.apache.org/jira/browse/IGNITE-13111?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vladimir Steshin updated IGNITE-13111:
--
Ignite Flags:   (was: Release Notes Required)

> Simplify backward checking of node connection.
> --
>
> Key: IGNITE-13111
> URL: https://issues.apache.org/jira/browse/IGNITE-13111
> Project: Ignite
>  Issue Type: Improvement
>Reporter: Vladimir Steshin
>Assignee: Vladimir Steshin
>Priority: Major
>  Labels: iep-45
> Attachments: FailureDetectionResearch.patch, 
> FailureDetectionResearch.txt, FailureDetectionResearch_fixed.txt, 
> WostCaseStepByStep.txt
>
>
> We should fix several drawbacks in the backward checking of failed node. They 
> prolong node failure detection upto: 
> ServerImpl.CON_CHECK_INTERVAL + 2 * 
> IgniteConfiguretion.failureDetectionTimeout + 300ms. 
> See:
> * ‘_NodeFailureResearch.patch_'. It creates test 'FailureDetectionResearch' 
> which emulates long answears on a failed node and measures failure detection 
> delays.
> * '_FailureDetectionResearch.txt_' - results of the test.
> * '_FailureDetectionResearch_fixed.txt_' - results of the test after this fix.
> * '_WostCaseStepByStep.txt_' - description how the worst case happens.
> *Suggestion:*
> 1) We can simplify backward connection checking as we implement IGNITE-13012. 
> Once we get robust, predictable connection ping, we don't need to check 
> previous node because we can see whether it sent ping to current node within 
> failure detection timeout. If not, previous node can be considered lost.
> Instead of:
> {code:java}
> // Node cannot connect to it's next (for local node it's previous).
> // Need to check connectivity to it.
> long rcvdTime = lastRingMsgReceivedTime;
> long now = U.currentTimeMillis();
> // We got message from previous in less than double 
> connection check interval.
> boolean ok = rcvdTime + effectiveExchangeTimeout() >= 
> now;
> TcpDiscoveryNode previous = null;
> if (ok) {
> // Check case when previous node suddenly died. 
> This will speed up
> // node failing.
>   Checking connection to previous node
>  }
> {code}
> we could wait for ping from previous node. Scenario:
> * n1 (Node1) failed to connect to n2.
> * n1 asks n3 to establish connection instead of n2.
> * n3 waits for ping form n2 for the rest of failure detection timeout.
> * If n3 received ping from n2, it connects with n1. Or answers n1 that n2 is 
> considered alive.
> 2) Then, seems we can remove:
> {code:java}
> ServerImpl.SocketReader.isConnectionRefused(SocketAddress addr);
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Closed] (IGNITE-13111) Simplify backward checking of node connection.

2020-06-17 Thread Vladimir Steshin (Jira)


 [ 
https://issues.apache.org/jira/browse/IGNITE-13111?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vladimir Steshin closed IGNITE-13111.
-

> Simplify backward checking of node connection.
> --
>
> Key: IGNITE-13111
> URL: https://issues.apache.org/jira/browse/IGNITE-13111
> Project: Ignite
>  Issue Type: Improvement
>Reporter: Vladimir Steshin
>Assignee: Vladimir Steshin
>Priority: Major
>  Labels: iep-45
> Attachments: FailureDetectionResearch.patch, 
> FailureDetectionResearch.txt, FailureDetectionResearch_fixed.txt, 
> WostCaseStepByStep.txt
>
>
> We should fix several drawbacks in the backward checking of failed node. They 
> prolong node failure detection upto: 
> ServerImpl.CON_CHECK_INTERVAL + 2 * 
> IgniteConfiguretion.failureDetectionTimeout + 300ms. 
> See:
> * ‘_NodeFailureResearch.patch_'. It creates test 'FailureDetectionResearch' 
> which emulates long answears on a failed node and measures failure detection 
> delays.
> * '_FailureDetectionResearch.txt_' - results of the test.
> * '_FailureDetectionResearch_fixed.txt_' - results of the test after this fix.
> * '_WostCaseStepByStep.txt_' - description how the worst case happens.
> *Suggestion:*
> 1) We can simplify backward connection checking as we implement IGNITE-13012. 
> Once we get robust, predictable connection ping, we don't need to check 
> previous node because we can see whether it sent ping to current node within 
> failure detection timeout. If not, previous node can be considered lost.
> Instead of:
> {code:java}
> // Node cannot connect to it's next (for local node it's previous).
> // Need to check connectivity to it.
> long rcvdTime = lastRingMsgReceivedTime;
> long now = U.currentTimeMillis();
> // We got message from previous in less than double 
> connection check interval.
> boolean ok = rcvdTime + effectiveExchangeTimeout() >= 
> now;
> TcpDiscoveryNode previous = null;
> if (ok) {
> // Check case when previous node suddenly died. 
> This will speed up
> // node failing.
>   Checking connection to previous node
>  }
> {code}
> we could wait for ping from previous node. Scenario:
> * n1 (Node1) failed to connect to n2.
> * n1 asks n3 to establish connection instead of n2.
> * n3 waits for ping form n2 for the rest of failure detection timeout.
> * If n3 received ping from n2, it connects with n1. Or answers n1 that n2 is 
> considered alive.
> 2) Then, seems we can remove:
> {code:java}
> ServerImpl.SocketReader.isConnectionRefused(SocketAddress addr);
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Resolved] (IGNITE-13111) Simplify backward checking of node connection.

2020-06-17 Thread Vladimir Steshin (Jira)


 [ 
https://issues.apache.org/jira/browse/IGNITE-13111?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vladimir Steshin resolved IGNITE-13111.
---
Resolution: Won't Fix

I find IGNITE-13016 as better solution. We cannot rely on ping interval because 
two nodes are involved in backward connection checking. They work with same but 
shifted ping intervals. If node N asks N+2 to check N+1, N+2 waits for the rest 
of its failureDetectionTimeout. But ping and failureDetectionTimeout on N are 
shifted in comparision with N+2. N can fail before N+2 has waited for ping from 
N+1.

> Simplify backward checking of node connection.
> --
>
> Key: IGNITE-13111
> URL: https://issues.apache.org/jira/browse/IGNITE-13111
> Project: Ignite
>  Issue Type: Improvement
>Reporter: Vladimir Steshin
>Assignee: Vladimir Steshin
>Priority: Major
>  Labels: iep-45
> Attachments: FailureDetectionResearch.patch, 
> FailureDetectionResearch.txt, FailureDetectionResearch_fixed.txt, 
> WostCaseStepByStep.txt
>
>
> We should fix several drawbacks in the backward checking of failed node. They 
> prolong node failure detection upto: 
> ServerImpl.CON_CHECK_INTERVAL + 2 * 
> IgniteConfiguretion.failureDetectionTimeout + 300ms. 
> See:
> * ‘_NodeFailureResearch.patch_'. It creates test 'FailureDetectionResearch' 
> which emulates long answears on a failed node and measures failure detection 
> delays.
> * '_FailureDetectionResearch.txt_' - results of the test.
> * '_FailureDetectionResearch_fixed.txt_' - results of the test after this fix.
> * '_WostCaseStepByStep.txt_' - description how the worst case happens.
> *Suggestion:*
> 1) We can simplify backward connection checking as we implement IGNITE-13012. 
> Once we get robust, predictable connection ping, we don't need to check 
> previous node because we can see whether it sent ping to current node within 
> failure detection timeout. If not, previous node can be considered lost.
> Instead of:
> {code:java}
> // Node cannot connect to it's next (for local node it's previous).
> // Need to check connectivity to it.
> long rcvdTime = lastRingMsgReceivedTime;
> long now = U.currentTimeMillis();
> // We got message from previous in less than double 
> connection check interval.
> boolean ok = rcvdTime + effectiveExchangeTimeout() >= 
> now;
> TcpDiscoveryNode previous = null;
> if (ok) {
> // Check case when previous node suddenly died. 
> This will speed up
> // node failing.
>   Checking connection to previous node
>  }
> {code}
> we could wait for ping from previous node. Scenario:
> * n1 (Node1) failed to connect to n2.
> * n1 asks n3 to establish connection instead of n2.
> * n3 waits for ping form n2 for the rest of failure detection timeout.
> * If n3 received ping from n2, it connects with n1. Or answers n1 that n2 is 
> considered alive.
> 2) Then, seems we can remove:
> {code:java}
> ServerImpl.SocketReader.isConnectionRefused(SocketAddress addr);
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Comment Edited] (IGNITE-13012) Fix failure detection timeout. Simplify node ping routine.

2020-06-16 Thread Vladimir Steshin (Jira)


[ 
https://issues.apache.org/jira/browse/IGNITE-13012?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17136824#comment-17136824
 ] 

Vladimir Steshin edited comment on IGNITE-13012 at 6/16/20, 5:03 PM:
-

[~avinogradov], I've put the patch. It creates:

# JmhNodeFailureDetection. Not an ordinary JMH, I believe. Because we have to 
start/wait/fail node, the detection time is only small peice of each run. So, 
fixed/not-fixes results are close. I made own runs and collected timings to 
prepare the output. 

You can find in the output of the fix (example):
{code:java}
Detection delay: 294. Failure detection timeout: 300
Total detection delay: 5477

Run complete. Total time: 00:01:23
Benchmark  Mode  Cnt   Score   Error
Units
JmhNodeFailureDetection.measureTotalForTheOutput  thrpt   10,954  
ops/min
{code}

vs not-fixed:

{code:java}
Detection delay: 539. Failure detection timeout: 300
Total detection delay: 11370

Run complete. Total time: 00:01:41

Benchmark  Mode  Cnt  Score   Error
Units
JmhNodeFailureDetection.measureTotalForTheOutput  thrpt   5,276  
ops/min
{code}

# A test which fail in unfixed code.
{code:java}TcpDiscoveryNetworkIssuesTest.testNodeFailureDetectedWithinConfiguredTimeout(){code}


was (Author: vladsz83):
[~avinogradov], I've put the patch. It creates:

# JmhNodeFailureDetection. Not an ordinary JMH, I believe. Because we have to 
start/wait/fail node, the detection time is only small peice of each run. So, 
fixed/not-fixes results are close. I made own runs and collected timings to 
prepare the output. 

You can find in the output of the fix (example):
{code:java}
Detection delay: 294. Failure detection timeout: 300
Total detection delay: 5477

# Run complete. Total time: 00:01:23
Benchmark  Mode  Cnt   Score   Error
Units
JmhNodeFailureDetection.measureTotalForTheOutput  thrpt   10,954  
ops/min
{code}

vs not-fixed:

{code:java}
Detection delay: 539. Failure detection timeout: 300
Total detection delay: 11370

# Run complete. Total time: 00:01:41

Benchmark  Mode  Cnt  Score   Error
Units
JmhNodeFailureDetection.measureTotalForTheOutput  thrpt   5,276  
ops/min
{code}

# A test which fail in unfixed code.
{code:java}TcpDiscoveryNetworkIssuesTest.testNodeFailureDetectedWithinConfiguredTimeout(){code}

> Fix failure detection timeout. Simplify node ping routine.
> --
>
> Key: IGNITE-13012
> URL: https://issues.apache.org/jira/browse/IGNITE-13012
> Project: Ignite
>  Issue Type: Improvement
>Affects Versions: 2.8.1
>Reporter: Vladimir Steshin
>Assignee: Vladimir Steshin
>Priority: Major
>  Labels: iep-45
> Attachments: IGNITE-13012-patch.patch
>
>  Time Spent: 2h 20m
>  Remaining Estimate: 0h
>
> Connection failure may not be detected within 
> IgniteConfiguration.failureDetectionTimeout. Actual worst delay is: 
> ServerImpl.CON_CHECK_INTERVAL + IgniteConfiguration.failureDetectionTimeout. 
> Node ping routine is duplicated.
> We should fix:
> 1. Failure detection timeout should take in account last sent message. 
> Current ping is bound to own time:
> {code:java}ServerImpl. RingMessageWorker.lastTimeConnCheckMsgSent{code}
> This is weird because any discovery message check connection. 
> 2. Make connection check interval depend on failure detection timeout (FTD). 
> Current value is a constant:
> {code:java}static int ServerImpls.CON_CHECK_INTERVAL = 500{code}
> 3. Remove additional, quickened connection checking.  Once we do fix 1, this 
> will become even more useless.
> Despite TCP discovery has a period of connection checking, it may send ping 
> before this period exhausts. This premature node ping relies on the time of 
> any sent or even any received message. 
> 4. Do not worry user with “Node seems disconnected” when everything is OK. 
> Once we do fix 1 and 3, this will become even more useless. 
> Node may log on INFO: “Local node seems to be disconnected from topology …” 
> whereas it is not actually disconnected at all.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Comment Edited] (IGNITE-13012) Fix failure detection timeout. Simplify node ping routine.

2020-06-16 Thread Vladimir Steshin (Jira)


[ 
https://issues.apache.org/jira/browse/IGNITE-13012?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17136824#comment-17136824
 ] 

Vladimir Steshin edited comment on IGNITE-13012 at 6/16/20, 5:03 PM:
-

[~avinogradov], I've put the patch. It creates:

* JmhNodeFailureDetection. Not an ordinary JMH, I believe. Because we have to 
start/wait/fail node, the detection time is only small peice of each run. So, 
fixed/not-fixes results are close. I made own runs and collected timings to 
prepare the output. 

You can find in the output of the fix (example):
{code:java}
Detection delay: 294. Failure detection timeout: 300
Total detection delay: 5477

# Run complete. Total time: 00:01:23
Benchmark  Mode  Cnt   Score   Error
Units
JmhNodeFailureDetection.measureTotalForTheOutput  thrpt   10,954  
ops/min
{code}

vs not-fixed:

{code:java}
Detection delay: 539. Failure detection timeout: 300
Total detection delay: 11370

# Run complete. Total time: 00:01:41

Benchmark  Mode  Cnt  Score   Error
Units
JmhNodeFailureDetection.measureTotalForTheOutput  thrpt   5,276  
ops/min
{code}

* A test which fail in unfixed code.
{code:java}TcpDiscoveryNetworkIssuesTest.testNodeFailureDetectedWithinConfiguredTimeout(){code}


was (Author: vladsz83):
[~avinogradov], I've put the patch. It creates:

# JmhNodeFailureDetection. Not an ordinary JMH, I believe. Because we have to 
start/wait/fail node, the detection time is only small peice of each run. So, 
fixed/not-fixes results are close. I made own runs and collected timings to 
prepare the output. 

You can find in the output of the fix (example):
{code:java}
Detection delay: 294. Failure detection timeout: 300
Total detection delay: 5477

Run complete. Total time: 00:01:23
Benchmark  Mode  Cnt   Score   Error
Units
JmhNodeFailureDetection.measureTotalForTheOutput  thrpt   10,954  
ops/min
{code}

vs not-fixed:

{code:java}
Detection delay: 539. Failure detection timeout: 300
Total detection delay: 11370

Run complete. Total time: 00:01:41

Benchmark  Mode  Cnt  Score   Error
Units
JmhNodeFailureDetection.measureTotalForTheOutput  thrpt   5,276  
ops/min
{code}

# A test which fail in unfixed code.
{code:java}TcpDiscoveryNetworkIssuesTest.testNodeFailureDetectedWithinConfiguredTimeout(){code}

> Fix failure detection timeout. Simplify node ping routine.
> --
>
> Key: IGNITE-13012
> URL: https://issues.apache.org/jira/browse/IGNITE-13012
> Project: Ignite
>  Issue Type: Improvement
>Affects Versions: 2.8.1
>Reporter: Vladimir Steshin
>Assignee: Vladimir Steshin
>Priority: Major
>  Labels: iep-45
> Attachments: IGNITE-13012-patch.patch
>
>  Time Spent: 2h 20m
>  Remaining Estimate: 0h
>
> Connection failure may not be detected within 
> IgniteConfiguration.failureDetectionTimeout. Actual worst delay is: 
> ServerImpl.CON_CHECK_INTERVAL + IgniteConfiguration.failureDetectionTimeout. 
> Node ping routine is duplicated.
> We should fix:
> 1. Failure detection timeout should take in account last sent message. 
> Current ping is bound to own time:
> {code:java}ServerImpl. RingMessageWorker.lastTimeConnCheckMsgSent{code}
> This is weird because any discovery message check connection. 
> 2. Make connection check interval depend on failure detection timeout (FTD). 
> Current value is a constant:
> {code:java}static int ServerImpls.CON_CHECK_INTERVAL = 500{code}
> 3. Remove additional, quickened connection checking.  Once we do fix 1, this 
> will become even more useless.
> Despite TCP discovery has a period of connection checking, it may send ping 
> before this period exhausts. This premature node ping relies on the time of 
> any sent or even any received message. 
> 4. Do not worry user with “Node seems disconnected” when everything is OK. 
> Once we do fix 1 and 3, this will become even more useless. 
> Node may log on INFO: “Local node seems to be disconnected from topology …” 
> whereas it is not actually disconnected at all.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Comment Edited] (IGNITE-13012) Fix failure detection timeout. Simplify node ping routine.

2020-06-16 Thread Vladimir Steshin (Jira)


[ 
https://issues.apache.org/jira/browse/IGNITE-13012?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17136824#comment-17136824
 ] 

Vladimir Steshin edited comment on IGNITE-13012 at 6/16/20, 5:02 PM:
-

[~avinogradov], I've put the patch. It creates:

# JmhNodeFailureDetection. Not an ordinary JMH, I believe. Because we have to 
start/wait/fail node, the detection time is only small peice of each run. So, 
fixed/not-fixes results are close. I made own runs and collected timings to 
prepare the output. 

You can find in the output of the fix (example):
{code:java}
Detection delay: 294. Failure detection timeout: 300
Total detection delay: 5477

# Run complete. Total time: 00:01:23
Benchmark  Mode  Cnt   Score   Error
Units
JmhNodeFailureDetection.measureTotalForTheOutput  thrpt   10,954  
ops/min
{code}

vs not-fixed:

{code:java}
Detection delay: 539. Failure detection timeout: 300
Total detection delay: 11370

# Run complete. Total time: 00:01:41

Benchmark  Mode  Cnt  Score   Error
Units
JmhNodeFailureDetection.measureTotalForTheOutput  thrpt   5,276  
ops/min
{code}

# A test which fail in unfixed code.
{code:java}TcpDiscoveryNetworkIssuesTest.testNodeFailureDetectedWithinConfiguredTimeout(){code}


was (Author: vladsz83):
[~avinogradov], I've put the patch. It creates:

* JmhNodeFailureDetection. Not an ordinary JMH, I believe. Because we have to 
start/wait/fail node, the detection time is only small peice of each run. So, 
fixed/not-fixes results are close. I made own runs and collected timings to 
prepare the output. 

You can find in the output of the fix (example):
{code:java}
Detection delay: 294. Failure detection timeout: 300
Total detection delay: 5477

# Run complete. Total time: 00:01:23
Benchmark  Mode  Cnt   Score   Error
Units
JmhNodeFailureDetection.measureTotalForTheOutput  thrpt   10,954  
ops/min
{code}

vs not-fixed:

{code:java}
Detection delay: 539. Failure detection timeout: 300
Total detection delay: 11370

# Run complete. Total time: 00:01:41

Benchmark  Mode  Cnt  Score   Error
Units
JmhNodeFailureDetection.measureTotalForTheOutput  thrpt   5,276  
ops/min
{code}

* 
{code:java}TcpDiscoveryNetworkIssuesTest.testNodeFailureDetectedWithinConfiguredTimeout(){code}

> Fix failure detection timeout. Simplify node ping routine.
> --
>
> Key: IGNITE-13012
> URL: https://issues.apache.org/jira/browse/IGNITE-13012
> Project: Ignite
>  Issue Type: Improvement
>Affects Versions: 2.8.1
>Reporter: Vladimir Steshin
>Assignee: Vladimir Steshin
>Priority: Major
>  Labels: iep-45
> Attachments: IGNITE-13012-patch.patch
>
>  Time Spent: 2h 20m
>  Remaining Estimate: 0h
>
> Connection failure may not be detected within 
> IgniteConfiguration.failureDetectionTimeout. Actual worst delay is: 
> ServerImpl.CON_CHECK_INTERVAL + IgniteConfiguration.failureDetectionTimeout. 
> Node ping routine is duplicated.
> We should fix:
> 1. Failure detection timeout should take in account last sent message. 
> Current ping is bound to own time:
> {code:java}ServerImpl. RingMessageWorker.lastTimeConnCheckMsgSent{code}
> This is weird because any discovery message check connection. 
> 2. Make connection check interval depend on failure detection timeout (FTD). 
> Current value is a constant:
> {code:java}static int ServerImpls.CON_CHECK_INTERVAL = 500{code}
> 3. Remove additional, quickened connection checking.  Once we do fix 1, this 
> will become even more useless.
> Despite TCP discovery has a period of connection checking, it may send ping 
> before this period exhausts. This premature node ping relies on the time of 
> any sent or even any received message. 
> 4. Do not worry user with “Node seems disconnected” when everything is OK. 
> Once we do fix 1 and 3, this will become even more useless. 
> Node may log on INFO: “Local node seems to be disconnected from topology …” 
> whereas it is not actually disconnected at all.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Comment Edited] (IGNITE-13012) Fix failure detection timeout. Simplify node ping routine.

2020-06-16 Thread Vladimir Steshin (Jira)


[ 
https://issues.apache.org/jira/browse/IGNITE-13012?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17136824#comment-17136824
 ] 

Vladimir Steshin edited comment on IGNITE-13012 at 6/16/20, 5:02 PM:
-

[~avinogradov], I've put the patch. It creates:

# JmhNodeFailureDetection. Not an ordinary JMH, I believe. Because we have to 
start/wait/fail node, the detection time is only small peice of each run. So, 
fixed/not-fixes results are close. I made own runs and collected timings to 
prepare the output. 

You can find in the output of the fix (example):
{code:java}
Detection delay: 294. Failure detection timeout: 300
Total detection delay: 5477

# Run complete. Total time: 00:01:23
Benchmark  Mode  Cnt   Score   Error
Units
JmhNodeFailureDetection.measureTotalForTheOutput  thrpt   10,954  
ops/min
{code}

vs not-fixed:

{code:java}
Detection delay: 539. Failure detection timeout: 300
Total detection delay: 11370

# Run complete. Total time: 00:01:41

Benchmark  Mode  Cnt  Score   Error
Units
JmhNodeFailureDetection.measureTotalForTheOutput  thrpt   5,276  
ops/min
{code}

# A test which fail in unfixed code.
{code:java}TcpDiscoveryNetworkIssuesTest.testNodeFailureDetectedWithinConfiguredTimeout(){code}


was (Author: vladsz83):
[~avinogradov], I've put the patch. It creates:

# JmhNodeFailureDetection. Not an ordinary JMH, I believe. Because we have to 
start/wait/fail node, the detection time is only small peice of each run. So, 
fixed/not-fixes results are close. I made own runs and collected timings to 
prepare the output. 

You can find in the output of the fix (example):
{code:java}
Detection delay: 294. Failure detection timeout: 300
Total detection delay: 5477

# Run complete. Total time: 00:01:23
Benchmark  Mode  Cnt   Score   Error
Units
JmhNodeFailureDetection.measureTotalForTheOutput  thrpt   10,954  
ops/min
{code}

vs not-fixed:

{code:java}
Detection delay: 539. Failure detection timeout: 300
Total detection delay: 11370

# Run complete. Total time: 00:01:41

Benchmark  Mode  Cnt  Score   Error
Units
JmhNodeFailureDetection.measureTotalForTheOutput  thrpt   5,276  
ops/min
{code}

# A test which fail in unfixed code.
{code:java}TcpDiscoveryNetworkIssuesTest.testNodeFailureDetectedWithinConfiguredTimeout(){code}

> Fix failure detection timeout. Simplify node ping routine.
> --
>
> Key: IGNITE-13012
> URL: https://issues.apache.org/jira/browse/IGNITE-13012
> Project: Ignite
>  Issue Type: Improvement
>Affects Versions: 2.8.1
>Reporter: Vladimir Steshin
>Assignee: Vladimir Steshin
>Priority: Major
>  Labels: iep-45
> Attachments: IGNITE-13012-patch.patch
>
>  Time Spent: 2h 20m
>  Remaining Estimate: 0h
>
> Connection failure may not be detected within 
> IgniteConfiguration.failureDetectionTimeout. Actual worst delay is: 
> ServerImpl.CON_CHECK_INTERVAL + IgniteConfiguration.failureDetectionTimeout. 
> Node ping routine is duplicated.
> We should fix:
> 1. Failure detection timeout should take in account last sent message. 
> Current ping is bound to own time:
> {code:java}ServerImpl. RingMessageWorker.lastTimeConnCheckMsgSent{code}
> This is weird because any discovery message check connection. 
> 2. Make connection check interval depend on failure detection timeout (FTD). 
> Current value is a constant:
> {code:java}static int ServerImpls.CON_CHECK_INTERVAL = 500{code}
> 3. Remove additional, quickened connection checking.  Once we do fix 1, this 
> will become even more useless.
> Despite TCP discovery has a period of connection checking, it may send ping 
> before this period exhausts. This premature node ping relies on the time of 
> any sent or even any received message. 
> 4. Do not worry user with “Node seems disconnected” when everything is OK. 
> Once we do fix 1 and 3, this will become even more useless. 
> Node may log on INFO: “Local node seems to be disconnected from topology …” 
> whereas it is not actually disconnected at all.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Comment Edited] (IGNITE-13012) Fix failure detection timeout. Simplify node ping routine.

2020-06-16 Thread Vladimir Steshin (Jira)


[ 
https://issues.apache.org/jira/browse/IGNITE-13012?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17136824#comment-17136824
 ] 

Vladimir Steshin edited comment on IGNITE-13012 at 6/16/20, 5:01 PM:
-

[~avinogradov], I've put the patch. It creates:

* JmhNodeFailureDetection. Not ordinary JMH, I believe. Because we have to 
start/wait/fail node, the detection time is only small peice of each run. So, 
fixed/not-fixes results are close. I made own runs and collected timings to 
prepare the output. 

You can find in the output of the fix (example):
{code:java}
Detection delay: 294. Failure detection timeout: 300
Total detection delay: 5477

# Run complete. Total time: 00:01:23
Benchmark  Mode  Cnt   Score   Error
Units
JmhNodeFailureDetection.measureTotalForTheOutput  thrpt   10,954  
ops/min
{code}

vs not-fixed:

{code:text}
Detection delay: 539. Failure detection timeout: 300
Total detection delay: 11370

# Run complete. Total time: 00:01:41

Benchmark  Mode  Cnt  Score   Error
Units
JmhNodeFailureDetection.measureTotalForTheOutput  thrpt   5,276  
ops/min
{code}

* 
{code:java}TcpDiscoveryNetworkIssuesTest.testNodeFailureDetectedWithinConfiguredTimeout(){code}


was (Author: vladsz83):
[~avinogradov], I've put the patch. It creates:

* JmhNodeFailureDetection. Not ordinary JMH, I believe. Because we have to 
start/wait/fail node, the detection time is only small peice of each run. So, 
fixed/not-fixes results are close. I made own runs and collected timings to 
prepare the output. 

You can find in the output of the fix (example):
{code:text}
Detection delay: 294. Failure detection timeout: 300
Total detection delay: 5477

# Run complete. Total time: 00:01:23
Benchmark  Mode  Cnt   Score   Error
Units
JmhNodeFailureDetection.measureTotalForTheOutput  thrpt   10,954  
ops/min
{code}

vs not-fixed:

{code:text}
Detection delay: 539. Failure detection timeout: 300
Total detection delay: 11370

# Run complete. Total time: 00:01:41

Benchmark  Mode  Cnt  Score   Error
Units
JmhNodeFailureDetection.measureTotalForTheOutput  thrpt   5,276  
ops/min
{code}

* 
{code:java}TcpDiscoveryNetworkIssuesTest.testNodeFailureDetectedWithinConfiguredTimeout(){code}

> Fix failure detection timeout. Simplify node ping routine.
> --
>
> Key: IGNITE-13012
> URL: https://issues.apache.org/jira/browse/IGNITE-13012
> Project: Ignite
>  Issue Type: Improvement
>Affects Versions: 2.8.1
>Reporter: Vladimir Steshin
>Assignee: Vladimir Steshin
>Priority: Major
>  Labels: iep-45
> Attachments: IGNITE-13012-patch.patch
>
>  Time Spent: 2h 20m
>  Remaining Estimate: 0h
>
> Connection failure may not be detected within 
> IgniteConfiguration.failureDetectionTimeout. Actual worst delay is: 
> ServerImpl.CON_CHECK_INTERVAL + IgniteConfiguration.failureDetectionTimeout. 
> Node ping routine is duplicated.
> We should fix:
> 1. Failure detection timeout should take in account last sent message. 
> Current ping is bound to own time:
> {code:java}ServerImpl. RingMessageWorker.lastTimeConnCheckMsgSent{code}
> This is weird because any discovery message check connection. 
> 2. Make connection check interval depend on failure detection timeout (FTD). 
> Current value is a constant:
> {code:java}static int ServerImpls.CON_CHECK_INTERVAL = 500{code}
> 3. Remove additional, quickened connection checking.  Once we do fix 1, this 
> will become even more useless.
> Despite TCP discovery has a period of connection checking, it may send ping 
> before this period exhausts. This premature node ping relies on the time of 
> any sent or even any received message. 
> 4. Do not worry user with “Node seems disconnected” when everything is OK. 
> Once we do fix 1 and 3, this will become even more useless. 
> Node may log on INFO: “Local node seems to be disconnected from topology …” 
> whereas it is not actually disconnected at all.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Comment Edited] (IGNITE-13012) Fix failure detection timeout. Simplify node ping routine.

2020-06-16 Thread Vladimir Steshin (Jira)


[ 
https://issues.apache.org/jira/browse/IGNITE-13012?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17136824#comment-17136824
 ] 

Vladimir Steshin edited comment on IGNITE-13012 at 6/16/20, 5:01 PM:
-

[~avinogradov], I've put the patch. It creates:

* JmhNodeFailureDetection. Not ordinary JMH, I believe. Because we have to 
start/wait/fail node, the detection time is only small peice of each run. So, 
fixed/not-fixes results are close. I made own runs and collected timings to 
prepare the output. 

You can find in the output of the fix (example):
{code:java}
Detection delay: 294. Failure detection timeout: 300
Total detection delay: 5477

# Run complete. Total time: 00:01:23
Benchmark  Mode  Cnt   Score   Error
Units
JmhNodeFailureDetection.measureTotalForTheOutput  thrpt   10,954  
ops/min
{code}

vs not-fixed:

{code:java}
Detection delay: 539. Failure detection timeout: 300
Total detection delay: 11370

# Run complete. Total time: 00:01:41

Benchmark  Mode  Cnt  Score   Error
Units
JmhNodeFailureDetection.measureTotalForTheOutput  thrpt   5,276  
ops/min
{code}

* 
{code:java}TcpDiscoveryNetworkIssuesTest.testNodeFailureDetectedWithinConfiguredTimeout(){code}


was (Author: vladsz83):
[~avinogradov], I've put the patch. It creates:

* JmhNodeFailureDetection. Not ordinary JMH, I believe. Because we have to 
start/wait/fail node, the detection time is only small peice of each run. So, 
fixed/not-fixes results are close. I made own runs and collected timings to 
prepare the output. 

You can find in the output of the fix (example):
{code:java}
Detection delay: 294. Failure detection timeout: 300
Total detection delay: 5477

# Run complete. Total time: 00:01:23
Benchmark  Mode  Cnt   Score   Error
Units
JmhNodeFailureDetection.measureTotalForTheOutput  thrpt   10,954  
ops/min
{code}

vs not-fixed:

{code:text}
Detection delay: 539. Failure detection timeout: 300
Total detection delay: 11370

# Run complete. Total time: 00:01:41

Benchmark  Mode  Cnt  Score   Error
Units
JmhNodeFailureDetection.measureTotalForTheOutput  thrpt   5,276  
ops/min
{code}

* 
{code:java}TcpDiscoveryNetworkIssuesTest.testNodeFailureDetectedWithinConfiguredTimeout(){code}

> Fix failure detection timeout. Simplify node ping routine.
> --
>
> Key: IGNITE-13012
> URL: https://issues.apache.org/jira/browse/IGNITE-13012
> Project: Ignite
>  Issue Type: Improvement
>Affects Versions: 2.8.1
>Reporter: Vladimir Steshin
>Assignee: Vladimir Steshin
>Priority: Major
>  Labels: iep-45
> Attachments: IGNITE-13012-patch.patch
>
>  Time Spent: 2h 20m
>  Remaining Estimate: 0h
>
> Connection failure may not be detected within 
> IgniteConfiguration.failureDetectionTimeout. Actual worst delay is: 
> ServerImpl.CON_CHECK_INTERVAL + IgniteConfiguration.failureDetectionTimeout. 
> Node ping routine is duplicated.
> We should fix:
> 1. Failure detection timeout should take in account last sent message. 
> Current ping is bound to own time:
> {code:java}ServerImpl. RingMessageWorker.lastTimeConnCheckMsgSent{code}
> This is weird because any discovery message check connection. 
> 2. Make connection check interval depend on failure detection timeout (FTD). 
> Current value is a constant:
> {code:java}static int ServerImpls.CON_CHECK_INTERVAL = 500{code}
> 3. Remove additional, quickened connection checking.  Once we do fix 1, this 
> will become even more useless.
> Despite TCP discovery has a period of connection checking, it may send ping 
> before this period exhausts. This premature node ping relies on the time of 
> any sent or even any received message. 
> 4. Do not worry user with “Node seems disconnected” when everything is OK. 
> Once we do fix 1 and 3, this will become even more useless. 
> Node may log on INFO: “Local node seems to be disconnected from topology …” 
> whereas it is not actually disconnected at all.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Comment Edited] (IGNITE-13012) Fix failure detection timeout. Simplify node ping routine.

2020-06-16 Thread Vladimir Steshin (Jira)


[ 
https://issues.apache.org/jira/browse/IGNITE-13012?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17136824#comment-17136824
 ] 

Vladimir Steshin edited comment on IGNITE-13012 at 6/16/20, 5:01 PM:
-

[~avinogradov], I've put the patch. It creates:

* JmhNodeFailureDetection. Not an ordinary JMH, I believe. Because we have to 
start/wait/fail node, the detection time is only small peice of each run. So, 
fixed/not-fixes results are close. I made own runs and collected timings to 
prepare the output. 

You can find in the output of the fix (example):
{code:java}
Detection delay: 294. Failure detection timeout: 300
Total detection delay: 5477

# Run complete. Total time: 00:01:23
Benchmark  Mode  Cnt   Score   Error
Units
JmhNodeFailureDetection.measureTotalForTheOutput  thrpt   10,954  
ops/min
{code}

vs not-fixed:

{code:java}
Detection delay: 539. Failure detection timeout: 300
Total detection delay: 11370

# Run complete. Total time: 00:01:41

Benchmark  Mode  Cnt  Score   Error
Units
JmhNodeFailureDetection.measureTotalForTheOutput  thrpt   5,276  
ops/min
{code}

* 
{code:java}TcpDiscoveryNetworkIssuesTest.testNodeFailureDetectedWithinConfiguredTimeout(){code}


was (Author: vladsz83):
[~avinogradov], I've put the patch. It creates:

* JmhNodeFailureDetection. Not ordinary JMH, I believe. Because we have to 
start/wait/fail node, the detection time is only small peice of each run. So, 
fixed/not-fixes results are close. I made own runs and collected timings to 
prepare the output. 

You can find in the output of the fix (example):
{code:java}
Detection delay: 294. Failure detection timeout: 300
Total detection delay: 5477

# Run complete. Total time: 00:01:23
Benchmark  Mode  Cnt   Score   Error
Units
JmhNodeFailureDetection.measureTotalForTheOutput  thrpt   10,954  
ops/min
{code}

vs not-fixed:

{code:java}
Detection delay: 539. Failure detection timeout: 300
Total detection delay: 11370

# Run complete. Total time: 00:01:41

Benchmark  Mode  Cnt  Score   Error
Units
JmhNodeFailureDetection.measureTotalForTheOutput  thrpt   5,276  
ops/min
{code}

* 
{code:java}TcpDiscoveryNetworkIssuesTest.testNodeFailureDetectedWithinConfiguredTimeout(){code}

> Fix failure detection timeout. Simplify node ping routine.
> --
>
> Key: IGNITE-13012
> URL: https://issues.apache.org/jira/browse/IGNITE-13012
> Project: Ignite
>  Issue Type: Improvement
>Affects Versions: 2.8.1
>Reporter: Vladimir Steshin
>Assignee: Vladimir Steshin
>Priority: Major
>  Labels: iep-45
> Attachments: IGNITE-13012-patch.patch
>
>  Time Spent: 2h 20m
>  Remaining Estimate: 0h
>
> Connection failure may not be detected within 
> IgniteConfiguration.failureDetectionTimeout. Actual worst delay is: 
> ServerImpl.CON_CHECK_INTERVAL + IgniteConfiguration.failureDetectionTimeout. 
> Node ping routine is duplicated.
> We should fix:
> 1. Failure detection timeout should take in account last sent message. 
> Current ping is bound to own time:
> {code:java}ServerImpl. RingMessageWorker.lastTimeConnCheckMsgSent{code}
> This is weird because any discovery message check connection. 
> 2. Make connection check interval depend on failure detection timeout (FTD). 
> Current value is a constant:
> {code:java}static int ServerImpls.CON_CHECK_INTERVAL = 500{code}
> 3. Remove additional, quickened connection checking.  Once we do fix 1, this 
> will become even more useless.
> Despite TCP discovery has a period of connection checking, it may send ping 
> before this period exhausts. This premature node ping relies on the time of 
> any sent or even any received message. 
> 4. Do not worry user with “Node seems disconnected” when everything is OK. 
> Once we do fix 1 and 3, this will become even more useless. 
> Node may log on INFO: “Local node seems to be disconnected from topology …” 
> whereas it is not actually disconnected at all.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (IGNITE-13012) Fix failure detection timeout. Simplify node ping routine.

2020-06-16 Thread Vladimir Steshin (Jira)


[ 
https://issues.apache.org/jira/browse/IGNITE-13012?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17136824#comment-17136824
 ] 

Vladimir Steshin commented on IGNITE-13012:
---

[~avinogradov], I've put the patch. It creates:

* JmhNodeFailureDetection. Not ordinary JMH, I believe. Because we have to 
start/wait/fail node, the detection time is only small peice of each run. So, 
fixed/not-fixes results are close. I made own runs and collected timings to 
prepare the output. 

You can find in the output of the fix (example):
{code:text}
Detection delay: 294. Failure detection timeout: 300
Total detection delay: 5477

# Run complete. Total time: 00:01:23
Benchmark  Mode  Cnt   Score   Error
Units
JmhNodeFailureDetection.measureTotalForTheOutput  thrpt   10,954  
ops/min
{code}

vs not-fixed:

{code:text}
Detection delay: 539. Failure detection timeout: 300
Total detection delay: 11370

# Run complete. Total time: 00:01:41

Benchmark  Mode  Cnt  Score   Error
Units
JmhNodeFailureDetection.measureTotalForTheOutput  thrpt   5,276  
ops/min
{code}

* 
{code:java}TcpDiscoveryNetworkIssuesTest.testNodeFailureDetectedWithinConfiguredTimeout(){code}

> Fix failure detection timeout. Simplify node ping routine.
> --
>
> Key: IGNITE-13012
> URL: https://issues.apache.org/jira/browse/IGNITE-13012
> Project: Ignite
>  Issue Type: Improvement
>Affects Versions: 2.8.1
>Reporter: Vladimir Steshin
>Assignee: Vladimir Steshin
>Priority: Major
>  Labels: iep-45
> Attachments: IGNITE-13012-patch.patch
>
>  Time Spent: 2h 20m
>  Remaining Estimate: 0h
>
> Connection failure may not be detected within 
> IgniteConfiguration.failureDetectionTimeout. Actual worst delay is: 
> ServerImpl.CON_CHECK_INTERVAL + IgniteConfiguration.failureDetectionTimeout. 
> Node ping routine is duplicated.
> We should fix:
> 1. Failure detection timeout should take in account last sent message. 
> Current ping is bound to own time:
> {code:java}ServerImpl. RingMessageWorker.lastTimeConnCheckMsgSent{code}
> This is weird because any discovery message check connection. 
> 2. Make connection check interval depend on failure detection timeout (FTD). 
> Current value is a constant:
> {code:java}static int ServerImpls.CON_CHECK_INTERVAL = 500{code}
> 3. Remove additional, quickened connection checking.  Once we do fix 1, this 
> will become even more useless.
> Despite TCP discovery has a period of connection checking, it may send ping 
> before this period exhausts. This premature node ping relies on the time of 
> any sent or even any received message. 
> 4. Do not worry user with “Node seems disconnected” when everything is OK. 
> Once we do fix 1 and 3, this will become even more useless. 
> Node may log on INFO: “Local node seems to be disconnected from topology …” 
> whereas it is not actually disconnected at all.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (IGNITE-13012) Fix failure detection timeout. Simplify node ping routine.

2020-06-16 Thread Vladimir Steshin (Jira)


 [ 
https://issues.apache.org/jira/browse/IGNITE-13012?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vladimir Steshin updated IGNITE-13012:
--
Attachment: IGNITE-13012-patch.patch

> Fix failure detection timeout. Simplify node ping routine.
> --
>
> Key: IGNITE-13012
> URL: https://issues.apache.org/jira/browse/IGNITE-13012
> Project: Ignite
>  Issue Type: Improvement
>Affects Versions: 2.8.1
>Reporter: Vladimir Steshin
>Assignee: Vladimir Steshin
>Priority: Major
>  Labels: iep-45
> Attachments: IGNITE-13012-patch.patch
>
>  Time Spent: 2h 20m
>  Remaining Estimate: 0h
>
> Connection failure may not be detected within 
> IgniteConfiguration.failureDetectionTimeout. Actual worst delay is: 
> ServerImpl.CON_CHECK_INTERVAL + IgniteConfiguration.failureDetectionTimeout. 
> Node ping routine is duplicated.
> We should fix:
> 1. Failure detection timeout should take in account last sent message. 
> Current ping is bound to own time:
> {code:java}ServerImpl. RingMessageWorker.lastTimeConnCheckMsgSent{code}
> This is weird because any discovery message check connection. 
> 2. Make connection check interval depend on failure detection timeout (FTD). 
> Current value is a constant:
> {code:java}static int ServerImpls.CON_CHECK_INTERVAL = 500{code}
> 3. Remove additional, quickened connection checking.  Once we do fix 1, this 
> will become even more useless.
> Despite TCP discovery has a period of connection checking, it may send ping 
> before this period exhausts. This premature node ping relies on the time of 
> any sent or even any received message. 
> 4. Do not worry user with “Node seems disconnected” when everything is OK. 
> Once we do fix 1 and 3, this will become even more useless. 
> Node may log on INFO: “Local node seems to be disconnected from topology …” 
> whereas it is not actually disconnected at all.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


  1   2   3   4   >