[jira] [Commented] (IGNITE-13016) Fix backward checking of failed node.

2020-07-21 Thread Aleksey Plekhanov (Jira)


[ 
https://issues.apache.org/jira/browse/IGNITE-13016?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17162076#comment-17162076
 ] 

Aleksey Plekhanov commented on IGNITE-13016:


Cherry-picked to 2.9

> Fix backward checking of failed node.
> -
>
> Key: IGNITE-13016
> URL: https://issues.apache.org/jira/browse/IGNITE-13016
> Project: Ignite
>  Issue Type: Sub-task
>Reporter: Vladimir Steshin
>Assignee: Vladimir Steshin
>Priority: Major
>  Labels: iep-45
> Fix For: 2.9
>
>  Time Spent: 20m
>  Remaining Estimate: 0h
>
> Backward node connection checking looks wierd. What we might improve are:
> 1) Addresses checking could be done in parrallel, not sequentially.
> {code:java}
> for (InetSocketAddress addr : nodeAddrs) {
> // Connection refused may be got if node doesn't listen
> // (or blocked by firewall, but anyway assume it is dead).
> if (!isConnectionRefused(addr)) {
> liveAddr = addr;
> break;
> }
> }
> {code}
> 2) Any io-exception should be considered as failed connection, not only 
> connection-refused:
> {code:java}
> catch (ConnectException e) {
> return true;
> }
> catch (IOException e) {
> return false;
> }
> {code}
> 3) Timeout on connection checking should not be constant or hardcode:
> {code:java}
> sock.connect(addr, 100);
> {code}
> 4) Decision to check connection should rely on configured exchange timeout, 
> no on the ping interval
> {code:java}
> // We got message from previous in less than double connection check interval.
> boolean ok = rcvdTime + U.millisToNanos(connCheckInterval) * 2 >= now;
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (IGNITE-13016) Fix backward checking of failed node.

2020-07-21 Thread Sergey Chugunov (Jira)


[ 
https://issues.apache.org/jira/browse/IGNITE-13016?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17161933#comment-17161933
 ] 

Sergey Chugunov commented on IGNITE-13016:
--

[~vladsz83],

The patch looks good to me, I merged it to master branch in commit 
*03ee85695014ff6aaa87e256d330d32342d34224*.

Thank you for contribution!

> Fix backward checking of failed node.
> -
>
> Key: IGNITE-13016
> URL: https://issues.apache.org/jira/browse/IGNITE-13016
> Project: Ignite
>  Issue Type: Sub-task
>Reporter: Vladimir Steshin
>Assignee: Vladimir Steshin
>Priority: Major
>  Labels: iep-45
> Fix For: 2.9
>
>  Time Spent: 20m
>  Remaining Estimate: 0h
>
> Backward node connection checking looks wierd. What we might improve are:
> 1) Addresses checking could be done in parrallel, not sequentially.
> {code:java}
> for (InetSocketAddress addr : nodeAddrs) {
> // Connection refused may be got if node doesn't listen
> // (or blocked by firewall, but anyway assume it is dead).
> if (!isConnectionRefused(addr)) {
> liveAddr = addr;
> break;
> }
> }
> {code}
> 2) Any io-exception should be considered as failed connection, not only 
> connection-refused:
> {code:java}
> catch (ConnectException e) {
> return true;
> }
> catch (IOException e) {
> return false;
> }
> {code}
> 3) Timeout on connection checking should not be constant or hardcode:
> {code:java}
> sock.connect(addr, 100);
> {code}
> 4) Decision to check connection should rely on configured exchange timeout, 
> no on the ping interval
> {code:java}
> // We got message from previous in less than double connection check interval.
> boolean ok = rcvdTime + U.millisToNanos(connCheckInterval) * 2 >= now;
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (IGNITE-13016) Fix backward checking of failed node.

2020-07-16 Thread Ignite TC Bot (Jira)


[ 
https://issues.apache.org/jira/browse/IGNITE-13016?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17158994#comment-17158994
 ] 

Ignite TC Bot commented on IGNITE-13016:


{panel:title=Branch: [pull/7838/head] Base: [master] : No blockers 
found!|borderStyle=dashed|borderColor=#ccc|titleBGColor=#D6F7C1}{panel}
{panel:title=Branch: [pull/7838/head] Base: [master] : New Tests 
(8)|borderStyle=dashed|borderColor=#ccc|titleBGColor=#D6F7C1}
{color:#8b}Service Grid{color} [[tests 
4|https://ci.ignite.apache.org/viewLog.html?buildId=5462169]]
* {color:#013220}IgniteServiceGridTestSuite: 
ServiceDeploymentProcessIdSelfTest.topologyVersion[Test event=IgniteBiTuple 
[val1=DiscoveryEvent [evtNode=eb37c448-46e6-47f1-90c3-e2c1232ce3e0, topVer=0, 
nodeId8=07632a1b, msg=, type=NODE_JOINED, tstamp=1594738092959], 
val2=AffinityTopologyVersion [topVer=6570610929897650521, minorTopVer=0]]] - 
PASSED{color}
* {color:#013220}IgniteServiceGridTestSuite: 
ServiceDeploymentProcessIdSelfTest.requestId[Test event=IgniteBiTuple 
[val1=DiscoveryEvent [evtNode=eb37c448-46e6-47f1-90c3-e2c1232ce3e0, topVer=0, 
nodeId8=07632a1b, msg=, type=NODE_JOINED, tstamp=1594738092959], 
val2=AffinityTopologyVersion [topVer=6570610929897650521, minorTopVer=0]]] - 
PASSED{color}
* {color:#013220}IgniteServiceGridTestSuite: 
ServiceDeploymentProcessIdSelfTest.topologyVersion[Test event=IgniteBiTuple 
[val1=DiscoveryCustomEvent [customMsg=ServiceChangeBatchRequest 
[id=90486dd4371-1cb86db9-6955-413c-a8bc-184a9dae4984, reqs=SingletonList 
[ServiceUndeploymentRequest []]], affTopVer=null, super=DiscoveryEvent 
[evtNode=dc276d8e-243d-4c5a-ae8e-ab4ab9525610, topVer=0, nodeId8=dc276d8e, 
msg=null, type=DISCOVERY_CUSTOM_EVT, tstamp=1594738092959]], 
val2=AffinityTopologyVersion [topVer=6331264228940651374, minorTopVer=0]]] - 
PASSED{color}
* {color:#013220}IgniteServiceGridTestSuite: 
ServiceDeploymentProcessIdSelfTest.requestId[Test event=IgniteBiTuple 
[val1=DiscoveryCustomEvent [customMsg=ServiceChangeBatchRequest 
[id=90486dd4371-1cb86db9-6955-413c-a8bc-184a9dae4984, reqs=SingletonList 
[ServiceUndeploymentRequest []]], affTopVer=null, super=DiscoveryEvent 
[evtNode=dc276d8e-243d-4c5a-ae8e-ab4ab9525610, topVer=0, nodeId8=dc276d8e, 
msg=null, type=DISCOVERY_CUSTOM_EVT, tstamp=1594738092959]], 
val2=AffinityTopologyVersion [topVer=6331264228940651374, minorTopVer=0]]] - 
PASSED{color}

{color:#8b}Service Grid (legacy mode){color} [[tests 
4|https://ci.ignite.apache.org/viewLog.html?buildId=5462170]]
* {color:#013220}IgniteServiceGridTestSuite: 
ServiceDeploymentProcessIdSelfTest.topologyVersion[Test event=IgniteBiTuple 
[val1=DiscoveryEvent [evtNode=b5e153ec-563e-4f60-8a4c-10d4a07635be, topVer=0, 
nodeId8=cb4a3e0c, msg=, type=NODE_JOINED, tstamp=1594738174507], 
val2=AffinityTopologyVersion [topVer=2298789891461322797, minorTopVer=0]]] - 
PASSED{color}
* {color:#013220}IgniteServiceGridTestSuite: 
ServiceDeploymentProcessIdSelfTest.requestId[Test event=IgniteBiTuple 
[val1=DiscoveryEvent [evtNode=b5e153ec-563e-4f60-8a4c-10d4a07635be, topVer=0, 
nodeId8=cb4a3e0c, msg=, type=NODE_JOINED, tstamp=1594738174507], 
val2=AffinityTopologyVersion [topVer=2298789891461322797, minorTopVer=0]]] - 
PASSED{color}
* {color:#013220}IgniteServiceGridTestSuite: 
ServiceDeploymentProcessIdSelfTest.topologyVersion[Test event=IgniteBiTuple 
[val1=DiscoveryCustomEvent [customMsg=ServiceChangeBatchRequest 
[id=f2a6dcd4371-42ed2f0b-c0c4-4ad3-be48-9a06bb65c82d, reqs=SingletonList 
[ServiceUndeploymentRequest []]], affTopVer=null, super=DiscoveryEvent 
[evtNode=feefc9bc-3741-48da-8a92-9d7d5732f937, topVer=0, nodeId8=feefc9bc, 
msg=null, type=DISCOVERY_CUSTOM_EVT, tstamp=1594738174507]], 
val2=AffinityTopologyVersion [topVer=1483823403640769579, minorTopVer=0]]] - 
PASSED{color}
* {color:#013220}IgniteServiceGridTestSuite: 
ServiceDeploymentProcessIdSelfTest.requestId[Test event=IgniteBiTuple 
[val1=DiscoveryCustomEvent [customMsg=ServiceChangeBatchRequest 
[id=f2a6dcd4371-42ed2f0b-c0c4-4ad3-be48-9a06bb65c82d, reqs=SingletonList 
[ServiceUndeploymentRequest []]], affTopVer=null, super=DiscoveryEvent 
[evtNode=feefc9bc-3741-48da-8a92-9d7d5732f937, topVer=0, nodeId8=feefc9bc, 
msg=null, type=DISCOVERY_CUSTOM_EVT, tstamp=1594738174507]], 
val2=AffinityTopologyVersion [topVer=1483823403640769579, minorTopVer=0]]] - 
PASSED{color}

{panel}
[TeamCity *--> Run :: All* 
Results|https://ci.ignite.apache.org/viewLog.html?buildId=5462190&buildTypeId=IgniteTests24Java8_RunAll]

> Fix backward checking of failed node.
> -
>
> Key: IGNITE-13016
> URL: https://issues.apache.org/jira/browse/IGNITE-13016
> Project: Ignite
>  Issue Type: Sub-task
>Reporter: Vladimir Steshin
>Assignee: Vladimir Steshin
>Priority: Major
>  Labels: iep-45
> Fix For: 2.9
>
> 

[jira] [Commented] (IGNITE-13016) Fix backward checking of failed node.

2020-06-30 Thread Ignite TC Bot (Jira)


[ 
https://issues.apache.org/jira/browse/IGNITE-13016?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17148825#comment-17148825
 ] 

Ignite TC Bot commented on IGNITE-13016:


{panel:title=Branch: [pull/7838/head] Base: [master] : No blockers 
found!|borderStyle=dashed|borderColor=#ccc|titleBGColor=#D6F7C1}{panel}
{panel:title=Branch: [pull/7838/head] Base: [master] : New Tests 
(8)|borderStyle=dashed|borderColor=#ccc|titleBGColor=#D6F7C1}
{color:#8b}Service Grid{color} [tests 4]
* {color:#013220}IgniteServiceGridTestSuite: 
ServiceDeploymentProcessIdSelfTest.requestId[Test event=IgniteBiTuple 
[val1=DiscoveryCustomEvent [customMsg=ServiceChangeBatchRequest 
[id=63e53550371-b7bbaed4-9174-4bd0-832e-86bb88791e4f, reqs=SingletonList 
[ServiceUndeploymentRequest []]], affTopVer=null, super=DiscoveryEvent 
[evtNode=f8d1a9be-a5ec-4dee-8c71-dbb945007756, topVer=0, nodeId8=f8d1a9be, 
msg=null, type=DISCOVERY_CUSTOM_EVT, tstamp=1593522216498]], 
val2=AffinityTopologyVersion [topVer=479273715644047619, minorTopVer=0]]] - 
PASSED{color}
* {color:#013220}IgniteServiceGridTestSuite: 
ServiceDeploymentProcessIdSelfTest.topologyVersion[Test event=IgniteBiTuple 
[val1=DiscoveryEvent [evtNode=d429993e-eff2-4659-bdf9-4bed09fca199, topVer=0, 
nodeId8=1eff0d87, msg=, type=NODE_JOINED, tstamp=1593522216498], 
val2=AffinityTopologyVersion [topVer=4418461583898486663, minorTopVer=0]]] - 
PASSED{color}
* {color:#013220}IgniteServiceGridTestSuite: 
ServiceDeploymentProcessIdSelfTest.requestId[Test event=IgniteBiTuple 
[val1=DiscoveryEvent [evtNode=d429993e-eff2-4659-bdf9-4bed09fca199, topVer=0, 
nodeId8=1eff0d87, msg=, type=NODE_JOINED, tstamp=1593522216498], 
val2=AffinityTopologyVersion [topVer=4418461583898486663, minorTopVer=0]]] - 
PASSED{color}
* {color:#013220}IgniteServiceGridTestSuite: 
ServiceDeploymentProcessIdSelfTest.topologyVersion[Test event=IgniteBiTuple 
[val1=DiscoveryCustomEvent [customMsg=ServiceChangeBatchRequest 
[id=63e53550371-b7bbaed4-9174-4bd0-832e-86bb88791e4f, reqs=SingletonList 
[ServiceUndeploymentRequest []]], affTopVer=null, super=DiscoveryEvent 
[evtNode=f8d1a9be-a5ec-4dee-8c71-dbb945007756, topVer=0, nodeId8=f8d1a9be, 
msg=null, type=DISCOVERY_CUSTOM_EVT, tstamp=1593522216498]], 
val2=AffinityTopologyVersion [topVer=479273715644047619, minorTopVer=0]]] - 
PASSED{color}

{color:#8b}Service Grid (legacy mode){color} [tests 4]
* {color:#013220}IgniteServiceGridTestSuite: 
ServiceDeploymentProcessIdSelfTest.topologyVersion[Test event=IgniteBiTuple 
[val1=DiscoveryEvent [evtNode=ac7fbe8c-3709-480f-8a9b-a17e9cb4e307, topVer=0, 
nodeId8=5a4dfb3a, msg=, type=NODE_JOINED, tstamp=1593522268422], 
val2=AffinityTopologyVersion [topVer=1174591130455611368, minorTopVer=0]]] - 
PASSED{color}
* {color:#013220}IgniteServiceGridTestSuite: 
ServiceDeploymentProcessIdSelfTest.requestId[Test event=IgniteBiTuple 
[val1=DiscoveryEvent [evtNode=ac7fbe8c-3709-480f-8a9b-a17e9cb4e307, topVer=0, 
nodeId8=5a4dfb3a, msg=, type=NODE_JOINED, tstamp=1593522268422], 
val2=AffinityTopologyVersion [topVer=1174591130455611368, minorTopVer=0]]] - 
PASSED{color}
* {color:#013220}IgniteServiceGridTestSuite: 
ServiceDeploymentProcessIdSelfTest.topologyVersion[Test event=IgniteBiTuple 
[val1=DiscoveryCustomEvent [customMsg=ServiceChangeBatchRequest 
[id=d6a6e550371-c62106f4-0bc0-4061-b0c1-735533169ef7, reqs=SingletonList 
[ServiceUndeploymentRequest []]], affTopVer=null, super=DiscoveryEvent 
[evtNode=90c60780-81e5-40b9-ae84-eb7166a59ae5, topVer=0, nodeId8=90c60780, 
msg=null, type=DISCOVERY_CUSTOM_EVT, tstamp=1593522268422]], 
val2=AffinityTopologyVersion [topVer=110843890637889783, minorTopVer=0]]] - 
PASSED{color}
* {color:#013220}IgniteServiceGridTestSuite: 
ServiceDeploymentProcessIdSelfTest.requestId[Test event=IgniteBiTuple 
[val1=DiscoveryCustomEvent [customMsg=ServiceChangeBatchRequest 
[id=d6a6e550371-c62106f4-0bc0-4061-b0c1-735533169ef7, reqs=SingletonList 
[ServiceUndeploymentRequest []]], affTopVer=null, super=DiscoveryEvent 
[evtNode=90c60780-81e5-40b9-ae84-eb7166a59ae5, topVer=0, nodeId8=90c60780, 
msg=null, type=DISCOVERY_CUSTOM_EVT, tstamp=1593522268422]], 
val2=AffinityTopologyVersion [topVer=110843890637889783, minorTopVer=0]]] - 
PASSED{color}

{panel}
[TeamCity *--> Run :: All* 
Results|https://ci.ignite.apache.org/viewLog.html?buildId=5430065&buildTypeId=IgniteTests24Java8_RunAll]

> Fix backward checking of failed node.
> -
>
> Key: IGNITE-13016
> URL: https://issues.apache.org/jira/browse/IGNITE-13016
> Project: Ignite
>  Issue Type: Sub-task
>Reporter: Vladimir Steshin
>Assignee: Vladimir Steshin
>Priority: Major
>  Labels: iep-45
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> Backward node connection checking looks wierd. What might be improved are:
> 1) Addresses checking could

[jira] [Commented] (IGNITE-13016) Fix backward checking of failed node.

2020-05-29 Thread Ignite TC Bot (Jira)


[ 
https://issues.apache.org/jira/browse/IGNITE-13016?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17119489#comment-17119489
 ] 

Ignite TC Bot commented on IGNITE-13016:


{panel:title=Branch: [pull/7838/head] Base: [master] : No blockers 
found!|borderStyle=dashed|borderColor=#ccc|titleBGColor=#D6F7C1}{panel}
[TeamCity *--> Run :: All* 
Results|https://ci.ignite.apache.org/viewLog.html?buildId=5346497&buildTypeId=IgniteTests24Java8_RunAll]

> Fix backward checking of failed node.
> -
>
> Key: IGNITE-13016
> URL: https://issues.apache.org/jira/browse/IGNITE-13016
> Project: Ignite
>  Issue Type: Sub-task
>Reporter: Vladimir Steshin
>Assignee: Vladimir Steshin
>Priority: Major
>  Labels: iep-45
> Attachments: FailureDetectionResearch.txt, 
> FailureDetectionResearch_fixed.txt, NodeFailureResearch.patch, 
> WostCaseStepByStep.txt
>
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> We should fix several drawbacks in the backward checking of failed node. They 
> prolong node failure detection upto: 
> ServerImpl.CON_CHECK_INTERVAL + 2 * 
> IgniteConfiguretion.failureDetectionTimeout + 300ms. 
> See:
> * ‘_NodeFailureResearch.patch_'. It creates test 'FailureDetectionResearch' 
> which emulates long answears on a failed node and measures failure detection 
> delays.
> * '_FailureDetectionResearch.txt_' - results of the test.
> * '_FailureDetectionResearch_fixed.txt_' - results of the test after this fix.
> * '_WostCaseStepByStep.txt_' - description how the worst case happens.
> *Suggestions:*
> 1) We should replace hardcoded timeout 100ms with a parameter like 
> failureDetectionTimeout:
> {code:java}
> private boolean ServerImpl.isConnectionRefused(SocketAddress addr) {
>...
> sock.connect(addr, 100); // Make it rely on failureDetectionTimeout.
>...
> }
> {code}
> 2) Any negative result of the connection checking should be considered as 
> node failed. Currently, we look only at refused connection. Any other 
> exceptions, including a timeout, are treated as living connection: 
> {code:java}
> private boolean ServerImpl.isConnectionRefused(SocketAddress addr) {
>...
>catch (ConnectException e) {
>   return true;
>}
>catch (IOException e) {
>   return false; // Make any error mean lost connection.
>}
>return false;
> }
> {code}
> 3) Maximal interval to check previous node should rely on actual failure 
> detection timeout:
> {code:java}
>TcpDiscoveryHandshakeResponse res = new TcpDiscoveryHandshakeResponse(...);
>...
>// We got message from previous in less than double connection check 
> interval.
>boolean ok = rcvdTime + CON_CHECK_INTERVAL * 2 >= now; // Here should be a 
> timeout of failure detection.
>if (ok) {
>   // Check case when previous node suddenly died. This will speed up
>   // node failing.
>   ...
> }
> res.previousNodeAlive(ok);
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (IGNITE-13016) Fix backward checking of failed node.

2020-05-28 Thread Ignite TC Bot (Jira)


[ 
https://issues.apache.org/jira/browse/IGNITE-13016?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17118541#comment-17118541
 ] 

Ignite TC Bot commented on IGNITE-13016:


{panel:title=Branch: [pull/7838/head] Base: [master] : No blockers 
found!|borderStyle=dashed|borderColor=#ccc|titleBGColor=#D6F7C1}{panel}
[TeamCity *--> Run :: All* 
Results|https://ci.ignite.apache.org/viewLog.html?buildId=5339425&buildTypeId=IgniteTests24Java8_RunAll]

> Fix backward checking of failed node.
> -
>
> Key: IGNITE-13016
> URL: https://issues.apache.org/jira/browse/IGNITE-13016
> Project: Ignite
>  Issue Type: Sub-task
>Reporter: Vladimir Steshin
>Assignee: Vladimir Steshin
>Priority: Major
>  Labels: iep-45
> Attachments: FailureDetectionResearch.txt, 
> FailureDetectionResearch_fixed.txt, NodeFailureResearch.patch, 
> WostCaseStepByStep.txt
>
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> We should fix several drawbacks in the backward checking of failed node. They 
> prolongs node failure detection upto: 
> ServerImpl.CON_CHECK_INTERVAL + 2 * 
> IgniteConfiguretion.failureDetectionTimeout + 300ms. 
> See:
> * ‘_NodeFailureResearch.patch_'. It creates test 'FailureDetectionResearch' 
> which emulates long answears on a failed node and measures failure detection 
> delays.
> * '_NodeFailureResearch.txt_' - results of the test.
> * 'NodeFailureResearch_fixed.txt' - results of the test after this fix.
> * '_WostCaseStepByStep.txt_' - description how the worst case happens.
> *Suggestions:*
> 1) We should replace hardcoded timeout 100ms with a parameter like 
> failureDetectionTimeout:
> {code:java}
> private boolean ServerImpl.isConnectionRefused(SocketAddress addr) {
>...
> sock.connect(addr, 100); // Make it rely on failureDetectionTimeout.
>...
> }
> {code}
> 2) Any negative result of the connection checking should be considered as 
> node failed. Currently, we look only at refused connection. Any other 
> exceptions, including a timeout, are treated as living connection: 
> {code:java}
> private boolean ServerImpl.isConnectionRefused(SocketAddress addr) {
>...
>catch (ConnectException e) {
>   return true;
>}
>catch (IOException e) {
>   return false; // Make any error mean lost connection.
>}
>return false;
> }
> {code}
> 3) Maximal interval to check previous node should rely on actual failure 
> detection timeout:
> {code:java}
>TcpDiscoveryHandshakeResponse res = new TcpDiscoveryHandshakeResponse(...);
>...
>// We got message from previous in less than double connection check 
> interval.
>boolean ok = rcvdTime + CON_CHECK_INTERVAL * 2 >= now; // Here should be a 
> timeout of failure detection.
>if (ok) {
>   // Check case when previous node suddenly died. This will speed up
>   // node failing.
>   ...
> }
> res.previousNodeAlive(ok);
> {code}
> 4) Remove hardcoded sleep of 200ms when marking previous node alive:
> {code:java}
> ServerImpl.CrossRingMessageSendState.markLastFailedNodeAlive(){
>...
>try {
>   Thread.sleep(200);
>}
>catch (InterruptedException e) {
>   Thread.currentThread().interrupt();
>}
>...
> }
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)