[jira] [Commented] (IGNITE-13016) Fix backward checking of failed node.
[ https://issues.apache.org/jira/browse/IGNITE-13016?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17162076#comment-17162076 ] Aleksey Plekhanov commented on IGNITE-13016: Cherry-picked to 2.9 > Fix backward checking of failed node. > - > > Key: IGNITE-13016 > URL: https://issues.apache.org/jira/browse/IGNITE-13016 > Project: Ignite > Issue Type: Sub-task >Reporter: Vladimir Steshin >Assignee: Vladimir Steshin >Priority: Major > Labels: iep-45 > Fix For: 2.9 > > Time Spent: 20m > Remaining Estimate: 0h > > Backward node connection checking looks wierd. What we might improve are: > 1) Addresses checking could be done in parrallel, not sequentially. > {code:java} > for (InetSocketAddress addr : nodeAddrs) { > // Connection refused may be got if node doesn't listen > // (or blocked by firewall, but anyway assume it is dead). > if (!isConnectionRefused(addr)) { > liveAddr = addr; > break; > } > } > {code} > 2) Any io-exception should be considered as failed connection, not only > connection-refused: > {code:java} > catch (ConnectException e) { > return true; > } > catch (IOException e) { > return false; > } > {code} > 3) Timeout on connection checking should not be constant or hardcode: > {code:java} > sock.connect(addr, 100); > {code} > 4) Decision to check connection should rely on configured exchange timeout, > no on the ping interval > {code:java} > // We got message from previous in less than double connection check interval. > boolean ok = rcvdTime + U.millisToNanos(connCheckInterval) * 2 >= now; > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (IGNITE-13016) Fix backward checking of failed node.
[ https://issues.apache.org/jira/browse/IGNITE-13016?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17161933#comment-17161933 ] Sergey Chugunov commented on IGNITE-13016: -- [~vladsz83], The patch looks good to me, I merged it to master branch in commit *03ee85695014ff6aaa87e256d330d32342d34224*. Thank you for contribution! > Fix backward checking of failed node. > - > > Key: IGNITE-13016 > URL: https://issues.apache.org/jira/browse/IGNITE-13016 > Project: Ignite > Issue Type: Sub-task >Reporter: Vladimir Steshin >Assignee: Vladimir Steshin >Priority: Major > Labels: iep-45 > Fix For: 2.9 > > Time Spent: 20m > Remaining Estimate: 0h > > Backward node connection checking looks wierd. What we might improve are: > 1) Addresses checking could be done in parrallel, not sequentially. > {code:java} > for (InetSocketAddress addr : nodeAddrs) { > // Connection refused may be got if node doesn't listen > // (or blocked by firewall, but anyway assume it is dead). > if (!isConnectionRefused(addr)) { > liveAddr = addr; > break; > } > } > {code} > 2) Any io-exception should be considered as failed connection, not only > connection-refused: > {code:java} > catch (ConnectException e) { > return true; > } > catch (IOException e) { > return false; > } > {code} > 3) Timeout on connection checking should not be constant or hardcode: > {code:java} > sock.connect(addr, 100); > {code} > 4) Decision to check connection should rely on configured exchange timeout, > no on the ping interval > {code:java} > // We got message from previous in less than double connection check interval. > boolean ok = rcvdTime + U.millisToNanos(connCheckInterval) * 2 >= now; > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (IGNITE-13016) Fix backward checking of failed node.
[ https://issues.apache.org/jira/browse/IGNITE-13016?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17158994#comment-17158994 ] Ignite TC Bot commented on IGNITE-13016: {panel:title=Branch: [pull/7838/head] Base: [master] : No blockers found!|borderStyle=dashed|borderColor=#ccc|titleBGColor=#D6F7C1}{panel} {panel:title=Branch: [pull/7838/head] Base: [master] : New Tests (8)|borderStyle=dashed|borderColor=#ccc|titleBGColor=#D6F7C1} {color:#8b}Service Grid{color} [[tests 4|https://ci.ignite.apache.org/viewLog.html?buildId=5462169]] * {color:#013220}IgniteServiceGridTestSuite: ServiceDeploymentProcessIdSelfTest.topologyVersion[Test event=IgniteBiTuple [val1=DiscoveryEvent [evtNode=eb37c448-46e6-47f1-90c3-e2c1232ce3e0, topVer=0, nodeId8=07632a1b, msg=, type=NODE_JOINED, tstamp=1594738092959], val2=AffinityTopologyVersion [topVer=6570610929897650521, minorTopVer=0]]] - PASSED{color} * {color:#013220}IgniteServiceGridTestSuite: ServiceDeploymentProcessIdSelfTest.requestId[Test event=IgniteBiTuple [val1=DiscoveryEvent [evtNode=eb37c448-46e6-47f1-90c3-e2c1232ce3e0, topVer=0, nodeId8=07632a1b, msg=, type=NODE_JOINED, tstamp=1594738092959], val2=AffinityTopologyVersion [topVer=6570610929897650521, minorTopVer=0]]] - PASSED{color} * {color:#013220}IgniteServiceGridTestSuite: ServiceDeploymentProcessIdSelfTest.topologyVersion[Test event=IgniteBiTuple [val1=DiscoveryCustomEvent [customMsg=ServiceChangeBatchRequest [id=90486dd4371-1cb86db9-6955-413c-a8bc-184a9dae4984, reqs=SingletonList [ServiceUndeploymentRequest []]], affTopVer=null, super=DiscoveryEvent [evtNode=dc276d8e-243d-4c5a-ae8e-ab4ab9525610, topVer=0, nodeId8=dc276d8e, msg=null, type=DISCOVERY_CUSTOM_EVT, tstamp=1594738092959]], val2=AffinityTopologyVersion [topVer=6331264228940651374, minorTopVer=0]]] - PASSED{color} * {color:#013220}IgniteServiceGridTestSuite: ServiceDeploymentProcessIdSelfTest.requestId[Test event=IgniteBiTuple [val1=DiscoveryCustomEvent [customMsg=ServiceChangeBatchRequest [id=90486dd4371-1cb86db9-6955-413c-a8bc-184a9dae4984, reqs=SingletonList [ServiceUndeploymentRequest []]], affTopVer=null, super=DiscoveryEvent [evtNode=dc276d8e-243d-4c5a-ae8e-ab4ab9525610, topVer=0, nodeId8=dc276d8e, msg=null, type=DISCOVERY_CUSTOM_EVT, tstamp=1594738092959]], val2=AffinityTopologyVersion [topVer=6331264228940651374, minorTopVer=0]]] - PASSED{color} {color:#8b}Service Grid (legacy mode){color} [[tests 4|https://ci.ignite.apache.org/viewLog.html?buildId=5462170]] * {color:#013220}IgniteServiceGridTestSuite: ServiceDeploymentProcessIdSelfTest.topologyVersion[Test event=IgniteBiTuple [val1=DiscoveryEvent [evtNode=b5e153ec-563e-4f60-8a4c-10d4a07635be, topVer=0, nodeId8=cb4a3e0c, msg=, type=NODE_JOINED, tstamp=1594738174507], val2=AffinityTopologyVersion [topVer=2298789891461322797, minorTopVer=0]]] - PASSED{color} * {color:#013220}IgniteServiceGridTestSuite: ServiceDeploymentProcessIdSelfTest.requestId[Test event=IgniteBiTuple [val1=DiscoveryEvent [evtNode=b5e153ec-563e-4f60-8a4c-10d4a07635be, topVer=0, nodeId8=cb4a3e0c, msg=, type=NODE_JOINED, tstamp=1594738174507], val2=AffinityTopologyVersion [topVer=2298789891461322797, minorTopVer=0]]] - PASSED{color} * {color:#013220}IgniteServiceGridTestSuite: ServiceDeploymentProcessIdSelfTest.topologyVersion[Test event=IgniteBiTuple [val1=DiscoveryCustomEvent [customMsg=ServiceChangeBatchRequest [id=f2a6dcd4371-42ed2f0b-c0c4-4ad3-be48-9a06bb65c82d, reqs=SingletonList [ServiceUndeploymentRequest []]], affTopVer=null, super=DiscoveryEvent [evtNode=feefc9bc-3741-48da-8a92-9d7d5732f937, topVer=0, nodeId8=feefc9bc, msg=null, type=DISCOVERY_CUSTOM_EVT, tstamp=1594738174507]], val2=AffinityTopologyVersion [topVer=1483823403640769579, minorTopVer=0]]] - PASSED{color} * {color:#013220}IgniteServiceGridTestSuite: ServiceDeploymentProcessIdSelfTest.requestId[Test event=IgniteBiTuple [val1=DiscoveryCustomEvent [customMsg=ServiceChangeBatchRequest [id=f2a6dcd4371-42ed2f0b-c0c4-4ad3-be48-9a06bb65c82d, reqs=SingletonList [ServiceUndeploymentRequest []]], affTopVer=null, super=DiscoveryEvent [evtNode=feefc9bc-3741-48da-8a92-9d7d5732f937, topVer=0, nodeId8=feefc9bc, msg=null, type=DISCOVERY_CUSTOM_EVT, tstamp=1594738174507]], val2=AffinityTopologyVersion [topVer=1483823403640769579, minorTopVer=0]]] - PASSED{color} {panel} [TeamCity *--> Run :: All* Results|https://ci.ignite.apache.org/viewLog.html?buildId=5462190&buildTypeId=IgniteTests24Java8_RunAll] > Fix backward checking of failed node. > - > > Key: IGNITE-13016 > URL: https://issues.apache.org/jira/browse/IGNITE-13016 > Project: Ignite > Issue Type: Sub-task >Reporter: Vladimir Steshin >Assignee: Vladimir Steshin >Priority: Major > Labels: iep-45 > Fix For: 2.9 > >
[jira] [Commented] (IGNITE-13016) Fix backward checking of failed node.
[ https://issues.apache.org/jira/browse/IGNITE-13016?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17148825#comment-17148825 ] Ignite TC Bot commented on IGNITE-13016: {panel:title=Branch: [pull/7838/head] Base: [master] : No blockers found!|borderStyle=dashed|borderColor=#ccc|titleBGColor=#D6F7C1}{panel} {panel:title=Branch: [pull/7838/head] Base: [master] : New Tests (8)|borderStyle=dashed|borderColor=#ccc|titleBGColor=#D6F7C1} {color:#8b}Service Grid{color} [tests 4] * {color:#013220}IgniteServiceGridTestSuite: ServiceDeploymentProcessIdSelfTest.requestId[Test event=IgniteBiTuple [val1=DiscoveryCustomEvent [customMsg=ServiceChangeBatchRequest [id=63e53550371-b7bbaed4-9174-4bd0-832e-86bb88791e4f, reqs=SingletonList [ServiceUndeploymentRequest []]], affTopVer=null, super=DiscoveryEvent [evtNode=f8d1a9be-a5ec-4dee-8c71-dbb945007756, topVer=0, nodeId8=f8d1a9be, msg=null, type=DISCOVERY_CUSTOM_EVT, tstamp=1593522216498]], val2=AffinityTopologyVersion [topVer=479273715644047619, minorTopVer=0]]] - PASSED{color} * {color:#013220}IgniteServiceGridTestSuite: ServiceDeploymentProcessIdSelfTest.topologyVersion[Test event=IgniteBiTuple [val1=DiscoveryEvent [evtNode=d429993e-eff2-4659-bdf9-4bed09fca199, topVer=0, nodeId8=1eff0d87, msg=, type=NODE_JOINED, tstamp=1593522216498], val2=AffinityTopologyVersion [topVer=4418461583898486663, minorTopVer=0]]] - PASSED{color} * {color:#013220}IgniteServiceGridTestSuite: ServiceDeploymentProcessIdSelfTest.requestId[Test event=IgniteBiTuple [val1=DiscoveryEvent [evtNode=d429993e-eff2-4659-bdf9-4bed09fca199, topVer=0, nodeId8=1eff0d87, msg=, type=NODE_JOINED, tstamp=1593522216498], val2=AffinityTopologyVersion [topVer=4418461583898486663, minorTopVer=0]]] - PASSED{color} * {color:#013220}IgniteServiceGridTestSuite: ServiceDeploymentProcessIdSelfTest.topologyVersion[Test event=IgniteBiTuple [val1=DiscoveryCustomEvent [customMsg=ServiceChangeBatchRequest [id=63e53550371-b7bbaed4-9174-4bd0-832e-86bb88791e4f, reqs=SingletonList [ServiceUndeploymentRequest []]], affTopVer=null, super=DiscoveryEvent [evtNode=f8d1a9be-a5ec-4dee-8c71-dbb945007756, topVer=0, nodeId8=f8d1a9be, msg=null, type=DISCOVERY_CUSTOM_EVT, tstamp=1593522216498]], val2=AffinityTopologyVersion [topVer=479273715644047619, minorTopVer=0]]] - PASSED{color} {color:#8b}Service Grid (legacy mode){color} [tests 4] * {color:#013220}IgniteServiceGridTestSuite: ServiceDeploymentProcessIdSelfTest.topologyVersion[Test event=IgniteBiTuple [val1=DiscoveryEvent [evtNode=ac7fbe8c-3709-480f-8a9b-a17e9cb4e307, topVer=0, nodeId8=5a4dfb3a, msg=, type=NODE_JOINED, tstamp=1593522268422], val2=AffinityTopologyVersion [topVer=1174591130455611368, minorTopVer=0]]] - PASSED{color} * {color:#013220}IgniteServiceGridTestSuite: ServiceDeploymentProcessIdSelfTest.requestId[Test event=IgniteBiTuple [val1=DiscoveryEvent [evtNode=ac7fbe8c-3709-480f-8a9b-a17e9cb4e307, topVer=0, nodeId8=5a4dfb3a, msg=, type=NODE_JOINED, tstamp=1593522268422], val2=AffinityTopologyVersion [topVer=1174591130455611368, minorTopVer=0]]] - PASSED{color} * {color:#013220}IgniteServiceGridTestSuite: ServiceDeploymentProcessIdSelfTest.topologyVersion[Test event=IgniteBiTuple [val1=DiscoveryCustomEvent [customMsg=ServiceChangeBatchRequest [id=d6a6e550371-c62106f4-0bc0-4061-b0c1-735533169ef7, reqs=SingletonList [ServiceUndeploymentRequest []]], affTopVer=null, super=DiscoveryEvent [evtNode=90c60780-81e5-40b9-ae84-eb7166a59ae5, topVer=0, nodeId8=90c60780, msg=null, type=DISCOVERY_CUSTOM_EVT, tstamp=1593522268422]], val2=AffinityTopologyVersion [topVer=110843890637889783, minorTopVer=0]]] - PASSED{color} * {color:#013220}IgniteServiceGridTestSuite: ServiceDeploymentProcessIdSelfTest.requestId[Test event=IgniteBiTuple [val1=DiscoveryCustomEvent [customMsg=ServiceChangeBatchRequest [id=d6a6e550371-c62106f4-0bc0-4061-b0c1-735533169ef7, reqs=SingletonList [ServiceUndeploymentRequest []]], affTopVer=null, super=DiscoveryEvent [evtNode=90c60780-81e5-40b9-ae84-eb7166a59ae5, topVer=0, nodeId8=90c60780, msg=null, type=DISCOVERY_CUSTOM_EVT, tstamp=1593522268422]], val2=AffinityTopologyVersion [topVer=110843890637889783, minorTopVer=0]]] - PASSED{color} {panel} [TeamCity *--> Run :: All* Results|https://ci.ignite.apache.org/viewLog.html?buildId=5430065&buildTypeId=IgniteTests24Java8_RunAll] > Fix backward checking of failed node. > - > > Key: IGNITE-13016 > URL: https://issues.apache.org/jira/browse/IGNITE-13016 > Project: Ignite > Issue Type: Sub-task >Reporter: Vladimir Steshin >Assignee: Vladimir Steshin >Priority: Major > Labels: iep-45 > Time Spent: 10m > Remaining Estimate: 0h > > Backward node connection checking looks wierd. What might be improved are: > 1) Addresses checking could
[jira] [Commented] (IGNITE-13016) Fix backward checking of failed node.
[ https://issues.apache.org/jira/browse/IGNITE-13016?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17119489#comment-17119489 ] Ignite TC Bot commented on IGNITE-13016: {panel:title=Branch: [pull/7838/head] Base: [master] : No blockers found!|borderStyle=dashed|borderColor=#ccc|titleBGColor=#D6F7C1}{panel} [TeamCity *--> Run :: All* Results|https://ci.ignite.apache.org/viewLog.html?buildId=5346497&buildTypeId=IgniteTests24Java8_RunAll] > Fix backward checking of failed node. > - > > Key: IGNITE-13016 > URL: https://issues.apache.org/jira/browse/IGNITE-13016 > Project: Ignite > Issue Type: Sub-task >Reporter: Vladimir Steshin >Assignee: Vladimir Steshin >Priority: Major > Labels: iep-45 > Attachments: FailureDetectionResearch.txt, > FailureDetectionResearch_fixed.txt, NodeFailureResearch.patch, > WostCaseStepByStep.txt > > Time Spent: 10m > Remaining Estimate: 0h > > We should fix several drawbacks in the backward checking of failed node. They > prolong node failure detection upto: > ServerImpl.CON_CHECK_INTERVAL + 2 * > IgniteConfiguretion.failureDetectionTimeout + 300ms. > See: > * ‘_NodeFailureResearch.patch_'. It creates test 'FailureDetectionResearch' > which emulates long answears on a failed node and measures failure detection > delays. > * '_FailureDetectionResearch.txt_' - results of the test. > * '_FailureDetectionResearch_fixed.txt_' - results of the test after this fix. > * '_WostCaseStepByStep.txt_' - description how the worst case happens. > *Suggestions:* > 1) We should replace hardcoded timeout 100ms with a parameter like > failureDetectionTimeout: > {code:java} > private boolean ServerImpl.isConnectionRefused(SocketAddress addr) { >... > sock.connect(addr, 100); // Make it rely on failureDetectionTimeout. >... > } > {code} > 2) Any negative result of the connection checking should be considered as > node failed. Currently, we look only at refused connection. Any other > exceptions, including a timeout, are treated as living connection: > {code:java} > private boolean ServerImpl.isConnectionRefused(SocketAddress addr) { >... >catch (ConnectException e) { > return true; >} >catch (IOException e) { > return false; // Make any error mean lost connection. >} >return false; > } > {code} > 3) Maximal interval to check previous node should rely on actual failure > detection timeout: > {code:java} >TcpDiscoveryHandshakeResponse res = new TcpDiscoveryHandshakeResponse(...); >... >// We got message from previous in less than double connection check > interval. >boolean ok = rcvdTime + CON_CHECK_INTERVAL * 2 >= now; // Here should be a > timeout of failure detection. >if (ok) { > // Check case when previous node suddenly died. This will speed up > // node failing. > ... > } > res.previousNodeAlive(ok); > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (IGNITE-13016) Fix backward checking of failed node.
[ https://issues.apache.org/jira/browse/IGNITE-13016?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17118541#comment-17118541 ] Ignite TC Bot commented on IGNITE-13016: {panel:title=Branch: [pull/7838/head] Base: [master] : No blockers found!|borderStyle=dashed|borderColor=#ccc|titleBGColor=#D6F7C1}{panel} [TeamCity *--> Run :: All* Results|https://ci.ignite.apache.org/viewLog.html?buildId=5339425&buildTypeId=IgniteTests24Java8_RunAll] > Fix backward checking of failed node. > - > > Key: IGNITE-13016 > URL: https://issues.apache.org/jira/browse/IGNITE-13016 > Project: Ignite > Issue Type: Sub-task >Reporter: Vladimir Steshin >Assignee: Vladimir Steshin >Priority: Major > Labels: iep-45 > Attachments: FailureDetectionResearch.txt, > FailureDetectionResearch_fixed.txt, NodeFailureResearch.patch, > WostCaseStepByStep.txt > > Time Spent: 10m > Remaining Estimate: 0h > > We should fix several drawbacks in the backward checking of failed node. They > prolongs node failure detection upto: > ServerImpl.CON_CHECK_INTERVAL + 2 * > IgniteConfiguretion.failureDetectionTimeout + 300ms. > See: > * ‘_NodeFailureResearch.patch_'. It creates test 'FailureDetectionResearch' > which emulates long answears on a failed node and measures failure detection > delays. > * '_NodeFailureResearch.txt_' - results of the test. > * 'NodeFailureResearch_fixed.txt' - results of the test after this fix. > * '_WostCaseStepByStep.txt_' - description how the worst case happens. > *Suggestions:* > 1) We should replace hardcoded timeout 100ms with a parameter like > failureDetectionTimeout: > {code:java} > private boolean ServerImpl.isConnectionRefused(SocketAddress addr) { >... > sock.connect(addr, 100); // Make it rely on failureDetectionTimeout. >... > } > {code} > 2) Any negative result of the connection checking should be considered as > node failed. Currently, we look only at refused connection. Any other > exceptions, including a timeout, are treated as living connection: > {code:java} > private boolean ServerImpl.isConnectionRefused(SocketAddress addr) { >... >catch (ConnectException e) { > return true; >} >catch (IOException e) { > return false; // Make any error mean lost connection. >} >return false; > } > {code} > 3) Maximal interval to check previous node should rely on actual failure > detection timeout: > {code:java} >TcpDiscoveryHandshakeResponse res = new TcpDiscoveryHandshakeResponse(...); >... >// We got message from previous in less than double connection check > interval. >boolean ok = rcvdTime + CON_CHECK_INTERVAL * 2 >= now; // Here should be a > timeout of failure detection. >if (ok) { > // Check case when previous node suddenly died. This will speed up > // node failing. > ... > } > res.previousNodeAlive(ok); > {code} > 4) Remove hardcoded sleep of 200ms when marking previous node alive: > {code:java} > ServerImpl.CrossRingMessageSendState.markLastFailedNodeAlive(){ >... >try { > Thread.sleep(200); >} >catch (InterruptedException e) { > Thread.currentThread().interrupt(); >} >... > } > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005)