[jira] [Commented] (IGNITE-13193) Implement fallback to full partition rebalancing in case historical supplier failed to read all necessary data updates from WAL
[ https://issues.apache.org/jira/browse/IGNITE-13193?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17151022#comment-17151022 ] Vladislav Pyatkov commented on IGNITE-13193: LGTM. > Implement fallback to full partition rebalancing in case historical supplier > failed to read all necessary data updates from WAL > --- > > Key: IGNITE-13193 > URL: https://issues.apache.org/jira/browse/IGNITE-13193 > Project: Ignite > Issue Type: Improvement >Affects Versions: 2.8.1 >Reporter: Vyacheslav Koptilin >Assignee: Vyacheslav Koptilin >Priority: Major > Time Spent: 40m > Remaining Estimate: 0h > > Historical rebalance may fail for several reasons: > 1) WAL on supplier node is corrupted - the supplier will trigger a failure > handler in the current implementation. > 2) After iteration over WAL demander node didn't receive all updates to make > MOVING partition up-to-date (resulting update counter didn't converge with > expected update counter of OWNING partition) - demander will silently ignore > lack of updates in the current implementation. > Such behavior negatively affects the stability of the cluster: an > inappropriate state of historical WAL is not a reason to fail a supplier node. > The more proper way to handle this scenario is: > - Either try to rebalance partition historically from another supplier > - Or use full partition rebalance for problem partition > Once the supplier fails to provide data from part of the WAL, its > corresponding sequence of checkpoints should be marked as inapplicable for > historical rebalance in order to prevent further errors. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (IGNITE-13193) Implement fallback to full partition rebalancing in case historical supplier failed to read all necessary data updates from WAL
[ https://issues.apache.org/jira/browse/IGNITE-13193?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17151020#comment-17151020 ] Ignite TC Bot commented on IGNITE-13193: {panel:title=Branch: [pull/7971/head] Base: [master] : Possible Blockers (1)|borderStyle=dashed|borderColor=#ccc|titleBGColor=#F7D6C1} {color:#d04437}PDS (Indexing){color} [[tests 0 Exit Code |https://ci.ignite.apache.org/viewLog.html?buildId=5436543]] {panel} {panel:title=Branch: [pull/7971/head] Base: [master] : New Tests (8)|borderStyle=dashed|borderColor=#ccc|titleBGColor=#D6F7C1} {color:#8b}Service Grid{color} [tests 4] * {color:#013220}IgniteServiceGridTestSuite: ServiceDeploymentProcessIdSelfTest.requestId[Test event=IgniteBiTuple [val1=DiscoveryEvent [evtNode=fab90e3a-7a80-42d5-aace-d8bcd082be28, topVer=0, nodeId8=de225e46, msg=, type=NODE_JOINED, tstamp=1593735466437], val2=AffinityTopologyVersion [topVer=2720052317725509699, minorTopVer=0]]] - PASSED{color} * {color:#013220}IgniteServiceGridTestSuite: ServiceDeploymentProcessIdSelfTest.topologyVersion[Test event=IgniteBiTuple [val1=DiscoveryCustomEvent [customMsg=ServiceChangeBatchRequest [id=9cd49021371-6d964edf-1554-43c4-9610-ba4d1175766c, reqs=SingletonList [ServiceUndeploymentRequest []]], affTopVer=null, super=DiscoveryEvent [evtNode=b6e73219-cc3c-4851-b05e-8c9f4d4e04fc, topVer=0, nodeId8=b6e73219, msg=null, type=DISCOVERY_CUSTOM_EVT, tstamp=1593735466437]], val2=AffinityTopologyVersion [topVer=-9005077219046006491, minorTopVer=0]]] - PASSED{color} * {color:#013220}IgniteServiceGridTestSuite: ServiceDeploymentProcessIdSelfTest.requestId[Test event=IgniteBiTuple [val1=DiscoveryCustomEvent [customMsg=ServiceChangeBatchRequest [id=9cd49021371-6d964edf-1554-43c4-9610-ba4d1175766c, reqs=SingletonList [ServiceUndeploymentRequest []]], affTopVer=null, super=DiscoveryEvent [evtNode=b6e73219-cc3c-4851-b05e-8c9f4d4e04fc, topVer=0, nodeId8=b6e73219, msg=null, type=DISCOVERY_CUSTOM_EVT, tstamp=1593735466437]], val2=AffinityTopologyVersion [topVer=-9005077219046006491, minorTopVer=0]]] - PASSED{color} * {color:#013220}IgniteServiceGridTestSuite: ServiceDeploymentProcessIdSelfTest.topologyVersion[Test event=IgniteBiTuple [val1=DiscoveryEvent [evtNode=fab90e3a-7a80-42d5-aace-d8bcd082be28, topVer=0, nodeId8=de225e46, msg=, type=NODE_JOINED, tstamp=1593735466437], val2=AffinityTopologyVersion [topVer=2720052317725509699, minorTopVer=0]]] - PASSED{color} {color:#8b}Service Grid (legacy mode){color} [tests 4] * {color:#013220}IgniteServiceGridTestSuite: ServiceDeploymentProcessIdSelfTest.topologyVersion[Test event=IgniteBiTuple [val1=DiscoveryEvent [evtNode=d6e5484b-0c04-4958-af8c-87060631298e, topVer=0, nodeId8=0b6cb16a, msg=, type=NODE_JOINED, tstamp=1593735530785], val2=AffinityTopologyVersion [topVer=8840357433858611253, minorTopVer=0]]] - PASSED{color} * {color:#013220}IgniteServiceGridTestSuite: ServiceDeploymentProcessIdSelfTest.requestId[Test event=IgniteBiTuple [val1=DiscoveryEvent [evtNode=d6e5484b-0c04-4958-af8c-87060631298e, topVer=0, nodeId8=0b6cb16a, msg=, type=NODE_JOINED, tstamp=1593735530785], val2=AffinityTopologyVersion [topVer=8840357433858611253, minorTopVer=0]]] - PASSED{color} * {color:#013220}IgniteServiceGridTestSuite: ServiceDeploymentProcessIdSelfTest.topologyVersion[Test event=IgniteBiTuple [val1=DiscoveryCustomEvent [customMsg=ServiceChangeBatchRequest [id=d1db4121371-52eedea3-cb4b-4f87-ad96-23a96c33063b, reqs=SingletonList [ServiceUndeploymentRequest []]], affTopVer=null, super=DiscoveryEvent [evtNode=39549e86-8440-4fda-aa43-8f588a7e8ac8, topVer=0, nodeId8=39549e86, msg=null, type=DISCOVERY_CUSTOM_EVT, tstamp=1593735530785]], val2=AffinityTopologyVersion [topVer=2935041773177708175, minorTopVer=0]]] - PASSED{color} * {color:#013220}IgniteServiceGridTestSuite: ServiceDeploymentProcessIdSelfTest.requestId[Test event=IgniteBiTuple [val1=DiscoveryCustomEvent [customMsg=ServiceChangeBatchRequest [id=d1db4121371-52eedea3-cb4b-4f87-ad96-23a96c33063b, reqs=SingletonList [ServiceUndeploymentRequest []]], affTopVer=null, super=DiscoveryEvent [evtNode=39549e86-8440-4fda-aa43-8f588a7e8ac8, topVer=0, nodeId8=39549e86, msg=null, type=DISCOVERY_CUSTOM_EVT, tstamp=1593735530785]], val2=AffinityTopologyVersion [topVer=2935041773177708175, minorTopVer=0]]] - PASSED{color} {panel} [TeamCity *-- Run :: All* Results|https://ci.ignite.apache.org/viewLog.html?buildId=5436032buildTypeId=IgniteTests24Java8_RunAll] > Implement fallback to full partition rebalancing in case historical supplier > failed to read all necessary data updates from WAL > --- > > Key: IGNITE-13193 > URL: https://issues.apache.org/jira/browse/IGNITE-13193 > Project: Ignite > Issue Type: Improvement >
[jira] [Commented] (IGNITE-13193) Implement fallback to full partition rebalancing in case historical supplier failed to read all necessary data updates from WAL
[ https://issues.apache.org/jira/browse/IGNITE-13193?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17151021#comment-17151021 ] Vyacheslav Koptilin commented on IGNITE-13193: -- Hello [~v.pyatkov], I have addressed your comments at PR. Please take a look. > Implement fallback to full partition rebalancing in case historical supplier > failed to read all necessary data updates from WAL > --- > > Key: IGNITE-13193 > URL: https://issues.apache.org/jira/browse/IGNITE-13193 > Project: Ignite > Issue Type: Improvement >Affects Versions: 2.8.1 >Reporter: Vyacheslav Koptilin >Assignee: Vyacheslav Koptilin >Priority: Major > Time Spent: 40m > Remaining Estimate: 0h > > Historical rebalance may fail for several reasons: > 1) WAL on supplier node is corrupted - the supplier will trigger a failure > handler in the current implementation. > 2) After iteration over WAL demander node didn't receive all updates to make > MOVING partition up-to-date (resulting update counter didn't converge with > expected update counter of OWNING partition) - demander will silently ignore > lack of updates in the current implementation. > Such behavior negatively affects the stability of the cluster: an > inappropriate state of historical WAL is not a reason to fail a supplier node. > The more proper way to handle this scenario is: > - Either try to rebalance partition historically from another supplier > - Or use full partition rebalance for problem partition > Once the supplier fails to provide data from part of the WAL, its > corresponding sequence of checkpoints should be marked as inapplicable for > historical rebalance in order to prevent further errors. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (IGNITE-13193) Implement fallback to full partition rebalancing in case historical supplier failed to read all necessary data updates from WAL
[ https://issues.apache.org/jira/browse/IGNITE-13193?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17150228#comment-17150228 ] Vladislav Pyatkov commented on IGNITE-13193: [~slava.koptilin] I left three comments in PR. Please look at those. > Implement fallback to full partition rebalancing in case historical supplier > failed to read all necessary data updates from WAL > --- > > Key: IGNITE-13193 > URL: https://issues.apache.org/jira/browse/IGNITE-13193 > Project: Ignite > Issue Type: Improvement >Affects Versions: 2.8.1 >Reporter: Vyacheslav Koptilin >Assignee: Vyacheslav Koptilin >Priority: Major > Time Spent: 20m > Remaining Estimate: 0h > > Historical rebalance may fail for several reasons: > 1) WAL on supplier node is corrupted - the supplier will trigger a failure > handler in the current implementation. > 2) After iteration over WAL demander node didn't receive all updates to make > MOVING partition up-to-date (resulting update counter didn't converge with > expected update counter of OWNING partition) - demander will silently ignore > lack of updates in the current implementation. > Such behavior negatively affects the stability of the cluster: an > inappropriate state of historical WAL is not a reason to fail a supplier node. > The more proper way to handle this scenario is: > - Either try to rebalance partition historically from another supplier > - Or use full partition rebalance for problem partition > Once the supplier fails to provide data from part of the WAL, its > corresponding sequence of checkpoints should be marked as inapplicable for > historical rebalance in order to prevent further errors. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (IGNITE-13193) Implement fallback to full partition rebalancing in case historical supplier failed to read all necessary data updates from WAL
[ https://issues.apache.org/jira/browse/IGNITE-13193?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17148704#comment-17148704 ] Ignite TC Bot commented on IGNITE-13193: {panel:title=Branch: [pull/7971/head] Base: [master] : No blockers found!|borderStyle=dashed|borderColor=#ccc|titleBGColor=#D6F7C1}{panel} {panel:title=Branch: [pull/7971/head] Base: [master] : New Tests (12)|borderStyle=dashed|borderColor=#ccc|titleBGColor=#D6F7C1} {color:#8b}PDS (Indexing){color} [tests 4] * {color:#013220}IgnitePdsWithIndexingCoreTestSuite: IgniteWalRebalanceTest.testSwitchHistoricalRebalanceToFullAndClientJoin - PASSED{color} * {color:#013220}IgnitePdsWithIndexingCoreTestSuite: IgniteWalRebalanceTest.testMultipleNodesFailHistoricalRebalance - PASSED{color} * {color:#013220}IgnitePdsWithIndexingCoreTestSuite: IgniteWalRebalanceTest.testSwitchHistoricalRebalanceToFullDueToFailOnCreatingWalIterator - PASSED{color} * {color:#013220}IgnitePdsWithIndexingCoreTestSuite: IgniteWalRebalanceTest.testSwitchHistoricalRebalanceToFullWhileIteratingOverWAL - PASSED{color} {color:#8b}Service Grid{color} [tests 4] * {color:#013220}IgniteServiceGridTestSuite: ServiceDeploymentProcessIdSelfTest.topologyVersion[Test event=IgniteBiTuple [val1=DiscoveryEvent [evtNode=d4fb8cb4-f5b4-40c5-aa31-78a86f176a39, topVer=0, nodeId8=988d3ef2, msg=, type=NODE_JOINED, tstamp=1593484888764], val2=AffinityTopologyVersion [topVer=2250924009422792828, minorTopVer=0]]] - PASSED{color} * {color:#013220}IgniteServiceGridTestSuite: ServiceDeploymentProcessIdSelfTest.requestId[Test event=IgniteBiTuple [val1=DiscoveryEvent [evtNode=d4fb8cb4-f5b4-40c5-aa31-78a86f176a39, topVer=0, nodeId8=988d3ef2, msg=, type=NODE_JOINED, tstamp=1593484888764], val2=AffinityTopologyVersion [topVer=2250924009422792828, minorTopVer=0]]] - PASSED{color} * {color:#013220}IgniteServiceGridTestSuite: ServiceDeploymentProcessIdSelfTest.topologyVersion[Test event=IgniteBiTuple [val1=DiscoveryCustomEvent [customMsg=ServiceChangeBatchRequest [id=0cac9130371-e91bd767-ce93-405f-8c4c-e65a6af284f1, reqs=SingletonList [ServiceUndeploymentRequest []]], affTopVer=null, super=DiscoveryEvent [evtNode=f96e45f8-ee71-45c8-b086-4f12b58b4e47, topVer=0, nodeId8=f96e45f8, msg=null, type=DISCOVERY_CUSTOM_EVT, tstamp=1593484888764]], val2=AffinityTopologyVersion [topVer=6194541553269355410, minorTopVer=0]]] - PASSED{color} * {color:#013220}IgniteServiceGridTestSuite: ServiceDeploymentProcessIdSelfTest.requestId[Test event=IgniteBiTuple [val1=DiscoveryCustomEvent [customMsg=ServiceChangeBatchRequest [id=0cac9130371-e91bd767-ce93-405f-8c4c-e65a6af284f1, reqs=SingletonList [ServiceUndeploymentRequest []]], affTopVer=null, super=DiscoveryEvent [evtNode=f96e45f8-ee71-45c8-b086-4f12b58b4e47, topVer=0, nodeId8=f96e45f8, msg=null, type=DISCOVERY_CUSTOM_EVT, tstamp=1593484888764]], val2=AffinityTopologyVersion [topVer=6194541553269355410, minorTopVer=0]]] - PASSED{color} {color:#8b}Service Grid (legacy mode){color} [tests 4] * {color:#013220}IgniteServiceGridTestSuite: ServiceDeploymentProcessIdSelfTest.topologyVersion[Test event=IgniteBiTuple [val1=DiscoveryEvent [evtNode=7edce5fc-46de-4493-a2cf-e515e4a06cb3, topVer=0, nodeId8=dae95a9e, msg=, type=NODE_JOINED, tstamp=1593485037885], val2=AffinityTopologyVersion [topVer=-5513012272294030853, minorTopVer=0]]] - PASSED{color} * {color:#013220}IgniteServiceGridTestSuite: ServiceDeploymentProcessIdSelfTest.requestId[Test event=IgniteBiTuple [val1=DiscoveryEvent [evtNode=7edce5fc-46de-4493-a2cf-e515e4a06cb3, topVer=0, nodeId8=dae95a9e, msg=, type=NODE_JOINED, tstamp=1593485037885], val2=AffinityTopologyVersion [topVer=-5513012272294030853, minorTopVer=0]]] - PASSED{color} * {color:#013220}IgniteServiceGridTestSuite: ServiceDeploymentProcessIdSelfTest.topologyVersion[Test event=IgniteBiTuple [val1=DiscoveryCustomEvent [customMsg=ServiceChangeBatchRequest [id=f9156230371-9f284914-20cf-402c-a951-de72c3dce064, reqs=SingletonList [ServiceUndeploymentRequest []]], affTopVer=null, super=DiscoveryEvent [evtNode=3c1ef5dc-b557-424e-84c4-ef739a37e3fb, topVer=0, nodeId8=3c1ef5dc, msg=null, type=DISCOVERY_CUSTOM_EVT, tstamp=1593485037885]], val2=AffinityTopologyVersion [topVer=4163126190682535866, minorTopVer=0]]] - PASSED{color} * {color:#013220}IgniteServiceGridTestSuite: ServiceDeploymentProcessIdSelfTest.requestId[Test event=IgniteBiTuple [val1=DiscoveryCustomEvent [customMsg=ServiceChangeBatchRequest [id=f9156230371-9f284914-20cf-402c-a951-de72c3dce064, reqs=SingletonList [ServiceUndeploymentRequest []]], affTopVer=null, super=DiscoveryEvent [evtNode=3c1ef5dc-b557-424e-84c4-ef739a37e3fb, topVer=0, nodeId8=3c1ef5dc, msg=null, type=DISCOVERY_CUSTOM_EVT, tstamp=1593485037885]], val2=AffinityTopologyVersion [topVer=4163126190682535866, minorTopVer=0]]] - PASSED{color} {panel} [TeamCity *-- Run :: All*