[jira] [Commented] (IGNITE-17457) Cluster locks after the transaction recovery procedure if the tx primary node fail
[ https://issues.apache.org/jira/browse/IGNITE-17457?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17581378#comment-17581378 ] Anton Vinogradov commented on IGNITE-17457: --- Merged to the master. > Cluster locks after the transaction recovery procedure if the tx primary node > fail > -- > > Key: IGNITE-17457 > URL: https://issues.apache.org/jira/browse/IGNITE-17457 > Project: Ignite > Issue Type: Bug >Reporter: Sergey Korotkov >Assignee: Sergey Korotkov >Priority: Major > Fix For: 2.14 > > Time Spent: 8h 20m > Remaining Estimate: 0h > > Ignite cluster may be locked (all client operations would block) after the tx > recovery procedure executed on the tx near & primary node failure. > The prepared transaction may remain un-commited on the backup node after the > tx recovery. So the partition exchange wouldn't complete. So cluster would > be locked. > > The Immediate reason is the race condition in the method: > {code:java} > org.apache.ignite.internal.processors.cache.transactions.IgniteTxAdapter::markFinalizing(RECOVERY_FINISH){code} > If 2 or more backups are configured It may be called concurrently for the > same transaction both from the recovery procedure: > {code:java} > IgniteTxManager::commitIfPrepared{code} > and from the tx recovery request handler: > {code:java} > IgniteTxHandler::processCheckPreparedTxRequest{code} > Problem occur if thread context is switched between old finalization status > request and status update. > > The problematic sequence of events is as follows (the lock will be in the > node1): > 1. Start cluster with 3 nodes (node0, node1, node2) and cache with 2 backups. > 2. On node2 start and prepare transaction choosing key with primary partition > stored on node2. > 3. Kill node2 > 4. The tx recovery procedure is started both on node0 and node1 > 5. In scope of the recovery procedure node0 sends tx recovery request to node1 > 6. The following steps are executed on the node1 in two threads ("procedure" > which is a system pool thread executing the tx recovery procedure and > "handler" which is a striped pool thread processing the tx recovery request > sent from node0): > - tx.finalization == NONE > - "procedure": calls markFinalizing(RECOVERY_FINISH) > - "handler": calls markFinalizing(RECOVERY_FINISH) > - "procedure": gets old tx.finlalization - it's NONE > - "handler": gets old tx.finalization - it's NONE > - "handler": updates tx.finalization - now it's RECOVERY_FINISH > - "procedure": tries to update tx.finalization via compareAndSet and fails > since compare fails. > - "procedure": stops transaction processing and does not try to commit it. > - Transaction remains not finished on node1. > > Reproducer is in the pull request. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (IGNITE-17457) Cluster locks after the transaction recovery procedure if the tx primary node fail
[ https://issues.apache.org/jira/browse/IGNITE-17457?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17580822#comment-17580822 ] Ignite TC Bot commented on IGNITE-17457: {panel:title=Branch: [pull/10178/head] Base: [master] : No blockers found!|borderStyle=dashed|borderColor=#ccc|titleBGColor=#D6F7C1}{panel} {panel:title=Branch: [pull/10178/head] Base: [master] : New Tests (1)|borderStyle=dashed|borderColor=#ccc|titleBGColor=#D6F7C1} {color:#8b}Cache 12{color} [[tests 1|https://ci.ignite.apache.org/viewLog.html?buildId=6735088]] * {color:#013220}IgniteCacheTestSuite12: TxRecoveryConcurrentTest.testRecoveryNotDeadLockOnNearAndPrimaryFail - PASSED{color} {panel} [TeamCity *-- Run :: All* Results|https://ci.ignite.apache.org/viewLog.html?buildId=6735189buildTypeId=IgniteTests24Java8_RunAll] > Cluster locks after the transaction recovery procedure if the tx primary node > fail > -- > > Key: IGNITE-17457 > URL: https://issues.apache.org/jira/browse/IGNITE-17457 > Project: Ignite > Issue Type: Bug >Reporter: Sergey Korotkov >Assignee: Sergey Korotkov >Priority: Major > Time Spent: 8h 10m > Remaining Estimate: 0h > > Ignite cluster may be locked (all client operations would block) after the tx > recovery procedure executed on the tx primary node failure. > The prepared transaction may remain un-commited on the backup node after the > tx recovery. So the partition exchange wouldn't complete. So cluster would > be locked. > > The Immediate reason is the race condition in the method: > {code:java} > org.apache.ignite.internal.processors.cache.transactions.IgniteTxAdapter::markFinalizing(RECOVERY_FINISH){code} > If 2 or more backups are configured It may be called concurrently for the > same transaction both from the recovery procedure: > {code:java} > IgniteTxManager::commitIfPrepared{code} > and from the tx recovery request handler: > {code:java} > IgniteTxHandler::processCheckPreparedTxRequest{code} > Problem occur if thread context is switched between old finalization status > request and status update. > > The problematic sequence of events is as follows (the lock will be in the > node1): > 1. Start cluster with 3 nodes (node0, node1, node2) and cache with 2 backups. > 2. On node2 start and prepare transaction choosing key with primary partition > stored on node2. > 3. Kill node2 > 4. The tx recovery procedure is started both on node0 and node1 > 5. In scope of the recovery procedure node0 sends tx recovery request to node1 > 6. The following steps are executed on the node1 in two threads ("procedure" > which is a system pool thread executing the tx recovery procedure and > "handler" which is a striped pool thread processing the tx recovery request > sent from node0): > - tx.finalization == NONE > - "procedure": calls markFinalizing(RECOVERY_FINISH) > - "handler": calls markFinalizing(RECOVERY_FINISH) > - "procedure": gets old tx.finlalization - it's NONE > - "handler": gets old tx.finalization - it's NONE > - "handler": updates tx.finalization - now it's RECOVERY_FINISH > - "procedure": tries to update tx.finalization via compareAndSet and fails > since compare fails. > - "procedure": stops transaction processing and does not try to commit it. > - Transaction remains not finished on node1. > > Reproducer is in the pull request. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (IGNITE-17457) Cluster locks after the transaction recovery procedure if the tx primary node fail
[ https://issues.apache.org/jira/browse/IGNITE-17457?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17577473#comment-17577473 ] Ignite TC Bot commented on IGNITE-17457: {panel:title=Branch: [pull/10178/head] Base: [master] : No blockers found!|borderStyle=dashed|borderColor=#ccc|titleBGColor=#D6F7C1}{panel} {panel:title=Branch: [pull/10178/head] Base: [master] : New Tests (1)|borderStyle=dashed|borderColor=#ccc|titleBGColor=#D6F7C1} {color:#8b}Cache 12{color} [[tests 1|https://ci.ignite.apache.org/viewLog.html?buildId=6720371]] * {color:#013220}IgniteCacheTestSuite12: TxRecoveryConcurrentOnPrimaryFailTest.testRecoveryNotDeadLockOnPrimaryFail - PASSED{color} {panel} [TeamCity *-- Run :: All* Results|https://ci.ignite.apache.org/viewLog.html?buildId=6710922buildTypeId=IgniteTests24Java8_RunAll] > Cluster locks after the transaction recovery procedure if the tx primary node > fail > -- > > Key: IGNITE-17457 > URL: https://issues.apache.org/jira/browse/IGNITE-17457 > Project: Ignite > Issue Type: Bug >Reporter: Sergey Korotkov >Assignee: Sergey Korotkov >Priority: Major > Time Spent: 4.5h > Remaining Estimate: 0h > > Ignite cluster may be locked (all client operations would block) after the tx > recovery procedure executed on the tx primary node failure. > The prepared transaction may remain un-commited on the backup node after the > tx recovery. So the partition exchange wouldn't complete. So cluster would > be locked. > > The Immediate reason is the race condition in the method: > {code:java} > org.apache.ignite.internal.processors.cache.transactions.IgniteTxAdapter::markFinalizing(RECOVERY_FINISH){code} > If 2 or more backups are configured It may be called concurrently for the > same transaction both from the recovery procedure: > {code:java} > IgniteTxManager::commitIfPrepared{code} > and from the tx recovery request handler: > {code:java} > IgniteTxHandler::processCheckPreparedTxRequest{code} > Problem occur if thread context is switched between old finalization status > request and status update. > > The problematic sequence of events is as follows (the lock will be in the > node1): > 1. Start cluster with 3 nodes (node0, node1, node2) and cache with 2 backups. > 2. On node2 start and prepare transaction choosing key with primary partition > stored on node2. > 3. Kill node2 > 4. The tx recovery procedure is started both on node0 and node1 > 5. In scope of the recovery procedure node0 sends tx recovery request to node1 > 6. The following steps are executed on the node1 in two threads ("procedure" > which is a system pool thread executing the tx recovery procedure and > "handler" which is a striped pool thread processing the tx recovery request > sent from node0): > - tx.finalization == NONE > - "procedure": calls markFinalizing(RECOVERY_FINISH) > - "handler": calls markFinalizing(RECOVERY_FINISH) > - "procedure": gets old tx.finlalization - it's NONE > - "handler": gets old tx.finalization - it's NONE > - "handler": updates tx.finalization - now it's RECOVERY_FINISH > - "procedure": tries to update tx.finalization via compareAndSet and fails > since compare fails. > - "procedure": stops transaction processing and does not try to commit it. > - Transaction remains not finished on node1. > > Reproducer is in the pull request. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (IGNITE-17457) Cluster locks after the transaction recovery procedure if the tx primary node fail
[ https://issues.apache.org/jira/browse/IGNITE-17457?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17576532#comment-17576532 ] Ignite TC Bot commented on IGNITE-17457: {panel:title=Branch: [pull/10178/head] Base: [master] : Possible Blockers (1)|borderStyle=dashed|borderColor=#ccc|titleBGColor=#F7D6C1} {color:#d04437}Cache 12{color} [[tests 1|https://ci.ignite.apache.org/viewLog.html?buildId=6717474]] * IgniteCacheTestSuite12: TxRecoveryWithConcurrentRollbackTest.testRecoveryNotDeadLockOnPrimaryFail - New test duration 96s is more that 1 minute {panel} {panel:title=Branch: [pull/10178/head] Base: [master] : New Tests (1)|borderStyle=dashed|borderColor=#ccc|titleBGColor=#D6F7C1} {color:#8b}Cache 12{color} [[tests 1|https://ci.ignite.apache.org/viewLog.html?buildId=6717474]] * {color:#013220}IgniteCacheTestSuite12: TxRecoveryWithConcurrentRollbackTest.testRecoveryNotDeadLockOnPrimaryFail - PASSED{color} {panel} [TeamCity *-- Run :: All* Results|https://ci.ignite.apache.org/viewLog.html?buildId=6710922buildTypeId=IgniteTests24Java8_RunAll] > Cluster locks after the transaction recovery procedure if the tx primary node > fail > -- > > Key: IGNITE-17457 > URL: https://issues.apache.org/jira/browse/IGNITE-17457 > Project: Ignite > Issue Type: Bug >Reporter: Sergey Korotkov >Assignee: Sergey Korotkov >Priority: Major > Time Spent: 3h 20m > Remaining Estimate: 0h > > Ignite cluster may be locked (all client operations would block) after the tx > recovery procedure executed on the tx primary node failure. > The prepared transaction may remain un-commited on the backup node after the > tx recovery. So the partition exchange wouldn't complete. So cluster would > be locked. > > The Immediate reason is the race condition in the method: > {code:java} > org.apache.ignite.internal.processors.cache.transactions.IgniteTxAdapter::markFinalizing(RECOVERY_FINISH){code} > If 2 or more backups are configured It may be called concurrently for the > same transaction both from the recovery procedure: > {code:java} > IgniteTxManager::commitIfPrepared{code} > and from the tx recovery request handler: > {code:java} > IgniteTxHandler::processCheckPreparedTxRequest{code} > Problem occur if thread context is switched between old finalization status > request and status update. > > The problematic sequence of events is as follows (the lock will be in the > node1): > 1. Start cluster with 3 nodes (node0, node1, node2) and cache with 2 backups. > 2. On node2 start and prepare transaction choosing key with primary partition > stored on node2. > 3. Kill node2 > 4. The tx recovery procedure is started both on node0 and node1 > 5. In scope of the recovery procedure node0 sends tx recovery request to node1 > 6. The following steps are executed on the node1 in two threads ("procedure" > which is a system pool thread executing the tx recovery procedure and > "handler" which is a striped pool thread processing the tx recovery request > sent from node0): > - tx.finalization == NONE > - "procedure": calls markFinalizing(RECOVERY_FINISH) > - "handler": calls markFinalizing(RECOVERY_FINISH) > - "procedure": gets old tx.finlalization - it's NONE > - "handler": gets old tx.finalization - it's NONE > - "handler": updates tx.finalization - now it's RECOVERY_FINISH > - "procedure": tries to update tx.finalization via compareAndSet and fails > since compare fails. > - "procedure": stops transaction processing and does not try to commit it. > - Transaction remains not finished on node1. > > Reproducer is in the pull request. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (IGNITE-17457) Cluster locks after the transaction recovery procedure if the tx primary node fail
[ https://issues.apache.org/jira/browse/IGNITE-17457?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17574993#comment-17574993 ] Ignite TC Bot commented on IGNITE-17457: {panel:title=Branch: [pull/10178/head] Base: [master] : No blockers found!|borderStyle=dashed|borderColor=#ccc|titleBGColor=#D6F7C1}{panel} {panel:title=Branch: [pull/10178/head] Base: [master] : New Tests (1)|borderStyle=dashed|borderColor=#ccc|titleBGColor=#D6F7C1} {color:#8b}Cache 12{color} [[tests 1|https://ci.ignite.apache.org/viewLog.html?buildId=6710160]] * {color:#013220}IgniteCacheTestSuite12: TxRecoveryWithConcurrentRollbackTest.testRecoveryNotDeadLockOnPrimaryFail - PASSED{color} {panel} [TeamCity *-- Run :: All* Results|https://ci.ignite.apache.org/viewLog.html?buildId=6710922buildTypeId=IgniteTests24Java8_RunAll] > Cluster locks after the transaction recovery procedure if the tx primary node > fail > -- > > Key: IGNITE-17457 > URL: https://issues.apache.org/jira/browse/IGNITE-17457 > Project: Ignite > Issue Type: Bug >Reporter: Sergey Korotkov >Assignee: Sergey Korotkov >Priority: Major > > Ignite cluster may be locked (all client operations would block) after the tx > recovery procedure executed on the tx primary node failure. > The prepared transaction may remain un-commited on the backup node after the > tx recovery. So the partition exchange wouldn't complete. So cluster would > be locked. > > The Immediate reason is the race condition in the method: > {code:java} > org.apache.ignite.internal.processors.cache.transactions.IgniteTxAdapter::markFinalizing(RECOVERY_FINISH){code} > If 2 or more backups are configured It may be called concurrently for the > same transaction both from the recovery procedure: > {code:java} > IgniteTxManager::commitIfPrepared{code} > and from the tx recovery request handler: > {code:java} > IgniteTxHandler::processCheckPreparedTxRequest{code} > Problem occur if thread context is switched between old finalization status > request and status update. > > The problematic sequence of events is as follows (the lock will be in the > node1): > 1. Start cluster with 3 nodes (node0, node1, node2) and cache with 2 backups. > 2. On node2 start and prepare transaction choosing key with primary partition > stored on node2. > 3. Kill node2 > 4. The tx recovery procedure is started both on node0 and node1 > 5. In scope of the recovery procedure node0 sends tx recovery request to node1 > 6. The following steps are executed on the node1 in two threads ("procedure" > which is a system pool thread executing the tx recovery procedure and > "handler" which is a striped pool thread processing the tx recovery request > sent from node0): > - tx.finalization == NONE > - "procedure": calls markFinalizing(RECOVERY_FINISH) > - "handler": calls markFinalizing(RECOVERY_FINISH) > - "procedure": gets old tx.finlalization - it's NONE > - "handler": gets old tx.finalization - it's NONE > - "handler": updates tx.finalization - now it's RECOVERY_FINISH > - "procedure": tries to update tx.finalization via compareAndSet and fails > since compare fails. > - "procedure": stops transaction processing and does not try to commit it. > - Transaction remains not finished on node1. > > Reproducer is in the pull request. -- This message was sent by Atlassian Jira (v8.20.10#820010)