[jira] [Updated] (IGNITE-22094) Add removeAll method to tx state storage
[ https://issues.apache.org/jira/browse/IGNITE-22094?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Denis Chudov updated IGNITE-22094: -- Fix Version/s: 3.0.0-beta2 > Add removeAll method to tx state storage > > > Key: IGNITE-22094 > URL: https://issues.apache.org/jira/browse/IGNITE-22094 > Project: Ignite > Issue Type: Improvement >Reporter: Denis Chudov >Assignee: Denis Chudov >Priority: Major > Labels: ignite-3 > Fix For: 3.0.0-beta2 > > > *Motivation* > Tx state vacuum should be able to remove multiple tx states at once - all of > them that meet requirements for removal. > *Definition of done* > TxStateStorage#removeAll is added, along with corresponding tests. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (IGNITE-22206) Unmute disabled ItTxResourcesVacuumTest#testRecoveryAfterPersistentStateVacuumized
[ https://issues.apache.org/jira/browse/IGNITE-22206?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Denis Chudov updated IGNITE-22206: -- Fix Version/s: 3.0.0-beta2 > Unmute disabled > ItTxResourcesVacuumTest#testRecoveryAfterPersistentStateVacuumized > -- > > Key: IGNITE-22206 > URL: https://issues.apache.org/jira/browse/IGNITE-22206 > Project: Ignite > Issue Type: Bug >Reporter: Denis Chudov >Assignee: Denis Chudov >Priority: Major > Labels: ignite-3 > Fix For: 3.0.0-beta2 > > Time Spent: 20m > Remaining Estimate: 0h > > Was fixed under IGNITE-22147 but left muted for some reason. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (IGNITE-22206) Unmute disabled ItTxResourcesVacuumTest#testRecoveryAfterPersistentStateVacuumized
[ https://issues.apache.org/jira/browse/IGNITE-22206?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Denis Chudov updated IGNITE-22206: -- Description: Was fixed under IGNITE-22147 but left muted for some reason. (was: subj) > Unmute disabled > ItTxResourcesVacuumTest#testRecoveryAfterPersistentStateVacuumized > -- > > Key: IGNITE-22206 > URL: https://issues.apache.org/jira/browse/IGNITE-22206 > Project: Ignite > Issue Type: Bug >Reporter: Denis Chudov >Assignee: Denis Chudov >Priority: Major > Labels: ignite-3 > Time Spent: 10m > Remaining Estimate: 0h > > Was fixed under IGNITE-22147 but left muted for some reason. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (IGNITE-22206) Unmute disabled ItTxResourcesVacuumTest#testRecoveryAfterPersistentStateVacuumized
[ https://issues.apache.org/jira/browse/IGNITE-22206?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17845334#comment-17845334 ] Denis Chudov commented on IGNITE-22206: --- https://ci.ignite.apache.org/test/7228314194201887527?currentProjectId=ApacheIgnite3xGradle_Test_IntegrationTests=pull%2F3738=true > Unmute disabled > ItTxResourcesVacuumTest#testRecoveryAfterPersistentStateVacuumized > -- > > Key: IGNITE-22206 > URL: https://issues.apache.org/jira/browse/IGNITE-22206 > Project: Ignite > Issue Type: Bug >Reporter: Denis Chudov >Assignee: Denis Chudov >Priority: Major > Labels: ignite-3 > Time Spent: 10m > Remaining Estimate: 0h > > subj -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (IGNITE-22206) Unmute disabled ItTxResourcesVacuumTest#testRecoveryAfterPersistentStateVacuumized
Denis Chudov created IGNITE-22206: - Summary: Unmute disabled ItTxResourcesVacuumTest#testRecoveryAfterPersistentStateVacuumized Key: IGNITE-22206 URL: https://issues.apache.org/jira/browse/IGNITE-22206 Project: Ignite Issue Type: Bug Reporter: Denis Chudov Assignee: Denis Chudov subj -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (IGNITE-22147) ItTxResourcesVacuumTest.testRecoveryAfterPersistentStateVacuumized is flaky
Denis Chudov created IGNITE-22147: - Summary: ItTxResourcesVacuumTest.testRecoveryAfterPersistentStateVacuumized is flaky Key: IGNITE-22147 URL: https://issues.apache.org/jira/browse/IGNITE-22147 Project: Ignite Issue Type: Bug Reporter: Denis Chudov https://ci.ignite.apache.org/project.html?projectId=ApacheIgnite3xGradle_Test_IntegrationTests=7228314194201887527=testDetails -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Assigned] (IGNITE-22094) Add removeAll method to tx state storage
[ https://issues.apache.org/jira/browse/IGNITE-22094?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Denis Chudov reassigned IGNITE-22094: - Assignee: Denis Chudov > Add removeAll method to tx state storage > > > Key: IGNITE-22094 > URL: https://issues.apache.org/jira/browse/IGNITE-22094 > Project: Ignite > Issue Type: Improvement >Reporter: Denis Chudov >Assignee: Denis Chudov >Priority: Major > Labels: ignite-3 > > *Motivation* > Tx state vacuum should be able to remove multiple tx states at once - all of > them that meet requirements for removal. > *Definition of done* > TxStateStorage#removeAll is added, along with corresponding tests. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (IGNITE-22094) Add removeAll method to tx state storage
Denis Chudov created IGNITE-22094: - Summary: Add removeAll method to tx state storage Key: IGNITE-22094 URL: https://issues.apache.org/jira/browse/IGNITE-22094 Project: Ignite Issue Type: Improvement Reporter: Denis Chudov *Motivation* Tx state vacuum should be able to remove multiple tx states at once - all of them that meet requirements for removal. *Definition of done* TxStateStorage#removeAll is added, along with corresponding tests. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (IGNITE-22024) ItSqlClientSynchronousApiTest#runtimeErrorInDmlCausesTransactionToFail is flaky
[ https://issues.apache.org/jira/browse/IGNITE-22024?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Denis Chudov updated IGNITE-22024: -- Fix Version/s: 3.0.0-beta2 > ItSqlClientSynchronousApiTest#runtimeErrorInDmlCausesTransactionToFail is > flaky > --- > > Key: IGNITE-22024 > URL: https://issues.apache.org/jira/browse/IGNITE-22024 > Project: Ignite > Issue Type: Bug >Reporter: Vladislav Pyatkov >Assignee: Denis Chudov >Priority: Major > Labels: ignite-3 > Fix For: 3.0.0-beta2 > > Attachments: screenshot-1.png > > Time Spent: 20m > Remaining Estimate: 0h > > h3. Motivation > Only one commit is a base transaction guarantee. The test shows this > guarantee is violated for thin clients. > {noformat} > java.lang.AssertionError: Exception has not been thrown. > > at > org.apache.ignite.internal.testframework.IgniteTestUtils.assertThrowsWithCode(IgniteTestUtils.java:314) > at > org.apache.ignite.internal.sql.api.ItSqlApiBaseTest.runtimeErrorInDmlCausesTransactionToFail(ItSqlApiBaseTest.java:648) > at > org.apache.ignite.internal.sql.api.ItSqlClientSynchronousApiTest.runtimeErrorInDmlCausesTransactionToFail(ItSqlClientSynchronousApiTest.java:65) > at java.base/java.lang.reflect.Method.invoke(Method.java:566) > at > java.base/java.util.stream.ForEachOps$ForEachOp$OfRef.accept(ForEachOps.java:183) > at > java.base/java.util.stream.ReferencePipeline$3$1.accept(ReferencePipeline.java:195) > at > java.base/java.util.stream.ReferencePipeline$2$1.accept(ReferencePipeline.java:177) > at > java.base/java.util.stream.ReferencePipeline$3$1.accept(ReferencePipeline.java:195) > at > java.base/java.util.stream.ForEachOps$ForEachOp$OfRef.accept(ForEachOps.java:183) > at > java.base/java.util.stream.ReferencePipeline$3$1.accept(ReferencePipeline.java:195) > at > java.base/java.util.stream.ReferencePipeline$3$1.accept(ReferencePipeline.java:195) > at > java.base/java.util.stream.ReferencePipeline$3$1.accept(ReferencePipeline.java:195) > at > java.base/java.util.stream.ForEachOps$ForEachOp$OfRef.accept(ForEachOps.java:183) > at > java.base/java.util.stream.ReferencePipeline$3$1.accept(ReferencePipeline.java:195) > at > java.base/java.util.Spliterators$ArraySpliterator.forEachRemaining(Spliterators.java:948) > at > java.base/java.util.stream.AbstractPipeline.copyInto(AbstractPipeline.java:484) > at > java.base/java.util.stream.AbstractPipeline.wrapAndCopyInto(AbstractPipeline.java:474) > at > java.base/java.util.stream.ForEachOps$ForEachOp.evaluateSequential(ForEachOps.java:150) > at > java.base/java.util.stream.ForEachOps$ForEachOp$OfRef.evaluateSequential(ForEachOps.java:173) > at > java.base/java.util.stream.AbstractPipeline.evaluate(AbstractPipeline.java:234) > at > java.base/java.util.stream.ReferencePipeline.forEach(ReferencePipeline.java:497) > at > java.base/java.util.stream.ReferencePipeline$7$1.accept(ReferencePipeline.java:274) > at > java.base/java.util.stream.ReferencePipeline$3$1.accept(ReferencePipeline.java:195) > at > java.base/java.util.stream.ReferencePipeline$3$1.accept(ReferencePipeline.java:195) > at > java.base/java.util.stream.ReferencePipeline$3$1.accept(ReferencePipeline.java:195) > at > java.base/java.util.ArrayList$ArrayListSpliterator.forEachRemaining(ArrayList.java:1654) > at > java.base/java.util.stream.AbstractPipeline.copyInto(AbstractPipeline.java:484) > at > java.base/java.util.stream.AbstractPipeline.wrapAndCopyInto(AbstractPipeline.java:474) > at > java.base/java.util.stream.ForEachOps$ForEachOp.evaluateSequential(ForEachOps.java:150) > at > java.base/java.util.stream.ForEachOps$ForEachOp$OfRef.evaluateSequential(ForEachOps.java:173) > at > java.base/java.util.stream.AbstractPipeline.evaluate(AbstractPipeline.java:234) > at > java.base/java.util.stream.ReferencePipeline.forEach(ReferencePipeline.java:497) > at > java.base/java.util.stream.ReferencePipeline$7$1.accept(ReferencePipeline.java:274) > at > java.base/java.util.ArrayList$ArrayListSpliterator.forEachRemaining(ArrayList.java:1654) > at > java.base/java.util.stream.AbstractPipeline.copyInto(AbstractPipeline.java:484) > at > java.base/java.util.stream.AbstractPipeline.wrapAndCopyInto(AbstractPipeline.java:474) > at > java.base/java.util.stream.ForEachOps$ForEachOp.evaluateSequential(ForEachOps.java:150) > at > java.base/java.util.stream.ForEachOps$ForEachOp$OfRef.evaluateSequential(ForEachOps.java:173) > at > java.base/java.util.stream.AbstractPipeline.evaluate(AbstractPipeline.java:234) > at > java.base/java.util.stream.ReferencePipeline.forEach(ReferencePipeline.java:497) > at java.base/java.util.ArrayList.forEach(ArrayList.java:1540) > at java.base/java.util.ArrayList.forEach(ArrayList.java:1540) > {noformat} > h3.
[jira] [Updated] (IGNITE-22024) ItSqlClientSynchronousApiTest#runtimeErrorInDmlCausesTransactionToFail is flaky
[ https://issues.apache.org/jira/browse/IGNITE-22024?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Denis Chudov updated IGNITE-22024: -- Reviewer: Vladislav Pyatkov > ItSqlClientSynchronousApiTest#runtimeErrorInDmlCausesTransactionToFail is > flaky > --- > > Key: IGNITE-22024 > URL: https://issues.apache.org/jira/browse/IGNITE-22024 > Project: Ignite > Issue Type: Bug >Reporter: Vladislav Pyatkov >Assignee: Denis Chudov >Priority: Major > Labels: ignite-3 > Attachments: screenshot-1.png > > Time Spent: 10m > Remaining Estimate: 0h > > h3. Motivation > Only one commit is a base transaction guarantee. The test shows this > guarantee is violated for thin clients. > {noformat} > java.lang.AssertionError: Exception has not been thrown. > > at > org.apache.ignite.internal.testframework.IgniteTestUtils.assertThrowsWithCode(IgniteTestUtils.java:314) > at > org.apache.ignite.internal.sql.api.ItSqlApiBaseTest.runtimeErrorInDmlCausesTransactionToFail(ItSqlApiBaseTest.java:648) > at > org.apache.ignite.internal.sql.api.ItSqlClientSynchronousApiTest.runtimeErrorInDmlCausesTransactionToFail(ItSqlClientSynchronousApiTest.java:65) > at java.base/java.lang.reflect.Method.invoke(Method.java:566) > at > java.base/java.util.stream.ForEachOps$ForEachOp$OfRef.accept(ForEachOps.java:183) > at > java.base/java.util.stream.ReferencePipeline$3$1.accept(ReferencePipeline.java:195) > at > java.base/java.util.stream.ReferencePipeline$2$1.accept(ReferencePipeline.java:177) > at > java.base/java.util.stream.ReferencePipeline$3$1.accept(ReferencePipeline.java:195) > at > java.base/java.util.stream.ForEachOps$ForEachOp$OfRef.accept(ForEachOps.java:183) > at > java.base/java.util.stream.ReferencePipeline$3$1.accept(ReferencePipeline.java:195) > at > java.base/java.util.stream.ReferencePipeline$3$1.accept(ReferencePipeline.java:195) > at > java.base/java.util.stream.ReferencePipeline$3$1.accept(ReferencePipeline.java:195) > at > java.base/java.util.stream.ForEachOps$ForEachOp$OfRef.accept(ForEachOps.java:183) > at > java.base/java.util.stream.ReferencePipeline$3$1.accept(ReferencePipeline.java:195) > at > java.base/java.util.Spliterators$ArraySpliterator.forEachRemaining(Spliterators.java:948) > at > java.base/java.util.stream.AbstractPipeline.copyInto(AbstractPipeline.java:484) > at > java.base/java.util.stream.AbstractPipeline.wrapAndCopyInto(AbstractPipeline.java:474) > at > java.base/java.util.stream.ForEachOps$ForEachOp.evaluateSequential(ForEachOps.java:150) > at > java.base/java.util.stream.ForEachOps$ForEachOp$OfRef.evaluateSequential(ForEachOps.java:173) > at > java.base/java.util.stream.AbstractPipeline.evaluate(AbstractPipeline.java:234) > at > java.base/java.util.stream.ReferencePipeline.forEach(ReferencePipeline.java:497) > at > java.base/java.util.stream.ReferencePipeline$7$1.accept(ReferencePipeline.java:274) > at > java.base/java.util.stream.ReferencePipeline$3$1.accept(ReferencePipeline.java:195) > at > java.base/java.util.stream.ReferencePipeline$3$1.accept(ReferencePipeline.java:195) > at > java.base/java.util.stream.ReferencePipeline$3$1.accept(ReferencePipeline.java:195) > at > java.base/java.util.ArrayList$ArrayListSpliterator.forEachRemaining(ArrayList.java:1654) > at > java.base/java.util.stream.AbstractPipeline.copyInto(AbstractPipeline.java:484) > at > java.base/java.util.stream.AbstractPipeline.wrapAndCopyInto(AbstractPipeline.java:474) > at > java.base/java.util.stream.ForEachOps$ForEachOp.evaluateSequential(ForEachOps.java:150) > at > java.base/java.util.stream.ForEachOps$ForEachOp$OfRef.evaluateSequential(ForEachOps.java:173) > at > java.base/java.util.stream.AbstractPipeline.evaluate(AbstractPipeline.java:234) > at > java.base/java.util.stream.ReferencePipeline.forEach(ReferencePipeline.java:497) > at > java.base/java.util.stream.ReferencePipeline$7$1.accept(ReferencePipeline.java:274) > at > java.base/java.util.ArrayList$ArrayListSpliterator.forEachRemaining(ArrayList.java:1654) > at > java.base/java.util.stream.AbstractPipeline.copyInto(AbstractPipeline.java:484) > at > java.base/java.util.stream.AbstractPipeline.wrapAndCopyInto(AbstractPipeline.java:474) > at > java.base/java.util.stream.ForEachOps$ForEachOp.evaluateSequential(ForEachOps.java:150) > at > java.base/java.util.stream.ForEachOps$ForEachOp$OfRef.evaluateSequential(ForEachOps.java:173) > at > java.base/java.util.stream.AbstractPipeline.evaluate(AbstractPipeline.java:234) > at > java.base/java.util.stream.ReferencePipeline.forEach(ReferencePipeline.java:497) > at java.base/java.util.ArrayList.forEach(ArrayList.java:1540) > at java.base/java.util.ArrayList.forEach(ArrayList.java:1540) > {noformat} > h3. Definition of done > Any transaction
[jira] [Assigned] (IGNITE-22024) ItSqlClientSynchronousApiTest#runtimeErrorInDmlCausesTransactionToFail is flaky
[ https://issues.apache.org/jira/browse/IGNITE-22024?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Denis Chudov reassigned IGNITE-22024: - Assignee: Denis Chudov > ItSqlClientSynchronousApiTest#runtimeErrorInDmlCausesTransactionToFail is > flaky > --- > > Key: IGNITE-22024 > URL: https://issues.apache.org/jira/browse/IGNITE-22024 > Project: Ignite > Issue Type: Bug >Reporter: Vladislav Pyatkov >Assignee: Denis Chudov >Priority: Major > Labels: ignite-3 > Attachments: screenshot-1.png > > > h3. Motivation > Only one commit is a base transaction guarantee. The test shows this > guarantee is violated for thin clients. > {noformat} > java.lang.AssertionError: Exception has not been thrown. > > at > org.apache.ignite.internal.testframework.IgniteTestUtils.assertThrowsWithCode(IgniteTestUtils.java:314) > at > org.apache.ignite.internal.sql.api.ItSqlApiBaseTest.runtimeErrorInDmlCausesTransactionToFail(ItSqlApiBaseTest.java:648) > at > org.apache.ignite.internal.sql.api.ItSqlClientSynchronousApiTest.runtimeErrorInDmlCausesTransactionToFail(ItSqlClientSynchronousApiTest.java:65) > at java.base/java.lang.reflect.Method.invoke(Method.java:566) > at > java.base/java.util.stream.ForEachOps$ForEachOp$OfRef.accept(ForEachOps.java:183) > at > java.base/java.util.stream.ReferencePipeline$3$1.accept(ReferencePipeline.java:195) > at > java.base/java.util.stream.ReferencePipeline$2$1.accept(ReferencePipeline.java:177) > at > java.base/java.util.stream.ReferencePipeline$3$1.accept(ReferencePipeline.java:195) > at > java.base/java.util.stream.ForEachOps$ForEachOp$OfRef.accept(ForEachOps.java:183) > at > java.base/java.util.stream.ReferencePipeline$3$1.accept(ReferencePipeline.java:195) > at > java.base/java.util.stream.ReferencePipeline$3$1.accept(ReferencePipeline.java:195) > at > java.base/java.util.stream.ReferencePipeline$3$1.accept(ReferencePipeline.java:195) > at > java.base/java.util.stream.ForEachOps$ForEachOp$OfRef.accept(ForEachOps.java:183) > at > java.base/java.util.stream.ReferencePipeline$3$1.accept(ReferencePipeline.java:195) > at > java.base/java.util.Spliterators$ArraySpliterator.forEachRemaining(Spliterators.java:948) > at > java.base/java.util.stream.AbstractPipeline.copyInto(AbstractPipeline.java:484) > at > java.base/java.util.stream.AbstractPipeline.wrapAndCopyInto(AbstractPipeline.java:474) > at > java.base/java.util.stream.ForEachOps$ForEachOp.evaluateSequential(ForEachOps.java:150) > at > java.base/java.util.stream.ForEachOps$ForEachOp$OfRef.evaluateSequential(ForEachOps.java:173) > at > java.base/java.util.stream.AbstractPipeline.evaluate(AbstractPipeline.java:234) > at > java.base/java.util.stream.ReferencePipeline.forEach(ReferencePipeline.java:497) > at > java.base/java.util.stream.ReferencePipeline$7$1.accept(ReferencePipeline.java:274) > at > java.base/java.util.stream.ReferencePipeline$3$1.accept(ReferencePipeline.java:195) > at > java.base/java.util.stream.ReferencePipeline$3$1.accept(ReferencePipeline.java:195) > at > java.base/java.util.stream.ReferencePipeline$3$1.accept(ReferencePipeline.java:195) > at > java.base/java.util.ArrayList$ArrayListSpliterator.forEachRemaining(ArrayList.java:1654) > at > java.base/java.util.stream.AbstractPipeline.copyInto(AbstractPipeline.java:484) > at > java.base/java.util.stream.AbstractPipeline.wrapAndCopyInto(AbstractPipeline.java:474) > at > java.base/java.util.stream.ForEachOps$ForEachOp.evaluateSequential(ForEachOps.java:150) > at > java.base/java.util.stream.ForEachOps$ForEachOp$OfRef.evaluateSequential(ForEachOps.java:173) > at > java.base/java.util.stream.AbstractPipeline.evaluate(AbstractPipeline.java:234) > at > java.base/java.util.stream.ReferencePipeline.forEach(ReferencePipeline.java:497) > at > java.base/java.util.stream.ReferencePipeline$7$1.accept(ReferencePipeline.java:274) > at > java.base/java.util.ArrayList$ArrayListSpliterator.forEachRemaining(ArrayList.java:1654) > at > java.base/java.util.stream.AbstractPipeline.copyInto(AbstractPipeline.java:484) > at > java.base/java.util.stream.AbstractPipeline.wrapAndCopyInto(AbstractPipeline.java:474) > at > java.base/java.util.stream.ForEachOps$ForEachOp.evaluateSequential(ForEachOps.java:150) > at > java.base/java.util.stream.ForEachOps$ForEachOp$OfRef.evaluateSequential(ForEachOps.java:173) > at > java.base/java.util.stream.AbstractPipeline.evaluate(AbstractPipeline.java:234) > at > java.base/java.util.stream.ReferencePipeline.forEach(ReferencePipeline.java:497) > at java.base/java.util.ArrayList.forEach(ArrayList.java:1540) > at java.base/java.util.ArrayList.forEach(ArrayList.java:1540) > {noformat} > h3. Definition of done > Any transaction operation must notify the user that the transaction is
[jira] [Updated] (IGNITE-22067) Make lease distribution more even
[ https://issues.apache.org/jira/browse/IGNITE-22067?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Denis Chudov updated IGNITE-22067: -- Description: *Motivation* Currently, if we have a cluster of 3 nodes and a zone of 5 partitions, there is relatively high chance that some node will have no primary replica for any partition located on it. For 10 partition this chance is much lower but still exists. It shows that LeaseUpdater#nextLeaseHolder provides distribution far from even. *Definition of done* Initial lease distribution has to be made decently even. However, it can be not preserved during the lifetime of the cluster, because leases don't have to be moved every time when the topology changes. *Implementation notes* We can make the distribution based on node priority. Nodes having less leases on them will have higher priority. Later this approach can be modified in order to calculate the node priority using user load, hot data metrics, etc. This will help us with IGNITE-18879 as well. was: *Motivation* Currently, if we have a cluster of 3 nodes and a zone of 5 partitions, there is relatively high chance that some node will have no primary replica for any partition located on it. For 10 partition this chance is much lower but still exists. It shows that LeaseUpdater#nextLeaseHolder provides distribution far from even. *Definition of done* Initial lease distribution has to be made decently even. However, it can be not preserved during the lifetime of the cluster, because leases don't have to be moved every time when the topology changes. *Implementation notes* We can make the distribution based on node priority. Nodes having less leases on them will have higher priority. Later this approach can be modified in order to calculate the node priority using user load, data > Make lease distribution more even > - > > Key: IGNITE-22067 > URL: https://issues.apache.org/jira/browse/IGNITE-22067 > Project: Ignite > Issue Type: Bug >Reporter: Denis Chudov >Priority: Major > Labels: ignite-3 > > *Motivation* > Currently, if we have a cluster of 3 nodes and a zone of 5 partitions, there > is relatively high chance that some node will have no primary replica for any > partition located on it. > For 10 partition this chance is much lower but still exists. > It shows that LeaseUpdater#nextLeaseHolder provides distribution far from > even. > *Definition of done* > Initial lease distribution has to be made decently even. However, it can be > not preserved during the lifetime of the cluster, because leases don't have > to be moved every time when the topology changes. > *Implementation notes* > We can make the distribution based on node priority. Nodes having less leases > on them will have higher priority. Later this approach can be modified in > order to calculate the node priority using user load, hot data metrics, etc. > This will help us with IGNITE-18879 as well. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (IGNITE-22067) Make lease distribution more even
[ https://issues.apache.org/jira/browse/IGNITE-22067?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Denis Chudov updated IGNITE-22067: -- Description: *Motivation* Currently, if we have a cluster of 3 nodes and a zone of 5 partitions, there is relatively high chance that some node will have no primary replica for any partition located on it. For 10 partition this chance is much lower but still exists. It shows that LeaseUpdater#nextLeaseHolder provides distribution far from even. *Definition of done* Initial lease distribution has to be made decently even. However, it can be not preserved during the lifetime of the cluster, because leases don't have to be moved every time when the topology changes. *Implementation notes* We can make the distribution based on node priority. Nodes having less leases on them will have higher priority. Later this approach can be modified in order to calculate the node priority using user load, data was: Currently, if we have a cluster of 3 nodes and a zone of 5 partitions, there is relatively high chance that some node will have no primary replica for any partition located on it. For 10 partition this chance is much lower but still exists. It shows that LeaseUpdater#nextLeaseHolder provides distribution far from even. > Make lease distribution more even > - > > Key: IGNITE-22067 > URL: https://issues.apache.org/jira/browse/IGNITE-22067 > Project: Ignite > Issue Type: Bug >Reporter: Denis Chudov >Priority: Major > Labels: ignite-3 > > *Motivation* > Currently, if we have a cluster of 3 nodes and a zone of 5 partitions, there > is relatively high chance that some node will have no primary replica for any > partition located on it. > For 10 partition this chance is much lower but still exists. > It shows that LeaseUpdater#nextLeaseHolder provides distribution far from > even. > *Definition of done* > Initial lease distribution has to be made decently even. However, it can be > not preserved during the lifetime of the cluster, because leases don't have > to be moved every time when the topology changes. > *Implementation notes* > We can make the distribution based on node priority. Nodes having less leases > on them will have higher priority. Later this approach can be modified in > order to calculate the node priority using user load, data -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (IGNITE-22067) Make lease distribution more even
Denis Chudov created IGNITE-22067: - Summary: Make lease distribution more even Key: IGNITE-22067 URL: https://issues.apache.org/jira/browse/IGNITE-22067 Project: Ignite Issue Type: Bug Reporter: Denis Chudov Currently, if we have a cluster of 3 nodes and a zone of 5 partitions, there is relatively high chance that some node will have no primary replica for any partition located on it. For 10 partition this chance is much lower but still exists. It shows that LeaseUpdater#nextLeaseHolder provides distribution far from even. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (IGNITE-22051) Rack awareness
Denis Chudov created IGNITE-22051: - Summary: Rack awareness Key: IGNITE-22051 URL: https://issues.apache.org/jira/browse/IGNITE-22051 Project: Ignite Issue Type: Epic Reporter: Denis Chudov Provide a way to ensure that backups are placed in different availability zones. In GG 8 this is done via ClusterNodeAttributeAffinityBackupFilter. Example: I have 3 copies and two AZs, az1 and az2. I need to configure rack-awareness (aka AZ-awareness) by telling the cluster to use the "zone" attribute of my nodes. Of the three copies, two needs to be in nodes with zone=az1, and one needs to be on a node with zone=az2. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Assigned] (IGNITE-22033) Replace PlacementDriver#currentLease with #getPrimaryReplica in ReadWriteTxContext#waitReadyToFinish
[ https://issues.apache.org/jira/browse/IGNITE-22033?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Denis Chudov reassigned IGNITE-22033: - Assignee: Denis Chudov > Replace PlacementDriver#currentLease with #getPrimaryReplica in > ReadWriteTxContext#waitReadyToFinish > > > Key: IGNITE-22033 > URL: https://issues.apache.org/jira/browse/IGNITE-22033 > Project: Ignite > Issue Type: Bug >Reporter: Denis Chudov >Assignee: Denis Chudov >Priority: Major > Labels: ignite-3 > Attachments: _Integration_Tests_Module_Runner_24658_.log > > > #currentLease can return null in a case when there is no lease information on > the current node yet, while the lease may already exist on another node. This > can lead to > PrimaryReplicaExpiredException. > Seems that we've already seen such exceptions on TC: > > {code:java} > Caused by: org.apache.ignite.internal.tx.impl.PrimaryReplicaExpiredException: > IGN-TX-13 TraceId:2766fa1f-a00e-4c53-b556-7d06fc116229 Primary replica has > expired, transaction will be rolled back: [groupId = 59_part_11, expected > enlistment consistency token = 112211838526816298, commit timestamp = null, > current primary replica = null] > at > app//org.apache.ignite.internal.tx.impl.TransactionInflights$ReadWriteTxContext.waitReadyToFinish(TransactionInflights.java:271) > at > app//org.apache.ignite.internal.tx.impl.TransactionInflights$ReadWriteTxContext.performFinish(TransactionInflights.java:229) > at > app//org.apache.ignite.internal.tx.impl.TxManagerImpl.finish(TxManagerImpl.java:501) > at > app//org.apache.ignite.internal.tx.impl.ReadWriteTransactionImpl.finishInternal(ReadWriteTransactionImpl.java:161) > at > app//org.apache.ignite.internal.tx.impl.ReadWriteTransactionImpl.finish(ReadWriteTransactionImpl.java:140) > at > app//org.apache.ignite.internal.tx.impl.IgniteAbstractTransactionImpl.commitAsync(IgniteAbstractTransactionImpl.java:98) > at > app//org.apache.ignite.internal.sql.engine.tx.QueryTransactionWrapperImpl.commitImplicit(QueryTransactionWrapperImpl.java:46) > at > app//org.apache.ignite.internal.sql.engine.AsyncSqlCursorImpl.lambda$closeAsync$3(AsyncSqlCursorImpl.java:132) > at > java.base@11.0.17/java.util.concurrent.CompletableFuture.uniComposeStage(CompletableFuture.java:1106) > at > java.base@11.0.17/java.util.concurrent.CompletableFuture.thenCompose(CompletableFuture.java:2235) > at > app//org.apache.ignite.internal.sql.engine.AsyncSqlCursorImpl.closeAsync(AsyncSqlCursorImpl.java:132) > at > app//org.apache.ignite.internal.sql.engine.AsyncSqlCursorImpl.lambda$requestNextAsync$2(AsyncSqlCursorImpl.java:101) > at > java.base@11.0.17/java.util.concurrent.CompletableFuture.uniHandle(CompletableFuture.java:930) > at > java.base@11.0.17/java.util.concurrent.CompletableFuture.uniHandleStage(CompletableFuture.java:946) > at > java.base@11.0.17/java.util.concurrent.CompletableFuture.handle(CompletableFuture.java:2266) > at > app//org.apache.ignite.internal.sql.engine.AsyncSqlCursorImpl.requestNextAsync(AsyncSqlCursorImpl.java:94) > at > app//org.apache.ignite.internal.sql.api.IgniteSqlImpl.lambda$executeAsyncInternal$4(IgniteSqlImpl.java:360) > at > java.base@11.0.17/java.util.concurrent.CompletableFuture$UniCompose.tryFire(CompletableFuture.java:1072) > at > java.base@11.0.17/java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:506) > at > java.base@11.0.17/java.util.concurrent.CompletableFuture.complete(CompletableFuture.java:2073) > at > app//org.apache.ignite.internal.sql.engine.SqlQueryProcessor$PrefetchCallback.onPrefetchComplete(SqlQueryProcessor.java:1050) > at > app//org.apache.ignite.internal.sql.engine.prepare.KeyValueModifyPlan.lambda$execute$3(KeyValueModifyPlan.java:141) > at > java.base@11.0.17/java.util.concurrent.CompletableFuture.uniWhenComplete(CompletableFuture.java:859) > at > java.base@11.0.17/java.util.concurrent.CompletableFuture$UniWhenComplete.tryFire(CompletableFuture.java:837) > at > java.base@11.0.17/java.util.concurrent.CompletableFuture$Completion.run(CompletableFuture.java:478) > at > app//org.apache.ignite.internal.sql.engine.exec.ExecutionContext.lambda$execute$0(ExecutionContext.java:325) > at > app//org.apache.ignite.internal.sql.engine.exec.QueryTaskExecutorImpl.lambda$execute$0(QueryTaskExecutorImpl.java:83){code} > Full log attached. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (IGNITE-22033) Replace PlacementDriver#currentLease with #getPrimaryReplica in ReadWriteTxContext#waitReadyToFinish
[ https://issues.apache.org/jira/browse/IGNITE-22033?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Denis Chudov updated IGNITE-22033: -- Attachment: _Integration_Tests_Module_Runner_24658_.log > Replace PlacementDriver#currentLease with #getPrimaryReplica in > ReadWriteTxContext#waitReadyToFinish > > > Key: IGNITE-22033 > URL: https://issues.apache.org/jira/browse/IGNITE-22033 > Project: Ignite > Issue Type: Bug >Reporter: Denis Chudov >Priority: Major > Labels: ignite-3 > Attachments: _Integration_Tests_Module_Runner_24658_.log > > > #currentLease can return null in a case when there is no lease information on > the current node yet, while the lease may already exist on another node. This > can lead to > PrimaryReplicaExpiredException. > Seems that we've already seen such exceptions on TC: > > {code:java} > Caused by: org.apache.ignite.internal.tx.impl.PrimaryReplicaExpiredException: > IGN-TX-13 TraceId:2766fa1f-a00e-4c53-b556-7d06fc116229 Primary replica has > expired, transaction will be rolled back: [groupId = 59_part_11, expected > enlistment consistency token = 112211838526816298, commit timestamp = null, > current primary replica = null] > at > app//org.apache.ignite.internal.tx.impl.TransactionInflights$ReadWriteTxContext.waitReadyToFinish(TransactionInflights.java:271) > at > app//org.apache.ignite.internal.tx.impl.TransactionInflights$ReadWriteTxContext.performFinish(TransactionInflights.java:229) > at > app//org.apache.ignite.internal.tx.impl.TxManagerImpl.finish(TxManagerImpl.java:501) > at > app//org.apache.ignite.internal.tx.impl.ReadWriteTransactionImpl.finishInternal(ReadWriteTransactionImpl.java:161) > at > app//org.apache.ignite.internal.tx.impl.ReadWriteTransactionImpl.finish(ReadWriteTransactionImpl.java:140) > at > app//org.apache.ignite.internal.tx.impl.IgniteAbstractTransactionImpl.commitAsync(IgniteAbstractTransactionImpl.java:98) > at > app//org.apache.ignite.internal.sql.engine.tx.QueryTransactionWrapperImpl.commitImplicit(QueryTransactionWrapperImpl.java:46) > at > app//org.apache.ignite.internal.sql.engine.AsyncSqlCursorImpl.lambda$closeAsync$3(AsyncSqlCursorImpl.java:132) > at > java.base@11.0.17/java.util.concurrent.CompletableFuture.uniComposeStage(CompletableFuture.java:1106) > at > java.base@11.0.17/java.util.concurrent.CompletableFuture.thenCompose(CompletableFuture.java:2235) > at > app//org.apache.ignite.internal.sql.engine.AsyncSqlCursorImpl.closeAsync(AsyncSqlCursorImpl.java:132) > at > app//org.apache.ignite.internal.sql.engine.AsyncSqlCursorImpl.lambda$requestNextAsync$2(AsyncSqlCursorImpl.java:101) > at > java.base@11.0.17/java.util.concurrent.CompletableFuture.uniHandle(CompletableFuture.java:930) > at > java.base@11.0.17/java.util.concurrent.CompletableFuture.uniHandleStage(CompletableFuture.java:946) > at > java.base@11.0.17/java.util.concurrent.CompletableFuture.handle(CompletableFuture.java:2266) > at > app//org.apache.ignite.internal.sql.engine.AsyncSqlCursorImpl.requestNextAsync(AsyncSqlCursorImpl.java:94) > at > app//org.apache.ignite.internal.sql.api.IgniteSqlImpl.lambda$executeAsyncInternal$4(IgniteSqlImpl.java:360) > at > java.base@11.0.17/java.util.concurrent.CompletableFuture$UniCompose.tryFire(CompletableFuture.java:1072) > at > java.base@11.0.17/java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:506) > at > java.base@11.0.17/java.util.concurrent.CompletableFuture.complete(CompletableFuture.java:2073) > at > app//org.apache.ignite.internal.sql.engine.SqlQueryProcessor$PrefetchCallback.onPrefetchComplete(SqlQueryProcessor.java:1050) > at > app//org.apache.ignite.internal.sql.engine.prepare.KeyValueModifyPlan.lambda$execute$3(KeyValueModifyPlan.java:141) > at > java.base@11.0.17/java.util.concurrent.CompletableFuture.uniWhenComplete(CompletableFuture.java:859) > at > java.base@11.0.17/java.util.concurrent.CompletableFuture$UniWhenComplete.tryFire(CompletableFuture.java:837) > at > java.base@11.0.17/java.util.concurrent.CompletableFuture$Completion.run(CompletableFuture.java:478) > at > app//org.apache.ignite.internal.sql.engine.exec.ExecutionContext.lambda$execute$0(ExecutionContext.java:325) > at > app//org.apache.ignite.internal.sql.engine.exec.QueryTaskExecutorImpl.lambda$execute$0(QueryTaskExecutorImpl.java:83){code} > Full log attached. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (IGNITE-22033) Replace PlacementDriver#currentLease with #getPrimaryReplica in ReadWriteTxContext#waitReadyToFinish
[ https://issues.apache.org/jira/browse/IGNITE-22033?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Denis Chudov updated IGNITE-22033: -- Description: #currentLease can return null in a case when there is no lease information on the current node yet, while the lease may already exist on another node. This can lead to PrimaryReplicaExpiredException. Seems that we've already seen such exceptions on TC: {code:java} Caused by: org.apache.ignite.internal.tx.impl.PrimaryReplicaExpiredException: IGN-TX-13 TraceId:2766fa1f-a00e-4c53-b556-7d06fc116229 Primary replica has expired, transaction will be rolled back: [groupId = 59_part_11, expected enlistment consistency token = 112211838526816298, commit timestamp = null, current primary replica = null] at app//org.apache.ignite.internal.tx.impl.TransactionInflights$ReadWriteTxContext.waitReadyToFinish(TransactionInflights.java:271) at app//org.apache.ignite.internal.tx.impl.TransactionInflights$ReadWriteTxContext.performFinish(TransactionInflights.java:229) at app//org.apache.ignite.internal.tx.impl.TxManagerImpl.finish(TxManagerImpl.java:501) at app//org.apache.ignite.internal.tx.impl.ReadWriteTransactionImpl.finishInternal(ReadWriteTransactionImpl.java:161) at app//org.apache.ignite.internal.tx.impl.ReadWriteTransactionImpl.finish(ReadWriteTransactionImpl.java:140) at app//org.apache.ignite.internal.tx.impl.IgniteAbstractTransactionImpl.commitAsync(IgniteAbstractTransactionImpl.java:98) at app//org.apache.ignite.internal.sql.engine.tx.QueryTransactionWrapperImpl.commitImplicit(QueryTransactionWrapperImpl.java:46) at app//org.apache.ignite.internal.sql.engine.AsyncSqlCursorImpl.lambda$closeAsync$3(AsyncSqlCursorImpl.java:132) at java.base@11.0.17/java.util.concurrent.CompletableFuture.uniComposeStage(CompletableFuture.java:1106) at java.base@11.0.17/java.util.concurrent.CompletableFuture.thenCompose(CompletableFuture.java:2235) at app//org.apache.ignite.internal.sql.engine.AsyncSqlCursorImpl.closeAsync(AsyncSqlCursorImpl.java:132) at app//org.apache.ignite.internal.sql.engine.AsyncSqlCursorImpl.lambda$requestNextAsync$2(AsyncSqlCursorImpl.java:101) at java.base@11.0.17/java.util.concurrent.CompletableFuture.uniHandle(CompletableFuture.java:930) at java.base@11.0.17/java.util.concurrent.CompletableFuture.uniHandleStage(CompletableFuture.java:946) at java.base@11.0.17/java.util.concurrent.CompletableFuture.handle(CompletableFuture.java:2266) at app//org.apache.ignite.internal.sql.engine.AsyncSqlCursorImpl.requestNextAsync(AsyncSqlCursorImpl.java:94) at app//org.apache.ignite.internal.sql.api.IgniteSqlImpl.lambda$executeAsyncInternal$4(IgniteSqlImpl.java:360) at java.base@11.0.17/java.util.concurrent.CompletableFuture$UniCompose.tryFire(CompletableFuture.java:1072) at java.base@11.0.17/java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:506) at java.base@11.0.17/java.util.concurrent.CompletableFuture.complete(CompletableFuture.java:2073) at app//org.apache.ignite.internal.sql.engine.SqlQueryProcessor$PrefetchCallback.onPrefetchComplete(SqlQueryProcessor.java:1050) at app//org.apache.ignite.internal.sql.engine.prepare.KeyValueModifyPlan.lambda$execute$3(KeyValueModifyPlan.java:141) at java.base@11.0.17/java.util.concurrent.CompletableFuture.uniWhenComplete(CompletableFuture.java:859) at java.base@11.0.17/java.util.concurrent.CompletableFuture$UniWhenComplete.tryFire(CompletableFuture.java:837) at java.base@11.0.17/java.util.concurrent.CompletableFuture$Completion.run(CompletableFuture.java:478) at app//org.apache.ignite.internal.sql.engine.exec.ExecutionContext.lambda$execute$0(ExecutionContext.java:325) at app//org.apache.ignite.internal.sql.engine.exec.QueryTaskExecutorImpl.lambda$execute$0(QueryTaskExecutorImpl.java:83){code} was: #currentLease can return null in a case when there is no lease information on the current node yet, while the lease may already exist on another node. This can lead to PrimaryReplicaExpiredException. Seems that we've already seen such exceptions on TC: {code:java} Caused by: org.apache.ignite.internal.tx.impl.PrimaryReplicaExpiredException: IGN-TX-13 TraceId:2766fa1f-a00e-4c53-b556-7d06fc116229 Primary replica has expired, transaction will be rolled back: [groupId = 59_part_11, expected enlistment consistency token = 112211838526816298, commit timestamp = null, current primary replica = null] at app//org.apache.ignite.internal.tx.impl.TransactionInflights$ReadWriteTxContext.waitReadyToFinish(TransactionInflights.java:271) at app//org.apache.ignite.internal.tx.impl.TransactionInflights$ReadWriteTxContext.performFinish(TransactionInflights.java:229) at app//org.apache.ignite.internal.tx.impl.TxManagerImpl.finish(TxManagerImpl.java:501) at
[jira] [Updated] (IGNITE-22033) Replace PlacementDriver#currentLease with #getPrimaryReplica in ReadWriteTxContext#waitReadyToFinish
[ https://issues.apache.org/jira/browse/IGNITE-22033?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Denis Chudov updated IGNITE-22033: -- Description: #currentLease can return null in a case when there is no lease information on the current node yet, while the lease may already exist on another node. This can lead to PrimaryReplicaExpiredException. Seems that we've already seen such exceptions on TC: {code:java} Caused by: org.apache.ignite.internal.tx.impl.PrimaryReplicaExpiredException: IGN-TX-13 TraceId:2766fa1f-a00e-4c53-b556-7d06fc116229 Primary replica has expired, transaction will be rolled back: [groupId = 59_part_11, expected enlistment consistency token = 112211838526816298, commit timestamp = null, current primary replica = null] at app//org.apache.ignite.internal.tx.impl.TransactionInflights$ReadWriteTxContext.waitReadyToFinish(TransactionInflights.java:271) at app//org.apache.ignite.internal.tx.impl.TransactionInflights$ReadWriteTxContext.performFinish(TransactionInflights.java:229) at app//org.apache.ignite.internal.tx.impl.TxManagerImpl.finish(TxManagerImpl.java:501) at app//org.apache.ignite.internal.tx.impl.ReadWriteTransactionImpl.finishInternal(ReadWriteTransactionImpl.java:161) at app//org.apache.ignite.internal.tx.impl.ReadWriteTransactionImpl.finish(ReadWriteTransactionImpl.java:140) at app//org.apache.ignite.internal.tx.impl.IgniteAbstractTransactionImpl.commitAsync(IgniteAbstractTransactionImpl.java:98) at app//org.apache.ignite.internal.sql.engine.tx.QueryTransactionWrapperImpl.commitImplicit(QueryTransactionWrapperImpl.java:46) at app//org.apache.ignite.internal.sql.engine.AsyncSqlCursorImpl.lambda$closeAsync$3(AsyncSqlCursorImpl.java:132) at java.base@11.0.17/java.util.concurrent.CompletableFuture.uniComposeStage(CompletableFuture.java:1106) at java.base@11.0.17/java.util.concurrent.CompletableFuture.thenCompose(CompletableFuture.java:2235) at app//org.apache.ignite.internal.sql.engine.AsyncSqlCursorImpl.closeAsync(AsyncSqlCursorImpl.java:132) at app//org.apache.ignite.internal.sql.engine.AsyncSqlCursorImpl.lambda$requestNextAsync$2(AsyncSqlCursorImpl.java:101) at java.base@11.0.17/java.util.concurrent.CompletableFuture.uniHandle(CompletableFuture.java:930) at java.base@11.0.17/java.util.concurrent.CompletableFuture.uniHandleStage(CompletableFuture.java:946) at java.base@11.0.17/java.util.concurrent.CompletableFuture.handle(CompletableFuture.java:2266) at app//org.apache.ignite.internal.sql.engine.AsyncSqlCursorImpl.requestNextAsync(AsyncSqlCursorImpl.java:94) at app//org.apache.ignite.internal.sql.api.IgniteSqlImpl.lambda$executeAsyncInternal$4(IgniteSqlImpl.java:360) at java.base@11.0.17/java.util.concurrent.CompletableFuture$UniCompose.tryFire(CompletableFuture.java:1072) at java.base@11.0.17/java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:506) at java.base@11.0.17/java.util.concurrent.CompletableFuture.complete(CompletableFuture.java:2073) at app//org.apache.ignite.internal.sql.engine.SqlQueryProcessor$PrefetchCallback.onPrefetchComplete(SqlQueryProcessor.java:1050) at app//org.apache.ignite.internal.sql.engine.prepare.KeyValueModifyPlan.lambda$execute$3(KeyValueModifyPlan.java:141) at java.base@11.0.17/java.util.concurrent.CompletableFuture.uniWhenComplete(CompletableFuture.java:859) at java.base@11.0.17/java.util.concurrent.CompletableFuture$UniWhenComplete.tryFire(CompletableFuture.java:837) at java.base@11.0.17/java.util.concurrent.CompletableFuture$Completion.run(CompletableFuture.java:478) at app//org.apache.ignite.internal.sql.engine.exec.ExecutionContext.lambda$execute$0(ExecutionContext.java:325) at app//org.apache.ignite.internal.sql.engine.exec.QueryTaskExecutorImpl.lambda$execute$0(QueryTaskExecutorImpl.java:83){code} Full log attached. was: #currentLease can return null in a case when there is no lease information on the current node yet, while the lease may already exist on another node. This can lead to PrimaryReplicaExpiredException. Seems that we've already seen such exceptions on TC: {code:java} Caused by: org.apache.ignite.internal.tx.impl.PrimaryReplicaExpiredException: IGN-TX-13 TraceId:2766fa1f-a00e-4c53-b556-7d06fc116229 Primary replica has expired, transaction will be rolled back: [groupId = 59_part_11, expected enlistment consistency token = 112211838526816298, commit timestamp = null, current primary replica = null] at app//org.apache.ignite.internal.tx.impl.TransactionInflights$ReadWriteTxContext.waitReadyToFinish(TransactionInflights.java:271) at app//org.apache.ignite.internal.tx.impl.TransactionInflights$ReadWriteTxContext.performFinish(TransactionInflights.java:229) at app//org.apache.ignite.internal.tx.impl.TxManagerImpl.finish(TxManagerImpl.java:501) at
[jira] [Created] (IGNITE-22033) Replace PlacementDriver#currentLease with #getPrimaryReplica in ReadWriteTxContext#waitReadyToFinish
Denis Chudov created IGNITE-22033: - Summary: Replace PlacementDriver#currentLease with #getPrimaryReplica in ReadWriteTxContext#waitReadyToFinish Key: IGNITE-22033 URL: https://issues.apache.org/jira/browse/IGNITE-22033 Project: Ignite Issue Type: Bug Reporter: Denis Chudov #currentLease can return null in a case when there is no lease information on the current node yet, while the lease may already exist on another node. This can lead to PrimaryReplicaExpiredException. Seems that we've already seen such exceptions on TC: {code:java} Caused by: org.apache.ignite.internal.tx.impl.PrimaryReplicaExpiredException: IGN-TX-13 TraceId:2766fa1f-a00e-4c53-b556-7d06fc116229 Primary replica has expired, transaction will be rolled back: [groupId = 59_part_11, expected enlistment consistency token = 112211838526816298, commit timestamp = null, current primary replica = null] at app//org.apache.ignite.internal.tx.impl.TransactionInflights$ReadWriteTxContext.waitReadyToFinish(TransactionInflights.java:271) at app//org.apache.ignite.internal.tx.impl.TransactionInflights$ReadWriteTxContext.performFinish(TransactionInflights.java:229) at app//org.apache.ignite.internal.tx.impl.TxManagerImpl.finish(TxManagerImpl.java:501) at app//org.apache.ignite.internal.tx.impl.ReadWriteTransactionImpl.finishInternal(ReadWriteTransactionImpl.java:161) at app//org.apache.ignite.internal.tx.impl.ReadWriteTransactionImpl.finish(ReadWriteTransactionImpl.java:140) at app//org.apache.ignite.internal.tx.impl.IgniteAbstractTransactionImpl.commitAsync(IgniteAbstractTransactionImpl.java:98) at app//org.apache.ignite.internal.sql.engine.tx.QueryTransactionWrapperImpl.commitImplicit(QueryTransactionWrapperImpl.java:46) at app//org.apache.ignite.internal.sql.engine.AsyncSqlCursorImpl.lambda$closeAsync$3(AsyncSqlCursorImpl.java:132) at java.base@11.0.17/java.util.concurrent.CompletableFuture.uniComposeStage(CompletableFuture.java:1106) at java.base@11.0.17/java.util.concurrent.CompletableFuture.thenCompose(CompletableFuture.java:2235) at app//org.apache.ignite.internal.sql.engine.AsyncSqlCursorImpl.closeAsync(AsyncSqlCursorImpl.java:132) at app//org.apache.ignite.internal.sql.engine.AsyncSqlCursorImpl.lambda$requestNextAsync$2(AsyncSqlCursorImpl.java:101) at java.base@11.0.17/java.util.concurrent.CompletableFuture.uniHandle(CompletableFuture.java:930) at java.base@11.0.17/java.util.concurrent.CompletableFuture.uniHandleStage(CompletableFuture.java:946) at java.base@11.0.17/java.util.concurrent.CompletableFuture.handle(CompletableFuture.java:2266) at app//org.apache.ignite.internal.sql.engine.AsyncSqlCursorImpl.requestNextAsync(AsyncSqlCursorImpl.java:94) at app//org.apache.ignite.internal.sql.api.IgniteSqlImpl.lambda$executeAsyncInternal$4(IgniteSqlImpl.java:360) at java.base@11.0.17/java.util.concurrent.CompletableFuture$UniCompose.tryFire(CompletableFuture.java:1072) at java.base@11.0.17/java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:506) at java.base@11.0.17/java.util.concurrent.CompletableFuture.complete(CompletableFuture.java:2073) at app//org.apache.ignite.internal.sql.engine.SqlQueryProcessor$PrefetchCallback.onPrefetchComplete(SqlQueryProcessor.java:1050) at app//org.apache.ignite.internal.sql.engine.prepare.KeyValueModifyPlan.lambda$execute$3(KeyValueModifyPlan.java:141) at java.base@11.0.17/java.util.concurrent.CompletableFuture.uniWhenComplete(CompletableFuture.java:859) at java.base@11.0.17/java.util.concurrent.CompletableFuture$UniWhenComplete.tryFire(CompletableFuture.java:837) at java.base@11.0.17/java.util.concurrent.CompletableFuture$Completion.run(CompletableFuture.java:478) at app//org.apache.ignite.internal.sql.engine.exec.ExecutionContext.lambda$execute$0(ExecutionContext.java:325) at app//org.apache.ignite.internal.sql.engine.exec.QueryTaskExecutorImpl.lambda$execute$0(QueryTaskExecutorImpl.java:83){code} -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Comment Edited] (IGNITE-20365) Add ability to intentionally change primary replica
[ https://issues.apache.org/jira/browse/IGNITE-20365?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17835638#comment-17835638 ] Denis Chudov edited comment on IGNITE-20365 at 4/10/24 8:45 AM: Fixed by IGNITE-21382 . The change of primary replica is possible while using org.apache.ignite.internal.table.NodeUtils#transferPrimary. was (Author: denis chudov): Fixed by IGNITE-21382 . > Add ability to intentionally change primary replica > --- > > Key: IGNITE-20365 > URL: https://issues.apache.org/jira/browse/IGNITE-20365 > Project: Ignite > Issue Type: Bug >Reporter: Alexander Lapin >Assignee: Alexander Lapin >Priority: Major > Labels: ignite-3 > > Some tests, e.g. testTxStateReplicaRequestMissLeaderMiss expects primary > replica to be changed. Earlier when primary replica was collocated with > leader refreshAndGetLeaderWithTerm was used in order to change leader and > thus primary replica. Now when Placement driver assigns primary replica it's > no longer the case. All in all, some PlacementDriver#changePrimaryReplica or > similar will be useful, at least within tests. > > *Implementation Details* > +Important note:+ The lease contract prohibits intersecting leases. We don't > want to break this contract, so we will have to wait until the current lease > ends before another replica becomes primary. > There are two ways to implement this functionality - either extend > {{PlacementDriver}} in the product or change only the test code. Looks like > the second approach is not enough if we start a test ignite instance using an > {{IgniteImpl}} class. So we might need to consider extend the production > code. Moreover, such change might become the first step towards a graceful > cluster reconfiguration. > The code that is responsible for managing lease resides in {{LeaseTracker}} > and {{LeaseUpdater}}. To do the required change we can add a pending lease > with the start time in the future. We should make sure that both there > places, as well as any recovery code accounts for it. Now the next lease is > added ONLY when the current one ends. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Resolved] (IGNITE-20365) Add ability to intentionally change primary replica
[ https://issues.apache.org/jira/browse/IGNITE-20365?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Denis Chudov resolved IGNITE-20365. --- Resolution: Duplicate Fixed by IGNITE-21382 . > Add ability to intentionally change primary replica > --- > > Key: IGNITE-20365 > URL: https://issues.apache.org/jira/browse/IGNITE-20365 > Project: Ignite > Issue Type: Bug >Reporter: Alexander Lapin >Assignee: Alexander Lapin >Priority: Major > Labels: ignite-3 > > Some tests, e.g. testTxStateReplicaRequestMissLeaderMiss expects primary > replica to be changed. Earlier when primary replica was collocated with > leader refreshAndGetLeaderWithTerm was used in order to change leader and > thus primary replica. Now when Placement driver assigns primary replica it's > no longer the case. All in all, some PlacementDriver#changePrimaryReplica or > similar will be useful, at least within tests. > > *Implementation Details* > +Important note:+ The lease contract prohibits intersecting leases. We don't > want to break this contract, so we will have to wait until the current lease > ends before another replica becomes primary. > There are two ways to implement this functionality - either extend > {{PlacementDriver}} in the product or change only the test code. Looks like > the second approach is not enough if we start a test ignite instance using an > {{IgniteImpl}} class. So we might need to consider extend the production > code. Moreover, such change might become the first step towards a graceful > cluster reconfiguration. > The code that is responsible for managing lease resides in {{LeaseTracker}} > and {{LeaseUpdater}}. To do the required change we can add a pending lease > with the start time in the future. We should make sure that both there > places, as well as any recovery code accounts for it. Now the next lease is > added ONLY when the current one ends. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (IGNITE-21418) ItTxDistributedTestThreeNodesThreeReplicas#testDeleteUpsertAllRollback is flaky
[ https://issues.apache.org/jira/browse/IGNITE-21418?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17834247#comment-17834247 ] Denis Chudov commented on IGNITE-21418: --- IGNITE-21572 is a possible reason. It is resolved, but due to the rare occurrence of this error, we need to monitor the teamcity for some time (about a month). After that, if this error is no longer reproduced, we can close this ticket. > ItTxDistributedTestThreeNodesThreeReplicas#testDeleteUpsertAllRollback is > flaky > --- > > Key: IGNITE-21418 > URL: https://issues.apache.org/jira/browse/IGNITE-21418 > Project: Ignite > Issue Type: Bug >Reporter: Alexander Lapin >Priority: Major > Labels: ignite-3 > Time Spent: 10m > Remaining Estimate: 0h > > {code:java} > java.lang.NullPointerException at > org.apache.ignite.internal.table.TxAbstractTest.testDeleteUpsertAllRollback(TxAbstractTest.java:233) > at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native > Method) at > java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) > at > java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > at java.base/java.lang.reflect.Method.invoke(Method.java:566) at > org.junit.platform.commons.util.ReflectionUtils.invokeMethod(ReflectionUtils.java:727) > {code} > [https://ci.ignite.apache.org/buildConfiguration/ApacheIgnite3xGradle_Test_RunAllTests/7814256?expandCode+Inspection=true=true=false=true=false=true] > Flaky rate is low. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (IGNITE-20628) testDropColumn and testMergeChangesAddDropAdd in ItSchemaChangeKvViewTest are disabled
[ https://issues.apache.org/jira/browse/IGNITE-20628?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17834245#comment-17834245 ] Denis Chudov commented on IGNITE-20628: --- IGNITE-21572 is resolved, but due to the rare occurrence of this error, we need to monitor the teamcity for some time (about a month). After that, if this error is no longer reproduced, we can close this ticket. > testDropColumn and testMergeChangesAddDropAdd in ItSchemaChangeKvViewTest are > disabled > -- > > Key: IGNITE-20628 > URL: https://issues.apache.org/jira/browse/IGNITE-20628 > Project: Ignite > Issue Type: Bug >Reporter: Roman Puchkovskiy >Priority: Major > Labels: ignite-3, tech-debt > Fix For: 3.0.0-beta2 > > > It was supposed that IGNITE-17931 was the culprit, but even after removing > the blocking code the tests are still flaky. > The tests fail with one of 3 symptoms: > # An NPE happens in the test method code: a value by a key for which a put > is made earlier is not found when using the same key. This is probably caused > by a transactional protocol implementation bug, maybe this: IGNITE-20116 > # A PrimaryReplicaAwaitTimeoutException > # A ReplicationTimeoutException > Items 2 and 3 need to be investigated. > h2. A stacktrace for 1 > java.lang.NullPointerException > at > org.apache.ignite.internal.runner.app.ItSchemaChangeKvViewTest.testDropColumn(ItSchemaChangeKvViewTest.java:58) > h2. A stacktrace for 2 > org.apache.ignite.tx.TransactionException: IGN-PLACEMENTDRIVER-1 > TraceId:0a32c369-b9ca-4091-b8de-af15d65a1f52 Failed to get the primary > replica [tablePartitionId=3_part_5, awaitTimestamp=HybridTimestamp > [time=111220884095959043, physical=1697096009765, logical=3]] > > at > org.apache.ignite.internal.util.ExceptionUtils.lambda$withCause$1(ExceptionUtils.java:400) > at > org.apache.ignite.internal.util.ExceptionUtils.withCauseInternal(ExceptionUtils.java:461) > at > org.apache.ignite.internal.util.ExceptionUtils.withCause(ExceptionUtils.java:400) > at > org.apache.ignite.internal.table.distributed.storage.InternalTableImpl.lambda$enlist$71(InternalTableImpl.java:1659) > at > java.base/java.util.concurrent.CompletableFuture.uniHandle(CompletableFuture.java:930) > at > java.base/java.util.concurrent.CompletableFuture$UniHandle.tryFire(CompletableFuture.java:907) > at > java.base/java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:506) > at > java.base/java.util.concurrent.CompletableFuture.completeExceptionally(CompletableFuture.java:2088) > at > java.base/java.util.concurrent.CompletableFuture$Timeout.run(CompletableFuture.java:2792) > at > java.base/java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:515) > at java.base/java.util.concurrent.FutureTask.run(FutureTask.java:264) > at > java.base/java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:304) > at > java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128) > at > java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628) > at java.base/java.lang.Thread.run(Thread.java:834) > Caused by: java.util.concurrent.CompletionException: > org.apache.ignite.internal.placementdriver.PrimaryReplicaAwaitTimeoutException: > IGN-PLACEMENTDRIVER-1 TraceId:0a32c369-b9ca-4091-b8de-af15d65a1f52 The > primary replica await timed out [replicationGroupId=3_part_5, > referenceTimestamp=HybridTimestamp [time=111220884095959043, > physical=1697096009765, logical=3], currentLease=Lease > [leaseholder=isckvt_tmcada_3346, accepted=false, startTime=HybridTimestamp > [time=111220884127809550, physical=1697096010251, logical=14], > expirationTime=HybridTimestamp [time=111220891992129536, > physical=1697096130251, logical=0], prolongable=false, > replicationGroupId=3_part_5]] > at > java.base/java.util.concurrent.CompletableFuture.encodeThrowable(CompletableFuture.java:314) > at > java.base/java.util.concurrent.CompletableFuture.completeThrowable(CompletableFuture.java:319) > at > java.base/java.util.concurrent.CompletableFuture.uniExceptionally(CompletableFuture.java:990) > at > java.base/java.util.concurrent.CompletableFuture$UniExceptionally.tryFire(CompletableFuture.java:970) > ... 9 more > Caused by: > org.apache.ignite.internal.placementdriver.PrimaryReplicaAwaitTimeoutException: > IGN-PLACEMENTDRIVER-1 TraceId:0a32c369-b9ca-4091-b8de-af15d65a1f52 The > primary replica await timed out [replicationGroupId=3_part_5, > referenceTimestamp=HybridTimestamp [time=111220884095959043, > physical=1697096009765, logical=3], currentLease=Lease > [leaseholder=isckvt_tmcada_3346, accepted=false, startTime=HybridTimestamp >
[jira] [Commented] (IGNITE-21307) Call failure handler in case of failure in WatchProcessor
[ https://issues.apache.org/jira/browse/IGNITE-21307?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17833934#comment-17833934 ] Denis Chudov commented on IGNITE-21307: --- [~slava.koptilin] LGTM. > Call failure handler in case of failure in WatchProcessor > - > > Key: IGNITE-21307 > URL: https://issues.apache.org/jira/browse/IGNITE-21307 > Project: Ignite > Issue Type: Improvement >Reporter: Denis Chudov >Assignee: Vyacheslav Koptilin >Priority: Major > Labels: ignite-3 > Fix For: 3.0.0-beta2 > > Time Spent: 10m > Remaining Estimate: 0h > > For the linearized watch processing, we have > WatchProcessor#notificationFuture that is rewritten for each revision > processing and meta storage safe time advance. If some watch processor > completes exceptionally, this means that no further updates will be > processed, because they need the previous updates to be processed > successfully. This is implemented in futures chaining like this: > > {code:java} > notificationFuture = notificationFuture > .thenRunAsync(() -> revisionCallback.onSafeTimeAdvanced(time), > watchExecutor) > .whenComplete((ignored, e) -> { > if (e != null) { > LOG.error("Error occurred when notifying safe time advanced > callback", e); > } > }); {code} > For now, we dont have any failure handing of exceptionally completed > notification future. It leads to the endless log records with the same > exception's stack trace, caused by meta storage safe time advances: > > {code:java} > [2024-01-16T21:42:35,515][ERROR][%isot_n_0%JRaft-FSMCaller-Disruptor-metastorage-_stripe_0-0][WatchProcessor] > Error occurred when notifying safe time advanced callback > java.util.concurrent.CompletionException: > org.apache.ignite.internal.lang.IgniteInternalException: IGN-CMN-65535 > TraceId:3877e098-6a1b-4f30-88a8-a4c13411d573 Peers are not ready > [groupId=5_part_0] > at > java.util.concurrent.CompletableFuture.encodeThrowable(CompletableFuture.java:314) > ~[?:?] > at > java.util.concurrent.CompletableFuture.completeThrowable(CompletableFuture.java:319) > ~[?:?] > at > java.util.concurrent.CompletableFuture$UniCompose.tryFire(CompletableFuture.java:1081) > ~[?:?] > at > java.util.concurrent.CompletableFuture$Completion.run(CompletableFuture.java:478) > ~[?:?] > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128) > ~[?:?] > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628) > ~[?:?] > at java.lang.Thread.run(Thread.java:834) [?:?] > Caused by: org.apache.ignite.internal.lang.IgniteInternalException: Peers are > not ready [groupId=5_part_0] > at > org.apache.ignite.internal.raft.RaftGroupServiceImpl.randomNode(RaftGroupServiceImpl.java:725) > ~[ignite-raft-9.0.127-SNAPSHOT.jar:?] > at > org.apache.ignite.internal.raft.RaftGroupServiceImpl.randomNode(RaftGroupServiceImpl.java:709) > ~[ignite-raft-9.0.127-SNAPSHOT.jar:?] > at > org.apache.ignite.internal.raft.RaftGroupServiceImpl.refreshLeader(RaftGroupServiceImpl.java:234) > ~[ignite-raft-9.0.127-SNAPSHOT.jar:?] > at > org.apache.ignite.internal.raft.RaftGroupServiceImpl.start(RaftGroupServiceImpl.java:190) > ~[ignite-raft-9.0.127-SNAPSHOT.jar:?] > at > org.apache.ignite.internal.raft.client.TopologyAwareRaftGroupService.start(TopologyAwareRaftGroupService.java:187) > ~[ignite-replicator-9.0.127-SNAPSHOT.jar:?] > at > org.apache.ignite.internal.raft.client.TopologyAwareRaftGroupServiceFactory.startRaftGroupService(TopologyAwareRaftGroupServiceFactory.java:73) > ~[ignite-replicator-9.0.127-SNAPSHOT.jar:?] > at > org.apache.ignite.internal.raft.Loza.startRaftGroupService(Loza.java:350) > ~[ignite-raft-9.0.127-SNAPSHOT.jar:?] > at > org.apache.ignite.internal.table.distributed.TableManager.lambda$startPartitionAndStartClient$27(TableManager.java:917) > ~[ignite-table-9.0.127-SNAPSHOT.jar:?] > at > org.apache.ignite.internal.util.IgniteUtils.inBusyLock(IgniteUtils.java:827) > ~[ignite-core-9.0.127-SNAPSHOT.jar:?] > at > org.apache.ignite.internal.table.distributed.TableManager.lambda$startPartitionAndStartClient$28(TableManager.java:913) > ~[ignite-table-9.0.127-SNAPSHOT.jar:?] > at > java.util.concurrent.CompletableFuture$UniCompose.tryFire(CompletableFuture.java:1072) > ~[?:?] > ... 4 more {code} > So, the node can't operate properly and just produces tons of logs. Such > nodes should be halted. > UPD: > We decided to just add invocation of {{failureProcessor.process}} in all > places in {{org.apache.ignite.internal.metastorage.server.WatchProcessor}} > where exceptions happen, like > {code:java} >
[jira] [Updated] (IGNITE-21933) Fix TxStateStorage#leaseStartTime possible inconsistency with partition storage
[ https://issues.apache.org/jira/browse/IGNITE-21933?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Denis Chudov updated IGNITE-21933: -- Description: Tx state storage may be inconsistent with partition storage during recovery, which may corrupt data written by 1-phase txns. Lease start time should be moved to partition storage. > Fix TxStateStorage#leaseStartTime possible inconsistency with partition > storage > --- > > Key: IGNITE-21933 > URL: https://issues.apache.org/jira/browse/IGNITE-21933 > Project: Ignite > Issue Type: Bug >Reporter: Denis Chudov >Assignee: Denis Chudov >Priority: Major > Labels: ignite-3 > Time Spent: 10m > Remaining Estimate: 0h > > Tx state storage may be inconsistent with partition storage during recovery, > which may corrupt data written by 1-phase txns. Lease start time should be > moved to partition storage. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (IGNITE-21933) Fix TxStateStorage#leaseStartTime possible inconsistency with partition storage
Denis Chudov created IGNITE-21933: - Summary: Fix TxStateStorage#leaseStartTime possible inconsistency with partition storage Key: IGNITE-21933 URL: https://issues.apache.org/jira/browse/IGNITE-21933 Project: Ignite Issue Type: Bug Reporter: Denis Chudov Assignee: Denis Chudov -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (IGNITE-21868) Move the sql RO inflights handling from SqlQueryProcessor to QueryTransactionContext#getOrStartImplicit/QueryTransactionWrapper#commitImplicit
[ https://issues.apache.org/jira/browse/IGNITE-21868?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Denis Chudov updated IGNITE-21868: -- Description: Handling the transaction inflights in the SqlQueryProcessor is not the best option, (was: *) > Move the sql RO inflights handling from SqlQueryProcessor to > QueryTransactionContext#getOrStartImplicit/QueryTransactionWrapper#commitImplicit > -- > > Key: IGNITE-21868 > URL: https://issues.apache.org/jira/browse/IGNITE-21868 > Project: Ignite > Issue Type: Bug >Reporter: Denis Chudov >Assignee: Denis Chudov >Priority: Major > Labels: ignite-3 > > Handling the transaction inflights in the SqlQueryProcessor is not the best > option, -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (IGNITE-21868) Move the sql RO inflights handling from SqlQueryProcessor to QueryTransactionContext#getOrStartImplicit/QueryTransactionWrapper#commitImplicit
[ https://issues.apache.org/jira/browse/IGNITE-21868?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Denis Chudov updated IGNITE-21868: -- Description: Handling the transaction inflights in the SqlQueryProcessor is not the best option, should be moved to QueryTransactionContext and QueryTransactionWrapper (was: Handling the transaction inflights in the SqlQueryProcessor is not the best option, ) > Move the sql RO inflights handling from SqlQueryProcessor to > QueryTransactionContext#getOrStartImplicit/QueryTransactionWrapper#commitImplicit > -- > > Key: IGNITE-21868 > URL: https://issues.apache.org/jira/browse/IGNITE-21868 > Project: Ignite > Issue Type: Bug >Reporter: Denis Chudov >Assignee: Denis Chudov >Priority: Major > Labels: ignite-3 > > Handling the transaction inflights in the SqlQueryProcessor is not the best > option, should be moved to QueryTransactionContext and QueryTransactionWrapper -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (IGNITE-21382) Test ItPrimaryReplicaChoiceTest.testPrimaryChangeLongHandling is flaky
[ https://issues.apache.org/jira/browse/IGNITE-21382?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Denis Chudov updated IGNITE-21382: -- Reviewer: Vladislav Pyatkov > Test ItPrimaryReplicaChoiceTest.testPrimaryChangeLongHandling is flaky > -- > > Key: IGNITE-21382 > URL: https://issues.apache.org/jira/browse/IGNITE-21382 > Project: Ignite > Issue Type: Bug >Reporter: Vladislav Pyatkov >Assignee: Denis Chudov >Priority: Major > Labels: ignite-3 > Time Spent: 0.5h > Remaining Estimate: 0h > > The test falls while waiting for the primary replica change. This issue is > also reproduced locally, at least one per five passes. > {code} > assertThat(primaryChangeTask, willCompleteSuccessfully()); > {code} > {noformat} > java.lang.AssertionError: java.util.concurrent.TimeoutException > at > org.apache.ignite.internal.testframework.matchers.CompletableFutureMatcher.matchesSafely(CompletableFutureMatcher.java:78) > at > org.apache.ignite.internal.testframework.matchers.CompletableFutureMatcher.matchesSafely(CompletableFutureMatcher.java:35) > at org.hamcrest.TypeSafeMatcher.matches(TypeSafeMatcher.java:67) > at org.hamcrest.MatcherAssert.assertThat(MatcherAssert.java:10) > at org.hamcrest.MatcherAssert.assertThat(MatcherAssert.java:6) > at > org.apache.ignite.internal.placementdriver.ItPrimaryReplicaChoiceTest.testPrimaryChangeLongHandling(ItPrimaryReplicaChoiceTest.java:179) > {noformat} > This test will be muted on TC to pervent future falls. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Assigned] (IGNITE-21763) Adjust TxnResourceVacuumTask in order to vacuum persistent txn state
[ https://issues.apache.org/jira/browse/IGNITE-21763?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Denis Chudov reassigned IGNITE-21763: - Assignee: Denis Chudov (was: Alexander Lapin) > Adjust TxnResourceVacuumTask in order to vacuum persistent txn state > > > Key: IGNITE-21763 > URL: https://issues.apache.org/jira/browse/IGNITE-21763 > Project: Ignite > Issue Type: Improvement >Reporter: Alexander Lapin >Assignee: Denis Chudov >Priority: Major > Labels: ignite-3 > > h3. Definition of Done > * TxnResourceVacuumTask is adjusted in a way that > ** txnState is removed from txnStateVolatileMap if > {*}max{*}(cleanupCompletionTimestamp, initialVacuumObservationTimestamp) + > txnResourcesTTL < vacuumObservationTimestamp > ** If there's a value in cleanupCompletionTimestamp, prior to removal the > txnState from the volatile map it's required to remove corresponding record > from within txn persistant state storage. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (IGNITE-21868) Move the sql RO inflights handling from SqlQueryProcessor to QueryTransactionContext#getOrStartImplicit/QueryTransactionWrapper#commitImplicit
Denis Chudov created IGNITE-21868: - Summary: Move the sql RO inflights handling from SqlQueryProcessor to QueryTransactionContext#getOrStartImplicit/QueryTransactionWrapper#commitImplicit Key: IGNITE-21868 URL: https://issues.apache.org/jira/browse/IGNITE-21868 Project: Ignite Issue Type: Bug Reporter: Denis Chudov Assignee: Denis Chudov * -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (IGNITE-21861) Unexpected "Transaction is already finished" exception
[ https://issues.apache.org/jira/browse/IGNITE-21861?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Denis Chudov updated IGNITE-21861: -- Description: Exception in log: {code:java} [2024-03-27T01:24:46,636][WARN ][%idt_n_1%partition-operations-4][ReplicaManager] Failed to process replica request [request=ReadWriteScanRetrieveBatchReplicaRequestImpl [batchSize=512, columnsToInclude=null, commitPartitionId=TablePartitionIdMessageImpl [partitionId=17, tableId=90], coordinatorId=125b397c-0404-4dcf-a28b-625fe010ecef, enlistmentConsistencyToken=112165039282455690, exactKey=null, flags=0, full=false, groupId=92_part_7, indexToUse=null, lowerBoundPrefix=null, scanId=20361, timestampLong=112165039967305730, transactionId=018e7d82-647b-0030-63a2-6a190001, upperBoundPrefix=null]]. java.util.concurrent.CompletionException: org.apache.ignite.tx.TransactionException: IGN-TX-14 TraceId:6612dad8-4a32-4453-8af0-0139e336aad9 Transaction is already finished. at java.base/java.util.concurrent.CompletableFuture.encodeThrowable(CompletableFuture.java:331) ~[?:?] at java.base/java.util.concurrent.CompletableFuture.uniComposeStage(CompletableFuture.java:1099) ~[?:?] at java.base/java.util.concurrent.CompletableFuture.thenCompose(CompletableFuture.java:2235) ~[?:?] at org.apache.ignite.internal.table.distributed.replicator.PartitionReplicaListener.processOperationRequest(PartitionReplicaListener.java:660) ~[ignite-table-3.0.0-SNAPSHOT.jar:?] at org.apache.ignite.internal.table.distributed.replicator.PartitionReplicaListener.processOperationRequestWithTxRwCounter(PartitionReplicaListener.java:3860) ~[ignite-table-3.0.0-SNAPSHOT.jar:?] at org.apache.ignite.internal.table.distributed.replicator.PartitionReplicaListener.lambda$processRequest$5(PartitionReplicaListener.java:436) ~[ignite-table-3.0.0-SNAPSHOT.jar:?] at java.base/java.util.concurrent.CompletableFuture$UniCompose.tryFire(CompletableFuture.java:1072) ~[?:?] at java.base/java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:506) [?:?] at java.base/java.util.concurrent.CompletableFuture.postFire(CompletableFuture.java:610) [?:?] at java.base/java.util.concurrent.CompletableFuture$UniApply.tryFire(CompletableFuture.java:649) [?:?] at java.base/java.util.concurrent.CompletableFuture$Completion.run(CompletableFuture.java:478) [?:?] at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128) [?:?] at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628) [?:?] at java.base/java.lang.Thread.run(Thread.java:834) [?:?] Caused by: org.apache.ignite.tx.TransactionException: Transaction is already finished. at org.apache.ignite.internal.table.distributed.replicator.PartitionReplicaListener.appendTxCommand(PartitionReplicaListener.java:1937) ~[ignite-table-3.0.0-SNAPSHOT.jar:?] at org.apache.ignite.internal.table.distributed.replicator.PartitionReplicaListener.processOperationRequest(PartitionReplicaListener.java:659) ~[ignite-table-3.0.0-SNAPSHOT.jar:?] ... 10 more{code} It happens in PartitionReplicaListener because the local volatile tx state is null or final when trying to compute a value for txCleanupReadyFutures map: {code:java} txCleanupReadyFutures.compute(txId, (id, txOps) -> { // First check whether the transaction has already been finished. // And complete cleanupReadyFut with exception if it is the case. TxStateMeta txStateMeta = txManager.stateMeta(txId); if (txStateMeta == null || isFinalState(txStateMeta.txState())) { cleanupReadyFut.completeExceptionally(new Exception()); return txOps; } // Otherwise collect cleanupReadyFut in the transaction's futures. if (txOps == null) { txOps = new TxCleanupReadyFutureList(); } txOps.futures.computeIfAbsent(cmdType, type -> new HashMap<>()).put(opId, cleanupReadyFut); return txOps; }); if (cleanupReadyFut.isCompletedExceptionally()) { return failedFuture(new TransactionException(TX_ALREADY_FINISHED_ERR, "Transaction is already finished.")); }{code} First problem is that we don't actually know the real state from this exception. The second one is the exception itself, because it shouldn't happen. We shouldn't meet a null state, because it's updated to pending just before, and it can be vacuumized only after it becomes final. Committed state is also not possible because we wait for all in-flights before the state transition. It can be Aborted state here, but there should be no exception in logs in this case. In our case, the transaction is most likely aborted because of replication timeout exception happened before (it would be nice to see a tx id in this exception as well). Full log is attached. *Defitinion of done:* * no TransactionException in log in case of aborted
[jira] [Updated] (IGNITE-21861) Unexpected "Transaction is already finished" exception
[ https://issues.apache.org/jira/browse/IGNITE-21861?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Denis Chudov updated IGNITE-21861: -- Description: Exception in log: {code:java} [2024-03-27T01:24:46,636][WARN ][%idt_n_1%partition-operations-4][ReplicaManager] Failed to process replica request [request=ReadWriteScanRetrieveBatchReplicaRequestImpl [batchSize=512, columnsToInclude=null, commitPartitionId=TablePartitionIdMessageImpl [partitionId=17, tableId=90], coordinatorId=125b397c-0404-4dcf-a28b-625fe010ecef, enlistmentConsistencyToken=112165039282455690, exactKey=null, flags=0, full=false, groupId=92_part_7, indexToUse=null, lowerBoundPrefix=null, scanId=20361, timestampLong=112165039967305730, transactionId=018e7d82-647b-0030-63a2-6a190001, upperBoundPrefix=null]]. java.util.concurrent.CompletionException: org.apache.ignite.tx.TransactionException: IGN-TX-14 TraceId:6612dad8-4a32-4453-8af0-0139e336aad9 Transaction is already finished. at java.base/java.util.concurrent.CompletableFuture.encodeThrowable(CompletableFuture.java:331) ~[?:?] at java.base/java.util.concurrent.CompletableFuture.uniComposeStage(CompletableFuture.java:1099) ~[?:?] at java.base/java.util.concurrent.CompletableFuture.thenCompose(CompletableFuture.java:2235) ~[?:?] at org.apache.ignite.internal.table.distributed.replicator.PartitionReplicaListener.processOperationRequest(PartitionReplicaListener.java:660) ~[ignite-table-3.0.0-SNAPSHOT.jar:?] at org.apache.ignite.internal.table.distributed.replicator.PartitionReplicaListener.processOperationRequestWithTxRwCounter(PartitionReplicaListener.java:3860) ~[ignite-table-3.0.0-SNAPSHOT.jar:?] at org.apache.ignite.internal.table.distributed.replicator.PartitionReplicaListener.lambda$processRequest$5(PartitionReplicaListener.java:436) ~[ignite-table-3.0.0-SNAPSHOT.jar:?] at java.base/java.util.concurrent.CompletableFuture$UniCompose.tryFire(CompletableFuture.java:1072) ~[?:?] at java.base/java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:506) [?:?] at java.base/java.util.concurrent.CompletableFuture.postFire(CompletableFuture.java:610) [?:?] at java.base/java.util.concurrent.CompletableFuture$UniApply.tryFire(CompletableFuture.java:649) [?:?] at java.base/java.util.concurrent.CompletableFuture$Completion.run(CompletableFuture.java:478) [?:?] at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128) [?:?] at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628) [?:?] at java.base/java.lang.Thread.run(Thread.java:834) [?:?] Caused by: org.apache.ignite.tx.TransactionException: Transaction is already finished. at org.apache.ignite.internal.table.distributed.replicator.PartitionReplicaListener.appendTxCommand(PartitionReplicaListener.java:1937) ~[ignite-table-3.0.0-SNAPSHOT.jar:?] at org.apache.ignite.internal.table.distributed.replicator.PartitionReplicaListener.processOperationRequest(PartitionReplicaListener.java:659) ~[ignite-table-3.0.0-SNAPSHOT.jar:?] ... 10 more{code} It happens in PartitionReplicaListener because the local volatile tx state is null or final when trying to compute a value for txCleanupReadyFutures map: {code:java} txCleanupReadyFutures.compute(txId, (id, txOps) -> { // First check whether the transaction has already been finished. // And complete cleanupReadyFut with exception if it is the case. TxStateMeta txStateMeta = txManager.stateMeta(txId); if (txStateMeta == null || isFinalState(txStateMeta.txState())) { cleanupReadyFut.completeExceptionally(new Exception()); return txOps; } // Otherwise collect cleanupReadyFut in the transaction's futures. if (txOps == null) { txOps = new TxCleanupReadyFutureList(); } txOps.futures.computeIfAbsent(cmdType, type -> new HashMap<>()).put(opId, cleanupReadyFut); return txOps; }); if (cleanupReadyFut.isCompletedExceptionally()) { return failedFuture(new TransactionException(TX_ALREADY_FINISHED_ERR, "Transaction is already finished.")); }{code} First problem is that we don't actually know the real state from this exception. The second one is the exception itself, because it shouldn't happen. We shouldn't meet a null state, because it's updated to pending just before, and it can be vacuumized only after it becomes final. Committed state is also not possible because we wait for all in-flights before the state transition. It can be Aborted state here, but there should be no exception in logs in this case. In our case, the transaction is most likely aborted because of replication timeout exception happened before (it would be nice to see a tx id in this exception as well). Full log is attached. *Defitinion of done:* * no TransactionException in log in case of aborted
[jira] [Updated] (IGNITE-21861) Unexpected "Transaction is already finished" exception
[ https://issues.apache.org/jira/browse/IGNITE-21861?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Denis Chudov updated IGNITE-21861: -- Description: Exception in log: {code:java} [2024-03-27T01:24:46,636][WARN ][%idt_n_1%partition-operations-4][ReplicaManager] Failed to process replica request [request=ReadWriteScanRetrieveBatchReplicaRequestImpl [batchSize=512, columnsToInclude=null, commitPartitionId=TablePartitionIdMessageImpl [partitionId=17, tableId=90], coordinatorId=125b397c-0404-4dcf-a28b-625fe010ecef, enlistmentConsistencyToken=112165039282455690, exactKey=null, flags=0, full=false, groupId=92_part_7, indexToUse=null, lowerBoundPrefix=null, scanId=20361, timestampLong=112165039967305730, transactionId=018e7d82-647b-0030-63a2-6a190001, upperBoundPrefix=null]]. java.util.concurrent.CompletionException: org.apache.ignite.tx.TransactionException: IGN-TX-14 TraceId:6612dad8-4a32-4453-8af0-0139e336aad9 Transaction is already finished. at java.base/java.util.concurrent.CompletableFuture.encodeThrowable(CompletableFuture.java:331) ~[?:?] at java.base/java.util.concurrent.CompletableFuture.uniComposeStage(CompletableFuture.java:1099) ~[?:?] at java.base/java.util.concurrent.CompletableFuture.thenCompose(CompletableFuture.java:2235) ~[?:?] at org.apache.ignite.internal.table.distributed.replicator.PartitionReplicaListener.processOperationRequest(PartitionReplicaListener.java:660) ~[ignite-table-3.0.0-SNAPSHOT.jar:?] at org.apache.ignite.internal.table.distributed.replicator.PartitionReplicaListener.processOperationRequestWithTxRwCounter(PartitionReplicaListener.java:3860) ~[ignite-table-3.0.0-SNAPSHOT.jar:?] at org.apache.ignite.internal.table.distributed.replicator.PartitionReplicaListener.lambda$processRequest$5(PartitionReplicaListener.java:436) ~[ignite-table-3.0.0-SNAPSHOT.jar:?] at java.base/java.util.concurrent.CompletableFuture$UniCompose.tryFire(CompletableFuture.java:1072) ~[?:?] at java.base/java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:506) [?:?] at java.base/java.util.concurrent.CompletableFuture.postFire(CompletableFuture.java:610) [?:?] at java.base/java.util.concurrent.CompletableFuture$UniApply.tryFire(CompletableFuture.java:649) [?:?] at java.base/java.util.concurrent.CompletableFuture$Completion.run(CompletableFuture.java:478) [?:?] at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128) [?:?] at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628) [?:?] at java.base/java.lang.Thread.run(Thread.java:834) [?:?] Caused by: org.apache.ignite.tx.TransactionException: Transaction is already finished. at org.apache.ignite.internal.table.distributed.replicator.PartitionReplicaListener.appendTxCommand(PartitionReplicaListener.java:1937) ~[ignite-table-3.0.0-SNAPSHOT.jar:?] at org.apache.ignite.internal.table.distributed.replicator.PartitionReplicaListener.processOperationRequest(PartitionReplicaListener.java:659) ~[ignite-table-3.0.0-SNAPSHOT.jar:?] ... 10 more{code} It happens in PartitionReplicaListener because the local volatile tx state is null or final when trying to compute a value for txCleanupReadyFutures map: {code:java} txCleanupReadyFutures.compute(txId, (id, txOps) -> { // First check whether the transaction has already been finished. // And complete cleanupReadyFut with exception if it is the case. TxStateMeta txStateMeta = txManager.stateMeta(txId); if (txStateMeta == null || isFinalState(txStateMeta.txState())) { cleanupReadyFut.completeExceptionally(new Exception()); return txOps; } // Otherwise collect cleanupReadyFut in the transaction's futures. if (txOps == null) { txOps = new TxCleanupReadyFutureList(); } txOps.futures.computeIfAbsent(cmdType, type -> new HashMap<>()).put(opId, cleanupReadyFut); return txOps; }); if (cleanupReadyFut.isCompletedExceptionally()) { return failedFuture(new TransactionException(TX_ALREADY_FINISHED_ERR, "Transaction is already finished.")); }{code} First problem is that we don't actually know the real state from this exception. The second one is the exception itself, because it shouldn't happen. We shouldn't meet a null state, because it's updated to pending just before, and it can be vacuumized only after it becomes final. Committed state is also not possible because we wait for all in-flights before the state transition. It can be Aborted state here, but there should be no exception in logs in this case. In our case, the transaction is most likely aborted because of replication timeout exception happened before (it would be nice to see a tx id in this exception as well). Full log is attached. *Defitinion of done:* * no TransactionException in log in case of aborted
[jira] [Updated] (IGNITE-21861) Unexpected "Transaction is already finished" exception
[ https://issues.apache.org/jira/browse/IGNITE-21861?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Denis Chudov updated IGNITE-21861: -- Attachment: _Integration_Tests_Module_SQL_Engine_4133_.log > Unexpected "Transaction is already finished" exception > --- > > Key: IGNITE-21861 > URL: https://issues.apache.org/jira/browse/IGNITE-21861 > Project: Ignite > Issue Type: Bug >Reporter: Denis Chudov >Priority: Major > Labels: ignite-3 > Attachments: _Integration_Tests_Module_SQL_Engine_4133_.log > > > Exception in log: > {code:java} > [2024-03-27T01:24:46,636][WARN > ][%idt_n_1%partition-operations-4][ReplicaManager] Failed to process replica > request [request=ReadWriteScanRetrieveBatchReplicaRequestImpl [batchSize=512, > columnsToInclude=null, commitPartitionId=TablePartitionIdMessageImpl > [partitionId=17, tableId=90], > coordinatorId=125b397c-0404-4dcf-a28b-625fe010ecef, > enlistmentConsistencyToken=112165039282455690, exactKey=null, flags=0, > full=false, groupId=92_part_7, indexToUse=null, lowerBoundPrefix=null, > scanId=20361, timestampLong=112165039967305730, > transactionId=018e7d82-647b-0030-63a2-6a190001, upperBoundPrefix=null]]. > java.util.concurrent.CompletionException: > org.apache.ignite.tx.TransactionException: IGN-TX-14 > TraceId:6612dad8-4a32-4453-8af0-0139e336aad9 Transaction is already finished. > at > java.base/java.util.concurrent.CompletableFuture.encodeThrowable(CompletableFuture.java:331) > ~[?:?] > at > java.base/java.util.concurrent.CompletableFuture.uniComposeStage(CompletableFuture.java:1099) > ~[?:?] > at > java.base/java.util.concurrent.CompletableFuture.thenCompose(CompletableFuture.java:2235) > ~[?:?] > at > org.apache.ignite.internal.table.distributed.replicator.PartitionReplicaListener.processOperationRequest(PartitionReplicaListener.java:660) > ~[ignite-table-3.0.0-SNAPSHOT.jar:?] > at > org.apache.ignite.internal.table.distributed.replicator.PartitionReplicaListener.processOperationRequestWithTxRwCounter(PartitionReplicaListener.java:3860) > ~[ignite-table-3.0.0-SNAPSHOT.jar:?] > at > org.apache.ignite.internal.table.distributed.replicator.PartitionReplicaListener.lambda$processRequest$5(PartitionReplicaListener.java:436) > ~[ignite-table-3.0.0-SNAPSHOT.jar:?] > at > java.base/java.util.concurrent.CompletableFuture$UniCompose.tryFire(CompletableFuture.java:1072) > ~[?:?] > at > java.base/java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:506) > [?:?] > at > java.base/java.util.concurrent.CompletableFuture.postFire(CompletableFuture.java:610) > [?:?] > at > java.base/java.util.concurrent.CompletableFuture$UniApply.tryFire(CompletableFuture.java:649) > [?:?] > at > java.base/java.util.concurrent.CompletableFuture$Completion.run(CompletableFuture.java:478) > [?:?] > at > java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128) > [?:?] > at > java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628) > [?:?] > at java.base/java.lang.Thread.run(Thread.java:834) [?:?] > Caused by: org.apache.ignite.tx.TransactionException: Transaction is already > finished. > at > org.apache.ignite.internal.table.distributed.replicator.PartitionReplicaListener.appendTxCommand(PartitionReplicaListener.java:1937) > ~[ignite-table-3.0.0-SNAPSHOT.jar:?] > at > org.apache.ignite.internal.table.distributed.replicator.PartitionReplicaListener.processOperationRequest(PartitionReplicaListener.java:659) > ~[ignite-table-3.0.0-SNAPSHOT.jar:?] > ... 10 more{code} > It happens in PartitionReplicaListener because the local volatile tx state is > null or final when trying to compute a value for txCleanupReadyFutures map: > {code:java} > txCleanupReadyFutures.compute(txId, (id, txOps) -> { > // First check whether the transaction has already been finished. > // And complete cleanupReadyFut with exception if it is the case. > TxStateMeta txStateMeta = txManager.stateMeta(txId); > if (txStateMeta == null || isFinalState(txStateMeta.txState())) { > cleanupReadyFut.completeExceptionally(new Exception()); > return txOps; > } > // Otherwise collect cleanupReadyFut in the transaction's futures. > if (txOps == null) { > txOps = new TxCleanupReadyFutureList(); > } > txOps.futures.computeIfAbsent(cmdType, type -> new HashMap<>()).put(opId, > cleanupReadyFut); > return txOps; > }); > if (cleanupReadyFut.isCompletedExceptionally()) { > return failedFuture(new TransactionException(TX_ALREADY_FINISHED_ERR, > "Transaction is already finished.")); > }{code} > First problem is that we don't actually know the real state from this > exception.
[jira] [Updated] (IGNITE-21861) Unexpected "Transaction is already finished" exception
[ https://issues.apache.org/jira/browse/IGNITE-21861?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Denis Chudov updated IGNITE-21861: -- Description: Exception in log: {code:java} [2024-03-27T01:24:46,636][WARN ][%idt_n_1%partition-operations-4][ReplicaManager] Failed to process replica request [request=ReadWriteScanRetrieveBatchReplicaRequestImpl [batchSize=512, columnsToInclude=null, commitPartitionId=TablePartitionIdMessageImpl [partitionId=17, tableId=90], coordinatorId=125b397c-0404-4dcf-a28b-625fe010ecef, enlistmentConsistencyToken=112165039282455690, exactKey=null, flags=0, full=false, groupId=92_part_7, indexToUse=null, lowerBoundPrefix=null, scanId=20361, timestampLong=112165039967305730, transactionId=018e7d82-647b-0030-63a2-6a190001, upperBoundPrefix=null]]. java.util.concurrent.CompletionException: org.apache.ignite.tx.TransactionException: IGN-TX-14 TraceId:6612dad8-4a32-4453-8af0-0139e336aad9 Transaction is already finished. at java.base/java.util.concurrent.CompletableFuture.encodeThrowable(CompletableFuture.java:331) ~[?:?] at java.base/java.util.concurrent.CompletableFuture.uniComposeStage(CompletableFuture.java:1099) ~[?:?] at java.base/java.util.concurrent.CompletableFuture.thenCompose(CompletableFuture.java:2235) ~[?:?] at org.apache.ignite.internal.table.distributed.replicator.PartitionReplicaListener.processOperationRequest(PartitionReplicaListener.java:660) ~[ignite-table-3.0.0-SNAPSHOT.jar:?] at org.apache.ignite.internal.table.distributed.replicator.PartitionReplicaListener.processOperationRequestWithTxRwCounter(PartitionReplicaListener.java:3860) ~[ignite-table-3.0.0-SNAPSHOT.jar:?] at org.apache.ignite.internal.table.distributed.replicator.PartitionReplicaListener.lambda$processRequest$5(PartitionReplicaListener.java:436) ~[ignite-table-3.0.0-SNAPSHOT.jar:?] at java.base/java.util.concurrent.CompletableFuture$UniCompose.tryFire(CompletableFuture.java:1072) ~[?:?] at java.base/java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:506) [?:?] at java.base/java.util.concurrent.CompletableFuture.postFire(CompletableFuture.java:610) [?:?] at java.base/java.util.concurrent.CompletableFuture$UniApply.tryFire(CompletableFuture.java:649) [?:?] at java.base/java.util.concurrent.CompletableFuture$Completion.run(CompletableFuture.java:478) [?:?] at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128) [?:?] at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628) [?:?] at java.base/java.lang.Thread.run(Thread.java:834) [?:?] Caused by: org.apache.ignite.tx.TransactionException: Transaction is already finished. at org.apache.ignite.internal.table.distributed.replicator.PartitionReplicaListener.appendTxCommand(PartitionReplicaListener.java:1937) ~[ignite-table-3.0.0-SNAPSHOT.jar:?] at org.apache.ignite.internal.table.distributed.replicator.PartitionReplicaListener.processOperationRequest(PartitionReplicaListener.java:659) ~[ignite-table-3.0.0-SNAPSHOT.jar:?] ... 10 more{code} It happens in PartitionReplicaListener because the local volatile tx state is null or final when trying to compute a value for txCleanupReadyFutures map: {code:java} txCleanupReadyFutures.compute(txId, (id, txOps) -> { // First check whether the transaction has already been finished. // And complete cleanupReadyFut with exception if it is the case. TxStateMeta txStateMeta = txManager.stateMeta(txId); if (txStateMeta == null || isFinalState(txStateMeta.txState())) { cleanupReadyFut.completeExceptionally(new Exception()); return txOps; } // Otherwise collect cleanupReadyFut in the transaction's futures. if (txOps == null) { txOps = new TxCleanupReadyFutureList(); } txOps.futures.computeIfAbsent(cmdType, type -> new HashMap<>()).put(opId, cleanupReadyFut); return txOps; }); if (cleanupReadyFut.isCompletedExceptionally()) { return failedFuture(new TransactionException(TX_ALREADY_FINISHED_ERR, "Transaction is already finished.")); }{code} First problem is that we don't actually know the real state from this exception. The second one is the exception itself, because it shouldn't happen. We shouldn't meet a null state, because it's updated to pending just before, and it can be vacuumized only after it becomes final. Committed state is also not possible because we wait for all in-flights before the state transition. It can be Aborted state here, but there should be no exception in logs in this case. In our case, the transaction is most likely aborted because of replication timeout exception happened before (it would be nice to see a tx id in this exception as well). Full log is attached. was: Exception in log: {code:java} [2024-03-27T01:24:46,636][WARN
[jira] [Updated] (IGNITE-21861) Unexpected "Transaction is already finished" exception
[ https://issues.apache.org/jira/browse/IGNITE-21861?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Denis Chudov updated IGNITE-21861: -- Description: Exception in log: {code:java} [2024-03-27T01:24:46,636][WARN ][%idt_n_1%partition-operations-4][ReplicaManager] Failed to process replica request [request=ReadWriteScanRetrieveBatchReplicaRequestImpl [batchSize=512, columnsToInclude=null, commitPartitionId=TablePartitionIdMessageImpl [partitionId=17, tableId=90], coordinatorId=125b397c-0404-4dcf-a28b-625fe010ecef, enlistmentConsistencyToken=112165039282455690, exactKey=null, flags=0, full=false, groupId=92_part_7, indexToUse=null, lowerBoundPrefix=null, scanId=20361, timestampLong=112165039967305730, transactionId=018e7d82-647b-0030-63a2-6a190001, upperBoundPrefix=null]]. java.util.concurrent.CompletionException: org.apache.ignite.tx.TransactionException: IGN-TX-14 TraceId:6612dad8-4a32-4453-8af0-0139e336aad9 Transaction is already finished. at java.base/java.util.concurrent.CompletableFuture.encodeThrowable(CompletableFuture.java:331) ~[?:?] at java.base/java.util.concurrent.CompletableFuture.uniComposeStage(CompletableFuture.java:1099) ~[?:?] at java.base/java.util.concurrent.CompletableFuture.thenCompose(CompletableFuture.java:2235) ~[?:?] at org.apache.ignite.internal.table.distributed.replicator.PartitionReplicaListener.processOperationRequest(PartitionReplicaListener.java:660) ~[ignite-table-3.0.0-SNAPSHOT.jar:?] at org.apache.ignite.internal.table.distributed.replicator.PartitionReplicaListener.processOperationRequestWithTxRwCounter(PartitionReplicaListener.java:3860) ~[ignite-table-3.0.0-SNAPSHOT.jar:?] at org.apache.ignite.internal.table.distributed.replicator.PartitionReplicaListener.lambda$processRequest$5(PartitionReplicaListener.java:436) ~[ignite-table-3.0.0-SNAPSHOT.jar:?] at java.base/java.util.concurrent.CompletableFuture$UniCompose.tryFire(CompletableFuture.java:1072) ~[?:?] at java.base/java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:506) [?:?] at java.base/java.util.concurrent.CompletableFuture.postFire(CompletableFuture.java:610) [?:?] at java.base/java.util.concurrent.CompletableFuture$UniApply.tryFire(CompletableFuture.java:649) [?:?] at java.base/java.util.concurrent.CompletableFuture$Completion.run(CompletableFuture.java:478) [?:?] at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128) [?:?] at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628) [?:?] at java.base/java.lang.Thread.run(Thread.java:834) [?:?] Caused by: org.apache.ignite.tx.TransactionException: Transaction is already finished. at org.apache.ignite.internal.table.distributed.replicator.PartitionReplicaListener.appendTxCommand(PartitionReplicaListener.java:1937) ~[ignite-table-3.0.0-SNAPSHOT.jar:?] at org.apache.ignite.internal.table.distributed.replicator.PartitionReplicaListener.processOperationRequest(PartitionReplicaListener.java:659) ~[ignite-table-3.0.0-SNAPSHOT.jar:?] ... 10 more{code} It happens in PartitionReplicaListener because the local volatile tx state is null or final when trying to compute a value for txCleanupReadyFutures map: {code:java} txCleanupReadyFutures.compute(txId, (id, txOps) -> { // First check whether the transaction has already been finished. // And complete cleanupReadyFut with exception if it is the case. TxStateMeta txStateMeta = txManager.stateMeta(txId); if (txStateMeta == null || isFinalState(txStateMeta.txState())) { cleanupReadyFut.completeExceptionally(new Exception()); return txOps; } // Otherwise collect cleanupReadyFut in the transaction's futures. if (txOps == null) { txOps = new TxCleanupReadyFutureList(); } txOps.futures.computeIfAbsent(cmdType, type -> new HashMap<>()).put(opId, cleanupReadyFut); return txOps; }); if (cleanupReadyFut.isCompletedExceptionally()) { return failedFuture(new TransactionException(TX_ALREADY_FINISHED_ERR, "Transaction is already finished.")); }{code} First problem is that we don't actually know the real state from this exception. The second one is the exception itself, because it shouldn't happen. We shouldn't meet a null state, because it's updated to pending just before, and it can be vacuumized only after it becomes final. was: Exception in log: {code:java} [2024-03-27T01:24:46,636][WARN ][%idt_n_1%partition-operations-4][ReplicaManager] Failed to process replica request [request=ReadWriteScanRetrieveBatchReplicaRequestImpl [batchSize=512, columnsToInclude=null, commitPartitionId=TablePartitionIdMessageImpl [partitionId=17, tableId=90], coordinatorId=125b397c-0404-4dcf-a28b-625fe010ecef, enlistmentConsistencyToken=112165039282455690, exactKey=null, flags=0, full=false,
[jira] [Created] (IGNITE-21861) Unexpected "Transaction is already finished" exception
Denis Chudov created IGNITE-21861: - Summary: Unexpected "Transaction is already finished" exception Key: IGNITE-21861 URL: https://issues.apache.org/jira/browse/IGNITE-21861 Project: Ignite Issue Type: Bug Reporter: Denis Chudov Exception in log: {code:java} [2024-03-27T01:24:46,636][WARN ][%idt_n_1%partition-operations-4][ReplicaManager] Failed to process replica request [request=ReadWriteScanRetrieveBatchReplicaRequestImpl [batchSize=512, columnsToInclude=null, commitPartitionId=TablePartitionIdMessageImpl [partitionId=17, tableId=90], coordinatorId=125b397c-0404-4dcf-a28b-625fe010ecef, enlistmentConsistencyToken=112165039282455690, exactKey=null, flags=0, full=false, groupId=92_part_7, indexToUse=null, lowerBoundPrefix=null, scanId=20361, timestampLong=112165039967305730, transactionId=018e7d82-647b-0030-63a2-6a190001, upperBoundPrefix=null]].java.util.concurrent.CompletionException: org.apache.ignite.tx.TransactionException: IGN-TX-14 TraceId:6612dad8-4a32-4453-8af0-0139e336aad9 Transaction is already finished. at java.base/java.util.concurrent.CompletableFuture.encodeThrowable(CompletableFuture.java:331) ~[?:?] at java.base/java.util.concurrent.CompletableFuture.uniComposeStage(CompletableFuture.java:1099) ~[?:?] at java.base/java.util.concurrent.CompletableFuture.thenCompose(CompletableFuture.java:2235) ~[?:?] at org.apache.ignite.internal.table.distributed.replicator.PartitionReplicaListener.processOperationRequest(PartitionReplicaListener.java:660) ~[ignite-table-3.0.0-SNAPSHOT.jar:?] at org.apache.ignite.internal.table.distributed.replicator.PartitionReplicaListener.processOperationRequestWithTxRwCounter(PartitionReplicaListener.java:3860) ~[ignite-table-3.0.0-SNAPSHOT.jar:?] at org.apache.ignite.internal.table.distributed.replicator.PartitionReplicaListener.lambda$processRequest$5(PartitionReplicaListener.java:436) ~[ignite-table-3.0.0-SNAPSHOT.jar:?] at java.base/java.util.concurrent.CompletableFuture$UniCompose.tryFire(CompletableFuture.java:1072) ~[?:?] at java.base/java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:506) [?:?] at java.base/java.util.concurrent.CompletableFuture.postFire(CompletableFuture.java:610) [?:?] at java.base/java.util.concurrent.CompletableFuture$UniApply.tryFire(CompletableFuture.java:649) [?:?] at java.base/java.util.concurrent.CompletableFuture$Completion.run(CompletableFuture.java:478) [?:?] at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128) [?:?] at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628) [?:?] at java.base/java.lang.Thread.run(Thread.java:834) [?:?]Caused by: org.apache.ignite.tx.TransactionException: Transaction is already finished. at org.apache.ignite.internal.table.distributed.replicator.PartitionReplicaListener.appendTxCommand(PartitionReplicaListener.java:1937) ~[ignite-table-3.0.0-SNAPSHOT.jar:?] at org.apache.ignite.internal.table.distributed.replicator.PartitionReplicaListener.processOperationRequest(PartitionReplicaListener.java:659) ~[ignite-table-3.0.0-SNAPSHOT.jar:?] ... 10 more {code} -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (IGNITE-21572) One phase transacion protocol is inconsistent in case of primary replica expirations
[ https://issues.apache.org/jira/browse/IGNITE-21572?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Denis Chudov updated IGNITE-21572: -- Reviewer: Alexander Lapin > One phase transacion protocol is inconsistent in case of primary replica > expirations > > > Key: IGNITE-21572 > URL: https://issues.apache.org/jira/browse/IGNITE-21572 > Project: Ignite > Issue Type: Bug >Reporter: Alexander Lapin >Assignee: Denis Chudov >Priority: Critical > Labels: ignite-3 > Time Spent: 10m > Remaining Estimate: 0h > > h3. Motivation > Consider following scenario: > # Full (1PC) transaction tx1 starts on PrimaryReplica1 [leaseholder='X', > startTime='t1', endTime='t10'] > # Within a given 1PC transaction 2-phase operation is evaluated over key1, > e.g. replace or increment (we do not have increment operation, however it's > easy to explain the problem with it, so let's assume that we have one). > # Within increment processing, processor acquires lock on key1, reads the > corresponding value and is about to write the new one. > # At this point, PrimaryReplica leaseholder='X' expires. > # Another transaction tx2 starts on new PrimaryReplica2 [leaseholder='Y', > startTime='t11', endTime='t21']. > # Within tx2 user also calls increment, thus also acquires lock, reads old > value and writes new one. > # tx2 finishes. > # tx1 successfully writes tx1.newValue overriding the value from tx2. > All in all, because tx2 didn't see tx1 locks because primary was changes > instead of (key1+{+}){+}+ transactions will finish with (key1)++ which is of > course not valid. > h3. Definition of Done > * Bug is fixed. > h3. Implementation Notes > * As a fast fix we should use 1PC(full) transactions only in case of > one-phase operation, like put. All two-phase operations like replica, > deleteExact, etc should be evaluated within a common 2PC transaction. > * Besides fast fix, we should consider supporting invoke as a raft command > that will effectively convert read+write to an atomic operation. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (IGNITE-18879) Leaseholder candidates balancing
[ https://issues.apache.org/jira/browse/IGNITE-18879?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17831109#comment-17831109 ] Denis Chudov commented on IGNITE-18879: --- [~yexiaowei] all nodes in the cluster are learners for meta storage group, and the information about leases is distributed to all learners. We use that fact that if a lease is accepted, it can't be revoked; and the intervals of leases are disjoint. Hence the outdated data doesn't break anything. Speaking about #currentLease, it is used only in the context of the local node and in fact, is just a currently known information, but the lease is unique cluster-wide within its interval, so it can't break the distributed mechanisms. > Leaseholder candidates balancing > > > Key: IGNITE-18879 > URL: https://issues.apache.org/jira/browse/IGNITE-18879 > Project: Ignite > Issue Type: Improvement >Reporter: Denis Chudov >Priority: Major > Labels: ignite-3 > > *Motivation* > Primary replicas (leaseholders) should be evenly distributed over cluster to > balance the transactional load between nodes. As the placement driver assigns > primary replicas, balancing the primary replicas is also it's responsibility. > Naive implementation of balancing should choose a node as leaseholder > candidate in a way to save even lease distribution over all nodes. In real > cluster, it may take into account slow nodes, hot table records, etc. If > lease candidate declines LeaseGrantMessage from placement driver, the > balancer should make decision to choose another candidate for given primary > replica or enforce the previously chosen. So the balancing algorith should be > pluggable, so that we could have ability to improve/replace/compare it with > others. > *Definition of done* > Introduced interface for lease candidates balancer, and a simple > implementation sustaining even lease distribution, which is used by placement > driver by default. No public or internal configuration needed on this stage. > *Implementation notes* > Lease candidates balancer should have at least 2 methods: > - {_}get(group, ignoredNodes){_}: returns candidate for the given group, a > node from ignoredNodes set can't be chosen as a candidate > - {_}considerRedirectProposal(group, candidate, proposedCandidate){_}: > processes redirect proposal for given group provided by given candidate > (previously chosen using _get_ method), proposedCandidate is the alternative > candidate. Returns candidate that should be enforced by placement driver. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Assigned] (IGNITE-21382) Test ItPrimaryReplicaChoiceTest.testPrimaryChangeLongHandling is flaky
[ https://issues.apache.org/jira/browse/IGNITE-21382?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Denis Chudov reassigned IGNITE-21382: - Assignee: Denis Chudov > Test ItPrimaryReplicaChoiceTest.testPrimaryChangeLongHandling is flaky > -- > > Key: IGNITE-21382 > URL: https://issues.apache.org/jira/browse/IGNITE-21382 > Project: Ignite > Issue Type: Bug >Reporter: Vladislav Pyatkov >Assignee: Denis Chudov >Priority: Major > Labels: ignite-3 > Time Spent: 0.5h > Remaining Estimate: 0h > > The test falls while waiting for the primary replica change. This issue is > also reproduced locally, at least one per five passes. > {code} > assertThat(primaryChangeTask, willCompleteSuccessfully()); > {code} > {noformat} > java.lang.AssertionError: java.util.concurrent.TimeoutException > at > org.apache.ignite.internal.testframework.matchers.CompletableFutureMatcher.matchesSafely(CompletableFutureMatcher.java:78) > at > org.apache.ignite.internal.testframework.matchers.CompletableFutureMatcher.matchesSafely(CompletableFutureMatcher.java:35) > at org.hamcrest.TypeSafeMatcher.matches(TypeSafeMatcher.java:67) > at org.hamcrest.MatcherAssert.assertThat(MatcherAssert.java:10) > at org.hamcrest.MatcherAssert.assertThat(MatcherAssert.java:6) > at > org.apache.ignite.internal.placementdriver.ItPrimaryReplicaChoiceTest.testPrimaryChangeLongHandling(ItPrimaryReplicaChoiceTest.java:179) > {noformat} > This test will be muted on TC to pervent future falls. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Assigned] (IGNITE-21572) One phase transacion protocol is inconsistent in case of primary replica expirations
[ https://issues.apache.org/jira/browse/IGNITE-21572?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Denis Chudov reassigned IGNITE-21572: - Assignee: Denis Chudov > One phase transacion protocol is inconsistent in case of primary replica > expirations > > > Key: IGNITE-21572 > URL: https://issues.apache.org/jira/browse/IGNITE-21572 > Project: Ignite > Issue Type: Bug >Reporter: Alexander Lapin >Assignee: Denis Chudov >Priority: Critical > Labels: ignite-3 > > h3. Motivation > Consider following scenario: > # Full (1PC) transaction tx1 starts on PrimaryReplica1 [leaseholder='X', > startTime='t1', endTime='t10'] > # Within a given 1PC transaction 2-phase operation is evaluated over key1, > e.g. replace or increment (we do not have increment operation, however it's > easy to explain the problem with it, so let's assume that we have one). > # Within increment processing, processor acquires lock on key1, reads the > corresponding value and is about to write the new one. > # At this point, PrimaryReplica leaseholder='X' expires. > # Another transaction tx2 starts on new PrimaryReplica2 [leaseholder='Y', > startTime='t11', endTime='t21']. > # Within tx2 user also calls increment, thus also acquires lock, reads old > value and writes new one. > # tx2 finishes. > # tx1 successfully writes tx1.newValue overriding the value from tx2. > All in all, because tx2 didn't see tx1 locks because primary was changes > instead of (key1+{+}){+}+ transactions will finish with (key1)++ which is of > course not valid. > h3. Definition of Done > * Bug is fixed. > h3. Implementation Notes > * As a fast fix we should use 1PC(full) transactions only in case of > one-phase operation, like put. All two-phase operations like replica, > deleteExact, etc should be evaluated within a common 2PC transaction. > * Besides fast fix, we should consider supporting invoke as a raft command > that will effectively convert read+write to an atomic operation. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (IGNITE-21348) Trigger the lease negotiation retry in case when the lease candidate is no more contained in assignments
[ https://issues.apache.org/jira/browse/IGNITE-21348?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Denis Chudov updated IGNITE-21348: -- Reviewer: Vladislav Pyatkov > Trigger the lease negotiation retry in case when the lease candidate is no > more contained in assignments > > > Key: IGNITE-21348 > URL: https://issues.apache.org/jira/browse/IGNITE-21348 > Project: Ignite > Issue Type: Bug >Reporter: Denis Chudov >Assignee: Denis Chudov >Priority: Major > Labels: ignite-3 > Time Spent: 10m > Remaining Estimate: 0h > > On receiving the "lease granted" message, the candidate replica tries to > catch up the actual storage state, in order to do that it makes read index > request. But in case when this candidate is no more a member of assignments > (and replication group) this request fails and is retried until the lease > negotiation interval exceeds. This makes no sense because such retries will > not be successful, and the current candidate is not a good candidate anymore > - because, although the leaseholder may be not a part of replication group, > preferably it should be, and should be its leader. > The assignment changes when some of current candidates and leaseholders are > no more included in new assignment set, should be detected on the placement > driver active actor, and the current lease should be revoked (if negotiation > is in progress) or not prolonged. The new negotitation will be triggered > automatically by the lease updater. > *Implementation notes* > This assignment changes detection should be done on placement driver side, > because the events of assignment changes can be processed on different nodes > in different time, and there is already assignments tracker as a part of > placement driver. > -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (IGNITE-21712) Hybrid time is not adjusted when handling some of transaction non-replica requests
[ https://issues.apache.org/jira/browse/IGNITE-21712?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17824821#comment-17824821 ] Denis Chudov commented on IGNITE-21712: --- Most like the time adjustment is not needed for FinishedTransactionsBatchMessage and TxCleanupMessage. We should think about it. > Hybrid time is not adjusted when handling some of transaction non-replica > requests > -- > > Key: IGNITE-21712 > URL: https://issues.apache.org/jira/browse/IGNITE-21712 > Project: Ignite > Issue Type: Bug >Reporter: Denis Chudov >Priority: Major > Labels: ignite-3 > > For example, TxStateResponse extends TimestampAware interface and is a part > of transaction flow, the hybrid time should be adjusted when handling > TxStateResponse but it doesnt happen. > We should also check classes extending TimestampAware in order to ensure that > timestamp is adjusted in every case. > Other message interfaces extending TimestampAware but having no time > adjustment: > {code:java} > FinishedTransactionsBatchMessage > TxCleanupMessage > TxStateResponse {code} > Also, these interfaces are unused are maybe they can be deleted: > {code:java} > TxCleanupMessageResponse > TxCleanupMessageErrorResponse > TxFinishResponse > {code} > > -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (IGNITE-18879) Leaseholder candidates balancing
[ https://issues.apache.org/jira/browse/IGNITE-18879?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17824789#comment-17824789 ] Denis Chudov commented on IGNITE-18879: --- [~yexiaowei] the current PD implementation doesn't include any heartbeats sent to PD leaders, and I am not sure that such heartbeats will be added in order to measure the load of the nodes and adjust the leases distribution. Highly likely, it would be overengineering, but the exact implementation is not designed yet. > Leaseholder candidates balancing > > > Key: IGNITE-18879 > URL: https://issues.apache.org/jira/browse/IGNITE-18879 > Project: Ignite > Issue Type: Improvement >Reporter: Denis Chudov >Priority: Major > Labels: ignite-3 > > *Motivation* > Primary replicas (leaseholders) should be evenly distributed over cluster to > balance the transactional load between nodes. As the placement driver assigns > primary replicas, balancing the primary replicas is also it's responsibility. > Naive implementation of balancing should choose a node as leaseholder > candidate in a way to save even lease distribution over all nodes. In real > cluster, it may take into account slow nodes, hot table records, etc. If > lease candidate declines LeaseGrantMessage from placement driver, the > balancer should make decision to choose another candidate for given primary > replica or enforce the previously chosen. So the balancing algorith should be > pluggable, so that we could have ability to improve/replace/compare it with > others. > *Definition of done* > Introduced interface for lease candidates balancer, and a simple > implementation sustaining even lease distribution, which is used by placement > driver by default. No public or internal configuration needed on this stage. > *Implementation notes* > Lease candidates balancer should have at least 2 methods: > - {_}get(group, ignoredNodes){_}: returns candidate for the given group, a > node from ignoredNodes set can't be chosen as a candidate > - {_}considerRedirectProposal(group, candidate, proposedCandidate){_}: > processes redirect proposal for given group provided by given candidate > (previously chosen using _get_ method), proposedCandidate is the alternative > candidate. Returns candidate that should be enforced by placement driver. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Comment Edited] (IGNITE-21382) Test ItPrimaryReplicaChoiceTest.testPrimaryChangeLongHandling is flaky
[ https://issues.apache.org/jira/browse/IGNITE-21382?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17824772#comment-17824772 ] Denis Chudov edited comment on IGNITE-21382 at 3/8/24 12:49 PM: The problem is that NodeUtils#transferPrimary is not competed in 30 seconds. I would propose to rewrite this method without using RaftGroupService#transferLeadership. Primary replica doesn't have to be colocated with raft leader, and we can use this in tests. We have StopLeaseProlongationMessage that is intended to stop lease prolongation for replica which lost its ability to serve as a primary (at least, a preferred one), and NodeUtils#transferPrimary can be reworked in a way that it sends this message to corresponding node that is a placement driver active actor (or, which is more simple for tests - just to every node, this message will be ignored on other nodes). The only problem is that StopLeaseProlongationMessage#redirectProposal is not handled by the placement driver correctly - this is a bug and should be fixed. After that we will obtain an ability to propose any node as the new primary and so choose the new primary deliberately. Until IGNITE-18879 is done, the LeaseUpdater chooses the proposed leaseholder every time when it is present, it never enforces another node and possibility of that can be neglected. After that, IGNITE-20365 might be closed as well. was (Author: denis chudov): The problem is that NodeUtils#transferPrimary is not competed in 30 seconds. I would propose to rewrite this method without using RaftGroupService#transferLeadership. Primary replica doesn't have to be colocated with raft leader, and we can use this in tests. We have StopLeaseProlongationMessage that is intended to stop lease prolongation for replica which lost its ability to serve as a primary (at least, a preferred one), and NodeUtils#transferPrimary can be reworked in a way that it sends this message to corresponding node that is a placement driver active actor (or, which is more simple for tests - just to every node, this message will be ignored on other nodes). The only problem is that StopLeaseProlongationMessage#redirectProposal is not handled by the placement driver correctly - this is a bug and should be fixed. After that we will obtain an ability to propose any node as the new primary and so choose the new primary deliberately. After that, IGNITE-20365 might be closed as well. > Test ItPrimaryReplicaChoiceTest.testPrimaryChangeLongHandling is flaky > -- > > Key: IGNITE-21382 > URL: https://issues.apache.org/jira/browse/IGNITE-21382 > Project: Ignite > Issue Type: Bug >Reporter: Vladislav Pyatkov >Priority: Major > Labels: ignite-3 > Time Spent: 20m > Remaining Estimate: 0h > > The test falls while waiting for the primary replica change. This issue is > also reproduced locally, at least one per five passes. > {code} > assertThat(primaryChangeTask, willCompleteSuccessfully()); > {code} > {noformat} > java.lang.AssertionError: java.util.concurrent.TimeoutException > at > org.apache.ignite.internal.testframework.matchers.CompletableFutureMatcher.matchesSafely(CompletableFutureMatcher.java:78) > at > org.apache.ignite.internal.testframework.matchers.CompletableFutureMatcher.matchesSafely(CompletableFutureMatcher.java:35) > at org.hamcrest.TypeSafeMatcher.matches(TypeSafeMatcher.java:67) > at org.hamcrest.MatcherAssert.assertThat(MatcherAssert.java:10) > at org.hamcrest.MatcherAssert.assertThat(MatcherAssert.java:6) > at > org.apache.ignite.internal.placementdriver.ItPrimaryReplicaChoiceTest.testPrimaryChangeLongHandling(ItPrimaryReplicaChoiceTest.java:179) > {noformat} > This test will be muted on TC to pervent future falls. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (IGNITE-21382) Test ItPrimaryReplicaChoiceTest.testPrimaryChangeLongHandling is flaky
[ https://issues.apache.org/jira/browse/IGNITE-21382?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17824772#comment-17824772 ] Denis Chudov commented on IGNITE-21382: --- The problem is that NodeUtils#transferPrimary is not competed in 30 seconds. I would propose to rewrite this method without using RaftGroupService#transferLeadership. Primary replica doesn't have to be colocated with raft leader, and we can use this in tests. We have StopLeaseProlongationMessage that is intended to stop lease prolongation for replica which lost its ability to serve as a primary (at least, a preferred one), and NodeUtils#transferPrimary can be reworked in a way that it sends this message to corresponding node that is a placement driver active actor (or, which is more simple for tests - just to every node, this message will be ignored on other nodes). The only problem is that StopLeaseProlongationMessage#redirectProposal is not handled by the placement driver correctly - this is a bug and should be fixed. After that we will obtain an ability to propose any node as the new primary and so choose the new primary deliberately. After that, IGNITE-20365 might be closed as well. > Test ItPrimaryReplicaChoiceTest.testPrimaryChangeLongHandling is flaky > -- > > Key: IGNITE-21382 > URL: https://issues.apache.org/jira/browse/IGNITE-21382 > Project: Ignite > Issue Type: Bug >Reporter: Vladislav Pyatkov >Priority: Major > Labels: ignite-3 > Time Spent: 20m > Remaining Estimate: 0h > > The test falls while waiting for the primary replica change. This issue is > also reproduced locally, at least one per five passes. > {code} > assertThat(primaryChangeTask, willCompleteSuccessfully()); > {code} > {noformat} > java.lang.AssertionError: java.util.concurrent.TimeoutException > at > org.apache.ignite.internal.testframework.matchers.CompletableFutureMatcher.matchesSafely(CompletableFutureMatcher.java:78) > at > org.apache.ignite.internal.testframework.matchers.CompletableFutureMatcher.matchesSafely(CompletableFutureMatcher.java:35) > at org.hamcrest.TypeSafeMatcher.matches(TypeSafeMatcher.java:67) > at org.hamcrest.MatcherAssert.assertThat(MatcherAssert.java:10) > at org.hamcrest.MatcherAssert.assertThat(MatcherAssert.java:6) > at > org.apache.ignite.internal.placementdriver.ItPrimaryReplicaChoiceTest.testPrimaryChangeLongHandling(ItPrimaryReplicaChoiceTest.java:179) > {noformat} > This test will be muted on TC to pervent future falls. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (IGNITE-21712) Hybrid time is not adjusted when handling some of transaction non-replica requests
[ https://issues.apache.org/jira/browse/IGNITE-21712?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Denis Chudov updated IGNITE-21712: -- Description: For example, TxStateResponse extends TimestampAware interface and is a part of transaction flow, the hybrid time should be adjusted when handling TxStateResponse but it doesnt happen. We should also check classes extending TimestampAware in order to ensure that timestamp is adjusted in every case. Other message interfaces extending TimestampAware but having no time adjustment: {code:java} FinishedTransactionsBatchMessage TxCleanupMessage TxStateResponse {code} Also, these interfaces are unused are maybe they can be deleted: {code:java} TxCleanupMessageResponse TxCleanupMessageErrorResponse TxFinishResponse {code} was: For example, TxStateResponse extends TimestampAware interface and is a part of transaction flow, the hybrid time should be adjusted when handling TxStateResponse but it doesnt happen. We should also check classes extending TimestampAware in order to ensure that timestamp is adjusted in every case. Other message interfaces extending TimestampAware but having no time adjustment: {code:java} FinishedTransactionsBatchMessage TxCleanupMessage TxStateResponse {code} Also, these interfaces are unused are maybe they can be deleted: {code:java} TxCleanupMessageResponse TxCleanupMessageErrorResponse TxFinishResponse {code} > Hybrid time is not adjusted when handling some of transaction non-replica > requests > -- > > Key: IGNITE-21712 > URL: https://issues.apache.org/jira/browse/IGNITE-21712 > Project: Ignite > Issue Type: Bug >Reporter: Denis Chudov >Priority: Major > Labels: ignite-3 > > For example, TxStateResponse extends TimestampAware interface and is a part > of transaction flow, the hybrid time should be adjusted when handling > TxStateResponse but it doesnt happen. > We should also check classes extending TimestampAware in order to ensure that > timestamp is adjusted in every case. > Other message interfaces extending TimestampAware but having no time > adjustment: > {code:java} > FinishedTransactionsBatchMessage > TxCleanupMessage > TxStateResponse {code} > Also, these interfaces are unused are maybe they can be deleted: > {code:java} > TxCleanupMessageResponse > TxCleanupMessageErrorResponse > TxFinishResponse > {code} > > -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (IGNITE-21712) Hybrid time is not adjusted when handling some of transaction non-replica requests
[ https://issues.apache.org/jira/browse/IGNITE-21712?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Denis Chudov updated IGNITE-21712: -- Description: For example, TxStateResponse extends TimestampAware interface and is a part of transaction flow, the hybrid time should be adjusted when handling TxStateResponse but it doesnt happen. We should also check classes extending TimestampAware in order to ensure that timestamp is adjusted in every case. Other message interfaces extending TimestampAware but having no time adjustment: {code:java} FinishedTransactionsBatchMessage TxCleanupMessage TxStateResponse {code} Also, these interfaces are unused are maybe they can be deleted: {code:java} TxCleanupMessageResponse TxCleanupMessageErrorResponse TxFinishResponse {code} was: TxStateResponse extends TimestampAware interface and is a part of transaction flow, the hybrid time should be adjusted when handling TxStateResponse but it doesnt happen. We should also check classes extending TimestampAware in order to ensure that timestamp is adjusted in every case. > Hybrid time is not adjusted when handling some of transaction non-replica > requests > -- > > Key: IGNITE-21712 > URL: https://issues.apache.org/jira/browse/IGNITE-21712 > Project: Ignite > Issue Type: Bug >Reporter: Denis Chudov >Priority: Major > Labels: ignite-3 > > For example, TxStateResponse extends TimestampAware interface and is a part > of transaction flow, the hybrid time should be adjusted when handling > TxStateResponse but it doesnt happen. > We should also check classes extending TimestampAware in order to ensure that > timestamp is adjusted in every case. > Other message interfaces extending TimestampAware but having no time > adjustment: > {code:java} > FinishedTransactionsBatchMessage > TxCleanupMessage > TxStateResponse {code} > Also, these interfaces are unused are maybe they can be deleted: > > {code:java} > TxCleanupMessageResponse > TxCleanupMessageErrorResponse > TxFinishResponse > {code} > > -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (IGNITE-21712) Hybrid time is not adjusted when handling some of transaction non-replica requests
[ https://issues.apache.org/jira/browse/IGNITE-21712?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Denis Chudov updated IGNITE-21712: -- Summary: Hybrid time is not adjusted when handling some of transaction non-replica requests (was: Hybrid time is not adjusted when handling TxStateResponse) > Hybrid time is not adjusted when handling some of transaction non-replica > requests > -- > > Key: IGNITE-21712 > URL: https://issues.apache.org/jira/browse/IGNITE-21712 > Project: Ignite > Issue Type: Bug >Reporter: Denis Chudov >Priority: Major > Labels: ignite-3 > > TxStateResponse extends TimestampAware interface and is a part of transaction > flow, the hybrid time should be adjusted when handling TxStateResponse but it > doesnt happen. > We should also check classes extending TimestampAware in order to ensure that > timestamp is adjusted in every case. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (IGNITE-21712) Hybrid time is not adjusted when handling TxStateResponse
[ https://issues.apache.org/jira/browse/IGNITE-21712?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Denis Chudov updated IGNITE-21712: -- Description: TxStateResponse extends TimestampAware interface and is a part of transaction flow, the hybrid time should be adjusted when handling TxStateResponse but it doesnt happen. We should also check classes extending TimestampAware in order to ensure that timestamp is adjusted in every case. was:TxStateResponse extends TimestampAware interface and is a part of transaction flow, the hybrid time should be adjusted when handling TxStateResponse but it doesnt happen. > Hybrid time is not adjusted when handling TxStateResponse > - > > Key: IGNITE-21712 > URL: https://issues.apache.org/jira/browse/IGNITE-21712 > Project: Ignite > Issue Type: Bug >Reporter: Denis Chudov >Priority: Major > Labels: ignite-3 > > TxStateResponse extends TimestampAware interface and is a part of transaction > flow, the hybrid time should be adjusted when handling TxStateResponse but it > doesnt happen. > We should also check classes extending TimestampAware in order to ensure that > timestamp is adjusted in every case. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Assigned] (IGNITE-21348) Trigger the lease negotiation retry in case when the lease candidate is no more contained in assignments
[ https://issues.apache.org/jira/browse/IGNITE-21348?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Denis Chudov reassigned IGNITE-21348: - Assignee: Denis Chudov > Trigger the lease negotiation retry in case when the lease candidate is no > more contained in assignments > > > Key: IGNITE-21348 > URL: https://issues.apache.org/jira/browse/IGNITE-21348 > Project: Ignite > Issue Type: Bug >Reporter: Denis Chudov >Assignee: Denis Chudov >Priority: Major > Labels: ignite-3 > > On receiving the "lease granted" message, the candidate replica tries to > catch up the actual storage state, in order to do that it makes read index > request. But in case when this candidate is no more a member of assignments > (and replication group) this request fails and is retried until the lease > negotiation interval exceeds. This makes no sense because such retries will > not be successful, and the current candidate is not a good candidate anymore > - because, although the leaseholder may be not a part of replication group, > preferably it should be, and should be its leader. > The assignment changes when some of current candidates and leaseholders are > no more included in new assignment set, should be detected on the placement > driver active actor, and the current lease should be revoked (if negotiation > is in progress) or not prolonged. The new negotitation will be triggered > automatically by the lease updater. > *Implementation notes* > This assignment changes detection should be done on placement driver side, > because the events of assignment changes can be processed on different nodes > in different time, and there is already assignments tracker as a part of > placement driver. > -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (IGNITE-21712) Hybrid time is not adjusted when handling TxStateResponse
Denis Chudov created IGNITE-21712: - Summary: Hybrid time is not adjusted when handling TxStateResponse Key: IGNITE-21712 URL: https://issues.apache.org/jira/browse/IGNITE-21712 Project: Ignite Issue Type: Bug Reporter: Denis Chudov TxStateResponse extends TimestampAware interface and is a part of transaction flow, the hybrid time should be adjusted when handling TxStateResponse but it doesnt happen. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Assigned] (IGNITE-21634) NPE in HeapLockManager
[ https://issues.apache.org/jira/browse/IGNITE-21634?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Denis Chudov reassigned IGNITE-21634: - Assignee: Denis Chudov > NPE in HeapLockManager > -- > > Key: IGNITE-21634 > URL: https://issues.apache.org/jira/browse/IGNITE-21634 > Project: Ignite > Issue Type: Bug >Reporter: Denis Chudov >Assignee: Denis Chudov >Priority: Major > Labels: ignite-3 > > {code:java} > Caused by: java.lang.NullPointerException at > org.apache.ignite.internal.tx.impl.HeapLockManager.lambda$lockState$4(HeapLockManager.java:297) > ~[main/:?] at > java.util.concurrent.ConcurrentHashMap.compute(ConcurrentHashMap.java:1908) > ~[?:?] at > org.apache.ignite.internal.tx.impl.HeapLockManager.lockState(HeapLockManager.java:291) > ~[main/:?] at > org.apache.ignite.internal.tx.impl.HeapLockManager.acquire(HeapLockManager.java:172) > ~[main/:?] at > org.apache.ignite.internal.table.distributed.SortedIndexLocker.lambda$locksForInsert$4(SortedIndexLocker.java:169) > ~[main/:?] at > java.util.concurrent.CompletableFuture.uniComposeStage(CompletableFuture.java:1106) > ~[?:?] ... 29 more{code} > on the line {{v.markedForRemove = false;}} > {code:java} > private LockState lockState(LockKey key) { > int h = spread(key.hashCode()); > int index = h & (slots.length - 1); > LockState[] res = new LockState[1]; > locks.compute(key, (k, v) -> { > if (v == null) { > if (empty.isEmpty()) { > res[0] = slots[index]; > } else { > v = empty.poll(); > v.markedForRemove = false; > v.key = k; > res[0] = v; > } > } else { > res[0] = v; > } > return v; > }); > return res[0]; > } {code} > The problem can be reproduced on main(71b4fb34) with following test > (probably, fsync should be turned off): > {code} > @Test > void test() { > sql("CREATE TABLE test(" > + "c1 INT PRIMARY KEY, c2 INT, c3 INT, c4 INT, c5 INT," > + "c6 INT, c7 INT, c8 INT, c9 INT, c10 INT)" > ); > for (int i = 2; i <= 10; i++) { > sql(format("CREATE INDEX c{}_idx ON test (c{})", i, i)); > } > sql("INSERT INTO test" > + " SELECT x as c1, x as c2, x as c3, x as c4, x as c5, " > + "x as c6, x as c7, x as c8, x as c9, x as c10" > + " FROM TABLE (system_range(1, 10))"); > } > {code} -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Assigned] (IGNITE-21641) OOM in PartitionReplicaListenerTest
[ https://issues.apache.org/jira/browse/IGNITE-21641?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Denis Chudov reassigned IGNITE-21641: - Assignee: Denis Chudov > OOM in PartitionReplicaListenerTest > --- > > Key: IGNITE-21641 > URL: https://issues.apache.org/jira/browse/IGNITE-21641 > Project: Ignite > Issue Type: Bug >Reporter: Mirza Aliev >Assignee: Denis Chudov >Priority: Major > Labels: ignite-3 > Attachments: image-2024-03-01-12-22-32-053.png, > image-2024-03-01-20-36-08-577.png > > > TC run failed with OOM > Problem occurred after > PartitionReplicaListenerTest.testReadOnlyGetAfterRowRewrite run, > {noformat} > [2024-03-01T05:12:50,629][INFO ][Test worker][PartitionReplicaListenerTest] > >>> Starting test: > PartitionReplicaListenerTest#testReadOnlyGetAfterRowRewrite, displayName: > [14] true, true, false, true > [2024-03-01T05:12:50,629][INFO ][Test worker][PartitionReplicaListenerTest] > workDir: > build/work/PartitionReplicaListenerTest/testReadOnlyGetAfterRowRewrite_33496469368142283 > [2024-03-01T05:12:50,638][INFO ][Test worker][PartitionReplicaListenerTest] > >>> Stopping test: > PartitionReplicaListenerTest#testReadOnlyGetAfterRowRewrite, displayName: > [14] true, true, false, true, cost: 8ms. > [05:12:50] : [testReadOnlyGetAfterRowRewrite(boolean, > boolean, boolean, boolean)] > org.apache.ignite.internal.table.distributed.replication.PartitionReplicaListenerTest.testReadOnlyGetAfterRowRewrite([15] > true, true, true, false) (10m:22s) > [05:12:50] : [:ignite-table:test] PartitionReplicaListenerTest > > testReadOnlyGetAfterRowRewrite(boolean, boolean, boolean, boolean) > [15] > true, true, true, false STANDARD_OUT > [05:12:50] : [:ignite-table:test] > [2024-03-01T05:12:50,648][INFO ][Test worker][PartitionReplicaListenerTest] > >>> Starting test: > PartitionReplicaListenerTest#testReadOnlyGetAfterRowRewrite, displayName: > [15] true, true, true, false > [05:12:50] : [:ignite-table:test] > [2024-03-01T05:12:50,648][INFO ][Test worker][PartitionReplicaListenerTest] > workDir: > build/work/PartitionReplicaListenerTest/testReadOnlyGetAfterRowRewrite_33496469386328241 > [05:18:42] : [:ignite-table:test] java.lang.OutOfMemoryError: Java > heap space > [05:18:42] : [:ignite-table:test] Dumping heap to > java_pid2349600.hprof ... > [05:19:06] : [:ignite-table:test] Heap dump file created > [3645526743 bytes in 24.038 secs] > {noformat} > https://ci.ignite.apache.org/buildConfiguration/ApacheIgnite3xGradle_Test_RunAllTests/7898564?hideTestsFromDependencies=false=false+Inspection=true=true=true=false > After analysing heap dump it appears that the reason of OOM is a problem with > Mockito. > !image-2024-03-01-12-22-32-053.png! > We need to investigate the reason of a problem with Mockito -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (IGNITE-21633) Get rid of RemotelyTriggeredResourceRegistry#remoteHostsToResources
[ https://issues.apache.org/jira/browse/IGNITE-21633?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Denis Chudov updated IGNITE-21633: -- Description: *Motivation* RemotelyTriggeredResourceRegistry has an API that allows closing the resources using following parameters: * #close(UUID contextId) * #close(FullyQualifiedResourceId resourceId) * #close(String remoteHostId) - is used when the remote host is no longer in topology and we can close all resources that it has triggered, because they are no longer needed In IGNITE-21293 there was added a map _RemotelyTriggeredResourceRegistry#remoteHostsToResource_ which, in fact, is grouping the resources by remote hosts which is needed to implement the last method without iterating over all resources. The main map ({_}#resources{_}, ordered map of FullyQualifiedResourceId to RemotelyTriggeredResource object) which is used to store the resources is not able to provide some resources by remote host id, because FullyQualifiedResourceId does not contain the remote host id. The context id is included into the FullyQualifiedResourceId , but the transaction id (which is contextId in case of cursor resource) does not contain node identifier, only an integer hash code of the coordinator node name. *Definition of done* The _RemotelyTriggeredResourceRegistry#remoteHostsToResource_ is removed. *Implementation notes* We can change the transaction id generation to replace the node name hash with the order in which the node joined the cluster, then we will be able to evaluate the transaction coordinator having only the transaction id. This will also require to postulate that context id generation for every type of resources should follow this rule. After that we will be able to get a submap of resources created by some node from _#resources_ map (FullyQualifiedResourceId to RemotelyTriggeredResource object). As one of the possible implementation to get all resources triggered by the nodes that are no longer in topology, we can iterate over the currently online nodes (their order in which they joined) and get a submap of resources belonging to the space between each two of them. As the number of nodes is significantly less that the number of resources, this operation should be more effective that iterating over the whole map. For example: * there were 3 nodes: A (join order 0), B (join order 1), C (join order 2); * node B left the topology; * there are 1000 resources, 200 of them are created by A, 500 by B and 300 by C; * iterating over existing node pairs will get following intervals: (MIN_ORDER; 0) - submap is empty, (0; 2) - submap includes 500 resources created by B, (2; MAX_ORDER) - submap is empty. was: *Motivation* RemotelyTriggeredResourceRegistry has an API that allows closing the resources using following parameters: * #close(UUID contextId) * #close(FullyQualifiedResourceId resourceId) * #close(String remoteHostId) - is used when the remote host is no longer in topology and we can close all resources that it has triggered, because they are no longer needed In IGNITE-21293 there was added a map _RemotelyTriggeredResourceRegistry#remoteHostsToResource_ which, in fact, is grouping the resources by remote hosts which is needed to implement the last method without iterating over all resources. The main map ({_}#resources{_}, ordered map of FullyQualifiedResourceId to RemotelyTriggeredResource object) which is used to store the resources is not able to provide some resources by remote host id, because FullyQualifiedResourceId does not contain the remote host id. The context id is included into the FullyQualifiedResourceId , but the transaction id (which is contextId in case of cursor resource) does not contain node identifier, only an integer hash code of the coordinator node name. *Definition of done* The _RemotelyTriggeredResourceRegistry#remoteHostsToResource_ is removed. *Implementation notes* We can change the transaction id generation to replace the node name hash with the order in which the node joined the cluster, then we will be able to evaluate the transaction coordinator having only the transaction id. This will also require to postulate that context id generation for every type of resources should follow this rule. After that we will be able to get a submap of resources created by some node from _#resources_ map (FullyQualifiedResourceId to RemotelyTriggeredResource object). To get all resources triggered by the nodes that are no longer in topology, we can iterate over the currently online nodes (their order in which they joined) and get a submap of resources belonging to the space between each two of them. As the number of nodes is significantly less that the number of resources, this operation should be more effective that iterating over the whole map. For example: * there were 3 nodes: A (join order 0), B (join
[jira] [Updated] (IGNITE-21633) Get rid of RemotelyTriggeredResourceRegistry#remoteHostsToResources
[ https://issues.apache.org/jira/browse/IGNITE-21633?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Denis Chudov updated IGNITE-21633: -- Description: *Motivation* RemotelyTriggeredResourceRegistry has an API that allows closing the resources using following parameters: * #close(UUID contextId) * #close(FullyQualifiedResourceId resourceId) * #close(String remoteHostId) - is used when the remote host is no longer in topology and we can close all resources that it has triggered, because they are no longer needed In IGNITE-21293 there was added a map _RemotelyTriggeredResourceRegistry#remoteHostsToResource_ which, in fact, is grouping the resources by remote hosts which is needed to implement the last method without iterating over all resources. The main map ({_}#resources{_}, ordered map of FullyQualifiedResourceId to RemotelyTriggeredResource object) which is used to store the resources is not able to provide some resources by remote host id, because FullyQualifiedResourceId does not contain the remote host id. The context id is included into the FullyQualifiedResourceId , but the transaction id (which is contextId in case of cursor resource) does not contain node identifier, only an integer hash code of the coordinator node name. *Definition of done* The _RemotelyTriggeredResourceRegistry#remoteHostsToResource_ is removed. *Implementation notes* We can change the transaction id generation to replace the node name hash with the order in which the node joined the cluster, then we will be able to evaluate the transaction coordinator having only the transaction id. This will also require to postulate that context id generation for every type of resources should follow this rule. After that we will be able to get a submap of resources created by some node from _#resources_ map (FullyQualifiedResourceId to RemotelyTriggeredResource object). To get all resources triggered by the nodes that are no longer in topology, we can iterate over the currently online nodes (their order in which they joined) and get a submap of resources belonging to the space between each two of them. As the number of nodes is significantly less that the number of resources, this operation should be more effective that iterating over the whole map. For example: * there were 3 nodes: A (join order 0), B (join order 1), C (join order 2); * node B left the topology; * there are 1000 resources, 200 of them are created by A, 500 by B and 300 by C; * iterating over existing node pairs will get following intervals: (MIN_ORDER; 0) - submap is empty, (0; 2) - submap includes 500 resources created by B, (2; MAX_ORDER) - submap is empty. was: Motivation In IGNITE-21293 there was added a map > Get rid of RemotelyTriggeredResourceRegistry#remoteHostsToResources > --- > > Key: IGNITE-21633 > URL: https://issues.apache.org/jira/browse/IGNITE-21633 > Project: Ignite > Issue Type: Improvement >Reporter: Denis Chudov >Priority: Major > Labels: ignite-3 > > *Motivation* > RemotelyTriggeredResourceRegistry has an API that allows closing the > resources using following parameters: > * #close(UUID contextId) > * #close(FullyQualifiedResourceId resourceId) > * #close(String remoteHostId) - is used when the remote host is no longer in > topology and we can close all resources that it has triggered, because they > are no longer needed > In IGNITE-21293 there was added a map > _RemotelyTriggeredResourceRegistry#remoteHostsToResource_ which, in fact, is > grouping the resources by remote hosts which is needed to implement the last > method without iterating over all resources. The main map ({_}#resources{_}, > ordered map of FullyQualifiedResourceId to RemotelyTriggeredResource object) > which is used to store the resources is not able to provide some resources by > remote host id, because FullyQualifiedResourceId does not contain the remote > host id. > The context id is included into the FullyQualifiedResourceId , but the > transaction id (which is contextId in case of cursor resource) does not > contain node identifier, only an integer hash code of the coordinator node > name. > *Definition of done* > The _RemotelyTriggeredResourceRegistry#remoteHostsToResource_ is removed. > *Implementation notes* > We can change the transaction id generation to replace the node name hash > with the order in which the node joined the cluster, then we will be able to > evaluate the transaction coordinator having only the transaction id. This > will also require to postulate that context id generation for every type of > resources should follow this rule. > After that we will be able to get a submap of resources created by some node > from _#resources_ map (FullyQualifiedResourceId to
[jira] [Updated] (IGNITE-21633) Get rid of RemotelyTriggeredResourceRegistry#remoteHostsToResources
[ https://issues.apache.org/jira/browse/IGNITE-21633?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Denis Chudov updated IGNITE-21633: -- Description: Motivation In IGNITE-21293 there was added a map was:TBD > Get rid of RemotelyTriggeredResourceRegistry#remoteHostsToResources > --- > > Key: IGNITE-21633 > URL: https://issues.apache.org/jira/browse/IGNITE-21633 > Project: Ignite > Issue Type: Improvement >Reporter: Denis Chudov >Priority: Major > Labels: ignite-3 > > Motivation > In IGNITE-21293 there was added a map -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (IGNITE-21634) NPE in HeapLockManager
[ https://issues.apache.org/jira/browse/IGNITE-21634?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Denis Chudov updated IGNITE-21634: -- Description: {code:java} Caused by: java.lang.NullPointerException at org.apache.ignite.internal.tx.impl.HeapLockManager.lambda$lockState$4(HeapLockManager.java:297) ~[main/:?] at java.util.concurrent.ConcurrentHashMap.compute(ConcurrentHashMap.java:1908) ~[?:?] at org.apache.ignite.internal.tx.impl.HeapLockManager.lockState(HeapLockManager.java:291) ~[main/:?] at org.apache.ignite.internal.tx.impl.HeapLockManager.acquire(HeapLockManager.java:172) ~[main/:?] at org.apache.ignite.internal.table.distributed.SortedIndexLocker.lambda$locksForInsert$4(SortedIndexLocker.java:169) ~[main/:?] at java.util.concurrent.CompletableFuture.uniComposeStage(CompletableFuture.java:1106) ~[?:?] ... 29 more{code} on the line {{v.markedForRemove = false;}} {code:java} private LockState lockState(LockKey key) { int h = spread(key.hashCode()); int index = h & (slots.length - 1); LockState[] res = new LockState[1]; locks.compute(key, (k, v) -> { if (v == null) { if (empty.isEmpty()) { res[0] = slots[index]; } else { v = empty.poll(); v.markedForRemove = false; v.key = k; res[0] = v; } } else { res[0] = v; } return v; }); return res[0]; } {code} was: {code:java} Caused by: java.lang.NullPointerException at org.apache.ignite.internal.tx.impl.HeapLockManager.lambda$lockState$4(HeapLockManager.java:297) ~[main/:?] at java.util.concurrent.ConcurrentHashMap.compute(ConcurrentHashMap.java:1908) ~[?:?] at org.apache.ignite.internal.tx.impl.HeapLockManager.lockState(HeapLockManager.java:291) ~[main/:?] at org.apache.ignite.internal.tx.impl.HeapLockManager.acquire(HeapLockManager.java:172) ~[main/:?] at org.apache.ignite.internal.table.distributed.SortedIndexLocker.lambda$locksForInsert$4(SortedIndexLocker.java:169) ~[main/:?] at java.util.concurrent.CompletableFuture.uniComposeStage(CompletableFuture.java:1106) ~[?:?] ... 29 more{code} > NPE in HeapLockManager > -- > > Key: IGNITE-21634 > URL: https://issues.apache.org/jira/browse/IGNITE-21634 > Project: Ignite > Issue Type: Improvement >Reporter: Denis Chudov >Priority: Major > Labels: ignite-3 > > {code:java} > Caused by: java.lang.NullPointerException at > org.apache.ignite.internal.tx.impl.HeapLockManager.lambda$lockState$4(HeapLockManager.java:297) > ~[main/:?] at > java.util.concurrent.ConcurrentHashMap.compute(ConcurrentHashMap.java:1908) > ~[?:?] at > org.apache.ignite.internal.tx.impl.HeapLockManager.lockState(HeapLockManager.java:291) > ~[main/:?] at > org.apache.ignite.internal.tx.impl.HeapLockManager.acquire(HeapLockManager.java:172) > ~[main/:?] at > org.apache.ignite.internal.table.distributed.SortedIndexLocker.lambda$locksForInsert$4(SortedIndexLocker.java:169) > ~[main/:?] at > java.util.concurrent.CompletableFuture.uniComposeStage(CompletableFuture.java:1106) > ~[?:?] ... 29 more{code} > on the line {{v.markedForRemove = false;}} > {code:java} > private LockState lockState(LockKey key) { > int h = spread(key.hashCode()); > int index = h & (slots.length - 1); > LockState[] res = new LockState[1]; > locks.compute(key, (k, v) -> { > if (v == null) { > if (empty.isEmpty()) { > res[0] = slots[index]; > } else { > v = empty.poll(); > v.markedForRemove = false; > v.key = k; > res[0] = v; > } > } else { > res[0] = v; > } > return v; > }); > return res[0]; > } {code} -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (IGNITE-21634) NPE in HeapLockManager
Denis Chudov created IGNITE-21634: - Summary: NPE in HeapLockManager Key: IGNITE-21634 URL: https://issues.apache.org/jira/browse/IGNITE-21634 Project: Ignite Issue Type: Improvement Reporter: Denis Chudov {code:java} Caused by: java.lang.NullPointerException at org.apache.ignite.internal.tx.impl.HeapLockManager.lambda$lockState$4(HeapLockManager.java:297) ~[main/:?] at java.util.concurrent.ConcurrentHashMap.compute(ConcurrentHashMap.java:1908) ~[?:?] at org.apache.ignite.internal.tx.impl.HeapLockManager.lockState(HeapLockManager.java:291) ~[main/:?] at org.apache.ignite.internal.tx.impl.HeapLockManager.acquire(HeapLockManager.java:172) ~[main/:?] at org.apache.ignite.internal.table.distributed.SortedIndexLocker.lambda$locksForInsert$4(SortedIndexLocker.java:169) ~[main/:?] at java.util.concurrent.CompletableFuture.uniComposeStage(CompletableFuture.java:1106) ~[?:?] ... 29 more{code} -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (IGNITE-21633) Get rid of RemotelyTriggeredResourceRegistry#remoteHostsToResources
Denis Chudov created IGNITE-21633: - Summary: Get rid of RemotelyTriggeredResourceRegistry#remoteHostsToResources Key: IGNITE-21633 URL: https://issues.apache.org/jira/browse/IGNITE-21633 Project: Ignite Issue Type: Improvement Reporter: Denis Chudov TBD -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Assigned] (IGNITE-21618) In-flights for read-only transactions
[ https://issues.apache.org/jira/browse/IGNITE-21618?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Denis Chudov reassigned IGNITE-21618: - Assignee: Denis Chudov > In-flights for read-only transactions > - > > Key: IGNITE-21618 > URL: https://issues.apache.org/jira/browse/IGNITE-21618 > Project: Ignite > Issue Type: Improvement >Reporter: Denis Chudov >Assignee: Denis Chudov >Priority: Major > Labels: ignite-3 > > *Motivation* > We need to make solid mechanism of closing read-only transactions' resources > (scan cursors, etc.) on remote servers after tx finish. Resources are > supposed to be closed by the requests from coordinator sent from a separate > cleanup thread after the tx is finished, to maximise the performance of the > tx finish itself and because these requests are needed only for resource > cleanup. But we need to prevent a race, such as: > * tx request supposing to create a scan cursor on remote server is sent > * tx is finished > * cleanup thread sends cleanup request > * cleanup request reaches remote server > * tx request reaches the remote server and opens a cursor that will never be > closed. > We need to ensure that cleanup request will be not sent until the coordinator > receives responses for all requests that sent before tx finish, and no > requests are allowed after tx finish. Something similar to RW inflight > requests counter for RO is to be done. > *Definition of done* > Cleanup request from cleanup thread will be not sent until the coordinator > receives responses for all requests that sent before tx finish, and no > requests are allowed after tx finish. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (IGNITE-21618) In-flights for read-only transactions
[ https://issues.apache.org/jira/browse/IGNITE-21618?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Denis Chudov updated IGNITE-21618: -- Description: *Motivation* We need to make solid mechanism of closing read-only transactions' resources (scan cursors, etc.) on remote servers after tx finish. Resources are supposed to be closed by the requests from coordinator sent from a separate cleanup thread after the tx is finished, to maximise the performance of the tx finish itself and because these requests are needed only for resource cleanup. But we need to prevent a race, such as: * tx request supposing to create a scan cursor on remote server is sent * tx is finished * cleanup thread sends cleanup request * cleanup request reaches remote server * tx request reaches the remote server and opens a cursor that will never be closed. We need to ensure that cleanup request will be not sent until the coordinator receives responses for all requests that sent before tx finish, and no requests are allowed after tx finish. Something similar to RW inflight requests counter for RO is to be done. *Definition of done* Cleanup request from cleanup thread will be not sent until the coordinator receives responses for all requests that sent before tx finish, and no requests are allowed after tx finish. was: *Motivation* We need to make solid mechanism of closing read-only transactions' resources (cursors, etc.) on remote servers after tx finish. > In-flights for read-only transactions > - > > Key: IGNITE-21618 > URL: https://issues.apache.org/jira/browse/IGNITE-21618 > Project: Ignite > Issue Type: Improvement >Reporter: Denis Chudov >Priority: Major > Labels: ignite-3 > > *Motivation* > We need to make solid mechanism of closing read-only transactions' resources > (scan cursors, etc.) on remote servers after tx finish. Resources are > supposed to be closed by the requests from coordinator sent from a separate > cleanup thread after the tx is finished, to maximise the performance of the > tx finish itself and because these requests are needed only for resource > cleanup. But we need to prevent a race, such as: > * tx request supposing to create a scan cursor on remote server is sent > * tx is finished > * cleanup thread sends cleanup request > * cleanup request reaches remote server > * tx request reaches the remote server and opens a cursor that will never be > closed. > We need to ensure that cleanup request will be not sent until the coordinator > receives responses for all requests that sent before tx finish, and no > requests are allowed after tx finish. Something similar to RW inflight > requests counter for RO is to be done. > *Definition of done* > Cleanup request from cleanup thread will be not sent until the coordinator > receives responses for all requests that sent before tx finish, and no > requests are allowed after tx finish. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (IGNITE-21618) In-flights for read-only transactions
[ https://issues.apache.org/jira/browse/IGNITE-21618?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Denis Chudov updated IGNITE-21618: -- Epic Link: IGNITE-21221 (was: IGNITE-21174) > In-flights for read-only transactions > - > > Key: IGNITE-21618 > URL: https://issues.apache.org/jira/browse/IGNITE-21618 > Project: Ignite > Issue Type: Improvement >Reporter: Denis Chudov >Priority: Major > Labels: ignite-3 > > *Motivation* > We need to make solid mechanism of closing read-only transactions' resources > (scan cursors, etc.) on remote servers after tx finish. Resources are > supposed to be closed by the requests from coordinator sent from a separate > cleanup thread after the tx is finished, to maximise the performance of the > tx finish itself and because these requests are needed only for resource > cleanup. But we need to prevent a race, such as: > * tx request supposing to create a scan cursor on remote server is sent > * tx is finished > * cleanup thread sends cleanup request > * cleanup request reaches remote server > * tx request reaches the remote server and opens a cursor that will never be > closed. > We need to ensure that cleanup request will be not sent until the coordinator > receives responses for all requests that sent before tx finish, and no > requests are allowed after tx finish. Something similar to RW inflight > requests counter for RO is to be done. > *Definition of done* > Cleanup request from cleanup thread will be not sent until the coordinator > receives responses for all requests that sent before tx finish, and no > requests are allowed after tx finish. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (IGNITE-21618) In-flights for read-only transactions
[ https://issues.apache.org/jira/browse/IGNITE-21618?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Denis Chudov updated IGNITE-21618: -- Description: *Motivation* We need to make solid mechanism of closing read-only transactions' resources (cursors, etc.) on remote servers after tx finish. was:TBD > In-flights for read-only transactions > - > > Key: IGNITE-21618 > URL: https://issues.apache.org/jira/browse/IGNITE-21618 > Project: Ignite > Issue Type: Improvement >Reporter: Denis Chudov >Priority: Major > Labels: ignite-3 > > *Motivation* > We need to make solid mechanism of closing read-only transactions' resources > (cursors, etc.) on remote servers after tx finish. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (IGNITE-21618) In-flights for read-only transactions
Denis Chudov created IGNITE-21618: - Summary: In-flights for read-only transactions Key: IGNITE-21618 URL: https://issues.apache.org/jira/browse/IGNITE-21618 Project: Ignite Issue Type: Improvement Reporter: Denis Chudov TBD -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (IGNITE-21545) Introduce a cursor manager
[ https://issues.apache.org/jira/browse/IGNITE-21545?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17817913#comment-17817913 ] Denis Chudov commented on IGNITE-21545: --- Splitted out from the IGNITE-21293 > Introduce a cursor manager > -- > > Key: IGNITE-21545 > URL: https://issues.apache.org/jira/browse/IGNITE-21545 > Project: Ignite > Issue Type: Improvement >Reporter: Denis Chudov >Assignee: Denis Chudov >Priority: Major > Labels: ignite-3 > > Introduce a cursor manager that would maintain all cursors created on a node, > instead of maintaining them in partition replica listeners. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (IGNITE-21545) Introduce a cursor manager
Denis Chudov created IGNITE-21545: - Summary: Introduce a cursor manager Key: IGNITE-21545 URL: https://issues.apache.org/jira/browse/IGNITE-21545 Project: Ignite Issue Type: Improvement Reporter: Denis Chudov Assignee: Denis Chudov Introduce a cursor manager that would maintain all cursors created on a node, instead of maintaining them in partition replica listeners. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (IGNITE-21513) ActiveActorTest#testChangeLeaderForce is flaky
[ https://issues.apache.org/jira/browse/IGNITE-21513?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Denis Chudov updated IGNITE-21513: -- Description: {code:java} [05:19:12]F: [org.apache.ignite.internal.placementdriver.ActiveActorTest.testChangeLeaderForce(TestInfo)] org.opentest4j.AssertionFailedError: expected: but was: at app//org.junit.jupiter.api.AssertionFailureBuilder.build(AssertionFailureBuilder.java:151) at app//org.junit.jupiter.api.AssertionFailureBuilder.buildAndThrow(AssertionFailureBuilder.java:132) at app//org.junit.jupiter.api.AssertTrue.failNotTrue(AssertTrue.java:63) at app//org.junit.jupiter.api.AssertTrue.assertTrue(AssertTrue.java:36) at app//org.junit.jupiter.api.AssertTrue.assertTrue(AssertTrue.java:31) at app//org.junit.jupiter.api.Assertions.assertTrue(Assertions.java:180) at app//org.apache.ignite.internal.placementdriver.ActiveActorTest.testChangeLeaderForce(ActiveActorTest.java:370){code} See IGNITE-21381 for more details. This ticket is about fixing flaky test and removing the code duplication between ActiveActorTest and TopologyAwareRaftGroupServiceTest. The actual problem of the test was the race due to the lack of joins on a futures from #subscribeLeader(). was: {code:java} [05:19:12]F: [org.apache.ignite.internal.placementdriver.ActiveActorTest.testChangeLeaderForce(TestInfo)] org.opentest4j.AssertionFailedError: expected: but was: at app//org.junit.jupiter.api.AssertionFailureBuilder.build(AssertionFailureBuilder.java:151) at app//org.junit.jupiter.api.AssertionFailureBuilder.buildAndThrow(AssertionFailureBuilder.java:132) at app//org.junit.jupiter.api.AssertTrue.failNotTrue(AssertTrue.java:63) at app//org.junit.jupiter.api.AssertTrue.assertTrue(AssertTrue.java:36) at app//org.junit.jupiter.api.AssertTrue.assertTrue(AssertTrue.java:31) at app//org.junit.jupiter.api.Assertions.assertTrue(Assertions.java:180) at app//org.apache.ignite.internal.placementdriver.ActiveActorTest.testChangeLeaderForce(ActiveActorTest.java:370){code} See IGNITE-21381 for more details. This ticket is about fixing flaky test and removing the code duplication between ActiveActorTest and TopologyAwareRaftGroupServiceTest. > ActiveActorTest#testChangeLeaderForce is flaky > -- > > Key: IGNITE-21513 > URL: https://issues.apache.org/jira/browse/IGNITE-21513 > Project: Ignite > Issue Type: Bug >Reporter: Denis Chudov >Assignee: Denis Chudov >Priority: Major > Labels: ignite-3 > Time Spent: 0.5h > Remaining Estimate: 0h > > {code:java} > [05:19:12]F: > [org.apache.ignite.internal.placementdriver.ActiveActorTest.testChangeLeaderForce(TestInfo)] > org.opentest4j.AssertionFailedError: expected: but was: > at > app//org.junit.jupiter.api.AssertionFailureBuilder.build(AssertionFailureBuilder.java:151) > > at > app//org.junit.jupiter.api.AssertionFailureBuilder.buildAndThrow(AssertionFailureBuilder.java:132) > > at app//org.junit.jupiter.api.AssertTrue.failNotTrue(AssertTrue.java:63) > at app//org.junit.jupiter.api.AssertTrue.assertTrue(AssertTrue.java:36) > at app//org.junit.jupiter.api.AssertTrue.assertTrue(AssertTrue.java:31) > at app//org.junit.jupiter.api.Assertions.assertTrue(Assertions.java:180) > at > app//org.apache.ignite.internal.placementdriver.ActiveActorTest.testChangeLeaderForce(ActiveActorTest.java:370){code} > See IGNITE-21381 for more details. This ticket is about fixing flaky test and > removing the code duplication between ActiveActorTest and > TopologyAwareRaftGroupServiceTest. > The actual problem of the test was the race due to the lack of joins on a > futures from #subscribeLeader(). -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Comment Edited] (IGNITE-21381) ActiveActorTest#testChangeLeaderForce has problems with resource cleanup
[ https://issues.apache.org/jira/browse/IGNITE-21381?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17816547#comment-17816547 ] Denis Chudov edited comment on IGNITE-21381 at 2/12/24 3:12 PM: As we have a PR already that is intended to fix resource leak ( [https://github.com/apache/ignite-3/pull/3150] ), I created a new ticket to fix the flaky test and make a refactoring IGNITE-21513 TC logs appeared to be misleading because of reordered messages (see time). The actual problem of the test was the race due to the lack of joins on a futures from #subscribeLeader(). was (Author: denis chudov): As we have a PR already that is intended to fix resource leak ( https://github.com/apache/ignite-3/pull/3150 ), I created a new ticket to fix the flaky test and make a refactoring IGNITE-21513 > ActiveActorTest#testChangeLeaderForce has problems with resource cleanup > > > Key: IGNITE-21381 > URL: https://issues.apache.org/jira/browse/IGNITE-21381 > Project: Ignite > Issue Type: Bug >Reporter: Mirza Aliev >Priority: Major > Labels: ignite-3 > Attachments: screenshot-1.png, screenshot-2.png > > Time Spent: 40m > Remaining Estimate: 0h > > {{ActiveActorTest#testChangeLeaderForce}} is started to be flaky on TC with > {noformat} > [05:19:12]F: > [org.apache.ignite.internal.placementdriver.ActiveActorTest.testChangeLeaderForce(TestInfo)] > org.opentest4j.AssertionFailedError: expected: but was: > at > app//org.junit.jupiter.api.AssertionFailureBuilder.build(AssertionFailureBuilder.java:151) > at > app//org.junit.jupiter.api.AssertionFailureBuilder.buildAndThrow(AssertionFailureBuilder.java:132) > at app//org.junit.jupiter.api.AssertTrue.failNotTrue(AssertTrue.java:63) > at app//org.junit.jupiter.api.AssertTrue.assertTrue(AssertTrue.java:36) > at app//org.junit.jupiter.api.AssertTrue.assertTrue(AssertTrue.java:31) > at app//org.junit.jupiter.api.Assertions.assertTrue(Assertions.java:180) > at > app//org.apache.ignite.internal.placementdriver.ActiveActorTest.testChangeLeaderForce(ActiveActorTest.java:370) > {noformat} > From the log we can see that transfer leadership, which was supposed to be > successful, do not happen. Behaviour is the following: > 1) Current leader is {{Leader: ClusterNodeImpl > [id=e99210fb-f872-4e08-a99c-53f9512da20e, name=aat_tclf_1235}} > 2) We want to transfer leadership to {{Peer to transfer leader: Peer > [consistentId=aat_tclf_1234, idx=0]}} > 3) Process of transfer is started > 4) We receive warn about error during {{GetLeaderRequestImpl}}: > {noformat} > [2024-01-29T05:19:08,855][WARN > ][CompletableFutureDelayScheduler][RaftGroupServiceImpl] Recoverable error > during the request occurred (will be retried on the randomly selected node) > [request=GetLeaderRequestImpl [groupId=TestReplicationGroup, > peerId=aat_tclf_1235], peer=Peer [consistentId=aat_tclf_1235, idx=0], > newPeer=Peer [consistentId=aat_tclf_1234, idx=0]]. > java.util.concurrent.CompletionException: > java.util.concurrent.TimeoutException > at > java.util.concurrent.CompletableFuture.encodeRelay(CompletableFuture.java:367) > ~[?:?] > at > java.util.concurrent.CompletableFuture.completeRelay(CompletableFuture.java:376) > ~[?:?] > at > java.util.concurrent.CompletableFuture$UniRelay.tryFire(CompletableFuture.java:1019) > ~[?:?] > at > java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:506) > [?:?] > at > java.util.concurrent.CompletableFuture.completeExceptionally(CompletableFuture.java:2088) > [?:?] > at > java.util.concurrent.CompletableFuture$Timeout.run(CompletableFuture.java:2792) > [?:?] > at > java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:515) [?:?] > at java.util.concurrent.FutureTask.run(FutureTask.java:264) [?:?] > at > java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:304) > [?:?] > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128) > [?:?] > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628) > [?:?] > at java.lang.Thread.run(Thread.java:834) [?:?] > Caused by: java.util.concurrent.TimeoutException > ... 7 more > {noformat} > 5) After that we see that node {{aat_tclf_1236}} sends invalid > {{RequestVoteResponse}} because it thinks that it is the leader: > {noformat} > [2024-01-29T05:19:11,370][WARN > ][%aat_tclf_1234%JRaft-Response-Processor-15][NodeImpl] Node > received invalid RequestVoteResponse > from aat_tclf_1236, state not in STATE_CANDIDATE but STATE_LEADER. > {noformat} > > Tests
[jira] [Updated] (IGNITE-21513) ActiveActorTest#testChangeLeaderForce is flaky
[ https://issues.apache.org/jira/browse/IGNITE-21513?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Denis Chudov updated IGNITE-21513: -- Description: {code:java} [05:19:12]F: [org.apache.ignite.internal.placementdriver.ActiveActorTest.testChangeLeaderForce(TestInfo)] org.opentest4j.AssertionFailedError: expected: but was: at app//org.junit.jupiter.api.AssertionFailureBuilder.build(AssertionFailureBuilder.java:151) at app//org.junit.jupiter.api.AssertionFailureBuilder.buildAndThrow(AssertionFailureBuilder.java:132) at app//org.junit.jupiter.api.AssertTrue.failNotTrue(AssertTrue.java:63) at app//org.junit.jupiter.api.AssertTrue.assertTrue(AssertTrue.java:36) at app//org.junit.jupiter.api.AssertTrue.assertTrue(AssertTrue.java:31) at app//org.junit.jupiter.api.Assertions.assertTrue(Assertions.java:180) at app//org.apache.ignite.internal.placementdriver.ActiveActorTest.testChangeLeaderForce(ActiveActorTest.java:370){code} See IGNITE-21381 for more details. This ticket is about fixing flaky test and removing the code duplication between ActiveActorTest and TopologyAwareRaftGroupServiceTest. was: {code:java} [05:19:12]F: [org.apache.ignite.internal.placementdriver.ActiveActorTest.testChangeLeaderForce(TestInfo)] org.opentest4j.AssertionFailedError: expected: but was: at app//org.junit.jupiter.api.AssertionFailureBuilder.build(AssertionFailureBuilder.java:151) at app//org.junit.jupiter.api.AssertionFailureBuilder.buildAndThrow(AssertionFailureBuilder.java:132) at app//org.junit.jupiter.api.AssertTrue.failNotTrue(AssertTrue.java:63) at app//org.junit.jupiter.api.AssertTrue.assertTrue(AssertTrue.java:36) at app//org.junit.jupiter.api.AssertTrue.assertTrue(AssertTrue.java:31) at app//org.junit.jupiter.api.Assertions.assertTrue(Assertions.java:180) at app//org.apache.ignite.internal.placementdriver.ActiveActorTest.testChangeLeaderForce(ActiveActorTest.java:370){code} > ActiveActorTest#testChangeLeaderForce is flaky > -- > > Key: IGNITE-21513 > URL: https://issues.apache.org/jira/browse/IGNITE-21513 > Project: Ignite > Issue Type: Bug >Reporter: Denis Chudov >Assignee: Denis Chudov >Priority: Major > Labels: ignite-3 > Time Spent: 10m > Remaining Estimate: 0h > > {code:java} > [05:19:12]F: > [org.apache.ignite.internal.placementdriver.ActiveActorTest.testChangeLeaderForce(TestInfo)] > org.opentest4j.AssertionFailedError: expected: but was: > at > app//org.junit.jupiter.api.AssertionFailureBuilder.build(AssertionFailureBuilder.java:151) > > at > app//org.junit.jupiter.api.AssertionFailureBuilder.buildAndThrow(AssertionFailureBuilder.java:132) > > at app//org.junit.jupiter.api.AssertTrue.failNotTrue(AssertTrue.java:63) > at app//org.junit.jupiter.api.AssertTrue.assertTrue(AssertTrue.java:36) > at app//org.junit.jupiter.api.AssertTrue.assertTrue(AssertTrue.java:31) > at app//org.junit.jupiter.api.Assertions.assertTrue(Assertions.java:180) > at > app//org.apache.ignite.internal.placementdriver.ActiveActorTest.testChangeLeaderForce(ActiveActorTest.java:370){code} > See IGNITE-21381 for more details. This ticket is about fixing flaky test and > removing the code duplication between ActiveActorTest and > TopologyAwareRaftGroupServiceTest. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Assigned] (IGNITE-21513) ActiveActorTest#testChangeLeaderForce is flaky
[ https://issues.apache.org/jira/browse/IGNITE-21513?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Denis Chudov reassigned IGNITE-21513: - Assignee: Denis Chudov > ActiveActorTest#testChangeLeaderForce is flaky > -- > > Key: IGNITE-21513 > URL: https://issues.apache.org/jira/browse/IGNITE-21513 > Project: Ignite > Issue Type: Bug >Reporter: Denis Chudov >Assignee: Denis Chudov >Priority: Major > Labels: ignite-3 > > {code:java} > [05:19:12]F: > [org.apache.ignite.internal.placementdriver.ActiveActorTest.testChangeLeaderForce(TestInfo)] > org.opentest4j.AssertionFailedError: expected: but was: > at > app//org.junit.jupiter.api.AssertionFailureBuilder.build(AssertionFailureBuilder.java:151) > > at > app//org.junit.jupiter.api.AssertionFailureBuilder.buildAndThrow(AssertionFailureBuilder.java:132) > > at app//org.junit.jupiter.api.AssertTrue.failNotTrue(AssertTrue.java:63) > at app//org.junit.jupiter.api.AssertTrue.assertTrue(AssertTrue.java:36) > at app//org.junit.jupiter.api.AssertTrue.assertTrue(AssertTrue.java:31) > at app//org.junit.jupiter.api.Assertions.assertTrue(Assertions.java:180) > at > app//org.apache.ignite.internal.placementdriver.ActiveActorTest.testChangeLeaderForce(ActiveActorTest.java:370){code} > -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (IGNITE-21381) ActiveActorTest#testChangeLeaderForce has problems with resource cleanup
[ https://issues.apache.org/jira/browse/IGNITE-21381?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17816547#comment-17816547 ] Denis Chudov commented on IGNITE-21381: --- As we have a PR already that is intended to fix resource leak ( https://github.com/apache/ignite-3/pull/3150 ), I created a new ticket to fix the flaky test and make a refactoring IGNITE-21513 > ActiveActorTest#testChangeLeaderForce has problems with resource cleanup > > > Key: IGNITE-21381 > URL: https://issues.apache.org/jira/browse/IGNITE-21381 > Project: Ignite > Issue Type: Bug >Reporter: Mirza Aliev >Priority: Major > Labels: ignite-3 > Attachments: screenshot-1.png, screenshot-2.png > > Time Spent: 0.5h > Remaining Estimate: 0h > > {{ActiveActorTest#testChangeLeaderForce}} is started to be flaky on TC with > {noformat} > [05:19:12]F: > [org.apache.ignite.internal.placementdriver.ActiveActorTest.testChangeLeaderForce(TestInfo)] > org.opentest4j.AssertionFailedError: expected: but was: > at > app//org.junit.jupiter.api.AssertionFailureBuilder.build(AssertionFailureBuilder.java:151) > at > app//org.junit.jupiter.api.AssertionFailureBuilder.buildAndThrow(AssertionFailureBuilder.java:132) > at app//org.junit.jupiter.api.AssertTrue.failNotTrue(AssertTrue.java:63) > at app//org.junit.jupiter.api.AssertTrue.assertTrue(AssertTrue.java:36) > at app//org.junit.jupiter.api.AssertTrue.assertTrue(AssertTrue.java:31) > at app//org.junit.jupiter.api.Assertions.assertTrue(Assertions.java:180) > at > app//org.apache.ignite.internal.placementdriver.ActiveActorTest.testChangeLeaderForce(ActiveActorTest.java:370) > {noformat} > From the log we can see that transfer leadership, which was supposed to be > successful, do not happen. Behaviour is the following: > 1) Current leader is {{Leader: ClusterNodeImpl > [id=e99210fb-f872-4e08-a99c-53f9512da20e, name=aat_tclf_1235}} > 2) We want to transfer leadership to {{Peer to transfer leader: Peer > [consistentId=aat_tclf_1234, idx=0]}} > 3) Process of transfer is started > 4) We receive warn about error during {{GetLeaderRequestImpl}}: > {noformat} > [2024-01-29T05:19:08,855][WARN > ][CompletableFutureDelayScheduler][RaftGroupServiceImpl] Recoverable error > during the request occurred (will be retried on the randomly selected node) > [request=GetLeaderRequestImpl [groupId=TestReplicationGroup, > peerId=aat_tclf_1235], peer=Peer [consistentId=aat_tclf_1235, idx=0], > newPeer=Peer [consistentId=aat_tclf_1234, idx=0]]. > java.util.concurrent.CompletionException: > java.util.concurrent.TimeoutException > at > java.util.concurrent.CompletableFuture.encodeRelay(CompletableFuture.java:367) > ~[?:?] > at > java.util.concurrent.CompletableFuture.completeRelay(CompletableFuture.java:376) > ~[?:?] > at > java.util.concurrent.CompletableFuture$UniRelay.tryFire(CompletableFuture.java:1019) > ~[?:?] > at > java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:506) > [?:?] > at > java.util.concurrent.CompletableFuture.completeExceptionally(CompletableFuture.java:2088) > [?:?] > at > java.util.concurrent.CompletableFuture$Timeout.run(CompletableFuture.java:2792) > [?:?] > at > java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:515) [?:?] > at java.util.concurrent.FutureTask.run(FutureTask.java:264) [?:?] > at > java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:304) > [?:?] > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128) > [?:?] > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628) > [?:?] > at java.lang.Thread.run(Thread.java:834) [?:?] > Caused by: java.util.concurrent.TimeoutException > ... 7 more > {noformat} > 5) After that we see that node {{aat_tclf_1236}} sends invalid > {{RequestVoteResponse}} because it thinks that it is the leader: > {noformat} > [2024-01-29T05:19:11,370][WARN > ][%aat_tclf_1234%JRaft-Response-Processor-15][NodeImpl] Node > received invalid RequestVoteResponse > from aat_tclf_1236, state not in STATE_CANDIDATE but STATE_LEADER. > {noformat} > > Tests {{ActiveActorTest#testChangeLeaderForce}} and > {{TopologyAwareRaftGroupServiceTest#testChangeLeaderForce}} were muted. > Also there are some other problems with this tests, they incorrectly clean up > resources in case of failure. Cluster is stopped in test itself, meaning that > if some assertion is failed, the rest part of the test won't be evaluated, > hence cluster won't be stopped. > The next problem is that if we run this test a several times, even if
[jira] [Created] (IGNITE-21513) ActiveActorTest#testChangeLeaderForce is flaky
Denis Chudov created IGNITE-21513: - Summary: ActiveActorTest#testChangeLeaderForce is flaky Key: IGNITE-21513 URL: https://issues.apache.org/jira/browse/IGNITE-21513 Project: Ignite Issue Type: Bug Reporter: Denis Chudov {code:java} [05:19:12]F: [org.apache.ignite.internal.placementdriver.ActiveActorTest.testChangeLeaderForce(TestInfo)] org.opentest4j.AssertionFailedError: expected: but was: at app//org.junit.jupiter.api.AssertionFailureBuilder.build(AssertionFailureBuilder.java:151) at app//org.junit.jupiter.api.AssertionFailureBuilder.buildAndThrow(AssertionFailureBuilder.java:132) at app//org.junit.jupiter.api.AssertTrue.failNotTrue(AssertTrue.java:63) at app//org.junit.jupiter.api.AssertTrue.assertTrue(AssertTrue.java:36) at app//org.junit.jupiter.api.AssertTrue.assertTrue(AssertTrue.java:31) at app//org.junit.jupiter.api.Assertions.assertTrue(Assertions.java:180) at app//org.apache.ignite.internal.placementdriver.ActiveActorTest.testChangeLeaderForce(ActiveActorTest.java:370){code} -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (IGNITE-21381) ActiveActorTest#testChangeLeaderForce has problems with resource cleanup
[ https://issues.apache.org/jira/browse/IGNITE-21381?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Denis Chudov updated IGNITE-21381: -- Summary: ActiveActorTest#testChangeLeaderForce has problems with resource cleanup (was: ActiveActorTest#testChangeLeaderForce is flaky ) > ActiveActorTest#testChangeLeaderForce has problems with resource cleanup > > > Key: IGNITE-21381 > URL: https://issues.apache.org/jira/browse/IGNITE-21381 > Project: Ignite > Issue Type: Bug >Reporter: Mirza Aliev >Priority: Major > Labels: ignite-3 > Attachments: screenshot-1.png, screenshot-2.png > > Time Spent: 0.5h > Remaining Estimate: 0h > > {{ActiveActorTest#testChangeLeaderForce}} is started to be flaky on TC with > {noformat} > [05:19:12]F: > [org.apache.ignite.internal.placementdriver.ActiveActorTest.testChangeLeaderForce(TestInfo)] > org.opentest4j.AssertionFailedError: expected: but was: > at > app//org.junit.jupiter.api.AssertionFailureBuilder.build(AssertionFailureBuilder.java:151) > at > app//org.junit.jupiter.api.AssertionFailureBuilder.buildAndThrow(AssertionFailureBuilder.java:132) > at app//org.junit.jupiter.api.AssertTrue.failNotTrue(AssertTrue.java:63) > at app//org.junit.jupiter.api.AssertTrue.assertTrue(AssertTrue.java:36) > at app//org.junit.jupiter.api.AssertTrue.assertTrue(AssertTrue.java:31) > at app//org.junit.jupiter.api.Assertions.assertTrue(Assertions.java:180) > at > app//org.apache.ignite.internal.placementdriver.ActiveActorTest.testChangeLeaderForce(ActiveActorTest.java:370) > {noformat} > From the log we can see that transfer leadership, which was supposed to be > successful, do not happen. Behaviour is the following: > 1) Current leader is {{Leader: ClusterNodeImpl > [id=e99210fb-f872-4e08-a99c-53f9512da20e, name=aat_tclf_1235}} > 2) We want to transfer leadership to {{Peer to transfer leader: Peer > [consistentId=aat_tclf_1234, idx=0]}} > 3) Process of transfer is started > 4) We receive warn about error during {{GetLeaderRequestImpl}}: > {noformat} > [2024-01-29T05:19:08,855][WARN > ][CompletableFutureDelayScheduler][RaftGroupServiceImpl] Recoverable error > during the request occurred (will be retried on the randomly selected node) > [request=GetLeaderRequestImpl [groupId=TestReplicationGroup, > peerId=aat_tclf_1235], peer=Peer [consistentId=aat_tclf_1235, idx=0], > newPeer=Peer [consistentId=aat_tclf_1234, idx=0]]. > java.util.concurrent.CompletionException: > java.util.concurrent.TimeoutException > at > java.util.concurrent.CompletableFuture.encodeRelay(CompletableFuture.java:367) > ~[?:?] > at > java.util.concurrent.CompletableFuture.completeRelay(CompletableFuture.java:376) > ~[?:?] > at > java.util.concurrent.CompletableFuture$UniRelay.tryFire(CompletableFuture.java:1019) > ~[?:?] > at > java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:506) > [?:?] > at > java.util.concurrent.CompletableFuture.completeExceptionally(CompletableFuture.java:2088) > [?:?] > at > java.util.concurrent.CompletableFuture$Timeout.run(CompletableFuture.java:2792) > [?:?] > at > java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:515) [?:?] > at java.util.concurrent.FutureTask.run(FutureTask.java:264) [?:?] > at > java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:304) > [?:?] > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128) > [?:?] > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628) > [?:?] > at java.lang.Thread.run(Thread.java:834) [?:?] > Caused by: java.util.concurrent.TimeoutException > ... 7 more > {noformat} > 5) After that we see that node {{aat_tclf_1236}} sends invalid > {{RequestVoteResponse}} because it thinks that it is the leader: > {noformat} > [2024-01-29T05:19:11,370][WARN > ][%aat_tclf_1234%JRaft-Response-Processor-15][NodeImpl] Node > received invalid RequestVoteResponse > from aat_tclf_1236, state not in STATE_CANDIDATE but STATE_LEADER. > {noformat} > > Tests {{ActiveActorTest#testChangeLeaderForce}} and > {{TopologyAwareRaftGroupServiceTest#testChangeLeaderForce}} were muted. > Also there are some other problems with this tests, they incorrectly clean up > resources in case of failure. Cluster is stopped in test itself, meaning that > if some assertion is failed, the rest part of the test won't be evaluated, > hence cluster won't be stopped. > The next problem is that if we run this test a several times, even if they > pass successfully, we can see that at some point new test cannot be run > because
[jira] [Created] (IGNITE-21500) Retry implicit full transactions in case of exceptions related to primary replica failure or move
Denis Chudov created IGNITE-21500: - Summary: Retry implicit full transactions in case of exceptions related to primary replica failure or move Key: IGNITE-21500 URL: https://issues.apache.org/jira/browse/IGNITE-21500 Project: Ignite Issue Type: Improvement Reporter: Denis Chudov *Motivation* Implicit transactions usually are "full" and include just one transactional request which can be safely retried in case of the primary replica related exceptions (PrimaryReplicaMissException, etc.). So users will never see these exceptions in case of primary replica failures. *Definition of done* PrimaryReplicaMissException, PrimaryReplicaAwaitException, TransactionExceptions with messages like "Failed to get the primary replica", "Failed to resolve the primary replica node" are not propagated to the users in cases of implicit full transactions. If it is not possible to await for primary replica, PrimaryReplicaAwaitException or transaction exception with timeout is still possible, -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (IGNITE-21293) Scan cursors should be closed if the tx coordinator is absent
[ https://issues.apache.org/jira/browse/IGNITE-21293?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Denis Chudov updated IGNITE-21293: -- Summary: Scan cursors should be closed if the tx coordinator is absent (was: Scan cursors do not close on transaction recovery) > Scan cursors should be closed if the tx coordinator is absent > - > > Key: IGNITE-21293 > URL: https://issues.apache.org/jira/browse/IGNITE-21293 > Project: Ignite > Issue Type: Bug >Reporter: Vladislav Pyatkov >Assignee: Denis Chudov >Priority: Major > Labels: ignite-3 > > h3. Motivation > Open cursors required extra memory on the server side. Hence, resources > cannot be stored for a long time. > h3. Implementation notes > During the recovery procedure, the server receives a cleanup message (the > message releases locks). On the message processing, we update the local > transaction state, and it should also close all the cursors related to this > transaction. > h3. Definition of done > All cursors should be closed on the RW transaction recovery and if a > coordinator of RO transaction leaves the cluster. > h3. Possible solution > The reason why the cursors are not being closed during the recovery is that > the normal way of closing them is implemented in > {{WriteIntentSwitchReplicaRequest}} handler, but for the recovery case we > don't have the collection of enlisted partitions, thus no write intent switch > is triggered. > We could follow the same approach as the lock manager uses, but we need a > node-wide access to all the cursors opened in the current transaction. There > is another way - instead of closing the cursors directly we can shift the > responsibility to the partition listener itself. > Each node has an in-memory txnState map, tracking the state of the > transactions. If we add listeners to this map, then on registering a new > cursor a partition listener will be able to check current transaction state > and add a listener for a terminal one. > When the tx state is changed to a terminal one, the cursors will be closed. > We can also create a cluanup thread which would check the coordinator node id > assosiated with cursors and if they are absent then the cursor would have to > be closed. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (IGNITE-21293) Scan cursors do not close on transaction recovery
[ https://issues.apache.org/jira/browse/IGNITE-21293?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Denis Chudov updated IGNITE-21293: -- Description: h3. Motivation Open cursors required extra memory on the server side. Hence, resources cannot be stored for a long time. h3. Implementation notes During the recovery procedure, the server receives a cleanup message (the message releases locks). On the message processing, we update the local transaction state, and it should also close all the cursors related to this transaction. h3. Definition of done All cursors should be closed on the RW transaction recovery and if a coordinator of RO transaction leaves the cluster. h3. Possible solution The reason why the cursors are not being closed during the recovery is that the normal way of closing them is implemented in {{WriteIntentSwitchReplicaRequest}} handler, but for the recovery case we don't have the collection of enlisted partitions, thus no write intent switch is triggered. We could follow the same approach as the lock manager uses, but we need a node-wide access to all the cursors opened in the current transaction. There is another way - instead of closing the cursors directly we can shift the responsibility to the partition listener itself. Each node has an in-memory txnState map, tracking the state of the transactions. If we add listeners to this map, then on registering a new cursor a partition listener will be able to check current transaction state and add a listener for a terminal one. When the tx state is changed to a terminal one, the cursors will be closed. We can also create a cluanup thread which would check the coordinator node id assosiated with cursors and if they are absent then the cursor would have to be closed. was: h3. Motivation Open cursors required extra memory on the server side. Hence, resources cannot be stored for a long time. h3. Implementation notes During the recovery procedure, the server receives a cleanup message (the message releases locks). On the message processing, we update the local transaction state, and it should also close all the cursors related to this transaction. h3. Definition of done All cursors should be closed on the RW transaction recovery. h3. Possible solution The reason why the cursors are not being closed during the recovery is that the normal way of closing them is implemented in {{WriteIntentSwitchReplicaRequest}} handler, but for the recovery case we don't have the collection of enlisted partitions, thus no write intent switch is triggered. We could follow the same approach as the lock manager uses, but we need a node-wide access to all the cursors opened in the current transaction. There is another way - instead of closing the cursors directly we can shift the responsibility to the partition listener itself. Each node has an in-memory txnState map, tracking the state of the transactions. If we add listeners to this map, then on registering a new cursor a partition listener will be able to check current transaction state and add a listener for a terminal one. When the tx state is changed to a terminal one, the cursors will be closed. h4. Pitfalls Currently the tx cursors are closed before ensuring the completion of read and update futures. There is a chance that one opens a new cursor after the "close cursors" stage. Checking TX state before registering a cursor should fix this - if the transaction is already in the terminal state - the cursor should be closed immediately. Another one: the tx state is updated from different places - {{PartitionReplicaListener}}, raft's {{PartitionListener}}. Need to make sure the tx cleanup flow is correct. > Scan cursors do not close on transaction recovery > - > > Key: IGNITE-21293 > URL: https://issues.apache.org/jira/browse/IGNITE-21293 > Project: Ignite > Issue Type: Bug >Reporter: Vladislav Pyatkov >Assignee: Denis Chudov >Priority: Major > Labels: ignite-3 > > h3. Motivation > Open cursors required extra memory on the server side. Hence, resources > cannot be stored for a long time. > h3. Implementation notes > During the recovery procedure, the server receives a cleanup message (the > message releases locks). On the message processing, we update the local > transaction state, and it should also close all the cursors related to this > transaction. > h3. Definition of done > All cursors should be closed on the RW transaction recovery and if a > coordinator of RO transaction leaves the cluster. > h3. Possible solution > The reason why the cursors are not being closed during the recovery is that > the normal way of closing them is implemented in > {{WriteIntentSwitchReplicaRequest}} handler, but for the recovery case we > don't have the
[jira] [Comment Edited] (IGNITE-21247) Log enhancements for LeaseUpdater
[ https://issues.apache.org/jira/browse/IGNITE-21247?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17815167#comment-17815167 ] Denis Chudov edited comment on IGNITE-21247 at 2/7/24 8:54 AM: --- Under this ticket the printing of lease statistics was added: {code:java} [2024-02-07T10:47:31,387][INFO ][%iinrt_dosor_0%lease-updater-1][LeaseUpdater] Leases updated (printed once per 10 iteration(s)): [inCurrentIteration=LeaseStats [leasesCreated=3, leasesPublished=0, leasesProlonged=0, leasesWithoutCandidates=0], active=0, currentAssignmentsSize=3].{code} * {*}inCurrentIteration{*}: leases that were processed in the iteration of LeaseUpdater that printed this log message. {*}leasesCreated{*}: how many leases were created (and negotiation started), {*}leasesPublished{*}: how many of them were published after successful negotiation, {*}leasesProlonged{*}: how many were prolonged, {*}leasesWithoutCandidate{*}: leases that had to be created or prolonged but there was no leaseholder candidate for them. * {*}active{*}: active leases (accepted and not outdated) * {*}currentAssignmentsSize{*}: total assignments list size, that is processed (number of replication groups). was (Author: denis chudov): Under this ticket the printing of lease statistics was added: {code:java} [2024-02-07T10:47:31,387][INFO ][%iinrt_dosor_0%lease-updater-1][LeaseUpdater] Leases updated (printed once per 10 iteration(s)): [inCurrentIteration=LeaseStats [leasesCreated=3, leasesPublished=0, leasesProlonged=0, leasesWithoutCandidates=0], active=0, currentAssignmentsSize=3].{code} > Log enhancements for LeaseUpdater > - > > Key: IGNITE-21247 > URL: https://issues.apache.org/jira/browse/IGNITE-21247 > Project: Ignite > Issue Type: Improvement >Reporter: Denis Chudov >Assignee: Denis Chudov >Priority: Major > Labels: ignite-3 > Time Spent: 50m > Remaining Estimate: 0h > > In > [https://ci.ignite.apache.org/viewLog.html?buildId=7754161=ApacheIgnite3xGradle_Test_RunAllTests] > , test failure of > {{{}org.apache.ignite.internal.raftsnapshot.ItTableRaftSnapshotsTest: > leaderFeedsFollowerWithSnapshot{}}}, we see that there are no log messages on > replica about lease negotiation, which means that it didn't even started on > the placement driver active actor's side. But the active actor has started > before. Log doesn't provide any information about the detail what happened on > LeaseUpdater. > The suggestion is to add logging to know whether some exception happened in > {{{}updateLeaseBatchInternal{}}}, or in {{{}LeaseNegotiator#negotiate{}}}, > and logging of lease updating statistics (how many groups without > leaseholders were detected, how many negotiations are in progress, how many > leases are prolonged). -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (IGNITE-21247) Log enhancements for LeaseUpdater
[ https://issues.apache.org/jira/browse/IGNITE-21247?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17815167#comment-17815167 ] Denis Chudov commented on IGNITE-21247: --- Under this ticket the printing of lease statistics was added: {code:java} [2024-02-07T10:47:31,387][INFO ][%iinrt_dosor_0%lease-updater-1][LeaseUpdater] Leases updated (printed once per 10 iteration(s)): [inCurrentIteration=LeaseStats [leasesCreated=3, leasesPublished=0, leasesProlonged=0, leasesWithoutCandidates=0], active=0, currentAssignmentsSize=3].{code} > Log enhancements for LeaseUpdater > - > > Key: IGNITE-21247 > URL: https://issues.apache.org/jira/browse/IGNITE-21247 > Project: Ignite > Issue Type: Improvement >Reporter: Denis Chudov >Assignee: Denis Chudov >Priority: Major > Labels: ignite-3 > Time Spent: 50m > Remaining Estimate: 0h > > In > [https://ci.ignite.apache.org/viewLog.html?buildId=7754161=ApacheIgnite3xGradle_Test_RunAllTests] > , test failure of > {{{}org.apache.ignite.internal.raftsnapshot.ItTableRaftSnapshotsTest: > leaderFeedsFollowerWithSnapshot{}}}, we see that there are no log messages on > replica about lease negotiation, which means that it didn't even started on > the placement driver active actor's side. But the active actor has started > before. Log doesn't provide any information about the detail what happened on > LeaseUpdater. > The suggestion is to add logging to know whether some exception happened in > {{{}updateLeaseBatchInternal{}}}, or in {{{}LeaseNegotiator#negotiate{}}}, > and logging of lease updating statistics (how many groups without > leaseholders were detected, how many negotiations are in progress, how many > leases are prolonged). -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (IGNITE-21473) Some transactional tests are not necessary for three node tests
Denis Chudov created IGNITE-21473: - Summary: Some transactional tests are not necessary for three node tests Key: IGNITE-21473 URL: https://issues.apache.org/jira/browse/IGNITE-21473 Project: Ignite Issue Type: Bug Reporter: Denis Chudov Following tests: {code:java} TxAbstractTest#testTransactionMultiThreadedCommit TxAbstractTest#testTransactionMultiThreadedCommitEmpty TxAbstractTest#testTransactionMultiThreadedRollback TxAbstractTest#testTransactionMultiThreadedRollbackEmpty TxAbstractTest#testTransactionMultiThreadedMixed TxAbstractTest#testTransactionMultiThreadedMixedEmpty {code} take significant time on TC but not actually necessary for all implementations of TxAbstractTest. Thay can be moved to another subclass. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Assigned] (IGNITE-21473) Some transactional tests are not necessary for three node tests
[ https://issues.apache.org/jira/browse/IGNITE-21473?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Denis Chudov reassigned IGNITE-21473: - Ignite Flags: (was: Docs Required,Release Notes Required) Assignee: Denis Chudov Labels: ignite-3 (was: ) > Some transactional tests are not necessary for three node tests > --- > > Key: IGNITE-21473 > URL: https://issues.apache.org/jira/browse/IGNITE-21473 > Project: Ignite > Issue Type: Bug >Reporter: Denis Chudov >Assignee: Denis Chudov >Priority: Major > Labels: ignite-3 > > Following tests: > > {code:java} > TxAbstractTest#testTransactionMultiThreadedCommit > TxAbstractTest#testTransactionMultiThreadedCommitEmpty > TxAbstractTest#testTransactionMultiThreadedRollback > TxAbstractTest#testTransactionMultiThreadedRollbackEmpty > TxAbstractTest#testTransactionMultiThreadedMixed > TxAbstractTest#testTransactionMultiThreadedMixedEmpty > {code} > take significant time on TC but not actually necessary for all > implementations of TxAbstractTest. Thay can be moved to another subclass. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (IGNITE-21247) Log enhancements for LeaseUpdater
[ https://issues.apache.org/jira/browse/IGNITE-21247?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Denis Chudov updated IGNITE-21247: -- Reviewer: Vladislav Pyatkov > Log enhancements for LeaseUpdater > - > > Key: IGNITE-21247 > URL: https://issues.apache.org/jira/browse/IGNITE-21247 > Project: Ignite > Issue Type: Improvement >Reporter: Denis Chudov >Assignee: Denis Chudov >Priority: Major > Labels: ignite-3 > Time Spent: 10m > Remaining Estimate: 0h > > In > [https://ci.ignite.apache.org/viewLog.html?buildId=7754161=ApacheIgnite3xGradle_Test_RunAllTests] > , test failure of > {{{}org.apache.ignite.internal.raftsnapshot.ItTableRaftSnapshotsTest: > leaderFeedsFollowerWithSnapshot{}}}, we see that there are no log messages on > replica about lease negotiation, which means that it didn't even started on > the placement driver active actor's side. But the active actor has started > before. Log doesn't provide any information about the detail what happened on > LeaseUpdater. > The suggestion is to add logging to know whether some exception happened in > {{{}updateLeaseBatchInternal{}}}, or in {{{}LeaseNegotiator#negotiate{}}}, > and logging of lease updating statistics (how many groups without > leaseholders were detected, how many negotiations are in progress, how many > leases are prolonged). -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Assigned] (IGNITE-15226) Print original exception when SSLException occurs
[ https://issues.apache.org/jira/browse/IGNITE-15226?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Denis Chudov reassigned IGNITE-15226: - Assignee: (was: Denis Chudov) > Print original exception when SSLException occurs > - > > Key: IGNITE-15226 > URL: https://issues.apache.org/jira/browse/IGNITE-15226 > Project: Ignite > Issue Type: Improvement >Reporter: Denis Chudov >Priority: Major > > We have to print original message when SSLException occurs. > {noformat} > 2021-02-23 03:23:35.579 [2021-02-23 03:23:35,579][WARN > ][grid-nio-worker-client-listener-0-#150][ClientListenerProcessor] Closing > NIO session because of unhandled exception [cls=class > o.a.i.i.util.nio.GridNioException, msg=Failed to decode SSL data: > GridSelectorNioSessionImpl [worker=GridWorker > [name=grid-nio-worker-client-listener-0, igniteInstanceName=null, > finished=false, heartbeatTs=1614039815570, hashCode=1938562251, > interrupted=false, > runner=grid-nio-worker-client-listener-0-#150]AbstractNioClientWorker [idx=0, > bytesRcvd=0, bytesSent=0, bytesRcvd0=0, bytesSent0=0, select=true, > super=]ByteBufferNioClientWorker [readBuf=java.nio.HeapByteBuffer[pos=517 > lim=517 cap=8192], super=], writeBuf=null, readBuf=null, inRecovery=null, > outRecovery=null, super=GridNioSessionImpl [locAddr=IP, rmtAddr=IP, > createTime=1614039815116, closeTime=0, bytesSent=7268, bytesRcvd=7785, > bytesSent0=7268, bytesRcvd0=7785, sndSchedTime=1614039815560, > lastSndTime=1614039815570, lastRcvTime=1614039815570, readsPaused=false, > filterChain=GridNioCodecFilter [parser=ClientListenerBufferedParser, > directMode=false]FilterChain[filters=[GridNioAsyncNotifyFilter, , SSL > filter], accepted=true, markedForClose=false]]]{noformat} -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (IGNITE-21411) Prohibit operations on the finished RO transactions
[ https://issues.apache.org/jira/browse/IGNITE-21411?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Denis Chudov updated IGNITE-21411: -- Description: *Motivation* For now we don't have any mechanism in RO transactions implementation to prohibit further gets/scans after finishing this transaction. In the same time, the ReadOnlyTransactionImpl#finish method updates the observable timestamp tracker which is necessary for implicit RO transactions, and completes the RO tx future which unblocks the low watermark in order to move it forward. Cursors that were opened within a transaction should be also closed, but this is out of scope of this ticket, see IGNITE-21291. *Definition of done* Any operations on the finished RO transactions are prohibited like it is done on RW transactions. *Implementation notes* RW-lock is used to check and prohibit enlists to RW transactions, something more simple can be used for RO transactions. There is a ReadOnlyTransactionImpl#finishGuard - it may appear to be enough for the purpose of locking the transaction for new operations. was: *Motivation* For now we don't have any mechanism in RO transactions implementation to prohibit further gets/scans after finishing this transaction. In the same time, the ReadOnlyTransactionImpl#finish method updates the observable timestamp tracker which is necessary for implicit RO transactions, and completes the RO tx future which unblocks the low watermark in order to move it forward. Cursors that were opened within a transaction should be also closed on this transaction finish. *Definition of done* Any operations on the finished RO transactions are prohibited like it is done on RW transactions. *Implementation notes* RW-lock is used to check and prohibit enlists to RW transactions, something more simple can be used for RO transactions. There is a ReadOnlyTransactionImpl#finishGuard - it may appear to be enough for the purpose of locking the transaction for new operations. > Prohibit operations on the finished RO transactions > --- > > Key: IGNITE-21411 > URL: https://issues.apache.org/jira/browse/IGNITE-21411 > Project: Ignite > Issue Type: Improvement >Reporter: Denis Chudov >Priority: Major > Labels: ignite-3 > > *Motivation* > For now we don't have any mechanism in RO transactions implementation to > prohibit further gets/scans after finishing this transaction. In the same > time, the ReadOnlyTransactionImpl#finish method updates the observable > timestamp tracker which is necessary for implicit RO transactions, and > completes the RO tx future which unblocks the low watermark in order to move > it forward. > Cursors that were opened within a transaction should be also closed, but this > is out of scope of this ticket, see IGNITE-21291. > *Definition of done* > Any operations on the finished RO transactions are prohibited like it is done > on RW transactions. > *Implementation notes* > RW-lock is used to check and prohibit enlists to RW transactions, something > more simple can be used for RO transactions. There is a > ReadOnlyTransactionImpl#finishGuard - it may appear to be enough for the > purpose of locking the transaction for new operations. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (IGNITE-21411) Prohibit operations on the finished RO transactions
[ https://issues.apache.org/jira/browse/IGNITE-21411?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Denis Chudov updated IGNITE-21411: -- Description: *Motivation* For now we don't have any mechanism in RO transactions implementation to prohibit further gets/scans after finishing this transaction. In the same time, the ReadOnlyTransactionImpl#finish method updates the observable timestamp tracker which is necessary for implicit RO transactions, and completes the RO tx future which unblocks the low watermark in order to move it forward. Cursors that were opened within a transaction should be also closed on this transaction finish. *Definition of done* Any operations on the finished RO transactions are prohibited like it is done on RW transactions. *Implementation notes* RW-lock is used to check and prohibit enlists to RW transactions, something more simple can be used for RO transactions. There is a ReadOnlyTransactionImpl#finishGuard - it may appear to be enough for the purpose of locking the transaction for new operations. was: *Motivation* For now we don't have any mechanism in RO transactions implementation to prohibit further gets/scans after finishing this transaction. In the same time, the ReadOnlyTransactionImpl#finish method updates the observable timestamp tracker which is necessary for implicit RO transactions, and completes the RO tx future which unblocks the low watermark in order to move it forward. *Definition of done* Any operations on the finished RO transactions are prohibited like it is done on RW transactions. *Implementation notes* RW-lock is used to check and prohibit enlists to RW transactions, something more simple can be used for RO transactions. There is a ReadOnlyTransactionImpl#finishGuard - it may appear to be enough for this purpose. > Prohibit operations on the finished RO transactions > --- > > Key: IGNITE-21411 > URL: https://issues.apache.org/jira/browse/IGNITE-21411 > Project: Ignite > Issue Type: Improvement >Reporter: Denis Chudov >Priority: Major > Labels: ignite-3 > > *Motivation* > For now we don't have any mechanism in RO transactions implementation to > prohibit further gets/scans after finishing this transaction. In the same > time, the ReadOnlyTransactionImpl#finish method updates the observable > timestamp tracker which is necessary for implicit RO transactions, and > completes the RO tx future which unblocks the low watermark in order to move > it forward. > Cursors that were opened within a transaction should be also closed on this > transaction finish. > *Definition of done* > Any operations on the finished RO transactions are prohibited like it is done > on RW transactions. > *Implementation notes* > RW-lock is used to check and prohibit enlists to RW transactions, something > more simple can be used for RO transactions. There is a > ReadOnlyTransactionImpl#finishGuard - it may appear to be enough for the > purpose of locking the transaction for new operations. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (IGNITE-21411) Prohibit operations on the finished RO transactions
[ https://issues.apache.org/jira/browse/IGNITE-21411?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Denis Chudov updated IGNITE-21411: -- Description: *Motivation* For now we don't have any mechanism in RO transactions implementation to prohibit further gets/scans after finishing this transaction. In the same time, the ReadOnlyTransactionImpl#finish method updates the observable timestamp tracker which is necessary for implicit RO transactions, and completes the RO tx future which unblocks the low watermark in order to move it forward. *Definition of done* Any operations on the finished RO transactions are prohibited like it is done on RW transactions. *Implementation notes* RW-lock is used to check and prohibit enlists to RW transactions, something more simple can be used for RO transactions. There is a ReadOnlyTransactionImpl#finishGuard - it may appear to be enough for this purpose. was: Motivation For now we don't have any mechanism in RO transactions implementation to prohibit further gets/scans after finishing this transaction. In the same time, the ReadOnlyTransactionImpl#finish method updates the observable timestamp tracker which is necessary for implicit RO transactions, and completes the RO tx future which unblocks the low watermark in order to move it forward. *Definition of done* Any operations on the finished RO transactions are prohibited like it is done on RW transactions. > Prohibit operations on the finished RO transactions > --- > > Key: IGNITE-21411 > URL: https://issues.apache.org/jira/browse/IGNITE-21411 > Project: Ignite > Issue Type: Improvement >Reporter: Denis Chudov >Priority: Major > Labels: ignite-3 > > *Motivation* > For now we don't have any mechanism in RO transactions implementation to > prohibit further gets/scans after finishing this transaction. In the same > time, the ReadOnlyTransactionImpl#finish method updates the observable > timestamp tracker which is necessary for implicit RO transactions, and > completes the RO tx future which unblocks the low watermark in order to move > it forward. > *Definition of done* > Any operations on the finished RO transactions are prohibited like it is done > on RW transactions. > *Implementation notes* > RW-lock is used to check and prohibit enlists to RW transactions, something > more simple can be used for RO transactions. There is a > ReadOnlyTransactionImpl#finishGuard - it may appear to be enough for this > purpose. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (IGNITE-21411) Prohibit operations on the finished RO transactions
[ https://issues.apache.org/jira/browse/IGNITE-21411?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Denis Chudov updated IGNITE-21411: -- Description: Motivation For now we don't have any mechanism in RO transactions implementation to prohibit further gets/scans after finishing this transaction. In the same time, the ReadOnlyTransactionImpl#finish method updates the observable timestamp tracker which is necessary for implicit RO transactions, and completes the RO tx future which unblocks the low watermark in order to move it forward. *Definition of done* Any operations on the finished RO transactions are prohibited like it is done on RW transactions. was: Motivation For now we don't have any mechanism in RO transactions implementation to prohibit further gets/scans after finishing this transaction. In the same time, the ReadOnlyTransactionImpl#finish method updates the observable timestamp tracker which is necessary for implicit RO transactions, and completes the RO tx future which unblocks the low watermark in order to move it forward. Any operations > Prohibit operations on the finished RO transactions > --- > > Key: IGNITE-21411 > URL: https://issues.apache.org/jira/browse/IGNITE-21411 > Project: Ignite > Issue Type: Improvement >Reporter: Denis Chudov >Priority: Major > Labels: ignite-3 > > Motivation > For now we don't have any mechanism in RO transactions implementation to > prohibit further gets/scans after finishing this transaction. In the same > time, the ReadOnlyTransactionImpl#finish method updates the observable > timestamp tracker which is necessary for implicit RO transactions, and > completes the RO tx future which unblocks the low watermark in order to move > it forward. > *Definition of done* > Any operations on the finished RO transactions are prohibited like it is done > on RW transactions. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (IGNITE-21411) Prohibit operations on the finished RO transactions
[ https://issues.apache.org/jira/browse/IGNITE-21411?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Denis Chudov updated IGNITE-21411: -- Summary: Prohibit operations on the finished RO transactions (was: Prohibit enlists to the finished RO transactions) > Prohibit operations on the finished RO transactions > --- > > Key: IGNITE-21411 > URL: https://issues.apache.org/jira/browse/IGNITE-21411 > Project: Ignite > Issue Type: Improvement >Reporter: Denis Chudov >Priority: Major > Labels: ignite-3 > > Motivation > For now we don't have any mechanism in RO transactions implementation to > prohibit further gets/scans after finishing this transaction. In the same > time, the ReadOnlyTransactionImpl#finish method updates the observable > timestamp tracker which is necessary for implicit RO transactions, and > completes the RO tx future which unblocks the low watermark in order to move > it forward. > Any operations -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (IGNITE-21411) Prohibit enlists to the finished RO transactions
[ https://issues.apache.org/jira/browse/IGNITE-21411?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Denis Chudov updated IGNITE-21411: -- Description: Motivation For now we don't have any mechanism in RO transactions implementation to prohibit further gets/scans after finishing this transaction. In the same time, the ReadOnlyTransactionImpl#finish method updates the observable timestamp tracker which is necessary for implicit RO transactions, and completes the RO tx future which unblocks the low watermark in order to move it forward. Any operations was: Motivation For now we don't have any mechanism in RO transactions implementation to prohibit further gets/scans after finishing this transaction. In the same time, > Prohibit enlists to the finished RO transactions > > > Key: IGNITE-21411 > URL: https://issues.apache.org/jira/browse/IGNITE-21411 > Project: Ignite > Issue Type: Improvement >Reporter: Denis Chudov >Priority: Major > Labels: ignite-3 > > Motivation > For now we don't have any mechanism in RO transactions implementation to > prohibit further gets/scans after finishing this transaction. In the same > time, the ReadOnlyTransactionImpl#finish method updates the observable > timestamp tracker which is necessary for implicit RO transactions, and > completes the RO tx future which unblocks the low watermark in order to move > it forward. > Any operations -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (IGNITE-21411) Prohibit enlists to the finished RO transactions
[ https://issues.apache.org/jira/browse/IGNITE-21411?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Denis Chudov updated IGNITE-21411: -- Description: Motivation For now we don't have any mechanism in RO transactions implementation to prohibit further gets/scans after finishing this transaction. In the same time, was:TBD > Prohibit enlists to the finished RO transactions > > > Key: IGNITE-21411 > URL: https://issues.apache.org/jira/browse/IGNITE-21411 > Project: Ignite > Issue Type: Improvement >Reporter: Denis Chudov >Priority: Major > Labels: ignite-3 > > Motivation > For now we don't have any mechanism in RO transactions implementation to > prohibit further gets/scans after finishing this transaction. In the same > time, -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (IGNITE-21415) Remote nodes are not added to NodeManager
[ https://issues.apache.org/jira/browse/IGNITE-21415?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Denis Chudov updated IGNITE-21415: -- Description: org.apache.ignite.raft.jraft.NodeManager has the internal data structures (groups to lists of nodes mapping, etc.) that are used for different purposes, including processing the requests. But the remote nodes are not adding to these mappings. See the usages of NodeManager#add method: it is called only from RaftGroupService#start to add the local node. > Remote nodes are not added to NodeManager > - > > Key: IGNITE-21415 > URL: https://issues.apache.org/jira/browse/IGNITE-21415 > Project: Ignite > Issue Type: Bug >Reporter: Denis Chudov >Priority: Major > Labels: ignite-3 > > org.apache.ignite.raft.jraft.NodeManager has the internal data structures > (groups to lists of nodes mapping, etc.) that are used for different > purposes, including processing the requests. But the remote nodes are not > adding to these mappings. See the usages of NodeManager#add method: it is > called only from RaftGroupService#start to add the local node. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (IGNITE-21415) Remote nodes are not added to NodeManager
Denis Chudov created IGNITE-21415: - Summary: Remote nodes are not added to NodeManager Key: IGNITE-21415 URL: https://issues.apache.org/jira/browse/IGNITE-21415 Project: Ignite Issue Type: Bug Reporter: Denis Chudov -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (IGNITE-21411) Prohibit enlists to the finished RO transactions
Denis Chudov created IGNITE-21411: - Summary: Prohibit enlists to the finished RO transactions Key: IGNITE-21411 URL: https://issues.apache.org/jira/browse/IGNITE-21411 Project: Ignite Issue Type: Improvement Reporter: Denis Chudov TBD -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (IGNITE-21394) TimeoutException in the listener of pending assignments change shouldn't fail the watch processor
[ https://issues.apache.org/jira/browse/IGNITE-21394?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Denis Chudov updated IGNITE-21394: -- Description: *Motivation* The handler of pending assignments event change ( TableManager#handleChangePendingAssignmentEvent() ) tries to do changePeerAsync after starting the partition and client. In order to know whether the calling of changePeerAsync is needed, it tries to get the current leader of corresponding raft group. This call of RaftGroupService#refreshAndGetLeaderWithTerm can fail with TimeoutException. For example, there is no known leader on the node that the GetLeader request is sent to, or that node is no more in the raft group, etc., and in the same time that node is the only known peer of the raft group: in these cases the GetLeader request will be constantly retried in hope to get a response with leader finally, when it's elected, but this can never happen. So, the TimeoutException is expected in this case. This exception should be handled within the mentioned listener of pending assignments event change. otherwise it fails the watch processor, making it unable to handle the further meta storage updates (and making the node inoperable). This means that, most likely, the current node is not a leader of the raft group, and changePeers shouln't be done, or it has not caught up with the current assignments events, this means that some requests for this node for this partition will fail, but the node will remain operable. *Definition of done* TimeoutException in the listener of pending assignments change doesn't fail the watch processor and doesn't lead to multiple exceptions like this: {code:java} [2024-01-29T22:00:58,658][ERROR][%isckvt_tmccd_3344%Raft-Group-Client-5][WatchProcessor] Error occurred when notifying safe time advanced callback java.util.concurrent.CompletionException: java.util.concurrent.TimeoutException at java.util.concurrent.CompletableFuture.encodeThrowable(CompletableFuture.java:331) ~[?:?] at java.util.concurrent.CompletableFuture.completeThrowable(CompletableFuture.java:346) ~[?:?] at java.util.concurrent.CompletableFuture$UniApply.tryFire(CompletableFuture.java:632) ~[?:?] at java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:506) ~[?:?] at java.util.concurrent.CompletableFuture.completeExceptionally(CompletableFuture.java:2088) ~[?:?] at org.apache.ignite.internal.raft.RaftGroupServiceImpl.sendWithRetry(RaftGroupServiceImpl.java:546) ~[ignite-raft-3.0.0-SNAPSHOT.jar:?] at org.apache.ignite.internal.raft.RaftGroupServiceImpl.lambda$handleErrorResponse$42(RaftGroupServiceImpl.java:635) ~[ignite-raft-3.0.0-SNAPSHOT.jar:?] at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:515) [?:?] at java.util.concurrent.FutureTask.run(FutureTask.java:264) [?:?] at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:304) [?:?] at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128) [?:?] at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628) [?:?] at java.lang.Thread.run(Thread.java:834) [?:?] Caused by: java.util.concurrent.TimeoutException ... 8 more{code} *Implementation notes* It can be reproduced in integration tests like ItSchemaChangeKvViewTest#testMergeChangesColumnDefault when there are 3 nodes starting, then a table with 25 partitions/1 replica created. During the table start the rebalance is possible, like this: * a replication group is moved from node A to node B * some node C tries to perform GetLeader, and has only node A in local peers * node A thinks it is the only member of the replication group, and is not leader, sends "Unknown leader" response to C * node C constatnly retries the request to node A. was: *Motivation* The handler of pending assignments event change ( TableManager#handleChangePendingAssignmentEvent() ) tries to do changePeerAsync after starting the partition and client. In order to know whether the calling of changePeerAsync is needed, it tries to get the current leader of corresponding raft group. This call of RaftGroupService#refreshAndGetLeaderWithTerm can fail with TimeoutException. For example, there is no known leader on the node that the GetLeader request is sent to, or that node is no more in the raft group, etc., and in the same time that node is the only known peer of the raft group: in these cases the GetLeader request will be constantly retried in hope to get a response with leader finally, when it's elected, but this can never happen. So, the TimeoutException is expected in this case. This exception should be handled within the mentioned listener of pending assignments event change. otherwise it fails the watch processor, making it unable to handle the
[jira] [Updated] (IGNITE-21394) TimeoutException in the listener of pending assignments change shouldn't fail the watch processor
[ https://issues.apache.org/jira/browse/IGNITE-21394?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Denis Chudov updated IGNITE-21394: -- Description: *Motivation* The handler of pending assignments event change ( TableManager#handleChangePendingAssignmentEvent() ) tries to do changePeerAsync after starting the partition and client. In order to know whether the calling of changePeerAsync is needed, it tries to get the current leader of corresponding raft group. This call of RaftGroupService#refreshAndGetLeaderWithTerm can fail with TimeoutException. For example, there is no known leader on the node that the GetLeader request is sent to, or that node is no more in the raft group, etc., and in the same time that node is the only known peer of the raft group: in these cases the GetLeader request will be constantly retried in hope to get a response with leader finally, when it's elected, but this can never happen. So, the TimeoutException is expected in this case. This exception should be handled within the mentioned listener of pending assignments event change. otherwise it fails the watch processor, making it unable to handle the further meta storage updates (and making the node inoperable). This means that, most likely, the current node is not a leader of the raft group, and changePeers shouln't be done, or it has not caught up with the current assignments events, this means that some client requests for this node for this partition will fail, but the node will remain operable. *Definition of done* TimeoutException in the listener of pending assignments change doesn't fail the watch processor and doesn't lead to multiple exceptions like this: {code:java} [2024-01-29T22:00:58,658][ERROR][%isckvt_tmccd_3344%Raft-Group-Client-5][WatchProcessor] Error occurred when notifying safe time advanced callback java.util.concurrent.CompletionException: java.util.concurrent.TimeoutException at java.util.concurrent.CompletableFuture.encodeThrowable(CompletableFuture.java:331) ~[?:?] at java.util.concurrent.CompletableFuture.completeThrowable(CompletableFuture.java:346) ~[?:?] at java.util.concurrent.CompletableFuture$UniApply.tryFire(CompletableFuture.java:632) ~[?:?] at java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:506) ~[?:?] at java.util.concurrent.CompletableFuture.completeExceptionally(CompletableFuture.java:2088) ~[?:?] at org.apache.ignite.internal.raft.RaftGroupServiceImpl.sendWithRetry(RaftGroupServiceImpl.java:546) ~[ignite-raft-3.0.0-SNAPSHOT.jar:?] at org.apache.ignite.internal.raft.RaftGroupServiceImpl.lambda$handleErrorResponse$42(RaftGroupServiceImpl.java:635) ~[ignite-raft-3.0.0-SNAPSHOT.jar:?] at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:515) [?:?] at java.util.concurrent.FutureTask.run(FutureTask.java:264) [?:?] at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:304) [?:?] at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128) [?:?] at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628) [?:?] at java.lang.Thread.run(Thread.java:834) [?:?] Caused by: java.util.concurrent.TimeoutException ... 8 more{code} *Implementation notes* It can be reproduced in integration tests like ItSchemaChangeKvViewTest#testMergeChangesColumnDefault when there are 3 nodes starting, then a table with 25 partitions/1 replica created. During the table start the rebalance is possible, like this: * a replication group is moved from node A to node B * some node C tries to perform GetLeader, and has only node A in local peers * node A thinks it is the only member of the replication group, and is not leader, sends "Unknown leader" response to C * node C constatnly retries the request to node A. was: *Motivation* The handler of pending assignments event change ( TableManager#handleChangePendingAssignmentEvent() ) tries to do changePeerAsync after starting the partition and client. In order to know whether the calling of changePeerAsync is needed, it tries to get the current leader of corresponding raft group. This call of RaftGroupService#refreshAndGetLeaderWithTerm can fail with TimeoutException. For example, there is no known leader on the node that the GetLeader request is sent to, or that node is no more in the raft group, etc., and in the same time that node is the only known peer of the raft group: in these cases the GetLeader request will be constantly retried in hope to get a response with leader finally, when it's elected, but this can never happen. So, the TimeoutException is expected in this case. This exception should be handled within the mentioned listener of pending assignments event change. otherwise it fails the watch processor, making it unable to handle the
[jira] [Created] (IGNITE-21394) TimeoutException in the listener of pending assignments change shouldn't fail the watch processor
Denis Chudov created IGNITE-21394: - Summary: TimeoutException in the listener of pending assignments change shouldn't fail the watch processor Key: IGNITE-21394 URL: https://issues.apache.org/jira/browse/IGNITE-21394 Project: Ignite Issue Type: Bug Reporter: Denis Chudov *Motivation* The handler of pending assignments event change ( TableManager#handleChangePendingAssignmentEvent() ) tries to do changePeerAsync after starting the partition and client. In order to know whether the calling of changePeerAsync is needed, it tries to get the current leader of corresponding raft group. This call of RaftGroupService#refreshAndGetLeaderWithTerm can fail with TimeoutException. For example, there is no known leader on the node that the GetLeader request is sent to, or that node is no more in the raft group, etc., and in the same time that node is the only known peer of the raft group: in these cases the GetLeader request will be constantly retried in hope to get a response with leader finally, when it's elected, but this can never happen. So, the TimeoutException is expected in this case. This exception should be handled within the mentioned listener of pending assignments event change. otherwise it fails the watch processor, making it unable to handle the further meta storage updates (and making the node inoperable). This means that, most likely, the current node is not a leader of the raft group, and changePeers shouln't be done, or it has not caught up with the current assignments events, this means that some client requests for this node for this partition will fail, but the node will remain operable. *Definition of done* TimeoutException in the listener of pending assignments change doesn't fail the watch processor and doesn't lead to multiple exceptions like this: {code:java} 024-01-29T22:00:58,658][ERROR][%isckvt_tmccd_3344%Raft-Group-Client-5][WatchProcessor] Error occurred when notifying safe time advanced callback java.util.concurrent.CompletionException: java.util.concurrent.TimeoutException at java.util.concurrent.CompletableFuture.encodeThrowable(CompletableFuture.java:331) ~[?:?] at java.util.concurrent.CompletableFuture.completeThrowable(CompletableFuture.java:346) ~[?:?] at java.util.concurrent.CompletableFuture$UniApply.tryFire(CompletableFuture.java:632) ~[?:?] at java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:506) ~[?:?] at java.util.concurrent.CompletableFuture.completeExceptionally(CompletableFuture.java:2088) ~[?:?] at org.apache.ignite.internal.raft.RaftGroupServiceImpl.sendWithRetry(RaftGroupServiceImpl.java:546) ~[ignite-raft-3.0.0-SNAPSHOT.jar:?] at org.apache.ignite.internal.raft.RaftGroupServiceImpl.lambda$handleErrorResponse$42(RaftGroupServiceImpl.java:635) ~[ignite-raft-3.0.0-SNAPSHOT.jar:?] at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:515) [?:?] at java.util.concurrent.FutureTask.run(FutureTask.java:264) [?:?] at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:304) [?:?] at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128) [?:?] at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628) [?:?] at java.lang.Thread.run(Thread.java:834) [?:?]Caused by: java.util.concurrent.TimeoutException ... 8 more {code} Implementation notes It can be reproduced in integration tests like ItSchemaChangeKvViewTest#testMergeChangesColumnDefault when there are 3 nodes starting, then a table with 25 partitions/1 replica created. During the table start the rebalance is possible, like this: * a replication group is moved from node A to node B * some node C tries to perform GetLeader, and has only node A in local peers * node A thinks it is the only member of the replication group, and is not leader, sends "Unknown leader" response to C * node C constatnly retries the request to node A. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (IGNITE-21394) TimeoutException in the listener of pending assignments change shouldn't fail the watch processor
[ https://issues.apache.org/jira/browse/IGNITE-21394?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Denis Chudov updated IGNITE-21394: -- Description: *Motivation* The handler of pending assignments event change ( TableManager#handleChangePendingAssignmentEvent() ) tries to do changePeerAsync after starting the partition and client. In order to know whether the calling of changePeerAsync is needed, it tries to get the current leader of corresponding raft group. This call of RaftGroupService#refreshAndGetLeaderWithTerm can fail with TimeoutException. For example, there is no known leader on the node that the GetLeader request is sent to, or that node is no more in the raft group, etc., and in the same time that node is the only known peer of the raft group: in these cases the GetLeader request will be constantly retried in hope to get a response with leader finally, when it's elected, but this can never happen. So, the TimeoutException is expected in this case. This exception should be handled within the mentioned listener of pending assignments event change. otherwise it fails the watch processor, making it unable to handle the further meta storage updates (and making the node inoperable). This means that, most likely, the current node is not a leader of the raft group, and changePeers shouln't be done, or it has not caught up with the current assignments events, this means that some client requests for this node for this partition will fail, but the node will remain operable. *Definition of done* TimeoutException in the listener of pending assignments change doesn't fail the watch processor and doesn't lead to multiple exceptions like this: {code:java} [2024-01-29T22:00:58,658][ERROR][%isckvt_tmccd_3344%Raft-Group-Client-5][WatchProcessor] Error occurred when notifying safe time advanced callback java.util.concurrent.CompletionException: java.util.concurrent.TimeoutException at java.util.concurrent.CompletableFuture.encodeThrowable(CompletableFuture.java:331) ~[?:?] at java.util.concurrent.CompletableFuture.completeThrowable(CompletableFuture.java:346) ~[?:?] at java.util.concurrent.CompletableFuture$UniApply.tryFire(CompletableFuture.java:632) ~[?:?] at java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:506) ~[?:?] at java.util.concurrent.CompletableFuture.completeExceptionally(CompletableFuture.java:2088) ~[?:?] at org.apache.ignite.internal.raft.RaftGroupServiceImpl.sendWithRetry(RaftGroupServiceImpl.java:546) ~[ignite-raft-3.0.0-SNAPSHOT.jar:?] at org.apache.ignite.internal.raft.RaftGroupServiceImpl.lambda$handleErrorResponse$42(RaftGroupServiceImpl.java:635) ~[ignite-raft-3.0.0-SNAPSHOT.jar:?] at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:515) [?:?] at java.util.concurrent.FutureTask.run(FutureTask.java:264) [?:?] at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:304) [?:?] at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128) [?:?] at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628) [?:?] at java.lang.Thread.run(Thread.java:834) [?:?] Caused by: java.util.concurrent.TimeoutException ... 8 more{code} Implementation notes It can be reproduced in integration tests like ItSchemaChangeKvViewTest#testMergeChangesColumnDefault when there are 3 nodes starting, then a table with 25 partitions/1 replica created. During the table start the rebalance is possible, like this: * a replication group is moved from node A to node B * some node C tries to perform GetLeader, and has only node A in local peers * node A thinks it is the only member of the replication group, and is not leader, sends "Unknown leader" response to C * node C constatnly retries the request to node A. was: *Motivation* The handler of pending assignments event change ( TableManager#handleChangePendingAssignmentEvent() ) tries to do changePeerAsync after starting the partition and client. In order to know whether the calling of changePeerAsync is needed, it tries to get the current leader of corresponding raft group. This call of RaftGroupService#refreshAndGetLeaderWithTerm can fail with TimeoutException. For example, there is no known leader on the node that the GetLeader request is sent to, or that node is no more in the raft group, etc., and in the same time that node is the only known peer of the raft group: in these cases the GetLeader request will be constantly retried in hope to get a response with leader finally, when it's elected, but this can never happen. So, the TimeoutException is expected in this case. This exception should be handled within the mentioned listener of pending assignments event change. otherwise it fails the watch processor, making it unable to handle the
[jira] [Assigned] (IGNITE-21181) Failure to resolve a primary replica after stopping a node
[ https://issues.apache.org/jira/browse/IGNITE-21181?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Denis Chudov reassigned IGNITE-21181: - Assignee: Denis Chudov > Failure to resolve a primary replica after stopping a node > -- > > Key: IGNITE-21181 > URL: https://issues.apache.org/jira/browse/IGNITE-21181 > Project: Ignite > Issue Type: Bug >Reporter: Roman Puchkovskiy >Assignee: Denis Chudov >Priority: Major > Labels: ignite-3 > Fix For: 3.0.0-beta2 > > > The scenario is that the cluster consists of 3 nodes (0, 1, 2). Primary > replica of the sole partition is on node 0. Then node 0 is stopped and an > attempt to do a put via node 2 is done. The partition still has majority, but > the put results in the following: > > {code:java} > org.apache.ignite.tx.TransactionException: IGN-REP-5 > TraceId:55c59c96-17d1-4efc-8e3c-cca81b8b41ad Failed to resolve the primary > replica node [consistentId=itrst_ncisasiti_0] > > at > org.apache.ignite.internal.table.distributed.storage.InternalTableImpl.lambda$enlist$69(InternalTableImpl.java:1749) > at > java.base/java.util.concurrent.CompletableFuture.uniHandle(CompletableFuture.java:930) > at > java.base/java.util.concurrent.CompletableFuture.uniHandleStage(CompletableFuture.java:946) > at > java.base/java.util.concurrent.CompletableFuture.handle(CompletableFuture.java:2266) > at > org.apache.ignite.internal.table.distributed.storage.InternalTableImpl.enlist(InternalTableImpl.java:1739) > at > org.apache.ignite.internal.table.distributed.storage.InternalTableImpl.enlistWithRetry(InternalTableImpl.java:480) > at > org.apache.ignite.internal.table.distributed.storage.InternalTableImpl.enlistInTx(InternalTableImpl.java:301) > at > org.apache.ignite.internal.table.distributed.storage.InternalTableImpl.upsert(InternalTableImpl.java:965) > at > org.apache.ignite.internal.table.KeyValueViewImpl.lambda$putAsync$10(KeyValueViewImpl.java:196) > at > org.apache.ignite.internal.table.AbstractTableView.lambda$withSchemaSync$1(AbstractTableView.java:111) > at > java.base/java.util.concurrent.CompletableFuture.uniComposeStage(CompletableFuture.java:1106) > at > java.base/java.util.concurrent.CompletableFuture.thenCompose(CompletableFuture.java:2235) > at > org.apache.ignite.internal.table.AbstractTableView.withSchemaSync(AbstractTableView.java:111) > at > org.apache.ignite.internal.table.AbstractTableView.withSchemaSync(AbstractTableView.java:102) > at > org.apache.ignite.internal.table.KeyValueViewImpl.putAsync(KeyValueViewImpl.java:193) > at > org.apache.ignite.internal.table.KeyValueViewImpl.put(KeyValueViewImpl.java:185) > at > org.apache.ignite.internal.raftsnapshot.ItTableRaftSnapshotsTest.putToNode(ItTableRaftSnapshotsTest.java:257) > at > org.apache.ignite.internal.raftsnapshot.ItTableRaftSnapshotsTest.putToNode(ItTableRaftSnapshotsTest.java:253) > at > org.apache.ignite.internal.raftsnapshot.ItTableRaftSnapshotsTest.nodeCanInstallSnapshotsAfterSnapshotInstalledToIt(ItTableRaftSnapshotsTest.java:473){code} > > This can be reproduced using > ItTableRaftSnapshotsTest#nodeCanInstallSnapshotsAfterSnapshotInstalledToIt(). > The reason is that, according to the test, the leader of partition group is > transferred on node 0, which means that this node most probably will be > selected as primary, and after that the node 0 is stopped, and then the > transaction is started. Node 0 is still a leaseholder in the current time > interval, but it's already left the topology. > We can fix the test to make it await the new primary, which would be present > in the cluster, or make the restries on the very first transactional request. > In the case of latter, we need to ensure that the request is actually first > and single, no other request in any parallel thread was sent, otherwise we > cant retry the request on another primary . -- This message was sent by Atlassian Jira (v8.20.10#820010)