[jira] [Updated] (IGNITE-22094) Add removeAll method to tx state storage

2024-05-15 Thread Denis Chudov (Jira)


 [ 
https://issues.apache.org/jira/browse/IGNITE-22094?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Denis Chudov updated IGNITE-22094:
--
Fix Version/s: 3.0.0-beta2

> Add removeAll method to tx state storage
> 
>
> Key: IGNITE-22094
> URL: https://issues.apache.org/jira/browse/IGNITE-22094
> Project: Ignite
>  Issue Type: Improvement
>Reporter: Denis Chudov
>Assignee: Denis Chudov
>Priority: Major
>  Labels: ignite-3
> Fix For: 3.0.0-beta2
>
>
> *Motivation*
> Tx state vacuum should be able to remove multiple tx states at once - all of 
> them that meet requirements for removal. 
> *Definition of done*
> TxStateStorage#removeAll is added, along with corresponding tests.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (IGNITE-22206) Unmute disabled ItTxResourcesVacuumTest#testRecoveryAfterPersistentStateVacuumized

2024-05-10 Thread Denis Chudov (Jira)


 [ 
https://issues.apache.org/jira/browse/IGNITE-22206?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Denis Chudov updated IGNITE-22206:
--
Fix Version/s: 3.0.0-beta2

> Unmute disabled 
> ItTxResourcesVacuumTest#testRecoveryAfterPersistentStateVacuumized
> --
>
> Key: IGNITE-22206
> URL: https://issues.apache.org/jira/browse/IGNITE-22206
> Project: Ignite
>  Issue Type: Bug
>Reporter: Denis Chudov
>Assignee: Denis Chudov
>Priority: Major
>  Labels: ignite-3
> Fix For: 3.0.0-beta2
>
>  Time Spent: 20m
>  Remaining Estimate: 0h
>
> Was fixed under IGNITE-22147 but left muted for some reason.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (IGNITE-22206) Unmute disabled ItTxResourcesVacuumTest#testRecoveryAfterPersistentStateVacuumized

2024-05-10 Thread Denis Chudov (Jira)


 [ 
https://issues.apache.org/jira/browse/IGNITE-22206?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Denis Chudov updated IGNITE-22206:
--
Description: Was fixed under IGNITE-22147 but left muted for some reason.  
(was: subj)

> Unmute disabled 
> ItTxResourcesVacuumTest#testRecoveryAfterPersistentStateVacuumized
> --
>
> Key: IGNITE-22206
> URL: https://issues.apache.org/jira/browse/IGNITE-22206
> Project: Ignite
>  Issue Type: Bug
>Reporter: Denis Chudov
>Assignee: Denis Chudov
>Priority: Major
>  Labels: ignite-3
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> Was fixed under IGNITE-22147 but left muted for some reason.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (IGNITE-22206) Unmute disabled ItTxResourcesVacuumTest#testRecoveryAfterPersistentStateVacuumized

2024-05-10 Thread Denis Chudov (Jira)


[ 
https://issues.apache.org/jira/browse/IGNITE-22206?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17845334#comment-17845334
 ] 

Denis Chudov commented on IGNITE-22206:
---

https://ci.ignite.apache.org/test/7228314194201887527?currentProjectId=ApacheIgnite3xGradle_Test_IntegrationTests=pull%2F3738=true

> Unmute disabled 
> ItTxResourcesVacuumTest#testRecoveryAfterPersistentStateVacuumized
> --
>
> Key: IGNITE-22206
> URL: https://issues.apache.org/jira/browse/IGNITE-22206
> Project: Ignite
>  Issue Type: Bug
>Reporter: Denis Chudov
>Assignee: Denis Chudov
>Priority: Major
>  Labels: ignite-3
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> subj



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (IGNITE-22206) Unmute disabled ItTxResourcesVacuumTest#testRecoveryAfterPersistentStateVacuumized

2024-05-10 Thread Denis Chudov (Jira)
Denis Chudov created IGNITE-22206:
-

 Summary: Unmute disabled 
ItTxResourcesVacuumTest#testRecoveryAfterPersistentStateVacuumized
 Key: IGNITE-22206
 URL: https://issues.apache.org/jira/browse/IGNITE-22206
 Project: Ignite
  Issue Type: Bug
Reporter: Denis Chudov
Assignee: Denis Chudov


subj



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (IGNITE-22147) ItTxResourcesVacuumTest.testRecoveryAfterPersistentStateVacuumized is flaky

2024-04-30 Thread Denis Chudov (Jira)
Denis Chudov created IGNITE-22147:
-

 Summary: 
ItTxResourcesVacuumTest.testRecoveryAfterPersistentStateVacuumized is flaky
 Key: IGNITE-22147
 URL: https://issues.apache.org/jira/browse/IGNITE-22147
 Project: Ignite
  Issue Type: Bug
Reporter: Denis Chudov


https://ci.ignite.apache.org/project.html?projectId=ApacheIgnite3xGradle_Test_IntegrationTests=7228314194201887527=testDetails



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Assigned] (IGNITE-22094) Add removeAll method to tx state storage

2024-04-23 Thread Denis Chudov (Jira)


 [ 
https://issues.apache.org/jira/browse/IGNITE-22094?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Denis Chudov reassigned IGNITE-22094:
-

Assignee: Denis Chudov

> Add removeAll method to tx state storage
> 
>
> Key: IGNITE-22094
> URL: https://issues.apache.org/jira/browse/IGNITE-22094
> Project: Ignite
>  Issue Type: Improvement
>Reporter: Denis Chudov
>Assignee: Denis Chudov
>Priority: Major
>  Labels: ignite-3
>
> *Motivation*
> Tx state vacuum should be able to remove multiple tx states at once - all of 
> them that meet requirements for removal. 
> *Definition of done*
> TxStateStorage#removeAll is added, along with corresponding tests.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (IGNITE-22094) Add removeAll method to tx state storage

2024-04-23 Thread Denis Chudov (Jira)
Denis Chudov created IGNITE-22094:
-

 Summary: Add removeAll method to tx state storage
 Key: IGNITE-22094
 URL: https://issues.apache.org/jira/browse/IGNITE-22094
 Project: Ignite
  Issue Type: Improvement
Reporter: Denis Chudov


*Motivation*

Tx state vacuum should be able to remove multiple tx states at once - all of 
them that meet requirements for removal. 

*Definition of done*

TxStateStorage#removeAll is added, along with corresponding tests.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (IGNITE-22024) ItSqlClientSynchronousApiTest#runtimeErrorInDmlCausesTransactionToFail is flaky

2024-04-23 Thread Denis Chudov (Jira)


 [ 
https://issues.apache.org/jira/browse/IGNITE-22024?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Denis Chudov updated IGNITE-22024:
--
Fix Version/s: 3.0.0-beta2

> ItSqlClientSynchronousApiTest#runtimeErrorInDmlCausesTransactionToFail is 
> flaky
> ---
>
> Key: IGNITE-22024
> URL: https://issues.apache.org/jira/browse/IGNITE-22024
> Project: Ignite
>  Issue Type: Bug
>Reporter: Vladislav Pyatkov
>Assignee: Denis Chudov
>Priority: Major
>  Labels: ignite-3
> Fix For: 3.0.0-beta2
>
> Attachments: screenshot-1.png
>
>  Time Spent: 20m
>  Remaining Estimate: 0h
>
> h3. Motivation
> Only one commit is a base transaction guarantee. The test shows this 
> guarantee is violated for thin clients.
> {noformat}
> java.lang.AssertionError: Exception has not been thrown.
>  
> at 
> org.apache.ignite.internal.testframework.IgniteTestUtils.assertThrowsWithCode(IgniteTestUtils.java:314)
> at 
> org.apache.ignite.internal.sql.api.ItSqlApiBaseTest.runtimeErrorInDmlCausesTransactionToFail(ItSqlApiBaseTest.java:648)
> at 
> org.apache.ignite.internal.sql.api.ItSqlClientSynchronousApiTest.runtimeErrorInDmlCausesTransactionToFail(ItSqlClientSynchronousApiTest.java:65)
> at java.base/java.lang.reflect.Method.invoke(Method.java:566)
> at 
> java.base/java.util.stream.ForEachOps$ForEachOp$OfRef.accept(ForEachOps.java:183)
> at 
> java.base/java.util.stream.ReferencePipeline$3$1.accept(ReferencePipeline.java:195)
> at 
> java.base/java.util.stream.ReferencePipeline$2$1.accept(ReferencePipeline.java:177)
> at 
> java.base/java.util.stream.ReferencePipeline$3$1.accept(ReferencePipeline.java:195)
> at 
> java.base/java.util.stream.ForEachOps$ForEachOp$OfRef.accept(ForEachOps.java:183)
> at 
> java.base/java.util.stream.ReferencePipeline$3$1.accept(ReferencePipeline.java:195)
> at 
> java.base/java.util.stream.ReferencePipeline$3$1.accept(ReferencePipeline.java:195)
> at 
> java.base/java.util.stream.ReferencePipeline$3$1.accept(ReferencePipeline.java:195)
> at 
> java.base/java.util.stream.ForEachOps$ForEachOp$OfRef.accept(ForEachOps.java:183)
> at 
> java.base/java.util.stream.ReferencePipeline$3$1.accept(ReferencePipeline.java:195)
> at 
> java.base/java.util.Spliterators$ArraySpliterator.forEachRemaining(Spliterators.java:948)
> at 
> java.base/java.util.stream.AbstractPipeline.copyInto(AbstractPipeline.java:484)
> at 
> java.base/java.util.stream.AbstractPipeline.wrapAndCopyInto(AbstractPipeline.java:474)
> at 
> java.base/java.util.stream.ForEachOps$ForEachOp.evaluateSequential(ForEachOps.java:150)
> at 
> java.base/java.util.stream.ForEachOps$ForEachOp$OfRef.evaluateSequential(ForEachOps.java:173)
> at 
> java.base/java.util.stream.AbstractPipeline.evaluate(AbstractPipeline.java:234)
> at 
> java.base/java.util.stream.ReferencePipeline.forEach(ReferencePipeline.java:497)
> at 
> java.base/java.util.stream.ReferencePipeline$7$1.accept(ReferencePipeline.java:274)
> at 
> java.base/java.util.stream.ReferencePipeline$3$1.accept(ReferencePipeline.java:195)
> at 
> java.base/java.util.stream.ReferencePipeline$3$1.accept(ReferencePipeline.java:195)
> at 
> java.base/java.util.stream.ReferencePipeline$3$1.accept(ReferencePipeline.java:195)
> at 
> java.base/java.util.ArrayList$ArrayListSpliterator.forEachRemaining(ArrayList.java:1654)
> at 
> java.base/java.util.stream.AbstractPipeline.copyInto(AbstractPipeline.java:484)
> at 
> java.base/java.util.stream.AbstractPipeline.wrapAndCopyInto(AbstractPipeline.java:474)
> at 
> java.base/java.util.stream.ForEachOps$ForEachOp.evaluateSequential(ForEachOps.java:150)
> at 
> java.base/java.util.stream.ForEachOps$ForEachOp$OfRef.evaluateSequential(ForEachOps.java:173)
> at 
> java.base/java.util.stream.AbstractPipeline.evaluate(AbstractPipeline.java:234)
> at 
> java.base/java.util.stream.ReferencePipeline.forEach(ReferencePipeline.java:497)
> at 
> java.base/java.util.stream.ReferencePipeline$7$1.accept(ReferencePipeline.java:274)
> at 
> java.base/java.util.ArrayList$ArrayListSpliterator.forEachRemaining(ArrayList.java:1654)
> at 
> java.base/java.util.stream.AbstractPipeline.copyInto(AbstractPipeline.java:484)
> at 
> java.base/java.util.stream.AbstractPipeline.wrapAndCopyInto(AbstractPipeline.java:474)
> at 
> java.base/java.util.stream.ForEachOps$ForEachOp.evaluateSequential(ForEachOps.java:150)
> at 
> java.base/java.util.stream.ForEachOps$ForEachOp$OfRef.evaluateSequential(ForEachOps.java:173)
> at 
> java.base/java.util.stream.AbstractPipeline.evaluate(AbstractPipeline.java:234)
> at 
> java.base/java.util.stream.ReferencePipeline.forEach(ReferencePipeline.java:497)
> at java.base/java.util.ArrayList.forEach(ArrayList.java:1540)
> at java.base/java.util.ArrayList.forEach(ArrayList.java:1540)
> {noformat}
> h3. 

[jira] [Updated] (IGNITE-22024) ItSqlClientSynchronousApiTest#runtimeErrorInDmlCausesTransactionToFail is flaky

2024-04-23 Thread Denis Chudov (Jira)


 [ 
https://issues.apache.org/jira/browse/IGNITE-22024?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Denis Chudov updated IGNITE-22024:
--
Reviewer: Vladislav Pyatkov

> ItSqlClientSynchronousApiTest#runtimeErrorInDmlCausesTransactionToFail is 
> flaky
> ---
>
> Key: IGNITE-22024
> URL: https://issues.apache.org/jira/browse/IGNITE-22024
> Project: Ignite
>  Issue Type: Bug
>Reporter: Vladislav Pyatkov
>Assignee: Denis Chudov
>Priority: Major
>  Labels: ignite-3
> Attachments: screenshot-1.png
>
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> h3. Motivation
> Only one commit is a base transaction guarantee. The test shows this 
> guarantee is violated for thin clients.
> {noformat}
> java.lang.AssertionError: Exception has not been thrown.
>  
> at 
> org.apache.ignite.internal.testframework.IgniteTestUtils.assertThrowsWithCode(IgniteTestUtils.java:314)
> at 
> org.apache.ignite.internal.sql.api.ItSqlApiBaseTest.runtimeErrorInDmlCausesTransactionToFail(ItSqlApiBaseTest.java:648)
> at 
> org.apache.ignite.internal.sql.api.ItSqlClientSynchronousApiTest.runtimeErrorInDmlCausesTransactionToFail(ItSqlClientSynchronousApiTest.java:65)
> at java.base/java.lang.reflect.Method.invoke(Method.java:566)
> at 
> java.base/java.util.stream.ForEachOps$ForEachOp$OfRef.accept(ForEachOps.java:183)
> at 
> java.base/java.util.stream.ReferencePipeline$3$1.accept(ReferencePipeline.java:195)
> at 
> java.base/java.util.stream.ReferencePipeline$2$1.accept(ReferencePipeline.java:177)
> at 
> java.base/java.util.stream.ReferencePipeline$3$1.accept(ReferencePipeline.java:195)
> at 
> java.base/java.util.stream.ForEachOps$ForEachOp$OfRef.accept(ForEachOps.java:183)
> at 
> java.base/java.util.stream.ReferencePipeline$3$1.accept(ReferencePipeline.java:195)
> at 
> java.base/java.util.stream.ReferencePipeline$3$1.accept(ReferencePipeline.java:195)
> at 
> java.base/java.util.stream.ReferencePipeline$3$1.accept(ReferencePipeline.java:195)
> at 
> java.base/java.util.stream.ForEachOps$ForEachOp$OfRef.accept(ForEachOps.java:183)
> at 
> java.base/java.util.stream.ReferencePipeline$3$1.accept(ReferencePipeline.java:195)
> at 
> java.base/java.util.Spliterators$ArraySpliterator.forEachRemaining(Spliterators.java:948)
> at 
> java.base/java.util.stream.AbstractPipeline.copyInto(AbstractPipeline.java:484)
> at 
> java.base/java.util.stream.AbstractPipeline.wrapAndCopyInto(AbstractPipeline.java:474)
> at 
> java.base/java.util.stream.ForEachOps$ForEachOp.evaluateSequential(ForEachOps.java:150)
> at 
> java.base/java.util.stream.ForEachOps$ForEachOp$OfRef.evaluateSequential(ForEachOps.java:173)
> at 
> java.base/java.util.stream.AbstractPipeline.evaluate(AbstractPipeline.java:234)
> at 
> java.base/java.util.stream.ReferencePipeline.forEach(ReferencePipeline.java:497)
> at 
> java.base/java.util.stream.ReferencePipeline$7$1.accept(ReferencePipeline.java:274)
> at 
> java.base/java.util.stream.ReferencePipeline$3$1.accept(ReferencePipeline.java:195)
> at 
> java.base/java.util.stream.ReferencePipeline$3$1.accept(ReferencePipeline.java:195)
> at 
> java.base/java.util.stream.ReferencePipeline$3$1.accept(ReferencePipeline.java:195)
> at 
> java.base/java.util.ArrayList$ArrayListSpliterator.forEachRemaining(ArrayList.java:1654)
> at 
> java.base/java.util.stream.AbstractPipeline.copyInto(AbstractPipeline.java:484)
> at 
> java.base/java.util.stream.AbstractPipeline.wrapAndCopyInto(AbstractPipeline.java:474)
> at 
> java.base/java.util.stream.ForEachOps$ForEachOp.evaluateSequential(ForEachOps.java:150)
> at 
> java.base/java.util.stream.ForEachOps$ForEachOp$OfRef.evaluateSequential(ForEachOps.java:173)
> at 
> java.base/java.util.stream.AbstractPipeline.evaluate(AbstractPipeline.java:234)
> at 
> java.base/java.util.stream.ReferencePipeline.forEach(ReferencePipeline.java:497)
> at 
> java.base/java.util.stream.ReferencePipeline$7$1.accept(ReferencePipeline.java:274)
> at 
> java.base/java.util.ArrayList$ArrayListSpliterator.forEachRemaining(ArrayList.java:1654)
> at 
> java.base/java.util.stream.AbstractPipeline.copyInto(AbstractPipeline.java:484)
> at 
> java.base/java.util.stream.AbstractPipeline.wrapAndCopyInto(AbstractPipeline.java:474)
> at 
> java.base/java.util.stream.ForEachOps$ForEachOp.evaluateSequential(ForEachOps.java:150)
> at 
> java.base/java.util.stream.ForEachOps$ForEachOp$OfRef.evaluateSequential(ForEachOps.java:173)
> at 
> java.base/java.util.stream.AbstractPipeline.evaluate(AbstractPipeline.java:234)
> at 
> java.base/java.util.stream.ReferencePipeline.forEach(ReferencePipeline.java:497)
> at java.base/java.util.ArrayList.forEach(ArrayList.java:1540)
> at java.base/java.util.ArrayList.forEach(ArrayList.java:1540)
> {noformat}
> h3. Definition of done
> Any transaction 

[jira] [Assigned] (IGNITE-22024) ItSqlClientSynchronousApiTest#runtimeErrorInDmlCausesTransactionToFail is flaky

2024-04-23 Thread Denis Chudov (Jira)


 [ 
https://issues.apache.org/jira/browse/IGNITE-22024?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Denis Chudov reassigned IGNITE-22024:
-

Assignee: Denis Chudov

> ItSqlClientSynchronousApiTest#runtimeErrorInDmlCausesTransactionToFail is 
> flaky
> ---
>
> Key: IGNITE-22024
> URL: https://issues.apache.org/jira/browse/IGNITE-22024
> Project: Ignite
>  Issue Type: Bug
>Reporter: Vladislav Pyatkov
>Assignee: Denis Chudov
>Priority: Major
>  Labels: ignite-3
> Attachments: screenshot-1.png
>
>
> h3. Motivation
> Only one commit is a base transaction guarantee. The test shows this 
> guarantee is violated for thin clients.
> {noformat}
> java.lang.AssertionError: Exception has not been thrown.
>  
> at 
> org.apache.ignite.internal.testframework.IgniteTestUtils.assertThrowsWithCode(IgniteTestUtils.java:314)
> at 
> org.apache.ignite.internal.sql.api.ItSqlApiBaseTest.runtimeErrorInDmlCausesTransactionToFail(ItSqlApiBaseTest.java:648)
> at 
> org.apache.ignite.internal.sql.api.ItSqlClientSynchronousApiTest.runtimeErrorInDmlCausesTransactionToFail(ItSqlClientSynchronousApiTest.java:65)
> at java.base/java.lang.reflect.Method.invoke(Method.java:566)
> at 
> java.base/java.util.stream.ForEachOps$ForEachOp$OfRef.accept(ForEachOps.java:183)
> at 
> java.base/java.util.stream.ReferencePipeline$3$1.accept(ReferencePipeline.java:195)
> at 
> java.base/java.util.stream.ReferencePipeline$2$1.accept(ReferencePipeline.java:177)
> at 
> java.base/java.util.stream.ReferencePipeline$3$1.accept(ReferencePipeline.java:195)
> at 
> java.base/java.util.stream.ForEachOps$ForEachOp$OfRef.accept(ForEachOps.java:183)
> at 
> java.base/java.util.stream.ReferencePipeline$3$1.accept(ReferencePipeline.java:195)
> at 
> java.base/java.util.stream.ReferencePipeline$3$1.accept(ReferencePipeline.java:195)
> at 
> java.base/java.util.stream.ReferencePipeline$3$1.accept(ReferencePipeline.java:195)
> at 
> java.base/java.util.stream.ForEachOps$ForEachOp$OfRef.accept(ForEachOps.java:183)
> at 
> java.base/java.util.stream.ReferencePipeline$3$1.accept(ReferencePipeline.java:195)
> at 
> java.base/java.util.Spliterators$ArraySpliterator.forEachRemaining(Spliterators.java:948)
> at 
> java.base/java.util.stream.AbstractPipeline.copyInto(AbstractPipeline.java:484)
> at 
> java.base/java.util.stream.AbstractPipeline.wrapAndCopyInto(AbstractPipeline.java:474)
> at 
> java.base/java.util.stream.ForEachOps$ForEachOp.evaluateSequential(ForEachOps.java:150)
> at 
> java.base/java.util.stream.ForEachOps$ForEachOp$OfRef.evaluateSequential(ForEachOps.java:173)
> at 
> java.base/java.util.stream.AbstractPipeline.evaluate(AbstractPipeline.java:234)
> at 
> java.base/java.util.stream.ReferencePipeline.forEach(ReferencePipeline.java:497)
> at 
> java.base/java.util.stream.ReferencePipeline$7$1.accept(ReferencePipeline.java:274)
> at 
> java.base/java.util.stream.ReferencePipeline$3$1.accept(ReferencePipeline.java:195)
> at 
> java.base/java.util.stream.ReferencePipeline$3$1.accept(ReferencePipeline.java:195)
> at 
> java.base/java.util.stream.ReferencePipeline$3$1.accept(ReferencePipeline.java:195)
> at 
> java.base/java.util.ArrayList$ArrayListSpliterator.forEachRemaining(ArrayList.java:1654)
> at 
> java.base/java.util.stream.AbstractPipeline.copyInto(AbstractPipeline.java:484)
> at 
> java.base/java.util.stream.AbstractPipeline.wrapAndCopyInto(AbstractPipeline.java:474)
> at 
> java.base/java.util.stream.ForEachOps$ForEachOp.evaluateSequential(ForEachOps.java:150)
> at 
> java.base/java.util.stream.ForEachOps$ForEachOp$OfRef.evaluateSequential(ForEachOps.java:173)
> at 
> java.base/java.util.stream.AbstractPipeline.evaluate(AbstractPipeline.java:234)
> at 
> java.base/java.util.stream.ReferencePipeline.forEach(ReferencePipeline.java:497)
> at 
> java.base/java.util.stream.ReferencePipeline$7$1.accept(ReferencePipeline.java:274)
> at 
> java.base/java.util.ArrayList$ArrayListSpliterator.forEachRemaining(ArrayList.java:1654)
> at 
> java.base/java.util.stream.AbstractPipeline.copyInto(AbstractPipeline.java:484)
> at 
> java.base/java.util.stream.AbstractPipeline.wrapAndCopyInto(AbstractPipeline.java:474)
> at 
> java.base/java.util.stream.ForEachOps$ForEachOp.evaluateSequential(ForEachOps.java:150)
> at 
> java.base/java.util.stream.ForEachOps$ForEachOp$OfRef.evaluateSequential(ForEachOps.java:173)
> at 
> java.base/java.util.stream.AbstractPipeline.evaluate(AbstractPipeline.java:234)
> at 
> java.base/java.util.stream.ReferencePipeline.forEach(ReferencePipeline.java:497)
> at java.base/java.util.ArrayList.forEach(ArrayList.java:1540)
> at java.base/java.util.ArrayList.forEach(ArrayList.java:1540)
> {noformat}
> h3. Definition of done
> Any transaction operation must notify the user that the transaction is 

[jira] [Updated] (IGNITE-22067) Make lease distribution more even

2024-04-19 Thread Denis Chudov (Jira)


 [ 
https://issues.apache.org/jira/browse/IGNITE-22067?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Denis Chudov updated IGNITE-22067:
--
Description: 
*Motivation*

Currently, if we have a cluster of 3 nodes and a zone of 5 partitions, there is 
relatively high chance that some node will have no primary replica for any 
partition located on it.

For 10 partition this chance is much lower but still exists.

It shows that LeaseUpdater#nextLeaseHolder provides distribution far from even.

*Definition of done*

Initial lease distribution has to be made decently even. However, it can be not 
preserved during the lifetime of the cluster, because leases don't have to be 
moved every time when the topology changes.

*Implementation notes*

We can make the distribution based on node priority. Nodes having less leases 
on them will have higher priority. Later this approach can be modified in order 
to calculate the node priority using user load, hot data metrics, etc. This 
will help us with IGNITE-18879 as well.

  was:
*Motivation*

Currently, if we have a cluster of 3 nodes and a zone of 5 partitions, there is 
relatively high chance that some node will have no primary replica for any 
partition located on it.

For 10 partition this chance is much lower but still exists.

It shows that LeaseUpdater#nextLeaseHolder provides distribution far from even.

*Definition of done*

Initial lease distribution has to be made decently even. However, it can be not 
preserved during the lifetime of the cluster, because leases don't have to be 
moved every time when the topology changes.

*Implementation notes*

We can make the distribution based on node priority. Nodes having less leases 
on them will have higher priority. Later this approach can be modified in order 
to calculate the node priority using user load, data 


> Make lease distribution more even
> -
>
> Key: IGNITE-22067
> URL: https://issues.apache.org/jira/browse/IGNITE-22067
> Project: Ignite
>  Issue Type: Bug
>Reporter: Denis Chudov
>Priority: Major
>  Labels: ignite-3
>
> *Motivation*
> Currently, if we have a cluster of 3 nodes and a zone of 5 partitions, there 
> is relatively high chance that some node will have no primary replica for any 
> partition located on it.
> For 10 partition this chance is much lower but still exists.
> It shows that LeaseUpdater#nextLeaseHolder provides distribution far from 
> even.
> *Definition of done*
> Initial lease distribution has to be made decently even. However, it can be 
> not preserved during the lifetime of the cluster, because leases don't have 
> to be moved every time when the topology changes.
> *Implementation notes*
> We can make the distribution based on node priority. Nodes having less leases 
> on them will have higher priority. Later this approach can be modified in 
> order to calculate the node priority using user load, hot data metrics, etc. 
> This will help us with IGNITE-18879 as well.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (IGNITE-22067) Make lease distribution more even

2024-04-19 Thread Denis Chudov (Jira)


 [ 
https://issues.apache.org/jira/browse/IGNITE-22067?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Denis Chudov updated IGNITE-22067:
--
Description: 
*Motivation*

Currently, if we have a cluster of 3 nodes and a zone of 5 partitions, there is 
relatively high chance that some node will have no primary replica for any 
partition located on it.

For 10 partition this chance is much lower but still exists.

It shows that LeaseUpdater#nextLeaseHolder provides distribution far from even.

*Definition of done*

Initial lease distribution has to be made decently even. However, it can be not 
preserved during the lifetime of the cluster, because leases don't have to be 
moved every time when the topology changes.

*Implementation notes*

We can make the distribution based on node priority. Nodes having less leases 
on them will have higher priority. Later this approach can be modified in order 
to calculate the node priority using user load, data 

  was:
Currently, if we have a cluster of 3 nodes and a zone of 5 partitions, there is 
relatively high chance that some node will have no primary replica for any 
partition located on it.

For 10 partition this chance is much lower but still exists.

It shows that LeaseUpdater#nextLeaseHolder provides distribution far from even.


> Make lease distribution more even
> -
>
> Key: IGNITE-22067
> URL: https://issues.apache.org/jira/browse/IGNITE-22067
> Project: Ignite
>  Issue Type: Bug
>Reporter: Denis Chudov
>Priority: Major
>  Labels: ignite-3
>
> *Motivation*
> Currently, if we have a cluster of 3 nodes and a zone of 5 partitions, there 
> is relatively high chance that some node will have no primary replica for any 
> partition located on it.
> For 10 partition this chance is much lower but still exists.
> It shows that LeaseUpdater#nextLeaseHolder provides distribution far from 
> even.
> *Definition of done*
> Initial lease distribution has to be made decently even. However, it can be 
> not preserved during the lifetime of the cluster, because leases don't have 
> to be moved every time when the topology changes.
> *Implementation notes*
> We can make the distribution based on node priority. Nodes having less leases 
> on them will have higher priority. Later this approach can be modified in 
> order to calculate the node priority using user load, data 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (IGNITE-22067) Make lease distribution more even

2024-04-18 Thread Denis Chudov (Jira)
Denis Chudov created IGNITE-22067:
-

 Summary: Make lease distribution more even
 Key: IGNITE-22067
 URL: https://issues.apache.org/jira/browse/IGNITE-22067
 Project: Ignite
  Issue Type: Bug
Reporter: Denis Chudov


Currently, if we have a cluster of 3 nodes and a zone of 5 partitions, there is 
relatively high chance that some node will have no primary replica for any 
partition located on it.

For 10 partition this chance is much lower but still exists.

It shows that LeaseUpdater#nextLeaseHolder provides distribution far from even.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (IGNITE-22051) Rack awareness

2024-04-16 Thread Denis Chudov (Jira)
Denis Chudov created IGNITE-22051:
-

 Summary: Rack awareness
 Key: IGNITE-22051
 URL: https://issues.apache.org/jira/browse/IGNITE-22051
 Project: Ignite
  Issue Type: Epic
Reporter: Denis Chudov


Provide a way to ensure that backups are placed in different availability 
zones. In GG 8 this is done via ClusterNodeAttributeAffinityBackupFilter.

Example: I have 3 copies and two AZs, az1 and az2. I need to configure 
rack-awareness (aka AZ-awareness) by telling the cluster to use the "zone" 
attribute of my nodes. Of the three copies, two needs to be in nodes with 
zone=az1, and one needs to be on a node with zone=az2.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Assigned] (IGNITE-22033) Replace PlacementDriver#currentLease with #getPrimaryReplica in ReadWriteTxContext#waitReadyToFinish

2024-04-15 Thread Denis Chudov (Jira)


 [ 
https://issues.apache.org/jira/browse/IGNITE-22033?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Denis Chudov reassigned IGNITE-22033:
-

Assignee: Denis Chudov

> Replace PlacementDriver#currentLease with #getPrimaryReplica in 
> ReadWriteTxContext#waitReadyToFinish
> 
>
> Key: IGNITE-22033
> URL: https://issues.apache.org/jira/browse/IGNITE-22033
> Project: Ignite
>  Issue Type: Bug
>Reporter: Denis Chudov
>Assignee: Denis Chudov
>Priority: Major
>  Labels: ignite-3
> Attachments: _Integration_Tests_Module_Runner_24658_.log
>
>
> #currentLease can return null in a case when there is no lease information on 
> the current node yet, while the lease may already exist on another node. This 
> can lead to 
> PrimaryReplicaExpiredException.
> Seems that we've already seen such exceptions on TC:
>  
> {code:java}
> Caused by: org.apache.ignite.internal.tx.impl.PrimaryReplicaExpiredException: 
> IGN-TX-13 TraceId:2766fa1f-a00e-4c53-b556-7d06fc116229 Primary replica has 
> expired, transaction will be rolled back: [groupId = 59_part_11, expected 
> enlistment consistency token = 112211838526816298, commit timestamp = null, 
> current primary replica = null]
>     at 
> app//org.apache.ignite.internal.tx.impl.TransactionInflights$ReadWriteTxContext.waitReadyToFinish(TransactionInflights.java:271)
>     at 
> app//org.apache.ignite.internal.tx.impl.TransactionInflights$ReadWriteTxContext.performFinish(TransactionInflights.java:229)
>     at 
> app//org.apache.ignite.internal.tx.impl.TxManagerImpl.finish(TxManagerImpl.java:501)
>     at 
> app//org.apache.ignite.internal.tx.impl.ReadWriteTransactionImpl.finishInternal(ReadWriteTransactionImpl.java:161)
>     at 
> app//org.apache.ignite.internal.tx.impl.ReadWriteTransactionImpl.finish(ReadWriteTransactionImpl.java:140)
>     at 
> app//org.apache.ignite.internal.tx.impl.IgniteAbstractTransactionImpl.commitAsync(IgniteAbstractTransactionImpl.java:98)
>     at 
> app//org.apache.ignite.internal.sql.engine.tx.QueryTransactionWrapperImpl.commitImplicit(QueryTransactionWrapperImpl.java:46)
>     at 
> app//org.apache.ignite.internal.sql.engine.AsyncSqlCursorImpl.lambda$closeAsync$3(AsyncSqlCursorImpl.java:132)
>     at 
> java.base@11.0.17/java.util.concurrent.CompletableFuture.uniComposeStage(CompletableFuture.java:1106)
>     at 
> java.base@11.0.17/java.util.concurrent.CompletableFuture.thenCompose(CompletableFuture.java:2235)
>     at 
> app//org.apache.ignite.internal.sql.engine.AsyncSqlCursorImpl.closeAsync(AsyncSqlCursorImpl.java:132)
>     at 
> app//org.apache.ignite.internal.sql.engine.AsyncSqlCursorImpl.lambda$requestNextAsync$2(AsyncSqlCursorImpl.java:101)
>     at 
> java.base@11.0.17/java.util.concurrent.CompletableFuture.uniHandle(CompletableFuture.java:930)
>     at 
> java.base@11.0.17/java.util.concurrent.CompletableFuture.uniHandleStage(CompletableFuture.java:946)
>     at 
> java.base@11.0.17/java.util.concurrent.CompletableFuture.handle(CompletableFuture.java:2266)
>     at 
> app//org.apache.ignite.internal.sql.engine.AsyncSqlCursorImpl.requestNextAsync(AsyncSqlCursorImpl.java:94)
>     at 
> app//org.apache.ignite.internal.sql.api.IgniteSqlImpl.lambda$executeAsyncInternal$4(IgniteSqlImpl.java:360)
>     at 
> java.base@11.0.17/java.util.concurrent.CompletableFuture$UniCompose.tryFire(CompletableFuture.java:1072)
>     at 
> java.base@11.0.17/java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:506)
>     at 
> java.base@11.0.17/java.util.concurrent.CompletableFuture.complete(CompletableFuture.java:2073)
>     at 
> app//org.apache.ignite.internal.sql.engine.SqlQueryProcessor$PrefetchCallback.onPrefetchComplete(SqlQueryProcessor.java:1050)
>     at 
> app//org.apache.ignite.internal.sql.engine.prepare.KeyValueModifyPlan.lambda$execute$3(KeyValueModifyPlan.java:141)
>     at 
> java.base@11.0.17/java.util.concurrent.CompletableFuture.uniWhenComplete(CompletableFuture.java:859)
>     at 
> java.base@11.0.17/java.util.concurrent.CompletableFuture$UniWhenComplete.tryFire(CompletableFuture.java:837)
>     at 
> java.base@11.0.17/java.util.concurrent.CompletableFuture$Completion.run(CompletableFuture.java:478)
>     at 
> app//org.apache.ignite.internal.sql.engine.exec.ExecutionContext.lambda$execute$0(ExecutionContext.java:325)
>     at 
> app//org.apache.ignite.internal.sql.engine.exec.QueryTaskExecutorImpl.lambda$execute$0(QueryTaskExecutorImpl.java:83){code}
>  Full log attached.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (IGNITE-22033) Replace PlacementDriver#currentLease with #getPrimaryReplica in ReadWriteTxContext#waitReadyToFinish

2024-04-12 Thread Denis Chudov (Jira)


 [ 
https://issues.apache.org/jira/browse/IGNITE-22033?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Denis Chudov updated IGNITE-22033:
--
Attachment: _Integration_Tests_Module_Runner_24658_.log

> Replace PlacementDriver#currentLease with #getPrimaryReplica in 
> ReadWriteTxContext#waitReadyToFinish
> 
>
> Key: IGNITE-22033
> URL: https://issues.apache.org/jira/browse/IGNITE-22033
> Project: Ignite
>  Issue Type: Bug
>Reporter: Denis Chudov
>Priority: Major
>  Labels: ignite-3
> Attachments: _Integration_Tests_Module_Runner_24658_.log
>
>
> #currentLease can return null in a case when there is no lease information on 
> the current node yet, while the lease may already exist on another node. This 
> can lead to 
> PrimaryReplicaExpiredException.
> Seems that we've already seen such exceptions on TC:
>  
> {code:java}
> Caused by: org.apache.ignite.internal.tx.impl.PrimaryReplicaExpiredException: 
> IGN-TX-13 TraceId:2766fa1f-a00e-4c53-b556-7d06fc116229 Primary replica has 
> expired, transaction will be rolled back: [groupId = 59_part_11, expected 
> enlistment consistency token = 112211838526816298, commit timestamp = null, 
> current primary replica = null]
>     at 
> app//org.apache.ignite.internal.tx.impl.TransactionInflights$ReadWriteTxContext.waitReadyToFinish(TransactionInflights.java:271)
>     at 
> app//org.apache.ignite.internal.tx.impl.TransactionInflights$ReadWriteTxContext.performFinish(TransactionInflights.java:229)
>     at 
> app//org.apache.ignite.internal.tx.impl.TxManagerImpl.finish(TxManagerImpl.java:501)
>     at 
> app//org.apache.ignite.internal.tx.impl.ReadWriteTransactionImpl.finishInternal(ReadWriteTransactionImpl.java:161)
>     at 
> app//org.apache.ignite.internal.tx.impl.ReadWriteTransactionImpl.finish(ReadWriteTransactionImpl.java:140)
>     at 
> app//org.apache.ignite.internal.tx.impl.IgniteAbstractTransactionImpl.commitAsync(IgniteAbstractTransactionImpl.java:98)
>     at 
> app//org.apache.ignite.internal.sql.engine.tx.QueryTransactionWrapperImpl.commitImplicit(QueryTransactionWrapperImpl.java:46)
>     at 
> app//org.apache.ignite.internal.sql.engine.AsyncSqlCursorImpl.lambda$closeAsync$3(AsyncSqlCursorImpl.java:132)
>     at 
> java.base@11.0.17/java.util.concurrent.CompletableFuture.uniComposeStage(CompletableFuture.java:1106)
>     at 
> java.base@11.0.17/java.util.concurrent.CompletableFuture.thenCompose(CompletableFuture.java:2235)
>     at 
> app//org.apache.ignite.internal.sql.engine.AsyncSqlCursorImpl.closeAsync(AsyncSqlCursorImpl.java:132)
>     at 
> app//org.apache.ignite.internal.sql.engine.AsyncSqlCursorImpl.lambda$requestNextAsync$2(AsyncSqlCursorImpl.java:101)
>     at 
> java.base@11.0.17/java.util.concurrent.CompletableFuture.uniHandle(CompletableFuture.java:930)
>     at 
> java.base@11.0.17/java.util.concurrent.CompletableFuture.uniHandleStage(CompletableFuture.java:946)
>     at 
> java.base@11.0.17/java.util.concurrent.CompletableFuture.handle(CompletableFuture.java:2266)
>     at 
> app//org.apache.ignite.internal.sql.engine.AsyncSqlCursorImpl.requestNextAsync(AsyncSqlCursorImpl.java:94)
>     at 
> app//org.apache.ignite.internal.sql.api.IgniteSqlImpl.lambda$executeAsyncInternal$4(IgniteSqlImpl.java:360)
>     at 
> java.base@11.0.17/java.util.concurrent.CompletableFuture$UniCompose.tryFire(CompletableFuture.java:1072)
>     at 
> java.base@11.0.17/java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:506)
>     at 
> java.base@11.0.17/java.util.concurrent.CompletableFuture.complete(CompletableFuture.java:2073)
>     at 
> app//org.apache.ignite.internal.sql.engine.SqlQueryProcessor$PrefetchCallback.onPrefetchComplete(SqlQueryProcessor.java:1050)
>     at 
> app//org.apache.ignite.internal.sql.engine.prepare.KeyValueModifyPlan.lambda$execute$3(KeyValueModifyPlan.java:141)
>     at 
> java.base@11.0.17/java.util.concurrent.CompletableFuture.uniWhenComplete(CompletableFuture.java:859)
>     at 
> java.base@11.0.17/java.util.concurrent.CompletableFuture$UniWhenComplete.tryFire(CompletableFuture.java:837)
>     at 
> java.base@11.0.17/java.util.concurrent.CompletableFuture$Completion.run(CompletableFuture.java:478)
>     at 
> app//org.apache.ignite.internal.sql.engine.exec.ExecutionContext.lambda$execute$0(ExecutionContext.java:325)
>     at 
> app//org.apache.ignite.internal.sql.engine.exec.QueryTaskExecutorImpl.lambda$execute$0(QueryTaskExecutorImpl.java:83){code}
>  Full log attached.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (IGNITE-22033) Replace PlacementDriver#currentLease with #getPrimaryReplica in ReadWriteTxContext#waitReadyToFinish

2024-04-12 Thread Denis Chudov (Jira)


 [ 
https://issues.apache.org/jira/browse/IGNITE-22033?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Denis Chudov updated IGNITE-22033:
--
Description: 
#currentLease can return null in a case when there is no lease information on 
the current node yet, while the lease may already exist on another node. This 
can lead to 
PrimaryReplicaExpiredException.
Seems that we've already seen such exceptions on TC:
 
{code:java}
Caused by: org.apache.ignite.internal.tx.impl.PrimaryReplicaExpiredException: 
IGN-TX-13 TraceId:2766fa1f-a00e-4c53-b556-7d06fc116229 Primary replica has 
expired, transaction will be rolled back: [groupId = 59_part_11, expected 
enlistment consistency token = 112211838526816298, commit timestamp = null, 
current primary replica = null]
    at 
app//org.apache.ignite.internal.tx.impl.TransactionInflights$ReadWriteTxContext.waitReadyToFinish(TransactionInflights.java:271)
    at 
app//org.apache.ignite.internal.tx.impl.TransactionInflights$ReadWriteTxContext.performFinish(TransactionInflights.java:229)
    at 
app//org.apache.ignite.internal.tx.impl.TxManagerImpl.finish(TxManagerImpl.java:501)
    at 
app//org.apache.ignite.internal.tx.impl.ReadWriteTransactionImpl.finishInternal(ReadWriteTransactionImpl.java:161)
    at 
app//org.apache.ignite.internal.tx.impl.ReadWriteTransactionImpl.finish(ReadWriteTransactionImpl.java:140)
    at 
app//org.apache.ignite.internal.tx.impl.IgniteAbstractTransactionImpl.commitAsync(IgniteAbstractTransactionImpl.java:98)
    at 
app//org.apache.ignite.internal.sql.engine.tx.QueryTransactionWrapperImpl.commitImplicit(QueryTransactionWrapperImpl.java:46)
    at 
app//org.apache.ignite.internal.sql.engine.AsyncSqlCursorImpl.lambda$closeAsync$3(AsyncSqlCursorImpl.java:132)
    at 
java.base@11.0.17/java.util.concurrent.CompletableFuture.uniComposeStage(CompletableFuture.java:1106)
    at 
java.base@11.0.17/java.util.concurrent.CompletableFuture.thenCompose(CompletableFuture.java:2235)
    at 
app//org.apache.ignite.internal.sql.engine.AsyncSqlCursorImpl.closeAsync(AsyncSqlCursorImpl.java:132)
    at 
app//org.apache.ignite.internal.sql.engine.AsyncSqlCursorImpl.lambda$requestNextAsync$2(AsyncSqlCursorImpl.java:101)
    at 
java.base@11.0.17/java.util.concurrent.CompletableFuture.uniHandle(CompletableFuture.java:930)
    at 
java.base@11.0.17/java.util.concurrent.CompletableFuture.uniHandleStage(CompletableFuture.java:946)
    at 
java.base@11.0.17/java.util.concurrent.CompletableFuture.handle(CompletableFuture.java:2266)
    at 
app//org.apache.ignite.internal.sql.engine.AsyncSqlCursorImpl.requestNextAsync(AsyncSqlCursorImpl.java:94)
    at 
app//org.apache.ignite.internal.sql.api.IgniteSqlImpl.lambda$executeAsyncInternal$4(IgniteSqlImpl.java:360)
    at 
java.base@11.0.17/java.util.concurrent.CompletableFuture$UniCompose.tryFire(CompletableFuture.java:1072)
    at 
java.base@11.0.17/java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:506)
    at 
java.base@11.0.17/java.util.concurrent.CompletableFuture.complete(CompletableFuture.java:2073)
    at 
app//org.apache.ignite.internal.sql.engine.SqlQueryProcessor$PrefetchCallback.onPrefetchComplete(SqlQueryProcessor.java:1050)
    at 
app//org.apache.ignite.internal.sql.engine.prepare.KeyValueModifyPlan.lambda$execute$3(KeyValueModifyPlan.java:141)
    at 
java.base@11.0.17/java.util.concurrent.CompletableFuture.uniWhenComplete(CompletableFuture.java:859)
    at 
java.base@11.0.17/java.util.concurrent.CompletableFuture$UniWhenComplete.tryFire(CompletableFuture.java:837)
    at 
java.base@11.0.17/java.util.concurrent.CompletableFuture$Completion.run(CompletableFuture.java:478)
    at 
app//org.apache.ignite.internal.sql.engine.exec.ExecutionContext.lambda$execute$0(ExecutionContext.java:325)
    at 
app//org.apache.ignite.internal.sql.engine.exec.QueryTaskExecutorImpl.lambda$execute$0(QueryTaskExecutorImpl.java:83){code}
 
 

  was:
#currentLease can return null in a case when there is no lease information on 
the current node yet, while the lease may already exist on another node. This 
can lead to 
PrimaryReplicaExpiredException.
Seems that we've already seen such exceptions on TC:
 
{code:java}
Caused by: org.apache.ignite.internal.tx.impl.PrimaryReplicaExpiredException: 
IGN-TX-13 TraceId:2766fa1f-a00e-4c53-b556-7d06fc116229 Primary replica has 
expired, transaction will be rolled back: [groupId = 59_part_11, expected 
enlistment consistency token = 112211838526816298, commit timestamp = null, 
current primary replica = null]     at 
app//org.apache.ignite.internal.tx.impl.TransactionInflights$ReadWriteTxContext.waitReadyToFinish(TransactionInflights.java:271)
     at 
app//org.apache.ignite.internal.tx.impl.TransactionInflights$ReadWriteTxContext.performFinish(TransactionInflights.java:229)
     at 
app//org.apache.ignite.internal.tx.impl.TxManagerImpl.finish(TxManagerImpl.java:501)
     at 

[jira] [Updated] (IGNITE-22033) Replace PlacementDriver#currentLease with #getPrimaryReplica in ReadWriteTxContext#waitReadyToFinish

2024-04-12 Thread Denis Chudov (Jira)


 [ 
https://issues.apache.org/jira/browse/IGNITE-22033?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Denis Chudov updated IGNITE-22033:
--
Description: 
#currentLease can return null in a case when there is no lease information on 
the current node yet, while the lease may already exist on another node. This 
can lead to 
PrimaryReplicaExpiredException.
Seems that we've already seen such exceptions on TC:
 
{code:java}
Caused by: org.apache.ignite.internal.tx.impl.PrimaryReplicaExpiredException: 
IGN-TX-13 TraceId:2766fa1f-a00e-4c53-b556-7d06fc116229 Primary replica has 
expired, transaction will be rolled back: [groupId = 59_part_11, expected 
enlistment consistency token = 112211838526816298, commit timestamp = null, 
current primary replica = null]
    at 
app//org.apache.ignite.internal.tx.impl.TransactionInflights$ReadWriteTxContext.waitReadyToFinish(TransactionInflights.java:271)
    at 
app//org.apache.ignite.internal.tx.impl.TransactionInflights$ReadWriteTxContext.performFinish(TransactionInflights.java:229)
    at 
app//org.apache.ignite.internal.tx.impl.TxManagerImpl.finish(TxManagerImpl.java:501)
    at 
app//org.apache.ignite.internal.tx.impl.ReadWriteTransactionImpl.finishInternal(ReadWriteTransactionImpl.java:161)
    at 
app//org.apache.ignite.internal.tx.impl.ReadWriteTransactionImpl.finish(ReadWriteTransactionImpl.java:140)
    at 
app//org.apache.ignite.internal.tx.impl.IgniteAbstractTransactionImpl.commitAsync(IgniteAbstractTransactionImpl.java:98)
    at 
app//org.apache.ignite.internal.sql.engine.tx.QueryTransactionWrapperImpl.commitImplicit(QueryTransactionWrapperImpl.java:46)
    at 
app//org.apache.ignite.internal.sql.engine.AsyncSqlCursorImpl.lambda$closeAsync$3(AsyncSqlCursorImpl.java:132)
    at 
java.base@11.0.17/java.util.concurrent.CompletableFuture.uniComposeStage(CompletableFuture.java:1106)
    at 
java.base@11.0.17/java.util.concurrent.CompletableFuture.thenCompose(CompletableFuture.java:2235)
    at 
app//org.apache.ignite.internal.sql.engine.AsyncSqlCursorImpl.closeAsync(AsyncSqlCursorImpl.java:132)
    at 
app//org.apache.ignite.internal.sql.engine.AsyncSqlCursorImpl.lambda$requestNextAsync$2(AsyncSqlCursorImpl.java:101)
    at 
java.base@11.0.17/java.util.concurrent.CompletableFuture.uniHandle(CompletableFuture.java:930)
    at 
java.base@11.0.17/java.util.concurrent.CompletableFuture.uniHandleStage(CompletableFuture.java:946)
    at 
java.base@11.0.17/java.util.concurrent.CompletableFuture.handle(CompletableFuture.java:2266)
    at 
app//org.apache.ignite.internal.sql.engine.AsyncSqlCursorImpl.requestNextAsync(AsyncSqlCursorImpl.java:94)
    at 
app//org.apache.ignite.internal.sql.api.IgniteSqlImpl.lambda$executeAsyncInternal$4(IgniteSqlImpl.java:360)
    at 
java.base@11.0.17/java.util.concurrent.CompletableFuture$UniCompose.tryFire(CompletableFuture.java:1072)
    at 
java.base@11.0.17/java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:506)
    at 
java.base@11.0.17/java.util.concurrent.CompletableFuture.complete(CompletableFuture.java:2073)
    at 
app//org.apache.ignite.internal.sql.engine.SqlQueryProcessor$PrefetchCallback.onPrefetchComplete(SqlQueryProcessor.java:1050)
    at 
app//org.apache.ignite.internal.sql.engine.prepare.KeyValueModifyPlan.lambda$execute$3(KeyValueModifyPlan.java:141)
    at 
java.base@11.0.17/java.util.concurrent.CompletableFuture.uniWhenComplete(CompletableFuture.java:859)
    at 
java.base@11.0.17/java.util.concurrent.CompletableFuture$UniWhenComplete.tryFire(CompletableFuture.java:837)
    at 
java.base@11.0.17/java.util.concurrent.CompletableFuture$Completion.run(CompletableFuture.java:478)
    at 
app//org.apache.ignite.internal.sql.engine.exec.ExecutionContext.lambda$execute$0(ExecutionContext.java:325)
    at 
app//org.apache.ignite.internal.sql.engine.exec.QueryTaskExecutorImpl.lambda$execute$0(QueryTaskExecutorImpl.java:83){code}
 Full log attached.

  was:
#currentLease can return null in a case when there is no lease information on 
the current node yet, while the lease may already exist on another node. This 
can lead to 
PrimaryReplicaExpiredException.
Seems that we've already seen such exceptions on TC:
 
{code:java}
Caused by: org.apache.ignite.internal.tx.impl.PrimaryReplicaExpiredException: 
IGN-TX-13 TraceId:2766fa1f-a00e-4c53-b556-7d06fc116229 Primary replica has 
expired, transaction will be rolled back: [groupId = 59_part_11, expected 
enlistment consistency token = 112211838526816298, commit timestamp = null, 
current primary replica = null]
    at 
app//org.apache.ignite.internal.tx.impl.TransactionInflights$ReadWriteTxContext.waitReadyToFinish(TransactionInflights.java:271)
    at 
app//org.apache.ignite.internal.tx.impl.TransactionInflights$ReadWriteTxContext.performFinish(TransactionInflights.java:229)
    at 
app//org.apache.ignite.internal.tx.impl.TxManagerImpl.finish(TxManagerImpl.java:501)
    at 

[jira] [Created] (IGNITE-22033) Replace PlacementDriver#currentLease with #getPrimaryReplica in ReadWriteTxContext#waitReadyToFinish

2024-04-12 Thread Denis Chudov (Jira)
Denis Chudov created IGNITE-22033:
-

 Summary: Replace PlacementDriver#currentLease with 
#getPrimaryReplica in ReadWriteTxContext#waitReadyToFinish
 Key: IGNITE-22033
 URL: https://issues.apache.org/jira/browse/IGNITE-22033
 Project: Ignite
  Issue Type: Bug
Reporter: Denis Chudov


#currentLease can return null in a case when there is no lease information on 
the current node yet, while the lease may already exist on another node. This 
can lead to 
PrimaryReplicaExpiredException.
Seems that we've already seen such exceptions on TC:
 
{code:java}
Caused by: org.apache.ignite.internal.tx.impl.PrimaryReplicaExpiredException: 
IGN-TX-13 TraceId:2766fa1f-a00e-4c53-b556-7d06fc116229 Primary replica has 
expired, transaction will be rolled back: [groupId = 59_part_11, expected 
enlistment consistency token = 112211838526816298, commit timestamp = null, 
current primary replica = null]     at 
app//org.apache.ignite.internal.tx.impl.TransactionInflights$ReadWriteTxContext.waitReadyToFinish(TransactionInflights.java:271)
     at 
app//org.apache.ignite.internal.tx.impl.TransactionInflights$ReadWriteTxContext.performFinish(TransactionInflights.java:229)
     at 
app//org.apache.ignite.internal.tx.impl.TxManagerImpl.finish(TxManagerImpl.java:501)
     at 
app//org.apache.ignite.internal.tx.impl.ReadWriteTransactionImpl.finishInternal(ReadWriteTransactionImpl.java:161)
     at 
app//org.apache.ignite.internal.tx.impl.ReadWriteTransactionImpl.finish(ReadWriteTransactionImpl.java:140)
     at 
app//org.apache.ignite.internal.tx.impl.IgniteAbstractTransactionImpl.commitAsync(IgniteAbstractTransactionImpl.java:98)
     at 
app//org.apache.ignite.internal.sql.engine.tx.QueryTransactionWrapperImpl.commitImplicit(QueryTransactionWrapperImpl.java:46)
     at 
app//org.apache.ignite.internal.sql.engine.AsyncSqlCursorImpl.lambda$closeAsync$3(AsyncSqlCursorImpl.java:132)
     at 
java.base@11.0.17/java.util.concurrent.CompletableFuture.uniComposeStage(CompletableFuture.java:1106)
     at 
java.base@11.0.17/java.util.concurrent.CompletableFuture.thenCompose(CompletableFuture.java:2235)
     at 
app//org.apache.ignite.internal.sql.engine.AsyncSqlCursorImpl.closeAsync(AsyncSqlCursorImpl.java:132)
     at 
app//org.apache.ignite.internal.sql.engine.AsyncSqlCursorImpl.lambda$requestNextAsync$2(AsyncSqlCursorImpl.java:101)
     at 
java.base@11.0.17/java.util.concurrent.CompletableFuture.uniHandle(CompletableFuture.java:930)
     at 
java.base@11.0.17/java.util.concurrent.CompletableFuture.uniHandleStage(CompletableFuture.java:946)
     at 
java.base@11.0.17/java.util.concurrent.CompletableFuture.handle(CompletableFuture.java:2266)
     at 
app//org.apache.ignite.internal.sql.engine.AsyncSqlCursorImpl.requestNextAsync(AsyncSqlCursorImpl.java:94)
     at 
app//org.apache.ignite.internal.sql.api.IgniteSqlImpl.lambda$executeAsyncInternal$4(IgniteSqlImpl.java:360)
     at 
java.base@11.0.17/java.util.concurrent.CompletableFuture$UniCompose.tryFire(CompletableFuture.java:1072)
     at 
java.base@11.0.17/java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:506)
     at 
java.base@11.0.17/java.util.concurrent.CompletableFuture.complete(CompletableFuture.java:2073)
     at 
app//org.apache.ignite.internal.sql.engine.SqlQueryProcessor$PrefetchCallback.onPrefetchComplete(SqlQueryProcessor.java:1050)
     at 
app//org.apache.ignite.internal.sql.engine.prepare.KeyValueModifyPlan.lambda$execute$3(KeyValueModifyPlan.java:141)
     at 
java.base@11.0.17/java.util.concurrent.CompletableFuture.uniWhenComplete(CompletableFuture.java:859)
     at 
java.base@11.0.17/java.util.concurrent.CompletableFuture$UniWhenComplete.tryFire(CompletableFuture.java:837)
     at 
java.base@11.0.17/java.util.concurrent.CompletableFuture$Completion.run(CompletableFuture.java:478)
     at 
app//org.apache.ignite.internal.sql.engine.exec.ExecutionContext.lambda$execute$0(ExecutionContext.java:325)
     at 
app//org.apache.ignite.internal.sql.engine.exec.QueryTaskExecutorImpl.lambda$execute$0(QueryTaskExecutorImpl.java:83){code}
 
 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Comment Edited] (IGNITE-20365) Add ability to intentionally change primary replica

2024-04-10 Thread Denis Chudov (Jira)


[ 
https://issues.apache.org/jira/browse/IGNITE-20365?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17835638#comment-17835638
 ] 

Denis Chudov edited comment on IGNITE-20365 at 4/10/24 8:45 AM:


Fixed by IGNITE-21382 .

The change of primary replica is possible while using 
org.apache.ignite.internal.table.NodeUtils#transferPrimary.


was (Author: denis chudov):
Fixed by IGNITE-21382 .

> Add ability to intentionally change primary replica
> ---
>
> Key: IGNITE-20365
> URL: https://issues.apache.org/jira/browse/IGNITE-20365
> Project: Ignite
>  Issue Type: Bug
>Reporter: Alexander Lapin
>Assignee: Alexander Lapin
>Priority: Major
>  Labels: ignite-3
>
> Some tests, e.g. testTxStateReplicaRequestMissLeaderMiss expects primary 
> replica to be changed. Earlier when primary replica was collocated with 
> leader refreshAndGetLeaderWithTerm was used in order to change leader and 
> thus primary replica. Now when Placement driver assigns primary replica it's 
> no longer the case. All in all, some PlacementDriver#changePrimaryReplica or 
> similar will be useful, at least within tests.
>  
> *Implementation Details*
> +Important note:+ The lease contract prohibits intersecting leases. We don't 
> want to break this contract, so we will have to wait until the current lease 
> ends before another replica becomes primary.
> There are two ways to implement this functionality - either extend 
> {{PlacementDriver}} in the product or change only the test code. Looks like 
> the second approach is not enough if we start a test ignite instance using an 
> {{IgniteImpl}} class. So we might need to consider extend the production 
> code. Moreover, such change might become the first step towards a graceful 
> cluster reconfiguration.
> The code that is responsible for managing lease resides in {{LeaseTracker}} 
> and {{LeaseUpdater}}. To do the required change we can add a pending lease 
> with the start time in the future. We should make sure that both there 
> places, as well as any recovery code accounts for it. Now the next lease is 
> added ONLY when the current one ends.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Resolved] (IGNITE-20365) Add ability to intentionally change primary replica

2024-04-10 Thread Denis Chudov (Jira)


 [ 
https://issues.apache.org/jira/browse/IGNITE-20365?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Denis Chudov resolved IGNITE-20365.
---
Resolution: Duplicate

Fixed by IGNITE-21382 .

> Add ability to intentionally change primary replica
> ---
>
> Key: IGNITE-20365
> URL: https://issues.apache.org/jira/browse/IGNITE-20365
> Project: Ignite
>  Issue Type: Bug
>Reporter: Alexander Lapin
>Assignee: Alexander Lapin
>Priority: Major
>  Labels: ignite-3
>
> Some tests, e.g. testTxStateReplicaRequestMissLeaderMiss expects primary 
> replica to be changed. Earlier when primary replica was collocated with 
> leader refreshAndGetLeaderWithTerm was used in order to change leader and 
> thus primary replica. Now when Placement driver assigns primary replica it's 
> no longer the case. All in all, some PlacementDriver#changePrimaryReplica or 
> similar will be useful, at least within tests.
>  
> *Implementation Details*
> +Important note:+ The lease contract prohibits intersecting leases. We don't 
> want to break this contract, so we will have to wait until the current lease 
> ends before another replica becomes primary.
> There are two ways to implement this functionality - either extend 
> {{PlacementDriver}} in the product or change only the test code. Looks like 
> the second approach is not enough if we start a test ignite instance using an 
> {{IgniteImpl}} class. So we might need to consider extend the production 
> code. Moreover, such change might become the first step towards a graceful 
> cluster reconfiguration.
> The code that is responsible for managing lease resides in {{LeaseTracker}} 
> and {{LeaseUpdater}}. To do the required change we can add a pending lease 
> with the start time in the future. We should make sure that both there 
> places, as well as any recovery code accounts for it. Now the next lease is 
> added ONLY when the current one ends.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (IGNITE-21418) ItTxDistributedTestThreeNodesThreeReplicas#testDeleteUpsertAllRollback is flaky

2024-04-05 Thread Denis Chudov (Jira)


[ 
https://issues.apache.org/jira/browse/IGNITE-21418?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17834247#comment-17834247
 ] 

Denis Chudov commented on IGNITE-21418:
---

IGNITE-21572 is a possible reason. It is resolved, but due to the rare 
occurrence of this error, we need to monitor the teamcity for some time (about 
a month). After that, if this error is no longer reproduced, we can close this 
ticket.

> ItTxDistributedTestThreeNodesThreeReplicas#testDeleteUpsertAllRollback is 
> flaky
> ---
>
> Key: IGNITE-21418
> URL: https://issues.apache.org/jira/browse/IGNITE-21418
> Project: Ignite
>  Issue Type: Bug
>Reporter: Alexander Lapin
>Priority: Major
>  Labels: ignite-3
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> {code:java}
> java.lang.NullPointerException  at 
> org.apache.ignite.internal.table.TxAbstractTest.testDeleteUpsertAllRollback(TxAbstractTest.java:233)
>   at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native 
> Method)  at 
> java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
>   at 
> java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>   at java.base/java.lang.reflect.Method.invoke(Method.java:566)  at 
> org.junit.platform.commons.util.ReflectionUtils.invokeMethod(ReflectionUtils.java:727)
>  {code}
> [https://ci.ignite.apache.org/buildConfiguration/ApacheIgnite3xGradle_Test_RunAllTests/7814256?expandCode+Inspection=true=true=false=true=false=true]
> Flaky rate is low.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (IGNITE-20628) testDropColumn and testMergeChangesAddDropAdd in ItSchemaChangeKvViewTest are disabled

2024-04-05 Thread Denis Chudov (Jira)


[ 
https://issues.apache.org/jira/browse/IGNITE-20628?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17834245#comment-17834245
 ] 

Denis Chudov commented on IGNITE-20628:
---

IGNITE-21572 is resolved, but due to the rare occurrence of this error, we need 
to monitor the teamcity for some time (about a month). After that, if this 
error is no longer reproduced, we can close this ticket.

> testDropColumn and testMergeChangesAddDropAdd in ItSchemaChangeKvViewTest are 
> disabled
> --
>
> Key: IGNITE-20628
> URL: https://issues.apache.org/jira/browse/IGNITE-20628
> Project: Ignite
>  Issue Type: Bug
>Reporter: Roman Puchkovskiy
>Priority: Major
>  Labels: ignite-3, tech-debt
> Fix For: 3.0.0-beta2
>
>
> It was supposed that IGNITE-17931 was the culprit, but even after removing 
> the blocking code the tests are still flaky.
> The tests fail with one of 3 symptoms:
>  # An NPE happens in the test method code: a value by a key for which a put 
> is made earlier is not found when using the same key. This is probably caused 
> by a transactional protocol implementation bug, maybe this: IGNITE-20116
>  # A PrimaryReplicaAwaitTimeoutException
>  # A ReplicationTimeoutException
> Items 2 and 3 need to be investigated.
> h2. A stacktrace for 1
> java.lang.NullPointerException
>     at 
> org.apache.ignite.internal.runner.app.ItSchemaChangeKvViewTest.testDropColumn(ItSchemaChangeKvViewTest.java:58)
> h2. A stacktrace for 2
> org.apache.ignite.tx.TransactionException: IGN-PLACEMENTDRIVER-1 
> TraceId:0a32c369-b9ca-4091-b8de-af15d65a1f52 Failed to get the primary 
> replica [tablePartitionId=3_part_5, awaitTimestamp=HybridTimestamp 
> [time=111220884095959043, physical=1697096009765, logical=3]]
>  
> at 
> org.apache.ignite.internal.util.ExceptionUtils.lambda$withCause$1(ExceptionUtils.java:400)
> at 
> org.apache.ignite.internal.util.ExceptionUtils.withCauseInternal(ExceptionUtils.java:461)
> at 
> org.apache.ignite.internal.util.ExceptionUtils.withCause(ExceptionUtils.java:400)
> at 
> org.apache.ignite.internal.table.distributed.storage.InternalTableImpl.lambda$enlist$71(InternalTableImpl.java:1659)
> at 
> java.base/java.util.concurrent.CompletableFuture.uniHandle(CompletableFuture.java:930)
> at 
> java.base/java.util.concurrent.CompletableFuture$UniHandle.tryFire(CompletableFuture.java:907)
> at 
> java.base/java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:506)
> at 
> java.base/java.util.concurrent.CompletableFuture.completeExceptionally(CompletableFuture.java:2088)
> at 
> java.base/java.util.concurrent.CompletableFuture$Timeout.run(CompletableFuture.java:2792)
> at 
> java.base/java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:515)
> at java.base/java.util.concurrent.FutureTask.run(FutureTask.java:264)
> at 
> java.base/java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:304)
> at 
> java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
> at 
> java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
> at java.base/java.lang.Thread.run(Thread.java:834)
> Caused by: java.util.concurrent.CompletionException: 
> org.apache.ignite.internal.placementdriver.PrimaryReplicaAwaitTimeoutException:
>  IGN-PLACEMENTDRIVER-1 TraceId:0a32c369-b9ca-4091-b8de-af15d65a1f52 The 
> primary replica await timed out [replicationGroupId=3_part_5, 
> referenceTimestamp=HybridTimestamp [time=111220884095959043, 
> physical=1697096009765, logical=3], currentLease=Lease 
> [leaseholder=isckvt_tmcada_3346, accepted=false, startTime=HybridTimestamp 
> [time=111220884127809550, physical=1697096010251, logical=14], 
> expirationTime=HybridTimestamp [time=111220891992129536, 
> physical=1697096130251, logical=0], prolongable=false, 
> replicationGroupId=3_part_5]]
> at 
> java.base/java.util.concurrent.CompletableFuture.encodeThrowable(CompletableFuture.java:314)
> at 
> java.base/java.util.concurrent.CompletableFuture.completeThrowable(CompletableFuture.java:319)
> at 
> java.base/java.util.concurrent.CompletableFuture.uniExceptionally(CompletableFuture.java:990)
> at 
> java.base/java.util.concurrent.CompletableFuture$UniExceptionally.tryFire(CompletableFuture.java:970)
> ... 9 more
> Caused by: 
> org.apache.ignite.internal.placementdriver.PrimaryReplicaAwaitTimeoutException:
>  IGN-PLACEMENTDRIVER-1 TraceId:0a32c369-b9ca-4091-b8de-af15d65a1f52 The 
> primary replica await timed out [replicationGroupId=3_part_5, 
> referenceTimestamp=HybridTimestamp [time=111220884095959043, 
> physical=1697096009765, logical=3], currentLease=Lease 
> [leaseholder=isckvt_tmcada_3346, accepted=false, startTime=HybridTimestamp 
> 

[jira] [Commented] (IGNITE-21307) Call failure handler in case of failure in WatchProcessor

2024-04-04 Thread Denis Chudov (Jira)


[ 
https://issues.apache.org/jira/browse/IGNITE-21307?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17833934#comment-17833934
 ] 

Denis Chudov commented on IGNITE-21307:
---

[~slava.koptilin] LGTM.

> Call failure handler in case of failure in WatchProcessor
> -
>
> Key: IGNITE-21307
> URL: https://issues.apache.org/jira/browse/IGNITE-21307
> Project: Ignite
>  Issue Type: Improvement
>Reporter: Denis Chudov
>Assignee: Vyacheslav Koptilin
>Priority: Major
>  Labels: ignite-3
> Fix For: 3.0.0-beta2
>
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> For the linearized watch processing, we have 
> WatchProcessor#notificationFuture that is rewritten for each revision 
> processing and meta storage safe time advance. If some watch processor 
> completes exceptionally, this means that no further updates will be 
> processed, because they need the previous updates to be processed 
> successfully. This is implemented in futures chaining like this:
>  
> {code:java}
> notificationFuture = notificationFuture
> .thenRunAsync(() -> revisionCallback.onSafeTimeAdvanced(time), 
> watchExecutor)
> .whenComplete((ignored, e) -> {
> if (e != null) {
> LOG.error("Error occurred when notifying safe time advanced 
> callback", e);
> }
> }); {code}
> For now, we dont have any failure handing of exceptionally completed 
> notification future. It leads to the endless log records with the same 
> exception's stack trace, caused by meta storage safe time advances:
>  
> {code:java}
> [2024-01-16T21:42:35,515][ERROR][%isot_n_0%JRaft-FSMCaller-Disruptor-metastorage-_stripe_0-0][WatchProcessor]
>  Error occurred when notifying safe time advanced callback
> java.util.concurrent.CompletionException: 
> org.apache.ignite.internal.lang.IgniteInternalException: IGN-CMN-65535 
> TraceId:3877e098-6a1b-4f30-88a8-a4c13411d573 Peers are not ready 
> [groupId=5_part_0]
>     at 
> java.util.concurrent.CompletableFuture.encodeThrowable(CompletableFuture.java:314)
>  ~[?:?]
>     at 
> java.util.concurrent.CompletableFuture.completeThrowable(CompletableFuture.java:319)
>  ~[?:?]
>     at 
> java.util.concurrent.CompletableFuture$UniCompose.tryFire(CompletableFuture.java:1081)
>  ~[?:?]
>     at 
> java.util.concurrent.CompletableFuture$Completion.run(CompletableFuture.java:478)
>  ~[?:?]
>     at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
>  ~[?:?]
>     at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
>  ~[?:?]
>     at java.lang.Thread.run(Thread.java:834) [?:?]
> Caused by: org.apache.ignite.internal.lang.IgniteInternalException: Peers are 
> not ready [groupId=5_part_0]
>     at 
> org.apache.ignite.internal.raft.RaftGroupServiceImpl.randomNode(RaftGroupServiceImpl.java:725)
>  ~[ignite-raft-9.0.127-SNAPSHOT.jar:?]
>     at 
> org.apache.ignite.internal.raft.RaftGroupServiceImpl.randomNode(RaftGroupServiceImpl.java:709)
>  ~[ignite-raft-9.0.127-SNAPSHOT.jar:?]
>     at 
> org.apache.ignite.internal.raft.RaftGroupServiceImpl.refreshLeader(RaftGroupServiceImpl.java:234)
>  ~[ignite-raft-9.0.127-SNAPSHOT.jar:?]
>     at 
> org.apache.ignite.internal.raft.RaftGroupServiceImpl.start(RaftGroupServiceImpl.java:190)
>  ~[ignite-raft-9.0.127-SNAPSHOT.jar:?]
>     at 
> org.apache.ignite.internal.raft.client.TopologyAwareRaftGroupService.start(TopologyAwareRaftGroupService.java:187)
>  ~[ignite-replicator-9.0.127-SNAPSHOT.jar:?]
>     at 
> org.apache.ignite.internal.raft.client.TopologyAwareRaftGroupServiceFactory.startRaftGroupService(TopologyAwareRaftGroupServiceFactory.java:73)
>  ~[ignite-replicator-9.0.127-SNAPSHOT.jar:?]
>     at 
> org.apache.ignite.internal.raft.Loza.startRaftGroupService(Loza.java:350) 
> ~[ignite-raft-9.0.127-SNAPSHOT.jar:?]
>     at 
> org.apache.ignite.internal.table.distributed.TableManager.lambda$startPartitionAndStartClient$27(TableManager.java:917)
>  ~[ignite-table-9.0.127-SNAPSHOT.jar:?]
>     at 
> org.apache.ignite.internal.util.IgniteUtils.inBusyLock(IgniteUtils.java:827) 
> ~[ignite-core-9.0.127-SNAPSHOT.jar:?]
>     at 
> org.apache.ignite.internal.table.distributed.TableManager.lambda$startPartitionAndStartClient$28(TableManager.java:913)
>  ~[ignite-table-9.0.127-SNAPSHOT.jar:?]
>     at 
> java.util.concurrent.CompletableFuture$UniCompose.tryFire(CompletableFuture.java:1072)
>  ~[?:?]
>     ... 4 more {code}
> So, the node can't operate properly and just produces tons of logs. Such 
> nodes should be halted.
> UPD:
> We decided to just add invocation of {{failureProcessor.process}} in all 
> places in {{org.apache.ignite.internal.metastorage.server.WatchProcessor}} 
> where exceptions happen, like 
> {code:java}
> 

[jira] [Updated] (IGNITE-21933) Fix TxStateStorage#leaseStartTime possible inconsistency with partition storage

2024-04-03 Thread Denis Chudov (Jira)


 [ 
https://issues.apache.org/jira/browse/IGNITE-21933?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Denis Chudov updated IGNITE-21933:
--
Description: Tx state storage may be inconsistent with partition storage 
during recovery, which may corrupt data written by 1-phase txns.  Lease start 
time should be moved to partition storage.

> Fix TxStateStorage#leaseStartTime possible inconsistency with partition 
> storage
> ---
>
> Key: IGNITE-21933
> URL: https://issues.apache.org/jira/browse/IGNITE-21933
> Project: Ignite
>  Issue Type: Bug
>Reporter: Denis Chudov
>Assignee: Denis Chudov
>Priority: Major
>  Labels: ignite-3
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> Tx state storage may be inconsistent with partition storage during recovery, 
> which may corrupt data written by 1-phase txns.  Lease start time should be 
> moved to partition storage.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (IGNITE-21933) Fix TxStateStorage#leaseStartTime possible inconsistency with partition storage

2024-04-03 Thread Denis Chudov (Jira)
Denis Chudov created IGNITE-21933:
-

 Summary: Fix TxStateStorage#leaseStartTime possible inconsistency 
with partition storage
 Key: IGNITE-21933
 URL: https://issues.apache.org/jira/browse/IGNITE-21933
 Project: Ignite
  Issue Type: Bug
Reporter: Denis Chudov
Assignee: Denis Chudov






--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (IGNITE-21868) Move the sql RO inflights handling from SqlQueryProcessor to QueryTransactionContext#getOrStartImplicit/QueryTransactionWrapper#commitImplicit

2024-03-28 Thread Denis Chudov (Jira)


 [ 
https://issues.apache.org/jira/browse/IGNITE-21868?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Denis Chudov updated IGNITE-21868:
--
Description: Handling  the transaction inflights in the SqlQueryProcessor 
is not the best option,   (was: *)

> Move the sql RO inflights handling from SqlQueryProcessor to 
> QueryTransactionContext#getOrStartImplicit/QueryTransactionWrapper#commitImplicit
> --
>
> Key: IGNITE-21868
> URL: https://issues.apache.org/jira/browse/IGNITE-21868
> Project: Ignite
>  Issue Type: Bug
>Reporter: Denis Chudov
>Assignee: Denis Chudov
>Priority: Major
>  Labels: ignite-3
>
> Handling  the transaction inflights in the SqlQueryProcessor is not the best 
> option, 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (IGNITE-21868) Move the sql RO inflights handling from SqlQueryProcessor to QueryTransactionContext#getOrStartImplicit/QueryTransactionWrapper#commitImplicit

2024-03-28 Thread Denis Chudov (Jira)


 [ 
https://issues.apache.org/jira/browse/IGNITE-21868?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Denis Chudov updated IGNITE-21868:
--
Description: Handling  the transaction inflights in the SqlQueryProcessor 
is not the best option, should be moved to QueryTransactionContext and 
QueryTransactionWrapper  (was: Handling  the transaction inflights in the 
SqlQueryProcessor is not the best option, )

> Move the sql RO inflights handling from SqlQueryProcessor to 
> QueryTransactionContext#getOrStartImplicit/QueryTransactionWrapper#commitImplicit
> --
>
> Key: IGNITE-21868
> URL: https://issues.apache.org/jira/browse/IGNITE-21868
> Project: Ignite
>  Issue Type: Bug
>Reporter: Denis Chudov
>Assignee: Denis Chudov
>Priority: Major
>  Labels: ignite-3
>
> Handling  the transaction inflights in the SqlQueryProcessor is not the best 
> option, should be moved to QueryTransactionContext and QueryTransactionWrapper



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (IGNITE-21382) Test ItPrimaryReplicaChoiceTest.testPrimaryChangeLongHandling is flaky

2024-03-28 Thread Denis Chudov (Jira)


 [ 
https://issues.apache.org/jira/browse/IGNITE-21382?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Denis Chudov updated IGNITE-21382:
--
Reviewer: Vladislav Pyatkov

> Test ItPrimaryReplicaChoiceTest.testPrimaryChangeLongHandling is flaky
> --
>
> Key: IGNITE-21382
> URL: https://issues.apache.org/jira/browse/IGNITE-21382
> Project: Ignite
>  Issue Type: Bug
>Reporter: Vladislav Pyatkov
>Assignee: Denis Chudov
>Priority: Major
>  Labels: ignite-3
>  Time Spent: 0.5h
>  Remaining Estimate: 0h
>
> The test falls while waiting for the primary replica change. This issue is 
> also reproduced locally, at least one per five passes.
> {code}
> assertThat(primaryChangeTask, willCompleteSuccessfully());
> {code}
> {noformat}
> java.lang.AssertionError: java.util.concurrent.TimeoutException
>   at 
> org.apache.ignite.internal.testframework.matchers.CompletableFutureMatcher.matchesSafely(CompletableFutureMatcher.java:78)
>   at 
> org.apache.ignite.internal.testframework.matchers.CompletableFutureMatcher.matchesSafely(CompletableFutureMatcher.java:35)
>   at org.hamcrest.TypeSafeMatcher.matches(TypeSafeMatcher.java:67)
>   at org.hamcrest.MatcherAssert.assertThat(MatcherAssert.java:10)
>   at org.hamcrest.MatcherAssert.assertThat(MatcherAssert.java:6)
>   at 
> org.apache.ignite.internal.placementdriver.ItPrimaryReplicaChoiceTest.testPrimaryChangeLongHandling(ItPrimaryReplicaChoiceTest.java:179)
> {noformat}
> This test will be muted on TC to pervent future falls.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Assigned] (IGNITE-21763) Adjust TxnResourceVacuumTask in order to vacuum persistent txn state

2024-03-28 Thread Denis Chudov (Jira)


 [ 
https://issues.apache.org/jira/browse/IGNITE-21763?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Denis Chudov reassigned IGNITE-21763:
-

Assignee: Denis Chudov  (was: Alexander Lapin)

> Adjust TxnResourceVacuumTask in order to vacuum persistent txn state
> 
>
> Key: IGNITE-21763
> URL: https://issues.apache.org/jira/browse/IGNITE-21763
> Project: Ignite
>  Issue Type: Improvement
>Reporter: Alexander Lapin
>Assignee: Denis Chudov
>Priority: Major
>  Labels: ignite-3
>
> h3. Definition of Done
>  * TxnResourceVacuumTask is adjusted in a way that 
>  ** txnState is removed from txnStateVolatileMap if 
> {*}max{*}(cleanupCompletionTimestamp, initialVacuumObservationTimestamp) + 
> txnResourcesTTL < vacuumObservationTimestamp
>  ** If there's a value in cleanupCompletionTimestamp, prior to removal the 
> txnState from the volatile map it's required to remove corresponding record 
> from within txn persistant state storage.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (IGNITE-21868) Move the sql RO inflights handling from SqlQueryProcessor to QueryTransactionContext#getOrStartImplicit/QueryTransactionWrapper#commitImplicit

2024-03-28 Thread Denis Chudov (Jira)
Denis Chudov created IGNITE-21868:
-

 Summary: Move the sql RO inflights handling from SqlQueryProcessor 
to 
QueryTransactionContext#getOrStartImplicit/QueryTransactionWrapper#commitImplicit
 Key: IGNITE-21868
 URL: https://issues.apache.org/jira/browse/IGNITE-21868
 Project: Ignite
  Issue Type: Bug
Reporter: Denis Chudov
Assignee: Denis Chudov


*



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (IGNITE-21861) Unexpected "Transaction is already finished" exception

2024-03-28 Thread Denis Chudov (Jira)


 [ 
https://issues.apache.org/jira/browse/IGNITE-21861?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Denis Chudov updated IGNITE-21861:
--
Description: 
Exception in log:
{code:java}
[2024-03-27T01:24:46,636][WARN 
][%idt_n_1%partition-operations-4][ReplicaManager] Failed to process replica 
request [request=ReadWriteScanRetrieveBatchReplicaRequestImpl [batchSize=512, 
columnsToInclude=null, commitPartitionId=TablePartitionIdMessageImpl 
[partitionId=17, tableId=90], 
coordinatorId=125b397c-0404-4dcf-a28b-625fe010ecef, 
enlistmentConsistencyToken=112165039282455690, exactKey=null, flags=0, 
full=false, groupId=92_part_7, indexToUse=null, lowerBoundPrefix=null, 
scanId=20361, timestampLong=112165039967305730, 
transactionId=018e7d82-647b-0030-63a2-6a190001, upperBoundPrefix=null]].
java.util.concurrent.CompletionException: 
org.apache.ignite.tx.TransactionException: IGN-TX-14 
TraceId:6612dad8-4a32-4453-8af0-0139e336aad9 Transaction is already finished.
    at 
java.base/java.util.concurrent.CompletableFuture.encodeThrowable(CompletableFuture.java:331)
 ~[?:?]
    at 
java.base/java.util.concurrent.CompletableFuture.uniComposeStage(CompletableFuture.java:1099)
 ~[?:?]
    at 
java.base/java.util.concurrent.CompletableFuture.thenCompose(CompletableFuture.java:2235)
 ~[?:?]
    at 
org.apache.ignite.internal.table.distributed.replicator.PartitionReplicaListener.processOperationRequest(PartitionReplicaListener.java:660)
 ~[ignite-table-3.0.0-SNAPSHOT.jar:?]
    at 
org.apache.ignite.internal.table.distributed.replicator.PartitionReplicaListener.processOperationRequestWithTxRwCounter(PartitionReplicaListener.java:3860)
 ~[ignite-table-3.0.0-SNAPSHOT.jar:?]
    at 
org.apache.ignite.internal.table.distributed.replicator.PartitionReplicaListener.lambda$processRequest$5(PartitionReplicaListener.java:436)
 ~[ignite-table-3.0.0-SNAPSHOT.jar:?]
    at 
java.base/java.util.concurrent.CompletableFuture$UniCompose.tryFire(CompletableFuture.java:1072)
 ~[?:?]
    at 
java.base/java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:506)
 [?:?]
    at 
java.base/java.util.concurrent.CompletableFuture.postFire(CompletableFuture.java:610)
 [?:?]
    at 
java.base/java.util.concurrent.CompletableFuture$UniApply.tryFire(CompletableFuture.java:649)
 [?:?]
    at 
java.base/java.util.concurrent.CompletableFuture$Completion.run(CompletableFuture.java:478)
 [?:?]
    at 
java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
 [?:?]
    at 
java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
 [?:?]
    at java.base/java.lang.Thread.run(Thread.java:834) [?:?]
Caused by: org.apache.ignite.tx.TransactionException: Transaction is already 
finished.
    at 
org.apache.ignite.internal.table.distributed.replicator.PartitionReplicaListener.appendTxCommand(PartitionReplicaListener.java:1937)
 ~[ignite-table-3.0.0-SNAPSHOT.jar:?]
    at 
org.apache.ignite.internal.table.distributed.replicator.PartitionReplicaListener.processOperationRequest(PartitionReplicaListener.java:659)
 ~[ignite-table-3.0.0-SNAPSHOT.jar:?]
    ... 10 more{code}
 

It happens in PartitionReplicaListener because the local volatile tx state is 
null or final when trying to compute a value for txCleanupReadyFutures map:
{code:java}
txCleanupReadyFutures.compute(txId, (id, txOps) -> {
// First check whether the transaction has already been finished.
// And complete cleanupReadyFut with exception if it is the case.
TxStateMeta txStateMeta = txManager.stateMeta(txId);

if (txStateMeta == null || isFinalState(txStateMeta.txState())) {
cleanupReadyFut.completeExceptionally(new Exception());

return txOps;
}

// Otherwise collect cleanupReadyFut in the transaction's futures.
if (txOps == null) {
txOps = new TxCleanupReadyFutureList();
}

txOps.futures.computeIfAbsent(cmdType, type -> new HashMap<>()).put(opId, 
cleanupReadyFut);

return txOps;
});

if (cleanupReadyFut.isCompletedExceptionally()) {
return failedFuture(new TransactionException(TX_ALREADY_FINISHED_ERR, 
"Transaction is already finished."));
}{code}
First problem is that we don't actually know the real state from this exception.

The second one is the exception itself, because it shouldn't happen. We 
shouldn't meet a null state, because it's updated to pending just before, and 
it can be vacuumized only after it becomes final. 

Committed state is also not possible because we wait for all in-flights before 
the state transition. It can be Aborted state here, but there should be no 
exception in logs in this case.

In our case, the transaction is most likely aborted because of replication 
timeout exception happened before (it would be nice to see a tx id in this 
exception as well).

Full log is attached.

*Defitinion of done:*
 * no TransactionException in log in case of aborted 

[jira] [Updated] (IGNITE-21861) Unexpected "Transaction is already finished" exception

2024-03-28 Thread Denis Chudov (Jira)


 [ 
https://issues.apache.org/jira/browse/IGNITE-21861?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Denis Chudov updated IGNITE-21861:
--
Description: 
Exception in log:
{code:java}
[2024-03-27T01:24:46,636][WARN 
][%idt_n_1%partition-operations-4][ReplicaManager] Failed to process replica 
request [request=ReadWriteScanRetrieveBatchReplicaRequestImpl [batchSize=512, 
columnsToInclude=null, commitPartitionId=TablePartitionIdMessageImpl 
[partitionId=17, tableId=90], 
coordinatorId=125b397c-0404-4dcf-a28b-625fe010ecef, 
enlistmentConsistencyToken=112165039282455690, exactKey=null, flags=0, 
full=false, groupId=92_part_7, indexToUse=null, lowerBoundPrefix=null, 
scanId=20361, timestampLong=112165039967305730, 
transactionId=018e7d82-647b-0030-63a2-6a190001, upperBoundPrefix=null]].
java.util.concurrent.CompletionException: 
org.apache.ignite.tx.TransactionException: IGN-TX-14 
TraceId:6612dad8-4a32-4453-8af0-0139e336aad9 Transaction is already finished.
    at 
java.base/java.util.concurrent.CompletableFuture.encodeThrowable(CompletableFuture.java:331)
 ~[?:?]
    at 
java.base/java.util.concurrent.CompletableFuture.uniComposeStage(CompletableFuture.java:1099)
 ~[?:?]
    at 
java.base/java.util.concurrent.CompletableFuture.thenCompose(CompletableFuture.java:2235)
 ~[?:?]
    at 
org.apache.ignite.internal.table.distributed.replicator.PartitionReplicaListener.processOperationRequest(PartitionReplicaListener.java:660)
 ~[ignite-table-3.0.0-SNAPSHOT.jar:?]
    at 
org.apache.ignite.internal.table.distributed.replicator.PartitionReplicaListener.processOperationRequestWithTxRwCounter(PartitionReplicaListener.java:3860)
 ~[ignite-table-3.0.0-SNAPSHOT.jar:?]
    at 
org.apache.ignite.internal.table.distributed.replicator.PartitionReplicaListener.lambda$processRequest$5(PartitionReplicaListener.java:436)
 ~[ignite-table-3.0.0-SNAPSHOT.jar:?]
    at 
java.base/java.util.concurrent.CompletableFuture$UniCompose.tryFire(CompletableFuture.java:1072)
 ~[?:?]
    at 
java.base/java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:506)
 [?:?]
    at 
java.base/java.util.concurrent.CompletableFuture.postFire(CompletableFuture.java:610)
 [?:?]
    at 
java.base/java.util.concurrent.CompletableFuture$UniApply.tryFire(CompletableFuture.java:649)
 [?:?]
    at 
java.base/java.util.concurrent.CompletableFuture$Completion.run(CompletableFuture.java:478)
 [?:?]
    at 
java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
 [?:?]
    at 
java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
 [?:?]
    at java.base/java.lang.Thread.run(Thread.java:834) [?:?]
Caused by: org.apache.ignite.tx.TransactionException: Transaction is already 
finished.
    at 
org.apache.ignite.internal.table.distributed.replicator.PartitionReplicaListener.appendTxCommand(PartitionReplicaListener.java:1937)
 ~[ignite-table-3.0.0-SNAPSHOT.jar:?]
    at 
org.apache.ignite.internal.table.distributed.replicator.PartitionReplicaListener.processOperationRequest(PartitionReplicaListener.java:659)
 ~[ignite-table-3.0.0-SNAPSHOT.jar:?]
    ... 10 more{code}
 

It happens in PartitionReplicaListener because the local volatile tx state is 
null or final when trying to compute a value for txCleanupReadyFutures map:
{code:java}
txCleanupReadyFutures.compute(txId, (id, txOps) -> {
// First check whether the transaction has already been finished.
// And complete cleanupReadyFut with exception if it is the case.
TxStateMeta txStateMeta = txManager.stateMeta(txId);

if (txStateMeta == null || isFinalState(txStateMeta.txState())) {
cleanupReadyFut.completeExceptionally(new Exception());

return txOps;
}

// Otherwise collect cleanupReadyFut in the transaction's futures.
if (txOps == null) {
txOps = new TxCleanupReadyFutureList();
}

txOps.futures.computeIfAbsent(cmdType, type -> new HashMap<>()).put(opId, 
cleanupReadyFut);

return txOps;
});

if (cleanupReadyFut.isCompletedExceptionally()) {
return failedFuture(new TransactionException(TX_ALREADY_FINISHED_ERR, 
"Transaction is already finished."));
}{code}
First problem is that we don't actually know the real state from this exception.

The second one is the exception itself, because it shouldn't happen. We 
shouldn't meet a null state, because it's updated to pending just before, and 
it can be vacuumized only after it becomes final. 

Committed state is also not possible because we wait for all in-flights before 
the state transition. It can be Aborted state here, but there should be no 
exception in logs in this case.

In our case, the transaction is most likely aborted because of replication 
timeout exception happened before (it would be nice to see a tx id in this 
exception as well).

Full log is attached.

*Defitinion of done:*
 * no TransactionException in log in case of aborted 

[jira] [Updated] (IGNITE-21861) Unexpected "Transaction is already finished" exception

2024-03-28 Thread Denis Chudov (Jira)


 [ 
https://issues.apache.org/jira/browse/IGNITE-21861?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Denis Chudov updated IGNITE-21861:
--
Description: 
Exception in log:
{code:java}
[2024-03-27T01:24:46,636][WARN 
][%idt_n_1%partition-operations-4][ReplicaManager] Failed to process replica 
request [request=ReadWriteScanRetrieveBatchReplicaRequestImpl [batchSize=512, 
columnsToInclude=null, commitPartitionId=TablePartitionIdMessageImpl 
[partitionId=17, tableId=90], 
coordinatorId=125b397c-0404-4dcf-a28b-625fe010ecef, 
enlistmentConsistencyToken=112165039282455690, exactKey=null, flags=0, 
full=false, groupId=92_part_7, indexToUse=null, lowerBoundPrefix=null, 
scanId=20361, timestampLong=112165039967305730, 
transactionId=018e7d82-647b-0030-63a2-6a190001, upperBoundPrefix=null]].
java.util.concurrent.CompletionException: 
org.apache.ignite.tx.TransactionException: IGN-TX-14 
TraceId:6612dad8-4a32-4453-8af0-0139e336aad9 Transaction is already finished.
    at 
java.base/java.util.concurrent.CompletableFuture.encodeThrowable(CompletableFuture.java:331)
 ~[?:?]
    at 
java.base/java.util.concurrent.CompletableFuture.uniComposeStage(CompletableFuture.java:1099)
 ~[?:?]
    at 
java.base/java.util.concurrent.CompletableFuture.thenCompose(CompletableFuture.java:2235)
 ~[?:?]
    at 
org.apache.ignite.internal.table.distributed.replicator.PartitionReplicaListener.processOperationRequest(PartitionReplicaListener.java:660)
 ~[ignite-table-3.0.0-SNAPSHOT.jar:?]
    at 
org.apache.ignite.internal.table.distributed.replicator.PartitionReplicaListener.processOperationRequestWithTxRwCounter(PartitionReplicaListener.java:3860)
 ~[ignite-table-3.0.0-SNAPSHOT.jar:?]
    at 
org.apache.ignite.internal.table.distributed.replicator.PartitionReplicaListener.lambda$processRequest$5(PartitionReplicaListener.java:436)
 ~[ignite-table-3.0.0-SNAPSHOT.jar:?]
    at 
java.base/java.util.concurrent.CompletableFuture$UniCompose.tryFire(CompletableFuture.java:1072)
 ~[?:?]
    at 
java.base/java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:506)
 [?:?]
    at 
java.base/java.util.concurrent.CompletableFuture.postFire(CompletableFuture.java:610)
 [?:?]
    at 
java.base/java.util.concurrent.CompletableFuture$UniApply.tryFire(CompletableFuture.java:649)
 [?:?]
    at 
java.base/java.util.concurrent.CompletableFuture$Completion.run(CompletableFuture.java:478)
 [?:?]
    at 
java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
 [?:?]
    at 
java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
 [?:?]
    at java.base/java.lang.Thread.run(Thread.java:834) [?:?]
Caused by: org.apache.ignite.tx.TransactionException: Transaction is already 
finished.
    at 
org.apache.ignite.internal.table.distributed.replicator.PartitionReplicaListener.appendTxCommand(PartitionReplicaListener.java:1937)
 ~[ignite-table-3.0.0-SNAPSHOT.jar:?]
    at 
org.apache.ignite.internal.table.distributed.replicator.PartitionReplicaListener.processOperationRequest(PartitionReplicaListener.java:659)
 ~[ignite-table-3.0.0-SNAPSHOT.jar:?]
    ... 10 more{code}
 

It happens in PartitionReplicaListener because the local volatile tx state is 
null or final when trying to compute a value for txCleanupReadyFutures map:
{code:java}
txCleanupReadyFutures.compute(txId, (id, txOps) -> {
// First check whether the transaction has already been finished.
// And complete cleanupReadyFut with exception if it is the case.
TxStateMeta txStateMeta = txManager.stateMeta(txId);

if (txStateMeta == null || isFinalState(txStateMeta.txState())) {
cleanupReadyFut.completeExceptionally(new Exception());

return txOps;
}

// Otherwise collect cleanupReadyFut in the transaction's futures.
if (txOps == null) {
txOps = new TxCleanupReadyFutureList();
}

txOps.futures.computeIfAbsent(cmdType, type -> new HashMap<>()).put(opId, 
cleanupReadyFut);

return txOps;
});

if (cleanupReadyFut.isCompletedExceptionally()) {
return failedFuture(new TransactionException(TX_ALREADY_FINISHED_ERR, 
"Transaction is already finished."));
}{code}
First problem is that we don't actually know the real state from this exception.

The second one is the exception itself, because it shouldn't happen. We 
shouldn't meet a null state, because it's updated to pending just before, and 
it can be vacuumized only after it becomes final. 

Committed state is also not possible because we wait for all in-flights before 
the state transition. It can be Aborted state here, but there should be no 
exception in logs in this case.

In our case, the transaction is most likely aborted because of replication 
timeout exception happened before (it would be nice to see a tx id in this 
exception as well).

Full log is attached.

*Defitinion of done:*
 * no TransactionException in log in case of aborted 

[jira] [Updated] (IGNITE-21861) Unexpected "Transaction is already finished" exception

2024-03-28 Thread Denis Chudov (Jira)


 [ 
https://issues.apache.org/jira/browse/IGNITE-21861?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Denis Chudov updated IGNITE-21861:
--
Attachment: _Integration_Tests_Module_SQL_Engine_4133_.log

> Unexpected "Transaction is already finished" exception 
> ---
>
> Key: IGNITE-21861
> URL: https://issues.apache.org/jira/browse/IGNITE-21861
> Project: Ignite
>  Issue Type: Bug
>Reporter: Denis Chudov
>Priority: Major
>  Labels: ignite-3
> Attachments: _Integration_Tests_Module_SQL_Engine_4133_.log
>
>
> Exception in log:
> {code:java}
> [2024-03-27T01:24:46,636][WARN 
> ][%idt_n_1%partition-operations-4][ReplicaManager] Failed to process replica 
> request [request=ReadWriteScanRetrieveBatchReplicaRequestImpl [batchSize=512, 
> columnsToInclude=null, commitPartitionId=TablePartitionIdMessageImpl 
> [partitionId=17, tableId=90], 
> coordinatorId=125b397c-0404-4dcf-a28b-625fe010ecef, 
> enlistmentConsistencyToken=112165039282455690, exactKey=null, flags=0, 
> full=false, groupId=92_part_7, indexToUse=null, lowerBoundPrefix=null, 
> scanId=20361, timestampLong=112165039967305730, 
> transactionId=018e7d82-647b-0030-63a2-6a190001, upperBoundPrefix=null]].
> java.util.concurrent.CompletionException: 
> org.apache.ignite.tx.TransactionException: IGN-TX-14 
> TraceId:6612dad8-4a32-4453-8af0-0139e336aad9 Transaction is already finished.
>     at 
> java.base/java.util.concurrent.CompletableFuture.encodeThrowable(CompletableFuture.java:331)
>  ~[?:?]
>     at 
> java.base/java.util.concurrent.CompletableFuture.uniComposeStage(CompletableFuture.java:1099)
>  ~[?:?]
>     at 
> java.base/java.util.concurrent.CompletableFuture.thenCompose(CompletableFuture.java:2235)
>  ~[?:?]
>     at 
> org.apache.ignite.internal.table.distributed.replicator.PartitionReplicaListener.processOperationRequest(PartitionReplicaListener.java:660)
>  ~[ignite-table-3.0.0-SNAPSHOT.jar:?]
>     at 
> org.apache.ignite.internal.table.distributed.replicator.PartitionReplicaListener.processOperationRequestWithTxRwCounter(PartitionReplicaListener.java:3860)
>  ~[ignite-table-3.0.0-SNAPSHOT.jar:?]
>     at 
> org.apache.ignite.internal.table.distributed.replicator.PartitionReplicaListener.lambda$processRequest$5(PartitionReplicaListener.java:436)
>  ~[ignite-table-3.0.0-SNAPSHOT.jar:?]
>     at 
> java.base/java.util.concurrent.CompletableFuture$UniCompose.tryFire(CompletableFuture.java:1072)
>  ~[?:?]
>     at 
> java.base/java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:506)
>  [?:?]
>     at 
> java.base/java.util.concurrent.CompletableFuture.postFire(CompletableFuture.java:610)
>  [?:?]
>     at 
> java.base/java.util.concurrent.CompletableFuture$UniApply.tryFire(CompletableFuture.java:649)
>  [?:?]
>     at 
> java.base/java.util.concurrent.CompletableFuture$Completion.run(CompletableFuture.java:478)
>  [?:?]
>     at 
> java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
>  [?:?]
>     at 
> java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
>  [?:?]
>     at java.base/java.lang.Thread.run(Thread.java:834) [?:?]
> Caused by: org.apache.ignite.tx.TransactionException: Transaction is already 
> finished.
>     at 
> org.apache.ignite.internal.table.distributed.replicator.PartitionReplicaListener.appendTxCommand(PartitionReplicaListener.java:1937)
>  ~[ignite-table-3.0.0-SNAPSHOT.jar:?]
>     at 
> org.apache.ignite.internal.table.distributed.replicator.PartitionReplicaListener.processOperationRequest(PartitionReplicaListener.java:659)
>  ~[ignite-table-3.0.0-SNAPSHOT.jar:?]
>     ... 10 more{code}
> It happens in PartitionReplicaListener because the local volatile tx state is 
> null or final when trying to compute a value for txCleanupReadyFutures map:
> {code:java}
> txCleanupReadyFutures.compute(txId, (id, txOps) -> {
> // First check whether the transaction has already been finished.
> // And complete cleanupReadyFut with exception if it is the case.
> TxStateMeta txStateMeta = txManager.stateMeta(txId);
> if (txStateMeta == null || isFinalState(txStateMeta.txState())) {
> cleanupReadyFut.completeExceptionally(new Exception());
> return txOps;
> }
> // Otherwise collect cleanupReadyFut in the transaction's futures.
> if (txOps == null) {
> txOps = new TxCleanupReadyFutureList();
> }
> txOps.futures.computeIfAbsent(cmdType, type -> new HashMap<>()).put(opId, 
> cleanupReadyFut);
> return txOps;
> });
> if (cleanupReadyFut.isCompletedExceptionally()) {
> return failedFuture(new TransactionException(TX_ALREADY_FINISHED_ERR, 
> "Transaction is already finished."));
> }{code}
> First problem is that we don't actually know the real state from this 
> exception.

[jira] [Updated] (IGNITE-21861) Unexpected "Transaction is already finished" exception

2024-03-28 Thread Denis Chudov (Jira)


 [ 
https://issues.apache.org/jira/browse/IGNITE-21861?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Denis Chudov updated IGNITE-21861:
--
Description: 
Exception in log:
{code:java}
[2024-03-27T01:24:46,636][WARN 
][%idt_n_1%partition-operations-4][ReplicaManager] Failed to process replica 
request [request=ReadWriteScanRetrieveBatchReplicaRequestImpl [batchSize=512, 
columnsToInclude=null, commitPartitionId=TablePartitionIdMessageImpl 
[partitionId=17, tableId=90], 
coordinatorId=125b397c-0404-4dcf-a28b-625fe010ecef, 
enlistmentConsistencyToken=112165039282455690, exactKey=null, flags=0, 
full=false, groupId=92_part_7, indexToUse=null, lowerBoundPrefix=null, 
scanId=20361, timestampLong=112165039967305730, 
transactionId=018e7d82-647b-0030-63a2-6a190001, upperBoundPrefix=null]].
java.util.concurrent.CompletionException: 
org.apache.ignite.tx.TransactionException: IGN-TX-14 
TraceId:6612dad8-4a32-4453-8af0-0139e336aad9 Transaction is already finished.
    at 
java.base/java.util.concurrent.CompletableFuture.encodeThrowable(CompletableFuture.java:331)
 ~[?:?]
    at 
java.base/java.util.concurrent.CompletableFuture.uniComposeStage(CompletableFuture.java:1099)
 ~[?:?]
    at 
java.base/java.util.concurrent.CompletableFuture.thenCompose(CompletableFuture.java:2235)
 ~[?:?]
    at 
org.apache.ignite.internal.table.distributed.replicator.PartitionReplicaListener.processOperationRequest(PartitionReplicaListener.java:660)
 ~[ignite-table-3.0.0-SNAPSHOT.jar:?]
    at 
org.apache.ignite.internal.table.distributed.replicator.PartitionReplicaListener.processOperationRequestWithTxRwCounter(PartitionReplicaListener.java:3860)
 ~[ignite-table-3.0.0-SNAPSHOT.jar:?]
    at 
org.apache.ignite.internal.table.distributed.replicator.PartitionReplicaListener.lambda$processRequest$5(PartitionReplicaListener.java:436)
 ~[ignite-table-3.0.0-SNAPSHOT.jar:?]
    at 
java.base/java.util.concurrent.CompletableFuture$UniCompose.tryFire(CompletableFuture.java:1072)
 ~[?:?]
    at 
java.base/java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:506)
 [?:?]
    at 
java.base/java.util.concurrent.CompletableFuture.postFire(CompletableFuture.java:610)
 [?:?]
    at 
java.base/java.util.concurrent.CompletableFuture$UniApply.tryFire(CompletableFuture.java:649)
 [?:?]
    at 
java.base/java.util.concurrent.CompletableFuture$Completion.run(CompletableFuture.java:478)
 [?:?]
    at 
java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
 [?:?]
    at 
java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
 [?:?]
    at java.base/java.lang.Thread.run(Thread.java:834) [?:?]
Caused by: org.apache.ignite.tx.TransactionException: Transaction is already 
finished.
    at 
org.apache.ignite.internal.table.distributed.replicator.PartitionReplicaListener.appendTxCommand(PartitionReplicaListener.java:1937)
 ~[ignite-table-3.0.0-SNAPSHOT.jar:?]
    at 
org.apache.ignite.internal.table.distributed.replicator.PartitionReplicaListener.processOperationRequest(PartitionReplicaListener.java:659)
 ~[ignite-table-3.0.0-SNAPSHOT.jar:?]
    ... 10 more{code}
It happens in PartitionReplicaListener because the local volatile tx state is 
null or final when trying to compute a value for txCleanupReadyFutures map:
{code:java}
txCleanupReadyFutures.compute(txId, (id, txOps) -> {
// First check whether the transaction has already been finished.
// And complete cleanupReadyFut with exception if it is the case.
TxStateMeta txStateMeta = txManager.stateMeta(txId);

if (txStateMeta == null || isFinalState(txStateMeta.txState())) {
cleanupReadyFut.completeExceptionally(new Exception());

return txOps;
}

// Otherwise collect cleanupReadyFut in the transaction's futures.
if (txOps == null) {
txOps = new TxCleanupReadyFutureList();
}

txOps.futures.computeIfAbsent(cmdType, type -> new HashMap<>()).put(opId, 
cleanupReadyFut);

return txOps;
});

if (cleanupReadyFut.isCompletedExceptionally()) {
return failedFuture(new TransactionException(TX_ALREADY_FINISHED_ERR, 
"Transaction is already finished."));
}{code}
First problem is that we don't actually know the real state from this exception.

The second one is the exception itself, because it shouldn't happen. We 
shouldn't meet a null state, because it's updated to pending just before, and 
it can be vacuumized only after it becomes final. 

Committed state is also not possible because we wait for all in-flights before 
the state transition. It can be Aborted state here, but there should be no 
exception in logs in this case.

In our case, the transaction is most likely aborted because of replication 
timeout exception happened before (it would be nice to see a tx id in this 
exception as well).

Full log is attached.

  was:
Exception in log:
{code:java}
[2024-03-27T01:24:46,636][WARN 

[jira] [Updated] (IGNITE-21861) Unexpected "Transaction is already finished" exception

2024-03-28 Thread Denis Chudov (Jira)


 [ 
https://issues.apache.org/jira/browse/IGNITE-21861?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Denis Chudov updated IGNITE-21861:
--
Description: 
Exception in log:
{code:java}
[2024-03-27T01:24:46,636][WARN 
][%idt_n_1%partition-operations-4][ReplicaManager] Failed to process replica 
request [request=ReadWriteScanRetrieveBatchReplicaRequestImpl [batchSize=512, 
columnsToInclude=null, commitPartitionId=TablePartitionIdMessageImpl 
[partitionId=17, tableId=90], 
coordinatorId=125b397c-0404-4dcf-a28b-625fe010ecef, 
enlistmentConsistencyToken=112165039282455690, exactKey=null, flags=0, 
full=false, groupId=92_part_7, indexToUse=null, lowerBoundPrefix=null, 
scanId=20361, timestampLong=112165039967305730, 
transactionId=018e7d82-647b-0030-63a2-6a190001, upperBoundPrefix=null]].
java.util.concurrent.CompletionException: 
org.apache.ignite.tx.TransactionException: IGN-TX-14 
TraceId:6612dad8-4a32-4453-8af0-0139e336aad9 Transaction is already finished.
    at 
java.base/java.util.concurrent.CompletableFuture.encodeThrowable(CompletableFuture.java:331)
 ~[?:?]
    at 
java.base/java.util.concurrent.CompletableFuture.uniComposeStage(CompletableFuture.java:1099)
 ~[?:?]
    at 
java.base/java.util.concurrent.CompletableFuture.thenCompose(CompletableFuture.java:2235)
 ~[?:?]
    at 
org.apache.ignite.internal.table.distributed.replicator.PartitionReplicaListener.processOperationRequest(PartitionReplicaListener.java:660)
 ~[ignite-table-3.0.0-SNAPSHOT.jar:?]
    at 
org.apache.ignite.internal.table.distributed.replicator.PartitionReplicaListener.processOperationRequestWithTxRwCounter(PartitionReplicaListener.java:3860)
 ~[ignite-table-3.0.0-SNAPSHOT.jar:?]
    at 
org.apache.ignite.internal.table.distributed.replicator.PartitionReplicaListener.lambda$processRequest$5(PartitionReplicaListener.java:436)
 ~[ignite-table-3.0.0-SNAPSHOT.jar:?]
    at 
java.base/java.util.concurrent.CompletableFuture$UniCompose.tryFire(CompletableFuture.java:1072)
 ~[?:?]
    at 
java.base/java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:506)
 [?:?]
    at 
java.base/java.util.concurrent.CompletableFuture.postFire(CompletableFuture.java:610)
 [?:?]
    at 
java.base/java.util.concurrent.CompletableFuture$UniApply.tryFire(CompletableFuture.java:649)
 [?:?]
    at 
java.base/java.util.concurrent.CompletableFuture$Completion.run(CompletableFuture.java:478)
 [?:?]
    at 
java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
 [?:?]
    at 
java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
 [?:?]
    at java.base/java.lang.Thread.run(Thread.java:834) [?:?]
Caused by: org.apache.ignite.tx.TransactionException: Transaction is already 
finished.
    at 
org.apache.ignite.internal.table.distributed.replicator.PartitionReplicaListener.appendTxCommand(PartitionReplicaListener.java:1937)
 ~[ignite-table-3.0.0-SNAPSHOT.jar:?]
    at 
org.apache.ignite.internal.table.distributed.replicator.PartitionReplicaListener.processOperationRequest(PartitionReplicaListener.java:659)
 ~[ignite-table-3.0.0-SNAPSHOT.jar:?]
    ... 10 more{code}
It happens in PartitionReplicaListener because the local volatile tx state is 
null or final when trying to compute a value for txCleanupReadyFutures map:
{code:java}
txCleanupReadyFutures.compute(txId, (id, txOps) -> {
// First check whether the transaction has already been finished.
// And complete cleanupReadyFut with exception if it is the case.
TxStateMeta txStateMeta = txManager.stateMeta(txId);

if (txStateMeta == null || isFinalState(txStateMeta.txState())) {
cleanupReadyFut.completeExceptionally(new Exception());

return txOps;
}

// Otherwise collect cleanupReadyFut in the transaction's futures.
if (txOps == null) {
txOps = new TxCleanupReadyFutureList();
}

txOps.futures.computeIfAbsent(cmdType, type -> new HashMap<>()).put(opId, 
cleanupReadyFut);

return txOps;
});

if (cleanupReadyFut.isCompletedExceptionally()) {
return failedFuture(new TransactionException(TX_ALREADY_FINISHED_ERR, 
"Transaction is already finished."));
}{code}
First problem is that we don't actually know the real state from this exception.

The second one is the exception itself, because it shouldn't happen. We 
shouldn't meet a null state, because it's updated to pending just before, and 
it can be vacuumized only after it becomes final. 

 

  was:
Exception in log:
{code:java}
[2024-03-27T01:24:46,636][WARN 
][%idt_n_1%partition-operations-4][ReplicaManager] Failed to process replica 
request [request=ReadWriteScanRetrieveBatchReplicaRequestImpl [batchSize=512, 
columnsToInclude=null, commitPartitionId=TablePartitionIdMessageImpl 
[partitionId=17, tableId=90], 
coordinatorId=125b397c-0404-4dcf-a28b-625fe010ecef, 
enlistmentConsistencyToken=112165039282455690, exactKey=null, flags=0, 
full=false, 

[jira] [Created] (IGNITE-21861) Unexpected "Transaction is already finished" exception

2024-03-28 Thread Denis Chudov (Jira)
Denis Chudov created IGNITE-21861:
-

 Summary: Unexpected "Transaction is already finished" exception 
 Key: IGNITE-21861
 URL: https://issues.apache.org/jira/browse/IGNITE-21861
 Project: Ignite
  Issue Type: Bug
Reporter: Denis Chudov


Exception in log:
{code:java}
[2024-03-27T01:24:46,636][WARN 
][%idt_n_1%partition-operations-4][ReplicaManager] Failed to process replica 
request [request=ReadWriteScanRetrieveBatchReplicaRequestImpl [batchSize=512, 
columnsToInclude=null, commitPartitionId=TablePartitionIdMessageImpl 
[partitionId=17, tableId=90], 
coordinatorId=125b397c-0404-4dcf-a28b-625fe010ecef, 
enlistmentConsistencyToken=112165039282455690, exactKey=null, flags=0, 
full=false, groupId=92_part_7, indexToUse=null, lowerBoundPrefix=null, 
scanId=20361, timestampLong=112165039967305730, 
transactionId=018e7d82-647b-0030-63a2-6a190001, 
upperBoundPrefix=null]].java.util.concurrent.CompletionException: 
org.apache.ignite.tx.TransactionException: IGN-TX-14 
TraceId:6612dad8-4a32-4453-8af0-0139e336aad9 Transaction is already finished.   
 at 
java.base/java.util.concurrent.CompletableFuture.encodeThrowable(CompletableFuture.java:331)
 ~[?:?]    at 
java.base/java.util.concurrent.CompletableFuture.uniComposeStage(CompletableFuture.java:1099)
 ~[?:?]    at 
java.base/java.util.concurrent.CompletableFuture.thenCompose(CompletableFuture.java:2235)
 ~[?:?]    at 
org.apache.ignite.internal.table.distributed.replicator.PartitionReplicaListener.processOperationRequest(PartitionReplicaListener.java:660)
 ~[ignite-table-3.0.0-SNAPSHOT.jar:?]    at 
org.apache.ignite.internal.table.distributed.replicator.PartitionReplicaListener.processOperationRequestWithTxRwCounter(PartitionReplicaListener.java:3860)
 ~[ignite-table-3.0.0-SNAPSHOT.jar:?]    at 
org.apache.ignite.internal.table.distributed.replicator.PartitionReplicaListener.lambda$processRequest$5(PartitionReplicaListener.java:436)
 ~[ignite-table-3.0.0-SNAPSHOT.jar:?]    at 
java.base/java.util.concurrent.CompletableFuture$UniCompose.tryFire(CompletableFuture.java:1072)
 ~[?:?]    at 
java.base/java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:506)
 [?:?]    at 
java.base/java.util.concurrent.CompletableFuture.postFire(CompletableFuture.java:610)
 [?:?]    at 
java.base/java.util.concurrent.CompletableFuture$UniApply.tryFire(CompletableFuture.java:649)
 [?:?]    at 
java.base/java.util.concurrent.CompletableFuture$Completion.run(CompletableFuture.java:478)
 [?:?]    at 
java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
 [?:?]    at 
java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
 [?:?]    at java.base/java.lang.Thread.run(Thread.java:834) [?:?]Caused by: 
org.apache.ignite.tx.TransactionException: Transaction is already finished.    
at 
org.apache.ignite.internal.table.distributed.replicator.PartitionReplicaListener.appendTxCommand(PartitionReplicaListener.java:1937)
 ~[ignite-table-3.0.0-SNAPSHOT.jar:?]    at 
org.apache.ignite.internal.table.distributed.replicator.PartitionReplicaListener.processOperationRequest(PartitionReplicaListener.java:659)
 ~[ignite-table-3.0.0-SNAPSHOT.jar:?]    ... 10 more {code}
 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (IGNITE-21572) One phase transacion protocol is inconsistent in case of primary replica expirations

2024-03-27 Thread Denis Chudov (Jira)


 [ 
https://issues.apache.org/jira/browse/IGNITE-21572?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Denis Chudov updated IGNITE-21572:
--
Reviewer: Alexander Lapin

> One phase transacion protocol is inconsistent in case of primary replica 
> expirations
> 
>
> Key: IGNITE-21572
> URL: https://issues.apache.org/jira/browse/IGNITE-21572
> Project: Ignite
>  Issue Type: Bug
>Reporter: Alexander Lapin
>Assignee: Denis Chudov
>Priority: Critical
>  Labels: ignite-3
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> h3. Motivation
> Consider following scenario:
>  # Full (1PC) transaction tx1 starts on PrimaryReplica1 [leaseholder='X', 
> startTime='t1', endTime='t10']
>  # Within a given 1PC transaction 2-phase operation is evaluated over key1, 
> e.g. replace or increment (we do not have increment operation, however it's 
> easy to explain the problem with it, so let's assume that we have one).
>  # Within increment processing, processor acquires lock on key1, reads the 
> corresponding value and is about to write the new one.
>  # At this point, PrimaryReplica leaseholder='X' expires.
>  # Another transaction tx2 starts on new PrimaryReplica2 [leaseholder='Y', 
> startTime='t11', endTime='t21'].
>  # Within tx2 user also calls increment, thus also acquires lock, reads old 
> value and writes new one.
>  # tx2 finishes.
>  # tx1 successfully writes tx1.newValue overriding the value from tx2.
> All in all, because tx2 didn't see tx1 locks because primary was changes 
> instead of (key1+{+}){+}+ transactions will finish with (key1)++ which is of 
> course not valid.
> h3. Definition of Done
>  * Bug is fixed.
> h3. Implementation Notes
>  * As a fast fix we should use 1PC(full) transactions only in case of 
> one-phase operation, like put. All two-phase operations like replica, 
> deleteExact, etc should be evaluated within a common 2PC transaction.
>  * Besides fast fix, we should consider supporting invoke as a raft command 
> that will effectively convert read+write to an atomic operation.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (IGNITE-18879) Leaseholder candidates balancing

2024-03-26 Thread Denis Chudov (Jira)


[ 
https://issues.apache.org/jira/browse/IGNITE-18879?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17831109#comment-17831109
 ] 

Denis Chudov commented on IGNITE-18879:
---

[~yexiaowei] all nodes in the cluster are learners for meta storage group, and 
the information about leases is distributed to all learners. We use that fact 
that if a lease is accepted, it can't be revoked; and the intervals of leases 
are disjoint. Hence the outdated data doesn't break anything. Speaking about 
#currentLease, it is used only in the context of the local node and in fact, is 
just a currently known information, but the lease is unique cluster-wide within 
its interval, so it can't break the distributed mechanisms.

> Leaseholder candidates balancing
> 
>
> Key: IGNITE-18879
> URL: https://issues.apache.org/jira/browse/IGNITE-18879
> Project: Ignite
>  Issue Type: Improvement
>Reporter: Denis Chudov
>Priority: Major
>  Labels: ignite-3
>
> *Motivation*
> Primary replicas (leaseholders) should be evenly distributed over cluster to 
> balance the transactional load between nodes. As the placement driver assigns 
> primary replicas, balancing the primary replicas is also it's responsibility. 
> Naive implementation of balancing should choose a node as leaseholder 
> candidate in a way to save even lease distribution over all nodes. In real 
> cluster, it may take into account slow nodes, hot table records, etc. If 
> lease candidate declines LeaseGrantMessage from placement driver, the 
> balancer should make decision to choose another candidate for given primary 
> replica or enforce the previously chosen. So the balancing algorith should be 
> pluggable, so that we could have ability to improve/replace/compare it with 
> others.
> *Definition of done*
> Introduced interface for lease candidates balancer, and a simple 
> implementation sustaining even lease distribution, which is used by placement 
> driver by default. No public or internal configuration needed on this stage.
> *Implementation notes*
> Lease candidates balancer should have at least 2 methods:
>  - {_}get(group, ignoredNodes){_}: returns candidate for the given group, a 
> node from ignoredNodes set can't be chosen as a candidate
>  - {_}considerRedirectProposal(group, candidate, proposedCandidate){_}: 
> processes redirect proposal for given group provided by given candidate 
> (previously chosen using _get_ method), proposedCandidate is the alternative 
> candidate. Returns candidate that should be enforced by placement driver.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Assigned] (IGNITE-21382) Test ItPrimaryReplicaChoiceTest.testPrimaryChangeLongHandling is flaky

2024-03-26 Thread Denis Chudov (Jira)


 [ 
https://issues.apache.org/jira/browse/IGNITE-21382?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Denis Chudov reassigned IGNITE-21382:
-

Assignee: Denis Chudov

> Test ItPrimaryReplicaChoiceTest.testPrimaryChangeLongHandling is flaky
> --
>
> Key: IGNITE-21382
> URL: https://issues.apache.org/jira/browse/IGNITE-21382
> Project: Ignite
>  Issue Type: Bug
>Reporter: Vladislav Pyatkov
>Assignee: Denis Chudov
>Priority: Major
>  Labels: ignite-3
>  Time Spent: 0.5h
>  Remaining Estimate: 0h
>
> The test falls while waiting for the primary replica change. This issue is 
> also reproduced locally, at least one per five passes.
> {code}
> assertThat(primaryChangeTask, willCompleteSuccessfully());
> {code}
> {noformat}
> java.lang.AssertionError: java.util.concurrent.TimeoutException
>   at 
> org.apache.ignite.internal.testframework.matchers.CompletableFutureMatcher.matchesSafely(CompletableFutureMatcher.java:78)
>   at 
> org.apache.ignite.internal.testframework.matchers.CompletableFutureMatcher.matchesSafely(CompletableFutureMatcher.java:35)
>   at org.hamcrest.TypeSafeMatcher.matches(TypeSafeMatcher.java:67)
>   at org.hamcrest.MatcherAssert.assertThat(MatcherAssert.java:10)
>   at org.hamcrest.MatcherAssert.assertThat(MatcherAssert.java:6)
>   at 
> org.apache.ignite.internal.placementdriver.ItPrimaryReplicaChoiceTest.testPrimaryChangeLongHandling(ItPrimaryReplicaChoiceTest.java:179)
> {noformat}
> This test will be muted on TC to pervent future falls.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Assigned] (IGNITE-21572) One phase transacion protocol is inconsistent in case of primary replica expirations

2024-03-13 Thread Denis Chudov (Jira)


 [ 
https://issues.apache.org/jira/browse/IGNITE-21572?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Denis Chudov reassigned IGNITE-21572:
-

Assignee: Denis Chudov

> One phase transacion protocol is inconsistent in case of primary replica 
> expirations
> 
>
> Key: IGNITE-21572
> URL: https://issues.apache.org/jira/browse/IGNITE-21572
> Project: Ignite
>  Issue Type: Bug
>Reporter: Alexander Lapin
>Assignee: Denis Chudov
>Priority: Critical
>  Labels: ignite-3
>
> h3. Motivation
> Consider following scenario:
>  # Full (1PC) transaction tx1 starts on PrimaryReplica1 [leaseholder='X', 
> startTime='t1', endTime='t10']
>  # Within a given 1PC transaction 2-phase operation is evaluated over key1, 
> e.g. replace or increment (we do not have increment operation, however it's 
> easy to explain the problem with it, so let's assume that we have one).
>  # Within increment processing, processor acquires lock on key1, reads the 
> corresponding value and is about to write the new one.
>  # At this point, PrimaryReplica leaseholder='X' expires.
>  # Another transaction tx2 starts on new PrimaryReplica2 [leaseholder='Y', 
> startTime='t11', endTime='t21'].
>  # Within tx2 user also calls increment, thus also acquires lock, reads old 
> value and writes new one.
>  # tx2 finishes.
>  # tx1 successfully writes tx1.newValue overriding the value from tx2.
> All in all, because tx2 didn't see tx1 locks because primary was changes 
> instead of (key1+{+}){+}+ transactions will finish with (key1)++ which is of 
> course not valid.
> h3. Definition of Done
>  * Bug is fixed.
> h3. Implementation Notes
>  * As a fast fix we should use 1PC(full) transactions only in case of 
> one-phase operation, like put. All two-phase operations like replica, 
> deleteExact, etc should be evaluated within a common 2PC transaction.
>  * Besides fast fix, we should consider supporting invoke as a raft command 
> that will effectively convert read+write to an atomic operation.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (IGNITE-21348) Trigger the lease negotiation retry in case when the lease candidate is no more contained in assignments

2024-03-12 Thread Denis Chudov (Jira)


 [ 
https://issues.apache.org/jira/browse/IGNITE-21348?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Denis Chudov updated IGNITE-21348:
--
Reviewer: Vladislav Pyatkov

> Trigger the lease negotiation retry in case when the lease candidate is no 
> more contained in assignments
> 
>
> Key: IGNITE-21348
> URL: https://issues.apache.org/jira/browse/IGNITE-21348
> Project: Ignite
>  Issue Type: Bug
>Reporter: Denis Chudov
>Assignee: Denis Chudov
>Priority: Major
>  Labels: ignite-3
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> On receiving the "lease granted" message, the candidate replica tries to 
> catch up the actual storage state, in order to do that it makes read index 
> request. But in case when this candidate is no more a member of assignments 
> (and replication group) this request fails and is retried until the lease 
> negotiation interval exceeds. This makes no sense because such retries will 
> not be successful, and the current candidate is not a good candidate anymore 
> - because, although the leaseholder may be not a part of replication group, 
> preferably it should be, and should be its leader.
> The assignment changes when some of current candidates and leaseholders are 
> no more included in new assignment set, should be detected on the placement 
> driver active actor, and the current lease should be revoked (if negotiation 
> is in progress) or not prolonged. The new negotitation will be triggered 
> automatically by the lease updater.
> *Implementation notes*
> This assignment changes detection should be done on placement driver side, 
> because the events of assignment changes can be processed on different nodes 
> in different time, and there is already assignments tracker as a part of 
> placement driver.
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (IGNITE-21712) Hybrid time is not adjusted when handling some of transaction non-replica requests

2024-03-08 Thread Denis Chudov (Jira)


[ 
https://issues.apache.org/jira/browse/IGNITE-21712?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17824821#comment-17824821
 ] 

Denis Chudov commented on IGNITE-21712:
---

Most like the time adjustment is not needed for 
FinishedTransactionsBatchMessage and TxCleanupMessage. We should think about it.

> Hybrid time is not adjusted when handling some of transaction non-replica 
> requests
> --
>
> Key: IGNITE-21712
> URL: https://issues.apache.org/jira/browse/IGNITE-21712
> Project: Ignite
>  Issue Type: Bug
>Reporter: Denis Chudov
>Priority: Major
>  Labels: ignite-3
>
> For example, TxStateResponse extends TimestampAware interface and is a part 
> of transaction flow, the hybrid time should be adjusted when handling 
> TxStateResponse but it doesnt happen.
> We should also check classes extending TimestampAware in order to ensure that 
> timestamp is adjusted in every case.
> Other message interfaces extending TimestampAware but having no time 
> adjustment:
> {code:java}
> FinishedTransactionsBatchMessage
> TxCleanupMessage
> TxStateResponse {code}
> Also, these interfaces are unused are maybe they can be deleted:
> {code:java}
> TxCleanupMessageResponse
> TxCleanupMessageErrorResponse
> TxFinishResponse
> {code}
>  
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (IGNITE-18879) Leaseholder candidates balancing

2024-03-08 Thread Denis Chudov (Jira)


[ 
https://issues.apache.org/jira/browse/IGNITE-18879?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17824789#comment-17824789
 ] 

Denis Chudov commented on IGNITE-18879:
---

[~yexiaowei] the current PD implementation doesn't include any heartbeats sent 
to PD leaders, and I am not sure that such heartbeats will be added in order to 
measure the load of the nodes and adjust the leases distribution. Highly 
likely, it would be overengineering, but the exact implementation is not 
designed yet.

> Leaseholder candidates balancing
> 
>
> Key: IGNITE-18879
> URL: https://issues.apache.org/jira/browse/IGNITE-18879
> Project: Ignite
>  Issue Type: Improvement
>Reporter: Denis Chudov
>Priority: Major
>  Labels: ignite-3
>
> *Motivation*
> Primary replicas (leaseholders) should be evenly distributed over cluster to 
> balance the transactional load between nodes. As the placement driver assigns 
> primary replicas, balancing the primary replicas is also it's responsibility. 
> Naive implementation of balancing should choose a node as leaseholder 
> candidate in a way to save even lease distribution over all nodes. In real 
> cluster, it may take into account slow nodes, hot table records, etc. If 
> lease candidate declines LeaseGrantMessage from placement driver, the 
> balancer should make decision to choose another candidate for given primary 
> replica or enforce the previously chosen. So the balancing algorith should be 
> pluggable, so that we could have ability to improve/replace/compare it with 
> others.
> *Definition of done*
> Introduced interface for lease candidates balancer, and a simple 
> implementation sustaining even lease distribution, which is used by placement 
> driver by default. No public or internal configuration needed on this stage.
> *Implementation notes*
> Lease candidates balancer should have at least 2 methods:
>  - {_}get(group, ignoredNodes){_}: returns candidate for the given group, a 
> node from ignoredNodes set can't be chosen as a candidate
>  - {_}considerRedirectProposal(group, candidate, proposedCandidate){_}: 
> processes redirect proposal for given group provided by given candidate 
> (previously chosen using _get_ method), proposedCandidate is the alternative 
> candidate. Returns candidate that should be enforced by placement driver.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Comment Edited] (IGNITE-21382) Test ItPrimaryReplicaChoiceTest.testPrimaryChangeLongHandling is flaky

2024-03-08 Thread Denis Chudov (Jira)


[ 
https://issues.apache.org/jira/browse/IGNITE-21382?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17824772#comment-17824772
 ] 

Denis Chudov edited comment on IGNITE-21382 at 3/8/24 12:49 PM:


The problem is that NodeUtils#transferPrimary is not competed in 30 seconds. I 
would propose to rewrite this method without using 
RaftGroupService#transferLeadership. Primary replica doesn't have to be 
colocated with raft leader, and we can use this in tests. We have 
StopLeaseProlongationMessage that is intended to stop lease prolongation for 
replica which lost its ability to serve as a primary (at least, a preferred 
one), and NodeUtils#transferPrimary can be reworked in a way that it sends this 
message to corresponding node that is a placement driver active actor (or, 
which is more simple for tests - just to every node, this message will be 
ignored on other nodes).

The only problem is that StopLeaseProlongationMessage#redirectProposal is not 
handled by the placement driver correctly - this is a bug and should be fixed. 
After that we will obtain an ability to propose any node as the new primary and 
so choose the new primary deliberately. Until IGNITE-18879 is done, the 
LeaseUpdater chooses the proposed leaseholder every time when it is present, it 
never enforces another node and possibility of that can be neglected.

After that, IGNITE-20365 might be closed as well.


was (Author: denis chudov):
The problem is that NodeUtils#transferPrimary is not competed in 30 seconds. I 
would propose to rewrite this method without using 
RaftGroupService#transferLeadership. Primary replica doesn't have to be 
colocated with raft leader, and we can use this in tests. We have 
StopLeaseProlongationMessage that is intended to stop lease prolongation for 
replica which lost its ability to serve as a primary (at least, a preferred 
one), and NodeUtils#transferPrimary can be reworked in a way that it sends this 
message to corresponding node that is a placement driver active actor (or, 
which is more simple for tests - just to every node, this message will be 
ignored on other nodes).

The only problem is that StopLeaseProlongationMessage#redirectProposal is not 
handled by the placement driver correctly - this is a bug and should be fixed. 
After that we will obtain an ability to propose any node as the new primary and 
so choose the new primary deliberately.

After that, IGNITE-20365 might be closed as well.

> Test ItPrimaryReplicaChoiceTest.testPrimaryChangeLongHandling is flaky
> --
>
> Key: IGNITE-21382
> URL: https://issues.apache.org/jira/browse/IGNITE-21382
> Project: Ignite
>  Issue Type: Bug
>Reporter: Vladislav Pyatkov
>Priority: Major
>  Labels: ignite-3
>  Time Spent: 20m
>  Remaining Estimate: 0h
>
> The test falls while waiting for the primary replica change. This issue is 
> also reproduced locally, at least one per five passes.
> {code}
> assertThat(primaryChangeTask, willCompleteSuccessfully());
> {code}
> {noformat}
> java.lang.AssertionError: java.util.concurrent.TimeoutException
>   at 
> org.apache.ignite.internal.testframework.matchers.CompletableFutureMatcher.matchesSafely(CompletableFutureMatcher.java:78)
>   at 
> org.apache.ignite.internal.testframework.matchers.CompletableFutureMatcher.matchesSafely(CompletableFutureMatcher.java:35)
>   at org.hamcrest.TypeSafeMatcher.matches(TypeSafeMatcher.java:67)
>   at org.hamcrest.MatcherAssert.assertThat(MatcherAssert.java:10)
>   at org.hamcrest.MatcherAssert.assertThat(MatcherAssert.java:6)
>   at 
> org.apache.ignite.internal.placementdriver.ItPrimaryReplicaChoiceTest.testPrimaryChangeLongHandling(ItPrimaryReplicaChoiceTest.java:179)
> {noformat}
> This test will be muted on TC to pervent future falls.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (IGNITE-21382) Test ItPrimaryReplicaChoiceTest.testPrimaryChangeLongHandling is flaky

2024-03-08 Thread Denis Chudov (Jira)


[ 
https://issues.apache.org/jira/browse/IGNITE-21382?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17824772#comment-17824772
 ] 

Denis Chudov commented on IGNITE-21382:
---

The problem is that NodeUtils#transferPrimary is not competed in 30 seconds. I 
would propose to rewrite this method without using 
RaftGroupService#transferLeadership. Primary replica doesn't have to be 
colocated with raft leader, and we can use this in tests. We have 
StopLeaseProlongationMessage that is intended to stop lease prolongation for 
replica which lost its ability to serve as a primary (at least, a preferred 
one), and NodeUtils#transferPrimary can be reworked in a way that it sends this 
message to corresponding node that is a placement driver active actor (or, 
which is more simple for tests - just to every node, this message will be 
ignored on other nodes).

The only problem is that StopLeaseProlongationMessage#redirectProposal is not 
handled by the placement driver correctly - this is a bug and should be fixed. 
After that we will obtain an ability to propose any node as the new primary and 
so choose the new primary deliberately.

After that, IGNITE-20365 might be closed as well.

> Test ItPrimaryReplicaChoiceTest.testPrimaryChangeLongHandling is flaky
> --
>
> Key: IGNITE-21382
> URL: https://issues.apache.org/jira/browse/IGNITE-21382
> Project: Ignite
>  Issue Type: Bug
>Reporter: Vladislav Pyatkov
>Priority: Major
>  Labels: ignite-3
>  Time Spent: 20m
>  Remaining Estimate: 0h
>
> The test falls while waiting for the primary replica change. This issue is 
> also reproduced locally, at least one per five passes.
> {code}
> assertThat(primaryChangeTask, willCompleteSuccessfully());
> {code}
> {noformat}
> java.lang.AssertionError: java.util.concurrent.TimeoutException
>   at 
> org.apache.ignite.internal.testframework.matchers.CompletableFutureMatcher.matchesSafely(CompletableFutureMatcher.java:78)
>   at 
> org.apache.ignite.internal.testframework.matchers.CompletableFutureMatcher.matchesSafely(CompletableFutureMatcher.java:35)
>   at org.hamcrest.TypeSafeMatcher.matches(TypeSafeMatcher.java:67)
>   at org.hamcrest.MatcherAssert.assertThat(MatcherAssert.java:10)
>   at org.hamcrest.MatcherAssert.assertThat(MatcherAssert.java:6)
>   at 
> org.apache.ignite.internal.placementdriver.ItPrimaryReplicaChoiceTest.testPrimaryChangeLongHandling(ItPrimaryReplicaChoiceTest.java:179)
> {noformat}
> This test will be muted on TC to pervent future falls.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (IGNITE-21712) Hybrid time is not adjusted when handling some of transaction non-replica requests

2024-03-08 Thread Denis Chudov (Jira)


 [ 
https://issues.apache.org/jira/browse/IGNITE-21712?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Denis Chudov updated IGNITE-21712:
--
Description: 
For example, TxStateResponse extends TimestampAware interface and is a part of 
transaction flow, the hybrid time should be adjusted when handling 
TxStateResponse but it doesnt happen.

We should also check classes extending TimestampAware in order to ensure that 
timestamp is adjusted in every case.

Other message interfaces extending TimestampAware but having no time adjustment:
{code:java}
FinishedTransactionsBatchMessage
TxCleanupMessage
TxStateResponse {code}
Also, these interfaces are unused are maybe they can be deleted:
{code:java}
TxCleanupMessageResponse
TxCleanupMessageErrorResponse
TxFinishResponse
{code}
 

 

  was:
For example, TxStateResponse extends TimestampAware interface and is a part of 
transaction flow, the hybrid time should be adjusted when handling 
TxStateResponse but it doesnt happen.

We should also check classes extending TimestampAware in order to ensure that 
timestamp is adjusted in every case.

Other message interfaces extending TimestampAware but having no time adjustment:
{code:java}
FinishedTransactionsBatchMessage
TxCleanupMessage
TxStateResponse {code}
Also, these interfaces are unused are maybe they can be deleted:

 
{code:java}
TxCleanupMessageResponse
TxCleanupMessageErrorResponse
TxFinishResponse
{code}
 

 


> Hybrid time is not adjusted when handling some of transaction non-replica 
> requests
> --
>
> Key: IGNITE-21712
> URL: https://issues.apache.org/jira/browse/IGNITE-21712
> Project: Ignite
>  Issue Type: Bug
>Reporter: Denis Chudov
>Priority: Major
>  Labels: ignite-3
>
> For example, TxStateResponse extends TimestampAware interface and is a part 
> of transaction flow, the hybrid time should be adjusted when handling 
> TxStateResponse but it doesnt happen.
> We should also check classes extending TimestampAware in order to ensure that 
> timestamp is adjusted in every case.
> Other message interfaces extending TimestampAware but having no time 
> adjustment:
> {code:java}
> FinishedTransactionsBatchMessage
> TxCleanupMessage
> TxStateResponse {code}
> Also, these interfaces are unused are maybe they can be deleted:
> {code:java}
> TxCleanupMessageResponse
> TxCleanupMessageErrorResponse
> TxFinishResponse
> {code}
>  
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (IGNITE-21712) Hybrid time is not adjusted when handling some of transaction non-replica requests

2024-03-08 Thread Denis Chudov (Jira)


 [ 
https://issues.apache.org/jira/browse/IGNITE-21712?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Denis Chudov updated IGNITE-21712:
--
Description: 
For example, TxStateResponse extends TimestampAware interface and is a part of 
transaction flow, the hybrid time should be adjusted when handling 
TxStateResponse but it doesnt happen.

We should also check classes extending TimestampAware in order to ensure that 
timestamp is adjusted in every case.

Other message interfaces extending TimestampAware but having no time adjustment:
{code:java}
FinishedTransactionsBatchMessage
TxCleanupMessage
TxStateResponse {code}
Also, these interfaces are unused are maybe they can be deleted:

 
{code:java}
TxCleanupMessageResponse
TxCleanupMessageErrorResponse
TxFinishResponse
{code}
 

 

  was:
TxStateResponse extends TimestampAware interface and is a part of transaction 
flow, the hybrid time should be adjusted when handling TxStateResponse but it 
doesnt happen.

We should also check classes extending TimestampAware in order to ensure that 
timestamp is adjusted in every case.


> Hybrid time is not adjusted when handling some of transaction non-replica 
> requests
> --
>
> Key: IGNITE-21712
> URL: https://issues.apache.org/jira/browse/IGNITE-21712
> Project: Ignite
>  Issue Type: Bug
>Reporter: Denis Chudov
>Priority: Major
>  Labels: ignite-3
>
> For example, TxStateResponse extends TimestampAware interface and is a part 
> of transaction flow, the hybrid time should be adjusted when handling 
> TxStateResponse but it doesnt happen.
> We should also check classes extending TimestampAware in order to ensure that 
> timestamp is adjusted in every case.
> Other message interfaces extending TimestampAware but having no time 
> adjustment:
> {code:java}
> FinishedTransactionsBatchMessage
> TxCleanupMessage
> TxStateResponse {code}
> Also, these interfaces are unused are maybe they can be deleted:
>  
> {code:java}
> TxCleanupMessageResponse
> TxCleanupMessageErrorResponse
> TxFinishResponse
> {code}
>  
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (IGNITE-21712) Hybrid time is not adjusted when handling some of transaction non-replica requests

2024-03-08 Thread Denis Chudov (Jira)


 [ 
https://issues.apache.org/jira/browse/IGNITE-21712?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Denis Chudov updated IGNITE-21712:
--
Summary: Hybrid time is not adjusted when handling some of transaction 
non-replica requests  (was: Hybrid time is not adjusted when handling 
TxStateResponse)

> Hybrid time is not adjusted when handling some of transaction non-replica 
> requests
> --
>
> Key: IGNITE-21712
> URL: https://issues.apache.org/jira/browse/IGNITE-21712
> Project: Ignite
>  Issue Type: Bug
>Reporter: Denis Chudov
>Priority: Major
>  Labels: ignite-3
>
> TxStateResponse extends TimestampAware interface and is a part of transaction 
> flow, the hybrid time should be adjusted when handling TxStateResponse but it 
> doesnt happen.
> We should also check classes extending TimestampAware in order to ensure that 
> timestamp is adjusted in every case.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (IGNITE-21712) Hybrid time is not adjusted when handling TxStateResponse

2024-03-08 Thread Denis Chudov (Jira)


 [ 
https://issues.apache.org/jira/browse/IGNITE-21712?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Denis Chudov updated IGNITE-21712:
--
Description: 
TxStateResponse extends TimestampAware interface and is a part of transaction 
flow, the hybrid time should be adjusted when handling TxStateResponse but it 
doesnt happen.

We should also check classes extending TimestampAware in order to ensure that 
timestamp is adjusted in every case.

  was:TxStateResponse extends TimestampAware interface and is a part of 
transaction flow, the hybrid time should be adjusted when handling 
TxStateResponse but it doesnt happen.


> Hybrid time is not adjusted when handling TxStateResponse
> -
>
> Key: IGNITE-21712
> URL: https://issues.apache.org/jira/browse/IGNITE-21712
> Project: Ignite
>  Issue Type: Bug
>Reporter: Denis Chudov
>Priority: Major
>  Labels: ignite-3
>
> TxStateResponse extends TimestampAware interface and is a part of transaction 
> flow, the hybrid time should be adjusted when handling TxStateResponse but it 
> doesnt happen.
> We should also check classes extending TimestampAware in order to ensure that 
> timestamp is adjusted in every case.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Assigned] (IGNITE-21348) Trigger the lease negotiation retry in case when the lease candidate is no more contained in assignments

2024-03-07 Thread Denis Chudov (Jira)


 [ 
https://issues.apache.org/jira/browse/IGNITE-21348?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Denis Chudov reassigned IGNITE-21348:
-

Assignee: Denis Chudov

> Trigger the lease negotiation retry in case when the lease candidate is no 
> more contained in assignments
> 
>
> Key: IGNITE-21348
> URL: https://issues.apache.org/jira/browse/IGNITE-21348
> Project: Ignite
>  Issue Type: Bug
>Reporter: Denis Chudov
>Assignee: Denis Chudov
>Priority: Major
>  Labels: ignite-3
>
> On receiving the "lease granted" message, the candidate replica tries to 
> catch up the actual storage state, in order to do that it makes read index 
> request. But in case when this candidate is no more a member of assignments 
> (and replication group) this request fails and is retried until the lease 
> negotiation interval exceeds. This makes no sense because such retries will 
> not be successful, and the current candidate is not a good candidate anymore 
> - because, although the leaseholder may be not a part of replication group, 
> preferably it should be, and should be its leader.
> The assignment changes when some of current candidates and leaseholders are 
> no more included in new assignment set, should be detected on the placement 
> driver active actor, and the current lease should be revoked (if negotiation 
> is in progress) or not prolonged. The new negotitation will be triggered 
> automatically by the lease updater.
> *Implementation notes*
> This assignment changes detection should be done on placement driver side, 
> because the events of assignment changes can be processed on different nodes 
> in different time, and there is already assignments tracker as a part of 
> placement driver.
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (IGNITE-21712) Hybrid time is not adjusted when handling TxStateResponse

2024-03-07 Thread Denis Chudov (Jira)
Denis Chudov created IGNITE-21712:
-

 Summary: Hybrid time is not adjusted when handling TxStateResponse
 Key: IGNITE-21712
 URL: https://issues.apache.org/jira/browse/IGNITE-21712
 Project: Ignite
  Issue Type: Bug
Reporter: Denis Chudov


TxStateResponse extends TimestampAware interface and is a part of transaction 
flow, the hybrid time should be adjusted when handling TxStateResponse but it 
doesnt happen.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Assigned] (IGNITE-21634) NPE in HeapLockManager

2024-03-07 Thread Denis Chudov (Jira)


 [ 
https://issues.apache.org/jira/browse/IGNITE-21634?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Denis Chudov reassigned IGNITE-21634:
-

Assignee: Denis Chudov

> NPE in HeapLockManager
> --
>
> Key: IGNITE-21634
> URL: https://issues.apache.org/jira/browse/IGNITE-21634
> Project: Ignite
>  Issue Type: Bug
>Reporter: Denis Chudov
>Assignee: Denis Chudov
>Priority: Major
>  Labels: ignite-3
>
> {code:java}
> Caused by: java.lang.NullPointerException at 
> org.apache.ignite.internal.tx.impl.HeapLockManager.lambda$lockState$4(HeapLockManager.java:297)
>  ~[main/:?] at 
> java.util.concurrent.ConcurrentHashMap.compute(ConcurrentHashMap.java:1908) 
> ~[?:?] at 
> org.apache.ignite.internal.tx.impl.HeapLockManager.lockState(HeapLockManager.java:291)
>  ~[main/:?] at 
> org.apache.ignite.internal.tx.impl.HeapLockManager.acquire(HeapLockManager.java:172)
>  ~[main/:?] at 
> org.apache.ignite.internal.table.distributed.SortedIndexLocker.lambda$locksForInsert$4(SortedIndexLocker.java:169)
>  ~[main/:?] at 
> java.util.concurrent.CompletableFuture.uniComposeStage(CompletableFuture.java:1106)
>  ~[?:?] ... 29 more{code}
> on the line {{v.markedForRemove = false;}}
> {code:java}
> private LockState lockState(LockKey key) {
> int h = spread(key.hashCode());
> int index = h & (slots.length - 1);
> LockState[] res = new LockState[1];
> locks.compute(key, (k, v) -> {
> if (v == null) {
> if (empty.isEmpty()) {
> res[0] = slots[index];
> } else {
> v = empty.poll();
> v.markedForRemove = false;
> v.key = k;
> res[0] = v;
> }
> } else {
> res[0] = v;
> }
> return v;
> });
> return res[0];
> } {code}
> The problem can be reproduced on main(71b4fb34) with following test 
> (probably, fsync should be turned off):
> {code}
> @Test
> void test() {
> sql("CREATE TABLE test("
> + "c1 INT PRIMARY KEY, c2 INT, c3 INT, c4 INT, c5 INT,"
> + "c6 INT, c7 INT, c8 INT, c9 INT, c10 INT)"
> );
> for (int i = 2; i <= 10; i++) {
> sql(format("CREATE INDEX c{}_idx ON test (c{})", i, i));
> }
> sql("INSERT INTO test"
> + " SELECT x as c1, x as c2, x as c3, x as c4, x as c5, "
> + "x as c6, x as c7, x as c8, x as c9, x as c10"
> + "   FROM TABLE (system_range(1, 10))");
> }
> {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Assigned] (IGNITE-21641) OOM in PartitionReplicaListenerTest

2024-03-05 Thread Denis Chudov (Jira)


 [ 
https://issues.apache.org/jira/browse/IGNITE-21641?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Denis Chudov reassigned IGNITE-21641:
-

Assignee: Denis Chudov

> OOM in PartitionReplicaListenerTest
> ---
>
> Key: IGNITE-21641
> URL: https://issues.apache.org/jira/browse/IGNITE-21641
> Project: Ignite
>  Issue Type: Bug
>Reporter: Mirza Aliev
>Assignee: Denis Chudov
>Priority: Major
>  Labels: ignite-3
> Attachments: image-2024-03-01-12-22-32-053.png, 
> image-2024-03-01-20-36-08-577.png
>
>
> TC run failed with OOM
> Problem occurred after 
> PartitionReplicaListenerTest.testReadOnlyGetAfterRowRewrite run, 
> {noformat}
> [2024-03-01T05:12:50,629][INFO ][Test worker][PartitionReplicaListenerTest] 
> >>> Starting test: 
> PartitionReplicaListenerTest#testReadOnlyGetAfterRowRewrite, displayName: 
> [14] true, true, false, true
> [2024-03-01T05:12:50,629][INFO ][Test worker][PartitionReplicaListenerTest] 
> workDir: 
> build/work/PartitionReplicaListenerTest/testReadOnlyGetAfterRowRewrite_33496469368142283
> [2024-03-01T05:12:50,638][INFO ][Test worker][PartitionReplicaListenerTest] 
> >>> Stopping test: 
> PartitionReplicaListenerTest#testReadOnlyGetAfterRowRewrite, displayName: 
> [14] true, true, false, true, cost: 8ms.
> [05:12:50] :   [testReadOnlyGetAfterRowRewrite(boolean, 
> boolean, boolean, boolean)] 
> org.apache.ignite.internal.table.distributed.replication.PartitionReplicaListenerTest.testReadOnlyGetAfterRowRewrite([15]
>  true, true, true, false) (10m:22s)
> [05:12:50] :   [:ignite-table:test] PartitionReplicaListenerTest > 
> testReadOnlyGetAfterRowRewrite(boolean, boolean, boolean, boolean) > [15] 
> true, true, true, false STANDARD_OUT
> [05:12:50] :   [:ignite-table:test] 
> [2024-03-01T05:12:50,648][INFO ][Test worker][PartitionReplicaListenerTest] 
> >>> Starting test: 
> PartitionReplicaListenerTest#testReadOnlyGetAfterRowRewrite, displayName: 
> [15] true, true, true, false
> [05:12:50] :   [:ignite-table:test] 
> [2024-03-01T05:12:50,648][INFO ][Test worker][PartitionReplicaListenerTest] 
> workDir: 
> build/work/PartitionReplicaListenerTest/testReadOnlyGetAfterRowRewrite_33496469386328241
> [05:18:42] :   [:ignite-table:test] java.lang.OutOfMemoryError: Java 
> heap space
> [05:18:42] :   [:ignite-table:test] Dumping heap to 
> java_pid2349600.hprof ...
> [05:19:06] :   [:ignite-table:test] Heap dump file created 
> [3645526743 bytes in 24.038 secs]
> {noformat}
> https://ci.ignite.apache.org/buildConfiguration/ApacheIgnite3xGradle_Test_RunAllTests/7898564?hideTestsFromDependencies=false=false+Inspection=true=true=true=false
> After analysing heap dump it appears that the reason of OOM is a problem with 
> Mockito.
>  !image-2024-03-01-12-22-32-053.png! 
> We need to investigate the reason of a problem with Mockito 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (IGNITE-21633) Get rid of RemotelyTriggeredResourceRegistry#remoteHostsToResources

2024-03-01 Thread Denis Chudov (Jira)


 [ 
https://issues.apache.org/jira/browse/IGNITE-21633?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Denis Chudov updated IGNITE-21633:
--
Description: 
*Motivation*

RemotelyTriggeredResourceRegistry has an API that allows closing the resources 
using following parameters:
 * #close(UUID contextId)
 * #close(FullyQualifiedResourceId resourceId)
 * #close(String remoteHostId) - is used when the remote host is no longer in 
topology and we can close all resources that it has triggered, because they are 
no longer needed

In IGNITE-21293 there was added a map 
_RemotelyTriggeredResourceRegistry#remoteHostsToResource_ which, in fact, is 
grouping the resources by remote hosts which is needed to implement the last 
method without iterating over all resources. The main map ({_}#resources{_}, 
ordered map of FullyQualifiedResourceId to RemotelyTriggeredResource object) 
which is used to store the resources is not able to provide some resources by 
remote host id, because FullyQualifiedResourceId does not contain the remote 
host id.

The context id is included into the FullyQualifiedResourceId , but the 
transaction id (which is contextId in case of cursor resource) does not contain 
node identifier, only an integer hash code of the coordinator node name.

*Definition of done*

The _RemotelyTriggeredResourceRegistry#remoteHostsToResource_ is removed.

*Implementation notes*

We can change the transaction id generation to replace the node name hash with 
the order in which the node joined the cluster, then we will be able to 
evaluate the transaction coordinator having only the transaction id. This will 
also require to postulate that context id generation for every type of 
resources should follow this rule.

After that we will be able to get a submap of resources created by some node 
from _#resources_ map (FullyQualifiedResourceId to RemotelyTriggeredResource 
object).

As one of the possible implementation to get all resources triggered by the 
nodes that are no longer in topology, we can iterate over the currently online 
nodes (their order in which they joined) and get a submap of resources 
belonging to the space between each two of them. As the number of nodes is 
significantly less that the number of resources, this operation should be more 
effective that iterating over the whole map.

For example:
 * there were 3 nodes: A (join order 0), B (join order 1), C (join order 2);
 * node B left the topology;
 * there are 1000 resources, 200 of them are created by A, 500 by B and 300 by 
C;
 * iterating over existing node pairs will get following intervals: (MIN_ORDER; 
0) - submap is empty, (0; 2) - submap includes 500 resources created by B, (2; 
MAX_ORDER) - submap is empty.

  was:
*Motivation*

RemotelyTriggeredResourceRegistry has an API that allows closing the resources 
using following parameters:
 * #close(UUID contextId)
 * #close(FullyQualifiedResourceId resourceId)
 * #close(String remoteHostId) - is used when the remote host is no longer in 
topology and we can close all resources that it has triggered, because they are 
no longer needed

In IGNITE-21293 there was added a map 
_RemotelyTriggeredResourceRegistry#remoteHostsToResource_ which, in fact, is 
grouping the resources by remote hosts which is needed to implement the last 
method without iterating over all resources. The main map ({_}#resources{_}, 
ordered map of FullyQualifiedResourceId to RemotelyTriggeredResource object) 
which is used to store the resources is not able to provide some resources by 
remote host id, because FullyQualifiedResourceId does not contain the remote 
host id.

The context id is included into the FullyQualifiedResourceId , but the 
transaction id (which is contextId in case of cursor resource) does not contain 
node identifier, only an integer hash code of the coordinator node name.

*Definition of done*

The _RemotelyTriggeredResourceRegistry#remoteHostsToResource_ is removed.

*Implementation notes*

We can change the transaction id generation to replace the node name hash with 
the order in which the node joined the cluster, then we will be able to 
evaluate the transaction coordinator having only the transaction id. This will 
also require to postulate that context id generation for every type of 
resources should follow this rule.

After that we will be able to get a submap of resources created by some node 
from _#resources_ map (FullyQualifiedResourceId to RemotelyTriggeredResource 
object).

To get all resources triggered by the nodes that are no longer in topology, we 
can iterate over the currently online nodes (their order in which they joined) 
and get a submap of resources belonging to the space between each two of them. 
As the number of nodes is significantly less that the number of resources, this 
operation should be more effective that iterating over the whole map.

For example:
 * there were 3 nodes: A (join order 0), B (join 

[jira] [Updated] (IGNITE-21633) Get rid of RemotelyTriggeredResourceRegistry#remoteHostsToResources

2024-03-01 Thread Denis Chudov (Jira)


 [ 
https://issues.apache.org/jira/browse/IGNITE-21633?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Denis Chudov updated IGNITE-21633:
--
Description: 
*Motivation*

RemotelyTriggeredResourceRegistry has an API that allows closing the resources 
using following parameters:
 * #close(UUID contextId)
 * #close(FullyQualifiedResourceId resourceId)
 * #close(String remoteHostId) - is used when the remote host is no longer in 
topology and we can close all resources that it has triggered, because they are 
no longer needed

In IGNITE-21293 there was added a map 
_RemotelyTriggeredResourceRegistry#remoteHostsToResource_ which, in fact, is 
grouping the resources by remote hosts which is needed to implement the last 
method without iterating over all resources. The main map ({_}#resources{_}, 
ordered map of FullyQualifiedResourceId to RemotelyTriggeredResource object) 
which is used to store the resources is not able to provide some resources by 
remote host id, because FullyQualifiedResourceId does not contain the remote 
host id.

The context id is included into the FullyQualifiedResourceId , but the 
transaction id (which is contextId in case of cursor resource) does not contain 
node identifier, only an integer hash code of the coordinator node name.

*Definition of done*

The _RemotelyTriggeredResourceRegistry#remoteHostsToResource_ is removed.

*Implementation notes*

We can change the transaction id generation to replace the node name hash with 
the order in which the node joined the cluster, then we will be able to 
evaluate the transaction coordinator having only the transaction id. This will 
also require to postulate that context id generation for every type of 
resources should follow this rule.

After that we will be able to get a submap of resources created by some node 
from _#resources_ map (FullyQualifiedResourceId to RemotelyTriggeredResource 
object).

To get all resources triggered by the nodes that are no longer in topology, we 
can iterate over the currently online nodes (their order in which they joined) 
and get a submap of resources belonging to the space between each two of them. 
As the number of nodes is significantly less that the number of resources, this 
operation should be more effective that iterating over the whole map.

For example:
 * there were 3 nodes: A (join order 0), B (join order 1), C (join order 2);
 * node B left the topology;
 * there are 1000 resources, 200 of them are created by A, 500 by B and 300 by 
C;
 * iterating over existing node pairs will get following intervals: (MIN_ORDER; 
0) - submap is empty, (0; 2) - submap includes 500 resources created by B, (2; 
MAX_ORDER) - submap is empty.

  was:
Motivation

In IGNITE-21293 there was added a map 


> Get rid of RemotelyTriggeredResourceRegistry#remoteHostsToResources
> ---
>
> Key: IGNITE-21633
> URL: https://issues.apache.org/jira/browse/IGNITE-21633
> Project: Ignite
>  Issue Type: Improvement
>Reporter: Denis Chudov
>Priority: Major
>  Labels: ignite-3
>
> *Motivation*
> RemotelyTriggeredResourceRegistry has an API that allows closing the 
> resources using following parameters:
>  * #close(UUID contextId)
>  * #close(FullyQualifiedResourceId resourceId)
>  * #close(String remoteHostId) - is used when the remote host is no longer in 
> topology and we can close all resources that it has triggered, because they 
> are no longer needed
> In IGNITE-21293 there was added a map 
> _RemotelyTriggeredResourceRegistry#remoteHostsToResource_ which, in fact, is 
> grouping the resources by remote hosts which is needed to implement the last 
> method without iterating over all resources. The main map ({_}#resources{_}, 
> ordered map of FullyQualifiedResourceId to RemotelyTriggeredResource object) 
> which is used to store the resources is not able to provide some resources by 
> remote host id, because FullyQualifiedResourceId does not contain the remote 
> host id.
> The context id is included into the FullyQualifiedResourceId , but the 
> transaction id (which is contextId in case of cursor resource) does not 
> contain node identifier, only an integer hash code of the coordinator node 
> name.
> *Definition of done*
> The _RemotelyTriggeredResourceRegistry#remoteHostsToResource_ is removed.
> *Implementation notes*
> We can change the transaction id generation to replace the node name hash 
> with the order in which the node joined the cluster, then we will be able to 
> evaluate the transaction coordinator having only the transaction id. This 
> will also require to postulate that context id generation for every type of 
> resources should follow this rule.
> After that we will be able to get a submap of resources created by some node 
> from _#resources_ map (FullyQualifiedResourceId to 

[jira] [Updated] (IGNITE-21633) Get rid of RemotelyTriggeredResourceRegistry#remoteHostsToResources

2024-03-01 Thread Denis Chudov (Jira)


 [ 
https://issues.apache.org/jira/browse/IGNITE-21633?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Denis Chudov updated IGNITE-21633:
--
Description: 
Motivation

In IGNITE-21293 there was added a map 

  was:TBD


> Get rid of RemotelyTriggeredResourceRegistry#remoteHostsToResources
> ---
>
> Key: IGNITE-21633
> URL: https://issues.apache.org/jira/browse/IGNITE-21633
> Project: Ignite
>  Issue Type: Improvement
>Reporter: Denis Chudov
>Priority: Major
>  Labels: ignite-3
>
> Motivation
> In IGNITE-21293 there was added a map 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (IGNITE-21634) NPE in HeapLockManager

2024-02-29 Thread Denis Chudov (Jira)


 [ 
https://issues.apache.org/jira/browse/IGNITE-21634?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Denis Chudov updated IGNITE-21634:
--
Description: 
{code:java}
Caused by: java.lang.NullPointerException at 
org.apache.ignite.internal.tx.impl.HeapLockManager.lambda$lockState$4(HeapLockManager.java:297)
 ~[main/:?] at 
java.util.concurrent.ConcurrentHashMap.compute(ConcurrentHashMap.java:1908) 
~[?:?] at 
org.apache.ignite.internal.tx.impl.HeapLockManager.lockState(HeapLockManager.java:291)
 ~[main/:?] at 
org.apache.ignite.internal.tx.impl.HeapLockManager.acquire(HeapLockManager.java:172)
 ~[main/:?] at 
org.apache.ignite.internal.table.distributed.SortedIndexLocker.lambda$locksForInsert$4(SortedIndexLocker.java:169)
 ~[main/:?] at 
java.util.concurrent.CompletableFuture.uniComposeStage(CompletableFuture.java:1106)
 ~[?:?] ... 29 more{code}
on the line {{v.markedForRemove = false;}}
{code:java}
private LockState lockState(LockKey key) {
int h = spread(key.hashCode());
int index = h & (slots.length - 1);

LockState[] res = new LockState[1];

locks.compute(key, (k, v) -> {
if (v == null) {
if (empty.isEmpty()) {
res[0] = slots[index];
} else {
v = empty.poll();
v.markedForRemove = false;
v.key = k;
res[0] = v;
}
} else {
res[0] = v;
}

return v;
});

return res[0];
} {code}

  was:
{code:java}
Caused by: java.lang.NullPointerException at 
org.apache.ignite.internal.tx.impl.HeapLockManager.lambda$lockState$4(HeapLockManager.java:297)
 ~[main/:?] at 
java.util.concurrent.ConcurrentHashMap.compute(ConcurrentHashMap.java:1908) 
~[?:?] at 
org.apache.ignite.internal.tx.impl.HeapLockManager.lockState(HeapLockManager.java:291)
 ~[main/:?] at 
org.apache.ignite.internal.tx.impl.HeapLockManager.acquire(HeapLockManager.java:172)
 ~[main/:?] at 
org.apache.ignite.internal.table.distributed.SortedIndexLocker.lambda$locksForInsert$4(SortedIndexLocker.java:169)
 ~[main/:?] at 
java.util.concurrent.CompletableFuture.uniComposeStage(CompletableFuture.java:1106)
 ~[?:?] ... 29 more{code}


> NPE in HeapLockManager
> --
>
> Key: IGNITE-21634
> URL: https://issues.apache.org/jira/browse/IGNITE-21634
> Project: Ignite
>  Issue Type: Improvement
>Reporter: Denis Chudov
>Priority: Major
>  Labels: ignite-3
>
> {code:java}
> Caused by: java.lang.NullPointerException at 
> org.apache.ignite.internal.tx.impl.HeapLockManager.lambda$lockState$4(HeapLockManager.java:297)
>  ~[main/:?] at 
> java.util.concurrent.ConcurrentHashMap.compute(ConcurrentHashMap.java:1908) 
> ~[?:?] at 
> org.apache.ignite.internal.tx.impl.HeapLockManager.lockState(HeapLockManager.java:291)
>  ~[main/:?] at 
> org.apache.ignite.internal.tx.impl.HeapLockManager.acquire(HeapLockManager.java:172)
>  ~[main/:?] at 
> org.apache.ignite.internal.table.distributed.SortedIndexLocker.lambda$locksForInsert$4(SortedIndexLocker.java:169)
>  ~[main/:?] at 
> java.util.concurrent.CompletableFuture.uniComposeStage(CompletableFuture.java:1106)
>  ~[?:?] ... 29 more{code}
> on the line {{v.markedForRemove = false;}}
> {code:java}
> private LockState lockState(LockKey key) {
> int h = spread(key.hashCode());
> int index = h & (slots.length - 1);
> LockState[] res = new LockState[1];
> locks.compute(key, (k, v) -> {
> if (v == null) {
> if (empty.isEmpty()) {
> res[0] = slots[index];
> } else {
> v = empty.poll();
> v.markedForRemove = false;
> v.key = k;
> res[0] = v;
> }
> } else {
> res[0] = v;
> }
> return v;
> });
> return res[0];
> } {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (IGNITE-21634) NPE in HeapLockManager

2024-02-29 Thread Denis Chudov (Jira)
Denis Chudov created IGNITE-21634:
-

 Summary: NPE in HeapLockManager
 Key: IGNITE-21634
 URL: https://issues.apache.org/jira/browse/IGNITE-21634
 Project: Ignite
  Issue Type: Improvement
Reporter: Denis Chudov


{code:java}
Caused by: java.lang.NullPointerException at 
org.apache.ignite.internal.tx.impl.HeapLockManager.lambda$lockState$4(HeapLockManager.java:297)
 ~[main/:?] at 
java.util.concurrent.ConcurrentHashMap.compute(ConcurrentHashMap.java:1908) 
~[?:?] at 
org.apache.ignite.internal.tx.impl.HeapLockManager.lockState(HeapLockManager.java:291)
 ~[main/:?] at 
org.apache.ignite.internal.tx.impl.HeapLockManager.acquire(HeapLockManager.java:172)
 ~[main/:?] at 
org.apache.ignite.internal.table.distributed.SortedIndexLocker.lambda$locksForInsert$4(SortedIndexLocker.java:169)
 ~[main/:?] at 
java.util.concurrent.CompletableFuture.uniComposeStage(CompletableFuture.java:1106)
 ~[?:?] ... 29 more{code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (IGNITE-21633) Get rid of RemotelyTriggeredResourceRegistry#remoteHostsToResources

2024-02-29 Thread Denis Chudov (Jira)
Denis Chudov created IGNITE-21633:
-

 Summary: Get rid of 
RemotelyTriggeredResourceRegistry#remoteHostsToResources
 Key: IGNITE-21633
 URL: https://issues.apache.org/jira/browse/IGNITE-21633
 Project: Ignite
  Issue Type: Improvement
Reporter: Denis Chudov


TBD



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Assigned] (IGNITE-21618) In-flights for read-only transactions

2024-02-28 Thread Denis Chudov (Jira)


 [ 
https://issues.apache.org/jira/browse/IGNITE-21618?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Denis Chudov reassigned IGNITE-21618:
-

Assignee: Denis Chudov

> In-flights for read-only transactions
> -
>
> Key: IGNITE-21618
> URL: https://issues.apache.org/jira/browse/IGNITE-21618
> Project: Ignite
>  Issue Type: Improvement
>Reporter: Denis Chudov
>Assignee: Denis Chudov
>Priority: Major
>  Labels: ignite-3
>
> *Motivation*
> We need to make solid mechanism of closing read-only transactions' resources 
> (scan cursors, etc.) on remote servers after tx finish. Resources are 
> supposed to be closed by the requests from coordinator sent from a separate 
> cleanup thread after the tx is finished, to maximise the performance of the 
> tx finish itself and because these requests are needed only for resource 
> cleanup. But we need to prevent a race, such as:
>  * tx request supposing to create a scan cursor on remote server is sent
>  * tx is finished
>  * cleanup thread sends cleanup request
>  * cleanup request reaches remote server
>  * tx request reaches the remote server and opens a cursor that will never be 
> closed.
> We need to ensure that cleanup request will be not sent until the coordinator 
> receives responses for all requests that sent before tx finish, and no 
> requests are allowed after tx finish. Something similar to RW inflight 
> requests counter for RO is to be done.
> *Definition of done*
> Cleanup request from cleanup thread will be not sent until the coordinator 
> receives responses for all requests that sent before tx finish, and no 
> requests are allowed after tx finish.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (IGNITE-21618) In-flights for read-only transactions

2024-02-27 Thread Denis Chudov (Jira)


 [ 
https://issues.apache.org/jira/browse/IGNITE-21618?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Denis Chudov updated IGNITE-21618:
--
Description: 
*Motivation*

We need to make solid mechanism of closing read-only transactions' resources 
(scan cursors, etc.) on remote servers after tx finish. Resources are supposed 
to be closed by the requests from coordinator sent from a separate cleanup 
thread after the tx is finished, to maximise the performance of the tx finish 
itself and because these requests are needed only for resource cleanup. But we 
need to prevent a race, such as:
 * tx request supposing to create a scan cursor on remote server is sent
 * tx is finished
 * cleanup thread sends cleanup request
 * cleanup request reaches remote server
 * tx request reaches the remote server and opens a cursor that will never be 
closed.

We need to ensure that cleanup request will be not sent until the coordinator 
receives responses for all requests that sent before tx finish, and no requests 
are allowed after tx finish. Something similar to RW inflight requests counter 
for RO is to be done.

*Definition of done*

Cleanup request from cleanup thread will be not sent until the coordinator 
receives responses for all requests that sent before tx finish, and no requests 
are allowed after tx finish.

  was:
*Motivation*

We need to make solid mechanism of closing read-only transactions' resources 
(cursors, etc.) on remote servers after tx finish. 


> In-flights for read-only transactions
> -
>
> Key: IGNITE-21618
> URL: https://issues.apache.org/jira/browse/IGNITE-21618
> Project: Ignite
>  Issue Type: Improvement
>Reporter: Denis Chudov
>Priority: Major
>  Labels: ignite-3
>
> *Motivation*
> We need to make solid mechanism of closing read-only transactions' resources 
> (scan cursors, etc.) on remote servers after tx finish. Resources are 
> supposed to be closed by the requests from coordinator sent from a separate 
> cleanup thread after the tx is finished, to maximise the performance of the 
> tx finish itself and because these requests are needed only for resource 
> cleanup. But we need to prevent a race, such as:
>  * tx request supposing to create a scan cursor on remote server is sent
>  * tx is finished
>  * cleanup thread sends cleanup request
>  * cleanup request reaches remote server
>  * tx request reaches the remote server and opens a cursor that will never be 
> closed.
> We need to ensure that cleanup request will be not sent until the coordinator 
> receives responses for all requests that sent before tx finish, and no 
> requests are allowed after tx finish. Something similar to RW inflight 
> requests counter for RO is to be done.
> *Definition of done*
> Cleanup request from cleanup thread will be not sent until the coordinator 
> receives responses for all requests that sent before tx finish, and no 
> requests are allowed after tx finish.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (IGNITE-21618) In-flights for read-only transactions

2024-02-27 Thread Denis Chudov (Jira)


 [ 
https://issues.apache.org/jira/browse/IGNITE-21618?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Denis Chudov updated IGNITE-21618:
--
Epic Link: IGNITE-21221  (was: IGNITE-21174)

> In-flights for read-only transactions
> -
>
> Key: IGNITE-21618
> URL: https://issues.apache.org/jira/browse/IGNITE-21618
> Project: Ignite
>  Issue Type: Improvement
>Reporter: Denis Chudov
>Priority: Major
>  Labels: ignite-3
>
> *Motivation*
> We need to make solid mechanism of closing read-only transactions' resources 
> (scan cursors, etc.) on remote servers after tx finish. Resources are 
> supposed to be closed by the requests from coordinator sent from a separate 
> cleanup thread after the tx is finished, to maximise the performance of the 
> tx finish itself and because these requests are needed only for resource 
> cleanup. But we need to prevent a race, such as:
>  * tx request supposing to create a scan cursor on remote server is sent
>  * tx is finished
>  * cleanup thread sends cleanup request
>  * cleanup request reaches remote server
>  * tx request reaches the remote server and opens a cursor that will never be 
> closed.
> We need to ensure that cleanup request will be not sent until the coordinator 
> receives responses for all requests that sent before tx finish, and no 
> requests are allowed after tx finish. Something similar to RW inflight 
> requests counter for RO is to be done.
> *Definition of done*
> Cleanup request from cleanup thread will be not sent until the coordinator 
> receives responses for all requests that sent before tx finish, and no 
> requests are allowed after tx finish.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (IGNITE-21618) In-flights for read-only transactions

2024-02-27 Thread Denis Chudov (Jira)


 [ 
https://issues.apache.org/jira/browse/IGNITE-21618?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Denis Chudov updated IGNITE-21618:
--
Description: 
*Motivation*

We need to make solid mechanism of closing read-only transactions' resources 
(cursors, etc.) on remote servers after tx finish. 

  was:TBD


> In-flights for read-only transactions
> -
>
> Key: IGNITE-21618
> URL: https://issues.apache.org/jira/browse/IGNITE-21618
> Project: Ignite
>  Issue Type: Improvement
>Reporter: Denis Chudov
>Priority: Major
>  Labels: ignite-3
>
> *Motivation*
> We need to make solid mechanism of closing read-only transactions' resources 
> (cursors, etc.) on remote servers after tx finish. 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (IGNITE-21618) In-flights for read-only transactions

2024-02-27 Thread Denis Chudov (Jira)
Denis Chudov created IGNITE-21618:
-

 Summary: In-flights for read-only transactions
 Key: IGNITE-21618
 URL: https://issues.apache.org/jira/browse/IGNITE-21618
 Project: Ignite
  Issue Type: Improvement
Reporter: Denis Chudov


TBD



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (IGNITE-21545) Introduce a cursor manager

2024-02-16 Thread Denis Chudov (Jira)


[ 
https://issues.apache.org/jira/browse/IGNITE-21545?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17817913#comment-17817913
 ] 

Denis Chudov commented on IGNITE-21545:
---

Splitted out from the IGNITE-21293

> Introduce a cursor manager
> --
>
> Key: IGNITE-21545
> URL: https://issues.apache.org/jira/browse/IGNITE-21545
> Project: Ignite
>  Issue Type: Improvement
>Reporter: Denis Chudov
>Assignee: Denis Chudov
>Priority: Major
>  Labels: ignite-3
>
> Introduce a cursor manager that would maintain all cursors created on a node, 
> instead of maintaining them in partition replica listeners.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (IGNITE-21545) Introduce a cursor manager

2024-02-16 Thread Denis Chudov (Jira)
Denis Chudov created IGNITE-21545:
-

 Summary: Introduce a cursor manager
 Key: IGNITE-21545
 URL: https://issues.apache.org/jira/browse/IGNITE-21545
 Project: Ignite
  Issue Type: Improvement
Reporter: Denis Chudov
Assignee: Denis Chudov


Introduce a cursor manager that would maintain all cursors created on a node, 
instead of maintaining them in partition replica listeners.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (IGNITE-21513) ActiveActorTest#testChangeLeaderForce is flaky

2024-02-13 Thread Denis Chudov (Jira)


 [ 
https://issues.apache.org/jira/browse/IGNITE-21513?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Denis Chudov updated IGNITE-21513:
--
Description: 
{code:java}
[05:19:12]F: 
[org.apache.ignite.internal.placementdriver.ActiveActorTest.testChangeLeaderForce(TestInfo)]
 org.opentest4j.AssertionFailedError: expected:  but was: 
at 
app//org.junit.jupiter.api.AssertionFailureBuilder.build(AssertionFailureBuilder.java:151)
   
at 
app//org.junit.jupiter.api.AssertionFailureBuilder.buildAndThrow(AssertionFailureBuilder.java:132)
   
at app//org.junit.jupiter.api.AssertTrue.failNotTrue(AssertTrue.java:63)
at app//org.junit.jupiter.api.AssertTrue.assertTrue(AssertTrue.java:36) 
at app//org.junit.jupiter.api.AssertTrue.assertTrue(AssertTrue.java:31) 
at app//org.junit.jupiter.api.Assertions.assertTrue(Assertions.java:180)
at 
app//org.apache.ignite.internal.placementdriver.ActiveActorTest.testChangeLeaderForce(ActiveActorTest.java:370){code}
See IGNITE-21381 for more details. This ticket is about fixing flaky test and 
removing the code duplication between ActiveActorTest and 
TopologyAwareRaftGroupServiceTest.

The actual problem of the test was the race due to the lack of joins on a 
futures from #subscribeLeader().

  was:
{code:java}
[05:19:12]F: 
[org.apache.ignite.internal.placementdriver.ActiveActorTest.testChangeLeaderForce(TestInfo)]
 org.opentest4j.AssertionFailedError: expected:  but was: 
at 
app//org.junit.jupiter.api.AssertionFailureBuilder.build(AssertionFailureBuilder.java:151)
   
at 
app//org.junit.jupiter.api.AssertionFailureBuilder.buildAndThrow(AssertionFailureBuilder.java:132)
   
at app//org.junit.jupiter.api.AssertTrue.failNotTrue(AssertTrue.java:63)
at app//org.junit.jupiter.api.AssertTrue.assertTrue(AssertTrue.java:36) 
at app//org.junit.jupiter.api.AssertTrue.assertTrue(AssertTrue.java:31) 
at app//org.junit.jupiter.api.Assertions.assertTrue(Assertions.java:180)
at 
app//org.apache.ignite.internal.placementdriver.ActiveActorTest.testChangeLeaderForce(ActiveActorTest.java:370){code}
See IGNITE-21381 for more details. This ticket is about fixing flaky test and 
removing the code duplication between ActiveActorTest and 
TopologyAwareRaftGroupServiceTest.


> ActiveActorTest#testChangeLeaderForce is flaky
> --
>
> Key: IGNITE-21513
> URL: https://issues.apache.org/jira/browse/IGNITE-21513
> Project: Ignite
>  Issue Type: Bug
>Reporter: Denis Chudov
>Assignee: Denis Chudov
>Priority: Major
>  Labels: ignite-3
>  Time Spent: 0.5h
>  Remaining Estimate: 0h
>
> {code:java}
> [05:19:12]F:   
> [org.apache.ignite.internal.placementdriver.ActiveActorTest.testChangeLeaderForce(TestInfo)]
>  org.opentest4j.AssertionFailedError: expected:  but was: 
> at 
> app//org.junit.jupiter.api.AssertionFailureBuilder.build(AssertionFailureBuilder.java:151)
>  
> at 
> app//org.junit.jupiter.api.AssertionFailureBuilder.buildAndThrow(AssertionFailureBuilder.java:132)
>  
> at app//org.junit.jupiter.api.AssertTrue.failNotTrue(AssertTrue.java:63)  
> at app//org.junit.jupiter.api.AssertTrue.assertTrue(AssertTrue.java:36)   
> at app//org.junit.jupiter.api.AssertTrue.assertTrue(AssertTrue.java:31)   
> at app//org.junit.jupiter.api.Assertions.assertTrue(Assertions.java:180)  
> at 
> app//org.apache.ignite.internal.placementdriver.ActiveActorTest.testChangeLeaderForce(ActiveActorTest.java:370){code}
> See IGNITE-21381 for more details. This ticket is about fixing flaky test and 
> removing the code duplication between ActiveActorTest and 
> TopologyAwareRaftGroupServiceTest.
> The actual problem of the test was the race due to the lack of joins on a 
> futures from #subscribeLeader().



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Comment Edited] (IGNITE-21381) ActiveActorTest#testChangeLeaderForce has problems with resource cleanup

2024-02-12 Thread Denis Chudov (Jira)


[ 
https://issues.apache.org/jira/browse/IGNITE-21381?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17816547#comment-17816547
 ] 

Denis Chudov edited comment on IGNITE-21381 at 2/12/24 3:12 PM:


As we have a PR already that is intended to fix resource leak ( 
[https://github.com/apache/ignite-3/pull/3150] ), I created a new ticket to fix 
the flaky test and make a refactoring IGNITE-21513

TC logs appeared to be misleading because of reordered messages (see time). The 
actual problem of the test was the race due to the lack of joins on a futures 
from #subscribeLeader().


was (Author: denis chudov):
As we have a PR already that is intended to fix resource leak ( 
https://github.com/apache/ignite-3/pull/3150 ), I created a new ticket to fix 
the flaky test and make a refactoring IGNITE-21513

> ActiveActorTest#testChangeLeaderForce has problems with resource cleanup
> 
>
> Key: IGNITE-21381
> URL: https://issues.apache.org/jira/browse/IGNITE-21381
> Project: Ignite
>  Issue Type: Bug
>Reporter: Mirza Aliev
>Priority: Major
>  Labels: ignite-3
> Attachments: screenshot-1.png, screenshot-2.png
>
>  Time Spent: 40m
>  Remaining Estimate: 0h
>
> {{ActiveActorTest#testChangeLeaderForce}} is started to be flaky on TC with 
> {noformat}
> [05:19:12]F:   
> [org.apache.ignite.internal.placementdriver.ActiveActorTest.testChangeLeaderForce(TestInfo)]
>  org.opentest4j.AssertionFailedError: expected:  but was: 
>   at 
> app//org.junit.jupiter.api.AssertionFailureBuilder.build(AssertionFailureBuilder.java:151)
>   at 
> app//org.junit.jupiter.api.AssertionFailureBuilder.buildAndThrow(AssertionFailureBuilder.java:132)
>   at app//org.junit.jupiter.api.AssertTrue.failNotTrue(AssertTrue.java:63)
>   at app//org.junit.jupiter.api.AssertTrue.assertTrue(AssertTrue.java:36)
>   at app//org.junit.jupiter.api.AssertTrue.assertTrue(AssertTrue.java:31)
>   at app//org.junit.jupiter.api.Assertions.assertTrue(Assertions.java:180)
>   at 
> app//org.apache.ignite.internal.placementdriver.ActiveActorTest.testChangeLeaderForce(ActiveActorTest.java:370)
> {noformat}
> From the log we can see that transfer leadership, which was supposed to be 
> successful, do not happen. Behaviour is the following:
> 1) Current leader is {{Leader: ClusterNodeImpl 
> [id=e99210fb-f872-4e08-a99c-53f9512da20e, name=aat_tclf_1235}}
> 2) We want to transfer leadership to {{Peer to transfer leader: Peer 
> [consistentId=aat_tclf_1234, idx=0]}}
> 3) Process of transfer is started
> 4) We receive warn about error during {{GetLeaderRequestImpl}}:
> {noformat}
> [2024-01-29T05:19:08,855][WARN 
> ][CompletableFutureDelayScheduler][RaftGroupServiceImpl] Recoverable error 
> during the request occurred (will be retried on the randomly selected node) 
> [request=GetLeaderRequestImpl [groupId=TestReplicationGroup, 
> peerId=aat_tclf_1235], peer=Peer [consistentId=aat_tclf_1235, idx=0], 
> newPeer=Peer [consistentId=aat_tclf_1234, idx=0]].
> java.util.concurrent.CompletionException: 
> java.util.concurrent.TimeoutException
>   at 
> java.util.concurrent.CompletableFuture.encodeRelay(CompletableFuture.java:367)
>  ~[?:?]
>   at 
> java.util.concurrent.CompletableFuture.completeRelay(CompletableFuture.java:376)
>  ~[?:?]
>   at 
> java.util.concurrent.CompletableFuture$UniRelay.tryFire(CompletableFuture.java:1019)
>  ~[?:?]
>   at 
> java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:506)
>  [?:?]
>   at 
> java.util.concurrent.CompletableFuture.completeExceptionally(CompletableFuture.java:2088)
>  [?:?]
>   at 
> java.util.concurrent.CompletableFuture$Timeout.run(CompletableFuture.java:2792)
>  [?:?]
>   at 
> java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:515) [?:?]
>   at java.util.concurrent.FutureTask.run(FutureTask.java:264) [?:?]
>   at 
> java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:304)
>  [?:?]
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
>  [?:?]
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
>  [?:?]
>   at java.lang.Thread.run(Thread.java:834) [?:?]
> Caused by: java.util.concurrent.TimeoutException
>   ... 7 more
> {noformat}
> 5) After that we see that node {{aat_tclf_1236}} sends invalid 
> {{RequestVoteResponse}} because it thinks that it is the leader:
> {noformat}
> [2024-01-29T05:19:11,370][WARN 
> ][%aat_tclf_1234%JRaft-Response-Processor-15][NodeImpl] Node 
>  received invalid RequestVoteResponse 
> from aat_tclf_1236, state not in STATE_CANDIDATE but STATE_LEADER.
> {noformat}
>  
> Tests 

[jira] [Updated] (IGNITE-21513) ActiveActorTest#testChangeLeaderForce is flaky

2024-02-12 Thread Denis Chudov (Jira)


 [ 
https://issues.apache.org/jira/browse/IGNITE-21513?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Denis Chudov updated IGNITE-21513:
--
Description: 
{code:java}
[05:19:12]F: 
[org.apache.ignite.internal.placementdriver.ActiveActorTest.testChangeLeaderForce(TestInfo)]
 org.opentest4j.AssertionFailedError: expected:  but was: 
at 
app//org.junit.jupiter.api.AssertionFailureBuilder.build(AssertionFailureBuilder.java:151)
   
at 
app//org.junit.jupiter.api.AssertionFailureBuilder.buildAndThrow(AssertionFailureBuilder.java:132)
   
at app//org.junit.jupiter.api.AssertTrue.failNotTrue(AssertTrue.java:63)
at app//org.junit.jupiter.api.AssertTrue.assertTrue(AssertTrue.java:36) 
at app//org.junit.jupiter.api.AssertTrue.assertTrue(AssertTrue.java:31) 
at app//org.junit.jupiter.api.Assertions.assertTrue(Assertions.java:180)
at 
app//org.apache.ignite.internal.placementdriver.ActiveActorTest.testChangeLeaderForce(ActiveActorTest.java:370){code}
See IGNITE-21381 for more details. This ticket is about fixing flaky test and 
removing the code duplication between ActiveActorTest and 
TopologyAwareRaftGroupServiceTest.

  was:
{code:java}
[05:19:12]F: 
[org.apache.ignite.internal.placementdriver.ActiveActorTest.testChangeLeaderForce(TestInfo)]
 org.opentest4j.AssertionFailedError: expected:  but was: 
at 
app//org.junit.jupiter.api.AssertionFailureBuilder.build(AssertionFailureBuilder.java:151)
   
at 
app//org.junit.jupiter.api.AssertionFailureBuilder.buildAndThrow(AssertionFailureBuilder.java:132)
   
at app//org.junit.jupiter.api.AssertTrue.failNotTrue(AssertTrue.java:63)
at app//org.junit.jupiter.api.AssertTrue.assertTrue(AssertTrue.java:36) 
at app//org.junit.jupiter.api.AssertTrue.assertTrue(AssertTrue.java:31) 
at app//org.junit.jupiter.api.Assertions.assertTrue(Assertions.java:180)
at 
app//org.apache.ignite.internal.placementdriver.ActiveActorTest.testChangeLeaderForce(ActiveActorTest.java:370){code}
 


> ActiveActorTest#testChangeLeaderForce is flaky
> --
>
> Key: IGNITE-21513
> URL: https://issues.apache.org/jira/browse/IGNITE-21513
> Project: Ignite
>  Issue Type: Bug
>Reporter: Denis Chudov
>Assignee: Denis Chudov
>Priority: Major
>  Labels: ignite-3
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> {code:java}
> [05:19:12]F:   
> [org.apache.ignite.internal.placementdriver.ActiveActorTest.testChangeLeaderForce(TestInfo)]
>  org.opentest4j.AssertionFailedError: expected:  but was: 
> at 
> app//org.junit.jupiter.api.AssertionFailureBuilder.build(AssertionFailureBuilder.java:151)
>  
> at 
> app//org.junit.jupiter.api.AssertionFailureBuilder.buildAndThrow(AssertionFailureBuilder.java:132)
>  
> at app//org.junit.jupiter.api.AssertTrue.failNotTrue(AssertTrue.java:63)  
> at app//org.junit.jupiter.api.AssertTrue.assertTrue(AssertTrue.java:36)   
> at app//org.junit.jupiter.api.AssertTrue.assertTrue(AssertTrue.java:31)   
> at app//org.junit.jupiter.api.Assertions.assertTrue(Assertions.java:180)  
> at 
> app//org.apache.ignite.internal.placementdriver.ActiveActorTest.testChangeLeaderForce(ActiveActorTest.java:370){code}
> See IGNITE-21381 for more details. This ticket is about fixing flaky test and 
> removing the code duplication between ActiveActorTest and 
> TopologyAwareRaftGroupServiceTest.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Assigned] (IGNITE-21513) ActiveActorTest#testChangeLeaderForce is flaky

2024-02-12 Thread Denis Chudov (Jira)


 [ 
https://issues.apache.org/jira/browse/IGNITE-21513?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Denis Chudov reassigned IGNITE-21513:
-

Assignee: Denis Chudov

> ActiveActorTest#testChangeLeaderForce is flaky
> --
>
> Key: IGNITE-21513
> URL: https://issues.apache.org/jira/browse/IGNITE-21513
> Project: Ignite
>  Issue Type: Bug
>Reporter: Denis Chudov
>Assignee: Denis Chudov
>Priority: Major
>  Labels: ignite-3
>
> {code:java}
> [05:19:12]F:   
> [org.apache.ignite.internal.placementdriver.ActiveActorTest.testChangeLeaderForce(TestInfo)]
>  org.opentest4j.AssertionFailedError: expected:  but was: 
> at 
> app//org.junit.jupiter.api.AssertionFailureBuilder.build(AssertionFailureBuilder.java:151)
>  
> at 
> app//org.junit.jupiter.api.AssertionFailureBuilder.buildAndThrow(AssertionFailureBuilder.java:132)
>  
> at app//org.junit.jupiter.api.AssertTrue.failNotTrue(AssertTrue.java:63)  
> at app//org.junit.jupiter.api.AssertTrue.assertTrue(AssertTrue.java:36)   
> at app//org.junit.jupiter.api.AssertTrue.assertTrue(AssertTrue.java:31)   
> at app//org.junit.jupiter.api.Assertions.assertTrue(Assertions.java:180)  
> at 
> app//org.apache.ignite.internal.placementdriver.ActiveActorTest.testChangeLeaderForce(ActiveActorTest.java:370){code}
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (IGNITE-21381) ActiveActorTest#testChangeLeaderForce has problems with resource cleanup

2024-02-12 Thread Denis Chudov (Jira)


[ 
https://issues.apache.org/jira/browse/IGNITE-21381?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17816547#comment-17816547
 ] 

Denis Chudov commented on IGNITE-21381:
---

As we have a PR already that is intended to fix resource leak ( 
https://github.com/apache/ignite-3/pull/3150 ), I created a new ticket to fix 
the flaky test and make a refactoring IGNITE-21513

> ActiveActorTest#testChangeLeaderForce has problems with resource cleanup
> 
>
> Key: IGNITE-21381
> URL: https://issues.apache.org/jira/browse/IGNITE-21381
> Project: Ignite
>  Issue Type: Bug
>Reporter: Mirza Aliev
>Priority: Major
>  Labels: ignite-3
> Attachments: screenshot-1.png, screenshot-2.png
>
>  Time Spent: 0.5h
>  Remaining Estimate: 0h
>
> {{ActiveActorTest#testChangeLeaderForce}} is started to be flaky on TC with 
> {noformat}
> [05:19:12]F:   
> [org.apache.ignite.internal.placementdriver.ActiveActorTest.testChangeLeaderForce(TestInfo)]
>  org.opentest4j.AssertionFailedError: expected:  but was: 
>   at 
> app//org.junit.jupiter.api.AssertionFailureBuilder.build(AssertionFailureBuilder.java:151)
>   at 
> app//org.junit.jupiter.api.AssertionFailureBuilder.buildAndThrow(AssertionFailureBuilder.java:132)
>   at app//org.junit.jupiter.api.AssertTrue.failNotTrue(AssertTrue.java:63)
>   at app//org.junit.jupiter.api.AssertTrue.assertTrue(AssertTrue.java:36)
>   at app//org.junit.jupiter.api.AssertTrue.assertTrue(AssertTrue.java:31)
>   at app//org.junit.jupiter.api.Assertions.assertTrue(Assertions.java:180)
>   at 
> app//org.apache.ignite.internal.placementdriver.ActiveActorTest.testChangeLeaderForce(ActiveActorTest.java:370)
> {noformat}
> From the log we can see that transfer leadership, which was supposed to be 
> successful, do not happen. Behaviour is the following:
> 1) Current leader is {{Leader: ClusterNodeImpl 
> [id=e99210fb-f872-4e08-a99c-53f9512da20e, name=aat_tclf_1235}}
> 2) We want to transfer leadership to {{Peer to transfer leader: Peer 
> [consistentId=aat_tclf_1234, idx=0]}}
> 3) Process of transfer is started
> 4) We receive warn about error during {{GetLeaderRequestImpl}}:
> {noformat}
> [2024-01-29T05:19:08,855][WARN 
> ][CompletableFutureDelayScheduler][RaftGroupServiceImpl] Recoverable error 
> during the request occurred (will be retried on the randomly selected node) 
> [request=GetLeaderRequestImpl [groupId=TestReplicationGroup, 
> peerId=aat_tclf_1235], peer=Peer [consistentId=aat_tclf_1235, idx=0], 
> newPeer=Peer [consistentId=aat_tclf_1234, idx=0]].
> java.util.concurrent.CompletionException: 
> java.util.concurrent.TimeoutException
>   at 
> java.util.concurrent.CompletableFuture.encodeRelay(CompletableFuture.java:367)
>  ~[?:?]
>   at 
> java.util.concurrent.CompletableFuture.completeRelay(CompletableFuture.java:376)
>  ~[?:?]
>   at 
> java.util.concurrent.CompletableFuture$UniRelay.tryFire(CompletableFuture.java:1019)
>  ~[?:?]
>   at 
> java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:506)
>  [?:?]
>   at 
> java.util.concurrent.CompletableFuture.completeExceptionally(CompletableFuture.java:2088)
>  [?:?]
>   at 
> java.util.concurrent.CompletableFuture$Timeout.run(CompletableFuture.java:2792)
>  [?:?]
>   at 
> java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:515) [?:?]
>   at java.util.concurrent.FutureTask.run(FutureTask.java:264) [?:?]
>   at 
> java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:304)
>  [?:?]
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
>  [?:?]
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
>  [?:?]
>   at java.lang.Thread.run(Thread.java:834) [?:?]
> Caused by: java.util.concurrent.TimeoutException
>   ... 7 more
> {noformat}
> 5) After that we see that node {{aat_tclf_1236}} sends invalid 
> {{RequestVoteResponse}} because it thinks that it is the leader:
> {noformat}
> [2024-01-29T05:19:11,370][WARN 
> ][%aat_tclf_1234%JRaft-Response-Processor-15][NodeImpl] Node 
>  received invalid RequestVoteResponse 
> from aat_tclf_1236, state not in STATE_CANDIDATE but STATE_LEADER.
> {noformat}
>  
> Tests {{ActiveActorTest#testChangeLeaderForce}} and 
> {{TopologyAwareRaftGroupServiceTest#testChangeLeaderForce}} were muted.
> Also there are some other problems with this tests, they incorrectly clean up 
> resources in case of failure. Cluster is stopped in test itself, meaning that 
> if some assertion is failed, the rest part of the test won't be evaluated, 
> hence cluster won't be stopped.
> The next problem is that if we run this test a several times, even if 

[jira] [Created] (IGNITE-21513) ActiveActorTest#testChangeLeaderForce is flaky

2024-02-12 Thread Denis Chudov (Jira)
Denis Chudov created IGNITE-21513:
-

 Summary: ActiveActorTest#testChangeLeaderForce is flaky
 Key: IGNITE-21513
 URL: https://issues.apache.org/jira/browse/IGNITE-21513
 Project: Ignite
  Issue Type: Bug
Reporter: Denis Chudov


{code:java}
[05:19:12]F: 
[org.apache.ignite.internal.placementdriver.ActiveActorTest.testChangeLeaderForce(TestInfo)]
 org.opentest4j.AssertionFailedError: expected:  but was: 
at 
app//org.junit.jupiter.api.AssertionFailureBuilder.build(AssertionFailureBuilder.java:151)
   
at 
app//org.junit.jupiter.api.AssertionFailureBuilder.buildAndThrow(AssertionFailureBuilder.java:132)
   
at app//org.junit.jupiter.api.AssertTrue.failNotTrue(AssertTrue.java:63)
at app//org.junit.jupiter.api.AssertTrue.assertTrue(AssertTrue.java:36) 
at app//org.junit.jupiter.api.AssertTrue.assertTrue(AssertTrue.java:31) 
at app//org.junit.jupiter.api.Assertions.assertTrue(Assertions.java:180)
at 
app//org.apache.ignite.internal.placementdriver.ActiveActorTest.testChangeLeaderForce(ActiveActorTest.java:370){code}
 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (IGNITE-21381) ActiveActorTest#testChangeLeaderForce has problems with resource cleanup

2024-02-12 Thread Denis Chudov (Jira)


 [ 
https://issues.apache.org/jira/browse/IGNITE-21381?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Denis Chudov updated IGNITE-21381:
--
Summary: ActiveActorTest#testChangeLeaderForce has problems with resource 
cleanup  (was: ActiveActorTest#testChangeLeaderForce is flaky )

> ActiveActorTest#testChangeLeaderForce has problems with resource cleanup
> 
>
> Key: IGNITE-21381
> URL: https://issues.apache.org/jira/browse/IGNITE-21381
> Project: Ignite
>  Issue Type: Bug
>Reporter: Mirza Aliev
>Priority: Major
>  Labels: ignite-3
> Attachments: screenshot-1.png, screenshot-2.png
>
>  Time Spent: 0.5h
>  Remaining Estimate: 0h
>
> {{ActiveActorTest#testChangeLeaderForce}} is started to be flaky on TC with 
> {noformat}
> [05:19:12]F:   
> [org.apache.ignite.internal.placementdriver.ActiveActorTest.testChangeLeaderForce(TestInfo)]
>  org.opentest4j.AssertionFailedError: expected:  but was: 
>   at 
> app//org.junit.jupiter.api.AssertionFailureBuilder.build(AssertionFailureBuilder.java:151)
>   at 
> app//org.junit.jupiter.api.AssertionFailureBuilder.buildAndThrow(AssertionFailureBuilder.java:132)
>   at app//org.junit.jupiter.api.AssertTrue.failNotTrue(AssertTrue.java:63)
>   at app//org.junit.jupiter.api.AssertTrue.assertTrue(AssertTrue.java:36)
>   at app//org.junit.jupiter.api.AssertTrue.assertTrue(AssertTrue.java:31)
>   at app//org.junit.jupiter.api.Assertions.assertTrue(Assertions.java:180)
>   at 
> app//org.apache.ignite.internal.placementdriver.ActiveActorTest.testChangeLeaderForce(ActiveActorTest.java:370)
> {noformat}
> From the log we can see that transfer leadership, which was supposed to be 
> successful, do not happen. Behaviour is the following:
> 1) Current leader is {{Leader: ClusterNodeImpl 
> [id=e99210fb-f872-4e08-a99c-53f9512da20e, name=aat_tclf_1235}}
> 2) We want to transfer leadership to {{Peer to transfer leader: Peer 
> [consistentId=aat_tclf_1234, idx=0]}}
> 3) Process of transfer is started
> 4) We receive warn about error during {{GetLeaderRequestImpl}}:
> {noformat}
> [2024-01-29T05:19:08,855][WARN 
> ][CompletableFutureDelayScheduler][RaftGroupServiceImpl] Recoverable error 
> during the request occurred (will be retried on the randomly selected node) 
> [request=GetLeaderRequestImpl [groupId=TestReplicationGroup, 
> peerId=aat_tclf_1235], peer=Peer [consistentId=aat_tclf_1235, idx=0], 
> newPeer=Peer [consistentId=aat_tclf_1234, idx=0]].
> java.util.concurrent.CompletionException: 
> java.util.concurrent.TimeoutException
>   at 
> java.util.concurrent.CompletableFuture.encodeRelay(CompletableFuture.java:367)
>  ~[?:?]
>   at 
> java.util.concurrent.CompletableFuture.completeRelay(CompletableFuture.java:376)
>  ~[?:?]
>   at 
> java.util.concurrent.CompletableFuture$UniRelay.tryFire(CompletableFuture.java:1019)
>  ~[?:?]
>   at 
> java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:506)
>  [?:?]
>   at 
> java.util.concurrent.CompletableFuture.completeExceptionally(CompletableFuture.java:2088)
>  [?:?]
>   at 
> java.util.concurrent.CompletableFuture$Timeout.run(CompletableFuture.java:2792)
>  [?:?]
>   at 
> java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:515) [?:?]
>   at java.util.concurrent.FutureTask.run(FutureTask.java:264) [?:?]
>   at 
> java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:304)
>  [?:?]
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
>  [?:?]
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
>  [?:?]
>   at java.lang.Thread.run(Thread.java:834) [?:?]
> Caused by: java.util.concurrent.TimeoutException
>   ... 7 more
> {noformat}
> 5) After that we see that node {{aat_tclf_1236}} sends invalid 
> {{RequestVoteResponse}} because it thinks that it is the leader:
> {noformat}
> [2024-01-29T05:19:11,370][WARN 
> ][%aat_tclf_1234%JRaft-Response-Processor-15][NodeImpl] Node 
>  received invalid RequestVoteResponse 
> from aat_tclf_1236, state not in STATE_CANDIDATE but STATE_LEADER.
> {noformat}
>  
> Tests {{ActiveActorTest#testChangeLeaderForce}} and 
> {{TopologyAwareRaftGroupServiceTest#testChangeLeaderForce}} were muted.
> Also there are some other problems with this tests, they incorrectly clean up 
> resources in case of failure. Cluster is stopped in test itself, meaning that 
> if some assertion is failed, the rest part of the test won't be evaluated, 
> hence cluster won't be stopped.
> The next problem is that if we run this test a several times, even if they 
> pass successfully, we can see that at some point new test cannot be run 
> because 

[jira] [Created] (IGNITE-21500) Retry implicit full transactions in case of exceptions related to primary replica failure or move

2024-02-08 Thread Denis Chudov (Jira)
Denis Chudov created IGNITE-21500:
-

 Summary: Retry implicit full transactions in case of exceptions 
related to primary replica failure or move
 Key: IGNITE-21500
 URL: https://issues.apache.org/jira/browse/IGNITE-21500
 Project: Ignite
  Issue Type: Improvement
Reporter: Denis Chudov


*Motivation* 

Implicit transactions usually are "full" and include just one transactional 
request which can be safely retried in case of the primary replica related 
exceptions (PrimaryReplicaMissException, etc.). So users will never see these 
exceptions in case of primary replica failures.

*Definition of done*

PrimaryReplicaMissException, PrimaryReplicaAwaitException, 
TransactionExceptions with messages like "Failed to get the primary replica", 
"Failed to resolve the primary replica node" are not propagated to the users in 
cases of implicit full transactions. If it is not possible to await for primary 
replica, PrimaryReplicaAwaitException or transaction exception with timeout is 
still possible,



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (IGNITE-21293) Scan cursors should be closed if the tx coordinator is absent

2024-02-08 Thread Denis Chudov (Jira)


 [ 
https://issues.apache.org/jira/browse/IGNITE-21293?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Denis Chudov updated IGNITE-21293:
--
Summary: Scan cursors should be closed if the tx coordinator is absent  
(was: Scan cursors do not close on transaction recovery)

> Scan cursors should be closed if the tx coordinator is absent
> -
>
> Key: IGNITE-21293
> URL: https://issues.apache.org/jira/browse/IGNITE-21293
> Project: Ignite
>  Issue Type: Bug
>Reporter: Vladislav Pyatkov
>Assignee: Denis Chudov
>Priority: Major
>  Labels: ignite-3
>
> h3. Motivation
> Open cursors required extra memory on the server side. Hence, resources 
> cannot be stored for a long time.
> h3. Implementation notes
> During the recovery procedure, the server receives a cleanup message (the 
> message releases locks). On the message processing, we update the local 
> transaction state, and it should also close all the cursors related to this 
> transaction.
> h3. Definition of done
> All cursors should be closed on the RW transaction recovery and if a 
> coordinator of RO transaction leaves the cluster.
> h3. Possible solution
> The reason why the cursors are not being closed during the recovery is that 
> the normal way of closing them is implemented in 
> {{WriteIntentSwitchReplicaRequest}} handler, but for the recovery case we 
> don't have the collection of enlisted partitions, thus no write intent switch 
> is triggered.
> We could follow the same approach as the lock manager uses, but we need a 
> node-wide access to all the cursors opened in the current transaction. There 
> is another way - instead of closing the cursors directly we can shift the 
> responsibility to the partition listener itself.
> Each node has an in-memory txnState map, tracking the state of the 
> transactions. If we add listeners to this map, then on registering a new 
> cursor a partition listener will be able to check current transaction state 
> and add a listener for a terminal one. 
> When the tx state is changed to a terminal one, the cursors will be closed.
> We can also create a cluanup thread which would check the coordinator node id 
> assosiated with cursors and if they are absent then the cursor would have to 
> be closed.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (IGNITE-21293) Scan cursors do not close on transaction recovery

2024-02-08 Thread Denis Chudov (Jira)


 [ 
https://issues.apache.org/jira/browse/IGNITE-21293?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Denis Chudov updated IGNITE-21293:
--
Description: 
h3. Motivation

Open cursors required extra memory on the server side. Hence, resources cannot 
be stored for a long time.
h3. Implementation notes

During the recovery procedure, the server receives a cleanup message (the 
message releases locks). On the message processing, we update the local 
transaction state, and it should also close all the cursors related to this 
transaction.
h3. Definition of done

All cursors should be closed on the RW transaction recovery and if a 
coordinator of RO transaction leaves the cluster.
h3. Possible solution

The reason why the cursors are not being closed during the recovery is that the 
normal way of closing them is implemented in 
{{WriteIntentSwitchReplicaRequest}} handler, but for the recovery case we don't 
have the collection of enlisted partitions, thus no write intent switch is 
triggered.
We could follow the same approach as the lock manager uses, but we need a 
node-wide access to all the cursors opened in the current transaction. There is 
another way - instead of closing the cursors directly we can shift the 
responsibility to the partition listener itself.
Each node has an in-memory txnState map, tracking the state of the 
transactions. If we add listeners to this map, then on registering a new cursor 
a partition listener will be able to check current transaction state and add a 
listener for a terminal one. 
When the tx state is changed to a terminal one, the cursors will be closed.

We can also create a cluanup thread which would check the coordinator node id 
assosiated with cursors and if they are absent then the cursor would have to be 
closed.

  was:
h3. Motivation
Open cursors required extra memory on the server side. Hence, resources cannot 
be stored for a long time.

h3. Implementation notes
During the recovery procedure, the server receives a cleanup message (the 
message releases locks). On the message processing, we update the local 
transaction state, and it should also close all the cursors related to this 
transaction.

h3. Definition of done
All cursors should be closed on the RW transaction recovery.

h3. Possible solution
The reason why the cursors are not being closed during the recovery is that the 
normal way of closing them is implemented in 
{{WriteIntentSwitchReplicaRequest}} handler, but for the recovery case we don't 
have the collection of enlisted partitions, thus no write intent switch is 
triggered.
We could follow the same approach as the lock manager uses, but we need a 
node-wide access to all the cursors opened in the current transaction. There is 
another way - instead of closing the cursors directly we can shift the 
responsibility to the partition listener itself.
Each node has an in-memory txnState map, tracking the state of the 
transactions. If we add listeners to this map, then on registering a new cursor 
a partition listener will be able to check current transaction state and add a 
listener for a terminal one. 
When the tx state is changed to a terminal one, the cursors will be closed.

h4. Pitfalls
Currently the tx cursors are closed before ensuring the completion of read and 
update futures. There is a chance that one opens a new cursor after the "close 
cursors" stage. Checking TX state before registering a cursor should fix this - 
if the transaction is already in the terminal state - the cursor should be 
closed immediately.

Another one: the tx state is updated from different places  - 
{{PartitionReplicaListener}}, raft's {{PartitionListener}}. Need to make sure 
the tx cleanup flow is correct.



> Scan cursors do not close on transaction recovery
> -
>
> Key: IGNITE-21293
> URL: https://issues.apache.org/jira/browse/IGNITE-21293
> Project: Ignite
>  Issue Type: Bug
>Reporter: Vladislav Pyatkov
>Assignee: Denis Chudov
>Priority: Major
>  Labels: ignite-3
>
> h3. Motivation
> Open cursors required extra memory on the server side. Hence, resources 
> cannot be stored for a long time.
> h3. Implementation notes
> During the recovery procedure, the server receives a cleanup message (the 
> message releases locks). On the message processing, we update the local 
> transaction state, and it should also close all the cursors related to this 
> transaction.
> h3. Definition of done
> All cursors should be closed on the RW transaction recovery and if a 
> coordinator of RO transaction leaves the cluster.
> h3. Possible solution
> The reason why the cursors are not being closed during the recovery is that 
> the normal way of closing them is implemented in 
> {{WriteIntentSwitchReplicaRequest}} handler, but for the recovery case we 
> don't have the 

[jira] [Comment Edited] (IGNITE-21247) Log enhancements for LeaseUpdater

2024-02-07 Thread Denis Chudov (Jira)


[ 
https://issues.apache.org/jira/browse/IGNITE-21247?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17815167#comment-17815167
 ] 

Denis Chudov edited comment on IGNITE-21247 at 2/7/24 8:54 AM:
---

Under this ticket the printing of lease statistics was added:

 
{code:java}
[2024-02-07T10:47:31,387][INFO ][%iinrt_dosor_0%lease-updater-1][LeaseUpdater] 
Leases updated (printed once per 10 iteration(s)): 
[inCurrentIteration=LeaseStats [leasesCreated=3, leasesPublished=0, 
leasesProlonged=0, leasesWithoutCandidates=0], active=0, 
currentAssignmentsSize=3].{code}
 

 
 * {*}inCurrentIteration{*}: leases that were processed in the iteration of 
LeaseUpdater that printed this log message. {*}leasesCreated{*}: how many 
leases were created (and negotiation started), {*}leasesPublished{*}: how many 
of them were published after successful negotiation, {*}leasesProlonged{*}: how 
many were prolonged, {*}leasesWithoutCandidate{*}: leases that had to be 
created or prolonged but there was no leaseholder candidate for them.
 * {*}active{*}: active leases (accepted and not outdated)
 * {*}currentAssignmentsSize{*}: total assignments list size, that is processed 
(number of replication groups).


was (Author: denis chudov):
Under this ticket the printing of lease statistics was added:
{code:java}
[2024-02-07T10:47:31,387][INFO ][%iinrt_dosor_0%lease-updater-1][LeaseUpdater] 
Leases updated (printed once per 10 iteration(s)): 
[inCurrentIteration=LeaseStats [leasesCreated=3, leasesPublished=0, 
leasesProlonged=0, leasesWithoutCandidates=0], active=0, 
currentAssignmentsSize=3].{code}

> Log enhancements for LeaseUpdater
> -
>
> Key: IGNITE-21247
> URL: https://issues.apache.org/jira/browse/IGNITE-21247
> Project: Ignite
>  Issue Type: Improvement
>Reporter: Denis Chudov
>Assignee: Denis Chudov
>Priority: Major
>  Labels: ignite-3
>  Time Spent: 50m
>  Remaining Estimate: 0h
>
> In 
> [https://ci.ignite.apache.org/viewLog.html?buildId=7754161=ApacheIgnite3xGradle_Test_RunAllTests]
>  , test failure of 
> {{{}org.apache.ignite.internal.raftsnapshot.ItTableRaftSnapshotsTest: 
> leaderFeedsFollowerWithSnapshot{}}}, we see that there are no log messages on 
> replica about lease negotiation, which means that it didn't even started on 
> the placement driver active actor's side.  But the active actor has started 
> before. Log doesn't provide any information about the detail what happened on 
> LeaseUpdater.
> The suggestion is to add logging to know whether some exception happened in 
> {{{}updateLeaseBatchInternal{}}}, or in {{{}LeaseNegotiator#negotiate{}}}, 
> and logging of lease updating statistics (how many groups without 
> leaseholders were detected, how many negotiations are in progress, how many 
> leases are prolonged). 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (IGNITE-21247) Log enhancements for LeaseUpdater

2024-02-07 Thread Denis Chudov (Jira)


[ 
https://issues.apache.org/jira/browse/IGNITE-21247?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17815167#comment-17815167
 ] 

Denis Chudov commented on IGNITE-21247:
---

Under this ticket the printing of lease statistics was added:
{code:java}
[2024-02-07T10:47:31,387][INFO ][%iinrt_dosor_0%lease-updater-1][LeaseUpdater] 
Leases updated (printed once per 10 iteration(s)): 
[inCurrentIteration=LeaseStats [leasesCreated=3, leasesPublished=0, 
leasesProlonged=0, leasesWithoutCandidates=0], active=0, 
currentAssignmentsSize=3].{code}

> Log enhancements for LeaseUpdater
> -
>
> Key: IGNITE-21247
> URL: https://issues.apache.org/jira/browse/IGNITE-21247
> Project: Ignite
>  Issue Type: Improvement
>Reporter: Denis Chudov
>Assignee: Denis Chudov
>Priority: Major
>  Labels: ignite-3
>  Time Spent: 50m
>  Remaining Estimate: 0h
>
> In 
> [https://ci.ignite.apache.org/viewLog.html?buildId=7754161=ApacheIgnite3xGradle_Test_RunAllTests]
>  , test failure of 
> {{{}org.apache.ignite.internal.raftsnapshot.ItTableRaftSnapshotsTest: 
> leaderFeedsFollowerWithSnapshot{}}}, we see that there are no log messages on 
> replica about lease negotiation, which means that it didn't even started on 
> the placement driver active actor's side.  But the active actor has started 
> before. Log doesn't provide any information about the detail what happened on 
> LeaseUpdater.
> The suggestion is to add logging to know whether some exception happened in 
> {{{}updateLeaseBatchInternal{}}}, or in {{{}LeaseNegotiator#negotiate{}}}, 
> and logging of lease updating statistics (how many groups without 
> leaseholders were detected, how many negotiations are in progress, how many 
> leases are prolonged). 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (IGNITE-21473) Some transactional tests are not necessary for three node tests

2024-02-06 Thread Denis Chudov (Jira)
Denis Chudov created IGNITE-21473:
-

 Summary: Some transactional tests are not necessary for three node 
tests
 Key: IGNITE-21473
 URL: https://issues.apache.org/jira/browse/IGNITE-21473
 Project: Ignite
  Issue Type: Bug
Reporter: Denis Chudov


Following tests:

 
{code:java}
TxAbstractTest#testTransactionMultiThreadedCommit
TxAbstractTest#testTransactionMultiThreadedCommitEmpty
TxAbstractTest#testTransactionMultiThreadedRollback
TxAbstractTest#testTransactionMultiThreadedRollbackEmpty
TxAbstractTest#testTransactionMultiThreadedMixed
TxAbstractTest#testTransactionMultiThreadedMixedEmpty
{code}
take significant time on TC but not actually necessary for all implementations 
of TxAbstractTest. Thay can be moved to another subclass.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Assigned] (IGNITE-21473) Some transactional tests are not necessary for three node tests

2024-02-06 Thread Denis Chudov (Jira)


 [ 
https://issues.apache.org/jira/browse/IGNITE-21473?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Denis Chudov reassigned IGNITE-21473:
-

Ignite Flags:   (was: Docs Required,Release Notes Required)
Assignee: Denis Chudov
  Labels: ignite-3  (was: )

> Some transactional tests are not necessary for three node tests
> ---
>
> Key: IGNITE-21473
> URL: https://issues.apache.org/jira/browse/IGNITE-21473
> Project: Ignite
>  Issue Type: Bug
>Reporter: Denis Chudov
>Assignee: Denis Chudov
>Priority: Major
>  Labels: ignite-3
>
> Following tests:
>  
> {code:java}
> TxAbstractTest#testTransactionMultiThreadedCommit
> TxAbstractTest#testTransactionMultiThreadedCommitEmpty
> TxAbstractTest#testTransactionMultiThreadedRollback
> TxAbstractTest#testTransactionMultiThreadedRollbackEmpty
> TxAbstractTest#testTransactionMultiThreadedMixed
> TxAbstractTest#testTransactionMultiThreadedMixedEmpty
> {code}
> take significant time on TC but not actually necessary for all 
> implementations of TxAbstractTest. Thay can be moved to another subclass.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (IGNITE-21247) Log enhancements for LeaseUpdater

2024-02-02 Thread Denis Chudov (Jira)


 [ 
https://issues.apache.org/jira/browse/IGNITE-21247?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Denis Chudov updated IGNITE-21247:
--
Reviewer: Vladislav Pyatkov

> Log enhancements for LeaseUpdater
> -
>
> Key: IGNITE-21247
> URL: https://issues.apache.org/jira/browse/IGNITE-21247
> Project: Ignite
>  Issue Type: Improvement
>Reporter: Denis Chudov
>Assignee: Denis Chudov
>Priority: Major
>  Labels: ignite-3
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> In 
> [https://ci.ignite.apache.org/viewLog.html?buildId=7754161=ApacheIgnite3xGradle_Test_RunAllTests]
>  , test failure of 
> {{{}org.apache.ignite.internal.raftsnapshot.ItTableRaftSnapshotsTest: 
> leaderFeedsFollowerWithSnapshot{}}}, we see that there are no log messages on 
> replica about lease negotiation, which means that it didn't even started on 
> the placement driver active actor's side.  But the active actor has started 
> before. Log doesn't provide any information about the detail what happened on 
> LeaseUpdater.
> The suggestion is to add logging to know whether some exception happened in 
> {{{}updateLeaseBatchInternal{}}}, or in {{{}LeaseNegotiator#negotiate{}}}, 
> and logging of lease updating statistics (how many groups without 
> leaseholders were detected, how many negotiations are in progress, how many 
> leases are prolonged). 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Assigned] (IGNITE-15226) Print original exception when SSLException occurs

2024-02-01 Thread Denis Chudov (Jira)


 [ 
https://issues.apache.org/jira/browse/IGNITE-15226?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Denis Chudov reassigned IGNITE-15226:
-

Assignee: (was: Denis Chudov)

> Print original exception when SSLException occurs
> -
>
> Key: IGNITE-15226
> URL: https://issues.apache.org/jira/browse/IGNITE-15226
> Project: Ignite
>  Issue Type: Improvement
>Reporter: Denis Chudov
>Priority: Major
>
> We have to print original message when SSLException occurs.
> {noformat}
> 2021-02-23 03:23:35.579   [2021-02-23 03:23:35,579][WARN 
> ][grid-nio-worker-client-listener-0-#150][ClientListenerProcessor] Closing 
> NIO session because of unhandled exception [cls=class 
> o.a.i.i.util.nio.GridNioException, msg=Failed to decode SSL data: 
> GridSelectorNioSessionImpl [worker=GridWorker 
> [name=grid-nio-worker-client-listener-0, igniteInstanceName=null, 
> finished=false, heartbeatTs=1614039815570, hashCode=1938562251, 
> interrupted=false, 
> runner=grid-nio-worker-client-listener-0-#150]AbstractNioClientWorker [idx=0, 
> bytesRcvd=0, bytesSent=0, bytesRcvd0=0, bytesSent0=0, select=true, 
> super=]ByteBufferNioClientWorker [readBuf=java.nio.HeapByteBuffer[pos=517 
> lim=517 cap=8192], super=], writeBuf=null, readBuf=null, inRecovery=null, 
> outRecovery=null, super=GridNioSessionImpl [locAddr=IP, rmtAddr=IP, 
> createTime=1614039815116, closeTime=0, bytesSent=7268, bytesRcvd=7785, 
> bytesSent0=7268, bytesRcvd0=7785, sndSchedTime=1614039815560, 
> lastSndTime=1614039815570, lastRcvTime=1614039815570, readsPaused=false, 
> filterChain=GridNioCodecFilter [parser=ClientListenerBufferedParser, 
> directMode=false]FilterChain[filters=[GridNioAsyncNotifyFilter, , SSL 
> filter], accepted=true, markedForClose=false]]]{noformat}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (IGNITE-21411) Prohibit operations on the finished RO transactions

2024-01-31 Thread Denis Chudov (Jira)


 [ 
https://issues.apache.org/jira/browse/IGNITE-21411?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Denis Chudov updated IGNITE-21411:
--
Description: 
*Motivation*

For now we don't have any mechanism in RO transactions implementation to 
prohibit further gets/scans after finishing this transaction. In the same time, 
the ReadOnlyTransactionImpl#finish method updates the observable timestamp 
tracker which is necessary for implicit RO transactions, and completes the RO 
tx future which unblocks the low watermark in order to move it forward.

Cursors that were opened within a transaction should be also closed, but this 
is out of scope of this ticket, see IGNITE-21291.

*Definition of done*

Any operations on the finished RO transactions are prohibited like it is done 
on RW transactions.

*Implementation notes*

RW-lock is used to check and prohibit enlists to RW transactions, something 
more simple can be used for RO transactions. There is a 
ReadOnlyTransactionImpl#finishGuard - it may appear to be enough for the 
purpose of locking the transaction for new operations.

  was:
*Motivation*

For now we don't have any mechanism in RO transactions implementation to 
prohibit further gets/scans after finishing this transaction. In the same time, 
the ReadOnlyTransactionImpl#finish method updates the observable timestamp 
tracker which is necessary for implicit RO transactions, and completes the RO 
tx future which unblocks the low watermark in order to move it forward.

Cursors that were opened within a transaction should be also closed on this 
transaction finish.

*Definition of done*

Any operations on the finished RO transactions are prohibited like it is done 
on RW transactions.

*Implementation notes*

RW-lock is used to check and prohibit enlists to RW transactions, something 
more simple can be used for RO transactions. There is a 
ReadOnlyTransactionImpl#finishGuard - it may appear to be enough for the 
purpose of locking the transaction for new operations.


> Prohibit operations on the finished RO transactions
> ---
>
> Key: IGNITE-21411
> URL: https://issues.apache.org/jira/browse/IGNITE-21411
> Project: Ignite
>  Issue Type: Improvement
>Reporter: Denis Chudov
>Priority: Major
>  Labels: ignite-3
>
> *Motivation*
> For now we don't have any mechanism in RO transactions implementation to 
> prohibit further gets/scans after finishing this transaction. In the same 
> time, the ReadOnlyTransactionImpl#finish method updates the observable 
> timestamp tracker which is necessary for implicit RO transactions, and 
> completes the RO tx future which unblocks the low watermark in order to move 
> it forward.
> Cursors that were opened within a transaction should be also closed, but this 
> is out of scope of this ticket, see IGNITE-21291.
> *Definition of done*
> Any operations on the finished RO transactions are prohibited like it is done 
> on RW transactions.
> *Implementation notes*
> RW-lock is used to check and prohibit enlists to RW transactions, something 
> more simple can be used for RO transactions. There is a 
> ReadOnlyTransactionImpl#finishGuard - it may appear to be enough for the 
> purpose of locking the transaction for new operations.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (IGNITE-21411) Prohibit operations on the finished RO transactions

2024-01-31 Thread Denis Chudov (Jira)


 [ 
https://issues.apache.org/jira/browse/IGNITE-21411?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Denis Chudov updated IGNITE-21411:
--
Description: 
*Motivation*

For now we don't have any mechanism in RO transactions implementation to 
prohibit further gets/scans after finishing this transaction. In the same time, 
the ReadOnlyTransactionImpl#finish method updates the observable timestamp 
tracker which is necessary for implicit RO transactions, and completes the RO 
tx future which unblocks the low watermark in order to move it forward.

Cursors that were opened within a transaction should be also closed on this 
transaction finish.

*Definition of done*

Any operations on the finished RO transactions are prohibited like it is done 
on RW transactions.

*Implementation notes*

RW-lock is used to check and prohibit enlists to RW transactions, something 
more simple can be used for RO transactions. There is a 
ReadOnlyTransactionImpl#finishGuard - it may appear to be enough for the 
purpose of locking the transaction for new operations.

  was:
*Motivation*

For now we don't have any mechanism in RO transactions implementation to 
prohibit further gets/scans after finishing this transaction. In the same time, 
the ReadOnlyTransactionImpl#finish method updates the observable timestamp 
tracker which is necessary for implicit RO transactions, and completes the RO 
tx future which unblocks the low watermark in order to move it forward.

*Definition of done*

Any operations on the finished RO transactions are prohibited like it is done 
on RW transactions.

*Implementation notes*

RW-lock is used to check and prohibit enlists to RW transactions, something 
more simple can be used for RO transactions. There is a 
ReadOnlyTransactionImpl#finishGuard - it may appear to be enough for this 
purpose.


> Prohibit operations on the finished RO transactions
> ---
>
> Key: IGNITE-21411
> URL: https://issues.apache.org/jira/browse/IGNITE-21411
> Project: Ignite
>  Issue Type: Improvement
>Reporter: Denis Chudov
>Priority: Major
>  Labels: ignite-3
>
> *Motivation*
> For now we don't have any mechanism in RO transactions implementation to 
> prohibit further gets/scans after finishing this transaction. In the same 
> time, the ReadOnlyTransactionImpl#finish method updates the observable 
> timestamp tracker which is necessary for implicit RO transactions, and 
> completes the RO tx future which unblocks the low watermark in order to move 
> it forward.
> Cursors that were opened within a transaction should be also closed on this 
> transaction finish.
> *Definition of done*
> Any operations on the finished RO transactions are prohibited like it is done 
> on RW transactions.
> *Implementation notes*
> RW-lock is used to check and prohibit enlists to RW transactions, something 
> more simple can be used for RO transactions. There is a 
> ReadOnlyTransactionImpl#finishGuard - it may appear to be enough for the 
> purpose of locking the transaction for new operations.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (IGNITE-21411) Prohibit operations on the finished RO transactions

2024-01-31 Thread Denis Chudov (Jira)


 [ 
https://issues.apache.org/jira/browse/IGNITE-21411?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Denis Chudov updated IGNITE-21411:
--
Description: 
*Motivation*

For now we don't have any mechanism in RO transactions implementation to 
prohibit further gets/scans after finishing this transaction. In the same time, 
the ReadOnlyTransactionImpl#finish method updates the observable timestamp 
tracker which is necessary for implicit RO transactions, and completes the RO 
tx future which unblocks the low watermark in order to move it forward.

*Definition of done*

Any operations on the finished RO transactions are prohibited like it is done 
on RW transactions.

*Implementation notes*

RW-lock is used to check and prohibit enlists to RW transactions, something 
more simple can be used for RO transactions. There is a 
ReadOnlyTransactionImpl#finishGuard - it may appear to be enough for this 
purpose.

  was:
Motivation

For now we don't have any mechanism in RO transactions implementation to 
prohibit further gets/scans after finishing this transaction. In the same time, 
the ReadOnlyTransactionImpl#finish method updates the observable timestamp 
tracker which is necessary for implicit RO transactions, and completes the RO 
tx future which unblocks the low watermark in order to move it forward.

*Definition of done*

Any operations on the finished RO transactions are prohibited like it is done 
on RW transactions.


> Prohibit operations on the finished RO transactions
> ---
>
> Key: IGNITE-21411
> URL: https://issues.apache.org/jira/browse/IGNITE-21411
> Project: Ignite
>  Issue Type: Improvement
>Reporter: Denis Chudov
>Priority: Major
>  Labels: ignite-3
>
> *Motivation*
> For now we don't have any mechanism in RO transactions implementation to 
> prohibit further gets/scans after finishing this transaction. In the same 
> time, the ReadOnlyTransactionImpl#finish method updates the observable 
> timestamp tracker which is necessary for implicit RO transactions, and 
> completes the RO tx future which unblocks the low watermark in order to move 
> it forward.
> *Definition of done*
> Any operations on the finished RO transactions are prohibited like it is done 
> on RW transactions.
> *Implementation notes*
> RW-lock is used to check and prohibit enlists to RW transactions, something 
> more simple can be used for RO transactions. There is a 
> ReadOnlyTransactionImpl#finishGuard - it may appear to be enough for this 
> purpose.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (IGNITE-21411) Prohibit operations on the finished RO transactions

2024-01-31 Thread Denis Chudov (Jira)


 [ 
https://issues.apache.org/jira/browse/IGNITE-21411?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Denis Chudov updated IGNITE-21411:
--
Description: 
Motivation

For now we don't have any mechanism in RO transactions implementation to 
prohibit further gets/scans after finishing this transaction. In the same time, 
the ReadOnlyTransactionImpl#finish method updates the observable timestamp 
tracker which is necessary for implicit RO transactions, and completes the RO 
tx future which unblocks the low watermark in order to move it forward.

*Definition of done*

Any operations on the finished RO transactions are prohibited like it is done 
on RW transactions.

  was:
Motivation

For now we don't have any mechanism in RO transactions implementation to 
prohibit further gets/scans after finishing this transaction. In the same time, 
the ReadOnlyTransactionImpl#finish method updates the observable timestamp 
tracker which is necessary for implicit RO transactions, and completes the RO 
tx future which unblocks the low watermark in order to move it forward.

Any operations 


> Prohibit operations on the finished RO transactions
> ---
>
> Key: IGNITE-21411
> URL: https://issues.apache.org/jira/browse/IGNITE-21411
> Project: Ignite
>  Issue Type: Improvement
>Reporter: Denis Chudov
>Priority: Major
>  Labels: ignite-3
>
> Motivation
> For now we don't have any mechanism in RO transactions implementation to 
> prohibit further gets/scans after finishing this transaction. In the same 
> time, the ReadOnlyTransactionImpl#finish method updates the observable 
> timestamp tracker which is necessary for implicit RO transactions, and 
> completes the RO tx future which unblocks the low watermark in order to move 
> it forward.
> *Definition of done*
> Any operations on the finished RO transactions are prohibited like it is done 
> on RW transactions.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (IGNITE-21411) Prohibit operations on the finished RO transactions

2024-01-31 Thread Denis Chudov (Jira)


 [ 
https://issues.apache.org/jira/browse/IGNITE-21411?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Denis Chudov updated IGNITE-21411:
--
Summary: Prohibit operations on the finished RO transactions  (was: 
Prohibit enlists to the finished RO transactions)

> Prohibit operations on the finished RO transactions
> ---
>
> Key: IGNITE-21411
> URL: https://issues.apache.org/jira/browse/IGNITE-21411
> Project: Ignite
>  Issue Type: Improvement
>Reporter: Denis Chudov
>Priority: Major
>  Labels: ignite-3
>
> Motivation
> For now we don't have any mechanism in RO transactions implementation to 
> prohibit further gets/scans after finishing this transaction. In the same 
> time, the ReadOnlyTransactionImpl#finish method updates the observable 
> timestamp tracker which is necessary for implicit RO transactions, and 
> completes the RO tx future which unblocks the low watermark in order to move 
> it forward.
> Any operations 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (IGNITE-21411) Prohibit enlists to the finished RO transactions

2024-01-31 Thread Denis Chudov (Jira)


 [ 
https://issues.apache.org/jira/browse/IGNITE-21411?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Denis Chudov updated IGNITE-21411:
--
Description: 
Motivation

For now we don't have any mechanism in RO transactions implementation to 
prohibit further gets/scans after finishing this transaction. In the same time, 
the ReadOnlyTransactionImpl#finish method updates the observable timestamp 
tracker which is necessary for implicit RO transactions, and completes the RO 
tx future which unblocks the low watermark in order to move it forward.

Any operations 

  was:
Motivation

For now we don't have any mechanism in RO transactions implementation to 
prohibit further gets/scans after finishing this transaction. In the same time, 


> Prohibit enlists to the finished RO transactions
> 
>
> Key: IGNITE-21411
> URL: https://issues.apache.org/jira/browse/IGNITE-21411
> Project: Ignite
>  Issue Type: Improvement
>Reporter: Denis Chudov
>Priority: Major
>  Labels: ignite-3
>
> Motivation
> For now we don't have any mechanism in RO transactions implementation to 
> prohibit further gets/scans after finishing this transaction. In the same 
> time, the ReadOnlyTransactionImpl#finish method updates the observable 
> timestamp tracker which is necessary for implicit RO transactions, and 
> completes the RO tx future which unblocks the low watermark in order to move 
> it forward.
> Any operations 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (IGNITE-21411) Prohibit enlists to the finished RO transactions

2024-01-31 Thread Denis Chudov (Jira)


 [ 
https://issues.apache.org/jira/browse/IGNITE-21411?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Denis Chudov updated IGNITE-21411:
--
Description: 
Motivation

For now we don't have any mechanism in RO transactions implementation to 
prohibit further gets/scans after finishing this transaction. In the same time, 

  was:TBD


> Prohibit enlists to the finished RO transactions
> 
>
> Key: IGNITE-21411
> URL: https://issues.apache.org/jira/browse/IGNITE-21411
> Project: Ignite
>  Issue Type: Improvement
>Reporter: Denis Chudov
>Priority: Major
>  Labels: ignite-3
>
> Motivation
> For now we don't have any mechanism in RO transactions implementation to 
> prohibit further gets/scans after finishing this transaction. In the same 
> time, 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (IGNITE-21415) Remote nodes are not added to NodeManager

2024-01-31 Thread Denis Chudov (Jira)


 [ 
https://issues.apache.org/jira/browse/IGNITE-21415?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Denis Chudov updated IGNITE-21415:
--
Description: org.apache.ignite.raft.jraft.NodeManager has the internal data 
structures (groups to lists of nodes mapping, etc.) that are used for different 
purposes, including processing the requests. But the remote nodes are not 
adding to these mappings. See the usages of NodeManager#add method: it is 
called only from RaftGroupService#start to add the local node.

> Remote nodes are not added to NodeManager
> -
>
> Key: IGNITE-21415
> URL: https://issues.apache.org/jira/browse/IGNITE-21415
> Project: Ignite
>  Issue Type: Bug
>Reporter: Denis Chudov
>Priority: Major
>  Labels: ignite-3
>
> org.apache.ignite.raft.jraft.NodeManager has the internal data structures 
> (groups to lists of nodes mapping, etc.) that are used for different 
> purposes, including processing the requests. But the remote nodes are not 
> adding to these mappings. See the usages of NodeManager#add method: it is 
> called only from RaftGroupService#start to add the local node.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (IGNITE-21415) Remote nodes are not added to NodeManager

2024-01-31 Thread Denis Chudov (Jira)
Denis Chudov created IGNITE-21415:
-

 Summary: Remote nodes are not added to NodeManager
 Key: IGNITE-21415
 URL: https://issues.apache.org/jira/browse/IGNITE-21415
 Project: Ignite
  Issue Type: Bug
Reporter: Denis Chudov






--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (IGNITE-21411) Prohibit enlists to the finished RO transactions

2024-01-31 Thread Denis Chudov (Jira)
Denis Chudov created IGNITE-21411:
-

 Summary: Prohibit enlists to the finished RO transactions
 Key: IGNITE-21411
 URL: https://issues.apache.org/jira/browse/IGNITE-21411
 Project: Ignite
  Issue Type: Improvement
Reporter: Denis Chudov


TBD



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (IGNITE-21394) TimeoutException in the listener of pending assignments change shouldn't fail the watch processor

2024-01-30 Thread Denis Chudov (Jira)


 [ 
https://issues.apache.org/jira/browse/IGNITE-21394?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Denis Chudov updated IGNITE-21394:
--
Description: 
*Motivation*

The handler of pending assignments event change ( 
TableManager#handleChangePendingAssignmentEvent() ) tries to do changePeerAsync 
after starting the partition and client. In order to know whether the calling 
of changePeerAsync is needed, it tries to get the current leader of 
corresponding raft group. This call of 
RaftGroupService#refreshAndGetLeaderWithTerm can fail with TimeoutException. 
For example, there is no known leader on the node that the GetLeader request is 
sent to, or that node is no more in the raft group, etc., and in the same time 
that node is the only known peer of the raft group: in these cases the 
GetLeader request will be constantly retried in hope to get a response with 
leader finally, when it's elected, but this can never happen. So, the 
TimeoutException is expected in this case.

This exception should be handled within the mentioned listener of pending 
assignments event change. otherwise it fails the watch processor, making it 
unable to handle the further meta storage updates (and making the node 
inoperable). This means that, most likely, the current node is not a leader of 
the raft group, and changePeers shouln't be done, or it has not caught up with 
the current assignments events, this means that some requests for this node for 
this partition will fail, but the node will remain operable.

*Definition of done*

TimeoutException in the listener of pending assignments change doesn't fail the 
watch processor and doesn't lead to multiple exceptions like this:
{code:java}
[2024-01-29T22:00:58,658][ERROR][%isckvt_tmccd_3344%Raft-Group-Client-5][WatchProcessor]
 Error occurred when notifying safe time advanced callback
 java.util.concurrent.CompletionException: java.util.concurrent.TimeoutException
    at 
java.util.concurrent.CompletableFuture.encodeThrowable(CompletableFuture.java:331)
 ~[?:?]
    at 
java.util.concurrent.CompletableFuture.completeThrowable(CompletableFuture.java:346)
 ~[?:?]
    at 
java.util.concurrent.CompletableFuture$UniApply.tryFire(CompletableFuture.java:632)
 ~[?:?]
    at 
java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:506) 
~[?:?]
    at 
java.util.concurrent.CompletableFuture.completeExceptionally(CompletableFuture.java:2088)
 ~[?:?]
    at 
org.apache.ignite.internal.raft.RaftGroupServiceImpl.sendWithRetry(RaftGroupServiceImpl.java:546)
 ~[ignite-raft-3.0.0-SNAPSHOT.jar:?]
    at 
org.apache.ignite.internal.raft.RaftGroupServiceImpl.lambda$handleErrorResponse$42(RaftGroupServiceImpl.java:635)
 ~[ignite-raft-3.0.0-SNAPSHOT.jar:?]
    at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:515) 
[?:?]
    at java.util.concurrent.FutureTask.run(FutureTask.java:264) [?:?]
    at 
java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:304)
 [?:?]
    at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128) 
[?:?]
    at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628) 
[?:?]
    at java.lang.Thread.run(Thread.java:834) [?:?]
Caused by: java.util.concurrent.TimeoutException
    ... 8 more{code}
 

*Implementation notes*

It can be reproduced in integration tests like 
ItSchemaChangeKvViewTest#testMergeChangesColumnDefault when there are 3 nodes 
starting, then a table with 25 partitions/1 replica created. During the table 
start the rebalance is possible, like this:
 * a replication group is moved from node A to node B
 * some node C tries to perform GetLeader, and has only node A in local peers
 * node A thinks it is the only member of the replication group, and is not 
leader, sends "Unknown leader" response to C
 * node C constatnly retries the request to node A.

  was:
*Motivation*

The handler of pending assignments event change ( 
TableManager#handleChangePendingAssignmentEvent() ) tries to do changePeerAsync 
after starting the partition and client. In order to know whether the calling 
of changePeerAsync is needed, it tries to get the current leader of 
corresponding raft group. This call of 
RaftGroupService#refreshAndGetLeaderWithTerm can fail with TimeoutException. 
For example, there is no known leader on the node that the GetLeader request is 
sent to, or that node is no more in the raft group, etc., and in the same time 
that node is the only known peer of the raft group: in these cases the 
GetLeader request will be constantly retried in hope to get a response with 
leader finally, when it's elected, but this can never happen. So, the 
TimeoutException is expected in this case.

This exception should be handled within the mentioned listener of pending 
assignments event change. otherwise it fails the watch processor, making it 
unable to handle the 

[jira] [Updated] (IGNITE-21394) TimeoutException in the listener of pending assignments change shouldn't fail the watch processor

2024-01-30 Thread Denis Chudov (Jira)


 [ 
https://issues.apache.org/jira/browse/IGNITE-21394?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Denis Chudov updated IGNITE-21394:
--
Description: 
*Motivation*

The handler of pending assignments event change ( 
TableManager#handleChangePendingAssignmentEvent() ) tries to do changePeerAsync 
after starting the partition and client. In order to know whether the calling 
of changePeerAsync is needed, it tries to get the current leader of 
corresponding raft group. This call of 
RaftGroupService#refreshAndGetLeaderWithTerm can fail with TimeoutException. 
For example, there is no known leader on the node that the GetLeader request is 
sent to, or that node is no more in the raft group, etc., and in the same time 
that node is the only known peer of the raft group: in these cases the 
GetLeader request will be constantly retried in hope to get a response with 
leader finally, when it's elected, but this can never happen. So, the 
TimeoutException is expected in this case.

This exception should be handled within the mentioned listener of pending 
assignments event change. otherwise it fails the watch processor, making it 
unable to handle the further meta storage updates (and making the node 
inoperable). This means that, most likely, the current node is not a leader of 
the raft group, and changePeers shouln't be done, or it has not caught up with 
the current assignments events, this means that some client requests for this 
node for this partition will fail, but the node will remain operable.

*Definition of done*

TimeoutException in the listener of pending assignments change doesn't fail the 
watch processor and doesn't lead to multiple exceptions like this:
{code:java}
[2024-01-29T22:00:58,658][ERROR][%isckvt_tmccd_3344%Raft-Group-Client-5][WatchProcessor]
 Error occurred when notifying safe time advanced callback
 java.util.concurrent.CompletionException: java.util.concurrent.TimeoutException
    at 
java.util.concurrent.CompletableFuture.encodeThrowable(CompletableFuture.java:331)
 ~[?:?]
    at 
java.util.concurrent.CompletableFuture.completeThrowable(CompletableFuture.java:346)
 ~[?:?]
    at 
java.util.concurrent.CompletableFuture$UniApply.tryFire(CompletableFuture.java:632)
 ~[?:?]
    at 
java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:506) 
~[?:?]
    at 
java.util.concurrent.CompletableFuture.completeExceptionally(CompletableFuture.java:2088)
 ~[?:?]
    at 
org.apache.ignite.internal.raft.RaftGroupServiceImpl.sendWithRetry(RaftGroupServiceImpl.java:546)
 ~[ignite-raft-3.0.0-SNAPSHOT.jar:?]
    at 
org.apache.ignite.internal.raft.RaftGroupServiceImpl.lambda$handleErrorResponse$42(RaftGroupServiceImpl.java:635)
 ~[ignite-raft-3.0.0-SNAPSHOT.jar:?]
    at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:515) 
[?:?]
    at java.util.concurrent.FutureTask.run(FutureTask.java:264) [?:?]
    at 
java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:304)
 [?:?]
    at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128) 
[?:?]
    at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628) 
[?:?]
    at java.lang.Thread.run(Thread.java:834) [?:?]
Caused by: java.util.concurrent.TimeoutException
    ... 8 more{code}
 

*Implementation notes*

It can be reproduced in integration tests like 
ItSchemaChangeKvViewTest#testMergeChangesColumnDefault when there are 3 nodes 
starting, then a table with 25 partitions/1 replica created. During the table 
start the rebalance is possible, like this:
 * a replication group is moved from node A to node B
 * some node C tries to perform GetLeader, and has only node A in local peers
 * node A thinks it is the only member of the replication group, and is not 
leader, sends "Unknown leader" response to C
 * node C constatnly retries the request to node A.

  was:
*Motivation*

The handler of pending assignments event change ( 
TableManager#handleChangePendingAssignmentEvent() ) tries to do changePeerAsync 
after starting the partition and client. In order to know whether the calling 
of changePeerAsync is needed, it tries to get the current leader of 
corresponding raft group. This call of 
RaftGroupService#refreshAndGetLeaderWithTerm can fail with TimeoutException. 
For example, there is no known leader on the node that the GetLeader request is 
sent to, or that node is no more in the raft group, etc., and in the same time 
that node is the only known peer of the raft group: in these cases the 
GetLeader request will be constantly retried in hope to get a response with 
leader finally, when it's elected, but this can never happen. So, the 
TimeoutException is expected in this case.

This exception should be handled within the mentioned listener of pending 
assignments event change. otherwise it fails the watch processor, making it 
unable to handle the 

[jira] [Created] (IGNITE-21394) TimeoutException in the listener of pending assignments change shouldn't fail the watch processor

2024-01-30 Thread Denis Chudov (Jira)
Denis Chudov created IGNITE-21394:
-

 Summary: TimeoutException in the listener of pending assignments 
change shouldn't fail the watch processor
 Key: IGNITE-21394
 URL: https://issues.apache.org/jira/browse/IGNITE-21394
 Project: Ignite
  Issue Type: Bug
Reporter: Denis Chudov


*Motivation*

The handler of pending assignments event change ( 
TableManager#handleChangePendingAssignmentEvent() ) tries to do changePeerAsync 
after starting the partition and client. In order to know whether the calling 
of changePeerAsync is needed, it tries to get the current leader of 
corresponding raft group. This call of 
RaftGroupService#refreshAndGetLeaderWithTerm can fail with TimeoutException. 
For example, there is no known leader on the node that the GetLeader request is 
sent to, or that node is no more in the raft group, etc., and in the same time 
that node is the only known peer of the raft group: in these cases the 
GetLeader request will be constantly retried in hope to get a response with 
leader finally, when it's elected, but this can never happen. So, the 
TimeoutException is expected in this case.

This exception should be handled within the mentioned listener of pending 
assignments event change. otherwise it fails the watch processor, making it 
unable to handle the further meta storage updates (and making the node 
inoperable). This means that, most likely, the current node is not a leader of 
the raft group, and changePeers shouln't be done, or it has not caught up with 
the current assignments events, this means that some client requests for this 
node for this partition will fail, but the node will remain operable.

*Definition of done*

TimeoutException in the listener of pending assignments change doesn't fail the 
watch processor and doesn't lead to multiple exceptions like this:

 
{code:java}
024-01-29T22:00:58,658][ERROR][%isckvt_tmccd_3344%Raft-Group-Client-5][WatchProcessor]
 Error occurred when notifying safe time advanced callback 
java.util.concurrent.CompletionException: java.util.concurrent.TimeoutException 
 at 
java.util.concurrent.CompletableFuture.encodeThrowable(CompletableFuture.java:331)
 ~[?:?]  at 
java.util.concurrent.CompletableFuture.completeThrowable(CompletableFuture.java:346)
 ~[?:?]  at 
java.util.concurrent.CompletableFuture$UniApply.tryFire(CompletableFuture.java:632)
 ~[?:?]  at 
java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:506) 
~[?:?]  at 
java.util.concurrent.CompletableFuture.completeExceptionally(CompletableFuture.java:2088)
 ~[?:?]  at 
org.apache.ignite.internal.raft.RaftGroupServiceImpl.sendWithRetry(RaftGroupServiceImpl.java:546)
 ~[ignite-raft-3.0.0-SNAPSHOT.jar:?]  at 
org.apache.ignite.internal.raft.RaftGroupServiceImpl.lambda$handleErrorResponse$42(RaftGroupServiceImpl.java:635)
 ~[ignite-raft-3.0.0-SNAPSHOT.jar:?]  at 
java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:515) [?:?]  
at java.util.concurrent.FutureTask.run(FutureTask.java:264) [?:?]  at 
java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:304)
 [?:?]  at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128) 
[?:?]  at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628) 
[?:?]  at java.lang.Thread.run(Thread.java:834) [?:?]Caused by: 
java.util.concurrent.TimeoutException  ... 8 more {code}
Implementation notes

 

It can be reproduced in integration tests like 
ItSchemaChangeKvViewTest#testMergeChangesColumnDefault when there are 3 nodes 
starting, then a table with 25 partitions/1 replica created. During the table 
start the rebalance is possible, like this:
 * a replication group is moved from node A to node B
 * some node C tries to perform GetLeader, and has only node A in local peers
 * node A thinks it is the only member of the replication group, and is not 
leader, sends "Unknown leader" response to C
 * node C constatnly retries the request to node A.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (IGNITE-21394) TimeoutException in the listener of pending assignments change shouldn't fail the watch processor

2024-01-30 Thread Denis Chudov (Jira)


 [ 
https://issues.apache.org/jira/browse/IGNITE-21394?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Denis Chudov updated IGNITE-21394:
--
Description: 
*Motivation*

The handler of pending assignments event change ( 
TableManager#handleChangePendingAssignmentEvent() ) tries to do changePeerAsync 
after starting the partition and client. In order to know whether the calling 
of changePeerAsync is needed, it tries to get the current leader of 
corresponding raft group. This call of 
RaftGroupService#refreshAndGetLeaderWithTerm can fail with TimeoutException. 
For example, there is no known leader on the node that the GetLeader request is 
sent to, or that node is no more in the raft group, etc., and in the same time 
that node is the only known peer of the raft group: in these cases the 
GetLeader request will be constantly retried in hope to get a response with 
leader finally, when it's elected, but this can never happen. So, the 
TimeoutException is expected in this case.

This exception should be handled within the mentioned listener of pending 
assignments event change. otherwise it fails the watch processor, making it 
unable to handle the further meta storage updates (and making the node 
inoperable). This means that, most likely, the current node is not a leader of 
the raft group, and changePeers shouln't be done, or it has not caught up with 
the current assignments events, this means that some client requests for this 
node for this partition will fail, but the node will remain operable.

*Definition of done*

TimeoutException in the listener of pending assignments change doesn't fail the 
watch processor and doesn't lead to multiple exceptions like this:
{code:java}
[2024-01-29T22:00:58,658][ERROR][%isckvt_tmccd_3344%Raft-Group-Client-5][WatchProcessor]
 Error occurred when notifying safe time advanced callback
 java.util.concurrent.CompletionException: java.util.concurrent.TimeoutException
    at 
java.util.concurrent.CompletableFuture.encodeThrowable(CompletableFuture.java:331)
 ~[?:?]
    at 
java.util.concurrent.CompletableFuture.completeThrowable(CompletableFuture.java:346)
 ~[?:?]
    at 
java.util.concurrent.CompletableFuture$UniApply.tryFire(CompletableFuture.java:632)
 ~[?:?]
    at 
java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:506) 
~[?:?]
    at 
java.util.concurrent.CompletableFuture.completeExceptionally(CompletableFuture.java:2088)
 ~[?:?]
    at 
org.apache.ignite.internal.raft.RaftGroupServiceImpl.sendWithRetry(RaftGroupServiceImpl.java:546)
 ~[ignite-raft-3.0.0-SNAPSHOT.jar:?]
    at 
org.apache.ignite.internal.raft.RaftGroupServiceImpl.lambda$handleErrorResponse$42(RaftGroupServiceImpl.java:635)
 ~[ignite-raft-3.0.0-SNAPSHOT.jar:?]
    at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:515) 
[?:?]
    at java.util.concurrent.FutureTask.run(FutureTask.java:264) [?:?]
    at 
java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:304)
 [?:?]
    at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128) 
[?:?]
    at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628) 
[?:?]
    at java.lang.Thread.run(Thread.java:834) [?:?]
Caused by: java.util.concurrent.TimeoutException
    ... 8 more{code}
Implementation notes

 

It can be reproduced in integration tests like 
ItSchemaChangeKvViewTest#testMergeChangesColumnDefault when there are 3 nodes 
starting, then a table with 25 partitions/1 replica created. During the table 
start the rebalance is possible, like this:
 * a replication group is moved from node A to node B
 * some node C tries to perform GetLeader, and has only node A in local peers
 * node A thinks it is the only member of the replication group, and is not 
leader, sends "Unknown leader" response to C
 * node C constatnly retries the request to node A.

  was:
*Motivation*

The handler of pending assignments event change ( 
TableManager#handleChangePendingAssignmentEvent() ) tries to do changePeerAsync 
after starting the partition and client. In order to know whether the calling 
of changePeerAsync is needed, it tries to get the current leader of 
corresponding raft group. This call of 
RaftGroupService#refreshAndGetLeaderWithTerm can fail with TimeoutException. 
For example, there is no known leader on the node that the GetLeader request is 
sent to, or that node is no more in the raft group, etc., and in the same time 
that node is the only known peer of the raft group: in these cases the 
GetLeader request will be constantly retried in hope to get a response with 
leader finally, when it's elected, but this can never happen. So, the 
TimeoutException is expected in this case.

This exception should be handled within the mentioned listener of pending 
assignments event change. otherwise it fails the watch processor, making it 
unable to handle the 

[jira] [Assigned] (IGNITE-21181) Failure to resolve a primary replica after stopping a node

2024-01-29 Thread Denis Chudov (Jira)


 [ 
https://issues.apache.org/jira/browse/IGNITE-21181?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Denis Chudov reassigned IGNITE-21181:
-

Assignee: Denis Chudov

> Failure to resolve a primary replica after stopping a node
> --
>
> Key: IGNITE-21181
> URL: https://issues.apache.org/jira/browse/IGNITE-21181
> Project: Ignite
>  Issue Type: Bug
>Reporter: Roman Puchkovskiy
>Assignee: Denis Chudov
>Priority: Major
>  Labels: ignite-3
> Fix For: 3.0.0-beta2
>
>
> The scenario is that the cluster consists of 3 nodes (0, 1, 2). Primary 
> replica of the sole partition is on node 0. Then node 0 is stopped and an 
> attempt to do a put via node 2 is done. The partition still has majority, but 
> the put results in the following:
>  
> {code:java}
> org.apache.ignite.tx.TransactionException: IGN-REP-5 
> TraceId:55c59c96-17d1-4efc-8e3c-cca81b8b41ad Failed to resolve the primary 
> replica node [consistentId=itrst_ncisasiti_0]
>  
> at 
> org.apache.ignite.internal.table.distributed.storage.InternalTableImpl.lambda$enlist$69(InternalTableImpl.java:1749)
> at 
> java.base/java.util.concurrent.CompletableFuture.uniHandle(CompletableFuture.java:930)
> at 
> java.base/java.util.concurrent.CompletableFuture.uniHandleStage(CompletableFuture.java:946)
> at 
> java.base/java.util.concurrent.CompletableFuture.handle(CompletableFuture.java:2266)
> at 
> org.apache.ignite.internal.table.distributed.storage.InternalTableImpl.enlist(InternalTableImpl.java:1739)
> at 
> org.apache.ignite.internal.table.distributed.storage.InternalTableImpl.enlistWithRetry(InternalTableImpl.java:480)
> at 
> org.apache.ignite.internal.table.distributed.storage.InternalTableImpl.enlistInTx(InternalTableImpl.java:301)
> at 
> org.apache.ignite.internal.table.distributed.storage.InternalTableImpl.upsert(InternalTableImpl.java:965)
> at 
> org.apache.ignite.internal.table.KeyValueViewImpl.lambda$putAsync$10(KeyValueViewImpl.java:196)
> at 
> org.apache.ignite.internal.table.AbstractTableView.lambda$withSchemaSync$1(AbstractTableView.java:111)
> at 
> java.base/java.util.concurrent.CompletableFuture.uniComposeStage(CompletableFuture.java:1106)
> at 
> java.base/java.util.concurrent.CompletableFuture.thenCompose(CompletableFuture.java:2235)
> at 
> org.apache.ignite.internal.table.AbstractTableView.withSchemaSync(AbstractTableView.java:111)
> at 
> org.apache.ignite.internal.table.AbstractTableView.withSchemaSync(AbstractTableView.java:102)
> at 
> org.apache.ignite.internal.table.KeyValueViewImpl.putAsync(KeyValueViewImpl.java:193)
> at 
> org.apache.ignite.internal.table.KeyValueViewImpl.put(KeyValueViewImpl.java:185)
> at 
> org.apache.ignite.internal.raftsnapshot.ItTableRaftSnapshotsTest.putToNode(ItTableRaftSnapshotsTest.java:257)
> at 
> org.apache.ignite.internal.raftsnapshot.ItTableRaftSnapshotsTest.putToNode(ItTableRaftSnapshotsTest.java:253)
> at 
> org.apache.ignite.internal.raftsnapshot.ItTableRaftSnapshotsTest.nodeCanInstallSnapshotsAfterSnapshotInstalledToIt(ItTableRaftSnapshotsTest.java:473){code}
>  
> This can be reproduced using 
> ItTableRaftSnapshotsTest#nodeCanInstallSnapshotsAfterSnapshotInstalledToIt().
> The reason is that, according to the test, the leader of partition group is 
> transferred on node 0, which means that this node most probably will be 
> selected as primary, and after that the node 0 is stopped, and then the 
> transaction is started. Node 0 is still a leaseholder in the current time 
> interval, but it's already left the topology.
> We can fix the test to make it await the new primary, which would be present 
> in the cluster, or make the restries on the very first transactional request. 
> In the case of latter, we need to ensure that the request is actually first 
> and single, no other request in any parallel thread was sent, otherwise we 
> cant retry the request on another primary .



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


  1   2   3   4   5   6   7   8   9   10   >