[jira] [Updated] (IGNITE-21806) Design one PrimaryReplica for many tables partitions inside the one zone

2024-04-27 Thread Alexander Lapin (Jira)


 [ 
https://issues.apache.org/jira/browse/IGNITE-21806?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alexander Lapin updated IGNITE-21806:
-
Issue Type: Task  (was: Improvement)

> Design one PrimaryReplica for many tables partitions inside the one zone
> 
>
> Key: IGNITE-21806
> URL: https://issues.apache.org/jira/browse/IGNITE-21806
> Project: Ignite
>  Issue Type: Task
>Reporter: Kirill Gusakov
>Assignee: Alexander Lapin
>Priority: Major
>  Labels: ignite-3
>
> *Motivation*
> When the IGNITE-18991 we will have ZonePartitionId oriented management inside 
> the PlacementDriver, but still have the number of replicas, which equals to 
> the number of tables per partition in the zone. As a iterative step on the 
> road of "zone based collocation (one RAFT group per zone partition" we must 
> implement the another approach: one PrimaryReplica for many tables partitions 
> inside the zone. So, in the case of 2 tables inside the 1 zone we must have 1 
> PrimaryReplica per partition.
> *Definition of done*
> Prepare design and according ticket breakdown for the feature implementation



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Resolved] (IGNITE-21806) Design one PrimaryReplica for many tables partitions inside the one zone

2024-04-27 Thread Alexander Lapin (Jira)


 [ 
https://issues.apache.org/jira/browse/IGNITE-21806?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alexander Lapin resolved IGNITE-21806.
--
Resolution: Fixed

> Design one PrimaryReplica for many tables partitions inside the one zone
> 
>
> Key: IGNITE-21806
> URL: https://issues.apache.org/jira/browse/IGNITE-21806
> Project: Ignite
>  Issue Type: Improvement
>Reporter: Kirill Gusakov
>Assignee: Alexander Lapin
>Priority: Major
>  Labels: ignite-3
>
> *Motivation*
> When the IGNITE-18991 we will have ZonePartitionId oriented management inside 
> the PlacementDriver, but still have the number of replicas, which equals to 
> the number of tables per partition in the zone. As a iterative step on the 
> road of "zone based collocation (one RAFT group per zone partition" we must 
> implement the another approach: one PrimaryReplica for many tables partitions 
> inside the zone. So, in the case of 2 tables inside the 1 zone we must have 1 
> PrimaryReplica per partition.
> *Definition of done*
> Prepare design and according ticket breakdown for the feature implementation



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (IGNITE-21881) Deal with retry send metastorage raft commands after a timeout

2024-04-26 Thread Alexander Lapin (Jira)


 [ 
https://issues.apache.org/jira/browse/IGNITE-21881?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alexander Lapin updated IGNITE-21881:
-
Description: 
As a result of the analysis and reproduction of IGNITE-21142, it was found that 
the metastorage raft command can be re-sent if it does not time out, which may 
not be good and lead to hidden negative consequences, such as in IGNITE-21142.

Here we need to find out the reasons for this decision (with re-try by timeout) 
and understand what to do next. I think we should use an infinite timeout.

As a result of the analysis and reproduction of IGNITE-21142, it was found that 
the metastorage raft command can be re-sent if it does not time out, which may 
not be good and lead to hidden negative consequences, such as in IGNITE-21142.

Here we need to find out the reasons for this decision (with re-try by timeout) 
and understand what to do next. I think we should use an infinite timeout.
h3. Upd#1

As discussed, it's required to detect whether InvokeCommand was already 
processed on a server and resend original response if true instead of 
reprocessing. First of all it's not only about invoke but also about 
multiInvoke. Worth mentioning though that it relates only to MS and maybe CMG 
but not Partitions: within partitions, tx protocol along with returning result 
from indexes instead of returning result from raft, protects us from 
non-idempotent command processing.

All in all following solution is expected to be implemented:
 * New interface NonIdempotentCommand is introduced with an id field.
 * All MS non-idempotent commands like InvokeCommand and MultiInvokeCommand 
implement aforementioned interface.
 * On the client side, an identifier is added to the command. Two options are 
possible here:
 ** It's possible to set id to the the command on command creation. Easiest 
way, but it will required extra effort on the server side to track command 
time. In that case it's possible to use LongCounter + nodeId as an id. 
 ** Or it's possible to adjust command with an id within retry loop, in that 
case we may use id as a "command time", of course, it also means that clock or 
System.currentTime<> should be used as id. I strongly believe that first option 
is better for now. 
 * On the server side, precisely, within MS state machine new 
nonIdempotentCommandCache is introduced commandId -> (commandResult, 
commandStartTime)
 * On each NonIdempotentCommand following logic should be implemented:
 ** As an initial step it's required to check whether there's a command with 
given id in the cache, if true just return cached result, without command 
reprocessing.
 ** If there's no given command in the cache, process it and populate the cache 
with the result.

Basically that's all. Both cache persistence and recovery on group restart and 
cache cleanup will be covered within separate tickets.

  was:
As a result of the analysis and reproduction of IGNITE-21142, it was found that 
the metastorage raft command can be re-sent if it does not time out, which may 
not be good and lead to hidden negative consequences, such as in IGNITE-21142.

Here we need to find out the reasons for this decision (with re-try by timeout) 
and understand what to do next. I think we should use an infinite timeout.

As a result of the analysis and reproduction of IGNITE-21142, it was found that 
the metastorage raft command can be re-sent if it does not time out, which may 
not be good and lead to hidden negative consequences, such as in IGNITE-21142.

Here we need to find out the reasons for this decision (with re-try by timeout) 
and understand what to do next. I think we should use an infinite timeout.
h3. Upd#1

As discussed, it's required to detect whether InvokeCommand was already 
processed on a server and resend original response if true instead of 
reprocessing. First of all it's not only about invoke but about all 
non-idempotent commands like getAndPut, getAndPutAll, getAndRemove, etc. Worth 
mentioning though that it relates only to MS and maybe CMG but not Partitions: 
within partitions, tx protocol along with returning result from indexes instead 
of returning result from raft, protects us from non-idempotent command 
processing.

All in all following solution is expected to be implemented:
 * New interface NonIdempotentCommand is introduced with an id field.
 * All MS non-idempotent commands like InvokeCommand, GetAndRemoveCommand, etc 
implement aforementioned interface.
 * On the client side, an identifier is added to the command. Two options are 
possible here:
 ** It's possible to set id to the the command on command creation. Easiest 
way, but it will required extra effort on the server side to track command 
time. In that case it's possible to use LongCounter + nodeId as an id. 
 ** Or it's possible to adjust command with an id within retry loop, in that 
case we may use id as a "command time", of 

[jira] [Created] (IGNITE-22115) Data collocation: single Primary, single Raft

2024-04-25 Thread Alexander Lapin (Jira)
Alexander Lapin created IGNITE-22115:


 Summary: Data collocation: single Primary, single Raft
 Key: IGNITE-22115
 URL: https://issues.apache.org/jira/browse/IGNITE-22115
 Project: Ignite
  Issue Type: Epic
Reporter: Alexander Lapin






--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Assigned] (IGNITE-22113) Remove unused MetaStorageManagerImpl getAnd<> methods

2024-04-25 Thread Alexander Lapin (Jira)


 [ 
https://issues.apache.org/jira/browse/IGNITE-22113?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alexander Lapin reassigned IGNITE-22113:


Assignee:  Kirill Sizov

> Remove unused MetaStorageManagerImpl getAnd<> methods
> -
>
> Key: IGNITE-22113
> URL: https://issues.apache.org/jira/browse/IGNITE-22113
> Project: Ignite
>  Issue Type: Improvement
>Reporter: Alexander Lapin
>Assignee:  Kirill Sizov
>Priority: Major
>  Labels: ignite-3
>
> h3. Motivation
> There is bunch of getAnd<> methods in MetaStorageManagerImpl which aren't 
> used. Moreover they aren't even declared in the MetaStorageManager interface. 
> Worth mentioning, that their support will require additional efforts due to 
> idempotency issue.
> All in all, It's better to remove them.
> h3. Definition of Done
>  * Unused getAnd<> are removed from MetaStorageManagerImpl, 
> MetastorageService along with corresponding tests etc.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (IGNITE-22113) Remove unused MetaStorageManagerImpl getAnd<> methods

2024-04-25 Thread Alexander Lapin (Jira)


 [ 
https://issues.apache.org/jira/browse/IGNITE-22113?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alexander Lapin updated IGNITE-22113:
-
Description: 
h3. Motivation

There is bunch of getAnd<> methods in MetaStorageManagerImpl which aren't used. 
Moreover they aren't even declared in the MetaStorageManager interface. Worth 
mentioning, that their support will require additional efforts due to 
idempotency issue.
All in all, It's better to remove them.
h3. Definition of Done
 * Unused getAnd<> are removed from MetaStorageManagerImpl, MetastorageService 
along with corresponding tests etc.

  was:
h3. Motivation

There is bunch of {{getAnd<>}} methods in {{MetaStorageManagerImpl}} which 
aren't used. Moreover they aren't even declared in the {{MetaStorageManager}} 
interface. Worth mentioning, that their support will require additional efforts 
due to idempotency issue.
All in all, It's better to remove them.
h3. Definition of Done
 * Unused {{getAnd<>}} are removed from {{MetaStorageManagerImpl, 
}}{{MetastorageService }}along with corresponding tests etc.{{{}{}}}


> Remove unused MetaStorageManagerImpl getAnd<> methods
> -
>
> Key: IGNITE-22113
> URL: https://issues.apache.org/jira/browse/IGNITE-22113
> Project: Ignite
>  Issue Type: Improvement
>Reporter: Alexander Lapin
>Priority: Major
>
> h3. Motivation
> There is bunch of getAnd<> methods in MetaStorageManagerImpl which aren't 
> used. Moreover they aren't even declared in the MetaStorageManager interface. 
> Worth mentioning, that their support will require additional efforts due to 
> idempotency issue.
> All in all, It's better to remove them.
> h3. Definition of Done
>  * Unused getAnd<> are removed from MetaStorageManagerImpl, 
> MetastorageService along with corresponding tests etc.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (IGNITE-22113) Remove unused MetaStorageManagerImpl getAnd<> methods

2024-04-25 Thread Alexander Lapin (Jira)


 [ 
https://issues.apache.org/jira/browse/IGNITE-22113?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alexander Lapin updated IGNITE-22113:
-
Labels: ignite-3  (was: )

> Remove unused MetaStorageManagerImpl getAnd<> methods
> -
>
> Key: IGNITE-22113
> URL: https://issues.apache.org/jira/browse/IGNITE-22113
> Project: Ignite
>  Issue Type: Improvement
>Reporter: Alexander Lapin
>Priority: Major
>  Labels: ignite-3
>
> h3. Motivation
> There is bunch of getAnd<> methods in MetaStorageManagerImpl which aren't 
> used. Moreover they aren't even declared in the MetaStorageManager interface. 
> Worth mentioning, that their support will require additional efforts due to 
> idempotency issue.
> All in all, It's better to remove them.
> h3. Definition of Done
>  * Unused getAnd<> are removed from MetaStorageManagerImpl, 
> MetastorageService along with corresponding tests etc.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (IGNITE-22113) Remove unused MetaStorageManagerImpl getAnd<> methods

2024-04-25 Thread Alexander Lapin (Jira)


 [ 
https://issues.apache.org/jira/browse/IGNITE-22113?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alexander Lapin updated IGNITE-22113:
-
Summary: Remove unused MetaStorageManagerImpl getAnd<> methods  (was: 
Remove  unused MetaStorageManagerImpl getAnd<> methods)

> Remove unused MetaStorageManagerImpl getAnd<> methods
> -
>
> Key: IGNITE-22113
> URL: https://issues.apache.org/jira/browse/IGNITE-22113
> Project: Ignite
>  Issue Type: Improvement
>Reporter: Alexander Lapin
>Priority: Major
>
> h3. Motivation
> There is bunch of {{getAnd<>}} methods in {{MetaStorageManagerImpl}} which 
> aren't used. Moreover they aren't even declared in the {{MetaStorageManager}} 
> interface. Worth mentioning, that their support will require additional 
> efforts due to idempotency issue.
> All in all, It's better to remove them.
> h3. Definition of Done
>  * Unused {{getAnd<>}} are removed from {{MetaStorageManagerImpl, 
> }}{{MetastorageService }}along with corresponding tests etc.{{{}{}}}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (IGNITE-22113) Remove unused MetaStorageManagerImpl getAnd<> methods

2024-04-25 Thread Alexander Lapin (Jira)
Alexander Lapin created IGNITE-22113:


 Summary: Remove  unused MetaStorageManagerImpl getAnd<> methods
 Key: IGNITE-22113
 URL: https://issues.apache.org/jira/browse/IGNITE-22113
 Project: Ignite
  Issue Type: Improvement
Reporter: Alexander Lapin


h3. Motivation

There is bunch of {{getAnd<>}} methods in {{MetaStorageManagerImpl}} which 
aren't used. Moreover they aren't even declared in the {{MetaStorageManager}} 
interface. Worth mentioning, that their support will require additional efforts 
due to idempotency issue.
All in all, It's better to remove them.
h3. Definition of Done
 * Unused {{getAnd<>}} are removed from {{MetaStorageManagerImpl, 
}}{{MetastorageService }}along with corresponding tests etc.{{{}{}}}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (IGNITE-21881) Deal with retry send metastorage raft commands after a timeout

2024-04-25 Thread Alexander Lapin (Jira)


 [ 
https://issues.apache.org/jira/browse/IGNITE-21881?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alexander Lapin updated IGNITE-21881:
-
Description: 
As a result of the analysis and reproduction of IGNITE-21142, it was found that 
the metastorage raft command can be re-sent if it does not time out, which may 
not be good and lead to hidden negative consequences, such as in IGNITE-21142.

Here we need to find out the reasons for this decision (with re-try by timeout) 
and understand what to do next. I think we should use an infinite timeout.

As a result of the analysis and reproduction of IGNITE-21142, it was found that 
the metastorage raft command can be re-sent if it does not time out, which may 
not be good and lead to hidden negative consequences, such as in IGNITE-21142.

Here we need to find out the reasons for this decision (with re-try by timeout) 
and understand what to do next. I think we should use an infinite timeout.
h3. Upd#1

As discussed, it's required to detect whether InvokeCommand was already 
processed on a server and resend original response if true instead of 
reprocessing. First of all it's not only about invoke but about all 
non-idempotent commands like getAndPut, getAndPutAll, getAndRemove, etc. Worth 
mentioning though that it relates only to MS and maybe CMG but not Partitions: 
within partitions, tx protocol along with returning result from indexes instead 
of returning result from raft, protects us from non-idempotent command 
processing.

All in all following solution is expected to be implemented:
 * New interface NonIdempotentCommand is introduced with an id field.
 * All MS non-idempotent commands like InvokeCommand, GetAndRemoveCommand, etc 
implement aforementioned interface.
 * On the client side, an identifier is added to the command. Two options are 
possible here:
 ** It's possible to set id to the the command on command creation. Easiest 
way, but it will required extra effort on the server side to track command 
time. In that case it's possible to use LongCounter + nodeId as an id. 
 ** Or it's possible to adjust command with an id within retry loop, in that 
case we may use id as a "command time", of course, it also means that clock or 
System.currentTime<> should be used as id. I strongly believe that first option 
is better for now. 
 * On the server side, precisely, within MS state machine new 
nonIdempotentCommandCache is introduced commandId -> (commandResult, 
commandStartTime)
 * On each NonIdempotentCommand following logic should be implemented:
 ** As an initial step it's required to check whether there's a command with 
given id in the cache, if true just return cached result, without command 
reprocessing.
 ** If there's no given command in the cache, process it and populate the cache 
with the result.

Basically that's all. Both cache persistence and recovery on group restart and 
cache cleanup will be covered within separate tickets.

  was:
As a result of the analysis and reproduction of IGNITE-21142, it was found that 
the metastorage raft command can be re-sent if it does not time out, which may 
not be good and lead to hidden negative consequences, such as in IGNITE-21142.

Here we need to find out the reasons for this decision (with re-try by timeout) 
and understand what to do next. I think we should use an infinite timeout.

As a result of the analysis and reproduction of IGNITE-21142, it was found that 
the metastorage raft command can be re-sent if it does not time out, which may 
not be good and lead to hidden negative consequences, such as in IGNITE-21142.

Here we need to find out the reasons for this decision (with re-try by timeout) 
and understand what to do next. I think we should use an infinite timeout.
h3. Upd#1

As discussed, it's required to detect whether InvokeCommand was already 
processed on a server and resend original response if true instead of 
reprocessing. First of all it's not only about invoke but about all 
non-idempotent commands like getAndPut, getAndPutAll, getAndRemove, etc. Worth 
mentioning though that it relates only to MS and maybe CMG but not Partitions: 
within partitions, tx protocol along with returning result from indexes instead 
of returning result from raft, protects us from non-idempotent command 
processing.

All in all following solution is expected to be implemented:
 * New interface NonIdempotentCommand is introduced with an id field.
 * All MS non-idempotent commands like InvokeCommand, GetAndRemoveCommand, etc 
implement aforementioned interface NonIdempotentCommand.
 * On the client side, an identifier is added to the command. Two options are 
possible here:
 ** It's possible to set id to the the command on command creation. Easiest 
way, but it will required extra effort on the server side to track command 
time. In that case it's possible to use LongCounter + nodeId as an id. 
 ** Or it's possible to adjust command 

[jira] [Updated] (IGNITE-21881) Deal with retry send metastorage raft commands after a timeout

2024-04-25 Thread Alexander Lapin (Jira)


 [ 
https://issues.apache.org/jira/browse/IGNITE-21881?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alexander Lapin updated IGNITE-21881:
-
Description: 
As a result of the analysis and reproduction of IGNITE-21142, it was found that 
the metastorage raft command can be re-sent if it does not time out, which may 
not be good and lead to hidden negative consequences, such as in IGNITE-21142.

Here we need to find out the reasons for this decision (with re-try by timeout) 
and understand what to do next. I think we should use an infinite timeout.

As a result of the analysis and reproduction of IGNITE-21142, it was found that 
the metastorage raft command can be re-sent if it does not time out, which may 
not be good and lead to hidden negative consequences, such as in IGNITE-21142.

Here we need to find out the reasons for this decision (with re-try by timeout) 
and understand what to do next. I think we should use an infinite timeout.
h3. Upd#1

As discussed, it's required to detect whether InvokeCommand was already 
processed on a server and resend original response if true instead of 
reprocessing. First of all it's not only about invoke but about all 
non-idempotent commands like getAndPut, getAndPutAll, getAndRemove, etc. Worth 
mentioning though that it relates only to MS and maybe CMG but not Partitions: 
within partitions, tx protocol along with returning result from indexes instead 
of returning result from raft, protects us from non-idempotent command 
processing.

All in all following solution is expected to be implemented:
 * New interface NonIdempotentCommand is introduced with an id field.
 * All MS non-idempotent commands like InvokeCommand, GetAndRemoveCommand, etc 
implement aforementioned interface NonIdempotentCommand.
 * On the client side, an identifier is added to the command. Two options are 
possible here:
 ** It's possible to set id to the the command on command creation. Easiest 
way, but it will required extra effort on the server side to track command 
time. In that case it's possible to use LongCounter + nodeId as an id. 
 ** Or it's possible to adjust command with an id within retry loop, in that 
case we may use id as a "command time", of course it also means that clock or 
System.currentTime<> should be used as id identifier. I strongly believe that 
first option is better for now. 
 * On the server side, precisely, within MS state machine new 
nonIdempotentCommandCache is introduced commandId -> (commandResult, 
commandStartTime)
 * On each NonIdempotentCommand following logic should be implemented:
 ** As an initial step it's required to check whether there's command with 
given id in the cache, if true just return cached result, without command 
reprocessing.
 ** If there's no given command in the cache, process it and populate the cache 
with the result.

Basically that's all. Both cache persistence and recovery on group restart and 
cache cleanup will be covered within separate tickets.

  was:
As a result of the analysis and reproduction of IGNITE-21142, it was found that 
the metastorage raft command can be re-sent if it does not time out, which may 
not be good and lead to hidden negative consequences, such as in IGNITE-21142.

Here we need to find out the reasons for this decision (with re-try by timeout) 
and understand what to do next. I think we should use an infinite timeout.

As a result of the analysis and reproduction of IGNITE-21142, it was found that 
the metastorage raft command can be re-sent if it does not time out, which may 
not be good and lead to hidden negative consequences, such as in IGNITE-21142.

Here we need to find out the reasons for this decision (with re-try by timeout) 
and understand what to do next. I think we should use an infinite timeout.
h3. Upd#1

As discussed, it's required to detect whether InvokeCommand was already 
processed on a server and resend original response if true instead of 
reprocessing. First of all it's not only about invoke but about all 
non-idempotent commands like getAndPut, getAndPutAll, getAndRemove, etc. Worth 
mentioning though that it relates only to MS and maybe CMG but not Partitions: 
within partitions, tx protocol along with returning result from indexes instead 
of returning result from raft, protects us from non-idempotent command 
processing.

All in all following solution is expected to be implemented.
 * New interface NonIdempotentCommand is introduced with an id field.
 * All MS non-idempotent commands like InvokeCommand, GetAndRemoveCommand, etc 
implement aforementioned interface NonIdempotentCommand.
 * On the client side, an identifier is added to the command. Two options are 
possible here:
 ** It's possible to set id to the the command on command creation. Easiest 
way, but it will required extra effort on the server side to track command 
time. In that case it's possible to use LongCounter + nodeId as an id. 
 ** Or it's 

[jira] [Updated] (IGNITE-21881) Deal with retry send metastorage raft commands after a timeout

2024-04-25 Thread Alexander Lapin (Jira)


 [ 
https://issues.apache.org/jira/browse/IGNITE-21881?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alexander Lapin updated IGNITE-21881:
-
Description: 
As a result of the analysis and reproduction of IGNITE-21142, it was found that 
the metastorage raft command can be re-sent if it does not time out, which may 
not be good and lead to hidden negative consequences, such as in IGNITE-21142.

Here we need to find out the reasons for this decision (with re-try by timeout) 
and understand what to do next. I think we should use an infinite timeout.

As a result of the analysis and reproduction of IGNITE-21142, it was found that 
the metastorage raft command can be re-sent if it does not time out, which may 
not be good and lead to hidden negative consequences, such as in IGNITE-21142.

Here we need to find out the reasons for this decision (with re-try by timeout) 
and understand what to do next. I think we should use an infinite timeout.
h3. Upd#1

As discussed, it's required to detect whether InvokeCommand was already 
processed on a server and resend original response if true instead of 
reprocessing. First of all it's not only about invoke but about all 
non-idempotent commands like getAndPut, getAndPutAll, getAndRemove, etc. Worth 
mentioning though that it relates only to MS and maybe CMG but not Partitions: 
within partitions, tx protocol along with returning result from indexes instead 
of returning result from raft, protects us from non-idempotent command 
processing.

All in all following solution is expected to be implemented.
 * New interface NonIdempotentCommand is introduced with an id field.
 * All MS non-idempotent commands like InvokeCommand, GetAndRemoveCommand, etc 
implement aforementioned interface NonIdempotentCommand.
 * On the client side, an identifier is added to the command. Two options are 
possible here:
 ** It's possible to set id to the the command on command creation. Easiest 
way, but it will required extra effort on the server side to track command 
time. In that case it's possible to use LongCounter + nodeId as an id. 
 ** Or it's possible to adjust command with an id within retry loop, in that 
case we may use id as a "command time", of course it also means that clock or 
System.currentTime<> should be used as id identifier. I strongly believe that 
first option is better for now. 
 * On the server side, precisely, within MS state machine new 
nonIdempotentCommandCache is introduced commandId -> (commandResult, 
commandStartTime)
 * On each NonIdempotentCommand following logic should be implemented:
 ** As an initial step it's required to check whether there's command with 
given id in the cache, if true just return cached result, without command 
reprocessing.
 ** If there's no given command in the cache, process it and populate the cache 
with the result.

Basically that's all. Both cache persistence and recovery on group restart and 
cache cleanup will be covered within separate tickets.

  was:
As a result of the analysis and reproduction of IGNITE-21142, it was found that 
the metastorage raft command can be re-sent if it does not time out, which may 
not be good and lead to hidden negative consequences, such as in IGNITE-21142.

Here we need to find out the reasons for this decision (with re-try by timeout) 
and understand what to do next. I think we should use an infinite timeout.


> Deal with retry send metastorage raft commands after a timeout
> --
>
> Key: IGNITE-21881
> URL: https://issues.apache.org/jira/browse/IGNITE-21881
> Project: Ignite
>  Issue Type: Bug
>Reporter: Kirill Tkalenko
>Priority: Major
>  Labels: ignite-3
> Fix For: 3.0.0-beta2
>
>
> As a result of the analysis and reproduction of IGNITE-21142, it was found 
> that the metastorage raft command can be re-sent if it does not time out, 
> which may not be good and lead to hidden negative consequences, such as in 
> IGNITE-21142.
> Here we need to find out the reasons for this decision (with re-try by 
> timeout) and understand what to do next. I think we should use an infinite 
> timeout.
> As a result of the analysis and reproduction of IGNITE-21142, it was found 
> that the metastorage raft command can be re-sent if it does not time out, 
> which may not be good and lead to hidden negative consequences, such as in 
> IGNITE-21142.
> Here we need to find out the reasons for this decision (with re-try by 
> timeout) and understand what to do next. I think we should use an infinite 
> timeout.
> h3. Upd#1
> As discussed, it's required to detect whether InvokeCommand was already 
> processed on a server and resend original response if true instead of 
> reprocessing. First of all it's not only about invoke but about all 
> non-idempotent commands like getAndPut, 

[jira] [Updated] (IGNITE-22082) Investigate HeapLockManager locks performance

2024-04-19 Thread Alexander Lapin (Jira)


 [ 
https://issues.apache.org/jira/browse/IGNITE-22082?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alexander Lapin updated IGNITE-22082:
-
Labels: ignite-3  (was: )

> Investigate HeapLockManager locks performance
> -
>
> Key: IGNITE-22082
> URL: https://issues.apache.org/jira/browse/IGNITE-22082
> Project: Ignite
>  Issue Type: Task
>Reporter: Alexander Lapin
>Priority: Major
>  Labels: ignite-3
>
> https://issues.apache.org/jira/browse/IGNITE-22054 adjusted lock behaviour 
> eliminating the race. Unfortunately we've faced performance degradation after 
> that.
> LockManagerBenchmark.
> Was 
> {code:java}
> Benchmark(concTxns)  Mode  Cnt   Score   Error  Units
> LockManagerBenchmark.lockCommit 200  avgt   66.696  us/op 
> {code}
> Now
> {code:java}
> Benchmark(concTxns)  Mode  Cnt   Score   Error  Units
> LockManagerBenchmark.lockCommit 200  avgt   74.360  us/op 
> {code}
> That might be reasonable, because race was eliminated, however it's required 
> to add multithreaded benchmark and precisely explain why we have the 
> degradation. That may lead to locks logic adjustment. 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (IGNITE-22082) Investigate HeapLockManager locks performance

2024-04-19 Thread Alexander Lapin (Jira)


 [ 
https://issues.apache.org/jira/browse/IGNITE-22082?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alexander Lapin updated IGNITE-22082:
-
Description: 
https://issues.apache.org/jira/browse/IGNITE-22054 adjusted lock behaviour 
eliminating the race. Unfortunately we've faced performance degradation after 
that.

LockManagerBenchmark.

Was 
{code:java}
Benchmark(concTxns)  Mode  Cnt   Score   Error  Units
LockManagerBenchmark.lockCommit 200  avgt   66.696  us/op 
{code}
Now
{code:java}
Benchmark(concTxns)  Mode  Cnt   Score   Error  Units
LockManagerBenchmark.lockCommit 200  avgt   74.360  us/op 
{code}
That might be reasonable, because race was eliminated, however it's required to 
add multithreaded benchmark and precisely explain why we have the degradation. 
That may lead to locks logic adjustment. 

> Investigate HeapLockManager locks performance
> -
>
> Key: IGNITE-22082
> URL: https://issues.apache.org/jira/browse/IGNITE-22082
> Project: Ignite
>  Issue Type: Task
>Reporter: Alexander Lapin
>Priority: Major
>
> https://issues.apache.org/jira/browse/IGNITE-22054 adjusted lock behaviour 
> eliminating the race. Unfortunately we've faced performance degradation after 
> that.
> LockManagerBenchmark.
> Was 
> {code:java}
> Benchmark(concTxns)  Mode  Cnt   Score   Error  Units
> LockManagerBenchmark.lockCommit 200  avgt   66.696  us/op 
> {code}
> Now
> {code:java}
> Benchmark(concTxns)  Mode  Cnt   Score   Error  Units
> LockManagerBenchmark.lockCommit 200  avgt   74.360  us/op 
> {code}
> That might be reasonable, because race was eliminated, however it's required 
> to add multithreaded benchmark and precisely explain why we have the 
> degradation. That may lead to locks logic adjustment. 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (IGNITE-22082) Investigate HeapLockManager locks performance

2024-04-19 Thread Alexander Lapin (Jira)
Alexander Lapin created IGNITE-22082:


 Summary: Investigate HeapLockManager locks performance
 Key: IGNITE-22082
 URL: https://issues.apache.org/jira/browse/IGNITE-22082
 Project: Ignite
  Issue Type: Task
Reporter: Alexander Lapin






--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Assigned] (IGNITE-15568) Striped Disruptor doesn't work with JRaft event handlers properly

2024-04-19 Thread Alexander Lapin (Jira)


 [ 
https://issues.apache.org/jira/browse/IGNITE-15568?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alexander Lapin reassigned IGNITE-15568:


Assignee: Vladislav Pyatkov  (was: Ivan Bessonov)

> Striped Disruptor doesn't work with JRaft event handlers properly
> -
>
> Key: IGNITE-15568
> URL: https://issues.apache.org/jira/browse/IGNITE-15568
> Project: Ignite
>  Issue Type: Bug
>Reporter: Alexey Scherbakov
>Assignee: Vladislav Pyatkov
>Priority: Major
>  Labels: ignite-3, performance
>  Time Spent: 1h 10m
>  Remaining Estimate: 0h
>
> The following scenario is broken:
>  # Two raft groups are started and mapped to the same stripe.
>  # Two LogEntryAndClosure events are added in quick succession so they form 
> distruptor batch: first for group 1, second for group 2.
> First event is delivered to group 1 with endOfBatch=false, so it's cached in 
> org.apache.ignite.raft.jraft.core.NodeImpl.LogEntryAndClosureHandler#tasks 
> and is not processed.
> Second event is delivered to group 2 with endOfBatch=true and processed, but 
> first event will remain in queue unprocessed forever, because 
> LogEntryAndClosureHandler are different instances per raft group.
> The possible WA for this is to set 
> org.apache.ignite.raft.jraft.option.RaftOptions#applyBatch=1
> Reproducible by 
> org.apache.ignite.internal.table.TxDistributedTest_1_1_1#testCrossTable + 
> applyBatch=32 in ignite-15085 branch
> *Implementation notes*
> My proposal goes bound Disruptor. The striped disruptor implementation has an 
> interceptor that proposes an event to a specific interceptor. Only the last 
> event in the batch has a completion batch flag. For the other RAFT groups, 
> which has been notified in the striped disruptor, required to create an event 
> to fix a batch into the specific group. The new event will be created in the 
> common striped disruptor interceptor, and it will send to a specific 
> interceptor with flag about batch completion.
> The rule of handling the new event is differenced for various interceptor:
> {code:java|title=title=ApplyTaskHandler (FSMCallerImpl#runApplyTask)}
> if (maxCommittedIndex >= 0) {
>   doCommitted(maxCommittedIndex);
>   return -1;
> }
> {code}
> {code:java|title=LogEntryAndClosureHandler(LogEntryAndClosureHandler#onEvent)}
> if (this.tasks.size() > 0) {
>   executeApplyingTasks(this.tasks);
>   this.tasks.clear();
> }
> {code}
> {code:java|title=ReadIndexEventHandler(ReadIndexEventHandler#onEvent)}
> if (this.events.size() > 0) {
>   executeReadIndexEvents(this.events);
>   this.events.clear();
> }
> {code}
> {code:java|title=StableClosureEventHandler(StableClosureEventHandler#onEvent)}
> if (this.ab.size > 0) {
>   this.lastId = this.ab.flush();
>   setDiskId(this.lastId);
> }
> {code}
> Also in bound of this issue, required to rerun benchmarks. Those are expected 
> to dhow increasing in case with high parallelism in one partition.
> There is [an example of the 
> benchmark|https://github.com/gridgain/apache-ignite-3/tree/4b9de922caa4aef97a5e8e159d5db76a3fc7a3ad/modules/runner/src/test/java/org/apache/ignite/internal/benchmark].
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (IGNITE-22001) Throw specific exception if during writeTableAssignmentsToMetastore process was interrupted

2024-04-18 Thread Alexander Lapin (Jira)


 [ 
https://issues.apache.org/jira/browse/IGNITE-22001?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alexander Lapin updated IGNITE-22001:
-
Fix Version/s: 3.0.0-beta2

> Throw specific exception if during writeTableAssignmentsToMetastore process 
> was interrupted
> ---
>
> Key: IGNITE-22001
> URL: https://issues.apache.org/jira/browse/IGNITE-22001
> Project: Ignite
>  Issue Type: Improvement
>Reporter: Mikhail Efremov
>Assignee: Mikhail Efremov
>Priority: Minor
>  Labels: ignite-3
> Fix For: 3.0.0-beta2
>
>  Time Spent: 1h 40m
>  Remaining Estimate: 0h
>
> h2. The problem
> In {{TableManager#writeTableAssignmentsToMetastore:752}} as a result of 
> output {{CompletableFuture}} returns {{{}null{}}}-value. In cases of 
> {{writeTableAssignmentsToMetastore}} method's using it leads to sudden 
> {{NullPointerException}} without clear understandings of the reasons of such 
> situation.
> h2. The solution
> Instead of returning {{{}null{}}}-value re-throw more specific exception with 
> given assignments list to write in metastorage, and table identifier that 
> should help to investigate cases of the method interruption.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (IGNITE-22057) destruction_does_not_update_data is flaky

2024-04-17 Thread Alexander Lapin (Jira)


 [ 
https://issues.apache.org/jira/browse/IGNITE-22057?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alexander Lapin updated IGNITE-22057:
-
Description: 
Seems that `destruction_does_not_update_data` may hang forever on tx rollback 
which causes
{code:java}
The build [Test]::> Run :: C++ Linux Tests #24053 {buildId=8040877} has been 
running for more than 15 minutes. Terminating...{code}
{code:java}
[08:15:21] :     [Step 8/12] [ RUN      ] 
transactions_test.rollback_does_not_update_data
[08:15:21] :     [Step 8/12] [          ] [ INFO ]    Established connection 
with remote host 127.0.0.1:10942
[08:15:21] :     [Step 8/12] [          ] [ DEBUG ]   Connection ID: 1
[08:15:21] :     [Step 8/12] [          ] [ DEBUG ]   Handshake sent 
successfully
[08:15:21] :     [Step 8/12] [          ] [ DEBUG ]   Message sent successfully 
on Connection ID 1
[08:15:21] :     [Step 8/12] [          ] [ DEBUG ]   Message sent successfully 
on Connection ID 1
[08:15:21] :     [Step 8/12] [          ] [ DEBUG ]   Message on Connection ID 
1, size: 149
[08:15:21] :     [Step 8/12] [          ] [ DEBUG ]   Got handshake response
[08:15:21] :     [Step 8/12] [          ] [ DEBUG ]   Server-side protocol 
version: 3.0.0
[08:15:21] :     [Step 8/12] [          ] [ DEBUG ]   Performing request: 
op=50, req_id=0
[08:15:21] :     [Step 8/12] [          ] [ DEBUG ]   Message sent successfully 
on Connection ID 1
[08:15:21] :     [Step 8/12] [          ] [ DEBUG ]   Message sent successfully 
on Connection ID 1
[08:15:21] :     [Step 8/12] [          ] [ DEBUG ]   Message on Connection ID 
1, size: 26
[08:15:21] :     [Step 8/12] [          ] [ DEBUG ]   Closed Connection ID 1, 
error=Client stopped
[08:15:21] :     [Step 8/12] [          ] [ INFO ]    Established connection 
with remote host 127.0.0.1:10942
[08:15:21] :     [Step 8/12] [          ] [ DEBUG ]   Connection ID: 1
[08:15:21] :     [Step 8/12] [          ] [ DEBUG ]   Handshake sent 
successfully
[08:15:21] :     [Step 8/12] [          ] [ DEBUG ]   Message sent successfully 
on Connection ID 1
[08:15:21] :     [Step 8/12] [          ] [ DEBUG ]   Message sent successfully 
on Connection ID 1
[08:15:21] :     [Step 8/12] [          ] [ DEBUG ]   Message on Connection ID 
1, size: 149
[08:15:21] :     [Step 8/12] [          ] [ DEBUG ]   Got handshake response
[08:15:21] :     [Step 8/12] [          ] [ DEBUG ]   Server-side protocol 
version: 3.0.0
[08:15:21] :     [Step 8/12] [          ] [ DEBUG ]   Performing request: op=4, 
req_id=0
[08:15:21] :     [Step 8/12] [          ] [ DEBUG ]   Message sent successfully 
on Connection ID 1
[08:15:21] :     [Step 8/12] [          ] [ DEBUG ]   Message sent successfully 
on Connection ID 1
[08:15:21] :     [Step 8/12] [          ] [ DEBUG ]   Message on Connection ID 
1, size: 21
[08:15:21] :     [Step 8/12] [          ] [ DEBUG ]   Performing request: 
op=43, req_id=1
[08:15:21] :     [Step 8/12] [          ] [ DEBUG ]   Message sent successfully 
on Connection ID 1
[08:15:21] :     [Step 8/12] [          ] [ DEBUG ]   Message sent successfully 
on Connection ID 1
[08:15:21] :     [Step 8/12] [          ] [ DEBUG ]   Message on Connection ID 
1, size: 12
[08:15:21] :     [Step 8/12] [          ] [ DEBUG ]   Performing request: op=5, 
req_id=2
[08:15:21] :     [Step 8/12] [          ] [ DEBUG ]   Message sent successfully 
on Connection ID 1
[08:15:21] :     [Step 8/12] [          ] [ DEBUG ]   Message sent successfully 
on Connection ID 1
[08:15:21] :     [Step 8/12] [          ] [ DEBUG ]   Message on Connection ID 
1, size: 36
[08:15:21] :     [Step 8/12] [          ] [ DEBUG ]   Performing request: 
op=10, req_id=3
[08:15:21] :     [Step 8/12] [          ] [ DEBUG ]   Message sent successfully 
on Connection ID 1
[08:15:21] :     [Step 8/12] [          ] [ DEBUG ]   Message sent successfully 
on Connection ID 1
[08:15:21] :     [Step 8/12] [          ] [ DEBUG ]   Message on Connection ID 
1, size: 12
[08:15:21] :     [Step 8/12] [          ] [ DEBUG ]   Performing request: 
op=45, req_id=4
[08:15:21] :     [Step 8/12] [          ] [ DEBUG ]   Message sent successfully 
on Connection ID 1
[08:15:21] :     [Step 8/12] [          ] [ DEBUG ]   Message sent successfully 
on Connection ID 1
[08:15:21] :     [Step 8/12] [          ] [ DEBUG ]   Message on Connection ID 
1, size: 11
[08:15:21] :     [Step 8/12] [          ] [ DEBUG ]   Performing request: 
op=12, req_id=5
[08:15:21] :     [Step 8/12] [          ] [ DEBUG ]   Message sent successfully 
on Connection ID 1
[08:15:21] :     [Step 8/12] [          ] [ DEBUG ]   Message sent successfully 
on Connection ID 1
[08:15:21] :     [Step 8/12] [          ] [ DEBUG ]   Message on Connection ID 
1, size: 13
[08:15:21] :     [Step 8/12] [          ] [ INFO ]    Established connection 
with remote host 127.0.0.1:10942
[08:15:21] :     [Step 8/12] [          ] [ DEBUG ]   Connection ID: 1
[08:15:21] :     

[jira] [Updated] (IGNITE-22057) destruction_does_not_update_data is flaky

2024-04-17 Thread Alexander Lapin (Jira)


 [ 
https://issues.apache.org/jira/browse/IGNITE-22057?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alexander Lapin updated IGNITE-22057:
-
Description: Seems that 

> destruction_does_not_update_data is flaky
> -
>
> Key: IGNITE-22057
> URL: https://issues.apache.org/jira/browse/IGNITE-22057
> Project: Ignite
>  Issue Type: Bug
>Reporter: Alexander Lapin
>Priority: Major
>  Labels: ignite-3
>
> Seems that 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (IGNITE-22057) destruction_does_not_update_data is flaky

2024-04-17 Thread Alexander Lapin (Jira)


 [ 
https://issues.apache.org/jira/browse/IGNITE-22057?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alexander Lapin updated IGNITE-22057:
-
Labels: ignite-3  (was: )

> destruction_does_not_update_data is flaky
> -
>
> Key: IGNITE-22057
> URL: https://issues.apache.org/jira/browse/IGNITE-22057
> Project: Ignite
>  Issue Type: Bug
>Reporter: Alexander Lapin
>Priority: Major
>  Labels: ignite-3
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (IGNITE-22057) destruction_does_not_update_data is flaky

2024-04-17 Thread Alexander Lapin (Jira)
Alexander Lapin created IGNITE-22057:


 Summary: destruction_does_not_update_data is flaky
 Key: IGNITE-22057
 URL: https://issues.apache.org/jira/browse/IGNITE-22057
 Project: Ignite
  Issue Type: Bug
Reporter: Alexander Lapin






--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (IGNITE-22054) ItMultipleLocksTest#test is flaky

2024-04-16 Thread Alexander Lapin (Jira)


 [ 
https://issues.apache.org/jira/browse/IGNITE-22054?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alexander Lapin updated IGNITE-22054:
-
Epic Link: IGNITE-21389

> ItMultipleLocksTest#test is flaky
> -
>
> Key: IGNITE-22054
> URL: https://issues.apache.org/jira/browse/IGNITE-22054
> Project: Ignite
>  Issue Type: Bug
>Reporter: Alexander Lapin
>Assignee: Alexander Lapin
>Priority: Major
>  Labels: ignite-3
>
>  
> {code:java}
> org.apache.ignite.sql.SqlException: IGN-REP-3 
> TraceId:3f2326ea-e777-43c7-a281-8c7d84bc03e8 Replication is timed out 
> [replicaGrpId=7_part_24]  at 
> java.base@11.0.17/java.lang.invoke.MethodHandle.invokeWithArguments(MethodHandle.java:710)
>  {code}
> [TC 
> link|https://ci.ignite.apache.org/buildConfiguration/ApacheIgnite3xGradle_Test_RunAllTests/8025534?hideProblemsFromDependencies=false=false+Inspection=true=true=true=false=true]
> h3. Upd#1
> We hang forever while trying to acquire put locks on indexes  
> `org.apache.ignite.internal.table.distributed.replicator.PartitionReplicaListener#takePutLockOnIndexes`



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (IGNITE-22054) ItMultipleLocksTest#test is flaky

2024-04-16 Thread Alexander Lapin (Jira)


 [ 
https://issues.apache.org/jira/browse/IGNITE-22054?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alexander Lapin updated IGNITE-22054:
-
Description: 
 
{code:java}
org.apache.ignite.sql.SqlException: IGN-REP-3 
TraceId:3f2326ea-e777-43c7-a281-8c7d84bc03e8 Replication is timed out 
[replicaGrpId=7_part_24]  at 
java.base@11.0.17/java.lang.invoke.MethodHandle.invokeWithArguments(MethodHandle.java:710)
 {code}
[TC 
link|https://ci.ignite.apache.org/buildConfiguration/ApacheIgnite3xGradle_Test_RunAllTests/8025534?hideProblemsFromDependencies=false=false+Inspection=true=true=true=false=true]
h3. Upd#1

We hang forever while trying to acquire put locks on indexes  
`org.apache.ignite.internal.table.distributed.replicator.PartitionReplicaListener#takePutLockOnIndexes`

  was:
 
{code:java}
org.apache.ignite.sql.SqlException: IGN-REP-3 
TraceId:3f2326ea-e777-43c7-a281-8c7d84bc03e8 Replication is timed out 
[replicaGrpId=7_part_24]  at 
java.base@11.0.17/java.lang.invoke.MethodHandle.invokeWithArguments(MethodHandle.java:710)
 {code}
[TC 
link|https://ci.ignite.apache.org/buildConfiguration/ApacheIgnite3xGradle_Test_RunAllTests/8025534?hideProblemsFromDependencies=false=false+Inspection=true=true=true=false=true]

 


> ItMultipleLocksTest#test is flaky
> -
>
> Key: IGNITE-22054
> URL: https://issues.apache.org/jira/browse/IGNITE-22054
> Project: Ignite
>  Issue Type: Bug
>Reporter: Alexander Lapin
>Priority: Major
>  Labels: ignite-3
>
>  
> {code:java}
> org.apache.ignite.sql.SqlException: IGN-REP-3 
> TraceId:3f2326ea-e777-43c7-a281-8c7d84bc03e8 Replication is timed out 
> [replicaGrpId=7_part_24]  at 
> java.base@11.0.17/java.lang.invoke.MethodHandle.invokeWithArguments(MethodHandle.java:710)
>  {code}
> [TC 
> link|https://ci.ignite.apache.org/buildConfiguration/ApacheIgnite3xGradle_Test_RunAllTests/8025534?hideProblemsFromDependencies=false=false+Inspection=true=true=true=false=true]
> h3. Upd#1
> We hang forever while trying to acquire put locks on indexes  
> `org.apache.ignite.internal.table.distributed.replicator.PartitionReplicaListener#takePutLockOnIndexes`



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Assigned] (IGNITE-22054) ItMultipleLocksTest#test is flaky

2024-04-16 Thread Alexander Lapin (Jira)


 [ 
https://issues.apache.org/jira/browse/IGNITE-22054?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alexander Lapin reassigned IGNITE-22054:


Assignee: Alexander Lapin

> ItMultipleLocksTest#test is flaky
> -
>
> Key: IGNITE-22054
> URL: https://issues.apache.org/jira/browse/IGNITE-22054
> Project: Ignite
>  Issue Type: Bug
>Reporter: Alexander Lapin
>Assignee: Alexander Lapin
>Priority: Major
>  Labels: ignite-3
>
>  
> {code:java}
> org.apache.ignite.sql.SqlException: IGN-REP-3 
> TraceId:3f2326ea-e777-43c7-a281-8c7d84bc03e8 Replication is timed out 
> [replicaGrpId=7_part_24]  at 
> java.base@11.0.17/java.lang.invoke.MethodHandle.invokeWithArguments(MethodHandle.java:710)
>  {code}
> [TC 
> link|https://ci.ignite.apache.org/buildConfiguration/ApacheIgnite3xGradle_Test_RunAllTests/8025534?hideProblemsFromDependencies=false=false+Inspection=true=true=true=false=true]
> h3. Upd#1
> We hang forever while trying to acquire put locks on indexes  
> `org.apache.ignite.internal.table.distributed.replicator.PartitionReplicaListener#takePutLockOnIndexes`



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (IGNITE-22054) ItMultipleLocksTest#test is flaky

2024-04-16 Thread Alexander Lapin (Jira)


 [ 
https://issues.apache.org/jira/browse/IGNITE-22054?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alexander Lapin updated IGNITE-22054:
-
Description: 
 
{code:java}
org.apache.ignite.sql.SqlException: IGN-REP-3 
TraceId:3f2326ea-e777-43c7-a281-8c7d84bc03e8 Replication is timed out 
[replicaGrpId=7_part_24]  at 
java.base@11.0.17/java.lang.invoke.MethodHandle.invokeWithArguments(MethodHandle.java:710)
 {code}
[TC 
link|https://ci.ignite.apache.org/buildConfiguration/ApacheIgnite3xGradle_Test_RunAllTests/8025534?hideProblemsFromDependencies=false=false+Inspection=true=true=true=false=true]

 

> ItMultipleLocksTest#test is flaky
> -
>
> Key: IGNITE-22054
> URL: https://issues.apache.org/jira/browse/IGNITE-22054
> Project: Ignite
>  Issue Type: Bug
>Reporter: Alexander Lapin
>Priority: Major
>  Labels: ignite-3
>
>  
> {code:java}
> org.apache.ignite.sql.SqlException: IGN-REP-3 
> TraceId:3f2326ea-e777-43c7-a281-8c7d84bc03e8 Replication is timed out 
> [replicaGrpId=7_part_24]  at 
> java.base@11.0.17/java.lang.invoke.MethodHandle.invokeWithArguments(MethodHandle.java:710)
>  {code}
> [TC 
> link|https://ci.ignite.apache.org/buildConfiguration/ApacheIgnite3xGradle_Test_RunAllTests/8025534?hideProblemsFromDependencies=false=false+Inspection=true=true=true=false=true]
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (IGNITE-22054) ItMultipleLocksTest#test is flaky

2024-04-16 Thread Alexander Lapin (Jira)


 [ 
https://issues.apache.org/jira/browse/IGNITE-22054?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alexander Lapin updated IGNITE-22054:
-
Labels: ignite-3  (was: )

> ItMultipleLocksTest#test is flaky
> -
>
> Key: IGNITE-22054
> URL: https://issues.apache.org/jira/browse/IGNITE-22054
> Project: Ignite
>  Issue Type: Bug
>Reporter: Alexander Lapin
>Priority: Major
>  Labels: ignite-3
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (IGNITE-22054) ItMultipleLocksTest#test is flaky

2024-04-16 Thread Alexander Lapin (Jira)
Alexander Lapin created IGNITE-22054:


 Summary: ItMultipleLocksTest#test is flaky
 Key: IGNITE-22054
 URL: https://issues.apache.org/jira/browse/IGNITE-22054
 Project: Ignite
  Issue Type: Bug
Reporter: Alexander Lapin






--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Assigned] (IGNITE-21913) Change API usage of Placement driver in Transaction module from TablePartitionId to ZonePartitionId

2024-04-16 Thread Alexander Lapin (Jira)


 [ 
https://issues.apache.org/jira/browse/IGNITE-21913?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alexander Lapin reassigned IGNITE-21913:


Assignee: Mirza Aliev

> Change API usage of Placement driver in Transaction module from 
> TablePartitionId to ZonePartitionId
> ---
>
> Key: IGNITE-21913
> URL: https://issues.apache.org/jira/browse/IGNITE-21913
> Project: Ignite
>  Issue Type: Improvement
>Reporter: Mirza Aliev
>Assignee: Mirza Aliev
>Priority: Major
>  Labels: ignite-3
>
> In https://issues.apache.org/jira/browse/IGNITE-21858 we have agreed to 
> decompose original task to several subtasks.
> In this ticket we need to use previously created decorator for Placement 
> Driver from https://issues.apache.org/jira/browse/IGNITE-21911 for all places 
> in Transaction module where PD was used before. See spreadsheet from 
> https://issues.apache.org/jira/browse/IGNITE-21858 with details about places 
> to change.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Resolved] (IGNITE-22038) Error while sending partition idle safe time request breaks sending for all replicas

2024-04-12 Thread Alexander Lapin (Jira)


 [ 
https://issues.apache.org/jira/browse/IGNITE-22038?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alexander Lapin resolved IGNITE-22038.
--
Resolution: Fixed

> Error while sending partition idle safe time request breaks sending for all 
> replicas
> 
>
> Key: IGNITE-22038
> URL: https://issues.apache.org/jira/browse/IGNITE-22038
> Project: Ignite
>  Issue Type: Bug
>Reporter: Roman Puchkovskiy
>Assignee: Roman Puchkovskiy
>Priority: Major
>  Labels: ignite-3
> Fix For: 3.0.0-beta2
>
>  Time Spent: 40m
>  Remaining Estimate: 0h
>
> This happens because if a runnable passed to 
> ScheduledExecutor#scheduleAtFixedRate() throws an exception, tasks stop being 
> executed.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (IGNITE-22005) 'Failed to process replica request' error under load with balance transfer scenario

2024-04-12 Thread Alexander Lapin (Jira)


[ 
https://issues.apache.org/jira/browse/IGNITE-22005?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17836619#comment-17836619
 ] 

Alexander Lapin commented on IGNITE-22005:
--

Postponed, not enough capacity.

> 'Failed to process replica request' error under load with balance transfer 
> scenario
> ---
>
> Key: IGNITE-22005
> URL: https://issues.apache.org/jira/browse/IGNITE-22005
> Project: Ignite
>  Issue Type: Bug
>Affects Versions: 3.0.0-beta2
> Environment: Cluster of 3 nodes
>Reporter: Nikita Sivkov
>Priority: Major
>  Labels: ignite-3
> Attachments: transfer_ign3.yaml
>
>
> *Steps to reproduce:*
> Perform a long (about 2 hours) load test with a balance transfer scenario 
> (see scenario pseudo code in attachments).
> *Expected result:*
> No errors happen.
> *Actual result:*
> Get error in server logs - {{Failed to process replica request}}
> {code:java}
> 2024-04-05 17:50:55:802 +0300 
> [WARNING][%poc-tester-SERVER-192.168.1.97-id-0%JRaft-AppendEntries-Processor-2][NodeImpl]
>  Node <193_part_15/poc-tester-SERVER-192.168.1.97-id-0> is not in active 
> state, currTerm=2.
> 2024-04-05 17:50:55:805 +0300 
> [WARNING][%poc-tester-SERVER-192.168.1.97-id-0%Raft-Group-Client-19][ReplicaManager]
>  Failed to process replica request [request=TxFinishReplicaRequestImpl 
> [commit=false, commitTimestampLong=0, 
> enlistmentConsistencyToken=112218720633356321, groupId=123_part_21, 
> groups=HashMap {141_part_13=poc-tester-SERVER-192.168.1.27-id-0, 
> 139_part_9=poc-tester-SERVER-192.168.1.97-id-0, 
> 193_part_3=poc-tester-SERVER-192.168.1.27-id-0, 
> 19_part_23=poc-tester-SERVER-192.168.1.27-id-0, 
> 117_part_17=poc-tester-SERVER-192.168.1.18-id-0, 
> 45_part_9=poc-tester-SERVER-192.168.1.18-id-0, 
> 39_part_3=poc-tester-SERVER-192.168.1.18-id-0, 
> 77_part_4=poc-tester-SERVER-192.168.1.18-id-0, 
> 105_part_4=poc-tester-SERVER-192.168.1.18-id-0, 
> 123_part_21=poc-tester-SERVER-192.168.1.97-id-0, 
> 103_part_9=poc-tester-SERVER-192.168.1.18-id-0, 
> 161_part_15=poc-tester-SERVER-192.168.1.27-id-0, 
> 103_part_22=poc-tester-SERVER-192.168.1.27-id-0, 
> 89_part_10=poc-tester-SERVER-192.168.1.18-id-0, 
> 39_part_19=poc-tester-SERVER-192.168.1.27-id-0, 
> 149_part_13=poc-tester-SERVER-192.168.1.27-id-0, 
> 97_part_24=poc-tester-SERVER-192.168.1.97-id-0, 
> 83_part_9=poc-tester-SERVER-192.168.1.27-id-0, 
> 209_part_10=poc-tester-SERVER-192.168.1.27-id-0, 
> 185_part_5=poc-tester-SERVER-192.168.1.18-id-0, 
> 117_part_9=poc-tester-SERVER-192.168.1.27-id-0, 
> 105_part_22=poc-tester-SERVER-192.168.1.18-id-0}, 
> timestampLong=112219170129903617, txId=018eaebd-88ba-0001-606d-62250001]].
> java.util.concurrent.CompletionException: 
> org.apache.ignite.tx.TransactionException: IGN-TX-7 
> TraceId:cb1577e6-ec35-47f0-ab7d-56a0687344ed 
> java.util.concurrent.TimeoutException
>     at 
> java.base/java.util.concurrent.CompletableFuture.encodeThrowable(CompletableFuture.java:314)
>     at 
> java.base/java.util.concurrent.CompletableFuture.completeThrowable(CompletableFuture.java:319)
>     at 
> java.base/java.util.concurrent.CompletableFuture.uniHandle(CompletableFuture.java:932)
>     at 
> java.base/java.util.concurrent.CompletableFuture$UniHandle.tryFire(CompletableFuture.java:907)
>     at 
> java.base/java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:506)
>     at 
> java.base/java.util.concurrent.CompletableFuture.completeExceptionally(CompletableFuture.java:2088)
>     at 
> org.apache.ignite.internal.table.distributed.replicator.PartitionReplicaListener.lambda$applyCmdWithRetryOnSafeTimeReorderException$126(PartitionReplicaListener.java:2806)
>     at 
> java.base/java.util.concurrent.CompletableFuture.uniWhenComplete(CompletableFuture.java:859)
>     at 
> java.base/java.util.concurrent.CompletableFuture$UniWhenComplete.tryFire(CompletableFuture.java:837)
>     at 
> java.base/java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:506)
>     at 
> java.base/java.util.concurrent.CompletableFuture.completeExceptionally(CompletableFuture.java:2088)
>     at 
> org.apache.ignite.internal.raft.RaftGroupServiceImpl.sendWithRetry(RaftGroupServiceImpl.java:550)
>     at 
> org.apache.ignite.internal.raft.RaftGroupServiceImpl.lambda$handleErrorResponse$44(RaftGroupServiceImpl.java:653)
>     at 
> java.base/java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:515)
>     at java.base/java.util.concurrent.FutureTask.run(FutureTask.java:264)
>     at 
> java.base/java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:304)
>     at 
> java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
>     at 
> 

[jira] [Commented] (IGNITE-22006) 'Failed to process the lease granted message' error under load with balance transfer scenario

2024-04-12 Thread Alexander Lapin (Jira)


[ 
https://issues.apache.org/jira/browse/IGNITE-22006?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17836618#comment-17836618
 ] 

Alexander Lapin commented on IGNITE-22006:
--

Postponed, not enough capacity.

> 'Failed to process the lease granted message' error under load with balance 
> transfer scenario
> -
>
> Key: IGNITE-22006
> URL: https://issues.apache.org/jira/browse/IGNITE-22006
> Project: Ignite
>  Issue Type: Bug
>Affects Versions: 3.0.0-beta2
> Environment: Cluster of 3 nodes
>Reporter: Nikita Sivkov
>Priority: Major
>  Labels: ignite-3
> Attachments: transfer_ign3.yaml
>
>
> *Steps to reproduce:*
> Perform a long (about 2 hours) load test with a balance transfer scenario 
> (see scenario pseudo code in attachments).
> *Expected result:*
> No errors happen.
> *Actual result:*
> Get error in server logs - {{Failed to process the lease granted message}}
> {code:java}
> 2024-04-05 17:50:39:180 +0300 
> [WARNING][%poc-tester-SERVER-192.168.1.97-id-0%JRaft-Request-Processor-13][NodeImpl]
>  Node <127_part_16/poc-tester-SERVER-192.168.1.97-id-0> is not in active 
> state, currTerm=3.
> 2024-04-05 17:50:39:187 +0300 
> [WARNING][CompletableFutureDelayScheduler][ReplicaManager] Failed to process 
> the lease granted message [msg=LeaseGrantedMessageImpl [force=true, 
> groupId=77_part_14, leaseExpirationTimeLong=112219169697759232, 
> leaseStartTimeLong=112219161833439373]].
> java.util.concurrent.TimeoutException
>     at 
> java.base/java.util.concurrent.CompletableFuture$Timeout.run(CompletableFuture.java:2792)
>     at 
> java.base/java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:515)
>     at java.base/java.util.concurrent.FutureTask.run(FutureTask.java:264)
>     at 
> java.base/java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:304)
>     at 
> java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
>     at 
> java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
>     at java.base/java.lang.Thread.run(Thread.java:829)
> 2024-04-05 17:50:39:190 +0300 
> [WARNING][%poc-tester-SERVER-192.168.1.97-id-0%JRaft-Request-Processor-34][NodeImpl]
>  Node <213_part_14/poc-tester-SERVER-192.168.1.97-id-0> is not in active 
> state, currTerm=2.{code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (IGNITE-22004) 'Failed to process delayed response' error under load with balance transfer scenario

2024-04-12 Thread Alexander Lapin (Jira)


[ 
https://issues.apache.org/jira/browse/IGNITE-22004?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17836617#comment-17836617
 ] 

Alexander Lapin commented on IGNITE-22004:
--

Postponed, not enough capacity.

> 'Failed to process delayed response' error under load with balance transfer 
> scenario
> 
>
> Key: IGNITE-22004
> URL: https://issues.apache.org/jira/browse/IGNITE-22004
> Project: Ignite
>  Issue Type: Bug
>Affects Versions: 3.0.0-beta2
> Environment: Cluster of 3 nodes
>Reporter: Nikita Sivkov
>Priority: Major
>  Labels: ignite-3
> Attachments: transfer_ign3.yaml
>
>
> *Steps to reproduce:*
> Perform a long (about 2 hours) load test with a balance transfer scenario 
> (see scenario pseudo code in attachments).
> *Expected result:*
> No errors happen.
> *Actual result:*
> Get error in server logs - {{Failed to process delayed response}}
> {code:java}
> 2024-04-05 17:50:50:776 +0300 
> [WARNING][%poc-tester-SERVER-192.168.1.97-id-0%JRaft-Request-Processor-1][NodeImpl]
>  Node <27_part_23/poc-tester-SERVER-192.168.1.97-id-0> is not in active 
> state, currTerm=2.
> 2024-04-05 17:50:50:778 +0300 
> [WARNING][%poc-tester-SERVER-192.168.1.97-id-0%Raft-Group-Client-5][ReplicaManager]
>  Failed to process delayed response 
> [request=ReadWriteSingleRowReplicaRequestImpl 
> [commitPartitionId=TablePartitionIdMessageImpl [partitionId=21, tableId=123], 
> coordinatorId=3de6f999-7ab9-4405-aff0-ee0c7e4886ce, 
> enlistmentConsistencyToken=112218720633356321, full=false, 
> groupId=123_part_21, requestType=RW_UPSERT, schemaVersion=1, 
> timestampLong=112219169796915211, 
> transactionId=018eaebd-88ba-0001-606d-62250001]]
> java.util.concurrent.CompletionException: 
> java.util.concurrent.TimeoutException
>     at 
> java.base/java.util.concurrent.CompletableFuture.encodeThrowable(CompletableFuture.java:331)
>     at 
> java.base/java.util.concurrent.CompletableFuture.completeThrowable(CompletableFuture.java:346)
>     at 
> java.base/java.util.concurrent.CompletableFuture$UniApply.tryFire(CompletableFuture.java:632)
>     at 
> java.base/java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:506)
>     at 
> java.base/java.util.concurrent.CompletableFuture.completeExceptionally(CompletableFuture.java:2088)
>     at 
> org.apache.ignite.internal.raft.RaftGroupServiceImpl.sendWithRetry(RaftGroupServiceImpl.java:550)
>     at 
> org.apache.ignite.internal.raft.RaftGroupServiceImpl.lambda$handleErrorResponse$44(RaftGroupServiceImpl.java:653)
>     at 
> java.base/java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:515)
>     at java.base/java.util.concurrent.FutureTask.run(FutureTask.java:264)
>     at 
> java.base/java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:304)
>     at 
> java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
>     at 
> java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
>     at java.base/java.lang.Thread.run(Thread.java:829)
> Caused by: java.util.concurrent.TimeoutException
>     ... 8 more
> 2024-04-05 17:50:50:780 +0300 
> [WARNING][%poc-tester-SERVER-192.168.1.97-id-0%JRaft-Request-Processor-27][NodeImpl]
>  Node <99_part_6/poc-tester-SERVER-192.168.1.97-id-0> is not in active state, 
> currTerm=3. {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Assigned] (IGNITE-22027) ItPrimaryReplicaChoiceTest#testPrimaryChangeLongHandling is flaky

2024-04-11 Thread Alexander Lapin (Jira)


 [ 
https://issues.apache.org/jira/browse/IGNITE-22027?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alexander Lapin reassigned IGNITE-22027:


Assignee: Alexander Lapin

> ItPrimaryReplicaChoiceTest#testPrimaryChangeLongHandling is flaky
> -
>
> Key: IGNITE-22027
> URL: https://issues.apache.org/jira/browse/IGNITE-22027
> Project: Ignite
>  Issue Type: Bug
>Reporter: Alexander Lapin
>Assignee: Alexander Lapin
>Priority: Major
>  Labels: ignite-3
>  Time Spent: 20m
>  Remaining Estimate: 0h
>
>  
> {code:java}
> org.opentest4j.AssertionFailedError: expected:  but was:   at 
> app//org.junit.jupiter.api.AssertionFailureBuilder.build(AssertionFailureBuilder.java:151)
>   at 
> app//org.junit.jupiter.api.AssertionFailureBuilder.buildAndThrow(AssertionFailureBuilder.java:132)
>   at app//org.junit.jupiter.api.AssertFalse.failNotFalse(AssertFalse.java:63) 
>  at app//org.junit.jupiter.api.AssertFalse.assertFalse(AssertFalse.java:36)  
> at app//org.junit.jupiter.api.AssertFalse.assertFalse(AssertFalse.java:31)  
> at app//org.junit.jupiter.api.Assertions.assertFalse(Assertions.java:231)  at 
> app//org.apache.ignite.internal.placementdriver.ItPrimaryReplicaChoiceTest.testPrimaryChangeLongHandling(ItPrimaryReplicaChoiceTest.java:189)
>  {code}
> Long story short, it's because of minor bug in NodeUtils#transferPrimary.
> While choosing new primary in case of null preferablePrimary following logic 
> was used
> {code:java}
> if (preferablePrimary == null) {
> preferablePrimary = nodes.stream()
> .map(IgniteImpl::name)
> .filter(n -> n.equals(currentLeaseholder.getLeaseholder()))
> .findFirst()
> .orElseThrow();
> } {code}
> that always selects current primary as new one. Apparently "!" was missing in
> {code:java}
> .filter(n -> n.equals(currentLeaseholder.getLeaseholder())){code}
>  
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (IGNITE-22027) ItPrimaryReplicaChoiceTest#testPrimaryChangeLongHandling is flaky

2024-04-11 Thread Alexander Lapin (Jira)


 [ 
https://issues.apache.org/jira/browse/IGNITE-22027?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alexander Lapin updated IGNITE-22027:
-
Description: 
 
{code:java}
org.opentest4j.AssertionFailedError: expected:  but was:   at 
app//org.junit.jupiter.api.AssertionFailureBuilder.build(AssertionFailureBuilder.java:151)
  at 
app//org.junit.jupiter.api.AssertionFailureBuilder.buildAndThrow(AssertionFailureBuilder.java:132)
  at app//org.junit.jupiter.api.AssertFalse.failNotFalse(AssertFalse.java:63)  
at app//org.junit.jupiter.api.AssertFalse.assertFalse(AssertFalse.java:36)  at 
app//org.junit.jupiter.api.AssertFalse.assertFalse(AssertFalse.java:31)  at 
app//org.junit.jupiter.api.Assertions.assertFalse(Assertions.java:231)  at 
app//org.apache.ignite.internal.placementdriver.ItPrimaryReplicaChoiceTest.testPrimaryChangeLongHandling(ItPrimaryReplicaChoiceTest.java:189)
 {code}
Long story short, it's because of minor bug in NodeUtils#transferPrimary.

While choosing new primary in case of null preferablePrimary following logic 
was used
{code:java}
if (preferablePrimary == null) {
preferablePrimary = nodes.stream()
.map(IgniteImpl::name)
.filter(n -> n.equals(currentLeaseholder.getLeaseholder()))
.findFirst()
.orElseThrow();
} {code}
that always selects current primary as new one. Apparently "!" was missing in
{code:java}
.filter(n -> n.equals(currentLeaseholder.getLeaseholder())){code}
 

 

> ItPrimaryReplicaChoiceTest#testPrimaryChangeLongHandling is flaky
> -
>
> Key: IGNITE-22027
> URL: https://issues.apache.org/jira/browse/IGNITE-22027
> Project: Ignite
>  Issue Type: Bug
>Reporter: Alexander Lapin
>Priority: Major
>  Labels: ignite-3
>
>  
> {code:java}
> org.opentest4j.AssertionFailedError: expected:  but was:   at 
> app//org.junit.jupiter.api.AssertionFailureBuilder.build(AssertionFailureBuilder.java:151)
>   at 
> app//org.junit.jupiter.api.AssertionFailureBuilder.buildAndThrow(AssertionFailureBuilder.java:132)
>   at app//org.junit.jupiter.api.AssertFalse.failNotFalse(AssertFalse.java:63) 
>  at app//org.junit.jupiter.api.AssertFalse.assertFalse(AssertFalse.java:36)  
> at app//org.junit.jupiter.api.AssertFalse.assertFalse(AssertFalse.java:31)  
> at app//org.junit.jupiter.api.Assertions.assertFalse(Assertions.java:231)  at 
> app//org.apache.ignite.internal.placementdriver.ItPrimaryReplicaChoiceTest.testPrimaryChangeLongHandling(ItPrimaryReplicaChoiceTest.java:189)
>  {code}
> Long story short, it's because of minor bug in NodeUtils#transferPrimary.
> While choosing new primary in case of null preferablePrimary following logic 
> was used
> {code:java}
> if (preferablePrimary == null) {
> preferablePrimary = nodes.stream()
> .map(IgniteImpl::name)
> .filter(n -> n.equals(currentLeaseholder.getLeaseholder()))
> .findFirst()
> .orElseThrow();
> } {code}
> that always selects current primary as new one. Apparently "!" was missing in
> {code:java}
> .filter(n -> n.equals(currentLeaseholder.getLeaseholder())){code}
>  
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (IGNITE-22027) ItPrimaryReplicaChoiceTest#testPrimaryChangeLongHandling is flaky

2024-04-11 Thread Alexander Lapin (Jira)


 [ 
https://issues.apache.org/jira/browse/IGNITE-22027?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alexander Lapin updated IGNITE-22027:
-
Labels: ignite-3  (was: )

> ItPrimaryReplicaChoiceTest#testPrimaryChangeLongHandling is flaky
> -
>
> Key: IGNITE-22027
> URL: https://issues.apache.org/jira/browse/IGNITE-22027
> Project: Ignite
>  Issue Type: Bug
>Reporter: Alexander Lapin
>Priority: Major
>  Labels: ignite-3
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (IGNITE-22027) ItPrimaryReplicaChoiceTest#testPrimaryChangeLongHandling is flaky

2024-04-11 Thread Alexander Lapin (Jira)
Alexander Lapin created IGNITE-22027:


 Summary: ItPrimaryReplicaChoiceTest#testPrimaryChangeLongHandling 
is flaky
 Key: IGNITE-22027
 URL: https://issues.apache.org/jira/browse/IGNITE-22027
 Project: Ignite
  Issue Type: Bug
Reporter: Alexander Lapin






--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (IGNITE-22027) ItPrimaryReplicaChoiceTest#testPrimaryChangeLongHandling is flaky

2024-04-11 Thread Alexander Lapin (Jira)


 [ 
https://issues.apache.org/jira/browse/IGNITE-22027?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alexander Lapin updated IGNITE-22027:
-
Ignite Flags:   (was: Docs Required,Release Notes Required)

> ItPrimaryReplicaChoiceTest#testPrimaryChangeLongHandling is flaky
> -
>
> Key: IGNITE-22027
> URL: https://issues.apache.org/jira/browse/IGNITE-22027
> Project: Ignite
>  Issue Type: Bug
>Reporter: Alexander Lapin
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (IGNITE-22027) ItPrimaryReplicaChoiceTest#testPrimaryChangeLongHandling is flaky

2024-04-11 Thread Alexander Lapin (Jira)


 [ 
https://issues.apache.org/jira/browse/IGNITE-22027?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alexander Lapin updated IGNITE-22027:
-
Epic Link: IGNITE-21389

> ItPrimaryReplicaChoiceTest#testPrimaryChangeLongHandling is flaky
> -
>
> Key: IGNITE-22027
> URL: https://issues.apache.org/jira/browse/IGNITE-22027
> Project: Ignite
>  Issue Type: Bug
>Reporter: Alexander Lapin
>Priority: Major
>  Labels: ignite-3
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (IGNITE-22002) AssertionError: Updated lease start time should be greater than current

2024-04-09 Thread Alexander Lapin (Jira)


 [ 
https://issues.apache.org/jira/browse/IGNITE-22002?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alexander Lapin updated IGNITE-22002:
-
Epic Link: IGNITE-21389

> AssertionError: Updated lease start time should be greater than current
> ---
>
> Key: IGNITE-22002
> URL: https://issues.apache.org/jira/browse/IGNITE-22002
> Project: Ignite
>  Issue Type: Bug
>Reporter: Alexander Lapin
>Assignee: Alexander Lapin
>Priority: Major
>  Labels: ignite-3
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> PrimaryReplicaChangeCommandImpl commands might be reordered as any other raft 
> commands. In case of PrimaryReplicaChangeCommand there's no need to force the 
> order it's just enough to skip inconsistent one, thus let's substitute 
> assertion with corresponding check and stop command evaluation in case of 
> miss.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Assigned] (IGNITE-22002) AssertionError: Updated lease start time should be greater than current

2024-04-09 Thread Alexander Lapin (Jira)


 [ 
https://issues.apache.org/jira/browse/IGNITE-22002?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alexander Lapin reassigned IGNITE-22002:


Assignee: Alexander Lapin

> AssertionError: Updated lease start time should be greater than current
> ---
>
> Key: IGNITE-22002
> URL: https://issues.apache.org/jira/browse/IGNITE-22002
> Project: Ignite
>  Issue Type: Bug
>Reporter: Alexander Lapin
>Assignee: Alexander Lapin
>Priority: Major
>  Labels: ignite-3
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> PrimaryReplicaChangeCommandImpl commands might be reordered as any other raft 
> commands. In case of PrimaryReplicaChangeCommand there's no need to force the 
> order it's just enough to skip inconsistent one, thus let's substitute 
> assertion with corresponding check and stop command evaluation in case of 
> miss.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Resolved] (IGNITE-21400) ItDataSchemaSyncTest.checkSchemasCorrectlyRestore() is flaky with The query was cancelled while executing

2024-04-09 Thread Alexander Lapin (Jira)


 [ 
https://issues.apache.org/jira/browse/IGNITE-21400?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alexander Lapin resolved IGNITE-21400.
--
Resolution: Cannot Reproduce

Was fixed in https://issues.apache.org/jira/browse/IGNITE-20883

> ItDataSchemaSyncTest.checkSchemasCorrectlyRestore() is flaky with The query 
> was cancelled while executing
> -
>
> Key: IGNITE-21400
> URL: https://issues.apache.org/jira/browse/IGNITE-21400
> Project: Ignite
>  Issue Type: Bug
>Reporter: Alexander Lapin
>Assignee: Alexander Lapin
>Priority: Major
>  Labels: ignite-3
>  Time Spent: 20m
>  Remaining Estimate: 0h
>
>  
> {code:java}
> org.apache.ignite.sql.SqlException: IGN-SQL-8 
> TraceId:9ddffd30-18d6-4006-ae81-f72867235290 The query was cancelled while 
> executing.  at 
> java.base@11.0.17/java.lang.invoke.MethodHandle.invokeWithArguments(MethodHandle.java:710)
>  at 
> app//org.apache.ignite.internal.util.ExceptionUtils$1.copy(ExceptionUtils.java:765)
>  at 
> app//org.apache.ignite.internal.util.ExceptionUtils$ExceptionFactory.createCopy(ExceptionUtils.java:699)
>  at 
> app//org.apache.ignite.internal.util.ExceptionUtils.copyExceptionWithCause(ExceptionUtils.java:525)
>  at 
> app//org.apache.ignite.internal.util.ExceptionUtils.copyExceptionWithCauseInternal(ExceptionUtils.java:634)
>  at 
> app//org.apache.ignite.internal.util.ExceptionUtils.copyExceptionWithCause(ExceptionUtils.java:476)
>  at 
> app//org.apache.ignite.internal.sql.AbstractSession.execute(AbstractSession.java:63)
>  at 
> app//org.apache.ignite.internal.runner.app.ItDataSchemaSyncTest.checkSchemasCorrectlyRestore(ItDataSchemaSyncTest.java:252)
>  at 
> java.base@11.0.17/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native
>  Method)     at 
> java.base@11.0.17/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
>  at 
> java.base@11.0.17/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>  at java.base@11.0.17/java.lang.reflect.Method.invoke(Method.java:566)    
>  at 
> app//org.junit.platform.commons.util.ReflectionUtils.invokeMethod(ReflectionUtils.java:727)
>  at 
> app//org.junit.jupiter.engine.execution.MethodInvocation.proceed(MethodInvocation.java:60)
>  at 
> app//org.junit.jupiter.engine.execution.InvocationInterceptorChain$ValidatingInvocation.proceed(InvocationInterceptorChain.java:131)
>  at 
> app//org.junit.jupiter.engine.extension.SameThreadTimeoutInvocation.proceed(SameThreadTimeoutInvocation.java:45)
>  at 
> app//org.junit.jupiter.engine.extension.TimeoutExtension.intercept(TimeoutExtension.java:156)
>  at 
> app//org.junit.jupiter.engine.extension.TimeoutExtension.interceptTestableMethod(TimeoutExtension.java:147)
>  at 
> app//org.junit.jupiter.engine.extension.TimeoutExtension.interceptTestMethod(TimeoutExtension.java:86)
>{code}
> Seems to be exactly the same as 
> https://issues.apache.org/jira/browse/IGNITE-20883
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Resolved] (IGNITE-20918) Leases expire after a node has been restarted

2024-04-09 Thread Alexander Lapin (Jira)


 [ 
https://issues.apache.org/jira/browse/IGNITE-20918?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alexander Lapin resolved IGNITE-20918.
--
Release Note: Was fixed in other issues, where nodeId matching check was 
added. Mainly https://issues.apache.org/jira/browse/IGNITE-20883.
  Resolution: Cannot Reproduce

> Leases expire after a node has been restarted
> -
>
> Key: IGNITE-20918
> URL: https://issues.apache.org/jira/browse/IGNITE-20918
> Project: Ignite
>  Issue Type: Bug
>Reporter: Aleksandr Polovtcev
>Assignee: Alexander Lapin
>Priority: Critical
>  Labels: ignite-3
>  Time Spent: 1h 40m
>  Remaining Estimate: 0h
>
> IGNITE-20910 introduces a test that inserts some data after restarting a 
> node. For some reason, after some time, I can see the following messages in 
> the log:
> {noformat}
> [2023-11-22T10:00:17,056][INFO 
> ][%isnt_tmpar_0%metastorage-watch-executor-3][PartitionReplicaListener] 
> Primary replica expired [grp=5_part_19]
> [2023-11-22T10:00:17,057][INFO 
> ][%isnt_tmpar_0%metastorage-watch-executor-3][PartitionReplicaListener] 
> Primary replica expired [grp=5_part_0]
> [2023-11-22T10:00:17,057][INFO 
> ][%isnt_tmpar_0%metastorage-watch-executor-3][PartitionReplicaListener] 
> Primary replica expired [grp=5_part_9]
> [2023-11-22T10:00:17,057][INFO 
> ][%isnt_tmpar_0%metastorage-watch-executor-3][PartitionReplicaListener] 
> Primary replica expired [grp=5_part_10]
> {noformat}
> After that, the test fails with a {{PrimaryReplicaMissException}}. The 
> problem here, that it is expected that a single node should never have 
> expired leases, they should be prolongated automatically. I think that this 
> happens because the initial lease that was issued before the node was 
> restarted is still accepted by the node after restart.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (IGNITE-20918) Leases expire after a node has been restarted

2024-04-09 Thread Alexander Lapin (Jira)


 [ 
https://issues.apache.org/jira/browse/IGNITE-20918?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alexander Lapin updated IGNITE-20918:
-
Release Note:   (was: Was fixed in other issues, where nodeId matching 
check was added. Mainly https://issues.apache.org/jira/browse/IGNITE-20883.)

> Leases expire after a node has been restarted
> -
>
> Key: IGNITE-20918
> URL: https://issues.apache.org/jira/browse/IGNITE-20918
> Project: Ignite
>  Issue Type: Bug
>Reporter: Aleksandr Polovtcev
>Assignee: Alexander Lapin
>Priority: Critical
>  Labels: ignite-3
>  Time Spent: 1h 40m
>  Remaining Estimate: 0h
>
> IGNITE-20910 introduces a test that inserts some data after restarting a 
> node. For some reason, after some time, I can see the following messages in 
> the log:
> {noformat}
> [2023-11-22T10:00:17,056][INFO 
> ][%isnt_tmpar_0%metastorage-watch-executor-3][PartitionReplicaListener] 
> Primary replica expired [grp=5_part_19]
> [2023-11-22T10:00:17,057][INFO 
> ][%isnt_tmpar_0%metastorage-watch-executor-3][PartitionReplicaListener] 
> Primary replica expired [grp=5_part_0]
> [2023-11-22T10:00:17,057][INFO 
> ][%isnt_tmpar_0%metastorage-watch-executor-3][PartitionReplicaListener] 
> Primary replica expired [grp=5_part_9]
> [2023-11-22T10:00:17,057][INFO 
> ][%isnt_tmpar_0%metastorage-watch-executor-3][PartitionReplicaListener] 
> Primary replica expired [grp=5_part_10]
> {noformat}
> After that, the test fails with a {{PrimaryReplicaMissException}}. The 
> problem here, that it is expected that a single node should never have 
> expired leases, they should be prolongated automatically. I think that this 
> happens because the initial lease that was issued before the node was 
> restarted is still accepted by the node after restart.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (IGNITE-20918) Leases expire after a node has been restarted

2024-04-09 Thread Alexander Lapin (Jira)


[ 
https://issues.apache.org/jira/browse/IGNITE-20918?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17835185#comment-17835185
 ] 

Alexander Lapin commented on IGNITE-20918:
--

Was fixed in other issues, where nodeId matching check was added. Mainly 
https://issues.apache.org/jira/browse/IGNITE-20883.

> Leases expire after a node has been restarted
> -
>
> Key: IGNITE-20918
> URL: https://issues.apache.org/jira/browse/IGNITE-20918
> Project: Ignite
>  Issue Type: Bug
>Reporter: Aleksandr Polovtcev
>Assignee: Alexander Lapin
>Priority: Critical
>  Labels: ignite-3
>  Time Spent: 1h 40m
>  Remaining Estimate: 0h
>
> IGNITE-20910 introduces a test that inserts some data after restarting a 
> node. For some reason, after some time, I can see the following messages in 
> the log:
> {noformat}
> [2023-11-22T10:00:17,056][INFO 
> ][%isnt_tmpar_0%metastorage-watch-executor-3][PartitionReplicaListener] 
> Primary replica expired [grp=5_part_19]
> [2023-11-22T10:00:17,057][INFO 
> ][%isnt_tmpar_0%metastorage-watch-executor-3][PartitionReplicaListener] 
> Primary replica expired [grp=5_part_0]
> [2023-11-22T10:00:17,057][INFO 
> ][%isnt_tmpar_0%metastorage-watch-executor-3][PartitionReplicaListener] 
> Primary replica expired [grp=5_part_9]
> [2023-11-22T10:00:17,057][INFO 
> ][%isnt_tmpar_0%metastorage-watch-executor-3][PartitionReplicaListener] 
> Primary replica expired [grp=5_part_10]
> {noformat}
> After that, the test fails with a {{PrimaryReplicaMissException}}. The 
> problem here, that it is expected that a single node should never have 
> expired leases, they should be prolongated automatically. I think that this 
> happens because the initial lease that was issued before the node was 
> restarted is still accepted by the node after restart.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (IGNITE-21892) ItPlacementDriverReplicaSideTest testNotificationToPlacementDriverAboutChangeLeader is flaky

2024-04-09 Thread Alexander Lapin (Jira)


 [ 
https://issues.apache.org/jira/browse/IGNITE-21892?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alexander Lapin updated IGNITE-21892:
-
Epic Link: IGNITE-21389

> ItPlacementDriverReplicaSideTest 
> testNotificationToPlacementDriverAboutChangeLeader is flaky
> 
>
> Key: IGNITE-21892
> URL: https://issues.apache.org/jira/browse/IGNITE-21892
> Project: Ignite
>  Issue Type: Bug
>Reporter: Maksim Zhuravkov
>Assignee: Alexander Lapin
>Priority: Major
>  Labels: ignite-3
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> This test is flaky. Build error:
> {code}
>   java.lang.AssertionError: There are replicas alive [replicas=[group_1]]
> at 
> org.apache.ignite.internal.replicator.ReplicaManager.stop(ReplicaManager.java:658)
> at 
> org.apache.ignite.internal.replicator.ItPlacementDriverReplicaSideTest.lambda$beforeTest$3(ItPlacementDriverReplicaSideTest.java:200)
> at 
> org.apache.ignite.internal.util.IgniteUtils.lambda$closeAll$0(IgniteUtils.java:559)
> at 
> java.base/java.util.stream.ForEachOps$ForEachOp$OfRef.accept(ForEachOps.java:183)
> at 
> java.base/java.util.stream.ReferencePipeline$2$1.accept(ReferencePipeline.java:177)
> at 
> java.base/java.util.ArrayList$ArrayListSpliterator.forEachRemaining(ArrayList.java:1655)
> at 
> java.base/java.util.stream.AbstractPipeline.copyInto(AbstractPipeline.java:484)
> at 
> java.base/java.util.stream.AbstractPipeline.wrapAndCopyInto(AbstractPipeline.java:474)
> at 
> java.base/java.util.stream.ForEachOps$ForEachOp.evaluateSequential(ForEachOps.java:150)
> at 
> java.base/java.util.stream.ForEachOps$ForEachOp$OfRef.evaluateSequential(ForEachOps.java:173)
> at 
> java.base/java.util.stream.AbstractPipeline.evaluate(AbstractPipeline.java:234)
> at 
> java.base/java.util.stream.ReferencePipeline.forEach(ReferencePipeline.java:497)
> at 
> org.apache.ignite.internal.util.IgniteUtils.closeAll(IgniteUtils.java:557)
> at 
> org.apache.ignite.internal.util.IgniteUtils.closeAll(IgniteUtils.java:580)
> at 
> org.apache.ignite.internal.replicator.ItPlacementDriverReplicaSideTest.afterTest(ItPlacementDriverReplicaSideTest.java:214)
> at java.base/java.lang.reflect.Method.invoke(Method.java:566)
> at java.base/java.util.ArrayList.forEach(ArrayList.java:1541)
> at java.base/java.util.ArrayList.forEach(ArrayList.java:1541)
> {code}
> https://ci.ignite.apache.org/buildConfiguration/ApacheIgnite3xGradle_Test_IntegrationTests_ModuleReplicator/7987165?expandBuildDeploymentsSection=false=false=true=false=true=true=7987165_489_86.470.489=debug=flowAware
> I was not able to reproduce the same error locally, I got an error on the 
> following line instead:
> {code}
> assertTrue(waitForCondition(() -> nodesToReceivedDeclineMsg.size() == 
> placementDriverNodeNames.size(), 10_000));
> {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (IGNITE-21892) ItPlacementDriverReplicaSideTest testNotificationToPlacementDriverAboutChangeLeader is flaky

2024-04-09 Thread Alexander Lapin (Jira)


 [ 
https://issues.apache.org/jira/browse/IGNITE-21892?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alexander Lapin updated IGNITE-21892:
-
Priority: Major  (was: Minor)

> ItPlacementDriverReplicaSideTest 
> testNotificationToPlacementDriverAboutChangeLeader is flaky
> 
>
> Key: IGNITE-21892
> URL: https://issues.apache.org/jira/browse/IGNITE-21892
> Project: Ignite
>  Issue Type: Bug
>Reporter: Maksim Zhuravkov
>Assignee: Alexander Lapin
>Priority: Major
>  Labels: ignite-3
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> This test is flaky. Build error:
> {code}
>   java.lang.AssertionError: There are replicas alive [replicas=[group_1]]
> at 
> org.apache.ignite.internal.replicator.ReplicaManager.stop(ReplicaManager.java:658)
> at 
> org.apache.ignite.internal.replicator.ItPlacementDriverReplicaSideTest.lambda$beforeTest$3(ItPlacementDriverReplicaSideTest.java:200)
> at 
> org.apache.ignite.internal.util.IgniteUtils.lambda$closeAll$0(IgniteUtils.java:559)
> at 
> java.base/java.util.stream.ForEachOps$ForEachOp$OfRef.accept(ForEachOps.java:183)
> at 
> java.base/java.util.stream.ReferencePipeline$2$1.accept(ReferencePipeline.java:177)
> at 
> java.base/java.util.ArrayList$ArrayListSpliterator.forEachRemaining(ArrayList.java:1655)
> at 
> java.base/java.util.stream.AbstractPipeline.copyInto(AbstractPipeline.java:484)
> at 
> java.base/java.util.stream.AbstractPipeline.wrapAndCopyInto(AbstractPipeline.java:474)
> at 
> java.base/java.util.stream.ForEachOps$ForEachOp.evaluateSequential(ForEachOps.java:150)
> at 
> java.base/java.util.stream.ForEachOps$ForEachOp$OfRef.evaluateSequential(ForEachOps.java:173)
> at 
> java.base/java.util.stream.AbstractPipeline.evaluate(AbstractPipeline.java:234)
> at 
> java.base/java.util.stream.ReferencePipeline.forEach(ReferencePipeline.java:497)
> at 
> org.apache.ignite.internal.util.IgniteUtils.closeAll(IgniteUtils.java:557)
> at 
> org.apache.ignite.internal.util.IgniteUtils.closeAll(IgniteUtils.java:580)
> at 
> org.apache.ignite.internal.replicator.ItPlacementDriverReplicaSideTest.afterTest(ItPlacementDriverReplicaSideTest.java:214)
> at java.base/java.lang.reflect.Method.invoke(Method.java:566)
> at java.base/java.util.ArrayList.forEach(ArrayList.java:1541)
> at java.base/java.util.ArrayList.forEach(ArrayList.java:1541)
> {code}
> https://ci.ignite.apache.org/buildConfiguration/ApacheIgnite3xGradle_Test_IntegrationTests_ModuleReplicator/7987165?expandBuildDeploymentsSection=false=false=true=false=true=true=7987165_489_86.470.489=debug=flowAware
> I was not able to reproduce the same error locally, I got an error on the 
> following line instead:
> {code}
> assertTrue(waitForCondition(() -> nodesToReceivedDeclineMsg.size() == 
> placementDriverNodeNames.size(), 10_000));
> {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (IGNITE-22002) AssertionError: Updated lease start time should be greater than current

2024-04-08 Thread Alexander Lapin (Jira)


 [ 
https://issues.apache.org/jira/browse/IGNITE-22002?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alexander Lapin updated IGNITE-22002:
-
Ignite Flags:   (was: Docs Required,Release Notes Required)

> AssertionError: Updated lease start time should be greater than current
> ---
>
> Key: IGNITE-22002
> URL: https://issues.apache.org/jira/browse/IGNITE-22002
> Project: Ignite
>  Issue Type: Bug
>Reporter: Alexander Lapin
>Priority: Major
>  Labels: ignite-3
>
> PrimaryReplicaChangeCommandImpl commands might be reordered as any other raft 
> commands. In case of PrimaryReplicaChangeCommand there's no need to force the 
> order it's just enough to skip inconsistent one, thus let's substitute 
> assertion with corresponding check and stop command evaluation in case of 
> miss.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (IGNITE-22002) AssertionError: Updated lease start time should be greater than current

2024-04-08 Thread Alexander Lapin (Jira)


 [ 
https://issues.apache.org/jira/browse/IGNITE-22002?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alexander Lapin updated IGNITE-22002:
-
Labels: ignite-3  (was: )

> AssertionError: Updated lease start time should be greater than current
> ---
>
> Key: IGNITE-22002
> URL: https://issues.apache.org/jira/browse/IGNITE-22002
> Project: Ignite
>  Issue Type: Bug
>Reporter: Alexander Lapin
>Priority: Major
>  Labels: ignite-3
>
> PrimaryReplicaChangeCommandImpl commands might be reordered as any other raft 
> commands. In case of PrimaryReplicaChangeCommand there's no need to force the 
> order it's just enough to skip inconsistent one, thus let's substitute 
> assertion with corresponding check and stop command evaluation in case of 
> miss.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (IGNITE-22002) AssertionError: Updated lease start time should be greater than current

2024-04-08 Thread Alexander Lapin (Jira)
Alexander Lapin created IGNITE-22002:


 Summary: AssertionError: Updated lease start time should be 
greater than current
 Key: IGNITE-22002
 URL: https://issues.apache.org/jira/browse/IGNITE-22002
 Project: Ignite
  Issue Type: Bug
Reporter: Alexander Lapin


PrimaryReplicaChangeCommandImpl commands might be reordered as any other raft 
commands. In case of PrimaryReplicaChangeCommand there's no need to force the 
order it's just enough to skip inconsistent one, thus let's substitute 
assertion with corresponding check and stop command evaluation in case of miss.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (IGNITE-21918) ItPlacementDriverReplicaSideTest#testNotificationToPlacementDriverAboutChangeLeader is flaky

2024-04-03 Thread Alexander Lapin (Jira)


 [ 
https://issues.apache.org/jira/browse/IGNITE-21918?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alexander Lapin updated IGNITE-21918:
-
Description: 
{code:java}
java.lang.AssertionError: There are replicas alive [replicas=[group_1]]
  at 
org.apache.ignite.internal.replicator.ReplicaManager.stop(ReplicaManager.java:658)
  at 
org.apache.ignite.internal.replicator.ItPlacementDriverReplicaSideTest.lambda$beforeTest$3(ItPlacementDriverReplicaSideTest.java:200)
 {code}
Flaky rate is relatively high, nearest failures:
 # 
[02/04/24|https://ci.ignite.apache.org/buildConfiguration/ApacheIgnite3xGradle_Test_IntegrationTests_ModuleReplicator/7987165?expandBuildDeploymentsSection=false=false=true=false=true=true]
 # 
[29/03/24|https://ci.ignite.apache.org/buildConfiguration/ApacheIgnite3xGradle_Test_IntegrationTests_ModuleReplicator/7996550?expandBuildDeploymentsSection=false=false=true=false=true]
 # 
[26/03/24|https://ci.ignite.apache.org/buildConfiguration/ApacheIgnite3xGradle_Test_IntegrationTests_ModuleReplicator/7973057?expandBuildDeploymentsSection=false=false=false=true=true=true]

Exception may differ a bit, but semantically it's the same
{code:java}
org.opentest4j.AssertionFailedError: expected:  but was: 
  at 
app//org.junit.jupiter.api.AssertionFailureBuilder.build(AssertionFailureBuilder.java:151)
  at 
app//org.junit.jupiter.api.AssertionFailureBuilder.buildAndThrow(AssertionFailureBuilder.java:132)
  at app//org.junit.jupiter.api.AssertTrue.failNotTrue(AssertTrue.java:63)
  at app//org.junit.jupiter.api.AssertTrue.assertTrue(AssertTrue.java:36)
  at app//org.junit.jupiter.api.AssertTrue.assertTrue(AssertTrue.java:31)
  at app//org.junit.jupiter.api.Assertions.assertTrue(Assertions.java:183)
  at 
app//org.apache.ignite.internal.replicator.ItPlacementDriverReplicaSideTest.testNotificationToPlacementDriverAboutChangeLeader(ItPlacementDriverReplicaSideTest.java:301)
  at java.base@11.0.17/java.lang.reflect.Method.invoke(Method.java:566)
  at java.base@11.0.17/java.util.ArrayList.forEach(ArrayList.java:1541)
  at java.base@11.0.17/java.util.ArrayList.forEach(ArrayList.java:1541)
  Suppressed: java.lang.AssertionError: There are replicas alive 
[replicas=[group_1]] {code}
Reproduced locally 1 out of 20.
h3. Upd#1

Seems that it's a test issue. It's expected that three replicas will be 
started: one per each test node
{code:java}
[2024-04-02T11:00:46,956][INFO ][Test worker][ItPlacementDriverReplicaSideTest] 
Replication group is based on [ipdrst_tntpdacl_1238, ipdrst_tntpdacl_1236, 
ipdrst_tntpdacl_1234] {code}
However there are only two log entries with "Replica is about to start"
{code:java}
[2024-04-02T11:00:48,404][INFO 
][%ipdrst_tntpdacl_1238%JRaft-Request-Processor-8][ReplicaManager] Replica is 
about to start [replicationGroupId=group_1].

[2024-04-02T11:00:49,071][INFO 
][%ipdrst_tntpdacl_1234%Raft-Group-Client-2][ReplicaManager] Replica is about 
to start [replicationGroupId=group_1]. {code}
Basically there's no "Replica is about to start" on 1236 node. 

 

Seems that there's no linearization between replica start and rest of test flow 
including replicaManager stop in @AfterEach.
{code:java}
private CompletableFuture createReplicationGroup(
ReplicationGroupId groupId,
Set nodes
) throws Exception {
var res = new CompletableFuture();

for (String nodeName : nodes) {
var replicaManager = replicaManagers.get(nodeName);
var raftManager = raftManagers.get(nodeName);

assertNotNull(replicaManager);
assertNotNull(raftManager);

var peer = new Peer(nodeName);

var rftNodeId = new RaftNodeId(groupId, peer);

CompletableFuture raftClientFut = 
raftManager.startRaftGroupNode(
rftNodeId,
fromConsistentIds(nodes),
new TestRaftGroupListener(),
RaftGroupEventsListener.noopLsnr,
RaftGroupOptions.defaults(),
raftClientFactory.get(nodeName)
);

raftClientFut.thenAccept(raftClient -> {
try {
if (!res.isDone()) {
res.complete(raftClient);
}

replicaManager.startReplica(
groupId,
(request, senderId) -> {
log.info("Handle request [type={}]", 
request.getClass().getSimpleName());

return 
raftClient.run(REPLICA_MESSAGES_FACTORY.safeTimeSyncCommand().build())
.thenApply(ignored -> new 
ReplicaResult(null, null));
},
raftClient,
new PendingComparableValuesTracker<>(Long.MAX_VALUE));
} catch (NodeStoppingException e) {
fail("Can not start replica [groupId=" + groupId + ']');
}
});
}

return res;
} {code}
Basically 

[jira] [Updated] (IGNITE-21918) ItPlacementDriverReplicaSideTest#testNotificationToPlacementDriverAboutChangeLeader is flaky

2024-04-03 Thread Alexander Lapin (Jira)


 [ 
https://issues.apache.org/jira/browse/IGNITE-21918?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alexander Lapin updated IGNITE-21918:
-
Description: 
{code:java}
java.lang.AssertionError: There are replicas alive [replicas=[group_1]]
  at 
org.apache.ignite.internal.replicator.ReplicaManager.stop(ReplicaManager.java:658)
  at 
org.apache.ignite.internal.replicator.ItPlacementDriverReplicaSideTest.lambda$beforeTest$3(ItPlacementDriverReplicaSideTest.java:200)
 {code}
Flaky rate is relatively high, nearest failures:
 # 
[02/04/24|https://ci.ignite.apache.org/buildConfiguration/ApacheIgnite3xGradle_Test_IntegrationTests_ModuleReplicator/7987165?expandBuildDeploymentsSection=false=false=true=false=true=true]
 # 
[29/03/24|https://ci.ignite.apache.org/buildConfiguration/ApacheIgnite3xGradle_Test_IntegrationTests_ModuleReplicator/7996550?expandBuildDeploymentsSection=false=false=true=false=true]
 # 
[26/03/24|https://ci.ignite.apache.org/buildConfiguration/ApacheIgnite3xGradle_Test_IntegrationTests_ModuleReplicator/7973057?expandBuildDeploymentsSection=false=false=false=true=true=true]

Exception may differ a bit, but semantically it's the same
{code:java}
org.opentest4j.AssertionFailedError: expected:  but was: 
  at 
app//org.junit.jupiter.api.AssertionFailureBuilder.build(AssertionFailureBuilder.java:151)
  at 
app//org.junit.jupiter.api.AssertionFailureBuilder.buildAndThrow(AssertionFailureBuilder.java:132)
  at app//org.junit.jupiter.api.AssertTrue.failNotTrue(AssertTrue.java:63)
  at app//org.junit.jupiter.api.AssertTrue.assertTrue(AssertTrue.java:36)
  at app//org.junit.jupiter.api.AssertTrue.assertTrue(AssertTrue.java:31)
  at app//org.junit.jupiter.api.Assertions.assertTrue(Assertions.java:183)
  at 
app//org.apache.ignite.internal.replicator.ItPlacementDriverReplicaSideTest.testNotificationToPlacementDriverAboutChangeLeader(ItPlacementDriverReplicaSideTest.java:301)
  at java.base@11.0.17/java.lang.reflect.Method.invoke(Method.java:566)
  at java.base@11.0.17/java.util.ArrayList.forEach(ArrayList.java:1541)
  at java.base@11.0.17/java.util.ArrayList.forEach(ArrayList.java:1541)
  Suppressed: java.lang.AssertionError: There are replicas alive 
[replicas=[group_1]] {code}
Reproduced locally 1 out of 20.
h3. Upd#1

Seems that it's a test issue. It's expected that three replicas will be 
started: one per each test node
{code:java}
[2024-04-02T11:00:46,956][INFO ][Test worker][ItPlacementDriverReplicaSideTest] 
Replication group is based on [ipdrst_tntpdacl_1238, ipdrst_tntpdacl_1236, 
ipdrst_tntpdacl_1234] {code}
However there are only two log entries with "Replica is about to start"
{code:java}
[2024-04-02T11:00:48,404][INFO 
][%ipdrst_tntpdacl_1238%JRaft-Request-Processor-8][ReplicaManager] Replica is 
about to start [replicationGroupId=group_1].

[2024-04-02T11:00:49,071][INFO 
][%ipdrst_tntpdacl_1234%Raft-Group-Client-2][ReplicaManager] Replica is about 
to start [replicationGroupId=group_1]. {code}
Basically there's no "Replica is about to start" on 1236 node. 

Probably that means that there was an exception during test flow, meaning that 
replica stop wasn't called, last line it test
{code:java}
stopReplicationGroup(GROUP_ID, grpNodes);
} 
..

private void stopReplicationGroup(ReplicationGroupId testGrpId, Set 
grpNodes) throws NodeStoppingException {
for (String nodeName : grpNodes) {
var raftManager = raftManagers.get(nodeName);
var replicaManager = replicaManagers.get(nodeName);

assertNotNull(raftManager);
assertNotNull(replicaManager);

replicaManager.stopReplica(testGrpId).join();
raftManager.stopRaftNodes(testGrpId);
}
}{code}
Unfortunately we do not log successful replica stop, thus it's hard to 
understand from logs whether stop was actually called. All in all that means 
that @AfterEach was called prior to calling stopReplicationGroup, however it's 
not clear whether it's because of an exception in test itself or because there 
was a race between previous test @AfterEach.

 

  was:
{code:java}
java.lang.AssertionError: There are replicas alive [replicas=[group_1]]
  at 
org.apache.ignite.internal.replicator.ReplicaManager.stop(ReplicaManager.java:658)
  at 
org.apache.ignite.internal.replicator.ItPlacementDriverReplicaSideTest.lambda$beforeTest$3(ItPlacementDriverReplicaSideTest.java:200)
 {code}
Flaky rate is relatively high, nearest failures:
 # 
[02/04/24|https://ci.ignite.apache.org/buildConfiguration/ApacheIgnite3xGradle_Test_IntegrationTests_ModuleReplicator/7987165?expandBuildDeploymentsSection=false=false=true=false=true=true]
 # 
[29/03/24|https://ci.ignite.apache.org/buildConfiguration/ApacheIgnite3xGradle_Test_IntegrationTests_ModuleReplicator/7996550?expandBuildDeploymentsSection=false=false=true=false=true]
 # 

[jira] [Updated] (IGNITE-21918) ItPlacementDriverReplicaSideTest#testNotificationToPlacementDriverAboutChangeLeader is flaky

2024-04-03 Thread Alexander Lapin (Jira)


 [ 
https://issues.apache.org/jira/browse/IGNITE-21918?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alexander Lapin updated IGNITE-21918:
-
Description: 
{code:java}
java.lang.AssertionError: There are replicas alive [replicas=[group_1]]
  at 
org.apache.ignite.internal.replicator.ReplicaManager.stop(ReplicaManager.java:658)
  at 
org.apache.ignite.internal.replicator.ItPlacementDriverReplicaSideTest.lambda$beforeTest$3(ItPlacementDriverReplicaSideTest.java:200)
 {code}
Flaky rate is relatively high, nearest failures:
 # 
[02/04/24|https://ci.ignite.apache.org/buildConfiguration/ApacheIgnite3xGradle_Test_IntegrationTests_ModuleReplicator/7987165?expandBuildDeploymentsSection=false=false=true=false=true=true]
 # 
[29/03/24|https://ci.ignite.apache.org/buildConfiguration/ApacheIgnite3xGradle_Test_IntegrationTests_ModuleReplicator/7996550?expandBuildDeploymentsSection=false=false=true=false=true]
 # 
[26/03/24|https://ci.ignite.apache.org/buildConfiguration/ApacheIgnite3xGradle_Test_IntegrationTests_ModuleReplicator/7973057?expandBuildDeploymentsSection=false=false=false=true=true=true]

Exception may differ a bit, but semantically it's the same
{code:java}
org.opentest4j.AssertionFailedError: expected:  but was: 
  at 
app//org.junit.jupiter.api.AssertionFailureBuilder.build(AssertionFailureBuilder.java:151)
  at 
app//org.junit.jupiter.api.AssertionFailureBuilder.buildAndThrow(AssertionFailureBuilder.java:132)
  at app//org.junit.jupiter.api.AssertTrue.failNotTrue(AssertTrue.java:63)
  at app//org.junit.jupiter.api.AssertTrue.assertTrue(AssertTrue.java:36)
  at app//org.junit.jupiter.api.AssertTrue.assertTrue(AssertTrue.java:31)
  at app//org.junit.jupiter.api.Assertions.assertTrue(Assertions.java:183)
  at 
app//org.apache.ignite.internal.replicator.ItPlacementDriverReplicaSideTest.testNotificationToPlacementDriverAboutChangeLeader(ItPlacementDriverReplicaSideTest.java:301)
  at java.base@11.0.17/java.lang.reflect.Method.invoke(Method.java:566)
  at java.base@11.0.17/java.util.ArrayList.forEach(ArrayList.java:1541)
  at java.base@11.0.17/java.util.ArrayList.forEach(ArrayList.java:1541)
  Suppressed: java.lang.AssertionError: There are replicas alive 
[replicas=[group_1]] {code}
Reproduced locally 1 out of 20.
h3. Upd#1

Seems that it's a test issue. It's expected that three replicas will be 
started: one per each test node
{code:java}
[2024-04-02T11:00:46,956][INFO ][Test worker][ItPlacementDriverReplicaSideTest] 
Replication group is based on [ipdrst_tntpdacl_1238, ipdrst_tntpdacl_1236, 
ipdrst_tntpdacl_1234] {code}
However there are only two log entries with "Replica is about to start"
{code:java}
[2024-04-02T11:00:48,404][INFO 
][%ipdrst_tntpdacl_1238%JRaft-Request-Processor-8][ReplicaManager] Replica is 
about to start [replicationGroupId=group_1].

[2024-04-02T11:00:49,071][INFO 
][%ipdrst_tntpdacl_1234%Raft-Group-Client-2][ReplicaManager] Replica is about 
to start [replicationGroupId=group_1]. {code}
Basically there's no "Replica is about to start" on 1236 node. 

Probably that means that there was an exception during test flow, meaning that 
replica stop wasn't called, last line it test
{code:java}
stopReplicationGroup(GROUP_ID, grpNodes);
} 
..

private void stopReplicationGroup(ReplicationGroupId testGrpId, Set 
grpNodes) throws NodeStoppingException {
for (String nodeName : grpNodes) {
var raftManager = raftManagers.get(nodeName);
var replicaManager = replicaManagers.get(nodeName);

assertNotNull(raftManager);
assertNotNull(replicaManager);

replicaManager.stopReplica(testGrpId).join();
raftManager.stopRaftNodes(testGrpId);
}
}{code}

  was:
{code:java}
java.lang.AssertionError: There are replicas alive [replicas=[group_1]]
  at 
org.apache.ignite.internal.replicator.ReplicaManager.stop(ReplicaManager.java:658)
  at 
org.apache.ignite.internal.replicator.ItPlacementDriverReplicaSideTest.lambda$beforeTest$3(ItPlacementDriverReplicaSideTest.java:200)
 {code}
Flaky rate is relatively high, nearest failures:
 # 
[02/04/24|https://ci.ignite.apache.org/buildConfiguration/ApacheIgnite3xGradle_Test_IntegrationTests_ModuleReplicator/7987165?expandBuildDeploymentsSection=false=false=true=false=true=true]
 # 
[29/03/24|https://ci.ignite.apache.org/buildConfiguration/ApacheIgnite3xGradle_Test_IntegrationTests_ModuleReplicator/7996550?expandBuildDeploymentsSection=false=false=true=false=true]
 # 
[26/03/24|https://ci.ignite.apache.org/buildConfiguration/ApacheIgnite3xGradle_Test_IntegrationTests_ModuleReplicator/7973057?expandBuildDeploymentsSection=false=false=false=true=true=true]

Exception may differ a bit, but semantically it's the same
{code:java}
org.opentest4j.AssertionFailedError: expected:  but was: 
  at 
app//org.junit.jupiter.api.AssertionFailureBuilder.build(AssertionFailureBuilder.java:151)
  at 

[jira] [Updated] (IGNITE-21918) ItPlacementDriverReplicaSideTest#testNotificationToPlacementDriverAboutChangeLeader is flaky

2024-04-02 Thread Alexander Lapin (Jira)


 [ 
https://issues.apache.org/jira/browse/IGNITE-21918?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alexander Lapin updated IGNITE-21918:
-
Description: 
{code:java}
java.lang.AssertionError: There are replicas alive [replicas=[group_1]]
  at 
org.apache.ignite.internal.replicator.ReplicaManager.stop(ReplicaManager.java:658)
  at 
org.apache.ignite.internal.replicator.ItPlacementDriverReplicaSideTest.lambda$beforeTest$3(ItPlacementDriverReplicaSideTest.java:200)
 {code}
Flaky rate is relatively high, nearest failures:
 # 
[02/04/24|https://ci.ignite.apache.org/buildConfiguration/ApacheIgnite3xGradle_Test_IntegrationTests_ModuleReplicator/7987165?expandBuildDeploymentsSection=false=false=true=false=true=true]
 # 
[29/03/24|https://ci.ignite.apache.org/buildConfiguration/ApacheIgnite3xGradle_Test_IntegrationTests_ModuleReplicator/7996550?expandBuildDeploymentsSection=false=false=true=false=true]
 # 
[26/03/24|https://ci.ignite.apache.org/buildConfiguration/ApacheIgnite3xGradle_Test_IntegrationTests_ModuleReplicator/7973057?expandBuildDeploymentsSection=false=false=false=true=true=true]

Exception may differ a bit, but semantically it's the same
{code:java}
org.opentest4j.AssertionFailedError: expected:  but was: 
  at 
app//org.junit.jupiter.api.AssertionFailureBuilder.build(AssertionFailureBuilder.java:151)
  at 
app//org.junit.jupiter.api.AssertionFailureBuilder.buildAndThrow(AssertionFailureBuilder.java:132)
  at app//org.junit.jupiter.api.AssertTrue.failNotTrue(AssertTrue.java:63)
  at app//org.junit.jupiter.api.AssertTrue.assertTrue(AssertTrue.java:36)
  at app//org.junit.jupiter.api.AssertTrue.assertTrue(AssertTrue.java:31)
  at app//org.junit.jupiter.api.Assertions.assertTrue(Assertions.java:183)
  at 
app//org.apache.ignite.internal.replicator.ItPlacementDriverReplicaSideTest.testNotificationToPlacementDriverAboutChangeLeader(ItPlacementDriverReplicaSideTest.java:301)
  at java.base@11.0.17/java.lang.reflect.Method.invoke(Method.java:566)
  at java.base@11.0.17/java.util.ArrayList.forEach(ArrayList.java:1541)
  at java.base@11.0.17/java.util.ArrayList.forEach(ArrayList.java:1541)
  Suppressed: java.lang.AssertionError: There are replicas alive 
[replicas=[group_1]] {code}
Reproduced locally 1 out of 20.

  was:
{code:java}
java.lang.AssertionError: There are replicas alive [replicas=[group_1]]
  at 
org.apache.ignite.internal.replicator.ReplicaManager.stop(ReplicaManager.java:658)
  at 
org.apache.ignite.internal.replicator.ItPlacementDriverReplicaSideTest.lambda$beforeTest$3(ItPlacementDriverReplicaSideTest.java:200)
 {code}
Flaky rate is relatively high, nearest failures:
 # 
[02/04/24|https://ci.ignite.apache.org/buildConfiguration/ApacheIgnite3xGradle_Test_IntegrationTests_ModuleReplicator/7987165?expandBuildDeploymentsSection=false=false=true=false=true=true]
 # 
[29/03/24|https://ci.ignite.apache.org/buildConfiguration/ApacheIgnite3xGradle_Test_IntegrationTests_ModuleReplicator/7996550?expandBuildDeploymentsSection=false=false=true=false=true]
 # 
[26/03/24|https://ci.ignite.apache.org/buildConfiguration/ApacheIgnite3xGradle_Test_IntegrationTests_ModuleReplicator/7973057?expandBuildDeploymentsSection=false=false=false=true=true=true]


> ItPlacementDriverReplicaSideTest#testNotificationToPlacementDriverAboutChangeLeader
>  is flaky
> 
>
> Key: IGNITE-21918
> URL: https://issues.apache.org/jira/browse/IGNITE-21918
> Project: Ignite
>  Issue Type: Bug
>Reporter: Alexander Lapin
>Priority: Major
>  Labels: ignite-3
>
> {code:java}
> java.lang.AssertionError: There are replicas alive [replicas=[group_1]]
>   at 
> org.apache.ignite.internal.replicator.ReplicaManager.stop(ReplicaManager.java:658)
>   at 
> org.apache.ignite.internal.replicator.ItPlacementDriverReplicaSideTest.lambda$beforeTest$3(ItPlacementDriverReplicaSideTest.java:200)
>  {code}
> Flaky rate is relatively high, nearest failures:
>  # 
> [02/04/24|https://ci.ignite.apache.org/buildConfiguration/ApacheIgnite3xGradle_Test_IntegrationTests_ModuleReplicator/7987165?expandBuildDeploymentsSection=false=false=true=false=true=true]
>  # 
> [29/03/24|https://ci.ignite.apache.org/buildConfiguration/ApacheIgnite3xGradle_Test_IntegrationTests_ModuleReplicator/7996550?expandBuildDeploymentsSection=false=false=true=false=true]
>  # 
> [26/03/24|https://ci.ignite.apache.org/buildConfiguration/ApacheIgnite3xGradle_Test_IntegrationTests_ModuleReplicator/7973057?expandBuildDeploymentsSection=false=false=false=true=true=true]
> Exception may differ a bit, but semantically it's the same
> {code:java}
> org.opentest4j.AssertionFailedError: expected:  but was: 
>   at 
> app//org.junit.jupiter.api.AssertionFailureBuilder.build(AssertionFailureBuilder.java:151)
>   at 
> 

[jira] [Updated] (IGNITE-21918) ItPlacementDriverReplicaSideTest#testNotificationToPlacementDriverAboutChangeLeader is flaky

2024-04-02 Thread Alexander Lapin (Jira)


 [ 
https://issues.apache.org/jira/browse/IGNITE-21918?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alexander Lapin updated IGNITE-21918:
-
Description: 
{code:java}
java.lang.AssertionError: There are replicas alive [replicas=[group_1]]
  at 
org.apache.ignite.internal.replicator.ReplicaManager.stop(ReplicaManager.java:658)
  at 
org.apache.ignite.internal.replicator.ItPlacementDriverReplicaSideTest.lambda$beforeTest$3(ItPlacementDriverReplicaSideTest.java:200)
 {code}
Flaky rate is relatively high, nearest failures:
 # 
[02/04/24|https://ci.ignite.apache.org/buildConfiguration/ApacheIgnite3xGradle_Test_IntegrationTests_ModuleReplicator/7987165?expandBuildDeploymentsSection=false=false=true=false=true=true]
 # 
[29/03/24|https://ci.ignite.apache.org/buildConfiguration/ApacheIgnite3xGradle_Test_IntegrationTests_ModuleReplicator/7996550?expandBuildDeploymentsSection=false=false=true=false=true]

  was:
{code:java}
java.lang.AssertionError: There are replicas alive [replicas=[group_1]]
  at 
org.apache.ignite.internal.replicator.ReplicaManager.stop(ReplicaManager.java:658)
  at 
org.apache.ignite.internal.replicator.ItPlacementDriverReplicaSideTest.lambda$beforeTest$3(ItPlacementDriverReplicaSideTest.java:200)
 {code}
Flaky rate is relatively high, nearest failures:

[02/04/24|https://ci.ignite.apache.org/buildConfiguration/ApacheIgnite3xGradle_Test_IntegrationTests_ModuleReplicator/7987165?expandBuildDeploymentsSection=false=false=true=false=true=true]


> ItPlacementDriverReplicaSideTest#testNotificationToPlacementDriverAboutChangeLeader
>  is flaky
> 
>
> Key: IGNITE-21918
> URL: https://issues.apache.org/jira/browse/IGNITE-21918
> Project: Ignite
>  Issue Type: Bug
>Reporter: Alexander Lapin
>Priority: Major
>  Labels: ignite-3
>
> {code:java}
> java.lang.AssertionError: There are replicas alive [replicas=[group_1]]
>   at 
> org.apache.ignite.internal.replicator.ReplicaManager.stop(ReplicaManager.java:658)
>   at 
> org.apache.ignite.internal.replicator.ItPlacementDriverReplicaSideTest.lambda$beforeTest$3(ItPlacementDriverReplicaSideTest.java:200)
>  {code}
> Flaky rate is relatively high, nearest failures:
>  # 
> [02/04/24|https://ci.ignite.apache.org/buildConfiguration/ApacheIgnite3xGradle_Test_IntegrationTests_ModuleReplicator/7987165?expandBuildDeploymentsSection=false=false=true=false=true=true]
>  # 
> [29/03/24|https://ci.ignite.apache.org/buildConfiguration/ApacheIgnite3xGradle_Test_IntegrationTests_ModuleReplicator/7996550?expandBuildDeploymentsSection=false=false=true=false=true]



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (IGNITE-21918) ItPlacementDriverReplicaSideTest#testNotificationToPlacementDriverAboutChangeLeader is flaky

2024-04-02 Thread Alexander Lapin (Jira)


 [ 
https://issues.apache.org/jira/browse/IGNITE-21918?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alexander Lapin updated IGNITE-21918:
-
Description: 
{code:java}
java.lang.AssertionError: There are replicas alive [replicas=[group_1]]
  at 
org.apache.ignite.internal.replicator.ReplicaManager.stop(ReplicaManager.java:658)
  at 
org.apache.ignite.internal.replicator.ItPlacementDriverReplicaSideTest.lambda$beforeTest$3(ItPlacementDriverReplicaSideTest.java:200)
 {code}
Flaky rate is relatively high, nearest failures:
 # 
[02/04/24|https://ci.ignite.apache.org/buildConfiguration/ApacheIgnite3xGradle_Test_IntegrationTests_ModuleReplicator/7987165?expandBuildDeploymentsSection=false=false=true=false=true=true]
 # 
[29/03/24|https://ci.ignite.apache.org/buildConfiguration/ApacheIgnite3xGradle_Test_IntegrationTests_ModuleReplicator/7996550?expandBuildDeploymentsSection=false=false=true=false=true]
 # 
[26/03/24|https://ci.ignite.apache.org/buildConfiguration/ApacheIgnite3xGradle_Test_IntegrationTests_ModuleReplicator/7973057?expandBuildDeploymentsSection=false=false=false=true=true=true]

  was:
{code:java}
java.lang.AssertionError: There are replicas alive [replicas=[group_1]]
  at 
org.apache.ignite.internal.replicator.ReplicaManager.stop(ReplicaManager.java:658)
  at 
org.apache.ignite.internal.replicator.ItPlacementDriverReplicaSideTest.lambda$beforeTest$3(ItPlacementDriverReplicaSideTest.java:200)
 {code}
Flaky rate is relatively high, nearest failures:
 # 
[02/04/24|https://ci.ignite.apache.org/buildConfiguration/ApacheIgnite3xGradle_Test_IntegrationTests_ModuleReplicator/7987165?expandBuildDeploymentsSection=false=false=true=false=true=true]
 # 
[29/03/24|https://ci.ignite.apache.org/buildConfiguration/ApacheIgnite3xGradle_Test_IntegrationTests_ModuleReplicator/7996550?expandBuildDeploymentsSection=false=false=true=false=true]


> ItPlacementDriverReplicaSideTest#testNotificationToPlacementDriverAboutChangeLeader
>  is flaky
> 
>
> Key: IGNITE-21918
> URL: https://issues.apache.org/jira/browse/IGNITE-21918
> Project: Ignite
>  Issue Type: Bug
>Reporter: Alexander Lapin
>Priority: Major
>  Labels: ignite-3
>
> {code:java}
> java.lang.AssertionError: There are replicas alive [replicas=[group_1]]
>   at 
> org.apache.ignite.internal.replicator.ReplicaManager.stop(ReplicaManager.java:658)
>   at 
> org.apache.ignite.internal.replicator.ItPlacementDriverReplicaSideTest.lambda$beforeTest$3(ItPlacementDriverReplicaSideTest.java:200)
>  {code}
> Flaky rate is relatively high, nearest failures:
>  # 
> [02/04/24|https://ci.ignite.apache.org/buildConfiguration/ApacheIgnite3xGradle_Test_IntegrationTests_ModuleReplicator/7987165?expandBuildDeploymentsSection=false=false=true=false=true=true]
>  # 
> [29/03/24|https://ci.ignite.apache.org/buildConfiguration/ApacheIgnite3xGradle_Test_IntegrationTests_ModuleReplicator/7996550?expandBuildDeploymentsSection=false=false=true=false=true]
>  # 
> [26/03/24|https://ci.ignite.apache.org/buildConfiguration/ApacheIgnite3xGradle_Test_IntegrationTests_ModuleReplicator/7973057?expandBuildDeploymentsSection=false=false=false=true=true=true]



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (IGNITE-21918) ItPlacementDriverReplicaSideTest#testNotificationToPlacementDriverAboutChangeLeader is flaky

2024-04-02 Thread Alexander Lapin (Jira)


 [ 
https://issues.apache.org/jira/browse/IGNITE-21918?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alexander Lapin updated IGNITE-21918:
-
Description: 
{code:java}
java.lang.AssertionError: There are replicas alive [replicas=[group_1]]
  at 
org.apache.ignite.internal.replicator.ReplicaManager.stop(ReplicaManager.java:658)
  at 
org.apache.ignite.internal.replicator.ItPlacementDriverReplicaSideTest.lambda$beforeTest$3(ItPlacementDriverReplicaSideTest.java:200)
 {code}
Flaky rate is relatively high, nearest failures:

[02/04/24|https://ci.ignite.apache.org/buildConfiguration/ApacheIgnite3xGradle_Test_IntegrationTests_ModuleReplicator/7987165?expandBuildDeploymentsSection=false=false=true=false=true=true]

  was:
{code:java}
java.lang.AssertionError: There are replicas alive [replicas=[group_1]]
  at 
org.apache.ignite.internal.replicator.ReplicaManager.stop(ReplicaManager.java:658)
  at 
org.apache.ignite.internal.replicator.ItPlacementDriverReplicaSideTest.lambda$beforeTest$3(ItPlacementDriverReplicaSideTest.java:200)
 {code}
Flaky rate is relatively high, nearest failures:

[02.04.24|https://ci.ignite.apache.org/buildConfiguration/ApacheIgnite3xGradle_Test_IntegrationTests_ModuleReplicator/7987165?expandBuildDeploymentsSection=false=false=true=false=true=true
]


> ItPlacementDriverReplicaSideTest#testNotificationToPlacementDriverAboutChangeLeader
>  is flaky
> 
>
> Key: IGNITE-21918
> URL: https://issues.apache.org/jira/browse/IGNITE-21918
> Project: Ignite
>  Issue Type: Bug
>Reporter: Alexander Lapin
>Priority: Major
>  Labels: ignite-3
>
> {code:java}
> java.lang.AssertionError: There are replicas alive [replicas=[group_1]]
>   at 
> org.apache.ignite.internal.replicator.ReplicaManager.stop(ReplicaManager.java:658)
>   at 
> org.apache.ignite.internal.replicator.ItPlacementDriverReplicaSideTest.lambda$beforeTest$3(ItPlacementDriverReplicaSideTest.java:200)
>  {code}
> Flaky rate is relatively high, nearest failures:
> [02/04/24|https://ci.ignite.apache.org/buildConfiguration/ApacheIgnite3xGradle_Test_IntegrationTests_ModuleReplicator/7987165?expandBuildDeploymentsSection=false=false=true=false=true=true]



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (IGNITE-21918) ItPlacementDriverReplicaSideTest#testNotificationToPlacementDriverAboutChangeLeader is flaky

2024-04-02 Thread Alexander Lapin (Jira)


 [ 
https://issues.apache.org/jira/browse/IGNITE-21918?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alexander Lapin updated IGNITE-21918:
-
Description: 
{code:java}
java.lang.AssertionError: There are replicas alive [replicas=[group_1]]
  at 
org.apache.ignite.internal.replicator.ReplicaManager.stop(ReplicaManager.java:658)
  at 
org.apache.ignite.internal.replicator.ItPlacementDriverReplicaSideTest.lambda$beforeTest$3(ItPlacementDriverReplicaSideTest.java:200)
 {code}
Flaky rate is relatively high, nearest failures:

[02.04.24|https://ci.ignite.apache.org/buildConfiguration/ApacheIgnite3xGradle_Test_IntegrationTests_ModuleReplicator/7987165?expandBuildDeploymentsSection=false=false=true=false=true=true
]

  was:
{code:java}
java.lang.AssertionError: There are replicas alive [replicas=[group_1]]
  at 
org.apache.ignite.internal.replicator.ReplicaManager.stop(ReplicaManager.java:658)
  at 
org.apache.ignite.internal.replicator.ItPlacementDriverReplicaSideTest.lambda$beforeTest$3(ItPlacementDriverReplicaSideTest.java:200)
 {code}


> ItPlacementDriverReplicaSideTest#testNotificationToPlacementDriverAboutChangeLeader
>  is flaky
> 
>
> Key: IGNITE-21918
> URL: https://issues.apache.org/jira/browse/IGNITE-21918
> Project: Ignite
>  Issue Type: Bug
>Reporter: Alexander Lapin
>Priority: Major
>  Labels: ignite-3
>
> {code:java}
> java.lang.AssertionError: There are replicas alive [replicas=[group_1]]
>   at 
> org.apache.ignite.internal.replicator.ReplicaManager.stop(ReplicaManager.java:658)
>   at 
> org.apache.ignite.internal.replicator.ItPlacementDriverReplicaSideTest.lambda$beforeTest$3(ItPlacementDriverReplicaSideTest.java:200)
>  {code}
> Flaky rate is relatively high, nearest failures:
> [02.04.24|https://ci.ignite.apache.org/buildConfiguration/ApacheIgnite3xGradle_Test_IntegrationTests_ModuleReplicator/7987165?expandBuildDeploymentsSection=false=false=true=false=true=true
> ]



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (IGNITE-21918) ItPlacementDriverReplicaSideTest#testNotificationToPlacementDriverAboutChangeLeader is flaky

2024-04-02 Thread Alexander Lapin (Jira)


 [ 
https://issues.apache.org/jira/browse/IGNITE-21918?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alexander Lapin updated IGNITE-21918:
-
Description: 
{code:java}
java.lang.AssertionError: There are replicas alive [replicas=[group_1]]
  at 
org.apache.ignite.internal.replicator.ReplicaManager.stop(ReplicaManager.java:658)
  at 
org.apache.ignite.internal.replicator.ItPlacementDriverReplicaSideTest.lambda$beforeTest$3(ItPlacementDriverReplicaSideTest.java:200)
 {code}

> ItPlacementDriverReplicaSideTest#testNotificationToPlacementDriverAboutChangeLeader
>  is flaky
> 
>
> Key: IGNITE-21918
> URL: https://issues.apache.org/jira/browse/IGNITE-21918
> Project: Ignite
>  Issue Type: Bug
>Reporter: Alexander Lapin
>Priority: Major
>  Labels: ignite-3
>
> {code:java}
> java.lang.AssertionError: There are replicas alive [replicas=[group_1]]
>   at 
> org.apache.ignite.internal.replicator.ReplicaManager.stop(ReplicaManager.java:658)
>   at 
> org.apache.ignite.internal.replicator.ItPlacementDriverReplicaSideTest.lambda$beforeTest$3(ItPlacementDriverReplicaSideTest.java:200)
>  {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (IGNITE-21918) ItPlacementDriverReplicaSideTest#testNotificationToPlacementDriverAboutChangeLeader is flaky

2024-04-02 Thread Alexander Lapin (Jira)


 [ 
https://issues.apache.org/jira/browse/IGNITE-21918?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alexander Lapin updated IGNITE-21918:
-
Ignite Flags:   (was: Docs Required,Release Notes Required)

> ItPlacementDriverReplicaSideTest#testNotificationToPlacementDriverAboutChangeLeader
>  is flaky
> 
>
> Key: IGNITE-21918
> URL: https://issues.apache.org/jira/browse/IGNITE-21918
> Project: Ignite
>  Issue Type: Bug
>Reporter: Alexander Lapin
>Priority: Major
>  Labels: ignite-3
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (IGNITE-21918) ItPlacementDriverReplicaSideTest#testNotificationToPlacementDriverAboutChangeLeader is flaky

2024-04-02 Thread Alexander Lapin (Jira)


 [ 
https://issues.apache.org/jira/browse/IGNITE-21918?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alexander Lapin updated IGNITE-21918:
-
Labels: ignite-3  (was: )

> ItPlacementDriverReplicaSideTest#testNotificationToPlacementDriverAboutChangeLeader
>  is flaky
> 
>
> Key: IGNITE-21918
> URL: https://issues.apache.org/jira/browse/IGNITE-21918
> Project: Ignite
>  Issue Type: Bug
>Reporter: Alexander Lapin
>Priority: Major
>  Labels: ignite-3
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (IGNITE-21918) ItPlacementDriverReplicaSideTest#testNotificationToPlacementDriverAboutChangeLeader is flaky

2024-04-02 Thread Alexander Lapin (Jira)


 [ 
https://issues.apache.org/jira/browse/IGNITE-21918?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alexander Lapin updated IGNITE-21918:
-
Epic Link: IGNITE-21389

> ItPlacementDriverReplicaSideTest#testNotificationToPlacementDriverAboutChangeLeader
>  is flaky
> 
>
> Key: IGNITE-21918
> URL: https://issues.apache.org/jira/browse/IGNITE-21918
> Project: Ignite
>  Issue Type: Bug
>Reporter: Alexander Lapin
>Priority: Major
>  Labels: ignite-3
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (IGNITE-21918) ItPlacementDriverReplicaSideTest#testNotificationToPlacementDriverAboutChangeLeader is flaky

2024-04-02 Thread Alexander Lapin (Jira)
Alexander Lapin created IGNITE-21918:


 Summary: 
ItPlacementDriverReplicaSideTest#testNotificationToPlacementDriverAboutChangeLeader
 is flaky
 Key: IGNITE-21918
 URL: https://issues.apache.org/jira/browse/IGNITE-21918
 Project: Ignite
  Issue Type: Bug
Reporter: Alexander Lapin






--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (IGNITE-21761) Within commitPartition mark txnState with cleanup replication finished timestamp

2024-04-02 Thread Alexander Lapin (Jira)


 [ 
https://issues.apache.org/jira/browse/IGNITE-21761?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alexander Lapin updated IGNITE-21761:
-
Reviewer: Denis Chudov

> Within commitPartition mark txnState with cleanup replication finished 
> timestamp
> 
>
> Key: IGNITE-21761
> URL: https://issues.apache.org/jira/browse/IGNITE-21761
> Project: Ignite
>  Issue Type: Improvement
>Reporter: Alexander Lapin
>Assignee:  Kirill Sizov
>Priority: Major
>  Labels: ignite-3
> Fix For: 3.0.0-beta2
>
>  Time Spent: 1h 20m
>  Remaining Estimate: 0h
>
> h3. Motivation
> Volatile txn state removal is implemented in IGNITE-21759 Worth mentioning 
> that having volatile txnState is mainly an optimization just because it might 
> be lost on node restart.
> This ticket is about marking txn persistent state as ready for removal. 
> Basically it is ready for removal when the state is either COMMITED or 
> ABORTED and cleanup is fully replicated over majority of all enlisted 
> partitions, including pre-cleanup within primary replica. There are some 
> nuances related to txn recovery case that will be covered in another jiras.
> h3. Definition of Done
>  * In a durable manner (meaning retry on failure) await cleanup replication 
> over majority of all enlisted partitions along with corresponding pre-cleanup 
> within primary replica.
>  * On cleanup done, compute txnState in txnStateVolatileMap by setting 
> txnState.cleanupCompletionTimestamp to System.currentTimeMillis(). Pay 
> attention that depending on whether given txnState was already removed by 
> TxnResourseVacuumTask or not, it's either an adjusting of an exiting entry in 
> txnStateVolatileMap with timestamp or writing a brand new record. In latter 
> case another iteration of TxnStateCleanupTask will remove newly created 
> entry. Besides that if the volatile state was removed it won't be possible to 
> restore all the meta, which is fine, because all we need is txnId, txnState 
> and System.currentTimeMillies as cleanupReplicationFinished timestamp.
>  * There will be another Jira that will consider such timestamps and remove 
> corresponding states from txnStatePersistentStorage. For now it's out of 
> scope.
>  * There will be another task that will redo the cleanup on node startup in 
> order to restore (or formally re-evaluate) the cleanup replication finished 
> timestamp. For now it's out of scope.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (IGNITE-21910) Implement write intent resolution primary replica path

2024-04-02 Thread Alexander Lapin (Jira)


 [ 
https://issues.apache.org/jira/browse/IGNITE-21910?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alexander Lapin updated IGNITE-21910:
-
Epic Link: IGNITE-21758

> Implement write intent resolution primary replica path
> --
>
> Key: IGNITE-21910
> URL: https://issues.apache.org/jira/browse/IGNITE-21910
> Project: Ignite
>  Issue Type: Improvement
>Reporter: Alexander Lapin
>Priority: Major
>  Labels: ignite-3
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (IGNITE-21910) Implement write intent resolution primary replica path

2024-04-02 Thread Alexander Lapin (Jira)


 [ 
https://issues.apache.org/jira/browse/IGNITE-21910?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alexander Lapin updated IGNITE-21910:
-
Description: TBD

> Implement write intent resolution primary replica path
> --
>
> Key: IGNITE-21910
> URL: https://issues.apache.org/jira/browse/IGNITE-21910
> Project: Ignite
>  Issue Type: Improvement
>Reporter: Alexander Lapin
>Priority: Major
>  Labels: ignite-3
>
> TBD



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (IGNITE-21910) Implement write intent resolution primary replica path

2024-04-02 Thread Alexander Lapin (Jira)


 [ 
https://issues.apache.org/jira/browse/IGNITE-21910?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alexander Lapin updated IGNITE-21910:
-
Labels: ignite-3  (was: )

> Implement write intent resolution primary replica path
> --
>
> Key: IGNITE-21910
> URL: https://issues.apache.org/jira/browse/IGNITE-21910
> Project: Ignite
>  Issue Type: Improvement
>Reporter: Alexander Lapin
>Priority: Major
>  Labels: ignite-3
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (IGNITE-21910) Implement write intent resolution primary replica path

2024-04-02 Thread Alexander Lapin (Jira)
Alexander Lapin created IGNITE-21910:


 Summary: Implement write intent resolution primary replica path
 Key: IGNITE-21910
 URL: https://issues.apache.org/jira/browse/IGNITE-21910
 Project: Ignite
  Issue Type: Improvement
Reporter: Alexander Lapin






--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (IGNITE-21867) Add new ability to configure ReplicaService#RPC_TIMEOUT and TxMessageSender#RPC_TIMEOUT and increase the default values

2024-03-29 Thread Alexander Lapin (Jira)


 [ 
https://issues.apache.org/jira/browse/IGNITE-21867?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alexander Lapin updated IGNITE-21867:
-
Reviewer:  Kirill Sizov

> Add new ability to configure ReplicaService#RPC_TIMEOUT and 
> TxMessageSender#RPC_TIMEOUT and increase the default values
> ---
>
> Key: IGNITE-21867
> URL: https://issues.apache.org/jira/browse/IGNITE-21867
> Project: Ignite
>  Issue Type: Improvement
>Reporter: Alexander Lapin
>Assignee: Alexander Lapin
>Priority: Major
>  Labels: ignite-3
> Fix For: 3.0.0-beta2
>
>  Time Spent: 20m
>  Remaining Estimate: 0h
>
> h3. Motivation
> RPC_TIMEOUT was mistakenly recognized as a network layer timeout, but 
> actually it's a business operation one. Within the context of replicaService, 
> there are legitimate reasons for operation to be processed longer than 
> current timeout of 3 seconds, basically it should be possible to wait 
> indefinitely until the transaction times out. But because we don't have 
> transaction timeout and because of possible bugs it seems reasonable to have 
> an operation processing timeout "watchdog" with finite but relatively big 
> configurable value.
> h3. Definition of Done
>  * Add ability to configure ReplicaService#RPC_TIMEOUT and 
> TxMessageSender#RPC_TIMEOUT and increase the default values.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (IGNITE-21867) Add new ability to configure ReplicaService#RPC_TIMEOUT and TxMessageSender#RPC_TIMEOUT and increase the default values

2024-03-28 Thread Alexander Lapin (Jira)


 [ 
https://issues.apache.org/jira/browse/IGNITE-21867?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alexander Lapin updated IGNITE-21867:
-
Summary: Add new ability to configure ReplicaService#RPC_TIMEOUT and 
TxMessageSender#RPC_TIMEOUT and increase the default values  (was: Add an 
ability to configure ReplicaService#RPC_TIMEOUT and TxMessageSender#RPC_TIMEOUT 
and increase the default values)

> Add new ability to configure ReplicaService#RPC_TIMEOUT and 
> TxMessageSender#RPC_TIMEOUT and increase the default values
> ---
>
> Key: IGNITE-21867
> URL: https://issues.apache.org/jira/browse/IGNITE-21867
> Project: Ignite
>  Issue Type: Improvement
>Reporter: Alexander Lapin
>Assignee: Alexander Lapin
>Priority: Major
>  Labels: ignite-3
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> h3. Motivation
> RPC_TIMEOUT was mistakenly recognized as a network layer timeout, but 
> actually it's a business operation one. Within the context of replicaService, 
> there are legitimate reasons for operation to be processed longer than 
> current timeout of 3 seconds, basically it should be possible to wait 
> indefinitely until the transaction times out. But because we don't have 
> transaction timeout and because of possible bugs it seems reasonable to have 
> an operation processing timeout "watchdog" with finite but relatively big 
> configurable value.
> h3. Definition of Done
>  * Add ability to configure ReplicaService#RPC_TIMEOUT and 
> TxMessageSender#RPC_TIMEOUT and increase the default values.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (IGNITE-21867) Add an ability to configure ReplicaService#RPC_TIMEOUT and TxMessageSender#RPC_TIMEOUT and increase the default values

2024-03-28 Thread Alexander Lapin (Jira)


 [ 
https://issues.apache.org/jira/browse/IGNITE-21867?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alexander Lapin updated IGNITE-21867:
-
Description: 
h3. Motivation

RPC_TIMEOUT was mistakenly recognized as a network layer timeout, but actually 
it's a business operation one. Within the context of replicaService, there are 
legitimate reasons for operation to be processed longer than current timeout of 
3 seconds, basically it should be possible to wait indefinitely until the 
transaction times out. But because we don't have transaction timeout and 
because of possible bugs it seems reasonable to have an operation processing 
timeout "watchdog" with finite but relatively big configurable value.
h3. Definition of Done
 * Add ability to configure ReplicaService#RPC_TIMEOUT and 
TxMessageSender#RPC_TIMEOUT and increase the default values.

  was:
h3. Motivation

RPC_TIMEOUT was mistakenly recognized as a network layer timeout, but actually 
it's a business operation one. Within the context of replicaService, there are 
legitimate reasons for operation to be processed longer than current timeout of 
3 seconds, basically it should be possible to wait indefinitely until the 
transaction times out. But because we don't have transaction timeout and 
because of possible bugs it seems reasonable to have an operation processing 
timeout "watchdog" with finite but relatively big configurable value.
h3. Definition of Done
 * Add ability to configure ReplicaService#RPC_TIMEOUT and 
TxMessageSender#RPC_TIMEOUT and increase the default values


> Add an ability to configure ReplicaService#RPC_TIMEOUT and 
> TxMessageSender#RPC_TIMEOUT and increase the default values
> --
>
> Key: IGNITE-21867
> URL: https://issues.apache.org/jira/browse/IGNITE-21867
> Project: Ignite
>  Issue Type: Improvement
>Reporter: Alexander Lapin
>Assignee: Alexander Lapin
>Priority: Major
>  Labels: ignite-3
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> h3. Motivation
> RPC_TIMEOUT was mistakenly recognized as a network layer timeout, but 
> actually it's a business operation one. Within the context of replicaService, 
> there are legitimate reasons for operation to be processed longer than 
> current timeout of 3 seconds, basically it should be possible to wait 
> indefinitely until the transaction times out. But because we don't have 
> transaction timeout and because of possible bugs it seems reasonable to have 
> an operation processing timeout "watchdog" with finite but relatively big 
> configurable value.
> h3. Definition of Done
>  * Add ability to configure ReplicaService#RPC_TIMEOUT and 
> TxMessageSender#RPC_TIMEOUT and increase the default values.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (IGNITE-21867) Add an ability to configure ReplicaService#RPC_TIMEOUT and TxMessageSender#RPC_TIMEOUT and increase the default values

2024-03-28 Thread Alexander Lapin (Jira)


 [ 
https://issues.apache.org/jira/browse/IGNITE-21867?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alexander Lapin updated IGNITE-21867:
-
Summary: Add an ability to configure ReplicaService#RPC_TIMEOUT and 
TxMessageSender#RPC_TIMEOUT and increase the default values  (was: Add ability 
to configure ReplicaService#RPC_TIMEOUT and TxMessageSender#RPC_TIMEOUT and 
increase the default values)

> Add an ability to configure ReplicaService#RPC_TIMEOUT and 
> TxMessageSender#RPC_TIMEOUT and increase the default values
> --
>
> Key: IGNITE-21867
> URL: https://issues.apache.org/jira/browse/IGNITE-21867
> Project: Ignite
>  Issue Type: Improvement
>Reporter: Alexander Lapin
>Assignee: Alexander Lapin
>Priority: Major
>  Labels: ignite-3
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> h3. Motivation
> RPC_TIMEOUT was mistakenly recognized as a network layer timeout, but 
> actually it's a business operation one. Within the context of replicaService, 
> there are legitimate reasons for operation to be processed longer than 
> current timeout of 3 seconds, basically it should be possible to wait 
> indefinitely until the transaction times out. But because we don't have 
> transaction timeout and because of possible bugs it seems reasonable to have 
> an operation processing timeout "watchdog" with finite but relatively big 
> configurable value.
> h3. Definition of Done
>  * Add ability to configure ReplicaService#RPC_TIMEOUT and 
> TxMessageSender#RPC_TIMEOUT and increase the default values



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Assigned] (IGNITE-21867) Add ability to configure ReplicaService#RPC_TIMEOUT and TxMessageSender#RPC_TIMEOUT and increase the default values

2024-03-28 Thread Alexander Lapin (Jira)


 [ 
https://issues.apache.org/jira/browse/IGNITE-21867?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alexander Lapin reassigned IGNITE-21867:


Assignee: Alexander Lapin

> Add ability to configure ReplicaService#RPC_TIMEOUT and 
> TxMessageSender#RPC_TIMEOUT and increase the default values
> ---
>
> Key: IGNITE-21867
> URL: https://issues.apache.org/jira/browse/IGNITE-21867
> Project: Ignite
>  Issue Type: Improvement
>Reporter: Alexander Lapin
>Assignee: Alexander Lapin
>Priority: Major
>  Labels: ignite-3
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> h3. Motivation
> RPC_TIMEOUT was mistakenly recognized as a network layer timeout, but 
> actually it's a business operation one. Within the context of replicaService, 
> there are legitimate reasons for operation to be processed longer than 
> current timeout of 3 seconds, basically it should be possible to wait 
> indefinitely until the transaction times out. But because we don't have 
> transaction timeout and because of possible bugs it seems reasonable to have 
> an operation processing timeout "watchdog" with finite but relatively big 
> configurable value.
> h3. Definition of Done
>  * Add ability to configure ReplicaService#RPC_TIMEOUT and 
> TxMessageSender#RPC_TIMEOUT and increase the default values



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (IGNITE-21867) Add ability to configure ReplicaService#RPC_TIMEOUT and TxMessageSender#RPC_TIMEOUT and increase the default values

2024-03-28 Thread Alexander Lapin (Jira)


 [ 
https://issues.apache.org/jira/browse/IGNITE-21867?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alexander Lapin updated IGNITE-21867:
-
Description: 
h3. Motivation

RPC_TIMEOUT was mistakenly recognized as a network layer timeout, but actually 
it's a business operation one. Within the context of replicaService, there are 
legitimate reasons for operation to be processed longer than current timeout of 
3 seconds, basically it should be possible to wait indefinitely until the 
transaction times out. But because we don't have transaction timeout and 
because of possible bugs it seems reasonable to have an operation processing 
timeout "watchdog" with finite but relatively big configurable value.
h3. Definition of Done
 * Add ability to configure ReplicaService#RPC_TIMEOUT and 
TxMessageSender#RPC_TIMEOUT and increase the default values

  was:
h3. Motivation

RPC_TIMEOUT was mistakenly recognized as a network layer timeout, but actually 
it's a business operation one. Within the context of replicaService, there are 
legitimate reasons for operation to be processed longer than current timeout of 
3 seconds, basically it should be possible to wait indefinitely until the 
transaction times out. But because we don't have transaction timeout and 
because of possible bugs it seems reasonable to have an operation processing 
timeout "watchdog" with finite but relatively big configurable value.
h3. Definition of Done


> Add ability to configure ReplicaService#RPC_TIMEOUT and 
> TxMessageSender#RPC_TIMEOUT and increase the default values
> ---
>
> Key: IGNITE-21867
> URL: https://issues.apache.org/jira/browse/IGNITE-21867
> Project: Ignite
>  Issue Type: Improvement
>Reporter: Alexander Lapin
>Priority: Major
>
> h3. Motivation
> RPC_TIMEOUT was mistakenly recognized as a network layer timeout, but 
> actually it's a business operation one. Within the context of replicaService, 
> there are legitimate reasons for operation to be processed longer than 
> current timeout of 3 seconds, basically it should be possible to wait 
> indefinitely until the transaction times out. But because we don't have 
> transaction timeout and because of possible bugs it seems reasonable to have 
> an operation processing timeout "watchdog" with finite but relatively big 
> configurable value.
> h3. Definition of Done
>  * Add ability to configure ReplicaService#RPC_TIMEOUT and 
> TxMessageSender#RPC_TIMEOUT and increase the default values



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (IGNITE-21867) Add ability to configure ReplicaService#RPC_TIMEOUT and TxMessageSender#RPC_TIMEOUT and increase the default values

2024-03-28 Thread Alexander Lapin (Jira)


 [ 
https://issues.apache.org/jira/browse/IGNITE-21867?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alexander Lapin updated IGNITE-21867:
-
Epic Link: IGNITE-21389

> Add ability to configure ReplicaService#RPC_TIMEOUT and 
> TxMessageSender#RPC_TIMEOUT and increase the default values
> ---
>
> Key: IGNITE-21867
> URL: https://issues.apache.org/jira/browse/IGNITE-21867
> Project: Ignite
>  Issue Type: Improvement
>Reporter: Alexander Lapin
>Priority: Major
>  Labels: ignite-3
>
> h3. Motivation
> RPC_TIMEOUT was mistakenly recognized as a network layer timeout, but 
> actually it's a business operation one. Within the context of replicaService, 
> there are legitimate reasons for operation to be processed longer than 
> current timeout of 3 seconds, basically it should be possible to wait 
> indefinitely until the transaction times out. But because we don't have 
> transaction timeout and because of possible bugs it seems reasonable to have 
> an operation processing timeout "watchdog" with finite but relatively big 
> configurable value.
> h3. Definition of Done
>  * Add ability to configure ReplicaService#RPC_TIMEOUT and 
> TxMessageSender#RPC_TIMEOUT and increase the default values



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (IGNITE-21867) Add ability to configure ReplicaService#RPC_TIMEOUT and TxMessageSender#RPC_TIMEOUT and increase the default values

2024-03-28 Thread Alexander Lapin (Jira)


 [ 
https://issues.apache.org/jira/browse/IGNITE-21867?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alexander Lapin updated IGNITE-21867:
-
Description: 
h3. Motivation

RPC_TIMEOUT was mistakenly recognized as a network layer timeout, but actually 
it's a business operation one. Within the context of replicaService, there are 
legitimate reasons for operation to be processed longer than current timeout of 
3 seconds, basically it should be possible to wait indefinitely until the 
transaction times out. But because we don't have transaction timeout and 
because of possible bugs it seems reasonable to have an operation processing 
timeout "watchdog" with finite but relatively big configurable value.
h3. Definition of Done

  was:
h3. Motivation

RPC_TIMEOUT  is actually but a business operation timeout, thus it should be 
configured separately depending on the operation type.


> Add ability to configure ReplicaService#RPC_TIMEOUT and 
> TxMessageSender#RPC_TIMEOUT and increase the default values
> ---
>
> Key: IGNITE-21867
> URL: https://issues.apache.org/jira/browse/IGNITE-21867
> Project: Ignite
>  Issue Type: Improvement
>Reporter: Alexander Lapin
>Priority: Major
>
> h3. Motivation
> RPC_TIMEOUT was mistakenly recognized as a network layer timeout, but 
> actually it's a business operation one. Within the context of replicaService, 
> there are legitimate reasons for operation to be processed longer than 
> current timeout of 3 seconds, basically it should be possible to wait 
> indefinitely until the transaction times out. But because we don't have 
> transaction timeout and because of possible bugs it seems reasonable to have 
> an operation processing timeout "watchdog" with finite but relatively big 
> configurable value.
> h3. Definition of Done



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (IGNITE-21867) Add ability to configure ReplicaService#RPC_TIMEOUT and TxMessageSender#RPC_TIMEOUT and increase the default values

2024-03-28 Thread Alexander Lapin (Jira)


 [ 
https://issues.apache.org/jira/browse/IGNITE-21867?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alexander Lapin updated IGNITE-21867:
-
Labels: ignite-3  (was: )

> Add ability to configure ReplicaService#RPC_TIMEOUT and 
> TxMessageSender#RPC_TIMEOUT and increase the default values
> ---
>
> Key: IGNITE-21867
> URL: https://issues.apache.org/jira/browse/IGNITE-21867
> Project: Ignite
>  Issue Type: Improvement
>Reporter: Alexander Lapin
>Priority: Major
>  Labels: ignite-3
>
> h3. Motivation
> RPC_TIMEOUT was mistakenly recognized as a network layer timeout, but 
> actually it's a business operation one. Within the context of replicaService, 
> there are legitimate reasons for operation to be processed longer than 
> current timeout of 3 seconds, basically it should be possible to wait 
> indefinitely until the transaction times out. But because we don't have 
> transaction timeout and because of possible bugs it seems reasonable to have 
> an operation processing timeout "watchdog" with finite but relatively big 
> configurable value.
> h3. Definition of Done
>  * Add ability to configure ReplicaService#RPC_TIMEOUT and 
> TxMessageSender#RPC_TIMEOUT and increase the default values



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (IGNITE-21867) Add ability to configure ReplicaService#RPC_TIMEOUT and TxMessageSender#RPC_TIMEOUT and increase the default values

2024-03-28 Thread Alexander Lapin (Jira)


 [ 
https://issues.apache.org/jira/browse/IGNITE-21867?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alexander Lapin updated IGNITE-21867:
-
Description: 
h3. Motivation

RPC_TIMEOUT  is actually but a business operation timeout, thus it should be 
configured separately depending on the operation type.

> Add ability to configure ReplicaService#RPC_TIMEOUT and 
> TxMessageSender#RPC_TIMEOUT and increase the default values
> ---
>
> Key: IGNITE-21867
> URL: https://issues.apache.org/jira/browse/IGNITE-21867
> Project: Ignite
>  Issue Type: Improvement
>Reporter: Alexander Lapin
>Priority: Major
>
> h3. Motivation
> RPC_TIMEOUT  is actually but a business operation timeout, thus it should be 
> configured separately depending on the operation type.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (IGNITE-21867) Add ability to configure ReplicaService#RPC_TIMEOUT and TxMessageSender#RPC_TIMEOUT and increase the default values

2024-03-28 Thread Alexander Lapin (Jira)
Alexander Lapin created IGNITE-21867:


 Summary: Add ability to configure ReplicaService#RPC_TIMEOUT and 
TxMessageSender#RPC_TIMEOUT and increase the default values
 Key: IGNITE-21867
 URL: https://issues.apache.org/jira/browse/IGNITE-21867
 Project: Ignite
  Issue Type: Improvement
Reporter: Alexander Lapin






--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Resolved] (IGNITE-21855) Fix ItIndexAndIndexStorageDestructionTest

2024-03-27 Thread Alexander Lapin (Jira)


 [ 
https://issues.apache.org/jira/browse/IGNITE-21855?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alexander Lapin resolved IGNITE-21855.
--
Resolution: Fixed

> Fix ItIndexAndIndexStorageDestructionTest
> -
>
> Key: IGNITE-21855
> URL: https://issues.apache.org/jira/browse/IGNITE-21855
> Project: Ignite
>  Issue Type: Improvement
>Reporter: Roman Puchkovskiy
>Assignee: Roman Puchkovskiy
>Priority: Major
>  Labels: ignite-3
> Fix For: 3.0.0-beta2
>
>  Time Spent: 20m
>  Remaining Estimate: 0h
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (IGNITE-21566) Multiple tests fail with Failed to get the primary replica caused by PrimaryReplicaAwaitTimeoutException because of unaccepted lease

2024-03-27 Thread Alexander Lapin (Jira)


 [ 
https://issues.apache.org/jira/browse/IGNITE-21566?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alexander Lapin updated IGNITE-21566:
-
Reviewer: Vladislav Pyatkov

> Multiple tests fail with Failed to get the primary replica caused by 
> PrimaryReplicaAwaitTimeoutException because of unaccepted lease
> 
>
> Key: IGNITE-21566
> URL: https://issues.apache.org/jira/browse/IGNITE-21566
> Project: Ignite
>  Issue Type: Bug
>Reporter: Alexander Lapin
>Assignee: Alexander Lapin
>Priority: Blocker
>  Labels: ignite-3
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> Multiple tests fail with either `Failed to get the primary replica` or 
> becauase corresponding lease is not accepted.
> {code:java}
> Caused by: 
> org.apache.ignite.internal.placementdriver.PrimaryReplicaAwaitTimeoutException:
>  IGN-PLACEMENTDRIVER-1 TraceId:11dd6a63-2686-4dfe-b1d4-ea85f3859816 The 
> primary replica await timed out [replicationGroupId=7_part_0, 
> referenceTimestamp=HybridTimestamp [physical=2024-02-19 11:58:08:158 +, 
> logical=3, composite=111958025054322691], currentLease=Lease 
> [leaseholder=itrst_sirot_0, 
> leaseholderId=f43432e9-f374-41e1-b8b6-9813bb59b8a1, accepted=false, 
> startTime=HybridTimestamp [physical=2024-02-19 11:58:07:936 +, logical=2, 
> composite=111958025039773698], expirationTime=HybridTimestamp 
> [physical=2024-02-19 12:00:07:936 +, logical=0, 
> composite=111958032904093696], prolongable=false, 
> replicationGroupId=7_part_0]]{code}
> [ItTableRaftSnapshotsTest#snapshotInstallationRepeatsOnTimeout|https://ci.ignite.apache.org/buildConfiguration/ApacheIgnite3xGradle_Test_RunAllTests/7866010?expandBuildDeploymentsSection=false=false=false+Inspection=true=true=true=true]
> h3. Upd#1
> While starting replicas we have following code
> {code:java}
> if (localMemberAssignment == null || !startedRaftNode || 
> replicaMgr.isReplicaStarted(replicaGrpId)) {
> return;
> }
> try {
> startReplicaWithNewListener(
> ... {code}
> that actually won't start a replica if 
> {{replicaMgr.isReplicaStarted(replicaGrpId)}} that basically checks
> {code:java}
> public boolean isReplicaStarted(ReplicationGroupId replicaGrpId) {
> return replicas.containsKey(replicaGrpId);
> } {code}
> where \{{replicas }}is
> {code:java}
> private final ConcurrentHashMap CompletableFuture> replicas = new ConcurrentHashMap<>(); {code}
> On the other hand common replica message processing pattern assumes that 
> replica may not be ready and in that case it populates {{replicas}} with 
> future and awaits it. E.g. for leaseGrantMessage
>  
> {code:java}
> CompletableFuture replicaFut = 
> replicas.computeIfAbsent(msg.groupId(), k -> new CompletableFuture<>());
> replicaFut.thenCompose(replica -> replica.processPlacementDriverMessage(msg)) 
> {code}
> All in all that means, that if leaseGrantMessage is a bit faster than replica 
> start (which is fine), it will populate replicas with a future and replica 
> start flow won't actually start the one and complete given future but just 
> will incorrectly assume that replica is already started.
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (IGNITE-21566) Multiple tests fail with Failed to get the primary replica caused by PrimaryReplicaAwaitTimeoutException because of unaccepted lease

2024-03-26 Thread Alexander Lapin (Jira)


 [ 
https://issues.apache.org/jira/browse/IGNITE-21566?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alexander Lapin updated IGNITE-21566:
-
Description: 
Multiple tests fail with either `Failed to get the primary replica` or becauase 
corresponding lease is not accepted.
{code:java}
Caused by: 
org.apache.ignite.internal.placementdriver.PrimaryReplicaAwaitTimeoutException: 
IGN-PLACEMENTDRIVER-1 TraceId:11dd6a63-2686-4dfe-b1d4-ea85f3859816 The primary 
replica await timed out [replicationGroupId=7_part_0, 
referenceTimestamp=HybridTimestamp [physical=2024-02-19 11:58:08:158 +, 
logical=3, composite=111958025054322691], currentLease=Lease 
[leaseholder=itrst_sirot_0, leaseholderId=f43432e9-f374-41e1-b8b6-9813bb59b8a1, 
accepted=false, startTime=HybridTimestamp [physical=2024-02-19 11:58:07:936 
+, logical=2, composite=111958025039773698], expirationTime=HybridTimestamp 
[physical=2024-02-19 12:00:07:936 +, logical=0, 
composite=111958032904093696], prolongable=false, 
replicationGroupId=7_part_0]]{code}
[ItTableRaftSnapshotsTest#snapshotInstallationRepeatsOnTimeout|https://ci.ignite.apache.org/buildConfiguration/ApacheIgnite3xGradle_Test_RunAllTests/7866010?expandBuildDeploymentsSection=false=false=false+Inspection=true=true=true=true]

 
h3. Upd#1
While starting replicas we have following code
{code:java}
if (localMemberAssignment == null || !startedRaftNode || 
replicaMgr.isReplicaStarted(replicaGrpId)) {
return;
}

try {
startReplicaWithNewListener(
... {code}
that actually won't start a replica if 
{{replicaMgr.isReplicaStarted(replicaGrpId)}} that basically checks
{code:java}
public boolean isReplicaStarted(ReplicationGroupId replicaGrpId) {
return replicas.containsKey(replicaGrpId);
} {code}
where {{replicas }}is
{code:java}
private final ConcurrentHashMap> 
replicas = new ConcurrentHashMap<>(); {code}
On the other hand common replica message processing pattern assumes that 
replica may not be ready and in that case it populates {{replicas}} with future 
and awaits it. E.g. for leaseGrantMessage
 
{code:java}
CompletableFuture replicaFut = replicas.computeIfAbsent(msg.groupId(), 
k -> new CompletableFuture<>());

replicaFut.thenCompose(replica -> replica.processPlacementDriverMessage(msg)) 
{code}
All in all that means, that if leaseGrantMessage is a bit faster than replica 
start (which is fine), it will populate replicas with a future and replica 
start flow won't actually start the one and complete given future but just will 
incorrectly assume that replica is already started.
 

  was:
Multiple tests fail with either `Failed to get the primary replica` or becauase 
corresponding lease is not accepted.
{code:java}
Caused by: 
org.apache.ignite.internal.placementdriver.PrimaryReplicaAwaitTimeoutException: 
IGN-PLACEMENTDRIVER-1 TraceId:11dd6a63-2686-4dfe-b1d4-ea85f3859816 The primary 
replica await timed out [replicationGroupId=7_part_0, 
referenceTimestamp=HybridTimestamp [physical=2024-02-19 11:58:08:158 +, 
logical=3, composite=111958025054322691], currentLease=Lease 
[leaseholder=itrst_sirot_0, leaseholderId=f43432e9-f374-41e1-b8b6-9813bb59b8a1, 
accepted=false, startTime=HybridTimestamp [physical=2024-02-19 11:58:07:936 
+, logical=2, composite=111958025039773698], expirationTime=HybridTimestamp 
[physical=2024-02-19 12:00:07:936 +, logical=0, 
composite=111958032904093696], prolongable=false, 
replicationGroupId=7_part_0]]{code}
[ItTableRaftSnapshotsTest#snapshotInstallationRepeatsOnTimeout|https://ci.ignite.apache.org/buildConfiguration/ApacheIgnite3xGradle_Test_RunAllTests/7866010?expandBuildDeploymentsSection=false=false=false+Inspection=true=true=true=true]


> Multiple tests fail with Failed to get the primary replica caused by 
> PrimaryReplicaAwaitTimeoutException because of unaccepted lease
> 
>
> Key: IGNITE-21566
> URL: https://issues.apache.org/jira/browse/IGNITE-21566
> Project: Ignite
>  Issue Type: Bug
>Reporter: Alexander Lapin
>Assignee: Alexander Lapin
>Priority: Blocker
>  Labels: ignite-3
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> Multiple tests fail with either `Failed to get the primary replica` or 
> becauase corresponding lease is not accepted.
> {code:java}
> Caused by: 
> org.apache.ignite.internal.placementdriver.PrimaryReplicaAwaitTimeoutException:
>  IGN-PLACEMENTDRIVER-1 TraceId:11dd6a63-2686-4dfe-b1d4-ea85f3859816 The 
> primary replica await timed out [replicationGroupId=7_part_0, 
> referenceTimestamp=HybridTimestamp [physical=2024-02-19 11:58:08:158 +, 
> logical=3, composite=111958025054322691], currentLease=Lease 
> [leaseholder=itrst_sirot_0, 
> leaseholderId=f43432e9-f374-41e1-b8b6-9813bb59b8a1, 

[jira] [Updated] (IGNITE-21566) Multiple tests fail with Failed to get the primary replica caused by PrimaryReplicaAwaitTimeoutException because of unaccepted lease

2024-03-26 Thread Alexander Lapin (Jira)


 [ 
https://issues.apache.org/jira/browse/IGNITE-21566?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alexander Lapin updated IGNITE-21566:
-
Description: 
Multiple tests fail with either `Failed to get the primary replica` or becauase 
corresponding lease is not accepted.
{code:java}
Caused by: 
org.apache.ignite.internal.placementdriver.PrimaryReplicaAwaitTimeoutException: 
IGN-PLACEMENTDRIVER-1 TraceId:11dd6a63-2686-4dfe-b1d4-ea85f3859816 The primary 
replica await timed out [replicationGroupId=7_part_0, 
referenceTimestamp=HybridTimestamp [physical=2024-02-19 11:58:08:158 +, 
logical=3, composite=111958025054322691], currentLease=Lease 
[leaseholder=itrst_sirot_0, leaseholderId=f43432e9-f374-41e1-b8b6-9813bb59b8a1, 
accepted=false, startTime=HybridTimestamp [physical=2024-02-19 11:58:07:936 
+, logical=2, composite=111958025039773698], expirationTime=HybridTimestamp 
[physical=2024-02-19 12:00:07:936 +, logical=0, 
composite=111958032904093696], prolongable=false, 
replicationGroupId=7_part_0]]{code}
[ItTableRaftSnapshotsTest#snapshotInstallationRepeatsOnTimeout|https://ci.ignite.apache.org/buildConfiguration/ApacheIgnite3xGradle_Test_RunAllTests/7866010?expandBuildDeploymentsSection=false=false=false+Inspection=true=true=true=true]
h3. Upd#1

While starting replicas we have following code
{code:java}
if (localMemberAssignment == null || !startedRaftNode || 
replicaMgr.isReplicaStarted(replicaGrpId)) {
return;
}

try {
startReplicaWithNewListener(
... {code}
that actually won't start a replica if 
{{replicaMgr.isReplicaStarted(replicaGrpId)}} that basically checks
{code:java}
public boolean isReplicaStarted(ReplicationGroupId replicaGrpId) {
return replicas.containsKey(replicaGrpId);
} {code}
where \{{replicas }}is
{code:java}
private final ConcurrentHashMap> 
replicas = new ConcurrentHashMap<>(); {code}
On the other hand common replica message processing pattern assumes that 
replica may not be ready and in that case it populates {{replicas}} with future 
and awaits it. E.g. for leaseGrantMessage
 
{code:java}
CompletableFuture replicaFut = replicas.computeIfAbsent(msg.groupId(), 
k -> new CompletableFuture<>());

replicaFut.thenCompose(replica -> replica.processPlacementDriverMessage(msg)) 
{code}
All in all that means, that if leaseGrantMessage is a bit faster than replica 
start (which is fine), it will populate replicas with a future and replica 
start flow won't actually start the one and complete given future but just will 
incorrectly assume that replica is already started.
 

  was:
Multiple tests fail with either `Failed to get the primary replica` or becauase 
corresponding lease is not accepted.
{code:java}
Caused by: 
org.apache.ignite.internal.placementdriver.PrimaryReplicaAwaitTimeoutException: 
IGN-PLACEMENTDRIVER-1 TraceId:11dd6a63-2686-4dfe-b1d4-ea85f3859816 The primary 
replica await timed out [replicationGroupId=7_part_0, 
referenceTimestamp=HybridTimestamp [physical=2024-02-19 11:58:08:158 +, 
logical=3, composite=111958025054322691], currentLease=Lease 
[leaseholder=itrst_sirot_0, leaseholderId=f43432e9-f374-41e1-b8b6-9813bb59b8a1, 
accepted=false, startTime=HybridTimestamp [physical=2024-02-19 11:58:07:936 
+, logical=2, composite=111958025039773698], expirationTime=HybridTimestamp 
[physical=2024-02-19 12:00:07:936 +, logical=0, 
composite=111958032904093696], prolongable=false, 
replicationGroupId=7_part_0]]{code}
[ItTableRaftSnapshotsTest#snapshotInstallationRepeatsOnTimeout|https://ci.ignite.apache.org/buildConfiguration/ApacheIgnite3xGradle_Test_RunAllTests/7866010?expandBuildDeploymentsSection=false=false=false+Inspection=true=true=true=true]

 
h3. Upd#1
While starting replicas we have following code
{code:java}
if (localMemberAssignment == null || !startedRaftNode || 
replicaMgr.isReplicaStarted(replicaGrpId)) {
return;
}

try {
startReplicaWithNewListener(
... {code}
that actually won't start a replica if 
{{replicaMgr.isReplicaStarted(replicaGrpId)}} that basically checks
{code:java}
public boolean isReplicaStarted(ReplicationGroupId replicaGrpId) {
return replicas.containsKey(replicaGrpId);
} {code}
where {{replicas }}is
{code:java}
private final ConcurrentHashMap> 
replicas = new ConcurrentHashMap<>(); {code}
On the other hand common replica message processing pattern assumes that 
replica may not be ready and in that case it populates {{replicas}} with future 
and awaits it. E.g. for leaseGrantMessage
 
{code:java}
CompletableFuture replicaFut = replicas.computeIfAbsent(msg.groupId(), 
k -> new CompletableFuture<>());

replicaFut.thenCompose(replica -> replica.processPlacementDriverMessage(msg)) 
{code}
All in all that means, that if leaseGrantMessage is a bit faster than replica 
start (which is fine), it will populate replicas with a future and replica 
start flow won't actually start the one and complete given future but just will 

[jira] [Updated] (IGNITE-21566) Multiple tests fail with Failed to get the primary replica caused by PrimaryReplicaAwaitTimeoutException because of unaccepted lease

2024-03-26 Thread Alexander Lapin (Jira)


 [ 
https://issues.apache.org/jira/browse/IGNITE-21566?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alexander Lapin updated IGNITE-21566:
-
Summary: Multiple tests fail with Failed to get the primary replica caused 
by PrimaryReplicaAwaitTimeoutException because of unaccepted lease  (was: 
Multiple tests fail with PrimaryReplicaAwaitTimeoutException because of 
unaccepted lease)

> Multiple tests fail with Failed to get the primary replica caused by 
> PrimaryReplicaAwaitTimeoutException because of unaccepted lease
> 
>
> Key: IGNITE-21566
> URL: https://issues.apache.org/jira/browse/IGNITE-21566
> Project: Ignite
>  Issue Type: Bug
>Reporter: Alexander Lapin
>Assignee: Alexander Lapin
>Priority: Blocker
>  Labels: ignite-3
>
> Multiple tests fail with either `Failed to get the primary replica` or 
> becauase corresponding lease is not accepted.
> {code:java}
> Caused by: 
> org.apache.ignite.internal.placementdriver.PrimaryReplicaAwaitTimeoutException:
>  IGN-PLACEMENTDRIVER-1 TraceId:11dd6a63-2686-4dfe-b1d4-ea85f3859816 The 
> primary replica await timed out [replicationGroupId=7_part_0, 
> referenceTimestamp=HybridTimestamp [physical=2024-02-19 11:58:08:158 +, 
> logical=3, composite=111958025054322691], currentLease=Lease 
> [leaseholder=itrst_sirot_0, 
> leaseholderId=f43432e9-f374-41e1-b8b6-9813bb59b8a1, accepted=false, 
> startTime=HybridTimestamp [physical=2024-02-19 11:58:07:936 +, logical=2, 
> composite=111958025039773698], expirationTime=HybridTimestamp 
> [physical=2024-02-19 12:00:07:936 +, logical=0, 
> composite=111958032904093696], prolongable=false, 
> replicationGroupId=7_part_0]]{code}
> [ItTableRaftSnapshotsTest#snapshotInstallationRepeatsOnTimeout|https://ci.ignite.apache.org/buildConfiguration/ApacheIgnite3xGradle_Test_RunAllTests/7866010?expandBuildDeploymentsSection=false=false=false+Inspection=true=true=true=true]



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (IGNITE-21759) Prepare general txn vacuum logic by introducing TxnResourceVacuumTask

2024-03-22 Thread Alexander Lapin (Jira)


 [ 
https://issues.apache.org/jira/browse/IGNITE-21759?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alexander Lapin updated IGNITE-21759:
-
Reviewer: Denis Chudov

> Prepare general txn vacuum logic by introducing TxnResourceVacuumTask
> -
>
> Key: IGNITE-21759
> URL: https://issues.apache.org/jira/browse/IGNITE-21759
> Project: Ignite
>  Issue Type: Improvement
>Reporter: Alexander Lapin
>Assignee: Alexander Lapin
>Priority: Major
>  Labels: ignite-3
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> h3. Motivation
> Within one of Txn cursros cleanup sub tickets cursor cleanup scheduler was 
> introduced. Now it's turn to generalize it to handle txn resource vacuum 
> task. Thus it's expected that TxnResourceVacuumTask will be introduced with 
> following logic inside (this logic will be adjusted later on in other jiras):
>  # TxnResourceVacuumTask evaluates vacuumObservationTimestamp as 
> System.currentTimeMillis(). There's no sense in using hybrid clock in order 
> not to increase the contention level on the clock.
>  # TxnResourceVacuumTask scans txnStateVolatileMap and for each finished 
> (COMMITED or ABORTED) transactions it
>  ## Removes it from txnStateVolatileMap if txnResourcesTTL == 0 (more details 
> a bit later) or if txnState.initialVacuumObservationTimestamp + 
> txnResourcesTTL < vacuumObservationTimestamp.
>  ## Updates txnState.initialVacuumObservationTimestamp by setting it to 
> vacuumObservationTimestamp if it's not already initialized.
> txnResourcesTTL is a new cluster configuration property that defines the 
> *minimum* lifetime of a transaction state in milliseconds. Real 
> txnResourcesTTL lifetime will also depend on cleanupTimer intervals which of 
> course not suitable for test purposes thus we should introduce manual ability 
> to run TxnResourceVacuumTask immediately. At least we will use it within 
> tests.
> h3. Definition of Done
>  * New configuration property txnResourceTTL is introduced with default of 
> 30_000 milliseconds.
>  * TxnResourceVacuumTask is introduced with aforementioned logic. Given task 
> should be tread safe.
>  * Cleanup scheduler is generalized and adjusted to run TxnResourceVacuumTask 
> on every iteration. 
>  * Cleanup scheduler along with subclasses and helpers is renamed to 
> <>Vacuum<>
>  ** Cleanup is part of tx finish flow.
>  ** Vacuum is for removing obsolete resources. Let's be consistent with 
> naming.
>  * Special trigger for TxnResourceVacuumTask introduced within TxManager, at 
> least for testing purposes.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (IGNITE-21829) Tigger failureProcessor in case of exceptions in ResourceVacuumManager

2024-03-22 Thread Alexander Lapin (Jira)
Alexander Lapin created IGNITE-21829:


 Summary: Tigger failureProcessor in case of exceptions in 
ResourceVacuumManager
 Key: IGNITE-21829
 URL: https://issues.apache.org/jira/browse/IGNITE-21829
 Project: Ignite
  Issue Type: Improvement
Reporter: Alexander Lapin






--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (IGNITE-21540) Handle lock exception for transaction operations

2024-03-22 Thread Alexander Lapin (Jira)


 [ 
https://issues.apache.org/jira/browse/IGNITE-21540?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alexander Lapin updated IGNITE-21540:
-
Reviewer: Denis Chudov

> Handle lock exception for transaction operations
> 
>
> Key: IGNITE-21540
> URL: https://issues.apache.org/jira/browse/IGNITE-21540
> Project: Ignite
>  Issue Type: Improvement
>Reporter: Vladislav Pyatkov
>Assignee: Vladislav Pyatkov
>Priority: Major
>  Labels: ignite-3
> Fix For: 3.0.0-beta2
>
>  Time Spent: 1h 10m
>  Remaining Estimate: 0h
>
> h3. Motivation
> Deadlock prevention can throw a lock exception, but it depends on the 
> situation. After several retries, the exception might pass because another 
> transaction has already released its locks.
> h3. Implementation notes
> We need to consider all kinds of implicit operations that lead to the 
> creation of RW transactions.
> h3. Definition of done
> Implicit operations never throw the lock exception.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (IGNITE-18617) Clear rw tx context and cleanup ready futures on tx finish

2024-03-21 Thread Alexander Lapin (Jira)


 [ 
https://issues.apache.org/jira/browse/IGNITE-18617?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alexander Lapin updated IGNITE-18617:
-
Reviewer: Alexander Lapin

> Clear rw tx context and cleanup ready futures on tx finish
> --
>
> Key: IGNITE-18617
> URL: https://issues.apache.org/jira/browse/IGNITE-18617
> Project: Ignite
>  Issue Type: Improvement
>Reporter: Denis Chudov
>Assignee:  Kirill Sizov
>Priority: Major
>  Labels: ignite-3
> Fix For: 3.0.0-beta2
>
>  Time Spent: 1h 40m
>  Remaining Estimate: 0h
>
> The map PartitionReplicaListener#txCleanupReadyFutures is not completely 
> cleared on tx finish. There should be no mapping once TX is closed.
> Also the coordinator should clear TX Context (inflights, enlistments) when 
> the corresponding transaction is finished.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Assigned] (IGNITE-21759) Prepare general txn vacuum logic by introducing TxnResourceVacuumTask

2024-03-20 Thread Alexander Lapin (Jira)


 [ 
https://issues.apache.org/jira/browse/IGNITE-21759?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alexander Lapin reassigned IGNITE-21759:


Assignee: Alexander Lapin  (was: Denis Chudov)

> Prepare general txn vacuum logic by introducing TxnResourceVacuumTask
> -
>
> Key: IGNITE-21759
> URL: https://issues.apache.org/jira/browse/IGNITE-21759
> Project: Ignite
>  Issue Type: Improvement
>Reporter: Alexander Lapin
>Assignee: Alexander Lapin
>Priority: Major
>  Labels: ignite-3
>
> h3. Motivation
> Within one of Txn cursros cleanup sub tickets cursor cleanup scheduler was 
> introduced. Now it's turn to generalize it to handle txn resource vacuum 
> task. Thus it's expected that TxnResourceVacuumTask will be introduced with 
> following logic inside (this logic will be adjusted later on in other jiras):
>  # TxnResourceVacuumTask evaluates vacuumObservationTimestamp as 
> System.currentTimeMillis(). There's no sense in using hybrid clock in order 
> not to increase the contention level on the clock.
>  # TxnResourceVacuumTask scans txnStateVolatileMap and for each finished 
> (COMMITED or ABORTED) transactions it
>  ## Removes it from txnStateVolatileMap if txnResourcesTTL == 0 (more details 
> a bit later) or if txnState.initialVacuumObservationTimestamp + 
> txnResourcesTTL < vacuumObservationTimestamp.
>  ## Updates txnState.initialVacuumObservationTimestamp by setting it to 
> vacuumObservationTimestamp if it's not already initialized.
> txnResourcesTTL is a new cluster configuration property that defines the 
> *minimum* lifetime of a transaction state in milliseconds. Real 
> txnResourcesTTL lifetime will also depend on cleanupTimer intervals which of 
> course not suitable for test purposes thus we should introduce manual ability 
> to run TxnResourceVacuumTask immediately. At least we will use it within 
> tests.
> h3. Definition of Done
>  * New configuration property txnResourceTTL is introduced with default of 
> 30_000 milliseconds.
>  * TxnResourceVacuumTask is introduced with aforementioned logic. Given task 
> should be tread safe.
>  * Cleanup scheduler is generalized and adjusted to run TxnResourceVacuumTask 
> on every iteration. 
>  * Cleanup scheduler along with subclasses and helpers is renamed to 
> <>Vacuum<>
>  ** Cleanup is part of tx finish flow.
>  ** Vacuum is for removing obsolete resources. Let's be consistent with 
> naming.
>  * Special trigger for TxnResourceVacuumTask introduced within TxManager, at 
> least for testing purposes.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (IGNITE-21618) In-flights for read-only transactions

2024-03-19 Thread Alexander Lapin (Jira)


 [ 
https://issues.apache.org/jira/browse/IGNITE-21618?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alexander Lapin updated IGNITE-21618:
-
Reviewer:  Kirill Sizov

> In-flights for read-only transactions
> -
>
> Key: IGNITE-21618
> URL: https://issues.apache.org/jira/browse/IGNITE-21618
> Project: Ignite
>  Issue Type: Improvement
>Reporter: Denis Chudov
>Assignee: Denis Chudov
>Priority: Major
>  Labels: ignite-3
> Fix For: 3.0.0-beta2
>
>  Time Spent: 3h 20m
>  Remaining Estimate: 0h
>
> *Motivation*
> We need to make solid mechanism of closing read-only transactions' resources 
> (scan cursors, etc.) on remote servers after tx finish. Resources are 
> supposed to be closed by the requests from coordinator sent from a separate 
> cleanup thread after the tx is finished, to maximise the performance of the 
> tx finish itself and because these requests are needed only for resource 
> cleanup. But we need to prevent a race, such as:
>  * tx request supposing to create a scan cursor on remote server is sent
>  * tx is finished
>  * cleanup thread sends cleanup request
>  * cleanup request reaches remote server
>  * tx request reaches the remote server and opens a cursor that will never be 
> closed.
> We need to ensure that cleanup request will be not sent until the coordinator 
> receives responses for all requests that sent before tx finish, and no 
> requests are allowed after tx finish. Something similar to RW inflight 
> requests counter for RO is to be done.
> *Definition of done*
> Cleanup request from cleanup thread will be not sent until the coordinator 
> receives responses for all requests that sent before tx finish, and no 
> requests are allowed after tx finish.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Assigned] (IGNITE-21795) Unconditionally update storage with proper raft index within PartitionListener

2024-03-19 Thread Alexander Lapin (Jira)


 [ 
https://issues.apache.org/jira/browse/IGNITE-21795?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alexander Lapin reassigned IGNITE-21795:


Assignee: Alexander Lapin

> Unconditionally update storage with proper raft index within PartitionListener
> --
>
> Key: IGNITE-21795
> URL: https://issues.apache.org/jira/browse/IGNITE-21795
> Project: Ignite
>  Issue Type: Bug
>Reporter: Alexander Lapin
>Assignee: Alexander Lapin
>Priority: Major
> Fix For: 3.0.0-beta2
>
>
> Despite the fact that it's reasonable to skip data update if it was already 
> executed on collocated primary replica storage, it's required to 
> unconditionally update storage with proper raft index within 
> PartitionListener. For more details please see:
> {code:java}
> synchronized (safeTime) {
> if (cmd.safeTime().compareTo(safeTime.current()) > 0) {
> storageUpdateHandler.handleUpdate( {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (IGNITE-21795) Unconditionally update storage with proper raft index within PartitionListener

2024-03-19 Thread Alexander Lapin (Jira)


 [ 
https://issues.apache.org/jira/browse/IGNITE-21795?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alexander Lapin updated IGNITE-21795:
-
Fix Version/s: 3.0.0-beta2

> Unconditionally update storage with proper raft index within PartitionListener
> --
>
> Key: IGNITE-21795
> URL: https://issues.apache.org/jira/browse/IGNITE-21795
> Project: Ignite
>  Issue Type: Bug
>Reporter: Alexander Lapin
>Priority: Major
> Fix For: 3.0.0-beta2
>
>
> Despite the fact that it's reasonable to skip data update if it was already 
> executed on collocated primary replica storage, it's required to 
> unconditionally update storage with proper raft index within 
> PartitionListener. For more details please see:
> {code:java}
> synchronized (safeTime) {
> if (cmd.safeTime().compareTo(safeTime.current()) > 0) {
> storageUpdateHandler.handleUpdate( {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (IGNITE-21795) Unconditionally update storage with proper raft index within PartitionListener

2024-03-19 Thread Alexander Lapin (Jira)


 [ 
https://issues.apache.org/jira/browse/IGNITE-21795?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alexander Lapin updated IGNITE-21795:
-
Description: 
Despite the fact that it's reasonable to skip data update if it was already 
executed on collocated primary replica storage, it's required to 
unconditionally update storage with proper raft index within PartitionListener. 
For more details please see:
{code:java}
synchronized (safeTime) {
if (cmd.safeTime().compareTo(safeTime.current()) > 0) {
storageUpdateHandler.handleUpdate( {code}

> Unconditionally update storage with proper raft index within PartitionListener
> --
>
> Key: IGNITE-21795
> URL: https://issues.apache.org/jira/browse/IGNITE-21795
> Project: Ignite
>  Issue Type: Bug
>Reporter: Alexander Lapin
>Priority: Major
>
> Despite the fact that it's reasonable to skip data update if it was already 
> executed on collocated primary replica storage, it's required to 
> unconditionally update storage with proper raft index within 
> PartitionListener. For more details please see:
> {code:java}
> synchronized (safeTime) {
> if (cmd.safeTime().compareTo(safeTime.current()) > 0) {
> storageUpdateHandler.handleUpdate( {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (IGNITE-21795) Unconditionally update storage with proper raft index within PartitionListener

2024-03-19 Thread Alexander Lapin (Jira)
Alexander Lapin created IGNITE-21795:


 Summary: Unconditionally update storage with proper raft index 
within PartitionListener
 Key: IGNITE-21795
 URL: https://issues.apache.org/jira/browse/IGNITE-21795
 Project: Ignite
  Issue Type: Bug
Reporter: Alexander Lapin






--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (IGNITE-21795) Unconditionally update storage with proper raft index within PartitionListener

2024-03-19 Thread Alexander Lapin (Jira)


 [ 
https://issues.apache.org/jira/browse/IGNITE-21795?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alexander Lapin updated IGNITE-21795:
-
Ignite Flags:   (was: Docs Required,Release Notes Required)

> Unconditionally update storage with proper raft index within PartitionListener
> --
>
> Key: IGNITE-21795
> URL: https://issues.apache.org/jira/browse/IGNITE-21795
> Project: Ignite
>  Issue Type: Bug
>Reporter: Alexander Lapin
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Resolved] (IGNITE-21354) java.lang.AssertionError: TABLE2 part 0 in ItIgniteNodeRestartTest#testCfgGapWithoutData

2024-03-19 Thread Alexander Lapin (Jira)


 [ 
https://issues.apache.org/jira/browse/IGNITE-21354?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alexander Lapin resolved IGNITE-21354.
--
Fix Version/s: 3.0.0-beta2
 Assignee: Alexander Lapin
   Resolution: Cannot Reproduce

> java.lang.AssertionError: TABLE2 part 0 in 
> ItIgniteNodeRestartTest#testCfgGapWithoutData
> 
>
> Key: IGNITE-21354
> URL: https://issues.apache.org/jira/browse/IGNITE-21354
> Project: Ignite
>  Issue Type: Bug
>Reporter: Alexander Lapin
>Assignee: Alexander Lapin
>Priority: Major
>  Labels: ignite-3
> Fix For: 3.0.0-beta2
>
>
> {code:java}
> [2024-01-26T07:13:05,030][ERROR][%iinrt_tcgwd_2%JRaft-FSMCaller-Disruptor-_stripe_0-0][PartitionListener]
>  Unknown error while processing command [commandIndex=3, commandTerm=3, 
> command=UpdateAllCommandImpl [full=false, messageRowsToUpdate=HashMap 
> {83afc1ea-6b45-4d16-891f-9b89811bbd23=TimedBinaryRowMessageImpl 
> [binaryRowMessage=BinaryRowMessageImpl 
> [binaryTuple=java.nio.HeapByteBuffer[pos=0 lim=9 cap=9], schemaVersion=1], 
> timestamp=0]}, requiredCatalogVersion=7, safeTimeLong=111821008665772036, 
> tablePartitionId=TablePartitionIdMessageImpl [partitionId=1, tableId=13], 
> txCoordinatorId=ae0ba53c-1e6f-439a-b1b0-7d3f09835631, 
> txId=018d449d-66e8--214b-4fe70001]]
> java.lang.AssertionError: TABLE20 part 0
>   at 
> org.apache.ignite.internal.table.IndexWrapper$HashIndexWrapper.getStorage(IndexWrapper.java:106)
>  ~[ignite-table-3.0.0-SNAPSHOT.jar:?]
>   at org.apache.ignite.internal.table.TableImpl$1.get(TableImpl.java:223) 
> ~[ignite-table-3.0.0-SNAPSHOT.jar:?]
>   at 
> org.apache.ignite.internal.table.distributed.index.IndexUpdateHandler.waitIndexes(IndexUpdateHandler.java:151)
>  ~[ignite-table-3.0.0-SNAPSHOT.jar:?]
>   at 
> org.apache.ignite.internal.table.distributed.StorageUpdateHandler.handleUpdateAll(StorageUpdateHandler.java:158)
>  ~[ignite-table-3.0.0-SNAPSHOT.jar:?]
>   at 
> org.apache.ignite.internal.table.distributed.raft.PartitionListener.handleUpdateAllCommand(PartitionListener.java:296)
>  ~[ignite-table-3.0.0-SNAPSHOT.jar:?]
>   at 
> org.apache.ignite.internal.table.distributed.raft.PartitionListener.lambda$onWrite$1(PartitionListener.java:202)
>  ~[ignite-table-3.0.0-SNAPSHOT.jar:?]
>   at java.util.Iterator.forEachRemaining(Iterator.java:133) [?:?] {code}
> [https://ci.ignite.apache.org/buildConfiguration/ApacheIgnite3xGradle_Test_IntegrationTests_ModuleRunner/7801857?expandBuildProblemsSection=true=false=false=true]
> h3. Upd#1
> Aforementioned assertion is fired within after test cleanup actions on stop. 
> Seems that there's a race between storage stopping and in-raft commands 
> processing.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Resolved] (IGNITE-21391) ItNodeTest#testAppendEntriesWhenFollowerIsInErrorState is flaky

2024-03-19 Thread Alexander Lapin (Jira)


 [ 
https://issues.apache.org/jira/browse/IGNITE-21391?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alexander Lapin resolved IGNITE-21391.
--
Fix Version/s: 3.0.0-beta2
   Resolution: Cannot Reproduce

> ItNodeTest#testAppendEntriesWhenFollowerIsInErrorState is flaky
> ---
>
> Key: IGNITE-21391
> URL: https://issues.apache.org/jira/browse/IGNITE-21391
> Project: Ignite
>  Issue Type: Bug
>Reporter: Alexander Lapin
>Assignee: Alexander Lapin
>Priority: Major
>  Labels: ignite-3
> Fix For: 3.0.0-beta2
>
> Attachments: test.log
>
>  Time Spent: 50m
>  Remaining Estimate: 0h
>
>  
> {code:java}
> org.opentest4j.AssertionFailedError: expected:  but was: 
>   at 
> app//org.junit.jupiter.api.AssertionFailureBuilder.build(AssertionFailureBuilder.java:151)
>   at 
> app//org.junit.jupiter.api.AssertionFailureBuilder.buildAndThrow(AssertionFailureBuilder.java:132)
>   at app//org.junit.jupiter.api.AssertTrue.failNotTrue(AssertTrue.java:63)
>   at app//org.junit.jupiter.api.AssertTrue.assertTrue(AssertTrue.java:36)
>   at app//org.junit.jupiter.api.AssertTrue.assertTrue(AssertTrue.java:31)
>   at app//org.junit.jupiter.api.Assertions.assertTrue(Assertions.java:180)
>   at 
> app//org.apache.ignite.raft.jraft.core.TestCluster.ensureSame(TestCluster.java:558)
>   at 
> app//org.apache.ignite.raft.jraft.core.TestCluster.ensureSame(TestCluster.java:530)
>   at 
> app//org.apache.ignite.raft.jraft.core.ItNodeTest.testAppendEntriesWhenFollowerIsInErrorState(ItNodeTest.java:2568){code}
> the appropariate line of code with fail 
> [https://github.com/apache/ignite-3/blob/f3e9af88192f5aea41a3f6a8d4cac2c891141205/modules/raft/src/integrationTest/java/org/apache/ignite/raft/jraft/core/ItNodeTest.java#L2570]
> {code:java}
> cluster.ensureSame();{code}
> Didn't manage to reproduce the issue locally, however, there are several 
> failures in TC with same reason.
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (IGNITE-21792) ItNodeTest#testFollowerStartStopFollowing is flaky

2024-03-19 Thread Alexander Lapin (Jira)


 [ 
https://issues.apache.org/jira/browse/IGNITE-21792?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alexander Lapin updated IGNITE-21792:
-
Epic Link: IGNITE-21389

> ItNodeTest#testFollowerStartStopFollowing is flaky
> --
>
> Key: IGNITE-21792
> URL: https://issues.apache.org/jira/browse/IGNITE-21792
> Project: Ignite
>  Issue Type: Bug
>Reporter: Alexander Lapin
>Priority: Major
>  Labels: ignite-3
>
> ItNodeTest#testFollowerStartStopFollowing is flaky with
> {code:java}
> org.opentest4j.AssertionFailedError: expected:  but was:   at 
> app//org.junit.jupiter.api.AssertionFailureBuilder.build(AssertionFailureBuilder.java:151)
>   at 
> app//org.junit.jupiter.api.AssertionFailureBuilder.buildAndThrow(AssertionFailureBuilder.java:132)
>   at app//org.junit.jupiter.api.AssertTrue.failNotTrue(AssertTrue.java:63)  
> at app//org.junit.jupiter.api.AssertTrue.assertTrue(AssertTrue.java:36)  at 
> app//org.junit.jupiter.api.AssertTrue.assertTrue(AssertTrue.java:31)  at 
> app//org.junit.jupiter.api.Assertions.assertTrue(Assertions.java:183)  at 
> app//org.apache.ignite.raft.jraft.core.ItNodeTest.testFollowerStartStopFollowing(ItNodeTest.java:2636)
>   at java.base@11.0.17/java.lang.reflect.Method.invoke(Method.java:566)  at 
> java.base@11.0.17/java.util.ArrayList.forEach(ArrayList.java:1541)  at 
> java.base@11.0.17/java.util.ArrayList.forEach(ArrayList.java:1541) {code}
> [TC failure 
> link|https://ci.ignite.apache.org/buildConfiguration/ApacheIgnite3xGradle_Test_RunAllTests/7944770?expandBuildDeploymentsSection=false=false=true=false+Inspection=true=true=true]
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (IGNITE-21792) ItNodeTest#testFollowerStartStopFollowing is flaky

2024-03-19 Thread Alexander Lapin (Jira)


 [ 
https://issues.apache.org/jira/browse/IGNITE-21792?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alexander Lapin updated IGNITE-21792:
-
Labels: ignite-3  (was: )

> ItNodeTest#testFollowerStartStopFollowing is flaky
> --
>
> Key: IGNITE-21792
> URL: https://issues.apache.org/jira/browse/IGNITE-21792
> Project: Ignite
>  Issue Type: Bug
>Reporter: Alexander Lapin
>Priority: Major
>  Labels: ignite-3
>
> ItNodeTest#testFollowerStartStopFollowing is flaky with
> {code:java}
> org.opentest4j.AssertionFailedError: expected:  but was:   at 
> app//org.junit.jupiter.api.AssertionFailureBuilder.build(AssertionFailureBuilder.java:151)
>   at 
> app//org.junit.jupiter.api.AssertionFailureBuilder.buildAndThrow(AssertionFailureBuilder.java:132)
>   at app//org.junit.jupiter.api.AssertTrue.failNotTrue(AssertTrue.java:63)  
> at app//org.junit.jupiter.api.AssertTrue.assertTrue(AssertTrue.java:36)  at 
> app//org.junit.jupiter.api.AssertTrue.assertTrue(AssertTrue.java:31)  at 
> app//org.junit.jupiter.api.Assertions.assertTrue(Assertions.java:183)  at 
> app//org.apache.ignite.raft.jraft.core.ItNodeTest.testFollowerStartStopFollowing(ItNodeTest.java:2636)
>   at java.base@11.0.17/java.lang.reflect.Method.invoke(Method.java:566)  at 
> java.base@11.0.17/java.util.ArrayList.forEach(ArrayList.java:1541)  at 
> java.base@11.0.17/java.util.ArrayList.forEach(ArrayList.java:1541) {code}
> [TC failure 
> link|https://ci.ignite.apache.org/buildConfiguration/ApacheIgnite3xGradle_Test_RunAllTests/7944770?expandBuildDeploymentsSection=false=false=true=false+Inspection=true=true=true]
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (IGNITE-21792) ItNodeTest#testFollowerStartStopFollowing is flaky

2024-03-19 Thread Alexander Lapin (Jira)


 [ 
https://issues.apache.org/jira/browse/IGNITE-21792?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alexander Lapin updated IGNITE-21792:
-
Ignite Flags:   (was: Docs Required,Release Notes Required)

> ItNodeTest#testFollowerStartStopFollowing is flaky
> --
>
> Key: IGNITE-21792
> URL: https://issues.apache.org/jira/browse/IGNITE-21792
> Project: Ignite
>  Issue Type: Bug
>Reporter: Alexander Lapin
>Priority: Major
>
> ItNodeTest#testFollowerStartStopFollowing is flaky with
> {code:java}
> org.opentest4j.AssertionFailedError: expected:  but was:   at 
> app//org.junit.jupiter.api.AssertionFailureBuilder.build(AssertionFailureBuilder.java:151)
>   at 
> app//org.junit.jupiter.api.AssertionFailureBuilder.buildAndThrow(AssertionFailureBuilder.java:132)
>   at app//org.junit.jupiter.api.AssertTrue.failNotTrue(AssertTrue.java:63)  
> at app//org.junit.jupiter.api.AssertTrue.assertTrue(AssertTrue.java:36)  at 
> app//org.junit.jupiter.api.AssertTrue.assertTrue(AssertTrue.java:31)  at 
> app//org.junit.jupiter.api.Assertions.assertTrue(Assertions.java:183)  at 
> app//org.apache.ignite.raft.jraft.core.ItNodeTest.testFollowerStartStopFollowing(ItNodeTest.java:2636)
>   at java.base@11.0.17/java.lang.reflect.Method.invoke(Method.java:566)  at 
> java.base@11.0.17/java.util.ArrayList.forEach(ArrayList.java:1541)  at 
> java.base@11.0.17/java.util.ArrayList.forEach(ArrayList.java:1541) {code}
> [TC failure 
> link|https://ci.ignite.apache.org/buildConfiguration/ApacheIgnite3xGradle_Test_RunAllTests/7944770?expandBuildDeploymentsSection=false=false=true=false+Inspection=true=true=true]
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (IGNITE-21792) ItNodeTest#testFollowerStartStopFollowing is flaky

2024-03-19 Thread Alexander Lapin (Jira)


 [ 
https://issues.apache.org/jira/browse/IGNITE-21792?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alexander Lapin updated IGNITE-21792:
-
Description: 
ItNodeTest#testFollowerStartStopFollowing is flaky with
{code:java}
org.opentest4j.AssertionFailedError: expected:  but was:   at 
app//org.junit.jupiter.api.AssertionFailureBuilder.build(AssertionFailureBuilder.java:151)
  at 
app//org.junit.jupiter.api.AssertionFailureBuilder.buildAndThrow(AssertionFailureBuilder.java:132)
  at app//org.junit.jupiter.api.AssertTrue.failNotTrue(AssertTrue.java:63)  at 
app//org.junit.jupiter.api.AssertTrue.assertTrue(AssertTrue.java:36)  at 
app//org.junit.jupiter.api.AssertTrue.assertTrue(AssertTrue.java:31)  at 
app//org.junit.jupiter.api.Assertions.assertTrue(Assertions.java:183)  at 
app//org.apache.ignite.raft.jraft.core.ItNodeTest.testFollowerStartStopFollowing(ItNodeTest.java:2636)
  at java.base@11.0.17/java.lang.reflect.Method.invoke(Method.java:566)  at 
java.base@11.0.17/java.util.ArrayList.forEach(ArrayList.java:1541)  at 
java.base@11.0.17/java.util.ArrayList.forEach(ArrayList.java:1541) {code}
[TC failure 
link|https://ci.ignite.apache.org/buildConfiguration/ApacheIgnite3xGradle_Test_RunAllTests/7944770?expandBuildDeploymentsSection=false=false=true=false+Inspection=true=true=true]

 

  was:
ItNodeTest#testFollowerStartStopFollowing is flaky with

 
{code:java}
org.opentest4j.AssertionFailedError: expected:  but was:   at 
app//org.junit.jupiter.api.AssertionFailureBuilder.build(AssertionFailureBuilder.java:151)
  at 
app//org.junit.jupiter.api.AssertionFailureBuilder.buildAndThrow(AssertionFailureBuilder.java:132)
  at app//org.junit.jupiter.api.AssertTrue.failNotTrue(AssertTrue.java:63)  at 
app//org.junit.jupiter.api.AssertTrue.assertTrue(AssertTrue.java:36)  at 
app//org.junit.jupiter.api.AssertTrue.assertTrue(AssertTrue.java:31)  at 
app//org.junit.jupiter.api.Assertions.assertTrue(Assertions.java:183)  at 
app//org.apache.ignite.raft.jraft.core.ItNodeTest.testFollowerStartStopFollowing(ItNodeTest.java:2636)
  at java.base@11.0.17/java.lang.reflect.Method.invoke(Method.java:566)  at 
java.base@11.0.17/java.util.ArrayList.forEach(ArrayList.java:1541)  at 
java.base@11.0.17/java.util.ArrayList.forEach(ArrayList.java:1541) {code}
[TC failure 
link|https://ci.ignite.apache.org/buildConfiguration/ApacheIgnite3xGradle_Test_RunAllTests/7944770?expandBuildDeploymentsSection=false=false=true=false+Inspection=true=true=true]

 


> ItNodeTest#testFollowerStartStopFollowing is flaky
> --
>
> Key: IGNITE-21792
> URL: https://issues.apache.org/jira/browse/IGNITE-21792
> Project: Ignite
>  Issue Type: Bug
>Reporter: Alexander Lapin
>Priority: Major
>
> ItNodeTest#testFollowerStartStopFollowing is flaky with
> {code:java}
> org.opentest4j.AssertionFailedError: expected:  but was:   at 
> app//org.junit.jupiter.api.AssertionFailureBuilder.build(AssertionFailureBuilder.java:151)
>   at 
> app//org.junit.jupiter.api.AssertionFailureBuilder.buildAndThrow(AssertionFailureBuilder.java:132)
>   at app//org.junit.jupiter.api.AssertTrue.failNotTrue(AssertTrue.java:63)  
> at app//org.junit.jupiter.api.AssertTrue.assertTrue(AssertTrue.java:36)  at 
> app//org.junit.jupiter.api.AssertTrue.assertTrue(AssertTrue.java:31)  at 
> app//org.junit.jupiter.api.Assertions.assertTrue(Assertions.java:183)  at 
> app//org.apache.ignite.raft.jraft.core.ItNodeTest.testFollowerStartStopFollowing(ItNodeTest.java:2636)
>   at java.base@11.0.17/java.lang.reflect.Method.invoke(Method.java:566)  at 
> java.base@11.0.17/java.util.ArrayList.forEach(ArrayList.java:1541)  at 
> java.base@11.0.17/java.util.ArrayList.forEach(ArrayList.java:1541) {code}
> [TC failure 
> link|https://ci.ignite.apache.org/buildConfiguration/ApacheIgnite3xGradle_Test_RunAllTests/7944770?expandBuildDeploymentsSection=false=false=true=false+Inspection=true=true=true]
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


  1   2   3   4   5   6   7   8   9   10   >