[jira] [Comment Edited] (IGNITE-9913) Prevent data updates blocking in case of backup BLT server node leave

2019-12-12 Thread Alexei Scherbakov (Jira)


[ 
https://issues.apache.org/jira/browse/IGNITE-9913?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16994406#comment-16994406
 ] 

Alexei Scherbakov edited comment on IGNITE-9913 at 12/12/19 9:27 AM:
-

[~avinogradov]

1. I've came to a conclusion having rebalanced state calculated on a 
coordinator is the most robust way to say the grid is rebalanced. Let's keep it.
2. I've left two comments in your PR regarding the change.
3. ok.


was (Author: ascherbakov):
[~avinogradov]

1. I've came to a conclusion having rebalanced state calculated on coordinator 
is the most robust way to say the grid is rebalanced. Let's keep it.
2. I've left two comments in your PR regarding the change.
3. ok.

> Prevent data updates blocking in case of backup BLT server node leave
> -
>
> Key: IGNITE-9913
> URL: https://issues.apache.org/jira/browse/IGNITE-9913
> Project: Ignite
>  Issue Type: Improvement
>  Components: general
>Reporter: Ivan Rakov
>Assignee: Anton Vinogradov
>Priority: Major
> Attachments: 9913_yardstick.png, master_yardstick.png
>
>  Time Spent: 9h 10m
>  Remaining Estimate: 0h
>
> Ignite cluster performs distributed partition map exchange when any server 
> node leaves or joins the topology.
> Distributed PME blocks all updates and may take a long time. If all 
> partitions are assigned according to the baseline topology and server node 
> leaves, there's no actual need to perform distributed PME: every cluster node 
> is able to recalculate new affinity assigments and partition states locally. 
> If we'll implement such lightweight PME and handle mapping and lock requests 
> on new topology version correctly, updates won't be stopped (except updates 
> of partitions that lost their primary copy).



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (IGNITE-9913) Prevent data updates blocking in case of backup BLT server node leave

2019-12-12 Thread Alexei Scherbakov (Jira)


[ 
https://issues.apache.org/jira/browse/IGNITE-9913?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16994406#comment-16994406
 ] 

Alexei Scherbakov commented on IGNITE-9913:
---

[~avinogradov]

1. I've came to a conclusion having rebalanced state calculated on coordinator 
is the most robust way to say the grid is rebalanced. Let's keep it.
2. I've left two comments in your PR regarding the change.
3. ok.

> Prevent data updates blocking in case of backup BLT server node leave
> -
>
> Key: IGNITE-9913
> URL: https://issues.apache.org/jira/browse/IGNITE-9913
> Project: Ignite
>  Issue Type: Improvement
>  Components: general
>Reporter: Ivan Rakov
>Assignee: Anton Vinogradov
>Priority: Major
> Attachments: 9913_yardstick.png, master_yardstick.png
>
>  Time Spent: 9h 10m
>  Remaining Estimate: 0h
>
> Ignite cluster performs distributed partition map exchange when any server 
> node leaves or joins the topology.
> Distributed PME blocks all updates and may take a long time. If all 
> partitions are assigned according to the baseline topology and server node 
> leaves, there's no actual need to perform distributed PME: every cluster node 
> is able to recalculate new affinity assigments and partition states locally. 
> If we'll implement such lightweight PME and handle mapping and lock requests 
> on new topology version correctly, updates won't be stopped (except updates 
> of partitions that lost their primary copy).



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (IGNITE-9913) Prevent data updates blocking in case of backup BLT server node leave

2019-12-10 Thread Alexei Scherbakov (Jira)


[ 
https://issues.apache.org/jira/browse/IGNITE-9913?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16992697#comment-16992697
 ] 

Alexei Scherbakov commented on IGNITE-9913:
---

[~avinogradov]

I've reviewed a PR. Overall the idea and the implementation looks valid.

My questions are:

1. You've introduced a flag _rebalanced_ indicating the previous exchange 
future was completed after everything is rebalanced.
Seems the flag is not necessary. The _rebalanced_ state can be figured out by 
the conditions:
a) this exchange is triggered by CacheAffinityChangeMessage 
b) for this exchange forceAffReassignment=true and 
GridDhtPartitionsFullMessage#idealAffinityDiff().isEmpty()

Can we get rid of the flag ?

2. It seems CacheAffinityChangeMessage is no longer contains any useful 
assignments when is triggered by switching from late to ideal state.
Can we get rid of sending any assignments for protocol v3 ?

Also could you add a test when all owners of the partition are left one by one 
under the load and make sure updates to other partitions work as expected 
without PME, using different loss policy modes and backups number ?

> Prevent data updates blocking in case of backup BLT server node leave
> -
>
> Key: IGNITE-9913
> URL: https://issues.apache.org/jira/browse/IGNITE-9913
> Project: Ignite
>  Issue Type: Improvement
>  Components: general
>Reporter: Ivan Rakov
>Assignee: Anton Vinogradov
>Priority: Major
> Attachments: 9913_yardstick.png, master_yardstick.png
>
>  Time Spent: 9h 10m
>  Remaining Estimate: 0h
>
> Ignite cluster performs distributed partition map exchange when any server 
> node leaves or joins the topology.
> Distributed PME blocks all updates and may take a long time. If all 
> partitions are assigned according to the baseline topology and server node 
> leaves, there's no actual need to perform distributed PME: every cluster node 
> is able to recalculate new affinity assigments and partition states locally. 
> If we'll implement such lightweight PME and handle mapping and lock requests 
> on new topology version correctly, updates won't be stopped (except updates 
> of partitions that lost their primary copy).



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Comment Edited] (IGNITE-12429) Rework bytes-based WAL archive size management logic to make historical rebalance more predictable

2019-12-09 Thread Alexei Scherbakov (Jira)


[ 
https://issues.apache.org/jira/browse/IGNITE-12429?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16992226#comment-16992226
 ] 

Alexei Scherbakov edited comment on IGNITE-12429 at 12/10/19 7:10 AM:
--

[~ivan.glukos]

I have some objections.

1. I don't think this is right. Having an ability to specify history in 
checkpoints is the same as setting a duration equal to checkpointFreq * 
walHistSize.
This is a good thing to have for me. Probably we should change the property to 
be measured in time units or just add a javadoc explaining how this is 
translated to a duration.

2. For me the root cause is wrong threatment of histMap when calculating 
available history for reservation. 
We already have a caching mechanics for checkpoint entries [1].
Looks like it's possible to keep all the history in the heap (only store 
references actually) using lazy loading/unloading when needed and get rid of 
IGNITE_PDS_MAX_CHECKPOINT_MEMORY_HISTORY_SIZE (or maybe use it as a hint for 
caching).
Also I do not understand how having sparse map will help us because we need all 
entries for history calculation.

[1] 
org.apache.ignite.internal.processors.cache.persistence.checkpoint.CheckpointEntry.GroupStateLazyStore






was (Author: ascherbakov):
[~ivan.glukos]

I have some objections.

1. I don't think this is right. Having an ability to specify history in 
checkpoints is the same as setting a duration equal to checkpointFreq * 
walHistSize.
This is a good thing to have for me. Probably we should change the property to 
be measured in time units or just add a javadoc explaining how this is 
translated to a duration.

2. For me the root cause is wrong threatment of histMap when calculating 
available history for reservation. 
We already have a caching mechanics for checkpoint entries [1].
Looks like it's possible to keep all the history in the heap (only store 
references actually) using lazy loading/unloading when needed and get reid of 
IGNITE_PDS_MAX_CHECKPOINT_MEMORY_HISTORY_SIZE (or maybe use it as a hint for 
caching).
Also I do not understand how having sparse map will help us because we need all 
entries for history calculation.

[1] 
org.apache.ignite.internal.processors.cache.persistence.checkpoint.CheckpointEntry.GroupStateLazyStore





> Rework bytes-based WAL archive size management logic to make historical 
> rebalance more predictable
> --
>
> Key: IGNITE-12429
> URL: https://issues.apache.org/jira/browse/IGNITE-12429
> Project: Ignite
>  Issue Type: Improvement
>Affects Versions: 2.7, 2.7.5, 2.7.6
>Reporter: Ivan Rakov
>Priority: Major
>
> Since 2.7 DataStorageConfiguration allows to specify size of WAL archive in 
> bytes (see DataStorageConfiguration#maxWalArchiveSize), which is much more 
> trasparent to user. 
> Unfortunately, new logic may be unpredictable when it comes to the historical 
> rebalance. WAL archive is truncated when one of the following conditions 
> occur:
> 1. Total number of checkpoints in WAL archive is bigger than 
> DataStorageConfiguration#walHistSize
> 2. Total size of WAL archive is bigger than 
> DataStorageConfiguration#maxWalArchiveSize
> Independently, in-memory checkpoint history contains only fixed number of 
> last checkpoints (can be changed with 
> IGNITE_PDS_MAX_CHECKPOINT_MEMORY_HISTORY_SIZE, 100 by default).
> All these particular qualities make it hard for user to cotrol usage of 
> historical rebalance. Imagine the case when user has slight load (WAL gets 
> rotated very slowly) and default checkpoint frequency. After 100 * 3 = 300 
> minutes, all updates in WAL will be impossible to be received via historical 
> rebalance even if:
> 1. User has configured large DataStorageConfiguration#maxWalArchiveSize
> 2. User has configured large DataStorageConfiguration#walHistSize
> At the same time, setting large IGNITE_PDS_MAX_CHECKPOINT_MEMORY_HISTORY_SIZE 
> will help (only with previous two points combined), but Ignite node heap 
> usage may increase dramatically.
> I propose to change WAL history management logic in the following way:
> 1. *Don't cut* WAL archive when number of checkpoint exceeds 
> DataStorageConfiguration#walHistSize. WAL history should be managed only 
> based on DataStorageConfiguration#maxWalArchiveSize.
> 2. Checkpoint history should contain fixed number of entries, but should 
> cover the whole stored WAL archive (not only its more recent part with 
> IGNITE_PDS_MAX_CHECKPOINT_MEMORY_HISTORY_SIZE last checkpoints). This can be 
> achieved by making checkpoint history sparse: some intermediate checkpoints 
> *may be not present in history*, but fixed number of checkpoints can be 
> positioned either in uniform distribution (trying to keep fixed number of 
> bytes between 

[jira] [Comment Edited] (IGNITE-12429) Rework bytes-based WAL archive size management logic to make historical rebalance more predictable

2019-12-09 Thread Alexei Scherbakov (Jira)


[ 
https://issues.apache.org/jira/browse/IGNITE-12429?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16992226#comment-16992226
 ] 

Alexei Scherbakov edited comment on IGNITE-12429 at 12/10/19 7:10 AM:
--

[~ivan.glukos]

I have some objections.

1. I don't think this is right. Having an ability to specify history in 
checkpoints is the same as setting a duration equal to checkpointFreq * 
walHistSize.
This is a good thing to have for me. Probably we should change the property to 
be measured in time units or just add a javadoc explaining how this is 
translated to a duration.

2. For me the root cause is wrong threatment of histMap when calculating 
available history for reservation. 
We already have a caching mechanics for checkpoint entries [1].
Looks like it's possible to keep all the history in the heap (only store 
references actually) using lazy loading/unloading when needed and get reid of 
IGNITE_PDS_MAX_CHECKPOINT_MEMORY_HISTORY_SIZE (or maybe use it as a hint for 
caching).
Also I do not understand how having sparse map will help us because we need all 
entries for history calculation.

[1] 
org.apache.ignite.internal.processors.cache.persistence.checkpoint.CheckpointEntry.GroupStateLazyStore






was (Author: ascherbakov):
[~ivan.glukos]

I have some objections.

1. I don't think this is right. Having an ability to specify history in 
checkpoints is the same as setting a duration equal to checkpointFreq * 
walHistSize.
This is a good thing to have for me. Probably we should change the property to 
be measured in time units or just add a javadoc explaining how this is 
transalated to duration.

2. For me the root cause is wrong threatment of histMap when calculating 
available history for reservation. 
We already have a caching mechanics for checkpoint entries [1].
Looks like it's possible to keep all the history in the heap (only store 
references actually) using lazy loading/unloading when needed and get reid of 
IGNITE_PDS_MAX_CHECKPOINT_MEMORY_HISTORY_SIZE (or maybe use it as a hint for 
caching).
Also I do not understand how having sparse map will help us because we need all 
entries for history calculation.

[1] 
org.apache.ignite.internal.processors.cache.persistence.checkpoint.CheckpointEntry.GroupStateLazyStore





> Rework bytes-based WAL archive size management logic to make historical 
> rebalance more predictable
> --
>
> Key: IGNITE-12429
> URL: https://issues.apache.org/jira/browse/IGNITE-12429
> Project: Ignite
>  Issue Type: Improvement
>Affects Versions: 2.7, 2.7.5, 2.7.6
>Reporter: Ivan Rakov
>Priority: Major
>
> Since 2.7 DataStorageConfiguration allows to specify size of WAL archive in 
> bytes (see DataStorageConfiguration#maxWalArchiveSize), which is much more 
> trasparent to user. 
> Unfortunately, new logic may be unpredictable when it comes to the historical 
> rebalance. WAL archive is truncated when one of the following conditions 
> occur:
> 1. Total number of checkpoints in WAL archive is bigger than 
> DataStorageConfiguration#walHistSize
> 2. Total size of WAL archive is bigger than 
> DataStorageConfiguration#maxWalArchiveSize
> Independently, in-memory checkpoint history contains only fixed number of 
> last checkpoints (can be changed with 
> IGNITE_PDS_MAX_CHECKPOINT_MEMORY_HISTORY_SIZE, 100 by default).
> All these particular qualities make it hard for user to cotrol usage of 
> historical rebalance. Imagine the case when user has slight load (WAL gets 
> rotated very slowly) and default checkpoint frequency. After 100 * 3 = 300 
> minutes, all updates in WAL will be impossible to be received via historical 
> rebalance even if:
> 1. User has configured large DataStorageConfiguration#maxWalArchiveSize
> 2. User has configured large DataStorageConfiguration#walHistSize
> At the same time, setting large IGNITE_PDS_MAX_CHECKPOINT_MEMORY_HISTORY_SIZE 
> will help (only with previous two points combined), but Ignite node heap 
> usage may increase dramatically.
> I propose to change WAL history management logic in the following way:
> 1. *Don't cut* WAL archive when number of checkpoint exceeds 
> DataStorageConfiguration#walHistSize. WAL history should be managed only 
> based on DataStorageConfiguration#maxWalArchiveSize.
> 2. Checkpoint history should contain fixed number of entries, but should 
> cover the whole stored WAL archive (not only its more recent part with 
> IGNITE_PDS_MAX_CHECKPOINT_MEMORY_HISTORY_SIZE last checkpoints). This can be 
> achieved by making checkpoint history sparse: some intermediate checkpoints 
> *may be not present in history*, but fixed number of checkpoints can be 
> positioned either in uniform distribution (trying to keep fixed number of 
> bytes between 

[jira] [Commented] (IGNITE-12429) Rework bytes-based WAL archive size management logic to make historical rebalance more predictable

2019-12-09 Thread Alexei Scherbakov (Jira)


[ 
https://issues.apache.org/jira/browse/IGNITE-12429?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16992226#comment-16992226
 ] 

Alexei Scherbakov commented on IGNITE-12429:


[~ivan.glukos]

I have some objections.

1. I don't think this is right. Having an ability to specify history in 
checkpoints is the same as setting a duration equal to checkpointFreq * 
walHistSize.
This is a good thing to have for me. Probably we should change the property to 
be measured in time units or just add a javadoc explaining how this is 
transalated to duration.

2. For me the root cause is wrong threatment of histMap when calculating 
available history for reservation. 
We already have a caching mechanics for checkpoint entries [1].
Looks like it's possible to keep all the history in the heap (only store 
references actually) using lazy loading/unloading when needed and get reid of 
IGNITE_PDS_MAX_CHECKPOINT_MEMORY_HISTORY_SIZE (or maybe use it as a hint for 
caching).
Also I do not understand how having sparse map will help us because we need all 
entries for history calculation.

[1] 
org.apache.ignite.internal.processors.cache.persistence.checkpoint.CheckpointEntry.GroupStateLazyStore





> Rework bytes-based WAL archive size management logic to make historical 
> rebalance more predictable
> --
>
> Key: IGNITE-12429
> URL: https://issues.apache.org/jira/browse/IGNITE-12429
> Project: Ignite
>  Issue Type: Improvement
>Affects Versions: 2.7, 2.7.5, 2.7.6
>Reporter: Ivan Rakov
>Priority: Major
>
> Since 2.7 DataStorageConfiguration allows to specify size of WAL archive in 
> bytes (see DataStorageConfiguration#maxWalArchiveSize), which is much more 
> trasparent to user. 
> Unfortunately, new logic may be unpredictable when it comes to the historical 
> rebalance. WAL archive is truncated when one of the following conditions 
> occur:
> 1. Total number of checkpoints in WAL archive is bigger than 
> DataStorageConfiguration#walHistSize
> 2. Total size of WAL archive is bigger than 
> DataStorageConfiguration#maxWalArchiveSize
> Independently, in-memory checkpoint history contains only fixed number of 
> last checkpoints (can be changed with 
> IGNITE_PDS_MAX_CHECKPOINT_MEMORY_HISTORY_SIZE, 100 by default).
> All these particular qualities make it hard for user to cotrol usage of 
> historical rebalance. Imagine the case when user has slight load (WAL gets 
> rotated very slowly) and default checkpoint frequency. After 100 * 3 = 300 
> minutes, all updates in WAL will be impossible to be received via historical 
> rebalance even if:
> 1. User has configured large DataStorageConfiguration#maxWalArchiveSize
> 2. User has configured large DataStorageConfiguration#walHistSize
> At the same time, setting large IGNITE_PDS_MAX_CHECKPOINT_MEMORY_HISTORY_SIZE 
> will help (only with previous two points combined), but Ignite node heap 
> usage may increase dramatically.
> I propose to change WAL history management logic in the following way:
> 1. *Don't cut* WAL archive when number of checkpoint exceeds 
> DataStorageConfiguration#walHistSize. WAL history should be managed only 
> based on DataStorageConfiguration#maxWalArchiveSize.
> 2. Checkpoint history should contain fixed number of entries, but should 
> cover the whole stored WAL archive (not only its more recent part with 
> IGNITE_PDS_MAX_CHECKPOINT_MEMORY_HISTORY_SIZE last checkpoints). This can be 
> achieved by making checkpoint history sparse: some intermediate checkpoints 
> *may be not present in history*, but fixed number of checkpoints can be 
> positioned either in uniform distribution (trying to keep fixed number of 
> bytes between two neighbour checkpoints) or exponentially (trying to keep 
> fixed ratio between [size of WAL from checkpoint(N-1) to current write 
> pointer] and [size of WAL from checkpoint(N) to current write pointer]).



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (IGNITE-11857) Investigate performance drop after IGNITE-10078

2019-12-06 Thread Alexei Scherbakov (Jira)


[ 
https://issues.apache.org/jira/browse/IGNITE-11857?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16989896#comment-16989896
 ] 

Alexei Scherbakov commented on IGNITE-11857:


[~alex_pl]

Ok, let's proceed with Map version.

> Investigate performance drop after IGNITE-10078
> ---
>
> Key: IGNITE-11857
> URL: https://issues.apache.org/jira/browse/IGNITE-11857
> Project: Ignite
>  Issue Type: Improvement
>Reporter: Alexei Scherbakov
>Assignee: Aleksey Plekhanov
>Priority: Major
> Attachments: ignite-config.xml, 
> run.properties.tx-optimistic-put-b-backup
>
>  Time Spent: 20m
>  Remaining Estimate: 0h
>
> After IGNITE-10078 yardstick tests show performance drop up to 8% in some 
> scenarios:
> * tx-optim-repRead-put-get
> * tx-optimistic-put
> * tx-putAll
> Partially this is due new update counter implementation, but not only. 
> Investigation is required.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (IGNITE-12422) Clean up GG-XXX internal ticket references from the code base.

2019-12-06 Thread Alexei Scherbakov (Jira)


 [ 
https://issues.apache.org/jira/browse/IGNITE-12422?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alexei Scherbakov updated IGNITE-12422:
---
Description: 
Replace with Apache Ignite equivalent if possible.

Also it's desirable to implement checkstyle rule to prevent foreigh links in 
TODOs [1]

[1] https://checkstyle.sourceforge.io/config_misc.html#TodoComment

  was:Replace with Apache Ignite equivalent if possible.


> Clean up GG-XXX internal ticket references from the code base.
> --
>
> Key: IGNITE-12422
> URL: https://issues.apache.org/jira/browse/IGNITE-12422
> Project: Ignite
>  Issue Type: Improvement
>Reporter: Alexei Scherbakov
>Assignee: Alexei Scherbakov
>Priority: Major
> Fix For: 2.9
>
>
> Replace with Apache Ignite equivalent if possible.
> Also it's desirable to implement checkstyle rule to prevent foreigh links in 
> TODOs [1]
> [1] https://checkstyle.sourceforge.io/config_misc.html#TodoComment



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Comment Edited] (IGNITE-11857) Investigate performance drop after IGNITE-10078

2019-12-06 Thread Alexei Scherbakov (Jira)


[ 
https://issues.apache.org/jira/browse/IGNITE-11857?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16989717#comment-16989717
 ] 

Alexei Scherbakov edited comment on IGNITE-11857 at 12/6/19 1:10 PM:
-

[~alex_pl]

You have only measured heap allocation (GC pressure) which seems to be very low 
for both implementation.
You also should measure resident size of both structures.
Long for value can be replaced with Integer because no tx with batch of size 
close to or larger than Integer.MAX_VALUE is viable.
For most frequent use cases object creation will be handled by Integer boxing 
cache.

I think I'm ok with proposed improvement, just make sure we couldn't do better.





was (Author: ascherbakov):
[~alex_pl]

You have only measured heap allocation (GC pressure) which seems to be very low 
for both implementation.
You also should measure resident size of both structures.
Long for value can be replaced with Integer because no tx with batch of size 
close to or larger than Integer.MAX_VALUE is viable.
For most frequent use cases object creation will be handled by Integer boxing 
cache.

I think I'm ok this proposed improvement, just make sure we couldn't do better.




> Investigate performance drop after IGNITE-10078
> ---
>
> Key: IGNITE-11857
> URL: https://issues.apache.org/jira/browse/IGNITE-11857
> Project: Ignite
>  Issue Type: Improvement
>Reporter: Alexei Scherbakov
>Assignee: Aleksey Plekhanov
>Priority: Major
> Attachments: ignite-config.xml, 
> run.properties.tx-optimistic-put-b-backup
>
>  Time Spent: 20m
>  Remaining Estimate: 0h
>
> After IGNITE-10078 yardstick tests show performance drop up to 8% in some 
> scenarios:
> * tx-optim-repRead-put-get
> * tx-optimistic-put
> * tx-putAll
> Partially this is due new update counter implementation, but not only. 
> Investigation is required.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Comment Edited] (IGNITE-11857) Investigate performance drop after IGNITE-10078

2019-12-06 Thread Alexei Scherbakov (Jira)


[ 
https://issues.apache.org/jira/browse/IGNITE-11857?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16989717#comment-16989717
 ] 

Alexei Scherbakov edited comment on IGNITE-11857 at 12/6/19 12:39 PM:
--

[~alex_pl]

You have only measured heap allocation (GC pressure) which seems to be very low 
for both implementation.
You also should measure resident size of both structures.
Long for value can be replaced with Integer because no tx with batch of size 
close to or larger than Integer.MAX_VALUE is viable.
For most frequent use cases object creation will be handled by Integer boxing 
cache.

I think I'm ok this proposed improvement, just make sure we couldn't do better.





was (Author: ascherbakov):
[~alex_pl]

You have only measured heap allocation (GC pressure) which seems to be very low 
for both implementation.
You also should measure resident size of both structures.
Long for value can be replaced with Integer because no tx with batch of size 
Integer.MAX_VALUE is viable.
For most frequent use cases object creation will be handled by Integer boxing 
cache.

I think I'm ok this proposed improvement, just make sure we couldn't do better.




> Investigate performance drop after IGNITE-10078
> ---
>
> Key: IGNITE-11857
> URL: https://issues.apache.org/jira/browse/IGNITE-11857
> Project: Ignite
>  Issue Type: Improvement
>Reporter: Alexei Scherbakov
>Assignee: Aleksey Plekhanov
>Priority: Major
> Attachments: ignite-config.xml, 
> run.properties.tx-optimistic-put-b-backup
>
>  Time Spent: 20m
>  Remaining Estimate: 0h
>
> After IGNITE-10078 yardstick tests show performance drop up to 8% in some 
> scenarios:
> * tx-optim-repRead-put-get
> * tx-optimistic-put
> * tx-putAll
> Partially this is due new update counter implementation, but not only. 
> Investigation is required.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (IGNITE-11857) Investigate performance drop after IGNITE-10078

2019-12-06 Thread Alexei Scherbakov (Jira)


[ 
https://issues.apache.org/jira/browse/IGNITE-11857?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16989717#comment-16989717
 ] 

Alexei Scherbakov commented on IGNITE-11857:


[~alex_pl]

You have only measured heap allocation (GC pressure) which seems to be very low 
for both implementation.
You also should measure resident size of both structures.
Long for value can be replaced with Integer because no tx with batch of size 
Integer.MAX_VALUE is viable.
For most frequent use cases object creation will be handled by Integer boxing 
cache.

I think I'm ok this proposed improvement, just make sure we couldn't do better.




> Investigate performance drop after IGNITE-10078
> ---
>
> Key: IGNITE-11857
> URL: https://issues.apache.org/jira/browse/IGNITE-11857
> Project: Ignite
>  Issue Type: Improvement
>Reporter: Alexei Scherbakov
>Assignee: Aleksey Plekhanov
>Priority: Major
> Attachments: ignite-config.xml, 
> run.properties.tx-optimistic-put-b-backup
>
>  Time Spent: 20m
>  Remaining Estimate: 0h
>
> After IGNITE-10078 yardstick tests show performance drop up to 8% in some 
> scenarios:
> * tx-optim-repRead-put-get
> * tx-optimistic-put
> * tx-putAll
> Partially this is due new update counter implementation, but not only. 
> Investigation is required.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (IGNITE-11857) Investigate performance drop after IGNITE-10078

2019-12-06 Thread Alexei Scherbakov (Jira)


[ 
https://issues.apache.org/jira/browse/IGNITE-11857?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16989525#comment-16989525
 ] 

Alexei Scherbakov commented on IGNITE-11857:


[~alex_pl]

I've looked at your contribution.

Changing TreeSet to TreeMap looks like a very minor change. I think you can go 
further and get rid of Item class. Out of order updates can be kept in 
SortedMap where key is a start and value is a range (or even in 
sorted array of primitive tuples). Another possibility is storing missing 
updates in a bitmap.

You should also check a new solution for heap usage in comparison to the old. 
For many partitions configurations less heap usage could be more significant 
advantage other the minor performance boost.

Also I have a little concern about the robustness of a fix. It might be risky 
to merge it to 2.8 without extensive testing.

So, I would postpone the change and improved the patch first.



> Investigate performance drop after IGNITE-10078
> ---
>
> Key: IGNITE-11857
> URL: https://issues.apache.org/jira/browse/IGNITE-11857
> Project: Ignite
>  Issue Type: Improvement
>Reporter: Alexei Scherbakov
>Assignee: Aleksey Plekhanov
>Priority: Major
> Attachments: ignite-config.xml, 
> run.properties.tx-optimistic-put-b-backup
>
>  Time Spent: 20m
>  Remaining Estimate: 0h
>
> After IGNITE-10078 yardstick tests show performance drop up to 8% in some 
> scenarios:
> * tx-optim-repRead-put-get
> * tx-optimistic-put
> * tx-putAll
> Partially this is due new update counter implementation, but not only. 
> Investigation is required.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (IGNITE-12422) Clean up GG-XXX internal ticket references from the code base.

2019-12-06 Thread Alexei Scherbakov (Jira)


 [ 
https://issues.apache.org/jira/browse/IGNITE-12422?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alexei Scherbakov updated IGNITE-12422:
---
Summary: Clean up GG-XXX internal ticket references from the code base.  
(was: Clean up GG-XXX internal ticket references from code base.)

> Clean up GG-XXX internal ticket references from the code base.
> --
>
> Key: IGNITE-12422
> URL: https://issues.apache.org/jira/browse/IGNITE-12422
> Project: Ignite
>  Issue Type: Improvement
>Reporter: Alexei Scherbakov
>Assignee: Alexei Scherbakov
>Priority: Major
> Fix For: 2.9
>
>
> Replace with Apache Ignite equivalent if possible.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (IGNITE-12422) Clean up GG-XXX internal ticket references from code base.

2019-12-05 Thread Alexei Scherbakov (Jira)
Alexei Scherbakov created IGNITE-12422:
--

 Summary: Clean up GG-XXX internal ticket references from code base.
 Key: IGNITE-12422
 URL: https://issues.apache.org/jira/browse/IGNITE-12422
 Project: Ignite
  Issue Type: Improvement
Reporter: Alexei Scherbakov
Assignee: Alexei Scherbakov
 Fix For: 2.9


Replace with Apache Ignite equivalent if possible.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (IGNITE-11704) Write tombstones during rebalance to get rid of deferred delete buffer

2019-12-05 Thread Alexei Scherbakov (Jira)


[ 
https://issues.apache.org/jira/browse/IGNITE-11704?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16988954#comment-16988954
 ] 

Alexei Scherbakov commented on IGNITE-11704:


Better to store a full log in case of TC history cleanup.

{noformat}
java.lang.AssertionError: Failed to wait for tombstone cleanup: 
distributed.CacheRemoveWithTombstonesLoadTest2 expected:<0> but was:<1>
at 
org.apache.ignite.internal.processors.cache.distributed.CacheRemoveWithTombstonesLoadTest.waitTombstoneCleanup(CacheRemoveWithTombstonesLoadTest.java:335)
at 
org.apache.ignite.internal.processors.cache.distributed.CacheRemoveWithTombstonesLoadTest.removeAndRebalance(CacheRemoveWithTombstonesLoadTest.java:250)
--- Stdout: ---
[12:28:21]__   
[12:28:21]   /  _/ ___/ |/ /  _/_  __/ __/ 
[12:28:21]  _/ // (7 7// /  / / / _/   
[12:28:21] /___/\___/_/|_/___/ /_/ /___/  
[12:28:21] 
[12:28:21] ver. 2.8.0-SNAPSHOT#20191203-sha1:DEV
[12:28:21] 2019 Copyright(C) Apache Software Foundation
[12:28:21] 
[12:28:21] Ignite documentation: http://ignite.apache.org
[12:28:21] 
[12:28:21] Quiet mode.
[12:28:21]   ^-- Logging by 'GridTestLog4jLogger [quiet=true, config=null]'
[12:28:21]   ^-- To see **FULL** console log here add -DIGNITE_QUIET=false or 
"-v" to ignite.{sh|bat}
[12:28:21] 
[12:28:21] OS: Linux 4.15.0-54-generic amd64
[12:28:21] VM information: Java(TM) SE Runtime Environment 1.8.0_212-b10 Oracle 
Corporation Java HotSpot(TM) 64-Bit Server VM 25.212-b10
[12:28:21] Configured plugins:
[12:28:21]   ^-- StanByClusterTestProvider 1.0
[12:28:21]   ^-- null
[12:28:21] 
[12:28:21]   ^-- PageMemory tracker plugin 1.0
[12:28:21]   ^-- 
[12:28:21] 
[12:28:21]   ^-- TestDistibutedConfigurationPlugin 1.0
[12:28:21]   ^-- 
[12:28:21] 
[12:28:21]   ^-- NodeValidationPluginProvider 1.0
[12:28:21]   ^-- 
[12:28:21] 
[12:28:21] Configured failure handler: [hnd=NoOpFailureHandler 
[super=AbstractFailureHandler [ignoredFailureTypes=UnmodifiableSet 
[SYSTEM_WORKER_BLOCKED, SYSTEM_CRITICAL_OPERATION_TIMEOUT
[12:28:21] Message queue limit is set to 0 which may lead to potential OOMEs 
when running cache operations in FULL_ASYNC or PRIMARY_SYNC modes due to 
message queues growth on sender and receiver sides.
[12:28:21] Security status [authentication=off, tls/ssl=off]
[12:28:21] To start Console Management & Monitoring run ignitevisorcmd.{sh|bat}
[12:28:21] Data Regions Configured:
[12:28:21]   ^-- default [initSize=256.0 MiB, maxSize=18.9 GiB, 
persistence=false, lazyMemoryAllocation=true]
[12:28:21] 
[12:28:21] Ignite node started OK (id=a34d6c79, instance 
name=distributed.CacheRemoveWithTombstonesLoadTest0)
[12:28:21] Topology snapshot [ver=1, locNode=a34d6c79, servers=1, clients=0, 
state=ACTIVE, CPUs=5, offheap=19.0GB, heap=2.0GB]
[12:28:23]__   
[12:28:23]   /  _/ ___/ |/ /  _/_  __/ __/ 
[12:28:23]  _/ // (7 7// /  / / / _/   
[12:28:23] /___/\___/_/|_/___/ /_/ /___/  
[12:28:23] 
[12:28:23] ver. 2.8.0-SNAPSHOT#20191203-sha1:DEV
[12:28:23] 2019 Copyright(C) Apache Software Foundation
[12:28:23] 
[12:28:23] Ignite documentation: http://ignite.apache.org
[12:28:23] 
[12:28:23] Quiet mode.
[12:28:23]   ^-- Logging by 'GridTestLog4jLogger [quiet=true, config=null]'
[12:28:23]   ^-- To see **FULL** console log here add -DIGNITE_QUIET=false or 
"-v" to ignite.{sh|bat}
[12:28:23] 
[12:28:23] OS: Linux 4.15.0-54-generic amd64
[12:28:23] VM information: Java(TM) SE Runtime Environment 1.8.0_212-b10 Oracle 
Corporation Java HotSpot(TM) 64-Bit Server VM 25.212-b10
[12:28:23] Configured plugins:
[12:28:23]   ^-- StanByClusterTestProvider 1.0
[12:28:23]   ^-- null
[12:28:23] 
[12:28:23]   ^-- PageMemory tracker plugin 1.0
[12:28:23]   ^-- 
[12:28:23] 
[12:28:23]   ^-- TestDistibutedConfigurationPlugin 1.0
[12:28:23]   ^-- 
[12:28:23] 
[12:28:23]   ^-- NodeValidationPluginProvider 1.0
[12:28:23]   ^-- 
[12:28:23] 
[12:28:23] Configured failure handler: [hnd=NoOpFailureHandler 
[super=AbstractFailureHandler [ignoredFailureTypes=UnmodifiableSet 
[SYSTEM_WORKER_BLOCKED, SYSTEM_CRITICAL_OPERATION_TIMEOUT
[12:28:23] Message queue limit is set to 0 which may lead to potential OOMEs 
when running cache operations in FULL_ASYNC or PRIMARY_SYNC modes due to 
message queues growth on sender and receiver sides.
[12:28:23] Security status [authentication=off, tls/ssl=off]
[12:28:23] Joining node doesn't have encryption data 
[node=3205dca1-2c61-4f76-8475-e0c5c0f1]
[12:28:23] Topology snapshot [ver=2, locNode=a34d6c79, servers=2, clients=0, 
state=ACTIVE, CPUs=5, offheap=19.0GB, heap=2.0GB]
[12:28:29] To start Console Management & Monitoring run ignitevisorcmd.{sh|bat}
[12:28:29] Data Regions Configured:
[12:28:29]   ^-- default [initSize=256.0 MiB, maxSize=18.9 GiB, 
persistence=false, lazyMemoryAllocation=true]
[12:28:29] 
[12:28:29] Ignite node started OK (id=3205dca1, instance 

[jira] [Commented] (IGNITE-12049) Allow custom authenticators to use SSL certificates

2019-12-02 Thread Alexei Scherbakov (Jira)


[ 
https://issues.apache.org/jira/browse/IGNITE-12049?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16986102#comment-16986102
 ] 

Alexei Scherbakov commented on IGNITE-12049:


[~SomeFire]

Sounds good.

Attributes for jdbc/odbc can be passed as base64 encoded strings to driver, the 
factory is also fine.

> Allow custom authenticators to use SSL certificates
> ---
>
> Key: IGNITE-12049
> URL: https://issues.apache.org/jira/browse/IGNITE-12049
> Project: Ignite
>  Issue Type: Improvement
>Reporter: Ryabov Dmitrii
>Assignee: Ryabov Dmitrii
>Priority: Minor
>  Time Spent: 1.5h
>  Remaining Estimate: 0h
>
> Add SSL certificates to AuthenticationContext, so, authenticators can make 
> additional checks based on SSL certificates.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Assigned] (IGNITE-11797) Fix consistency issues for atomic and mixed tx-atomic cache groups.

2019-11-27 Thread Alexei Scherbakov (Jira)


 [ 
https://issues.apache.org/jira/browse/IGNITE-11797?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alexei Scherbakov reassigned IGNITE-11797:
--

Assignee: Alexei Scherbakov

> Fix consistency issues for atomic and mixed tx-atomic cache groups.
> ---
>
> Key: IGNITE-11797
> URL: https://issues.apache.org/jira/browse/IGNITE-11797
> Project: Ignite
>  Issue Type: Bug
>Reporter: Alexei Scherbakov
>Assignee: Alexei Scherbakov
>Priority: Major
>
> IGNITE-10078 only solves consistency problems for tx mode.
> For atomic caches the rebalance consistency issues still remain and should be 
> fixed together with improvement of atomic cache protocol consistency.
> Also, need to disable dynamic start of atomic cache in group having only tx 
> caches because it's not working in current state.
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (IGNITE-12049) Allow custom authenticators to use SSL certificates

2019-11-18 Thread Alexei Scherbakov (Jira)


[ 
https://issues.apache.org/jira/browse/IGNITE-12049?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16976630#comment-16976630
 ] 

Alexei Scherbakov commented on IGNITE-12049:


[~SomeFire]

1. User can put any value to node attributes, any number of certificates, etc. 
I still do not see the importance of proposed change, because this can be done 
right now for normal clients by passing certificate(s) to node attributes. 
Besides, thin clients do not have node attributes at all, and putting only a 
certificate to the map looks hacky.

3. TestSslSecurityProcessor does nothing besides checking certificate 
existence. I think providing a more realistic example with description should 
be useful for anyone who might wish to use the feature and make it more 
valuable for community.

> Allow custom authenticators to use SSL certificates
> ---
>
> Key: IGNITE-12049
> URL: https://issues.apache.org/jira/browse/IGNITE-12049
> Project: Ignite
>  Issue Type: Improvement
>Reporter: Ryabov Dmitrii
>Assignee: Ryabov Dmitrii
>Priority: Minor
>  Time Spent: 1.5h
>  Remaining Estimate: 0h
>
> Add SSL certificates to AuthenticationContext, so, authenticators can make 
> additional checks based on SSL certificates.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (IGNITE-12049) Allow custom authenticators to use SSL certificates

2019-11-05 Thread Alexei Scherbakov (Jira)


[ 
https://issues.apache.org/jira/browse/IGNITE-12049?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16967400#comment-16967400
 ] 

Alexei Scherbakov commented on IGNITE-12049:


[~SomeFire]

I left comments on PR, please address them.

Some general questions:

1. For "normal" cluster nodes attributes are already available using 
ClusterNode.attributes and user can just set any attribute and use it in custom 
authenticator without any changes in core by implementing [1].

Do I understand correctly the fix is only relevant for thin clients 
authenticated using [2] and not having associated local attributes ? 
Shouldn't we instead provide the ability for thin clients to have attributes 
and avoid changing IgniteConfiguration ?

2. Why the new attribute is not available during authentication for jdbc/odbc 
client types ?

3. Can you create an example of using custom authenticator with certificates ?

[1] 
org.apache.ignite.internal.processors.security.GridSecurityProcessor#authenticateNode
[2] 
org.apache.ignite.internal.processors.security.GridSecurityProcessor#authenticate










> Allow custom authenticators to use SSL certificates
> ---
>
> Key: IGNITE-12049
> URL: https://issues.apache.org/jira/browse/IGNITE-12049
> Project: Ignite
>  Issue Type: Improvement
>Reporter: Ryabov Dmitrii
>Assignee: Ryabov Dmitrii
>Priority: Minor
>  Time Spent: 1h 10m
>  Remaining Estimate: 0h
>
> Add SSL certificates to AuthenticationContext, so, authenticators can make 
> additional checks based on SSL certificates.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (IGNITE-12049) Allow custom authenticators to use SSL certificates

2019-11-01 Thread Alexei Scherbakov (Jira)


 [ 
https://issues.apache.org/jira/browse/IGNITE-12049?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alexei Scherbakov updated IGNITE-12049:
---
Reviewer: Alexei Scherbakov

> Allow custom authenticators to use SSL certificates
> ---
>
> Key: IGNITE-12049
> URL: https://issues.apache.org/jira/browse/IGNITE-12049
> Project: Ignite
>  Issue Type: Improvement
>Reporter: Ryabov Dmitrii
>Assignee: Ryabov Dmitrii
>Priority: Minor
>  Time Spent: 50m
>  Remaining Estimate: 0h
>
> Add SSL certificates to AuthenticationContext, so, authenticators can make 
> additional checks based on SSL certificates.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (IGNITE-12329) Invalid handling of remote entries causes partition desync and transaction hanging in COMMITTING state.

2019-10-31 Thread Alexei Scherbakov (Jira)


[ 
https://issues.apache.org/jira/browse/IGNITE-12329?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16963904#comment-16963904
 ] 

Alexei Scherbakov commented on IGNITE-12329:


Fixed.

> Invalid handling of remote entries causes partition desync and transaction 
> hanging in COMMITTING state.
> ---
>
> Key: IGNITE-12329
> URL: https://issues.apache.org/jira/browse/IGNITE-12329
> Project: Ignite
>  Issue Type: Bug
>Affects Versions: 2.7.6
>Reporter: Alexei Scherbakov
>Assignee: Alexei Scherbakov
>Priority: Major
> Fix For: 2.8
>
>  Time Spent: 20m
>  Remaining Estimate: 0h
>
> This can happen if transaction is mapped on a partition which is about to be 
> evicted on backup.
> Due to bugs entry belonging to other cache may be excluded from commit or 
> entry containing a lock can be removed without lock release causes depending 
> transactions to hang.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (IGNITE-12329) Invalid handling of remote entries causes partition desync and transaction hanging in COMMITTING state.

2019-10-31 Thread Alexei Scherbakov (Jira)


 [ 
https://issues.apache.org/jira/browse/IGNITE-12329?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alexei Scherbakov updated IGNITE-12329:
---
Description: 
This can happen if transaction is mapped on a partition which is about to be 
evicted on backup.

Due to bugs entry belonging to other cache may be excluded from commit or entry 
containing a lock can be removed without lock release causing depending 
transactions to hang.

  was:
This can happen if transaction is mapped on a partition which is about to be 
evicted on backup.

Due to bugs entry belonging to other cache may be excluded from commit or entry 
containing a lock can be removed without lock release causes depending 
transactions to hang.


> Invalid handling of remote entries causes partition desync and transaction 
> hanging in COMMITTING state.
> ---
>
> Key: IGNITE-12329
> URL: https://issues.apache.org/jira/browse/IGNITE-12329
> Project: Ignite
>  Issue Type: Bug
>Affects Versions: 2.7.6
>Reporter: Alexei Scherbakov
>Assignee: Alexei Scherbakov
>Priority: Major
> Fix For: 2.8
>
>  Time Spent: 20m
>  Remaining Estimate: 0h
>
> This can happen if transaction is mapped on a partition which is about to be 
> evicted on backup.
> Due to bugs entry belonging to other cache may be excluded from commit or 
> entry containing a lock can be removed without lock release causing depending 
> transactions to hang.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (IGNITE-12329) Invalid handling of remote entries causes partition desync and transaction hanging in COMMITTING state.

2019-10-29 Thread Alexei Scherbakov (Jira)


[ 
https://issues.apache.org/jira/browse/IGNITE-12329?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16962306#comment-16962306
 ] 

Alexei Scherbakov commented on IGNITE-12329:


The contribution also includes fix for GridDhtLocalPartition equals and 
hashCode.

[~ivan.glukos] Ready for review.

> Invalid handling of remote entries causes partition desync and transaction 
> hanging in COMMITTING state.
> ---
>
> Key: IGNITE-12329
> URL: https://issues.apache.org/jira/browse/IGNITE-12329
> Project: Ignite
>  Issue Type: Bug
>Affects Versions: 2.7.6
>Reporter: Alexei Scherbakov
>Assignee: Alexei Scherbakov
>Priority: Major
> Fix For: 2.8
>
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> This can happen if transaction is mapped on a partition which is about to be 
> evicted on backup.
> Due to bugs entry belonging to other cache may be excluded from commit or 
> entry containing a lock can be removed without lock release causes depending 
> transactions to hang.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (IGNITE-12328) IgniteException "Failed to resolve nodes topology" during cache.removeAll() and constantly changing topology

2019-10-29 Thread Alexei Scherbakov (Jira)


[ 
https://issues.apache.org/jira/browse/IGNITE-12328?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16961819#comment-16961819
 ] 

Alexei Scherbakov commented on IGNITE-12328:


The contribution also include fixes: 
1. pessimistic tx lock request processing over incomplete topology.
2. atomic cache is remapped on the compatible topology.

[~irakov] Ready for review.

> IgniteException "Failed to resolve nodes topology" during cache.removeAll() 
> and constantly changing topology
> 
>
> Key: IGNITE-12328
> URL: https://issues.apache.org/jira/browse/IGNITE-12328
> Project: Ignite
>  Issue Type: Bug
>Affects Versions: 2.7.6
>Reporter: Alexei Scherbakov
>Assignee: Alexei Scherbakov
>Priority: Major
> Fix For: 2.8
>
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> {noformat}
> [2019-09-25 13:13:58,339][ERROR][TxThread-threadNum-3] Failed to complete 
> transaction.
> org.apache.ignite.IgniteException: Failed to resolve nodes topology 
> [cacheGrp=cache_group_36, topVer=AffinityTopologyVersion [topVer=16, 
> minorTopVer=0], history=[AffinityTopologyVersion [topVer=13, minorTopVer=0], 
> AffinityTopologyVersion [topVer=14, minorTopVer=0], AffinityTopologyVersion 
> [topVer=15, minorTopVer=0]], snap=Snapshot [topVer=AffinityTopologyVersion 
> [topVer=15, minorTopVer=0]], locNode=TcpDiscoveryNode 
> [id=6cbf7666-9a8c-4b61-8b3f-6351ef44bd4a, 
> consistentId=poc-tester-client-172.25.1.21-id-0, addrs=ArrayList 
> [172.25.1.21], sockAddrs=HashSet [lab21.gridgain.local/172.25.1.21:0], 
> discPort=0, order=13, intOrder=0, lastExchangeTime=1569406379934, loc=true, 
> ver=2.5.10#20190922-sha1:02133315, isClient=true]]
>   at 
> org.apache.ignite.internal.managers.discovery.GridDiscoveryManager.resolveDiscoCache(GridDiscoveryManager.java:2125)
>  ~[ignite-core-2.5.10.jar:2.5.10]
>   at 
> org.apache.ignite.internal.managers.discovery.GridDiscoveryManager.cacheGroupAffinityNodes(GridDiscoveryManager.java:2007)
>  ~[ignite-core-2.5.10.jar:2.5.10]
>   at 
> org.apache.ignite.internal.processors.cache.GridCacheUtils.affinityNodes(GridCacheUtils.java:465)
>  ~[ignite-core-2.5.10.jar:2.5.10]
>   at 
> org.apache.ignite.internal.processors.cache.distributed.dht.colocated.GridDhtColocatedLockFuture.map0(GridDhtColocatedLockFuture.java:939)
>  ~[ignite-core-2.5.10.jar:2.5.10]
>   at 
> org.apache.ignite.internal.processors.cache.distributed.dht.colocated.GridDhtColocatedLockFuture.map(GridDhtColocatedLockFuture.java:911)
>  ~[ignite-core-2.5.10.jar:2.5.10]
>   at 
> org.apache.ignite.internal.processors.cache.distributed.dht.colocated.GridDhtColocatedLockFuture.map(GridDhtColocatedLockFuture.java:811)
>  ~[ignite-core-2.5.10.jar:2.5.10]
>   at 
> org.apache.ignite.internal.processors.cache.distributed.dht.colocated.GridDhtColocatedCache.lockAllAsync(GridDhtColocatedCache.java:656)
>  ~[ignite-core-2.5.10.jar:2.5.10]
>   at 
> org.apache.ignite.internal.processors.cache.distributed.GridDistributedCacheAdapter.txLockAsync(GridDistributedCacheAdapter.java:109)
>  ~[ignite-core-2.5.10.jar:2.5.10]
>   at 
> org.apache.ignite.internal.processors.cache.distributed.near.GridNearTxLocal.removeAllAsync0(GridNearTxLocal.java:1648)
>  ~[ignite-core-2.5.10.jar:2.5.10]
>   at 
> org.apache.ignite.internal.processors.cache.distributed.near.GridNearTxLocal.removeAllAsync(GridNearTxLocal.java:521)
>  ~[ignite-core-2.5.10.jar:2.5.10]
>   at 
> org.apache.ignite.internal.processors.cache.GridCacheAdapter$33.inOp(GridCacheAdapter.java:2619)
>  ~[ignite-core-2.5.10.jar:2.5.10]
>   at 
> org.apache.ignite.internal.processors.cache.GridCacheAdapter$SyncInOp.op(GridCacheAdapter.java:4701)
>  ~[ignite-core-2.5.10.jar:2.5.10]
>   at 
> org.apache.ignite.internal.processors.cache.GridCacheAdapter.syncOp(GridCacheAdapter.java:3780)
>  ~[ignite-core-2.5.10.jar:2.5.10]
>   at 
> org.apache.ignite.internal.processors.cache.GridCacheAdapter.removeAll0(GridCacheAdapter.java:2617)
>  ~[ignite-core-2.5.10.jar:2.5.10]
>   at 
> org.apache.ignite.internal.processors.cache.GridCacheAdapter.removeAll(GridCacheAdapter.java:2606)
>  ~[ignite-core-2.5.10.jar:2.5.10]
>   at 
> org.apache.ignite.internal.processors.cache.IgniteCacheProxyImpl.removeAll(IgniteCacheProxyImpl.java:1553)
>  ~[ignite-core-2.5.10.jar:2.5.10]
>   at 
> org.apache.ignite.internal.processors.cache.GatewayProtectedCacheProxy.removeAll(GatewayProtectedCacheProxy.java:1026)
>  ~[ignite-core-2.5.10.jar:2.5.10]
>   at 
> org.apache.ignite.scenario.TxBalanceTask$TxBody.doTxRemoveAll(TxBalanceTask.java:291)
>  ~[poc-tester-0.1.0-SNAPSHOT.jar:?]
>   at 
> org.apache.ignite.scenario.TxBalanceTask$TxBody.call(TxBalanceTask.java:93) 
> 

[jira] [Updated] (IGNITE-12317) Add EvictionFilter factory support in IgniteConfiguration.

2019-10-29 Thread Alexei Scherbakov (Jira)


 [ 
https://issues.apache.org/jira/browse/IGNITE-12317?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alexei Scherbakov updated IGNITE-12317:
---
Fix Version/s: (was: 2.9)
   2.8

> Add EvictionFilter factory support in IgniteConfiguration.
> --
>
> Key: IGNITE-12317
> URL: https://issues.apache.org/jira/browse/IGNITE-12317
> Project: Ignite
>  Issue Type: Sub-task
>  Components: cache
>Reporter: Nikolai Kulagin
>Assignee: Nikolai Kulagin
>Priority: Major
> Fix For: 2.8
>
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> Some entities on cache configuration are configured via factories, while 
> others are set directly, for example, eviction policy and eviction filter. 
> Need to add new configuration properties for eviction filter factory and 
> deprecate old ones (do not remove for compatibility).



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (IGNITE-12332) Fix flaky test GridCacheAtomicClientInvalidPartitionHandlingSelfTest#testPrimaryFullAsync

2019-10-28 Thread Alexei Scherbakov (Jira)
Alexei Scherbakov created IGNITE-12332:
--

 Summary: Fix flaky test 
GridCacheAtomicClientInvalidPartitionHandlingSelfTest#testPrimaryFullAsync
 Key: IGNITE-12332
 URL: https://issues.apache.org/jira/browse/IGNITE-12332
 Project: Ignite
  Issue Type: Bug
Affects Versions: 2.7.6
Reporter: Alexei Scherbakov
 Fix For: 2.8


Can be reproduced locally with range = 10_000



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (IGNITE-12264) Private application data should not be lit in the logs, exceptions, ERROR, WARN etc.

2019-10-28 Thread Alexei Scherbakov (Jira)


[ 
https://issues.apache.org/jira/browse/IGNITE-12264?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16961095#comment-16961095
 ] 

Alexei Scherbakov commented on IGNITE-12264:


[~KPushenko]

This feature exists for 3 years [1]
Have you tried to enable it using -DIGNITE_TO_STRING_INCLUDE_SENSITIVE=false ?

[1] https://issues.apache.org/jira/browse/IGNITE-4167

> Private application data should not be lit in the logs, exceptions, ERROR, 
> WARN etc.
> 
>
> Key: IGNITE-12264
> URL: https://issues.apache.org/jira/browse/IGNITE-12264
> Project: Ignite
>  Issue Type: Improvement
>Affects Versions: 2.7.6
>Reporter: Pushenko Kirill
>Priority: Major
>
> Private application data should not be lit in the logs, exceptions, ERROR, 
> WARN etc.
> The executions contained a value in which there were cardboard numbers.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (IGNITE-12329) Invalid handling of remote entries causes partition desync and transaction hanging in COMMITTING state.

2019-10-28 Thread Alexei Scherbakov (Jira)
Alexei Scherbakov created IGNITE-12329:
--

 Summary: Invalid handling of remote entries causes partition 
desync and transaction hanging in COMMITTING state.
 Key: IGNITE-12329
 URL: https://issues.apache.org/jira/browse/IGNITE-12329
 Project: Ignite
  Issue Type: Bug
Affects Versions: 2.7.6
Reporter: Alexei Scherbakov
Assignee: Alexei Scherbakov
 Fix For: 2.8


This can happen if transaction is mapped on a partition which is about to be 
evicted on backup.

Due to bugs entry belonging to other cache may be excluded from commit or entry 
containing a lock can be removed without lock release causes depending 
transactions to hang.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (IGNITE-12328) IgniteException "Failed to resolve nodes topology" during cache.removeAll() and constantly changing topology

2019-10-28 Thread Alexei Scherbakov (Jira)


[ 
https://issues.apache.org/jira/browse/IGNITE-12328?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16960816#comment-16960816
 ] 

Alexei Scherbakov commented on IGNITE-12328:


Step by step to reproduce:

* Start tx from a client node (A). This client node sees topology version X.

* Start another client node (B) to increment topology version on server nodes. 
After that action topology version on server nodes will be X + 1.

* Discovery event of B node join should be delivered to A with a delay.

* Perform tx put (1,1) from A. Node A sees topology version X, while server 
nodes see topology version X + 1.

* This put will result in telling to node A that tx should be remmaped to 
version X + 1.

* Topology version for this tx on node A is set to X + 1. While Node A still 
hasn’t received discovery event for node B join (X + 1 version).

* Perform tx remove (1) from A. IMPORTANT. We should use a key that is already 
used in transaction. In another case tx will wait for affinity version X + 1.

* This tx remove results to assertion mentioned in ticket, because A doesn’t 
see discovery event of X + 1 and exchange future corresponding to this event 
and tries to get discovery cache of X + 1 that doesn’t exist on A yet.

> IgniteException "Failed to resolve nodes topology" during cache.removeAll() 
> and constantly changing topology
> 
>
> Key: IGNITE-12328
> URL: https://issues.apache.org/jira/browse/IGNITE-12328
> Project: Ignite
>  Issue Type: Bug
>Affects Versions: 2.7.6
>Reporter: Alexei Scherbakov
>Assignee: Alexei Scherbakov
>Priority: Major
> Fix For: 2.8
>
>
> {noformat}
> [2019-09-25 13:13:58,339][ERROR][TxThread-threadNum-3] Failed to complete 
> transaction.
> org.apache.ignite.IgniteException: Failed to resolve nodes topology 
> [cacheGrp=cache_group_36, topVer=AffinityTopologyVersion [topVer=16, 
> minorTopVer=0], history=[AffinityTopologyVersion [topVer=13, minorTopVer=0], 
> AffinityTopologyVersion [topVer=14, minorTopVer=0], AffinityTopologyVersion 
> [topVer=15, minorTopVer=0]], snap=Snapshot [topVer=AffinityTopologyVersion 
> [topVer=15, minorTopVer=0]], locNode=TcpDiscoveryNode 
> [id=6cbf7666-9a8c-4b61-8b3f-6351ef44bd4a, 
> consistentId=poc-tester-client-172.25.1.21-id-0, addrs=ArrayList 
> [172.25.1.21], sockAddrs=HashSet [lab21.gridgain.local/172.25.1.21:0], 
> discPort=0, order=13, intOrder=0, lastExchangeTime=1569406379934, loc=true, 
> ver=2.5.10#20190922-sha1:02133315, isClient=true]]
>   at 
> org.apache.ignite.internal.managers.discovery.GridDiscoveryManager.resolveDiscoCache(GridDiscoveryManager.java:2125)
>  ~[ignite-core-2.5.10.jar:2.5.10]
>   at 
> org.apache.ignite.internal.managers.discovery.GridDiscoveryManager.cacheGroupAffinityNodes(GridDiscoveryManager.java:2007)
>  ~[ignite-core-2.5.10.jar:2.5.10]
>   at 
> org.apache.ignite.internal.processors.cache.GridCacheUtils.affinityNodes(GridCacheUtils.java:465)
>  ~[ignite-core-2.5.10.jar:2.5.10]
>   at 
> org.apache.ignite.internal.processors.cache.distributed.dht.colocated.GridDhtColocatedLockFuture.map0(GridDhtColocatedLockFuture.java:939)
>  ~[ignite-core-2.5.10.jar:2.5.10]
>   at 
> org.apache.ignite.internal.processors.cache.distributed.dht.colocated.GridDhtColocatedLockFuture.map(GridDhtColocatedLockFuture.java:911)
>  ~[ignite-core-2.5.10.jar:2.5.10]
>   at 
> org.apache.ignite.internal.processors.cache.distributed.dht.colocated.GridDhtColocatedLockFuture.map(GridDhtColocatedLockFuture.java:811)
>  ~[ignite-core-2.5.10.jar:2.5.10]
>   at 
> org.apache.ignite.internal.processors.cache.distributed.dht.colocated.GridDhtColocatedCache.lockAllAsync(GridDhtColocatedCache.java:656)
>  ~[ignite-core-2.5.10.jar:2.5.10]
>   at 
> org.apache.ignite.internal.processors.cache.distributed.GridDistributedCacheAdapter.txLockAsync(GridDistributedCacheAdapter.java:109)
>  ~[ignite-core-2.5.10.jar:2.5.10]
>   at 
> org.apache.ignite.internal.processors.cache.distributed.near.GridNearTxLocal.removeAllAsync0(GridNearTxLocal.java:1648)
>  ~[ignite-core-2.5.10.jar:2.5.10]
>   at 
> org.apache.ignite.internal.processors.cache.distributed.near.GridNearTxLocal.removeAllAsync(GridNearTxLocal.java:521)
>  ~[ignite-core-2.5.10.jar:2.5.10]
>   at 
> org.apache.ignite.internal.processors.cache.GridCacheAdapter$33.inOp(GridCacheAdapter.java:2619)
>  ~[ignite-core-2.5.10.jar:2.5.10]
>   at 
> org.apache.ignite.internal.processors.cache.GridCacheAdapter$SyncInOp.op(GridCacheAdapter.java:4701)
>  ~[ignite-core-2.5.10.jar:2.5.10]
>   at 
> org.apache.ignite.internal.processors.cache.GridCacheAdapter.syncOp(GridCacheAdapter.java:3780)
>  ~[ignite-core-2.5.10.jar:2.5.10]
>   at 
> 

[jira] [Created] (IGNITE-12328) IgniteException "Failed to resolve nodes topology" during cache.removeAll() and constantly changing topology

2019-10-28 Thread Alexei Scherbakov (Jira)
Alexei Scherbakov created IGNITE-12328:
--

 Summary: IgniteException "Failed to resolve nodes topology" during 
cache.removeAll() and constantly changing topology
 Key: IGNITE-12328
 URL: https://issues.apache.org/jira/browse/IGNITE-12328
 Project: Ignite
  Issue Type: Bug
Affects Versions: 2.7.6
Reporter: Alexei Scherbakov
Assignee: Alexei Scherbakov
 Fix For: 2.8


{noformat}
[2019-09-25 13:13:58,339][ERROR][TxThread-threadNum-3] Failed to complete 
transaction.
org.apache.ignite.IgniteException: Failed to resolve nodes topology 
[cacheGrp=cache_group_36, topVer=AffinityTopologyVersion [topVer=16, 
minorTopVer=0], history=[AffinityTopologyVersion [topVer=13, minorTopVer=0], 
AffinityTopologyVersion [topVer=14, minorTopVer=0], AffinityTopologyVersion 
[topVer=15, minorTopVer=0]], snap=Snapshot [topVer=AffinityTopologyVersion 
[topVer=15, minorTopVer=0]], locNode=TcpDiscoveryNode 
[id=6cbf7666-9a8c-4b61-8b3f-6351ef44bd4a, 
consistentId=poc-tester-client-172.25.1.21-id-0, addrs=ArrayList [172.25.1.21], 
sockAddrs=HashSet [lab21.gridgain.local/172.25.1.21:0], discPort=0, order=13, 
intOrder=0, lastExchangeTime=1569406379934, loc=true, 
ver=2.5.10#20190922-sha1:02133315, isClient=true]]
at 
org.apache.ignite.internal.managers.discovery.GridDiscoveryManager.resolveDiscoCache(GridDiscoveryManager.java:2125)
 ~[ignite-core-2.5.10.jar:2.5.10]
at 
org.apache.ignite.internal.managers.discovery.GridDiscoveryManager.cacheGroupAffinityNodes(GridDiscoveryManager.java:2007)
 ~[ignite-core-2.5.10.jar:2.5.10]
at 
org.apache.ignite.internal.processors.cache.GridCacheUtils.affinityNodes(GridCacheUtils.java:465)
 ~[ignite-core-2.5.10.jar:2.5.10]
at 
org.apache.ignite.internal.processors.cache.distributed.dht.colocated.GridDhtColocatedLockFuture.map0(GridDhtColocatedLockFuture.java:939)
 ~[ignite-core-2.5.10.jar:2.5.10]
at 
org.apache.ignite.internal.processors.cache.distributed.dht.colocated.GridDhtColocatedLockFuture.map(GridDhtColocatedLockFuture.java:911)
 ~[ignite-core-2.5.10.jar:2.5.10]
at 
org.apache.ignite.internal.processors.cache.distributed.dht.colocated.GridDhtColocatedLockFuture.map(GridDhtColocatedLockFuture.java:811)
 ~[ignite-core-2.5.10.jar:2.5.10]
at 
org.apache.ignite.internal.processors.cache.distributed.dht.colocated.GridDhtColocatedCache.lockAllAsync(GridDhtColocatedCache.java:656)
 ~[ignite-core-2.5.10.jar:2.5.10]
at 
org.apache.ignite.internal.processors.cache.distributed.GridDistributedCacheAdapter.txLockAsync(GridDistributedCacheAdapter.java:109)
 ~[ignite-core-2.5.10.jar:2.5.10]
at 
org.apache.ignite.internal.processors.cache.distributed.near.GridNearTxLocal.removeAllAsync0(GridNearTxLocal.java:1648)
 ~[ignite-core-2.5.10.jar:2.5.10]
at 
org.apache.ignite.internal.processors.cache.distributed.near.GridNearTxLocal.removeAllAsync(GridNearTxLocal.java:521)
 ~[ignite-core-2.5.10.jar:2.5.10]
at 
org.apache.ignite.internal.processors.cache.GridCacheAdapter$33.inOp(GridCacheAdapter.java:2619)
 ~[ignite-core-2.5.10.jar:2.5.10]
at 
org.apache.ignite.internal.processors.cache.GridCacheAdapter$SyncInOp.op(GridCacheAdapter.java:4701)
 ~[ignite-core-2.5.10.jar:2.5.10]
at 
org.apache.ignite.internal.processors.cache.GridCacheAdapter.syncOp(GridCacheAdapter.java:3780)
 ~[ignite-core-2.5.10.jar:2.5.10]
at 
org.apache.ignite.internal.processors.cache.GridCacheAdapter.removeAll0(GridCacheAdapter.java:2617)
 ~[ignite-core-2.5.10.jar:2.5.10]
at 
org.apache.ignite.internal.processors.cache.GridCacheAdapter.removeAll(GridCacheAdapter.java:2606)
 ~[ignite-core-2.5.10.jar:2.5.10]
at 
org.apache.ignite.internal.processors.cache.IgniteCacheProxyImpl.removeAll(IgniteCacheProxyImpl.java:1553)
 ~[ignite-core-2.5.10.jar:2.5.10]
at 
org.apache.ignite.internal.processors.cache.GatewayProtectedCacheProxy.removeAll(GatewayProtectedCacheProxy.java:1026)
 ~[ignite-core-2.5.10.jar:2.5.10]
at 
org.apache.ignite.scenario.TxBalanceTask$TxBody.doTxRemoveAll(TxBalanceTask.java:291)
 ~[poc-tester-0.1.0-SNAPSHOT.jar:?]
at 
org.apache.ignite.scenario.TxBalanceTask$TxBody.call(TxBalanceTask.java:93) 
~[poc-tester-0.1.0-SNAPSHOT.jar:?]
at 
org.apache.ignite.scenario.TxBalanceTask$TxBody.call(TxBalanceTask.java:70) 
~[poc-tester-0.1.0-SNAPSHOT.jar:?]
at 
org.apache.ignite.scenario.internal.AbstractTxTask.doInTransaction(AbstractTxTask.java:290)
 ~[poc-tester-0.1.0-SNAPSHOT.jar:?]
at 
org.apache.ignite.scenario.internal.AbstractTxTask.access$400(AbstractTxTask.java:56)
 ~[poc-tester-0.1.0-SNAPSHOT.jar:?]
at 
org.apache.ignite.scenario.internal.AbstractTxTask$TxRunner.call(AbstractTxTask.java:470)
 [poc-tester-0.1.0-SNAPSHOT.jar:?]
at 

[jira] [Resolved] (IGNITE-12327) Cross-cache tx is mapped on wrong primary when enlisted caches have incompatible assignments.

2019-10-28 Thread Alexei Scherbakov (Jira)


 [ 
https://issues.apache.org/jira/browse/IGNITE-12327?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alexei Scherbakov resolved IGNITE-12327.

Resolution: Won't Fix

Already fixed in IGNITE-12038

> Cross-cache tx is mapped on wrong primary when enlisted caches have 
> incompatible assignments.
> -
>
> Key: IGNITE-12327
> URL: https://issues.apache.org/jira/browse/IGNITE-12327
> Project: Ignite
>  Issue Type: Bug
>Affects Versions: 2.7.6
>Reporter: Alexei Scherbakov
>Assignee: Alexei Scherbakov
>Priority: Major
> Fix For: 2.8
>
>
> This is happening when supplier node is left while rebalancing is partially 
> completed on demander.
> Suppose we have 2 cache groups, rebalance is in progress and for first group 
> rebalance is done and for second group rebalance is partially done (some 
> partitions are still MOVING).
> At this moment supplier node dies and corresponding topology version is (N,0).
> New assignment is computed using current state of partitions and for first 
> group will be ideal and the same as for next topology (N,1) which will be 
> triggered after all rebalancing is completed by CacheAffinityChangeMessage.
> For second group affinity will not be ideal.
> If transaction is started while PME is in progress (N, 0)->(N,1), first lock 
> request will pass remap check if it's enslists rebalanced group. All 
> subsequent lock requests will use invalid topology from previous assignment.
> Possible fix: return actual locked topology version from first lock request 
> and use it for all subsequent requests.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (IGNITE-12327) Cross-cache tx is mapped on wrong primary when enlisted caches have incompatible assignments.

2019-10-25 Thread Alexei Scherbakov (Jira)
Alexei Scherbakov created IGNITE-12327:
--

 Summary: Cross-cache tx is mapped on wrong primary when enlisted 
caches have incompatible assignments.
 Key: IGNITE-12327
 URL: https://issues.apache.org/jira/browse/IGNITE-12327
 Project: Ignite
  Issue Type: Bug
Affects Versions: 2.7.6
Reporter: Alexei Scherbakov
Assignee: Alexei Scherbakov
 Fix For: 2.8


This is happening when supplier node is left while rebalancing is partially 
completed on demander.

Suppose we have 2 cache groups, rebalance is in progress and for first group 
rebalance is done and for second group rebalance is partially done (some 
partitions are still MOVING).
At this moment supplier node dies and corresponding topology version is (N,0).
New assignment is computed using current state of partitions and for first 
group will be ideal and the same as for next topology (N,1) which will be 
triggered after all rebalancing is completed by CacheAffinityChangeMessage.
For second group affinity will not be ideal.

If transaction is started while PME is in progress (N, 0)->(N,1), first lock 
request will pass remap check if it's enslists rebalanced group. All subsequent 
lock requests will use invalid topology from previous assignment.

Possible fix: return actual locked topology version from first lock request and 
use it for all subsequent requests.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (IGNITE-11704) Write tombstones during rebalance to get rid of deferred delete buffer

2019-10-17 Thread Alexei Scherbakov (Jira)


[ 
https://issues.apache.org/jira/browse/IGNITE-11704?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16953592#comment-16953592
 ] 

Alexei Scherbakov commented on IGNITE-11704:


[~jokser]

Looks good.

> Write tombstones during rebalance to get rid of deferred delete buffer
> --
>
> Key: IGNITE-11704
> URL: https://issues.apache.org/jira/browse/IGNITE-11704
> Project: Ignite
>  Issue Type: Improvement
>Reporter: Alexey Goncharuk
>Assignee: Pavel Kovalenko
>Priority: Major
>  Labels: rebalance
> Fix For: 2.8
>
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> Currently Ignite relies on deferred delete buffer in order to handle 
> write-remove conflicts during rebalance. Given the limit size of the buffer, 
> this approach is fundamentally flawed, especially in case when persistence is 
> enabled.
> I suggest to extend the logic of data storage to be able to store key 
> tombstones - to keep version for deleted entries. The tombstones will be 
> stored when rebalance is in progress and should be cleaned up when rebalance 
> is completed.
> Later this approach may be used to implement fast partition rebalance based 
> on merkle trees (in this case, tombstones should be written on an incomplete 
> baseline).



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (IGNITE-11704) Write tombstones during rebalance to get rid of deferred delete buffer

2019-10-15 Thread Alexei Scherbakov (Jira)


[ 
https://issues.apache.org/jira/browse/IGNITE-11704?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16952020#comment-16952020
 ] 

Alexei Scherbakov commented on IGNITE-11704:


[~jokser]

4. Typo in the comment. My understanding is the code will be called when 
datastreamer initiates first update for an entry, is it true ?
6. 
* Looks like it's not necessary to preload 256k keys for historical rebalance, 
you need only one update in each partition. 
* Test looks similar but my idea is to delay each batch and remove all 
containing keys in the batch, then release batch. Such scenario should bring 
all partition keys to tombstones and looks interesting.

In other aspects looks good.


> Write tombstones during rebalance to get rid of deferred delete buffer
> --
>
> Key: IGNITE-11704
> URL: https://issues.apache.org/jira/browse/IGNITE-11704
> Project: Ignite
>  Issue Type: Improvement
>Reporter: Alexey Goncharuk
>Assignee: Pavel Kovalenko
>Priority: Major
>  Labels: rebalance
> Fix For: 2.8
>
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> Currently Ignite relies on deferred delete buffer in order to handle 
> write-remove conflicts during rebalance. Given the limit size of the buffer, 
> this approach is fundamentally flawed, especially in case when persistence is 
> enabled.
> I suggest to extend the logic of data storage to be able to store key 
> tombstones - to keep version for deleted entries. The tombstones will be 
> stored when rebalance is in progress and should be cleaned up when rebalance 
> is completed.
> Later this approach may be used to implement fast partition rebalance based 
> on merkle trees (in this case, tombstones should be written on an incomplete 
> baseline).



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Comment Edited] (IGNITE-9913) Prevent data updates blocking in case of backup BLT server node leave

2019-10-12 Thread Alexei Scherbakov (Jira)


[ 
https://issues.apache.org/jira/browse/IGNITE-9913?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16950019#comment-16950019
 ] 

Alexei Scherbakov edited comment on IGNITE-9913 at 10/12/19 12:14 PM:
--

[~avinogradov]

I've reviewed changes. Seems it follows the architecture we discussed privately.
I left comments in the PR.
Besides what you definitely should add more tests.
Important scenarious migh be:
1. Baseline node is left under tx load while rebalancing is in progress 
(rebalancing is due to other node joins).
2. Owners are left one by one under tx load until subset of partitions will 
have single owner.
3. Owners are left one by one under tx load until subset of partitions will 
have no owner. Validate partition loss.

All tests should check partition integrity: see 
org.apache.ignite.testframework.junits.common.GridCommonAbstractTest#assertPartitionsSame

Do you have plans to implement non-blocking mapping for transactions not 
affected by topology change by the same ticket ?

Let me know if some personal discussion is required.




was (Author: ascherbakov):
[~avinogradov]

I've reviewed changes. Seems it follows the architecture we discussed privately.
I left comments in the PR.
Besides what you definitely should add more tests.
Important scenarious migh be:
1. Baseline node is left under tx load while rebalancing is in progress 
(rebalancing is due to other node joins).
2. Owners are left one by one under tx load until subset of partitions will 
have single owner.
3. Owners are left one by one under tx load until subset of partitions will 
have no owner. Validate partition loss.

All tests should check partition integrity: see 
org.apache.ignite.testframework.junits.common.GridCommonAbstractTest#assertPartitionsSame

Do you have plans to implement non-blocking mapping for transactions not 
affected by topology change by the same ticket ?




> Prevent data updates blocking in case of backup BLT server node leave
> -
>
> Key: IGNITE-9913
> URL: https://issues.apache.org/jira/browse/IGNITE-9913
> Project: Ignite
>  Issue Type: Improvement
>  Components: general
>Reporter: Ivan Rakov
>Assignee: Anton Vinogradov
>Priority: Major
> Fix For: 2.8
>
> Attachments: 9913_yardstick.png, master_yardstick.png
>
>  Time Spent: 6h 50m
>  Remaining Estimate: 0h
>
> Ignite cluster performs distributed partition map exchange when any server 
> node leaves or joins the topology.
> Distributed PME blocks all updates and may take a long time. If all 
> partitions are assigned according to the baseline topology and server node 
> leaves, there's no actual need to perform distributed PME: every cluster node 
> is able to recalculate new affinity assigments and partition states locally. 
> If we'll implement such lightweight PME and handle mapping and lock requests 
> on new topology version correctly, updates won't be stopped (except updates 
> of partitions that lost their primary copy).



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (IGNITE-9913) Prevent data updates blocking in case of backup BLT server node leave

2019-10-12 Thread Alexei Scherbakov (Jira)


[ 
https://issues.apache.org/jira/browse/IGNITE-9913?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16950019#comment-16950019
 ] 

Alexei Scherbakov commented on IGNITE-9913:
---

[~avinogradov]

I've reviewed changes. Seems it follows the architecture we discussed privately.
I left comments in the PR.
Besides what you definitely should add more tests.
Important scenarious migh be:
1. Baseline node is left under tx load while rebalancing is in progress 
(rebalancing is due to other node joins).
2. Owners are left one by one under tx load until subset of partitions will 
have single owner.
3. Owners are left one by one under tx load until subset of partitions will 
have no owner. Validate partition loss.

All tests should check partition integrity: see 
org.apache.ignite.testframework.junits.common.GridCommonAbstractTest#assertPartitionsSame

Do you have plans to implement non-blocking mapping for transactions not 
affected by topology change by the same ticket ?




> Prevent data updates blocking in case of backup BLT server node leave
> -
>
> Key: IGNITE-9913
> URL: https://issues.apache.org/jira/browse/IGNITE-9913
> Project: Ignite
>  Issue Type: Improvement
>  Components: general
>Reporter: Ivan Rakov
>Assignee: Anton Vinogradov
>Priority: Major
> Fix For: 2.8
>
> Attachments: 9913_yardstick.png, master_yardstick.png
>
>  Time Spent: 6h 50m
>  Remaining Estimate: 0h
>
> Ignite cluster performs distributed partition map exchange when any server 
> node leaves or joins the topology.
> Distributed PME blocks all updates and may take a long time. If all 
> partitions are assigned according to the baseline topology and server node 
> leaves, there's no actual need to perform distributed PME: every cluster node 
> is able to recalculate new affinity assigments and partition states locally. 
> If we'll implement such lightweight PME and handle mapping and lock requests 
> on new topology version correctly, updates won't be stopped (except updates 
> of partitions that lost their primary copy).



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Assigned] (IGNITE-12117) Historical rebalance should NOT be processed in striped way

2019-10-11 Thread Alexei Scherbakov (Jira)


 [ 
https://issues.apache.org/jira/browse/IGNITE-12117?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alexei Scherbakov reassigned IGNITE-12117:
--

Assignee: Alexei Scherbakov

> Historical rebalance should NOT be processed in striped way
> ---
>
> Key: IGNITE-12117
> URL: https://issues.apache.org/jira/browse/IGNITE-12117
> Project: Ignite
>  Issue Type: Task
>Reporter: Anton Vinogradov
>Assignee: Alexei Scherbakov
>Priority: Major
>  Labels: iep-16
> Fix For: 2.9
>
>
> Test 
> {{org.apache.ignite.internal.processors.cache.transactions.TxPartitionCounterStateConsistencyTest#testPartitionConsistencyWithBackupsRestart}}
>  have failure on attempt to handle historical rebalance using un-striped pool.
> You can reproduce it by replacing
> {noformat}
>  if (historical) // Can not be reordered.
> 
> ctx.kernalContext().getStripedRebalanceExecutorService().execute(r, 
> Math.abs(nodeId.hashCode()));
> {noformat}
> with
> {noformat}
>  if (historical) // Can be reordered?
> ctx.kernalContext().getRebalanceExecutorService().execute(r);
> {noformat}
> and you will gain the following
> {noformat}
> ava.lang.AssertionError: idle_verify failed on 1 node.
> idle_verify check has finished, found 7 conflict partitions: 
> [counterConflicts=0, hashConflicts=7]
> Hash conflicts:
> Conflict partition: PartitionKeyV2 [grpId=1544803905, grpName=default, 
> partId=23]
> Partition instances: [PartitionHashRecordV2 [isPrimary=false, 
> consistentId=nodetransactions.TxPartitionCounterStateConsistencyHistoryRebalanceTest1,
>  updateCntr=707143, partitionState=OWNING, size=495, partHash=-1503789370], 
> PartitionHashRecordV2 [isPrimary=false, 
> consistentId=nodetransactions.TxPartitionCounterStateConsistencyHistoryRebalanceTest2,
>  updateCntr=707143, partitionState=OWNING, size=494, partHash=-1538739200]]
> Conflict partition: PartitionKeyV2 [grpId=1544803905, grpName=default, 
> partId=8]
> 
> {noformat}
> So, we need to investigate reasons and provide proper historical rebalance 
> refactoring to use the unstriped pool, if possible.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (IGNITE-11704) Write tombstones during rebalance to get rid of deferred delete buffer

2019-10-10 Thread Alexei Scherbakov (Jira)


[ 
https://issues.apache.org/jira/browse/IGNITE-11704?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16948360#comment-16948360
 ] 

Alexei Scherbakov commented on IGNITE-11704:


6. GridDhtLocalPartition.clearTombstones looks very similar to 
GridDhtLocalPartition.clearAll. 
Could we avoid code duplication ?

> Write tombstones during rebalance to get rid of deferred delete buffer
> --
>
> Key: IGNITE-11704
> URL: https://issues.apache.org/jira/browse/IGNITE-11704
> Project: Ignite
>  Issue Type: Improvement
>Reporter: Alexey Goncharuk
>Assignee: Pavel Kovalenko
>Priority: Major
>  Labels: rebalance
> Fix For: 2.8
>
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> Currently Ignite relies on deferred delete buffer in order to handle 
> write-remove conflicts during rebalance. Given the limit size of the buffer, 
> this approach is fundamentally flawed, especially in case when persistence is 
> enabled.
> I suggest to extend the logic of data storage to be able to store key 
> tombstones - to keep version for deleted entries. The tombstones will be 
> stored when rebalance is in progress and should be cleaned up when rebalance 
> is completed.
> Later this approach may be used to implement fast partition rebalance based 
> on merkle trees (in this case, tombstones should be written on an incomplete 
> baseline).



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (IGNITE-11704) Write tombstones during rebalance to get rid of deferred delete buffer

2019-10-08 Thread Alexei Scherbakov (Jira)


[ 
https://issues.apache.org/jira/browse/IGNITE-11704?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16946592#comment-16946592
 ] 

Alexei Scherbakov commented on IGNITE-11704:


5. I would add one more load test scenario:
Start a node, backups=1.
Load many keys (like 100k).
Join another node triggering rebalance.
Delay each batch. Remove keys supplied in the batch. Release batch.
Validate cache is empty and tombstones are cleared.


> Write tombstones during rebalance to get rid of deferred delete buffer
> --
>
> Key: IGNITE-11704
> URL: https://issues.apache.org/jira/browse/IGNITE-11704
> Project: Ignite
>  Issue Type: Improvement
>Reporter: Alexey Goncharuk
>Assignee: Pavel Kovalenko
>Priority: Major
>  Labels: rebalance
> Fix For: 2.8
>
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> Currently Ignite relies on deferred delete buffer in order to handle 
> write-remove conflicts during rebalance. Given the limit size of the buffer, 
> this approach is fundamentally flawed, especially in case when persistence is 
> enabled.
> I suggest to extend the logic of data storage to be able to store key 
> tombstones - to keep version for deleted entries. The tombstones will be 
> stored when rebalance is in progress and should be cleaned up when rebalance 
> is completed.
> Later this approach may be used to implement fast partition rebalance based 
> on merkle trees (in this case, tombstones should be written on an incomplete 
> baseline).



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Comment Edited] (IGNITE-11704) Write tombstones during rebalance to get rid of deferred delete buffer

2019-10-08 Thread Alexei Scherbakov (Jira)


[ 
https://issues.apache.org/jira/browse/IGNITE-11704?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16946165#comment-16946165
 ] 

Alexei Scherbakov edited comment on IGNITE-11704 at 10/8/19 7:10 AM:
-

[~jokser] [~sboikov]

I've reviewed changes. Overall looks good, but still I have some questions.

1. My main concern is regarding the necessity of tombstoneBytes 5-bytes object. 
Seems it's possible to implement tombstone by treating absence of value as a 
tombstone.
For example, valLen=0 could be treated as tombstone presense. Doing so we can 
get rid of 5 bytes comparison, and instead do null check:
{noformat}
private Boolean isTombstone(ByteBuffer buf, int offset) {
int valLen = buf.getInt(buf.position() + offset);
if (valLen != tombstoneBytes.length)
return Boolean.FALSE;
...
}
{noformat}

Instead we can do something like {{if (valLen == 0) return Boolean.TRUE}}

2. With new changes in PartitionsEvictManager it's possible to have two tasks 
of different types for the same partition.
Consider a scenario: 
* node finished rebalancing and starts to clear thombstones
* another node joins topology and become an owner for clearing partition.
* eviction is started for already clearing partition.

Probably this should not be allowed.

3. I see changes having no obvious relation to contribution, for example: 
static String cacheGroupMetricsRegistryName(String cacheGrp)
DropCacheContextDuringEvictionTest.java
GridCommandHandlerIndexingTest.java

What's the purpose of these ?

4. Could you clarify the change in 
org.apache.ignite.internal.processors.cache.GridCacheMapEntry#initialValue:

update0 |= (!preload && val == null); ?




was (Author: ascherbakov):
[~jokser] [~sboikov]

I've reviewed changes. Overall looks good, but still I have some questions.

1. My main concern is regarding the necessity of tombstoneBytes 5-bytes object. 
Seems it's possible to implement tombstone by treating absence of value as a 
tombstone.
For example, valLen=0 could be treated as tombstone presense. Doing so we can 
get rid of 5 bytes comparison, and instead do null check:
{noformat}
private Boolean isTombstone(ByteBuffer buf, int offset) {
int valLen = buf.getInt(buf.position() + offset);
if (valLen != tombstoneBytes.length)
return Boolean.FALSE;
...
}
{noformat}

Instead we can do something like {{if (valLen == 0) return true}}

2. With new changes in PartitionsEvictManager it's possible to have two tasks 
of different types for the same partition.
Consider a scenario: 
* node finished rebalancing and starts to clear thombstones
* another node joins topology and become an owner for clearing partition.
* eviction is started for already clearing partition.

Probably this should not be allowed.

3. I see changes having no obvious relation to contribution, for example: 
static String cacheGroupMetricsRegistryName(String cacheGrp)
DropCacheContextDuringEvictionTest.java
GridCommandHandlerIndexingTest.java

What's the purpose of these ?

4. Could you clarify the change in 
org.apache.ignite.internal.processors.cache.GridCacheMapEntry#initialValue:

update0 |= (!preload && val == null); ?



> Write tombstones during rebalance to get rid of deferred delete buffer
> --
>
> Key: IGNITE-11704
> URL: https://issues.apache.org/jira/browse/IGNITE-11704
> Project: Ignite
>  Issue Type: Improvement
>Reporter: Alexey Goncharuk
>Assignee: Pavel Kovalenko
>Priority: Major
>  Labels: rebalance
> Fix For: 2.8
>
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> Currently Ignite relies on deferred delete buffer in order to handle 
> write-remove conflicts during rebalance. Given the limit size of the buffer, 
> this approach is fundamentally flawed, especially in case when persistence is 
> enabled.
> I suggest to extend the logic of data storage to be able to store key 
> tombstones - to keep version for deleted entries. The tombstones will be 
> stored when rebalance is in progress and should be cleaned up when rebalance 
> is completed.
> Later this approach may be used to implement fast partition rebalance based 
> on merkle trees (in this case, tombstones should be written on an incomplete 
> baseline).



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Comment Edited] (IGNITE-11704) Write tombstones during rebalance to get rid of deferred delete buffer

2019-10-08 Thread Alexei Scherbakov (Jira)


[ 
https://issues.apache.org/jira/browse/IGNITE-11704?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16946165#comment-16946165
 ] 

Alexei Scherbakov edited comment on IGNITE-11704 at 10/8/19 7:04 AM:
-

[~jokser] [~sboikov]

I've reviewed changes. Overall looks good, but still I have some questions.

1. My main concern is regarding the necessity of tombstoneBytes 5-bytes object. 
Seems it's possible to implement tombstone by treating absence of value as a 
tombstone.
For example, valLen=0 could be treated as tombstone presense. Doing so we can 
get rid of 5 bytes comparison, and instead do null check:
{noformat}
private Boolean isTombstone(ByteBuffer buf, int offset) {
int valLen = buf.getInt(buf.position() + offset);
if (valLen != tombstoneBytes.length)
return Boolean.FALSE;
...
}
{noformat}

Instead we can do something like {{if (valLen == 0) return true}}

2. With new changes in PartitionsEvictManager it's possible to have two tasks 
of different types for the same partition.
Consider a scenario: 
* node finished rebalancing and starts to clear thombstones
* another node joins topology and become an owner for clearing partition.
* eviction is started for already clearing partition.

Probably this should not be allowed.

3. I see changes having no obvious relation to contribution, for example: 
static String cacheGroupMetricsRegistryName(String cacheGrp)
DropCacheContextDuringEvictionTest.java
GridCommandHandlerIndexingTest.java

What's the purpose of these ?

4. Could you clarify the change in 
org.apache.ignite.internal.processors.cache.GridCacheMapEntry#initialValue:

update0 |= (!preload && val == null); ?




was (Author: ascherbakov):
[~jokser] [~sboikov]

I've reviewed changes. Overall looks good, but still I have some questions.

1. My main concern is regarding the necessity of tombstoneBytes 5-bytes object. 
Seems it's possible to implement tombstone by treating absence of value as a 
tombstone.
For example, valLen=0 could be treated as tombstone presense. Doing so we can 
get rid of 5 bytes comparison, and instead do null check:
{noformat}
private Boolean isTombstone(ByteBuffer buf, int offset) {
int valLen = buf.getInt(buf.position() + offset);
if (valLen != tombstoneBytes.length)
return Boolean.FALSE;
...
}
{noformat}

Instead we can do something like {{if (valLen == 0) return true}}

2. With new changes in PartitionsEvictManager it's possible to have two tasks 
of different types for the same partition.
Consider a scenario: 
* node finished rebalancing and starts to clear thombstones
* another node joins topology and become an owner for clearing partition.
* eviction is started for already clearing partition.

Probably this should not be allowed.

3. I see changes having no obvious relation to contribution, for example: 
static String cacheGroupMetricsRegistryName(String cacheGrp)
DropCacheContextDuringEvictionTest.java
GridCommandHandlerIndexingTest.java

What's the purpose of these ?

4. Could you explain the modification in 
org.apache.ignite.internal.processors.cache.GridCacheMapEntry#initialValue:

update0 |= (!preload && val == null); ?



> Write tombstones during rebalance to get rid of deferred delete buffer
> --
>
> Key: IGNITE-11704
> URL: https://issues.apache.org/jira/browse/IGNITE-11704
> Project: Ignite
>  Issue Type: Improvement
>Reporter: Alexey Goncharuk
>Assignee: Pavel Kovalenko
>Priority: Major
>  Labels: rebalance
> Fix For: 2.8
>
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> Currently Ignite relies on deferred delete buffer in order to handle 
> write-remove conflicts during rebalance. Given the limit size of the buffer, 
> this approach is fundamentally flawed, especially in case when persistence is 
> enabled.
> I suggest to extend the logic of data storage to be able to store key 
> tombstones - to keep version for deleted entries. The tombstones will be 
> stored when rebalance is in progress and should be cleaned up when rebalance 
> is completed.
> Later this approach may be used to implement fast partition rebalance based 
> on merkle trees (in this case, tombstones should be written on an incomplete 
> baseline).



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (IGNITE-11704) Write tombstones during rebalance to get rid of deferred delete buffer

2019-10-07 Thread Alexei Scherbakov (Jira)


[ 
https://issues.apache.org/jira/browse/IGNITE-11704?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16946165#comment-16946165
 ] 

Alexei Scherbakov commented on IGNITE-11704:


[~jokser] [~sboikov]

I've reviewed changes. Overall looks good, but still I have some questions.

1. My main concern is regarding the necessity of tombstoneBytes 5-bytes object. 
Seems it's possible to implement tombstone by treating absence of value as a 
tombstone.
For example, valLen=0 could be treated as tombstone presense. Doing so we can 
get rid of 5 bytes comparison, and instead do null check:
{noformat}
private Boolean isTombstone(ByteBuffer buf, int offset) {
int valLen = buf.getInt(buf.position() + offset);
if (valLen != tombstoneBytes.length)
return Boolean.FALSE;
...
}
{noformat}

Instead we can do something like {{if (valLen == 0) return true}}

2. With new changes in PartitionsEvictManager it's possible to have two tasks 
of different types for the same partition.
Consider a scenario: 
* node finished rebalancing and starts to clear thombstones
* another node joins topology and become an owner for clearing partition.
* eviction is started for already clearing partition.

Probably this should not be allowed.

3. I see changes having no obvious relation to contribution, for example: 
static String cacheGroupMetricsRegistryName(String cacheGrp)
DropCacheContextDuringEvictionTest.java
GridCommandHandlerIndexingTest.java

What's the purpose of these ?

4. Could you explain the modification in 
org.apache.ignite.internal.processors.cache.GridCacheMapEntry#initialValue:

update0 |= (!preload && val == null); ?



> Write tombstones during rebalance to get rid of deferred delete buffer
> --
>
> Key: IGNITE-11704
> URL: https://issues.apache.org/jira/browse/IGNITE-11704
> Project: Ignite
>  Issue Type: Improvement
>Reporter: Alexey Goncharuk
>Assignee: Pavel Kovalenko
>Priority: Major
>  Labels: rebalance
> Fix For: 2.8
>
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> Currently Ignite relies on deferred delete buffer in order to handle 
> write-remove conflicts during rebalance. Given the limit size of the buffer, 
> this approach is fundamentally flawed, especially in case when persistence is 
> enabled.
> I suggest to extend the logic of data storage to be able to store key 
> tombstones - to keep version for deleted entries. The tombstones will be 
> stored when rebalance is in progress and should be cleaned up when rebalance 
> is completed.
> Later this approach may be used to implement fast partition rebalance based 
> on merkle trees (in this case, tombstones should be written on an incomplete 
> baseline).



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (IGNITE-7083) Reduce memory usage of CachePartitionFullCountersMap

2019-10-03 Thread Alexei Scherbakov (Jira)


[ 
https://issues.apache.org/jira/browse/IGNITE-7083?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16943621#comment-16943621
 ] 

Alexei Scherbakov commented on IGNITE-7083:
---

[~mmuzaf]

Looks like the issue is no longer actual because we have cache groups for a 
long time now.
CachePartitionFullCountersMap size expected to be reduced twice after 
IGNITE-11794.
Closing the issue.

> Reduce memory usage of CachePartitionFullCountersMap
> 
>
> Key: IGNITE-7083
> URL: https://issues.apache.org/jira/browse/IGNITE-7083
> Project: Ignite
>  Issue Type: Improvement
>  Components: cache
>Affects Versions: 2.3
> Environment: Any
>Reporter: Sunny Chan
>Assignee: Alexey Goncharuk
>Priority: Major
> Fix For: 2.9
>
>
> The Cache Partition Exchange Manager kept a copy of the already completed 
> exchange. However, we have found that it uses a significant amount of memory. 
> Upon further investigation using heap dump we have found that a large amount 
> of memory is used by the CachePartitionFullCountersMap. We have also observed 
> in most cases, these maps contains only 0s.
> Therefore I propose an optimization for this: Initially the long arrays to 
> store initial update counter and update counter in the CPFCM will be null, 
> and when you get the value and see these tables are null then we will return 
> 0 for the counter. We only allocate the long arrays when there is any 
> non-zero updates to the the map.
> In our tests, the amount of heap used by GridCachePartitionExchangeManager 
> was around 70MB (67 copies of these CPFCM), after we apply the optimization 
> it drops to around 9MB.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Resolved] (IGNITE-7083) Reduce memory usage of CachePartitionFullCountersMap

2019-10-03 Thread Alexei Scherbakov (Jira)


 [ 
https://issues.apache.org/jira/browse/IGNITE-7083?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alexei Scherbakov resolved IGNITE-7083.
---
Resolution: Won't Fix

> Reduce memory usage of CachePartitionFullCountersMap
> 
>
> Key: IGNITE-7083
> URL: https://issues.apache.org/jira/browse/IGNITE-7083
> Project: Ignite
>  Issue Type: Improvement
>  Components: cache
>Affects Versions: 2.3
> Environment: Any
>Reporter: Sunny Chan
>Assignee: Alexey Goncharuk
>Priority: Major
> Fix For: 2.9
>
>
> The Cache Partition Exchange Manager kept a copy of the already completed 
> exchange. However, we have found that it uses a significant amount of memory. 
> Upon further investigation using heap dump we have found that a large amount 
> of memory is used by the CachePartitionFullCountersMap. We have also observed 
> in most cases, these maps contains only 0s.
> Therefore I propose an optimization for this: Initially the long arrays to 
> store initial update counter and update counter in the CPFCM will be null, 
> and when you get the value and see these tables are null then we will return 
> 0 for the counter. We only allocate the long arrays when there is any 
> non-zero updates to the the map.
> In our tests, the amount of heap used by GridCachePartitionExchangeManager 
> was around 70MB (67 copies of these CPFCM), after we apply the optimization 
> it drops to around 9MB.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (IGNITE-12209) Transaction system view

2019-09-30 Thread Alexei Scherbakov (Jira)


[ 
https://issues.apache.org/jira/browse/IGNITE-12209?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16941082#comment-16941082
 ] 

Alexei Scherbakov commented on IGNITE-12209:


[~nizhikov]
Looks good.

> Transaction system view
> ---
>
> Key: IGNITE-12209
> URL: https://issues.apache.org/jira/browse/IGNITE-12209
> Project: Ignite
>  Issue Type: Sub-task
>Affects Versions: 2.7.6
>Reporter: Nikolay Izhikov
>Assignee: Nikolay Izhikov
>Priority: Major
>  Labels: IEP-35
> Fix For: 2.8
>
>  Time Spent: 1h 50m
>  Remaining Estimate: 0h
>
> IGNITE-12145 finished
> We should add transactions to the system views.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (IGNITE-12209) Transaction system view

2019-09-30 Thread Alexei Scherbakov (Jira)


[ 
https://issues.apache.org/jira/browse/IGNITE-12209?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16940887#comment-16940887
 ] 

Alexei Scherbakov commented on IGNITE-12209:


allEntries has multiple implementations. What one are you talking about ? Are 
you sure all implementations are safe ?

In fact, both can fail if something will change if future in underlying 
implementation.
Writing a code on assumption what this will not happen is bad.

I would add try ... catch block to avoid issues.




> Transaction system view
> ---
>
> Key: IGNITE-12209
> URL: https://issues.apache.org/jira/browse/IGNITE-12209
> Project: Ignite
>  Issue Type: Sub-task
>Affects Versions: 2.7.6
>Reporter: Nikolay Izhikov
>Assignee: Nikolay Izhikov
>Priority: Major
>  Labels: IEP-35
> Fix For: 2.8
>
>  Time Spent: 1h 50m
>  Remaining Estimate: 0h
>
> IGNITE-12145 finished
> We should add transactions to the system views.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (IGNITE-12209) Transaction system view

2019-09-30 Thread Alexei Scherbakov (Jira)


[ 
https://issues.apache.org/jira/browse/IGNITE-12209?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16940834#comment-16940834
 ] 

Alexei Scherbakov commented on IGNITE-12209:


[~nizhikov]

There are absolutely no guarantee this would always work. 
Having catch block is 100% safe.


> Transaction system view
> ---
>
> Key: IGNITE-12209
> URL: https://issues.apache.org/jira/browse/IGNITE-12209
> Project: Ignite
>  Issue Type: Sub-task
>Affects Versions: 2.7.6
>Reporter: Nikolay Izhikov
>Assignee: Nikolay Izhikov
>Priority: Major
>  Labels: IEP-35
> Fix For: 2.8
>
>  Time Spent: 1h 50m
>  Remaining Estimate: 0h
>
> IGNITE-12145 finished
> We should add transactions to the system views.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Comment Edited] (IGNITE-12209) Transaction system view

2019-09-30 Thread Alexei Scherbakov (Jira)


[ 
https://issues.apache.org/jira/browse/IGNITE-12209?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16940709#comment-16940709
 ] 

Alexei Scherbakov edited comment on IGNITE-12209 at 9/30/19 7:37 AM:
-

[~nizhikov]

Note what 
org.apache.ignite.internal.processors.cache.transactions.IgniteTxState#allEntries
 and 
org.apache.ignite.internal.processors.cache.transactions.IgniteTxState#cacheIds 
are unsynchronized and tx state can be concurrently updated if a transaction 
enlists keys in the moment of view producing.

So current implementation is unsafe but probably will work somehow. 
I suggest to enclose methods in try .. catch(Throwable) to implement fallback 
in case something goes wrong.



was (Author: ascherbakov):
[~nizhikov]

Note what 
org.apache.ignite.internal.processors.cache.transactions.IgniteTxState#allEntries
 and 
org.apache.ignite.internal.processors.cache.transactions.IgniteTxState#cacheIds 
are unsynchronized and can be concurrently updated if a transaction enlists 
keys in the moment of view producing.

So current implementation is unsafe but probably will work somehow. 
I suggest to enclose methods in try .. catch(Throwable) to implement fallback 
in case something goes wrong.


> Transaction system view
> ---
>
> Key: IGNITE-12209
> URL: https://issues.apache.org/jira/browse/IGNITE-12209
> Project: Ignite
>  Issue Type: Sub-task
>Affects Versions: 2.7.6
>Reporter: Nikolay Izhikov
>Assignee: Nikolay Izhikov
>Priority: Major
>  Labels: IEP-35
> Fix For: 2.8
>
>  Time Spent: 1h 50m
>  Remaining Estimate: 0h
>
> IGNITE-12145 finished
> We should add transactions to the system views.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (IGNITE-12209) Transaction system view

2019-09-30 Thread Alexei Scherbakov (Jira)


[ 
https://issues.apache.org/jira/browse/IGNITE-12209?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16940709#comment-16940709
 ] 

Alexei Scherbakov commented on IGNITE-12209:


[~nizhikov]

Note what 
org.apache.ignite.internal.processors.cache.transactions.IgniteTxState#allEntries
 and 
org.apache.ignite.internal.processors.cache.transactions.IgniteTxState#cacheIds 
are unsynchronized and can be concurrently updated if a transaction enlists 
keys in the moment of view producing.

So current implementation is unsafe but probably will work somehow. 
I suggest to enclose methods in try .. catch(Throwable) to implement fallback 
in case something goes wrong.


> Transaction system view
> ---
>
> Key: IGNITE-12209
> URL: https://issues.apache.org/jira/browse/IGNITE-12209
> Project: Ignite
>  Issue Type: Sub-task
>Affects Versions: 2.7.6
>Reporter: Nikolay Izhikov
>Assignee: Nikolay Izhikov
>Priority: Major
>  Labels: IEP-35
> Fix For: 2.8
>
>  Time Spent: 1h 50m
>  Remaining Estimate: 0h
>
> IGNITE-12145 finished
> We should add transactions to the system views.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (IGNITE-12209) Transaction system view

2019-09-28 Thread Alexei Scherbakov (Jira)


[ 
https://issues.apache.org/jira/browse/IGNITE-12209?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16940018#comment-16940018
 ] 

Alexei Scherbakov commented on IGNITE-12209:


[~nizhikov]

I do not understand why we couldn't have local views and do distributed queries 
agains them.
SQL is great for such analytic tasks.

Enlisted cache ids are held in 
org.apache.ignite.internal.processors.cache.transactions.IgniteTxState#cacheIds.
 No need to traverse entries.

> Transaction system view
> ---
>
> Key: IGNITE-12209
> URL: https://issues.apache.org/jira/browse/IGNITE-12209
> Project: Ignite
>  Issue Type: Sub-task
>Affects Versions: 2.7.6
>Reporter: Nikolay Izhikov
>Assignee: Nikolay Izhikov
>Priority: Major
>  Labels: IEP-35
> Fix For: 2.8
>
>  Time Spent: 1h 20m
>  Remaining Estimate: 0h
>
> IGNITE-12145 finished
> We should add transactions to the system views.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (IGNITE-12209) Transaction system view

2019-09-27 Thread Alexei Scherbakov (Jira)


[ 
https://issues.apache.org/jira/browse/IGNITE-12209?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16939577#comment-16939577
 ] 

Alexei Scherbakov commented on IGNITE-12209:


[~nizhikov]

This is sad. 
Such feature would allow to produce interesting views on transaction snapshots. 
I don't know implementation details but in theory it should work out of the box 
using current SQL engine and distributed joins.
Do you think it's possible to implement grid global system views in future ?

> Transaction system view
> ---
>
> Key: IGNITE-12209
> URL: https://issues.apache.org/jira/browse/IGNITE-12209
> Project: Ignite
>  Issue Type: Sub-task
>Affects Versions: 2.7.6
>Reporter: Nikolay Izhikov
>Assignee: Nikolay Izhikov
>Priority: Major
>  Labels: IEP-35
> Fix For: 2.8
>
>  Time Spent: 1h 20m
>  Remaining Estimate: 0h
>
> IGNITE-12145 finished
> We should add transactions to the system views.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (IGNITE-12209) Transaction system view

2019-09-27 Thread Alexei Scherbakov (Jira)


[ 
https://issues.apache.org/jira/browse/IGNITE-12209?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16939567#comment-16939567
 ] 

Alexei Scherbakov commented on IGNITE-12209:


[~nizhikov]

I think 

* parent node id (tx.originatingNodeId), 
* near node id (tx.otherNodeId),
* mapped topVer, 
* duration, 
* number of currently enlisted keys, 
* cache ids/names 

must be added to output.

Note what tx.nodeId is not originating but local node for tx. You should change 
javadoc.

Would it be possible to construct whole distributed transaction using SQL joins 
(joining by parent and local node) ?








> Transaction system view
> ---
>
> Key: IGNITE-12209
> URL: https://issues.apache.org/jira/browse/IGNITE-12209
> Project: Ignite
>  Issue Type: Sub-task
>Affects Versions: 2.7.6
>Reporter: Nikolay Izhikov
>Assignee: Nikolay Izhikov
>Priority: Major
>  Labels: IEP-35
> Fix For: 2.8
>
>  Time Spent: 1h 20m
>  Remaining Estimate: 0h
>
> IGNITE-12145 finished
> We should add transactions to the system views.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Comment Edited] (IGNITE-12133) O(log n) partition exchange

2019-09-04 Thread Alexei Scherbakov (Jira)


[ 
https://issues.apache.org/jira/browse/IGNITE-12133?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16922809#comment-16922809
 ] 

Alexei Scherbakov edited comment on IGNITE-12133 at 9/4/19 8:03 PM:


PME protocol itself doesn't leverage ring and uses direct node to node 
communication for sending partition maps (except for special case), but ring is 
used by discovery protocol, which "discovers" topology changes and delivers 
event to grid nodes, which triggers PME due to topology changes, for example 
"node left" or "node added".
Also discovery protocol provides "guaranteed ordered messages delivery" which 
is extensively used by Ignite internals and cannot be replaced easily.

Actually, PME consists of three phases:

1. Discovery phase, having O( n ) complexity for default TcpDiscoverySpi 
implementation.
2. Topology unlock waiting.
3. PME phase having k * O(m) complextity where m is number of I/O sender 
threads and k depends on topology size.

So total PME complexity is sum of 1 2 and 3.
To speed up PME we should improve 1 2 and 3.

How to improve 1 ?
Initially ring was designed for small topologies and still works very well for 
such cases with default settings.
Specially for large topologies zookeeper based discovery was introduced, which 
have better complexity.
So, for small topologies I suggest to use defaults.
For large topologies zookeeper discovery should be used.

How to improve 2 ?
This is discussed on dev list in lightweigh PME topic.

How to improve 3 ?
For small topologies same as 1, use defaults.
For large topologies we could use [~mnk]'s proposal and use tree-like message 
propagation pattern to achieve log(N) complexity.
I agree with [~ivan.glukos] on increasing failover complexity, but I think it's 
doable.
NOTE: same idea could be used for increasing replicated caches performance on 
large topologies. We have long time known issue with performance degradation if 
topology is large.

[~Jokser] 
Gossip idea looks interesting, but looks like complicated change and 
reinventing the wheel. Why not stick to zookeeper?







was (Author: ascherbakov):
PME protocol itself doesn't leverage ring and uses direct node to node 
communication for sending partition maps (except for special case), but ring is 
used by discovery protocol, which "discovers" topology changes and delivers 
event to grid nodes, which triggers PME due to topology changes, for example 
"node left" or "node added".
Also discovery protocol provides "guaranteed ordered messages delivery" which 
is extensively used by Ignite internals and cannot be replaced easily.

Actually, PME consists of three phases:

1. Discovery phase, having O( n ) complexity for default TcpDiscoverySpi 
implementation.
2. Topology unlock waiting (out of this post's scope).
3. PME phase having k * O(m) complextity where m is number of I/O sender 
threads and k depends on topology size.

So total PME complexity is sum of 1 2 and 3.
To speed up PME we should improve 1 and 3.

How to improve 1 ?
Initially ring was designed for small topologies and still works very well for 
such cases with default settings.
Specially for large topologies zookeeper based discovery was introduced, which 
have better complexity.
So, for small topologies I suggest to use defaults.
For large topologies zookeeper discovery should be used.

How to improve 3 ?
For small topologies same as 1, use defaults.
For large topologies we could use [~mnk]'s proposal and use tree-like message 
propagation pattern to achieve log(N) complexity.
I agree with [~ivan.glukos] on increasing failover complexity, but I think it's 
doable.
NOTE: same idea could be used for increasing replicated caches performance on 
large topologies. We have long time known issue with performance degradation if 
topology is large.

[~Jokser] 
Gossip idea looks interesting, but looks like complicated change and 
reinventing the wheel. Why not stick to zookeeper?






> O(log n) partition exchange
> ---
>
> Key: IGNITE-12133
> URL: https://issues.apache.org/jira/browse/IGNITE-12133
> Project: Ignite
>  Issue Type: Improvement
>Reporter: Moti Nisenson-Ken
>Priority: Major
>
> Currently, partition exchange leverages a ring. This means that 
> communications is O\(n) in number of nodes. It also means that if 
> non-coordinator nodes hang it can take much longer to successfully resolve 
> the topology.
> Instead, why not use something like a skip-list where the coordinator is 
> first. The coordinator can notify the first node at each level of the 
> skip-list. Each node then notifies all of its "near-neighbours" in the 
> skip-list, where node B is a near-neighbour of node-A, if max-level(nodeB) <= 
> max-level(nodeA), and nodeB is the first node at its level when traversing 
> from nodeA in the direction of nodeB, skipping 

[jira] [Comment Edited] (IGNITE-12133) O(log n) partition exchange

2019-09-04 Thread Alexei Scherbakov (Jira)


[ 
https://issues.apache.org/jira/browse/IGNITE-12133?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16922809#comment-16922809
 ] 

Alexei Scherbakov edited comment on IGNITE-12133 at 9/4/19 8:02 PM:


PME protocol itself doesn't leverage ring and uses direct node to node 
communication for sending partition maps (except for special case), but ring is 
used by discovery protocol, which "discovers" topology changes and delivers 
event to grid nodes, which triggers PME due to topology changes, for example 
"node left" or "node added".
Also discovery protocol provides "guaranteed ordered messages delivery" which 
is extensively used by Ignite internals and cannot be replaced easily.

Actually, PME consists of three phases:

1. Discovery phase, having O( n ) complexity for default TcpDiscoverySpi 
implementation.
2. Topology unlock waiting (out of this post's scope).
3. PME phase having k * O(m) complextity where m is number of I/O sender 
threads and k depends on topology size.

So total PME complexity is sum of 1 2 and 3.
To speed up PME we should improve 1 and 3.

How to improve 1 ?
Initially ring was designed for small topologies and still works very well for 
such cases with default settings.
Specially for large topologies zookeeper based discovery was introduced, which 
have better complexity.
So, for small topologies I suggest to use defaults.
For large topologies zookeeper discovery should be used.

How to improve 3 ?
For small topologies same as 1, use defaults.
For large topologies we could use [~mnk]'s proposal and use tree-like message 
propagation pattern to achieve log(N) complexity.
I agree with [~ivan.glukos] on increasing failover complexity, but I think it's 
doable.
NOTE: same idea could be used for increasing replicated caches performance on 
large topologies. We have long time known issue with performance degradation if 
topology is large.

[~Jokser] 
Gossip idea looks interesting, but looks like complicated change and 
reinventing the wheel. Why not stick to zookeeper?







was (Author: ascherbakov):
PME protocol itself doesn't leverage ring and uses direct node to node 
communication for sending partition maps (except for special case), but ring is 
used by discovery protocol, which "discovers" topology changes and delivers 
event to grid nodes, which triggers PME due to topology changes, for example 
"node left" or "node added".
Also discovery protocol provides "guaranteed ordered messages delivery" which 
is extensively used by Ignite internals and cannot be replaced easily.

Actually, PME consists of three phases:

1. Discovery phase, having O( n ) complexity for default TcpDiscoverySpi 
implementation.
2. Topology unlock waiting (out of this post's scope).
3. PME phase having k * O(m) complextity where m is number of I/O sender 
threads and k depends on topology size.

So total PME complexity is sum of 1 and 3.
To speed up PME we should improve 1 and 3.

How to improve 1 ?
Initially ring was designed for small topologies and still works very well for 
such cases with default settings.
Specially for large topologies zookeeper based discovery was introduced, which 
have better complexity.
So, for small topologies I suggest to use defaults.
For large topologies zookeeper discovery should be used.

How to improve 3 ?
For small topologies same as 1, use defaults.
For large topologies we could use [~mnk]'s proposal and use tree-like message 
propagation pattern to achieve log(N) complexity.
I agree with [~ivan.glukos] on increasing failover complexity, but I think it's 
doable.
NOTE: same idea could be used for increasing replicated caches performance on 
large topologies. We have long time known issue with performance degradation if 
topology is large.

[~Jokser] 
Gossip idea looks interesting, but looks like complicated change and 
reinventing the wheel. Why not stick to zookeeper?






> O(log n) partition exchange
> ---
>
> Key: IGNITE-12133
> URL: https://issues.apache.org/jira/browse/IGNITE-12133
> Project: Ignite
>  Issue Type: Improvement
>Reporter: Moti Nisenson-Ken
>Priority: Major
>
> Currently, partition exchange leverages a ring. This means that 
> communications is O\(n) in number of nodes. It also means that if 
> non-coordinator nodes hang it can take much longer to successfully resolve 
> the topology.
> Instead, why not use something like a skip-list where the coordinator is 
> first. The coordinator can notify the first node at each level of the 
> skip-list. Each node then notifies all of its "near-neighbours" in the 
> skip-list, where node B is a near-neighbour of node-A, if max-level(nodeB) <= 
> max-level(nodeA), and nodeB is the first node at its level when traversing 
> from nodeA in the direction of nodeB, skipping over nodes C which have 
> max-level(C) > 

[jira] [Commented] (IGNITE-12133) O(log n) partition exchange

2019-09-04 Thread Alexei Scherbakov (Jira)


[ 
https://issues.apache.org/jira/browse/IGNITE-12133?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16922809#comment-16922809
 ] 

Alexei Scherbakov commented on IGNITE-12133:


PME protocol itself doesn't leverage ring and uses direct node to node 
communication for sending partition maps (except for special case), but ring is 
used by discovery protocol, which "discovers" topology changes and delivers 
event to grid nodes, which triggers PME due to topology changes, for example 
"node left" or "node added".
Also discovery protocol provides "guaranteed ordered messages delivery" which 
is extensively used by Ignite internals and cannot be replaced easily.

Actually, PME consists of three phases:

1. Discovery phase, having O(n) complexity for default TcpDiscoverySpi 
implementation.
2. Topology unlock waiting (out of this post's scope).
3. PME phase having k * O(m) complextity where m is number of I/O sender 
threads and k depends on topology size.

So total PME complexity is sum of 1 and 3.
To speed up PME we should improve 1 and 3.

How to improve 1 ?
Initially ring was designed for small topologies and still works very well for 
such cases with default settings.
Specially for large topologies zookeeper based discovery was introduced, which 
have better complexity.
So, for small topologies I suggest to use defaults.
For large topologies zookeeper discovery should be used.

How to improve 3 ?
For small topologies same as 1, use defaults.
For large topologies we could use [~mnk]'s proposal and use tree-like message 
propagation pattern to achieve log(N) complexity.
I agree with [~ivan.glukos] on increasing failover complexity, but I think it's 
doable.
NOTE: same idea could be used for increasing replicated caches performance on 
large topologies. We have long time known issue with performance degradation if 
topology is large.

[~Jokser] 
Gossip idea looks interesting, but looks like complicated change and 
reinventing the wheel. Why not stick to zookeeper?






> O(log n) partition exchange
> ---
>
> Key: IGNITE-12133
> URL: https://issues.apache.org/jira/browse/IGNITE-12133
> Project: Ignite
>  Issue Type: Improvement
>Reporter: Moti Nisenson-Ken
>Priority: Major
>
> Currently, partition exchange leverages a ring. This means that 
> communications is O\(n) in number of nodes. It also means that if 
> non-coordinator nodes hang it can take much longer to successfully resolve 
> the topology.
> Instead, why not use something like a skip-list where the coordinator is 
> first. The coordinator can notify the first node at each level of the 
> skip-list. Each node then notifies all of its "near-neighbours" in the 
> skip-list, where node B is a near-neighbour of node-A, if max-level(nodeB) <= 
> max-level(nodeA), and nodeB is the first node at its level when traversing 
> from nodeA in the direction of nodeB, skipping over nodes C which have 
> max-level(C) > max-level(A). 
> 1
> 1 .  .  .3
> 1        3 . .  . 5
> 1 . 2 . 3 . 4 . 5 . 6
> In the above 1 would notify 2 and 3, 3 would notify 4 and 5, 2 -> 4, and 4 -> 
> 6, and 5 -> 6.
> One can achieve better redundancy by having each node traverse in both 
> directions, and having the coordinator also notify the last node in the list 
> at each level. This way in the above example if 2 and 3 were both down, 4 
> would still get notified from 5 and 6 (in the backwards direction).
>  
> The idea is that each individual node has O(log n) nodes to notify - so the 
> overall time is reduced. Additionally, we can deal well with at least 1 node 
> failure - if one includes the option of processing backwards, 2 consecutive 
> node failures can be handled as well. By taking this kind of an approach, 
> then the coordinator can basically treat any nodes it didn't receive a 
> message from as not-connected, and update the topology as well (disconnecting 
> any nodes that it didn't get a notification from). While there are some edge 
> cases here (e.g. 2 disconnected nodes, then 1 connected node, then 2 
> disconnected nodes - the connected node would be wrongly ejected from the 
> topology), these would generally be too rare to need explicit handling for.



--
This message was sent by Atlassian Jira
(v8.3.2#803003)


[jira] [Comment Edited] (IGNITE-12133) O(log n) partition exchange

2019-09-04 Thread Alexei Scherbakov (Jira)


[ 
https://issues.apache.org/jira/browse/IGNITE-12133?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16922809#comment-16922809
 ] 

Alexei Scherbakov edited comment on IGNITE-12133 at 9/4/19 8:00 PM:


PME protocol itself doesn't leverage ring and uses direct node to node 
communication for sending partition maps (except for special case), but ring is 
used by discovery protocol, which "discovers" topology changes and delivers 
event to grid nodes, which triggers PME due to topology changes, for example 
"node left" or "node added".
Also discovery protocol provides "guaranteed ordered messages delivery" which 
is extensively used by Ignite internals and cannot be replaced easily.

Actually, PME consists of three phases:

1. Discovery phase, having O( n ) complexity for default TcpDiscoverySpi 
implementation.
2. Topology unlock waiting (out of this post's scope).
3. PME phase having k * O(m) complextity where m is number of I/O sender 
threads and k depends on topology size.

So total PME complexity is sum of 1 and 3.
To speed up PME we should improve 1 and 3.

How to improve 1 ?
Initially ring was designed for small topologies and still works very well for 
such cases with default settings.
Specially for large topologies zookeeper based discovery was introduced, which 
have better complexity.
So, for small topologies I suggest to use defaults.
For large topologies zookeeper discovery should be used.

How to improve 3 ?
For small topologies same as 1, use defaults.
For large topologies we could use [~mnk]'s proposal and use tree-like message 
propagation pattern to achieve log(N) complexity.
I agree with [~ivan.glukos] on increasing failover complexity, but I think it's 
doable.
NOTE: same idea could be used for increasing replicated caches performance on 
large topologies. We have long time known issue with performance degradation if 
topology is large.

[~Jokser] 
Gossip idea looks interesting, but looks like complicated change and 
reinventing the wheel. Why not stick to zookeeper?







was (Author: ascherbakov):
PME protocol itself doesn't leverage ring and uses direct node to node 
communication for sending partition maps (except for special case), but ring is 
used by discovery protocol, which "discovers" topology changes and delivers 
event to grid nodes, which triggers PME due to topology changes, for example 
"node left" or "node added".
Also discovery protocol provides "guaranteed ordered messages delivery" which 
is extensively used by Ignite internals and cannot be replaced easily.

Actually, PME consists of three phases:

1. Discovery phase, having O(n) complexity for default TcpDiscoverySpi 
implementation.
2. Topology unlock waiting (out of this post's scope).
3. PME phase having k * O(m) complextity where m is number of I/O sender 
threads and k depends on topology size.

So total PME complexity is sum of 1 and 3.
To speed up PME we should improve 1 and 3.

How to improve 1 ?
Initially ring was designed for small topologies and still works very well for 
such cases with default settings.
Specially for large topologies zookeeper based discovery was introduced, which 
have better complexity.
So, for small topologies I suggest to use defaults.
For large topologies zookeeper discovery should be used.

How to improve 3 ?
For small topologies same as 1, use defaults.
For large topologies we could use [~mnk]'s proposal and use tree-like message 
propagation pattern to achieve log(N) complexity.
I agree with [~ivan.glukos] on increasing failover complexity, but I think it's 
doable.
NOTE: same idea could be used for increasing replicated caches performance on 
large topologies. We have long time known issue with performance degradation if 
topology is large.

[~Jokser] 
Gossip idea looks interesting, but looks like complicated change and 
reinventing the wheel. Why not stick to zookeeper?






> O(log n) partition exchange
> ---
>
> Key: IGNITE-12133
> URL: https://issues.apache.org/jira/browse/IGNITE-12133
> Project: Ignite
>  Issue Type: Improvement
>Reporter: Moti Nisenson-Ken
>Priority: Major
>
> Currently, partition exchange leverages a ring. This means that 
> communications is O\(n) in number of nodes. It also means that if 
> non-coordinator nodes hang it can take much longer to successfully resolve 
> the topology.
> Instead, why not use something like a skip-list where the coordinator is 
> first. The coordinator can notify the first node at each level of the 
> skip-list. Each node then notifies all of its "near-neighbours" in the 
> skip-list, where node B is a near-neighbour of node-A, if max-level(nodeB) <= 
> max-level(nodeA), and nodeB is the first node at its level when traversing 
> from nodeA in the direction of nodeB, skipping over nodes C which have 
> max-level(C) > max-level(A). 

[jira] [Commented] (IGNITE-12038) Fix several failing tests after IGNITE-10078

2019-08-29 Thread Alexei Scherbakov (Jira)


[ 
https://issues.apache.org/jira/browse/IGNITE-12038?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16918720#comment-16918720
 ] 

Alexei Scherbakov commented on IGNITE-12038:


The contribution contains a bunch of fixes related to new partition counter 
implementation (IGNITE-10078) discovered during private testing.
This also should fix mentioned above tests.

List of fixes:

Fixed issues related to incorrect partition clearing in OWNING state.
Fixed RENTING->EVICTED partition state change prevention.
CheckpointReadLock() may hang during node stop - fixed.
Fixed invalid invalid topology version assertion thrown on 
PartitionCountersNeighborcastRequest.
Fixed an issue when cross-cache tx is mapped on wrong primary when enlisted 
caches have incompatible assignments.
Now transactions will be rolled back if are preparing on invalid primary node.
Stablilized LocalWalModeChangeDuringRebalancingSelfTest.

[~ivan.glukos] could you do review?





> Fix several failing tests after IGNITE-10078
> 
>
> Key: IGNITE-12038
> URL: https://issues.apache.org/jira/browse/IGNITE-12038
> Project: Ignite
>  Issue Type: Bug
>Reporter: Alexei Scherbakov
>Assignee: Alexei Scherbakov
>Priority: Major
> Fix For: 2.8
>
>  Time Spent: 20m
>  Remaining Estimate: 0h
>
>  *New stable failure of a flaky test in master 
> LocalWalModeChangeDuringRebalancingSelfTest.testWithExchangesMerge 
> https://ci.ignite.apache.org/project.html?projectId=IgniteTests24Java8=-6585115376754732686=%3Cdefault%3E=testDetails
>  *New stable failure of a flaky test in master 
> GridCacheRebalancingWithAsyncClearingMvccTest.testPartitionClearingNotBlockExchange
>  
> https://ci.ignite.apache.org/project.html?projectId=IgniteTests24Java8=-7007912051428984819=%3Cdefault%3E=testDetails
>  *New stable failure of a flaky test in master 
> GridCacheRebalancingAsyncSelfTest.testComplexRebalancing 
> https://ci.ignite.apache.org/project.html?projectId=IgniteTests24Java8=-8829809273565657995=%3Cdefault%3E=testDetails



--
This message was sent by Atlassian Jira
(v8.3.2#803003)


[jira] [Commented] (IGNITE-11857) Investigate performance drop after IGNITE-10078

2019-08-28 Thread Alexei Scherbakov (Jira)


[ 
https://issues.apache.org/jira/browse/IGNITE-11857?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16918035#comment-16918035
 ] 

Alexei Scherbakov commented on IGNITE-11857:


[~alex_pl]

Haven't yet, but it's in my queue.

> Investigate performance drop after IGNITE-10078
> ---
>
> Key: IGNITE-11857
> URL: https://issues.apache.org/jira/browse/IGNITE-11857
> Project: Ignite
>  Issue Type: Improvement
>Reporter: Alexei Scherbakov
>Assignee: Aleksey Plekhanov
>Priority: Major
> Attachments: ignite-config.xml, 
> run.properties.tx-optimistic-put-b-backup
>
>  Time Spent: 20m
>  Remaining Estimate: 0h
>
> After IGNITE-10078 yardstick tests show performance drop up to 8% in some 
> scenarios:
> * tx-optim-repRead-put-get
> * tx-optimistic-put
> * tx-putAll
> Partially this is due new update counter implementation, but not only. 
> Investigation is required.



--
This message was sent by Atlassian Jira
(v8.3.2#803003)


[jira] [Commented] (IGNITE-3195) Rebalancing: IgniteConfiguration.rebalanceThreadPoolSize is wrongly treated

2019-08-28 Thread Alexei Scherbakov (Jira)


[ 
https://issues.apache.org/jira/browse/IGNITE-3195?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16917802#comment-16917802
 ] 

Alexei Scherbakov commented on IGNITE-3195:
---

[~avinogradov]

1. This theoretically should work. I'm going to contribute a bunch of follow-up 
fixes by IGNITE-12038 very soon, let's check TC again after.
2. OK, looks like ordered messages should provide necessary ordering.

No more objections from me, thanks.

> Rebalancing: IgniteConfiguration.rebalanceThreadPoolSize is wrongly treated
> ---
>
> Key: IGNITE-3195
> URL: https://issues.apache.org/jira/browse/IGNITE-3195
> Project: Ignite
>  Issue Type: Bug
>  Components: cache
>Reporter: Denis Magda
>Assignee: Anton Vinogradov
>Priority: Major
>  Labels: iep-16
> Fix For: 2.8
>
>  Time Spent: 4h
>  Remaining Estimate: 0h
>
> Presently it's considered that the maximum number of threads that has to 
> process all demand and supply messages coming from all the nodes must not be 
> bigger than {{IgniteConfiguration.rebalanceThreadPoolSize}}.
> Current implementation relies on ordered messages functionality creating a 
> number of topics equal to {{IgniteConfiguration.rebalanceThreadPoolSize}}.
> However, the implementation doesn't take into account that ordered messages, 
> that correspond to a particular topic, are processed in parallel for 
> different nodes. Refer to the implementation of 
> {{GridIoManager.processOrderedMessage}} to see that for every topic there 
> will be a unique {{GridCommunicationMessageSet}} for every node.
> Also to prove that this is true you can refer to this execution stack 
> {noformat}
> java.lang.RuntimeException: HAPPENED DEMAND
>   at 
> org.apache.ignite.internal.processors.cache.GridCachePartitionExchangeManager$5.apply(GridCachePartitionExchangeManager.java:378)
>   at 
> org.apache.ignite.internal.processors.cache.GridCachePartitionExchangeManager$5.apply(GridCachePartitionExchangeManager.java:364)
>   at 
> org.apache.ignite.internal.processors.cache.GridCacheIoManager.processMessage(GridCacheIoManager.java:622)
>   at 
> org.apache.ignite.internal.processors.cache.GridCacheIoManager.onMessage0(GridCacheIoManager.java:320)
>   at 
> org.apache.ignite.internal.processors.cache.GridCacheIoManager.access$300(GridCacheIoManager.java:81)
>   at 
> org.apache.ignite.internal.processors.cache.GridCacheIoManager$OrderedMessageListener.onMessage(GridCacheIoManager.java:1125)
>   at 
> org.apache.ignite.internal.managers.communication.GridIoManager.invokeListener(GridIoManager.java:1219)
>   at 
> org.apache.ignite.internal.managers.communication.GridIoManager.access$1600(GridIoManager.java:105)
>   at 
> org.apache.ignite.internal.managers.communication.GridIoManager$GridCommunicationMessageSet.unwind(GridIoManager.java:2456)
>   at 
> org.apache.ignite.internal.managers.communication.GridIoManager.unwindMessageSet(GridIoManager.java:1179)
>   at 
> org.apache.ignite.internal.managers.communication.GridIoManager.access$1900(GridIoManager.java:105)
>   at 
> org.apache.ignite.internal.managers.communication.GridIoManager$6.run(GridIoManager.java:1148)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
>   at java.lang.Thread.run(Thread.java:745)
> {noformat}
> All this means that in fact the number of threads that will be busy with 
> replication activity will be equal to 
> {{IgniteConfiguration.rebalanceThreadPoolSize}} x 
> number_of_nodes_participated_in_rebalancing



--
This message was sent by Atlassian Jira
(v8.3.2#803003)


[jira] [Comment Edited] (IGNITE-3195) Rebalancing: IgniteConfiguration.rebalanceThreadPoolSize is wrongly treated

2019-08-27 Thread Alexei Scherbakov (Jira)


[ 
https://issues.apache.org/jira/browse/IGNITE-3195?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16916855#comment-16916855
 ] 

Alexei Scherbakov edited comment on IGNITE-3195 at 8/27/19 4:41 PM:


[~avinogradov]

Overall fix looks good, but I think we could improve it.

1. Looks like it's safe to remove ordering for historical rebalance because 
after IGNITE-10078 rmvQeue for partition is no longer cleared during rebalance 
and removals cannot be lost.
Given what, we could use single thread pool for historical and full rebalance 
and parallelize historical rebalance on supplier side same as full.
This is right thing to do because from user side of view there is no difference 
between full and historical rebalance and they can happen simultaneously.
Note, proper fix for writing tombstones is on the way [1]

2. Looks like current implementation for detecting partition completion on 
concurrent processing using *queued* and *processed* is flawed.
Consider the scenario:
Demander sends initial demand request for single partition.
Supplier replies with 2 total supply messages which are starting to process in 
parallel.
2-d message is last.
2-d message started to process first, increments *queued* to N (number of 
entries in message)
2-d message finished processing, incrementing *processed* to N.
Because this is last message partition will be owned before other messages are 
applied.

[1] https://issues.apache.org/jira/browse/IGNITE-11704


was (Author: ascherbakov):
[~avinogradov]

Overall fix looks good, but I think we could improve it.

1. Looks like it's safe to remove ordering for historical rebalance because 
after IGNITE-10078 rmvQeue for partition is no longer cleared during rebalance 
and removals cannot be lost.
Given what, we could use single thread pool for historical and full rebalance 
and parallelize historical rebalance on supplier side same as full.
This is right thing to do because from user side of view there is no difference 
between full and historical rebalance and they can happen simultaneously.
Note, proper fix for writing tombstones is on the way [1]

2. Looks like current implementation for detecting partition completion on 
concurrent processing using *queued*and *processed* is flawed.
Consider the scenario:
Demander sends initial demand request for single partition.
Supplier replies with 2 total supply messages which are starting to process in 
parallel.
2-d message is last.
2-d message started to process first, increments *queued* to N (number of 
entries in message)
2-d message finished processing, incrementing *processed* to N.
Because this is last message partition will be owned before other messages are 
applied.

[1] https://issues.apache.org/jira/browse/IGNITE-11704

> Rebalancing: IgniteConfiguration.rebalanceThreadPoolSize is wrongly treated
> ---
>
> Key: IGNITE-3195
> URL: https://issues.apache.org/jira/browse/IGNITE-3195
> Project: Ignite
>  Issue Type: Bug
>  Components: cache
>Reporter: Denis Magda
>Assignee: Anton Vinogradov
>Priority: Major
>  Labels: iep-16
> Fix For: 2.8
>
>  Time Spent: 3h 50m
>  Remaining Estimate: 0h
>
> Presently it's considered that the maximum number of threads that has to 
> process all demand and supply messages coming from all the nodes must not be 
> bigger than {{IgniteConfiguration.rebalanceThreadPoolSize}}.
> Current implementation relies on ordered messages functionality creating a 
> number of topics equal to {{IgniteConfiguration.rebalanceThreadPoolSize}}.
> However, the implementation doesn't take into account that ordered messages, 
> that correspond to a particular topic, are processed in parallel for 
> different nodes. Refer to the implementation of 
> {{GridIoManager.processOrderedMessage}} to see that for every topic there 
> will be a unique {{GridCommunicationMessageSet}} for every node.
> Also to prove that this is true you can refer to this execution stack 
> {noformat}
> java.lang.RuntimeException: HAPPENED DEMAND
>   at 
> org.apache.ignite.internal.processors.cache.GridCachePartitionExchangeManager$5.apply(GridCachePartitionExchangeManager.java:378)
>   at 
> org.apache.ignite.internal.processors.cache.GridCachePartitionExchangeManager$5.apply(GridCachePartitionExchangeManager.java:364)
>   at 
> org.apache.ignite.internal.processors.cache.GridCacheIoManager.processMessage(GridCacheIoManager.java:622)
>   at 
> org.apache.ignite.internal.processors.cache.GridCacheIoManager.onMessage0(GridCacheIoManager.java:320)
>   at 
> org.apache.ignite.internal.processors.cache.GridCacheIoManager.access$300(GridCacheIoManager.java:81)
>   at 
> 

[jira] [Comment Edited] (IGNITE-3195) Rebalancing: IgniteConfiguration.rebalanceThreadPoolSize is wrongly treated

2019-08-27 Thread Alexei Scherbakov (Jira)


[ 
https://issues.apache.org/jira/browse/IGNITE-3195?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16916855#comment-16916855
 ] 

Alexei Scherbakov edited comment on IGNITE-3195 at 8/27/19 4:40 PM:


[~avinogradov]

Overall fix looks good, but I think we could improve it.

1. Looks like it's safe to remove ordering for historical rebalance because 
after IGNITE-10078 rmvQeue for partition is no longer cleared during rebalance 
and removals cannot be lost.
Given what, we could use single thread pool for historical and full rebalance 
and parallelize historical rebalance on supplier side same as full.
This is right thing to do because from user side of view there is no difference 
between full and historical rebalance and they can happen simultaneously.
Note, proper fix for writing tombstones is on the way [1]

2. Looks like current implementation for detecting partition completion on 
concurrent processing using *queued*and *processed* is flawed.
Consider the scenario:
Demander sends initial demand request for single partition.
Supplier replies with 2 total supply messages which are starting to process in 
parallel.
2-d message is last.
2-d message started to process first, increments *queued* to N (number of 
entries in message)
2-d message finished processing, incrementing *processed* to N.
Because this is last message partition will be owned before other messages are 
applied.

[1] https://issues.apache.org/jira/browse/IGNITE-11704


was (Author: ascherbakov):
[~avinogradov]

Overall fix looks good, but I think we could improve it.

1. Looks like it's safe to remove ordering for historical rebalance because 
after IGNITE-10078 rmvQeue for partition is no longer cleared during rebalance 
and removals cannot be lost.
Given what, we could use single thread pool for historical and full rebalance 
and parallelize historical rebalance on supplier side same as full.
This is right thing to do because from user side of view there is no difference 
between full and historical rebalance and they can happen simultaneously.
Note, proper fix for writing tombstones is on the way [1]

2. Looks like current implementation for detecting partition completion on 
concurrent processing using *queued *and *processed *is flawed.
Consider the scenario:
Demander sends initial demand request for single partition.
Supplier replies with 2 total supply messages which are starting to process in 
parallel.
2-d message is last.
2-d message started to process first, increments *queued *to N (number of 
entries in message)
2-d message finished processing, incrementing *processed *to N.
Because this is last message partition will be owned before other messages are 
applied.

[1] https://issues.apache.org/jira/browse/IGNITE-11704

> Rebalancing: IgniteConfiguration.rebalanceThreadPoolSize is wrongly treated
> ---
>
> Key: IGNITE-3195
> URL: https://issues.apache.org/jira/browse/IGNITE-3195
> Project: Ignite
>  Issue Type: Bug
>  Components: cache
>Reporter: Denis Magda
>Assignee: Anton Vinogradov
>Priority: Major
>  Labels: iep-16
> Fix For: 2.8
>
>  Time Spent: 3h 50m
>  Remaining Estimate: 0h
>
> Presently it's considered that the maximum number of threads that has to 
> process all demand and supply messages coming from all the nodes must not be 
> bigger than {{IgniteConfiguration.rebalanceThreadPoolSize}}.
> Current implementation relies on ordered messages functionality creating a 
> number of topics equal to {{IgniteConfiguration.rebalanceThreadPoolSize}}.
> However, the implementation doesn't take into account that ordered messages, 
> that correspond to a particular topic, are processed in parallel for 
> different nodes. Refer to the implementation of 
> {{GridIoManager.processOrderedMessage}} to see that for every topic there 
> will be a unique {{GridCommunicationMessageSet}} for every node.
> Also to prove that this is true you can refer to this execution stack 
> {noformat}
> java.lang.RuntimeException: HAPPENED DEMAND
>   at 
> org.apache.ignite.internal.processors.cache.GridCachePartitionExchangeManager$5.apply(GridCachePartitionExchangeManager.java:378)
>   at 
> org.apache.ignite.internal.processors.cache.GridCachePartitionExchangeManager$5.apply(GridCachePartitionExchangeManager.java:364)
>   at 
> org.apache.ignite.internal.processors.cache.GridCacheIoManager.processMessage(GridCacheIoManager.java:622)
>   at 
> org.apache.ignite.internal.processors.cache.GridCacheIoManager.onMessage0(GridCacheIoManager.java:320)
>   at 
> org.apache.ignite.internal.processors.cache.GridCacheIoManager.access$300(GridCacheIoManager.java:81)
>   at 
> 

[jira] [Commented] (IGNITE-3195) Rebalancing: IgniteConfiguration.rebalanceThreadPoolSize is wrongly treated

2019-08-27 Thread Alexei Scherbakov (Jira)


[ 
https://issues.apache.org/jira/browse/IGNITE-3195?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16916855#comment-16916855
 ] 

Alexei Scherbakov commented on IGNITE-3195:
---

[~avinogradov]

Overall fix looks good, but I think we could improve it.

1. Looks like it's safe to remove ordering for historical rebalance because 
after IGNITE-10078 rmvQeue for partition is no longer cleared during rebalance 
and removals cannot be lost.
Given what, we could use single thread pool for historical and full rebalance 
and parallelize historical rebalance on supplier side same as full.
This is right thing to do because from user side of view there is no difference 
between full and historical rebalance and they can happen simultaneously.
Note, proper fix for writing tombstones is on the way [1]

2. Looks like current implementation for detecting partition completion on 
concurrent processing using *queued *and *processed *is flawed.
Consider the scenario:
Demander sends initial demand request for single partition.
Supplier replies with 2 total supply messages which are starting to process in 
parallel.
2-d message is last.
2-d message started to process first, increments *queued *to N (number of 
entries in message)
2-d message finished processing, incrementing *processed *to N.
Because this is last message partition will be owned before other messages are 
applied.

[1] https://issues.apache.org/jira/browse/IGNITE-11704

> Rebalancing: IgniteConfiguration.rebalanceThreadPoolSize is wrongly treated
> ---
>
> Key: IGNITE-3195
> URL: https://issues.apache.org/jira/browse/IGNITE-3195
> Project: Ignite
>  Issue Type: Bug
>  Components: cache
>Reporter: Denis Magda
>Assignee: Anton Vinogradov
>Priority: Major
>  Labels: iep-16
> Fix For: 2.8
>
>  Time Spent: 3h 50m
>  Remaining Estimate: 0h
>
> Presently it's considered that the maximum number of threads that has to 
> process all demand and supply messages coming from all the nodes must not be 
> bigger than {{IgniteConfiguration.rebalanceThreadPoolSize}}.
> Current implementation relies on ordered messages functionality creating a 
> number of topics equal to {{IgniteConfiguration.rebalanceThreadPoolSize}}.
> However, the implementation doesn't take into account that ordered messages, 
> that correspond to a particular topic, are processed in parallel for 
> different nodes. Refer to the implementation of 
> {{GridIoManager.processOrderedMessage}} to see that for every topic there 
> will be a unique {{GridCommunicationMessageSet}} for every node.
> Also to prove that this is true you can refer to this execution stack 
> {noformat}
> java.lang.RuntimeException: HAPPENED DEMAND
>   at 
> org.apache.ignite.internal.processors.cache.GridCachePartitionExchangeManager$5.apply(GridCachePartitionExchangeManager.java:378)
>   at 
> org.apache.ignite.internal.processors.cache.GridCachePartitionExchangeManager$5.apply(GridCachePartitionExchangeManager.java:364)
>   at 
> org.apache.ignite.internal.processors.cache.GridCacheIoManager.processMessage(GridCacheIoManager.java:622)
>   at 
> org.apache.ignite.internal.processors.cache.GridCacheIoManager.onMessage0(GridCacheIoManager.java:320)
>   at 
> org.apache.ignite.internal.processors.cache.GridCacheIoManager.access$300(GridCacheIoManager.java:81)
>   at 
> org.apache.ignite.internal.processors.cache.GridCacheIoManager$OrderedMessageListener.onMessage(GridCacheIoManager.java:1125)
>   at 
> org.apache.ignite.internal.managers.communication.GridIoManager.invokeListener(GridIoManager.java:1219)
>   at 
> org.apache.ignite.internal.managers.communication.GridIoManager.access$1600(GridIoManager.java:105)
>   at 
> org.apache.ignite.internal.managers.communication.GridIoManager$GridCommunicationMessageSet.unwind(GridIoManager.java:2456)
>   at 
> org.apache.ignite.internal.managers.communication.GridIoManager.unwindMessageSet(GridIoManager.java:1179)
>   at 
> org.apache.ignite.internal.managers.communication.GridIoManager.access$1900(GridIoManager.java:105)
>   at 
> org.apache.ignite.internal.managers.communication.GridIoManager$6.run(GridIoManager.java:1148)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
>   at java.lang.Thread.run(Thread.java:745)
> {noformat}
> All this means that in fact the number of threads that will be busy with 
> replication activity will be equal to 
> {{IgniteConfiguration.rebalanceThreadPoolSize}} x 
> number_of_nodes_participated_in_rebalancing



--
This message was sent by Atlassian Jira
(v8.3.2#803003)


[jira] [Commented] (IGNITE-3195) Rebalancing: IgniteConfiguration.rebalanceThreadPoolSize is wrongly treated

2019-08-27 Thread Alexei Scherbakov (Jira)


[ 
https://issues.apache.org/jira/browse/IGNITE-3195?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16916567#comment-16916567
 ] 

Alexei Scherbakov commented on IGNITE-3195:
---

[~avinogradov]

I'll take a look.

> Rebalancing: IgniteConfiguration.rebalanceThreadPoolSize is wrongly treated
> ---
>
> Key: IGNITE-3195
> URL: https://issues.apache.org/jira/browse/IGNITE-3195
> Project: Ignite
>  Issue Type: Bug
>  Components: cache
>Reporter: Denis Magda
>Assignee: Anton Vinogradov
>Priority: Major
>  Labels: iep-16
> Fix For: 2.8
>
>  Time Spent: 3h 50m
>  Remaining Estimate: 0h
>
> Presently it's considered that the maximum number of threads that has to 
> process all demand and supply messages coming from all the nodes must not be 
> bigger than {{IgniteConfiguration.rebalanceThreadPoolSize}}.
> Current implementation relies on ordered messages functionality creating a 
> number of topics equal to {{IgniteConfiguration.rebalanceThreadPoolSize}}.
> However, the implementation doesn't take into account that ordered messages, 
> that correspond to a particular topic, are processed in parallel for 
> different nodes. Refer to the implementation of 
> {{GridIoManager.processOrderedMessage}} to see that for every topic there 
> will be a unique {{GridCommunicationMessageSet}} for every node.
> Also to prove that this is true you can refer to this execution stack 
> {noformat}
> java.lang.RuntimeException: HAPPENED DEMAND
>   at 
> org.apache.ignite.internal.processors.cache.GridCachePartitionExchangeManager$5.apply(GridCachePartitionExchangeManager.java:378)
>   at 
> org.apache.ignite.internal.processors.cache.GridCachePartitionExchangeManager$5.apply(GridCachePartitionExchangeManager.java:364)
>   at 
> org.apache.ignite.internal.processors.cache.GridCacheIoManager.processMessage(GridCacheIoManager.java:622)
>   at 
> org.apache.ignite.internal.processors.cache.GridCacheIoManager.onMessage0(GridCacheIoManager.java:320)
>   at 
> org.apache.ignite.internal.processors.cache.GridCacheIoManager.access$300(GridCacheIoManager.java:81)
>   at 
> org.apache.ignite.internal.processors.cache.GridCacheIoManager$OrderedMessageListener.onMessage(GridCacheIoManager.java:1125)
>   at 
> org.apache.ignite.internal.managers.communication.GridIoManager.invokeListener(GridIoManager.java:1219)
>   at 
> org.apache.ignite.internal.managers.communication.GridIoManager.access$1600(GridIoManager.java:105)
>   at 
> org.apache.ignite.internal.managers.communication.GridIoManager$GridCommunicationMessageSet.unwind(GridIoManager.java:2456)
>   at 
> org.apache.ignite.internal.managers.communication.GridIoManager.unwindMessageSet(GridIoManager.java:1179)
>   at 
> org.apache.ignite.internal.managers.communication.GridIoManager.access$1900(GridIoManager.java:105)
>   at 
> org.apache.ignite.internal.managers.communication.GridIoManager$6.run(GridIoManager.java:1148)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
>   at java.lang.Thread.run(Thread.java:745)
> {noformat}
> All this means that in fact the number of threads that will be busy with 
> replication activity will be equal to 
> {{IgniteConfiguration.rebalanceThreadPoolSize}} x 
> number_of_nodes_participated_in_rebalancing



--
This message was sent by Atlassian Jira
(v8.3.2#803003)


[jira] [Commented] (IGNITE-12093) Initial rebalance should be performed as efficiently as possible

2019-08-21 Thread Alexei Scherbakov (Jira)


[ 
https://issues.apache.org/jira/browse/IGNITE-12093?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16912283#comment-16912283
 ] 

Alexei Scherbakov commented on IGNITE-12093:


[~avinogradov]

Keep in mind what during initial rebalance demander node as well receives 
updates to moving partitions and enlisted in transactions.
Having all threads performing rebalance may hurt performance of normal 
transactions.

> Initial rebalance should be performed as efficiently as possible
> 
>
> Key: IGNITE-12093
> URL: https://issues.apache.org/jira/browse/IGNITE-12093
> Project: Ignite
>  Issue Type: Task
>Reporter: Anton Vinogradov
>Priority: Major
>  Labels: iep-16
>
> {{rebalanceThreadPoolSize}} setting should be ignored on initial rebalance 
> for demanders.
> Maximum suitable thread pool size should be used during the initial rebalance 
> to perform it asap.



--
This message was sent by Atlassian Jira
(v8.3.2#803003)


[jira] [Updated] (IGNITE-12038) Fix several failing tests after IGNITE-10078

2019-08-02 Thread Alexei Scherbakov (JIRA)


 [ 
https://issues.apache.org/jira/browse/IGNITE-12038?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alexei Scherbakov updated IGNITE-12038:
---
Description: 
 *New stable failure of a flaky test in master 
LocalWalModeChangeDuringRebalancingSelfTest.testWithExchangesMerge 
https://ci.ignite.apache.org/project.html?projectId=IgniteTests24Java8=-6585115376754732686=%3Cdefault%3E=testDetails

 *New stable failure of a flaky test in master 
GridCacheRebalancingWithAsyncClearingMvccTest.testPartitionClearingNotBlockExchange
 
https://ci.ignite.apache.org/project.html?projectId=IgniteTests24Java8=-7007912051428984819=%3Cdefault%3E=testDetails

 *New stable failure of a flaky test in master 
GridCacheRebalancingAsyncSelfTest.testComplexRebalancing 
https://ci.ignite.apache.org/project.html?projectId=IgniteTests24Java8=-8829809273565657995=%3Cdefault%3E=testDetails

  was:
 *New stable failure of a flaky test in master 
LocalWalModeChangeDuringRebalancingSelfTest.testWithExchangesMerge 
https://ci.ignite.apache.org/project.html?projectId=IgniteTests24Java8=-6585115376754732686=%3Cdefault%3E=testDetails
 Changes may lead to failure were done by 
 - alexey.scherbak...@gmail.com 
https://ci.ignite.apache.org/viewModification.html?modId=886764
 - ptupit...@apache.org 
https://ci.ignite.apache.org/viewModification.html?modId=886762

 *New stable failure of a flaky test in master 
GridCacheRebalancingWithAsyncClearingMvccTest.testPartitionClearingNotBlockExchange
 
https://ci.ignite.apache.org/project.html?projectId=IgniteTests24Java8=-7007912051428984819=%3Cdefault%3E=testDetails
 Changes may lead to failure were done by 
 - alexey.scherbak...@gmail.com 
https://ci.ignite.apache.org/viewModification.html?modId=886764
 - ptupit...@apache.org 
https://ci.ignite.apache.org/viewModification.html?modId=886762

 *New stable failure of a flaky test in master 
GridCacheRebalancingAsyncSelfTest.testComplexRebalancing 
https://ci.ignite.apache.org/project.html?projectId=IgniteTests24Java8=-8829809273565657995=%3Cdefault%3E=testDetails


> Fix several failing tests after IGNITE-10078
> 
>
> Key: IGNITE-12038
> URL: https://issues.apache.org/jira/browse/IGNITE-12038
> Project: Ignite
>  Issue Type: Bug
>Reporter: Alexei Scherbakov
>Assignee: Alexei Scherbakov
>Priority: Major
> Fix For: 2.8
>
>
>  *New stable failure of a flaky test in master 
> LocalWalModeChangeDuringRebalancingSelfTest.testWithExchangesMerge 
> https://ci.ignite.apache.org/project.html?projectId=IgniteTests24Java8=-6585115376754732686=%3Cdefault%3E=testDetails
>  *New stable failure of a flaky test in master 
> GridCacheRebalancingWithAsyncClearingMvccTest.testPartitionClearingNotBlockExchange
>  
> https://ci.ignite.apache.org/project.html?projectId=IgniteTests24Java8=-7007912051428984819=%3Cdefault%3E=testDetails
>  *New stable failure of a flaky test in master 
> GridCacheRebalancingAsyncSelfTest.testComplexRebalancing 
> https://ci.ignite.apache.org/project.html?projectId=IgniteTests24Java8=-8829809273565657995=%3Cdefault%3E=testDetails



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Updated] (IGNITE-12038) Fix several failing tests after IGNITE-10078

2019-08-02 Thread Alexei Scherbakov (JIRA)


 [ 
https://issues.apache.org/jira/browse/IGNITE-12038?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alexei Scherbakov updated IGNITE-12038:
---
Description: 
 *New stable failure of a flaky test in master 
LocalWalModeChangeDuringRebalancingSelfTest.testWithExchangesMerge 
https://ci.ignite.apache.org/project.html?projectId=IgniteTests24Java8=-6585115376754732686=%3Cdefault%3E=testDetails
 Changes may lead to failure were done by 
 - alexey.scherbak...@gmail.com 
https://ci.ignite.apache.org/viewModification.html?modId=886764
 - ptupit...@apache.org 
https://ci.ignite.apache.org/viewModification.html?modId=886762

 *New stable failure of a flaky test in master 
GridCacheRebalancingWithAsyncClearingMvccTest.testPartitionClearingNotBlockExchange
 
https://ci.ignite.apache.org/project.html?projectId=IgniteTests24Java8=-7007912051428984819=%3Cdefault%3E=testDetails
 Changes may lead to failure were done by 
 - alexey.scherbak...@gmail.com 
https://ci.ignite.apache.org/viewModification.html?modId=886764
 - ptupit...@apache.org 
https://ci.ignite.apache.org/viewModification.html?modId=886762

 *New stable failure of a flaky test in master 
GridCacheRebalancingAsyncSelfTest.testComplexRebalancing 
https://ci.ignite.apache.org/project.html?projectId=IgniteTests24Java8=-8829809273565657995=%3Cdefault%3E=testDetails

> Fix several failing tests after IGNITE-10078
> 
>
> Key: IGNITE-12038
> URL: https://issues.apache.org/jira/browse/IGNITE-12038
> Project: Ignite
>  Issue Type: Bug
>Reporter: Alexei Scherbakov
>Assignee: Alexei Scherbakov
>Priority: Major
> Fix For: 2.8
>
>
>  *New stable failure of a flaky test in master 
> LocalWalModeChangeDuringRebalancingSelfTest.testWithExchangesMerge 
> https://ci.ignite.apache.org/project.html?projectId=IgniteTests24Java8=-6585115376754732686=%3Cdefault%3E=testDetails
>  Changes may lead to failure were done by 
>- alexey.scherbak...@gmail.com 
> https://ci.ignite.apache.org/viewModification.html?modId=886764
>- ptupit...@apache.org 
> https://ci.ignite.apache.org/viewModification.html?modId=886762
>  *New stable failure of a flaky test in master 
> GridCacheRebalancingWithAsyncClearingMvccTest.testPartitionClearingNotBlockExchange
>  
> https://ci.ignite.apache.org/project.html?projectId=IgniteTests24Java8=-7007912051428984819=%3Cdefault%3E=testDetails
>  Changes may lead to failure were done by 
>- alexey.scherbak...@gmail.com 
> https://ci.ignite.apache.org/viewModification.html?modId=886764
>- ptupit...@apache.org 
> https://ci.ignite.apache.org/viewModification.html?modId=886762
>  *New stable failure of a flaky test in master 
> GridCacheRebalancingAsyncSelfTest.testComplexRebalancing 
> https://ci.ignite.apache.org/project.html?projectId=IgniteTests24Java8=-8829809273565657995=%3Cdefault%3E=testDetails



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Created] (IGNITE-12038) Fix several failing tests after IGNITE-10078

2019-08-02 Thread Alexei Scherbakov (JIRA)
Alexei Scherbakov created IGNITE-12038:
--

 Summary: Fix several failing tests after IGNITE-10078
 Key: IGNITE-12038
 URL: https://issues.apache.org/jira/browse/IGNITE-12038
 Project: Ignite
  Issue Type: Bug
Reporter: Alexei Scherbakov
Assignee: Alexei Scherbakov
 Fix For: 2.8






--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Commented] (IGNITE-11857) Investigate performance drop after IGNITE-10078

2019-08-02 Thread Alexei Scherbakov (JIRA)


[ 
https://issues.apache.org/jira/browse/IGNITE-11857?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16898804#comment-16898804
 ] 

Alexei Scherbakov commented on IGNITE-11857:


[~alex_pl], 

I will take a look in nearest couple of days.

> Investigate performance drop after IGNITE-10078
> ---
>
> Key: IGNITE-11857
> URL: https://issues.apache.org/jira/browse/IGNITE-11857
> Project: Ignite
>  Issue Type: Improvement
>Reporter: Alexei Scherbakov
>Assignee: Aleksey Plekhanov
>Priority: Major
> Attachments: ignite-config.xml, 
> run.properties.tx-optimistic-put-b-backup
>
>  Time Spent: 20m
>  Remaining Estimate: 0h
>
> After IGNITE-10078 yardstick tests show performance drop up to 8% in some 
> scenarios:
> * tx-optim-repRead-put-get
> * tx-optimistic-put
> * tx-putAll
> Partially this is due new update counter implementation, but not only. 
> Investigation is required.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Commented] (IGNITE-11799) Do not always clear partition in MOVING state before exchange

2019-07-18 Thread Alexei Scherbakov (JIRA)


[ 
https://issues.apache.org/jira/browse/IGNITE-11799?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16887952#comment-16887952
 ] 

Alexei Scherbakov commented on IGNITE-11799:


[~Mmuzaf]

This is actual. I still haven't donated several follow up fixes from GridGain 
CE, where comment is removed.
Currently I'm on vacation and expecting to donate in the start of August.

> Do not always clear partition in MOVING state before exchange
> -
>
> Key: IGNITE-11799
> URL: https://issues.apache.org/jira/browse/IGNITE-11799
> Project: Ignite
>  Issue Type: Improvement
>Reporter: Alexei Scherbakov
>Assignee: Alexei Scherbakov
>Priority: Major
>
> After IGNITE-10078 if partition was in moving state before exchange and 
> choosed for full rebalance (for example, this will happen if any minor PME 
> cancels previous rebalance) we always will clear it to avoid desync issues if 
> some removals were not delivered to demander.
> This is not necessary to do if previous rebalance was full.
>  
>  
>  



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Commented] (IGNITE-8873) Optimize cache scans with enabled persistence.

2019-06-25 Thread Alexei Scherbakov (JIRA)


[ 
https://issues.apache.org/jira/browse/IGNITE-8873?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16872682#comment-16872682
 ] 

Alexei Scherbakov commented on IGNITE-8873:
---

[~dmagda]

This method was added to address exactly the case where huge (tens of 
terabytes) cache have to be efficiently fully scanned.
It was already successfully used in production by some Ignite users as far as I 
know.
The main idea behind per partition preloading API same as for other methods 
working with partitions: affinity run/call, scan by partition, SQL query by 
partition(s).
I suggest to keep this method for advanced use cases and add some more "high 
level" APIs like you have proposed.

> Optimize cache scans with enabled persistence.
> --
>
> Key: IGNITE-8873
> URL: https://issues.apache.org/jira/browse/IGNITE-8873
> Project: Ignite
>  Issue Type: Improvement
>Reporter: Alexei Scherbakov
>Assignee: Alexei Scherbakov
>Priority: Major
> Fix For: 2.8
>
>
> Currently cache scans with enabled persistence involve link resolution, which 
> can lead to radom disk access resulting in bad performace on SAS disks.
> One possibility is to preload cache data pages to remove slow random disk 
> access.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (IGNITE-11867) Fix flaky test GridCacheRebalancingWithAsyncClearingTest#testCorrectRebalancingCurrentlyRentingPartitions

2019-06-21 Thread Alexei Scherbakov (JIRA)


[ 
https://issues.apache.org/jira/browse/IGNITE-11867?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16869471#comment-16869471
 ] 

Alexei Scherbakov commented on IGNITE-11867:


[~ivan.glukos]

Ready for review.

> Fix flaky test 
> GridCacheRebalancingWithAsyncClearingTest#testCorrectRebalancingCurrentlyRentingPartitions
> -
>
> Key: IGNITE-11867
> URL: https://issues.apache.org/jira/browse/IGNITE-11867
> Project: Ignite
>  Issue Type: Bug
>Reporter: Alexei Scherbakov
>Assignee: Alexei Scherbakov
>Priority: Major
> Fix For: 2.8
>
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> {noformat}
> java.lang.AssertionError: Value for 4 is null
> at org.junit.Assert.fail(Assert.java:88)
> at org.junit.Assert.assertTrue(Assert.java:41)
> at org.junit.Assert.assertNotNull(Assert.java:621)
> at 
> org.apache.ignite.internal.processors.cache.distributed.rebalancing.GridCacheRebalancingWithAsyncClearingTest.testCorrectRebalancingCurrentlyRentingPartitions(GridCacheRebalancingWithAsyncClearingTest.java:280)
> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
> at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> at 
> org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.java:47)
> at 
> org.junit.internal.runners.model.ReflectiveCallable.run(ReflectiveCallable.java:12)
> at 
> org.junit.runners.model.FrameworkMethod.invokeExplosively(FrameworkMethod.java:44)
> at 
> org.junit.internal.runners.statements.InvokeMethod.evaluate(InvokeMethod.java:17)
> at 
> org.junit.internal.runners.statements.RunBefores.evaluate(RunBefores.java:26)
> at org.junit.internal.runners.statements.RunAfters.evaluate(RunAfters.java:27)
> at 
> org.apache.ignite.testframework.junits.GridAbstractTest$6.run(GridAbstractTest.java:2148)
> at java.lang.Thread.run(Thread.java:748){noformat}
> EDIT: The issue is related to recently contributed IGNITE-10078. In specific 
> scenario due to race partition clearing could be started while partition is 
> passing through ongoing rebalancing started on previous topology version.
> I fixed it by preventing partition clearing on newer topology version. In 
> such case if rebalance will be finished and partition will go in OWNING state 
> further clearing is not needed any more, otherwise partition should be 
> scheduled for clearing again.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (IGNITE-11939) IgnitePdsTxHistoricalRebalancingTest.testTopologyChangesWithConstantLoad test failure

2019-06-21 Thread Alexei Scherbakov (JIRA)
Alexei Scherbakov created IGNITE-11939:
--

 Summary:  
IgnitePdsTxHistoricalRebalancingTest.testTopologyChangesWithConstantLoad test 
failure
 Key: IGNITE-11939
 URL: https://issues.apache.org/jira/browse/IGNITE-11939
 Project: Ignite
  Issue Type: Bug
Reporter: Alexei Scherbakov


Caused by exception on releasing reserved segments:
{noformat}
[12:51:23]W: [org.apache.ignite:ignite-indexing] [2019-06-21 
12:51:23,967][ERROR][exchange-worker-#33825%persistence.IgnitePdsTxHistoricalRebalancingTest1%][GridDhtPartitionsExchangeFuture]
 Failed to reinitialize local partitions (rebalancing will be stopped)
: GridDhtPartitionExchangeId [topVer=AffinityTopologyVersion [topVer=7, 
minorTopVer=1], discoEvt=DiscoveryCustomEvent 
[customMsg=CacheAffinityChangeMessage 
[id=08de0ff7b61-276ac575-e4dc-4525-b24b-d0a5d1d7633d, 
topVer=AffinityTopologyVersion [topVer=7, minorTopVer=0], exc
hId=null, partsMsg=null, exchangeNeeded=true], 
affTopVer=AffinityTopologyVersion [topVer=7, minorTopVer=1], 
super=DiscoveryEvent [evtNode=TcpDiscoveryNode 
[id=97e46568-6aa0-4a4b-864c-f05415c0, 
consistentId=persistence.IgnitePdsTxHistoricalRebalancingTest0, addrs=Arra
yList [127.0.0.1], sockAddrs=HashSet [/127.0.0.1:47500], discPort=47500, 
order=1, intOrder=1, lastExchangeTime=1561110643882, loc=false, 
ver=2.8.0#20190621-sha1:, isClient=false], topVer=7, nodeId8=0ff3354e, 
msg=null, type=DISCOVERY_CUSTOM_EVT, tstamp=15611106839
58]], nodeId=97e46568, evt=DISCOVERY_CUSTOM_EVT]
[12:51:23]W: [org.apache.ignite:ignite-indexing] 
java.lang.AssertionError: cur=null, absIdx=0
[12:51:23]W: [org.apache.ignite:ignite-indexing]at 
org.apache.ignite.internal.processors.cache.persistence.wal.aware.SegmentReservationStorage.release(SegmentReservationStorage.java:55)
[12:51:23]W: [org.apache.ignite:ignite-indexing]at 
org.apache.ignite.internal.processors.cache.persistence.wal.aware.SegmentAware.release(SegmentAware.java:207)
[12:51:23]W: [org.apache.ignite:ignite-indexing]at 
org.apache.ignite.internal.processors.cache.persistence.wal.FileWriteAheadLogManager.release(FileWriteAheadLogManager.java:983)
[12:51:23]W: [org.apache.ignite:ignite-indexing]at 
org.apache.ignite.internal.processors.cache.persistence.GridCacheDatabaseSharedManager.releaseHistoryForPreloading(GridCacheDatabaseSharedManager.java:1844)
[12:51:23]W: [org.apache.ignite:ignite-indexing]at 
org.apache.ignite.internal.processors.cache.distributed.dht.preloader.GridDhtPartitionsExchangeFuture.distributedExchange(GridDhtPartitionsExchangeFuture.java:1431)
[12:51:23]W: [org.apache.ignite:ignite-indexing]at 
org.apache.ignite.internal.processors.cache.distributed.dht.preloader.GridDhtPartitionsExchangeFuture.init(GridDhtPartitionsExchangeFuture.java:862)
[12:51:23]W: [org.apache.ignite:ignite-indexing]at 
org.apache.ignite.internal.processors.cache.GridCachePartitionExchangeManager$ExchangeWorker.body0(GridCachePartitionExchangeManager.java:3079)
[12:51:23]W: [org.apache.ignite:ignite-indexing]at 
org.apache.ignite.internal.processors.cache.GridCachePartitionExchangeManager$ExchangeWorker.body(GridCachePartitionExchangeManager.java:2928)
[12:51:23]W: [org.apache.ignite:ignite-indexing]at 
org.apache.ignite.internal.util.worker.GridWorker.run(GridWorker.java:120)
[12:51:23]W: [org.apache.ignite:ignite-indexing]at 
java.lang.Thread.run(Thread.java:748)
[12:51:23]W: [org.apache.ignite:ignite-indexing] [12:51:23] (err) 
Failed to notify listener: 
o.a.i.i.processors.timeout.GridTimeoutProcessor$2...@79ba1907java.lang.AssertionError:
 cur=null, absIdx=0
[12:51:23]W: [org.apache.ignite:ignite-indexing]at 
org.apache.ignite.internal.processors.cache.persistence.wal.aware.SegmentReservationStorage.release(SegmentReservationStorage.java:55)
[12:51:23]W: [org.apache.ignite:ignite-indexing]at 
org.apache.ignite.internal.processors.cache.persistence.wal.aware.SegmentAware.release(SegmentAware.java:207)
[12:51:23]W: [org.apache.ignite:ignite-indexing]at 
org.apache.ignite.internal.processors.cache.persistence.wal.FileWriteAheadLogManager.release(FileWriteAheadLogManager.java:983)
[12:51:23]W: [org.apache.ignite:ignite-indexing]at 
org.apache.ignite.internal.processors.cache.persistence.GridCacheDatabaseSharedManager.releaseHistoryForPreloading(GridCacheDatabaseSharedManager.java:1844)
[12:51:23]W: [org.apache.ignite:ignite-indexing]at 
org.apache.ignite.internal.processors.cache.distributed.dht.preloader.GridDhtPartitionsExchangeFuture.distributedExchange(GridDhtPartitionsExchangeFuture.java:1431)
[12:51:23]W: [org.apache.ignite:ignite-indexing]at 

[jira] [Created] (IGNITE-11937) Fix MVCC PDS flaky suites timeout

2019-06-20 Thread Alexei Scherbakov (JIRA)
Alexei Scherbakov created IGNITE-11937:
--

 Summary: Fix MVCC PDS flaky suites timeout
 Key: IGNITE-11937
 URL: https://issues.apache.org/jira/browse/IGNITE-11937
 Project: Ignite
  Issue Type: Bug
Reporter: Alexei Scherbakov


Currently we have non-zero failure rate for some MVCC PDS suites in master.

Seems this is due to failure [1] in testRebalancingDuringLoad* tests group, 
which leads to dumping WAL and lock states at the time proportional to current 
WAL length increasing test duration for random time depending on WAL length.

Worse thing the test remains green despite throwing a critical exception.

[1]  Stacktrace
{noformat}
[2019-06-19 
15:56:53,386][ERROR][sys-stripe-6-#134%persistence.IgnitePdsContinuousRestartTestWithSharedGroupAndIndexes3%][IgniteTestResources]
 Critical system error detected. Will be handled accordingly to configured 
handler [hnd=NoOpFailureHandler [super=AbstractFailure
Handler [ignoredFailureTypes=UnmodifiableSet [SYSTEM_WORKER_BLOCKED, 
SYSTEM_CRITICAL_OPERATION_TIMEOUT]]], failureCtx=FailureContext 
[type=CRITICAL_ERROR, err=class 
o.a.i.i.processors.cache.persistence.tree.CorruptedTreeException: B+Tree is 
corrupted [pages(groupId, page
Id)=[IgniteBiTuple [val1=81227264, val2=844420635164676]], msg=Runtime failure 
on search row: TxKey [major=1560948946388, minor=17286
class 
org.apache.ignite.internal.processors.cache.persistence.tree.CorruptedTreeException:
 B+Tree is corrupted [pages(groupId, pageId)=[IgniteBiTuple [val1=81227264, 
val2=844420635164676]], msg=Runtime failure on search row: TxKey 
[major=1560948946388, minor=17286]]
at 
org.apache.ignite.internal.processors.cache.persistence.tree.BPlusTree.corruptedTreeException(BPlusTree.java:5909)
at 
org.apache.ignite.internal.processors.cache.persistence.tree.BPlusTree.invoke(BPlusTree.java:1859)
at 
org.apache.ignite.internal.processors.cache.mvcc.txlog.TxLog.put(TxLog.java:293)
at 
org.apache.ignite.internal.processors.cache.mvcc.MvccProcessorImpl.updateState(MvccProcessorImpl.java:699)
at 
org.apache.ignite.internal.processors.cache.transactions.IgniteTxManager.setMvccState(IgniteTxManager.java:2570)
at 
org.apache.ignite.internal.processors.cache.transactions.IgniteTxAdapter.state(IgniteTxAdapter.java:1228)
at 
org.apache.ignite.internal.processors.cache.transactions.IgniteTxAdapter.state(IgniteTxAdapter.java:1070)
at 
org.apache.ignite.internal.processors.cache.distributed.GridDistributedTxRemoteAdapter.prepareRemoteTx(GridDistributedTxRemoteAdapter.java:421)
at 
org.apache.ignite.internal.processors.cache.transactions.IgniteTxHandler.startRemoteTx(IgniteTxHandler.java:1837)
at 
org.apache.ignite.internal.processors.cache.transactions.IgniteTxHandler.processDhtTxPrepareRequest(IgniteTxHandler.java:1198)
at 
org.apache.ignite.internal.processors.cache.transactions.IgniteTxHandler.access$400(IgniteTxHandler.java:118)
at 
org.apache.ignite.internal.processors.cache.transactions.IgniteTxHandler$5.apply(IgniteTxHandler.java:224)
at 
org.apache.ignite.internal.processors.cache.transactions.IgniteTxHandler$5.apply(IgniteTxHandler.java:222)
at 
org.apache.ignite.internal.processors.cache.GridCacheIoManager.processMessage(GridCacheIoManager.java:1141)
at 
org.apache.ignite.internal.processors.cache.GridCacheIoManager.onMessage0(GridCacheIoManager.java:591)
at 
org.apache.ignite.internal.processors.cache.GridCacheIoManager.handleMessage(GridCacheIoManager.java:392)
at 
org.apache.ignite.internal.processors.cache.GridCacheIoManager.handleMessage(GridCacheIoManager.java:318)
at 
org.apache.ignite.internal.processors.cache.GridCacheIoManager.access$100(GridCacheIoManager.java:109)
at 
org.apache.ignite.internal.processors.cache.GridCacheIoManager$1.onMessage(GridCacheIoManager.java:308)
at 
org.apache.ignite.internal.managers.communication.GridIoManager.invokeListener(GridIoManager.java:1558)
at 
org.apache.ignite.internal.managers.communication.GridIoManager.processRegularMessage0(GridIoManager.java:1186)
at 
org.apache.ignite.internal.managers.communication.GridIoManager.access$4200(GridIoManager.java:125)
at 
org.apache.ignite.internal.managers.communication.GridIoManager$8.run(GridIoManager.java:1083)
at 
org.apache.ignite.internal.util.StripedExecutor$Stripe.body(StripedExecutor.java:559)
at 
org.apache.ignite.internal.util.worker.GridWorker.run(GridWorker.java:119)
at java.lang.Thread.run(Thread.java:748)
Caused by: java.lang.IllegalStateException: Unexpected new transaction state. 
[currState=2, newState=1, cntr=17286]
at 
org.apache.ignite.internal.processors.cache.mvcc.txlog.TxLog$TxLogUpdateClosure.invalid(TxLog.java:629)
at 

[jira] [Comment Edited] (IGNITE-11799) Do not always clear partition in MOVING state before exchange

2019-06-20 Thread Alexei Scherbakov (JIRA)


[ 
https://issues.apache.org/jira/browse/IGNITE-11799?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16868401#comment-16868401
 ] 

Alexei Scherbakov edited comment on IGNITE-11799 at 6/20/19 9:46 AM:
-

Actually clearing required in all cases if new rebalance is FULL.
Consider the scenario:

2 nodes, N1 - supplier, N2 - demander.
N1 has keys k1,k2,k3
N2 joins and start any type of rebalancing
N1 removes k1
N2 dies after receiving k1,k2 in supply but before receiving removal of k1 as 
update to MOVING partition.
N2 joins, starts full rebalance and loads k2,k3
N2 will contain keys 1,2,3 while N1 will contain keys 1,2 causing partition 
desync.




was (Author: ascherbakov):
Actually clearing required in all cases if new rebalance is FULL.
Consider the scenario:

2 nodes, N1 - supplier, N2 - demander.
N1 has keys k1,k2,k3
N2 joins and start any type of rebalancing
N1 removes k1
N2 dies after receiving k1,k2 in supply but before receiving removal of k1
N2 joins, starts full rebalance and loads k2,k3
N2 will contain keys 1,2,3 while N1 will contain keys 1,2 causing partition 
desync.



> Do not always clear partition in MOVING state before exchange
> -
>
> Key: IGNITE-11799
> URL: https://issues.apache.org/jira/browse/IGNITE-11799
> Project: Ignite
>  Issue Type: Improvement
>Reporter: Alexei Scherbakov
>Assignee: Alexei Scherbakov
>Priority: Major
>
> After IGNITE-10078 if partition was in moving state before exchange and 
> choosed for full rebalance (for example, this will happen if any minor PME 
> cancels previous rebalance) we always will clear it to avoid desync issues if 
> some removals were not delivered to demander.
> This is not necessary to do if previous rebalance was full.
>  
>  
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Comment Edited] (IGNITE-11799) Do not always clear partition in MOVING state before exchange

2019-06-20 Thread Alexei Scherbakov (JIRA)


[ 
https://issues.apache.org/jira/browse/IGNITE-11799?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16868401#comment-16868401
 ] 

Alexei Scherbakov edited comment on IGNITE-11799 at 6/20/19 9:46 AM:
-

Actually clearing required in all cases if new rebalance is FULL.
Consider the scenario:

2 nodes, N1 - supplier, N2 - demander.
N1 has keys k1,k2,k3
N2 joins and start any type of rebalancing
N1 removes k1
N2 dies after receiving k1,k2 in supply but before receiving removal of k1 as 
update to MOVING partition.
N2 joins, starts full rebalance and loads k2,k3
N2 will now contain keys 1,2,3 while N1 will contain keys 1,2 causing partition 
desync.




was (Author: ascherbakov):
Actually clearing required in all cases if new rebalance is FULL.
Consider the scenario:

2 nodes, N1 - supplier, N2 - demander.
N1 has keys k1,k2,k3
N2 joins and start any type of rebalancing
N1 removes k1
N2 dies after receiving k1,k2 in supply but before receiving removal of k1 as 
update to MOVING partition.
N2 joins, starts full rebalance and loads k2,k3
N2 will contain keys 1,2,3 while N1 will contain keys 1,2 causing partition 
desync.



> Do not always clear partition in MOVING state before exchange
> -
>
> Key: IGNITE-11799
> URL: https://issues.apache.org/jira/browse/IGNITE-11799
> Project: Ignite
>  Issue Type: Improvement
>Reporter: Alexei Scherbakov
>Assignee: Alexei Scherbakov
>Priority: Major
>
> After IGNITE-10078 if partition was in moving state before exchange and 
> choosed for full rebalance (for example, this will happen if any minor PME 
> cancels previous rebalance) we always will clear it to avoid desync issues if 
> some removals were not delivered to demander.
> This is not necessary to do if previous rebalance was full.
>  
>  
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Comment Edited] (IGNITE-11799) Do not always clear partition in MOVING state before exchange

2019-06-20 Thread Alexei Scherbakov (JIRA)


[ 
https://issues.apache.org/jira/browse/IGNITE-11799?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16868401#comment-16868401
 ] 

Alexei Scherbakov edited comment on IGNITE-11799 at 6/20/19 9:45 AM:
-

Actually clearing required in all cases if new rebalance is FULL.
Consider the scenario:

2 nodes, N1 - supplier, N2 - demander.
N1 has keys k1,k2,k3
N2 joins and start any type of rebalancing
N1 removes k1
N2 dies after receiving k1,k2 in supply but before receiving removal of k1
N2 joins, starts full rebalance and loads k2,k3
N2 will contain keys 1,2,3 while N1 will contain keys 1,2 causing partition 
desync.




was (Author: ascherbakov):
Actually clearing required in all cases if new rebalance is FULL.
Consider the scenario:

2 nodes, N1 - supplier, N2 - demander.
N1 has keys k1,k2,k3
N2 joins and start any type of rebalancing
N1 removes k1
N2 dies after receiving k1,k2 but before receiving removal of k1
N2 joins, starts full rebalance and loads k2,k3
N2 will contain keys 1,2,3 while N1 will contain keys 1,2 causing partition 
desync.



> Do not always clear partition in MOVING state before exchange
> -
>
> Key: IGNITE-11799
> URL: https://issues.apache.org/jira/browse/IGNITE-11799
> Project: Ignite
>  Issue Type: Improvement
>Reporter: Alexei Scherbakov
>Assignee: Alexei Scherbakov
>Priority: Major
>
> After IGNITE-10078 if partition was in moving state before exchange and 
> choosed for full rebalance (for example, this will happen if any minor PME 
> cancels previous rebalance) we always will clear it to avoid desync issues if 
> some removals were not delivered to demander.
> This is not necessary to do if previous rebalance was full.
>  
>  
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Resolved] (IGNITE-11799) Do not always clear partition in MOVING state before exchange

2019-06-20 Thread Alexei Scherbakov (JIRA)


 [ 
https://issues.apache.org/jira/browse/IGNITE-11799?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alexei Scherbakov resolved IGNITE-11799.

Resolution: Won't Fix

Actually clearing required in all cases if new rebalance is FULL.
Consider the scenario:

2 nodes, N1 - supplier, N2 - demander.
N1 has keys k1,k2,k3
N2 joins and start any type of rebalancing
N1 removes k1
N2 dies after receiving k1,k2 but before receiving removal of k1
N2 joins, starts full rebalance and loads k2,k3
N2 will contain keys 1,2,3 while N1 will contain keys 1,2 causing partition 
desync.



> Do not always clear partition in MOVING state before exchange
> -
>
> Key: IGNITE-11799
> URL: https://issues.apache.org/jira/browse/IGNITE-11799
> Project: Ignite
>  Issue Type: Improvement
>Reporter: Alexei Scherbakov
>Assignee: Alexei Scherbakov
>Priority: Major
>
> After IGNITE-10078 if partition was in moving state before exchange and 
> choosed for full rebalance (for example, this will happen if any minor PME 
> cancels previous rebalance) we always will clear it to avoid desync issues if 
> some removals were not delivered to demander.
> This is not necessary to do if previous rebalance was full.
>  
>  
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Assigned] (IGNITE-11799) Do not always clear partition in MOVING state before exchange

2019-06-20 Thread Alexei Scherbakov (JIRA)


 [ 
https://issues.apache.org/jira/browse/IGNITE-11799?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alexei Scherbakov reassigned IGNITE-11799:
--

Assignee: Alexei Scherbakov

> Do not always clear partition in MOVING state before exchange
> -
>
> Key: IGNITE-11799
> URL: https://issues.apache.org/jira/browse/IGNITE-11799
> Project: Ignite
>  Issue Type: Improvement
>Reporter: Alexei Scherbakov
>Assignee: Alexei Scherbakov
>Priority: Major
>
> After IGNITE-10078 if partition was in moving state before exchange and 
> choosed for full rebalance (for example, this will happen if any minor PME 
> cancels previous rebalance) we always will clear it to avoid desync issues if 
> some removals were not delivered to demander.
> This is not necessary to do if previous rebalance was full.
>  
>  
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (IGNITE-11848) [IEP-35] Monitoring Phase 1

2019-06-12 Thread Alexei Scherbakov (JIRA)


[ 
https://issues.apache.org/jira/browse/IGNITE-11848?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16862120#comment-16862120
 ] 

Alexei Scherbakov commented on IGNITE-11848:


[~NIzhikov] I have left some (mostly minor) comments under PR, please address 
them.

In general I'm OK with the changes.

> [IEP-35] Monitoring Phase 1
> --
>
> Key: IGNITE-11848
> URL: https://issues.apache.org/jira/browse/IGNITE-11848
> Project: Ignite
>  Issue Type: Task
>Affects Versions: 2.7
>Reporter: Nikolay Izhikov
>Assignee: Nikolay Izhikov
>Priority: Major
>  Labels: IEP-35
> Fix For: 2.8
>
>  Time Spent: 4h 20m
>  Remaining Estimate: 0h
>
> Umbrella ticket for the IEP-35. Monitoring and profiling.
> Phase 1 should include:
>  * NextGen monitoring subsystem implementation to manage
>  ** metrics
>  ** -lists- (will be implemented in the following tickets)
>  ** exporters
>  * JMX, SQLView, Log exporters
>  * Migration of existing metrics to new manager
>  * -Lists for all Ignite user API-



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (IGNITE-11867) Fix flaky test GridCacheRebalancingWithAsyncClearingTest#testCorrectRebalancingCurrentlyRentingPartitions

2019-06-06 Thread Alexei Scherbakov (JIRA)


[ 
https://issues.apache.org/jira/browse/IGNITE-11867?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16857479#comment-16857479
 ] 

Alexei Scherbakov commented on IGNITE-11867:


[~ivan.glukos] [~Jokser] 

Please review.

The main idea for the fix is to enforce a relation: current rebalance happens 
before next partition cleaning preventing race between rebalancing and clearing.

I checked timed out runs and do not see any obvious relation to the patch.
MVCC PDS3 and PDS4 also time out in base (master) branch.




> Fix flaky test 
> GridCacheRebalancingWithAsyncClearingTest#testCorrectRebalancingCurrentlyRentingPartitions
> -
>
> Key: IGNITE-11867
> URL: https://issues.apache.org/jira/browse/IGNITE-11867
> Project: Ignite
>  Issue Type: Bug
>Reporter: Alexei Scherbakov
>Assignee: Alexei Scherbakov
>Priority: Major
> Fix For: 2.8
>
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> {noformat}
> java.lang.AssertionError: Value for 4 is null
> at org.junit.Assert.fail(Assert.java:88)
> at org.junit.Assert.assertTrue(Assert.java:41)
> at org.junit.Assert.assertNotNull(Assert.java:621)
> at 
> org.apache.ignite.internal.processors.cache.distributed.rebalancing.GridCacheRebalancingWithAsyncClearingTest.testCorrectRebalancingCurrentlyRentingPartitions(GridCacheRebalancingWithAsyncClearingTest.java:280)
> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
> at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> at 
> org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.java:47)
> at 
> org.junit.internal.runners.model.ReflectiveCallable.run(ReflectiveCallable.java:12)
> at 
> org.junit.runners.model.FrameworkMethod.invokeExplosively(FrameworkMethod.java:44)
> at 
> org.junit.internal.runners.statements.InvokeMethod.evaluate(InvokeMethod.java:17)
> at 
> org.junit.internal.runners.statements.RunBefores.evaluate(RunBefores.java:26)
> at org.junit.internal.runners.statements.RunAfters.evaluate(RunAfters.java:27)
> at 
> org.apache.ignite.testframework.junits.GridAbstractTest$6.run(GridAbstractTest.java:2148)
> at java.lang.Thread.run(Thread.java:748){noformat}
> EDIT: The issue is related to recently contributed IGNITE-10078. In specific 
> scenario due to race partition clearing could be started while partition is 
> passing through ongoing rebalancing started on previous topology version.
> I fixed it by preventing partition clearing on newer topology version. In 
> such case if rebalance will be finished and partition will go in OWNING state 
> further clearing is not needed any more, otherwise partition should be 
> scheduled for clearing again.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (IGNITE-11887) Add more test scenarious for OWNING -> RENTING -> MOVING scenario

2019-05-31 Thread Alexei Scherbakov (JIRA)


 [ 
https://issues.apache.org/jira/browse/IGNITE-11887?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alexei Scherbakov updated IGNITE-11887:
---
Description: 
Relevant test 
GridCacheRebalancingWithAsyncClearingTest#testCorrectRebalancingCurrentlyRentingPartitions.
Need to extend with 
1. in-memory
2. under load

  was:Relevant test 
GridCacheRebalancingWithAsyncClearingTest#testCorrectRebalancingCurrentlyRentingPartitions


> Add more test scenarious for OWNING -> RENTING -> MOVING scenario
> -
>
> Key: IGNITE-11887
> URL: https://issues.apache.org/jira/browse/IGNITE-11887
> Project: Ignite
>  Issue Type: Test
>Reporter: Alexei Scherbakov
>Assignee: Alexei Scherbakov
>Priority: Major
>
> Relevant test 
> GridCacheRebalancingWithAsyncClearingTest#testCorrectRebalancingCurrentlyRentingPartitions.
> Need to extend with 
> 1. in-memory
> 2. under load



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (IGNITE-11887) Add more test scenarious for OWNING -> RENTING -> MOVING scenario

2019-05-31 Thread Alexei Scherbakov (JIRA)
Alexei Scherbakov created IGNITE-11887:
--

 Summary: Add more test scenarious for OWNING -> RENTING -> MOVING 
scenario
 Key: IGNITE-11887
 URL: https://issues.apache.org/jira/browse/IGNITE-11887
 Project: Ignite
  Issue Type: Test
Reporter: Alexei Scherbakov
Assignee: Alexei Scherbakov


Relevant test 
GridCacheRebalancingWithAsyncClearingTest#testCorrectRebalancingCurrentlyRentingPartitions



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (IGNITE-11867) Fix flaky test GridCacheRebalancingWithAsyncClearingTest#testCorrectRebalancingCurrentlyRentingPartitions

2019-05-30 Thread Alexei Scherbakov (JIRA)


[ 
https://issues.apache.org/jira/browse/IGNITE-11867?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16851750#comment-16851750
 ] 

Alexei Scherbakov commented on IGNITE-11867:


[~ivan.glukos] 

Failing suite is not related to changes.
Please review. 

> Fix flaky test 
> GridCacheRebalancingWithAsyncClearingTest#testCorrectRebalancingCurrentlyRentingPartitions
> -
>
> Key: IGNITE-11867
> URL: https://issues.apache.org/jira/browse/IGNITE-11867
> Project: Ignite
>  Issue Type: Bug
>Reporter: Alexei Scherbakov
>Assignee: Alexei Scherbakov
>Priority: Major
> Fix For: 2.8
>
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> {noformat}
> java.lang.AssertionError: Value for 4 is null
> at org.junit.Assert.fail(Assert.java:88)
> at org.junit.Assert.assertTrue(Assert.java:41)
> at org.junit.Assert.assertNotNull(Assert.java:621)
> at 
> org.apache.ignite.internal.processors.cache.distributed.rebalancing.GridCacheRebalancingWithAsyncClearingTest.testCorrectRebalancingCurrentlyRentingPartitions(GridCacheRebalancingWithAsyncClearingTest.java:280)
> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
> at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> at 
> org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.java:47)
> at 
> org.junit.internal.runners.model.ReflectiveCallable.run(ReflectiveCallable.java:12)
> at 
> org.junit.runners.model.FrameworkMethod.invokeExplosively(FrameworkMethod.java:44)
> at 
> org.junit.internal.runners.statements.InvokeMethod.evaluate(InvokeMethod.java:17)
> at 
> org.junit.internal.runners.statements.RunBefores.evaluate(RunBefores.java:26)
> at org.junit.internal.runners.statements.RunAfters.evaluate(RunAfters.java:27)
> at 
> org.apache.ignite.testframework.junits.GridAbstractTest$6.run(GridAbstractTest.java:2148)
> at java.lang.Thread.run(Thread.java:748){noformat}
> EDIT: The issue is related to recently contributed IGNITE-10078. In specific 
> scenario due to race partition clearing could be started while partition is 
> passing through ongoing rebalancing started on previous topology version.
> I fixed it by preventing partition clearing on newer topology version. In 
> such case if rebalance will be finished and partition will go in OWNING state 
> further clearing is not needed any more, otherwise partition should be 
> scheduled for clearing again.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (IGNITE-11867) Fix flaky test GridCacheRebalancingWithAsyncClearingTest#testCorrectRebalancingCurrentlyRentingPartitions

2019-05-27 Thread Alexei Scherbakov (JIRA)


 [ 
https://issues.apache.org/jira/browse/IGNITE-11867?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alexei Scherbakov updated IGNITE-11867:
---
Description: 
{noformat}
java.lang.AssertionError: Value for 4 is null
at org.junit.Assert.fail(Assert.java:88)
at org.junit.Assert.assertTrue(Assert.java:41)
at org.junit.Assert.assertNotNull(Assert.java:621)
at 
org.apache.ignite.internal.processors.cache.distributed.rebalancing.GridCacheRebalancingWithAsyncClearingTest.testCorrectRebalancingCurrentlyRentingPartitions(GridCacheRebalancingWithAsyncClearingTest.java:280)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at 
org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.java:47)
at 
org.junit.internal.runners.model.ReflectiveCallable.run(ReflectiveCallable.java:12)
at 
org.junit.runners.model.FrameworkMethod.invokeExplosively(FrameworkMethod.java:44)
at 
org.junit.internal.runners.statements.InvokeMethod.evaluate(InvokeMethod.java:17)
at org.junit.internal.runners.statements.RunBefores.evaluate(RunBefores.java:26)
at org.junit.internal.runners.statements.RunAfters.evaluate(RunAfters.java:27)
at 
org.apache.ignite.testframework.junits.GridAbstractTest$6.run(GridAbstractTest.java:2148)
at java.lang.Thread.run(Thread.java:748){noformat}

EDIT: The issue is related to recently contributed IGNITE-10078. In specific 
scenario due to race partition clearing could be started while partition is 
passing through ongoing rebalancing started on previous topology version.

I fixed it by preventing partition clearing on newer topology version. In such 
case if rebalance will be finished and partition will go in OWNING state 
further clearing is not needed any more, otherwise partition should be 
scheduled for clearing again.


  was:
{noformat}
java.lang.AssertionError: Value for 4 is null
at org.junit.Assert.fail(Assert.java:88)
at org.junit.Assert.assertTrue(Assert.java:41)
at org.junit.Assert.assertNotNull(Assert.java:621)
at 
org.apache.ignite.internal.processors.cache.distributed.rebalancing.GridCacheRebalancingWithAsyncClearingTest.testCorrectRebalancingCurrentlyRentingPartitions(GridCacheRebalancingWithAsyncClearingTest.java:280)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at 
org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.java:47)
at 
org.junit.internal.runners.model.ReflectiveCallable.run(ReflectiveCallable.java:12)
at 
org.junit.runners.model.FrameworkMethod.invokeExplosively(FrameworkMethod.java:44)
at 
org.junit.internal.runners.statements.InvokeMethod.evaluate(InvokeMethod.java:17)
at org.junit.internal.runners.statements.RunBefores.evaluate(RunBefores.java:26)
at org.junit.internal.runners.statements.RunAfters.evaluate(RunAfters.java:27)
at 
org.apache.ignite.testframework.junits.GridAbstractTest$6.run(GridAbstractTest.java:2148)
at java.lang.Thread.run(Thread.java:748){noformat}

EDIT: The issue is related to recently contributed IGNITE-10078. In specific 
scenario due to race partition clearing could be started while partition is 
passing through ongoing rebalancing started on previous topology version.

I fixed it by preventing partition clearing on newer topology versions. In such 
case if rebalance will be finished and partition will go in OWNING state 
further clearing is not needed any more, otherwise partition will be scheduled 
for clearing again.



> Fix flaky test 
> GridCacheRebalancingWithAsyncClearingTest#testCorrectRebalancingCurrentlyRentingPartitions
> -
>
> Key: IGNITE-11867
> URL: https://issues.apache.org/jira/browse/IGNITE-11867
> Project: Ignite
>  Issue Type: Bug
>Reporter: Alexei Scherbakov
>Assignee: Alexei Scherbakov
>Priority: Major
> Fix For: 2.8
>
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> {noformat}
> java.lang.AssertionError: Value for 4 is null
> at org.junit.Assert.fail(Assert.java:88)
> at org.junit.Assert.assertTrue(Assert.java:41)
> at org.junit.Assert.assertNotNull(Assert.java:621)
> at 
> org.apache.ignite.internal.processors.cache.distributed.rebalancing.GridCacheRebalancingWithAsyncClearingTest.testCorrectRebalancingCurrentlyRentingPartitions(GridCacheRebalancingWithAsyncClearingTest.java:280)
> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
> at 
> 

[jira] [Updated] (IGNITE-11867) Fix flaky test GridCacheRebalancingWithAsyncClearingTest#testCorrectRebalancingCurrentlyRentingPartitions

2019-05-27 Thread Alexei Scherbakov (JIRA)


 [ 
https://issues.apache.org/jira/browse/IGNITE-11867?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alexei Scherbakov updated IGNITE-11867:
---
Description: 
{noformat}
java.lang.AssertionError: Value for 4 is null
at org.junit.Assert.fail(Assert.java:88)
at org.junit.Assert.assertTrue(Assert.java:41)
at org.junit.Assert.assertNotNull(Assert.java:621)
at 
org.apache.ignite.internal.processors.cache.distributed.rebalancing.GridCacheRebalancingWithAsyncClearingTest.testCorrectRebalancingCurrentlyRentingPartitions(GridCacheRebalancingWithAsyncClearingTest.java:280)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at 
org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.java:47)
at 
org.junit.internal.runners.model.ReflectiveCallable.run(ReflectiveCallable.java:12)
at 
org.junit.runners.model.FrameworkMethod.invokeExplosively(FrameworkMethod.java:44)
at 
org.junit.internal.runners.statements.InvokeMethod.evaluate(InvokeMethod.java:17)
at org.junit.internal.runners.statements.RunBefores.evaluate(RunBefores.java:26)
at org.junit.internal.runners.statements.RunAfters.evaluate(RunAfters.java:27)
at 
org.apache.ignite.testframework.junits.GridAbstractTest$6.run(GridAbstractTest.java:2148)
at java.lang.Thread.run(Thread.java:748){noformat}

EDIT: The issue is related to recently contributed IGNITE-10078. In specific 
scenario due to race partition clearing could be started while partition is 
passing through ongoing rebalancing started on previous topology version.

I fixed it by preventing partition clearing on newer topology versions. In such 
case if rebalance will be finished and partition will go in OWNING state 
further clearing is not needed any more, otherwise partition will be scheduled 
for clearing again.


  was:
{noformat}
java.lang.AssertionError: Value for 4 is null
at org.junit.Assert.fail(Assert.java:88)
at org.junit.Assert.assertTrue(Assert.java:41)
at org.junit.Assert.assertNotNull(Assert.java:621)
at 
org.apache.ignite.internal.processors.cache.distributed.rebalancing.GridCacheRebalancingWithAsyncClearingTest.testCorrectRebalancingCurrentlyRentingPartitions(GridCacheRebalancingWithAsyncClearingTest.java:280)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at 
org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.java:47)
at 
org.junit.internal.runners.model.ReflectiveCallable.run(ReflectiveCallable.java:12)
at 
org.junit.runners.model.FrameworkMethod.invokeExplosively(FrameworkMethod.java:44)
at 
org.junit.internal.runners.statements.InvokeMethod.evaluate(InvokeMethod.java:17)
at org.junit.internal.runners.statements.RunBefores.evaluate(RunBefores.java:26)
at org.junit.internal.runners.statements.RunAfters.evaluate(RunAfters.java:27)
at 
org.apache.ignite.testframework.junits.GridAbstractTest$6.run(GridAbstractTest.java:2148)
at java.lang.Thread.run(Thread.java:748){noformat}


> Fix flaky test 
> GridCacheRebalancingWithAsyncClearingTest#testCorrectRebalancingCurrentlyRentingPartitions
> -
>
> Key: IGNITE-11867
> URL: https://issues.apache.org/jira/browse/IGNITE-11867
> Project: Ignite
>  Issue Type: Bug
>Reporter: Alexei Scherbakov
>Assignee: Alexei Scherbakov
>Priority: Major
> Fix For: 2.8
>
>
> {noformat}
> java.lang.AssertionError: Value for 4 is null
> at org.junit.Assert.fail(Assert.java:88)
> at org.junit.Assert.assertTrue(Assert.java:41)
> at org.junit.Assert.assertNotNull(Assert.java:621)
> at 
> org.apache.ignite.internal.processors.cache.distributed.rebalancing.GridCacheRebalancingWithAsyncClearingTest.testCorrectRebalancingCurrentlyRentingPartitions(GridCacheRebalancingWithAsyncClearingTest.java:280)
> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
> at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> at 
> org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.java:47)
> at 
> org.junit.internal.runners.model.ReflectiveCallable.run(ReflectiveCallable.java:12)
> at 
> org.junit.runners.model.FrameworkMethod.invokeExplosively(FrameworkMethod.java:44)
> at 
> org.junit.internal.runners.statements.InvokeMethod.evaluate(InvokeMethod.java:17)
> at 
> org.junit.internal.runners.statements.RunBefores.evaluate(RunBefores.java:26)
> at 

[jira] [Commented] (IGNITE-11256) Implement read-only mode for grid

2019-05-24 Thread Alexei Scherbakov (JIRA)


[ 
https://issues.apache.org/jira/browse/IGNITE-11256?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16847688#comment-16847688
 ] 

Alexei Scherbakov commented on IGNITE-11256:


[~antonovsergey93]

I left some minor comments under PR.

 

> Implement read-only mode for grid
> -
>
> Key: IGNITE-11256
> URL: https://issues.apache.org/jira/browse/IGNITE-11256
> Project: Ignite
>  Issue Type: Improvement
>Reporter: Alexei Scherbakov
>Assignee: Sergey Antonov
>Priority: Major
> Fix For: 2.8
>
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> Should be triggered from control.sh utility.
> Useful for maintenance work, for example checking partition consistency 
> (idle_verify)



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Assigned] (IGNITE-11867) Fix flaky test GridCacheRebalancingWithAsyncClearingTest#testCorrectRebalancingCurrentlyRentingPartitions

2019-05-23 Thread Alexei Scherbakov (JIRA)


 [ 
https://issues.apache.org/jira/browse/IGNITE-11867?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alexei Scherbakov reassigned IGNITE-11867:
--

Assignee: Alexei Scherbakov

> Fix flaky test 
> GridCacheRebalancingWithAsyncClearingTest#testCorrectRebalancingCurrentlyRentingPartitions
> -
>
> Key: IGNITE-11867
> URL: https://issues.apache.org/jira/browse/IGNITE-11867
> Project: Ignite
>  Issue Type: Bug
>Reporter: Alexei Scherbakov
>Assignee: Alexei Scherbakov
>Priority: Major
> Fix For: 2.8
>
>
> {noformat}
> java.lang.AssertionError: Value for 4 is null
> at org.junit.Assert.fail(Assert.java:88)
> at org.junit.Assert.assertTrue(Assert.java:41)
> at org.junit.Assert.assertNotNull(Assert.java:621)
> at 
> org.apache.ignite.internal.processors.cache.distributed.rebalancing.GridCacheRebalancingWithAsyncClearingTest.testCorrectRebalancingCurrentlyRentingPartitions(GridCacheRebalancingWithAsyncClearingTest.java:280)
> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
> at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> at 
> org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.java:47)
> at 
> org.junit.internal.runners.model.ReflectiveCallable.run(ReflectiveCallable.java:12)
> at 
> org.junit.runners.model.FrameworkMethod.invokeExplosively(FrameworkMethod.java:44)
> at 
> org.junit.internal.runners.statements.InvokeMethod.evaluate(InvokeMethod.java:17)
> at 
> org.junit.internal.runners.statements.RunBefores.evaluate(RunBefores.java:26)
> at org.junit.internal.runners.statements.RunAfters.evaluate(RunAfters.java:27)
> at 
> org.apache.ignite.testframework.junits.GridAbstractTest$6.run(GridAbstractTest.java:2148)
> at java.lang.Thread.run(Thread.java:748){noformat}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (IGNITE-11867) Fix flaky test GridCacheRebalancingWithAsyncClearingTest#testCorrectRebalancingCurrentlyRentingPartitions

2019-05-23 Thread Alexei Scherbakov (JIRA)


 [ 
https://issues.apache.org/jira/browse/IGNITE-11867?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alexei Scherbakov updated IGNITE-11867:
---
Description: 
{noformat}
java.lang.AssertionError: Value for 4 is null
at org.junit.Assert.fail(Assert.java:88)
at org.junit.Assert.assertTrue(Assert.java:41)
at org.junit.Assert.assertNotNull(Assert.java:621)
at 
org.apache.ignite.internal.processors.cache.distributed.rebalancing.GridCacheRebalancingWithAsyncClearingTest.testCorrectRebalancingCurrentlyRentingPartitions(GridCacheRebalancingWithAsyncClearingTest.java:280)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at 
org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.java:47)
at 
org.junit.internal.runners.model.ReflectiveCallable.run(ReflectiveCallable.java:12)
at 
org.junit.runners.model.FrameworkMethod.invokeExplosively(FrameworkMethod.java:44)
at 
org.junit.internal.runners.statements.InvokeMethod.evaluate(InvokeMethod.java:17)
at org.junit.internal.runners.statements.RunBefores.evaluate(RunBefores.java:26)
at org.junit.internal.runners.statements.RunAfters.evaluate(RunAfters.java:27)
at 
org.apache.ignite.testframework.junits.GridAbstractTest$6.run(GridAbstractTest.java:2148)
at java.lang.Thread.run(Thread.java:748){noformat}

> Fix flaky test 
> GridCacheRebalancingWithAsyncClearingTest#testCorrectRebalancingCurrentlyRentingPartitions
> -
>
> Key: IGNITE-11867
> URL: https://issues.apache.org/jira/browse/IGNITE-11867
> Project: Ignite
>  Issue Type: Bug
>Reporter: Alexei Scherbakov
>Priority: Major
> Fix For: 2.8
>
>
> {noformat}
> java.lang.AssertionError: Value for 4 is null
> at org.junit.Assert.fail(Assert.java:88)
> at org.junit.Assert.assertTrue(Assert.java:41)
> at org.junit.Assert.assertNotNull(Assert.java:621)
> at 
> org.apache.ignite.internal.processors.cache.distributed.rebalancing.GridCacheRebalancingWithAsyncClearingTest.testCorrectRebalancingCurrentlyRentingPartitions(GridCacheRebalancingWithAsyncClearingTest.java:280)
> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
> at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> at 
> org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.java:47)
> at 
> org.junit.internal.runners.model.ReflectiveCallable.run(ReflectiveCallable.java:12)
> at 
> org.junit.runners.model.FrameworkMethod.invokeExplosively(FrameworkMethod.java:44)
> at 
> org.junit.internal.runners.statements.InvokeMethod.evaluate(InvokeMethod.java:17)
> at 
> org.junit.internal.runners.statements.RunBefores.evaluate(RunBefores.java:26)
> at org.junit.internal.runners.statements.RunAfters.evaluate(RunAfters.java:27)
> at 
> org.apache.ignite.testframework.junits.GridAbstractTest$6.run(GridAbstractTest.java:2148)
> at java.lang.Thread.run(Thread.java:748){noformat}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (IGNITE-11867) Fix flaky test GridCacheRebalancingWithAsyncClearingTest#testCorrectRebalancingCurrentlyRentingPartitions

2019-05-23 Thread Alexei Scherbakov (JIRA)
Alexei Scherbakov created IGNITE-11867:
--

 Summary: Fix flaky test 
GridCacheRebalancingWithAsyncClearingTest#testCorrectRebalancingCurrentlyRentingPartitions
 Key: IGNITE-11867
 URL: https://issues.apache.org/jira/browse/IGNITE-11867
 Project: Ignite
  Issue Type: Bug
Reporter: Alexei Scherbakov
 Fix For: 2.8






--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Comment Edited] (IGNITE-10078) Node failure during concurrent partition updates may cause partition desync between primary and backup.

2019-05-22 Thread Alexei Scherbakov (JIRA)


[ 
https://issues.apache.org/jira/browse/IGNITE-10078?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16845931#comment-16845931
 ] 

Alexei Scherbakov edited comment on IGNITE-10078 at 5/22/19 2:42 PM:
-

IgniteCache150ClientsTest from Cache 6 also oftenly timed out in master 

[~ivan.glukos] All comments are fixed, ready for merge.


was (Author: ascherbakov):
_IgniteCache150ClientsTest from Cache 6 also oftenly timed out in master_ 

[~ivan.glukos] All comments are fixed, ready for merge.

> Node failure during concurrent partition updates may cause partition desync 
> between primary and backup.
> ---
>
> Key: IGNITE-10078
> URL: https://issues.apache.org/jira/browse/IGNITE-10078
> Project: Ignite
>  Issue Type: Bug
>Reporter: Alexei Scherbakov
>Assignee: Alexei Scherbakov
>Priority: Major
> Fix For: 2.8
>
>  Time Spent: 2h 50m
>  Remaining Estimate: 0h
>
> This is possible if some updates are not written to WAL before node failure. 
> They will be not applied by rebalancing due to same partition counters in 
> certain scenario:
> 1. Start grid with 3 nodes, 2 backups.
> 2. Preload some data to partition P.
> 3. Start two concurrent transactions writing single key to the same partition 
> P, keys are different
> {noformat}
> try(Transaction tx = client.transactions().txStart(PESSIMISTIC, 
> REPEATABLE_READ, 0, 1)) {
>   client.cache(DEFAULT_CACHE_NAME).put(k, v);
>   tx.commit();
> }
> {noformat}
> 4. Order updates on backup in the way such update with max partition counter 
> is written to WAL and update with lesser partition counter failed due to 
> triggering of FH before it's added to WAL
> 5. Return failed node to grid, observe no rebalancing due to same partition 
> counters.
> Possible solution: detect gaps in update counters on recovery and force 
> rebalance from a node without gaps if detected.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (IGNITE-10078) Node failure during concurrent partition updates may cause partition desync between primary and backup.

2019-05-22 Thread Alexei Scherbakov (JIRA)


[ 
https://issues.apache.org/jira/browse/IGNITE-10078?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16845931#comment-16845931
 ] 

Alexei Scherbakov commented on IGNITE-10078:


_IgniteCache150ClientsTest from Cache 6 also oftenly timed out in master_ 

[~ivan.glukos] All comments are fixed, ready for merge.

> Node failure during concurrent partition updates may cause partition desync 
> between primary and backup.
> ---
>
> Key: IGNITE-10078
> URL: https://issues.apache.org/jira/browse/IGNITE-10078
> Project: Ignite
>  Issue Type: Bug
>Reporter: Alexei Scherbakov
>Assignee: Alexei Scherbakov
>Priority: Major
> Fix For: 2.8
>
>  Time Spent: 2h 50m
>  Remaining Estimate: 0h
>
> This is possible if some updates are not written to WAL before node failure. 
> They will be not applied by rebalancing due to same partition counters in 
> certain scenario:
> 1. Start grid with 3 nodes, 2 backups.
> 2. Preload some data to partition P.
> 3. Start two concurrent transactions writing single key to the same partition 
> P, keys are different
> {noformat}
> try(Transaction tx = client.transactions().txStart(PESSIMISTIC, 
> REPEATABLE_READ, 0, 1)) {
>   client.cache(DEFAULT_CACHE_NAME).put(k, v);
>   tx.commit();
> }
> {noformat}
> 4. Order updates on backup in the way such update with max partition counter 
> is written to WAL and update with lesser partition counter failed due to 
> triggering of FH before it's added to WAL
> 5. Return failed node to grid, observe no rebalancing due to same partition 
> counters.
> Possible solution: detect gaps in update counters on recovery and force 
> rebalance from a node without gaps if detected.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (IGNITE-11862) Cache stopping on supplier during rebalance causes NPE and supplying failure.

2019-05-22 Thread Alexei Scherbakov (JIRA)


 [ 
https://issues.apache.org/jira/browse/IGNITE-11862?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alexei Scherbakov updated IGNITE-11862:
---
Fix Version/s: 2.8

> Cache stopping on supplier during rebalance causes NPE and supplying failure.
> -
>
> Key: IGNITE-11862
> URL: https://issues.apache.org/jira/browse/IGNITE-11862
> Project: Ignite
>  Issue Type: Bug
>Reporter: Alexei Scherbakov
>Priority: Major
> Fix For: 2.8
>
>
> {noformat}
> [21:12:14]W: [org.apache.ignite:ignite-core] [2019-05-20 
> 21:12:14,376][ERROR][sys-#60310%distributed.CacheParallelStartTest0%][GridDhtPartitionSupplier]
>  Failed to continue supplying [grp=static-cache-group45, 
> demander=ed1c0109-8721-4cd8-80d9-d36e8251, top
> Ver=AffinityTopologyVersion [topVer=2, minorTopVer=0], topic=0]
> [21:12:14]W: [org.apache.ignite:ignite-core] java.lang.NullPointerException
> [21:12:14]W: [org.apache.ignite:ignite-core] at 
> org.apache.ignite.internal.processors.cache.CacheGroupContext.addRebalanceSupplyEvent(CacheGroupContext.java:525)
> [21:12:14]W: [org.apache.ignite:ignite-core] at 
> org.apache.ignite.internal.processors.cache.distributed.dht.preloader.GridDhtPartitionSupplier.handleDemandMessage(GridDhtPartitionSupplier.java:422)
> [21:12:14]W: [org.apache.ignite:ignite-core] at 
> org.apache.ignite.internal.processors.cache.distributed.dht.preloader.GridDhtPreloader.handleDemandMessage(GridDhtPreloader.java:397)
> [21:12:14]W: [org.apache.ignite:ignite-core] at 
> org.apache.ignite.internal.processors.cache.GridCachePartitionExchangeManager$5.apply(GridCachePartitionExchangeManager.java:455)
> [21:12:14]W: [org.apache.ignite:ignite-core] at 
> org.apache.ignite.internal.processors.cache.GridCachePartitionExchangeManager$5.apply(GridCachePartitionExchangeManager.java:440)
> [21:12:14]W: [org.apache.ignite:ignite-core] at 
> org.apache.ignite.internal.processors.cache.GridCacheIoManager.processMessage(GridCacheIoManager.java:1141)
> [21:12:14]W: [org.apache.ignite:ignite-core] at 
> org.apache.ignite.internal.processors.cache.GridCacheIoManager.onMessage0(GridCacheIoManager.java:591)
> [21:12:14]W: [org.apache.ignite:ignite-core] at 
> org.apache.ignite.internal.processors.cache.GridCacheIoManager.access$800(GridCacheIoManager.java:109)
> [21:12:14]W: [org.apache.ignite:ignite-core] at 
> org.apache.ignite.internal.processors.cache.GridCacheIoManager$OrderedMessageListener.onMessage(GridCacheIoManager.java:1706)
> [21:12:14]W: [org.apache.ignite:ignite-core] at 
> org.apache.ignite.internal.managers.communication.GridIoManager.invokeListener(GridIoManager.java:1566)
> [21:12:14]W: [org.apache.ignite:ignite-core] at 
> org.apache.ignite.internal.managers.communication.GridIoManager.access$4200(GridIoManager.java:129)
> [21:12:14]W: [org.apache.ignite:ignite-core] at 
> org.apache.ignite.internal.managers.communication.GridIoManager$GridCommunicationMessageSet.unwind(GridIoManager.java:2795)
> [21:12:14]W: [org.apache.ignite:ignite-core] at 
> org.apache.ignite.internal.managers.communication.GridIoManager.unwindMessageSet(GridIoManager.java:1523)
> [21:12:14]W: [org.apache.ignite:ignite-core] at 
> org.apache.ignite.internal.managers.communication.GridIoManager.access$4500(GridIoManager.java:129)
> [21:12:14]W: [org.apache.ignite:ignite-core] at 
> org.apache.ignite.internal.managers.communication.GridIoManager$9.run(GridIoManager.java:1492)
> [21:12:14]W: [org.apache.ignite:ignite-core] at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
> [21:12:14]W: [org.apache.ignite:ignite-core] at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
> [21:12:14]W: [org.apache.ignite:ignite-core] at 
> java.lang.Thread.run(Thread.java:748)
> {noformat}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (IGNITE-11862) Cache stopping on supplier during rebalance causes NPE and supplying failure.

2019-05-22 Thread Alexei Scherbakov (JIRA)
Alexei Scherbakov created IGNITE-11862:
--

 Summary: Cache stopping on supplier during rebalance causes NPE 
and supplying failure.
 Key: IGNITE-11862
 URL: https://issues.apache.org/jira/browse/IGNITE-11862
 Project: Ignite
  Issue Type: Bug
Reporter: Alexei Scherbakov


{noformat}
[21:12:14]W: [org.apache.ignite:ignite-core] [2019-05-20 
21:12:14,376][ERROR][sys-#60310%distributed.CacheParallelStartTest0%][GridDhtPartitionSupplier]
 Failed to continue supplying [grp=static-cache-group45, 
demander=ed1c0109-8721-4cd8-80d9-d36e8251, top
Ver=AffinityTopologyVersion [topVer=2, minorTopVer=0], topic=0]
[21:12:14]W: [org.apache.ignite:ignite-core] java.lang.NullPointerException
[21:12:14]W: [org.apache.ignite:ignite-core] at 
org.apache.ignite.internal.processors.cache.CacheGroupContext.addRebalanceSupplyEvent(CacheGroupContext.java:525)
[21:12:14]W: [org.apache.ignite:ignite-core] at 
org.apache.ignite.internal.processors.cache.distributed.dht.preloader.GridDhtPartitionSupplier.handleDemandMessage(GridDhtPartitionSupplier.java:422)
[21:12:14]W: [org.apache.ignite:ignite-core] at 
org.apache.ignite.internal.processors.cache.distributed.dht.preloader.GridDhtPreloader.handleDemandMessage(GridDhtPreloader.java:397)
[21:12:14]W: [org.apache.ignite:ignite-core] at 
org.apache.ignite.internal.processors.cache.GridCachePartitionExchangeManager$5.apply(GridCachePartitionExchangeManager.java:455)
[21:12:14]W: [org.apache.ignite:ignite-core] at 
org.apache.ignite.internal.processors.cache.GridCachePartitionExchangeManager$5.apply(GridCachePartitionExchangeManager.java:440)
[21:12:14]W: [org.apache.ignite:ignite-core] at 
org.apache.ignite.internal.processors.cache.GridCacheIoManager.processMessage(GridCacheIoManager.java:1141)
[21:12:14]W: [org.apache.ignite:ignite-core] at 
org.apache.ignite.internal.processors.cache.GridCacheIoManager.onMessage0(GridCacheIoManager.java:591)
[21:12:14]W: [org.apache.ignite:ignite-core] at 
org.apache.ignite.internal.processors.cache.GridCacheIoManager.access$800(GridCacheIoManager.java:109)
[21:12:14]W: [org.apache.ignite:ignite-core] at 
org.apache.ignite.internal.processors.cache.GridCacheIoManager$OrderedMessageListener.onMessage(GridCacheIoManager.java:1706)
[21:12:14]W: [org.apache.ignite:ignite-core] at 
org.apache.ignite.internal.managers.communication.GridIoManager.invokeListener(GridIoManager.java:1566)
[21:12:14]W: [org.apache.ignite:ignite-core] at 
org.apache.ignite.internal.managers.communication.GridIoManager.access$4200(GridIoManager.java:129)
[21:12:14]W: [org.apache.ignite:ignite-core] at 
org.apache.ignite.internal.managers.communication.GridIoManager$GridCommunicationMessageSet.unwind(GridIoManager.java:2795)
[21:12:14]W: [org.apache.ignite:ignite-core] at 
org.apache.ignite.internal.managers.communication.GridIoManager.unwindMessageSet(GridIoManager.java:1523)
[21:12:14]W: [org.apache.ignite:ignite-core] at 
org.apache.ignite.internal.managers.communication.GridIoManager.access$4500(GridIoManager.java:129)
[21:12:14]W: [org.apache.ignite:ignite-core] at 
org.apache.ignite.internal.managers.communication.GridIoManager$9.run(GridIoManager.java:1492)
[21:12:14]W: [org.apache.ignite:ignite-core] at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
[21:12:14]W: [org.apache.ignite:ignite-core] at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
[21:12:14]W: [org.apache.ignite:ignite-core] at 
java.lang.Thread.run(Thread.java:748)
{noformat}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (IGNITE-11791) Fix IgnitePdsContinuousRestartTestWithExpiryPolicy

2019-05-22 Thread Alexei Scherbakov (JIRA)


 [ 
https://issues.apache.org/jira/browse/IGNITE-11791?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alexei Scherbakov updated IGNITE-11791:
---
Description: Test reproduces partition counter validation errors (but 
passes nevertheless).  (was: Test reproduces partition counter validation 
errors.)

> Fix IgnitePdsContinuousRestartTestWithExpiryPolicy 
> ---
>
> Key: IGNITE-11791
> URL: https://issues.apache.org/jira/browse/IGNITE-11791
> Project: Ignite
>  Issue Type: Improvement
>Reporter: Alexei Scherbakov
>Priority: Major
>
> Test reproduces partition counter validation errors (but passes nevertheless).



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (IGNITE-11820) Add persistence to IgniteCacheGroupTest

2019-05-22 Thread Alexei Scherbakov (JIRA)


 [ 
https://issues.apache.org/jira/browse/IGNITE-11820?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alexei Scherbakov updated IGNITE-11820:
---
Summary: Add persistence to IgniteCacheGroupTest  (was: Add partition 
consistency tests for multiple caches in group.)

> Add persistence to IgniteCacheGroupTest
> ---
>
> Key: IGNITE-11820
> URL: https://issues.apache.org/jira/browse/IGNITE-11820
> Project: Ignite
>  Issue Type: Improvement
>Reporter: Alexei Scherbakov
>Priority: Major
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (IGNITE-11857) Investigate performance drop after IGNITE-10078

2019-05-21 Thread Alexei Scherbakov (JIRA)


 [ 
https://issues.apache.org/jira/browse/IGNITE-11857?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alexei Scherbakov updated IGNITE-11857:
---
Description: 
After IGNITE-10078 yardstick tests show performance drop up to 8% in some 
scenarios:

* tx-optim-repRead-put-get

* tx-optimistic-put

* tx-putAll

Partially this is due new update counter implementation, but not only. 
Investigation is required.

  was:
After IGNITE-1078 yardstick tests show performance drop up to 8% in some 
scenarios:

* tx-optim-repRead-put-get

* tx-optimistic-put

* tx-putAll

Partially this is due new update counter implementation, but not only. 
Investigation is required.


> Investigate performance drop after IGNITE-10078
> ---
>
> Key: IGNITE-11857
> URL: https://issues.apache.org/jira/browse/IGNITE-11857
> Project: Ignite
>  Issue Type: Improvement
>Reporter: Alexei Scherbakov
>Priority: Major
>
> After IGNITE-10078 yardstick tests show performance drop up to 8% in some 
> scenarios:
> * tx-optim-repRead-put-get
> * tx-optimistic-put
> * tx-putAll
> Partially this is due new update counter implementation, but not only. 
> Investigation is required.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (IGNITE-11857) Investigate performance drop after IGNITE-10078

2019-05-21 Thread Alexei Scherbakov (JIRA)
Alexei Scherbakov created IGNITE-11857:
--

 Summary: Investigate performance drop after IGNITE-10078
 Key: IGNITE-11857
 URL: https://issues.apache.org/jira/browse/IGNITE-11857
 Project: Ignite
  Issue Type: Improvement
Reporter: Alexei Scherbakov


After IGNITE-1078 yardstick tests show performance drop up to 8% in some 
scenarios:

* tx-optim-repRead-put-get

* tx-optimistic-put

* tx-putAll

Partially this is due new update counter implementation, but not only. 
Investigation is required.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (IGNITE-10078) Node failure during concurrent partition updates may cause partition desync between primary and backup.

2019-05-17 Thread Alexei Scherbakov (JIRA)


[ 
https://issues.apache.org/jira/browse/IGNITE-10078?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16842230#comment-16842230
 ] 

Alexei Scherbakov commented on IGNITE-10078:


[~ivan.glukos], please do final review.

> Node failure during concurrent partition updates may cause partition desync 
> between primary and backup.
> ---
>
> Key: IGNITE-10078
> URL: https://issues.apache.org/jira/browse/IGNITE-10078
> Project: Ignite
>  Issue Type: Bug
>Reporter: Alexei Scherbakov
>Assignee: Alexei Scherbakov
>Priority: Major
> Fix For: 2.8
>
>  Time Spent: 2.5h
>  Remaining Estimate: 0h
>
> This is possible if some updates are not written to WAL before node failure. 
> They will be not applied by rebalancing due to same partition counters in 
> certain scenario:
> 1. Start grid with 3 nodes, 2 backups.
> 2. Preload some data to partition P.
> 3. Start two concurrent transactions writing single key to the same partition 
> P, keys are different
> {noformat}
> try(Transaction tx = client.transactions().txStart(PESSIMISTIC, 
> REPEATABLE_READ, 0, 1)) {
>   client.cache(DEFAULT_CACHE_NAME).put(k, v);
>   tx.commit();
> }
> {noformat}
> 4. Order updates on backup in the way such update with max partition counter 
> is written to WAL and update with lesser partition counter failed due to 
> triggering of FH before it's added to WAL
> 5. Return failed node to grid, observe no rebalancing due to same partition 
> counters.
> Possible solution: detect gaps in update counters on recovery and force 
> rebalance from a node without gaps if detected.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (IGNITE-10078) Node failure during concurrent partition updates may cause partition desync between primary and backup.

2019-05-17 Thread Alexei Scherbakov (JIRA)


[ 
https://issues.apache.org/jira/browse/IGNITE-10078?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16842091#comment-16842091
 ] 

Alexei Scherbakov commented on IGNITE-10078:


Contribution seems to be ready for merging.

> Node failure during concurrent partition updates may cause partition desync 
> between primary and backup.
> ---
>
> Key: IGNITE-10078
> URL: https://issues.apache.org/jira/browse/IGNITE-10078
> Project: Ignite
>  Issue Type: Bug
>Reporter: Alexei Scherbakov
>Assignee: Alexei Scherbakov
>Priority: Major
> Fix For: 2.8
>
>  Time Spent: 2.5h
>  Remaining Estimate: 0h
>
> This is possible if some updates are not written to WAL before node failure. 
> They will be not applied by rebalancing due to same partition counters in 
> certain scenario:
> 1. Start grid with 3 nodes, 2 backups.
> 2. Preload some data to partition P.
> 3. Start two concurrent transactions writing single key to the same partition 
> P, keys are different
> {noformat}
> try(Transaction tx = client.transactions().txStart(PESSIMISTIC, 
> REPEATABLE_READ, 0, 1)) {
>   client.cache(DEFAULT_CACHE_NAME).put(k, v);
>   tx.commit();
> }
> {noformat}
> 4. Order updates on backup in the way such update with max partition counter 
> is written to WAL and update with lesser partition counter failed due to 
> triggering of FH before it's added to WAL
> 5. Return failed node to grid, observe no rebalancing due to same partition 
> counters.
> Possible solution: detect gaps in update counters on recovery and force 
> rebalance from a node without gaps if detected.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (IGNITE-11256) Implement read-only mode for grid

2019-05-08 Thread Alexei Scherbakov (JIRA)


[ 
https://issues.apache.org/jira/browse/IGNITE-11256?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16835577#comment-16835577
 ] 

Alexei Scherbakov commented on IGNITE-11256:


[~antonovsergey93]

I reviewed your contribution. My comments:

1. No need to implement metrics aggregation for readOnlyMode and 
readOnlyModeDuration_._ They will be almost same for all nodes. __ Better move 
them to IgniteMXBean and in addition implement readOnly(boolean) method to 
allow read-only mode switching from JMX. Look for 
{{org.apache.ignite.mxbean.IgniteMXBean#active(boolean)}}

2. It might be good to have a way to activate grid in read-only state. This 
could be achieved by adding new configuration property like 
readOnlyAfterActivation and something like --activate read-only in control.sh

3. Fix logging like: log("Cluster is active" + (readOnly ? " (read-only)" : 
""));

4. Fix logging like: log("Read-only mode is " + (readOnly ? "enabled" : 
"disabled"));

5. Fix message like: Failed to perform cache operation (cluster is in read-only 
mode)

6. U.hasCause is redundant and should be erased. We already have 
{{org.apache.ignite.internal.util.typedef.X#hasCause(java.lang.Throwable, 
java.lang.String, java.lang.Class...)}}

7. Documentation on new public methods {{IgniteCluster.readOnly*}} could be 
improved.

8. You should create a ticket for missing bindings in .NET module.

Otherwise looks good. [~tledkov-gridgain] could your review SQL related changes 
?

> Implement read-only mode for grid
> -
>
> Key: IGNITE-11256
> URL: https://issues.apache.org/jira/browse/IGNITE-11256
> Project: Ignite
>  Issue Type: Improvement
>Reporter: Alexei Scherbakov
>Assignee: Sergey Antonov
>Priority: Major
> Fix For: 2.8
>
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> Should be triggered from control.sh utility.
> Useful for maintenance work, for example checking partition consistency 
> (idle_verify)



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


  1   2   3   4   5   6   7   8   >