[jira] [Comment Edited] (IGNITE-9913) Prevent data updates blocking in case of backup BLT server node leave
[ https://issues.apache.org/jira/browse/IGNITE-9913?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16994406#comment-16994406 ] Alexei Scherbakov edited comment on IGNITE-9913 at 12/12/19 9:27 AM: - [~avinogradov] 1. I've came to a conclusion having rebalanced state calculated on a coordinator is the most robust way to say the grid is rebalanced. Let's keep it. 2. I've left two comments in your PR regarding the change. 3. ok. was (Author: ascherbakov): [~avinogradov] 1. I've came to a conclusion having rebalanced state calculated on coordinator is the most robust way to say the grid is rebalanced. Let's keep it. 2. I've left two comments in your PR regarding the change. 3. ok. > Prevent data updates blocking in case of backup BLT server node leave > - > > Key: IGNITE-9913 > URL: https://issues.apache.org/jira/browse/IGNITE-9913 > Project: Ignite > Issue Type: Improvement > Components: general >Reporter: Ivan Rakov >Assignee: Anton Vinogradov >Priority: Major > Attachments: 9913_yardstick.png, master_yardstick.png > > Time Spent: 9h 10m > Remaining Estimate: 0h > > Ignite cluster performs distributed partition map exchange when any server > node leaves or joins the topology. > Distributed PME blocks all updates and may take a long time. If all > partitions are assigned according to the baseline topology and server node > leaves, there's no actual need to perform distributed PME: every cluster node > is able to recalculate new affinity assigments and partition states locally. > If we'll implement such lightweight PME and handle mapping and lock requests > on new topology version correctly, updates won't be stopped (except updates > of partitions that lost their primary copy). -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (IGNITE-9913) Prevent data updates blocking in case of backup BLT server node leave
[ https://issues.apache.org/jira/browse/IGNITE-9913?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16994406#comment-16994406 ] Alexei Scherbakov commented on IGNITE-9913: --- [~avinogradov] 1. I've came to a conclusion having rebalanced state calculated on coordinator is the most robust way to say the grid is rebalanced. Let's keep it. 2. I've left two comments in your PR regarding the change. 3. ok. > Prevent data updates blocking in case of backup BLT server node leave > - > > Key: IGNITE-9913 > URL: https://issues.apache.org/jira/browse/IGNITE-9913 > Project: Ignite > Issue Type: Improvement > Components: general >Reporter: Ivan Rakov >Assignee: Anton Vinogradov >Priority: Major > Attachments: 9913_yardstick.png, master_yardstick.png > > Time Spent: 9h 10m > Remaining Estimate: 0h > > Ignite cluster performs distributed partition map exchange when any server > node leaves or joins the topology. > Distributed PME blocks all updates and may take a long time. If all > partitions are assigned according to the baseline topology and server node > leaves, there's no actual need to perform distributed PME: every cluster node > is able to recalculate new affinity assigments and partition states locally. > If we'll implement such lightweight PME and handle mapping and lock requests > on new topology version correctly, updates won't be stopped (except updates > of partitions that lost their primary copy). -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (IGNITE-9913) Prevent data updates blocking in case of backup BLT server node leave
[ https://issues.apache.org/jira/browse/IGNITE-9913?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16992697#comment-16992697 ] Alexei Scherbakov commented on IGNITE-9913: --- [~avinogradov] I've reviewed a PR. Overall the idea and the implementation looks valid. My questions are: 1. You've introduced a flag _rebalanced_ indicating the previous exchange future was completed after everything is rebalanced. Seems the flag is not necessary. The _rebalanced_ state can be figured out by the conditions: a) this exchange is triggered by CacheAffinityChangeMessage b) for this exchange forceAffReassignment=true and GridDhtPartitionsFullMessage#idealAffinityDiff().isEmpty() Can we get rid of the flag ? 2. It seems CacheAffinityChangeMessage is no longer contains any useful assignments when is triggered by switching from late to ideal state. Can we get rid of sending any assignments for protocol v3 ? Also could you add a test when all owners of the partition are left one by one under the load and make sure updates to other partitions work as expected without PME, using different loss policy modes and backups number ? > Prevent data updates blocking in case of backup BLT server node leave > - > > Key: IGNITE-9913 > URL: https://issues.apache.org/jira/browse/IGNITE-9913 > Project: Ignite > Issue Type: Improvement > Components: general >Reporter: Ivan Rakov >Assignee: Anton Vinogradov >Priority: Major > Attachments: 9913_yardstick.png, master_yardstick.png > > Time Spent: 9h 10m > Remaining Estimate: 0h > > Ignite cluster performs distributed partition map exchange when any server > node leaves or joins the topology. > Distributed PME blocks all updates and may take a long time. If all > partitions are assigned according to the baseline topology and server node > leaves, there's no actual need to perform distributed PME: every cluster node > is able to recalculate new affinity assigments and partition states locally. > If we'll implement such lightweight PME and handle mapping and lock requests > on new topology version correctly, updates won't be stopped (except updates > of partitions that lost their primary copy). -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Comment Edited] (IGNITE-12429) Rework bytes-based WAL archive size management logic to make historical rebalance more predictable
[ https://issues.apache.org/jira/browse/IGNITE-12429?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16992226#comment-16992226 ] Alexei Scherbakov edited comment on IGNITE-12429 at 12/10/19 7:10 AM: -- [~ivan.glukos] I have some objections. 1. I don't think this is right. Having an ability to specify history in checkpoints is the same as setting a duration equal to checkpointFreq * walHistSize. This is a good thing to have for me. Probably we should change the property to be measured in time units or just add a javadoc explaining how this is translated to a duration. 2. For me the root cause is wrong threatment of histMap when calculating available history for reservation. We already have a caching mechanics for checkpoint entries [1]. Looks like it's possible to keep all the history in the heap (only store references actually) using lazy loading/unloading when needed and get rid of IGNITE_PDS_MAX_CHECKPOINT_MEMORY_HISTORY_SIZE (or maybe use it as a hint for caching). Also I do not understand how having sparse map will help us because we need all entries for history calculation. [1] org.apache.ignite.internal.processors.cache.persistence.checkpoint.CheckpointEntry.GroupStateLazyStore was (Author: ascherbakov): [~ivan.glukos] I have some objections. 1. I don't think this is right. Having an ability to specify history in checkpoints is the same as setting a duration equal to checkpointFreq * walHistSize. This is a good thing to have for me. Probably we should change the property to be measured in time units or just add a javadoc explaining how this is translated to a duration. 2. For me the root cause is wrong threatment of histMap when calculating available history for reservation. We already have a caching mechanics for checkpoint entries [1]. Looks like it's possible to keep all the history in the heap (only store references actually) using lazy loading/unloading when needed and get reid of IGNITE_PDS_MAX_CHECKPOINT_MEMORY_HISTORY_SIZE (or maybe use it as a hint for caching). Also I do not understand how having sparse map will help us because we need all entries for history calculation. [1] org.apache.ignite.internal.processors.cache.persistence.checkpoint.CheckpointEntry.GroupStateLazyStore > Rework bytes-based WAL archive size management logic to make historical > rebalance more predictable > -- > > Key: IGNITE-12429 > URL: https://issues.apache.org/jira/browse/IGNITE-12429 > Project: Ignite > Issue Type: Improvement >Affects Versions: 2.7, 2.7.5, 2.7.6 >Reporter: Ivan Rakov >Priority: Major > > Since 2.7 DataStorageConfiguration allows to specify size of WAL archive in > bytes (see DataStorageConfiguration#maxWalArchiveSize), which is much more > trasparent to user. > Unfortunately, new logic may be unpredictable when it comes to the historical > rebalance. WAL archive is truncated when one of the following conditions > occur: > 1. Total number of checkpoints in WAL archive is bigger than > DataStorageConfiguration#walHistSize > 2. Total size of WAL archive is bigger than > DataStorageConfiguration#maxWalArchiveSize > Independently, in-memory checkpoint history contains only fixed number of > last checkpoints (can be changed with > IGNITE_PDS_MAX_CHECKPOINT_MEMORY_HISTORY_SIZE, 100 by default). > All these particular qualities make it hard for user to cotrol usage of > historical rebalance. Imagine the case when user has slight load (WAL gets > rotated very slowly) and default checkpoint frequency. After 100 * 3 = 300 > minutes, all updates in WAL will be impossible to be received via historical > rebalance even if: > 1. User has configured large DataStorageConfiguration#maxWalArchiveSize > 2. User has configured large DataStorageConfiguration#walHistSize > At the same time, setting large IGNITE_PDS_MAX_CHECKPOINT_MEMORY_HISTORY_SIZE > will help (only with previous two points combined), but Ignite node heap > usage may increase dramatically. > I propose to change WAL history management logic in the following way: > 1. *Don't cut* WAL archive when number of checkpoint exceeds > DataStorageConfiguration#walHistSize. WAL history should be managed only > based on DataStorageConfiguration#maxWalArchiveSize. > 2. Checkpoint history should contain fixed number of entries, but should > cover the whole stored WAL archive (not only its more recent part with > IGNITE_PDS_MAX_CHECKPOINT_MEMORY_HISTORY_SIZE last checkpoints). This can be > achieved by making checkpoint history sparse: some intermediate checkpoints > *may be not present in history*, but fixed number of checkpoints can be > positioned either in uniform distribution (trying to keep fixed number of > bytes between
[jira] [Comment Edited] (IGNITE-12429) Rework bytes-based WAL archive size management logic to make historical rebalance more predictable
[ https://issues.apache.org/jira/browse/IGNITE-12429?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16992226#comment-16992226 ] Alexei Scherbakov edited comment on IGNITE-12429 at 12/10/19 7:10 AM: -- [~ivan.glukos] I have some objections. 1. I don't think this is right. Having an ability to specify history in checkpoints is the same as setting a duration equal to checkpointFreq * walHistSize. This is a good thing to have for me. Probably we should change the property to be measured in time units or just add a javadoc explaining how this is translated to a duration. 2. For me the root cause is wrong threatment of histMap when calculating available history for reservation. We already have a caching mechanics for checkpoint entries [1]. Looks like it's possible to keep all the history in the heap (only store references actually) using lazy loading/unloading when needed and get reid of IGNITE_PDS_MAX_CHECKPOINT_MEMORY_HISTORY_SIZE (or maybe use it as a hint for caching). Also I do not understand how having sparse map will help us because we need all entries for history calculation. [1] org.apache.ignite.internal.processors.cache.persistence.checkpoint.CheckpointEntry.GroupStateLazyStore was (Author: ascherbakov): [~ivan.glukos] I have some objections. 1. I don't think this is right. Having an ability to specify history in checkpoints is the same as setting a duration equal to checkpointFreq * walHistSize. This is a good thing to have for me. Probably we should change the property to be measured in time units or just add a javadoc explaining how this is transalated to duration. 2. For me the root cause is wrong threatment of histMap when calculating available history for reservation. We already have a caching mechanics for checkpoint entries [1]. Looks like it's possible to keep all the history in the heap (only store references actually) using lazy loading/unloading when needed and get reid of IGNITE_PDS_MAX_CHECKPOINT_MEMORY_HISTORY_SIZE (or maybe use it as a hint for caching). Also I do not understand how having sparse map will help us because we need all entries for history calculation. [1] org.apache.ignite.internal.processors.cache.persistence.checkpoint.CheckpointEntry.GroupStateLazyStore > Rework bytes-based WAL archive size management logic to make historical > rebalance more predictable > -- > > Key: IGNITE-12429 > URL: https://issues.apache.org/jira/browse/IGNITE-12429 > Project: Ignite > Issue Type: Improvement >Affects Versions: 2.7, 2.7.5, 2.7.6 >Reporter: Ivan Rakov >Priority: Major > > Since 2.7 DataStorageConfiguration allows to specify size of WAL archive in > bytes (see DataStorageConfiguration#maxWalArchiveSize), which is much more > trasparent to user. > Unfortunately, new logic may be unpredictable when it comes to the historical > rebalance. WAL archive is truncated when one of the following conditions > occur: > 1. Total number of checkpoints in WAL archive is bigger than > DataStorageConfiguration#walHistSize > 2. Total size of WAL archive is bigger than > DataStorageConfiguration#maxWalArchiveSize > Independently, in-memory checkpoint history contains only fixed number of > last checkpoints (can be changed with > IGNITE_PDS_MAX_CHECKPOINT_MEMORY_HISTORY_SIZE, 100 by default). > All these particular qualities make it hard for user to cotrol usage of > historical rebalance. Imagine the case when user has slight load (WAL gets > rotated very slowly) and default checkpoint frequency. After 100 * 3 = 300 > minutes, all updates in WAL will be impossible to be received via historical > rebalance even if: > 1. User has configured large DataStorageConfiguration#maxWalArchiveSize > 2. User has configured large DataStorageConfiguration#walHistSize > At the same time, setting large IGNITE_PDS_MAX_CHECKPOINT_MEMORY_HISTORY_SIZE > will help (only with previous two points combined), but Ignite node heap > usage may increase dramatically. > I propose to change WAL history management logic in the following way: > 1. *Don't cut* WAL archive when number of checkpoint exceeds > DataStorageConfiguration#walHistSize. WAL history should be managed only > based on DataStorageConfiguration#maxWalArchiveSize. > 2. Checkpoint history should contain fixed number of entries, but should > cover the whole stored WAL archive (not only its more recent part with > IGNITE_PDS_MAX_CHECKPOINT_MEMORY_HISTORY_SIZE last checkpoints). This can be > achieved by making checkpoint history sparse: some intermediate checkpoints > *may be not present in history*, but fixed number of checkpoints can be > positioned either in uniform distribution (trying to keep fixed number of > bytes between
[jira] [Commented] (IGNITE-12429) Rework bytes-based WAL archive size management logic to make historical rebalance more predictable
[ https://issues.apache.org/jira/browse/IGNITE-12429?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16992226#comment-16992226 ] Alexei Scherbakov commented on IGNITE-12429: [~ivan.glukos] I have some objections. 1. I don't think this is right. Having an ability to specify history in checkpoints is the same as setting a duration equal to checkpointFreq * walHistSize. This is a good thing to have for me. Probably we should change the property to be measured in time units or just add a javadoc explaining how this is transalated to duration. 2. For me the root cause is wrong threatment of histMap when calculating available history for reservation. We already have a caching mechanics for checkpoint entries [1]. Looks like it's possible to keep all the history in the heap (only store references actually) using lazy loading/unloading when needed and get reid of IGNITE_PDS_MAX_CHECKPOINT_MEMORY_HISTORY_SIZE (or maybe use it as a hint for caching). Also I do not understand how having sparse map will help us because we need all entries for history calculation. [1] org.apache.ignite.internal.processors.cache.persistence.checkpoint.CheckpointEntry.GroupStateLazyStore > Rework bytes-based WAL archive size management logic to make historical > rebalance more predictable > -- > > Key: IGNITE-12429 > URL: https://issues.apache.org/jira/browse/IGNITE-12429 > Project: Ignite > Issue Type: Improvement >Affects Versions: 2.7, 2.7.5, 2.7.6 >Reporter: Ivan Rakov >Priority: Major > > Since 2.7 DataStorageConfiguration allows to specify size of WAL archive in > bytes (see DataStorageConfiguration#maxWalArchiveSize), which is much more > trasparent to user. > Unfortunately, new logic may be unpredictable when it comes to the historical > rebalance. WAL archive is truncated when one of the following conditions > occur: > 1. Total number of checkpoints in WAL archive is bigger than > DataStorageConfiguration#walHistSize > 2. Total size of WAL archive is bigger than > DataStorageConfiguration#maxWalArchiveSize > Independently, in-memory checkpoint history contains only fixed number of > last checkpoints (can be changed with > IGNITE_PDS_MAX_CHECKPOINT_MEMORY_HISTORY_SIZE, 100 by default). > All these particular qualities make it hard for user to cotrol usage of > historical rebalance. Imagine the case when user has slight load (WAL gets > rotated very slowly) and default checkpoint frequency. After 100 * 3 = 300 > minutes, all updates in WAL will be impossible to be received via historical > rebalance even if: > 1. User has configured large DataStorageConfiguration#maxWalArchiveSize > 2. User has configured large DataStorageConfiguration#walHistSize > At the same time, setting large IGNITE_PDS_MAX_CHECKPOINT_MEMORY_HISTORY_SIZE > will help (only with previous two points combined), but Ignite node heap > usage may increase dramatically. > I propose to change WAL history management logic in the following way: > 1. *Don't cut* WAL archive when number of checkpoint exceeds > DataStorageConfiguration#walHistSize. WAL history should be managed only > based on DataStorageConfiguration#maxWalArchiveSize. > 2. Checkpoint history should contain fixed number of entries, but should > cover the whole stored WAL archive (not only its more recent part with > IGNITE_PDS_MAX_CHECKPOINT_MEMORY_HISTORY_SIZE last checkpoints). This can be > achieved by making checkpoint history sparse: some intermediate checkpoints > *may be not present in history*, but fixed number of checkpoints can be > positioned either in uniform distribution (trying to keep fixed number of > bytes between two neighbour checkpoints) or exponentially (trying to keep > fixed ratio between [size of WAL from checkpoint(N-1) to current write > pointer] and [size of WAL from checkpoint(N) to current write pointer]). -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (IGNITE-11857) Investigate performance drop after IGNITE-10078
[ https://issues.apache.org/jira/browse/IGNITE-11857?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16989896#comment-16989896 ] Alexei Scherbakov commented on IGNITE-11857: [~alex_pl] Ok, let's proceed with Map version. > Investigate performance drop after IGNITE-10078 > --- > > Key: IGNITE-11857 > URL: https://issues.apache.org/jira/browse/IGNITE-11857 > Project: Ignite > Issue Type: Improvement >Reporter: Alexei Scherbakov >Assignee: Aleksey Plekhanov >Priority: Major > Attachments: ignite-config.xml, > run.properties.tx-optimistic-put-b-backup > > Time Spent: 20m > Remaining Estimate: 0h > > After IGNITE-10078 yardstick tests show performance drop up to 8% in some > scenarios: > * tx-optim-repRead-put-get > * tx-optimistic-put > * tx-putAll > Partially this is due new update counter implementation, but not only. > Investigation is required. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (IGNITE-12422) Clean up GG-XXX internal ticket references from the code base.
[ https://issues.apache.org/jira/browse/IGNITE-12422?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Alexei Scherbakov updated IGNITE-12422: --- Description: Replace with Apache Ignite equivalent if possible. Also it's desirable to implement checkstyle rule to prevent foreigh links in TODOs [1] [1] https://checkstyle.sourceforge.io/config_misc.html#TodoComment was:Replace with Apache Ignite equivalent if possible. > Clean up GG-XXX internal ticket references from the code base. > -- > > Key: IGNITE-12422 > URL: https://issues.apache.org/jira/browse/IGNITE-12422 > Project: Ignite > Issue Type: Improvement >Reporter: Alexei Scherbakov >Assignee: Alexei Scherbakov >Priority: Major > Fix For: 2.9 > > > Replace with Apache Ignite equivalent if possible. > Also it's desirable to implement checkstyle rule to prevent foreigh links in > TODOs [1] > [1] https://checkstyle.sourceforge.io/config_misc.html#TodoComment -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Comment Edited] (IGNITE-11857) Investigate performance drop after IGNITE-10078
[ https://issues.apache.org/jira/browse/IGNITE-11857?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16989717#comment-16989717 ] Alexei Scherbakov edited comment on IGNITE-11857 at 12/6/19 1:10 PM: - [~alex_pl] You have only measured heap allocation (GC pressure) which seems to be very low for both implementation. You also should measure resident size of both structures. Long for value can be replaced with Integer because no tx with batch of size close to or larger than Integer.MAX_VALUE is viable. For most frequent use cases object creation will be handled by Integer boxing cache. I think I'm ok with proposed improvement, just make sure we couldn't do better. was (Author: ascherbakov): [~alex_pl] You have only measured heap allocation (GC pressure) which seems to be very low for both implementation. You also should measure resident size of both structures. Long for value can be replaced with Integer because no tx with batch of size close to or larger than Integer.MAX_VALUE is viable. For most frequent use cases object creation will be handled by Integer boxing cache. I think I'm ok this proposed improvement, just make sure we couldn't do better. > Investigate performance drop after IGNITE-10078 > --- > > Key: IGNITE-11857 > URL: https://issues.apache.org/jira/browse/IGNITE-11857 > Project: Ignite > Issue Type: Improvement >Reporter: Alexei Scherbakov >Assignee: Aleksey Plekhanov >Priority: Major > Attachments: ignite-config.xml, > run.properties.tx-optimistic-put-b-backup > > Time Spent: 20m > Remaining Estimate: 0h > > After IGNITE-10078 yardstick tests show performance drop up to 8% in some > scenarios: > * tx-optim-repRead-put-get > * tx-optimistic-put > * tx-putAll > Partially this is due new update counter implementation, but not only. > Investigation is required. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Comment Edited] (IGNITE-11857) Investigate performance drop after IGNITE-10078
[ https://issues.apache.org/jira/browse/IGNITE-11857?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16989717#comment-16989717 ] Alexei Scherbakov edited comment on IGNITE-11857 at 12/6/19 12:39 PM: -- [~alex_pl] You have only measured heap allocation (GC pressure) which seems to be very low for both implementation. You also should measure resident size of both structures. Long for value can be replaced with Integer because no tx with batch of size close to or larger than Integer.MAX_VALUE is viable. For most frequent use cases object creation will be handled by Integer boxing cache. I think I'm ok this proposed improvement, just make sure we couldn't do better. was (Author: ascherbakov): [~alex_pl] You have only measured heap allocation (GC pressure) which seems to be very low for both implementation. You also should measure resident size of both structures. Long for value can be replaced with Integer because no tx with batch of size Integer.MAX_VALUE is viable. For most frequent use cases object creation will be handled by Integer boxing cache. I think I'm ok this proposed improvement, just make sure we couldn't do better. > Investigate performance drop after IGNITE-10078 > --- > > Key: IGNITE-11857 > URL: https://issues.apache.org/jira/browse/IGNITE-11857 > Project: Ignite > Issue Type: Improvement >Reporter: Alexei Scherbakov >Assignee: Aleksey Plekhanov >Priority: Major > Attachments: ignite-config.xml, > run.properties.tx-optimistic-put-b-backup > > Time Spent: 20m > Remaining Estimate: 0h > > After IGNITE-10078 yardstick tests show performance drop up to 8% in some > scenarios: > * tx-optim-repRead-put-get > * tx-optimistic-put > * tx-putAll > Partially this is due new update counter implementation, but not only. > Investigation is required. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (IGNITE-11857) Investigate performance drop after IGNITE-10078
[ https://issues.apache.org/jira/browse/IGNITE-11857?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16989717#comment-16989717 ] Alexei Scherbakov commented on IGNITE-11857: [~alex_pl] You have only measured heap allocation (GC pressure) which seems to be very low for both implementation. You also should measure resident size of both structures. Long for value can be replaced with Integer because no tx with batch of size Integer.MAX_VALUE is viable. For most frequent use cases object creation will be handled by Integer boxing cache. I think I'm ok this proposed improvement, just make sure we couldn't do better. > Investigate performance drop after IGNITE-10078 > --- > > Key: IGNITE-11857 > URL: https://issues.apache.org/jira/browse/IGNITE-11857 > Project: Ignite > Issue Type: Improvement >Reporter: Alexei Scherbakov >Assignee: Aleksey Plekhanov >Priority: Major > Attachments: ignite-config.xml, > run.properties.tx-optimistic-put-b-backup > > Time Spent: 20m > Remaining Estimate: 0h > > After IGNITE-10078 yardstick tests show performance drop up to 8% in some > scenarios: > * tx-optim-repRead-put-get > * tx-optimistic-put > * tx-putAll > Partially this is due new update counter implementation, but not only. > Investigation is required. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (IGNITE-11857) Investigate performance drop after IGNITE-10078
[ https://issues.apache.org/jira/browse/IGNITE-11857?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16989525#comment-16989525 ] Alexei Scherbakov commented on IGNITE-11857: [~alex_pl] I've looked at your contribution. Changing TreeSet to TreeMap looks like a very minor change. I think you can go further and get rid of Item class. Out of order updates can be kept in SortedMap where key is a start and value is a range (or even in sorted array of primitive tuples). Another possibility is storing missing updates in a bitmap. You should also check a new solution for heap usage in comparison to the old. For many partitions configurations less heap usage could be more significant advantage other the minor performance boost. Also I have a little concern about the robustness of a fix. It might be risky to merge it to 2.8 without extensive testing. So, I would postpone the change and improved the patch first. > Investigate performance drop after IGNITE-10078 > --- > > Key: IGNITE-11857 > URL: https://issues.apache.org/jira/browse/IGNITE-11857 > Project: Ignite > Issue Type: Improvement >Reporter: Alexei Scherbakov >Assignee: Aleksey Plekhanov >Priority: Major > Attachments: ignite-config.xml, > run.properties.tx-optimistic-put-b-backup > > Time Spent: 20m > Remaining Estimate: 0h > > After IGNITE-10078 yardstick tests show performance drop up to 8% in some > scenarios: > * tx-optim-repRead-put-get > * tx-optimistic-put > * tx-putAll > Partially this is due new update counter implementation, but not only. > Investigation is required. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (IGNITE-12422) Clean up GG-XXX internal ticket references from the code base.
[ https://issues.apache.org/jira/browse/IGNITE-12422?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Alexei Scherbakov updated IGNITE-12422: --- Summary: Clean up GG-XXX internal ticket references from the code base. (was: Clean up GG-XXX internal ticket references from code base.) > Clean up GG-XXX internal ticket references from the code base. > -- > > Key: IGNITE-12422 > URL: https://issues.apache.org/jira/browse/IGNITE-12422 > Project: Ignite > Issue Type: Improvement >Reporter: Alexei Scherbakov >Assignee: Alexei Scherbakov >Priority: Major > Fix For: 2.9 > > > Replace with Apache Ignite equivalent if possible. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (IGNITE-12422) Clean up GG-XXX internal ticket references from code base.
Alexei Scherbakov created IGNITE-12422: -- Summary: Clean up GG-XXX internal ticket references from code base. Key: IGNITE-12422 URL: https://issues.apache.org/jira/browse/IGNITE-12422 Project: Ignite Issue Type: Improvement Reporter: Alexei Scherbakov Assignee: Alexei Scherbakov Fix For: 2.9 Replace with Apache Ignite equivalent if possible. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (IGNITE-11704) Write tombstones during rebalance to get rid of deferred delete buffer
[ https://issues.apache.org/jira/browse/IGNITE-11704?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16988954#comment-16988954 ] Alexei Scherbakov commented on IGNITE-11704: Better to store a full log in case of TC history cleanup. {noformat} java.lang.AssertionError: Failed to wait for tombstone cleanup: distributed.CacheRemoveWithTombstonesLoadTest2 expected:<0> but was:<1> at org.apache.ignite.internal.processors.cache.distributed.CacheRemoveWithTombstonesLoadTest.waitTombstoneCleanup(CacheRemoveWithTombstonesLoadTest.java:335) at org.apache.ignite.internal.processors.cache.distributed.CacheRemoveWithTombstonesLoadTest.removeAndRebalance(CacheRemoveWithTombstonesLoadTest.java:250) --- Stdout: --- [12:28:21]__ [12:28:21] / _/ ___/ |/ / _/_ __/ __/ [12:28:21] _/ // (7 7// / / / / _/ [12:28:21] /___/\___/_/|_/___/ /_/ /___/ [12:28:21] [12:28:21] ver. 2.8.0-SNAPSHOT#20191203-sha1:DEV [12:28:21] 2019 Copyright(C) Apache Software Foundation [12:28:21] [12:28:21] Ignite documentation: http://ignite.apache.org [12:28:21] [12:28:21] Quiet mode. [12:28:21] ^-- Logging by 'GridTestLog4jLogger [quiet=true, config=null]' [12:28:21] ^-- To see **FULL** console log here add -DIGNITE_QUIET=false or "-v" to ignite.{sh|bat} [12:28:21] [12:28:21] OS: Linux 4.15.0-54-generic amd64 [12:28:21] VM information: Java(TM) SE Runtime Environment 1.8.0_212-b10 Oracle Corporation Java HotSpot(TM) 64-Bit Server VM 25.212-b10 [12:28:21] Configured plugins: [12:28:21] ^-- StanByClusterTestProvider 1.0 [12:28:21] ^-- null [12:28:21] [12:28:21] ^-- PageMemory tracker plugin 1.0 [12:28:21] ^-- [12:28:21] [12:28:21] ^-- TestDistibutedConfigurationPlugin 1.0 [12:28:21] ^-- [12:28:21] [12:28:21] ^-- NodeValidationPluginProvider 1.0 [12:28:21] ^-- [12:28:21] [12:28:21] Configured failure handler: [hnd=NoOpFailureHandler [super=AbstractFailureHandler [ignoredFailureTypes=UnmodifiableSet [SYSTEM_WORKER_BLOCKED, SYSTEM_CRITICAL_OPERATION_TIMEOUT [12:28:21] Message queue limit is set to 0 which may lead to potential OOMEs when running cache operations in FULL_ASYNC or PRIMARY_SYNC modes due to message queues growth on sender and receiver sides. [12:28:21] Security status [authentication=off, tls/ssl=off] [12:28:21] To start Console Management & Monitoring run ignitevisorcmd.{sh|bat} [12:28:21] Data Regions Configured: [12:28:21] ^-- default [initSize=256.0 MiB, maxSize=18.9 GiB, persistence=false, lazyMemoryAllocation=true] [12:28:21] [12:28:21] Ignite node started OK (id=a34d6c79, instance name=distributed.CacheRemoveWithTombstonesLoadTest0) [12:28:21] Topology snapshot [ver=1, locNode=a34d6c79, servers=1, clients=0, state=ACTIVE, CPUs=5, offheap=19.0GB, heap=2.0GB] [12:28:23]__ [12:28:23] / _/ ___/ |/ / _/_ __/ __/ [12:28:23] _/ // (7 7// / / / / _/ [12:28:23] /___/\___/_/|_/___/ /_/ /___/ [12:28:23] [12:28:23] ver. 2.8.0-SNAPSHOT#20191203-sha1:DEV [12:28:23] 2019 Copyright(C) Apache Software Foundation [12:28:23] [12:28:23] Ignite documentation: http://ignite.apache.org [12:28:23] [12:28:23] Quiet mode. [12:28:23] ^-- Logging by 'GridTestLog4jLogger [quiet=true, config=null]' [12:28:23] ^-- To see **FULL** console log here add -DIGNITE_QUIET=false or "-v" to ignite.{sh|bat} [12:28:23] [12:28:23] OS: Linux 4.15.0-54-generic amd64 [12:28:23] VM information: Java(TM) SE Runtime Environment 1.8.0_212-b10 Oracle Corporation Java HotSpot(TM) 64-Bit Server VM 25.212-b10 [12:28:23] Configured plugins: [12:28:23] ^-- StanByClusterTestProvider 1.0 [12:28:23] ^-- null [12:28:23] [12:28:23] ^-- PageMemory tracker plugin 1.0 [12:28:23] ^-- [12:28:23] [12:28:23] ^-- TestDistibutedConfigurationPlugin 1.0 [12:28:23] ^-- [12:28:23] [12:28:23] ^-- NodeValidationPluginProvider 1.0 [12:28:23] ^-- [12:28:23] [12:28:23] Configured failure handler: [hnd=NoOpFailureHandler [super=AbstractFailureHandler [ignoredFailureTypes=UnmodifiableSet [SYSTEM_WORKER_BLOCKED, SYSTEM_CRITICAL_OPERATION_TIMEOUT [12:28:23] Message queue limit is set to 0 which may lead to potential OOMEs when running cache operations in FULL_ASYNC or PRIMARY_SYNC modes due to message queues growth on sender and receiver sides. [12:28:23] Security status [authentication=off, tls/ssl=off] [12:28:23] Joining node doesn't have encryption data [node=3205dca1-2c61-4f76-8475-e0c5c0f1] [12:28:23] Topology snapshot [ver=2, locNode=a34d6c79, servers=2, clients=0, state=ACTIVE, CPUs=5, offheap=19.0GB, heap=2.0GB] [12:28:29] To start Console Management & Monitoring run ignitevisorcmd.{sh|bat} [12:28:29] Data Regions Configured: [12:28:29] ^-- default [initSize=256.0 MiB, maxSize=18.9 GiB, persistence=false, lazyMemoryAllocation=true] [12:28:29] [12:28:29] Ignite node started OK (id=3205dca1, instance
[jira] [Commented] (IGNITE-12049) Allow custom authenticators to use SSL certificates
[ https://issues.apache.org/jira/browse/IGNITE-12049?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16986102#comment-16986102 ] Alexei Scherbakov commented on IGNITE-12049: [~SomeFire] Sounds good. Attributes for jdbc/odbc can be passed as base64 encoded strings to driver, the factory is also fine. > Allow custom authenticators to use SSL certificates > --- > > Key: IGNITE-12049 > URL: https://issues.apache.org/jira/browse/IGNITE-12049 > Project: Ignite > Issue Type: Improvement >Reporter: Ryabov Dmitrii >Assignee: Ryabov Dmitrii >Priority: Minor > Time Spent: 1.5h > Remaining Estimate: 0h > > Add SSL certificates to AuthenticationContext, so, authenticators can make > additional checks based on SSL certificates. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Assigned] (IGNITE-11797) Fix consistency issues for atomic and mixed tx-atomic cache groups.
[ https://issues.apache.org/jira/browse/IGNITE-11797?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Alexei Scherbakov reassigned IGNITE-11797: -- Assignee: Alexei Scherbakov > Fix consistency issues for atomic and mixed tx-atomic cache groups. > --- > > Key: IGNITE-11797 > URL: https://issues.apache.org/jira/browse/IGNITE-11797 > Project: Ignite > Issue Type: Bug >Reporter: Alexei Scherbakov >Assignee: Alexei Scherbakov >Priority: Major > > IGNITE-10078 only solves consistency problems for tx mode. > For atomic caches the rebalance consistency issues still remain and should be > fixed together with improvement of atomic cache protocol consistency. > Also, need to disable dynamic start of atomic cache in group having only tx > caches because it's not working in current state. > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (IGNITE-12049) Allow custom authenticators to use SSL certificates
[ https://issues.apache.org/jira/browse/IGNITE-12049?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16976630#comment-16976630 ] Alexei Scherbakov commented on IGNITE-12049: [~SomeFire] 1. User can put any value to node attributes, any number of certificates, etc. I still do not see the importance of proposed change, because this can be done right now for normal clients by passing certificate(s) to node attributes. Besides, thin clients do not have node attributes at all, and putting only a certificate to the map looks hacky. 3. TestSslSecurityProcessor does nothing besides checking certificate existence. I think providing a more realistic example with description should be useful for anyone who might wish to use the feature and make it more valuable for community. > Allow custom authenticators to use SSL certificates > --- > > Key: IGNITE-12049 > URL: https://issues.apache.org/jira/browse/IGNITE-12049 > Project: Ignite > Issue Type: Improvement >Reporter: Ryabov Dmitrii >Assignee: Ryabov Dmitrii >Priority: Minor > Time Spent: 1.5h > Remaining Estimate: 0h > > Add SSL certificates to AuthenticationContext, so, authenticators can make > additional checks based on SSL certificates. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (IGNITE-12049) Allow custom authenticators to use SSL certificates
[ https://issues.apache.org/jira/browse/IGNITE-12049?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16967400#comment-16967400 ] Alexei Scherbakov commented on IGNITE-12049: [~SomeFire] I left comments on PR, please address them. Some general questions: 1. For "normal" cluster nodes attributes are already available using ClusterNode.attributes and user can just set any attribute and use it in custom authenticator without any changes in core by implementing [1]. Do I understand correctly the fix is only relevant for thin clients authenticated using [2] and not having associated local attributes ? Shouldn't we instead provide the ability for thin clients to have attributes and avoid changing IgniteConfiguration ? 2. Why the new attribute is not available during authentication for jdbc/odbc client types ? 3. Can you create an example of using custom authenticator with certificates ? [1] org.apache.ignite.internal.processors.security.GridSecurityProcessor#authenticateNode [2] org.apache.ignite.internal.processors.security.GridSecurityProcessor#authenticate > Allow custom authenticators to use SSL certificates > --- > > Key: IGNITE-12049 > URL: https://issues.apache.org/jira/browse/IGNITE-12049 > Project: Ignite > Issue Type: Improvement >Reporter: Ryabov Dmitrii >Assignee: Ryabov Dmitrii >Priority: Minor > Time Spent: 1h 10m > Remaining Estimate: 0h > > Add SSL certificates to AuthenticationContext, so, authenticators can make > additional checks based on SSL certificates. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (IGNITE-12049) Allow custom authenticators to use SSL certificates
[ https://issues.apache.org/jira/browse/IGNITE-12049?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Alexei Scherbakov updated IGNITE-12049: --- Reviewer: Alexei Scherbakov > Allow custom authenticators to use SSL certificates > --- > > Key: IGNITE-12049 > URL: https://issues.apache.org/jira/browse/IGNITE-12049 > Project: Ignite > Issue Type: Improvement >Reporter: Ryabov Dmitrii >Assignee: Ryabov Dmitrii >Priority: Minor > Time Spent: 50m > Remaining Estimate: 0h > > Add SSL certificates to AuthenticationContext, so, authenticators can make > additional checks based on SSL certificates. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (IGNITE-12329) Invalid handling of remote entries causes partition desync and transaction hanging in COMMITTING state.
[ https://issues.apache.org/jira/browse/IGNITE-12329?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16963904#comment-16963904 ] Alexei Scherbakov commented on IGNITE-12329: Fixed. > Invalid handling of remote entries causes partition desync and transaction > hanging in COMMITTING state. > --- > > Key: IGNITE-12329 > URL: https://issues.apache.org/jira/browse/IGNITE-12329 > Project: Ignite > Issue Type: Bug >Affects Versions: 2.7.6 >Reporter: Alexei Scherbakov >Assignee: Alexei Scherbakov >Priority: Major > Fix For: 2.8 > > Time Spent: 20m > Remaining Estimate: 0h > > This can happen if transaction is mapped on a partition which is about to be > evicted on backup. > Due to bugs entry belonging to other cache may be excluded from commit or > entry containing a lock can be removed without lock release causes depending > transactions to hang. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (IGNITE-12329) Invalid handling of remote entries causes partition desync and transaction hanging in COMMITTING state.
[ https://issues.apache.org/jira/browse/IGNITE-12329?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Alexei Scherbakov updated IGNITE-12329: --- Description: This can happen if transaction is mapped on a partition which is about to be evicted on backup. Due to bugs entry belonging to other cache may be excluded from commit or entry containing a lock can be removed without lock release causing depending transactions to hang. was: This can happen if transaction is mapped on a partition which is about to be evicted on backup. Due to bugs entry belonging to other cache may be excluded from commit or entry containing a lock can be removed without lock release causes depending transactions to hang. > Invalid handling of remote entries causes partition desync and transaction > hanging in COMMITTING state. > --- > > Key: IGNITE-12329 > URL: https://issues.apache.org/jira/browse/IGNITE-12329 > Project: Ignite > Issue Type: Bug >Affects Versions: 2.7.6 >Reporter: Alexei Scherbakov >Assignee: Alexei Scherbakov >Priority: Major > Fix For: 2.8 > > Time Spent: 20m > Remaining Estimate: 0h > > This can happen if transaction is mapped on a partition which is about to be > evicted on backup. > Due to bugs entry belonging to other cache may be excluded from commit or > entry containing a lock can be removed without lock release causing depending > transactions to hang. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (IGNITE-12329) Invalid handling of remote entries causes partition desync and transaction hanging in COMMITTING state.
[ https://issues.apache.org/jira/browse/IGNITE-12329?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16962306#comment-16962306 ] Alexei Scherbakov commented on IGNITE-12329: The contribution also includes fix for GridDhtLocalPartition equals and hashCode. [~ivan.glukos] Ready for review. > Invalid handling of remote entries causes partition desync and transaction > hanging in COMMITTING state. > --- > > Key: IGNITE-12329 > URL: https://issues.apache.org/jira/browse/IGNITE-12329 > Project: Ignite > Issue Type: Bug >Affects Versions: 2.7.6 >Reporter: Alexei Scherbakov >Assignee: Alexei Scherbakov >Priority: Major > Fix For: 2.8 > > Time Spent: 10m > Remaining Estimate: 0h > > This can happen if transaction is mapped on a partition which is about to be > evicted on backup. > Due to bugs entry belonging to other cache may be excluded from commit or > entry containing a lock can be removed without lock release causes depending > transactions to hang. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (IGNITE-12328) IgniteException "Failed to resolve nodes topology" during cache.removeAll() and constantly changing topology
[ https://issues.apache.org/jira/browse/IGNITE-12328?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16961819#comment-16961819 ] Alexei Scherbakov commented on IGNITE-12328: The contribution also include fixes: 1. pessimistic tx lock request processing over incomplete topology. 2. atomic cache is remapped on the compatible topology. [~irakov] Ready for review. > IgniteException "Failed to resolve nodes topology" during cache.removeAll() > and constantly changing topology > > > Key: IGNITE-12328 > URL: https://issues.apache.org/jira/browse/IGNITE-12328 > Project: Ignite > Issue Type: Bug >Affects Versions: 2.7.6 >Reporter: Alexei Scherbakov >Assignee: Alexei Scherbakov >Priority: Major > Fix For: 2.8 > > Time Spent: 10m > Remaining Estimate: 0h > > {noformat} > [2019-09-25 13:13:58,339][ERROR][TxThread-threadNum-3] Failed to complete > transaction. > org.apache.ignite.IgniteException: Failed to resolve nodes topology > [cacheGrp=cache_group_36, topVer=AffinityTopologyVersion [topVer=16, > minorTopVer=0], history=[AffinityTopologyVersion [topVer=13, minorTopVer=0], > AffinityTopologyVersion [topVer=14, minorTopVer=0], AffinityTopologyVersion > [topVer=15, minorTopVer=0]], snap=Snapshot [topVer=AffinityTopologyVersion > [topVer=15, minorTopVer=0]], locNode=TcpDiscoveryNode > [id=6cbf7666-9a8c-4b61-8b3f-6351ef44bd4a, > consistentId=poc-tester-client-172.25.1.21-id-0, addrs=ArrayList > [172.25.1.21], sockAddrs=HashSet [lab21.gridgain.local/172.25.1.21:0], > discPort=0, order=13, intOrder=0, lastExchangeTime=1569406379934, loc=true, > ver=2.5.10#20190922-sha1:02133315, isClient=true]] > at > org.apache.ignite.internal.managers.discovery.GridDiscoveryManager.resolveDiscoCache(GridDiscoveryManager.java:2125) > ~[ignite-core-2.5.10.jar:2.5.10] > at > org.apache.ignite.internal.managers.discovery.GridDiscoveryManager.cacheGroupAffinityNodes(GridDiscoveryManager.java:2007) > ~[ignite-core-2.5.10.jar:2.5.10] > at > org.apache.ignite.internal.processors.cache.GridCacheUtils.affinityNodes(GridCacheUtils.java:465) > ~[ignite-core-2.5.10.jar:2.5.10] > at > org.apache.ignite.internal.processors.cache.distributed.dht.colocated.GridDhtColocatedLockFuture.map0(GridDhtColocatedLockFuture.java:939) > ~[ignite-core-2.5.10.jar:2.5.10] > at > org.apache.ignite.internal.processors.cache.distributed.dht.colocated.GridDhtColocatedLockFuture.map(GridDhtColocatedLockFuture.java:911) > ~[ignite-core-2.5.10.jar:2.5.10] > at > org.apache.ignite.internal.processors.cache.distributed.dht.colocated.GridDhtColocatedLockFuture.map(GridDhtColocatedLockFuture.java:811) > ~[ignite-core-2.5.10.jar:2.5.10] > at > org.apache.ignite.internal.processors.cache.distributed.dht.colocated.GridDhtColocatedCache.lockAllAsync(GridDhtColocatedCache.java:656) > ~[ignite-core-2.5.10.jar:2.5.10] > at > org.apache.ignite.internal.processors.cache.distributed.GridDistributedCacheAdapter.txLockAsync(GridDistributedCacheAdapter.java:109) > ~[ignite-core-2.5.10.jar:2.5.10] > at > org.apache.ignite.internal.processors.cache.distributed.near.GridNearTxLocal.removeAllAsync0(GridNearTxLocal.java:1648) > ~[ignite-core-2.5.10.jar:2.5.10] > at > org.apache.ignite.internal.processors.cache.distributed.near.GridNearTxLocal.removeAllAsync(GridNearTxLocal.java:521) > ~[ignite-core-2.5.10.jar:2.5.10] > at > org.apache.ignite.internal.processors.cache.GridCacheAdapter$33.inOp(GridCacheAdapter.java:2619) > ~[ignite-core-2.5.10.jar:2.5.10] > at > org.apache.ignite.internal.processors.cache.GridCacheAdapter$SyncInOp.op(GridCacheAdapter.java:4701) > ~[ignite-core-2.5.10.jar:2.5.10] > at > org.apache.ignite.internal.processors.cache.GridCacheAdapter.syncOp(GridCacheAdapter.java:3780) > ~[ignite-core-2.5.10.jar:2.5.10] > at > org.apache.ignite.internal.processors.cache.GridCacheAdapter.removeAll0(GridCacheAdapter.java:2617) > ~[ignite-core-2.5.10.jar:2.5.10] > at > org.apache.ignite.internal.processors.cache.GridCacheAdapter.removeAll(GridCacheAdapter.java:2606) > ~[ignite-core-2.5.10.jar:2.5.10] > at > org.apache.ignite.internal.processors.cache.IgniteCacheProxyImpl.removeAll(IgniteCacheProxyImpl.java:1553) > ~[ignite-core-2.5.10.jar:2.5.10] > at > org.apache.ignite.internal.processors.cache.GatewayProtectedCacheProxy.removeAll(GatewayProtectedCacheProxy.java:1026) > ~[ignite-core-2.5.10.jar:2.5.10] > at > org.apache.ignite.scenario.TxBalanceTask$TxBody.doTxRemoveAll(TxBalanceTask.java:291) > ~[poc-tester-0.1.0-SNAPSHOT.jar:?] > at > org.apache.ignite.scenario.TxBalanceTask$TxBody.call(TxBalanceTask.java:93) >
[jira] [Updated] (IGNITE-12317) Add EvictionFilter factory support in IgniteConfiguration.
[ https://issues.apache.org/jira/browse/IGNITE-12317?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Alexei Scherbakov updated IGNITE-12317: --- Fix Version/s: (was: 2.9) 2.8 > Add EvictionFilter factory support in IgniteConfiguration. > -- > > Key: IGNITE-12317 > URL: https://issues.apache.org/jira/browse/IGNITE-12317 > Project: Ignite > Issue Type: Sub-task > Components: cache >Reporter: Nikolai Kulagin >Assignee: Nikolai Kulagin >Priority: Major > Fix For: 2.8 > > Time Spent: 10m > Remaining Estimate: 0h > > Some entities on cache configuration are configured via factories, while > others are set directly, for example, eviction policy and eviction filter. > Need to add new configuration properties for eviction filter factory and > deprecate old ones (do not remove for compatibility). -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (IGNITE-12332) Fix flaky test GridCacheAtomicClientInvalidPartitionHandlingSelfTest#testPrimaryFullAsync
Alexei Scherbakov created IGNITE-12332: -- Summary: Fix flaky test GridCacheAtomicClientInvalidPartitionHandlingSelfTest#testPrimaryFullAsync Key: IGNITE-12332 URL: https://issues.apache.org/jira/browse/IGNITE-12332 Project: Ignite Issue Type: Bug Affects Versions: 2.7.6 Reporter: Alexei Scherbakov Fix For: 2.8 Can be reproduced locally with range = 10_000 -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (IGNITE-12264) Private application data should not be lit in the logs, exceptions, ERROR, WARN etc.
[ https://issues.apache.org/jira/browse/IGNITE-12264?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16961095#comment-16961095 ] Alexei Scherbakov commented on IGNITE-12264: [~KPushenko] This feature exists for 3 years [1] Have you tried to enable it using -DIGNITE_TO_STRING_INCLUDE_SENSITIVE=false ? [1] https://issues.apache.org/jira/browse/IGNITE-4167 > Private application data should not be lit in the logs, exceptions, ERROR, > WARN etc. > > > Key: IGNITE-12264 > URL: https://issues.apache.org/jira/browse/IGNITE-12264 > Project: Ignite > Issue Type: Improvement >Affects Versions: 2.7.6 >Reporter: Pushenko Kirill >Priority: Major > > Private application data should not be lit in the logs, exceptions, ERROR, > WARN etc. > The executions contained a value in which there were cardboard numbers. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (IGNITE-12329) Invalid handling of remote entries causes partition desync and transaction hanging in COMMITTING state.
Alexei Scherbakov created IGNITE-12329: -- Summary: Invalid handling of remote entries causes partition desync and transaction hanging in COMMITTING state. Key: IGNITE-12329 URL: https://issues.apache.org/jira/browse/IGNITE-12329 Project: Ignite Issue Type: Bug Affects Versions: 2.7.6 Reporter: Alexei Scherbakov Assignee: Alexei Scherbakov Fix For: 2.8 This can happen if transaction is mapped on a partition which is about to be evicted on backup. Due to bugs entry belonging to other cache may be excluded from commit or entry containing a lock can be removed without lock release causes depending transactions to hang. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (IGNITE-12328) IgniteException "Failed to resolve nodes topology" during cache.removeAll() and constantly changing topology
[ https://issues.apache.org/jira/browse/IGNITE-12328?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16960816#comment-16960816 ] Alexei Scherbakov commented on IGNITE-12328: Step by step to reproduce: * Start tx from a client node (A). This client node sees topology version X. * Start another client node (B) to increment topology version on server nodes. After that action topology version on server nodes will be X + 1. * Discovery event of B node join should be delivered to A with a delay. * Perform tx put (1,1) from A. Node A sees topology version X, while server nodes see topology version X + 1. * This put will result in telling to node A that tx should be remmaped to version X + 1. * Topology version for this tx on node A is set to X + 1. While Node A still hasn’t received discovery event for node B join (X + 1 version). * Perform tx remove (1) from A. IMPORTANT. We should use a key that is already used in transaction. In another case tx will wait for affinity version X + 1. * This tx remove results to assertion mentioned in ticket, because A doesn’t see discovery event of X + 1 and exchange future corresponding to this event and tries to get discovery cache of X + 1 that doesn’t exist on A yet. > IgniteException "Failed to resolve nodes topology" during cache.removeAll() > and constantly changing topology > > > Key: IGNITE-12328 > URL: https://issues.apache.org/jira/browse/IGNITE-12328 > Project: Ignite > Issue Type: Bug >Affects Versions: 2.7.6 >Reporter: Alexei Scherbakov >Assignee: Alexei Scherbakov >Priority: Major > Fix For: 2.8 > > > {noformat} > [2019-09-25 13:13:58,339][ERROR][TxThread-threadNum-3] Failed to complete > transaction. > org.apache.ignite.IgniteException: Failed to resolve nodes topology > [cacheGrp=cache_group_36, topVer=AffinityTopologyVersion [topVer=16, > minorTopVer=0], history=[AffinityTopologyVersion [topVer=13, minorTopVer=0], > AffinityTopologyVersion [topVer=14, minorTopVer=0], AffinityTopologyVersion > [topVer=15, minorTopVer=0]], snap=Snapshot [topVer=AffinityTopologyVersion > [topVer=15, minorTopVer=0]], locNode=TcpDiscoveryNode > [id=6cbf7666-9a8c-4b61-8b3f-6351ef44bd4a, > consistentId=poc-tester-client-172.25.1.21-id-0, addrs=ArrayList > [172.25.1.21], sockAddrs=HashSet [lab21.gridgain.local/172.25.1.21:0], > discPort=0, order=13, intOrder=0, lastExchangeTime=1569406379934, loc=true, > ver=2.5.10#20190922-sha1:02133315, isClient=true]] > at > org.apache.ignite.internal.managers.discovery.GridDiscoveryManager.resolveDiscoCache(GridDiscoveryManager.java:2125) > ~[ignite-core-2.5.10.jar:2.5.10] > at > org.apache.ignite.internal.managers.discovery.GridDiscoveryManager.cacheGroupAffinityNodes(GridDiscoveryManager.java:2007) > ~[ignite-core-2.5.10.jar:2.5.10] > at > org.apache.ignite.internal.processors.cache.GridCacheUtils.affinityNodes(GridCacheUtils.java:465) > ~[ignite-core-2.5.10.jar:2.5.10] > at > org.apache.ignite.internal.processors.cache.distributed.dht.colocated.GridDhtColocatedLockFuture.map0(GridDhtColocatedLockFuture.java:939) > ~[ignite-core-2.5.10.jar:2.5.10] > at > org.apache.ignite.internal.processors.cache.distributed.dht.colocated.GridDhtColocatedLockFuture.map(GridDhtColocatedLockFuture.java:911) > ~[ignite-core-2.5.10.jar:2.5.10] > at > org.apache.ignite.internal.processors.cache.distributed.dht.colocated.GridDhtColocatedLockFuture.map(GridDhtColocatedLockFuture.java:811) > ~[ignite-core-2.5.10.jar:2.5.10] > at > org.apache.ignite.internal.processors.cache.distributed.dht.colocated.GridDhtColocatedCache.lockAllAsync(GridDhtColocatedCache.java:656) > ~[ignite-core-2.5.10.jar:2.5.10] > at > org.apache.ignite.internal.processors.cache.distributed.GridDistributedCacheAdapter.txLockAsync(GridDistributedCacheAdapter.java:109) > ~[ignite-core-2.5.10.jar:2.5.10] > at > org.apache.ignite.internal.processors.cache.distributed.near.GridNearTxLocal.removeAllAsync0(GridNearTxLocal.java:1648) > ~[ignite-core-2.5.10.jar:2.5.10] > at > org.apache.ignite.internal.processors.cache.distributed.near.GridNearTxLocal.removeAllAsync(GridNearTxLocal.java:521) > ~[ignite-core-2.5.10.jar:2.5.10] > at > org.apache.ignite.internal.processors.cache.GridCacheAdapter$33.inOp(GridCacheAdapter.java:2619) > ~[ignite-core-2.5.10.jar:2.5.10] > at > org.apache.ignite.internal.processors.cache.GridCacheAdapter$SyncInOp.op(GridCacheAdapter.java:4701) > ~[ignite-core-2.5.10.jar:2.5.10] > at > org.apache.ignite.internal.processors.cache.GridCacheAdapter.syncOp(GridCacheAdapter.java:3780) > ~[ignite-core-2.5.10.jar:2.5.10] > at >
[jira] [Created] (IGNITE-12328) IgniteException "Failed to resolve nodes topology" during cache.removeAll() and constantly changing topology
Alexei Scherbakov created IGNITE-12328: -- Summary: IgniteException "Failed to resolve nodes topology" during cache.removeAll() and constantly changing topology Key: IGNITE-12328 URL: https://issues.apache.org/jira/browse/IGNITE-12328 Project: Ignite Issue Type: Bug Affects Versions: 2.7.6 Reporter: Alexei Scherbakov Assignee: Alexei Scherbakov Fix For: 2.8 {noformat} [2019-09-25 13:13:58,339][ERROR][TxThread-threadNum-3] Failed to complete transaction. org.apache.ignite.IgniteException: Failed to resolve nodes topology [cacheGrp=cache_group_36, topVer=AffinityTopologyVersion [topVer=16, minorTopVer=0], history=[AffinityTopologyVersion [topVer=13, minorTopVer=0], AffinityTopologyVersion [topVer=14, minorTopVer=0], AffinityTopologyVersion [topVer=15, minorTopVer=0]], snap=Snapshot [topVer=AffinityTopologyVersion [topVer=15, minorTopVer=0]], locNode=TcpDiscoveryNode [id=6cbf7666-9a8c-4b61-8b3f-6351ef44bd4a, consistentId=poc-tester-client-172.25.1.21-id-0, addrs=ArrayList [172.25.1.21], sockAddrs=HashSet [lab21.gridgain.local/172.25.1.21:0], discPort=0, order=13, intOrder=0, lastExchangeTime=1569406379934, loc=true, ver=2.5.10#20190922-sha1:02133315, isClient=true]] at org.apache.ignite.internal.managers.discovery.GridDiscoveryManager.resolveDiscoCache(GridDiscoveryManager.java:2125) ~[ignite-core-2.5.10.jar:2.5.10] at org.apache.ignite.internal.managers.discovery.GridDiscoveryManager.cacheGroupAffinityNodes(GridDiscoveryManager.java:2007) ~[ignite-core-2.5.10.jar:2.5.10] at org.apache.ignite.internal.processors.cache.GridCacheUtils.affinityNodes(GridCacheUtils.java:465) ~[ignite-core-2.5.10.jar:2.5.10] at org.apache.ignite.internal.processors.cache.distributed.dht.colocated.GridDhtColocatedLockFuture.map0(GridDhtColocatedLockFuture.java:939) ~[ignite-core-2.5.10.jar:2.5.10] at org.apache.ignite.internal.processors.cache.distributed.dht.colocated.GridDhtColocatedLockFuture.map(GridDhtColocatedLockFuture.java:911) ~[ignite-core-2.5.10.jar:2.5.10] at org.apache.ignite.internal.processors.cache.distributed.dht.colocated.GridDhtColocatedLockFuture.map(GridDhtColocatedLockFuture.java:811) ~[ignite-core-2.5.10.jar:2.5.10] at org.apache.ignite.internal.processors.cache.distributed.dht.colocated.GridDhtColocatedCache.lockAllAsync(GridDhtColocatedCache.java:656) ~[ignite-core-2.5.10.jar:2.5.10] at org.apache.ignite.internal.processors.cache.distributed.GridDistributedCacheAdapter.txLockAsync(GridDistributedCacheAdapter.java:109) ~[ignite-core-2.5.10.jar:2.5.10] at org.apache.ignite.internal.processors.cache.distributed.near.GridNearTxLocal.removeAllAsync0(GridNearTxLocal.java:1648) ~[ignite-core-2.5.10.jar:2.5.10] at org.apache.ignite.internal.processors.cache.distributed.near.GridNearTxLocal.removeAllAsync(GridNearTxLocal.java:521) ~[ignite-core-2.5.10.jar:2.5.10] at org.apache.ignite.internal.processors.cache.GridCacheAdapter$33.inOp(GridCacheAdapter.java:2619) ~[ignite-core-2.5.10.jar:2.5.10] at org.apache.ignite.internal.processors.cache.GridCacheAdapter$SyncInOp.op(GridCacheAdapter.java:4701) ~[ignite-core-2.5.10.jar:2.5.10] at org.apache.ignite.internal.processors.cache.GridCacheAdapter.syncOp(GridCacheAdapter.java:3780) ~[ignite-core-2.5.10.jar:2.5.10] at org.apache.ignite.internal.processors.cache.GridCacheAdapter.removeAll0(GridCacheAdapter.java:2617) ~[ignite-core-2.5.10.jar:2.5.10] at org.apache.ignite.internal.processors.cache.GridCacheAdapter.removeAll(GridCacheAdapter.java:2606) ~[ignite-core-2.5.10.jar:2.5.10] at org.apache.ignite.internal.processors.cache.IgniteCacheProxyImpl.removeAll(IgniteCacheProxyImpl.java:1553) ~[ignite-core-2.5.10.jar:2.5.10] at org.apache.ignite.internal.processors.cache.GatewayProtectedCacheProxy.removeAll(GatewayProtectedCacheProxy.java:1026) ~[ignite-core-2.5.10.jar:2.5.10] at org.apache.ignite.scenario.TxBalanceTask$TxBody.doTxRemoveAll(TxBalanceTask.java:291) ~[poc-tester-0.1.0-SNAPSHOT.jar:?] at org.apache.ignite.scenario.TxBalanceTask$TxBody.call(TxBalanceTask.java:93) ~[poc-tester-0.1.0-SNAPSHOT.jar:?] at org.apache.ignite.scenario.TxBalanceTask$TxBody.call(TxBalanceTask.java:70) ~[poc-tester-0.1.0-SNAPSHOT.jar:?] at org.apache.ignite.scenario.internal.AbstractTxTask.doInTransaction(AbstractTxTask.java:290) ~[poc-tester-0.1.0-SNAPSHOT.jar:?] at org.apache.ignite.scenario.internal.AbstractTxTask.access$400(AbstractTxTask.java:56) ~[poc-tester-0.1.0-SNAPSHOT.jar:?] at org.apache.ignite.scenario.internal.AbstractTxTask$TxRunner.call(AbstractTxTask.java:470) [poc-tester-0.1.0-SNAPSHOT.jar:?] at
[jira] [Resolved] (IGNITE-12327) Cross-cache tx is mapped on wrong primary when enlisted caches have incompatible assignments.
[ https://issues.apache.org/jira/browse/IGNITE-12327?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Alexei Scherbakov resolved IGNITE-12327. Resolution: Won't Fix Already fixed in IGNITE-12038 > Cross-cache tx is mapped on wrong primary when enlisted caches have > incompatible assignments. > - > > Key: IGNITE-12327 > URL: https://issues.apache.org/jira/browse/IGNITE-12327 > Project: Ignite > Issue Type: Bug >Affects Versions: 2.7.6 >Reporter: Alexei Scherbakov >Assignee: Alexei Scherbakov >Priority: Major > Fix For: 2.8 > > > This is happening when supplier node is left while rebalancing is partially > completed on demander. > Suppose we have 2 cache groups, rebalance is in progress and for first group > rebalance is done and for second group rebalance is partially done (some > partitions are still MOVING). > At this moment supplier node dies and corresponding topology version is (N,0). > New assignment is computed using current state of partitions and for first > group will be ideal and the same as for next topology (N,1) which will be > triggered after all rebalancing is completed by CacheAffinityChangeMessage. > For second group affinity will not be ideal. > If transaction is started while PME is in progress (N, 0)->(N,1), first lock > request will pass remap check if it's enslists rebalanced group. All > subsequent lock requests will use invalid topology from previous assignment. > Possible fix: return actual locked topology version from first lock request > and use it for all subsequent requests. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (IGNITE-12327) Cross-cache tx is mapped on wrong primary when enlisted caches have incompatible assignments.
Alexei Scherbakov created IGNITE-12327: -- Summary: Cross-cache tx is mapped on wrong primary when enlisted caches have incompatible assignments. Key: IGNITE-12327 URL: https://issues.apache.org/jira/browse/IGNITE-12327 Project: Ignite Issue Type: Bug Affects Versions: 2.7.6 Reporter: Alexei Scherbakov Assignee: Alexei Scherbakov Fix For: 2.8 This is happening when supplier node is left while rebalancing is partially completed on demander. Suppose we have 2 cache groups, rebalance is in progress and for first group rebalance is done and for second group rebalance is partially done (some partitions are still MOVING). At this moment supplier node dies and corresponding topology version is (N,0). New assignment is computed using current state of partitions and for first group will be ideal and the same as for next topology (N,1) which will be triggered after all rebalancing is completed by CacheAffinityChangeMessage. For second group affinity will not be ideal. If transaction is started while PME is in progress (N, 0)->(N,1), first lock request will pass remap check if it's enslists rebalanced group. All subsequent lock requests will use invalid topology from previous assignment. Possible fix: return actual locked topology version from first lock request and use it for all subsequent requests. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (IGNITE-11704) Write tombstones during rebalance to get rid of deferred delete buffer
[ https://issues.apache.org/jira/browse/IGNITE-11704?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16953592#comment-16953592 ] Alexei Scherbakov commented on IGNITE-11704: [~jokser] Looks good. > Write tombstones during rebalance to get rid of deferred delete buffer > -- > > Key: IGNITE-11704 > URL: https://issues.apache.org/jira/browse/IGNITE-11704 > Project: Ignite > Issue Type: Improvement >Reporter: Alexey Goncharuk >Assignee: Pavel Kovalenko >Priority: Major > Labels: rebalance > Fix For: 2.8 > > Time Spent: 10m > Remaining Estimate: 0h > > Currently Ignite relies on deferred delete buffer in order to handle > write-remove conflicts during rebalance. Given the limit size of the buffer, > this approach is fundamentally flawed, especially in case when persistence is > enabled. > I suggest to extend the logic of data storage to be able to store key > tombstones - to keep version for deleted entries. The tombstones will be > stored when rebalance is in progress and should be cleaned up when rebalance > is completed. > Later this approach may be used to implement fast partition rebalance based > on merkle trees (in this case, tombstones should be written on an incomplete > baseline). -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (IGNITE-11704) Write tombstones during rebalance to get rid of deferred delete buffer
[ https://issues.apache.org/jira/browse/IGNITE-11704?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16952020#comment-16952020 ] Alexei Scherbakov commented on IGNITE-11704: [~jokser] 4. Typo in the comment. My understanding is the code will be called when datastreamer initiates first update for an entry, is it true ? 6. * Looks like it's not necessary to preload 256k keys for historical rebalance, you need only one update in each partition. * Test looks similar but my idea is to delay each batch and remove all containing keys in the batch, then release batch. Such scenario should bring all partition keys to tombstones and looks interesting. In other aspects looks good. > Write tombstones during rebalance to get rid of deferred delete buffer > -- > > Key: IGNITE-11704 > URL: https://issues.apache.org/jira/browse/IGNITE-11704 > Project: Ignite > Issue Type: Improvement >Reporter: Alexey Goncharuk >Assignee: Pavel Kovalenko >Priority: Major > Labels: rebalance > Fix For: 2.8 > > Time Spent: 10m > Remaining Estimate: 0h > > Currently Ignite relies on deferred delete buffer in order to handle > write-remove conflicts during rebalance. Given the limit size of the buffer, > this approach is fundamentally flawed, especially in case when persistence is > enabled. > I suggest to extend the logic of data storage to be able to store key > tombstones - to keep version for deleted entries. The tombstones will be > stored when rebalance is in progress and should be cleaned up when rebalance > is completed. > Later this approach may be used to implement fast partition rebalance based > on merkle trees (in this case, tombstones should be written on an incomplete > baseline). -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Comment Edited] (IGNITE-9913) Prevent data updates blocking in case of backup BLT server node leave
[ https://issues.apache.org/jira/browse/IGNITE-9913?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16950019#comment-16950019 ] Alexei Scherbakov edited comment on IGNITE-9913 at 10/12/19 12:14 PM: -- [~avinogradov] I've reviewed changes. Seems it follows the architecture we discussed privately. I left comments in the PR. Besides what you definitely should add more tests. Important scenarious migh be: 1. Baseline node is left under tx load while rebalancing is in progress (rebalancing is due to other node joins). 2. Owners are left one by one under tx load until subset of partitions will have single owner. 3. Owners are left one by one under tx load until subset of partitions will have no owner. Validate partition loss. All tests should check partition integrity: see org.apache.ignite.testframework.junits.common.GridCommonAbstractTest#assertPartitionsSame Do you have plans to implement non-blocking mapping for transactions not affected by topology change by the same ticket ? Let me know if some personal discussion is required. was (Author: ascherbakov): [~avinogradov] I've reviewed changes. Seems it follows the architecture we discussed privately. I left comments in the PR. Besides what you definitely should add more tests. Important scenarious migh be: 1. Baseline node is left under tx load while rebalancing is in progress (rebalancing is due to other node joins). 2. Owners are left one by one under tx load until subset of partitions will have single owner. 3. Owners are left one by one under tx load until subset of partitions will have no owner. Validate partition loss. All tests should check partition integrity: see org.apache.ignite.testframework.junits.common.GridCommonAbstractTest#assertPartitionsSame Do you have plans to implement non-blocking mapping for transactions not affected by topology change by the same ticket ? > Prevent data updates blocking in case of backup BLT server node leave > - > > Key: IGNITE-9913 > URL: https://issues.apache.org/jira/browse/IGNITE-9913 > Project: Ignite > Issue Type: Improvement > Components: general >Reporter: Ivan Rakov >Assignee: Anton Vinogradov >Priority: Major > Fix For: 2.8 > > Attachments: 9913_yardstick.png, master_yardstick.png > > Time Spent: 6h 50m > Remaining Estimate: 0h > > Ignite cluster performs distributed partition map exchange when any server > node leaves or joins the topology. > Distributed PME blocks all updates and may take a long time. If all > partitions are assigned according to the baseline topology and server node > leaves, there's no actual need to perform distributed PME: every cluster node > is able to recalculate new affinity assigments and partition states locally. > If we'll implement such lightweight PME and handle mapping and lock requests > on new topology version correctly, updates won't be stopped (except updates > of partitions that lost their primary copy). -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (IGNITE-9913) Prevent data updates blocking in case of backup BLT server node leave
[ https://issues.apache.org/jira/browse/IGNITE-9913?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16950019#comment-16950019 ] Alexei Scherbakov commented on IGNITE-9913: --- [~avinogradov] I've reviewed changes. Seems it follows the architecture we discussed privately. I left comments in the PR. Besides what you definitely should add more tests. Important scenarious migh be: 1. Baseline node is left under tx load while rebalancing is in progress (rebalancing is due to other node joins). 2. Owners are left one by one under tx load until subset of partitions will have single owner. 3. Owners are left one by one under tx load until subset of partitions will have no owner. Validate partition loss. All tests should check partition integrity: see org.apache.ignite.testframework.junits.common.GridCommonAbstractTest#assertPartitionsSame Do you have plans to implement non-blocking mapping for transactions not affected by topology change by the same ticket ? > Prevent data updates blocking in case of backup BLT server node leave > - > > Key: IGNITE-9913 > URL: https://issues.apache.org/jira/browse/IGNITE-9913 > Project: Ignite > Issue Type: Improvement > Components: general >Reporter: Ivan Rakov >Assignee: Anton Vinogradov >Priority: Major > Fix For: 2.8 > > Attachments: 9913_yardstick.png, master_yardstick.png > > Time Spent: 6h 50m > Remaining Estimate: 0h > > Ignite cluster performs distributed partition map exchange when any server > node leaves or joins the topology. > Distributed PME blocks all updates and may take a long time. If all > partitions are assigned according to the baseline topology and server node > leaves, there's no actual need to perform distributed PME: every cluster node > is able to recalculate new affinity assigments and partition states locally. > If we'll implement such lightweight PME and handle mapping and lock requests > on new topology version correctly, updates won't be stopped (except updates > of partitions that lost their primary copy). -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Assigned] (IGNITE-12117) Historical rebalance should NOT be processed in striped way
[ https://issues.apache.org/jira/browse/IGNITE-12117?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Alexei Scherbakov reassigned IGNITE-12117: -- Assignee: Alexei Scherbakov > Historical rebalance should NOT be processed in striped way > --- > > Key: IGNITE-12117 > URL: https://issues.apache.org/jira/browse/IGNITE-12117 > Project: Ignite > Issue Type: Task >Reporter: Anton Vinogradov >Assignee: Alexei Scherbakov >Priority: Major > Labels: iep-16 > Fix For: 2.9 > > > Test > {{org.apache.ignite.internal.processors.cache.transactions.TxPartitionCounterStateConsistencyTest#testPartitionConsistencyWithBackupsRestart}} > have failure on attempt to handle historical rebalance using un-striped pool. > You can reproduce it by replacing > {noformat} > if (historical) // Can not be reordered. > > ctx.kernalContext().getStripedRebalanceExecutorService().execute(r, > Math.abs(nodeId.hashCode())); > {noformat} > with > {noformat} > if (historical) // Can be reordered? > ctx.kernalContext().getRebalanceExecutorService().execute(r); > {noformat} > and you will gain the following > {noformat} > ava.lang.AssertionError: idle_verify failed on 1 node. > idle_verify check has finished, found 7 conflict partitions: > [counterConflicts=0, hashConflicts=7] > Hash conflicts: > Conflict partition: PartitionKeyV2 [grpId=1544803905, grpName=default, > partId=23] > Partition instances: [PartitionHashRecordV2 [isPrimary=false, > consistentId=nodetransactions.TxPartitionCounterStateConsistencyHistoryRebalanceTest1, > updateCntr=707143, partitionState=OWNING, size=495, partHash=-1503789370], > PartitionHashRecordV2 [isPrimary=false, > consistentId=nodetransactions.TxPartitionCounterStateConsistencyHistoryRebalanceTest2, > updateCntr=707143, partitionState=OWNING, size=494, partHash=-1538739200]] > Conflict partition: PartitionKeyV2 [grpId=1544803905, grpName=default, > partId=8] > > {noformat} > So, we need to investigate reasons and provide proper historical rebalance > refactoring to use the unstriped pool, if possible. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (IGNITE-11704) Write tombstones during rebalance to get rid of deferred delete buffer
[ https://issues.apache.org/jira/browse/IGNITE-11704?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16948360#comment-16948360 ] Alexei Scherbakov commented on IGNITE-11704: 6. GridDhtLocalPartition.clearTombstones looks very similar to GridDhtLocalPartition.clearAll. Could we avoid code duplication ? > Write tombstones during rebalance to get rid of deferred delete buffer > -- > > Key: IGNITE-11704 > URL: https://issues.apache.org/jira/browse/IGNITE-11704 > Project: Ignite > Issue Type: Improvement >Reporter: Alexey Goncharuk >Assignee: Pavel Kovalenko >Priority: Major > Labels: rebalance > Fix For: 2.8 > > Time Spent: 10m > Remaining Estimate: 0h > > Currently Ignite relies on deferred delete buffer in order to handle > write-remove conflicts during rebalance. Given the limit size of the buffer, > this approach is fundamentally flawed, especially in case when persistence is > enabled. > I suggest to extend the logic of data storage to be able to store key > tombstones - to keep version for deleted entries. The tombstones will be > stored when rebalance is in progress and should be cleaned up when rebalance > is completed. > Later this approach may be used to implement fast partition rebalance based > on merkle trees (in this case, tombstones should be written on an incomplete > baseline). -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (IGNITE-11704) Write tombstones during rebalance to get rid of deferred delete buffer
[ https://issues.apache.org/jira/browse/IGNITE-11704?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16946592#comment-16946592 ] Alexei Scherbakov commented on IGNITE-11704: 5. I would add one more load test scenario: Start a node, backups=1. Load many keys (like 100k). Join another node triggering rebalance. Delay each batch. Remove keys supplied in the batch. Release batch. Validate cache is empty and tombstones are cleared. > Write tombstones during rebalance to get rid of deferred delete buffer > -- > > Key: IGNITE-11704 > URL: https://issues.apache.org/jira/browse/IGNITE-11704 > Project: Ignite > Issue Type: Improvement >Reporter: Alexey Goncharuk >Assignee: Pavel Kovalenko >Priority: Major > Labels: rebalance > Fix For: 2.8 > > Time Spent: 10m > Remaining Estimate: 0h > > Currently Ignite relies on deferred delete buffer in order to handle > write-remove conflicts during rebalance. Given the limit size of the buffer, > this approach is fundamentally flawed, especially in case when persistence is > enabled. > I suggest to extend the logic of data storage to be able to store key > tombstones - to keep version for deleted entries. The tombstones will be > stored when rebalance is in progress and should be cleaned up when rebalance > is completed. > Later this approach may be used to implement fast partition rebalance based > on merkle trees (in this case, tombstones should be written on an incomplete > baseline). -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Comment Edited] (IGNITE-11704) Write tombstones during rebalance to get rid of deferred delete buffer
[ https://issues.apache.org/jira/browse/IGNITE-11704?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16946165#comment-16946165 ] Alexei Scherbakov edited comment on IGNITE-11704 at 10/8/19 7:10 AM: - [~jokser] [~sboikov] I've reviewed changes. Overall looks good, but still I have some questions. 1. My main concern is regarding the necessity of tombstoneBytes 5-bytes object. Seems it's possible to implement tombstone by treating absence of value as a tombstone. For example, valLen=0 could be treated as tombstone presense. Doing so we can get rid of 5 bytes comparison, and instead do null check: {noformat} private Boolean isTombstone(ByteBuffer buf, int offset) { int valLen = buf.getInt(buf.position() + offset); if (valLen != tombstoneBytes.length) return Boolean.FALSE; ... } {noformat} Instead we can do something like {{if (valLen == 0) return Boolean.TRUE}} 2. With new changes in PartitionsEvictManager it's possible to have two tasks of different types for the same partition. Consider a scenario: * node finished rebalancing and starts to clear thombstones * another node joins topology and become an owner for clearing partition. * eviction is started for already clearing partition. Probably this should not be allowed. 3. I see changes having no obvious relation to contribution, for example: static String cacheGroupMetricsRegistryName(String cacheGrp) DropCacheContextDuringEvictionTest.java GridCommandHandlerIndexingTest.java What's the purpose of these ? 4. Could you clarify the change in org.apache.ignite.internal.processors.cache.GridCacheMapEntry#initialValue: update0 |= (!preload && val == null); ? was (Author: ascherbakov): [~jokser] [~sboikov] I've reviewed changes. Overall looks good, but still I have some questions. 1. My main concern is regarding the necessity of tombstoneBytes 5-bytes object. Seems it's possible to implement tombstone by treating absence of value as a tombstone. For example, valLen=0 could be treated as tombstone presense. Doing so we can get rid of 5 bytes comparison, and instead do null check: {noformat} private Boolean isTombstone(ByteBuffer buf, int offset) { int valLen = buf.getInt(buf.position() + offset); if (valLen != tombstoneBytes.length) return Boolean.FALSE; ... } {noformat} Instead we can do something like {{if (valLen == 0) return true}} 2. With new changes in PartitionsEvictManager it's possible to have two tasks of different types for the same partition. Consider a scenario: * node finished rebalancing and starts to clear thombstones * another node joins topology and become an owner for clearing partition. * eviction is started for already clearing partition. Probably this should not be allowed. 3. I see changes having no obvious relation to contribution, for example: static String cacheGroupMetricsRegistryName(String cacheGrp) DropCacheContextDuringEvictionTest.java GridCommandHandlerIndexingTest.java What's the purpose of these ? 4. Could you clarify the change in org.apache.ignite.internal.processors.cache.GridCacheMapEntry#initialValue: update0 |= (!preload && val == null); ? > Write tombstones during rebalance to get rid of deferred delete buffer > -- > > Key: IGNITE-11704 > URL: https://issues.apache.org/jira/browse/IGNITE-11704 > Project: Ignite > Issue Type: Improvement >Reporter: Alexey Goncharuk >Assignee: Pavel Kovalenko >Priority: Major > Labels: rebalance > Fix For: 2.8 > > Time Spent: 10m > Remaining Estimate: 0h > > Currently Ignite relies on deferred delete buffer in order to handle > write-remove conflicts during rebalance. Given the limit size of the buffer, > this approach is fundamentally flawed, especially in case when persistence is > enabled. > I suggest to extend the logic of data storage to be able to store key > tombstones - to keep version for deleted entries. The tombstones will be > stored when rebalance is in progress and should be cleaned up when rebalance > is completed. > Later this approach may be used to implement fast partition rebalance based > on merkle trees (in this case, tombstones should be written on an incomplete > baseline). -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Comment Edited] (IGNITE-11704) Write tombstones during rebalance to get rid of deferred delete buffer
[ https://issues.apache.org/jira/browse/IGNITE-11704?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16946165#comment-16946165 ] Alexei Scherbakov edited comment on IGNITE-11704 at 10/8/19 7:04 AM: - [~jokser] [~sboikov] I've reviewed changes. Overall looks good, but still I have some questions. 1. My main concern is regarding the necessity of tombstoneBytes 5-bytes object. Seems it's possible to implement tombstone by treating absence of value as a tombstone. For example, valLen=0 could be treated as tombstone presense. Doing so we can get rid of 5 bytes comparison, and instead do null check: {noformat} private Boolean isTombstone(ByteBuffer buf, int offset) { int valLen = buf.getInt(buf.position() + offset); if (valLen != tombstoneBytes.length) return Boolean.FALSE; ... } {noformat} Instead we can do something like {{if (valLen == 0) return true}} 2. With new changes in PartitionsEvictManager it's possible to have two tasks of different types for the same partition. Consider a scenario: * node finished rebalancing and starts to clear thombstones * another node joins topology and become an owner for clearing partition. * eviction is started for already clearing partition. Probably this should not be allowed. 3. I see changes having no obvious relation to contribution, for example: static String cacheGroupMetricsRegistryName(String cacheGrp) DropCacheContextDuringEvictionTest.java GridCommandHandlerIndexingTest.java What's the purpose of these ? 4. Could you clarify the change in org.apache.ignite.internal.processors.cache.GridCacheMapEntry#initialValue: update0 |= (!preload && val == null); ? was (Author: ascherbakov): [~jokser] [~sboikov] I've reviewed changes. Overall looks good, but still I have some questions. 1. My main concern is regarding the necessity of tombstoneBytes 5-bytes object. Seems it's possible to implement tombstone by treating absence of value as a tombstone. For example, valLen=0 could be treated as tombstone presense. Doing so we can get rid of 5 bytes comparison, and instead do null check: {noformat} private Boolean isTombstone(ByteBuffer buf, int offset) { int valLen = buf.getInt(buf.position() + offset); if (valLen != tombstoneBytes.length) return Boolean.FALSE; ... } {noformat} Instead we can do something like {{if (valLen == 0) return true}} 2. With new changes in PartitionsEvictManager it's possible to have two tasks of different types for the same partition. Consider a scenario: * node finished rebalancing and starts to clear thombstones * another node joins topology and become an owner for clearing partition. * eviction is started for already clearing partition. Probably this should not be allowed. 3. I see changes having no obvious relation to contribution, for example: static String cacheGroupMetricsRegistryName(String cacheGrp) DropCacheContextDuringEvictionTest.java GridCommandHandlerIndexingTest.java What's the purpose of these ? 4. Could you explain the modification in org.apache.ignite.internal.processors.cache.GridCacheMapEntry#initialValue: update0 |= (!preload && val == null); ? > Write tombstones during rebalance to get rid of deferred delete buffer > -- > > Key: IGNITE-11704 > URL: https://issues.apache.org/jira/browse/IGNITE-11704 > Project: Ignite > Issue Type: Improvement >Reporter: Alexey Goncharuk >Assignee: Pavel Kovalenko >Priority: Major > Labels: rebalance > Fix For: 2.8 > > Time Spent: 10m > Remaining Estimate: 0h > > Currently Ignite relies on deferred delete buffer in order to handle > write-remove conflicts during rebalance. Given the limit size of the buffer, > this approach is fundamentally flawed, especially in case when persistence is > enabled. > I suggest to extend the logic of data storage to be able to store key > tombstones - to keep version for deleted entries. The tombstones will be > stored when rebalance is in progress and should be cleaned up when rebalance > is completed. > Later this approach may be used to implement fast partition rebalance based > on merkle trees (in this case, tombstones should be written on an incomplete > baseline). -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (IGNITE-11704) Write tombstones during rebalance to get rid of deferred delete buffer
[ https://issues.apache.org/jira/browse/IGNITE-11704?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16946165#comment-16946165 ] Alexei Scherbakov commented on IGNITE-11704: [~jokser] [~sboikov] I've reviewed changes. Overall looks good, but still I have some questions. 1. My main concern is regarding the necessity of tombstoneBytes 5-bytes object. Seems it's possible to implement tombstone by treating absence of value as a tombstone. For example, valLen=0 could be treated as tombstone presense. Doing so we can get rid of 5 bytes comparison, and instead do null check: {noformat} private Boolean isTombstone(ByteBuffer buf, int offset) { int valLen = buf.getInt(buf.position() + offset); if (valLen != tombstoneBytes.length) return Boolean.FALSE; ... } {noformat} Instead we can do something like {{if (valLen == 0) return true}} 2. With new changes in PartitionsEvictManager it's possible to have two tasks of different types for the same partition. Consider a scenario: * node finished rebalancing and starts to clear thombstones * another node joins topology and become an owner for clearing partition. * eviction is started for already clearing partition. Probably this should not be allowed. 3. I see changes having no obvious relation to contribution, for example: static String cacheGroupMetricsRegistryName(String cacheGrp) DropCacheContextDuringEvictionTest.java GridCommandHandlerIndexingTest.java What's the purpose of these ? 4. Could you explain the modification in org.apache.ignite.internal.processors.cache.GridCacheMapEntry#initialValue: update0 |= (!preload && val == null); ? > Write tombstones during rebalance to get rid of deferred delete buffer > -- > > Key: IGNITE-11704 > URL: https://issues.apache.org/jira/browse/IGNITE-11704 > Project: Ignite > Issue Type: Improvement >Reporter: Alexey Goncharuk >Assignee: Pavel Kovalenko >Priority: Major > Labels: rebalance > Fix For: 2.8 > > Time Spent: 10m > Remaining Estimate: 0h > > Currently Ignite relies on deferred delete buffer in order to handle > write-remove conflicts during rebalance. Given the limit size of the buffer, > this approach is fundamentally flawed, especially in case when persistence is > enabled. > I suggest to extend the logic of data storage to be able to store key > tombstones - to keep version for deleted entries. The tombstones will be > stored when rebalance is in progress and should be cleaned up when rebalance > is completed. > Later this approach may be used to implement fast partition rebalance based > on merkle trees (in this case, tombstones should be written on an incomplete > baseline). -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (IGNITE-7083) Reduce memory usage of CachePartitionFullCountersMap
[ https://issues.apache.org/jira/browse/IGNITE-7083?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16943621#comment-16943621 ] Alexei Scherbakov commented on IGNITE-7083: --- [~mmuzaf] Looks like the issue is no longer actual because we have cache groups for a long time now. CachePartitionFullCountersMap size expected to be reduced twice after IGNITE-11794. Closing the issue. > Reduce memory usage of CachePartitionFullCountersMap > > > Key: IGNITE-7083 > URL: https://issues.apache.org/jira/browse/IGNITE-7083 > Project: Ignite > Issue Type: Improvement > Components: cache >Affects Versions: 2.3 > Environment: Any >Reporter: Sunny Chan >Assignee: Alexey Goncharuk >Priority: Major > Fix For: 2.9 > > > The Cache Partition Exchange Manager kept a copy of the already completed > exchange. However, we have found that it uses a significant amount of memory. > Upon further investigation using heap dump we have found that a large amount > of memory is used by the CachePartitionFullCountersMap. We have also observed > in most cases, these maps contains only 0s. > Therefore I propose an optimization for this: Initially the long arrays to > store initial update counter and update counter in the CPFCM will be null, > and when you get the value and see these tables are null then we will return > 0 for the counter. We only allocate the long arrays when there is any > non-zero updates to the the map. > In our tests, the amount of heap used by GridCachePartitionExchangeManager > was around 70MB (67 copies of these CPFCM), after we apply the optimization > it drops to around 9MB. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Resolved] (IGNITE-7083) Reduce memory usage of CachePartitionFullCountersMap
[ https://issues.apache.org/jira/browse/IGNITE-7083?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Alexei Scherbakov resolved IGNITE-7083. --- Resolution: Won't Fix > Reduce memory usage of CachePartitionFullCountersMap > > > Key: IGNITE-7083 > URL: https://issues.apache.org/jira/browse/IGNITE-7083 > Project: Ignite > Issue Type: Improvement > Components: cache >Affects Versions: 2.3 > Environment: Any >Reporter: Sunny Chan >Assignee: Alexey Goncharuk >Priority: Major > Fix For: 2.9 > > > The Cache Partition Exchange Manager kept a copy of the already completed > exchange. However, we have found that it uses a significant amount of memory. > Upon further investigation using heap dump we have found that a large amount > of memory is used by the CachePartitionFullCountersMap. We have also observed > in most cases, these maps contains only 0s. > Therefore I propose an optimization for this: Initially the long arrays to > store initial update counter and update counter in the CPFCM will be null, > and when you get the value and see these tables are null then we will return > 0 for the counter. We only allocate the long arrays when there is any > non-zero updates to the the map. > In our tests, the amount of heap used by GridCachePartitionExchangeManager > was around 70MB (67 copies of these CPFCM), after we apply the optimization > it drops to around 9MB. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (IGNITE-12209) Transaction system view
[ https://issues.apache.org/jira/browse/IGNITE-12209?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16941082#comment-16941082 ] Alexei Scherbakov commented on IGNITE-12209: [~nizhikov] Looks good. > Transaction system view > --- > > Key: IGNITE-12209 > URL: https://issues.apache.org/jira/browse/IGNITE-12209 > Project: Ignite > Issue Type: Sub-task >Affects Versions: 2.7.6 >Reporter: Nikolay Izhikov >Assignee: Nikolay Izhikov >Priority: Major > Labels: IEP-35 > Fix For: 2.8 > > Time Spent: 1h 50m > Remaining Estimate: 0h > > IGNITE-12145 finished > We should add transactions to the system views. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (IGNITE-12209) Transaction system view
[ https://issues.apache.org/jira/browse/IGNITE-12209?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16940887#comment-16940887 ] Alexei Scherbakov commented on IGNITE-12209: allEntries has multiple implementations. What one are you talking about ? Are you sure all implementations are safe ? In fact, both can fail if something will change if future in underlying implementation. Writing a code on assumption what this will not happen is bad. I would add try ... catch block to avoid issues. > Transaction system view > --- > > Key: IGNITE-12209 > URL: https://issues.apache.org/jira/browse/IGNITE-12209 > Project: Ignite > Issue Type: Sub-task >Affects Versions: 2.7.6 >Reporter: Nikolay Izhikov >Assignee: Nikolay Izhikov >Priority: Major > Labels: IEP-35 > Fix For: 2.8 > > Time Spent: 1h 50m > Remaining Estimate: 0h > > IGNITE-12145 finished > We should add transactions to the system views. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (IGNITE-12209) Transaction system view
[ https://issues.apache.org/jira/browse/IGNITE-12209?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16940834#comment-16940834 ] Alexei Scherbakov commented on IGNITE-12209: [~nizhikov] There are absolutely no guarantee this would always work. Having catch block is 100% safe. > Transaction system view > --- > > Key: IGNITE-12209 > URL: https://issues.apache.org/jira/browse/IGNITE-12209 > Project: Ignite > Issue Type: Sub-task >Affects Versions: 2.7.6 >Reporter: Nikolay Izhikov >Assignee: Nikolay Izhikov >Priority: Major > Labels: IEP-35 > Fix For: 2.8 > > Time Spent: 1h 50m > Remaining Estimate: 0h > > IGNITE-12145 finished > We should add transactions to the system views. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Comment Edited] (IGNITE-12209) Transaction system view
[ https://issues.apache.org/jira/browse/IGNITE-12209?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16940709#comment-16940709 ] Alexei Scherbakov edited comment on IGNITE-12209 at 9/30/19 7:37 AM: - [~nizhikov] Note what org.apache.ignite.internal.processors.cache.transactions.IgniteTxState#allEntries and org.apache.ignite.internal.processors.cache.transactions.IgniteTxState#cacheIds are unsynchronized and tx state can be concurrently updated if a transaction enlists keys in the moment of view producing. So current implementation is unsafe but probably will work somehow. I suggest to enclose methods in try .. catch(Throwable) to implement fallback in case something goes wrong. was (Author: ascherbakov): [~nizhikov] Note what org.apache.ignite.internal.processors.cache.transactions.IgniteTxState#allEntries and org.apache.ignite.internal.processors.cache.transactions.IgniteTxState#cacheIds are unsynchronized and can be concurrently updated if a transaction enlists keys in the moment of view producing. So current implementation is unsafe but probably will work somehow. I suggest to enclose methods in try .. catch(Throwable) to implement fallback in case something goes wrong. > Transaction system view > --- > > Key: IGNITE-12209 > URL: https://issues.apache.org/jira/browse/IGNITE-12209 > Project: Ignite > Issue Type: Sub-task >Affects Versions: 2.7.6 >Reporter: Nikolay Izhikov >Assignee: Nikolay Izhikov >Priority: Major > Labels: IEP-35 > Fix For: 2.8 > > Time Spent: 1h 50m > Remaining Estimate: 0h > > IGNITE-12145 finished > We should add transactions to the system views. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (IGNITE-12209) Transaction system view
[ https://issues.apache.org/jira/browse/IGNITE-12209?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16940709#comment-16940709 ] Alexei Scherbakov commented on IGNITE-12209: [~nizhikov] Note what org.apache.ignite.internal.processors.cache.transactions.IgniteTxState#allEntries and org.apache.ignite.internal.processors.cache.transactions.IgniteTxState#cacheIds are unsynchronized and can be concurrently updated if a transaction enlists keys in the moment of view producing. So current implementation is unsafe but probably will work somehow. I suggest to enclose methods in try .. catch(Throwable) to implement fallback in case something goes wrong. > Transaction system view > --- > > Key: IGNITE-12209 > URL: https://issues.apache.org/jira/browse/IGNITE-12209 > Project: Ignite > Issue Type: Sub-task >Affects Versions: 2.7.6 >Reporter: Nikolay Izhikov >Assignee: Nikolay Izhikov >Priority: Major > Labels: IEP-35 > Fix For: 2.8 > > Time Spent: 1h 50m > Remaining Estimate: 0h > > IGNITE-12145 finished > We should add transactions to the system views. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (IGNITE-12209) Transaction system view
[ https://issues.apache.org/jira/browse/IGNITE-12209?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16940018#comment-16940018 ] Alexei Scherbakov commented on IGNITE-12209: [~nizhikov] I do not understand why we couldn't have local views and do distributed queries agains them. SQL is great for such analytic tasks. Enlisted cache ids are held in org.apache.ignite.internal.processors.cache.transactions.IgniteTxState#cacheIds. No need to traverse entries. > Transaction system view > --- > > Key: IGNITE-12209 > URL: https://issues.apache.org/jira/browse/IGNITE-12209 > Project: Ignite > Issue Type: Sub-task >Affects Versions: 2.7.6 >Reporter: Nikolay Izhikov >Assignee: Nikolay Izhikov >Priority: Major > Labels: IEP-35 > Fix For: 2.8 > > Time Spent: 1h 20m > Remaining Estimate: 0h > > IGNITE-12145 finished > We should add transactions to the system views. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (IGNITE-12209) Transaction system view
[ https://issues.apache.org/jira/browse/IGNITE-12209?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16939577#comment-16939577 ] Alexei Scherbakov commented on IGNITE-12209: [~nizhikov] This is sad. Such feature would allow to produce interesting views on transaction snapshots. I don't know implementation details but in theory it should work out of the box using current SQL engine and distributed joins. Do you think it's possible to implement grid global system views in future ? > Transaction system view > --- > > Key: IGNITE-12209 > URL: https://issues.apache.org/jira/browse/IGNITE-12209 > Project: Ignite > Issue Type: Sub-task >Affects Versions: 2.7.6 >Reporter: Nikolay Izhikov >Assignee: Nikolay Izhikov >Priority: Major > Labels: IEP-35 > Fix For: 2.8 > > Time Spent: 1h 20m > Remaining Estimate: 0h > > IGNITE-12145 finished > We should add transactions to the system views. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (IGNITE-12209) Transaction system view
[ https://issues.apache.org/jira/browse/IGNITE-12209?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16939567#comment-16939567 ] Alexei Scherbakov commented on IGNITE-12209: [~nizhikov] I think * parent node id (tx.originatingNodeId), * near node id (tx.otherNodeId), * mapped topVer, * duration, * number of currently enlisted keys, * cache ids/names must be added to output. Note what tx.nodeId is not originating but local node for tx. You should change javadoc. Would it be possible to construct whole distributed transaction using SQL joins (joining by parent and local node) ? > Transaction system view > --- > > Key: IGNITE-12209 > URL: https://issues.apache.org/jira/browse/IGNITE-12209 > Project: Ignite > Issue Type: Sub-task >Affects Versions: 2.7.6 >Reporter: Nikolay Izhikov >Assignee: Nikolay Izhikov >Priority: Major > Labels: IEP-35 > Fix For: 2.8 > > Time Spent: 1h 20m > Remaining Estimate: 0h > > IGNITE-12145 finished > We should add transactions to the system views. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Comment Edited] (IGNITE-12133) O(log n) partition exchange
[ https://issues.apache.org/jira/browse/IGNITE-12133?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16922809#comment-16922809 ] Alexei Scherbakov edited comment on IGNITE-12133 at 9/4/19 8:03 PM: PME protocol itself doesn't leverage ring and uses direct node to node communication for sending partition maps (except for special case), but ring is used by discovery protocol, which "discovers" topology changes and delivers event to grid nodes, which triggers PME due to topology changes, for example "node left" or "node added". Also discovery protocol provides "guaranteed ordered messages delivery" which is extensively used by Ignite internals and cannot be replaced easily. Actually, PME consists of three phases: 1. Discovery phase, having O( n ) complexity for default TcpDiscoverySpi implementation. 2. Topology unlock waiting. 3. PME phase having k * O(m) complextity where m is number of I/O sender threads and k depends on topology size. So total PME complexity is sum of 1 2 and 3. To speed up PME we should improve 1 2 and 3. How to improve 1 ? Initially ring was designed for small topologies and still works very well for such cases with default settings. Specially for large topologies zookeeper based discovery was introduced, which have better complexity. So, for small topologies I suggest to use defaults. For large topologies zookeeper discovery should be used. How to improve 2 ? This is discussed on dev list in lightweigh PME topic. How to improve 3 ? For small topologies same as 1, use defaults. For large topologies we could use [~mnk]'s proposal and use tree-like message propagation pattern to achieve log(N) complexity. I agree with [~ivan.glukos] on increasing failover complexity, but I think it's doable. NOTE: same idea could be used for increasing replicated caches performance on large topologies. We have long time known issue with performance degradation if topology is large. [~Jokser] Gossip idea looks interesting, but looks like complicated change and reinventing the wheel. Why not stick to zookeeper? was (Author: ascherbakov): PME protocol itself doesn't leverage ring and uses direct node to node communication for sending partition maps (except for special case), but ring is used by discovery protocol, which "discovers" topology changes and delivers event to grid nodes, which triggers PME due to topology changes, for example "node left" or "node added". Also discovery protocol provides "guaranteed ordered messages delivery" which is extensively used by Ignite internals and cannot be replaced easily. Actually, PME consists of three phases: 1. Discovery phase, having O( n ) complexity for default TcpDiscoverySpi implementation. 2. Topology unlock waiting (out of this post's scope). 3. PME phase having k * O(m) complextity where m is number of I/O sender threads and k depends on topology size. So total PME complexity is sum of 1 2 and 3. To speed up PME we should improve 1 and 3. How to improve 1 ? Initially ring was designed for small topologies and still works very well for such cases with default settings. Specially for large topologies zookeeper based discovery was introduced, which have better complexity. So, for small topologies I suggest to use defaults. For large topologies zookeeper discovery should be used. How to improve 3 ? For small topologies same as 1, use defaults. For large topologies we could use [~mnk]'s proposal and use tree-like message propagation pattern to achieve log(N) complexity. I agree with [~ivan.glukos] on increasing failover complexity, but I think it's doable. NOTE: same idea could be used for increasing replicated caches performance on large topologies. We have long time known issue with performance degradation if topology is large. [~Jokser] Gossip idea looks interesting, but looks like complicated change and reinventing the wheel. Why not stick to zookeeper? > O(log n) partition exchange > --- > > Key: IGNITE-12133 > URL: https://issues.apache.org/jira/browse/IGNITE-12133 > Project: Ignite > Issue Type: Improvement >Reporter: Moti Nisenson-Ken >Priority: Major > > Currently, partition exchange leverages a ring. This means that > communications is O\(n) in number of nodes. It also means that if > non-coordinator nodes hang it can take much longer to successfully resolve > the topology. > Instead, why not use something like a skip-list where the coordinator is > first. The coordinator can notify the first node at each level of the > skip-list. Each node then notifies all of its "near-neighbours" in the > skip-list, where node B is a near-neighbour of node-A, if max-level(nodeB) <= > max-level(nodeA), and nodeB is the first node at its level when traversing > from nodeA in the direction of nodeB, skipping
[jira] [Comment Edited] (IGNITE-12133) O(log n) partition exchange
[ https://issues.apache.org/jira/browse/IGNITE-12133?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16922809#comment-16922809 ] Alexei Scherbakov edited comment on IGNITE-12133 at 9/4/19 8:02 PM: PME protocol itself doesn't leverage ring and uses direct node to node communication for sending partition maps (except for special case), but ring is used by discovery protocol, which "discovers" topology changes and delivers event to grid nodes, which triggers PME due to topology changes, for example "node left" or "node added". Also discovery protocol provides "guaranteed ordered messages delivery" which is extensively used by Ignite internals and cannot be replaced easily. Actually, PME consists of three phases: 1. Discovery phase, having O( n ) complexity for default TcpDiscoverySpi implementation. 2. Topology unlock waiting (out of this post's scope). 3. PME phase having k * O(m) complextity where m is number of I/O sender threads and k depends on topology size. So total PME complexity is sum of 1 2 and 3. To speed up PME we should improve 1 and 3. How to improve 1 ? Initially ring was designed for small topologies and still works very well for such cases with default settings. Specially for large topologies zookeeper based discovery was introduced, which have better complexity. So, for small topologies I suggest to use defaults. For large topologies zookeeper discovery should be used. How to improve 3 ? For small topologies same as 1, use defaults. For large topologies we could use [~mnk]'s proposal and use tree-like message propagation pattern to achieve log(N) complexity. I agree with [~ivan.glukos] on increasing failover complexity, but I think it's doable. NOTE: same idea could be used for increasing replicated caches performance on large topologies. We have long time known issue with performance degradation if topology is large. [~Jokser] Gossip idea looks interesting, but looks like complicated change and reinventing the wheel. Why not stick to zookeeper? was (Author: ascherbakov): PME protocol itself doesn't leverage ring and uses direct node to node communication for sending partition maps (except for special case), but ring is used by discovery protocol, which "discovers" topology changes and delivers event to grid nodes, which triggers PME due to topology changes, for example "node left" or "node added". Also discovery protocol provides "guaranteed ordered messages delivery" which is extensively used by Ignite internals and cannot be replaced easily. Actually, PME consists of three phases: 1. Discovery phase, having O( n ) complexity for default TcpDiscoverySpi implementation. 2. Topology unlock waiting (out of this post's scope). 3. PME phase having k * O(m) complextity where m is number of I/O sender threads and k depends on topology size. So total PME complexity is sum of 1 and 3. To speed up PME we should improve 1 and 3. How to improve 1 ? Initially ring was designed for small topologies and still works very well for such cases with default settings. Specially for large topologies zookeeper based discovery was introduced, which have better complexity. So, for small topologies I suggest to use defaults. For large topologies zookeeper discovery should be used. How to improve 3 ? For small topologies same as 1, use defaults. For large topologies we could use [~mnk]'s proposal and use tree-like message propagation pattern to achieve log(N) complexity. I agree with [~ivan.glukos] on increasing failover complexity, but I think it's doable. NOTE: same idea could be used for increasing replicated caches performance on large topologies. We have long time known issue with performance degradation if topology is large. [~Jokser] Gossip idea looks interesting, but looks like complicated change and reinventing the wheel. Why not stick to zookeeper? > O(log n) partition exchange > --- > > Key: IGNITE-12133 > URL: https://issues.apache.org/jira/browse/IGNITE-12133 > Project: Ignite > Issue Type: Improvement >Reporter: Moti Nisenson-Ken >Priority: Major > > Currently, partition exchange leverages a ring. This means that > communications is O\(n) in number of nodes. It also means that if > non-coordinator nodes hang it can take much longer to successfully resolve > the topology. > Instead, why not use something like a skip-list where the coordinator is > first. The coordinator can notify the first node at each level of the > skip-list. Each node then notifies all of its "near-neighbours" in the > skip-list, where node B is a near-neighbour of node-A, if max-level(nodeB) <= > max-level(nodeA), and nodeB is the first node at its level when traversing > from nodeA in the direction of nodeB, skipping over nodes C which have > max-level(C) >
[jira] [Commented] (IGNITE-12133) O(log n) partition exchange
[ https://issues.apache.org/jira/browse/IGNITE-12133?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16922809#comment-16922809 ] Alexei Scherbakov commented on IGNITE-12133: PME protocol itself doesn't leverage ring and uses direct node to node communication for sending partition maps (except for special case), but ring is used by discovery protocol, which "discovers" topology changes and delivers event to grid nodes, which triggers PME due to topology changes, for example "node left" or "node added". Also discovery protocol provides "guaranteed ordered messages delivery" which is extensively used by Ignite internals and cannot be replaced easily. Actually, PME consists of three phases: 1. Discovery phase, having O(n) complexity for default TcpDiscoverySpi implementation. 2. Topology unlock waiting (out of this post's scope). 3. PME phase having k * O(m) complextity where m is number of I/O sender threads and k depends on topology size. So total PME complexity is sum of 1 and 3. To speed up PME we should improve 1 and 3. How to improve 1 ? Initially ring was designed for small topologies and still works very well for such cases with default settings. Specially for large topologies zookeeper based discovery was introduced, which have better complexity. So, for small topologies I suggest to use defaults. For large topologies zookeeper discovery should be used. How to improve 3 ? For small topologies same as 1, use defaults. For large topologies we could use [~mnk]'s proposal and use tree-like message propagation pattern to achieve log(N) complexity. I agree with [~ivan.glukos] on increasing failover complexity, but I think it's doable. NOTE: same idea could be used for increasing replicated caches performance on large topologies. We have long time known issue with performance degradation if topology is large. [~Jokser] Gossip idea looks interesting, but looks like complicated change and reinventing the wheel. Why not stick to zookeeper? > O(log n) partition exchange > --- > > Key: IGNITE-12133 > URL: https://issues.apache.org/jira/browse/IGNITE-12133 > Project: Ignite > Issue Type: Improvement >Reporter: Moti Nisenson-Ken >Priority: Major > > Currently, partition exchange leverages a ring. This means that > communications is O\(n) in number of nodes. It also means that if > non-coordinator nodes hang it can take much longer to successfully resolve > the topology. > Instead, why not use something like a skip-list where the coordinator is > first. The coordinator can notify the first node at each level of the > skip-list. Each node then notifies all of its "near-neighbours" in the > skip-list, where node B is a near-neighbour of node-A, if max-level(nodeB) <= > max-level(nodeA), and nodeB is the first node at its level when traversing > from nodeA in the direction of nodeB, skipping over nodes C which have > max-level(C) > max-level(A). > 1 > 1 . . .3 > 1 3 . . . 5 > 1 . 2 . 3 . 4 . 5 . 6 > In the above 1 would notify 2 and 3, 3 would notify 4 and 5, 2 -> 4, and 4 -> > 6, and 5 -> 6. > One can achieve better redundancy by having each node traverse in both > directions, and having the coordinator also notify the last node in the list > at each level. This way in the above example if 2 and 3 were both down, 4 > would still get notified from 5 and 6 (in the backwards direction). > > The idea is that each individual node has O(log n) nodes to notify - so the > overall time is reduced. Additionally, we can deal well with at least 1 node > failure - if one includes the option of processing backwards, 2 consecutive > node failures can be handled as well. By taking this kind of an approach, > then the coordinator can basically treat any nodes it didn't receive a > message from as not-connected, and update the topology as well (disconnecting > any nodes that it didn't get a notification from). While there are some edge > cases here (e.g. 2 disconnected nodes, then 1 connected node, then 2 > disconnected nodes - the connected node would be wrongly ejected from the > topology), these would generally be too rare to need explicit handling for. -- This message was sent by Atlassian Jira (v8.3.2#803003)
[jira] [Comment Edited] (IGNITE-12133) O(log n) partition exchange
[ https://issues.apache.org/jira/browse/IGNITE-12133?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16922809#comment-16922809 ] Alexei Scherbakov edited comment on IGNITE-12133 at 9/4/19 8:00 PM: PME protocol itself doesn't leverage ring and uses direct node to node communication for sending partition maps (except for special case), but ring is used by discovery protocol, which "discovers" topology changes and delivers event to grid nodes, which triggers PME due to topology changes, for example "node left" or "node added". Also discovery protocol provides "guaranteed ordered messages delivery" which is extensively used by Ignite internals and cannot be replaced easily. Actually, PME consists of three phases: 1. Discovery phase, having O( n ) complexity for default TcpDiscoverySpi implementation. 2. Topology unlock waiting (out of this post's scope). 3. PME phase having k * O(m) complextity where m is number of I/O sender threads and k depends on topology size. So total PME complexity is sum of 1 and 3. To speed up PME we should improve 1 and 3. How to improve 1 ? Initially ring was designed for small topologies and still works very well for such cases with default settings. Specially for large topologies zookeeper based discovery was introduced, which have better complexity. So, for small topologies I suggest to use defaults. For large topologies zookeeper discovery should be used. How to improve 3 ? For small topologies same as 1, use defaults. For large topologies we could use [~mnk]'s proposal and use tree-like message propagation pattern to achieve log(N) complexity. I agree with [~ivan.glukos] on increasing failover complexity, but I think it's doable. NOTE: same idea could be used for increasing replicated caches performance on large topologies. We have long time known issue with performance degradation if topology is large. [~Jokser] Gossip idea looks interesting, but looks like complicated change and reinventing the wheel. Why not stick to zookeeper? was (Author: ascherbakov): PME protocol itself doesn't leverage ring and uses direct node to node communication for sending partition maps (except for special case), but ring is used by discovery protocol, which "discovers" topology changes and delivers event to grid nodes, which triggers PME due to topology changes, for example "node left" or "node added". Also discovery protocol provides "guaranteed ordered messages delivery" which is extensively used by Ignite internals and cannot be replaced easily. Actually, PME consists of three phases: 1. Discovery phase, having O(n) complexity for default TcpDiscoverySpi implementation. 2. Topology unlock waiting (out of this post's scope). 3. PME phase having k * O(m) complextity where m is number of I/O sender threads and k depends on topology size. So total PME complexity is sum of 1 and 3. To speed up PME we should improve 1 and 3. How to improve 1 ? Initially ring was designed for small topologies and still works very well for such cases with default settings. Specially for large topologies zookeeper based discovery was introduced, which have better complexity. So, for small topologies I suggest to use defaults. For large topologies zookeeper discovery should be used. How to improve 3 ? For small topologies same as 1, use defaults. For large topologies we could use [~mnk]'s proposal and use tree-like message propagation pattern to achieve log(N) complexity. I agree with [~ivan.glukos] on increasing failover complexity, but I think it's doable. NOTE: same idea could be used for increasing replicated caches performance on large topologies. We have long time known issue with performance degradation if topology is large. [~Jokser] Gossip idea looks interesting, but looks like complicated change and reinventing the wheel. Why not stick to zookeeper? > O(log n) partition exchange > --- > > Key: IGNITE-12133 > URL: https://issues.apache.org/jira/browse/IGNITE-12133 > Project: Ignite > Issue Type: Improvement >Reporter: Moti Nisenson-Ken >Priority: Major > > Currently, partition exchange leverages a ring. This means that > communications is O\(n) in number of nodes. It also means that if > non-coordinator nodes hang it can take much longer to successfully resolve > the topology. > Instead, why not use something like a skip-list where the coordinator is > first. The coordinator can notify the first node at each level of the > skip-list. Each node then notifies all of its "near-neighbours" in the > skip-list, where node B is a near-neighbour of node-A, if max-level(nodeB) <= > max-level(nodeA), and nodeB is the first node at its level when traversing > from nodeA in the direction of nodeB, skipping over nodes C which have > max-level(C) > max-level(A).
[jira] [Commented] (IGNITE-12038) Fix several failing tests after IGNITE-10078
[ https://issues.apache.org/jira/browse/IGNITE-12038?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16918720#comment-16918720 ] Alexei Scherbakov commented on IGNITE-12038: The contribution contains a bunch of fixes related to new partition counter implementation (IGNITE-10078) discovered during private testing. This also should fix mentioned above tests. List of fixes: Fixed issues related to incorrect partition clearing in OWNING state. Fixed RENTING->EVICTED partition state change prevention. CheckpointReadLock() may hang during node stop - fixed. Fixed invalid invalid topology version assertion thrown on PartitionCountersNeighborcastRequest. Fixed an issue when cross-cache tx is mapped on wrong primary when enlisted caches have incompatible assignments. Now transactions will be rolled back if are preparing on invalid primary node. Stablilized LocalWalModeChangeDuringRebalancingSelfTest. [~ivan.glukos] could you do review? > Fix several failing tests after IGNITE-10078 > > > Key: IGNITE-12038 > URL: https://issues.apache.org/jira/browse/IGNITE-12038 > Project: Ignite > Issue Type: Bug >Reporter: Alexei Scherbakov >Assignee: Alexei Scherbakov >Priority: Major > Fix For: 2.8 > > Time Spent: 20m > Remaining Estimate: 0h > > *New stable failure of a flaky test in master > LocalWalModeChangeDuringRebalancingSelfTest.testWithExchangesMerge > https://ci.ignite.apache.org/project.html?projectId=IgniteTests24Java8=-6585115376754732686=%3Cdefault%3E=testDetails > *New stable failure of a flaky test in master > GridCacheRebalancingWithAsyncClearingMvccTest.testPartitionClearingNotBlockExchange > > https://ci.ignite.apache.org/project.html?projectId=IgniteTests24Java8=-7007912051428984819=%3Cdefault%3E=testDetails > *New stable failure of a flaky test in master > GridCacheRebalancingAsyncSelfTest.testComplexRebalancing > https://ci.ignite.apache.org/project.html?projectId=IgniteTests24Java8=-8829809273565657995=%3Cdefault%3E=testDetails -- This message was sent by Atlassian Jira (v8.3.2#803003)
[jira] [Commented] (IGNITE-11857) Investigate performance drop after IGNITE-10078
[ https://issues.apache.org/jira/browse/IGNITE-11857?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16918035#comment-16918035 ] Alexei Scherbakov commented on IGNITE-11857: [~alex_pl] Haven't yet, but it's in my queue. > Investigate performance drop after IGNITE-10078 > --- > > Key: IGNITE-11857 > URL: https://issues.apache.org/jira/browse/IGNITE-11857 > Project: Ignite > Issue Type: Improvement >Reporter: Alexei Scherbakov >Assignee: Aleksey Plekhanov >Priority: Major > Attachments: ignite-config.xml, > run.properties.tx-optimistic-put-b-backup > > Time Spent: 20m > Remaining Estimate: 0h > > After IGNITE-10078 yardstick tests show performance drop up to 8% in some > scenarios: > * tx-optim-repRead-put-get > * tx-optimistic-put > * tx-putAll > Partially this is due new update counter implementation, but not only. > Investigation is required. -- This message was sent by Atlassian Jira (v8.3.2#803003)
[jira] [Commented] (IGNITE-3195) Rebalancing: IgniteConfiguration.rebalanceThreadPoolSize is wrongly treated
[ https://issues.apache.org/jira/browse/IGNITE-3195?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16917802#comment-16917802 ] Alexei Scherbakov commented on IGNITE-3195: --- [~avinogradov] 1. This theoretically should work. I'm going to contribute a bunch of follow-up fixes by IGNITE-12038 very soon, let's check TC again after. 2. OK, looks like ordered messages should provide necessary ordering. No more objections from me, thanks. > Rebalancing: IgniteConfiguration.rebalanceThreadPoolSize is wrongly treated > --- > > Key: IGNITE-3195 > URL: https://issues.apache.org/jira/browse/IGNITE-3195 > Project: Ignite > Issue Type: Bug > Components: cache >Reporter: Denis Magda >Assignee: Anton Vinogradov >Priority: Major > Labels: iep-16 > Fix For: 2.8 > > Time Spent: 4h > Remaining Estimate: 0h > > Presently it's considered that the maximum number of threads that has to > process all demand and supply messages coming from all the nodes must not be > bigger than {{IgniteConfiguration.rebalanceThreadPoolSize}}. > Current implementation relies on ordered messages functionality creating a > number of topics equal to {{IgniteConfiguration.rebalanceThreadPoolSize}}. > However, the implementation doesn't take into account that ordered messages, > that correspond to a particular topic, are processed in parallel for > different nodes. Refer to the implementation of > {{GridIoManager.processOrderedMessage}} to see that for every topic there > will be a unique {{GridCommunicationMessageSet}} for every node. > Also to prove that this is true you can refer to this execution stack > {noformat} > java.lang.RuntimeException: HAPPENED DEMAND > at > org.apache.ignite.internal.processors.cache.GridCachePartitionExchangeManager$5.apply(GridCachePartitionExchangeManager.java:378) > at > org.apache.ignite.internal.processors.cache.GridCachePartitionExchangeManager$5.apply(GridCachePartitionExchangeManager.java:364) > at > org.apache.ignite.internal.processors.cache.GridCacheIoManager.processMessage(GridCacheIoManager.java:622) > at > org.apache.ignite.internal.processors.cache.GridCacheIoManager.onMessage0(GridCacheIoManager.java:320) > at > org.apache.ignite.internal.processors.cache.GridCacheIoManager.access$300(GridCacheIoManager.java:81) > at > org.apache.ignite.internal.processors.cache.GridCacheIoManager$OrderedMessageListener.onMessage(GridCacheIoManager.java:1125) > at > org.apache.ignite.internal.managers.communication.GridIoManager.invokeListener(GridIoManager.java:1219) > at > org.apache.ignite.internal.managers.communication.GridIoManager.access$1600(GridIoManager.java:105) > at > org.apache.ignite.internal.managers.communication.GridIoManager$GridCommunicationMessageSet.unwind(GridIoManager.java:2456) > at > org.apache.ignite.internal.managers.communication.GridIoManager.unwindMessageSet(GridIoManager.java:1179) > at > org.apache.ignite.internal.managers.communication.GridIoManager.access$1900(GridIoManager.java:105) > at > org.apache.ignite.internal.managers.communication.GridIoManager$6.run(GridIoManager.java:1148) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) > at java.lang.Thread.run(Thread.java:745) > {noformat} > All this means that in fact the number of threads that will be busy with > replication activity will be equal to > {{IgniteConfiguration.rebalanceThreadPoolSize}} x > number_of_nodes_participated_in_rebalancing -- This message was sent by Atlassian Jira (v8.3.2#803003)
[jira] [Comment Edited] (IGNITE-3195) Rebalancing: IgniteConfiguration.rebalanceThreadPoolSize is wrongly treated
[ https://issues.apache.org/jira/browse/IGNITE-3195?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16916855#comment-16916855 ] Alexei Scherbakov edited comment on IGNITE-3195 at 8/27/19 4:41 PM: [~avinogradov] Overall fix looks good, but I think we could improve it. 1. Looks like it's safe to remove ordering for historical rebalance because after IGNITE-10078 rmvQeue for partition is no longer cleared during rebalance and removals cannot be lost. Given what, we could use single thread pool for historical and full rebalance and parallelize historical rebalance on supplier side same as full. This is right thing to do because from user side of view there is no difference between full and historical rebalance and they can happen simultaneously. Note, proper fix for writing tombstones is on the way [1] 2. Looks like current implementation for detecting partition completion on concurrent processing using *queued* and *processed* is flawed. Consider the scenario: Demander sends initial demand request for single partition. Supplier replies with 2 total supply messages which are starting to process in parallel. 2-d message is last. 2-d message started to process first, increments *queued* to N (number of entries in message) 2-d message finished processing, incrementing *processed* to N. Because this is last message partition will be owned before other messages are applied. [1] https://issues.apache.org/jira/browse/IGNITE-11704 was (Author: ascherbakov): [~avinogradov] Overall fix looks good, but I think we could improve it. 1. Looks like it's safe to remove ordering for historical rebalance because after IGNITE-10078 rmvQeue for partition is no longer cleared during rebalance and removals cannot be lost. Given what, we could use single thread pool for historical and full rebalance and parallelize historical rebalance on supplier side same as full. This is right thing to do because from user side of view there is no difference between full and historical rebalance and they can happen simultaneously. Note, proper fix for writing tombstones is on the way [1] 2. Looks like current implementation for detecting partition completion on concurrent processing using *queued*and *processed* is flawed. Consider the scenario: Demander sends initial demand request for single partition. Supplier replies with 2 total supply messages which are starting to process in parallel. 2-d message is last. 2-d message started to process first, increments *queued* to N (number of entries in message) 2-d message finished processing, incrementing *processed* to N. Because this is last message partition will be owned before other messages are applied. [1] https://issues.apache.org/jira/browse/IGNITE-11704 > Rebalancing: IgniteConfiguration.rebalanceThreadPoolSize is wrongly treated > --- > > Key: IGNITE-3195 > URL: https://issues.apache.org/jira/browse/IGNITE-3195 > Project: Ignite > Issue Type: Bug > Components: cache >Reporter: Denis Magda >Assignee: Anton Vinogradov >Priority: Major > Labels: iep-16 > Fix For: 2.8 > > Time Spent: 3h 50m > Remaining Estimate: 0h > > Presently it's considered that the maximum number of threads that has to > process all demand and supply messages coming from all the nodes must not be > bigger than {{IgniteConfiguration.rebalanceThreadPoolSize}}. > Current implementation relies on ordered messages functionality creating a > number of topics equal to {{IgniteConfiguration.rebalanceThreadPoolSize}}. > However, the implementation doesn't take into account that ordered messages, > that correspond to a particular topic, are processed in parallel for > different nodes. Refer to the implementation of > {{GridIoManager.processOrderedMessage}} to see that for every topic there > will be a unique {{GridCommunicationMessageSet}} for every node. > Also to prove that this is true you can refer to this execution stack > {noformat} > java.lang.RuntimeException: HAPPENED DEMAND > at > org.apache.ignite.internal.processors.cache.GridCachePartitionExchangeManager$5.apply(GridCachePartitionExchangeManager.java:378) > at > org.apache.ignite.internal.processors.cache.GridCachePartitionExchangeManager$5.apply(GridCachePartitionExchangeManager.java:364) > at > org.apache.ignite.internal.processors.cache.GridCacheIoManager.processMessage(GridCacheIoManager.java:622) > at > org.apache.ignite.internal.processors.cache.GridCacheIoManager.onMessage0(GridCacheIoManager.java:320) > at > org.apache.ignite.internal.processors.cache.GridCacheIoManager.access$300(GridCacheIoManager.java:81) > at >
[jira] [Comment Edited] (IGNITE-3195) Rebalancing: IgniteConfiguration.rebalanceThreadPoolSize is wrongly treated
[ https://issues.apache.org/jira/browse/IGNITE-3195?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16916855#comment-16916855 ] Alexei Scherbakov edited comment on IGNITE-3195 at 8/27/19 4:40 PM: [~avinogradov] Overall fix looks good, but I think we could improve it. 1. Looks like it's safe to remove ordering for historical rebalance because after IGNITE-10078 rmvQeue for partition is no longer cleared during rebalance and removals cannot be lost. Given what, we could use single thread pool for historical and full rebalance and parallelize historical rebalance on supplier side same as full. This is right thing to do because from user side of view there is no difference between full and historical rebalance and they can happen simultaneously. Note, proper fix for writing tombstones is on the way [1] 2. Looks like current implementation for detecting partition completion on concurrent processing using *queued*and *processed* is flawed. Consider the scenario: Demander sends initial demand request for single partition. Supplier replies with 2 total supply messages which are starting to process in parallel. 2-d message is last. 2-d message started to process first, increments *queued* to N (number of entries in message) 2-d message finished processing, incrementing *processed* to N. Because this is last message partition will be owned before other messages are applied. [1] https://issues.apache.org/jira/browse/IGNITE-11704 was (Author: ascherbakov): [~avinogradov] Overall fix looks good, but I think we could improve it. 1. Looks like it's safe to remove ordering for historical rebalance because after IGNITE-10078 rmvQeue for partition is no longer cleared during rebalance and removals cannot be lost. Given what, we could use single thread pool for historical and full rebalance and parallelize historical rebalance on supplier side same as full. This is right thing to do because from user side of view there is no difference between full and historical rebalance and they can happen simultaneously. Note, proper fix for writing tombstones is on the way [1] 2. Looks like current implementation for detecting partition completion on concurrent processing using *queued *and *processed *is flawed. Consider the scenario: Demander sends initial demand request for single partition. Supplier replies with 2 total supply messages which are starting to process in parallel. 2-d message is last. 2-d message started to process first, increments *queued *to N (number of entries in message) 2-d message finished processing, incrementing *processed *to N. Because this is last message partition will be owned before other messages are applied. [1] https://issues.apache.org/jira/browse/IGNITE-11704 > Rebalancing: IgniteConfiguration.rebalanceThreadPoolSize is wrongly treated > --- > > Key: IGNITE-3195 > URL: https://issues.apache.org/jira/browse/IGNITE-3195 > Project: Ignite > Issue Type: Bug > Components: cache >Reporter: Denis Magda >Assignee: Anton Vinogradov >Priority: Major > Labels: iep-16 > Fix For: 2.8 > > Time Spent: 3h 50m > Remaining Estimate: 0h > > Presently it's considered that the maximum number of threads that has to > process all demand and supply messages coming from all the nodes must not be > bigger than {{IgniteConfiguration.rebalanceThreadPoolSize}}. > Current implementation relies on ordered messages functionality creating a > number of topics equal to {{IgniteConfiguration.rebalanceThreadPoolSize}}. > However, the implementation doesn't take into account that ordered messages, > that correspond to a particular topic, are processed in parallel for > different nodes. Refer to the implementation of > {{GridIoManager.processOrderedMessage}} to see that for every topic there > will be a unique {{GridCommunicationMessageSet}} for every node. > Also to prove that this is true you can refer to this execution stack > {noformat} > java.lang.RuntimeException: HAPPENED DEMAND > at > org.apache.ignite.internal.processors.cache.GridCachePartitionExchangeManager$5.apply(GridCachePartitionExchangeManager.java:378) > at > org.apache.ignite.internal.processors.cache.GridCachePartitionExchangeManager$5.apply(GridCachePartitionExchangeManager.java:364) > at > org.apache.ignite.internal.processors.cache.GridCacheIoManager.processMessage(GridCacheIoManager.java:622) > at > org.apache.ignite.internal.processors.cache.GridCacheIoManager.onMessage0(GridCacheIoManager.java:320) > at > org.apache.ignite.internal.processors.cache.GridCacheIoManager.access$300(GridCacheIoManager.java:81) > at >
[jira] [Commented] (IGNITE-3195) Rebalancing: IgniteConfiguration.rebalanceThreadPoolSize is wrongly treated
[ https://issues.apache.org/jira/browse/IGNITE-3195?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16916855#comment-16916855 ] Alexei Scherbakov commented on IGNITE-3195: --- [~avinogradov] Overall fix looks good, but I think we could improve it. 1. Looks like it's safe to remove ordering for historical rebalance because after IGNITE-10078 rmvQeue for partition is no longer cleared during rebalance and removals cannot be lost. Given what, we could use single thread pool for historical and full rebalance and parallelize historical rebalance on supplier side same as full. This is right thing to do because from user side of view there is no difference between full and historical rebalance and they can happen simultaneously. Note, proper fix for writing tombstones is on the way [1] 2. Looks like current implementation for detecting partition completion on concurrent processing using *queued *and *processed *is flawed. Consider the scenario: Demander sends initial demand request for single partition. Supplier replies with 2 total supply messages which are starting to process in parallel. 2-d message is last. 2-d message started to process first, increments *queued *to N (number of entries in message) 2-d message finished processing, incrementing *processed *to N. Because this is last message partition will be owned before other messages are applied. [1] https://issues.apache.org/jira/browse/IGNITE-11704 > Rebalancing: IgniteConfiguration.rebalanceThreadPoolSize is wrongly treated > --- > > Key: IGNITE-3195 > URL: https://issues.apache.org/jira/browse/IGNITE-3195 > Project: Ignite > Issue Type: Bug > Components: cache >Reporter: Denis Magda >Assignee: Anton Vinogradov >Priority: Major > Labels: iep-16 > Fix For: 2.8 > > Time Spent: 3h 50m > Remaining Estimate: 0h > > Presently it's considered that the maximum number of threads that has to > process all demand and supply messages coming from all the nodes must not be > bigger than {{IgniteConfiguration.rebalanceThreadPoolSize}}. > Current implementation relies on ordered messages functionality creating a > number of topics equal to {{IgniteConfiguration.rebalanceThreadPoolSize}}. > However, the implementation doesn't take into account that ordered messages, > that correspond to a particular topic, are processed in parallel for > different nodes. Refer to the implementation of > {{GridIoManager.processOrderedMessage}} to see that for every topic there > will be a unique {{GridCommunicationMessageSet}} for every node. > Also to prove that this is true you can refer to this execution stack > {noformat} > java.lang.RuntimeException: HAPPENED DEMAND > at > org.apache.ignite.internal.processors.cache.GridCachePartitionExchangeManager$5.apply(GridCachePartitionExchangeManager.java:378) > at > org.apache.ignite.internal.processors.cache.GridCachePartitionExchangeManager$5.apply(GridCachePartitionExchangeManager.java:364) > at > org.apache.ignite.internal.processors.cache.GridCacheIoManager.processMessage(GridCacheIoManager.java:622) > at > org.apache.ignite.internal.processors.cache.GridCacheIoManager.onMessage0(GridCacheIoManager.java:320) > at > org.apache.ignite.internal.processors.cache.GridCacheIoManager.access$300(GridCacheIoManager.java:81) > at > org.apache.ignite.internal.processors.cache.GridCacheIoManager$OrderedMessageListener.onMessage(GridCacheIoManager.java:1125) > at > org.apache.ignite.internal.managers.communication.GridIoManager.invokeListener(GridIoManager.java:1219) > at > org.apache.ignite.internal.managers.communication.GridIoManager.access$1600(GridIoManager.java:105) > at > org.apache.ignite.internal.managers.communication.GridIoManager$GridCommunicationMessageSet.unwind(GridIoManager.java:2456) > at > org.apache.ignite.internal.managers.communication.GridIoManager.unwindMessageSet(GridIoManager.java:1179) > at > org.apache.ignite.internal.managers.communication.GridIoManager.access$1900(GridIoManager.java:105) > at > org.apache.ignite.internal.managers.communication.GridIoManager$6.run(GridIoManager.java:1148) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) > at java.lang.Thread.run(Thread.java:745) > {noformat} > All this means that in fact the number of threads that will be busy with > replication activity will be equal to > {{IgniteConfiguration.rebalanceThreadPoolSize}} x > number_of_nodes_participated_in_rebalancing -- This message was sent by Atlassian Jira (v8.3.2#803003)
[jira] [Commented] (IGNITE-3195) Rebalancing: IgniteConfiguration.rebalanceThreadPoolSize is wrongly treated
[ https://issues.apache.org/jira/browse/IGNITE-3195?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16916567#comment-16916567 ] Alexei Scherbakov commented on IGNITE-3195: --- [~avinogradov] I'll take a look. > Rebalancing: IgniteConfiguration.rebalanceThreadPoolSize is wrongly treated > --- > > Key: IGNITE-3195 > URL: https://issues.apache.org/jira/browse/IGNITE-3195 > Project: Ignite > Issue Type: Bug > Components: cache >Reporter: Denis Magda >Assignee: Anton Vinogradov >Priority: Major > Labels: iep-16 > Fix For: 2.8 > > Time Spent: 3h 50m > Remaining Estimate: 0h > > Presently it's considered that the maximum number of threads that has to > process all demand and supply messages coming from all the nodes must not be > bigger than {{IgniteConfiguration.rebalanceThreadPoolSize}}. > Current implementation relies on ordered messages functionality creating a > number of topics equal to {{IgniteConfiguration.rebalanceThreadPoolSize}}. > However, the implementation doesn't take into account that ordered messages, > that correspond to a particular topic, are processed in parallel for > different nodes. Refer to the implementation of > {{GridIoManager.processOrderedMessage}} to see that for every topic there > will be a unique {{GridCommunicationMessageSet}} for every node. > Also to prove that this is true you can refer to this execution stack > {noformat} > java.lang.RuntimeException: HAPPENED DEMAND > at > org.apache.ignite.internal.processors.cache.GridCachePartitionExchangeManager$5.apply(GridCachePartitionExchangeManager.java:378) > at > org.apache.ignite.internal.processors.cache.GridCachePartitionExchangeManager$5.apply(GridCachePartitionExchangeManager.java:364) > at > org.apache.ignite.internal.processors.cache.GridCacheIoManager.processMessage(GridCacheIoManager.java:622) > at > org.apache.ignite.internal.processors.cache.GridCacheIoManager.onMessage0(GridCacheIoManager.java:320) > at > org.apache.ignite.internal.processors.cache.GridCacheIoManager.access$300(GridCacheIoManager.java:81) > at > org.apache.ignite.internal.processors.cache.GridCacheIoManager$OrderedMessageListener.onMessage(GridCacheIoManager.java:1125) > at > org.apache.ignite.internal.managers.communication.GridIoManager.invokeListener(GridIoManager.java:1219) > at > org.apache.ignite.internal.managers.communication.GridIoManager.access$1600(GridIoManager.java:105) > at > org.apache.ignite.internal.managers.communication.GridIoManager$GridCommunicationMessageSet.unwind(GridIoManager.java:2456) > at > org.apache.ignite.internal.managers.communication.GridIoManager.unwindMessageSet(GridIoManager.java:1179) > at > org.apache.ignite.internal.managers.communication.GridIoManager.access$1900(GridIoManager.java:105) > at > org.apache.ignite.internal.managers.communication.GridIoManager$6.run(GridIoManager.java:1148) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) > at java.lang.Thread.run(Thread.java:745) > {noformat} > All this means that in fact the number of threads that will be busy with > replication activity will be equal to > {{IgniteConfiguration.rebalanceThreadPoolSize}} x > number_of_nodes_participated_in_rebalancing -- This message was sent by Atlassian Jira (v8.3.2#803003)
[jira] [Commented] (IGNITE-12093) Initial rebalance should be performed as efficiently as possible
[ https://issues.apache.org/jira/browse/IGNITE-12093?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16912283#comment-16912283 ] Alexei Scherbakov commented on IGNITE-12093: [~avinogradov] Keep in mind what during initial rebalance demander node as well receives updates to moving partitions and enlisted in transactions. Having all threads performing rebalance may hurt performance of normal transactions. > Initial rebalance should be performed as efficiently as possible > > > Key: IGNITE-12093 > URL: https://issues.apache.org/jira/browse/IGNITE-12093 > Project: Ignite > Issue Type: Task >Reporter: Anton Vinogradov >Priority: Major > Labels: iep-16 > > {{rebalanceThreadPoolSize}} setting should be ignored on initial rebalance > for demanders. > Maximum suitable thread pool size should be used during the initial rebalance > to perform it asap. -- This message was sent by Atlassian Jira (v8.3.2#803003)
[jira] [Updated] (IGNITE-12038) Fix several failing tests after IGNITE-10078
[ https://issues.apache.org/jira/browse/IGNITE-12038?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Alexei Scherbakov updated IGNITE-12038: --- Description: *New stable failure of a flaky test in master LocalWalModeChangeDuringRebalancingSelfTest.testWithExchangesMerge https://ci.ignite.apache.org/project.html?projectId=IgniteTests24Java8=-6585115376754732686=%3Cdefault%3E=testDetails *New stable failure of a flaky test in master GridCacheRebalancingWithAsyncClearingMvccTest.testPartitionClearingNotBlockExchange https://ci.ignite.apache.org/project.html?projectId=IgniteTests24Java8=-7007912051428984819=%3Cdefault%3E=testDetails *New stable failure of a flaky test in master GridCacheRebalancingAsyncSelfTest.testComplexRebalancing https://ci.ignite.apache.org/project.html?projectId=IgniteTests24Java8=-8829809273565657995=%3Cdefault%3E=testDetails was: *New stable failure of a flaky test in master LocalWalModeChangeDuringRebalancingSelfTest.testWithExchangesMerge https://ci.ignite.apache.org/project.html?projectId=IgniteTests24Java8=-6585115376754732686=%3Cdefault%3E=testDetails Changes may lead to failure were done by - alexey.scherbak...@gmail.com https://ci.ignite.apache.org/viewModification.html?modId=886764 - ptupit...@apache.org https://ci.ignite.apache.org/viewModification.html?modId=886762 *New stable failure of a flaky test in master GridCacheRebalancingWithAsyncClearingMvccTest.testPartitionClearingNotBlockExchange https://ci.ignite.apache.org/project.html?projectId=IgniteTests24Java8=-7007912051428984819=%3Cdefault%3E=testDetails Changes may lead to failure were done by - alexey.scherbak...@gmail.com https://ci.ignite.apache.org/viewModification.html?modId=886764 - ptupit...@apache.org https://ci.ignite.apache.org/viewModification.html?modId=886762 *New stable failure of a flaky test in master GridCacheRebalancingAsyncSelfTest.testComplexRebalancing https://ci.ignite.apache.org/project.html?projectId=IgniteTests24Java8=-8829809273565657995=%3Cdefault%3E=testDetails > Fix several failing tests after IGNITE-10078 > > > Key: IGNITE-12038 > URL: https://issues.apache.org/jira/browse/IGNITE-12038 > Project: Ignite > Issue Type: Bug >Reporter: Alexei Scherbakov >Assignee: Alexei Scherbakov >Priority: Major > Fix For: 2.8 > > > *New stable failure of a flaky test in master > LocalWalModeChangeDuringRebalancingSelfTest.testWithExchangesMerge > https://ci.ignite.apache.org/project.html?projectId=IgniteTests24Java8=-6585115376754732686=%3Cdefault%3E=testDetails > *New stable failure of a flaky test in master > GridCacheRebalancingWithAsyncClearingMvccTest.testPartitionClearingNotBlockExchange > > https://ci.ignite.apache.org/project.html?projectId=IgniteTests24Java8=-7007912051428984819=%3Cdefault%3E=testDetails > *New stable failure of a flaky test in master > GridCacheRebalancingAsyncSelfTest.testComplexRebalancing > https://ci.ignite.apache.org/project.html?projectId=IgniteTests24Java8=-8829809273565657995=%3Cdefault%3E=testDetails -- This message was sent by Atlassian JIRA (v7.6.14#76016)
[jira] [Updated] (IGNITE-12038) Fix several failing tests after IGNITE-10078
[ https://issues.apache.org/jira/browse/IGNITE-12038?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Alexei Scherbakov updated IGNITE-12038: --- Description: *New stable failure of a flaky test in master LocalWalModeChangeDuringRebalancingSelfTest.testWithExchangesMerge https://ci.ignite.apache.org/project.html?projectId=IgniteTests24Java8=-6585115376754732686=%3Cdefault%3E=testDetails Changes may lead to failure were done by - alexey.scherbak...@gmail.com https://ci.ignite.apache.org/viewModification.html?modId=886764 - ptupit...@apache.org https://ci.ignite.apache.org/viewModification.html?modId=886762 *New stable failure of a flaky test in master GridCacheRebalancingWithAsyncClearingMvccTest.testPartitionClearingNotBlockExchange https://ci.ignite.apache.org/project.html?projectId=IgniteTests24Java8=-7007912051428984819=%3Cdefault%3E=testDetails Changes may lead to failure were done by - alexey.scherbak...@gmail.com https://ci.ignite.apache.org/viewModification.html?modId=886764 - ptupit...@apache.org https://ci.ignite.apache.org/viewModification.html?modId=886762 *New stable failure of a flaky test in master GridCacheRebalancingAsyncSelfTest.testComplexRebalancing https://ci.ignite.apache.org/project.html?projectId=IgniteTests24Java8=-8829809273565657995=%3Cdefault%3E=testDetails > Fix several failing tests after IGNITE-10078 > > > Key: IGNITE-12038 > URL: https://issues.apache.org/jira/browse/IGNITE-12038 > Project: Ignite > Issue Type: Bug >Reporter: Alexei Scherbakov >Assignee: Alexei Scherbakov >Priority: Major > Fix For: 2.8 > > > *New stable failure of a flaky test in master > LocalWalModeChangeDuringRebalancingSelfTest.testWithExchangesMerge > https://ci.ignite.apache.org/project.html?projectId=IgniteTests24Java8=-6585115376754732686=%3Cdefault%3E=testDetails > Changes may lead to failure were done by >- alexey.scherbak...@gmail.com > https://ci.ignite.apache.org/viewModification.html?modId=886764 >- ptupit...@apache.org > https://ci.ignite.apache.org/viewModification.html?modId=886762 > *New stable failure of a flaky test in master > GridCacheRebalancingWithAsyncClearingMvccTest.testPartitionClearingNotBlockExchange > > https://ci.ignite.apache.org/project.html?projectId=IgniteTests24Java8=-7007912051428984819=%3Cdefault%3E=testDetails > Changes may lead to failure were done by >- alexey.scherbak...@gmail.com > https://ci.ignite.apache.org/viewModification.html?modId=886764 >- ptupit...@apache.org > https://ci.ignite.apache.org/viewModification.html?modId=886762 > *New stable failure of a flaky test in master > GridCacheRebalancingAsyncSelfTest.testComplexRebalancing > https://ci.ignite.apache.org/project.html?projectId=IgniteTests24Java8=-8829809273565657995=%3Cdefault%3E=testDetails -- This message was sent by Atlassian JIRA (v7.6.14#76016)
[jira] [Created] (IGNITE-12038) Fix several failing tests after IGNITE-10078
Alexei Scherbakov created IGNITE-12038: -- Summary: Fix several failing tests after IGNITE-10078 Key: IGNITE-12038 URL: https://issues.apache.org/jira/browse/IGNITE-12038 Project: Ignite Issue Type: Bug Reporter: Alexei Scherbakov Assignee: Alexei Scherbakov Fix For: 2.8 -- This message was sent by Atlassian JIRA (v7.6.14#76016)
[jira] [Commented] (IGNITE-11857) Investigate performance drop after IGNITE-10078
[ https://issues.apache.org/jira/browse/IGNITE-11857?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16898804#comment-16898804 ] Alexei Scherbakov commented on IGNITE-11857: [~alex_pl], I will take a look in nearest couple of days. > Investigate performance drop after IGNITE-10078 > --- > > Key: IGNITE-11857 > URL: https://issues.apache.org/jira/browse/IGNITE-11857 > Project: Ignite > Issue Type: Improvement >Reporter: Alexei Scherbakov >Assignee: Aleksey Plekhanov >Priority: Major > Attachments: ignite-config.xml, > run.properties.tx-optimistic-put-b-backup > > Time Spent: 20m > Remaining Estimate: 0h > > After IGNITE-10078 yardstick tests show performance drop up to 8% in some > scenarios: > * tx-optim-repRead-put-get > * tx-optimistic-put > * tx-putAll > Partially this is due new update counter implementation, but not only. > Investigation is required. -- This message was sent by Atlassian JIRA (v7.6.14#76016)
[jira] [Commented] (IGNITE-11799) Do not always clear partition in MOVING state before exchange
[ https://issues.apache.org/jira/browse/IGNITE-11799?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16887952#comment-16887952 ] Alexei Scherbakov commented on IGNITE-11799: [~Mmuzaf] This is actual. I still haven't donated several follow up fixes from GridGain CE, where comment is removed. Currently I'm on vacation and expecting to donate in the start of August. > Do not always clear partition in MOVING state before exchange > - > > Key: IGNITE-11799 > URL: https://issues.apache.org/jira/browse/IGNITE-11799 > Project: Ignite > Issue Type: Improvement >Reporter: Alexei Scherbakov >Assignee: Alexei Scherbakov >Priority: Major > > After IGNITE-10078 if partition was in moving state before exchange and > choosed for full rebalance (for example, this will happen if any minor PME > cancels previous rebalance) we always will clear it to avoid desync issues if > some removals were not delivered to demander. > This is not necessary to do if previous rebalance was full. > > > -- This message was sent by Atlassian JIRA (v7.6.14#76016)
[jira] [Commented] (IGNITE-8873) Optimize cache scans with enabled persistence.
[ https://issues.apache.org/jira/browse/IGNITE-8873?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16872682#comment-16872682 ] Alexei Scherbakov commented on IGNITE-8873: --- [~dmagda] This method was added to address exactly the case where huge (tens of terabytes) cache have to be efficiently fully scanned. It was already successfully used in production by some Ignite users as far as I know. The main idea behind per partition preloading API same as for other methods working with partitions: affinity run/call, scan by partition, SQL query by partition(s). I suggest to keep this method for advanced use cases and add some more "high level" APIs like you have proposed. > Optimize cache scans with enabled persistence. > -- > > Key: IGNITE-8873 > URL: https://issues.apache.org/jira/browse/IGNITE-8873 > Project: Ignite > Issue Type: Improvement >Reporter: Alexei Scherbakov >Assignee: Alexei Scherbakov >Priority: Major > Fix For: 2.8 > > > Currently cache scans with enabled persistence involve link resolution, which > can lead to radom disk access resulting in bad performace on SAS disks. > One possibility is to preload cache data pages to remove slow random disk > access. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (IGNITE-11867) Fix flaky test GridCacheRebalancingWithAsyncClearingTest#testCorrectRebalancingCurrentlyRentingPartitions
[ https://issues.apache.org/jira/browse/IGNITE-11867?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16869471#comment-16869471 ] Alexei Scherbakov commented on IGNITE-11867: [~ivan.glukos] Ready for review. > Fix flaky test > GridCacheRebalancingWithAsyncClearingTest#testCorrectRebalancingCurrentlyRentingPartitions > - > > Key: IGNITE-11867 > URL: https://issues.apache.org/jira/browse/IGNITE-11867 > Project: Ignite > Issue Type: Bug >Reporter: Alexei Scherbakov >Assignee: Alexei Scherbakov >Priority: Major > Fix For: 2.8 > > Time Spent: 10m > Remaining Estimate: 0h > > {noformat} > java.lang.AssertionError: Value for 4 is null > at org.junit.Assert.fail(Assert.java:88) > at org.junit.Assert.assertTrue(Assert.java:41) > at org.junit.Assert.assertNotNull(Assert.java:621) > at > org.apache.ignite.internal.processors.cache.distributed.rebalancing.GridCacheRebalancingWithAsyncClearingTest.testCorrectRebalancingCurrentlyRentingPartitions(GridCacheRebalancingWithAsyncClearingTest.java:280) > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) > at > sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > at > org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.java:47) > at > org.junit.internal.runners.model.ReflectiveCallable.run(ReflectiveCallable.java:12) > at > org.junit.runners.model.FrameworkMethod.invokeExplosively(FrameworkMethod.java:44) > at > org.junit.internal.runners.statements.InvokeMethod.evaluate(InvokeMethod.java:17) > at > org.junit.internal.runners.statements.RunBefores.evaluate(RunBefores.java:26) > at org.junit.internal.runners.statements.RunAfters.evaluate(RunAfters.java:27) > at > org.apache.ignite.testframework.junits.GridAbstractTest$6.run(GridAbstractTest.java:2148) > at java.lang.Thread.run(Thread.java:748){noformat} > EDIT: The issue is related to recently contributed IGNITE-10078. In specific > scenario due to race partition clearing could be started while partition is > passing through ongoing rebalancing started on previous topology version. > I fixed it by preventing partition clearing on newer topology version. In > such case if rebalance will be finished and partition will go in OWNING state > further clearing is not needed any more, otherwise partition should be > scheduled for clearing again. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (IGNITE-11939) IgnitePdsTxHistoricalRebalancingTest.testTopologyChangesWithConstantLoad test failure
Alexei Scherbakov created IGNITE-11939: -- Summary: IgnitePdsTxHistoricalRebalancingTest.testTopologyChangesWithConstantLoad test failure Key: IGNITE-11939 URL: https://issues.apache.org/jira/browse/IGNITE-11939 Project: Ignite Issue Type: Bug Reporter: Alexei Scherbakov Caused by exception on releasing reserved segments: {noformat} [12:51:23]W: [org.apache.ignite:ignite-indexing] [2019-06-21 12:51:23,967][ERROR][exchange-worker-#33825%persistence.IgnitePdsTxHistoricalRebalancingTest1%][GridDhtPartitionsExchangeFuture] Failed to reinitialize local partitions (rebalancing will be stopped) : GridDhtPartitionExchangeId [topVer=AffinityTopologyVersion [topVer=7, minorTopVer=1], discoEvt=DiscoveryCustomEvent [customMsg=CacheAffinityChangeMessage [id=08de0ff7b61-276ac575-e4dc-4525-b24b-d0a5d1d7633d, topVer=AffinityTopologyVersion [topVer=7, minorTopVer=0], exc hId=null, partsMsg=null, exchangeNeeded=true], affTopVer=AffinityTopologyVersion [topVer=7, minorTopVer=1], super=DiscoveryEvent [evtNode=TcpDiscoveryNode [id=97e46568-6aa0-4a4b-864c-f05415c0, consistentId=persistence.IgnitePdsTxHistoricalRebalancingTest0, addrs=Arra yList [127.0.0.1], sockAddrs=HashSet [/127.0.0.1:47500], discPort=47500, order=1, intOrder=1, lastExchangeTime=1561110643882, loc=false, ver=2.8.0#20190621-sha1:, isClient=false], topVer=7, nodeId8=0ff3354e, msg=null, type=DISCOVERY_CUSTOM_EVT, tstamp=15611106839 58]], nodeId=97e46568, evt=DISCOVERY_CUSTOM_EVT] [12:51:23]W: [org.apache.ignite:ignite-indexing] java.lang.AssertionError: cur=null, absIdx=0 [12:51:23]W: [org.apache.ignite:ignite-indexing]at org.apache.ignite.internal.processors.cache.persistence.wal.aware.SegmentReservationStorage.release(SegmentReservationStorage.java:55) [12:51:23]W: [org.apache.ignite:ignite-indexing]at org.apache.ignite.internal.processors.cache.persistence.wal.aware.SegmentAware.release(SegmentAware.java:207) [12:51:23]W: [org.apache.ignite:ignite-indexing]at org.apache.ignite.internal.processors.cache.persistence.wal.FileWriteAheadLogManager.release(FileWriteAheadLogManager.java:983) [12:51:23]W: [org.apache.ignite:ignite-indexing]at org.apache.ignite.internal.processors.cache.persistence.GridCacheDatabaseSharedManager.releaseHistoryForPreloading(GridCacheDatabaseSharedManager.java:1844) [12:51:23]W: [org.apache.ignite:ignite-indexing]at org.apache.ignite.internal.processors.cache.distributed.dht.preloader.GridDhtPartitionsExchangeFuture.distributedExchange(GridDhtPartitionsExchangeFuture.java:1431) [12:51:23]W: [org.apache.ignite:ignite-indexing]at org.apache.ignite.internal.processors.cache.distributed.dht.preloader.GridDhtPartitionsExchangeFuture.init(GridDhtPartitionsExchangeFuture.java:862) [12:51:23]W: [org.apache.ignite:ignite-indexing]at org.apache.ignite.internal.processors.cache.GridCachePartitionExchangeManager$ExchangeWorker.body0(GridCachePartitionExchangeManager.java:3079) [12:51:23]W: [org.apache.ignite:ignite-indexing]at org.apache.ignite.internal.processors.cache.GridCachePartitionExchangeManager$ExchangeWorker.body(GridCachePartitionExchangeManager.java:2928) [12:51:23]W: [org.apache.ignite:ignite-indexing]at org.apache.ignite.internal.util.worker.GridWorker.run(GridWorker.java:120) [12:51:23]W: [org.apache.ignite:ignite-indexing]at java.lang.Thread.run(Thread.java:748) [12:51:23]W: [org.apache.ignite:ignite-indexing] [12:51:23] (err) Failed to notify listener: o.a.i.i.processors.timeout.GridTimeoutProcessor$2...@79ba1907java.lang.AssertionError: cur=null, absIdx=0 [12:51:23]W: [org.apache.ignite:ignite-indexing]at org.apache.ignite.internal.processors.cache.persistence.wal.aware.SegmentReservationStorage.release(SegmentReservationStorage.java:55) [12:51:23]W: [org.apache.ignite:ignite-indexing]at org.apache.ignite.internal.processors.cache.persistence.wal.aware.SegmentAware.release(SegmentAware.java:207) [12:51:23]W: [org.apache.ignite:ignite-indexing]at org.apache.ignite.internal.processors.cache.persistence.wal.FileWriteAheadLogManager.release(FileWriteAheadLogManager.java:983) [12:51:23]W: [org.apache.ignite:ignite-indexing]at org.apache.ignite.internal.processors.cache.persistence.GridCacheDatabaseSharedManager.releaseHistoryForPreloading(GridCacheDatabaseSharedManager.java:1844) [12:51:23]W: [org.apache.ignite:ignite-indexing]at org.apache.ignite.internal.processors.cache.distributed.dht.preloader.GridDhtPartitionsExchangeFuture.distributedExchange(GridDhtPartitionsExchangeFuture.java:1431) [12:51:23]W: [org.apache.ignite:ignite-indexing]at
[jira] [Created] (IGNITE-11937) Fix MVCC PDS flaky suites timeout
Alexei Scherbakov created IGNITE-11937: -- Summary: Fix MVCC PDS flaky suites timeout Key: IGNITE-11937 URL: https://issues.apache.org/jira/browse/IGNITE-11937 Project: Ignite Issue Type: Bug Reporter: Alexei Scherbakov Currently we have non-zero failure rate for some MVCC PDS suites in master. Seems this is due to failure [1] in testRebalancingDuringLoad* tests group, which leads to dumping WAL and lock states at the time proportional to current WAL length increasing test duration for random time depending on WAL length. Worse thing the test remains green despite throwing a critical exception. [1] Stacktrace {noformat} [2019-06-19 15:56:53,386][ERROR][sys-stripe-6-#134%persistence.IgnitePdsContinuousRestartTestWithSharedGroupAndIndexes3%][IgniteTestResources] Critical system error detected. Will be handled accordingly to configured handler [hnd=NoOpFailureHandler [super=AbstractFailure Handler [ignoredFailureTypes=UnmodifiableSet [SYSTEM_WORKER_BLOCKED, SYSTEM_CRITICAL_OPERATION_TIMEOUT]]], failureCtx=FailureContext [type=CRITICAL_ERROR, err=class o.a.i.i.processors.cache.persistence.tree.CorruptedTreeException: B+Tree is corrupted [pages(groupId, page Id)=[IgniteBiTuple [val1=81227264, val2=844420635164676]], msg=Runtime failure on search row: TxKey [major=1560948946388, minor=17286 class org.apache.ignite.internal.processors.cache.persistence.tree.CorruptedTreeException: B+Tree is corrupted [pages(groupId, pageId)=[IgniteBiTuple [val1=81227264, val2=844420635164676]], msg=Runtime failure on search row: TxKey [major=1560948946388, minor=17286]] at org.apache.ignite.internal.processors.cache.persistence.tree.BPlusTree.corruptedTreeException(BPlusTree.java:5909) at org.apache.ignite.internal.processors.cache.persistence.tree.BPlusTree.invoke(BPlusTree.java:1859) at org.apache.ignite.internal.processors.cache.mvcc.txlog.TxLog.put(TxLog.java:293) at org.apache.ignite.internal.processors.cache.mvcc.MvccProcessorImpl.updateState(MvccProcessorImpl.java:699) at org.apache.ignite.internal.processors.cache.transactions.IgniteTxManager.setMvccState(IgniteTxManager.java:2570) at org.apache.ignite.internal.processors.cache.transactions.IgniteTxAdapter.state(IgniteTxAdapter.java:1228) at org.apache.ignite.internal.processors.cache.transactions.IgniteTxAdapter.state(IgniteTxAdapter.java:1070) at org.apache.ignite.internal.processors.cache.distributed.GridDistributedTxRemoteAdapter.prepareRemoteTx(GridDistributedTxRemoteAdapter.java:421) at org.apache.ignite.internal.processors.cache.transactions.IgniteTxHandler.startRemoteTx(IgniteTxHandler.java:1837) at org.apache.ignite.internal.processors.cache.transactions.IgniteTxHandler.processDhtTxPrepareRequest(IgniteTxHandler.java:1198) at org.apache.ignite.internal.processors.cache.transactions.IgniteTxHandler.access$400(IgniteTxHandler.java:118) at org.apache.ignite.internal.processors.cache.transactions.IgniteTxHandler$5.apply(IgniteTxHandler.java:224) at org.apache.ignite.internal.processors.cache.transactions.IgniteTxHandler$5.apply(IgniteTxHandler.java:222) at org.apache.ignite.internal.processors.cache.GridCacheIoManager.processMessage(GridCacheIoManager.java:1141) at org.apache.ignite.internal.processors.cache.GridCacheIoManager.onMessage0(GridCacheIoManager.java:591) at org.apache.ignite.internal.processors.cache.GridCacheIoManager.handleMessage(GridCacheIoManager.java:392) at org.apache.ignite.internal.processors.cache.GridCacheIoManager.handleMessage(GridCacheIoManager.java:318) at org.apache.ignite.internal.processors.cache.GridCacheIoManager.access$100(GridCacheIoManager.java:109) at org.apache.ignite.internal.processors.cache.GridCacheIoManager$1.onMessage(GridCacheIoManager.java:308) at org.apache.ignite.internal.managers.communication.GridIoManager.invokeListener(GridIoManager.java:1558) at org.apache.ignite.internal.managers.communication.GridIoManager.processRegularMessage0(GridIoManager.java:1186) at org.apache.ignite.internal.managers.communication.GridIoManager.access$4200(GridIoManager.java:125) at org.apache.ignite.internal.managers.communication.GridIoManager$8.run(GridIoManager.java:1083) at org.apache.ignite.internal.util.StripedExecutor$Stripe.body(StripedExecutor.java:559) at org.apache.ignite.internal.util.worker.GridWorker.run(GridWorker.java:119) at java.lang.Thread.run(Thread.java:748) Caused by: java.lang.IllegalStateException: Unexpected new transaction state. [currState=2, newState=1, cntr=17286] at org.apache.ignite.internal.processors.cache.mvcc.txlog.TxLog$TxLogUpdateClosure.invalid(TxLog.java:629) at
[jira] [Comment Edited] (IGNITE-11799) Do not always clear partition in MOVING state before exchange
[ https://issues.apache.org/jira/browse/IGNITE-11799?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16868401#comment-16868401 ] Alexei Scherbakov edited comment on IGNITE-11799 at 6/20/19 9:46 AM: - Actually clearing required in all cases if new rebalance is FULL. Consider the scenario: 2 nodes, N1 - supplier, N2 - demander. N1 has keys k1,k2,k3 N2 joins and start any type of rebalancing N1 removes k1 N2 dies after receiving k1,k2 in supply but before receiving removal of k1 as update to MOVING partition. N2 joins, starts full rebalance and loads k2,k3 N2 will contain keys 1,2,3 while N1 will contain keys 1,2 causing partition desync. was (Author: ascherbakov): Actually clearing required in all cases if new rebalance is FULL. Consider the scenario: 2 nodes, N1 - supplier, N2 - demander. N1 has keys k1,k2,k3 N2 joins and start any type of rebalancing N1 removes k1 N2 dies after receiving k1,k2 in supply but before receiving removal of k1 N2 joins, starts full rebalance and loads k2,k3 N2 will contain keys 1,2,3 while N1 will contain keys 1,2 causing partition desync. > Do not always clear partition in MOVING state before exchange > - > > Key: IGNITE-11799 > URL: https://issues.apache.org/jira/browse/IGNITE-11799 > Project: Ignite > Issue Type: Improvement >Reporter: Alexei Scherbakov >Assignee: Alexei Scherbakov >Priority: Major > > After IGNITE-10078 if partition was in moving state before exchange and > choosed for full rebalance (for example, this will happen if any minor PME > cancels previous rebalance) we always will clear it to avoid desync issues if > some removals were not delivered to demander. > This is not necessary to do if previous rebalance was full. > > > -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Comment Edited] (IGNITE-11799) Do not always clear partition in MOVING state before exchange
[ https://issues.apache.org/jira/browse/IGNITE-11799?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16868401#comment-16868401 ] Alexei Scherbakov edited comment on IGNITE-11799 at 6/20/19 9:46 AM: - Actually clearing required in all cases if new rebalance is FULL. Consider the scenario: 2 nodes, N1 - supplier, N2 - demander. N1 has keys k1,k2,k3 N2 joins and start any type of rebalancing N1 removes k1 N2 dies after receiving k1,k2 in supply but before receiving removal of k1 as update to MOVING partition. N2 joins, starts full rebalance and loads k2,k3 N2 will now contain keys 1,2,3 while N1 will contain keys 1,2 causing partition desync. was (Author: ascherbakov): Actually clearing required in all cases if new rebalance is FULL. Consider the scenario: 2 nodes, N1 - supplier, N2 - demander. N1 has keys k1,k2,k3 N2 joins and start any type of rebalancing N1 removes k1 N2 dies after receiving k1,k2 in supply but before receiving removal of k1 as update to MOVING partition. N2 joins, starts full rebalance and loads k2,k3 N2 will contain keys 1,2,3 while N1 will contain keys 1,2 causing partition desync. > Do not always clear partition in MOVING state before exchange > - > > Key: IGNITE-11799 > URL: https://issues.apache.org/jira/browse/IGNITE-11799 > Project: Ignite > Issue Type: Improvement >Reporter: Alexei Scherbakov >Assignee: Alexei Scherbakov >Priority: Major > > After IGNITE-10078 if partition was in moving state before exchange and > choosed for full rebalance (for example, this will happen if any minor PME > cancels previous rebalance) we always will clear it to avoid desync issues if > some removals were not delivered to demander. > This is not necessary to do if previous rebalance was full. > > > -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Comment Edited] (IGNITE-11799) Do not always clear partition in MOVING state before exchange
[ https://issues.apache.org/jira/browse/IGNITE-11799?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16868401#comment-16868401 ] Alexei Scherbakov edited comment on IGNITE-11799 at 6/20/19 9:45 AM: - Actually clearing required in all cases if new rebalance is FULL. Consider the scenario: 2 nodes, N1 - supplier, N2 - demander. N1 has keys k1,k2,k3 N2 joins and start any type of rebalancing N1 removes k1 N2 dies after receiving k1,k2 in supply but before receiving removal of k1 N2 joins, starts full rebalance and loads k2,k3 N2 will contain keys 1,2,3 while N1 will contain keys 1,2 causing partition desync. was (Author: ascherbakov): Actually clearing required in all cases if new rebalance is FULL. Consider the scenario: 2 nodes, N1 - supplier, N2 - demander. N1 has keys k1,k2,k3 N2 joins and start any type of rebalancing N1 removes k1 N2 dies after receiving k1,k2 but before receiving removal of k1 N2 joins, starts full rebalance and loads k2,k3 N2 will contain keys 1,2,3 while N1 will contain keys 1,2 causing partition desync. > Do not always clear partition in MOVING state before exchange > - > > Key: IGNITE-11799 > URL: https://issues.apache.org/jira/browse/IGNITE-11799 > Project: Ignite > Issue Type: Improvement >Reporter: Alexei Scherbakov >Assignee: Alexei Scherbakov >Priority: Major > > After IGNITE-10078 if partition was in moving state before exchange and > choosed for full rebalance (for example, this will happen if any minor PME > cancels previous rebalance) we always will clear it to avoid desync issues if > some removals were not delivered to demander. > This is not necessary to do if previous rebalance was full. > > > -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Resolved] (IGNITE-11799) Do not always clear partition in MOVING state before exchange
[ https://issues.apache.org/jira/browse/IGNITE-11799?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Alexei Scherbakov resolved IGNITE-11799. Resolution: Won't Fix Actually clearing required in all cases if new rebalance is FULL. Consider the scenario: 2 nodes, N1 - supplier, N2 - demander. N1 has keys k1,k2,k3 N2 joins and start any type of rebalancing N1 removes k1 N2 dies after receiving k1,k2 but before receiving removal of k1 N2 joins, starts full rebalance and loads k2,k3 N2 will contain keys 1,2,3 while N1 will contain keys 1,2 causing partition desync. > Do not always clear partition in MOVING state before exchange > - > > Key: IGNITE-11799 > URL: https://issues.apache.org/jira/browse/IGNITE-11799 > Project: Ignite > Issue Type: Improvement >Reporter: Alexei Scherbakov >Assignee: Alexei Scherbakov >Priority: Major > > After IGNITE-10078 if partition was in moving state before exchange and > choosed for full rebalance (for example, this will happen if any minor PME > cancels previous rebalance) we always will clear it to avoid desync issues if > some removals were not delivered to demander. > This is not necessary to do if previous rebalance was full. > > > -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Assigned] (IGNITE-11799) Do not always clear partition in MOVING state before exchange
[ https://issues.apache.org/jira/browse/IGNITE-11799?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Alexei Scherbakov reassigned IGNITE-11799: -- Assignee: Alexei Scherbakov > Do not always clear partition in MOVING state before exchange > - > > Key: IGNITE-11799 > URL: https://issues.apache.org/jira/browse/IGNITE-11799 > Project: Ignite > Issue Type: Improvement >Reporter: Alexei Scherbakov >Assignee: Alexei Scherbakov >Priority: Major > > After IGNITE-10078 if partition was in moving state before exchange and > choosed for full rebalance (for example, this will happen if any minor PME > cancels previous rebalance) we always will clear it to avoid desync issues if > some removals were not delivered to demander. > This is not necessary to do if previous rebalance was full. > > > -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (IGNITE-11848) [IEP-35] Monitoring Phase 1
[ https://issues.apache.org/jira/browse/IGNITE-11848?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16862120#comment-16862120 ] Alexei Scherbakov commented on IGNITE-11848: [~NIzhikov] I have left some (mostly minor) comments under PR, please address them. In general I'm OK with the changes. > [IEP-35] Monitoring Phase 1 > -- > > Key: IGNITE-11848 > URL: https://issues.apache.org/jira/browse/IGNITE-11848 > Project: Ignite > Issue Type: Task >Affects Versions: 2.7 >Reporter: Nikolay Izhikov >Assignee: Nikolay Izhikov >Priority: Major > Labels: IEP-35 > Fix For: 2.8 > > Time Spent: 4h 20m > Remaining Estimate: 0h > > Umbrella ticket for the IEP-35. Monitoring and profiling. > Phase 1 should include: > * NextGen monitoring subsystem implementation to manage > ** metrics > ** -lists- (will be implemented in the following tickets) > ** exporters > * JMX, SQLView, Log exporters > * Migration of existing metrics to new manager > * -Lists for all Ignite user API- -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (IGNITE-11867) Fix flaky test GridCacheRebalancingWithAsyncClearingTest#testCorrectRebalancingCurrentlyRentingPartitions
[ https://issues.apache.org/jira/browse/IGNITE-11867?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16857479#comment-16857479 ] Alexei Scherbakov commented on IGNITE-11867: [~ivan.glukos] [~Jokser] Please review. The main idea for the fix is to enforce a relation: current rebalance happens before next partition cleaning preventing race between rebalancing and clearing. I checked timed out runs and do not see any obvious relation to the patch. MVCC PDS3 and PDS4 also time out in base (master) branch. > Fix flaky test > GridCacheRebalancingWithAsyncClearingTest#testCorrectRebalancingCurrentlyRentingPartitions > - > > Key: IGNITE-11867 > URL: https://issues.apache.org/jira/browse/IGNITE-11867 > Project: Ignite > Issue Type: Bug >Reporter: Alexei Scherbakov >Assignee: Alexei Scherbakov >Priority: Major > Fix For: 2.8 > > Time Spent: 10m > Remaining Estimate: 0h > > {noformat} > java.lang.AssertionError: Value for 4 is null > at org.junit.Assert.fail(Assert.java:88) > at org.junit.Assert.assertTrue(Assert.java:41) > at org.junit.Assert.assertNotNull(Assert.java:621) > at > org.apache.ignite.internal.processors.cache.distributed.rebalancing.GridCacheRebalancingWithAsyncClearingTest.testCorrectRebalancingCurrentlyRentingPartitions(GridCacheRebalancingWithAsyncClearingTest.java:280) > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) > at > sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > at > org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.java:47) > at > org.junit.internal.runners.model.ReflectiveCallable.run(ReflectiveCallable.java:12) > at > org.junit.runners.model.FrameworkMethod.invokeExplosively(FrameworkMethod.java:44) > at > org.junit.internal.runners.statements.InvokeMethod.evaluate(InvokeMethod.java:17) > at > org.junit.internal.runners.statements.RunBefores.evaluate(RunBefores.java:26) > at org.junit.internal.runners.statements.RunAfters.evaluate(RunAfters.java:27) > at > org.apache.ignite.testframework.junits.GridAbstractTest$6.run(GridAbstractTest.java:2148) > at java.lang.Thread.run(Thread.java:748){noformat} > EDIT: The issue is related to recently contributed IGNITE-10078. In specific > scenario due to race partition clearing could be started while partition is > passing through ongoing rebalancing started on previous topology version. > I fixed it by preventing partition clearing on newer topology version. In > such case if rebalance will be finished and partition will go in OWNING state > further clearing is not needed any more, otherwise partition should be > scheduled for clearing again. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (IGNITE-11887) Add more test scenarious for OWNING -> RENTING -> MOVING scenario
[ https://issues.apache.org/jira/browse/IGNITE-11887?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Alexei Scherbakov updated IGNITE-11887: --- Description: Relevant test GridCacheRebalancingWithAsyncClearingTest#testCorrectRebalancingCurrentlyRentingPartitions. Need to extend with 1. in-memory 2. under load was:Relevant test GridCacheRebalancingWithAsyncClearingTest#testCorrectRebalancingCurrentlyRentingPartitions > Add more test scenarious for OWNING -> RENTING -> MOVING scenario > - > > Key: IGNITE-11887 > URL: https://issues.apache.org/jira/browse/IGNITE-11887 > Project: Ignite > Issue Type: Test >Reporter: Alexei Scherbakov >Assignee: Alexei Scherbakov >Priority: Major > > Relevant test > GridCacheRebalancingWithAsyncClearingTest#testCorrectRebalancingCurrentlyRentingPartitions. > Need to extend with > 1. in-memory > 2. under load -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (IGNITE-11887) Add more test scenarious for OWNING -> RENTING -> MOVING scenario
Alexei Scherbakov created IGNITE-11887: -- Summary: Add more test scenarious for OWNING -> RENTING -> MOVING scenario Key: IGNITE-11887 URL: https://issues.apache.org/jira/browse/IGNITE-11887 Project: Ignite Issue Type: Test Reporter: Alexei Scherbakov Assignee: Alexei Scherbakov Relevant test GridCacheRebalancingWithAsyncClearingTest#testCorrectRebalancingCurrentlyRentingPartitions -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (IGNITE-11867) Fix flaky test GridCacheRebalancingWithAsyncClearingTest#testCorrectRebalancingCurrentlyRentingPartitions
[ https://issues.apache.org/jira/browse/IGNITE-11867?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16851750#comment-16851750 ] Alexei Scherbakov commented on IGNITE-11867: [~ivan.glukos] Failing suite is not related to changes. Please review. > Fix flaky test > GridCacheRebalancingWithAsyncClearingTest#testCorrectRebalancingCurrentlyRentingPartitions > - > > Key: IGNITE-11867 > URL: https://issues.apache.org/jira/browse/IGNITE-11867 > Project: Ignite > Issue Type: Bug >Reporter: Alexei Scherbakov >Assignee: Alexei Scherbakov >Priority: Major > Fix For: 2.8 > > Time Spent: 10m > Remaining Estimate: 0h > > {noformat} > java.lang.AssertionError: Value for 4 is null > at org.junit.Assert.fail(Assert.java:88) > at org.junit.Assert.assertTrue(Assert.java:41) > at org.junit.Assert.assertNotNull(Assert.java:621) > at > org.apache.ignite.internal.processors.cache.distributed.rebalancing.GridCacheRebalancingWithAsyncClearingTest.testCorrectRebalancingCurrentlyRentingPartitions(GridCacheRebalancingWithAsyncClearingTest.java:280) > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) > at > sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > at > org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.java:47) > at > org.junit.internal.runners.model.ReflectiveCallable.run(ReflectiveCallable.java:12) > at > org.junit.runners.model.FrameworkMethod.invokeExplosively(FrameworkMethod.java:44) > at > org.junit.internal.runners.statements.InvokeMethod.evaluate(InvokeMethod.java:17) > at > org.junit.internal.runners.statements.RunBefores.evaluate(RunBefores.java:26) > at org.junit.internal.runners.statements.RunAfters.evaluate(RunAfters.java:27) > at > org.apache.ignite.testframework.junits.GridAbstractTest$6.run(GridAbstractTest.java:2148) > at java.lang.Thread.run(Thread.java:748){noformat} > EDIT: The issue is related to recently contributed IGNITE-10078. In specific > scenario due to race partition clearing could be started while partition is > passing through ongoing rebalancing started on previous topology version. > I fixed it by preventing partition clearing on newer topology version. In > such case if rebalance will be finished and partition will go in OWNING state > further clearing is not needed any more, otherwise partition should be > scheduled for clearing again. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (IGNITE-11867) Fix flaky test GridCacheRebalancingWithAsyncClearingTest#testCorrectRebalancingCurrentlyRentingPartitions
[ https://issues.apache.org/jira/browse/IGNITE-11867?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Alexei Scherbakov updated IGNITE-11867: --- Description: {noformat} java.lang.AssertionError: Value for 4 is null at org.junit.Assert.fail(Assert.java:88) at org.junit.Assert.assertTrue(Assert.java:41) at org.junit.Assert.assertNotNull(Assert.java:621) at org.apache.ignite.internal.processors.cache.distributed.rebalancing.GridCacheRebalancingWithAsyncClearingTest.testCorrectRebalancingCurrentlyRentingPartitions(GridCacheRebalancingWithAsyncClearingTest.java:280) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.java:47) at org.junit.internal.runners.model.ReflectiveCallable.run(ReflectiveCallable.java:12) at org.junit.runners.model.FrameworkMethod.invokeExplosively(FrameworkMethod.java:44) at org.junit.internal.runners.statements.InvokeMethod.evaluate(InvokeMethod.java:17) at org.junit.internal.runners.statements.RunBefores.evaluate(RunBefores.java:26) at org.junit.internal.runners.statements.RunAfters.evaluate(RunAfters.java:27) at org.apache.ignite.testframework.junits.GridAbstractTest$6.run(GridAbstractTest.java:2148) at java.lang.Thread.run(Thread.java:748){noformat} EDIT: The issue is related to recently contributed IGNITE-10078. In specific scenario due to race partition clearing could be started while partition is passing through ongoing rebalancing started on previous topology version. I fixed it by preventing partition clearing on newer topology version. In such case if rebalance will be finished and partition will go in OWNING state further clearing is not needed any more, otherwise partition should be scheduled for clearing again. was: {noformat} java.lang.AssertionError: Value for 4 is null at org.junit.Assert.fail(Assert.java:88) at org.junit.Assert.assertTrue(Assert.java:41) at org.junit.Assert.assertNotNull(Assert.java:621) at org.apache.ignite.internal.processors.cache.distributed.rebalancing.GridCacheRebalancingWithAsyncClearingTest.testCorrectRebalancingCurrentlyRentingPartitions(GridCacheRebalancingWithAsyncClearingTest.java:280) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.java:47) at org.junit.internal.runners.model.ReflectiveCallable.run(ReflectiveCallable.java:12) at org.junit.runners.model.FrameworkMethod.invokeExplosively(FrameworkMethod.java:44) at org.junit.internal.runners.statements.InvokeMethod.evaluate(InvokeMethod.java:17) at org.junit.internal.runners.statements.RunBefores.evaluate(RunBefores.java:26) at org.junit.internal.runners.statements.RunAfters.evaluate(RunAfters.java:27) at org.apache.ignite.testframework.junits.GridAbstractTest$6.run(GridAbstractTest.java:2148) at java.lang.Thread.run(Thread.java:748){noformat} EDIT: The issue is related to recently contributed IGNITE-10078. In specific scenario due to race partition clearing could be started while partition is passing through ongoing rebalancing started on previous topology version. I fixed it by preventing partition clearing on newer topology versions. In such case if rebalance will be finished and partition will go in OWNING state further clearing is not needed any more, otherwise partition will be scheduled for clearing again. > Fix flaky test > GridCacheRebalancingWithAsyncClearingTest#testCorrectRebalancingCurrentlyRentingPartitions > - > > Key: IGNITE-11867 > URL: https://issues.apache.org/jira/browse/IGNITE-11867 > Project: Ignite > Issue Type: Bug >Reporter: Alexei Scherbakov >Assignee: Alexei Scherbakov >Priority: Major > Fix For: 2.8 > > Time Spent: 10m > Remaining Estimate: 0h > > {noformat} > java.lang.AssertionError: Value for 4 is null > at org.junit.Assert.fail(Assert.java:88) > at org.junit.Assert.assertTrue(Assert.java:41) > at org.junit.Assert.assertNotNull(Assert.java:621) > at > org.apache.ignite.internal.processors.cache.distributed.rebalancing.GridCacheRebalancingWithAsyncClearingTest.testCorrectRebalancingCurrentlyRentingPartitions(GridCacheRebalancingWithAsyncClearingTest.java:280) > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) > at > sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) > at >
[jira] [Updated] (IGNITE-11867) Fix flaky test GridCacheRebalancingWithAsyncClearingTest#testCorrectRebalancingCurrentlyRentingPartitions
[ https://issues.apache.org/jira/browse/IGNITE-11867?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Alexei Scherbakov updated IGNITE-11867: --- Description: {noformat} java.lang.AssertionError: Value for 4 is null at org.junit.Assert.fail(Assert.java:88) at org.junit.Assert.assertTrue(Assert.java:41) at org.junit.Assert.assertNotNull(Assert.java:621) at org.apache.ignite.internal.processors.cache.distributed.rebalancing.GridCacheRebalancingWithAsyncClearingTest.testCorrectRebalancingCurrentlyRentingPartitions(GridCacheRebalancingWithAsyncClearingTest.java:280) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.java:47) at org.junit.internal.runners.model.ReflectiveCallable.run(ReflectiveCallable.java:12) at org.junit.runners.model.FrameworkMethod.invokeExplosively(FrameworkMethod.java:44) at org.junit.internal.runners.statements.InvokeMethod.evaluate(InvokeMethod.java:17) at org.junit.internal.runners.statements.RunBefores.evaluate(RunBefores.java:26) at org.junit.internal.runners.statements.RunAfters.evaluate(RunAfters.java:27) at org.apache.ignite.testframework.junits.GridAbstractTest$6.run(GridAbstractTest.java:2148) at java.lang.Thread.run(Thread.java:748){noformat} EDIT: The issue is related to recently contributed IGNITE-10078. In specific scenario due to race partition clearing could be started while partition is passing through ongoing rebalancing started on previous topology version. I fixed it by preventing partition clearing on newer topology versions. In such case if rebalance will be finished and partition will go in OWNING state further clearing is not needed any more, otherwise partition will be scheduled for clearing again. was: {noformat} java.lang.AssertionError: Value for 4 is null at org.junit.Assert.fail(Assert.java:88) at org.junit.Assert.assertTrue(Assert.java:41) at org.junit.Assert.assertNotNull(Assert.java:621) at org.apache.ignite.internal.processors.cache.distributed.rebalancing.GridCacheRebalancingWithAsyncClearingTest.testCorrectRebalancingCurrentlyRentingPartitions(GridCacheRebalancingWithAsyncClearingTest.java:280) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.java:47) at org.junit.internal.runners.model.ReflectiveCallable.run(ReflectiveCallable.java:12) at org.junit.runners.model.FrameworkMethod.invokeExplosively(FrameworkMethod.java:44) at org.junit.internal.runners.statements.InvokeMethod.evaluate(InvokeMethod.java:17) at org.junit.internal.runners.statements.RunBefores.evaluate(RunBefores.java:26) at org.junit.internal.runners.statements.RunAfters.evaluate(RunAfters.java:27) at org.apache.ignite.testframework.junits.GridAbstractTest$6.run(GridAbstractTest.java:2148) at java.lang.Thread.run(Thread.java:748){noformat} > Fix flaky test > GridCacheRebalancingWithAsyncClearingTest#testCorrectRebalancingCurrentlyRentingPartitions > - > > Key: IGNITE-11867 > URL: https://issues.apache.org/jira/browse/IGNITE-11867 > Project: Ignite > Issue Type: Bug >Reporter: Alexei Scherbakov >Assignee: Alexei Scherbakov >Priority: Major > Fix For: 2.8 > > > {noformat} > java.lang.AssertionError: Value for 4 is null > at org.junit.Assert.fail(Assert.java:88) > at org.junit.Assert.assertTrue(Assert.java:41) > at org.junit.Assert.assertNotNull(Assert.java:621) > at > org.apache.ignite.internal.processors.cache.distributed.rebalancing.GridCacheRebalancingWithAsyncClearingTest.testCorrectRebalancingCurrentlyRentingPartitions(GridCacheRebalancingWithAsyncClearingTest.java:280) > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) > at > sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > at > org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.java:47) > at > org.junit.internal.runners.model.ReflectiveCallable.run(ReflectiveCallable.java:12) > at > org.junit.runners.model.FrameworkMethod.invokeExplosively(FrameworkMethod.java:44) > at > org.junit.internal.runners.statements.InvokeMethod.evaluate(InvokeMethod.java:17) > at > org.junit.internal.runners.statements.RunBefores.evaluate(RunBefores.java:26) > at
[jira] [Commented] (IGNITE-11256) Implement read-only mode for grid
[ https://issues.apache.org/jira/browse/IGNITE-11256?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16847688#comment-16847688 ] Alexei Scherbakov commented on IGNITE-11256: [~antonovsergey93] I left some minor comments under PR. > Implement read-only mode for grid > - > > Key: IGNITE-11256 > URL: https://issues.apache.org/jira/browse/IGNITE-11256 > Project: Ignite > Issue Type: Improvement >Reporter: Alexei Scherbakov >Assignee: Sergey Antonov >Priority: Major > Fix For: 2.8 > > Time Spent: 10m > Remaining Estimate: 0h > > Should be triggered from control.sh utility. > Useful for maintenance work, for example checking partition consistency > (idle_verify) -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Assigned] (IGNITE-11867) Fix flaky test GridCacheRebalancingWithAsyncClearingTest#testCorrectRebalancingCurrentlyRentingPartitions
[ https://issues.apache.org/jira/browse/IGNITE-11867?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Alexei Scherbakov reassigned IGNITE-11867: -- Assignee: Alexei Scherbakov > Fix flaky test > GridCacheRebalancingWithAsyncClearingTest#testCorrectRebalancingCurrentlyRentingPartitions > - > > Key: IGNITE-11867 > URL: https://issues.apache.org/jira/browse/IGNITE-11867 > Project: Ignite > Issue Type: Bug >Reporter: Alexei Scherbakov >Assignee: Alexei Scherbakov >Priority: Major > Fix For: 2.8 > > > {noformat} > java.lang.AssertionError: Value for 4 is null > at org.junit.Assert.fail(Assert.java:88) > at org.junit.Assert.assertTrue(Assert.java:41) > at org.junit.Assert.assertNotNull(Assert.java:621) > at > org.apache.ignite.internal.processors.cache.distributed.rebalancing.GridCacheRebalancingWithAsyncClearingTest.testCorrectRebalancingCurrentlyRentingPartitions(GridCacheRebalancingWithAsyncClearingTest.java:280) > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) > at > sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > at > org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.java:47) > at > org.junit.internal.runners.model.ReflectiveCallable.run(ReflectiveCallable.java:12) > at > org.junit.runners.model.FrameworkMethod.invokeExplosively(FrameworkMethod.java:44) > at > org.junit.internal.runners.statements.InvokeMethod.evaluate(InvokeMethod.java:17) > at > org.junit.internal.runners.statements.RunBefores.evaluate(RunBefores.java:26) > at org.junit.internal.runners.statements.RunAfters.evaluate(RunAfters.java:27) > at > org.apache.ignite.testframework.junits.GridAbstractTest$6.run(GridAbstractTest.java:2148) > at java.lang.Thread.run(Thread.java:748){noformat} -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (IGNITE-11867) Fix flaky test GridCacheRebalancingWithAsyncClearingTest#testCorrectRebalancingCurrentlyRentingPartitions
[ https://issues.apache.org/jira/browse/IGNITE-11867?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Alexei Scherbakov updated IGNITE-11867: --- Description: {noformat} java.lang.AssertionError: Value for 4 is null at org.junit.Assert.fail(Assert.java:88) at org.junit.Assert.assertTrue(Assert.java:41) at org.junit.Assert.assertNotNull(Assert.java:621) at org.apache.ignite.internal.processors.cache.distributed.rebalancing.GridCacheRebalancingWithAsyncClearingTest.testCorrectRebalancingCurrentlyRentingPartitions(GridCacheRebalancingWithAsyncClearingTest.java:280) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.java:47) at org.junit.internal.runners.model.ReflectiveCallable.run(ReflectiveCallable.java:12) at org.junit.runners.model.FrameworkMethod.invokeExplosively(FrameworkMethod.java:44) at org.junit.internal.runners.statements.InvokeMethod.evaluate(InvokeMethod.java:17) at org.junit.internal.runners.statements.RunBefores.evaluate(RunBefores.java:26) at org.junit.internal.runners.statements.RunAfters.evaluate(RunAfters.java:27) at org.apache.ignite.testframework.junits.GridAbstractTest$6.run(GridAbstractTest.java:2148) at java.lang.Thread.run(Thread.java:748){noformat} > Fix flaky test > GridCacheRebalancingWithAsyncClearingTest#testCorrectRebalancingCurrentlyRentingPartitions > - > > Key: IGNITE-11867 > URL: https://issues.apache.org/jira/browse/IGNITE-11867 > Project: Ignite > Issue Type: Bug >Reporter: Alexei Scherbakov >Priority: Major > Fix For: 2.8 > > > {noformat} > java.lang.AssertionError: Value for 4 is null > at org.junit.Assert.fail(Assert.java:88) > at org.junit.Assert.assertTrue(Assert.java:41) > at org.junit.Assert.assertNotNull(Assert.java:621) > at > org.apache.ignite.internal.processors.cache.distributed.rebalancing.GridCacheRebalancingWithAsyncClearingTest.testCorrectRebalancingCurrentlyRentingPartitions(GridCacheRebalancingWithAsyncClearingTest.java:280) > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) > at > sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > at > org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.java:47) > at > org.junit.internal.runners.model.ReflectiveCallable.run(ReflectiveCallable.java:12) > at > org.junit.runners.model.FrameworkMethod.invokeExplosively(FrameworkMethod.java:44) > at > org.junit.internal.runners.statements.InvokeMethod.evaluate(InvokeMethod.java:17) > at > org.junit.internal.runners.statements.RunBefores.evaluate(RunBefores.java:26) > at org.junit.internal.runners.statements.RunAfters.evaluate(RunAfters.java:27) > at > org.apache.ignite.testframework.junits.GridAbstractTest$6.run(GridAbstractTest.java:2148) > at java.lang.Thread.run(Thread.java:748){noformat} -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (IGNITE-11867) Fix flaky test GridCacheRebalancingWithAsyncClearingTest#testCorrectRebalancingCurrentlyRentingPartitions
Alexei Scherbakov created IGNITE-11867: -- Summary: Fix flaky test GridCacheRebalancingWithAsyncClearingTest#testCorrectRebalancingCurrentlyRentingPartitions Key: IGNITE-11867 URL: https://issues.apache.org/jira/browse/IGNITE-11867 Project: Ignite Issue Type: Bug Reporter: Alexei Scherbakov Fix For: 2.8 -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Comment Edited] (IGNITE-10078) Node failure during concurrent partition updates may cause partition desync between primary and backup.
[ https://issues.apache.org/jira/browse/IGNITE-10078?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16845931#comment-16845931 ] Alexei Scherbakov edited comment on IGNITE-10078 at 5/22/19 2:42 PM: - IgniteCache150ClientsTest from Cache 6 also oftenly timed out in master [~ivan.glukos] All comments are fixed, ready for merge. was (Author: ascherbakov): _IgniteCache150ClientsTest from Cache 6 also oftenly timed out in master_ [~ivan.glukos] All comments are fixed, ready for merge. > Node failure during concurrent partition updates may cause partition desync > between primary and backup. > --- > > Key: IGNITE-10078 > URL: https://issues.apache.org/jira/browse/IGNITE-10078 > Project: Ignite > Issue Type: Bug >Reporter: Alexei Scherbakov >Assignee: Alexei Scherbakov >Priority: Major > Fix For: 2.8 > > Time Spent: 2h 50m > Remaining Estimate: 0h > > This is possible if some updates are not written to WAL before node failure. > They will be not applied by rebalancing due to same partition counters in > certain scenario: > 1. Start grid with 3 nodes, 2 backups. > 2. Preload some data to partition P. > 3. Start two concurrent transactions writing single key to the same partition > P, keys are different > {noformat} > try(Transaction tx = client.transactions().txStart(PESSIMISTIC, > REPEATABLE_READ, 0, 1)) { > client.cache(DEFAULT_CACHE_NAME).put(k, v); > tx.commit(); > } > {noformat} > 4. Order updates on backup in the way such update with max partition counter > is written to WAL and update with lesser partition counter failed due to > triggering of FH before it's added to WAL > 5. Return failed node to grid, observe no rebalancing due to same partition > counters. > Possible solution: detect gaps in update counters on recovery and force > rebalance from a node without gaps if detected. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (IGNITE-10078) Node failure during concurrent partition updates may cause partition desync between primary and backup.
[ https://issues.apache.org/jira/browse/IGNITE-10078?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16845931#comment-16845931 ] Alexei Scherbakov commented on IGNITE-10078: _IgniteCache150ClientsTest from Cache 6 also oftenly timed out in master_ [~ivan.glukos] All comments are fixed, ready for merge. > Node failure during concurrent partition updates may cause partition desync > between primary and backup. > --- > > Key: IGNITE-10078 > URL: https://issues.apache.org/jira/browse/IGNITE-10078 > Project: Ignite > Issue Type: Bug >Reporter: Alexei Scherbakov >Assignee: Alexei Scherbakov >Priority: Major > Fix For: 2.8 > > Time Spent: 2h 50m > Remaining Estimate: 0h > > This is possible if some updates are not written to WAL before node failure. > They will be not applied by rebalancing due to same partition counters in > certain scenario: > 1. Start grid with 3 nodes, 2 backups. > 2. Preload some data to partition P. > 3. Start two concurrent transactions writing single key to the same partition > P, keys are different > {noformat} > try(Transaction tx = client.transactions().txStart(PESSIMISTIC, > REPEATABLE_READ, 0, 1)) { > client.cache(DEFAULT_CACHE_NAME).put(k, v); > tx.commit(); > } > {noformat} > 4. Order updates on backup in the way such update with max partition counter > is written to WAL and update with lesser partition counter failed due to > triggering of FH before it's added to WAL > 5. Return failed node to grid, observe no rebalancing due to same partition > counters. > Possible solution: detect gaps in update counters on recovery and force > rebalance from a node without gaps if detected. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (IGNITE-11862) Cache stopping on supplier during rebalance causes NPE and supplying failure.
[ https://issues.apache.org/jira/browse/IGNITE-11862?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Alexei Scherbakov updated IGNITE-11862: --- Fix Version/s: 2.8 > Cache stopping on supplier during rebalance causes NPE and supplying failure. > - > > Key: IGNITE-11862 > URL: https://issues.apache.org/jira/browse/IGNITE-11862 > Project: Ignite > Issue Type: Bug >Reporter: Alexei Scherbakov >Priority: Major > Fix For: 2.8 > > > {noformat} > [21:12:14]W: [org.apache.ignite:ignite-core] [2019-05-20 > 21:12:14,376][ERROR][sys-#60310%distributed.CacheParallelStartTest0%][GridDhtPartitionSupplier] > Failed to continue supplying [grp=static-cache-group45, > demander=ed1c0109-8721-4cd8-80d9-d36e8251, top > Ver=AffinityTopologyVersion [topVer=2, minorTopVer=0], topic=0] > [21:12:14]W: [org.apache.ignite:ignite-core] java.lang.NullPointerException > [21:12:14]W: [org.apache.ignite:ignite-core] at > org.apache.ignite.internal.processors.cache.CacheGroupContext.addRebalanceSupplyEvent(CacheGroupContext.java:525) > [21:12:14]W: [org.apache.ignite:ignite-core] at > org.apache.ignite.internal.processors.cache.distributed.dht.preloader.GridDhtPartitionSupplier.handleDemandMessage(GridDhtPartitionSupplier.java:422) > [21:12:14]W: [org.apache.ignite:ignite-core] at > org.apache.ignite.internal.processors.cache.distributed.dht.preloader.GridDhtPreloader.handleDemandMessage(GridDhtPreloader.java:397) > [21:12:14]W: [org.apache.ignite:ignite-core] at > org.apache.ignite.internal.processors.cache.GridCachePartitionExchangeManager$5.apply(GridCachePartitionExchangeManager.java:455) > [21:12:14]W: [org.apache.ignite:ignite-core] at > org.apache.ignite.internal.processors.cache.GridCachePartitionExchangeManager$5.apply(GridCachePartitionExchangeManager.java:440) > [21:12:14]W: [org.apache.ignite:ignite-core] at > org.apache.ignite.internal.processors.cache.GridCacheIoManager.processMessage(GridCacheIoManager.java:1141) > [21:12:14]W: [org.apache.ignite:ignite-core] at > org.apache.ignite.internal.processors.cache.GridCacheIoManager.onMessage0(GridCacheIoManager.java:591) > [21:12:14]W: [org.apache.ignite:ignite-core] at > org.apache.ignite.internal.processors.cache.GridCacheIoManager.access$800(GridCacheIoManager.java:109) > [21:12:14]W: [org.apache.ignite:ignite-core] at > org.apache.ignite.internal.processors.cache.GridCacheIoManager$OrderedMessageListener.onMessage(GridCacheIoManager.java:1706) > [21:12:14]W: [org.apache.ignite:ignite-core] at > org.apache.ignite.internal.managers.communication.GridIoManager.invokeListener(GridIoManager.java:1566) > [21:12:14]W: [org.apache.ignite:ignite-core] at > org.apache.ignite.internal.managers.communication.GridIoManager.access$4200(GridIoManager.java:129) > [21:12:14]W: [org.apache.ignite:ignite-core] at > org.apache.ignite.internal.managers.communication.GridIoManager$GridCommunicationMessageSet.unwind(GridIoManager.java:2795) > [21:12:14]W: [org.apache.ignite:ignite-core] at > org.apache.ignite.internal.managers.communication.GridIoManager.unwindMessageSet(GridIoManager.java:1523) > [21:12:14]W: [org.apache.ignite:ignite-core] at > org.apache.ignite.internal.managers.communication.GridIoManager.access$4500(GridIoManager.java:129) > [21:12:14]W: [org.apache.ignite:ignite-core] at > org.apache.ignite.internal.managers.communication.GridIoManager$9.run(GridIoManager.java:1492) > [21:12:14]W: [org.apache.ignite:ignite-core] at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) > [21:12:14]W: [org.apache.ignite:ignite-core] at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) > [21:12:14]W: [org.apache.ignite:ignite-core] at > java.lang.Thread.run(Thread.java:748) > {noformat} -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (IGNITE-11862) Cache stopping on supplier during rebalance causes NPE and supplying failure.
Alexei Scherbakov created IGNITE-11862: -- Summary: Cache stopping on supplier during rebalance causes NPE and supplying failure. Key: IGNITE-11862 URL: https://issues.apache.org/jira/browse/IGNITE-11862 Project: Ignite Issue Type: Bug Reporter: Alexei Scherbakov {noformat} [21:12:14]W: [org.apache.ignite:ignite-core] [2019-05-20 21:12:14,376][ERROR][sys-#60310%distributed.CacheParallelStartTest0%][GridDhtPartitionSupplier] Failed to continue supplying [grp=static-cache-group45, demander=ed1c0109-8721-4cd8-80d9-d36e8251, top Ver=AffinityTopologyVersion [topVer=2, minorTopVer=0], topic=0] [21:12:14]W: [org.apache.ignite:ignite-core] java.lang.NullPointerException [21:12:14]W: [org.apache.ignite:ignite-core] at org.apache.ignite.internal.processors.cache.CacheGroupContext.addRebalanceSupplyEvent(CacheGroupContext.java:525) [21:12:14]W: [org.apache.ignite:ignite-core] at org.apache.ignite.internal.processors.cache.distributed.dht.preloader.GridDhtPartitionSupplier.handleDemandMessage(GridDhtPartitionSupplier.java:422) [21:12:14]W: [org.apache.ignite:ignite-core] at org.apache.ignite.internal.processors.cache.distributed.dht.preloader.GridDhtPreloader.handleDemandMessage(GridDhtPreloader.java:397) [21:12:14]W: [org.apache.ignite:ignite-core] at org.apache.ignite.internal.processors.cache.GridCachePartitionExchangeManager$5.apply(GridCachePartitionExchangeManager.java:455) [21:12:14]W: [org.apache.ignite:ignite-core] at org.apache.ignite.internal.processors.cache.GridCachePartitionExchangeManager$5.apply(GridCachePartitionExchangeManager.java:440) [21:12:14]W: [org.apache.ignite:ignite-core] at org.apache.ignite.internal.processors.cache.GridCacheIoManager.processMessage(GridCacheIoManager.java:1141) [21:12:14]W: [org.apache.ignite:ignite-core] at org.apache.ignite.internal.processors.cache.GridCacheIoManager.onMessage0(GridCacheIoManager.java:591) [21:12:14]W: [org.apache.ignite:ignite-core] at org.apache.ignite.internal.processors.cache.GridCacheIoManager.access$800(GridCacheIoManager.java:109) [21:12:14]W: [org.apache.ignite:ignite-core] at org.apache.ignite.internal.processors.cache.GridCacheIoManager$OrderedMessageListener.onMessage(GridCacheIoManager.java:1706) [21:12:14]W: [org.apache.ignite:ignite-core] at org.apache.ignite.internal.managers.communication.GridIoManager.invokeListener(GridIoManager.java:1566) [21:12:14]W: [org.apache.ignite:ignite-core] at org.apache.ignite.internal.managers.communication.GridIoManager.access$4200(GridIoManager.java:129) [21:12:14]W: [org.apache.ignite:ignite-core] at org.apache.ignite.internal.managers.communication.GridIoManager$GridCommunicationMessageSet.unwind(GridIoManager.java:2795) [21:12:14]W: [org.apache.ignite:ignite-core] at org.apache.ignite.internal.managers.communication.GridIoManager.unwindMessageSet(GridIoManager.java:1523) [21:12:14]W: [org.apache.ignite:ignite-core] at org.apache.ignite.internal.managers.communication.GridIoManager.access$4500(GridIoManager.java:129) [21:12:14]W: [org.apache.ignite:ignite-core] at org.apache.ignite.internal.managers.communication.GridIoManager$9.run(GridIoManager.java:1492) [21:12:14]W: [org.apache.ignite:ignite-core] at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) [21:12:14]W: [org.apache.ignite:ignite-core] at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) [21:12:14]W: [org.apache.ignite:ignite-core] at java.lang.Thread.run(Thread.java:748) {noformat} -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (IGNITE-11791) Fix IgnitePdsContinuousRestartTestWithExpiryPolicy
[ https://issues.apache.org/jira/browse/IGNITE-11791?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Alexei Scherbakov updated IGNITE-11791: --- Description: Test reproduces partition counter validation errors (but passes nevertheless). (was: Test reproduces partition counter validation errors.) > Fix IgnitePdsContinuousRestartTestWithExpiryPolicy > --- > > Key: IGNITE-11791 > URL: https://issues.apache.org/jira/browse/IGNITE-11791 > Project: Ignite > Issue Type: Improvement >Reporter: Alexei Scherbakov >Priority: Major > > Test reproduces partition counter validation errors (but passes nevertheless). -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (IGNITE-11820) Add persistence to IgniteCacheGroupTest
[ https://issues.apache.org/jira/browse/IGNITE-11820?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Alexei Scherbakov updated IGNITE-11820: --- Summary: Add persistence to IgniteCacheGroupTest (was: Add partition consistency tests for multiple caches in group.) > Add persistence to IgniteCacheGroupTest > --- > > Key: IGNITE-11820 > URL: https://issues.apache.org/jira/browse/IGNITE-11820 > Project: Ignite > Issue Type: Improvement >Reporter: Alexei Scherbakov >Priority: Major > -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (IGNITE-11857) Investigate performance drop after IGNITE-10078
[ https://issues.apache.org/jira/browse/IGNITE-11857?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Alexei Scherbakov updated IGNITE-11857: --- Description: After IGNITE-10078 yardstick tests show performance drop up to 8% in some scenarios: * tx-optim-repRead-put-get * tx-optimistic-put * tx-putAll Partially this is due new update counter implementation, but not only. Investigation is required. was: After IGNITE-1078 yardstick tests show performance drop up to 8% in some scenarios: * tx-optim-repRead-put-get * tx-optimistic-put * tx-putAll Partially this is due new update counter implementation, but not only. Investigation is required. > Investigate performance drop after IGNITE-10078 > --- > > Key: IGNITE-11857 > URL: https://issues.apache.org/jira/browse/IGNITE-11857 > Project: Ignite > Issue Type: Improvement >Reporter: Alexei Scherbakov >Priority: Major > > After IGNITE-10078 yardstick tests show performance drop up to 8% in some > scenarios: > * tx-optim-repRead-put-get > * tx-optimistic-put > * tx-putAll > Partially this is due new update counter implementation, but not only. > Investigation is required. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (IGNITE-11857) Investigate performance drop after IGNITE-10078
Alexei Scherbakov created IGNITE-11857: -- Summary: Investigate performance drop after IGNITE-10078 Key: IGNITE-11857 URL: https://issues.apache.org/jira/browse/IGNITE-11857 Project: Ignite Issue Type: Improvement Reporter: Alexei Scherbakov After IGNITE-1078 yardstick tests show performance drop up to 8% in some scenarios: * tx-optim-repRead-put-get * tx-optimistic-put * tx-putAll Partially this is due new update counter implementation, but not only. Investigation is required. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (IGNITE-10078) Node failure during concurrent partition updates may cause partition desync between primary and backup.
[ https://issues.apache.org/jira/browse/IGNITE-10078?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16842230#comment-16842230 ] Alexei Scherbakov commented on IGNITE-10078: [~ivan.glukos], please do final review. > Node failure during concurrent partition updates may cause partition desync > between primary and backup. > --- > > Key: IGNITE-10078 > URL: https://issues.apache.org/jira/browse/IGNITE-10078 > Project: Ignite > Issue Type: Bug >Reporter: Alexei Scherbakov >Assignee: Alexei Scherbakov >Priority: Major > Fix For: 2.8 > > Time Spent: 2.5h > Remaining Estimate: 0h > > This is possible if some updates are not written to WAL before node failure. > They will be not applied by rebalancing due to same partition counters in > certain scenario: > 1. Start grid with 3 nodes, 2 backups. > 2. Preload some data to partition P. > 3. Start two concurrent transactions writing single key to the same partition > P, keys are different > {noformat} > try(Transaction tx = client.transactions().txStart(PESSIMISTIC, > REPEATABLE_READ, 0, 1)) { > client.cache(DEFAULT_CACHE_NAME).put(k, v); > tx.commit(); > } > {noformat} > 4. Order updates on backup in the way such update with max partition counter > is written to WAL and update with lesser partition counter failed due to > triggering of FH before it's added to WAL > 5. Return failed node to grid, observe no rebalancing due to same partition > counters. > Possible solution: detect gaps in update counters on recovery and force > rebalance from a node without gaps if detected. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (IGNITE-10078) Node failure during concurrent partition updates may cause partition desync between primary and backup.
[ https://issues.apache.org/jira/browse/IGNITE-10078?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16842091#comment-16842091 ] Alexei Scherbakov commented on IGNITE-10078: Contribution seems to be ready for merging. > Node failure during concurrent partition updates may cause partition desync > between primary and backup. > --- > > Key: IGNITE-10078 > URL: https://issues.apache.org/jira/browse/IGNITE-10078 > Project: Ignite > Issue Type: Bug >Reporter: Alexei Scherbakov >Assignee: Alexei Scherbakov >Priority: Major > Fix For: 2.8 > > Time Spent: 2.5h > Remaining Estimate: 0h > > This is possible if some updates are not written to WAL before node failure. > They will be not applied by rebalancing due to same partition counters in > certain scenario: > 1. Start grid with 3 nodes, 2 backups. > 2. Preload some data to partition P. > 3. Start two concurrent transactions writing single key to the same partition > P, keys are different > {noformat} > try(Transaction tx = client.transactions().txStart(PESSIMISTIC, > REPEATABLE_READ, 0, 1)) { > client.cache(DEFAULT_CACHE_NAME).put(k, v); > tx.commit(); > } > {noformat} > 4. Order updates on backup in the way such update with max partition counter > is written to WAL and update with lesser partition counter failed due to > triggering of FH before it's added to WAL > 5. Return failed node to grid, observe no rebalancing due to same partition > counters. > Possible solution: detect gaps in update counters on recovery and force > rebalance from a node without gaps if detected. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (IGNITE-11256) Implement read-only mode for grid
[ https://issues.apache.org/jira/browse/IGNITE-11256?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16835577#comment-16835577 ] Alexei Scherbakov commented on IGNITE-11256: [~antonovsergey93] I reviewed your contribution. My comments: 1. No need to implement metrics aggregation for readOnlyMode and readOnlyModeDuration_._ They will be almost same for all nodes. __ Better move them to IgniteMXBean and in addition implement readOnly(boolean) method to allow read-only mode switching from JMX. Look for {{org.apache.ignite.mxbean.IgniteMXBean#active(boolean)}} 2. It might be good to have a way to activate grid in read-only state. This could be achieved by adding new configuration property like readOnlyAfterActivation and something like --activate read-only in control.sh 3. Fix logging like: log("Cluster is active" + (readOnly ? " (read-only)" : "")); 4. Fix logging like: log("Read-only mode is " + (readOnly ? "enabled" : "disabled")); 5. Fix message like: Failed to perform cache operation (cluster is in read-only mode) 6. U.hasCause is redundant and should be erased. We already have {{org.apache.ignite.internal.util.typedef.X#hasCause(java.lang.Throwable, java.lang.String, java.lang.Class...)}} 7. Documentation on new public methods {{IgniteCluster.readOnly*}} could be improved. 8. You should create a ticket for missing bindings in .NET module. Otherwise looks good. [~tledkov-gridgain] could your review SQL related changes ? > Implement read-only mode for grid > - > > Key: IGNITE-11256 > URL: https://issues.apache.org/jira/browse/IGNITE-11256 > Project: Ignite > Issue Type: Improvement >Reporter: Alexei Scherbakov >Assignee: Sergey Antonov >Priority: Major > Fix For: 2.8 > > Time Spent: 10m > Remaining Estimate: 0h > > Should be triggered from control.sh utility. > Useful for maintenance work, for example checking partition consistency > (idle_verify) -- This message was sent by Atlassian JIRA (v7.6.3#76005)