from:"Pavel Kovalenko \(Jira\)"

[jira] [Updated] (IGNITE-12325) GridCacheMapEntry reservation mechanism is broken with enabled cache store

2019-10-24 Thread Pavel Kovalenko (Jira)



 [ 
https://issues.apache.org/jira/browse/IGNITE-12325?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Pavel Kovalenko updated IGNITE-12325:
-
Ignite Flags:   (was: Docs Required,Release Notes Required)

> GridCacheMapEntry reservation mechanism is broken with enabled cache store
> --
>
> Key: IGNITE-12325
> URL: https://issues.apache.org/jira/browse/IGNITE-12325
> Project: Ignite
>  Issue Type: Bug
>  Components: cache
>Affects Versions: 2.8
>Reporter: Pavel Kovalenko
>Assignee: Pavel Kovalenko
>Priority: Major
> Fix For: 2.8
>
>  Time Spent: 20m
>  Remaining Estimate: 0h
>
> Entry deferred deletion was disabled after 
> https://issues.apache.org/jira/browse/IGNITE-11704 in transactional caches. 
> However, if cache store is enabled there is a race with cache entry 
> reservation after transactional remove and clear reservation after cache load:
> {noformat}
> java.lang.AssertionError: GridDhtCacheEntry [rdrs=ReaderId[] [ReaderId 
> [nodeId=96c87c98-2524-4f9e-8a2f-6cfceda5, msgId=22663371, txFut=null], 
> ReaderId [nodeId=68130805-0dc8-4ef4-abf7-7e7cde86, msgId=22663375, 
> txFut=null], ReaderId [nodeId=b4a8abce-8d0e-4459-b93a-a734ad64, 
> msgId=22663370, txFut=null]], part=8, super=GridDistributedCacheEntry 
> [super=GridCacheMapEntry [key=KeyCacheObjectImpl [part=8, val=8, 
> hasValBytes=true], val=null, ver=GridCacheVersion [topVer=0, order=0, 
> nodeOrder=0], hash=8, extras=null, flags=2]]]
>   at 
> org.apache.ignite.internal.processors.cache.GridCacheMapEntry.clearReserveForLoad(GridCacheMapEntry.java:3616)
>   at 
> org.apache.ignite.internal.processors.cache.GridCacheAdapter.clearReservationsIfNeeded(GridCacheAdapter.java:2429)
>   at 
> org.apache.ignite.internal.processors.cache.GridCacheAdapter.access$400(GridCacheAdapter.java:179)
>   at 
> org.apache.ignite.internal.processors.cache.GridCacheAdapter$18.call(GridCacheAdapter.java:2309)
>   at 
> org.apache.ignite.internal.processors.cache.GridCacheAdapter$18.call(GridCacheAdapter.java:2217)
>   at 
> org.apache.ignite.internal.util.IgniteUtils.wrapThreadLoader(IgniteUtils.java:6963)
>   at 
> org.apache.ignite.internal.processors.closure.GridClosureProcessor$2.body(GridClosureProcessor.java:967)
>   at 
> org.apache.ignite.internal.util.worker.GridWorker.run(GridWorker.java:120)
>   at 
> java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1135)
>   at 
> java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635)
>   at java.base/java.lang.Thread.run(Thread.java:844)
> {noformat}
> The issue can be resolved with enabled deferred delete if cache store is 
> enabled.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Resolved] (IGNITE-12325) GridCacheMapEntry reservation mechanism is broken with enabled cache store

2019-10-24 Thread Pavel Kovalenko (Jira)



 [ 
https://issues.apache.org/jira/browse/IGNITE-12325?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Pavel Kovalenko resolved IGNITE-12325.
--
Resolution: Fixed

Merged to master.

> GridCacheMapEntry reservation mechanism is broken with enabled cache store
> --
>
> Key: IGNITE-12325
> URL: https://issues.apache.org/jira/browse/IGNITE-12325
> Project: Ignite
>  Issue Type: Bug
>  Components: cache
>Affects Versions: 2.8
>Reporter: Pavel Kovalenko
>Assignee: Pavel Kovalenko
>Priority: Major
> Fix For: 2.8
>
>  Time Spent: 20m
>  Remaining Estimate: 0h
>
> Entry deferred deletion was disabled after 
> https://issues.apache.org/jira/browse/IGNITE-11704 in transactional caches. 
> However, if cache store is enabled there is a race with cache entry 
> reservation after transactional remove and clear reservation after cache load:
> {noformat}
> java.lang.AssertionError: GridDhtCacheEntry [rdrs=ReaderId[] [ReaderId 
> [nodeId=96c87c98-2524-4f9e-8a2f-6cfceda5, msgId=22663371, txFut=null], 
> ReaderId [nodeId=68130805-0dc8-4ef4-abf7-7e7cde86, msgId=22663375, 
> txFut=null], ReaderId [nodeId=b4a8abce-8d0e-4459-b93a-a734ad64, 
> msgId=22663370, txFut=null]], part=8, super=GridDistributedCacheEntry 
> [super=GridCacheMapEntry [key=KeyCacheObjectImpl [part=8, val=8, 
> hasValBytes=true], val=null, ver=GridCacheVersion [topVer=0, order=0, 
> nodeOrder=0], hash=8, extras=null, flags=2]]]
>   at 
> org.apache.ignite.internal.processors.cache.GridCacheMapEntry.clearReserveForLoad(GridCacheMapEntry.java:3616)
>   at 
> org.apache.ignite.internal.processors.cache.GridCacheAdapter.clearReservationsIfNeeded(GridCacheAdapter.java:2429)
>   at 
> org.apache.ignite.internal.processors.cache.GridCacheAdapter.access$400(GridCacheAdapter.java:179)
>   at 
> org.apache.ignite.internal.processors.cache.GridCacheAdapter$18.call(GridCacheAdapter.java:2309)
>   at 
> org.apache.ignite.internal.processors.cache.GridCacheAdapter$18.call(GridCacheAdapter.java:2217)
>   at 
> org.apache.ignite.internal.util.IgniteUtils.wrapThreadLoader(IgniteUtils.java:6963)
>   at 
> org.apache.ignite.internal.processors.closure.GridClosureProcessor$2.body(GridClosureProcessor.java:967)
>   at 
> org.apache.ignite.internal.util.worker.GridWorker.run(GridWorker.java:120)
>   at 
> java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1135)
>   at 
> java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635)
>   at java.base/java.lang.Thread.run(Thread.java:844)
> {noformat}
> The issue can be resolved with enabled deferred delete if cache store is 
> enabled.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Assigned] (IGNITE-12325) GridCacheMapEntry reservation mechanism is broken with enabled cache store

2019-10-23 Thread Pavel Kovalenko (Jira)



 [ 
https://issues.apache.org/jira/browse/IGNITE-12325?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Pavel Kovalenko reassigned IGNITE-12325:


Assignee: Pavel Kovalenko

> GridCacheMapEntry reservation mechanism is broken with enabled cache store
> --
>
> Key: IGNITE-12325
> URL: https://issues.apache.org/jira/browse/IGNITE-12325
> Project: Ignite
>  Issue Type: Bug
>  Components: cache
>Affects Versions: 2.8
>Reporter: Pavel Kovalenko
>Assignee: Pavel Kovalenko
>Priority: Major
> Fix For: 2.8
>
>
> Entry deferred deletion was disabled after 
> https://issues.apache.org/jira/browse/IGNITE-11704 in transactional caches. 
> However, if cache store is enabled there is a race with cache entry 
> reservation after transactional remove and clear reservation after cache load:
> {noformat}
> java.lang.AssertionError: GridDhtCacheEntry [rdrs=ReaderId[] [ReaderId 
> [nodeId=96c87c98-2524-4f9e-8a2f-6cfceda5, msgId=22663371, txFut=null], 
> ReaderId [nodeId=68130805-0dc8-4ef4-abf7-7e7cde86, msgId=22663375, 
> txFut=null], ReaderId [nodeId=b4a8abce-8d0e-4459-b93a-a734ad64, 
> msgId=22663370, txFut=null]], part=8, super=GridDistributedCacheEntry 
> [super=GridCacheMapEntry [key=KeyCacheObjectImpl [part=8, val=8, 
> hasValBytes=true], val=null, ver=GridCacheVersion [topVer=0, order=0, 
> nodeOrder=0], hash=8, extras=null, flags=2]]]
>   at 
> org.apache.ignite.internal.processors.cache.GridCacheMapEntry.clearReserveForLoad(GridCacheMapEntry.java:3616)
>   at 
> org.apache.ignite.internal.processors.cache.GridCacheAdapter.clearReservationsIfNeeded(GridCacheAdapter.java:2429)
>   at 
> org.apache.ignite.internal.processors.cache.GridCacheAdapter.access$400(GridCacheAdapter.java:179)
>   at 
> org.apache.ignite.internal.processors.cache.GridCacheAdapter$18.call(GridCacheAdapter.java:2309)
>   at 
> org.apache.ignite.internal.processors.cache.GridCacheAdapter$18.call(GridCacheAdapter.java:2217)
>   at 
> org.apache.ignite.internal.util.IgniteUtils.wrapThreadLoader(IgniteUtils.java:6963)
>   at 
> org.apache.ignite.internal.processors.closure.GridClosureProcessor$2.body(GridClosureProcessor.java:967)
>   at 
> org.apache.ignite.internal.util.worker.GridWorker.run(GridWorker.java:120)
>   at 
> java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1135)
>   at 
> java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635)
>   at java.base/java.lang.Thread.run(Thread.java:844)
> {noformat}
> The issue can be resolved with enabled deferred delete if cache store is 
> enabled.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Created] (IGNITE-12325) GridCacheMapEntry reservation mechanism is broken with enabled cache store

2019-10-23 Thread Pavel Kovalenko (Jira)

Pavel Kovalenko created IGNITE-12325:


 Summary: GridCacheMapEntry reservation mechanism is broken with 
enabled cache store
 Key: IGNITE-12325
 URL: https://issues.apache.org/jira/browse/IGNITE-12325
 Project: Ignite
  Issue Type: Bug
  Components: cache
Affects Versions: 2.8
Reporter: Pavel Kovalenko
 Fix For: 2.8


Entry deferred deletion was disabled after 
https://issues.apache.org/jira/browse/IGNITE-11704 in transactional caches. 
However, if cache store is enabled there is a race with cache entry reservation 
after transactional remove and clear reservation after cache load:

{noformat}
java.lang.AssertionError: GridDhtCacheEntry [rdrs=ReaderId[] [ReaderId 
[nodeId=96c87c98-2524-4f9e-8a2f-6cfceda5, msgId=22663371, txFut=null], 
ReaderId [nodeId=68130805-0dc8-4ef4-abf7-7e7cde86, msgId=22663375, 
txFut=null], ReaderId [nodeId=b4a8abce-8d0e-4459-b93a-a734ad64, 
msgId=22663370, txFut=null]], part=8, super=GridDistributedCacheEntry 
[super=GridCacheMapEntry [key=KeyCacheObjectImpl [part=8, val=8, 
hasValBytes=true], val=null, ver=GridCacheVersion [topVer=0, order=0, 
nodeOrder=0], hash=8, extras=null, flags=2]]]
at 
org.apache.ignite.internal.processors.cache.GridCacheMapEntry.clearReserveForLoad(GridCacheMapEntry.java:3616)
at 
org.apache.ignite.internal.processors.cache.GridCacheAdapter.clearReservationsIfNeeded(GridCacheAdapter.java:2429)
at 
org.apache.ignite.internal.processors.cache.GridCacheAdapter.access$400(GridCacheAdapter.java:179)
at 
org.apache.ignite.internal.processors.cache.GridCacheAdapter$18.call(GridCacheAdapter.java:2309)
at 
org.apache.ignite.internal.processors.cache.GridCacheAdapter$18.call(GridCacheAdapter.java:2217)
at 
org.apache.ignite.internal.util.IgniteUtils.wrapThreadLoader(IgniteUtils.java:6963)
at 
org.apache.ignite.internal.processors.closure.GridClosureProcessor$2.body(GridClosureProcessor.java:967)
at 
org.apache.ignite.internal.util.worker.GridWorker.run(GridWorker.java:120)
at 
java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1135)
at 
java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635)
at java.base/java.lang.Thread.run(Thread.java:844)

{noformat}

The issue can be resolved with enabled deferred delete if cache store is 
enabled.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Created] (IGNITE-12299) Store tombstone links into separate BPlus tree to avoid partition full-scan during tombstones remove

2019-10-17 Thread Pavel Kovalenko (Jira)

Pavel Kovalenko created IGNITE-12299:


 Summary: Store tombstone links into separate BPlus tree to avoid 
partition full-scan during tombstones remove
 Key: IGNITE-12299
 URL: https://issues.apache.org/jira/browse/IGNITE-12299
 Project: Ignite
  Issue Type: Improvement
  Components: cache
Affects Versions: 2.8
Reporter: Pavel Kovalenko
 Fix For: 2.9


Currently, we can't identify which keys are tombstones in the partition fastly. 
To collect tombstones we need to make a full-scan BPlus tree. It can slowdown 
node performance when rebalance is finished and tombstones cleanup is needed. 
We can introduce a separate BPlus tree (like for TTL) inside partition where we 
can store links to tombstone keys. When tombstones cleanup is needed we can 
make a fast scan for tombstones using the only a subset of the keys stored to 
this tree.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Created] (IGNITE-12298) Write tombstones on incomplete baseline to get rid of partition cleanup

2019-10-17 Thread Pavel Kovalenko (Jira)

Pavel Kovalenko created IGNITE-12298:


 Summary: Write tombstones on incomplete baseline to get rid of 
partition cleanup
 Key: IGNITE-12298
 URL: https://issues.apache.org/jira/browse/IGNITE-12298
 Project: Ignite
  Issue Type: Improvement
  Components: cache
Affects Versions: 2.8
Reporter: Pavel Kovalenko
 Fix For: 2.9


After tombstone objects are introduced 
https://issues.apache.org/jira/browse/IGNITE-11704
we can write tombstones on OWNING nodes if the baseline is incomplete (some of 
the backup nodes are left). When baseline completes and old nodes return back 
we can avoid partition cleanup on those nodes before rebalance. We can 
translate the whole OWNING partition state including tombstones that will clear 
the data that was removed when node was offline.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Updated] (IGNITE-11704) Write tombstones during rebalance to get rid of deferred delete buffer

2019-10-17 Thread Pavel Kovalenko (Jira)



 [ 
https://issues.apache.org/jira/browse/IGNITE-11704?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Pavel Kovalenko updated IGNITE-11704:
-
Ignite Flags:   (was: Docs Required)

> Write tombstones during rebalance to get rid of deferred delete buffer
> --
>
> Key: IGNITE-11704
> URL: https://issues.apache.org/jira/browse/IGNITE-11704
> Project: Ignite
>  Issue Type: Improvement
>Reporter: Alexey Goncharuk
>Assignee: Pavel Kovalenko
>Priority: Major
>  Labels: rebalance
> Fix For: 2.8
>
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> Currently Ignite relies on deferred delete buffer in order to handle 
> write-remove conflicts during rebalance. Given the limit size of the buffer, 
> this approach is fundamentally flawed, especially in case when persistence is 
> enabled.
> I suggest to extend the logic of data storage to be able to store key 
> tombstones - to keep version for deleted entries. The tombstones will be 
> stored when rebalance is in progress and should be cleaned up when rebalance 
> is completed.
> Later this approach may be used to implement fast partition rebalance based 
> on merkle trees (in this case, tombstones should be written on an incomplete 
> baseline).



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (IGNITE-10226) Partition may restore wrong MOVING state during crash recovery

2019-10-16 Thread Pavel Kovalenko (Jira)



[ 
https://issues.apache.org/jira/browse/IGNITE-10226?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16952767#comment-16952767
 ] 

Pavel Kovalenko commented on IGNITE-10226:
--

[~matijap] Look like detect lost partitions is not invoked during cluster 
activation, which is definitely a bug. I've filed corresponding ticket 
https://issues.apache.org/jira/browse/IGNITE-12297
Yes in 1 node cluster and partition in MOVING state resetLostPartitions will 
not have an effect. However, you can return back partition to OWNING state with 
the following trick:
1) Start another node, this is a topology event that will trigger detecting 
lost partitions.
2) Stop started node
3) If you have partition loss policy != IGNORE trigger explicitly 
`resetLostPartitions`
It should help to return back partition to OWNING state.


> Partition may restore wrong MOVING state during crash recovery
> --
>
> Key: IGNITE-10226
> URL: https://issues.apache.org/jira/browse/IGNITE-10226
> Project: Ignite
>  Issue Type: Bug
>  Components: cache
>Affects Versions: 2.4
>Reporter: Pavel Kovalenko
>Assignee: Pavel Kovalenko
>Priority: Major
> Fix For: 2.8
>
>
> The way to get it exists only in versions that don't have IGNITE-9420:
> 1) Start cache, upload some data to partitions, forceCheckpoint
> 2) Start uploading additional data. Kill node. Node should be killed with 
> skipping last checkpoint, or during checkpoint mark phase.
> 3) Re-start node. The crash recovery process for partitions started. When we 
> create partition during crash recovery (topology().forceCreatePartition()) we 
> log it's initial state to WAL. If we have any logical update relates to 
> partition we'll log wrong MOVING state to the end of current WAL. This state 
> will be considered as last valid when we process PartitionMetaStateRecord 
> record's during logical recovery. In "restorePartitionsState" phase this 
> state will be chosen as final and the partition will change to MOVING, even 
> in page memory it has OWNING or something else.
> To fix this problem in 2.4 - 2.7 versions, additional logging partition state 
> change to WAL during crash recovery (logical recovery) should be removed.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Created] (IGNITE-12297) Detect lost partitions is not happened during cluster activation

2019-10-16 Thread Pavel Kovalenko (Jira)

Pavel Kovalenko created IGNITE-12297:


 Summary: Detect lost partitions is not happened during cluster 
activation
 Key: IGNITE-12297
 URL: https://issues.apache.org/jira/browse/IGNITE-12297
 Project: Ignite
  Issue Type: Bug
  Components: cache
Affects Versions: 2.4
Reporter: Pavel Kovalenko
 Fix For: 2.8


We invoke `detectLostPartitions` during PME only if there is a server join or 
server left.
However,  we can activate a persistent cluster where a partition may have 
MOVING status on all nodes. In this case, a partition may stay in MOVING state 
forever before any other topology event. 





--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (IGNITE-10226) Partition may restore wrong MOVING state during crash recovery

2019-10-16 Thread Pavel Kovalenko (Jira)



[ 
https://issues.apache.org/jira/browse/IGNITE-10226?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16952716#comment-16952716
 ] 

Pavel Kovalenko commented on IGNITE-10226:
--

[~matijap] How did you identify that you faced with the issue? If partition 
restored to MOVED state and if you have backup factor for cache > 0 this 
partition will be automatically rebalanced from the existing owner. If there 
are no backups it can be marked as LOST if you have partitions lost policy != 
IGNORE. In the worst case, if you still have LOST partition you can reset its 
state to OWNING calling:
{noformat}
org.apache.ignite.Ignite#resetLostPartitions
{noformat}


> Partition may restore wrong MOVING state during crash recovery
> --
>
> Key: IGNITE-10226
> URL: https://issues.apache.org/jira/browse/IGNITE-10226
> Project: Ignite
>  Issue Type: Bug
>  Components: cache
>Affects Versions: 2.4
>Reporter: Pavel Kovalenko
>Assignee: Pavel Kovalenko
>Priority: Major
> Fix For: 2.8
>
>
> The way to get it exists only in versions that don't have IGNITE-9420:
> 1) Start cache, upload some data to partitions, forceCheckpoint
> 2) Start uploading additional data. Kill node. Node should be killed with 
> skipping last checkpoint, or during checkpoint mark phase.
> 3) Re-start node. The crash recovery process for partitions started. When we 
> create partition during crash recovery (topology().forceCreatePartition()) we 
> log it's initial state to WAL. If we have any logical update relates to 
> partition we'll log wrong MOVING state to the end of current WAL. This state 
> will be considered as last valid when we process PartitionMetaStateRecord 
> record's during logical recovery. In "restorePartitionsState" phase this 
> state will be chosen as final and the partition will change to MOVING, even 
> in page memory it has OWNING or something else.
> To fix this problem in 2.4 - 2.7 versions, additional logging partition state 
> change to WAL during crash recovery (logical recovery) should be removed.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Resolved] (IGNITE-10771) Print troubleshooting hint when exchange latch got stucked

2019-10-15 Thread Pavel Kovalenko (Jira)



 [ 
https://issues.apache.org/jira/browse/IGNITE-10771?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Pavel Kovalenko resolved IGNITE-10771.
--
Resolution: Fixed

> Print troubleshooting hint when exchange latch got stucked
> --
>
> Key: IGNITE-10771
> URL: https://issues.apache.org/jira/browse/IGNITE-10771
> Project: Ignite
>  Issue Type: Improvement
>  Components: cache
>Affects Versions: 2.5
>Reporter: Pavel Kovalenko
>Assignee: Pavel Kovalenko
>Priority: Minor
>  Labels: usability
> Fix For: 2.8
>
>  Time Spent: 20m
>  Remaining Estimate: 0h
>
> Sometimes users face with a problem when exchange latch can't be completed:
> {noformat}
> 2018-12-12 07:07:57:563 [exchange-worker-#42] WARN 
> o.a.i.i.p.c.d.d.p.GridDhtPartitionsExchangeFuture:488 - Unable to await 
> partitions release latch within timeout: ClientLatch 
> [coordinator=ZookeeperClusterNode [id=6b9fc6e4-5b6a-4a98-be4d-6bc1aa5c014c, 
> addrs=[172.17.0.1, 10.0.230.117, 0:0:0:0:0:0:0:1%lo, 127.0.0.1], order=3, 
> loc=false, client=false], ackSent=true, super=CompletableLatch [id=exchange, 
> topVer=AffinityTopologyVersion [topVer=45, minorTopVer=1]]] 
> {noformat}
> It may indicate that some node in a cluster can' t finish partitions release 
> (finish all ongoing operations at the previous topology version) or it can be 
> silent network problem.
> We should print to log a hint how to troubleshoot it to reduce the number of 
> questions about such problem.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (IGNITE-10771) Print troubleshooting hint when exchange latch got stucked

2019-10-15 Thread Pavel Kovalenko (Jira)



[ 
https://issues.apache.org/jira/browse/IGNITE-10771?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16952000#comment-16952000
 ] 

Pavel Kovalenko commented on IGNITE-10771:
--

Merged to master.

> Print troubleshooting hint when exchange latch got stucked
> --
>
> Key: IGNITE-10771
> URL: https://issues.apache.org/jira/browse/IGNITE-10771
> Project: Ignite
>  Issue Type: Improvement
>  Components: cache
>Affects Versions: 2.5
>Reporter: Pavel Kovalenko
>Assignee: Pavel Kovalenko
>Priority: Minor
>  Labels: usability
> Fix For: 2.8
>
>  Time Spent: 20m
>  Remaining Estimate: 0h
>
> Sometimes users face with a problem when exchange latch can't be completed:
> {noformat}
> 2018-12-12 07:07:57:563 [exchange-worker-#42] WARN 
> o.a.i.i.p.c.d.d.p.GridDhtPartitionsExchangeFuture:488 - Unable to await 
> partitions release latch within timeout: ClientLatch 
> [coordinator=ZookeeperClusterNode [id=6b9fc6e4-5b6a-4a98-be4d-6bc1aa5c014c, 
> addrs=[172.17.0.1, 10.0.230.117, 0:0:0:0:0:0:0:1%lo, 127.0.0.1], order=3, 
> loc=false, client=false], ackSent=true, super=CompletableLatch [id=exchange, 
> topVer=AffinityTopologyVersion [topVer=45, minorTopVer=1]]] 
> {noformat}
> It may indicate that some node in a cluster can' t finish partitions release 
> (finish all ongoing operations at the previous topology version) or it can be 
> silent network problem.
> We should print to log a hint how to troubleshoot it to reduce the number of 
> questions about such problem.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Updated] (IGNITE-10771) Print troubleshooting hint when exchange latch got stucked

2019-10-15 Thread Pavel Kovalenko (Jira)



 [ 
https://issues.apache.org/jira/browse/IGNITE-10771?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Pavel Kovalenko updated IGNITE-10771:
-
Ignite Flags:   (was: Docs Required)

> Print troubleshooting hint when exchange latch got stucked
> --
>
> Key: IGNITE-10771
> URL: https://issues.apache.org/jira/browse/IGNITE-10771
> Project: Ignite
>  Issue Type: Improvement
>  Components: cache
>Affects Versions: 2.5
>Reporter: Pavel Kovalenko
>Assignee: Pavel Kovalenko
>Priority: Minor
>  Labels: usability
> Fix For: 2.8
>
>  Time Spent: 20m
>  Remaining Estimate: 0h
>
> Sometimes users face with a problem when exchange latch can't be completed:
> {noformat}
> 2018-12-12 07:07:57:563 [exchange-worker-#42] WARN 
> o.a.i.i.p.c.d.d.p.GridDhtPartitionsExchangeFuture:488 - Unable to await 
> partitions release latch within timeout: ClientLatch 
> [coordinator=ZookeeperClusterNode [id=6b9fc6e4-5b6a-4a98-be4d-6bc1aa5c014c, 
> addrs=[172.17.0.1, 10.0.230.117, 0:0:0:0:0:0:0:1%lo, 127.0.0.1], order=3, 
> loc=false, client=false], ackSent=true, super=CompletableLatch [id=exchange, 
> topVer=AffinityTopologyVersion [topVer=45, minorTopVer=1]]] 
> {noformat}
> It may indicate that some node in a cluster can' t finish partitions release 
> (finish all ongoing operations at the previous topology version) or it can be 
> silent network problem.
> We should print to log a hint how to troubleshoot it to reduce the number of 
> questions about such problem.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (IGNITE-11704) Write tombstones during rebalance to get rid of deferred delete buffer

2019-10-15 Thread Pavel Kovalenko (Jira)



[ 
https://issues.apache.org/jira/browse/IGNITE-11704?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16951983#comment-16951983
 ] 

Pavel Kovalenko commented on IGNITE-11704:
--

[~ascherbakov]
I've fixed several of your concerns others are commented:
1. Tombstone object storing has optimized: 
Any object (key or value) has a header (object len + object type). The object 
type can be regular, binary or byte array. In the previous version, the 
tombstone will be regular cache object with marshalled "null" value. In current 
version, I introduced a special type of object - Tombstone that doesn't store 
any value, only header. All checking for the tombstone has optimized.
2. I think it's fine. Every clear tombstone task periodically check is 
partition become in not OWNING state. In this case, a clear tombstones 
operation is stopped. Yes, there can be a window of time where both clear 
tombstones and eviction can happen, but it shouldn't be long.
3. DropCacheContextDuringEvictionTest is reworked due to reuse test 
PartitionsEvictManagerAbstractTest for checking tomstones failure.
cacheGroupMetricsRegistryName is added as a utility method as part of cache 
group tombstone metrics.
GridCommandHandlerIndexingTest - merge artifact, should be ignored.
4. I've added a comment when this condition is true.
5. This test already exists 
(org.apache.ignite.internal.processors.cache.distributed.CacheRemoveWithTombstonesTest#testRemoveAndRebalanceRaceTx)
6. I've reworked code and now clearAll and clearTombstones have a common 
codebase.

Could you please review it again?

> Write tombstones during rebalance to get rid of deferred delete buffer
> --
>
> Key: IGNITE-11704
> URL: https://issues.apache.org/jira/browse/IGNITE-11704
> Project: Ignite
>  Issue Type: Improvement
>Reporter: Alexey Goncharuk
>Assignee: Pavel Kovalenko
>Priority: Major
>  Labels: rebalance
> Fix For: 2.8
>
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> Currently Ignite relies on deferred delete buffer in order to handle 
> write-remove conflicts during rebalance. Given the limit size of the buffer, 
> this approach is fundamentally flawed, especially in case when persistence is 
> enabled.
> I suggest to extend the logic of data storage to be able to store key 
> tombstones - to keep version for deleted entries. The tombstones will be 
> stored when rebalance is in progress and should be cleaned up when rebalance 
> is completed.
> Later this approach may be used to implement fast partition rebalance based 
> on merkle trees (in this case, tombstones should be written on an incomplete 
> baseline).



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (IGNITE-11852) Assertion errors when changing PME coordinator to locally joining node

2019-10-14 Thread Pavel Kovalenko (Jira)



[ 
https://issues.apache.org/jira/browse/IGNITE-11852?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16951011#comment-16951011
 ] 

Pavel Kovalenko commented on IGNITE-11852:
--

[~mmuzaf] I've merged the latest master to a development branch. I will take a 
visa and merge changes soon.

> Assertion errors when changing PME coordinator to locally joining node
> --
>
> Key: IGNITE-11852
> URL: https://issues.apache.org/jira/browse/IGNITE-11852
> Project: Ignite
>  Issue Type: Bug
>  Components: cache
>Affects Versions: 2.5, 2.7
>Reporter: Pavel Kovalenko
>Assignee: Pavel Kovalenko
>Priority: Critical
> Fix For: 2.8
>
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> When PME coordinator changed to locally joining node several assertion errors 
> may occur:
> 1. When some other joining nodes finished PME:
> {noformat}
> [13:49:58] (err) Failed to notify listener: 
> o.a.i.i.processors.cache.distributed.dht.preloader.GridDhtPartitionsExchangeFuture$8$1$1...@27296181java.lang.AssertionError:
>  AffinityTopologyVersion [topVer=2, minorTopVer=0]
>   at 
> org.apache.ignite.internal.processors.cache.CacheAffinitySharedManager$11.applyx(CacheAffinitySharedManager.java:1546)
>   at 
> org.apache.ignite.internal.processors.cache.CacheAffinitySharedManager$11.applyx(CacheAffinitySharedManager.java:1535)
>   at 
> org.apache.ignite.internal.processors.cache.CacheAffinitySharedManager.lambda$forAllRegisteredCacheGroups$e0a6939d$1(CacheAffinitySharedManager.java:1281)
>   at 
> org.apache.ignite.internal.util.IgniteUtils.doInParallel(IgniteUtils.java:10929)
>   at 
> org.apache.ignite.internal.util.IgniteUtils.doInParallel(IgniteUtils.java:10831)
>   at 
> org.apache.ignite.internal.util.IgniteUtils.doInParallel(IgniteUtils.java:10811)
>   at 
> org.apache.ignite.internal.processors.cache.CacheAffinitySharedManager.forAllRegisteredCacheGroups(CacheAffinitySharedManager.java:1280)
>   at 
> org.apache.ignite.internal.processors.cache.CacheAffinitySharedManager.onLocalJoin(CacheAffinitySharedManager.java:1535)
>   at 
> org.apache.ignite.internal.processors.cache.distributed.dht.preloader.GridDhtPartitionsExchangeFuture.processFullMessage(GridDhtPartitionsExchangeFuture.java:4189)
>   at 
> org.apache.ignite.internal.processors.cache.distributed.dht.preloader.GridDhtPartitionsExchangeFuture.onBecomeCoordinator(GridDhtPartitionsExchangeFuture.java:4731)
>   at 
> org.apache.ignite.internal.processors.cache.distributed.dht.preloader.GridDhtPartitionsExchangeFuture.access$3400(GridDhtPartitionsExchangeFuture.java:145)
>   at 
> org.apache.ignite.internal.processors.cache.distributed.dht.preloader.GridDhtPartitionsExchangeFuture$8$1$1.apply(GridDhtPartitionsExchangeFuture.java:4622)
>   at 
> org.apache.ignite.internal.processors.cache.distributed.dht.preloader.GridDhtPartitionsExchangeFuture$8$1$1.apply(GridDhtPartitionsExchangeFuture.java:4611)
>   at 
> org.apache.ignite.internal.util.future.GridFutureAdapter.notifyListener(GridFutureAdapter.java:398)
>   at 
> org.apache.ignite.internal.util.future.GridFutureAdapter.unblock(GridFutureAdapter.java:346)
>   at 
> org.apache.ignite.internal.util.future.GridFutureAdapter.unblockAll(GridFutureAdapter.java:334)
>   at 
> org.apache.ignite.internal.util.future.GridFutureAdapter.onDone(GridFutureAdapter.java:510)
>   at 
> org.apache.ignite.internal.util.future.GridFutureAdapter.onDone(GridFutureAdapter.java:489)
>   at 
> org.apache.ignite.internal.util.future.GridFutureAdapter.onDone(GridFutureAdapter.java:466)
>   at 
> org.apache.ignite.internal.util.future.GridCompoundFuture.checkComplete(GridCompoundFuture.java:281)
>   at 
> org.apache.ignite.internal.util.future.GridCompoundFuture.apply(GridCompoundFuture.java:143)
>   at 
> org.apache.ignite.internal.util.future.GridCompoundFuture.apply(GridCompoundFuture.java:44)
>   at 
> org.apache.ignite.internal.util.future.GridFutureAdapter.notifyListener(GridFutureAdapter.java:398)
>   at 
> org.apache.ignite.internal.util.future.GridFutureAdapter.unblock(GridFutureAdapter.java:346)
>   at 
> org.apache.ignite.internal.util.future.GridFutureAdapter.unblockAll(GridFutureAdapter.java:334)
>   at 
> org.apache.ignite.internal.util.future.GridFutureAdapter.onDone(GridFutureAdapter.java:510)
>   at 
> org.apache.ignite.internal.util.future.GridFutureAdapter.onDone(GridFutureAdapter.java:489)
>   at 
> org.apache.ignite.internal.util.future.GridFutureAdapter.onDone(GridFutureAdapter.java:455)
>   at 
> org.apache.ignite.internal.processors.cache.distributed.dht.preloader.InitNewCoordinatorFuture.onMessage(InitNewCoordinatorFuture.java:253)
>   at 
>

[jira] [Updated] (IGNITE-11852) Assertion errors when changing PME coordinator to locally joining node

2019-10-14 Thread Pavel Kovalenko (Jira)



 [ 
https://issues.apache.org/jira/browse/IGNITE-11852?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Pavel Kovalenko updated IGNITE-11852:
-
Ignite Flags:   (was: Docs Required)

> Assertion errors when changing PME coordinator to locally joining node
> --
>
> Key: IGNITE-11852
> URL: https://issues.apache.org/jira/browse/IGNITE-11852
> Project: Ignite
>  Issue Type: Bug
>  Components: cache
>Affects Versions: 2.5, 2.7
>Reporter: Pavel Kovalenko
>Assignee: Pavel Kovalenko
>Priority: Critical
> Fix For: 2.8
>
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> When PME coordinator changed to locally joining node several assertion errors 
> may occur:
> 1. When some other joining nodes finished PME:
> {noformat}
> [13:49:58] (err) Failed to notify listener: 
> o.a.i.i.processors.cache.distributed.dht.preloader.GridDhtPartitionsExchangeFuture$8$1$1...@27296181java.lang.AssertionError:
>  AffinityTopologyVersion [topVer=2, minorTopVer=0]
>   at 
> org.apache.ignite.internal.processors.cache.CacheAffinitySharedManager$11.applyx(CacheAffinitySharedManager.java:1546)
>   at 
> org.apache.ignite.internal.processors.cache.CacheAffinitySharedManager$11.applyx(CacheAffinitySharedManager.java:1535)
>   at 
> org.apache.ignite.internal.processors.cache.CacheAffinitySharedManager.lambda$forAllRegisteredCacheGroups$e0a6939d$1(CacheAffinitySharedManager.java:1281)
>   at 
> org.apache.ignite.internal.util.IgniteUtils.doInParallel(IgniteUtils.java:10929)
>   at 
> org.apache.ignite.internal.util.IgniteUtils.doInParallel(IgniteUtils.java:10831)
>   at 
> org.apache.ignite.internal.util.IgniteUtils.doInParallel(IgniteUtils.java:10811)
>   at 
> org.apache.ignite.internal.processors.cache.CacheAffinitySharedManager.forAllRegisteredCacheGroups(CacheAffinitySharedManager.java:1280)
>   at 
> org.apache.ignite.internal.processors.cache.CacheAffinitySharedManager.onLocalJoin(CacheAffinitySharedManager.java:1535)
>   at 
> org.apache.ignite.internal.processors.cache.distributed.dht.preloader.GridDhtPartitionsExchangeFuture.processFullMessage(GridDhtPartitionsExchangeFuture.java:4189)
>   at 
> org.apache.ignite.internal.processors.cache.distributed.dht.preloader.GridDhtPartitionsExchangeFuture.onBecomeCoordinator(GridDhtPartitionsExchangeFuture.java:4731)
>   at 
> org.apache.ignite.internal.processors.cache.distributed.dht.preloader.GridDhtPartitionsExchangeFuture.access$3400(GridDhtPartitionsExchangeFuture.java:145)
>   at 
> org.apache.ignite.internal.processors.cache.distributed.dht.preloader.GridDhtPartitionsExchangeFuture$8$1$1.apply(GridDhtPartitionsExchangeFuture.java:4622)
>   at 
> org.apache.ignite.internal.processors.cache.distributed.dht.preloader.GridDhtPartitionsExchangeFuture$8$1$1.apply(GridDhtPartitionsExchangeFuture.java:4611)
>   at 
> org.apache.ignite.internal.util.future.GridFutureAdapter.notifyListener(GridFutureAdapter.java:398)
>   at 
> org.apache.ignite.internal.util.future.GridFutureAdapter.unblock(GridFutureAdapter.java:346)
>   at 
> org.apache.ignite.internal.util.future.GridFutureAdapter.unblockAll(GridFutureAdapter.java:334)
>   at 
> org.apache.ignite.internal.util.future.GridFutureAdapter.onDone(GridFutureAdapter.java:510)
>   at 
> org.apache.ignite.internal.util.future.GridFutureAdapter.onDone(GridFutureAdapter.java:489)
>   at 
> org.apache.ignite.internal.util.future.GridFutureAdapter.onDone(GridFutureAdapter.java:466)
>   at 
> org.apache.ignite.internal.util.future.GridCompoundFuture.checkComplete(GridCompoundFuture.java:281)
>   at 
> org.apache.ignite.internal.util.future.GridCompoundFuture.apply(GridCompoundFuture.java:143)
>   at 
> org.apache.ignite.internal.util.future.GridCompoundFuture.apply(GridCompoundFuture.java:44)
>   at 
> org.apache.ignite.internal.util.future.GridFutureAdapter.notifyListener(GridFutureAdapter.java:398)
>   at 
> org.apache.ignite.internal.util.future.GridFutureAdapter.unblock(GridFutureAdapter.java:346)
>   at 
> org.apache.ignite.internal.util.future.GridFutureAdapter.unblockAll(GridFutureAdapter.java:334)
>   at 
> org.apache.ignite.internal.util.future.GridFutureAdapter.onDone(GridFutureAdapter.java:510)
>   at 
> org.apache.ignite.internal.util.future.GridFutureAdapter.onDone(GridFutureAdapter.java:489)
>   at 
> org.apache.ignite.internal.util.future.GridFutureAdapter.onDone(GridFutureAdapter.java:455)
>   at 
> org.apache.ignite.internal.processors.cache.distributed.dht.preloader.InitNewCoordinatorFuture.onMessage(InitNewCoordinatorFuture.java:253)
>   at 
>

[jira] [Updated] (IGNITE-10771) Print troubleshooting hint when exchange latch got stucked

2019-10-11 Thread Pavel Kovalenko (Jira)



 [ 
https://issues.apache.org/jira/browse/IGNITE-10771?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Pavel Kovalenko updated IGNITE-10771:
-
Fix Version/s: 2.8

> Print troubleshooting hint when exchange latch got stucked
> --
>
> Key: IGNITE-10771
> URL: https://issues.apache.org/jira/browse/IGNITE-10771
> Project: Ignite
>  Issue Type: Improvement
>  Components: cache
>Affects Versions: 2.5
>Reporter: Pavel Kovalenko
>Assignee: Pavel Kovalenko
>Priority: Minor
>  Labels: usability
> Fix For: 2.8
>
>
> Sometimes users face with a problem when exchange latch can't be completed:
> {noformat}
> 2018-12-12 07:07:57:563 [exchange-worker-#42] WARN 
> o.a.i.i.p.c.d.d.p.GridDhtPartitionsExchangeFuture:488 - Unable to await 
> partitions release latch within timeout: ClientLatch 
> [coordinator=ZookeeperClusterNode [id=6b9fc6e4-5b6a-4a98-be4d-6bc1aa5c014c, 
> addrs=[172.17.0.1, 10.0.230.117, 0:0:0:0:0:0:0:1%lo, 127.0.0.1], order=3, 
> loc=false, client=false], ackSent=true, super=CompletableLatch [id=exchange, 
> topVer=AffinityTopologyVersion [topVer=45, minorTopVer=1]]] 
> {noformat}
> It may indicate that some node in a cluster can' t finish partitions release 
> (finish all ongoing operations at the previous topology version) or it can be 
> silent network problem.
> We should print to log a hint how to troubleshoot it to reduce the number of 
> questions about such problem.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Assigned] (IGNITE-10771) Print troubleshooting hint when exchange latch got stucked

2019-10-11 Thread Pavel Kovalenko (Jira)



 [ 
https://issues.apache.org/jira/browse/IGNITE-10771?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Pavel Kovalenko reassigned IGNITE-10771:


Assignee: Pavel Kovalenko

> Print troubleshooting hint when exchange latch got stucked
> --
>
> Key: IGNITE-10771
> URL: https://issues.apache.org/jira/browse/IGNITE-10771
> Project: Ignite
>  Issue Type: Improvement
>  Components: cache
>Affects Versions: 2.5
>Reporter: Pavel Kovalenko
>Assignee: Pavel Kovalenko
>Priority: Minor
>  Labels: usability
>
> Sometimes users face with a problem when exchange latch can't be completed:
> {noformat}
> 2018-12-12 07:07:57:563 [exchange-worker-#42] WARN 
> o.a.i.i.p.c.d.d.p.GridDhtPartitionsExchangeFuture:488 - Unable to await 
> partitions release latch within timeout: ClientLatch 
> [coordinator=ZookeeperClusterNode [id=6b9fc6e4-5b6a-4a98-be4d-6bc1aa5c014c, 
> addrs=[172.17.0.1, 10.0.230.117, 0:0:0:0:0:0:0:1%lo, 127.0.0.1], order=3, 
> loc=false, client=false], ackSent=true, super=CompletableLatch [id=exchange, 
> topVer=AffinityTopologyVersion [topVer=45, minorTopVer=1]]] 
> {noformat}
> It may indicate that some node in a cluster can' t finish partitions release 
> (finish all ongoing operations at the previous topology version) or it can be 
> silent network problem.
> We should print to log a hint how to troubleshoot it to reduce the number of 
> questions about such problem.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (IGNITE-5829) Writing entry contents to WAL only single time

2019-10-03 Thread Pavel Kovalenko (Jira)



[ 
https://issues.apache.org/jira/browse/IGNITE-5829?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16943697#comment-16943697
 ] 

Pavel Kovalenko commented on IGNITE-5829:
-

[~mmuzaf] The issue is still actual. We should move it to next release.

> Writing entry contents to WAL only single time
> --
>
> Key: IGNITE-5829
> URL: https://issues.apache.org/jira/browse/IGNITE-5829
> Project: Ignite
>  Issue Type: Improvement
>  Components: cache
>Affects Versions: 2.1
>Reporter: Pavel Kovalenko
>Assignee: Pavel Kovalenko
>Priority: Major
> Fix For: 2.8
>
>
> Currently we write entry contents 2 times: once in logical record and once 
> again when we write data page update records. We should do that only once. In 
> data page updates we can write only a reference to the logical update record 
> but not the whole entry contents.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (IGNITE-10010) Node halted if second node was stopped, then cache destroyed, then second node returned

2019-10-03 Thread Pavel Kovalenko (Jira)



[ 
https://issues.apache.org/jira/browse/IGNITE-10010?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16943696#comment-16943696
 ] 

Pavel Kovalenko commented on IGNITE-10010:
--

[~mmuzaf] There is fail-fast approach implemented in 
https://issues.apache.org/jira/browse/IGNITE-9562 (that ticket is similar to 
current). Changes disallow node join if it has persisted caches not presented 
in the cluster. I think we can decrease priority to Critical at least.

> Node halted if second node was stopped, then cache destroyed, then second 
> node returned
> ---
>
> Key: IGNITE-10010
> URL: https://issues.apache.org/jira/browse/IGNITE-10010
> Project: Ignite
>  Issue Type: Bug
>Affects Versions: 2.7
>Reporter: Sergey Kozlov
>Assignee: Alexey Goncharuk
>Priority: Blocker
> Fix For: 2.8
>
> Attachments: PersistenceNodeRestartAfterCacheDropSelfTest.java, 
> ignite-gridparitition-nullpointer.zip
>
>
> 1. Start 2 nodes with PDS
> 2. Activate cluster
> 3. Connect sqlline.
> 4. Create table {{create table t1(a int, b varchar, primary key(a)) with 
> "ATOMICITY=TRANSACTIONAL_SNAPSHOT,backups=1";}}
> 5. Stop node 1
> 6. Drop table {{drop table t1;}}
> 7. Start node 1
> 8. Node 2 stopped by handler:
> {noformat}
> c:\Work\apache-ignite-2.7.0-SNAPSHOT-bin>bin\ignite.bat server.xml -v -J-DID=1
> Ignite Command Line Startup, ver. 2.7.0-SNAPSHOT#19700101-sha1:DEV
> 2018 Copyright(C) Apache Software Foundation
> [18:04:22,745][INFO][main][IgniteKernal]
> >>>__  
> >>>   /  _/ ___/ |/ /  _/_  __/ __/
> >>>  _/ // (7 7// /  / / / _/
> >>> /___/\___/_/|_/___/ /_/ /___/
> >>>
> >>> ver. 2.7.0-SNAPSHOT#19700101-sha1:DEV
> >>> 2018 Copyright(C) Apache Software Foundation
> >>>
> >>> Ignite documentation: http://ignite.apache.org
> [18:04:22,745][INFO][main][IgniteKernal] Config URL: 
> file:/c:/Work/apache-ignite-2.7.0-SNAPSHOT-bin/server.xml
> [18:04:22,760][INFO][main][IgniteKernal] IgniteConfiguration 
> [igniteInstanceName=null, pubPoolSize=8, svcPoolSize=8, cal
> lbackPoolSize=8, stripedPoolSize=8, sysPoolSize=8, mgmtPoolSize=4, 
> igfsPoolSize=8, dataStreamerPoolSize=8, utilityCacheP
> oolSize=8, utilityCacheKeepAliveTime=6, p2pPoolSize=2, qryPoolSize=8, 
> igniteHome=c:\Work\apache-ignite-2.7.0-SNAPSHO
> T-bin, igniteWorkDir=c:\Work\apache-ignite-2.7.0-SNAPSHOT-bin\work, 
> mbeanSrv=com.sun.jmx.mbeanserver.JmxMBeanServer@6f94
> fa3e, nodeId=d02069db-6d0b-4a40-b185-1d95fa330853, marsh=BinaryMarshaller [], 
> marshLocJobs=false, daemon=false, p2pEnabl
> ed=false, netTimeout=5000, sndRetryDelay=1000, sndRetryCnt=3, 
> metricsHistSize=1, metricsUpdateFreq=2000, metricsExpT
> ime=9223372036854775807, discoSpi=TcpDiscoverySpi [addrRslvr=null, 
> sockTimeout=0, ackTimeout=0, marsh=null, reconCnt=10,
>  reconDelay=2000, maxAckTimeout=60, forceSrvMode=false, 
> clientReconnectDisabled=false, internalLsnr=null], segPlc=ST
> OP, segResolveAttempts=2, waitForSegOnStart=true, allResolversPassReq=true, 
> segChkFreq=1, commSpi=TcpCommunicationSp
> i [connectGate=null, 
> connPlc=org.apache.ignite.spi.communication.tcp.TcpCommunicationSpi$FirstConnectionPolicy@22ff4249,
>  enableForcibleNodeKill=false, enableTroubleshootingLog=false, locAddr=null, 
> locHost=null, locPort=47100, locPortRange=1
> 00, shmemPort=-1, directBuf=true, directSndBuf=false, idleConnTimeout=60, 
> connTimeout=5000, maxConnTimeout=60, r
> econCnt=10, sockSndBuf=32768, sockRcvBuf=32768, msgQueueLimit=0, 
> slowClientQueueLimit=0, nioSrvr=null, shmemSrv=null, us
> ePairedConnections=false, connectionsPerNode=1, tcpNoDelay=true, 
> filterReachableAddresses=false, ackSndThreshold=32, una
> ckedMsgsBufSize=0, sockWriteTimeout=2000, boundTcpPort=-1, 
> boundTcpShmemPort=-1, selectorsCnt=4, selectorSpins=0, addrRs
> lvr=null, ctxInitLatch=java.util.concurrent.CountDownLatch@2d1ef81a[Count = 
> 1], stopping=false], evtSpi=org.apache.ignit
> e.spi.eventstorage.NoopEventStorageSpi@4c402120, colSpi=NoopCollisionSpi [], 
> deploySpi=LocalDeploymentSpi [], indexingSp
> i=org.apache.ignite.spi.indexing.noop.NoopIndexingSpi@815b41f, 
> addrRslvr=null, encryptionSpi=org.apache.ignite.spi.encry
> ption.noop.NoopEncryptionSpi@5542c4ed, clientMode=false, 
> rebalanceThreadPoolSize=1, txCfg=TransactionConfiguration [txSe
> rEnabled=false, dfltIsolation=REPEATABLE_READ, dfltConcurrency=PESSIMISTIC, 
> dfltTxTimeout=0, txTimeoutOnPartitionMapExch
> ange=0, pessimisticTxLogSize=0, pessimisticTxLogLinger=1, 
> tmLookupClsName=null, txManagerFactory=null, useJtaSync=fa
> lse], cacheSanityCheckEnabled=true, discoStartupDelay=6, 
> deployMode=SHARED, p2pMissedCacheSize=100, locHost=127.0.0.
> 1, timeSrvPortBase=31100, timeSrvPortRange=100, 
> failureDetectionTimeout=1,

[jira] [Created] (IGNITE-12255) Cache affinity fetching and calculation on client nodes may be broken in some cases

2019-10-03 Thread Pavel Kovalenko (Jira)

Pavel Kovalenko created IGNITE-12255:


 Summary: Cache affinity fetching and calculation on client nodes 
may be broken in some cases
 Key: IGNITE-12255
 URL: https://issues.apache.org/jira/browse/IGNITE-12255
 Project: Ignite
  Issue Type: Bug
  Components: cache
Affects Versions: 2.7, 2.5
Reporter: Pavel Kovalenko
Assignee: Pavel Kovalenko
 Fix For: 2.8


We have a cluster with server and client nodes.
We dynamically start several caches on a cluster.
Periodically we create and destroy some temporary cache in a cluster to move up 
cluster topology version.
At the same time, a random client node chooses a random existing cache and 
performs operations on that cache.
It leads to an exception on client node that affinity is not initialized for a 
cache during cache operation like:
Affinity for topology version is not initialized [topVer = 8:10, head = 8:2]

This exception means that the last affinity for a cache is calculated on 
version [8,2]. This is a cache start version. It happens because during 
creating/destroying some temporary cache we don’t re-calculate affinity for all 
existing but not already accessed caches on client nodes. Re-calculate in this 
case is cheap - we just copy affinity assignment and increment topology version.

As a solution, we need to fetch affinity on client node join for all caches. 
Also, we need to re-calculate affinity for all affinity holders (not only for 
started caches or only configured caches) for all topology events that happened 
in a cluster on a client node.

This solution showed the existing race between client node join and concurrent 
cache destroy.

The race is the following:

Client node (with some configured caches) joins to a cluster sending 
SingleMessage to coordinator during client PME. This SingleMessage contains 
affinity fetch requests for all cluster caches. When SingleMessage is in-flight 
server nodes finish client PME and also process and finish cache destroy PME. 
When a cache is destroyed affinity for that cache is cleared. When 
SingleMessage delivered to coordinator it doesn’t have affinity for a requested 
cache because the cache is already destroyed. It leads to assertion error on 
the coordinator and unpredictable behavior on the client node.

The race may be fixed with the following change:

If the coordinator doesn’t have an affinity for requested cache from the client 
node, it doesn’t break PME with assertion error, just doesn’t send affinity for 
that cache to a client node. When the client node receives FullMessage and sees 
that affinity for some requested cache doesn’t exist, it just closes cache 
proxy for user interactions which throws CacheStopped exception for every 
attempt to use that cache. This is safe behavior because cache destroy event 
should be happened on the client node soon and destroy that cache completely.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Updated] (IGNITE-12230) Partition eviction during cache stop / deactivation may cause errors leading to node failure and storage corruption

2019-10-01 Thread Pavel Kovalenko (Jira)



 [ 
https://issues.apache.org/jira/browse/IGNITE-12230?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Pavel Kovalenko updated IGNITE-12230:
-
Ignite Flags: Release Notes Required  (was: Docs Required,Release Notes 
Required)

> Partition eviction during cache stop / deactivation may cause errors leading 
> to node failure and storage corruption
> ---
>
> Key: IGNITE-12230
> URL: https://issues.apache.org/jira/browse/IGNITE-12230
> Project: Ignite
>  Issue Type: Bug
>  Components: cache
>Reporter: Stepachev Maksim
>Assignee: Stepachev Maksim
>Priority: Major
> Fix For: 2.8
>
>  Time Spent: 20m
>  Remaining Estimate: 0h
>
> PartitionEvictionTask may produce NullPointerException if cache / cache group 
> / cluser is stopping / deactivating.
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Updated] (IGNITE-12230) Partition eviction during cache stop / deactivation may cause errors leading to node failure and storage corruption

2019-10-01 Thread Pavel Kovalenko (Jira)



 [ 
https://issues.apache.org/jira/browse/IGNITE-12230?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Pavel Kovalenko updated IGNITE-12230:
-
Release Note: Fixed possible exceptions during simultaneous cache group 
stop and partition eviction

> Partition eviction during cache stop / deactivation may cause errors leading 
> to node failure and storage corruption
> ---
>
> Key: IGNITE-12230
> URL: https://issues.apache.org/jira/browse/IGNITE-12230
> Project: Ignite
>  Issue Type: Bug
>  Components: cache
>Reporter: Stepachev Maksim
>Assignee: Stepachev Maksim
>Priority: Major
> Fix For: 2.8
>
>  Time Spent: 20m
>  Remaining Estimate: 0h
>
> PartitionEvictionTask may produce NullPointerException if cache / cache group 
> / cluser is stopping / deactivating.
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (IGNITE-12230) Partition eviction during cache stop / deactivation may cause errors leading to node failure and storage corruption

2019-10-01 Thread Pavel Kovalenko (Jira)



[ 
https://issues.apache.org/jira/browse/IGNITE-12230?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16942013#comment-16942013
 ] 

Pavel Kovalenko commented on IGNITE-12230:
--

[~mstepachev] Thank you for contribution. Merged to master.

> Partition eviction during cache stop / deactivation may cause errors leading 
> to node failure and storage corruption
> ---
>
> Key: IGNITE-12230
> URL: https://issues.apache.org/jira/browse/IGNITE-12230
> Project: Ignite
>  Issue Type: Bug
>  Components: cache
>Reporter: Stepachev Maksim
>Assignee: Stepachev Maksim
>Priority: Major
> Fix For: 2.8
>
>  Time Spent: 20m
>  Remaining Estimate: 0h
>
> PartitionEvictionTask may produce NullPointerException if cache / cache group 
> / cluser is stopping / deactivating.
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Resolved] (IGNITE-12230) Partition eviction during cache stop / deactivation may cause errors leading to node failure and storage corruption

2019-10-01 Thread Pavel Kovalenko (Jira)



 [ 
https://issues.apache.org/jira/browse/IGNITE-12230?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Pavel Kovalenko resolved IGNITE-12230.
--
Resolution: Fixed

> Partition eviction during cache stop / deactivation may cause errors leading 
> to node failure and storage corruption
> ---
>
> Key: IGNITE-12230
> URL: https://issues.apache.org/jira/browse/IGNITE-12230
> Project: Ignite
>  Issue Type: Bug
>  Components: cache
>Reporter: Stepachev Maksim
>Assignee: Stepachev Maksim
>Priority: Major
> Fix For: 2.8
>
>  Time Spent: 20m
>  Remaining Estimate: 0h
>
> PartitionEvictionTask may produce NullPointerException if cache / cache group 
> / cluser is stopping / deactivating.
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (IGNITE-12197) Incorrect way for getting value of persistent enabled in CacheGroupMetricsImpl

2019-09-27 Thread Pavel Kovalenko (Jira)



[ 
https://issues.apache.org/jira/browse/IGNITE-12197?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16939443#comment-16939443
 ] 

Pavel Kovalenko commented on IGNITE-12197:
--

[~agura] Looks good to me. Please proceed with merge.

> Incorrect way for getting value of persistent enabled in CacheGroupMetricsImpl
> --
>
> Key: IGNITE-12197
> URL: https://issues.apache.org/jira/browse/IGNITE-12197
> Project: Ignite
>  Issue Type: Improvement
>Reporter: Andrey N. Gura
>Assignee: Andrey N. Gura
>Priority: Minor
>  Labels: IEP-35
> Fix For: 2.8
>
>  Time Spent: 20m
>  Remaining Estimate: 0h
>
> IGNITE-12027 introduces possible bug  due to incorrect way for getting value 
> of persistent enabled property in {{CacheGroupMetricsImpl}}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (IGNITE-11094) Add SSL support for ignite zookeeper SPI

2019-09-27 Thread Pavel Kovalenko (Jira)



[ 
https://issues.apache.org/jira/browse/IGNITE-11094?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16939439#comment-16939439
 ] 

Pavel Kovalenko commented on IGNITE-11094:
--

[~SomeFire] Merged dependency fix for Kafka module

> Add SSL support for ignite zookeeper SPI
> 
>
> Key: IGNITE-11094
> URL: https://issues.apache.org/jira/browse/IGNITE-11094
> Project: Ignite
>  Issue Type: Improvement
>Affects Versions: 2.7
>Reporter: Sergey Kozlov
>Assignee: Ryabov Dmitrii
>Priority: Major
> Fix For: 2.8
>
>  Time Spent: 7h
>  Remaining Estimate: 0h
>
> ZK 3.5.4-beta is already supporting SSL [1]. We should add SSL support to ZK 
> server connections  if Zookeeper SPI used.
> 1. 
> https://cwiki.apache.org/confluence/display/ZOOKEEPER/ZooKeeper+SSL+User+Guide



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (IGNITE-12206) Partition state validation warns are not printed to log

2019-09-24 Thread Pavel Kovalenko (Jira)



[ 
https://issues.apache.org/jira/browse/IGNITE-12206?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16936946#comment-16936946
 ] 

Pavel Kovalenko commented on IGNITE-12206:
--

[~mstepachev] Thank you for contribution. Merged to master.

> Partition state validation warns are not printed to log
> ---
>
> Key: IGNITE-12206
> URL: https://issues.apache.org/jira/browse/IGNITE-12206
> Project: Ignite
>  Issue Type: Bug
>Reporter: Stepachev Maksim
>Assignee: Stepachev Maksim
>Priority: Major
>  Time Spent: 20m
>  Remaining Estimate: 0h
>
> GridDhtPartitionsExchangeFuture.java
>  
> {code:java}
>  if (grpCtx == null
> || grpCtx.config().isReadThrough()
> || grpCtx.config().isWriteThrough()
> || grpCtx.config().getCacheStoreFactory() != null
> || grpCtx.config().getRebalanceDelay() == -1
> || grpCtx.config().getRebalanceMode() == 
> CacheRebalanceMode.NONE
> || grpCtx.config().getExpiryPolicyFactory() == null
> || SKIP_PARTITION_SIZE_VALIDATION)
> return null;{code}
>  
> Looks like a typo, probably it should be 
> grpCtx.config().getExpiryPolicyFactory() != null



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Updated] (IGNITE-12206) Partition state validation warns are not printed to log

2019-09-24 Thread Pavel Kovalenko (Jira)



 [ 
https://issues.apache.org/jira/browse/IGNITE-12206?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Pavel Kovalenko updated IGNITE-12206:
-
Affects Version/s: 2.7

> Partition state validation warns are not printed to log
> ---
>
> Key: IGNITE-12206
> URL: https://issues.apache.org/jira/browse/IGNITE-12206
> Project: Ignite
>  Issue Type: Bug
>Affects Versions: 2.7
>Reporter: Stepachev Maksim
>Assignee: Stepachev Maksim
>Priority: Major
> Fix For: 2.8
>
>  Time Spent: 20m
>  Remaining Estimate: 0h
>
> GridDhtPartitionsExchangeFuture.java
>  
> {code:java}
>  if (grpCtx == null
> || grpCtx.config().isReadThrough()
> || grpCtx.config().isWriteThrough()
> || grpCtx.config().getCacheStoreFactory() != null
> || grpCtx.config().getRebalanceDelay() == -1
> || grpCtx.config().getRebalanceMode() == 
> CacheRebalanceMode.NONE
> || grpCtx.config().getExpiryPolicyFactory() == null
> || SKIP_PARTITION_SIZE_VALIDATION)
> return null;{code}
>  
> Looks like a typo, probably it should be 
> grpCtx.config().getExpiryPolicyFactory() != null



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (IGNITE-12088) Cache or template name should be validated before attempt to start

2019-09-19 Thread Pavel Kovalenko (Jira)



[ 
https://issues.apache.org/jira/browse/IGNITE-12088?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16933509#comment-16933509
 ] 

Pavel Kovalenko commented on IGNITE-12088:
--

[~kcheng.mvp] Thank you. I've left comments in your PR. To check your PR you 
need to run all tests suites on TeamCity and get a green visa from TeamCity 
bot. You can find your PR in https://mtcga.gridgain.com/prs.html page and run 
"Trigger build" for comfort.
Please adjust your changes according to Ignite code style - 
https://cwiki.apache.org/confluence/display/IGNITE/Coding+Guidelines
Information about team city bot - 
https://cwiki.apache.org/confluence/display/IGNITE/Apache+Ignite+Teamcity+Bot

> Cache or template name should be validated before attempt to start
> --
>
> Key: IGNITE-12088
> URL: https://issues.apache.org/jira/browse/IGNITE-12088
> Project: Ignite
>  Issue Type: Bug
>  Components: cache
>Reporter: Pavel Kovalenko
>Assignee: kcheng.mvp
>Priority: Critical
>  Labels: usability
> Fix For: 2.8
>
>
> If set too long cache name it can be a cause of impossibility to create work 
> directory for that cache:
> {noformat}
> [2019-08-20 
> 19:35:42,139][ERROR][exchange-worker-#172%node1%][IgniteTestResources] 
> Critical system error detected. Will be handled accordingly to configured 
> handler [hnd=NoOpFailureHandler [super=AbstractFailureHandler 
> [ignoredFailureTypes=UnmodifiableSet [SYSTEM_WORKER_BLOCKED, 
> SYSTEM_CRITICAL_OPERATION_TIMEOUT]]], failureCtx=FailureContext 
> [type=CRITICAL_ERROR, err=class o.a.i.IgniteCheckedException: Failed to 
> initialize cache working directory (failed to create, make sure the work 
> folder has correct permissions): 
> /home/gridgain/projects/incubator-ignite/work/db/node1/cache-CacheConfiguration
>  [name=ccfg3staticTemplate*, grpName=null, memPlcName=null, 
> storeConcurrentLoadAllThreshold=5, rebalancePoolSize=1, 
> rebalanceTimeout=1, evictPlc=null, evictPlcFactory=null, 
> onheapCache=false, sqlOnheapCache=false, sqlOnheapCacheMaxSize=0, 
> evictFilter=null, eagerTtl=true, dfltLockTimeout=0, nearCfg=null, 
> writeSync=null, storeFactory=null, storeKeepBinary=false, loadPrevVal=false, 
> aff=null, cacheMode=PARTITIONED, atomicityMode=null, backups=6, 
> invalidate=false, tmLookupClsName=null, rebalanceMode=ASYNC, 
> rebalanceOrder=0, rebalanceBatchSize=524288, rebalanceBatchesPrefetchCnt=2, 
> maxConcurrentAsyncOps=500, sqlIdxMaxInlineSize=-1, writeBehindEnabled=false, 
> writeBehindFlushSize=10240, writeBehindFlushFreq=5000, 
> writeBehindFlushThreadCnt=1, writeBehindBatchSize=512, 
> writeBehindCoalescing=true, maxQryIterCnt=1024, affMapper=null, 
> rebalanceDelay=0, rebalanceThrottle=0, interceptor=null, 
> longQryWarnTimeout=3000, qryDetailMetricsSz=0, readFromBackup=true, 
> nodeFilter=null, sqlSchema=null, sqlEscapeAll=false, cpOnRead=true, 
> topValidator=null, partLossPlc=IGNORE, qryParallelism=1, evtsDisabled=false, 
> encryptionEnabled=false, diskPageCompression=null, 
> diskPageCompressionLevel=null]0]]
> class org.apache.ignite.IgniteCheckedException: Failed to initialize cache 
> working directory (failed to create, make sure the work folder has correct 
> permissions): 
> /home/gridgain/projects/incubator-ignite/work/db/node1/cache-CacheConfiguration
>  [name=ccfg3staticTemplate*, grpName=null, memPlcName=null, 
> storeConcurrentLoadAllThreshold=5, rebalancePoolSize=1, 
> rebalanceTimeout=1, evictPlc=null, evictPlcFactory=null, 
> onheapCache=false, sqlOnheapCache=false, sqlOnheapCacheMaxSize=0, 
> evictFilter=null, eagerTtl=true, dfltLockTimeout=0, nearCfg=null, 
> writeSync=null, storeFactory=null, storeKeepBinary=false, loadPrevVal=false, 
> aff=null, cacheMode=PARTITIONED, atomicityMode=null, backups=6, 
> invalidate=false, tmLookupClsName=null, rebalanceMode=ASYNC, 
> rebalanceOrder=0, rebalanceBatchSize=524288, rebalanceBatchesPrefetchCnt=2, 
> maxConcurrentAsyncOps=500, sqlIdxMaxInlineSize=-1, writeBehindEnabled=false, 
> writeBehindFlushSize=10240, writeBehindFlushFreq=5000, 
> writeBehindFlushThreadCnt=1, writeBehindBatchSize=512, 
> writeBehindCoalescing=true, maxQryIterCnt=1024, affMapper=null, 
> rebalanceDelay=0, rebalanceThrottle=0, interceptor=null, 
> longQryWarnTimeout=3000, qryDetailMetricsSz=0, readFromBackup=true, 
> nodeFilter=null, sqlSchema=null, sqlEscapeAll=false, cpOnRead=true, 
> topValidator=null, partLossPlc=IGNORE, qryParallelism=1, evtsDisabled=false, 
> encryptionEnabled=false, diskPageCompression=null, 
> diskPageCompressionLevel=null]0
>   at 
> org.apache.ignite.internal.processors.cache.persistence.file.FilePageStoreManager.checkAndInitCacheWorkDir(FilePageStoreManager.java:769)
>   at 
>

[jira] [Commented] (IGNITE-11094) Add SSL support for ignite zookeeper SPI

2019-09-19 Thread Pavel Kovalenko (Jira)



[ 
https://issues.apache.org/jira/browse/IGNITE-11094?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16933355#comment-16933355
 ] 

Pavel Kovalenko commented on IGNITE-11094:
--

[~SomeFire] Thank you for your work on the issue. Now changes are ready to 
merge. Please re-run Zookeeper suites to ensure that everything works well.

> Add SSL support for ignite zookeeper SPI
> 
>
> Key: IGNITE-11094
> URL: https://issues.apache.org/jira/browse/IGNITE-11094
> Project: Ignite
>  Issue Type: Improvement
>Affects Versions: 2.7
>Reporter: Sergey Kozlov
>Assignee: Ryabov Dmitrii
>Priority: Major
> Fix For: 2.8
>
>  Time Spent: 6.5h
>  Remaining Estimate: 0h
>
> ZK 3.5.4-beta is already supporting SSL [1]. We should add SSL support to ZK 
> server connections  if Zookeeper SPI used.
> 1. 
> https://cwiki.apache.org/confluence/display/ZOOKEEPER/ZooKeeper+SSL+User+Guide



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (IGNITE-12088) Cache or template name should be validated before attempt to start

2019-09-18 Thread Pavel Kovalenko (Jira)



[ 
https://issues.apache.org/jira/browse/IGNITE-12088?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16932714#comment-16932714
 ] 

Pavel Kovalenko commented on IGNITE-12088:
--

[~kcheng.mvp] Thank you for taking work on this issue. Most filesystems allowed 
to have no more than 255 characters in a filename or directory name. I advise 
limiting cache name and cache group name length up to 235 characters (20 is 
reserved for internal prefixes, suffixes or extensions).

> Cache or template name should be validated before attempt to start
> --
>
> Key: IGNITE-12088
> URL: https://issues.apache.org/jira/browse/IGNITE-12088
> Project: Ignite
>  Issue Type: Bug
>  Components: cache
>Reporter: Pavel Kovalenko
>Assignee: kcheng.mvp
>Priority: Critical
>  Labels: usability
> Fix For: 2.8
>
>
> If set too long cache name it can be a cause of impossibility to create work 
> directory for that cache:
> {noformat}
> [2019-08-20 
> 19:35:42,139][ERROR][exchange-worker-#172%node1%][IgniteTestResources] 
> Critical system error detected. Will be handled accordingly to configured 
> handler [hnd=NoOpFailureHandler [super=AbstractFailureHandler 
> [ignoredFailureTypes=UnmodifiableSet [SYSTEM_WORKER_BLOCKED, 
> SYSTEM_CRITICAL_OPERATION_TIMEOUT]]], failureCtx=FailureContext 
> [type=CRITICAL_ERROR, err=class o.a.i.IgniteCheckedException: Failed to 
> initialize cache working directory (failed to create, make sure the work 
> folder has correct permissions): 
> /home/gridgain/projects/incubator-ignite/work/db/node1/cache-CacheConfiguration
>  [name=ccfg3staticTemplate*, grpName=null, memPlcName=null, 
> storeConcurrentLoadAllThreshold=5, rebalancePoolSize=1, 
> rebalanceTimeout=1, evictPlc=null, evictPlcFactory=null, 
> onheapCache=false, sqlOnheapCache=false, sqlOnheapCacheMaxSize=0, 
> evictFilter=null, eagerTtl=true, dfltLockTimeout=0, nearCfg=null, 
> writeSync=null, storeFactory=null, storeKeepBinary=false, loadPrevVal=false, 
> aff=null, cacheMode=PARTITIONED, atomicityMode=null, backups=6, 
> invalidate=false, tmLookupClsName=null, rebalanceMode=ASYNC, 
> rebalanceOrder=0, rebalanceBatchSize=524288, rebalanceBatchesPrefetchCnt=2, 
> maxConcurrentAsyncOps=500, sqlIdxMaxInlineSize=-1, writeBehindEnabled=false, 
> writeBehindFlushSize=10240, writeBehindFlushFreq=5000, 
> writeBehindFlushThreadCnt=1, writeBehindBatchSize=512, 
> writeBehindCoalescing=true, maxQryIterCnt=1024, affMapper=null, 
> rebalanceDelay=0, rebalanceThrottle=0, interceptor=null, 
> longQryWarnTimeout=3000, qryDetailMetricsSz=0, readFromBackup=true, 
> nodeFilter=null, sqlSchema=null, sqlEscapeAll=false, cpOnRead=true, 
> topValidator=null, partLossPlc=IGNORE, qryParallelism=1, evtsDisabled=false, 
> encryptionEnabled=false, diskPageCompression=null, 
> diskPageCompressionLevel=null]0]]
> class org.apache.ignite.IgniteCheckedException: Failed to initialize cache 
> working directory (failed to create, make sure the work folder has correct 
> permissions): 
> /home/gridgain/projects/incubator-ignite/work/db/node1/cache-CacheConfiguration
>  [name=ccfg3staticTemplate*, grpName=null, memPlcName=null, 
> storeConcurrentLoadAllThreshold=5, rebalancePoolSize=1, 
> rebalanceTimeout=1, evictPlc=null, evictPlcFactory=null, 
> onheapCache=false, sqlOnheapCache=false, sqlOnheapCacheMaxSize=0, 
> evictFilter=null, eagerTtl=true, dfltLockTimeout=0, nearCfg=null, 
> writeSync=null, storeFactory=null, storeKeepBinary=false, loadPrevVal=false, 
> aff=null, cacheMode=PARTITIONED, atomicityMode=null, backups=6, 
> invalidate=false, tmLookupClsName=null, rebalanceMode=ASYNC, 
> rebalanceOrder=0, rebalanceBatchSize=524288, rebalanceBatchesPrefetchCnt=2, 
> maxConcurrentAsyncOps=500, sqlIdxMaxInlineSize=-1, writeBehindEnabled=false, 
> writeBehindFlushSize=10240, writeBehindFlushFreq=5000, 
> writeBehindFlushThreadCnt=1, writeBehindBatchSize=512, 
> writeBehindCoalescing=true, maxQryIterCnt=1024, affMapper=null, 
> rebalanceDelay=0, rebalanceThrottle=0, interceptor=null, 
> longQryWarnTimeout=3000, qryDetailMetricsSz=0, readFromBackup=true, 
> nodeFilter=null, sqlSchema=null, sqlEscapeAll=false, cpOnRead=true, 
> topValidator=null, partLossPlc=IGNORE, qryParallelism=1, evtsDisabled=false, 
> encryptionEnabled=false, diskPageCompression=null, 
> diskPageCompressionLevel=null]0
>   at 
> org.apache.ignite.internal.processors.cache.persistence.file.FilePageStoreManager.checkAndInitCacheWorkDir(FilePageStoreManager.java:769)
>   at 
> org.apache.ignite.internal.processors.cache.persistence.file.FilePageStoreManager.checkAndInitCacheWorkDir(FilePageStoreManager.java:748)
>   at 
> org.apache.ignite.internal.processors.cache.CachesRegistry.persistCacheConfigurations(CachesRegistry.java:289)
>   at 
>

[jira] [Commented] (IGNITE-11094) Add SSL support for ignite zookeeper SPI

2019-09-12 Thread Pavel Kovalenko (Jira)



[ 
https://issues.apache.org/jira/browse/IGNITE-11094?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16928391#comment-16928391
 ] 

Pavel Kovalenko commented on IGNITE-11094:
--

[~SomeFire] I've several minor concerns about the change and left comments in 
PR. Please resolve them.

> Add SSL support for ignite zookeeper SPI
> 
>
> Key: IGNITE-11094
> URL: https://issues.apache.org/jira/browse/IGNITE-11094
> Project: Ignite
>  Issue Type: Improvement
>Affects Versions: 2.7
>Reporter: Sergey Kozlov
>Assignee: Ryabov Dmitrii
>Priority: Major
> Fix For: 2.8
>
>  Time Spent: 2h 50m
>  Remaining Estimate: 0h
>
> ZK 3.5.4-beta is already supporting SSL [1]. We should add SSL support to ZK 
> server connections  if Zookeeper SPI used.
> 1. 
> https://cwiki.apache.org/confluence/display/ZOOKEEPER/ZooKeeper+SSL+User+Guide



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

[jira] [Commented] (IGNITE-12133) O(log n) partition exchange

2019-09-04 Thread Pavel Kovalenko (Jira)



[ 
https://issues.apache.org/jira/browse/IGNITE-12133?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16922393#comment-16922393
 ] 

Pavel Kovalenko commented on IGNITE-12133:
--

[~ivan.glukos] I guess the general solution of such problem will be in 
introducing Gossip protocol like in Cassandra. In this case, we shouldn't have 
a pre-determined skip-list topology for the nodes. The probabilistic nature of 
Gossip also gives us ~ log(N) rounds for the dissemination of such messages.

> O(log n) partition exchange
> ---
>
> Key: IGNITE-12133
> URL: https://issues.apache.org/jira/browse/IGNITE-12133
> Project: Ignite
>  Issue Type: Improvement
>Reporter: Moti Nisenson-Ken
>Priority: Major
>
> Currently, partition exchange leverages a ring. This means that 
> communications is O\(n) in number of nodes. It also means that if 
> non-coordinator nodes hang it can take much longer to successfully resolve 
> the topology.
> Instead, why not use something like a skip-list where the coordinator is 
> first. The coordinator can notify the first node at each level of the 
> skip-list. Each node then notifies all of its "near-neighbours" in the 
> skip-list, where node B is a near-neighbour of node-A, if max-level(nodeB) <= 
> max-level(nodeA), and nodeB is the first node at its level when traversing 
> from nodeA in the direction of nodeB, skipping over nodes C which have 
> max-level(C) > max-level(A). 
> 1
> 1 .  .  .3
> 1        3 . .  . 5
> 1 . 2 . 3 . 4 . 5 . 6
> In the above 1 would notify 2 and 3, 3 would notify 4 and 5, 2 -> 4, and 4 -> 
> 6, and 5 -> 6.
> One can achieve better redundancy by having each node traverse in both 
> directions, and having the coordinator also notify the last node in the list 
> at each level. This way in the above example if 2 and 3 were both down, 4 
> would still get notified from 5 and 6 (in the backwards direction).
>  
> The idea is that each individual node has O(log n) nodes to notify - so the 
> overall time is reduced. Additionally, we can deal well with at least 1 node 
> failure - if one includes the option of processing backwards, 2 consecutive 
> node failures can be handled as well. By taking this kind of an approach, 
> then the coordinator can basically treat any nodes it didn't receive a 
> message from as not-connected, and update the topology as well (disconnecting 
> any nodes that it didn't get a notification from). While there are some edge 
> cases here (e.g. 2 disconnected nodes, then 1 connected node, then 2 
> disconnected nodes - the connected node would be wrongly ejected from the 
> topology), these would generally be too rare to need explicit handling for.



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

[jira] [Commented] (IGNITE-11704) Write tombstones during rebalance to get rid of deferred delete buffer

2019-09-02 Thread Pavel Kovalenko (Jira)



[ 
https://issues.apache.org/jira/browse/IGNITE-11704?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16920882#comment-16920882
 ] 

Pavel Kovalenko commented on IGNITE-11704:
--

[~sboikov] If you don't mind I'll take the ticket for finishing.

> Write tombstones during rebalance to get rid of deferred delete buffer
> --
>
> Key: IGNITE-11704
> URL: https://issues.apache.org/jira/browse/IGNITE-11704
> Project: Ignite
>  Issue Type: Improvement
>Reporter: Alexey Goncharuk
>Assignee: Pavel Kovalenko
>Priority: Major
>  Labels: rebalance
>
> Currently Ignite relies on deferred delete buffer in order to handle 
> write-remove conflicts during rebalance. Given the limit size of the buffer, 
> this approach is fundamentally flawed, especially in case when persistence is 
> enabled.
> I suggest to extend the logic of data storage to be able to store key 
> tombstones - to keep version for deleted entries. The tombstones will be 
> stored when rebalance is in progress and should be cleaned up when rebalance 
> is completed.
> Later this approach may be used to implement fast partition rebalance based 
> on merkle trees (in this case, tombstones should be written on an incomplete 
> baseline).



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

[jira] [Assigned] (IGNITE-11704) Write tombstones during rebalance to get rid of deferred delete buffer

2019-09-02 Thread Pavel Kovalenko (Jira)



 [ 
https://issues.apache.org/jira/browse/IGNITE-11704?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Pavel Kovalenko reassigned IGNITE-11704:


Assignee: Pavel Kovalenko

> Write tombstones during rebalance to get rid of deferred delete buffer
> --
>
> Key: IGNITE-11704
> URL: https://issues.apache.org/jira/browse/IGNITE-11704
> Project: Ignite
>  Issue Type: Improvement
>Reporter: Alexey Goncharuk
>Assignee: Pavel Kovalenko
>Priority: Major
>  Labels: rebalance
>
> Currently Ignite relies on deferred delete buffer in order to handle 
> write-remove conflicts during rebalance. Given the limit size of the buffer, 
> this approach is fundamentally flawed, especially in case when persistence is 
> enabled.
> I suggest to extend the logic of data storage to be able to store key 
> tombstones - to keep version for deleted entries. The tombstones will be 
> stored when rebalance is in progress and should be cleaned up when rebalance 
> is completed.
> Later this approach may be used to implement fast partition rebalance based 
> on merkle trees (in this case, tombstones should be written on an incomplete 
> baseline).



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

[jira] [Commented] (IGNITE-3195) Rebalancing: IgniteConfiguration.rebalanceThreadPoolSize is wrongly treated

2019-08-28 Thread Pavel Kovalenko (Jira)



[ 
https://issues.apache.org/jira/browse/IGNITE-3195?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16917846#comment-16917846
 ] 

Pavel Kovalenko commented on IGNITE-3195:
-

[~avinogradov]
I have a question regarding the completed change. Why 2 thread-pools is used 
for rebalancing? Why it's worse if we leave the only a striped pool for all 
messages? 

> Rebalancing: IgniteConfiguration.rebalanceThreadPoolSize is wrongly treated
> ---
>
> Key: IGNITE-3195
> URL: https://issues.apache.org/jira/browse/IGNITE-3195
> Project: Ignite
>  Issue Type: Bug
>  Components: cache
>Reporter: Denis Magda
>Assignee: Anton Vinogradov
>Priority: Major
>  Labels: iep-16
> Fix For: 2.8
>
>  Time Spent: 4h
>  Remaining Estimate: 0h
>
> Presently it's considered that the maximum number of threads that has to 
> process all demand and supply messages coming from all the nodes must not be 
> bigger than {{IgniteConfiguration.rebalanceThreadPoolSize}}.
> Current implementation relies on ordered messages functionality creating a 
> number of topics equal to {{IgniteConfiguration.rebalanceThreadPoolSize}}.
> However, the implementation doesn't take into account that ordered messages, 
> that correspond to a particular topic, are processed in parallel for 
> different nodes. Refer to the implementation of 
> {{GridIoManager.processOrderedMessage}} to see that for every topic there 
> will be a unique {{GridCommunicationMessageSet}} for every node.
> Also to prove that this is true you can refer to this execution stack 
> {noformat}
> java.lang.RuntimeException: HAPPENED DEMAND
>   at 
> org.apache.ignite.internal.processors.cache.GridCachePartitionExchangeManager$5.apply(GridCachePartitionExchangeManager.java:378)
>   at 
> org.apache.ignite.internal.processors.cache.GridCachePartitionExchangeManager$5.apply(GridCachePartitionExchangeManager.java:364)
>   at 
> org.apache.ignite.internal.processors.cache.GridCacheIoManager.processMessage(GridCacheIoManager.java:622)
>   at 
> org.apache.ignite.internal.processors.cache.GridCacheIoManager.onMessage0(GridCacheIoManager.java:320)
>   at 
> org.apache.ignite.internal.processors.cache.GridCacheIoManager.access$300(GridCacheIoManager.java:81)
>   at 
> org.apache.ignite.internal.processors.cache.GridCacheIoManager$OrderedMessageListener.onMessage(GridCacheIoManager.java:1125)
>   at 
> org.apache.ignite.internal.managers.communication.GridIoManager.invokeListener(GridIoManager.java:1219)
>   at 
> org.apache.ignite.internal.managers.communication.GridIoManager.access$1600(GridIoManager.java:105)
>   at 
> org.apache.ignite.internal.managers.communication.GridIoManager$GridCommunicationMessageSet.unwind(GridIoManager.java:2456)
>   at 
> org.apache.ignite.internal.managers.communication.GridIoManager.unwindMessageSet(GridIoManager.java:1179)
>   at 
> org.apache.ignite.internal.managers.communication.GridIoManager.access$1900(GridIoManager.java:105)
>   at 
> org.apache.ignite.internal.managers.communication.GridIoManager$6.run(GridIoManager.java:1148)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
>   at java.lang.Thread.run(Thread.java:745)
> {noformat}
> All this means that in fact the number of threads that will be busy with 
> replication activity will be equal to 
> {{IgniteConfiguration.rebalanceThreadPoolSize}} x 
> number_of_nodes_participated_in_rebalancing



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

[jira] [Created] (IGNITE-12088) Cache or template name should be validated before attempt to start

2019-08-20 Thread Pavel Kovalenko (Jira)

Pavel Kovalenko created IGNITE-12088:


 Summary: Cache or template name should be validated before attempt 
to start
 Key: IGNITE-12088
 URL: https://issues.apache.org/jira/browse/IGNITE-12088
 Project: Ignite
  Issue Type: Bug
  Components: cache
Reporter: Pavel Kovalenko
 Fix For: 2.8


If set too long cache name it can be a cause of impossibility to create work 
directory for that cache:

{noformat}
[2019-08-20 
19:35:42,139][ERROR][exchange-worker-#172%node1%][IgniteTestResources] Critical 
system error detected. Will be handled accordingly to configured handler 
[hnd=NoOpFailureHandler [super=AbstractFailureHandler 
[ignoredFailureTypes=UnmodifiableSet [SYSTEM_WORKER_BLOCKED, 
SYSTEM_CRITICAL_OPERATION_TIMEOUT]]], failureCtx=FailureContext 
[type=CRITICAL_ERROR, err=class o.a.i.IgniteCheckedException: Failed to 
initialize cache working directory (failed to create, make sure the work folder 
has correct permissions): 
/home/gridgain/projects/incubator-ignite/work/db/node1/cache-CacheConfiguration 
[name=ccfg3staticTemplate*, grpName=null, memPlcName=null, 
storeConcurrentLoadAllThreshold=5, rebalancePoolSize=1, rebalanceTimeout=1, 
evictPlc=null, evictPlcFactory=null, onheapCache=false, sqlOnheapCache=false, 
sqlOnheapCacheMaxSize=0, evictFilter=null, eagerTtl=true, dfltLockTimeout=0, 
nearCfg=null, writeSync=null, storeFactory=null, storeKeepBinary=false, 
loadPrevVal=false, aff=null, cacheMode=PARTITIONED, atomicityMode=null, 
backups=6, invalidate=false, tmLookupClsName=null, rebalanceMode=ASYNC, 
rebalanceOrder=0, rebalanceBatchSize=524288, rebalanceBatchesPrefetchCnt=2, 
maxConcurrentAsyncOps=500, sqlIdxMaxInlineSize=-1, writeBehindEnabled=false, 
writeBehindFlushSize=10240, writeBehindFlushFreq=5000, 
writeBehindFlushThreadCnt=1, writeBehindBatchSize=512, 
writeBehindCoalescing=true, maxQryIterCnt=1024, affMapper=null, 
rebalanceDelay=0, rebalanceThrottle=0, interceptor=null, 
longQryWarnTimeout=3000, qryDetailMetricsSz=0, readFromBackup=true, 
nodeFilter=null, sqlSchema=null, sqlEscapeAll=false, cpOnRead=true, 
topValidator=null, partLossPlc=IGNORE, qryParallelism=1, evtsDisabled=false, 
encryptionEnabled=false, diskPageCompression=null, 
diskPageCompressionLevel=null]0]]
class org.apache.ignite.IgniteCheckedException: Failed to initialize cache 
working directory (failed to create, make sure the work folder has correct 
permissions): 
/home/gridgain/projects/incubator-ignite/work/db/node1/cache-CacheConfiguration 
[name=ccfg3staticTemplate*, grpName=null, memPlcName=null, 
storeConcurrentLoadAllThreshold=5, rebalancePoolSize=1, rebalanceTimeout=1, 
evictPlc=null, evictPlcFactory=null, onheapCache=false, sqlOnheapCache=false, 
sqlOnheapCacheMaxSize=0, evictFilter=null, eagerTtl=true, dfltLockTimeout=0, 
nearCfg=null, writeSync=null, storeFactory=null, storeKeepBinary=false, 
loadPrevVal=false, aff=null, cacheMode=PARTITIONED, atomicityMode=null, 
backups=6, invalidate=false, tmLookupClsName=null, rebalanceMode=ASYNC, 
rebalanceOrder=0, rebalanceBatchSize=524288, rebalanceBatchesPrefetchCnt=2, 
maxConcurrentAsyncOps=500, sqlIdxMaxInlineSize=-1, writeBehindEnabled=false, 
writeBehindFlushSize=10240, writeBehindFlushFreq=5000, 
writeBehindFlushThreadCnt=1, writeBehindBatchSize=512, 
writeBehindCoalescing=true, maxQryIterCnt=1024, affMapper=null, 
rebalanceDelay=0, rebalanceThrottle=0, interceptor=null, 
longQryWarnTimeout=3000, qryDetailMetricsSz=0, readFromBackup=true, 
nodeFilter=null, sqlSchema=null, sqlEscapeAll=false, cpOnRead=true, 
topValidator=null, partLossPlc=IGNORE, qryParallelism=1, evtsDisabled=false, 
encryptionEnabled=false, diskPageCompression=null, 
diskPageCompressionLevel=null]0
at 
org.apache.ignite.internal.processors.cache.persistence.file.FilePageStoreManager.checkAndInitCacheWorkDir(FilePageStoreManager.java:769)
at 
org.apache.ignite.internal.processors.cache.persistence.file.FilePageStoreManager.checkAndInitCacheWorkDir(FilePageStoreManager.java:748)
at 
org.apache.ignite.internal.processors.cache.CachesRegistry.persistCacheConfigurations(CachesRegistry.java:289)
at 
org.apache.ignite.internal.processors.cache.CachesRegistry.registerAllCachesAndGroups(CachesRegistry.java:264)
at 
org.apache.ignite.internal.processors.cache.CachesRegistry.update(CachesRegistry.java:202)
at 
org.apache.ignite.internal.processors.cache.CacheAffinitySharedManager.onCacheChangeRequest(CacheAffinitySharedManager.java:850)
at 
org.apache.ignite.internal.processors.cache.distributed.dht.preloader.GridDhtPartitionsExchangeFuture.onCacheChangeRequest(GridDhtPartitionsExchangeFuture.java:1306)
at 
org.apache.ignite.internal.processors.cache.distributed.dht.preloader.GridDhtPartitionsExchangeFuture.init(GridDhtPartitionsExchangeFuture.java:846)
at

[jira] [Updated] (IGNITE-12088) Cache or template name should be validated before attempt to start

2019-08-20 Thread Pavel Kovalenko (Jira)



 [ 
https://issues.apache.org/jira/browse/IGNITE-12088?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Pavel Kovalenko updated IGNITE-12088:
-
Labels: usability  (was: )

> Cache or template name should be validated before attempt to start
> --
>
> Key: IGNITE-12088
> URL: https://issues.apache.org/jira/browse/IGNITE-12088
> Project: Ignite
>  Issue Type: Bug
>  Components: cache
>Reporter: Pavel Kovalenko
>Priority: Critical
>  Labels: usability
> Fix For: 2.8
>
>
> If set too long cache name it can be a cause of impossibility to create work 
> directory for that cache:
> {noformat}
> [2019-08-20 
> 19:35:42,139][ERROR][exchange-worker-#172%node1%][IgniteTestResources] 
> Critical system error detected. Will be handled accordingly to configured 
> handler [hnd=NoOpFailureHandler [super=AbstractFailureHandler 
> [ignoredFailureTypes=UnmodifiableSet [SYSTEM_WORKER_BLOCKED, 
> SYSTEM_CRITICAL_OPERATION_TIMEOUT]]], failureCtx=FailureContext 
> [type=CRITICAL_ERROR, err=class o.a.i.IgniteCheckedException: Failed to 
> initialize cache working directory (failed to create, make sure the work 
> folder has correct permissions): 
> /home/gridgain/projects/incubator-ignite/work/db/node1/cache-CacheConfiguration
>  [name=ccfg3staticTemplate*, grpName=null, memPlcName=null, 
> storeConcurrentLoadAllThreshold=5, rebalancePoolSize=1, 
> rebalanceTimeout=1, evictPlc=null, evictPlcFactory=null, 
> onheapCache=false, sqlOnheapCache=false, sqlOnheapCacheMaxSize=0, 
> evictFilter=null, eagerTtl=true, dfltLockTimeout=0, nearCfg=null, 
> writeSync=null, storeFactory=null, storeKeepBinary=false, loadPrevVal=false, 
> aff=null, cacheMode=PARTITIONED, atomicityMode=null, backups=6, 
> invalidate=false, tmLookupClsName=null, rebalanceMode=ASYNC, 
> rebalanceOrder=0, rebalanceBatchSize=524288, rebalanceBatchesPrefetchCnt=2, 
> maxConcurrentAsyncOps=500, sqlIdxMaxInlineSize=-1, writeBehindEnabled=false, 
> writeBehindFlushSize=10240, writeBehindFlushFreq=5000, 
> writeBehindFlushThreadCnt=1, writeBehindBatchSize=512, 
> writeBehindCoalescing=true, maxQryIterCnt=1024, affMapper=null, 
> rebalanceDelay=0, rebalanceThrottle=0, interceptor=null, 
> longQryWarnTimeout=3000, qryDetailMetricsSz=0, readFromBackup=true, 
> nodeFilter=null, sqlSchema=null, sqlEscapeAll=false, cpOnRead=true, 
> topValidator=null, partLossPlc=IGNORE, qryParallelism=1, evtsDisabled=false, 
> encryptionEnabled=false, diskPageCompression=null, 
> diskPageCompressionLevel=null]0]]
> class org.apache.ignite.IgniteCheckedException: Failed to initialize cache 
> working directory (failed to create, make sure the work folder has correct 
> permissions): 
> /home/gridgain/projects/incubator-ignite/work/db/node1/cache-CacheConfiguration
>  [name=ccfg3staticTemplate*, grpName=null, memPlcName=null, 
> storeConcurrentLoadAllThreshold=5, rebalancePoolSize=1, 
> rebalanceTimeout=1, evictPlc=null, evictPlcFactory=null, 
> onheapCache=false, sqlOnheapCache=false, sqlOnheapCacheMaxSize=0, 
> evictFilter=null, eagerTtl=true, dfltLockTimeout=0, nearCfg=null, 
> writeSync=null, storeFactory=null, storeKeepBinary=false, loadPrevVal=false, 
> aff=null, cacheMode=PARTITIONED, atomicityMode=null, backups=6, 
> invalidate=false, tmLookupClsName=null, rebalanceMode=ASYNC, 
> rebalanceOrder=0, rebalanceBatchSize=524288, rebalanceBatchesPrefetchCnt=2, 
> maxConcurrentAsyncOps=500, sqlIdxMaxInlineSize=-1, writeBehindEnabled=false, 
> writeBehindFlushSize=10240, writeBehindFlushFreq=5000, 
> writeBehindFlushThreadCnt=1, writeBehindBatchSize=512, 
> writeBehindCoalescing=true, maxQryIterCnt=1024, affMapper=null, 
> rebalanceDelay=0, rebalanceThrottle=0, interceptor=null, 
> longQryWarnTimeout=3000, qryDetailMetricsSz=0, readFromBackup=true, 
> nodeFilter=null, sqlSchema=null, sqlEscapeAll=false, cpOnRead=true, 
> topValidator=null, partLossPlc=IGNORE, qryParallelism=1, evtsDisabled=false, 
> encryptionEnabled=false, diskPageCompression=null, 
> diskPageCompressionLevel=null]0
>   at 
> org.apache.ignite.internal.processors.cache.persistence.file.FilePageStoreManager.checkAndInitCacheWorkDir(FilePageStoreManager.java:769)
>   at 
> org.apache.ignite.internal.processors.cache.persistence.file.FilePageStoreManager.checkAndInitCacheWorkDir(FilePageStoreManager.java:748)
>   at 
> org.apache.ignite.internal.processors.cache.CachesRegistry.persistCacheConfigurations(CachesRegistry.java:289)
>   at 
> org.apache.ignite.internal.processors.cache.CachesRegistry.registerAllCachesAndGroups(CachesRegistry.java:264)
>   at 
> org.apache.ignite.internal.processors.cache.CachesRegistry.update(CachesRegistry.java:202)
>   at 
>

[jira] [Commented] (IGNITE-11704) Write tombstones during rebalance to get rid of deferred delete buffer

2019-08-05 Thread Pavel Kovalenko (JIRA)



[ 
https://issues.apache.org/jira/browse/IGNITE-11704?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16900252#comment-16900252
 ] 

Pavel Kovalenko commented on IGNITE-11704:
--

[~sboikov]
Thank you for contribution.
I have a couple of questions and suggestions regarding the change:
1) I think we should remain 
org.apache.ignite.internal.processors.cache.IgniteCacheOffheapManager#partitionIterator
 with default behaviour (withTombstones=false). It can help to avoid making 
changes in existing code using partitionIterator previous behavior.
2) Unnecessary brackets in 
org/apache/ignite/internal/processors/cache/GridCacheMapEntry.java:1715
3) Why double-check in 
org/apache/ignite/internal/processors/cache/GridCacheMapEntry.java:1723 is 
needed?
4) Broken javadoc in 
org/apache/ignite/internal/processors/cache/GridCacheMapEntry.java:5859

The main concern about change is that tombstones can remain in partition 
forever if partition CASed to OWNING state and immediately shut down. In this 
case after node return back it can never clear tombstones. I think 
"tombstoneCreated" flag should be reflected in partition meta information and 
saved during a checkpoint. The same information should be added to the 
appropriate WAL delta record. During recovery, we can notice that partition has 
tombstones and run the cleaning process. Also, this flag is never reset looking 
to code.

> Write tombstones during rebalance to get rid of deferred delete buffer
> --
>
> Key: IGNITE-11704
> URL: https://issues.apache.org/jira/browse/IGNITE-11704
> Project: Ignite
>  Issue Type: Improvement
>Reporter: Alexey Goncharuk
>Priority: Major
>  Labels: rebalance
>
> Currently Ignite relies on deferred delete buffer in order to handle 
> write-remove conflicts during rebalance. Given the limit size of the buffer, 
> this approach is fundamentally flawed, especially in case when persistence is 
> enabled.
> I suggest to extend the logic of data storage to be able to store key 
> tombstones - to keep version for deleted entries. The tombstones will be 
> stored when rebalance is in progress and should be cleaned up when rebalance 
> is completed.
> Later this approach may be used to implement fast partition rebalance based 
> on merkle trees (in this case, tombstones should be written on an incomplete 
> baseline).



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

[jira] [Commented] (IGNITE-11848) [IEP-35] Monitoring Phase 1

2019-06-11 Thread Pavel Kovalenko (JIRA)



[ 
https://issues.apache.org/jira/browse/IGNITE-11848?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16861178#comment-16861178
 ] 

Pavel Kovalenko commented on IGNITE-11848:
--

[~NIzhikov] Thank you for the contribution and actively participating in the 
review process. At the moment I have no objections regarding the change.

> [IEP-35] Monitoring Phase 1
> --
>
> Key: IGNITE-11848
> URL: https://issues.apache.org/jira/browse/IGNITE-11848
> Project: Ignite
>  Issue Type: Task
>Affects Versions: 2.7
>Reporter: Nikolay Izhikov
>Assignee: Nikolay Izhikov
>Priority: Major
>  Labels: IEP-35
> Fix For: 2.8
>
>  Time Spent: 3.5h
>  Remaining Estimate: 0h
>
> Umbrella ticket for the IEP-35. Monitoring and profiling.
> Phase 1 should include:
>  * NextGen monitoring subsystem implementation to manage
>  ** metrics
>  ** -lists- (will be implemented in the following tickets)
>  ** exporters
>  * JMX, SQLView, Log exporters
>  * Migration of existing metrics to new manager
>  * -Lists for all Ignite user API-



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Commented] (IGNITE-11592) NPE in case of continuing tx and cache stop operation.

2019-05-23 Thread Pavel Kovalenko (JIRA)



[ 
https://issues.apache.org/jira/browse/IGNITE-11592?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16846894#comment-16846894
 ] 

Pavel Kovalenko commented on IGNITE-11592:
--

[~zstan] Thank you for contribution. Merged to master.

> NPE in case of continuing tx and cache stop operation. 
> ---
>
> Key: IGNITE-11592
> URL: https://issues.apache.org/jira/browse/IGNITE-11592
> Project: Ignite
>  Issue Type: Bug
>Affects Versions: 2.7
>Reporter: Stanilovsky Evgeny
>Assignee: Stanilovsky Evgeny
>Priority: Major
> Fix For: 2.8
>
>  Time Spent: 2h 50m
>  Remaining Estimate: 0h
>
> Parallel cache stop and tx operations may lead to NPE.
> {code}
> java.lang.NullPointerException
>   at 
> org.apache.ignite.internal.processors.cache.CacheObjectImpl.finishUnmarshal(CacheObjectImpl.java:129)
>   at 
> org.apache.ignite.internal.processors.cache.transactions.TxEntryValueHolder.unmarshal(TxEntryValueHolder.java:151)
>   at 
> org.apache.ignite.internal.processors.cache.transactions.IgniteTxEntry.unmarshal(IgniteTxEntry.java:964)
>   at 
> org.apache.ignite.internal.processors.cache.transactions.IgniteTxHandler.unmarshal(IgniteTxHandler.java:306)
>   at 
> org.apache.ignite.internal.processors.cache.transactions.IgniteTxHandler.prepareNearTx(IgniteTxHandler.java:338)
>   at 
> org.apache.ignite.internal.processors.cache.transactions.IgniteTxHandler.processNearTxPrepareRequest0(IgniteTxHandler.java:154)
>   at 
> org.apache.ignite.internal.processors.cache.transactions.IgniteTxHandler.lambda$null$0(IgniteTxHandler.java:580)
>   at 
> org.apache.ignite.internal.util.StripedExecutor$Stripe.body(StripedExecutor.java:496)
>   at 
> org.apache.ignite.internal.util.worker.GridWorker.run(GridWorker.java:110)
>   at java.lang.Thread.run(Thread.java:748)
> {code}
> i hope that correct decision would be to roll back tx (on exchange phase) 
> participating in stopped caches.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Commented] (IGNITE-11592) NPE in case of continuing tx and cache stop operation.

2019-05-22 Thread Pavel Kovalenko (JIRA)



[ 
https://issues.apache.org/jira/browse/IGNITE-11592?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16846001#comment-16846001
 ] 

Pavel Kovalenko commented on IGNITE-11592:
--

[~zstan] Looks good to me. Please proceed with merge.

> NPE in case of continuing tx and cache stop operation. 
> ---
>
> Key: IGNITE-11592
> URL: https://issues.apache.org/jira/browse/IGNITE-11592
> Project: Ignite
>  Issue Type: Bug
>Affects Versions: 2.7
>Reporter: Stanilovsky Evgeny
>Assignee: Stanilovsky Evgeny
>Priority: Major
> Fix For: 2.8
>
>  Time Spent: 2h 40m
>  Remaining Estimate: 0h
>
> Parallel cache stop and tx operations may lead to NPE.
> {code}
> java.lang.NullPointerException
>   at 
> org.apache.ignite.internal.processors.cache.CacheObjectImpl.finishUnmarshal(CacheObjectImpl.java:129)
>   at 
> org.apache.ignite.internal.processors.cache.transactions.TxEntryValueHolder.unmarshal(TxEntryValueHolder.java:151)
>   at 
> org.apache.ignite.internal.processors.cache.transactions.IgniteTxEntry.unmarshal(IgniteTxEntry.java:964)
>   at 
> org.apache.ignite.internal.processors.cache.transactions.IgniteTxHandler.unmarshal(IgniteTxHandler.java:306)
>   at 
> org.apache.ignite.internal.processors.cache.transactions.IgniteTxHandler.prepareNearTx(IgniteTxHandler.java:338)
>   at 
> org.apache.ignite.internal.processors.cache.transactions.IgniteTxHandler.processNearTxPrepareRequest0(IgniteTxHandler.java:154)
>   at 
> org.apache.ignite.internal.processors.cache.transactions.IgniteTxHandler.lambda$null$0(IgniteTxHandler.java:580)
>   at 
> org.apache.ignite.internal.util.StripedExecutor$Stripe.body(StripedExecutor.java:496)
>   at 
> org.apache.ignite.internal.util.worker.GridWorker.run(GridWorker.java:110)
>   at java.lang.Thread.run(Thread.java:748)
> {code}
> i hope that correct decision would be to roll back tx (on exchange phase) 
> participating in stopped caches.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Commented] (IGNITE-11592) NPE in case of continuing tx and cache stop operation.

2019-05-20 Thread Pavel Kovalenko (JIRA)



[ 
https://issues.apache.org/jira/browse/IGNITE-11592?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16844012#comment-16844012
 ] 

Pavel Kovalenko commented on IGNITE-11592:
--

[~zstan] It's not clear to understand what is the cause of the problem and what 
is the scenario (step-by-step) to get into such a situation.
Could you please fill in the scenario to the ticket description?

> NPE in case of continuing tx and cache stop operation. 
> ---
>
> Key: IGNITE-11592
> URL: https://issues.apache.org/jira/browse/IGNITE-11592
> Project: Ignite
>  Issue Type: Bug
>Affects Versions: 2.7
>Reporter: Stanilovsky Evgeny
>Assignee: Stanilovsky Evgeny
>Priority: Major
> Fix For: 2.8
>
>  Time Spent: 2h 40m
>  Remaining Estimate: 0h
>
> Parallel cache stop and tx operations may lead to NPE.
> {code}
> java.lang.NullPointerException
>   at 
> org.apache.ignite.internal.processors.cache.CacheObjectImpl.finishUnmarshal(CacheObjectImpl.java:129)
>   at 
> org.apache.ignite.internal.processors.cache.transactions.TxEntryValueHolder.unmarshal(TxEntryValueHolder.java:151)
>   at 
> org.apache.ignite.internal.processors.cache.transactions.IgniteTxEntry.unmarshal(IgniteTxEntry.java:964)
>   at 
> org.apache.ignite.internal.processors.cache.transactions.IgniteTxHandler.unmarshal(IgniteTxHandler.java:306)
>   at 
> org.apache.ignite.internal.processors.cache.transactions.IgniteTxHandler.prepareNearTx(IgniteTxHandler.java:338)
>   at 
> org.apache.ignite.internal.processors.cache.transactions.IgniteTxHandler.processNearTxPrepareRequest0(IgniteTxHandler.java:154)
>   at 
> org.apache.ignite.internal.processors.cache.transactions.IgniteTxHandler.lambda$null$0(IgniteTxHandler.java:580)
>   at 
> org.apache.ignite.internal.util.StripedExecutor$Stripe.body(StripedExecutor.java:496)
>   at 
> org.apache.ignite.internal.util.worker.GridWorker.run(GridWorker.java:110)
>   at java.lang.Thread.run(Thread.java:748)
> {code}
> i hope that correct decision would be to roll back tx (on exchange phase) 
> participating in stopped caches.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Created] (IGNITE-11852) Assertion errors when changing PME coordinator to locally joining node

2019-05-14 Thread Pavel Kovalenko (JIRA)

Pavel Kovalenko created IGNITE-11852:


 Summary: Assertion errors when changing PME coordinator to locally 
joining node
 Key: IGNITE-11852
 URL: https://issues.apache.org/jira/browse/IGNITE-11852
 Project: Ignite
  Issue Type: Bug
  Components: cache
Affects Versions: 2.7, 2.5
Reporter: Pavel Kovalenko
Assignee: Pavel Kovalenko
 Fix For: 2.8


When PME coordinator changed to locally joining node several assertion errors 
may occur:
1. When some other joining nodes finished PME:

{noformat}
[13:49:58] (err) Failed to notify listener: 
o.a.i.i.processors.cache.distributed.dht.preloader.GridDhtPartitionsExchangeFuture$8$1$1...@27296181java.lang.AssertionError:
 AffinityTopologyVersion [topVer=2, minorTopVer=0]
at 
org.apache.ignite.internal.processors.cache.CacheAffinitySharedManager$11.applyx(CacheAffinitySharedManager.java:1546)
at 
org.apache.ignite.internal.processors.cache.CacheAffinitySharedManager$11.applyx(CacheAffinitySharedManager.java:1535)
at 
org.apache.ignite.internal.processors.cache.CacheAffinitySharedManager.lambda$forAllRegisteredCacheGroups$e0a6939d$1(CacheAffinitySharedManager.java:1281)
at 
org.apache.ignite.internal.util.IgniteUtils.doInParallel(IgniteUtils.java:10929)
at 
org.apache.ignite.internal.util.IgniteUtils.doInParallel(IgniteUtils.java:10831)
at 
org.apache.ignite.internal.util.IgniteUtils.doInParallel(IgniteUtils.java:10811)
at 
org.apache.ignite.internal.processors.cache.CacheAffinitySharedManager.forAllRegisteredCacheGroups(CacheAffinitySharedManager.java:1280)
at 
org.apache.ignite.internal.processors.cache.CacheAffinitySharedManager.onLocalJoin(CacheAffinitySharedManager.java:1535)
at 
org.apache.ignite.internal.processors.cache.distributed.dht.preloader.GridDhtPartitionsExchangeFuture.processFullMessage(GridDhtPartitionsExchangeFuture.java:4189)
at 
org.apache.ignite.internal.processors.cache.distributed.dht.preloader.GridDhtPartitionsExchangeFuture.onBecomeCoordinator(GridDhtPartitionsExchangeFuture.java:4731)
at 
org.apache.ignite.internal.processors.cache.distributed.dht.preloader.GridDhtPartitionsExchangeFuture.access$3400(GridDhtPartitionsExchangeFuture.java:145)
at 
org.apache.ignite.internal.processors.cache.distributed.dht.preloader.GridDhtPartitionsExchangeFuture$8$1$1.apply(GridDhtPartitionsExchangeFuture.java:4622)
at 
org.apache.ignite.internal.processors.cache.distributed.dht.preloader.GridDhtPartitionsExchangeFuture$8$1$1.apply(GridDhtPartitionsExchangeFuture.java:4611)
at 
org.apache.ignite.internal.util.future.GridFutureAdapter.notifyListener(GridFutureAdapter.java:398)
at 
org.apache.ignite.internal.util.future.GridFutureAdapter.unblock(GridFutureAdapter.java:346)
at 
org.apache.ignite.internal.util.future.GridFutureAdapter.unblockAll(GridFutureAdapter.java:334)
at 
org.apache.ignite.internal.util.future.GridFutureAdapter.onDone(GridFutureAdapter.java:510)
at 
org.apache.ignite.internal.util.future.GridFutureAdapter.onDone(GridFutureAdapter.java:489)
at 
org.apache.ignite.internal.util.future.GridFutureAdapter.onDone(GridFutureAdapter.java:466)
at 
org.apache.ignite.internal.util.future.GridCompoundFuture.checkComplete(GridCompoundFuture.java:281)
at 
org.apache.ignite.internal.util.future.GridCompoundFuture.apply(GridCompoundFuture.java:143)
at 
org.apache.ignite.internal.util.future.GridCompoundFuture.apply(GridCompoundFuture.java:44)
at 
org.apache.ignite.internal.util.future.GridFutureAdapter.notifyListener(GridFutureAdapter.java:398)
at 
org.apache.ignite.internal.util.future.GridFutureAdapter.unblock(GridFutureAdapter.java:346)
at 
org.apache.ignite.internal.util.future.GridFutureAdapter.unblockAll(GridFutureAdapter.java:334)
at 
org.apache.ignite.internal.util.future.GridFutureAdapter.onDone(GridFutureAdapter.java:510)
at 
org.apache.ignite.internal.util.future.GridFutureAdapter.onDone(GridFutureAdapter.java:489)
at 
org.apache.ignite.internal.util.future.GridFutureAdapter.onDone(GridFutureAdapter.java:455)
at 
org.apache.ignite.internal.processors.cache.distributed.dht.preloader.InitNewCoordinatorFuture.onMessage(InitNewCoordinatorFuture.java:253)
at 
org.apache.ignite.internal.processors.cache.distributed.dht.preloader.GridDhtPartitionsExchangeFuture.onReceiveSingleMessage(GridDhtPartitionsExchangeFuture.java:2731)
at 
org.apache.ignite.internal.processors.cache.GridCachePartitionExchangeManager.processSinglePartitionUpdate(GridCachePartitionExchangeManager.java:1917)
at 
org.apache.ignite.internal.processors.cache.GridCachePartitionExchangeManager.access$1300(GridCachePartitionExchangeManager.java:162)
at

[jira] [Commented] (IGNITE-11762) Test testClientStartCloseServersRestart causes hang of the whole Cache 2 suite in master

2019-04-19 Thread Pavel Kovalenko (JIRA)



[ 
https://issues.apache.org/jira/browse/IGNITE-11762?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16821959#comment-16821959
 ] 

Pavel Kovalenko commented on IGNITE-11762:
--

Investigation of cause showed that the failure is not related to IGNITE-10799
Here is stack trace that is the cause of transaction hanging:

{noformat}
java.lang.AssertionError:
at 
org.apache.ignite.internal.processors.cache.transactions.IgniteTxLocalAdapter.calculatePartitionUpdateCounters(IgniteTxLocalAdapter.java:498)
at 
org.apache.ignite.internal.processors.cache.transactions.IgniteTxLocalAdapter.userPrepare(IgniteTxLocalAdapter.java:438)
at 
org.apache.ignite.internal.processors.cache.distributed.dht.GridDhtTxLocal.prepareAsync(GridDhtTxLocal.java:403)
at 
org.apache.ignite.internal.processors.cache.transactions.IgniteTxHandler.prepareNearTx(IgniteTxHandler.java:570)
at 
org.apache.ignite.internal.processors.cache.transactions.IgniteTxHandler.prepareNearTx(IgniteTxHandler.java:367)
at 
org.apache.ignite.internal.processors.cache.transactions.IgniteTxHandler.processNearTxPrepareRequest0(IgniteTxHandler.java:178)
at 
org.apache.ignite.internal.processors.cache.transactions.IgniteTxHandler.processNearTxPrepareRequest(IgniteTxHandler.java:156)
at 
org.apache.ignite.internal.processors.cache.transactions.IgniteTxHandler.access$000(IgniteTxHandler.java:118)
at 
org.apache.ignite.internal.processors.cache.transactions.IgniteTxHandler$1.apply(IgniteTxHandler.java:198)
at 
org.apache.ignite.internal.processors.cache.transactions.IgniteTxHandler$1.apply(IgniteTxHandler.java:196)
at 
org.apache.ignite.internal.processors.cache.GridCacheIoManager.processMessage(GridCacheIoManager.java:1141)
at 
org.apache.ignite.internal.processors.cache.GridCacheIoManager.onMessage0(GridCacheIoManager.java:591)
at 
org.apache.ignite.internal.processors.cache.GridCacheIoManager.handleMessage(GridCacheIoManager.java:392)
at 
org.apache.ignite.internal.processors.cache.GridCacheIoManager.handleMessage(GridCacheIoManager.java:318)
at 
org.apache.ignite.internal.processors.cache.GridCacheIoManager.access$100(GridCacheIoManager.java:109)
at 
org.apache.ignite.internal.processors.cache.GridCacheIoManager$1.onMessage(GridCacheIoManager.java:308)
at 
org.apache.ignite.internal.managers.communication.GridIoManager.invokeListener(GridIoManager.java:1561)
at 
org.apache.ignite.internal.managers.communication.GridIoManager.processRegularMessage0(GridIoManager.java:1189)
at 
org.apache.ignite.internal.managers.communication.GridIoManager.access$4200(GridIoManager.java:127)
at 
org.apache.ignite.internal.managers.communication.GridIoManager$8.run(GridIoManager.java:1086)
at 
org.apache.ignite.internal.util.StripedExecutor$Stripe.body(StripedExecutor.java:550)
at 
org.apache.ignite.internal.util.worker.GridWorker.run(GridWorker.java:120)
{noformat}

It means that our transaction is mapped to the primary node with a partition in 
MOVING state which is generally impossible.
How we can move into such situation? Here is scenario:

Environment:
Cache with 1 backup, In-memory mode, Partition lost policy is IGNORE.
Nodes: Crd, Node1, Node2.

1. Partition has the following state:
Node1 (OWNING) [Primary], Node2(MOVING)
Node2 is currently rebalancing data from Node1

2. Node1 is left from topology, rebalancing is canceled.

3. Crd node assigned this partition by affinity and created it in MOVING state 
here:
Method:
{code:java}
org.apache.ignite.internal.processors.cache.distributed.dht.preloader.GridDhtPartitionsExchangeFuture#finishExchangeOnCoordinator
{code}
Section:
{code:java}
doInParallel(
parallelismLvl,
cctx.kernalContext().getSystemExecutorService(),
cctx.affinity().cacheGroups().values(),
desc -> {
if (desc.config().getCacheMode() == CacheMode.LOCAL)
return null;

CacheGroupContext grp = 
cctx.cache().cacheGroup(desc.groupId());

GridDhtPartitionTopology top = grp != null ? 
grp.topology() :
cctx.exchange().clientTopology(desc.groupId(), 
events().discoveryCache());

top.beforeExchange(this, true, true);

return null;
});
{code}

4. After that partition has following state:
Node2 (MOVING) [Primary], Crd (MOVING)

5. This partition is immediately owned on the coordinator node due to lost 
policy in the following code block:
{code:java}
if (exchCtx.events().hasServerLeft())
detectLostPartitions(resTopVer);
{code}

6. FullMap with OWNED partition is sent to Node2.

7. Node2 can't mark this partition as LOST, because it

[jira] [Commented] (IGNITE-11743) Stopping caches concurrently with node join may lead to crash of the node

2019-04-18 Thread Pavel Kovalenko (JIRA)



[ 
https://issues.apache.org/jira/browse/IGNITE-11743?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16821237#comment-16821237
 ] 

Pavel Kovalenko commented on IGNITE-11743:
--

Cache2 failure is related to https://issues.apache.org/jira/browse/IGNITE-11762
JDBC Driver failure is already fixed there 
https://issues.apache.org/jira/browse/IGNITE-11773
Javadocs are already broken in master.

> Stopping caches concurrently with node join may lead to crash of the node
> -
>
> Key: IGNITE-11743
> URL: https://issues.apache.org/jira/browse/IGNITE-11743
> Project: Ignite
>  Issue Type: Bug
>Affects Versions: 2.7
>Reporter: Sergey Chugunov
>Assignee: Pavel Kovalenko
>Priority: Major
> Fix For: 2.8
>
> Attachments: IgnitePdsNodeRestartCacheCreateTest.java
>
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> When an existing cache is stopped (e.g. via call Ignite#destroyCache(String 
> name)) this action is distributed across cluster by discovery mechanism (and 
> is processed from *disco-notifier-worker* thread).
> At the same time joining node prepares to start caches from *exchange-worker* 
> thread.
> If a cache stop request arrives to new node right in the middle of cache 
> start prepare, it may lead to exception in FilePageStoreManager like one 
> below and node crash.
> Test reproducing the issue is attached.
> {noformat}
> class org.apache.ignite.IgniteCheckedException: Failed to get page store for 
> the given cache ID (cache has not been started): -1422502786
>   at 
> org.apache.ignite.internal.processors.cache.persistence.file.FilePageStoreManager.getStore(FilePageStoreManager.java:1132)
>   at 
> org.apache.ignite.internal.processors.cache.persistence.file.FilePageStoreManager.read(FilePageStoreManager.java:482)
>   at 
> org.apache.ignite.internal.processors.cache.persistence.file.FilePageStoreManager.read(FilePageStoreManager.java:469)
>   at 
> org.apache.ignite.internal.processors.cache.persistence.pagemem.PageMemoryImpl.acquirePage(PageMemoryImpl.java:854)
>   at 
> org.apache.ignite.internal.processors.cache.persistence.pagemem.PageMemoryImpl.acquirePage(PageMemoryImpl.java:681)
>   at 
> org.apache.ignite.internal.processors.cache.persistence.GridCacheOffheapManager.getOrAllocateCacheMetas(GridCacheOffheapManager.java:869)
>   at 
> org.apache.ignite.internal.processors.cache.persistence.GridCacheOffheapManager.initDataStructures(GridCacheOffheapManager.java:128)
>   at 
> org.apache.ignite.internal.processors.cache.IgniteCacheOffheapManagerImpl.start(IgniteCacheOffheapManagerImpl.java:193)
>   at 
> org.apache.ignite.internal.processors.cache.CacheGroupContext.start(CacheGroupContext.java:1043)
>   at 
> org.apache.ignite.internal.processors.cache.GridCacheProcessor.startCacheGroup(GridCacheProcessor.java:2829)
>   at 
> org.apache.ignite.internal.processors.cache.GridCacheProcessor.getOrCreateCacheGroupContext(GridCacheProcessor.java:2557)
>   at 
> org.apache.ignite.internal.processors.cache.GridCacheProcessor.prepareCacheContext(GridCacheProcessor.java:2387)
>   at 
> org.apache.ignite.internal.processors.cache.GridCacheProcessor.lambda$null$6a5b31b9$1(GridCacheProcessor.java:2209)
>   at 
> org.apache.ignite.internal.processors.cache.GridCacheProcessor.lambda$prepareStartCaches$5(GridCacheProcessor.java:2130)
>   at 
> org.apache.ignite.internal.processors.cache.GridCacheProcessor.lambda$prepareStartCaches$926b6886$1(GridCacheProcessor.java:2206)
>   at 
> org.apache.ignite.internal.util.IgniteUtils.lambda$null$1(IgniteUtils.java:10874)
>   at java.util.concurrent.FutureTask.run(FutureTask.java:266)
> {noformat}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Commented] (IGNITE-11767) GridDhtPartitionsFullMessage retains huge maps on heap ion exchange history

2019-04-18 Thread Pavel Kovalenko (JIRA)



[ 
https://issues.apache.org/jira/browse/IGNITE-11767?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16821177#comment-16821177
 ] 

Pavel Kovalenko commented on IGNITE-11767:
--

[~ilyak] Overall approach to solving the problem looks good. Partition sizes 
map in FullMessage is read only once during topology update, so it's safe to 
keep it in the serialized format every time.

> GridDhtPartitionsFullMessage retains huge maps on heap ion exchange history
> ---
>
> Key: IGNITE-11767
> URL: https://issues.apache.org/jira/browse/IGNITE-11767
> Project: Ignite
>  Issue Type: Bug
>  Components: cache
>Affects Versions: 2.7
>Reporter: Ilya Kasnacheev
>Assignee: Ilya Kasnacheev
>Priority: Blocker
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> ExchangeHistory keeps a FinishState for every topology version.
> FinishState contains msg, which contains at least two huge maps:
> partCntrs2 and partsSizesBytes.
> We should probably strip msg, removing those two data structures before 
> putting msg in exchFuts linked list to be stowed away.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Created] (IGNITE-11773) JDBC suite hangs due to cleared non-serializable proxy objects

2019-04-18 Thread Pavel Kovalenko (JIRA)

Pavel Kovalenko created IGNITE-11773:


 Summary: JDBC suite hangs due to cleared non-serializable proxy 
objects
 Key: IGNITE-11773
 URL: https://issues.apache.org/jira/browse/IGNITE-11773
 Project: Ignite
  Issue Type: Bug
  Components: cache
Affects Versions: 2.8
Reporter: Pavel Kovalenko
Assignee: Pavel Kovalenko
 Fix For: 2.8



{noformat}
[01:53:02]W: [org.apache.ignite:ignite-clients] 
java.lang.AssertionError
[01:53:02]W: [org.apache.ignite:ignite-clients] at 
org.apache.ignite.testframework.junits.GridAbstractTest$SerializableProxy.readResolve(GridAbstractTest.java:2419)
[01:53:02]W: [org.apache.ignite:ignite-clients] at 
sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
[01:53:02]W: [org.apache.ignite:ignite-clients] at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
[01:53:02]W: [org.apache.ignite:ignite-clients] at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
[01:53:02]W: [org.apache.ignite:ignite-clients] at 
java.lang.reflect.Method.invoke(Method.java:498)
[01:53:02]W: [org.apache.ignite:ignite-clients] at 
java.io.ObjectStreamClass.invokeReadResolve(ObjectStreamClass.java:1260)
[01:53:02]W: [org.apache.ignite:ignite-clients] at 
java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:2078)
[01:53:02]W: [org.apache.ignite:ignite-clients] at 
java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1573)
[01:53:02]W: [org.apache.ignite:ignite-clients] at 
java.io.ObjectInputStream.readObject(ObjectInputStream.java:431)
[01:53:02]W: [org.apache.ignite:ignite-clients] at 
org.apache.ignite.marshaller.jdk.JdkMarshaller.unmarshal0(JdkMarshaller.java:141)
[01:53:02]W: [org.apache.ignite:ignite-clients] at 
org.apache.ignite.marshaller.AbstractNodeNameAwareMarshaller.unmarshal(AbstractNodeNameAwareMarshaller.java:93)
[01:53:02]W: [org.apache.ignite:ignite-clients] at 
org.apache.ignite.marshaller.jdk.JdkMarshaller.unmarshal0(JdkMarshaller.java:163)
[01:53:02]W: [org.apache.ignite:ignite-clients] at 
org.apache.ignite.marshaller.AbstractNodeNameAwareMarshaller.unmarshal(AbstractNodeNameAwareMarshaller.java:81)
[01:53:02]W: [org.apache.ignite:ignite-clients] at 
org.apache.ignite.internal.util.IgniteUtils.unmarshal(IgniteUtils.java:10039)
[01:53:02]W: [org.apache.ignite:ignite-clients] at 
org.apache.ignite.internal.processors.cache.CacheConfigurationEnricher.deserialize(CacheConfigurationEnricher.java:151)
[01:53:02]W: [org.apache.ignite:ignite-clients] at 
org.apache.ignite.internal.processors.cache.CacheConfigurationEnricher.enrich(CacheConfigurationEnricher.java:122)
[01:53:02]W: [org.apache.ignite:ignite-clients] at 
org.apache.ignite.internal.processors.cache.CacheConfigurationEnricher.enrichFully(CacheConfigurationEnricher.java:143)
[01:53:02]W: [org.apache.ignite:ignite-clients] at 
org.apache.ignite.internal.processors.cache.GridCacheProcessor.getConfigFromTemplate(GridCacheProcessor.java:3776)
[01:53:02]W: [org.apache.ignite:ignite-clients] at 
org.apache.ignite.internal.processors.query.GridQueryProcessor.dynamicTableCreate(GridQueryProcessor.java:1549)
[01:53:02]W: [org.apache.ignite:ignite-clients] at 
org.apache.ignite.internal.processors.query.h2.CommandProcessor.runCommandH2(CommandProcessor.java:437)
[01:53:02]W: [org.apache.ignite:ignite-clients] at 
org.apache.ignite.internal.processors.query.h2.CommandProcessor.runCommand(CommandProcessor.java:195)
[01:53:02]W: [org.apache.ignite:ignite-clients] at 
org.apache.ignite.internal.processors.query.h2.IgniteH2Indexing.executeCommand(IgniteH2Indexing.java:954)
[01:53:02]W: [org.apache.ignite:ignite-clients] at 
org.apache.ignite.internal.processors.query.h2.IgniteH2Indexing.querySqlFields(IgniteH2Indexing.java:1038)
[01:53:02]W: [org.apache.ignite:ignite-clients] at 
org.apache.ignite.internal.processors.query.GridQueryProcessor$3.applyx(GridQueryProcessor.java:2292)
[01:53:02]W: [org.apache.ignite:ignite-clients] at 
org.apache.ignite.internal.processors.query.GridQueryProcessor$3.applyx(GridQueryProcessor.java:2288)
[01:53:02]W: [org.apache.ignite:ignite-clients] at 
org.apache.ignite.internal.util.lang.IgniteOutClosureX.apply(IgniteOutClosureX.java:36)
[01:53:02]W: [org.apache.ignite:ignite-clients] at 
org.apache.ignite.internal.processors.query.GridQueryProcessor.executeQuery(GridQueryProcessor.java:2804)
[01:53:02]W:

[jira] [Assigned] (IGNITE-11743) Stopping caches concurrently with node join may lead to crash of the node

2019-04-17 Thread Pavel Kovalenko (JIRA)



 [ 
https://issues.apache.org/jira/browse/IGNITE-11743?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Pavel Kovalenko reassigned IGNITE-11743:


Assignee: Pavel Kovalenko  (was: Sergey Chugunov)

> Stopping caches concurrently with node join may lead to crash of the node
> -
>
> Key: IGNITE-11743
> URL: https://issues.apache.org/jira/browse/IGNITE-11743
> Project: Ignite
>  Issue Type: Bug
>Affects Versions: 2.7
>Reporter: Sergey Chugunov
>Assignee: Pavel Kovalenko
>Priority: Major
> Fix For: 2.8
>
> Attachments: IgnitePdsNodeRestartCacheCreateTest.java
>
>
> When an existing cache is stopped (e.g. via call Ignite#destroyCache(String 
> name)) this action is distributed across cluster by discovery mechanism (and 
> is processed from *disco-notifier-worker* thread).
> At the same time joining node prepares to start caches from *exchange-worker* 
> thread.
> If a cache stop request arrives to new node right in the middle of cache 
> start prepare, it may lead to exception in FilePageStoreManager like one 
> below and node crash.
> Test reproducing the issue is attached.
> {noformat}
> class org.apache.ignite.IgniteCheckedException: Failed to get page store for 
> the given cache ID (cache has not been started): -1422502786
>   at 
> org.apache.ignite.internal.processors.cache.persistence.file.FilePageStoreManager.getStore(FilePageStoreManager.java:1132)
>   at 
> org.apache.ignite.internal.processors.cache.persistence.file.FilePageStoreManager.read(FilePageStoreManager.java:482)
>   at 
> org.apache.ignite.internal.processors.cache.persistence.file.FilePageStoreManager.read(FilePageStoreManager.java:469)
>   at 
> org.apache.ignite.internal.processors.cache.persistence.pagemem.PageMemoryImpl.acquirePage(PageMemoryImpl.java:854)
>   at 
> org.apache.ignite.internal.processors.cache.persistence.pagemem.PageMemoryImpl.acquirePage(PageMemoryImpl.java:681)
>   at 
> org.apache.ignite.internal.processors.cache.persistence.GridCacheOffheapManager.getOrAllocateCacheMetas(GridCacheOffheapManager.java:869)
>   at 
> org.apache.ignite.internal.processors.cache.persistence.GridCacheOffheapManager.initDataStructures(GridCacheOffheapManager.java:128)
>   at 
> org.apache.ignite.internal.processors.cache.IgniteCacheOffheapManagerImpl.start(IgniteCacheOffheapManagerImpl.java:193)
>   at 
> org.apache.ignite.internal.processors.cache.CacheGroupContext.start(CacheGroupContext.java:1043)
>   at 
> org.apache.ignite.internal.processors.cache.GridCacheProcessor.startCacheGroup(GridCacheProcessor.java:2829)
>   at 
> org.apache.ignite.internal.processors.cache.GridCacheProcessor.getOrCreateCacheGroupContext(GridCacheProcessor.java:2557)
>   at 
> org.apache.ignite.internal.processors.cache.GridCacheProcessor.prepareCacheContext(GridCacheProcessor.java:2387)
>   at 
> org.apache.ignite.internal.processors.cache.GridCacheProcessor.lambda$null$6a5b31b9$1(GridCacheProcessor.java:2209)
>   at 
> org.apache.ignite.internal.processors.cache.GridCacheProcessor.lambda$prepareStartCaches$5(GridCacheProcessor.java:2130)
>   at 
> org.apache.ignite.internal.processors.cache.GridCacheProcessor.lambda$prepareStartCaches$926b6886$1(GridCacheProcessor.java:2206)
>   at 
> org.apache.ignite.internal.util.IgniteUtils.lambda$null$1(IgniteUtils.java:10874)
>   at java.util.concurrent.FutureTask.run(FutureTask.java:266)
> {noformat}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Commented] (IGNITE-11732) Multi-merged partitions exchange future may hang if node left event is received last

2019-04-12 Thread Pavel Kovalenko (JIRA)



[ 
https://issues.apache.org/jira/browse/IGNITE-11732?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16816374#comment-16816374
 ] 

Pavel Kovalenko commented on IGNITE-11732:
--

[~agoncharuk] Looks good to me. Please proceed with merge.

> Multi-merged partitions exchange future may hang if node left event is 
> received last
> 
>
> Key: IGNITE-11732
> URL: https://issues.apache.org/jira/browse/IGNITE-11732
> Project: Ignite
>  Issue Type: Bug
>Reporter: Alexey Goncharuk
>Assignee: Alexey Goncharuk
>Priority: Major
> Fix For: 2.8
>
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> The scenario is similar to IGNITE-11204, but several exchanges should be 
> merged. If in this case a merged-exchange node is left and all other nodes 
> messages are already received, the exchange will not be completed because 
> {{F.isEmpty(mergedJoinExchMsgs)}} is {{false}}. 
> Looks like we should decrement the {{awaitMergedMsgs}} and check this field 
> for {{0}}.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Commented] (IGNITE-10799) Optimize affinity initialization/re-calculation

2019-04-01 Thread Pavel Kovalenko (JIRA)



[ 
https://issues.apache.org/jira/browse/IGNITE-10799?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16806922#comment-16806922
 ] 

Pavel Kovalenko commented on IGNITE-10799:
--

Blockers from Ignite TC Bot are not related to my changes.
[~agoncharuk] Your comments have fixed. Could you please look on change again?

> Optimize affinity initialization/re-calculation
> ---
>
> Key: IGNITE-10799
> URL: https://issues.apache.org/jira/browse/IGNITE-10799
> Project: Ignite
>  Issue Type: Improvement
>  Components: cache
>Affects Versions: 2.4
>Reporter: Pavel Kovalenko
>Assignee: Pavel Kovalenko
>Priority: Major
> Fix For: 2.8
>
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> In case of persistence enabled and a baseline is set we have 2 main 
> approaches to recalculate affinity:
> {noformat}
> org.apache.ignite.internal.processors.cache.CacheAffinitySharedManager#onServerJoinWithExchangeMergeProtocol
> org.apache.ignite.internal.processors.cache.CacheAffinitySharedManager#onServerLeftWithExchangeMergeProtocol
> {noformat}
> Both of them following the same approach of recalculating:
> 1) Take a current baseline (ideal assignment).
> 2) Filter out offline nodes from it.
> 3) Choose new primary nodes if previous went away.
> 4) Place temporal primary nodes to late affinity assignment set.
> Looking at implementation details we may notice that we do a lot of 
> unnecessary online nodes cache lookups and array list copies. The performance 
> becomes too slow if we do recalculate affinity for replicated caches (It 
> takes P * N on each node, where P - partitions count, N - the number of nodes 
> in the cluster). In case of large partitions count or large cluster, it may 
> take few seconds, which is unacceptable, because this process happens during 
> PME and freezes ongoing cluster operations.
> We should investigate possible bottlenecks and improve the performance of 
> affinity recalculation.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Commented] (IGNITE-9913) Prevent data updates blocking in case of backup BLT server node leave

2019-04-01 Thread Pavel Kovalenko (JIRA)



[ 
https://issues.apache.org/jira/browse/IGNITE-9913?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16806597#comment-16806597
 ] 

Pavel Kovalenko commented on IGNITE-9913:
-

[~NSAmelchev]
I've reviewed your changes. I have a question regarding conditions when this 
optimization is disabled.
Why local affinity calculation is turned off when there are some moving 
partitions in topology and affinity assignments are not equal to ideal?


> Prevent data updates blocking in case of backup BLT server node leave
> -
>
> Key: IGNITE-9913
> URL: https://issues.apache.org/jira/browse/IGNITE-9913
> Project: Ignite
>  Issue Type: Improvement
>  Components: general
>Reporter: Ivan Rakov
>Assignee: Amelchev Nikita
>Priority: Major
> Fix For: 2.8
>
> Attachments: 9913_yardstick.png, master_yardstick.png
>
>  Time Spent: 50m
>  Remaining Estimate: 0h
>
> Ignite cluster performs distributed partition map exchange when any server 
> node leaves or joins the topology.
> Distributed PME blocks all updates and may take a long time. If all 
> partitions are assigned according to the baseline topology and server node 
> leaves, there's no actual need to perform distributed PME: every cluster node 
> is able to recalculate new affinity assigments and partition states locally. 
> If we'll implement such lightweight PME and handle mapping and lock requests 
> on new topology version correctly, updates won't be stopped (except updates 
> of partitions that lost their primary copy).



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Commented] (IGNITE-11011) Initialize components with grid disco data when NodeAddFinished message is received

2019-03-28 Thread Pavel Kovalenko (JIRA)



[ 
https://issues.apache.org/jira/browse/IGNITE-11011?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16803653#comment-16803653
 ] 

Pavel Kovalenko commented on IGNITE-11011:
--

[~sergey-chugunov] Thank you for contribution. Merged to master.

> Initialize components with grid disco data when NodeAddFinished message is 
> received
> ---
>
> Key: IGNITE-11011
> URL: https://issues.apache.org/jira/browse/IGNITE-11011
> Project: Ignite
>  Issue Type: Improvement
>Reporter: Sergey Chugunov
>Assignee: Sergey Chugunov
>Priority: Major
> Fix For: 2.8
>
>  Time Spent: 1h 20m
>  Remaining Estimate: 0h
>
> There is an issue when CacheProcessor on fresh coordinator (the very first 
> node in new topology) receives grid discovery data from another cluster that 
> died before this node has joined its topology but after sending NodeAdded 
> message.
> IGNITE-10878 fixes it by filtering cache descriptors and cache groups in 
> GridCacheProcessor which is not generic solution.
> To fix the issue in a true generic way node should initialize its components 
> (including cache processor) not on receiving NodeAdded message but 
> NodeAddFinished message.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Commented] (IGNITE-11465) Multiple client leave/join events may wipe affinity assignment history and cause transactions fail

2019-03-25 Thread Pavel Kovalenko (JIRA)



[ 
https://issues.apache.org/jira/browse/IGNITE-11465?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16800531#comment-16800531
 ] 

Pavel Kovalenko commented on IGNITE-11465:
--

[~ivan.glukos] Thank you for contribution. Looks good to me. Please proceed 
with merge.

> Multiple client leave/join events may wipe affinity assignment history and 
> cause transactions fail
> --
>
> Key: IGNITE-11465
> URL: https://issues.apache.org/jira/browse/IGNITE-11465
> Project: Ignite
>  Issue Type: Bug
>Reporter: Ivan Rakov
>Assignee: Ivan Rakov
>Priority: Major
> Fix For: 2.8
>
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> We keep history of GridAffinityAssignmentCache#MAX_HIST_SIZE affinity 
> assignments. However, flood of client joins/leaves may wipe it out entirely 
> and cause fail/hang of transaction that was started before the flood due to 
> the following exception:
> {code:java}
> if (cache == null || cache.topologyVersion().compareTo(topVer) > 
> 0) {
> throw new IllegalStateException("Getting affinity for 
> topology version earlier than affinity is " +
> "calculated [locNode=" + ctx.discovery().localNode() +
> ", grp=" + cacheOrGrpName +
> ", topVer=" + topVer +
> ", head=" + head.get().topologyVersion() +
> ", history=" + affCache.keySet() +
> ']');
> }
> {code}
> History is limited in order to prevent JVM heap overflow. At the same time, 
> only "server event" affinity assignments are heavy: "client event" 
> assignments are just shallow copies of "server event" assignments.
> I suggest to limit history by the number of "server event" assignments.
> Also, considering the provided fix, I don't see any need to keep 500 items in 
> history. I propose to change history size to 50.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Assigned] (IGNITE-1903) Cache configuration is serialized to nodes whether they require it or not

2019-03-20 Thread Pavel Kovalenko (JIRA)



 [ 
https://issues.apache.org/jira/browse/IGNITE-1903?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Pavel Kovalenko reassigned IGNITE-1903:
---

Assignee: Pavel Kovalenko

> Cache configuration is serialized to nodes whether they require it or not
> -
>
> Key: IGNITE-1903
> URL: https://issues.apache.org/jira/browse/IGNITE-1903
> Project: Ignite
>  Issue Type: Bug
>Affects Versions: 1.5.0.final
>Reporter: Michael Griggs
>Assignee: Pavel Kovalenko
>Priority: Major
>  Labels: community, usability
>
> See User discussion thread:  
> http://apache-ignite-users.70518.x6.nabble.com/CacheStore-being-serialized-to-client-td1931.html
> Brief summary:  When a grid client joins the grid (clientMode=true) it 
> receives a message from the server node(s) on the grid that contains the 
> serialized CacheStore implementation object.  If the client does not have 
> this class on its CLASSPATH (and there is no reason it should, as it is a 
> client) then the de-serialization of this message will fail, causing this 
> exception:
> {code}SEVERE: Failed to unmarshal discovery data for component: 1 
> class org.apache.ignite.IgniteCheckedException: Failed to find class with 
> given class loader for unmarshalling (make sure same versions of all classes 
> are available on all nodes or enable peer-class-loading): 
> sun.misc.Launcher$AppClassLoader@14dad5dc 
> at 
> org.apache.ignite.marshaller.jdk.JdkMarshaller.unmarshal(JdkMarshaller.java:104)
>  
> at 
> org.apache.ignite.marshaller.AbstractMarshaller.unmarshal(AbstractMarshaller.java:67)
>  
> at 
> org.apache.ignite.spi.discovery.tcp.TcpDiscoverySpi.onExchange(TcpDiscoverySpi.java:1529)
>  
> at 
> org.apache.ignite.spi.discovery.tcp.ClientImpl$MessageWorker.processNodeAddFinishedMessage(ClientImpl.java:1317)
>  
> at 
> org.apache.ignite.spi.discovery.tcp.ClientImpl$MessageWorker.processDiscoveryMessage(ClientImpl.java:1229)
>  
> at 
> org.apache.ignite.spi.discovery.tcp.ClientImpl$MessageWorker.body(ClientImpl.java:1199)
>  
> at org.apache.ignite.spi.IgniteSpiThread.run(IgniteSpiThread.java:62) 
> Caused by: java.lang.ClassNotFoundException: 
> c.g.r.cachewrapper.ignite.CacheMissHandlerIgnite
> {code}
> where {{c.g.r.cachewrapper.ignite.CacheMissHandlerIgnite}} is the CacheStore 
> implementation.
> The ostensible reason for the CacheStore serialization is so that clients of 
> a TRANSACTIONAL cache can begin the transaction on the underlying store.  
> The only current solution to this is to add the grid node's CacheStore 
> implementation class definition to the CLASSPATH of the client.  This creates 
> an *undesirable coupling* between server and client.
> ---
> *UPDATE (copy-paste from comment below)*
> This is actually more generic issue. When a new node joins (client or 
> server), all existing cache configurations (which include cache stores) are 
> sent to this node. It then deserializes it during startup which can cause 
> exceptions on clients or servers where cache is not supposed to be deployed 
> as defined by node filter.
> As a solution we can do the following:
> * During discovery, send node filter and cache store factory in binary format 
> along with the cache configuration, not as its parts.
> * On server, check node filter first and deserialize cache configuration and 
> cache store only if it returns true. In case of error, STOP join process (now 
> we just print exception in log and go on, which is very error-prone).
> * On client, do not deserialize cache configuration and cache store until 
> user's code tries to actually use the cache (calls Ignite.cache. If cache is 
> ATOMIC, never deserialize the cache store.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Updated] (IGNITE-1903) Cache configuration is serialized to nodes whether they require it or not

2019-03-20 Thread Pavel Kovalenko (JIRA)



 [ 
https://issues.apache.org/jira/browse/IGNITE-1903?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Pavel Kovalenko updated IGNITE-1903:

Fix Version/s: 2.8

> Cache configuration is serialized to nodes whether they require it or not
> -
>
> Key: IGNITE-1903
> URL: https://issues.apache.org/jira/browse/IGNITE-1903
> Project: Ignite
>  Issue Type: Bug
>Affects Versions: 1.5.0.final
>Reporter: Michael Griggs
>Assignee: Pavel Kovalenko
>Priority: Major
>  Labels: community, usability
> Fix For: 2.8
>
>
> See User discussion thread:  
> http://apache-ignite-users.70518.x6.nabble.com/CacheStore-being-serialized-to-client-td1931.html
> Brief summary:  When a grid client joins the grid (clientMode=true) it 
> receives a message from the server node(s) on the grid that contains the 
> serialized CacheStore implementation object.  If the client does not have 
> this class on its CLASSPATH (and there is no reason it should, as it is a 
> client) then the de-serialization of this message will fail, causing this 
> exception:
> {code}SEVERE: Failed to unmarshal discovery data for component: 1 
> class org.apache.ignite.IgniteCheckedException: Failed to find class with 
> given class loader for unmarshalling (make sure same versions of all classes 
> are available on all nodes or enable peer-class-loading): 
> sun.misc.Launcher$AppClassLoader@14dad5dc 
> at 
> org.apache.ignite.marshaller.jdk.JdkMarshaller.unmarshal(JdkMarshaller.java:104)
>  
> at 
> org.apache.ignite.marshaller.AbstractMarshaller.unmarshal(AbstractMarshaller.java:67)
>  
> at 
> org.apache.ignite.spi.discovery.tcp.TcpDiscoverySpi.onExchange(TcpDiscoverySpi.java:1529)
>  
> at 
> org.apache.ignite.spi.discovery.tcp.ClientImpl$MessageWorker.processNodeAddFinishedMessage(ClientImpl.java:1317)
>  
> at 
> org.apache.ignite.spi.discovery.tcp.ClientImpl$MessageWorker.processDiscoveryMessage(ClientImpl.java:1229)
>  
> at 
> org.apache.ignite.spi.discovery.tcp.ClientImpl$MessageWorker.body(ClientImpl.java:1199)
>  
> at org.apache.ignite.spi.IgniteSpiThread.run(IgniteSpiThread.java:62) 
> Caused by: java.lang.ClassNotFoundException: 
> c.g.r.cachewrapper.ignite.CacheMissHandlerIgnite
> {code}
> where {{c.g.r.cachewrapper.ignite.CacheMissHandlerIgnite}} is the CacheStore 
> implementation.
> The ostensible reason for the CacheStore serialization is so that clients of 
> a TRANSACTIONAL cache can begin the transaction on the underlying store.  
> The only current solution to this is to add the grid node's CacheStore 
> implementation class definition to the CLASSPATH of the client.  This creates 
> an *undesirable coupling* between server and client.
> ---
> *UPDATE (copy-paste from comment below)*
> This is actually more generic issue. When a new node joins (client or 
> server), all existing cache configurations (which include cache stores) are 
> sent to this node. It then deserializes it during startup which can cause 
> exceptions on clients or servers where cache is not supposed to be deployed 
> as defined by node filter.
> As a solution we can do the following:
> * During discovery, send node filter and cache store factory in binary format 
> along with the cache configuration, not as its parts.
> * On server, check node filter first and deserialize cache configuration and 
> cache store only if it returns true. In case of error, STOP join process (now 
> we just print exception in log and go on, which is very error-prone).
> * On client, do not deserialize cache configuration and cache store until 
> user's code tries to actually use the cache (calls Ignite.cache. If cache is 
> ATOMIC, never deserialize the cache store.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Commented] (IGNITE-11555) Unable to await partitions release latch caused by coordinator failover

2019-03-19 Thread Pavel Kovalenko (JIRA)



[ 
https://issues.apache.org/jira/browse/IGNITE-11555?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16795845#comment-16795845
 ] 

Pavel Kovalenko commented on IGNITE-11555:
--

[~agoncharuk] LGTM. Please proceed with merge.

> Unable to await partitions release latch caused by coordinator failover
> ---
>
> Key: IGNITE-11555
> URL: https://issues.apache.org/jira/browse/IGNITE-11555
> Project: Ignite
>  Issue Type: Bug
>Reporter: Alexey Goncharuk
>Assignee: Alexey Goncharuk
>Priority: Critical
> Fix For: 2.8
>
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> Currently exchanges latches (both server and client) are deleted when the 
> latch is completed. This leads to a hang in the following scenario:
> 1) A grid with several nodes starts exchange latch sync
> 2) All nodes send acks to coordinator
> 3) Coordinator processes acks and sends final acks to some of the nodes
> 4) These nodes process acks, complete and delete client latches
> 5) Coordinator fails
> 6) Nodes which did not receive final acks re-send the ack to a new coordinator
> 7) Since the new coordinator already completed and deleted the client latch, 
> it does not process re-sent ack correctly and only adds it to the pending 
> messages.
> Looks like the root cause of this issue is latch deletion on final ack. We 
> can safely delete the latch only when all nodes are guaranteed to process the 
> messages. Luckily, since the latch is tied to the exchange process, we can 
> safely delete the latch when the corresponding exchange completes.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Resolved] (IGNITE-8459) Searching checkpoint history for WAL rebalance is broken

2019-03-07 Thread Pavel Kovalenko (JIRA)



 [ 
https://issues.apache.org/jira/browse/IGNITE-8459?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Pavel Kovalenko resolved IGNITE-8459.
-
Resolution: Won't Fix

The problem is already fixed in IGNITE-10078

> Searching checkpoint history for WAL rebalance is broken
> 
>
> Key: IGNITE-8459
> URL: https://issues.apache.org/jira/browse/IGNITE-8459
> Project: Ignite
>  Issue Type: Bug
>  Components: cache
>Affects Versions: 2.5
>Reporter: Pavel Kovalenko
>Priority: Critical
> Fix For: 2.8
>
>
> Currently the mechanism to search available checkpoint records in WAL to have 
> history for WAL rebalance is broken. It means that WAL (Historical) rebalance 
> will never find history for rebalance and full rebalance will be always used.
> This mechanism was broken in 
> https://github.com/apache/ignite/commit/ec04cd174ed5476fba83e8682214390736321b37
>  by unclear reasons.
> If we swap the following two code blocks (database().beforeExchange() and 
> exchCtx if block):
> {noformat}
> /* It is necessary to run database callback before all topology 
> callbacks.
>In case of persistent store is enabled we first restore partitions 
> presented on disk.
>We need to guarantee that there are no partition state changes 
> logged to WAL before this callback
>to make sure that we correctly restored last actual states. */
> cctx.database().beforeExchange(this);
> if (!exchCtx.mergeExchanges()) {
> for (CacheGroupContext grp : cctx.cache().cacheGroups()) {
> if (grp.isLocal() || cacheGroupStopping(grp.groupId()))
> continue;
> // It is possible affinity is not initialized yet if node 
> joins to cluster.
> if (grp.affinity().lastVersion().topologyVersion() > 0)
> grp.topology().beforeExchange(this, !centralizedAff && 
> !forceAffReassignment, false);
> }
> }
> {noformat}
> the searching mechanism will start to work correctly. Currently it's unclear 
> why it's happened.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Created] (IGNITE-11455) Introduce free lists rebuild mechanism

2019-02-28 Thread Pavel Kovalenko (JIRA)

Pavel Kovalenko created IGNITE-11455:


 Summary: Introduce free lists rebuild mechanism
 Key: IGNITE-11455
 URL: https://issues.apache.org/jira/browse/IGNITE-11455
 Project: Ignite
  Issue Type: Improvement
  Components: cache
Affects Versions: 2.0
Reporter: Pavel Kovalenko
 Fix For: 2.8


Sometimes the state of free lists become invalid like in 
https://issues.apache.org/jira/browse/IGNITE-10669 . It leads the node to an 
unrecoverable state. At the same time, free lists don't hold any critical or 
data information and can be built from scratch using existing data pages. It 
may be useful to introduce a mechanism to rebuild free lists using an optimal 
algorithm of partition data pages scanning.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Assigned] (IGNITE-10821) Caching affinity with affinity similarity key is broken

2019-01-14 Thread Pavel Kovalenko (JIRA)



 [ 
https://issues.apache.org/jira/browse/IGNITE-10821?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Pavel Kovalenko reassigned IGNITE-10821:


Assignee: (was: Pavel Kovalenko)

> Caching affinity with affinity similarity key is broken
> ---
>
> Key: IGNITE-10821
> URL: https://issues.apache.org/jira/browse/IGNITE-10821
> Project: Ignite
>  Issue Type: Bug
>  Components: cache
>Affects Versions: 2.8
>Reporter: Pavel Kovalenko
>Priority: Major
> Fix For: 2.8
>
>
> When some cache groups have the same affinity function, number of partitions, 
> backups and the same node filter they can use the same affinity distribution 
> without needs for explicit recalculating. These parameters are called as 
> "Affinity similarity key". 
> In case of affinity recalculation caching affinity using this key may 
> speed-up the process.
> However, after https://issues.apache.org/jira/browse/IGNITE-9561 merge this 
> mechanishm become broken, because parallell execution of affinity 
> recalculation for the similar affinity groups leads to caching affinity 
> misses.
> To fix it we should couple together similar affinity groups and run affinity 
> recalculation for them in one thread, caching previous results.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Comment Edited] (IGNITE-10821) Caching affinity with affinity similarity key is broken

2019-01-14 Thread Pavel Kovalenko (JIRA)



[ 
https://issues.apache.org/jira/browse/IGNITE-10821?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16742752#comment-16742752
 ] 

Pavel Kovalenko edited comment on IGNITE-10821 at 1/15/19 5:30 AM:
---

The issue is partially completed. What is done at the moment?
1) Introduced IdealAffinityAssignment abstraction. This abstraction remembers 
primary partition nodes.
2) Introduced method "calculateWithCachedIdealAffinity". In this method, ideal 
affinity assignment for cache groups with similar affinity is calculated once 
and cached. This saves time/memory consumption.
3) All affinity recalculation methods are reworked with newly introduced 
methods.
4) Non-cooridnator node affinity recalculations are now using 
calculateWithCachedIdealAffinity where it's possible.
5) initAffinityOnNodeJoin time consumption is optimized. Now it works in O (Px 
* N), where Px - total number of partitions that only has changed the primary 
node, instead of whole (P * N).

What should be completed?
1) calculateWithCachedIdealAffinity method should be used where it's possible. 
Currently, it's not used in coordinator node affinity manipulations due to 
CacheGroupHolder type restrictions. This should be fixed.
2) onServerLeftWithExchangeMergeProtocol shouldn't always use general 
(partitions availability) approach as it slow. If Exchange has only server node 
left events, we can keep ideal assignment and change assignment only for 
partitions where left nodes were primaries using topology owners information 
(similar to initAffinityOnNodeJoin approach).
3) If Exchange has the only server left events, affinity message for other 
nodes in full message should contain only changed primary nodes for partitions, 
instead of whole assignment list. NOTE: there can be problems with backward 
compatibility.
4) In the rare case when Exchange has both server left/server join events 
onReassignmentEnforced may be used (just for simplification), this should also 
be optimized and fixed in nearest future. 
5) Introduce tests checking that for similar affinity cache groups 
IdealAffinityAssignment object is the same.


was (Author: jokser):
The issue is partially completed. What is done at the moment?
1) Introduced IdealAffinityAssignment abstraction. This abstraction remembers 
primary partition nodes.
2) Introduced method "calculateWithCachedIdealAffinity". In this method, ideal 
affinity assignment for cache groups with similar affinity is calculated once 
and cached. This saves time/memory consumption.
3) All affinity recalculation methods are reworked with newly introduced 
methods.
4) Non-cooridnator node affinity recalculations are now using 
calculateWithCachedIdealAffinity where it's possible.
5) initAffinityOnNodeJoin time consumption is optimized. Now it works in O (Px 
* N), where Px - total number of partitions that changed the primary node, 
instead of whole (P * N).

What should be completed?
1) calculateWithCachedIdealAffinity method should be used where it's possible. 
Currently, it's not used in coordinator node affinity manipulations due to 
CacheGroupHolder type restrictions. This should be fixed.
2) onServerLeftWithExchangeMergeProtocol shouldn't always use general 
(partitions availability) approach as it slow. If Exchange has only server node 
left events, we can keep ideal assignment and change assignment only for 
partitions where left nodes were primaries using topology owners information 
(similar to initAffinityOnNodeJoin approach).
3) If Exchange has the only server left events, affinity message for other 
nodes in full message should contain only changed primary nodes for partitions, 
instead of whole assignment list. NOTE: there can be problems with backward 
compatibility.
4) In the rare case when Exchange has both server left/server join events 
onReassignmentEnforced may be used (just for simplification), this should also 
be optimized and fixed in nearest future. 
5) Introduce tests checking that for similar affinity cache groups 
IdealAffinityAssignment object is the same.

> Caching affinity with affinity similarity key is broken
> ---
>
> Key: IGNITE-10821
> URL: https://issues.apache.org/jira/browse/IGNITE-10821
> Project: Ignite
>  Issue Type: Bug
>  Components: cache
>Affects Versions: 2.8
>Reporter: Pavel Kovalenko
>Priority: Major
> Fix For: 2.8
>
>
> When some cache groups have the same affinity function, number of partitions, 
> backups and the same node filter they can use the same affinity distribution 
> without needs for explicit recalculating. These parameters are called as 
> "Affinity similarity key". 
> In case of affinity recalculation caching affinity using this key may 
> speed-up the process.
> However, after

[jira] [Commented] (IGNITE-10821) Caching affinity with affinity similarity key is broken

2019-01-14 Thread Pavel Kovalenko (JIRA)



[ 
https://issues.apache.org/jira/browse/IGNITE-10821?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16742752#comment-16742752
 ] 

Pavel Kovalenko commented on IGNITE-10821:
--

The issue is partially completed. What is done at the moment?
1) Introduced IdealAffinityAssignment abstraction. This abstraction remembers 
primary partition nodes.
2) Introduced method "calculateWithCachedIdealAffinity". In this method, ideal 
affinity assignment for cache groups with similar affinity is calculated once 
and cached. This saves time/memory consumption.
3) All affinity recalculation methods are reworked with newly introduced 
methods.
4) Non-cooridnator node affinity recalculations are now using 
calculateWithCachedIdealAffinity where it's possible.
5) initAffinityOnNodeJoin time consumption is optimized. Now it works in O (Px 
* N), where Px - total number of partitions that changed the primary node, 
instead of whole (P * N).

What should be completed?
1) calculateWithCachedIdealAffinity method should be used where it's possible. 
Currently, it's not used in coordinator node affinity manipulations due to 
CacheGroupHolder type restrictions. This should be fixed.
2) onServerLeftWithExchangeMergeProtocol shouldn't always use general 
(partitions availability) approach as it slow. If Exchange has only server node 
left events, we can keep ideal assignment and change assignment only for 
partitions where left nodes were primaries using topology owners information 
(similar to initAffinityOnNodeJoin approach).
3) If Exchange has the only server left events, affinity message for other 
nodes in full message should contain only changed primary nodes for partitions, 
instead of whole assignment list. NOTE: there can be problems with backward 
compatibility.
4) In the rare case when Exchange has both server left/server join events 
onReassignmentEnforced may be used (just for simplification), this should also 
be optimized and fixed in nearest future. 
5) Introduce tests checking that for similar affinity cache groups 
IdealAffinityAssignment object is the same.

> Caching affinity with affinity similarity key is broken
> ---
>
> Key: IGNITE-10821
> URL: https://issues.apache.org/jira/browse/IGNITE-10821
> Project: Ignite
>  Issue Type: Bug
>  Components: cache
>Affects Versions: 2.8
>Reporter: Pavel Kovalenko
>Assignee: Pavel Kovalenko
>Priority: Major
> Fix For: 2.8
>
>
> When some cache groups have the same affinity function, number of partitions, 
> backups and the same node filter they can use the same affinity distribution 
> without needs for explicit recalculating. These parameters are called as 
> "Affinity similarity key". 
> In case of affinity recalculation caching affinity using this key may 
> speed-up the process.
> However, after https://issues.apache.org/jira/browse/IGNITE-9561 merge this 
> mechanishm become broken, because parallell execution of affinity 
> recalculation for the similar affinity groups leads to caching affinity 
> misses.
> To fix it we should couple together similar affinity groups and run affinity 
> recalculation for them in one thread, caching previous results.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Commented] (IGNITE-10815) NullPointerException in InitNewCoordinatorFuture.init() leads to cluster hang

2018-12-27 Thread Pavel Kovalenko (JIRA)



[ 
https://issues.apache.org/jira/browse/IGNITE-10815?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16729441#comment-16729441
 ] 

Pavel Kovalenko commented on IGNITE-10815:
--

[~agoncharuk] Thank you for review. Merged to master.

> NullPointerException in InitNewCoordinatorFuture.init() leads to cluster hang
> -
>
> Key: IGNITE-10815
> URL: https://issues.apache.org/jira/browse/IGNITE-10815
> Project: Ignite
>  Issue Type: Bug
>Affects Versions: 2.4
>Reporter: Anton Kurbanov
>Assignee: Pavel Kovalenko
>Priority: Critical
> Fix For: 2.8
>
>
> Possible scenario to reproduce:
> 1. Force few consecutive exchange merges and finish.
> 2. Trigger exchange.
> 3. Shutdown coordinator node before sending/receiving full partitions message.
>  
> Stacktrace:
> {code:java}
> 2018-12-24 15:54:02,664 sys-#48%gg% ERROR 
> org.apache.ignite.internal.processors.cache.distributed.dht.preloader.GridDhtPartitionsExchangeFuture
>  - Failed to init new coordinator future: bd74f7ed-6984-4f78-9941-480df673ab77
> java.lang.NullPointerException: null
> at 
> org.apache.ignite.internal.processors.cache.distributed.dht.preloader.GridDhtPartitionsExchangeFuture.events(GridDhtPartitionsExchangeFuture.java:534)
>  ~[ignite-core-2.4.13.b4.jar:2.4.13.b4]
> at 
> org.apache.ignite.internal.processors.cache.CacheAffinitySharedManager$18.applyx(CacheAffinitySharedManager.java:1790)
>  ~[ignite-core-2.4.13.b4.jar:2.4.13.b4]
> at 
> org.apache.ignite.internal.processors.cache.CacheAffinitySharedManager$18.applyx(CacheAffinitySharedManager.java:1738)
>  ~[ignite-core-2.4.13.b4.jar:2.4.13.b4]
> at 
> org.apache.ignite.internal.processors.cache.CacheAffinitySharedManager.forAllRegisteredCacheGroups(CacheAffinitySharedManager.java:1107)
>  ~[ignite-core-2.4.13.b4.jar:2.4.13.b4]
> at 
> org.apache.ignite.internal.processors.cache.CacheAffinitySharedManager.initCoordinatorCaches(CacheAffinitySharedManager.java:1738)
>  ~[ignite-core-2.4.13.b4.jar:2.4.13.b4]
> at 
> org.apache.ignite.internal.processors.cache.distributed.dht.preloader.InitNewCoordinatorFuture.init(InitNewCoordinatorFuture.java:104)
>  ~[ignite-core-2.4.13.b4.jar:2.4.13.b4]
> at 
> org.apache.ignite.internal.processors.cache.distributed.dht.preloader.GridDhtPartitionsExchangeFuture$8$1.call(GridDhtPartitionsExchangeFuture.java:3439)
>  [ignite-core-2.4.13.b4.jar:2.4.13.b4]
> at 
> org.apache.ignite.internal.processors.cache.distributed.dht.preloader.GridDhtPartitionsExchangeFuture$8$1.call(GridDhtPartitionsExchangeFuture.java:3435)
>  [ignite-core-2.4.13.b4.jar:2.4.13.b4]
> at 
> org.apache.ignite.internal.util.IgniteUtils.wrapThreadLoader(IgniteUtils.java:6720)
>  [ignite-core-2.4.13.b4.jar:2.4.13.b4]
> at 
> org.apache.ignite.internal.processors.closure.GridClosureProcessor$2.body(GridClosureProcessor.java:967)
>  [ignite-core-2.4.13.b4.jar:2.4.13.b4]
> at org.apache.ignite.internal.util.worker.GridWorker.run(GridWorker.java:110) 
> [ignite-core-2.4.13.b4.jar:2.4.13.b4]
> at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
>  [?:1.8.0_171]
> at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
>  [?:1.8.0_171]
> at java.lang.Thread.run(Thread.java:748) [?:1.8.0_171]
> {code}
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Created] (IGNITE-10821) Caching affinity with affinity similarity key is broken

2018-12-26 Thread Pavel Kovalenko (JIRA)

Pavel Kovalenko created IGNITE-10821:


 Summary: Caching affinity with affinity similarity key is broken
 Key: IGNITE-10821
 URL: https://issues.apache.org/jira/browse/IGNITE-10821
 Project: Ignite
  Issue Type: Bug
  Components: cache
Affects Versions: 2.8
Reporter: Pavel Kovalenko
Assignee: Pavel Kovalenko
 Fix For: 2.8


When some cache groups have the same affinity function, number of partitions, 
backups and the same node filter they can use the same affinity distribution 
without needs for explicit recalculating. These parameters are called as 
"Affinity similarity key". 

In case of affinity recalculation caching affinity using this key may speed-up 
the process.

However, after https://issues.apache.org/jira/browse/IGNITE-9561 merge this 
mechanishm become broken, because parallell execution of affinity recalculation 
for the similar affinity groups leads to caching affinity misses.

To fix it we should couple together similar affinity groups and run affinity 
recalculation for them in one thread, caching previous results.




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Updated] (IGNITE-10815) NullPointerException in InitNewCoordinatorFuture.init() leads to cluster hang

2018-12-25 Thread Pavel Kovalenko (JIRA)



 [ 
https://issues.apache.org/jira/browse/IGNITE-10815?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Pavel Kovalenko updated IGNITE-10815:
-
Fix Version/s: 2.8

> NullPointerException in InitNewCoordinatorFuture.init() leads to cluster hang
> -
>
> Key: IGNITE-10815
> URL: https://issues.apache.org/jira/browse/IGNITE-10815
> Project: Ignite
>  Issue Type: Bug
>Affects Versions: 2.4
>Reporter: Anton Kurbanov
>Assignee: Pavel Kovalenko
>Priority: Critical
> Fix For: 2.8
>
>
> Possible scenario to reproduce:
> 1. Force few consecutive exchange merges and finish.
> 2. Trigger exchange.
> 3. Shutdown coordinator node before sending/receiving full partitions message.
>  
> Stacktrace:
> {code:java}
> 2018-12-24 15:54:02,664 sys-#48%gg% ERROR 
> org.apache.ignite.internal.processors.cache.distributed.dht.preloader.GridDhtPartitionsExchangeFuture
>  - Failed to init new coordinator future: bd74f7ed-6984-4f78-9941-480df673ab77
> java.lang.NullPointerException: null
> at 
> org.apache.ignite.internal.processors.cache.distributed.dht.preloader.GridDhtPartitionsExchangeFuture.events(GridDhtPartitionsExchangeFuture.java:534)
>  ~[ignite-core-2.4.13.b4.jar:2.4.13.b4]
> at 
> org.apache.ignite.internal.processors.cache.CacheAffinitySharedManager$18.applyx(CacheAffinitySharedManager.java:1790)
>  ~[ignite-core-2.4.13.b4.jar:2.4.13.b4]
> at 
> org.apache.ignite.internal.processors.cache.CacheAffinitySharedManager$18.applyx(CacheAffinitySharedManager.java:1738)
>  ~[ignite-core-2.4.13.b4.jar:2.4.13.b4]
> at 
> org.apache.ignite.internal.processors.cache.CacheAffinitySharedManager.forAllRegisteredCacheGroups(CacheAffinitySharedManager.java:1107)
>  ~[ignite-core-2.4.13.b4.jar:2.4.13.b4]
> at 
> org.apache.ignite.internal.processors.cache.CacheAffinitySharedManager.initCoordinatorCaches(CacheAffinitySharedManager.java:1738)
>  ~[ignite-core-2.4.13.b4.jar:2.4.13.b4]
> at 
> org.apache.ignite.internal.processors.cache.distributed.dht.preloader.InitNewCoordinatorFuture.init(InitNewCoordinatorFuture.java:104)
>  ~[ignite-core-2.4.13.b4.jar:2.4.13.b4]
> at 
> org.apache.ignite.internal.processors.cache.distributed.dht.preloader.GridDhtPartitionsExchangeFuture$8$1.call(GridDhtPartitionsExchangeFuture.java:3439)
>  [ignite-core-2.4.13.b4.jar:2.4.13.b4]
> at 
> org.apache.ignite.internal.processors.cache.distributed.dht.preloader.GridDhtPartitionsExchangeFuture$8$1.call(GridDhtPartitionsExchangeFuture.java:3435)
>  [ignite-core-2.4.13.b4.jar:2.4.13.b4]
> at 
> org.apache.ignite.internal.util.IgniteUtils.wrapThreadLoader(IgniteUtils.java:6720)
>  [ignite-core-2.4.13.b4.jar:2.4.13.b4]
> at 
> org.apache.ignite.internal.processors.closure.GridClosureProcessor$2.body(GridClosureProcessor.java:967)
>  [ignite-core-2.4.13.b4.jar:2.4.13.b4]
> at org.apache.ignite.internal.util.worker.GridWorker.run(GridWorker.java:110) 
> [ignite-core-2.4.13.b4.jar:2.4.13.b4]
> at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
>  [?:1.8.0_171]
> at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
>  [?:1.8.0_171]
> at java.lang.Thread.run(Thread.java:748) [?:1.8.0_171]
> {code}
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Assigned] (IGNITE-10815) NullPointerException in InitNewCoordinatorFuture.init() leads to cluster hang

2018-12-25 Thread Pavel Kovalenko (JIRA)



 [ 
https://issues.apache.org/jira/browse/IGNITE-10815?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Pavel Kovalenko reassigned IGNITE-10815:


Assignee: Pavel Kovalenko

> NullPointerException in InitNewCoordinatorFuture.init() leads to cluster hang
> -
>
> Key: IGNITE-10815
> URL: https://issues.apache.org/jira/browse/IGNITE-10815
> Project: Ignite
>  Issue Type: Bug
>Affects Versions: 2.4
>Reporter: Anton Kurbanov
>Assignee: Pavel Kovalenko
>Priority: Critical
> Fix For: 2.8
>
>
> Possible scenario to reproduce:
> 1. Force few consecutive exchange merges and finish.
> 2. Trigger exchange.
> 3. Shutdown coordinator node before sending/receiving full partitions message.
>  
> Stacktrace:
> {code:java}
> 2018-12-24 15:54:02,664 sys-#48%gg% ERROR 
> org.apache.ignite.internal.processors.cache.distributed.dht.preloader.GridDhtPartitionsExchangeFuture
>  - Failed to init new coordinator future: bd74f7ed-6984-4f78-9941-480df673ab77
> java.lang.NullPointerException: null
> at 
> org.apache.ignite.internal.processors.cache.distributed.dht.preloader.GridDhtPartitionsExchangeFuture.events(GridDhtPartitionsExchangeFuture.java:534)
>  ~[ignite-core-2.4.13.b4.jar:2.4.13.b4]
> at 
> org.apache.ignite.internal.processors.cache.CacheAffinitySharedManager$18.applyx(CacheAffinitySharedManager.java:1790)
>  ~[ignite-core-2.4.13.b4.jar:2.4.13.b4]
> at 
> org.apache.ignite.internal.processors.cache.CacheAffinitySharedManager$18.applyx(CacheAffinitySharedManager.java:1738)
>  ~[ignite-core-2.4.13.b4.jar:2.4.13.b4]
> at 
> org.apache.ignite.internal.processors.cache.CacheAffinitySharedManager.forAllRegisteredCacheGroups(CacheAffinitySharedManager.java:1107)
>  ~[ignite-core-2.4.13.b4.jar:2.4.13.b4]
> at 
> org.apache.ignite.internal.processors.cache.CacheAffinitySharedManager.initCoordinatorCaches(CacheAffinitySharedManager.java:1738)
>  ~[ignite-core-2.4.13.b4.jar:2.4.13.b4]
> at 
> org.apache.ignite.internal.processors.cache.distributed.dht.preloader.InitNewCoordinatorFuture.init(InitNewCoordinatorFuture.java:104)
>  ~[ignite-core-2.4.13.b4.jar:2.4.13.b4]
> at 
> org.apache.ignite.internal.processors.cache.distributed.dht.preloader.GridDhtPartitionsExchangeFuture$8$1.call(GridDhtPartitionsExchangeFuture.java:3439)
>  [ignite-core-2.4.13.b4.jar:2.4.13.b4]
> at 
> org.apache.ignite.internal.processors.cache.distributed.dht.preloader.GridDhtPartitionsExchangeFuture$8$1.call(GridDhtPartitionsExchangeFuture.java:3435)
>  [ignite-core-2.4.13.b4.jar:2.4.13.b4]
> at 
> org.apache.ignite.internal.util.IgniteUtils.wrapThreadLoader(IgniteUtils.java:6720)
>  [ignite-core-2.4.13.b4.jar:2.4.13.b4]
> at 
> org.apache.ignite.internal.processors.closure.GridClosureProcessor$2.body(GridClosureProcessor.java:967)
>  [ignite-core-2.4.13.b4.jar:2.4.13.b4]
> at org.apache.ignite.internal.util.worker.GridWorker.run(GridWorker.java:110) 
> [ignite-core-2.4.13.b4.jar:2.4.13.b4]
> at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
>  [?:1.8.0_171]
> at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
>  [?:1.8.0_171]
> at java.lang.Thread.run(Thread.java:748) [?:1.8.0_171]
> {code}
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Commented] (IGNITE-9493) Communication error resolver shouldn't be invoked if connection with client breaks unexpectedly

2018-12-25 Thread Pavel Kovalenko (JIRA)



[ 
https://issues.apache.org/jira/browse/IGNITE-9493?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16728629#comment-16728629
 ] 

Pavel Kovalenko commented on IGNITE-9493:
-

[~zstan] Thank you for contribution. Merged to master.

> Communication error resolver shouldn't be invoked if connection with client 
> breaks unexpectedly
> ---
>
> Key: IGNITE-9493
> URL: https://issues.apache.org/jira/browse/IGNITE-9493
> Project: Ignite
>  Issue Type: Bug
>  Components: cache, zookeeper
>Affects Versions: 2.5
>Reporter: Pavel Kovalenko
>Assignee: Stanilovsky Evgeny
>Priority: Major
> Fix For: 2.8
>
>
> Currently, we initiate communication error resolving process even if a 
> connection between server and client breaks unexpectedly.
> This is unnecessary action because client nodes are not important for cluster 
> stability. We should ignore communication errors for client and daemon nodes.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Updated] (IGNITE-10799) Optimize affinity initialization/re-calculation

2018-12-24 Thread Pavel Kovalenko (JIRA)



 [ 
https://issues.apache.org/jira/browse/IGNITE-10799?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Pavel Kovalenko updated IGNITE-10799:
-
Affects Version/s: (was: 2.1)
   2.4

> Optimize affinity initialization/re-calculation
> ---
>
> Key: IGNITE-10799
> URL: https://issues.apache.org/jira/browse/IGNITE-10799
> Project: Ignite
>  Issue Type: Improvement
>  Components: cache
>Affects Versions: 2.4
>Reporter: Pavel Kovalenko
>Assignee: Pavel Kovalenko
>Priority: Major
> Fix For: 2.8
>
>
> In case of persistence enabled and a baseline is set we have 2 main 
> approaches to recalculate affinity:
> {noformat}
> org.apache.ignite.internal.processors.cache.CacheAffinitySharedManager#onServerJoinWithExchangeMergeProtocol
> org.apache.ignite.internal.processors.cache.CacheAffinitySharedManager#onServerLeftWithExchangeMergeProtocol
> {noformat}
> Both of them following the same approach of recalculating:
> 1) Take a current baseline (ideal assignment).
> 2) Filter out offline nodes from it.
> 3) Choose new primary nodes if previous went away.
> 4) Place temporal primary nodes to late affinity assignment set.
> Looking at implementation details we may notice that we do a lot of 
> unnecessary online nodes cache lookups and array list copies. The performance 
> becomes too slow if we do recalculate affinity for replicated caches (It 
> takes P * N on each node, where P - partitions count, N - the number of nodes 
> in the cluster). In case of large partitions count or large cluster, it may 
> take few seconds, which is unacceptable, because this process happens during 
> PME and freezes ongoing cluster operations.
> We should investigate possible bottlenecks and improve the performance of 
> affinity recalculation.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Created] (IGNITE-10799) Optimize affinity initialization/re-calculation

2018-12-24 Thread Pavel Kovalenko (JIRA)

Pavel Kovalenko created IGNITE-10799:

Summary: Optimize affinity initialization/re-calculation
Key: IGNITE-10799
URL: https://issues.apache.org/jira/browse/IGNITE-10799
Project: Ignite
Issue Type: Improvement
Components: cache
Affects Versions: 2.1
Reporter: Pavel Kovalenko
Assignee: Pavel Kovalenko
Fix For: 2.8

In case of persistence enabled and a baseline is set we have 2 main approaches
to recalculate affinity:

{noformat}
org.apache.ignite.internal.processors.cache.CacheAffinitySharedManager#onServerJoinWithExchangeMergeProtocol
org.apache.ignite.internal.processors.cache.CacheAffinitySharedManager#onServerLeftWithExchangeMergeProtocol
{noformat}

Both of them following the same approach of recalculating:
1) Take a current baseline (ideal assignment).
2) Filter out offline nodes from it.
3) Choose new primary nodes if previous went away.
4) Place temporal primary nodes to late affinity assignment set.

Looking at implementation details we may notice that we do a lot of unnecessary
online nodes cache lookups and array list copies. The performance becomes too
slow if we do recalculate affinity for replicated caches (It takes P * N on
each node, where P - partitions count, N - the number of nodes in the cluster).
In case of large partitions count or large cluster, it may take few seconds,
which is unacceptable, because this process happens during PME and freezes
ongoing cluster operations.

We should investigate possible bottlenecks and improve the performance of
affinity recalculation.

--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Commented] (IGNITE-10003) Raise SYSTEM_WORKER_BLOCKED instead of CRITICAL_ERROR when checkpoint read lock timeout detected

2018-12-21 Thread Pavel Kovalenko (JIRA)



[ 
https://issues.apache.org/jira/browse/IGNITE-10003?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16726568#comment-16726568
 ] 

Pavel Kovalenko commented on IGNITE-10003:
--

[~andrey-kuznetsov] Thank you for change. Looks good to me.

> Raise SYSTEM_WORKER_BLOCKED instead of CRITICAL_ERROR when checkpoint read 
> lock timeout detected
> 
>
> Key: IGNITE-10003
> URL: https://issues.apache.org/jira/browse/IGNITE-10003
> Project: Ignite
>  Issue Type: Task
>Affects Versions: 2.7
>Reporter: Andrey Kuznetsov
>Assignee: Andrey Kuznetsov
>Priority: Trivial
> Fix For: 2.8
>
>
> {{GridCacheDatabaseSharedManager#failCheckpointReadLock}} should report 
> {{SYSTEM_WORKER_BLOCKED}} to failure handler: it is closer to the truth and 
> default consequenses are not so severe as opposed to {{CRITICAL_ERROR}}.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Created] (IGNITE-10771) Print troubleshooting hint when exchange latch got stucked

2018-12-20 Thread Pavel Kovalenko (JIRA)

Pavel Kovalenko created IGNITE-10771:


 Summary: Print troubleshooting hint when exchange latch got stucked
 Key: IGNITE-10771
 URL: https://issues.apache.org/jira/browse/IGNITE-10771
 Project: Ignite
  Issue Type: Improvement
  Components: cache
Affects Versions: 2.5
Reporter: Pavel Kovalenko
 Fix For: 2.8


Sometimes users face with a problem when exchange latch can't be completed:
{noformat}
2018-12-12 07:07:57:563 [exchange-worker-#42] WARN 
o.a.i.i.p.c.d.d.p.GridDhtPartitionsExchangeFuture:488 - Unable to await 
partitions release latch within timeout: ClientLatch 
[coordinator=ZookeeperClusterNode [id=6b9fc6e4-5b6a-4a98-be4d-6bc1aa5c014c, 
addrs=[172.17.0.1, 10.0.230.117, 0:0:0:0:0:0:0:1%lo, 127.0.0.1], order=3, 
loc=false, client=false], ackSent=true, super=CompletableLatch [id=exchange, 
topVer=AffinityTopologyVersion [topVer=45, minorTopVer=1]]] 
{noformat}
It may indicate that some node in a cluster can' t finish partitions release 
(finish all ongoing operations at the previous topology version) or it can be 
silent network problem.
We should print to log a hint how to troubleshoot it to reduce the number of 
questions about such problem.





--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Commented] (IGNITE-10003) Raise SYSTEM_WORKER_BLOCKED instead of CRITICAL_ERROR when checkpoint read lock timeout detected

2018-12-20 Thread Pavel Kovalenko (JIRA)



[ 
https://issues.apache.org/jira/browse/IGNITE-10003?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16725865#comment-16725865
 ] 

Pavel Kovalenko commented on IGNITE-10003:
--

[~andrey-kuznetsov] Overall looks good. But for me, the test can be written in 
a much easier approach.
1) Disable automatic checkpoints by timeout.
2) You can introduce checkpointWriteLock/Unlock methods in 
GridCacheDatabaseSharedManager and explicitly acquire write lock in test code.
3) Then try to acquire cp readLock and wait till it will be failed with a 
timeout exception.
4) Check that the failure handler was called and failure type was as expected.
In this approach, there will be no needs to "cross fingers" :)

> Raise SYSTEM_WORKER_BLOCKED instead of CRITICAL_ERROR when checkpoint read 
> lock timeout detected
> 
>
> Key: IGNITE-10003
> URL: https://issues.apache.org/jira/browse/IGNITE-10003
> Project: Ignite
>  Issue Type: Task
>Affects Versions: 2.7
>Reporter: Andrey Kuznetsov
>Assignee: Andrey Kuznetsov
>Priority: Trivial
> Fix For: 2.8
>
>
> {{GridCacheDatabaseSharedManager#failCheckpointReadLock}} should report 
> {{SYSTEM_WORKER_BLOCKED}} to failure handler: it is closer to the truth and 
> default consequenses are not so severe as opposed to {{CRITICAL_ERROR}}.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Commented] (IGNITE-10493) Refactor exchange stages time measurements

2018-12-20 Thread Pavel Kovalenko (JIRA)



[ 
https://issues.apache.org/jira/browse/IGNITE-10493?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16725820#comment-16725820
 ] 

Pavel Kovalenko commented on IGNITE-10493:
--

[~agoncharuk] Thank you for review. Merged to master.

> Refactor exchange stages time measurements
> --
>
> Key: IGNITE-10493
> URL: https://issues.apache.org/jira/browse/IGNITE-10493
> Project: Ignite
>  Issue Type: Improvement
>  Components: cache
>Affects Versions: 2.7
>Reporter: Pavel Kovalenko
>Assignee: Pavel Kovalenko
>Priority: Major
> Fix For: 2.8
>
>
> At the current implementation, we don't cover and measure all possible code 
> executions that influence on PME time. Instead of it we just measure the 
> hottest separate parts with the following hardcoded pattern:
> {noformat}
> long time = currentTime();
> ... // some code block
> print ("Stage name performed in " + (currentTime() - time));
> {noformat}
> This approach can be improved. Instead of declaring time variable and print 
> the message to log immediately we can introduce a utility class (TimesBag) 
> that will hold all stages and their times. The content of TimesBag can be 
> printed when the exchange future is done.
> As exchange is a linear process that executes init stage by exchange-worker 
> and finish stage by one of the sys thread we can easily cover all exchange 
> code base by time cutoffs.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Updated] (IGNITE-10493) Refactor exchange stages time measurements

2018-12-20 Thread Pavel Kovalenko (JIRA)



 [ 
https://issues.apache.org/jira/browse/IGNITE-10493?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Pavel Kovalenko updated IGNITE-10493:
-
Ignite Flags:   (was: Docs Required)

> Refactor exchange stages time measurements
> --
>
> Key: IGNITE-10493
> URL: https://issues.apache.org/jira/browse/IGNITE-10493
> Project: Ignite
>  Issue Type: Improvement
>  Components: cache
>Affects Versions: 2.7
>Reporter: Pavel Kovalenko
>Assignee: Pavel Kovalenko
>Priority: Major
> Fix For: 2.8
>
>
> At the current implementation, we don't cover and measure all possible code 
> executions that influence on PME time. Instead of it we just measure the 
> hottest separate parts with the following hardcoded pattern:
> {noformat}
> long time = currentTime();
> ... // some code block
> print ("Stage name performed in " + (currentTime() - time));
> {noformat}
> This approach can be improved. Instead of declaring time variable and print 
> the message to log immediately we can introduce a utility class (TimesBag) 
> that will hold all stages and their times. The content of TimesBag can be 
> printed when the exchange future is done.
> As exchange is a linear process that executes init stage by exchange-worker 
> and finish stage by one of the sys thread we can easily cover all exchange 
> code base by time cutoffs.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Created] (IGNITE-10749) Improve speed of checkpoint finalization on binary memory recovery

2018-12-20 Thread Pavel Kovalenko (JIRA)

Pavel Kovalenko created IGNITE-10749:


 Summary: Improve speed of checkpoint finalization on binary memory 
recovery
 Key: IGNITE-10749
 URL: https://issues.apache.org/jira/browse/IGNITE-10749
 Project: Ignite
  Issue Type: Improvement
  Components: cache
Affects Versions: 2.0
Reporter: Pavel Kovalenko
 Fix For: 2.8


Stopping node during checkpoint leads to binary memory recovery after node 
start.
When binary memory is restored node performs checkpoint that fixes the 
consistent state of the page memory.
It happens there

{noformat}
org.apache.ignite.internal.processors.cache.persistence.GridCacheDatabaseSharedManager#finalizeCheckpointOnRecovery
{noformat}

Looking at the implementation of this method we can notice that it performs 
finalization in 1 thread, which is not optimal. This process can be speed-up 
using parallelization of collecting checkpoint pages like in regular 
checkpoints.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Resolved] (IGNITE-10465) TTL Worker can fails on node start due to a race.

2018-12-20 Thread Pavel Kovalenko (JIRA)



 [ 
https://issues.apache.org/jira/browse/IGNITE-10465?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Pavel Kovalenko resolved IGNITE-10465.
--
Resolution: Fixed

Now TTL cleanup worker waits for local join future completion.
Merged to master.

> TTL Worker can fails on node start due to a race.
> -
>
> Key: IGNITE-10465
> URL: https://issues.apache.org/jira/browse/IGNITE-10465
> Project: Ignite
>  Issue Type: Bug
>  Components: cache, persistence
>Reporter: Andrew Mashenkov
>Assignee: Pavel Kovalenko
>Priority: Critical
>  Labels: MakeTeamcityGreenAgain
> Fix For: 2.8
>
>
> PDS 3 suite timeouts sporadicaly on TC if TC is under high load.
> Seems, there is a race and TtlWorker starts before node has joined. 
> Here is failure dump:
> {noformat}
> [17:32:47]W: [org.apache.ignite:ignite-core] at 
> org.apache.ignite.spi.discovery.tcp.TcpDiscoverySpi.getGridStartTime(TcpDiscoverySpi.java:1456)
> [17:32:47]W: [org.apache.ignite:ignite-core] at 
> org.apache.ignite.internal.managers.discovery.GridDiscoveryManager.gridStartTime(GridDiscoveryManager.java:2245)
> [17:32:47]W: [org.apache.ignite:ignite-core] at 
> org.apache.ignite.internal.processors.cache.version.GridCacheVersionManager.next(GridCacheVersionManager.java:279)
> [17:32:47]W: [org.apache.ignite:ignite-core] at 
> org.apache.ignite.internal.processors.cache.version.GridCacheVersionManager.next(GridCacheVersionManager.java:201)
> [17:32:47]W: [org.apache.ignite:ignite-core] at 
> org.apache.ignite.internal.processors.cache.persistence.GridCacheOffheapManager$GridCacheDataStore.purgeExpiredInternal(GridCacheOffheapM
> [17:32:47]W: [org.apache.ignite:ignite-core] at 
> org.apache.ignite.internal.processors.cache.persistence.GridCacheOffheapManager$GridCacheDataStore.purgeExpired(GridCacheOffheapManager.j
> [17:32:47]W: [org.apache.ignite:ignite-core] at 
> org.apache.ignite.internal.processors.cache.persistence.GridCacheOffheapManager.expire(GridCacheOffheapManager.java:986)
> [17:32:47]W: [org.apache.ignite:ignite-core] at 
> org.apache.ignite.internal.processors.cache.GridCacheTtlManager.expire(GridCacheTtlManager.java:207)
> [17:32:47]W: [org.apache.ignite:ignite-core] at 
> org.apache.ignite.internal.processors.cache.GridCacheSharedTtlCleanupManager$CleanupWorker.body(GridCacheSharedTtlCleanupManager.java:141
> [17:32:47]W: [org.apache.ignite:ignite-core] at 
> org.apache.ignite.internal.util.worker.GridWorker.run(GridWorker.java:120){noformat}
>  
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Commented] (IGNITE-10731) ZookeeperDiscoverySpiTestSuite4: IgniteCacheReplicatedQuerySelfTest.testNodeLeft fails

2018-12-19 Thread Pavel Kovalenko (JIRA)



[ 
https://issues.apache.org/jira/browse/IGNITE-10731?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16725006#comment-16725006
 ] 

Pavel Kovalenko commented on IGNITE-10731:
--

[~VitaliyB] Thank you for contribution. Change looks good to me. Merged to 
master.

> ZookeeperDiscoverySpiTestSuite4: 
> IgniteCacheReplicatedQuerySelfTest.testNodeLeft fails
> --
>
> Key: IGNITE-10731
> URL: https://issues.apache.org/jira/browse/IGNITE-10731
> Project: Ignite
>  Issue Type: Improvement
>Affects Versions: 2.7
>Reporter: Vitaliy Biryukov
>Assignee: Vitaliy Biryukov
>Priority: Major
>  Labels: MakeTeamcityGreenAgain
> Fix For: 2.8
>
>
> {noformat}
> junit.framework.AssertionFailedError: 
> Expected :0
> Actual   :312
>  
>   at junit.framework.Assert.fail(Assert.java:57)
>   at junit.framework.Assert.failNotEquals(Assert.java:329)
>   at junit.framework.Assert.assertEquals(Assert.java:78)
>   at junit.framework.Assert.assertEquals(Assert.java:234)
>   at junit.framework.Assert.assertEquals(Assert.java:241)
>   at junit.framework.TestCase.assertEquals(TestCase.java:409)
>   at 
> org.apache.ignite.internal.processors.cache.distributed.replicated.IgniteCacheReplicatedQuerySelfTest.testNodeLeft(IgniteCacheReplicatedQuerySelfTest.java:348)
>   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>   at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
>   at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>   at java.lang.reflect.Method.invoke(Method.java:498)
>   at junit.framework.TestCase.runTest(TestCase.java:176)
>   at 
> org.apache.ignite.testframework.junits.GridAbstractTest.access$001(GridAbstractTest.java:151)
>   at 
> org.apache.ignite.testframework.junits.GridAbstractTest$6.evaluate(GridAbstractTest.java:2102)
>   at 
> org.apache.ignite.testframework.junits.GridAbstractTest$7.run(GridAbstractTest.java:2117)
>   at java.lang.Thread.run(Thread.java:745)
> {noformat}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Commented] (IGNITE-10493) Refactor exchange stages time measurements

2018-12-19 Thread Pavel Kovalenko (JIRA)



[ 
https://issues.apache.org/jira/browse/IGNITE-10493?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16724911#comment-16724911
 ] 

Pavel Kovalenko commented on IGNITE-10493:
--

Timings in logs looks like this:

{noformat}
[2018-12-19 13:33:09,942][INFO 
][sys-#322%database.IgniteDbMultiNodePutGetTest0%][GridDhtPartitionsExchangeFuture]
 Exchange timings [startVer=AffinityTopologyVersion [topVer=3, minorTopVer=1], 
resVer=AffinityTopologyVersion [topVer=3, minorTopVer=1], stage="Waiting in 
exchange queue" (0 ms), stage="Exchange parameters initialization" (0 ms), 
stage="Update caches registry" (144 ms), stage="Start caches" (56 ms), 
stage="Affinity initialization on cache group start" (11 ms), stage="Exchange 
type determination" (0 ms), stage="Preloading notification" (0 ms), stage="WAL 
history reservation" (0 ms), stage="Wait partitions release" (0 ms), 
stage="After states restored callback" (220 ms), stage="Waiting for all single 
messages" (27 ms), stage="Affinity recalculation (crd)" (2 ms), stage="Collect 
update counters and create affinity messages" (0 ms), stage="Validate 
partitions states" (0 ms), stage="Assign partitions states" (1 ms), 
stage="Ideal affinity diff calculation (enforced)" (6 ms), stage="Apply update 
counters" (0 ms), stage="Full message preparing" (5 ms), stage="Full message 
sending" (12 ms), stage="State finish message sending" (8 ms), stage="Exchange 
done" (65 ms), stage="Total time" (557 ms), Discovery lag / Clocks discrepancy 
= 13 ms.]
[2018-12-19 13:33:09,943][INFO 
][sys-#322%database.IgniteDbMultiNodePutGetTest0%][GridDhtPartitionsExchangeFuture]
 Exchange longest local stages [startVer=AffinityTopologyVersion [topVer=3, 
minorTopVer=1], resVer=AffinityTopologyVersion [topVer=3, minorTopVer=1], 
stage="Affinity initialization on cache group start [grp=tiny]" (0 ms) 
(parent=Affinity initialization on cache group start), stage="Affinity 
initialization on cache group start [grp=non-primitive]" (0 ms) 
(parent=Affinity initialization on cache group start), stage="Affinity 
initialization on cache group start [grp=large]" (0 ms) (parent=Affinity 
initialization on cache group start), stage="Affinity centralized 
initialization (crd) [grp=tiny]" (0 ms) (parent=Exchange type determination), 
stage="Affinity centralized initialization (crd) [grp=non-primitive]" (0 ms) 
(parent=Exchange type determination), stage="Affinity centralized 
initialization (crd) [grp=large]" (0 ms) (parent=Exchange type determination), 
stage="Restore partition states" (0 ms) (parent=After states restored 
callback), stage="Affinity recalculation (partitions availability) [grp=tiny]" 
(0 ms) (parent=Ideal affinity diff calculation (enforced)), stage="Affinity 
recalculation (partitions availability) [grp=non-primitive]" (0 ms) 
(parent=Ideal affinity diff calculation (enforced)), stage="Affinity 
recalculation (partitions availability) [grp=large]" (0 ms) (parent=Ideal 
affinity diff calculation (enforced))]

{noformat}


> Refactor exchange stages time measurements
> --
>
> Key: IGNITE-10493
> URL: https://issues.apache.org/jira/browse/IGNITE-10493
> Project: Ignite
>  Issue Type: Improvement
>  Components: cache
>Affects Versions: 2.7
>Reporter: Pavel Kovalenko
>Assignee: Pavel Kovalenko
>Priority: Major
> Fix For: 2.8
>
>
> At the current implementation, we don't cover and measure all possible code 
> executions that influence on PME time. Instead of it we just measure the 
> hottest separate parts with the following hardcoded pattern:
> {noformat}
> long time = currentTime();
> ... // some code block
> print ("Stage name performed in " + (currentTime() - time));
> {noformat}
> This approach can be improved. Instead of declaring time variable and print 
> the message to log immediately we can introduce a utility class (TimesBag) 
> that will hold all stages and their times. The content of TimesBag can be 
> printed when the exchange future is done.
> As exchange is a linear process that executes init stage by exchange-worker 
> and finish stage by one of the sys thread we can easily cover all exchange 
> code base by time cutoffs.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Assigned] (IGNITE-10669) NPE in freelist.PagesList.findTailIndex

2018-12-17 Thread Pavel Kovalenko (JIRA)



 [ 
https://issues.apache.org/jira/browse/IGNITE-10669?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Pavel Kovalenko reassigned IGNITE-10669:


Assignee: Pavel Kovalenko

> NPE in freelist.PagesList.findTailIndex
> ---
>
> Key: IGNITE-10669
> URL: https://issues.apache.org/jira/browse/IGNITE-10669
> Project: Ignite
>  Issue Type: Bug
>  Components: data structures
>Affects Versions: 2.7
> Environment: Windows
>Reporter: ARomantsov
>Assignee: Pavel Kovalenko
>Priority: Critical
> Fix For: 2.8
>
>
> Run node with 1 cache and put to it.
> Kill node and try run back - it broken on start
> {code:java}
> [22:40:10,916][INFO][main][GridCacheDatabaseSharedManager] Applying lost 
> cache updates since last checkpoint record [lastMarked=FileWALPointer [idx=2, 
> fileOff=14706, len=21409], 
> lastCheckpointId=2f9202e9-c9d7-47ca-9dcc-299a959bb2e0]
> [22:40:10,922][SEVERE][main][IgniteKernal] Exception during start processors, 
> node will be stopped and close connections
> java.lang.NullPointerException
>   at 
> org.apache.ignite.internal.processors.cache.persistence.freelist.PagesList.findTailIndex(PagesList.java:502)
>   at 
> org.apache.ignite.internal.processors.cache.persistence.freelist.PagesList.updateTail(PagesList.java:458)
>   at 
> org.apache.ignite.internal.processors.cache.persistence.freelist.PagesList.mergeNoNext(PagesList.java:1330)
>   at 
> org.apache.ignite.internal.processors.cache.persistence.freelist.PagesList.removeDataPage(PagesList.java:1281)
>   at 
> org.apache.ignite.internal.processors.cache.persistence.freelist.AbstractFreeList$RemoveRowHandler.run(AbstractFreeList.java:305)
>   at 
> org.apache.ignite.internal.processors.cache.persistence.freelist.AbstractFreeList$RemoveRowHandler.run(AbstractFreeList.java:261)
>   at 
> org.apache.ignite.internal.processors.cache.persistence.tree.util.PageHandler.writePage(PageHandler.java:279)
>   at 
> org.apache.ignite.internal.processors.cache.persistence.DataStructure.write(DataStructure.java:256)
>   at 
> org.apache.ignite.internal.processors.cache.persistence.freelist.AbstractFreeList.removeDataRowByLink(AbstractFreeList.java:571)
>   at 
> org.apache.ignite.internal.processors.cache.persistence.metastorage.MetastorageRowStore.removeRow(MetastorageRowStore.java:57)
>   at 
> org.apache.ignite.internal.processors.cache.persistence.metastorage.MetaStorage.putData(MetaStorage.java:253)
>   at 
> org.apache.ignite.internal.processors.cache.persistence.metastorage.MetaStorage.applyUpdate(MetaStorage.java:492)
>   at 
> org.apache.ignite.internal.processors.cache.persistence.GridCacheDatabaseSharedManager.applyLogicalUpdates(GridCacheDatabaseSharedManager.java:2420)
>   at 
> org.apache.ignite.internal.processors.cache.persistence.GridCacheDatabaseSharedManager.startMemoryRestore(GridCacheDatabaseSharedManager.java:1909)
>   at org.apache.ignite.internal.IgniteKernal.start(IgniteKernal.java:1056)
>   at 
> org.apache.ignite.internal.IgnitionEx$IgniteNamedInstance.start0(IgnitionEx.java:2040)
>   at 
> org.apache.ignite.internal.IgnitionEx$IgniteNamedInstance.start(IgnitionEx.java:1732)
>   at org.apache.ignite.internal.IgnitionEx.start0(IgnitionEx.java:1158)
>   at 
> org.apache.ignite.internal.IgnitionEx.startConfigurations(IgnitionEx.java:1076)
>   at org.apache.ignite.internal.IgnitionEx.start(IgnitionEx.java:962)
>   at org.apache.ignite.internal.IgnitionEx.start(IgnitionEx.java:861)
>   at org.apache.ignite.internal.IgnitionEx.start(IgnitionEx.java:731)
>   at org.apache.ignite.internal.IgnitionEx.start(IgnitionEx.java:700)
>   at org.apache.ignite.Ignition.start(Ignition.java:348)
>   at 
> org.apache.ignite.startup.cmdline.CommandLineStartup.main(CommandLineStartup.java:301)
> [22:40:10,922][SEVERE][main][IgniteKernal] Got exception while starting (will 
> rollback startup routine).
> java.lang.NullPointerException
>   at 
> org.apache.ignite.internal.processors.cache.persistence.freelist.PagesList.findTailIndex(PagesList.java:502)
>   at 
> org.apache.ignite.internal.processors.cache.persistence.freelist.PagesList.updateTail(PagesList.java:458)
>   at 
> org.apache.ignite.internal.processors.cache.persistence.freelist.PagesList.mergeNoNext(PagesList.java:1330)
>   at 
> org.apache.ignite.internal.processors.cache.persistence.freelist.PagesList.removeDataPage(PagesList.java:1281)
>   at 
> org.apache.ignite.internal.processors.cache.persistence.freelist.AbstractFreeList$RemoveRowHandler.run(AbstractFreeList.java:305)
>   at 
> org.apache.ignite.internal.processors.cache.persistence.freelist.AbstractFreeList$RemoveRowHandler.run(AbstractFreeList.java:261)
>   at 
>

[jira] [Commented] (IGNITE-10624) Cache deployment id may be different than cluster-wide after recovery

2018-12-17 Thread Pavel Kovalenko (JIRA)



[ 
https://issues.apache.org/jira/browse/IGNITE-10624?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16722768#comment-16722768
 ] 

Pavel Kovalenko commented on IGNITE-10624:
--

Merged to master

> Cache deployment id may be different than cluster-wide after recovery
> -
>
> Key: IGNITE-10624
> URL: https://issues.apache.org/jira/browse/IGNITE-10624
> Project: Ignite
>  Issue Type: Bug
>  Components: cache, sql
>Affects Versions: 2.8
>Reporter: Pavel Kovalenko
>Assignee: Pavel Kovalenko
>Priority: Major
> Fix For: 2.8
>
>
> When schema for a cache is changing 
> (GridQueryProcessor#processSchemaOperationLocal),
> it may produce false-negative "CACHE_NOT_FOUND" message if a cache was 
> started during recovery while cluster-wide descriptor was changed.
> {noformat}
> if (cacheInfo == null || !F.eq(depId, 
> cacheInfo.dynamicDeploymentId()))
> throw new 
> SchemaOperationException(SchemaOperationException.CODE_CACHE_NOT_FOUND, 
> cacheName); 
> {noformat}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Commented] (IGNITE-4210) CacheLoadingConcurrentGridStartSelfTest.testLoadCacheFromStore() test lose data.

2018-12-14 Thread Pavel Kovalenko (JIRA)



[ 
https://issues.apache.org/jira/browse/IGNITE-4210?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16721420#comment-16721420
 ] 

Pavel Kovalenko commented on IGNITE-4210:
-

[~Alexey Kuznetsov] 
I've reviewed your solution and have several concerns about it's implementation 
and test.
1) Current implentation of cache loading mechanism doesn't fit into requirement 
that there should be no ongoing update operations during PME.
2) We're waiting for finishing that operations at 
org.apache.ignite.internal.processors.cache.distributed.dht.preloader.GridDhtPartitionsExchangeFuture#waitPartitionRelease
 method . Cache load operations are not covered in partitions release future 
and have their custom workflow which is big miss now and this should be fixed.
3) GridDhtInvalidPartitionException is throwed only when affinity for that 
partition is changed. What will be happened when some partitions are moved to 
other nodes, some not? In that case some of the cache loads will be finished 
with exception, some of them not. This will cause of data inconsistency between 
nodes after PME is finished.
4) For me test doesn't cover well case when cluster topology is changed. Future 
that starts new nodes can be completed before you run load to cache store. 
Moreover test don't check that event with ClusterTopologyCheckedException is 
fired always when topology is changed. It means that sometimes test checks 
negative (expected) scenario, sometimes positive scneario which leads to test 
flakiness. 

>From my side of view correct solution should be implemented as below:
1) Introduce new cache future in 
org.apache.ignite.internal.processors.cache.GridCacheMvccManager (as welll as 
locks, transactions, atomic updates futures)  that will indicate that cache 
loading is in progress.
2) Create and register this future at the beginning of 
org.apache.ignite.internal.processors.cache.store.CacheStoreManager#loadAll 
method.
3) Future should have ability to cancel. Add this future as a part of 
partitionsRelease future and cancel it when waitPartitionsRelease event is 
happened.
4) Divide whole keys set to micro-batches in loadAll. At the end of each of 
micro-batch, check that CacheLoadFuture is not cancelled by 
waitPartitionsRelease.
5) If future was cancelled, immediately finish this future to unblock 
waitPartitionsRelease and throw appropriate exception to user, that topology is 
changed and operation should be retired.
6) Fix test in a way to have 100% guarantees that it checks negative scenario 
with cluster topology changing.

> CacheLoadingConcurrentGridStartSelfTest.testLoadCacheFromStore() test lose 
> data.
> 
>
> Key: IGNITE-4210
> URL: https://issues.apache.org/jira/browse/IGNITE-4210
> Project: Ignite
>  Issue Type: Bug
>Reporter: Anton Vinogradov
>Assignee: Alexey Kuznetsov
>Priority: Major
>  Labels: MakeTeamcityGreenAgain
> Fix For: 2.8
>
>
> org.apache.ignite.internal.processors.cache.distributed.CacheLoadingConcurrentGridStartSelfTest#testLoadCacheFromStore
>  sometimes have failures.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Assigned] (IGNITE-10669) NPE in freelist.PagesList.findTailIndex

2018-12-13 Thread Pavel Kovalenko (JIRA)



 [ 
https://issues.apache.org/jira/browse/IGNITE-10669?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Pavel Kovalenko reassigned IGNITE-10669:


Assignee: (was: Pavel Kovalenko)

> NPE in freelist.PagesList.findTailIndex
> ---
>
> Key: IGNITE-10669
> URL: https://issues.apache.org/jira/browse/IGNITE-10669
> Project: Ignite
>  Issue Type: Bug
>  Components: data structures
>Affects Versions: 2.7
> Environment: Windows
>Reporter: ARomantsov
>Priority: Critical
> Fix For: 2.8
>
>
> Run node with 1 cache and put to it.
> Kill node and try run back - it broken on start
> {code:java}
> [22:40:10,916][INFO][main][GridCacheDatabaseSharedManager] Applying lost 
> cache updates since last checkpoint record [lastMarked=FileWALPointer [idx=2, 
> fileOff=14706, len=21409], 
> lastCheckpointId=2f9202e9-c9d7-47ca-9dcc-299a959bb2e0]
> [22:40:10,922][SEVERE][main][IgniteKernal] Exception during start processors, 
> node will be stopped and close connections
> java.lang.NullPointerException
>   at 
> org.apache.ignite.internal.processors.cache.persistence.freelist.PagesList.findTailIndex(PagesList.java:502)
>   at 
> org.apache.ignite.internal.processors.cache.persistence.freelist.PagesList.updateTail(PagesList.java:458)
>   at 
> org.apache.ignite.internal.processors.cache.persistence.freelist.PagesList.mergeNoNext(PagesList.java:1330)
>   at 
> org.apache.ignite.internal.processors.cache.persistence.freelist.PagesList.removeDataPage(PagesList.java:1281)
>   at 
> org.apache.ignite.internal.processors.cache.persistence.freelist.AbstractFreeList$RemoveRowHandler.run(AbstractFreeList.java:305)
>   at 
> org.apache.ignite.internal.processors.cache.persistence.freelist.AbstractFreeList$RemoveRowHandler.run(AbstractFreeList.java:261)
>   at 
> org.apache.ignite.internal.processors.cache.persistence.tree.util.PageHandler.writePage(PageHandler.java:279)
>   at 
> org.apache.ignite.internal.processors.cache.persistence.DataStructure.write(DataStructure.java:256)
>   at 
> org.apache.ignite.internal.processors.cache.persistence.freelist.AbstractFreeList.removeDataRowByLink(AbstractFreeList.java:571)
>   at 
> org.apache.ignite.internal.processors.cache.persistence.metastorage.MetastorageRowStore.removeRow(MetastorageRowStore.java:57)
>   at 
> org.apache.ignite.internal.processors.cache.persistence.metastorage.MetaStorage.putData(MetaStorage.java:253)
>   at 
> org.apache.ignite.internal.processors.cache.persistence.metastorage.MetaStorage.applyUpdate(MetaStorage.java:492)
>   at 
> org.apache.ignite.internal.processors.cache.persistence.GridCacheDatabaseSharedManager.applyLogicalUpdates(GridCacheDatabaseSharedManager.java:2420)
>   at 
> org.apache.ignite.internal.processors.cache.persistence.GridCacheDatabaseSharedManager.startMemoryRestore(GridCacheDatabaseSharedManager.java:1909)
>   at org.apache.ignite.internal.IgniteKernal.start(IgniteKernal.java:1056)
>   at 
> org.apache.ignite.internal.IgnitionEx$IgniteNamedInstance.start0(IgnitionEx.java:2040)
>   at 
> org.apache.ignite.internal.IgnitionEx$IgniteNamedInstance.start(IgnitionEx.java:1732)
>   at org.apache.ignite.internal.IgnitionEx.start0(IgnitionEx.java:1158)
>   at 
> org.apache.ignite.internal.IgnitionEx.startConfigurations(IgnitionEx.java:1076)
>   at org.apache.ignite.internal.IgnitionEx.start(IgnitionEx.java:962)
>   at org.apache.ignite.internal.IgnitionEx.start(IgnitionEx.java:861)
>   at org.apache.ignite.internal.IgnitionEx.start(IgnitionEx.java:731)
>   at org.apache.ignite.internal.IgnitionEx.start(IgnitionEx.java:700)
>   at org.apache.ignite.Ignition.start(Ignition.java:348)
>   at 
> org.apache.ignite.startup.cmdline.CommandLineStartup.main(CommandLineStartup.java:301)
> [22:40:10,922][SEVERE][main][IgniteKernal] Got exception while starting (will 
> rollback startup routine).
> java.lang.NullPointerException
>   at 
> org.apache.ignite.internal.processors.cache.persistence.freelist.PagesList.findTailIndex(PagesList.java:502)
>   at 
> org.apache.ignite.internal.processors.cache.persistence.freelist.PagesList.updateTail(PagesList.java:458)
>   at 
> org.apache.ignite.internal.processors.cache.persistence.freelist.PagesList.mergeNoNext(PagesList.java:1330)
>   at 
> org.apache.ignite.internal.processors.cache.persistence.freelist.PagesList.removeDataPage(PagesList.java:1281)
>   at 
> org.apache.ignite.internal.processors.cache.persistence.freelist.AbstractFreeList$RemoveRowHandler.run(AbstractFreeList.java:305)
>   at 
> org.apache.ignite.internal.processors.cache.persistence.freelist.AbstractFreeList$RemoveRowHandler.run(AbstractFreeList.java:261)
>   at 
>

[jira] [Assigned] (IGNITE-10669) NPE in freelist.PagesList.findTailIndex

2018-12-13 Thread Pavel Kovalenko (JIRA)



 [ 
https://issues.apache.org/jira/browse/IGNITE-10669?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Pavel Kovalenko reassigned IGNITE-10669:


Assignee: Pavel Kovalenko

> NPE in freelist.PagesList.findTailIndex
> ---
>
> Key: IGNITE-10669
> URL: https://issues.apache.org/jira/browse/IGNITE-10669
> Project: Ignite
>  Issue Type: Bug
>  Components: data structures
>Affects Versions: 2.7
> Environment: Windows
>Reporter: ARomantsov
>Assignee: Pavel Kovalenko
>Priority: Critical
> Fix For: 2.8
>
>
> Run node with 1 cache and put to it.
> Kill node and try run back - it broken on start
> {code:java}
> [22:40:10,916][INFO][main][GridCacheDatabaseSharedManager] Applying lost 
> cache updates since last checkpoint record [lastMarked=FileWALPointer [idx=2, 
> fileOff=14706, len=21409], 
> lastCheckpointId=2f9202e9-c9d7-47ca-9dcc-299a959bb2e0]
> [22:40:10,922][SEVERE][main][IgniteKernal] Exception during start processors, 
> node will be stopped and close connections
> java.lang.NullPointerException
>   at 
> org.apache.ignite.internal.processors.cache.persistence.freelist.PagesList.findTailIndex(PagesList.java:502)
>   at 
> org.apache.ignite.internal.processors.cache.persistence.freelist.PagesList.updateTail(PagesList.java:458)
>   at 
> org.apache.ignite.internal.processors.cache.persistence.freelist.PagesList.mergeNoNext(PagesList.java:1330)
>   at 
> org.apache.ignite.internal.processors.cache.persistence.freelist.PagesList.removeDataPage(PagesList.java:1281)
>   at 
> org.apache.ignite.internal.processors.cache.persistence.freelist.AbstractFreeList$RemoveRowHandler.run(AbstractFreeList.java:305)
>   at 
> org.apache.ignite.internal.processors.cache.persistence.freelist.AbstractFreeList$RemoveRowHandler.run(AbstractFreeList.java:261)
>   at 
> org.apache.ignite.internal.processors.cache.persistence.tree.util.PageHandler.writePage(PageHandler.java:279)
>   at 
> org.apache.ignite.internal.processors.cache.persistence.DataStructure.write(DataStructure.java:256)
>   at 
> org.apache.ignite.internal.processors.cache.persistence.freelist.AbstractFreeList.removeDataRowByLink(AbstractFreeList.java:571)
>   at 
> org.apache.ignite.internal.processors.cache.persistence.metastorage.MetastorageRowStore.removeRow(MetastorageRowStore.java:57)
>   at 
> org.apache.ignite.internal.processors.cache.persistence.metastorage.MetaStorage.putData(MetaStorage.java:253)
>   at 
> org.apache.ignite.internal.processors.cache.persistence.metastorage.MetaStorage.applyUpdate(MetaStorage.java:492)
>   at 
> org.apache.ignite.internal.processors.cache.persistence.GridCacheDatabaseSharedManager.applyLogicalUpdates(GridCacheDatabaseSharedManager.java:2420)
>   at 
> org.apache.ignite.internal.processors.cache.persistence.GridCacheDatabaseSharedManager.startMemoryRestore(GridCacheDatabaseSharedManager.java:1909)
>   at org.apache.ignite.internal.IgniteKernal.start(IgniteKernal.java:1056)
>   at 
> org.apache.ignite.internal.IgnitionEx$IgniteNamedInstance.start0(IgnitionEx.java:2040)
>   at 
> org.apache.ignite.internal.IgnitionEx$IgniteNamedInstance.start(IgnitionEx.java:1732)
>   at org.apache.ignite.internal.IgnitionEx.start0(IgnitionEx.java:1158)
>   at 
> org.apache.ignite.internal.IgnitionEx.startConfigurations(IgnitionEx.java:1076)
>   at org.apache.ignite.internal.IgnitionEx.start(IgnitionEx.java:962)
>   at org.apache.ignite.internal.IgnitionEx.start(IgnitionEx.java:861)
>   at org.apache.ignite.internal.IgnitionEx.start(IgnitionEx.java:731)
>   at org.apache.ignite.internal.IgnitionEx.start(IgnitionEx.java:700)
>   at org.apache.ignite.Ignition.start(Ignition.java:348)
>   at 
> org.apache.ignite.startup.cmdline.CommandLineStartup.main(CommandLineStartup.java:301)
> [22:40:10,922][SEVERE][main][IgniteKernal] Got exception while starting (will 
> rollback startup routine).
> java.lang.NullPointerException
>   at 
> org.apache.ignite.internal.processors.cache.persistence.freelist.PagesList.findTailIndex(PagesList.java:502)
>   at 
> org.apache.ignite.internal.processors.cache.persistence.freelist.PagesList.updateTail(PagesList.java:458)
>   at 
> org.apache.ignite.internal.processors.cache.persistence.freelist.PagesList.mergeNoNext(PagesList.java:1330)
>   at 
> org.apache.ignite.internal.processors.cache.persistence.freelist.PagesList.removeDataPage(PagesList.java:1281)
>   at 
> org.apache.ignite.internal.processors.cache.persistence.freelist.AbstractFreeList$RemoveRowHandler.run(AbstractFreeList.java:305)
>   at 
> org.apache.ignite.internal.processors.cache.persistence.freelist.AbstractFreeList$RemoveRowHandler.run(AbstractFreeList.java:261)
>   at 
>

[jira] [Updated] (IGNITE-10624) Cache deployment id may be different that cluster-wide after recovery

2018-12-13 Thread Pavel Kovalenko (JIRA)



 [ 
https://issues.apache.org/jira/browse/IGNITE-10624?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Pavel Kovalenko updated IGNITE-10624:
-
Ignite Flags:   (was: Docs Required)

> Cache deployment id may be different that cluster-wide after recovery
> -
>
> Key: IGNITE-10624
> URL: https://issues.apache.org/jira/browse/IGNITE-10624
> Project: Ignite
>  Issue Type: Bug
>  Components: cache, sql
>Affects Versions: 2.8
>Reporter: Pavel Kovalenko
>Assignee: Pavel Kovalenko
>Priority: Major
> Fix For: 2.8
>
>
> When schema for a cache is changing 
> (GridQueryProcessor#processSchemaOperationLocal),
> it may produce false-negative "CACHE_NOT_FOUND" message if a cache was 
> started during recovery while cluster-wide descriptor was changed.
> {noformat}
> if (cacheInfo == null || !F.eq(depId, 
> cacheInfo.dynamicDeploymentId()))
> throw new 
> SchemaOperationException(SchemaOperationException.CODE_CACHE_NOT_FOUND, 
> cacheName); 
> {noformat}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Updated] (IGNITE-10624) Cache deployment id may be different than cluster-wide after recovery

2018-12-13 Thread Pavel Kovalenko (JIRA)



 [ 
https://issues.apache.org/jira/browse/IGNITE-10624?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Pavel Kovalenko updated IGNITE-10624:
-
Summary: Cache deployment id may be different than cluster-wide after 
recovery  (was: Cache deployment id may be different that cluster-wide after 
recovery)

> Cache deployment id may be different than cluster-wide after recovery
> -
>
> Key: IGNITE-10624
> URL: https://issues.apache.org/jira/browse/IGNITE-10624
> Project: Ignite
>  Issue Type: Bug
>  Components: cache, sql
>Affects Versions: 2.8
>Reporter: Pavel Kovalenko
>Assignee: Pavel Kovalenko
>Priority: Major
> Fix For: 2.8
>
>
> When schema for a cache is changing 
> (GridQueryProcessor#processSchemaOperationLocal),
> it may produce false-negative "CACHE_NOT_FOUND" message if a cache was 
> started during recovery while cluster-wide descriptor was changed.
> {noformat}
> if (cacheInfo == null || !F.eq(depId, 
> cacheInfo.dynamicDeploymentId()))
> throw new 
> SchemaOperationException(SchemaOperationException.CODE_CACHE_NOT_FOUND, 
> cacheName); 
> {noformat}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Commented] (IGNITE-10556) Attempt to decrypt data records during read-only metastorage recovery leads to NPE

2018-12-12 Thread Pavel Kovalenko (JIRA)



[ 
https://issues.apache.org/jira/browse/IGNITE-10556?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16719048#comment-16719048
 ] 

Pavel Kovalenko commented on IGNITE-10556:
--

[~DmitriyGovorukhin] Changes merged to master.

> Attempt to decrypt data records during read-only metastorage recovery leads 
> to NPE
> --
>
> Key: IGNITE-10556
> URL: https://issues.apache.org/jira/browse/IGNITE-10556
> Project: Ignite
>  Issue Type: Bug
>  Components: cache
>Affects Versions: 2.8
>Reporter: Pavel Kovalenko
>Assignee: Pavel Kovalenko
>Priority: Critical
> Fix For: 2.8
>
>
> Stacktrace:
> {noformat}
> Caused by: java.lang.NullPointerException
> at 
> org.apache.ignite.internal.processors.cache.persistence.GridCacheDatabaseSharedManager$RestoreStateContext.lambda$next$0(GridCacheDatabaseSharedManager.java:4795)
> at 
> java.util.stream.ReferencePipeline$2$1.accept(ReferencePipeline.java:174)
> at 
> java.util.ArrayList$ArrayListSpliterator.forEachRemaining(ArrayList.java:1382)
> at java.util.stream.AbstractPipeline.copyInto(AbstractPipeline.java:481)
> at 
> java.util.stream.AbstractPipeline.wrapAndCopyInto(AbstractPipeline.java:471)
> at 
> java.util.stream.ReduceOps$ReduceOp.evaluateSequential(ReduceOps.java:708)
> at java.util.stream.AbstractPipeline.evaluate(AbstractPipeline.java:234)
> at java.util.stream.ReferencePipeline.collect(ReferencePipeline.java:499)
> at 
> org.apache.ignite.internal.processors.cache.persistence.GridCacheDatabaseSharedManager$RestoreStateContext.next(GridCacheDatabaseSharedManager.java:4799)
> at 
> org.apache.ignite.internal.processors.cache.persistence.GridCacheDatabaseSharedManager$RestoreLogicalState.next(GridCacheDatabaseSharedManager.java:4926)
> at 
> org.apache.ignite.internal.processors.cache.persistence.GridCacheDatabaseSharedManager.applyLogicalUpdates(GridCacheDatabaseSharedManager.java:2370)
> at 
> org.apache.ignite.internal.processors.cache.persistence.GridCacheDatabaseSharedManager.readMetastore(GridCacheDatabaseSharedManager.java:733)
> at 
> org.apache.ignite.internal.processors.cache.persistence.GridCacheDatabaseSharedManager.notifyMetaStorageSubscribersOnReadyForRead(GridCacheDatabaseSharedManager.java:4493)
> at org.apache.ignite.internal.IgniteKernal.start(IgniteKernal.java:1048)
> ... 20 more
> {noformat}
> It happens because there is no encryption key for that cache group. 
> Encryption keys are initialized after read-only metastorage is ready. There 
> is a bug in RestoreStateContext which tries to filter out DataEntries in 
> DataRecord by group id during read-only metastorage recovery. We should 
> explicitly skip such records before filtering. As a possible solution, we 
> should provide more flexible records filter to RestoreStateContext if we do 
> recovery of read-only metastorage.
> We should also return something more meaningful instead of null if no 
> encryption key is found for DataRecord, as it can be a silent problem for 
> components iterating over WAL.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Commented] (IGNITE-10556) Attempt to decrypt data records during read-only metastorage recovery leads to NPE

2018-12-12 Thread Pavel Kovalenko (JIRA)



[ 
https://issues.apache.org/jira/browse/IGNITE-10556?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16718896#comment-16718896
 ] 

Pavel Kovalenko commented on IGNITE-10556:
--

[~DmitriyGovorukhin] Thank you for review. I've fixed minors and returned back 
multicast ip finder to the example. Could you please look it again?

> Attempt to decrypt data records during read-only metastorage recovery leads 
> to NPE
> --
>
> Key: IGNITE-10556
> URL: https://issues.apache.org/jira/browse/IGNITE-10556
> Project: Ignite
>  Issue Type: Bug
>  Components: cache
>Affects Versions: 2.8
>Reporter: Pavel Kovalenko
>Assignee: Pavel Kovalenko
>Priority: Critical
> Fix For: 2.8
>
>
> Stacktrace:
> {noformat}
> Caused by: java.lang.NullPointerException
> at 
> org.apache.ignite.internal.processors.cache.persistence.GridCacheDatabaseSharedManager$RestoreStateContext.lambda$next$0(GridCacheDatabaseSharedManager.java:4795)
> at 
> java.util.stream.ReferencePipeline$2$1.accept(ReferencePipeline.java:174)
> at 
> java.util.ArrayList$ArrayListSpliterator.forEachRemaining(ArrayList.java:1382)
> at java.util.stream.AbstractPipeline.copyInto(AbstractPipeline.java:481)
> at 
> java.util.stream.AbstractPipeline.wrapAndCopyInto(AbstractPipeline.java:471)
> at 
> java.util.stream.ReduceOps$ReduceOp.evaluateSequential(ReduceOps.java:708)
> at java.util.stream.AbstractPipeline.evaluate(AbstractPipeline.java:234)
> at java.util.stream.ReferencePipeline.collect(ReferencePipeline.java:499)
> at 
> org.apache.ignite.internal.processors.cache.persistence.GridCacheDatabaseSharedManager$RestoreStateContext.next(GridCacheDatabaseSharedManager.java:4799)
> at 
> org.apache.ignite.internal.processors.cache.persistence.GridCacheDatabaseSharedManager$RestoreLogicalState.next(GridCacheDatabaseSharedManager.java:4926)
> at 
> org.apache.ignite.internal.processors.cache.persistence.GridCacheDatabaseSharedManager.applyLogicalUpdates(GridCacheDatabaseSharedManager.java:2370)
> at 
> org.apache.ignite.internal.processors.cache.persistence.GridCacheDatabaseSharedManager.readMetastore(GridCacheDatabaseSharedManager.java:733)
> at 
> org.apache.ignite.internal.processors.cache.persistence.GridCacheDatabaseSharedManager.notifyMetaStorageSubscribersOnReadyForRead(GridCacheDatabaseSharedManager.java:4493)
> at org.apache.ignite.internal.IgniteKernal.start(IgniteKernal.java:1048)
> ... 20 more
> {noformat}
> It happens because there is no encryption key for that cache group. 
> Encryption keys are initialized after read-only metastorage is ready. There 
> is a bug in RestoreStateContext which tries to filter out DataEntries in 
> DataRecord by group id during read-only metastorage recovery. We should 
> explicitly skip such records before filtering. As a possible solution, we 
> should provide more flexible records filter to RestoreStateContext if we do 
> recovery of read-only metastorage.
> We should also return something more meaningful instead of null if no 
> encryption key is found for DataRecord, as it can be a silent problem for 
> components iterating over WAL.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Updated] (IGNITE-10556) Attempt to decrypt data records during read-only metastorage recovery leads to NPE

2018-12-11 Thread Pavel Kovalenko (JIRA)



 [ 
https://issues.apache.org/jira/browse/IGNITE-10556?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Pavel Kovalenko updated IGNITE-10556:
-
Ignite Flags:   (was: Docs Required)

> Attempt to decrypt data records during read-only metastorage recovery leads 
> to NPE
> --
>
> Key: IGNITE-10556
> URL: https://issues.apache.org/jira/browse/IGNITE-10556
> Project: Ignite
>  Issue Type: Bug
>  Components: cache
>Affects Versions: 2.8
>Reporter: Pavel Kovalenko
>Assignee: Pavel Kovalenko
>Priority: Critical
> Fix For: 2.8
>
>
> Stacktrace:
> {noformat}
> Caused by: java.lang.NullPointerException
> at 
> org.apache.ignite.internal.processors.cache.persistence.GridCacheDatabaseSharedManager$RestoreStateContext.lambda$next$0(GridCacheDatabaseSharedManager.java:4795)
> at 
> java.util.stream.ReferencePipeline$2$1.accept(ReferencePipeline.java:174)
> at 
> java.util.ArrayList$ArrayListSpliterator.forEachRemaining(ArrayList.java:1382)
> at java.util.stream.AbstractPipeline.copyInto(AbstractPipeline.java:481)
> at 
> java.util.stream.AbstractPipeline.wrapAndCopyInto(AbstractPipeline.java:471)
> at 
> java.util.stream.ReduceOps$ReduceOp.evaluateSequential(ReduceOps.java:708)
> at java.util.stream.AbstractPipeline.evaluate(AbstractPipeline.java:234)
> at java.util.stream.ReferencePipeline.collect(ReferencePipeline.java:499)
> at 
> org.apache.ignite.internal.processors.cache.persistence.GridCacheDatabaseSharedManager$RestoreStateContext.next(GridCacheDatabaseSharedManager.java:4799)
> at 
> org.apache.ignite.internal.processors.cache.persistence.GridCacheDatabaseSharedManager$RestoreLogicalState.next(GridCacheDatabaseSharedManager.java:4926)
> at 
> org.apache.ignite.internal.processors.cache.persistence.GridCacheDatabaseSharedManager.applyLogicalUpdates(GridCacheDatabaseSharedManager.java:2370)
> at 
> org.apache.ignite.internal.processors.cache.persistence.GridCacheDatabaseSharedManager.readMetastore(GridCacheDatabaseSharedManager.java:733)
> at 
> org.apache.ignite.internal.processors.cache.persistence.GridCacheDatabaseSharedManager.notifyMetaStorageSubscribersOnReadyForRead(GridCacheDatabaseSharedManager.java:4493)
> at org.apache.ignite.internal.IgniteKernal.start(IgniteKernal.java:1048)
> ... 20 more
> {noformat}
> It happens because there is no encryption key for that cache group. 
> Encryption keys are initialized after read-only metastorage is ready. There 
> is a bug in RestoreStateContext which tries to filter out DataEntries in 
> DataRecord by group id during read-only metastorage recovery. We should 
> explicitly skip such records before filtering. As a possible solution, we 
> should provide more flexible records filter to RestoreStateContext if we do 
> recovery of read-only metastorage.
> We should also return something more meaningful instead of null if no 
> encryption key is found for DataRecord, as it can be a silent problem for 
> components iterating over WAL.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Created] (IGNITE-10625) Do first checkpoint on node start before join to topology

2018-12-10 Thread Pavel Kovalenko (JIRA)

Pavel Kovalenko created IGNITE-10625:


 Summary: Do first checkpoint on node start before join to topology
 Key: IGNITE-10625
 URL: https://issues.apache.org/jira/browse/IGNITE-10625
 Project: Ignite
  Issue Type: Improvement
  Components: cache
Affects Versions: 2.4
Reporter: Pavel Kovalenko
 Fix For: 2.8


If a node joins to active cluster we do the first checkpoint during PME when 
partition states have restored here 
{code:java}
org.apache.ignite.internal.processors.cache.distributed.dht.topology.GridDhtPartitionTopology#afterStateRestored
 
{code}
In IGNITE-9420 we moved logical recovery phase before joining to topology and 
currently when a node joins to active cluster it already has all recovered 
partitions. It means that we can safely do the first checkpoint after all 
logical updates are applied. This change will accelerate PME process if there 
were a lot of applied updates during recovery.




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Created] (IGNITE-10624) Cache deployment id may be different that cluster-wide after recovery

2018-12-10 Thread Pavel Kovalenko (JIRA)

Pavel Kovalenko created IGNITE-10624:


 Summary: Cache deployment id may be different that cluster-wide 
after recovery
 Key: IGNITE-10624
 URL: https://issues.apache.org/jira/browse/IGNITE-10624
 Project: Ignite
  Issue Type: Bug
  Components: cache, sql
Affects Versions: 2.8
Reporter: Pavel Kovalenko
Assignee: Pavel Kovalenko
 Fix For: 2.8


When schema for a cache is changing 
(GridQueryProcessor#processSchemaOperationLocal),
it may produce false-negative "CACHE_NOT_FOUND" message if a cache was started 
during recovery while cluster-wide descriptor was changed.

{noformat}
if (cacheInfo == null || !F.eq(depId, cacheInfo.dynamicDeploymentId()))
throw new 
SchemaOperationException(SchemaOperationException.CODE_CACHE_NOT_FOUND, 
cacheName); 
{noformat}




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Assigned] (IGNITE-10556) Attempt to decrypt data records during read-only metastorage recovery leads to NPE

2018-12-10 Thread Pavel Kovalenko (JIRA)



 [ 
https://issues.apache.org/jira/browse/IGNITE-10556?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Pavel Kovalenko reassigned IGNITE-10556:


Assignee: Pavel Kovalenko

> Attempt to decrypt data records during read-only metastorage recovery leads 
> to NPE
> --
>
> Key: IGNITE-10556
> URL: https://issues.apache.org/jira/browse/IGNITE-10556
> Project: Ignite
>  Issue Type: Bug
>  Components: cache
>Affects Versions: 2.8
>Reporter: Pavel Kovalenko
>Assignee: Pavel Kovalenko
>Priority: Critical
> Fix For: 2.8
>
>
> Stacktrace:
> {noformat}
> Caused by: java.lang.NullPointerException
> at 
> org.apache.ignite.internal.processors.cache.persistence.GridCacheDatabaseSharedManager$RestoreStateContext.lambda$next$0(GridCacheDatabaseSharedManager.java:4795)
> at 
> java.util.stream.ReferencePipeline$2$1.accept(ReferencePipeline.java:174)
> at 
> java.util.ArrayList$ArrayListSpliterator.forEachRemaining(ArrayList.java:1382)
> at java.util.stream.AbstractPipeline.copyInto(AbstractPipeline.java:481)
> at 
> java.util.stream.AbstractPipeline.wrapAndCopyInto(AbstractPipeline.java:471)
> at 
> java.util.stream.ReduceOps$ReduceOp.evaluateSequential(ReduceOps.java:708)
> at java.util.stream.AbstractPipeline.evaluate(AbstractPipeline.java:234)
> at java.util.stream.ReferencePipeline.collect(ReferencePipeline.java:499)
> at 
> org.apache.ignite.internal.processors.cache.persistence.GridCacheDatabaseSharedManager$RestoreStateContext.next(GridCacheDatabaseSharedManager.java:4799)
> at 
> org.apache.ignite.internal.processors.cache.persistence.GridCacheDatabaseSharedManager$RestoreLogicalState.next(GridCacheDatabaseSharedManager.java:4926)
> at 
> org.apache.ignite.internal.processors.cache.persistence.GridCacheDatabaseSharedManager.applyLogicalUpdates(GridCacheDatabaseSharedManager.java:2370)
> at 
> org.apache.ignite.internal.processors.cache.persistence.GridCacheDatabaseSharedManager.readMetastore(GridCacheDatabaseSharedManager.java:733)
> at 
> org.apache.ignite.internal.processors.cache.persistence.GridCacheDatabaseSharedManager.notifyMetaStorageSubscribersOnReadyForRead(GridCacheDatabaseSharedManager.java:4493)
> at org.apache.ignite.internal.IgniteKernal.start(IgniteKernal.java:1048)
> ... 20 more
> {noformat}
> It happens because there is no encryption key for that cache group. 
> Encryption keys are initialized after read-only metastorage is ready. There 
> is a bug in RestoreStateContext which tries to filter out DataEntries in 
> DataRecord by group id during read-only metastorage recovery. We should 
> explicitly skip such records before filtering. As a possible solution, we 
> should provide more flexible records filter to RestoreStateContext if we do 
> recovery of read-only metastorage.
> We should also return something more meaningful instead of null if no 
> encryption key is found for DataRecord, as it can be a silent problem for 
> components iterating over WAL.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Resolved] (IGNITE-8527) Show actual rebalance starting in logs

2018-12-07 Thread Pavel Kovalenko (JIRA)



 [ 
https://issues.apache.org/jira/browse/IGNITE-8527?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Pavel Kovalenko resolved IGNITE-8527.
-
Resolution: Won't Fix

Fixed by https://issues.apache.org/jira/browse/IGNITE-9649

> Show actual rebalance starting in logs
> --
>
> Key: IGNITE-8527
> URL: https://issues.apache.org/jira/browse/IGNITE-8527
> Project: Ignite
>  Issue Type: Improvement
>Reporter: Pavel Kovalenko
>Assignee: Pavel Kovalenko
>Priority: Trivial
> Fix For: 2.8
>
>
> We should increase level of logging from DEBUG to INFO for message:
> {noformat}
> if (log.isDebugEnabled())
> log.debug("Requested rebalancing [from node=" 
> + node.id() + ", listener index=" + topicId + " " + demandMsg.rebalanceId() + 
> ", partitions count=" + stripePartitions.get(topicId).size() + " (" + 
> stripePartitions.get(topicId).partitionsList() + ")]");
> {noformat}
> to have actual rebalancing start time.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Resolved] (IGNITE-8782) Wrong message may be printed during simultaneous deactivation and rebalance

2018-12-07 Thread Pavel Kovalenko (JIRA)



 [ 
https://issues.apache.org/jira/browse/IGNITE-8782?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Pavel Kovalenko resolved IGNITE-8782.
-
Resolution: Won't Fix

Fixed by https://issues.apache.org/jira/browse/IGNITE-10242

> Wrong message may be printed during simultaneous deactivation and rebalance
> ---
>
> Key: IGNITE-8782
> URL: https://issues.apache.org/jira/browse/IGNITE-8782
> Project: Ignite
>  Issue Type: Bug
>  Components: cache
>Affects Versions: 2.4
>Reporter: Pavel Kovalenko
>Priority: Minor
> Fix For: 2.8
>
>
> A message located at: GridCachePartitionExchangeManager.java:394 may be 
> printed out if cache group doesn't exist while rebalance process is still 
> finishing. This may happen after deactivation during rebalance.
> We should put this logging under if (grp != null) block and print other 
> message if cache group was actually stopped.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Updated] (IGNITE-10242) NPE in GridDhtPartitionDemander#handleSupplyMessage when concurrently rebalancing and stopping cache in same cache group.

2018-12-06 Thread Pavel Kovalenko (JIRA)



 [ 
https://issues.apache.org/jira/browse/IGNITE-10242?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Pavel Kovalenko updated IGNITE-10242:
-
Ignite Flags:   (was: Docs Required)

> NPE in GridDhtPartitionDemander#handleSupplyMessage when concurrently 
> rebalancing and stopping cache in same cache group.
> -
>
> Key: IGNITE-10242
> URL: https://issues.apache.org/jira/browse/IGNITE-10242
> Project: Ignite
>  Issue Type: Bug
>Affects Versions: 2.5, 2.6
>Reporter: Ivan Daschinskiy
>Assignee: Ivan Daschinskiy
>Priority: Major
> Fix For: 2.8
>
> Attachments: IgniteDemanderOnStoppingCacheTest.java
>
>
> NPE in GridDhtPartitionDemander#handleSupplyMessage occurs when concurrently 
> rebalancing and stopping cache in same cache group. Reproducer is attached
> {noformat}
> java.lang.NullPointerException
>   at 
> org.apache.ignite.internal.processors.cache.distributed.dht.preloader.GridDhtPartitionDemander.preloadEntry(GridDhtPartitionDemander.java:893)
>   at 
> org.apache.ignite.internal.processors.cache.distributed.dht.preloader.GridDhtPartitionDemander.handleSupplyMessage(GridDhtPartitionDemander.java:772)
>   at 
> org.apache.ignite.internal.processors.cache.distributed.dht.preloader.GridDhtPreloader.handleSupplyMessage(GridDhtPreloader.java:331)
>   at 
> org.apache.ignite.internal.processors.cache.GridCachePartitionExchangeManager$5.apply(GridCachePartitionExchangeManager.java:411)
>   at 
> org.apache.ignite.internal.processors.cache.GridCachePartitionExchangeManager$5.apply(GridCachePartitionExchangeManager.java:401)
>   at 
> org.apache.ignite.internal.processors.cache.GridCacheIoManager.processMessage(GridCacheIoManager.java:1058)
>   at 
> org.apache.ignite.internal.processors.cache.GridCacheIoManager.onMessage0(GridCacheIoManager.java:583)
>   at 
> org.apache.ignite.internal.processors.cache.GridCacheIoManager.access$700(GridCacheIoManager.java:101)
> {noformat}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Commented] (IGNITE-10242) NPE in GridDhtPartitionDemander#handleSupplyMessage when concurrently rebalancing and stopping cache in same cache group.

2018-12-06 Thread Pavel Kovalenko (JIRA)



[ 
https://issues.apache.org/jira/browse/IGNITE-10242?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16711623#comment-16711623
 ] 

Pavel Kovalenko commented on IGNITE-10242:
--

[~ivandasch] Thank you for contribution. Changes look good to me. Merged to 
master.

> NPE in GridDhtPartitionDemander#handleSupplyMessage when concurrently 
> rebalancing and stopping cache in same cache group.
> -
>
> Key: IGNITE-10242
> URL: https://issues.apache.org/jira/browse/IGNITE-10242
> Project: Ignite
>  Issue Type: Bug
>Affects Versions: 2.5, 2.6
>Reporter: Ivan Daschinskiy
>Assignee: Ivan Daschinskiy
>Priority: Major
> Fix For: 2.8
>
> Attachments: IgniteDemanderOnStoppingCacheTest.java
>
>
> NPE in GridDhtPartitionDemander#handleSupplyMessage occurs when concurrently 
> rebalancing and stopping cache in same cache group. Reproducer is attached
> {noformat}
> java.lang.NullPointerException
>   at 
> org.apache.ignite.internal.processors.cache.distributed.dht.preloader.GridDhtPartitionDemander.preloadEntry(GridDhtPartitionDemander.java:893)
>   at 
> org.apache.ignite.internal.processors.cache.distributed.dht.preloader.GridDhtPartitionDemander.handleSupplyMessage(GridDhtPartitionDemander.java:772)
>   at 
> org.apache.ignite.internal.processors.cache.distributed.dht.preloader.GridDhtPreloader.handleSupplyMessage(GridDhtPreloader.java:331)
>   at 
> org.apache.ignite.internal.processors.cache.GridCachePartitionExchangeManager$5.apply(GridCachePartitionExchangeManager.java:411)
>   at 
> org.apache.ignite.internal.processors.cache.GridCachePartitionExchangeManager$5.apply(GridCachePartitionExchangeManager.java:401)
>   at 
> org.apache.ignite.internal.processors.cache.GridCacheIoManager.processMessage(GridCacheIoManager.java:1058)
>   at 
> org.apache.ignite.internal.processors.cache.GridCacheIoManager.onMessage0(GridCacheIoManager.java:583)
>   at 
> org.apache.ignite.internal.processors.cache.GridCacheIoManager.access$700(GridCacheIoManager.java:101)
> {noformat}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Created] (IGNITE-10556) Attempt to decrypt data records during read-only metastorage recovery leads to NPE

2018-12-05 Thread Pavel Kovalenko (JIRA)

Pavel Kovalenko created IGNITE-10556:


 Summary: Attempt to decrypt data records during read-only 
metastorage recovery leads to NPE
 Key: IGNITE-10556
 URL: https://issues.apache.org/jira/browse/IGNITE-10556
 Project: Ignite
  Issue Type: Bug
  Components: cache
Affects Versions: 2.8
Reporter: Pavel Kovalenko
 Fix For: 2.8


Stacktrace:
{noformat}
Caused by: java.lang.NullPointerException
at 
org.apache.ignite.internal.processors.cache.persistence.GridCacheDatabaseSharedManager$RestoreStateContext.lambda$next$0(GridCacheDatabaseSharedManager.java:4795)
at java.util.stream.ReferencePipeline$2$1.accept(ReferencePipeline.java:174)
at 
java.util.ArrayList$ArrayListSpliterator.forEachRemaining(ArrayList.java:1382)
at java.util.stream.AbstractPipeline.copyInto(AbstractPipeline.java:481)
at 
java.util.stream.AbstractPipeline.wrapAndCopyInto(AbstractPipeline.java:471)
at 
java.util.stream.ReduceOps$ReduceOp.evaluateSequential(ReduceOps.java:708)
at java.util.stream.AbstractPipeline.evaluate(AbstractPipeline.java:234)
at java.util.stream.ReferencePipeline.collect(ReferencePipeline.java:499)
at 
org.apache.ignite.internal.processors.cache.persistence.GridCacheDatabaseSharedManager$RestoreStateContext.next(GridCacheDatabaseSharedManager.java:4799)
at 
org.apache.ignite.internal.processors.cache.persistence.GridCacheDatabaseSharedManager$RestoreLogicalState.next(GridCacheDatabaseSharedManager.java:4926)
at 
org.apache.ignite.internal.processors.cache.persistence.GridCacheDatabaseSharedManager.applyLogicalUpdates(GridCacheDatabaseSharedManager.java:2370)
at 
org.apache.ignite.internal.processors.cache.persistence.GridCacheDatabaseSharedManager.readMetastore(GridCacheDatabaseSharedManager.java:733)
at 
org.apache.ignite.internal.processors.cache.persistence.GridCacheDatabaseSharedManager.notifyMetaStorageSubscribersOnReadyForRead(GridCacheDatabaseSharedManager.java:4493)
at org.apache.ignite.internal.IgniteKernal.start(IgniteKernal.java:1048)
... 20 more
{noformat}

It happens because there is no encryption key for that cache group. Encryption 
keys are initialized after read-only metastorage is ready. There is a bug in 
RestoreStateContext which tries to filter out DataEntries in DataRecord by 
group id during read-only metastorage recovery. We should explicitly skip such 
records before filtering. As a possible solution, we should provide more 
flexible records filter to RestoreStateContext if we do recovery of read-only 
metastorage.

We should also return something more meaningful instead of null if no 
encryption key is found for DataRecord, as it can be a silent problem for 
components iterating over WAL.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Commented] (IGNITE-10003) Raise SYSTEM_WORKER_BLOCKED instead of CRITICAL_ERROR when checkpoint read lock timeout detected

2018-12-05 Thread Pavel Kovalenko (JIRA)



[ 
https://issues.apache.org/jira/browse/IGNITE-10003?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16709996#comment-16709996
 ] 

Pavel Kovalenko commented on IGNITE-10003:
--

[~andrey-kuznetsov] Thank you for contribution. 
I have a question regarding this change. Do we have any tests checking that 
critical failure is raised when checkpointReadLock is timed out? If yes, please 
add a check that critical failure type SYSTEM_WORKER_BLOCKED is used in that 
case. If no, please add such test.
Also, your branch is a little bit outdated comparing to master and it's hard to 
analyze which tests didn't pass expectedly. Could you please merge the latest 
master, re-run tests and get the green visa from 
[MTCGA|https://mtcga.gridgain.com/]?

> Raise SYSTEM_WORKER_BLOCKED instead of CRITICAL_ERROR when checkpoint read 
> lock timeout detected
> 
>
> Key: IGNITE-10003
> URL: https://issues.apache.org/jira/browse/IGNITE-10003
> Project: Ignite
>  Issue Type: Task
>Affects Versions: 2.7
>Reporter: Andrey Kuznetsov
>Assignee: Andrey Kuznetsov
>Priority: Trivial
> Fix For: 2.8
>
>
> {{GridCacheDatabaseSharedManager#failCheckpointReadLock}} should report 
> {{SYSTEM_WORKER_BLOCKED}} to failure handler: it is closer to the truth and 
> default consequenses are not so severe as opposed to {{CRITICAL_ERROR}}.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Commented] (IGNITE-10369) PDS 4 hangs on TC

2018-12-04 Thread Pavel Kovalenko (JIRA)



[ 
https://issues.apache.org/jira/browse/IGNITE-10369?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16708492#comment-16708492
 ] 

Pavel Kovalenko commented on IGNITE-10369:
--

[~ibessonov] Thank you for contribution. Changes merged to master.

> PDS 4 hangs on TC
> -
>
> Key: IGNITE-10369
> URL: https://issues.apache.org/jira/browse/IGNITE-10369
> Project: Ignite
>  Issue Type: Test
>Reporter: Ivan Bessonov
>Assignee: Ivan Bessonov
>Priority: Major
>  Labels: MakeTeamcityGreenAgain
> Fix For: 2.8
>
>
> [https://ci.ignite.apache.org/viewLog.html?buildId=2365697=buildResultsDiv=IgniteTests24Java8_Pds4]
> org.apache.ignite.internal.processors.cache.IgniteClusterActivateDeactivateTestWithPersistenceAndMemoryReuse#testClientJoinsWhenActivationIsInProgress
>  hangs on client connection.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Assigned] (IGNITE-10493) Refactor exchange stages time measurements

2018-12-03 Thread Pavel Kovalenko (JIRA)



 [ 
https://issues.apache.org/jira/browse/IGNITE-10493?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Pavel Kovalenko reassigned IGNITE-10493:


Assignee: Pavel Kovalenko

> Refactor exchange stages time measurements
> --
>
> Key: IGNITE-10493
> URL: https://issues.apache.org/jira/browse/IGNITE-10493
> Project: Ignite
>  Issue Type: Improvement
>  Components: cache
>Affects Versions: 2.7
>Reporter: Pavel Kovalenko
>Assignee: Pavel Kovalenko
>Priority: Major
> Fix For: 2.8
>
>
> At the current implementation, we don't cover and measure all possible code 
> executions that influence on PME time. Instead of it we just measure the 
> hottest separate parts with the following hardcoded pattern:
> {noformat}
> long time = currentTime();
> ... // some code block
> print ("Stage name performed in " + (currentTime() - time));
> {noformat}
> This approach can be improved. Instead of declaring time variable and print 
> the message to log immediately we can introduce a utility class (TimesBag) 
> that will hold all stages and their times. The content of TimesBag can be 
> printed when the exchange future is done.
> As exchange is a linear process that executes init stage by exchange-worker 
> and finish stage by one of the sys thread we can easily cover all exchange 
> code base by time cutoffs.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

1 2 3 4 5 6 >

1 - 100 of 523 matches

Mail list logo