[jira] [Commented] (IGNITE-7540) Sequential checkpoints cause overwrite of already cleaned & freed offheap page

2018-02-09 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/IGNITE-7540?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16358418#comment-16358418
 ] 

ASF GitHub Bot commented on IGNITE-7540:


Github user asfgit closed the pull request at:

https://github.com/apache/ignite/pull/3490


> Sequential checkpoints cause overwrite of already cleaned & freed offheap page
> --
>
> Key: IGNITE-7540
> URL: https://issues.apache.org/jira/browse/IGNITE-7540
> Project: Ignite
>  Issue Type: Bug
>  Components: persistence
>Affects Versions: 2.4
>Reporter: Ilya Kasnacheev
>Assignee: Alexey Goncharuk
>Priority: Major
> Attachments: IgnitePdsDestroyCacheTest.java
>
>
> The sequence of events as follows:
> in GridCacheProcessor.onExchangeDone(), 
> {color:#660e7a}sharedCtx{color}.database().waitForCheckpoint({color:#008000}"caches
>  stop"{color}) is peformed and then cache is destroyed and all its pages are 
> freed and cleared asynchronously.
> However, it is entirely possible that after waitForCheckpoint(), next 
> checkpoint will start immediately. It is typical when a lot of data being 
> loaded into Ignite, leading to rapid checkpoint buffer depletion, as well as 
> with artificially increased checkpoint frequency, as used in reproducer.
> Then, checkpointer will save (overwrite) metadata page:
> {code:java}
>     at 
> org.apache.ignite.internal.processors.cache.persistence.pagemem.PageMemoryImpl.writeUnlockPage(PageMemoryImpl.java:1330)
>     at 
> org.apache.ignite.internal.processors.cache.persistence.pagemem.PageMemoryImpl.writeUnlock(PageMemoryImpl.java:428)
>     at 
> org.apache.ignite.internal.processors.cache.persistence.pagemem.PageMemoryImpl.writeUnlock(PageMemoryImpl.java:422)
>     at 
> org.apache.ignite.internal.processors.cache.persistence.GridCacheOffheapManager.saveStoreMetadata(GridCacheOffheapManager.java:375)
>     at 
> org.apache.ignite.internal.processors.cache.persistence.GridCacheOffheapManager.onCheckpointBegin(GridCacheOffheapManager.java:163)
>     at 
> org.apache.ignite.internal.processors.cache.persistence.GridCacheDatabaseSharedManager$Checkpointer.markCheckpointBegin(GridCacheDatabaseSharedManager.java:2309)
>     at 
> org.apache.ignite.internal.processors.cache.persistence.GridCacheDatabaseSharedManager$Checkpointer.doCheckpoint(GridCacheDatabaseSharedManager.java:2088)
>     at 
> org.apache.ignite.internal.processors.cache.persistence.GridCacheDatabaseSharedManager$Checkpointer.body(GridCacheDatabaseSharedManager.java:2013)
>     at 
> org.apache.ignite.internal.util.worker.GridWorker.run(GridWorker.java:110)
>     at java.lang.Thread.run(Thread.java:748){code}
> This will happen after cache is already destroyed and even after the page is 
> already zeroed by PageMemoryImpl$ClearSegmentRunnable.run().
> Then, some new cache is being created, and in 
> GridCacheOffheapManager$GridCacheDataStore.getOrAllocatePartitionMetas(), 
> pageMem.acquirePage() will return this page, expected zeroed, but actually 
> containing metadata for old cache's partition. Then, type == 
> PageIO.T_PART_META check will return true and the following exception is 
> issued, leading to cache state inconsistency and data loss:
> {code:java}
> Caused by: java.lang.IllegalStateException: Failed to get page IO instance 
> (page content is corrupted)
>     at 
> org.apache.ignite.internal.processors.cache.persistence.tree.io.IOVersions.forVersion(IOVersions.java:83)
>     at 
> org.apache.ignite.internal.processors.cache.persistence.tree.io.IOVersions.forPage(IOVersions.java:95)
>     at 
> org.apache.ignite.internal.processors.cache.persistence.freelist.PagesList.init(PagesList.java:175)
>     at 
> org.apache.ignite.internal.processors.cache.persistence.freelist.FreeListImpl.(FreeListImpl.java:370)
>     at 
> org.apache.ignite.internal.processors.cache.persistence.GridCacheOffheapManager$GridCacheDataStore$1.(GridCacheOffheapManager.java:932)
>     at 
> org.apache.ignite.internal.processors.cache.persistence.GridCacheOffheapManager$GridCacheDataStore.init0(GridCacheOffheapManager.java:929)
>     at 
> org.apache.ignite.internal.processors.cache.persistence.GridCacheOffheapManager$GridCacheDataStore.invoke(GridCacheOffheapManager.java:1295)
>     at 
> org.apache.ignite.internal.processors.cache.IgniteCacheOffheapManagerImpl.invoke(IgniteCacheOffheapManagerImpl.java:344)
>     at 
> org.apache.ignite.internal.processors.cache.GridCacheMapEntry.storeValue(GridCacheMapEntry.java:3191)
>     at 
> org.apache.ignite.internal.processors.cache.GridCacheMapEntry.initialValue(GridCacheMapEntry.java:2571)
>     at 
> org.apache.ignite.internal.processors.datastreamer.DataStreamerImpl$IsolatedUpdater.receive(DataStreamerImpl.java:2096)
>     at 
> 

[jira] [Commented] (IGNITE-7540) Sequential checkpoints cause overwrite of already cleaned & freed offheap page

2018-02-08 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/IGNITE-7540?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16356941#comment-16356941
 ] 

ASF GitHub Bot commented on IGNITE-7540:


GitHub user Jokser opened a pull request:

https://github.com/apache/ignite/pull/3490

IGNITE-7540 Sequential checkpoints cause overwrite of already cleaned & 
freed offheap page



You can merge this pull request into a Git repository by running:

$ git pull https://github.com/gridgain/apache-ignite ignite-7540

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/ignite/pull/3490.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #3490


commit 17bdf9459a3234eefbd379506a1d0895cd88d3a5
Author: Pavel Kovalenko 
Date:   2018-02-08T13:32:40Z

IGNITE-7540 Prevent page memory metadata corruption during checkpoint and 
group destroying.




> Sequential checkpoints cause overwrite of already cleaned & freed offheap page
> --
>
> Key: IGNITE-7540
> URL: https://issues.apache.org/jira/browse/IGNITE-7540
> Project: Ignite
>  Issue Type: Bug
>  Components: persistence
>Affects Versions: 2.4
>Reporter: Ilya Kasnacheev
>Assignee: Pavel Kovalenko
>Priority: Major
> Attachments: IgnitePdsDestroyCacheTest.java
>
>
> The sequence of events as follows:
> in GridCacheProcessor.onExchangeDone(), 
> {color:#660e7a}sharedCtx{color}.database().waitForCheckpoint({color:#008000}"caches
>  stop"{color}) is peformed and then cache is destroyed and all its pages are 
> freed and cleared asynchronously.
> However, it is entirely possible that after waitForCheckpoint(), next 
> checkpoint will start immediately. It is typical when a lot of data being 
> loaded into Ignite, leading to rapid checkpoint buffer depletion, as well as 
> with artificially increased checkpoint frequency, as used in reproducer.
> Then, checkpointer will save (overwrite) metadata page:
> {code:java}
>     at 
> org.apache.ignite.internal.processors.cache.persistence.pagemem.PageMemoryImpl.writeUnlockPage(PageMemoryImpl.java:1330)
>     at 
> org.apache.ignite.internal.processors.cache.persistence.pagemem.PageMemoryImpl.writeUnlock(PageMemoryImpl.java:428)
>     at 
> org.apache.ignite.internal.processors.cache.persistence.pagemem.PageMemoryImpl.writeUnlock(PageMemoryImpl.java:422)
>     at 
> org.apache.ignite.internal.processors.cache.persistence.GridCacheOffheapManager.saveStoreMetadata(GridCacheOffheapManager.java:375)
>     at 
> org.apache.ignite.internal.processors.cache.persistence.GridCacheOffheapManager.onCheckpointBegin(GridCacheOffheapManager.java:163)
>     at 
> org.apache.ignite.internal.processors.cache.persistence.GridCacheDatabaseSharedManager$Checkpointer.markCheckpointBegin(GridCacheDatabaseSharedManager.java:2309)
>     at 
> org.apache.ignite.internal.processors.cache.persistence.GridCacheDatabaseSharedManager$Checkpointer.doCheckpoint(GridCacheDatabaseSharedManager.java:2088)
>     at 
> org.apache.ignite.internal.processors.cache.persistence.GridCacheDatabaseSharedManager$Checkpointer.body(GridCacheDatabaseSharedManager.java:2013)
>     at 
> org.apache.ignite.internal.util.worker.GridWorker.run(GridWorker.java:110)
>     at java.lang.Thread.run(Thread.java:748){code}
> This will happen after cache is already destroyed and even after the page is 
> already zeroed by PageMemoryImpl$ClearSegmentRunnable.run().
> Then, some new cache is being created, and in 
> GridCacheOffheapManager$GridCacheDataStore.getOrAllocatePartitionMetas(), 
> pageMem.acquirePage() will return this page, expected zeroed, but actually 
> containing metadata for old cache's partition. Then, type == 
> PageIO.T_PART_META check will return true and the following exception is 
> issued, leading to cache state inconsistency and data loss:
> {code:java}
> Caused by: java.lang.IllegalStateException: Failed to get page IO instance 
> (page content is corrupted)
>     at 
> org.apache.ignite.internal.processors.cache.persistence.tree.io.IOVersions.forVersion(IOVersions.java:83)
>     at 
> org.apache.ignite.internal.processors.cache.persistence.tree.io.IOVersions.forPage(IOVersions.java:95)
>     at 
> org.apache.ignite.internal.processors.cache.persistence.freelist.PagesList.init(PagesList.java:175)
>     at 
> org.apache.ignite.internal.processors.cache.persistence.freelist.FreeListImpl.(FreeListImpl.java:370)
>     at 
> org.apache.ignite.internal.processors.cache.persistence.GridCacheOffheapManager$GridCacheDataStore$1.(GridCacheOffheapManager.java:932)
>     at 
> 

[jira] [Commented] (IGNITE-7540) Sequential checkpoints cause overwrite of already cleaned & freed offheap page

2018-01-29 Thread Ilya Kasnacheev (JIRA)

[ 
https://issues.apache.org/jira/browse/IGNITE-7540?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16343344#comment-16343344
 ] 

Ilya Kasnacheev commented on IGNITE-7540:
-

Please consider applying [https://github.com/apache/ignite/pull/3448] before 
starting this fix

> Sequential checkpoints cause overwrite of already cleaned & freed offheap page
> --
>
> Key: IGNITE-7540
> URL: https://issues.apache.org/jira/browse/IGNITE-7540
> Project: Ignite
>  Issue Type: Bug
>  Components: persistence
>Affects Versions: 2.4
>Reporter: Ilya Kasnacheev
>Assignee: Alexey Goncharuk
>Priority: Major
> Attachments: IgnitePdsDestroyCacheTest.java
>
>
> The sequence of events as follows:
> in GridCacheProcessor.onExchangeDone(), 
> {color:#660e7a}sharedCtx{color}.database().waitForCheckpoint({color:#008000}"caches
>  stop"{color}) is peformed and then cache is destroyed and all its pages are 
> freed and cleared asynchronously.
> However, it is entirely possible that after waitForCheckpoint(), next 
> checkpoint will start immediately. It is typical when a lot of data being 
> loaded into Ignite, leading to rapid checkpoint buffer depletion, as well as 
> with artificially increased checkpoint frequency, as used in reproducer.
> Then, checkpointer will save (overwrite) metadata page:
> {code:java}
>     at 
> org.apache.ignite.internal.processors.cache.persistence.pagemem.PageMemoryImpl.writeUnlockPage(PageMemoryImpl.java:1330)
>     at 
> org.apache.ignite.internal.processors.cache.persistence.pagemem.PageMemoryImpl.writeUnlock(PageMemoryImpl.java:428)
>     at 
> org.apache.ignite.internal.processors.cache.persistence.pagemem.PageMemoryImpl.writeUnlock(PageMemoryImpl.java:422)
>     at 
> org.apache.ignite.internal.processors.cache.persistence.GridCacheOffheapManager.saveStoreMetadata(GridCacheOffheapManager.java:375)
>     at 
> org.apache.ignite.internal.processors.cache.persistence.GridCacheOffheapManager.onCheckpointBegin(GridCacheOffheapManager.java:163)
>     at 
> org.apache.ignite.internal.processors.cache.persistence.GridCacheDatabaseSharedManager$Checkpointer.markCheckpointBegin(GridCacheDatabaseSharedManager.java:2309)
>     at 
> org.apache.ignite.internal.processors.cache.persistence.GridCacheDatabaseSharedManager$Checkpointer.doCheckpoint(GridCacheDatabaseSharedManager.java:2088)
>     at 
> org.apache.ignite.internal.processors.cache.persistence.GridCacheDatabaseSharedManager$Checkpointer.body(GridCacheDatabaseSharedManager.java:2013)
>     at 
> org.apache.ignite.internal.util.worker.GridWorker.run(GridWorker.java:110)
>     at java.lang.Thread.run(Thread.java:748){code}
> This will happen after cache is already destroyed and even after the page is 
> already zeroed by PageMemoryImpl$ClearSegmentRunnable.run().
> Then, some new cache is being created, and in 
> GridCacheOffheapManager$GridCacheDataStore.getOrAllocatePartitionMetas(), 
> pageMem.acquirePage() will return this page, expected zeroed, but actually 
> containing metadata for old cache's partition. Then, type == 
> PageIO.T_PART_META check will return true and the following exception is 
> issued, leading to cache state inconsistency and data loss:
> {code:java}
> Caused by: java.lang.IllegalStateException: Failed to get page IO instance 
> (page content is corrupted)
>     at 
> org.apache.ignite.internal.processors.cache.persistence.tree.io.IOVersions.forVersion(IOVersions.java:83)
>     at 
> org.apache.ignite.internal.processors.cache.persistence.tree.io.IOVersions.forPage(IOVersions.java:95)
>     at 
> org.apache.ignite.internal.processors.cache.persistence.freelist.PagesList.init(PagesList.java:175)
>     at 
> org.apache.ignite.internal.processors.cache.persistence.freelist.FreeListImpl.(FreeListImpl.java:370)
>     at 
> org.apache.ignite.internal.processors.cache.persistence.GridCacheOffheapManager$GridCacheDataStore$1.(GridCacheOffheapManager.java:932)
>     at 
> org.apache.ignite.internal.processors.cache.persistence.GridCacheOffheapManager$GridCacheDataStore.init0(GridCacheOffheapManager.java:929)
>     at 
> org.apache.ignite.internal.processors.cache.persistence.GridCacheOffheapManager$GridCacheDataStore.invoke(GridCacheOffheapManager.java:1295)
>     at 
> org.apache.ignite.internal.processors.cache.IgniteCacheOffheapManagerImpl.invoke(IgniteCacheOffheapManagerImpl.java:344)
>     at 
> org.apache.ignite.internal.processors.cache.GridCacheMapEntry.storeValue(GridCacheMapEntry.java:3191)
>     at 
> org.apache.ignite.internal.processors.cache.GridCacheMapEntry.initialValue(GridCacheMapEntry.java:2571)
>     at 
> org.apache.ignite.internal.processors.datastreamer.DataStreamerImpl$IsolatedUpdater.receive(DataStreamerImpl.java:2096)
>     at 
> 

[jira] [Commented] (IGNITE-7540) Sequential checkpoints cause overwrite of already cleaned & freed offheap page

2018-01-25 Thread Ilya Kasnacheev (JIRA)

[ 
https://issues.apache.org/jira/browse/IGNITE-7540?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16339438#comment-16339438
 ] 

Ilya Kasnacheev commented on IGNITE-7540:
-

Proposed fix is marking caches for destruction in 
GridCacheProcessor.onExchangeDone() before waiting for checkpoint to finish, to 
avoid touching these caches during next checkpoint.

> Sequential checkpoints cause overwrite of already cleaned & freed offheap page
> --
>
> Key: IGNITE-7540
> URL: https://issues.apache.org/jira/browse/IGNITE-7540
> Project: Ignite
>  Issue Type: Bug
>  Components: persistence
>Affects Versions: 2.4
>Reporter: Ilya Kasnacheev
>Assignee: Alexey Goncharuk
>Priority: Major
> Attachments: IgnitePdsDestroyCacheTest.java
>
>
> The sequence of events as follows:
> in GridCacheProcessor.onExchangeDone(), 
> {color:#660e7a}sharedCtx{color}.database().waitForCheckpoint({color:#008000}"caches
>  stop"{color}) is peformed and then cache is destroyed and all its pages are 
> freed and cleared asynchronously.
> However, it is entirely possible that after waitForCheckpoint(), next 
> checkpoint will start immediately. It is typical when a lot of data being 
> loaded into Ignite, leading to rapid checkpoint buffer depletion, as well as 
> with artificially increased checkpoint frequency, as used in reproducer.
> Then, checkpointer will save (overwrite) metadata page:
> {code:java}
>     at 
> org.apache.ignite.internal.processors.cache.persistence.pagemem.PageMemoryImpl.writeUnlockPage(PageMemoryImpl.java:1330)
>     at 
> org.apache.ignite.internal.processors.cache.persistence.pagemem.PageMemoryImpl.writeUnlock(PageMemoryImpl.java:428)
>     at 
> org.apache.ignite.internal.processors.cache.persistence.pagemem.PageMemoryImpl.writeUnlock(PageMemoryImpl.java:422)
>     at 
> org.apache.ignite.internal.processors.cache.persistence.GridCacheOffheapManager.saveStoreMetadata(GridCacheOffheapManager.java:375)
>     at 
> org.apache.ignite.internal.processors.cache.persistence.GridCacheOffheapManager.onCheckpointBegin(GridCacheOffheapManager.java:163)
>     at 
> org.apache.ignite.internal.processors.cache.persistence.GridCacheDatabaseSharedManager$Checkpointer.markCheckpointBegin(GridCacheDatabaseSharedManager.java:2309)
>     at 
> org.apache.ignite.internal.processors.cache.persistence.GridCacheDatabaseSharedManager$Checkpointer.doCheckpoint(GridCacheDatabaseSharedManager.java:2088)
>     at 
> org.apache.ignite.internal.processors.cache.persistence.GridCacheDatabaseSharedManager$Checkpointer.body(GridCacheDatabaseSharedManager.java:2013)
>     at 
> org.apache.ignite.internal.util.worker.GridWorker.run(GridWorker.java:110)
>     at java.lang.Thread.run(Thread.java:748){code}
> This will happen after cache is already destroyed and even after the page is 
> already zeroed by PageMemoryImpl$ClearSegmentRunnable.run().
> Then, some new cache is being created, and in 
> GridCacheOffheapManager$GridCacheDataStore.getOrAllocatePartitionMetas(), 
> pageMem.acquirePage() will return this page, expected zeroed, but actually 
> containing metadata for old cache's partition. Then, type == 
> PageIO.T_PART_META check will return true and the following exception is 
> issued, leading to cache state inconsistency and data loss:
> {code:java}
> Caused by: java.lang.IllegalStateException: Failed to get page IO instance 
> (page content is corrupted)
>     at 
> org.apache.ignite.internal.processors.cache.persistence.tree.io.IOVersions.forVersion(IOVersions.java:83)
>     at 
> org.apache.ignite.internal.processors.cache.persistence.tree.io.IOVersions.forPage(IOVersions.java:95)
>     at 
> org.apache.ignite.internal.processors.cache.persistence.freelist.PagesList.init(PagesList.java:175)
>     at 
> org.apache.ignite.internal.processors.cache.persistence.freelist.FreeListImpl.(FreeListImpl.java:370)
>     at 
> org.apache.ignite.internal.processors.cache.persistence.GridCacheOffheapManager$GridCacheDataStore$1.(GridCacheOffheapManager.java:932)
>     at 
> org.apache.ignite.internal.processors.cache.persistence.GridCacheOffheapManager$GridCacheDataStore.init0(GridCacheOffheapManager.java:929)
>     at 
> org.apache.ignite.internal.processors.cache.persistence.GridCacheOffheapManager$GridCacheDataStore.invoke(GridCacheOffheapManager.java:1295)
>     at 
> org.apache.ignite.internal.processors.cache.IgniteCacheOffheapManagerImpl.invoke(IgniteCacheOffheapManagerImpl.java:344)
>     at 
> org.apache.ignite.internal.processors.cache.GridCacheMapEntry.storeValue(GridCacheMapEntry.java:3191)
>     at 
> org.apache.ignite.internal.processors.cache.GridCacheMapEntry.initialValue(GridCacheMapEntry.java:2571)
>     at 
>