[jira] [Commented] (IGNITE-7540) Sequential checkpoints cause overwrite of already cleaned & freed offheap page
[ https://issues.apache.org/jira/browse/IGNITE-7540?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16358418#comment-16358418 ] ASF GitHub Bot commented on IGNITE-7540: Github user asfgit closed the pull request at: https://github.com/apache/ignite/pull/3490 > Sequential checkpoints cause overwrite of already cleaned & freed offheap page > -- > > Key: IGNITE-7540 > URL: https://issues.apache.org/jira/browse/IGNITE-7540 > Project: Ignite > Issue Type: Bug > Components: persistence >Affects Versions: 2.4 >Reporter: Ilya Kasnacheev >Assignee: Alexey Goncharuk >Priority: Major > Attachments: IgnitePdsDestroyCacheTest.java > > > The sequence of events as follows: > in GridCacheProcessor.onExchangeDone(), > {color:#660e7a}sharedCtx{color}.database().waitForCheckpoint({color:#008000}"caches > stop"{color}) is peformed and then cache is destroyed and all its pages are > freed and cleared asynchronously. > However, it is entirely possible that after waitForCheckpoint(), next > checkpoint will start immediately. It is typical when a lot of data being > loaded into Ignite, leading to rapid checkpoint buffer depletion, as well as > with artificially increased checkpoint frequency, as used in reproducer. > Then, checkpointer will save (overwrite) metadata page: > {code:java} > at > org.apache.ignite.internal.processors.cache.persistence.pagemem.PageMemoryImpl.writeUnlockPage(PageMemoryImpl.java:1330) > at > org.apache.ignite.internal.processors.cache.persistence.pagemem.PageMemoryImpl.writeUnlock(PageMemoryImpl.java:428) > at > org.apache.ignite.internal.processors.cache.persistence.pagemem.PageMemoryImpl.writeUnlock(PageMemoryImpl.java:422) > at > org.apache.ignite.internal.processors.cache.persistence.GridCacheOffheapManager.saveStoreMetadata(GridCacheOffheapManager.java:375) > at > org.apache.ignite.internal.processors.cache.persistence.GridCacheOffheapManager.onCheckpointBegin(GridCacheOffheapManager.java:163) > at > org.apache.ignite.internal.processors.cache.persistence.GridCacheDatabaseSharedManager$Checkpointer.markCheckpointBegin(GridCacheDatabaseSharedManager.java:2309) > at > org.apache.ignite.internal.processors.cache.persistence.GridCacheDatabaseSharedManager$Checkpointer.doCheckpoint(GridCacheDatabaseSharedManager.java:2088) > at > org.apache.ignite.internal.processors.cache.persistence.GridCacheDatabaseSharedManager$Checkpointer.body(GridCacheDatabaseSharedManager.java:2013) > at > org.apache.ignite.internal.util.worker.GridWorker.run(GridWorker.java:110) > at java.lang.Thread.run(Thread.java:748){code} > This will happen after cache is already destroyed and even after the page is > already zeroed by PageMemoryImpl$ClearSegmentRunnable.run(). > Then, some new cache is being created, and in > GridCacheOffheapManager$GridCacheDataStore.getOrAllocatePartitionMetas(), > pageMem.acquirePage() will return this page, expected zeroed, but actually > containing metadata for old cache's partition. Then, type == > PageIO.T_PART_META check will return true and the following exception is > issued, leading to cache state inconsistency and data loss: > {code:java} > Caused by: java.lang.IllegalStateException: Failed to get page IO instance > (page content is corrupted) > at > org.apache.ignite.internal.processors.cache.persistence.tree.io.IOVersions.forVersion(IOVersions.java:83) > at > org.apache.ignite.internal.processors.cache.persistence.tree.io.IOVersions.forPage(IOVersions.java:95) > at > org.apache.ignite.internal.processors.cache.persistence.freelist.PagesList.init(PagesList.java:175) > at > org.apache.ignite.internal.processors.cache.persistence.freelist.FreeListImpl.(FreeListImpl.java:370) > at > org.apache.ignite.internal.processors.cache.persistence.GridCacheOffheapManager$GridCacheDataStore$1.(GridCacheOffheapManager.java:932) > at > org.apache.ignite.internal.processors.cache.persistence.GridCacheOffheapManager$GridCacheDataStore.init0(GridCacheOffheapManager.java:929) > at > org.apache.ignite.internal.processors.cache.persistence.GridCacheOffheapManager$GridCacheDataStore.invoke(GridCacheOffheapManager.java:1295) > at > org.apache.ignite.internal.processors.cache.IgniteCacheOffheapManagerImpl.invoke(IgniteCacheOffheapManagerImpl.java:344) > at > org.apache.ignite.internal.processors.cache.GridCacheMapEntry.storeValue(GridCacheMapEntry.java:3191) > at > org.apache.ignite.internal.processors.cache.GridCacheMapEntry.initialValue(GridCacheMapEntry.java:2571) > at > org.apache.ignite.internal.processors.datastreamer.DataStreamerImpl$IsolatedUpdater.receive(DataStreamerImpl.java:2096) > at >
[jira] [Commented] (IGNITE-7540) Sequential checkpoints cause overwrite of already cleaned & freed offheap page
[ https://issues.apache.org/jira/browse/IGNITE-7540?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16356941#comment-16356941 ] ASF GitHub Bot commented on IGNITE-7540: GitHub user Jokser opened a pull request: https://github.com/apache/ignite/pull/3490 IGNITE-7540 Sequential checkpoints cause overwrite of already cleaned & freed offheap page You can merge this pull request into a Git repository by running: $ git pull https://github.com/gridgain/apache-ignite ignite-7540 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/ignite/pull/3490.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #3490 commit 17bdf9459a3234eefbd379506a1d0895cd88d3a5 Author: Pavel KovalenkoDate: 2018-02-08T13:32:40Z IGNITE-7540 Prevent page memory metadata corruption during checkpoint and group destroying. > Sequential checkpoints cause overwrite of already cleaned & freed offheap page > -- > > Key: IGNITE-7540 > URL: https://issues.apache.org/jira/browse/IGNITE-7540 > Project: Ignite > Issue Type: Bug > Components: persistence >Affects Versions: 2.4 >Reporter: Ilya Kasnacheev >Assignee: Pavel Kovalenko >Priority: Major > Attachments: IgnitePdsDestroyCacheTest.java > > > The sequence of events as follows: > in GridCacheProcessor.onExchangeDone(), > {color:#660e7a}sharedCtx{color}.database().waitForCheckpoint({color:#008000}"caches > stop"{color}) is peformed and then cache is destroyed and all its pages are > freed and cleared asynchronously. > However, it is entirely possible that after waitForCheckpoint(), next > checkpoint will start immediately. It is typical when a lot of data being > loaded into Ignite, leading to rapid checkpoint buffer depletion, as well as > with artificially increased checkpoint frequency, as used in reproducer. > Then, checkpointer will save (overwrite) metadata page: > {code:java} > at > org.apache.ignite.internal.processors.cache.persistence.pagemem.PageMemoryImpl.writeUnlockPage(PageMemoryImpl.java:1330) > at > org.apache.ignite.internal.processors.cache.persistence.pagemem.PageMemoryImpl.writeUnlock(PageMemoryImpl.java:428) > at > org.apache.ignite.internal.processors.cache.persistence.pagemem.PageMemoryImpl.writeUnlock(PageMemoryImpl.java:422) > at > org.apache.ignite.internal.processors.cache.persistence.GridCacheOffheapManager.saveStoreMetadata(GridCacheOffheapManager.java:375) > at > org.apache.ignite.internal.processors.cache.persistence.GridCacheOffheapManager.onCheckpointBegin(GridCacheOffheapManager.java:163) > at > org.apache.ignite.internal.processors.cache.persistence.GridCacheDatabaseSharedManager$Checkpointer.markCheckpointBegin(GridCacheDatabaseSharedManager.java:2309) > at > org.apache.ignite.internal.processors.cache.persistence.GridCacheDatabaseSharedManager$Checkpointer.doCheckpoint(GridCacheDatabaseSharedManager.java:2088) > at > org.apache.ignite.internal.processors.cache.persistence.GridCacheDatabaseSharedManager$Checkpointer.body(GridCacheDatabaseSharedManager.java:2013) > at > org.apache.ignite.internal.util.worker.GridWorker.run(GridWorker.java:110) > at java.lang.Thread.run(Thread.java:748){code} > This will happen after cache is already destroyed and even after the page is > already zeroed by PageMemoryImpl$ClearSegmentRunnable.run(). > Then, some new cache is being created, and in > GridCacheOffheapManager$GridCacheDataStore.getOrAllocatePartitionMetas(), > pageMem.acquirePage() will return this page, expected zeroed, but actually > containing metadata for old cache's partition. Then, type == > PageIO.T_PART_META check will return true and the following exception is > issued, leading to cache state inconsistency and data loss: > {code:java} > Caused by: java.lang.IllegalStateException: Failed to get page IO instance > (page content is corrupted) > at > org.apache.ignite.internal.processors.cache.persistence.tree.io.IOVersions.forVersion(IOVersions.java:83) > at > org.apache.ignite.internal.processors.cache.persistence.tree.io.IOVersions.forPage(IOVersions.java:95) > at > org.apache.ignite.internal.processors.cache.persistence.freelist.PagesList.init(PagesList.java:175) > at > org.apache.ignite.internal.processors.cache.persistence.freelist.FreeListImpl.(FreeListImpl.java:370) > at > org.apache.ignite.internal.processors.cache.persistence.GridCacheOffheapManager$GridCacheDataStore$1.(GridCacheOffheapManager.java:932) > at >
[jira] [Commented] (IGNITE-7540) Sequential checkpoints cause overwrite of already cleaned & freed offheap page
[ https://issues.apache.org/jira/browse/IGNITE-7540?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16343344#comment-16343344 ] Ilya Kasnacheev commented on IGNITE-7540: - Please consider applying [https://github.com/apache/ignite/pull/3448] before starting this fix > Sequential checkpoints cause overwrite of already cleaned & freed offheap page > -- > > Key: IGNITE-7540 > URL: https://issues.apache.org/jira/browse/IGNITE-7540 > Project: Ignite > Issue Type: Bug > Components: persistence >Affects Versions: 2.4 >Reporter: Ilya Kasnacheev >Assignee: Alexey Goncharuk >Priority: Major > Attachments: IgnitePdsDestroyCacheTest.java > > > The sequence of events as follows: > in GridCacheProcessor.onExchangeDone(), > {color:#660e7a}sharedCtx{color}.database().waitForCheckpoint({color:#008000}"caches > stop"{color}) is peformed and then cache is destroyed and all its pages are > freed and cleared asynchronously. > However, it is entirely possible that after waitForCheckpoint(), next > checkpoint will start immediately. It is typical when a lot of data being > loaded into Ignite, leading to rapid checkpoint buffer depletion, as well as > with artificially increased checkpoint frequency, as used in reproducer. > Then, checkpointer will save (overwrite) metadata page: > {code:java} > at > org.apache.ignite.internal.processors.cache.persistence.pagemem.PageMemoryImpl.writeUnlockPage(PageMemoryImpl.java:1330) > at > org.apache.ignite.internal.processors.cache.persistence.pagemem.PageMemoryImpl.writeUnlock(PageMemoryImpl.java:428) > at > org.apache.ignite.internal.processors.cache.persistence.pagemem.PageMemoryImpl.writeUnlock(PageMemoryImpl.java:422) > at > org.apache.ignite.internal.processors.cache.persistence.GridCacheOffheapManager.saveStoreMetadata(GridCacheOffheapManager.java:375) > at > org.apache.ignite.internal.processors.cache.persistence.GridCacheOffheapManager.onCheckpointBegin(GridCacheOffheapManager.java:163) > at > org.apache.ignite.internal.processors.cache.persistence.GridCacheDatabaseSharedManager$Checkpointer.markCheckpointBegin(GridCacheDatabaseSharedManager.java:2309) > at > org.apache.ignite.internal.processors.cache.persistence.GridCacheDatabaseSharedManager$Checkpointer.doCheckpoint(GridCacheDatabaseSharedManager.java:2088) > at > org.apache.ignite.internal.processors.cache.persistence.GridCacheDatabaseSharedManager$Checkpointer.body(GridCacheDatabaseSharedManager.java:2013) > at > org.apache.ignite.internal.util.worker.GridWorker.run(GridWorker.java:110) > at java.lang.Thread.run(Thread.java:748){code} > This will happen after cache is already destroyed and even after the page is > already zeroed by PageMemoryImpl$ClearSegmentRunnable.run(). > Then, some new cache is being created, and in > GridCacheOffheapManager$GridCacheDataStore.getOrAllocatePartitionMetas(), > pageMem.acquirePage() will return this page, expected zeroed, but actually > containing metadata for old cache's partition. Then, type == > PageIO.T_PART_META check will return true and the following exception is > issued, leading to cache state inconsistency and data loss: > {code:java} > Caused by: java.lang.IllegalStateException: Failed to get page IO instance > (page content is corrupted) > at > org.apache.ignite.internal.processors.cache.persistence.tree.io.IOVersions.forVersion(IOVersions.java:83) > at > org.apache.ignite.internal.processors.cache.persistence.tree.io.IOVersions.forPage(IOVersions.java:95) > at > org.apache.ignite.internal.processors.cache.persistence.freelist.PagesList.init(PagesList.java:175) > at > org.apache.ignite.internal.processors.cache.persistence.freelist.FreeListImpl.(FreeListImpl.java:370) > at > org.apache.ignite.internal.processors.cache.persistence.GridCacheOffheapManager$GridCacheDataStore$1.(GridCacheOffheapManager.java:932) > at > org.apache.ignite.internal.processors.cache.persistence.GridCacheOffheapManager$GridCacheDataStore.init0(GridCacheOffheapManager.java:929) > at > org.apache.ignite.internal.processors.cache.persistence.GridCacheOffheapManager$GridCacheDataStore.invoke(GridCacheOffheapManager.java:1295) > at > org.apache.ignite.internal.processors.cache.IgniteCacheOffheapManagerImpl.invoke(IgniteCacheOffheapManagerImpl.java:344) > at > org.apache.ignite.internal.processors.cache.GridCacheMapEntry.storeValue(GridCacheMapEntry.java:3191) > at > org.apache.ignite.internal.processors.cache.GridCacheMapEntry.initialValue(GridCacheMapEntry.java:2571) > at > org.apache.ignite.internal.processors.datastreamer.DataStreamerImpl$IsolatedUpdater.receive(DataStreamerImpl.java:2096) > at >
[jira] [Commented] (IGNITE-7540) Sequential checkpoints cause overwrite of already cleaned & freed offheap page
[ https://issues.apache.org/jira/browse/IGNITE-7540?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16339438#comment-16339438 ] Ilya Kasnacheev commented on IGNITE-7540: - Proposed fix is marking caches for destruction in GridCacheProcessor.onExchangeDone() before waiting for checkpoint to finish, to avoid touching these caches during next checkpoint. > Sequential checkpoints cause overwrite of already cleaned & freed offheap page > -- > > Key: IGNITE-7540 > URL: https://issues.apache.org/jira/browse/IGNITE-7540 > Project: Ignite > Issue Type: Bug > Components: persistence >Affects Versions: 2.4 >Reporter: Ilya Kasnacheev >Assignee: Alexey Goncharuk >Priority: Major > Attachments: IgnitePdsDestroyCacheTest.java > > > The sequence of events as follows: > in GridCacheProcessor.onExchangeDone(), > {color:#660e7a}sharedCtx{color}.database().waitForCheckpoint({color:#008000}"caches > stop"{color}) is peformed and then cache is destroyed and all its pages are > freed and cleared asynchronously. > However, it is entirely possible that after waitForCheckpoint(), next > checkpoint will start immediately. It is typical when a lot of data being > loaded into Ignite, leading to rapid checkpoint buffer depletion, as well as > with artificially increased checkpoint frequency, as used in reproducer. > Then, checkpointer will save (overwrite) metadata page: > {code:java} > at > org.apache.ignite.internal.processors.cache.persistence.pagemem.PageMemoryImpl.writeUnlockPage(PageMemoryImpl.java:1330) > at > org.apache.ignite.internal.processors.cache.persistence.pagemem.PageMemoryImpl.writeUnlock(PageMemoryImpl.java:428) > at > org.apache.ignite.internal.processors.cache.persistence.pagemem.PageMemoryImpl.writeUnlock(PageMemoryImpl.java:422) > at > org.apache.ignite.internal.processors.cache.persistence.GridCacheOffheapManager.saveStoreMetadata(GridCacheOffheapManager.java:375) > at > org.apache.ignite.internal.processors.cache.persistence.GridCacheOffheapManager.onCheckpointBegin(GridCacheOffheapManager.java:163) > at > org.apache.ignite.internal.processors.cache.persistence.GridCacheDatabaseSharedManager$Checkpointer.markCheckpointBegin(GridCacheDatabaseSharedManager.java:2309) > at > org.apache.ignite.internal.processors.cache.persistence.GridCacheDatabaseSharedManager$Checkpointer.doCheckpoint(GridCacheDatabaseSharedManager.java:2088) > at > org.apache.ignite.internal.processors.cache.persistence.GridCacheDatabaseSharedManager$Checkpointer.body(GridCacheDatabaseSharedManager.java:2013) > at > org.apache.ignite.internal.util.worker.GridWorker.run(GridWorker.java:110) > at java.lang.Thread.run(Thread.java:748){code} > This will happen after cache is already destroyed and even after the page is > already zeroed by PageMemoryImpl$ClearSegmentRunnable.run(). > Then, some new cache is being created, and in > GridCacheOffheapManager$GridCacheDataStore.getOrAllocatePartitionMetas(), > pageMem.acquirePage() will return this page, expected zeroed, but actually > containing metadata for old cache's partition. Then, type == > PageIO.T_PART_META check will return true and the following exception is > issued, leading to cache state inconsistency and data loss: > {code:java} > Caused by: java.lang.IllegalStateException: Failed to get page IO instance > (page content is corrupted) > at > org.apache.ignite.internal.processors.cache.persistence.tree.io.IOVersions.forVersion(IOVersions.java:83) > at > org.apache.ignite.internal.processors.cache.persistence.tree.io.IOVersions.forPage(IOVersions.java:95) > at > org.apache.ignite.internal.processors.cache.persistence.freelist.PagesList.init(PagesList.java:175) > at > org.apache.ignite.internal.processors.cache.persistence.freelist.FreeListImpl.(FreeListImpl.java:370) > at > org.apache.ignite.internal.processors.cache.persistence.GridCacheOffheapManager$GridCacheDataStore$1.(GridCacheOffheapManager.java:932) > at > org.apache.ignite.internal.processors.cache.persistence.GridCacheOffheapManager$GridCacheDataStore.init0(GridCacheOffheapManager.java:929) > at > org.apache.ignite.internal.processors.cache.persistence.GridCacheOffheapManager$GridCacheDataStore.invoke(GridCacheOffheapManager.java:1295) > at > org.apache.ignite.internal.processors.cache.IgniteCacheOffheapManagerImpl.invoke(IgniteCacheOffheapManagerImpl.java:344) > at > org.apache.ignite.internal.processors.cache.GridCacheMapEntry.storeValue(GridCacheMapEntry.java:3191) > at > org.apache.ignite.internal.processors.cache.GridCacheMapEntry.initialValue(GridCacheMapEntry.java:2571) > at >