[jira] [Commented] (IGNITE-9612) Improve checkpoint mark phase speed.
[ https://issues.apache.org/jira/browse/IGNITE-9612?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16649075#comment-16649075 ] Alexei Scherbakov commented on IGNITE-9612: --- Yet another commit fixes slowdown on snapshot creation: 6041cda5a0b7331c33ec643d0c8ccbd77a120a86 > Improve checkpoint mark phase speed. > > > Key: IGNITE-9612 > URL: https://issues.apache.org/jira/browse/IGNITE-9612 > Project: Ignite > Issue Type: Improvement >Reporter: Alexei Scherbakov >Assignee: Alexei Scherbakov >Priority: Major > Fix For: 2.7 > > > I'm observing regular slow checkpoints due to long mark duration, which is > not related to dirty pages number: > {noformat} > 2018-09-01 14:55:20.408 [INFO > ][db-checkpoint-thread-#241%DPL_GRID%DplGridNodeName%][o.a.i.i.p.c.p.GridCacheDatabaseSharedManager] > Checkpoint started [checkpointId=01e0c7bf-842f-4ed6-8589-b4904063434f, > startPtr=FileWALPointer [idx=19814, fileOff=948996096, len=5233457], > checkpointLockWait=0ms, checkpointLockHoldTime=951ms, > walCpRecordFsyncDuration=39ms, pages=78477, reason='timeout'] > 2018-09-01 14:55:21.307 [INFO > ][db-checkpoint-thread-#241%DPL_GRID%DplGridNodeName%][o.a.i.i.p.c.p.GridCacheDatabaseSharedManager] > Checkpoint finished [cpId=01e0c7bf-842f-4ed6-8589-b4904063434f, pages=78477, > markPos=FileWALPointer [idx=19814, fileOff=948996096, len=5233457], > walSegmentsCleared=0, walSegmentsCovered=[], *markDuration=1002m*s, > pagesWrite=478ms, fsync=421ms, total=1901ms] > {noformat} > {noformat} > 2018-09-01 14:58:20.355 [INFO > ][db-checkpoint-thread-#241%DPL_GRID%DplGridNodeName%][o.a.i.i.p.c.p.GridCacheDatabaseSharedManager] > Checkpoint started [checkpointId=09d1f4bc-d3f3-4a16-b291-89d7fa745ea5, > startPtr=FileWALPointer [idx=19814, fileOff=124208, len=5233457], > checkpointLockWait=0ms, checkpointLockHoldTime=926ms, > walCpRecordFsyncDuration=14ms, pages=10837, reason='timeout'] > 2018-09-01 14:58:20.480 [INFO > ][db-checkpoint-thread-#241%DPL_GRID%DplGridNodeName%][o.a.i.i.p.c.p.GridCacheDatabaseSharedManager] > Checkpoint finished [cpId=09d1f4bc-d3f3-4a16-b291-89d7fa745ea5, pages=10837, > markPos=FileWALPointer [idx=19814, fileOff=124208, len=5233457], > walSegmentsCleared=0, walSegmentsCovered=[], *markDuration=943ms*, > pagesWrite=64ms, fsync=61ms, total=1068ms] > {noformat} > Debugging has revealed what this is due to large amount of work required to > save metadata for metapages and free/reuse lists. Because this is done under > checkpoint write lock, all other activities are blocked, resulting in > increased tx and atomic ops latency. > Simple solution: parallelize metadata processing during mark phase. > Best way to solve the problem is described in IGNITE-9520. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (IGNITE-9612) Improve checkpoint mark phase speed.
[ https://issues.apache.org/jira/browse/IGNITE-9612?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16632056#comment-16632056 ] ASF GitHub Bot commented on IGNITE-9612: Github user ascherbakoff closed the pull request at: https://github.com/apache/ignite/pull/4864 > Improve checkpoint mark phase speed. > > > Key: IGNITE-9612 > URL: https://issues.apache.org/jira/browse/IGNITE-9612 > Project: Ignite > Issue Type: Improvement >Reporter: Alexei Scherbakov >Assignee: Alexei Scherbakov >Priority: Major > Fix For: 2.7 > > > I'm observing regular slow checkpoints due to long mark duration, which is > not related to dirty pages number: > {noformat} > 2018-09-01 14:55:20.408 [INFO > ][db-checkpoint-thread-#241%DPL_GRID%DplGridNodeName%][o.a.i.i.p.c.p.GridCacheDatabaseSharedManager] > Checkpoint started [checkpointId=01e0c7bf-842f-4ed6-8589-b4904063434f, > startPtr=FileWALPointer [idx=19814, fileOff=948996096, len=5233457], > checkpointLockWait=0ms, checkpointLockHoldTime=951ms, > walCpRecordFsyncDuration=39ms, pages=78477, reason='timeout'] > 2018-09-01 14:55:21.307 [INFO > ][db-checkpoint-thread-#241%DPL_GRID%DplGridNodeName%][o.a.i.i.p.c.p.GridCacheDatabaseSharedManager] > Checkpoint finished [cpId=01e0c7bf-842f-4ed6-8589-b4904063434f, pages=78477, > markPos=FileWALPointer [idx=19814, fileOff=948996096, len=5233457], > walSegmentsCleared=0, walSegmentsCovered=[], *markDuration=1002m*s, > pagesWrite=478ms, fsync=421ms, total=1901ms] > {noformat} > {noformat} > 2018-09-01 14:58:20.355 [INFO > ][db-checkpoint-thread-#241%DPL_GRID%DplGridNodeName%][o.a.i.i.p.c.p.GridCacheDatabaseSharedManager] > Checkpoint started [checkpointId=09d1f4bc-d3f3-4a16-b291-89d7fa745ea5, > startPtr=FileWALPointer [idx=19814, fileOff=124208, len=5233457], > checkpointLockWait=0ms, checkpointLockHoldTime=926ms, > walCpRecordFsyncDuration=14ms, pages=10837, reason='timeout'] > 2018-09-01 14:58:20.480 [INFO > ][db-checkpoint-thread-#241%DPL_GRID%DplGridNodeName%][o.a.i.i.p.c.p.GridCacheDatabaseSharedManager] > Checkpoint finished [cpId=09d1f4bc-d3f3-4a16-b291-89d7fa745ea5, pages=10837, > markPos=FileWALPointer [idx=19814, fileOff=124208, len=5233457], > walSegmentsCleared=0, walSegmentsCovered=[], *markDuration=943ms*, > pagesWrite=64ms, fsync=61ms, total=1068ms] > {noformat} > Debugging has revealed what this is due to large amount of work required to > save metadata for metapages and free/reuse lists. Because this is done under > checkpoint write lock, all other activities are blocked, resulting in > increased tx and atomic ops latency. > Simple solution: parallelize metadata processing during mark phase. > Best way to solve the problem is described in IGNITE-9520. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (IGNITE-9612) Improve checkpoint mark phase speed.
[ https://issues.apache.org/jira/browse/IGNITE-9612?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16632055#comment-16632055 ] Alexei Scherbakov commented on IGNITE-9612: --- Another issue is related to snapshots. Fixed in commit 2f3b567f61c2a9a1080ee7fda57f3a18231b1dae > Improve checkpoint mark phase speed. > > > Key: IGNITE-9612 > URL: https://issues.apache.org/jira/browse/IGNITE-9612 > Project: Ignite > Issue Type: Improvement >Reporter: Alexei Scherbakov >Assignee: Alexei Scherbakov >Priority: Major > Fix For: 2.7 > > > I'm observing regular slow checkpoints due to long mark duration, which is > not related to dirty pages number: > {noformat} > 2018-09-01 14:55:20.408 [INFO > ][db-checkpoint-thread-#241%DPL_GRID%DplGridNodeName%][o.a.i.i.p.c.p.GridCacheDatabaseSharedManager] > Checkpoint started [checkpointId=01e0c7bf-842f-4ed6-8589-b4904063434f, > startPtr=FileWALPointer [idx=19814, fileOff=948996096, len=5233457], > checkpointLockWait=0ms, checkpointLockHoldTime=951ms, > walCpRecordFsyncDuration=39ms, pages=78477, reason='timeout'] > 2018-09-01 14:55:21.307 [INFO > ][db-checkpoint-thread-#241%DPL_GRID%DplGridNodeName%][o.a.i.i.p.c.p.GridCacheDatabaseSharedManager] > Checkpoint finished [cpId=01e0c7bf-842f-4ed6-8589-b4904063434f, pages=78477, > markPos=FileWALPointer [idx=19814, fileOff=948996096, len=5233457], > walSegmentsCleared=0, walSegmentsCovered=[], *markDuration=1002m*s, > pagesWrite=478ms, fsync=421ms, total=1901ms] > {noformat} > {noformat} > 2018-09-01 14:58:20.355 [INFO > ][db-checkpoint-thread-#241%DPL_GRID%DplGridNodeName%][o.a.i.i.p.c.p.GridCacheDatabaseSharedManager] > Checkpoint started [checkpointId=09d1f4bc-d3f3-4a16-b291-89d7fa745ea5, > startPtr=FileWALPointer [idx=19814, fileOff=124208, len=5233457], > checkpointLockWait=0ms, checkpointLockHoldTime=926ms, > walCpRecordFsyncDuration=14ms, pages=10837, reason='timeout'] > 2018-09-01 14:58:20.480 [INFO > ][db-checkpoint-thread-#241%DPL_GRID%DplGridNodeName%][o.a.i.i.p.c.p.GridCacheDatabaseSharedManager] > Checkpoint finished [cpId=09d1f4bc-d3f3-4a16-b291-89d7fa745ea5, pages=10837, > markPos=FileWALPointer [idx=19814, fileOff=124208, len=5233457], > walSegmentsCleared=0, walSegmentsCovered=[], *markDuration=943ms*, > pagesWrite=64ms, fsync=61ms, total=1068ms] > {noformat} > Debugging has revealed what this is due to large amount of work required to > save metadata for metapages and free/reuse lists. Because this is done under > checkpoint write lock, all other activities are blocked, resulting in > increased tx and atomic ops latency. > Simple solution: parallelize metadata processing during mark phase. > Best way to solve the problem is described in IGNITE-9520. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (IGNITE-9612) Improve checkpoint mark phase speed.
[ https://issues.apache.org/jira/browse/IGNITE-9612?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16632054#comment-16632054 ] Alexey Goncharuk commented on IGNITE-9612: -- Another fix with checkpoint listener logic is merged to master and ignite-2.7 branch: 2f3b567f61c2a9a1080ee7fda57f3a18231b1dae > Improve checkpoint mark phase speed. > > > Key: IGNITE-9612 > URL: https://issues.apache.org/jira/browse/IGNITE-9612 > Project: Ignite > Issue Type: Improvement >Reporter: Alexei Scherbakov >Assignee: Alexei Scherbakov >Priority: Major > Fix For: 2.7 > > > I'm observing regular slow checkpoints due to long mark duration, which is > not related to dirty pages number: > {noformat} > 2018-09-01 14:55:20.408 [INFO > ][db-checkpoint-thread-#241%DPL_GRID%DplGridNodeName%][o.a.i.i.p.c.p.GridCacheDatabaseSharedManager] > Checkpoint started [checkpointId=01e0c7bf-842f-4ed6-8589-b4904063434f, > startPtr=FileWALPointer [idx=19814, fileOff=948996096, len=5233457], > checkpointLockWait=0ms, checkpointLockHoldTime=951ms, > walCpRecordFsyncDuration=39ms, pages=78477, reason='timeout'] > 2018-09-01 14:55:21.307 [INFO > ][db-checkpoint-thread-#241%DPL_GRID%DplGridNodeName%][o.a.i.i.p.c.p.GridCacheDatabaseSharedManager] > Checkpoint finished [cpId=01e0c7bf-842f-4ed6-8589-b4904063434f, pages=78477, > markPos=FileWALPointer [idx=19814, fileOff=948996096, len=5233457], > walSegmentsCleared=0, walSegmentsCovered=[], *markDuration=1002m*s, > pagesWrite=478ms, fsync=421ms, total=1901ms] > {noformat} > {noformat} > 2018-09-01 14:58:20.355 [INFO > ][db-checkpoint-thread-#241%DPL_GRID%DplGridNodeName%][o.a.i.i.p.c.p.GridCacheDatabaseSharedManager] > Checkpoint started [checkpointId=09d1f4bc-d3f3-4a16-b291-89d7fa745ea5, > startPtr=FileWALPointer [idx=19814, fileOff=124208, len=5233457], > checkpointLockWait=0ms, checkpointLockHoldTime=926ms, > walCpRecordFsyncDuration=14ms, pages=10837, reason='timeout'] > 2018-09-01 14:58:20.480 [INFO > ][db-checkpoint-thread-#241%DPL_GRID%DplGridNodeName%][o.a.i.i.p.c.p.GridCacheDatabaseSharedManager] > Checkpoint finished [cpId=09d1f4bc-d3f3-4a16-b291-89d7fa745ea5, pages=10837, > markPos=FileWALPointer [idx=19814, fileOff=124208, len=5233457], > walSegmentsCleared=0, walSegmentsCovered=[], *markDuration=943ms*, > pagesWrite=64ms, fsync=61ms, total=1068ms] > {noformat} > Debugging has revealed what this is due to large amount of work required to > save metadata for metapages and free/reuse lists. Because this is done under > checkpoint write lock, all other activities are blocked, resulting in > increased tx and atomic ops latency. > Simple solution: parallelize metadata processing during mark phase. > Best way to solve the problem is described in IGNITE-9520. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (IGNITE-9612) Improve checkpoint mark phase speed.
[ https://issues.apache.org/jira/browse/IGNITE-9612?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16631585#comment-16631585 ] ASF GitHub Bot commented on IGNITE-9612: GitHub user ascherbakoff opened a pull request: https://github.com/apache/ignite/pull/4864 IGNITE-9612 test fix You can merge this pull request into a Git repository by running: $ git pull https://github.com/gridgain/apache-ignite ignite-9612 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/ignite/pull/4864.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #4864 commit 7e3c341ff4b99d3ca6f10283da539464d45a1c17 Author: ascherbakoff Date: 2018-09-23T16:34:09Z IGNITE-9612 wip. commit 5ab5553e989df5971472978d36515cfe1565627b Author: Aleksei Scherbakov Date: 2018-09-24T10:49:59Z IGNITE-9612 wip. commit 5ad45b70e1fe6a921b49aec943bec874dcf64758 Author: Aleksei Scherbakov Date: 2018-09-25T10:30:18Z IGNITE-9612 wip. commit 53c99ba9bb02cd83ee9752224ca27a82312b731d Author: Aleksei Scherbakov Date: 2018-09-25T10:33:15Z IGNITE-9612 wip. commit 795f296cfa7cd0783f03d76b1b7066b835a3d4c2 Author: Aleksei Scherbakov Date: 2018-09-25T10:33:34Z Merge branch 'master' of https://github.com/apache/ignite into ignite-9612 commit 9a5b1519c333953fb24621742ff0ad6aa03a011a Author: Aleksei Scherbakov Date: 2018-09-26T09:48:25Z Merge branch 'master' of https://github.com/apache/ignite into ignite-9612 commit b105f7beb8643ad971d33352b8fd3fd90e6bb3f4 Author: Aleksei Scherbakov Date: 2018-09-26T17:20:48Z IGNITE-9612 wip. commit db622f6f709d758324f15b6aa582c40c3bbea13d Author: ascherbakoff Date: 2018-09-27T19:28:57Z Merge branch 'master' of https://github.com/apache/ignite into ignite-9612 commit 12c36d2771e870c08e6bfc2d3eef561f9ac6c8ca Author: Aleksei Scherbakov Date: 2018-09-28T08:34:05Z Merge branch 'master' of https://github.com/apache/ignite into ignite-9612 commit a2c6b48127f9d1a4660c43e05b9302c7c16dc2d1 Author: Aleksei Scherbakov Date: 2018-09-28T08:36:38Z Merge branch 'ignite-9612' of https://github.com/gridgain/apache-ignite into ignite-9612 commit 0b2b24a58bcd0a9a6f35536852628ddde4bcb846 Author: Aleksei Scherbakov Date: 2018-09-28T09:04:57Z IGNITE-9612 wip. > Improve checkpoint mark phase speed. > > > Key: IGNITE-9612 > URL: https://issues.apache.org/jira/browse/IGNITE-9612 > Project: Ignite > Issue Type: Improvement >Reporter: Alexei Scherbakov >Assignee: Alexei Scherbakov >Priority: Major > Fix For: 2.7 > > > I'm observing regular slow checkpoints due to long mark duration, which is > not related to dirty pages number: > {noformat} > 2018-09-01 14:55:20.408 [INFO > ][db-checkpoint-thread-#241%DPL_GRID%DplGridNodeName%][o.a.i.i.p.c.p.GridCacheDatabaseSharedManager] > Checkpoint started [checkpointId=01e0c7bf-842f-4ed6-8589-b4904063434f, > startPtr=FileWALPointer [idx=19814, fileOff=948996096, len=5233457], > checkpointLockWait=0ms, checkpointLockHoldTime=951ms, > walCpRecordFsyncDuration=39ms, pages=78477, reason='timeout'] > 2018-09-01 14:55:21.307 [INFO > ][db-checkpoint-thread-#241%DPL_GRID%DplGridNodeName%][o.a.i.i.p.c.p.GridCacheDatabaseSharedManager] > Checkpoint finished [cpId=01e0c7bf-842f-4ed6-8589-b4904063434f, pages=78477, > markPos=FileWALPointer [idx=19814, fileOff=948996096, len=5233457], > walSegmentsCleared=0, walSegmentsCovered=[], *markDuration=1002m*s, > pagesWrite=478ms, fsync=421ms, total=1901ms] > {noformat} > {noformat} > 2018-09-01 14:58:20.355 [INFO > ][db-checkpoint-thread-#241%DPL_GRID%DplGridNodeName%][o.a.i.i.p.c.p.GridCacheDatabaseSharedManager] > Checkpoint started [checkpointId=09d1f4bc-d3f3-4a16-b291-89d7fa745ea5, > startPtr=FileWALPointer [idx=19814, fileOff=124208, len=5233457], > checkpointLockWait=0ms, checkpointLockHoldTime=926ms, > walCpRecordFsyncDuration=14ms, pages=10837, reason='timeout'] > 2018-09-01 14:58:20.480 [INFO > ][db-checkpoint-thread-#241%DPL_GRID%DplGridNodeName%][o.a.i.i.p.c.p.GridCacheDatabaseSharedManager] > Checkpoint finished [cpId=09d1f4bc-d3f3-4a16-b291-89d7fa745ea5, pages=10837, > markPos=FileWALPointer [idx=19814, fileOff=124208, len=5233457], > walSegmentsCleared=0, walSegmentsCovered=[], *markDuration=943ms*, > pagesWrite=64ms, fsync=61ms, total=1068ms] > {noformat} > Debugging has revealed what this is due to large amount of work required to > save metadata for metapages and free/reuse lists. Because this is done under > checkpoint write lock, all other activities are blocked, resulting in > increased tx and atomic ops latency. > Simple solution: parallelize metadata
[jira] [Commented] (IGNITE-9612) Improve checkpoint mark phase speed.
[ https://issues.apache.org/jira/browse/IGNITE-9612?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16630721#comment-16630721 ] Alexei Scherbakov commented on IGNITE-9612: --- The ticket requires fix for concurrency issue with PartitionAllocationMap - it uses thread-unsafe collections Fixed with commit [1] [1] 4527fe6de8bddea31a8142e78a84eacb534d0384 > Improve checkpoint mark phase speed. > > > Key: IGNITE-9612 > URL: https://issues.apache.org/jira/browse/IGNITE-9612 > Project: Ignite > Issue Type: Improvement >Reporter: Alexei Scherbakov >Assignee: Alexei Scherbakov >Priority: Major > Fix For: 2.7 > > > I'm observing regular slow checkpoints due to long mark duration, which is > not related to dirty pages number: > {noformat} > 2018-09-01 14:55:20.408 [INFO > ][db-checkpoint-thread-#241%DPL_GRID%DplGridNodeName%][o.a.i.i.p.c.p.GridCacheDatabaseSharedManager] > Checkpoint started [checkpointId=01e0c7bf-842f-4ed6-8589-b4904063434f, > startPtr=FileWALPointer [idx=19814, fileOff=948996096, len=5233457], > checkpointLockWait=0ms, checkpointLockHoldTime=951ms, > walCpRecordFsyncDuration=39ms, pages=78477, reason='timeout'] > 2018-09-01 14:55:21.307 [INFO > ][db-checkpoint-thread-#241%DPL_GRID%DplGridNodeName%][o.a.i.i.p.c.p.GridCacheDatabaseSharedManager] > Checkpoint finished [cpId=01e0c7bf-842f-4ed6-8589-b4904063434f, pages=78477, > markPos=FileWALPointer [idx=19814, fileOff=948996096, len=5233457], > walSegmentsCleared=0, walSegmentsCovered=[], *markDuration=1002m*s, > pagesWrite=478ms, fsync=421ms, total=1901ms] > {noformat} > {noformat} > 2018-09-01 14:58:20.355 [INFO > ][db-checkpoint-thread-#241%DPL_GRID%DplGridNodeName%][o.a.i.i.p.c.p.GridCacheDatabaseSharedManager] > Checkpoint started [checkpointId=09d1f4bc-d3f3-4a16-b291-89d7fa745ea5, > startPtr=FileWALPointer [idx=19814, fileOff=124208, len=5233457], > checkpointLockWait=0ms, checkpointLockHoldTime=926ms, > walCpRecordFsyncDuration=14ms, pages=10837, reason='timeout'] > 2018-09-01 14:58:20.480 [INFO > ][db-checkpoint-thread-#241%DPL_GRID%DplGridNodeName%][o.a.i.i.p.c.p.GridCacheDatabaseSharedManager] > Checkpoint finished [cpId=09d1f4bc-d3f3-4a16-b291-89d7fa745ea5, pages=10837, > markPos=FileWALPointer [idx=19814, fileOff=124208, len=5233457], > walSegmentsCleared=0, walSegmentsCovered=[], *markDuration=943ms*, > pagesWrite=64ms, fsync=61ms, total=1068ms] > {noformat} > Debugging has revealed what this is due to large amount of work required to > save metadata for metapages and free/reuse lists. Because this is done under > checkpoint write lock, all other activities are blocked, resulting in > increased tx and atomic ops latency. > Simple solution: parallelize metadata processing during mark phase. > Best way to solve the problem is described in IGNITE-9520. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (IGNITE-9612) Improve checkpoint mark phase speed.
[ https://issues.apache.org/jira/browse/IGNITE-9612?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16630218#comment-16630218 ] ASF GitHub Bot commented on IGNITE-9612: Github user asfgit closed the pull request at: https://github.com/apache/ignite/pull/4813 > Improve checkpoint mark phase speed. > > > Key: IGNITE-9612 > URL: https://issues.apache.org/jira/browse/IGNITE-9612 > Project: Ignite > Issue Type: Improvement >Reporter: Alexei Scherbakov >Assignee: Alexei Scherbakov >Priority: Major > Fix For: 2.7 > > > I'm observing regular slow checkpoints due to long mark duration, which is > not related to dirty pages number: > {noformat} > 2018-09-01 14:55:20.408 [INFO > ][db-checkpoint-thread-#241%DPL_GRID%DplGridNodeName%][o.a.i.i.p.c.p.GridCacheDatabaseSharedManager] > Checkpoint started [checkpointId=01e0c7bf-842f-4ed6-8589-b4904063434f, > startPtr=FileWALPointer [idx=19814, fileOff=948996096, len=5233457], > checkpointLockWait=0ms, checkpointLockHoldTime=951ms, > walCpRecordFsyncDuration=39ms, pages=78477, reason='timeout'] > 2018-09-01 14:55:21.307 [INFO > ][db-checkpoint-thread-#241%DPL_GRID%DplGridNodeName%][o.a.i.i.p.c.p.GridCacheDatabaseSharedManager] > Checkpoint finished [cpId=01e0c7bf-842f-4ed6-8589-b4904063434f, pages=78477, > markPos=FileWALPointer [idx=19814, fileOff=948996096, len=5233457], > walSegmentsCleared=0, walSegmentsCovered=[], *markDuration=1002m*s, > pagesWrite=478ms, fsync=421ms, total=1901ms] > {noformat} > {noformat} > 2018-09-01 14:58:20.355 [INFO > ][db-checkpoint-thread-#241%DPL_GRID%DplGridNodeName%][o.a.i.i.p.c.p.GridCacheDatabaseSharedManager] > Checkpoint started [checkpointId=09d1f4bc-d3f3-4a16-b291-89d7fa745ea5, > startPtr=FileWALPointer [idx=19814, fileOff=124208, len=5233457], > checkpointLockWait=0ms, checkpointLockHoldTime=926ms, > walCpRecordFsyncDuration=14ms, pages=10837, reason='timeout'] > 2018-09-01 14:58:20.480 [INFO > ][db-checkpoint-thread-#241%DPL_GRID%DplGridNodeName%][o.a.i.i.p.c.p.GridCacheDatabaseSharedManager] > Checkpoint finished [cpId=09d1f4bc-d3f3-4a16-b291-89d7fa745ea5, pages=10837, > markPos=FileWALPointer [idx=19814, fileOff=124208, len=5233457], > walSegmentsCleared=0, walSegmentsCovered=[], *markDuration=943ms*, > pagesWrite=64ms, fsync=61ms, total=1068ms] > {noformat} > Debugging has revealed what this is due to large amount of work required to > save metadata for metapages and free/reuse lists. Because this is done under > checkpoint write lock, all other activities are blocked, resulting in > increased tx and atomic ops latency. > Simple solution: parallelize metadata processing during mark phase. > Best way to solve the problem is described in IGNITE-9520. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (IGNITE-9612) Improve checkpoint mark phase speed.
[ https://issues.apache.org/jira/browse/IGNITE-9612?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16630215#comment-16630215 ] Dmitriy Govorukhin commented on IGNITE-9612: [~ascherbakov] Changes are merged, thanks for the contribution! > Improve checkpoint mark phase speed. > > > Key: IGNITE-9612 > URL: https://issues.apache.org/jira/browse/IGNITE-9612 > Project: Ignite > Issue Type: Improvement >Reporter: Alexei Scherbakov >Assignee: Alexei Scherbakov >Priority: Major > Fix For: 2.7 > > > I'm observing regular slow checkpoints due to long mark duration, which is > not related to dirty pages number: > {noformat} > 2018-09-01 14:55:20.408 [INFO > ][db-checkpoint-thread-#241%DPL_GRID%DplGridNodeName%][o.a.i.i.p.c.p.GridCacheDatabaseSharedManager] > Checkpoint started [checkpointId=01e0c7bf-842f-4ed6-8589-b4904063434f, > startPtr=FileWALPointer [idx=19814, fileOff=948996096, len=5233457], > checkpointLockWait=0ms, checkpointLockHoldTime=951ms, > walCpRecordFsyncDuration=39ms, pages=78477, reason='timeout'] > 2018-09-01 14:55:21.307 [INFO > ][db-checkpoint-thread-#241%DPL_GRID%DplGridNodeName%][o.a.i.i.p.c.p.GridCacheDatabaseSharedManager] > Checkpoint finished [cpId=01e0c7bf-842f-4ed6-8589-b4904063434f, pages=78477, > markPos=FileWALPointer [idx=19814, fileOff=948996096, len=5233457], > walSegmentsCleared=0, walSegmentsCovered=[], *markDuration=1002m*s, > pagesWrite=478ms, fsync=421ms, total=1901ms] > {noformat} > {noformat} > 2018-09-01 14:58:20.355 [INFO > ][db-checkpoint-thread-#241%DPL_GRID%DplGridNodeName%][o.a.i.i.p.c.p.GridCacheDatabaseSharedManager] > Checkpoint started [checkpointId=09d1f4bc-d3f3-4a16-b291-89d7fa745ea5, > startPtr=FileWALPointer [idx=19814, fileOff=124208, len=5233457], > checkpointLockWait=0ms, checkpointLockHoldTime=926ms, > walCpRecordFsyncDuration=14ms, pages=10837, reason='timeout'] > 2018-09-01 14:58:20.480 [INFO > ][db-checkpoint-thread-#241%DPL_GRID%DplGridNodeName%][o.a.i.i.p.c.p.GridCacheDatabaseSharedManager] > Checkpoint finished [cpId=09d1f4bc-d3f3-4a16-b291-89d7fa745ea5, pages=10837, > markPos=FileWALPointer [idx=19814, fileOff=124208, len=5233457], > walSegmentsCleared=0, walSegmentsCovered=[], *markDuration=943ms*, > pagesWrite=64ms, fsync=61ms, total=1068ms] > {noformat} > Debugging has revealed what this is due to large amount of work required to > save metadata for metapages and free/reuse lists. Because this is done under > checkpoint write lock, all other activities are blocked, resulting in > increased tx and atomic ops latency. > Simple solution: parallelize metadata processing during mark phase. > Best way to solve the problem is described in IGNITE-9520. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (IGNITE-9612) Improve checkpoint mark phase speed.
[ https://issues.apache.org/jira/browse/IGNITE-9612?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16630125#comment-16630125 ] Ignite TC Bot commented on IGNITE-9612: --- {panel:title=No blockers found!|borderStyle=dashed|borderColor=#ccc|titleBGColor=#D6F7C1}{panel} [TeamCity Run All|http://ci.ignite.apache.org/viewLog.html?buildId=1939519buildTypeId=IgniteTests24Java8_RunAll] > Improve checkpoint mark phase speed. > > > Key: IGNITE-9612 > URL: https://issues.apache.org/jira/browse/IGNITE-9612 > Project: Ignite > Issue Type: Improvement >Reporter: Alexei Scherbakov >Assignee: Alexei Scherbakov >Priority: Major > Fix For: 2.7 > > > I'm observing regular slow checkpoints due to long mark duration, which is > not related to dirty pages number: > {noformat} > 2018-09-01 14:55:20.408 [INFO > ][db-checkpoint-thread-#241%DPL_GRID%DplGridNodeName%][o.a.i.i.p.c.p.GridCacheDatabaseSharedManager] > Checkpoint started [checkpointId=01e0c7bf-842f-4ed6-8589-b4904063434f, > startPtr=FileWALPointer [idx=19814, fileOff=948996096, len=5233457], > checkpointLockWait=0ms, checkpointLockHoldTime=951ms, > walCpRecordFsyncDuration=39ms, pages=78477, reason='timeout'] > 2018-09-01 14:55:21.307 [INFO > ][db-checkpoint-thread-#241%DPL_GRID%DplGridNodeName%][o.a.i.i.p.c.p.GridCacheDatabaseSharedManager] > Checkpoint finished [cpId=01e0c7bf-842f-4ed6-8589-b4904063434f, pages=78477, > markPos=FileWALPointer [idx=19814, fileOff=948996096, len=5233457], > walSegmentsCleared=0, walSegmentsCovered=[], *markDuration=1002m*s, > pagesWrite=478ms, fsync=421ms, total=1901ms] > {noformat} > {noformat} > 2018-09-01 14:58:20.355 [INFO > ][db-checkpoint-thread-#241%DPL_GRID%DplGridNodeName%][o.a.i.i.p.c.p.GridCacheDatabaseSharedManager] > Checkpoint started [checkpointId=09d1f4bc-d3f3-4a16-b291-89d7fa745ea5, > startPtr=FileWALPointer [idx=19814, fileOff=124208, len=5233457], > checkpointLockWait=0ms, checkpointLockHoldTime=926ms, > walCpRecordFsyncDuration=14ms, pages=10837, reason='timeout'] > 2018-09-01 14:58:20.480 [INFO > ][db-checkpoint-thread-#241%DPL_GRID%DplGridNodeName%][o.a.i.i.p.c.p.GridCacheDatabaseSharedManager] > Checkpoint finished [cpId=09d1f4bc-d3f3-4a16-b291-89d7fa745ea5, pages=10837, > markPos=FileWALPointer [idx=19814, fileOff=124208, len=5233457], > walSegmentsCleared=0, walSegmentsCovered=[], *markDuration=943ms*, > pagesWrite=64ms, fsync=61ms, total=1068ms] > {noformat} > Debugging has revealed what this is due to large amount of work required to > save metadata for metapages and free/reuse lists. Because this is done under > checkpoint write lock, all other activities are blocked, resulting in > increased tx and atomic ops latency. > Simple solution: parallelize metadata processing during mark phase. > Best way to solve the problem is described in IGNITE-9520. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (IGNITE-9612) Improve checkpoint mark phase speed.
[ https://issues.apache.org/jira/browse/IGNITE-9612?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16629952#comment-16629952 ] Ignite TC Bot commented on IGNITE-9612: --- {panel:title=Possible Blockers|borderStyle=dashed|borderColor=#ccc|titleBGColor=#F7D6C1} {color:#d04437}Hadoop{color} [[tests 0 TIMEOUT , Exit Code |https://ci.ignite.apache.org/viewLog.html?buildId=1939435]] * IgniteHadoopFileSystemClientBasedProxySelfTest.testClientReconnect (last started) {panel} [TeamCity Run All|http://ci.ignite.apache.org/viewLog.html?buildId=1939519buildTypeId=IgniteTests24Java8_RunAll] > Improve checkpoint mark phase speed. > > > Key: IGNITE-9612 > URL: https://issues.apache.org/jira/browse/IGNITE-9612 > Project: Ignite > Issue Type: Improvement >Reporter: Alexei Scherbakov >Assignee: Alexei Scherbakov >Priority: Major > Fix For: 2.7 > > > I'm observing regular slow checkpoints due to long mark duration, which is > not related to dirty pages number: > {noformat} > 2018-09-01 14:55:20.408 [INFO > ][db-checkpoint-thread-#241%DPL_GRID%DplGridNodeName%][o.a.i.i.p.c.p.GridCacheDatabaseSharedManager] > Checkpoint started [checkpointId=01e0c7bf-842f-4ed6-8589-b4904063434f, > startPtr=FileWALPointer [idx=19814, fileOff=948996096, len=5233457], > checkpointLockWait=0ms, checkpointLockHoldTime=951ms, > walCpRecordFsyncDuration=39ms, pages=78477, reason='timeout'] > 2018-09-01 14:55:21.307 [INFO > ][db-checkpoint-thread-#241%DPL_GRID%DplGridNodeName%][o.a.i.i.p.c.p.GridCacheDatabaseSharedManager] > Checkpoint finished [cpId=01e0c7bf-842f-4ed6-8589-b4904063434f, pages=78477, > markPos=FileWALPointer [idx=19814, fileOff=948996096, len=5233457], > walSegmentsCleared=0, walSegmentsCovered=[], *markDuration=1002m*s, > pagesWrite=478ms, fsync=421ms, total=1901ms] > {noformat} > {noformat} > 2018-09-01 14:58:20.355 [INFO > ][db-checkpoint-thread-#241%DPL_GRID%DplGridNodeName%][o.a.i.i.p.c.p.GridCacheDatabaseSharedManager] > Checkpoint started [checkpointId=09d1f4bc-d3f3-4a16-b291-89d7fa745ea5, > startPtr=FileWALPointer [idx=19814, fileOff=124208, len=5233457], > checkpointLockWait=0ms, checkpointLockHoldTime=926ms, > walCpRecordFsyncDuration=14ms, pages=10837, reason='timeout'] > 2018-09-01 14:58:20.480 [INFO > ][db-checkpoint-thread-#241%DPL_GRID%DplGridNodeName%][o.a.i.i.p.c.p.GridCacheDatabaseSharedManager] > Checkpoint finished [cpId=09d1f4bc-d3f3-4a16-b291-89d7fa745ea5, pages=10837, > markPos=FileWALPointer [idx=19814, fileOff=124208, len=5233457], > walSegmentsCleared=0, walSegmentsCovered=[], *markDuration=943ms*, > pagesWrite=64ms, fsync=61ms, total=1068ms] > {noformat} > Debugging has revealed what this is due to large amount of work required to > save metadata for metapages and free/reuse lists. Because this is done under > checkpoint write lock, all other activities are blocked, resulting in > increased tx and atomic ops latency. > Simple solution: parallelize metadata processing during mark phase. > Best way to solve the problem is described in IGNITE-9520. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (IGNITE-9612) Improve checkpoint mark phase speed.
[ https://issues.apache.org/jira/browse/IGNITE-9612?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16628858#comment-16628858 ] Alexei Scherbakov commented on IGNITE-9612: --- Looks like many test failures are related to new system thread liveness check functionality. I see many SYSTEM_WORKER_BLOCKED messages in logs resulting in hanging of failing tests. For example, look at [1], CpTriggeredWalDeltaConsistencyTest.testPutRemoveCacheDestroy The test always fails in master and in PR branch because fsync-ing of large amount of pages on checkpoint takes longer than 10 sec on average hardware (default failure detection timeout) I suggest to disable liveness checking during TC test runs due to possible thread block detection false positives because of TC overload or JVM freezes, whatever. [~agoncharuk], Please review. [1] https://ci.ignite.apache.org/viewLog.html?buildId=1939475=buildResultsDiv=IgniteTests24Java8_Pds1# > Improve checkpoint mark phase speed. > > > Key: IGNITE-9612 > URL: https://issues.apache.org/jira/browse/IGNITE-9612 > Project: Ignite > Issue Type: Improvement >Reporter: Alexei Scherbakov >Assignee: Alexei Scherbakov >Priority: Major > Fix For: 2.7 > > > I'm observing regular slow checkpoints due to long mark duration, which is > not related to dirty pages number: > {noformat} > 2018-09-01 14:55:20.408 [INFO > ][db-checkpoint-thread-#241%DPL_GRID%DplGridNodeName%][o.a.i.i.p.c.p.GridCacheDatabaseSharedManager] > Checkpoint started [checkpointId=01e0c7bf-842f-4ed6-8589-b4904063434f, > startPtr=FileWALPointer [idx=19814, fileOff=948996096, len=5233457], > checkpointLockWait=0ms, checkpointLockHoldTime=951ms, > walCpRecordFsyncDuration=39ms, pages=78477, reason='timeout'] > 2018-09-01 14:55:21.307 [INFO > ][db-checkpoint-thread-#241%DPL_GRID%DplGridNodeName%][o.a.i.i.p.c.p.GridCacheDatabaseSharedManager] > Checkpoint finished [cpId=01e0c7bf-842f-4ed6-8589-b4904063434f, pages=78477, > markPos=FileWALPointer [idx=19814, fileOff=948996096, len=5233457], > walSegmentsCleared=0, walSegmentsCovered=[], *markDuration=1002m*s, > pagesWrite=478ms, fsync=421ms, total=1901ms] > {noformat} > {noformat} > 2018-09-01 14:58:20.355 [INFO > ][db-checkpoint-thread-#241%DPL_GRID%DplGridNodeName%][o.a.i.i.p.c.p.GridCacheDatabaseSharedManager] > Checkpoint started [checkpointId=09d1f4bc-d3f3-4a16-b291-89d7fa745ea5, > startPtr=FileWALPointer [idx=19814, fileOff=124208, len=5233457], > checkpointLockWait=0ms, checkpointLockHoldTime=926ms, > walCpRecordFsyncDuration=14ms, pages=10837, reason='timeout'] > 2018-09-01 14:58:20.480 [INFO > ][db-checkpoint-thread-#241%DPL_GRID%DplGridNodeName%][o.a.i.i.p.c.p.GridCacheDatabaseSharedManager] > Checkpoint finished [cpId=09d1f4bc-d3f3-4a16-b291-89d7fa745ea5, pages=10837, > markPos=FileWALPointer [idx=19814, fileOff=124208, len=5233457], > walSegmentsCleared=0, walSegmentsCovered=[], *markDuration=943ms*, > pagesWrite=64ms, fsync=61ms, total=1068ms] > {noformat} > Debugging has revealed what this is due to large amount of work required to > save metadata for metapages and free/reuse lists. Because this is done under > checkpoint write lock, all other activities are blocked, resulting in > increased tx and atomic ops latency. > Simple solution: parallelize metadata processing during mark phase. > Best way to solve the problem is described in IGNITE-9520. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (IGNITE-9612) Improve checkpoint mark phase speed.
[ https://issues.apache.org/jira/browse/IGNITE-9612?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16625175#comment-16625175 ] ASF GitHub Bot commented on IGNITE-9612: GitHub user ascherbakoff opened a pull request: https://github.com/apache/ignite/pull/4813 IGNITE-9612 Improve checkpoint mark phase speed. You can merge this pull request into a Git repository by running: $ git pull https://github.com/gridgain/apache-ignite ignite-9612 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/ignite/pull/4813.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #4813 commit 7e3c341ff4b99d3ca6f10283da539464d45a1c17 Author: ascherbakoff Date: 2018-09-23T16:34:09Z IGNITE-9612 wip. > Improve checkpoint mark phase speed. > > > Key: IGNITE-9612 > URL: https://issues.apache.org/jira/browse/IGNITE-9612 > Project: Ignite > Issue Type: Improvement >Reporter: Alexei Scherbakov >Assignee: Alexei Scherbakov >Priority: Major > Fix For: 2.7 > > > I'm observing regular slow checkpoints due to long mark duration, which is > not related to dirty pages number: > {noformat} > 2018-09-01 14:55:20.408 [INFO > ][db-checkpoint-thread-#241%DPL_GRID%DplGridNodeName%][o.a.i.i.p.c.p.GridCacheDatabaseSharedManager] > Checkpoint started [checkpointId=01e0c7bf-842f-4ed6-8589-b4904063434f, > startPtr=FileWALPointer [idx=19814, fileOff=948996096, len=5233457], > checkpointLockWait=0ms, checkpointLockHoldTime=951ms, > walCpRecordFsyncDuration=39ms, pages=78477, reason='timeout'] > 2018-09-01 14:55:21.307 [INFO > ][db-checkpoint-thread-#241%DPL_GRID%DplGridNodeName%][o.a.i.i.p.c.p.GridCacheDatabaseSharedManager] > Checkpoint finished [cpId=01e0c7bf-842f-4ed6-8589-b4904063434f, pages=78477, > markPos=FileWALPointer [idx=19814, fileOff=948996096, len=5233457], > walSegmentsCleared=0, walSegmentsCovered=[], *markDuration=1002m*s, > pagesWrite=478ms, fsync=421ms, total=1901ms] > {noformat} > {noformat} > 2018-09-01 14:58:20.355 [INFO > ][db-checkpoint-thread-#241%DPL_GRID%DplGridNodeName%][o.a.i.i.p.c.p.GridCacheDatabaseSharedManager] > Checkpoint started [checkpointId=09d1f4bc-d3f3-4a16-b291-89d7fa745ea5, > startPtr=FileWALPointer [idx=19814, fileOff=124208, len=5233457], > checkpointLockWait=0ms, checkpointLockHoldTime=926ms, > walCpRecordFsyncDuration=14ms, pages=10837, reason='timeout'] > 2018-09-01 14:58:20.480 [INFO > ][db-checkpoint-thread-#241%DPL_GRID%DplGridNodeName%][o.a.i.i.p.c.p.GridCacheDatabaseSharedManager] > Checkpoint finished [cpId=09d1f4bc-d3f3-4a16-b291-89d7fa745ea5, pages=10837, > markPos=FileWALPointer [idx=19814, fileOff=124208, len=5233457], > walSegmentsCleared=0, walSegmentsCovered=[], *markDuration=943ms*, > pagesWrite=64ms, fsync=61ms, total=1068ms] > {noformat} > Debugging has revealed what this is due to large amount of work required to > save metadata for metapages and free/reuse lists. Because this is done under > checkpoint write lock, all other activities are blocked, resulting in > increased tx and atomic ops latency. > Simple solution: parallelize metadata processing during mark phase. > Best way to solve the problem is described in IGNITE-9520. -- This message was sent by Atlassian JIRA (v7.6.3#76005)