[jira] [Comment Edited] (IGNITE-19904) Assertion in defragmentation

2023-08-01 Thread Vladimir Steshin (Jira)


[ 
https://issues.apache.org/jira/browse/IGNITE-19904?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17749563#comment-17749563
 ] 

Vladimir Steshin edited comment on IGNITE-19904 at 8/1/23 11:43 AM:


Caused by concurrent default checkpointer which clears shared 
{code:java}
CheckpointProgress#clearCounters()
{code}
and rises hidden NPE in
{code:java}
@Override public void CheckpointProgressImpl#updateEvictedPages(int delta) {
A.ensure(delta > 0, "param must be positive");

if (evictedPagesCounter() != null)
evictedPagesCounter().addAndGet(delta);
}
{code}
while flushing replaced page in `PageMemoryImpl#allocatePage(int grpId, int 
partId, byte flags)`. See IGNITE-20047 and 'failure_with_root_npe_cause.log'.




was (Author: vladsz83):
Caused by concurrent default checkpointer which clears shared 
{code:java}
CheckpointProgress#clearCounters()
{code}
and rises hidden NPE in
{code:java}
@Override public void CheckpointProgressImpl#updateEvictedPages(int delta) {
A.ensure(delta > 0, "param must be positive");

if (evictedPagesCounter() != null)
evictedPagesCounter().addAndGet(delta);
}
{code}
while flushing replaced page in `PageMemoryImpl#allocatePage(int grpId, int 
partId, byte flags)`. See IGNITE-20047.



> Assertion in defragmentation
> 
>
> Key: IGNITE-19904
> URL: https://issues.apache.org/jira/browse/IGNITE-19904
> Project: Ignite
>  Issue Type: Bug
>Affects Versions: 2.12
>Reporter: Vladimir Steshin
>Priority: Major
>  Labels: ise
> Attachments: default-config.xml, failure2.16_with_thread_dump.log, 
> failure_with_root_npe_cause.log, ignite.log, jvm.opts
>
>  Time Spent: 20m
>  Remaining Estimate: 0h
>
> Defragmentaion fails with:
> {code:java}
> java.lang.AssertionError: Invalid state. Type is 0! pageId = 0001000d00024cbf
>   at 
> org.apache.ignite.internal.processors.cache.persistence.pagemem.PageMemoryImpl.copyPageForCheckpoint(PageMemoryImpl.java:1359)
>  ~[ignite-core-2.16.0-SNAPSHOT.jar:2.16.0-SNAPSHOT]
>   at 
> org.apache.ignite.internal.processors.cache.persistence.pagemem.PageMemoryImpl.checkpointWritePage(PageMemoryImpl.java:1277)
>  ~[ignite-core-2.16.0-SNAPSHOT.jar:2.16.0-SNAPSHOT]
>   at 
> org.apache.ignite.internal.processors.cache.persistence.checkpoint.CheckpointPagesWriter.writePages(CheckpointPagesWriter.java:208)
>  ~[ignite-core-2.16.0-SNAPSHOT.jar:2.16.0-SNAPSHOT]
>   at 
> org.apache.ignite.internal.processors.cache.persistence.checkpoint.CheckpointPagesWriter.run(CheckpointPagesWriter.java:150)
>  ~[ignite-core-2.16.0-SNAPSHOT.jar:2.16.0-SNAPSHOT]
> {code}
> Difficult to write a test. Can't reproduce on my computers :(. Flackly 
> appears on a server (4 core x 4 cpu) with 100G of the test cache data and 
> million+ pages to checkpoint during defragmentation. More often, this occurs 
> with pageSize 1024 (to produce more pages).
> Regarding my diagnostic build, I suppose that a fresh, empty page is caught 
> in defragmentation. Here is a page dump with test-expented PAGE_OVERHEAD 
> (=64) and same error a bit before copyPageForCheckpoint():
> {code:java}
> org.apache.ignite.IgniteException: Wrong page type in checkpointWritePage1. 
> Page: Data region = 'defragPartitionsDataRegion'.
>  FullPageId [pageId=281878703760205, effectivePageId=403727049549, 
> grpId=-1368047378].
>  PageDump = page_id: 281878703760205, rel_id: 48603, cache_id: -1368047378, 
> pin: 0, lock: 65536, tmp_buf: 72057594037927935, test_val: 1. data_hex: 
> 

[jira] [Comment Edited] (IGNITE-19904) Assertion in defragmentation

2023-08-01 Thread Vladimir Steshin (Jira)


[ 
https://issues.apache.org/jira/browse/IGNITE-19904?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17749563#comment-17749563
 ] 

Vladimir Steshin edited comment on IGNITE-19904 at 8/1/23 8:36 AM:
---

Caused by concurrent default checkpointer which clears shared 
{code:java}
CheckpointProgress#clearCounters()
{code}
and rises hidden NPE in
{code:java}
@Override public void CheckpointProgressImpl#updateEvictedPages(int delta) {
A.ensure(delta > 0, "param must be positive");

if (evictedPagesCounter() != null)
evictedPagesCounter().addAndGet(delta);
}
{code}
while flushing replaced page in `PageMemoryImpl#allocatePage(int grpId, int 
partId, byte flags)`. See IGNITE-20047.




was (Author: vladsz83):
Caused by concurrent default checkpointer which clears shared 
{code:java}
CheckpointProgress#clearCounters()
{code}
and rises hidden NPE in `evictedPagesCounter().`:
{code:java}
@Override public void CheckpointProgressImpl#updateEvictedPages(int delta) {
A.ensure(delta > 0, "param must be positive");

if (evictedPagesCounter() != null)
evictedPagesCounter().addAndGet(delta);
}
{code}
while flushing replaced in `PageMemoryImpl#allocatePage(int grpId, int partId, 
byte flags)`



> Assertion in defragmentation
> 
>
> Key: IGNITE-19904
> URL: https://issues.apache.org/jira/browse/IGNITE-19904
> Project: Ignite
>  Issue Type: Bug
>Affects Versions: 2.12
>Reporter: Vladimir Steshin
>Priority: Major
>  Labels: ise
> Attachments: default-config.xml, failure2.16_with_thread_dump.log, 
> ignite.log, jvm.opts
>
>
> Defragmentaion fails with:
> {code:java}
> java.lang.AssertionError: Invalid state. Type is 0! pageId = 0001000d00024cbf
>   at 
> org.apache.ignite.internal.processors.cache.persistence.pagemem.PageMemoryImpl.copyPageForCheckpoint(PageMemoryImpl.java:1359)
>  ~[ignite-core-2.16.0-SNAPSHOT.jar:2.16.0-SNAPSHOT]
>   at 
> org.apache.ignite.internal.processors.cache.persistence.pagemem.PageMemoryImpl.checkpointWritePage(PageMemoryImpl.java:1277)
>  ~[ignite-core-2.16.0-SNAPSHOT.jar:2.16.0-SNAPSHOT]
>   at 
> org.apache.ignite.internal.processors.cache.persistence.checkpoint.CheckpointPagesWriter.writePages(CheckpointPagesWriter.java:208)
>  ~[ignite-core-2.16.0-SNAPSHOT.jar:2.16.0-SNAPSHOT]
>   at 
> org.apache.ignite.internal.processors.cache.persistence.checkpoint.CheckpointPagesWriter.run(CheckpointPagesWriter.java:150)
>  ~[ignite-core-2.16.0-SNAPSHOT.jar:2.16.0-SNAPSHOT]
> {code}
> Difficult to write a test. Can't reproduce on my computers :(. Flackly 
> appears on a server (4 core x 4 cpu) with 100G of the test cache data and 
> million+ pages to checkpoint during defragmentation. More often, this occurs 
> with pageSize 1024 (to produce more pages).
> Regarding my diagnostic build, I suppose that a fresh, empty page is caught 
> in defragmentation. Here is a page dump with test-expented PAGE_OVERHEAD 
> (=64) and same error a bit before copyPageForCheckpoint():
> {code:java}
> org.apache.ignite.IgniteException: Wrong page type in checkpointWritePage1. 
> Page: Data region = 'defragPartitionsDataRegion'.
>  FullPageId [pageId=281878703760205, effectivePageId=403727049549, 
> grpId=-1368047378].
>  PageDump = page_id: 281878703760205, rel_id: 48603, cache_id: -1368047378, 
> pin: 0, lock: 65536, tmp_buf: 72057594037927935, test_val: 1. data_hex: 
>