[jira] [Commented] (IGNITE-10959) Memory leaks in continuous query handlers
[ https://issues.apache.org/jira/browse/IGNITE-10959?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17114194#comment-17114194 ] Zane Hu commented on IGNITE-10959: -- As I commented before, there are two confirmed cases of memory blowup in Ignite, which are caused by too many cache update events in pending buffers since an older update event than the buffered pending events has not arrived yet. # One is per-partition TreeMap CacheContinuousQueryPartitionRecovery.pendingEvts in CacheContinuousQueryHandler.rcvs. It has a upper-bound prevention of MAX_BUFF_SIZE. # Another is per-partition ConcurrentSkipListMap CacheContinuousQueryEventBuffer.pending. It has no upper-bound prevention on CacheContinuousQueryEventBuffer.pending. Attached below is a pseudo code of the main logic flow of how they are processed in Ignite. Hope it can help people to fix the problem. They all started in CacheContinuousQueryHandler.CacheContinuousQueryListener.onEntryUpdated() when a cache entry is updated. [^Memory_blowup_in_Ignite_CacheContinuousQueryHandler.txt] > Memory leaks in continuous query handlers > - > > Key: IGNITE-10959 > URL: https://issues.apache.org/jira/browse/IGNITE-10959 > Project: Ignite > Issue Type: Bug >Affects Versions: 2.7 >Reporter: Denis Mekhanikov >Assignee: Maxim Muzafarov >Priority: Critical > Attachments: CacheContinuousQueryMemoryUsageTest.java, > CacheContinuousQueryMemoryUsageTest.result, > CacheContinuousQueryMemoryUsageTest2.java, > Memory_blowup_in_Ignite_CacheContinuousQueryHandler.txt, > Memory_blowup_in_Ignite_CacheContinuousQueryHandler.txt, > Memory_blowup_in_Ignite_CacheContinuousQueryHandler.txt, > continuousquery_leak_profile.png > > > Continuous query handlers don't clear internal data structures after cache > events are processed. > A test, that reproduces the problem, is attached. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (IGNITE-10959) Memory leaks in continuous query handlers
[ https://issues.apache.org/jira/browse/IGNITE-10959?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Zane Hu updated IGNITE-10959: - Attachment: Memory_blowup_in_Ignite_CacheContinuousQueryHandler.txt > Memory leaks in continuous query handlers > - > > Key: IGNITE-10959 > URL: https://issues.apache.org/jira/browse/IGNITE-10959 > Project: Ignite > Issue Type: Bug >Affects Versions: 2.7 >Reporter: Denis Mekhanikov >Assignee: Maxim Muzafarov >Priority: Critical > Attachments: CacheContinuousQueryMemoryUsageTest.java, > CacheContinuousQueryMemoryUsageTest.result, > CacheContinuousQueryMemoryUsageTest2.java, > Memory_blowup_in_Ignite_CacheContinuousQueryHandler.txt, > Memory_blowup_in_Ignite_CacheContinuousQueryHandler.txt, > Memory_blowup_in_Ignite_CacheContinuousQueryHandler.txt, > continuousquery_leak_profile.png > > > Continuous query handlers don't clear internal data structures after cache > events are processed. > A test, that reproduces the problem, is attached. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (IGNITE-10959) Memory leaks in continuous query handlers
[ https://issues.apache.org/jira/browse/IGNITE-10959?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Zane Hu updated IGNITE-10959: - Attachment: Memory_blowup_in_Ignite_CacheContinuousQueryHandler.txt > Memory leaks in continuous query handlers > - > > Key: IGNITE-10959 > URL: https://issues.apache.org/jira/browse/IGNITE-10959 > Project: Ignite > Issue Type: Bug >Affects Versions: 2.7 >Reporter: Denis Mekhanikov >Assignee: Maxim Muzafarov >Priority: Critical > Attachments: CacheContinuousQueryMemoryUsageTest.java, > CacheContinuousQueryMemoryUsageTest.result, > CacheContinuousQueryMemoryUsageTest2.java, > Memory_blowup_in_Ignite_CacheContinuousQueryHandler.txt, > Memory_blowup_in_Ignite_CacheContinuousQueryHandler.txt, > continuousquery_leak_profile.png > > > Continuous query handlers don't clear internal data structures after cache > events are processed. > A test, that reproduces the problem, is attached. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (IGNITE-10959) Memory leaks in continuous query handlers
[ https://issues.apache.org/jira/browse/IGNITE-10959?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Zane Hu updated IGNITE-10959: - Attachment: Memory_blowup_in_Ignite_CacheContinuousQueryHandler.txt > Memory leaks in continuous query handlers > - > > Key: IGNITE-10959 > URL: https://issues.apache.org/jira/browse/IGNITE-10959 > Project: Ignite > Issue Type: Bug >Affects Versions: 2.7 >Reporter: Denis Mekhanikov >Assignee: Maxim Muzafarov >Priority: Critical > Attachments: CacheContinuousQueryMemoryUsageTest.java, > CacheContinuousQueryMemoryUsageTest.result, > CacheContinuousQueryMemoryUsageTest2.java, > Memory_blowup_in_Ignite_CacheContinuousQueryHandler.txt, > continuousquery_leak_profile.png > > > Continuous query handlers don't clear internal data structures after cache > events are processed. > A test, that reproduces the problem, is attached. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (IGNITE-10959) Memory leaks in continuous query handlers
[ https://issues.apache.org/jira/browse/IGNITE-10959?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16973903#comment-16973903 ] Zane Hu commented on IGNITE-10959: -- In addition to having an upper-bound limit to flush and remove 10% events out from CacheContinuousQueryEventBuffer.pending if the limit is reached, it would be nice to inform the app somehow that at least one of the earlier events than the flushed 10% of pending events has dropped because it has not arrived in time. This way, the app may have a chance to handle such exception afterwards, for example, by doing a full scan of the partition to which the dropped event belongs if possible. We would like to handle such exception for both CacheContinuousQueryPartitionRecovery.pendingEvts and CacheContinuousQueryEventBuffer.pending. > Memory leaks in continuous query handlers > - > > Key: IGNITE-10959 > URL: https://issues.apache.org/jira/browse/IGNITE-10959 > Project: Ignite > Issue Type: Bug >Affects Versions: 2.7 >Reporter: Denis Mekhanikov >Priority: Major > Fix For: 2.9 > > Attachments: CacheContinuousQueryMemoryUsageTest.java, > CacheContinuousQueryMemoryUsageTest.result, > CacheContinuousQueryMemoryUsageTest2.java, continuousquery_leak_profile.png > > > Continuous query handlers don't clear internal data structures after cache > events are processed. > A test, that reproduces the problem, is attached. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (IGNITE-10959) Memory leaks in continuous query handlers
[ https://issues.apache.org/jira/browse/IGNITE-10959?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16967866#comment-16967866 ] Zane Hu commented on IGNITE-10959: -- We have observed two cases of using huge amount of memory in Ignite Continuous Query, which both are caused by too many pending cache-update events since an earlier event than the pending events has not arrived yet. BTW, we use Ignite 2.7.0. * One is CacheContinuousQueryHandler.rcvs growing to 7.7 GB Retained Heap, seen in Jmap/Memory Analyzer. Also we saw "Pending events reached max of buffer size" in Ignite log file. According to [https://github.com/apache/ignite/blob/ignite-2.7/modules/core/src/main/java/org/apache/ignite/internal/processors/cache/query/continuous/CacheContinuousQueryPartitionRecovery.java#L196], it is because the size of CacheContinuousQueryPartitionRecovery.pendingEvts >= MAX_BUFF_SIZE, (default 10,000). And Ignite will flush and remove 10% of the entries in the pendingEvts, regardless some unarrived early events are dropped without notifying the listener. This upper-bound limit of MAX_BUFF_SIZE prevents the memory from further growing to OOM. * Another is CacheContinuousQueryEventBuffer.pending growing to 22 GB Retained Heap, seen in Jmap/Memory Analyzer. According to [https://github.com/apache/ignite/blob/ignite-2.7/modules/core/src/main/java/org/apache/ignite/internal/processors/cache/query/continuous/CacheContinuousQueryEventBuffer.java#L168], the cache-update events are processed in batch of CacheContinuousQueryEventBuffer.Batch.entries[BUF_SIZE] (default BUF_SIZE is 1,000). If an event entry is within the current batch (e.updateCounter() <= batch.endCntr), it is processed by batch.processEntry0(). Otherwise it is put in CacheContinuousQueryEventBuffer.pending. However, according to [https://github.com/apache/ignite/blob/ignite-2.7/modules/core/src/main/java/org/apache/ignite/internal/processors/cache/query/continuous/CacheContinuousQueryEventBuffer.java#L425], if any event within the current batch has not arrived, Ignite will not move to the next batch to process entries in CacheContinuousQueryEventBuffer.pending. Differently from processing in CacheContinuousQueryPartitionRecovery.pendingEvts and MAX_BUFF_SIZE, there is NO upper-bound limit on CacheContinuousQueryEventBuffer.pending. It means if an earlier event than the events in CacheContinuousQueryEventBuffer.pending never comes for some reason (high frequency of lots of events, high concurrency, timeout, ...), CacheContinuousQueryEventBuffer.pending will grow to OOM. To prevent this, I think Ignite at least needs to add an upper-bound limit and some processing here, to flush and remove 10% events out from CacheContinuousQueryEventBuffer.pending, similarly to CacheContinuousQueryPartitionRecovery.pendingEvts. In terms of exception handling, I think dropping some events is better than OOM. > Memory leaks in continuous query handlers > - > > Key: IGNITE-10959 > URL: https://issues.apache.org/jira/browse/IGNITE-10959 > Project: Ignite > Issue Type: Bug >Affects Versions: 2.7 >Reporter: Denis Mekhanikov >Priority: Major > Fix For: 2.9 > > Attachments: CacheContinuousQueryMemoryUsageTest.java, > CacheContinuousQueryMemoryUsageTest.result, > CacheContinuousQueryMemoryUsageTest2.java, continuousquery_leak_profile.png > > > Continuous query handlers don't clear internal data structures after cache > events are processed. > A test, that reproduces the problem, is attached. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (IGNITE-10959) Memory leaks in continuous query handlers
[ https://issues.apache.org/jira/browse/IGNITE-10959?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16958202#comment-16958202 ] Zane Hu commented on IGNITE-10959: -- The test result of [^CacheContinuousQueryMemoryUsageTest.result] if done using the following program with Ignite 2.7.0. [^CacheContinuousQueryMemoryUsageTest2.java] > Memory leaks in continuous query handlers > - > > Key: IGNITE-10959 > URL: https://issues.apache.org/jira/browse/IGNITE-10959 > Project: Ignite > Issue Type: Bug >Affects Versions: 2.7 >Reporter: Denis Mekhanikov >Priority: Major > Fix For: 2.9 > > Attachments: CacheContinuousQueryMemoryUsageTest.java, > CacheContinuousQueryMemoryUsageTest.result, > CacheContinuousQueryMemoryUsageTest2.java, continuousquery_leak_profile.png > > > Continuous query handlers don't clear internal data structures after cache > events are processed. > A test, that reproduces the problem, is attached. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (IGNITE-10959) Memory leaks in continuous query handlers
[ https://issues.apache.org/jira/browse/IGNITE-10959?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Zane Hu updated IGNITE-10959: - Attachment: CacheContinuousQueryMemoryUsageTest2.java > Memory leaks in continuous query handlers > - > > Key: IGNITE-10959 > URL: https://issues.apache.org/jira/browse/IGNITE-10959 > Project: Ignite > Issue Type: Bug >Affects Versions: 2.7 >Reporter: Denis Mekhanikov >Priority: Major > Fix For: 2.9 > > Attachments: CacheContinuousQueryMemoryUsageTest.java, > CacheContinuousQueryMemoryUsageTest.result, > CacheContinuousQueryMemoryUsageTest2.java, continuousquery_leak_profile.png > > > Continuous query handlers don't clear internal data structures after cache > events are processed. > A test, that reproduces the problem, is attached. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (IGNITE-10959) Memory leaks in continuous query handlers
[ https://issues.apache.org/jira/browse/IGNITE-10959?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16957176#comment-16957176 ] Zane Hu commented on IGNITE-10959: -- Here is the test result. [^CacheContinuousQueryMemoryUsageTest.result] > Memory leaks in continuous query handlers > - > > Key: IGNITE-10959 > URL: https://issues.apache.org/jira/browse/IGNITE-10959 > Project: Ignite > Issue Type: Bug >Affects Versions: 2.7 >Reporter: Denis Mekhanikov >Priority: Major > Fix For: 2.9 > > Attachments: CacheContinuousQueryMemoryUsageTest.java, > CacheContinuousQueryMemoryUsageTest.result, continuousquery_leak_profile.png > > > Continuous query handlers don't clear internal data structures after cache > events are processed. > A test, that reproduces the problem, is attached. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (IGNITE-10959) Memory leaks in continuous query handlers
[ https://issues.apache.org/jira/browse/IGNITE-10959?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Zane Hu updated IGNITE-10959: - Attachment: CacheContinuousQueryMemoryUsageTest.result > Memory leaks in continuous query handlers > - > > Key: IGNITE-10959 > URL: https://issues.apache.org/jira/browse/IGNITE-10959 > Project: Ignite > Issue Type: Bug >Affects Versions: 2.7 >Reporter: Denis Mekhanikov >Priority: Major > Fix For: 2.9 > > Attachments: CacheContinuousQueryMemoryUsageTest.java, > CacheContinuousQueryMemoryUsageTest.result, continuousquery_leak_profile.png > > > Continuous query handlers don't clear internal data structures after cache > events are processed. > A test, that reproduces the problem, is attached. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Comment Edited] (IGNITE-10959) Memory leaks in continuous query handlers
[ https://issues.apache.org/jira/browse/IGNITE-10959?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16955345#comment-16955345 ] Zane Hu edited comment on IGNITE-10959 at 10/20/19 1:09 AM: An error case TransactionalPartitionedTwoBackupFullSync is as the following log we got from a slightly modified CacheContinuousQueryMemoryUsageTest.java. But we don't see such error for cases of TransactionalReplicatedTwoBackupFullSync and TransactionalPartitionedOneBackupFullSync. {quote}[ERROR] CacheContinuousQueryMemoryUsageTest>GridAbstractTest.access$000:143->GridAbstractTest.runTestInternal:2177->testTransactionalPartitionedTwoBackupFullSync:235->testContinuousQuery:355->assertEntriesReleased:423->assertEntriesReleased:435->checkEntryBuffers:466 Backup queue is not empty. Node: continuous.CacheContinuousQueryMemoryUsageTest0; cache: test-cache. expected:<0> but was:<1> {quote} Looked at Ignite code, the following snip of onEntryUpdated() in CacheContinuousQueryHandler.java {code:java} if (primary || skipPrimaryCheck) // TransactionalReplicatedTwoBackupFullSync goes here onEntryUpdate(evt, notify, loc, recordIgniteEvt); // Notify the query client without putting evt.entry() into backupQ. else // A backup node of TransactionalPartitionedTwoBackupFullSync goes here handleBackupEntry(cctx, evt.entry()); // This will put evt.entry() into backupQ {code} After notifying the query client, there seems an ack msg CacheContinuousQueryBatchAck sent to the CQ server side on backup nodes to clean up the entries in backupQ. And there is even a periodic BackupCleaner task every 5 seconds to clean up backupQ. The actual cleanup code is as below: {code:java} /** * @param updateCntr Acknowledged counter. */ void cleanupBackupQueue(Long updateCntr) { Iterator it = backupQ.iterator(); while (it.hasNext()) { CacheContinuousQueryEntry backupEntry = it.next(); if (backupEntry.updateCounter() <= updateCntr) // Remove backupEntry if its updateCounter <= Ack updateCntr it.remove(); } } {code} So some questions are # Why is a backupEntry still left over in backupQ after all these? # Is it possible that the updateCounter and Ack updateCntr are mis-calculated? # Is it possible that the ack msg is sent to only one of the two backup nodes? The load of 1000 updates of 3 nodes in a stable network, so there shouldn't be a msg somehow dropped in the middle. Please help to look into more, especially from Ignite experts or developers. Thanks, was (Author: zanehu): An error case TransactionalPartitionedTwoBackupFullSync is as the following log we got from a slightly modified CacheContinuousQueryMemoryUsageTest.java. But we don't see such error for cases of TransactionalReplicatedTwoBackupFullSync and TransactionalPartitionedOneBackupFullSync. [ERROR] CacheContinuousQueryMemoryUsageTest>GridAbstractTest.access$000:143->GridAbstractTest.runTestInternal:2177->testTransactionalPartitionedTwoBackupFullSync:235->testContinuousQuery:355->assertEntriesReleased:423->assertEntriesReleased:435->checkEntryBuffers:466 Backup queue is not empty. Node: continuous.CacheContinuousQueryMemoryUsageTest0; cache: test-cache. expected:<0> but was:<1> Looked at Ignite code, the following snip of onEntryUpdated() in CacheContinuousQueryHandler.java {code:java} if (primary || skipPrimaryCheck) // TransactionalReplicatedTwoBackupFullSync goes here onEntryUpdate(evt, notify, loc, recordIgniteEvt); // Notify the query client without putting evt.entry() into backupQ. else // A backup node of TransactionalPartitionedTwoBackupFullSync goes here handleBackupEntry(cctx, evt.entry()); // This will put evt.entry() into backupQ {code} After notifying the query client, there seems an ack msg CacheContinuousQueryBatchAck sent to the CQ server side on backup nodes to clean up the entries in backupQ. And there is even a periodic BackupCleaner task every 5 seconds to clean up backupQ. The actual cleanup code is as below: {code:java} /** * @param updateCntr Acknowledged counter. */ void cleanupBackupQueue(Long updateCntr) { Iterator it = backupQ.iterator(); while (it.hasNext()) { CacheContinuousQueryEntry backupEntry = it.next(); if (backupEntry.updateCounter() <= updateCntr) // Remove backupEntry if its updateCounter <= Ack updateCntr it.remove(); } } {code} So some questions are # Why is a backupEntry still left over in backupQ after all these? # Is it possible that the updateCounter and Ack updateCntr are mis-calculated? # Is it possible that the ack msg is sent to only one of the two backup nodes? The load of 1000 updates of 3 nodes in a stable network, so there shouldn't be a msg somehow dropped
[jira] [Comment Edited] (IGNITE-10959) Memory leaks in continuous query handlers
[ https://issues.apache.org/jira/browse/IGNITE-10959?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16955345#comment-16955345 ] Zane Hu edited comment on IGNITE-10959 at 10/20/19 1:08 AM: An error case TransactionalPartitionedTwoBackupFullSync is as the following log we got from a slightly modified CacheContinuousQueryMemoryUsageTest.java. But we don't see such error for cases of TransactionalReplicatedTwoBackupFullSync and TransactionalPartitionedOneBackupFullSync. [ERROR] CacheContinuousQueryMemoryUsageTest>GridAbstractTest.access$000:143->GridAbstractTest.runTestInternal:2177->testTransactionalPartitionedTwoBackupFullSync:235->testContinuousQuery:355->assertEntriesReleased:423->assertEntriesReleased:435->checkEntryBuffers:466 Backup queue is not empty. Node: continuous.CacheContinuousQueryMemoryUsageTest0; cache: test-cache. expected:<0> but was:<1> Looked at Ignite code, the following snip of onEntryUpdated() in CacheContinuousQueryHandler.java {code:java} if (primary || skipPrimaryCheck) // TransactionalReplicatedTwoBackupFullSync goes here onEntryUpdate(evt, notify, loc, recordIgniteEvt); // Notify the query client without putting evt.entry() into backupQ. else // A backup node of TransactionalPartitionedTwoBackupFullSync goes here handleBackupEntry(cctx, evt.entry()); // This will put evt.entry() into backupQ {code} After notifying the query client, there seems an ack msg CacheContinuousQueryBatchAck sent to the CQ server side on backup nodes to clean up the entries in backupQ. And there is even a periodic BackupCleaner task every 5 seconds to clean up backupQ. The actual cleanup code is as below: {code:java} /** * @param updateCntr Acknowledged counter. */ void cleanupBackupQueue(Long updateCntr) { Iterator it = backupQ.iterator(); while (it.hasNext()) { CacheContinuousQueryEntry backupEntry = it.next(); if (backupEntry.updateCounter() <= updateCntr) // Remove backupEntry if its updateCounter <= Ack updateCntr it.remove(); } } {code} So some questions are # Why is a backupEntry still left over in backupQ after all these? # Is it possible that the updateCounter and Ack updateCntr are mis-calculated? # Is it possible that the ack msg is sent to only one of the two backup nodes? The load of 1000 updates of 3 nodes in a stable network, so there shouldn't be a msg somehow dropped in the middle. Please help to look into more, especially from Ignite experts or developers. Thanks, was (Author: zanehu): An error case TransactionalPartitionedTwoBackupFullSync is as the following log we got from a slightly modified CacheContinuousQueryMemoryUsageTest.java: [ERROR] CacheContinuousQueryMemoryUsageTest>GridAbstractTest.access$000:143->GridAbstractTest.runTestInternal:2177->testTransactionalPartitionedTwoBackupFullSync:235->testContinuousQuery:355->assertEntriesReleased:423->assertEntriesReleased:435->checkEntryBuffers:466 Backup queue is not empty. Node: continuous.CacheContinuousQueryMemoryUsageTest0; cache: test-cache. expected:<0> but was:<1> But we don't see such error for TransactionalReplicatedTwoBackupFullSync or TransactionalPartitionedOneBackupFullSync. Looked at Ignite code, the following snip of onEntryUpdated() in CacheContinuousQueryHandler.java {code:java} if (primary || skipPrimaryCheck) // TransactionalReplicatedTwoBackupFullSync goes here onEntryUpdate(evt, notify, loc, recordIgniteEvt); // Notify the query client without putting evt.entry() into backupQ. else // A backup node of TransactionalPartitionedTwoBackupFullSync goes here handleBackupEntry(cctx, evt.entry()); // This will put evt.entry() into backupQ {code} After notifying the query client, there seems an ack msg CacheContinuousQueryBatchAck sent to the CQ server side on backup nodes to clean up the entries in backupQ. And there is even a periodic BackupCleaner task every 5 seconds to clean up backupQ. The actual cleanup code is as below: {code:java} /** * @param updateCntr Acknowledged counter. */ void cleanupBackupQueue(Long updateCntr) { Iterator it = backupQ.iterator(); while (it.hasNext()) { CacheContinuousQueryEntry backupEntry = it.next(); if (backupEntry.updateCounter() <= updateCntr) // Remove backupEntry if its updateCounter <= Ack updateCntr it.remove(); } } {code} So some questions are # Why is a backupEntry still left over in backupQ after all these? # Is it possible that the updateCounter and Ack updateCntr are mis-calculated? # Is it possible that the ack msg is sent to only one of the two backup nodes? The load of 1000 updates of 3 nodes in a stable network, so there shouldn't be a msg somehow dropped in the middle.
[jira] [Comment Edited] (IGNITE-10959) Memory leaks in continuous query handlers
[ https://issues.apache.org/jira/browse/IGNITE-10959?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16955345#comment-16955345 ] Zane Hu edited comment on IGNITE-10959 at 10/20/19 1:07 AM: An error case TransactionalPartitionedTwoBackupFullSync is as the following log we got from a slightly modified CacheContinuousQueryMemoryUsageTest.java: [ERROR] CacheContinuousQueryMemoryUsageTest>GridAbstractTest.access$000:143->GridAbstractTest.runTestInternal:2177->testTransactionalPartitionedTwoBackupFullSync:235->testContinuousQuery:355->assertEntriesReleased:423->assertEntriesReleased:435->checkEntryBuffers:466 Backup queue is not empty. Node: continuous.CacheContinuousQueryMemoryUsageTest0; cache: test-cache. expected:<0> but was:<1> But we don't see such error for TransactionalReplicatedTwoBackupFullSync or TransactionalPartitionedOneBackupFullSync. Looked at Ignite code, the following snip of onEntryUpdated() in CacheContinuousQueryHandler.java {code:java} if (primary || skipPrimaryCheck) // TransactionalReplicatedTwoBackupFullSync goes here onEntryUpdate(evt, notify, loc, recordIgniteEvt); // Notify the query client without putting evt.entry() into backupQ. else // A backup node of TransactionalPartitionedTwoBackupFullSync goes here handleBackupEntry(cctx, evt.entry()); // This will put evt.entry() into backupQ {code} After notifying the query client, there seems an ack msg CacheContinuousQueryBatchAck sent to the CQ server side on backup nodes to clean up the entries in backupQ. And there is even a periodic BackupCleaner task every 5 seconds to clean up backupQ. The actual cleanup code is as below: {code:java} /** * @param updateCntr Acknowledged counter. */ void cleanupBackupQueue(Long updateCntr) { Iterator it = backupQ.iterator(); while (it.hasNext()) { CacheContinuousQueryEntry backupEntry = it.next(); if (backupEntry.updateCounter() <= updateCntr) // Remove backupEntry if its updateCounter <= Ack updateCntr it.remove(); } } {code} So some questions are # Why is a backupEntry still left over in backupQ after all these? # Is it possible that the updateCounter and Ack updateCntr are mis-calculated? # Is it possible that the ack msg is sent to only one of the two backup nodes? The load of 1000 updates of 3 nodes in a stable network, so there shouldn't be a msg somehow dropped in the middle. Please help to look into more, especially from Ignite experts or developers. Thanks, was (Author: zanehu): An error case TransactionalPartitionedTwoBackupFullSync is as the following log we got from a slightly modified CacheContinuousQueryMemoryUsageTest.java: [ERROR] CacheContinuousQueryMemoryUsageTest>GridAbstractTest.access$000:143->GridAbstractTest.runTestInternal:2177->testTransactionalPartitionedTwoBackupFullSync:235->testContinuousQuery:355->assertEntriesReleased:423->assertEntriesReleased:435->checkEntryBuffers:466 Backup queue is not empty. Node: continuous.CacheContinuousQueryMemoryUsageTest0; cache: test-cache. expected:<0> but was:<1> But we don't see such error for TransactionalReplicatedTwoBackupFullSync or TransactionalPartitionedOneBackupFullSync Looked at Ignite code, the following snip of onEntryUpdated() in CacheContinuousQueryHandler.java {code:java} if (primary || skipPrimaryCheck) // TransactionalReplicatedTwoBackupFullSync goes here onEntryUpdate(evt, notify, loc, recordIgniteEvt); // Notify the query client without putting evt.entry() into backupQ. else // A backup node of TransactionalPartitionedTwoBackupFullSync goes here handleBackupEntry(cctx, evt.entry()); // This will put evt.entry() into backupQ {code} After notifying the query client, there seems an ack msg CacheContinuousQueryBatchAck sent to the CQ server side on backup nodes to clean up the entries in backupQ. And there is even a periodic BackupCleaner task every 5 seconds to clean up backupQ. The actual cleanup code is as below: {code:java} /** * @param updateCntr Acknowledged counter. */ void cleanupBackupQueue(Long updateCntr) { Iterator it = backupQ.iterator(); while (it.hasNext()) { CacheContinuousQueryEntry backupEntry = it.next(); if (backupEntry.updateCounter() <= updateCntr) // Remove backupEntry if its updateCounter <= Ack updateCntr it.remove(); } } {code} So some questions are # Why is a backupEntry still left over in backupQ after all these? # Is it possible that the updateCounter and Ack updateCntr are mis-calculated? # Is it possible that the ack msg is sent to only one of the two backup nodes? The load of 1000 updates of 3 nodes in a stable network, so there shouldn't be a msg somehow dropped in the middle. Please
[jira] [Comment Edited] (IGNITE-10959) Memory leaks in continuous query handlers
[ https://issues.apache.org/jira/browse/IGNITE-10959?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16955345#comment-16955345 ] Zane Hu edited comment on IGNITE-10959 at 10/20/19 1:06 AM: An error case TransactionalPartitionedTwoBackupFullSync is as the following log we got from a slightly modified CacheContinuousQueryMemoryUsageTest.java: [ERROR] CacheContinuousQueryMemoryUsageTest>GridAbstractTest.access$000:143->GridAbstractTest.runTestInternal:2177->testTransactionalPartitionedTwoBackupFullSync:235->testContinuousQuery:355->assertEntriesReleased:423->assertEntriesReleased:435->checkEntryBuffers:466 Backup queue is not empty. Node: continuous.CacheContinuousQueryMemoryUsageTest0; cache: test-cache. expected:<0> but was:<1> But we don't see such error for TransactionalReplicatedTwoBackupFullSync or TransactionalPartitionedOneBackupFullSync Looked at Ignite code, the following snip of onEntryUpdated() in CacheContinuousQueryHandler.java {code:java} if (primary || skipPrimaryCheck) // TransactionalReplicatedTwoBackupFullSync goes here onEntryUpdate(evt, notify, loc, recordIgniteEvt); // Notify the query client without putting evt.entry() into backupQ. else // A backup node of TransactionalPartitionedTwoBackupFullSync goes here handleBackupEntry(cctx, evt.entry()); // This will put evt.entry() into backupQ {code} After notifying the query client, there seems an ack msg CacheContinuousQueryBatchAck sent to the CQ server side on backup nodes to clean up the entries in backupQ. And there is even a periodic BackupCleaner task every 5 seconds to clean up backupQ. The actual cleanup code is as below: {code:java} /** * @param updateCntr Acknowledged counter. */ void cleanupBackupQueue(Long updateCntr) { Iterator it = backupQ.iterator(); while (it.hasNext()) { CacheContinuousQueryEntry backupEntry = it.next(); if (backupEntry.updateCounter() <= updateCntr) // Remove backupEntry if its updateCounter <= Ack updateCntr it.remove(); } } {code} So some questions are # Why is a backupEntry still left over in backupQ after all these? # Is it possible that the updateCounter and Ack updateCntr are mis-calculated? # Is it possible that the ack msg is sent to only one of the two backup nodes? The load of 1000 updates of 3 nodes in a stable network, so there shouldn't be a msg somehow dropped in the middle. Please help to look into more, especially from Ignite experts or developers. Thanks, was (Author: zanehu): An error case TransactionalPartitionedTwoBackupFullSync is as the following log we got from a slightly modified CacheContinuousQueryMemoryUsageTest.java: [ERROR] CacheContinuousQueryMemoryUsageTest>GridAbstractTest.access$000:143->GridAbstractTest.runTestInternal:2177->testTransactionalPartitionedTwoBackupFullSync:235->testContinuousQuery:355->assertEntriesReleased:423->assertEntriesReleased:435->checkEntryBuffers:466 Backup queue is not empty. Node: continuous.CacheContinuousQueryMemoryUsageTest0; cache: test-cache. expected:<0> but was:<1> But we don't see such error for TransactionalReplicatedTwoBackupFullSync or TransactionalPartitionedOneBackupFullSync Looked at Ignite code, the following snip of onEntryUpdated() in CacheContinuousQueryHandler.java {code:java} if (primary || skipPrimaryCheck) // TransactionalReplicatedTwoBackupFullSync goes here onEntryUpdate(evt, notify, loc, recordIgniteEvt); // Notify the query client without putting evt.entry() into backupQ. else // A backup node of TransactionalPartitionedTwoBackupFullSync goes here handleBackupEntry(cctx, evt.entry()); // This will put evt.entry() into backupQ {code} After notifying the query client, there seems an ack msg CacheContinuousQueryBatchAck sent to the CQ server side on backup nodes to clean up the entries in backupQ. And there is even a periodic BackupCleaner task every 5 seconds to clean up backupQ. The actual cleanup code is as below: {code:java} /** * @param updateCntr Acknowledged counter. */ void cleanupBackupQueue(Long updateCntr) { Iterator it = backupQ.iterator(); while (it.hasNext()) { CacheContinuousQueryEntry backupEntry = it.next(); if (backupEntry.updateCounter() <= updateCntr) // Remove backupEntry if its updateCounter <= Ack updateCntr it.remove(); } } {code} So some questions are # Why is a backupEntry still left over in backupQ after all these? # Is it possible that the updateCounter and Ack updateCntr are mis-calculated? # Is it possible that the ack msg is sent to only one of the two backup nodes? The load of 1000 updates of 3 nodes in a stable network, so there shouldn't be a msg somehow dropped in the middle.
[jira] [Commented] (IGNITE-10959) Memory leaks in continuous query handlers
[ https://issues.apache.org/jira/browse/IGNITE-10959?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16955345#comment-16955345 ] Zane Hu commented on IGNITE-10959: -- An error case TransactionalPartitionedTwoBackupFullSync is as the following log we got from a slightly modified CacheContinuousQueryMemoryUsageTest.java: [ERROR] CacheContinuousQueryMemoryUsageTest>GridAbstractTest.access$000:143->GridAbstractTest.runTestInternal:2177->testTransactionalPartitionedTwoBackupFullSync:235->testContinuousQuery:355->assertEntriesReleased:423->assertEntriesReleased:435->checkEntryBuffers:466 Backup queue is not empty. Node: continuous.CacheContinuousQueryMemoryUsageTest0; cache: test-cache. expected:<0> but was:<1> But we don't see such error for TransactionalReplicatedTwoBackupFullSync or TransactionalPartitionedOneBackupFullSync Looked at Ignite code, the following snip of onEntryUpdated() in CacheContinuousQueryHandler.java {code:java} if (primary || skipPrimaryCheck) // TransactionalReplicatedTwoBackupFullSync goes here onEntryUpdate(evt, notify, loc, recordIgniteEvt); // Notify the query client without putting evt.entry() into backupQ. else // A backup node of TransactionalPartitionedTwoBackupFullSync goes here handleBackupEntry(cctx, evt.entry()); // This will put evt.entry() into backupQ {code} After notifying the query client, there seems an ack msg CacheContinuousQueryBatchAck sent to the CQ server side on backup nodes to clean up the entries in backupQ. And there is even a periodic BackupCleaner task every 5 seconds to clean up backupQ. The actual cleanup code is as below: {code:java} /** * @param updateCntr Acknowledged counter. */ void cleanupBackupQueue(Long updateCntr) { Iterator it = backupQ.iterator(); while (it.hasNext()) { CacheContinuousQueryEntry backupEntry = it.next(); if (backupEntry.updateCounter() <= updateCntr) // Remove backupEntry if its updateCounter <= Ack updateCntr it.remove(); } } {code} So some questions are # Why is a backupEntry still left over in backupQ after all these? # Is it possible that the updateCounter and Ack updateCntr are mis-calculated? # Is it possible that the ack msg is sent to only one of the two backup nodes? The load of 1000 updates of 3 nodes in a stable network, so there shouldn't be a msg somehow dropped in the middle. Please help to look into more, especially from Ignite experts or developers. Thanks, > Memory leaks in continuous query handlers > - > > Key: IGNITE-10959 > URL: https://issues.apache.org/jira/browse/IGNITE-10959 > Project: Ignite > Issue Type: Bug >Affects Versions: 2.7 >Reporter: Denis Mekhanikov >Priority: Major > Fix For: 2.9 > > Attachments: CacheContinuousQueryMemoryUsageTest.java, > continuousquery_leak_profile.png > > > Continuous query handlers don't clear internal data structures after cache > events are processed. > A test, that reproduces the problem, is attached. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (IGNITE-10959) Memory leaks in continuous query handlers
[ https://issues.apache.org/jira/browse/IGNITE-10959?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16954721#comment-16954721 ] Zane Hu commented on IGNITE-10959: -- We hit this issue too. Is it possible to have a quick fix patch soon? Thanks! > Memory leaks in continuous query handlers > - > > Key: IGNITE-10959 > URL: https://issues.apache.org/jira/browse/IGNITE-10959 > Project: Ignite > Issue Type: Bug >Affects Versions: 2.7 >Reporter: Denis Mekhanikov >Priority: Major > Fix For: 2.9 > > Attachments: CacheContinuousQueryMemoryUsageTest.java, > continuousquery_leak_profile.png > > > Continuous query handlers don't clear internal data structures after cache > events are processed. > A test, that reproduces the problem, is attached. -- This message was sent by Atlassian Jira (v8.3.4#803005)