[jira] [Commented] (IGNITE-3212) Servers get stuck with the warning "Failed to wait for initial partition map exchange" during falover test

2020-06-08 Thread Cong Guo (Jira)


[ 
https://issues.apache.org/jira/browse/IGNITE-3212?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17128626#comment-17128626
 ] 

Cong Guo commented on IGNITE-3212:
--

This issue still exists. Do you have any clue?

We are using Ignite 2.7.0 and seeing this issue sometimes. We have three server 
nodes embedded in our application nodes. For example, with three nodes A, B, C, 
when we restart A,  the whole cluster gets stuck during partition map exchange.

A logs:

"Failed to wait for initial partition map exchange."

then repeats forever:

"Failed to wait for partition map exchange" and a log of dump

The coordinator is C. C logs:

"Unable to await partitions release latch within timeout: ServerLatch 
[permits=1, pendingAcks=[070d58f2-a1a8-4b5f-b986-06a63ac18fb2], 
super=CompletableLatch [id=exchange, topVer=AffinityTopologyVersion [topVer=87, 
minorTopVer=0]]]"

and "Failed to roll back transaction (cache may contain stale locks)"

The pending ACK should be from B (070d58f2-a1a8-4b5f-b986-06a63ac18fb2).

B logs:

"Failed to wait for partition release future" and dumps many pending 
transactions.

Our transaction time out is 35 seconds, but the duration of many of these 
transactions is much longer than 35 seconds, and the transaction state is 
MARKED_ROLLBACK and timeout=true.

 

 

 

 

 

 

 

 

 

> Servers get stuck with the warning "Failed to wait for initial partition map 
> exchange" during falover test
> --
>
> Key: IGNITE-3212
> URL: https://issues.apache.org/jira/browse/IGNITE-3212
> Project: Ignite
>  Issue Type: Bug
>Affects Versions: 1.6
>Reporter: Ksenia Rybakova
>Priority: Critical
> Fix For: 3.0, 2.9
>
>
> Servers being restarted during failover test get stuck after some time with 
> the warning "Failed to wait for initial partition map exchange". 
> {noformat}
> [08:44:41,303][INFO ][disco-event-worker-#80%null%][GridDiscoveryManager] 
> Added new node to topology: TcpDiscoveryNode 
> [id=db557f04-43b7-4e28-ae0d-d4dcf4139c89, addrs=
> [10.20.0.222, 127.0.0.1], sockAddrs=[fosters-222/10.20.0.222:47503, 
> /10.20.0.222:47503, /127.0.0.1:47503], discPort=47503, order=44, intOrder=32, 
> lastExchangeTime=1464
> 363880917, loc=false, ver=1.6.0#20160525-sha1:48321a40, isClient=false]
> [08:44:41,304][INFO ][disco-event-worker-#80%null%][GridDiscoveryManager] 
> Topology snapshot [ver=44, servers=19, clients=1, CPUs=64, heap=160.0GB]
> [08:45:11,455][INFO ][disco-event-worker-#80%null%][GridDiscoveryManager] 
> Added new node to topology: TcpDiscoveryNode 
> [id=6fae61a7-c1c1-40e5-8ad0-8bf5d6c86eb7, addrs=
> [10.20.0.223, 127.0.0.1], sockAddrs=[fosters-223/10.20.0.223:47503, 
> /10.20.0.223:47503, /127.0.0.1:47503], discPort=47503, order=45, intOrder=33, 
> lastExchangeTime=1464
> 363910999, loc=false, ver=1.6.0#20160525-sha1:48321a40, isClient=false]
> [08:45:11,455][INFO ][disco-event-worker-#80%null%][GridDiscoveryManager] 
> Topology snapshot [ver=45, servers=20, clients=1, CPUs=64, heap=170.0GB]
> [08:45:19,942][INFO ][ignite-update-notifier-timer][GridUpdateNotifier] 
> Update status is not available.
> [08:46:20,370][WARN ][main][GridCachePartitionExchangeManager] Failed to wait 
> for initial partition map exchange. Possible reasons are:
>   ^-- Transactions in deadlock.
>   ^-- Long running transactions (ignore if this is the case).
>   ^-- Unreleased explicit locks.
> [08:48:30,375][WARN ][main][GridCachePartitionExchangeManager] Still waiting 
> for initial partition map exchange ...
> {noformat}
> "Failed to wait for partition release future" warnings are on other nodes.
> {noformat}
> [08:09:45,822][WARN 
> ][exchange-worker-#82%null%][GridDhtPartitionsExchangeFuture] Failed to wait 
> for partition release future [topVer=AffinityTopologyVersion [topVer=29, 
> minorTopVer=0], node=cab5d0e0-7365-4774-8f99-d9f131c5d896]. Dumping pending 
> objects that might be the cause:
> [08:09:45,822][WARN 
> ][exchange-worker-#82%null%][GridCachePartitionExchangeManager] Ready 
> affinity version: AffinityTopologyVersion [topVer=28, minorTopVer=1]
> [08:09:45,826][WARN 
> ][exchange-worker-#82%null%][GridCachePartitionExchangeManager] Last exchange 
> future: GridDhtPartitionsExchangeFuture ...
> {noformat}
> Load config:
> - 1 client, 20 servers (5 servers per 1 host)
> - warmup 60
> - duration 66h
> - preload 5M
> - key range 10M
> - operations: PUT PUT_ALL GET GET_ALL INVOKE INVOKE_ALL REMOVE REMOVE_ALL 
> PUT_IF_ABSENT REPLACE
> - backups count 3
> - 3 servers restart every 15 min with 30 sec step, pause between stop and 
> start 5min



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (IGNITE-10959) Memory leaks in continuous query handlers

2020-05-28 Thread Cong Guo (Jira)


[ 
https://issues.apache.org/jira/browse/IGNITE-10959?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17118803#comment-17118803
 ] 

Cong Guo commented on IGNITE-10959:
---

Hi, Maxim

In my case, the fix under IGNITE-11970 does not solve the problem. The entries 
are stored in the pending buffer because the current batch does not move on. It 
is not a waste like in IGNITE-11970. It is a memory leak. There is no limit on 
the pending buffer.

 

According to another comment above, this issue is relevant to the sync mode and 
the backup number. I did not test all the cases, but you can easily reproduce 
this issue using three Ignite node with Transactional + Partition + 2 backups.

 
 # Start a 3-node Ignite cluster, and each node has a continuous query to watch 
updates in cache.
 # Send updates to three nodes continuously. Simple puts are OK.
 # Restart one Ignite node, and then memory leak can be seen on two other nodes.

 

I attach the figure showing the accumulation point and references in my 
experiment here.

 

!referencepath.PNG!  

 

 

> Memory leaks in continuous query handlers
> -
>
> Key: IGNITE-10959
> URL: https://issues.apache.org/jira/browse/IGNITE-10959
> Project: Ignite
>  Issue Type: Bug
>Affects Versions: 2.7
>Reporter: Denis Mekhanikov
>Assignee: Maxim Muzafarov
>Priority: Critical
> Attachments: CacheContinuousQueryMemoryUsageTest.java, 
> CacheContinuousQueryMemoryUsageTest.result, 
> CacheContinuousQueryMemoryUsageTest2.java, 
> Memory_blowup_in_Ignite_CacheContinuousQueryHandler.txt, 
> Memory_blowup_in_Ignite_CacheContinuousQueryHandler.txt, 
> Memory_blowup_in_Ignite_CacheContinuousQueryHandler.txt, 
> continuousquery_leak_profile.png, referencepath.PNG
>
>
> Continuous query handlers don't clear internal data structures after cache 
> events are processed.
> A test, that reproduces the problem, is attached.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (IGNITE-10959) Memory leaks in continuous query handlers

2020-05-28 Thread Cong Guo (Jira)


 [ 
https://issues.apache.org/jira/browse/IGNITE-10959?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Cong Guo updated IGNITE-10959:
--
Attachment: referencepath.PNG

> Memory leaks in continuous query handlers
> -
>
> Key: IGNITE-10959
> URL: https://issues.apache.org/jira/browse/IGNITE-10959
> Project: Ignite
>  Issue Type: Bug
>Affects Versions: 2.7
>Reporter: Denis Mekhanikov
>Assignee: Maxim Muzafarov
>Priority: Critical
> Attachments: CacheContinuousQueryMemoryUsageTest.java, 
> CacheContinuousQueryMemoryUsageTest.result, 
> CacheContinuousQueryMemoryUsageTest2.java, 
> Memory_blowup_in_Ignite_CacheContinuousQueryHandler.txt, 
> Memory_blowup_in_Ignite_CacheContinuousQueryHandler.txt, 
> Memory_blowup_in_Ignite_CacheContinuousQueryHandler.txt, 
> continuousquery_leak_profile.png, referencepath.PNG
>
>
> Continuous query handlers don't clear internal data structures after cache 
> events are processed.
> A test, that reproduces the problem, is attached.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (IGNITE-10959) Memory leaks in continuous query handlers

2020-05-21 Thread Cong Guo (Jira)


[ 
https://issues.apache.org/jira/browse/IGNITE-10959?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17113213#comment-17113213
 ] 

Cong Guo commented on IGNITE-10959:
---

Hi,

Is this issue still open? Is there any patch to fix this issue? I can reproduce 
this issue by restarting one node in a cluster.
 # Start a 3-node Ignite cluster, and each node has a continuous query to watch 
updates in cache.
 # Send updates to three nodes continuously.
 # Restart one Ignite node, and then memory leak can be seen on two other nodes.

 

Based on the heap dump, the memory issue happens in entryBufs of 
CacheContinuousQueryHandler.  I already close the continuous query when I stop 
the Ignite node. However, the pending buffer on two other nodes still increases 
when the node starts again. Is there any method to discard the pending buffer 
when one node (the master node of a continuous query) leaves the cluster?

 

!ObjectRef.PNG!

 

 

> Memory leaks in continuous query handlers
> -
>
> Key: IGNITE-10959
> URL: https://issues.apache.org/jira/browse/IGNITE-10959
> Project: Ignite
>  Issue Type: Bug
>Affects Versions: 2.7
>Reporter: Denis Mekhanikov
>Priority: Major
> Fix For: 2.9
>
> Attachments: CacheContinuousQueryMemoryUsageTest.java, 
> CacheContinuousQueryMemoryUsageTest.result, 
> CacheContinuousQueryMemoryUsageTest2.java, continuousquery_leak_profile.png
>
>
> Continuous query handlers don't clear internal data structures after cache 
> events are processed.
> A test, that reproduces the problem, is attached.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)