[jira] [Updated] (NIFI-5225) Leak in RingBufferEventRepository for frequently updated flows

2018-05-23 Thread Mark Payne (JIRA)

 [ 
https://issues.apache.org/jira/browse/NIFI-5225?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mark Payne updated NIFI-5225:
-
   Resolution: Fixed
Fix Version/s: 1.7.0
   Status: Resolved  (was: Patch Available)

> Leak in RingBufferEventRepository for frequently updated flows
> --
>
> Key: NIFI-5225
> URL: https://issues.apache.org/jira/browse/NIFI-5225
> Project: Apache NiFi
>  Issue Type: Bug
>  Components: Core Framework
> Environment: HDF-3.1.0.0
>Reporter: Frederik Petersen
>Priority: Major
>  Labels: performance
> Fix For: 1.7.0
>
>
> We use NiFi's API to change a part of our flow quite frequently. Over the 
> past weeks we have noticed that the performance of web requests degrades over 
> time and had a very hard time to find out why.
> Today I took a closer look. When using visualvm to sample cpu it already 
> stood out that the longer the cluster was running, the more time was spent in 
> 'SecondPrecisionEventContainer.generateReport()' during web requests. This 
> method is already relied on a lot right after starting the cluster (for big 
> flows and process groups). But the time spent in it increases (in our setup) 
> the longer the cluster runs. This increases latency of almost every web 
> request. Our flow reconfiguration script (calling many NiFi API endpoints) 
> went from 2 minutes to 20 minutes run time in a few days.
>  Looking at the source code I couldn't quite figure out why the run time 
> should increase over time, because the ring buffers always stay the same size 
> (301 entries|5 minutes).
> When sampling memory I noticed quite a lot of EventSum instances, more than 
> there should have been. So I took a heap dump and ran a MemoryAnalyzer tool. 
> The "Leak Suspects" overview gave me the final hint to what was wrong.
>  It reported:
> One instance of "java.util.concurrent.ConcurrentHashMap$Node[]" loaded by 
> "" occupies 5,649,926,328 (55.74%) bytes. The instance 
> is referenced by 
> org.apache.nifi.controller.repository.metrics.RingBufferEventRepository @ 
> 0x7f86c50cda40 , loaded by "org.apache.nifi.nar.NarClassLoader @ 
> 0x7f86a000". The memory is accumulated in one instance of 
> "java.util.concurrent.ConcurrentHashMap$Node[]" loaded by " loader>".
> The issue is:
> When we remove processors, connections, process groups from the flow, their 
> data is not removed from the ConcurrentHashMap in RingBufferEventRepository. 
> There is a 'purgeTransferEvents' but it only calls an empty 'purgeEvents' 
> method on all 'SecondPrecisionEventContainer's in the map.
> This means that the map grows without bounds and every time 
> 'reportTransferEvents' is called it iterates over all (meaning more and more 
> over time) entries of the map. This increases latency of every web request 
> and also a huge amount of memory occupied.
> A rough idea to fix this:
> Remove the entry for each removed component (processor, process group, 
> connection, ?...) using their onRemoved Methods in the FlowController
> This should stop the map from growing infinitely for any flow where removals 
> of any components happens frequently. Especially when automated.
> Since this is quite urgent for us, I'll try to work on a fix for this and 
> provide a pull request if successful.
> Since no-one noticed this before, I guess we are not the typical user of 
> NiFi, as we thought it was possible to heavily reconfigure flows using the 
> API, but with this performance issue, it's not.
> Please let me know if I can provide any more helpful detail for this problem.
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (NIFI-5225) Leak in RingBufferEventRepository for frequently updated flows

2018-05-22 Thread Matt Burgess (JIRA)

 [ 
https://issues.apache.org/jira/browse/NIFI-5225?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Matt Burgess updated NIFI-5225:
---
Affects Version/s: (was: 1.6.0)
   (was: 1.5.0)
   Status: Patch Available  (was: Open)

> Leak in RingBufferEventRepository for frequently updated flows
> --
>
> Key: NIFI-5225
> URL: https://issues.apache.org/jira/browse/NIFI-5225
> Project: Apache NiFi
>  Issue Type: Bug
>  Components: Core Framework
> Environment: HDF-3.1.0.0
>Reporter: Frederik Petersen
>Priority: Major
>  Labels: performance
>
> We use NiFi's API to change a part of our flow quite frequently. Over the 
> past weeks we have noticed that the performance of web requests degrades over 
> time and had a very hard time to find out why.
> Today I took a closer look. When using visualvm to sample cpu it already 
> stood out that the longer the cluster was running, the more time was spent in 
> 'SecondPrecisionEventContainer.generateReport()' during web requests. This 
> method is already relied on a lot right after starting the cluster (for big 
> flows and process groups). But the time spent in it increases (in our setup) 
> the longer the cluster runs. This increases latency of almost every web 
> request. Our flow reconfiguration script (calling many NiFi API endpoints) 
> went from 2 minutes to 20 minutes run time in a few days.
>  Looking at the source code I couldn't quite figure out why the run time 
> should increase over time, because the ring buffers always stay the same size 
> (301 entries|5 minutes).
> When sampling memory I noticed quite a lot of EventSum instances, more than 
> there should have been. So I took a heap dump and ran a MemoryAnalyzer tool. 
> The "Leak Suspects" overview gave me the final hint to what was wrong.
>  It reported:
> One instance of "java.util.concurrent.ConcurrentHashMap$Node[]" loaded by 
> "" occupies 5,649,926,328 (55.74%) bytes. The instance 
> is referenced by 
> org.apache.nifi.controller.repository.metrics.RingBufferEventRepository @ 
> 0x7f86c50cda40 , loaded by "org.apache.nifi.nar.NarClassLoader @ 
> 0x7f86a000". The memory is accumulated in one instance of 
> "java.util.concurrent.ConcurrentHashMap$Node[]" loaded by " loader>".
> The issue is:
> When we remove processors, connections, process groups from the flow, their 
> data is not removed from the ConcurrentHashMap in RingBufferEventRepository. 
> There is a 'purgeTransferEvents' but it only calls an empty 'purgeEvents' 
> method on all 'SecondPrecisionEventContainer's in the map.
> This means that the map grows without bounds and every time 
> 'reportTransferEvents' is called it iterates over all (meaning more and more 
> over time) entries of the map. This increases latency of every web request 
> and also a huge amount of memory occupied.
> A rough idea to fix this:
> Remove the entry for each removed component (processor, process group, 
> connection, ?...) using their onRemoved Methods in the FlowController
> This should stop the map from growing infinitely for any flow where removals 
> of any components happens frequently. Especially when automated.
> Since this is quite urgent for us, I'll try to work on a fix for this and 
> provide a pull request if successful.
> Since no-one noticed this before, I guess we are not the typical user of 
> NiFi, as we thought it was possible to heavily reconfigure flows using the 
> API, but with this performance issue, it's not.
> Please let me know if I can provide any more helpful detail for this problem.
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (NIFI-5225) Leak in RingBufferEventRepository for frequently updated flows

2018-05-22 Thread Frederik Petersen (JIRA)

 [ 
https://issues.apache.org/jira/browse/NIFI-5225?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Frederik Petersen updated NIFI-5225:

Description: 
We use NiFi's API to change a part of our flow quite frequently. Over the past 
weeks we have noticed that the performance of web requests degrades over time 
and had a very hard time to find out why.

Today I took a closer look. When using visualvm to sample cpu it already stood 
out that the longer the cluster was running, the more time was spent in 
'SecondPrecisionEventContainer.generateReport()' during web requests. This 
method is already relied on a lot right after starting the cluster (for big 
flows and process groups). But the time spent in it increases (in our setup) 
the longer the cluster runs. This increases latency of almost every web 
request. Our flow reconfiguration script (calling many NiFi API endpoints) went 
from 2 minutes to 20 minutes run time in a few days.
 Looking at the source code I couldn't quite figure out why the run time should 
increase over time, because the ring buffers always stay the same size (301 
entries|5 minutes).

When sampling memory I noticed quite a lot of EventSum instances, more than 
there should have been. So I took a heap dump and ran a MemoryAnalyzer tool. 
The "Leak Suspects" overview gave me the final hint to what was wrong.
 It reported:

One instance of "java.util.concurrent.ConcurrentHashMap$Node[]" loaded by 
"" occupies 5,649,926,328 (55.74%) bytes. The instance is 
referenced by 
org.apache.nifi.controller.repository.metrics.RingBufferEventRepository @ 
0x7f86c50cda40 , loaded by "org.apache.nifi.nar.NarClassLoader @ 
0x7f86a000". The memory is accumulated in one instance of 
"java.util.concurrent.ConcurrentHashMap$Node[]" loaded by "".

The issue is:

When we remove processors, connections, process groups from the flow, their 
data is not removed from the ConcurrentHashMap in RingBufferEventRepository. 
There is a 'purgeTransferEvents' but it only calls an empty 'purgeEvents' 
method on all 'SecondPrecisionEventContainer's in the map.

This means that the map grows without bounds and every time 
'reportTransferEvents' is called it iterates over all (meaning more and more 
over time) entries of the map. This increases latency of every web request and 
also a huge amount of memory occupied.

A rough idea to fix this:

Remove the entry for each removed component (processor, process group, 
connection, ?...) using their onRemoved Methods in the FlowController

This should stop the map from growing infinitely for any flow where removals of 
any components happens frequently. Especially when automated.

Since this is quite urgent for us, I'll try to work on a fix for this and 
provide a pull request if successful.

Since no-one noticed this before, I guess we are not the typical user of NiFi, 
as we thought it was possible to heavily reconfigure flows using the API, but 
with this performance issue, it's not.

Please let me know if I can provide any more helpful detail for this problem.

 

  was:
We use NiFi's API to change our part of our flow quite frequently. Over the 
past weeks we have noticed that the performance of web requests degrades over 
time and had a very hard time to find out why.

Today I took a closer look. When using visualvm to sample cpu it already stood 
out that the longer the cluster was running, the more time was spent in 
'SecondPrecisionEventContainer.generateReport()' during web requests. This 
method is already relied on a lot right after starting the cluster (for big 
flows and process groups). But the time spent in it increases (in our setup) 
the longer the cluster runs. This increases latency of almost every web 
request. Our flow reconfiguration script (calling many NiFi API endpoints) went 
from 2 minutes to 20 minutes run time in a few days.
Looking at the source code I couldn't quite figure out why the run time should 
increase over time, because the ring buffers always stay the same size (301 
entries|5 minutes).

When sampling memory I noticed quite a lot of EventSum instances, more than 
there should have been. So I took a heap dump and ran a MemoryAnalyzer tool. 
The "Leak Suspects" overview gave me the final hint to what was wrong.
It reported:

One instance of "java.util.concurrent.ConcurrentHashMap$Node[]" loaded by 
"" occupies 5,649,926,328 (55.74%) bytes. The instance is 
referenced by 
org.apache.nifi.controller.repository.metrics.RingBufferEventRepository @ 
0x7f86c50cda40 , loaded by "org.apache.nifi.nar.NarClassLoader @ 
0x7f86a000". The memory is accumulated in one instance of 
"java.util.concurrent.ConcurrentHashMap$Node[]" loaded by "".

The issue is:

When we remove processors, connections, process groups from the flow, their 
data is not removed from the ConcurrentHashMap in RingBufferEventRepository. 
There is a 'purgeTransferEvents' but it only calls an empty