Mark Payne created NIFI-7992:
--------------------------------

             Summary: Content Repository can fail to cleanup archive directory 
fast enough
                 Key: NIFI-7992
                 URL: https://issues.apache.org/jira/browse/NIFI-7992
             Project: Apache NiFi
          Issue Type: Bug
          Components: Core Framework
            Reporter: Mark Payne
            Assignee: Mark Payne


For the scenario where a use is generating many small FlowFiles and has the 
"nifi.content.claim.max.appendable.size" property set to a small value, we can 
encounter a situation where data is constantly archived but not cleaned up 
quickly enough. As a result, the Content Repository can run out of space.

The FileSystemRepository has a backpressure mechanism built in to avoid 
allowing this to happen, but under the above conditions, it can sometimes fail 
to prevent this situation. The backpressure mechanism works by performing the 
following steps:
 # When a new Content Claim is created, the Content Repository determines which 
'container' to use.
 # Content Repository checks if the amount of storage space used for the 
container is greater than the configured backpressure threshold.
 # If so, the thread blocks until a background task completes cleanup of the 
archive directories.

However, in Step #2 above, it determines if the amount of space currently being 
used by looking at a cached member variable. That cached member variable is 
only updated on the first iteration, and when the said background task 
completes.

So, now consider a case where there are millions of files in the content 
repository archive. The background task could take a massive amount of time 
performing cleanup. Meanwhile, processors are able to write to the repository 
without any backpressure being applied because the background task hasn't 
updated the cached variable for the amount of space used. This continues until 
the content repository fills.

There are three important very simple things that should be changed:
 # The background task should be faster in this case. While we cannot improve 
the amount of time it takes to destroy the files, we do create an ArrayList to 
hold all of the file info and then use an iterator, calling remove(). Under the 
hood, this creates a copy of the underlying array for each file that is 
removed. On my laptop, performing this procedure on an ArrayList with 1 million 
elements took approximately 1 minute. Changing to a LinkedList took 15 
milliseconds but took much more heap. Keeping an ArrayList, then removing all 
of elements at the end (via ArrayList.subList(0, n).clear()) resulted in 
similar performance to LinkedList with the memory footprint of ArrayList.
 # The check to see whether or not the content repository's usage has crossed 
the threshold should not rely entirely on a cache that is populated by a 
process that can take a long time. It should periodically calculate the disk 
usage itself (perhaps once per minute).
 # When backpressure does get applied, it can appear that the system has frozen 
up, not performing any sort of work. The background task that is clearing space 
should periodically log its progress at INFO level to allow users to understand 
that this action is taking place.

 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to