Here is a draft of the blog I've been meaning to put out for quite a while. The images referenced in the writeup dont really matter so ignore them.
Subject: The flow looks good yet I ran out of disk space! Apache NiFi 1.16 to the rescue Body: The next two screenshots are extremely similar images of the very same flow. The first one is running in Apache NiFi 1.15 and the next is on the to be released Apache NiFi 1.16.2. They’re both running an identical flow which simulates a common scenario in NiFi; one flow processing larger objects (in this case 5MB blobs) and another processing smaller (in this case 125 byte blobs). In both cases you see, as a user, what looks like a really healthy system. There are only a few megabytes worth of data still in processing and queues are reasonably manageable. Nothing to worry about… right? In order to explain what the problem is first you’ll want to refresh your knowledge on how NiFi’s content repository mechanism works as described in this document [1]. In short as NiFi writes content to the repository we cannot simply write every flowfile as its own file on disk as this would be incredibly slow given how nearly all file systems work and in general just isn’t a good use of the awesome power of file systems, kernels, disks, and disk caching today. So instead NiFi writes a group of content to a given file on disk and holds claims and tracks offsets to them. By default until now we’ve always had this setting in place ‘nifi.content.claim.max.appendable.size=1 MB’. What that means is that we will group together as many flowfiles respective content chunks into a single file up to at least 1MB worth of content and then we’d go on to making a new file in the repository for any subsequent data. This greatly reduces the total number of files on disk and very often results in strong alignment to the disk caching mechanisms yielding very strong performance on common flows. A consequence of this design tradeoff though is that we cannot delete the file on disk until all references to that content have been cleared which will only occur when no FlowFile remains in the flow the references that content. Generally speaking that works extremely well. And of course we understood the previously stated design tradeoff. But what we did not realize was how problematic this can be for certain usage patterns including some that are pretty common for users given the default value we use of 1MB. Nearly all users of NiFi have flows which operate on a mixture of large and small data. These process at various rates over time. But what if a user decides that certain objects, perhaps small objects, should be held onto for a while? They might do this for dead-letter/troubleshooting or simply because some object needs to stick around for a while due to downstream system issues. As the images above show that is precisely what happens in this flow. Both flows process quickly with the path on the left being for 125 byte objects and the path on the right being for 5MB objects. In the case of the 5MB objects we process and move on but in the case of the tiny flowfiles we take every 1000th object and hang onto it. Users do things like this in their flows all the time and for good reasons. Yet what this means is that you end up having lots of files on disk which contain both the small objects and the large objects and only the reference to the large object was removed therefore the content repository hangs onto both. Some of the readers more familiar with NiFi might be thinking that turning off the content archive capability would help here. It does not. The problem has to do with actively referenceable content so the archive doesn’t even come into play yet. Now so far I’ve talked about this problem happening when combining small objects. It turns out in versions before Apache NiFi 1.16.2 this could even happen for flowfiles that referenced no content at all! That scenario sounds odd but actually that too is quite common. People routinely create marker flowfiles to kick off certain processes or hold metadata they use within the flow. So going back to the above images. The first image was from Apache NiFi 1.15 and was actually slowly but surely filling up the content until eventually the flow would have been unable to progress at all. **The way the test scenario played out on Apache NiFi 1.15 On startup Content repository shows 344GB of disk space available. Let the flow run until shown in the first image Content repository shows 269GB of disk space available with around 30,000 flowfiles held representing little more than 3MB of actual content. So we retained more than 75GB of disk space even though we had only 3MB of actual content we needed! Stop the processors for the tiny flow file path and clear out the queued flow files Content repository almost immediately shrinks back down to its original size plus some buffer capacity for the live flow hovering around 32GB of disk space available. **The way the test scenario played out on Apache NiFi 1.16 On startup Content repository shows 344GB of disk space available. Let the flow run until show in the second image Content repository grows a bit and shrinks hovering somewhere around its original size plus some buffer again hovering around 320GB of disk space available except in this case we don’t have to force delete anything to get or stay there. So what did we change in Apache NiFi 1.16.2? First we changed the default from 1MB to 50KB. Our testing shows this offers a pretty good sweet spot where the unintended retention of data is greatly reduced (by 20x approximately) but also performance remains strong. But you might find in your own flows on your given systems you might want to tweak this. Time will tell as usually this property and its default value is unchanged by users. Second, we also fixed the issue that led to flowfiles with zero content being allowed to hold claims on content on disk. These two things should result in correct behaviors for common flows and eliminate any known cases whereby the content repository could become full without there being an obvious corresponding backlog of flowfiles in the flow. [1] https://nifi.apache.org/docs/nifi-docs/html/nifi-in-depth.html#content-repository On Tue, Sep 13, 2022 at 10:15 AM Joe Witt <[email protected]> wrote: > > read that again and hopefully it was obvious I was joking. But I am > looking forward to hearing what you learn. > > Thanks > > On Tue, Sep 13, 2022 at 10:10 AM Joe Witt <[email protected]> wrote: > > > > Lars > > > > I need you to drive back to work because now I am very vested in the > > outcome :) > > > > But yeah this was an annoying problem we saw hit some folks. Changing > > that value after fixing the behavior was the answer. I owe the > > community a blog on this.... > > > > Thanks > > > > On Tue, Sep 13, 2022 at 9:57 AM Lars Winderling > > <[email protected]> wrote: > > > > > > Sorry, misread the jira. We're still on the old default value. Thank you > > > for being persistant about it. I will try it tomorrow with the lower > > > value and get back to you. Not at work atm, so I can't paste the config > > > values in detail. > > > > > > On 13 September 2022 16:45:30 CEST, Joe Witt <[email protected]> wrote: > > >> > > >> Lars > > >> > > >> You should not have to update to 1.17. While I'm always fond of > > >> peoople being on the latest the issue i mentioned is fixed in 1.16.3. > > >> > > >> HOWEVER, please do confirm your values. The one I'd really focus you on > > >> is > > >> nifi.content.claim.max.appendable.size=50 KB > > >> > > >> Our default before was like 1MB and what we'd see is we'd hang on to > > >> large content way longer than we intended because some queue had one > > >> tiny object in it. So that value became really important. > > >> > > >> If you're on 1MB change to 50KB and see what happens. > > >> > > >> Thanks > > >> > > >> On Tue, Sep 13, 2022 at 9:40 AM Lars Winderling > > >> <[email protected]> wrote: > > >>> > > >>> > > >>> I guess the issue you linked, is related. I have seen similar messages > > >>> in the log occasionally, but didn't directly connect it. Our config is > > >>> pretty similar to the defaults, none of it should directly cause the > > >>> issue. Will give 1.17.0 a try and come back if the issue persists. Your > > >>> help is really appreciated, thanks! > > >>> > > >>> On 13 September 2022 16:33:53 CEST, Joe Witt <[email protected]> > > >>> wrote: > > >>>> > > >>>> > > >>>> Lars > > >>>> > > >>>> The issue that came to mind is > > >>>> https://issues.apache.org/jira/browse/NIFI-10023 but that is fixed in > > >>>> 1.16.2 and 1.17.0 so that is why I asked. > > >>>> > > >>>> What is in your nifi.properties for > > >>>> # Content Repository > > >>>> > > >>>> nifi.content.repository.implementation=org.apache.nifi.controller.repository.FileSystemRepository > > >>>> nifi.content.claim.max.appendable.size=50 KB > > >>>> nifi.content.repository.directory.default=./content_repository > > >>>> nifi.content.repository.archive.max.retention.period=7 days > > >>>> nifi.content.repository.archive.max.usage.percentage=50% > > >>>> nifi.content.repository.archive.enabled=true > > >>>> nifi.content.repository.always.sync=false > > >>>> > > >>>> Thanks > > >>>> > > >>>> On Tue, Sep 13, 2022 at 7:04 AM Lars Winderling > > >>>> <[email protected]> wrote: > > >>>>> > > >>>>> > > >>>>> > > >>>>> I'm using 1.16.3 from upstream (no custom build) on java 11 > > >>>>> temurin, debian 10, virtualized, no docker setup. > > >>>>> > > >>>>> On 13 September 2022 13:37:15 CEST, Joe Witt <[email protected]> > > >>>>> wrote: > > >>>>>> > > >>>>>> > > >>>>>> > > >>>>>> Lars > > >>>>>> > > >>>>>> What version are you using? > > >>>>>> > > >>>>>> Thanks > > >>>>>> > > >>>>>> On Tue, Sep 13, 2022 at 3:11 AM Lars Winderling > > >>>>>> <[email protected]> wrote: > > >>>>>>> > > >>>>>>> > > >>>>>>> > > >>>>>>> Dear community, > > >>>>>>> > > >>>>>>> sometimes our content repository grows out of bounds. Since it > > >>>>>>> has been separated on disk from the rest of NiFi, we can still use > > >>>>>>> the NiFi UI and empty the respective queues. However, the disk > > >>>>>>> remains jammed. Sometimes, it gets cleaned up after a few mintes, > > >>>>>>> but most of the time we need to restart NiFi manually, for the > > >>>>>>> cleanup to happen. > > >>>>>>> So. is there any way of triggering the content eviction manually > > >>>>>>> without restarting NiFi? > > >>>>>>> Btw. the respective files on disk are not archived in the content > > >>>>>>> repository (thus not below */archive/*). > > >>>>>>> > > >>>>>>> Thanks in advance for your support! > > >>>>>>> Best, > > >>>>>>> Lars
