Hi Mike, I might have a few more pointers to offer when I can get unburied from some other work ... but the couple things that jump to mind are the following:
- I think for that many flowfiles, you will want to make sure you have separate disks set up for data provenance. We have several different types of flowfile profiles. For the ones where we didn't have too many flowfiles, we didn't do much to change some of the default settings, and we actually (again recommendation and better judgement) had everything hitting the same set of disks. When we had another more real time processing profile more akin to the volume that you are talking about, we began to run into issues related to the ability of provenance to keep up. We created three separate disks, and changed the accompanying config and that helped a great deal. You'd need to make some changes around threading for that to. You can find some info on that here: https://community.cloudera.com/t5/Community-Articles/HDF-CFM-NIFI-Best-practices-for-setting-up-a-high/ta-p/244999 - I don't know what you've done with regard to the Maximum Timer Drive Thread Count, but the default is quite low (depending on the size of your machine). If I'm not mistaken (there is a best practices doc out there), you can set this to 2-4 times the number of cores that you have. We have been fairly aggressive and set it to 4. Once we did that, we had some of the processors run multiple threads - but you have to be careful you don't have one set of processors eating all of your available cycles. One of the sizing docs we used was this one: https://community.cloudera.com/t5/Community-Articles/NiFi-Sizing-Guide-Deployment-Best-Practices/ta-p/246781 so that we could use that to give some thought to our server size and the throughput we wanted. In all, we found that there were some best practices, but it required some tuning and observation. I hope that helps. Craig Craig S. Connell CTO & Senior VP of Engineering [email protected] 443-789-4842 On Fri, Sep 11, 2020 at 12:51 PM Mike Thomsen <[email protected]> wrote: > What are the general recommended practices around tuning NiFi to > safely handle flows that may drop in several million very small > flowfiles (2k-10kb each) onto a single node? It's possible that some > of the data dumps we're processing (and we can't control their size) > will drop about 3.5-5M flowfiles the moment we expand them in the > flow. > > (Let me emphasize again, it was not our idea to dump the data this way) > > Any pointers would be appreciated. > > Thanks, > > Mike >
