Hi Mike,

I might have a few more pointers to offer when I can get unburied from some
other work ... but the couple things that jump to mind are the following:


   - I think for that many flowfiles, you will want to make sure you have
   separate disks set up for data provenance.  We have several different types
   of flowfile profiles.  For the ones where we didn't have too many
   flowfiles, we didn't do much to change some of the default settings, and we
   actually (again recommendation and better judgement) had everything hitting
   the same set of disks.  When we had another more real time processing
   profile more akin to the volume that you are talking about, we began to run
   into issues related to the ability of provenance to keep up.  We created
   three separate disks, and changed the accompanying config and that helped a
   great deal.  You'd need to make some changes around threading for that to.
   You can find some info on that here:
   
https://community.cloudera.com/t5/Community-Articles/HDF-CFM-NIFI-Best-practices-for-setting-up-a-high/ta-p/244999


   - I don't know what you've done with regard to the Maximum Timer Drive
   Thread Count, but the default is quite low (depending on the size of your
   machine).  If I'm not mistaken (there is a best practices doc out there),
   you can set this to 2-4 times the number of cores that you have.  We have
   been fairly aggressive and set it to 4.  Once we did that, we had some of
   the processors run multiple threads - but you have to be careful you don't
   have one set of processors eating all of your available cycles.

One of the sizing docs we used was this one:
https://community.cloudera.com/t5/Community-Articles/NiFi-Sizing-Guide-Deployment-Best-Practices/ta-p/246781
so that we could use that to give some thought to our server size and the
throughput we wanted.

In all, we found that there were some best practices, but it required some
tuning and observation.

I hope that helps.

Craig

Craig S. Connell
CTO & Senior VP of Engineering
[email protected]
443-789-4842




On Fri, Sep 11, 2020 at 12:51 PM Mike Thomsen <[email protected]>
wrote:

> What are the general recommended practices around tuning NiFi to
> safely handle flows that may drop in several million very small
> flowfiles (2k-10kb each) onto a single node? It's possible that some
> of the data dumps we're processing (and we can't control their size)
> will drop about 3.5-5M flowfiles the moment we expand them in the
> flow.
>
> (Let me emphasize again, it was not our idea to dump the data this way)
>
> Any pointers would be appreciated.
>
> Thanks,
>
> Mike
>

Reply via email to