Re: Best way to tune NiFi for huge amounts of small flowfiles
We keep our queue limit at 20,000 to keep data from swapping between ArrayLists and Prioritized Queues. See bug: https://issues.apache.org/jira/browse/NIFI-7583 You can also adjust that limit up in the nifi.properties. On Sat, Sep 12, 2020 at 1:15 AM Chris Sampson wrote: > One thing we've not done yet but I think might help is to stripe disks for > each repo too, i.e. multiple disks for content, etc., which will help > spread the disk I/O. > > > Cheers, > > Chris Sampson > > On Fri, 11 Sep 2020, 22:46 Mike Thomsen, wrote: > >> Craig and Jeremy, >> >> Thanks. The point about using different disks for different >> repositories is definitely something to add to the list. >> >> On Fri, Sep 11, 2020 at 3:11 PM Jeremy Dyer wrote: >> > >> > Hey Mike, >> > >> > When you say "flows that may drop in several million ... flowfiles" I >> read that as a single node that might be inundated with tons of source data >> (local files, ftp, kafka messages, etc). Just my 2 cents but if you don't >> have strict SLAs (and this kind of sounds like a 1 time thing) I wouldn't >> even worry about it and just let the system back pressure and process in >> time as designed. That process will be "safe" although maybe not fast. If >> you need speed throw lots of NVMe mounts at it. We process well into the >> tens (sometimes hundreds) of millions of flowfiles a day on a 5 node >> cluster with no issues. However our hardware is quite over the top. >> > >> > Thanks, >> > Jeremy Dyer >> > >> > On Fri, Sep 11, 2020 at 12:51 PM Mike Thomsen >> wrote: >> >> >> >> What are the general recommended practices around tuning NiFi to >> >> safely handle flows that may drop in several million very small >> >> flowfiles (2k-10kb each) onto a single node? It's possible that some >> >> of the data dumps we're processing (and we can't control their size) >> >> will drop about 3.5-5M flowfiles the moment we expand them in the >> >> flow. >> >> >> >> (Let me emphasize again, it was not our idea to dump the data this way) >> >> >> >> Any pointers would be appreciated. >> >> >> >> Thanks, >> >> >> >> Mike >> >
Re: Best way to tune NiFi for huge amounts of small flowfiles
One thing we've not done yet but I think might help is to stripe disks for each repo too, i.e. multiple disks for content, etc., which will help spread the disk I/O. Cheers, Chris Sampson On Fri, 11 Sep 2020, 22:46 Mike Thomsen, wrote: > Craig and Jeremy, > > Thanks. The point about using different disks for different > repositories is definitely something to add to the list. > > On Fri, Sep 11, 2020 at 3:11 PM Jeremy Dyer wrote: > > > > Hey Mike, > > > > When you say "flows that may drop in several million ... flowfiles" I > read that as a single node that might be inundated with tons of source data > (local files, ftp, kafka messages, etc). Just my 2 cents but if you don't > have strict SLAs (and this kind of sounds like a 1 time thing) I wouldn't > even worry about it and just let the system back pressure and process in > time as designed. That process will be "safe" although maybe not fast. If > you need speed throw lots of NVMe mounts at it. We process well into the > tens (sometimes hundreds) of millions of flowfiles a day on a 5 node > cluster with no issues. However our hardware is quite over the top. > > > > Thanks, > > Jeremy Dyer > > > > On Fri, Sep 11, 2020 at 12:51 PM Mike Thomsen > wrote: > >> > >> What are the general recommended practices around tuning NiFi to > >> safely handle flows that may drop in several million very small > >> flowfiles (2k-10kb each) onto a single node? It's possible that some > >> of the data dumps we're processing (and we can't control their size) > >> will drop about 3.5-5M flowfiles the moment we expand them in the > >> flow. > >> > >> (Let me emphasize again, it was not our idea to dump the data this way) > >> > >> Any pointers would be appreciated. > >> > >> Thanks, > >> > >> Mike >
Re: Best way to tune NiFi for huge amounts of small flowfiles
Craig and Jeremy, Thanks. The point about using different disks for different repositories is definitely something to add to the list. On Fri, Sep 11, 2020 at 3:11 PM Jeremy Dyer wrote: > > Hey Mike, > > When you say "flows that may drop in several million ... flowfiles" I read > that as a single node that might be inundated with tons of source data (local > files, ftp, kafka messages, etc). Just my 2 cents but if you don't have > strict SLAs (and this kind of sounds like a 1 time thing) I wouldn't even > worry about it and just let the system back pressure and process in time as > designed. That process will be "safe" although maybe not fast. If you need > speed throw lots of NVMe mounts at it. We process well into the tens > (sometimes hundreds) of millions of flowfiles a day on a 5 node cluster with > no issues. However our hardware is quite over the top. > > Thanks, > Jeremy Dyer > > On Fri, Sep 11, 2020 at 12:51 PM Mike Thomsen wrote: >> >> What are the general recommended practices around tuning NiFi to >> safely handle flows that may drop in several million very small >> flowfiles (2k-10kb each) onto a single node? It's possible that some >> of the data dumps we're processing (and we can't control their size) >> will drop about 3.5-5M flowfiles the moment we expand them in the >> flow. >> >> (Let me emphasize again, it was not our idea to dump the data this way) >> >> Any pointers would be appreciated. >> >> Thanks, >> >> Mike
Re: Best way to tune NiFi for huge amounts of small flowfiles
Hey Mike, When you say "flows that may drop in several million ... flowfiles" I read that as a single node that might be inundated with tons of source data (local files, ftp, kafka messages, etc). Just my 2 cents but if you don't have strict SLAs (and this kind of sounds like a 1 time thing) I wouldn't even worry about it and just let the system back pressure and process in time as designed. That process will be "safe" although maybe not fast. If you need speed throw lots of NVMe mounts at it. We process well into the tens (sometimes hundreds) of millions of flowfiles a day on a 5 node cluster with no issues. However our hardware is quite over the top. Thanks, Jeremy Dyer On Fri, Sep 11, 2020 at 12:51 PM Mike Thomsen wrote: > What are the general recommended practices around tuning NiFi to > safely handle flows that may drop in several million very small > flowfiles (2k-10kb each) onto a single node? It's possible that some > of the data dumps we're processing (and we can't control their size) > will drop about 3.5-5M flowfiles the moment we expand them in the > flow. > > (Let me emphasize again, it was not our idea to dump the data this way) > > Any pointers would be appreciated. > > Thanks, > > Mike >
Re: Best way to tune NiFi for huge amounts of small flowfiles
Hi Mike, I might have a few more pointers to offer when I can get unburied from some other work ... but the couple things that jump to mind are the following: - I think for that many flowfiles, you will want to make sure you have separate disks set up for data provenance. We have several different types of flowfile profiles. For the ones where we didn't have too many flowfiles, we didn't do much to change some of the default settings, and we actually (again recommendation and better judgement) had everything hitting the same set of disks. When we had another more real time processing profile more akin to the volume that you are talking about, we began to run into issues related to the ability of provenance to keep up. We created three separate disks, and changed the accompanying config and that helped a great deal. You'd need to make some changes around threading for that to. You can find some info on that here: https://community.cloudera.com/t5/Community-Articles/HDF-CFM-NIFI-Best-practices-for-setting-up-a-high/ta-p/244999 - I don't know what you've done with regard to the Maximum Timer Drive Thread Count, but the default is quite low (depending on the size of your machine). If I'm not mistaken (there is a best practices doc out there), you can set this to 2-4 times the number of cores that you have. We have been fairly aggressive and set it to 4. Once we did that, we had some of the processors run multiple threads - but you have to be careful you don't have one set of processors eating all of your available cycles. One of the sizing docs we used was this one: https://community.cloudera.com/t5/Community-Articles/NiFi-Sizing-Guide-Deployment-Best-Practices/ta-p/246781 so that we could use that to give some thought to our server size and the throughput we wanted. In all, we found that there were some best practices, but it required some tuning and observation. I hope that helps. Craig Craig S. Connell CTO & Senior VP of Engineering csconn...@staq.com 443-789-4842 On Fri, Sep 11, 2020 at 12:51 PM Mike Thomsen wrote: > What are the general recommended practices around tuning NiFi to > safely handle flows that may drop in several million very small > flowfiles (2k-10kb each) onto a single node? It's possible that some > of the data dumps we're processing (and we can't control their size) > will drop about 3.5-5M flowfiles the moment we expand them in the > flow. > > (Let me emphasize again, it was not our idea to dump the data this way) > > Any pointers would be appreciated. > > Thanks, > > Mike >
Best way to tune NiFi for huge amounts of small flowfiles
What are the general recommended practices around tuning NiFi to safely handle flows that may drop in several million very small flowfiles (2k-10kb each) onto a single node? It's possible that some of the data dumps we're processing (and we can't control their size) will drop about 3.5-5M flowfiles the moment we expand them in the flow. (Let me emphasize again, it was not our idea to dump the data this way) Any pointers would be appreciated. Thanks, Mike