Hi David, Thanks for your reply.
I’m confident I’ve identified the root cause now, having been able to reproduce the symptoms on a test cluster. The issue appears to be not just having many RPGs, but many of them pointing to the same input port (that leads to the generic audit logging flow). Under even a low to moderate load a point is reached after some time where the input port is not being scheduled often enough, causing a chain reaction. Every RPG that gets scheduled then spends 30 seconds holding its thread until the connection to the input port times out. Because the RPGs and input ports are on the same cluster and even node for 1/3 of the files, this starves the input port (and frankly the whole NiFi instance) of threads, increasing the timeout issues, disconnecting nodes, causing UI issues etc. The solution is still migrating to loadbalanced connections and removing the RPGs, though I worry about the same chain reaction if we simply replace the RPGs with local output ports all pointing to the same input port. That takes time to implement though, so I’m looking to tweaking settings to keep the system running for now. So my question this time around: What is supposed to happen if I set the concurrent tasks for an input port to a high number (say 10-20)? Will the port get scheduled with exactly that number or as many threads as are available? If there are no 10 threads available, will the port get scheduled at all? Regards, Isha Van: David Handermann <[email protected]> Verzonden: woensdag 23 februari 2022 19:57 Aan: [email protected] Onderwerp: Re: Performance implications of RPGs for loadbalancing Hi Isha, Thanks for providing some background on the configuration and related issues. Based on the issues you highlighted, it sounds like you are running into several known problems. There are some potential workarounds, but refactoring the flow configuration to use standard connection load balancing is the best solution. Upgrading to NiFi 1.15.3 addresses a number of security and performance issues, including some of the items you mentioned. Related to the first problem, using RAW socket communication should be preferred for RPG communication. RAW socket communication is not subject to the Denial-of-Service filter timeout, and also has less overhead than HTTP request processing. Ensuring that all Remote Process Groups use RAW socket communication should help. When HTTP requests exceed the DoS filter timeout, Jetty terminates the connection, which can produce any number of errors, such as the End-of-File and Connection Closed issues you have observed. Using HTTP communication also uses threads from the Jetty server, which can impact user interface performance. This might also be part of the explanation for cluster nodes getting out of sync, but there could be other factors involved. NiFi 1.12.1 includes several issues related to the Denial-of-Service filter and Site-to-Site communication, which have been addressed in more recent releases. Here are a couple worth noting: - https://issues.apache.org/jira/browse/NIFI-7912<https://eur03.safelinks.protection.outlook.com/?url=https%3A%2F%2Fissues.apache.org%2Fjira%2Fbrowse%2FNIFI-7912&data=04%7C01%7Cisha.lamboo%40virtualsciences.nl%7C7ef36a49d18346a103d508d9f6fe6086%7C21429da9e4ad45f99a6fcd126a64274b%7C0%7C0%7C637812394689837710%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000&sdata=txyv5SOJ1aGn0dmdC9z%2F3nhgqsVQoRjlFMnQNobMiIM%3D&reserved=0> Added new nifi.web.request properties that can be used to change the default 30 second timeout and exclude IP addresses from filtering - https://issues.apache.org/jira/browse/NIFI-9448<https://eur03.safelinks.protection.outlook.com/?url=https%3A%2F%2Fissues.apache.org%2Fjira%2Fbrowse%2FNIFI-9448&data=04%7C01%7Cisha.lamboo%40virtualsciences.nl%7C7ef36a49d18346a103d508d9f6fe6086%7C21429da9e4ad45f99a6fcd126a64274b%7C0%7C0%7C637812394689837710%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000&sdata=PoP4SfPIPaauwGQxvi4S%2FI208TkpW%2BB0%2FBrtPPBYw6I%3D&reserved=0> Resolved potential IllegalStateException for S2S client communication - https://issues.apache.org/jira/browse/NIFI-9481<https://eur03.safelinks.protection.outlook.com/?url=https%3A%2F%2Fissues.apache.org%2Fjira%2Fbrowse%2FNIFI-9481&data=04%7C01%7Cisha.lamboo%40virtualsciences.nl%7C7ef36a49d18346a103d508d9f6fe6086%7C21429da9e4ad45f99a6fcd126a64274b%7C0%7C0%7C637812394689837710%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000&sdata=QulchPc53aRYbvUSQIl3cycHtiVWsiF4VVww%2BdBClsg%3D&reserved=0> Exclude HTTP Site-to-Site Communication from DoS Filter The last issue is not yet part of a released version, but the other two are resolved in NiFi 1.15.3. Although upgrading and migrating to connection load balancing will take some work, it is the best path forward to address the issues you observed. Regards, David Handermann On Wed, Feb 23, 2022 at 11:55 AM Isha Lamboo <[email protected]<mailto:[email protected]>> wrote: Hi all, I’m hoping to get some perspective from people that have NiFi with a large number of Remote Process Groups. I’m supporting a NiFi 1.12.1 (yes, I know) cluster of 3 nodes that has about 5k processors and load-balancing still done the pre-1.8 way, with RPGs looping back to the local cluster. There are 500+ RPGs with only about 30 actually going to other NiFi clusters. We’re having several problems: · input ports getting stuck when the RPG is set to HTTP protocol and connections get killed by the Jetty DoS filter after 30 secs. The standard is RAW, but sometimes a HTTP RPG still gets deployed. · Intermittent errors like EoF, connection closed etc on HTTP connections · The cluster being unable to sync changes made to the flow resulting in disconnected nodes and sometimes uninheritable flow exceptions. My idea is that the RPGs should be replaced by load-balanced connection and/or local ports, but developer resources are scarce, so I want to either make a business case or tune NiFi performance if 500 RPGs should not cause problems normally. So is this a known issue or particular to my case? How can I identify/solve performance bottlenecks with RPGs? Kind regards, Isha Lamboo
