RE: Performance implications of RPGs for loadbalancing

Isha Lamboo Fri, 25 Feb 2022 01:06:26 -0800

Hi David,

Thanks for your reply.


I’m confident I’ve identified the root cause now, having been able to reproduce 
the symptoms on a test cluster.

The issue appears to be not just having many RPGs, but many of them pointing to 
the same input port (that leads to the generic audit logging flow).
Under even a low to moderate load a point is reached after some time where the 
input port is not being scheduled often enough, causing a chain reaction.
Every RPG that gets scheduled then spends 30 seconds holding its thread until 
the connection to the input port times out. Because the RPGs and input ports 
are on the same cluster and even node for 1/3 of the files, this starves the 
input port (and frankly the whole NiFi instance) of threads, increasing the 
timeout issues, disconnecting nodes, causing UI issues etc.

The solution is still migrating to loadbalanced connections and removing the 
RPGs, though I worry about the same chain reaction if we simply replace the 
RPGs with local output ports all pointing to the same input port. That takes 
time to implement though, so I’m looking to tweaking settings to keep the 
system running for now.

So my question this time around: What is supposed to happen if I set the 
concurrent tasks for an input port to a high number (say 10-20)? Will the port 
get scheduled with exactly that number or as many threads as are available? If 
there are no 10 threads available, will the port get scheduled at all?

Regards,

Isha



Van: David Handermann <[email protected]>
Verzonden: woensdag 23 februari 2022 19:57
Aan: [email protected]
Onderwerp: Re: Performance implications of RPGs for loadbalancing

Hi Isha,

Thanks for providing some background on the configuration and related issues. 
Based on the issues you highlighted, it sounds like you are running into 
several known problems.  There are some potential workarounds, but refactoring 
the flow configuration to use standard connection load balancing is the best 
solution. Upgrading to NiFi 1.15.3 addresses a number of security and 
performance issues, including some of the items you mentioned.

Related to the first problem, using RAW socket communication should be 
preferred for RPG communication.  RAW socket communication is not subject to 
the Denial-of-Service filter timeout, and also has less overhead than HTTP 
request processing.  Ensuring that all Remote Process Groups use RAW socket 
communication should help.  When HTTP requests exceed the DoS filter timeout, 
Jetty terminates the connection, which can produce any number of errors, such 
as the End-of-File and Connection Closed issues you have observed.

Using HTTP communication also uses threads from the Jetty server, which can 
impact user interface performance. This might also be part of the explanation 
for cluster nodes getting out of sync, but there could be other factors 
involved.

NiFi 1.12.1 includes several issues related to the Denial-of-Service filter and 
Site-to-Site communication, which have been addressed in more recent releases.  
Here are a couple worth noting:

- 
https://issues.apache.org/jira/browse/NIFI-7912<https://eur03.safelinks.protection.outlook.com/?url=https%3A%2F%2Fissues.apache.org%2Fjira%2Fbrowse%2FNIFI-7912&data=04%7C01%7Cisha.lamboo%40virtualsciences.nl%7C7ef36a49d18346a103d508d9f6fe6086%7C21429da9e4ad45f99a6fcd126a64274b%7C0%7C0%7C637812394689837710%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000&sdata=txyv5SOJ1aGn0dmdC9z%2F3nhgqsVQoRjlFMnQNobMiIM%3D&reserved=0>
 Added new nifi.web.request properties that can be used to change the default 
30 second timeout and exclude IP addresses from filtering
- 
https://issues.apache.org/jira/browse/NIFI-9448<https://eur03.safelinks.protection.outlook.com/?url=https%3A%2F%2Fissues.apache.org%2Fjira%2Fbrowse%2FNIFI-9448&data=04%7C01%7Cisha.lamboo%40virtualsciences.nl%7C7ef36a49d18346a103d508d9f6fe6086%7C21429da9e4ad45f99a6fcd126a64274b%7C0%7C0%7C637812394689837710%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000&sdata=PoP4SfPIPaauwGQxvi4S%2FI208TkpW%2BB0%2FBrtPPBYw6I%3D&reserved=0>
 Resolved potential IllegalStateException for S2S client communication
- 
https://issues.apache.org/jira/browse/NIFI-9481<https://eur03.safelinks.protection.outlook.com/?url=https%3A%2F%2Fissues.apache.org%2Fjira%2Fbrowse%2FNIFI-9481&data=04%7C01%7Cisha.lamboo%40virtualsciences.nl%7C7ef36a49d18346a103d508d9f6fe6086%7C21429da9e4ad45f99a6fcd126a64274b%7C0%7C0%7C637812394689837710%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000&sdata=QulchPc53aRYbvUSQIl3cycHtiVWsiF4VVww%2BdBClsg%3D&reserved=0>
 Exclude HTTP Site-to-Site Communication from DoS Filter

The last issue is not yet part of a released version, but the other two are 
resolved in NiFi 1.15.3.

Although upgrading and migrating to connection load balancing will take some 
work, it is the best path forward to address the issues you observed.

Regards,
David Handermann

On Wed, Feb 23, 2022 at 11:55 AM Isha Lamboo 
<[email protected]<mailto:[email protected]>> wrote:
Hi all,

I’m hoping to get some perspective from people that have NiFi with a large 
number of Remote Process Groups.

I’m supporting a NiFi 1.12.1 (yes, I know) cluster of 3 nodes that has about 5k 
processors and load-balancing still done the pre-1.8 way, with RPGs looping 
back to the local cluster. There are 500+ RPGs with only about 30 actually 
going to other NiFi clusters.

We’re having several problems:

·         input ports getting stuck when the RPG is set to HTTP protocol and 
connections get killed  by the Jetty DoS filter after 30 secs. The standard is 
RAW, but sometimes a HTTP RPG still gets deployed.

·         Intermittent errors like EoF, connection closed etc on HTTP 
connections

·         The cluster being unable to sync changes made to the flow resulting 
in disconnected nodes and sometimes uninheritable flow exceptions.

My idea is that the RPGs should be replaced by load-balanced connection and/or 
local ports, but developer resources are scarce, so I want to either make a 
business case or tune NiFi performance if 500 RPGs should not cause problems 
normally.

So is this a known issue or particular to my case? How can I identify/solve 
performance bottlenecks with RPGs?

Kind regards,

Isha Lamboo

RE: Performance implications of RPGs for loadbalancing

Reply via email to