Re: Load balancer queues stuck on 1.9.2?

Mark Payne Mon, 10 Jun 2019 13:08:44 -0700

Joe,

I did just get a PR up for this JIRA. If you are inclined to test the PR, 
please do and let us know how everything goes.


Thanks!
-Mark


On Jun 5, 2019, at 2:29 PM, Mark Payne 
<[email protected]<mailto:[email protected]>> wrote:

Hey Joe,

Thanks for the feedback here on the logs and the analysis. I think you're very 
right - the
connection in the second flow appears to be causing your first flow to stop 
transmitting.
I have been able to replicate it pretty consistently and am starting to work on 
a fix. Hopefully
will have a PR up very shortly. If you're in a position to do so, it would be 
great if you have
a chance to test it out. I just created a JIRA to track the issue, NIFI-6353 
[1].

Thanks
-Mark

[1] https://issues.apache.org/jira/browse/NIFI-6353

On Jun 4, 2019, at 8:13 PM, Joe Gresock 
<[email protected]<mailto:[email protected]>> wrote:

Ok, after a couple hours from the above restart, all the load balanced 
connections stopped sending again.

I enabled DEBUG on the above 2 classes, and found the following messages being 
spammed in the logs:
2019-06-04 23:39:15,497 DEBUG [Load-Balanced Client Thread-2] 
o.a.n.c.q.c.c.a.nio.LoadBalanceSession Will not communicate with Peer 
prod-6.ec2.internal:8443 for Connection e1d23323-5630-1703-0000-00000481bd04 
because session is penalized

The same message is also spammed for prod-7 on the same connection, but I don't 
see any other connections in the log.

Now, interestingly, these are the only messages I see for any of the 8 
"Load-Balanced Client Thread-X" threads, so this makes me wonder if this 
penalized session has consumed all of the available load balance threads 
(nifi.cluster.load.balance.max.thread.count=8), such that no other load 
balancing can occur for any of the other connections in the flow, at least from 
that server.

On that hunch, I changed this connection (e1d...) to "Not load balanced" and 
bled out all the flow files on it, and the spammed log message stopped right 
away.  At the same time, several of my other load balanced queues began sending 
their flow files, as if a dam was released.

At this point, I stopped to consider why this connection would be penalized, 
and realized it was backpressured for unrelated reasons (part of our flow is 
stopped, which leads to backpressure all the way back to this queue).

Could it be that if any one load balanced connection is back-pressured, it 
could consume all of the available load balancer threads such that no other 
load balanced connection can function?

Joe

On Tue, Jun 4, 2019 at 6:00 PM Joe Gresock 
<[email protected]<mailto:[email protected]>> wrote:
Thanks!  I'm taking notes for next time.  For now, a full cluster restart 
appears to have resolved this case.

On Tue, Jun 4, 2019 at 5:55 PM Mark Payne 
<[email protected]<mailto:[email protected]>> wrote:

Joe,


You may want to try enabling DEBUG logging for the following classes:


org.apache.nifi.controller.queue.clustered.client.async.nio.LoadBalanceSession

org.apache.nifi.controller.queue.clustered.client.async.nio.NioAsyncLoadBalanceClient

That may provide some interesting information, especially if grepping for 
specific nodes. But I'll warn you - the logging can certainly be quite verbose.

Thanks
-Mark



On Jun 4, 2019, at 12:29 PM, Mark Payne 
<[email protected]<mailto:[email protected]>> wrote:

Well, that is certainly interesting. Thanks for the updated. Please do let us 
know if it occurs again.

On Jun 4, 2019, at 12:23 PM, Joe Gresock 
<[email protected]<mailto:[email protected]>> wrote:

Ok.. I just tried disconnecting each node from the cluster, in turn.  The first 
three (prod-6, -7, and -8) didn't make a difference, but when I reconnected 
prod-5, the load balanced connection started flowing again.

I'll continue to monitor it and let you know if this happens again.

Thanks for the suggestions!

On Tue, Jun 4, 2019 at 4:14 PM Joe Gresock 
<[email protected]<mailto:[email protected]>> wrote:
prod-5 and -6 don't appear to be receiving any data in that queue, based on the 
status history.  Is there anything I should see in the logs to confirm this?

On Tue, Jun 4, 2019 at 4:05 PM Mark Payne 
<[email protected]<mailto:[email protected]>> wrote:
Joe,

So it looks like from the Diagnostics info, that there are currently 500 
FlowFiles queued up.
They all live on prod-8.ec2.internal:8443. Of those 500, 250 are waiting to go 
to prod-5.ec2.internal:8443,
and 250 are waiting to go to prod-6.ec2.internal:8443.

So this tells us that if there are any problems, they are likely occurring on 
one of those 3 nodes. It's also not
related to swapping if it's in this state with only 500 FlowFiles queued.

Are you able to confirm that you are indeed receiving data from the load 
balanced queue on both prod-5 and prod-6?


On Jun 4, 2019, at 11:47 AM, Joe Gresock 
<[email protected]<mailto:[email protected]>> wrote:

Thanks Mark.

I'm running on Linux.  I've followed your suggestion and added an 
UpdateAttribute processor to the flow, and attached the diagnostics for it.

I also don't see any errors in the logs.

On Tue, Jun 4, 2019 at 3:34 PM Mark Payne 
<[email protected]<mailto:[email protected]>> wrote:
Joe,

The first thing that comes to mind would be NIFI-6285, as Bryan points out. 
However,
that only would affect you if you are running on Windows. So, the first 
question is:
what operating system are you running on? :)

If it's not Windows, I would recommend getting some diagnostics info if 
possible. To do this,
you can go to 
http://<hostname>:<port>/nifi-api/processors/<processor-id>/diagnostics. For 
example,
if you get to nifi by going to http://nifi01:8080/nifi, and you want 
diagnostics for processor with ID 1234,
then try going to http://nifi01:8080/nifi-api/processors/1234/diagnostics in 
your browser.

But a couple of caveats on the 'diagnostics' approach above. It will only work 
if you are running an insecure
NiFi instance, or if you are secured using certificates. We want the 
diagnostics for the Processor that is either
the source of the connection or the destination of the connection - it doesn't 
matter which. This will give us a
lot of information about the internal structure of the connection's FlowFile 
Queue. Of course, you said that your
connection is between two Process Groups, which means that neither the source 
nor the destination is a Processor,
so I would recommend creating a dummy Processor like UpdateAttribute and 
temporarily dragging the Connection
so that it points to that Processor, just to get the diagnostic information, 
then dragging the connection back.

Of course, it would also be helpful to look for any errors in the logs. But if 
you are able to get the diagnostics info
as described above, that's usually the best bet for debugging this sort of 
thing.

Thanks
-Mark


On Jun 4, 2019, at 11:13 AM, Bryan Bende 
<[email protected]<mailto:[email protected]>> wrote:

Joe,

There are two known issues that possibly seem related...

The first was already addressed in 1.9.0, but the reason I mention it
is because it was specific to a connection between two ports:

https://issues.apache.org/jira/browse/NIFI-5919

The second is not in a release yet, but is addressed in master, and
has to do with swapping:

https://issues.apache.org/jira/browse/NIFI-6285

Seems like you wouldn't hit the first one since you are on 1.9.2, but
does seem odd that is the same scenario.

Mark P probably knows best about debugging, but I'm guessing possibly
a thread dump while in this state would be helpful.

-Bryan

On Tue, Jun 4, 2019 at 10:56 AM Joe Gresock 
<[email protected]<mailto:[email protected]>> wrote:

I have round robin load balanced connections working on one cluster, but on 
another, this type of connection seems to be stuck.

What would be the best way to debug this problem?  The connection is from one 
processor group to another, so it's from an Output Port to an Input Port.

My configuration is as follows:
nifi.cluster.load.balance.host=
nifi.cluster.load.balance.port=6342
nifi.cluster.load.balance.connections.per.node=4
nifi.cluster.load.balance.max.thread.count=8
nifi.cluster.load.balance.comms.timeout=30 sec

And I ensured port 6342 is open from one node to another using the cluster node 
addresses.

Is there some error that should appear in the logs if flow files get stuck here?

I suspect they are actually stuck, not just missing, because the remainder of 
the flow is back-pressured up until this point in the flow.

Thanks!
Joe




--
I know what it is to be in need, and I know what it is to have plenty.  I have 
learned the secret of being content in any and every situation, whether well 
fed or hungry, whether living in plenty or in want.  I can do all this through 
him who gives me strength.    -Philippians 4:12-13
<diagnostics.json.gz>



--
I know what it is to be in need, and I know what it is to have plenty.  I have 
learned the secret of being content in any and every situation, whether well 
fed or hungry, whether living in plenty or in want.  I can do all this through 
him who gives me strength.    -Philippians 4:12-13


--
I know what it is to be in need, and I know what it is to have plenty.  I have 
learned the secret of being content in any and every situation, whether well 
fed or hungry, whether living in plenty or in want.  I can do all this through 
him who gives me strength.    -Philippians 4:12-13




--
I know what it is to be in need, and I know what it is to have plenty.  I have 
learned the secret of being content in any and every situation, whether well 
fed or hungry, whether living in plenty or in want.  I can do all this through 
him who gives me strength.    -Philippians 4:12-13


--
I know what it is to be in need, and I know what it is to have plenty.  I have 
learned the secret of being content in any and every situation, whether well 
fed or hungry, whether living in plenty or in want.  I can do all this through 
him who gives me strength.    -Philippians 4:12-13

Re: Load balancer queues stuck on 1.9.2?

Reply via email to