Joe, I did just get a PR up for this JIRA. If you are inclined to test the PR, please do and let us know how everything goes.
Thanks! -Mark On Jun 5, 2019, at 2:29 PM, Mark Payne <[email protected]<mailto:[email protected]>> wrote: Hey Joe, Thanks for the feedback here on the logs and the analysis. I think you're very right - the connection in the second flow appears to be causing your first flow to stop transmitting. I have been able to replicate it pretty consistently and am starting to work on a fix. Hopefully will have a PR up very shortly. If you're in a position to do so, it would be great if you have a chance to test it out. I just created a JIRA to track the issue, NIFI-6353 [1]. Thanks -Mark [1] https://issues.apache.org/jira/browse/NIFI-6353 On Jun 4, 2019, at 8:13 PM, Joe Gresock <[email protected]<mailto:[email protected]>> wrote: Ok, after a couple hours from the above restart, all the load balanced connections stopped sending again. I enabled DEBUG on the above 2 classes, and found the following messages being spammed in the logs: 2019-06-04 23:39:15,497 DEBUG [Load-Balanced Client Thread-2] o.a.n.c.q.c.c.a.nio.LoadBalanceSession Will not communicate with Peer prod-6.ec2.internal:8443 for Connection e1d23323-5630-1703-0000-00000481bd04 because session is penalized The same message is also spammed for prod-7 on the same connection, but I don't see any other connections in the log. Now, interestingly, these are the only messages I see for any of the 8 "Load-Balanced Client Thread-X" threads, so this makes me wonder if this penalized session has consumed all of the available load balance threads (nifi.cluster.load.balance.max.thread.count=8), such that no other load balancing can occur for any of the other connections in the flow, at least from that server. On that hunch, I changed this connection (e1d...) to "Not load balanced" and bled out all the flow files on it, and the spammed log message stopped right away. At the same time, several of my other load balanced queues began sending their flow files, as if a dam was released. At this point, I stopped to consider why this connection would be penalized, and realized it was backpressured for unrelated reasons (part of our flow is stopped, which leads to backpressure all the way back to this queue). Could it be that if any one load balanced connection is back-pressured, it could consume all of the available load balancer threads such that no other load balanced connection can function? Joe On Tue, Jun 4, 2019 at 6:00 PM Joe Gresock <[email protected]<mailto:[email protected]>> wrote: Thanks! I'm taking notes for next time. For now, a full cluster restart appears to have resolved this case. On Tue, Jun 4, 2019 at 5:55 PM Mark Payne <[email protected]<mailto:[email protected]>> wrote: Joe, You may want to try enabling DEBUG logging for the following classes: org.apache.nifi.controller.queue.clustered.client.async.nio.LoadBalanceSession org.apache.nifi.controller.queue.clustered.client.async.nio.NioAsyncLoadBalanceClient That may provide some interesting information, especially if grepping for specific nodes. But I'll warn you - the logging can certainly be quite verbose. Thanks -Mark On Jun 4, 2019, at 12:29 PM, Mark Payne <[email protected]<mailto:[email protected]>> wrote: Well, that is certainly interesting. Thanks for the updated. Please do let us know if it occurs again. On Jun 4, 2019, at 12:23 PM, Joe Gresock <[email protected]<mailto:[email protected]>> wrote: Ok.. I just tried disconnecting each node from the cluster, in turn. The first three (prod-6, -7, and -8) didn't make a difference, but when I reconnected prod-5, the load balanced connection started flowing again. I'll continue to monitor it and let you know if this happens again. Thanks for the suggestions! On Tue, Jun 4, 2019 at 4:14 PM Joe Gresock <[email protected]<mailto:[email protected]>> wrote: prod-5 and -6 don't appear to be receiving any data in that queue, based on the status history. Is there anything I should see in the logs to confirm this? On Tue, Jun 4, 2019 at 4:05 PM Mark Payne <[email protected]<mailto:[email protected]>> wrote: Joe, So it looks like from the Diagnostics info, that there are currently 500 FlowFiles queued up. They all live on prod-8.ec2.internal:8443. Of those 500, 250 are waiting to go to prod-5.ec2.internal:8443, and 250 are waiting to go to prod-6.ec2.internal:8443. So this tells us that if there are any problems, they are likely occurring on one of those 3 nodes. It's also not related to swapping if it's in this state with only 500 FlowFiles queued. Are you able to confirm that you are indeed receiving data from the load balanced queue on both prod-5 and prod-6? On Jun 4, 2019, at 11:47 AM, Joe Gresock <[email protected]<mailto:[email protected]>> wrote: Thanks Mark. I'm running on Linux. I've followed your suggestion and added an UpdateAttribute processor to the flow, and attached the diagnostics for it. I also don't see any errors in the logs. On Tue, Jun 4, 2019 at 3:34 PM Mark Payne <[email protected]<mailto:[email protected]>> wrote: Joe, The first thing that comes to mind would be NIFI-6285, as Bryan points out. However, that only would affect you if you are running on Windows. So, the first question is: what operating system are you running on? :) If it's not Windows, I would recommend getting some diagnostics info if possible. To do this, you can go to http://<hostname>:<port>/nifi-api/processors/<processor-id>/diagnostics. For example, if you get to nifi by going to http://nifi01:8080/nifi, and you want diagnostics for processor with ID 1234, then try going to http://nifi01:8080/nifi-api/processors/1234/diagnostics in your browser. But a couple of caveats on the 'diagnostics' approach above. It will only work if you are running an insecure NiFi instance, or if you are secured using certificates. We want the diagnostics for the Processor that is either the source of the connection or the destination of the connection - it doesn't matter which. This will give us a lot of information about the internal structure of the connection's FlowFile Queue. Of course, you said that your connection is between two Process Groups, which means that neither the source nor the destination is a Processor, so I would recommend creating a dummy Processor like UpdateAttribute and temporarily dragging the Connection so that it points to that Processor, just to get the diagnostic information, then dragging the connection back. Of course, it would also be helpful to look for any errors in the logs. But if you are able to get the diagnostics info as described above, that's usually the best bet for debugging this sort of thing. Thanks -Mark On Jun 4, 2019, at 11:13 AM, Bryan Bende <[email protected]<mailto:[email protected]>> wrote: Joe, There are two known issues that possibly seem related... The first was already addressed in 1.9.0, but the reason I mention it is because it was specific to a connection between two ports: https://issues.apache.org/jira/browse/NIFI-5919 The second is not in a release yet, but is addressed in master, and has to do with swapping: https://issues.apache.org/jira/browse/NIFI-6285 Seems like you wouldn't hit the first one since you are on 1.9.2, but does seem odd that is the same scenario. Mark P probably knows best about debugging, but I'm guessing possibly a thread dump while in this state would be helpful. -Bryan On Tue, Jun 4, 2019 at 10:56 AM Joe Gresock <[email protected]<mailto:[email protected]>> wrote: I have round robin load balanced connections working on one cluster, but on another, this type of connection seems to be stuck. What would be the best way to debug this problem? The connection is from one processor group to another, so it's from an Output Port to an Input Port. My configuration is as follows: nifi.cluster.load.balance.host= nifi.cluster.load.balance.port=6342 nifi.cluster.load.balance.connections.per.node=4 nifi.cluster.load.balance.max.thread.count=8 nifi.cluster.load.balance.comms.timeout=30 sec And I ensured port 6342 is open from one node to another using the cluster node addresses. Is there some error that should appear in the logs if flow files get stuck here? I suspect they are actually stuck, not just missing, because the remainder of the flow is back-pressured up until this point in the flow. Thanks! Joe -- I know what it is to be in need, and I know what it is to have plenty. I have learned the secret of being content in any and every situation, whether well fed or hungry, whether living in plenty or in want. I can do all this through him who gives me strength. -Philippians 4:12-13 <diagnostics.json.gz> -- I know what it is to be in need, and I know what it is to have plenty. I have learned the secret of being content in any and every situation, whether well fed or hungry, whether living in plenty or in want. I can do all this through him who gives me strength. -Philippians 4:12-13 -- I know what it is to be in need, and I know what it is to have plenty. I have learned the secret of being content in any and every situation, whether well fed or hungry, whether living in plenty or in want. I can do all this through him who gives me strength. -Philippians 4:12-13 -- I know what it is to be in need, and I know what it is to have plenty. I have learned the secret of being content in any and every situation, whether well fed or hungry, whether living in plenty or in want. I can do all this through him who gives me strength. -Philippians 4:12-13 -- I know what it is to be in need, and I know what it is to have plenty. I have learned the secret of being content in any and every situation, whether well fed or hungry, whether living in plenty or in want. I can do all this through him who gives me strength. -Philippians 4:12-13
