Ok.. I just tried disconnecting each node from the cluster, in turn. The first three (prod-6, -7, and -8) didn't make a difference, but when I reconnected prod-5, the load balanced connection started flowing again.
I'll continue to monitor it and let you know if this happens again. Thanks for the suggestions! On Tue, Jun 4, 2019 at 4:14 PM Joe Gresock <[email protected]> wrote: > prod-5 and -6 don't appear to be receiving any data in that queue, based > on the status history. Is there anything I should see in the logs to > confirm this? > > On Tue, Jun 4, 2019 at 4:05 PM Mark Payne <[email protected]> wrote: > >> Joe, >> >> So it looks like from the Diagnostics info, that there are currently 500 >> FlowFiles queued up. >> They all live on prod-8.ec2.internal:8443. Of those 500, 250 are waiting >> to go to prod-5.ec2.internal:8443, >> and 250 are waiting to go to prod-6.ec2.internal:8443. >> >> So this tells us that if there are any problems, they are likely >> occurring on one of those 3 nodes. It's also not >> related to swapping if it's in this state with only 500 FlowFiles queued. >> >> Are you able to confirm that you are indeed receiving data from the load >> balanced queue on both prod-5 and prod-6? >> >> >> On Jun 4, 2019, at 11:47 AM, Joe Gresock <[email protected]> wrote: >> >> Thanks Mark. >> >> I'm running on Linux. I've followed your suggestion and added an >> UpdateAttribute processor to the flow, and attached the diagnostics for it. >> >> I also don't see any errors in the logs. >> >> On Tue, Jun 4, 2019 at 3:34 PM Mark Payne <[email protected]> wrote: >> >>> Joe, >>> >>> The first thing that comes to mind would be NIFI-6285, as Bryan points >>> out. However, >>> that only would affect you if you are running on Windows. So, the first >>> question is: >>> what operating system are you running on? :) >>> >>> If it's not Windows, I would recommend getting some diagnostics info if >>> possible. To do this, >>> you can go to >>> http://<hostname>:<port>/nifi-api/processors/<processor-id>/diagnostics. >>> For example, >>> if you get to nifi by going to http://nifi01:8080/nifi, and you want >>> diagnostics for processor with ID 1234, >>> then try going to >>> http://nifi01:8080/nifi-api/processors/1234/diagnostics in your browser. >>> >>> But a couple of caveats on the 'diagnostics' approach above. It will >>> only work if you are running an insecure >>> NiFi instance, or if you are secured using certificates. We want the >>> diagnostics for the Processor that is either >>> the source of the connection or the destination of the connection - it >>> doesn't matter which. This will give us a >>> lot of information about the internal structure of the connection's >>> FlowFile Queue. Of course, you said that your >>> connection is between two Process Groups, which means that neither the >>> source nor the destination is a Processor, >>> so I would recommend creating a dummy Processor like UpdateAttribute and >>> temporarily dragging the Connection >>> so that it points to that Processor, just to get the diagnostic >>> information, then dragging the connection back. >>> >>> Of course, it would also be helpful to look for any errors in the logs. >>> But if you are able to get the diagnostics info >>> as described above, that's usually the best bet for debugging this sort >>> of thing. >>> >>> Thanks >>> -Mark >>> >>> >>> On Jun 4, 2019, at 11:13 AM, Bryan Bende <[email protected]> wrote: >>> >>> Joe, >>> >>> There are two known issues that possibly seem related... >>> >>> The first was already addressed in 1.9.0, but the reason I mention it >>> is because it was specific to a connection between two ports: >>> >>> https://issues.apache.org/jira/browse/NIFI-5919 >>> >>> The second is not in a release yet, but is addressed in master, and >>> has to do with swapping: >>> >>> https://issues.apache.org/jira/browse/NIFI-6285 >>> >>> Seems like you wouldn't hit the first one since you are on 1.9.2, but >>> does seem odd that is the same scenario. >>> >>> Mark P probably knows best about debugging, but I'm guessing possibly >>> a thread dump while in this state would be helpful. >>> >>> -Bryan >>> >>> On Tue, Jun 4, 2019 at 10:56 AM Joe Gresock <[email protected]> wrote: >>> >>> >>> I have round robin load balanced connections working on one cluster, but >>> on another, this type of connection seems to be stuck. >>> >>> What would be the best way to debug this problem? The connection is >>> from one processor group to another, so it's from an Output Port to an >>> Input Port. >>> >>> My configuration is as follows: >>> nifi.cluster.load.balance.host= >>> nifi.cluster.load.balance.port=6342 >>> nifi.cluster.load.balance.connections.per.node=4 >>> nifi.cluster.load.balance.max.thread.count=8 >>> nifi.cluster.load.balance.comms.timeout=30 sec >>> >>> And I ensured port 6342 is open from one node to another using the >>> cluster node addresses. >>> >>> Is there some error that should appear in the logs if flow files get >>> stuck here? >>> >>> I suspect they are actually stuck, not just missing, because the >>> remainder of the flow is back-pressured up until this point in the flow. >>> >>> Thanks! >>> Joe >>> >>> >>> >> >> -- >> I know what it is to be in need, and I know what it is to have plenty. I >> have learned the secret of being content in any and every situation, >> whether well fed or hungry, whether living in plenty or in want. I can >> do all this through him who gives me strength. *-Philippians 4:12-13* >> <diagnostics.json.gz> >> >> >> > > -- > I know what it is to be in need, and I know what it is to have plenty. I > have learned the secret of being content in any and every situation, > whether well fed or hungry, whether living in plenty or in want. I can > do all this through him who gives me strength. *-Philippians 4:12-13* > -- I know what it is to be in need, and I know what it is to have plenty. I have learned the secret of being content in any and every situation, whether well fed or hungry, whether living in plenty or in want. I can do all this through him who gives me strength. *-Philippians 4:12-13*
