Thanks! I'm taking notes for next time. For now, a full cluster restart appears to have resolved this case.
On Tue, Jun 4, 2019 at 5:55 PM Mark Payne <[email protected]> wrote: > Joe, > > > You may want to try enabling DEBUG logging for the following classes: > > > org.apache.nifi.controller.queue.clustered.client.async.nio.LoadBalanceSession > > org.apache.nifi.controller.queue.clustered.client.async.nio. > NioAsyncLoadBalanceClient > > That may provide some interesting information, especially if grepping for > specific nodes. But I'll warn you - the logging can certainly be quite > verbose. > > Thanks > -Mark > > > > On Jun 4, 2019, at 12:29 PM, Mark Payne <[email protected]> wrote: > > Well, that is certainly interesting. Thanks for the updated. Please do let > us know if it occurs again. > > On Jun 4, 2019, at 12:23 PM, Joe Gresock <[email protected]> wrote: > > Ok.. I just tried disconnecting each node from the cluster, in turn. The > first three (prod-6, -7, and -8) didn't make a difference, but when I > reconnected prod-5, the load balanced connection started flowing again. > > I'll continue to monitor it and let you know if this happens again. > > Thanks for the suggestions! > > On Tue, Jun 4, 2019 at 4:14 PM Joe Gresock <[email protected]> wrote: > >> prod-5 and -6 don't appear to be receiving any data in that queue, based >> on the status history. Is there anything I should see in the logs to >> confirm this? >> >> On Tue, Jun 4, 2019 at 4:05 PM Mark Payne <[email protected]> wrote: >> >>> Joe, >>> >>> So it looks like from the Diagnostics info, that there are currently 500 >>> FlowFiles queued up. >>> They all live on prod-8.ec2.internal:8443. Of those 500, 250 are waiting >>> to go to prod-5.ec2.internal:8443, >>> and 250 are waiting to go to prod-6.ec2.internal:8443. >>> >>> So this tells us that if there are any problems, they are likely >>> occurring on one of those 3 nodes. It's also not >>> related to swapping if it's in this state with only 500 FlowFiles queued. >>> >>> Are you able to confirm that you are indeed receiving data from the load >>> balanced queue on both prod-5 and prod-6? >>> >>> >>> On Jun 4, 2019, at 11:47 AM, Joe Gresock <[email protected]> wrote: >>> >>> Thanks Mark. >>> >>> I'm running on Linux. I've followed your suggestion and added an >>> UpdateAttribute processor to the flow, and attached the diagnostics for it. >>> >>> I also don't see any errors in the logs. >>> >>> On Tue, Jun 4, 2019 at 3:34 PM Mark Payne <[email protected]> wrote: >>> >>>> Joe, >>>> >>>> The first thing that comes to mind would be NIFI-6285, as Bryan points >>>> out. However, >>>> that only would affect you if you are running on Windows. So, the first >>>> question is: >>>> what operating system are you running on? :) >>>> >>>> If it's not Windows, I would recommend getting some diagnostics info if >>>> possible. To do this, >>>> you can go to >>>> http://<hostname>:<port>/nifi-api/processors/<processor-id>/diagnostics. >>>> For example, >>>> if you get to nifi by going to http://nifi01:8080/nifi, and you want >>>> diagnostics for processor with ID 1234, >>>> then try going to >>>> http://nifi01:8080/nifi-api/processors/1234/diagnostics in your >>>> browser. >>>> >>>> But a couple of caveats on the 'diagnostics' approach above. It will >>>> only work if you are running an insecure >>>> NiFi instance, or if you are secured using certificates. We want the >>>> diagnostics for the Processor that is either >>>> the source of the connection or the destination of the connection - it >>>> doesn't matter which. This will give us a >>>> lot of information about the internal structure of the connection's >>>> FlowFile Queue. Of course, you said that your >>>> connection is between two Process Groups, which means that neither the >>>> source nor the destination is a Processor, >>>> so I would recommend creating a dummy Processor like UpdateAttribute >>>> and temporarily dragging the Connection >>>> so that it points to that Processor, just to get the diagnostic >>>> information, then dragging the connection back. >>>> >>>> Of course, it would also be helpful to look for any errors in the logs. >>>> But if you are able to get the diagnostics info >>>> as described above, that's usually the best bet for debugging this sort >>>> of thing. >>>> >>>> Thanks >>>> -Mark >>>> >>>> >>>> On Jun 4, 2019, at 11:13 AM, Bryan Bende <[email protected]> wrote: >>>> >>>> Joe, >>>> >>>> There are two known issues that possibly seem related... >>>> >>>> The first was already addressed in 1.9.0, but the reason I mention it >>>> is because it was specific to a connection between two ports: >>>> >>>> https://issues.apache.org/jira/browse/NIFI-5919 >>>> >>>> The second is not in a release yet, but is addressed in master, and >>>> has to do with swapping: >>>> >>>> https://issues.apache.org/jira/browse/NIFI-6285 >>>> >>>> Seems like you wouldn't hit the first one since you are on 1.9.2, but >>>> does seem odd that is the same scenario. >>>> >>>> Mark P probably knows best about debugging, but I'm guessing possibly >>>> a thread dump while in this state would be helpful. >>>> >>>> -Bryan >>>> >>>> On Tue, Jun 4, 2019 at 10:56 AM Joe Gresock <[email protected]> wrote: >>>> >>>> >>>> I have round robin load balanced connections working on one cluster, >>>> but on another, this type of connection seems to be stuck. >>>> >>>> What would be the best way to debug this problem? The connection is >>>> from one processor group to another, so it's from an Output Port to an >>>> Input Port. >>>> >>>> My configuration is as follows: >>>> nifi.cluster.load.balance.host= >>>> nifi.cluster.load.balance.port=6342 >>>> nifi.cluster.load.balance.connections.per.node=4 >>>> nifi.cluster.load.balance.max.thread.count=8 >>>> nifi.cluster.load.balance.comms.timeout=30 sec >>>> >>>> And I ensured port 6342 is open from one node to another using the >>>> cluster node addresses. >>>> >>>> Is there some error that should appear in the logs if flow files get >>>> stuck here? >>>> >>>> I suspect they are actually stuck, not just missing, because the >>>> remainder of the flow is back-pressured up until this point in the flow. >>>> >>>> Thanks! >>>> Joe >>>> >>>> >>>> >>> >>> -- >>> I know what it is to be in need, and I know what it is to have plenty. >>> I have learned the secret of being content in any and every situation, >>> whether well fed or hungry, whether living in plenty or in want. I can >>> do all this through him who gives me strength. *-Philippians 4:12-13* >>> <diagnostics.json.gz> >>> >>> >>> >> >> -- >> I know what it is to be in need, and I know what it is to have plenty. I >> have learned the secret of being content in any and every situation, >> whether well fed or hungry, whether living in plenty or in want. I can >> do all this through him who gives me strength. *-Philippians 4:12-13* >> > > > -- > I know what it is to be in need, and I know what it is to have plenty. I > have learned the secret of being content in any and every situation, > whether well fed or hungry, whether living in plenty or in want. I can > do all this through him who gives me strength. *-Philippians 4:12-13* > > > > -- I know what it is to be in need, and I know what it is to have plenty. I have learned the secret of being content in any and every situation, whether well fed or hungry, whether living in plenty or in want. I can do all this through him who gives me strength. *-Philippians 4:12-13*
