Great, Mark. Not sure if this is related, but I'm seeing another error now:
Failed to receive FlowFIles for Load Balancing due to java.io.IOException: Expected a Data Frame Indicator from Peer X but received a vlue of 145. On Wed, Jun 5, 2019 at 6:29 PM Mark Payne <[email protected]> wrote: > Hey Joe, > > Thanks for the feedback here on the logs and the analysis. I think you're > very right - the > connection in the second flow appears to be causing your first flow to > stop transmitting. > I have been able to replicate it pretty consistently and am starting to > work on a fix. Hopefully > will have a PR up very shortly. If you're in a position to do so, it would > be great if you have > a chance to test it out. I just created a JIRA to track the issue, > NIFI-6353 [1]. > > Thanks > -Mark > > [1] https://issues.apache.org/jira/browse/NIFI-6353 > > On Jun 4, 2019, at 8:13 PM, Joe Gresock <[email protected]> wrote: > > Ok, after a couple hours from the above restart, all the load balanced > connections stopped sending again. > > I enabled DEBUG on the above 2 classes, and found the following messages > being spammed in the logs: > 2019-06-04 23:39:15,497 DEBUG [Load-Balanced Client Thread-2] > o.a.n.c.q.c.c.a.nio.LoadBalanceSession Will not communicate with Peer > prod-6.ec2.internal:8443 for Connection > e1d23323-5630-1703-0000-00000481bd04 because session is penalized > > The same message is also spammed for prod-7 on the same connection, but I > don't see any other connections in the log. > > Now, interestingly, these are the only messages I see for any of the 8 > "Load-Balanced Client Thread-X" threads, so this makes me wonder if this > penalized session has consumed all of the available load balance threads > (nifi.cluster.load.balance.max.thread.count=8), such that no other load > balancing can occur for any of the other connections in the flow, at least > from that server. > > On that hunch, I changed this connection (e1d...) to "Not load balanced" > and bled out all the flow files on it, and the spammed log message stopped > right away. At the same time, several of my other load balanced queues > began sending their flow files, as if a dam was released. > > At this point, I stopped to consider why this connection would be > penalized, and realized it was backpressured for unrelated reasons (part of > our flow is stopped, which leads to backpressure all the way back to this > queue). > > Could it be that if any one load balanced connection is back-pressured, it > could consume all of the available load balancer threads such that no other > load balanced connection can function? > > Joe > > On Tue, Jun 4, 2019 at 6:00 PM Joe Gresock <[email protected]> wrote: > >> Thanks! I'm taking notes for next time. For now, a full cluster restart >> appears to have resolved this case. >> >> On Tue, Jun 4, 2019 at 5:55 PM Mark Payne <[email protected]> wrote: >> >>> Joe, >>> >>> >>> You may want to try enabling DEBUG logging for the following classes: >>> >>> >>> org.apache.nifi.controller.queue.clustered.client.async.nio.LoadBalanceSession >>> >>> org.apache.nifi.controller.queue.clustered.client.async.nio. >>> NioAsyncLoadBalanceClient >>> >>> That may provide some interesting information, especially if grepping >>> for specific nodes. But I'll warn you - the logging can certainly be quite >>> verbose. >>> >>> Thanks >>> -Mark >>> >>> >>> >>> On Jun 4, 2019, at 12:29 PM, Mark Payne <[email protected]> wrote: >>> >>> Well, that is certainly interesting. Thanks for the updated. Please do >>> let us know if it occurs again. >>> >>> On Jun 4, 2019, at 12:23 PM, Joe Gresock <[email protected]> wrote: >>> >>> Ok.. I just tried disconnecting each node from the cluster, in turn. >>> The first three (prod-6, -7, and -8) didn't make a difference, but when I >>> reconnected prod-5, the load balanced connection started flowing again. >>> >>> I'll continue to monitor it and let you know if this happens again. >>> >>> Thanks for the suggestions! >>> >>> On Tue, Jun 4, 2019 at 4:14 PM Joe Gresock <[email protected]> wrote: >>> >>>> prod-5 and -6 don't appear to be receiving any data in that queue, >>>> based on the status history. Is there anything I should see in the logs to >>>> confirm this? >>>> >>>> On Tue, Jun 4, 2019 at 4:05 PM Mark Payne <[email protected]> wrote: >>>> >>>>> Joe, >>>>> >>>>> So it looks like from the Diagnostics info, that there are currently >>>>> 500 FlowFiles queued up. >>>>> They all live on prod-8.ec2.internal:8443. Of those 500, 250 are >>>>> waiting to go to prod-5.ec2.internal:8443, >>>>> and 250 are waiting to go to prod-6.ec2.internal:8443. >>>>> >>>>> So this tells us that if there are any problems, they are likely >>>>> occurring on one of those 3 nodes. It's also not >>>>> related to swapping if it's in this state with only 500 FlowFiles >>>>> queued. >>>>> >>>>> Are you able to confirm that you are indeed receiving data from the >>>>> load balanced queue on both prod-5 and prod-6? >>>>> >>>>> >>>>> On Jun 4, 2019, at 11:47 AM, Joe Gresock <[email protected]> wrote: >>>>> >>>>> Thanks Mark. >>>>> >>>>> I'm running on Linux. I've followed your suggestion and added an >>>>> UpdateAttribute processor to the flow, and attached the diagnostics for >>>>> it. >>>>> >>>>> I also don't see any errors in the logs. >>>>> >>>>> On Tue, Jun 4, 2019 at 3:34 PM Mark Payne <[email protected]> >>>>> wrote: >>>>> >>>>>> Joe, >>>>>> >>>>>> The first thing that comes to mind would be NIFI-6285, as Bryan >>>>>> points out. However, >>>>>> that only would affect you if you are running on Windows. So, the >>>>>> first question is: >>>>>> what operating system are you running on? :) >>>>>> >>>>>> If it's not Windows, I would recommend getting some diagnostics info >>>>>> if possible. To do this, >>>>>> you can go to >>>>>> http://<hostname>:<port>/nifi-api/processors/<processor-id>/diagnostics. >>>>>> For example, >>>>>> if you get to nifi by going to http://nifi01:8080/nifi, and you want >>>>>> diagnostics for processor with ID 1234, >>>>>> then try going to >>>>>> http://nifi01:8080/nifi-api/processors/1234/diagnostics in your >>>>>> browser. >>>>>> >>>>>> But a couple of caveats on the 'diagnostics' approach above. It will >>>>>> only work if you are running an insecure >>>>>> NiFi instance, or if you are secured using certificates. We want the >>>>>> diagnostics for the Processor that is either >>>>>> the source of the connection or the destination of the connection - >>>>>> it doesn't matter which. This will give us a >>>>>> lot of information about the internal structure of the connection's >>>>>> FlowFile Queue. Of course, you said that your >>>>>> connection is between two Process Groups, which means that neither >>>>>> the source nor the destination is a Processor, >>>>>> so I would recommend creating a dummy Processor like UpdateAttribute >>>>>> and temporarily dragging the Connection >>>>>> so that it points to that Processor, just to get the diagnostic >>>>>> information, then dragging the connection back. >>>>>> >>>>>> Of course, it would also be helpful to look for any errors in the >>>>>> logs. But if you are able to get the diagnostics info >>>>>> as described above, that's usually the best bet for debugging this >>>>>> sort of thing. >>>>>> >>>>>> Thanks >>>>>> -Mark >>>>>> >>>>>> >>>>>> On Jun 4, 2019, at 11:13 AM, Bryan Bende <[email protected]> wrote: >>>>>> >>>>>> Joe, >>>>>> >>>>>> There are two known issues that possibly seem related... >>>>>> >>>>>> The first was already addressed in 1.9.0, but the reason I mention it >>>>>> is because it was specific to a connection between two ports: >>>>>> >>>>>> https://issues.apache.org/jira/browse/NIFI-5919 >>>>>> >>>>>> The second is not in a release yet, but is addressed in master, and >>>>>> has to do with swapping: >>>>>> >>>>>> https://issues.apache.org/jira/browse/NIFI-6285 >>>>>> >>>>>> Seems like you wouldn't hit the first one since you are on 1.9.2, but >>>>>> does seem odd that is the same scenario. >>>>>> >>>>>> Mark P probably knows best about debugging, but I'm guessing possibly >>>>>> a thread dump while in this state would be helpful. >>>>>> >>>>>> -Bryan >>>>>> >>>>>> On Tue, Jun 4, 2019 at 10:56 AM Joe Gresock <[email protected]> >>>>>> wrote: >>>>>> >>>>>> >>>>>> I have round robin load balanced connections working on one cluster, >>>>>> but on another, this type of connection seems to be stuck. >>>>>> >>>>>> What would be the best way to debug this problem? The connection is >>>>>> from one processor group to another, so it's from an Output Port to an >>>>>> Input Port. >>>>>> >>>>>> My configuration is as follows: >>>>>> nifi.cluster.load.balance.host= >>>>>> nifi.cluster.load.balance.port=6342 >>>>>> nifi.cluster.load.balance.connections.per.node=4 >>>>>> nifi.cluster.load.balance.max.thread.count=8 >>>>>> nifi.cluster.load.balance.comms.timeout=30 sec >>>>>> >>>>>> And I ensured port 6342 is open from one node to another using the >>>>>> cluster node addresses. >>>>>> >>>>>> Is there some error that should appear in the logs if flow files get >>>>>> stuck here? >>>>>> >>>>>> I suspect they are actually stuck, not just missing, because the >>>>>> remainder of the flow is back-pressured up until this point in the flow. >>>>>> >>>>>> Thanks! >>>>>> Joe >>>>>> >>>>>> >>>>>> >>>>> >>>>> -- >>>>> I know what it is to be in need, and I know what it is to have >>>>> plenty. I have learned the secret of being content in any and every >>>>> situation, whether well fed or hungry, whether living in plenty or in >>>>> want. I can do all this through him who gives me strength. >>>>> *-Philippians >>>>> 4:12-13* >>>>> <diagnostics.json.gz> >>>>> >>>>> >>>>> >>>> >>>> -- >>>> I know what it is to be in need, and I know what it is to have plenty. >>>> I have learned the secret of being content in any and every situation, >>>> whether well fed or hungry, whether living in plenty or in want. I >>>> can do all this through him who gives me strength. *-Philippians >>>> 4:12-13* >>>> >>> >>> >>> -- >>> I know what it is to be in need, and I know what it is to have plenty. >>> I have learned the secret of being content in any and every situation, >>> whether well fed or hungry, whether living in plenty or in want. I can >>> do all this through him who gives me strength. *-Philippians 4:12-13* >>> >>> >>> >>> >> >> -- >> I know what it is to be in need, and I know what it is to have plenty. I >> have learned the secret of being content in any and every situation, >> whether well fed or hungry, whether living in plenty or in want. I can >> do all this through him who gives me strength. *-Philippians 4:12-13* >> > > > -- > I know what it is to be in need, and I know what it is to have plenty. I > have learned the secret of being content in any and every situation, > whether well fed or hungry, whether living in plenty or in want. I can > do all this through him who gives me strength. *-Philippians 4:12-13* > > > -- I know what it is to be in need, and I know what it is to have plenty. I have learned the secret of being content in any and every situation, whether well fed or hungry, whether living in plenty or in want. I can do all this through him who gives me strength. *-Philippians 4:12-13*
