Mark, looks great! No problems since yesterday. I'd say you're good to commit.
On Tue, Jun 11, 2019 at 7:15 PM Mark Payne <[email protected]> wrote: > Thanks Joe! If all looks good then we can hopefully get this merged in > quickly. > > On Jun 11, 2019, at 2:30 PM, Joe Gresock <[email protected]> wrote: > > I deployed the patch, Mark. So far, so good. I configured the flow as I > had before, and haven't encountered this state yet. The load balancing > logs look much better, and I don't see any spamming of the "will not > communicate with..." message. > > I'll let it run for another day and report back. > > On Mon, Jun 10, 2019 at 8:08 PM Mark Payne <[email protected]> wrote: > >> Joe, >> >> I did just get a PR up for this JIRA. If you are inclined to test the PR, >> please do and let us know how everything goes. >> >> Thanks! >> -Mark >> >> >> On Jun 5, 2019, at 2:29 PM, Mark Payne <[email protected]> wrote: >> >> Hey Joe, >> >> Thanks for the feedback here on the logs and the analysis. I think you're >> very right - the >> connection in the second flow appears to be causing your first flow to >> stop transmitting. >> I have been able to replicate it pretty consistently and am starting to >> work on a fix. Hopefully >> will have a PR up very shortly. If you're in a position to do so, it >> would be great if you have >> a chance to test it out. I just created a JIRA to track the issue, >> NIFI-6353 [1]. >> >> Thanks >> -Mark >> >> [1] https://issues.apache.org/jira/browse/NIFI-6353 >> >> On Jun 4, 2019, at 8:13 PM, Joe Gresock <[email protected]> wrote: >> >> Ok, after a couple hours from the above restart, all the load balanced >> connections stopped sending again. >> >> I enabled DEBUG on the above 2 classes, and found the following messages >> being spammed in the logs: >> 2019-06-04 23:39:15,497 DEBUG [Load-Balanced Client Thread-2] >> o.a.n.c.q.c.c.a.nio.LoadBalanceSession Will not communicate with Peer >> prod-6.ec2.internal:8443 for Connection >> e1d23323-5630-1703-0000-00000481bd04 because session is penalized >> >> The same message is also spammed for prod-7 on the same connection, but I >> don't see any other connections in the log. >> >> Now, interestingly, these are the only messages I see for any of the 8 >> "Load-Balanced Client Thread-X" threads, so this makes me wonder if this >> penalized session has consumed all of the available load balance threads >> (nifi.cluster.load.balance.max.thread.count=8), such that no other load >> balancing can occur for any of the other connections in the flow, at least >> from that server. >> >> On that hunch, I changed this connection (e1d...) to "Not load balanced" >> and bled out all the flow files on it, and the spammed log message stopped >> right away. At the same time, several of my other load balanced queues >> began sending their flow files, as if a dam was released. >> >> At this point, I stopped to consider why this connection would be >> penalized, and realized it was backpressured for unrelated reasons (part of >> our flow is stopped, which leads to backpressure all the way back to this >> queue). >> >> Could it be that if any one load balanced connection is back-pressured, >> it could consume all of the available load balancer threads such that no >> other load balanced connection can function? >> >> Joe >> >> On Tue, Jun 4, 2019 at 6:00 PM Joe Gresock <[email protected]> wrote: >> >>> Thanks! I'm taking notes for next time. For now, a full cluster >>> restart appears to have resolved this case. >>> >>> On Tue, Jun 4, 2019 at 5:55 PM Mark Payne <[email protected]> wrote: >>> >>>> Joe, >>>> >>>> >>>> You may want to try enabling DEBUG logging for the following classes: >>>> >>>> >>>> org.apache.nifi.controller.queue.clustered.client.async.nio.LoadBalanceSession >>>> >>>> org.apache.nifi.controller.queue.clustered.client.async.nio. >>>> NioAsyncLoadBalanceClient >>>> >>>> That may provide some interesting information, especially if grepping >>>> for specific nodes. But I'll warn you - the logging can certainly be quite >>>> verbose. >>>> >>>> Thanks >>>> -Mark >>>> >>>> >>>> >>>> On Jun 4, 2019, at 12:29 PM, Mark Payne <[email protected]> wrote: >>>> >>>> Well, that is certainly interesting. Thanks for the updated. Please do >>>> let us know if it occurs again. >>>> >>>> On Jun 4, 2019, at 12:23 PM, Joe Gresock <[email protected]> wrote: >>>> >>>> Ok.. I just tried disconnecting each node from the cluster, in turn. >>>> The first three (prod-6, -7, and -8) didn't make a difference, but when I >>>> reconnected prod-5, the load balanced connection started flowing again. >>>> >>>> I'll continue to monitor it and let you know if this happens again. >>>> >>>> Thanks for the suggestions! >>>> >>>> On Tue, Jun 4, 2019 at 4:14 PM Joe Gresock <[email protected]> wrote: >>>> >>>>> prod-5 and -6 don't appear to be receiving any data in that queue, >>>>> based on the status history. Is there anything I should see in the logs >>>>> to >>>>> confirm this? >>>>> >>>>> On Tue, Jun 4, 2019 at 4:05 PM Mark Payne <[email protected]> >>>>> wrote: >>>>> >>>>>> Joe, >>>>>> >>>>>> So it looks like from the Diagnostics info, that there are currently >>>>>> 500 FlowFiles queued up. >>>>>> They all live on prod-8.ec2.internal:8443. Of those 500, 250 are >>>>>> waiting to go to prod-5.ec2.internal:8443, >>>>>> and 250 are waiting to go to prod-6.ec2.internal:8443. >>>>>> >>>>>> So this tells us that if there are any problems, they are likely >>>>>> occurring on one of those 3 nodes. It's also not >>>>>> related to swapping if it's in this state with only 500 FlowFiles >>>>>> queued. >>>>>> >>>>>> Are you able to confirm that you are indeed receiving data from the >>>>>> load balanced queue on both prod-5 and prod-6? >>>>>> >>>>>> >>>>>> On Jun 4, 2019, at 11:47 AM, Joe Gresock <[email protected]> wrote: >>>>>> >>>>>> Thanks Mark. >>>>>> >>>>>> I'm running on Linux. I've followed your suggestion and added an >>>>>> UpdateAttribute processor to the flow, and attached the diagnostics for >>>>>> it. >>>>>> >>>>>> I also don't see any errors in the logs. >>>>>> >>>>>> On Tue, Jun 4, 2019 at 3:34 PM Mark Payne <[email protected]> >>>>>> wrote: >>>>>> >>>>>>> Joe, >>>>>>> >>>>>>> The first thing that comes to mind would be NIFI-6285, as Bryan >>>>>>> points out. However, >>>>>>> that only would affect you if you are running on Windows. So, the >>>>>>> first question is: >>>>>>> what operating system are you running on? :) >>>>>>> >>>>>>> If it's not Windows, I would recommend getting some diagnostics info >>>>>>> if possible. To do this, >>>>>>> you can go to >>>>>>> http://<hostname>:<port>/nifi-api/processors/<processor-id>/diagnostics. >>>>>>> For example, >>>>>>> if you get to nifi by going to http://nifi01:8080/nifi, and you >>>>>>> want diagnostics for processor with ID 1234, >>>>>>> then try going to >>>>>>> http://nifi01:8080/nifi-api/processors/1234/diagnostics in your >>>>>>> browser. >>>>>>> >>>>>>> But a couple of caveats on the 'diagnostics' approach above. It will >>>>>>> only work if you are running an insecure >>>>>>> NiFi instance, or if you are secured using certificates. We want the >>>>>>> diagnostics for the Processor that is either >>>>>>> the source of the connection or the destination of the connection - >>>>>>> it doesn't matter which. This will give us a >>>>>>> lot of information about the internal structure of the connection's >>>>>>> FlowFile Queue. Of course, you said that your >>>>>>> connection is between two Process Groups, which means that neither >>>>>>> the source nor the destination is a Processor, >>>>>>> so I would recommend creating a dummy Processor like UpdateAttribute >>>>>>> and temporarily dragging the Connection >>>>>>> so that it points to that Processor, just to get the diagnostic >>>>>>> information, then dragging the connection back. >>>>>>> >>>>>>> Of course, it would also be helpful to look for any errors in the >>>>>>> logs. But if you are able to get the diagnostics info >>>>>>> as described above, that's usually the best bet for debugging this >>>>>>> sort of thing. >>>>>>> >>>>>>> Thanks >>>>>>> -Mark >>>>>>> >>>>>>> >>>>>>> On Jun 4, 2019, at 11:13 AM, Bryan Bende <[email protected]> wrote: >>>>>>> >>>>>>> Joe, >>>>>>> >>>>>>> There are two known issues that possibly seem related... >>>>>>> >>>>>>> The first was already addressed in 1.9.0, but the reason I mention it >>>>>>> is because it was specific to a connection between two ports: >>>>>>> >>>>>>> https://issues.apache.org/jira/browse/NIFI-5919 >>>>>>> >>>>>>> The second is not in a release yet, but is addressed in master, and >>>>>>> has to do with swapping: >>>>>>> >>>>>>> https://issues.apache.org/jira/browse/NIFI-6285 >>>>>>> >>>>>>> Seems like you wouldn't hit the first one since you are on 1.9.2, but >>>>>>> does seem odd that is the same scenario. >>>>>>> >>>>>>> Mark P probably knows best about debugging, but I'm guessing possibly >>>>>>> a thread dump while in this state would be helpful. >>>>>>> >>>>>>> -Bryan >>>>>>> >>>>>>> On Tue, Jun 4, 2019 at 10:56 AM Joe Gresock <[email protected]> >>>>>>> wrote: >>>>>>> >>>>>>> >>>>>>> I have round robin load balanced connections working on one cluster, >>>>>>> but on another, this type of connection seems to be stuck. >>>>>>> >>>>>>> What would be the best way to debug this problem? The connection is >>>>>>> from one processor group to another, so it's from an Output Port to an >>>>>>> Input Port. >>>>>>> >>>>>>> My configuration is as follows: >>>>>>> nifi.cluster.load.balance.host= >>>>>>> nifi.cluster.load.balance.port=6342 >>>>>>> nifi.cluster.load.balance.connections.per.node=4 >>>>>>> nifi.cluster.load.balance.max.thread.count=8 >>>>>>> nifi.cluster.load.balance.comms.timeout=30 sec >>>>>>> >>>>>>> And I ensured port 6342 is open from one node to another using the >>>>>>> cluster node addresses. >>>>>>> >>>>>>> Is there some error that should appear in the logs if flow files get >>>>>>> stuck here? >>>>>>> >>>>>>> I suspect they are actually stuck, not just missing, because the >>>>>>> remainder of the flow is back-pressured up until this point in the flow. >>>>>>> >>>>>>> Thanks! >>>>>>> Joe >>>>>>> >>>>>>> >>>>>>> >>>>>> >>>>>> -- >>>>>> I know what it is to be in need, and I know what it is to have >>>>>> plenty. I have learned the secret of being content in any and every >>>>>> situation, whether well fed or hungry, whether living in plenty or >>>>>> in want. I can do all this through him who gives me strength. >>>>>> *-Philippians >>>>>> 4:12-13* >>>>>> <diagnostics.json.gz> >>>>>> >>>>>> >>>>>> >>>>> >>>>> -- >>>>> I know what it is to be in need, and I know what it is to have >>>>> plenty. I have learned the secret of being content in any and every >>>>> situation, whether well fed or hungry, whether living in plenty or in >>>>> want. I can do all this through him who gives me strength. >>>>> *-Philippians >>>>> 4:12-13* >>>>> >>>> >>>> >>>> -- >>>> I know what it is to be in need, and I know what it is to have plenty. >>>> I have learned the secret of being content in any and every situation, >>>> whether well fed or hungry, whether living in plenty or in want. I >>>> can do all this through him who gives me strength. *-Philippians >>>> 4:12-13* >>>> >>>> >>>> >>>> >>> >>> -- >>> I know what it is to be in need, and I know what it is to have plenty. >>> I have learned the secret of being content in any and every situation, >>> whether well fed or hungry, whether living in plenty or in want. I can >>> do all this through him who gives me strength. *-Philippians 4:12-13* >>> >> >> >> -- >> I know what it is to be in need, and I know what it is to have plenty. I >> have learned the secret of being content in any and every situation, >> whether well fed or hungry, whether living in plenty or in want. I can >> do all this through him who gives me strength. *-Philippians 4:12-13* >> >> >> >> > > -- > I know what it is to be in need, and I know what it is to have plenty. I > have learned the secret of being content in any and every situation, > whether well fed or hungry, whether living in plenty or in want. I can > do all this through him who gives me strength. *-Philippians 4:12-13* > > > -- I know what it is to be in need, and I know what it is to have plenty. I have learned the secret of being content in any and every situation, whether well fed or hungry, whether living in plenty or in want. I can do all this through him who gives me strength. *-Philippians 4:12-13*
