Mark, looks great!  No problems since yesterday.  I'd say you're good to
commit.

On Tue, Jun 11, 2019 at 7:15 PM Mark Payne <[email protected]> wrote:

> Thanks Joe! If all looks good then we can hopefully get this merged in
> quickly.
>
> On Jun 11, 2019, at 2:30 PM, Joe Gresock <[email protected]> wrote:
>
> I deployed the patch, Mark.  So far, so good.  I configured the flow as I
> had before, and haven't encountered this state yet.  The load balancing
> logs look much better, and I don't see any spamming of the "will not
> communicate with..." message.
>
> I'll let it run for another day and report back.
>
> On Mon, Jun 10, 2019 at 8:08 PM Mark Payne <[email protected]> wrote:
>
>> Joe,
>>
>> I did just get a PR up for this JIRA. If you are inclined to test the PR,
>> please do and let us know how everything goes.
>>
>> Thanks!
>> -Mark
>>
>>
>> On Jun 5, 2019, at 2:29 PM, Mark Payne <[email protected]> wrote:
>>
>> Hey Joe,
>>
>> Thanks for the feedback here on the logs and the analysis. I think you're
>> very right - the
>> connection in the second flow appears to be causing your first flow to
>> stop transmitting.
>> I have been able to replicate it pretty consistently and am starting to
>> work on a fix. Hopefully
>> will have a PR up very shortly. If you're in a position to do so, it
>> would be great if you have
>> a chance to test it out. I just created a JIRA to track the issue,
>> NIFI-6353 [1].
>>
>> Thanks
>> -Mark
>>
>> [1] https://issues.apache.org/jira/browse/NIFI-6353
>>
>> On Jun 4, 2019, at 8:13 PM, Joe Gresock <[email protected]> wrote:
>>
>> Ok, after a couple hours from the above restart, all the load balanced
>> connections stopped sending again.
>>
>> I enabled DEBUG on the above 2 classes, and found the following messages
>> being spammed in the logs:
>> 2019-06-04 23:39:15,497 DEBUG [Load-Balanced Client Thread-2]
>> o.a.n.c.q.c.c.a.nio.LoadBalanceSession Will not communicate with Peer
>> prod-6.ec2.internal:8443 for Connection
>> e1d23323-5630-1703-0000-00000481bd04 because session is penalized
>>
>> The same message is also spammed for prod-7 on the same connection, but I
>> don't see any other connections in the log.
>>
>> Now, interestingly, these are the only messages I see for any of the 8
>> "Load-Balanced Client Thread-X" threads, so this makes me wonder if this
>> penalized session has consumed all of the available load balance threads
>> (nifi.cluster.load.balance.max.thread.count=8), such that no other load
>> balancing can occur for any of the other connections in the flow, at least
>> from that server.
>>
>> On that hunch, I changed this connection (e1d...) to "Not load balanced"
>> and bled out all the flow files on it, and the spammed log message stopped
>> right away.  At the same time, several of my other load balanced queues
>> began sending their flow files, as if a dam was released.
>>
>> At this point, I stopped to consider why this connection would be
>> penalized, and realized it was backpressured for unrelated reasons (part of
>> our flow is stopped, which leads to backpressure all the way back to this
>> queue).
>>
>> Could it be that if any one load balanced connection is back-pressured,
>> it could consume all of the available load balancer threads such that no
>> other load balanced connection can function?
>>
>> Joe
>>
>> On Tue, Jun 4, 2019 at 6:00 PM Joe Gresock <[email protected]> wrote:
>>
>>> Thanks!  I'm taking notes for next time.  For now, a full cluster
>>> restart appears to have resolved this case.
>>>
>>> On Tue, Jun 4, 2019 at 5:55 PM Mark Payne <[email protected]> wrote:
>>>
>>>> Joe,
>>>>
>>>>
>>>> You may want to try enabling DEBUG logging for the following classes:
>>>>
>>>>
>>>> org.apache.nifi.controller.queue.clustered.client.async.nio.LoadBalanceSession
>>>>
>>>> org.apache.nifi.controller.queue.clustered.client.async.nio.
>>>> NioAsyncLoadBalanceClient
>>>>
>>>> That may provide some interesting information, especially if grepping
>>>> for specific nodes. But I'll warn you - the logging can certainly be quite
>>>> verbose.
>>>>
>>>> Thanks
>>>> -Mark
>>>>
>>>>
>>>>
>>>> On Jun 4, 2019, at 12:29 PM, Mark Payne <[email protected]> wrote:
>>>>
>>>> Well, that is certainly interesting. Thanks for the updated. Please do
>>>> let us know if it occurs again.
>>>>
>>>> On Jun 4, 2019, at 12:23 PM, Joe Gresock <[email protected]> wrote:
>>>>
>>>> Ok.. I just tried disconnecting each node from the cluster, in turn.
>>>> The first three (prod-6, -7, and -8) didn't make a difference, but when I
>>>> reconnected prod-5, the load balanced connection started flowing again.
>>>>
>>>> I'll continue to monitor it and let you know if this happens again.
>>>>
>>>> Thanks for the suggestions!
>>>>
>>>> On Tue, Jun 4, 2019 at 4:14 PM Joe Gresock <[email protected]> wrote:
>>>>
>>>>> prod-5 and -6 don't appear to be receiving any data in that queue,
>>>>> based on the status history.  Is there anything I should see in the logs 
>>>>> to
>>>>> confirm this?
>>>>>
>>>>> On Tue, Jun 4, 2019 at 4:05 PM Mark Payne <[email protected]>
>>>>> wrote:
>>>>>
>>>>>> Joe,
>>>>>>
>>>>>> So it looks like from the Diagnostics info, that there are currently
>>>>>> 500 FlowFiles queued up.
>>>>>> They all live on prod-8.ec2.internal:8443. Of those 500, 250 are
>>>>>> waiting to go to prod-5.ec2.internal:8443,
>>>>>> and 250 are waiting to go to prod-6.ec2.internal:8443.
>>>>>>
>>>>>> So this tells us that if there are any problems, they are likely
>>>>>> occurring on one of those 3 nodes. It's also not
>>>>>> related to swapping if it's in this state with only 500 FlowFiles
>>>>>> queued.
>>>>>>
>>>>>> Are you able to confirm that you are indeed receiving data from the
>>>>>> load balanced queue on both prod-5 and prod-6?
>>>>>>
>>>>>>
>>>>>> On Jun 4, 2019, at 11:47 AM, Joe Gresock <[email protected]> wrote:
>>>>>>
>>>>>> Thanks Mark.
>>>>>>
>>>>>> I'm running on Linux.  I've followed your suggestion and added an
>>>>>> UpdateAttribute processor to the flow, and attached the diagnostics for 
>>>>>> it.
>>>>>>
>>>>>> I also don't see any errors in the logs.
>>>>>>
>>>>>> On Tue, Jun 4, 2019 at 3:34 PM Mark Payne <[email protected]>
>>>>>> wrote:
>>>>>>
>>>>>>> Joe,
>>>>>>>
>>>>>>> The first thing that comes to mind would be NIFI-6285, as Bryan
>>>>>>> points out. However,
>>>>>>> that only would affect you if you are running on Windows. So, the
>>>>>>> first question is:
>>>>>>> what operating system are you running on? :)
>>>>>>>
>>>>>>> If it's not Windows, I would recommend getting some diagnostics info
>>>>>>> if possible. To do this,
>>>>>>> you can go to 
>>>>>>> http://<hostname>:<port>/nifi-api/processors/<processor-id>/diagnostics.
>>>>>>> For example,
>>>>>>> if you get to nifi by going to http://nifi01:8080/nifi, and you
>>>>>>> want diagnostics for processor with ID 1234,
>>>>>>> then try going to
>>>>>>> http://nifi01:8080/nifi-api/processors/1234/diagnostics in your
>>>>>>> browser.
>>>>>>>
>>>>>>> But a couple of caveats on the 'diagnostics' approach above. It will
>>>>>>> only work if you are running an insecure
>>>>>>> NiFi instance, or if you are secured using certificates. We want the
>>>>>>> diagnostics for the Processor that is either
>>>>>>> the source of the connection or the destination of the connection -
>>>>>>> it doesn't matter which. This will give us a
>>>>>>> lot of information about the internal structure of the connection's
>>>>>>> FlowFile Queue. Of course, you said that your
>>>>>>> connection is between two Process Groups, which means that neither
>>>>>>> the source nor the destination is a Processor,
>>>>>>> so I would recommend creating a dummy Processor like UpdateAttribute
>>>>>>> and temporarily dragging the Connection
>>>>>>> so that it points to that Processor, just to get the diagnostic
>>>>>>> information, then dragging the connection back.
>>>>>>>
>>>>>>> Of course, it would also be helpful to look for any errors in the
>>>>>>> logs. But if you are able to get the diagnostics info
>>>>>>> as described above, that's usually the best bet for debugging this
>>>>>>> sort of thing.
>>>>>>>
>>>>>>> Thanks
>>>>>>> -Mark
>>>>>>>
>>>>>>>
>>>>>>> On Jun 4, 2019, at 11:13 AM, Bryan Bende <[email protected]> wrote:
>>>>>>>
>>>>>>> Joe,
>>>>>>>
>>>>>>> There are two known issues that possibly seem related...
>>>>>>>
>>>>>>> The first was already addressed in 1.9.0, but the reason I mention it
>>>>>>> is because it was specific to a connection between two ports:
>>>>>>>
>>>>>>> https://issues.apache.org/jira/browse/NIFI-5919
>>>>>>>
>>>>>>> The second is not in a release yet, but is addressed in master, and
>>>>>>> has to do with swapping:
>>>>>>>
>>>>>>> https://issues.apache.org/jira/browse/NIFI-6285
>>>>>>>
>>>>>>> Seems like you wouldn't hit the first one since you are on 1.9.2, but
>>>>>>> does seem odd that is the same scenario.
>>>>>>>
>>>>>>> Mark P probably knows best about debugging, but I'm guessing possibly
>>>>>>> a thread dump while in this state would be helpful.
>>>>>>>
>>>>>>> -Bryan
>>>>>>>
>>>>>>> On Tue, Jun 4, 2019 at 10:56 AM Joe Gresock <[email protected]>
>>>>>>> wrote:
>>>>>>>
>>>>>>>
>>>>>>> I have round robin load balanced connections working on one cluster,
>>>>>>> but on another, this type of connection seems to be stuck.
>>>>>>>
>>>>>>> What would be the best way to debug this problem?  The connection is
>>>>>>> from one processor group to another, so it's from an Output Port to an
>>>>>>> Input Port.
>>>>>>>
>>>>>>> My configuration is as follows:
>>>>>>> nifi.cluster.load.balance.host=
>>>>>>> nifi.cluster.load.balance.port=6342
>>>>>>> nifi.cluster.load.balance.connections.per.node=4
>>>>>>> nifi.cluster.load.balance.max.thread.count=8
>>>>>>> nifi.cluster.load.balance.comms.timeout=30 sec
>>>>>>>
>>>>>>> And I ensured port 6342 is open from one node to another using the
>>>>>>> cluster node addresses.
>>>>>>>
>>>>>>> Is there some error that should appear in the logs if flow files get
>>>>>>> stuck here?
>>>>>>>
>>>>>>> I suspect they are actually stuck, not just missing, because the
>>>>>>> remainder of the flow is back-pressured up until this point in the flow.
>>>>>>>
>>>>>>> Thanks!
>>>>>>> Joe
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>
>>>>>> --
>>>>>> I know what it is to be in need, and I know what it is to have
>>>>>> plenty.  I have learned the secret of being content in any and every
>>>>>> situation, whether well fed or hungry, whether living in plenty or
>>>>>> in want.  I can do all this through him who gives me strength.    
>>>>>> *-Philippians
>>>>>> 4:12-13*
>>>>>> <diagnostics.json.gz>
>>>>>>
>>>>>>
>>>>>>
>>>>>
>>>>> --
>>>>> I know what it is to be in need, and I know what it is to have
>>>>> plenty.  I have learned the secret of being content in any and every
>>>>> situation, whether well fed or hungry, whether living in plenty or in
>>>>> want.  I can do all this through him who gives me strength.    
>>>>> *-Philippians
>>>>> 4:12-13*
>>>>>
>>>>
>>>>
>>>> --
>>>> I know what it is to be in need, and I know what it is to have plenty.
>>>> I have learned the secret of being content in any and every situation,
>>>> whether well fed or hungry, whether living in plenty or in want.  I
>>>> can do all this through him who gives me strength.    *-Philippians
>>>> 4:12-13*
>>>>
>>>>
>>>>
>>>>
>>>
>>> --
>>> I know what it is to be in need, and I know what it is to have plenty.
>>> I have learned the secret of being content in any and every situation,
>>> whether well fed or hungry, whether living in plenty or in want.  I can
>>> do all this through him who gives me strength.    *-Philippians 4:12-13*
>>>
>>
>>
>> --
>> I know what it is to be in need, and I know what it is to have plenty.  I
>> have learned the secret of being content in any and every situation,
>> whether well fed or hungry, whether living in plenty or in want.  I can
>> do all this through him who gives me strength.    *-Philippians 4:12-13*
>>
>>
>>
>>
>
> --
> I know what it is to be in need, and I know what it is to have plenty.  I
> have learned the secret of being content in any and every situation,
> whether well fed or hungry, whether living in plenty or in want.  I can
> do all this through him who gives me strength.    *-Philippians 4:12-13*
>
>
>

-- 
I know what it is to be in need, and I know what it is to have plenty.  I
have learned the secret of being content in any and every situation,
whether well fed or hungry, whether living in plenty or in want.  I can do
all this through him who gives me strength.    *-Philippians 4:12-13*

Reply via email to