Re: Load balancer queues stuck on 1.9.2?

Joe Gresock Tue, 04 Jun 2019 11:01:45 -0700

Thanks!  I'm taking notes for next time.  For now, a full cluster restart
appears to have resolved this case.


On Tue, Jun 4, 2019 at 5:55 PM Mark Payne <[email protected]> wrote:

> Joe,
>
>
> You may want to try enabling DEBUG logging for the following classes:
>
>
> org.apache.nifi.controller.queue.clustered.client.async.nio.LoadBalanceSession
>
> org.apache.nifi.controller.queue.clustered.client.async.nio.
> NioAsyncLoadBalanceClient
>
> That may provide some interesting information, especially if grepping for
> specific nodes. But I'll warn you - the logging can certainly be quite
> verbose.
>
> Thanks
> -Mark
>
>
>
> On Jun 4, 2019, at 12:29 PM, Mark Payne <[email protected]> wrote:
>
> Well, that is certainly interesting. Thanks for the updated. Please do let
> us know if it occurs again.
>
> On Jun 4, 2019, at 12:23 PM, Joe Gresock <[email protected]> wrote:
>
> Ok.. I just tried disconnecting each node from the cluster, in turn.  The
> first three (prod-6, -7, and -8) didn't make a difference, but when I
> reconnected prod-5, the load balanced connection started flowing again.
>
> I'll continue to monitor it and let you know if this happens again.
>
> Thanks for the suggestions!
>
> On Tue, Jun 4, 2019 at 4:14 PM Joe Gresock <[email protected]> wrote:
>
>> prod-5 and -6 don't appear to be receiving any data in that queue, based
>> on the status history.  Is there anything I should see in the logs to
>> confirm this?
>>
>> On Tue, Jun 4, 2019 at 4:05 PM Mark Payne <[email protected]> wrote:
>>
>>> Joe,
>>>
>>> So it looks like from the Diagnostics info, that there are currently 500
>>> FlowFiles queued up.
>>> They all live on prod-8.ec2.internal:8443. Of those 500, 250 are waiting
>>> to go to prod-5.ec2.internal:8443,
>>> and 250 are waiting to go to prod-6.ec2.internal:8443.
>>>
>>> So this tells us that if there are any problems, they are likely
>>> occurring on one of those 3 nodes. It's also not
>>> related to swapping if it's in this state with only 500 FlowFiles queued.
>>>
>>> Are you able to confirm that you are indeed receiving data from the load
>>> balanced queue on both prod-5 and prod-6?
>>>
>>>
>>> On Jun 4, 2019, at 11:47 AM, Joe Gresock <[email protected]> wrote:
>>>
>>> Thanks Mark.
>>>
>>> I'm running on Linux.  I've followed your suggestion and added an
>>> UpdateAttribute processor to the flow, and attached the diagnostics for it.
>>>
>>> I also don't see any errors in the logs.
>>>
>>> On Tue, Jun 4, 2019 at 3:34 PM Mark Payne <[email protected]> wrote:
>>>
>>>> Joe,
>>>>
>>>> The first thing that comes to mind would be NIFI-6285, as Bryan points
>>>> out. However,
>>>> that only would affect you if you are running on Windows. So, the first
>>>> question is:
>>>> what operating system are you running on? :)
>>>>
>>>> If it's not Windows, I would recommend getting some diagnostics info if
>>>> possible. To do this,
>>>> you can go to 
>>>> http://<hostname>:<port>/nifi-api/processors/<processor-id>/diagnostics.
>>>> For example,
>>>> if you get to nifi by going to http://nifi01:8080/nifi, and you want
>>>> diagnostics for processor with ID 1234,
>>>> then try going to
>>>> http://nifi01:8080/nifi-api/processors/1234/diagnostics in your
>>>> browser.
>>>>
>>>> But a couple of caveats on the 'diagnostics' approach above. It will
>>>> only work if you are running an insecure
>>>> NiFi instance, or if you are secured using certificates. We want the
>>>> diagnostics for the Processor that is either
>>>> the source of the connection or the destination of the connection - it
>>>> doesn't matter which. This will give us a
>>>> lot of information about the internal structure of the connection's
>>>> FlowFile Queue. Of course, you said that your
>>>> connection is between two Process Groups, which means that neither the
>>>> source nor the destination is a Processor,
>>>> so I would recommend creating a dummy Processor like UpdateAttribute
>>>> and temporarily dragging the Connection
>>>> so that it points to that Processor, just to get the diagnostic
>>>> information, then dragging the connection back.
>>>>
>>>> Of course, it would also be helpful to look for any errors in the logs.
>>>> But if you are able to get the diagnostics info
>>>> as described above, that's usually the best bet for debugging this sort
>>>> of thing.
>>>>
>>>> Thanks
>>>> -Mark
>>>>
>>>>
>>>> On Jun 4, 2019, at 11:13 AM, Bryan Bende <[email protected]> wrote:
>>>>
>>>> Joe,
>>>>
>>>> There are two known issues that possibly seem related...
>>>>
>>>> The first was already addressed in 1.9.0, but the reason I mention it
>>>> is because it was specific to a connection between two ports:
>>>>
>>>> https://issues.apache.org/jira/browse/NIFI-5919
>>>>
>>>> The second is not in a release yet, but is addressed in master, and
>>>> has to do with swapping:
>>>>
>>>> https://issues.apache.org/jira/browse/NIFI-6285
>>>>
>>>> Seems like you wouldn't hit the first one since you are on 1.9.2, but
>>>> does seem odd that is the same scenario.
>>>>
>>>> Mark P probably knows best about debugging, but I'm guessing possibly
>>>> a thread dump while in this state would be helpful.
>>>>
>>>> -Bryan
>>>>
>>>> On Tue, Jun 4, 2019 at 10:56 AM Joe Gresock <[email protected]> wrote:
>>>>
>>>>
>>>> I have round robin load balanced connections working on one cluster,
>>>> but on another, this type of connection seems to be stuck.
>>>>
>>>> What would be the best way to debug this problem?  The connection is
>>>> from one processor group to another, so it's from an Output Port to an
>>>> Input Port.
>>>>
>>>> My configuration is as follows:
>>>> nifi.cluster.load.balance.host=
>>>> nifi.cluster.load.balance.port=6342
>>>> nifi.cluster.load.balance.connections.per.node=4
>>>> nifi.cluster.load.balance.max.thread.count=8
>>>> nifi.cluster.load.balance.comms.timeout=30 sec
>>>>
>>>> And I ensured port 6342 is open from one node to another using the
>>>> cluster node addresses.
>>>>
>>>> Is there some error that should appear in the logs if flow files get
>>>> stuck here?
>>>>
>>>> I suspect they are actually stuck, not just missing, because the
>>>> remainder of the flow is back-pressured up until this point in the flow.
>>>>
>>>> Thanks!
>>>> Joe
>>>>
>>>>
>>>>
>>>
>>> --
>>> I know what it is to be in need, and I know what it is to have plenty.
>>> I have learned the secret of being content in any and every situation,
>>> whether well fed or hungry, whether living in plenty or in want.  I can
>>> do all this through him who gives me strength.    *-Philippians 4:12-13*
>>> <diagnostics.json.gz>
>>>
>>>
>>>
>>
>> --
>> I know what it is to be in need, and I know what it is to have plenty.  I
>> have learned the secret of being content in any and every situation,
>> whether well fed or hungry, whether living in plenty or in want.  I can
>> do all this through him who gives me strength.    *-Philippians 4:12-13*
>>
>
>
> --
> I know what it is to be in need, and I know what it is to have plenty.  I
> have learned the secret of being content in any and every situation,
> whether well fed or hungry, whether living in plenty or in want.  I can
> do all this through him who gives me strength.    *-Philippians 4:12-13*
>
>
>
>

-- 
I know what it is to be in need, and I know what it is to have plenty.  I
have learned the secret of being content in any and every situation,
whether well fed or hungry, whether living in plenty or in want.  I can do
all this through him who gives me strength.    *-Philippians 4:12-13*

Re: Load balancer queues stuck on 1.9.2?

Reply via email to