Re: [EXT] Re: Cluster Warnings

Martijn Dekkers Tue, 06 Nov 2018 08:04:02 -0800

In my case it was a combined disk/network issue.

If you run on Linux I can strongly recommend you have a look at netdata
which is a 5 minute deploy (and clean remove) and will give you a great
way to correlate various metrics without any drama.
Martijn



On Tue, 6 Nov 2018, at 16:53, Karthik Kothareddy (karthikk) [CONT - Type 2] 
wrote:> Hello Martijn,


>  


> No... we’re still seeing these warnings very frequently and sometimes
> node disconnects as well. We tried everything we could and was not
> able to get to the bottom of this. When you say, “related to hardware”
> can you please point me to the specific issue as we double checked all
> our boxes for any issues and everything seemed normal.>  


> -Karthik


>  


> *From:* Martijn Dekkers [mailto:[email protected]] *Sent:*
> Tuesday, November 06, 2018 8:38 AM *To:* [email protected]
> *Subject:* Re: [EXT] Re: Cluster Warnings>  


> Did you ever get to the bottom of this Karthik? I have seen similar
> issue, and turned out to be related to hardware not keeping up with
> the increased volume of work (which is why a node was added to
> begin with)>  


>  


> On Fri, 26 Oct 2018, at 19:18, Karthik Kothareddy (karthikk) [CONT -
> Type 2] wrote:>> Any light on this? We’re still seeing warnings every 10 
> seconds which
>> is really annoying and no idea what’s triggering them.>>  


>> -Karthik


>>  


>> *From:* Karthik Kothareddy (karthikk) [CONT - Type 2] 
>> *Sent:* Tuesday, October 16, 2018 8:27 AM
>> *To:* [email protected]
>> *Subject:* RE: [EXT] Re: Cluster Warnings


>>  


>> Below is some of the nifi-app.log when DEBUG mode is enabled.


>>  


>> *2018-10-16 14:21:11,577 DEBUG [Replicate Request Thread-3]
>> o.a.n.c.c.h.r.ThreadPoolRequestReplicator Received response from
>> NIFI01:8443 for GET /nifi-api/site-to-site*>> *2018-10-16 14:21:11,577 DEBUG 
>> [Replicate Request Thread-3]
>> o.a.n.c.c.h.r.StandardAsyncClusterResponse Received response 4 out of
>> 4 for c910f24a-96fe-4efc-a8b8-6ef8d2674524 from NIFI01:8443*>> *2018-10-16 
>> 14:21:11,577 DEBUG [Replicate Request Thread-3]
>> o.a.n.c.c.h.r.StandardAsyncClusterResponse Notifying all that merged
>> response is ready for c910f24a-96fe-4efc-a8b8-6ef8d2674524*>> *2018-10-16 
>> 14:21:11,578 DEBUG [Replicate Request Thread-3]
>> o.a.n.c.c.h.r.ThreadPoolRequestReplicator For GET /nifi-api/site-to-
>> site (Request ID c910f24a-96fe-4efc-a8b8-6ef8d2674524), minimum
>> response time = 2, max = 224, average = 59.0 ms*>> *2018-10-16 14:21:11,578 
>> DEBUG [Replicate Request Thread-3]
>> o.a.n.c.c.h.r.ThreadPoolRequestReplicator Node Responses for GET 
>> /nifi-api/site-to-
>> site (Request ID c910f24a-96fe-4efc-a8b8-6ef8d2674524):*>> *NIFI04:8443: 2 
>> millis*


>> *NIFI02:8443: 5 millis*


>> *NIFI01:8443: 224 millis*


>> *NIFI03:8443: 5 millis*


>> * *


>> *2018-10-16 14:21:11,578 DEBUG [Replicate Request Thread-3]
>> o.a.n.c.c.h.r.ThreadPoolRequestReplicator Notified monitor
>> java.lang.Object@1da27fc4 because request GET
>> https://NIFI03:8443/nifi-api/site-to-site has completed*>> *2018-10-16 
>> 14:21:11,578 DEBUG [NiFi Web Server-304048]
>> o.a.n.c.c.h.r.ThreadPoolRequestReplicator Unlocked  java.util.concur-
>> rent.locks.ReentrantReadWriteLock$ReadLock@3d507660[Read locks = 0]
>> after replication completed for GET
>> https://NIFI03:8443/nifi-api/site-to-site*>> *2018-10-16 14:21:11,578 DEBUG 
>> [NiFi Web Server-304048]
>> o.a.n.c.c.h.r.StandardAsyncClusterResponse Notifying all that merged
>> response is complete for c910f24a-96fe-4efc-a8b8-6ef8d2674524*>> *2018-10-16 
>> 14:21:11,578 DEBUG [NiFi Web Server-304048]
>> o.a.n.c.c.h.r.StandardAsyncClusterResponse For GET
>> https://NIFI03:8443/nifi-api/site-to-site Timing Info is as follows:*>> 
>> *Verify Cluster State for All Nodes took 0 millis*


>> *Wait for HTTP Request Replication to be triggered for NIFI04:8443
>> took 0 millis*>> *Wait for HTTP Request Replication to be triggered for 
>> NIFI02:8443
>> took 0 millis*>> *Wait for HTTP Request Replication to be triggered for 
>> NIFI01:8443
>> took 0 millis*>> *Wait for HTTP Request Replication to be triggered for 
>> NIFI03:8443
>> took 0 millis*>> *Perform HTTP Request for NIFI04:8443 took 2 millis*


>> *Perform HTTP Request for NIFI03:8443 took 5 millis*


>> *Perform HTTP Request for NIFI02:8443 took 5 millis*


>> *Perform HTTP Request for NIFI01:8443 took 224 millis*


>> *Phase Completed for All Nodes took 224 millis*


>> *Notifying All Threads that Request is Complete took 0 millis*


>> *Total Time for All Nodes took 224 millis*


>> *Map/Merge Responses for All Nodes took 0 millis*


>>  


>> -Karthik


>>  


>> *From:* Joe Witt [mailto:[email protected]] 
>> *Sent:* Tuesday, October 16, 2018 8:03 AM
>> *To:* [email protected]
>> *Subject:* Re: [EXT] Re: Cluster Warnings


>>  


>> karthik


>>  


>> understood.  do you have those logs?


>>  


>>  


>>  


>> On Tue, Oct 16, 2018, 9:59 AM Karthik Kothareddy (karthikk) [CONT -
>> Type 2] <[email protected]> wrote:>>> Joe,


>>>  


>>> The slow node is Node04 in this case but we get one such slow
>>> response from a random node(Node01, Node02,Node03) every time we see
>>> this warning.>>>  


>>> -Karthik


>>>  


>>> *From:* Joe Witt [mailto:[email protected]] 
>>> *Sent:* Tuesday, October 16, 2018 7:55 AM
>>> *To:* [email protected]
>>> *Subject:* [EXT] Re: Cluster Warnings


>>>  


>>> the logs show the fourth node is the slowest by far in all cases.
>>> possibly a dns or other related issue?  but def focus on that node
>>> as the outlier and presuming nifi config is identical it suggest
>>> system/network differences from other nodes.>>>  


>>> thanks


>>>  


>>> On Tue, Oct 16, 2018, 9:51 AM Karthik Kothareddy (karthikk) [CONT -
>>> Type 2] <[email protected]> wrote:>>>>  


>>>> Hello,


>>>>  


>>>> We’re running a 4-node cluster on NiFi 1.7.1. The fourth node was
>>>> added recently and as soon as we added the 4th node, we started
>>>> seeing below warnings>>>>  


>>>> *Response time from NODE2 was slow for each of the last 3 requests
>>>> made. To see more information about timing, enable DEBUG logging
>>>> for org.apache.nifi.cluster.coordination.http.replication.ThreadPo-
>>>> olRequestReplicator*>>>> * *


>>>> Initially we though the problem was with the recent node added and
>>>> cross checked all the configs on the box and everything seemed to
>>>> be just fine. After enabling the DEBUG mode for cluster logging we
>>>> noticed that the warning is not specific to any node and every-time
>>>> we see a warning like above there is one slow node which takes
>>>> forever to send a response like below (in this case the slow node
>>>> is NIFI04). Sometimes these will lead to node-disconnects needing a
>>>> manual intervention.>>>>  


>>>> *DEBUG [Replicate Request Thread-50]
>>>> o.a.n.c.c.h.r.ThreadPoolRequestReplicator Node Responses for GET
>>>> /nifi-api/site-to-site (Request ID b2c6e983-5233-4007-bd54-
>>>> 13d21b7068d5):*>>>> *NIFI04:8443: 1386 millis*


>>>> *NIFI02:8443: 3 millis*


>>>> *NIFI01:8443: 5 millis*


>>>> *NIFI03:8443: 3 millis*


>>>> *DEBUG [Replicate Request Thread-41]
>>>> o.a.n.c.c.h.r.ThreadPoolRequestReplicator Node Responses for GET
>>>> /nifi-api/site-to-site (Request ID d182fdab-f1d4-4ac9-97fd-
>>>> e24c41dc4622):*>>>> *NIFI04:8443: 1143 millis*


>>>> *NIFI02:8443: 22 millis*


>>>> *NIFI01:8443: 3 millis*


>>>> *NIFI03:8443: 2 millis*


>>>> *DEBUG [Replicate Request Thread-31]
>>>> o.a.n.c.c.h.r.ThreadPoolRequestReplicator Node Responses for GET
>>>> /nifi-api/site-to-site (Request ID e4726027-27c7-4bbb-8ab6-
>>>> d02bb41f1920):*>>>> *NIFI04:8443: 1053 millis*


>>>> *NIFI02:8443: 3 millis*


>>>> *NIFI01:8443: 3 millis*


>>>> *NIFI03:8443: 2 millis*


>>>>  


>>>> We tried changing the configurations in nifi.properties like
>>>> bumping up the “nifi.cluster.node.protocol.max.threads” but none of
>>>> them seems to be working and we’re still stuck with the slow
>>>> communication between the nodes. We use an external zookeeper as
>>>> this is our production server.>>>> Below are some of our configs


>>>>  


>>>> *# cluster node properties (only configure for cluster nodes) #*>>>> 
>>>> *nifi.cluster.is.node=true*


>>>> *nifi.cluster.node.address=fslhdppnifi01.imfs.micron.com*


>>>> *nifi.cluster.node.protocol.port=11443*


>>>> *nifi.cluster.node.protocol.threads=100*


>>>> *nifi.cluster.node.protocol.max.threads=120*


>>>> *nifi.cluster.node.event.history.size=25*


>>>> *nifi.cluster.node.connection.timeout=90 sec*


>>>> *nifi.cluster.node.read.timeout=90 sec*


>>>> *nifi.cluster.node.max.concurrent.requests=1000*


>>>> *nifi.cluster.firewall.file=*


>>>> *nifi.cluster.flow.election.max.wait.time=30 sec*


>>>> *nifi.cluster.flow.election.max.candidates=*


>>>>  


>>>> Any thoughts on why this is happening?


>>>>  


>>>>  


>>>> -Karthik


>

Re: [EXT] Re: Cluster Warnings

Reply via email to