In my case it was a combined disk/network issue. If you run on Linux I can strongly recommend you have a look at netdata which is a 5 minute deploy (and clean remove) and will give you a great way to correlate various metrics without any drama. Martijn
On Tue, 6 Nov 2018, at 16:53, Karthik Kothareddy (karthikk) [CONT - Type 2] wrote:> Hello Martijn, > > No... we’re still seeing these warnings very frequently and sometimes > node disconnects as well. We tried everything we could and was not > able to get to the bottom of this. When you say, “related to hardware” > can you please point me to the specific issue as we double checked all > our boxes for any issues and everything seemed normal.> > -Karthik > > *From:* Martijn Dekkers [mailto:[email protected]] *Sent:* > Tuesday, November 06, 2018 8:38 AM *To:* [email protected] > *Subject:* Re: [EXT] Re: Cluster Warnings> > Did you ever get to the bottom of this Karthik? I have seen similar > issue, and turned out to be related to hardware not keeping up with > the increased volume of work (which is why a node was added to > begin with)> > > On Fri, 26 Oct 2018, at 19:18, Karthik Kothareddy (karthikk) [CONT - > Type 2] wrote:>> Any light on this? We’re still seeing warnings every 10 > seconds which >> is really annoying and no idea what’s triggering them.>> >> -Karthik >> >> *From:* Karthik Kothareddy (karthikk) [CONT - Type 2] >> *Sent:* Tuesday, October 16, 2018 8:27 AM >> *To:* [email protected] >> *Subject:* RE: [EXT] Re: Cluster Warnings >> >> Below is some of the nifi-app.log when DEBUG mode is enabled. >> >> *2018-10-16 14:21:11,577 DEBUG [Replicate Request Thread-3] >> o.a.n.c.c.h.r.ThreadPoolRequestReplicator Received response from >> NIFI01:8443 for GET /nifi-api/site-to-site*>> *2018-10-16 14:21:11,577 DEBUG >> [Replicate Request Thread-3] >> o.a.n.c.c.h.r.StandardAsyncClusterResponse Received response 4 out of >> 4 for c910f24a-96fe-4efc-a8b8-6ef8d2674524 from NIFI01:8443*>> *2018-10-16 >> 14:21:11,577 DEBUG [Replicate Request Thread-3] >> o.a.n.c.c.h.r.StandardAsyncClusterResponse Notifying all that merged >> response is ready for c910f24a-96fe-4efc-a8b8-6ef8d2674524*>> *2018-10-16 >> 14:21:11,578 DEBUG [Replicate Request Thread-3] >> o.a.n.c.c.h.r.ThreadPoolRequestReplicator For GET /nifi-api/site-to- >> site (Request ID c910f24a-96fe-4efc-a8b8-6ef8d2674524), minimum >> response time = 2, max = 224, average = 59.0 ms*>> *2018-10-16 14:21:11,578 >> DEBUG [Replicate Request Thread-3] >> o.a.n.c.c.h.r.ThreadPoolRequestReplicator Node Responses for GET >> /nifi-api/site-to- >> site (Request ID c910f24a-96fe-4efc-a8b8-6ef8d2674524):*>> *NIFI04:8443: 2 >> millis* >> *NIFI02:8443: 5 millis* >> *NIFI01:8443: 224 millis* >> *NIFI03:8443: 5 millis* >> * * >> *2018-10-16 14:21:11,578 DEBUG [Replicate Request Thread-3] >> o.a.n.c.c.h.r.ThreadPoolRequestReplicator Notified monitor >> java.lang.Object@1da27fc4 because request GET >> https://NIFI03:8443/nifi-api/site-to-site has completed*>> *2018-10-16 >> 14:21:11,578 DEBUG [NiFi Web Server-304048] >> o.a.n.c.c.h.r.ThreadPoolRequestReplicator Unlocked java.util.concur- >> rent.locks.ReentrantReadWriteLock$ReadLock@3d507660[Read locks = 0] >> after replication completed for GET >> https://NIFI03:8443/nifi-api/site-to-site*>> *2018-10-16 14:21:11,578 DEBUG >> [NiFi Web Server-304048] >> o.a.n.c.c.h.r.StandardAsyncClusterResponse Notifying all that merged >> response is complete for c910f24a-96fe-4efc-a8b8-6ef8d2674524*>> *2018-10-16 >> 14:21:11,578 DEBUG [NiFi Web Server-304048] >> o.a.n.c.c.h.r.StandardAsyncClusterResponse For GET >> https://NIFI03:8443/nifi-api/site-to-site Timing Info is as follows:*>> >> *Verify Cluster State for All Nodes took 0 millis* >> *Wait for HTTP Request Replication to be triggered for NIFI04:8443 >> took 0 millis*>> *Wait for HTTP Request Replication to be triggered for >> NIFI02:8443 >> took 0 millis*>> *Wait for HTTP Request Replication to be triggered for >> NIFI01:8443 >> took 0 millis*>> *Wait for HTTP Request Replication to be triggered for >> NIFI03:8443 >> took 0 millis*>> *Perform HTTP Request for NIFI04:8443 took 2 millis* >> *Perform HTTP Request for NIFI03:8443 took 5 millis* >> *Perform HTTP Request for NIFI02:8443 took 5 millis* >> *Perform HTTP Request for NIFI01:8443 took 224 millis* >> *Phase Completed for All Nodes took 224 millis* >> *Notifying All Threads that Request is Complete took 0 millis* >> *Total Time for All Nodes took 224 millis* >> *Map/Merge Responses for All Nodes took 0 millis* >> >> -Karthik >> >> *From:* Joe Witt [mailto:[email protected]] >> *Sent:* Tuesday, October 16, 2018 8:03 AM >> *To:* [email protected] >> *Subject:* Re: [EXT] Re: Cluster Warnings >> >> karthik >> >> understood. do you have those logs? >> >> >> >> On Tue, Oct 16, 2018, 9:59 AM Karthik Kothareddy (karthikk) [CONT - >> Type 2] <[email protected]> wrote:>>> Joe, >>> >>> The slow node is Node04 in this case but we get one such slow >>> response from a random node(Node01, Node02,Node03) every time we see >>> this warning.>>> >>> -Karthik >>> >>> *From:* Joe Witt [mailto:[email protected]] >>> *Sent:* Tuesday, October 16, 2018 7:55 AM >>> *To:* [email protected] >>> *Subject:* [EXT] Re: Cluster Warnings >>> >>> the logs show the fourth node is the slowest by far in all cases. >>> possibly a dns or other related issue? but def focus on that node >>> as the outlier and presuming nifi config is identical it suggest >>> system/network differences from other nodes.>>> >>> thanks >>> >>> On Tue, Oct 16, 2018, 9:51 AM Karthik Kothareddy (karthikk) [CONT - >>> Type 2] <[email protected]> wrote:>>>> >>>> Hello, >>>> >>>> We’re running a 4-node cluster on NiFi 1.7.1. The fourth node was >>>> added recently and as soon as we added the 4th node, we started >>>> seeing below warnings>>>> >>>> *Response time from NODE2 was slow for each of the last 3 requests >>>> made. To see more information about timing, enable DEBUG logging >>>> for org.apache.nifi.cluster.coordination.http.replication.ThreadPo- >>>> olRequestReplicator*>>>> * * >>>> Initially we though the problem was with the recent node added and >>>> cross checked all the configs on the box and everything seemed to >>>> be just fine. After enabling the DEBUG mode for cluster logging we >>>> noticed that the warning is not specific to any node and every-time >>>> we see a warning like above there is one slow node which takes >>>> forever to send a response like below (in this case the slow node >>>> is NIFI04). Sometimes these will lead to node-disconnects needing a >>>> manual intervention.>>>> >>>> *DEBUG [Replicate Request Thread-50] >>>> o.a.n.c.c.h.r.ThreadPoolRequestReplicator Node Responses for GET >>>> /nifi-api/site-to-site (Request ID b2c6e983-5233-4007-bd54- >>>> 13d21b7068d5):*>>>> *NIFI04:8443: 1386 millis* >>>> *NIFI02:8443: 3 millis* >>>> *NIFI01:8443: 5 millis* >>>> *NIFI03:8443: 3 millis* >>>> *DEBUG [Replicate Request Thread-41] >>>> o.a.n.c.c.h.r.ThreadPoolRequestReplicator Node Responses for GET >>>> /nifi-api/site-to-site (Request ID d182fdab-f1d4-4ac9-97fd- >>>> e24c41dc4622):*>>>> *NIFI04:8443: 1143 millis* >>>> *NIFI02:8443: 22 millis* >>>> *NIFI01:8443: 3 millis* >>>> *NIFI03:8443: 2 millis* >>>> *DEBUG [Replicate Request Thread-31] >>>> o.a.n.c.c.h.r.ThreadPoolRequestReplicator Node Responses for GET >>>> /nifi-api/site-to-site (Request ID e4726027-27c7-4bbb-8ab6- >>>> d02bb41f1920):*>>>> *NIFI04:8443: 1053 millis* >>>> *NIFI02:8443: 3 millis* >>>> *NIFI01:8443: 3 millis* >>>> *NIFI03:8443: 2 millis* >>>> >>>> We tried changing the configurations in nifi.properties like >>>> bumping up the “nifi.cluster.node.protocol.max.threads” but none of >>>> them seems to be working and we’re still stuck with the slow >>>> communication between the nodes. We use an external zookeeper as >>>> this is our production server.>>>> Below are some of our configs >>>> >>>> *# cluster node properties (only configure for cluster nodes) #*>>>> >>>> *nifi.cluster.is.node=true* >>>> *nifi.cluster.node.address=fslhdppnifi01.imfs.micron.com* >>>> *nifi.cluster.node.protocol.port=11443* >>>> *nifi.cluster.node.protocol.threads=100* >>>> *nifi.cluster.node.protocol.max.threads=120* >>>> *nifi.cluster.node.event.history.size=25* >>>> *nifi.cluster.node.connection.timeout=90 sec* >>>> *nifi.cluster.node.read.timeout=90 sec* >>>> *nifi.cluster.node.max.concurrent.requests=1000* >>>> *nifi.cluster.firewall.file=* >>>> *nifi.cluster.flow.election.max.wait.time=30 sec* >>>> *nifi.cluster.flow.election.max.candidates=* >>>> >>>> Any thoughts on why this is happening? >>>> >>>> >>>> -Karthik >
