Apologies in advance if I've got this completely wrong, but I recall that error if I forget to increase the limit of open files for a heavily loaded install. It is more obvious via the UI but the logs will have error messages about too many open files.
On Wed, 22 Mar 2023, 16:49 Mark Payne, <[email protected]> wrote: > OK. So changing the checkpoint internal to 300 seconds might help reduce > IO a bit. But it will cause the repo to become much larger, and it will > take much longer to startup whenever you restart NiFi. > > The variance in size between nodes is likely due to how recently it’s > checkpointed. If it stays large like 31 GB while the other stay small, that > would be interesting to know. > > Thanks > -Mark > > > On Mar 22, 2023, at 12:45 PM, Joe Obernberger < > [email protected]> wrote: > > Thanks for this Mark. I'm not seeing any large attributes at the moment > but will go through this and verify - but I did have one queue that was set > to 100k instead of 10k. > I set the nifi.cluster.node.connection.timeout to 30 seconds (up from 5) > and the nifi.flowfile.repository.checkpoint.interval to 300 seconds (up > from 20). > > While it's running the size of the flowfile repo varies (wildly?) on each > of the nodes from 1.5G to over 30G. Disk IO is still very high, but it's > running now and I can use the UI. Interestingly at this point the UI shows > 677k files and 1.5G of flow. But disk usage on the flowfile repo is 31G, > 3.7G, and 2.6G on the 3 nodes. I'd love to throw some SSDs at this > problem. I can add more nifi nodes. > > -Joe > On 3/22/2023 11:08 AM, Mark Payne wrote: > > Joe, > > The errors noted are indicating that NiFi cannot communicate with > registry. Either the registry is offline, NiFi’s Registry Client is not > configured properly, there’s a firewall in the way, etc. > > A FlowFile repo of 35 GB is rather huge. This would imply one of 3 things: > - You have a huge number of FlowFiles (doesn’t seem to be the case) > - FlowFiles have a huge number of attributes > or > - FlowFiles have 1 or more huge attribute values. > > Typically, FlowFile attribute should be kept minimal and should never > contain chunks of contents from the FlowFile content. Often when we see > this type of behavior it’s due to using something like ExtractText or > EvaluateJsonPath to put large blocks of content into attributes. > > And in this case, setting Backpressure Threshold above 10,000 is even more > concerning, as it means even greater disk I/O. > > Thanks > -Mark > > > On Mar 22, 2023, at 11:01 AM, Joe Obernberger > <[email protected]> <[email protected]> wrote: > > Thank you Mark. These are SATA drives - but there's no way for the > flowfile repo to be on multiple spindles. It's not huge - maybe 35G per > node. > I do see a lot of messages like this in the log: > > 2023-03-22 10:52:13,960 ERROR [Timer-Driven Process Thread-62] > o.a.nifi.groups.StandardProcessGroup Failed to synchronize > StandardProcessGroup[identifier=861d3b27-aace-186d-bbb7-870c6fa65243,name=TIKA > Handle Extract Metadata] with Flow Registry because could not retrieve > version 1 of flow with identifier d64e72b5-16ea-4a87-af09-72c5bbcd82bf in > bucket 736a8f4b-19be-4c01-b2c3-901d9538c5ef due to: Connection refused > (Connection refused) > 2023-03-22 10:52:13,960 ERROR [Timer-Driven Process Thread-62] > o.a.nifi.groups.StandardProcessGroup Failed to synchronize > StandardProcessGroup[identifier=bcc23c03-49ef-1e41-83cb-83f22630466d,name=WriteDB] > with Flow Registry because could not retrieve version 2 of flow with > identifier ff197063-af31-45df-9401-e9f8ba2e4b2b in bucket > 736a8f4b-19be-4c01-b2c3-901d9538c5ef due to: Connection refused (Connection > refused) > 2023-03-22 10:52:13,960 ERROR [Timer-Driven Process Thread-62] > o.a.nifi.groups.StandardProcessGroup Failed to synchronize > StandardProcessGroup[identifier=bc913ff1-06b1-1b76-a548-7525a836560a,name=TIKA > Handle Extract Metadata] with Flow Registry because could not retrieve > version 1 of flow with identifier d64e72b5-16ea-4a87-af09-72c5bbcd82bf in > bucket 736a8f4b-19be-4c01-b2c3-901d9538c5ef due to: Connection refused > (Connection refused) > 2023-03-22 10:52:13,960 ERROR [Timer-Driven Process Thread-62] > o.a.nifi.groups.StandardProcessGroup Failed to synchronize > StandardProcessGroup[identifier=920c3600-2954-1c8e-b121-6d7d3d393de6,name=Save > Binary Data] with Flow Registry because could not retrieve version 1 of > flow with identifier 7a8c82be-1707-4e7d-a5e7-bb3825e0a38f in bucket > 736a8f4b-19be-4c01-b2c3-901d9538c5ef due to: Connection refused (Connection > refused) > > A clue? > > -joe > On 3/22/2023 10:49 AM, Mark Payne wrote: > > Joe, > > 1.8 million FlowFiles is not a concern. But when you say “Should I reduce > the queue sizes?” it makes me wonder if they’re all in a single queue? > Generally, you should leave the backpressure threshold at the default > 10,000 FlowFile max. Increasing this can lead to huge amounts of swapping, > which will drastically reduce performance and increase disk utilization > very significantly. > > Also from the diagnostics, it looks like you’ve got a lot of CPU cores, > but you’re not using much. And based on the amount of disk space available > and the fact that you’re seeing 100% utilization, I’m wondering if you’re > using spinning disks, rather than SSDs? I would highly recommend always > running NiFi with ssd/nvme drives. Absent that, if you have multiple disk > drives, you could also configure the content repository to span multiple > disks, in order to spread that load. > > Thanks > -Mark > > On Mar 22, 2023, at 10:41 AM, Joe Obernberger > <[email protected]> <[email protected]> wrote: > > Thank you. Was able to get in. > Currently there are 1.8 million flow files and 3.2G. Is this too much for > a 3 node cluster with mutliple spindles each (SATA drives)? > Should I reduce the queue sizes? > > -Joe > On 3/22/2023 10:23 AM, Phillip Lord wrote: > > Joe, > > If you need the UI to come back up, try setting the autoresume setting in > nifi.properties to false and restart node(s). > This will bring up every component/controllerService up stopped/disabled > and may provide some breathing room for the UI to become available again. > > Phil > On Mar 22, 2023 at 10:20 AM -0400, Joe Obernberger > <[email protected]> <[email protected]>, wrote: > > atop shows the disk as being all red with IO - 100% utilization. There > are a lot of flowfiles currently trying to run through, but I can't > monitor it because....UI wont' load. > > -Joe > > On 3/22/2023 10:16 AM, Mark Payne wrote: > > Joe, > > I’d recommend taking a look at garbage collection. It is far more likely > the culprit than disk I/O. > > Thanks > -Mark > > On Mar 22, 2023, at 10:12 AM, Joe Obernberger > <[email protected]> <[email protected]> wrote: > > I'm getting "java.net.SocketTimeoutException: timeout" from the user > interface of NiFi when load is heavy. This is 1.18.0 running on a 3 node > cluster. Disk IO is high and when that happens, I can't get into the UI to > stop any of the processors. > Any ideas? > > I have put the flowfile repository and content repository on different > disks on the 3 nodes, but disk usage is still so high that I can't get in. > Thank you! > > -Joe > > > -- > This email has been checked for viruses by AVG antivirus software. > www.avg.com > > > > <http://www.avg.com/email-signature?utm_medium=email&utm_source=link&utm_campaign=sig-email&utm_content=emailclient> > Virus-free.www.avg.com > <http://www.avg.com/email-signature?utm_medium=email&utm_source=link&utm_campaign=sig-email&utm_content=emailclient> > > > > >
