Apologies in advance if I've got this completely wrong, but I recall that
error if I forget to increase the limit of open files for a heavily loaded
install. It is more obvious via the UI but the logs will have error
messages about too many open files.

On Wed, 22 Mar 2023, 16:49 Mark Payne, <[email protected]> wrote:

> OK. So changing the checkpoint internal to 300 seconds might help reduce
> IO a bit. But it will cause the repo to become much larger, and it will
> take much longer to startup whenever you restart NiFi.
>
> The variance in size between nodes is likely due to how recently it’s
> checkpointed. If it stays large like 31 GB while the other stay small, that
> would be interesting to know.
>
> Thanks
> -Mark
>
>
> On Mar 22, 2023, at 12:45 PM, Joe Obernberger <
> [email protected]> wrote:
>
> Thanks for this Mark.  I'm not seeing any large attributes at the moment
> but will go through this and verify - but I did have one queue that was set
> to 100k instead of 10k.
> I set the nifi.cluster.node.connection.timeout to 30 seconds (up from 5)
> and the nifi.flowfile.repository.checkpoint.interval to 300 seconds (up
> from 20).
>
> While it's running the size of the flowfile repo varies (wildly?) on each
> of the nodes from 1.5G to over 30G.  Disk IO is still very high, but it's
> running now and I can use the UI.  Interestingly at this point the UI shows
> 677k files and 1.5G of flow.  But disk usage on the flowfile repo is 31G,
> 3.7G, and 2.6G on the 3 nodes.  I'd love to throw some SSDs at this
> problem.  I can add more nifi nodes.
>
> -Joe
> On 3/22/2023 11:08 AM, Mark Payne wrote:
>
> Joe,
>
> The errors noted are indicating that NiFi cannot communicate with
> registry. Either the registry is offline, NiFi’s Registry Client is not
> configured properly, there’s a firewall in the way, etc.
>
> A FlowFile repo of 35 GB is rather huge. This would imply one of 3 things:
> - You have a huge number of FlowFiles (doesn’t seem to be the case)
> - FlowFiles have a huge number of attributes
> or
> - FlowFiles have 1 or more huge attribute values.
>
> Typically, FlowFile attribute should be kept minimal and should never
> contain chunks of contents from the FlowFile content. Often when we see
> this type of behavior it’s due to using something like ExtractText or
> EvaluateJsonPath to put large blocks of content into attributes.
>
> And in this case, setting Backpressure Threshold above 10,000 is even more
> concerning, as it means even greater disk I/O.
>
> Thanks
> -Mark
>
>
> On Mar 22, 2023, at 11:01 AM, Joe Obernberger
> <[email protected]> <[email protected]> wrote:
>
> Thank you Mark.  These are SATA drives - but there's no way for the
> flowfile repo to be on multiple spindles.  It's not huge - maybe 35G per
> node.
> I do see a lot of messages like this in the log:
>
> 2023-03-22 10:52:13,960 ERROR [Timer-Driven Process Thread-62]
> o.a.nifi.groups.StandardProcessGroup Failed to synchronize
> StandardProcessGroup[identifier=861d3b27-aace-186d-bbb7-870c6fa65243,name=TIKA
> Handle Extract Metadata] with Flow Registry because could not retrieve
> version 1 of flow with identifier d64e72b5-16ea-4a87-af09-72c5bbcd82bf in
> bucket 736a8f4b-19be-4c01-b2c3-901d9538c5ef due to: Connection refused
> (Connection refused)
> 2023-03-22 10:52:13,960 ERROR [Timer-Driven Process Thread-62]
> o.a.nifi.groups.StandardProcessGroup Failed to synchronize
> StandardProcessGroup[identifier=bcc23c03-49ef-1e41-83cb-83f22630466d,name=WriteDB]
> with Flow Registry because could not retrieve version 2 of flow with
> identifier ff197063-af31-45df-9401-e9f8ba2e4b2b in bucket
> 736a8f4b-19be-4c01-b2c3-901d9538c5ef due to: Connection refused (Connection
> refused)
> 2023-03-22 10:52:13,960 ERROR [Timer-Driven Process Thread-62]
> o.a.nifi.groups.StandardProcessGroup Failed to synchronize
> StandardProcessGroup[identifier=bc913ff1-06b1-1b76-a548-7525a836560a,name=TIKA
> Handle Extract Metadata] with Flow Registry because could not retrieve
> version 1 of flow with identifier d64e72b5-16ea-4a87-af09-72c5bbcd82bf in
> bucket 736a8f4b-19be-4c01-b2c3-901d9538c5ef due to: Connection refused
> (Connection refused)
> 2023-03-22 10:52:13,960 ERROR [Timer-Driven Process Thread-62]
> o.a.nifi.groups.StandardProcessGroup Failed to synchronize
> StandardProcessGroup[identifier=920c3600-2954-1c8e-b121-6d7d3d393de6,name=Save
> Binary Data] with Flow Registry because could not retrieve version 1 of
> flow with identifier 7a8c82be-1707-4e7d-a5e7-bb3825e0a38f in bucket
> 736a8f4b-19be-4c01-b2c3-901d9538c5ef due to: Connection refused (Connection
> refused)
>
> A clue?
>
> -joe
> On 3/22/2023 10:49 AM, Mark Payne wrote:
>
> Joe,
>
> 1.8 million FlowFiles is not a concern. But when you say “Should I reduce
> the queue sizes?” it makes me wonder if they’re all in a single queue?
> Generally, you should leave the backpressure threshold at the default
> 10,000 FlowFile max. Increasing this can lead to huge amounts of swapping,
> which will drastically reduce performance and increase disk utilization
> very significantly.
>
> Also from the diagnostics, it looks like you’ve got a lot of CPU cores,
> but you’re not using much. And based on the amount of disk space available
> and the fact that you’re seeing 100% utilization, I’m wondering if you’re
> using spinning disks, rather than SSDs? I would highly recommend always
> running NiFi with ssd/nvme drives. Absent that, if you have multiple disk
> drives, you could also configure the content repository to span multiple
> disks, in order to spread that load.
>
> Thanks
> -Mark
>
> On Mar 22, 2023, at 10:41 AM, Joe Obernberger
> <[email protected]> <[email protected]> wrote:
>
> Thank you.  Was able to get in.
> Currently there are 1.8 million flow files and 3.2G.  Is this too much for
> a 3 node cluster with mutliple spindles each (SATA drives)?
> Should I reduce the queue sizes?
>
> -Joe
> On 3/22/2023 10:23 AM, Phillip Lord wrote:
>
> Joe,
>
> If you need the UI to come back up, try setting the autoresume setting in
> nifi.properties to false and restart node(s).
> This will bring up every component/controllerService up stopped/disabled
> and may provide some breathing room for the UI to become available again.
>
> Phil
> On Mar 22, 2023 at 10:20 AM -0400, Joe Obernberger
> <[email protected]> <[email protected]>, wrote:
>
> atop shows the disk as being all red with IO - 100% utilization. There
> are a lot of flowfiles currently trying to run through, but I can't
> monitor it because....UI wont' load.
>
> -Joe
>
> On 3/22/2023 10:16 AM, Mark Payne wrote:
>
> Joe,
>
> I’d recommend taking a look at garbage collection. It is far more likely
> the culprit than disk I/O.
>
> Thanks
> -Mark
>
> On Mar 22, 2023, at 10:12 AM, Joe Obernberger
> <[email protected]> <[email protected]> wrote:
>
> I'm getting "java.net.SocketTimeoutException: timeout" from the user
> interface of NiFi when load is heavy. This is 1.18.0 running on a 3 node
> cluster. Disk IO is high and when that happens, I can't get into the UI to
> stop any of the processors.
> Any ideas?
>
> I have put the flowfile repository and content repository on different
> disks on the 3 nodes, but disk usage is still so high that I can't get in.
> Thank you!
>
> -Joe
>
>
> --
> This email has been checked for viruses by AVG antivirus software.
> www.avg.com
>
>
>
> <http://www.avg.com/email-signature?utm_medium=email&utm_source=link&utm_campaign=sig-email&utm_content=emailclient>
> Virus-free.www.avg.com
> <http://www.avg.com/email-signature?utm_medium=email&utm_source=link&utm_campaign=sig-email&utm_content=emailclient>
>
>
>
>
>

Reply via email to