Re: UI SocketTimeoutException - heavy IO

Joe Obernberger Wed, 12 Jul 2023 07:54:35 -0700

Raising this thread from the dead...

Having issues with IO to the flowfile repository. NiFi will show 500kflow files and a size of ~1.7G - but the size on disk on each of the 4nodes is massive - over 100G, and disk IO to the flowfile spindle isjust pegged doing writes.

I do have ExtractText processors that take the flowfile content (.*) andput it into an attribute, but the sizes of these is maybe in the 10k atmost size. How can I find out what module (there are some 2200) iscausing the issue? I think I'm doing something fundamentally wrong withNiFi. :)Perhaps I should change the size of all the queues to something lessthan 10k/1G?

Under cluster/FLOWFILE STORAGE, one of the nodes shows 3.74TBytes ofusage, but it's actually ~150G on disk. The other nodes are correct.


Ideas on what to debug?
Thank you!

-Joe (NiFi 1.18)

On 3/22/2023 12:49 PM, Mark Payne wrote:

OK. So changing the checkpoint internal to 300 seconds might helpreduce IO a bit. But it will cause the repo to become much larger, andit will take much longer to startup whenever you restart NiFi.
The variance in size between nodes is likely due to how recently it’scheckpointed. If it stays large like 31 GB while the other stay small,that would be interesting to know.
Thanks
-Mark
On Mar 22, 2023, at 12:45 PM, Joe Obernberger<[email protected]> wrote:
Thanks for this Mark. I'm not seeing any large attributes at themoment but will go through this and verify - but I did have one queuethat was set to 100k instead of 10k.I set the nifi.cluster.node.connection.timeout to 30 seconds (up from5) and the nifi.flowfile.repository.checkpoint.interval to 300seconds (up from 20).
While it's running the size of the flowfile repo varies (wildly?) oneach of the nodes from 1.5G to over 30G. Disk IO is still very high,but it's running now and I can use the UI. Interestingly at thispoint the UI shows 677k files and 1.5G of flow. But disk usage on theflowfile repo is 31G, 3.7G, and 2.6G on the 3 nodes. I'd love tothrow some SSDs at this problem. I can add more nifi nodes.
-Joe

On 3/22/2023 11:08 AM, Mark Payne wrote:
Joe,
The errors noted are indicating that NiFi cannot communicate withregistry. Either the registry is offline, NiFi’s Registry Client isnot configured properly, there’s a firewall in the way, etc.
A FlowFile repo of 35 GB is rather huge. This would imply one of 3things:
- You have a huge number of FlowFiles (doesn’t seem to be the case)
- FlowFiles have a huge number of attributes
or
- FlowFiles have 1 or more huge attribute values.
Typically, FlowFile attribute should be kept minimal and shouldnever contain chunks of contents from the FlowFile content. Oftenwhen we see this type of behavior it’s due to using something likeExtractText or EvaluateJsonPath to put large blocks of content intoattributes.
And in this case, setting Backpressure Threshold above 10,000 iseven more concerning, as it means even greater disk I/O.
Thanks
-Mark
On Mar 22, 2023, at 11:01 AM, Joe Obernberger<[email protected]> wrote:
Thank you Mark. These are SATA drives - but there's no way for theflowfile repo to be on multiple spindles. It's not huge - maybe35G per node.
I do see a lot of messages like this in the log:
2023-03-22 10:52:13,960 ERROR [Timer-Driven Process Thread-62]o.a.nifi.groups.StandardProcessGroup Failed to synchronizeStandardProcessGroup[identifier=861d3b27-aace-186d-bbb7-870c6fa65243,name=TIKAHandle Extract Metadata] with Flow Registry because could notretrieve version 1 of flow with identifierd64e72b5-16ea-4a87-af09-72c5bbcd82bf in bucket736a8f4b-19be-4c01-b2c3-901d9538c5ef due to: Connection refused(Connection refused)2023-03-22 10:52:13,960 ERROR [Timer-Driven Process Thread-62]o.a.nifi.groups.StandardProcessGroup Failed to synchronizeStandardProcessGroup[identifier=bcc23c03-49ef-1e41-83cb-83f22630466d,name=WriteDB]with Flow Registry because could not retrieve version 2 of flowwith identifier ff197063-af31-45df-9401-e9f8ba2e4b2b in bucket736a8f4b-19be-4c01-b2c3-901d9538c5ef due to: Connection refused(Connection refused)2023-03-22 10:52:13,960 ERROR [Timer-Driven Process Thread-62]o.a.nifi.groups.StandardProcessGroup Failed to synchronizeStandardProcessGroup[identifier=bc913ff1-06b1-1b76-a548-7525a836560a,name=TIKAHandle Extract Metadata] with Flow Registry because could notretrieve version 1 of flow with identifierd64e72b5-16ea-4a87-af09-72c5bbcd82bf in bucket736a8f4b-19be-4c01-b2c3-901d9538c5ef due to: Connection refused(Connection refused)2023-03-22 10:52:13,960 ERROR [Timer-Driven Process Thread-62]o.a.nifi.groups.StandardProcessGroup Failed to synchronizeStandardProcessGroup[identifier=920c3600-2954-1c8e-b121-6d7d3d393de6,name=SaveBinary Data] with Flow Registry because could not retrieve version1 of flow with identifier 7a8c82be-1707-4e7d-a5e7-bb3825e0a38f inbucket 736a8f4b-19be-4c01-b2c3-901d9538c5ef due to: Connectionrefused (Connection refused)
A clue?

-joe

On 3/22/2023 10:49 AM, Mark Payne wrote:
Joe,
1.8 million FlowFiles is not a concern. But when you say “Should Ireduce the queue sizes?” it makes me wonder if they’re all in asingle queue?Generally, you should leave the backpressure threshold at thedefault 10,000 FlowFile max. Increasing this can lead to hugeamounts of swapping, which will drastically reduce performance andincrease disk utilization very significantly.
Also from the diagnostics, it looks like you’ve got a lot of CPUcores, but you’re not using much. And based on the amount of diskspace available and the fact that you’re seeing 100% utilization,I’m wondering if you’re using spinning disks, rather than SSDs? Iwould highly recommend always running NiFi with ssd/nvme drives.Absent that, if you have multiple disk drives, you could alsoconfigure the content repository to span multiple disks, in orderto spread that load.
Thanks
-Mark
On Mar 22, 2023, at 10:41 AM, Joe Obernberger<[email protected]> wrote:
Thank you.  Was able to get in.
Currently there are 1.8 million flow files and 3.2G. Is this toomuch for a 3 node cluster with mutliple spindles each (SATA drives)?
Should I reduce the queue sizes?

-Joe

On 3/22/2023 10:23 AM, Phillip Lord wrote:
Joe,
If you need the UI to come back up, try setting the autoresumesetting in nifi.properties to false and restart node(s).This will bring up every component/controllerService upstopped/disabled and may provide some breathing room for the UIto become available again.
Phil
On Mar 22, 2023 at 10:20 AM -0400, Joe Obernberger<[email protected]>, wrote:
atop shows the disk as being all red with IO - 100%utilization. There
are a lot of flowfiles currently trying to run through, but I can't
monitor it because....UI wont' load.

-Joe

On 3/22/2023 10:16 AM, Mark Payne wrote:
Joe,
I’d recommend taking a look at garbage collection. It is farmore likely the culprit than disk I/O.
Thanks
-Mark
On Mar 22, 2023, at 10:12 AM, Joe Obernberger<[email protected]> wrote:
I'm getting "java.net.SocketTimeoutException: timeout" fromthe user interface of NiFi when load is heavy. This is 1.18.0running on a 3 node cluster. Disk IO is high and when thathappens, I can't get into the UI to stop any of the processors.
Any ideas?
I have put the flowfile repository and content repository ondifferent disks on the 3 nodes, but disk usage is still sohigh that I can't get in.
Thank you!

-Joe


--
This email has been checked for viruses by AVG antivirussoftware.
www.avg.com
<http://www.avg.com/email-signature?utm_medium=email&utm_source=link&utm_campaign=sig-email&utm_content=emailclient>Virus-free.www.avg.com<http://www.avg.com/email-signature?utm_medium=email&utm_source=link&utm_campaign=sig-email&utm_content=emailclient>


--
This email has been checked for viruses by AVG antivirus software.
www.avg.com

Re: UI SocketTimeoutException - heavy IO

Reply via email to