Re: UI SocketTimeoutException - heavy IO

2023-03-22 Thread James Srinivasan
Apologies in advance if I've got this completely wrong, but I recall that
error if I forget to increase the limit of open files for a heavily loaded
install. It is more obvious via the UI but the logs will have error
messages about too many open files.

On Wed, 22 Mar 2023, 16:49 Mark Payne,  wrote:

> OK. So changing the checkpoint internal to 300 seconds might help reduce
> IO a bit. But it will cause the repo to become much larger, and it will
> take much longer to startup whenever you restart NiFi.
>
> The variance in size between nodes is likely due to how recently it’s
> checkpointed. If it stays large like 31 GB while the other stay small, that
> would be interesting to know.
>
> Thanks
> -Mark
>
>
> On Mar 22, 2023, at 12:45 PM, Joe Obernberger <
> joseph.obernber...@gmail.com> wrote:
>
> Thanks for this Mark.  I'm not seeing any large attributes at the moment
> but will go through this and verify - but I did have one queue that was set
> to 100k instead of 10k.
> I set the nifi.cluster.node.connection.timeout to 30 seconds (up from 5)
> and the nifi.flowfile.repository.checkpoint.interval to 300 seconds (up
> from 20).
>
> While it's running the size of the flowfile repo varies (wildly?) on each
> of the nodes from 1.5G to over 30G.  Disk IO is still very high, but it's
> running now and I can use the UI.  Interestingly at this point the UI shows
> 677k files and 1.5G of flow.  But disk usage on the flowfile repo is 31G,
> 3.7G, and 2.6G on the 3 nodes.  I'd love to throw some SSDs at this
> problem.  I can add more nifi nodes.
>
> -Joe
> On 3/22/2023 11:08 AM, Mark Payne wrote:
>
> Joe,
>
> The errors noted are indicating that NiFi cannot communicate with
> registry. Either the registry is offline, NiFi’s Registry Client is not
> configured properly, there’s a firewall in the way, etc.
>
> A FlowFile repo of 35 GB is rather huge. This would imply one of 3 things:
> - You have a huge number of FlowFiles (doesn’t seem to be the case)
> - FlowFiles have a huge number of attributes
> or
> - FlowFiles have 1 or more huge attribute values.
>
> Typically, FlowFile attribute should be kept minimal and should never
> contain chunks of contents from the FlowFile content. Often when we see
> this type of behavior it’s due to using something like ExtractText or
> EvaluateJsonPath to put large blocks of content into attributes.
>
> And in this case, setting Backpressure Threshold above 10,000 is even more
> concerning, as it means even greater disk I/O.
>
> Thanks
> -Mark
>
>
> On Mar 22, 2023, at 11:01 AM, Joe Obernberger
>   wrote:
>
> Thank you Mark.  These are SATA drives - but there's no way for the
> flowfile repo to be on multiple spindles.  It's not huge - maybe 35G per
> node.
> I do see a lot of messages like this in the log:
>
> 2023-03-22 10:52:13,960 ERROR [Timer-Driven Process Thread-62]
> o.a.nifi.groups.StandardProcessGroup Failed to synchronize
> StandardProcessGroup[identifier=861d3b27-aace-186d-bbb7-870c6fa65243,name=TIKA
> Handle Extract Metadata] with Flow Registry because could not retrieve
> version 1 of flow with identifier d64e72b5-16ea-4a87-af09-72c5bbcd82bf in
> bucket 736a8f4b-19be-4c01-b2c3-901d9538c5ef due to: Connection refused
> (Connection refused)
> 2023-03-22 10:52:13,960 ERROR [Timer-Driven Process Thread-62]
> o.a.nifi.groups.StandardProcessGroup Failed to synchronize
> StandardProcessGroup[identifier=bcc23c03-49ef-1e41-83cb-83f22630466d,name=WriteDB]
> with Flow Registry because could not retrieve version 2 of flow with
> identifier ff197063-af31-45df-9401-e9f8ba2e4b2b in bucket
> 736a8f4b-19be-4c01-b2c3-901d9538c5ef due to: Connection refused (Connection
> refused)
> 2023-03-22 10:52:13,960 ERROR [Timer-Driven Process Thread-62]
> o.a.nifi.groups.StandardProcessGroup Failed to synchronize
> StandardProcessGroup[identifier=bc913ff1-06b1-1b76-a548-7525a836560a,name=TIKA
> Handle Extract Metadata] with Flow Registry because could not retrieve
> version 1 of flow with identifier d64e72b5-16ea-4a87-af09-72c5bbcd82bf in
> bucket 736a8f4b-19be-4c01-b2c3-901d9538c5ef due to: Connection refused
> (Connection refused)
> 2023-03-22 10:52:13,960 ERROR [Timer-Driven Process Thread-62]
> o.a.nifi.groups.StandardProcessGroup Failed to synchronize
> StandardProcessGroup[identifier=920c3600-2954-1c8e-b121-6d7d3d393de6,name=Save
> Binary Data] with Flow Registry because could not retrieve version 1 of
> flow with identifier 7a8c82be-1707-4e7d-a5e7-bb3825e0a38f in bucket
> 736a8f4b-19be-4c01-b2c3-901d9538c5ef due to: Connection refused (Connection
> refused)
>
> A clue?
>
> -joe
> On 3/22/2023 10:49 AM, Mark Payne wrote:
>
> Joe,
>
> 1.8 million FlowFiles is not a concern. But when you say “Should I reduce
> the queue sizes?” it makes me wonder if they’re all in a single queue?
> Generally, you should leave the backpressure threshold at the default
> 10,000 FlowFile max. Increasing this can lead to huge amounts of swapping,
> which will drastically reduce performance and increase disk 

Re: UI SocketTimeoutException - heavy IO

2023-03-22 Thread Mark Payne
OK. So changing the checkpoint internal to 300 seconds might help reduce IO a 
bit. But it will cause the repo to become much larger, and it will take much 
longer to startup whenever you restart NiFi.

The variance in size between nodes is likely due to how recently it’s 
checkpointed. If it stays large like 31 GB while the other stay small, that 
would be interesting to know.

Thanks
-Mark


On Mar 22, 2023, at 12:45 PM, Joe Obernberger  
wrote:


Thanks for this Mark.  I'm not seeing any large attributes at the moment but 
will go through this and verify - but I did have one queue that was set to 100k 
instead of 10k.
I set the nifi.cluster.node.connection.timeout to 30 seconds (up from 5) and 
the nifi.flowfile.repository.checkpoint.interval to 300 seconds (up from 20).

While it's running the size of the flowfile repo varies (wildly?) on each of 
the nodes from 1.5G to over 30G.  Disk IO is still very high, but it's running 
now and I can use the UI.  Interestingly at this point the UI shows 677k files 
and 1.5G of flow.  But disk usage on the flowfile repo is 31G, 3.7G, and 2.6G 
on the 3 nodes.  I'd love to throw some SSDs at this problem.  I can add more 
nifi nodes.

-Joe

On 3/22/2023 11:08 AM, Mark Payne wrote:
Joe,

The errors noted are indicating that NiFi cannot communicate with registry. 
Either the registry is offline, NiFi’s Registry Client is not configured 
properly, there’s a firewall in the way, etc.

A FlowFile repo of 35 GB is rather huge. This would imply one of 3 things:
- You have a huge number of FlowFiles (doesn’t seem to be the case)
- FlowFiles have a huge number of attributes
or
- FlowFiles have 1 or more huge attribute values.

Typically, FlowFile attribute should be kept minimal and should never contain 
chunks of contents from the FlowFile content. Often when we see this type of 
behavior it’s due to using something like ExtractText or EvaluateJsonPath to 
put large blocks of content into attributes.

And in this case, setting Backpressure Threshold above 10,000 is even more 
concerning, as it means even greater disk I/O.

Thanks
-Mark


On Mar 22, 2023, at 11:01 AM, Joe Obernberger 
 wrote:


Thank you Mark.  These are SATA drives - but there's no way for the flowfile 
repo to be on multiple spindles.  It's not huge - maybe 35G per node.
I do see a lot of messages like this in the log:

2023-03-22 10:52:13,960 ERROR [Timer-Driven Process Thread-62] 
o.a.nifi.groups.StandardProcessGroup Failed to synchronize 
StandardProcessGroup[identifier=861d3b27-aace-186d-bbb7-870c6fa65243,name=TIKA 
Handle Extract Metadata] with Flow Registry because could not retrieve version 
1 of flow with identifier d64e72b5-16ea-4a87-af09-72c5bbcd82bf in bucket 
736a8f4b-19be-4c01-b2c3-901d9538c5ef due to: Connection refused (Connection 
refused)
2023-03-22 10:52:13,960 ERROR [Timer-Driven Process Thread-62] 
o.a.nifi.groups.StandardProcessGroup Failed to synchronize 
StandardProcessGroup[identifier=bcc23c03-49ef-1e41-83cb-83f22630466d,name=WriteDB]
 with Flow Registry because could not retrieve version 2 of flow with 
identifier ff197063-af31-45df-9401-e9f8ba2e4b2b in bucket 
736a8f4b-19be-4c01-b2c3-901d9538c5ef due to: Connection refused (Connection 
refused)
2023-03-22 10:52:13,960 ERROR [Timer-Driven Process Thread-62] 
o.a.nifi.groups.StandardProcessGroup Failed to synchronize 
StandardProcessGroup[identifier=bc913ff1-06b1-1b76-a548-7525a836560a,name=TIKA 
Handle Extract Metadata] with Flow Registry because could not retrieve version 
1 of flow with identifier d64e72b5-16ea-4a87-af09-72c5bbcd82bf in bucket 
736a8f4b-19be-4c01-b2c3-901d9538c5ef due to: Connection refused (Connection 
refused)
2023-03-22 10:52:13,960 ERROR [Timer-Driven Process Thread-62] 
o.a.nifi.groups.StandardProcessGroup Failed to synchronize 
StandardProcessGroup[identifier=920c3600-2954-1c8e-b121-6d7d3d393de6,name=Save 
Binary Data] with Flow Registry because could not retrieve version 1 of flow 
with identifier 7a8c82be-1707-4e7d-a5e7-bb3825e0a38f in bucket 
736a8f4b-19be-4c01-b2c3-901d9538c5ef due to: Connection refused (Connection 
refused)

A clue?

-joe

On 3/22/2023 10:49 AM, Mark Payne wrote:
Joe,

1.8 million FlowFiles is not a concern. But when you say “Should I reduce the 
queue sizes?” it makes me wonder if they’re all in a single queue?
Generally, you should leave the backpressure threshold at the default 10,000 
FlowFile max. Increasing this can lead to huge amounts of swapping, which will 
drastically reduce performance and increase disk utilization very significantly.

Also from the diagnostics, it looks like you’ve got a lot of CPU cores, but 
you’re not using much. And based on the amount of disk space available and the 
fact that you’re seeing 100% utilization, I’m wondering if you’re using 
spinning disks, rather than SSDs? I would highly recommend always running NiFi 
with ssd/nvme drives. Absent that, if you have multiple disk drives, you could 
also configure 

Re: UI SocketTimeoutException - heavy IO

2023-03-22 Thread Joe Obernberger
Thanks for this Mark.  I'm not seeing any large attributes at the moment 
but will go through this and verify - but I did have one queue that was 
set to 100k instead of 10k.
I set the nifi.cluster.node.connection.timeout to 30 seconds (up from 5) 
and the nifi.flowfile.repository.checkpoint.interval to 300 seconds (up 
from 20).


While it's running the size of the flowfile repo varies (wildly?) on 
each of the nodes from 1.5G to over 30G.  Disk IO is still very high, 
but it's running now and I can use the UI.  Interestingly at this point 
the UI shows 677k files and 1.5G of flow.  But disk usage on the 
flowfile repo is 31G, 3.7G, and 2.6G on the 3 nodes. I'd love to throw 
some SSDs at this problem.  I can add more nifi nodes.


-Joe

On 3/22/2023 11:08 AM, Mark Payne wrote:

Joe,

The errors noted are indicating that NiFi cannot communicate with 
registry. Either the registry is offline, NiFi’s Registry Client is 
not configured properly, there’s a firewall in the way, etc.


A FlowFile repo of 35 GB is rather huge. This would imply one of 3 things:
- You have a huge number of FlowFiles (doesn’t seem to be the case)
- FlowFiles have a huge number of attributes
or
- FlowFiles have 1 or more huge attribute values.

Typically, FlowFile attribute should be kept minimal and should never 
contain chunks of contents from the FlowFile content. Often when we 
see this type of behavior it’s due to using something like ExtractText 
or EvaluateJsonPath to put large blocks of content into attributes.


And in this case, setting Backpressure Threshold above 10,000 is even 
more concerning, as it means even greater disk I/O.


Thanks
-Mark


On Mar 22, 2023, at 11:01 AM, Joe Obernberger 
 wrote:


Thank you Mark.  These are SATA drives - but there's no way for the 
flowfile repo to be on multiple spindles.  It's not huge - maybe 35G 
per node.

I do see a lot of messages like this in the log:

2023-03-22 10:52:13,960 ERROR [Timer-Driven Process Thread-62] 
o.a.nifi.groups.StandardProcessGroup Failed to synchronize 
StandardProcessGroup[identifier=861d3b27-aace-186d-bbb7-870c6fa65243,name=TIKA 
Handle Extract Metadata] with Flow Registry because could not 
retrieve version 1 of flow with identifier 
d64e72b5-16ea-4a87-af09-72c5bbcd82bf in bucket 
736a8f4b-19be-4c01-b2c3-901d9538c5ef due to: Connection refused 
(Connection refused)
2023-03-22 10:52:13,960 ERROR [Timer-Driven Process Thread-62] 
o.a.nifi.groups.StandardProcessGroup Failed to synchronize 
StandardProcessGroup[identifier=bcc23c03-49ef-1e41-83cb-83f22630466d,name=WriteDB] 
with Flow Registry because could not retrieve version 2 of flow with 
identifier ff197063-af31-45df-9401-e9f8ba2e4b2b in bucket 
736a8f4b-19be-4c01-b2c3-901d9538c5ef due to: Connection refused 
(Connection refused)
2023-03-22 10:52:13,960 ERROR [Timer-Driven Process Thread-62] 
o.a.nifi.groups.StandardProcessGroup Failed to synchronize 
StandardProcessGroup[identifier=bc913ff1-06b1-1b76-a548-7525a836560a,name=TIKA 
Handle Extract Metadata] with Flow Registry because could not 
retrieve version 1 of flow with identifier 
d64e72b5-16ea-4a87-af09-72c5bbcd82bf in bucket 
736a8f4b-19be-4c01-b2c3-901d9538c5ef due to: Connection refused 
(Connection refused)
2023-03-22 10:52:13,960 ERROR [Timer-Driven Process Thread-62] 
o.a.nifi.groups.StandardProcessGroup Failed to synchronize 
StandardProcessGroup[identifier=920c3600-2954-1c8e-b121-6d7d3d393de6,name=Save 
Binary Data] with Flow Registry because could not retrieve version 1 
of flow with identifier 7a8c82be-1707-4e7d-a5e7-bb3825e0a38f in 
bucket 736a8f4b-19be-4c01-b2c3-901d9538c5ef due to: Connection 
refused (Connection refused)


A clue?

-joe

On 3/22/2023 10:49 AM, Mark Payne wrote:

Joe,

1.8 million FlowFiles is not a concern. But when you say “Should I 
reduce the queue sizes?” it makes me wonder if they’re all in a 
single queue?
Generally, you should leave the backpressure threshold at the 
default 10,000 FlowFile max. Increasing this can lead to huge 
amounts of swapping, which will drastically reduce performance and 
increase disk utilization very significantly.


Also from the diagnostics, it looks like you’ve got a lot of CPU 
cores, but you’re not using much. And based on the amount of disk 
space available and the fact that you’re seeing 100% utilization, 
I’m wondering if you’re using spinning disks, rather than SSDs? I 
would highly recommend always running NiFi with ssd/nvme drives. 
Absent that, if you have multiple disk drives, you could also 
configure the content repository to span multiple disks, in order to 
spread that load.


Thanks
-Mark

On Mar 22, 2023, at 10:41 AM, Joe Obernberger 
 wrote:


Thank you.  Was able to get in.
Currently there are 1.8 million flow files and 3.2G.  Is this too 
much for a 3 node cluster with mutliple spindles each (SATA drives)?

Should I reduce the queue sizes?

-Joe

On 3/22/2023 10:23 AM, Phillip Lord wrote:

Joe,

If you need the UI to come back up, try setting the autoresume 

Re: UI SocketTimeoutException - heavy IO

2023-03-22 Thread Mark Payne
Joe,

The errors noted are indicating that NiFi cannot communicate with registry. 
Either the registry is offline, NiFi’s Registry Client is not configured 
properly, there’s a firewall in the way, etc.

A FlowFile repo of 35 GB is rather huge. This would imply one of 3 things:
- You have a huge number of FlowFiles (doesn’t seem to be the case)
- FlowFiles have a huge number of attributes
or
- FlowFiles have 1 or more huge attribute values.

Typically, FlowFile attribute should be kept minimal and should never contain 
chunks of contents from the FlowFile content. Often when we see this type of 
behavior it’s due to using something like ExtractText or EvaluateJsonPath to 
put large blocks of content into attributes.

And in this case, setting Backpressure Threshold above 10,000 is even more 
concerning, as it means even greater disk I/O.

Thanks
-Mark


On Mar 22, 2023, at 11:01 AM, Joe Obernberger  
wrote:


Thank you Mark.  These are SATA drives - but there's no way for the flowfile 
repo to be on multiple spindles.  It's not huge - maybe 35G per node.
I do see a lot of messages like this in the log:

2023-03-22 10:52:13,960 ERROR [Timer-Driven Process Thread-62] 
o.a.nifi.groups.StandardProcessGroup Failed to synchronize 
StandardProcessGroup[identifier=861d3b27-aace-186d-bbb7-870c6fa65243,name=TIKA 
Handle Extract Metadata] with Flow Registry because could not retrieve version 
1 of flow with identifier d64e72b5-16ea-4a87-af09-72c5bbcd82bf in bucket 
736a8f4b-19be-4c01-b2c3-901d9538c5ef due to: Connection refused (Connection 
refused)
2023-03-22 10:52:13,960 ERROR [Timer-Driven Process Thread-62] 
o.a.nifi.groups.StandardProcessGroup Failed to synchronize 
StandardProcessGroup[identifier=bcc23c03-49ef-1e41-83cb-83f22630466d,name=WriteDB]
 with Flow Registry because could not retrieve version 2 of flow with 
identifier ff197063-af31-45df-9401-e9f8ba2e4b2b in bucket 
736a8f4b-19be-4c01-b2c3-901d9538c5ef due to: Connection refused (Connection 
refused)
2023-03-22 10:52:13,960 ERROR [Timer-Driven Process Thread-62] 
o.a.nifi.groups.StandardProcessGroup Failed to synchronize 
StandardProcessGroup[identifier=bc913ff1-06b1-1b76-a548-7525a836560a,name=TIKA 
Handle Extract Metadata] with Flow Registry because could not retrieve version 
1 of flow with identifier d64e72b5-16ea-4a87-af09-72c5bbcd82bf in bucket 
736a8f4b-19be-4c01-b2c3-901d9538c5ef due to: Connection refused (Connection 
refused)
2023-03-22 10:52:13,960 ERROR [Timer-Driven Process Thread-62] 
o.a.nifi.groups.StandardProcessGroup Failed to synchronize 
StandardProcessGroup[identifier=920c3600-2954-1c8e-b121-6d7d3d393de6,name=Save 
Binary Data] with Flow Registry because could not retrieve version 1 of flow 
with identifier 7a8c82be-1707-4e7d-a5e7-bb3825e0a38f in bucket 
736a8f4b-19be-4c01-b2c3-901d9538c5ef due to: Connection refused (Connection 
refused)

A clue?

-joe

On 3/22/2023 10:49 AM, Mark Payne wrote:
Joe,

1.8 million FlowFiles is not a concern. But when you say “Should I reduce the 
queue sizes?” it makes me wonder if they’re all in a single queue?
Generally, you should leave the backpressure threshold at the default 10,000 
FlowFile max. Increasing this can lead to huge amounts of swapping, which will 
drastically reduce performance and increase disk utilization very significantly.

Also from the diagnostics, it looks like you’ve got a lot of CPU cores, but 
you’re not using much. And based on the amount of disk space available and the 
fact that you’re seeing 100% utilization, I’m wondering if you’re using 
spinning disks, rather than SSDs? I would highly recommend always running NiFi 
with ssd/nvme drives. Absent that, if you have multiple disk drives, you could 
also configure the content repository to span multiple disks, in order to 
spread that load.

Thanks
-Mark

On Mar 22, 2023, at 10:41 AM, Joe Obernberger 
 wrote:


Thank you.  Was able to get in.
Currently there are 1.8 million flow files and 3.2G.  Is this too much for a 3 
node cluster with mutliple spindles each (SATA drives)?
Should I reduce the queue sizes?

-Joe

On 3/22/2023 10:23 AM, Phillip Lord wrote:
Joe,

If you need the UI to come back up, try setting the autoresume setting in 
nifi.properties to false and restart node(s).
This will bring up every component/controllerService up stopped/disabled and 
may provide some breathing room for the UI to become available again.

Phil
On Mar 22, 2023 at 10:20 AM -0400, Joe Obernberger 
, wrote:
atop shows the disk as being all red with IO - 100% utilization. There
are a lot of flowfiles currently trying to run through, but I can't
monitor it becauseUI wont' load.

-Joe

On 3/22/2023 10:16 AM, Mark Payne wrote:
Joe,

I’d recommend taking a look at garbage collection. It is far more likely the 
culprit than disk I/O.

Thanks
-Mark

On Mar 22, 2023, at 10:12 AM, Joe Obernberger 
 wrote:

I'm getting 

Re: UI SocketTimeoutException - heavy IO

2023-03-22 Thread Joe Obernberger
Thank you Mark.  These are SATA drives - but there's no way for the 
flowfile repo to be on multiple spindles.  It's not huge - maybe 35G per 
node.

I do see a lot of messages like this in the log:

2023-03-22 10:52:13,960 ERROR [Timer-Driven Process Thread-62] 
o.a.nifi.groups.StandardProcessGroup Failed to synchronize 
StandardProcessGroup[identifier=861d3b27-aace-186d-bbb7-870c6fa65243,name=TIKA 
Handle Extract Metadata] with Flow Registry because could not retrieve 
version 1 of flow with identifier d64e72b5-16ea-4a87-af09-72c5bbcd82bf 
in bucket 736a8f4b-19be-4c01-b2c3-901d9538c5ef due to: Connection 
refused (Connection refused)
2023-03-22 10:52:13,960 ERROR [Timer-Driven Process Thread-62] 
o.a.nifi.groups.StandardProcessGroup Failed to synchronize 
StandardProcessGroup[identifier=bcc23c03-49ef-1e41-83cb-83f22630466d,name=WriteDB] 
with Flow Registry because could not retrieve version 2 of flow with 
identifier ff197063-af31-45df-9401-e9f8ba2e4b2b in bucket 
736a8f4b-19be-4c01-b2c3-901d9538c5ef due to: Connection refused 
(Connection refused)
2023-03-22 10:52:13,960 ERROR [Timer-Driven Process Thread-62] 
o.a.nifi.groups.StandardProcessGroup Failed to synchronize 
StandardProcessGroup[identifier=bc913ff1-06b1-1b76-a548-7525a836560a,name=TIKA 
Handle Extract Metadata] with Flow Registry because could not retrieve 
version 1 of flow with identifier d64e72b5-16ea-4a87-af09-72c5bbcd82bf 
in bucket 736a8f4b-19be-4c01-b2c3-901d9538c5ef due to: Connection 
refused (Connection refused)
2023-03-22 10:52:13,960 ERROR [Timer-Driven Process Thread-62] 
o.a.nifi.groups.StandardProcessGroup Failed to synchronize 
StandardProcessGroup[identifier=920c3600-2954-1c8e-b121-6d7d3d393de6,name=Save 
Binary Data] with Flow Registry because could not retrieve version 1 of 
flow with identifier 7a8c82be-1707-4e7d-a5e7-bb3825e0a38f in bucket 
736a8f4b-19be-4c01-b2c3-901d9538c5ef due to: Connection refused 
(Connection refused)


A clue?

-joe

On 3/22/2023 10:49 AM, Mark Payne wrote:

Joe,

1.8 million FlowFiles is not a concern. But when you say “Should I 
reduce the queue sizes?” it makes me wonder if they’re all in a single 
queue?
Generally, you should leave the backpressure threshold at the default 
10,000 FlowFile max. Increasing this can lead to huge amounts of 
swapping, which will drastically reduce performance and increase disk 
utilization very significantly.


Also from the diagnostics, it looks like you’ve got a lot of CPU 
cores, but you’re not using much. And based on the amount of disk 
space available and the fact that you’re seeing 100% utilization, I’m 
wondering if you’re using spinning disks, rather than SSDs? I would 
highly recommend always running NiFi with ssd/nvme drives. Absent 
that, if you have multiple disk drives, you could also configure the 
content repository to span multiple disks, in order to spread that load.


Thanks
-Mark

On Mar 22, 2023, at 10:41 AM, Joe Obernberger 
 wrote:


Thank you.  Was able to get in.
Currently there are 1.8 million flow files and 3.2G. Is this too much 
for a 3 node cluster with mutliple spindles each (SATA drives)?

Should I reduce the queue sizes?

-Joe

On 3/22/2023 10:23 AM, Phillip Lord wrote:

Joe,

If you need the UI to come back up, try setting the autoresume 
setting in nifi.properties to false and restart node(s).
This will bring up every component/controllerService up 
stopped/disabled and may provide some breathing room for the UI to 
become available again.


Phil
On Mar 22, 2023 at 10:20 AM -0400, Joe Obernberger 
, wrote:

atop shows the disk as being all red with IO - 100% utilization. There
are a lot of flowfiles currently trying to run through, but I can't
monitor it becauseUI wont' load.

-Joe

On 3/22/2023 10:16 AM, Mark Payne wrote:

Joe,

I’d recommend taking a look at garbage collection. It is far more 
likely the culprit than disk I/O.


Thanks
-Mark

On Mar 22, 2023, at 10:12 AM, Joe Obernberger 
 wrote:


I'm getting "java.net.SocketTimeoutException: timeout" from the 
user interface of NiFi when load is heavy. This is 1.18.0 running 
on a 3 node cluster. Disk IO is high and when that happens, I 
can't get into the UI to stop any of the processors.

Any ideas?

I have put the flowfile repository and content repository on 
different disks on the 3 nodes, but disk usage is still so high 
that I can't get in.

Thank you!

-Joe


--
This email has been checked for viruses by AVG antivirus software.
www.avg.com


 
	Virus-free.www.avg.com 
 







--
This email has been checked for viruses by AVG antivirus software.
www.avg.com

Re: UI SocketTimeoutException - heavy IO

2023-03-22 Thread Joe Obernberger
I've since brought the node back up - no change.  Looks like IO is all 
related to flowfile repository.  When it's running, CPU is pretty high - 
usually ~12 cores (ie top will show 1200%) per node.  I'm using the XFS 
filesystem; maybe some FS parameters would help?


The big change is that I was using Kafka for queuing, and have re-done 
my flow so that it will use only NiFi's internal queuing. This was 
working great with small amount of data (100k records), but bringing in 
8 million started causing this issue.  Even with everything off, as soon 
as I start one thing, I start getting timeouts and the disks just grind.


-Joe

On 3/22/2023 10:44 AM, Mark Payne wrote:

Sorry, apparently I dropped users@ from my previous reply.

Looking at the diagnostics, garbage collection looks very healthy. 
Overall CPU usage is also very low.
The one thing that did strike me as interesting, though, is that you 
have one node in the cluster shutdown. While this shouldn’t cause 
issues (or if it does, only for a few seconds until all the nodes 
realize it’s disconnected), I’m curious if started seeing issues only 
after that was shutdown?


I also noted that you have the read timeout set to 5 secs in 
nifi.properties:

nifi.cluster.node.read.timeout : 5 sec

That might be worth increasing.

> On Mar 22, 2023, at 10:24 AM, Joe Obernberger 
 wrote:

>
> Hi Mark - thank you so much for helping me.
> Any thoughts on the attached?
>
> -Joe
>
> On 3/22/2023 10:21 AM, Mark Payne wrote:
>> You can see how busy garbage collection is by running “nifi.sh 
diagnostics diag1.txt” and then looking t the diag1.txt file. It’ll 
contain a lot of information, including garbage collection details.

>>
>> Thanks
>> -Mark
>>
>>
>>> On Mar 22, 2023, at 10:19 AM, Joe Obernberger 
 wrote:

>>>
>>> atop shows the disk as being all red with IO - 100% utilization. 
There are a lot of flowfiles currently trying to run through, but I 
can't monitor it becauseUI wont' load.

>>>
>>> -Joe
>>>
>>> On 3/22/2023 10:16 AM, Mark Payne wrote:
 Joe,

 I’d recommend taking a look at garbage collection. It is far more 
likely the culprit than disk I/O.


 Thanks
 -Mark

> On Mar 22, 2023, at 10:12 AM, Joe Obernberger 
 wrote:

>
> I'm getting "java.net.SocketTimeoutException: timeout" from the 
user interface of NiFi when load is heavy.  This is 1.18.0 running on 
a 3 node cluster.  Disk IO is high and when that happens, I can't get 
into the UI to stop any of the processors.

> Any ideas?
>
> I have put the flowfile repository and content repository on 
different disks on the 3 nodes, but disk usage is still so high that I 
can't get in.

> Thank you!
>
> -Joe
>
>
> --
> This email has been checked for viruses by AVG antivirus software.
> www.avg.com 



--
This email has been checked for viruses by AVG antivirus software.
www.avg.com

Re: UI SocketTimeoutException - heavy IO

2023-03-22 Thread Mark Payne
Joe,

1.8 million FlowFiles is not a concern. But when you say “Should I reduce the 
queue sizes?” it makes me wonder if they’re all in a single queue?
Generally, you should leave the backpressure threshold at the default 10,000 
FlowFile max. Increasing this can lead to huge amounts of swapping, which will 
drastically reduce performance and increase disk utilization very significantly.

Also from the diagnostics, it looks like you’ve got a lot of CPU cores, but 
you’re not using much. And based on the amount of disk space available and the 
fact that you’re seeing 100% utilization, I’m wondering if you’re using 
spinning disks, rather than SSDs? I would highly recommend always running NiFi 
with ssd/nvme drives. Absent that, if you have multiple disk drives, you could 
also configure the content repository to span multiple disks, in order to 
spread that load.

Thanks
-Mark

On Mar 22, 2023, at 10:41 AM, Joe Obernberger  
wrote:


Thank you.  Was able to get in.
Currently there are 1.8 million flow files and 3.2G.  Is this too much for a 3 
node cluster with mutliple spindles each (SATA drives)?
Should I reduce the queue sizes?

-Joe

On 3/22/2023 10:23 AM, Phillip Lord wrote:
Joe,

If you need the UI to come back up, try setting the autoresume setting in 
nifi.properties to false and restart node(s).
This will bring up every component/controllerService up stopped/disabled and 
may provide some breathing room for the UI to become available again.

Phil
On Mar 22, 2023 at 10:20 AM -0400, Joe Obernberger 
, wrote:
atop shows the disk as being all red with IO - 100% utilization. There
are a lot of flowfiles currently trying to run through, but I can't
monitor it becauseUI wont' load.

-Joe

On 3/22/2023 10:16 AM, Mark Payne wrote:
Joe,

I’d recommend taking a look at garbage collection. It is far more likely the 
culprit than disk I/O.

Thanks
-Mark

On Mar 22, 2023, at 10:12 AM, Joe Obernberger 
 wrote:

I'm getting "java.net.SocketTimeoutException: timeout" from the user interface 
of NiFi when load is heavy. This is 1.18.0 running on a 3 node cluster. Disk IO 
is high and when that happens, I can't get into the UI to stop any of the 
processors.
Any ideas?

I have put the flowfile repository and content repository on different disks on 
the 3 nodes, but disk usage is still so high that I can't get in.
Thank you!

-Joe


--
This email has been checked for viruses by AVG antivirus software.
www.avg.com

[https://s-install.avcdn.net/ipm/preview/icons/icon-envelope-tick-green-avg-v1.png]
 
Virus-free.www.avg.com




Re: UI SocketTimeoutException - heavy IO

2023-03-22 Thread Joe Obernberger

Thank you.  Was able to get in.
Currently there are 1.8 million flow files and 3.2G.  Is this too much 
for a 3 node cluster with mutliple spindles each (SATA drives)?

Should I reduce the queue sizes?

-Joe

On 3/22/2023 10:23 AM, Phillip Lord wrote:

Joe,

If you need the UI to come back up, try setting the autoresume setting 
in nifi.properties to false and restart node(s).
This will bring up every component/controllerService up 
stopped/disabled and may provide some breathing room for the UI to 
become available again.


Phil
On Mar 22, 2023 at 10:20 AM -0400, Joe Obernberger 
, wrote:

atop shows the disk as being all red with IO - 100% utilization. There
are a lot of flowfiles currently trying to run through, but I can't
monitor it becauseUI wont' load.

-Joe

On 3/22/2023 10:16 AM, Mark Payne wrote:

Joe,

I’d recommend taking a look at garbage collection. It is far more 
likely the culprit than disk I/O.


Thanks
-Mark

On Mar 22, 2023, at 10:12 AM, Joe Obernberger 
 wrote:


I'm getting "java.net.SocketTimeoutException: timeout" from the 
user interface of NiFi when load is heavy. This is 1.18.0 running 
on a 3 node cluster. Disk IO is high and when that happens, I can't 
get into the UI to stop any of the processors.

Any ideas?

I have put the flowfile repository and content repository on 
different disks on the 3 nodes, but disk usage is still so high 
that I can't get in.

Thank you!

-Joe


--
This email has been checked for viruses by AVG antivirus software.
www.avg.com


--
This email has been checked for viruses by AVG antivirus software.
www.avg.com

Re: UI SocketTimeoutException - heavy IO

2023-03-22 Thread Phillip Lord
Joe,

If you need the UI to come back up, try setting the autoresume setting in 
nifi.properties to false and restart node(s).
This will bring up every component/controllerService up stopped/disabled and 
may provide some breathing room for the UI to become available again.

Phil
On Mar 22, 2023 at 10:20 AM -0400, Joe Obernberger 
, wrote:
> atop shows the disk as being all red with IO - 100% utilization. There
> are a lot of flowfiles currently trying to run through, but I can't
> monitor it becauseUI wont' load.
>
> -Joe
>
> On 3/22/2023 10:16 AM, Mark Payne wrote:
> > Joe,
> >
> > I’d recommend taking a look at garbage collection. It is far more likely 
> > the culprit than disk I/O.
> >
> > Thanks
> > -Mark
> >
> > > On Mar 22, 2023, at 10:12 AM, Joe Obernberger 
> > >  wrote:
> > >
> > > I'm getting "java.net.SocketTimeoutException: timeout" from the user 
> > > interface of NiFi when load is heavy. This is 1.18.0 running on a 3 node 
> > > cluster. Disk IO is high and when that happens, I can't get into the UI 
> > > to stop any of the processors.
> > > Any ideas?
> > >
> > > I have put the flowfile repository and content repository on different 
> > > disks on the 3 nodes, but disk usage is still so high that I can't get in.
> > > Thank you!
> > >
> > > -Joe
> > >
> > >
> > > --
> > > This email has been checked for viruses by AVG antivirus software.
> > > www.avg.com


Re: UI SocketTimeoutException - heavy IO

2023-03-22 Thread Joe Obernberger
atop shows the disk as being all red with IO - 100% utilization. There 
are a lot of flowfiles currently trying to run through, but I can't 
monitor it becauseUI wont' load.


-Joe

On 3/22/2023 10:16 AM, Mark Payne wrote:

Joe,

I’d recommend taking a look at garbage collection. It is far more likely the 
culprit than disk I/O.

Thanks
-Mark


On Mar 22, 2023, at 10:12 AM, Joe Obernberger  
wrote:

I'm getting "java.net.SocketTimeoutException: timeout" from the user interface 
of NiFi when load is heavy.  This is 1.18.0 running on a 3 node cluster.  Disk IO is high 
and when that happens, I can't get into the UI to stop any of the processors.
Any ideas?

I have put the flowfile repository and content repository on different disks on 
the 3 nodes, but disk usage is still so high that I can't get in.
Thank you!

-Joe


--
This email has been checked for viruses by AVG antivirus software.
www.avg.com


Re: UI SocketTimeoutException - heavy IO

2023-03-22 Thread Mark Payne
Joe,

I’d recommend taking a look at garbage collection. It is far more likely the 
culprit than disk I/O.

Thanks
-Mark

> On Mar 22, 2023, at 10:12 AM, Joe Obernberger  
> wrote:
> 
> I'm getting "java.net.SocketTimeoutException: timeout" from the user 
> interface of NiFi when load is heavy.  This is 1.18.0 running on a 3 node 
> cluster.  Disk IO is high and when that happens, I can't get into the UI to 
> stop any of the processors.
> Any ideas?
> 
> I have put the flowfile repository and content repository on different disks 
> on the 3 nodes, but disk usage is still so high that I can't get in.
> Thank you!
> 
> -Joe
> 
> 
> -- 
> This email has been checked for viruses by AVG antivirus software.
> www.avg.com



UI SocketTimeoutException - heavy IO

2023-03-22 Thread Joe Obernberger
I'm getting "java.net.SocketTimeoutException: timeout" from the user 
interface of NiFi when load is heavy.  This is 1.18.0 running on a 3 
node cluster.  Disk IO is high and when that happens, I can't get into 
the UI to stop any of the processors.

Any ideas?

I have put the flowfile repository and content repository on different 
disks on the 3 nodes, but disk usage is still so high that I can't get in.

Thank you!

-Joe


--
This email has been checked for viruses by AVG antivirus software.
www.avg.com