Re: Large amounts of data in cluster state

Mark Payne Wed, 27 Oct 2021 10:06:48 -0700

Hi Isha,

ListHDFS does not store the listing of files that it’s found in its state. 
Doing so would cause a lot of problems. Instead, it only stores the timestamps 
of the latest file that it has found and the timestamp of the latest file that 
it has listed/sent out. If peak loads are triggering instability, it likely has 
more to do with either overutilization of the CPU or excessive garbage 
collection because you’re running out of heap space. Would recommend monitoring 
both the CPU Load and the Garbage Collection. Also, what is the Scheduling 
Period set to for your ListHDFS (in the Settings tab)? It probably is defaulted 
to 0 sec but should probably be set to something like 1 min or something to 
avoid constantly hitting both HDFS and ZooKeeper.


Also, there have been many improvements since 1.9 to improve cluster 
performance and stability. Probably worth looking into upgrading.

Thanks
-Mark



On Oct 27, 2021, at 12:56 PM, Isha Lamboo 
<[email protected]<mailto:[email protected]>> wrote:

Hi all,

I have a question that some of you must have tackled already. On a NiFi cluster 
(still 1.9 at the moment) that is normally very stable, the users sometimes 
trigger peak loads that cause disconnections or other issues either in NiFi 
itself or the external zookeeper cluster. The clear example I found is a 
ListHDFS processor that maintains large amount of state (many millions of 
files) being cleared and refilled, but I suspect it may just keep adding more 
and more to the state.

So far, we’ve increased Zookeeper initLimit and SyncLimit and did some NiFi 
timeout tuning, but it’s hard to figure out a sensible value when the reported 
times are normally nowhere near the limits. The users also keep finding bigger 
data loads which they repeatedly process through NiFi into some processing 
applications. Deleting the files is also not an option because of the 
re-processing It seems to me that increasing timeouts from several seconds to 
what’s going to be minutes must impact some other aspect of Zookeeper.

Is there a flow design or tuning strategy that avoids large changes to the 
state in a short time like this?

Are Zookeeper timeouts of 60+ secs actually usual?

Met vriendelijke groet,

Isha Lamboo
Data Engineer
+31 (0)6 20 50 15 91
<image001.png>

[email protected]<mailto:[email protected]>

Edisonbaan 15
3439 MN Nieuwegein
www.virtualsciences.nl<http://www.virtualsciences.nl/>
www.conclusion.nl<http://www.conclusion.nl/>
Bekijk hier de algemene voorwaarden van 
Conclusion<http://www.conclusion.nl/kleine-lettertjes/algemene-voorwaarden>
<image002.png>

Re: Large amounts of data in cluster state

Reply via email to