Bryan,

Our Min is set to 32GB's. Under normal situations the heap does not exceed 
roughly 50% usage (out of 70 GB's), and many times is lower. We collect and 
track these metrics, and in the last 30 days it's been closer to 35% usage.

But during database maintenance we have to shut down a lot of processors. 
Flowfile's start to backup in the system across lots of different feeds. Then, 
when the database comes back online, the combined processing of all these 
separate feeds catching up on backlog (lots of different processors, not a 
single processor), causes the heap usage to spike. What we saw in GC logging 
was we would reach 70 GB's, then GC would do a stop the world pause and bring 
us down to about 65 GB's, then we'd reach 70 GB's, and GC would get us down to 
68 GB's. This would repeat until GC was only trimming off a few MB's and having 
to run full cleanups every few seconds; thus leaving the system inoperable.

We brought our cluster back online by:
 1. Shutting everything down
 2. Going into a single node and setting NiFi to not auto-resume state; we also 
set the maximum thread count to 10.
 3. We turned on a single node and verified we could process a single feed 
without crashing. We then synchronized the flow to the rest of the nodes and 
brought them back online.
 4. We then manually turned feeds on to flush out backlogged data, of course 
more data was backlogging on our edge servers while we did this.
 5. We decided to set threads to 140 per node (significantly lower than the 
1500 threads we used to have), and Heap to 200 GB's. We did 2x threads per 
virtual core, plus enough threads to cover all of the site-to-site input ports. 
It's weird, because NiFi used to happily run 1000+ threads per node all the 
time, but is able to keep up just as well now with 140 threads...
 6. With these settings in place we caught up on our backlog without running 
out of heap. We maxed out around 100 GB's of Heap usage per node.

--Peter

-----Original Message-----
From: Bryan Bende [mailto:bbe...@gmail.com] 
Sent: Friday, October 5, 2018 7:26 AM
To: users@nifi.apache.org
Subject: [EXT] Re: Maximum Memory for NiFi?

Generally the larger the heap, the more likely to have long GC pauses.

I'm surprised that you would need a 70GB heap given NiFi's design where the 
content of the flow files is generally not held in memory, unless many of the 
processors you are using are not written in an optimal way to process the 
content in a streaming fashion.

Did you initially start out lower than 70GB and head to increase it to that 
point? Just wondering what happens at lower levels like maybe 32GB.

On Thu, Oct 4, 2018 at 4:20 PM Peter Wicks (pwicks) <pwi...@micron.com> wrote:
>
> We’ve had some more clustering issues, and found that some nodes are running 
> out of memory when we have unexpected spikes in data, then we run into a GC 
> stop-the-world event... We lowered our thread count, and that has allowed the 
> cluster to stabilize for the time being.
>
>
>
> Our hardware is pretty robust, we usually have 1000+ threads running on each 
> node in the cluster (cumulative ~4,000 threads). Each node has about 500G’s 
> of RAM. But we’ve only been running NiFi with 70G’s of RAM, and it usually 
> uses only 50G’s.
>
>
>
> I enabled GC logging and after analyzing the data we decided to increase the 
> heap size. We are experimenting with upping the max to 200G of heap to better 
> absorb spikes in data. We are using the default G1GC.
>
>
>
> Also, how much impact is there from doing GC logging all the time? The 
> metrics we are getting are really helpful for debugging/analyzing, but we 
> don’t want to slowdown the cluster too much.
>
>
>
> Thoughts on issues we might encounter? Things we should consider?
>
>
>
> --Peter

Reply via email to