Re: OOM killer and kernel cache reclamation rate limit in vm_pageout_scan()

2014-10-17 Thread Konstantin Belousov
On Wed, Oct 15, 2014 at 11:56:33PM -0600, Justin T. Gibbs wrote:
> avg pointed out the rate limiting code in vm_pageout_scan() during discussion 
> about PR 187594.  While it certainly can contribute to the problems discussed 
> in that PR, a bigger problem is that it can allow the OOM killer to be 
> triggered even though there is plenty of reclaimable memory available in the 
> system.  Any load that can consume enough pages within the polling interval 
> to hit the v_free_min threshold (e.g. multiple 'dd if=/dev/zero 
> of=/file/on/zfs') can make this happen.
> 
> The product I?m working on does not have swap configured and treats any OOM 
> trigger as fatal, so it is very obvious when this happens. :-)
> 
> I?ve tried several things to mitigate the problem.  The first was to ignore 
> rate limiting for pass 2.  However, even though ZFS is guaranteed to receive 
> some feedback prior to OOM being declared, my testing showed that a trivial 
> load (a couple dd operations) could still consume enough of the reclaimed 
> space to leave the system below its target at the end of pass 2.  After 
> removing the rate limiting entirely, I?ve so far been unable to kill the 
> system via a ZFS induced load.
> 
> I understand the motivation behind the rate limiting, but the current 
> implementation seems too simplistic to be safe.  The documentation for the 
> Solaris slab allocator provides good motivation for their approach of using a 
> ?sliding average? to reign in temporary bursts of usage without unduly 
> harming efficient service for the recorded steady-state memory demand.  
> Regardless of the approach taken, I believe that the OOM killer must be a 
> last resort and shouldn?t be called when there are caches that can be culled.
> 
> One other thing I?ve noticed in my testing with ZFS is that it needs feedback 
> and a little time to react to memory pressure.  Calling it?s lowmem handler 
> just once isn?t enough for it to limit in-flight writes so it can avoid reuse 
> of pages that it just freed up.  But, it doesn?t take too long to react (> 
> 1sec in the profiling I?ve done).  Is there a way in vm_pageout_scan() that 
> we can better record that progress is being made (pages were freed in the 
> pass, even if some/all of them were consumed again) and allow more passes 
> before the OOM killer is invoked in this case?
> 
> ?
> Justin
https://docs.freebsd.org/cgi/getmsg.cgi?fetch=103436+0+/usr/local/www/db/text/2014/freebsd-hackers/20141012.freebsd-hackers
might have some relevance.
___
freebsd-current@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-current
To unsubscribe, send any mail to "freebsd-current-unsubscr...@freebsd.org"


Re: OOM killer and kernel cache reclamation rate limit in vm_pageout_scan()

2014-10-16 Thread Andriy Gapon
On 16/10/2014 12:08, Steven Hartland wrote:
> Unfortunately ZFS doesn't prevent new inflight writes until it
> hits zfs_dirty_data_max, so while what your suggesting will
> help, if the writes come in quick enough I would expect it to
> still be able to out run the pageout.

As I've mentioned, arc_memory_throttle() also plays role in limiting the dirty 
data.

-- 
Andriy Gapon
___
freebsd-current@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-current
To unsubscribe, send any mail to "freebsd-current-unsubscr...@freebsd.org"


Re: OOM killer and kernel cache reclamation rate limit in vm_pageout_scan()

2014-10-16 Thread Steven Hartland

Unfortunately ZFS doesn't prevent new inflight writes until it
hits zfs_dirty_data_max, so while what your suggesting will
help, if the writes come in quick enough I would expect it to
still be able to out run the pageout.

- Original Message - 
From: "Justin T. Gibbs" 

To: 
Cc: ; "Andriy Gapon" 
Sent: Thursday, October 16, 2014 6:56 AM
Subject: OOM killer and kernel cache reclamation rate limit in vm_pageout_scan()


avg pointed out the rate limiting code in vm_pageout_scan() during discussion about PR 187594.  While it certainly can contribute to 
the problems discussed in that PR, a bigger problem is that it can allow the OOM killer to be triggered even though there is plenty 
of reclaimable memory available in the system.  Any load that can consume enough pages within the polling interval to hit the 
v_free_min threshold (e.g. multiple 'dd if=/dev/zero of=/file/on/zfs') can make this happen.


The product I’m working on does not have swap configured and treats any OOM trigger as fatal, so it is very obvious when this 
happens. :-)


I’ve tried several things to mitigate the problem.  The first was to ignore rate limiting for pass 2.  However, even though ZFS is 
guaranteed to receive some feedback prior to OOM being declared, my testing showed that a trivial load (a couple dd operations) 
could still consume enough of the reclaimed space to leave the system below its target at the end of pass 2.  After removing the 
rate limiting entirely, I’ve so far been unable to kill the system via a ZFS induced load.


I understand the motivation behind the rate limiting, but the current implementation seems too simplistic to be safe.  The 
documentation for the Solaris slab allocator provides good motivation for their approach of using a “sliding average” to reign in 
temporary bursts of usage without unduly harming efficient service for the recorded steady-state memory demand.  Regardless of the 
approach taken, I believe that the OOM killer must be a last resort and shouldn’t be called when there are caches that can be 
culled.


One other thing I’ve noticed in my testing with ZFS is that it needs feedback and a little time to react to memory pressure. 
Calling it’s lowmem handler just once isn’t enough for it to limit in-flight writes so it can avoid reuse of pages that it just 
freed up.  But, it doesn’t take too long to react (> 1sec in the profiling I’ve done).  Is there a way in vm_pageout_scan() that we 
can better record that progress is being made (pages were freed in the pass, even if some/all of them were consumed again) and allow 
more passes before the OOM killer is invoked in this case?


—
Justin

___
freebsd-current@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-current
To unsubscribe, send any mail to "freebsd-current-unsubscr...@freebsd.org"

___
freebsd-current@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-current
To unsubscribe, send any mail to "freebsd-current-unsubscr...@freebsd.org"


Re: OOM killer and kernel cache reclamation rate limit in vm_pageout_scan()

2014-10-15 Thread Andriy Gapon
On 16/10/2014 08:56, Justin T. Gibbs wrote:
> avg pointed out the rate limiting code in vm_pageout_scan() during discussion
> about PR 187594.  While it certainly can contribute to the problems discussed
> in that PR, a bigger problem is that it can allow the OOM killer to be
> triggered even though there is plenty of reclaimable memory available in the
> system.  Any load that can consume enough pages within the polling interval
> to hit the v_free_min threshold (e.g. multiple 'dd if=/dev/zero
> of=/file/on/zfs') can make this happen.
> 
> The product I’m working on does not have swap configured and treats any OOM
> trigger as fatal, so it is very obvious when this happens. :-)
> 
> I’ve tried several things to mitigate the problem.  The first was to ignore
> rate limiting for pass 2.  However, even though ZFS is guaranteed to receive
> some feedback prior to OOM being declared, my testing showed that a trivial
> load (a couple dd operations) could still consume enough of the reclaimed
> space to leave the system below its target at the end of pass 2.  After
> removing the rate limiting entirely, I’ve so far been unable to kill the
> system via a ZFS induced load.
> 
> I understand the motivation behind the rate limiting, but the current
> implementation seems too simplistic to be safe.  The documentation for the
> Solaris slab allocator provides good motivation for their approach of using a
> “sliding average” to reign in temporary bursts of usage without unduly
> harming efficient service for the recorded steady-state memory demand.
> Regardless of the approach taken, I believe that the OOM killer must be a
> last resort and shouldn’t be called when there are caches that can be
> culled.

FWIW, I have this toy branch:
https://github.com/avg-I/freebsd/compare/experiment/uma-cache-trimming

Not all commits are relevant to the problem and some things are unfinished.
Not sure if the changes would help your case either...

> One other thing I’ve noticed in my testing with ZFS is that it needs feedback
> and a little time to react to memory pressure.  Calling it’s lowmem handler
> just once isn’t enough for it to limit in-flight writes so it can avoid reuse
> of pages that it just freed up.  But, it doesn’t take too long to react (>

I've been thinking about this and maybe we need to make arc_memory_throttle()
more aggressive on FreeBSD.  I can't say that I really follow the logic of that
code, though.

> 1sec in the profiling I’ve done).  Is there a way in vm_pageout_scan() that
> we can better record that progress is being made (pages were freed in the
> pass, even if some/all of them were consumed again) and allow more passes
> before the OOM killer is invoked in this case?

-- 
Andriy Gapon
___
freebsd-current@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-current
To unsubscribe, send any mail to "freebsd-current-unsubscr...@freebsd.org"


Re: OOM killer and kernel cache reclamation rate limit in vm_pageout_scan()

2014-10-15 Thread Andriy Gapon
On 16/10/2014 08:56, Justin T. Gibbs wrote:
> avg pointed out the rate limiting code in vm_pageout_scan() during discussion
> about PR 187594.  While it certainly can contribute to the problems discussed
> in that PR, a bigger problem is that it can allow the OOM killer to be
> triggered even though there is plenty of reclaimable memory available in the
> system.  Any load that can consume enough pages within the polling interval
> to hit the v_free_min threshold (e.g. multiple 'dd if=/dev/zero
> of=/file/on/zfs') can make this happen.
> 
> The product I’m working on does not have swap configured and treats any OOM
> trigger as fatal, so it is very obvious when this happens. :-)
> 
> I’ve tried several things to mitigate the problem.  The first was to ignore
> rate limiting for pass 2.  However, even though ZFS is guaranteed to receive
> some feedback prior to OOM being declared, my testing showed that a trivial
> load (a couple dd operations) could still consume enough of the reclaimed
> space to leave the system below its target at the end of pass 2.  After
> removing the rate limiting entirely, I’ve so far been unable to kill the
> system via a ZFS induced load.
> 
> I understand the motivation behind the rate limiting, but the current
> implementation seems too simplistic to be safe.  The documentation for the
> Solaris slab allocator provides good motivation for their approach of using a
> “sliding average” to reign in temporary bursts of usage without unduly
> harming efficient service for the recorded steady-state memory demand.
> Regardless of the approach taken, I believe that the OOM killer must be a
> last resort and shouldn’t be called when there are caches that can be
> culled.

FWIW, I have this toy branch:
https://github.com/avg-I/freebsd/compare/experiment/uma-cache-trimming

Not all commits are relevant to the problem and some things are unfinished.
Not sure if the changes would help your case either...

> One other thing I’ve noticed in my testing with ZFS is that it needs feedback
> and a little time to react to memory pressure.  Calling it’s lowmem handler
> just once isn’t enough for it to limit in-flight writes so it can avoid reuse
> of pages that it just freed up.  But, it doesn’t take too long to react (>

I've been thinking about this and maybe we need to make arc_memory_throttle()
more aggressive on FreeBSD.  I can't say that I really follow the logic of that
code, though.

> 1sec in the profiling I’ve done).  Is there a way in vm_pageout_scan() that
> we can better record that progress is being made (pages were freed in the
> pass, even if some/all of them were consumed again) and allow more passes
> before the OOM killer is invoked in this case?

-- 
Andriy Gapon
___
freebsd-current@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-current
To unsubscribe, send any mail to "freebsd-current-unsubscr...@freebsd.org"


OOM killer and kernel cache reclamation rate limit in vm_pageout_scan()

2014-10-15 Thread Justin T. Gibbs
avg pointed out the rate limiting code in vm_pageout_scan() during discussion 
about PR 187594.  While it certainly can contribute to the problems discussed 
in that PR, a bigger problem is that it can allow the OOM killer to be 
triggered even though there is plenty of reclaimable memory available in the 
system.  Any load that can consume enough pages within the polling interval to 
hit the v_free_min threshold (e.g. multiple 'dd if=/dev/zero of=/file/on/zfs') 
can make this happen.

The product I’m working on does not have swap configured and treats any OOM 
trigger as fatal, so it is very obvious when this happens. :-)

I’ve tried several things to mitigate the problem.  The first was to ignore 
rate limiting for pass 2.  However, even though ZFS is guaranteed to receive 
some feedback prior to OOM being declared, my testing showed that a trivial 
load (a couple dd operations) could still consume enough of the reclaimed space 
to leave the system below its target at the end of pass 2.  After removing the 
rate limiting entirely, I’ve so far been unable to kill the system via a ZFS 
induced load.

I understand the motivation behind the rate limiting, but the current 
implementation seems too simplistic to be safe.  The documentation for the 
Solaris slab allocator provides good motivation for their approach of using a 
“sliding average” to reign in temporary bursts of usage without unduly harming 
efficient service for the recorded steady-state memory demand.  Regardless of 
the approach taken, I believe that the OOM killer must be a last resort and 
shouldn’t be called when there are caches that can be culled.

One other thing I’ve noticed in my testing with ZFS is that it needs feedback 
and a little time to react to memory pressure.  Calling it’s lowmem handler 
just once isn’t enough for it to limit in-flight writes so it can avoid reuse 
of pages that it just freed up.  But, it doesn’t take too long to react (> 1sec 
in the profiling I’ve done).  Is there a way in vm_pageout_scan() that we can 
better record that progress is being made (pages were freed in the pass, even 
if some/all of them were consumed again) and allow more passes before the OOM 
killer is invoked in this case?

—
Justin

___
freebsd-current@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-current
To unsubscribe, send any mail to "freebsd-current-unsubscr...@freebsd.org"