Re: [PATCH] add extra free kbytes tunable
Hi Hugh, On 03/02/2013 11:08 AM, Hugh Dickins wrote: On Sat, 2 Mar 2013, Simon Jeons wrote: On 03/02/2013 09:42 AM, Hugh Dickins wrote: On Sat, 2 Mar 2013, Simon Jeons wrote: In function __add_to_swap_cache if add to radix tree successfully will result in increase NR_FILE_PAGES, why? This is anonymous page instead of file backed page. Right, that's hard to understand without historical background. I think the quick answer would be that we used to (and still do) think of file-cache and swap-cache as two halves of page-cache. And then when shmem page should be treated as file-cache or swap-cache? It is strange since it is consist of anonymous pages and these pages establish files. A shmem page is swap-backed file-cache, and it may get transferred to or from swap-cache: yes, it's a difficult and confusing case, as I said below. I would never call it "anonymous", but it is counted in /proc/meminfo's Active(anon) or Inactive(anon) rather than in (file), because "anon" there is shorthand for "swap-backed". In read_swap_cache_async: SetPageSwapBacked(new_page); __add_to_swap_cache(); swap_readpage(); ClearPageSwapBacked(new_page); Why clear PG_swapbacked flag? So you'll find that shmem and swap are counted as file in some places and anon in others, and it's hard to grasp which is where and why, without remembering the history. Hugh -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] add extra free kbytes tunable
Hi Hugh, On 03/02/2013 11:08 AM, Hugh Dickins wrote: On Sat, 2 Mar 2013, Simon Jeons wrote: On 03/02/2013 09:42 AM, Hugh Dickins wrote: On Sat, 2 Mar 2013, Simon Jeons wrote: In function __add_to_swap_cache if add to radix tree successfully will result in increase NR_FILE_PAGES, why? This is anonymous page instead of file backed page. Right, that's hard to understand without historical background. I think the quick answer would be that we used to (and still do) think of file-cache and swap-cache as two halves of page-cache. And then when shmem page should be treated as file-cache or swap-cache? It is strange since it is consist of anonymous pages and these pages establish files. A shmem page is swap-backed file-cache, and it may get transferred to or from swap-cache: yes, it's a difficult and confusing case, as I said below. I would never call it anonymous, but it is counted in /proc/meminfo's Active(anon) or Inactive(anon) rather than in (file), because anon there is shorthand for swap-backed. In read_swap_cache_async: SetPageSwapBacked(new_page); __add_to_swap_cache(); swap_readpage(); ClearPageSwapBacked(new_page); Why clear PG_swapbacked flag? So you'll find that shmem and swap are counted as file in some places and anon in others, and it's hard to grasp which is where and why, without remembering the history. Hugh -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] add extra free kbytes tunable
On 03/02/2013 11:08 AM, Hugh Dickins wrote: On Sat, 2 Mar 2013, Simon Jeons wrote: On 03/02/2013 09:42 AM, Hugh Dickins wrote: On Sat, 2 Mar 2013, Simon Jeons wrote: In function __add_to_swap_cache if add to radix tree successfully will result in increase NR_FILE_PAGES, why? This is anonymous page instead of file backed page. Right, that's hard to understand without historical background. I think the quick answer would be that we used to (and still do) think of file-cache and swap-cache as two halves of page-cache. And then when shmem page should be treated as file-cache or swap-cache? It is strange since it is consist of anonymous pages and these pages establish files. A shmem page is swap-backed file-cache, and it may get transferred to or from swap-cache: yes, it's a difficult and confusing case, as I said below. I would never call it "anonymous", but it is counted in /proc/meminfo's Active(anon) or Inactive(anon) rather than in (file), because "anon" there is shorthand for "swap-backed". Oh, I see. Thanks. :) So you'll find that shmem and swap are counted as file in some places and anon in others, and it's hard to grasp which is where and why, without remembering the history. Hugh -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] add extra free kbytes tunable
On Sat, 2 Mar 2013, Simon Jeons wrote: > On 03/02/2013 09:42 AM, Hugh Dickins wrote: > > On Sat, 2 Mar 2013, Simon Jeons wrote: > > > In function __add_to_swap_cache if add to radix tree successfully will > > > result > > > in increase NR_FILE_PAGES, why? This is anonymous page instead of file > > > backed > > > page. > > Right, that's hard to understand without historical background. > > > > I think the quick answer would be that we used to (and still do) think > > of file-cache and swap-cache as two halves of page-cache. And then when > > shmem page should be treated as file-cache or swap-cache? It is strange since > it is consist of anonymous pages and these pages establish files. A shmem page is swap-backed file-cache, and it may get transferred to or from swap-cache: yes, it's a difficult and confusing case, as I said below. I would never call it "anonymous", but it is counted in /proc/meminfo's Active(anon) or Inactive(anon) rather than in (file), because "anon" there is shorthand for "swap-backed". > > So you'll find that shmem and swap are counted as file in some places > > and anon in others, and it's hard to grasp which is where and why, > > without remembering the history. Hugh -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] add extra free kbytes tunable
On 03/02/2013 09:42 AM, Hugh Dickins wrote: On Sat, 2 Mar 2013, Simon Jeons wrote: In function __add_to_swap_cache if add to radix tree successfully will result in increase NR_FILE_PAGES, why? This is anonymous page instead of file backed page. Right, that's hard to understand without historical background. I think the quick answer would be that we used to (and still do) think of file-cache and swap-cache as two halves of page-cache. And then when shmem page should be treated as file-cache or swap-cache? It is strange since it is consist of anonymous pages and these pages establish files. someone changed the way stats were gathered, they couldn't very well name the stat for page-cache pages NR_PAGE_PAGES, so they called it NR_FILE_PAGES - but it still included swap. We have tried down the years to keep the info shown in /proc/meminfo (for example, but it is the prime example) consistent across releases, while adding new lines and new distinctions. But it has often been hard to find good enough short enough names for those new distinctions: when 2.6.28 split the LRUs between file-backed and swap-backed, it used "anon" for swap-backed in /proc/meminfo. So you'll find that shmem and swap are counted as file in some places and anon in others, and it's hard to grasp which is where and why, without remembering the history. I notice that fs/proc/meminfo.c:meminfo_proc_show() subtracts total_swapcache_pages from the NR_FILE_PAGES count for /proc/meminfo: so it's undoing what you observe __add_to_swap_cache() to be doing. It's quite possible that if you went through all the users of NR_FILE_PAGES, you'd find it makes much more sense to leave out the swap-cache pages, and just add those on where needed. But you might find a few places where it's hard to decide whether the swap-cache pages were ever intended to be included or not, and hard to decide if it's safe to change those numbers now or not. Hugh -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] add extra free kbytes tunable
On Sat, 2 Mar 2013, Simon Jeons wrote: > > In function __add_to_swap_cache if add to radix tree successfully will result > in increase NR_FILE_PAGES, why? This is anonymous page instead of file backed > page. Right, that's hard to understand without historical background. I think the quick answer would be that we used to (and still do) think of file-cache and swap-cache as two halves of page-cache. And then when someone changed the way stats were gathered, they couldn't very well name the stat for page-cache pages NR_PAGE_PAGES, so they called it NR_FILE_PAGES - but it still included swap. We have tried down the years to keep the info shown in /proc/meminfo (for example, but it is the prime example) consistent across releases, while adding new lines and new distinctions. But it has often been hard to find good enough short enough names for those new distinctions: when 2.6.28 split the LRUs between file-backed and swap-backed, it used "anon" for swap-backed in /proc/meminfo. So you'll find that shmem and swap are counted as file in some places and anon in others, and it's hard to grasp which is where and why, without remembering the history. I notice that fs/proc/meminfo.c:meminfo_proc_show() subtracts total_swapcache_pages from the NR_FILE_PAGES count for /proc/meminfo: so it's undoing what you observe __add_to_swap_cache() to be doing. It's quite possible that if you went through all the users of NR_FILE_PAGES, you'd find it makes much more sense to leave out the swap-cache pages, and just add those on where needed. But you might find a few places where it's hard to decide whether the swap-cache pages were ever intended to be included or not, and hard to decide if it's safe to change those numbers now or not. Hugh -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] add extra free kbytes tunable
On 03/02/2013 06:33 AM, Hugh Dickins wrote: On Fri, 1 Mar 2013, Simon Jeons wrote: On 03/01/2013 05:22 PM, Simon Jeons wrote: On 02/23/2013 01:56 AM, Johannes Weiner wrote: Mapped file pages have to get scanned twice before they are reclaimed because we don't have enough usage information after the first scan. It seems that just VM_EXEC mapped file pages are protected. Issue in page reclaim subsystem: static inline int page_is_file_cache(struct page *page) { return !PageSwapBacked(page); } AFAIK, PG_swapbacked is set if anonymous page added to swap cache, and be cleaned if removed from swap cache. So anonymous pages which are reclaimed and add to swap cache won't have this flag, then they will be treated as s/are/aren't PG_swapbacked != PG_swapcache Oh, I see. Thanks Hugh, thanks for your patient. :) In function __add_to_swap_cache if add to radix tree successfully will result in increase NR_FILE_PAGES, why? This is anonymous page instead of file backed page. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] add extra free kbytes tunable
On Fri, 1 Mar 2013, Simon Jeons wrote: > On 03/01/2013 05:22 PM, Simon Jeons wrote: > > On 02/23/2013 01:56 AM, Johannes Weiner wrote: > > > Mapped file pages have to get scanned twice before they are reclaimed > > > because we don't have enough usage information after the first scan. > > > > It seems that just VM_EXEC mapped file pages are protected. > > Issue in page reclaim subsystem: > > static inline int page_is_file_cache(struct page *page) > > { > > return !PageSwapBacked(page); > > } > > AFAIK, PG_swapbacked is set if anonymous page added to swap cache, and be > > cleaned if removed from swap cache. So anonymous pages which are reclaimed > > and add to swap cache won't have this flag, then they will be treated as > > s/are/aren't PG_swapbacked != PG_swapcache -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] add extra free kbytes tunable
On 03/01/2013 05:22 PM, Simon Jeons wrote: Hi Johannes, On 02/23/2013 01:56 AM, Johannes Weiner wrote: On Tue, Feb 19, 2013 at 09:19:27PM -0800, dormando wrote: The problem is that adding this tunable will constrain future VM implementations. We will forever need to at least retain the pseudo-file. We will also need to make some effort to retain its behaviour. It would of course be better to fix things so you don't need to tweak VM internals to get acceptable behaviour. I sympathize with this. It's presently all that keeps us afloat though. I'll whine about it again later if nothing else pans out. You said: : We have a server workload wherein machines with 100G+ of "free" memory : (used by page cache), scattered but frequent random io reads from 12+ : SSD's, and 5gbps+ of internet traffic, will frequently hit direct reclaim : in a few different ways. : : 1) It'll run into small amounts of reclaim randomly (a few hundred : thousand). : : 2) A burst of reads or traffic can cause extra pressure, which kswapd : occasionally responds to by freeing up 40g+ of the pagecache all at once : (!) while pausing the system (Argh). : : 3) A blip in an upstream provider or failover from a peer causes the : kernel to allocate massive amounts of memory for retransmission : queues/etc, potentially along with buffered IO reads and (some, but not : often a ton) of new allocations from an application. This paired with 2) : can cause the box to stall for 15+ seconds. Can we prioritise these? 2) looks just awful - kswapd shouldn't just go off and free 40G of pagecache. Do you know what's actually in that pagecache? Large number of small files or small number of (very) large files? We have a handful of huge files (6-12ish 200g+) that are mmap'ed and accessed via address. occasionally madvise (WILLNEED) applied to the address ranges before attempting to use them. There're a mix of other files but nothing significant. The mmap's are READONLY and writes are done via pwrite-ish functions. I could use some guidance on inspecting/tracing the problem. I've been trying to reproduce it in a lab, and respecting to 2)'s issue I've found: - The amount of memory freed back up is either a percentage of total memory or a percentage of free memory. (a machine with 48G of ram will "only" free up an extra 4-7g) - It's most likely to happen after a fresh boot, or if "3 > drop_caches" is applied with the application down. As it fills it seems to get itself into trouble, but becomes more stable after that. Unfortunately 1) and 3) still apply to a stable instance. - Protecting the DMA32 zone with something like "1 1 32" into lowmem_reserve_ratio makes the mass-reclaiming less likely to happen. - While watching "sar -B 1" I'll see kswapd wake up, and scan up to a few hundred thousand pages before finding anything it actually wants to reclaim (low vmeff). I've only been able to reproduce this from a clean start. It can take up to 3 seconds before kswapd starts actually reclaiming pages. - So far as I can tell we're almost exclusively using 0 order allocations. THP is disabled. There's not much dirty memory involved. It's not flushing out writes while reclaiming, it just kills off massive amount of cached memory. Mapped file pages have to get scanned twice before they are reclaimed because we don't have enough usage information after the first scan. It seems that just VM_EXEC mapped file pages are protected. Issue in page reclaim subsystem: static inline int page_is_file_cache(struct page *page) { return !PageSwapBacked(page); } AFAIK, PG_swapbacked is set if anonymous page added to swap cache, and be cleaned if removed from swap cache. So anonymous pages which are reclaimed and add to swap cache won't have this flag, then they will be treated as s/are/aren't file backed pages? Is it buggy? In function __add_to_swap_cache if add to radix tree successfully will result in increase NR_FILE_PAGES, why? In your case, when you start this workload after a fresh boot or dropping the caches, there will be 48G of mapped file pages that have never been scanned before and that need to be looked at twice. Unfortunately, if kswapd does not make progress (and it won't for some time at first), it will scan more and more aggressively with Why kswapd does not make progress for some time at first? increasing scan priority. And when the 48G of pages are finally cycled, kswapd's scan window is a large percentage of your machine's memory, and it will free every single page in it. I think we should think about capping kswapd zone reclaim cycles just as we do for direct reclaim. It's a little ridiculous that it can run unbounded and reclaim every page in a zone without ever checking back against the watermark. We still increase the scan window evenly when we don't make forward progress, but we are more carefully inching zone levels back toward the watermarks. diff --git a/mm/vmscan.c b/mm/vmscan.c
Re: [PATCH] add extra free kbytes tunable
Hi Johannes, On 02/23/2013 01:56 AM, Johannes Weiner wrote: On Tue, Feb 19, 2013 at 09:19:27PM -0800, dormando wrote: The problem is that adding this tunable will constrain future VM implementations. We will forever need to at least retain the pseudo-file. We will also need to make some effort to retain its behaviour. It would of course be better to fix things so you don't need to tweak VM internals to get acceptable behaviour. I sympathize with this. It's presently all that keeps us afloat though. I'll whine about it again later if nothing else pans out. You said: : We have a server workload wherein machines with 100G+ of "free" memory : (used by page cache), scattered but frequent random io reads from 12+ : SSD's, and 5gbps+ of internet traffic, will frequently hit direct reclaim : in a few different ways. : : 1) It'll run into small amounts of reclaim randomly (a few hundred : thousand). : : 2) A burst of reads or traffic can cause extra pressure, which kswapd : occasionally responds to by freeing up 40g+ of the pagecache all at once : (!) while pausing the system (Argh). : : 3) A blip in an upstream provider or failover from a peer causes the : kernel to allocate massive amounts of memory for retransmission : queues/etc, potentially along with buffered IO reads and (some, but not : often a ton) of new allocations from an application. This paired with 2) : can cause the box to stall for 15+ seconds. Can we prioritise these? 2) looks just awful - kswapd shouldn't just go off and free 40G of pagecache. Do you know what's actually in that pagecache? Large number of small files or small number of (very) large files? We have a handful of huge files (6-12ish 200g+) that are mmap'ed and accessed via address. occasionally madvise (WILLNEED) applied to the address ranges before attempting to use them. There're a mix of other files but nothing significant. The mmap's are READONLY and writes are done via pwrite-ish functions. I could use some guidance on inspecting/tracing the problem. I've been trying to reproduce it in a lab, and respecting to 2)'s issue I've found: - The amount of memory freed back up is either a percentage of total memory or a percentage of free memory. (a machine with 48G of ram will "only" free up an extra 4-7g) - It's most likely to happen after a fresh boot, or if "3 > drop_caches" is applied with the application down. As it fills it seems to get itself into trouble, but becomes more stable after that. Unfortunately 1) and 3) still apply to a stable instance. - Protecting the DMA32 zone with something like "1 1 32" into lowmem_reserve_ratio makes the mass-reclaiming less likely to happen. - While watching "sar -B 1" I'll see kswapd wake up, and scan up to a few hundred thousand pages before finding anything it actually wants to reclaim (low vmeff). I've only been able to reproduce this from a clean start. It can take up to 3 seconds before kswapd starts actually reclaiming pages. - So far as I can tell we're almost exclusively using 0 order allocations. THP is disabled. There's not much dirty memory involved. It's not flushing out writes while reclaiming, it just kills off massive amount of cached memory. Mapped file pages have to get scanned twice before they are reclaimed because we don't have enough usage information after the first scan. It seems that just VM_EXEC mapped file pages are protected. Issue in page reclaim subsystem: static inline int page_is_file_cache(struct page *page) { return !PageSwapBacked(page); } AFAIK, PG_swapbacked is set if anonymous page added to swap cache, and be cleaned if removed from swap cache. So anonymous pages which are reclaimed and add to swap cache won't have this flag, then they will be treated as file backed pages? Is it buggy? In function __add_to_swap_cache if add to radix tree successfully will result in increase NR_FILE_PAGES, why? In your case, when you start this workload after a fresh boot or dropping the caches, there will be 48G of mapped file pages that have never been scanned before and that need to be looked at twice. Unfortunately, if kswapd does not make progress (and it won't for some time at first), it will scan more and more aggressively with Why kswapd does not make progress for some time at first? increasing scan priority. And when the 48G of pages are finally cycled, kswapd's scan window is a large percentage of your machine's memory, and it will free every single page in it. I think we should think about capping kswapd zone reclaim cycles just as we do for direct reclaim. It's a little ridiculous that it can run unbounded and reclaim every page in a zone without ever checking back against the watermark. We still increase the scan window evenly when we don't make forward progress, but we are more carefully inching zone levels back toward the watermarks. diff --git a/mm/vmscan.c b/mm/vmscan.c index c4883eb..8a4c446 100644 --- a/mm/vmscan.c +++ b/mm/vmscan.c @@ -2645,10 +2645,11 @@
Re: [PATCH] add extra free kbytes tunable
Hi Johannes, On 02/23/2013 01:56 AM, Johannes Weiner wrote: On Tue, Feb 19, 2013 at 09:19:27PM -0800, dormando wrote: The problem is that adding this tunable will constrain future VM implementations. We will forever need to at least retain the pseudo-file. We will also need to make some effort to retain its behaviour. It would of course be better to fix things so you don't need to tweak VM internals to get acceptable behaviour. I sympathize with this. It's presently all that keeps us afloat though. I'll whine about it again later if nothing else pans out. You said: : We have a server workload wherein machines with 100G+ of free memory : (used by page cache), scattered but frequent random io reads from 12+ : SSD's, and 5gbps+ of internet traffic, will frequently hit direct reclaim : in a few different ways. : : 1) It'll run into small amounts of reclaim randomly (a few hundred : thousand). : : 2) A burst of reads or traffic can cause extra pressure, which kswapd : occasionally responds to by freeing up 40g+ of the pagecache all at once : (!) while pausing the system (Argh). : : 3) A blip in an upstream provider or failover from a peer causes the : kernel to allocate massive amounts of memory for retransmission : queues/etc, potentially along with buffered IO reads and (some, but not : often a ton) of new allocations from an application. This paired with 2) : can cause the box to stall for 15+ seconds. Can we prioritise these? 2) looks just awful - kswapd shouldn't just go off and free 40G of pagecache. Do you know what's actually in that pagecache? Large number of small files or small number of (very) large files? We have a handful of huge files (6-12ish 200g+) that are mmap'ed and accessed via address. occasionally madvise (WILLNEED) applied to the address ranges before attempting to use them. There're a mix of other files but nothing significant. The mmap's are READONLY and writes are done via pwrite-ish functions. I could use some guidance on inspecting/tracing the problem. I've been trying to reproduce it in a lab, and respecting to 2)'s issue I've found: - The amount of memory freed back up is either a percentage of total memory or a percentage of free memory. (a machine with 48G of ram will only free up an extra 4-7g) - It's most likely to happen after a fresh boot, or if 3 drop_caches is applied with the application down. As it fills it seems to get itself into trouble, but becomes more stable after that. Unfortunately 1) and 3) still apply to a stable instance. - Protecting the DMA32 zone with something like 1 1 32 into lowmem_reserve_ratio makes the mass-reclaiming less likely to happen. - While watching sar -B 1 I'll see kswapd wake up, and scan up to a few hundred thousand pages before finding anything it actually wants to reclaim (low vmeff). I've only been able to reproduce this from a clean start. It can take up to 3 seconds before kswapd starts actually reclaiming pages. - So far as I can tell we're almost exclusively using 0 order allocations. THP is disabled. There's not much dirty memory involved. It's not flushing out writes while reclaiming, it just kills off massive amount of cached memory. Mapped file pages have to get scanned twice before they are reclaimed because we don't have enough usage information after the first scan. It seems that just VM_EXEC mapped file pages are protected. Issue in page reclaim subsystem: static inline int page_is_file_cache(struct page *page) { return !PageSwapBacked(page); } AFAIK, PG_swapbacked is set if anonymous page added to swap cache, and be cleaned if removed from swap cache. So anonymous pages which are reclaimed and add to swap cache won't have this flag, then they will be treated as file backed pages? Is it buggy? In function __add_to_swap_cache if add to radix tree successfully will result in increase NR_FILE_PAGES, why? In your case, when you start this workload after a fresh boot or dropping the caches, there will be 48G of mapped file pages that have never been scanned before and that need to be looked at twice. Unfortunately, if kswapd does not make progress (and it won't for some time at first), it will scan more and more aggressively with Why kswapd does not make progress for some time at first? increasing scan priority. And when the 48G of pages are finally cycled, kswapd's scan window is a large percentage of your machine's memory, and it will free every single page in it. I think we should think about capping kswapd zone reclaim cycles just as we do for direct reclaim. It's a little ridiculous that it can run unbounded and reclaim every page in a zone without ever checking back against the watermark. We still increase the scan window evenly when we don't make forward progress, but we are more carefully inching zone levels back toward the watermarks. diff --git a/mm/vmscan.c b/mm/vmscan.c index c4883eb..8a4c446 100644 --- a/mm/vmscan.c +++ b/mm/vmscan.c @@ -2645,10 +2645,11 @@ static
Re: [PATCH] add extra free kbytes tunable
On 03/01/2013 05:22 PM, Simon Jeons wrote: Hi Johannes, On 02/23/2013 01:56 AM, Johannes Weiner wrote: On Tue, Feb 19, 2013 at 09:19:27PM -0800, dormando wrote: The problem is that adding this tunable will constrain future VM implementations. We will forever need to at least retain the pseudo-file. We will also need to make some effort to retain its behaviour. It would of course be better to fix things so you don't need to tweak VM internals to get acceptable behaviour. I sympathize with this. It's presently all that keeps us afloat though. I'll whine about it again later if nothing else pans out. You said: : We have a server workload wherein machines with 100G+ of free memory : (used by page cache), scattered but frequent random io reads from 12+ : SSD's, and 5gbps+ of internet traffic, will frequently hit direct reclaim : in a few different ways. : : 1) It'll run into small amounts of reclaim randomly (a few hundred : thousand). : : 2) A burst of reads or traffic can cause extra pressure, which kswapd : occasionally responds to by freeing up 40g+ of the pagecache all at once : (!) while pausing the system (Argh). : : 3) A blip in an upstream provider or failover from a peer causes the : kernel to allocate massive amounts of memory for retransmission : queues/etc, potentially along with buffered IO reads and (some, but not : often a ton) of new allocations from an application. This paired with 2) : can cause the box to stall for 15+ seconds. Can we prioritise these? 2) looks just awful - kswapd shouldn't just go off and free 40G of pagecache. Do you know what's actually in that pagecache? Large number of small files or small number of (very) large files? We have a handful of huge files (6-12ish 200g+) that are mmap'ed and accessed via address. occasionally madvise (WILLNEED) applied to the address ranges before attempting to use them. There're a mix of other files but nothing significant. The mmap's are READONLY and writes are done via pwrite-ish functions. I could use some guidance on inspecting/tracing the problem. I've been trying to reproduce it in a lab, and respecting to 2)'s issue I've found: - The amount of memory freed back up is either a percentage of total memory or a percentage of free memory. (a machine with 48G of ram will only free up an extra 4-7g) - It's most likely to happen after a fresh boot, or if 3 drop_caches is applied with the application down. As it fills it seems to get itself into trouble, but becomes more stable after that. Unfortunately 1) and 3) still apply to a stable instance. - Protecting the DMA32 zone with something like 1 1 32 into lowmem_reserve_ratio makes the mass-reclaiming less likely to happen. - While watching sar -B 1 I'll see kswapd wake up, and scan up to a few hundred thousand pages before finding anything it actually wants to reclaim (low vmeff). I've only been able to reproduce this from a clean start. It can take up to 3 seconds before kswapd starts actually reclaiming pages. - So far as I can tell we're almost exclusively using 0 order allocations. THP is disabled. There's not much dirty memory involved. It's not flushing out writes while reclaiming, it just kills off massive amount of cached memory. Mapped file pages have to get scanned twice before they are reclaimed because we don't have enough usage information after the first scan. It seems that just VM_EXEC mapped file pages are protected. Issue in page reclaim subsystem: static inline int page_is_file_cache(struct page *page) { return !PageSwapBacked(page); } AFAIK, PG_swapbacked is set if anonymous page added to swap cache, and be cleaned if removed from swap cache. So anonymous pages which are reclaimed and add to swap cache won't have this flag, then they will be treated as s/are/aren't file backed pages? Is it buggy? In function __add_to_swap_cache if add to radix tree successfully will result in increase NR_FILE_PAGES, why? In your case, when you start this workload after a fresh boot or dropping the caches, there will be 48G of mapped file pages that have never been scanned before and that need to be looked at twice. Unfortunately, if kswapd does not make progress (and it won't for some time at first), it will scan more and more aggressively with Why kswapd does not make progress for some time at first? increasing scan priority. And when the 48G of pages are finally cycled, kswapd's scan window is a large percentage of your machine's memory, and it will free every single page in it. I think we should think about capping kswapd zone reclaim cycles just as we do for direct reclaim. It's a little ridiculous that it can run unbounded and reclaim every page in a zone without ever checking back against the watermark. We still increase the scan window evenly when we don't make forward progress, but we are more carefully inching zone levels back toward the watermarks. diff --git a/mm/vmscan.c b/mm/vmscan.c index
Re: [PATCH] add extra free kbytes tunable
On Fri, 1 Mar 2013, Simon Jeons wrote: On 03/01/2013 05:22 PM, Simon Jeons wrote: On 02/23/2013 01:56 AM, Johannes Weiner wrote: Mapped file pages have to get scanned twice before they are reclaimed because we don't have enough usage information after the first scan. It seems that just VM_EXEC mapped file pages are protected. Issue in page reclaim subsystem: static inline int page_is_file_cache(struct page *page) { return !PageSwapBacked(page); } AFAIK, PG_swapbacked is set if anonymous page added to swap cache, and be cleaned if removed from swap cache. So anonymous pages which are reclaimed and add to swap cache won't have this flag, then they will be treated as s/are/aren't PG_swapbacked != PG_swapcache -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] add extra free kbytes tunable
On 03/02/2013 06:33 AM, Hugh Dickins wrote: On Fri, 1 Mar 2013, Simon Jeons wrote: On 03/01/2013 05:22 PM, Simon Jeons wrote: On 02/23/2013 01:56 AM, Johannes Weiner wrote: Mapped file pages have to get scanned twice before they are reclaimed because we don't have enough usage information after the first scan. It seems that just VM_EXEC mapped file pages are protected. Issue in page reclaim subsystem: static inline int page_is_file_cache(struct page *page) { return !PageSwapBacked(page); } AFAIK, PG_swapbacked is set if anonymous page added to swap cache, and be cleaned if removed from swap cache. So anonymous pages which are reclaimed and add to swap cache won't have this flag, then they will be treated as s/are/aren't PG_swapbacked != PG_swapcache Oh, I see. Thanks Hugh, thanks for your patient. :) In function __add_to_swap_cache if add to radix tree successfully will result in increase NR_FILE_PAGES, why? This is anonymous page instead of file backed page. -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] add extra free kbytes tunable
On Sat, 2 Mar 2013, Simon Jeons wrote: In function __add_to_swap_cache if add to radix tree successfully will result in increase NR_FILE_PAGES, why? This is anonymous page instead of file backed page. Right, that's hard to understand without historical background. I think the quick answer would be that we used to (and still do) think of file-cache and swap-cache as two halves of page-cache. And then when someone changed the way stats were gathered, they couldn't very well name the stat for page-cache pages NR_PAGE_PAGES, so they called it NR_FILE_PAGES - but it still included swap. We have tried down the years to keep the info shown in /proc/meminfo (for example, but it is the prime example) consistent across releases, while adding new lines and new distinctions. But it has often been hard to find good enough short enough names for those new distinctions: when 2.6.28 split the LRUs between file-backed and swap-backed, it used anon for swap-backed in /proc/meminfo. So you'll find that shmem and swap are counted as file in some places and anon in others, and it's hard to grasp which is where and why, without remembering the history. I notice that fs/proc/meminfo.c:meminfo_proc_show() subtracts total_swapcache_pages from the NR_FILE_PAGES count for /proc/meminfo: so it's undoing what you observe __add_to_swap_cache() to be doing. It's quite possible that if you went through all the users of NR_FILE_PAGES, you'd find it makes much more sense to leave out the swap-cache pages, and just add those on where needed. But you might find a few places where it's hard to decide whether the swap-cache pages were ever intended to be included or not, and hard to decide if it's safe to change those numbers now or not. Hugh -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] add extra free kbytes tunable
On 03/02/2013 09:42 AM, Hugh Dickins wrote: On Sat, 2 Mar 2013, Simon Jeons wrote: In function __add_to_swap_cache if add to radix tree successfully will result in increase NR_FILE_PAGES, why? This is anonymous page instead of file backed page. Right, that's hard to understand without historical background. I think the quick answer would be that we used to (and still do) think of file-cache and swap-cache as two halves of page-cache. And then when shmem page should be treated as file-cache or swap-cache? It is strange since it is consist of anonymous pages and these pages establish files. someone changed the way stats were gathered, they couldn't very well name the stat for page-cache pages NR_PAGE_PAGES, so they called it NR_FILE_PAGES - but it still included swap. We have tried down the years to keep the info shown in /proc/meminfo (for example, but it is the prime example) consistent across releases, while adding new lines and new distinctions. But it has often been hard to find good enough short enough names for those new distinctions: when 2.6.28 split the LRUs between file-backed and swap-backed, it used anon for swap-backed in /proc/meminfo. So you'll find that shmem and swap are counted as file in some places and anon in others, and it's hard to grasp which is where and why, without remembering the history. I notice that fs/proc/meminfo.c:meminfo_proc_show() subtracts total_swapcache_pages from the NR_FILE_PAGES count for /proc/meminfo: so it's undoing what you observe __add_to_swap_cache() to be doing. It's quite possible that if you went through all the users of NR_FILE_PAGES, you'd find it makes much more sense to leave out the swap-cache pages, and just add those on where needed. But you might find a few places where it's hard to decide whether the swap-cache pages were ever intended to be included or not, and hard to decide if it's safe to change those numbers now or not. Hugh -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] add extra free kbytes tunable
On Sat, 2 Mar 2013, Simon Jeons wrote: On 03/02/2013 09:42 AM, Hugh Dickins wrote: On Sat, 2 Mar 2013, Simon Jeons wrote: In function __add_to_swap_cache if add to radix tree successfully will result in increase NR_FILE_PAGES, why? This is anonymous page instead of file backed page. Right, that's hard to understand without historical background. I think the quick answer would be that we used to (and still do) think of file-cache and swap-cache as two halves of page-cache. And then when shmem page should be treated as file-cache or swap-cache? It is strange since it is consist of anonymous pages and these pages establish files. A shmem page is swap-backed file-cache, and it may get transferred to or from swap-cache: yes, it's a difficult and confusing case, as I said below. I would never call it anonymous, but it is counted in /proc/meminfo's Active(anon) or Inactive(anon) rather than in (file), because anon there is shorthand for swap-backed. So you'll find that shmem and swap are counted as file in some places and anon in others, and it's hard to grasp which is where and why, without remembering the history. Hugh -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] add extra free kbytes tunable
On 03/02/2013 11:08 AM, Hugh Dickins wrote: On Sat, 2 Mar 2013, Simon Jeons wrote: On 03/02/2013 09:42 AM, Hugh Dickins wrote: On Sat, 2 Mar 2013, Simon Jeons wrote: In function __add_to_swap_cache if add to radix tree successfully will result in increase NR_FILE_PAGES, why? This is anonymous page instead of file backed page. Right, that's hard to understand without historical background. I think the quick answer would be that we used to (and still do) think of file-cache and swap-cache as two halves of page-cache. And then when shmem page should be treated as file-cache or swap-cache? It is strange since it is consist of anonymous pages and these pages establish files. A shmem page is swap-backed file-cache, and it may get transferred to or from swap-cache: yes, it's a difficult and confusing case, as I said below. I would never call it anonymous, but it is counted in /proc/meminfo's Active(anon) or Inactive(anon) rather than in (file), because anon there is shorthand for swap-backed. Oh, I see. Thanks. :) So you'll find that shmem and swap are counted as file in some places and anon in others, and it's hard to grasp which is where and why, without remembering the history. Hugh -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] add extra free kbytes tunable
On Tue, Feb 26, 2013 at 10:13:15AM -0500, Johannes Weiner wrote: > > > > > > I think we should think about capping kswapd zone reclaim cycles just > > > as we do for direct reclaim. It's a little ridiculous that it can run > > > unbounded and reclaim every page in a zone without ever checking back > > > against the watermark. We still increase the scan window evenly when > > > we don't make forward progress, but we are more carefully inching zone > > > levels back toward the watermarks. > > > > > > > While on the surface I think this will appear to work, I worry that it > > will cause kswapds priorities to continually reset even when it's under > > real pressure as opposed to "failing to reclaim because of use-once". > > With nr_to_reclaim always at SWAP_CLUSTER_MAX, we'll hit this check and > > reset after each zone scan. > > > > if (sc.nr_reclaimed >= SWAP_CLUSTER_MAX) > > break; > > But we hit that check now as well...? Eventually yes. > I.e. unless there is a hard to > reclaim batch and kswapd is unable to make forward progress, priority > levels will always get reset after we scanned all zones and reclaimed > SWAP_CLUSTER_MAX or more in the process. > The reset happens after it has reclaimed a lot of pages. I agree with you that this is likely the wrong thing to do. I'm just pointing out that this simple patch changes behaviour in a big way. > All I'm arguing is that, if we hit a hard to reclaim batch we should > continue to increase the number of pages to scan, but still bail out > if we reclaimed a batch successfully. It does make sense to me to > look at more pages if we encounter unreclaimable ones. It makes less > sense to me, however, to increase the reclaim goal as well in that > case. > Bail out from the reclaim maybe but care should be taken to ensure we do not hammer slab on each "bail" or reset the scanning priorities if the watermark was not met by that batch of SWAP_CLUSTER_MAX reclaims. We also have to think about what it means for pressure being applied equally to each zone. We will still apply equal scanning pressure but not necessarily reclaim pressure. Does that matter? I don't know. > > It'll fail the watermark check and restart of course but it does mean we > > would call shrink_slab() for every SWAP_CLUSTER_MAX*nr_unbalaced_zones > > pages scanned which will have other consequences. It'll behave differently > > but not necessarily better. > > Right, I wasn't proposing to merge the patch as is. But I do think > it's not okay that a batch of immediately unreclaimable pages can > cause kswapd to grow its reclaim target exponentially and we should > probably think about capping it one way or another. > I agree with you. MMtest results I looked at over the weekend showed that kswapd tends to be extremely spiky. Doing nothing following by reclaiming an excessive amount of memory and going back to doing nothing. This partially explains it. > shrink_slab()'s action is already based on the ratio between the > number of scanned pages and the number of lru pages, so I don't see > this as a fundamental issue, although it may require some tweaking. > > > In general, IO causing anonymous workloads to stall has gotten a lot worse > > during the last few kernels without us properly realising it other than > > interactivity in the presence of IO has gone down the crapper again. Late > > last week I fixed up an mmtests that runs memcachetest as the primary > > workload while doing varying amounts of IO in the background and found this > > > > http://www.csn.ul.ie/~mel/postings/reclaim-20130221/global-dhp__parallelio-memcachetest-ext4/hydra/report.html > > > > Snippet looks like this; > > 3.0.56 > > 3.6.10 3.7.4 3.8.0-rc4 > > mainline > > mainlinemainlinemainline > > Ops memcachetest-0M 10125.00 ( 0.00%) 10091.00 ( > > -0.34%) 11038.00 ( 9.02%) 10864.00 ( 7.30%) > > Ops memcachetest-749M 10097.00 ( 0.00%) 8546.00 > > (-15.36%) 8770.00 (-13.14%) 4872.00 (-51.75%) > > Ops memcachetest-1623M 10161.00 ( 0.00%) 3149.00 > > (-69.01%) 3645.00 (-64.13%) 2760.00 (-72.84%) > > Ops memcachetest-2498M 8095.00 ( 0.00%) 2527.00 > > (-68.78%) 2461.00 (-69.60%) 2282.00 (-71.81%) > > Ops memcachetest-3372M 7814.00 ( 0.00%) 2369.00 > > (-69.68%) 2396.00 (-69.34%) 2323.00 (-70.27%) > > Ops memcachetest-4247M 3818.00 ( 0.00%) 2366.00 > > (-38.03%) 2391.00 (-37.38%) 2274.00 (-40.44%) > > Ops memcachetest-5121M 3852.00 ( 0.00%) 2335.00 > > (-39.38%) 2384.00 (-38.11%)
Re: [PATCH] add extra free kbytes tunable
On Tue, Feb 26, 2013 at 10:47:31AM +, Mel Gorman wrote: > On Fri, Feb 22, 2013 at 12:56:34PM -0500, Johannes Weiner wrote: > > > > > > > > > > > > : We have a server workload wherein machines with 100G+ of "free" memory > > > > : (used by page cache), scattered but frequent random io reads from 12+ > > > > : SSD's, and 5gbps+ of internet traffic, will frequently hit direct > > > > reclaim > > > > : in a few different ways. > > > > : > > > > : 1) It'll run into small amounts of reclaim randomly (a few hundred > > > > : thousand). > > > > : > > > > : 2) A burst of reads or traffic can cause extra pressure, which kswapd > > > > : occasionally responds to by freeing up 40g+ of the pagecache all at > > > > once > > > > : (!) while pausing the system (Argh). > > > > : > > > > : 3) A blip in an upstream provider or failover from a peer causes the > > > > : kernel to allocate massive amounts of memory for retransmission > > > > : queues/etc, potentially along with buffered IO reads and (some, but > > > > not > > > > : often a ton) of new allocations from an application. This paired with > > > > 2) > > > > : can cause the box to stall for 15+ seconds. > > > > > > > > Can we prioritise these? 2) looks just awful - kswapd shouldn't just > > > > go off and free 40G of pagecache. Do you know what's actually in that > > > > pagecache? Large number of small files or small number of (very) large > > > > files? > > > > > > We have a handful of huge files (6-12ish 200g+) that are mmap'ed and > > > accessed via address. occasionally madvise (WILLNEED) applied to the > > > address ranges before attempting to use them. There're a mix of other > > > files but nothing significant. The mmap's are READONLY and writes are done > > > via pwrite-ish functions. > > > > > > I could use some guidance on inspecting/tracing the problem. I've been > > > trying to reproduce it in a lab, and respecting to 2)'s issue I've found: > > > > > > - The amount of memory freed back up is either a percentage of total > > > memory or a percentage of free memory. (a machine with 48G of ram will > > > "only" free up an extra 4-7g) > > > > > > - It's most likely to happen after a fresh boot, or if "3 > drop_caches" > > > is applied with the application down. As it fills it seems to get itself > > > into trouble, but becomes more stable after that. Unfortunately 1) and 3) > > > still apply to a stable instance. > > > > > > - Protecting the DMA32 zone with something like "1 1 32" into > > > lowmem_reserve_ratio makes the mass-reclaiming less likely to happen. > > > > > > - While watching "sar -B 1" I'll see kswapd wake up, and scan up to a few > > > hundred thousand pages before finding anything it actually wants to > > > reclaim (low vmeff). I've only been able to reproduce this from a clean > > > start. It can take up to 3 seconds before kswapd starts actually > > > reclaiming pages. > > > > > > - So far as I can tell we're almost exclusively using 0 order allocations. > > > THP is disabled. > > > > > > There's not much dirty memory involved. It's not flushing out writes while > > > reclaiming, it just kills off massive amount of cached memory. > > > > Mapped file pages have to get scanned twice before they are reclaimed > > because we don't have enough usage information after the first scan. > > > > In your case, when you start this workload after a fresh boot or > > dropping the caches, there will be 48G of mapped file pages that have > > never been scanned before and that need to be looked at twice. > > > > Unfortunately, if kswapd does not make progress (and it won't for some > > time at first), it will scan more and more aggressively with > > increasing scan priority. And when the 48G of pages are finally > > cycled, kswapd's scan window is a large percentage of your machine's > > memory, and it will free every single page in it. > > > > I think we should think about capping kswapd zone reclaim cycles just > > as we do for direct reclaim. It's a little ridiculous that it can run > > unbounded and reclaim every page in a zone without ever checking back > > against the watermark. We still increase the scan window evenly when > > we don't make forward progress, but we are more carefully inching zone > > levels back toward the watermarks. > > > > While on the surface I think this will appear to work, I worry that it > will cause kswapds priorities to continually reset even when it's under > real pressure as opposed to "failing to reclaim because of use-once". > With nr_to_reclaim always at SWAP_CLUSTER_MAX, we'll hit this check and > reset after each zone scan. > > if (sc.nr_reclaimed >= SWAP_CLUSTER_MAX) > break; But we hit that check now as well...? I.e. unless there is a hard to reclaim batch and kswapd is unable to make forward progress, priority levels will always get reset after we scanned all zones and reclaimed SWAP_CLUSTER_MAX or more in the process. All I'm arguing is
Re: [PATCH] add extra free kbytes tunable
On Fri, Feb 22, 2013 at 12:56:34PM -0500, Johannes Weiner wrote: > > > > > > > > > : We have a server workload wherein machines with 100G+ of "free" memory > > > : (used by page cache), scattered but frequent random io reads from 12+ > > > : SSD's, and 5gbps+ of internet traffic, will frequently hit direct > > > reclaim > > > : in a few different ways. > > > : > > > : 1) It'll run into small amounts of reclaim randomly (a few hundred > > > : thousand). > > > : > > > : 2) A burst of reads or traffic can cause extra pressure, which kswapd > > > : occasionally responds to by freeing up 40g+ of the pagecache all at once > > > : (!) while pausing the system (Argh). > > > : > > > : 3) A blip in an upstream provider or failover from a peer causes the > > > : kernel to allocate massive amounts of memory for retransmission > > > : queues/etc, potentially along with buffered IO reads and (some, but not > > > : often a ton) of new allocations from an application. This paired with 2) > > > : can cause the box to stall for 15+ seconds. > > > > > > Can we prioritise these? 2) looks just awful - kswapd shouldn't just > > > go off and free 40G of pagecache. Do you know what's actually in that > > > pagecache? Large number of small files or small number of (very) large > > > files? > > > > We have a handful of huge files (6-12ish 200g+) that are mmap'ed and > > accessed via address. occasionally madvise (WILLNEED) applied to the > > address ranges before attempting to use them. There're a mix of other > > files but nothing significant. The mmap's are READONLY and writes are done > > via pwrite-ish functions. > > > > I could use some guidance on inspecting/tracing the problem. I've been > > trying to reproduce it in a lab, and respecting to 2)'s issue I've found: > > > > - The amount of memory freed back up is either a percentage of total > > memory or a percentage of free memory. (a machine with 48G of ram will > > "only" free up an extra 4-7g) > > > > - It's most likely to happen after a fresh boot, or if "3 > drop_caches" > > is applied with the application down. As it fills it seems to get itself > > into trouble, but becomes more stable after that. Unfortunately 1) and 3) > > still apply to a stable instance. > > > > - Protecting the DMA32 zone with something like "1 1 32" into > > lowmem_reserve_ratio makes the mass-reclaiming less likely to happen. > > > > - While watching "sar -B 1" I'll see kswapd wake up, and scan up to a few > > hundred thousand pages before finding anything it actually wants to > > reclaim (low vmeff). I've only been able to reproduce this from a clean > > start. It can take up to 3 seconds before kswapd starts actually > > reclaiming pages. > > > > - So far as I can tell we're almost exclusively using 0 order allocations. > > THP is disabled. > > > > There's not much dirty memory involved. It's not flushing out writes while > > reclaiming, it just kills off massive amount of cached memory. > > Mapped file pages have to get scanned twice before they are reclaimed > because we don't have enough usage information after the first scan. > > In your case, when you start this workload after a fresh boot or > dropping the caches, there will be 48G of mapped file pages that have > never been scanned before and that need to be looked at twice. > > Unfortunately, if kswapd does not make progress (and it won't for some > time at first), it will scan more and more aggressively with > increasing scan priority. And when the 48G of pages are finally > cycled, kswapd's scan window is a large percentage of your machine's > memory, and it will free every single page in it. > > I think we should think about capping kswapd zone reclaim cycles just > as we do for direct reclaim. It's a little ridiculous that it can run > unbounded and reclaim every page in a zone without ever checking back > against the watermark. We still increase the scan window evenly when > we don't make forward progress, but we are more carefully inching zone > levels back toward the watermarks. > While on the surface I think this will appear to work, I worry that it will cause kswapds priorities to continually reset even when it's under real pressure as opposed to "failing to reclaim because of use-once". With nr_to_reclaim always at SWAP_CLUSTER_MAX, we'll hit this check and reset after each zone scan. if (sc.nr_reclaimed >= SWAP_CLUSTER_MAX) break; It'll fail the watermark check and restart of course but it does mean we would call shrink_slab() for every SWAP_CLUSTER_MAX*nr_unbalaced_zones pages scanned which will have other consequences. It'll behave differently but not necessarily better. In general, IO causing anonymous workloads to stall has gotten a lot worse during the last few kernels without us properly realising it other than interactivity in the presence of IO has gone down the crapper again. Late last week I fixed up an mmtests that runs memcachetest as the primary
Re: [PATCH] add extra free kbytes tunable
On Fri, Feb 22, 2013 at 12:56:34PM -0500, Johannes Weiner wrote: SNIP : We have a server workload wherein machines with 100G+ of free memory : (used by page cache), scattered but frequent random io reads from 12+ : SSD's, and 5gbps+ of internet traffic, will frequently hit direct reclaim : in a few different ways. : : 1) It'll run into small amounts of reclaim randomly (a few hundred : thousand). : : 2) A burst of reads or traffic can cause extra pressure, which kswapd : occasionally responds to by freeing up 40g+ of the pagecache all at once : (!) while pausing the system (Argh). : : 3) A blip in an upstream provider or failover from a peer causes the : kernel to allocate massive amounts of memory for retransmission : queues/etc, potentially along with buffered IO reads and (some, but not : often a ton) of new allocations from an application. This paired with 2) : can cause the box to stall for 15+ seconds. Can we prioritise these? 2) looks just awful - kswapd shouldn't just go off and free 40G of pagecache. Do you know what's actually in that pagecache? Large number of small files or small number of (very) large files? We have a handful of huge files (6-12ish 200g+) that are mmap'ed and accessed via address. occasionally madvise (WILLNEED) applied to the address ranges before attempting to use them. There're a mix of other files but nothing significant. The mmap's are READONLY and writes are done via pwrite-ish functions. I could use some guidance on inspecting/tracing the problem. I've been trying to reproduce it in a lab, and respecting to 2)'s issue I've found: - The amount of memory freed back up is either a percentage of total memory or a percentage of free memory. (a machine with 48G of ram will only free up an extra 4-7g) - It's most likely to happen after a fresh boot, or if 3 drop_caches is applied with the application down. As it fills it seems to get itself into trouble, but becomes more stable after that. Unfortunately 1) and 3) still apply to a stable instance. - Protecting the DMA32 zone with something like 1 1 32 into lowmem_reserve_ratio makes the mass-reclaiming less likely to happen. - While watching sar -B 1 I'll see kswapd wake up, and scan up to a few hundred thousand pages before finding anything it actually wants to reclaim (low vmeff). I've only been able to reproduce this from a clean start. It can take up to 3 seconds before kswapd starts actually reclaiming pages. - So far as I can tell we're almost exclusively using 0 order allocations. THP is disabled. There's not much dirty memory involved. It's not flushing out writes while reclaiming, it just kills off massive amount of cached memory. Mapped file pages have to get scanned twice before they are reclaimed because we don't have enough usage information after the first scan. In your case, when you start this workload after a fresh boot or dropping the caches, there will be 48G of mapped file pages that have never been scanned before and that need to be looked at twice. Unfortunately, if kswapd does not make progress (and it won't for some time at first), it will scan more and more aggressively with increasing scan priority. And when the 48G of pages are finally cycled, kswapd's scan window is a large percentage of your machine's memory, and it will free every single page in it. I think we should think about capping kswapd zone reclaim cycles just as we do for direct reclaim. It's a little ridiculous that it can run unbounded and reclaim every page in a zone without ever checking back against the watermark. We still increase the scan window evenly when we don't make forward progress, but we are more carefully inching zone levels back toward the watermarks. While on the surface I think this will appear to work, I worry that it will cause kswapds priorities to continually reset even when it's under real pressure as opposed to failing to reclaim because of use-once. With nr_to_reclaim always at SWAP_CLUSTER_MAX, we'll hit this check and reset after each zone scan. if (sc.nr_reclaimed = SWAP_CLUSTER_MAX) break; It'll fail the watermark check and restart of course but it does mean we would call shrink_slab() for every SWAP_CLUSTER_MAX*nr_unbalaced_zones pages scanned which will have other consequences. It'll behave differently but not necessarily better. In general, IO causing anonymous workloads to stall has gotten a lot worse during the last few kernels without us properly realising it other than interactivity in the presence of IO has gone down the crapper again. Late last week I fixed up an mmtests that runs memcachetest as the primary workload while doing varying amounts of IO in the background and found this
Re: [PATCH] add extra free kbytes tunable
On Tue, Feb 26, 2013 at 10:47:31AM +, Mel Gorman wrote: On Fri, Feb 22, 2013 at 12:56:34PM -0500, Johannes Weiner wrote: SNIP : We have a server workload wherein machines with 100G+ of free memory : (used by page cache), scattered but frequent random io reads from 12+ : SSD's, and 5gbps+ of internet traffic, will frequently hit direct reclaim : in a few different ways. : : 1) It'll run into small amounts of reclaim randomly (a few hundred : thousand). : : 2) A burst of reads or traffic can cause extra pressure, which kswapd : occasionally responds to by freeing up 40g+ of the pagecache all at once : (!) while pausing the system (Argh). : : 3) A blip in an upstream provider or failover from a peer causes the : kernel to allocate massive amounts of memory for retransmission : queues/etc, potentially along with buffered IO reads and (some, but not : often a ton) of new allocations from an application. This paired with 2) : can cause the box to stall for 15+ seconds. Can we prioritise these? 2) looks just awful - kswapd shouldn't just go off and free 40G of pagecache. Do you know what's actually in that pagecache? Large number of small files or small number of (very) large files? We have a handful of huge files (6-12ish 200g+) that are mmap'ed and accessed via address. occasionally madvise (WILLNEED) applied to the address ranges before attempting to use them. There're a mix of other files but nothing significant. The mmap's are READONLY and writes are done via pwrite-ish functions. I could use some guidance on inspecting/tracing the problem. I've been trying to reproduce it in a lab, and respecting to 2)'s issue I've found: - The amount of memory freed back up is either a percentage of total memory or a percentage of free memory. (a machine with 48G of ram will only free up an extra 4-7g) - It's most likely to happen after a fresh boot, or if 3 drop_caches is applied with the application down. As it fills it seems to get itself into trouble, but becomes more stable after that. Unfortunately 1) and 3) still apply to a stable instance. - Protecting the DMA32 zone with something like 1 1 32 into lowmem_reserve_ratio makes the mass-reclaiming less likely to happen. - While watching sar -B 1 I'll see kswapd wake up, and scan up to a few hundred thousand pages before finding anything it actually wants to reclaim (low vmeff). I've only been able to reproduce this from a clean start. It can take up to 3 seconds before kswapd starts actually reclaiming pages. - So far as I can tell we're almost exclusively using 0 order allocations. THP is disabled. There's not much dirty memory involved. It's not flushing out writes while reclaiming, it just kills off massive amount of cached memory. Mapped file pages have to get scanned twice before they are reclaimed because we don't have enough usage information after the first scan. In your case, when you start this workload after a fresh boot or dropping the caches, there will be 48G of mapped file pages that have never been scanned before and that need to be looked at twice. Unfortunately, if kswapd does not make progress (and it won't for some time at first), it will scan more and more aggressively with increasing scan priority. And when the 48G of pages are finally cycled, kswapd's scan window is a large percentage of your machine's memory, and it will free every single page in it. I think we should think about capping kswapd zone reclaim cycles just as we do for direct reclaim. It's a little ridiculous that it can run unbounded and reclaim every page in a zone without ever checking back against the watermark. We still increase the scan window evenly when we don't make forward progress, but we are more carefully inching zone levels back toward the watermarks. While on the surface I think this will appear to work, I worry that it will cause kswapds priorities to continually reset even when it's under real pressure as opposed to failing to reclaim because of use-once. With nr_to_reclaim always at SWAP_CLUSTER_MAX, we'll hit this check and reset after each zone scan. if (sc.nr_reclaimed = SWAP_CLUSTER_MAX) break; But we hit that check now as well...? I.e. unless there is a hard to reclaim batch and kswapd is unable to make forward progress, priority levels will always get reset after we scanned all zones and reclaimed SWAP_CLUSTER_MAX or more in the process. All I'm arguing is that, if we hit a hard to reclaim batch we should continue to increase the number of pages to scan, but still bail out if we reclaimed a batch successfully. It does make sense to me to look at more pages if we encounter unreclaimable ones. It makes less sense to me,
Re: [PATCH] add extra free kbytes tunable
On Tue, Feb 26, 2013 at 10:13:15AM -0500, Johannes Weiner wrote: SNIP I think we should think about capping kswapd zone reclaim cycles just as we do for direct reclaim. It's a little ridiculous that it can run unbounded and reclaim every page in a zone without ever checking back against the watermark. We still increase the scan window evenly when we don't make forward progress, but we are more carefully inching zone levels back toward the watermarks. While on the surface I think this will appear to work, I worry that it will cause kswapds priorities to continually reset even when it's under real pressure as opposed to failing to reclaim because of use-once. With nr_to_reclaim always at SWAP_CLUSTER_MAX, we'll hit this check and reset after each zone scan. if (sc.nr_reclaimed = SWAP_CLUSTER_MAX) break; But we hit that check now as well...? Eventually yes. I.e. unless there is a hard to reclaim batch and kswapd is unable to make forward progress, priority levels will always get reset after we scanned all zones and reclaimed SWAP_CLUSTER_MAX or more in the process. The reset happens after it has reclaimed a lot of pages. I agree with you that this is likely the wrong thing to do. I'm just pointing out that this simple patch changes behaviour in a big way. All I'm arguing is that, if we hit a hard to reclaim batch we should continue to increase the number of pages to scan, but still bail out if we reclaimed a batch successfully. It does make sense to me to look at more pages if we encounter unreclaimable ones. It makes less sense to me, however, to increase the reclaim goal as well in that case. Bail out from the reclaim maybe but care should be taken to ensure we do not hammer slab on each bail or reset the scanning priorities if the watermark was not met by that batch of SWAP_CLUSTER_MAX reclaims. We also have to think about what it means for pressure being applied equally to each zone. We will still apply equal scanning pressure but not necessarily reclaim pressure. Does that matter? I don't know. It'll fail the watermark check and restart of course but it does mean we would call shrink_slab() for every SWAP_CLUSTER_MAX*nr_unbalaced_zones pages scanned which will have other consequences. It'll behave differently but not necessarily better. Right, I wasn't proposing to merge the patch as is. But I do think it's not okay that a batch of immediately unreclaimable pages can cause kswapd to grow its reclaim target exponentially and we should probably think about capping it one way or another. I agree with you. MMtest results I looked at over the weekend showed that kswapd tends to be extremely spiky. Doing nothing following by reclaiming an excessive amount of memory and going back to doing nothing. This partially explains it. shrink_slab()'s action is already based on the ratio between the number of scanned pages and the number of lru pages, so I don't see this as a fundamental issue, although it may require some tweaking. In general, IO causing anonymous workloads to stall has gotten a lot worse during the last few kernels without us properly realising it other than interactivity in the presence of IO has gone down the crapper again. Late last week I fixed up an mmtests that runs memcachetest as the primary workload while doing varying amounts of IO in the background and found this http://www.csn.ul.ie/~mel/postings/reclaim-20130221/global-dhp__parallelio-memcachetest-ext4/hydra/report.html Snippet looks like this; 3.0.56 3.6.10 3.7.4 3.8.0-rc4 mainline mainlinemainlinemainline Ops memcachetest-0M 10125.00 ( 0.00%) 10091.00 ( -0.34%) 11038.00 ( 9.02%) 10864.00 ( 7.30%) Ops memcachetest-749M 10097.00 ( 0.00%) 8546.00 (-15.36%) 8770.00 (-13.14%) 4872.00 (-51.75%) Ops memcachetest-1623M 10161.00 ( 0.00%) 3149.00 (-69.01%) 3645.00 (-64.13%) 2760.00 (-72.84%) Ops memcachetest-2498M 8095.00 ( 0.00%) 2527.00 (-68.78%) 2461.00 (-69.60%) 2282.00 (-71.81%) Ops memcachetest-3372M 7814.00 ( 0.00%) 2369.00 (-69.68%) 2396.00 (-69.34%) 2323.00 (-70.27%) Ops memcachetest-4247M 3818.00 ( 0.00%) 2366.00 (-38.03%) 2391.00 (-37.38%) 2274.00 (-40.44%) Ops memcachetest-5121M 3852.00 ( 0.00%) 2335.00 (-39.38%) 2384.00 (-38.11%) 2233.00 (-42.03%) This is showing transactions/second -- more the better. 3.0.56 was pretty bad in itself because a
Re: [PATCH] add extra free kbytes tunable
On Tue, Feb 19, 2013 at 09:19:27PM -0800, dormando wrote: > > > > The problem is that adding this tunable will constrain future VM > > implementations. We will forever need to at least retain the > > pseudo-file. We will also need to make some effort to retain its > > behaviour. > > > > It would of course be better to fix things so you don't need to tweak > > VM internals to get acceptable behaviour. > > I sympathize with this. It's presently all that keeps us afloat though. > I'll whine about it again later if nothing else pans out. > > > You said: > > > > : We have a server workload wherein machines with 100G+ of "free" memory > > : (used by page cache), scattered but frequent random io reads from 12+ > > : SSD's, and 5gbps+ of internet traffic, will frequently hit direct reclaim > > : in a few different ways. > > : > > : 1) It'll run into small amounts of reclaim randomly (a few hundred > > : thousand). > > : > > : 2) A burst of reads or traffic can cause extra pressure, which kswapd > > : occasionally responds to by freeing up 40g+ of the pagecache all at once > > : (!) while pausing the system (Argh). > > : > > : 3) A blip in an upstream provider or failover from a peer causes the > > : kernel to allocate massive amounts of memory for retransmission > > : queues/etc, potentially along with buffered IO reads and (some, but not > > : often a ton) of new allocations from an application. This paired with 2) > > : can cause the box to stall for 15+ seconds. > > > > Can we prioritise these? 2) looks just awful - kswapd shouldn't just > > go off and free 40G of pagecache. Do you know what's actually in that > > pagecache? Large number of small files or small number of (very) large > > files? > > We have a handful of huge files (6-12ish 200g+) that are mmap'ed and > accessed via address. occasionally madvise (WILLNEED) applied to the > address ranges before attempting to use them. There're a mix of other > files but nothing significant. The mmap's are READONLY and writes are done > via pwrite-ish functions. > > I could use some guidance on inspecting/tracing the problem. I've been > trying to reproduce it in a lab, and respecting to 2)'s issue I've found: > > - The amount of memory freed back up is either a percentage of total > memory or a percentage of free memory. (a machine with 48G of ram will > "only" free up an extra 4-7g) > > - It's most likely to happen after a fresh boot, or if "3 > drop_caches" > is applied with the application down. As it fills it seems to get itself > into trouble, but becomes more stable after that. Unfortunately 1) and 3) > still apply to a stable instance. > > - Protecting the DMA32 zone with something like "1 1 32" into > lowmem_reserve_ratio makes the mass-reclaiming less likely to happen. > > - While watching "sar -B 1" I'll see kswapd wake up, and scan up to a few > hundred thousand pages before finding anything it actually wants to > reclaim (low vmeff). I've only been able to reproduce this from a clean > start. It can take up to 3 seconds before kswapd starts actually > reclaiming pages. > > - So far as I can tell we're almost exclusively using 0 order allocations. > THP is disabled. > > There's not much dirty memory involved. It's not flushing out writes while > reclaiming, it just kills off massive amount of cached memory. Mapped file pages have to get scanned twice before they are reclaimed because we don't have enough usage information after the first scan. In your case, when you start this workload after a fresh boot or dropping the caches, there will be 48G of mapped file pages that have never been scanned before and that need to be looked at twice. Unfortunately, if kswapd does not make progress (and it won't for some time at first), it will scan more and more aggressively with increasing scan priority. And when the 48G of pages are finally cycled, kswapd's scan window is a large percentage of your machine's memory, and it will free every single page in it. I think we should think about capping kswapd zone reclaim cycles just as we do for direct reclaim. It's a little ridiculous that it can run unbounded and reclaim every page in a zone without ever checking back against the watermark. We still increase the scan window evenly when we don't make forward progress, but we are more carefully inching zone levels back toward the watermarks. diff --git a/mm/vmscan.c b/mm/vmscan.c index c4883eb..8a4c446 100644 --- a/mm/vmscan.c +++ b/mm/vmscan.c @@ -2645,10 +2645,11 @@ static unsigned long balance_pgdat(pg_data_t *pgdat, int order, .may_unmap = 1, .may_swap = 1, /* -* kswapd doesn't want to be bailed out while reclaim. because -* we want to put equal scanning pressure on each zone. +* Even kswapd zone scans want to be bailed out after +* reclaiming a good chunk of pages. It will just +* come back if the watermarks are
Re: [PATCH] add extra free kbytes tunable
On Tue, Feb 19, 2013 at 09:19:27PM -0800, dormando wrote: The problem is that adding this tunable will constrain future VM implementations. We will forever need to at least retain the pseudo-file. We will also need to make some effort to retain its behaviour. It would of course be better to fix things so you don't need to tweak VM internals to get acceptable behaviour. I sympathize with this. It's presently all that keeps us afloat though. I'll whine about it again later if nothing else pans out. You said: : We have a server workload wherein machines with 100G+ of free memory : (used by page cache), scattered but frequent random io reads from 12+ : SSD's, and 5gbps+ of internet traffic, will frequently hit direct reclaim : in a few different ways. : : 1) It'll run into small amounts of reclaim randomly (a few hundred : thousand). : : 2) A burst of reads or traffic can cause extra pressure, which kswapd : occasionally responds to by freeing up 40g+ of the pagecache all at once : (!) while pausing the system (Argh). : : 3) A blip in an upstream provider or failover from a peer causes the : kernel to allocate massive amounts of memory for retransmission : queues/etc, potentially along with buffered IO reads and (some, but not : often a ton) of new allocations from an application. This paired with 2) : can cause the box to stall for 15+ seconds. Can we prioritise these? 2) looks just awful - kswapd shouldn't just go off and free 40G of pagecache. Do you know what's actually in that pagecache? Large number of small files or small number of (very) large files? We have a handful of huge files (6-12ish 200g+) that are mmap'ed and accessed via address. occasionally madvise (WILLNEED) applied to the address ranges before attempting to use them. There're a mix of other files but nothing significant. The mmap's are READONLY and writes are done via pwrite-ish functions. I could use some guidance on inspecting/tracing the problem. I've been trying to reproduce it in a lab, and respecting to 2)'s issue I've found: - The amount of memory freed back up is either a percentage of total memory or a percentage of free memory. (a machine with 48G of ram will only free up an extra 4-7g) - It's most likely to happen after a fresh boot, or if 3 drop_caches is applied with the application down. As it fills it seems to get itself into trouble, but becomes more stable after that. Unfortunately 1) and 3) still apply to a stable instance. - Protecting the DMA32 zone with something like 1 1 32 into lowmem_reserve_ratio makes the mass-reclaiming less likely to happen. - While watching sar -B 1 I'll see kswapd wake up, and scan up to a few hundred thousand pages before finding anything it actually wants to reclaim (low vmeff). I've only been able to reproduce this from a clean start. It can take up to 3 seconds before kswapd starts actually reclaiming pages. - So far as I can tell we're almost exclusively using 0 order allocations. THP is disabled. There's not much dirty memory involved. It's not flushing out writes while reclaiming, it just kills off massive amount of cached memory. Mapped file pages have to get scanned twice before they are reclaimed because we don't have enough usage information after the first scan. In your case, when you start this workload after a fresh boot or dropping the caches, there will be 48G of mapped file pages that have never been scanned before and that need to be looked at twice. Unfortunately, if kswapd does not make progress (and it won't for some time at first), it will scan more and more aggressively with increasing scan priority. And when the 48G of pages are finally cycled, kswapd's scan window is a large percentage of your machine's memory, and it will free every single page in it. I think we should think about capping kswapd zone reclaim cycles just as we do for direct reclaim. It's a little ridiculous that it can run unbounded and reclaim every page in a zone without ever checking back against the watermark. We still increase the scan window evenly when we don't make forward progress, but we are more carefully inching zone levels back toward the watermarks. diff --git a/mm/vmscan.c b/mm/vmscan.c index c4883eb..8a4c446 100644 --- a/mm/vmscan.c +++ b/mm/vmscan.c @@ -2645,10 +2645,11 @@ static unsigned long balance_pgdat(pg_data_t *pgdat, int order, .may_unmap = 1, .may_swap = 1, /* -* kswapd doesn't want to be bailed out while reclaim. because -* we want to put equal scanning pressure on each zone. +* Even kswapd zone scans want to be bailed out after +* reclaiming a good chunk of pages. It will just +* come back if the watermarks are still not met. */ - .nr_to_reclaim = ULONG_MAX, + .nr_to_reclaim =
Re: [PATCH] add extra free kbytes tunable
> > The problem is that adding this tunable will constrain future VM > implementations. We will forever need to at least retain the > pseudo-file. We will also need to make some effort to retain its > behaviour. > > It would of course be better to fix things so you don't need to tweak > VM internals to get acceptable behaviour. I sympathize with this. It's presently all that keeps us afloat though. I'll whine about it again later if nothing else pans out. > You said: > > : We have a server workload wherein machines with 100G+ of "free" memory > : (used by page cache), scattered but frequent random io reads from 12+ > : SSD's, and 5gbps+ of internet traffic, will frequently hit direct reclaim > : in a few different ways. > : > : 1) It'll run into small amounts of reclaim randomly (a few hundred > : thousand). > : > : 2) A burst of reads or traffic can cause extra pressure, which kswapd > : occasionally responds to by freeing up 40g+ of the pagecache all at once > : (!) while pausing the system (Argh). > : > : 3) A blip in an upstream provider or failover from a peer causes the > : kernel to allocate massive amounts of memory for retransmission > : queues/etc, potentially along with buffered IO reads and (some, but not > : often a ton) of new allocations from an application. This paired with 2) > : can cause the box to stall for 15+ seconds. > > Can we prioritise these? 2) looks just awful - kswapd shouldn't just > go off and free 40G of pagecache. Do you know what's actually in that > pagecache? Large number of small files or small number of (very) large > files? We have a handful of huge files (6-12ish 200g+) that are mmap'ed and accessed via address. occasionally madvise (WILLNEED) applied to the address ranges before attempting to use them. There're a mix of other files but nothing significant. The mmap's are READONLY and writes are done via pwrite-ish functions. I could use some guidance on inspecting/tracing the problem. I've been trying to reproduce it in a lab, and respecting to 2)'s issue I've found: - The amount of memory freed back up is either a percentage of total memory or a percentage of free memory. (a machine with 48G of ram will "only" free up an extra 4-7g) - It's most likely to happen after a fresh boot, or if "3 > drop_caches" is applied with the application down. As it fills it seems to get itself into trouble, but becomes more stable after that. Unfortunately 1) and 3) still apply to a stable instance. - Protecting the DMA32 zone with something like "1 1 32" into lowmem_reserve_ratio makes the mass-reclaiming less likely to happen. - While watching "sar -B 1" I'll see kswapd wake up, and scan up to a few hundred thousand pages before finding anything it actually wants to reclaim (low vmeff). I've only been able to reproduce this from a clean start. It can take up to 3 seconds before kswapd starts actually reclaiming pages. - So far as I can tell we're almost exclusively using 0 order allocations. THP is disabled. There's not much dirty memory involved. It's not flushing out writes while reclaiming, it just kills off massive amount of cached memory. We're not running the machines particularily hard... Often less than 30% CPU usage at peak. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] add extra free kbytes tunable
On Sun, 17 Feb 2013 15:48:31 -0800 (PST) dormando wrote: > Add a userspace visible knob to tell the VM to keep an extra amount > of memory free, by increasing the gap between each zone's min and > low watermarks. The problem is that adding this tunable will constrain future VM implementations. We will forever need to at least retain the pseudo-file. We will also need to make some effort to retain its behaviour. It would of course be better to fix things so you don't need to tweak VM internals to get acceptable behaviour. You said: : We have a server workload wherein machines with 100G+ of "free" memory : (used by page cache), scattered but frequent random io reads from 12+ : SSD's, and 5gbps+ of internet traffic, will frequently hit direct reclaim : in a few different ways. : : 1) It'll run into small amounts of reclaim randomly (a few hundred : thousand). : : 2) A burst of reads or traffic can cause extra pressure, which kswapd : occasionally responds to by freeing up 40g+ of the pagecache all at once : (!) while pausing the system (Argh). : : 3) A blip in an upstream provider or failover from a peer causes the : kernel to allocate massive amounts of memory for retransmission : queues/etc, potentially along with buffered IO reads and (some, but not : often a ton) of new allocations from an application. This paired with 2) : can cause the box to stall for 15+ seconds. Can we prioritise these? 2) looks just awful - kswapd shouldn't just go off and free 40G of pagecache. Do you know what's actually in that pagecache? Large number of small files or small number of (very) large files? -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] add extra free kbytes tunable
On Sun, 17 Feb 2013 15:48:31 -0800 (PST) dormando dorma...@rydia.net wrote: Add a userspace visible knob to tell the VM to keep an extra amount of memory free, by increasing the gap between each zone's min and low watermarks. The problem is that adding this tunable will constrain future VM implementations. We will forever need to at least retain the pseudo-file. We will also need to make some effort to retain its behaviour. It would of course be better to fix things so you don't need to tweak VM internals to get acceptable behaviour. You said: : We have a server workload wherein machines with 100G+ of free memory : (used by page cache), scattered but frequent random io reads from 12+ : SSD's, and 5gbps+ of internet traffic, will frequently hit direct reclaim : in a few different ways. : : 1) It'll run into small amounts of reclaim randomly (a few hundred : thousand). : : 2) A burst of reads or traffic can cause extra pressure, which kswapd : occasionally responds to by freeing up 40g+ of the pagecache all at once : (!) while pausing the system (Argh). : : 3) A blip in an upstream provider or failover from a peer causes the : kernel to allocate massive amounts of memory for retransmission : queues/etc, potentially along with buffered IO reads and (some, but not : often a ton) of new allocations from an application. This paired with 2) : can cause the box to stall for 15+ seconds. Can we prioritise these? 2) looks just awful - kswapd shouldn't just go off and free 40G of pagecache. Do you know what's actually in that pagecache? Large number of small files or small number of (very) large files? -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] add extra free kbytes tunable
The problem is that adding this tunable will constrain future VM implementations. We will forever need to at least retain the pseudo-file. We will also need to make some effort to retain its behaviour. It would of course be better to fix things so you don't need to tweak VM internals to get acceptable behaviour. I sympathize with this. It's presently all that keeps us afloat though. I'll whine about it again later if nothing else pans out. You said: : We have a server workload wherein machines with 100G+ of free memory : (used by page cache), scattered but frequent random io reads from 12+ : SSD's, and 5gbps+ of internet traffic, will frequently hit direct reclaim : in a few different ways. : : 1) It'll run into small amounts of reclaim randomly (a few hundred : thousand). : : 2) A burst of reads or traffic can cause extra pressure, which kswapd : occasionally responds to by freeing up 40g+ of the pagecache all at once : (!) while pausing the system (Argh). : : 3) A blip in an upstream provider or failover from a peer causes the : kernel to allocate massive amounts of memory for retransmission : queues/etc, potentially along with buffered IO reads and (some, but not : often a ton) of new allocations from an application. This paired with 2) : can cause the box to stall for 15+ seconds. Can we prioritise these? 2) looks just awful - kswapd shouldn't just go off and free 40G of pagecache. Do you know what's actually in that pagecache? Large number of small files or small number of (very) large files? We have a handful of huge files (6-12ish 200g+) that are mmap'ed and accessed via address. occasionally madvise (WILLNEED) applied to the address ranges before attempting to use them. There're a mix of other files but nothing significant. The mmap's are READONLY and writes are done via pwrite-ish functions. I could use some guidance on inspecting/tracing the problem. I've been trying to reproduce it in a lab, and respecting to 2)'s issue I've found: - The amount of memory freed back up is either a percentage of total memory or a percentage of free memory. (a machine with 48G of ram will only free up an extra 4-7g) - It's most likely to happen after a fresh boot, or if 3 drop_caches is applied with the application down. As it fills it seems to get itself into trouble, but becomes more stable after that. Unfortunately 1) and 3) still apply to a stable instance. - Protecting the DMA32 zone with something like 1 1 32 into lowmem_reserve_ratio makes the mass-reclaiming less likely to happen. - While watching sar -B 1 I'll see kswapd wake up, and scan up to a few hundred thousand pages before finding anything it actually wants to reclaim (low vmeff). I've only been able to reproduce this from a clean start. It can take up to 3 seconds before kswapd starts actually reclaiming pages. - So far as I can tell we're almost exclusively using 0 order allocations. THP is disabled. There's not much dirty memory involved. It's not flushing out writes while reclaiming, it just kills off massive amount of cached memory. We're not running the machines particularily hard... Often less than 30% CPU usage at peak. -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH] add extra free kbytes tunable
From: Rik van Riel Add a userspace visible knob to tell the VM to keep an extra amount of memory free, by increasing the gap between each zone's min and low watermarks. This is useful for realtime applications that call system calls and have a bound on the number of allocations that happen in any short time period. In this application, extra_free_kbytes would be left at an amount equal to or larger than than the maximum number of allocations that happen in any burst. It may also be useful to reduce the memory use of virtual machines (temporarily?), in a way that does not cause memory fragmentation like ballooning does. --- Documentation/sysctl/vm.txt | 16 include/linux/mmzone.h |2 +- include/linux/swap.h|2 ++ kernel/sysctl.c | 11 +-- mm/page_alloc.c | 39 +-- 5 files changed, 57 insertions(+), 13 deletions(-) diff --git a/Documentation/sysctl/vm.txt b/Documentation/sysctl/vm.txt index 078701f..5d12bbd 100644 --- a/Documentation/sysctl/vm.txt +++ b/Documentation/sysctl/vm.txt @@ -28,6 +28,7 @@ Currently, these files are in /proc/sys/vm: - dirty_writeback_centisecs - drop_caches - extfrag_threshold +- extra_free_kbytes - hugepages_treat_as_movable - hugetlb_shm_group - laptop_mode @@ -167,6 +168,21 @@ fragmentation index is <= extfrag_threshold. The default value is 500. == +extra_free_kbytes + +This parameter tells the VM to keep extra free memory between the threshold +where background reclaim (kswapd) kicks in, and the threshold where direct +reclaim (by allocating processes) kicks in. + +This is useful for workloads that require low latency memory allocations +and have a bounded burstiness in memory allocations, for example a +realtime application that receives and transmits network traffic +(causing in-kernel memory allocations) with a maximum total message burst +size of 200MB may need 200MB of extra free memory to avoid direct reclaim +related latencies. + +== + hugepages_treat_as_movable This parameter is only useful when kernelcore= is specified at boot time to diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h index 73b64a3..7f8f883 100644 --- a/include/linux/mmzone.h +++ b/include/linux/mmzone.h @@ -881,7 +881,7 @@ static inline int is_dma(struct zone *zone) /* These two functions are used to setup the per zone pages min values */ struct ctl_table; -int min_free_kbytes_sysctl_handler(struct ctl_table *, int, +int free_kbytes_sysctl_handler(struct ctl_table *, int, void __user *, size_t *, loff_t *); extern int sysctl_lowmem_reserve_ratio[MAX_NR_ZONES-1]; int lowmem_reserve_ratio_sysctl_handler(struct ctl_table *, int, diff --git a/include/linux/swap.h b/include/linux/swap.h index 68df9c1..66a12c4 100644 --- a/include/linux/swap.h +++ b/include/linux/swap.h @@ -215,6 +215,8 @@ struct swap_list_t { /* linux/mm/page_alloc.c */ extern unsigned long totalram_pages; extern unsigned long totalreserve_pages; +extern int min_free_kbytes; +extern int extra_free_kbytes; extern unsigned long dirty_balance_reserve; extern unsigned int nr_free_buffer_pages(void); extern unsigned int nr_free_pagecache_pages(void); diff --git a/kernel/sysctl.c b/kernel/sysctl.c index c88878d..102e9a1 100644 --- a/kernel/sysctl.c +++ b/kernel/sysctl.c @@ -104,7 +104,6 @@ extern char core_pattern[]; extern unsigned int core_pipe_limit; #endif extern int pid_max; -extern int min_free_kbytes; extern int pid_max_min, pid_max_max; extern int sysctl_drop_caches; extern int percpu_pagelist_fraction; @@ -1246,10 +1245,18 @@ static struct ctl_table vm_table[] = { .data = _free_kbytes, .maxlen = sizeof(min_free_kbytes), .mode = 0644, - .proc_handler = min_free_kbytes_sysctl_handler, + .proc_handler = free_kbytes_sysctl_handler, .extra1 = , }, { + .procname = "extra_free_kbytes", + .data = _free_kbytes, + .maxlen = sizeof(extra_free_kbytes), + .mode = 0644, + .proc_handler = free_kbytes_sysctl_handler, + .extra1 = , + }, + { .procname = "percpu_pagelist_fraction", .data = _pagelist_fraction, .maxlen = sizeof(percpu_pagelist_fraction), diff --git a/mm/page_alloc.c b/mm/page_alloc.c index 9673d96..5380d84 100644 --- a/mm/page_alloc.c +++ b/mm/page_alloc.c @@ -194,8 +194,21 @@ static char * const zone_names[MAX_NR_ZONES] = { "Movable", }; +/* + * Try to keep at least this much lowmem free. Do not allow normal + * allocations below this point, only high priority ones. Automatically
[PATCH] add extra free kbytes tunable
From: Rik van Riel r...@redhat.com Add a userspace visible knob to tell the VM to keep an extra amount of memory free, by increasing the gap between each zone's min and low watermarks. This is useful for realtime applications that call system calls and have a bound on the number of allocations that happen in any short time period. In this application, extra_free_kbytes would be left at an amount equal to or larger than than the maximum number of allocations that happen in any burst. It may also be useful to reduce the memory use of virtual machines (temporarily?), in a way that does not cause memory fragmentation like ballooning does. --- Documentation/sysctl/vm.txt | 16 include/linux/mmzone.h |2 +- include/linux/swap.h|2 ++ kernel/sysctl.c | 11 +-- mm/page_alloc.c | 39 +-- 5 files changed, 57 insertions(+), 13 deletions(-) diff --git a/Documentation/sysctl/vm.txt b/Documentation/sysctl/vm.txt index 078701f..5d12bbd 100644 --- a/Documentation/sysctl/vm.txt +++ b/Documentation/sysctl/vm.txt @@ -28,6 +28,7 @@ Currently, these files are in /proc/sys/vm: - dirty_writeback_centisecs - drop_caches - extfrag_threshold +- extra_free_kbytes - hugepages_treat_as_movable - hugetlb_shm_group - laptop_mode @@ -167,6 +168,21 @@ fragmentation index is = extfrag_threshold. The default value is 500. == +extra_free_kbytes + +This parameter tells the VM to keep extra free memory between the threshold +where background reclaim (kswapd) kicks in, and the threshold where direct +reclaim (by allocating processes) kicks in. + +This is useful for workloads that require low latency memory allocations +and have a bounded burstiness in memory allocations, for example a +realtime application that receives and transmits network traffic +(causing in-kernel memory allocations) with a maximum total message burst +size of 200MB may need 200MB of extra free memory to avoid direct reclaim +related latencies. + +== + hugepages_treat_as_movable This parameter is only useful when kernelcore= is specified at boot time to diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h index 73b64a3..7f8f883 100644 --- a/include/linux/mmzone.h +++ b/include/linux/mmzone.h @@ -881,7 +881,7 @@ static inline int is_dma(struct zone *zone) /* These two functions are used to setup the per zone pages min values */ struct ctl_table; -int min_free_kbytes_sysctl_handler(struct ctl_table *, int, +int free_kbytes_sysctl_handler(struct ctl_table *, int, void __user *, size_t *, loff_t *); extern int sysctl_lowmem_reserve_ratio[MAX_NR_ZONES-1]; int lowmem_reserve_ratio_sysctl_handler(struct ctl_table *, int, diff --git a/include/linux/swap.h b/include/linux/swap.h index 68df9c1..66a12c4 100644 --- a/include/linux/swap.h +++ b/include/linux/swap.h @@ -215,6 +215,8 @@ struct swap_list_t { /* linux/mm/page_alloc.c */ extern unsigned long totalram_pages; extern unsigned long totalreserve_pages; +extern int min_free_kbytes; +extern int extra_free_kbytes; extern unsigned long dirty_balance_reserve; extern unsigned int nr_free_buffer_pages(void); extern unsigned int nr_free_pagecache_pages(void); diff --git a/kernel/sysctl.c b/kernel/sysctl.c index c88878d..102e9a1 100644 --- a/kernel/sysctl.c +++ b/kernel/sysctl.c @@ -104,7 +104,6 @@ extern char core_pattern[]; extern unsigned int core_pipe_limit; #endif extern int pid_max; -extern int min_free_kbytes; extern int pid_max_min, pid_max_max; extern int sysctl_drop_caches; extern int percpu_pagelist_fraction; @@ -1246,10 +1245,18 @@ static struct ctl_table vm_table[] = { .data = min_free_kbytes, .maxlen = sizeof(min_free_kbytes), .mode = 0644, - .proc_handler = min_free_kbytes_sysctl_handler, + .proc_handler = free_kbytes_sysctl_handler, .extra1 = zero, }, { + .procname = extra_free_kbytes, + .data = extra_free_kbytes, + .maxlen = sizeof(extra_free_kbytes), + .mode = 0644, + .proc_handler = free_kbytes_sysctl_handler, + .extra1 = zero, + }, + { .procname = percpu_pagelist_fraction, .data = percpu_pagelist_fraction, .maxlen = sizeof(percpu_pagelist_fraction), diff --git a/mm/page_alloc.c b/mm/page_alloc.c index 9673d96..5380d84 100644 --- a/mm/page_alloc.c +++ b/mm/page_alloc.c @@ -194,8 +194,21 @@ static char * const zone_names[MAX_NR_ZONES] = { Movable, }; +/* + * Try to keep at least this much lowmem free. Do not allow normal + * allocations below this point, only high