Re: [PATCH 2/3] mm: Ensure that mark_page_accessed moves pages to the active list
On 05/01/2013 04:06 PM, Mel Gorman wrote: On Wed, May 01, 2013 at 01:41:34PM +0800, Sam Ben wrote: Hi Mel, On 04/30/2013 12:31 AM, Mel Gorman wrote: If a page is on a pagevec then it is !PageLRU and mark_page_accessed() may fail to move a page to the active list as expected. Now that the LRU is selected at LRU drain time, mark pages PageActive if they are on a pagevec so it gets moved to the correct list at LRU drain time. Using a debugging patch it was found that for a simple git checkout based workload that pages were never added to the active file list in Could you show us the details of your workload? The workload is git checkouts of a fixed number of commits for the Is there script which you used? kernel git tree. It starts with a warm-up run that is not timed and then records the time for a number of iterations. How to record the time for a number of iterations? Is the iteration here means lru scan? -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCHv6 0/8] zswap: compressed swap caching
Hi Seth, On 02/22/2013 02:25 AM, Seth Jennings wrote: On 02/21/2013 09:50 AM, Dan Magenheimer wrote: From: Seth Jennings [mailto:sjenn...@linux.vnet.ibm.com] Subject: [PATCHv6 0/8] zswap: compressed swap caching Changelog: v6: * fix improper freeing of rbtree (Cody) Cody's bug fix reminded me of a rather fundamental question: Why does zswap use a rbtree instead of a radix tree? Intuitively, I'd expect that pgoff_t values would have a relatively high level of locality AND at any one time the set of stored pgoff_t values would be relatively non-sparse. This would argue that a radix tree would result in fewer nodes touched on average for lookup/insert/remove. I considered using a radix tree, but I don't think there is a compelling reason to choose a radix tree over a red-black tree in this case (explanation below). From a runtime standpoint, a radix tree might be faster. The swap offsets will be largely in linearly bunched groups over the indexed range. However, there are also memory constraints to consider in this particular situation. Using a radix tree could result in intermediate radix_tree_node allocations in the store (insert) path in addition to the zswap_entry allocation. Since we are under memory pressure, using the red-black Then in which case radix tree is prefer and in which case redblack tree is better? tree, whose metadata is included in the struct zswap_entry, reduces the number of opportunities to fail. On my system, the radix_tree_node structure is 568 bytes. The radix_tree_node cache requires 4 pages per slab, an order-2 page allocation. Growing that cache will be difficult under the pressure. In my mind, cost of even a single node allocation failure resulting in an additional page swapped to disk will more that wipe out any possible performance advantage using a radix tree might have. Thanks, Seth -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majord...@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: mailto:"d...@kvack.org;> em...@kvack.org -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] mm: swap: Mark swap pages writeback before queueing for direct IO
Hi Mel, On 04/25/2013 02:57 AM, Mel Gorman wrote: As pointed out by Andrew Morton, the swap-over-NFS writeback is not setting PageWriteback before it is queued for direct IO. While swap pages do not Before commit commit 62c230bc1 (mm: add support for a filesystem to activate swap files and use direct_IO for writing swap pages), swap pages will write to page cache firstly and then writeback? participate in BDI or process dirty accounting and the IO is synchronous, the writeback bit is still required and not setting it in this case was an oversight. swapoff depends on the page writeback to synchronoise all pending writes on a swap page before it is reused. Swapcache freeing and reuse depend on checking the PageWriteback under lock to ensure the page is safe to reuse. Direct IO handlers and the direct IO handler for NFS do not deal with PageWriteback as they are synchronous writes. In the case of NFS, it schedules pages (or a page in the case of swap) for IO and then waits synchronously for IO to complete in nfs_direct_write(). It is recognised that this is a slowdown from normal swap handling which is asynchronous and uses a completion handler. Shoving PageWriteback handling down into direct IO handlers looks like a bad fit to handle the swap case although it may have to be dealt with some day if swap is converted to use direct IO in general and bmap is finally done away with. At that point it will be necessary to refit asynchronous direct IO with completion handlers onto the swap subsystem. As swapcache currently depends on PageWriteback to protect against races, this patch sets PageWriteback under the page lock before queueing it for direct IO. It is cleared when the direct IO handler returns. IO errors are treated similarly to the direct-to-bio case except PageError is not set as in the case of swap-over-NFS, it is likely to be a transient error. It was asked what prevents such a page being reclaimed in parallel. With this patch applied, such a page will now be skipped (most of the time) or blocked until the writeback completes. Reclaim checks PageWriteback under the page lock before calling try_to_free_swap and the page lock should prevent the page being requeued for IO before it is freed. This and Jerome's related patch should considered for -stable as far back as 3.6 when swap-over-NFS was introduced. Signed-off-by: Mel Gorman --- mm/page_io.c | 17 + 1 file changed, 17 insertions(+) diff --git a/mm/page_io.c b/mm/page_io.c index 04ca00d..ec04247 100644 --- a/mm/page_io.c +++ b/mm/page_io.c @@ -214,6 +214,7 @@ int swap_writepage(struct page *page, struct writeback_control *wbc) kiocb.ki_left = PAGE_SIZE; kiocb.ki_nbytes = PAGE_SIZE; + set_page_writeback(page); unlock_page(page); ret = mapping->a_ops->direct_IO(KERNEL_WRITE, , , @@ -223,8 +224,24 @@ int swap_writepage(struct page *page, struct writeback_control *wbc) count_vm_event(PSWPOUT); ret = 0; } else { + /* +* In the case of swap-over-nfs, this can be a +* temporary failure if the system has limited +* memory for allocating transmit buffers. +* Mark the page dirty and avoid +* rotate_reclaimable_page but rate-limit the +* messages but do not flag PageError like +* the normal direct-to-bio case as it could +* be temporary. +*/ set_page_dirty(page); + ClearPageReclaim(page); + if (printk_ratelimit()) { + pr_err("Write-error on dio swapfile (%Lu)\n", + (unsigned long long)page_file_offset(page)); + } } + end_page_writeback(page); return ret; } -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majord...@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: mailto:"d...@kvack.org;> em...@kvack.org -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] mm: swap: Mark swap pages writeback before queueing for direct IO
Hi Mel, On 04/25/2013 02:57 AM, Mel Gorman wrote: As pointed out by Andrew Morton, the swap-over-NFS writeback is not setting PageWriteback before it is queued for direct IO. While swap pages do not Before commit commit 62c230bc1 (mm: add support for a filesystem to activate swap files and use direct_IO for writing swap pages), swap pages will write to page cache firstly and then writeback? participate in BDI or process dirty accounting and the IO is synchronous, the writeback bit is still required and not setting it in this case was an oversight. swapoff depends on the page writeback to synchronoise all pending writes on a swap page before it is reused. Swapcache freeing and reuse depend on checking the PageWriteback under lock to ensure the page is safe to reuse. Direct IO handlers and the direct IO handler for NFS do not deal with PageWriteback as they are synchronous writes. In the case of NFS, it schedules pages (or a page in the case of swap) for IO and then waits synchronously for IO to complete in nfs_direct_write(). It is recognised that this is a slowdown from normal swap handling which is asynchronous and uses a completion handler. Shoving PageWriteback handling down into direct IO handlers looks like a bad fit to handle the swap case although it may have to be dealt with some day if swap is converted to use direct IO in general and bmap is finally done away with. At that point it will be necessary to refit asynchronous direct IO with completion handlers onto the swap subsystem. As swapcache currently depends on PageWriteback to protect against races, this patch sets PageWriteback under the page lock before queueing it for direct IO. It is cleared when the direct IO handler returns. IO errors are treated similarly to the direct-to-bio case except PageError is not set as in the case of swap-over-NFS, it is likely to be a transient error. It was asked what prevents such a page being reclaimed in parallel. With this patch applied, such a page will now be skipped (most of the time) or blocked until the writeback completes. Reclaim checks PageWriteback under the page lock before calling try_to_free_swap and the page lock should prevent the page being requeued for IO before it is freed. This and Jerome's related patch should considered for -stable as far back as 3.6 when swap-over-NFS was introduced. Signed-off-by: Mel Gorman mgor...@suse.de --- mm/page_io.c | 17 + 1 file changed, 17 insertions(+) diff --git a/mm/page_io.c b/mm/page_io.c index 04ca00d..ec04247 100644 --- a/mm/page_io.c +++ b/mm/page_io.c @@ -214,6 +214,7 @@ int swap_writepage(struct page *page, struct writeback_control *wbc) kiocb.ki_left = PAGE_SIZE; kiocb.ki_nbytes = PAGE_SIZE; + set_page_writeback(page); unlock_page(page); ret = mapping-a_ops-direct_IO(KERNEL_WRITE, kiocb, iov, @@ -223,8 +224,24 @@ int swap_writepage(struct page *page, struct writeback_control *wbc) count_vm_event(PSWPOUT); ret = 0; } else { + /* +* In the case of swap-over-nfs, this can be a +* temporary failure if the system has limited +* memory for allocating transmit buffers. +* Mark the page dirty and avoid +* rotate_reclaimable_page but rate-limit the +* messages but do not flag PageError like +* the normal direct-to-bio case as it could +* be temporary. +*/ set_page_dirty(page); + ClearPageReclaim(page); + if (printk_ratelimit()) { + pr_err(Write-error on dio swapfile (%Lu)\n, + (unsigned long long)page_file_offset(page)); + } } + end_page_writeback(page); return ret; } -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majord...@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: a href=mailto:d...@kvack.org; em...@kvack.org /a -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCHv6 0/8] zswap: compressed swap caching
Hi Seth, On 02/22/2013 02:25 AM, Seth Jennings wrote: On 02/21/2013 09:50 AM, Dan Magenheimer wrote: From: Seth Jennings [mailto:sjenn...@linux.vnet.ibm.com] Subject: [PATCHv6 0/8] zswap: compressed swap caching Changelog: v6: * fix improper freeing of rbtree (Cody) Cody's bug fix reminded me of a rather fundamental question: Why does zswap use a rbtree instead of a radix tree? Intuitively, I'd expect that pgoff_t values would have a relatively high level of locality AND at any one time the set of stored pgoff_t values would be relatively non-sparse. This would argue that a radix tree would result in fewer nodes touched on average for lookup/insert/remove. I considered using a radix tree, but I don't think there is a compelling reason to choose a radix tree over a red-black tree in this case (explanation below). From a runtime standpoint, a radix tree might be faster. The swap offsets will be largely in linearly bunched groups over the indexed range. However, there are also memory constraints to consider in this particular situation. Using a radix tree could result in intermediate radix_tree_node allocations in the store (insert) path in addition to the zswap_entry allocation. Since we are under memory pressure, using the red-black Then in which case radix tree is prefer and in which case redblack tree is better? tree, whose metadata is included in the struct zswap_entry, reduces the number of opportunities to fail. On my system, the radix_tree_node structure is 568 bytes. The radix_tree_node cache requires 4 pages per slab, an order-2 page allocation. Growing that cache will be difficult under the pressure. In my mind, cost of even a single node allocation failure resulting in an additional page swapped to disk will more that wipe out any possible performance advantage using a radix tree might have. Thanks, Seth -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majord...@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: a href=mailto:d...@kvack.org; em...@kvack.org /a -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 2/3] mm: Ensure that mark_page_accessed moves pages to the active list
On 05/01/2013 04:06 PM, Mel Gorman wrote: On Wed, May 01, 2013 at 01:41:34PM +0800, Sam Ben wrote: Hi Mel, On 04/30/2013 12:31 AM, Mel Gorman wrote: If a page is on a pagevec then it is !PageLRU and mark_page_accessed() may fail to move a page to the active list as expected. Now that the LRU is selected at LRU drain time, mark pages PageActive if they are on a pagevec so it gets moved to the correct list at LRU drain time. Using a debugging patch it was found that for a simple git checkout based workload that pages were never added to the active file list in Could you show us the details of your workload? The workload is git checkouts of a fixed number of commits for the Is there script which you used? kernel git tree. It starts with a warm-up run that is not timed and then records the time for a number of iterations. How to record the time for a number of iterations? Is the iteration here means lru scan? -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 2/2] Make batch size for memory accounting configured according to size of memory
Hi Tim, On 04/30/2013 01:12 AM, Tim Chen wrote: Currently the per cpu counter's batch size for memory accounting is configured as twice the number of cpus in the system. However, for system with very large memory, it is more appropriate to make it proportional to the memory size per cpu in the system. For example, for a x86_64 system with 64 cpus and 128 GB of memory, the batch size is only 2*64 pages (0.5 MB). So any memory accounting changes of more than 0.5MB will overflow the per cpu counter into the global counter. Instead, for the new scheme, the batch size is configured to be 0.4% of the memory/cpu = 8MB (128 GB/64 /256), If large batch size will lead to global counter more inaccurate? which is more inline with the memory size. Signed-off-by: Tim Chen --- mm/mmap.c | 13 - mm/nommu.c | 13 - 2 files changed, 24 insertions(+), 2 deletions(-) diff --git a/mm/mmap.c b/mm/mmap.c index 0db0de1..082836e 100644 --- a/mm/mmap.c +++ b/mm/mmap.c @@ -89,6 +89,7 @@ int sysctl_max_map_count __read_mostly = DEFAULT_MAX_MAP_COUNT; * other variables. It can be updated by several CPUs frequently. */ struct percpu_counter vm_committed_as cacheline_aligned_in_smp; +int vm_committed_batchsz cacheline_aligned_in_smp; /* * The global memory commitment made in the system can be a metric @@ -3090,10 +3091,20 @@ void mm_drop_all_locks(struct mm_struct *mm) /* * initialise the VMA slab */ +static inline int mm_compute_batch(void) +{ + int nr = num_present_cpus(); + + /* batch size set to 0.4% of (total memory/#cpus) */ + return (int) (totalram_pages/nr) / 256; +} + void __init mmap_init(void) { int ret; - ret = percpu_counter_init(_committed_as, 0); + vm_committed_batchsz = mm_compute_batch(); + ret = percpu_counter_and_batch_init(_committed_as, 0, + _committed_batchsz); VM_BUG_ON(ret); } diff --git a/mm/nommu.c b/mm/nommu.c index 2f3ea74..a87a99c 100644 --- a/mm/nommu.c +++ b/mm/nommu.c @@ -59,6 +59,7 @@ unsigned long max_mapnr; unsigned long num_physpages; unsigned long highest_memmap_pfn; struct percpu_counter vm_committed_as; +int vm_committed_batchsz; int sysctl_overcommit_memory = OVERCOMMIT_GUESS; /* heuristic overcommit */ int sysctl_overcommit_ratio = 50; /* default is 50% */ int sysctl_max_map_count = DEFAULT_MAX_MAP_COUNT; @@ -526,11 +527,21 @@ SYSCALL_DEFINE1(brk, unsigned long, brk) /* * initialise the VMA and region record slabs */ +static inline int mm_compute_batch(void) +{ + int nr = num_present_cpus(); + + /* batch size set to 0.4% of (total memory/#cpus) */ + return (int) (totalram_pages/nr) / 256; +} + void __init mmap_init(void) { int ret; - ret = percpu_counter_init(_committed_as, 0); + vm_committed_batchsz = mm_compute_batch(); + ret = percpu_counter_and_batch_init(_committed_as, 0, + _committed_batchsz); VM_BUG_ON(ret); vm_region_jar = KMEM_CACHE(vm_region, SLAB_PANIC); } -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 2/2] Make batch size for memory accounting configured according to size of memory
Hi Tim, On 04/30/2013 01:12 AM, Tim Chen wrote: Currently the per cpu counter's batch size for memory accounting is configured as twice the number of cpus in the system. However, for system with very large memory, it is more appropriate to make it proportional to the memory size per cpu in the system. For example, for a x86_64 system with 64 cpus and 128 GB of memory, the batch size is only 2*64 pages (0.5 MB). So any memory accounting changes of more than 0.5MB will overflow the per cpu counter into the global counter. Instead, for the new scheme, the batch size is configured to be 0.4% of the memory/cpu = 8MB (128 GB/64 /256), If large batch size will lead to global counter more inaccurate? which is more inline with the memory size. Signed-off-by: Tim Chen tim.c.c...@linux.intel.com --- mm/mmap.c | 13 - mm/nommu.c | 13 - 2 files changed, 24 insertions(+), 2 deletions(-) diff --git a/mm/mmap.c b/mm/mmap.c index 0db0de1..082836e 100644 --- a/mm/mmap.c +++ b/mm/mmap.c @@ -89,6 +89,7 @@ int sysctl_max_map_count __read_mostly = DEFAULT_MAX_MAP_COUNT; * other variables. It can be updated by several CPUs frequently. */ struct percpu_counter vm_committed_as cacheline_aligned_in_smp; +int vm_committed_batchsz cacheline_aligned_in_smp; /* * The global memory commitment made in the system can be a metric @@ -3090,10 +3091,20 @@ void mm_drop_all_locks(struct mm_struct *mm) /* * initialise the VMA slab */ +static inline int mm_compute_batch(void) +{ + int nr = num_present_cpus(); + + /* batch size set to 0.4% of (total memory/#cpus) */ + return (int) (totalram_pages/nr) / 256; +} + void __init mmap_init(void) { int ret; - ret = percpu_counter_init(vm_committed_as, 0); + vm_committed_batchsz = mm_compute_batch(); + ret = percpu_counter_and_batch_init(vm_committed_as, 0, + vm_committed_batchsz); VM_BUG_ON(ret); } diff --git a/mm/nommu.c b/mm/nommu.c index 2f3ea74..a87a99c 100644 --- a/mm/nommu.c +++ b/mm/nommu.c @@ -59,6 +59,7 @@ unsigned long max_mapnr; unsigned long num_physpages; unsigned long highest_memmap_pfn; struct percpu_counter vm_committed_as; +int vm_committed_batchsz; int sysctl_overcommit_memory = OVERCOMMIT_GUESS; /* heuristic overcommit */ int sysctl_overcommit_ratio = 50; /* default is 50% */ int sysctl_max_map_count = DEFAULT_MAX_MAP_COUNT; @@ -526,11 +527,21 @@ SYSCALL_DEFINE1(brk, unsigned long, brk) /* * initialise the VMA and region record slabs */ +static inline int mm_compute_batch(void) +{ + int nr = num_present_cpus(); + + /* batch size set to 0.4% of (total memory/#cpus) */ + return (int) (totalram_pages/nr) / 256; +} + void __init mmap_init(void) { int ret; - ret = percpu_counter_init(vm_committed_as, 0); + vm_committed_batchsz = mm_compute_batch(); + ret = percpu_counter_and_batch_init(vm_committed_as, 0, + vm_committed_batchsz); VM_BUG_ON(ret); vm_region_jar = KMEM_CACHE(vm_region, SLAB_PANIC); } -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 01/10] mm: vmscan: Limit the number of pages kswapd reclaims at each priority
Ping Rik, I also want to know the answer. ;-) On 04/11/2013 01:58 PM, Will Huck wrote: Hi Rik, On 03/22/2013 11:52 AM, Rik van Riel wrote: On 03/21/2013 08:05 PM, Will Huck wrote: One offline question, how to understand this in function balance_pgdat: /* * Do some background aging of the anon list, to give * pages a chance to be referenced before reclaiming. */ age_acitve_anon(zone, ); The anon lrus use a two-handed clock algorithm. New anonymous pages start off on the active anon list. Older anonymous pages get moved to the inactive anon list. The downside of page cache use-once replacement algorithm is inter-reference distance, corret? Does it have any other downside? What's the downside of two-handed clock algorithm against anonymous pages? If they get referenced before they reach the end of the inactive anon list, they get moved back to the active list. If we need to swap something out and find a non-referenced page at the end of the inactive anon list, we will swap it out. In order to make good pageout decisions, pages need to stay on the inactive anon list for a longer time, so they have plenty of time to get referenced, before the reclaim code looks at them. To achieve that, we will move some active anon pages to the inactive anon list even when we do not want to swap anything out - as long as the inactive anon list is below its target size. Does that make sense? -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majord...@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: mailto:"d...@kvack.org;> em...@kvack.org -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC Patch 0/2] mm: Add parameters to make kernel behavior at memory error on dirty cache selectable
Hi Mitsuhiro, On 04/11/2013 08:51 PM, Mitsuhiro Tanino wrote: (2013/04/11 12:53), Simon Jeons wrote: One question against mce instead of the patchset. ;-) When check memory is bad? Before memory access? Is there a process scan it period? Hi Simon-san, Yes, there is a process to scan memory periodically. At Intel Nehalem-EX and CPUs after Nehalem-EX generation, MCA recovery is supported. MCA recovery provides error detection and isolation features to work together with OS. One of the MCA Recovery features is Memory Scrubbing. It periodically checks memory in the background of OS. Memory Scrubbing is a kernel thread? Where is the codes of memory scrubbing? If Memory Scrubbing find an uncorrectable error on a memory before OS accesses the memory bit, MCA recovery notifies SRAO error into OS It maybe can't find memory error timely since it is sleeping when memory error occur, can this case happened? and OS handles the SRAO error using hwpoison function. Regards, Mitsuhiro Tanino -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majord...@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: mailto:"d...@kvack.org;> em...@kvack.org -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 08/10] mm: vmscan: Have kswapd shrink slab only once per priority
Hi Mel, On 04/11/2013 06:01 PM, Mel Gorman wrote: On Wed, Apr 10, 2013 at 02:21:42PM +0900, Joonsoo Kim wrote: @@ -2673,9 +2674,15 @@ static bool kswapd_shrink_zone(struct zone *zone, sc->nr_to_reclaim = max(SWAP_CLUSTER_MAX, high_wmark_pages(zone)); shrink_zone(zone, sc); - reclaim_state->reclaimed_slab = 0; - nr_slab = shrink_slab(, sc->nr_scanned, lru_pages); - sc->nr_reclaimed += reclaim_state->reclaimed_slab; + /* +* Slabs are shrunk for each zone once per priority or if the zone +* being balanced is otherwise unreclaimable +*/ + if (shrinking_slab || !zone_reclaimable(zone)) { + reclaim_state->reclaimed_slab = 0; + nr_slab = shrink_slab(, sc->nr_scanned, lru_pages); + sc->nr_reclaimed += reclaim_state->reclaimed_slab; + } if (nr_slab == 0 && !zone_reclaimable(zone)) zone->all_unreclaimable = 1; Why shrink_slab() is called here? Preserves existing behaviour. Yes, but, with this patch, existing behaviour is changed, that is, we call shrink_slab() once per priority. For now, there is no reason this function is called here. How about separating it and executing it outside of zone loop? We are calling it fewer times but it's still receiving the same information from sc->nr_scanned it received before. With the change you are suggesting it would be necessary to accumulating sc->nr_scanned for each zone shrunk and then pass the sum to shrink_slab() once per priority. While this is not necessarily wrong, there is little or no motivation to alter the shrinkers in this manner in this series. Why the result is not the same? -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 08/10] mm: vmscan: Have kswapd shrink slab only once per priority
Hi Mel, On 04/11/2013 06:01 PM, Mel Gorman wrote: On Wed, Apr 10, 2013 at 02:21:42PM +0900, Joonsoo Kim wrote: @@ -2673,9 +2674,15 @@ static bool kswapd_shrink_zone(struct zone *zone, sc-nr_to_reclaim = max(SWAP_CLUSTER_MAX, high_wmark_pages(zone)); shrink_zone(zone, sc); - reclaim_state-reclaimed_slab = 0; - nr_slab = shrink_slab(shrink, sc-nr_scanned, lru_pages); - sc-nr_reclaimed += reclaim_state-reclaimed_slab; + /* +* Slabs are shrunk for each zone once per priority or if the zone +* being balanced is otherwise unreclaimable +*/ + if (shrinking_slab || !zone_reclaimable(zone)) { + reclaim_state-reclaimed_slab = 0; + nr_slab = shrink_slab(shrink, sc-nr_scanned, lru_pages); + sc-nr_reclaimed += reclaim_state-reclaimed_slab; + } if (nr_slab == 0 !zone_reclaimable(zone)) zone-all_unreclaimable = 1; Why shrink_slab() is called here? Preserves existing behaviour. Yes, but, with this patch, existing behaviour is changed, that is, we call shrink_slab() once per priority. For now, there is no reason this function is called here. How about separating it and executing it outside of zone loop? We are calling it fewer times but it's still receiving the same information from sc-nr_scanned it received before. With the change you are suggesting it would be necessary to accumulating sc-nr_scanned for each zone shrunk and then pass the sum to shrink_slab() once per priority. While this is not necessarily wrong, there is little or no motivation to alter the shrinkers in this manner in this series. Why the result is not the same? -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC Patch 0/2] mm: Add parameters to make kernel behavior at memory error on dirty cache selectable
Hi Mitsuhiro, On 04/11/2013 08:51 PM, Mitsuhiro Tanino wrote: (2013/04/11 12:53), Simon Jeons wrote: One question against mce instead of the patchset. ;-) When check memory is bad? Before memory access? Is there a process scan it period? Hi Simon-san, Yes, there is a process to scan memory periodically. At Intel Nehalem-EX and CPUs after Nehalem-EX generation, MCA recovery is supported. MCA recovery provides error detection and isolation features to work together with OS. One of the MCA Recovery features is Memory Scrubbing. It periodically checks memory in the background of OS. Memory Scrubbing is a kernel thread? Where is the codes of memory scrubbing? If Memory Scrubbing find an uncorrectable error on a memory before OS accesses the memory bit, MCA recovery notifies SRAO error into OS It maybe can't find memory error timely since it is sleeping when memory error occur, can this case happened? and OS handles the SRAO error using hwpoison function. Regards, Mitsuhiro Tanino -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majord...@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: a href=mailto:d...@kvack.org; em...@kvack.org /a -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 01/10] mm: vmscan: Limit the number of pages kswapd reclaims at each priority
Ping Rik, I also want to know the answer. ;-) On 04/11/2013 01:58 PM, Will Huck wrote: Hi Rik, On 03/22/2013 11:52 AM, Rik van Riel wrote: On 03/21/2013 08:05 PM, Will Huck wrote: One offline question, how to understand this in function balance_pgdat: /* * Do some background aging of the anon list, to give * pages a chance to be referenced before reclaiming. */ age_acitve_anon(zone, sc); The anon lrus use a two-handed clock algorithm. New anonymous pages start off on the active anon list. Older anonymous pages get moved to the inactive anon list. The downside of page cache use-once replacement algorithm is inter-reference distance, corret? Does it have any other downside? What's the downside of two-handed clock algorithm against anonymous pages? If they get referenced before they reach the end of the inactive anon list, they get moved back to the active list. If we need to swap something out and find a non-referenced page at the end of the inactive anon list, we will swap it out. In order to make good pageout decisions, pages need to stay on the inactive anon list for a longer time, so they have plenty of time to get referenced, before the reclaim code looks at them. To achieve that, we will move some active anon pages to the inactive anon list even when we do not want to swap anything out - as long as the inactive anon list is below its target size. Does that make sense? -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majord...@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: a href=mailto:d...@kvack.org; em...@kvack.org /a -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] mm: page_alloc: Avoid marking zones full prematurely after zone_reclaim()
Hi Michal, On 04/09/2013 06:14 PM, Michal Hocko wrote: On Tue 09-04-13 18:05:30, Simon Jeons wrote: [...] I try this in v3.9-rc5: dd if=/dev/sda of=/dev/null bs=1MB 14813+0 records in 14812+0 records out 1481200 bytes (15 GB) copied, 105.988 s, 140 MB/s free -m -s 1 total used free shared buffers cached Mem: 7912 1181 6731 0 663239 -/+ buffers/cache:277 7634 Swap: 8011 0 8011 It seems that almost 15GB copied before I stop dd, but the used pages which I monitor during dd always around 1200MB. Weird, why? Sorry for waste your time, but the test result is weird, is it? I am not sure which values you have been watching but you have to realize that you are reading a _partition_ not a file and those pages go into buffers rather than the page chache. Interesting. ;-) What's the difference between buffers and page cache? Why buffers don't grow? -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: zsmalloc defrag (Was: [PATCH] mm: remove compressed copy from zram in-memory)
Hi Minchan, On 04/10/2013 08:50 AM, Minchan Kim wrote: On Tue, Apr 09, 2013 at 01:25:45PM -0700, Dan Magenheimer wrote: From: Minchan Kim [mailto:minc...@kernel.org] Subject: Re: zsmalloc defrag (Was: [PATCH] mm: remove compressed copy from zram in-memory) Hi Dan, On Mon, Apr 08, 2013 at 09:32:38AM -0700, Dan Magenheimer wrote: From: Minchan Kim [mailto:minc...@kernel.org] Sent: Monday, April 08, 2013 12:01 AM Subject: [PATCH] mm: remove compressed copy from zram in-memory (patch removed) Fragment ratio is almost same but memory consumption and compile time is better. I am working to add defragment function of zsmalloc. Hi Minchan -- I would be very interested in your design thoughts on how you plan to add defragmentation for zsmalloc. In What I can say now about is only just a word "Compaction". As you know, zsmalloc has a transparent handle so we can do whatever under user. Of course, there is a tradeoff between performance and memory efficiency. I'm biased to latter for embedded usecase. Have you designed or implemented this yet? I have a couple of concerns: Not yet implemented but just had a time to think about it, simply. So surely, there are some obstacle so I want to uncase the code and number after I make a prototype/test the performance. Of course, if it has a severe problem, will drop it without wasting many guys's time. 1) The handle is transparent to the "user", but it is still a form of a "pointer" to a zpage. Are you planning on walking zram's tables and changing those pointers? That may be OK for zram but for more complex data structures than tables (as in zswap and zcache) it may not be as easy, due to races, or as efficient because you will have to walk potentially very large trees. Rough concept is following as. I'm considering for zsmalloc to return transparent fake handle but we have to maintain it with real one. It could be done in zsmalloc internal so there isn't any race we should consider. 2) Compaction in the kernel is heavily dependent on page migration and page migration is dependent on using flags in the struct page. There's a lot of code in those two code modules and there are going to be a lot of implementation differences between compacting pages vs compacting zpages. Compaction of kernel is never related to zsmalloc's one. I'm also wondering if you will be implementing "variable length zspages". Without that, I'm not sure compaction will help enough. (And that is a good example of the difference between Why do you think so? variable lengh zspage could be further step to improve but it's not only a solution to solve fragmentation. the kernel page compaction design/code and zspage compaction.) particular, I am wondering if your design will also handle the requirements for zcache (especially for cleancache pages) and perhaps also for ramster. I don't know requirements for cleancache pages but compaction is general as you know well so I expect you can get a benefit from it if you are concern on memory efficiency but not sure it's valuable to compact cleancache pages for getting more slot in RAM. Sometime, just discarding would be much better, IMHO. Zcache has page reclaim. Zswap has zpage reclaim. I am concerned that these continue to work in the presence of compaction. With no reclaim at all, zram is a simpler use case but if you implement compaction in a way that can't be used by either zcache or zswap, then zsmalloc is essentially forking. Don't go too far. If it's really problem for zswap and zcache, maybe, we could add it optionally. In https://lkml.org/lkml/2013/3/27/501 I suggested it would be good to work together on a common design, but you didn't reply. Are you thinking that zsmalloc I saw the thread but explicit agreement is really matter? I believe everybody want it although they didn't reply. :) You can make the design/post it or prototyping/post it. If there are some conflit with something in my brain, I will be happy to feedback. :) Anyway, I think my above statement "COMPACTION" would be enough to express my current thought to avoid duplicated work and you can catch up. I will get around to it after LSF/MM. improvements should focus only on zram, in which case Just focusing zsmalloc. Right. Again, I am asking if you are changing zsmalloc in a way that helps zram but hurts zswap and makes it impossible for zcache to ever use the improvements to zsmalloc. As I said, I'm biased to memory efficiency rather than performace. Of course, severe performance drop is disaster but small drop will be acceptable for memory-efficiency concerning systems. If so, that's fine, but please make it clear that is your goal. Simple, help memory hungry system. :) Which kind of system are memory hungry? we may -- and possibly should -- end up with a different allocator for frontswap-based/cleancache-based compression in zcache (and possibly zswap)? I'm just trying to determine if I
Re: zsmalloc defrag (Was: [PATCH] mm: remove compressed copy from zram in-memory)
Hi Dan, On 04/10/2013 04:25 AM, Dan Magenheimer wrote: From: Minchan Kim [mailto:minc...@kernel.org] Subject: Re: zsmalloc defrag (Was: [PATCH] mm: remove compressed copy from zram in-memory) Hi Dan, On Mon, Apr 08, 2013 at 09:32:38AM -0700, Dan Magenheimer wrote: From: Minchan Kim [mailto:minc...@kernel.org] Sent: Monday, April 08, 2013 12:01 AM Subject: [PATCH] mm: remove compressed copy from zram in-memory (patch removed) Fragment ratio is almost same but memory consumption and compile time is better. I am working to add defragment function of zsmalloc. Hi Minchan -- I would be very interested in your design thoughts on how you plan to add defragmentation for zsmalloc. In What I can say now about is only just a word "Compaction". As you know, zsmalloc has a transparent handle so we can do whatever under user. Of course, there is a tradeoff between performance and memory efficiency. I'm biased to latter for embedded usecase. Have you designed or implemented this yet? I have a couple of concerns: 1) The handle is transparent to the "user", but it is still a form of a "pointer" to a zpage. Are you planning on walking zram's tables and changing those pointers? That may be OK for zram but for more complex data structures than tables (as in zswap and zcache) it may not be as easy, due to races, or as efficient because you will have to walk potentially very large trees. 2) Compaction in the kernel is heavily dependent on page migration and page migration is dependent on using flags in the struct page. Which flag? There's a lot of code in those two code modules and there are going to be a lot of implementation differences between compacting pages vs compacting zpages. I'm also wondering if you will be implementing "variable length zspages". Without that, I'm not sure compaction will help enough. (And that is a good example of the difference between the kernel page compaction design/code and zspage compaction.) particular, I am wondering if your design will also handle the requirements for zcache (especially for cleancache pages) and perhaps also for ramster. I don't know requirements for cleancache pages but compaction is general as you know well so I expect you can get a benefit from it if you are concern on memory efficiency but not sure it's valuable to compact cleancache pages for getting more slot in RAM. Sometime, just discarding would be much better, IMHO. Zcache has page reclaim. Zswap has zpage reclaim. I am concerned that these continue to work in the presence of compaction. With no reclaim at all, zram is a simpler use case but if you implement compaction in a way that can't be used by either zcache or zswap, then zsmalloc is essentially forking. I fail to understand "then zsmalloc is essentially forking.", could you explain more? In https://lkml.org/lkml/2013/3/27/501 I suggested it would be good to work together on a common design, but you didn't reply. Are you thinking that zsmalloc I saw the thread but explicit agreement is really matter? I believe everybody want it although they didn't reply. :) You can make the design/post it or prototyping/post it. If there are some conflit with something in my brain, I will be happy to feedback. :) Anyway, I think my above statement "COMPACTION" would be enough to express my current thought to avoid duplicated work and you can catch up. I will get around to it after LSF/MM. improvements should focus only on zram, in which case Just focusing zsmalloc. Right. Again, I am asking if you are changing zsmalloc in a way that helps zram but hurts zswap and makes it impossible for zcache to ever use the improvements to zsmalloc. If so, that's fine, but please make it clear that is your goal. we may -- and possibly should -- end up with a different allocator for frontswap-based/cleancache-based compression in zcache (and possibly zswap)? I'm just trying to determine if I should proceed separately with my design (with Bob Liu, who expressed interest) or if it would be beneficial to work together. Just posting and if it affects zsmalloc/zram/zswap and goes the way I don't want, I will involve the discussion because our product uses zram heavily and consider zswap, too. I really appreciate your enthusiastic collaboration model to find optimal solution! My goal is to have compression be an integral part of Linux memory management. It may be tied to a config option, but the goal is that distros turn it on by default. I don't think zsmalloc meets that objective yet, but it may be fine for your needs. If so it would be good to understand exactly why it doesn't meet the other zproject needs. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majord...@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: em...@kvack.org -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to
Re: zsmalloc defrag (Was: [PATCH] mm: remove compressed copy from zram in-memory)
Hi Dan, On 04/10/2013 04:25 AM, Dan Magenheimer wrote: From: Minchan Kim [mailto:minc...@kernel.org] Subject: Re: zsmalloc defrag (Was: [PATCH] mm: remove compressed copy from zram in-memory) Hi Dan, On Mon, Apr 08, 2013 at 09:32:38AM -0700, Dan Magenheimer wrote: From: Minchan Kim [mailto:minc...@kernel.org] Sent: Monday, April 08, 2013 12:01 AM Subject: [PATCH] mm: remove compressed copy from zram in-memory (patch removed) Fragment ratio is almost same but memory consumption and compile time is better. I am working to add defragment function of zsmalloc. Hi Minchan -- I would be very interested in your design thoughts on how you plan to add defragmentation for zsmalloc. In What I can say now about is only just a word Compaction. As you know, zsmalloc has a transparent handle so we can do whatever under user. Of course, there is a tradeoff between performance and memory efficiency. I'm biased to latter for embedded usecase. Have you designed or implemented this yet? I have a couple of concerns: 1) The handle is transparent to the user, but it is still a form of a pointer to a zpage. Are you planning on walking zram's tables and changing those pointers? That may be OK for zram but for more complex data structures than tables (as in zswap and zcache) it may not be as easy, due to races, or as efficient because you will have to walk potentially very large trees. 2) Compaction in the kernel is heavily dependent on page migration and page migration is dependent on using flags in the struct page. Which flag? There's a lot of code in those two code modules and there are going to be a lot of implementation differences between compacting pages vs compacting zpages. I'm also wondering if you will be implementing variable length zspages. Without that, I'm not sure compaction will help enough. (And that is a good example of the difference between the kernel page compaction design/code and zspage compaction.) particular, I am wondering if your design will also handle the requirements for zcache (especially for cleancache pages) and perhaps also for ramster. I don't know requirements for cleancache pages but compaction is general as you know well so I expect you can get a benefit from it if you are concern on memory efficiency but not sure it's valuable to compact cleancache pages for getting more slot in RAM. Sometime, just discarding would be much better, IMHO. Zcache has page reclaim. Zswap has zpage reclaim. I am concerned that these continue to work in the presence of compaction. With no reclaim at all, zram is a simpler use case but if you implement compaction in a way that can't be used by either zcache or zswap, then zsmalloc is essentially forking. I fail to understand then zsmalloc is essentially forking., could you explain more? In https://lkml.org/lkml/2013/3/27/501 I suggested it would be good to work together on a common design, but you didn't reply. Are you thinking that zsmalloc I saw the thread but explicit agreement is really matter? I believe everybody want it although they didn't reply. :) You can make the design/post it or prototyping/post it. If there are some conflit with something in my brain, I will be happy to feedback. :) Anyway, I think my above statement COMPACTION would be enough to express my current thought to avoid duplicated work and you can catch up. I will get around to it after LSF/MM. improvements should focus only on zram, in which case Just focusing zsmalloc. Right. Again, I am asking if you are changing zsmalloc in a way that helps zram but hurts zswap and makes it impossible for zcache to ever use the improvements to zsmalloc. If so, that's fine, but please make it clear that is your goal. we may -- and possibly should -- end up with a different allocator for frontswap-based/cleancache-based compression in zcache (and possibly zswap)? I'm just trying to determine if I should proceed separately with my design (with Bob Liu, who expressed interest) or if it would be beneficial to work together. Just posting and if it affects zsmalloc/zram/zswap and goes the way I don't want, I will involve the discussion because our product uses zram heavily and consider zswap, too. I really appreciate your enthusiastic collaboration model to find optimal solution! My goal is to have compression be an integral part of Linux memory management. It may be tied to a config option, but the goal is that distros turn it on by default. I don't think zsmalloc meets that objective yet, but it may be fine for your needs. If so it would be good to understand exactly why it doesn't meet the other zproject needs. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majord...@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: a href=ilto:d...@kvack.org em...@kvack.org /a -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of
Re: zsmalloc defrag (Was: [PATCH] mm: remove compressed copy from zram in-memory)
Hi Minchan, On 04/10/2013 08:50 AM, Minchan Kim wrote: On Tue, Apr 09, 2013 at 01:25:45PM -0700, Dan Magenheimer wrote: From: Minchan Kim [mailto:minc...@kernel.org] Subject: Re: zsmalloc defrag (Was: [PATCH] mm: remove compressed copy from zram in-memory) Hi Dan, On Mon, Apr 08, 2013 at 09:32:38AM -0700, Dan Magenheimer wrote: From: Minchan Kim [mailto:minc...@kernel.org] Sent: Monday, April 08, 2013 12:01 AM Subject: [PATCH] mm: remove compressed copy from zram in-memory (patch removed) Fragment ratio is almost same but memory consumption and compile time is better. I am working to add defragment function of zsmalloc. Hi Minchan -- I would be very interested in your design thoughts on how you plan to add defragmentation for zsmalloc. In What I can say now about is only just a word Compaction. As you know, zsmalloc has a transparent handle so we can do whatever under user. Of course, there is a tradeoff between performance and memory efficiency. I'm biased to latter for embedded usecase. Have you designed or implemented this yet? I have a couple of concerns: Not yet implemented but just had a time to think about it, simply. So surely, there are some obstacle so I want to uncase the code and number after I make a prototype/test the performance. Of course, if it has a severe problem, will drop it without wasting many guys's time. 1) The handle is transparent to the user, but it is still a form of a pointer to a zpage. Are you planning on walking zram's tables and changing those pointers? That may be OK for zram but for more complex data structures than tables (as in zswap and zcache) it may not be as easy, due to races, or as efficient because you will have to walk potentially very large trees. Rough concept is following as. I'm considering for zsmalloc to return transparent fake handle but we have to maintain it with real one. It could be done in zsmalloc internal so there isn't any race we should consider. 2) Compaction in the kernel is heavily dependent on page migration and page migration is dependent on using flags in the struct page. There's a lot of code in those two code modules and there are going to be a lot of implementation differences between compacting pages vs compacting zpages. Compaction of kernel is never related to zsmalloc's one. I'm also wondering if you will be implementing variable length zspages. Without that, I'm not sure compaction will help enough. (And that is a good example of the difference between Why do you think so? variable lengh zspage could be further step to improve but it's not only a solution to solve fragmentation. the kernel page compaction design/code and zspage compaction.) particular, I am wondering if your design will also handle the requirements for zcache (especially for cleancache pages) and perhaps also for ramster. I don't know requirements for cleancache pages but compaction is general as you know well so I expect you can get a benefit from it if you are concern on memory efficiency but not sure it's valuable to compact cleancache pages for getting more slot in RAM. Sometime, just discarding would be much better, IMHO. Zcache has page reclaim. Zswap has zpage reclaim. I am concerned that these continue to work in the presence of compaction. With no reclaim at all, zram is a simpler use case but if you implement compaction in a way that can't be used by either zcache or zswap, then zsmalloc is essentially forking. Don't go too far. If it's really problem for zswap and zcache, maybe, we could add it optionally. In https://lkml.org/lkml/2013/3/27/501 I suggested it would be good to work together on a common design, but you didn't reply. Are you thinking that zsmalloc I saw the thread but explicit agreement is really matter? I believe everybody want it although they didn't reply. :) You can make the design/post it or prototyping/post it. If there are some conflit with something in my brain, I will be happy to feedback. :) Anyway, I think my above statement COMPACTION would be enough to express my current thought to avoid duplicated work and you can catch up. I will get around to it after LSF/MM. improvements should focus only on zram, in which case Just focusing zsmalloc. Right. Again, I am asking if you are changing zsmalloc in a way that helps zram but hurts zswap and makes it impossible for zcache to ever use the improvements to zsmalloc. As I said, I'm biased to memory efficiency rather than performace. Of course, severe performance drop is disaster but small drop will be acceptable for memory-efficiency concerning systems. If so, that's fine, but please make it clear that is your goal. Simple, help memory hungry system. :) Which kind of system are memory hungry? we may -- and possibly should -- end up with a different allocator for frontswap-based/cleancache-based compression in zcache (and possibly zswap)? I'm just trying to determine if I should proceed
Re: [PATCH] mm: page_alloc: Avoid marking zones full prematurely after zone_reclaim()
Hi Michal, On 04/09/2013 06:14 PM, Michal Hocko wrote: On Tue 09-04-13 18:05:30, Simon Jeons wrote: [...] I try this in v3.9-rc5: dd if=/dev/sda of=/dev/null bs=1MB 14813+0 records in 14812+0 records out 1481200 bytes (15 GB) copied, 105.988 s, 140 MB/s free -m -s 1 total used free shared buffers cached Mem: 7912 1181 6731 0 663239 -/+ buffers/cache:277 7634 Swap: 8011 0 8011 It seems that almost 15GB copied before I stop dd, but the used pages which I monitor during dd always around 1200MB. Weird, why? Sorry for waste your time, but the test result is weird, is it? I am not sure which values you have been watching but you have to realize that you are reading a _partition_ not a file and those pages go into buffers rather than the page chache. Interesting. ;-) What's the difference between buffers and page cache? Why buffers don't grow? -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] mm: remove compressed copy from zram in-memory
Hi Minchan, On 04/09/2013 09:02 AM, Minchan Kim wrote: Hi Andrew, On Mon, Apr 08, 2013 at 02:17:10PM -0700, Andrew Morton wrote: On Mon, 8 Apr 2013 15:01:02 +0900 Minchan Kim wrote: Swap subsystem does lazy swap slot free with expecting the page would be swapped out again so we can avoid unnecessary write. Is that correct? How can it save a write? Correct. The add_to_swap makes the page dirty and we must pageout only if the page is dirty. If a anon page is already charged into swapcache, we skip writeout the page in shrink_page_list, then just remove the page from swapcache and free it by __remove_mapping. I did received same question multiple time so it would be good idea to write down it in vmscan.c somewhere. But the problem in in-memory swap(ex, zram) is that it consumes memory space until vm_swap_full(ie, used half of all of swap device) condition meet. It could be bad if we use multiple swap device, small in-memory swap and big storage swap or in-memory swap alone. This patch makes swap subsystem free swap slot as soon as swap-read is completed and make the swapcache page dirty so the page should be written out the swap device to reclaim it. It means we never lose it. >From my reading of the patch, that isn't how it works? It changed end_swap_bio_read() to call zram_slot_free_notify(), which appears to free the underlying compressed page. I have a feeling I'm hopelessly confused. You understand right totally. Selecting swap slot in my description was totally miss. Need to rewrite the description. free the swap slot and free compress page is the same, isn't it? --- a/mm/page_io.c +++ b/mm/page_io.c @@ -20,6 +20,7 @@ #include #include #include +#include #include static struct bio *get_swap_bio(gfp_t gfp_flags, @@ -81,8 +82,30 @@ void end_swap_bio_read(struct bio *bio, int err) iminor(bio->bi_bdev->bd_inode), (unsigned long long)bio->bi_sector); } else { + /* +* There is no reason to keep both uncompressed data and +* compressed data in memory. +*/ + struct swap_info_struct *sis; + SetPageUptodate(page); + sis = page_swap_info(page); + if (sis->flags & SWP_BLKDEV) { + struct gendisk *disk = sis->bdev->bd_disk; + if (disk->fops->swap_slot_free_notify) { + swp_entry_t entry; + unsigned long offset; + + entry.val = page_private(page); + offset = swp_offset(entry); + + SetPageDirty(page); + disk->fops->swap_slot_free_notify(sis->bdev, + offset); + } + } } + unlock_page(page); bio_put(bio); The new code is wasted space if CONFIG_BLOCK=n, yes? CONFIG_SWAP is already dependent on CONFIG_BLOCK. Also, what's up with the SWP_BLKDEV test? zram doesn't support SWP_FILE? Why on earth not? Putting swap_slot_free_notify() into block_device_operations seems rather wrong. It precludes zram-over-swapfiles for all time and means that other subsystems cannot get notifications for swap slot freeing for swapfile-backed swap. Zram is just pseudo-block device so anyone can format it with any FSes and swapon a file. In such case, he can't get a benefit from swap_slot_free_notify. But I think it's not a severe problem because there is no reason to use a file-swap on zram. If anyone want to use it, I'd like to know the reason. If it's reasonable, we have to rethink a wheel and it's another story, IMHO. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majord...@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: mailto:"d...@kvack.org;> em...@kvack.org -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] mm: remove compressed copy from zram in-memory
Hi Minchan, On 04/09/2013 09:02 AM, Minchan Kim wrote: Hi Andrew, On Mon, Apr 08, 2013 at 02:17:10PM -0700, Andrew Morton wrote: On Mon, 8 Apr 2013 15:01:02 +0900 Minchan Kim minc...@kernel.org wrote: Swap subsystem does lazy swap slot free with expecting the page would be swapped out again so we can avoid unnecessary write. Is that correct? How can it save a write? Correct. The add_to_swap makes the page dirty and we must pageout only if the page is dirty. If a anon page is already charged into swapcache, we skip writeout the page in shrink_page_list, then just remove the page from swapcache and free it by __remove_mapping. I did received same question multiple time so it would be good idea to write down it in vmscan.c somewhere. But the problem in in-memory swap(ex, zram) is that it consumes memory space until vm_swap_full(ie, used half of all of swap device) condition meet. It could be bad if we use multiple swap device, small in-memory swap and big storage swap or in-memory swap alone. This patch makes swap subsystem free swap slot as soon as swap-read is completed and make the swapcache page dirty so the page should be written out the swap device to reclaim it. It means we never lose it. From my reading of the patch, that isn't how it works? It changed end_swap_bio_read() to call zram_slot_free_notify(), which appears to free the underlying compressed page. I have a feeling I'm hopelessly confused. You understand right totally. Selecting swap slot in my description was totally miss. Need to rewrite the description. free the swap slot and free compress page is the same, isn't it? --- a/mm/page_io.c +++ b/mm/page_io.c @@ -20,6 +20,7 @@ #include linux/buffer_head.h #include linux/writeback.h #include linux/frontswap.h +#include linux/blkdev.h #include asm/pgtable.h static struct bio *get_swap_bio(gfp_t gfp_flags, @@ -81,8 +82,30 @@ void end_swap_bio_read(struct bio *bio, int err) iminor(bio-bi_bdev-bd_inode), (unsigned long long)bio-bi_sector); } else { + /* +* There is no reason to keep both uncompressed data and +* compressed data in memory. +*/ + struct swap_info_struct *sis; + SetPageUptodate(page); + sis = page_swap_info(page); + if (sis-flags SWP_BLKDEV) { + struct gendisk *disk = sis-bdev-bd_disk; + if (disk-fops-swap_slot_free_notify) { + swp_entry_t entry; + unsigned long offset; + + entry.val = page_private(page); + offset = swp_offset(entry); + + SetPageDirty(page); + disk-fops-swap_slot_free_notify(sis-bdev, + offset); + } + } } + unlock_page(page); bio_put(bio); The new code is wasted space if CONFIG_BLOCK=n, yes? CONFIG_SWAP is already dependent on CONFIG_BLOCK. Also, what's up with the SWP_BLKDEV test? zram doesn't support SWP_FILE? Why on earth not? Putting swap_slot_free_notify() into block_device_operations seems rather wrong. It precludes zram-over-swapfiles for all time and means that other subsystems cannot get notifications for swap slot freeing for swapfile-backed swap. Zram is just pseudo-block device so anyone can format it with any FSes and swapon a file. In such case, he can't get a benefit from swap_slot_free_notify. But I think it's not a severe problem because there is no reason to use a file-swap on zram. If anyone want to use it, I'd like to know the reason. If it's reasonable, we have to rethink a wheel and it's another story, IMHO. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majord...@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: a href=mailto:d...@kvack.org; em...@kvack.org /a -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH part2 v6 0/3] staging: zcache: Support zero-filled pages more efficiently
cc Bob On 04/07/2013 05:03 PM, Wanpeng Li wrote: On Wed, Apr 03, 2013 at 06:16:20PM +0800, Wanpeng Li wrote: Changelog: v5 -> v6: * shove variables in debug.c and in debug.h just have an extern, spotted by Konrad * update patch description, spotted by Konrad v4 -> v5: * fix compile error, reported by Fengguang, Geert * add check for !is_ephemeral(pool), spotted by Bob v3 -> v4: * handle duplication in page_is_zero_filled, spotted by Bob * fix zcache writeback in dubugfs * fix pers_pageframes|_max isn't exported in debugfs * fix static variable defined in debug.h but used in multiple C files * rebase on Greg's staging-next v2 -> v3: * increment/decrement zcache_[eph|pers]_zpages for zero-filled pages, spotted by Dan * replace "zero" or "zero page" by "zero_filled_page", spotted by Dan v1 -> v2: * avoid changing tmem.[ch] entirely, spotted by Dan. * don't accumulate [eph|pers]pageframe and [eph|pers]zpages for zero-filled pages, spotted by Dan * cleanup TODO list * add Dan Acked-by. Hi Dan, Some issues against Ramster: - Ramster who takes advantage of zcache also should support zero-filled pages more efficiently, correct? It doesn't handle zero-filled pages well currently. - Ramster DebugFS counters are exported in /sys/kernel/mm/, but zcache/frontswap/cleancache all are exported in /sys/kernel/debug/, should we unify them? - If ramster also should move DebugFS counters to a single file like zcache do? If you confirm these issues are make sense to fix, I will start coding. ;-) Regards, Wanpeng Li Motivation: - Seth Jennings points out compress zero-filled pages with LZO(a lossless data compression algorithm) will waste memory and result in fragmentation. https://lkml.org/lkml/2012/8/14/347 - Dan Magenheimer add "Support zero-filled pages more efficiently" feature in zcache TODO list https://lkml.org/lkml/2013/2/13/503 Design: - For store page, capture zero-filled pages(evicted clean page cache pages and swap pages), but don't compress them, set pampd which store zpage address to 0x2(since 0x0 and 0x1 has already been ocuppied) to mark special zero-filled case and take advantage of tmem infrastructure to transform handle-tuple(pool id, object id, and an index) to a pampd. Twice compress zero-filled pages will contribute to one zcache_[eph|pers]_pageframes count accumulated. - For load page, traverse tmem hierachical to transform handle-tuple to pampd and identify zero-filled case by pampd equal to 0x2 when filesystem reads file pages or a page needs to be swapped in, then refill the page to zero and return. Test: dd if=/dev/zero of=zerofile bs=1MB count=500 vmtouch -t zerofile vmtouch -e zerofile formula: - fragmentation level = (zcache_[eph|pers]_pageframes * PAGE_SIZE - zcache_[eph|pers]_zbytes) * 100 / (zcache_[eph|pers]_pageframes * PAGE_SIZE) - memory zcache occupy = zcache_[eph|pers]_zbytes Result: without zero-filled awareness: - fragmentation level: 98% - memory zcache occupy: 238MB with zero-filled awareness: - fragmentation level: 0% - memory zcache occupy: 0MB Wanpeng Li (3): staging: zcache: fix static variables defined in debug.h but used in mutiple C files staging: zcache: introduce zero-filled page stat count staging: zcache: clean TODO list drivers/staging/zcache/TODO |3 +- drivers/staging/zcache/debug.c | 35 +++ drivers/staging/zcache/debug.h | 79 - drivers/staging/zcache/zcache-main.c |4 ++ 4 files changed, 88 insertions(+), 33 deletions(-) -- 1.7.5.4 -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majord...@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: mailto:"d...@kvack.org;> em...@kvack.org -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH part2 v6 0/3] staging: zcache: Support zero-filled pages more efficiently
cc Bob On 04/07/2013 05:03 PM, Wanpeng Li wrote: On Wed, Apr 03, 2013 at 06:16:20PM +0800, Wanpeng Li wrote: Changelog: v5 - v6: * shove variables in debug.c and in debug.h just have an extern, spotted by Konrad * update patch description, spotted by Konrad v4 - v5: * fix compile error, reported by Fengguang, Geert * add check for !is_ephemeral(pool), spotted by Bob v3 - v4: * handle duplication in page_is_zero_filled, spotted by Bob * fix zcache writeback in dubugfs * fix pers_pageframes|_max isn't exported in debugfs * fix static variable defined in debug.h but used in multiple C files * rebase on Greg's staging-next v2 - v3: * increment/decrement zcache_[eph|pers]_zpages for zero-filled pages, spotted by Dan * replace zero or zero page by zero_filled_page, spotted by Dan v1 - v2: * avoid changing tmem.[ch] entirely, spotted by Dan. * don't accumulate [eph|pers]pageframe and [eph|pers]zpages for zero-filled pages, spotted by Dan * cleanup TODO list * add Dan Acked-by. Hi Dan, Some issues against Ramster: - Ramster who takes advantage of zcache also should support zero-filled pages more efficiently, correct? It doesn't handle zero-filled pages well currently. - Ramster DebugFS counters are exported in /sys/kernel/mm/, but zcache/frontswap/cleancache all are exported in /sys/kernel/debug/, should we unify them? - If ramster also should move DebugFS counters to a single file like zcache do? If you confirm these issues are make sense to fix, I will start coding. ;-) Regards, Wanpeng Li Motivation: - Seth Jennings points out compress zero-filled pages with LZO(a lossless data compression algorithm) will waste memory and result in fragmentation. https://lkml.org/lkml/2012/8/14/347 - Dan Magenheimer add Support zero-filled pages more efficiently feature in zcache TODO list https://lkml.org/lkml/2013/2/13/503 Design: - For store page, capture zero-filled pages(evicted clean page cache pages and swap pages), but don't compress them, set pampd which store zpage address to 0x2(since 0x0 and 0x1 has already been ocuppied) to mark special zero-filled case and take advantage of tmem infrastructure to transform handle-tuple(pool id, object id, and an index) to a pampd. Twice compress zero-filled pages will contribute to one zcache_[eph|pers]_pageframes count accumulated. - For load page, traverse tmem hierachical to transform handle-tuple to pampd and identify zero-filled case by pampd equal to 0x2 when filesystem reads file pages or a page needs to be swapped in, then refill the page to zero and return. Test: dd if=/dev/zero of=zerofile bs=1MB count=500 vmtouch -t zerofile vmtouch -e zerofile formula: - fragmentation level = (zcache_[eph|pers]_pageframes * PAGE_SIZE - zcache_[eph|pers]_zbytes) * 100 / (zcache_[eph|pers]_pageframes * PAGE_SIZE) - memory zcache occupy = zcache_[eph|pers]_zbytes Result: without zero-filled awareness: - fragmentation level: 98% - memory zcache occupy: 238MB with zero-filled awareness: - fragmentation level: 0% - memory zcache occupy: 0MB Wanpeng Li (3): staging: zcache: fix static variables defined in debug.h but used in mutiple C files staging: zcache: introduce zero-filled page stat count staging: zcache: clean TODO list drivers/staging/zcache/TODO |3 +- drivers/staging/zcache/debug.c | 35 +++ drivers/staging/zcache/debug.h | 79 - drivers/staging/zcache/zcache-main.c |4 ++ 4 files changed, 88 insertions(+), 33 deletions(-) -- 1.7.5.4 -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majord...@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: a href=mailto:d...@kvack.org; em...@kvack.org /a -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCHv3, RFC 00/34] Transparent huge page cache
Hi Kirill, On 04/05/2013 07:59 PM, Kirill A. Shutemov wrote: From: "Kirill A. Shutemov" Here's third RFC. Thanks everybody for feedback. Could you answer my questions in your version two? The patchset is pretty big already and I want to stop generate new features to keep it reviewable. Next I'll concentrate on benchmarking and tuning. Therefore some features will be outside initial transparent huge page cache implementation: - page collapsing; - migration; - tmpfs/shmem; There are few features which are not implemented and potentially can block upstreaming: 1. Currently we allocate 2M page even if we create only 1 byte file on ramfs. I don't think it's a problem by itself. With anon thp pages we also try to allocate huge pages whenever possible. The problem is that ramfs pages are unevictable and we can't just split and pushed them in swap as with anon thp. We (at some point) have to have mechanism to split last page of the file under memory pressure to reclaim some memory. 2. We don't have knobs for disabling transparent huge page cache per-mount or per-file. Should we have mount option and fadivse flags as part of initial implementation? Any thoughts? The patchset is also on git: git://git.kernel.org/pub/scm/linux/kernel/git/kas/linux.git thp/pagecache v3: - set RADIX_TREE_PRELOAD_NR to 512 only if we build with THP; - rewrite lru_add_page_tail() to address few bags; - memcg accounting; - represent file thp pages in meminfo and friends; - dump page order in filemap trace; - add missed flush_dcache_page() in zero_huge_user_segment; - random cleanups based on feedback. v2: - mmap(); - fix add_to_page_cache_locked() and delete_from_page_cache(); - introduce mapping_can_have_hugepages(); - call split_huge_page() only for head page in filemap_fault(); - wait_split_huge_page(): serialize over i_mmap_mutex too; - lru_add_page_tail: avoid PageUnevictable on active/inactive lru lists; - fix off-by-one in zero_huge_user_segment(); - THP_WRITE_ALLOC/THP_WRITE_FAILED counters; Kirill A. Shutemov (34): mm: drop actor argument of do_generic_file_read() block: implement add_bdi_stat() mm: implement zero_huge_user_segment and friends radix-tree: implement preload for multiple contiguous elements memcg, thp: charge huge cache pages thp, mm: avoid PageUnevictable on active/inactive lru lists thp, mm: basic defines for transparent huge page cache thp, mm: introduce mapping_can_have_hugepages() predicate thp: represent file thp pages in meminfo and friends thp, mm: rewrite add_to_page_cache_locked() to support huge pages mm: trace filemap: dump page order thp, mm: rewrite delete_from_page_cache() to support huge pages thp, mm: trigger bug in replace_page_cache_page() on THP thp, mm: locking tail page is a bug thp, mm: handle tail pages in page_cache_get_speculative() thp, mm: add event counters for huge page alloc on write to a file thp, mm: implement grab_thp_write_begin() thp, mm: naive support of thp in generic read/write routines thp, libfs: initial support of thp in simple_read/write_begin/write_end thp: handle file pages in split_huge_page() thp: wait_split_huge_page(): serialize over i_mmap_mutex too thp, mm: truncate support for transparent huge page cache thp, mm: split huge page on mmap file page ramfs: enable transparent huge page cache x86-64, mm: proper alignment mappings with hugepages mm: add huge_fault() callback to vm_operations_struct thp: prepare zap_huge_pmd() to uncharge file pages thp: move maybe_pmd_mkwrite() out of mk_huge_pmd() thp, mm: basic huge_fault implementation for generic_file_vm_ops thp: extract fallback path from do_huge_pmd_anonymous_page() to a function thp: initial implementation of do_huge_linear_fault() thp: handle write-protect exception to file-backed huge pages thp: call __vma_adjust_trans_huge() for file-backed VMA thp: map file-backed huge pages on fault arch/x86/kernel/sys_x86_64.c | 12 +- drivers/base/node.c| 10 + fs/libfs.c | 48 +++- fs/proc/meminfo.c |6 + fs/ramfs/inode.c |6 +- include/linux/backing-dev.h| 10 + include/linux/huge_mm.h| 36 ++- include/linux/mm.h |8 + include/linux/mmzone.h |1 + include/linux/pagemap.h| 33 ++- include/linux/radix-tree.h | 11 + include/linux/vm_event_item.h |2 + include/trace/events/filemap.h |7 +- lib/radix-tree.c | 33 ++- mm/filemap.c | 298 - mm/huge_memory.c | 474 +--- mm/memcontrol.c|2 - mm/memory.c| 41 +++- mm/mmap.c |3 + mm/page_alloc.c|7 +- mm/swap.c | 20 +- mm/truncate.c
Re: [PATCHv3, RFC 00/34] Transparent huge page cache
Hi Kirill, On 04/05/2013 07:59 PM, Kirill A. Shutemov wrote: From: Kirill A. Shutemov kirill.shute...@linux.intel.com Here's third RFC. Thanks everybody for feedback. Could you answer my questions in your version two? The patchset is pretty big already and I want to stop generate new features to keep it reviewable. Next I'll concentrate on benchmarking and tuning. Therefore some features will be outside initial transparent huge page cache implementation: - page collapsing; - migration; - tmpfs/shmem; There are few features which are not implemented and potentially can block upstreaming: 1. Currently we allocate 2M page even if we create only 1 byte file on ramfs. I don't think it's a problem by itself. With anon thp pages we also try to allocate huge pages whenever possible. The problem is that ramfs pages are unevictable and we can't just split and pushed them in swap as with anon thp. We (at some point) have to have mechanism to split last page of the file under memory pressure to reclaim some memory. 2. We don't have knobs for disabling transparent huge page cache per-mount or per-file. Should we have mount option and fadivse flags as part of initial implementation? Any thoughts? The patchset is also on git: git://git.kernel.org/pub/scm/linux/kernel/git/kas/linux.git thp/pagecache v3: - set RADIX_TREE_PRELOAD_NR to 512 only if we build with THP; - rewrite lru_add_page_tail() to address few bags; - memcg accounting; - represent file thp pages in meminfo and friends; - dump page order in filemap trace; - add missed flush_dcache_page() in zero_huge_user_segment; - random cleanups based on feedback. v2: - mmap(); - fix add_to_page_cache_locked() and delete_from_page_cache(); - introduce mapping_can_have_hugepages(); - call split_huge_page() only for head page in filemap_fault(); - wait_split_huge_page(): serialize over i_mmap_mutex too; - lru_add_page_tail: avoid PageUnevictable on active/inactive lru lists; - fix off-by-one in zero_huge_user_segment(); - THP_WRITE_ALLOC/THP_WRITE_FAILED counters; Kirill A. Shutemov (34): mm: drop actor argument of do_generic_file_read() block: implement add_bdi_stat() mm: implement zero_huge_user_segment and friends radix-tree: implement preload for multiple contiguous elements memcg, thp: charge huge cache pages thp, mm: avoid PageUnevictable on active/inactive lru lists thp, mm: basic defines for transparent huge page cache thp, mm: introduce mapping_can_have_hugepages() predicate thp: represent file thp pages in meminfo and friends thp, mm: rewrite add_to_page_cache_locked() to support huge pages mm: trace filemap: dump page order thp, mm: rewrite delete_from_page_cache() to support huge pages thp, mm: trigger bug in replace_page_cache_page() on THP thp, mm: locking tail page is a bug thp, mm: handle tail pages in page_cache_get_speculative() thp, mm: add event counters for huge page alloc on write to a file thp, mm: implement grab_thp_write_begin() thp, mm: naive support of thp in generic read/write routines thp, libfs: initial support of thp in simple_read/write_begin/write_end thp: handle file pages in split_huge_page() thp: wait_split_huge_page(): serialize over i_mmap_mutex too thp, mm: truncate support for transparent huge page cache thp, mm: split huge page on mmap file page ramfs: enable transparent huge page cache x86-64, mm: proper alignment mappings with hugepages mm: add huge_fault() callback to vm_operations_struct thp: prepare zap_huge_pmd() to uncharge file pages thp: move maybe_pmd_mkwrite() out of mk_huge_pmd() thp, mm: basic huge_fault implementation for generic_file_vm_ops thp: extract fallback path from do_huge_pmd_anonymous_page() to a function thp: initial implementation of do_huge_linear_fault() thp: handle write-protect exception to file-backed huge pages thp: call __vma_adjust_trans_huge() for file-backed VMA thp: map file-backed huge pages on fault arch/x86/kernel/sys_x86_64.c | 12 +- drivers/base/node.c| 10 + fs/libfs.c | 48 +++- fs/proc/meminfo.c |6 + fs/ramfs/inode.c |6 +- include/linux/backing-dev.h| 10 + include/linux/huge_mm.h| 36 ++- include/linux/mm.h |8 + include/linux/mmzone.h |1 + include/linux/pagemap.h| 33 ++- include/linux/radix-tree.h | 11 + include/linux/vm_event_item.h |2 + include/trace/events/filemap.h |7 +- lib/radix-tree.c | 33 ++- mm/filemap.c | 298 - mm/huge_memory.c | 474 +--- mm/memcontrol.c|2 - mm/memory.c| 41 +++- mm/mmap.c |3 + mm/page_alloc.c|7 +- mm/swap.c
Re: [PATCHv2, RFC 12/30] thp, mm: add event counters for huge page alloc on write to a file
Hi Kirill, On 03/26/2013 04:40 PM, Kirill A. Shutemov wrote: Dave Hansen wrote: On 03/14/2013 10:50 AM, Kirill A. Shutemov wrote: --- a/include/linux/vm_event_item.h +++ b/include/linux/vm_event_item.h @@ -71,6 +71,8 @@ enum vm_event_item { PGPGIN, PGPGOUT, PSWPIN, PSWPOUT, THP_FAULT_FALLBACK, THP_COLLAPSE_ALLOC, THP_COLLAPSE_ALLOC_FAILED, + THP_WRITE_ALLOC, + THP_WRITE_FAILED, THP_SPLIT, THP_ZERO_PAGE_ALLOC, THP_ZERO_PAGE_ALLOC_FAILED, I think these names are a bit terse. It's certainly not _writes_ that are failing and "THP_WRITE_FAILED" makes it sound that way. Right. s/THP_WRITE_FAILED/THP_WRITE_ALLOC_FAILED/ Also, why do we need to differentiate these from the existing anon-hugepage vm stats? The alloc_pages() call seems to be doing the exact same thing in the end. Is one more likely to succeed than the other? Existing stats specify source of thp page: fault or collapse. When we allocate a new huge page with write(2) it's nither fault nor collapse. I think it's reasonable to introduce new type of event for that. Why when we allocated a new huge page with write(2) is not a write fault? -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCHv2, RFC 11/30] thp, mm: handle tail pages in page_cache_get_speculative()
Hi Kirill, On 03/15/2013 01:50 AM, Kirill A. Shutemov wrote: From: "Kirill A. Shutemov" For tail page we call __get_page_tail(). It has the same semantics, but for tail page. Signed-off-by: Kirill A. Shutemov --- include/linux/pagemap.h |4 +++- 1 file changed, 3 insertions(+), 1 deletion(-) diff --git a/include/linux/pagemap.h b/include/linux/pagemap.h index 3521b0d..408c4e3 100644 --- a/include/linux/pagemap.h +++ b/include/linux/pagemap.h @@ -159,6 +159,9 @@ static inline int page_cache_get_speculative(struct page *page) What's the different between page_cache_get_speculative and page_cache_get? { VM_BUG_ON(in_interrupt()); + if (unlikely(PageTail(page))) + return __get_page_tail(page); + #ifdef CONFIG_TINY_RCU # ifdef CONFIG_PREEMPT_COUNT VM_BUG_ON(!in_atomic()); @@ -185,7 +188,6 @@ static inline int page_cache_get_speculative(struct page *page) return 0; } #endif - VM_BUG_ON(PageTail(page)); return 1; } -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCHv2, RFC 07/30] thp, mm: introduce mapping_can_have_hugepages() predicate
On 04/05/2013 11:45 AM, Ric Mason wrote: Hi Kirill, On 03/15/2013 01:50 AM, Kirill A. Shutemov wrote: From: "Kirill A. Shutemov" Returns true if mapping can have huge pages. Just check for __GFP_COMP in gfp mask of the mapping for now. Signed-off-by: Kirill A. Shutemov --- include/linux/pagemap.h | 10 ++ 1 file changed, 10 insertions(+) diff --git a/include/linux/pagemap.h b/include/linux/pagemap.h index e3dea75..3521b0d 100644 --- a/include/linux/pagemap.h +++ b/include/linux/pagemap.h @@ -84,6 +84,16 @@ static inline void mapping_set_gfp_mask(struct address_space *m, gfp_t mask) (__force unsigned long)mask; } +static inline bool mapping_can_have_hugepages(struct address_space *m) +{ +if (IS_ENABLED(CONFIG_TRANSPARENT_HUGEPAGE)) { +gfp_t gfp_mask = mapping_gfp_mask(m); +return !!(gfp_mask & __GFP_COMP); I always see !! in kernel, but why check directly instead of have !! prefix? s/why/why not +} + +return false; +} + /* * The page cache can done in larger chunks than * one page, because it allows for more efficient -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCHv2, RFC 07/30] thp, mm: introduce mapping_can_have_hugepages() predicate
Hi Kirill, On 03/15/2013 01:50 AM, Kirill A. Shutemov wrote: From: "Kirill A. Shutemov" Returns true if mapping can have huge pages. Just check for __GFP_COMP in gfp mask of the mapping for now. Signed-off-by: Kirill A. Shutemov --- include/linux/pagemap.h | 10 ++ 1 file changed, 10 insertions(+) diff --git a/include/linux/pagemap.h b/include/linux/pagemap.h index e3dea75..3521b0d 100644 --- a/include/linux/pagemap.h +++ b/include/linux/pagemap.h @@ -84,6 +84,16 @@ static inline void mapping_set_gfp_mask(struct address_space *m, gfp_t mask) (__force unsigned long)mask; } +static inline bool mapping_can_have_hugepages(struct address_space *m) +{ + if (IS_ENABLED(CONFIG_TRANSPARENT_HUGEPAGE)) { + gfp_t gfp_mask = mapping_gfp_mask(m); + return !!(gfp_mask & __GFP_COMP); I always see !! in kernel, but why check directly instead of have !! prefix? + } + + return false; +} + /* * The page cache can done in larger chunks than * one page, because it allows for more efficient -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCHv2, RFC 05/30] thp, mm: avoid PageUnevictable on active/inactive lru lists
Hi Kirill, On 03/22/2013 06:11 PM, Kirill A. Shutemov wrote: Dave Hansen wrote: On 03/14/2013 10:50 AM, Kirill A. Shutemov wrote: active/inactive lru lists can contain unevicable pages (i.e. ramfs pages that have been placed on the LRU lists when first allocated), but these pages must not have PageUnevictable set - otherwise shrink_active_list goes crazy: ... For lru_add_page_tail(), it means we should not set PageUnevictable() for tail pages unless we're sure that it will go to LRU_UNEVICTABLE. The tail page will go LRU_UNEVICTABLE if head page is not on LRU or if it's marked PageUnevictable() too. This is only an issue once you're using lru_add_page_tail() for non-anonymous pages, right? I'm not sure about that. Documentation/vm/unevictable-lru.txt: Some examples of these unevictable pages on the LRU lists are: (1) ramfs pages that have been placed on the LRU lists when first allocated. (2) SHM_LOCK'd shared memory pages. shmctl(SHM_LOCK) does not attempt to allocate or fault in the pages in the shared memory region. This happens when an application accesses the page the first time after SHM_LOCK'ing the segment. (3) mlocked pages that could not be isolated from the LRU and moved to the unevictable list in mlock_vma_page(). (4) Pages mapped into multiple VM_LOCKED VMAs, but try_to_munlock() couldn't acquire the VMA's mmap semaphore to test the flags and set PageMlocked. munlock_vma_page() was forced to let the page back on to the normal LRU list for vmscan to handle. diff --git a/mm/swap.c b/mm/swap.c index 92a9be5..31584d0 100644 --- a/mm/swap.c +++ b/mm/swap.c @@ -762,7 +762,8 @@ void lru_add_page_tail(struct page *page, struct page *page_tail, lru = LRU_INACTIVE_ANON; } } else { - SetPageUnevictable(page_tail); + if (!PageLRU(page) || PageUnevictable(page)) + SetPageUnevictable(page_tail); lru = LRU_UNEVICTABLE; } You were saying above that ramfs pages can get on the normal active/inactive lists. But, this will end up getting them on the unevictable list, right? So, we have normal ramfs pages on the active/inactive lists, but ramfs pages after a huge-page-split on the unevictable list. That seems a bit inconsistent. Yeah, it's confusing. I was able to trigger another bug on this code: if page_evictable(page_tail) is false and PageLRU(page) is true, page_tail will go to the same lru as page, but nobody cares to sync page_tail active/inactive state with page. So we can end up with inactive page on active lru... I've updated the patch for the next interation. You can check it in git. It should be cleaner. Description need to be updated. Hope you can send out soon. ;-) -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCHv2, RFC 04/30] radix-tree: implement preload for multiple contiguous elements
Hi Kirill, On 03/15/2013 01:50 AM, Kirill A. Shutemov wrote: From: "Kirill A. Shutemov" Currently radix_tree_preload() only guarantees enough nodes to insert one element. It's a hard limit. You cannot batch a number insert under one tree_lock. This patch introduces radix_tree_preload_count(). It allows to preallocate nodes enough to insert a number of *contiguous* elements. Signed-off-by: Matthew Wilcox Signed-off-by: Kirill A. Shutemov --- include/linux/radix-tree.h |3 +++ lib/radix-tree.c | 32 +--- 2 files changed, 28 insertions(+), 7 deletions(-) diff --git a/include/linux/radix-tree.h b/include/linux/radix-tree.h index ffc444c..81318cb 100644 --- a/include/linux/radix-tree.h +++ b/include/linux/radix-tree.h @@ -83,6 +83,8 @@ do { \ (root)->rnode = NULL;\ } while (0) +#define RADIX_TREE_PRELOAD_NR 512 /* For THP's benefit */ + /** * Radix-tree synchronization * @@ -231,6 +233,7 @@ unsigned long radix_tree_next_hole(struct radix_tree_root *root, unsigned long radix_tree_prev_hole(struct radix_tree_root *root, unsigned long index, unsigned long max_scan); int radix_tree_preload(gfp_t gfp_mask); +int radix_tree_preload_count(unsigned size, gfp_t gfp_mask); void radix_tree_init(void); void *radix_tree_tag_set(struct radix_tree_root *root, unsigned long index, unsigned int tag); diff --git a/lib/radix-tree.c b/lib/radix-tree.c index e796429..9bef0ac 100644 --- a/lib/radix-tree.c +++ b/lib/radix-tree.c @@ -81,16 +81,24 @@ static struct kmem_cache *radix_tree_node_cachep; * The worst case is a zero height tree with just a single item at index 0, * and then inserting an item at index ULONG_MAX. This requires 2 new branches * of RADIX_TREE_MAX_PATH size to be created, with only the root node shared. + * + * Worst case for adding N contiguous items is adding entries at indexes + * (ULONG_MAX - N) to ULONG_MAX. It requires nodes to insert single worst-case + * item plus extra nodes if you cross the boundary from one node to the next. + * What's the meaning of this comments? Could you explain in details? I also don't understand #define RADIX_TREE_PRELOAD_SIZE (RADIX_TREE_MAX_PATH * 2 - 1), why RADIX_TREE_MAX_PATH * 2 - 1, I fail to understand comments above it. * Hence: */ -#define RADIX_TREE_PRELOAD_SIZE (RADIX_TREE_MAX_PATH * 2 - 1) +#define RADIX_TREE_PRELOAD_MIN (RADIX_TREE_MAX_PATH * 2 - 1) +#define RADIX_TREE_PRELOAD_MAX \ + (RADIX_TREE_PRELOAD_MIN + \ +DIV_ROUND_UP(RADIX_TREE_PRELOAD_NR - 1, RADIX_TREE_MAP_SIZE)) /* * Per-cpu pool of preloaded nodes */ struct radix_tree_preload { int nr; - struct radix_tree_node *nodes[RADIX_TREE_PRELOAD_SIZE]; + struct radix_tree_node *nodes[RADIX_TREE_PRELOAD_MAX]; }; static DEFINE_PER_CPU(struct radix_tree_preload, radix_tree_preloads) = { 0, }; @@ -257,29 +265,34 @@ radix_tree_node_free(struct radix_tree_node *node) /* * Load up this CPU's radix_tree_node buffer with sufficient objects to - * ensure that the addition of a single element in the tree cannot fail. On - * success, return zero, with preemption disabled. On error, return -ENOMEM + * ensure that the addition of *contiguous* elements in the tree cannot fail. + * On success, return zero, with preemption disabled. On error, return -ENOMEM * with preemption not disabled. * * To make use of this facility, the radix tree must be initialised without * __GFP_WAIT being passed to INIT_RADIX_TREE(). */ -int radix_tree_preload(gfp_t gfp_mask) +int radix_tree_preload_count(unsigned size, gfp_t gfp_mask) { struct radix_tree_preload *rtp; struct radix_tree_node *node; int ret = -ENOMEM; + int alloc = RADIX_TREE_PRELOAD_MIN + + DIV_ROUND_UP(size - 1, RADIX_TREE_MAP_SIZE); + + if (size > RADIX_TREE_PRELOAD_NR) + return -ENOMEM; preempt_disable(); rtp = &__get_cpu_var(radix_tree_preloads); - while (rtp->nr < ARRAY_SIZE(rtp->nodes)) { + while (rtp->nr < alloc) { preempt_enable(); node = kmem_cache_alloc(radix_tree_node_cachep, gfp_mask); if (node == NULL) goto out; preempt_disable(); rtp = &__get_cpu_var(radix_tree_preloads); - if (rtp->nr < ARRAY_SIZE(rtp->nodes)) + if (rtp->nr < alloc) rtp->nodes[rtp->nr++] = node; else kmem_cache_free(radix_tree_node_cachep, node); @@ -288,6 +301,11 @@ int radix_tree_preload(gfp_t gfp_mask) out: return ret; } + +int radix_tree_preload(gfp_t gfp_mask) +{ + return radix_tree_preload_count(1, gfp_mask); +}
Re: [PATCH, RFC 00/16] Transparent huge page cache
Hi Hugh, On 01/29/2013 01:03 PM, Hugh Dickins wrote: On Mon, 28 Jan 2013, Kirill A. Shutemov wrote: From: "Kirill A. Shutemov" Here's first steps towards huge pages in page cache. The intend of the work is get code ready to enable transparent huge page cache for the most simple fs -- ramfs. It's not yet near feature-complete. It only provides basic infrastructure. At the moment we can read, write and truncate file on ramfs with huge pages in page cache. The most interesting part, mmap(), is not yet there. For now we split huge page on mmap() attempt. I can't say that I see whole picture. I'm not sure if I understand locking model around split_huge_page(). Probably, not. Andrea, could you check if it looks correct? Next steps (not necessary in this order): - mmap(); - migration (?); - collapse; - stats, knobs, etc.; - tmpfs/shmem enabling; - ... Kirill A. Shutemov (16): block: implement add_bdi_stat() mm: implement zero_huge_user_segment and friends mm: drop actor argument of do_generic_file_read() radix-tree: implement preload for multiple contiguous elements thp, mm: basic defines for transparent huge page cache thp, mm: rewrite add_to_page_cache_locked() to support huge pages thp, mm: rewrite delete_from_page_cache() to support huge pages thp, mm: locking tail page is a bug thp, mm: handle tail pages in page_cache_get_speculative() thp, mm: implement grab_cache_huge_page_write_begin() thp, mm: naive support of thp in generic read/write routines thp, libfs: initial support of thp in simple_read/write_begin/write_end thp: handle file pages in split_huge_page() thp, mm: truncate support for transparent huge page cache thp, mm: split huge page on mmap file page ramfs: enable transparent huge page cache fs/libfs.c | 54 +--- fs/ramfs/inode.c|6 +- include/linux/backing-dev.h | 10 +++ include/linux/huge_mm.h |8 ++ include/linux/mm.h | 15 include/linux/pagemap.h | 14 ++- include/linux/radix-tree.h |3 + lib/radix-tree.c| 32 +-- mm/filemap.c| 204 +++ mm/huge_memory.c| 62 +++-- mm/memory.c | 22 + mm/truncate.c | 12 +++ 12 files changed, 375 insertions(+), 67 deletions(-) Interesting. I was starting to think about Transparent Huge Pagecache a few months ago, but then got washed away by incoming waves as usual. Certainly I don't have a line of code to show for it; but my first impression of your patches is that we have very different ideas of where to start. Perhaps that's good complementarity, or perhaps I'll disagree with your approach. I'll be taking a look at yours in the coming days, and trying to summon back up my own ideas to summarize them for you. Perhaps I was naive to imagine it, but I did intend to start out generically, independent of filesystem; but content to narrow down on tmpfs alone where it gets hard to support the others (writeback springs to mind). khugepaged would be migrating little pages into huge pages, where it saw that the mmaps of the file would benefit (and for testing I would hack mmap alignment choice to favour it). I had arrived at a conviction that the first thing to change was the way that tail pages of a THP are refcounted, that it had been a mistake to use the compound page method of holding the THP together. But I'll have to enter a trance now to recall the arguments ;) One offline question, do you have any idea hugetlbfs pages support swapping? Hugh -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majord...@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: mailto:"d...@kvack.org;> em...@kvack.org -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH, RFC 00/16] Transparent huge page cache
Hi Hugh, On 01/29/2013 01:03 PM, Hugh Dickins wrote: On Mon, 28 Jan 2013, Kirill A. Shutemov wrote: From: Kirill A. Shutemov kirill.shute...@linux.intel.com Here's first steps towards huge pages in page cache. The intend of the work is get code ready to enable transparent huge page cache for the most simple fs -- ramfs. It's not yet near feature-complete. It only provides basic infrastructure. At the moment we can read, write and truncate file on ramfs with huge pages in page cache. The most interesting part, mmap(), is not yet there. For now we split huge page on mmap() attempt. I can't say that I see whole picture. I'm not sure if I understand locking model around split_huge_page(). Probably, not. Andrea, could you check if it looks correct? Next steps (not necessary in this order): - mmap(); - migration (?); - collapse; - stats, knobs, etc.; - tmpfs/shmem enabling; - ... Kirill A. Shutemov (16): block: implement add_bdi_stat() mm: implement zero_huge_user_segment and friends mm: drop actor argument of do_generic_file_read() radix-tree: implement preload for multiple contiguous elements thp, mm: basic defines for transparent huge page cache thp, mm: rewrite add_to_page_cache_locked() to support huge pages thp, mm: rewrite delete_from_page_cache() to support huge pages thp, mm: locking tail page is a bug thp, mm: handle tail pages in page_cache_get_speculative() thp, mm: implement grab_cache_huge_page_write_begin() thp, mm: naive support of thp in generic read/write routines thp, libfs: initial support of thp in simple_read/write_begin/write_end thp: handle file pages in split_huge_page() thp, mm: truncate support for transparent huge page cache thp, mm: split huge page on mmap file page ramfs: enable transparent huge page cache fs/libfs.c | 54 +--- fs/ramfs/inode.c|6 +- include/linux/backing-dev.h | 10 +++ include/linux/huge_mm.h |8 ++ include/linux/mm.h | 15 include/linux/pagemap.h | 14 ++- include/linux/radix-tree.h |3 + lib/radix-tree.c| 32 +-- mm/filemap.c| 204 +++ mm/huge_memory.c| 62 +++-- mm/memory.c | 22 + mm/truncate.c | 12 +++ 12 files changed, 375 insertions(+), 67 deletions(-) Interesting. I was starting to think about Transparent Huge Pagecache a few months ago, but then got washed away by incoming waves as usual. Certainly I don't have a line of code to show for it; but my first impression of your patches is that we have very different ideas of where to start. Perhaps that's good complementarity, or perhaps I'll disagree with your approach. I'll be taking a look at yours in the coming days, and trying to summon back up my own ideas to summarize them for you. Perhaps I was naive to imagine it, but I did intend to start out generically, independent of filesystem; but content to narrow down on tmpfs alone where it gets hard to support the others (writeback springs to mind). khugepaged would be migrating little pages into huge pages, where it saw that the mmaps of the file would benefit (and for testing I would hack mmap alignment choice to favour it). I had arrived at a conviction that the first thing to change was the way that tail pages of a THP are refcounted, that it had been a mistake to use the compound page method of holding the THP together. But I'll have to enter a trance now to recall the arguments ;) One offline question, do you have any idea hugetlbfs pages support swapping? Hugh -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majord...@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: a href=mailto:d...@kvack.org; em...@kvack.org /a -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCHv2, RFC 04/30] radix-tree: implement preload for multiple contiguous elements
Hi Kirill, On 03/15/2013 01:50 AM, Kirill A. Shutemov wrote: From: Kirill A. Shutemov kirill.shute...@linux.intel.com Currently radix_tree_preload() only guarantees enough nodes to insert one element. It's a hard limit. You cannot batch a number insert under one tree_lock. This patch introduces radix_tree_preload_count(). It allows to preallocate nodes enough to insert a number of *contiguous* elements. Signed-off-by: Matthew Wilcox matthew.r.wil...@intel.com Signed-off-by: Kirill A. Shutemov kirill.shute...@linux.intel.com --- include/linux/radix-tree.h |3 +++ lib/radix-tree.c | 32 +--- 2 files changed, 28 insertions(+), 7 deletions(-) diff --git a/include/linux/radix-tree.h b/include/linux/radix-tree.h index ffc444c..81318cb 100644 --- a/include/linux/radix-tree.h +++ b/include/linux/radix-tree.h @@ -83,6 +83,8 @@ do { \ (root)-rnode = NULL;\ } while (0) +#define RADIX_TREE_PRELOAD_NR 512 /* For THP's benefit */ + /** * Radix-tree synchronization * @@ -231,6 +233,7 @@ unsigned long radix_tree_next_hole(struct radix_tree_root *root, unsigned long radix_tree_prev_hole(struct radix_tree_root *root, unsigned long index, unsigned long max_scan); int radix_tree_preload(gfp_t gfp_mask); +int radix_tree_preload_count(unsigned size, gfp_t gfp_mask); void radix_tree_init(void); void *radix_tree_tag_set(struct radix_tree_root *root, unsigned long index, unsigned int tag); diff --git a/lib/radix-tree.c b/lib/radix-tree.c index e796429..9bef0ac 100644 --- a/lib/radix-tree.c +++ b/lib/radix-tree.c @@ -81,16 +81,24 @@ static struct kmem_cache *radix_tree_node_cachep; * The worst case is a zero height tree with just a single item at index 0, * and then inserting an item at index ULONG_MAX. This requires 2 new branches * of RADIX_TREE_MAX_PATH size to be created, with only the root node shared. + * + * Worst case for adding N contiguous items is adding entries at indexes + * (ULONG_MAX - N) to ULONG_MAX. It requires nodes to insert single worst-case + * item plus extra nodes if you cross the boundary from one node to the next. + * What's the meaning of this comments? Could you explain in details? I also don't understand #define RADIX_TREE_PRELOAD_SIZE (RADIX_TREE_MAX_PATH * 2 - 1), why RADIX_TREE_MAX_PATH * 2 - 1, I fail to understand comments above it. * Hence: */ -#define RADIX_TREE_PRELOAD_SIZE (RADIX_TREE_MAX_PATH * 2 - 1) +#define RADIX_TREE_PRELOAD_MIN (RADIX_TREE_MAX_PATH * 2 - 1) +#define RADIX_TREE_PRELOAD_MAX \ + (RADIX_TREE_PRELOAD_MIN + \ +DIV_ROUND_UP(RADIX_TREE_PRELOAD_NR - 1, RADIX_TREE_MAP_SIZE)) /* * Per-cpu pool of preloaded nodes */ struct radix_tree_preload { int nr; - struct radix_tree_node *nodes[RADIX_TREE_PRELOAD_SIZE]; + struct radix_tree_node *nodes[RADIX_TREE_PRELOAD_MAX]; }; static DEFINE_PER_CPU(struct radix_tree_preload, radix_tree_preloads) = { 0, }; @@ -257,29 +265,34 @@ radix_tree_node_free(struct radix_tree_node *node) /* * Load up this CPU's radix_tree_node buffer with sufficient objects to - * ensure that the addition of a single element in the tree cannot fail. On - * success, return zero, with preemption disabled. On error, return -ENOMEM + * ensure that the addition of *contiguous* elements in the tree cannot fail. + * On success, return zero, with preemption disabled. On error, return -ENOMEM * with preemption not disabled. * * To make use of this facility, the radix tree must be initialised without * __GFP_WAIT being passed to INIT_RADIX_TREE(). */ -int radix_tree_preload(gfp_t gfp_mask) +int radix_tree_preload_count(unsigned size, gfp_t gfp_mask) { struct radix_tree_preload *rtp; struct radix_tree_node *node; int ret = -ENOMEM; + int alloc = RADIX_TREE_PRELOAD_MIN + + DIV_ROUND_UP(size - 1, RADIX_TREE_MAP_SIZE); + + if (size RADIX_TREE_PRELOAD_NR) + return -ENOMEM; preempt_disable(); rtp = __get_cpu_var(radix_tree_preloads); - while (rtp-nr ARRAY_SIZE(rtp-nodes)) { + while (rtp-nr alloc) { preempt_enable(); node = kmem_cache_alloc(radix_tree_node_cachep, gfp_mask); if (node == NULL) goto out; preempt_disable(); rtp = __get_cpu_var(radix_tree_preloads); - if (rtp-nr ARRAY_SIZE(rtp-nodes)) + if (rtp-nr alloc) rtp-nodes[rtp-nr++] = node; else kmem_cache_free(radix_tree_node_cachep, node); @@ -288,6 +301,11 @@ int radix_tree_preload(gfp_t gfp_mask) out: return ret; } + +int radix_tree_preload(gfp_t gfp_mask) +{ + return
Re: [PATCHv2, RFC 05/30] thp, mm: avoid PageUnevictable on active/inactive lru lists
Hi Kirill, On 03/22/2013 06:11 PM, Kirill A. Shutemov wrote: Dave Hansen wrote: On 03/14/2013 10:50 AM, Kirill A. Shutemov wrote: active/inactive lru lists can contain unevicable pages (i.e. ramfs pages that have been placed on the LRU lists when first allocated), but these pages must not have PageUnevictable set - otherwise shrink_active_list goes crazy: ... For lru_add_page_tail(), it means we should not set PageUnevictable() for tail pages unless we're sure that it will go to LRU_UNEVICTABLE. The tail page will go LRU_UNEVICTABLE if head page is not on LRU or if it's marked PageUnevictable() too. This is only an issue once you're using lru_add_page_tail() for non-anonymous pages, right? I'm not sure about that. Documentation/vm/unevictable-lru.txt: Some examples of these unevictable pages on the LRU lists are: (1) ramfs pages that have been placed on the LRU lists when first allocated. (2) SHM_LOCK'd shared memory pages. shmctl(SHM_LOCK) does not attempt to allocate or fault in the pages in the shared memory region. This happens when an application accesses the page the first time after SHM_LOCK'ing the segment. (3) mlocked pages that could not be isolated from the LRU and moved to the unevictable list in mlock_vma_page(). (4) Pages mapped into multiple VM_LOCKED VMAs, but try_to_munlock() couldn't acquire the VMA's mmap semaphore to test the flags and set PageMlocked. munlock_vma_page() was forced to let the page back on to the normal LRU list for vmscan to handle. diff --git a/mm/swap.c b/mm/swap.c index 92a9be5..31584d0 100644 --- a/mm/swap.c +++ b/mm/swap.c @@ -762,7 +762,8 @@ void lru_add_page_tail(struct page *page, struct page *page_tail, lru = LRU_INACTIVE_ANON; } } else { - SetPageUnevictable(page_tail); + if (!PageLRU(page) || PageUnevictable(page)) + SetPageUnevictable(page_tail); lru = LRU_UNEVICTABLE; } You were saying above that ramfs pages can get on the normal active/inactive lists. But, this will end up getting them on the unevictable list, right? So, we have normal ramfs pages on the active/inactive lists, but ramfs pages after a huge-page-split on the unevictable list. That seems a bit inconsistent. Yeah, it's confusing. I was able to trigger another bug on this code: if page_evictable(page_tail) is false and PageLRU(page) is true, page_tail will go to the same lru as page, but nobody cares to sync page_tail active/inactive state with page. So we can end up with inactive page on active lru... I've updated the patch for the next interation. You can check it in git. It should be cleaner. Description need to be updated. Hope you can send out soon. ;-) -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCHv2, RFC 07/30] thp, mm: introduce mapping_can_have_hugepages() predicate
Hi Kirill, On 03/15/2013 01:50 AM, Kirill A. Shutemov wrote: From: Kirill A. Shutemov kirill.shute...@linux.intel.com Returns true if mapping can have huge pages. Just check for __GFP_COMP in gfp mask of the mapping for now. Signed-off-by: Kirill A. Shutemov kirill.shute...@linux.intel.com --- include/linux/pagemap.h | 10 ++ 1 file changed, 10 insertions(+) diff --git a/include/linux/pagemap.h b/include/linux/pagemap.h index e3dea75..3521b0d 100644 --- a/include/linux/pagemap.h +++ b/include/linux/pagemap.h @@ -84,6 +84,16 @@ static inline void mapping_set_gfp_mask(struct address_space *m, gfp_t mask) (__force unsigned long)mask; } +static inline bool mapping_can_have_hugepages(struct address_space *m) +{ + if (IS_ENABLED(CONFIG_TRANSPARENT_HUGEPAGE)) { + gfp_t gfp_mask = mapping_gfp_mask(m); + return !!(gfp_mask __GFP_COMP); I always see !! in kernel, but why check directly instead of have !! prefix? + } + + return false; +} + /* * The page cache can done in larger chunks than * one page, because it allows for more efficient -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCHv2, RFC 07/30] thp, mm: introduce mapping_can_have_hugepages() predicate
On 04/05/2013 11:45 AM, Ric Mason wrote: Hi Kirill, On 03/15/2013 01:50 AM, Kirill A. Shutemov wrote: From: Kirill A. Shutemov kirill.shute...@linux.intel.com Returns true if mapping can have huge pages. Just check for __GFP_COMP in gfp mask of the mapping for now. Signed-off-by: Kirill A. Shutemov kirill.shute...@linux.intel.com --- include/linux/pagemap.h | 10 ++ 1 file changed, 10 insertions(+) diff --git a/include/linux/pagemap.h b/include/linux/pagemap.h index e3dea75..3521b0d 100644 --- a/include/linux/pagemap.h +++ b/include/linux/pagemap.h @@ -84,6 +84,16 @@ static inline void mapping_set_gfp_mask(struct address_space *m, gfp_t mask) (__force unsigned long)mask; } +static inline bool mapping_can_have_hugepages(struct address_space *m) +{ +if (IS_ENABLED(CONFIG_TRANSPARENT_HUGEPAGE)) { +gfp_t gfp_mask = mapping_gfp_mask(m); +return !!(gfp_mask __GFP_COMP); I always see !! in kernel, but why check directly instead of have !! prefix? s/why/why not +} + +return false; +} + /* * The page cache can done in larger chunks than * one page, because it allows for more efficient -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCHv2, RFC 11/30] thp, mm: handle tail pages in page_cache_get_speculative()
Hi Kirill, On 03/15/2013 01:50 AM, Kirill A. Shutemov wrote: From: Kirill A. Shutemov kirill.shute...@linux.intel.com For tail page we call __get_page_tail(). It has the same semantics, but for tail page. Signed-off-by: Kirill A. Shutemov kirill.shute...@linux.intel.com --- include/linux/pagemap.h |4 +++- 1 file changed, 3 insertions(+), 1 deletion(-) diff --git a/include/linux/pagemap.h b/include/linux/pagemap.h index 3521b0d..408c4e3 100644 --- a/include/linux/pagemap.h +++ b/include/linux/pagemap.h @@ -159,6 +159,9 @@ static inline int page_cache_get_speculative(struct page *page) What's the different between page_cache_get_speculative and page_cache_get? { VM_BUG_ON(in_interrupt()); + if (unlikely(PageTail(page))) + return __get_page_tail(page); + #ifdef CONFIG_TINY_RCU # ifdef CONFIG_PREEMPT_COUNT VM_BUG_ON(!in_atomic()); @@ -185,7 +188,6 @@ static inline int page_cache_get_speculative(struct page *page) return 0; } #endif - VM_BUG_ON(PageTail(page)); return 1; } -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCHv2, RFC 12/30] thp, mm: add event counters for huge page alloc on write to a file
Hi Kirill, On 03/26/2013 04:40 PM, Kirill A. Shutemov wrote: Dave Hansen wrote: On 03/14/2013 10:50 AM, Kirill A. Shutemov wrote: --- a/include/linux/vm_event_item.h +++ b/include/linux/vm_event_item.h @@ -71,6 +71,8 @@ enum vm_event_item { PGPGIN, PGPGOUT, PSWPIN, PSWPOUT, THP_FAULT_FALLBACK, THP_COLLAPSE_ALLOC, THP_COLLAPSE_ALLOC_FAILED, + THP_WRITE_ALLOC, + THP_WRITE_FAILED, THP_SPLIT, THP_ZERO_PAGE_ALLOC, THP_ZERO_PAGE_ALLOC_FAILED, I think these names are a bit terse. It's certainly not _writes_ that are failing and THP_WRITE_FAILED makes it sound that way. Right. s/THP_WRITE_FAILED/THP_WRITE_ALLOC_FAILED/ Also, why do we need to differentiate these from the existing anon-hugepage vm stats? The alloc_pages() call seems to be doing the exact same thing in the end. Is one more likely to succeed than the other? Existing stats specify source of thp page: fault or collapse. When we allocate a new huge page with write(2) it's nither fault nor collapse. I think it's reasonable to introduce new type of event for that. Why when we allocated a new huge page with write(2) is not a write fault? -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH v2 3/4] introduce zero-filled page stat count
On 03/20/2013 12:41 AM, Konrad Rzeszutek Wilk wrote: On Sun, Mar 17, 2013 at 8:58 AM, Ric Mason wrote: Hi Konrad, On 03/16/2013 09:06 PM, Konrad Rzeszutek Wilk wrote: On Thu, Mar 14, 2013 at 06:08:16PM +0800, Wanpeng Li wrote: Introduce zero-filled page statistics to monitor the number of zero-filled pages. Hm, you must be using an older version of the driver. Please rebase it against Greg KH's staging tree. This is where most if not all of the DebugFS counters got moved to a different file. It seems that zcache debugfs in Greg's staging-next is buggy, Could you test it? Could you email me what the issue you are seeing? They have already fixed in Wanpeng's patchset v4. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH v4 0/8] staging: zcache: Support zero-filled pages more efficiently
On 03/19/2013 05:25 PM, Wanpeng Li wrote: Hi Greg, Since you have already merge 1/8, feel free to merge 2/8~8/8, I have already rebased against staging-next. Changelog: v3 -> v4: * handle duplication in page_is_zero_filled, spotted by Bob * fix zcache writeback in dubugfs * fix pers_pageframes|_max isn't exported in debugfs * fix static variable defined in debug.h but used in multiple C files * rebase on Greg's staging-next v2 -> v3: * increment/decrement zcache_[eph|pers]_zpages for zero-filled pages, spotted by Dan * replace "zero" or "zero page" by "zero_filled_page", spotted by Dan v1 -> v2: * avoid changing tmem.[ch] entirely, spotted by Dan. * don't accumulate [eph|pers]pageframe and [eph|pers]zpages for zero-filled pages, spotted by Dan * cleanup TODO list * add Dan Acked-by. Motivation: - Seth Jennings points out compress zero-filled pages with LZO(a lossless data compression algorithm) will waste memory and result in fragmentation. https://lkml.org/lkml/2012/8/14/347 - Dan Magenheimer add "Support zero-filled pages more efficiently" feature in zcache TODO list https://lkml.org/lkml/2013/2/13/503 Design: - For store page, capture zero-filled pages(evicted clean page cache pages and swap pages), but don't compress them, set pampd which store zpage address to 0x2(since 0x0 and 0x1 has already been ocuppied) to mark special zero-filled case and take advantage of tmem infrastructure to transform handle-tuple(pool id, object id, and an index) to a pampd. Twice compress zero-filled pages will contribute to one zcache_[eph|pers]_pageframes count accumulated. - For load page, traverse tmem hierachical to transform handle-tuple to pampd and identify zero-filled case by pampd equal to 0x2 when filesystem reads file pages or a page needs to be swapped in, then refill the page to zero and return. Test: dd if=/dev/zero of=zerofile bs=1MB count=500 vmtouch -t zerofile vmtouch -e zerofile formula: - fragmentation level = (zcache_[eph|pers]_pageframes * PAGE_SIZE - zcache_[eph|pers]_zbytes) * 100 / (zcache_[eph|pers]_pageframes * PAGE_SIZE) - memory zcache occupy = zcache_[eph|pers]_zbytes Result: without zero-filled awareness: - fragmentation level: 98% - memory zcache occupy: 238MB with zero-filled awareness: - fragmentation level: 0% - memory zcache occupy: 0MB Wanpeng Li (8): introduce zero filled pages handler zero-filled pages awareness handle zcache_[eph|pers]_pages for zero-filled page fix pers_pageframes|_max aren't exported in debugfs fix zcache writeback in debugfs fix static variables are defined in debug.h but use in multiple C files introduce zero-filled page stat count clean TODO list You can add Reviewed-by: Ric Mason to this patchset. drivers/staging/zcache/TODO |3 +- drivers/staging/zcache/debug.c |5 +- drivers/staging/zcache/debug.h | 77 +- drivers/staging/zcache/zcache-main.c | 147 ++ 4 files changed, 185 insertions(+), 47 deletions(-) -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH v4 0/8] staging: zcache: Support zero-filled pages more efficiently
On 03/19/2013 05:25 PM, Wanpeng Li wrote: Hi Greg, Since you have already merge 1/8, feel free to merge 2/8~8/8, I have already rebased against staging-next. Changelog: v3 - v4: * handle duplication in page_is_zero_filled, spotted by Bob * fix zcache writeback in dubugfs * fix pers_pageframes|_max isn't exported in debugfs * fix static variable defined in debug.h but used in multiple C files * rebase on Greg's staging-next v2 - v3: * increment/decrement zcache_[eph|pers]_zpages for zero-filled pages, spotted by Dan * replace zero or zero page by zero_filled_page, spotted by Dan v1 - v2: * avoid changing tmem.[ch] entirely, spotted by Dan. * don't accumulate [eph|pers]pageframe and [eph|pers]zpages for zero-filled pages, spotted by Dan * cleanup TODO list * add Dan Acked-by. Motivation: - Seth Jennings points out compress zero-filled pages with LZO(a lossless data compression algorithm) will waste memory and result in fragmentation. https://lkml.org/lkml/2012/8/14/347 - Dan Magenheimer add Support zero-filled pages more efficiently feature in zcache TODO list https://lkml.org/lkml/2013/2/13/503 Design: - For store page, capture zero-filled pages(evicted clean page cache pages and swap pages), but don't compress them, set pampd which store zpage address to 0x2(since 0x0 and 0x1 has already been ocuppied) to mark special zero-filled case and take advantage of tmem infrastructure to transform handle-tuple(pool id, object id, and an index) to a pampd. Twice compress zero-filled pages will contribute to one zcache_[eph|pers]_pageframes count accumulated. - For load page, traverse tmem hierachical to transform handle-tuple to pampd and identify zero-filled case by pampd equal to 0x2 when filesystem reads file pages or a page needs to be swapped in, then refill the page to zero and return. Test: dd if=/dev/zero of=zerofile bs=1MB count=500 vmtouch -t zerofile vmtouch -e zerofile formula: - fragmentation level = (zcache_[eph|pers]_pageframes * PAGE_SIZE - zcache_[eph|pers]_zbytes) * 100 / (zcache_[eph|pers]_pageframes * PAGE_SIZE) - memory zcache occupy = zcache_[eph|pers]_zbytes Result: without zero-filled awareness: - fragmentation level: 98% - memory zcache occupy: 238MB with zero-filled awareness: - fragmentation level: 0% - memory zcache occupy: 0MB Wanpeng Li (8): introduce zero filled pages handler zero-filled pages awareness handle zcache_[eph|pers]_pages for zero-filled page fix pers_pageframes|_max aren't exported in debugfs fix zcache writeback in debugfs fix static variables are defined in debug.h but use in multiple C files introduce zero-filled page stat count clean TODO list You can add Reviewed-by: Ric Mason ric.mas...@gmail.com to this patchset. drivers/staging/zcache/TODO |3 +- drivers/staging/zcache/debug.c |5 +- drivers/staging/zcache/debug.h | 77 +- drivers/staging/zcache/zcache-main.c | 147 ++ 4 files changed, 185 insertions(+), 47 deletions(-) -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH v2 3/4] introduce zero-filled page stat count
On 03/20/2013 12:41 AM, Konrad Rzeszutek Wilk wrote: On Sun, Mar 17, 2013 at 8:58 AM, Ric Mason ric.mas...@gmail.com wrote: Hi Konrad, On 03/16/2013 09:06 PM, Konrad Rzeszutek Wilk wrote: On Thu, Mar 14, 2013 at 06:08:16PM +0800, Wanpeng Li wrote: Introduce zero-filled page statistics to monitor the number of zero-filled pages. Hm, you must be using an older version of the driver. Please rebase it against Greg KH's staging tree. This is where most if not all of the DebugFS counters got moved to a different file. It seems that zcache debugfs in Greg's staging-next is buggy, Could you test it? Could you email me what the issue you are seeing? They have already fixed in Wanpeng's patchset v4. -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCHv2, RFC 00/30] Transparent huge page cache
On 03/18/2013 07:42 PM, Kirill A. Shutemov wrote: Simon Jeons wrote: Hi Kirill, On 03/18/2013 07:19 PM, Kirill A. Shutemov wrote: Simon Jeons wrote: On 03/18/2013 12:03 PM, Simon Jeons wrote: Hi Kirill, On 03/15/2013 01:50 AM, Kirill A. Shutemov wrote: From: "Kirill A. Shutemov" Here's the second version of the patchset. The intend of the work is get code ready to enable transparent huge page cache for the most simple fs -- ramfs. We have read()/write()/mmap() functionality now. Still plenty work ahead. One offline question. Why set PG_mlocked to page_tail which be splited in function __split_huge_page_refcount? Not set, but copied from head page. Head page represents up-to-date sate of compound page, we need to copy it to all tail pages on split. I always see up-to-date state, could you conclude to me which state can be treated as up-to-date? :-) While we work with huge page we only alter flags (like mlocked or uptodate) of head page, but not tail, so we have to copy flags to all tail pages on split. We also need to distribute _count and _mapcount properly. Just read the code. Sorry, you can treat this question as an offline one and irrelevant thp. Which state of page can be treated as up-to-date? Also why can't find where _PAGE_SPLITTING and _PAGE_PSE flags are cleared in split_huge_page path? The pmd is invalidated and replaced with reference to page table at the end of __split_huge_page_map. Since pmd is populated by page table and new flag why need invalidated(clear present flag) before it? Comment just before pmdp_invalidate() in __split_huge_page_map() is fairly informative. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCHv2, RFC 00/30] Transparent huge page cache
On 03/18/2013 07:42 PM, Kirill A. Shutemov wrote: Simon Jeons wrote: Hi Kirill, On 03/18/2013 07:19 PM, Kirill A. Shutemov wrote: Simon Jeons wrote: On 03/18/2013 12:03 PM, Simon Jeons wrote: Hi Kirill, On 03/15/2013 01:50 AM, Kirill A. Shutemov wrote: From: Kirill A. Shutemov kirill.shute...@linux.intel.com Here's the second version of the patchset. The intend of the work is get code ready to enable transparent huge page cache for the most simple fs -- ramfs. We have read()/write()/mmap() functionality now. Still plenty work ahead. One offline question. Why set PG_mlocked to page_tail which be splited in function __split_huge_page_refcount? Not set, but copied from head page. Head page represents up-to-date sate of compound page, we need to copy it to all tail pages on split. I always see up-to-date state, could you conclude to me which state can be treated as up-to-date? :-) While we work with huge page we only alter flags (like mlocked or uptodate) of head page, but not tail, so we have to copy flags to all tail pages on split. We also need to distribute _count and _mapcount properly. Just read the code. Sorry, you can treat this question as an offline one and irrelevant thp. Which state of page can be treated as up-to-date? Also why can't find where _PAGE_SPLITTING and _PAGE_PSE flags are cleared in split_huge_page path? The pmd is invalidated and replaced with reference to page table at the end of __split_huge_page_map. Since pmd is populated by page table and new flag why need invalidated(clear present flag) before it? Comment just before pmdp_invalidate() in __split_huge_page_map() is fairly informative. -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH v2 3/4] introduce zero-filled page stat count
Hi Konrad, On 03/16/2013 09:06 PM, Konrad Rzeszutek Wilk wrote: On Thu, Mar 14, 2013 at 06:08:16PM +0800, Wanpeng Li wrote: Introduce zero-filled page statistics to monitor the number of zero-filled pages. Hm, you must be using an older version of the driver. Please rebase it against Greg KH's staging tree. This is where most if not all of the DebugFS counters got moved to a different file. It seems that zcache debugfs in Greg's staging-next is buggy, Could you test it? Acked-by: Dan Magenheimer Signed-off-by: Wanpeng Li --- drivers/staging/zcache/zcache-main.c |7 +++ 1 files changed, 7 insertions(+), 0 deletions(-) diff --git a/drivers/staging/zcache/zcache-main.c b/drivers/staging/zcache/zcache-main.c index db200b4..2091a4d 100644 --- a/drivers/staging/zcache/zcache-main.c +++ b/drivers/staging/zcache/zcache-main.c @@ -196,6 +196,7 @@ static ssize_t zcache_eph_nonactive_puts_ignored; static ssize_t zcache_pers_nonactive_puts_ignored; static ssize_t zcache_writtenback_pages; static ssize_t zcache_outstanding_writeback_pages; +static ssize_t zcache_pages_zero; #ifdef CONFIG_DEBUG_FS #include @@ -257,6 +258,7 @@ static int zcache_debugfs_init(void) zdfs("outstanding_writeback_pages", S_IRUGO, root, _outstanding_writeback_pages); zdfs("writtenback_pages", S_IRUGO, root, _writtenback_pages); + zdfs("pages_zero", S_IRUGO, root, _pages_zero); return 0; } #undefzdebugfs @@ -326,6 +328,7 @@ void zcache_dump(void) pr_info("zcache: outstanding_writeback_pages=%zd\n", zcache_outstanding_writeback_pages); pr_info("zcache: writtenback_pages=%zd\n", zcache_writtenback_pages); + pr_info("zcache: pages_zero=%zd\n", zcache_pages_zero); } #endif @@ -562,6 +565,7 @@ static void *zcache_pampd_eph_create(char *data, size_t size, bool raw, kunmap_atomic(user_mem); clen = 0; zero_filled = true; + zcache_pages_zero++; goto got_pampd; } kunmap_atomic(user_mem); @@ -645,6 +649,7 @@ static void *zcache_pampd_pers_create(char *data, size_t size, bool raw, kunmap_atomic(user_mem); clen = 0; zero_filled = true; + zcache_pages_zero++; goto got_pampd; } kunmap_atomic(user_mem); @@ -866,6 +871,7 @@ static int zcache_pampd_get_data_and_free(char *data, size_t *sizep, bool raw, zpages = 0; if (!raw) *sizep = PAGE_SIZE; + zcache_pages_zero--; goto zero_fill; } @@ -922,6 +928,7 @@ static void zcache_pampd_free(void *pampd, struct tmem_pool *pool, zero_filled = true; zsize = 0; zpages = 0; + zcache_pages_zero--; } if (pampd_is_remote(pampd) && !zero_filled) { -- 1.7.7.6 -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majord...@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: mailto:"d...@kvack.org;> em...@kvack.org -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH v2 3/4] introduce zero-filled page stat count
Hi Konrad, On 03/16/2013 09:06 PM, Konrad Rzeszutek Wilk wrote: On Thu, Mar 14, 2013 at 06:08:16PM +0800, Wanpeng Li wrote: Introduce zero-filled page statistics to monitor the number of zero-filled pages. Hm, you must be using an older version of the driver. Please rebase it against Greg KH's staging tree. This is where most if not all of the DebugFS counters got moved to a different file. It seems that zcache debugfs in Greg's staging-next is buggy, Could you test it? Acked-by: Dan Magenheimer dan.magenhei...@oracle.com Signed-off-by: Wanpeng Li liw...@linux.vnet.ibm.com --- drivers/staging/zcache/zcache-main.c |7 +++ 1 files changed, 7 insertions(+), 0 deletions(-) diff --git a/drivers/staging/zcache/zcache-main.c b/drivers/staging/zcache/zcache-main.c index db200b4..2091a4d 100644 --- a/drivers/staging/zcache/zcache-main.c +++ b/drivers/staging/zcache/zcache-main.c @@ -196,6 +196,7 @@ static ssize_t zcache_eph_nonactive_puts_ignored; static ssize_t zcache_pers_nonactive_puts_ignored; static ssize_t zcache_writtenback_pages; static ssize_t zcache_outstanding_writeback_pages; +static ssize_t zcache_pages_zero; #ifdef CONFIG_DEBUG_FS #include linux/debugfs.h @@ -257,6 +258,7 @@ static int zcache_debugfs_init(void) zdfs(outstanding_writeback_pages, S_IRUGO, root, zcache_outstanding_writeback_pages); zdfs(writtenback_pages, S_IRUGO, root, zcache_writtenback_pages); + zdfs(pages_zero, S_IRUGO, root, zcache_pages_zero); return 0; } #undefzdebugfs @@ -326,6 +328,7 @@ void zcache_dump(void) pr_info(zcache: outstanding_writeback_pages=%zd\n, zcache_outstanding_writeback_pages); pr_info(zcache: writtenback_pages=%zd\n, zcache_writtenback_pages); + pr_info(zcache: pages_zero=%zd\n, zcache_pages_zero); } #endif @@ -562,6 +565,7 @@ static void *zcache_pampd_eph_create(char *data, size_t size, bool raw, kunmap_atomic(user_mem); clen = 0; zero_filled = true; + zcache_pages_zero++; goto got_pampd; } kunmap_atomic(user_mem); @@ -645,6 +649,7 @@ static void *zcache_pampd_pers_create(char *data, size_t size, bool raw, kunmap_atomic(user_mem); clen = 0; zero_filled = true; + zcache_pages_zero++; goto got_pampd; } kunmap_atomic(user_mem); @@ -866,6 +871,7 @@ static int zcache_pampd_get_data_and_free(char *data, size_t *sizep, bool raw, zpages = 0; if (!raw) *sizep = PAGE_SIZE; + zcache_pages_zero--; goto zero_fill; } @@ -922,6 +928,7 @@ static void zcache_pampd_free(void *pampd, struct tmem_pool *pool, zero_filled = true; zsize = 0; zpages = 0; + zcache_pages_zero--; } if (pampd_is_remote(pampd) !zero_filled) { -- 1.7.7.6 -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majord...@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: a href=mailto:d...@kvack.org; em...@kvack.org /a -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: mmap vs fs cache
Hi Johannes, On 03/09/2013 12:16 AM, Johannes Weiner wrote: On Fri, Mar 08, 2013 at 07:00:55AM -0800, Howard Chu wrote: Chris Friesen wrote: On 03/08/2013 03:40 AM, Howard Chu wrote: There is no way that a process that is accessing only 30GB of a mmap should be able to fill up 32GB of RAM. There's nothing else running on the machine, I've killed or suspended everything else in userland besides a couple shells running top and vmstat. When I manually drop_caches repeatedly, then eventually slapd RSS/SHR grows to 30GB and the physical I/O stops. Is it possible that the kernel is doing some sort of automatic readahead, but it ends up reading pages corresponding to data that isn't ever queried and so doesn't get mapped by the application? Yes, that's what I was thinking. I added a posix_madvise(..POSIX_MADV_RANDOM) but that had no effect on the test. First obvious conclusion - kswapd is being too aggressive. When free memory hits the low watermark, the reclaim shrinks slapd down from 25GB to 18-19GB, while the page cache still contains ~7GB of unmapped pages. Ideally I'd like a tuning knob so I can say to keep no more than 2GB of unmapped pages in the cache. (And the desired effect of that would be to allow user processes to grow to 30GB total, in this case.) We should find out where the unmapped page cache is coming from if you are only accessing mapped file cache and disabled readahead. How do you arrive at this number of unmapped page cache? What could happen is that previously used and activated pages do not get evicted anymore since there is a constant supply of younger If a user process exit, its file pages and anonymous pages will be freed immediately or go through page reclaim? reclaimable cache that is actually thrashing. Whenever you drop the caches, you get rid of those stale active pages and allow the previously thrashing cache to get activated. However, that would require that there is already a significant amount of active file Why you emphasize a *significant* amount of active file pages? pages before your workload starts (check the nr_active_file number in /proc/vmstat before launching slapd, try sync; echo 3 >drop_caches before launching to eliminate this option) OR that the set of pages accessed during your workload changes and the combined set of pages accessed by your workload is bigger than available memory -- which you claimed would not happen because you only access the 30GB file area on that system. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majord...@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: mailto:"d...@kvack.org;> em...@kvack.org -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: mmap vs fs cache
Hi Johannes, On 03/08/2013 10:08 AM, Johannes Weiner wrote: On Thu, Mar 07, 2013 at 04:43:12PM +0100, Jan Kara wrote: Added mm list to CC. On Tue 05-03-13 09:57:34, Howard Chu wrote: I'm testing our memory-mapped database code on a small VM. The machine has 32GB of RAM and the size of the DB on disk is ~44GB. The database library mmaps the entire file as a single region and starts accessing it as a tree of B+trees. Running on an Ubuntu 3.5.0-23 kernel, XFS on a local disk. If I start running read-only queries against the DB with a freshly started server, I see that my process (OpenLDAP slapd) quickly grows to an RSS of about 16GB in tandem with the FS cache. (I.e., "top" shows 16GB cached, and slapd is 16GB.) If I confine my queries to the first 20% of the data then it all fits in RAM and queries are nice and fast. if I extend the query range to cover more of the data, approaching the size of physical RAM, I see something strange - the FS cache keeps growing, but the slapd process size grows at a slower rate. This is rather puzzling to me since the only thing triggering reads is accesses through the mmap region. Eventually the FS cache grows to basically all of the 32GB of RAM (+/- some text/data space...) but the slapd process only reaches 25GB, at which point it actually starts to shrink - apparently the FS cache is now stealing pages from it. I find that a bit puzzling; if the pages are present in memory, and the only reason they were paged in was to satisfy an mmap reference, why aren't they simply assigned to the slapd process? The current behavior gets even more aggravating: I can run a test that spans exactly 30GB of the data. One would expect that the slapd process should simply grow to 30GB in size, and then remain static for the remainder of the test. Instead, the server grows to 25GB, the FS cache grows to 32GB, and starts stealing pages from the server, shrinking it back down to 19GB or so. If I do an "echo 1 > /proc/sys/vm/drop_caches" at the onset of this condition, the FS cache shrinks back to 25GB, matching the slapd process size. This then frees up enough RAM for slapd to grow further. If I don't do this, the test is constantly paging in data from disk. Even so, the FS cache continues to grow faster than the slapd process size, so the system may run out of free RAM again, and I have to drop caches multiple times before slapd finally grows to the full 30GB. Once it gets to that size the test runs entirely from RAM with zero I/Os, but it doesn't get there without a lot of babysitting. 2 questions: why is there data in the FS cache that isn't owned by (the mmap of) the process that caused it to be paged in in the first place? The filesystem cache is shared among processes because the filesystem is also shared among processes. If another task were to access the same file, we still should only have one copy of that data in memory. It sounds to me like slapd is itself caching all the data it reads. If that is true, shouldn't it really be using direct IO to prevent this double buffering of filesystem data in memory? When use direct IO is better? When use page cache is better? is there a tunable knob to discourage the page cache from stealing from the process? Try reducing /proc/sys/vm/swappiness, which ranges from 0-100 and defaults to 60. Why redunce? IIUC, swappiness is used to determine how aggressive reclaim anonymous pages, if the value is high more anonymous pages will be reclaimed. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majord...@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: mailto:"d...@kvack.org;> em...@kvack.org -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: mmap vs fs cache
Hi Johannes, On 03/08/2013 10:08 AM, Johannes Weiner wrote: On Thu, Mar 07, 2013 at 04:43:12PM +0100, Jan Kara wrote: Added mm list to CC. On Tue 05-03-13 09:57:34, Howard Chu wrote: I'm testing our memory-mapped database code on a small VM. The machine has 32GB of RAM and the size of the DB on disk is ~44GB. The database library mmaps the entire file as a single region and starts accessing it as a tree of B+trees. Running on an Ubuntu 3.5.0-23 kernel, XFS on a local disk. If I start running read-only queries against the DB with a freshly started server, I see that my process (OpenLDAP slapd) quickly grows to an RSS of about 16GB in tandem with the FS cache. (I.e., top shows 16GB cached, and slapd is 16GB.) If I confine my queries to the first 20% of the data then it all fits in RAM and queries are nice and fast. if I extend the query range to cover more of the data, approaching the size of physical RAM, I see something strange - the FS cache keeps growing, but the slapd process size grows at a slower rate. This is rather puzzling to me since the only thing triggering reads is accesses through the mmap region. Eventually the FS cache grows to basically all of the 32GB of RAM (+/- some text/data space...) but the slapd process only reaches 25GB, at which point it actually starts to shrink - apparently the FS cache is now stealing pages from it. I find that a bit puzzling; if the pages are present in memory, and the only reason they were paged in was to satisfy an mmap reference, why aren't they simply assigned to the slapd process? The current behavior gets even more aggravating: I can run a test that spans exactly 30GB of the data. One would expect that the slapd process should simply grow to 30GB in size, and then remain static for the remainder of the test. Instead, the server grows to 25GB, the FS cache grows to 32GB, and starts stealing pages from the server, shrinking it back down to 19GB or so. If I do an echo 1 /proc/sys/vm/drop_caches at the onset of this condition, the FS cache shrinks back to 25GB, matching the slapd process size. This then frees up enough RAM for slapd to grow further. If I don't do this, the test is constantly paging in data from disk. Even so, the FS cache continues to grow faster than the slapd process size, so the system may run out of free RAM again, and I have to drop caches multiple times before slapd finally grows to the full 30GB. Once it gets to that size the test runs entirely from RAM with zero I/Os, but it doesn't get there without a lot of babysitting. 2 questions: why is there data in the FS cache that isn't owned by (the mmap of) the process that caused it to be paged in in the first place? The filesystem cache is shared among processes because the filesystem is also shared among processes. If another task were to access the same file, we still should only have one copy of that data in memory. It sounds to me like slapd is itself caching all the data it reads. If that is true, shouldn't it really be using direct IO to prevent this double buffering of filesystem data in memory? When use direct IO is better? When use page cache is better? is there a tunable knob to discourage the page cache from stealing from the process? Try reducing /proc/sys/vm/swappiness, which ranges from 0-100 and defaults to 60. Why redunce? IIUC, swappiness is used to determine how aggressive reclaim anonymous pages, if the value is high more anonymous pages will be reclaimed. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majord...@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: a href=mailto:d...@kvack.org; em...@kvack.org /a -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: mmap vs fs cache
Hi Johannes, On 03/09/2013 12:16 AM, Johannes Weiner wrote: On Fri, Mar 08, 2013 at 07:00:55AM -0800, Howard Chu wrote: Chris Friesen wrote: On 03/08/2013 03:40 AM, Howard Chu wrote: There is no way that a process that is accessing only 30GB of a mmap should be able to fill up 32GB of RAM. There's nothing else running on the machine, I've killed or suspended everything else in userland besides a couple shells running top and vmstat. When I manually drop_caches repeatedly, then eventually slapd RSS/SHR grows to 30GB and the physical I/O stops. Is it possible that the kernel is doing some sort of automatic readahead, but it ends up reading pages corresponding to data that isn't ever queried and so doesn't get mapped by the application? Yes, that's what I was thinking. I added a posix_madvise(..POSIX_MADV_RANDOM) but that had no effect on the test. First obvious conclusion - kswapd is being too aggressive. When free memory hits the low watermark, the reclaim shrinks slapd down from 25GB to 18-19GB, while the page cache still contains ~7GB of unmapped pages. Ideally I'd like a tuning knob so I can say to keep no more than 2GB of unmapped pages in the cache. (And the desired effect of that would be to allow user processes to grow to 30GB total, in this case.) We should find out where the unmapped page cache is coming from if you are only accessing mapped file cache and disabled readahead. How do you arrive at this number of unmapped page cache? What could happen is that previously used and activated pages do not get evicted anymore since there is a constant supply of younger If a user process exit, its file pages and anonymous pages will be freed immediately or go through page reclaim? reclaimable cache that is actually thrashing. Whenever you drop the caches, you get rid of those stale active pages and allow the previously thrashing cache to get activated. However, that would require that there is already a significant amount of active file Why you emphasize a *significant* amount of active file pages? pages before your workload starts (check the nr_active_file number in /proc/vmstat before launching slapd, try sync; echo 3 drop_caches before launching to eliminate this option) OR that the set of pages accessed during your workload changes and the combined set of pages accessed by your workload is bigger than available memory -- which you claimed would not happen because you only access the 30GB file area on that system. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majord...@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: a href=mailto:d...@kvack.org; em...@kvack.org /a -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 2/7] ksm: treat unstable nid like in stable tree
Ping Hugh, :-) On 03/06/2013 06:18 PM, Ric Mason wrote: Hi Hugh, On 03/06/2013 01:05 PM, Hugh Dickins wrote: On Wed, 6 Mar 2013, Ric Mason wrote: [ I've deleted the context because that was about the unstable tree, and here you have moved to asking about a case in the stable tree. ] I think I can basically understand you, please correct me if something wrong. For ksm page: If one ksm page(in old node) migrate to another(new) node(ksm page is treated as old page, one new page allocated in another node now), since we can't get right lock in this time, we can't move stable node to its new tree at this time, stable node still in old node and stable_node->nid still store old node value. If ksmd scan and compare another page in old node and search stable tree will figure out that stable node relevant ksm page is migrated to new node, stable node will be erased from old node's stable tree and link to migrate_nodes list. What's the life of new page in new node? new page will be scaned by ksmd, it will search stable tree in new node and if doesn't find matched stable node, the new node is deleted from migrate_node list and add to new node's table tree as a leaf, if find stable node in stable tree, they will be merged. But in special case, the stable node relevant ksm page can also migrated, new stable node will replace the stable node which relevant page migrated this time. For unstable tree page: If search in unstable tree and find the tree page which has equal content is migrated, just stop search and return, nothing merged. The new page in new node for this migrated unstable tree page will be insert to unstable tree in new node. For the case of a ksm page is migrated to a different NUMA node and migrate its stable node to the right tree and collide with an existing stable node. get_kpfn_nid(stable_node->kpfn) != NUMA(stable_node->nid) can capture nothing That's not so: as I've pointed out before, ksm_migrate_page() updates stable_node->kpfn for the new page on the new NUMA node; but it cannot (get the right locking to) move the stable_node to its new tree at that time. It's moved out once ksmd notices that it's in the wrong NUMA node tree - perhaps when one its rmap_items reaches the head of cmp_and_merge_page(), or perhaps here in stable_tree_search() when it matches another page coming in to cmp_and_merge_page(). You may be concentrating on the case when that "another page" is a ksm page migrated from a different NUMA node; and overlooking the case of when the matching ksm page in this stable tree has itself been migrated. since stable_node is the node in the right stable tree, nothing happen to it before this check. Did you intend to check get_kpfn_nid(page_node->kpfn) != NUMA(page_node->nid) ? Certainly not: page_node is usually NULL. But I could have checked get_kpfn_nid(stable_node->kpfn) != nid: I was duplicating the test from cmp_and_merge_page(), but here we do have local variable nid. Hugh -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 2/7] ksm: treat unstable nid like in stable tree
Ping Hugh, :-) On 03/06/2013 06:18 PM, Ric Mason wrote: Hi Hugh, On 03/06/2013 01:05 PM, Hugh Dickins wrote: On Wed, 6 Mar 2013, Ric Mason wrote: [ I've deleted the context because that was about the unstable tree, and here you have moved to asking about a case in the stable tree. ] I think I can basically understand you, please correct me if something wrong. For ksm page: If one ksm page(in old node) migrate to another(new) node(ksm page is treated as old page, one new page allocated in another node now), since we can't get right lock in this time, we can't move stable node to its new tree at this time, stable node still in old node and stable_node-nid still store old node value. If ksmd scan and compare another page in old node and search stable tree will figure out that stable node relevant ksm page is migrated to new node, stable node will be erased from old node's stable tree and link to migrate_nodes list. What's the life of new page in new node? new page will be scaned by ksmd, it will search stable tree in new node and if doesn't find matched stable node, the new node is deleted from migrate_node list and add to new node's table tree as a leaf, if find stable node in stable tree, they will be merged. But in special case, the stable node relevant ksm page can also migrated, new stable node will replace the stable node which relevant page migrated this time. For unstable tree page: If search in unstable tree and find the tree page which has equal content is migrated, just stop search and return, nothing merged. The new page in new node for this migrated unstable tree page will be insert to unstable tree in new node. For the case of a ksm page is migrated to a different NUMA node and migrate its stable node to the right tree and collide with an existing stable node. get_kpfn_nid(stable_node-kpfn) != NUMA(stable_node-nid) can capture nothing That's not so: as I've pointed out before, ksm_migrate_page() updates stable_node-kpfn for the new page on the new NUMA node; but it cannot (get the right locking to) move the stable_node to its new tree at that time. It's moved out once ksmd notices that it's in the wrong NUMA node tree - perhaps when one its rmap_items reaches the head of cmp_and_merge_page(), or perhaps here in stable_tree_search() when it matches another page coming in to cmp_and_merge_page(). You may be concentrating on the case when that another page is a ksm page migrated from a different NUMA node; and overlooking the case of when the matching ksm page in this stable tree has itself been migrated. since stable_node is the node in the right stable tree, nothing happen to it before this check. Did you intend to check get_kpfn_nid(page_node-kpfn) != NUMA(page_node-nid) ? Certainly not: page_node is usually NULL. But I could have checked get_kpfn_nid(stable_node-kpfn) != nid: I was duplicating the test from cmp_and_merge_page(), but here we do have local variable nid. Hugh -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Should a swapped out page be deleted from swap cache?
On 03/06/2013 07:04 PM, Ric Mason wrote: On 03/06/2013 01:34 PM, Li Haifeng wrote: 2013/2/20 Ric Mason : Hi Hugh, On 02/20/2013 02:56 AM, Hugh Dickins wrote: On Tue, 19 Feb 2013, Ric Mason wrote: There is a call of try_to_free_swap in function swap_writepage, if swap_writepage is call from shrink_page_list path, PageSwapCache(page) == trure, PageWriteback(page) maybe false, page_swapcount(page) == 0, then will delete the page from swap cache and free swap slot, where I miss? That's correct. PageWriteback is sure to be false there. page_swapcount usually won't be 0 there, but sometimes it will be, and in that case we do want to delete from swap cache and free the swap slot. 1) If PageSwapCache(page) == true, PageWriteback(page) == false, page_swapcount(page) == 0 in swap_writepage(shrink_page_list path), then will delete the page from swap cache and free swap slot, in function swap_writepage: if (try_to_free_swap(page)) { unlock_page(page); goto out; } writeback will not execute, that's wrong. Where I miss? when the page is deleted from swap cache and corresponding swap slot is free, the page is set dirty. The dirty page won't be reclaimed. It is not wrong. I don't think so. For dirty pages, there are two steps: 1)writeback 2)reclaim. Since PageSwapCache(page) == true && PageWriteback(page) == false && page_swapcount(page) == 0 in swap_writeback(), try_to_free_swap() will return true and writeback will be skip. Then how can step one be executed? s/swap_writeback()/swap_writepage() Btw, Hi Hugh, could you explain more to us? :-) corresponding path lists as below. when swap_writepage() is called by pageout() in shrink_page_list(). pageout() will return PAGE_SUCCESS. For PAGE_SUCCESS, when PageDirty(page) is true, this reclaiming page will be keeped in the inactive LRU list. shrink_page_list() { ... 904 switch (pageout(page, mapping, sc)) { 905 case PAGE_KEEP: 906 nr_congested++; 907 goto keep_locked; 908 case PAGE_ACTIVATE: 909 goto activate_locked; 910 case PAGE_SUCCESS: 911 if (PageWriteback(page)) 912 goto keep_lumpy; 913 if (PageDirty(page)) 914 goto keep; ...} 2) In the function pageout, page will be set PG_Reclaim flag, since this flag is set, end_swap_bio_write->end_page_writeback: if (TestClearPageReclaim(page)) rotate_reclaimable_page(page); it means that page will be add to the tail of lru list, page is clean anonymous page this time and will be reclaim to buddy system soon, correct? correct If is correct, what is the meaning of rotate here? Rotating here is to add the page to the tail of inactive LRU list. So this page will be reclaimed ASAP while reclaiming. Hugh -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Should a swapped out page be deleted from swap cache?
On 03/06/2013 01:34 PM, Li Haifeng wrote: 2013/2/20 Ric Mason : Hi Hugh, On 02/20/2013 02:56 AM, Hugh Dickins wrote: On Tue, 19 Feb 2013, Ric Mason wrote: There is a call of try_to_free_swap in function swap_writepage, if swap_writepage is call from shrink_page_list path, PageSwapCache(page) == trure, PageWriteback(page) maybe false, page_swapcount(page) == 0, then will delete the page from swap cache and free swap slot, where I miss? That's correct. PageWriteback is sure to be false there. page_swapcount usually won't be 0 there, but sometimes it will be, and in that case we do want to delete from swap cache and free the swap slot. 1) If PageSwapCache(page) == true, PageWriteback(page) == false, page_swapcount(page) == 0 in swap_writepage(shrink_page_list path), then will delete the page from swap cache and free swap slot, in function swap_writepage: if (try_to_free_swap(page)) { unlock_page(page); goto out; } writeback will not execute, that's wrong. Where I miss? when the page is deleted from swap cache and corresponding swap slot is free, the page is set dirty. The dirty page won't be reclaimed. It is not wrong. I don't think so. For dirty pages, there are two steps: 1)writeback 2)reclaim. Since PageSwapCache(page) == true && PageWriteback(page) == false && page_swapcount(page) == 0 in swap_writeback(), try_to_free_swap() will return true and writeback will be skip. Then how can step one be executed? corresponding path lists as below. when swap_writepage() is called by pageout() in shrink_page_list(). pageout() will return PAGE_SUCCESS. For PAGE_SUCCESS, when PageDirty(page) is true, this reclaiming page will be keeped in the inactive LRU list. shrink_page_list() { ... 904 switch (pageout(page, mapping, sc)) { 905 case PAGE_KEEP: 906 nr_congested++; 907 goto keep_locked; 908 case PAGE_ACTIVATE: 909 goto activate_locked; 910 case PAGE_SUCCESS: 911 if (PageWriteback(page)) 912 goto keep_lumpy; 913 if (PageDirty(page)) 914 goto keep; ...} 2) In the function pageout, page will be set PG_Reclaim flag, since this flag is set, end_swap_bio_write->end_page_writeback: if (TestClearPageReclaim(page)) rotate_reclaimable_page(page); it means that page will be add to the tail of lru list, page is clean anonymous page this time and will be reclaim to buddy system soon, correct? correct If is correct, what is the meaning of rotate here? Rotating here is to add the page to the tail of inactive LRU list. So this page will be reclaimed ASAP while reclaiming. Hugh -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 2/7] ksm: treat unstable nid like in stable tree
Hi Hugh, On 03/06/2013 01:05 PM, Hugh Dickins wrote: On Wed, 6 Mar 2013, Ric Mason wrote: [ I've deleted the context because that was about the unstable tree, and here you have moved to asking about a case in the stable tree. ] I think I can basically understand you, please correct me if something wrong. For ksm page: If one ksm page(in old node) migrate to another(new) node(ksm page is treated as old page, one new page allocated in another node now), since we can't get right lock in this time, we can't move stable node to its new tree at this time, stable node still in old node and stable_node->nid still store old node value. If ksmd scan and compare another page in old node and search stable tree will figure out that stable node relevant ksm page is migrated to new node, stable node will be erased from old node's stable tree and link to migrate_nodes list. What's the life of new page in new node? new page will be scaned by ksmd, it will search stable tree in new node and if doesn't find matched stable node, the new node is deleted from migrate_node list and add to new node's table tree as a leaf, if find stable node in stable tree, they will be merged. But in special case, the stable node relevant ksm page can also migrated, new stable node will replace the stable node which relevant page migrated this time. For unstable tree page: If search in unstable tree and find the tree page which has equal content is migrated, just stop search and return, nothing merged. The new page in new node for this migrated unstable tree page will be insert to unstable tree in new node. For the case of a ksm page is migrated to a different NUMA node and migrate its stable node to the right tree and collide with an existing stable node. get_kpfn_nid(stable_node->kpfn) != NUMA(stable_node->nid) can capture nothing That's not so: as I've pointed out before, ksm_migrate_page() updates stable_node->kpfn for the new page on the new NUMA node; but it cannot (get the right locking to) move the stable_node to its new tree at that time. It's moved out once ksmd notices that it's in the wrong NUMA node tree - perhaps when one its rmap_items reaches the head of cmp_and_merge_page(), or perhaps here in stable_tree_search() when it matches another page coming in to cmp_and_merge_page(). You may be concentrating on the case when that "another page" is a ksm page migrated from a different NUMA node; and overlooking the case of when the matching ksm page in this stable tree has itself been migrated. since stable_node is the node in the right stable tree, nothing happen to it before this check. Did you intend to check get_kpfn_nid(page_node->kpfn) != NUMA(page_node->nid) ? Certainly not: page_node is usually NULL. But I could have checked get_kpfn_nid(stable_node->kpfn) != nid: I was duplicating the test from cmp_and_merge_page(), but here we do have local variable nid. Hugh -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 2/7] ksm: treat unstable nid like in stable tree
Hi Hugh, On 03/06/2013 01:05 PM, Hugh Dickins wrote: On Wed, 6 Mar 2013, Ric Mason wrote: [ I've deleted the context because that was about the unstable tree, and here you have moved to asking about a case in the stable tree. ] I think I can basically understand you, please correct me if something wrong. For ksm page: If one ksm page(in old node) migrate to another(new) node(ksm page is treated as old page, one new page allocated in another node now), since we can't get right lock in this time, we can't move stable node to its new tree at this time, stable node still in old node and stable_node-nid still store old node value. If ksmd scan and compare another page in old node and search stable tree will figure out that stable node relevant ksm page is migrated to new node, stable node will be erased from old node's stable tree and link to migrate_nodes list. What's the life of new page in new node? new page will be scaned by ksmd, it will search stable tree in new node and if doesn't find matched stable node, the new node is deleted from migrate_node list and add to new node's table tree as a leaf, if find stable node in stable tree, they will be merged. But in special case, the stable node relevant ksm page can also migrated, new stable node will replace the stable node which relevant page migrated this time. For unstable tree page: If search in unstable tree and find the tree page which has equal content is migrated, just stop search and return, nothing merged. The new page in new node for this migrated unstable tree page will be insert to unstable tree in new node. For the case of a ksm page is migrated to a different NUMA node and migrate its stable node to the right tree and collide with an existing stable node. get_kpfn_nid(stable_node-kpfn) != NUMA(stable_node-nid) can capture nothing That's not so: as I've pointed out before, ksm_migrate_page() updates stable_node-kpfn for the new page on the new NUMA node; but it cannot (get the right locking to) move the stable_node to its new tree at that time. It's moved out once ksmd notices that it's in the wrong NUMA node tree - perhaps when one its rmap_items reaches the head of cmp_and_merge_page(), or perhaps here in stable_tree_search() when it matches another page coming in to cmp_and_merge_page(). You may be concentrating on the case when that another page is a ksm page migrated from a different NUMA node; and overlooking the case of when the matching ksm page in this stable tree has itself been migrated. since stable_node is the node in the right stable tree, nothing happen to it before this check. Did you intend to check get_kpfn_nid(page_node-kpfn) != NUMA(page_node-nid) ? Certainly not: page_node is usually NULL. But I could have checked get_kpfn_nid(stable_node-kpfn) != nid: I was duplicating the test from cmp_and_merge_page(), but here we do have local variable nid. Hugh -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Should a swapped out page be deleted from swap cache?
On 03/06/2013 01:34 PM, Li Haifeng wrote: 2013/2/20 Ric Mason ric.mas...@gmail.com: Hi Hugh, On 02/20/2013 02:56 AM, Hugh Dickins wrote: On Tue, 19 Feb 2013, Ric Mason wrote: There is a call of try_to_free_swap in function swap_writepage, if swap_writepage is call from shrink_page_list path, PageSwapCache(page) == trure, PageWriteback(page) maybe false, page_swapcount(page) == 0, then will delete the page from swap cache and free swap slot, where I miss? That's correct. PageWriteback is sure to be false there. page_swapcount usually won't be 0 there, but sometimes it will be, and in that case we do want to delete from swap cache and free the swap slot. 1) If PageSwapCache(page) == true, PageWriteback(page) == false, page_swapcount(page) == 0 in swap_writepage(shrink_page_list path), then will delete the page from swap cache and free swap slot, in function swap_writepage: if (try_to_free_swap(page)) { unlock_page(page); goto out; } writeback will not execute, that's wrong. Where I miss? when the page is deleted from swap cache and corresponding swap slot is free, the page is set dirty. The dirty page won't be reclaimed. It is not wrong. I don't think so. For dirty pages, there are two steps: 1)writeback 2)reclaim. Since PageSwapCache(page) == true PageWriteback(page) == false page_swapcount(page) == 0 in swap_writeback(), try_to_free_swap() will return true and writeback will be skip. Then how can step one be executed? corresponding path lists as below. when swap_writepage() is called by pageout() in shrink_page_list(). pageout() will return PAGE_SUCCESS. For PAGE_SUCCESS, when PageDirty(page) is true, this reclaiming page will be keeped in the inactive LRU list. shrink_page_list() { ... 904 switch (pageout(page, mapping, sc)) { 905 case PAGE_KEEP: 906 nr_congested++; 907 goto keep_locked; 908 case PAGE_ACTIVATE: 909 goto activate_locked; 910 case PAGE_SUCCESS: 911 if (PageWriteback(page)) 912 goto keep_lumpy; 913 if (PageDirty(page)) 914 goto keep; ...} 2) In the function pageout, page will be set PG_Reclaim flag, since this flag is set, end_swap_bio_write-end_page_writeback: if (TestClearPageReclaim(page)) rotate_reclaimable_page(page); it means that page will be add to the tail of lru list, page is clean anonymous page this time and will be reclaim to buddy system soon, correct? correct If is correct, what is the meaning of rotate here? Rotating here is to add the page to the tail of inactive LRU list. So this page will be reclaimed ASAP while reclaiming. Hugh -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Should a swapped out page be deleted from swap cache?
On 03/06/2013 07:04 PM, Ric Mason wrote: On 03/06/2013 01:34 PM, Li Haifeng wrote: 2013/2/20 Ric Mason ric.mas...@gmail.com: Hi Hugh, On 02/20/2013 02:56 AM, Hugh Dickins wrote: On Tue, 19 Feb 2013, Ric Mason wrote: There is a call of try_to_free_swap in function swap_writepage, if swap_writepage is call from shrink_page_list path, PageSwapCache(page) == trure, PageWriteback(page) maybe false, page_swapcount(page) == 0, then will delete the page from swap cache and free swap slot, where I miss? That's correct. PageWriteback is sure to be false there. page_swapcount usually won't be 0 there, but sometimes it will be, and in that case we do want to delete from swap cache and free the swap slot. 1) If PageSwapCache(page) == true, PageWriteback(page) == false, page_swapcount(page) == 0 in swap_writepage(shrink_page_list path), then will delete the page from swap cache and free swap slot, in function swap_writepage: if (try_to_free_swap(page)) { unlock_page(page); goto out; } writeback will not execute, that's wrong. Where I miss? when the page is deleted from swap cache and corresponding swap slot is free, the page is set dirty. The dirty page won't be reclaimed. It is not wrong. I don't think so. For dirty pages, there are two steps: 1)writeback 2)reclaim. Since PageSwapCache(page) == true PageWriteback(page) == false page_swapcount(page) == 0 in swap_writeback(), try_to_free_swap() will return true and writeback will be skip. Then how can step one be executed? s/swap_writeback()/swap_writepage() Btw, Hi Hugh, could you explain more to us? :-) corresponding path lists as below. when swap_writepage() is called by pageout() in shrink_page_list(). pageout() will return PAGE_SUCCESS. For PAGE_SUCCESS, when PageDirty(page) is true, this reclaiming page will be keeped in the inactive LRU list. shrink_page_list() { ... 904 switch (pageout(page, mapping, sc)) { 905 case PAGE_KEEP: 906 nr_congested++; 907 goto keep_locked; 908 case PAGE_ACTIVATE: 909 goto activate_locked; 910 case PAGE_SUCCESS: 911 if (PageWriteback(page)) 912 goto keep_lumpy; 913 if (PageDirty(page)) 914 goto keep; ...} 2) In the function pageout, page will be set PG_Reclaim flag, since this flag is set, end_swap_bio_write-end_page_writeback: if (TestClearPageReclaim(page)) rotate_reclaimable_page(page); it means that page will be add to the tail of lru list, page is clean anonymous page this time and will be reclaim to buddy system soon, correct? correct If is correct, what is the meaning of rotate here? Rotating here is to add the page to the tail of inactive LRU list. So this page will be reclaimed ASAP while reclaiming. Hugh -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 2/7] ksm: treat unstable nid like in stable tree
Hi Hugh, On 03/06/2013 01:05 PM, Hugh Dickins wrote: On Wed, 6 Mar 2013, Ric Mason wrote: [ I've deleted the context because that was about the unstable tree, and here you have moved to asking about a case in the stable tree. ] For the case of a ksm page is migrated to a different NUMA node and migrate its stable node to the right tree and collide with an existing stable node. get_kpfn_nid(stable_node->kpfn) != NUMA(stable_node->nid) can capture nothing That's not so: as I've pointed out before, ksm_migrate_page() updates stable_node->kpfn for the new page on the new NUMA node; but it cannot (get the right locking to) move the stable_node to its new tree at that time. It's moved out once ksmd notices that it's in the wrong NUMA node tree - perhaps when one its rmap_items reaches the head of cmp_and_merge_page(), or perhaps here in stable_tree_search() when it matches another page coming in to cmp_and_merge_page(). You may be concentrating on the case when that "another page" is a ksm page migrated from a different NUMA node; and overlooking the case of when the matching ksm page in this stable tree has itself been migrated. since stable_node is the node in the right stable tree, nothing happen to it before this check. Did you intend to check get_kpfn_nid(page_node->kpfn) != NUMA(page_node->nid) ? Certainly not: page_node is usually NULL. But I could have checked Are you sure? list_del(_node->list) and DO_NUMA(page_node->nid = nid) will trigger panic now. get_kpfn_nid(stable_node->kpfn) != nid: I was duplicating the test from cmp_and_merge_page(), but here we do have local variable nid. Hugh -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 2/7] ksm: treat unstable nid like in stable tree
On 03/02/2013 10:57 AM, Hugh Dickins wrote: On Sat, 2 Mar 2013, Ric Mason wrote: On 03/02/2013 04:03 AM, Hugh Dickins wrote: On Fri, 1 Mar 2013, Ric Mason wrote: I think the ksm implementation for num awareness is buggy. Sorry, I just don't understand your comments below, but will try to answer or question them as best I can. For page migratyion stuff, new page is allocated from node *which page is migrated to*. Yes, by definition. - when meeting a page from the wrong NUMA node in an unstable tree get_kpfn_nid(page_to_pfn(page)) *==* page_to_nid(tree_page) I thought you were writing of the wrong NUMA node case, but now you emphasize "*==*", which means the right NUMA node. Yes, I mean the wrong NUMA node. During page migration, new page has already been allocated in new node and old page maybe freed. So tree_page is the page in new node's unstable tree, page is also new node page, so get_kpfn_nid(page_to_pfn(page)) *==* page_to_nid(tree_page). I don't understand; but here you seem to be describing a case where two pages from the same NUMA node get merged (after both have been migrated from another NUMA node?), and there's nothing wrong with that, so I won't worry about it further. For the case of a ksm page is migrated to a different NUMA node and migrate its stable node to the right tree and collide with an existing stable node. get_kpfn_nid(stable_node->kpfn) != NUMA(stable_node->nid) can capture nothing since stable_node is the node in the right stable tree, nothing happen to it before this check. Did you intend to check get_kpfn_nid(page_node->kpfn) != NUMA(page_node->nid) ? - meeting a page which is ksm page before migration get_kpfn_nid(stable_node->kpfn) != NUMA(stable_node->nid) can't capture them since stable_node is for tree page in current stable tree. They are always equal. When we meet a ksm page in the stable tree before it's migrated to another NUMA node, yes, it will be on the right NUMA node (because we were careful only to merge pages from the right NUMA node there), and that test will not capture them. It's for capturng a ksm page in the stable tree after it has been migrated to another NUMA node. ksm page migrated to another NUMA node still not freed, why? Who take page count of it? The old page, the one which used to be a ksm page on the old NUMA node, should be freed very soon: since it was isolated from lru, and its page count checked, I cannot think of anything to hold a reference to it, apart from migration itself - so it just needs to reach putback_lru_page(), and then may rest awhile on __lru_cache_add()'s pagevec before being freed. But I don't see where I said the old page was still not freed. If not freed, since new page is allocated in new node, it is the copy of current ksm page, so current ksm doesn't change, get_kpfn_nid(stable_node->kpfn) *==* NUMA(stable_node->nid). But ksm_migrate_page() did VM_BUG_ON(stable_node->kpfn != page_to_pfn(oldpage)); stable_node->kpfn = page_to_pfn(newpage); without changing stable_node->nid. Hugh -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 2/7] ksm: treat unstable nid like in stable tree
On 03/02/2013 10:57 AM, Hugh Dickins wrote: On Sat, 2 Mar 2013, Ric Mason wrote: On 03/02/2013 04:03 AM, Hugh Dickins wrote: On Fri, 1 Mar 2013, Ric Mason wrote: I think the ksm implementation for num awareness is buggy. Sorry, I just don't understand your comments below, but will try to answer or question them as best I can. For page migratyion stuff, new page is allocated from node *which page is migrated to*. Yes, by definition. - when meeting a page from the wrong NUMA node in an unstable tree get_kpfn_nid(page_to_pfn(page)) *==* page_to_nid(tree_page) I thought you were writing of the wrong NUMA node case, but now you emphasize *==*, which means the right NUMA node. Yes, I mean the wrong NUMA node. During page migration, new page has already been allocated in new node and old page maybe freed. So tree_page is the page in new node's unstable tree, page is also new node page, so get_kpfn_nid(page_to_pfn(page)) *==* page_to_nid(tree_page). I don't understand; but here you seem to be describing a case where two pages from the same NUMA node get merged (after both have been migrated from another NUMA node?), and there's nothing wrong with that, so I won't worry about it further. For the case of a ksm page is migrated to a different NUMA node and migrate its stable node to the right tree and collide with an existing stable node. get_kpfn_nid(stable_node-kpfn) != NUMA(stable_node-nid) can capture nothing since stable_node is the node in the right stable tree, nothing happen to it before this check. Did you intend to check get_kpfn_nid(page_node-kpfn) != NUMA(page_node-nid) ? - meeting a page which is ksm page before migration get_kpfn_nid(stable_node-kpfn) != NUMA(stable_node-nid) can't capture them since stable_node is for tree page in current stable tree. They are always equal. When we meet a ksm page in the stable tree before it's migrated to another NUMA node, yes, it will be on the right NUMA node (because we were careful only to merge pages from the right NUMA node there), and that test will not capture them. It's for capturng a ksm page in the stable tree after it has been migrated to another NUMA node. ksm page migrated to another NUMA node still not freed, why? Who take page count of it? The old page, the one which used to be a ksm page on the old NUMA node, should be freed very soon: since it was isolated from lru, and its page count checked, I cannot think of anything to hold a reference to it, apart from migration itself - so it just needs to reach putback_lru_page(), and then may rest awhile on __lru_cache_add()'s pagevec before being freed. But I don't see where I said the old page was still not freed. If not freed, since new page is allocated in new node, it is the copy of current ksm page, so current ksm doesn't change, get_kpfn_nid(stable_node-kpfn) *==* NUMA(stable_node-nid). But ksm_migrate_page() did VM_BUG_ON(stable_node-kpfn != page_to_pfn(oldpage)); stable_node-kpfn = page_to_pfn(newpage); without changing stable_node-nid. Hugh -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 2/7] ksm: treat unstable nid like in stable tree
Hi Hugh, On 03/06/2013 01:05 PM, Hugh Dickins wrote: On Wed, 6 Mar 2013, Ric Mason wrote: [ I've deleted the context because that was about the unstable tree, and here you have moved to asking about a case in the stable tree. ] For the case of a ksm page is migrated to a different NUMA node and migrate its stable node to the right tree and collide with an existing stable node. get_kpfn_nid(stable_node-kpfn) != NUMA(stable_node-nid) can capture nothing That's not so: as I've pointed out before, ksm_migrate_page() updates stable_node-kpfn for the new page on the new NUMA node; but it cannot (get the right locking to) move the stable_node to its new tree at that time. It's moved out once ksmd notices that it's in the wrong NUMA node tree - perhaps when one its rmap_items reaches the head of cmp_and_merge_page(), or perhaps here in stable_tree_search() when it matches another page coming in to cmp_and_merge_page(). You may be concentrating on the case when that another page is a ksm page migrated from a different NUMA node; and overlooking the case of when the matching ksm page in this stable tree has itself been migrated. since stable_node is the node in the right stable tree, nothing happen to it before this check. Did you intend to check get_kpfn_nid(page_node-kpfn) != NUMA(page_node-nid) ? Certainly not: page_node is usually NULL. But I could have checked Are you sure? list_del(page_node-list) and DO_NUMA(page_node-nid = nid) will trigger panic now. get_kpfn_nid(stable_node-kpfn) != nid: I was duplicating the test from cmp_and_merge_page(), but here we do have local variable nid. Hugh -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 2/7] ksm: treat unstable nid like in stable tree
Hi Hugh, On 03/02/2013 04:03 AM, Hugh Dickins wrote: On Fri, 1 Mar 2013, Ric Mason wrote: I think the ksm implementation for num awareness is buggy. Sorry, I just don't understand your comments below, but will try to answer or question them as best I can. For page migratyion stuff, new page is allocated from node *which page is migrated to*. Yes, by definition. - when meeting a page from the wrong NUMA node in an unstable tree get_kpfn_nid(page_to_pfn(page)) *==* page_to_nid(tree_page) I thought you were writing of the wrong NUMA node case, but now you emphasize "*==*", which means the right NUMA node. Yes, I mean the wrong NUMA node. During page migration, new page has already been allocated in new node and old page maybe freed. So tree_page is the page in new node's unstable tree, page is also new node page, so get_kpfn_nid(page_to_pfn(page)) *==* page_to_nid(tree_page). How can say it's okay for comparisons, but not as a leaf for merging? Pages in the unstable tree are unstable (and it's not even accurate to say "pages in the unstable tree"), they and their content can change at any moment, so I cannot assert anything of them for sure. But if we suppose, as an approximation, that they are somewhat likely to remain stable (and the unstable tree would be useless without that assumption: it tends to work out), but subject to migration, then it makes sense to compare content, no matter what NUMA node it is on, in order to locate a page of the same content; but wrong to merge with that page if it's on the wrong NUMA node, if !merge_across_nodes tells us not to. - when meeting a page from the wrong NUMA node in an stable tree - meeting a normal page What does that line mean, and where does it fit in your argument? I distinguish pages in three kinds. - ksm page which already in stable tree in old node - page in unstable tree in old node - page not in any trees in old node So normal page here I mean page not in any trees in old node. - meeting a page which is ksm page before migration get_kpfn_nid(stable_node->kpfn) != NUMA(stable_node->nid) can't capture them since stable_node is for tree page in current stable tree. They are always equal. When we meet a ksm page in the stable tree before it's migrated to another NUMA node, yes, it will be on the right NUMA node (because we were careful only to merge pages from the right NUMA node there), and that test will not capture them. It's for capturng a ksm page in the stable tree after it has been migrated to another NUMA node. ksm page migrated to another NUMA node still not freed, why? Who take page count of it? If not freed, since new page is allocated in new node, it is the copy of current ksm page, so current ksm doesn't change, get_kpfn_nid(stable_node->kpfn) *==* NUMA(stable_node->nid). Hugh -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC PATCH v2 1/2] mm: tuning hardcoded reserved memory
On 03/02/2013 06:41 AM, Andrew Shewmaker wrote: On Fri, Mar 01, 2013 at 10:40:43AM +0800, Ric Mason wrote: On 02/28/2013 11:48 AM, Andrew Shewmaker wrote: On Thu, Feb 28, 2013 at 02:12:00PM -0800, Andrew Morton wrote: On Wed, 27 Feb 2013 15:56:30 -0500 Andrew Shewmaker wrote: The following patches are against the mmtom git tree as of February 27th. The first patch only affects OVERCOMMIT_NEVER mode, entirely removing the 3% reserve for other user processes. The second patch affects both OVERCOMMIT_GUESS and OVERCOMMIT_NEVER modes, replacing the hardcoded 3% reserve for the root user with a tunable knob. Gee, it's been years since anyone thought about the overcommit code. Documentation/vm/overcommit-accounting says that OVERCOMMIT_ALWAYS is "Appropriate for some scientific applications", but doesn't say why. You're running a scientific cluster but you're using OVERCOMMIT_NEVER, I think? Is the documentation wrong? None of my scientists appeared to use sparse arrays as Alan described. My users would run jobs that appeared to initialize correctly. However, they wouldn't write to every page they malloced (and they wouldn't use calloc), so I saw jobs failing well into a computation once the simulation tried to access a page and the kernel couldn't give it to them. I think Roadrunner (http://en.wikipedia.org/wiki/IBM_Roadrunner) was the first cluster I put into OVERCOMMIT_NEVER mode. Jobs with infeasible memory requirements fail early and the OOM killer gets triggered much less often than in guess mode. More often than not the OOM killer seemed to kill the wrong thing causing a subtle brokenness. Disabling overcommit worked so well during the stabilization and early user phases that we did the same with other clusters. Do you mean OVERCOMMIT_NEVER is more suitable for scientific application than OVERCOMMIT_GUESS and OVERCOMMIT_ALWAYS? Or should depend on workload? Since your users would run jobs that wouldn't write to every page they malloced, so why OVERCOMMIT_GUESS is not more suitable for you? It depends on the workload. They eventually wrote to every page, but not early in the life of the process, so they thought they were fine until the simulation crashed. Why overcommit guess is not suitable even they eventually wrote to every page? It takes free pages, file pages, available swap pages, reclaimable slab pages into consideration. In other words, these are all pages available, then why overcommit is not suitable? Actually, I confuse what's the root different of overcommit guess and never? __vm_enough_memory reserves 3% of free pages with the default overcommit mode and 6% when overcommit is disabled. These hardcoded values have become less reasonable as memory sizes have grown. On scientific clusters, systems are generally dedicated to one user. Also, overcommit is sometimes disabled in order to prevent a long running job from suddenly failing days or weeks into a calculation. In this case, a user wishing to allocate as much memory as possible to one process may be prevented from using, for example, around 7GB out of 128GB. The effect is less, but still significant when a user starts a job with one process per core. I have repeatedly seen a set of processes requesting the same amount of memory fail because one of them could not allocate the amount of memory a user would expect to be able to allocate. ... --- a/mm/mmap.c +++ b/mm/mmap.c @@ -182,11 +182,6 @@ int __vm_enough_memory(struct mm_struct *mm, long pages, int cap_sys_admin) allowed -= allowed / 32; allowed += total_swap_pages; - /* Don't let a single process grow too big: - leave 3% of the size of this process for other processes */ - if (mm) - allowed -= mm->total_vm / 32; - if (percpu_counter_read_positive(_committed_as) < allowed) return 0; So what might be the downside for this change? root can't log in, I assume. Have you actually tested for this scenario and observed the effects? If there *are* observable risks and/or to preserve back-compatibility, I guess we could create a fourth overcommit mode which provides the headroom which you desire. Also, should we be looking at removing root's 3% from OVERCOMMIT_GUESS as well? The downside of the first patch, which removes the "other" reserve (sorry about the confusing duplicated subject line), is that a user may not be able to kill their process, even if they have a shell prompt. When testing, I did sometimes get into spot where I attempted to execute kill, but got: "bash: fork: Cannot allocate memory". Of course, a user can get in the same predicament with the current 3% reserve--they just have to start processes until 3% becomes negligible. With just the first patch, root still has a 3% reserve, so they can still log in. When I resubmit the second patch, adding a tunable rootuser_reserve_pages variable, I'll test both guess and never
Re: [RFC PATCH v2 1/2] mm: tuning hardcoded reserved memory
On 03/02/2013 06:41 AM, Andrew Shewmaker wrote: On Fri, Mar 01, 2013 at 10:40:43AM +0800, Ric Mason wrote: On 02/28/2013 11:48 AM, Andrew Shewmaker wrote: On Thu, Feb 28, 2013 at 02:12:00PM -0800, Andrew Morton wrote: On Wed, 27 Feb 2013 15:56:30 -0500 Andrew Shewmaker ags...@gmail.com wrote: The following patches are against the mmtom git tree as of February 27th. The first patch only affects OVERCOMMIT_NEVER mode, entirely removing the 3% reserve for other user processes. The second patch affects both OVERCOMMIT_GUESS and OVERCOMMIT_NEVER modes, replacing the hardcoded 3% reserve for the root user with a tunable knob. Gee, it's been years since anyone thought about the overcommit code. Documentation/vm/overcommit-accounting says that OVERCOMMIT_ALWAYS is Appropriate for some scientific applications, but doesn't say why. You're running a scientific cluster but you're using OVERCOMMIT_NEVER, I think? Is the documentation wrong? None of my scientists appeared to use sparse arrays as Alan described. My users would run jobs that appeared to initialize correctly. However, they wouldn't write to every page they malloced (and they wouldn't use calloc), so I saw jobs failing well into a computation once the simulation tried to access a page and the kernel couldn't give it to them. I think Roadrunner (http://en.wikipedia.org/wiki/IBM_Roadrunner) was the first cluster I put into OVERCOMMIT_NEVER mode. Jobs with infeasible memory requirements fail early and the OOM killer gets triggered much less often than in guess mode. More often than not the OOM killer seemed to kill the wrong thing causing a subtle brokenness. Disabling overcommit worked so well during the stabilization and early user phases that we did the same with other clusters. Do you mean OVERCOMMIT_NEVER is more suitable for scientific application than OVERCOMMIT_GUESS and OVERCOMMIT_ALWAYS? Or should depend on workload? Since your users would run jobs that wouldn't write to every page they malloced, so why OVERCOMMIT_GUESS is not more suitable for you? It depends on the workload. They eventually wrote to every page, but not early in the life of the process, so they thought they were fine until the simulation crashed. Why overcommit guess is not suitable even they eventually wrote to every page? It takes free pages, file pages, available swap pages, reclaimable slab pages into consideration. In other words, these are all pages available, then why overcommit is not suitable? Actually, I confuse what's the root different of overcommit guess and never? __vm_enough_memory reserves 3% of free pages with the default overcommit mode and 6% when overcommit is disabled. These hardcoded values have become less reasonable as memory sizes have grown. On scientific clusters, systems are generally dedicated to one user. Also, overcommit is sometimes disabled in order to prevent a long running job from suddenly failing days or weeks into a calculation. In this case, a user wishing to allocate as much memory as possible to one process may be prevented from using, for example, around 7GB out of 128GB. The effect is less, but still significant when a user starts a job with one process per core. I have repeatedly seen a set of processes requesting the same amount of memory fail because one of them could not allocate the amount of memory a user would expect to be able to allocate. ... --- a/mm/mmap.c +++ b/mm/mmap.c @@ -182,11 +182,6 @@ int __vm_enough_memory(struct mm_struct *mm, long pages, int cap_sys_admin) allowed -= allowed / 32; allowed += total_swap_pages; - /* Don't let a single process grow too big: - leave 3% of the size of this process for other processes */ - if (mm) - allowed -= mm-total_vm / 32; - if (percpu_counter_read_positive(vm_committed_as) allowed) return 0; So what might be the downside for this change? root can't log in, I assume. Have you actually tested for this scenario and observed the effects? If there *are* observable risks and/or to preserve back-compatibility, I guess we could create a fourth overcommit mode which provides the headroom which you desire. Also, should we be looking at removing root's 3% from OVERCOMMIT_GUESS as well? The downside of the first patch, which removes the other reserve (sorry about the confusing duplicated subject line), is that a user may not be able to kill their process, even if they have a shell prompt. When testing, I did sometimes get into spot where I attempted to execute kill, but got: bash: fork: Cannot allocate memory. Of course, a user can get in the same predicament with the current 3% reserve--they just have to start processes until 3% becomes negligible. With just the first patch, root still has a 3% reserve, so they can still log in. When I resubmit the second patch, adding a tunable rootuser_reserve_pages variable, I'll test both guess and never overcommit modes to see what
Re: [PATCH 2/7] ksm: treat unstable nid like in stable tree
Hi Hugh, On 03/02/2013 04:03 AM, Hugh Dickins wrote: On Fri, 1 Mar 2013, Ric Mason wrote: I think the ksm implementation for num awareness is buggy. Sorry, I just don't understand your comments below, but will try to answer or question them as best I can. For page migratyion stuff, new page is allocated from node *which page is migrated to*. Yes, by definition. - when meeting a page from the wrong NUMA node in an unstable tree get_kpfn_nid(page_to_pfn(page)) *==* page_to_nid(tree_page) I thought you were writing of the wrong NUMA node case, but now you emphasize *==*, which means the right NUMA node. Yes, I mean the wrong NUMA node. During page migration, new page has already been allocated in new node and old page maybe freed. So tree_page is the page in new node's unstable tree, page is also new node page, so get_kpfn_nid(page_to_pfn(page)) *==* page_to_nid(tree_page). How can say it's okay for comparisons, but not as a leaf for merging? Pages in the unstable tree are unstable (and it's not even accurate to say pages in the unstable tree), they and their content can change at any moment, so I cannot assert anything of them for sure. But if we suppose, as an approximation, that they are somewhat likely to remain stable (and the unstable tree would be useless without that assumption: it tends to work out), but subject to migration, then it makes sense to compare content, no matter what NUMA node it is on, in order to locate a page of the same content; but wrong to merge with that page if it's on the wrong NUMA node, if !merge_across_nodes tells us not to. - when meeting a page from the wrong NUMA node in an stable tree - meeting a normal page What does that line mean, and where does it fit in your argument? I distinguish pages in three kinds. - ksm page which already in stable tree in old node - page in unstable tree in old node - page not in any trees in old node So normal page here I mean page not in any trees in old node. - meeting a page which is ksm page before migration get_kpfn_nid(stable_node-kpfn) != NUMA(stable_node-nid) can't capture them since stable_node is for tree page in current stable tree. They are always equal. When we meet a ksm page in the stable tree before it's migrated to another NUMA node, yes, it will be on the right NUMA node (because we were careful only to merge pages from the right NUMA node there), and that test will not capture them. It's for capturng a ksm page in the stable tree after it has been migrated to another NUMA node. ksm page migrated to another NUMA node still not freed, why? Who take page count of it? If not freed, since new page is allocated in new node, it is the copy of current ksm page, so current ksm doesn't change, get_kpfn_nid(stable_node-kpfn) *==* NUMA(stable_node-nid). Hugh -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCHv5 2/8] zsmalloc: add documentation
On 02/25/2013 11:18 PM, Seth Jennings wrote: On 02/23/2013 06:37 PM, Ric Mason wrote: On 02/23/2013 05:02 AM, Seth Jennings wrote: On 02/21/2013 08:56 PM, Ric Mason wrote: On 02/21/2013 11:50 PM, Seth Jennings wrote: On 02/21/2013 02:49 AM, Ric Mason wrote: On 02/19/2013 03:16 AM, Seth Jennings wrote: On 02/16/2013 12:21 AM, Ric Mason wrote: On 02/14/2013 02:38 AM, Seth Jennings wrote: This patch adds a documentation file for zsmalloc at Documentation/vm/zsmalloc.txt Signed-off-by: Seth Jennings --- Documentation/vm/zsmalloc.txt | 68 + 1 file changed, 68 insertions(+) create mode 100644 Documentation/vm/zsmalloc.txt diff --git a/Documentation/vm/zsmalloc.txt b/Documentation/vm/zsmalloc.txt new file mode 100644 index 000..85aa617 --- /dev/null +++ b/Documentation/vm/zsmalloc.txt @@ -0,0 +1,68 @@ +zsmalloc Memory Allocator + +Overview + +zmalloc a new slab-based memory allocator, +zsmalloc, for storing compressed pages. It is designed for +low fragmentation and high allocation success rate on +large object, but <= PAGE_SIZE allocations. + +zsmalloc differs from the kernel slab allocator in two primary +ways to achieve these design goals. + +zsmalloc never requires high order page allocations to back +slabs, or "size classes" in zsmalloc terms. Instead it allows +multiple single-order pages to be stitched together into a +"zspage" which backs the slab. This allows for higher allocation +success rate under memory pressure. + +Also, zsmalloc allows objects to span page boundaries within the +zspage. This allows for lower fragmentation than could be had +with the kernel slab allocator for objects between PAGE_SIZE/2 +and PAGE_SIZE. With the kernel slab allocator, if a page compresses +to 60% of it original size, the memory savings gained through +compression is lost in fragmentation because another object of +the same size can't be stored in the leftover space. + +This ability to span pages results in zsmalloc allocations not being +directly addressable by the user. The user is given an +non-dereferencable handle in response to an allocation request. +That handle must be mapped, using zs_map_object(), which returns +a pointer to the mapped region that can be used. The mapping is +necessary since the object data may reside in two different +noncontigious pages. Do you mean the reason of to use a zsmalloc object must map after malloc is object data maybe reside in two different nocontiguous pages? Yes, that is one reason for the mapping. The other reason (more of an added bonus) is below. + +For 32-bit systems, zsmalloc has the added benefit of being +able to back slabs with HIGHMEM pages, something not possible What's the meaning of "back slabs with HIGHMEM pages"? By HIGHMEM, I'm referring to the HIGHMEM memory zone on 32-bit systems with larger that 1GB (actually a little less) of RAM. The upper 3GB of the 4GB address space, depending on kernel build options, is not directly addressable by the kernel, but can be mapped into the kernel address space with functions like kmap() or kmap_atomic(). These pages can't be used by slab/slub because they are not continuously mapped into the kernel address space. However, since zsmalloc requires a mapping anyway to handle objects that span non-contiguous page boundaries, we do the kernel mapping as part of the process. So zspages, the conceptual slab in zsmalloc backed by single-order pages can include pages from the HIGHMEM zone as well. Thanks for your clarify, http://lwn.net/Articles/537422/, your article about zswap in lwn. "Additionally, the kernel slab allocator does not allow objects that are less than a page in size to span a page boundary. This means that if an object is PAGE_SIZE/2 + 1 bytes in size, it effectively use an entire page, resulting in ~50% waste. Hense there are *no kmalloc() cache size* between PAGE_SIZE/2 and PAGE_SIZE." Are your sure? It seems that kmalloc cache support big size, your can check in include/linux/kmalloc_sizes.h Yes, kmalloc can allocate large objects > PAGE_SIZE, but there are no cache sizes _between_ PAGE_SIZE/2 and PAGE_SIZE. For example, on a system with 4k pages, there are no caches between kmalloc-2048 and kmalloc-4096. kmalloc object > PAGE_SIZE/2 or > PAGE_SIZE should also allocate from slab cache, correct? Then how can alloc object w/o slab cache which? contains this object size objects? I have to admit, I didn't understand the question. object is allocated from slab cache, correct? There two kinds of slab cache, one is for general purpose, eg. kmalloc slab cache, the other is for special purpose, eg. mm_struct, task_struct. kmalloc object > PAGE_SIZE/2 or > PAGE_SIZE should also allocated from slab cache, correct? then why you said that there are no caches between kmalloc-2048 and kmalloc-4096? Ok, now I get it. Yes, I guess I should qualified h
Re: [PATCH 2/7] ksm: treat unstable nid like in stable tree
Hi Hugh, On 02/23/2013 05:03 AM, Hugh Dickins wrote: On Fri, 22 Feb 2013, Ric Mason wrote: On 02/21/2013 04:20 PM, Hugh Dickins wrote: An inconsistency emerged in reviewing the NUMA node changes to KSM: when meeting a page from the wrong NUMA node in a stable tree, we say that it's okay for comparisons, but not as a leaf for merging; whereas when meeting a page from the wrong NUMA node in an unstable tree, we bail out immediately. IIUC - ksm page from the wrong NUMA node will be add to current node's stable tree Please forgive my late response. That should never happen (and when I was checking with a WARN_ON it did not happen). What can happen is that a node already in a stable tree has its page migrated away to another NUMA node. - normal page from the wrong NUMA node will be merged to current node's stable tree <- where I miss here? I didn't see any special handling in function stable_tree_search for this case. nid = get_kpfn_nid(page_to_pfn(page)); root = root_stable_tree + nid; to choose the right tree for the page, and if (get_kpfn_nid(stable_node->kpfn) != NUMA(stable_node->nid)) { put_page(tree_page); goto replace; } to make sure that we don't latch on to a node whose page got migrated away. I think the ksm implementation for num awareness is buggy. For page migratyion stuff, new page is allocated from node *which page is migrated to*. - when meeting a page from the wrong NUMA node in an unstable tree get_kpfn_nid(page_to_pfn(page)) *==* page_to_nid(tree_page) How can say it's okay for comparisons, but not as a leaf for merging? - when meeting a page from the wrong NUMA node in an stable tree - meeting a normal page - meeting a page which is ksm page before migration get_kpfn_nid(stable_node->kpfn) != NUMA(stable_node->nid) can't capture them since stable_node is for tree page in current stable tree. They are always equal. - normal page from the wrong NUMA node will compare but not as a leaf for merging after the patch I don't understand you there, but hope my remarks above resolve it. Hugh -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC PATCH v2 1/2] mm: tuning hardcoded reserved memory
On 02/28/2013 11:48 AM, Andrew Shewmaker wrote: On Thu, Feb 28, 2013 at 02:12:00PM -0800, Andrew Morton wrote: On Wed, 27 Feb 2013 15:56:30 -0500 Andrew Shewmaker wrote: The following patches are against the mmtom git tree as of February 27th. The first patch only affects OVERCOMMIT_NEVER mode, entirely removing the 3% reserve for other user processes. The second patch affects both OVERCOMMIT_GUESS and OVERCOMMIT_NEVER modes, replacing the hardcoded 3% reserve for the root user with a tunable knob. Gee, it's been years since anyone thought about the overcommit code. Documentation/vm/overcommit-accounting says that OVERCOMMIT_ALWAYS is "Appropriate for some scientific applications", but doesn't say why. You're running a scientific cluster but you're using OVERCOMMIT_NEVER, I think? Is the documentation wrong? None of my scientists appeared to use sparse arrays as Alan described. My users would run jobs that appeared to initialize correctly. However, they wouldn't write to every page they malloced (and they wouldn't use calloc), so I saw jobs failing well into a computation once the simulation tried to access a page and the kernel couldn't give it to them. I think Roadrunner (http://en.wikipedia.org/wiki/IBM_Roadrunner) was the first cluster I put into OVERCOMMIT_NEVER mode. Jobs with infeasible memory requirements fail early and the OOM killer gets triggered much less often than in guess mode. More often than not the OOM killer seemed to kill the wrong thing causing a subtle brokenness. Disabling overcommit worked so well during the stabilization and early user phases that we did the same with other clusters. Do you mean OVERCOMMIT_NEVER is more suitable for scientific application than OVERCOMMIT_GUESS and OVERCOMMIT_ALWAYS? Or should depend on workload? Since your users would run jobs that wouldn't write to every page they malloced, so why OVERCOMMIT_GUESS is not more suitable for you? __vm_enough_memory reserves 3% of free pages with the default overcommit mode and 6% when overcommit is disabled. These hardcoded values have become less reasonable as memory sizes have grown. On scientific clusters, systems are generally dedicated to one user. Also, overcommit is sometimes disabled in order to prevent a long running job from suddenly failing days or weeks into a calculation. In this case, a user wishing to allocate as much memory as possible to one process may be prevented from using, for example, around 7GB out of 128GB. The effect is less, but still significant when a user starts a job with one process per core. I have repeatedly seen a set of processes requesting the same amount of memory fail because one of them could not allocate the amount of memory a user would expect to be able to allocate. ... --- a/mm/mmap.c +++ b/mm/mmap.c @@ -182,11 +182,6 @@ int __vm_enough_memory(struct mm_struct *mm, long pages, int cap_sys_admin) allowed -= allowed / 32; allowed += total_swap_pages; - /* Don't let a single process grow too big: - leave 3% of the size of this process for other processes */ - if (mm) - allowed -= mm->total_vm / 32; - if (percpu_counter_read_positive(_committed_as) < allowed) return 0; So what might be the downside for this change? root can't log in, I assume. Have you actually tested for this scenario and observed the effects? If there *are* observable risks and/or to preserve back-compatibility, I guess we could create a fourth overcommit mode which provides the headroom which you desire. Also, should we be looking at removing root's 3% from OVERCOMMIT_GUESS as well? The downside of the first patch, which removes the "other" reserve (sorry about the confusing duplicated subject line), is that a user may not be able to kill their process, even if they have a shell prompt. When testing, I did sometimes get into spot where I attempted to execute kill, but got: "bash: fork: Cannot allocate memory". Of course, a user can get in the same predicament with the current 3% reserve--they just have to start processes until 3% becomes negligible. With just the first patch, root still has a 3% reserve, so they can still log in. When I resubmit the second patch, adding a tunable rootuser_reserve_pages variable, I'll test both guess and never overcommit modes to see what minimum initial values allow root to login and kill a user's memory hogging process. This will be safer than the current behavior since root's reserve will never shrink to something useless in the case where a user has grabbed all available memory with many processes. The idea of two patches looks reasonable to me. As an estimate of a useful rootuser_reserve_pages, the rss+share size of Sorry for my silly, why you mean share size is not consist in rss size? sshd, bash, and top is about 16MB. Overcommit disabled mode would need closer to 360MB for the same processes. On a 128GB box 3% is 3.8GB, so the new tunable
Re: zsmalloc limitations and related topics
On 02/28/2013 07:24 AM, Dan Magenheimer wrote: Hi all -- I've been doing some experimentation on zsmalloc in preparation for my topic proposed for LSFMM13 and have run across some perplexing limitations. Those familiar with the intimate details of zsmalloc might be well aware of these limitations, but they aren't documented or immediately obvious, so I thought it would be worthwhile to air them publicly. I've also included some measurements from the experimentation and some related thoughts. (Some of the terms here are unusual and may be used inconsistently by different developers so a glossary of definitions of the terms used here is appended.) ZSMALLOC LIMITATIONS Zsmalloc is used for two zprojects: zram and the out-of-tree zswap. Zsmalloc can achieve high density when "full". But: 1) Zsmalloc has a worst-case density of 0.25 (one zpage per four pageframes). 2) When not full and especially when nearly-empty _after_ being full, density may fall below 1.0 as a result of fragmentation. What's the meaning of nearly-empty _after_ being full? 3) Zsmalloc has a density of exactly 1.0 for any number of zpages with zsize >= 0.8. 4) Zsmalloc contains several compile-time parameters; the best value of these parameters may be very workload dependent. If density == 1.0, that means we are paying the overhead of compression+decompression for no space advantage. If density < 1.0, that means using zsmalloc is detrimental, resulting in worse memory pressure than if it were not used. WORKLOAD ANALYSIS These limitations emphasize that the workload used to evaluate zsmalloc is very important. Benchmarks that measure data Could you share your benchmark? In order that other guys can take advantage of it. throughput or CPU utilization are of questionable value because it is the _content_ of the data that is particularly relevant for compression. Even more precisely, it is the "entropy" of the data that is relevant, because the amount of compressibility in the data is related to the entropy: I.e. an entirely random pagefull of bits will compress poorly and a highly-regular pagefull of bits will compress well. Since the zprojects manage a large number of zpages, both the mean and distribution of zsize of the workload should be "representative". The workload most widely used to publish results for the various zprojects is a kernel-compile using "make -jN" where N is artificially increased to impose memory pressure. By adding some debug code to zswap, I was able to analyze this workload and found the following: 1) The average page compressed by almost a factor of six (mean zsize == 694, stddev == 474) stddev is what? 2) Almost eleven percent of the pages were zero pages. A zero page compresses to 28 bytes. 3) On average, 77% of the bytes (3156) in the pages-to-be- compressed contained a byte-value of zero. 4) Despite the above, mean density of zsmalloc was measured at 3.2 zpages/pageframe, presumably losing nearly half of available space to fragmentation. I have no clue if these measurements are representative of a wide range of workloads over the lifetime of a booted machine, but I am suspicious that they are not. For example, the lzo1x compression algorithm claims to compress data by about a factor of two. I would welcome ideas on how to evaluate workloads for "representativeness". Personally I don't believe we should be making decisions about selecting the "best" algorithms or merging code without an agreement on workloads. PAGEFRAME EVACUATION AND RECLAIM I've repeatedly stated the opinion that managing the number of pageframes containing compressed pages will be valuable for managing MM interaction/policy when compression is used in the kernel. After the experimentation above and some brainstorming, I still do not see an effective method for zsmalloc evacuating and reclaiming pageframes, because both are complicated by high density and page-crossing. In other words, zsmalloc's strengths may also be its Achilles heels. For zram, as far as I can see, pageframe evacuation/reclaim is irrelevant except perhaps as part of mass defragmentation. For zcache and zswap, where writethrough is used, pageframe evacuation/reclaim is very relevant. (Note: The writeback implemented in zswap does _zpage_ evacuation without pageframe reclaim.) CLOSING THOUGHT Since zsmalloc and zbud have different strengths and weaknesses, I wonder if some combination or hybrid might be more optimal? But unless/until we have and can measure a representative workload, only intuition can answer that. GLOSSARY zproject -- a kernel project using compression (zram, zcache, zswap) zpage -- a compressed sequence of PAGE_SIZE bytes zsize -- the number of bytes in a compressed page pageframe -- the term "page" is widely used both to describe either (1) PAGE_SIZE bytes of data, or (2) a physical RAM area with size=PAGE_SIZE which is PAGE_SIZE-aligned, as represented
Re: [PATCH] staging/zcache: Fix/improve zcache writeback code, tie to a config option
On 03/01/2013 06:29 AM, Dan Magenheimer wrote: From: Ric Mason [mailto:ric.mas...@gmail.com] Subject: Re: [PATCH] staging/zcache: Fix/improve zcache writeback code, tie to a config option On 02/07/2013 02:27 AM, Dan Magenheimer wrote: It was observed by Andrea Arcangeli in 2011 that zcache can get "full" and there must be some way for compressed swap pages to be (uncompressed and then) sent through to the backing swap disk. A prototype of this functionality, called "unuse", was added in 2012 as part of a major update to zcache (aka "zcache2"), but was left unfinished due to the unfortunate temporary fork of zcache. This earlier version of the code had an unresolved memory leak and was anyway dependent on not-yet-upstream frontswap and mm changes. The code was meanwhile adapted by Seth Jennings for similar functionality in zswap (which he calls "flush"). Seth also made some clever simplifications which are herein ported back to zcache. As a result of those simplifications, the frontswap changes are no longer necessary, but a slightly different (and simpler) set of mm changes are still required [1]. The memory leak is also fixed. Due to feedback from akpm in a zswap thread, this functionality in zcache has now been renamed from "unuse" to "writeback". Although this zcache writeback code now works, there are open questions as how best to handle the policy that drives it. As a result, this patch also ties writeback to a new config option. And, since the code still depends on not-yet-upstreamed mm patches, to avoid build problems, the config option added by this patch temporarily depends on "BROKEN"; this config dependency can be removed in trees that contain the necessary mm patches. [1] https://lkml.org/lkml/2013/1/29/540/ https://lkml.org/lkml/2013/1/29/539/ shrink_zcache_memory: while(nr_evict-- > 0) { page = zcache_evict_eph_pageframe(); if (page == NULL) break; zcache_free_page(page); } zcache_evict_eph_pageframe ->zbud_evict_pageframe_lru ->zbud_evict_tmem ->tmem_flush_page ->zcache_pampd_free ->zcache_free_page <- zbudpage has already been free here If the zcache_free_page called in shrink_zcache_memory can be treated as a double free? Thanks for the code review and sorry for the delay... zcache_pampd_free() only calls zcache_free_page() if page is non-NULL, but in this code path I think when zcache_pampd_free() calls zbud_free_and_delist(), that function determines that the zbudpage is dying and returns NULL. So unless I am misunderstanding (or misreading the code), there is no double free. Oh, I see. Thanks for your response. :) Thanks, Dan -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] staging/zcache: Fix/improve zcache writeback code, tie to a config option
On 03/01/2013 06:29 AM, Dan Magenheimer wrote: From: Ric Mason [mailto:ric.mas...@gmail.com] Subject: Re: [PATCH] staging/zcache: Fix/improve zcache writeback code, tie to a config option On 02/07/2013 02:27 AM, Dan Magenheimer wrote: It was observed by Andrea Arcangeli in 2011 that zcache can get full and there must be some way for compressed swap pages to be (uncompressed and then) sent through to the backing swap disk. A prototype of this functionality, called unuse, was added in 2012 as part of a major update to zcache (aka zcache2), but was left unfinished due to the unfortunate temporary fork of zcache. This earlier version of the code had an unresolved memory leak and was anyway dependent on not-yet-upstream frontswap and mm changes. The code was meanwhile adapted by Seth Jennings for similar functionality in zswap (which he calls flush). Seth also made some clever simplifications which are herein ported back to zcache. As a result of those simplifications, the frontswap changes are no longer necessary, but a slightly different (and simpler) set of mm changes are still required [1]. The memory leak is also fixed. Due to feedback from akpm in a zswap thread, this functionality in zcache has now been renamed from unuse to writeback. Although this zcache writeback code now works, there are open questions as how best to handle the policy that drives it. As a result, this patch also ties writeback to a new config option. And, since the code still depends on not-yet-upstreamed mm patches, to avoid build problems, the config option added by this patch temporarily depends on BROKEN; this config dependency can be removed in trees that contain the necessary mm patches. [1] https://lkml.org/lkml/2013/1/29/540/ https://lkml.org/lkml/2013/1/29/539/ shrink_zcache_memory: while(nr_evict-- 0) { page = zcache_evict_eph_pageframe(); if (page == NULL) break; zcache_free_page(page); } zcache_evict_eph_pageframe -zbud_evict_pageframe_lru -zbud_evict_tmem -tmem_flush_page -zcache_pampd_free -zcache_free_page - zbudpage has already been free here If the zcache_free_page called in shrink_zcache_memory can be treated as a double free? Thanks for the code review and sorry for the delay... zcache_pampd_free() only calls zcache_free_page() if page is non-NULL, but in this code path I think when zcache_pampd_free() calls zbud_free_and_delist(), that function determines that the zbudpage is dying and returns NULL. So unless I am misunderstanding (or misreading the code), there is no double free. Oh, I see. Thanks for your response. :) Thanks, Dan -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: zsmalloc limitations and related topics
On 02/28/2013 07:24 AM, Dan Magenheimer wrote: Hi all -- I've been doing some experimentation on zsmalloc in preparation for my topic proposed for LSFMM13 and have run across some perplexing limitations. Those familiar with the intimate details of zsmalloc might be well aware of these limitations, but they aren't documented or immediately obvious, so I thought it would be worthwhile to air them publicly. I've also included some measurements from the experimentation and some related thoughts. (Some of the terms here are unusual and may be used inconsistently by different developers so a glossary of definitions of the terms used here is appended.) ZSMALLOC LIMITATIONS Zsmalloc is used for two zprojects: zram and the out-of-tree zswap. Zsmalloc can achieve high density when full. But: 1) Zsmalloc has a worst-case density of 0.25 (one zpage per four pageframes). 2) When not full and especially when nearly-empty _after_ being full, density may fall below 1.0 as a result of fragmentation. What's the meaning of nearly-empty _after_ being full? 3) Zsmalloc has a density of exactly 1.0 for any number of zpages with zsize = 0.8. 4) Zsmalloc contains several compile-time parameters; the best value of these parameters may be very workload dependent. If density == 1.0, that means we are paying the overhead of compression+decompression for no space advantage. If density 1.0, that means using zsmalloc is detrimental, resulting in worse memory pressure than if it were not used. WORKLOAD ANALYSIS These limitations emphasize that the workload used to evaluate zsmalloc is very important. Benchmarks that measure data Could you share your benchmark? In order that other guys can take advantage of it. throughput or CPU utilization are of questionable value because it is the _content_ of the data that is particularly relevant for compression. Even more precisely, it is the entropy of the data that is relevant, because the amount of compressibility in the data is related to the entropy: I.e. an entirely random pagefull of bits will compress poorly and a highly-regular pagefull of bits will compress well. Since the zprojects manage a large number of zpages, both the mean and distribution of zsize of the workload should be representative. The workload most widely used to publish results for the various zprojects is a kernel-compile using make -jN where N is artificially increased to impose memory pressure. By adding some debug code to zswap, I was able to analyze this workload and found the following: 1) The average page compressed by almost a factor of six (mean zsize == 694, stddev == 474) stddev is what? 2) Almost eleven percent of the pages were zero pages. A zero page compresses to 28 bytes. 3) On average, 77% of the bytes (3156) in the pages-to-be- compressed contained a byte-value of zero. 4) Despite the above, mean density of zsmalloc was measured at 3.2 zpages/pageframe, presumably losing nearly half of available space to fragmentation. I have no clue if these measurements are representative of a wide range of workloads over the lifetime of a booted machine, but I am suspicious that they are not. For example, the lzo1x compression algorithm claims to compress data by about a factor of two. I would welcome ideas on how to evaluate workloads for representativeness. Personally I don't believe we should be making decisions about selecting the best algorithms or merging code without an agreement on workloads. PAGEFRAME EVACUATION AND RECLAIM I've repeatedly stated the opinion that managing the number of pageframes containing compressed pages will be valuable for managing MM interaction/policy when compression is used in the kernel. After the experimentation above and some brainstorming, I still do not see an effective method for zsmalloc evacuating and reclaiming pageframes, because both are complicated by high density and page-crossing. In other words, zsmalloc's strengths may also be its Achilles heels. For zram, as far as I can see, pageframe evacuation/reclaim is irrelevant except perhaps as part of mass defragmentation. For zcache and zswap, where writethrough is used, pageframe evacuation/reclaim is very relevant. (Note: The writeback implemented in zswap does _zpage_ evacuation without pageframe reclaim.) CLOSING THOUGHT Since zsmalloc and zbud have different strengths and weaknesses, I wonder if some combination or hybrid might be more optimal? But unless/until we have and can measure a representative workload, only intuition can answer that. GLOSSARY zproject -- a kernel project using compression (zram, zcache, zswap) zpage -- a compressed sequence of PAGE_SIZE bytes zsize -- the number of bytes in a compressed page pageframe -- the term page is widely used both to describe either (1) PAGE_SIZE bytes of data, or (2) a physical RAM area with size=PAGE_SIZE which is PAGE_SIZE-aligned, as represented in the kernel
Re: [RFC PATCH v2 1/2] mm: tuning hardcoded reserved memory
On 02/28/2013 11:48 AM, Andrew Shewmaker wrote: On Thu, Feb 28, 2013 at 02:12:00PM -0800, Andrew Morton wrote: On Wed, 27 Feb 2013 15:56:30 -0500 Andrew Shewmaker ags...@gmail.com wrote: The following patches are against the mmtom git tree as of February 27th. The first patch only affects OVERCOMMIT_NEVER mode, entirely removing the 3% reserve for other user processes. The second patch affects both OVERCOMMIT_GUESS and OVERCOMMIT_NEVER modes, replacing the hardcoded 3% reserve for the root user with a tunable knob. Gee, it's been years since anyone thought about the overcommit code. Documentation/vm/overcommit-accounting says that OVERCOMMIT_ALWAYS is Appropriate for some scientific applications, but doesn't say why. You're running a scientific cluster but you're using OVERCOMMIT_NEVER, I think? Is the documentation wrong? None of my scientists appeared to use sparse arrays as Alan described. My users would run jobs that appeared to initialize correctly. However, they wouldn't write to every page they malloced (and they wouldn't use calloc), so I saw jobs failing well into a computation once the simulation tried to access a page and the kernel couldn't give it to them. I think Roadrunner (http://en.wikipedia.org/wiki/IBM_Roadrunner) was the first cluster I put into OVERCOMMIT_NEVER mode. Jobs with infeasible memory requirements fail early and the OOM killer gets triggered much less often than in guess mode. More often than not the OOM killer seemed to kill the wrong thing causing a subtle brokenness. Disabling overcommit worked so well during the stabilization and early user phases that we did the same with other clusters. Do you mean OVERCOMMIT_NEVER is more suitable for scientific application than OVERCOMMIT_GUESS and OVERCOMMIT_ALWAYS? Or should depend on workload? Since your users would run jobs that wouldn't write to every page they malloced, so why OVERCOMMIT_GUESS is not more suitable for you? __vm_enough_memory reserves 3% of free pages with the default overcommit mode and 6% when overcommit is disabled. These hardcoded values have become less reasonable as memory sizes have grown. On scientific clusters, systems are generally dedicated to one user. Also, overcommit is sometimes disabled in order to prevent a long running job from suddenly failing days or weeks into a calculation. In this case, a user wishing to allocate as much memory as possible to one process may be prevented from using, for example, around 7GB out of 128GB. The effect is less, but still significant when a user starts a job with one process per core. I have repeatedly seen a set of processes requesting the same amount of memory fail because one of them could not allocate the amount of memory a user would expect to be able to allocate. ... --- a/mm/mmap.c +++ b/mm/mmap.c @@ -182,11 +182,6 @@ int __vm_enough_memory(struct mm_struct *mm, long pages, int cap_sys_admin) allowed -= allowed / 32; allowed += total_swap_pages; - /* Don't let a single process grow too big: - leave 3% of the size of this process for other processes */ - if (mm) - allowed -= mm-total_vm / 32; - if (percpu_counter_read_positive(vm_committed_as) allowed) return 0; So what might be the downside for this change? root can't log in, I assume. Have you actually tested for this scenario and observed the effects? If there *are* observable risks and/or to preserve back-compatibility, I guess we could create a fourth overcommit mode which provides the headroom which you desire. Also, should we be looking at removing root's 3% from OVERCOMMIT_GUESS as well? The downside of the first patch, which removes the other reserve (sorry about the confusing duplicated subject line), is that a user may not be able to kill their process, even if they have a shell prompt. When testing, I did sometimes get into spot where I attempted to execute kill, but got: bash: fork: Cannot allocate memory. Of course, a user can get in the same predicament with the current 3% reserve--they just have to start processes until 3% becomes negligible. With just the first patch, root still has a 3% reserve, so they can still log in. When I resubmit the second patch, adding a tunable rootuser_reserve_pages variable, I'll test both guess and never overcommit modes to see what minimum initial values allow root to login and kill a user's memory hogging process. This will be safer than the current behavior since root's reserve will never shrink to something useless in the case where a user has grabbed all available memory with many processes. The idea of two patches looks reasonable to me. As an estimate of a useful rootuser_reserve_pages, the rss+share size of Sorry for my silly, why you mean share size is not consist in rss size? sshd, bash, and top is about 16MB. Overcommit disabled mode would need closer to 360MB for the same processes. On a 128GB box 3% is 3.8GB, so the new
Re: [PATCH 2/7] ksm: treat unstable nid like in stable tree
Hi Hugh, On 02/23/2013 05:03 AM, Hugh Dickins wrote: On Fri, 22 Feb 2013, Ric Mason wrote: On 02/21/2013 04:20 PM, Hugh Dickins wrote: An inconsistency emerged in reviewing the NUMA node changes to KSM: when meeting a page from the wrong NUMA node in a stable tree, we say that it's okay for comparisons, but not as a leaf for merging; whereas when meeting a page from the wrong NUMA node in an unstable tree, we bail out immediately. IIUC - ksm page from the wrong NUMA node will be add to current node's stable tree Please forgive my late response. That should never happen (and when I was checking with a WARN_ON it did not happen). What can happen is that a node already in a stable tree has its page migrated away to another NUMA node. - normal page from the wrong NUMA node will be merged to current node's stable tree - where I miss here? I didn't see any special handling in function stable_tree_search for this case. nid = get_kpfn_nid(page_to_pfn(page)); root = root_stable_tree + nid; to choose the right tree for the page, and if (get_kpfn_nid(stable_node-kpfn) != NUMA(stable_node-nid)) { put_page(tree_page); goto replace; } to make sure that we don't latch on to a node whose page got migrated away. I think the ksm implementation for num awareness is buggy. For page migratyion stuff, new page is allocated from node *which page is migrated to*. - when meeting a page from the wrong NUMA node in an unstable tree get_kpfn_nid(page_to_pfn(page)) *==* page_to_nid(tree_page) How can say it's okay for comparisons, but not as a leaf for merging? - when meeting a page from the wrong NUMA node in an stable tree - meeting a normal page - meeting a page which is ksm page before migration get_kpfn_nid(stable_node-kpfn) != NUMA(stable_node-nid) can't capture them since stable_node is for tree page in current stable tree. They are always equal. - normal page from the wrong NUMA node will compare but not as a leaf for merging after the patch I don't understand you there, but hope my remarks above resolve it. Hugh -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCHv5 2/8] zsmalloc: add documentation
On 02/25/2013 11:18 PM, Seth Jennings wrote: On 02/23/2013 06:37 PM, Ric Mason wrote: On 02/23/2013 05:02 AM, Seth Jennings wrote: On 02/21/2013 08:56 PM, Ric Mason wrote: On 02/21/2013 11:50 PM, Seth Jennings wrote: On 02/21/2013 02:49 AM, Ric Mason wrote: On 02/19/2013 03:16 AM, Seth Jennings wrote: On 02/16/2013 12:21 AM, Ric Mason wrote: On 02/14/2013 02:38 AM, Seth Jennings wrote: This patch adds a documentation file for zsmalloc at Documentation/vm/zsmalloc.txt Signed-off-by: Seth Jennings sjenn...@linux.vnet.ibm.com --- Documentation/vm/zsmalloc.txt | 68 + 1 file changed, 68 insertions(+) create mode 100644 Documentation/vm/zsmalloc.txt diff --git a/Documentation/vm/zsmalloc.txt b/Documentation/vm/zsmalloc.txt new file mode 100644 index 000..85aa617 --- /dev/null +++ b/Documentation/vm/zsmalloc.txt @@ -0,0 +1,68 @@ +zsmalloc Memory Allocator + +Overview + +zmalloc a new slab-based memory allocator, +zsmalloc, for storing compressed pages. It is designed for +low fragmentation and high allocation success rate on +large object, but = PAGE_SIZE allocations. + +zsmalloc differs from the kernel slab allocator in two primary +ways to achieve these design goals. + +zsmalloc never requires high order page allocations to back +slabs, or size classes in zsmalloc terms. Instead it allows +multiple single-order pages to be stitched together into a +zspage which backs the slab. This allows for higher allocation +success rate under memory pressure. + +Also, zsmalloc allows objects to span page boundaries within the +zspage. This allows for lower fragmentation than could be had +with the kernel slab allocator for objects between PAGE_SIZE/2 +and PAGE_SIZE. With the kernel slab allocator, if a page compresses +to 60% of it original size, the memory savings gained through +compression is lost in fragmentation because another object of +the same size can't be stored in the leftover space. + +This ability to span pages results in zsmalloc allocations not being +directly addressable by the user. The user is given an +non-dereferencable handle in response to an allocation request. +That handle must be mapped, using zs_map_object(), which returns +a pointer to the mapped region that can be used. The mapping is +necessary since the object data may reside in two different +noncontigious pages. Do you mean the reason of to use a zsmalloc object must map after malloc is object data maybe reside in two different nocontiguous pages? Yes, that is one reason for the mapping. The other reason (more of an added bonus) is below. + +For 32-bit systems, zsmalloc has the added benefit of being +able to back slabs with HIGHMEM pages, something not possible What's the meaning of back slabs with HIGHMEM pages? By HIGHMEM, I'm referring to the HIGHMEM memory zone on 32-bit systems with larger that 1GB (actually a little less) of RAM. The upper 3GB of the 4GB address space, depending on kernel build options, is not directly addressable by the kernel, but can be mapped into the kernel address space with functions like kmap() or kmap_atomic(). These pages can't be used by slab/slub because they are not continuously mapped into the kernel address space. However, since zsmalloc requires a mapping anyway to handle objects that span non-contiguous page boundaries, we do the kernel mapping as part of the process. So zspages, the conceptual slab in zsmalloc backed by single-order pages can include pages from the HIGHMEM zone as well. Thanks for your clarify, http://lwn.net/Articles/537422/, your article about zswap in lwn. Additionally, the kernel slab allocator does not allow objects that are less than a page in size to span a page boundary. This means that if an object is PAGE_SIZE/2 + 1 bytes in size, it effectively use an entire page, resulting in ~50% waste. Hense there are *no kmalloc() cache size* between PAGE_SIZE/2 and PAGE_SIZE. Are your sure? It seems that kmalloc cache support big size, your can check in include/linux/kmalloc_sizes.h Yes, kmalloc can allocate large objects PAGE_SIZE, but there are no cache sizes _between_ PAGE_SIZE/2 and PAGE_SIZE. For example, on a system with 4k pages, there are no caches between kmalloc-2048 and kmalloc-4096. kmalloc object PAGE_SIZE/2 or PAGE_SIZE should also allocate from slab cache, correct? Then how can alloc object w/o slab cache which? contains this object size objects? I have to admit, I didn't understand the question. object is allocated from slab cache, correct? There two kinds of slab cache, one is for general purpose, eg. kmalloc slab cache, the other is for special purpose, eg. mm_struct, task_struct. kmalloc object PAGE_SIZE/2 or PAGE_SIZE should also allocated from slab cache, correct? then why you said that there are no caches between kmalloc-2048 and kmalloc-4096? Ok, now I get it. Yes, I guess I should qualified here that there are no _kmalloc_ caches between PAGE_SIZE
Re: [PATCH] staging/zcache: Fix/improve zcache writeback code, tie to a config option
On 02/26/2013 01:29 AM, Dan Magenheimer wrote: From: Ric Mason [mailto:ric.mas...@gmail.com] Subject: Re: [PATCH] staging/zcache: Fix/improve zcache writeback code, tie to a config option On 02/07/2013 02:27 AM, Dan Magenheimer wrote: It was observed by Andrea Arcangeli in 2011 that zcache can get "full" and there must be some way for compressed swap pages to be (uncompressed and then) sent through to the backing swap disk. A prototype of this functionality, called "unuse", was added in 2012 as part of a major update to zcache (aka "zcache2"), but was left unfinished due to the unfortunate temporary fork of zcache. This earlier version of the code had an unresolved memory leak and was anyway dependent on not-yet-upstream frontswap and mm changes. The code was meanwhile adapted by Seth Jennings for similar functionality in zswap (which he calls "flush"). Seth also made some clever simplifications which are herein ported back to zcache. As a result of those simplifications, the frontswap changes are no longer necessary, but a slightly different (and simpler) set of mm changes are still required [1]. The memory leak is also fixed. Due to feedback from akpm in a zswap thread, this functionality in zcache has now been renamed from "unuse" to "writeback". Although this zcache writeback code now works, there are open questions as how best to handle the policy that drives it. As a result, this patch also ties writeback to a new config option. And, since the code still depends on not-yet-upstreamed mm patches, to avoid build problems, the config option added by this patch temporarily depends on "BROKEN"; this config dependency can be removed in trees that contain the necessary mm patches. [1] https://lkml.org/lkml/2013/1/29/540/ https://lkml.org/lkml/2013/1/29/539/ This patch leads to backend interact with core mm directly, is it core mm should interact with frontend instead of backend? In addition, frontswap has already have shrink funtion, should we can take advantage of it? Good questions! If you have ideas (or patches) that handle the interaction with the frontend instead of backend, we can take a look at them. But for zcache (and zswap), the backend already interacts with the core mm, for example to allocate and free pageframes. The existing frontswap shrink function cause data pages to be sucked back from the backend. The data pages are put back in the swapcache and they aren't marked in any way so it is possible the data page might soon (or immediately) be sent back to the backend. Then can frontswap shrink work well? This code is used for backends that can't "callback" the frontend, such as the Xen tmem backend and ramster. But I do agree that there might be a good use for the frontswap shrink function for zcache (and zswap). Any ideas? Dan -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] staging/zcache: Fix/improve zcache writeback code, tie to a config option
On 02/26/2013 01:29 AM, Dan Magenheimer wrote: From: Ric Mason [mailto:ric.mas...@gmail.com] Subject: Re: [PATCH] staging/zcache: Fix/improve zcache writeback code, tie to a config option On 02/07/2013 02:27 AM, Dan Magenheimer wrote: It was observed by Andrea Arcangeli in 2011 that zcache can get full and there must be some way for compressed swap pages to be (uncompressed and then) sent through to the backing swap disk. A prototype of this functionality, called unuse, was added in 2012 as part of a major update to zcache (aka zcache2), but was left unfinished due to the unfortunate temporary fork of zcache. This earlier version of the code had an unresolved memory leak and was anyway dependent on not-yet-upstream frontswap and mm changes. The code was meanwhile adapted by Seth Jennings for similar functionality in zswap (which he calls flush). Seth also made some clever simplifications which are herein ported back to zcache. As a result of those simplifications, the frontswap changes are no longer necessary, but a slightly different (and simpler) set of mm changes are still required [1]. The memory leak is also fixed. Due to feedback from akpm in a zswap thread, this functionality in zcache has now been renamed from unuse to writeback. Although this zcache writeback code now works, there are open questions as how best to handle the policy that drives it. As a result, this patch also ties writeback to a new config option. And, since the code still depends on not-yet-upstreamed mm patches, to avoid build problems, the config option added by this patch temporarily depends on BROKEN; this config dependency can be removed in trees that contain the necessary mm patches. [1] https://lkml.org/lkml/2013/1/29/540/ https://lkml.org/lkml/2013/1/29/539/ This patch leads to backend interact with core mm directly, is it core mm should interact with frontend instead of backend? In addition, frontswap has already have shrink funtion, should we can take advantage of it? Good questions! If you have ideas (or patches) that handle the interaction with the frontend instead of backend, we can take a look at them. But for zcache (and zswap), the backend already interacts with the core mm, for example to allocate and free pageframes. The existing frontswap shrink function cause data pages to be sucked back from the backend. The data pages are put back in the swapcache and they aren't marked in any way so it is possible the data page might soon (or immediately) be sent back to the backend. Then can frontswap shrink work well? This code is used for backends that can't callback the frontend, such as the Xen tmem backend and ramster. But I do agree that there might be a good use for the frontswap shrink function for zcache (and zswap). Any ideas? Dan -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 0/7] ksm: responses to NUMA review
On 02/23/2013 04:38 AM, Hugh Dickins wrote: On Fri, 22 Feb 2013, Ric Mason wrote: On 02/21/2013 04:17 PM, Hugh Dickins wrote: Here's a second KSM series, based on mmotm 2013-02-19-17-20: partly in response to Mel's review feedback, partly fixes to issues that I found myself in doing more review and testing. None of the issues fixed are truly show-stoppers, though I would prefer them fixed sooner than later. Do you have any ideas ksm support page cache and tmpfs? No. It's only been asked as a hypothetical question: I don't know of anyone actually needing it, and I wouldn't have time to do it myself. It would be significantly more invasive than just dealing with anonymous memory: with anon, we already have the infrastructure for read-only pages, but we don't at present have any notion of read-only pagecache. Just doing it in tmpfs? Well, yes, that might be easier: since v3.1's radix_tree rework, shmem/tmpfs mostly goes through its own interfaces to pagecache, so read-only pagecache, and hence KSM, might be easier to implement there than more generally. Ok, are there potential users to take advantage of it? Hugh -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCHv5 2/8] zsmalloc: add documentation
On 02/23/2013 05:02 AM, Seth Jennings wrote: On 02/21/2013 08:56 PM, Ric Mason wrote: On 02/21/2013 11:50 PM, Seth Jennings wrote: On 02/21/2013 02:49 AM, Ric Mason wrote: On 02/19/2013 03:16 AM, Seth Jennings wrote: On 02/16/2013 12:21 AM, Ric Mason wrote: On 02/14/2013 02:38 AM, Seth Jennings wrote: This patch adds a documentation file for zsmalloc at Documentation/vm/zsmalloc.txt Signed-off-by: Seth Jennings --- Documentation/vm/zsmalloc.txt | 68 + 1 file changed, 68 insertions(+) create mode 100644 Documentation/vm/zsmalloc.txt diff --git a/Documentation/vm/zsmalloc.txt b/Documentation/vm/zsmalloc.txt new file mode 100644 index 000..85aa617 --- /dev/null +++ b/Documentation/vm/zsmalloc.txt @@ -0,0 +1,68 @@ +zsmalloc Memory Allocator + +Overview + +zmalloc a new slab-based memory allocator, +zsmalloc, for storing compressed pages. It is designed for +low fragmentation and high allocation success rate on +large object, but <= PAGE_SIZE allocations. + +zsmalloc differs from the kernel slab allocator in two primary +ways to achieve these design goals. + +zsmalloc never requires high order page allocations to back +slabs, or "size classes" in zsmalloc terms. Instead it allows +multiple single-order pages to be stitched together into a +"zspage" which backs the slab. This allows for higher allocation +success rate under memory pressure. + +Also, zsmalloc allows objects to span page boundaries within the +zspage. This allows for lower fragmentation than could be had +with the kernel slab allocator for objects between PAGE_SIZE/2 +and PAGE_SIZE. With the kernel slab allocator, if a page compresses +to 60% of it original size, the memory savings gained through +compression is lost in fragmentation because another object of +the same size can't be stored in the leftover space. + +This ability to span pages results in zsmalloc allocations not being +directly addressable by the user. The user is given an +non-dereferencable handle in response to an allocation request. +That handle must be mapped, using zs_map_object(), which returns +a pointer to the mapped region that can be used. The mapping is +necessary since the object data may reside in two different +noncontigious pages. Do you mean the reason of to use a zsmalloc object must map after malloc is object data maybe reside in two different nocontiguous pages? Yes, that is one reason for the mapping. The other reason (more of an added bonus) is below. + +For 32-bit systems, zsmalloc has the added benefit of being +able to back slabs with HIGHMEM pages, something not possible What's the meaning of "back slabs with HIGHMEM pages"? By HIGHMEM, I'm referring to the HIGHMEM memory zone on 32-bit systems with larger that 1GB (actually a little less) of RAM. The upper 3GB of the 4GB address space, depending on kernel build options, is not directly addressable by the kernel, but can be mapped into the kernel address space with functions like kmap() or kmap_atomic(). These pages can't be used by slab/slub because they are not continuously mapped into the kernel address space. However, since zsmalloc requires a mapping anyway to handle objects that span non-contiguous page boundaries, we do the kernel mapping as part of the process. So zspages, the conceptual slab in zsmalloc backed by single-order pages can include pages from the HIGHMEM zone as well. Thanks for your clarify, http://lwn.net/Articles/537422/, your article about zswap in lwn. "Additionally, the kernel slab allocator does not allow objects that are less than a page in size to span a page boundary. This means that if an object is PAGE_SIZE/2 + 1 bytes in size, it effectively use an entire page, resulting in ~50% waste. Hense there are *no kmalloc() cache size* between PAGE_SIZE/2 and PAGE_SIZE." Are your sure? It seems that kmalloc cache support big size, your can check in include/linux/kmalloc_sizes.h Yes, kmalloc can allocate large objects > PAGE_SIZE, but there are no cache sizes _between_ PAGE_SIZE/2 and PAGE_SIZE. For example, on a system with 4k pages, there are no caches between kmalloc-2048 and kmalloc-4096. kmalloc object > PAGE_SIZE/2 or > PAGE_SIZE should also allocate from slab cache, correct? Then how can alloc object w/o slab cache which? contains this object size objects? I have to admit, I didn't understand the question. object is allocated from slab cache, correct? There two kinds of slab cache, one is for general purpose, eg. kmalloc slab cache, the other is for special purpose, eg. mm_struct, task_struct. kmalloc object > PAGE_SIZE/2 or > PAGE_SIZE should also allocated from slab cache, correct? then why you said that there are no caches between kmalloc-2048 and kmalloc-4096? Thanks, Seth -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger
Re: [PATCHv5 2/8] zsmalloc: add documentation
On 02/23/2013 05:02 AM, Seth Jennings wrote: On 02/21/2013 08:56 PM, Ric Mason wrote: On 02/21/2013 11:50 PM, Seth Jennings wrote: On 02/21/2013 02:49 AM, Ric Mason wrote: On 02/19/2013 03:16 AM, Seth Jennings wrote: On 02/16/2013 12:21 AM, Ric Mason wrote: On 02/14/2013 02:38 AM, Seth Jennings wrote: This patch adds a documentation file for zsmalloc at Documentation/vm/zsmalloc.txt Signed-off-by: Seth Jennings sjenn...@linux.vnet.ibm.com --- Documentation/vm/zsmalloc.txt | 68 + 1 file changed, 68 insertions(+) create mode 100644 Documentation/vm/zsmalloc.txt diff --git a/Documentation/vm/zsmalloc.txt b/Documentation/vm/zsmalloc.txt new file mode 100644 index 000..85aa617 --- /dev/null +++ b/Documentation/vm/zsmalloc.txt @@ -0,0 +1,68 @@ +zsmalloc Memory Allocator + +Overview + +zmalloc a new slab-based memory allocator, +zsmalloc, for storing compressed pages. It is designed for +low fragmentation and high allocation success rate on +large object, but = PAGE_SIZE allocations. + +zsmalloc differs from the kernel slab allocator in two primary +ways to achieve these design goals. + +zsmalloc never requires high order page allocations to back +slabs, or size classes in zsmalloc terms. Instead it allows +multiple single-order pages to be stitched together into a +zspage which backs the slab. This allows for higher allocation +success rate under memory pressure. + +Also, zsmalloc allows objects to span page boundaries within the +zspage. This allows for lower fragmentation than could be had +with the kernel slab allocator for objects between PAGE_SIZE/2 +and PAGE_SIZE. With the kernel slab allocator, if a page compresses +to 60% of it original size, the memory savings gained through +compression is lost in fragmentation because another object of +the same size can't be stored in the leftover space. + +This ability to span pages results in zsmalloc allocations not being +directly addressable by the user. The user is given an +non-dereferencable handle in response to an allocation request. +That handle must be mapped, using zs_map_object(), which returns +a pointer to the mapped region that can be used. The mapping is +necessary since the object data may reside in two different +noncontigious pages. Do you mean the reason of to use a zsmalloc object must map after malloc is object data maybe reside in two different nocontiguous pages? Yes, that is one reason for the mapping. The other reason (more of an added bonus) is below. + +For 32-bit systems, zsmalloc has the added benefit of being +able to back slabs with HIGHMEM pages, something not possible What's the meaning of back slabs with HIGHMEM pages? By HIGHMEM, I'm referring to the HIGHMEM memory zone on 32-bit systems with larger that 1GB (actually a little less) of RAM. The upper 3GB of the 4GB address space, depending on kernel build options, is not directly addressable by the kernel, but can be mapped into the kernel address space with functions like kmap() or kmap_atomic(). These pages can't be used by slab/slub because they are not continuously mapped into the kernel address space. However, since zsmalloc requires a mapping anyway to handle objects that span non-contiguous page boundaries, we do the kernel mapping as part of the process. So zspages, the conceptual slab in zsmalloc backed by single-order pages can include pages from the HIGHMEM zone as well. Thanks for your clarify, http://lwn.net/Articles/537422/, your article about zswap in lwn. Additionally, the kernel slab allocator does not allow objects that are less than a page in size to span a page boundary. This means that if an object is PAGE_SIZE/2 + 1 bytes in size, it effectively use an entire page, resulting in ~50% waste. Hense there are *no kmalloc() cache size* between PAGE_SIZE/2 and PAGE_SIZE. Are your sure? It seems that kmalloc cache support big size, your can check in include/linux/kmalloc_sizes.h Yes, kmalloc can allocate large objects PAGE_SIZE, but there are no cache sizes _between_ PAGE_SIZE/2 and PAGE_SIZE. For example, on a system with 4k pages, there are no caches between kmalloc-2048 and kmalloc-4096. kmalloc object PAGE_SIZE/2 or PAGE_SIZE should also allocate from slab cache, correct? Then how can alloc object w/o slab cache which? contains this object size objects? I have to admit, I didn't understand the question. object is allocated from slab cache, correct? There two kinds of slab cache, one is for general purpose, eg. kmalloc slab cache, the other is for special purpose, eg. mm_struct, task_struct. kmalloc object PAGE_SIZE/2 or PAGE_SIZE should also allocated from slab cache, correct? then why you said that there are no caches between kmalloc-2048 and kmalloc-4096? Thanks, Seth -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org
Re: [PATCH 0/7] ksm: responses to NUMA review
On 02/23/2013 04:38 AM, Hugh Dickins wrote: On Fri, 22 Feb 2013, Ric Mason wrote: On 02/21/2013 04:17 PM, Hugh Dickins wrote: Here's a second KSM series, based on mmotm 2013-02-19-17-20: partly in response to Mel's review feedback, partly fixes to issues that I found myself in doing more review and testing. None of the issues fixed are truly show-stoppers, though I would prefer them fixed sooner than later. Do you have any ideas ksm support page cache and tmpfs? No. It's only been asked as a hypothetical question: I don't know of anyone actually needing it, and I wouldn't have time to do it myself. It would be significantly more invasive than just dealing with anonymous memory: with anon, we already have the infrastructure for read-only pages, but we don't at present have any notion of read-only pagecache. Just doing it in tmpfs? Well, yes, that might be easier: since v3.1's radix_tree rework, shmem/tmpfs mostly goes through its own interfaces to pagecache, so read-only pagecache, and hence KSM, might be easier to implement there than more generally. Ok, are there potential users to take advantage of it? Hugh -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 2/7] ksm: treat unstable nid like in stable tree
On 02/21/2013 04:20 PM, Hugh Dickins wrote: An inconsistency emerged in reviewing the NUMA node changes to KSM: when meeting a page from the wrong NUMA node in a stable tree, we say that it's okay for comparisons, but not as a leaf for merging; whereas when meeting a page from the wrong NUMA node in an unstable tree, we bail out immediately. IIUC - ksm page from the wrong NUMA node will be add to current node's stable tree - normal page from the wrong NUMA node will be merged to current node's stable tree <- where I miss here? I didn't see any special handling in function stable_tree_search for this case. - normal page from the wrong NUMA node will compare but not as a leaf for merging after the patch Now, it might be that a wrong NUMA node in an unstable tree is more likely to correlate with instablility (different content, with rbnode now misplaced) than page migration; but even so, we are accustomed to instablility in the unstable tree. Without strong evidence for which strategy is generally better, I'd rather be consistent with what's done in the stable tree: accept a page from the wrong NUMA node for comparison, but not as a leaf for merging. Signed-off-by: Hugh Dickins --- mm/ksm.c | 19 +-- 1 file changed, 9 insertions(+), 10 deletions(-) --- mmotm.orig/mm/ksm.c 2013-02-20 22:28:23.584001392 -0800 +++ mmotm/mm/ksm.c 2013-02-20 22:28:27.288001480 -0800 @@ -1340,16 +1340,6 @@ struct rmap_item *unstable_tree_search_i return NULL; } - /* -* If tree_page has been migrated to another NUMA node, it -* will be flushed out and put into the right unstable tree -* next time: only merge with it if merge_across_nodes. -*/ - if (!ksm_merge_across_nodes && page_to_nid(tree_page) != nid) { - put_page(tree_page); - return NULL; - } - ret = memcmp_pages(page, tree_page); parent = *new; @@ -1359,6 +1349,15 @@ struct rmap_item *unstable_tree_search_i } else if (ret > 0) { put_page(tree_page); new = >rb_right; + } else if (!ksm_merge_across_nodes && + page_to_nid(tree_page) != nid) { + /* +* If tree_page has been migrated to another NUMA node, +* it will be flushed out and put in the right unstable +* tree next time: only merge with it when across_nodes. +*/ + put_page(tree_page); + return NULL; } else { *tree_pagep = tree_page; return tree_rmap_item; -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majord...@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: mailto:"d...@kvack.org;> em...@kvack.org -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 1/7] ksm: add some comments
On 02/21/2013 04:19 PM, Hugh Dickins wrote: Added slightly more detail to the Documentation of merge_across_nodes, a few comments in areas indicated by review, and renamed get_ksm_page()'s argument from "locked" to "lock_it". No functional change. Signed-off-by: Hugh Dickins --- Documentation/vm/ksm.txt | 16 mm/ksm.c | 18 ++ 2 files changed, 26 insertions(+), 8 deletions(-) --- mmotm.orig/Documentation/vm/ksm.txt 2013-02-20 22:28:09.456001057 -0800 +++ mmotm/Documentation/vm/ksm.txt 2013-02-20 22:28:23.580001392 -0800 @@ -60,10 +60,18 @@ sleep_millisecs - how many milliseconds merge_across_nodes - specifies if pages from different numa nodes can be merged. When set to 0, ksm merges only pages which physically - reside in the memory area of same NUMA node. It brings - lower latency to access to shared page. Value can be - changed only when there is no ksm shared pages in system. - Default: 1 + reside in the memory area of same NUMA node. That brings + lower latency to access of shared pages. Systems with more + nodes, at significant NUMA distances, are likely to benefit + from the lower latency of setting 0. Smaller systems, which + need to minimize memory usage, are likely to benefit from + the greater sharing of setting 1 (default). You may wish to + compare how your system performs under each setting, before + deciding on which to use. merge_across_nodes setting can be + changed only when there are no ksm shared pages in system: + set run 2 to unmerge pages first, then to 1 after changing + merge_across_nodes, to remerge according to the new setting. What's the root reason merge_across_nodes setting just can be changed only when there are no ksm shared pages in system? Can they be unmerged and merged again during ksmd scan? + Default: 1 (merging across nodes as in earlier releases) run - set 0 to stop ksmd from running but keep merged pages, set 1 to run ksmd e.g. "echo 1 > /sys/kernel/mm/ksm/run", --- mmotm.orig/mm/ksm.c 2013-02-20 22:28:09.456001057 -0800 +++ mmotm/mm/ksm.c 2013-02-20 22:28:23.584001392 -0800 @@ -87,6 +87,9 @@ *take 10 attempts to find a page in the unstable tree, once it is found, *it is secured in the stable tree. (When we scan a new page, we first *compare it against the stable tree, and then against the unstable tree.) + * + * If the merge_across_nodes tunable is unset, then KSM maintains multiple + * stable trees and multiple unstable trees: one of each for each NUMA node. */ /** @@ -524,7 +527,7 @@ static void remove_node_from_stable_tree * a page to put something that might look like our key in page->mapping. * is on its way to being freed; but it is an anomaly to bear in mind. */ -static struct page *get_ksm_page(struct stable_node *stable_node, bool locked) +static struct page *get_ksm_page(struct stable_node *stable_node, bool lock_it) { struct page *page; void *expected_mapping; @@ -573,7 +576,7 @@ again: goto stale; } - if (locked) { + if (lock_it) { lock_page(page); if (ACCESS_ONCE(page->mapping) != expected_mapping) { unlock_page(page); @@ -703,10 +706,17 @@ static int remove_stable_node(struct sta return 0; } - if (WARN_ON_ONCE(page_mapped(page))) + if (WARN_ON_ONCE(page_mapped(page))) { + /* +* This should not happen: but if it does, just refuse to let +* merge_across_nodes be switched - there is no need to panic. +*/ err = -EBUSY; - else { + } else { /* +* The stable node did not yet appear stale to get_ksm_page(), +* since that allows for an unmapped ksm page to be recognized +* right up until it is freed; but the node is safe to remove. * This page might be in a pagevec waiting to be freed, * or it might be PageSwapCache (perhaps under writeback), * or it might have been removed from swapcache a moment ago. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majord...@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: mailto:"d...@kvack.org;> em...@kvack.org -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at
Re: [PATCH] staging/zcache: Fix/improve zcache writeback code, tie to a config option
On 02/07/2013 02:27 AM, Dan Magenheimer wrote: It was observed by Andrea Arcangeli in 2011 that zcache can get "full" and there must be some way for compressed swap pages to be (uncompressed and then) sent through to the backing swap disk. A prototype of this functionality, called "unuse", was added in 2012 as part of a major update to zcache (aka "zcache2"), but was left unfinished due to the unfortunate temporary fork of zcache. This earlier version of the code had an unresolved memory leak and was anyway dependent on not-yet-upstream frontswap and mm changes. The code was meanwhile adapted by Seth Jennings for similar functionality in zswap (which he calls "flush"). Seth also made some clever simplifications which are herein ported back to zcache. As a result of those simplifications, the frontswap changes are no longer necessary, but a slightly different (and simpler) set of mm changes are still required [1]. The memory leak is also fixed. Due to feedback from akpm in a zswap thread, this functionality in zcache has now been renamed from "unuse" to "writeback". Although this zcache writeback code now works, there are open questions as how best to handle the policy that drives it. As a result, this patch also ties writeback to a new config option. And, since the code still depends on not-yet-upstreamed mm patches, to avoid build problems, the config option added by this patch temporarily depends on "BROKEN"; this config dependency can be removed in trees that contain the necessary mm patches. [1] https://lkml.org/lkml/2013/1/29/540/ https://lkml.org/lkml/2013/1/29/539/ shrink_zcache_memory: while(nr_evict-- > 0) { page = zcache_evict_eph_pageframe(); if (page == NULL) break; zcache_free_page(page); } zcache_evict_eph_pageframe ->zbud_evict_pageframe_lru ->zbud_evict_tmem ->tmem_flush_page ->zcache_pampd_free ->zcache_free_page <- zbudpage has already been free here If the zcache_free_page called in shrink_zcache_memory can be treated as a double free? Signed-off-by: Dan Magenheimer --- drivers/staging/zcache/Kconfig | 17 ++ drivers/staging/zcache/zcache-main.c | 332 +++--- 2 files changed, 284 insertions(+), 65 deletions(-) diff --git a/drivers/staging/zcache/Kconfig b/drivers/staging/zcache/Kconfig index c1dbd04..7358270 100644 --- a/drivers/staging/zcache/Kconfig +++ b/drivers/staging/zcache/Kconfig @@ -24,3 +24,20 @@ config RAMSTER while minimizing total RAM across the cluster. RAMster, like zcache2, compresses swap pages into local RAM, but then remotifies the compressed pages to another node in the RAMster cluster. + +# Depends on not-yet-upstreamed mm patches to export end_swap_bio_write and +# __add_to_swap_cache, and implement __swap_writepage (which is swap_writepage +# without the frontswap call. When these are in-tree, the dependency on +# BROKEN can be removed +config ZCACHE_WRITEBACK + bool "Allow compressed swap pages to be writtenback to swap disk" + depends on ZCACHE=y && BROKEN + default n + help + Zcache caches compressed swap pages (and other data) in RAM which + often improves performance by avoiding I/O's due to swapping. + In some workloads with very long-lived large processes, it can + instead reduce performance. Writeback decompresses zcache-compressed + pages (in LRU order) when under memory pressure and writes them to + the backing swap disk to ameliorate this problem. Policy driving + writeback is still under development. diff --git a/drivers/staging/zcache/zcache-main.c b/drivers/staging/zcache/zcache-main.c index c1ac905..5bf14c3 100644 --- a/drivers/staging/zcache/zcache-main.c +++ b/drivers/staging/zcache/zcache-main.c @@ -22,6 +22,10 @@ #include #include #include +#include +#include +#include +#include #include #include @@ -55,6 +59,9 @@ static inline void frontswap_tmem_exclusive_gets(bool b) } #endif +/* enable (or fix code) when Seth's patches are accepted upstream */ +#define zcache_writeback_enabled 0 + static int zcache_enabled __read_mostly; static int disable_cleancache __read_mostly; static int disable_frontswap __read_mostly; @@ -181,6 +188,8 @@ static unsigned long zcache_last_active_anon_pageframes; static unsigned long zcache_last_inactive_anon_pageframes; static unsigned long zcache_eph_nonactive_puts_ignored; static unsigned long zcache_pers_nonactive_puts_ignored; +static unsigned long zcache_writtenback_pages; +static long zcache_outstanding_writeback_pages; #ifdef CONFIG_DEBUG_FS #include @@ -239,6 +248,9 @@ static int zcache_debugfs_init(void) zdfs64("eph_zbytes_max", S_IRUGO, root, _eph_zbytes_max); zdfs64("pers_zbytes", S_IRUGO, root, _pers_zbytes); zdfs64("pers_zbytes_max", S_IRUGO, root, _pers_zbytes_max); +
Re: [PATCH] staging/zcache: Fix/improve zcache writeback code, tie to a config option
On 02/07/2013 02:27 AM, Dan Magenheimer wrote: It was observed by Andrea Arcangeli in 2011 that zcache can get "full" and there must be some way for compressed swap pages to be (uncompressed and then) sent through to the backing swap disk. A prototype of this functionality, called "unuse", was added in 2012 as part of a major update to zcache (aka "zcache2"), but was left unfinished due to the unfortunate temporary fork of zcache. This earlier version of the code had an unresolved memory leak and was anyway dependent on not-yet-upstream frontswap and mm changes. The code was meanwhile adapted by Seth Jennings for similar functionality in zswap (which he calls "flush"). Seth also made some clever simplifications which are herein ported back to zcache. As a result of those simplifications, the frontswap changes are no longer necessary, but a slightly different (and simpler) set of mm changes are still required [1]. The memory leak is also fixed. Due to feedback from akpm in a zswap thread, this functionality in zcache has now been renamed from "unuse" to "writeback". Although this zcache writeback code now works, there are open questions as how best to handle the policy that drives it. As a result, this patch also ties writeback to a new config option. And, since the code still depends on not-yet-upstreamed mm patches, to avoid build problems, the config option added by this patch temporarily depends on "BROKEN"; this config dependency can be removed in trees that contain the necessary mm patches. [1] https://lkml.org/lkml/2013/1/29/540/ https://lkml.org/lkml/2013/1/29/539/ This patch leads to backend interact with core mm directly, is it core mm should interact with frontend instead of backend? In addition, frontswap has already have shrink funtion, should we can take advantage of it? Signed-off-by: Dan Magenheimer --- drivers/staging/zcache/Kconfig | 17 ++ drivers/staging/zcache/zcache-main.c | 332 +++--- 2 files changed, 284 insertions(+), 65 deletions(-) diff --git a/drivers/staging/zcache/Kconfig b/drivers/staging/zcache/Kconfig index c1dbd04..7358270 100644 --- a/drivers/staging/zcache/Kconfig +++ b/drivers/staging/zcache/Kconfig @@ -24,3 +24,20 @@ config RAMSTER while minimizing total RAM across the cluster. RAMster, like zcache2, compresses swap pages into local RAM, but then remotifies the compressed pages to another node in the RAMster cluster. + +# Depends on not-yet-upstreamed mm patches to export end_swap_bio_write and +# __add_to_swap_cache, and implement __swap_writepage (which is swap_writepage +# without the frontswap call. When these are in-tree, the dependency on +# BROKEN can be removed +config ZCACHE_WRITEBACK + bool "Allow compressed swap pages to be writtenback to swap disk" + depends on ZCACHE=y && BROKEN + default n + help + Zcache caches compressed swap pages (and other data) in RAM which + often improves performance by avoiding I/O's due to swapping. + In some workloads with very long-lived large processes, it can + instead reduce performance. Writeback decompresses zcache-compressed + pages (in LRU order) when under memory pressure and writes them to + the backing swap disk to ameliorate this problem. Policy driving + writeback is still under development. diff --git a/drivers/staging/zcache/zcache-main.c b/drivers/staging/zcache/zcache-main.c index c1ac905..5bf14c3 100644 --- a/drivers/staging/zcache/zcache-main.c +++ b/drivers/staging/zcache/zcache-main.c @@ -22,6 +22,10 @@ #include #include #include +#include +#include +#include +#include #include #include @@ -55,6 +59,9 @@ static inline void frontswap_tmem_exclusive_gets(bool b) } #endif +/* enable (or fix code) when Seth's patches are accepted upstream */ +#define zcache_writeback_enabled 0 + static int zcache_enabled __read_mostly; static int disable_cleancache __read_mostly; static int disable_frontswap __read_mostly; @@ -181,6 +188,8 @@ static unsigned long zcache_last_active_anon_pageframes; static unsigned long zcache_last_inactive_anon_pageframes; static unsigned long zcache_eph_nonactive_puts_ignored; static unsigned long zcache_pers_nonactive_puts_ignored; +static unsigned long zcache_writtenback_pages; +static long zcache_outstanding_writeback_pages; #ifdef CONFIG_DEBUG_FS #include @@ -239,6 +248,9 @@ static int zcache_debugfs_init(void) zdfs64("eph_zbytes_max", S_IRUGO, root, _eph_zbytes_max); zdfs64("pers_zbytes", S_IRUGO, root, _pers_zbytes); zdfs64("pers_zbytes_max", S_IRUGO, root, _pers_zbytes_max); + zdfs("outstanding_writeback_pages", S_IRUGO, root, + _outstanding_writeback_pages); + zdfs("writtenback_pages", S_IRUGO, root, _writtenback_pages); return 0; } #undefzdebugfs @@ -285,6 +297,18 @@ void
Re: [PATCH 0/7] ksm: responses to NUMA review
On 02/21/2013 04:17 PM, Hugh Dickins wrote: Here's a second KSM series, based on mmotm 2013-02-19-17-20: partly in response to Mel's review feedback, partly fixes to issues that I found myself in doing more review and testing. None of the issues fixed are truly show-stoppers, though I would prefer them fixed sooner than later. Do you have any ideas ksm support page cache and tmpfs? 1 ksm: add some comments 2 ksm: treat unstable nid like in stable tree 3 ksm: shrink 32-bit rmap_item back to 32 bytes 4 mm,ksm: FOLL_MIGRATION do migration_entry_wait 5 mm,ksm: swapoff might need to copy 6 mm: cleanup "swapcache" in do_swap_page 7 ksm: allocate roots when needed Documentation/vm/ksm.txt | 16 +++- include/linux/mm.h |1 mm/ksm.c | 137 +++-- mm/memory.c | 38 +++--- mm/swapfile.c| 15 +++- 5 files changed, 140 insertions(+), 67 deletions(-) Thanks, Hugh -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majord...@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: mailto:"d...@kvack.org;> em...@kvack.org -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Questin about swap_slot free and invalidate page
On 02/22/2013 05:42 AM, Dan Magenheimer wrote: From: Ric Mason [mailto:ric.mas...@gmail.com] Subject: Re: Questin about swap_slot free and invalidate page On 02/19/2013 11:27 PM, Dan Magenheimer wrote: From: Ric Mason [mailto:ric.mas...@gmail.com] Hugh is right that handling the possibility of duplicates is part of the tmem ABI. If there is any possibility of duplicates, the ABI defines how a backend must handle them to avoid data coherency issues. The kernel implements an in-kernel API which implements the tmem ABI. If the frontend and backend can always agree that duplicate Which ABI in zcache implement that? https://oss.oracle.com/projects/tmem/dist/documentation/api/tmemspec-v001.pdf The in-kernel APIs are frontswap and cleancache. For more information about tmem, see http://lwn.net/Articles/454795/ But you mentioned that you have in-kernel API which can handle duplicate. Do you mean zcache_cleancache/frontswap_put_page? I think they just overwrite instead of optional flush the page on the second(duplicate) put as mentioned in your tmemspec. Maybe I am misunderstanding your question... The spec allows overwrite (and return success) OR flush the page (and return failure). Zcache does the latter (flush). The code that implements it is in tmem_put. Thanks for your point out. Pers pages can have duplicate put since swap cache page can be reused. Can eph pages also have duplicate put? If yes, when can happen? -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCHv5 2/8] zsmalloc: add documentation
On 02/21/2013 11:50 PM, Seth Jennings wrote: On 02/21/2013 02:49 AM, Ric Mason wrote: On 02/19/2013 03:16 AM, Seth Jennings wrote: On 02/16/2013 12:21 AM, Ric Mason wrote: On 02/14/2013 02:38 AM, Seth Jennings wrote: This patch adds a documentation file for zsmalloc at Documentation/vm/zsmalloc.txt Signed-off-by: Seth Jennings --- Documentation/vm/zsmalloc.txt | 68 + 1 file changed, 68 insertions(+) create mode 100644 Documentation/vm/zsmalloc.txt diff --git a/Documentation/vm/zsmalloc.txt b/Documentation/vm/zsmalloc.txt new file mode 100644 index 000..85aa617 --- /dev/null +++ b/Documentation/vm/zsmalloc.txt @@ -0,0 +1,68 @@ +zsmalloc Memory Allocator + +Overview + +zmalloc a new slab-based memory allocator, +zsmalloc, for storing compressed pages. It is designed for +low fragmentation and high allocation success rate on +large object, but <= PAGE_SIZE allocations. + +zsmalloc differs from the kernel slab allocator in two primary +ways to achieve these design goals. + +zsmalloc never requires high order page allocations to back +slabs, or "size classes" in zsmalloc terms. Instead it allows +multiple single-order pages to be stitched together into a +"zspage" which backs the slab. This allows for higher allocation +success rate under memory pressure. + +Also, zsmalloc allows objects to span page boundaries within the +zspage. This allows for lower fragmentation than could be had +with the kernel slab allocator for objects between PAGE_SIZE/2 +and PAGE_SIZE. With the kernel slab allocator, if a page compresses +to 60% of it original size, the memory savings gained through +compression is lost in fragmentation because another object of +the same size can't be stored in the leftover space. + +This ability to span pages results in zsmalloc allocations not being +directly addressable by the user. The user is given an +non-dereferencable handle in response to an allocation request. +That handle must be mapped, using zs_map_object(), which returns +a pointer to the mapped region that can be used. The mapping is +necessary since the object data may reside in two different +noncontigious pages. Do you mean the reason of to use a zsmalloc object must map after malloc is object data maybe reside in two different nocontiguous pages? Yes, that is one reason for the mapping. The other reason (more of an added bonus) is below. + +For 32-bit systems, zsmalloc has the added benefit of being +able to back slabs with HIGHMEM pages, something not possible What's the meaning of "back slabs with HIGHMEM pages"? By HIGHMEM, I'm referring to the HIGHMEM memory zone on 32-bit systems with larger that 1GB (actually a little less) of RAM. The upper 3GB of the 4GB address space, depending on kernel build options, is not directly addressable by the kernel, but can be mapped into the kernel address space with functions like kmap() or kmap_atomic(). These pages can't be used by slab/slub because they are not continuously mapped into the kernel address space. However, since zsmalloc requires a mapping anyway to handle objects that span non-contiguous page boundaries, we do the kernel mapping as part of the process. So zspages, the conceptual slab in zsmalloc backed by single-order pages can include pages from the HIGHMEM zone as well. Thanks for your clarify, http://lwn.net/Articles/537422/, your article about zswap in lwn. "Additionally, the kernel slab allocator does not allow objects that are less than a page in size to span a page boundary. This means that if an object is PAGE_SIZE/2 + 1 bytes in size, it effectively use an entire page, resulting in ~50% waste. Hense there are *no kmalloc() cache size* between PAGE_SIZE/2 and PAGE_SIZE." Are your sure? It seems that kmalloc cache support big size, your can check in include/linux/kmalloc_sizes.h Yes, kmalloc can allocate large objects > PAGE_SIZE, but there are no cache sizes _between_ PAGE_SIZE/2 and PAGE_SIZE. For example, on a system with 4k pages, there are no caches between kmalloc-2048 and kmalloc-4096. Since slub cache can merge, is it the root reason? Seth -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majord...@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: mailto:"d...@kvack.org;> em...@kvack.org -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCHv5 2/8] zsmalloc: add documentation
On 02/21/2013 11:50 PM, Seth Jennings wrote: On 02/21/2013 02:49 AM, Ric Mason wrote: On 02/19/2013 03:16 AM, Seth Jennings wrote: On 02/16/2013 12:21 AM, Ric Mason wrote: On 02/14/2013 02:38 AM, Seth Jennings wrote: This patch adds a documentation file for zsmalloc at Documentation/vm/zsmalloc.txt Signed-off-by: Seth Jennings --- Documentation/vm/zsmalloc.txt | 68 + 1 file changed, 68 insertions(+) create mode 100644 Documentation/vm/zsmalloc.txt diff --git a/Documentation/vm/zsmalloc.txt b/Documentation/vm/zsmalloc.txt new file mode 100644 index 000..85aa617 --- /dev/null +++ b/Documentation/vm/zsmalloc.txt @@ -0,0 +1,68 @@ +zsmalloc Memory Allocator + +Overview + +zmalloc a new slab-based memory allocator, +zsmalloc, for storing compressed pages. It is designed for +low fragmentation and high allocation success rate on +large object, but <= PAGE_SIZE allocations. + +zsmalloc differs from the kernel slab allocator in two primary +ways to achieve these design goals. + +zsmalloc never requires high order page allocations to back +slabs, or "size classes" in zsmalloc terms. Instead it allows +multiple single-order pages to be stitched together into a +"zspage" which backs the slab. This allows for higher allocation +success rate under memory pressure. + +Also, zsmalloc allows objects to span page boundaries within the +zspage. This allows for lower fragmentation than could be had +with the kernel slab allocator for objects between PAGE_SIZE/2 +and PAGE_SIZE. With the kernel slab allocator, if a page compresses +to 60% of it original size, the memory savings gained through +compression is lost in fragmentation because another object of +the same size can't be stored in the leftover space. + +This ability to span pages results in zsmalloc allocations not being +directly addressable by the user. The user is given an +non-dereferencable handle in response to an allocation request. +That handle must be mapped, using zs_map_object(), which returns +a pointer to the mapped region that can be used. The mapping is +necessary since the object data may reside in two different +noncontigious pages. Do you mean the reason of to use a zsmalloc object must map after malloc is object data maybe reside in two different nocontiguous pages? Yes, that is one reason for the mapping. The other reason (more of an added bonus) is below. + +For 32-bit systems, zsmalloc has the added benefit of being +able to back slabs with HIGHMEM pages, something not possible What's the meaning of "back slabs with HIGHMEM pages"? By HIGHMEM, I'm referring to the HIGHMEM memory zone on 32-bit systems with larger that 1GB (actually a little less) of RAM. The upper 3GB of the 4GB address space, depending on kernel build options, is not directly addressable by the kernel, but can be mapped into the kernel address space with functions like kmap() or kmap_atomic(). These pages can't be used by slab/slub because they are not continuously mapped into the kernel address space. However, since zsmalloc requires a mapping anyway to handle objects that span non-contiguous page boundaries, we do the kernel mapping as part of the process. So zspages, the conceptual slab in zsmalloc backed by single-order pages can include pages from the HIGHMEM zone as well. Thanks for your clarify, http://lwn.net/Articles/537422/, your article about zswap in lwn. "Additionally, the kernel slab allocator does not allow objects that are less than a page in size to span a page boundary. This means that if an object is PAGE_SIZE/2 + 1 bytes in size, it effectively use an entire page, resulting in ~50% waste. Hense there are *no kmalloc() cache size* between PAGE_SIZE/2 and PAGE_SIZE." Are your sure? It seems that kmalloc cache support big size, your can check in include/linux/kmalloc_sizes.h Yes, kmalloc can allocate large objects > PAGE_SIZE, but there are no cache sizes _between_ PAGE_SIZE/2 and PAGE_SIZE. For example, on a system with 4k pages, there are no caches between kmalloc-2048 and kmalloc-4096. kmalloc object > PAGE_SIZE/2 or > PAGE_SIZE should also allocate from slab cache, correct? Then how can alloc object w/o slab cache which contains this object size objects? Seth -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majord...@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: mailto:"d...@kvack.org;> em...@kvack.org -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCHv5 2/8] zsmalloc: add documentation
On 02/19/2013 03:16 AM, Seth Jennings wrote: On 02/16/2013 12:21 AM, Ric Mason wrote: On 02/14/2013 02:38 AM, Seth Jennings wrote: This patch adds a documentation file for zsmalloc at Documentation/vm/zsmalloc.txt Signed-off-by: Seth Jennings --- Documentation/vm/zsmalloc.txt | 68 + 1 file changed, 68 insertions(+) create mode 100644 Documentation/vm/zsmalloc.txt diff --git a/Documentation/vm/zsmalloc.txt b/Documentation/vm/zsmalloc.txt new file mode 100644 index 000..85aa617 --- /dev/null +++ b/Documentation/vm/zsmalloc.txt @@ -0,0 +1,68 @@ +zsmalloc Memory Allocator + +Overview + +zmalloc a new slab-based memory allocator, +zsmalloc, for storing compressed pages. It is designed for +low fragmentation and high allocation success rate on +large object, but <= PAGE_SIZE allocations. + +zsmalloc differs from the kernel slab allocator in two primary +ways to achieve these design goals. + +zsmalloc never requires high order page allocations to back +slabs, or "size classes" in zsmalloc terms. Instead it allows +multiple single-order pages to be stitched together into a +"zspage" which backs the slab. This allows for higher allocation +success rate under memory pressure. + +Also, zsmalloc allows objects to span page boundaries within the +zspage. This allows for lower fragmentation than could be had +with the kernel slab allocator for objects between PAGE_SIZE/2 +and PAGE_SIZE. With the kernel slab allocator, if a page compresses +to 60% of it original size, the memory savings gained through +compression is lost in fragmentation because another object of +the same size can't be stored in the leftover space. + +This ability to span pages results in zsmalloc allocations not being +directly addressable by the user. The user is given an +non-dereferencable handle in response to an allocation request. +That handle must be mapped, using zs_map_object(), which returns +a pointer to the mapped region that can be used. The mapping is +necessary since the object data may reside in two different +noncontigious pages. Do you mean the reason of to use a zsmalloc object must map after malloc is object data maybe reside in two different nocontiguous pages? Yes, that is one reason for the mapping. The other reason (more of an added bonus) is below. + +For 32-bit systems, zsmalloc has the added benefit of being +able to back slabs with HIGHMEM pages, something not possible What's the meaning of "back slabs with HIGHMEM pages"? By HIGHMEM, I'm referring to the HIGHMEM memory zone on 32-bit systems with larger that 1GB (actually a little less) of RAM. The upper 3GB of the 4GB address space, depending on kernel build options, is not directly addressable by the kernel, but can be mapped into the kernel address space with functions like kmap() or kmap_atomic(). These pages can't be used by slab/slub because they are not continuously mapped into the kernel address space. However, since zsmalloc requires a mapping anyway to handle objects that span non-contiguous page boundaries, we do the kernel mapping as part of the process. So zspages, the conceptual slab in zsmalloc backed by single-order pages can include pages from the HIGHMEM zone as well. Thanks for your clarify, http://lwn.net/Articles/537422/, your article about zswap in lwn. "Additionally, the kernel slab allocator does not allow objects that are less than a page in size to span a page boundary. This means that if an object is PAGE_SIZE/2 + 1 bytes in size, it effectively use an entire page, resulting in ~50% waste. Hense there are *no kmalloc() cache size* between PAGE_SIZE/2 and PAGE_SIZE." Are your sure? It seems that kmalloc cache support big size, your can check in include/linux/kmalloc_sizes.h Seth -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majord...@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: mailto:"d...@kvack.org;> em...@kvack.org -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCHv5 2/8] zsmalloc: add documentation
On 02/19/2013 03:16 AM, Seth Jennings wrote: On 02/16/2013 12:21 AM, Ric Mason wrote: On 02/14/2013 02:38 AM, Seth Jennings wrote: This patch adds a documentation file for zsmalloc at Documentation/vm/zsmalloc.txt Signed-off-by: Seth Jennings sjenn...@linux.vnet.ibm.com --- Documentation/vm/zsmalloc.txt | 68 + 1 file changed, 68 insertions(+) create mode 100644 Documentation/vm/zsmalloc.txt diff --git a/Documentation/vm/zsmalloc.txt b/Documentation/vm/zsmalloc.txt new file mode 100644 index 000..85aa617 --- /dev/null +++ b/Documentation/vm/zsmalloc.txt @@ -0,0 +1,68 @@ +zsmalloc Memory Allocator + +Overview + +zmalloc a new slab-based memory allocator, +zsmalloc, for storing compressed pages. It is designed for +low fragmentation and high allocation success rate on +large object, but = PAGE_SIZE allocations. + +zsmalloc differs from the kernel slab allocator in two primary +ways to achieve these design goals. + +zsmalloc never requires high order page allocations to back +slabs, or size classes in zsmalloc terms. Instead it allows +multiple single-order pages to be stitched together into a +zspage which backs the slab. This allows for higher allocation +success rate under memory pressure. + +Also, zsmalloc allows objects to span page boundaries within the +zspage. This allows for lower fragmentation than could be had +with the kernel slab allocator for objects between PAGE_SIZE/2 +and PAGE_SIZE. With the kernel slab allocator, if a page compresses +to 60% of it original size, the memory savings gained through +compression is lost in fragmentation because another object of +the same size can't be stored in the leftover space. + +This ability to span pages results in zsmalloc allocations not being +directly addressable by the user. The user is given an +non-dereferencable handle in response to an allocation request. +That handle must be mapped, using zs_map_object(), which returns +a pointer to the mapped region that can be used. The mapping is +necessary since the object data may reside in two different +noncontigious pages. Do you mean the reason of to use a zsmalloc object must map after malloc is object data maybe reside in two different nocontiguous pages? Yes, that is one reason for the mapping. The other reason (more of an added bonus) is below. + +For 32-bit systems, zsmalloc has the added benefit of being +able to back slabs with HIGHMEM pages, something not possible What's the meaning of back slabs with HIGHMEM pages? By HIGHMEM, I'm referring to the HIGHMEM memory zone on 32-bit systems with larger that 1GB (actually a little less) of RAM. The upper 3GB of the 4GB address space, depending on kernel build options, is not directly addressable by the kernel, but can be mapped into the kernel address space with functions like kmap() or kmap_atomic(). These pages can't be used by slab/slub because they are not continuously mapped into the kernel address space. However, since zsmalloc requires a mapping anyway to handle objects that span non-contiguous page boundaries, we do the kernel mapping as part of the process. So zspages, the conceptual slab in zsmalloc backed by single-order pages can include pages from the HIGHMEM zone as well. Thanks for your clarify, http://lwn.net/Articles/537422/, your article about zswap in lwn. Additionally, the kernel slab allocator does not allow objects that are less than a page in size to span a page boundary. This means that if an object is PAGE_SIZE/2 + 1 bytes in size, it effectively use an entire page, resulting in ~50% waste. Hense there are *no kmalloc() cache size* between PAGE_SIZE/2 and PAGE_SIZE. Are your sure? It seems that kmalloc cache support big size, your can check in include/linux/kmalloc_sizes.h Seth -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majord...@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: a href=mailto:d...@kvack.org; em...@kvack.org /a -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCHv5 2/8] zsmalloc: add documentation
On 02/21/2013 11:50 PM, Seth Jennings wrote: On 02/21/2013 02:49 AM, Ric Mason wrote: On 02/19/2013 03:16 AM, Seth Jennings wrote: On 02/16/2013 12:21 AM, Ric Mason wrote: On 02/14/2013 02:38 AM, Seth Jennings wrote: This patch adds a documentation file for zsmalloc at Documentation/vm/zsmalloc.txt Signed-off-by: Seth Jennings sjenn...@linux.vnet.ibm.com --- Documentation/vm/zsmalloc.txt | 68 + 1 file changed, 68 insertions(+) create mode 100644 Documentation/vm/zsmalloc.txt diff --git a/Documentation/vm/zsmalloc.txt b/Documentation/vm/zsmalloc.txt new file mode 100644 index 000..85aa617 --- /dev/null +++ b/Documentation/vm/zsmalloc.txt @@ -0,0 +1,68 @@ +zsmalloc Memory Allocator + +Overview + +zmalloc a new slab-based memory allocator, +zsmalloc, for storing compressed pages. It is designed for +low fragmentation and high allocation success rate on +large object, but = PAGE_SIZE allocations. + +zsmalloc differs from the kernel slab allocator in two primary +ways to achieve these design goals. + +zsmalloc never requires high order page allocations to back +slabs, or size classes in zsmalloc terms. Instead it allows +multiple single-order pages to be stitched together into a +zspage which backs the slab. This allows for higher allocation +success rate under memory pressure. + +Also, zsmalloc allows objects to span page boundaries within the +zspage. This allows for lower fragmentation than could be had +with the kernel slab allocator for objects between PAGE_SIZE/2 +and PAGE_SIZE. With the kernel slab allocator, if a page compresses +to 60% of it original size, the memory savings gained through +compression is lost in fragmentation because another object of +the same size can't be stored in the leftover space. + +This ability to span pages results in zsmalloc allocations not being +directly addressable by the user. The user is given an +non-dereferencable handle in response to an allocation request. +That handle must be mapped, using zs_map_object(), which returns +a pointer to the mapped region that can be used. The mapping is +necessary since the object data may reside in two different +noncontigious pages. Do you mean the reason of to use a zsmalloc object must map after malloc is object data maybe reside in two different nocontiguous pages? Yes, that is one reason for the mapping. The other reason (more of an added bonus) is below. + +For 32-bit systems, zsmalloc has the added benefit of being +able to back slabs with HIGHMEM pages, something not possible What's the meaning of back slabs with HIGHMEM pages? By HIGHMEM, I'm referring to the HIGHMEM memory zone on 32-bit systems with larger that 1GB (actually a little less) of RAM. The upper 3GB of the 4GB address space, depending on kernel build options, is not directly addressable by the kernel, but can be mapped into the kernel address space with functions like kmap() or kmap_atomic(). These pages can't be used by slab/slub because they are not continuously mapped into the kernel address space. However, since zsmalloc requires a mapping anyway to handle objects that span non-contiguous page boundaries, we do the kernel mapping as part of the process. So zspages, the conceptual slab in zsmalloc backed by single-order pages can include pages from the HIGHMEM zone as well. Thanks for your clarify, http://lwn.net/Articles/537422/, your article about zswap in lwn. Additionally, the kernel slab allocator does not allow objects that are less than a page in size to span a page boundary. This means that if an object is PAGE_SIZE/2 + 1 bytes in size, it effectively use an entire page, resulting in ~50% waste. Hense there are *no kmalloc() cache size* between PAGE_SIZE/2 and PAGE_SIZE. Are your sure? It seems that kmalloc cache support big size, your can check in include/linux/kmalloc_sizes.h Yes, kmalloc can allocate large objects PAGE_SIZE, but there are no cache sizes _between_ PAGE_SIZE/2 and PAGE_SIZE. For example, on a system with 4k pages, there are no caches between kmalloc-2048 and kmalloc-4096. kmalloc object PAGE_SIZE/2 or PAGE_SIZE should also allocate from slab cache, correct? Then how can alloc object w/o slab cache which contains this object size objects? Seth -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majord...@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: a href=mailto:d...@kvack.org; em...@kvack.org /a -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCHv5 2/8] zsmalloc: add documentation
On 02/21/2013 11:50 PM, Seth Jennings wrote: On 02/21/2013 02:49 AM, Ric Mason wrote: On 02/19/2013 03:16 AM, Seth Jennings wrote: On 02/16/2013 12:21 AM, Ric Mason wrote: On 02/14/2013 02:38 AM, Seth Jennings wrote: This patch adds a documentation file for zsmalloc at Documentation/vm/zsmalloc.txt Signed-off-by: Seth Jennings sjenn...@linux.vnet.ibm.com --- Documentation/vm/zsmalloc.txt | 68 + 1 file changed, 68 insertions(+) create mode 100644 Documentation/vm/zsmalloc.txt diff --git a/Documentation/vm/zsmalloc.txt b/Documentation/vm/zsmalloc.txt new file mode 100644 index 000..85aa617 --- /dev/null +++ b/Documentation/vm/zsmalloc.txt @@ -0,0 +1,68 @@ +zsmalloc Memory Allocator + +Overview + +zmalloc a new slab-based memory allocator, +zsmalloc, for storing compressed pages. It is designed for +low fragmentation and high allocation success rate on +large object, but = PAGE_SIZE allocations. + +zsmalloc differs from the kernel slab allocator in two primary +ways to achieve these design goals. + +zsmalloc never requires high order page allocations to back +slabs, or size classes in zsmalloc terms. Instead it allows +multiple single-order pages to be stitched together into a +zspage which backs the slab. This allows for higher allocation +success rate under memory pressure. + +Also, zsmalloc allows objects to span page boundaries within the +zspage. This allows for lower fragmentation than could be had +with the kernel slab allocator for objects between PAGE_SIZE/2 +and PAGE_SIZE. With the kernel slab allocator, if a page compresses +to 60% of it original size, the memory savings gained through +compression is lost in fragmentation because another object of +the same size can't be stored in the leftover space. + +This ability to span pages results in zsmalloc allocations not being +directly addressable by the user. The user is given an +non-dereferencable handle in response to an allocation request. +That handle must be mapped, using zs_map_object(), which returns +a pointer to the mapped region that can be used. The mapping is +necessary since the object data may reside in two different +noncontigious pages. Do you mean the reason of to use a zsmalloc object must map after malloc is object data maybe reside in two different nocontiguous pages? Yes, that is one reason for the mapping. The other reason (more of an added bonus) is below. + +For 32-bit systems, zsmalloc has the added benefit of being +able to back slabs with HIGHMEM pages, something not possible What's the meaning of back slabs with HIGHMEM pages? By HIGHMEM, I'm referring to the HIGHMEM memory zone on 32-bit systems with larger that 1GB (actually a little less) of RAM. The upper 3GB of the 4GB address space, depending on kernel build options, is not directly addressable by the kernel, but can be mapped into the kernel address space with functions like kmap() or kmap_atomic(). These pages can't be used by slab/slub because they are not continuously mapped into the kernel address space. However, since zsmalloc requires a mapping anyway to handle objects that span non-contiguous page boundaries, we do the kernel mapping as part of the process. So zspages, the conceptual slab in zsmalloc backed by single-order pages can include pages from the HIGHMEM zone as well. Thanks for your clarify, http://lwn.net/Articles/537422/, your article about zswap in lwn. Additionally, the kernel slab allocator does not allow objects that are less than a page in size to span a page boundary. This means that if an object is PAGE_SIZE/2 + 1 bytes in size, it effectively use an entire page, resulting in ~50% waste. Hense there are *no kmalloc() cache size* between PAGE_SIZE/2 and PAGE_SIZE. Are your sure? It seems that kmalloc cache support big size, your can check in include/linux/kmalloc_sizes.h Yes, kmalloc can allocate large objects PAGE_SIZE, but there are no cache sizes _between_ PAGE_SIZE/2 and PAGE_SIZE. For example, on a system with 4k pages, there are no caches between kmalloc-2048 and kmalloc-4096. Since slub cache can merge, is it the root reason? Seth -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majord...@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: a href=mailto:d...@kvack.org; em...@kvack.org /a -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Questin about swap_slot free and invalidate page
On 02/22/2013 05:42 AM, Dan Magenheimer wrote: From: Ric Mason [mailto:ric.mas...@gmail.com] Subject: Re: Questin about swap_slot free and invalidate page On 02/19/2013 11:27 PM, Dan Magenheimer wrote: From: Ric Mason [mailto:ric.mas...@gmail.com] Hugh is right that handling the possibility of duplicates is part of the tmem ABI. If there is any possibility of duplicates, the ABI defines how a backend must handle them to avoid data coherency issues. The kernel implements an in-kernel API which implements the tmem ABI. If the frontend and backend can always agree that duplicate Which ABI in zcache implement that? https://oss.oracle.com/projects/tmem/dist/documentation/api/tmemspec-v001.pdf The in-kernel APIs are frontswap and cleancache. For more information about tmem, see http://lwn.net/Articles/454795/ But you mentioned that you have in-kernel API which can handle duplicate. Do you mean zcache_cleancache/frontswap_put_page? I think they just overwrite instead of optional flush the page on the second(duplicate) put as mentioned in your tmemspec. Maybe I am misunderstanding your question... The spec allows overwrite (and return success) OR flush the page (and return failure). Zcache does the latter (flush). The code that implements it is in tmem_put. Thanks for your point out. Pers pages can have duplicate put since swap cache page can be reused. Can eph pages also have duplicate put? If yes, when can happen? -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 0/7] ksm: responses to NUMA review
On 02/21/2013 04:17 PM, Hugh Dickins wrote: Here's a second KSM series, based on mmotm 2013-02-19-17-20: partly in response to Mel's review feedback, partly fixes to issues that I found myself in doing more review and testing. None of the issues fixed are truly show-stoppers, though I would prefer them fixed sooner than later. Do you have any ideas ksm support page cache and tmpfs? 1 ksm: add some comments 2 ksm: treat unstable nid like in stable tree 3 ksm: shrink 32-bit rmap_item back to 32 bytes 4 mm,ksm: FOLL_MIGRATION do migration_entry_wait 5 mm,ksm: swapoff might need to copy 6 mm: cleanup swapcache in do_swap_page 7 ksm: allocate roots when needed Documentation/vm/ksm.txt | 16 +++- include/linux/mm.h |1 mm/ksm.c | 137 +++-- mm/memory.c | 38 +++--- mm/swapfile.c| 15 +++- 5 files changed, 140 insertions(+), 67 deletions(-) Thanks, Hugh -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majord...@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: a href=mailto:d...@kvack.org; em...@kvack.org /a -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] staging/zcache: Fix/improve zcache writeback code, tie to a config option
On 02/07/2013 02:27 AM, Dan Magenheimer wrote: It was observed by Andrea Arcangeli in 2011 that zcache can get full and there must be some way for compressed swap pages to be (uncompressed and then) sent through to the backing swap disk. A prototype of this functionality, called unuse, was added in 2012 as part of a major update to zcache (aka zcache2), but was left unfinished due to the unfortunate temporary fork of zcache. This earlier version of the code had an unresolved memory leak and was anyway dependent on not-yet-upstream frontswap and mm changes. The code was meanwhile adapted by Seth Jennings for similar functionality in zswap (which he calls flush). Seth also made some clever simplifications which are herein ported back to zcache. As a result of those simplifications, the frontswap changes are no longer necessary, but a slightly different (and simpler) set of mm changes are still required [1]. The memory leak is also fixed. Due to feedback from akpm in a zswap thread, this functionality in zcache has now been renamed from unuse to writeback. Although this zcache writeback code now works, there are open questions as how best to handle the policy that drives it. As a result, this patch also ties writeback to a new config option. And, since the code still depends on not-yet-upstreamed mm patches, to avoid build problems, the config option added by this patch temporarily depends on BROKEN; this config dependency can be removed in trees that contain the necessary mm patches. [1] https://lkml.org/lkml/2013/1/29/540/ https://lkml.org/lkml/2013/1/29/539/ This patch leads to backend interact with core mm directly, is it core mm should interact with frontend instead of backend? In addition, frontswap has already have shrink funtion, should we can take advantage of it? Signed-off-by: Dan Magenheimer dan.magenhei...@oracle.com --- drivers/staging/zcache/Kconfig | 17 ++ drivers/staging/zcache/zcache-main.c | 332 +++--- 2 files changed, 284 insertions(+), 65 deletions(-) diff --git a/drivers/staging/zcache/Kconfig b/drivers/staging/zcache/Kconfig index c1dbd04..7358270 100644 --- a/drivers/staging/zcache/Kconfig +++ b/drivers/staging/zcache/Kconfig @@ -24,3 +24,20 @@ config RAMSTER while minimizing total RAM across the cluster. RAMster, like zcache2, compresses swap pages into local RAM, but then remotifies the compressed pages to another node in the RAMster cluster. + +# Depends on not-yet-upstreamed mm patches to export end_swap_bio_write and +# __add_to_swap_cache, and implement __swap_writepage (which is swap_writepage +# without the frontswap call. When these are in-tree, the dependency on +# BROKEN can be removed +config ZCACHE_WRITEBACK + bool Allow compressed swap pages to be writtenback to swap disk + depends on ZCACHE=y BROKEN + default n + help + Zcache caches compressed swap pages (and other data) in RAM which + often improves performance by avoiding I/O's due to swapping. + In some workloads with very long-lived large processes, it can + instead reduce performance. Writeback decompresses zcache-compressed + pages (in LRU order) when under memory pressure and writes them to + the backing swap disk to ameliorate this problem. Policy driving + writeback is still under development. diff --git a/drivers/staging/zcache/zcache-main.c b/drivers/staging/zcache/zcache-main.c index c1ac905..5bf14c3 100644 --- a/drivers/staging/zcache/zcache-main.c +++ b/drivers/staging/zcache/zcache-main.c @@ -22,6 +22,10 @@ #include linux/atomic.h #include linux/math64.h #include linux/crypto.h +#include linux/swap.h +#include linux/swapops.h +#include linux/pagemap.h +#include linux/writeback.h #include linux/cleancache.h #include linux/frontswap.h @@ -55,6 +59,9 @@ static inline void frontswap_tmem_exclusive_gets(bool b) } #endif +/* enable (or fix code) when Seth's patches are accepted upstream */ +#define zcache_writeback_enabled 0 + static int zcache_enabled __read_mostly; static int disable_cleancache __read_mostly; static int disable_frontswap __read_mostly; @@ -181,6 +188,8 @@ static unsigned long zcache_last_active_anon_pageframes; static unsigned long zcache_last_inactive_anon_pageframes; static unsigned long zcache_eph_nonactive_puts_ignored; static unsigned long zcache_pers_nonactive_puts_ignored; +static unsigned long zcache_writtenback_pages; +static long zcache_outstanding_writeback_pages; #ifdef CONFIG_DEBUG_FS #include linux/debugfs.h @@ -239,6 +248,9 @@ static int zcache_debugfs_init(void) zdfs64(eph_zbytes_max, S_IRUGO, root, zcache_eph_zbytes_max); zdfs64(pers_zbytes, S_IRUGO, root, zcache_pers_zbytes); zdfs64(pers_zbytes_max, S_IRUGO, root, zcache_pers_zbytes_max); + zdfs(outstanding_writeback_pages, S_IRUGO, root, +
Re: [PATCH] staging/zcache: Fix/improve zcache writeback code, tie to a config option
On 02/07/2013 02:27 AM, Dan Magenheimer wrote: It was observed by Andrea Arcangeli in 2011 that zcache can get full and there must be some way for compressed swap pages to be (uncompressed and then) sent through to the backing swap disk. A prototype of this functionality, called unuse, was added in 2012 as part of a major update to zcache (aka zcache2), but was left unfinished due to the unfortunate temporary fork of zcache. This earlier version of the code had an unresolved memory leak and was anyway dependent on not-yet-upstream frontswap and mm changes. The code was meanwhile adapted by Seth Jennings for similar functionality in zswap (which he calls flush). Seth also made some clever simplifications which are herein ported back to zcache. As a result of those simplifications, the frontswap changes are no longer necessary, but a slightly different (and simpler) set of mm changes are still required [1]. The memory leak is also fixed. Due to feedback from akpm in a zswap thread, this functionality in zcache has now been renamed from unuse to writeback. Although this zcache writeback code now works, there are open questions as how best to handle the policy that drives it. As a result, this patch also ties writeback to a new config option. And, since the code still depends on not-yet-upstreamed mm patches, to avoid build problems, the config option added by this patch temporarily depends on BROKEN; this config dependency can be removed in trees that contain the necessary mm patches. [1] https://lkml.org/lkml/2013/1/29/540/ https://lkml.org/lkml/2013/1/29/539/ shrink_zcache_memory: while(nr_evict-- 0) { page = zcache_evict_eph_pageframe(); if (page == NULL) break; zcache_free_page(page); } zcache_evict_eph_pageframe -zbud_evict_pageframe_lru -zbud_evict_tmem -tmem_flush_page -zcache_pampd_free -zcache_free_page - zbudpage has already been free here If the zcache_free_page called in shrink_zcache_memory can be treated as a double free? Signed-off-by: Dan Magenheimer dan.magenhei...@oracle.com --- drivers/staging/zcache/Kconfig | 17 ++ drivers/staging/zcache/zcache-main.c | 332 +++--- 2 files changed, 284 insertions(+), 65 deletions(-) diff --git a/drivers/staging/zcache/Kconfig b/drivers/staging/zcache/Kconfig index c1dbd04..7358270 100644 --- a/drivers/staging/zcache/Kconfig +++ b/drivers/staging/zcache/Kconfig @@ -24,3 +24,20 @@ config RAMSTER while minimizing total RAM across the cluster. RAMster, like zcache2, compresses swap pages into local RAM, but then remotifies the compressed pages to another node in the RAMster cluster. + +# Depends on not-yet-upstreamed mm patches to export end_swap_bio_write and +# __add_to_swap_cache, and implement __swap_writepage (which is swap_writepage +# without the frontswap call. When these are in-tree, the dependency on +# BROKEN can be removed +config ZCACHE_WRITEBACK + bool Allow compressed swap pages to be writtenback to swap disk + depends on ZCACHE=y BROKEN + default n + help + Zcache caches compressed swap pages (and other data) in RAM which + often improves performance by avoiding I/O's due to swapping. + In some workloads with very long-lived large processes, it can + instead reduce performance. Writeback decompresses zcache-compressed + pages (in LRU order) when under memory pressure and writes them to + the backing swap disk to ameliorate this problem. Policy driving + writeback is still under development. diff --git a/drivers/staging/zcache/zcache-main.c b/drivers/staging/zcache/zcache-main.c index c1ac905..5bf14c3 100644 --- a/drivers/staging/zcache/zcache-main.c +++ b/drivers/staging/zcache/zcache-main.c @@ -22,6 +22,10 @@ #include linux/atomic.h #include linux/math64.h #include linux/crypto.h +#include linux/swap.h +#include linux/swapops.h +#include linux/pagemap.h +#include linux/writeback.h #include linux/cleancache.h #include linux/frontswap.h @@ -55,6 +59,9 @@ static inline void frontswap_tmem_exclusive_gets(bool b) } #endif +/* enable (or fix code) when Seth's patches are accepted upstream */ +#define zcache_writeback_enabled 0 + static int zcache_enabled __read_mostly; static int disable_cleancache __read_mostly; static int disable_frontswap __read_mostly; @@ -181,6 +188,8 @@ static unsigned long zcache_last_active_anon_pageframes; static unsigned long zcache_last_inactive_anon_pageframes; static unsigned long zcache_eph_nonactive_puts_ignored; static unsigned long zcache_pers_nonactive_puts_ignored; +static unsigned long zcache_writtenback_pages; +static long zcache_outstanding_writeback_pages; #ifdef CONFIG_DEBUG_FS #include linux/debugfs.h @@ -239,6 +248,9 @@ static int zcache_debugfs_init(void) zdfs64(eph_zbytes_max, S_IRUGO, root, zcache_eph_zbytes_max);