Re: [PATCH 2/3] mm: Ensure that mark_page_accessed moves pages to the active list

2013-05-01 Thread Ric Mason

On 05/01/2013 04:06 PM, Mel Gorman wrote:

On Wed, May 01, 2013 at 01:41:34PM +0800, Sam Ben wrote:

Hi Mel,
On 04/30/2013 12:31 AM, Mel Gorman wrote:

If a page is on a pagevec then it is !PageLRU and mark_page_accessed()
may fail to move a page to the active list as expected. Now that the
LRU is selected at LRU drain time, mark pages PageActive if they are
on a pagevec so it gets moved to the correct list at LRU drain time.
Using a debugging patch it was found that for a simple git checkout
based workload that pages were never added to the active file list in

Could you show us the details of your workload?


The workload is git checkouts of a fixed number of commits for the


Is there script which you used?


kernel git tree. It starts with a warm-up run that is not timed and then
records the time for a number of iterations.


How to record the time for a number of iterations? Is the iteration here 
means lru scan?






--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCHv6 0/8] zswap: compressed swap caching

2013-05-01 Thread Ric Mason

Hi Seth,
On 02/22/2013 02:25 AM, Seth Jennings wrote:

On 02/21/2013 09:50 AM, Dan Magenheimer wrote:

From: Seth Jennings [mailto:sjenn...@linux.vnet.ibm.com]
Subject: [PATCHv6 0/8] zswap: compressed swap caching

Changelog:

v6:
* fix improper freeing of rbtree (Cody)

Cody's bug fix reminded me of a rather fundamental question:

Why does zswap use a rbtree instead of a radix tree?

Intuitively, I'd expect that pgoff_t values would
have a relatively high level of locality AND at any one time
the set of stored pgoff_t values would be relatively non-sparse.
This would argue that a radix tree would result in fewer nodes
touched on average for lookup/insert/remove.

I considered using a radix tree, but I don't think there is a compelling
reason to choose a radix tree over a red-black tree in this case
(explanation below).

 From a runtime standpoint, a radix tree might be faster.  The swap
offsets will be largely in linearly bunched groups over the indexed
range.  However, there are also memory constraints to consider in this
particular situation.

Using a radix tree could result in intermediate radix_tree_node
allocations in the store (insert) path in addition to the zswap_entry
allocation.  Since we are under memory pressure, using the red-black


Then in which case radix tree is prefer and in which case redblack tree 
is better?



tree, whose metadata is included in the struct zswap_entry, reduces the
number of opportunities to fail.

On my system, the radix_tree_node structure is 568 bytes.  The
radix_tree_node cache requires 4 pages per slab, an order-2 page
allocation.  Growing that cache will be difficult under the pressure.

In my mind, cost of even a single node allocation failure resulting in
an additional page swapped to disk will more that wipe out any possible
performance advantage using a radix tree might have.

Thanks,
Seth

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majord...@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: mailto:"d...@kvack.org;> em...@kvack.org 


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] mm: swap: Mark swap pages writeback before queueing for direct IO

2013-05-01 Thread Ric Mason

Hi Mel,
On 04/25/2013 02:57 AM, Mel Gorman wrote:

As pointed out by Andrew Morton, the swap-over-NFS writeback is not setting
PageWriteback before it is queued for direct IO. While swap pages do not


Before commit commit 62c230bc1 (mm: add support for a filesystem to 
activate swap files and use direct_IO for writing swap pages), swap 
pages will write to page cache firstly and then writeback?



participate in BDI or process dirty accounting and the IO is synchronous,
the writeback bit is still required and not setting it in this case was
an oversight.  swapoff depends on the page writeback to synchronoise all
pending writes on a swap page before it is reused. Swapcache freeing and
reuse depend on checking the PageWriteback under lock to ensure the page
is safe to reuse.

Direct IO handlers and the direct IO handler for NFS do not deal with
PageWriteback as they are synchronous writes. In the case of NFS, it
schedules pages (or a page in the case of swap) for IO and then waits
synchronously for IO to complete in nfs_direct_write(). It is recognised
that this is a slowdown from normal swap handling which is asynchronous
and uses a completion handler. Shoving PageWriteback handling down into
direct IO handlers looks like a bad fit to handle the swap case although
it may have to be dealt with some day if swap is converted to use direct
IO in general and bmap is finally done away with. At that point it will
be necessary to refit asynchronous direct IO with completion handlers onto
the swap subsystem.

As swapcache currently depends on PageWriteback to protect against races,
this patch sets PageWriteback under the page lock before queueing it for
direct IO. It is cleared when the direct IO handler returns. IO errors
are treated similarly to the direct-to-bio case except PageError is not
set as in the case of swap-over-NFS, it is likely to be a transient error.

It was asked what prevents such a page being reclaimed in parallel.
With this patch applied, such a page will now be skipped (most of the time)
or blocked until the writeback completes.  Reclaim checks PageWriteback
under the page lock before calling try_to_free_swap and the page lock
should prevent the page being requeued for IO before it is freed.

This and Jerome's related patch should considered for -stable as far
back as 3.6 when swap-over-NFS was introduced.

Signed-off-by: Mel Gorman 
---
  mm/page_io.c | 17 +
  1 file changed, 17 insertions(+)

diff --git a/mm/page_io.c b/mm/page_io.c
index 04ca00d..ec04247 100644
--- a/mm/page_io.c
+++ b/mm/page_io.c
@@ -214,6 +214,7 @@ int swap_writepage(struct page *page, struct 
writeback_control *wbc)
kiocb.ki_left = PAGE_SIZE;
kiocb.ki_nbytes = PAGE_SIZE;
  
+		set_page_writeback(page);

unlock_page(page);
ret = mapping->a_ops->direct_IO(KERNEL_WRITE,
, ,
@@ -223,8 +224,24 @@ int swap_writepage(struct page *page, struct 
writeback_control *wbc)
count_vm_event(PSWPOUT);
ret = 0;
} else {
+   /*
+* In the case of swap-over-nfs, this can be a
+* temporary failure if the system has limited
+* memory for allocating transmit buffers.
+* Mark the page dirty and avoid
+* rotate_reclaimable_page but rate-limit the
+* messages but do not flag PageError like
+* the normal direct-to-bio case as it could
+* be temporary.
+*/
set_page_dirty(page);
+   ClearPageReclaim(page);
+   if (printk_ratelimit()) {
+   pr_err("Write-error on dio swapfile (%Lu)\n",
+   (unsigned long 
long)page_file_offset(page));
+   }
}
+   end_page_writeback(page);
return ret;
}

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majord...@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: mailto:"d...@kvack.org;> em...@kvack.org 


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] mm: swap: Mark swap pages writeback before queueing for direct IO

2013-05-01 Thread Ric Mason

Hi Mel,
On 04/25/2013 02:57 AM, Mel Gorman wrote:

As pointed out by Andrew Morton, the swap-over-NFS writeback is not setting
PageWriteback before it is queued for direct IO. While swap pages do not


Before commit commit 62c230bc1 (mm: add support for a filesystem to 
activate swap files and use direct_IO for writing swap pages), swap 
pages will write to page cache firstly and then writeback?



participate in BDI or process dirty accounting and the IO is synchronous,
the writeback bit is still required and not setting it in this case was
an oversight.  swapoff depends on the page writeback to synchronoise all
pending writes on a swap page before it is reused. Swapcache freeing and
reuse depend on checking the PageWriteback under lock to ensure the page
is safe to reuse.

Direct IO handlers and the direct IO handler for NFS do not deal with
PageWriteback as they are synchronous writes. In the case of NFS, it
schedules pages (or a page in the case of swap) for IO and then waits
synchronously for IO to complete in nfs_direct_write(). It is recognised
that this is a slowdown from normal swap handling which is asynchronous
and uses a completion handler. Shoving PageWriteback handling down into
direct IO handlers looks like a bad fit to handle the swap case although
it may have to be dealt with some day if swap is converted to use direct
IO in general and bmap is finally done away with. At that point it will
be necessary to refit asynchronous direct IO with completion handlers onto
the swap subsystem.

As swapcache currently depends on PageWriteback to protect against races,
this patch sets PageWriteback under the page lock before queueing it for
direct IO. It is cleared when the direct IO handler returns. IO errors
are treated similarly to the direct-to-bio case except PageError is not
set as in the case of swap-over-NFS, it is likely to be a transient error.

It was asked what prevents such a page being reclaimed in parallel.
With this patch applied, such a page will now be skipped (most of the time)
or blocked until the writeback completes.  Reclaim checks PageWriteback
under the page lock before calling try_to_free_swap and the page lock
should prevent the page being requeued for IO before it is freed.

This and Jerome's related patch should considered for -stable as far
back as 3.6 when swap-over-NFS was introduced.

Signed-off-by: Mel Gorman mgor...@suse.de
---
  mm/page_io.c | 17 +
  1 file changed, 17 insertions(+)

diff --git a/mm/page_io.c b/mm/page_io.c
index 04ca00d..ec04247 100644
--- a/mm/page_io.c
+++ b/mm/page_io.c
@@ -214,6 +214,7 @@ int swap_writepage(struct page *page, struct 
writeback_control *wbc)
kiocb.ki_left = PAGE_SIZE;
kiocb.ki_nbytes = PAGE_SIZE;
  
+		set_page_writeback(page);

unlock_page(page);
ret = mapping-a_ops-direct_IO(KERNEL_WRITE,
kiocb, iov,
@@ -223,8 +224,24 @@ int swap_writepage(struct page *page, struct 
writeback_control *wbc)
count_vm_event(PSWPOUT);
ret = 0;
} else {
+   /*
+* In the case of swap-over-nfs, this can be a
+* temporary failure if the system has limited
+* memory for allocating transmit buffers.
+* Mark the page dirty and avoid
+* rotate_reclaimable_page but rate-limit the
+* messages but do not flag PageError like
+* the normal direct-to-bio case as it could
+* be temporary.
+*/
set_page_dirty(page);
+   ClearPageReclaim(page);
+   if (printk_ratelimit()) {
+   pr_err(Write-error on dio swapfile (%Lu)\n,
+   (unsigned long 
long)page_file_offset(page));
+   }
}
+   end_page_writeback(page);
return ret;
}

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majord...@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: a href=mailto:d...@kvack.org; em...@kvack.org /a


--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCHv6 0/8] zswap: compressed swap caching

2013-05-01 Thread Ric Mason

Hi Seth,
On 02/22/2013 02:25 AM, Seth Jennings wrote:

On 02/21/2013 09:50 AM, Dan Magenheimer wrote:

From: Seth Jennings [mailto:sjenn...@linux.vnet.ibm.com]
Subject: [PATCHv6 0/8] zswap: compressed swap caching

Changelog:

v6:
* fix improper freeing of rbtree (Cody)

Cody's bug fix reminded me of a rather fundamental question:

Why does zswap use a rbtree instead of a radix tree?

Intuitively, I'd expect that pgoff_t values would
have a relatively high level of locality AND at any one time
the set of stored pgoff_t values would be relatively non-sparse.
This would argue that a radix tree would result in fewer nodes
touched on average for lookup/insert/remove.

I considered using a radix tree, but I don't think there is a compelling
reason to choose a radix tree over a red-black tree in this case
(explanation below).

 From a runtime standpoint, a radix tree might be faster.  The swap
offsets will be largely in linearly bunched groups over the indexed
range.  However, there are also memory constraints to consider in this
particular situation.

Using a radix tree could result in intermediate radix_tree_node
allocations in the store (insert) path in addition to the zswap_entry
allocation.  Since we are under memory pressure, using the red-black


Then in which case radix tree is prefer and in which case redblack tree 
is better?



tree, whose metadata is included in the struct zswap_entry, reduces the
number of opportunities to fail.

On my system, the radix_tree_node structure is 568 bytes.  The
radix_tree_node cache requires 4 pages per slab, an order-2 page
allocation.  Growing that cache will be difficult under the pressure.

In my mind, cost of even a single node allocation failure resulting in
an additional page swapped to disk will more that wipe out any possible
performance advantage using a radix tree might have.

Thanks,
Seth

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majord...@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: a href=mailto:d...@kvack.org; em...@kvack.org /a


--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 2/3] mm: Ensure that mark_page_accessed moves pages to the active list

2013-05-01 Thread Ric Mason

On 05/01/2013 04:06 PM, Mel Gorman wrote:

On Wed, May 01, 2013 at 01:41:34PM +0800, Sam Ben wrote:

Hi Mel,
On 04/30/2013 12:31 AM, Mel Gorman wrote:

If a page is on a pagevec then it is !PageLRU and mark_page_accessed()
may fail to move a page to the active list as expected. Now that the
LRU is selected at LRU drain time, mark pages PageActive if they are
on a pagevec so it gets moved to the correct list at LRU drain time.
Using a debugging patch it was found that for a simple git checkout
based workload that pages were never added to the active file list in

Could you show us the details of your workload?


The workload is git checkouts of a fixed number of commits for the


Is there script which you used?


kernel git tree. It starts with a warm-up run that is not timed and then
records the time for a number of iterations.


How to record the time for a number of iterations? Is the iteration here 
means lru scan?






--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 2/2] Make batch size for memory accounting configured according to size of memory

2013-04-30 Thread Ric Mason

Hi Tim,
On 04/30/2013 01:12 AM, Tim Chen wrote:

Currently the per cpu counter's batch size for memory accounting is
configured as twice the number of cpus in the system.  However,
for system with very large memory, it is more appropriate to make it
proportional to the memory size per cpu in the system.

For example, for a x86_64 system with 64 cpus and 128 GB of memory,
the batch size is only 2*64 pages (0.5 MB).  So any memory accounting
changes of more than 0.5MB will overflow the per cpu counter into
the global counter.  Instead, for the new scheme, the batch size
is configured to be 0.4% of the memory/cpu = 8MB (128 GB/64 /256),


If large batch size will lead to global counter more inaccurate?


which is more inline with the memory size.

Signed-off-by: Tim Chen 
---
  mm/mmap.c  | 13 -
  mm/nommu.c | 13 -
  2 files changed, 24 insertions(+), 2 deletions(-)

diff --git a/mm/mmap.c b/mm/mmap.c
index 0db0de1..082836e 100644
--- a/mm/mmap.c
+++ b/mm/mmap.c
@@ -89,6 +89,7 @@ int sysctl_max_map_count __read_mostly = 
DEFAULT_MAX_MAP_COUNT;
   * other variables. It can be updated by several CPUs frequently.
   */
  struct percpu_counter vm_committed_as cacheline_aligned_in_smp;
+int vm_committed_batchsz cacheline_aligned_in_smp;
  
  /*

   * The global memory commitment made in the system can be a metric
@@ -3090,10 +3091,20 @@ void mm_drop_all_locks(struct mm_struct *mm)
  /*
   * initialise the VMA slab
   */
+static inline int mm_compute_batch(void)
+{
+   int nr = num_present_cpus();
+
+   /* batch size set to 0.4% of (total memory/#cpus) */
+   return (int) (totalram_pages/nr) / 256;
+}
+
  void __init mmap_init(void)
  {
int ret;
  
-	ret = percpu_counter_init(_committed_as, 0);

+   vm_committed_batchsz = mm_compute_batch();
+   ret = percpu_counter_and_batch_init(_committed_as, 0,
+   _committed_batchsz);
VM_BUG_ON(ret);
  }
diff --git a/mm/nommu.c b/mm/nommu.c
index 2f3ea74..a87a99c 100644
--- a/mm/nommu.c
+++ b/mm/nommu.c
@@ -59,6 +59,7 @@ unsigned long max_mapnr;
  unsigned long num_physpages;
  unsigned long highest_memmap_pfn;
  struct percpu_counter vm_committed_as;
+int vm_committed_batchsz;
  int sysctl_overcommit_memory = OVERCOMMIT_GUESS; /* heuristic overcommit */
  int sysctl_overcommit_ratio = 50; /* default is 50% */
  int sysctl_max_map_count = DEFAULT_MAX_MAP_COUNT;
@@ -526,11 +527,21 @@ SYSCALL_DEFINE1(brk, unsigned long, brk)
  /*
   * initialise the VMA and region record slabs
   */
+static inline int mm_compute_batch(void)
+{
+   int nr = num_present_cpus();
+
+   /* batch size set to 0.4% of (total memory/#cpus) */
+   return (int) (totalram_pages/nr) / 256;
+}
+
  void __init mmap_init(void)
  {
int ret;
  
-	ret = percpu_counter_init(_committed_as, 0);

+   vm_committed_batchsz = mm_compute_batch();
+   ret = percpu_counter_and_batch_init(_committed_as, 0,
+   _committed_batchsz);
VM_BUG_ON(ret);
vm_region_jar = KMEM_CACHE(vm_region, SLAB_PANIC);
  }


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 2/2] Make batch size for memory accounting configured according to size of memory

2013-04-30 Thread Ric Mason

Hi Tim,
On 04/30/2013 01:12 AM, Tim Chen wrote:

Currently the per cpu counter's batch size for memory accounting is
configured as twice the number of cpus in the system.  However,
for system with very large memory, it is more appropriate to make it
proportional to the memory size per cpu in the system.

For example, for a x86_64 system with 64 cpus and 128 GB of memory,
the batch size is only 2*64 pages (0.5 MB).  So any memory accounting
changes of more than 0.5MB will overflow the per cpu counter into
the global counter.  Instead, for the new scheme, the batch size
is configured to be 0.4% of the memory/cpu = 8MB (128 GB/64 /256),


If large batch size will lead to global counter more inaccurate?


which is more inline with the memory size.

Signed-off-by: Tim Chen tim.c.c...@linux.intel.com
---
  mm/mmap.c  | 13 -
  mm/nommu.c | 13 -
  2 files changed, 24 insertions(+), 2 deletions(-)

diff --git a/mm/mmap.c b/mm/mmap.c
index 0db0de1..082836e 100644
--- a/mm/mmap.c
+++ b/mm/mmap.c
@@ -89,6 +89,7 @@ int sysctl_max_map_count __read_mostly = 
DEFAULT_MAX_MAP_COUNT;
   * other variables. It can be updated by several CPUs frequently.
   */
  struct percpu_counter vm_committed_as cacheline_aligned_in_smp;
+int vm_committed_batchsz cacheline_aligned_in_smp;
  
  /*

   * The global memory commitment made in the system can be a metric
@@ -3090,10 +3091,20 @@ void mm_drop_all_locks(struct mm_struct *mm)
  /*
   * initialise the VMA slab
   */
+static inline int mm_compute_batch(void)
+{
+   int nr = num_present_cpus();
+
+   /* batch size set to 0.4% of (total memory/#cpus) */
+   return (int) (totalram_pages/nr) / 256;
+}
+
  void __init mmap_init(void)
  {
int ret;
  
-	ret = percpu_counter_init(vm_committed_as, 0);

+   vm_committed_batchsz = mm_compute_batch();
+   ret = percpu_counter_and_batch_init(vm_committed_as, 0,
+   vm_committed_batchsz);
VM_BUG_ON(ret);
  }
diff --git a/mm/nommu.c b/mm/nommu.c
index 2f3ea74..a87a99c 100644
--- a/mm/nommu.c
+++ b/mm/nommu.c
@@ -59,6 +59,7 @@ unsigned long max_mapnr;
  unsigned long num_physpages;
  unsigned long highest_memmap_pfn;
  struct percpu_counter vm_committed_as;
+int vm_committed_batchsz;
  int sysctl_overcommit_memory = OVERCOMMIT_GUESS; /* heuristic overcommit */
  int sysctl_overcommit_ratio = 50; /* default is 50% */
  int sysctl_max_map_count = DEFAULT_MAX_MAP_COUNT;
@@ -526,11 +527,21 @@ SYSCALL_DEFINE1(brk, unsigned long, brk)
  /*
   * initialise the VMA and region record slabs
   */
+static inline int mm_compute_batch(void)
+{
+   int nr = num_present_cpus();
+
+   /* batch size set to 0.4% of (total memory/#cpus) */
+   return (int) (totalram_pages/nr) / 256;
+}
+
  void __init mmap_init(void)
  {
int ret;
  
-	ret = percpu_counter_init(vm_committed_as, 0);

+   vm_committed_batchsz = mm_compute_batch();
+   ret = percpu_counter_and_batch_init(vm_committed_as, 0,
+   vm_committed_batchsz);
VM_BUG_ON(ret);
vm_region_jar = KMEM_CACHE(vm_region, SLAB_PANIC);
  }


--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 01/10] mm: vmscan: Limit the number of pages kswapd reclaims at each priority

2013-04-11 Thread Ric Mason

Ping Rik, I also want to know the answer. ;-)
On 04/11/2013 01:58 PM, Will Huck wrote:

Hi Rik,
On 03/22/2013 11:52 AM, Rik van Riel wrote:

On 03/21/2013 08:05 PM, Will Huck wrote:


One offline question, how to understand this in function balance_pgdat:
/*
  * Do some background aging of the anon list, to give
  * pages a chance to be referenced before reclaiming.
  */
age_acitve_anon(zone, );


The anon lrus use a two-handed clock algorithm. New anonymous pages
start off on the active anon list. Older anonymous pages get moved
to the inactive anon list.


The downside of page cache use-once replacement algorithm is 
inter-reference distance, corret? Does it have any other downside? 
What's the downside of two-handed clock algorithm against anonymous 
pages?




If they get referenced before they reach the end of the inactive anon
list, they get moved back to the active list.

If we need to swap something out and find a non-referenced page at the
end of the inactive anon list, we will swap it out.

In order to make good pageout decisions, pages need to stay on the
inactive anon list for a longer time, so they have plenty of time to
get referenced, before the reclaim code looks at them.

To achieve that, we will move some active anon pages to the inactive
anon list even when we do not want to swap anything out - as long as
the inactive anon list is below its target size.

Does that make sense?



--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majord...@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: mailto:"d...@kvack.org;> em...@kvack.org 


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC Patch 0/2] mm: Add parameters to make kernel behavior at memory error on dirty cache selectable

2013-04-11 Thread Ric Mason

Hi Mitsuhiro,
On 04/11/2013 08:51 PM, Mitsuhiro Tanino wrote:

(2013/04/11 12:53), Simon Jeons wrote:

One question against mce instead of the patchset. ;-)

When check memory is bad? Before memory access? Is there a process scan it 
period?

Hi Simon-san,

Yes, there is a process to scan memory periodically.

At Intel Nehalem-EX and CPUs after Nehalem-EX generation, MCA recovery
is supported. MCA recovery provides error detection and isolation
features to work together with OS.
One of the MCA Recovery features is Memory Scrubbing. It periodically
checks memory in the background of OS.


Memory Scrubbing is a kernel thread? Where is the codes of memory scrubbing?



If Memory Scrubbing find an uncorrectable error on a memory before
OS accesses the memory bit, MCA recovery notifies SRAO error into OS


It maybe can't find memory error timely since it is sleeping when memory 
error occur, can this case happened?



and OS handles the SRAO error using hwpoison function.

Regards,
Mitsuhiro Tanino

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majord...@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: mailto:"d...@kvack.org;> em...@kvack.org 


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 08/10] mm: vmscan: Have kswapd shrink slab only once per priority

2013-04-11 Thread Ric Mason

Hi Mel,
On 04/11/2013 06:01 PM, Mel Gorman wrote:

On Wed, Apr 10, 2013 at 02:21:42PM +0900, Joonsoo Kim wrote:

@@ -2673,9 +2674,15 @@ static bool kswapd_shrink_zone(struct zone *zone,
sc->nr_to_reclaim = max(SWAP_CLUSTER_MAX, high_wmark_pages(zone));
shrink_zone(zone, sc);
  
-	reclaim_state->reclaimed_slab = 0;

-   nr_slab = shrink_slab(, sc->nr_scanned, lru_pages);
-   sc->nr_reclaimed += reclaim_state->reclaimed_slab;
+   /*
+* Slabs are shrunk for each zone once per priority or if the zone
+* being balanced is otherwise unreclaimable
+*/
+   if (shrinking_slab || !zone_reclaimable(zone)) {
+   reclaim_state->reclaimed_slab = 0;
+   nr_slab = shrink_slab(, sc->nr_scanned, lru_pages);
+   sc->nr_reclaimed += reclaim_state->reclaimed_slab;
+   }
  
  	if (nr_slab == 0 && !zone_reclaimable(zone))

zone->all_unreclaimable = 1;

Why shrink_slab() is called here?

Preserves existing behaviour.

Yes, but, with this patch, existing behaviour is changed, that is, we call
shrink_slab() once per priority. For now, there is no reason this function
is called here. How about separating it and executing it outside of
zone loop?


We are calling it fewer times but it's still receiving the same information
from sc->nr_scanned it received before. With the change you are suggesting
it would be necessary to accumulating sc->nr_scanned for each zone shrunk
and then pass the sum to shrink_slab() once per priority. While this is not
necessarily wrong, there is little or no motivation to alter the shrinkers
in this manner in this series.


Why the result is not the same?


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 08/10] mm: vmscan: Have kswapd shrink slab only once per priority

2013-04-11 Thread Ric Mason

Hi Mel,
On 04/11/2013 06:01 PM, Mel Gorman wrote:

On Wed, Apr 10, 2013 at 02:21:42PM +0900, Joonsoo Kim wrote:

@@ -2673,9 +2674,15 @@ static bool kswapd_shrink_zone(struct zone *zone,
sc-nr_to_reclaim = max(SWAP_CLUSTER_MAX, high_wmark_pages(zone));
shrink_zone(zone, sc);
  
-	reclaim_state-reclaimed_slab = 0;

-   nr_slab = shrink_slab(shrink, sc-nr_scanned, lru_pages);
-   sc-nr_reclaimed += reclaim_state-reclaimed_slab;
+   /*
+* Slabs are shrunk for each zone once per priority or if the zone
+* being balanced is otherwise unreclaimable
+*/
+   if (shrinking_slab || !zone_reclaimable(zone)) {
+   reclaim_state-reclaimed_slab = 0;
+   nr_slab = shrink_slab(shrink, sc-nr_scanned, lru_pages);
+   sc-nr_reclaimed += reclaim_state-reclaimed_slab;
+   }
  
  	if (nr_slab == 0  !zone_reclaimable(zone))

zone-all_unreclaimable = 1;

Why shrink_slab() is called here?

Preserves existing behaviour.

Yes, but, with this patch, existing behaviour is changed, that is, we call
shrink_slab() once per priority. For now, there is no reason this function
is called here. How about separating it and executing it outside of
zone loop?


We are calling it fewer times but it's still receiving the same information
from sc-nr_scanned it received before. With the change you are suggesting
it would be necessary to accumulating sc-nr_scanned for each zone shrunk
and then pass the sum to shrink_slab() once per priority. While this is not
necessarily wrong, there is little or no motivation to alter the shrinkers
in this manner in this series.


Why the result is not the same?


--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC Patch 0/2] mm: Add parameters to make kernel behavior at memory error on dirty cache selectable

2013-04-11 Thread Ric Mason

Hi Mitsuhiro,
On 04/11/2013 08:51 PM, Mitsuhiro Tanino wrote:

(2013/04/11 12:53), Simon Jeons wrote:

One question against mce instead of the patchset. ;-)

When check memory is bad? Before memory access? Is there a process scan it 
period?

Hi Simon-san,

Yes, there is a process to scan memory periodically.

At Intel Nehalem-EX and CPUs after Nehalem-EX generation, MCA recovery
is supported. MCA recovery provides error detection and isolation
features to work together with OS.
One of the MCA Recovery features is Memory Scrubbing. It periodically
checks memory in the background of OS.


Memory Scrubbing is a kernel thread? Where is the codes of memory scrubbing?



If Memory Scrubbing find an uncorrectable error on a memory before
OS accesses the memory bit, MCA recovery notifies SRAO error into OS


It maybe can't find memory error timely since it is sleeping when memory 
error occur, can this case happened?



and OS handles the SRAO error using hwpoison function.

Regards,
Mitsuhiro Tanino

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majord...@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: a href=mailto:d...@kvack.org; em...@kvack.org /a


--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 01/10] mm: vmscan: Limit the number of pages kswapd reclaims at each priority

2013-04-11 Thread Ric Mason

Ping Rik, I also want to know the answer. ;-)
On 04/11/2013 01:58 PM, Will Huck wrote:

Hi Rik,
On 03/22/2013 11:52 AM, Rik van Riel wrote:

On 03/21/2013 08:05 PM, Will Huck wrote:


One offline question, how to understand this in function balance_pgdat:
/*
  * Do some background aging of the anon list, to give
  * pages a chance to be referenced before reclaiming.
  */
age_acitve_anon(zone, sc);


The anon lrus use a two-handed clock algorithm. New anonymous pages
start off on the active anon list. Older anonymous pages get moved
to the inactive anon list.


The downside of page cache use-once replacement algorithm is 
inter-reference distance, corret? Does it have any other downside? 
What's the downside of two-handed clock algorithm against anonymous 
pages?




If they get referenced before they reach the end of the inactive anon
list, they get moved back to the active list.

If we need to swap something out and find a non-referenced page at the
end of the inactive anon list, we will swap it out.

In order to make good pageout decisions, pages need to stay on the
inactive anon list for a longer time, so they have plenty of time to
get referenced, before the reclaim code looks at them.

To achieve that, we will move some active anon pages to the inactive
anon list even when we do not want to swap anything out - as long as
the inactive anon list is below its target size.

Does that make sense?



--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majord...@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: a href=mailto:d...@kvack.org; em...@kvack.org /a


--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] mm: page_alloc: Avoid marking zones full prematurely after zone_reclaim()

2013-04-09 Thread Ric Mason

Hi Michal,
On 04/09/2013 06:14 PM, Michal Hocko wrote:

On Tue 09-04-13 18:05:30, Simon Jeons wrote:
[...]

I try this in v3.9-rc5:
dd if=/dev/sda of=/dev/null bs=1MB
14813+0 records in
14812+0 records out
1481200 bytes (15 GB) copied, 105.988 s, 140 MB/s

free -m -s 1

   total   used   free shared buffers
cached
Mem:  7912   1181   6731  0 663239
-/+ buffers/cache:277   7634
Swap: 8011  0   8011

It seems that almost 15GB copied before I stop dd, but the used
pages which I monitor during dd always around 1200MB. Weird, why?


Sorry for waste your time, but the test result is weird, is it?

I am not sure which values you have been watching but you have to
realize that you are reading a _partition_ not a file and those pages
go into buffers rather than the page chache.


Interesting. ;-)

What's the difference between buffers and page cache? Why buffers don't 
grow?



--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: zsmalloc defrag (Was: [PATCH] mm: remove compressed copy from zram in-memory)

2013-04-09 Thread Ric Mason

Hi Minchan,
On 04/10/2013 08:50 AM, Minchan Kim wrote:

On Tue, Apr 09, 2013 at 01:25:45PM -0700, Dan Magenheimer wrote:

From: Minchan Kim [mailto:minc...@kernel.org]
Subject: Re: zsmalloc defrag (Was: [PATCH] mm: remove compressed copy from zram 
in-memory)

Hi Dan,

On Mon, Apr 08, 2013 at 09:32:38AM -0700, Dan Magenheimer wrote:

From: Minchan Kim [mailto:minc...@kernel.org]
Sent: Monday, April 08, 2013 12:01 AM
Subject: [PATCH] mm: remove compressed copy from zram in-memory

(patch removed)


Fragment ratio is almost same but memory consumption and compile time
is better. I am working to add defragment function of zsmalloc.

Hi Minchan --

I would be very interested in your design thoughts on
how you plan to add defragmentation for zsmalloc.  In

What I can say now about is only just a word "Compaction".
As you know, zsmalloc has a transparent handle so we can do whatever
under user. Of course, there is a tradeoff between performance
and memory efficiency. I'm biased to latter for embedded usecase.

Have you designed or implemented this yet?  I have a couple
of concerns:

Not yet implemented but just had a time to think about it, simply.
So surely, there are some obstacle so I want to uncase the code and
number after I make a prototype/test the performance.
Of course, if it has a severe problem, will drop it without wasting
many guys's time.


1) The handle is transparent to the "user", but it is still a form
of a "pointer" to a zpage.  Are you planning on walking zram's
tables and changing those pointers?  That may be OK for zram
but for more complex data structures than tables (as in zswap
and zcache) it may not be as easy, due to races, or as efficient
because you will have to walk potentially very large trees.

Rough concept is following as.

I'm considering for zsmalloc to return transparent fake handle
but we have to maintain it with real one.
It could be done in zsmalloc internal so there isn't any race we should 
consider.



2) Compaction in the kernel is heavily dependent on page migration
and page migration is dependent on using flags in the struct page.
There's a lot of code in those two code modules and there
are going to be a lot of implementation differences between
compacting pages vs compacting zpages.

Compaction of kernel is never related to zsmalloc's one.


I'm also wondering if you will be implementing "variable length
zspages".  Without that, I'm not sure compaction will help
enough.  (And that is a good example of the difference between

Why do you think so?
variable lengh zspage could be further step to improve but it's not
only a solution to solve fragmentation.


the kernel page compaction design/code and zspage compaction.)

particular, I am wondering if your design will also
handle the requirements for zcache (especially for
cleancache pages) and perhaps also for ramster.

I don't know requirements for cleancache pages but compaction is
general as you know well so I expect you can get a benefit from it
if you are concern on memory efficiency but not sure it's valuable
to compact cleancache pages for getting more slot in RAM.
Sometime, just discarding would be much better, IMHO.

Zcache has page reclaim.  Zswap has zpage reclaim.  I am
concerned that these continue to work in the presence of
compaction.   With no reclaim at all, zram is a simpler use
case but if you implement compaction in a way that can't be
used by either zcache or zswap, then zsmalloc is essentially
forking.

Don't go too far. If it's really problem for zswap and zcache,
maybe, we could add it optionally.


In https://lkml.org/lkml/2013/3/27/501 I suggested it
would be good to work together on a common design, but
you didn't reply.  Are you thinking that zsmalloc

I saw the thread but explicit agreement is really matter?
I believe everybody want it although they didn't reply. :)

You can make the design/post it or prototyping/post it.
If there are some conflit with something in my brain,
I will be happy to feedback. :)

Anyway, I think my above statement "COMPACTION" would be enough to
express my current thought to avoid duplicated work and you can catch up.

I will get around to it after LSF/MM.


improvements should focus only on zram, in which case

Just focusing zsmalloc.

Right.  Again, I am asking if you are changing zsmalloc in
a way that helps zram but hurts zswap and makes it impossible
for zcache to ever use the improvements to zsmalloc.

As I said, I'm biased to memory efficiency rather than performace.
Of course, severe performance drop is disaster but small drop will
be acceptable for memory-efficiency concerning systems.


If so, that's fine, but please make it clear that is your goal.

Simple, help memory hungry system. :)


Which kind of system are memory hungry?




we may -- and possibly should -- end up with a different
allocator for frontswap-based/cleancache-based compression
in zcache (and possibly zswap)?
I'm just trying to determine if I 

Re: zsmalloc defrag (Was: [PATCH] mm: remove compressed copy from zram in-memory)

2013-04-09 Thread Ric Mason

Hi Dan,
On 04/10/2013 04:25 AM, Dan Magenheimer wrote:

From: Minchan Kim [mailto:minc...@kernel.org]
Subject: Re: zsmalloc defrag (Was: [PATCH] mm: remove compressed copy from zram 
in-memory)

Hi Dan,

On Mon, Apr 08, 2013 at 09:32:38AM -0700, Dan Magenheimer wrote:

From: Minchan Kim [mailto:minc...@kernel.org]
Sent: Monday, April 08, 2013 12:01 AM
Subject: [PATCH] mm: remove compressed copy from zram in-memory

(patch removed)


Fragment ratio is almost same but memory consumption and compile time
is better. I am working to add defragment function of zsmalloc.

Hi Minchan --

I would be very interested in your design thoughts on
how you plan to add defragmentation for zsmalloc.  In

What I can say now about is only just a word "Compaction".
As you know, zsmalloc has a transparent handle so we can do whatever
under user. Of course, there is a tradeoff between performance
and memory efficiency. I'm biased to latter for embedded usecase.

Have you designed or implemented this yet?  I have a couple
of concerns:

1) The handle is transparent to the "user", but it is still a form
of a "pointer" to a zpage.  Are you planning on walking zram's
tables and changing those pointers?  That may be OK for zram
but for more complex data structures than tables (as in zswap
and zcache) it may not be as easy, due to races, or as efficient
because you will have to walk potentially very large trees.
2) Compaction in the kernel is heavily dependent on page migration
and page migration is dependent on using flags in the struct page.


Which flag?


There's a lot of code in those two code modules and there
are going to be a lot of implementation differences between
compacting pages vs compacting zpages.

I'm also wondering if you will be implementing "variable length
zspages".  Without that, I'm not sure compaction will help
enough.  (And that is a good example of the difference between
the kernel page compaction design/code and zspage compaction.)


particular, I am wondering if your design will also
handle the requirements for zcache (especially for
cleancache pages) and perhaps also for ramster.

I don't know requirements for cleancache pages but compaction is
general as you know well so I expect you can get a benefit from it
if you are concern on memory efficiency but not sure it's valuable
to compact cleancache pages for getting more slot in RAM.
Sometime, just discarding would be much better, IMHO.

Zcache has page reclaim.  Zswap has zpage reclaim.  I am
concerned that these continue to work in the presence of
compaction.   With no reclaim at all, zram is a simpler use
case but if you implement compaction in a way that can't be
used by either zcache or zswap, then zsmalloc is essentially
forking.


I fail to understand "then zsmalloc is essentially forking.", could you 
explain more?





In https://lkml.org/lkml/2013/3/27/501 I suggested it
would be good to work together on a common design, but
you didn't reply.  Are you thinking that zsmalloc

I saw the thread but explicit agreement is really matter?
I believe everybody want it although they didn't reply. :)

You can make the design/post it or prototyping/post it.
If there are some conflit with something in my brain,
I will be happy to feedback. :)

Anyway, I think my above statement "COMPACTION" would be enough to
express my current thought to avoid duplicated work and you can catch up.

I will get around to it after LSF/MM.


improvements should focus only on zram, in which case

Just focusing zsmalloc.

Right.  Again, I am asking if you are changing zsmalloc in
a way that helps zram but hurts zswap and makes it impossible
for zcache to ever use the improvements to zsmalloc.

If so, that's fine, but please make it clear that is your goal.


we may -- and possibly should -- end up with a different
allocator for frontswap-based/cleancache-based compression
in zcache (and possibly zswap)?
I'm just trying to determine if I should proceed separately
with my design (with Bob Liu, who expressed interest) or if
it would be beneficial to work together.

Just posting and if it affects zsmalloc/zram/zswap and goes the way
I don't want, I will involve the discussion because our product uses
zram heavily and consider zswap, too.

I really appreciate your enthusiastic collaboration model to find
optimal solution!

My goal is to have compression be an integral part of Linux
memory management.  It may be tied to a config option, but
the goal is that distros turn it on by default.  I don't think
zsmalloc meets that objective yet, but it may be fine for
your needs.  If so it would be good to understand exactly why
it doesn't meet the other zproject needs.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majord...@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email:  em...@kvack.org 


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to 

Re: zsmalloc defrag (Was: [PATCH] mm: remove compressed copy from zram in-memory)

2013-04-09 Thread Ric Mason

Hi Dan,
On 04/10/2013 04:25 AM, Dan Magenheimer wrote:

From: Minchan Kim [mailto:minc...@kernel.org]
Subject: Re: zsmalloc defrag (Was: [PATCH] mm: remove compressed copy from zram 
in-memory)

Hi Dan,

On Mon, Apr 08, 2013 at 09:32:38AM -0700, Dan Magenheimer wrote:

From: Minchan Kim [mailto:minc...@kernel.org]
Sent: Monday, April 08, 2013 12:01 AM
Subject: [PATCH] mm: remove compressed copy from zram in-memory

(patch removed)


Fragment ratio is almost same but memory consumption and compile time
is better. I am working to add defragment function of zsmalloc.

Hi Minchan --

I would be very interested in your design thoughts on
how you plan to add defragmentation for zsmalloc.  In

What I can say now about is only just a word Compaction.
As you know, zsmalloc has a transparent handle so we can do whatever
under user. Of course, there is a tradeoff between performance
and memory efficiency. I'm biased to latter for embedded usecase.

Have you designed or implemented this yet?  I have a couple
of concerns:

1) The handle is transparent to the user, but it is still a form
of a pointer to a zpage.  Are you planning on walking zram's
tables and changing those pointers?  That may be OK for zram
but for more complex data structures than tables (as in zswap
and zcache) it may not be as easy, due to races, or as efficient
because you will have to walk potentially very large trees.
2) Compaction in the kernel is heavily dependent on page migration
and page migration is dependent on using flags in the struct page.


Which flag?


There's a lot of code in those two code modules and there
are going to be a lot of implementation differences between
compacting pages vs compacting zpages.

I'm also wondering if you will be implementing variable length
zspages.  Without that, I'm not sure compaction will help
enough.  (And that is a good example of the difference between
the kernel page compaction design/code and zspage compaction.)


particular, I am wondering if your design will also
handle the requirements for zcache (especially for
cleancache pages) and perhaps also for ramster.

I don't know requirements for cleancache pages but compaction is
general as you know well so I expect you can get a benefit from it
if you are concern on memory efficiency but not sure it's valuable
to compact cleancache pages for getting more slot in RAM.
Sometime, just discarding would be much better, IMHO.

Zcache has page reclaim.  Zswap has zpage reclaim.  I am
concerned that these continue to work in the presence of
compaction.   With no reclaim at all, zram is a simpler use
case but if you implement compaction in a way that can't be
used by either zcache or zswap, then zsmalloc is essentially
forking.


I fail to understand then zsmalloc is essentially forking., could you 
explain more?





In https://lkml.org/lkml/2013/3/27/501 I suggested it
would be good to work together on a common design, but
you didn't reply.  Are you thinking that zsmalloc

I saw the thread but explicit agreement is really matter?
I believe everybody want it although they didn't reply. :)

You can make the design/post it or prototyping/post it.
If there are some conflit with something in my brain,
I will be happy to feedback. :)

Anyway, I think my above statement COMPACTION would be enough to
express my current thought to avoid duplicated work and you can catch up.

I will get around to it after LSF/MM.


improvements should focus only on zram, in which case

Just focusing zsmalloc.

Right.  Again, I am asking if you are changing zsmalloc in
a way that helps zram but hurts zswap and makes it impossible
for zcache to ever use the improvements to zsmalloc.

If so, that's fine, but please make it clear that is your goal.


we may -- and possibly should -- end up with a different
allocator for frontswap-based/cleancache-based compression
in zcache (and possibly zswap)?
I'm just trying to determine if I should proceed separately
with my design (with Bob Liu, who expressed interest) or if
it would be beneficial to work together.

Just posting and if it affects zsmalloc/zram/zswap and goes the way
I don't want, I will involve the discussion because our product uses
zram heavily and consider zswap, too.

I really appreciate your enthusiastic collaboration model to find
optimal solution!

My goal is to have compression be an integral part of Linux
memory management.  It may be tied to a config option, but
the goal is that distros turn it on by default.  I don't think
zsmalloc meets that objective yet, but it may be fine for
your needs.  If so it would be good to understand exactly why
it doesn't meet the other zproject needs.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majord...@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: a href=ilto:d...@kvack.org em...@kvack.org /a


--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of 

Re: zsmalloc defrag (Was: [PATCH] mm: remove compressed copy from zram in-memory)

2013-04-09 Thread Ric Mason

Hi Minchan,
On 04/10/2013 08:50 AM, Minchan Kim wrote:

On Tue, Apr 09, 2013 at 01:25:45PM -0700, Dan Magenheimer wrote:

From: Minchan Kim [mailto:minc...@kernel.org]
Subject: Re: zsmalloc defrag (Was: [PATCH] mm: remove compressed copy from zram 
in-memory)

Hi Dan,

On Mon, Apr 08, 2013 at 09:32:38AM -0700, Dan Magenheimer wrote:

From: Minchan Kim [mailto:minc...@kernel.org]
Sent: Monday, April 08, 2013 12:01 AM
Subject: [PATCH] mm: remove compressed copy from zram in-memory

(patch removed)


Fragment ratio is almost same but memory consumption and compile time
is better. I am working to add defragment function of zsmalloc.

Hi Minchan --

I would be very interested in your design thoughts on
how you plan to add defragmentation for zsmalloc.  In

What I can say now about is only just a word Compaction.
As you know, zsmalloc has a transparent handle so we can do whatever
under user. Of course, there is a tradeoff between performance
and memory efficiency. I'm biased to latter for embedded usecase.

Have you designed or implemented this yet?  I have a couple
of concerns:

Not yet implemented but just had a time to think about it, simply.
So surely, there are some obstacle so I want to uncase the code and
number after I make a prototype/test the performance.
Of course, if it has a severe problem, will drop it without wasting
many guys's time.


1) The handle is transparent to the user, but it is still a form
of a pointer to a zpage.  Are you planning on walking zram's
tables and changing those pointers?  That may be OK for zram
but for more complex data structures than tables (as in zswap
and zcache) it may not be as easy, due to races, or as efficient
because you will have to walk potentially very large trees.

Rough concept is following as.

I'm considering for zsmalloc to return transparent fake handle
but we have to maintain it with real one.
It could be done in zsmalloc internal so there isn't any race we should 
consider.



2) Compaction in the kernel is heavily dependent on page migration
and page migration is dependent on using flags in the struct page.
There's a lot of code in those two code modules and there
are going to be a lot of implementation differences between
compacting pages vs compacting zpages.

Compaction of kernel is never related to zsmalloc's one.


I'm also wondering if you will be implementing variable length
zspages.  Without that, I'm not sure compaction will help
enough.  (And that is a good example of the difference between

Why do you think so?
variable lengh zspage could be further step to improve but it's not
only a solution to solve fragmentation.


the kernel page compaction design/code and zspage compaction.)

particular, I am wondering if your design will also
handle the requirements for zcache (especially for
cleancache pages) and perhaps also for ramster.

I don't know requirements for cleancache pages but compaction is
general as you know well so I expect you can get a benefit from it
if you are concern on memory efficiency but not sure it's valuable
to compact cleancache pages for getting more slot in RAM.
Sometime, just discarding would be much better, IMHO.

Zcache has page reclaim.  Zswap has zpage reclaim.  I am
concerned that these continue to work in the presence of
compaction.   With no reclaim at all, zram is a simpler use
case but if you implement compaction in a way that can't be
used by either zcache or zswap, then zsmalloc is essentially
forking.

Don't go too far. If it's really problem for zswap and zcache,
maybe, we could add it optionally.


In https://lkml.org/lkml/2013/3/27/501 I suggested it
would be good to work together on a common design, but
you didn't reply.  Are you thinking that zsmalloc

I saw the thread but explicit agreement is really matter?
I believe everybody want it although they didn't reply. :)

You can make the design/post it or prototyping/post it.
If there are some conflit with something in my brain,
I will be happy to feedback. :)

Anyway, I think my above statement COMPACTION would be enough to
express my current thought to avoid duplicated work and you can catch up.

I will get around to it after LSF/MM.


improvements should focus only on zram, in which case

Just focusing zsmalloc.

Right.  Again, I am asking if you are changing zsmalloc in
a way that helps zram but hurts zswap and makes it impossible
for zcache to ever use the improvements to zsmalloc.

As I said, I'm biased to memory efficiency rather than performace.
Of course, severe performance drop is disaster but small drop will
be acceptable for memory-efficiency concerning systems.


If so, that's fine, but please make it clear that is your goal.

Simple, help memory hungry system. :)


Which kind of system are memory hungry?




we may -- and possibly should -- end up with a different
allocator for frontswap-based/cleancache-based compression
in zcache (and possibly zswap)?
I'm just trying to determine if I should proceed 

Re: [PATCH] mm: page_alloc: Avoid marking zones full prematurely after zone_reclaim()

2013-04-09 Thread Ric Mason

Hi Michal,
On 04/09/2013 06:14 PM, Michal Hocko wrote:

On Tue 09-04-13 18:05:30, Simon Jeons wrote:
[...]

I try this in v3.9-rc5:
dd if=/dev/sda of=/dev/null bs=1MB
14813+0 records in
14812+0 records out
1481200 bytes (15 GB) copied, 105.988 s, 140 MB/s

free -m -s 1

   total   used   free shared buffers
cached
Mem:  7912   1181   6731  0 663239
-/+ buffers/cache:277   7634
Swap: 8011  0   8011

It seems that almost 15GB copied before I stop dd, but the used
pages which I monitor during dd always around 1200MB. Weird, why?


Sorry for waste your time, but the test result is weird, is it?

I am not sure which values you have been watching but you have to
realize that you are reading a _partition_ not a file and those pages
go into buffers rather than the page chache.


Interesting. ;-)

What's the difference between buffers and page cache? Why buffers don't 
grow?



--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] mm: remove compressed copy from zram in-memory

2013-04-08 Thread Ric Mason

Hi Minchan,
On 04/09/2013 09:02 AM, Minchan Kim wrote:

Hi Andrew,

On Mon, Apr 08, 2013 at 02:17:10PM -0700, Andrew Morton wrote:

On Mon,  8 Apr 2013 15:01:02 +0900 Minchan Kim  wrote:


Swap subsystem does lazy swap slot free with expecting the page
would be swapped out again so we can avoid unnecessary write.

Is that correct?  How can it save a write?

Correct.

The add_to_swap makes the page dirty and we must pageout only if the page is
dirty. If a anon page is already charged into swapcache, we skip writeout
the page in shrink_page_list, then just remove the page from swapcache and
free it by __remove_mapping.

I did received same question multiple time so it would be good idea to
write down it in vmscan.c somewhere.


But the problem in in-memory swap(ex, zram) is that it consumes
memory space until vm_swap_full(ie, used half of all of swap device)
condition meet. It could be bad if we use multiple swap device,
small in-memory swap and big storage swap or in-memory swap alone.

This patch makes swap subsystem free swap slot as soon as swap-read
is completed and make the swapcache page dirty so the page should
be written out the swap device to reclaim it.
It means we never lose it.

>From my reading of the patch, that isn't how it works?  It changed
end_swap_bio_read() to call zram_slot_free_notify(), which appears to
free the underlying compressed page.  I have a feeling I'm hopelessly
confused.

You understand right totally.
Selecting swap slot in my description was totally miss.
Need to rewrite the description.


free the swap slot and free compress page is the same, isn't it?




--- a/mm/page_io.c
+++ b/mm/page_io.c
@@ -20,6 +20,7 @@
  #include 
  #include 
  #include 
+#include 
  #include 
  
  static struct bio *get_swap_bio(gfp_t gfp_flags,

@@ -81,8 +82,30 @@ void end_swap_bio_read(struct bio *bio, int err)
iminor(bio->bi_bdev->bd_inode),
(unsigned long long)bio->bi_sector);
} else {
+   /*
+* There is no reason to keep both uncompressed data and
+* compressed data in memory.
+*/
+   struct swap_info_struct *sis;
+
SetPageUptodate(page);
+   sis = page_swap_info(page);
+   if (sis->flags & SWP_BLKDEV) {
+   struct gendisk *disk = sis->bdev->bd_disk;
+   if (disk->fops->swap_slot_free_notify) {
+   swp_entry_t entry;
+   unsigned long offset;
+
+   entry.val = page_private(page);
+   offset = swp_offset(entry);
+
+   SetPageDirty(page);
+   disk->fops->swap_slot_free_notify(sis->bdev,
+   offset);
+   }
+   }
}
+
unlock_page(page);
bio_put(bio);

The new code is wasted space if CONFIG_BLOCK=n, yes?

CONFIG_SWAP is already dependent on CONFIG_BLOCK.


Also, what's up with the SWP_BLKDEV test?  zram doesn't support
SWP_FILE?  Why on earth not?

Putting swap_slot_free_notify() into block_device_operations seems
rather wrong.  It precludes zram-over-swapfiles for all time and means
that other subsystems cannot get notifications for swap slot freeing
for swapfile-backed swap.

Zram is just pseudo-block device so anyone can format it with any FSes
and swapon a file. In such case, he can't get a benefit from
swap_slot_free_notify. But I think it's not a severe problem because
there is no reason to use a file-swap on zram. If anyone want to use it,
I'd like to know the reason. If it's reasonable, we have to rethink a
wheel and it's another story, IMHO.



--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majord...@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: mailto:"d...@kvack.org;> em...@kvack.org 


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] mm: remove compressed copy from zram in-memory

2013-04-08 Thread Ric Mason

Hi Minchan,
On 04/09/2013 09:02 AM, Minchan Kim wrote:

Hi Andrew,

On Mon, Apr 08, 2013 at 02:17:10PM -0700, Andrew Morton wrote:

On Mon,  8 Apr 2013 15:01:02 +0900 Minchan Kim minc...@kernel.org wrote:


Swap subsystem does lazy swap slot free with expecting the page
would be swapped out again so we can avoid unnecessary write.

Is that correct?  How can it save a write?

Correct.

The add_to_swap makes the page dirty and we must pageout only if the page is
dirty. If a anon page is already charged into swapcache, we skip writeout
the page in shrink_page_list, then just remove the page from swapcache and
free it by __remove_mapping.

I did received same question multiple time so it would be good idea to
write down it in vmscan.c somewhere.


But the problem in in-memory swap(ex, zram) is that it consumes
memory space until vm_swap_full(ie, used half of all of swap device)
condition meet. It could be bad if we use multiple swap device,
small in-memory swap and big storage swap or in-memory swap alone.

This patch makes swap subsystem free swap slot as soon as swap-read
is completed and make the swapcache page dirty so the page should
be written out the swap device to reclaim it.
It means we never lose it.

From my reading of the patch, that isn't how it works?  It changed
end_swap_bio_read() to call zram_slot_free_notify(), which appears to
free the underlying compressed page.  I have a feeling I'm hopelessly
confused.

You understand right totally.
Selecting swap slot in my description was totally miss.
Need to rewrite the description.


free the swap slot and free compress page is the same, isn't it?




--- a/mm/page_io.c
+++ b/mm/page_io.c
@@ -20,6 +20,7 @@
  #include linux/buffer_head.h
  #include linux/writeback.h
  #include linux/frontswap.h
+#include linux/blkdev.h
  #include asm/pgtable.h
  
  static struct bio *get_swap_bio(gfp_t gfp_flags,

@@ -81,8 +82,30 @@ void end_swap_bio_read(struct bio *bio, int err)
iminor(bio-bi_bdev-bd_inode),
(unsigned long long)bio-bi_sector);
} else {
+   /*
+* There is no reason to keep both uncompressed data and
+* compressed data in memory.
+*/
+   struct swap_info_struct *sis;
+
SetPageUptodate(page);
+   sis = page_swap_info(page);
+   if (sis-flags  SWP_BLKDEV) {
+   struct gendisk *disk = sis-bdev-bd_disk;
+   if (disk-fops-swap_slot_free_notify) {
+   swp_entry_t entry;
+   unsigned long offset;
+
+   entry.val = page_private(page);
+   offset = swp_offset(entry);
+
+   SetPageDirty(page);
+   disk-fops-swap_slot_free_notify(sis-bdev,
+   offset);
+   }
+   }
}
+
unlock_page(page);
bio_put(bio);

The new code is wasted space if CONFIG_BLOCK=n, yes?

CONFIG_SWAP is already dependent on CONFIG_BLOCK.


Also, what's up with the SWP_BLKDEV test?  zram doesn't support
SWP_FILE?  Why on earth not?

Putting swap_slot_free_notify() into block_device_operations seems
rather wrong.  It precludes zram-over-swapfiles for all time and means
that other subsystems cannot get notifications for swap slot freeing
for swapfile-backed swap.

Zram is just pseudo-block device so anyone can format it with any FSes
and swapon a file. In such case, he can't get a benefit from
swap_slot_free_notify. But I think it's not a severe problem because
there is no reason to use a file-swap on zram. If anyone want to use it,
I'd like to know the reason. If it's reasonable, we have to rethink a
wheel and it's another story, IMHO.



--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majord...@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: a href=mailto:d...@kvack.org; em...@kvack.org /a


--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH part2 v6 0/3] staging: zcache: Support zero-filled pages more efficiently

2013-04-07 Thread Ric Mason

cc Bob
On 04/07/2013 05:03 PM, Wanpeng Li wrote:

On Wed, Apr 03, 2013 at 06:16:20PM +0800, Wanpeng Li wrote:

Changelog:
v5 -> v6:
  * shove variables in debug.c and in debug.h just have an extern, spotted by 
Konrad
  * update patch description, spotted by Konrad
v4 -> v5:
  * fix compile error, reported by Fengguang, Geert
  * add check for !is_ephemeral(pool), spotted by Bob
v3 -> v4:
  * handle duplication in page_is_zero_filled, spotted by Bob
  * fix zcache writeback in dubugfs
  * fix pers_pageframes|_max isn't exported in debugfs
  * fix static variable defined in debug.h but used in multiple C files
  * rebase on Greg's staging-next
v2 -> v3:
  * increment/decrement zcache_[eph|pers]_zpages for zero-filled pages, spotted 
by Dan
  * replace "zero" or "zero page" by "zero_filled_page", spotted by Dan
v1 -> v2:
  * avoid changing tmem.[ch] entirely, spotted by Dan.
  * don't accumulate [eph|pers]pageframe and [eph|pers]zpages for
zero-filled pages, spotted by Dan
  * cleanup TODO list
  * add Dan Acked-by.


Hi Dan,

Some issues against Ramster:

- Ramster who takes advantage of zcache also should support zero-filled
   pages more efficiently, correct? It doesn't handle zero-filled pages well
   currently.
- Ramster DebugFS counters are exported in /sys/kernel/mm/, but 
zcache/frontswap/cleancache
   all are exported in /sys/kernel/debug/, should we unify them?
- If ramster also should move DebugFS counters to a single file like
   zcache do?

If you confirm these issues are make sense to fix, I will start coding. ;-)

Regards,
Wanpeng Li


Motivation:

- Seth Jennings points out compress zero-filled pages with LZO(a lossless
  data compression algorithm) will waste memory and result in fragmentation.
  https://lkml.org/lkml/2012/8/14/347
- Dan Magenheimer add "Support zero-filled pages more efficiently" feature
  in zcache TODO list https://lkml.org/lkml/2013/2/13/503

Design:

- For store page, capture zero-filled pages(evicted clean page cache pages and
  swap pages), but don't compress them, set pampd which store zpage address to
  0x2(since 0x0 and 0x1 has already been ocuppied) to mark special zero-filled
  case and take advantage of tmem infrastructure to transform handle-tuple(pool
  id, object id, and an index) to a pampd. Twice compress zero-filled pages will
  contribute to one zcache_[eph|pers]_pageframes count accumulated.
- For load page, traverse tmem hierachical to transform handle-tuple to pampd
  and identify zero-filled case by pampd equal to 0x2 when filesystem reads
  file pages or a page needs to be swapped in, then refill the page to zero
  and return.

Test:

dd if=/dev/zero of=zerofile bs=1MB count=500
vmtouch -t zerofile
vmtouch -e zerofile

formula:
- fragmentation level = (zcache_[eph|pers]_pageframes * PAGE_SIZE - 
zcache_[eph|pers]_zbytes)
  * 100 / (zcache_[eph|pers]_pageframes * PAGE_SIZE)
- memory zcache occupy = zcache_[eph|pers]_zbytes

Result:

without zero-filled awareness:
- fragmentation level: 98%
- memory zcache occupy: 238MB
with zero-filled awareness:
- fragmentation level: 0%
- memory zcache occupy: 0MB

Wanpeng Li (3):
  staging: zcache: fix static variables defined in debug.h but used in
mutiple C files
  staging: zcache: introduce zero-filled page stat count
  staging: zcache: clean TODO list

drivers/staging/zcache/TODO  |3 +-
drivers/staging/zcache/debug.c   |   35 +++
drivers/staging/zcache/debug.h   |   79 -
drivers/staging/zcache/zcache-main.c |4 ++
4 files changed, 88 insertions(+), 33 deletions(-)

--
1.7.5.4

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majord...@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: mailto:"d...@kvack.org;> em...@kvack.org 


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH part2 v6 0/3] staging: zcache: Support zero-filled pages more efficiently

2013-04-07 Thread Ric Mason

cc Bob
On 04/07/2013 05:03 PM, Wanpeng Li wrote:

On Wed, Apr 03, 2013 at 06:16:20PM +0800, Wanpeng Li wrote:

Changelog:
v5 - v6:
  * shove variables in debug.c and in debug.h just have an extern, spotted by 
Konrad
  * update patch description, spotted by Konrad
v4 - v5:
  * fix compile error, reported by Fengguang, Geert
  * add check for !is_ephemeral(pool), spotted by Bob
v3 - v4:
  * handle duplication in page_is_zero_filled, spotted by Bob
  * fix zcache writeback in dubugfs
  * fix pers_pageframes|_max isn't exported in debugfs
  * fix static variable defined in debug.h but used in multiple C files
  * rebase on Greg's staging-next
v2 - v3:
  * increment/decrement zcache_[eph|pers]_zpages for zero-filled pages, spotted 
by Dan
  * replace zero or zero page by zero_filled_page, spotted by Dan
v1 - v2:
  * avoid changing tmem.[ch] entirely, spotted by Dan.
  * don't accumulate [eph|pers]pageframe and [eph|pers]zpages for
zero-filled pages, spotted by Dan
  * cleanup TODO list
  * add Dan Acked-by.


Hi Dan,

Some issues against Ramster:

- Ramster who takes advantage of zcache also should support zero-filled
   pages more efficiently, correct? It doesn't handle zero-filled pages well
   currently.
- Ramster DebugFS counters are exported in /sys/kernel/mm/, but 
zcache/frontswap/cleancache
   all are exported in /sys/kernel/debug/, should we unify them?
- If ramster also should move DebugFS counters to a single file like
   zcache do?

If you confirm these issues are make sense to fix, I will start coding. ;-)

Regards,
Wanpeng Li


Motivation:

- Seth Jennings points out compress zero-filled pages with LZO(a lossless
  data compression algorithm) will waste memory and result in fragmentation.
  https://lkml.org/lkml/2012/8/14/347
- Dan Magenheimer add Support zero-filled pages more efficiently feature
  in zcache TODO list https://lkml.org/lkml/2013/2/13/503

Design:

- For store page, capture zero-filled pages(evicted clean page cache pages and
  swap pages), but don't compress them, set pampd which store zpage address to
  0x2(since 0x0 and 0x1 has already been ocuppied) to mark special zero-filled
  case and take advantage of tmem infrastructure to transform handle-tuple(pool
  id, object id, and an index) to a pampd. Twice compress zero-filled pages will
  contribute to one zcache_[eph|pers]_pageframes count accumulated.
- For load page, traverse tmem hierachical to transform handle-tuple to pampd
  and identify zero-filled case by pampd equal to 0x2 when filesystem reads
  file pages or a page needs to be swapped in, then refill the page to zero
  and return.

Test:

dd if=/dev/zero of=zerofile bs=1MB count=500
vmtouch -t zerofile
vmtouch -e zerofile

formula:
- fragmentation level = (zcache_[eph|pers]_pageframes * PAGE_SIZE - 
zcache_[eph|pers]_zbytes)
  * 100 / (zcache_[eph|pers]_pageframes * PAGE_SIZE)
- memory zcache occupy = zcache_[eph|pers]_zbytes

Result:

without zero-filled awareness:
- fragmentation level: 98%
- memory zcache occupy: 238MB
with zero-filled awareness:
- fragmentation level: 0%
- memory zcache occupy: 0MB

Wanpeng Li (3):
  staging: zcache: fix static variables defined in debug.h but used in
mutiple C files
  staging: zcache: introduce zero-filled page stat count
  staging: zcache: clean TODO list

drivers/staging/zcache/TODO  |3 +-
drivers/staging/zcache/debug.c   |   35 +++
drivers/staging/zcache/debug.h   |   79 -
drivers/staging/zcache/zcache-main.c |4 ++
4 files changed, 88 insertions(+), 33 deletions(-)

--
1.7.5.4

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majord...@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: a href=mailto:d...@kvack.org; em...@kvack.org /a


--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCHv3, RFC 00/34] Transparent huge page cache

2013-04-06 Thread Ric Mason

Hi Kirill,
On 04/05/2013 07:59 PM, Kirill A. Shutemov wrote:

From: "Kirill A. Shutemov" 

Here's third RFC. Thanks everybody for feedback.


Could you answer my questions in your version two?



The patchset is pretty big already and I want to stop generate new
features to keep it reviewable. Next I'll concentrate on benchmarking and
tuning.

Therefore some features will be outside initial transparent huge page
cache implementation:
  - page collapsing;
  - migration;
  - tmpfs/shmem;

There are few features which are not implemented and potentially can block
upstreaming:

1. Currently we allocate 2M page even if we create only 1 byte file on
ramfs. I don't think it's a problem by itself. With anon thp pages we also
try to allocate huge pages whenever possible.
The problem is that ramfs pages are unevictable and we can't just split
and pushed them in swap as with anon thp. We (at some point) have to have
mechanism to split last page of the file under memory pressure to reclaim
some memory.

2. We don't have knobs for disabling transparent huge page cache per-mount
or per-file. Should we have mount option and fadivse flags as part of
initial implementation?

Any thoughts?

The patchset is also on git:

git://git.kernel.org/pub/scm/linux/kernel/git/kas/linux.git thp/pagecache

v3:
  - set RADIX_TREE_PRELOAD_NR to 512 only if we build with THP;
  - rewrite lru_add_page_tail() to address few bags;
  - memcg accounting;
  - represent file thp pages in meminfo and friends;
  - dump page order in filemap trace;
  - add missed flush_dcache_page() in zero_huge_user_segment;
  - random cleanups based on feedback.
v2:
  - mmap();
  - fix add_to_page_cache_locked() and delete_from_page_cache();
  - introduce mapping_can_have_hugepages();
  - call split_huge_page() only for head page in filemap_fault();
  - wait_split_huge_page(): serialize over i_mmap_mutex too;
  - lru_add_page_tail: avoid PageUnevictable on active/inactive lru lists;
  - fix off-by-one in zero_huge_user_segment();
  - THP_WRITE_ALLOC/THP_WRITE_FAILED counters;

Kirill A. Shutemov (34):
   mm: drop actor argument of do_generic_file_read()
   block: implement add_bdi_stat()
   mm: implement zero_huge_user_segment and friends
   radix-tree: implement preload for multiple contiguous elements
   memcg, thp: charge huge cache pages
   thp, mm: avoid PageUnevictable on active/inactive lru lists
   thp, mm: basic defines for transparent huge page cache
   thp, mm: introduce mapping_can_have_hugepages() predicate
   thp: represent file thp pages in meminfo and friends
   thp, mm: rewrite add_to_page_cache_locked() to support huge pages
   mm: trace filemap: dump page order
   thp, mm: rewrite delete_from_page_cache() to support huge pages
   thp, mm: trigger bug in replace_page_cache_page() on THP
   thp, mm: locking tail page is a bug
   thp, mm: handle tail pages in page_cache_get_speculative()
   thp, mm: add event counters for huge page alloc on write to a file
   thp, mm: implement grab_thp_write_begin()
   thp, mm: naive support of thp in generic read/write routines
   thp, libfs: initial support of thp in
 simple_read/write_begin/write_end
   thp: handle file pages in split_huge_page()
   thp: wait_split_huge_page(): serialize over i_mmap_mutex too
   thp, mm: truncate support for transparent huge page cache
   thp, mm: split huge page on mmap file page
   ramfs: enable transparent huge page cache
   x86-64, mm: proper alignment mappings with hugepages
   mm: add huge_fault() callback to vm_operations_struct
   thp: prepare zap_huge_pmd() to uncharge file pages
   thp: move maybe_pmd_mkwrite() out of mk_huge_pmd()
   thp, mm: basic huge_fault implementation for generic_file_vm_ops
   thp: extract fallback path from do_huge_pmd_anonymous_page() to a
 function
   thp: initial implementation of do_huge_linear_fault()
   thp: handle write-protect exception to file-backed huge pages
   thp: call __vma_adjust_trans_huge() for file-backed VMA
   thp: map file-backed huge pages on fault

  arch/x86/kernel/sys_x86_64.c   |   12 +-
  drivers/base/node.c|   10 +
  fs/libfs.c |   48 +++-
  fs/proc/meminfo.c  |6 +
  fs/ramfs/inode.c   |6 +-
  include/linux/backing-dev.h|   10 +
  include/linux/huge_mm.h|   36 ++-
  include/linux/mm.h |8 +
  include/linux/mmzone.h |1 +
  include/linux/pagemap.h|   33 ++-
  include/linux/radix-tree.h |   11 +
  include/linux/vm_event_item.h  |2 +
  include/trace/events/filemap.h |7 +-
  lib/radix-tree.c   |   33 ++-
  mm/filemap.c   |  298 -
  mm/huge_memory.c   |  474 +---
  mm/memcontrol.c|2 -
  mm/memory.c|   41 +++-
  mm/mmap.c  |3 +
  mm/page_alloc.c|7 +-
  mm/swap.c  |   20 +-
  mm/truncate.c   

Re: [PATCHv3, RFC 00/34] Transparent huge page cache

2013-04-06 Thread Ric Mason

Hi Kirill,
On 04/05/2013 07:59 PM, Kirill A. Shutemov wrote:

From: Kirill A. Shutemov kirill.shute...@linux.intel.com

Here's third RFC. Thanks everybody for feedback.


Could you answer my questions in your version two?



The patchset is pretty big already and I want to stop generate new
features to keep it reviewable. Next I'll concentrate on benchmarking and
tuning.

Therefore some features will be outside initial transparent huge page
cache implementation:
  - page collapsing;
  - migration;
  - tmpfs/shmem;

There are few features which are not implemented and potentially can block
upstreaming:

1. Currently we allocate 2M page even if we create only 1 byte file on
ramfs. I don't think it's a problem by itself. With anon thp pages we also
try to allocate huge pages whenever possible.
The problem is that ramfs pages are unevictable and we can't just split
and pushed them in swap as with anon thp. We (at some point) have to have
mechanism to split last page of the file under memory pressure to reclaim
some memory.

2. We don't have knobs for disabling transparent huge page cache per-mount
or per-file. Should we have mount option and fadivse flags as part of
initial implementation?

Any thoughts?

The patchset is also on git:

git://git.kernel.org/pub/scm/linux/kernel/git/kas/linux.git thp/pagecache

v3:
  - set RADIX_TREE_PRELOAD_NR to 512 only if we build with THP;
  - rewrite lru_add_page_tail() to address few bags;
  - memcg accounting;
  - represent file thp pages in meminfo and friends;
  - dump page order in filemap trace;
  - add missed flush_dcache_page() in zero_huge_user_segment;
  - random cleanups based on feedback.
v2:
  - mmap();
  - fix add_to_page_cache_locked() and delete_from_page_cache();
  - introduce mapping_can_have_hugepages();
  - call split_huge_page() only for head page in filemap_fault();
  - wait_split_huge_page(): serialize over i_mmap_mutex too;
  - lru_add_page_tail: avoid PageUnevictable on active/inactive lru lists;
  - fix off-by-one in zero_huge_user_segment();
  - THP_WRITE_ALLOC/THP_WRITE_FAILED counters;

Kirill A. Shutemov (34):
   mm: drop actor argument of do_generic_file_read()
   block: implement add_bdi_stat()
   mm: implement zero_huge_user_segment and friends
   radix-tree: implement preload for multiple contiguous elements
   memcg, thp: charge huge cache pages
   thp, mm: avoid PageUnevictable on active/inactive lru lists
   thp, mm: basic defines for transparent huge page cache
   thp, mm: introduce mapping_can_have_hugepages() predicate
   thp: represent file thp pages in meminfo and friends
   thp, mm: rewrite add_to_page_cache_locked() to support huge pages
   mm: trace filemap: dump page order
   thp, mm: rewrite delete_from_page_cache() to support huge pages
   thp, mm: trigger bug in replace_page_cache_page() on THP
   thp, mm: locking tail page is a bug
   thp, mm: handle tail pages in page_cache_get_speculative()
   thp, mm: add event counters for huge page alloc on write to a file
   thp, mm: implement grab_thp_write_begin()
   thp, mm: naive support of thp in generic read/write routines
   thp, libfs: initial support of thp in
 simple_read/write_begin/write_end
   thp: handle file pages in split_huge_page()
   thp: wait_split_huge_page(): serialize over i_mmap_mutex too
   thp, mm: truncate support for transparent huge page cache
   thp, mm: split huge page on mmap file page
   ramfs: enable transparent huge page cache
   x86-64, mm: proper alignment mappings with hugepages
   mm: add huge_fault() callback to vm_operations_struct
   thp: prepare zap_huge_pmd() to uncharge file pages
   thp: move maybe_pmd_mkwrite() out of mk_huge_pmd()
   thp, mm: basic huge_fault implementation for generic_file_vm_ops
   thp: extract fallback path from do_huge_pmd_anonymous_page() to a
 function
   thp: initial implementation of do_huge_linear_fault()
   thp: handle write-protect exception to file-backed huge pages
   thp: call __vma_adjust_trans_huge() for file-backed VMA
   thp: map file-backed huge pages on fault

  arch/x86/kernel/sys_x86_64.c   |   12 +-
  drivers/base/node.c|   10 +
  fs/libfs.c |   48 +++-
  fs/proc/meminfo.c  |6 +
  fs/ramfs/inode.c   |6 +-
  include/linux/backing-dev.h|   10 +
  include/linux/huge_mm.h|   36 ++-
  include/linux/mm.h |8 +
  include/linux/mmzone.h |1 +
  include/linux/pagemap.h|   33 ++-
  include/linux/radix-tree.h |   11 +
  include/linux/vm_event_item.h  |2 +
  include/trace/events/filemap.h |7 +-
  lib/radix-tree.c   |   33 ++-
  mm/filemap.c   |  298 -
  mm/huge_memory.c   |  474 +---
  mm/memcontrol.c|2 -
  mm/memory.c|   41 +++-
  mm/mmap.c  |3 +
  mm/page_alloc.c|7 +-
  mm/swap.c 

Re: [PATCHv2, RFC 12/30] thp, mm: add event counters for huge page alloc on write to a file

2013-04-04 Thread Ric Mason

Hi Kirill,
On 03/26/2013 04:40 PM, Kirill A. Shutemov wrote:

Dave Hansen wrote:

On 03/14/2013 10:50 AM, Kirill A. Shutemov wrote:

--- a/include/linux/vm_event_item.h
+++ b/include/linux/vm_event_item.h
@@ -71,6 +71,8 @@ enum vm_event_item { PGPGIN, PGPGOUT, PSWPIN, PSWPOUT,
THP_FAULT_FALLBACK,
THP_COLLAPSE_ALLOC,
THP_COLLAPSE_ALLOC_FAILED,
+   THP_WRITE_ALLOC,
+   THP_WRITE_FAILED,
THP_SPLIT,
THP_ZERO_PAGE_ALLOC,
THP_ZERO_PAGE_ALLOC_FAILED,

I think these names are a bit terse.  It's certainly not _writes_ that
are failing and "THP_WRITE_FAILED" makes it sound that way.

Right. s/THP_WRITE_FAILED/THP_WRITE_ALLOC_FAILED/


Also, why do we need to differentiate these from the existing anon-hugepage
vm stats?  The alloc_pages() call seems to be doing the exact same thing in
the end.  Is one more likely to succeed than the other?

Existing stats specify source of thp page: fault or collapse. When we
allocate a new huge page with write(2) it's nither fault nor collapse. I
think it's reasonable to introduce new type of event for that.


Why when we allocated a new huge page with write(2) is not a write fault?





--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCHv2, RFC 11/30] thp, mm: handle tail pages in page_cache_get_speculative()

2013-04-04 Thread Ric Mason

Hi Kirill,
On 03/15/2013 01:50 AM, Kirill A. Shutemov wrote:

From: "Kirill A. Shutemov" 

For tail page we call __get_page_tail(). It has the same semantics, but
for tail page.

Signed-off-by: Kirill A. Shutemov 
---
  include/linux/pagemap.h |4 +++-
  1 file changed, 3 insertions(+), 1 deletion(-)

diff --git a/include/linux/pagemap.h b/include/linux/pagemap.h
index 3521b0d..408c4e3 100644
--- a/include/linux/pagemap.h
+++ b/include/linux/pagemap.h
@@ -159,6 +159,9 @@ static inline int page_cache_get_speculative(struct page 
*page)


What's the different between page_cache_get_speculative and page_cache_get?


  {
VM_BUG_ON(in_interrupt());
  
+	if (unlikely(PageTail(page)))

+   return __get_page_tail(page);
+
  #ifdef CONFIG_TINY_RCU
  # ifdef CONFIG_PREEMPT_COUNT
VM_BUG_ON(!in_atomic());
@@ -185,7 +188,6 @@ static inline int page_cache_get_speculative(struct page 
*page)
return 0;
}
  #endif
-   VM_BUG_ON(PageTail(page));
  
  	return 1;

  }


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCHv2, RFC 07/30] thp, mm: introduce mapping_can_have_hugepages() predicate

2013-04-04 Thread Ric Mason

On 04/05/2013 11:45 AM, Ric Mason wrote:

Hi Kirill,
On 03/15/2013 01:50 AM, Kirill A. Shutemov wrote:

From: "Kirill A. Shutemov" 

Returns true if mapping can have huge pages. Just check for __GFP_COMP
in gfp mask of the mapping for now.

Signed-off-by: Kirill A. Shutemov 
---
  include/linux/pagemap.h |   10 ++
  1 file changed, 10 insertions(+)

diff --git a/include/linux/pagemap.h b/include/linux/pagemap.h
index e3dea75..3521b0d 100644
--- a/include/linux/pagemap.h
+++ b/include/linux/pagemap.h
@@ -84,6 +84,16 @@ static inline void mapping_set_gfp_mask(struct 
address_space *m, gfp_t mask)

  (__force unsigned long)mask;
  }
  +static inline bool mapping_can_have_hugepages(struct address_space 
*m)

+{
+if (IS_ENABLED(CONFIG_TRANSPARENT_HUGEPAGE)) {
+gfp_t gfp_mask = mapping_gfp_mask(m);
+return !!(gfp_mask & __GFP_COMP);


I always see !! in kernel, but why check directly instead of have !! 
prefix?


s/why/why not




+}
+
+return false;
+}
+
  /*
   * The page cache can done in larger chunks than
   * one page, because it allows for more efficient




--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCHv2, RFC 07/30] thp, mm: introduce mapping_can_have_hugepages() predicate

2013-04-04 Thread Ric Mason

Hi Kirill,
On 03/15/2013 01:50 AM, Kirill A. Shutemov wrote:

From: "Kirill A. Shutemov" 

Returns true if mapping can have huge pages. Just check for __GFP_COMP
in gfp mask of the mapping for now.

Signed-off-by: Kirill A. Shutemov 
---
  include/linux/pagemap.h |   10 ++
  1 file changed, 10 insertions(+)

diff --git a/include/linux/pagemap.h b/include/linux/pagemap.h
index e3dea75..3521b0d 100644
--- a/include/linux/pagemap.h
+++ b/include/linux/pagemap.h
@@ -84,6 +84,16 @@ static inline void mapping_set_gfp_mask(struct address_space 
*m, gfp_t mask)
(__force unsigned long)mask;
  }
  
+static inline bool mapping_can_have_hugepages(struct address_space *m)

+{
+   if (IS_ENABLED(CONFIG_TRANSPARENT_HUGEPAGE)) {
+   gfp_t gfp_mask = mapping_gfp_mask(m);
+   return !!(gfp_mask & __GFP_COMP);


I always see !! in kernel, but why check directly instead of have !! prefix?


+   }
+
+   return false;
+}
+
  /*
   * The page cache can done in larger chunks than
   * one page, because it allows for more efficient


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCHv2, RFC 05/30] thp, mm: avoid PageUnevictable on active/inactive lru lists

2013-04-04 Thread Ric Mason

Hi Kirill,
On 03/22/2013 06:11 PM, Kirill A. Shutemov wrote:

Dave Hansen wrote:

On 03/14/2013 10:50 AM, Kirill A. Shutemov wrote:

active/inactive lru lists can contain unevicable pages (i.e. ramfs pages
that have been placed on the LRU lists when first allocated), but these
pages must not have PageUnevictable set - otherwise shrink_active_list
goes crazy:

...

For lru_add_page_tail(), it means we should not set PageUnevictable()
for tail pages unless we're sure that it will go to LRU_UNEVICTABLE.
The tail page will go LRU_UNEVICTABLE if head page is not on LRU or if
it's marked PageUnevictable() too.

This is only an issue once you're using lru_add_page_tail() for
non-anonymous pages, right?

I'm not sure about that. Documentation/vm/unevictable-lru.txt:

Some examples of these unevictable pages on the LRU lists are:

  (1) ramfs pages that have been placed on the LRU lists when first allocated.

  (2) SHM_LOCK'd shared memory pages.  shmctl(SHM_LOCK) does not attempt to
  allocate or fault in the pages in the shared memory region.  This happens
  when an application accesses the page the first time after SHM_LOCK'ing
  the segment.

  (3) mlocked pages that could not be isolated from the LRU and moved to the
  unevictable list in mlock_vma_page().

  (4) Pages mapped into multiple VM_LOCKED VMAs, but try_to_munlock() couldn't
  acquire the VMA's mmap semaphore to test the flags and set PageMlocked.
  munlock_vma_page() was forced to let the page back on to the normal LRU
  list for vmscan to handle.


diff --git a/mm/swap.c b/mm/swap.c
index 92a9be5..31584d0 100644
--- a/mm/swap.c
+++ b/mm/swap.c
@@ -762,7 +762,8 @@ void lru_add_page_tail(struct page *page, struct page 
*page_tail,
lru = LRU_INACTIVE_ANON;
}
} else {
-   SetPageUnevictable(page_tail);
+   if (!PageLRU(page) || PageUnevictable(page))
+   SetPageUnevictable(page_tail);
lru = LRU_UNEVICTABLE;
}

You were saying above that ramfs pages can get on the normal
active/inactive lists.  But, this will end up getting them on the
unevictable list, right?  So, we have normal ramfs pages on the
active/inactive lists, but ramfs pages after a huge-page-split on the
unevictable list.  That seems a bit inconsistent.

Yeah, it's confusing.

I was able to trigger another bug on this code:
if page_evictable(page_tail) is false and PageLRU(page) is true, page_tail
will go to the same lru as page, but nobody cares to sync page_tail
active/inactive state with page. So we can end up with inactive page on
active lru...

I've updated the patch for the next interation. You can check it in git.
It should be cleaner. Description need to be updated.


Hope you can send out soon. ;-)





--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCHv2, RFC 04/30] radix-tree: implement preload for multiple contiguous elements

2013-04-04 Thread Ric Mason

Hi Kirill,
On 03/15/2013 01:50 AM, Kirill A. Shutemov wrote:

From: "Kirill A. Shutemov" 

Currently radix_tree_preload() only guarantees enough nodes to insert
one element. It's a hard limit. You cannot batch a number insert under
one tree_lock.

This patch introduces radix_tree_preload_count(). It allows to
preallocate nodes enough to insert a number of *contiguous* elements.

Signed-off-by: Matthew Wilcox 
Signed-off-by: Kirill A. Shutemov 
---
  include/linux/radix-tree.h |3 +++
  lib/radix-tree.c   |   32 +---
  2 files changed, 28 insertions(+), 7 deletions(-)

diff --git a/include/linux/radix-tree.h b/include/linux/radix-tree.h
index ffc444c..81318cb 100644
--- a/include/linux/radix-tree.h
+++ b/include/linux/radix-tree.h
@@ -83,6 +83,8 @@ do {  
\
(root)->rnode = NULL;\
  } while (0)
  
+#define RADIX_TREE_PRELOAD_NR		512 /* For THP's benefit */

+
  /**
   * Radix-tree synchronization
   *
@@ -231,6 +233,7 @@ unsigned long radix_tree_next_hole(struct radix_tree_root 
*root,
  unsigned long radix_tree_prev_hole(struct radix_tree_root *root,
unsigned long index, unsigned long max_scan);
  int radix_tree_preload(gfp_t gfp_mask);
+int radix_tree_preload_count(unsigned size, gfp_t gfp_mask);
  void radix_tree_init(void);
  void *radix_tree_tag_set(struct radix_tree_root *root,
unsigned long index, unsigned int tag);
diff --git a/lib/radix-tree.c b/lib/radix-tree.c
index e796429..9bef0ac 100644
--- a/lib/radix-tree.c
+++ b/lib/radix-tree.c
@@ -81,16 +81,24 @@ static struct kmem_cache *radix_tree_node_cachep;
   * The worst case is a zero height tree with just a single item at index 0,
   * and then inserting an item at index ULONG_MAX. This requires 2 new branches
   * of RADIX_TREE_MAX_PATH size to be created, with only the root node shared.
+ *
+ * Worst case for adding N contiguous items is adding entries at indexes
+ * (ULONG_MAX - N) to ULONG_MAX. It requires nodes to insert single worst-case
+ * item plus extra nodes if you cross the boundary from one node to the next.
+ *


What's the meaning of this comments? Could you explain in details? I 
also don't understand #define RADIX_TREE_PRELOAD_SIZE 
(RADIX_TREE_MAX_PATH * 2 - 1), why RADIX_TREE_MAX_PATH * 2 - 1, I fail 
to understand comments above it.



   * Hence:
   */
-#define RADIX_TREE_PRELOAD_SIZE (RADIX_TREE_MAX_PATH * 2 - 1)
+#define RADIX_TREE_PRELOAD_MIN (RADIX_TREE_MAX_PATH * 2 - 1)
+#define RADIX_TREE_PRELOAD_MAX \
+   (RADIX_TREE_PRELOAD_MIN + \
+DIV_ROUND_UP(RADIX_TREE_PRELOAD_NR - 1, RADIX_TREE_MAP_SIZE))
  
  /*

   * Per-cpu pool of preloaded nodes
   */
  struct radix_tree_preload {
int nr;
-   struct radix_tree_node *nodes[RADIX_TREE_PRELOAD_SIZE];
+   struct radix_tree_node *nodes[RADIX_TREE_PRELOAD_MAX];
  };
  static DEFINE_PER_CPU(struct radix_tree_preload, radix_tree_preloads) = { 0, 
};
  
@@ -257,29 +265,34 @@ radix_tree_node_free(struct radix_tree_node *node)
  
  /*

   * Load up this CPU's radix_tree_node buffer with sufficient objects to
- * ensure that the addition of a single element in the tree cannot fail.  On
- * success, return zero, with preemption disabled.  On error, return -ENOMEM
+ * ensure that the addition of *contiguous* elements in the tree cannot fail.
+ * On success, return zero, with preemption disabled.  On error, return -ENOMEM
   * with preemption not disabled.
   *
   * To make use of this facility, the radix tree must be initialised without
   * __GFP_WAIT being passed to INIT_RADIX_TREE().
   */
-int radix_tree_preload(gfp_t gfp_mask)
+int radix_tree_preload_count(unsigned size, gfp_t gfp_mask)
  {
struct radix_tree_preload *rtp;
struct radix_tree_node *node;
int ret = -ENOMEM;
+   int alloc = RADIX_TREE_PRELOAD_MIN +
+   DIV_ROUND_UP(size - 1, RADIX_TREE_MAP_SIZE);
+
+   if (size > RADIX_TREE_PRELOAD_NR)
+   return -ENOMEM;
  
  	preempt_disable();

rtp = &__get_cpu_var(radix_tree_preloads);
-   while (rtp->nr < ARRAY_SIZE(rtp->nodes)) {
+   while (rtp->nr < alloc) {
preempt_enable();
node = kmem_cache_alloc(radix_tree_node_cachep, gfp_mask);
if (node == NULL)
goto out;
preempt_disable();
rtp = &__get_cpu_var(radix_tree_preloads);
-   if (rtp->nr < ARRAY_SIZE(rtp->nodes))
+   if (rtp->nr < alloc)
rtp->nodes[rtp->nr++] = node;
else
kmem_cache_free(radix_tree_node_cachep, node);
@@ -288,6 +301,11 @@ int radix_tree_preload(gfp_t gfp_mask)
  out:
return ret;
  }
+
+int radix_tree_preload(gfp_t gfp_mask)
+{
+   return radix_tree_preload_count(1, gfp_mask);
+}
  

Re: [PATCH, RFC 00/16] Transparent huge page cache

2013-04-04 Thread Ric Mason

Hi Hugh,
On 01/29/2013 01:03 PM, Hugh Dickins wrote:

On Mon, 28 Jan 2013, Kirill A. Shutemov wrote:

From: "Kirill A. Shutemov" 

Here's first steps towards huge pages in page cache.

The intend of the work is get code ready to enable transparent huge page
cache for the most simple fs -- ramfs.

It's not yet near feature-complete. It only provides basic infrastructure.
At the moment we can read, write and truncate file on ramfs with huge pages in
page cache. The most interesting part, mmap(), is not yet there. For now
we split huge page on mmap() attempt.

I can't say that I see whole picture. I'm not sure if I understand locking
model around split_huge_page(). Probably, not.
Andrea, could you check if it looks correct?

Next steps (not necessary in this order):
  - mmap();
  - migration (?);
  - collapse;
  - stats, knobs, etc.;
  - tmpfs/shmem enabling;
  - ...

Kirill A. Shutemov (16):
   block: implement add_bdi_stat()
   mm: implement zero_huge_user_segment and friends
   mm: drop actor argument of do_generic_file_read()
   radix-tree: implement preload for multiple contiguous elements
   thp, mm: basic defines for transparent huge page cache
   thp, mm: rewrite add_to_page_cache_locked() to support huge pages
   thp, mm: rewrite delete_from_page_cache() to support huge pages
   thp, mm: locking tail page is a bug
   thp, mm: handle tail pages in page_cache_get_speculative()
   thp, mm: implement grab_cache_huge_page_write_begin()
   thp, mm: naive support of thp in generic read/write routines
   thp, libfs: initial support of thp in
 simple_read/write_begin/write_end
   thp: handle file pages in split_huge_page()
   thp, mm: truncate support for transparent huge page cache
   thp, mm: split huge page on mmap file page
   ramfs: enable transparent huge page cache

  fs/libfs.c  |   54 +---
  fs/ramfs/inode.c|6 +-
  include/linux/backing-dev.h |   10 +++
  include/linux/huge_mm.h |8 ++
  include/linux/mm.h  |   15 
  include/linux/pagemap.h |   14 ++-
  include/linux/radix-tree.h  |3 +
  lib/radix-tree.c|   32 +--
  mm/filemap.c|  204 +++
  mm/huge_memory.c|   62 +++--
  mm/memory.c |   22 +
  mm/truncate.c   |   12 +++
  12 files changed, 375 insertions(+), 67 deletions(-)

Interesting.

I was starting to think about Transparent Huge Pagecache a few
months ago, but then got washed away by incoming waves as usual.

Certainly I don't have a line of code to show for it; but my first
impression of your patches is that we have very different ideas of
where to start.

Perhaps that's good complementarity, or perhaps I'll disagree with
your approach.  I'll be taking a look at yours in the coming days,
and trying to summon back up my own ideas to summarize them for you.

Perhaps I was naive to imagine it, but I did intend to start out
generically, independent of filesystem; but content to narrow down
on tmpfs alone where it gets hard to support the others (writeback
springs to mind).  khugepaged would be migrating little pages into
huge pages, where it saw that the mmaps of the file would benefit
(and for testing I would hack mmap alignment choice to favour it).

I had arrived at a conviction that the first thing to change was
the way that tail pages of a THP are refcounted, that it had been a
mistake to use the compound page method of holding the THP together.
But I'll have to enter a trance now to recall the arguments ;)


One offline question, do you have any idea hugetlbfs pages support swapping?



Hugh

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majord...@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: mailto:"d...@kvack.org;> em...@kvack.org 


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH, RFC 00/16] Transparent huge page cache

2013-04-04 Thread Ric Mason

Hi Hugh,
On 01/29/2013 01:03 PM, Hugh Dickins wrote:

On Mon, 28 Jan 2013, Kirill A. Shutemov wrote:

From: Kirill A. Shutemov kirill.shute...@linux.intel.com

Here's first steps towards huge pages in page cache.

The intend of the work is get code ready to enable transparent huge page
cache for the most simple fs -- ramfs.

It's not yet near feature-complete. It only provides basic infrastructure.
At the moment we can read, write and truncate file on ramfs with huge pages in
page cache. The most interesting part, mmap(), is not yet there. For now
we split huge page on mmap() attempt.

I can't say that I see whole picture. I'm not sure if I understand locking
model around split_huge_page(). Probably, not.
Andrea, could you check if it looks correct?

Next steps (not necessary in this order):
  - mmap();
  - migration (?);
  - collapse;
  - stats, knobs, etc.;
  - tmpfs/shmem enabling;
  - ...

Kirill A. Shutemov (16):
   block: implement add_bdi_stat()
   mm: implement zero_huge_user_segment and friends
   mm: drop actor argument of do_generic_file_read()
   radix-tree: implement preload for multiple contiguous elements
   thp, mm: basic defines for transparent huge page cache
   thp, mm: rewrite add_to_page_cache_locked() to support huge pages
   thp, mm: rewrite delete_from_page_cache() to support huge pages
   thp, mm: locking tail page is a bug
   thp, mm: handle tail pages in page_cache_get_speculative()
   thp, mm: implement grab_cache_huge_page_write_begin()
   thp, mm: naive support of thp in generic read/write routines
   thp, libfs: initial support of thp in
 simple_read/write_begin/write_end
   thp: handle file pages in split_huge_page()
   thp, mm: truncate support for transparent huge page cache
   thp, mm: split huge page on mmap file page
   ramfs: enable transparent huge page cache

  fs/libfs.c  |   54 +---
  fs/ramfs/inode.c|6 +-
  include/linux/backing-dev.h |   10 +++
  include/linux/huge_mm.h |8 ++
  include/linux/mm.h  |   15 
  include/linux/pagemap.h |   14 ++-
  include/linux/radix-tree.h  |3 +
  lib/radix-tree.c|   32 +--
  mm/filemap.c|  204 +++
  mm/huge_memory.c|   62 +++--
  mm/memory.c |   22 +
  mm/truncate.c   |   12 +++
  12 files changed, 375 insertions(+), 67 deletions(-)

Interesting.

I was starting to think about Transparent Huge Pagecache a few
months ago, but then got washed away by incoming waves as usual.

Certainly I don't have a line of code to show for it; but my first
impression of your patches is that we have very different ideas of
where to start.

Perhaps that's good complementarity, or perhaps I'll disagree with
your approach.  I'll be taking a look at yours in the coming days,
and trying to summon back up my own ideas to summarize them for you.

Perhaps I was naive to imagine it, but I did intend to start out
generically, independent of filesystem; but content to narrow down
on tmpfs alone where it gets hard to support the others (writeback
springs to mind).  khugepaged would be migrating little pages into
huge pages, where it saw that the mmaps of the file would benefit
(and for testing I would hack mmap alignment choice to favour it).

I had arrived at a conviction that the first thing to change was
the way that tail pages of a THP are refcounted, that it had been a
mistake to use the compound page method of holding the THP together.
But I'll have to enter a trance now to recall the arguments ;)


One offline question, do you have any idea hugetlbfs pages support swapping?



Hugh

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majord...@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: a href=mailto:d...@kvack.org; em...@kvack.org /a


--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCHv2, RFC 04/30] radix-tree: implement preload for multiple contiguous elements

2013-04-04 Thread Ric Mason

Hi Kirill,
On 03/15/2013 01:50 AM, Kirill A. Shutemov wrote:

From: Kirill A. Shutemov kirill.shute...@linux.intel.com

Currently radix_tree_preload() only guarantees enough nodes to insert
one element. It's a hard limit. You cannot batch a number insert under
one tree_lock.

This patch introduces radix_tree_preload_count(). It allows to
preallocate nodes enough to insert a number of *contiguous* elements.

Signed-off-by: Matthew Wilcox matthew.r.wil...@intel.com
Signed-off-by: Kirill A. Shutemov kirill.shute...@linux.intel.com
---
  include/linux/radix-tree.h |3 +++
  lib/radix-tree.c   |   32 +---
  2 files changed, 28 insertions(+), 7 deletions(-)

diff --git a/include/linux/radix-tree.h b/include/linux/radix-tree.h
index ffc444c..81318cb 100644
--- a/include/linux/radix-tree.h
+++ b/include/linux/radix-tree.h
@@ -83,6 +83,8 @@ do {  
\
(root)-rnode = NULL;\
  } while (0)
  
+#define RADIX_TREE_PRELOAD_NR		512 /* For THP's benefit */

+
  /**
   * Radix-tree synchronization
   *
@@ -231,6 +233,7 @@ unsigned long radix_tree_next_hole(struct radix_tree_root 
*root,
  unsigned long radix_tree_prev_hole(struct radix_tree_root *root,
unsigned long index, unsigned long max_scan);
  int radix_tree_preload(gfp_t gfp_mask);
+int radix_tree_preload_count(unsigned size, gfp_t gfp_mask);
  void radix_tree_init(void);
  void *radix_tree_tag_set(struct radix_tree_root *root,
unsigned long index, unsigned int tag);
diff --git a/lib/radix-tree.c b/lib/radix-tree.c
index e796429..9bef0ac 100644
--- a/lib/radix-tree.c
+++ b/lib/radix-tree.c
@@ -81,16 +81,24 @@ static struct kmem_cache *radix_tree_node_cachep;
   * The worst case is a zero height tree with just a single item at index 0,
   * and then inserting an item at index ULONG_MAX. This requires 2 new branches
   * of RADIX_TREE_MAX_PATH size to be created, with only the root node shared.
+ *
+ * Worst case for adding N contiguous items is adding entries at indexes
+ * (ULONG_MAX - N) to ULONG_MAX. It requires nodes to insert single worst-case
+ * item plus extra nodes if you cross the boundary from one node to the next.
+ *


What's the meaning of this comments? Could you explain in details? I 
also don't understand #define RADIX_TREE_PRELOAD_SIZE 
(RADIX_TREE_MAX_PATH * 2 - 1), why RADIX_TREE_MAX_PATH * 2 - 1, I fail 
to understand comments above it.



   * Hence:
   */
-#define RADIX_TREE_PRELOAD_SIZE (RADIX_TREE_MAX_PATH * 2 - 1)
+#define RADIX_TREE_PRELOAD_MIN (RADIX_TREE_MAX_PATH * 2 - 1)
+#define RADIX_TREE_PRELOAD_MAX \
+   (RADIX_TREE_PRELOAD_MIN + \
+DIV_ROUND_UP(RADIX_TREE_PRELOAD_NR - 1, RADIX_TREE_MAP_SIZE))
  
  /*

   * Per-cpu pool of preloaded nodes
   */
  struct radix_tree_preload {
int nr;
-   struct radix_tree_node *nodes[RADIX_TREE_PRELOAD_SIZE];
+   struct radix_tree_node *nodes[RADIX_TREE_PRELOAD_MAX];
  };
  static DEFINE_PER_CPU(struct radix_tree_preload, radix_tree_preloads) = { 0, 
};
  
@@ -257,29 +265,34 @@ radix_tree_node_free(struct radix_tree_node *node)
  
  /*

   * Load up this CPU's radix_tree_node buffer with sufficient objects to
- * ensure that the addition of a single element in the tree cannot fail.  On
- * success, return zero, with preemption disabled.  On error, return -ENOMEM
+ * ensure that the addition of *contiguous* elements in the tree cannot fail.
+ * On success, return zero, with preemption disabled.  On error, return -ENOMEM
   * with preemption not disabled.
   *
   * To make use of this facility, the radix tree must be initialised without
   * __GFP_WAIT being passed to INIT_RADIX_TREE().
   */
-int radix_tree_preload(gfp_t gfp_mask)
+int radix_tree_preload_count(unsigned size, gfp_t gfp_mask)
  {
struct radix_tree_preload *rtp;
struct radix_tree_node *node;
int ret = -ENOMEM;
+   int alloc = RADIX_TREE_PRELOAD_MIN +
+   DIV_ROUND_UP(size - 1, RADIX_TREE_MAP_SIZE);
+
+   if (size  RADIX_TREE_PRELOAD_NR)
+   return -ENOMEM;
  
  	preempt_disable();

rtp = __get_cpu_var(radix_tree_preloads);
-   while (rtp-nr  ARRAY_SIZE(rtp-nodes)) {
+   while (rtp-nr  alloc) {
preempt_enable();
node = kmem_cache_alloc(radix_tree_node_cachep, gfp_mask);
if (node == NULL)
goto out;
preempt_disable();
rtp = __get_cpu_var(radix_tree_preloads);
-   if (rtp-nr  ARRAY_SIZE(rtp-nodes))
+   if (rtp-nr  alloc)
rtp-nodes[rtp-nr++] = node;
else
kmem_cache_free(radix_tree_node_cachep, node);
@@ -288,6 +301,11 @@ int radix_tree_preload(gfp_t gfp_mask)
  out:
return ret;
  }
+
+int radix_tree_preload(gfp_t gfp_mask)
+{
+   return 

Re: [PATCHv2, RFC 05/30] thp, mm: avoid PageUnevictable on active/inactive lru lists

2013-04-04 Thread Ric Mason

Hi Kirill,
On 03/22/2013 06:11 PM, Kirill A. Shutemov wrote:

Dave Hansen wrote:

On 03/14/2013 10:50 AM, Kirill A. Shutemov wrote:

active/inactive lru lists can contain unevicable pages (i.e. ramfs pages
that have been placed on the LRU lists when first allocated), but these
pages must not have PageUnevictable set - otherwise shrink_active_list
goes crazy:

...

For lru_add_page_tail(), it means we should not set PageUnevictable()
for tail pages unless we're sure that it will go to LRU_UNEVICTABLE.
The tail page will go LRU_UNEVICTABLE if head page is not on LRU or if
it's marked PageUnevictable() too.

This is only an issue once you're using lru_add_page_tail() for
non-anonymous pages, right?

I'm not sure about that. Documentation/vm/unevictable-lru.txt:

Some examples of these unevictable pages on the LRU lists are:

  (1) ramfs pages that have been placed on the LRU lists when first allocated.

  (2) SHM_LOCK'd shared memory pages.  shmctl(SHM_LOCK) does not attempt to
  allocate or fault in the pages in the shared memory region.  This happens
  when an application accesses the page the first time after SHM_LOCK'ing
  the segment.

  (3) mlocked pages that could not be isolated from the LRU and moved to the
  unevictable list in mlock_vma_page().

  (4) Pages mapped into multiple VM_LOCKED VMAs, but try_to_munlock() couldn't
  acquire the VMA's mmap semaphore to test the flags and set PageMlocked.
  munlock_vma_page() was forced to let the page back on to the normal LRU
  list for vmscan to handle.


diff --git a/mm/swap.c b/mm/swap.c
index 92a9be5..31584d0 100644
--- a/mm/swap.c
+++ b/mm/swap.c
@@ -762,7 +762,8 @@ void lru_add_page_tail(struct page *page, struct page 
*page_tail,
lru = LRU_INACTIVE_ANON;
}
} else {
-   SetPageUnevictable(page_tail);
+   if (!PageLRU(page) || PageUnevictable(page))
+   SetPageUnevictable(page_tail);
lru = LRU_UNEVICTABLE;
}

You were saying above that ramfs pages can get on the normal
active/inactive lists.  But, this will end up getting them on the
unevictable list, right?  So, we have normal ramfs pages on the
active/inactive lists, but ramfs pages after a huge-page-split on the
unevictable list.  That seems a bit inconsistent.

Yeah, it's confusing.

I was able to trigger another bug on this code:
if page_evictable(page_tail) is false and PageLRU(page) is true, page_tail
will go to the same lru as page, but nobody cares to sync page_tail
active/inactive state with page. So we can end up with inactive page on
active lru...

I've updated the patch for the next interation. You can check it in git.
It should be cleaner. Description need to be updated.


Hope you can send out soon. ;-)





--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCHv2, RFC 07/30] thp, mm: introduce mapping_can_have_hugepages() predicate

2013-04-04 Thread Ric Mason

Hi Kirill,
On 03/15/2013 01:50 AM, Kirill A. Shutemov wrote:

From: Kirill A. Shutemov kirill.shute...@linux.intel.com

Returns true if mapping can have huge pages. Just check for __GFP_COMP
in gfp mask of the mapping for now.

Signed-off-by: Kirill A. Shutemov kirill.shute...@linux.intel.com
---
  include/linux/pagemap.h |   10 ++
  1 file changed, 10 insertions(+)

diff --git a/include/linux/pagemap.h b/include/linux/pagemap.h
index e3dea75..3521b0d 100644
--- a/include/linux/pagemap.h
+++ b/include/linux/pagemap.h
@@ -84,6 +84,16 @@ static inline void mapping_set_gfp_mask(struct address_space 
*m, gfp_t mask)
(__force unsigned long)mask;
  }
  
+static inline bool mapping_can_have_hugepages(struct address_space *m)

+{
+   if (IS_ENABLED(CONFIG_TRANSPARENT_HUGEPAGE)) {
+   gfp_t gfp_mask = mapping_gfp_mask(m);
+   return !!(gfp_mask  __GFP_COMP);


I always see !! in kernel, but why check directly instead of have !! prefix?


+   }
+
+   return false;
+}
+
  /*
   * The page cache can done in larger chunks than
   * one page, because it allows for more efficient


--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCHv2, RFC 07/30] thp, mm: introduce mapping_can_have_hugepages() predicate

2013-04-04 Thread Ric Mason

On 04/05/2013 11:45 AM, Ric Mason wrote:

Hi Kirill,
On 03/15/2013 01:50 AM, Kirill A. Shutemov wrote:

From: Kirill A. Shutemov kirill.shute...@linux.intel.com

Returns true if mapping can have huge pages. Just check for __GFP_COMP
in gfp mask of the mapping for now.

Signed-off-by: Kirill A. Shutemov kirill.shute...@linux.intel.com
---
  include/linux/pagemap.h |   10 ++
  1 file changed, 10 insertions(+)

diff --git a/include/linux/pagemap.h b/include/linux/pagemap.h
index e3dea75..3521b0d 100644
--- a/include/linux/pagemap.h
+++ b/include/linux/pagemap.h
@@ -84,6 +84,16 @@ static inline void mapping_set_gfp_mask(struct 
address_space *m, gfp_t mask)

  (__force unsigned long)mask;
  }
  +static inline bool mapping_can_have_hugepages(struct address_space 
*m)

+{
+if (IS_ENABLED(CONFIG_TRANSPARENT_HUGEPAGE)) {
+gfp_t gfp_mask = mapping_gfp_mask(m);
+return !!(gfp_mask  __GFP_COMP);


I always see !! in kernel, but why check directly instead of have !! 
prefix?


s/why/why not




+}
+
+return false;
+}
+
  /*
   * The page cache can done in larger chunks than
   * one page, because it allows for more efficient




--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCHv2, RFC 11/30] thp, mm: handle tail pages in page_cache_get_speculative()

2013-04-04 Thread Ric Mason

Hi Kirill,
On 03/15/2013 01:50 AM, Kirill A. Shutemov wrote:

From: Kirill A. Shutemov kirill.shute...@linux.intel.com

For tail page we call __get_page_tail(). It has the same semantics, but
for tail page.

Signed-off-by: Kirill A. Shutemov kirill.shute...@linux.intel.com
---
  include/linux/pagemap.h |4 +++-
  1 file changed, 3 insertions(+), 1 deletion(-)

diff --git a/include/linux/pagemap.h b/include/linux/pagemap.h
index 3521b0d..408c4e3 100644
--- a/include/linux/pagemap.h
+++ b/include/linux/pagemap.h
@@ -159,6 +159,9 @@ static inline int page_cache_get_speculative(struct page 
*page)


What's the different between page_cache_get_speculative and page_cache_get?


  {
VM_BUG_ON(in_interrupt());
  
+	if (unlikely(PageTail(page)))

+   return __get_page_tail(page);
+
  #ifdef CONFIG_TINY_RCU
  # ifdef CONFIG_PREEMPT_COUNT
VM_BUG_ON(!in_atomic());
@@ -185,7 +188,6 @@ static inline int page_cache_get_speculative(struct page 
*page)
return 0;
}
  #endif
-   VM_BUG_ON(PageTail(page));
  
  	return 1;

  }


--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCHv2, RFC 12/30] thp, mm: add event counters for huge page alloc on write to a file

2013-04-04 Thread Ric Mason

Hi Kirill,
On 03/26/2013 04:40 PM, Kirill A. Shutemov wrote:

Dave Hansen wrote:

On 03/14/2013 10:50 AM, Kirill A. Shutemov wrote:

--- a/include/linux/vm_event_item.h
+++ b/include/linux/vm_event_item.h
@@ -71,6 +71,8 @@ enum vm_event_item { PGPGIN, PGPGOUT, PSWPIN, PSWPOUT,
THP_FAULT_FALLBACK,
THP_COLLAPSE_ALLOC,
THP_COLLAPSE_ALLOC_FAILED,
+   THP_WRITE_ALLOC,
+   THP_WRITE_FAILED,
THP_SPLIT,
THP_ZERO_PAGE_ALLOC,
THP_ZERO_PAGE_ALLOC_FAILED,

I think these names are a bit terse.  It's certainly not _writes_ that
are failing and THP_WRITE_FAILED makes it sound that way.

Right. s/THP_WRITE_FAILED/THP_WRITE_ALLOC_FAILED/


Also, why do we need to differentiate these from the existing anon-hugepage
vm stats?  The alloc_pages() call seems to be doing the exact same thing in
the end.  Is one more likely to succeed than the other?

Existing stats specify source of thp page: fault or collapse. When we
allocate a new huge page with write(2) it's nither fault nor collapse. I
think it's reasonable to introduce new type of event for that.


Why when we allocated a new huge page with write(2) is not a write fault?





--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v2 3/4] introduce zero-filled page stat count

2013-03-19 Thread Ric Mason

On 03/20/2013 12:41 AM, Konrad Rzeszutek Wilk wrote:

On Sun, Mar 17, 2013 at 8:58 AM, Ric Mason  wrote:

Hi Konrad,

On 03/16/2013 09:06 PM, Konrad Rzeszutek Wilk wrote:

On Thu, Mar 14, 2013 at 06:08:16PM +0800, Wanpeng Li wrote:

Introduce zero-filled page statistics to monitor the number of
zero-filled pages.

Hm, you must be using an older version of the driver. Please
rebase it against Greg KH's staging tree. This is where most if not
all of the DebugFS counters got moved to a different file.


It seems that zcache debugfs in Greg's staging-next is buggy, Could you test
it?


Could you email me what the issue you are seeing?

They have already fixed in Wanpeng's patchset v4.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v4 0/8] staging: zcache: Support zero-filled pages more efficiently

2013-03-19 Thread Ric Mason

On 03/19/2013 05:25 PM, Wanpeng Li wrote:

Hi Greg,

Since you have already merge 1/8, feel free to merge 2/8~8/8, I have already
rebased against staging-next.

Changelog:
  v3 -> v4:
   * handle duplication in page_is_zero_filled, spotted by Bob
   * fix zcache writeback in dubugfs
   * fix pers_pageframes|_max isn't exported in debugfs
   * fix static variable defined in debug.h but used in multiple C files
   * rebase on Greg's staging-next
  v2 -> v3:
   * increment/decrement zcache_[eph|pers]_zpages for zero-filled pages, 
spotted by Dan
   * replace "zero" or "zero page" by "zero_filled_page", spotted by Dan
  v1 -> v2:
   * avoid changing tmem.[ch] entirely, spotted by Dan.
   * don't accumulate [eph|pers]pageframe and [eph|pers]zpages for
 zero-filled pages, spotted by Dan
   * cleanup TODO list
   * add Dan Acked-by.

Motivation:

- Seth Jennings points out compress zero-filled pages with LZO(a lossless
   data compression algorithm) will waste memory and result in fragmentation.
   https://lkml.org/lkml/2012/8/14/347
- Dan Magenheimer add "Support zero-filled pages more efficiently" feature
   in zcache TODO list https://lkml.org/lkml/2013/2/13/503

Design:

- For store page, capture zero-filled pages(evicted clean page cache pages and
   swap pages), but don't compress them, set pampd which store zpage address to
   0x2(since 0x0 and 0x1 has already been ocuppied) to mark special zero-filled
   case and take advantage of tmem infrastructure to transform handle-tuple(pool
   id, object id, and an index) to a pampd. Twice compress zero-filled pages 
will
   contribute to one zcache_[eph|pers]_pageframes count accumulated.
- For load page, traverse tmem hierachical to transform handle-tuple to pampd
   and identify zero-filled case by pampd equal to 0x2 when filesystem reads
   file pages or a page needs to be swapped in, then refill the page to zero
   and return.

Test:

dd if=/dev/zero of=zerofile bs=1MB count=500
vmtouch -t zerofile
vmtouch -e zerofile

formula:
- fragmentation level = (zcache_[eph|pers]_pageframes * PAGE_SIZE - 
zcache_[eph|pers]_zbytes)
   * 100 / (zcache_[eph|pers]_pageframes * PAGE_SIZE)
- memory zcache occupy = zcache_[eph|pers]_zbytes

Result:

without zero-filled awareness:
- fragmentation level: 98%
- memory zcache occupy: 238MB
with zero-filled awareness:
- fragmentation level: 0%
- memory zcache occupy: 0MB

Wanpeng Li (8):
   introduce zero filled pages handler
   zero-filled pages awareness
   handle zcache_[eph|pers]_pages for zero-filled page
   fix pers_pageframes|_max aren't exported in debugfs
   fix zcache writeback in debugfs
   fix static variables are defined in debug.h but use in multiple C files
   introduce zero-filled page stat count
   clean TODO list


You can add Reviewed-by: Ric Mason  to this patchset.



  drivers/staging/zcache/TODO  |3 +-
  drivers/staging/zcache/debug.c   |5 +-
  drivers/staging/zcache/debug.h   |   77 +-
  drivers/staging/zcache/zcache-main.c |  147 ++
  4 files changed, 185 insertions(+), 47 deletions(-)



--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v4 0/8] staging: zcache: Support zero-filled pages more efficiently

2013-03-19 Thread Ric Mason

On 03/19/2013 05:25 PM, Wanpeng Li wrote:

Hi Greg,

Since you have already merge 1/8, feel free to merge 2/8~8/8, I have already
rebased against staging-next.

Changelog:
  v3 - v4:
   * handle duplication in page_is_zero_filled, spotted by Bob
   * fix zcache writeback in dubugfs
   * fix pers_pageframes|_max isn't exported in debugfs
   * fix static variable defined in debug.h but used in multiple C files
   * rebase on Greg's staging-next
  v2 - v3:
   * increment/decrement zcache_[eph|pers]_zpages for zero-filled pages, 
spotted by Dan
   * replace zero or zero page by zero_filled_page, spotted by Dan
  v1 - v2:
   * avoid changing tmem.[ch] entirely, spotted by Dan.
   * don't accumulate [eph|pers]pageframe and [eph|pers]zpages for
 zero-filled pages, spotted by Dan
   * cleanup TODO list
   * add Dan Acked-by.

Motivation:

- Seth Jennings points out compress zero-filled pages with LZO(a lossless
   data compression algorithm) will waste memory and result in fragmentation.
   https://lkml.org/lkml/2012/8/14/347
- Dan Magenheimer add Support zero-filled pages more efficiently feature
   in zcache TODO list https://lkml.org/lkml/2013/2/13/503

Design:

- For store page, capture zero-filled pages(evicted clean page cache pages and
   swap pages), but don't compress them, set pampd which store zpage address to
   0x2(since 0x0 and 0x1 has already been ocuppied) to mark special zero-filled
   case and take advantage of tmem infrastructure to transform handle-tuple(pool
   id, object id, and an index) to a pampd. Twice compress zero-filled pages 
will
   contribute to one zcache_[eph|pers]_pageframes count accumulated.
- For load page, traverse tmem hierachical to transform handle-tuple to pampd
   and identify zero-filled case by pampd equal to 0x2 when filesystem reads
   file pages or a page needs to be swapped in, then refill the page to zero
   and return.

Test:

dd if=/dev/zero of=zerofile bs=1MB count=500
vmtouch -t zerofile
vmtouch -e zerofile

formula:
- fragmentation level = (zcache_[eph|pers]_pageframes * PAGE_SIZE - 
zcache_[eph|pers]_zbytes)
   * 100 / (zcache_[eph|pers]_pageframes * PAGE_SIZE)
- memory zcache occupy = zcache_[eph|pers]_zbytes

Result:

without zero-filled awareness:
- fragmentation level: 98%
- memory zcache occupy: 238MB
with zero-filled awareness:
- fragmentation level: 0%
- memory zcache occupy: 0MB

Wanpeng Li (8):
   introduce zero filled pages handler
   zero-filled pages awareness
   handle zcache_[eph|pers]_pages for zero-filled page
   fix pers_pageframes|_max aren't exported in debugfs
   fix zcache writeback in debugfs
   fix static variables are defined in debug.h but use in multiple C files
   introduce zero-filled page stat count
   clean TODO list


You can add Reviewed-by: Ric Mason ric.mas...@gmail.com to this patchset.



  drivers/staging/zcache/TODO  |3 +-
  drivers/staging/zcache/debug.c   |5 +-
  drivers/staging/zcache/debug.h   |   77 +-
  drivers/staging/zcache/zcache-main.c |  147 ++
  4 files changed, 185 insertions(+), 47 deletions(-)



--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v2 3/4] introduce zero-filled page stat count

2013-03-19 Thread Ric Mason

On 03/20/2013 12:41 AM, Konrad Rzeszutek Wilk wrote:

On Sun, Mar 17, 2013 at 8:58 AM, Ric Mason ric.mas...@gmail.com wrote:

Hi Konrad,

On 03/16/2013 09:06 PM, Konrad Rzeszutek Wilk wrote:

On Thu, Mar 14, 2013 at 06:08:16PM +0800, Wanpeng Li wrote:

Introduce zero-filled page statistics to monitor the number of
zero-filled pages.

Hm, you must be using an older version of the driver. Please
rebase it against Greg KH's staging tree. This is where most if not
all of the DebugFS counters got moved to a different file.


It seems that zcache debugfs in Greg's staging-next is buggy, Could you test
it?


Could you email me what the issue you are seeing?

They have already fixed in Wanpeng's patchset v4.
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCHv2, RFC 00/30] Transparent huge page cache

2013-03-18 Thread Ric Mason

On 03/18/2013 07:42 PM, Kirill A. Shutemov wrote:

Simon Jeons wrote:

Hi Kirill,
On 03/18/2013 07:19 PM, Kirill A. Shutemov wrote:

Simon Jeons wrote:

On 03/18/2013 12:03 PM, Simon Jeons wrote:

Hi Kirill,
On 03/15/2013 01:50 AM, Kirill A. Shutemov wrote:

From: "Kirill A. Shutemov" 

Here's the second version of the patchset.

The intend of the work is get code ready to enable transparent huge page
cache for the most simple fs -- ramfs.

We have read()/write()/mmap() functionality now. Still plenty work
ahead.

One offline question.

Why set PG_mlocked to page_tail which be splited in function
__split_huge_page_refcount?

Not set, but copied from head page. Head page represents up-to-date sate
of compound page, we need to copy it to all tail pages on split.

I always see up-to-date state, could you conclude to me which state can
be treated as up-to-date? :-)

While we work with huge page we only alter flags (like mlocked or
uptodate) of head page, but not tail, so we have to copy flags to all tail
pages on split. We also need to distribute _count and _mapcount properly.
Just read the code.


Sorry, you can treat this question as an offline one and irrelevant thp. 
Which state of page can be treated as up-to-date?




   

Also why can't find where _PAGE_SPLITTING and _PAGE_PSE flags are
cleared in split_huge_page path?
   
The pmd is invalidated and replaced with reference to page table at the end

of __split_huge_page_map.

Since pmd is populated by page table and new flag why need
invalidated(clear present flag) before it?

Comment just before pmdp_invalidate() in __split_huge_page_map() is fairly
informative.



--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCHv2, RFC 00/30] Transparent huge page cache

2013-03-18 Thread Ric Mason

On 03/18/2013 07:42 PM, Kirill A. Shutemov wrote:

Simon Jeons wrote:

Hi Kirill,
On 03/18/2013 07:19 PM, Kirill A. Shutemov wrote:

Simon Jeons wrote:

On 03/18/2013 12:03 PM, Simon Jeons wrote:

Hi Kirill,
On 03/15/2013 01:50 AM, Kirill A. Shutemov wrote:

From: Kirill A. Shutemov kirill.shute...@linux.intel.com

Here's the second version of the patchset.

The intend of the work is get code ready to enable transparent huge page
cache for the most simple fs -- ramfs.

We have read()/write()/mmap() functionality now. Still plenty work
ahead.

One offline question.

Why set PG_mlocked to page_tail which be splited in function
__split_huge_page_refcount?

Not set, but copied from head page. Head page represents up-to-date sate
of compound page, we need to copy it to all tail pages on split.

I always see up-to-date state, could you conclude to me which state can
be treated as up-to-date? :-)

While we work with huge page we only alter flags (like mlocked or
uptodate) of head page, but not tail, so we have to copy flags to all tail
pages on split. We also need to distribute _count and _mapcount properly.
Just read the code.


Sorry, you can treat this question as an offline one and irrelevant thp. 
Which state of page can be treated as up-to-date?




   

Also why can't find where _PAGE_SPLITTING and _PAGE_PSE flags are
cleared in split_huge_page path?
   
The pmd is invalidated and replaced with reference to page table at the end

of __split_huge_page_map.

Since pmd is populated by page table and new flag why need
invalidated(clear present flag) before it?

Comment just before pmdp_invalidate() in __split_huge_page_map() is fairly
informative.



--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v2 3/4] introduce zero-filled page stat count

2013-03-17 Thread Ric Mason

Hi Konrad,
On 03/16/2013 09:06 PM, Konrad Rzeszutek Wilk wrote:

On Thu, Mar 14, 2013 at 06:08:16PM +0800, Wanpeng Li wrote:

Introduce zero-filled page statistics to monitor the number of
zero-filled pages.

Hm, you must be using an older version of the driver. Please
rebase it against Greg KH's staging tree. This is where most if not
all of the DebugFS counters got moved to a different file.


It seems that zcache debugfs in Greg's staging-next is buggy, Could you 
test it?





Acked-by: Dan Magenheimer 
Signed-off-by: Wanpeng Li 
---
  drivers/staging/zcache/zcache-main.c |7 +++
  1 files changed, 7 insertions(+), 0 deletions(-)

diff --git a/drivers/staging/zcache/zcache-main.c 
b/drivers/staging/zcache/zcache-main.c
index db200b4..2091a4d 100644
--- a/drivers/staging/zcache/zcache-main.c
+++ b/drivers/staging/zcache/zcache-main.c
@@ -196,6 +196,7 @@ static ssize_t zcache_eph_nonactive_puts_ignored;
  static ssize_t zcache_pers_nonactive_puts_ignored;
  static ssize_t zcache_writtenback_pages;
  static ssize_t zcache_outstanding_writeback_pages;
+static ssize_t zcache_pages_zero;
  
  #ifdef CONFIG_DEBUG_FS

  #include 
@@ -257,6 +258,7 @@ static int zcache_debugfs_init(void)
zdfs("outstanding_writeback_pages", S_IRUGO, root,
_outstanding_writeback_pages);
zdfs("writtenback_pages", S_IRUGO, root, _writtenback_pages);
+   zdfs("pages_zero", S_IRUGO, root, _pages_zero);
return 0;
  }
  #undefzdebugfs
@@ -326,6 +328,7 @@ void zcache_dump(void)
pr_info("zcache: outstanding_writeback_pages=%zd\n",
zcache_outstanding_writeback_pages);
pr_info("zcache: writtenback_pages=%zd\n", zcache_writtenback_pages);
+   pr_info("zcache: pages_zero=%zd\n", zcache_pages_zero);
  }
  #endif
  
@@ -562,6 +565,7 @@ static void *zcache_pampd_eph_create(char *data, size_t size, bool raw,

kunmap_atomic(user_mem);
clen = 0;
zero_filled = true;
+   zcache_pages_zero++;
goto got_pampd;
}
kunmap_atomic(user_mem);
@@ -645,6 +649,7 @@ static void *zcache_pampd_pers_create(char *data, size_t 
size, bool raw,
kunmap_atomic(user_mem);
clen = 0;
zero_filled = true;
+   zcache_pages_zero++;
goto got_pampd;
}
kunmap_atomic(user_mem);
@@ -866,6 +871,7 @@ static int zcache_pampd_get_data_and_free(char *data, 
size_t *sizep, bool raw,
zpages = 0;
if (!raw)
*sizep = PAGE_SIZE;
+   zcache_pages_zero--;
goto zero_fill;
}
  
@@ -922,6 +928,7 @@ static void zcache_pampd_free(void *pampd, struct tmem_pool *pool,

zero_filled = true;
zsize = 0;
zpages = 0;
+   zcache_pages_zero--;
}
  
  	if (pampd_is_remote(pampd) && !zero_filled) {

--
1.7.7.6


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majord...@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: mailto:"d...@kvack.org;> em...@kvack.org 


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v2 3/4] introduce zero-filled page stat count

2013-03-17 Thread Ric Mason

Hi Konrad,
On 03/16/2013 09:06 PM, Konrad Rzeszutek Wilk wrote:

On Thu, Mar 14, 2013 at 06:08:16PM +0800, Wanpeng Li wrote:

Introduce zero-filled page statistics to monitor the number of
zero-filled pages.

Hm, you must be using an older version of the driver. Please
rebase it against Greg KH's staging tree. This is where most if not
all of the DebugFS counters got moved to a different file.


It seems that zcache debugfs in Greg's staging-next is buggy, Could you 
test it?





Acked-by: Dan Magenheimer dan.magenhei...@oracle.com
Signed-off-by: Wanpeng Li liw...@linux.vnet.ibm.com
---
  drivers/staging/zcache/zcache-main.c |7 +++
  1 files changed, 7 insertions(+), 0 deletions(-)

diff --git a/drivers/staging/zcache/zcache-main.c 
b/drivers/staging/zcache/zcache-main.c
index db200b4..2091a4d 100644
--- a/drivers/staging/zcache/zcache-main.c
+++ b/drivers/staging/zcache/zcache-main.c
@@ -196,6 +196,7 @@ static ssize_t zcache_eph_nonactive_puts_ignored;
  static ssize_t zcache_pers_nonactive_puts_ignored;
  static ssize_t zcache_writtenback_pages;
  static ssize_t zcache_outstanding_writeback_pages;
+static ssize_t zcache_pages_zero;
  
  #ifdef CONFIG_DEBUG_FS

  #include linux/debugfs.h
@@ -257,6 +258,7 @@ static int zcache_debugfs_init(void)
zdfs(outstanding_writeback_pages, S_IRUGO, root,
zcache_outstanding_writeback_pages);
zdfs(writtenback_pages, S_IRUGO, root, zcache_writtenback_pages);
+   zdfs(pages_zero, S_IRUGO, root, zcache_pages_zero);
return 0;
  }
  #undefzdebugfs
@@ -326,6 +328,7 @@ void zcache_dump(void)
pr_info(zcache: outstanding_writeback_pages=%zd\n,
zcache_outstanding_writeback_pages);
pr_info(zcache: writtenback_pages=%zd\n, zcache_writtenback_pages);
+   pr_info(zcache: pages_zero=%zd\n, zcache_pages_zero);
  }
  #endif
  
@@ -562,6 +565,7 @@ static void *zcache_pampd_eph_create(char *data, size_t size, bool raw,

kunmap_atomic(user_mem);
clen = 0;
zero_filled = true;
+   zcache_pages_zero++;
goto got_pampd;
}
kunmap_atomic(user_mem);
@@ -645,6 +649,7 @@ static void *zcache_pampd_pers_create(char *data, size_t 
size, bool raw,
kunmap_atomic(user_mem);
clen = 0;
zero_filled = true;
+   zcache_pages_zero++;
goto got_pampd;
}
kunmap_atomic(user_mem);
@@ -866,6 +871,7 @@ static int zcache_pampd_get_data_and_free(char *data, 
size_t *sizep, bool raw,
zpages = 0;
if (!raw)
*sizep = PAGE_SIZE;
+   zcache_pages_zero--;
goto zero_fill;
}
  
@@ -922,6 +928,7 @@ static void zcache_pampd_free(void *pampd, struct tmem_pool *pool,

zero_filled = true;
zsize = 0;
zpages = 0;
+   zcache_pages_zero--;
}
  
  	if (pampd_is_remote(pampd)  !zero_filled) {

--
1.7.7.6


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majord...@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: a href=mailto:d...@kvack.org; em...@kvack.org /a


--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: mmap vs fs cache

2013-03-08 Thread Ric Mason

Hi Johannes,
On 03/09/2013 12:16 AM, Johannes Weiner wrote:

On Fri, Mar 08, 2013 at 07:00:55AM -0800, Howard Chu wrote:

Chris Friesen wrote:

On 03/08/2013 03:40 AM, Howard Chu wrote:


There is no way that a process that is accessing only 30GB of a mmap
should be able to fill up 32GB of RAM. There's nothing else running on
the machine, I've killed or suspended everything else in userland
besides a couple shells running top and vmstat. When I manually
drop_caches repeatedly, then eventually slapd RSS/SHR grows to 30GB and
the physical I/O stops.

Is it possible that the kernel is doing some sort of automatic
readahead, but it ends up reading pages corresponding to data that isn't
ever queried and so doesn't get mapped by the application?

Yes, that's what I was thinking. I added a
posix_madvise(..POSIX_MADV_RANDOM) but that had no effect on the
test.

First obvious conclusion - kswapd is being too aggressive. When free
memory hits the low watermark, the reclaim shrinks slapd down from
25GB to 18-19GB, while the page cache still contains ~7GB of
unmapped pages. Ideally I'd like a tuning knob so I can say to keep
no more than 2GB of unmapped pages in the cache. (And the desired
effect of that would be to allow user processes to grow to 30GB
total, in this case.)

We should find out where the unmapped page cache is coming from if you
are only accessing mapped file cache and disabled readahead.

How do you arrive at this number of unmapped page cache?

What could happen is that previously used and activated pages do not
get evicted anymore since there is a constant supply of younger


If a user process exit, its file pages and anonymous pages will be freed 
immediately or go through page reclaim?



reclaimable cache that is actually thrashing.  Whenever you drop the
caches, you get rid of those stale active pages and allow the
previously thrashing cache to get activated.  However, that would
require that there is already a significant amount of active file


Why you emphasize a *significant* amount of active file pages?


pages before your workload starts (check the nr_active_file number in
/proc/vmstat before launching slapd, try sync; echo 3 >drop_caches
before launching to eliminate this option) OR that the set of pages
accessed during your workload changes and the combined set of pages
accessed by your workload is bigger than available memory -- which you
claimed would not happen because you only access the 30GB file area on
that system.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majord...@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: mailto:"d...@kvack.org;> em...@kvack.org 


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: mmap vs fs cache

2013-03-08 Thread Ric Mason

Hi Johannes,
On 03/08/2013 10:08 AM, Johannes Weiner wrote:

On Thu, Mar 07, 2013 at 04:43:12PM +0100, Jan Kara wrote:

   Added mm list to CC.

On Tue 05-03-13 09:57:34, Howard Chu wrote:

I'm testing our memory-mapped database code on a small VM. The
machine has 32GB of RAM and the size of the DB on disk is ~44GB. The
database library mmaps the entire file as a single region and starts
accessing it as a tree of B+trees. Running on an Ubuntu 3.5.0-23
kernel, XFS on a local disk.

If I start running read-only queries against the DB with a freshly
started server, I see that my process (OpenLDAP slapd) quickly grows
to an RSS of about 16GB in tandem with the FS cache. (I.e., "top"
shows 16GB cached, and slapd is 16GB.)
If I confine my queries to the first 20% of the data then it all
fits in RAM and queries are nice and fast.

if I extend the query range to cover more of the data, approaching
the size of physical RAM, I see something strange - the FS cache
keeps growing, but the slapd process size grows at a slower rate.
This is rather puzzling to me since the only thing triggering reads
is accesses through the mmap region. Eventually the FS cache grows
to basically all of the 32GB of RAM (+/- some text/data space...)
but the slapd process only reaches 25GB, at which point it actually
starts to shrink - apparently the FS cache is now stealing pages
from it. I find that a bit puzzling; if the pages are present in
memory, and the only reason they were paged in was to satisfy an
mmap reference, why aren't they simply assigned to the slapd
process?

The current behavior gets even more aggravating: I can run a test
that spans exactly 30GB of the data. One would expect that the slapd
process should simply grow to 30GB in size, and then remain static
for the remainder of the test. Instead, the server grows to 25GB,
the FS cache grows to 32GB, and starts stealing pages from the
server, shrinking it back down to 19GB or so.

If I do an "echo 1 > /proc/sys/vm/drop_caches" at the onset of this
condition, the FS cache shrinks back to 25GB, matching the slapd
process size.
This then frees up enough RAM for slapd to grow further. If I don't
do this, the test is constantly paging in data from disk. Even so,
the FS cache continues to grow faster than the slapd process size,
so the system may run out of free RAM again, and I have to drop
caches multiple times before slapd finally grows to the full 30GB.
Once it gets to that size the test runs entirely from RAM with zero
I/Os, but it doesn't get there without a lot of babysitting.

2 questions:
   why is there data in the FS cache that isn't owned by (the mmap
of) the process that caused it to be paged in in the first place?

The filesystem cache is shared among processes because the filesystem
is also shared among processes.  If another task were to access the
same file, we still should only have one copy of that data in memory.

It sounds to me like slapd is itself caching all the data it reads.
If that is true, shouldn't it really be using direct IO to prevent
this double buffering of filesystem data in memory?


When use direct IO is better? When use page cache is better?




   is there a tunable knob to discourage the page cache from stealing
from the process?

Try reducing /proc/sys/vm/swappiness, which ranges from 0-100 and
defaults to 60.


Why redunce? IIUC, swappiness is used to determine how aggressive 
reclaim anonymous pages, if the value is high more anonymous pages will 
be reclaimed.




--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majord...@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: mailto:"d...@kvack.org;> em...@kvack.org 


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: mmap vs fs cache

2013-03-08 Thread Ric Mason

Hi Johannes,
On 03/08/2013 10:08 AM, Johannes Weiner wrote:

On Thu, Mar 07, 2013 at 04:43:12PM +0100, Jan Kara wrote:

   Added mm list to CC.

On Tue 05-03-13 09:57:34, Howard Chu wrote:

I'm testing our memory-mapped database code on a small VM. The
machine has 32GB of RAM and the size of the DB on disk is ~44GB. The
database library mmaps the entire file as a single region and starts
accessing it as a tree of B+trees. Running on an Ubuntu 3.5.0-23
kernel, XFS on a local disk.

If I start running read-only queries against the DB with a freshly
started server, I see that my process (OpenLDAP slapd) quickly grows
to an RSS of about 16GB in tandem with the FS cache. (I.e., top
shows 16GB cached, and slapd is 16GB.)
If I confine my queries to the first 20% of the data then it all
fits in RAM and queries are nice and fast.

if I extend the query range to cover more of the data, approaching
the size of physical RAM, I see something strange - the FS cache
keeps growing, but the slapd process size grows at a slower rate.
This is rather puzzling to me since the only thing triggering reads
is accesses through the mmap region. Eventually the FS cache grows
to basically all of the 32GB of RAM (+/- some text/data space...)
but the slapd process only reaches 25GB, at which point it actually
starts to shrink - apparently the FS cache is now stealing pages
from it. I find that a bit puzzling; if the pages are present in
memory, and the only reason they were paged in was to satisfy an
mmap reference, why aren't they simply assigned to the slapd
process?

The current behavior gets even more aggravating: I can run a test
that spans exactly 30GB of the data. One would expect that the slapd
process should simply grow to 30GB in size, and then remain static
for the remainder of the test. Instead, the server grows to 25GB,
the FS cache grows to 32GB, and starts stealing pages from the
server, shrinking it back down to 19GB or so.

If I do an echo 1  /proc/sys/vm/drop_caches at the onset of this
condition, the FS cache shrinks back to 25GB, matching the slapd
process size.
This then frees up enough RAM for slapd to grow further. If I don't
do this, the test is constantly paging in data from disk. Even so,
the FS cache continues to grow faster than the slapd process size,
so the system may run out of free RAM again, and I have to drop
caches multiple times before slapd finally grows to the full 30GB.
Once it gets to that size the test runs entirely from RAM with zero
I/Os, but it doesn't get there without a lot of babysitting.

2 questions:
   why is there data in the FS cache that isn't owned by (the mmap
of) the process that caused it to be paged in in the first place?

The filesystem cache is shared among processes because the filesystem
is also shared among processes.  If another task were to access the
same file, we still should only have one copy of that data in memory.

It sounds to me like slapd is itself caching all the data it reads.
If that is true, shouldn't it really be using direct IO to prevent
this double buffering of filesystem data in memory?


When use direct IO is better? When use page cache is better?




   is there a tunable knob to discourage the page cache from stealing
from the process?

Try reducing /proc/sys/vm/swappiness, which ranges from 0-100 and
defaults to 60.


Why redunce? IIUC, swappiness is used to determine how aggressive 
reclaim anonymous pages, if the value is high more anonymous pages will 
be reclaimed.




--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majord...@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: a href=mailto:d...@kvack.org; em...@kvack.org /a


--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: mmap vs fs cache

2013-03-08 Thread Ric Mason

Hi Johannes,
On 03/09/2013 12:16 AM, Johannes Weiner wrote:

On Fri, Mar 08, 2013 at 07:00:55AM -0800, Howard Chu wrote:

Chris Friesen wrote:

On 03/08/2013 03:40 AM, Howard Chu wrote:


There is no way that a process that is accessing only 30GB of a mmap
should be able to fill up 32GB of RAM. There's nothing else running on
the machine, I've killed or suspended everything else in userland
besides a couple shells running top and vmstat. When I manually
drop_caches repeatedly, then eventually slapd RSS/SHR grows to 30GB and
the physical I/O stops.

Is it possible that the kernel is doing some sort of automatic
readahead, but it ends up reading pages corresponding to data that isn't
ever queried and so doesn't get mapped by the application?

Yes, that's what I was thinking. I added a
posix_madvise(..POSIX_MADV_RANDOM) but that had no effect on the
test.

First obvious conclusion - kswapd is being too aggressive. When free
memory hits the low watermark, the reclaim shrinks slapd down from
25GB to 18-19GB, while the page cache still contains ~7GB of
unmapped pages. Ideally I'd like a tuning knob so I can say to keep
no more than 2GB of unmapped pages in the cache. (And the desired
effect of that would be to allow user processes to grow to 30GB
total, in this case.)

We should find out where the unmapped page cache is coming from if you
are only accessing mapped file cache and disabled readahead.

How do you arrive at this number of unmapped page cache?

What could happen is that previously used and activated pages do not
get evicted anymore since there is a constant supply of younger


If a user process exit, its file pages and anonymous pages will be freed 
immediately or go through page reclaim?



reclaimable cache that is actually thrashing.  Whenever you drop the
caches, you get rid of those stale active pages and allow the
previously thrashing cache to get activated.  However, that would
require that there is already a significant amount of active file


Why you emphasize a *significant* amount of active file pages?


pages before your workload starts (check the nr_active_file number in
/proc/vmstat before launching slapd, try sync; echo 3 drop_caches
before launching to eliminate this option) OR that the set of pages
accessed during your workload changes and the combined set of pages
accessed by your workload is bigger than available memory -- which you
claimed would not happen because you only access the 30GB file area on
that system.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majord...@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: a href=mailto:d...@kvack.org; em...@kvack.org /a


--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 2/7] ksm: treat unstable nid like in stable tree

2013-03-07 Thread Ric Mason

Ping Hugh, :-)
On 03/06/2013 06:18 PM, Ric Mason wrote:

Hi Hugh,
On 03/06/2013 01:05 PM, Hugh Dickins wrote:

On Wed, 6 Mar 2013, Ric Mason wrote:
[ I've deleted the context because that was about the unstable tree,
   and here you have moved to asking about a case in the stable tree. ]


I think I can basically understand you, please correct me if something 
wrong.


For ksm page:
If one ksm page(in old node) migrate to another(new) node(ksm page is 
treated as old page, one new page allocated in another node now), 
since we can't get right lock in this time, we can't move stable node 
to its new tree at this time, stable node still in old node and 
stable_node->nid still store old node value. If ksmd scan and compare 
another page in old node and search stable tree will figure out that 
stable node relevant ksm page is migrated to new node, stable node 
will be erased from old node's stable tree and link to migrate_nodes 
list. What's the life of new page in new node? new page will be scaned 
by ksmd, it will search stable tree in new node and if doesn't find 
matched stable node, the new node is deleted from migrate_node list 
and add to new node's table tree as a leaf, if find stable node in 
stable tree, they will be merged. But in special case, the stable node 
relevant  ksm page can also migrated, new stable node will replace the 
stable node which relevant page migrated this time.

For unstable tree page:
If search in unstable tree and find the tree page which has equal 
content is migrated, just stop search and return, nothing merged. The 
new page in new node for this migrated unstable tree page will be 
insert to unstable tree in new node.


For the case of a ksm page is migrated to a different NUMA node and 
migrate
its stable node to  the right tree and collide with an existing 
stable node.
get_kpfn_nid(stable_node->kpfn) != NUMA(stable_node->nid) can 
capture nothing

That's not so: as I've pointed out before, ksm_migrate_page() updates
stable_node->kpfn for the new page on the new NUMA node; but it cannot
(get the right locking to) move the stable_node to its new tree at 
that time.


It's moved out once ksmd notices that it's in the wrong NUMA node tree -
perhaps when one its rmap_items reaches the head of 
cmp_and_merge_page(),

or perhaps here in stable_tree_search() when it matches another page
coming in to cmp_and_merge_page().

You may be concentrating on the case when that "another page" is a ksm
page migrated from a different NUMA node; and overlooking the case of
when the matching ksm page in this stable tree has itself been migrated.

since stable_node is the node in the right stable tree, nothing 
happen to it
before this check. Did you intend to check 
get_kpfn_nid(page_node->kpfn) !=

NUMA(page_node->nid) ?

Certainly not: page_node is usually NULL.  But I could have checked
get_kpfn_nid(stable_node->kpfn) != nid: I was duplicating the test
from cmp_and_merge_page(), but here we do have local variable nid.

Hugh




--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 2/7] ksm: treat unstable nid like in stable tree

2013-03-07 Thread Ric Mason

Ping Hugh, :-)
On 03/06/2013 06:18 PM, Ric Mason wrote:

Hi Hugh,
On 03/06/2013 01:05 PM, Hugh Dickins wrote:

On Wed, 6 Mar 2013, Ric Mason wrote:
[ I've deleted the context because that was about the unstable tree,
   and here you have moved to asking about a case in the stable tree. ]


I think I can basically understand you, please correct me if something 
wrong.


For ksm page:
If one ksm page(in old node) migrate to another(new) node(ksm page is 
treated as old page, one new page allocated in another node now), 
since we can't get right lock in this time, we can't move stable node 
to its new tree at this time, stable node still in old node and 
stable_node-nid still store old node value. If ksmd scan and compare 
another page in old node and search stable tree will figure out that 
stable node relevant ksm page is migrated to new node, stable node 
will be erased from old node's stable tree and link to migrate_nodes 
list. What's the life of new page in new node? new page will be scaned 
by ksmd, it will search stable tree in new node and if doesn't find 
matched stable node, the new node is deleted from migrate_node list 
and add to new node's table tree as a leaf, if find stable node in 
stable tree, they will be merged. But in special case, the stable node 
relevant  ksm page can also migrated, new stable node will replace the 
stable node which relevant page migrated this time.

For unstable tree page:
If search in unstable tree and find the tree page which has equal 
content is migrated, just stop search and return, nothing merged. The 
new page in new node for this migrated unstable tree page will be 
insert to unstable tree in new node.


For the case of a ksm page is migrated to a different NUMA node and 
migrate
its stable node to  the right tree and collide with an existing 
stable node.
get_kpfn_nid(stable_node-kpfn) != NUMA(stable_node-nid) can 
capture nothing

That's not so: as I've pointed out before, ksm_migrate_page() updates
stable_node-kpfn for the new page on the new NUMA node; but it cannot
(get the right locking to) move the stable_node to its new tree at 
that time.


It's moved out once ksmd notices that it's in the wrong NUMA node tree -
perhaps when one its rmap_items reaches the head of 
cmp_and_merge_page(),

or perhaps here in stable_tree_search() when it matches another page
coming in to cmp_and_merge_page().

You may be concentrating on the case when that another page is a ksm
page migrated from a different NUMA node; and overlooking the case of
when the matching ksm page in this stable tree has itself been migrated.

since stable_node is the node in the right stable tree, nothing 
happen to it
before this check. Did you intend to check 
get_kpfn_nid(page_node-kpfn) !=

NUMA(page_node-nid) ?

Certainly not: page_node is usually NULL.  But I could have checked
get_kpfn_nid(stable_node-kpfn) != nid: I was duplicating the test
from cmp_and_merge_page(), but here we do have local variable nid.

Hugh




--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Should a swapped out page be deleted from swap cache?

2013-03-06 Thread Ric Mason

On 03/06/2013 07:04 PM, Ric Mason wrote:

On 03/06/2013 01:34 PM, Li Haifeng wrote:

2013/2/20 Ric Mason :

Hi Hugh,


On 02/20/2013 02:56 AM, Hugh Dickins wrote:

On Tue, 19 Feb 2013, Ric Mason wrote:

There is a call of try_to_free_swap in function swap_writepage, if
swap_writepage is call from shrink_page_list path, 
PageSwapCache(page) ==
trure, PageWriteback(page) maybe false, page_swapcount(page) == 0, 
then

will
delete the page from swap cache and free swap slot, where I miss?
That's correct.  PageWriteback is sure to be false there. 
page_swapcount
usually won't be 0 there, but sometimes it will be, and in that 
case we

do want to delete from swap cache and free the swap slot.


1) If PageSwapCache(page)  == true, PageWriteback(page) == false,
page_swapcount(page) == 0  in swap_writepage(shrink_page_list path), 
then

will delete the page from swap cache and free swap slot, in function
swap_writepage:

if (try_to_free_swap(page)) {
 unlock_page(page);
 goto out;
}
writeback will not execute, that's wrong. Where I miss?

when the page is deleted from swap cache and corresponding swap slot
is free, the page is set dirty. The dirty page won't be reclaimed. It
is not wrong.


I don't think so. For dirty pages, there are two steps: 1)writeback 
2)reclaim. Since PageSwapCache(page) == true && PageWriteback(page) == 
false && page_swapcount(page) == 0 in swap_writeback(), 
try_to_free_swap() will return true and writeback will be skip. Then 
how can step one be executed?


s/swap_writeback()/swap_writepage()

Btw, Hi Hugh, could you explain more to us? :-)





corresponding path lists as below.
when swap_writepage() is called by pageout() in shrink_page_list().
pageout() will return PAGE_SUCCESS. For PAGE_SUCCESS, when
PageDirty(page) is true, this reclaiming page will be keeped in the
inactive LRU list.
shrink_page_list()
{
...
  904 switch (pageout(page, mapping, sc)) {
  905 case PAGE_KEEP:
  906 nr_congested++;
  907 goto keep_locked;
  908 case PAGE_ACTIVATE:
  909 goto activate_locked;
  910 case PAGE_SUCCESS:
  911 if (PageWriteback(page))
  912 goto keep_lumpy;
  913 if (PageDirty(page))
  914 goto keep;
...}

2) In the function pageout, page will be set PG_Reclaim flag, since 
this

flag is set, end_swap_bio_write->end_page_writeback:
if (TestClearPageReclaim(page))
  rotate_reclaimable_page(page);
it means that page will be add to the tail of lru list, page is clean
anonymous page this time and will be reclaim to buddy system soon, 
correct?

correct

If is correct, what is the meaning of rotate here?

Rotating here is to add the page to the tail of inactive LRU list. So
this page will be reclaimed ASAP while reclaiming.


Hugh






--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Should a swapped out page be deleted from swap cache?

2013-03-06 Thread Ric Mason

On 03/06/2013 01:34 PM, Li Haifeng wrote:

2013/2/20 Ric Mason :

Hi Hugh,


On 02/20/2013 02:56 AM, Hugh Dickins wrote:

On Tue, 19 Feb 2013, Ric Mason wrote:

There is a call of try_to_free_swap in function swap_writepage, if
swap_writepage is call from shrink_page_list path, PageSwapCache(page) ==
trure, PageWriteback(page) maybe false, page_swapcount(page) == 0, then
will
delete the page from swap cache and free swap slot, where I miss?

That's correct.  PageWriteback is sure to be false there.  page_swapcount
usually won't be 0 there, but sometimes it will be, and in that case we
do want to delete from swap cache and free the swap slot.


1) If PageSwapCache(page)  == true, PageWriteback(page) == false,
page_swapcount(page) == 0  in swap_writepage(shrink_page_list path), then
will delete the page from swap cache and free swap slot, in function
swap_writepage:

if (try_to_free_swap(page)) {
 unlock_page(page);
 goto out;
}
writeback will not execute, that's wrong. Where I miss?

when the page is deleted from swap cache and corresponding swap slot
is free, the page is set dirty. The dirty page won't be reclaimed. It
is not wrong.


I don't think so. For dirty pages, there are two steps: 1)writeback 
2)reclaim. Since PageSwapCache(page) == true && PageWriteback(page) == 
false && page_swapcount(page) == 0 in swap_writeback(), 
try_to_free_swap() will return true and writeback will be skip. Then how 
can step one be executed?




corresponding path lists as below.
when swap_writepage() is called by pageout() in shrink_page_list().
pageout() will return PAGE_SUCCESS. For PAGE_SUCCESS, when
PageDirty(page) is true, this reclaiming page will be keeped in the
inactive LRU list.
shrink_page_list()
{
...
  904 switch (pageout(page, mapping, sc)) {
  905 case PAGE_KEEP:
  906 nr_congested++;
  907 goto keep_locked;
  908 case PAGE_ACTIVATE:
  909 goto activate_locked;
  910 case PAGE_SUCCESS:
  911 if (PageWriteback(page))
  912 goto keep_lumpy;
  913 if (PageDirty(page))
  914 goto keep;
...}


2) In the function pageout, page will be set PG_Reclaim flag, since this
flag is set, end_swap_bio_write->end_page_writeback:
if (TestClearPageReclaim(page))
  rotate_reclaimable_page(page);
it means that page will be add to the tail of lru list, page is clean
anonymous page this time and will be reclaim to buddy system soon, correct?

correct

If is correct, what is the meaning of rotate here?

Rotating here is to add the page to the tail of inactive LRU list. So
this page will be reclaimed ASAP while reclaiming.


Hugh




--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 2/7] ksm: treat unstable nid like in stable tree

2013-03-06 Thread Ric Mason

Hi Hugh,
On 03/06/2013 01:05 PM, Hugh Dickins wrote:

On Wed, 6 Mar 2013, Ric Mason wrote:
[ I've deleted the context because that was about the unstable tree,
   and here you have moved to asking about a case in the stable tree. ]


I think I can basically understand you, please correct me if something 
wrong.


For ksm page:
If one ksm page(in old node) migrate to another(new) node(ksm page is 
treated as old page, one new page allocated in another node now), since 
we can't get right lock in this time, we can't move stable node to its 
new tree at this time, stable node still in old node and 
stable_node->nid still store old node value. If ksmd scan and compare 
another page in old node and search stable tree will figure out that 
stable node relevant ksm page is migrated to new node, stable node will 
be erased from old node's stable tree and link to migrate_nodes list. 
What's the life of new page in new node? new page will be scaned by 
ksmd, it will search stable tree in new node and if doesn't find matched 
stable node, the new node is deleted from migrate_node list and add to 
new node's table tree as a leaf, if find stable node in stable tree, 
they will be merged. But in special case, the stable node relevant  ksm 
page can also migrated, new stable node will replace the stable node 
which relevant page migrated this time.

For unstable tree page:
If search in unstable tree and find the tree page which has equal 
content is migrated, just stop search and return, nothing merged. The 
new page in new node for this migrated unstable tree page will be insert 
to unstable tree in new node.



For the case of a ksm page is migrated to a different NUMA node and migrate
its stable node to  the right tree and collide with an existing stable node.
get_kpfn_nid(stable_node->kpfn) != NUMA(stable_node->nid) can capture nothing

That's not so: as I've pointed out before, ksm_migrate_page() updates
stable_node->kpfn for the new page on the new NUMA node; but it cannot
(get the right locking to) move the stable_node to its new tree at that time.

It's moved out once ksmd notices that it's in the wrong NUMA node tree -
perhaps when one its rmap_items reaches the head of cmp_and_merge_page(),
or perhaps here in stable_tree_search() when it matches another page
coming in to cmp_and_merge_page().

You may be concentrating on the case when that "another page" is a ksm
page migrated from a different NUMA node; and overlooking the case of
when the matching ksm page in this stable tree has itself been migrated.


since stable_node is the node in the right stable tree, nothing happen to it
before this check. Did you intend to check get_kpfn_nid(page_node->kpfn) !=
NUMA(page_node->nid) ?

Certainly not: page_node is usually NULL.  But I could have checked
get_kpfn_nid(stable_node->kpfn) != nid: I was duplicating the test
from cmp_and_merge_page(), but here we do have local variable nid.

Hugh


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 2/7] ksm: treat unstable nid like in stable tree

2013-03-06 Thread Ric Mason

Hi Hugh,
On 03/06/2013 01:05 PM, Hugh Dickins wrote:

On Wed, 6 Mar 2013, Ric Mason wrote:
[ I've deleted the context because that was about the unstable tree,
   and here you have moved to asking about a case in the stable tree. ]


I think I can basically understand you, please correct me if something 
wrong.


For ksm page:
If one ksm page(in old node) migrate to another(new) node(ksm page is 
treated as old page, one new page allocated in another node now), since 
we can't get right lock in this time, we can't move stable node to its 
new tree at this time, stable node still in old node and 
stable_node-nid still store old node value. If ksmd scan and compare 
another page in old node and search stable tree will figure out that 
stable node relevant ksm page is migrated to new node, stable node will 
be erased from old node's stable tree and link to migrate_nodes list. 
What's the life of new page in new node? new page will be scaned by 
ksmd, it will search stable tree in new node and if doesn't find matched 
stable node, the new node is deleted from migrate_node list and add to 
new node's table tree as a leaf, if find stable node in stable tree, 
they will be merged. But in special case, the stable node relevant  ksm 
page can also migrated, new stable node will replace the stable node 
which relevant page migrated this time.

For unstable tree page:
If search in unstable tree and find the tree page which has equal 
content is migrated, just stop search and return, nothing merged. The 
new page in new node for this migrated unstable tree page will be insert 
to unstable tree in new node.



For the case of a ksm page is migrated to a different NUMA node and migrate
its stable node to  the right tree and collide with an existing stable node.
get_kpfn_nid(stable_node-kpfn) != NUMA(stable_node-nid) can capture nothing

That's not so: as I've pointed out before, ksm_migrate_page() updates
stable_node-kpfn for the new page on the new NUMA node; but it cannot
(get the right locking to) move the stable_node to its new tree at that time.

It's moved out once ksmd notices that it's in the wrong NUMA node tree -
perhaps when one its rmap_items reaches the head of cmp_and_merge_page(),
or perhaps here in stable_tree_search() when it matches another page
coming in to cmp_and_merge_page().

You may be concentrating on the case when that another page is a ksm
page migrated from a different NUMA node; and overlooking the case of
when the matching ksm page in this stable tree has itself been migrated.


since stable_node is the node in the right stable tree, nothing happen to it
before this check. Did you intend to check get_kpfn_nid(page_node-kpfn) !=
NUMA(page_node-nid) ?

Certainly not: page_node is usually NULL.  But I could have checked
get_kpfn_nid(stable_node-kpfn) != nid: I was duplicating the test
from cmp_and_merge_page(), but here we do have local variable nid.

Hugh


--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Should a swapped out page be deleted from swap cache?

2013-03-06 Thread Ric Mason

On 03/06/2013 01:34 PM, Li Haifeng wrote:

2013/2/20 Ric Mason ric.mas...@gmail.com:

Hi Hugh,


On 02/20/2013 02:56 AM, Hugh Dickins wrote:

On Tue, 19 Feb 2013, Ric Mason wrote:

There is a call of try_to_free_swap in function swap_writepage, if
swap_writepage is call from shrink_page_list path, PageSwapCache(page) ==
trure, PageWriteback(page) maybe false, page_swapcount(page) == 0, then
will
delete the page from swap cache and free swap slot, where I miss?

That's correct.  PageWriteback is sure to be false there.  page_swapcount
usually won't be 0 there, but sometimes it will be, and in that case we
do want to delete from swap cache and free the swap slot.


1) If PageSwapCache(page)  == true, PageWriteback(page) == false,
page_swapcount(page) == 0  in swap_writepage(shrink_page_list path), then
will delete the page from swap cache and free swap slot, in function
swap_writepage:

if (try_to_free_swap(page)) {
 unlock_page(page);
 goto out;
}
writeback will not execute, that's wrong. Where I miss?

when the page is deleted from swap cache and corresponding swap slot
is free, the page is set dirty. The dirty page won't be reclaimed. It
is not wrong.


I don't think so. For dirty pages, there are two steps: 1)writeback 
2)reclaim. Since PageSwapCache(page) == true  PageWriteback(page) == 
false  page_swapcount(page) == 0 in swap_writeback(), 
try_to_free_swap() will return true and writeback will be skip. Then how 
can step one be executed?




corresponding path lists as below.
when swap_writepage() is called by pageout() in shrink_page_list().
pageout() will return PAGE_SUCCESS. For PAGE_SUCCESS, when
PageDirty(page) is true, this reclaiming page will be keeped in the
inactive LRU list.
shrink_page_list()
{
...
  904 switch (pageout(page, mapping, sc)) {
  905 case PAGE_KEEP:
  906 nr_congested++;
  907 goto keep_locked;
  908 case PAGE_ACTIVATE:
  909 goto activate_locked;
  910 case PAGE_SUCCESS:
  911 if (PageWriteback(page))
  912 goto keep_lumpy;
  913 if (PageDirty(page))
  914 goto keep;
...}


2) In the function pageout, page will be set PG_Reclaim flag, since this
flag is set, end_swap_bio_write-end_page_writeback:
if (TestClearPageReclaim(page))
  rotate_reclaimable_page(page);
it means that page will be add to the tail of lru list, page is clean
anonymous page this time and will be reclaim to buddy system soon, correct?

correct

If is correct, what is the meaning of rotate here?

Rotating here is to add the page to the tail of inactive LRU list. So
this page will be reclaimed ASAP while reclaiming.


Hugh




--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Should a swapped out page be deleted from swap cache?

2013-03-06 Thread Ric Mason

On 03/06/2013 07:04 PM, Ric Mason wrote:

On 03/06/2013 01:34 PM, Li Haifeng wrote:

2013/2/20 Ric Mason ric.mas...@gmail.com:

Hi Hugh,


On 02/20/2013 02:56 AM, Hugh Dickins wrote:

On Tue, 19 Feb 2013, Ric Mason wrote:

There is a call of try_to_free_swap in function swap_writepage, if
swap_writepage is call from shrink_page_list path, 
PageSwapCache(page) ==
trure, PageWriteback(page) maybe false, page_swapcount(page) == 0, 
then

will
delete the page from swap cache and free swap slot, where I miss?
That's correct.  PageWriteback is sure to be false there. 
page_swapcount
usually won't be 0 there, but sometimes it will be, and in that 
case we

do want to delete from swap cache and free the swap slot.


1) If PageSwapCache(page)  == true, PageWriteback(page) == false,
page_swapcount(page) == 0  in swap_writepage(shrink_page_list path), 
then

will delete the page from swap cache and free swap slot, in function
swap_writepage:

if (try_to_free_swap(page)) {
 unlock_page(page);
 goto out;
}
writeback will not execute, that's wrong. Where I miss?

when the page is deleted from swap cache and corresponding swap slot
is free, the page is set dirty. The dirty page won't be reclaimed. It
is not wrong.


I don't think so. For dirty pages, there are two steps: 1)writeback 
2)reclaim. Since PageSwapCache(page) == true  PageWriteback(page) == 
false  page_swapcount(page) == 0 in swap_writeback(), 
try_to_free_swap() will return true and writeback will be skip. Then 
how can step one be executed?


s/swap_writeback()/swap_writepage()

Btw, Hi Hugh, could you explain more to us? :-)





corresponding path lists as below.
when swap_writepage() is called by pageout() in shrink_page_list().
pageout() will return PAGE_SUCCESS. For PAGE_SUCCESS, when
PageDirty(page) is true, this reclaiming page will be keeped in the
inactive LRU list.
shrink_page_list()
{
...
  904 switch (pageout(page, mapping, sc)) {
  905 case PAGE_KEEP:
  906 nr_congested++;
  907 goto keep_locked;
  908 case PAGE_ACTIVATE:
  909 goto activate_locked;
  910 case PAGE_SUCCESS:
  911 if (PageWriteback(page))
  912 goto keep_lumpy;
  913 if (PageDirty(page))
  914 goto keep;
...}

2) In the function pageout, page will be set PG_Reclaim flag, since 
this

flag is set, end_swap_bio_write-end_page_writeback:
if (TestClearPageReclaim(page))
  rotate_reclaimable_page(page);
it means that page will be add to the tail of lru list, page is clean
anonymous page this time and will be reclaim to buddy system soon, 
correct?

correct

If is correct, what is the meaning of rotate here?

Rotating here is to add the page to the tail of inactive LRU list. So
this page will be reclaimed ASAP while reclaiming.


Hugh






--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 2/7] ksm: treat unstable nid like in stable tree

2013-03-05 Thread Ric Mason

Hi Hugh,
On 03/06/2013 01:05 PM, Hugh Dickins wrote:

On Wed, 6 Mar 2013, Ric Mason wrote:
[ I've deleted the context because that was about the unstable tree,
   and here you have moved to asking about a case in the stable tree. ]

For the case of a ksm page is migrated to a different NUMA node and migrate
its stable node to  the right tree and collide with an existing stable node.
get_kpfn_nid(stable_node->kpfn) != NUMA(stable_node->nid) can capture nothing

That's not so: as I've pointed out before, ksm_migrate_page() updates
stable_node->kpfn for the new page on the new NUMA node; but it cannot
(get the right locking to) move the stable_node to its new tree at that time.

It's moved out once ksmd notices that it's in the wrong NUMA node tree -
perhaps when one its rmap_items reaches the head of cmp_and_merge_page(),
or perhaps here in stable_tree_search() when it matches another page
coming in to cmp_and_merge_page().

You may be concentrating on the case when that "another page" is a ksm
page migrated from a different NUMA node; and overlooking the case of
when the matching ksm page in this stable tree has itself been migrated.


since stable_node is the node in the right stable tree, nothing happen to it
before this check. Did you intend to check get_kpfn_nid(page_node->kpfn) !=
NUMA(page_node->nid) ?

Certainly not: page_node is usually NULL.  But I could have checked


Are you sure? list_del(_node->list) and DO_NUMA(page_node->nid = 
nid) will trigger panic now.

get_kpfn_nid(stable_node->kpfn) != nid: I was duplicating the test
from cmp_and_merge_page(), but here we do have local variable nid.

Hugh


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 2/7] ksm: treat unstable nid like in stable tree

2013-03-05 Thread Ric Mason

On 03/02/2013 10:57 AM, Hugh Dickins wrote:

On Sat, 2 Mar 2013, Ric Mason wrote:

On 03/02/2013 04:03 AM, Hugh Dickins wrote:

On Fri, 1 Mar 2013, Ric Mason wrote:

I think the ksm implementation for num awareness  is buggy.

Sorry, I just don't understand your comments below,
but will try to answer or question them as best I can.


For page migratyion stuff, new page is allocated from node *which page is
migrated to*.

Yes, by definition.


- when meeting a page from the wrong NUMA node in an unstable tree
  get_kpfn_nid(page_to_pfn(page)) *==* page_to_nid(tree_page)

I thought you were writing of the wrong NUMA node case,
but now you emphasize "*==*", which means the right NUMA node.

Yes, I mean the wrong NUMA node. During page migration, new page has already
been allocated in new node and old page maybe freed.  So tree_page is the
page in new node's unstable tree, page is also new node page, so
get_kpfn_nid(page_to_pfn(page)) *==* page_to_nid(tree_page).

I don't understand; but here you seem to be describing a case where two
pages from the same NUMA node get merged (after both have been migrated
from another NUMA node?), and there's nothing wrong with that,
so I won't worry about it further.


For the case of a ksm page is migrated to a different NUMA node and 
migrate its stable node to  the right tree and collide with an existing 
stable node. get_kpfn_nid(stable_node->kpfn) != NUMA(stable_node->nid) 
can capture nothing since stable_node is the node in the right stable 
tree, nothing happen to it before this check. Did you intend to check 
get_kpfn_nid(page_node->kpfn) != NUMA(page_node->nid) ?





 - meeting a page which is ksm page before migration
   get_kpfn_nid(stable_node->kpfn) != NUMA(stable_node->nid) can't
capture
them since stable_node is for tree page in current stable tree. They are
always equal.

When we meet a ksm page in the stable tree before it's migrated to another
NUMA node, yes, it will be on the right NUMA node (because we were careful
only to merge pages from the right NUMA node there), and that test will not
capture them.  It's for capturng a ksm page in the stable tree after it has
been migrated to another NUMA node.

ksm page migrated to another NUMA node still not freed, why? Who take page
count of it?

The old page, the one which used to be a ksm page on the old NUMA node,
should be freed very soon: since it was isolated from lru, and its page
count checked, I cannot think of anything to hold a reference to it,
apart from migration itself - so it just needs to reach putback_lru_page(),
and then may rest awhile on __lru_cache_add()'s pagevec before being freed.

But I don't see where I said the old page was still not freed.


If not  freed, since new page is allocated in new node, it is
the copy of current ksm page, so current ksm doesn't change,
get_kpfn_nid(stable_node->kpfn) *==* NUMA(stable_node->nid).

But ksm_migrate_page() did
VM_BUG_ON(stable_node->kpfn != page_to_pfn(oldpage));
stable_node->kpfn = page_to_pfn(newpage);
without changing stable_node->nid.

Hugh


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 2/7] ksm: treat unstable nid like in stable tree

2013-03-05 Thread Ric Mason

On 03/02/2013 10:57 AM, Hugh Dickins wrote:

On Sat, 2 Mar 2013, Ric Mason wrote:

On 03/02/2013 04:03 AM, Hugh Dickins wrote:

On Fri, 1 Mar 2013, Ric Mason wrote:

I think the ksm implementation for num awareness  is buggy.

Sorry, I just don't understand your comments below,
but will try to answer or question them as best I can.


For page migratyion stuff, new page is allocated from node *which page is
migrated to*.

Yes, by definition.


- when meeting a page from the wrong NUMA node in an unstable tree
  get_kpfn_nid(page_to_pfn(page)) *==* page_to_nid(tree_page)

I thought you were writing of the wrong NUMA node case,
but now you emphasize *==*, which means the right NUMA node.

Yes, I mean the wrong NUMA node. During page migration, new page has already
been allocated in new node and old page maybe freed.  So tree_page is the
page in new node's unstable tree, page is also new node page, so
get_kpfn_nid(page_to_pfn(page)) *==* page_to_nid(tree_page).

I don't understand; but here you seem to be describing a case where two
pages from the same NUMA node get merged (after both have been migrated
from another NUMA node?), and there's nothing wrong with that,
so I won't worry about it further.


For the case of a ksm page is migrated to a different NUMA node and 
migrate its stable node to  the right tree and collide with an existing 
stable node. get_kpfn_nid(stable_node-kpfn) != NUMA(stable_node-nid) 
can capture nothing since stable_node is the node in the right stable 
tree, nothing happen to it before this check. Did you intend to check 
get_kpfn_nid(page_node-kpfn) != NUMA(page_node-nid) ?





 - meeting a page which is ksm page before migration
   get_kpfn_nid(stable_node-kpfn) != NUMA(stable_node-nid) can't
capture
them since stable_node is for tree page in current stable tree. They are
always equal.

When we meet a ksm page in the stable tree before it's migrated to another
NUMA node, yes, it will be on the right NUMA node (because we were careful
only to merge pages from the right NUMA node there), and that test will not
capture them.  It's for capturng a ksm page in the stable tree after it has
been migrated to another NUMA node.

ksm page migrated to another NUMA node still not freed, why? Who take page
count of it?

The old page, the one which used to be a ksm page on the old NUMA node,
should be freed very soon: since it was isolated from lru, and its page
count checked, I cannot think of anything to hold a reference to it,
apart from migration itself - so it just needs to reach putback_lru_page(),
and then may rest awhile on __lru_cache_add()'s pagevec before being freed.

But I don't see where I said the old page was still not freed.


If not  freed, since new page is allocated in new node, it is
the copy of current ksm page, so current ksm doesn't change,
get_kpfn_nid(stable_node-kpfn) *==* NUMA(stable_node-nid).

But ksm_migrate_page() did
VM_BUG_ON(stable_node-kpfn != page_to_pfn(oldpage));
stable_node-kpfn = page_to_pfn(newpage);
without changing stable_node-nid.

Hugh


--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 2/7] ksm: treat unstable nid like in stable tree

2013-03-05 Thread Ric Mason

Hi Hugh,
On 03/06/2013 01:05 PM, Hugh Dickins wrote:

On Wed, 6 Mar 2013, Ric Mason wrote:
[ I've deleted the context because that was about the unstable tree,
   and here you have moved to asking about a case in the stable tree. ]

For the case of a ksm page is migrated to a different NUMA node and migrate
its stable node to  the right tree and collide with an existing stable node.
get_kpfn_nid(stable_node-kpfn) != NUMA(stable_node-nid) can capture nothing

That's not so: as I've pointed out before, ksm_migrate_page() updates
stable_node-kpfn for the new page on the new NUMA node; but it cannot
(get the right locking to) move the stable_node to its new tree at that time.

It's moved out once ksmd notices that it's in the wrong NUMA node tree -
perhaps when one its rmap_items reaches the head of cmp_and_merge_page(),
or perhaps here in stable_tree_search() when it matches another page
coming in to cmp_and_merge_page().

You may be concentrating on the case when that another page is a ksm
page migrated from a different NUMA node; and overlooking the case of
when the matching ksm page in this stable tree has itself been migrated.


since stable_node is the node in the right stable tree, nothing happen to it
before this check. Did you intend to check get_kpfn_nid(page_node-kpfn) !=
NUMA(page_node-nid) ?

Certainly not: page_node is usually NULL.  But I could have checked


Are you sure? list_del(page_node-list) and DO_NUMA(page_node-nid = 
nid) will trigger panic now.

get_kpfn_nid(stable_node-kpfn) != nid: I was duplicating the test
from cmp_and_merge_page(), but here we do have local variable nid.

Hugh


--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 2/7] ksm: treat unstable nid like in stable tree

2013-03-01 Thread Ric Mason


Hi Hugh,
On 03/02/2013 04:03 AM, Hugh Dickins wrote:

On Fri, 1 Mar 2013, Ric Mason wrote:

I think the ksm implementation for num awareness  is buggy.

Sorry, I just don't understand your comments below,
but will try to answer or question them as best I can.


For page migratyion stuff, new page is allocated from node *which page is
migrated to*.

Yes, by definition.


- when meeting a page from the wrong NUMA node in an unstable tree
 get_kpfn_nid(page_to_pfn(page)) *==* page_to_nid(tree_page)

I thought you were writing of the wrong NUMA node case,
but now you emphasize "*==*", which means the right NUMA node.


Yes, I mean the wrong NUMA node. During page migration, new page has 
already been allocated in new node and old page maybe freed.  So 
tree_page is the page in new node's unstable tree, page is also new node 
page, so get_kpfn_nid(page_to_pfn(page)) *==* page_to_nid(tree_page).





 How can say it's okay for comparisons, but not as a leaf for merging?

Pages in the unstable tree are unstable (and it's not even accurate to
say "pages in the unstable tree"), they and their content can change at
any moment, so I cannot assert anything of them for sure.

But if we suppose, as an approximation, that they are somewhat likely
to remain stable (and the unstable tree would be useless without that
assumption: it tends to work out), but subject to migration, then it makes
sense to compare content, no matter what NUMA node it is on, in order to
locate a page of the same content; but wrong to merge with that page if
it's on the wrong NUMA node, if !merge_across_nodes tells us not to.



- when meeting a page from the wrong NUMA node in an stable tree
- meeting a normal page

What does that line mean, and where does it fit in your argument?


I distinguish pages in three kinds.
- ksm page which already in stable tree in old node
- page in unstable tree in old node
- page not in any trees in old node

So normal page here I mean page not in any trees in old node.




- meeting a page which is ksm page before migration
  get_kpfn_nid(stable_node->kpfn) != NUMA(stable_node->nid) can't capture
them since stable_node is for tree page in current stable tree. They are
always equal.

When we meet a ksm page in the stable tree before it's migrated to another
NUMA node, yes, it will be on the right NUMA node (because we were careful
only to merge pages from the right NUMA node there), and that test will not
capture them.  It's for capturng a ksm page in the stable tree after it has
been migrated to another NUMA node.


ksm page migrated to another NUMA node still not freed, why? Who take 
page count of it? If not  freed, since new page is allocated in new 
node, it is the copy of current ksm page, so current ksm doesn't change, 
get_kpfn_nid(stable_node->kpfn) *==* NUMA(stable_node->nid).




Hugh


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC PATCH v2 1/2] mm: tuning hardcoded reserved memory

2013-03-01 Thread Ric Mason

On 03/02/2013 06:41 AM, Andrew Shewmaker wrote:

On Fri, Mar 01, 2013 at 10:40:43AM +0800, Ric Mason wrote:

On 02/28/2013 11:48 AM, Andrew Shewmaker wrote:

On Thu, Feb 28, 2013 at 02:12:00PM -0800, Andrew Morton wrote:

On Wed, 27 Feb 2013 15:56:30 -0500
Andrew Shewmaker  wrote:


The following patches are against the mmtom git tree as of February 27th.

The first patch only affects OVERCOMMIT_NEVER mode, entirely removing
the 3% reserve for other user processes.

The second patch affects both OVERCOMMIT_GUESS and OVERCOMMIT_NEVER
modes, replacing the hardcoded 3% reserve for the root user with a
tunable knob.


Gee, it's been years since anyone thought about the overcommit code.

Documentation/vm/overcommit-accounting says that OVERCOMMIT_ALWAYS is
"Appropriate for some scientific applications", but doesn't say why.
You're running a scientific cluster but you're using OVERCOMMIT_NEVER,
I think?  Is the documentation wrong?

None of my scientists appeared to use sparse arrays as Alan described.
My users would run jobs that appeared to initialize correctly. However,
they wouldn't write to every page they malloced (and they wouldn't use
calloc), so I saw jobs failing well into a computation once the
simulation tried to access a page and the kernel couldn't give it to them.

I think Roadrunner (http://en.wikipedia.org/wiki/IBM_Roadrunner) was
the first cluster I put into OVERCOMMIT_NEVER mode. Jobs with
infeasible memory requirements fail early and the OOM killer
gets triggered much less often than in guess mode. More often than not
the OOM killer seemed to kill the wrong thing causing a subtle brokenness.
Disabling overcommit worked so well during the stabilization and
early user phases that we did the same with other clusters.

Do you mean OVERCOMMIT_NEVER is more suitable for scientific
application than OVERCOMMIT_GUESS and OVERCOMMIT_ALWAYS? Or should
depend on workload? Since your users would run jobs that wouldn't
write to every page they malloced, so why OVERCOMMIT_GUESS is not
more suitable for you?

It depends on the workload. They eventually wrote to every page,
but not early in the life of the process, so they thought they
were fine until the simulation crashed.


Why overcommit guess is not suitable even they eventually wrote to every 
page? It takes free pages, file pages, available swap pages, reclaimable 
slab pages into consideration. In other words, these are all pages 
available, then why overcommit is not suitable? Actually, I confuse 
what's the root different of overcommit guess and never?





__vm_enough_memory reserves 3% of free pages with the default
overcommit mode and 6% when overcommit is disabled. These hardcoded
values have become less reasonable as memory sizes have grown.

On scientific clusters, systems are generally dedicated to one user.
Also, overcommit is sometimes disabled in order to prevent a long
running job from suddenly failing days or weeks into a calculation.
In this case, a user wishing to allocate as much memory as possible
to one process may be prevented from using, for example, around 7GB
out of 128GB.

The effect is less, but still significant when a user starts a job
with one process per core. I have repeatedly seen a set of processes
requesting the same amount of memory fail because one of them could
not allocate the amount of memory a user would expect to be able to
allocate.

...

--- a/mm/mmap.c
+++ b/mm/mmap.c
@@ -182,11 +182,6 @@ int __vm_enough_memory(struct mm_struct *mm, long pages, 
int cap_sys_admin)
allowed -= allowed / 32;
allowed += total_swap_pages;
-   /* Don't let a single process grow too big:
-  leave 3% of the size of this process for other processes */
-   if (mm)
-   allowed -= mm->total_vm / 32;
-
if (percpu_counter_read_positive(_committed_as) < allowed)
return 0;

So what might be the downside for this change?  root can't log in, I
assume.  Have you actually tested for this scenario and observed the
effects?

If there *are* observable risks and/or to preserve back-compatibility,
I guess we could create a fourth overcommit mode which provides the
headroom which you desire.

Also, should we be looking at removing root's 3% from OVERCOMMIT_GUESS
as well?

The downside of the first patch, which removes the "other" reserve
(sorry about the confusing duplicated subject line), is that a user
may not be able to kill their process, even if they have a shell prompt.
When testing, I did sometimes get into spot where I attempted to execute
kill, but got: "bash: fork: Cannot allocate memory". Of course, a
user can get in the same predicament with the current 3% reserve--they
just have to start processes until 3% becomes negligible.

With just the first patch, root still has a 3% reserve, so they can
still log in.

When I resubmit the second patch, adding a tunable rootuser_reserve_pages
variable, I'll test both guess and never 

Re: [RFC PATCH v2 1/2] mm: tuning hardcoded reserved memory

2013-03-01 Thread Ric Mason

On 03/02/2013 06:41 AM, Andrew Shewmaker wrote:

On Fri, Mar 01, 2013 at 10:40:43AM +0800, Ric Mason wrote:

On 02/28/2013 11:48 AM, Andrew Shewmaker wrote:

On Thu, Feb 28, 2013 at 02:12:00PM -0800, Andrew Morton wrote:

On Wed, 27 Feb 2013 15:56:30 -0500
Andrew Shewmaker ags...@gmail.com wrote:


The following patches are against the mmtom git tree as of February 27th.

The first patch only affects OVERCOMMIT_NEVER mode, entirely removing
the 3% reserve for other user processes.

The second patch affects both OVERCOMMIT_GUESS and OVERCOMMIT_NEVER
modes, replacing the hardcoded 3% reserve for the root user with a
tunable knob.


Gee, it's been years since anyone thought about the overcommit code.

Documentation/vm/overcommit-accounting says that OVERCOMMIT_ALWAYS is
Appropriate for some scientific applications, but doesn't say why.
You're running a scientific cluster but you're using OVERCOMMIT_NEVER,
I think?  Is the documentation wrong?

None of my scientists appeared to use sparse arrays as Alan described.
My users would run jobs that appeared to initialize correctly. However,
they wouldn't write to every page they malloced (and they wouldn't use
calloc), so I saw jobs failing well into a computation once the
simulation tried to access a page and the kernel couldn't give it to them.

I think Roadrunner (http://en.wikipedia.org/wiki/IBM_Roadrunner) was
the first cluster I put into OVERCOMMIT_NEVER mode. Jobs with
infeasible memory requirements fail early and the OOM killer
gets triggered much less often than in guess mode. More often than not
the OOM killer seemed to kill the wrong thing causing a subtle brokenness.
Disabling overcommit worked so well during the stabilization and
early user phases that we did the same with other clusters.

Do you mean OVERCOMMIT_NEVER is more suitable for scientific
application than OVERCOMMIT_GUESS and OVERCOMMIT_ALWAYS? Or should
depend on workload? Since your users would run jobs that wouldn't
write to every page they malloced, so why OVERCOMMIT_GUESS is not
more suitable for you?

It depends on the workload. They eventually wrote to every page,
but not early in the life of the process, so they thought they
were fine until the simulation crashed.


Why overcommit guess is not suitable even they eventually wrote to every 
page? It takes free pages, file pages, available swap pages, reclaimable 
slab pages into consideration. In other words, these are all pages 
available, then why overcommit is not suitable? Actually, I confuse 
what's the root different of overcommit guess and never?





__vm_enough_memory reserves 3% of free pages with the default
overcommit mode and 6% when overcommit is disabled. These hardcoded
values have become less reasonable as memory sizes have grown.

On scientific clusters, systems are generally dedicated to one user.
Also, overcommit is sometimes disabled in order to prevent a long
running job from suddenly failing days or weeks into a calculation.
In this case, a user wishing to allocate as much memory as possible
to one process may be prevented from using, for example, around 7GB
out of 128GB.

The effect is less, but still significant when a user starts a job
with one process per core. I have repeatedly seen a set of processes
requesting the same amount of memory fail because one of them could
not allocate the amount of memory a user would expect to be able to
allocate.

...

--- a/mm/mmap.c
+++ b/mm/mmap.c
@@ -182,11 +182,6 @@ int __vm_enough_memory(struct mm_struct *mm, long pages, 
int cap_sys_admin)
allowed -= allowed / 32;
allowed += total_swap_pages;
-   /* Don't let a single process grow too big:
-  leave 3% of the size of this process for other processes */
-   if (mm)
-   allowed -= mm-total_vm / 32;
-
if (percpu_counter_read_positive(vm_committed_as)  allowed)
return 0;

So what might be the downside for this change?  root can't log in, I
assume.  Have you actually tested for this scenario and observed the
effects?

If there *are* observable risks and/or to preserve back-compatibility,
I guess we could create a fourth overcommit mode which provides the
headroom which you desire.

Also, should we be looking at removing root's 3% from OVERCOMMIT_GUESS
as well?

The downside of the first patch, which removes the other reserve
(sorry about the confusing duplicated subject line), is that a user
may not be able to kill their process, even if they have a shell prompt.
When testing, I did sometimes get into spot where I attempted to execute
kill, but got: bash: fork: Cannot allocate memory. Of course, a
user can get in the same predicament with the current 3% reserve--they
just have to start processes until 3% becomes negligible.

With just the first patch, root still has a 3% reserve, so they can
still log in.

When I resubmit the second patch, adding a tunable rootuser_reserve_pages
variable, I'll test both guess and never overcommit modes to see what

Re: [PATCH 2/7] ksm: treat unstable nid like in stable tree

2013-03-01 Thread Ric Mason


Hi Hugh,
On 03/02/2013 04:03 AM, Hugh Dickins wrote:

On Fri, 1 Mar 2013, Ric Mason wrote:

I think the ksm implementation for num awareness  is buggy.

Sorry, I just don't understand your comments below,
but will try to answer or question them as best I can.


For page migratyion stuff, new page is allocated from node *which page is
migrated to*.

Yes, by definition.


- when meeting a page from the wrong NUMA node in an unstable tree
 get_kpfn_nid(page_to_pfn(page)) *==* page_to_nid(tree_page)

I thought you were writing of the wrong NUMA node case,
but now you emphasize *==*, which means the right NUMA node.


Yes, I mean the wrong NUMA node. During page migration, new page has 
already been allocated in new node and old page maybe freed.  So 
tree_page is the page in new node's unstable tree, page is also new node 
page, so get_kpfn_nid(page_to_pfn(page)) *==* page_to_nid(tree_page).





 How can say it's okay for comparisons, but not as a leaf for merging?

Pages in the unstable tree are unstable (and it's not even accurate to
say pages in the unstable tree), they and their content can change at
any moment, so I cannot assert anything of them for sure.

But if we suppose, as an approximation, that they are somewhat likely
to remain stable (and the unstable tree would be useless without that
assumption: it tends to work out), but subject to migration, then it makes
sense to compare content, no matter what NUMA node it is on, in order to
locate a page of the same content; but wrong to merge with that page if
it's on the wrong NUMA node, if !merge_across_nodes tells us not to.



- when meeting a page from the wrong NUMA node in an stable tree
- meeting a normal page

What does that line mean, and where does it fit in your argument?


I distinguish pages in three kinds.
- ksm page which already in stable tree in old node
- page in unstable tree in old node
- page not in any trees in old node

So normal page here I mean page not in any trees in old node.




- meeting a page which is ksm page before migration
  get_kpfn_nid(stable_node-kpfn) != NUMA(stable_node-nid) can't capture
them since stable_node is for tree page in current stable tree. They are
always equal.

When we meet a ksm page in the stable tree before it's migrated to another
NUMA node, yes, it will be on the right NUMA node (because we were careful
only to merge pages from the right NUMA node there), and that test will not
capture them.  It's for capturng a ksm page in the stable tree after it has
been migrated to another NUMA node.


ksm page migrated to another NUMA node still not freed, why? Who take 
page count of it? If not  freed, since new page is allocated in new 
node, it is the copy of current ksm page, so current ksm doesn't change, 
get_kpfn_nid(stable_node-kpfn) *==* NUMA(stable_node-nid).




Hugh


--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCHv5 2/8] zsmalloc: add documentation

2013-02-28 Thread Ric Mason

On 02/25/2013 11:18 PM, Seth Jennings wrote:

On 02/23/2013 06:37 PM, Ric Mason wrote:

On 02/23/2013 05:02 AM, Seth Jennings wrote:

On 02/21/2013 08:56 PM, Ric Mason wrote:

On 02/21/2013 11:50 PM, Seth Jennings wrote:

On 02/21/2013 02:49 AM, Ric Mason wrote:

On 02/19/2013 03:16 AM, Seth Jennings wrote:

On 02/16/2013 12:21 AM, Ric Mason wrote:

On 02/14/2013 02:38 AM, Seth Jennings wrote:

This patch adds a documentation file for zsmalloc at
Documentation/vm/zsmalloc.txt

Signed-off-by: Seth Jennings 
---
  Documentation/vm/zsmalloc.txt |   68
+
  1 file changed, 68 insertions(+)
  create mode 100644 Documentation/vm/zsmalloc.txt

diff --git a/Documentation/vm/zsmalloc.txt
b/Documentation/vm/zsmalloc.txt
new file mode 100644
index 000..85aa617
--- /dev/null
+++ b/Documentation/vm/zsmalloc.txt
@@ -0,0 +1,68 @@
+zsmalloc Memory Allocator
+
+Overview
+
+zmalloc a new slab-based memory allocator,
+zsmalloc, for storing compressed pages.  It is designed for
+low fragmentation and high allocation success rate on
+large object, but <= PAGE_SIZE allocations.
+
+zsmalloc differs from the kernel slab allocator in two primary
+ways to achieve these design goals.
+
+zsmalloc never requires high order page allocations to back
+slabs, or "size classes" in zsmalloc terms. Instead it allows
+multiple single-order pages to be stitched together into a
+"zspage" which backs the slab.  This allows for higher
allocation
+success rate under memory pressure.
+
+Also, zsmalloc allows objects to span page boundaries within the
+zspage.  This allows for lower fragmentation than could be had
+with the kernel slab allocator for objects between PAGE_SIZE/2
+and PAGE_SIZE.  With the kernel slab allocator, if a page
compresses
+to 60% of it original size, the memory savings gained through
+compression is lost in fragmentation because another object of
+the same size can't be stored in the leftover space.
+
+This ability to span pages results in zsmalloc allocations not
being
+directly addressable by the user.  The user is given an
+non-dereferencable handle in response to an allocation request.
+That handle must be mapped, using zs_map_object(), which returns
+a pointer to the mapped region that can be used.  The mapping is
+necessary since the object data may reside in two different
+noncontigious pages.

Do you mean the reason of  to use a zsmalloc object must map after
malloc is object data maybe reside in two different nocontiguous
pages?

Yes, that is one reason for the mapping.  The other reason (more
of an
added bonus) is below.


+
+For 32-bit systems, zsmalloc has the added benefit of being
+able to back slabs with HIGHMEM pages, something not possible

What's the meaning of "back slabs with HIGHMEM pages"?

By HIGHMEM, I'm referring to the HIGHMEM memory zone on 32-bit
systems
with larger that 1GB (actually a little less) of RAM.  The upper
3GB
of the 4GB address space, depending on kernel build options, is not
directly addressable by the kernel, but can be mapped into the
kernel
address space with functions like kmap() or kmap_atomic().

These pages can't be used by slab/slub because they are not
continuously mapped into the kernel address space.  However, since
zsmalloc requires a mapping anyway to handle objects that span
non-contiguous page boundaries, we do the kernel mapping as part of
the process.

So zspages, the conceptual slab in zsmalloc backed by single-order
pages can include pages from the HIGHMEM zone as well.

Thanks for your clarify,
http://lwn.net/Articles/537422/, your article about zswap in lwn.
"Additionally, the kernel slab allocator does not allow
objects that
are less
than a page in size to span a page boundary. This means that if an
object is
PAGE_SIZE/2 + 1 bytes in size, it effectively use an entire page,
resulting in
~50% waste. Hense there are *no kmalloc() cache size* between
PAGE_SIZE/2 and
PAGE_SIZE."
Are your sure? It seems that kmalloc cache support big size, your
can
check in
include/linux/kmalloc_sizes.h

Yes, kmalloc can allocate large objects > PAGE_SIZE, but there are no
cache sizes _between_ PAGE_SIZE/2 and PAGE_SIZE.  For example, on a
system with 4k pages, there are no caches between kmalloc-2048 and
kmalloc-4096.

kmalloc object > PAGE_SIZE/2 or > PAGE_SIZE should also allocate from
slab cache, correct? Then how can alloc object w/o slab cache which?
contains this object size objects?

I have to admit, I didn't understand the question.

object is allocated from slab cache, correct? There two kinds of slab
cache, one is for general purpose, eg. kmalloc slab cache, the other
is for special purpose, eg. mm_struct, task_struct. kmalloc object >
PAGE_SIZE/2 or > PAGE_SIZE should also allocated from slab cache,
correct? then why you said that there are no caches between
kmalloc-2048 and kmalloc-4096?

Ok, now I get it.  Yes, I guess I should qualified h

Re: [PATCH 2/7] ksm: treat unstable nid like in stable tree

2013-02-28 Thread Ric Mason


Hi Hugh,
On 02/23/2013 05:03 AM, Hugh Dickins wrote:

On Fri, 22 Feb 2013, Ric Mason wrote:

On 02/21/2013 04:20 PM, Hugh Dickins wrote:

An inconsistency emerged in reviewing the NUMA node changes to KSM:
when meeting a page from the wrong NUMA node in a stable tree, we say
that it's okay for comparisons, but not as a leaf for merging; whereas
when meeting a page from the wrong NUMA node in an unstable tree, we
bail out immediately.

IIUC
- ksm page from the wrong NUMA node will be add to current node's stable tree


Please forgive my late response.


That should never happen (and when I was checking with a WARN_ON it did
not happen).  What can happen is that a node already in a stable tree
has its page migrated away to another NUMA node.


- normal page from the wrong NUMA node will be merged to current node's
stable tree  <- where I miss here? I didn't see any special handling in
function stable_tree_search for this case.

nid = get_kpfn_nid(page_to_pfn(page));
root = root_stable_tree + nid;

to choose the right tree for the page, and

if (get_kpfn_nid(stable_node->kpfn) !=
NUMA(stable_node->nid)) {
put_page(tree_page);
goto replace;
}

to make sure that we don't latch on to a node whose page got migrated away.


I think the ksm implementation for num awareness  is buggy.

For page migratyion stuff, new page is allocated from node *which page 
is migrated to*.

- when meeting a page from the wrong NUMA node in an unstable tree
get_kpfn_nid(page_to_pfn(page)) *==* page_to_nid(tree_page)
How can say it's okay for comparisons, but not as a leaf for merging?
- when meeting a page from the wrong NUMA node in an stable tree
   - meeting a normal page
   - meeting a page which is ksm page before migration
 get_kpfn_nid(stable_node->kpfn) != NUMA(stable_node->nid) can't 
capture them since stable_node is for tree page in current stable tree. 
They are always equal.



- normal page from the wrong NUMA node will compare but not as a leaf for
merging after the patch

I don't understand you there, but hope my remarks above resolve it.

Hugh


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC PATCH v2 1/2] mm: tuning hardcoded reserved memory

2013-02-28 Thread Ric Mason

On 02/28/2013 11:48 AM, Andrew Shewmaker wrote:

On Thu, Feb 28, 2013 at 02:12:00PM -0800, Andrew Morton wrote:

On Wed, 27 Feb 2013 15:56:30 -0500
Andrew Shewmaker  wrote:


The following patches are against the mmtom git tree as of February 27th.

The first patch only affects OVERCOMMIT_NEVER mode, entirely removing
the 3% reserve for other user processes.

The second patch affects both OVERCOMMIT_GUESS and OVERCOMMIT_NEVER
modes, replacing the hardcoded 3% reserve for the root user with a
tunable knob.


Gee, it's been years since anyone thought about the overcommit code.

Documentation/vm/overcommit-accounting says that OVERCOMMIT_ALWAYS is
"Appropriate for some scientific applications", but doesn't say why.
You're running a scientific cluster but you're using OVERCOMMIT_NEVER,
I think?  Is the documentation wrong?

None of my scientists appeared to use sparse arrays as Alan described.
My users would run jobs that appeared to initialize correctly. However,
they wouldn't write to every page they malloced (and they wouldn't use
calloc), so I saw jobs failing well into a computation once the
simulation tried to access a page and the kernel couldn't give it to them.

I think Roadrunner (http://en.wikipedia.org/wiki/IBM_Roadrunner) was
the first cluster I put into OVERCOMMIT_NEVER mode. Jobs with
infeasible memory requirements fail early and the OOM killer
gets triggered much less often than in guess mode. More often than not
the OOM killer seemed to kill the wrong thing causing a subtle brokenness.
Disabling overcommit worked so well during the stabilization and
early user phases that we did the same with other clusters.


Do you mean OVERCOMMIT_NEVER is more suitable for scientific application 
than OVERCOMMIT_GUESS and OVERCOMMIT_ALWAYS? Or should depend on 
workload? Since your users would run jobs that wouldn't write to every 
page they malloced, so why OVERCOMMIT_GUESS is not more suitable for you?





__vm_enough_memory reserves 3% of free pages with the default
overcommit mode and 6% when overcommit is disabled. These hardcoded
values have become less reasonable as memory sizes have grown.

On scientific clusters, systems are generally dedicated to one user.
Also, overcommit is sometimes disabled in order to prevent a long
running job from suddenly failing days or weeks into a calculation.
In this case, a user wishing to allocate as much memory as possible
to one process may be prevented from using, for example, around 7GB
out of 128GB.

The effect is less, but still significant when a user starts a job
with one process per core. I have repeatedly seen a set of processes
requesting the same amount of memory fail because one of them could
not allocate the amount of memory a user would expect to be able to
allocate.

...

--- a/mm/mmap.c
+++ b/mm/mmap.c
@@ -182,11 +182,6 @@ int __vm_enough_memory(struct mm_struct *mm, long pages, 
int cap_sys_admin)
allowed -= allowed / 32;
allowed += total_swap_pages;
  
-	/* Don't let a single process grow too big:

-  leave 3% of the size of this process for other processes */
-   if (mm)
-   allowed -= mm->total_vm / 32;
-
if (percpu_counter_read_positive(_committed_as) < allowed)
return 0;

So what might be the downside for this change?  root can't log in, I
assume.  Have you actually tested for this scenario and observed the
effects?

If there *are* observable risks and/or to preserve back-compatibility,
I guess we could create a fourth overcommit mode which provides the
headroom which you desire.

Also, should we be looking at removing root's 3% from OVERCOMMIT_GUESS
as well?

The downside of the first patch, which removes the "other" reserve
(sorry about the confusing duplicated subject line), is that a user
may not be able to kill their process, even if they have a shell prompt.
When testing, I did sometimes get into spot where I attempted to execute
kill, but got: "bash: fork: Cannot allocate memory". Of course, a
user can get in the same predicament with the current 3% reserve--they
just have to start processes until 3% becomes negligible.

With just the first patch, root still has a 3% reserve, so they can
still log in.

When I resubmit the second patch, adding a tunable rootuser_reserve_pages
variable, I'll test both guess and never overcommit modes to see what
minimum initial values allow root to login and kill a user's memory
hogging process. This will be safer than the current behavior since
root's reserve will never shrink to something useless in the case where
a user has grabbed all available memory with many processes.


The idea of two patches looks reasonable to me.



As an estimate of a useful rootuser_reserve_pages, the rss+share size of


Sorry for my silly, why you mean share size is not consist in rss size?


sshd, bash, and top is about 16MB. Overcommit disabled mode would need
closer to 360MB for the same processes. On a 128GB box 3% is 3.8GB, so
the new tunable 

Re: zsmalloc limitations and related topics

2013-02-28 Thread Ric Mason

On 02/28/2013 07:24 AM, Dan Magenheimer wrote:

Hi all --

I've been doing some experimentation on zsmalloc in preparation
for my topic proposed for LSFMM13 and have run across some
perplexing limitations.  Those familiar with the intimate details
of zsmalloc might be well aware of these limitations, but they
aren't documented or immediately obvious, so I thought it would
be worthwhile to air them publicly.  I've also included some
measurements from the experimentation and some related thoughts.

(Some of the terms here are unusual and may be used inconsistently
by different developers so a glossary of definitions of the terms
used here is appended.)

ZSMALLOC LIMITATIONS

Zsmalloc is used for two zprojects: zram and the out-of-tree
zswap.  Zsmalloc can achieve high density when "full".  But:

1) Zsmalloc has a worst-case density of 0.25 (one zpage per
four pageframes).
2) When not full and especially when nearly-empty _after_
being full, density may fall below 1.0 as a result of
fragmentation.


What's the meaning of nearly-empty _after_ being full?


3) Zsmalloc has a density of exactly 1.0 for any number of
zpages with zsize >= 0.8.
4) Zsmalloc contains several compile-time parameters;
the best value of these parameters may be very workload
dependent.

If density == 1.0, that means we are paying the overhead of
compression+decompression for no space advantage.  If
density < 1.0, that means using zsmalloc is detrimental,
resulting in worse memory pressure than if it were not used.

WORKLOAD ANALYSIS

These limitations emphasize that the workload used to evaluate
zsmalloc is very important.  Benchmarks that measure data


Could you share your benchmark? In order that other guys can take 
advantage of it.



throughput or CPU utilization are of questionable value because
it is the _content_ of the data that is particularly relevant
for compression.  Even more precisely, it is the "entropy"
of the data that is relevant, because the amount of
compressibility in the data is related to the entropy:
I.e. an entirely random pagefull of bits will compress poorly
and a highly-regular pagefull of bits will compress well.
Since the zprojects manage a large number of zpages, both
the mean and distribution of zsize of the workload should
be "representative".

The workload most widely used to publish results for
the various zprojects is a kernel-compile using "make -jN"
where N is artificially increased to impose memory pressure.
By adding some debug code to zswap, I was able to analyze
this workload and found the following:

1) The average page compressed by almost a factor of six
(mean zsize == 694, stddev == 474)


stddev is what?


2) Almost eleven percent of the pages were zero pages.  A
zero page compresses to 28 bytes.
3) On average, 77% of the bytes (3156) in the pages-to-be-
compressed contained a byte-value of zero.
4) Despite the above, mean density of zsmalloc was measured at
3.2 zpages/pageframe, presumably losing nearly half of
available space to fragmentation.

I have no clue if these measurements are representative
of a wide range of workloads over the lifetime of a booted
machine, but I am suspicious that they are not.  For example,
the lzo1x compression algorithm claims to compress data by
about a factor of two.

I would welcome ideas on how to evaluate workloads for
"representativeness".  Personally I don't believe we should
be making decisions about selecting the "best" algorithms
or merging code without an agreement on workloads.

PAGEFRAME EVACUATION AND RECLAIM

I've repeatedly stated the opinion that managing the number of
pageframes containing compressed pages will be valuable for
managing MM interaction/policy when compression is used in
the kernel.  After the experimentation above and some brainstorming,
I still do not see an effective method for zsmalloc evacuating and
reclaiming pageframes, because both are complicated by high density
and page-crossing.  In other words, zsmalloc's strengths may
also be its Achilles heels.  For zram, as far as I can see,
pageframe evacuation/reclaim is irrelevant except perhaps
as part of mass defragmentation.  For zcache and zswap, where
writethrough is used, pageframe evacuation/reclaim is very relevant.
(Note: The writeback implemented in zswap does _zpage_ evacuation
without pageframe reclaim.)

CLOSING THOUGHT

Since zsmalloc and zbud have different strengths and weaknesses,
I wonder if some combination or hybrid might be more optimal?
But unless/until we have and can measure a representative workload,
only intuition can answer that.

GLOSSARY

zproject -- a kernel project using compression (zram, zcache, zswap)
zpage -- a compressed sequence of PAGE_SIZE bytes
zsize -- the number of bytes in a compressed page
pageframe -- the term "page" is widely used both to describe
 either (1) PAGE_SIZE bytes of data, or (2) a physical RAM
 area with size=PAGE_SIZE which is PAGE_SIZE-aligned,
 as represented 

Re: [PATCH] staging/zcache: Fix/improve zcache writeback code, tie to a config option

2013-02-28 Thread Ric Mason

On 03/01/2013 06:29 AM, Dan Magenheimer wrote:

From: Ric Mason [mailto:ric.mas...@gmail.com]
Subject: Re: [PATCH] staging/zcache: Fix/improve zcache writeback code, tie to 
a config option

On 02/07/2013 02:27 AM, Dan Magenheimer wrote:

It was observed by Andrea Arcangeli in 2011 that zcache can get "full"
and there must be some way for compressed swap pages to be (uncompressed
and then) sent through to the backing swap disk.  A prototype of this
functionality, called "unuse", was added in 2012 as part of a major update
to zcache (aka "zcache2"), but was left unfinished due to the unfortunate
temporary fork of zcache.

This earlier version of the code had an unresolved memory leak
and was anyway dependent on not-yet-upstream frontswap and mm changes.
The code was meanwhile adapted by Seth Jennings for similar
functionality in zswap (which he calls "flush").  Seth also made some
clever simplifications which are herein ported back to zcache.  As a
result of those simplifications, the frontswap changes are no longer
necessary, but a slightly different (and simpler) set of mm changes are
still required [1].  The memory leak is also fixed.

Due to feedback from akpm in a zswap thread, this functionality in zcache
has now been renamed from "unuse" to "writeback".

Although this zcache writeback code now works, there are open questions
as how best to handle the policy that drives it.  As a result, this
patch also ties writeback to a new config option.  And, since the
code still depends on not-yet-upstreamed mm patches, to avoid build
problems, the config option added by this patch temporarily depends
on "BROKEN"; this config dependency can be removed in trees that
contain the necessary mm patches.

[1] https://lkml.org/lkml/2013/1/29/540/ https://lkml.org/lkml/2013/1/29/539/

shrink_zcache_memory:

while(nr_evict-- > 0) {
  page = zcache_evict_eph_pageframe();
  if (page == NULL)
  break;
  zcache_free_page(page);
}

zcache_evict_eph_pageframe
->zbud_evict_pageframe_lru
  ->zbud_evict_tmem
  ->tmem_flush_page
  ->zcache_pampd_free
  ->zcache_free_page  <- zbudpage has already been free here

If the zcache_free_page called in shrink_zcache_memory can be treated as
a double free?

Thanks for the code review and sorry for the delay...

zcache_pampd_free() only calls zcache_free_page() if page is non-NULL,
but in this code path I think when zcache_pampd_free() calls
zbud_free_and_delist(), that function determines that the zbudpage
is dying and returns NULL.

So unless I am misunderstanding (or misreading the code), there
is no double free.


Oh, I see. Thanks for your response. :)



Thanks,
Dan


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] staging/zcache: Fix/improve zcache writeback code, tie to a config option

2013-02-28 Thread Ric Mason

On 03/01/2013 06:29 AM, Dan Magenheimer wrote:

From: Ric Mason [mailto:ric.mas...@gmail.com]
Subject: Re: [PATCH] staging/zcache: Fix/improve zcache writeback code, tie to 
a config option

On 02/07/2013 02:27 AM, Dan Magenheimer wrote:

It was observed by Andrea Arcangeli in 2011 that zcache can get full
and there must be some way for compressed swap pages to be (uncompressed
and then) sent through to the backing swap disk.  A prototype of this
functionality, called unuse, was added in 2012 as part of a major update
to zcache (aka zcache2), but was left unfinished due to the unfortunate
temporary fork of zcache.

This earlier version of the code had an unresolved memory leak
and was anyway dependent on not-yet-upstream frontswap and mm changes.
The code was meanwhile adapted by Seth Jennings for similar
functionality in zswap (which he calls flush).  Seth also made some
clever simplifications which are herein ported back to zcache.  As a
result of those simplifications, the frontswap changes are no longer
necessary, but a slightly different (and simpler) set of mm changes are
still required [1].  The memory leak is also fixed.

Due to feedback from akpm in a zswap thread, this functionality in zcache
has now been renamed from unuse to writeback.

Although this zcache writeback code now works, there are open questions
as how best to handle the policy that drives it.  As a result, this
patch also ties writeback to a new config option.  And, since the
code still depends on not-yet-upstreamed mm patches, to avoid build
problems, the config option added by this patch temporarily depends
on BROKEN; this config dependency can be removed in trees that
contain the necessary mm patches.

[1] https://lkml.org/lkml/2013/1/29/540/ https://lkml.org/lkml/2013/1/29/539/

shrink_zcache_memory:

while(nr_evict--  0) {
  page = zcache_evict_eph_pageframe();
  if (page == NULL)
  break;
  zcache_free_page(page);
}

zcache_evict_eph_pageframe
-zbud_evict_pageframe_lru
  -zbud_evict_tmem
  -tmem_flush_page
  -zcache_pampd_free
  -zcache_free_page  - zbudpage has already been free here

If the zcache_free_page called in shrink_zcache_memory can be treated as
a double free?

Thanks for the code review and sorry for the delay...

zcache_pampd_free() only calls zcache_free_page() if page is non-NULL,
but in this code path I think when zcache_pampd_free() calls
zbud_free_and_delist(), that function determines that the zbudpage
is dying and returns NULL.

So unless I am misunderstanding (or misreading the code), there
is no double free.


Oh, I see. Thanks for your response. :)



Thanks,
Dan


--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: zsmalloc limitations and related topics

2013-02-28 Thread Ric Mason

On 02/28/2013 07:24 AM, Dan Magenheimer wrote:

Hi all --

I've been doing some experimentation on zsmalloc in preparation
for my topic proposed for LSFMM13 and have run across some
perplexing limitations.  Those familiar with the intimate details
of zsmalloc might be well aware of these limitations, but they
aren't documented or immediately obvious, so I thought it would
be worthwhile to air them publicly.  I've also included some
measurements from the experimentation and some related thoughts.

(Some of the terms here are unusual and may be used inconsistently
by different developers so a glossary of definitions of the terms
used here is appended.)

ZSMALLOC LIMITATIONS

Zsmalloc is used for two zprojects: zram and the out-of-tree
zswap.  Zsmalloc can achieve high density when full.  But:

1) Zsmalloc has a worst-case density of 0.25 (one zpage per
four pageframes).
2) When not full and especially when nearly-empty _after_
being full, density may fall below 1.0 as a result of
fragmentation.


What's the meaning of nearly-empty _after_ being full?


3) Zsmalloc has a density of exactly 1.0 for any number of
zpages with zsize = 0.8.
4) Zsmalloc contains several compile-time parameters;
the best value of these parameters may be very workload
dependent.

If density == 1.0, that means we are paying the overhead of
compression+decompression for no space advantage.  If
density  1.0, that means using zsmalloc is detrimental,
resulting in worse memory pressure than if it were not used.

WORKLOAD ANALYSIS

These limitations emphasize that the workload used to evaluate
zsmalloc is very important.  Benchmarks that measure data


Could you share your benchmark? In order that other guys can take 
advantage of it.



throughput or CPU utilization are of questionable value because
it is the _content_ of the data that is particularly relevant
for compression.  Even more precisely, it is the entropy
of the data that is relevant, because the amount of
compressibility in the data is related to the entropy:
I.e. an entirely random pagefull of bits will compress poorly
and a highly-regular pagefull of bits will compress well.
Since the zprojects manage a large number of zpages, both
the mean and distribution of zsize of the workload should
be representative.

The workload most widely used to publish results for
the various zprojects is a kernel-compile using make -jN
where N is artificially increased to impose memory pressure.
By adding some debug code to zswap, I was able to analyze
this workload and found the following:

1) The average page compressed by almost a factor of six
(mean zsize == 694, stddev == 474)


stddev is what?


2) Almost eleven percent of the pages were zero pages.  A
zero page compresses to 28 bytes.
3) On average, 77% of the bytes (3156) in the pages-to-be-
compressed contained a byte-value of zero.
4) Despite the above, mean density of zsmalloc was measured at
3.2 zpages/pageframe, presumably losing nearly half of
available space to fragmentation.

I have no clue if these measurements are representative
of a wide range of workloads over the lifetime of a booted
machine, but I am suspicious that they are not.  For example,
the lzo1x compression algorithm claims to compress data by
about a factor of two.

I would welcome ideas on how to evaluate workloads for
representativeness.  Personally I don't believe we should
be making decisions about selecting the best algorithms
or merging code without an agreement on workloads.

PAGEFRAME EVACUATION AND RECLAIM

I've repeatedly stated the opinion that managing the number of
pageframes containing compressed pages will be valuable for
managing MM interaction/policy when compression is used in
the kernel.  After the experimentation above and some brainstorming,
I still do not see an effective method for zsmalloc evacuating and
reclaiming pageframes, because both are complicated by high density
and page-crossing.  In other words, zsmalloc's strengths may
also be its Achilles heels.  For zram, as far as I can see,
pageframe evacuation/reclaim is irrelevant except perhaps
as part of mass defragmentation.  For zcache and zswap, where
writethrough is used, pageframe evacuation/reclaim is very relevant.
(Note: The writeback implemented in zswap does _zpage_ evacuation
without pageframe reclaim.)

CLOSING THOUGHT

Since zsmalloc and zbud have different strengths and weaknesses,
I wonder if some combination or hybrid might be more optimal?
But unless/until we have and can measure a representative workload,
only intuition can answer that.

GLOSSARY

zproject -- a kernel project using compression (zram, zcache, zswap)
zpage -- a compressed sequence of PAGE_SIZE bytes
zsize -- the number of bytes in a compressed page
pageframe -- the term page is widely used both to describe
 either (1) PAGE_SIZE bytes of data, or (2) a physical RAM
 area with size=PAGE_SIZE which is PAGE_SIZE-aligned,
 as represented in the kernel 

Re: [RFC PATCH v2 1/2] mm: tuning hardcoded reserved memory

2013-02-28 Thread Ric Mason

On 02/28/2013 11:48 AM, Andrew Shewmaker wrote:

On Thu, Feb 28, 2013 at 02:12:00PM -0800, Andrew Morton wrote:

On Wed, 27 Feb 2013 15:56:30 -0500
Andrew Shewmaker ags...@gmail.com wrote:


The following patches are against the mmtom git tree as of February 27th.

The first patch only affects OVERCOMMIT_NEVER mode, entirely removing
the 3% reserve for other user processes.

The second patch affects both OVERCOMMIT_GUESS and OVERCOMMIT_NEVER
modes, replacing the hardcoded 3% reserve for the root user with a
tunable knob.


Gee, it's been years since anyone thought about the overcommit code.

Documentation/vm/overcommit-accounting says that OVERCOMMIT_ALWAYS is
Appropriate for some scientific applications, but doesn't say why.
You're running a scientific cluster but you're using OVERCOMMIT_NEVER,
I think?  Is the documentation wrong?

None of my scientists appeared to use sparse arrays as Alan described.
My users would run jobs that appeared to initialize correctly. However,
they wouldn't write to every page they malloced (and they wouldn't use
calloc), so I saw jobs failing well into a computation once the
simulation tried to access a page and the kernel couldn't give it to them.

I think Roadrunner (http://en.wikipedia.org/wiki/IBM_Roadrunner) was
the first cluster I put into OVERCOMMIT_NEVER mode. Jobs with
infeasible memory requirements fail early and the OOM killer
gets triggered much less often than in guess mode. More often than not
the OOM killer seemed to kill the wrong thing causing a subtle brokenness.
Disabling overcommit worked so well during the stabilization and
early user phases that we did the same with other clusters.


Do you mean OVERCOMMIT_NEVER is more suitable for scientific application 
than OVERCOMMIT_GUESS and OVERCOMMIT_ALWAYS? Or should depend on 
workload? Since your users would run jobs that wouldn't write to every 
page they malloced, so why OVERCOMMIT_GUESS is not more suitable for you?





__vm_enough_memory reserves 3% of free pages with the default
overcommit mode and 6% when overcommit is disabled. These hardcoded
values have become less reasonable as memory sizes have grown.

On scientific clusters, systems are generally dedicated to one user.
Also, overcommit is sometimes disabled in order to prevent a long
running job from suddenly failing days or weeks into a calculation.
In this case, a user wishing to allocate as much memory as possible
to one process may be prevented from using, for example, around 7GB
out of 128GB.

The effect is less, but still significant when a user starts a job
with one process per core. I have repeatedly seen a set of processes
requesting the same amount of memory fail because one of them could
not allocate the amount of memory a user would expect to be able to
allocate.

...

--- a/mm/mmap.c
+++ b/mm/mmap.c
@@ -182,11 +182,6 @@ int __vm_enough_memory(struct mm_struct *mm, long pages, 
int cap_sys_admin)
allowed -= allowed / 32;
allowed += total_swap_pages;
  
-	/* Don't let a single process grow too big:

-  leave 3% of the size of this process for other processes */
-   if (mm)
-   allowed -= mm-total_vm / 32;
-
if (percpu_counter_read_positive(vm_committed_as)  allowed)
return 0;

So what might be the downside for this change?  root can't log in, I
assume.  Have you actually tested for this scenario and observed the
effects?

If there *are* observable risks and/or to preserve back-compatibility,
I guess we could create a fourth overcommit mode which provides the
headroom which you desire.

Also, should we be looking at removing root's 3% from OVERCOMMIT_GUESS
as well?

The downside of the first patch, which removes the other reserve
(sorry about the confusing duplicated subject line), is that a user
may not be able to kill their process, even if they have a shell prompt.
When testing, I did sometimes get into spot where I attempted to execute
kill, but got: bash: fork: Cannot allocate memory. Of course, a
user can get in the same predicament with the current 3% reserve--they
just have to start processes until 3% becomes negligible.

With just the first patch, root still has a 3% reserve, so they can
still log in.

When I resubmit the second patch, adding a tunable rootuser_reserve_pages
variable, I'll test both guess and never overcommit modes to see what
minimum initial values allow root to login and kill a user's memory
hogging process. This will be safer than the current behavior since
root's reserve will never shrink to something useless in the case where
a user has grabbed all available memory with many processes.


The idea of two patches looks reasonable to me.



As an estimate of a useful rootuser_reserve_pages, the rss+share size of


Sorry for my silly, why you mean share size is not consist in rss size?


sshd, bash, and top is about 16MB. Overcommit disabled mode would need
closer to 360MB for the same processes. On a 128GB box 3% is 3.8GB, so
the new 

Re: [PATCH 2/7] ksm: treat unstable nid like in stable tree

2013-02-28 Thread Ric Mason


Hi Hugh,
On 02/23/2013 05:03 AM, Hugh Dickins wrote:

On Fri, 22 Feb 2013, Ric Mason wrote:

On 02/21/2013 04:20 PM, Hugh Dickins wrote:

An inconsistency emerged in reviewing the NUMA node changes to KSM:
when meeting a page from the wrong NUMA node in a stable tree, we say
that it's okay for comparisons, but not as a leaf for merging; whereas
when meeting a page from the wrong NUMA node in an unstable tree, we
bail out immediately.

IIUC
- ksm page from the wrong NUMA node will be add to current node's stable tree


Please forgive my late response.


That should never happen (and when I was checking with a WARN_ON it did
not happen).  What can happen is that a node already in a stable tree
has its page migrated away to another NUMA node.


- normal page from the wrong NUMA node will be merged to current node's
stable tree  - where I miss here? I didn't see any special handling in
function stable_tree_search for this case.

nid = get_kpfn_nid(page_to_pfn(page));
root = root_stable_tree + nid;

to choose the right tree for the page, and

if (get_kpfn_nid(stable_node-kpfn) !=
NUMA(stable_node-nid)) {
put_page(tree_page);
goto replace;
}

to make sure that we don't latch on to a node whose page got migrated away.


I think the ksm implementation for num awareness  is buggy.

For page migratyion stuff, new page is allocated from node *which page 
is migrated to*.

- when meeting a page from the wrong NUMA node in an unstable tree
get_kpfn_nid(page_to_pfn(page)) *==* page_to_nid(tree_page)
How can say it's okay for comparisons, but not as a leaf for merging?
- when meeting a page from the wrong NUMA node in an stable tree
   - meeting a normal page
   - meeting a page which is ksm page before migration
 get_kpfn_nid(stable_node-kpfn) != NUMA(stable_node-nid) can't 
capture them since stable_node is for tree page in current stable tree. 
They are always equal.



- normal page from the wrong NUMA node will compare but not as a leaf for
merging after the patch

I don't understand you there, but hope my remarks above resolve it.

Hugh


--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCHv5 2/8] zsmalloc: add documentation

2013-02-28 Thread Ric Mason

On 02/25/2013 11:18 PM, Seth Jennings wrote:

On 02/23/2013 06:37 PM, Ric Mason wrote:

On 02/23/2013 05:02 AM, Seth Jennings wrote:

On 02/21/2013 08:56 PM, Ric Mason wrote:

On 02/21/2013 11:50 PM, Seth Jennings wrote:

On 02/21/2013 02:49 AM, Ric Mason wrote:

On 02/19/2013 03:16 AM, Seth Jennings wrote:

On 02/16/2013 12:21 AM, Ric Mason wrote:

On 02/14/2013 02:38 AM, Seth Jennings wrote:

This patch adds a documentation file for zsmalloc at
Documentation/vm/zsmalloc.txt

Signed-off-by: Seth Jennings sjenn...@linux.vnet.ibm.com
---
  Documentation/vm/zsmalloc.txt |   68
+
  1 file changed, 68 insertions(+)
  create mode 100644 Documentation/vm/zsmalloc.txt

diff --git a/Documentation/vm/zsmalloc.txt
b/Documentation/vm/zsmalloc.txt
new file mode 100644
index 000..85aa617
--- /dev/null
+++ b/Documentation/vm/zsmalloc.txt
@@ -0,0 +1,68 @@
+zsmalloc Memory Allocator
+
+Overview
+
+zmalloc a new slab-based memory allocator,
+zsmalloc, for storing compressed pages.  It is designed for
+low fragmentation and high allocation success rate on
+large object, but = PAGE_SIZE allocations.
+
+zsmalloc differs from the kernel slab allocator in two primary
+ways to achieve these design goals.
+
+zsmalloc never requires high order page allocations to back
+slabs, or size classes in zsmalloc terms. Instead it allows
+multiple single-order pages to be stitched together into a
+zspage which backs the slab.  This allows for higher
allocation
+success rate under memory pressure.
+
+Also, zsmalloc allows objects to span page boundaries within the
+zspage.  This allows for lower fragmentation than could be had
+with the kernel slab allocator for objects between PAGE_SIZE/2
+and PAGE_SIZE.  With the kernel slab allocator, if a page
compresses
+to 60% of it original size, the memory savings gained through
+compression is lost in fragmentation because another object of
+the same size can't be stored in the leftover space.
+
+This ability to span pages results in zsmalloc allocations not
being
+directly addressable by the user.  The user is given an
+non-dereferencable handle in response to an allocation request.
+That handle must be mapped, using zs_map_object(), which returns
+a pointer to the mapped region that can be used.  The mapping is
+necessary since the object data may reside in two different
+noncontigious pages.

Do you mean the reason of  to use a zsmalloc object must map after
malloc is object data maybe reside in two different nocontiguous
pages?

Yes, that is one reason for the mapping.  The other reason (more
of an
added bonus) is below.


+
+For 32-bit systems, zsmalloc has the added benefit of being
+able to back slabs with HIGHMEM pages, something not possible

What's the meaning of back slabs with HIGHMEM pages?

By HIGHMEM, I'm referring to the HIGHMEM memory zone on 32-bit
systems
with larger that 1GB (actually a little less) of RAM.  The upper
3GB
of the 4GB address space, depending on kernel build options, is not
directly addressable by the kernel, but can be mapped into the
kernel
address space with functions like kmap() or kmap_atomic().

These pages can't be used by slab/slub because they are not
continuously mapped into the kernel address space.  However, since
zsmalloc requires a mapping anyway to handle objects that span
non-contiguous page boundaries, we do the kernel mapping as part of
the process.

So zspages, the conceptual slab in zsmalloc backed by single-order
pages can include pages from the HIGHMEM zone as well.

Thanks for your clarify,
http://lwn.net/Articles/537422/, your article about zswap in lwn.
Additionally, the kernel slab allocator does not allow
objects that
are less
than a page in size to span a page boundary. This means that if an
object is
PAGE_SIZE/2 + 1 bytes in size, it effectively use an entire page,
resulting in
~50% waste. Hense there are *no kmalloc() cache size* between
PAGE_SIZE/2 and
PAGE_SIZE.
Are your sure? It seems that kmalloc cache support big size, your
can
check in
include/linux/kmalloc_sizes.h

Yes, kmalloc can allocate large objects  PAGE_SIZE, but there are no
cache sizes _between_ PAGE_SIZE/2 and PAGE_SIZE.  For example, on a
system with 4k pages, there are no caches between kmalloc-2048 and
kmalloc-4096.

kmalloc object  PAGE_SIZE/2 or  PAGE_SIZE should also allocate from
slab cache, correct? Then how can alloc object w/o slab cache which?
contains this object size objects?

I have to admit, I didn't understand the question.

object is allocated from slab cache, correct? There two kinds of slab
cache, one is for general purpose, eg. kmalloc slab cache, the other
is for special purpose, eg. mm_struct, task_struct. kmalloc object 
PAGE_SIZE/2 or  PAGE_SIZE should also allocated from slab cache,
correct? then why you said that there are no caches between
kmalloc-2048 and kmalloc-4096?

Ok, now I get it.  Yes, I guess I should qualified here that there are
no _kmalloc_ caches between PAGE_SIZE

Re: [PATCH] staging/zcache: Fix/improve zcache writeback code, tie to a config option

2013-02-25 Thread Ric Mason

On 02/26/2013 01:29 AM, Dan Magenheimer wrote:

From: Ric Mason [mailto:ric.mas...@gmail.com]
Subject: Re: [PATCH] staging/zcache: Fix/improve zcache writeback code, tie to 
a config option

On 02/07/2013 02:27 AM, Dan Magenheimer wrote:

It was observed by Andrea Arcangeli in 2011 that zcache can get "full"
and there must be some way for compressed swap pages to be (uncompressed
and then) sent through to the backing swap disk.  A prototype of this
functionality, called "unuse", was added in 2012 as part of a major update
to zcache (aka "zcache2"), but was left unfinished due to the unfortunate
temporary fork of zcache.

This earlier version of the code had an unresolved memory leak
and was anyway dependent on not-yet-upstream frontswap and mm changes.
The code was meanwhile adapted by Seth Jennings for similar
functionality in zswap (which he calls "flush").  Seth also made some
clever simplifications which are herein ported back to zcache.  As a
result of those simplifications, the frontswap changes are no longer
necessary, but a slightly different (and simpler) set of mm changes are
still required [1].  The memory leak is also fixed.

Due to feedback from akpm in a zswap thread, this functionality in zcache
has now been renamed from "unuse" to "writeback".

Although this zcache writeback code now works, there are open questions
as how best to handle the policy that drives it.  As a result, this
patch also ties writeback to a new config option.  And, since the
code still depends on not-yet-upstreamed mm patches, to avoid build
problems, the config option added by this patch temporarily depends
on "BROKEN"; this config dependency can be removed in trees that
contain the necessary mm patches.

[1] https://lkml.org/lkml/2013/1/29/540/ https://lkml.org/lkml/2013/1/29/539/

This patch leads to backend interact with core mm directly,  is it core
mm should interact with frontend instead of backend? In addition,
frontswap has already have shrink funtion, should we can take advantage
of it?

Good questions!

If you have ideas (or patches) that handle the interaction with
the frontend instead of backend, we can take a look at them.
But for zcache (and zswap), the backend already interacts with
the core mm, for example to allocate and free pageframes.

The existing frontswap shrink function cause data pages to be sucked
back from the backend.  The data pages are put back in the swapcache
and they aren't marked in any way so it is possible the data page
might soon (or immediately) be sent back to the backend.


Then can frontswap shrink work well?



This code is used for backends that can't "callback" the frontend, such
as the Xen tmem backend and ramster.  But I do agree that there
might be a good use for the frontswap shrink function for zcache
(and zswap).  Any ideas?

Dan


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] staging/zcache: Fix/improve zcache writeback code, tie to a config option

2013-02-25 Thread Ric Mason

On 02/26/2013 01:29 AM, Dan Magenheimer wrote:

From: Ric Mason [mailto:ric.mas...@gmail.com]
Subject: Re: [PATCH] staging/zcache: Fix/improve zcache writeback code, tie to 
a config option

On 02/07/2013 02:27 AM, Dan Magenheimer wrote:

It was observed by Andrea Arcangeli in 2011 that zcache can get full
and there must be some way for compressed swap pages to be (uncompressed
and then) sent through to the backing swap disk.  A prototype of this
functionality, called unuse, was added in 2012 as part of a major update
to zcache (aka zcache2), but was left unfinished due to the unfortunate
temporary fork of zcache.

This earlier version of the code had an unresolved memory leak
and was anyway dependent on not-yet-upstream frontswap and mm changes.
The code was meanwhile adapted by Seth Jennings for similar
functionality in zswap (which he calls flush).  Seth also made some
clever simplifications which are herein ported back to zcache.  As a
result of those simplifications, the frontswap changes are no longer
necessary, but a slightly different (and simpler) set of mm changes are
still required [1].  The memory leak is also fixed.

Due to feedback from akpm in a zswap thread, this functionality in zcache
has now been renamed from unuse to writeback.

Although this zcache writeback code now works, there are open questions
as how best to handle the policy that drives it.  As a result, this
patch also ties writeback to a new config option.  And, since the
code still depends on not-yet-upstreamed mm patches, to avoid build
problems, the config option added by this patch temporarily depends
on BROKEN; this config dependency can be removed in trees that
contain the necessary mm patches.

[1] https://lkml.org/lkml/2013/1/29/540/ https://lkml.org/lkml/2013/1/29/539/

This patch leads to backend interact with core mm directly,  is it core
mm should interact with frontend instead of backend? In addition,
frontswap has already have shrink funtion, should we can take advantage
of it?

Good questions!

If you have ideas (or patches) that handle the interaction with
the frontend instead of backend, we can take a look at them.
But for zcache (and zswap), the backend already interacts with
the core mm, for example to allocate and free pageframes.

The existing frontswap shrink function cause data pages to be sucked
back from the backend.  The data pages are put back in the swapcache
and they aren't marked in any way so it is possible the data page
might soon (or immediately) be sent back to the backend.


Then can frontswap shrink work well?



This code is used for backends that can't callback the frontend, such
as the Xen tmem backend and ramster.  But I do agree that there
might be a good use for the frontswap shrink function for zcache
(and zswap).  Any ideas?

Dan


--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 0/7] ksm: responses to NUMA review

2013-02-23 Thread Ric Mason

On 02/23/2013 04:38 AM, Hugh Dickins wrote:

On Fri, 22 Feb 2013, Ric Mason wrote:

On 02/21/2013 04:17 PM, Hugh Dickins wrote:

Here's a second KSM series, based on mmotm 2013-02-19-17-20: partly in
response to Mel's review feedback, partly fixes to issues that I found
myself in doing more review and testing.  None of the issues fixed are
truly show-stoppers, though I would prefer them fixed sooner than later.

Do you have any ideas ksm support page cache and tmpfs?

No.  It's only been asked as a hypothetical question: I don't know of
anyone actually needing it, and I wouldn't have time to do it myself.

It would be significantly more invasive than just dealing with anonymous
memory: with anon, we already have the infrastructure for read-only pages,
but we don't at present have any notion of read-only pagecache.

Just doing it in tmpfs?  Well, yes, that might be easier: since v3.1's
radix_tree rework, shmem/tmpfs mostly goes through its own interfaces
to pagecache, so read-only pagecache, and hence KSM, might be easier
to implement there than more generally.


Ok, are there potential users to take advantage of it?



Hugh


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCHv5 2/8] zsmalloc: add documentation

2013-02-23 Thread Ric Mason

On 02/23/2013 05:02 AM, Seth Jennings wrote:

On 02/21/2013 08:56 PM, Ric Mason wrote:

On 02/21/2013 11:50 PM, Seth Jennings wrote:

On 02/21/2013 02:49 AM, Ric Mason wrote:

On 02/19/2013 03:16 AM, Seth Jennings wrote:

On 02/16/2013 12:21 AM, Ric Mason wrote:

On 02/14/2013 02:38 AM, Seth Jennings wrote:

This patch adds a documentation file for zsmalloc at
Documentation/vm/zsmalloc.txt

Signed-off-by: Seth Jennings 
---
 Documentation/vm/zsmalloc.txt |   68
+
 1 file changed, 68 insertions(+)
 create mode 100644 Documentation/vm/zsmalloc.txt

diff --git a/Documentation/vm/zsmalloc.txt
b/Documentation/vm/zsmalloc.txt
new file mode 100644
index 000..85aa617
--- /dev/null
+++ b/Documentation/vm/zsmalloc.txt
@@ -0,0 +1,68 @@
+zsmalloc Memory Allocator
+
+Overview
+
+zmalloc a new slab-based memory allocator,
+zsmalloc, for storing compressed pages.  It is designed for
+low fragmentation and high allocation success rate on
+large object, but <= PAGE_SIZE allocations.
+
+zsmalloc differs from the kernel slab allocator in two primary
+ways to achieve these design goals.
+
+zsmalloc never requires high order page allocations to back
+slabs, or "size classes" in zsmalloc terms. Instead it allows
+multiple single-order pages to be stitched together into a
+"zspage" which backs the slab.  This allows for higher allocation
+success rate under memory pressure.
+
+Also, zsmalloc allows objects to span page boundaries within the
+zspage.  This allows for lower fragmentation than could be had
+with the kernel slab allocator for objects between PAGE_SIZE/2
+and PAGE_SIZE.  With the kernel slab allocator, if a page
compresses
+to 60% of it original size, the memory savings gained through
+compression is lost in fragmentation because another object of
+the same size can't be stored in the leftover space.
+
+This ability to span pages results in zsmalloc allocations not
being
+directly addressable by the user.  The user is given an
+non-dereferencable handle in response to an allocation request.
+That handle must be mapped, using zs_map_object(), which returns
+a pointer to the mapped region that can be used.  The mapping is
+necessary since the object data may reside in two different
+noncontigious pages.

Do you mean the reason of  to use a zsmalloc object must map after
malloc is object data maybe reside in two different nocontiguous
pages?

Yes, that is one reason for the mapping.  The other reason (more
of an
added bonus) is below.


+
+For 32-bit systems, zsmalloc has the added benefit of being
+able to back slabs with HIGHMEM pages, something not possible

What's the meaning of "back slabs with HIGHMEM pages"?

By HIGHMEM, I'm referring to the HIGHMEM memory zone on 32-bit
systems
with larger that 1GB (actually a little less) of RAM.  The upper 3GB
of the 4GB address space, depending on kernel build options, is not
directly addressable by the kernel, but can be mapped into the kernel
address space with functions like kmap() or kmap_atomic().

These pages can't be used by slab/slub because they are not
continuously mapped into the kernel address space.  However, since
zsmalloc requires a mapping anyway to handle objects that span
non-contiguous page boundaries, we do the kernel mapping as part of
the process.

So zspages, the conceptual slab in zsmalloc backed by single-order
pages can include pages from the HIGHMEM zone as well.

Thanks for your clarify,
   http://lwn.net/Articles/537422/, your article about zswap in lwn.
   "Additionally, the kernel slab allocator does not allow objects that
are less
than a page in size to span a page boundary. This means that if an
object is
PAGE_SIZE/2 + 1 bytes in size, it effectively use an entire page,
resulting in
~50% waste. Hense there are *no kmalloc() cache size* between
PAGE_SIZE/2 and
PAGE_SIZE."
Are your sure? It seems that kmalloc cache support big size, your can
check in
include/linux/kmalloc_sizes.h

Yes, kmalloc can allocate large objects > PAGE_SIZE, but there are no
cache sizes _between_ PAGE_SIZE/2 and PAGE_SIZE.  For example, on a
system with 4k pages, there are no caches between kmalloc-2048 and
kmalloc-4096.

kmalloc object > PAGE_SIZE/2 or > PAGE_SIZE should also allocate from
slab cache, correct? Then how can alloc object w/o slab cache which?
contains this object size objects?

I have to admit, I didn't understand the question.


object is allocated from slab cache, correct? There two kinds of slab 
cache, one is for general purpose, eg. kmalloc slab cache, the other is 
for special purpose, eg. mm_struct, task_struct. kmalloc object > 
PAGE_SIZE/2 or > PAGE_SIZE should also allocated from slab cache, 
correct? then why you said that there are no caches between kmalloc-2048 
and kmalloc-4096?




Thanks,
Seth



--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger

Re: [PATCHv5 2/8] zsmalloc: add documentation

2013-02-23 Thread Ric Mason

On 02/23/2013 05:02 AM, Seth Jennings wrote:

On 02/21/2013 08:56 PM, Ric Mason wrote:

On 02/21/2013 11:50 PM, Seth Jennings wrote:

On 02/21/2013 02:49 AM, Ric Mason wrote:

On 02/19/2013 03:16 AM, Seth Jennings wrote:

On 02/16/2013 12:21 AM, Ric Mason wrote:

On 02/14/2013 02:38 AM, Seth Jennings wrote:

This patch adds a documentation file for zsmalloc at
Documentation/vm/zsmalloc.txt

Signed-off-by: Seth Jennings sjenn...@linux.vnet.ibm.com
---
 Documentation/vm/zsmalloc.txt |   68
+
 1 file changed, 68 insertions(+)
 create mode 100644 Documentation/vm/zsmalloc.txt

diff --git a/Documentation/vm/zsmalloc.txt
b/Documentation/vm/zsmalloc.txt
new file mode 100644
index 000..85aa617
--- /dev/null
+++ b/Documentation/vm/zsmalloc.txt
@@ -0,0 +1,68 @@
+zsmalloc Memory Allocator
+
+Overview
+
+zmalloc a new slab-based memory allocator,
+zsmalloc, for storing compressed pages.  It is designed for
+low fragmentation and high allocation success rate on
+large object, but = PAGE_SIZE allocations.
+
+zsmalloc differs from the kernel slab allocator in two primary
+ways to achieve these design goals.
+
+zsmalloc never requires high order page allocations to back
+slabs, or size classes in zsmalloc terms. Instead it allows
+multiple single-order pages to be stitched together into a
+zspage which backs the slab.  This allows for higher allocation
+success rate under memory pressure.
+
+Also, zsmalloc allows objects to span page boundaries within the
+zspage.  This allows for lower fragmentation than could be had
+with the kernel slab allocator for objects between PAGE_SIZE/2
+and PAGE_SIZE.  With the kernel slab allocator, if a page
compresses
+to 60% of it original size, the memory savings gained through
+compression is lost in fragmentation because another object of
+the same size can't be stored in the leftover space.
+
+This ability to span pages results in zsmalloc allocations not
being
+directly addressable by the user.  The user is given an
+non-dereferencable handle in response to an allocation request.
+That handle must be mapped, using zs_map_object(), which returns
+a pointer to the mapped region that can be used.  The mapping is
+necessary since the object data may reside in two different
+noncontigious pages.

Do you mean the reason of  to use a zsmalloc object must map after
malloc is object data maybe reside in two different nocontiguous
pages?

Yes, that is one reason for the mapping.  The other reason (more
of an
added bonus) is below.


+
+For 32-bit systems, zsmalloc has the added benefit of being
+able to back slabs with HIGHMEM pages, something not possible

What's the meaning of back slabs with HIGHMEM pages?

By HIGHMEM, I'm referring to the HIGHMEM memory zone on 32-bit
systems
with larger that 1GB (actually a little less) of RAM.  The upper 3GB
of the 4GB address space, depending on kernel build options, is not
directly addressable by the kernel, but can be mapped into the kernel
address space with functions like kmap() or kmap_atomic().

These pages can't be used by slab/slub because they are not
continuously mapped into the kernel address space.  However, since
zsmalloc requires a mapping anyway to handle objects that span
non-contiguous page boundaries, we do the kernel mapping as part of
the process.

So zspages, the conceptual slab in zsmalloc backed by single-order
pages can include pages from the HIGHMEM zone as well.

Thanks for your clarify,
   http://lwn.net/Articles/537422/, your article about zswap in lwn.
   Additionally, the kernel slab allocator does not allow objects that
are less
than a page in size to span a page boundary. This means that if an
object is
PAGE_SIZE/2 + 1 bytes in size, it effectively use an entire page,
resulting in
~50% waste. Hense there are *no kmalloc() cache size* between
PAGE_SIZE/2 and
PAGE_SIZE.
Are your sure? It seems that kmalloc cache support big size, your can
check in
include/linux/kmalloc_sizes.h

Yes, kmalloc can allocate large objects  PAGE_SIZE, but there are no
cache sizes _between_ PAGE_SIZE/2 and PAGE_SIZE.  For example, on a
system with 4k pages, there are no caches between kmalloc-2048 and
kmalloc-4096.

kmalloc object  PAGE_SIZE/2 or  PAGE_SIZE should also allocate from
slab cache, correct? Then how can alloc object w/o slab cache which?
contains this object size objects?

I have to admit, I didn't understand the question.


object is allocated from slab cache, correct? There two kinds of slab 
cache, one is for general purpose, eg. kmalloc slab cache, the other is 
for special purpose, eg. mm_struct, task_struct. kmalloc object  
PAGE_SIZE/2 or  PAGE_SIZE should also allocated from slab cache, 
correct? then why you said that there are no caches between kmalloc-2048 
and kmalloc-4096?




Thanks,
Seth



--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org

Re: [PATCH 0/7] ksm: responses to NUMA review

2013-02-23 Thread Ric Mason

On 02/23/2013 04:38 AM, Hugh Dickins wrote:

On Fri, 22 Feb 2013, Ric Mason wrote:

On 02/21/2013 04:17 PM, Hugh Dickins wrote:

Here's a second KSM series, based on mmotm 2013-02-19-17-20: partly in
response to Mel's review feedback, partly fixes to issues that I found
myself in doing more review and testing.  None of the issues fixed are
truly show-stoppers, though I would prefer them fixed sooner than later.

Do you have any ideas ksm support page cache and tmpfs?

No.  It's only been asked as a hypothetical question: I don't know of
anyone actually needing it, and I wouldn't have time to do it myself.

It would be significantly more invasive than just dealing with anonymous
memory: with anon, we already have the infrastructure for read-only pages,
but we don't at present have any notion of read-only pagecache.

Just doing it in tmpfs?  Well, yes, that might be easier: since v3.1's
radix_tree rework, shmem/tmpfs mostly goes through its own interfaces
to pagecache, so read-only pagecache, and hence KSM, might be easier
to implement there than more generally.


Ok, are there potential users to take advantage of it?



Hugh


--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 2/7] ksm: treat unstable nid like in stable tree

2013-02-21 Thread Ric Mason

On 02/21/2013 04:20 PM, Hugh Dickins wrote:

An inconsistency emerged in reviewing the NUMA node changes to KSM:
when meeting a page from the wrong NUMA node in a stable tree, we say
that it's okay for comparisons, but not as a leaf for merging; whereas
when meeting a page from the wrong NUMA node in an unstable tree, we
bail out immediately.


IIUC
- ksm page from the wrong NUMA node will be add to current node's stable 
tree
- normal page from the wrong NUMA node will be merged to current node's 
stable tree  <- where I miss here? I didn't see any special handling in 
function stable_tree_search for this case.
- normal page from the wrong NUMA node will compare but not as a leaf 
for merging after the patch




Now, it might be that a wrong NUMA node in an unstable tree is more
likely to correlate with instablility (different content, with rbnode
now misplaced) than page migration; but even so, we are accustomed to
instablility in the unstable tree.

Without strong evidence for which strategy is generally better, I'd
rather be consistent with what's done in the stable tree: accept a page
from the wrong NUMA node for comparison, but not as a leaf for merging.

Signed-off-by: Hugh Dickins 
---
  mm/ksm.c |   19 +--
  1 file changed, 9 insertions(+), 10 deletions(-)

--- mmotm.orig/mm/ksm.c 2013-02-20 22:28:23.584001392 -0800
+++ mmotm/mm/ksm.c  2013-02-20 22:28:27.288001480 -0800
@@ -1340,16 +1340,6 @@ struct rmap_item *unstable_tree_search_i
return NULL;
}
  
-		/*

-* If tree_page has been migrated to another NUMA node, it
-* will be flushed out and put into the right unstable tree
-* next time: only merge with it if merge_across_nodes.
-*/
-   if (!ksm_merge_across_nodes && page_to_nid(tree_page) != nid) {
-   put_page(tree_page);
-   return NULL;
-   }
-
ret = memcmp_pages(page, tree_page);
  
  		parent = *new;

@@ -1359,6 +1349,15 @@ struct rmap_item *unstable_tree_search_i
} else if (ret > 0) {
put_page(tree_page);
new = >rb_right;
+   } else if (!ksm_merge_across_nodes &&
+  page_to_nid(tree_page) != nid) {
+   /*
+* If tree_page has been migrated to another NUMA node,
+* it will be flushed out and put in the right unstable
+* tree next time: only merge with it when across_nodes.
+*/
+   put_page(tree_page);
+   return NULL;
} else {
*tree_pagep = tree_page;
return tree_rmap_item;

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majord...@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: mailto:"d...@kvack.org;> em...@kvack.org 


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 1/7] ksm: add some comments

2013-02-21 Thread Ric Mason

On 02/21/2013 04:19 PM, Hugh Dickins wrote:

Added slightly more detail to the Documentation of merge_across_nodes,
a few comments in areas indicated by review, and renamed get_ksm_page()'s
argument from "locked" to "lock_it".  No functional change.

Signed-off-by: Hugh Dickins 
---
  Documentation/vm/ksm.txt |   16 
  mm/ksm.c |   18 ++
  2 files changed, 26 insertions(+), 8 deletions(-)

--- mmotm.orig/Documentation/vm/ksm.txt 2013-02-20 22:28:09.456001057 -0800
+++ mmotm/Documentation/vm/ksm.txt  2013-02-20 22:28:23.580001392 -0800
@@ -60,10 +60,18 @@ sleep_millisecs  - how many milliseconds
  
  merge_across_nodes - specifies if pages from different numa nodes can be merged.

 When set to 0, ksm merges only pages which physically
-   reside in the memory area of same NUMA node. It brings
-   lower latency to access to shared page. Value can be
-   changed only when there is no ksm shared pages in system.
-   Default: 1
+   reside in the memory area of same NUMA node. That brings
+   lower latency to access of shared pages. Systems with more
+   nodes, at significant NUMA distances, are likely to benefit
+   from the lower latency of setting 0. Smaller systems, which
+   need to minimize memory usage, are likely to benefit from
+   the greater sharing of setting 1 (default). You may wish to
+   compare how your system performs under each setting, before
+   deciding on which to use. merge_across_nodes setting can be
+   changed only when there are no ksm shared pages in system:
+   set run 2 to unmerge pages first, then to 1 after changing
+   merge_across_nodes, to remerge according to the new setting.


What's the root reason merge_across_nodes setting just can be changed 
only when there are no ksm shared pages in system? Can they be unmerged 
and merged again during ksmd scan?



+   Default: 1 (merging across nodes as in earlier releases)
  
  run  - set 0 to stop ksmd from running but keep merged pages,

 set 1 to run ksmd e.g. "echo 1 > /sys/kernel/mm/ksm/run",
--- mmotm.orig/mm/ksm.c 2013-02-20 22:28:09.456001057 -0800
+++ mmotm/mm/ksm.c  2013-02-20 22:28:23.584001392 -0800
@@ -87,6 +87,9 @@
   *take 10 attempts to find a page in the unstable tree, once it is found,
   *it is secured in the stable tree.  (When we scan a new page, we first
   *compare it against the stable tree, and then against the unstable tree.)
+ *
+ * If the merge_across_nodes tunable is unset, then KSM maintains multiple
+ * stable trees and multiple unstable trees: one of each for each NUMA node.
   */
  
  /**

@@ -524,7 +527,7 @@ static void remove_node_from_stable_tree
   * a page to put something that might look like our key in page->mapping.
   * is on its way to being freed; but it is an anomaly to bear in mind.
   */
-static struct page *get_ksm_page(struct stable_node *stable_node, bool locked)
+static struct page *get_ksm_page(struct stable_node *stable_node, bool lock_it)
  {
struct page *page;
void *expected_mapping;
@@ -573,7 +576,7 @@ again:
goto stale;
}
  
-	if (locked) {

+   if (lock_it) {
lock_page(page);
if (ACCESS_ONCE(page->mapping) != expected_mapping) {
unlock_page(page);
@@ -703,10 +706,17 @@ static int remove_stable_node(struct sta
return 0;
}
  
-	if (WARN_ON_ONCE(page_mapped(page)))

+   if (WARN_ON_ONCE(page_mapped(page))) {
+   /*
+* This should not happen: but if it does, just refuse to let
+* merge_across_nodes be switched - there is no need to panic.
+*/
err = -EBUSY;
-   else {
+   } else {
/*
+* The stable node did not yet appear stale to get_ksm_page(),
+* since that allows for an unmapped ksm page to be recognized
+* right up until it is freed; but the node is safe to remove.
 * This page might be in a pagevec waiting to be freed,
 * or it might be PageSwapCache (perhaps under writeback),
 * or it might have been removed from swapcache a moment ago.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majord...@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: mailto:"d...@kvack.org;> em...@kvack.org 


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  

Re: [PATCH] staging/zcache: Fix/improve zcache writeback code, tie to a config option

2013-02-21 Thread Ric Mason

On 02/07/2013 02:27 AM, Dan Magenheimer wrote:

It was observed by Andrea Arcangeli in 2011 that zcache can get "full"
and there must be some way for compressed swap pages to be (uncompressed
and then) sent through to the backing swap disk.  A prototype of this
functionality, called "unuse", was added in 2012 as part of a major update
to zcache (aka "zcache2"), but was left unfinished due to the unfortunate
temporary fork of zcache.

This earlier version of the code had an unresolved memory leak
and was anyway dependent on not-yet-upstream frontswap and mm changes.
The code was meanwhile adapted by Seth Jennings for similar
functionality in zswap (which he calls "flush").  Seth also made some
clever simplifications which are herein ported back to zcache.  As a
result of those simplifications, the frontswap changes are no longer
necessary, but a slightly different (and simpler) set of mm changes are
still required [1].  The memory leak is also fixed.

Due to feedback from akpm in a zswap thread, this functionality in zcache
has now been renamed from "unuse" to "writeback".

Although this zcache writeback code now works, there are open questions
as how best to handle the policy that drives it.  As a result, this
patch also ties writeback to a new config option.  And, since the
code still depends on not-yet-upstreamed mm patches, to avoid build
problems, the config option added by this patch temporarily depends
on "BROKEN"; this config dependency can be removed in trees that
contain the necessary mm patches.

[1] https://lkml.org/lkml/2013/1/29/540/ https://lkml.org/lkml/2013/1/29/539/


shrink_zcache_memory:

while(nr_evict-- > 0) {
page = zcache_evict_eph_pageframe();
if (page == NULL)
break;
zcache_free_page(page);
}

zcache_evict_eph_pageframe
->zbud_evict_pageframe_lru
->zbud_evict_tmem
->tmem_flush_page
->zcache_pampd_free
->zcache_free_page  <- zbudpage has already been free here

If the zcache_free_page called in shrink_zcache_memory can be treated as 
a double free?




Signed-off-by: Dan Magenheimer 
---
  drivers/staging/zcache/Kconfig   |   17 ++
  drivers/staging/zcache/zcache-main.c |  332 +++---
  2 files changed, 284 insertions(+), 65 deletions(-)

diff --git a/drivers/staging/zcache/Kconfig b/drivers/staging/zcache/Kconfig
index c1dbd04..7358270 100644
--- a/drivers/staging/zcache/Kconfig
+++ b/drivers/staging/zcache/Kconfig
@@ -24,3 +24,20 @@ config RAMSTER
  while minimizing total RAM across the cluster.  RAMster, like
  zcache2, compresses swap pages into local RAM, but then remotifies
  the compressed pages to another node in the RAMster cluster.
+
+# Depends on not-yet-upstreamed mm patches to export end_swap_bio_write and
+# __add_to_swap_cache, and implement __swap_writepage (which is swap_writepage
+# without the frontswap call. When these are in-tree, the dependency on
+# BROKEN can be removed
+config ZCACHE_WRITEBACK
+   bool "Allow compressed swap pages to be writtenback to swap disk"
+   depends on ZCACHE=y && BROKEN
+   default n
+   help
+ Zcache caches compressed swap pages (and other data) in RAM which
+ often improves performance by avoiding I/O's due to swapping.
+ In some workloads with very long-lived large processes, it can
+ instead reduce performance.  Writeback decompresses zcache-compressed
+ pages (in LRU order) when under memory pressure and writes them to
+ the backing swap disk to ameliorate this problem.  Policy driving
+ writeback is still under development.
diff --git a/drivers/staging/zcache/zcache-main.c 
b/drivers/staging/zcache/zcache-main.c
index c1ac905..5bf14c3 100644
--- a/drivers/staging/zcache/zcache-main.c
+++ b/drivers/staging/zcache/zcache-main.c
@@ -22,6 +22,10 @@
  #include 
  #include 
  #include 
+#include 
+#include 
+#include 
+#include 
  
  #include 

  #include 
@@ -55,6 +59,9 @@ static inline void frontswap_tmem_exclusive_gets(bool b)
  }
  #endif
  
+/* enable (or fix code) when Seth's patches are accepted upstream */

+#define zcache_writeback_enabled 0
+
  static int zcache_enabled __read_mostly;
  static int disable_cleancache __read_mostly;
  static int disable_frontswap __read_mostly;
@@ -181,6 +188,8 @@ static unsigned long zcache_last_active_anon_pageframes;
  static unsigned long zcache_last_inactive_anon_pageframes;
  static unsigned long zcache_eph_nonactive_puts_ignored;
  static unsigned long zcache_pers_nonactive_puts_ignored;
+static unsigned long zcache_writtenback_pages;
+static long zcache_outstanding_writeback_pages;
  
  #ifdef CONFIG_DEBUG_FS

  #include 
@@ -239,6 +248,9 @@ static int zcache_debugfs_init(void)
zdfs64("eph_zbytes_max", S_IRUGO, root, _eph_zbytes_max);
zdfs64("pers_zbytes", S_IRUGO, root, _pers_zbytes);
zdfs64("pers_zbytes_max", S_IRUGO, root, _pers_zbytes_max);
+   

Re: [PATCH] staging/zcache: Fix/improve zcache writeback code, tie to a config option

2013-02-21 Thread Ric Mason

On 02/07/2013 02:27 AM, Dan Magenheimer wrote:

It was observed by Andrea Arcangeli in 2011 that zcache can get "full"
and there must be some way for compressed swap pages to be (uncompressed
and then) sent through to the backing swap disk.  A prototype of this
functionality, called "unuse", was added in 2012 as part of a major update
to zcache (aka "zcache2"), but was left unfinished due to the unfortunate
temporary fork of zcache.

This earlier version of the code had an unresolved memory leak
and was anyway dependent on not-yet-upstream frontswap and mm changes.
The code was meanwhile adapted by Seth Jennings for similar
functionality in zswap (which he calls "flush").  Seth also made some
clever simplifications which are herein ported back to zcache.  As a
result of those simplifications, the frontswap changes are no longer
necessary, but a slightly different (and simpler) set of mm changes are
still required [1].  The memory leak is also fixed.

Due to feedback from akpm in a zswap thread, this functionality in zcache
has now been renamed from "unuse" to "writeback".

Although this zcache writeback code now works, there are open questions
as how best to handle the policy that drives it.  As a result, this
patch also ties writeback to a new config option.  And, since the
code still depends on not-yet-upstreamed mm patches, to avoid build
problems, the config option added by this patch temporarily depends
on "BROKEN"; this config dependency can be removed in trees that
contain the necessary mm patches.

[1] https://lkml.org/lkml/2013/1/29/540/ https://lkml.org/lkml/2013/1/29/539/


This patch leads to backend interact with core mm directly,  is it core 
mm should interact with frontend instead of backend? In addition, 
frontswap has already have shrink funtion, should we can take advantage 
of it?




Signed-off-by: Dan Magenheimer 
---
  drivers/staging/zcache/Kconfig   |   17 ++
  drivers/staging/zcache/zcache-main.c |  332 +++---
  2 files changed, 284 insertions(+), 65 deletions(-)

diff --git a/drivers/staging/zcache/Kconfig b/drivers/staging/zcache/Kconfig
index c1dbd04..7358270 100644
--- a/drivers/staging/zcache/Kconfig
+++ b/drivers/staging/zcache/Kconfig
@@ -24,3 +24,20 @@ config RAMSTER
  while minimizing total RAM across the cluster.  RAMster, like
  zcache2, compresses swap pages into local RAM, but then remotifies
  the compressed pages to another node in the RAMster cluster.
+
+# Depends on not-yet-upstreamed mm patches to export end_swap_bio_write and
+# __add_to_swap_cache, and implement __swap_writepage (which is swap_writepage
+# without the frontswap call. When these are in-tree, the dependency on
+# BROKEN can be removed
+config ZCACHE_WRITEBACK
+   bool "Allow compressed swap pages to be writtenback to swap disk"
+   depends on ZCACHE=y && BROKEN
+   default n
+   help
+ Zcache caches compressed swap pages (and other data) in RAM which
+ often improves performance by avoiding I/O's due to swapping.
+ In some workloads with very long-lived large processes, it can
+ instead reduce performance.  Writeback decompresses zcache-compressed
+ pages (in LRU order) when under memory pressure and writes them to
+ the backing swap disk to ameliorate this problem.  Policy driving
+ writeback is still under development.
diff --git a/drivers/staging/zcache/zcache-main.c 
b/drivers/staging/zcache/zcache-main.c
index c1ac905..5bf14c3 100644
--- a/drivers/staging/zcache/zcache-main.c
+++ b/drivers/staging/zcache/zcache-main.c
@@ -22,6 +22,10 @@
  #include 
  #include 
  #include 
+#include 
+#include 
+#include 
+#include 
  
  #include 

  #include 
@@ -55,6 +59,9 @@ static inline void frontswap_tmem_exclusive_gets(bool b)
  }
  #endif
  
+/* enable (or fix code) when Seth's patches are accepted upstream */

+#define zcache_writeback_enabled 0
+
  static int zcache_enabled __read_mostly;
  static int disable_cleancache __read_mostly;
  static int disable_frontswap __read_mostly;
@@ -181,6 +188,8 @@ static unsigned long zcache_last_active_anon_pageframes;
  static unsigned long zcache_last_inactive_anon_pageframes;
  static unsigned long zcache_eph_nonactive_puts_ignored;
  static unsigned long zcache_pers_nonactive_puts_ignored;
+static unsigned long zcache_writtenback_pages;
+static long zcache_outstanding_writeback_pages;
  
  #ifdef CONFIG_DEBUG_FS

  #include 
@@ -239,6 +248,9 @@ static int zcache_debugfs_init(void)
zdfs64("eph_zbytes_max", S_IRUGO, root, _eph_zbytes_max);
zdfs64("pers_zbytes", S_IRUGO, root, _pers_zbytes);
zdfs64("pers_zbytes_max", S_IRUGO, root, _pers_zbytes_max);
+   zdfs("outstanding_writeback_pages", S_IRUGO, root,
+   _outstanding_writeback_pages);
+   zdfs("writtenback_pages", S_IRUGO, root, _writtenback_pages);
return 0;
  }
  #undefzdebugfs
@@ -285,6 +297,18 @@ void 

Re: [PATCH 0/7] ksm: responses to NUMA review

2013-02-21 Thread Ric Mason

On 02/21/2013 04:17 PM, Hugh Dickins wrote:

Here's a second KSM series, based on mmotm 2013-02-19-17-20: partly in
response to Mel's review feedback, partly fixes to issues that I found
myself in doing more review and testing.  None of the issues fixed are
truly show-stoppers, though I would prefer them fixed sooner than later.


Do you have any ideas ksm support page cache and tmpfs?



1 ksm: add some comments
2 ksm: treat unstable nid like in stable tree
3 ksm: shrink 32-bit rmap_item back to 32 bytes
4 mm,ksm: FOLL_MIGRATION do migration_entry_wait
5 mm,ksm: swapoff might need to copy
6 mm: cleanup "swapcache" in do_swap_page
7 ksm: allocate roots when needed

  Documentation/vm/ksm.txt |   16 +++-
  include/linux/mm.h   |1
  mm/ksm.c |  137 +++--
  mm/memory.c  |   38 +++---
  mm/swapfile.c|   15 +++-
  5 files changed, 140 insertions(+), 67 deletions(-)

Thanks,
Hugh

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majord...@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: mailto:"d...@kvack.org;> em...@kvack.org 


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Questin about swap_slot free and invalidate page

2013-02-21 Thread Ric Mason

On 02/22/2013 05:42 AM, Dan Magenheimer wrote:

From: Ric Mason [mailto:ric.mas...@gmail.com]
Subject: Re: Questin about swap_slot free and invalidate page

On 02/19/2013 11:27 PM, Dan Magenheimer wrote:

From: Ric Mason [mailto:ric.mas...@gmail.com]

Hugh is right that handling the possibility of duplicates is
part of the tmem ABI.  If there is any possibility of duplicates,
the ABI defines how a backend must handle them to avoid data
coherency issues.

The kernel implements an in-kernel API which implements the tmem
ABI.  If the frontend and backend can always agree that duplicate

Which ABI in zcache implement that?

https://oss.oracle.com/projects/tmem/dist/documentation/api/tmemspec-v001.pdf

The in-kernel APIs are frontswap and cleancache.  For more information about
tmem, see http://lwn.net/Articles/454795/

But you mentioned that you have in-kernel API which can handle
duplicate.  Do you mean zcache_cleancache/frontswap_put_page? I think
they just overwrite instead of optional flush the page on the
second(duplicate) put as mentioned in your tmemspec.

Maybe I am misunderstanding your question...  The spec allows
overwrite (and return success) OR flush the page (and return
failure).  Zcache does the latter (flush).  The code that implements
it is in tmem_put.


Thanks for your point out.  Pers pages can have duplicate put since swap 
cache page can be reused. Can eph pages also have duplicate put? If yes, 
when can happen?






--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCHv5 2/8] zsmalloc: add documentation

2013-02-21 Thread Ric Mason

On 02/21/2013 11:50 PM, Seth Jennings wrote:

On 02/21/2013 02:49 AM, Ric Mason wrote:

On 02/19/2013 03:16 AM, Seth Jennings wrote:

On 02/16/2013 12:21 AM, Ric Mason wrote:

On 02/14/2013 02:38 AM, Seth Jennings wrote:

This patch adds a documentation file for zsmalloc at
Documentation/vm/zsmalloc.txt

Signed-off-by: Seth Jennings 
---
Documentation/vm/zsmalloc.txt |   68
+
1 file changed, 68 insertions(+)
create mode 100644 Documentation/vm/zsmalloc.txt

diff --git a/Documentation/vm/zsmalloc.txt
b/Documentation/vm/zsmalloc.txt
new file mode 100644
index 000..85aa617
--- /dev/null
+++ b/Documentation/vm/zsmalloc.txt
@@ -0,0 +1,68 @@
+zsmalloc Memory Allocator
+
+Overview
+
+zmalloc a new slab-based memory allocator,
+zsmalloc, for storing compressed pages.  It is designed for
+low fragmentation and high allocation success rate on
+large object, but <= PAGE_SIZE allocations.
+
+zsmalloc differs from the kernel slab allocator in two primary
+ways to achieve these design goals.
+
+zsmalloc never requires high order page allocations to back
+slabs, or "size classes" in zsmalloc terms. Instead it allows
+multiple single-order pages to be stitched together into a
+"zspage" which backs the slab.  This allows for higher allocation
+success rate under memory pressure.
+
+Also, zsmalloc allows objects to span page boundaries within the
+zspage.  This allows for lower fragmentation than could be had
+with the kernel slab allocator for objects between PAGE_SIZE/2
+and PAGE_SIZE.  With the kernel slab allocator, if a page compresses
+to 60% of it original size, the memory savings gained through
+compression is lost in fragmentation because another object of
+the same size can't be stored in the leftover space.
+
+This ability to span pages results in zsmalloc allocations not being
+directly addressable by the user.  The user is given an
+non-dereferencable handle in response to an allocation request.
+That handle must be mapped, using zs_map_object(), which returns
+a pointer to the mapped region that can be used.  The mapping is
+necessary since the object data may reside in two different
+noncontigious pages.

Do you mean the reason of  to use a zsmalloc object must map after
malloc is object data maybe reside in two different nocontiguous pages?

Yes, that is one reason for the mapping.  The other reason (more of an
added bonus) is below.


+
+For 32-bit systems, zsmalloc has the added benefit of being
+able to back slabs with HIGHMEM pages, something not possible

What's the meaning of "back slabs with HIGHMEM pages"?

By HIGHMEM, I'm referring to the HIGHMEM memory zone on 32-bit systems
with larger that 1GB (actually a little less) of RAM.  The upper 3GB
of the 4GB address space, depending on kernel build options, is not
directly addressable by the kernel, but can be mapped into the kernel
address space with functions like kmap() or kmap_atomic().

These pages can't be used by slab/slub because they are not
continuously mapped into the kernel address space.  However, since
zsmalloc requires a mapping anyway to handle objects that span
non-contiguous page boundaries, we do the kernel mapping as part of
the process.

So zspages, the conceptual slab in zsmalloc backed by single-order
pages can include pages from the HIGHMEM zone as well.

Thanks for your clarify,
  http://lwn.net/Articles/537422/, your article about zswap in lwn.
  "Additionally, the kernel slab allocator does not allow objects that
are less
than a page in size to span a page boundary. This means that if an
object is
PAGE_SIZE/2 + 1 bytes in size, it effectively use an entire page,
resulting in
~50% waste. Hense there are *no kmalloc() cache size* between
PAGE_SIZE/2 and
PAGE_SIZE."
Are your sure? It seems that kmalloc cache support big size, your can
check in
include/linux/kmalloc_sizes.h

Yes, kmalloc can allocate large objects > PAGE_SIZE, but there are no
cache sizes _between_ PAGE_SIZE/2 and PAGE_SIZE.  For example, on a
system with 4k pages, there are no caches between kmalloc-2048 and
kmalloc-4096.


Since slub cache can merge, is it the root reason?



Seth

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majord...@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: mailto:"d...@kvack.org;> em...@kvack.org 


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCHv5 2/8] zsmalloc: add documentation

2013-02-21 Thread Ric Mason

On 02/21/2013 11:50 PM, Seth Jennings wrote:

On 02/21/2013 02:49 AM, Ric Mason wrote:

On 02/19/2013 03:16 AM, Seth Jennings wrote:

On 02/16/2013 12:21 AM, Ric Mason wrote:

On 02/14/2013 02:38 AM, Seth Jennings wrote:

This patch adds a documentation file for zsmalloc at
Documentation/vm/zsmalloc.txt

Signed-off-by: Seth Jennings 
---
Documentation/vm/zsmalloc.txt |   68
+
1 file changed, 68 insertions(+)
create mode 100644 Documentation/vm/zsmalloc.txt

diff --git a/Documentation/vm/zsmalloc.txt
b/Documentation/vm/zsmalloc.txt
new file mode 100644
index 000..85aa617
--- /dev/null
+++ b/Documentation/vm/zsmalloc.txt
@@ -0,0 +1,68 @@
+zsmalloc Memory Allocator
+
+Overview
+
+zmalloc a new slab-based memory allocator,
+zsmalloc, for storing compressed pages.  It is designed for
+low fragmentation and high allocation success rate on
+large object, but <= PAGE_SIZE allocations.
+
+zsmalloc differs from the kernel slab allocator in two primary
+ways to achieve these design goals.
+
+zsmalloc never requires high order page allocations to back
+slabs, or "size classes" in zsmalloc terms. Instead it allows
+multiple single-order pages to be stitched together into a
+"zspage" which backs the slab.  This allows for higher allocation
+success rate under memory pressure.
+
+Also, zsmalloc allows objects to span page boundaries within the
+zspage.  This allows for lower fragmentation than could be had
+with the kernel slab allocator for objects between PAGE_SIZE/2
+and PAGE_SIZE.  With the kernel slab allocator, if a page compresses
+to 60% of it original size, the memory savings gained through
+compression is lost in fragmentation because another object of
+the same size can't be stored in the leftover space.
+
+This ability to span pages results in zsmalloc allocations not being
+directly addressable by the user.  The user is given an
+non-dereferencable handle in response to an allocation request.
+That handle must be mapped, using zs_map_object(), which returns
+a pointer to the mapped region that can be used.  The mapping is
+necessary since the object data may reside in two different
+noncontigious pages.

Do you mean the reason of  to use a zsmalloc object must map after
malloc is object data maybe reside in two different nocontiguous pages?

Yes, that is one reason for the mapping.  The other reason (more of an
added bonus) is below.


+
+For 32-bit systems, zsmalloc has the added benefit of being
+able to back slabs with HIGHMEM pages, something not possible

What's the meaning of "back slabs with HIGHMEM pages"?

By HIGHMEM, I'm referring to the HIGHMEM memory zone on 32-bit systems
with larger that 1GB (actually a little less) of RAM.  The upper 3GB
of the 4GB address space, depending on kernel build options, is not
directly addressable by the kernel, but can be mapped into the kernel
address space with functions like kmap() or kmap_atomic().

These pages can't be used by slab/slub because they are not
continuously mapped into the kernel address space.  However, since
zsmalloc requires a mapping anyway to handle objects that span
non-contiguous page boundaries, we do the kernel mapping as part of
the process.

So zspages, the conceptual slab in zsmalloc backed by single-order
pages can include pages from the HIGHMEM zone as well.

Thanks for your clarify,
  http://lwn.net/Articles/537422/, your article about zswap in lwn.
  "Additionally, the kernel slab allocator does not allow objects that
are less
than a page in size to span a page boundary. This means that if an
object is
PAGE_SIZE/2 + 1 bytes in size, it effectively use an entire page,
resulting in
~50% waste. Hense there are *no kmalloc() cache size* between
PAGE_SIZE/2 and
PAGE_SIZE."
Are your sure? It seems that kmalloc cache support big size, your can
check in
include/linux/kmalloc_sizes.h

Yes, kmalloc can allocate large objects > PAGE_SIZE, but there are no
cache sizes _between_ PAGE_SIZE/2 and PAGE_SIZE.  For example, on a
system with 4k pages, there are no caches between kmalloc-2048 and
kmalloc-4096.


kmalloc object > PAGE_SIZE/2 or > PAGE_SIZE should also allocate from 
slab cache, correct? Then how can alloc object w/o slab cache which 
contains this object size objects?




Seth

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majord...@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: mailto:"d...@kvack.org;> em...@kvack.org 


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCHv5 2/8] zsmalloc: add documentation

2013-02-21 Thread Ric Mason

On 02/19/2013 03:16 AM, Seth Jennings wrote:

On 02/16/2013 12:21 AM, Ric Mason wrote:

On 02/14/2013 02:38 AM, Seth Jennings wrote:

This patch adds a documentation file for zsmalloc at
Documentation/vm/zsmalloc.txt

Signed-off-by: Seth Jennings 
---
   Documentation/vm/zsmalloc.txt |   68
+
   1 file changed, 68 insertions(+)
   create mode 100644 Documentation/vm/zsmalloc.txt

diff --git a/Documentation/vm/zsmalloc.txt
b/Documentation/vm/zsmalloc.txt
new file mode 100644
index 000..85aa617
--- /dev/null
+++ b/Documentation/vm/zsmalloc.txt
@@ -0,0 +1,68 @@
+zsmalloc Memory Allocator
+
+Overview
+
+zmalloc a new slab-based memory allocator,
+zsmalloc, for storing compressed pages.  It is designed for
+low fragmentation and high allocation success rate on
+large object, but <= PAGE_SIZE allocations.
+
+zsmalloc differs from the kernel slab allocator in two primary
+ways to achieve these design goals.
+
+zsmalloc never requires high order page allocations to back
+slabs, or "size classes" in zsmalloc terms. Instead it allows
+multiple single-order pages to be stitched together into a
+"zspage" which backs the slab.  This allows for higher allocation
+success rate under memory pressure.
+
+Also, zsmalloc allows objects to span page boundaries within the
+zspage.  This allows for lower fragmentation than could be had
+with the kernel slab allocator for objects between PAGE_SIZE/2
+and PAGE_SIZE.  With the kernel slab allocator, if a page compresses
+to 60% of it original size, the memory savings gained through
+compression is lost in fragmentation because another object of
+the same size can't be stored in the leftover space.
+
+This ability to span pages results in zsmalloc allocations not being
+directly addressable by the user.  The user is given an
+non-dereferencable handle in response to an allocation request.
+That handle must be mapped, using zs_map_object(), which returns
+a pointer to the mapped region that can be used.  The mapping is
+necessary since the object data may reside in two different
+noncontigious pages.

Do you mean the reason of  to use a zsmalloc object must map after
malloc is object data maybe reside in two different nocontiguous pages?

Yes, that is one reason for the mapping.  The other reason (more of an
added bonus) is below.


+
+For 32-bit systems, zsmalloc has the added benefit of being
+able to back slabs with HIGHMEM pages, something not possible

What's the meaning of "back slabs with HIGHMEM pages"?

By HIGHMEM, I'm referring to the HIGHMEM memory zone on 32-bit systems
with larger that 1GB (actually a little less) of RAM.  The upper 3GB
of the 4GB address space, depending on kernel build options, is not
directly addressable by the kernel, but can be mapped into the kernel
address space with functions like kmap() or kmap_atomic().

These pages can't be used by slab/slub because they are not
continuously mapped into the kernel address space.  However, since
zsmalloc requires a mapping anyway to handle objects that span
non-contiguous page boundaries, we do the kernel mapping as part of
the process.

So zspages, the conceptual slab in zsmalloc backed by single-order
pages can include pages from the HIGHMEM zone as well.


Thanks for your clarify,
 http://lwn.net/Articles/537422/, your article about zswap in lwn.
 "Additionally, the kernel slab allocator does not allow objects that 
are less

than a page in size to span a page boundary. This means that if an object is
PAGE_SIZE/2 + 1 bytes in size, it effectively use an entire page, 
resulting in
~50% waste. Hense there are *no kmalloc() cache size* between 
PAGE_SIZE/2 and

PAGE_SIZE."
Are your sure? It seems that kmalloc cache support big size, your can 
check in

include/linux/kmalloc_sizes.h



Seth

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majord...@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: mailto:"d...@kvack.org;> em...@kvack.org 


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCHv5 2/8] zsmalloc: add documentation

2013-02-21 Thread Ric Mason

On 02/19/2013 03:16 AM, Seth Jennings wrote:

On 02/16/2013 12:21 AM, Ric Mason wrote:

On 02/14/2013 02:38 AM, Seth Jennings wrote:

This patch adds a documentation file for zsmalloc at
Documentation/vm/zsmalloc.txt

Signed-off-by: Seth Jennings sjenn...@linux.vnet.ibm.com
---
   Documentation/vm/zsmalloc.txt |   68
+
   1 file changed, 68 insertions(+)
   create mode 100644 Documentation/vm/zsmalloc.txt

diff --git a/Documentation/vm/zsmalloc.txt
b/Documentation/vm/zsmalloc.txt
new file mode 100644
index 000..85aa617
--- /dev/null
+++ b/Documentation/vm/zsmalloc.txt
@@ -0,0 +1,68 @@
+zsmalloc Memory Allocator
+
+Overview
+
+zmalloc a new slab-based memory allocator,
+zsmalloc, for storing compressed pages.  It is designed for
+low fragmentation and high allocation success rate on
+large object, but = PAGE_SIZE allocations.
+
+zsmalloc differs from the kernel slab allocator in two primary
+ways to achieve these design goals.
+
+zsmalloc never requires high order page allocations to back
+slabs, or size classes in zsmalloc terms. Instead it allows
+multiple single-order pages to be stitched together into a
+zspage which backs the slab.  This allows for higher allocation
+success rate under memory pressure.
+
+Also, zsmalloc allows objects to span page boundaries within the
+zspage.  This allows for lower fragmentation than could be had
+with the kernel slab allocator for objects between PAGE_SIZE/2
+and PAGE_SIZE.  With the kernel slab allocator, if a page compresses
+to 60% of it original size, the memory savings gained through
+compression is lost in fragmentation because another object of
+the same size can't be stored in the leftover space.
+
+This ability to span pages results in zsmalloc allocations not being
+directly addressable by the user.  The user is given an
+non-dereferencable handle in response to an allocation request.
+That handle must be mapped, using zs_map_object(), which returns
+a pointer to the mapped region that can be used.  The mapping is
+necessary since the object data may reside in two different
+noncontigious pages.

Do you mean the reason of  to use a zsmalloc object must map after
malloc is object data maybe reside in two different nocontiguous pages?

Yes, that is one reason for the mapping.  The other reason (more of an
added bonus) is below.


+
+For 32-bit systems, zsmalloc has the added benefit of being
+able to back slabs with HIGHMEM pages, something not possible

What's the meaning of back slabs with HIGHMEM pages?

By HIGHMEM, I'm referring to the HIGHMEM memory zone on 32-bit systems
with larger that 1GB (actually a little less) of RAM.  The upper 3GB
of the 4GB address space, depending on kernel build options, is not
directly addressable by the kernel, but can be mapped into the kernel
address space with functions like kmap() or kmap_atomic().

These pages can't be used by slab/slub because they are not
continuously mapped into the kernel address space.  However, since
zsmalloc requires a mapping anyway to handle objects that span
non-contiguous page boundaries, we do the kernel mapping as part of
the process.

So zspages, the conceptual slab in zsmalloc backed by single-order
pages can include pages from the HIGHMEM zone as well.


Thanks for your clarify,
 http://lwn.net/Articles/537422/, your article about zswap in lwn.
 Additionally, the kernel slab allocator does not allow objects that 
are less

than a page in size to span a page boundary. This means that if an object is
PAGE_SIZE/2 + 1 bytes in size, it effectively use an entire page, 
resulting in
~50% waste. Hense there are *no kmalloc() cache size* between 
PAGE_SIZE/2 and

PAGE_SIZE.
Are your sure? It seems that kmalloc cache support big size, your can 
check in

include/linux/kmalloc_sizes.h



Seth

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majord...@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: a href=mailto:d...@kvack.org; em...@kvack.org /a


--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCHv5 2/8] zsmalloc: add documentation

2013-02-21 Thread Ric Mason

On 02/21/2013 11:50 PM, Seth Jennings wrote:

On 02/21/2013 02:49 AM, Ric Mason wrote:

On 02/19/2013 03:16 AM, Seth Jennings wrote:

On 02/16/2013 12:21 AM, Ric Mason wrote:

On 02/14/2013 02:38 AM, Seth Jennings wrote:

This patch adds a documentation file for zsmalloc at
Documentation/vm/zsmalloc.txt

Signed-off-by: Seth Jennings sjenn...@linux.vnet.ibm.com
---
Documentation/vm/zsmalloc.txt |   68
+
1 file changed, 68 insertions(+)
create mode 100644 Documentation/vm/zsmalloc.txt

diff --git a/Documentation/vm/zsmalloc.txt
b/Documentation/vm/zsmalloc.txt
new file mode 100644
index 000..85aa617
--- /dev/null
+++ b/Documentation/vm/zsmalloc.txt
@@ -0,0 +1,68 @@
+zsmalloc Memory Allocator
+
+Overview
+
+zmalloc a new slab-based memory allocator,
+zsmalloc, for storing compressed pages.  It is designed for
+low fragmentation and high allocation success rate on
+large object, but = PAGE_SIZE allocations.
+
+zsmalloc differs from the kernel slab allocator in two primary
+ways to achieve these design goals.
+
+zsmalloc never requires high order page allocations to back
+slabs, or size classes in zsmalloc terms. Instead it allows
+multiple single-order pages to be stitched together into a
+zspage which backs the slab.  This allows for higher allocation
+success rate under memory pressure.
+
+Also, zsmalloc allows objects to span page boundaries within the
+zspage.  This allows for lower fragmentation than could be had
+with the kernel slab allocator for objects between PAGE_SIZE/2
+and PAGE_SIZE.  With the kernel slab allocator, if a page compresses
+to 60% of it original size, the memory savings gained through
+compression is lost in fragmentation because another object of
+the same size can't be stored in the leftover space.
+
+This ability to span pages results in zsmalloc allocations not being
+directly addressable by the user.  The user is given an
+non-dereferencable handle in response to an allocation request.
+That handle must be mapped, using zs_map_object(), which returns
+a pointer to the mapped region that can be used.  The mapping is
+necessary since the object data may reside in two different
+noncontigious pages.

Do you mean the reason of  to use a zsmalloc object must map after
malloc is object data maybe reside in two different nocontiguous pages?

Yes, that is one reason for the mapping.  The other reason (more of an
added bonus) is below.


+
+For 32-bit systems, zsmalloc has the added benefit of being
+able to back slabs with HIGHMEM pages, something not possible

What's the meaning of back slabs with HIGHMEM pages?

By HIGHMEM, I'm referring to the HIGHMEM memory zone on 32-bit systems
with larger that 1GB (actually a little less) of RAM.  The upper 3GB
of the 4GB address space, depending on kernel build options, is not
directly addressable by the kernel, but can be mapped into the kernel
address space with functions like kmap() or kmap_atomic().

These pages can't be used by slab/slub because they are not
continuously mapped into the kernel address space.  However, since
zsmalloc requires a mapping anyway to handle objects that span
non-contiguous page boundaries, we do the kernel mapping as part of
the process.

So zspages, the conceptual slab in zsmalloc backed by single-order
pages can include pages from the HIGHMEM zone as well.

Thanks for your clarify,
  http://lwn.net/Articles/537422/, your article about zswap in lwn.
  Additionally, the kernel slab allocator does not allow objects that
are less
than a page in size to span a page boundary. This means that if an
object is
PAGE_SIZE/2 + 1 bytes in size, it effectively use an entire page,
resulting in
~50% waste. Hense there are *no kmalloc() cache size* between
PAGE_SIZE/2 and
PAGE_SIZE.
Are your sure? It seems that kmalloc cache support big size, your can
check in
include/linux/kmalloc_sizes.h

Yes, kmalloc can allocate large objects  PAGE_SIZE, but there are no
cache sizes _between_ PAGE_SIZE/2 and PAGE_SIZE.  For example, on a
system with 4k pages, there are no caches between kmalloc-2048 and
kmalloc-4096.


kmalloc object  PAGE_SIZE/2 or  PAGE_SIZE should also allocate from 
slab cache, correct? Then how can alloc object w/o slab cache which 
contains this object size objects?




Seth

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majord...@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: a href=mailto:d...@kvack.org; em...@kvack.org /a


--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCHv5 2/8] zsmalloc: add documentation

2013-02-21 Thread Ric Mason

On 02/21/2013 11:50 PM, Seth Jennings wrote:

On 02/21/2013 02:49 AM, Ric Mason wrote:

On 02/19/2013 03:16 AM, Seth Jennings wrote:

On 02/16/2013 12:21 AM, Ric Mason wrote:

On 02/14/2013 02:38 AM, Seth Jennings wrote:

This patch adds a documentation file for zsmalloc at
Documentation/vm/zsmalloc.txt

Signed-off-by: Seth Jennings sjenn...@linux.vnet.ibm.com
---
Documentation/vm/zsmalloc.txt |   68
+
1 file changed, 68 insertions(+)
create mode 100644 Documentation/vm/zsmalloc.txt

diff --git a/Documentation/vm/zsmalloc.txt
b/Documentation/vm/zsmalloc.txt
new file mode 100644
index 000..85aa617
--- /dev/null
+++ b/Documentation/vm/zsmalloc.txt
@@ -0,0 +1,68 @@
+zsmalloc Memory Allocator
+
+Overview
+
+zmalloc a new slab-based memory allocator,
+zsmalloc, for storing compressed pages.  It is designed for
+low fragmentation and high allocation success rate on
+large object, but = PAGE_SIZE allocations.
+
+zsmalloc differs from the kernel slab allocator in two primary
+ways to achieve these design goals.
+
+zsmalloc never requires high order page allocations to back
+slabs, or size classes in zsmalloc terms. Instead it allows
+multiple single-order pages to be stitched together into a
+zspage which backs the slab.  This allows for higher allocation
+success rate under memory pressure.
+
+Also, zsmalloc allows objects to span page boundaries within the
+zspage.  This allows for lower fragmentation than could be had
+with the kernel slab allocator for objects between PAGE_SIZE/2
+and PAGE_SIZE.  With the kernel slab allocator, if a page compresses
+to 60% of it original size, the memory savings gained through
+compression is lost in fragmentation because another object of
+the same size can't be stored in the leftover space.
+
+This ability to span pages results in zsmalloc allocations not being
+directly addressable by the user.  The user is given an
+non-dereferencable handle in response to an allocation request.
+That handle must be mapped, using zs_map_object(), which returns
+a pointer to the mapped region that can be used.  The mapping is
+necessary since the object data may reside in two different
+noncontigious pages.

Do you mean the reason of  to use a zsmalloc object must map after
malloc is object data maybe reside in two different nocontiguous pages?

Yes, that is one reason for the mapping.  The other reason (more of an
added bonus) is below.


+
+For 32-bit systems, zsmalloc has the added benefit of being
+able to back slabs with HIGHMEM pages, something not possible

What's the meaning of back slabs with HIGHMEM pages?

By HIGHMEM, I'm referring to the HIGHMEM memory zone on 32-bit systems
with larger that 1GB (actually a little less) of RAM.  The upper 3GB
of the 4GB address space, depending on kernel build options, is not
directly addressable by the kernel, but can be mapped into the kernel
address space with functions like kmap() or kmap_atomic().

These pages can't be used by slab/slub because they are not
continuously mapped into the kernel address space.  However, since
zsmalloc requires a mapping anyway to handle objects that span
non-contiguous page boundaries, we do the kernel mapping as part of
the process.

So zspages, the conceptual slab in zsmalloc backed by single-order
pages can include pages from the HIGHMEM zone as well.

Thanks for your clarify,
  http://lwn.net/Articles/537422/, your article about zswap in lwn.
  Additionally, the kernel slab allocator does not allow objects that
are less
than a page in size to span a page boundary. This means that if an
object is
PAGE_SIZE/2 + 1 bytes in size, it effectively use an entire page,
resulting in
~50% waste. Hense there are *no kmalloc() cache size* between
PAGE_SIZE/2 and
PAGE_SIZE.
Are your sure? It seems that kmalloc cache support big size, your can
check in
include/linux/kmalloc_sizes.h

Yes, kmalloc can allocate large objects  PAGE_SIZE, but there are no
cache sizes _between_ PAGE_SIZE/2 and PAGE_SIZE.  For example, on a
system with 4k pages, there are no caches between kmalloc-2048 and
kmalloc-4096.


Since slub cache can merge, is it the root reason?



Seth

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majord...@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: a href=mailto:d...@kvack.org; em...@kvack.org /a


--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Questin about swap_slot free and invalidate page

2013-02-21 Thread Ric Mason

On 02/22/2013 05:42 AM, Dan Magenheimer wrote:

From: Ric Mason [mailto:ric.mas...@gmail.com]
Subject: Re: Questin about swap_slot free and invalidate page

On 02/19/2013 11:27 PM, Dan Magenheimer wrote:

From: Ric Mason [mailto:ric.mas...@gmail.com]

Hugh is right that handling the possibility of duplicates is
part of the tmem ABI.  If there is any possibility of duplicates,
the ABI defines how a backend must handle them to avoid data
coherency issues.

The kernel implements an in-kernel API which implements the tmem
ABI.  If the frontend and backend can always agree that duplicate

Which ABI in zcache implement that?

https://oss.oracle.com/projects/tmem/dist/documentation/api/tmemspec-v001.pdf

The in-kernel APIs are frontswap and cleancache.  For more information about
tmem, see http://lwn.net/Articles/454795/

But you mentioned that you have in-kernel API which can handle
duplicate.  Do you mean zcache_cleancache/frontswap_put_page? I think
they just overwrite instead of optional flush the page on the
second(duplicate) put as mentioned in your tmemspec.

Maybe I am misunderstanding your question...  The spec allows
overwrite (and return success) OR flush the page (and return
failure).  Zcache does the latter (flush).  The code that implements
it is in tmem_put.


Thanks for your point out.  Pers pages can have duplicate put since swap 
cache page can be reused. Can eph pages also have duplicate put? If yes, 
when can happen?






--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 0/7] ksm: responses to NUMA review

2013-02-21 Thread Ric Mason

On 02/21/2013 04:17 PM, Hugh Dickins wrote:

Here's a second KSM series, based on mmotm 2013-02-19-17-20: partly in
response to Mel's review feedback, partly fixes to issues that I found
myself in doing more review and testing.  None of the issues fixed are
truly show-stoppers, though I would prefer them fixed sooner than later.


Do you have any ideas ksm support page cache and tmpfs?



1 ksm: add some comments
2 ksm: treat unstable nid like in stable tree
3 ksm: shrink 32-bit rmap_item back to 32 bytes
4 mm,ksm: FOLL_MIGRATION do migration_entry_wait
5 mm,ksm: swapoff might need to copy
6 mm: cleanup swapcache in do_swap_page
7 ksm: allocate roots when needed

  Documentation/vm/ksm.txt |   16 +++-
  include/linux/mm.h   |1
  mm/ksm.c |  137 +++--
  mm/memory.c  |   38 +++---
  mm/swapfile.c|   15 +++-
  5 files changed, 140 insertions(+), 67 deletions(-)

Thanks,
Hugh

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majord...@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: a href=mailto:d...@kvack.org; em...@kvack.org /a


--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] staging/zcache: Fix/improve zcache writeback code, tie to a config option

2013-02-21 Thread Ric Mason

On 02/07/2013 02:27 AM, Dan Magenheimer wrote:

It was observed by Andrea Arcangeli in 2011 that zcache can get full
and there must be some way for compressed swap pages to be (uncompressed
and then) sent through to the backing swap disk.  A prototype of this
functionality, called unuse, was added in 2012 as part of a major update
to zcache (aka zcache2), but was left unfinished due to the unfortunate
temporary fork of zcache.

This earlier version of the code had an unresolved memory leak
and was anyway dependent on not-yet-upstream frontswap and mm changes.
The code was meanwhile adapted by Seth Jennings for similar
functionality in zswap (which he calls flush).  Seth also made some
clever simplifications which are herein ported back to zcache.  As a
result of those simplifications, the frontswap changes are no longer
necessary, but a slightly different (and simpler) set of mm changes are
still required [1].  The memory leak is also fixed.

Due to feedback from akpm in a zswap thread, this functionality in zcache
has now been renamed from unuse to writeback.

Although this zcache writeback code now works, there are open questions
as how best to handle the policy that drives it.  As a result, this
patch also ties writeback to a new config option.  And, since the
code still depends on not-yet-upstreamed mm patches, to avoid build
problems, the config option added by this patch temporarily depends
on BROKEN; this config dependency can be removed in trees that
contain the necessary mm patches.

[1] https://lkml.org/lkml/2013/1/29/540/ https://lkml.org/lkml/2013/1/29/539/


This patch leads to backend interact with core mm directly,  is it core 
mm should interact with frontend instead of backend? In addition, 
frontswap has already have shrink funtion, should we can take advantage 
of it?




Signed-off-by: Dan Magenheimer dan.magenhei...@oracle.com
---
  drivers/staging/zcache/Kconfig   |   17 ++
  drivers/staging/zcache/zcache-main.c |  332 +++---
  2 files changed, 284 insertions(+), 65 deletions(-)

diff --git a/drivers/staging/zcache/Kconfig b/drivers/staging/zcache/Kconfig
index c1dbd04..7358270 100644
--- a/drivers/staging/zcache/Kconfig
+++ b/drivers/staging/zcache/Kconfig
@@ -24,3 +24,20 @@ config RAMSTER
  while minimizing total RAM across the cluster.  RAMster, like
  zcache2, compresses swap pages into local RAM, but then remotifies
  the compressed pages to another node in the RAMster cluster.
+
+# Depends on not-yet-upstreamed mm patches to export end_swap_bio_write and
+# __add_to_swap_cache, and implement __swap_writepage (which is swap_writepage
+# without the frontswap call. When these are in-tree, the dependency on
+# BROKEN can be removed
+config ZCACHE_WRITEBACK
+   bool Allow compressed swap pages to be writtenback to swap disk
+   depends on ZCACHE=y  BROKEN
+   default n
+   help
+ Zcache caches compressed swap pages (and other data) in RAM which
+ often improves performance by avoiding I/O's due to swapping.
+ In some workloads with very long-lived large processes, it can
+ instead reduce performance.  Writeback decompresses zcache-compressed
+ pages (in LRU order) when under memory pressure and writes them to
+ the backing swap disk to ameliorate this problem.  Policy driving
+ writeback is still under development.
diff --git a/drivers/staging/zcache/zcache-main.c 
b/drivers/staging/zcache/zcache-main.c
index c1ac905..5bf14c3 100644
--- a/drivers/staging/zcache/zcache-main.c
+++ b/drivers/staging/zcache/zcache-main.c
@@ -22,6 +22,10 @@
  #include linux/atomic.h
  #include linux/math64.h
  #include linux/crypto.h
+#include linux/swap.h
+#include linux/swapops.h
+#include linux/pagemap.h
+#include linux/writeback.h
  
  #include linux/cleancache.h

  #include linux/frontswap.h
@@ -55,6 +59,9 @@ static inline void frontswap_tmem_exclusive_gets(bool b)
  }
  #endif
  
+/* enable (or fix code) when Seth's patches are accepted upstream */

+#define zcache_writeback_enabled 0
+
  static int zcache_enabled __read_mostly;
  static int disable_cleancache __read_mostly;
  static int disable_frontswap __read_mostly;
@@ -181,6 +188,8 @@ static unsigned long zcache_last_active_anon_pageframes;
  static unsigned long zcache_last_inactive_anon_pageframes;
  static unsigned long zcache_eph_nonactive_puts_ignored;
  static unsigned long zcache_pers_nonactive_puts_ignored;
+static unsigned long zcache_writtenback_pages;
+static long zcache_outstanding_writeback_pages;
  
  #ifdef CONFIG_DEBUG_FS

  #include linux/debugfs.h
@@ -239,6 +248,9 @@ static int zcache_debugfs_init(void)
zdfs64(eph_zbytes_max, S_IRUGO, root, zcache_eph_zbytes_max);
zdfs64(pers_zbytes, S_IRUGO, root, zcache_pers_zbytes);
zdfs64(pers_zbytes_max, S_IRUGO, root, zcache_pers_zbytes_max);
+   zdfs(outstanding_writeback_pages, S_IRUGO, root,
+   

Re: [PATCH] staging/zcache: Fix/improve zcache writeback code, tie to a config option

2013-02-21 Thread Ric Mason

On 02/07/2013 02:27 AM, Dan Magenheimer wrote:

It was observed by Andrea Arcangeli in 2011 that zcache can get full
and there must be some way for compressed swap pages to be (uncompressed
and then) sent through to the backing swap disk.  A prototype of this
functionality, called unuse, was added in 2012 as part of a major update
to zcache (aka zcache2), but was left unfinished due to the unfortunate
temporary fork of zcache.

This earlier version of the code had an unresolved memory leak
and was anyway dependent on not-yet-upstream frontswap and mm changes.
The code was meanwhile adapted by Seth Jennings for similar
functionality in zswap (which he calls flush).  Seth also made some
clever simplifications which are herein ported back to zcache.  As a
result of those simplifications, the frontswap changes are no longer
necessary, but a slightly different (and simpler) set of mm changes are
still required [1].  The memory leak is also fixed.

Due to feedback from akpm in a zswap thread, this functionality in zcache
has now been renamed from unuse to writeback.

Although this zcache writeback code now works, there are open questions
as how best to handle the policy that drives it.  As a result, this
patch also ties writeback to a new config option.  And, since the
code still depends on not-yet-upstreamed mm patches, to avoid build
problems, the config option added by this patch temporarily depends
on BROKEN; this config dependency can be removed in trees that
contain the necessary mm patches.

[1] https://lkml.org/lkml/2013/1/29/540/ https://lkml.org/lkml/2013/1/29/539/


shrink_zcache_memory:

while(nr_evict--  0) {
page = zcache_evict_eph_pageframe();
if (page == NULL)
break;
zcache_free_page(page);
}

zcache_evict_eph_pageframe
-zbud_evict_pageframe_lru
-zbud_evict_tmem
-tmem_flush_page
-zcache_pampd_free
-zcache_free_page  - zbudpage has already been free here

If the zcache_free_page called in shrink_zcache_memory can be treated as 
a double free?




Signed-off-by: Dan Magenheimer dan.magenhei...@oracle.com
---
  drivers/staging/zcache/Kconfig   |   17 ++
  drivers/staging/zcache/zcache-main.c |  332 +++---
  2 files changed, 284 insertions(+), 65 deletions(-)

diff --git a/drivers/staging/zcache/Kconfig b/drivers/staging/zcache/Kconfig
index c1dbd04..7358270 100644
--- a/drivers/staging/zcache/Kconfig
+++ b/drivers/staging/zcache/Kconfig
@@ -24,3 +24,20 @@ config RAMSTER
  while minimizing total RAM across the cluster.  RAMster, like
  zcache2, compresses swap pages into local RAM, but then remotifies
  the compressed pages to another node in the RAMster cluster.
+
+# Depends on not-yet-upstreamed mm patches to export end_swap_bio_write and
+# __add_to_swap_cache, and implement __swap_writepage (which is swap_writepage
+# without the frontswap call. When these are in-tree, the dependency on
+# BROKEN can be removed
+config ZCACHE_WRITEBACK
+   bool Allow compressed swap pages to be writtenback to swap disk
+   depends on ZCACHE=y  BROKEN
+   default n
+   help
+ Zcache caches compressed swap pages (and other data) in RAM which
+ often improves performance by avoiding I/O's due to swapping.
+ In some workloads with very long-lived large processes, it can
+ instead reduce performance.  Writeback decompresses zcache-compressed
+ pages (in LRU order) when under memory pressure and writes them to
+ the backing swap disk to ameliorate this problem.  Policy driving
+ writeback is still under development.
diff --git a/drivers/staging/zcache/zcache-main.c 
b/drivers/staging/zcache/zcache-main.c
index c1ac905..5bf14c3 100644
--- a/drivers/staging/zcache/zcache-main.c
+++ b/drivers/staging/zcache/zcache-main.c
@@ -22,6 +22,10 @@
  #include linux/atomic.h
  #include linux/math64.h
  #include linux/crypto.h
+#include linux/swap.h
+#include linux/swapops.h
+#include linux/pagemap.h
+#include linux/writeback.h
  
  #include linux/cleancache.h

  #include linux/frontswap.h
@@ -55,6 +59,9 @@ static inline void frontswap_tmem_exclusive_gets(bool b)
  }
  #endif
  
+/* enable (or fix code) when Seth's patches are accepted upstream */

+#define zcache_writeback_enabled 0
+
  static int zcache_enabled __read_mostly;
  static int disable_cleancache __read_mostly;
  static int disable_frontswap __read_mostly;
@@ -181,6 +188,8 @@ static unsigned long zcache_last_active_anon_pageframes;
  static unsigned long zcache_last_inactive_anon_pageframes;
  static unsigned long zcache_eph_nonactive_puts_ignored;
  static unsigned long zcache_pers_nonactive_puts_ignored;
+static unsigned long zcache_writtenback_pages;
+static long zcache_outstanding_writeback_pages;
  
  #ifdef CONFIG_DEBUG_FS

  #include linux/debugfs.h
@@ -239,6 +248,9 @@ static int zcache_debugfs_init(void)
zdfs64(eph_zbytes_max, S_IRUGO, root, zcache_eph_zbytes_max);
   

  1   2   >