from:"Dan Magenheimer"

RE: [PATCHv13 3/4] zswap: add to mm/

2013-06-21 Thread Dan Magenheimer

> From: Bob Liu [mailto:lliu...@gmail.com]
 Subject: Re: [PATCHv13 3/4] zswap: add to mm/
> 
> On Thu, Jun 20, 2013 at 10:23 PM, Seth Jennings
>  wrote:
> > On Thu, Jun 20, 2013 at 05:42:04PM +0800, Bob Liu wrote:
> >> > Just made a mmtests run of my own and got very different results:
> >> >
> >>
> >> It's strange, I'll update to rc6 and try again.
> >> By the way, are you using 824 hardware compressor instead of lzo?
> >
> > My results where using lzo software compression.
> >
> 
> Thanks, and today I used another machine to test zswap.
> The total ram size of that machine is around 4G.
> This time the result is better:
>rc6 rc6
>  zswapbase
> Ops memcachetest-0M 14619.00 (  0.00%)  15602.00 (  6.72%)
> Ops memcachetest-435M   14727.00 (  0.00%)  15860.00 (  7.69%)
> Ops memcachetest-944M   12452.00 (  0.00%)  11812.00 ( -5.14%)
> Ops memcachetest-1452M  12183.00 (  0.00%)   9829.00 (-19.32%)
> Ops memcachetest-1961M  11953.00 (  0.00%)   9337.00 (-21.89%)
> Ops memcachetest-2469M  11201.00 (  0.00%)   7509.00 (-32.96%)
> Ops memcachetest-2978M   9738.00 (  0.00%)   5981.00 (-38.58%)
> Ops io-duration-0M  0.00 (  0.00%)  0.00 (  0.00%)
> Ops io-duration-435M   10.00 (  0.00%)  6.00 ( 40.00%)
> Ops io-duration-944M   19.00 (  0.00%) 19.00 (  0.00%)
> Ops io-duration-1452M  31.00 (  0.00%) 26.00 ( 16.13%)
> Ops io-duration-1961M  40.00 (  0.00%) 35.00 ( 12.50%)
> Ops io-duration-2469M  45.00 (  0.00%) 43.00 (  4.44%)
> Ops io-duration-2978M  58.00 (  0.00%) 53.00 (  8.62%)
> Ops swaptotal-0M56711.00 (  0.00%)  8.00 ( 99.99%)
> Ops swaptotal-435M  19218.00 (  0.00%)   2101.00 ( 89.07%)
> Ops swaptotal-944M  53233.00 (  0.00%)  98055.00 (-84.20%)
> Ops swaptotal-1452M 52064.00 (  0.00%) 145624.00 
> (-179.70%)
> Ops swaptotal-1961M 54960.00 (  0.00%) 153907.00 
> (-180.03%)
> Ops swaptotal-2469M 57485.00 (  0.00%) 176340.00 
> (-206.76%)
> Ops swaptotal-2978M 77704.00 (  0.00%) 182996.00 
> (-135.50%)
> Ops swapin-0M   24834.00 (  0.00%)  8.00 ( 99.97%)
> Ops swapin-435M  9038.00 (  0.00%)  0.00 (  0.00%)
> Ops swapin-944M 26230.00 (  0.00%)  42953.00 (-63.76%)
> Ops swapin-1452M25766.00 (  0.00%)  68440.00 
> (-165.62%)
> Ops swapin-1961M27258.00 (  0.00%)  68129.00 
> (-149.94%)
> Ops swapin-2469M28508.00 (  0.00%)  82234.00 
> (-188.46%)
> Ops swapin-2978M37970.00 (  0.00%)  89280.00 
> (-135.13%)
> Ops minorfaults-0M1460163.00 (  0.00%) 927966.00 ( 36.45%)
> Ops minorfaults-435M   954058.00 (  0.00%) 936182.00 (  1.87%)
> Ops minorfaults-944M   972818.00 (  0.00%)1005956.00 ( -3.41%)
> Ops minorfaults-1452M  966597.00 (  0.00%)1035465.00 ( -7.12%)
> Ops minorfaults-1961M  976158.00 (  0.00%)1049441.00 ( -7.51%)
> Ops minorfaults-2469M  967815.00 (  0.00%)1051752.00 ( -8.67%)
> Ops minorfaults-2978M  988712.00 (  0.00%)1034615.00 ( -4.64%)
> Ops majorfaults-0M   5899.00 (  0.00%)  9.00 ( 99.85%)
> Ops majorfaults-435M 2684.00 (  0.00%) 67.00 ( 97.50%)
> Ops majorfaults-944M 4380.00 (  0.00%)   5790.00 (-32.19%)
> Ops majorfaults-1452M4161.00 (  0.00%)   9222.00 
> (-121.63%)
> Ops majorfaults-1961M4435.00 (  0.00%)   8800.00 (-98.42%)
> Ops majorfaults-2469M4555.00 (  0.00%)  10541.00 
> (-131.42%)
> Ops majorfaults-2978M6182.00 (  0.00%)  11618.00 (-87.93%)
> 
> 
> But the performance of the first machine I used whose total ram size
> is 2G is still bad.
> I need more time to summarize those testing results.
> 
> Maybe you can also have a try with lower total ram size.
> 
> --
> Regards,
> --Bob


A very important factor that you are not considering and
that might account for your different results is the
"initial conditions".  For example, I always ran my benchmarks
after a default-configured EL6 boot, which launches many services
at boot time, each of which creates many anonymous pages,
and these "service anonymous pages" are often the pages
that are selected by LRU for swapping, and compressed by zcache/zswap.
Someone else may run the benchmarks on a minimally-configured
embedded system, and someone else

RE: [PATCHv13 3/4] zswap: add to mm/

2013-06-21 Thread Dan Magenheimer

 From: Bob Liu [mailto:lliu...@gmail.com]
 Subject: Re: [PATCHv13 3/4] zswap: add to mm/
 
 On Thu, Jun 20, 2013 at 10:23 PM, Seth Jennings
 sjenn...@linux.vnet.ibm.com wrote:
  On Thu, Jun 20, 2013 at 05:42:04PM +0800, Bob Liu wrote:
   Just made a mmtests run of my own and got very different results:
  
 
  It's strange, I'll update to rc6 and try again.
  By the way, are you using 824 hardware compressor instead of lzo?
 
  My results where using lzo software compression.
 
 
 Thanks, and today I used another machine to test zswap.
 The total ram size of that machine is around 4G.
 This time the result is better:
rc6 rc6
  zswapbase
 Ops memcachetest-0M 14619.00 (  0.00%)  15602.00 (  6.72%)
 Ops memcachetest-435M   14727.00 (  0.00%)  15860.00 (  7.69%)
 Ops memcachetest-944M   12452.00 (  0.00%)  11812.00 ( -5.14%)
 Ops memcachetest-1452M  12183.00 (  0.00%)   9829.00 (-19.32%)
 Ops memcachetest-1961M  11953.00 (  0.00%)   9337.00 (-21.89%)
 Ops memcachetest-2469M  11201.00 (  0.00%)   7509.00 (-32.96%)
 Ops memcachetest-2978M   9738.00 (  0.00%)   5981.00 (-38.58%)
 Ops io-duration-0M  0.00 (  0.00%)  0.00 (  0.00%)
 Ops io-duration-435M   10.00 (  0.00%)  6.00 ( 40.00%)
 Ops io-duration-944M   19.00 (  0.00%) 19.00 (  0.00%)
 Ops io-duration-1452M  31.00 (  0.00%) 26.00 ( 16.13%)
 Ops io-duration-1961M  40.00 (  0.00%) 35.00 ( 12.50%)
 Ops io-duration-2469M  45.00 (  0.00%) 43.00 (  4.44%)
 Ops io-duration-2978M  58.00 (  0.00%) 53.00 (  8.62%)
 Ops swaptotal-0M56711.00 (  0.00%)  8.00 ( 99.99%)
 Ops swaptotal-435M  19218.00 (  0.00%)   2101.00 ( 89.07%)
 Ops swaptotal-944M  53233.00 (  0.00%)  98055.00 (-84.20%)
 Ops swaptotal-1452M 52064.00 (  0.00%) 145624.00 
 (-179.70%)
 Ops swaptotal-1961M 54960.00 (  0.00%) 153907.00 
 (-180.03%)
 Ops swaptotal-2469M 57485.00 (  0.00%) 176340.00 
 (-206.76%)
 Ops swaptotal-2978M 77704.00 (  0.00%) 182996.00 
 (-135.50%)
 Ops swapin-0M   24834.00 (  0.00%)  8.00 ( 99.97%)
 Ops swapin-435M  9038.00 (  0.00%)  0.00 (  0.00%)
 Ops swapin-944M 26230.00 (  0.00%)  42953.00 (-63.76%)
 Ops swapin-1452M25766.00 (  0.00%)  68440.00 
 (-165.62%)
 Ops swapin-1961M27258.00 (  0.00%)  68129.00 
 (-149.94%)
 Ops swapin-2469M28508.00 (  0.00%)  82234.00 
 (-188.46%)
 Ops swapin-2978M37970.00 (  0.00%)  89280.00 
 (-135.13%)
 Ops minorfaults-0M1460163.00 (  0.00%) 927966.00 ( 36.45%)
 Ops minorfaults-435M   954058.00 (  0.00%) 936182.00 (  1.87%)
 Ops minorfaults-944M   972818.00 (  0.00%)1005956.00 ( -3.41%)
 Ops minorfaults-1452M  966597.00 (  0.00%)1035465.00 ( -7.12%)
 Ops minorfaults-1961M  976158.00 (  0.00%)1049441.00 ( -7.51%)
 Ops minorfaults-2469M  967815.00 (  0.00%)1051752.00 ( -8.67%)
 Ops minorfaults-2978M  988712.00 (  0.00%)1034615.00 ( -4.64%)
 Ops majorfaults-0M   5899.00 (  0.00%)  9.00 ( 99.85%)
 Ops majorfaults-435M 2684.00 (  0.00%) 67.00 ( 97.50%)
 Ops majorfaults-944M 4380.00 (  0.00%)   5790.00 (-32.19%)
 Ops majorfaults-1452M4161.00 (  0.00%)   9222.00 
 (-121.63%)
 Ops majorfaults-1961M4435.00 (  0.00%)   8800.00 (-98.42%)
 Ops majorfaults-2469M4555.00 (  0.00%)  10541.00 
 (-131.42%)
 Ops majorfaults-2978M6182.00 (  0.00%)  11618.00 (-87.93%)
 
 
 But the performance of the first machine I used whose total ram size
 is 2G is still bad.
 I need more time to summarize those testing results.
 
 Maybe you can also have a try with lower total ram size.
 
 --
 Regards,
 --Bob


A very important factor that you are not considering and
that might account for your different results is the
initial conditions.  For example, I always ran my benchmarks
after a default-configured EL6 boot, which launches many services
at boot time, each of which creates many anonymous pages,
and these service anonymous pages are often the pages
that are selected by LRU for swapping, and compressed by zcache/zswap.
Someone else may run the benchmarks on a minimally-configured
embedded system, and someone else on a single-user system
with no services running at all.  A single-user

RE: [PATCHv12 2/4] zbud: add to mm/

2013-05-29 Thread Dan Magenheimer

> From: Andrew Morton [mailto:a...@linux-foundation.org]
> Subject: Re: [PATCHv12 2/4] zbud: add to mm/
> 
> On Wed, 29 May 2013 15:42:36 -0500 Seth Jennings 
>  wrote:
> 
> > > > > I worry about any code which independently looks at the pageframe
> > > > > tables and expects to find page struts there.  One example is probably
> > > > > memory_failure() but there are probably others.
> > >
> > > ^^ this, please.  It could be kinda fatal.
> >
> > I'll look into this.
> >
> > The expected behavior is that memory_failure() should handle zbud pages in 
> > the
> > same way that it handles in-use slub/slab/slob pages and return -EBUSY.
> 
> memory_failure() is merely an example of a general problem: code which
> reads from the memmap[] array and expects its elements to be of type
> `struct page'.  Other examples might be memory hotplugging, memory leak
> checkers etc.  I have vague memories of out-of-tree patches
> (bigphysarea?) doing this as well.
> 
> It's a general problem to which we need a general solution.

One could reasonably argue that any code that makes incorrect
assumptions about the contents of a struct page structure is buggy
and should be fixed.  Isn't the "general solution" already described
in the following comment, excerpted from include/linux/mm.h, which
implies that "scribbling on existing pageframes" [carefully], is fine?
(And, if not, shouldn't that comment be fixed, or am I misreading
it?)

 * For the non-reserved pages, page_count(page) denotes a reference count.
 *   page_count() == 0 means the page is free. page->lru is then used for
 *   freelist management in the buddy allocator.
 *   page_count() > 0  means the page has been allocated.
 *
 * Pages are allocated by the slab allocator in order to provide memory
 * to kmalloc and kmem_cache_alloc. In this case, the management of the
 * page, and the fields in 'struct page' are the responsibility of mm/slab.c
 * unless a particular usage is carefully commented. (the responsibility of
 * freeing the kmalloc memory is the caller's, of course).
 *
 * A page may be used by anyone else who does a __get_free_page().
 * In this case, page_count still tracks the references, and should only
 * be used through the normal accessor functions. The top bits of page->flags
 * and page->virtual store page management information, but all other fields
 * are unused and could be used privately, carefully. The management of this
 * page is the responsibility of the one who allocated it, and those who have
 * subsequently been given references to it.
 *
 * The other pages (we may call them "pagecache pages") are completely
 * managed by the Linux memory manager: I/O, buffers, swapping etc.
 * The following discussion applies only to them.

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

RE: [PATCHv12 2/4] zbud: add to mm/

2013-05-29 Thread Dan Magenheimer

 From: Andrew Morton [mailto:a...@linux-foundation.org]
 Subject: Re: [PATCHv12 2/4] zbud: add to mm/

 On Wed, 29 May 2013 15:42:36 -0500 Seth Jennings 
 sjenn...@linux.vnet.ibm.com wrote:

 I worry about any code which independently looks at the pageframe
 tables and expects to find page struts there.  One example is probably
 memory_failure() but there are probably others.

   ^^ this, please.  It could be kinda fatal.

  I'll look into this.

  The expected behavior is that memory_failure() should handle zbud pages in 
  the
  same way that it handles in-use slub/slab/slob pages and return -EBUSY.

 memory_failure() is merely an example of a general problem: code which
 reads from the memmap[] array and expects its elements to be of type
 `struct page'.  Other examples might be memory hotplugging, memory leak
 checkers etc.  I have vague memories of out-of-tree patches
 (bigphysarea?) doing this as well.

 It's a general problem to which we need a general solution.

Obi-tmem Kenobe slowly materializes... use the force, Luke!

One could reasonably argue that any code that makes incorrect
assumptions about the contents of a struct page structure is buggy
and should be fixed.  Isn't the general solution already described
in the following comment, excerpted from include/linux/mm.h, which
implies that scribbling on existing pageframes [carefully], is fine?
(And, if not, shouldn't that comment be fixed, or am I misreading
it?)

start excerpt
 * For the non-reserved pages, page_count(page) denotes a reference count.
 *   page_count() == 0 means the page is free. page-lru is then used for
 *   freelist management in the buddy allocator.
 *   page_count()  0  means the page has been allocated.
 *
 * Pages are allocated by the slab allocator in order to provide memory
 * to kmalloc and kmem_cache_alloc. In this case, the management of the
 * page, and the fields in 'struct page' are the responsibility of mm/slab.c
 * unless a particular usage is carefully commented. (the responsibility of
 * freeing the kmalloc memory is the caller's, of course).
 *
 * A page may be used by anyone else who does a __get_free_page().
 * In this case, page_count still tracks the references, and should only
 * be used through the normal accessor functions. The top bits of page-flags
 * and page-virtual store page management information, but all other fields
 * are unused and could be used privately, carefully. The management of this
 * page is the responsibility of the one who allocated it, and those who have
 * subsequently been given references to it.
 *
 * The other pages (we may call them pagecache pages) are completely
 * managed by the Linux memory manager: I/O, buffers, swapping etc.
 * The following discussion applies only to them.
end excerpt

Obi-tmem Kenobe slowly dematerializes
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Bye bye Mr tmem guy

2013-05-20 Thread Dan Magenheimer

Hi Linux kernel folks and Xen folks --

Effective July 5, I will be resigning from Oracle and "retiring"
for a minimum of 12-18 months and probably/hopefully much longer.
Between now and July 5, I will be tying up loose ends related to
my patches but also using up accrued vacation days.  If you have
a loose end you'd like to see tied, please let me know ASAP and
I will do my best.

After July 5, any email to me via first DOT last AT oracle DOT com
will go undelivered and may bounce.  Please send email related to
my open source patches and contributions to Konrad Wilk and/or Bob Liu.
Personal email directed to me can be sent to first AT last DOT com.

Thanks much to everybody for the many educational opportunities,
the technical and political jousting, and the great times at
conferences and summits!  I wish you all the best of luck!
Or to quote Douglas Adams: "So long and thanks for all the fish!"

Cheers,
Dan Magenheimer
The Transcendent Memory ("tmem") guy

Tmem-related historical webography:
http://lwn.net/Articles/454795/ 
http://lwn.net/Articles/475681/
http://lwn.net/Articles/545244/ 
https://oss.oracle.com/projects/tmem/ 
http://www.linux-kvm.org/wiki/images/d/d7/TmemNotVirt-Linuxcon2011-Final.pdf 
http://lwn.net/Articles/465317/ 
http://lwn.net/Articles/340080/ 
http://lwn.net/Articles/386090/ 
http://www.xen.org/files/xensummit_oracle09/xensummit_transmemory.pdf 
https://oss.oracle.com/projects/tmem/dist/documentation/presentations/TranscendentMemoryXenSummit2010.pdf
 
https://blogs.oracle.com/wim/entry/example_of_transcendent_memory_and 
https://blogs.oracle.com/wim/entry/another_feature_hit_mainline_linux 
https://blogs.oracle.com/wim/entry/from_the_research_department_ramster 
http://streaming.oracle.com/ebn/podcasts/media/11663326_VM_Linux_042512.mp3 
https://oss.oracle.com/projects/tmem/dist/documentation/papers/overcommit.pdf 
http://static.usenix.org/event/wiov08/tech/full_papers/magenheimer/magenheimer_html/
 
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

RE: [PATCH] staging: ramster: add how-to document

2013-05-20 Thread Dan Magenheimer

Hey Greg --

Since this is documentation only and documents existing
behavior, I'm not clear whether it is acceptable for
an rcN release in the current cycle or must wait until
the next window.  Since it is a new file, it should
apply to either so I'll leave the choice up to you.

Thanks,
Dan

> From: Dan Magenheimer [mailto:dan.magenhei...@oracle.com]
> Sent: Monday, May 20, 2013 8:52 AM
> To: de...@linuxdriverproject.org; linux-kernel@vger.kernel.org; 
> gre...@linuxfoundation.org; linux-
> m...@kvack.org; konrad.w...@oracle.com; dan.magenhei...@oracle.com; 
> liw...@linux.vnet.ibm.com;
> bob@oracle.com
> Subject: [PATCH] staging: ramster: add how-to document
> 
> Add how-to documentation that provides a step-by-step guide
> for configuring and trying out a ramster cluster.
> 
> Signed-off-by: Dan Magenheimer 
> ---
>  drivers/staging/zcache/ramster/ramster-howto.txt |  366 
> ++
>  1 files changed, 366 insertions(+), 0 deletions(-)
>  create mode 100644 drivers/staging/zcache/ramster/ramster-howto.txt
> 
> diff --git a/drivers/staging/zcache/ramster/ramster-howto.txt
> b/drivers/staging/zcache/ramster/ramster-howto.txt
> new file mode 100644
> index 000..7b1ee3b
> --- /dev/null
> +++ b/drivers/staging/zcache/ramster/ramster-howto.txt
> @@ -0,0 +1,366 @@
> + RAMSTER HOW-TO
> +
> +Author: Dan Magenheimer
> +Ramster maintainer: Konrad Wilk 
> +
> +This is a HOWTO document for ramster which, as of this writing, is in
> +the kernel as a subdirectory of zcache in drivers/staging, called ramster.
> +(Zcache can be built with or without ramster functionality.)  If enabled
> +and properly configured, ramster allows memory capacity load balancing
> +across multiple machines in a cluster.  Further, the ramster code serves
> +as an example of asynchronous access for zcache (as well as cleancache and
> +frontswap) that may prove useful for future transcendent memory
> +implementations, such as KVM and NVRAM.  While ramster works today on
> +any network connection that supports kernel sockets, its features may
> +become more interesting on future high-speed fabrics/interconnects.
> +
> +Ramster requires both kernel and userland support.  The userland support,
> +called ramster-tools, is known to work with EL6-based distros, but is a
> +set of poorly-hacked slightly-modified cluster tools based on ocfs2, which
> +includes an init file, a config file, and a userland binary that interfaces
> +to the kernel.  This state of userland support reflects the abysmal userland
> +skills of this suitably-embarrassed author; any help/patches to turn
> +ramster-tools into more distributable rpms/debs useful for a wider range
> +of distros would be appreciated.  The source RPM that can be used as a
> +starting point is available at:
> +http://oss.oracle.com/projects/tmem/files/RAMster/
> +
> +As a result of this author's ignorance, userland setup described in this
> +HOWTO assumes an EL6 distro and is described in EL6 syntax.  Apologies
> +if this offends anyone!
> +
> +Kernel support has only been tested on x86_64.  Systems with an active
> +ocfs2 filesystem should work, but since ramster leverages a lot of
> +code from ocfs2, there may be latent issues.  A kernel configuration that
> +includes CONFIG_OCFS2_FS should build OK, and should certainly run OK
> +if no ocfs2 filesystem is mounted.
> +
> +This HOWTO demonstrates memory capacity load balancing for a two-node
> +cluster, where one node called the "local" node becomes overcommitted
> +and the other node called the "remote" node provides additional RAM
> +capacity for use by the local node.  Ramster is capable of more complex
> +topologies; see the last section titled "ADVANCED RAMSTER TOPOLOGIES".
> +
> +If you find any terms in this HOWTO unfamiliar or don't understand the
> +motivation for ramster, the following LWN reading is recommended:
> +-- Transcendent Memory in a Nutshell (lwn.net/Articles/454795)
> +-- The future calculus of memory management (lwn.net/Articles/475681)
> +And since ramster is built on top of zcache, this article may be helpful:
> +-- In-kernel memory compression (lwn.net/Articles/545244)
> +
> +Now that you've memorized the contents of those articles, let's get started!
> +
> +A. PRELIMINARY
> +
> +1) Install two x86_64 Linux systems that are known to work when
> +   upgraded to a recent upstream Linux kernel version.
> +
> +On each system:
> +
> +2) Configure, build and install, then boot Linux, just to ensure it
> +   can be done with an unmodified upstream kernel.  Confirm you booted
> +   the upstream kernel with "uname -a".
> +
> +3) If you plan to do any performance testing or u

[PATCH] staging: ramster: add how-to document

2013-05-20 Thread Dan Magenheimer

Add how-to documentation that provides a step-by-step guide
for configuring and trying out a ramster cluster.

Signed-off-by: Dan Magenheimer 
---
 drivers/staging/zcache/ramster/ramster-howto.txt |  366 ++
 1 files changed, 366 insertions(+), 0 deletions(-)
 create mode 100644 drivers/staging/zcache/ramster/ramster-howto.txt

diff --git a/drivers/staging/zcache/ramster/ramster-howto.txt 
b/drivers/staging/zcache/ramster/ramster-howto.txt
new file mode 100644
index 000..7b1ee3b
--- /dev/null
+++ b/drivers/staging/zcache/ramster/ramster-howto.txt
@@ -0,0 +1,366 @@
+   RAMSTER HOW-TO
+
+Author: Dan Magenheimer
+Ramster maintainer: Konrad Wilk 
+
+This is a HOWTO document for ramster which, as of this writing, is in
+the kernel as a subdirectory of zcache in drivers/staging, called ramster.
+(Zcache can be built with or without ramster functionality.)  If enabled
+and properly configured, ramster allows memory capacity load balancing
+across multiple machines in a cluster.  Further, the ramster code serves
+as an example of asynchronous access for zcache (as well as cleancache and
+frontswap) that may prove useful for future transcendent memory
+implementations, such as KVM and NVRAM.  While ramster works today on
+any network connection that supports kernel sockets, its features may
+become more interesting on future high-speed fabrics/interconnects.
+
+Ramster requires both kernel and userland support.  The userland support,
+called ramster-tools, is known to work with EL6-based distros, but is a
+set of poorly-hacked slightly-modified cluster tools based on ocfs2, which
+includes an init file, a config file, and a userland binary that interfaces
+to the kernel.  This state of userland support reflects the abysmal userland
+skills of this suitably-embarrassed author; any help/patches to turn
+ramster-tools into more distributable rpms/debs useful for a wider range
+of distros would be appreciated.  The source RPM that can be used as a
+starting point is available at:
+http://oss.oracle.com/projects/tmem/files/RAMster/ 
+
+As a result of this author's ignorance, userland setup described in this
+HOWTO assumes an EL6 distro and is described in EL6 syntax.  Apologies
+if this offends anyone!
+
+Kernel support has only been tested on x86_64.  Systems with an active
+ocfs2 filesystem should work, but since ramster leverages a lot of
+code from ocfs2, there may be latent issues.  A kernel configuration that
+includes CONFIG_OCFS2_FS should build OK, and should certainly run OK
+if no ocfs2 filesystem is mounted.
+
+This HOWTO demonstrates memory capacity load balancing for a two-node
+cluster, where one node called the "local" node becomes overcommitted
+and the other node called the "remote" node provides additional RAM
+capacity for use by the local node.  Ramster is capable of more complex
+topologies; see the last section titled "ADVANCED RAMSTER TOPOLOGIES".
+
+If you find any terms in this HOWTO unfamiliar or don't understand the
+motivation for ramster, the following LWN reading is recommended:
+-- Transcendent Memory in a Nutshell (lwn.net/Articles/454795)
+-- The future calculus of memory management (lwn.net/Articles/475681)
+And since ramster is built on top of zcache, this article may be helpful:
+-- In-kernel memory compression (lwn.net/Articles/545244)
+
+Now that you've memorized the contents of those articles, let's get started!
+
+A. PRELIMINARY
+
+1) Install two x86_64 Linux systems that are known to work when
+   upgraded to a recent upstream Linux kernel version.
+
+On each system:
+
+2) Configure, build and install, then boot Linux, just to ensure it
+   can be done with an unmodified upstream kernel.  Confirm you booted
+   the upstream kernel with "uname -a".
+
+3) If you plan to do any performance testing or unless you plan to
+   test only swapping, the "WasActive" patch is also highly recommended.
+   (Search lkml.org for WasActive, apply the patch, rebuild your kernel.)
+   For a demo or simple testing, the patch can be ignored.
+
+4) Install ramster-tools as root.  An x86_64 rpm for EL6-based systems
+   can be found at:
+http://oss.oracle.com/projects/tmem/files/RAMster/ 
+   (Sorry but for now, non-EL6 users must recreate ramster-tools on
+   their own from source.  See above.)
+
+5) Ensure that debugfs is mounted at each boot.  Examples below assume it
+   is mounted at /sys/kernel/debug.
+
+B. BUILDING RAMSTER INTO THE KERNEL
+
+Do the following on each system:
+
+1) Using the kernel configuration mechanism of your choice, change
+   your config to include:
+
+   CONFIG_CLEANCACHE=y
+   CONFIG_FRONTSWAP=y
+   CONFIG_STAGING=y
+   CONFIG_CONFIGFS_FS=y # NOTE: MUST BE y, not m
+   CONFIG_ZCACHE=y
+   CONFIG_RAMSTER=y
+
+   For a linux-3.10 or later kernel, you should also set:
+
+   CONFIG_ZCACHE_DEBUG=y
+   CONFIG_RAMSTER_DEBUG=y
+
+   Before buildin

[PATCH] staging: ramster: add how-to document

2013-05-20 Thread Dan Magenheimer

Add how-to documentation that provides a step-by-step guide
for configuring and trying out a ramster cluster.

Signed-off-by: Dan Magenheimer dan.magenhei...@oracle.com
---
 drivers/staging/zcache/ramster/ramster-howto.txt |  366 ++
 1 files changed, 366 insertions(+), 0 deletions(-)
 create mode 100644 drivers/staging/zcache/ramster/ramster-howto.txt

diff --git a/drivers/staging/zcache/ramster/ramster-howto.txt 
b/drivers/staging/zcache/ramster/ramster-howto.txt
new file mode 100644
index 000..7b1ee3b
--- /dev/null
+++ b/drivers/staging/zcache/ramster/ramster-howto.txt
@@ -0,0 +1,366 @@
+   RAMSTER HOW-TO
+
+Author: Dan Magenheimer
+Ramster maintainer: Konrad Wilk konrad.w...@oracle.com
+
+This is a HOWTO document for ramster which, as of this writing, is in
+the kernel as a subdirectory of zcache in drivers/staging, called ramster.
+(Zcache can be built with or without ramster functionality.)  If enabled
+and properly configured, ramster allows memory capacity load balancing
+across multiple machines in a cluster.  Further, the ramster code serves
+as an example of asynchronous access for zcache (as well as cleancache and
+frontswap) that may prove useful for future transcendent memory
+implementations, such as KVM and NVRAM.  While ramster works today on
+any network connection that supports kernel sockets, its features may
+become more interesting on future high-speed fabrics/interconnects.
+
+Ramster requires both kernel and userland support.  The userland support,
+called ramster-tools, is known to work with EL6-based distros, but is a
+set of poorly-hacked slightly-modified cluster tools based on ocfs2, which
+includes an init file, a config file, and a userland binary that interfaces
+to the kernel.  This state of userland support reflects the abysmal userland
+skills of this suitably-embarrassed author; any help/patches to turn
+ramster-tools into more distributable rpms/debs useful for a wider range
+of distros would be appreciated.  The source RPM that can be used as a
+starting point is available at:
+http://oss.oracle.com/projects/tmem/files/RAMster/ 
+
+As a result of this author's ignorance, userland setup described in this
+HOWTO assumes an EL6 distro and is described in EL6 syntax.  Apologies
+if this offends anyone!
+
+Kernel support has only been tested on x86_64.  Systems with an active
+ocfs2 filesystem should work, but since ramster leverages a lot of
+code from ocfs2, there may be latent issues.  A kernel configuration that
+includes CONFIG_OCFS2_FS should build OK, and should certainly run OK
+if no ocfs2 filesystem is mounted.
+
+This HOWTO demonstrates memory capacity load balancing for a two-node
+cluster, where one node called the local node becomes overcommitted
+and the other node called the remote node provides additional RAM
+capacity for use by the local node.  Ramster is capable of more complex
+topologies; see the last section titled ADVANCED RAMSTER TOPOLOGIES.
+
+If you find any terms in this HOWTO unfamiliar or don't understand the
+motivation for ramster, the following LWN reading is recommended:
+-- Transcendent Memory in a Nutshell (lwn.net/Articles/454795)
+-- The future calculus of memory management (lwn.net/Articles/475681)
+And since ramster is built on top of zcache, this article may be helpful:
+-- In-kernel memory compression (lwn.net/Articles/545244)
+
+Now that you've memorized the contents of those articles, let's get started!
+
+A. PRELIMINARY
+
+1) Install two x86_64 Linux systems that are known to work when
+   upgraded to a recent upstream Linux kernel version.
+
+On each system:
+
+2) Configure, build and install, then boot Linux, just to ensure it
+   can be done with an unmodified upstream kernel.  Confirm you booted
+   the upstream kernel with uname -a.
+
+3) If you plan to do any performance testing or unless you plan to
+   test only swapping, the WasActive patch is also highly recommended.
+   (Search lkml.org for WasActive, apply the patch, rebuild your kernel.)
+   For a demo or simple testing, the patch can be ignored.
+
+4) Install ramster-tools as root.  An x86_64 rpm for EL6-based systems
+   can be found at:
+http://oss.oracle.com/projects/tmem/files/RAMster/ 
+   (Sorry but for now, non-EL6 users must recreate ramster-tools on
+   their own from source.  See above.)
+
+5) Ensure that debugfs is mounted at each boot.  Examples below assume it
+   is mounted at /sys/kernel/debug.
+
+B. BUILDING RAMSTER INTO THE KERNEL
+
+Do the following on each system:
+
+1) Using the kernel configuration mechanism of your choice, change
+   your config to include:
+
+   CONFIG_CLEANCACHE=y
+   CONFIG_FRONTSWAP=y
+   CONFIG_STAGING=y
+   CONFIG_CONFIGFS_FS=y # NOTE: MUST BE y, not m
+   CONFIG_ZCACHE=y
+   CONFIG_RAMSTER=y
+
+   For a linux-3.10 or later kernel, you should also set:
+
+   CONFIG_ZCACHE_DEBUG=y
+   CONFIG_RAMSTER_DEBUG=y
+
+   Before building the kernel

RE: [PATCH] staging: ramster: add how-to document

2013-05-20 Thread Dan Magenheimer

Hey Greg --

Since this is documentation only and documents existing
behavior, I'm not clear whether it is acceptable for
an rcN release in the current cycle or must wait until
the next window.  Since it is a new file, it should
apply to either so I'll leave the choice up to you.

Thanks,
Dan

 From: Dan Magenheimer [mailto:dan.magenhei...@oracle.com]
 Sent: Monday, May 20, 2013 8:52 AM
 To: de...@linuxdriverproject.org; linux-kernel@vger.kernel.org; 
 gre...@linuxfoundation.org; linux-
 m...@kvack.org; konrad.w...@oracle.com; dan.magenhei...@oracle.com; 
 liw...@linux.vnet.ibm.com;
 bob@oracle.com
 Subject: [PATCH] staging: ramster: add how-to document
 
 Add how-to documentation that provides a step-by-step guide
 for configuring and trying out a ramster cluster.
 
 Signed-off-by: Dan Magenheimer dan.magenhei...@oracle.com
 ---
  drivers/staging/zcache/ramster/ramster-howto.txt |  366 
 ++
  1 files changed, 366 insertions(+), 0 deletions(-)
  create mode 100644 drivers/staging/zcache/ramster/ramster-howto.txt
 
 diff --git a/drivers/staging/zcache/ramster/ramster-howto.txt
 b/drivers/staging/zcache/ramster/ramster-howto.txt
 new file mode 100644
 index 000..7b1ee3b
 --- /dev/null
 +++ b/drivers/staging/zcache/ramster/ramster-howto.txt
 @@ -0,0 +1,366 @@
 + RAMSTER HOW-TO
 +
 +Author: Dan Magenheimer
 +Ramster maintainer: Konrad Wilk konrad.w...@oracle.com
 +
 +This is a HOWTO document for ramster which, as of this writing, is in
 +the kernel as a subdirectory of zcache in drivers/staging, called ramster.
 +(Zcache can be built with or without ramster functionality.)  If enabled
 +and properly configured, ramster allows memory capacity load balancing
 +across multiple machines in a cluster.  Further, the ramster code serves
 +as an example of asynchronous access for zcache (as well as cleancache and
 +frontswap) that may prove useful for future transcendent memory
 +implementations, such as KVM and NVRAM.  While ramster works today on
 +any network connection that supports kernel sockets, its features may
 +become more interesting on future high-speed fabrics/interconnects.
 +
 +Ramster requires both kernel and userland support.  The userland support,
 +called ramster-tools, is known to work with EL6-based distros, but is a
 +set of poorly-hacked slightly-modified cluster tools based on ocfs2, which
 +includes an init file, a config file, and a userland binary that interfaces
 +to the kernel.  This state of userland support reflects the abysmal userland
 +skills of this suitably-embarrassed author; any help/patches to turn
 +ramster-tools into more distributable rpms/debs useful for a wider range
 +of distros would be appreciated.  The source RPM that can be used as a
 +starting point is available at:
 +http://oss.oracle.com/projects/tmem/files/RAMster/
 +
 +As a result of this author's ignorance, userland setup described in this
 +HOWTO assumes an EL6 distro and is described in EL6 syntax.  Apologies
 +if this offends anyone!
 +
 +Kernel support has only been tested on x86_64.  Systems with an active
 +ocfs2 filesystem should work, but since ramster leverages a lot of
 +code from ocfs2, there may be latent issues.  A kernel configuration that
 +includes CONFIG_OCFS2_FS should build OK, and should certainly run OK
 +if no ocfs2 filesystem is mounted.
 +
 +This HOWTO demonstrates memory capacity load balancing for a two-node
 +cluster, where one node called the local node becomes overcommitted
 +and the other node called the remote node provides additional RAM
 +capacity for use by the local node.  Ramster is capable of more complex
 +topologies; see the last section titled ADVANCED RAMSTER TOPOLOGIES.
 +
 +If you find any terms in this HOWTO unfamiliar or don't understand the
 +motivation for ramster, the following LWN reading is recommended:
 +-- Transcendent Memory in a Nutshell (lwn.net/Articles/454795)
 +-- The future calculus of memory management (lwn.net/Articles/475681)
 +And since ramster is built on top of zcache, this article may be helpful:
 +-- In-kernel memory compression (lwn.net/Articles/545244)
 +
 +Now that you've memorized the contents of those articles, let's get started!
 +
 +A. PRELIMINARY
 +
 +1) Install two x86_64 Linux systems that are known to work when
 +   upgraded to a recent upstream Linux kernel version.
 +
 +On each system:
 +
 +2) Configure, build and install, then boot Linux, just to ensure it
 +   can be done with an unmodified upstream kernel.  Confirm you booted
 +   the upstream kernel with uname -a.
 +
 +3) If you plan to do any performance testing or unless you plan to
 +   test only swapping, the WasActive patch is also highly recommended.
 +   (Search lkml.org for WasActive, apply the patch, rebuild your kernel.)
 +   For a demo or simple testing, the patch can be ignored.
 +
 +4) Install ramster-tools as root.  An x86_64 rpm for EL6-based systems
 +   can be found at:
 +http://oss.oracle.com/projects/tmem

Bye bye Mr tmem guy

2013-05-20 Thread Dan Magenheimer

Hi Linux kernel folks and Xen folks --

Effective July 5, I will be resigning from Oracle and retiring
for a minimum of 12-18 months and probably/hopefully much longer.
Between now and July 5, I will be tying up loose ends related to
my patches but also using up accrued vacation days.  If you have
a loose end you'd like to see tied, please let me know ASAP and
I will do my best.

After July 5, any email to me via first DOT last AT oracle DOT com
will go undelivered and may bounce.  Please send email related to
my open source patches and contributions to Konrad Wilk and/or Bob Liu.
Personal email directed to me can be sent to first AT last DOT com.

Thanks much to everybody for the many educational opportunities,
the technical and political jousting, and the great times at
conferences and summits!  I wish you all the best of luck!
Or to quote Douglas Adams: So long and thanks for all the fish!

Cheers,
Dan Magenheimer
The Transcendent Memory (tmem) guy

Tmem-related historical webography:
http://lwn.net/Articles/454795/ 
http://lwn.net/Articles/475681/
http://lwn.net/Articles/545244/ 
https://oss.oracle.com/projects/tmem/ 
http://www.linux-kvm.org/wiki/images/d/d7/TmemNotVirt-Linuxcon2011-Final.pdf 
http://lwn.net/Articles/465317/ 
http://lwn.net/Articles/340080/ 
http://lwn.net/Articles/386090/ 
http://www.xen.org/files/xensummit_oracle09/xensummit_transmemory.pdf 
https://oss.oracle.com/projects/tmem/dist/documentation/presentations/TranscendentMemoryXenSummit2010.pdf
 
https://blogs.oracle.com/wim/entry/example_of_transcendent_memory_and 
https://blogs.oracle.com/wim/entry/another_feature_hit_mainline_linux 
https://blogs.oracle.com/wim/entry/from_the_research_department_ramster 
http://streaming.oracle.com/ebn/podcasts/media/11663326_VM_Linux_042512.mp3 
https://oss.oracle.com/projects/tmem/dist/documentation/papers/overcommit.pdf 
http://static.usenix.org/event/wiov08/tech/full_papers/magenheimer/magenheimer_html/
 
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

RE: [PATCHv11 3/4] zswap: add to mm/

2013-05-16 Thread Dan Magenheimer

> From: Rik van Riel [mailto:r...@redhat.com]
> Sent: Wednesday, May 15, 2013 4:15 PM
> To: Dan Magenheimer
> Cc: Seth Jennings; Andrew Morton; Greg Kroah-Hartman; Nitin Gupta; Minchan 
> Kim; Konrad Wilk; Robert
> Jennings; Jenifer Hopper; Mel Gorman; Johannes Weiner; Larry Woodman; 
> Benjamin Herrenschmidt; Dave
> Hansen; Joe Perches; Joonsoo Kim; Cody P Schafer; Hugh Dickens; Paul 
> Mackerras; linux...@kvack.org;
> linux-kernel@vger.kernel.org; de...@driverdev.osuosl.org
> Subject: Re: [PATCHv11 3/4] zswap: add to mm/
> 
> On 05/14/2013 04:18 PM, Dan Magenheimer wrote:
> 
> > It's unfortunate that my proposed topic for LSFMM was pre-empted
> > by the zsmalloc vs zbud discussion and zswap vs zcache, because
> > I think the real challenge of zswap (or zcache) and the value to
> > distros and end users requires us to get this right BEFORE users
> > start filing bugs about performance weirdness.  After which most
> > users and distros will simply default to 0% (i.e. turn zswap off)
> > because zswap unpredictably sometimes sucks.
> 
> I'm not sure we can get it right before people actually start
> using it for real world setups, instead of just running benchmarks
> on it.
> 
> The sooner we get the code out there, where users can play with
> it (even if it is disabled by default and needs a sysfs or
> sysctl config option to enable it), the sooner we will know how
> well it works, and what needs to be changed.

/me sets stage of first Star Wars (1977)

/me envisions self as Obi-Wan Kenobi, old and tired of fighting,
in lightsaber battle with protege Darth Vader / Anakin Skywalker

/me sadly turns off lightsaber, holds useless handle at waist,
takes a deep breath, and promptly gets sliced into oblivion.

Time for A New Hope(tm).

(/me cc's Jon Corbet for a longshot last chance of making LWN's
Kernel Development Quotes of the Week.)
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

RE: [PATCHv11 3/4] zswap: add to mm/

2013-05-16 Thread Dan Magenheimer

 From: Rik van Riel [mailto:r...@redhat.com]
 Sent: Wednesday, May 15, 2013 4:15 PM
 To: Dan Magenheimer
 Cc: Seth Jennings; Andrew Morton; Greg Kroah-Hartman; Nitin Gupta; Minchan 
 Kim; Konrad Wilk; Robert
 Jennings; Jenifer Hopper; Mel Gorman; Johannes Weiner; Larry Woodman; 
 Benjamin Herrenschmidt; Dave
 Hansen; Joe Perches; Joonsoo Kim; Cody P Schafer; Hugh Dickens; Paul 
 Mackerras; linux...@kvack.org;
 linux-kernel@vger.kernel.org; de...@driverdev.osuosl.org
 Subject: Re: [PATCHv11 3/4] zswap: add to mm/

 On 05/14/2013 04:18 PM, Dan Magenheimer wrote:

  It's unfortunate that my proposed topic for LSFMM was pre-empted
  by the zsmalloc vs zbud discussion and zswap vs zcache, because
  I think the real challenge of zswap (or zcache) and the value to
  distros and end users requires us to get this right BEFORE users
  start filing bugs about performance weirdness.  After which most
  users and distros will simply default to 0% (i.e. turn zswap off)
  because zswap unpredictably sometimes sucks.

 I'm not sure we can get it right before people actually start
 using it for real world setups, instead of just running benchmarks
 on it.

 The sooner we get the code out there, where users can play with
 it (even if it is disabled by default and needs a sysfs or
 sysctl config option to enable it), the sooner we will know how
 well it works, and what needs to be changed.

/me sets stage of first Star Wars (1977)

/me envisions self as Obi-Wan Kenobi, old and tired of fighting,
in lightsaber battle with protege Darth Vader / Anakin Skywalker

/me sadly turns off lightsaber, holds useless handle at waist,
takes a deep breath, and promptly gets sliced into oblivion.

Time for A New Hope(tm).

(/me cc's Jon Corbet for a longshot last chance of making LWN's
Kernel Development Quotes of the Week.)
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

RE: [PATCHv11 3/4] zswap: add to mm/

2013-05-15 Thread Dan Magenheimer

> From: Rik van Riel [mailto:r...@redhat.com]
> Subject: Re: [PATCHv11 3/4] zswap: add to mm/
> 
> On 05/15/2013 03:35 PM, Dan Magenheimer wrote:
> >> From: Konrad Rzeszutek Wilk
> >> Subject: Re: [PATCHv11 3/4] zswap: add to mm/
> >>
> >>> Sorry, but I don't think that's appropriate for a patch in the MM 
> >>> subsystem.
> >>
> >> I am heading to the airport shortly so this email is a bit hastily typed.
> >>
> >> Perhaps a compromise can be reached where this code is merged as a driver
> >> not a core mm component. There is a high bar to be in the MM - it has to
> >> work with many many different configurations.
> >>
> >> And drivers don't have such a high bar. They just need to work on a 
> >> specific
> >> issue and that is it. If zswap ended up in say, drivers/mm that would make
> >> it more palpable I think.
> >>
> >> Thoughts?
> >
> > Hmmm...
> >
> > To me, that sounds like a really good compromise.
> 
> Come on, we all know that is nonsense.
> 
> Sure, the zswap and zbud code may not be in their final state yet,
> but they belong in the mm/ directory, together with the cleancache
> code and all the other related bits of code.
> 
> Lets put them in their final destination, and hope the code attracts
> attention by as many MM developers as can spare the time to help
> improve it.

Hi Rik --

Seth has been hell-bent on getting SOME code into the kernel
for over a year, since he found out that enabling zcache, a staging
driver, resulted in a tainted kernel.  First it was promoting
zcache+zsmalloc out of staging.  Then it was zswap+zsmalloc without
writeback, then zswap+zsmalloc with writeback, and now zswap+zbud
with writeback but without a sane policy for writeback.  All of
that time, I've been arguing and trying to integrate compression more
deeply and sensibly into MM, rather than just enabling compression as
a toy that happens to speed up a few benchmarks.  (This,
in a nutshell, was the feedback I got at LSFMM12 from Andrea and
Mel... and I think also from you.)  Seth has resisted every
step of the way, then integrated the functionality in question,
adapted my code (or Nitin's), and called it his own.

If you disagree with any of my arguments earlier in this thread,
please say so.  Else, please reinforce that the MM subsystem
needs to dynamically adapt to a broad range of workloads,
which zswap does not (yet) do.  Zswap is not simple, it is
simplistic*.

IMHO, it may be OK for a driver to be ham-handed in its memory
use, but that's not OK for something in mm/.  So I think merging
zswap as a driver is a perfectly sensible compromise which lets
Seth get his code upstream, allows users (and leading-edge distros)
to experiment with compression, avoids these endless arguments,
and allows those who care to move forward on how to deeply
integrate compression into MM.

Dan

* simplistic, n., The tendency to oversimplify an issue or a problem
  by ignoring complexities or complications.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

RE: [PATCHv11 3/4] zswap: add to mm/

2013-05-15 Thread Dan Magenheimer

> From: Dave Hansen [mailto:d...@sr71.net]
> Sent: Wednesday, May 15, 2013 2:24 PM
> To: Seth Jennings
> Cc: Konrad Rzeszutek Wilk; Dan Magenheimer; Andrew Morton; Greg 
> Kroah-Hartman; Nitin Gupta; Minchan
> Kim; Robert Jennings; Jenifer Hopper; Mel Gorman; Johannes Weiner; Rik van 
> Riel; Larry Woodman;
> Benjamin Herrenschmidt; Joe Perches; Joonsoo Kim; Cody P Schafer; Hugh 
> Dickens; Paul Mackerras; linux-
> m...@kvack.org; linux-kernel@vger.kernel.org; de...@driverdev.osuosl.org
> Subject: Re: [PATCHv11 3/4] zswap: add to mm/
> 
> On 05/15/2013 01:09 PM, Seth Jennings wrote:
> > On Wed, May 15, 2013 at 02:55:06PM -0400, Konrad Rzeszutek Wilk wrote:
> >>> Sorry, but I don't think that's appropriate for a patch in the MM 
> >>> subsystem.
> >>
> >> Perhaps a compromise can be reached where this code is merged as a driver
> >> not a core mm component. There is a high bar to be in the MM - it has to
> >> work with many many different configurations.
> >>
> >> And drivers don't have such a high bar. They just need to work on a 
> >> specific
> >> issue and that is it. If zswap ended up in say, drivers/mm that would make
> >> it more palpable I think.
> 
> The issue is not whether it is a loadable module or a driver.  Nobody
> here is stupid enough to say, "hey, now it's a driver/module, all of the
> complex VM interactions are finally fixed!"
> 
> If folks don't want this in their system, there's a way to turn it off,
> today, with the sysfs tunables.  We don't need _another_ way to turn it
> off at runtime (unloading the module/driver).

The issue is we KNOW the complex VM interactions are NOT fixed
and there has been very very little breadth testing (i.e.
across a wide range of workloads, and any attempts to show
how much harm can come from enabling it.)

That's (at least borderline) acceptable in a driver that can
be unloaded, but not in MM code IMHO.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

RE: [PATCHv11 3/4] zswap: add to mm/

2013-05-15 Thread Dan Magenheimer

> From: Seth Jennings [mailto:sjenn...@linux.vnet.ibm.com]
> Sent: Wednesday, May 15, 2013 2:10 PM
> To: Konrad Rzeszutek Wilk
> Cc: Dan Magenheimer; Andrew Morton; Greg Kroah-Hartman; Nitin Gupta; Minchan 
> Kim; Robert Jennings;
> Jenifer Hopper; Mel Gorman; Johannes Weiner; Rik van Riel; Larry Woodman; 
> Benjamin Herrenschmidt; Dave
> Hansen; Joe Perches; Joonsoo Kim; Cody P Schafer; Hugh Dickens; Paul 
> Mackerras; linux...@kvack.org;
> linux-kernel@vger.kernel.org; de...@driverdev.osuosl.org
> Subject: Re: [PATCHv11 3/4] zswap: add to mm/
> 
> On Wed, May 15, 2013 at 02:55:06PM -0400, Konrad Rzeszutek Wilk wrote:
> > > Sorry, but I don't think that's appropriate for a patch in the MM 
> > > subsystem.
> >
> > I am heading to the airport shortly so this email is a bit hastily typed.
> >
> > Perhaps a compromise can be reached where this code is merged as a driver
> > not a core mm component. There is a high bar to be in the MM - it has to
> > work with many many different configurations.
> >
> > And drivers don't have such a high bar. They just need to work on a specific
> > issue and that is it. If zswap ended up in say, drivers/mm that would make
> > it more palpable I think.
> >
> > Thoughts?
> 
> zswap, the writeback code particularly, depends on a number of non-exported
> kernel symbols, namely:
> 
> swapcache_free
> __swap_writepage
> __add_to_swap_cache
> swapcache_prepare
> swapper_spaces
> 
> So it can't currently be built as a module and I'm not sure what the MM
> folks would think about exporting them and making them part of the KABI.

It can be built as a module if writeback is disabled (or ifdef'd by
a CONFIG_ZSWAP_WRITEBACK which depends on CONFIG_ZSWAP=y).  The
folks at LSFMM who were planning to use zswap will be turning
off writeback anyway so an alternate is to pull writeback out
of zswap completely for now, since you don't really have a good
policy to manage it yet anyway.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

RE: [PATCHv11 3/4] zswap: add to mm/

2013-05-15 Thread Dan Magenheimer

> From: Konrad Rzeszutek Wilk
> Subject: Re: [PATCHv11 3/4] zswap: add to mm/
> 
> > Sorry, but I don't think that's appropriate for a patch in the MM subsystem.
> 
> I am heading to the airport shortly so this email is a bit hastily typed.
> 
> Perhaps a compromise can be reached where this code is merged as a driver
> not a core mm component. There is a high bar to be in the MM - it has to
> work with many many different configurations.
> 
> And drivers don't have such a high bar. They just need to work on a specific
> issue and that is it. If zswap ended up in say, drivers/mm that would make
> it more palpable I think.
> 
> Thoughts?

Hmmm...

To me, that sounds like a really good compromise.  Then anyone
who wants to experiment with compressed swap pages can do so by
enabling the zswap driver.  And the harder problem of deeply integrating
compression into the MM subsystem can proceed in parallel
by leveraging and building on the best of zswap and zcache
and zram.

Seth, if you want to re-post zswap as a driver... even a
previous zswap version with zsmalloc and without writeback...
I would be willing to ack it.  If I correctly understand
Mel's concerns, I suspect he might feel the same.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

RE: [PATCH] Fixes, cleanups, compile warning fixes, and documentation update for Xen tmem driver (v2).

2013-05-15 Thread Dan Magenheimer

> From: Konrad Rzeszutek [mailto:ketuzs...@gmail.com] On Behalf Of Konrad 
> Rzeszutek Wilk
> Sent: Tuesday, May 14, 2013 12:09 PM
> To: bob@oracle.com; dan.magenhei...@oracle.com; 
> linux-kernel@vger.kernel.org; akpm@linux-
> foundation.org; linux...@kvack.org; xen-de...@lists.xensource.com
> Subject: [PATCH] Fixes, cleanups, compile warning fixes, and documentation 
> update for Xen tmem driver
> (v2).
> 
> Heya,
> 
> These nine patches fix the tmem driver to:
>  - not emit a compile warning anymore (reported by 0 day test compile tool)
>  - remove the various nofrontswap, nocleancache, noselfshrinking, 
> noselfballooning,
>selfballooning, selfshrinking bootup options.
>  - said options are now folded in the tmem driver as module options and are
>much shorter (and also there are only four of them now).
>  - add documentation to explain these parameters in kernel-parameters.txt
>  - And lastly add some logic to not enable selfshrinking and selfballooning
>if frontswap functionality is off.
> 
> That is it. Tested and ready to go. If nobody objects will put on my queue
> for Linus on Monday.

FWIW, I've scanned all of these and they look sane and good.  So consider all:

Acked-by: Dan Magenheimer 
 
>  Documentation/kernel-parameters.txt |   21 
>  drivers/xen/Kconfig |7 +--
>  drivers/xen/tmem.c  |   87 
> ---
>  drivers/xen/xen-selfballoon.c   |   47 ++
>  4 files changed, 69 insertions(+), 93 deletions(-)
> 
> (oh nice, more deletions!)
> 
> Konrad Rzeszutek Wilk (9):
>   xen/tmem: Cleanup. Remove the parts that say temporary.
>   xen/tmem: Move all of the boot and module parameters to the top of the 
> file.
>   xen/tmem: Split out the different module/boot options.
>   xen/tmem: Fix compile warning.
>   xen/tmem: s/disable_// and change the logic.
>   xen/tmem: Remove the boot options and fold them in the tmem.X 
> parameters.
>   xen/tmem: Remove the usage of 'noselfshrink' and use 'tmem.selfshrink' 
> bool instead.
>   xen/tmem: Remove the usage of '[no|]selfballoon' and use 
> 'tmem.selfballooning' bool instead.
>   xen/tmem: Don't use self[ballooning|shrinking] if frontswap is off.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

RE: [PATCHv11 3/4] zswap: add to mm/

2013-05-15 Thread Dan Magenheimer

> From: Seth Jennings [mailto:sjenn...@linux.vnet.ibm.com]
> Subject: Re: [PATCHv11 3/4] zswap: add to mm/
> 
> On Tue, May 14, 2013 at 01:18:48PM -0700, Dan Magenheimer wrote:
> > > From: Seth Jennings [mailto:sjenn...@linux.vnet.ibm.com]
> > > Subject: Re: [PATCHv11 3/4] zswap: add to mm/
> > >
> > > 
> > >
> > > > > +/* The maximum percentage of memory that the compressed pool can 
> > > > > occupy */
> > > > > +static unsigned int zswap_max_pool_percent = 20;
> > > > > +module_param_named(max_pool_percent,
> > > > > + zswap_max_pool_percent, uint, 0644);
> > >
> > > 
> > >
> > > > This limit, along with the code that enforces it (by calling reclaim
> > > > when the limit is reached), is IMHO questionable.  Is there any
> > > > other kernel memory allocation that is constrained by a percentage
> > > > of total memory rather than dynamically according to current
> > > > system conditions?  As Mel pointed out (approx.), if this limit
> > > > is reached by a zswap-storm and filled with pages of long-running,
> > > > rarely-used processes, 20% of RAM (by default here) becomes forever
> > > > clogged.
> > >
> > > So there are two comments here 1) dynamic pool limit and 2) writeback
> > > of pages in zswap that won't be faulted in or forced out by pressure.
> > >
> > > Comment 1 feeds from the point of view that compressed pages should just 
> > > be
> > > another type of memory managed by the core MM.  While ideal, very hard to
> > > implement in practice.  We are starting to realize that even the policy
> > > governing to active vs inactive list is very hard to get right. Then 
> > > shrinkers
> > > add more complexity to the policy problem.  Throwing another type in the 
> > > mix
> > > would just that much more complex and hard to get right (assuming there 
> > > even
> > > _is_ a "right" policy for everyone in such a complex system).
> > >
> > > This max_pool_percent policy is simple, works well, and provides a
> > > deterministic policy that users can understand. Users can be assured that 
> > > a
> > > dynamic policy heuristic won't go nuts and allow the compressed pool to 
> > > grow
> > > unbounded or be so aggressively reclaimed that it offers no value.
> >
> > Hi Seth --
> >
> > Hmmm... I'm not sure how to politely say "bullshit". :-)
> >
> > The default 20% was randomly pulled out of the air long ago for zcache
> > experiments.  If you can explain why 20% is better than 19% or 21%, or
> > better than 10% or 30% or even 50%, that would be a start.  Then please try
> > to explain -- in terms an average sysadmin can understand -- under
> > what circumstances this number should be higher or lower, that would
> > be even better.  In fact if you can explain it in even very broadbrush
> > terms like "higher for embedded" and "lower for server" that would be
> > useful.  If the top Linux experts in compression can't answer these
> > questions (and the default is a random number, which it is), I don't
> > know how we can expect users to be "assured".
> 
> 20% is a default maximum.  There really isn't a particular reason for the
> selection other than to supply reasonable default to a tunable.  20% is enough
> to show the benefit while assuring the user zswap won't eat more than that
> amount of memory under any circumstance.  The point is to make it a tunable,
> not to launch an incredibly in-depth study on what the default should be.

My point is that a tunable is worthless -- and essentially the same as
a fixed value -- unless you can clearly instruct target users how to
change it to match their needs.
 
> As guidance on how to tune it, switching to zbud actually made the math 
> simpler
> by bounding the best case to 2 and the expected density to very near 2.  I 
> have
> two methods, one based on calculation and another based on experimentation.
>
> Yes, I understand that there are many things to consider, but for the sake of
> those that honestly care about the answer to the question, I'll answer it.
> 
> Method 1:
> 
> If you have a workload running on a machine with x GB of RAM and an anonymous
> working set of y GB of pages where x < y, a good starting point for
> max_pool_percent is ((y/x)-1)*100.
> 
> For example, if you have 10GB of RAM and 12GB anon working set, (12/10-1)*100 
> =
> 20.  During operation th

RE: [PATCHv11 3/4] zswap: add to mm/

2013-05-15 Thread Dan Magenheimer

 From: Seth Jennings [mailto:sjenn...@linux.vnet.ibm.com]
 Subject: Re: [PATCHv11 3/4] zswap: add to mm/

 On Tue, May 14, 2013 at 01:18:48PM -0700, Dan Magenheimer wrote:
   From: Seth Jennings [mailto:sjenn...@linux.vnet.ibm.com]
   Subject: Re: [PATCHv11 3/4] zswap: add to mm/

   snip

 +/* The maximum percentage of memory that the compressed pool can 
 occupy */
 +static unsigned int zswap_max_pool_percent = 20;
 +module_param_named(max_pool_percent,
 + zswap_max_pool_percent, uint, 0644);

   snip

This limit, along with the code that enforces it (by calling reclaim
when the limit is reached), is IMHO questionable.  Is there any
other kernel memory allocation that is constrained by a percentage
of total memory rather than dynamically according to current
system conditions?  As Mel pointed out (approx.), if this limit
is reached by a zswap-storm and filled with pages of long-running,
rarely-used processes, 20% of RAM (by default here) becomes forever
clogged.

   So there are two comments here 1) dynamic pool limit and 2) writeback
   of pages in zswap that won't be faulted in or forced out by pressure.

   Comment 1 feeds from the point of view that compressed pages should just 
   be
   another type of memory managed by the core MM.  While ideal, very hard to
   implement in practice.  We are starting to realize that even the policy
   governing to active vs inactive list is very hard to get right. Then 
   shrinkers
   add more complexity to the policy problem.  Throwing another type in the 
   mix
   would just that much more complex and hard to get right (assuming there 
   even
   _is_ a right policy for everyone in such a complex system).

   This max_pool_percent policy is simple, works well, and provides a
   deterministic policy that users can understand. Users can be assured that 
   a
   dynamic policy heuristic won't go nuts and allow the compressed pool to 
   grow
   unbounded or be so aggressively reclaimed that it offers no value.

  Hi Seth --

  Hmmm... I'm not sure how to politely say bullshit. :-)

  The default 20% was randomly pulled out of the air long ago for zcache
  experiments.  If you can explain why 20% is better than 19% or 21%, or
  better than 10% or 30% or even 50%, that would be a start.  Then please try
  to explain -- in terms an average sysadmin can understand -- under
  what circumstances this number should be higher or lower, that would
  be even better.  In fact if you can explain it in even very broadbrush
  terms like higher for embedded and lower for server that would be
  useful.  If the top Linux experts in compression can't answer these
  questions (and the default is a random number, which it is), I don't
  know how we can expect users to be assured.

 20% is a default maximum.  There really isn't a particular reason for the
 selection other than to supply reasonable default to a tunable.  20% is enough
 to show the benefit while assuring the user zswap won't eat more than that
 amount of memory under any circumstance.  The point is to make it a tunable,
 not to launch an incredibly in-depth study on what the default should be.

My point is that a tunable is worthless -- and essentially the same as
a fixed value -- unless you can clearly instruct target users how to
change it to match their needs.

 As guidance on how to tune it, switching to zbud actually made the math 
 simpler
 by bounding the best case to 2 and the expected density to very near 2.  I 
 have
 two methods, one based on calculation and another based on experimentation.

 Yes, I understand that there are many things to consider, but for the sake of
 those that honestly care about the answer to the question, I'll answer it.

 Method 1:

 If you have a workload running on a machine with x GB of RAM and an anonymous
 working set of y GB of pages where x  y, a good starting point for
 max_pool_percent is ((y/x)-1)*100.

 For example, if you have 10GB of RAM and 12GB anon working set, (12/10-1)*100 
 =
 20.  During operation there would be 8GB in uncompressed memory, and 4GB worth
 of compressed memory occupying 2GB (i.e. 20%) of RAM.  This will reduce swap 
 I/O
 to near zero assuming the pages compress 50% on average.

 Bear in mind that this formula provides a lower bound on max_pool_percent if
 you want to avoid swap I/0.  Setting max_pool_percent to 20 would produce
 the same situation.

OK, let's try to apply your method.  You personally have undoubtedly
compiled the kernel hundreds, maybe thousands of times in the last year.
In the restricted environment where you and I have run benchmarks, this
is a fairly stable and reproducible workload == stable and reproducible
are somewhat rare in the real world.

Can you tell me what the anon working set is of compiling the kernel?
Have you, one of the top experts in Linux compression technology, ever
even once changed the max_pool_percent in your

RE: [PATCH] Fixes, cleanups, compile warning fixes, and documentation update for Xen tmem driver (v2).

2013-05-15 Thread Dan Magenheimer

 From: Konrad Rzeszutek [mailto:ketuzs...@gmail.com] On Behalf Of Konrad 
 Rzeszutek Wilk
 Sent: Tuesday, May 14, 2013 12:09 PM
 To: bob@oracle.com; dan.magenhei...@oracle.com; 
 linux-kernel@vger.kernel.org; akpm@linux-
 foundation.org; linux...@kvack.org; xen-de...@lists.xensource.com
 Subject: [PATCH] Fixes, cleanups, compile warning fixes, and documentation 
 update for Xen tmem driver
 (v2).
 
 Heya,
 
 These nine patches fix the tmem driver to:
  - not emit a compile warning anymore (reported by 0 day test compile tool)
  - remove the various nofrontswap, nocleancache, noselfshrinking, 
 noselfballooning,
selfballooning, selfshrinking bootup options.
  - said options are now folded in the tmem driver as module options and are
much shorter (and also there are only four of them now).
  - add documentation to explain these parameters in kernel-parameters.txt
  - And lastly add some logic to not enable selfshrinking and selfballooning
if frontswap functionality is off.
 
 That is it. Tested and ready to go. If nobody objects will put on my queue
 for Linus on Monday.

FWIW, I've scanned all of these and they look sane and good.  So consider all:

Acked-by: Dan Magenheimer dan.magenhei...@oracle.com
 
  Documentation/kernel-parameters.txt |   21 
  drivers/xen/Kconfig |7 +--
  drivers/xen/tmem.c  |   87 
 ---
  drivers/xen/xen-selfballoon.c   |   47 ++
  4 files changed, 69 insertions(+), 93 deletions(-)
 
 (oh nice, more deletions!)
 
 Konrad Rzeszutek Wilk (9):
   xen/tmem: Cleanup. Remove the parts that say temporary.
   xen/tmem: Move all of the boot and module parameters to the top of the 
 file.
   xen/tmem: Split out the different module/boot options.
   xen/tmem: Fix compile warning.
   xen/tmem: s/disable_// and change the logic.
   xen/tmem: Remove the boot options and fold them in the tmem.X 
 parameters.
   xen/tmem: Remove the usage of 'noselfshrink' and use 'tmem.selfshrink' 
 bool instead.
   xen/tmem: Remove the usage of '[no|]selfballoon' and use 
 'tmem.selfballooning' bool instead.
   xen/tmem: Don't use self[ballooning|shrinking] if frontswap is off.
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

RE: [PATCHv11 3/4] zswap: add to mm/

2013-05-15 Thread Dan Magenheimer

 From: Konrad Rzeszutek Wilk
 Subject: Re: [PATCHv11 3/4] zswap: add to mm/

  Sorry, but I don't think that's appropriate for a patch in the MM subsystem.

 I am heading to the airport shortly so this email is a bit hastily typed.

 Perhaps a compromise can be reached where this code is merged as a driver
 not a core mm component. There is a high bar to be in the MM - it has to
 work with many many different configurations.

 And drivers don't have such a high bar. They just need to work on a specific
 issue and that is it. If zswap ended up in say, drivers/mm that would make
 it more palpable I think.

 Thoughts?

Hmmm...

To me, that sounds like a really good compromise.  Then anyone
who wants to experiment with compressed swap pages can do so by
enabling the zswap driver.  And the harder problem of deeply integrating
compression into the MM subsystem can proceed in parallel
by leveraging and building on the best of zswap and zcache
and zram.

Seth, if you want to re-post zswap as a driver... even a
previous zswap version with zsmalloc and without writeback...
I would be willing to ack it.  If I correctly understand
Mel's concerns, I suspect he might feel the same.
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

RE: [PATCHv11 3/4] zswap: add to mm/

2013-05-15 Thread Dan Magenheimer

 From: Seth Jennings [mailto:sjenn...@linux.vnet.ibm.com]
 Sent: Wednesday, May 15, 2013 2:10 PM
 To: Konrad Rzeszutek Wilk
 Cc: Dan Magenheimer; Andrew Morton; Greg Kroah-Hartman; Nitin Gupta; Minchan 
 Kim; Robert Jennings;
 Jenifer Hopper; Mel Gorman; Johannes Weiner; Rik van Riel; Larry Woodman; 
 Benjamin Herrenschmidt; Dave
 Hansen; Joe Perches; Joonsoo Kim; Cody P Schafer; Hugh Dickens; Paul 
 Mackerras; linux...@kvack.org;
 linux-kernel@vger.kernel.org; de...@driverdev.osuosl.org
 Subject: Re: [PATCHv11 3/4] zswap: add to mm/

 On Wed, May 15, 2013 at 02:55:06PM -0400, Konrad Rzeszutek Wilk wrote:
   Sorry, but I don't think that's appropriate for a patch in the MM 
   subsystem.

  I am heading to the airport shortly so this email is a bit hastily typed.

  Perhaps a compromise can be reached where this code is merged as a driver
  not a core mm component. There is a high bar to be in the MM - it has to
  work with many many different configurations.

  And drivers don't have such a high bar. They just need to work on a specific
  issue and that is it. If zswap ended up in say, drivers/mm that would make
  it more palpable I think.

  Thoughts?

 zswap, the writeback code particularly, depends on a number of non-exported
 kernel symbols, namely:

 swapcache_free
 __swap_writepage
 __add_to_swap_cache
 swapcache_prepare
 swapper_spaces

 So it can't currently be built as a module and I'm not sure what the MM
 folks would think about exporting them and making them part of the KABI.

It can be built as a module if writeback is disabled (or ifdef'd by
a CONFIG_ZSWAP_WRITEBACK which depends on CONFIG_ZSWAP=y).  The
folks at LSFMM who were planning to use zswap will be turning
off writeback anyway so an alternate is to pull writeback out
of zswap completely for now, since you don't really have a good
policy to manage it yet anyway.
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

RE: [PATCHv11 3/4] zswap: add to mm/

2013-05-15 Thread Dan Magenheimer

 From: Dave Hansen [mailto:d...@sr71.net]
 Sent: Wednesday, May 15, 2013 2:24 PM
 To: Seth Jennings
 Cc: Konrad Rzeszutek Wilk; Dan Magenheimer; Andrew Morton; Greg 
 Kroah-Hartman; Nitin Gupta; Minchan
 Kim; Robert Jennings; Jenifer Hopper; Mel Gorman; Johannes Weiner; Rik van 
 Riel; Larry Woodman;
 Benjamin Herrenschmidt; Joe Perches; Joonsoo Kim; Cody P Schafer; Hugh 
 Dickens; Paul Mackerras; linux-
 m...@kvack.org; linux-kernel@vger.kernel.org; de...@driverdev.osuosl.org
 Subject: Re: [PATCHv11 3/4] zswap: add to mm/

 On 05/15/2013 01:09 PM, Seth Jennings wrote:
  On Wed, May 15, 2013 at 02:55:06PM -0400, Konrad Rzeszutek Wilk wrote:
  Sorry, but I don't think that's appropriate for a patch in the MM 
  subsystem.

  Perhaps a compromise can be reached where this code is merged as a driver
  not a core mm component. There is a high bar to be in the MM - it has to
  work with many many different configurations.

  And drivers don't have such a high bar. They just need to work on a 
  specific
  issue and that is it. If zswap ended up in say, drivers/mm that would make
  it more palpable I think.

 The issue is not whether it is a loadable module or a driver.  Nobody
 here is stupid enough to say, hey, now it's a driver/module, all of the
 complex VM interactions are finally fixed!

 If folks don't want this in their system, there's a way to turn it off,
 today, with the sysfs tunables.  We don't need _another_ way to turn it
 off at runtime (unloading the module/driver).

The issue is we KNOW the complex VM interactions are NOT fixed
and there has been very very little breadth testing (i.e.
across a wide range of workloads, and any attempts to show
how much harm can come from enabling it.)

That's (at least borderline) acceptable in a driver that can
be unloaded, but not in MM code IMHO.
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

RE: [PATCHv11 3/4] zswap: add to mm/

2013-05-15 Thread Dan Magenheimer

 From: Rik van Riel [mailto:r...@redhat.com]
 Subject: Re: [PATCHv11 3/4] zswap: add to mm/

 On 05/15/2013 03:35 PM, Dan Magenheimer wrote:
  From: Konrad Rzeszutek Wilk
  Subject: Re: [PATCHv11 3/4] zswap: add to mm/

  Sorry, but I don't think that's appropriate for a patch in the MM 
  subsystem.

  I am heading to the airport shortly so this email is a bit hastily typed.

  Perhaps a compromise can be reached where this code is merged as a driver
  not a core mm component. There is a high bar to be in the MM - it has to
  work with many many different configurations.

  And drivers don't have such a high bar. They just need to work on a 
  specific
  issue and that is it. If zswap ended up in say, drivers/mm that would make
  it more palpable I think.

  Thoughts?

  Hmmm...

  To me, that sounds like a really good compromise.

 Come on, we all know that is nonsense.

 Sure, the zswap and zbud code may not be in their final state yet,
 but they belong in the mm/ directory, together with the cleancache
 code and all the other related bits of code.

 Lets put them in their final destination, and hope the code attracts
 attention by as many MM developers as can spare the time to help
 improve it.

Hi Rik --

Seth has been hell-bent on getting SOME code into the kernel
for over a year, since he found out that enabling zcache, a staging
driver, resulted in a tainted kernel.  First it was promoting
zcache+zsmalloc out of staging.  Then it was zswap+zsmalloc without
writeback, then zswap+zsmalloc with writeback, and now zswap+zbud
with writeback but without a sane policy for writeback.  All of
that time, I've been arguing and trying to integrate compression more
deeply and sensibly into MM, rather than just enabling compression as
a toy that happens to speed up a few benchmarks.  (This,
in a nutshell, was the feedback I got at LSFMM12 from Andrea and
Mel... and I think also from you.)  Seth has resisted every
step of the way, then integrated the functionality in question,
adapted my code (or Nitin's), and called it his own.

If you disagree with any of my arguments earlier in this thread,
please say so.  Else, please reinforce that the MM subsystem
needs to dynamically adapt to a broad range of workloads,
which zswap does not (yet) do.  Zswap is not simple, it is
simplistic*.

IMHO, it may be OK for a driver to be ham-handed in its memory
use, but that's not OK for something in mm/.  So I think merging
zswap as a driver is a perfectly sensible compromise which lets
Seth get his code upstream, allows users (and leading-edge distros)
to experiment with compression, avoids these endless arguments,
and allows those who care to move forward on how to deeply
integrate compression into MM.

Dan

* simplistic, n., The tendency to oversimplify an issue or a problem
  by ignoring complexities or complications.
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

RE: [PATCHv11 3/4] zswap: add to mm/

2013-05-14 Thread Dan Magenheimer

> From: Seth Jennings [mailto:sjenn...@linux.vnet.ibm.com]
> Subject: Re: [PATCHv11 3/4] zswap: add to mm/
> 
> On Tue, May 14, 2013 at 09:37:08AM -0700, Dan Magenheimer wrote:
> > > From: Seth Jennings [mailto:sjenn...@linux.vnet.ibm.com]
> > > Subject: Re: [PATCHv11 3/4] zswap: add to mm/
> > >
> > > On Tue, May 14, 2013 at 05:19:19PM +0800, Bob Liu wrote:
> > > > Hi Seth,
> > >
> > > Hi Bob, thanks for the review!
> > >
> > > >
> > > > > + /* reclaim space if needed */
> > > > > + if (zswap_is_full()) {
> > > > > + zswap_pool_limit_hit++;
> > > > > + if (zbud_reclaim_page(tree->pool, 8)) {
> > > >
> > > > My idea is to wake up a kernel thread here to do the reclaim.
> > > > Once zswap is full(20% percent of total mem currently), the kernel
> > > > thread should reclaim pages from it. Not only reclaim one page, it
> > > > should depend on the current memory pressure.
> > > > And then the API in zbud may like this:
> > > > zbud_reclaim_page(pool, nr_pages_to_reclaim, nr_retry);
> > >
> > > So kswapd for zswap.  I'm not opposed to the idea if a case can be
> > > made for the complexity.  I must say, I don't see that case though.
> > >
> > > The policy can evolve as deficiencies are demonstrated and solutions are
> > > found.
> >
> > Hmmm... it is fairly easy to demonstrate the deficiency if
> > one tries.  I actually first saw it occur on a real (though
> > early) EL6 system which started some graphics-related service
> > that caused a very brief swapstorm that was invisible during
> > normal boot but clogged up RAM with compressed pages which
> > later caused reduced weird benchmarking performance.
> 
> Without any specifics, I'm not sure what I can do with this.

Well, I think its customary for the author of a patch to know
the limitations of the patch.  I suggest you synthesize a
workload that attempts to measure worst case.  That's exactly
what I did a year ago that led me to the realization that
zcache needed to solve some issues before it was ready to
promote out of staging.
 
> I'm hearing you say that the source of the benchmark degradation
> are the idle pages in zswap.  In that case, the periodic writeback
> patches I have in the wings should address this.
> 
> I think we are on the same page without realizing it.  Right now
> zswap supports a kind of "direct reclaim" model at allocation time.
> The periodic writeback patches will handle the proactive writeback
> part to free up the zswap pool when it has idle pages in it.

I don't think we are on the same page though maybe you are heading
in the same direction now. I won't repeat the comments from the
previous email.

> > I think Mel's unpredictability concern applies equally here...
> > this may be a "long-term source of bugs and strange memory
> > management behavior."
> >
> > > Can I get your ack on this pending the other changes?
> >
> > I'd like to hear Mel's feedback about this, but perhaps
> > a compromise to allow for zswap merging would be to add
> > something like the following to zswap's Kconfig comment:
> >
> > "Zswap reclaim policy is still primitive.  Until it improves,
> > zswap should be considered experimental and is not recommended
> > for production use."
> 
> Just for the record, an "experimental" tag in the Kconfig won't
> work for me.
>
> The reclaim policy for zswap is not primitive, it's simple.  There
> is a difference.  Plus zswap is already runtime disabled by default.
> If distros/customers enabled it, it is because they purposely
> enabled it.

Hmmm... I think you are proposing to users/distros the following
use model:  "If zswap works for you, turn it on.  If it sucks,
turn it off.  I can't tell you in advance whether it will work
or suck for your distro/workload, but it will probably work so
please try it."

That sounds awfully experimental to me.

The problem is not simple.  Your solution is simple because
you are simply pretending that the harder parts of the problem
don't exist.

Dan
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

RE: [PATCHv11 3/4] zswap: add to mm/

2013-05-14 Thread Dan Magenheimer

> From: Seth Jennings [mailto:sjenn...@linux.vnet.ibm.com]
> Subject: Re: [PATCHv11 3/4] zswap: add to mm/
> 
> 
>
> > > +/* The maximum percentage of memory that the compressed pool can occupy 
> > > */
> > > +static unsigned int zswap_max_pool_percent = 20;
> > > +module_param_named(max_pool_percent,
> > > + zswap_max_pool_percent, uint, 0644);
> 
> 
>
> > This limit, along with the code that enforces it (by calling reclaim
> > when the limit is reached), is IMHO questionable.  Is there any
> > other kernel memory allocation that is constrained by a percentage
> > of total memory rather than dynamically according to current
> > system conditions?  As Mel pointed out (approx.), if this limit
> > is reached by a zswap-storm and filled with pages of long-running,
> > rarely-used processes, 20% of RAM (by default here) becomes forever
> > clogged.
> 
> So there are two comments here 1) dynamic pool limit and 2) writeback
> of pages in zswap that won't be faulted in or forced out by pressure.
> 
> Comment 1 feeds from the point of view that compressed pages should just be
> another type of memory managed by the core MM.  While ideal, very hard to
> implement in practice.  We are starting to realize that even the policy
> governing to active vs inactive list is very hard to get right. Then shrinkers
> add more complexity to the policy problem.  Throwing another type in the mix
> would just that much more complex and hard to get right (assuming there even
> _is_ a "right" policy for everyone in such a complex system).
> 
> This max_pool_percent policy is simple, works well, and provides a
> deterministic policy that users can understand. Users can be assured that a
> dynamic policy heuristic won't go nuts and allow the compressed pool to grow
> unbounded or be so aggressively reclaimed that it offers no value.

Hi Seth --

Hmmm... I'm not sure how to politely say "bullshit". :-)

The default 20% was randomly pulled out of the air long ago for zcache
experiments.  If you can explain why 20% is better than 19% or 21%, or
better than 10% or 30% or even 50%, that would be a start.  Then please try
to explain -- in terms an average sysadmin can understand -- under
what circumstances this number should be higher or lower, that would
be even better.  In fact if you can explain it in even very broadbrush
terms like "higher for embedded" and "lower for server" that would be
useful.  If the top Linux experts in compression can't answer these
questions (and the default is a random number, which it is), I don't
know how we can expect users to be "assured".

What you mean is "works well"... on the two benchmarks you've tried it
on.  You say it's too hard to do dynamically... even though every other
significant RAM user in the kernel has to do it dynamically.
Workloads are dynamic and heavy users of RAM needs to deal with that.
You don't see a limit on the number of anonymous pages in the MM subsystem,
and you don't see a limit on the number of inodes in btrfs.  Linus
would rightfully barf all over those limits and (if he was paying attention
to this discussion) he would barf on this limit too.

It's unfortunate that my proposed topic for LSFMM was pre-empted
by the zsmalloc vs zbud discussion and zswap vs zcache, because
I think the real challenge of zswap (or zcache) and the value to
distros and end users requires us to get this right BEFORE users
start filing bugs about performance weirdness.  After which most
users and distros will simply default to 0% (i.e. turn zswap off)
because zswap unpredictably sometimes sucks.

 sorry...

> Comment 2 I agree is an issue. I already have patches for a "periodic
> writeback" functionality that starts to shrink the zswap pool via
> writeback if zswap goes idle for a period of time.  This addresses
> the issue with long-lived, never-accessed pages getting stuck in
> zswap forever.

Pulling the call out of zswap_frontswap_store() (and ensuring there still
aren't any new races) would be a good start.  But this is just a mechanism;
you haven't said anything about the policy or how you intend to
enforce the policy.  Which just gets us back to Comment 1...

So Comment 1 and Comment 2 are really the same:  How do we appropriately
manage the number of pages in the system that are used for storing
compressed pages?
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

RE: [PATCHv11 3/4] zswap: add to mm/

2013-05-14 Thread Dan Magenheimer

> From: Seth Jennings [mailto:sjenn...@linux.vnet.ibm.com]
> Subject: Re: [PATCHv11 3/4] zswap: add to mm/
> 
> On Tue, May 14, 2013 at 05:19:19PM +0800, Bob Liu wrote:
> > Hi Seth,
> 
> Hi Bob, thanks for the review!
> 
> >
> > > + /* reclaim space if needed */
> > > + if (zswap_is_full()) {
> > > + zswap_pool_limit_hit++;
> > > + if (zbud_reclaim_page(tree->pool, 8)) {
> >
> > My idea is to wake up a kernel thread here to do the reclaim.
> > Once zswap is full(20% percent of total mem currently), the kernel
> > thread should reclaim pages from it. Not only reclaim one page, it
> > should depend on the current memory pressure.
> > And then the API in zbud may like this:
> > zbud_reclaim_page(pool, nr_pages_to_reclaim, nr_retry);
> 
> So kswapd for zswap.  I'm not opposed to the idea if a case can be
> made for the complexity.  I must say, I don't see that case though.
> 
> The policy can evolve as deficiencies are demonstrated and solutions are
> found.

Hmmm... it is fairly easy to demonstrate the deficiency if
one tries.  I actually first saw it occur on a real (though
early) EL6 system which started some graphics-related service
that caused a very brief swapstorm that was invisible during
normal boot but clogged up RAM with compressed pages which
later caused reduced weird benchmarking performance.

I think Mel's unpredictability concern applies equally here...
this may be a "long-term source of bugs and strange memory
management behavior."

> Can I get your ack on this pending the other changes?

I'd like to hear Mel's feedback about this, but perhaps
a compromise to allow for zswap merging would be to add
something like the following to zswap's Kconfig comment:

"Zswap reclaim policy is still primitive.  Until it improves,
zswap should be considered experimental and is not recommended
for production use."

If Mel agrees with the unpredictability and also agrees
with the Kconfig compromise, I am willing to ack.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

RE: [PATCHv11 3/4] zswap: add to mm/

2013-05-14 Thread Dan Magenheimer

 From: Seth Jennings [mailto:sjenn...@linux.vnet.ibm.com]
 Subject: Re: [PATCHv11 3/4] zswap: add to mm/

 On Tue, May 14, 2013 at 05:19:19PM +0800, Bob Liu wrote:
  Hi Seth,

 Hi Bob, thanks for the review!

   + /* reclaim space if needed */
   + if (zswap_is_full()) {
   + zswap_pool_limit_hit++;
   + if (zbud_reclaim_page(tree-pool, 8)) {

  My idea is to wake up a kernel thread here to do the reclaim.
  Once zswap is full(20% percent of total mem currently), the kernel
  thread should reclaim pages from it. Not only reclaim one page, it
  should depend on the current memory pressure.
  And then the API in zbud may like this:
  zbud_reclaim_page(pool, nr_pages_to_reclaim, nr_retry);

 So kswapd for zswap.  I'm not opposed to the idea if a case can be
 made for the complexity.  I must say, I don't see that case though.

 The policy can evolve as deficiencies are demonstrated and solutions are
 found.

Hmmm... it is fairly easy to demonstrate the deficiency if
one tries.  I actually first saw it occur on a real (though
early) EL6 system which started some graphics-related service
that caused a very brief swapstorm that was invisible during
normal boot but clogged up RAM with compressed pages which
later caused reduced weird benchmarking performance.

I think Mel's unpredictability concern applies equally here...
this may be a long-term source of bugs and strange memory
management behavior.

 Can I get your ack on this pending the other changes?

I'd like to hear Mel's feedback about this, but perhaps
a compromise to allow for zswap merging would be to add
something like the following to zswap's Kconfig comment:

Zswap reclaim policy is still primitive.  Until it improves,
zswap should be considered experimental and is not recommended
for production use.

If Mel agrees with the unpredictability and also agrees
with the Kconfig compromise, I am willing to ack.
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

RE: [PATCHv11 3/4] zswap: add to mm/

2013-05-14 Thread Dan Magenheimer

 From: Seth Jennings [mailto:sjenn...@linux.vnet.ibm.com]
 Subject: Re: [PATCHv11 3/4] zswap: add to mm/

 snip

   +/* The maximum percentage of memory that the compressed pool can occupy 
   */
   +static unsigned int zswap_max_pool_percent = 20;
   +module_param_named(max_pool_percent,
   + zswap_max_pool_percent, uint, 0644);

 snip

  This limit, along with the code that enforces it (by calling reclaim
  when the limit is reached), is IMHO questionable.  Is there any
  other kernel memory allocation that is constrained by a percentage
  of total memory rather than dynamically according to current
  system conditions?  As Mel pointed out (approx.), if this limit
  is reached by a zswap-storm and filled with pages of long-running,
  rarely-used processes, 20% of RAM (by default here) becomes forever
  clogged.

 So there are two comments here 1) dynamic pool limit and 2) writeback
 of pages in zswap that won't be faulted in or forced out by pressure.

 Comment 1 feeds from the point of view that compressed pages should just be
 another type of memory managed by the core MM.  While ideal, very hard to
 implement in practice.  We are starting to realize that even the policy
 governing to active vs inactive list is very hard to get right. Then shrinkers
 add more complexity to the policy problem.  Throwing another type in the mix
 would just that much more complex and hard to get right (assuming there even
 _is_ a right policy for everyone in such a complex system).

 This max_pool_percent policy is simple, works well, and provides a
 deterministic policy that users can understand. Users can be assured that a
 dynamic policy heuristic won't go nuts and allow the compressed pool to grow
 unbounded or be so aggressively reclaimed that it offers no value.

Hi Seth --

Hmmm... I'm not sure how to politely say bullshit. :-)

The default 20% was randomly pulled out of the air long ago for zcache
experiments.  If you can explain why 20% is better than 19% or 21%, or
better than 10% or 30% or even 50%, that would be a start.  Then please try
to explain -- in terms an average sysadmin can understand -- under
what circumstances this number should be higher or lower, that would
be even better.  In fact if you can explain it in even very broadbrush
terms like higher for embedded and lower for server that would be
useful.  If the top Linux experts in compression can't answer these
questions (and the default is a random number, which it is), I don't
know how we can expect users to be assured.

What you mean is works well... on the two benchmarks you've tried it
on.  You say it's too hard to do dynamically... even though every other
significant RAM user in the kernel has to do it dynamically.
Workloads are dynamic and heavy users of RAM needs to deal with that.
You don't see a limit on the number of anonymous pages in the MM subsystem,
and you don't see a limit on the number of inodes in btrfs.  Linus
would rightfully barf all over those limits and (if he was paying attention
to this discussion) he would barf on this limit too.

It's unfortunate that my proposed topic for LSFMM was pre-empted
by the zsmalloc vs zbud discussion and zswap vs zcache, because
I think the real challenge of zswap (or zcache) and the value to
distros and end users requires us to get this right BEFORE users
start filing bugs about performance weirdness.  After which most
users and distros will simply default to 0% (i.e. turn zswap off)
because zswap unpredictably sometimes sucks.

flame off sorry...

 Comment 2 I agree is an issue. I already have patches for a periodic
 writeback functionality that starts to shrink the zswap pool via
 writeback if zswap goes idle for a period of time.  This addresses
 the issue with long-lived, never-accessed pages getting stuck in
 zswap forever.

Pulling the call out of zswap_frontswap_store() (and ensuring there still
aren't any new races) would be a good start.  But this is just a mechanism;
you haven't said anything about the policy or how you intend to
enforce the policy.  Which just gets us back to Comment 1...

So Comment 1 and Comment 2 are really the same:  How do we appropriately
manage the number of pages in the system that are used for storing
compressed pages?
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

RE: [PATCHv11 3/4] zswap: add to mm/

2013-05-14 Thread Dan Magenheimer

 From: Seth Jennings [mailto:sjenn...@linux.vnet.ibm.com]
 Subject: Re: [PATCHv11 3/4] zswap: add to mm/

 On Tue, May 14, 2013 at 09:37:08AM -0700, Dan Magenheimer wrote:
   From: Seth Jennings [mailto:sjenn...@linux.vnet.ibm.com]
   Subject: Re: [PATCHv11 3/4] zswap: add to mm/

   On Tue, May 14, 2013 at 05:19:19PM +0800, Bob Liu wrote:
Hi Seth,

   Hi Bob, thanks for the review!

 + /* reclaim space if needed */
 + if (zswap_is_full()) {
 + zswap_pool_limit_hit++;
 + if (zbud_reclaim_page(tree-pool, 8)) {

My idea is to wake up a kernel thread here to do the reclaim.
Once zswap is full(20% percent of total mem currently), the kernel
thread should reclaim pages from it. Not only reclaim one page, it
should depend on the current memory pressure.
And then the API in zbud may like this:
zbud_reclaim_page(pool, nr_pages_to_reclaim, nr_retry);

   So kswapd for zswap.  I'm not opposed to the idea if a case can be
   made for the complexity.  I must say, I don't see that case though.

   The policy can evolve as deficiencies are demonstrated and solutions are
   found.

  Hmmm... it is fairly easy to demonstrate the deficiency if
  one tries.  I actually first saw it occur on a real (though
  early) EL6 system which started some graphics-related service
  that caused a very brief swapstorm that was invisible during
  normal boot but clogged up RAM with compressed pages which
  later caused reduced weird benchmarking performance.

 Without any specifics, I'm not sure what I can do with this.

Well, I think its customary for the author of a patch to know
the limitations of the patch.  I suggest you synthesize a
workload that attempts to measure worst case.  That's exactly
what I did a year ago that led me to the realization that
zcache needed to solve some issues before it was ready to
promote out of staging.

 I'm hearing you say that the source of the benchmark degradation
 are the idle pages in zswap.  In that case, the periodic writeback
 patches I have in the wings should address this.

 I think we are on the same page without realizing it.  Right now
 zswap supports a kind of direct reclaim model at allocation time.
 The periodic writeback patches will handle the proactive writeback
 part to free up the zswap pool when it has idle pages in it.

I don't think we are on the same page though maybe you are heading
in the same direction now. I won't repeat the comments from the
previous email.

  I think Mel's unpredictability concern applies equally here...
  this may be a long-term source of bugs and strange memory
  management behavior.

   Can I get your ack on this pending the other changes?

  I'd like to hear Mel's feedback about this, but perhaps
  a compromise to allow for zswap merging would be to add
  something like the following to zswap's Kconfig comment:

  Zswap reclaim policy is still primitive.  Until it improves,
  zswap should be considered experimental and is not recommended
  for production use.

 Just for the record, an experimental tag in the Kconfig won't
 work for me.

 The reclaim policy for zswap is not primitive, it's simple.  There
 is a difference.  Plus zswap is already runtime disabled by default.
 If distros/customers enabled it, it is because they purposely
 enabled it.

Hmmm... I think you are proposing to users/distros the following
use model:  If zswap works for you, turn it on.  If it sucks,
turn it off.  I can't tell you in advance whether it will work
or suck for your distro/workload, but it will probably work so
please try it.

That sounds awfully experimental to me.

The problem is not simple.  Your solution is simple because
you are simply pretending that the harder parts of the problem
don't exist.

Dan
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

RE: [PATCHv11 3/4] zswap: add to mm/

2013-05-13 Thread Dan Magenheimer

> From: Seth Jennings [mailto:sjenn...@linux.vnet.ibm.com]
> Subject: [PATCHv11 3/4] zswap: add to mm/
> 
> zswap is a thin compression backend for frontswap. It receives pages from
> frontswap and attempts to store them in a compressed memory pool, resulting in
> an effective partial memory reclaim and dramatically reduced swap device I/O.
> 
> Additionally, in most cases, pages can be retrieved from this compressed store
> much more quickly than reading from tradition swap devices resulting in faster
> performance for many workloads.
> 
> It also has support for evicting swap pages that are currently compressed in
> zswap to the swap device on an LRU(ish) basis. This functionality is very
> important and make zswap a true cache in that, once the cache is full or can't
> grow due to memory pressure, the oldest pages can be moved out of zswap to the
> swap device so newer pages can be compressed and stored in zswap.
> 
> This patch adds the zswap driver to mm/
> 
> Signed-off-by: Seth Jennings 

A couple of comments below...

> ---
>  mm/Kconfig  |   15 +
>  mm/Makefile |1 +
>  mm/zswap.c  |  952 
> +++
>  3 files changed, 968 insertions(+)
>  create mode 100644 mm/zswap.c
> 
> diff --git a/mm/Kconfig b/mm/Kconfig
> index 908f41b..4042e07 100644
> --- a/mm/Kconfig
> +++ b/mm/Kconfig
> @@ -487,3 +487,18 @@ config ZBUD
> While this design limits storage density, it has simple and
> deterministic reclaim properties that make it preferable to a higher
> density approach when reclaim will be used.
> +
> +config ZSWAP
> + bool "In-kernel swap page compression"
> + depends on FRONTSWAP && CRYPTO
> + select CRYPTO_LZO
> + select ZBUD
> + default n
> + help
> +   Zswap is a backend for the frontswap mechanism in the VMM.
> +   It receives pages from frontswap and attempts to store them
> +   in a compressed memory pool, resulting in an effective
> +   partial memory reclaim.  In addition, pages and be retrieved
> +   from this compressed store much faster than most tradition
> +   swap devices resulting in reduced I/O and faster performance
> +   for many workloads.
> diff --git a/mm/Makefile b/mm/Makefile
> index 95f0197..f008033 100644
> --- a/mm/Makefile
> +++ b/mm/Makefile
> @@ -32,6 +32,7 @@ obj-$(CONFIG_HAVE_MEMBLOCK) += memblock.o
>  obj-$(CONFIG_BOUNCE) += bounce.o
>  obj-$(CONFIG_SWAP)   += page_io.o swap_state.o swapfile.o
>  obj-$(CONFIG_FRONTSWAP)  += frontswap.o
> +obj-$(CONFIG_ZSWAP)  += zswap.o
>  obj-$(CONFIG_HAS_DMA)+= dmapool.o
>  obj-$(CONFIG_HUGETLBFS)  += hugetlb.o
>  obj-$(CONFIG_NUMA)   += mempolicy.o
> diff --git a/mm/zswap.c b/mm/zswap.c
> new file mode 100644
> index 000..b1070ca
> --- /dev/null
> +++ b/mm/zswap.c
> @@ -0,0 +1,952 @@
> +/*
> + * zswap.c - zswap driver file
> + *
> + * zswap is a backend for frontswap that takes pages that are in the
> + * process of being swapped out and attempts to compress them and store
> + * them in a RAM-based memory pool.  This results in a significant I/O
> + * reduction on the real swap device and, in the case of a slow swap
> + * device, can also improve workload performance.
> + *
> + * Copyright (C) 2012  Seth Jennings 
> + *
> + * This program is free software; you can redistribute it and/or
> + * modify it under the terms of the GNU General Public License
> + * as published by the Free Software Foundation; either version 2
> + * of the License, or (at your option) any later version.
> + *
> + * This program is distributed in the hope that it will be useful,
> + * but WITHOUT ANY WARRANTY; without even the implied warranty of
> + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
> + * GNU General Public License for more details.
> +*/
> +
> +#define pr_fmt(fmt) KBUILD_MODNAME ": " fmt
> +
> +#include 
> +#include 
> +#include 
> +#include 
> +#include 
> +#include 
> +#include 
> +#include 
> +#include 
> +#include 
> +#include 
> +#include 
> +#include 
> +
> +#include 
> +#include 
> +#include 
> +#include 
> +#include 
> +
> +/*
> +* statistics
> +**/
> +/* Number of memory pages used by the compressed pool */
> +static atomic_t zswap_pool_pages = ATOMIC_INIT(0);
> +/* The number of compressed pages currently stored in zswap */
> +static atomic_t zswap_stored_pages = ATOMIC_INIT(0);
> +
> +/*
> + * The statistics below are not protected from concurrent access for
> + * performance reasons so they may not be a 100% accurate.  However,
> + * they do provide useful information on roughly how many times a
> + * certain event is occurring.
> +*/
> +static u64 zswap_pool_limit_hit;
> +static u64 zswap_written_back_pages;
> +static u64 zswap_reject_reclaim_fail;
> +static u64 zswap_reject_compress_poor;
> +static u64 zswap_reject_alloc_fail;
> +static u64 zswap_reject_kmemcache_fail;
> +static u64

RE: [PATCHv11 2/4] zbud: add to mm/

2013-05-13 Thread Dan Magenheimer

> From: Seth Jennings [mailto:sjenn...@linux.vnet.ibm.com]
> Sent: Monday, May 13, 2013 6:40 AM
> Subject: [PATCHv11 2/4] zbud: add to mm/

One comment about a questionable algorithm change (vs my original zbud code)
below... I'll leave the detailed code review to others.

Dan

> zbud is an special purpose allocator for storing compressed pages. It is
> designed to store up to two compressed pages per physical page.  While this
> design limits storage density, it has simple and deterministic reclaim
> properties that make it preferable to a higher density approach when reclaim
> will be used.
> 
> zbud works by storing compressed pages, or "zpages", together in pairs in a
> single memory page called a "zbud page".  The first buddy is "left
> justifed" at the beginning of the zbud page, and the last buddy is "right
> justified" at the end of the zbud page.  The benefit is that if either
> buddy is freed, the freed buddy space, coalesced with whatever slack space
> that existed between the buddies, results in the largest possible free region
> within the zbud page.
> 
> zbud also provides an attractive lower bound on density. The ratio of zpages
> to zbud pages can not be less than 1.  This ensures that zbud can never "do
> harm" by using more pages to store zpages than the uncompressed zpages would
> have used on their own.
> 
> This patch adds zbud to mm/ for later use by zswap.
> 
> Signed-off-by: Seth Jennings 
> ---
>  include/linux/zbud.h |   22 ++
>  mm/Kconfig   |   10 +
>  mm/Makefile  |1 +
>  mm/zbud.c|  564 
> ++
>  4 files changed, 597 insertions(+)
>  create mode 100644 include/linux/zbud.h
>  create mode 100644 mm/zbud.c
> 
> diff --git a/include/linux/zbud.h b/include/linux/zbud.h
> new file mode 100644
> index 000..954252b
> --- /dev/null
> +++ b/include/linux/zbud.h
> @@ -0,0 +1,22 @@
> +#ifndef _ZBUD_H_
> +#define _ZBUD_H_
> +
> +#include 
> +
> +struct zbud_pool;
> +
> +struct zbud_ops {
> + int (*evict)(struct zbud_pool *pool, unsigned long handle);
> +};
> +
> +struct zbud_pool *zbud_create_pool(gfp_t gfp, struct zbud_ops *ops);
> +void zbud_destroy_pool(struct zbud_pool *pool);
> +int zbud_alloc(struct zbud_pool *pool, int size, gfp_t gfp,
> + unsigned long *handle);
> +void zbud_free(struct zbud_pool *pool, unsigned long handle);
> +int zbud_reclaim_page(struct zbud_pool *pool, unsigned int retries);
> +void *zbud_map(struct zbud_pool *pool, unsigned long handle);
> +void zbud_unmap(struct zbud_pool *pool, unsigned long handle);
> +int zbud_get_pool_size(struct zbud_pool *pool);
> +
> +#endif /* _ZBUD_H_ */
> diff --git a/mm/Kconfig b/mm/Kconfig
> index e742d06..908f41b 100644
> --- a/mm/Kconfig
> +++ b/mm/Kconfig
> @@ -477,3 +477,13 @@ config FRONTSWAP
> and swap data is stored as normal on the matching swap device.
> 
> If unsure, say Y to enable frontswap.
> +
> +config ZBUD
> + tristate "Buddy allocator for compressed pages"
> + default n
> + help
> +   zbud is an special purpose allocator for storing compressed pages.
> +   It is designed to store up to two compressed pages per physical page.
> +   While this design limits storage density, it has simple and
> +   deterministic reclaim properties that make it preferable to a higher
> +   density approach when reclaim will be used.
> diff --git a/mm/Makefile b/mm/Makefile
> index 72c5acb..95f0197 100644
> --- a/mm/Makefile
> +++ b/mm/Makefile
> @@ -58,3 +58,4 @@ obj-$(CONFIG_DEBUG_KMEMLEAK) += kmemleak.o
>  obj-$(CONFIG_DEBUG_KMEMLEAK_TEST) += kmemleak-test.o
>  obj-$(CONFIG_CLEANCACHE) += cleancache.o
>  obj-$(CONFIG_MEMORY_ISOLATION) += page_isolation.o
> +obj-$(CONFIG_ZBUD)   += zbud.o
> diff --git a/mm/zbud.c b/mm/zbud.c
> new file mode 100644
> index 000..e5bd0e6
> --- /dev/null
> +++ b/mm/zbud.c
> @@ -0,0 +1,564 @@
> +/*
> + * zbud.c - Buddy Allocator for Compressed Pages
> + *
> + * Copyright (C) 2013, Seth Jennings, IBM
> + *
> + * Concepts based on zcache internal zbud allocator by Dan Magenheimer.
> + *
> + * zbud is an special purpose allocator for storing compressed pages. It is
> + * designed to store up to two compressed pages per physical page.  While 
> this
> + * design limits storage density, it has simple and deterministic reclaim
> + * properties that make it preferable to a higher density approach when 
> reclaim
> + * will be used.
> + *
> + * zbud works by storing compressed pages, or "zpages", together in pairs in 
> a
> + * single memory page called a "zbud

RE: [PATCHv11 2/4] zbud: add to mm/

2013-05-13 Thread Dan Magenheimer

 From: Seth Jennings [mailto:sjenn...@linux.vnet.ibm.com]
 Sent: Monday, May 13, 2013 6:40 AM
 Subject: [PATCHv11 2/4] zbud: add to mm/

One comment about a questionable algorithm change (vs my original zbud code)
below... I'll leave the detailed code review to others.

Dan

 zbud is an special purpose allocator for storing compressed pages. It is
 designed to store up to two compressed pages per physical page.  While this
 design limits storage density, it has simple and deterministic reclaim
 properties that make it preferable to a higher density approach when reclaim
 will be used.
 
 zbud works by storing compressed pages, or zpages, together in pairs in a
 single memory page called a zbud page.  The first buddy is left
 justifed at the beginning of the zbud page, and the last buddy is right
 justified at the end of the zbud page.  The benefit is that if either
 buddy is freed, the freed buddy space, coalesced with whatever slack space
 that existed between the buddies, results in the largest possible free region
 within the zbud page.
 
 zbud also provides an attractive lower bound on density. The ratio of zpages
 to zbud pages can not be less than 1.  This ensures that zbud can never do
 harm by using more pages to store zpages than the uncompressed zpages would
 have used on their own.
 
 This patch adds zbud to mm/ for later use by zswap.
 
 Signed-off-by: Seth Jennings sjenn...@linux.vnet.ibm.com
 ---
  include/linux/zbud.h |   22 ++
  mm/Kconfig   |   10 +
  mm/Makefile  |1 +
  mm/zbud.c|  564 
 ++
  4 files changed, 597 insertions(+)
  create mode 100644 include/linux/zbud.h
  create mode 100644 mm/zbud.c
 
 diff --git a/include/linux/zbud.h b/include/linux/zbud.h
 new file mode 100644
 index 000..954252b
 --- /dev/null
 +++ b/include/linux/zbud.h
 @@ -0,0 +1,22 @@
 +#ifndef _ZBUD_H_
 +#define _ZBUD_H_
 +
 +#include linux/types.h
 +
 +struct zbud_pool;
 +
 +struct zbud_ops {
 + int (*evict)(struct zbud_pool *pool, unsigned long handle);
 +};
 +
 +struct zbud_pool *zbud_create_pool(gfp_t gfp, struct zbud_ops *ops);
 +void zbud_destroy_pool(struct zbud_pool *pool);
 +int zbud_alloc(struct zbud_pool *pool, int size, gfp_t gfp,
 + unsigned long *handle);
 +void zbud_free(struct zbud_pool *pool, unsigned long handle);
 +int zbud_reclaim_page(struct zbud_pool *pool, unsigned int retries);
 +void *zbud_map(struct zbud_pool *pool, unsigned long handle);
 +void zbud_unmap(struct zbud_pool *pool, unsigned long handle);
 +int zbud_get_pool_size(struct zbud_pool *pool);
 +
 +#endif /* _ZBUD_H_ */
 diff --git a/mm/Kconfig b/mm/Kconfig
 index e742d06..908f41b 100644
 --- a/mm/Kconfig
 +++ b/mm/Kconfig
 @@ -477,3 +477,13 @@ config FRONTSWAP
 and swap data is stored as normal on the matching swap device.
 
 If unsure, say Y to enable frontswap.
 +
 +config ZBUD
 + tristate Buddy allocator for compressed pages
 + default n
 + help
 +   zbud is an special purpose allocator for storing compressed pages.
 +   It is designed to store up to two compressed pages per physical page.
 +   While this design limits storage density, it has simple and
 +   deterministic reclaim properties that make it preferable to a higher
 +   density approach when reclaim will be used.
 diff --git a/mm/Makefile b/mm/Makefile
 index 72c5acb..95f0197 100644
 --- a/mm/Makefile
 +++ b/mm/Makefile
 @@ -58,3 +58,4 @@ obj-$(CONFIG_DEBUG_KMEMLEAK) += kmemleak.o
  obj-$(CONFIG_DEBUG_KMEMLEAK_TEST) += kmemleak-test.o
  obj-$(CONFIG_CLEANCACHE) += cleancache.o
  obj-$(CONFIG_MEMORY_ISOLATION) += page_isolation.o
 +obj-$(CONFIG_ZBUD)   += zbud.o
 diff --git a/mm/zbud.c b/mm/zbud.c
 new file mode 100644
 index 000..e5bd0e6
 --- /dev/null
 +++ b/mm/zbud.c
 @@ -0,0 +1,564 @@
 +/*
 + * zbud.c - Buddy Allocator for Compressed Pages
 + *
 + * Copyright (C) 2013, Seth Jennings, IBM
 + *
 + * Concepts based on zcache internal zbud allocator by Dan Magenheimer.
 + *
 + * zbud is an special purpose allocator for storing compressed pages. It is
 + * designed to store up to two compressed pages per physical page.  While 
 this
 + * design limits storage density, it has simple and deterministic reclaim
 + * properties that make it preferable to a higher density approach when 
 reclaim
 + * will be used.
 + *
 + * zbud works by storing compressed pages, or zpages, together in pairs in 
 a
 + * single memory page called a zbud page.  The first buddy is left
 + * justifed at the beginning of the zbud page, and the last buddy is right
 + * justified at the end of the zbud page.  The benefit is that if either
 + * buddy is freed, the freed buddy space, coalesced with whatever slack space
 + * that existed between the buddies, results in the largest possible free 
 region
 + * within the zbud page.
 + *
 + * zbud also provides an attractive lower bound on density. The ratio of 
 zpages
 + * to zbud pages can not be less than 1

RE: [PATCHv11 3/4] zswap: add to mm/

2013-05-13 Thread Dan Magenheimer

 From: Seth Jennings [mailto:sjenn...@linux.vnet.ibm.com]
 Subject: [PATCHv11 3/4] zswap: add to mm/
 
 zswap is a thin compression backend for frontswap. It receives pages from
 frontswap and attempts to store them in a compressed memory pool, resulting in
 an effective partial memory reclaim and dramatically reduced swap device I/O.
 
 Additionally, in most cases, pages can be retrieved from this compressed store
 much more quickly than reading from tradition swap devices resulting in faster
 performance for many workloads.
 
 It also has support for evicting swap pages that are currently compressed in
 zswap to the swap device on an LRU(ish) basis. This functionality is very
 important and make zswap a true cache in that, once the cache is full or can't
 grow due to memory pressure, the oldest pages can be moved out of zswap to the
 swap device so newer pages can be compressed and stored in zswap.
 
 This patch adds the zswap driver to mm/
 
 Signed-off-by: Seth Jennings sjenn...@linux.vnet.ibm.com

A couple of comments below...

 ---
  mm/Kconfig  |   15 +
  mm/Makefile |1 +
  mm/zswap.c  |  952 
 +++
  3 files changed, 968 insertions(+)
  create mode 100644 mm/zswap.c
 
 diff --git a/mm/Kconfig b/mm/Kconfig
 index 908f41b..4042e07 100644
 --- a/mm/Kconfig
 +++ b/mm/Kconfig
 @@ -487,3 +487,18 @@ config ZBUD
 While this design limits storage density, it has simple and
 deterministic reclaim properties that make it preferable to a higher
 density approach when reclaim will be used.
 +
 +config ZSWAP
 + bool In-kernel swap page compression
 + depends on FRONTSWAP  CRYPTO
 + select CRYPTO_LZO
 + select ZBUD
 + default n
 + help
 +   Zswap is a backend for the frontswap mechanism in the VMM.
 +   It receives pages from frontswap and attempts to store them
 +   in a compressed memory pool, resulting in an effective
 +   partial memory reclaim.  In addition, pages and be retrieved
 +   from this compressed store much faster than most tradition
 +   swap devices resulting in reduced I/O and faster performance
 +   for many workloads.
 diff --git a/mm/Makefile b/mm/Makefile
 index 95f0197..f008033 100644
 --- a/mm/Makefile
 +++ b/mm/Makefile
 @@ -32,6 +32,7 @@ obj-$(CONFIG_HAVE_MEMBLOCK) += memblock.o
  obj-$(CONFIG_BOUNCE) += bounce.o
  obj-$(CONFIG_SWAP)   += page_io.o swap_state.o swapfile.o
  obj-$(CONFIG_FRONTSWAP)  += frontswap.o
 +obj-$(CONFIG_ZSWAP)  += zswap.o
  obj-$(CONFIG_HAS_DMA)+= dmapool.o
  obj-$(CONFIG_HUGETLBFS)  += hugetlb.o
  obj-$(CONFIG_NUMA)   += mempolicy.o
 diff --git a/mm/zswap.c b/mm/zswap.c
 new file mode 100644
 index 000..b1070ca
 --- /dev/null
 +++ b/mm/zswap.c
 @@ -0,0 +1,952 @@
 +/*
 + * zswap.c - zswap driver file
 + *
 + * zswap is a backend for frontswap that takes pages that are in the
 + * process of being swapped out and attempts to compress them and store
 + * them in a RAM-based memory pool.  This results in a significant I/O
 + * reduction on the real swap device and, in the case of a slow swap
 + * device, can also improve workload performance.
 + *
 + * Copyright (C) 2012  Seth Jennings sjenn...@linux.vnet.ibm.com
 + *
 + * This program is free software; you can redistribute it and/or
 + * modify it under the terms of the GNU General Public License
 + * as published by the Free Software Foundation; either version 2
 + * of the License, or (at your option) any later version.
 + *
 + * This program is distributed in the hope that it will be useful,
 + * but WITHOUT ANY WARRANTY; without even the implied warranty of
 + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
 + * GNU General Public License for more details.
 +*/
 +
 +#define pr_fmt(fmt) KBUILD_MODNAME :  fmt
 +
 +#include linux/module.h
 +#include linux/cpu.h
 +#include linux/highmem.h
 +#include linux/slab.h
 +#include linux/spinlock.h
 +#include linux/types.h
 +#include linux/atomic.h
 +#include linux/frontswap.h
 +#include linux/rbtree.h
 +#include linux/swap.h
 +#include linux/crypto.h
 +#include linux/mempool.h
 +#include linux/zbud.h
 +
 +#include linux/mm_types.h
 +#include linux/page-flags.h
 +#include linux/swapops.h
 +#include linux/writeback.h
 +#include linux/pagemap.h
 +
 +/*
 +* statistics
 +**/
 +/* Number of memory pages used by the compressed pool */
 +static atomic_t zswap_pool_pages = ATOMIC_INIT(0);
 +/* The number of compressed pages currently stored in zswap */
 +static atomic_t zswap_stored_pages = ATOMIC_INIT(0);
 +
 +/*
 + * The statistics below are not protected from concurrent access for
 + * performance reasons so they may not be a 100% accurate.  However,
 + * they do provide useful information on roughly how many times a
 + * certain event is occurring.
 +*/
 +static u64 zswap_pool_limit_hit;
 +static u64 zswap_written_back_pages;
 +static u64

RE: zsmalloc zbud hybrid design discussion?

2013-04-12 Thread Dan Magenheimer

> From: Seth Jennings [mailto:sjenn...@linux.vnet.ibm.com]
> Subject: Re: zsmalloc zbud hybrid design discussion?
> 
> On Thu, Apr 11, 2013 at 04:28:19PM -0700, Dan Magenheimer wrote:
> > (Bob Liu added)
> >
> > > From: Seth Jennings [mailto:sjenn...@linux.vnet.ibm.com]
> > > Subject: Re: zsmalloc zbud hybrid design discussion?
> > >
> > > On Wed, Mar 27, 2013 at 01:04:25PM -0700, Dan Magenheimer wrote:
> > > > Seth and all zproject folks --
> > > >
> > > > I've been giving some deep thought as to how a zpage
> > > > allocator might be designed that would incorporate the
> > > > best of both zsmalloc and zbud.
> > > >
> > > > Rather than dive into coding, it occurs to me that the
> > > > best chance of success would be if all interested parties
> > > > could first discuss (on-list) and converge on a design
> > > > that we can all agree on.  If we achieve that, I don't
> > > > care who writes the code and/or gets the credit or
> > > > chooses the name.  If we can't achieve consensus, at
> > > > least it will be much clearer where our differences lie.
> > > >
> > > > Any thoughts?
> >
> > Hi Seth!
> >
> > > I'll put some thoughts, keeping in mind that I'm not throwing zsmalloc 
> > > under
> > > the bus here.  Just what I would do starting from scratch given all that 
> > > has
> > > happened.
> >
> > Excellent.  Good food for thought.  I'll add some of my thinking
> > too and we can talk more next week.
> >
> > BTW, I'm not throwing zsmalloc under the bus either.  I'm OK with
> > using zsmalloc as a "base" for an improved hybrid, and even calling
> > the result "zsmalloc".  I *am* however willing to throw the
> > "generic" nature of zsmalloc away... I think the combined requirements
> > of the zprojects are complex enough and the likelihood of zsmalloc
> > being appropriate for future "users" is low enough, that we should
> > accept that zsmalloc is highly tuned for zprojects and modify it
> > as required.  I.e. the API to zsmalloc need not be exposed to and
> > documented for the rest of the kernel.
> >
> > > Simplicity - the simpler the better
> >
> > Generally I agree.  But only if the simplicity addresses the
> > whole problem.  I'm specifically very concerned that we have
> > an allocator that works well across a wide variety of zsize distributions,
> > even if it adds complexity to the allocator.
> >
> > > High density - LZO best case is ~40 bytes. That's around 1/100th of a 
> > > page.
> > > I'd say it should support up to at least 64 object per page in the best 
> > > case.
> > > (see Reclaim effectiveness before responding here)
> >
> > Hmmm... if you pre-check for zero pages, I would guess the percentage
> > of pages with zsize less than 64 is actually quite small.  But 64 size
> > classes may be a good place to start as long as it doesn't overly
> > complicate or restrict other design points.
> >
> > > No slab - the slab approach limits LRU and swap slot locality within the 
> > > pool
> > > pages.  Also swap slots have a tendency to be freed in clusters.  If we 
> > > improve
> > > locality within each pool page, it is more likely that page will be freed
> > > sooner as the zpages it contains will likely be invalidated all together.
> >
> > "Pool page" =?= "pageframe used by zsmalloc"
> 
> Yes.
> 
> >
> > Isn't it true that that there is no correlation between whether a
> > page is in the same cluster and the zsize (and thus size class) of
> > the zpage?  So every zpage may end up in a different pool page
> > and this theory wouldn't work.  Or am I misunderstanding?
> 
> I think so.  I didn't say this outright and should have: I'm thinking along 
> the
> lines of a first-fit type method.  So you just stack zpages up in a page until
> the page is full then allocate a new one.  Searching for free slots would
> ideally be done in reverse LRU so that you put new zpages in the most recently
> allocated page that has room.  I'm still thinking how to do that efficiently.

OK I see.  You probably know that the xvmalloc allocator did something like
that.  I didn't study that code much but Nitin thought zsmalloc was much
superior to xvmalloc.

> > > Also, take a note out of the zbud playbook at track LRU based on pool 
> > > pages,
> > > not zpages.  One would fill allocation requests from

RE: zsmalloc zbud hybrid design discussion?

2013-04-12 Thread Dan Magenheimer

 From: Seth Jennings [mailto:sjenn...@linux.vnet.ibm.com]
 Subject: Re: zsmalloc zbud hybrid design discussion?

 On Thu, Apr 11, 2013 at 04:28:19PM -0700, Dan Magenheimer wrote:
  (Bob Liu added)

   From: Seth Jennings [mailto:sjenn...@linux.vnet.ibm.com]
   Subject: Re: zsmalloc zbud hybrid design discussion?

   On Wed, Mar 27, 2013 at 01:04:25PM -0700, Dan Magenheimer wrote:
Seth and all zproject folks --

I've been giving some deep thought as to how a zpage
allocator might be designed that would incorporate the
best of both zsmalloc and zbud.

Rather than dive into coding, it occurs to me that the
best chance of success would be if all interested parties
could first discuss (on-list) and converge on a design
that we can all agree on.  If we achieve that, I don't
care who writes the code and/or gets the credit or
chooses the name.  If we can't achieve consensus, at
least it will be much clearer where our differences lie.

Any thoughts?

  Hi Seth!

   I'll put some thoughts, keeping in mind that I'm not throwing zsmalloc 
   under
   the bus here.  Just what I would do starting from scratch given all that 
   has
   happened.

  Excellent.  Good food for thought.  I'll add some of my thinking
  too and we can talk more next week.

  BTW, I'm not throwing zsmalloc under the bus either.  I'm OK with
  using zsmalloc as a base for an improved hybrid, and even calling
  the result zsmalloc.  I *am* however willing to throw the
  generic nature of zsmalloc away... I think the combined requirements
  of the zprojects are complex enough and the likelihood of zsmalloc
  being appropriate for future users is low enough, that we should
  accept that zsmalloc is highly tuned for zprojects and modify it
  as required.  I.e. the API to zsmalloc need not be exposed to and
  documented for the rest of the kernel.

   Simplicity - the simpler the better

  Generally I agree.  But only if the simplicity addresses the
  whole problem.  I'm specifically very concerned that we have
  an allocator that works well across a wide variety of zsize distributions,
  even if it adds complexity to the allocator.

   High density - LZO best case is ~40 bytes. That's around 1/100th of a 
   page.
   I'd say it should support up to at least 64 object per page in the best 
   case.
   (see Reclaim effectiveness before responding here)

  Hmmm... if you pre-check for zero pages, I would guess the percentage
  of pages with zsize less than 64 is actually quite small.  But 64 size
  classes may be a good place to start as long as it doesn't overly
  complicate or restrict other design points.

   No slab - the slab approach limits LRU and swap slot locality within the 
   pool
   pages.  Also swap slots have a tendency to be freed in clusters.  If we 
   improve
   locality within each pool page, it is more likely that page will be freed
   sooner as the zpages it contains will likely be invalidated all together.

  Pool page =?= pageframe used by zsmalloc

 Yes.

  Isn't it true that that there is no correlation between whether a
  page is in the same cluster and the zsize (and thus size class) of
  the zpage?  So every zpage may end up in a different pool page
  and this theory wouldn't work.  Or am I misunderstanding?

 I think so.  I didn't say this outright and should have: I'm thinking along 
 the
 lines of a first-fit type method.  So you just stack zpages up in a page until
 the page is full then allocate a new one.  Searching for free slots would
 ideally be done in reverse LRU so that you put new zpages in the most recently
 allocated page that has room.  I'm still thinking how to do that efficiently.

OK I see.  You probably know that the xvmalloc allocator did something like
that.  I didn't study that code much but Nitin thought zsmalloc was much
superior to xvmalloc.

   Also, take a note out of the zbud playbook at track LRU based on pool 
   pages,
   not zpages.  One would fill allocation requests from the most recently 
   used
   pool page.

  Yes, I'm also thinking that should be in any hybrid solution.
  A global LRU queue (like in zbud) could also be applicable to entire 
  zspages;
  this is similar to pageframe-reclaim except all the pageframes in a zspage
  would be claimed at the same time.

 This brings up another thing that I left out that might be the stickiest part,
 eviction and reclaim.  We first have to figure out if eviction is going to be
 initiated by the user or by the allocator.

 If we do it in the allocator, then I think we are going to muck up the API
 because you'll have to register and eviction notification function that the
 allocator can call, once for each zpage in the page frame the allocator is
 trying to reclaim/free.  The locking might get hairy in that case (user -
 allocator - user).  Additionally the user would have to maintain a different
 lookup system for zpages by address/handle.  Alternatively, you

RE: zsmalloc zbud hybrid design discussion?

2013-04-11 Thread Dan Magenheimer

(Bob Liu added)

> From: Seth Jennings [mailto:sjenn...@linux.vnet.ibm.com]
> Subject: Re: zsmalloc zbud hybrid design discussion?
> 
> On Wed, Mar 27, 2013 at 01:04:25PM -0700, Dan Magenheimer wrote:
> > Seth and all zproject folks --
> >
> > I've been giving some deep thought as to how a zpage
> > allocator might be designed that would incorporate the
> > best of both zsmalloc and zbud.
> >
> > Rather than dive into coding, it occurs to me that the
> > best chance of success would be if all interested parties
> > could first discuss (on-list) and converge on a design
> > that we can all agree on.  If we achieve that, I don't
> > care who writes the code and/or gets the credit or
> > chooses the name.  If we can't achieve consensus, at
> > least it will be much clearer where our differences lie.
> >
> > Any thoughts?

Hi Seth!

> I'll put some thoughts, keeping in mind that I'm not throwing zsmalloc under
> the bus here.  Just what I would do starting from scratch given all that has
> happened.

Excellent.  Good food for thought.  I'll add some of my thinking
too and we can talk more next week.

BTW, I'm not throwing zsmalloc under the bus either.  I'm OK with
using zsmalloc as a "base" for an improved hybrid, and even calling
the result "zsmalloc".  I *am* however willing to throw the
"generic" nature of zsmalloc away... I think the combined requirements
of the zprojects are complex enough and the likelihood of zsmalloc
being appropriate for future "users" is low enough, that we should
accept that zsmalloc is highly tuned for zprojects and modify it
as required.  I.e. the API to zsmalloc need not be exposed to and
documented for the rest of the kernel.

> Simplicity - the simpler the better

Generally I agree.  But only if the simplicity addresses the
whole problem.  I'm specifically very concerned that we have
an allocator that works well across a wide variety of zsize distributions,
even if it adds complexity to the allocator.

> High density - LZO best case is ~40 bytes. That's around 1/100th of a page.
> I'd say it should support up to at least 64 object per page in the best case.
> (see Reclaim effectiveness before responding here)

Hmmm... if you pre-check for zero pages, I would guess the percentage
of pages with zsize less than 64 is actually quite small.  But 64 size
classes may be a good place to start as long as it doesn't overly
complicate or restrict other design points.

> No slab - the slab approach limits LRU and swap slot locality within the pool
> pages.  Also swap slots have a tendency to be freed in clusters.  If we 
> improve
> locality within each pool page, it is more likely that page will be freed
> sooner as the zpages it contains will likely be invalidated all together.

"Pool page" =?= "pageframe used by zsmalloc"

Isn't it true that that there is no correlation between whether a
page is in the same cluster and the zsize (and thus size class) of
the zpage?  So every zpage may end up in a different pool page
and this theory wouldn't work.  Or am I misunderstanding?

> Also, take a note out of the zbud playbook at track LRU based on pool pages,
> not zpages.  One would fill allocation requests from the most recently used
> pool page.

Yes, I'm also thinking that should be in any hybrid solution.
A "global LRU queue" (like in zbud) could also be applicable to entire zspages;
this is similar to pageframe-reclaim except all the pageframes in a zspage
would be claimed at the same time.

> Reclaim effectiveness - conflicts with density. As the number of zpages per
> page increases, the odds decrease that all of those objects will be
> invalidated, which is necessary to free up the underlying page, since moving
> objects out of sparely used pages would involve compaction (see next).  One
> solution is to lower the density, but I think that is self-defeating as we 
> lose
> much the compression benefit though fragmentation. I think the better solution
> is to improve the likelihood that the zpages in the page are likely to be 
> freed
> together through increased locality.

I do think we should seriously reconsider ZS_MAX_ZSPAGE_ORDER==2.
The value vs ZS_MAX_ZSPAGE_ORDER==0 is enough for most cases and
1 is enough for the rest.  If get_pages_per_zspage were "flexible",
there might be a better tradeoff of density vs reclaim effectiveness.

I've some ideas along the lines of a hybrid adaptively combining
buddying and slab which might make it rarely necessary to have
pages_per_zspage exceed 2.  That also might make it much easier
to have "variable sized" zspages (size is always one or two).

> Not a requirement:
> 
> Compaction - compaction would basically involve creating a virtual address
> space

RE: zsmalloc defrag (Was: [PATCH] mm: remove compressed copy from zram in-memory)

2013-04-11 Thread Dan Magenheimer

> From: Minchan Kim [mailto:minc...@kernel.org]
> Subject: Re: zsmalloc defrag (Was: [PATCH] mm: remove compressed copy from 
> zram in-memory)
> 
> Hi Seth,
> 
> On Tue, Apr 09, 2013 at 03:52:36PM -0500, Seth Jennings wrote:
> > On 04/08/2013 08:36 PM, Minchan Kim wrote:
> > > On Tue, Apr 09, 2013 at 10:27:19AM +0900, Minchan Kim wrote:
> > >> Hi Dan,
> > >>
> > >> On Mon, Apr 08, 2013 at 09:32:38AM -0700, Dan Magenheimer wrote:
> > >>>> From: Minchan Kim [mailto:minc...@kernel.org]
> > >>>> Sent: Monday, April 08, 2013 12:01 AM
> > >>>> Subject: [PATCH] mm: remove compressed copy from zram in-memory
> > >>>
> > >>> (patch removed)
> > >>>
> > >>>> Fragment ratio is almost same but memory consumption and compile time
> > >>>> is better. I am working to add defragment function of zsmalloc.
> > >>>
> > >>> Hi Minchan --
> > >>>
> > >>> I would be very interested in your design thoughts on
> > >>> how you plan to add defragmentation for zsmalloc.  In
> > >>
> > >> What I can say now about is only just a word "Compaction".
> > >> As you know, zsmalloc has a transparent handle so we can do whatever
> > >> under user. Of course, there is a tradeoff between performance
> > >> and memory efficiency. I'm biased to latter for embedded usecase.
> > >>
> > >> And I might post it because as you know well, zsmalloc
> > >
> > > Incomplete sentense,
> > >
> > > I might not post it until promoting zsmalloc because as you know well,
> > > zsmalloc/zram's all new stuffs are blocked into staging tree.
> > > Even if we could add it into staging, as you know well, staging is where
> > > every mm guys ignore so we end up needing another round to promote it. 
> > > sigh.
> >
> > Yes. The lack of compaction/defragmentation support in zsmalloc has not
> > been raised as an obstacle to mainline acceptance so I think we should
> > wait to add new features to a yet-to-be accepted codebase.
> >
> > Also, I think this feature is more important to zram than it is to
> > zswap/zcache as they can do writeback to free zpages.  In other words,
> > the fragmentation is a transient issue for zswap/zcache since writeback
> > to the swap device is possible.
> 
> Other benefit derived from compaction work is that we can pick a zpage
> from zspage and move it into somewhere. It means core mm could control
> pages in zsmalloc freely.

I'm not sure I understand which is why I'd like to learn more about
your proposed design.  Are you suggesting that core mm would periodically
call zsmalloc-compaction and see what pages get freed?  I'm hoping
for more control than that.

More good discussion for next week!
Dan
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

RE: zsmalloc defrag (Was: [PATCH] mm: remove compressed copy from zram in-memory)

2013-04-11 Thread Dan Magenheimer

> From: Minchan Kim [mailto:minc...@kernel.org]
> Subject: Re: zsmalloc defrag (Was: [PATCH] mm: remove compressed copy from 
> zram in-memory)
> 
> On Tue, Apr 09, 2013 at 01:37:47PM -0700, Dan Magenheimer wrote:
> > > From: Minchan Kim [mailto:minc...@kernel.org]
> > > Subject: Re: zsmalloc defrag (Was: [PATCH] mm: remove compressed copy 
> > > from zram in-memory)
> > >
> > > On Tue, Apr 09, 2013 at 10:27:19AM +0900, Minchan Kim wrote:
> > > > Hi Dan,
> > > >
> > > > On Mon, Apr 08, 2013 at 09:32:38AM -0700, Dan Magenheimer wrote:
> > > > > > From: Minchan Kim [mailto:minc...@kernel.org]
> > > > > > Sent: Monday, April 08, 2013 12:01 AM
> > > > > > Subject: [PATCH] mm: remove compressed copy from zram in-memory
> > > > >
> > > > > (patch removed)
> > > > >
> > > > > > Fragment ratio is almost same but memory consumption and compile 
> > > > > > time
> > > > > > is better. I am working to add defragment function of zsmalloc.
> > > > >
> > > > > Hi Minchan --
> > > > >
> > > > > I would be very interested in your design thoughts on
> > > > > how you plan to add defragmentation for zsmalloc.  In
> > > >
> > > > What I can say now about is only just a word "Compaction".
> > > > As you know, zsmalloc has a transparent handle so we can do whatever
> > > > under user. Of course, there is a tradeoff between performance
> > > > and memory efficiency. I'm biased to latter for embedded usecase.
> > > >
> > > > And I might post it because as you know well, zsmalloc
> > >
> > > Incomplete sentense,
> > >
> > > I might not post it until promoting zsmalloc because as you know well,
> > > zsmalloc/zram's all new stuffs are blocked into staging tree.
> > > Even if we could add it into staging, as you know well, staging is where
> > > every mm guys ignore so we end up needing another round to promote it. 
> > > sigh.
> > >
> > > I hope it gets better after LSF/MM.
> >
> > If zsmalloc is moving in the direction of supporting only zram,
> > why should it be promoted into mm, or even lib?  Why not promote
> > zram into drivers and put zsmalloc.c in the same directory?
> 
> I don't want to make zsmalloc zram specific and will do best effort
> to generalize it to all z* familiy.

I'm glad to hear that.  You may not know/remember that the split between
"old zcache" and "new zcache" (and the fork to zswap) was started
because some people refused to accept changes to zsmalloc to
support a broader set of requirements.

> If it is hard to reach out
> agreement, yes, forking could be a easy solution like other embedded
> product company but I don't want it.

I don't want it either, so I think it is wise for us all to understand
each others' objectives to see if we can avoid a fork.  Or if the
objectives are too different, then we have data to explain to other kernel
developers why a fork is necessary.

Thanks!
Dan
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

RE: zsmalloc defrag (Was: [PATCH] mm: remove compressed copy from zram in-memory)

2013-04-11 Thread Dan Magenheimer

> From: Minchan Kim [mailto:minc...@kernel.org]
> Subject: Re: zsmalloc defrag (Was: [PATCH] mm: remove compressed copy from 
> zram in-memory)
> 
> On Tue, Apr 09, 2013 at 01:25:45PM -0700, Dan Magenheimer wrote:
> > > From: Minchan Kim [mailto:minc...@kernel.org]
> > > Subject: Re: zsmalloc defrag (Was: [PATCH] mm: remove compressed copy 
> > > from zram in-memory)
> > >
> > > Hi Dan,
> > >
> > > On Mon, Apr 08, 2013 at 09:32:38AM -0700, Dan Magenheimer wrote:
> > > > > From: Minchan Kim [mailto:minc...@kernel.org]
> > > > > Sent: Monday, April 08, 2013 12:01 AM
> > > > > Subject: [PATCH] mm: remove compressed copy from zram in-memory
> > > >
> > > > (patch removed)
> > > >
> > > > > Fragment ratio is almost same but memory consumption and compile time
> > > > > is better. I am working to add defragment function of zsmalloc.
> > > >
> > > > Hi Minchan --
> > > >
> > > > I would be very interested in your design thoughts on
> > > > how you plan to add defragmentation for zsmalloc.  In
> > >
> > > What I can say now about is only just a word "Compaction".
> > > As you know, zsmalloc has a transparent handle so we can do whatever
> > > under user. Of course, there is a tradeoff between performance
> > > and memory efficiency. I'm biased to latter for embedded usecase.
> >
> > Have you designed or implemented this yet?  I have a couple
> > of concerns:
> 
> Not yet implemented but just had a time to think about it, simply.
> So surely, there are some obstacle so I want to uncase the code and
> number after I make a prototype/test the performance.
> Of course, if it has a severe problem, will drop it without wasting
> many guys's time.

OK.  I have some ideas that may similar or may be very different
than yours.  Likely different, since I am coming at it from the
angle of zcache which has some different requirements.  So
I'm hoping that by discussing design we can incorporate some
of the zcache requirements before coding.

> > 1) The handle is transparent to the "user", but it is still a form
> >of a "pointer" to a zpage.  Are you planning on walking zram's
> >tables and changing those pointers?  That may be OK for zram
> >but for more complex data structures than tables (as in zswap
> >and zcache) it may not be as easy, due to races, or as efficient
> >because you will have to walk potentially very large trees.
> 
> Rough concept is following as.
> 
> I'm considering for zsmalloc to return transparent fake handle
> but we have to maintain it with real one.
> It could be done in zsmalloc internal so there isn't any race we should 
> consider.

That sounds very difficult because I think you will need
an extra level of indirection to translate every fake handle
to every real handle/pointer (like virtual-to-physical page tables).
Or do you have some more clever idea?

> > 2) Compaction in the kernel is heavily dependent on page migration
> >and page migration is dependent on using flags in the struct page.
> >There's a lot of code in those two code modules and there
> >are going to be a lot of implementation differences between
> >compacting pages vs compacting zpages.
> 
> Compaction of kernel is never related to zsmalloc's one.

OK.  Compaction has certain meaning in the kernel.  Defrag
is usually used I think for what we are discussing here.
So I thought you might be planning on doing exactly what
the kernel does that it calls compaction.

> > I'm also wondering if you will be implementing "variable length
> > zspages".  Without that, I'm not sure compaction will help
> > enough.  (And that is a good example of the difference between
> 
> Why do you think so?
> variable lengh zspage could be further step to improve but it's not
> only a solution to solve fragmentation.

In my partial-design-in-my-head, they are related, but I
think I understand what you mean.  You are planning to
move zpages across zspage boundaries, and I am not.  So
I think your solution will result in better density but
may be harder to implement.

> > > > particular, I am wondering if your design will also
> > > > handle the requirements for zcache (especially for
> > > > cleancache pages) and perhaps also for ramster.
> > >
> > > I don't know requirements for cleancache pages but compaction is
> > > general as you know well so I expect you can get a benefit from it
> > > if you are concern on memory efficiency but not sure it's valuable
> >

RE: zsmalloc defrag (Was: [PATCH] mm: remove compressed copy from zram in-memory)

2013-04-11 Thread Dan Magenheimer

> From: Seth Jennings [mailto:sjenn...@linux.vnet.ibm.com]
> Subject: Re: zsmalloc defrag (Was: [PATCH] mm: remove compressed copy from 
> zram in-memory)
> 
> On 04/08/2013 08:36 PM, Minchan Kim wrote:
> > On Tue, Apr 09, 2013 at 10:27:19AM +0900, Minchan Kim wrote:
> >> Hi Dan,
> >>
> >> On Mon, Apr 08, 2013 at 09:32:38AM -0700, Dan Magenheimer wrote:
> >>>> From: Minchan Kim [mailto:minc...@kernel.org]
> >>>> Sent: Monday, April 08, 2013 12:01 AM
> >>>> Subject: [PATCH] mm: remove compressed copy from zram in-memory
> >>>
> >>> (patch removed)
> >>>
> >>>> Fragment ratio is almost same but memory consumption and compile time
> >>>> is better. I am working to add defragment function of zsmalloc.
> >>>
> >>> Hi Minchan --
> >>>
> >>> I would be very interested in your design thoughts on
> >>> how you plan to add defragmentation for zsmalloc.  In
> >>
> >> What I can say now about is only just a word "Compaction".
> >> As you know, zsmalloc has a transparent handle so we can do whatever
> >> under user. Of course, there is a tradeoff between performance
> >> and memory efficiency. I'm biased to latter for embedded usecase.
> >>
> >> And I might post it because as you know well, zsmalloc
> >
> > Incomplete sentense,
> >
> > I might not post it until promoting zsmalloc because as you know well,
> > zsmalloc/zram's all new stuffs are blocked into staging tree.
> > Even if we could add it into staging, as you know well, staging is where
> > every mm guys ignore so we end up needing another round to promote it. sigh.
> 
> Yes. The lack of compaction/defragmentation support in zsmalloc has not
> been raised as an obstacle to mainline acceptance so I think we should
> wait to add new features to a yet-to-be accepted codebase.

Um, I explicitly raised as an obstacle the greatly reduced density for
zsmalloc on active workloads and on zsize distributions that skew fat.
Understanding that more deeply and hopefully fixing it is an issue,
and compaction/defragmentation is a step in that direction.

> Also, I think this feature is more important to zram than it is to
> zswap/zcache as they can do writeback to free zpages.  In other words,
> the fragmentation is a transient issue for zswap/zcache since writeback
> to the swap device is possible.

Actually, I think I demonstrated that the zpage-based writeback in
zswap makes fragmentation worse.  Zcache doesn't use zsmalloc
in part because it doesn't support pagframe writeback.  If zsmalloc
can fix this (and it may be easier to fix depending on the design
and implementation of compaction/defrag, which is why I'm asking
lots of questions), zcache may be able to make use of zsmalloc.

Lots of good discussion fodder for next week!

Dan
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

RE: [PATCH 00/10] staging: zcache/ramster: fix and ramster/debugfs improvement

2013-04-11 Thread Dan Magenheimer

> From: Wanpeng Li [mailto:liw...@linux.vnet.ibm.com]
> Sent: Tuesday, April 09, 2013 6:26 PM
> To: Greg Kroah-Hartman
> Cc: Dan Magenheimer; Seth Jennings; Konrad Rzeszutek Wilk; Minchan Kim; 
> linux...@kvack.org; linux-
> ker...@vger.kernel.org; Andrew Morton; Bob Liu; Wanpeng Li
> Subject: [PATCH 00/10] staging: zcache/ramster: fix and ramster/debugfs 
> improvement
> 
> Fix bugs in zcache and rips out the debug counters out of ramster.c and
> sticks them in a debug.c file. Introduce accessory functions for counters
> increase/decrease, they are available when config RAMSTER_DEBUG, otherwise
> they are empty non-debug functions. Using an array to initialize/use debugfs
> attributes to make them neater. Dan Magenheimer confirm these works
> are needed. http://marc.info/?l=linux-mm=136535713106882=2
> 
> Patch 1~2 fix bugs in zcache
> 
> Patch 3~8 rips out the debug counters out of ramster.c and sticks them
> in a debug.c file
> 
> Patch 9 fix coding style issue introduced in zcache2 cleanups
> (s/int/bool + debugfs movement) patchset
> 
> Patch 10 add how-to for ramster

Note my preference to not apply patch 2of10 (which GregKH may choose
to override), but for all, please add my:
Acked-by: Dan Magenheimer 
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

RE: [PATCH 02/10] staging: zcache: remove zcache_freeze

2013-04-11 Thread Dan Magenheimer

> From: Wanpeng Li [mailto:liw...@linux.vnet.ibm.com]
> Subject: [PATCH 02/10] staging: zcache: remove zcache_freeze
> 
> The default value of zcache_freeze is false and it won't be modified by
> other codes. Remove zcache_freeze since no routine can disable zcache
> during system running.
> 
> Signed-off-by: Wanpeng Li 

I'd prefer to leave this code in place as it may be very useful
if/when zcache becomes more tightly integrated into the MM subsystem
and the rest of the kernel.  And the subtleties for temporarily disabling
zcache (which is what zcache_freeze does) are non-obvious and
may cause data loss so if someone wants to add this functionality
back in later and don't have this piece of code, it may take
a lot of pain to get it working.

Usage example: All CPUs are fully saturated so it is questionable
whether spending CPU cycles for compression is wise.  Kernel
could disable zcache using zcache_freeze.  (Yes, a new entry point
would need to be added to enable/disable zcache_freeze.)

My two cents... others are welcome to override.

> ---
>  drivers/staging/zcache/zcache-main.c |   55 
> +++---
>  1 file changed, 18 insertions(+), 37 deletions(-)
> 
> diff --git a/drivers/staging/zcache/zcache-main.c 
> b/drivers/staging/zcache/zcache-main.c
> index e23d814..fe6801a 100644
> --- a/drivers/staging/zcache/zcache-main.c
> +++ b/drivers/staging/zcache/zcache-main.c
> @@ -1118,15 +1118,6 @@ free_and_out:
>  #endif /* CONFIG_ZCACHE_WRITEBACK */
> 
>  /*
> - * When zcache is disabled ("frozen"), pools can be created and destroyed,
> - * but all puts (and thus all other operations that require memory 
> allocation)
> - * must fail.  If zcache is unfrozen, accepts puts, then frozen again,
> - * data consistency requires all puts while frozen to be converted into
> - * flushes.
> - */
> -static bool zcache_freeze;
> -
> -/*
>   * This zcache shrinker interface reduces the number of ephemeral pageframes
>   * used by zcache to approximately the same as the total number of LRU_FILE
>   * pageframes in use, and now also reduces the number of persistent 
> pageframes
> @@ -1221,44 +1212,34 @@ int zcache_put_page(int cli_id, int pool_id, struct 
> tmem_oid *oidp,
>  {
>   struct tmem_pool *pool;
>   struct tmem_handle th;
> - int ret = -1;
> + int ret = 0;
>   void *pampd = NULL;
> 
>   BUG_ON(!irqs_disabled());
>   pool = zcache_get_pool_by_id(cli_id, pool_id);
>   if (unlikely(pool == NULL))
>   goto out;
> - if (!zcache_freeze) {
> - ret = 0;
> - th.client_id = cli_id;
> - th.pool_id = pool_id;
> - th.oid = *oidp;
> - th.index = index;
> - pampd = zcache_pampd_create((char *)page, size, raw,
> - ephemeral, );
> - if (pampd == NULL) {
> - ret = -ENOMEM;
> - if (ephemeral)
> - inc_zcache_failed_eph_puts();
> - else
> - inc_zcache_failed_pers_puts();
> - } else {
> - if (ramster_enabled)
> - ramster_do_preload_flnode(pool);
> - ret = tmem_put(pool, oidp, index, 0, pampd);
> - if (ret < 0)
> - BUG();
> - }
> - zcache_put_pool(pool);
> +
> + th.client_id = cli_id;
> + th.pool_id = pool_id;
> + th.oid = *oidp;
> + th.index = index;
> + pampd = zcache_pampd_create((char *)page, size, raw,
> + ephemeral, );
> + if (pampd == NULL) {
> + ret = -ENOMEM;
> + if (ephemeral)
> + inc_zcache_failed_eph_puts();
> + else
> + inc_zcache_failed_pers_puts();
>   } else {
> - inc_zcache_put_to_flush();
>   if (ramster_enabled)
>   ramster_do_preload_flnode(pool);
> - if (atomic_read(>obj_count) > 0)
> - /* the put fails whether the flush succeeds or not */
> - (void)tmem_flush_page(pool, oidp, index);
> - zcache_put_pool(pool);
> + ret = tmem_put(pool, oidp, index, 0, pampd);
> + if (ret < 0)
> + BUG();
>   }
> + zcache_put_pool(pool);
>  out:
>   return ret;
>  }
> --
> 1.7.10.4
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

RE: [PATCH 02/10] staging: zcache: remove zcache_freeze

2013-04-11 Thread Dan Magenheimer

 From: Wanpeng Li [mailto:liw...@linux.vnet.ibm.com]
 Subject: [PATCH 02/10] staging: zcache: remove zcache_freeze
 
 The default value of zcache_freeze is false and it won't be modified by
 other codes. Remove zcache_freeze since no routine can disable zcache
 during system running.
 
 Signed-off-by: Wanpeng Li liw...@linux.vnet.ibm.com

I'd prefer to leave this code in place as it may be very useful
if/when zcache becomes more tightly integrated into the MM subsystem
and the rest of the kernel.  And the subtleties for temporarily disabling
zcache (which is what zcache_freeze does) are non-obvious and
may cause data loss so if someone wants to add this functionality
back in later and don't have this piece of code, it may take
a lot of pain to get it working.

Usage example: All CPUs are fully saturated so it is questionable
whether spending CPU cycles for compression is wise.  Kernel
could disable zcache using zcache_freeze.  (Yes, a new entry point
would need to be added to enable/disable zcache_freeze.)

My two cents... others are welcome to override.

 ---
  drivers/staging/zcache/zcache-main.c |   55 
 +++---
  1 file changed, 18 insertions(+), 37 deletions(-)
 
 diff --git a/drivers/staging/zcache/zcache-main.c 
 b/drivers/staging/zcache/zcache-main.c
 index e23d814..fe6801a 100644
 --- a/drivers/staging/zcache/zcache-main.c
 +++ b/drivers/staging/zcache/zcache-main.c
 @@ -1118,15 +1118,6 @@ free_and_out:
  #endif /* CONFIG_ZCACHE_WRITEBACK */
 
  /*
 - * When zcache is disabled (frozen), pools can be created and destroyed,
 - * but all puts (and thus all other operations that require memory 
 allocation)
 - * must fail.  If zcache is unfrozen, accepts puts, then frozen again,
 - * data consistency requires all puts while frozen to be converted into
 - * flushes.
 - */
 -static bool zcache_freeze;
 -
 -/*
   * This zcache shrinker interface reduces the number of ephemeral pageframes
   * used by zcache to approximately the same as the total number of LRU_FILE
   * pageframes in use, and now also reduces the number of persistent 
 pageframes
 @@ -1221,44 +1212,34 @@ int zcache_put_page(int cli_id, int pool_id, struct 
 tmem_oid *oidp,
  {
   struct tmem_pool *pool;
   struct tmem_handle th;
 - int ret = -1;
 + int ret = 0;
   void *pampd = NULL;
 
   BUG_ON(!irqs_disabled());
   pool = zcache_get_pool_by_id(cli_id, pool_id);
   if (unlikely(pool == NULL))
   goto out;
 - if (!zcache_freeze) {
 - ret = 0;
 - th.client_id = cli_id;
 - th.pool_id = pool_id;
 - th.oid = *oidp;
 - th.index = index;
 - pampd = zcache_pampd_create((char *)page, size, raw,
 - ephemeral, th);
 - if (pampd == NULL) {
 - ret = -ENOMEM;
 - if (ephemeral)
 - inc_zcache_failed_eph_puts();
 - else
 - inc_zcache_failed_pers_puts();
 - } else {
 - if (ramster_enabled)
 - ramster_do_preload_flnode(pool);
 - ret = tmem_put(pool, oidp, index, 0, pampd);
 - if (ret  0)
 - BUG();
 - }
 - zcache_put_pool(pool);
 +
 + th.client_id = cli_id;
 + th.pool_id = pool_id;
 + th.oid = *oidp;
 + th.index = index;
 + pampd = zcache_pampd_create((char *)page, size, raw,
 + ephemeral, th);
 + if (pampd == NULL) {
 + ret = -ENOMEM;
 + if (ephemeral)
 + inc_zcache_failed_eph_puts();
 + else
 + inc_zcache_failed_pers_puts();
   } else {
 - inc_zcache_put_to_flush();
   if (ramster_enabled)
   ramster_do_preload_flnode(pool);
 - if (atomic_read(pool-obj_count)  0)
 - /* the put fails whether the flush succeeds or not */
 - (void)tmem_flush_page(pool, oidp, index);
 - zcache_put_pool(pool);
 + ret = tmem_put(pool, oidp, index, 0, pampd);
 + if (ret  0)
 + BUG();
   }
 + zcache_put_pool(pool);
  out:
   return ret;
  }
 --
 1.7.10.4
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

RE: [PATCH 00/10] staging: zcache/ramster: fix and ramster/debugfs improvement

2013-04-11 Thread Dan Magenheimer

 From: Wanpeng Li [mailto:liw...@linux.vnet.ibm.com]
 Sent: Tuesday, April 09, 2013 6:26 PM
 To: Greg Kroah-Hartman
 Cc: Dan Magenheimer; Seth Jennings; Konrad Rzeszutek Wilk; Minchan Kim; 
 linux...@kvack.org; linux-
 ker...@vger.kernel.org; Andrew Morton; Bob Liu; Wanpeng Li
 Subject: [PATCH 00/10] staging: zcache/ramster: fix and ramster/debugfs 
 improvement
 
 Fix bugs in zcache and rips out the debug counters out of ramster.c and
 sticks them in a debug.c file. Introduce accessory functions for counters
 increase/decrease, they are available when config RAMSTER_DEBUG, otherwise
 they are empty non-debug functions. Using an array to initialize/use debugfs
 attributes to make them neater. Dan Magenheimer confirm these works
 are needed. http://marc.info/?l=linux-mmm=136535713106882w=2
 
 Patch 1~2 fix bugs in zcache
 
 Patch 3~8 rips out the debug counters out of ramster.c and sticks them
 in a debug.c file
 
 Patch 9 fix coding style issue introduced in zcache2 cleanups
 (s/int/bool + debugfs movement) patchset
 
 Patch 10 add how-to for ramster

Note my preference to not apply patch 2of10 (which GregKH may choose
to override), but for all, please add my:
Acked-by: Dan Magenheimer dan.magenhei...@oracle.com
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

RE: zsmalloc defrag (Was: [PATCH] mm: remove compressed copy from zram in-memory)

2013-04-11 Thread Dan Magenheimer

 From: Seth Jennings [mailto:sjenn...@linux.vnet.ibm.com]
 Subject: Re: zsmalloc defrag (Was: [PATCH] mm: remove compressed copy from 
 zram in-memory)

 On 04/08/2013 08:36 PM, Minchan Kim wrote:
  On Tue, Apr 09, 2013 at 10:27:19AM +0900, Minchan Kim wrote:
  Hi Dan,

  On Mon, Apr 08, 2013 at 09:32:38AM -0700, Dan Magenheimer wrote:
  From: Minchan Kim [mailto:minc...@kernel.org]
  Sent: Monday, April 08, 2013 12:01 AM
  Subject: [PATCH] mm: remove compressed copy from zram in-memory

  (patch removed)

  Fragment ratio is almost same but memory consumption and compile time
  is better. I am working to add defragment function of zsmalloc.

  Hi Minchan --

  I would be very interested in your design thoughts on
  how you plan to add defragmentation for zsmalloc.  In

  What I can say now about is only just a word Compaction.
  As you know, zsmalloc has a transparent handle so we can do whatever
  under user. Of course, there is a tradeoff between performance
  and memory efficiency. I'm biased to latter for embedded usecase.

  And I might post it because as you know well, zsmalloc

  Incomplete sentense,

  I might not post it until promoting zsmalloc because as you know well,
  zsmalloc/zram's all new stuffs are blocked into staging tree.
  Even if we could add it into staging, as you know well, staging is where
  every mm guys ignore so we end up needing another round to promote it. sigh.

 Yes. The lack of compaction/defragmentation support in zsmalloc has not
 been raised as an obstacle to mainline acceptance so I think we should
 wait to add new features to a yet-to-be accepted codebase.

Um, I explicitly raised as an obstacle the greatly reduced density for
zsmalloc on active workloads and on zsize distributions that skew fat.
Understanding that more deeply and hopefully fixing it is an issue,
and compaction/defragmentation is a step in that direction.

 Also, I think this feature is more important to zram than it is to
 zswap/zcache as they can do writeback to free zpages.  In other words,
 the fragmentation is a transient issue for zswap/zcache since writeback
 to the swap device is possible.

Actually, I think I demonstrated that the zpage-based writeback in
zswap makes fragmentation worse.  Zcache doesn't use zsmalloc
in part because it doesn't support pagframe writeback.  If zsmalloc
can fix this (and it may be easier to fix depending on the design
and implementation of compaction/defrag, which is why I'm asking
lots of questions), zcache may be able to make use of zsmalloc.

Lots of good discussion fodder for next week!

Dan
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

RE: zsmalloc defrag (Was: [PATCH] mm: remove compressed copy from zram in-memory)

2013-04-11 Thread Dan Magenheimer

 From: Minchan Kim [mailto:minc...@kernel.org]
 Subject: Re: zsmalloc defrag (Was: [PATCH] mm: remove compressed copy from 
 zram in-memory)

 On Tue, Apr 09, 2013 at 01:25:45PM -0700, Dan Magenheimer wrote:
   From: Minchan Kim [mailto:minc...@kernel.org]
   Subject: Re: zsmalloc defrag (Was: [PATCH] mm: remove compressed copy 
   from zram in-memory)

   Hi Dan,

   On Mon, Apr 08, 2013 at 09:32:38AM -0700, Dan Magenheimer wrote:
 From: Minchan Kim [mailto:minc...@kernel.org]
 Sent: Monday, April 08, 2013 12:01 AM
 Subject: [PATCH] mm: remove compressed copy from zram in-memory

(patch removed)

 Fragment ratio is almost same but memory consumption and compile time
 is better. I am working to add defragment function of zsmalloc.

Hi Minchan --

I would be very interested in your design thoughts on
how you plan to add defragmentation for zsmalloc.  In

   What I can say now about is only just a word Compaction.
   As you know, zsmalloc has a transparent handle so we can do whatever
   under user. Of course, there is a tradeoff between performance
   and memory efficiency. I'm biased to latter for embedded usecase.

  Have you designed or implemented this yet?  I have a couple
  of concerns:

 Not yet implemented but just had a time to think about it, simply.
 So surely, there are some obstacle so I want to uncase the code and
 number after I make a prototype/test the performance.
 Of course, if it has a severe problem, will drop it without wasting
 many guys's time.

OK.  I have some ideas that may similar or may be very different
than yours.  Likely different, since I am coming at it from the
angle of zcache which has some different requirements.  So
I'm hoping that by discussing design we can incorporate some
of the zcache requirements before coding.

  1) The handle is transparent to the user, but it is still a form
 of a pointer to a zpage.  Are you planning on walking zram's
 tables and changing those pointers?  That may be OK for zram
 but for more complex data structures than tables (as in zswap
 and zcache) it may not be as easy, due to races, or as efficient
 because you will have to walk potentially very large trees.

 Rough concept is following as.

 I'm considering for zsmalloc to return transparent fake handle
 but we have to maintain it with real one.
 It could be done in zsmalloc internal so there isn't any race we should 
 consider.

That sounds very difficult because I think you will need
an extra level of indirection to translate every fake handle
to every real handle/pointer (like virtual-to-physical page tables).
Or do you have some more clever idea?

  2) Compaction in the kernel is heavily dependent on page migration
 and page migration is dependent on using flags in the struct page.
 There's a lot of code in those two code modules and there
 are going to be a lot of implementation differences between
 compacting pages vs compacting zpages.

 Compaction of kernel is never related to zsmalloc's one.

OK.  Compaction has certain meaning in the kernel.  Defrag
is usually used I think for what we are discussing here.
So I thought you might be planning on doing exactly what
the kernel does that it calls compaction.

  I'm also wondering if you will be implementing variable length
  zspages.  Without that, I'm not sure compaction will help
  enough.  (And that is a good example of the difference between

 Why do you think so?
 variable lengh zspage could be further step to improve but it's not
 only a solution to solve fragmentation.

In my partial-design-in-my-head, they are related, but I
think I understand what you mean.  You are planning to
move zpages across zspage boundaries, and I am not.  So
I think your solution will result in better density but
may be harder to implement.

particular, I am wondering if your design will also
handle the requirements for zcache (especially for
cleancache pages) and perhaps also for ramster.

   I don't know requirements for cleancache pages but compaction is
   general as you know well so I expect you can get a benefit from it
   if you are concern on memory efficiency but not sure it's valuable
   to compact cleancache pages for getting more slot in RAM.
   Sometime, just discarding would be much better, IMHO.

  Zcache has page reclaim.  Zswap has zpage reclaim.  I am
  concerned that these continue to work in the presence of
  compaction.   With no reclaim at all, zram is a simpler use
  case but if you implement compaction in a way that can't be
  used by either zcache or zswap, then zsmalloc is essentially
  forking.

 Don't go too far. If it's really problem for zswap and zcache,
 maybe, we could add it optionally.

Good, I think it should be possible to do it optionally too.

In https://lkml.org/lkml/2013/3/27/501 I suggested it
would be good to work together on a common design, but
you didn't reply.  Are you thinking

RE: zsmalloc defrag (Was: [PATCH] mm: remove compressed copy from zram in-memory)

2013-04-11 Thread Dan Magenheimer

 From: Minchan Kim [mailto:minc...@kernel.org]
 Subject: Re: zsmalloc defrag (Was: [PATCH] mm: remove compressed copy from 
 zram in-memory)

 On Tue, Apr 09, 2013 at 01:37:47PM -0700, Dan Magenheimer wrote:
   From: Minchan Kim [mailto:minc...@kernel.org]
   Subject: Re: zsmalloc defrag (Was: [PATCH] mm: remove compressed copy 
   from zram in-memory)

   On Tue, Apr 09, 2013 at 10:27:19AM +0900, Minchan Kim wrote:
Hi Dan,

On Mon, Apr 08, 2013 at 09:32:38AM -0700, Dan Magenheimer wrote:
  From: Minchan Kim [mailto:minc...@kernel.org]
  Sent: Monday, April 08, 2013 12:01 AM
  Subject: [PATCH] mm: remove compressed copy from zram in-memory

 (patch removed)

  Fragment ratio is almost same but memory consumption and compile 
  time
  is better. I am working to add defragment function of zsmalloc.

 Hi Minchan --

 I would be very interested in your design thoughts on
 how you plan to add defragmentation for zsmalloc.  In

What I can say now about is only just a word Compaction.
As you know, zsmalloc has a transparent handle so we can do whatever
under user. Of course, there is a tradeoff between performance
and memory efficiency. I'm biased to latter for embedded usecase.

And I might post it because as you know well, zsmalloc

   Incomplete sentense,

   I might not post it until promoting zsmalloc because as you know well,
   zsmalloc/zram's all new stuffs are blocked into staging tree.
   Even if we could add it into staging, as you know well, staging is where
   every mm guys ignore so we end up needing another round to promote it. 
   sigh.

   I hope it gets better after LSF/MM.

  If zsmalloc is moving in the direction of supporting only zram,
  why should it be promoted into mm, or even lib?  Why not promote
  zram into drivers and put zsmalloc.c in the same directory?

 I don't want to make zsmalloc zram specific and will do best effort
 to generalize it to all z* familiy.

I'm glad to hear that.  You may not know/remember that the split between
old zcache and new zcache (and the fork to zswap) was started
because some people refused to accept changes to zsmalloc to
support a broader set of requirements.

 If it is hard to reach out
 agreement, yes, forking could be a easy solution like other embedded
 product company but I don't want it.

I don't want it either, so I think it is wise for us all to understand
each others' objectives to see if we can avoid a fork.  Or if the
objectives are too different, then we have data to explain to other kernel
developers why a fork is necessary.

Thanks!
Dan
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

RE: zsmalloc defrag (Was: [PATCH] mm: remove compressed copy from zram in-memory)

2013-04-11 Thread Dan Magenheimer

 From: Minchan Kim [mailto:minc...@kernel.org]
 Subject: Re: zsmalloc defrag (Was: [PATCH] mm: remove compressed copy from 
 zram in-memory)

 Hi Seth,

 On Tue, Apr 09, 2013 at 03:52:36PM -0500, Seth Jennings wrote:
  On 04/08/2013 08:36 PM, Minchan Kim wrote:
   On Tue, Apr 09, 2013 at 10:27:19AM +0900, Minchan Kim wrote:
   Hi Dan,

   On Mon, Apr 08, 2013 at 09:32:38AM -0700, Dan Magenheimer wrote:
   From: Minchan Kim [mailto:minc...@kernel.org]
   Sent: Monday, April 08, 2013 12:01 AM
   Subject: [PATCH] mm: remove compressed copy from zram in-memory

   (patch removed)

   Fragment ratio is almost same but memory consumption and compile time
   is better. I am working to add defragment function of zsmalloc.

   Hi Minchan --

   I would be very interested in your design thoughts on
   how you plan to add defragmentation for zsmalloc.  In

   What I can say now about is only just a word Compaction.
   As you know, zsmalloc has a transparent handle so we can do whatever
   under user. Of course, there is a tradeoff between performance
   and memory efficiency. I'm biased to latter for embedded usecase.

   And I might post it because as you know well, zsmalloc

   Incomplete sentense,

   I might not post it until promoting zsmalloc because as you know well,
   zsmalloc/zram's all new stuffs are blocked into staging tree.
   Even if we could add it into staging, as you know well, staging is where
   every mm guys ignore so we end up needing another round to promote it. 
   sigh.

  Yes. The lack of compaction/defragmentation support in zsmalloc has not
  been raised as an obstacle to mainline acceptance so I think we should
  wait to add new features to a yet-to-be accepted codebase.

  Also, I think this feature is more important to zram than it is to
  zswap/zcache as they can do writeback to free zpages.  In other words,
  the fragmentation is a transient issue for zswap/zcache since writeback
  to the swap device is possible.

 Other benefit derived from compaction work is that we can pick a zpage
 from zspage and move it into somewhere. It means core mm could control
 pages in zsmalloc freely.

I'm not sure I understand which is why I'd like to learn more about
your proposed design.  Are you suggesting that core mm would periodically
call zsmalloc-compaction and see what pages get freed?  I'm hoping
for more control than that.

More good discussion for next week!
Dan
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

RE: zsmalloc zbud hybrid design discussion?

2013-04-11 Thread Dan Magenheimer

(Bob Liu added)

 From: Seth Jennings [mailto:sjenn...@linux.vnet.ibm.com]
 Subject: Re: zsmalloc zbud hybrid design discussion?

 On Wed, Mar 27, 2013 at 01:04:25PM -0700, Dan Magenheimer wrote:
  Seth and all zproject folks --

  I've been giving some deep thought as to how a zpage
  allocator might be designed that would incorporate the
  best of both zsmalloc and zbud.

  Rather than dive into coding, it occurs to me that the
  best chance of success would be if all interested parties
  could first discuss (on-list) and converge on a design
  that we can all agree on.  If we achieve that, I don't
  care who writes the code and/or gets the credit or
  chooses the name.  If we can't achieve consensus, at
  least it will be much clearer where our differences lie.

  Any thoughts?

Hi Seth!

 I'll put some thoughts, keeping in mind that I'm not throwing zsmalloc under
 the bus here.  Just what I would do starting from scratch given all that has
 happened.

Excellent.  Good food for thought.  I'll add some of my thinking
too and we can talk more next week.

BTW, I'm not throwing zsmalloc under the bus either.  I'm OK with
using zsmalloc as a base for an improved hybrid, and even calling
the result zsmalloc.  I *am* however willing to throw the
generic nature of zsmalloc away... I think the combined requirements
of the zprojects are complex enough and the likelihood of zsmalloc
being appropriate for future users is low enough, that we should
accept that zsmalloc is highly tuned for zprojects and modify it
as required.  I.e. the API to zsmalloc need not be exposed to and
documented for the rest of the kernel.

 Simplicity - the simpler the better

Generally I agree.  But only if the simplicity addresses the
whole problem.  I'm specifically very concerned that we have
an allocator that works well across a wide variety of zsize distributions,
even if it adds complexity to the allocator.

 High density - LZO best case is ~40 bytes. That's around 1/100th of a page.
 I'd say it should support up to at least 64 object per page in the best case.
 (see Reclaim effectiveness before responding here)

Hmmm... if you pre-check for zero pages, I would guess the percentage
of pages with zsize less than 64 is actually quite small.  But 64 size
classes may be a good place to start as long as it doesn't overly
complicate or restrict other design points.

 No slab - the slab approach limits LRU and swap slot locality within the pool
 pages.  Also swap slots have a tendency to be freed in clusters.  If we 
 improve
 locality within each pool page, it is more likely that page will be freed
 sooner as the zpages it contains will likely be invalidated all together.

Pool page =?= pageframe used by zsmalloc

Isn't it true that that there is no correlation between whether a
page is in the same cluster and the zsize (and thus size class) of
the zpage?  So every zpage may end up in a different pool page
and this theory wouldn't work.  Or am I misunderstanding?

 Also, take a note out of the zbud playbook at track LRU based on pool pages,
 not zpages.  One would fill allocation requests from the most recently used
 pool page.

Yes, I'm also thinking that should be in any hybrid solution.
A global LRU queue (like in zbud) could also be applicable to entire zspages;
this is similar to pageframe-reclaim except all the pageframes in a zspage
would be claimed at the same time.

 Reclaim effectiveness - conflicts with density. As the number of zpages per
 page increases, the odds decrease that all of those objects will be
 invalidated, which is necessary to free up the underlying page, since moving
 objects out of sparely used pages would involve compaction (see next).  One
 solution is to lower the density, but I think that is self-defeating as we 
 lose
 much the compression benefit though fragmentation. I think the better solution
 is to improve the likelihood that the zpages in the page are likely to be 
 freed
 together through increased locality.

I do think we should seriously reconsider ZS_MAX_ZSPAGE_ORDER==2.
The value vs ZS_MAX_ZSPAGE_ORDER==0 is enough for most cases and
1 is enough for the rest.  If get_pages_per_zspage were flexible,
there might be a better tradeoff of density vs reclaim effectiveness.

I've some ideas along the lines of a hybrid adaptively combining
buddying and slab which might make it rarely necessary to have
pages_per_zspage exceed 2.  That also might make it much easier
to have variable sized zspages (size is always one or two).

 Not a requirement:

 Compaction - compaction would basically involve creating a virtual address
 space of sorts, which zsmalloc is capable of through its API with handles,
 not pointer.  However, as Dan points out this requires a structure the 
 maintain
 the mappings and adds to complexity.  Additionally, the need for compaction
 diminishes as the allocations are short-lived with frontswap backends doing
 writeback and cleancache backends shrinking.

I have an idea

RE: zsmalloc defrag (Was: [PATCH] mm: remove compressed copy from zram in-memory)

2013-04-09 Thread Dan Magenheimer

> From: Minchan Kim [mailto:minc...@kernel.org]
> Subject: Re: zsmalloc defrag (Was: [PATCH] mm: remove compressed copy from 
> zram in-memory)
> 
> On Tue, Apr 09, 2013 at 10:27:19AM +0900, Minchan Kim wrote:
> > Hi Dan,
> >
> > On Mon, Apr 08, 2013 at 09:32:38AM -0700, Dan Magenheimer wrote:
> > > > From: Minchan Kim [mailto:minc...@kernel.org]
> > > > Sent: Monday, April 08, 2013 12:01 AM
> > > > Subject: [PATCH] mm: remove compressed copy from zram in-memory
> > >
> > > (patch removed)
> > >
> > > > Fragment ratio is almost same but memory consumption and compile time
> > > > is better. I am working to add defragment function of zsmalloc.
> > >
> > > Hi Minchan --
> > >
> > > I would be very interested in your design thoughts on
> > > how you plan to add defragmentation for zsmalloc.  In
> >
> > What I can say now about is only just a word "Compaction".
> > As you know, zsmalloc has a transparent handle so we can do whatever
> > under user. Of course, there is a tradeoff between performance
> > and memory efficiency. I'm biased to latter for embedded usecase.
> >
> > And I might post it because as you know well, zsmalloc
> 
> Incomplete sentense,
> 
> I might not post it until promoting zsmalloc because as you know well,
> zsmalloc/zram's all new stuffs are blocked into staging tree.
> Even if we could add it into staging, as you know well, staging is where
> every mm guys ignore so we end up needing another round to promote it. sigh.
> 
> I hope it gets better after LSF/MM.

If zsmalloc is moving in the direction of supporting only zram,
why should it be promoted into mm, or even lib?  Why not promote
zram into drivers and put zsmalloc.c in the same directory?
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

RE: zsmalloc defrag (Was: [PATCH] mm: remove compressed copy from zram in-memory)

2013-04-09 Thread Dan Magenheimer

> From: Minchan Kim [mailto:minc...@kernel.org]
> Subject: Re: zsmalloc defrag (Was: [PATCH] mm: remove compressed copy from 
> zram in-memory)
> 
> Hi Dan,
> 
> On Mon, Apr 08, 2013 at 09:32:38AM -0700, Dan Magenheimer wrote:
> > > From: Minchan Kim [mailto:minc...@kernel.org]
> > > Sent: Monday, April 08, 2013 12:01 AM
> > > Subject: [PATCH] mm: remove compressed copy from zram in-memory
> >
> > (patch removed)
> >
> > > Fragment ratio is almost same but memory consumption and compile time
> > > is better. I am working to add defragment function of zsmalloc.
> >
> > Hi Minchan --
> >
> > I would be very interested in your design thoughts on
> > how you plan to add defragmentation for zsmalloc.  In
> 
> What I can say now about is only just a word "Compaction".
> As you know, zsmalloc has a transparent handle so we can do whatever
> under user. Of course, there is a tradeoff between performance
> and memory efficiency. I'm biased to latter for embedded usecase.

Have you designed or implemented this yet?  I have a couple
of concerns:

1) The handle is transparent to the "user", but it is still a form
   of a "pointer" to a zpage.  Are you planning on walking zram's
   tables and changing those pointers?  That may be OK for zram
   but for more complex data structures than tables (as in zswap
   and zcache) it may not be as easy, due to races, or as efficient
   because you will have to walk potentially very large trees.
2) Compaction in the kernel is heavily dependent on page migration
   and page migration is dependent on using flags in the struct page.
   There's a lot of code in those two code modules and there
   are going to be a lot of implementation differences between
   compacting pages vs compacting zpages.

I'm also wondering if you will be implementing "variable length
zspages".  Without that, I'm not sure compaction will help
enough.  (And that is a good example of the difference between
the kernel page compaction design/code and zspage compaction.)

> > particular, I am wondering if your design will also
> > handle the requirements for zcache (especially for
> > cleancache pages) and perhaps also for ramster.
> 
> I don't know requirements for cleancache pages but compaction is
> general as you know well so I expect you can get a benefit from it
> if you are concern on memory efficiency but not sure it's valuable
> to compact cleancache pages for getting more slot in RAM.
> Sometime, just discarding would be much better, IMHO.

Zcache has page reclaim.  Zswap has zpage reclaim.  I am
concerned that these continue to work in the presence of
compaction.   With no reclaim at all, zram is a simpler use
case but if you implement compaction in a way that can't be
used by either zcache or zswap, then zsmalloc is essentially
forking.

> > In https://lkml.org/lkml/2013/3/27/501 I suggested it
> > would be good to work together on a common design, but
> > you didn't reply.  Are you thinking that zsmalloc
> 
> I saw the thread but explicit agreement is really matter?
> I believe everybody want it although they didn't reply. :)
> 
> You can make the design/post it or prototyping/post it.
> If there are some conflit with something in my brain,
> I will be happy to feedback. :)
> 
> Anyway, I think my above statement "COMPACTION" would be enough to
> express my current thought to avoid duplicated work and you can catch up.
> 
> I will get around to it after LSF/MM.
> 
> > improvements should focus only on zram, in which case
> 
> Just focusing zsmalloc.

Right.  Again, I am asking if you are changing zsmalloc in
a way that helps zram but hurts zswap and makes it impossible
for zcache to ever use the improvements to zsmalloc.

If so, that's fine, but please make it clear that is your goal.

> > we may -- and possibly should -- end up with a different
> > allocator for frontswap-based/cleancache-based compression
> > in zcache (and possibly zswap)?
> 
> > I'm just trying to determine if I should proceed separately
> > with my design (with Bob Liu, who expressed interest) or if
> > it would be beneficial to work together.
> 
> Just posting and if it affects zsmalloc/zram/zswap and goes the way
> I don't want, I will involve the discussion because our product uses
> zram heavily and consider zswap, too.
> 
> I really appreciate your enthusiastic collaboration model to find
> optimal solution!

My goal is to have compression be an integral part of Linux
memory management.  It may be tied to a config option, but
the goal is that distros turn it on by default.  I don't think
zsmalloc meets that objective yet, but it may be fine for
your needs.  If so it would be good to understand exactly why
it doesn't meet the other zproject needs.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

RE: zsmalloc defrag (Was: [PATCH] mm: remove compressed copy from zram in-memory)

2013-04-09 Thread Dan Magenheimer

 From: Minchan Kim [mailto:minc...@kernel.org]
 Subject: Re: zsmalloc defrag (Was: [PATCH] mm: remove compressed copy from 
 zram in-memory)

 Hi Dan,

 On Mon, Apr 08, 2013 at 09:32:38AM -0700, Dan Magenheimer wrote:
   From: Minchan Kim [mailto:minc...@kernel.org]
   Sent: Monday, April 08, 2013 12:01 AM
   Subject: [PATCH] mm: remove compressed copy from zram in-memory

  (patch removed)

   Fragment ratio is almost same but memory consumption and compile time
   is better. I am working to add defragment function of zsmalloc.

  Hi Minchan --

  I would be very interested in your design thoughts on
  how you plan to add defragmentation for zsmalloc.  In

 What I can say now about is only just a word Compaction.
 As you know, zsmalloc has a transparent handle so we can do whatever
 under user. Of course, there is a tradeoff between performance
 and memory efficiency. I'm biased to latter for embedded usecase.

Have you designed or implemented this yet?  I have a couple
of concerns:

1) The handle is transparent to the user, but it is still a form
   of a pointer to a zpage.  Are you planning on walking zram's
   tables and changing those pointers?  That may be OK for zram
   but for more complex data structures than tables (as in zswap
   and zcache) it may not be as easy, due to races, or as efficient
   because you will have to walk potentially very large trees.
2) Compaction in the kernel is heavily dependent on page migration
   and page migration is dependent on using flags in the struct page.
   There's a lot of code in those two code modules and there
   are going to be a lot of implementation differences between
   compacting pages vs compacting zpages.

I'm also wondering if you will be implementing variable length
zspages.  Without that, I'm not sure compaction will help
enough.  (And that is a good example of the difference between
the kernel page compaction design/code and zspage compaction.)

  particular, I am wondering if your design will also
  handle the requirements for zcache (especially for
  cleancache pages) and perhaps also for ramster.

 I don't know requirements for cleancache pages but compaction is
 general as you know well so I expect you can get a benefit from it
 if you are concern on memory efficiency but not sure it's valuable
 to compact cleancache pages for getting more slot in RAM.
 Sometime, just discarding would be much better, IMHO.

Zcache has page reclaim.  Zswap has zpage reclaim.  I am
concerned that these continue to work in the presence of
compaction.   With no reclaim at all, zram is a simpler use
case but if you implement compaction in a way that can't be
used by either zcache or zswap, then zsmalloc is essentially
forking.

  In https://lkml.org/lkml/2013/3/27/501 I suggested it
  would be good to work together on a common design, but
  you didn't reply.  Are you thinking that zsmalloc

 I saw the thread but explicit agreement is really matter?
 I believe everybody want it although they didn't reply. :)

 You can make the design/post it or prototyping/post it.
 If there are some conflit with something in my brain,
 I will be happy to feedback. :)

 Anyway, I think my above statement COMPACTION would be enough to
 express my current thought to avoid duplicated work and you can catch up.

 I will get around to it after LSF/MM.

  improvements should focus only on zram, in which case

 Just focusing zsmalloc.

Right.  Again, I am asking if you are changing zsmalloc in
a way that helps zram but hurts zswap and makes it impossible
for zcache to ever use the improvements to zsmalloc.

If so, that's fine, but please make it clear that is your goal.

  we may -- and possibly should -- end up with a different
  allocator for frontswap-based/cleancache-based compression
  in zcache (and possibly zswap)?

  I'm just trying to determine if I should proceed separately
  with my design (with Bob Liu, who expressed interest) or if
  it would be beneficial to work together.

 Just posting and if it affects zsmalloc/zram/zswap and goes the way
 I don't want, I will involve the discussion because our product uses
 zram heavily and consider zswap, too.

 I really appreciate your enthusiastic collaboration model to find
 optimal solution!

My goal is to have compression be an integral part of Linux
memory management.  It may be tied to a config option, but
the goal is that distros turn it on by default.  I don't think
zsmalloc meets that objective yet, but it may be fine for
your needs.  If so it would be good to understand exactly why
it doesn't meet the other zproject needs.
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

RE: zsmalloc defrag (Was: [PATCH] mm: remove compressed copy from zram in-memory)

2013-04-09 Thread Dan Magenheimer

 From: Minchan Kim [mailto:minc...@kernel.org]
 Subject: Re: zsmalloc defrag (Was: [PATCH] mm: remove compressed copy from 
 zram in-memory)

 On Tue, Apr 09, 2013 at 10:27:19AM +0900, Minchan Kim wrote:
  Hi Dan,

  On Mon, Apr 08, 2013 at 09:32:38AM -0700, Dan Magenheimer wrote:
From: Minchan Kim [mailto:minc...@kernel.org]
Sent: Monday, April 08, 2013 12:01 AM
Subject: [PATCH] mm: remove compressed copy from zram in-memory

   (patch removed)

Fragment ratio is almost same but memory consumption and compile time
is better. I am working to add defragment function of zsmalloc.

   Hi Minchan --

   I would be very interested in your design thoughts on
   how you plan to add defragmentation for zsmalloc.  In

  What I can say now about is only just a word Compaction.
  As you know, zsmalloc has a transparent handle so we can do whatever
  under user. Of course, there is a tradeoff between performance
  and memory efficiency. I'm biased to latter for embedded usecase.

  And I might post it because as you know well, zsmalloc

 Incomplete sentense,

 I might not post it until promoting zsmalloc because as you know well,
 zsmalloc/zram's all new stuffs are blocked into staging tree.
 Even if we could add it into staging, as you know well, staging is where
 every mm guys ignore so we end up needing another round to promote it. sigh.

 I hope it gets better after LSF/MM.

If zsmalloc is moving in the direction of supporting only zram,
why should it be promoted into mm, or even lib?  Why not promote
zram into drivers and put zsmalloc.c in the same directory?
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

zsmalloc defrag (Was: [PATCH] mm: remove compressed copy from zram in-memory)

2013-04-08 Thread Dan Magenheimer

> From: Minchan Kim [mailto:minc...@kernel.org]
> Sent: Monday, April 08, 2013 12:01 AM
> Subject: [PATCH] mm: remove compressed copy from zram in-memory

(patch removed)

> Fragment ratio is almost same but memory consumption and compile time
> is better. I am working to add defragment function of zsmalloc.

Hi Minchan --

I would be very interested in your design thoughts on
how you plan to add defragmentation for zsmalloc.  In
particular, I am wondering if your design will also
handle the requirements for zcache (especially for
cleancache pages) and perhaps also for ramster.

In https://lkml.org/lkml/2013/3/27/501 I suggested it
would be good to work together on a common design, but
you didn't reply.  Are you thinking that zsmalloc
improvements should focus only on zram, in which case
we may -- and possibly should -- end up with a different
allocator for frontswap-based/cleancache-based compression
in zcache (and possibly zswap)?

I'm just trying to determine if I should proceed separately
with my design (with Bob Liu, who expressed interest) or if
it would be beneficial to work together.

Thanks,
Dan
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

zsmalloc defrag (Was: [PATCH] mm: remove compressed copy from zram in-memory)

2013-04-08 Thread Dan Magenheimer

 From: Minchan Kim [mailto:minc...@kernel.org]
 Sent: Monday, April 08, 2013 12:01 AM
 Subject: [PATCH] mm: remove compressed copy from zram in-memory

(patch removed)

 Fragment ratio is almost same but memory consumption and compile time
 is better. I am working to add defragment function of zsmalloc.

Hi Minchan --

I would be very interested in your design thoughts on
how you plan to add defragmentation for zsmalloc.  In
particular, I am wondering if your design will also
handle the requirements for zcache (especially for
cleancache pages) and perhaps also for ramster.

In https://lkml.org/lkml/2013/3/27/501 I suggested it
would be good to work together on a common design, but
you didn't reply.  Are you thinking that zsmalloc
improvements should focus only on zram, in which case
we may -- and possibly should -- end up with a different
allocator for frontswap-based/cleancache-based compression
in zcache (and possibly zswap)?

I'm just trying to determine if I should proceed separately
with my design (with Bob Liu, who expressed interest) or if
it would be beneficial to work together.

Thanks,
Dan
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

RE: [PATCH part2 v6 0/3] staging: zcache: Support zero-filled pages more efficiently

2013-04-07 Thread Dan Magenheimer

> From: Dan Magenheimer
> Subject: RE: [PATCH part2 v6 0/3] staging: zcache: Support zero-filled pages 
> more efficiently
> 
> > From: Wanpeng Li [mailto:liw...@linux.vnet.ibm.com]
> > Subject: Re: [PATCH part2 v6 0/3] staging: zcache: Support zero-filled 
> > pages more efficiently
> >
> > Hi Dan,
> >
> > Some issues against Ramster:
> >
> 
> Sure!  I am concerned about Konrad's patches adding debug.c as they
> add many global variables.  They are only required when ZCACHE_DEBUG
> is enabled so they may be ok.  If not, adding ramster variables
> to debug.c may make the problem worse.

Oops, I just noticed/remembered that ramster uses BOTH debugfs and sysfs.
The sysfs variables are all currently required, i.e. for configuration
so should not be tied to debugfs or a DEBUG config option.  However,
if there is a more acceptable way to implement the function of
those sysfs variables, that would be fine.

Thanks,
Dan
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

RE: [PATCH part2 v6 0/3] staging: zcache: Support zero-filled pages more efficiently

2013-04-07 Thread Dan Magenheimer

> From: Wanpeng Li [mailto:liw...@linux.vnet.ibm.com]
> Subject: Re: [PATCH part2 v6 0/3] staging: zcache: Support zero-filled pages 
> more efficiently
> 
> Hi Dan,
> 
> Some issues against Ramster:
> 
> - Ramster who takes advantage of zcache also should support zero-filled
>   pages more efficiently, correct? It doesn't handle zero-filled pages well
>   currently.

When you first posted your patchset I took a quick look at ramster
and it looked like your patchset should work for ramster also.
However I didn't actually run ramster to try it so there may
be a bug.  If it doesn't work, I would very much appreciate a patch.

> - Ramster DebugFS counters are exported in /sys/kernel/mm/, but 
> zcache/frontswap/cleancache
>   all are exported in /sys/kernel/debug/, should we unify them?

That would be great.

> - If ramster also should move DebugFS counters to a single file like
>   zcache do?

Sure!  I am concerned about Konrad's patches adding debug.c as they
add many global variables.  They are only required when ZCACHE_DEBUG
is enabled so they may be ok.  If not, adding ramster variables
to debug.c may make the problem worse.

> If you confirm these issues are make sense to fix, I will start coding. ;-)

That would be great.  Note that I have a how-to for ramster here:

https://oss.oracle.com/projects/tmem/dist/files/RAMster/HOWTO-120817 

If when you are testing you find that this how-to has mistakes,
please let me know.  Or feel free to add the (corrected) how-to file
as a patch in your patchset.

Thanks very much, Wanpeng, for your great contributions!

(Ric, since you have expressed interest in ramster, if you try it and
find corrections to the how-to file above, your input would be
very much appreciated also!)

Dan
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

RE: [PATCH part2 v6 0/3] staging: zcache: Support zero-filled pages more efficiently

2013-04-07 Thread Dan Magenheimer

 From: Wanpeng Li [mailto:liw...@linux.vnet.ibm.com]
 Subject: Re: [PATCH part2 v6 0/3] staging: zcache: Support zero-filled pages 
 more efficiently

 Hi Dan,

 Some issues against Ramster:

 - Ramster who takes advantage of zcache also should support zero-filled
   pages more efficiently, correct? It doesn't handle zero-filled pages well
   currently.

When you first posted your patchset I took a quick look at ramster
and it looked like your patchset should work for ramster also.
However I didn't actually run ramster to try it so there may
be a bug.  If it doesn't work, I would very much appreciate a patch.

 - Ramster DebugFS counters are exported in /sys/kernel/mm/, but 
 zcache/frontswap/cleancache
   all are exported in /sys/kernel/debug/, should we unify them?

That would be great.

 - If ramster also should move DebugFS counters to a single file like
   zcache do?

Sure!  I am concerned about Konrad's patches adding debug.c as they
add many global variables.  They are only required when ZCACHE_DEBUG
is enabled so they may be ok.  If not, adding ramster variables
to debug.c may make the problem worse.

 If you confirm these issues are make sense to fix, I will start coding. ;-)

That would be great.  Note that I have a how-to for ramster here:

https://oss.oracle.com/projects/tmem/dist/files/RAMster/HOWTO-120817 

If when you are testing you find that this how-to has mistakes,
please let me know.  Or feel free to add the (corrected) how-to file
as a patch in your patchset.

Thanks very much, Wanpeng, for your great contributions!

(Ric, since you have expressed interest in ramster, if you try it and
find corrections to the how-to file above, your input would be
very much appreciated also!)

Dan
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

RE: [PATCH part2 v6 0/3] staging: zcache: Support zero-filled pages more efficiently

2013-04-07 Thread Dan Magenheimer

 From: Dan Magenheimer
 Subject: RE: [PATCH part2 v6 0/3] staging: zcache: Support zero-filled pages 
 more efficiently

  From: Wanpeng Li [mailto:liw...@linux.vnet.ibm.com]
  Subject: Re: [PATCH part2 v6 0/3] staging: zcache: Support zero-filled 
  pages more efficiently

  Hi Dan,

  Some issues against Ramster:

 Sure!  I am concerned about Konrad's patches adding debug.c as they
 add many global variables.  They are only required when ZCACHE_DEBUG
 is enabled so they may be ok.  If not, adding ramster variables
 to debug.c may make the problem worse.

Oops, I just noticed/remembered that ramster uses BOTH debugfs and sysfs.
The sysfs variables are all currently required, i.e. for configuration
so should not be tied to debugfs or a DEBUG config option.  However,
if there is a more acceptable way to implement the function of
those sysfs variables, that would be fine.

Thanks,
Dan
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

RE: [PATCHv8 5/8] mm: break up swap_writepage() for frontswap backends

2013-04-04 Thread Dan Magenheimer

> From: Seth Jennings [mailto:sjenn...@linux.vnet.ibm.com]
> Subject: Re: [PATCHv8 5/8] mm: break up swap_writepage() for frontswap 
> backends
> 
> On 04/04/2013 05:10 PM, Seth Jennings wrote:
> > swap_writepage() is currently where frontswap hooks into the swap
> > write path to capture pages with the frontswap_store() function.
> > However, if a frontswap backend wants to "resume" the writeback of
> > a page to the swap device, it can't call swap_writepage() as
> > the page will simply reenter the backend.
> >
> > This patch separates swap_writepage() into a top and bottom half, the
> > bottom half named __swap_writepage() to allow a frontswap backend,
> > like zswap, to resume writeback beyond the frontswap_store() hook.
> >
> > __add_to_swap_cache() is also made non-static so that the page for
> > which writeback is to be resumed can be added to the swap cache.
> >
> > Acked-by: Minchan Kim 
> > Signed-off-by: Seth Jennings 
> 
> Adding Cc Bob Liu.
> 
> I just remembered that Bob had done a repost of the 5 and 6 patches,
> outside the zswap thread,  with a small change to avoid a checkpatch
> warning.  I didn't pull that change into my version, but I should have.
> 
> It doesn't make a functional difference, so this patch can still go
> forward and the checkpatch warning can be cleaned up in a subsequent
> patch.  If another revision of the patchset is needed for other
> reasons, I'll pull this change into the next version.
> 
> I think Dan and Bob would be ok with their tags being applied to 5 and 6:
> 
> Acked-by: Bob Liu 
> Reviewed-by: Dan Magenheimer 
> 
> That ok?

OK with me.  I do support these two MM patches as candidates for the
3.10 window since both zswap AND in-tree zcache depend on them,
but the silence from Andrew was a bit deafening.

Seth, perhaps you could add a #ifdef CONFIG_ZSWAP_WRITEBACK
to the zswap code and Kconfig (as zcache has done) and then
these two patches in your patchset can be reviewed
separately?
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

RE: [PATCHv8 0/8] zswap: compressed swap caching

2013-04-04 Thread Dan Magenheimer

> From: Seth Jennings [mailto:sjenn...@linux.vnet.ibm.com]
> Subject: [PATCHv8 0/8] zswap: compressed swap caching
> 
> ... I am submitting this as a
> candidate for merging in the v3.10 window...
>   :
> I'll be attending the LSF/MM summit where there (hopefully) will be a
> discussion this patchset and memory compression in general.

IMHO it would be good to first have the discussion at LSF/MM.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

RE: [PATCHv8 0/8] zswap: compressed swap caching

2013-04-04 Thread Dan Magenheimer

 From: Seth Jennings [mailto:sjenn...@linux.vnet.ibm.com]
 Subject: [PATCHv8 0/8] zswap: compressed swap caching

 ... I am submitting this as a
 candidate for merging in the v3.10 window...
   :
 I'll be attending the LSF/MM summit where there (hopefully) will be a
 discussion this patchset and memory compression in general.

IMHO it would be good to first have the discussion at LSF/MM.
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

RE: [PATCHv8 5/8] mm: break up swap_writepage() for frontswap backends

2013-04-04 Thread Dan Magenheimer

 From: Seth Jennings [mailto:sjenn...@linux.vnet.ibm.com]
 Subject: Re: [PATCHv8 5/8] mm: break up swap_writepage() for frontswap 
 backends

 On 04/04/2013 05:10 PM, Seth Jennings wrote:
  swap_writepage() is currently where frontswap hooks into the swap
  write path to capture pages with the frontswap_store() function.
  However, if a frontswap backend wants to resume the writeback of
  a page to the swap device, it can't call swap_writepage() as
  the page will simply reenter the backend.

  This patch separates swap_writepage() into a top and bottom half, the
  bottom half named __swap_writepage() to allow a frontswap backend,
  like zswap, to resume writeback beyond the frontswap_store() hook.

  __add_to_swap_cache() is also made non-static so that the page for
  which writeback is to be resumed can be added to the swap cache.

  Acked-by: Minchan Kim minc...@kernel.org
  Signed-off-by: Seth Jennings sjenn...@linux.vnet.ibm.com

 Adding Cc Bob Liu.

 I just remembered that Bob had done a repost of the 5 and 6 patches,
 outside the zswap thread,  with a small change to avoid a checkpatch
 warning.  I didn't pull that change into my version, but I should have.

 It doesn't make a functional difference, so this patch can still go
 forward and the checkpatch warning can be cleaned up in a subsequent
 patch.  If another revision of the patchset is needed for other
 reasons, I'll pull this change into the next version.

 I think Dan and Bob would be ok with their tags being applied to 5 and 6:

 Acked-by: Bob Liu bob@oracle.com
 Reviewed-by: Dan Magenheimer dan.magenhei...@oracle.com

 That ok?

OK with me.  I do support these two MM patches as candidates for the
3.10 window since both zswap AND in-tree zcache depend on them,
but the silence from Andrew was a bit deafening.

Seth, perhaps you could add a #ifdef CONFIG_ZSWAP_WRITEBACK
to the zswap code and Kconfig (as zcache has done) and then
these two patches in your patchset can be reviewed
separately?
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

RE: zsmalloc/lzo compressibility vs entropy

2013-03-29 Thread Dan Magenheimer

> From: Dan Magenheimer
> Sent: Wednesday, March 27, 2013 3:42 PM
> To: Seth Jennings; Konrad Wilk; Minchan Kim; Bob Liu; Robert Jennings; Nitin 
> Gupta; Wanpeng Li; Andrew
> Morton; Mel Gorman
> Cc: linux...@kvack.org; linux-kernel@vger.kernel.org
> Subject: zsmalloc/lzo compressibility vs entropy
> 
> This might be obvious to those of you who are better
> mathematicians than I, but I ran some experiments
> to confirm the relationship between entropy and compressibility
> and thought I should report the results to the list.

A few new observations worth mentioning:

Since Seth long ago mentioned that the text of Moby Dick
resulted in poor (but not horribly poor) compression I thought
I'd look at some ASCII data.

I used the first sentence of the Gettysburg Address (91 characters)
and repeated it to fill a page.  Interestingly, LZO apparently
discovered the repetition... the page compressed to 118 bytes
even though the result had 15618 one-bits (fairly high entropy).

I used the full Gettysburg Address (1459 characters), again
repeated to fill a page.  LZO compressed this to 1070 bytes.
(14568 one-bits.)

To fill a page with text, I added part of the Declaration of
Independence.  No repeating text now.  This only compressed
to 2754 bytes (which, I assume, is close to Seth's observations
on Moby Dick).  14819 one-bits.

Last (for swap), to see if random ascii would compress better
than binary, I masked off the MSB in each byte of a random
page.  The mean zsize was 4116 bytes (larger than a page)
with a stddev of 51.  The one-bit mean was 14336 (7/16 of a page).

On a completely different track, I thought it would be relevant
to look at the difference between frontswap (anonymous) page
zsize distribution and cleancache (file) page zsize distribution.

Running kernbench, zsize mean was 1974 (stddev 895).

For a different benchmark, I did:

# find / | grep3

where grep3 is a simple bash script that does three separate
greps on the first argument.  Since this fills the page cache
and causes reclaiming, and reclaims are captured by cleancache
and fed to zcache, this data page stream approximates random
pages on the disk.

This "benchmark" generated a zsize mean of 2265 with stddev 1008.
Also of note: Only a fraction of a percent of cleancache pages
are zero-filled, so Wanpeng's zcache patch to handle zero-filled
pages more efficiently is very good for frontswap pages but may
have little benefit for cleancache pages.

Bottom line conclusions:  (1) Entropy is probably less a factor
for LZO-compressibility than data repetition. (2) Cleancache
data pages may have a very different zsize distribution than
frontswap data pages, anecdotally skewed to much higher zsize.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

RE: zsmalloc/lzo compressibility vs entropy

2013-03-29 Thread Dan Magenheimer

 From: Dan Magenheimer
 Sent: Wednesday, March 27, 2013 3:42 PM
 To: Seth Jennings; Konrad Wilk; Minchan Kim; Bob Liu; Robert Jennings; Nitin 
 Gupta; Wanpeng Li; Andrew
 Morton; Mel Gorman
 Cc: linux...@kvack.org; linux-kernel@vger.kernel.org
 Subject: zsmalloc/lzo compressibility vs entropy

 This might be obvious to those of you who are better
 mathematicians than I, but I ran some experiments
 to confirm the relationship between entropy and compressibility
 and thought I should report the results to the list.

A few new observations worth mentioning:

Since Seth long ago mentioned that the text of Moby Dick
resulted in poor (but not horribly poor) compression I thought
I'd look at some ASCII data.

I used the first sentence of the Gettysburg Address (91 characters)
and repeated it to fill a page.  Interestingly, LZO apparently
discovered the repetition... the page compressed to 118 bytes
even though the result had 15618 one-bits (fairly high entropy).

I used the full Gettysburg Address (1459 characters), again
repeated to fill a page.  LZO compressed this to 1070 bytes.
(14568 one-bits.)

To fill a page with text, I added part of the Declaration of
Independence.  No repeating text now.  This only compressed
to 2754 bytes (which, I assume, is close to Seth's observations
on Moby Dick).  14819 one-bits.

Last (for swap), to see if random ascii would compress better
than binary, I masked off the MSB in each byte of a random
page.  The mean zsize was 4116 bytes (larger than a page)
with a stddev of 51.  The one-bit mean was 14336 (7/16 of a page).

On a completely different track, I thought it would be relevant
to look at the difference between frontswap (anonymous) page
zsize distribution and cleancache (file) page zsize distribution.

Running kernbench, zsize mean was 1974 (stddev 895).

For a different benchmark, I did:

# find / | grep3

where grep3 is a simple bash script that does three separate
greps on the first argument.  Since this fills the page cache
and causes reclaiming, and reclaims are captured by cleancache
and fed to zcache, this data page stream approximates random
pages on the disk.

This benchmark generated a zsize mean of 2265 with stddev 1008.
Also of note: Only a fraction of a percent of cleancache pages
are zero-filled, so Wanpeng's zcache patch to handle zero-filled
pages more efficiently is very good for frontswap pages but may
have little benefit for cleancache pages.

Bottom line conclusions:  (1) Entropy is probably less a factor
for LZO-compressibility than data repetition. (2) Cleancache
data pages may have a very different zsize distribution than
frontswap data pages, anecdotally skewed to much higher zsize.
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

RE: [RFC] mm: remove swapcache page early

2013-03-28 Thread Dan Magenheimer

> From: Minchan Kim [mailto:minc...@kernel.org]
> Subject: Re: [RFC] mm: remove swapcache page early
> 
> Hi Dan,
> 
> On Wed, Mar 27, 2013 at 03:24:00PM -0700, Dan Magenheimer wrote:
> > > From: Hugh Dickins [mailto:hu...@google.com]
> > > Subject: Re: [RFC] mm: remove swapcache page early
> > >
> > > I believe the answer is for frontswap/zmem to invalidate the frontswap
> > > copy of the page (to free up the compressed memory when possible) and
> > > SetPageDirty on the PageUptodate PageSwapCache page when swapping in
> > > (setting page dirty so nothing will later go to read it from the
> > > unfreed location on backing swap disk, which was never written).
> >
> > There are two duplication issues:  (1) When can the page be removed
> > from the swap cache after a call to frontswap_store; and (2) When
> > can the page be removed from the frontswap storage after it
> > has been brought back into memory via frontswap_load.
> >
> > This patch from Minchan addresses (1).  The issue you are raising
> 
> No. I am addressing (2).
> 
> > here is (2).  You may not know that (2) has recently been solved
> > in frontswap, at least for zcache.  See frontswap_exclusive_gets_enabled.
> > If this is enabled (and it is for zcache but not yet for zswap),
> > what you suggest (SetPageDirty) is what happens.
> 
> I am blind on zcache so I didn't see it. Anyway, I'd like to address it
> on zram and zswap.

Zswap can enable it trivially by adding a function call in init_zswap.
(Note that it is not enabled by default for all frontswap backends
because it is another complicated tradeoff of cpu time vs memory space
that needs more study on a broad set of workloads.)

I wonder if something like this would have a similar result for zram?
(Completely untested... snippet stolen from swap_entry_free with
SetPageDirty added... doesn't compile yet, but should give you the idea.)

diff --git a/mm/page_io.c b/mm/page_io.c
index 56276fe..2d10988 100644
--- a/mm/page_io.c
+++ b/mm/page_io.c
@@ -81,7 +81,17 @@ void end_swap_bio_read(struct bio *bio, int err)
iminor(bio->bi_bdev->bd_inode),
(unsigned long long)bio->bi_sector);
} else {
+   struct swap_info_struct *sis;
+
SetPageUptodate(page);
+   sis = page_swap_info(page);
+   if (sis->flags & SWP_BLKDEV) {
+   struct gendisk *disk = sis->bdev->bd_disk;
+   if (disk->fops->swap_slot_free_notify) {
+   SetPageDirty(page);
+   disk->fops->swap_slot_free_notify(sis->bdev,
+ offset);
+   }
+   }
}
unlock_page(page);
bio_put(bio);
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

RE: [RFC] mm: remove swapcache page early

2013-03-28 Thread Dan Magenheimer

> From: Hugh Dickins [mailto:hu...@google.com]
> Subject: RE: [RFC] mm: remove swapcache page early
> 
> On Wed, 27 Mar 2013, Dan Magenheimer wrote:
> > > From: Hugh Dickins [mailto:hu...@google.com]
> > > Subject: Re: [RFC] mm: remove swapcache page early
> > >
> > The issue you are raising
> > here is (2).  You may not know that (2) has recently been solved
> > in frontswap, at least for zcache.  See frontswap_exclusive_gets_enabled.
> > If this is enabled (and it is for zcache but not yet for zswap),
> > what you suggest (SetPageDirty) is what happens.
> 
> Ah, and I have a dim, perhaps mistaken, memory that I gave you
> input on that before, suggesting the SetPageDirty.  Good, sounds
> like the solution is already in place, if not actually activated.
> 
> Thanks, must dash,
> Hugh

Hi Hugh --

Credit where it is due...  Yes, I do recall now that the idea
was originally yours.  It went on a to-do list where I eventually
tried it and it worked... I'm sorry I had forgotten and neglected
to give you credit!

(BTW, it is activated for zcache in 3.9.)

Thanks,
Dan
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

RE: [RFC] mm: remove swapcache page early

2013-03-28 Thread Dan Magenheimer

 From: Hugh Dickins [mailto:hu...@google.com]
 Subject: RE: [RFC] mm: remove swapcache page early

 On Wed, 27 Mar 2013, Dan Magenheimer wrote:
   From: Hugh Dickins [mailto:hu...@google.com]
   Subject: Re: [RFC] mm: remove swapcache page early

  The issue you are raising
  here is (2).  You may not know that (2) has recently been solved
  in frontswap, at least for zcache.  See frontswap_exclusive_gets_enabled.
  If this is enabled (and it is for zcache but not yet for zswap),
  what you suggest (SetPageDirty) is what happens.

 Ah, and I have a dim, perhaps mistaken, memory that I gave you
 input on that before, suggesting the SetPageDirty.  Good, sounds
 like the solution is already in place, if not actually activated.

 Thanks, must dash,
 Hugh

Hi Hugh --

Credit where it is due...  Yes, I do recall now that the idea
was originally yours.  It went on a to-do list where I eventually
tried it and it worked... I'm sorry I had forgotten and neglected
to give you credit!

(BTW, it is activated for zcache in 3.9.)

Thanks,
Dan
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

RE: [RFC] mm: remove swapcache page early

2013-03-28 Thread Dan Magenheimer

 From: Minchan Kim [mailto:minc...@kernel.org]
 Subject: Re: [RFC] mm: remove swapcache page early

 Hi Dan,

 On Wed, Mar 27, 2013 at 03:24:00PM -0700, Dan Magenheimer wrote:
   From: Hugh Dickins [mailto:hu...@google.com]
   Subject: Re: [RFC] mm: remove swapcache page early

   I believe the answer is for frontswap/zmem to invalidate the frontswap
   copy of the page (to free up the compressed memory when possible) and
   SetPageDirty on the PageUptodate PageSwapCache page when swapping in
   (setting page dirty so nothing will later go to read it from the
   unfreed location on backing swap disk, which was never written).

  There are two duplication issues:  (1) When can the page be removed
  from the swap cache after a call to frontswap_store; and (2) When
  can the page be removed from the frontswap storage after it
  has been brought back into memory via frontswap_load.

  This patch from Minchan addresses (1).  The issue you are raising

 No. I am addressing (2).

  here is (2).  You may not know that (2) has recently been solved
  in frontswap, at least for zcache.  See frontswap_exclusive_gets_enabled.
  If this is enabled (and it is for zcache but not yet for zswap),
  what you suggest (SetPageDirty) is what happens.

 I am blind on zcache so I didn't see it. Anyway, I'd like to address it
 on zram and zswap.

Zswap can enable it trivially by adding a function call in init_zswap.
(Note that it is not enabled by default for all frontswap backends
because it is another complicated tradeoff of cpu time vs memory space
that needs more study on a broad set of workloads.)

I wonder if something like this would have a similar result for zram?
(Completely untested... snippet stolen from swap_entry_free with
SetPageDirty added... doesn't compile yet, but should give you the idea.)

diff --git a/mm/page_io.c b/mm/page_io.c
index 56276fe..2d10988 100644
--- a/mm/page_io.c
+++ b/mm/page_io.c
@@ -81,7 +81,17 @@ void end_swap_bio_read(struct bio *bio, int err)
iminor(bio-bi_bdev-bd_inode),
(unsigned long long)bio-bi_sector);
} else {
+   struct swap_info_struct *sis;
+
SetPageUptodate(page);
+   sis = page_swap_info(page);
+   if (sis-flags  SWP_BLKDEV) {
+   struct gendisk *disk = sis-bdev-bd_disk;
+   if (disk-fops-swap_slot_free_notify) {
+   SetPageDirty(page);
+   disk-fops-swap_slot_free_notify(sis-bdev,
+ offset);
+   }
+   }
}
unlock_page(page);
bio_put(bio);
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

RE: [RFC] mm: remove swapcache page early

2013-03-27 Thread Dan Magenheimer

> From: Hugh Dickins [mailto:hu...@google.com]
> Subject: Re: [RFC] mm: remove swapcache page early
> 
> On Wed, 27 Mar 2013, Minchan Kim wrote:
> 
> > Swap subsystem does lazy swap slot free with expecting the page
> > would be swapped out again so we can't avoid unnecessary write.
>  so we can avoid unnecessary write.
> >
> > But the problem in in-memory swap is that it consumes memory space
> > until vm_swap_full(ie, used half of all of swap device) condition
> > meet. It could be bad if we use multiple swap device, small in-memory swap
> > and big storage swap or in-memory swap alone.
> 
> That is a very good realization: it's surprising that none of us
> thought of it before - no disrespect to you, well done, thank you.

Yes, my compliments also Minchan.  This problem has been thought of before
but this patch is the first to identify a possible solution.

> And I guess swap readahead is utterly unhelpful in this case too.

Yes... as is any "swap writeahead".  Excuse my ignorance, but I
think this is not done in the swap subsystem but instead the kernel
assumes write-coalescing will be done in the block I/O subsystem,
which means swap writeahead would affect zram but not zcache/zswap
(since frontswap subverts the block I/O subsystem).

However I think a swap-readahead solution would be helpful to
zram as well as zcache/zswap.

> > This patch changes vm_swap_full logic slightly so it could free
> > swap slot early if the backed device is really fast.
> > For it, I used SWP_SOLIDSTATE but It might be controversial.
> 
> But I strongly disagree with almost everything in your patch :)
> I disagree with addressing it in vm_swap_full(), I disagree that
> it can be addressed by device, I disagree that it has anything to
> do with SWP_SOLIDSTATE.
> 
> This is not a problem with swapping to /dev/ram0 or to /dev/zram0,
> is it?  In those cases, a fixed amount of memory has been set aside
> for swap, and it works out just like with disk block devices.  The
> memory set aside may be wasted, but that is accepted upfront.

It is (I believe) also a problem with swapping to ram.  Two
copies of the same page are kept in memory in different places,
right?  Fixed vs variable size is irrelevant I think.  Or am
I misunderstanding something about swap-to-ram?

> Similarly, this is not a problem with swapping to SSD.  There might
> or might not be other reasons for adjusting the vm_swap_full() logic
> for SSD or generally, but those have nothing to do with this issue.

I think it is at least highly related.  The key issue is the
tradeoff of the likelihood that the page will soon be read/written
again while it is in swap cache vs the time/resource-usage necessary
to "reconstitute" the page into swap cache.  Reconstituting from disk
requires a LOT of elapsed time.  Reconstituting from
an SSD likely takes much less time.  Reconstituting from
zcache/zram takes thousands of CPU cycles.

> The problem here is peculiar to frontswap, and the variably sized
> memory behind it, isn't it?  We are accustomed to using swap to free
> up memory by transferring its data to some other, cheaper but slower
> resource.

Frontswap does make the problem more complex because some pages
are in "fairly fast" storage (zcache, needs decompression) and
some are on the actual (usually) rotating media.  Fortunately,
differentiating between these two cases is just a table lookup
(see frontswap_test).

> But in the case of frontswap and zmem (I'll say that to avoid thinking
> through which backends are actually involved), it is not a cheaper and
> slower resource, but the very same memory we are trying to save: swap
> is stolen from the memory under reclaim, so any duplication becomes
> counter-productive (if we ignore cpu compression/decompression costs:
> I have no idea how fair it is to do so, but anyone who chooses zmem
> is prepared to pay some cpu price for that).

Exactly.  There is some "robbing of Peter to pay Paul" and
other complex resource tradeoffs.  Presumably, though, it is
not "the very same memory we are trying to save" but a
fraction of it, saving the same page of data more efficiently
in memory, using less than a page, at some CPU cost.

> And because it's a frontswap thing, we cannot decide this by device:
> frontswap may or may not stand in front of each device.  There is no
> problem with swapcache duplicated on disk (until that area approaches
> being full or fragmented), but at the higher level we cannot see what
> is in zmem and what is on disk: we only want to free up the zmem dup.

I *think* frontswap_test(page) resolves this problem, as long as
we have a specific page available to use as a parameter.

> I believe the answer is for frontswap/zmem to invalidate the frontswap
> copy of the page (to free up the compressed memory when possible) and
> SetPageDirty on the PageUptodate PageSwapCache page when swapping in
> (setting page dirty so nothing will later go to read it from the
> unfreed location

zsmalloc/lzo compressibility vs entropy

2013-03-27 Thread Dan Magenheimer

This might be obvious to those of you who are better
mathematicians than I, but I ran some experiments
to confirm the relationship between entropy and compressibility
and thought I should report the results to the list.

Using the LZO code in the kernel via zsmalloc and some
hacks in zswap, I measured the compression of pages
generated by get_random_bytes and then of pages
where half the page is generated by get_random_bytes()
and the other half-page is zero-filled.

For a fully random page, one would expect the number
of zeroes and ones generated to be equal (highest
entropy) and that proved true:  The mean number of
one-bits in the fully random page was 16384 (x86,
so PAGE_SIZE=4096 * 8 bits/byte) with a stddev of 93.
(sample size > 50).  For this sample of pages,
zsize had a mean of 4116 and a stddev of 16.
So for fully random pages, LZO compression results
in "negative" compression... the size of the compressed
page is slightly larger than a page.

For a "half random page" -- a fully random page with
the first half of the page overwritten with zeros --
zsize mean is 2077 with a stddev of 6.  So a half-random
page compresses by about a factor of 2.  (Just to
be sure, I reran the experiment with the first half
of the page overwritten with ones instead of zeroes,
and the result was approximately the same.)

For extra credit, I ran a "quarter random page"...
zsize mean is 1052 with a stddev of 45.

For more extra credit, I tried a fully-random page
with every OTHER byte forced to zero, so half the
bytes are random and half are zero.  The result:
mean zsize is 3841 with a stddev of 33.  Then I
tried a fully-random page with every other PAIR of
bytes forced to zero.  The result: zsize mean is 4029
with a stddev of 67. (Worse!)

So LZO page compression works better when there are many
more zeroes than ones in a page (or vice-versa), but works
best when a long sequence of bits (bytes?) are the same.

All this still begs the question as to what the
page-entropy (and zsize distribution) will be over a
large set of pages and over a large set of workloads
AND across different classes of data (e.g. frontswap
pages vs cleancache pages), but at least we have
some theory to guide us.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

RE: [RFC] mm: remove swapcache page early

2013-03-27 Thread Dan Magenheimer

> From: Minchan Kim [mailto:minc...@kernel.org]
> Subject: [RFC] mm: remove swapcache page early
> 
> Swap subsystem does lazy swap slot free with expecting the page
> would be swapped out again so we can't avoid unnecessary write.
> 
> But the problem in in-memory swap is that it consumes memory space
> until vm_swap_full(ie, used half of all of swap device) condition
> meet. It could be bad if we use multiple swap device, small in-memory swap
> and big storage swap or in-memory swap alone.
> 
> This patch changes vm_swap_full logic slightly so it could free
> swap slot early if the backed device is really fast.
> For it, I used SWP_SOLIDSTATE but It might be controversial.
> So let's add Ccing Shaohua and Hugh.
> If it's a problem for SSD, I'd like to create new type SWP_INMEMORY
> or something for z* family.
> 
> Other problem is zram is block device so that it can set SWP_INMEMORY
> or SWP_SOLIDSTATE easily(ie, actually, zram is already done) but
> I have no idea to use it for frontswap.
> 
> Any idea?
> 
> Other optimize point is we remove it unconditionally when we
> found it's exclusive when swap in happen.
> It could help frontswap family, too.

By passing a struct page * to vm_swap_full() you can then call
frontswap_test()... if it returns true, then vm_swap_full
can return true.  Note that this precisely checks whether
the page is in zcache/zswap or not, so Seth's concern that
some pages may be in-memory and some may be in rotating
storage is no longer an issue.

> What do you think about it?

By removing the page from swapcache, you are now increasing
the risk that pages will "thrash" between uncompressed state
(in swapcache) and compressed state (in z*).  I think this is
a better tradeoff though than keeping a copy of both the
compressed page AND the uncompressed page in memory.

You should probably rename vm_swap_full() because you are
now overloading it with other meanings.  Maybe
vm_swap_reclaimable()?

Do you have any measurements?  I think you are correct
that it may help a LOT.

Thanks,
Dan
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

RE: [RFC] mm: remove swapcache page early

2013-03-27 Thread Dan Magenheimer

 From: Minchan Kim [mailto:minc...@kernel.org]
 Subject: [RFC] mm: remove swapcache page early

 Swap subsystem does lazy swap slot free with expecting the page
 would be swapped out again so we can't avoid unnecessary write.

 But the problem in in-memory swap is that it consumes memory space
 until vm_swap_full(ie, used half of all of swap device) condition
 meet. It could be bad if we use multiple swap device, small in-memory swap
 and big storage swap or in-memory swap alone.

 This patch changes vm_swap_full logic slightly so it could free
 swap slot early if the backed device is really fast.
 For it, I used SWP_SOLIDSTATE but It might be controversial.
 So let's add Ccing Shaohua and Hugh.
 If it's a problem for SSD, I'd like to create new type SWP_INMEMORY
 or something for z* family.

 Other problem is zram is block device so that it can set SWP_INMEMORY
 or SWP_SOLIDSTATE easily(ie, actually, zram is already done) but
 I have no idea to use it for frontswap.

 Any idea?

 Other optimize point is we remove it unconditionally when we
 found it's exclusive when swap in happen.
 It could help frontswap family, too.

By passing a struct page * to vm_swap_full() you can then call
frontswap_test()... if it returns true, then vm_swap_full
can return true.  Note that this precisely checks whether
the page is in zcache/zswap or not, so Seth's concern that
some pages may be in-memory and some may be in rotating
storage is no longer an issue.

 What do you think about it?

By removing the page from swapcache, you are now increasing
the risk that pages will thrash between uncompressed state
(in swapcache) and compressed state (in z*).  I think this is
a better tradeoff though than keeping a copy of both the
compressed page AND the uncompressed page in memory.

You should probably rename vm_swap_full() because you are
now overloading it with other meanings.  Maybe
vm_swap_reclaimable()?

Do you have any measurements?  I think you are correct
that it may help a LOT.

Thanks,
Dan
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

zsmalloc/lzo compressibility vs entropy

2013-03-27 Thread Dan Magenheimer

This might be obvious to those of you who are better
mathematicians than I, but I ran some experiments
to confirm the relationship between entropy and compressibility
and thought I should report the results to the list.

Using the LZO code in the kernel via zsmalloc and some
hacks in zswap, I measured the compression of pages
generated by get_random_bytes and then of pages
where half the page is generated by get_random_bytes()
and the other half-page is zero-filled.

For a fully random page, one would expect the number
of zeroes and ones generated to be equal (highest
entropy) and that proved true:  The mean number of
one-bits in the fully random page was 16384 (x86,
so PAGE_SIZE=4096 * 8 bits/byte) with a stddev of 93.
(sample size  50).  For this sample of pages,
zsize had a mean of 4116 and a stddev of 16.
So for fully random pages, LZO compression results
in negative compression... the size of the compressed
page is slightly larger than a page.

For a half random page -- a fully random page with
the first half of the page overwritten with zeros --
zsize mean is 2077 with a stddev of 6.  So a half-random
page compresses by about a factor of 2.  (Just to
be sure, I reran the experiment with the first half
of the page overwritten with ones instead of zeroes,
and the result was approximately the same.)

For extra credit, I ran a quarter random page...
zsize mean is 1052 with a stddev of 45.

For more extra credit, I tried a fully-random page
with every OTHER byte forced to zero, so half the
bytes are random and half are zero.  The result:
mean zsize is 3841 with a stddev of 33.  Then I
tried a fully-random page with every other PAIR of
bytes forced to zero.  The result: zsize mean is 4029
with a stddev of 67. (Worse!)

So LZO page compression works better when there are many
more zeroes than ones in a page (or vice-versa), but works
best when a long sequence of bits (bytes?) are the same.

All this still begs the question as to what the
page-entropy (and zsize distribution) will be over a
large set of pages and over a large set of workloads
AND across different classes of data (e.g. frontswap
pages vs cleancache pages), but at least we have
some theory to guide us.
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

RE: [RFC] mm: remove swapcache page early

2013-03-27 Thread Dan Magenheimer

 From: Hugh Dickins [mailto:hu...@google.com]
 Subject: Re: [RFC] mm: remove swapcache page early

 On Wed, 27 Mar 2013, Minchan Kim wrote:

  Swap subsystem does lazy swap slot free with expecting the page
  would be swapped out again so we can't avoid unnecessary write.
  so we can avoid unnecessary write.

  But the problem in in-memory swap is that it consumes memory space
  until vm_swap_full(ie, used half of all of swap device) condition
  meet. It could be bad if we use multiple swap device, small in-memory swap
  and big storage swap or in-memory swap alone.

 That is a very good realization: it's surprising that none of us
 thought of it before - no disrespect to you, well done, thank you.

Yes, my compliments also Minchan.  This problem has been thought of before
but this patch is the first to identify a possible solution.

 And I guess swap readahead is utterly unhelpful in this case too.

Yes... as is any swap writeahead.  Excuse my ignorance, but I
think this is not done in the swap subsystem but instead the kernel
assumes write-coalescing will be done in the block I/O subsystem,
which means swap writeahead would affect zram but not zcache/zswap
(since frontswap subverts the block I/O subsystem).

However I think a swap-readahead solution would be helpful to
zram as well as zcache/zswap.

  This patch changes vm_swap_full logic slightly so it could free
  swap slot early if the backed device is really fast.
  For it, I used SWP_SOLIDSTATE but It might be controversial.

 But I strongly disagree with almost everything in your patch :)
 I disagree with addressing it in vm_swap_full(), I disagree that
 it can be addressed by device, I disagree that it has anything to
 do with SWP_SOLIDSTATE.

 This is not a problem with swapping to /dev/ram0 or to /dev/zram0,
 is it?  In those cases, a fixed amount of memory has been set aside
 for swap, and it works out just like with disk block devices.  The
 memory set aside may be wasted, but that is accepted upfront.

It is (I believe) also a problem with swapping to ram.  Two
copies of the same page are kept in memory in different places,
right?  Fixed vs variable size is irrelevant I think.  Or am
I misunderstanding something about swap-to-ram?

 Similarly, this is not a problem with swapping to SSD.  There might
 or might not be other reasons for adjusting the vm_swap_full() logic
 for SSD or generally, but those have nothing to do with this issue.

I think it is at least highly related.  The key issue is the
tradeoff of the likelihood that the page will soon be read/written
again while it is in swap cache vs the time/resource-usage necessary
to reconstitute the page into swap cache.  Reconstituting from disk
requires a LOT of elapsed time.  Reconstituting from
an SSD likely takes much less time.  Reconstituting from
zcache/zram takes thousands of CPU cycles.

 The problem here is peculiar to frontswap, and the variably sized
 memory behind it, isn't it?  We are accustomed to using swap to free
 up memory by transferring its data to some other, cheaper but slower
 resource.

Frontswap does make the problem more complex because some pages
are in fairly fast storage (zcache, needs decompression) and
some are on the actual (usually) rotating media.  Fortunately,
differentiating between these two cases is just a table lookup
(see frontswap_test).

 But in the case of frontswap and zmem (I'll say that to avoid thinking
 through which backends are actually involved), it is not a cheaper and
 slower resource, but the very same memory we are trying to save: swap
 is stolen from the memory under reclaim, so any duplication becomes
 counter-productive (if we ignore cpu compression/decompression costs:
 I have no idea how fair it is to do so, but anyone who chooses zmem
 is prepared to pay some cpu price for that).

Exactly.  There is some robbing of Peter to pay Paul and
other complex resource tradeoffs.  Presumably, though, it is
not the very same memory we are trying to save but a
fraction of it, saving the same page of data more efficiently
in memory, using less than a page, at some CPU cost.

 And because it's a frontswap thing, we cannot decide this by device:
 frontswap may or may not stand in front of each device.  There is no
 problem with swapcache duplicated on disk (until that area approaches
 being full or fragmented), but at the higher level we cannot see what
 is in zmem and what is on disk: we only want to free up the zmem dup.

I *think* frontswap_test(page) resolves this problem, as long as
we have a specific page available to use as a parameter.

 I believe the answer is for frontswap/zmem to invalidate the frontswap
 copy of the page (to free up the compressed memory when possible) and
 SetPageDirty on the PageUptodate PageSwapCache page when swapping in
 (setting page dirty so nothing will later go to read it from the
 unfreed location on backing swap disk, which was never written).

There are two duplication

RE: [PATCH v2 1/4] introduce zero filled pages handler

2013-03-25 Thread Dan Magenheimer

> From: Konrad Rzeszutek Wilk [mailto:kon...@darnok.org]
> Sent: Tuesday, March 19, 2013 10:44 AM
> To: Dan Magenheimer
> Cc: Wanpeng Li; Greg Kroah-Hartman; Andrew Morton; Seth Jennings; Minchan 
> Kim; linux...@kvack.org;
> linux-kernel@vger.kernel.org
> Subject: Re: [PATCH v2 1/4] introduce zero filled pages handler
> 
> On Sat, Mar 16, 2013 at 2:24 PM, Dan Magenheimer
>  wrote:
> >> From: Konrad Rzeszutek Wilk [mailto:kon...@darnok.org]
> >> Subject: Re: [PATCH v2 1/4] introduce zero filled pages handler
> >>
> >> > +
> >> > +   for (pos = 0; pos < PAGE_SIZE / sizeof(*page); pos++) {
> >> > +   if (page[pos])
> >> > +   return false;
> >>
> >> Perhaps allocate a static page filled with zeros and just do memcmp?
> >
> > That seems like a bad idea.  Why compare two different
> > memory locations when comparing one memory location
> > to a register will do?
> 
> Good point. I was hoping there was an fast memcmp that would
> do fancy SSE registers. But it is memory against memory instead of
> registers.
> 
> Perhaps a cunning trick would be to check (as a shortcircuit)
> check against 'empty_zero_page' and if that check fails, then try
> to do the check for each byte in the code?

Curious about this, I added some code to check for this case.
In my test run, the conditional "if (page == ZERO_PAGE(0))"
was never true, for >20 pages passed through frontswap that
were zero-filled.  My test run is certainly not conclusive,
but perhaps some other code in the swap subsystem disqualifies
ZERO_PAGE as a candidate for swapping?  Or maybe it is accessed
frequently enough that it never falls out of the active-anonymous
page queue?

Dan

P.S. In arch/x86/include/asm/pgtable.h:

#define ZERO_PAGE(vaddr) (virt_to_page(empty_zero_page))
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

RE: [PATCH v2 1/4] introduce zero filled pages handler

2013-03-25 Thread Dan Magenheimer

 From: Konrad Rzeszutek Wilk [mailto:kon...@darnok.org]
 Sent: Tuesday, March 19, 2013 10:44 AM
 To: Dan Magenheimer
 Cc: Wanpeng Li; Greg Kroah-Hartman; Andrew Morton; Seth Jennings; Minchan 
 Kim; linux...@kvack.org;
 linux-kernel@vger.kernel.org
 Subject: Re: [PATCH v2 1/4] introduce zero filled pages handler

 On Sat, Mar 16, 2013 at 2:24 PM, Dan Magenheimer
 dan.magenhei...@oracle.com wrote:
  From: Konrad Rzeszutek Wilk [mailto:kon...@darnok.org]
  Subject: Re: [PATCH v2 1/4] introduce zero filled pages handler

   +
   +   for (pos = 0; pos  PAGE_SIZE / sizeof(*page); pos++) {
   +   if (page[pos])
   +   return false;

  Perhaps allocate a static page filled with zeros and just do memcmp?

  That seems like a bad idea.  Why compare two different
  memory locations when comparing one memory location
  to a register will do?

 Good point. I was hoping there was an fast memcmp that would
 do fancy SSE registers. But it is memory against memory instead of
 registers.

 Perhaps a cunning trick would be to check (as a shortcircuit)
 check against 'empty_zero_page' and if that check fails, then try
 to do the check for each byte in the code?

Curious about this, I added some code to check for this case.
In my test run, the conditional if (page == ZERO_PAGE(0))
was never true, for 20 pages passed through frontswap that
were zero-filled.  My test run is certainly not conclusive,
but perhaps some other code in the swap subsystem disqualifies
ZERO_PAGE as a candidate for swapping?  Or maybe it is accessed
frequently enough that it never falls out of the active-anonymous
page queue?

Dan

P.S. In arch/x86/include/asm/pgtable.h:

#define ZERO_PAGE(vaddr) (virt_to_page(empty_zero_page))
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

RE: [PATCH v2 1/4] introduce zero filled pages handler

2013-03-16 Thread Dan Magenheimer

> From: Konrad Rzeszutek Wilk [mailto:kon...@darnok.org]
> Subject: Re: [PATCH v2 1/4] introduce zero filled pages handler
> 
> > +
> > +   for (pos = 0; pos < PAGE_SIZE / sizeof(*page); pos++) {
> > +   if (page[pos])
> > +   return false;
> 
> Perhaps allocate a static page filled with zeros and just do memcmp?

That seems like a bad idea.  Why compare two different
memory locations when comparing one memory location
to a register will do?

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

RE: [PATCH v2 1/4] introduce zero filled pages handler

2013-03-16 Thread Dan Magenheimer

 From: Konrad Rzeszutek Wilk [mailto:kon...@darnok.org]
 Subject: Re: [PATCH v2 1/4] introduce zero filled pages handler
 
  +
  +   for (pos = 0; pos  PAGE_SIZE / sizeof(*page); pos++) {
  +   if (page[pos])
  +   return false;
 
 Perhaps allocate a static page filled with zeros and just do memcmp?

That seems like a bad idea.  Why compare two different
memory locations when comparing one memory location
to a register will do?

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

RE: zsmalloc limitations and related topics

2013-03-15 Thread Dan Magenheimer

> From: Seth Jennings [mailto:sjenn...@linux.vnet.ibm.com]
> Subject: Re: zsmalloc limitations and related topics
> 
> On 03/14/2013 01:54 PM, Dan Magenheimer wrote:
> >> From: Robert Jennings [mailto:r...@linux.vnet.ibm.com]
> >> Subject: Re: zsmalloc limitations and related topics
> >>
> >> * Bob (bob@oracle.com) wrote:
> >>> On 03/14/2013 06:59 AM, Seth Jennings wrote:
> >>>> On 03/13/2013 03:02 PM, Dan Magenheimer wrote:
> >>>>>> From: Robert Jennings [mailto:r...@linux.vnet.ibm.com]
> >>>>>> Subject: Re: zsmalloc limitations and related topics
> >>>>>
> >> 
> >>>>> Yes.  And add pageframe-reclaim to this list of things that
> >>>>> zsmalloc should do but currently cannot do.
> >>>>
> >>>> The real question is why is pageframe-reclaim a requirement?  What
> >>>> operation needs this feature?
> >>>>
> >>>> AFAICT, the pageframe-reclaim requirements is derived from the
> >>>> assumption that some external control path should be able to tell
> >>>> zswap/zcache to evacuate a page, like the shrinker interface.  But this
> >>>> introduces a new and complex problem in designing a policy that doesn't
> >>>> shrink the zpage pool so aggressively that it is useless.
> >>>>
> >>>> Unless there is another reason for this functionality I'm missing.
> >>>>.
> >>>
> >>> Perhaps it's needed if the user want to enable/disable the memory
> >>> compression feature dynamically.
> >>> Eg, use it as a module instead of recompile the kernel or even
> >>> reboot the system.
> >
> > It's worth thinking about: Under what circumstances would a user want
> > to turn off compression?  While unloading a compression module should
> > certainly be allowed if it makes a user comfortable, in my opinion,
> > if a user wants to do that, we have done our job poorly (or there
> > is a bug).
> >
> >> To unload zswap all that is needed is to perform writeback on the pages
> >> held in the cache, this can be done by extending the existing writeback
> >> code.
> >
> > Actually, frontswap supports this directly.  See frontswap_shrink.
> 
> frontswap_shrink() is a best-effort attempt to fault in all the pages
> stored in the backend.  However, if there is not enough RAM to hold all
> the pages, then it can not completely evacuate the backend.
> 
> Module exit functions must return void, so there is no way to fail a
> module unload.  If you implement an exit function for your module, you
> must insure that it can always complete successfully.  For this reason
> frontswap_shrink() is unsuitable for module unloading.  You'd need to
> use a mechanism like writeback that could surely evacuate the backend
> (baring I/O failures).

A single call to frontswap_shrink may be unsuitable... multiple
calls (do while zcache/zswap is not empty) may work fine.
Writeback-until-empty should also work fine.

In any case, it's a good point that module exit must succeed,
and that if there is already heavy memory pressure when zcache/zswap
module exit is invoked, module exit may be very very slow and cause
many many swap disk writes, so the system may become unresponsive
(and may even OOM).

So if someone implements zcache/zswap module unload, a thorough
test plan would be good.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

RE: zsmalloc limitations and related topics

2013-03-15 Thread Dan Magenheimer

 From: Seth Jennings [mailto:sjenn...@linux.vnet.ibm.com]
 Subject: Re: zsmalloc limitations and related topics

 On 03/14/2013 01:54 PM, Dan Magenheimer wrote:
  From: Robert Jennings [mailto:r...@linux.vnet.ibm.com]
  Subject: Re: zsmalloc limitations and related topics

  * Bob (bob@oracle.com) wrote:
  On 03/14/2013 06:59 AM, Seth Jennings wrote:
  On 03/13/2013 03:02 PM, Dan Magenheimer wrote:
  From: Robert Jennings [mailto:r...@linux.vnet.ibm.com]
  Subject: Re: zsmalloc limitations and related topics

  snip
  Yes.  And add pageframe-reclaim to this list of things that
  zsmalloc should do but currently cannot do.

  The real question is why is pageframe-reclaim a requirement?  What
  operation needs this feature?

  AFAICT, the pageframe-reclaim requirements is derived from the
  assumption that some external control path should be able to tell
  zswap/zcache to evacuate a page, like the shrinker interface.  But this
  introduces a new and complex problem in designing a policy that doesn't
  shrink the zpage pool so aggressively that it is useless.

  Unless there is another reason for this functionality I'm missing.
 .

  Perhaps it's needed if the user want to enable/disable the memory
  compression feature dynamically.
  Eg, use it as a module instead of recompile the kernel or even
  reboot the system.

  It's worth thinking about: Under what circumstances would a user want
  to turn off compression?  While unloading a compression module should
  certainly be allowed if it makes a user comfortable, in my opinion,
  if a user wants to do that, we have done our job poorly (or there
  is a bug).

  To unload zswap all that is needed is to perform writeback on the pages
  held in the cache, this can be done by extending the existing writeback
  code.

  Actually, frontswap supports this directly.  See frontswap_shrink.

 frontswap_shrink() is a best-effort attempt to fault in all the pages
 stored in the backend.  However, if there is not enough RAM to hold all
 the pages, then it can not completely evacuate the backend.

 Module exit functions must return void, so there is no way to fail a
 module unload.  If you implement an exit function for your module, you
 must insure that it can always complete successfully.  For this reason
 frontswap_shrink() is unsuitable for module unloading.  You'd need to
 use a mechanism like writeback that could surely evacuate the backend
 (baring I/O failures).

A single call to frontswap_shrink may be unsuitable... multiple
calls (do while zcache/zswap is not empty) may work fine.
Writeback-until-empty should also work fine.

In any case, it's a good point that module exit must succeed,
and that if there is already heavy memory pressure when zcache/zswap
module exit is invoked, module exit may be very very slow and cause
many many swap disk writes, so the system may become unresponsive
(and may even OOM).

So if someone implements zcache/zswap module unload, a thorough
test plan would be good.
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

RE: zsmalloc limitations and related topics

2013-03-14 Thread Dan Magenheimer

> From: Dan Magenheimer
> Subject: RE: zsmalloc limitations and related topics
> 
> > > I would welcome ideas on how to evaluate workloads for
> > > "representativeness".  Personally I don't believe we should
> > > be making decisions about selecting the "best" algorithms
> > > or merging code without an agreement on workloads.
> >
> > I'd argue that there is no such thing as a "representative workload".
> > Instead, we try different workloads to validate the design and illustrate
> > the performance characteristics and impacts.
> 
> Sorry for repeatedly hammering my point in the above, but
> there have been many design choices driven by what was presumed
> to be representative (kernbench and now SPECjbb) workload
> that may be entirely wrong for a different workload (as
> Seth once pointed out using the text of Moby Dick as a source
> data stream).
> 
> Further, the value of different designs can't be measured here just
> by the workload because the pages chosen to swap may be completely
> independent of the intended workload-driver... i.e. if you track
> the pid of the pages intended for swap, the pages can be mostly
> pages from long-running or periodic system services, not pages
> generated by kernbench or SPECjbb.  So it is the workload PLUS the
> environment that is being measured and evaluated.  That makes
> the problem especially tough.
> 
> Just to clarify, I'm not suggesting that there is any single
> workload that can be called representative, just that we may
> need both a broad set of workloads (not silly benchmarks) AND
> some theoretical analysis to drive design decisions.  And, without
> this, arguing about whether zsmalloc is better than zbud or not
> is silly.  Both zbud and zsmalloc have strengths and weaknesses.
> 
> That said, it should also be pointed out that the stream of
> pages-to-compress from cleancache ("file pages") may be dramatically
> different than for frontswap ("anonymous pages"), so unless you
> and Seth are going to argue upfront that cleancache pages should
> NEVER be candidates for compression, the evaluation criteria
> to drive design decisions needs to encompass both anonymous
> and file pages.  It is currently impossible to evaluate that
> with zswap.

Sorry to reply to myself here, but I realized last night that
I left off another related important point:

We have a tendency to run benchmarks on a "cold" system
so that the results are reproducible.  For compression however,
this may unnaturally skew the entropy of data-pages-to-be-compressed
and so also the density measurements.

I can't prove it, but I suspect that soon after boot the number
of anonymous pages containing all (or nearly all) zeroes is large,
i.e. entropy is low.  As the length of time grows since the system
booted, more anonymous pages will be written with non-zero
data, thus increasing entropy and decreasing compressibility.

So, over time, the distribution of zsize may slowly skew right
(toward PAGE_SIZE).

If so, this effect may be very real but very hard to observe.

Dan
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

RE: zsmalloc limitations and related topics

2013-03-14 Thread Dan Magenheimer

> From: Robert Jennings [mailto:r...@linux.vnet.ibm.com]
> Sent: Thursday, March 14, 2013 7:21 AM
> To: Bob
> Cc: Seth Jennings; Dan Magenheimer; minc...@kernel.org; Nitin Gupta; Konrad 
> Wilk; linux...@kvack.org;
> linux-kernel@vger.kernel.org; Bob Liu; Luigi Semenzato; Mel Gorman
> Subject: Re: zsmalloc limitations and related topics
> 
> * Bob (bob@oracle.com) wrote:
> > On 03/14/2013 06:59 AM, Seth Jennings wrote:
> > >On 03/13/2013 03:02 PM, Dan Magenheimer wrote:
> > >>>From: Robert Jennings [mailto:r...@linux.vnet.ibm.com]
> > >>>Subject: Re: zsmalloc limitations and related topics
> > >>
> 
> > >>Yes.  And add pageframe-reclaim to this list of things that
> > >>zsmalloc should do but currently cannot do.
> > >
> > >The real question is why is pageframe-reclaim a requirement?  What
> > >operation needs this feature?
> > >
> > >AFAICT, the pageframe-reclaim requirements is derived from the
> > >assumption that some external control path should be able to tell
> > >zswap/zcache to evacuate a page, like the shrinker interface.  But this
> > >introduces a new and complex problem in designing a policy that doesn't
> > >shrink the zpage pool so aggressively that it is useless.
> > >
> > >Unless there is another reason for this functionality I'm missing.
> > >
> >
> > Perhaps it's needed if the user want to enable/disable the memory
> > compression feature dynamically.
> > Eg, use it as a module instead of recompile the kernel or even
> > reboot the system.

It's worth thinking about: Under what circumstances would a user want
to turn off compression?  While unloading a compression module should
certainly be allowed if it makes a user comfortable, in my opinion,
if a user wants to do that, we have done our job poorly (or there
is a bug).

> To unload zswap all that is needed is to perform writeback on the pages
> held in the cache, this can be done by extending the existing writeback
> code.

Actually, frontswap supports this directly.  See frontswap_shrink.

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

RE: zsmalloc limitations and related topics

2013-03-14 Thread Dan Magenheimer

> From: Seth Jennings [mailto:sjenn...@linux.vnet.ibm.com]
> Subject: Re: zsmalloc limitations and related topics

Hi Seth --

Thanks for the reply.  I think it is very important to
be having these conversations.

> >>> 2) When not full and especially when nearly-empty _after_
> >>>being full, density may fall below 1.0 as a result of
> >>>fragmentation.
> >>
> >> True and there are several ways to address this including
> >> defragmentation, fewer class sizes in zsmalloc, aging, and/or writeback
> >> of zpages in sparse zspages to free pageframes during normal writeback.
> >
> > Yes.  And add pageframe-reclaim to this list of things that
> > zsmalloc should do but currently cannot do.
> 
> The real question is why is pageframe-reclaim a requirement?

It is because pageframes are the currency of the MM subsystem.
See more below.

> What operation needs this feature?
> AFAICT, the pageframe-reclaim requirements is derived from the
> assumption that some external control path should be able to tell
> zswap/zcache to evacuate a page, like the shrinker interface.  But this
> introduces a new and complex problem in designing a policy that doesn't
> shrink the zpage pool so aggressively that it is useless.
> 
> Unless there is another reason for this functionality I'm missing.

That's the reason.  IMHO, it is precisely this "new and complex"
problem that we must solve.  Otherwise, compression is just a cool toy
that may (or may not) help your workload if you turn it on.

Zcache already does implement "a policy that doesn't shrink the
zpage pool so aggressively that it is useless".  While I won't
claim the policy is the right one, it is a policy, it is not
particularly complex, and it is definitely not useless.  And
it depends on pageframe-reclaim.

> >>> 3) Zsmalloc has a density of exactly 1.0 for any number of
> >>>zpages with zsize >= 0.8.
> >>
> >> For this reason zswap does not cache pages which in this range.
> >> It is not enforced in the allocator because some users may be forced to
> >> store these pages; users like zram.
> >
> > Again, without a "representative" workload, we don't know whether
> > or not it is important to manage pages with zsize >= 0.8.  You are
> > simply dismissing it as unnecessary because zsmalloc can't handle
> > them and because they don't appear at any measurable frequency
> > in kernbench or SPECjbb.  (Zbud _can_ efficiently handle these larger
> > pages under many circumstances... but without a "representative" workload,
> > we don't know whether or not those circumstances will occur.)
> 
> The real question is not whether any workload would operate on pages
> that don't compress to 80%.  Any workload that operates on pages of
> already compressed or encrypted data would do this.  The question is, is
> it worth it to store those pages in the compressed cache since the
> effective reclaim efficiency approaches 0.

You are letting the implementation of zsmalloc color your
thinking.  Zbud can quite efficiently store pages that compress
up to zsize = ((63 * PAGE_SIZE) / 64) because it buddies highly
compressible pages with poorly compressible pages.  This is also,
of course, very zsize-distribution-dependent.

(These are not just already-compressed or encrypted data, although
those are good examples.  Compressibility is related to
entropy, and there may be many anonymous pages that have
high entropy.  We really just don't know.)

> >>> 4) Zsmalloc contains several compile-time parameters;
> >>>the best value of these parameters may be very workload
> >>>dependent.
> >>
> >> The parameters fall into two major areas, handle computation and class
> >> size.  The handle can be abstracted away, eliminating the compile-time
> >> parameters.  The class-size tunable could be changed to a default value
> >> with the option for specifying an alternate value from the user during
> >> pool creation.
> >
> > Perhaps my point here wasn't clear so let me be more blunt:
> > There's no way in hell that even a very sophisticated user
> > will know how to set these values.  I think we need to
> > ensure either that they are "always right" (which without
> > a "representative workload"...) or, preferably, have some way
> > so that they can dynamically adapt at runtime.
> 
> I think you made the point that if this "representative workload" is
> completely undefined, then having tunables for zsmalloc that are "always
> right" is also not possible.  The best we can hope for is "mostly right"
> which, of course, is difficult to get everyone to agree on and will be
> based on usage.

I agree "always right" is impossible and, as I said, would
prefer adaptable.  I think zsmalloc and zbud address very different
zsize-distributions so some combination may be better than either
by itself.

> >>> If density == 1.0, that means we are paying the overhead of
> >>> compression+decompression for no space advantage.  If
> >>> density < 1.0, that means using zsmalloc is detrimental,
> >>>

RE: [PATCH 4/4] zcache: add pageframes count once compress zero-filled pages twice

2013-03-14 Thread Dan Magenheimer

> From: Wanpeng Li [mailto:liw...@linux.vnet.ibm.com]
> Sent: Wednesday, March 13, 2013 6:21 PM
> To: Dan Magenheimer
> Cc: Andrew Morton; Greg Kroah-Hartman; Dan Magenheimer; Seth Jennings; Konrad 
> Rzeszutek Wilk; Minchan
> Kim; linux...@kvack.org; linux-kernel@vger.kernel.org
> Subject: Re: [PATCH 4/4] zcache: add pageframes count once compress 
> zero-filled pages twice
> 
> On Wed, Mar 13, 2013 at 09:42:16AM -0700, Dan Magenheimer wrote:
> >> From: Wanpeng Li [mailto:liw...@linux.vnet.ibm.com]
> >> Sent: Wednesday, March 13, 2013 1:05 AM
> >> To: Andrew Morton
> >> Cc: Greg Kroah-Hartman; Dan Magenheimer; Seth Jennings; Konrad Rzeszutek 
> >> Wilk; Minchan Kim; linux-
> >> m...@kvack.org; linux-kernel@vger.kernel.org; Wanpeng Li
> >> Subject: [PATCH 4/4] zcache: add pageframes count once compress 
> >> zero-filled pages twice
> >
> >Hi Wanpeng --
> >
> >Thanks for taking on this task from the drivers/staging/zcache TODO list!
> >
> >> Since zbudpage consist of two zpages, two zero-filled pages compression
> >> contribute to one [eph|pers]pageframe count accumulated.
> >
> 
> Hi Dan,
> 
> >I'm not sure why this is necessary.  The [eph|pers]pageframe count
> >is supposed to be counting actual pageframes used by zcache.  Since
> >your patch eliminates the need to store zero pages, no pageframes
> >are needed at all to store zero pages, so it's not necessary
> >to increment zcache_[eph|pers]_pageframes when storing zero
> >pages.
> >
> 
> Great point! It seems that we also don't need to caculate
> zcache_[eph|pers]_zpages for zero-filled pages. I will fix
> it in next version. :-)

Hi Wanpeng --

I think we DO need to increment/decrement zcache_[eph|pers]_zpages
for zero-filled pages.

The main point of the counters for zpages and pageframes
is to be able to calculate density == zpages/pageframes.
A zero-filled page becomes a zpage that "compresses" to zero bytes
and, as a result, requires zero pageframes for storage.
So the zpages counter should be increased but the pageframes
counter should not.

If you are changing the patch anyway, I do like better the use
of "zero_filled_page" rather than just "zero" or "zero page".
So it might be good to change:

handle_zero_page -> handle_zero_filled_page
pages_zero -> zero_filled_pages
zcache_pages_zero -> zcache_zero_filled_pages

and maybe

page_zero_filled -> page_is_zero_filled

Thanks,
Dan
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

RE: [PATCH 4/4] zcache: add pageframes count once compress zero-filled pages twice

2013-03-14 Thread Dan Magenheimer

 From: Wanpeng Li [mailto:liw...@linux.vnet.ibm.com]
 Sent: Wednesday, March 13, 2013 6:21 PM
 To: Dan Magenheimer
 Cc: Andrew Morton; Greg Kroah-Hartman; Dan Magenheimer; Seth Jennings; Konrad 
 Rzeszutek Wilk; Minchan
 Kim; linux...@kvack.org; linux-kernel@vger.kernel.org
 Subject: Re: [PATCH 4/4] zcache: add pageframes count once compress 
 zero-filled pages twice

 On Wed, Mar 13, 2013 at 09:42:16AM -0700, Dan Magenheimer wrote:
  From: Wanpeng Li [mailto:liw...@linux.vnet.ibm.com]
  Sent: Wednesday, March 13, 2013 1:05 AM
  To: Andrew Morton
  Cc: Greg Kroah-Hartman; Dan Magenheimer; Seth Jennings; Konrad Rzeszutek 
  Wilk; Minchan Kim; linux-
  m...@kvack.org; linux-kernel@vger.kernel.org; Wanpeng Li
  Subject: [PATCH 4/4] zcache: add pageframes count once compress 
  zero-filled pages twice

 Hi Wanpeng --

 Thanks for taking on this task from the drivers/staging/zcache TODO list!

  Since zbudpage consist of two zpages, two zero-filled pages compression
  contribute to one [eph|pers]pageframe count accumulated.

 Hi Dan,

 I'm not sure why this is necessary.  The [eph|pers]pageframe count
 is supposed to be counting actual pageframes used by zcache.  Since
 your patch eliminates the need to store zero pages, no pageframes
 are needed at all to store zero pages, so it's not necessary
 to increment zcache_[eph|pers]_pageframes when storing zero
 pages.

 Great point! It seems that we also don't need to caculate
 zcache_[eph|pers]_zpages for zero-filled pages. I will fix
 it in next version. :-)

Hi Wanpeng --

I think we DO need to increment/decrement zcache_[eph|pers]_zpages
for zero-filled pages.

The main point of the counters for zpages and pageframes
is to be able to calculate density == zpages/pageframes.
A zero-filled page becomes a zpage that compresses to zero bytes
and, as a result, requires zero pageframes for storage.
So the zpages counter should be increased but the pageframes
counter should not.

If you are changing the patch anyway, I do like better the use
of zero_filled_page rather than just zero or zero page.
So it might be good to change:

handle_zero_page - handle_zero_filled_page
pages_zero - zero_filled_pages
zcache_pages_zero - zcache_zero_filled_pages

and maybe

page_zero_filled - page_is_zero_filled

Thanks,
Dan
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

RE: zsmalloc limitations and related topics

2013-03-14 Thread Dan Magenheimer

 From: Seth Jennings [mailto:sjenn...@linux.vnet.ibm.com]
 Subject: Re: zsmalloc limitations and related topics

Hi Seth --

Thanks for the reply.  I think it is very important to
be having these conversations.

  2) When not full and especially when nearly-empty _after_
 being full, density may fall below 1.0 as a result of
 fragmentation.

  True and there are several ways to address this including
  defragmentation, fewer class sizes in zsmalloc, aging, and/or writeback
  of zpages in sparse zspages to free pageframes during normal writeback.

  Yes.  And add pageframe-reclaim to this list of things that
  zsmalloc should do but currently cannot do.

 The real question is why is pageframe-reclaim a requirement?

It is because pageframes are the currency of the MM subsystem.
See more below.

 What operation needs this feature?
 AFAICT, the pageframe-reclaim requirements is derived from the
 assumption that some external control path should be able to tell
 zswap/zcache to evacuate a page, like the shrinker interface.  But this
 introduces a new and complex problem in designing a policy that doesn't
 shrink the zpage pool so aggressively that it is useless.

 Unless there is another reason for this functionality I'm missing.

That's the reason.  IMHO, it is precisely this new and complex
problem that we must solve.  Otherwise, compression is just a cool toy
that may (or may not) help your workload if you turn it on.

Zcache already does implement a policy that doesn't shrink the
zpage pool so aggressively that it is useless.  While I won't
claim the policy is the right one, it is a policy, it is not
particularly complex, and it is definitely not useless.  And
it depends on pageframe-reclaim.

  3) Zsmalloc has a density of exactly 1.0 for any number of
 zpages with zsize = 0.8.

  For this reason zswap does not cache pages which in this range.
  It is not enforced in the allocator because some users may be forced to
  store these pages; users like zram.

  Again, without a representative workload, we don't know whether
  or not it is important to manage pages with zsize = 0.8.  You are
  simply dismissing it as unnecessary because zsmalloc can't handle
  them and because they don't appear at any measurable frequency
  in kernbench or SPECjbb.  (Zbud _can_ efficiently handle these larger
  pages under many circumstances... but without a representative workload,
  we don't know whether or not those circumstances will occur.)

 The real question is not whether any workload would operate on pages
 that don't compress to 80%.  Any workload that operates on pages of
 already compressed or encrypted data would do this.  The question is, is
 it worth it to store those pages in the compressed cache since the
 effective reclaim efficiency approaches 0.

You are letting the implementation of zsmalloc color your
thinking.  Zbud can quite efficiently store pages that compress
up to zsize = ((63 * PAGE_SIZE) / 64) because it buddies highly
compressible pages with poorly compressible pages.  This is also,
of course, very zsize-distribution-dependent.

(These are not just already-compressed or encrypted data, although
those are good examples.  Compressibility is related to
entropy, and there may be many anonymous pages that have
high entropy.  We really just don't know.)

  4) Zsmalloc contains several compile-time parameters;
 the best value of these parameters may be very workload
 dependent.

  The parameters fall into two major areas, handle computation and class
  size.  The handle can be abstracted away, eliminating the compile-time
  parameters.  The class-size tunable could be changed to a default value
  with the option for specifying an alternate value from the user during
  pool creation.

  Perhaps my point here wasn't clear so let me be more blunt:
  There's no way in hell that even a very sophisticated user
  will know how to set these values.  I think we need to
  ensure either that they are always right (which without
  a representative workload...) or, preferably, have some way
  so that they can dynamically adapt at runtime.

 I think you made the point that if this representative workload is
 completely undefined, then having tunables for zsmalloc that are always
 right is also not possible.  The best we can hope for is mostly right
 which, of course, is difficult to get everyone to agree on and will be
 based on usage.

I agree always right is impossible and, as I said, would
prefer adaptable.  I think zsmalloc and zbud address very different
zsize-distributions so some combination may be better than either
by itself.

  If density == 1.0, that means we are paying the overhead of
  compression+decompression for no space advantage.  If
  density  1.0, that means using zsmalloc is detrimental,
  resulting in worse memory pressure than if it were not used.

  WORKLOAD ANALYSIS

  These limitations emphasize that the workload used to evaluate
  zsmalloc is very important.

RE: zsmalloc limitations and related topics

2013-03-14 Thread Dan Magenheimer

 From: Robert Jennings [mailto:r...@linux.vnet.ibm.com]
 Sent: Thursday, March 14, 2013 7:21 AM
 To: Bob
 Cc: Seth Jennings; Dan Magenheimer; minc...@kernel.org; Nitin Gupta; Konrad 
 Wilk; linux...@kvack.org;
 linux-kernel@vger.kernel.org; Bob Liu; Luigi Semenzato; Mel Gorman
 Subject: Re: zsmalloc limitations and related topics

 * Bob (bob@oracle.com) wrote:
  On 03/14/2013 06:59 AM, Seth Jennings wrote:
  On 03/13/2013 03:02 PM, Dan Magenheimer wrote:
  From: Robert Jennings [mailto:r...@linux.vnet.ibm.com]
  Subject: Re: zsmalloc limitations and related topics

 snip
  Yes.  And add pageframe-reclaim to this list of things that
  zsmalloc should do but currently cannot do.

  The real question is why is pageframe-reclaim a requirement?  What
  operation needs this feature?

  AFAICT, the pageframe-reclaim requirements is derived from the
  assumption that some external control path should be able to tell
  zswap/zcache to evacuate a page, like the shrinker interface.  But this
  introduces a new and complex problem in designing a policy that doesn't
  shrink the zpage pool so aggressively that it is useless.

  Unless there is another reason for this functionality I'm missing.

  Perhaps it's needed if the user want to enable/disable the memory
  compression feature dynamically.
  Eg, use it as a module instead of recompile the kernel or even
  reboot the system.

It's worth thinking about: Under what circumstances would a user want
to turn off compression?  While unloading a compression module should
certainly be allowed if it makes a user comfortable, in my opinion,
if a user wants to do that, we have done our job poorly (or there
is a bug).

 To unload zswap all that is needed is to perform writeback on the pages
 held in the cache, this can be done by extending the existing writeback
 code.

Actually, frontswap supports this directly.  See frontswap_shrink.

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

RE: zsmalloc limitations and related topics

2013-03-14 Thread Dan Magenheimer

 From: Dan Magenheimer
 Subject: RE: zsmalloc limitations and related topics

   I would welcome ideas on how to evaluate workloads for
   representativeness.  Personally I don't believe we should
   be making decisions about selecting the best algorithms
   or merging code without an agreement on workloads.

  I'd argue that there is no such thing as a representative workload.
  Instead, we try different workloads to validate the design and illustrate
  the performance characteristics and impacts.

 Sorry for repeatedly hammering my point in the above, but
 there have been many design choices driven by what was presumed
 to be representative (kernbench and now SPECjbb) workload
 that may be entirely wrong for a different workload (as
 Seth once pointed out using the text of Moby Dick as a source
 data stream).

 Further, the value of different designs can't be measured here just
 by the workload because the pages chosen to swap may be completely
 independent of the intended workload-driver... i.e. if you track
 the pid of the pages intended for swap, the pages can be mostly
 pages from long-running or periodic system services, not pages
 generated by kernbench or SPECjbb.  So it is the workload PLUS the
 environment that is being measured and evaluated.  That makes
 the problem especially tough.

 Just to clarify, I'm not suggesting that there is any single
 workload that can be called representative, just that we may
 need both a broad set of workloads (not silly benchmarks) AND
 some theoretical analysis to drive design decisions.  And, without
 this, arguing about whether zsmalloc is better than zbud or not
 is silly.  Both zbud and zsmalloc have strengths and weaknesses.

 That said, it should also be pointed out that the stream of
 pages-to-compress from cleancache (file pages) may be dramatically
 different than for frontswap (anonymous pages), so unless you
 and Seth are going to argue upfront that cleancache pages should
 NEVER be candidates for compression, the evaluation criteria
 to drive design decisions needs to encompass both anonymous
 and file pages.  It is currently impossible to evaluate that
 with zswap.

Sorry to reply to myself here, but I realized last night that
I left off another related important point:

We have a tendency to run benchmarks on a cold system
so that the results are reproducible.  For compression however,
this may unnaturally skew the entropy of data-pages-to-be-compressed
and so also the density measurements.

I can't prove it, but I suspect that soon after boot the number
of anonymous pages containing all (or nearly all) zeroes is large,
i.e. entropy is low.  As the length of time grows since the system
booted, more anonymous pages will be written with non-zero
data, thus increasing entropy and decreasing compressibility.

So, over time, the distribution of zsize may slowly skew right
(toward PAGE_SIZE).

If so, this effect may be very real but very hard to observe.

Dan
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

RE: zsmalloc limitations and related topics

2013-03-13 Thread Dan Magenheimer

> From: Robert Jennings [mailto:r...@linux.vnet.ibm.com]
> Subject: Re: zsmalloc limitations and related topics

Hi Robert --

Thanks for the well-considered reply!
 
> * Dan Magenheimer (dan.magenhei...@oracle.com) wrote:
> > Hi all --
> >
> > I've been doing some experimentation on zsmalloc in preparation
> > for my topic proposed for LSFMM13 and have run across some
> > perplexing limitations.  Those familiar with the intimate details
> > of zsmalloc might be well aware of these limitations, but they
> > aren't documented or immediately obvious, so I thought it would
> > be worthwhile to air them publicly.  I've also included some
> > measurements from the experimentation and some related thoughts.
> >
> > (Some of the terms here are unusual and may be used inconsistently
> > by different developers so a glossary of definitions of the terms
> > used here is appended.)
> >
> > ZSMALLOC LIMITATIONS
> >
> > Zsmalloc is used for two zprojects: zram and the out-of-tree
> > zswap.  Zsmalloc can achieve high density when "full".  But:
> >
> > 1) Zsmalloc has a worst-case density of 0.25 (one zpage per
> >four pageframes).
> 
> The design of the allocator results in a trade-off between best case
> density and the worst-case which is true for any allocator.  For zsmalloc,
> the best case density with a 4K page size is 32.0, or 177.0 for a 64K page
> size, based on storing a set of zero-filled pages compressed by lzo1x-1.

Right.  Without a "representative workload", we have no idea
whether either my worst-case or your best-case will be relevant.

(As an aside, I'm measuring zsize=28 bytes for a zero page...
Seth has repeatedly said 103 bytes and I think this is
reflected in your computation above.  Maybe it is 103 for your
hardware compression engine?  Else, I'm not sure why our
numbers would be different.)
 
> > 2) When not full and especially when nearly-empty _after_
> >being full, density may fall below 1.0 as a result of
> >fragmentation.
> 
> True and there are several ways to address this including
> defragmentation, fewer class sizes in zsmalloc, aging, and/or writeback
> of zpages in sparse zspages to free pageframes during normal writeback.

Yes.  And add pageframe-reclaim to this list of things that
zsmalloc should do but currently cannot do.

> > 3) Zsmalloc has a density of exactly 1.0 for any number of
> >zpages with zsize >= 0.8.
> 
> For this reason zswap does not cache pages which in this range.
> It is not enforced in the allocator because some users may be forced to
> store these pages; users like zram.

Again, without a "representative" workload, we don't know whether
or not it is important to manage pages with zsize >= 0.8.  You are
simply dismissing it as unnecessary because zsmalloc can't handle
them and because they don't appear at any measurable frequency
in kernbench or SPECjbb.  (Zbud _can_ efficiently handle these larger
pages under many circumstances... but without a "representative" workload,
we don't know whether or not those circumstances will occur.)

> > 4) Zsmalloc contains several compile-time parameters;
> >the best value of these parameters may be very workload
> >dependent.
> 
> The parameters fall into two major areas, handle computation and class
> size.  The handle can be abstracted away, eliminating the compile-time
> parameters.  The class-size tunable could be changed to a default value
> with the option for specifying an alternate value from the user during
> pool creation.

Perhaps my point here wasn't clear so let me be more blunt:
There's no way in hell that even a very sophisticated user
will know how to set these values.  I think we need to
ensure either that they are "always right" (which without
a "representative workload"...) or, preferably, have some way
so that they can dynamically adapt at runtime.

> > If density == 1.0, that means we are paying the overhead of
> > compression+decompression for no space advantage.  If
> > density < 1.0, that means using zsmalloc is detrimental,
> > resulting in worse memory pressure than if it were not used.
> >
> > WORKLOAD ANALYSIS
> >
> > These limitations emphasize that the workload used to evaluate
> > zsmalloc is very important.  Benchmarks that measure data
> > throughput or CPU utilization are of questionable value because
> > it is the _content_ of the data that is particularly relevant
> > for compression.  Even more precisely, it is the "entropy"
> > of the data that is relevant, because the amount of
> > compressibility in the data is related to the entropy:
> > I.e. an entirel

RE: [PATCH 2/4] zcache: zero-filled pages awareness

2013-03-13 Thread Dan Magenheimer

> From: Wanpeng Li [mailto:liw...@linux.vnet.ibm.com]
> Subject: [PATCH 2/4] zcache: zero-filled pages awareness
> 
> Compression of zero-filled pages can unneccessarily cause internal
> fragmentation, and thus waste memory. This special case can be
> optimized.
> 
> This patch captures zero-filled pages, and marks their corresponding
> zcache backing page entry as zero-filled. Whenever such zero-filled
> page is retrieved, we fill the page frame with zero.
> 
> Signed-off-by: Wanpeng Li 
> ---
>  drivers/staging/zcache/tmem.c|4 +-
>  drivers/staging/zcache/tmem.h|5 ++
>  drivers/staging/zcache/zcache-main.c |   87 
> ++
>  3 files changed, 85 insertions(+), 11 deletions(-)
> 
> diff --git a/drivers/staging/zcache/tmem.c b/drivers/staging/zcache/tmem.c
> index a2b7e03..62468ea 100644
> --- a/drivers/staging/zcache/tmem.c
> +++ b/drivers/staging/zcache/tmem.c
> @@ -597,7 +597,9 @@ int tmem_put(struct tmem_pool *pool, struct tmem_oid 
> *oidp, uint32_t index,
>   if (unlikely(ret == -ENOMEM))
>   /* may have partially built objnode tree ("stump") */
>   goto delete_and_free;
> - (*tmem_pamops.create_finish)(pampd, is_ephemeral(pool));
> + if (pampd != (void *)ZERO_FILLED)
> + (*tmem_pamops.create_finish)(pampd, is_ephemeral(pool));
> +
>   goto out;
> 
>  delete_and_free:
> diff --git a/drivers/staging/zcache/tmem.h b/drivers/staging/zcache/tmem.h
> index adbe5a8..6719dbd 100644
> --- a/drivers/staging/zcache/tmem.h
> +++ b/drivers/staging/zcache/tmem.h
> @@ -204,6 +204,11 @@ struct tmem_handle {
>   uint16_t client_id;
>  };
> 
> +/*
> + * mark pampd to special vaule in order that later
> + * retrieve will identify zero-filled pages
> + */
> +#define ZERO_FILLED 0x2

You can avoid changing tmem.[ch] entirely by moving this
definition into zcache-main.c and by moving the check
comparing pampd against ZERO_FILLED into zcache_pampd_create_finish()
I think that would be cleaner...

If you change this and make the pageframe counter fix for PATCH 4/4,
please add my ack for the next version:

Acked-by: Dan Magenheimer 

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

RE: [PATCH 4/4] zcache: add pageframes count once compress zero-filled pages twice

2013-03-13 Thread Dan Magenheimer

> From: Wanpeng Li [mailto:liw...@linux.vnet.ibm.com]
> Sent: Wednesday, March 13, 2013 1:05 AM
> To: Andrew Morton
> Cc: Greg Kroah-Hartman; Dan Magenheimer; Seth Jennings; Konrad Rzeszutek 
> Wilk; Minchan Kim; linux-
> m...@kvack.org; linux-kernel@vger.kernel.org; Wanpeng Li
> Subject: [PATCH 4/4] zcache: add pageframes count once compress zero-filled 
> pages twice

Hi Wanpeng --

Thanks for taking on this task from the drivers/staging/zcache TODO list!

> Since zbudpage consist of two zpages, two zero-filled pages compression
> contribute to one [eph|pers]pageframe count accumulated.

I'm not sure why this is necessary.  The [eph|pers]pageframe count
is supposed to be counting actual pageframes used by zcache.  Since
your patch eliminates the need to store zero pages, no pageframes
are needed at all to store zero pages, so it's not necessary
to increment zcache_[eph|pers]_pageframes when storing zero
pages.

Or am I misunderstanding your intent?

Thanks,
Dan
 
> Signed-off-by: Wanpeng Li 
> ---
>  drivers/staging/zcache/zcache-main.c |   25 +++--
>  1 files changed, 23 insertions(+), 2 deletions(-)
> 
> diff --git a/drivers/staging/zcache/zcache-main.c 
> b/drivers/staging/zcache/zcache-main.c
> index dd52975..7860ff0 100644
> --- a/drivers/staging/zcache/zcache-main.c
> +++ b/drivers/staging/zcache/zcache-main.c
> @@ -544,6 +544,8 @@ static struct page *zcache_evict_eph_pageframe(void);
>  static void *zcache_pampd_eph_create(char *data, size_t size, bool raw,
>   struct tmem_handle *th)
>  {
> + static ssize_t second_eph_zero_page;
> + static atomic_t second_eph_zero_page_atomic = ATOMIC_INIT(0);
>   void *pampd = NULL, *cdata = data;
>   unsigned clen = size;
>   bool zero_filled = false;
> @@ -561,7 +563,14 @@ static void *zcache_pampd_eph_create(char *data, size_t 
> size, bool raw,
>   clen = 0;
>   zero_filled = true;
>   zcache_pages_zero++;
> - goto got_pampd;
> + second_eph_zero_page = atomic_inc_return(
> + _eph_zero_page_atomic);
> + if (second_eph_zero_page % 2 == 1)
> + goto got_pampd;
> + else {
> + atomic_sub(2, _eph_zero_page_atomic);
> + goto count_zero_page;
> + }
>   }
>   kunmap_atomic(user_mem);
> 
> @@ -597,6 +606,7 @@ static void *zcache_pampd_eph_create(char *data, size_t 
> size, bool raw,
>  create_in_new_page:
>   pampd = (void *)zbud_create_prep(th, true, cdata, clen, newpage);
>   BUG_ON(pampd == NULL);
> +count_zero_page:
>   zcache_eph_pageframes =
>   atomic_inc_return(_eph_pageframes_atomic);
>   if (zcache_eph_pageframes > zcache_eph_pageframes_max)
> @@ -621,6 +631,8 @@ out:
>  static void *zcache_pampd_pers_create(char *data, size_t size, bool raw,
>   struct tmem_handle *th)
>  {
> + static ssize_t second_pers_zero_page;
> + static atomic_t second_pers_zero_page_atomic = ATOMIC_INIT(0);
>   void *pampd = NULL, *cdata = data;
>   unsigned clen = size, zero_filled = 0;
>   struct page *page = (struct page *)(data), *newpage;
> @@ -644,7 +656,15 @@ static void *zcache_pampd_pers_create(char *data, size_t 
> size, bool raw,
>   clen = 0;
>   zero_filled = 1;
>   zcache_pages_zero++;
> - goto got_pampd;
> + second_pers_zero_page = atomic_inc_return(
> + _pers_zero_page_atomic);
> + if (second_pers_zero_page % 2 == 1)
> + goto got_pampd;
> + else {
> + atomic_sub(2, _pers_zero_page_atomic);
> + goto count_zero_page;
> + }
> +
>   }
>   kunmap_atomic(user_mem);
> 
> @@ -698,6 +718,7 @@ create_pampd:
>  create_in_new_page:
>   pampd = (void *)zbud_create_prep(th, false, cdata, clen, newpage);
>   BUG_ON(pampd == NULL);
> +count_zero_page:
>   zcache_pers_pageframes =
>   atomic_inc_return(_pers_pageframes_atomic);
>   if (zcache_pers_pageframes > zcache_pers_pageframes_max)
> --
> 1.7.7.6
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

RE: [PATCH 4/4] zcache: add pageframes count once compress zero-filled pages twice

2013-03-13 Thread Dan Magenheimer

 From: Wanpeng Li [mailto:liw...@linux.vnet.ibm.com]
 Sent: Wednesday, March 13, 2013 1:05 AM
 To: Andrew Morton
 Cc: Greg Kroah-Hartman; Dan Magenheimer; Seth Jennings; Konrad Rzeszutek 
 Wilk; Minchan Kim; linux-
 m...@kvack.org; linux-kernel@vger.kernel.org; Wanpeng Li
 Subject: [PATCH 4/4] zcache: add pageframes count once compress zero-filled 
 pages twice

Hi Wanpeng --

Thanks for taking on this task from the drivers/staging/zcache TODO list!

 Since zbudpage consist of two zpages, two zero-filled pages compression
 contribute to one [eph|pers]pageframe count accumulated.

I'm not sure why this is necessary.  The [eph|pers]pageframe count
is supposed to be counting actual pageframes used by zcache.  Since
your patch eliminates the need to store zero pages, no pageframes
are needed at all to store zero pages, so it's not necessary
to increment zcache_[eph|pers]_pageframes when storing zero
pages.

Or am I misunderstanding your intent?

Thanks,
Dan
 
 Signed-off-by: Wanpeng Li liw...@linux.vnet.ibm.com
 ---
  drivers/staging/zcache/zcache-main.c |   25 +++--
  1 files changed, 23 insertions(+), 2 deletions(-)
 
 diff --git a/drivers/staging/zcache/zcache-main.c 
 b/drivers/staging/zcache/zcache-main.c
 index dd52975..7860ff0 100644
 --- a/drivers/staging/zcache/zcache-main.c
 +++ b/drivers/staging/zcache/zcache-main.c
 @@ -544,6 +544,8 @@ static struct page *zcache_evict_eph_pageframe(void);
  static void *zcache_pampd_eph_create(char *data, size_t size, bool raw,
   struct tmem_handle *th)
  {
 + static ssize_t second_eph_zero_page;
 + static atomic_t second_eph_zero_page_atomic = ATOMIC_INIT(0);
   void *pampd = NULL, *cdata = data;
   unsigned clen = size;
   bool zero_filled = false;
 @@ -561,7 +563,14 @@ static void *zcache_pampd_eph_create(char *data, size_t 
 size, bool raw,
   clen = 0;
   zero_filled = true;
   zcache_pages_zero++;
 - goto got_pampd;
 + second_eph_zero_page = atomic_inc_return(
 + second_eph_zero_page_atomic);
 + if (second_eph_zero_page % 2 == 1)
 + goto got_pampd;
 + else {
 + atomic_sub(2, second_eph_zero_page_atomic);
 + goto count_zero_page;
 + }
   }
   kunmap_atomic(user_mem);
 
 @@ -597,6 +606,7 @@ static void *zcache_pampd_eph_create(char *data, size_t 
 size, bool raw,
  create_in_new_page:
   pampd = (void *)zbud_create_prep(th, true, cdata, clen, newpage);
   BUG_ON(pampd == NULL);
 +count_zero_page:
   zcache_eph_pageframes =
   atomic_inc_return(zcache_eph_pageframes_atomic);
   if (zcache_eph_pageframes  zcache_eph_pageframes_max)
 @@ -621,6 +631,8 @@ out:
  static void *zcache_pampd_pers_create(char *data, size_t size, bool raw,
   struct tmem_handle *th)
  {
 + static ssize_t second_pers_zero_page;
 + static atomic_t second_pers_zero_page_atomic = ATOMIC_INIT(0);
   void *pampd = NULL, *cdata = data;
   unsigned clen = size, zero_filled = 0;
   struct page *page = (struct page *)(data), *newpage;
 @@ -644,7 +656,15 @@ static void *zcache_pampd_pers_create(char *data, size_t 
 size, bool raw,
   clen = 0;
   zero_filled = 1;
   zcache_pages_zero++;
 - goto got_pampd;
 + second_pers_zero_page = atomic_inc_return(
 + second_pers_zero_page_atomic);
 + if (second_pers_zero_page % 2 == 1)
 + goto got_pampd;
 + else {
 + atomic_sub(2, second_pers_zero_page_atomic);
 + goto count_zero_page;
 + }
 +
   }
   kunmap_atomic(user_mem);
 
 @@ -698,6 +718,7 @@ create_pampd:
  create_in_new_page:
   pampd = (void *)zbud_create_prep(th, false, cdata, clen, newpage);
   BUG_ON(pampd == NULL);
 +count_zero_page:
   zcache_pers_pageframes =
   atomic_inc_return(zcache_pers_pageframes_atomic);
   if (zcache_pers_pageframes  zcache_pers_pageframes_max)
 --
 1.7.7.6
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

RE: [PATCH 2/4] zcache: zero-filled pages awareness

2013-03-13 Thread Dan Magenheimer

 From: Wanpeng Li [mailto:liw...@linux.vnet.ibm.com]
 Subject: [PATCH 2/4] zcache: zero-filled pages awareness
 
 Compression of zero-filled pages can unneccessarily cause internal
 fragmentation, and thus waste memory. This special case can be
 optimized.
 
 This patch captures zero-filled pages, and marks their corresponding
 zcache backing page entry as zero-filled. Whenever such zero-filled
 page is retrieved, we fill the page frame with zero.
 
 Signed-off-by: Wanpeng Li liw...@linux.vnet.ibm.com
 ---
  drivers/staging/zcache/tmem.c|4 +-
  drivers/staging/zcache/tmem.h|5 ++
  drivers/staging/zcache/zcache-main.c |   87 
 ++
  3 files changed, 85 insertions(+), 11 deletions(-)
 
 diff --git a/drivers/staging/zcache/tmem.c b/drivers/staging/zcache/tmem.c
 index a2b7e03..62468ea 100644
 --- a/drivers/staging/zcache/tmem.c
 +++ b/drivers/staging/zcache/tmem.c
 @@ -597,7 +597,9 @@ int tmem_put(struct tmem_pool *pool, struct tmem_oid 
 *oidp, uint32_t index,
   if (unlikely(ret == -ENOMEM))
   /* may have partially built objnode tree (stump) */
   goto delete_and_free;
 - (*tmem_pamops.create_finish)(pampd, is_ephemeral(pool));
 + if (pampd != (void *)ZERO_FILLED)
 + (*tmem_pamops.create_finish)(pampd, is_ephemeral(pool));
 +
   goto out;
 
  delete_and_free:
 diff --git a/drivers/staging/zcache/tmem.h b/drivers/staging/zcache/tmem.h
 index adbe5a8..6719dbd 100644
 --- a/drivers/staging/zcache/tmem.h
 +++ b/drivers/staging/zcache/tmem.h
 @@ -204,6 +204,11 @@ struct tmem_handle {
   uint16_t client_id;
  };
 
 +/*
 + * mark pampd to special vaule in order that later
 + * retrieve will identify zero-filled pages
 + */
 +#define ZERO_FILLED 0x2

You can avoid changing tmem.[ch] entirely by moving this
definition into zcache-main.c and by moving the check
comparing pampd against ZERO_FILLED into zcache_pampd_create_finish()
I think that would be cleaner...

If you change this and make the pageframe counter fix for PATCH 4/4,
please add my ack for the next version:

Acked-by: Dan Magenheimer dan.magenhei...@oracle.com

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

RE: zsmalloc limitations and related topics

2013-03-13 Thread Dan Magenheimer

 From: Robert Jennings [mailto:r...@linux.vnet.ibm.com]
 Subject: Re: zsmalloc limitations and related topics

Hi Robert --

Thanks for the well-considered reply!

 * Dan Magenheimer (dan.magenhei...@oracle.com) wrote:
  Hi all --

  I've been doing some experimentation on zsmalloc in preparation
  for my topic proposed for LSFMM13 and have run across some
  perplexing limitations.  Those familiar with the intimate details
  of zsmalloc might be well aware of these limitations, but they
  aren't documented or immediately obvious, so I thought it would
  be worthwhile to air them publicly.  I've also included some
  measurements from the experimentation and some related thoughts.

  (Some of the terms here are unusual and may be used inconsistently
  by different developers so a glossary of definitions of the terms
  used here is appended.)

  ZSMALLOC LIMITATIONS

  Zsmalloc is used for two zprojects: zram and the out-of-tree
  zswap.  Zsmalloc can achieve high density when full.  But:

  1) Zsmalloc has a worst-case density of 0.25 (one zpage per
 four pageframes).

 The design of the allocator results in a trade-off between best case
 density and the worst-case which is true for any allocator.  For zsmalloc,
 the best case density with a 4K page size is 32.0, or 177.0 for a 64K page
 size, based on storing a set of zero-filled pages compressed by lzo1x-1.

Right.  Without a representative workload, we have no idea
whether either my worst-case or your best-case will be relevant.

(As an aside, I'm measuring zsize=28 bytes for a zero page...
Seth has repeatedly said 103 bytes and I think this is
reflected in your computation above.  Maybe it is 103 for your
hardware compression engine?  Else, I'm not sure why our
numbers would be different.)

  2) When not full and especially when nearly-empty _after_
 being full, density may fall below 1.0 as a result of
 fragmentation.

 True and there are several ways to address this including
 defragmentation, fewer class sizes in zsmalloc, aging, and/or writeback
 of zpages in sparse zspages to free pageframes during normal writeback.

Yes.  And add pageframe-reclaim to this list of things that
zsmalloc should do but currently cannot do.

  3) Zsmalloc has a density of exactly 1.0 for any number of
 zpages with zsize = 0.8.

 For this reason zswap does not cache pages which in this range.
 It is not enforced in the allocator because some users may be forced to
 store these pages; users like zram.

Again, without a representative workload, we don't know whether
or not it is important to manage pages with zsize = 0.8.  You are
simply dismissing it as unnecessary because zsmalloc can't handle
them and because they don't appear at any measurable frequency
in kernbench or SPECjbb.  (Zbud _can_ efficiently handle these larger
pages under many circumstances... but without a representative workload,
we don't know whether or not those circumstances will occur.)

  4) Zsmalloc contains several compile-time parameters;
 the best value of these parameters may be very workload
 dependent.

 The parameters fall into two major areas, handle computation and class
 size.  The handle can be abstracted away, eliminating the compile-time
 parameters.  The class-size tunable could be changed to a default value
 with the option for specifying an alternate value from the user during
 pool creation.

Perhaps my point here wasn't clear so let me be more blunt:
There's no way in hell that even a very sophisticated user
will know how to set these values.  I think we need to
ensure either that they are always right (which without
a representative workload...) or, preferably, have some way
so that they can dynamically adapt at runtime.

  If density == 1.0, that means we are paying the overhead of
  compression+decompression for no space advantage.  If
  density  1.0, that means using zsmalloc is detrimental,
  resulting in worse memory pressure than if it were not used.

  WORKLOAD ANALYSIS

  These limitations emphasize that the workload used to evaluate
  zsmalloc is very important.  Benchmarks that measure data
  throughput or CPU utilization are of questionable value because
  it is the _content_ of the data that is particularly relevant
  for compression.  Even more precisely, it is the entropy
  of the data that is relevant, because the amount of
  compressibility in the data is related to the entropy:
  I.e. an entirely random pagefull of bits will compress poorly
  and a highly-regular pagefull of bits will compress well.
  Since the zprojects manage a large number of zpages, both
  the mean and distribution of zsize of the workload should
  be representative.

  The workload most widely used to publish results for
  the various zprojects is a kernel-compile using make -jN
  where N is artificially increased to impose memory pressure.
  By adding some debug code to zswap, I was able to analyze
  this workload and found the following:

  1

RE: [PATCHv7 4/8] zswap: add to mm/

2013-03-07 Thread Dan Magenheimer

> From: Seth Jennings [mailto:sjenn...@linux.vnet.ibm.com]
> To: Dave Hansen
> Subject: Re: [PATCHv7 4/8] zswap: add to mm/
> 
> On 03/07/2013 01:00 PM, Dave Hansen wrote:
> > On 03/06/2013 07:52 AM, Seth Jennings wrote:
> > ...
> >> +**/
> >> +/* attempts to compress and store an single page */
> >> +static int zswap_frontswap_store(unsigned type, pgoff_t offset,
> >> +  struct page *page)
> >> +{
> > ...
> >> +  /* store */
> >> +  handle = zs_malloc(tree->pool, dlen,
> >> +  __GFP_NORETRY | __GFP_HIGHMEM | __GFP_NOMEMALLOC |
> >> +  __GFP_NOWARN);
> >> +  if (!handle) {
> >> +  zswap_reject_zsmalloc_fail++;
> >> +  ret = -ENOMEM;
> >> +  goto putcpu;
> >> +  }
> >> +
> >
> > I think there needs to at least be some strong comments in here about
> > why you're doing this kind of allocation.  From some IRC discussion, it
> > seems like you found some pathological case where zswap wasn't helping
> > make reclaim progress and ended up draining the reserve pools and you
> > did this to avoid draining the reserve pools.
> 
> I'm currently doing some tests with fewer zsmalloc class sizes and
> removing __GFP_NOMEMALLOC to see the effect.

Zswap/zcache/frontswap are greedy, at times almost violently so.
Using emergency reserves seems like a sure way to OOM depending
on the workload (and luck).

I did some class size experiments too without seeing much advantage.
But without a range of "representative" data streams, it's very
hard to claim any experiment is successful.

I've got some ideas on combining the best of zsmalloc and zbud
but they are still a little raw.

> > I think the lack of progress doing reclaim is really the root cause you
> > should be going after here instead of just working around the symptom.

Dave, agreed.  See http://marc.info/?l=linux-mm=136147977602561=2 
and the PAGEFRAME EVACUATION subsection of
http://marc.info/?l=linux-mm=136200745931284=2
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

RE: [PATCHv7 4/8] zswap: add to mm/

2013-03-07 Thread Dan Magenheimer

 From: Seth Jennings [mailto:sjenn...@linux.vnet.ibm.com]
 To: Dave Hansen
 Subject: Re: [PATCHv7 4/8] zswap: add to mm/

 On 03/07/2013 01:00 PM, Dave Hansen wrote:
  On 03/06/2013 07:52 AM, Seth Jennings wrote:
  ...
  +**/
  +/* attempts to compress and store an single page */
  +static int zswap_frontswap_store(unsigned type, pgoff_t offset,
  +  struct page *page)
  +{
  ...
  +  /* store */
  +  handle = zs_malloc(tree-pool, dlen,
  +  __GFP_NORETRY | __GFP_HIGHMEM | __GFP_NOMEMALLOC |
  +  __GFP_NOWARN);
  +  if (!handle) {
  +  zswap_reject_zsmalloc_fail++;
  +  ret = -ENOMEM;
  +  goto putcpu;
  +  }
  +

  I think there needs to at least be some strong comments in here about
  why you're doing this kind of allocation.  From some IRC discussion, it
  seems like you found some pathological case where zswap wasn't helping
  make reclaim progress and ended up draining the reserve pools and you
  did this to avoid draining the reserve pools.

 I'm currently doing some tests with fewer zsmalloc class sizes and
 removing __GFP_NOMEMALLOC to see the effect.

Zswap/zcache/frontswap are greedy, at times almost violently so.
Using emergency reserves seems like a sure way to OOM depending
on the workload (and luck).

I did some class size experiments too without seeing much advantage.
But without a range of representative data streams, it's very
hard to claim any experiment is successful.

I've got some ideas on combining the best of zsmalloc and zbud
but they are still a little raw.

  I think the lack of progress doing reclaim is really the root cause you
  should be going after here instead of just working around the symptom.

Dave, agreed.  See http://marc.info/?l=linux-mmm=136147977602561w=2 
and the PAGEFRAME EVACUATION subsection of
http://marc.info/?l=linux-mmm=136200745931284w=2
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

RE: zsmalloc limitations and related topics

2013-03-04 Thread Dan Magenheimer

> From: Ric Mason [mailto:ric.mas...@gmail.com]
> Subject: Re: zsmalloc limitations and related topics
> 
> On 02/28/2013 07:24 AM, Dan Magenheimer wrote:
> > Hi all --
> >
> > I've been doing some experimentation on zsmalloc in preparation
> > for my topic proposed for LSFMM13 and have run across some
> > perplexing limitations.  Those familiar with the intimate details
> > of zsmalloc might be well aware of these limitations, but they
> > aren't documented or immediately obvious, so I thought it would
> > be worthwhile to air them publicly.  I've also included some
> > measurements from the experimentation and some related thoughts.
> >
> > (Some of the terms here are unusual and may be used inconsistently
> > by different developers so a glossary of definitions of the terms
> > used here is appended.)
> >
> > ZSMALLOC LIMITATIONS
> >
> > Zsmalloc is used for two zprojects: zram and the out-of-tree
> > zswap.  Zsmalloc can achieve high density when "full".  But:
> >
> > 1) Zsmalloc has a worst-case density of 0.25 (one zpage per
> > four pageframes).
> > 2) When not full and especially when nearly-empty _after_
> > being full, density may fall below 1.0 as a result of
> > fragmentation.
> 
> What's the meaning of nearly-empty _after_ being full?

Step 1:  Add a few (N) pages to zsmalloc.  It is "nearly empty".
Step 2:  Now add many more pages to zsmalloc until allocation
 limits are reached.  It is "full".
Step 3:  Now remove many pages from zsmalloc until there are
 N pages remaining.  It is now "nearly empty after
 being full".

Fragmentation characteristics are different comparing
after Step 1 and after Step 3 even though, in both cases,
zsmalloc contains N pages.
 
> > 3) Zsmalloc has a density of exactly 1.0 for any number of
> > zpages with zsize >= 0.8.
> > 4) Zsmalloc contains several compile-time parameters;
> > the best value of these parameters may be very workload
> > dependent.
> >
> > If density == 1.0, that means we are paying the overhead of
> > compression+decompression for no space advantage.  If
> > density < 1.0, that means using zsmalloc is detrimental,
> > resulting in worse memory pressure than if it were not used.
> >
> > WORKLOAD ANALYSIS
> >
> > These limitations emphasize that the workload used to evaluate
> > zsmalloc is very important.  Benchmarks that measure data
> 
> Could you share your benchmark? In order that other guys can take
> advantage of it.

As Seth does, I just used "make" of a kernel.  I run it on
a full graphical installation of EL6.  In order to ensure there
is memory pressure, I limit physical memory to 1GB, and use
"make -j20".

> > throughput or CPU utilization are of questionable value because
> > it is the _content_ of the data that is particularly relevant
> > for compression.  Even more precisely, it is the "entropy"
> > of the data that is relevant, because the amount of
> > compressibility in the data is related to the entropy:
> > I.e. an entirely random pagefull of bits will compress poorly
> > and a highly-regular pagefull of bits will compress well.
> > Since the zprojects manage a large number of zpages, both
> > the mean and distribution of zsize of the workload should
> > be "representative".
> >
> > The workload most widely used to publish results for
> > the various zprojects is a kernel-compile using "make -jN"
> > where N is artificially increased to impose memory pressure.
> > By adding some debug code to zswap, I was able to analyze
> > this workload and found the following:
> >
> > 1) The average page compressed by almost a factor of six
> > (mean zsize == 694, stddev == 474)
> 
> stddev is what?

Standard deviation.  See:
http://en.wikipedia.org/wiki/Standard_deviation 

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

RE: zsmalloc limitations and related topics

2013-03-04 Thread Dan Magenheimer

 From: Ric Mason [mailto:ric.mas...@gmail.com]
 Subject: Re: zsmalloc limitations and related topics

 On 02/28/2013 07:24 AM, Dan Magenheimer wrote:
  Hi all --

  I've been doing some experimentation on zsmalloc in preparation
  for my topic proposed for LSFMM13 and have run across some
  perplexing limitations.  Those familiar with the intimate details
  of zsmalloc might be well aware of these limitations, but they
  aren't documented or immediately obvious, so I thought it would
  be worthwhile to air them publicly.  I've also included some
  measurements from the experimentation and some related thoughts.

  (Some of the terms here are unusual and may be used inconsistently
  by different developers so a glossary of definitions of the terms
  used here is appended.)

  ZSMALLOC LIMITATIONS

  Zsmalloc is used for two zprojects: zram and the out-of-tree
  zswap.  Zsmalloc can achieve high density when full.  But:

  1) Zsmalloc has a worst-case density of 0.25 (one zpage per
  four pageframes).
  2) When not full and especially when nearly-empty _after_
  being full, density may fall below 1.0 as a result of
  fragmentation.

 What's the meaning of nearly-empty _after_ being full?

Step 1:  Add a few (N) pages to zsmalloc.  It is nearly empty.
Step 2:  Now add many more pages to zsmalloc until allocation
 limits are reached.  It is full.
Step 3:  Now remove many pages from zsmalloc until there are
 N pages remaining.  It is now nearly empty after
 being full.

Fragmentation characteristics are different comparing
after Step 1 and after Step 3 even though, in both cases,
zsmalloc contains N pages.

  3) Zsmalloc has a density of exactly 1.0 for any number of
  zpages with zsize = 0.8.
  4) Zsmalloc contains several compile-time parameters;
  the best value of these parameters may be very workload
  dependent.

  If density == 1.0, that means we are paying the overhead of
  compression+decompression for no space advantage.  If
  density  1.0, that means using zsmalloc is detrimental,
  resulting in worse memory pressure than if it were not used.

  WORKLOAD ANALYSIS

  These limitations emphasize that the workload used to evaluate
  zsmalloc is very important.  Benchmarks that measure data

 Could you share your benchmark? In order that other guys can take
 advantage of it.

As Seth does, I just used make of a kernel.  I run it on
a full graphical installation of EL6.  In order to ensure there
is memory pressure, I limit physical memory to 1GB, and use
make -j20.

  throughput or CPU utilization are of questionable value because
  it is the _content_ of the data that is particularly relevant
  for compression.  Even more precisely, it is the entropy
  of the data that is relevant, because the amount of
  compressibility in the data is related to the entropy:
  I.e. an entirely random pagefull of bits will compress poorly
  and a highly-regular pagefull of bits will compress well.
  Since the zprojects manage a large number of zpages, both
  the mean and distribution of zsize of the workload should
  be representative.

  The workload most widely used to publish results for
  the various zprojects is a kernel-compile using make -jN
  where N is artificially increased to impose memory pressure.
  By adding some debug code to zswap, I was able to analyze
  this workload and found the following:

  1) The average page compressed by almost a factor of six
  (mean zsize == 694, stddev == 474)

 stddev is what?

Standard deviation.  See:
http://en.wikipedia.org/wiki/Standard_deviation 

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

1 2 3 4 >

1 - 100 of 376 matches

Mail list logo