RE: [PATCHv13 3/4] zswap: add to mm/
> From: Bob Liu [mailto:lliu...@gmail.com] Subject: Re: [PATCHv13 3/4] zswap: add to mm/ > > On Thu, Jun 20, 2013 at 10:23 PM, Seth Jennings > wrote: > > On Thu, Jun 20, 2013 at 05:42:04PM +0800, Bob Liu wrote: > >> > Just made a mmtests run of my own and got very different results: > >> > > >> > >> It's strange, I'll update to rc6 and try again. > >> By the way, are you using 824 hardware compressor instead of lzo? > > > > My results where using lzo software compression. > > > > Thanks, and today I used another machine to test zswap. > The total ram size of that machine is around 4G. > This time the result is better: >rc6 rc6 > zswapbase > Ops memcachetest-0M 14619.00 ( 0.00%) 15602.00 ( 6.72%) > Ops memcachetest-435M 14727.00 ( 0.00%) 15860.00 ( 7.69%) > Ops memcachetest-944M 12452.00 ( 0.00%) 11812.00 ( -5.14%) > Ops memcachetest-1452M 12183.00 ( 0.00%) 9829.00 (-19.32%) > Ops memcachetest-1961M 11953.00 ( 0.00%) 9337.00 (-21.89%) > Ops memcachetest-2469M 11201.00 ( 0.00%) 7509.00 (-32.96%) > Ops memcachetest-2978M 9738.00 ( 0.00%) 5981.00 (-38.58%) > Ops io-duration-0M 0.00 ( 0.00%) 0.00 ( 0.00%) > Ops io-duration-435M 10.00 ( 0.00%) 6.00 ( 40.00%) > Ops io-duration-944M 19.00 ( 0.00%) 19.00 ( 0.00%) > Ops io-duration-1452M 31.00 ( 0.00%) 26.00 ( 16.13%) > Ops io-duration-1961M 40.00 ( 0.00%) 35.00 ( 12.50%) > Ops io-duration-2469M 45.00 ( 0.00%) 43.00 ( 4.44%) > Ops io-duration-2978M 58.00 ( 0.00%) 53.00 ( 8.62%) > Ops swaptotal-0M56711.00 ( 0.00%) 8.00 ( 99.99%) > Ops swaptotal-435M 19218.00 ( 0.00%) 2101.00 ( 89.07%) > Ops swaptotal-944M 53233.00 ( 0.00%) 98055.00 (-84.20%) > Ops swaptotal-1452M 52064.00 ( 0.00%) 145624.00 > (-179.70%) > Ops swaptotal-1961M 54960.00 ( 0.00%) 153907.00 > (-180.03%) > Ops swaptotal-2469M 57485.00 ( 0.00%) 176340.00 > (-206.76%) > Ops swaptotal-2978M 77704.00 ( 0.00%) 182996.00 > (-135.50%) > Ops swapin-0M 24834.00 ( 0.00%) 8.00 ( 99.97%) > Ops swapin-435M 9038.00 ( 0.00%) 0.00 ( 0.00%) > Ops swapin-944M 26230.00 ( 0.00%) 42953.00 (-63.76%) > Ops swapin-1452M25766.00 ( 0.00%) 68440.00 > (-165.62%) > Ops swapin-1961M27258.00 ( 0.00%) 68129.00 > (-149.94%) > Ops swapin-2469M28508.00 ( 0.00%) 82234.00 > (-188.46%) > Ops swapin-2978M37970.00 ( 0.00%) 89280.00 > (-135.13%) > Ops minorfaults-0M1460163.00 ( 0.00%) 927966.00 ( 36.45%) > Ops minorfaults-435M 954058.00 ( 0.00%) 936182.00 ( 1.87%) > Ops minorfaults-944M 972818.00 ( 0.00%)1005956.00 ( -3.41%) > Ops minorfaults-1452M 966597.00 ( 0.00%)1035465.00 ( -7.12%) > Ops minorfaults-1961M 976158.00 ( 0.00%)1049441.00 ( -7.51%) > Ops minorfaults-2469M 967815.00 ( 0.00%)1051752.00 ( -8.67%) > Ops minorfaults-2978M 988712.00 ( 0.00%)1034615.00 ( -4.64%) > Ops majorfaults-0M 5899.00 ( 0.00%) 9.00 ( 99.85%) > Ops majorfaults-435M 2684.00 ( 0.00%) 67.00 ( 97.50%) > Ops majorfaults-944M 4380.00 ( 0.00%) 5790.00 (-32.19%) > Ops majorfaults-1452M4161.00 ( 0.00%) 9222.00 > (-121.63%) > Ops majorfaults-1961M4435.00 ( 0.00%) 8800.00 (-98.42%) > Ops majorfaults-2469M4555.00 ( 0.00%) 10541.00 > (-131.42%) > Ops majorfaults-2978M6182.00 ( 0.00%) 11618.00 (-87.93%) > > > But the performance of the first machine I used whose total ram size > is 2G is still bad. > I need more time to summarize those testing results. > > Maybe you can also have a try with lower total ram size. > > -- > Regards, > --Bob A very important factor that you are not considering and that might account for your different results is the "initial conditions". For example, I always ran my benchmarks after a default-configured EL6 boot, which launches many services at boot time, each of which creates many anonymous pages, and these "service anonymous pages" are often the pages that are selected by LRU for swapping, and compressed by zcache/zswap. Someone else may run the benchmarks on a minimally-configured embedded system, and someone else
RE: [PATCHv13 3/4] zswap: add to mm/
From: Bob Liu [mailto:lliu...@gmail.com] Subject: Re: [PATCHv13 3/4] zswap: add to mm/ On Thu, Jun 20, 2013 at 10:23 PM, Seth Jennings sjenn...@linux.vnet.ibm.com wrote: On Thu, Jun 20, 2013 at 05:42:04PM +0800, Bob Liu wrote: Just made a mmtests run of my own and got very different results: It's strange, I'll update to rc6 and try again. By the way, are you using 824 hardware compressor instead of lzo? My results where using lzo software compression. Thanks, and today I used another machine to test zswap. The total ram size of that machine is around 4G. This time the result is better: rc6 rc6 zswapbase Ops memcachetest-0M 14619.00 ( 0.00%) 15602.00 ( 6.72%) Ops memcachetest-435M 14727.00 ( 0.00%) 15860.00 ( 7.69%) Ops memcachetest-944M 12452.00 ( 0.00%) 11812.00 ( -5.14%) Ops memcachetest-1452M 12183.00 ( 0.00%) 9829.00 (-19.32%) Ops memcachetest-1961M 11953.00 ( 0.00%) 9337.00 (-21.89%) Ops memcachetest-2469M 11201.00 ( 0.00%) 7509.00 (-32.96%) Ops memcachetest-2978M 9738.00 ( 0.00%) 5981.00 (-38.58%) Ops io-duration-0M 0.00 ( 0.00%) 0.00 ( 0.00%) Ops io-duration-435M 10.00 ( 0.00%) 6.00 ( 40.00%) Ops io-duration-944M 19.00 ( 0.00%) 19.00 ( 0.00%) Ops io-duration-1452M 31.00 ( 0.00%) 26.00 ( 16.13%) Ops io-duration-1961M 40.00 ( 0.00%) 35.00 ( 12.50%) Ops io-duration-2469M 45.00 ( 0.00%) 43.00 ( 4.44%) Ops io-duration-2978M 58.00 ( 0.00%) 53.00 ( 8.62%) Ops swaptotal-0M56711.00 ( 0.00%) 8.00 ( 99.99%) Ops swaptotal-435M 19218.00 ( 0.00%) 2101.00 ( 89.07%) Ops swaptotal-944M 53233.00 ( 0.00%) 98055.00 (-84.20%) Ops swaptotal-1452M 52064.00 ( 0.00%) 145624.00 (-179.70%) Ops swaptotal-1961M 54960.00 ( 0.00%) 153907.00 (-180.03%) Ops swaptotal-2469M 57485.00 ( 0.00%) 176340.00 (-206.76%) Ops swaptotal-2978M 77704.00 ( 0.00%) 182996.00 (-135.50%) Ops swapin-0M 24834.00 ( 0.00%) 8.00 ( 99.97%) Ops swapin-435M 9038.00 ( 0.00%) 0.00 ( 0.00%) Ops swapin-944M 26230.00 ( 0.00%) 42953.00 (-63.76%) Ops swapin-1452M25766.00 ( 0.00%) 68440.00 (-165.62%) Ops swapin-1961M27258.00 ( 0.00%) 68129.00 (-149.94%) Ops swapin-2469M28508.00 ( 0.00%) 82234.00 (-188.46%) Ops swapin-2978M37970.00 ( 0.00%) 89280.00 (-135.13%) Ops minorfaults-0M1460163.00 ( 0.00%) 927966.00 ( 36.45%) Ops minorfaults-435M 954058.00 ( 0.00%) 936182.00 ( 1.87%) Ops minorfaults-944M 972818.00 ( 0.00%)1005956.00 ( -3.41%) Ops minorfaults-1452M 966597.00 ( 0.00%)1035465.00 ( -7.12%) Ops minorfaults-1961M 976158.00 ( 0.00%)1049441.00 ( -7.51%) Ops minorfaults-2469M 967815.00 ( 0.00%)1051752.00 ( -8.67%) Ops minorfaults-2978M 988712.00 ( 0.00%)1034615.00 ( -4.64%) Ops majorfaults-0M 5899.00 ( 0.00%) 9.00 ( 99.85%) Ops majorfaults-435M 2684.00 ( 0.00%) 67.00 ( 97.50%) Ops majorfaults-944M 4380.00 ( 0.00%) 5790.00 (-32.19%) Ops majorfaults-1452M4161.00 ( 0.00%) 9222.00 (-121.63%) Ops majorfaults-1961M4435.00 ( 0.00%) 8800.00 (-98.42%) Ops majorfaults-2469M4555.00 ( 0.00%) 10541.00 (-131.42%) Ops majorfaults-2978M6182.00 ( 0.00%) 11618.00 (-87.93%) But the performance of the first machine I used whose total ram size is 2G is still bad. I need more time to summarize those testing results. Maybe you can also have a try with lower total ram size. -- Regards, --Bob A very important factor that you are not considering and that might account for your different results is the initial conditions. For example, I always ran my benchmarks after a default-configured EL6 boot, which launches many services at boot time, each of which creates many anonymous pages, and these service anonymous pages are often the pages that are selected by LRU for swapping, and compressed by zcache/zswap. Someone else may run the benchmarks on a minimally-configured embedded system, and someone else on a single-user system with no services running at all. A single-user
RE: [PATCHv12 2/4] zbud: add to mm/
> From: Andrew Morton [mailto:a...@linux-foundation.org] > Subject: Re: [PATCHv12 2/4] zbud: add to mm/ > > On Wed, 29 May 2013 15:42:36 -0500 Seth Jennings > wrote: > > > > > > I worry about any code which independently looks at the pageframe > > > > > tables and expects to find page struts there. One example is probably > > > > > memory_failure() but there are probably others. > > > > > > ^^ this, please. It could be kinda fatal. > > > > I'll look into this. > > > > The expected behavior is that memory_failure() should handle zbud pages in > > the > > same way that it handles in-use slub/slab/slob pages and return -EBUSY. > > memory_failure() is merely an example of a general problem: code which > reads from the memmap[] array and expects its elements to be of type > `struct page'. Other examples might be memory hotplugging, memory leak > checkers etc. I have vague memories of out-of-tree patches > (bigphysarea?) doing this as well. > > It's a general problem to which we need a general solution. One could reasonably argue that any code that makes incorrect assumptions about the contents of a struct page structure is buggy and should be fixed. Isn't the "general solution" already described in the following comment, excerpted from include/linux/mm.h, which implies that "scribbling on existing pageframes" [carefully], is fine? (And, if not, shouldn't that comment be fixed, or am I misreading it?) * For the non-reserved pages, page_count(page) denotes a reference count. * page_count() == 0 means the page is free. page->lru is then used for * freelist management in the buddy allocator. * page_count() > 0 means the page has been allocated. * * Pages are allocated by the slab allocator in order to provide memory * to kmalloc and kmem_cache_alloc. In this case, the management of the * page, and the fields in 'struct page' are the responsibility of mm/slab.c * unless a particular usage is carefully commented. (the responsibility of * freeing the kmalloc memory is the caller's, of course). * * A page may be used by anyone else who does a __get_free_page(). * In this case, page_count still tracks the references, and should only * be used through the normal accessor functions. The top bits of page->flags * and page->virtual store page management information, but all other fields * are unused and could be used privately, carefully. The management of this * page is the responsibility of the one who allocated it, and those who have * subsequently been given references to it. * * The other pages (we may call them "pagecache pages") are completely * managed by the Linux memory manager: I/O, buffers, swapping etc. * The following discussion applies only to them. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
RE: [PATCHv12 2/4] zbud: add to mm/
From: Andrew Morton [mailto:a...@linux-foundation.org] Subject: Re: [PATCHv12 2/4] zbud: add to mm/ On Wed, 29 May 2013 15:42:36 -0500 Seth Jennings sjenn...@linux.vnet.ibm.com wrote: I worry about any code which independently looks at the pageframe tables and expects to find page struts there. One example is probably memory_failure() but there are probably others. ^^ this, please. It could be kinda fatal. I'll look into this. The expected behavior is that memory_failure() should handle zbud pages in the same way that it handles in-use slub/slab/slob pages and return -EBUSY. memory_failure() is merely an example of a general problem: code which reads from the memmap[] array and expects its elements to be of type `struct page'. Other examples might be memory hotplugging, memory leak checkers etc. I have vague memories of out-of-tree patches (bigphysarea?) doing this as well. It's a general problem to which we need a general solution. Obi-tmem Kenobe slowly materializes... use the force, Luke! One could reasonably argue that any code that makes incorrect assumptions about the contents of a struct page structure is buggy and should be fixed. Isn't the general solution already described in the following comment, excerpted from include/linux/mm.h, which implies that scribbling on existing pageframes [carefully], is fine? (And, if not, shouldn't that comment be fixed, or am I misreading it?) start excerpt * For the non-reserved pages, page_count(page) denotes a reference count. * page_count() == 0 means the page is free. page-lru is then used for * freelist management in the buddy allocator. * page_count() 0 means the page has been allocated. * * Pages are allocated by the slab allocator in order to provide memory * to kmalloc and kmem_cache_alloc. In this case, the management of the * page, and the fields in 'struct page' are the responsibility of mm/slab.c * unless a particular usage is carefully commented. (the responsibility of * freeing the kmalloc memory is the caller's, of course). * * A page may be used by anyone else who does a __get_free_page(). * In this case, page_count still tracks the references, and should only * be used through the normal accessor functions. The top bits of page-flags * and page-virtual store page management information, but all other fields * are unused and could be used privately, carefully. The management of this * page is the responsibility of the one who allocated it, and those who have * subsequently been given references to it. * * The other pages (we may call them pagecache pages) are completely * managed by the Linux memory manager: I/O, buffers, swapping etc. * The following discussion applies only to them. end excerpt Obi-tmem Kenobe slowly dematerializes -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Bye bye Mr tmem guy
Hi Linux kernel folks and Xen folks -- Effective July 5, I will be resigning from Oracle and "retiring" for a minimum of 12-18 months and probably/hopefully much longer. Between now and July 5, I will be tying up loose ends related to my patches but also using up accrued vacation days. If you have a loose end you'd like to see tied, please let me know ASAP and I will do my best. After July 5, any email to me via first DOT last AT oracle DOT com will go undelivered and may bounce. Please send email related to my open source patches and contributions to Konrad Wilk and/or Bob Liu. Personal email directed to me can be sent to first AT last DOT com. Thanks much to everybody for the many educational opportunities, the technical and political jousting, and the great times at conferences and summits! I wish you all the best of luck! Or to quote Douglas Adams: "So long and thanks for all the fish!" Cheers, Dan Magenheimer The Transcendent Memory ("tmem") guy Tmem-related historical webography: http://lwn.net/Articles/454795/ http://lwn.net/Articles/475681/ http://lwn.net/Articles/545244/ https://oss.oracle.com/projects/tmem/ http://www.linux-kvm.org/wiki/images/d/d7/TmemNotVirt-Linuxcon2011-Final.pdf http://lwn.net/Articles/465317/ http://lwn.net/Articles/340080/ http://lwn.net/Articles/386090/ http://www.xen.org/files/xensummit_oracle09/xensummit_transmemory.pdf https://oss.oracle.com/projects/tmem/dist/documentation/presentations/TranscendentMemoryXenSummit2010.pdf https://blogs.oracle.com/wim/entry/example_of_transcendent_memory_and https://blogs.oracle.com/wim/entry/another_feature_hit_mainline_linux https://blogs.oracle.com/wim/entry/from_the_research_department_ramster http://streaming.oracle.com/ebn/podcasts/media/11663326_VM_Linux_042512.mp3 https://oss.oracle.com/projects/tmem/dist/documentation/papers/overcommit.pdf http://static.usenix.org/event/wiov08/tech/full_papers/magenheimer/magenheimer_html/ -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
RE: [PATCH] staging: ramster: add how-to document
Hey Greg -- Since this is documentation only and documents existing behavior, I'm not clear whether it is acceptable for an rcN release in the current cycle or must wait until the next window. Since it is a new file, it should apply to either so I'll leave the choice up to you. Thanks, Dan > From: Dan Magenheimer [mailto:dan.magenhei...@oracle.com] > Sent: Monday, May 20, 2013 8:52 AM > To: de...@linuxdriverproject.org; linux-kernel@vger.kernel.org; > gre...@linuxfoundation.org; linux- > m...@kvack.org; konrad.w...@oracle.com; dan.magenhei...@oracle.com; > liw...@linux.vnet.ibm.com; > bob@oracle.com > Subject: [PATCH] staging: ramster: add how-to document > > Add how-to documentation that provides a step-by-step guide > for configuring and trying out a ramster cluster. > > Signed-off-by: Dan Magenheimer > --- > drivers/staging/zcache/ramster/ramster-howto.txt | 366 > ++ > 1 files changed, 366 insertions(+), 0 deletions(-) > create mode 100644 drivers/staging/zcache/ramster/ramster-howto.txt > > diff --git a/drivers/staging/zcache/ramster/ramster-howto.txt > b/drivers/staging/zcache/ramster/ramster-howto.txt > new file mode 100644 > index 000..7b1ee3b > --- /dev/null > +++ b/drivers/staging/zcache/ramster/ramster-howto.txt > @@ -0,0 +1,366 @@ > + RAMSTER HOW-TO > + > +Author: Dan Magenheimer > +Ramster maintainer: Konrad Wilk > + > +This is a HOWTO document for ramster which, as of this writing, is in > +the kernel as a subdirectory of zcache in drivers/staging, called ramster. > +(Zcache can be built with or without ramster functionality.) If enabled > +and properly configured, ramster allows memory capacity load balancing > +across multiple machines in a cluster. Further, the ramster code serves > +as an example of asynchronous access for zcache (as well as cleancache and > +frontswap) that may prove useful for future transcendent memory > +implementations, such as KVM and NVRAM. While ramster works today on > +any network connection that supports kernel sockets, its features may > +become more interesting on future high-speed fabrics/interconnects. > + > +Ramster requires both kernel and userland support. The userland support, > +called ramster-tools, is known to work with EL6-based distros, but is a > +set of poorly-hacked slightly-modified cluster tools based on ocfs2, which > +includes an init file, a config file, and a userland binary that interfaces > +to the kernel. This state of userland support reflects the abysmal userland > +skills of this suitably-embarrassed author; any help/patches to turn > +ramster-tools into more distributable rpms/debs useful for a wider range > +of distros would be appreciated. The source RPM that can be used as a > +starting point is available at: > +http://oss.oracle.com/projects/tmem/files/RAMster/ > + > +As a result of this author's ignorance, userland setup described in this > +HOWTO assumes an EL6 distro and is described in EL6 syntax. Apologies > +if this offends anyone! > + > +Kernel support has only been tested on x86_64. Systems with an active > +ocfs2 filesystem should work, but since ramster leverages a lot of > +code from ocfs2, there may be latent issues. A kernel configuration that > +includes CONFIG_OCFS2_FS should build OK, and should certainly run OK > +if no ocfs2 filesystem is mounted. > + > +This HOWTO demonstrates memory capacity load balancing for a two-node > +cluster, where one node called the "local" node becomes overcommitted > +and the other node called the "remote" node provides additional RAM > +capacity for use by the local node. Ramster is capable of more complex > +topologies; see the last section titled "ADVANCED RAMSTER TOPOLOGIES". > + > +If you find any terms in this HOWTO unfamiliar or don't understand the > +motivation for ramster, the following LWN reading is recommended: > +-- Transcendent Memory in a Nutshell (lwn.net/Articles/454795) > +-- The future calculus of memory management (lwn.net/Articles/475681) > +And since ramster is built on top of zcache, this article may be helpful: > +-- In-kernel memory compression (lwn.net/Articles/545244) > + > +Now that you've memorized the contents of those articles, let's get started! > + > +A. PRELIMINARY > + > +1) Install two x86_64 Linux systems that are known to work when > + upgraded to a recent upstream Linux kernel version. > + > +On each system: > + > +2) Configure, build and install, then boot Linux, just to ensure it > + can be done with an unmodified upstream kernel. Confirm you booted > + the upstream kernel with "uname -a". > + > +3) If you plan to do any performance testing or u
[PATCH] staging: ramster: add how-to document
Add how-to documentation that provides a step-by-step guide for configuring and trying out a ramster cluster. Signed-off-by: Dan Magenheimer --- drivers/staging/zcache/ramster/ramster-howto.txt | 366 ++ 1 files changed, 366 insertions(+), 0 deletions(-) create mode 100644 drivers/staging/zcache/ramster/ramster-howto.txt diff --git a/drivers/staging/zcache/ramster/ramster-howto.txt b/drivers/staging/zcache/ramster/ramster-howto.txt new file mode 100644 index 000..7b1ee3b --- /dev/null +++ b/drivers/staging/zcache/ramster/ramster-howto.txt @@ -0,0 +1,366 @@ + RAMSTER HOW-TO + +Author: Dan Magenheimer +Ramster maintainer: Konrad Wilk + +This is a HOWTO document for ramster which, as of this writing, is in +the kernel as a subdirectory of zcache in drivers/staging, called ramster. +(Zcache can be built with or without ramster functionality.) If enabled +and properly configured, ramster allows memory capacity load balancing +across multiple machines in a cluster. Further, the ramster code serves +as an example of asynchronous access for zcache (as well as cleancache and +frontswap) that may prove useful for future transcendent memory +implementations, such as KVM and NVRAM. While ramster works today on +any network connection that supports kernel sockets, its features may +become more interesting on future high-speed fabrics/interconnects. + +Ramster requires both kernel and userland support. The userland support, +called ramster-tools, is known to work with EL6-based distros, but is a +set of poorly-hacked slightly-modified cluster tools based on ocfs2, which +includes an init file, a config file, and a userland binary that interfaces +to the kernel. This state of userland support reflects the abysmal userland +skills of this suitably-embarrassed author; any help/patches to turn +ramster-tools into more distributable rpms/debs useful for a wider range +of distros would be appreciated. The source RPM that can be used as a +starting point is available at: +http://oss.oracle.com/projects/tmem/files/RAMster/ + +As a result of this author's ignorance, userland setup described in this +HOWTO assumes an EL6 distro and is described in EL6 syntax. Apologies +if this offends anyone! + +Kernel support has only been tested on x86_64. Systems with an active +ocfs2 filesystem should work, but since ramster leverages a lot of +code from ocfs2, there may be latent issues. A kernel configuration that +includes CONFIG_OCFS2_FS should build OK, and should certainly run OK +if no ocfs2 filesystem is mounted. + +This HOWTO demonstrates memory capacity load balancing for a two-node +cluster, where one node called the "local" node becomes overcommitted +and the other node called the "remote" node provides additional RAM +capacity for use by the local node. Ramster is capable of more complex +topologies; see the last section titled "ADVANCED RAMSTER TOPOLOGIES". + +If you find any terms in this HOWTO unfamiliar or don't understand the +motivation for ramster, the following LWN reading is recommended: +-- Transcendent Memory in a Nutshell (lwn.net/Articles/454795) +-- The future calculus of memory management (lwn.net/Articles/475681) +And since ramster is built on top of zcache, this article may be helpful: +-- In-kernel memory compression (lwn.net/Articles/545244) + +Now that you've memorized the contents of those articles, let's get started! + +A. PRELIMINARY + +1) Install two x86_64 Linux systems that are known to work when + upgraded to a recent upstream Linux kernel version. + +On each system: + +2) Configure, build and install, then boot Linux, just to ensure it + can be done with an unmodified upstream kernel. Confirm you booted + the upstream kernel with "uname -a". + +3) If you plan to do any performance testing or unless you plan to + test only swapping, the "WasActive" patch is also highly recommended. + (Search lkml.org for WasActive, apply the patch, rebuild your kernel.) + For a demo or simple testing, the patch can be ignored. + +4) Install ramster-tools as root. An x86_64 rpm for EL6-based systems + can be found at: +http://oss.oracle.com/projects/tmem/files/RAMster/ + (Sorry but for now, non-EL6 users must recreate ramster-tools on + their own from source. See above.) + +5) Ensure that debugfs is mounted at each boot. Examples below assume it + is mounted at /sys/kernel/debug. + +B. BUILDING RAMSTER INTO THE KERNEL + +Do the following on each system: + +1) Using the kernel configuration mechanism of your choice, change + your config to include: + + CONFIG_CLEANCACHE=y + CONFIG_FRONTSWAP=y + CONFIG_STAGING=y + CONFIG_CONFIGFS_FS=y # NOTE: MUST BE y, not m + CONFIG_ZCACHE=y + CONFIG_RAMSTER=y + + For a linux-3.10 or later kernel, you should also set: + + CONFIG_ZCACHE_DEBUG=y + CONFIG_RAMSTER_DEBUG=y + + Before buildin
[PATCH] staging: ramster: add how-to document
Add how-to documentation that provides a step-by-step guide for configuring and trying out a ramster cluster. Signed-off-by: Dan Magenheimer dan.magenhei...@oracle.com --- drivers/staging/zcache/ramster/ramster-howto.txt | 366 ++ 1 files changed, 366 insertions(+), 0 deletions(-) create mode 100644 drivers/staging/zcache/ramster/ramster-howto.txt diff --git a/drivers/staging/zcache/ramster/ramster-howto.txt b/drivers/staging/zcache/ramster/ramster-howto.txt new file mode 100644 index 000..7b1ee3b --- /dev/null +++ b/drivers/staging/zcache/ramster/ramster-howto.txt @@ -0,0 +1,366 @@ + RAMSTER HOW-TO + +Author: Dan Magenheimer +Ramster maintainer: Konrad Wilk konrad.w...@oracle.com + +This is a HOWTO document for ramster which, as of this writing, is in +the kernel as a subdirectory of zcache in drivers/staging, called ramster. +(Zcache can be built with or without ramster functionality.) If enabled +and properly configured, ramster allows memory capacity load balancing +across multiple machines in a cluster. Further, the ramster code serves +as an example of asynchronous access for zcache (as well as cleancache and +frontswap) that may prove useful for future transcendent memory +implementations, such as KVM and NVRAM. While ramster works today on +any network connection that supports kernel sockets, its features may +become more interesting on future high-speed fabrics/interconnects. + +Ramster requires both kernel and userland support. The userland support, +called ramster-tools, is known to work with EL6-based distros, but is a +set of poorly-hacked slightly-modified cluster tools based on ocfs2, which +includes an init file, a config file, and a userland binary that interfaces +to the kernel. This state of userland support reflects the abysmal userland +skills of this suitably-embarrassed author; any help/patches to turn +ramster-tools into more distributable rpms/debs useful for a wider range +of distros would be appreciated. The source RPM that can be used as a +starting point is available at: +http://oss.oracle.com/projects/tmem/files/RAMster/ + +As a result of this author's ignorance, userland setup described in this +HOWTO assumes an EL6 distro and is described in EL6 syntax. Apologies +if this offends anyone! + +Kernel support has only been tested on x86_64. Systems with an active +ocfs2 filesystem should work, but since ramster leverages a lot of +code from ocfs2, there may be latent issues. A kernel configuration that +includes CONFIG_OCFS2_FS should build OK, and should certainly run OK +if no ocfs2 filesystem is mounted. + +This HOWTO demonstrates memory capacity load balancing for a two-node +cluster, where one node called the local node becomes overcommitted +and the other node called the remote node provides additional RAM +capacity for use by the local node. Ramster is capable of more complex +topologies; see the last section titled ADVANCED RAMSTER TOPOLOGIES. + +If you find any terms in this HOWTO unfamiliar or don't understand the +motivation for ramster, the following LWN reading is recommended: +-- Transcendent Memory in a Nutshell (lwn.net/Articles/454795) +-- The future calculus of memory management (lwn.net/Articles/475681) +And since ramster is built on top of zcache, this article may be helpful: +-- In-kernel memory compression (lwn.net/Articles/545244) + +Now that you've memorized the contents of those articles, let's get started! + +A. PRELIMINARY + +1) Install two x86_64 Linux systems that are known to work when + upgraded to a recent upstream Linux kernel version. + +On each system: + +2) Configure, build and install, then boot Linux, just to ensure it + can be done with an unmodified upstream kernel. Confirm you booted + the upstream kernel with uname -a. + +3) If you plan to do any performance testing or unless you plan to + test only swapping, the WasActive patch is also highly recommended. + (Search lkml.org for WasActive, apply the patch, rebuild your kernel.) + For a demo or simple testing, the patch can be ignored. + +4) Install ramster-tools as root. An x86_64 rpm for EL6-based systems + can be found at: +http://oss.oracle.com/projects/tmem/files/RAMster/ + (Sorry but for now, non-EL6 users must recreate ramster-tools on + their own from source. See above.) + +5) Ensure that debugfs is mounted at each boot. Examples below assume it + is mounted at /sys/kernel/debug. + +B. BUILDING RAMSTER INTO THE KERNEL + +Do the following on each system: + +1) Using the kernel configuration mechanism of your choice, change + your config to include: + + CONFIG_CLEANCACHE=y + CONFIG_FRONTSWAP=y + CONFIG_STAGING=y + CONFIG_CONFIGFS_FS=y # NOTE: MUST BE y, not m + CONFIG_ZCACHE=y + CONFIG_RAMSTER=y + + For a linux-3.10 or later kernel, you should also set: + + CONFIG_ZCACHE_DEBUG=y + CONFIG_RAMSTER_DEBUG=y + + Before building the kernel
RE: [PATCH] staging: ramster: add how-to document
Hey Greg -- Since this is documentation only and documents existing behavior, I'm not clear whether it is acceptable for an rcN release in the current cycle or must wait until the next window. Since it is a new file, it should apply to either so I'll leave the choice up to you. Thanks, Dan From: Dan Magenheimer [mailto:dan.magenhei...@oracle.com] Sent: Monday, May 20, 2013 8:52 AM To: de...@linuxdriverproject.org; linux-kernel@vger.kernel.org; gre...@linuxfoundation.org; linux- m...@kvack.org; konrad.w...@oracle.com; dan.magenhei...@oracle.com; liw...@linux.vnet.ibm.com; bob@oracle.com Subject: [PATCH] staging: ramster: add how-to document Add how-to documentation that provides a step-by-step guide for configuring and trying out a ramster cluster. Signed-off-by: Dan Magenheimer dan.magenhei...@oracle.com --- drivers/staging/zcache/ramster/ramster-howto.txt | 366 ++ 1 files changed, 366 insertions(+), 0 deletions(-) create mode 100644 drivers/staging/zcache/ramster/ramster-howto.txt diff --git a/drivers/staging/zcache/ramster/ramster-howto.txt b/drivers/staging/zcache/ramster/ramster-howto.txt new file mode 100644 index 000..7b1ee3b --- /dev/null +++ b/drivers/staging/zcache/ramster/ramster-howto.txt @@ -0,0 +1,366 @@ + RAMSTER HOW-TO + +Author: Dan Magenheimer +Ramster maintainer: Konrad Wilk konrad.w...@oracle.com + +This is a HOWTO document for ramster which, as of this writing, is in +the kernel as a subdirectory of zcache in drivers/staging, called ramster. +(Zcache can be built with or without ramster functionality.) If enabled +and properly configured, ramster allows memory capacity load balancing +across multiple machines in a cluster. Further, the ramster code serves +as an example of asynchronous access for zcache (as well as cleancache and +frontswap) that may prove useful for future transcendent memory +implementations, such as KVM and NVRAM. While ramster works today on +any network connection that supports kernel sockets, its features may +become more interesting on future high-speed fabrics/interconnects. + +Ramster requires both kernel and userland support. The userland support, +called ramster-tools, is known to work with EL6-based distros, but is a +set of poorly-hacked slightly-modified cluster tools based on ocfs2, which +includes an init file, a config file, and a userland binary that interfaces +to the kernel. This state of userland support reflects the abysmal userland +skills of this suitably-embarrassed author; any help/patches to turn +ramster-tools into more distributable rpms/debs useful for a wider range +of distros would be appreciated. The source RPM that can be used as a +starting point is available at: +http://oss.oracle.com/projects/tmem/files/RAMster/ + +As a result of this author's ignorance, userland setup described in this +HOWTO assumes an EL6 distro and is described in EL6 syntax. Apologies +if this offends anyone! + +Kernel support has only been tested on x86_64. Systems with an active +ocfs2 filesystem should work, but since ramster leverages a lot of +code from ocfs2, there may be latent issues. A kernel configuration that +includes CONFIG_OCFS2_FS should build OK, and should certainly run OK +if no ocfs2 filesystem is mounted. + +This HOWTO demonstrates memory capacity load balancing for a two-node +cluster, where one node called the local node becomes overcommitted +and the other node called the remote node provides additional RAM +capacity for use by the local node. Ramster is capable of more complex +topologies; see the last section titled ADVANCED RAMSTER TOPOLOGIES. + +If you find any terms in this HOWTO unfamiliar or don't understand the +motivation for ramster, the following LWN reading is recommended: +-- Transcendent Memory in a Nutshell (lwn.net/Articles/454795) +-- The future calculus of memory management (lwn.net/Articles/475681) +And since ramster is built on top of zcache, this article may be helpful: +-- In-kernel memory compression (lwn.net/Articles/545244) + +Now that you've memorized the contents of those articles, let's get started! + +A. PRELIMINARY + +1) Install two x86_64 Linux systems that are known to work when + upgraded to a recent upstream Linux kernel version. + +On each system: + +2) Configure, build and install, then boot Linux, just to ensure it + can be done with an unmodified upstream kernel. Confirm you booted + the upstream kernel with uname -a. + +3) If you plan to do any performance testing or unless you plan to + test only swapping, the WasActive patch is also highly recommended. + (Search lkml.org for WasActive, apply the patch, rebuild your kernel.) + For a demo or simple testing, the patch can be ignored. + +4) Install ramster-tools as root. An x86_64 rpm for EL6-based systems + can be found at: +http://oss.oracle.com/projects/tmem
Bye bye Mr tmem guy
Hi Linux kernel folks and Xen folks -- Effective July 5, I will be resigning from Oracle and retiring for a minimum of 12-18 months and probably/hopefully much longer. Between now and July 5, I will be tying up loose ends related to my patches but also using up accrued vacation days. If you have a loose end you'd like to see tied, please let me know ASAP and I will do my best. After July 5, any email to me via first DOT last AT oracle DOT com will go undelivered and may bounce. Please send email related to my open source patches and contributions to Konrad Wilk and/or Bob Liu. Personal email directed to me can be sent to first AT last DOT com. Thanks much to everybody for the many educational opportunities, the technical and political jousting, and the great times at conferences and summits! I wish you all the best of luck! Or to quote Douglas Adams: So long and thanks for all the fish! Cheers, Dan Magenheimer The Transcendent Memory (tmem) guy Tmem-related historical webography: http://lwn.net/Articles/454795/ http://lwn.net/Articles/475681/ http://lwn.net/Articles/545244/ https://oss.oracle.com/projects/tmem/ http://www.linux-kvm.org/wiki/images/d/d7/TmemNotVirt-Linuxcon2011-Final.pdf http://lwn.net/Articles/465317/ http://lwn.net/Articles/340080/ http://lwn.net/Articles/386090/ http://www.xen.org/files/xensummit_oracle09/xensummit_transmemory.pdf https://oss.oracle.com/projects/tmem/dist/documentation/presentations/TranscendentMemoryXenSummit2010.pdf https://blogs.oracle.com/wim/entry/example_of_transcendent_memory_and https://blogs.oracle.com/wim/entry/another_feature_hit_mainline_linux https://blogs.oracle.com/wim/entry/from_the_research_department_ramster http://streaming.oracle.com/ebn/podcasts/media/11663326_VM_Linux_042512.mp3 https://oss.oracle.com/projects/tmem/dist/documentation/papers/overcommit.pdf http://static.usenix.org/event/wiov08/tech/full_papers/magenheimer/magenheimer_html/ -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
RE: [PATCHv11 3/4] zswap: add to mm/
> From: Rik van Riel [mailto:r...@redhat.com] > Sent: Wednesday, May 15, 2013 4:15 PM > To: Dan Magenheimer > Cc: Seth Jennings; Andrew Morton; Greg Kroah-Hartman; Nitin Gupta; Minchan > Kim; Konrad Wilk; Robert > Jennings; Jenifer Hopper; Mel Gorman; Johannes Weiner; Larry Woodman; > Benjamin Herrenschmidt; Dave > Hansen; Joe Perches; Joonsoo Kim; Cody P Schafer; Hugh Dickens; Paul > Mackerras; linux...@kvack.org; > linux-kernel@vger.kernel.org; de...@driverdev.osuosl.org > Subject: Re: [PATCHv11 3/4] zswap: add to mm/ > > On 05/14/2013 04:18 PM, Dan Magenheimer wrote: > > > It's unfortunate that my proposed topic for LSFMM was pre-empted > > by the zsmalloc vs zbud discussion and zswap vs zcache, because > > I think the real challenge of zswap (or zcache) and the value to > > distros and end users requires us to get this right BEFORE users > > start filing bugs about performance weirdness. After which most > > users and distros will simply default to 0% (i.e. turn zswap off) > > because zswap unpredictably sometimes sucks. > > I'm not sure we can get it right before people actually start > using it for real world setups, instead of just running benchmarks > on it. > > The sooner we get the code out there, where users can play with > it (even if it is disabled by default and needs a sysfs or > sysctl config option to enable it), the sooner we will know how > well it works, and what needs to be changed. /me sets stage of first Star Wars (1977) /me envisions self as Obi-Wan Kenobi, old and tired of fighting, in lightsaber battle with protege Darth Vader / Anakin Skywalker /me sadly turns off lightsaber, holds useless handle at waist, takes a deep breath, and promptly gets sliced into oblivion. Time for A New Hope(tm). (/me cc's Jon Corbet for a longshot last chance of making LWN's Kernel Development Quotes of the Week.) -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
RE: [PATCHv11 3/4] zswap: add to mm/
From: Rik van Riel [mailto:r...@redhat.com] Sent: Wednesday, May 15, 2013 4:15 PM To: Dan Magenheimer Cc: Seth Jennings; Andrew Morton; Greg Kroah-Hartman; Nitin Gupta; Minchan Kim; Konrad Wilk; Robert Jennings; Jenifer Hopper; Mel Gorman; Johannes Weiner; Larry Woodman; Benjamin Herrenschmidt; Dave Hansen; Joe Perches; Joonsoo Kim; Cody P Schafer; Hugh Dickens; Paul Mackerras; linux...@kvack.org; linux-kernel@vger.kernel.org; de...@driverdev.osuosl.org Subject: Re: [PATCHv11 3/4] zswap: add to mm/ On 05/14/2013 04:18 PM, Dan Magenheimer wrote: It's unfortunate that my proposed topic for LSFMM was pre-empted by the zsmalloc vs zbud discussion and zswap vs zcache, because I think the real challenge of zswap (or zcache) and the value to distros and end users requires us to get this right BEFORE users start filing bugs about performance weirdness. After which most users and distros will simply default to 0% (i.e. turn zswap off) because zswap unpredictably sometimes sucks. I'm not sure we can get it right before people actually start using it for real world setups, instead of just running benchmarks on it. The sooner we get the code out there, where users can play with it (even if it is disabled by default and needs a sysfs or sysctl config option to enable it), the sooner we will know how well it works, and what needs to be changed. /me sets stage of first Star Wars (1977) /me envisions self as Obi-Wan Kenobi, old and tired of fighting, in lightsaber battle with protege Darth Vader / Anakin Skywalker /me sadly turns off lightsaber, holds useless handle at waist, takes a deep breath, and promptly gets sliced into oblivion. Time for A New Hope(tm). (/me cc's Jon Corbet for a longshot last chance of making LWN's Kernel Development Quotes of the Week.) -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
RE: [PATCHv11 3/4] zswap: add to mm/
> From: Rik van Riel [mailto:r...@redhat.com] > Subject: Re: [PATCHv11 3/4] zswap: add to mm/ > > On 05/15/2013 03:35 PM, Dan Magenheimer wrote: > >> From: Konrad Rzeszutek Wilk > >> Subject: Re: [PATCHv11 3/4] zswap: add to mm/ > >> > >>> Sorry, but I don't think that's appropriate for a patch in the MM > >>> subsystem. > >> > >> I am heading to the airport shortly so this email is a bit hastily typed. > >> > >> Perhaps a compromise can be reached where this code is merged as a driver > >> not a core mm component. There is a high bar to be in the MM - it has to > >> work with many many different configurations. > >> > >> And drivers don't have such a high bar. They just need to work on a > >> specific > >> issue and that is it. If zswap ended up in say, drivers/mm that would make > >> it more palpable I think. > >> > >> Thoughts? > > > > Hmmm... > > > > To me, that sounds like a really good compromise. > > Come on, we all know that is nonsense. > > Sure, the zswap and zbud code may not be in their final state yet, > but they belong in the mm/ directory, together with the cleancache > code and all the other related bits of code. > > Lets put them in their final destination, and hope the code attracts > attention by as many MM developers as can spare the time to help > improve it. Hi Rik -- Seth has been hell-bent on getting SOME code into the kernel for over a year, since he found out that enabling zcache, a staging driver, resulted in a tainted kernel. First it was promoting zcache+zsmalloc out of staging. Then it was zswap+zsmalloc without writeback, then zswap+zsmalloc with writeback, and now zswap+zbud with writeback but without a sane policy for writeback. All of that time, I've been arguing and trying to integrate compression more deeply and sensibly into MM, rather than just enabling compression as a toy that happens to speed up a few benchmarks. (This, in a nutshell, was the feedback I got at LSFMM12 from Andrea and Mel... and I think also from you.) Seth has resisted every step of the way, then integrated the functionality in question, adapted my code (or Nitin's), and called it his own. If you disagree with any of my arguments earlier in this thread, please say so. Else, please reinforce that the MM subsystem needs to dynamically adapt to a broad range of workloads, which zswap does not (yet) do. Zswap is not simple, it is simplistic*. IMHO, it may be OK for a driver to be ham-handed in its memory use, but that's not OK for something in mm/. So I think merging zswap as a driver is a perfectly sensible compromise which lets Seth get his code upstream, allows users (and leading-edge distros) to experiment with compression, avoids these endless arguments, and allows those who care to move forward on how to deeply integrate compression into MM. Dan * simplistic, n., The tendency to oversimplify an issue or a problem by ignoring complexities or complications. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
RE: [PATCHv11 3/4] zswap: add to mm/
> From: Dave Hansen [mailto:d...@sr71.net] > Sent: Wednesday, May 15, 2013 2:24 PM > To: Seth Jennings > Cc: Konrad Rzeszutek Wilk; Dan Magenheimer; Andrew Morton; Greg > Kroah-Hartman; Nitin Gupta; Minchan > Kim; Robert Jennings; Jenifer Hopper; Mel Gorman; Johannes Weiner; Rik van > Riel; Larry Woodman; > Benjamin Herrenschmidt; Joe Perches; Joonsoo Kim; Cody P Schafer; Hugh > Dickens; Paul Mackerras; linux- > m...@kvack.org; linux-kernel@vger.kernel.org; de...@driverdev.osuosl.org > Subject: Re: [PATCHv11 3/4] zswap: add to mm/ > > On 05/15/2013 01:09 PM, Seth Jennings wrote: > > On Wed, May 15, 2013 at 02:55:06PM -0400, Konrad Rzeszutek Wilk wrote: > >>> Sorry, but I don't think that's appropriate for a patch in the MM > >>> subsystem. > >> > >> Perhaps a compromise can be reached where this code is merged as a driver > >> not a core mm component. There is a high bar to be in the MM - it has to > >> work with many many different configurations. > >> > >> And drivers don't have such a high bar. They just need to work on a > >> specific > >> issue and that is it. If zswap ended up in say, drivers/mm that would make > >> it more palpable I think. > > The issue is not whether it is a loadable module or a driver. Nobody > here is stupid enough to say, "hey, now it's a driver/module, all of the > complex VM interactions are finally fixed!" > > If folks don't want this in their system, there's a way to turn it off, > today, with the sysfs tunables. We don't need _another_ way to turn it > off at runtime (unloading the module/driver). The issue is we KNOW the complex VM interactions are NOT fixed and there has been very very little breadth testing (i.e. across a wide range of workloads, and any attempts to show how much harm can come from enabling it.) That's (at least borderline) acceptable in a driver that can be unloaded, but not in MM code IMHO. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
RE: [PATCHv11 3/4] zswap: add to mm/
> From: Seth Jennings [mailto:sjenn...@linux.vnet.ibm.com] > Sent: Wednesday, May 15, 2013 2:10 PM > To: Konrad Rzeszutek Wilk > Cc: Dan Magenheimer; Andrew Morton; Greg Kroah-Hartman; Nitin Gupta; Minchan > Kim; Robert Jennings; > Jenifer Hopper; Mel Gorman; Johannes Weiner; Rik van Riel; Larry Woodman; > Benjamin Herrenschmidt; Dave > Hansen; Joe Perches; Joonsoo Kim; Cody P Schafer; Hugh Dickens; Paul > Mackerras; linux...@kvack.org; > linux-kernel@vger.kernel.org; de...@driverdev.osuosl.org > Subject: Re: [PATCHv11 3/4] zswap: add to mm/ > > On Wed, May 15, 2013 at 02:55:06PM -0400, Konrad Rzeszutek Wilk wrote: > > > Sorry, but I don't think that's appropriate for a patch in the MM > > > subsystem. > > > > I am heading to the airport shortly so this email is a bit hastily typed. > > > > Perhaps a compromise can be reached where this code is merged as a driver > > not a core mm component. There is a high bar to be in the MM - it has to > > work with many many different configurations. > > > > And drivers don't have such a high bar. They just need to work on a specific > > issue and that is it. If zswap ended up in say, drivers/mm that would make > > it more palpable I think. > > > > Thoughts? > > zswap, the writeback code particularly, depends on a number of non-exported > kernel symbols, namely: > > swapcache_free > __swap_writepage > __add_to_swap_cache > swapcache_prepare > swapper_spaces > > So it can't currently be built as a module and I'm not sure what the MM > folks would think about exporting them and making them part of the KABI. It can be built as a module if writeback is disabled (or ifdef'd by a CONFIG_ZSWAP_WRITEBACK which depends on CONFIG_ZSWAP=y). The folks at LSFMM who were planning to use zswap will be turning off writeback anyway so an alternate is to pull writeback out of zswap completely for now, since you don't really have a good policy to manage it yet anyway. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
RE: [PATCHv11 3/4] zswap: add to mm/
> From: Konrad Rzeszutek Wilk > Subject: Re: [PATCHv11 3/4] zswap: add to mm/ > > > Sorry, but I don't think that's appropriate for a patch in the MM subsystem. > > I am heading to the airport shortly so this email is a bit hastily typed. > > Perhaps a compromise can be reached where this code is merged as a driver > not a core mm component. There is a high bar to be in the MM - it has to > work with many many different configurations. > > And drivers don't have such a high bar. They just need to work on a specific > issue and that is it. If zswap ended up in say, drivers/mm that would make > it more palpable I think. > > Thoughts? Hmmm... To me, that sounds like a really good compromise. Then anyone who wants to experiment with compressed swap pages can do so by enabling the zswap driver. And the harder problem of deeply integrating compression into the MM subsystem can proceed in parallel by leveraging and building on the best of zswap and zcache and zram. Seth, if you want to re-post zswap as a driver... even a previous zswap version with zsmalloc and without writeback... I would be willing to ack it. If I correctly understand Mel's concerns, I suspect he might feel the same. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
RE: [PATCH] Fixes, cleanups, compile warning fixes, and documentation update for Xen tmem driver (v2).
> From: Konrad Rzeszutek [mailto:ketuzs...@gmail.com] On Behalf Of Konrad > Rzeszutek Wilk > Sent: Tuesday, May 14, 2013 12:09 PM > To: bob@oracle.com; dan.magenhei...@oracle.com; > linux-kernel@vger.kernel.org; akpm@linux- > foundation.org; linux...@kvack.org; xen-de...@lists.xensource.com > Subject: [PATCH] Fixes, cleanups, compile warning fixes, and documentation > update for Xen tmem driver > (v2). > > Heya, > > These nine patches fix the tmem driver to: > - not emit a compile warning anymore (reported by 0 day test compile tool) > - remove the various nofrontswap, nocleancache, noselfshrinking, > noselfballooning, >selfballooning, selfshrinking bootup options. > - said options are now folded in the tmem driver as module options and are >much shorter (and also there are only four of them now). > - add documentation to explain these parameters in kernel-parameters.txt > - And lastly add some logic to not enable selfshrinking and selfballooning >if frontswap functionality is off. > > That is it. Tested and ready to go. If nobody objects will put on my queue > for Linus on Monday. FWIW, I've scanned all of these and they look sane and good. So consider all: Acked-by: Dan Magenheimer > Documentation/kernel-parameters.txt | 21 > drivers/xen/Kconfig |7 +-- > drivers/xen/tmem.c | 87 > --- > drivers/xen/xen-selfballoon.c | 47 ++ > 4 files changed, 69 insertions(+), 93 deletions(-) > > (oh nice, more deletions!) > > Konrad Rzeszutek Wilk (9): > xen/tmem: Cleanup. Remove the parts that say temporary. > xen/tmem: Move all of the boot and module parameters to the top of the > file. > xen/tmem: Split out the different module/boot options. > xen/tmem: Fix compile warning. > xen/tmem: s/disable_// and change the logic. > xen/tmem: Remove the boot options and fold them in the tmem.X > parameters. > xen/tmem: Remove the usage of 'noselfshrink' and use 'tmem.selfshrink' > bool instead. > xen/tmem: Remove the usage of '[no|]selfballoon' and use > 'tmem.selfballooning' bool instead. > xen/tmem: Don't use self[ballooning|shrinking] if frontswap is off. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
RE: [PATCHv11 3/4] zswap: add to mm/
> From: Seth Jennings [mailto:sjenn...@linux.vnet.ibm.com] > Subject: Re: [PATCHv11 3/4] zswap: add to mm/ > > On Tue, May 14, 2013 at 01:18:48PM -0700, Dan Magenheimer wrote: > > > From: Seth Jennings [mailto:sjenn...@linux.vnet.ibm.com] > > > Subject: Re: [PATCHv11 3/4] zswap: add to mm/ > > > > > > > > > > > > > > +/* The maximum percentage of memory that the compressed pool can > > > > > occupy */ > > > > > +static unsigned int zswap_max_pool_percent = 20; > > > > > +module_param_named(max_pool_percent, > > > > > + zswap_max_pool_percent, uint, 0644); > > > > > > > > > > > > > This limit, along with the code that enforces it (by calling reclaim > > > > when the limit is reached), is IMHO questionable. Is there any > > > > other kernel memory allocation that is constrained by a percentage > > > > of total memory rather than dynamically according to current > > > > system conditions? As Mel pointed out (approx.), if this limit > > > > is reached by a zswap-storm and filled with pages of long-running, > > > > rarely-used processes, 20% of RAM (by default here) becomes forever > > > > clogged. > > > > > > So there are two comments here 1) dynamic pool limit and 2) writeback > > > of pages in zswap that won't be faulted in or forced out by pressure. > > > > > > Comment 1 feeds from the point of view that compressed pages should just > > > be > > > another type of memory managed by the core MM. While ideal, very hard to > > > implement in practice. We are starting to realize that even the policy > > > governing to active vs inactive list is very hard to get right. Then > > > shrinkers > > > add more complexity to the policy problem. Throwing another type in the > > > mix > > > would just that much more complex and hard to get right (assuming there > > > even > > > _is_ a "right" policy for everyone in such a complex system). > > > > > > This max_pool_percent policy is simple, works well, and provides a > > > deterministic policy that users can understand. Users can be assured that > > > a > > > dynamic policy heuristic won't go nuts and allow the compressed pool to > > > grow > > > unbounded or be so aggressively reclaimed that it offers no value. > > > > Hi Seth -- > > > > Hmmm... I'm not sure how to politely say "bullshit". :-) > > > > The default 20% was randomly pulled out of the air long ago for zcache > > experiments. If you can explain why 20% is better than 19% or 21%, or > > better than 10% or 30% or even 50%, that would be a start. Then please try > > to explain -- in terms an average sysadmin can understand -- under > > what circumstances this number should be higher or lower, that would > > be even better. In fact if you can explain it in even very broadbrush > > terms like "higher for embedded" and "lower for server" that would be > > useful. If the top Linux experts in compression can't answer these > > questions (and the default is a random number, which it is), I don't > > know how we can expect users to be "assured". > > 20% is a default maximum. There really isn't a particular reason for the > selection other than to supply reasonable default to a tunable. 20% is enough > to show the benefit while assuring the user zswap won't eat more than that > amount of memory under any circumstance. The point is to make it a tunable, > not to launch an incredibly in-depth study on what the default should be. My point is that a tunable is worthless -- and essentially the same as a fixed value -- unless you can clearly instruct target users how to change it to match their needs. > As guidance on how to tune it, switching to zbud actually made the math > simpler > by bounding the best case to 2 and the expected density to very near 2. I > have > two methods, one based on calculation and another based on experimentation. > > Yes, I understand that there are many things to consider, but for the sake of > those that honestly care about the answer to the question, I'll answer it. > > Method 1: > > If you have a workload running on a machine with x GB of RAM and an anonymous > working set of y GB of pages where x < y, a good starting point for > max_pool_percent is ((y/x)-1)*100. > > For example, if you have 10GB of RAM and 12GB anon working set, (12/10-1)*100 > = > 20. During operation th
RE: [PATCHv11 3/4] zswap: add to mm/
From: Seth Jennings [mailto:sjenn...@linux.vnet.ibm.com] Subject: Re: [PATCHv11 3/4] zswap: add to mm/ On Tue, May 14, 2013 at 01:18:48PM -0700, Dan Magenheimer wrote: From: Seth Jennings [mailto:sjenn...@linux.vnet.ibm.com] Subject: Re: [PATCHv11 3/4] zswap: add to mm/ snip +/* The maximum percentage of memory that the compressed pool can occupy */ +static unsigned int zswap_max_pool_percent = 20; +module_param_named(max_pool_percent, + zswap_max_pool_percent, uint, 0644); snip This limit, along with the code that enforces it (by calling reclaim when the limit is reached), is IMHO questionable. Is there any other kernel memory allocation that is constrained by a percentage of total memory rather than dynamically according to current system conditions? As Mel pointed out (approx.), if this limit is reached by a zswap-storm and filled with pages of long-running, rarely-used processes, 20% of RAM (by default here) becomes forever clogged. So there are two comments here 1) dynamic pool limit and 2) writeback of pages in zswap that won't be faulted in or forced out by pressure. Comment 1 feeds from the point of view that compressed pages should just be another type of memory managed by the core MM. While ideal, very hard to implement in practice. We are starting to realize that even the policy governing to active vs inactive list is very hard to get right. Then shrinkers add more complexity to the policy problem. Throwing another type in the mix would just that much more complex and hard to get right (assuming there even _is_ a right policy for everyone in such a complex system). This max_pool_percent policy is simple, works well, and provides a deterministic policy that users can understand. Users can be assured that a dynamic policy heuristic won't go nuts and allow the compressed pool to grow unbounded or be so aggressively reclaimed that it offers no value. Hi Seth -- Hmmm... I'm not sure how to politely say bullshit. :-) The default 20% was randomly pulled out of the air long ago for zcache experiments. If you can explain why 20% is better than 19% or 21%, or better than 10% or 30% or even 50%, that would be a start. Then please try to explain -- in terms an average sysadmin can understand -- under what circumstances this number should be higher or lower, that would be even better. In fact if you can explain it in even very broadbrush terms like higher for embedded and lower for server that would be useful. If the top Linux experts in compression can't answer these questions (and the default is a random number, which it is), I don't know how we can expect users to be assured. 20% is a default maximum. There really isn't a particular reason for the selection other than to supply reasonable default to a tunable. 20% is enough to show the benefit while assuring the user zswap won't eat more than that amount of memory under any circumstance. The point is to make it a tunable, not to launch an incredibly in-depth study on what the default should be. My point is that a tunable is worthless -- and essentially the same as a fixed value -- unless you can clearly instruct target users how to change it to match their needs. As guidance on how to tune it, switching to zbud actually made the math simpler by bounding the best case to 2 and the expected density to very near 2. I have two methods, one based on calculation and another based on experimentation. Yes, I understand that there are many things to consider, but for the sake of those that honestly care about the answer to the question, I'll answer it. Method 1: If you have a workload running on a machine with x GB of RAM and an anonymous working set of y GB of pages where x y, a good starting point for max_pool_percent is ((y/x)-1)*100. For example, if you have 10GB of RAM and 12GB anon working set, (12/10-1)*100 = 20. During operation there would be 8GB in uncompressed memory, and 4GB worth of compressed memory occupying 2GB (i.e. 20%) of RAM. This will reduce swap I/O to near zero assuming the pages compress 50% on average. Bear in mind that this formula provides a lower bound on max_pool_percent if you want to avoid swap I/0. Setting max_pool_percent to 20 would produce the same situation. OK, let's try to apply your method. You personally have undoubtedly compiled the kernel hundreds, maybe thousands of times in the last year. In the restricted environment where you and I have run benchmarks, this is a fairly stable and reproducible workload == stable and reproducible are somewhat rare in the real world. Can you tell me what the anon working set is of compiling the kernel? Have you, one of the top experts in Linux compression technology, ever even once changed the max_pool_percent in your
RE: [PATCH] Fixes, cleanups, compile warning fixes, and documentation update for Xen tmem driver (v2).
From: Konrad Rzeszutek [mailto:ketuzs...@gmail.com] On Behalf Of Konrad Rzeszutek Wilk Sent: Tuesday, May 14, 2013 12:09 PM To: bob@oracle.com; dan.magenhei...@oracle.com; linux-kernel@vger.kernel.org; akpm@linux- foundation.org; linux...@kvack.org; xen-de...@lists.xensource.com Subject: [PATCH] Fixes, cleanups, compile warning fixes, and documentation update for Xen tmem driver (v2). Heya, These nine patches fix the tmem driver to: - not emit a compile warning anymore (reported by 0 day test compile tool) - remove the various nofrontswap, nocleancache, noselfshrinking, noselfballooning, selfballooning, selfshrinking bootup options. - said options are now folded in the tmem driver as module options and are much shorter (and also there are only four of them now). - add documentation to explain these parameters in kernel-parameters.txt - And lastly add some logic to not enable selfshrinking and selfballooning if frontswap functionality is off. That is it. Tested and ready to go. If nobody objects will put on my queue for Linus on Monday. FWIW, I've scanned all of these and they look sane and good. So consider all: Acked-by: Dan Magenheimer dan.magenhei...@oracle.com Documentation/kernel-parameters.txt | 21 drivers/xen/Kconfig |7 +-- drivers/xen/tmem.c | 87 --- drivers/xen/xen-selfballoon.c | 47 ++ 4 files changed, 69 insertions(+), 93 deletions(-) (oh nice, more deletions!) Konrad Rzeszutek Wilk (9): xen/tmem: Cleanup. Remove the parts that say temporary. xen/tmem: Move all of the boot and module parameters to the top of the file. xen/tmem: Split out the different module/boot options. xen/tmem: Fix compile warning. xen/tmem: s/disable_// and change the logic. xen/tmem: Remove the boot options and fold them in the tmem.X parameters. xen/tmem: Remove the usage of 'noselfshrink' and use 'tmem.selfshrink' bool instead. xen/tmem: Remove the usage of '[no|]selfballoon' and use 'tmem.selfballooning' bool instead. xen/tmem: Don't use self[ballooning|shrinking] if frontswap is off. -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
RE: [PATCHv11 3/4] zswap: add to mm/
From: Konrad Rzeszutek Wilk Subject: Re: [PATCHv11 3/4] zswap: add to mm/ Sorry, but I don't think that's appropriate for a patch in the MM subsystem. I am heading to the airport shortly so this email is a bit hastily typed. Perhaps a compromise can be reached where this code is merged as a driver not a core mm component. There is a high bar to be in the MM - it has to work with many many different configurations. And drivers don't have such a high bar. They just need to work on a specific issue and that is it. If zswap ended up in say, drivers/mm that would make it more palpable I think. Thoughts? Hmmm... To me, that sounds like a really good compromise. Then anyone who wants to experiment with compressed swap pages can do so by enabling the zswap driver. And the harder problem of deeply integrating compression into the MM subsystem can proceed in parallel by leveraging and building on the best of zswap and zcache and zram. Seth, if you want to re-post zswap as a driver... even a previous zswap version with zsmalloc and without writeback... I would be willing to ack it. If I correctly understand Mel's concerns, I suspect he might feel the same. -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
RE: [PATCHv11 3/4] zswap: add to mm/
From: Seth Jennings [mailto:sjenn...@linux.vnet.ibm.com] Sent: Wednesday, May 15, 2013 2:10 PM To: Konrad Rzeszutek Wilk Cc: Dan Magenheimer; Andrew Morton; Greg Kroah-Hartman; Nitin Gupta; Minchan Kim; Robert Jennings; Jenifer Hopper; Mel Gorman; Johannes Weiner; Rik van Riel; Larry Woodman; Benjamin Herrenschmidt; Dave Hansen; Joe Perches; Joonsoo Kim; Cody P Schafer; Hugh Dickens; Paul Mackerras; linux...@kvack.org; linux-kernel@vger.kernel.org; de...@driverdev.osuosl.org Subject: Re: [PATCHv11 3/4] zswap: add to mm/ On Wed, May 15, 2013 at 02:55:06PM -0400, Konrad Rzeszutek Wilk wrote: Sorry, but I don't think that's appropriate for a patch in the MM subsystem. I am heading to the airport shortly so this email is a bit hastily typed. Perhaps a compromise can be reached where this code is merged as a driver not a core mm component. There is a high bar to be in the MM - it has to work with many many different configurations. And drivers don't have such a high bar. They just need to work on a specific issue and that is it. If zswap ended up in say, drivers/mm that would make it more palpable I think. Thoughts? zswap, the writeback code particularly, depends on a number of non-exported kernel symbols, namely: swapcache_free __swap_writepage __add_to_swap_cache swapcache_prepare swapper_spaces So it can't currently be built as a module and I'm not sure what the MM folks would think about exporting them and making them part of the KABI. It can be built as a module if writeback is disabled (or ifdef'd by a CONFIG_ZSWAP_WRITEBACK which depends on CONFIG_ZSWAP=y). The folks at LSFMM who were planning to use zswap will be turning off writeback anyway so an alternate is to pull writeback out of zswap completely for now, since you don't really have a good policy to manage it yet anyway. -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
RE: [PATCHv11 3/4] zswap: add to mm/
From: Dave Hansen [mailto:d...@sr71.net] Sent: Wednesday, May 15, 2013 2:24 PM To: Seth Jennings Cc: Konrad Rzeszutek Wilk; Dan Magenheimer; Andrew Morton; Greg Kroah-Hartman; Nitin Gupta; Minchan Kim; Robert Jennings; Jenifer Hopper; Mel Gorman; Johannes Weiner; Rik van Riel; Larry Woodman; Benjamin Herrenschmidt; Joe Perches; Joonsoo Kim; Cody P Schafer; Hugh Dickens; Paul Mackerras; linux- m...@kvack.org; linux-kernel@vger.kernel.org; de...@driverdev.osuosl.org Subject: Re: [PATCHv11 3/4] zswap: add to mm/ On 05/15/2013 01:09 PM, Seth Jennings wrote: On Wed, May 15, 2013 at 02:55:06PM -0400, Konrad Rzeszutek Wilk wrote: Sorry, but I don't think that's appropriate for a patch in the MM subsystem. Perhaps a compromise can be reached where this code is merged as a driver not a core mm component. There is a high bar to be in the MM - it has to work with many many different configurations. And drivers don't have such a high bar. They just need to work on a specific issue and that is it. If zswap ended up in say, drivers/mm that would make it more palpable I think. The issue is not whether it is a loadable module or a driver. Nobody here is stupid enough to say, hey, now it's a driver/module, all of the complex VM interactions are finally fixed! If folks don't want this in their system, there's a way to turn it off, today, with the sysfs tunables. We don't need _another_ way to turn it off at runtime (unloading the module/driver). The issue is we KNOW the complex VM interactions are NOT fixed and there has been very very little breadth testing (i.e. across a wide range of workloads, and any attempts to show how much harm can come from enabling it.) That's (at least borderline) acceptable in a driver that can be unloaded, but not in MM code IMHO. -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
RE: [PATCHv11 3/4] zswap: add to mm/
From: Rik van Riel [mailto:r...@redhat.com] Subject: Re: [PATCHv11 3/4] zswap: add to mm/ On 05/15/2013 03:35 PM, Dan Magenheimer wrote: From: Konrad Rzeszutek Wilk Subject: Re: [PATCHv11 3/4] zswap: add to mm/ Sorry, but I don't think that's appropriate for a patch in the MM subsystem. I am heading to the airport shortly so this email is a bit hastily typed. Perhaps a compromise can be reached where this code is merged as a driver not a core mm component. There is a high bar to be in the MM - it has to work with many many different configurations. And drivers don't have such a high bar. They just need to work on a specific issue and that is it. If zswap ended up in say, drivers/mm that would make it more palpable I think. Thoughts? Hmmm... To me, that sounds like a really good compromise. Come on, we all know that is nonsense. Sure, the zswap and zbud code may not be in their final state yet, but they belong in the mm/ directory, together with the cleancache code and all the other related bits of code. Lets put them in their final destination, and hope the code attracts attention by as many MM developers as can spare the time to help improve it. Hi Rik -- Seth has been hell-bent on getting SOME code into the kernel for over a year, since he found out that enabling zcache, a staging driver, resulted in a tainted kernel. First it was promoting zcache+zsmalloc out of staging. Then it was zswap+zsmalloc without writeback, then zswap+zsmalloc with writeback, and now zswap+zbud with writeback but without a sane policy for writeback. All of that time, I've been arguing and trying to integrate compression more deeply and sensibly into MM, rather than just enabling compression as a toy that happens to speed up a few benchmarks. (This, in a nutshell, was the feedback I got at LSFMM12 from Andrea and Mel... and I think also from you.) Seth has resisted every step of the way, then integrated the functionality in question, adapted my code (or Nitin's), and called it his own. If you disagree with any of my arguments earlier in this thread, please say so. Else, please reinforce that the MM subsystem needs to dynamically adapt to a broad range of workloads, which zswap does not (yet) do. Zswap is not simple, it is simplistic*. IMHO, it may be OK for a driver to be ham-handed in its memory use, but that's not OK for something in mm/. So I think merging zswap as a driver is a perfectly sensible compromise which lets Seth get his code upstream, allows users (and leading-edge distros) to experiment with compression, avoids these endless arguments, and allows those who care to move forward on how to deeply integrate compression into MM. Dan * simplistic, n., The tendency to oversimplify an issue or a problem by ignoring complexities or complications. -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
RE: [PATCHv11 3/4] zswap: add to mm/
> From: Seth Jennings [mailto:sjenn...@linux.vnet.ibm.com] > Subject: Re: [PATCHv11 3/4] zswap: add to mm/ > > On Tue, May 14, 2013 at 09:37:08AM -0700, Dan Magenheimer wrote: > > > From: Seth Jennings [mailto:sjenn...@linux.vnet.ibm.com] > > > Subject: Re: [PATCHv11 3/4] zswap: add to mm/ > > > > > > On Tue, May 14, 2013 at 05:19:19PM +0800, Bob Liu wrote: > > > > Hi Seth, > > > > > > Hi Bob, thanks for the review! > > > > > > > > > > > > + /* reclaim space if needed */ > > > > > + if (zswap_is_full()) { > > > > > + zswap_pool_limit_hit++; > > > > > + if (zbud_reclaim_page(tree->pool, 8)) { > > > > > > > > My idea is to wake up a kernel thread here to do the reclaim. > > > > Once zswap is full(20% percent of total mem currently), the kernel > > > > thread should reclaim pages from it. Not only reclaim one page, it > > > > should depend on the current memory pressure. > > > > And then the API in zbud may like this: > > > > zbud_reclaim_page(pool, nr_pages_to_reclaim, nr_retry); > > > > > > So kswapd for zswap. I'm not opposed to the idea if a case can be > > > made for the complexity. I must say, I don't see that case though. > > > > > > The policy can evolve as deficiencies are demonstrated and solutions are > > > found. > > > > Hmmm... it is fairly easy to demonstrate the deficiency if > > one tries. I actually first saw it occur on a real (though > > early) EL6 system which started some graphics-related service > > that caused a very brief swapstorm that was invisible during > > normal boot but clogged up RAM with compressed pages which > > later caused reduced weird benchmarking performance. > > Without any specifics, I'm not sure what I can do with this. Well, I think its customary for the author of a patch to know the limitations of the patch. I suggest you synthesize a workload that attempts to measure worst case. That's exactly what I did a year ago that led me to the realization that zcache needed to solve some issues before it was ready to promote out of staging. > I'm hearing you say that the source of the benchmark degradation > are the idle pages in zswap. In that case, the periodic writeback > patches I have in the wings should address this. > > I think we are on the same page without realizing it. Right now > zswap supports a kind of "direct reclaim" model at allocation time. > The periodic writeback patches will handle the proactive writeback > part to free up the zswap pool when it has idle pages in it. I don't think we are on the same page though maybe you are heading in the same direction now. I won't repeat the comments from the previous email. > > I think Mel's unpredictability concern applies equally here... > > this may be a "long-term source of bugs and strange memory > > management behavior." > > > > > Can I get your ack on this pending the other changes? > > > > I'd like to hear Mel's feedback about this, but perhaps > > a compromise to allow for zswap merging would be to add > > something like the following to zswap's Kconfig comment: > > > > "Zswap reclaim policy is still primitive. Until it improves, > > zswap should be considered experimental and is not recommended > > for production use." > > Just for the record, an "experimental" tag in the Kconfig won't > work for me. > > The reclaim policy for zswap is not primitive, it's simple. There > is a difference. Plus zswap is already runtime disabled by default. > If distros/customers enabled it, it is because they purposely > enabled it. Hmmm... I think you are proposing to users/distros the following use model: "If zswap works for you, turn it on. If it sucks, turn it off. I can't tell you in advance whether it will work or suck for your distro/workload, but it will probably work so please try it." That sounds awfully experimental to me. The problem is not simple. Your solution is simple because you are simply pretending that the harder parts of the problem don't exist. Dan -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
RE: [PATCHv11 3/4] zswap: add to mm/
> From: Seth Jennings [mailto:sjenn...@linux.vnet.ibm.com] > Subject: Re: [PATCHv11 3/4] zswap: add to mm/ > > > > > > +/* The maximum percentage of memory that the compressed pool can occupy > > > */ > > > +static unsigned int zswap_max_pool_percent = 20; > > > +module_param_named(max_pool_percent, > > > + zswap_max_pool_percent, uint, 0644); > > > > > This limit, along with the code that enforces it (by calling reclaim > > when the limit is reached), is IMHO questionable. Is there any > > other kernel memory allocation that is constrained by a percentage > > of total memory rather than dynamically according to current > > system conditions? As Mel pointed out (approx.), if this limit > > is reached by a zswap-storm and filled with pages of long-running, > > rarely-used processes, 20% of RAM (by default here) becomes forever > > clogged. > > So there are two comments here 1) dynamic pool limit and 2) writeback > of pages in zswap that won't be faulted in or forced out by pressure. > > Comment 1 feeds from the point of view that compressed pages should just be > another type of memory managed by the core MM. While ideal, very hard to > implement in practice. We are starting to realize that even the policy > governing to active vs inactive list is very hard to get right. Then shrinkers > add more complexity to the policy problem. Throwing another type in the mix > would just that much more complex and hard to get right (assuming there even > _is_ a "right" policy for everyone in such a complex system). > > This max_pool_percent policy is simple, works well, and provides a > deterministic policy that users can understand. Users can be assured that a > dynamic policy heuristic won't go nuts and allow the compressed pool to grow > unbounded or be so aggressively reclaimed that it offers no value. Hi Seth -- Hmmm... I'm not sure how to politely say "bullshit". :-) The default 20% was randomly pulled out of the air long ago for zcache experiments. If you can explain why 20% is better than 19% or 21%, or better than 10% or 30% or even 50%, that would be a start. Then please try to explain -- in terms an average sysadmin can understand -- under what circumstances this number should be higher or lower, that would be even better. In fact if you can explain it in even very broadbrush terms like "higher for embedded" and "lower for server" that would be useful. If the top Linux experts in compression can't answer these questions (and the default is a random number, which it is), I don't know how we can expect users to be "assured". What you mean is "works well"... on the two benchmarks you've tried it on. You say it's too hard to do dynamically... even though every other significant RAM user in the kernel has to do it dynamically. Workloads are dynamic and heavy users of RAM needs to deal with that. You don't see a limit on the number of anonymous pages in the MM subsystem, and you don't see a limit on the number of inodes in btrfs. Linus would rightfully barf all over those limits and (if he was paying attention to this discussion) he would barf on this limit too. It's unfortunate that my proposed topic for LSFMM was pre-empted by the zsmalloc vs zbud discussion and zswap vs zcache, because I think the real challenge of zswap (or zcache) and the value to distros and end users requires us to get this right BEFORE users start filing bugs about performance weirdness. After which most users and distros will simply default to 0% (i.e. turn zswap off) because zswap unpredictably sometimes sucks. sorry... > Comment 2 I agree is an issue. I already have patches for a "periodic > writeback" functionality that starts to shrink the zswap pool via > writeback if zswap goes idle for a period of time. This addresses > the issue with long-lived, never-accessed pages getting stuck in > zswap forever. Pulling the call out of zswap_frontswap_store() (and ensuring there still aren't any new races) would be a good start. But this is just a mechanism; you haven't said anything about the policy or how you intend to enforce the policy. Which just gets us back to Comment 1... So Comment 1 and Comment 2 are really the same: How do we appropriately manage the number of pages in the system that are used for storing compressed pages? -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
RE: [PATCHv11 3/4] zswap: add to mm/
> From: Seth Jennings [mailto:sjenn...@linux.vnet.ibm.com] > Subject: Re: [PATCHv11 3/4] zswap: add to mm/ > > On Tue, May 14, 2013 at 05:19:19PM +0800, Bob Liu wrote: > > Hi Seth, > > Hi Bob, thanks for the review! > > > > > > + /* reclaim space if needed */ > > > + if (zswap_is_full()) { > > > + zswap_pool_limit_hit++; > > > + if (zbud_reclaim_page(tree->pool, 8)) { > > > > My idea is to wake up a kernel thread here to do the reclaim. > > Once zswap is full(20% percent of total mem currently), the kernel > > thread should reclaim pages from it. Not only reclaim one page, it > > should depend on the current memory pressure. > > And then the API in zbud may like this: > > zbud_reclaim_page(pool, nr_pages_to_reclaim, nr_retry); > > So kswapd for zswap. I'm not opposed to the idea if a case can be > made for the complexity. I must say, I don't see that case though. > > The policy can evolve as deficiencies are demonstrated and solutions are > found. Hmmm... it is fairly easy to demonstrate the deficiency if one tries. I actually first saw it occur on a real (though early) EL6 system which started some graphics-related service that caused a very brief swapstorm that was invisible during normal boot but clogged up RAM with compressed pages which later caused reduced weird benchmarking performance. I think Mel's unpredictability concern applies equally here... this may be a "long-term source of bugs and strange memory management behavior." > Can I get your ack on this pending the other changes? I'd like to hear Mel's feedback about this, but perhaps a compromise to allow for zswap merging would be to add something like the following to zswap's Kconfig comment: "Zswap reclaim policy is still primitive. Until it improves, zswap should be considered experimental and is not recommended for production use." If Mel agrees with the unpredictability and also agrees with the Kconfig compromise, I am willing to ack. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
RE: [PATCHv11 3/4] zswap: add to mm/
From: Seth Jennings [mailto:sjenn...@linux.vnet.ibm.com] Subject: Re: [PATCHv11 3/4] zswap: add to mm/ On Tue, May 14, 2013 at 05:19:19PM +0800, Bob Liu wrote: Hi Seth, Hi Bob, thanks for the review! + /* reclaim space if needed */ + if (zswap_is_full()) { + zswap_pool_limit_hit++; + if (zbud_reclaim_page(tree-pool, 8)) { My idea is to wake up a kernel thread here to do the reclaim. Once zswap is full(20% percent of total mem currently), the kernel thread should reclaim pages from it. Not only reclaim one page, it should depend on the current memory pressure. And then the API in zbud may like this: zbud_reclaim_page(pool, nr_pages_to_reclaim, nr_retry); So kswapd for zswap. I'm not opposed to the idea if a case can be made for the complexity. I must say, I don't see that case though. The policy can evolve as deficiencies are demonstrated and solutions are found. Hmmm... it is fairly easy to demonstrate the deficiency if one tries. I actually first saw it occur on a real (though early) EL6 system which started some graphics-related service that caused a very brief swapstorm that was invisible during normal boot but clogged up RAM with compressed pages which later caused reduced weird benchmarking performance. I think Mel's unpredictability concern applies equally here... this may be a long-term source of bugs and strange memory management behavior. Can I get your ack on this pending the other changes? I'd like to hear Mel's feedback about this, but perhaps a compromise to allow for zswap merging would be to add something like the following to zswap's Kconfig comment: Zswap reclaim policy is still primitive. Until it improves, zswap should be considered experimental and is not recommended for production use. If Mel agrees with the unpredictability and also agrees with the Kconfig compromise, I am willing to ack. -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
RE: [PATCHv11 3/4] zswap: add to mm/
From: Seth Jennings [mailto:sjenn...@linux.vnet.ibm.com] Subject: Re: [PATCHv11 3/4] zswap: add to mm/ snip +/* The maximum percentage of memory that the compressed pool can occupy */ +static unsigned int zswap_max_pool_percent = 20; +module_param_named(max_pool_percent, + zswap_max_pool_percent, uint, 0644); snip This limit, along with the code that enforces it (by calling reclaim when the limit is reached), is IMHO questionable. Is there any other kernel memory allocation that is constrained by a percentage of total memory rather than dynamically according to current system conditions? As Mel pointed out (approx.), if this limit is reached by a zswap-storm and filled with pages of long-running, rarely-used processes, 20% of RAM (by default here) becomes forever clogged. So there are two comments here 1) dynamic pool limit and 2) writeback of pages in zswap that won't be faulted in or forced out by pressure. Comment 1 feeds from the point of view that compressed pages should just be another type of memory managed by the core MM. While ideal, very hard to implement in practice. We are starting to realize that even the policy governing to active vs inactive list is very hard to get right. Then shrinkers add more complexity to the policy problem. Throwing another type in the mix would just that much more complex and hard to get right (assuming there even _is_ a right policy for everyone in such a complex system). This max_pool_percent policy is simple, works well, and provides a deterministic policy that users can understand. Users can be assured that a dynamic policy heuristic won't go nuts and allow the compressed pool to grow unbounded or be so aggressively reclaimed that it offers no value. Hi Seth -- Hmmm... I'm not sure how to politely say bullshit. :-) The default 20% was randomly pulled out of the air long ago for zcache experiments. If you can explain why 20% is better than 19% or 21%, or better than 10% or 30% or even 50%, that would be a start. Then please try to explain -- in terms an average sysadmin can understand -- under what circumstances this number should be higher or lower, that would be even better. In fact if you can explain it in even very broadbrush terms like higher for embedded and lower for server that would be useful. If the top Linux experts in compression can't answer these questions (and the default is a random number, which it is), I don't know how we can expect users to be assured. What you mean is works well... on the two benchmarks you've tried it on. You say it's too hard to do dynamically... even though every other significant RAM user in the kernel has to do it dynamically. Workloads are dynamic and heavy users of RAM needs to deal with that. You don't see a limit on the number of anonymous pages in the MM subsystem, and you don't see a limit on the number of inodes in btrfs. Linus would rightfully barf all over those limits and (if he was paying attention to this discussion) he would barf on this limit too. It's unfortunate that my proposed topic for LSFMM was pre-empted by the zsmalloc vs zbud discussion and zswap vs zcache, because I think the real challenge of zswap (or zcache) and the value to distros and end users requires us to get this right BEFORE users start filing bugs about performance weirdness. After which most users and distros will simply default to 0% (i.e. turn zswap off) because zswap unpredictably sometimes sucks. flame off sorry... Comment 2 I agree is an issue. I already have patches for a periodic writeback functionality that starts to shrink the zswap pool via writeback if zswap goes idle for a period of time. This addresses the issue with long-lived, never-accessed pages getting stuck in zswap forever. Pulling the call out of zswap_frontswap_store() (and ensuring there still aren't any new races) would be a good start. But this is just a mechanism; you haven't said anything about the policy or how you intend to enforce the policy. Which just gets us back to Comment 1... So Comment 1 and Comment 2 are really the same: How do we appropriately manage the number of pages in the system that are used for storing compressed pages? -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
RE: [PATCHv11 3/4] zswap: add to mm/
From: Seth Jennings [mailto:sjenn...@linux.vnet.ibm.com] Subject: Re: [PATCHv11 3/4] zswap: add to mm/ On Tue, May 14, 2013 at 09:37:08AM -0700, Dan Magenheimer wrote: From: Seth Jennings [mailto:sjenn...@linux.vnet.ibm.com] Subject: Re: [PATCHv11 3/4] zswap: add to mm/ On Tue, May 14, 2013 at 05:19:19PM +0800, Bob Liu wrote: Hi Seth, Hi Bob, thanks for the review! + /* reclaim space if needed */ + if (zswap_is_full()) { + zswap_pool_limit_hit++; + if (zbud_reclaim_page(tree-pool, 8)) { My idea is to wake up a kernel thread here to do the reclaim. Once zswap is full(20% percent of total mem currently), the kernel thread should reclaim pages from it. Not only reclaim one page, it should depend on the current memory pressure. And then the API in zbud may like this: zbud_reclaim_page(pool, nr_pages_to_reclaim, nr_retry); So kswapd for zswap. I'm not opposed to the idea if a case can be made for the complexity. I must say, I don't see that case though. The policy can evolve as deficiencies are demonstrated and solutions are found. Hmmm... it is fairly easy to demonstrate the deficiency if one tries. I actually first saw it occur on a real (though early) EL6 system which started some graphics-related service that caused a very brief swapstorm that was invisible during normal boot but clogged up RAM with compressed pages which later caused reduced weird benchmarking performance. Without any specifics, I'm not sure what I can do with this. Well, I think its customary for the author of a patch to know the limitations of the patch. I suggest you synthesize a workload that attempts to measure worst case. That's exactly what I did a year ago that led me to the realization that zcache needed to solve some issues before it was ready to promote out of staging. I'm hearing you say that the source of the benchmark degradation are the idle pages in zswap. In that case, the periodic writeback patches I have in the wings should address this. I think we are on the same page without realizing it. Right now zswap supports a kind of direct reclaim model at allocation time. The periodic writeback patches will handle the proactive writeback part to free up the zswap pool when it has idle pages in it. I don't think we are on the same page though maybe you are heading in the same direction now. I won't repeat the comments from the previous email. I think Mel's unpredictability concern applies equally here... this may be a long-term source of bugs and strange memory management behavior. Can I get your ack on this pending the other changes? I'd like to hear Mel's feedback about this, but perhaps a compromise to allow for zswap merging would be to add something like the following to zswap's Kconfig comment: Zswap reclaim policy is still primitive. Until it improves, zswap should be considered experimental and is not recommended for production use. Just for the record, an experimental tag in the Kconfig won't work for me. The reclaim policy for zswap is not primitive, it's simple. There is a difference. Plus zswap is already runtime disabled by default. If distros/customers enabled it, it is because they purposely enabled it. Hmmm... I think you are proposing to users/distros the following use model: If zswap works for you, turn it on. If it sucks, turn it off. I can't tell you in advance whether it will work or suck for your distro/workload, but it will probably work so please try it. That sounds awfully experimental to me. The problem is not simple. Your solution is simple because you are simply pretending that the harder parts of the problem don't exist. Dan -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
RE: [PATCHv11 3/4] zswap: add to mm/
> From: Seth Jennings [mailto:sjenn...@linux.vnet.ibm.com] > Subject: [PATCHv11 3/4] zswap: add to mm/ > > zswap is a thin compression backend for frontswap. It receives pages from > frontswap and attempts to store them in a compressed memory pool, resulting in > an effective partial memory reclaim and dramatically reduced swap device I/O. > > Additionally, in most cases, pages can be retrieved from this compressed store > much more quickly than reading from tradition swap devices resulting in faster > performance for many workloads. > > It also has support for evicting swap pages that are currently compressed in > zswap to the swap device on an LRU(ish) basis. This functionality is very > important and make zswap a true cache in that, once the cache is full or can't > grow due to memory pressure, the oldest pages can be moved out of zswap to the > swap device so newer pages can be compressed and stored in zswap. > > This patch adds the zswap driver to mm/ > > Signed-off-by: Seth Jennings A couple of comments below... > --- > mm/Kconfig | 15 + > mm/Makefile |1 + > mm/zswap.c | 952 > +++ > 3 files changed, 968 insertions(+) > create mode 100644 mm/zswap.c > > diff --git a/mm/Kconfig b/mm/Kconfig > index 908f41b..4042e07 100644 > --- a/mm/Kconfig > +++ b/mm/Kconfig > @@ -487,3 +487,18 @@ config ZBUD > While this design limits storage density, it has simple and > deterministic reclaim properties that make it preferable to a higher > density approach when reclaim will be used. > + > +config ZSWAP > + bool "In-kernel swap page compression" > + depends on FRONTSWAP && CRYPTO > + select CRYPTO_LZO > + select ZBUD > + default n > + help > + Zswap is a backend for the frontswap mechanism in the VMM. > + It receives pages from frontswap and attempts to store them > + in a compressed memory pool, resulting in an effective > + partial memory reclaim. In addition, pages and be retrieved > + from this compressed store much faster than most tradition > + swap devices resulting in reduced I/O and faster performance > + for many workloads. > diff --git a/mm/Makefile b/mm/Makefile > index 95f0197..f008033 100644 > --- a/mm/Makefile > +++ b/mm/Makefile > @@ -32,6 +32,7 @@ obj-$(CONFIG_HAVE_MEMBLOCK) += memblock.o > obj-$(CONFIG_BOUNCE) += bounce.o > obj-$(CONFIG_SWAP) += page_io.o swap_state.o swapfile.o > obj-$(CONFIG_FRONTSWAP) += frontswap.o > +obj-$(CONFIG_ZSWAP) += zswap.o > obj-$(CONFIG_HAS_DMA)+= dmapool.o > obj-$(CONFIG_HUGETLBFS) += hugetlb.o > obj-$(CONFIG_NUMA) += mempolicy.o > diff --git a/mm/zswap.c b/mm/zswap.c > new file mode 100644 > index 000..b1070ca > --- /dev/null > +++ b/mm/zswap.c > @@ -0,0 +1,952 @@ > +/* > + * zswap.c - zswap driver file > + * > + * zswap is a backend for frontswap that takes pages that are in the > + * process of being swapped out and attempts to compress them and store > + * them in a RAM-based memory pool. This results in a significant I/O > + * reduction on the real swap device and, in the case of a slow swap > + * device, can also improve workload performance. > + * > + * Copyright (C) 2012 Seth Jennings > + * > + * This program is free software; you can redistribute it and/or > + * modify it under the terms of the GNU General Public License > + * as published by the Free Software Foundation; either version 2 > + * of the License, or (at your option) any later version. > + * > + * This program is distributed in the hope that it will be useful, > + * but WITHOUT ANY WARRANTY; without even the implied warranty of > + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the > + * GNU General Public License for more details. > +*/ > + > +#define pr_fmt(fmt) KBUILD_MODNAME ": " fmt > + > +#include > +#include > +#include > +#include > +#include > +#include > +#include > +#include > +#include > +#include > +#include > +#include > +#include > + > +#include > +#include > +#include > +#include > +#include > + > +/* > +* statistics > +**/ > +/* Number of memory pages used by the compressed pool */ > +static atomic_t zswap_pool_pages = ATOMIC_INIT(0); > +/* The number of compressed pages currently stored in zswap */ > +static atomic_t zswap_stored_pages = ATOMIC_INIT(0); > + > +/* > + * The statistics below are not protected from concurrent access for > + * performance reasons so they may not be a 100% accurate. However, > + * they do provide useful information on roughly how many times a > + * certain event is occurring. > +*/ > +static u64 zswap_pool_limit_hit; > +static u64 zswap_written_back_pages; > +static u64 zswap_reject_reclaim_fail; > +static u64 zswap_reject_compress_poor; > +static u64 zswap_reject_alloc_fail; > +static u64 zswap_reject_kmemcache_fail; > +static u64
RE: [PATCHv11 2/4] zbud: add to mm/
> From: Seth Jennings [mailto:sjenn...@linux.vnet.ibm.com] > Sent: Monday, May 13, 2013 6:40 AM > Subject: [PATCHv11 2/4] zbud: add to mm/ One comment about a questionable algorithm change (vs my original zbud code) below... I'll leave the detailed code review to others. Dan > zbud is an special purpose allocator for storing compressed pages. It is > designed to store up to two compressed pages per physical page. While this > design limits storage density, it has simple and deterministic reclaim > properties that make it preferable to a higher density approach when reclaim > will be used. > > zbud works by storing compressed pages, or "zpages", together in pairs in a > single memory page called a "zbud page". The first buddy is "left > justifed" at the beginning of the zbud page, and the last buddy is "right > justified" at the end of the zbud page. The benefit is that if either > buddy is freed, the freed buddy space, coalesced with whatever slack space > that existed between the buddies, results in the largest possible free region > within the zbud page. > > zbud also provides an attractive lower bound on density. The ratio of zpages > to zbud pages can not be less than 1. This ensures that zbud can never "do > harm" by using more pages to store zpages than the uncompressed zpages would > have used on their own. > > This patch adds zbud to mm/ for later use by zswap. > > Signed-off-by: Seth Jennings > --- > include/linux/zbud.h | 22 ++ > mm/Kconfig | 10 + > mm/Makefile |1 + > mm/zbud.c| 564 > ++ > 4 files changed, 597 insertions(+) > create mode 100644 include/linux/zbud.h > create mode 100644 mm/zbud.c > > diff --git a/include/linux/zbud.h b/include/linux/zbud.h > new file mode 100644 > index 000..954252b > --- /dev/null > +++ b/include/linux/zbud.h > @@ -0,0 +1,22 @@ > +#ifndef _ZBUD_H_ > +#define _ZBUD_H_ > + > +#include > + > +struct zbud_pool; > + > +struct zbud_ops { > + int (*evict)(struct zbud_pool *pool, unsigned long handle); > +}; > + > +struct zbud_pool *zbud_create_pool(gfp_t gfp, struct zbud_ops *ops); > +void zbud_destroy_pool(struct zbud_pool *pool); > +int zbud_alloc(struct zbud_pool *pool, int size, gfp_t gfp, > + unsigned long *handle); > +void zbud_free(struct zbud_pool *pool, unsigned long handle); > +int zbud_reclaim_page(struct zbud_pool *pool, unsigned int retries); > +void *zbud_map(struct zbud_pool *pool, unsigned long handle); > +void zbud_unmap(struct zbud_pool *pool, unsigned long handle); > +int zbud_get_pool_size(struct zbud_pool *pool); > + > +#endif /* _ZBUD_H_ */ > diff --git a/mm/Kconfig b/mm/Kconfig > index e742d06..908f41b 100644 > --- a/mm/Kconfig > +++ b/mm/Kconfig > @@ -477,3 +477,13 @@ config FRONTSWAP > and swap data is stored as normal on the matching swap device. > > If unsure, say Y to enable frontswap. > + > +config ZBUD > + tristate "Buddy allocator for compressed pages" > + default n > + help > + zbud is an special purpose allocator for storing compressed pages. > + It is designed to store up to two compressed pages per physical page. > + While this design limits storage density, it has simple and > + deterministic reclaim properties that make it preferable to a higher > + density approach when reclaim will be used. > diff --git a/mm/Makefile b/mm/Makefile > index 72c5acb..95f0197 100644 > --- a/mm/Makefile > +++ b/mm/Makefile > @@ -58,3 +58,4 @@ obj-$(CONFIG_DEBUG_KMEMLEAK) += kmemleak.o > obj-$(CONFIG_DEBUG_KMEMLEAK_TEST) += kmemleak-test.o > obj-$(CONFIG_CLEANCACHE) += cleancache.o > obj-$(CONFIG_MEMORY_ISOLATION) += page_isolation.o > +obj-$(CONFIG_ZBUD) += zbud.o > diff --git a/mm/zbud.c b/mm/zbud.c > new file mode 100644 > index 000..e5bd0e6 > --- /dev/null > +++ b/mm/zbud.c > @@ -0,0 +1,564 @@ > +/* > + * zbud.c - Buddy Allocator for Compressed Pages > + * > + * Copyright (C) 2013, Seth Jennings, IBM > + * > + * Concepts based on zcache internal zbud allocator by Dan Magenheimer. > + * > + * zbud is an special purpose allocator for storing compressed pages. It is > + * designed to store up to two compressed pages per physical page. While > this > + * design limits storage density, it has simple and deterministic reclaim > + * properties that make it preferable to a higher density approach when > reclaim > + * will be used. > + * > + * zbud works by storing compressed pages, or "zpages", together in pairs in > a > + * single memory page called a "zbud
RE: [PATCHv11 2/4] zbud: add to mm/
From: Seth Jennings [mailto:sjenn...@linux.vnet.ibm.com] Sent: Monday, May 13, 2013 6:40 AM Subject: [PATCHv11 2/4] zbud: add to mm/ One comment about a questionable algorithm change (vs my original zbud code) below... I'll leave the detailed code review to others. Dan zbud is an special purpose allocator for storing compressed pages. It is designed to store up to two compressed pages per physical page. While this design limits storage density, it has simple and deterministic reclaim properties that make it preferable to a higher density approach when reclaim will be used. zbud works by storing compressed pages, or zpages, together in pairs in a single memory page called a zbud page. The first buddy is left justifed at the beginning of the zbud page, and the last buddy is right justified at the end of the zbud page. The benefit is that if either buddy is freed, the freed buddy space, coalesced with whatever slack space that existed between the buddies, results in the largest possible free region within the zbud page. zbud also provides an attractive lower bound on density. The ratio of zpages to zbud pages can not be less than 1. This ensures that zbud can never do harm by using more pages to store zpages than the uncompressed zpages would have used on their own. This patch adds zbud to mm/ for later use by zswap. Signed-off-by: Seth Jennings sjenn...@linux.vnet.ibm.com --- include/linux/zbud.h | 22 ++ mm/Kconfig | 10 + mm/Makefile |1 + mm/zbud.c| 564 ++ 4 files changed, 597 insertions(+) create mode 100644 include/linux/zbud.h create mode 100644 mm/zbud.c diff --git a/include/linux/zbud.h b/include/linux/zbud.h new file mode 100644 index 000..954252b --- /dev/null +++ b/include/linux/zbud.h @@ -0,0 +1,22 @@ +#ifndef _ZBUD_H_ +#define _ZBUD_H_ + +#include linux/types.h + +struct zbud_pool; + +struct zbud_ops { + int (*evict)(struct zbud_pool *pool, unsigned long handle); +}; + +struct zbud_pool *zbud_create_pool(gfp_t gfp, struct zbud_ops *ops); +void zbud_destroy_pool(struct zbud_pool *pool); +int zbud_alloc(struct zbud_pool *pool, int size, gfp_t gfp, + unsigned long *handle); +void zbud_free(struct zbud_pool *pool, unsigned long handle); +int zbud_reclaim_page(struct zbud_pool *pool, unsigned int retries); +void *zbud_map(struct zbud_pool *pool, unsigned long handle); +void zbud_unmap(struct zbud_pool *pool, unsigned long handle); +int zbud_get_pool_size(struct zbud_pool *pool); + +#endif /* _ZBUD_H_ */ diff --git a/mm/Kconfig b/mm/Kconfig index e742d06..908f41b 100644 --- a/mm/Kconfig +++ b/mm/Kconfig @@ -477,3 +477,13 @@ config FRONTSWAP and swap data is stored as normal on the matching swap device. If unsure, say Y to enable frontswap. + +config ZBUD + tristate Buddy allocator for compressed pages + default n + help + zbud is an special purpose allocator for storing compressed pages. + It is designed to store up to two compressed pages per physical page. + While this design limits storage density, it has simple and + deterministic reclaim properties that make it preferable to a higher + density approach when reclaim will be used. diff --git a/mm/Makefile b/mm/Makefile index 72c5acb..95f0197 100644 --- a/mm/Makefile +++ b/mm/Makefile @@ -58,3 +58,4 @@ obj-$(CONFIG_DEBUG_KMEMLEAK) += kmemleak.o obj-$(CONFIG_DEBUG_KMEMLEAK_TEST) += kmemleak-test.o obj-$(CONFIG_CLEANCACHE) += cleancache.o obj-$(CONFIG_MEMORY_ISOLATION) += page_isolation.o +obj-$(CONFIG_ZBUD) += zbud.o diff --git a/mm/zbud.c b/mm/zbud.c new file mode 100644 index 000..e5bd0e6 --- /dev/null +++ b/mm/zbud.c @@ -0,0 +1,564 @@ +/* + * zbud.c - Buddy Allocator for Compressed Pages + * + * Copyright (C) 2013, Seth Jennings, IBM + * + * Concepts based on zcache internal zbud allocator by Dan Magenheimer. + * + * zbud is an special purpose allocator for storing compressed pages. It is + * designed to store up to two compressed pages per physical page. While this + * design limits storage density, it has simple and deterministic reclaim + * properties that make it preferable to a higher density approach when reclaim + * will be used. + * + * zbud works by storing compressed pages, or zpages, together in pairs in a + * single memory page called a zbud page. The first buddy is left + * justifed at the beginning of the zbud page, and the last buddy is right + * justified at the end of the zbud page. The benefit is that if either + * buddy is freed, the freed buddy space, coalesced with whatever slack space + * that existed between the buddies, results in the largest possible free region + * within the zbud page. + * + * zbud also provides an attractive lower bound on density. The ratio of zpages + * to zbud pages can not be less than 1
RE: [PATCHv11 3/4] zswap: add to mm/
From: Seth Jennings [mailto:sjenn...@linux.vnet.ibm.com] Subject: [PATCHv11 3/4] zswap: add to mm/ zswap is a thin compression backend for frontswap. It receives pages from frontswap and attempts to store them in a compressed memory pool, resulting in an effective partial memory reclaim and dramatically reduced swap device I/O. Additionally, in most cases, pages can be retrieved from this compressed store much more quickly than reading from tradition swap devices resulting in faster performance for many workloads. It also has support for evicting swap pages that are currently compressed in zswap to the swap device on an LRU(ish) basis. This functionality is very important and make zswap a true cache in that, once the cache is full or can't grow due to memory pressure, the oldest pages can be moved out of zswap to the swap device so newer pages can be compressed and stored in zswap. This patch adds the zswap driver to mm/ Signed-off-by: Seth Jennings sjenn...@linux.vnet.ibm.com A couple of comments below... --- mm/Kconfig | 15 + mm/Makefile |1 + mm/zswap.c | 952 +++ 3 files changed, 968 insertions(+) create mode 100644 mm/zswap.c diff --git a/mm/Kconfig b/mm/Kconfig index 908f41b..4042e07 100644 --- a/mm/Kconfig +++ b/mm/Kconfig @@ -487,3 +487,18 @@ config ZBUD While this design limits storage density, it has simple and deterministic reclaim properties that make it preferable to a higher density approach when reclaim will be used. + +config ZSWAP + bool In-kernel swap page compression + depends on FRONTSWAP CRYPTO + select CRYPTO_LZO + select ZBUD + default n + help + Zswap is a backend for the frontswap mechanism in the VMM. + It receives pages from frontswap and attempts to store them + in a compressed memory pool, resulting in an effective + partial memory reclaim. In addition, pages and be retrieved + from this compressed store much faster than most tradition + swap devices resulting in reduced I/O and faster performance + for many workloads. diff --git a/mm/Makefile b/mm/Makefile index 95f0197..f008033 100644 --- a/mm/Makefile +++ b/mm/Makefile @@ -32,6 +32,7 @@ obj-$(CONFIG_HAVE_MEMBLOCK) += memblock.o obj-$(CONFIG_BOUNCE) += bounce.o obj-$(CONFIG_SWAP) += page_io.o swap_state.o swapfile.o obj-$(CONFIG_FRONTSWAP) += frontswap.o +obj-$(CONFIG_ZSWAP) += zswap.o obj-$(CONFIG_HAS_DMA)+= dmapool.o obj-$(CONFIG_HUGETLBFS) += hugetlb.o obj-$(CONFIG_NUMA) += mempolicy.o diff --git a/mm/zswap.c b/mm/zswap.c new file mode 100644 index 000..b1070ca --- /dev/null +++ b/mm/zswap.c @@ -0,0 +1,952 @@ +/* + * zswap.c - zswap driver file + * + * zswap is a backend for frontswap that takes pages that are in the + * process of being swapped out and attempts to compress them and store + * them in a RAM-based memory pool. This results in a significant I/O + * reduction on the real swap device and, in the case of a slow swap + * device, can also improve workload performance. + * + * Copyright (C) 2012 Seth Jennings sjenn...@linux.vnet.ibm.com + * + * This program is free software; you can redistribute it and/or + * modify it under the terms of the GNU General Public License + * as published by the Free Software Foundation; either version 2 + * of the License, or (at your option) any later version. + * + * This program is distributed in the hope that it will be useful, + * but WITHOUT ANY WARRANTY; without even the implied warranty of + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the + * GNU General Public License for more details. +*/ + +#define pr_fmt(fmt) KBUILD_MODNAME : fmt + +#include linux/module.h +#include linux/cpu.h +#include linux/highmem.h +#include linux/slab.h +#include linux/spinlock.h +#include linux/types.h +#include linux/atomic.h +#include linux/frontswap.h +#include linux/rbtree.h +#include linux/swap.h +#include linux/crypto.h +#include linux/mempool.h +#include linux/zbud.h + +#include linux/mm_types.h +#include linux/page-flags.h +#include linux/swapops.h +#include linux/writeback.h +#include linux/pagemap.h + +/* +* statistics +**/ +/* Number of memory pages used by the compressed pool */ +static atomic_t zswap_pool_pages = ATOMIC_INIT(0); +/* The number of compressed pages currently stored in zswap */ +static atomic_t zswap_stored_pages = ATOMIC_INIT(0); + +/* + * The statistics below are not protected from concurrent access for + * performance reasons so they may not be a 100% accurate. However, + * they do provide useful information on roughly how many times a + * certain event is occurring. +*/ +static u64 zswap_pool_limit_hit; +static u64 zswap_written_back_pages; +static u64
RE: zsmalloc zbud hybrid design discussion?
> From: Seth Jennings [mailto:sjenn...@linux.vnet.ibm.com] > Subject: Re: zsmalloc zbud hybrid design discussion? > > On Thu, Apr 11, 2013 at 04:28:19PM -0700, Dan Magenheimer wrote: > > (Bob Liu added) > > > > > From: Seth Jennings [mailto:sjenn...@linux.vnet.ibm.com] > > > Subject: Re: zsmalloc zbud hybrid design discussion? > > > > > > On Wed, Mar 27, 2013 at 01:04:25PM -0700, Dan Magenheimer wrote: > > > > Seth and all zproject folks -- > > > > > > > > I've been giving some deep thought as to how a zpage > > > > allocator might be designed that would incorporate the > > > > best of both zsmalloc and zbud. > > > > > > > > Rather than dive into coding, it occurs to me that the > > > > best chance of success would be if all interested parties > > > > could first discuss (on-list) and converge on a design > > > > that we can all agree on. If we achieve that, I don't > > > > care who writes the code and/or gets the credit or > > > > chooses the name. If we can't achieve consensus, at > > > > least it will be much clearer where our differences lie. > > > > > > > > Any thoughts? > > > > Hi Seth! > > > > > I'll put some thoughts, keeping in mind that I'm not throwing zsmalloc > > > under > > > the bus here. Just what I would do starting from scratch given all that > > > has > > > happened. > > > > Excellent. Good food for thought. I'll add some of my thinking > > too and we can talk more next week. > > > > BTW, I'm not throwing zsmalloc under the bus either. I'm OK with > > using zsmalloc as a "base" for an improved hybrid, and even calling > > the result "zsmalloc". I *am* however willing to throw the > > "generic" nature of zsmalloc away... I think the combined requirements > > of the zprojects are complex enough and the likelihood of zsmalloc > > being appropriate for future "users" is low enough, that we should > > accept that zsmalloc is highly tuned for zprojects and modify it > > as required. I.e. the API to zsmalloc need not be exposed to and > > documented for the rest of the kernel. > > > > > Simplicity - the simpler the better > > > > Generally I agree. But only if the simplicity addresses the > > whole problem. I'm specifically very concerned that we have > > an allocator that works well across a wide variety of zsize distributions, > > even if it adds complexity to the allocator. > > > > > High density - LZO best case is ~40 bytes. That's around 1/100th of a > > > page. > > > I'd say it should support up to at least 64 object per page in the best > > > case. > > > (see Reclaim effectiveness before responding here) > > > > Hmmm... if you pre-check for zero pages, I would guess the percentage > > of pages with zsize less than 64 is actually quite small. But 64 size > > classes may be a good place to start as long as it doesn't overly > > complicate or restrict other design points. > > > > > No slab - the slab approach limits LRU and swap slot locality within the > > > pool > > > pages. Also swap slots have a tendency to be freed in clusters. If we > > > improve > > > locality within each pool page, it is more likely that page will be freed > > > sooner as the zpages it contains will likely be invalidated all together. > > > > "Pool page" =?= "pageframe used by zsmalloc" > > Yes. > > > > > Isn't it true that that there is no correlation between whether a > > page is in the same cluster and the zsize (and thus size class) of > > the zpage? So every zpage may end up in a different pool page > > and this theory wouldn't work. Or am I misunderstanding? > > I think so. I didn't say this outright and should have: I'm thinking along > the > lines of a first-fit type method. So you just stack zpages up in a page until > the page is full then allocate a new one. Searching for free slots would > ideally be done in reverse LRU so that you put new zpages in the most recently > allocated page that has room. I'm still thinking how to do that efficiently. OK I see. You probably know that the xvmalloc allocator did something like that. I didn't study that code much but Nitin thought zsmalloc was much superior to xvmalloc. > > > Also, take a note out of the zbud playbook at track LRU based on pool > > > pages, > > > not zpages. One would fill allocation requests from
RE: zsmalloc zbud hybrid design discussion?
From: Seth Jennings [mailto:sjenn...@linux.vnet.ibm.com] Subject: Re: zsmalloc zbud hybrid design discussion? On Thu, Apr 11, 2013 at 04:28:19PM -0700, Dan Magenheimer wrote: (Bob Liu added) From: Seth Jennings [mailto:sjenn...@linux.vnet.ibm.com] Subject: Re: zsmalloc zbud hybrid design discussion? On Wed, Mar 27, 2013 at 01:04:25PM -0700, Dan Magenheimer wrote: Seth and all zproject folks -- I've been giving some deep thought as to how a zpage allocator might be designed that would incorporate the best of both zsmalloc and zbud. Rather than dive into coding, it occurs to me that the best chance of success would be if all interested parties could first discuss (on-list) and converge on a design that we can all agree on. If we achieve that, I don't care who writes the code and/or gets the credit or chooses the name. If we can't achieve consensus, at least it will be much clearer where our differences lie. Any thoughts? Hi Seth! I'll put some thoughts, keeping in mind that I'm not throwing zsmalloc under the bus here. Just what I would do starting from scratch given all that has happened. Excellent. Good food for thought. I'll add some of my thinking too and we can talk more next week. BTW, I'm not throwing zsmalloc under the bus either. I'm OK with using zsmalloc as a base for an improved hybrid, and even calling the result zsmalloc. I *am* however willing to throw the generic nature of zsmalloc away... I think the combined requirements of the zprojects are complex enough and the likelihood of zsmalloc being appropriate for future users is low enough, that we should accept that zsmalloc is highly tuned for zprojects and modify it as required. I.e. the API to zsmalloc need not be exposed to and documented for the rest of the kernel. Simplicity - the simpler the better Generally I agree. But only if the simplicity addresses the whole problem. I'm specifically very concerned that we have an allocator that works well across a wide variety of zsize distributions, even if it adds complexity to the allocator. High density - LZO best case is ~40 bytes. That's around 1/100th of a page. I'd say it should support up to at least 64 object per page in the best case. (see Reclaim effectiveness before responding here) Hmmm... if you pre-check for zero pages, I would guess the percentage of pages with zsize less than 64 is actually quite small. But 64 size classes may be a good place to start as long as it doesn't overly complicate or restrict other design points. No slab - the slab approach limits LRU and swap slot locality within the pool pages. Also swap slots have a tendency to be freed in clusters. If we improve locality within each pool page, it is more likely that page will be freed sooner as the zpages it contains will likely be invalidated all together. Pool page =?= pageframe used by zsmalloc Yes. Isn't it true that that there is no correlation between whether a page is in the same cluster and the zsize (and thus size class) of the zpage? So every zpage may end up in a different pool page and this theory wouldn't work. Or am I misunderstanding? I think so. I didn't say this outright and should have: I'm thinking along the lines of a first-fit type method. So you just stack zpages up in a page until the page is full then allocate a new one. Searching for free slots would ideally be done in reverse LRU so that you put new zpages in the most recently allocated page that has room. I'm still thinking how to do that efficiently. OK I see. You probably know that the xvmalloc allocator did something like that. I didn't study that code much but Nitin thought zsmalloc was much superior to xvmalloc. Also, take a note out of the zbud playbook at track LRU based on pool pages, not zpages. One would fill allocation requests from the most recently used pool page. Yes, I'm also thinking that should be in any hybrid solution. A global LRU queue (like in zbud) could also be applicable to entire zspages; this is similar to pageframe-reclaim except all the pageframes in a zspage would be claimed at the same time. This brings up another thing that I left out that might be the stickiest part, eviction and reclaim. We first have to figure out if eviction is going to be initiated by the user or by the allocator. If we do it in the allocator, then I think we are going to muck up the API because you'll have to register and eviction notification function that the allocator can call, once for each zpage in the page frame the allocator is trying to reclaim/free. The locking might get hairy in that case (user - allocator - user). Additionally the user would have to maintain a different lookup system for zpages by address/handle. Alternatively, you
RE: zsmalloc zbud hybrid design discussion?
(Bob Liu added) > From: Seth Jennings [mailto:sjenn...@linux.vnet.ibm.com] > Subject: Re: zsmalloc zbud hybrid design discussion? > > On Wed, Mar 27, 2013 at 01:04:25PM -0700, Dan Magenheimer wrote: > > Seth and all zproject folks -- > > > > I've been giving some deep thought as to how a zpage > > allocator might be designed that would incorporate the > > best of both zsmalloc and zbud. > > > > Rather than dive into coding, it occurs to me that the > > best chance of success would be if all interested parties > > could first discuss (on-list) and converge on a design > > that we can all agree on. If we achieve that, I don't > > care who writes the code and/or gets the credit or > > chooses the name. If we can't achieve consensus, at > > least it will be much clearer where our differences lie. > > > > Any thoughts? Hi Seth! > I'll put some thoughts, keeping in mind that I'm not throwing zsmalloc under > the bus here. Just what I would do starting from scratch given all that has > happened. Excellent. Good food for thought. I'll add some of my thinking too and we can talk more next week. BTW, I'm not throwing zsmalloc under the bus either. I'm OK with using zsmalloc as a "base" for an improved hybrid, and even calling the result "zsmalloc". I *am* however willing to throw the "generic" nature of zsmalloc away... I think the combined requirements of the zprojects are complex enough and the likelihood of zsmalloc being appropriate for future "users" is low enough, that we should accept that zsmalloc is highly tuned for zprojects and modify it as required. I.e. the API to zsmalloc need not be exposed to and documented for the rest of the kernel. > Simplicity - the simpler the better Generally I agree. But only if the simplicity addresses the whole problem. I'm specifically very concerned that we have an allocator that works well across a wide variety of zsize distributions, even if it adds complexity to the allocator. > High density - LZO best case is ~40 bytes. That's around 1/100th of a page. > I'd say it should support up to at least 64 object per page in the best case. > (see Reclaim effectiveness before responding here) Hmmm... if you pre-check for zero pages, I would guess the percentage of pages with zsize less than 64 is actually quite small. But 64 size classes may be a good place to start as long as it doesn't overly complicate or restrict other design points. > No slab - the slab approach limits LRU and swap slot locality within the pool > pages. Also swap slots have a tendency to be freed in clusters. If we > improve > locality within each pool page, it is more likely that page will be freed > sooner as the zpages it contains will likely be invalidated all together. "Pool page" =?= "pageframe used by zsmalloc" Isn't it true that that there is no correlation between whether a page is in the same cluster and the zsize (and thus size class) of the zpage? So every zpage may end up in a different pool page and this theory wouldn't work. Or am I misunderstanding? > Also, take a note out of the zbud playbook at track LRU based on pool pages, > not zpages. One would fill allocation requests from the most recently used > pool page. Yes, I'm also thinking that should be in any hybrid solution. A "global LRU queue" (like in zbud) could also be applicable to entire zspages; this is similar to pageframe-reclaim except all the pageframes in a zspage would be claimed at the same time. > Reclaim effectiveness - conflicts with density. As the number of zpages per > page increases, the odds decrease that all of those objects will be > invalidated, which is necessary to free up the underlying page, since moving > objects out of sparely used pages would involve compaction (see next). One > solution is to lower the density, but I think that is self-defeating as we > lose > much the compression benefit though fragmentation. I think the better solution > is to improve the likelihood that the zpages in the page are likely to be > freed > together through increased locality. I do think we should seriously reconsider ZS_MAX_ZSPAGE_ORDER==2. The value vs ZS_MAX_ZSPAGE_ORDER==0 is enough for most cases and 1 is enough for the rest. If get_pages_per_zspage were "flexible", there might be a better tradeoff of density vs reclaim effectiveness. I've some ideas along the lines of a hybrid adaptively combining buddying and slab which might make it rarely necessary to have pages_per_zspage exceed 2. That also might make it much easier to have "variable sized" zspages (size is always one or two). > Not a requirement: > > Compaction - compaction would basically involve creating a virtual address > space
RE: zsmalloc defrag (Was: [PATCH] mm: remove compressed copy from zram in-memory)
> From: Minchan Kim [mailto:minc...@kernel.org] > Subject: Re: zsmalloc defrag (Was: [PATCH] mm: remove compressed copy from > zram in-memory) > > Hi Seth, > > On Tue, Apr 09, 2013 at 03:52:36PM -0500, Seth Jennings wrote: > > On 04/08/2013 08:36 PM, Minchan Kim wrote: > > > On Tue, Apr 09, 2013 at 10:27:19AM +0900, Minchan Kim wrote: > > >> Hi Dan, > > >> > > >> On Mon, Apr 08, 2013 at 09:32:38AM -0700, Dan Magenheimer wrote: > > >>>> From: Minchan Kim [mailto:minc...@kernel.org] > > >>>> Sent: Monday, April 08, 2013 12:01 AM > > >>>> Subject: [PATCH] mm: remove compressed copy from zram in-memory > > >>> > > >>> (patch removed) > > >>> > > >>>> Fragment ratio is almost same but memory consumption and compile time > > >>>> is better. I am working to add defragment function of zsmalloc. > > >>> > > >>> Hi Minchan -- > > >>> > > >>> I would be very interested in your design thoughts on > > >>> how you plan to add defragmentation for zsmalloc. In > > >> > > >> What I can say now about is only just a word "Compaction". > > >> As you know, zsmalloc has a transparent handle so we can do whatever > > >> under user. Of course, there is a tradeoff between performance > > >> and memory efficiency. I'm biased to latter for embedded usecase. > > >> > > >> And I might post it because as you know well, zsmalloc > > > > > > Incomplete sentense, > > > > > > I might not post it until promoting zsmalloc because as you know well, > > > zsmalloc/zram's all new stuffs are blocked into staging tree. > > > Even if we could add it into staging, as you know well, staging is where > > > every mm guys ignore so we end up needing another round to promote it. > > > sigh. > > > > Yes. The lack of compaction/defragmentation support in zsmalloc has not > > been raised as an obstacle to mainline acceptance so I think we should > > wait to add new features to a yet-to-be accepted codebase. > > > > Also, I think this feature is more important to zram than it is to > > zswap/zcache as they can do writeback to free zpages. In other words, > > the fragmentation is a transient issue for zswap/zcache since writeback > > to the swap device is possible. > > Other benefit derived from compaction work is that we can pick a zpage > from zspage and move it into somewhere. It means core mm could control > pages in zsmalloc freely. I'm not sure I understand which is why I'd like to learn more about your proposed design. Are you suggesting that core mm would periodically call zsmalloc-compaction and see what pages get freed? I'm hoping for more control than that. More good discussion for next week! Dan -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
RE: zsmalloc defrag (Was: [PATCH] mm: remove compressed copy from zram in-memory)
> From: Minchan Kim [mailto:minc...@kernel.org] > Subject: Re: zsmalloc defrag (Was: [PATCH] mm: remove compressed copy from > zram in-memory) > > On Tue, Apr 09, 2013 at 01:37:47PM -0700, Dan Magenheimer wrote: > > > From: Minchan Kim [mailto:minc...@kernel.org] > > > Subject: Re: zsmalloc defrag (Was: [PATCH] mm: remove compressed copy > > > from zram in-memory) > > > > > > On Tue, Apr 09, 2013 at 10:27:19AM +0900, Minchan Kim wrote: > > > > Hi Dan, > > > > > > > > On Mon, Apr 08, 2013 at 09:32:38AM -0700, Dan Magenheimer wrote: > > > > > > From: Minchan Kim [mailto:minc...@kernel.org] > > > > > > Sent: Monday, April 08, 2013 12:01 AM > > > > > > Subject: [PATCH] mm: remove compressed copy from zram in-memory > > > > > > > > > > (patch removed) > > > > > > > > > > > Fragment ratio is almost same but memory consumption and compile > > > > > > time > > > > > > is better. I am working to add defragment function of zsmalloc. > > > > > > > > > > Hi Minchan -- > > > > > > > > > > I would be very interested in your design thoughts on > > > > > how you plan to add defragmentation for zsmalloc. In > > > > > > > > What I can say now about is only just a word "Compaction". > > > > As you know, zsmalloc has a transparent handle so we can do whatever > > > > under user. Of course, there is a tradeoff between performance > > > > and memory efficiency. I'm biased to latter for embedded usecase. > > > > > > > > And I might post it because as you know well, zsmalloc > > > > > > Incomplete sentense, > > > > > > I might not post it until promoting zsmalloc because as you know well, > > > zsmalloc/zram's all new stuffs are blocked into staging tree. > > > Even if we could add it into staging, as you know well, staging is where > > > every mm guys ignore so we end up needing another round to promote it. > > > sigh. > > > > > > I hope it gets better after LSF/MM. > > > > If zsmalloc is moving in the direction of supporting only zram, > > why should it be promoted into mm, or even lib? Why not promote > > zram into drivers and put zsmalloc.c in the same directory? > > I don't want to make zsmalloc zram specific and will do best effort > to generalize it to all z* familiy. I'm glad to hear that. You may not know/remember that the split between "old zcache" and "new zcache" (and the fork to zswap) was started because some people refused to accept changes to zsmalloc to support a broader set of requirements. > If it is hard to reach out > agreement, yes, forking could be a easy solution like other embedded > product company but I don't want it. I don't want it either, so I think it is wise for us all to understand each others' objectives to see if we can avoid a fork. Or if the objectives are too different, then we have data to explain to other kernel developers why a fork is necessary. Thanks! Dan -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
RE: zsmalloc defrag (Was: [PATCH] mm: remove compressed copy from zram in-memory)
> From: Minchan Kim [mailto:minc...@kernel.org] > Subject: Re: zsmalloc defrag (Was: [PATCH] mm: remove compressed copy from > zram in-memory) > > On Tue, Apr 09, 2013 at 01:25:45PM -0700, Dan Magenheimer wrote: > > > From: Minchan Kim [mailto:minc...@kernel.org] > > > Subject: Re: zsmalloc defrag (Was: [PATCH] mm: remove compressed copy > > > from zram in-memory) > > > > > > Hi Dan, > > > > > > On Mon, Apr 08, 2013 at 09:32:38AM -0700, Dan Magenheimer wrote: > > > > > From: Minchan Kim [mailto:minc...@kernel.org] > > > > > Sent: Monday, April 08, 2013 12:01 AM > > > > > Subject: [PATCH] mm: remove compressed copy from zram in-memory > > > > > > > > (patch removed) > > > > > > > > > Fragment ratio is almost same but memory consumption and compile time > > > > > is better. I am working to add defragment function of zsmalloc. > > > > > > > > Hi Minchan -- > > > > > > > > I would be very interested in your design thoughts on > > > > how you plan to add defragmentation for zsmalloc. In > > > > > > What I can say now about is only just a word "Compaction". > > > As you know, zsmalloc has a transparent handle so we can do whatever > > > under user. Of course, there is a tradeoff between performance > > > and memory efficiency. I'm biased to latter for embedded usecase. > > > > Have you designed or implemented this yet? I have a couple > > of concerns: > > Not yet implemented but just had a time to think about it, simply. > So surely, there are some obstacle so I want to uncase the code and > number after I make a prototype/test the performance. > Of course, if it has a severe problem, will drop it without wasting > many guys's time. OK. I have some ideas that may similar or may be very different than yours. Likely different, since I am coming at it from the angle of zcache which has some different requirements. So I'm hoping that by discussing design we can incorporate some of the zcache requirements before coding. > > 1) The handle is transparent to the "user", but it is still a form > >of a "pointer" to a zpage. Are you planning on walking zram's > >tables and changing those pointers? That may be OK for zram > >but for more complex data structures than tables (as in zswap > >and zcache) it may not be as easy, due to races, or as efficient > >because you will have to walk potentially very large trees. > > Rough concept is following as. > > I'm considering for zsmalloc to return transparent fake handle > but we have to maintain it with real one. > It could be done in zsmalloc internal so there isn't any race we should > consider. That sounds very difficult because I think you will need an extra level of indirection to translate every fake handle to every real handle/pointer (like virtual-to-physical page tables). Or do you have some more clever idea? > > 2) Compaction in the kernel is heavily dependent on page migration > >and page migration is dependent on using flags in the struct page. > >There's a lot of code in those two code modules and there > >are going to be a lot of implementation differences between > >compacting pages vs compacting zpages. > > Compaction of kernel is never related to zsmalloc's one. OK. Compaction has certain meaning in the kernel. Defrag is usually used I think for what we are discussing here. So I thought you might be planning on doing exactly what the kernel does that it calls compaction. > > I'm also wondering if you will be implementing "variable length > > zspages". Without that, I'm not sure compaction will help > > enough. (And that is a good example of the difference between > > Why do you think so? > variable lengh zspage could be further step to improve but it's not > only a solution to solve fragmentation. In my partial-design-in-my-head, they are related, but I think I understand what you mean. You are planning to move zpages across zspage boundaries, and I am not. So I think your solution will result in better density but may be harder to implement. > > > > particular, I am wondering if your design will also > > > > handle the requirements for zcache (especially for > > > > cleancache pages) and perhaps also for ramster. > > > > > > I don't know requirements for cleancache pages but compaction is > > > general as you know well so I expect you can get a benefit from it > > > if you are concern on memory efficiency but not sure it's valuable > >
RE: zsmalloc defrag (Was: [PATCH] mm: remove compressed copy from zram in-memory)
> From: Seth Jennings [mailto:sjenn...@linux.vnet.ibm.com] > Subject: Re: zsmalloc defrag (Was: [PATCH] mm: remove compressed copy from > zram in-memory) > > On 04/08/2013 08:36 PM, Minchan Kim wrote: > > On Tue, Apr 09, 2013 at 10:27:19AM +0900, Minchan Kim wrote: > >> Hi Dan, > >> > >> On Mon, Apr 08, 2013 at 09:32:38AM -0700, Dan Magenheimer wrote: > >>>> From: Minchan Kim [mailto:minc...@kernel.org] > >>>> Sent: Monday, April 08, 2013 12:01 AM > >>>> Subject: [PATCH] mm: remove compressed copy from zram in-memory > >>> > >>> (patch removed) > >>> > >>>> Fragment ratio is almost same but memory consumption and compile time > >>>> is better. I am working to add defragment function of zsmalloc. > >>> > >>> Hi Minchan -- > >>> > >>> I would be very interested in your design thoughts on > >>> how you plan to add defragmentation for zsmalloc. In > >> > >> What I can say now about is only just a word "Compaction". > >> As you know, zsmalloc has a transparent handle so we can do whatever > >> under user. Of course, there is a tradeoff between performance > >> and memory efficiency. I'm biased to latter for embedded usecase. > >> > >> And I might post it because as you know well, zsmalloc > > > > Incomplete sentense, > > > > I might not post it until promoting zsmalloc because as you know well, > > zsmalloc/zram's all new stuffs are blocked into staging tree. > > Even if we could add it into staging, as you know well, staging is where > > every mm guys ignore so we end up needing another round to promote it. sigh. > > Yes. The lack of compaction/defragmentation support in zsmalloc has not > been raised as an obstacle to mainline acceptance so I think we should > wait to add new features to a yet-to-be accepted codebase. Um, I explicitly raised as an obstacle the greatly reduced density for zsmalloc on active workloads and on zsize distributions that skew fat. Understanding that more deeply and hopefully fixing it is an issue, and compaction/defragmentation is a step in that direction. > Also, I think this feature is more important to zram than it is to > zswap/zcache as they can do writeback to free zpages. In other words, > the fragmentation is a transient issue for zswap/zcache since writeback > to the swap device is possible. Actually, I think I demonstrated that the zpage-based writeback in zswap makes fragmentation worse. Zcache doesn't use zsmalloc in part because it doesn't support pagframe writeback. If zsmalloc can fix this (and it may be easier to fix depending on the design and implementation of compaction/defrag, which is why I'm asking lots of questions), zcache may be able to make use of zsmalloc. Lots of good discussion fodder for next week! Dan -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
RE: [PATCH 00/10] staging: zcache/ramster: fix and ramster/debugfs improvement
> From: Wanpeng Li [mailto:liw...@linux.vnet.ibm.com] > Sent: Tuesday, April 09, 2013 6:26 PM > To: Greg Kroah-Hartman > Cc: Dan Magenheimer; Seth Jennings; Konrad Rzeszutek Wilk; Minchan Kim; > linux...@kvack.org; linux- > ker...@vger.kernel.org; Andrew Morton; Bob Liu; Wanpeng Li > Subject: [PATCH 00/10] staging: zcache/ramster: fix and ramster/debugfs > improvement > > Fix bugs in zcache and rips out the debug counters out of ramster.c and > sticks them in a debug.c file. Introduce accessory functions for counters > increase/decrease, they are available when config RAMSTER_DEBUG, otherwise > they are empty non-debug functions. Using an array to initialize/use debugfs > attributes to make them neater. Dan Magenheimer confirm these works > are needed. http://marc.info/?l=linux-mm=136535713106882=2 > > Patch 1~2 fix bugs in zcache > > Patch 3~8 rips out the debug counters out of ramster.c and sticks them > in a debug.c file > > Patch 9 fix coding style issue introduced in zcache2 cleanups > (s/int/bool + debugfs movement) patchset > > Patch 10 add how-to for ramster Note my preference to not apply patch 2of10 (which GregKH may choose to override), but for all, please add my: Acked-by: Dan Magenheimer -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
RE: [PATCH 02/10] staging: zcache: remove zcache_freeze
> From: Wanpeng Li [mailto:liw...@linux.vnet.ibm.com] > Subject: [PATCH 02/10] staging: zcache: remove zcache_freeze > > The default value of zcache_freeze is false and it won't be modified by > other codes. Remove zcache_freeze since no routine can disable zcache > during system running. > > Signed-off-by: Wanpeng Li I'd prefer to leave this code in place as it may be very useful if/when zcache becomes more tightly integrated into the MM subsystem and the rest of the kernel. And the subtleties for temporarily disabling zcache (which is what zcache_freeze does) are non-obvious and may cause data loss so if someone wants to add this functionality back in later and don't have this piece of code, it may take a lot of pain to get it working. Usage example: All CPUs are fully saturated so it is questionable whether spending CPU cycles for compression is wise. Kernel could disable zcache using zcache_freeze. (Yes, a new entry point would need to be added to enable/disable zcache_freeze.) My two cents... others are welcome to override. > --- > drivers/staging/zcache/zcache-main.c | 55 > +++--- > 1 file changed, 18 insertions(+), 37 deletions(-) > > diff --git a/drivers/staging/zcache/zcache-main.c > b/drivers/staging/zcache/zcache-main.c > index e23d814..fe6801a 100644 > --- a/drivers/staging/zcache/zcache-main.c > +++ b/drivers/staging/zcache/zcache-main.c > @@ -1118,15 +1118,6 @@ free_and_out: > #endif /* CONFIG_ZCACHE_WRITEBACK */ > > /* > - * When zcache is disabled ("frozen"), pools can be created and destroyed, > - * but all puts (and thus all other operations that require memory > allocation) > - * must fail. If zcache is unfrozen, accepts puts, then frozen again, > - * data consistency requires all puts while frozen to be converted into > - * flushes. > - */ > -static bool zcache_freeze; > - > -/* > * This zcache shrinker interface reduces the number of ephemeral pageframes > * used by zcache to approximately the same as the total number of LRU_FILE > * pageframes in use, and now also reduces the number of persistent > pageframes > @@ -1221,44 +1212,34 @@ int zcache_put_page(int cli_id, int pool_id, struct > tmem_oid *oidp, > { > struct tmem_pool *pool; > struct tmem_handle th; > - int ret = -1; > + int ret = 0; > void *pampd = NULL; > > BUG_ON(!irqs_disabled()); > pool = zcache_get_pool_by_id(cli_id, pool_id); > if (unlikely(pool == NULL)) > goto out; > - if (!zcache_freeze) { > - ret = 0; > - th.client_id = cli_id; > - th.pool_id = pool_id; > - th.oid = *oidp; > - th.index = index; > - pampd = zcache_pampd_create((char *)page, size, raw, > - ephemeral, ); > - if (pampd == NULL) { > - ret = -ENOMEM; > - if (ephemeral) > - inc_zcache_failed_eph_puts(); > - else > - inc_zcache_failed_pers_puts(); > - } else { > - if (ramster_enabled) > - ramster_do_preload_flnode(pool); > - ret = tmem_put(pool, oidp, index, 0, pampd); > - if (ret < 0) > - BUG(); > - } > - zcache_put_pool(pool); > + > + th.client_id = cli_id; > + th.pool_id = pool_id; > + th.oid = *oidp; > + th.index = index; > + pampd = zcache_pampd_create((char *)page, size, raw, > + ephemeral, ); > + if (pampd == NULL) { > + ret = -ENOMEM; > + if (ephemeral) > + inc_zcache_failed_eph_puts(); > + else > + inc_zcache_failed_pers_puts(); > } else { > - inc_zcache_put_to_flush(); > if (ramster_enabled) > ramster_do_preload_flnode(pool); > - if (atomic_read(>obj_count) > 0) > - /* the put fails whether the flush succeeds or not */ > - (void)tmem_flush_page(pool, oidp, index); > - zcache_put_pool(pool); > + ret = tmem_put(pool, oidp, index, 0, pampd); > + if (ret < 0) > + BUG(); > } > + zcache_put_pool(pool); > out: > return ret; > } > -- > 1.7.10.4 -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
RE: [PATCH 02/10] staging: zcache: remove zcache_freeze
From: Wanpeng Li [mailto:liw...@linux.vnet.ibm.com] Subject: [PATCH 02/10] staging: zcache: remove zcache_freeze The default value of zcache_freeze is false and it won't be modified by other codes. Remove zcache_freeze since no routine can disable zcache during system running. Signed-off-by: Wanpeng Li liw...@linux.vnet.ibm.com I'd prefer to leave this code in place as it may be very useful if/when zcache becomes more tightly integrated into the MM subsystem and the rest of the kernel. And the subtleties for temporarily disabling zcache (which is what zcache_freeze does) are non-obvious and may cause data loss so if someone wants to add this functionality back in later and don't have this piece of code, it may take a lot of pain to get it working. Usage example: All CPUs are fully saturated so it is questionable whether spending CPU cycles for compression is wise. Kernel could disable zcache using zcache_freeze. (Yes, a new entry point would need to be added to enable/disable zcache_freeze.) My two cents... others are welcome to override. --- drivers/staging/zcache/zcache-main.c | 55 +++--- 1 file changed, 18 insertions(+), 37 deletions(-) diff --git a/drivers/staging/zcache/zcache-main.c b/drivers/staging/zcache/zcache-main.c index e23d814..fe6801a 100644 --- a/drivers/staging/zcache/zcache-main.c +++ b/drivers/staging/zcache/zcache-main.c @@ -1118,15 +1118,6 @@ free_and_out: #endif /* CONFIG_ZCACHE_WRITEBACK */ /* - * When zcache is disabled (frozen), pools can be created and destroyed, - * but all puts (and thus all other operations that require memory allocation) - * must fail. If zcache is unfrozen, accepts puts, then frozen again, - * data consistency requires all puts while frozen to be converted into - * flushes. - */ -static bool zcache_freeze; - -/* * This zcache shrinker interface reduces the number of ephemeral pageframes * used by zcache to approximately the same as the total number of LRU_FILE * pageframes in use, and now also reduces the number of persistent pageframes @@ -1221,44 +1212,34 @@ int zcache_put_page(int cli_id, int pool_id, struct tmem_oid *oidp, { struct tmem_pool *pool; struct tmem_handle th; - int ret = -1; + int ret = 0; void *pampd = NULL; BUG_ON(!irqs_disabled()); pool = zcache_get_pool_by_id(cli_id, pool_id); if (unlikely(pool == NULL)) goto out; - if (!zcache_freeze) { - ret = 0; - th.client_id = cli_id; - th.pool_id = pool_id; - th.oid = *oidp; - th.index = index; - pampd = zcache_pampd_create((char *)page, size, raw, - ephemeral, th); - if (pampd == NULL) { - ret = -ENOMEM; - if (ephemeral) - inc_zcache_failed_eph_puts(); - else - inc_zcache_failed_pers_puts(); - } else { - if (ramster_enabled) - ramster_do_preload_flnode(pool); - ret = tmem_put(pool, oidp, index, 0, pampd); - if (ret 0) - BUG(); - } - zcache_put_pool(pool); + + th.client_id = cli_id; + th.pool_id = pool_id; + th.oid = *oidp; + th.index = index; + pampd = zcache_pampd_create((char *)page, size, raw, + ephemeral, th); + if (pampd == NULL) { + ret = -ENOMEM; + if (ephemeral) + inc_zcache_failed_eph_puts(); + else + inc_zcache_failed_pers_puts(); } else { - inc_zcache_put_to_flush(); if (ramster_enabled) ramster_do_preload_flnode(pool); - if (atomic_read(pool-obj_count) 0) - /* the put fails whether the flush succeeds or not */ - (void)tmem_flush_page(pool, oidp, index); - zcache_put_pool(pool); + ret = tmem_put(pool, oidp, index, 0, pampd); + if (ret 0) + BUG(); } + zcache_put_pool(pool); out: return ret; } -- 1.7.10.4 -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
RE: [PATCH 00/10] staging: zcache/ramster: fix and ramster/debugfs improvement
From: Wanpeng Li [mailto:liw...@linux.vnet.ibm.com] Sent: Tuesday, April 09, 2013 6:26 PM To: Greg Kroah-Hartman Cc: Dan Magenheimer; Seth Jennings; Konrad Rzeszutek Wilk; Minchan Kim; linux...@kvack.org; linux- ker...@vger.kernel.org; Andrew Morton; Bob Liu; Wanpeng Li Subject: [PATCH 00/10] staging: zcache/ramster: fix and ramster/debugfs improvement Fix bugs in zcache and rips out the debug counters out of ramster.c and sticks them in a debug.c file. Introduce accessory functions for counters increase/decrease, they are available when config RAMSTER_DEBUG, otherwise they are empty non-debug functions. Using an array to initialize/use debugfs attributes to make them neater. Dan Magenheimer confirm these works are needed. http://marc.info/?l=linux-mmm=136535713106882w=2 Patch 1~2 fix bugs in zcache Patch 3~8 rips out the debug counters out of ramster.c and sticks them in a debug.c file Patch 9 fix coding style issue introduced in zcache2 cleanups (s/int/bool + debugfs movement) patchset Patch 10 add how-to for ramster Note my preference to not apply patch 2of10 (which GregKH may choose to override), but for all, please add my: Acked-by: Dan Magenheimer dan.magenhei...@oracle.com -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
RE: zsmalloc defrag (Was: [PATCH] mm: remove compressed copy from zram in-memory)
From: Seth Jennings [mailto:sjenn...@linux.vnet.ibm.com] Subject: Re: zsmalloc defrag (Was: [PATCH] mm: remove compressed copy from zram in-memory) On 04/08/2013 08:36 PM, Minchan Kim wrote: On Tue, Apr 09, 2013 at 10:27:19AM +0900, Minchan Kim wrote: Hi Dan, On Mon, Apr 08, 2013 at 09:32:38AM -0700, Dan Magenheimer wrote: From: Minchan Kim [mailto:minc...@kernel.org] Sent: Monday, April 08, 2013 12:01 AM Subject: [PATCH] mm: remove compressed copy from zram in-memory (patch removed) Fragment ratio is almost same but memory consumption and compile time is better. I am working to add defragment function of zsmalloc. Hi Minchan -- I would be very interested in your design thoughts on how you plan to add defragmentation for zsmalloc. In What I can say now about is only just a word Compaction. As you know, zsmalloc has a transparent handle so we can do whatever under user. Of course, there is a tradeoff between performance and memory efficiency. I'm biased to latter for embedded usecase. And I might post it because as you know well, zsmalloc Incomplete sentense, I might not post it until promoting zsmalloc because as you know well, zsmalloc/zram's all new stuffs are blocked into staging tree. Even if we could add it into staging, as you know well, staging is where every mm guys ignore so we end up needing another round to promote it. sigh. Yes. The lack of compaction/defragmentation support in zsmalloc has not been raised as an obstacle to mainline acceptance so I think we should wait to add new features to a yet-to-be accepted codebase. Um, I explicitly raised as an obstacle the greatly reduced density for zsmalloc on active workloads and on zsize distributions that skew fat. Understanding that more deeply and hopefully fixing it is an issue, and compaction/defragmentation is a step in that direction. Also, I think this feature is more important to zram than it is to zswap/zcache as they can do writeback to free zpages. In other words, the fragmentation is a transient issue for zswap/zcache since writeback to the swap device is possible. Actually, I think I demonstrated that the zpage-based writeback in zswap makes fragmentation worse. Zcache doesn't use zsmalloc in part because it doesn't support pagframe writeback. If zsmalloc can fix this (and it may be easier to fix depending on the design and implementation of compaction/defrag, which is why I'm asking lots of questions), zcache may be able to make use of zsmalloc. Lots of good discussion fodder for next week! Dan -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
RE: zsmalloc defrag (Was: [PATCH] mm: remove compressed copy from zram in-memory)
From: Minchan Kim [mailto:minc...@kernel.org] Subject: Re: zsmalloc defrag (Was: [PATCH] mm: remove compressed copy from zram in-memory) On Tue, Apr 09, 2013 at 01:25:45PM -0700, Dan Magenheimer wrote: From: Minchan Kim [mailto:minc...@kernel.org] Subject: Re: zsmalloc defrag (Was: [PATCH] mm: remove compressed copy from zram in-memory) Hi Dan, On Mon, Apr 08, 2013 at 09:32:38AM -0700, Dan Magenheimer wrote: From: Minchan Kim [mailto:minc...@kernel.org] Sent: Monday, April 08, 2013 12:01 AM Subject: [PATCH] mm: remove compressed copy from zram in-memory (patch removed) Fragment ratio is almost same but memory consumption and compile time is better. I am working to add defragment function of zsmalloc. Hi Minchan -- I would be very interested in your design thoughts on how you plan to add defragmentation for zsmalloc. In What I can say now about is only just a word Compaction. As you know, zsmalloc has a transparent handle so we can do whatever under user. Of course, there is a tradeoff between performance and memory efficiency. I'm biased to latter for embedded usecase. Have you designed or implemented this yet? I have a couple of concerns: Not yet implemented but just had a time to think about it, simply. So surely, there are some obstacle so I want to uncase the code and number after I make a prototype/test the performance. Of course, if it has a severe problem, will drop it without wasting many guys's time. OK. I have some ideas that may similar or may be very different than yours. Likely different, since I am coming at it from the angle of zcache which has some different requirements. So I'm hoping that by discussing design we can incorporate some of the zcache requirements before coding. 1) The handle is transparent to the user, but it is still a form of a pointer to a zpage. Are you planning on walking zram's tables and changing those pointers? That may be OK for zram but for more complex data structures than tables (as in zswap and zcache) it may not be as easy, due to races, or as efficient because you will have to walk potentially very large trees. Rough concept is following as. I'm considering for zsmalloc to return transparent fake handle but we have to maintain it with real one. It could be done in zsmalloc internal so there isn't any race we should consider. That sounds very difficult because I think you will need an extra level of indirection to translate every fake handle to every real handle/pointer (like virtual-to-physical page tables). Or do you have some more clever idea? 2) Compaction in the kernel is heavily dependent on page migration and page migration is dependent on using flags in the struct page. There's a lot of code in those two code modules and there are going to be a lot of implementation differences between compacting pages vs compacting zpages. Compaction of kernel is never related to zsmalloc's one. OK. Compaction has certain meaning in the kernel. Defrag is usually used I think for what we are discussing here. So I thought you might be planning on doing exactly what the kernel does that it calls compaction. I'm also wondering if you will be implementing variable length zspages. Without that, I'm not sure compaction will help enough. (And that is a good example of the difference between Why do you think so? variable lengh zspage could be further step to improve but it's not only a solution to solve fragmentation. In my partial-design-in-my-head, they are related, but I think I understand what you mean. You are planning to move zpages across zspage boundaries, and I am not. So I think your solution will result in better density but may be harder to implement. particular, I am wondering if your design will also handle the requirements for zcache (especially for cleancache pages) and perhaps also for ramster. I don't know requirements for cleancache pages but compaction is general as you know well so I expect you can get a benefit from it if you are concern on memory efficiency but not sure it's valuable to compact cleancache pages for getting more slot in RAM. Sometime, just discarding would be much better, IMHO. Zcache has page reclaim. Zswap has zpage reclaim. I am concerned that these continue to work in the presence of compaction. With no reclaim at all, zram is a simpler use case but if you implement compaction in a way that can't be used by either zcache or zswap, then zsmalloc is essentially forking. Don't go too far. If it's really problem for zswap and zcache, maybe, we could add it optionally. Good, I think it should be possible to do it optionally too. In https://lkml.org/lkml/2013/3/27/501 I suggested it would be good to work together on a common design, but you didn't reply. Are you thinking
RE: zsmalloc defrag (Was: [PATCH] mm: remove compressed copy from zram in-memory)
From: Minchan Kim [mailto:minc...@kernel.org] Subject: Re: zsmalloc defrag (Was: [PATCH] mm: remove compressed copy from zram in-memory) On Tue, Apr 09, 2013 at 01:37:47PM -0700, Dan Magenheimer wrote: From: Minchan Kim [mailto:minc...@kernel.org] Subject: Re: zsmalloc defrag (Was: [PATCH] mm: remove compressed copy from zram in-memory) On Tue, Apr 09, 2013 at 10:27:19AM +0900, Minchan Kim wrote: Hi Dan, On Mon, Apr 08, 2013 at 09:32:38AM -0700, Dan Magenheimer wrote: From: Minchan Kim [mailto:minc...@kernel.org] Sent: Monday, April 08, 2013 12:01 AM Subject: [PATCH] mm: remove compressed copy from zram in-memory (patch removed) Fragment ratio is almost same but memory consumption and compile time is better. I am working to add defragment function of zsmalloc. Hi Minchan -- I would be very interested in your design thoughts on how you plan to add defragmentation for zsmalloc. In What I can say now about is only just a word Compaction. As you know, zsmalloc has a transparent handle so we can do whatever under user. Of course, there is a tradeoff between performance and memory efficiency. I'm biased to latter for embedded usecase. And I might post it because as you know well, zsmalloc Incomplete sentense, I might not post it until promoting zsmalloc because as you know well, zsmalloc/zram's all new stuffs are blocked into staging tree. Even if we could add it into staging, as you know well, staging is where every mm guys ignore so we end up needing another round to promote it. sigh. I hope it gets better after LSF/MM. If zsmalloc is moving in the direction of supporting only zram, why should it be promoted into mm, or even lib? Why not promote zram into drivers and put zsmalloc.c in the same directory? I don't want to make zsmalloc zram specific and will do best effort to generalize it to all z* familiy. I'm glad to hear that. You may not know/remember that the split between old zcache and new zcache (and the fork to zswap) was started because some people refused to accept changes to zsmalloc to support a broader set of requirements. If it is hard to reach out agreement, yes, forking could be a easy solution like other embedded product company but I don't want it. I don't want it either, so I think it is wise for us all to understand each others' objectives to see if we can avoid a fork. Or if the objectives are too different, then we have data to explain to other kernel developers why a fork is necessary. Thanks! Dan -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
RE: zsmalloc defrag (Was: [PATCH] mm: remove compressed copy from zram in-memory)
From: Minchan Kim [mailto:minc...@kernel.org] Subject: Re: zsmalloc defrag (Was: [PATCH] mm: remove compressed copy from zram in-memory) Hi Seth, On Tue, Apr 09, 2013 at 03:52:36PM -0500, Seth Jennings wrote: On 04/08/2013 08:36 PM, Minchan Kim wrote: On Tue, Apr 09, 2013 at 10:27:19AM +0900, Minchan Kim wrote: Hi Dan, On Mon, Apr 08, 2013 at 09:32:38AM -0700, Dan Magenheimer wrote: From: Minchan Kim [mailto:minc...@kernel.org] Sent: Monday, April 08, 2013 12:01 AM Subject: [PATCH] mm: remove compressed copy from zram in-memory (patch removed) Fragment ratio is almost same but memory consumption and compile time is better. I am working to add defragment function of zsmalloc. Hi Minchan -- I would be very interested in your design thoughts on how you plan to add defragmentation for zsmalloc. In What I can say now about is only just a word Compaction. As you know, zsmalloc has a transparent handle so we can do whatever under user. Of course, there is a tradeoff between performance and memory efficiency. I'm biased to latter for embedded usecase. And I might post it because as you know well, zsmalloc Incomplete sentense, I might not post it until promoting zsmalloc because as you know well, zsmalloc/zram's all new stuffs are blocked into staging tree. Even if we could add it into staging, as you know well, staging is where every mm guys ignore so we end up needing another round to promote it. sigh. Yes. The lack of compaction/defragmentation support in zsmalloc has not been raised as an obstacle to mainline acceptance so I think we should wait to add new features to a yet-to-be accepted codebase. Also, I think this feature is more important to zram than it is to zswap/zcache as they can do writeback to free zpages. In other words, the fragmentation is a transient issue for zswap/zcache since writeback to the swap device is possible. Other benefit derived from compaction work is that we can pick a zpage from zspage and move it into somewhere. It means core mm could control pages in zsmalloc freely. I'm not sure I understand which is why I'd like to learn more about your proposed design. Are you suggesting that core mm would periodically call zsmalloc-compaction and see what pages get freed? I'm hoping for more control than that. More good discussion for next week! Dan -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
RE: zsmalloc zbud hybrid design discussion?
(Bob Liu added) From: Seth Jennings [mailto:sjenn...@linux.vnet.ibm.com] Subject: Re: zsmalloc zbud hybrid design discussion? On Wed, Mar 27, 2013 at 01:04:25PM -0700, Dan Magenheimer wrote: Seth and all zproject folks -- I've been giving some deep thought as to how a zpage allocator might be designed that would incorporate the best of both zsmalloc and zbud. Rather than dive into coding, it occurs to me that the best chance of success would be if all interested parties could first discuss (on-list) and converge on a design that we can all agree on. If we achieve that, I don't care who writes the code and/or gets the credit or chooses the name. If we can't achieve consensus, at least it will be much clearer where our differences lie. Any thoughts? Hi Seth! I'll put some thoughts, keeping in mind that I'm not throwing zsmalloc under the bus here. Just what I would do starting from scratch given all that has happened. Excellent. Good food for thought. I'll add some of my thinking too and we can talk more next week. BTW, I'm not throwing zsmalloc under the bus either. I'm OK with using zsmalloc as a base for an improved hybrid, and even calling the result zsmalloc. I *am* however willing to throw the generic nature of zsmalloc away... I think the combined requirements of the zprojects are complex enough and the likelihood of zsmalloc being appropriate for future users is low enough, that we should accept that zsmalloc is highly tuned for zprojects and modify it as required. I.e. the API to zsmalloc need not be exposed to and documented for the rest of the kernel. Simplicity - the simpler the better Generally I agree. But only if the simplicity addresses the whole problem. I'm specifically very concerned that we have an allocator that works well across a wide variety of zsize distributions, even if it adds complexity to the allocator. High density - LZO best case is ~40 bytes. That's around 1/100th of a page. I'd say it should support up to at least 64 object per page in the best case. (see Reclaim effectiveness before responding here) Hmmm... if you pre-check for zero pages, I would guess the percentage of pages with zsize less than 64 is actually quite small. But 64 size classes may be a good place to start as long as it doesn't overly complicate or restrict other design points. No slab - the slab approach limits LRU and swap slot locality within the pool pages. Also swap slots have a tendency to be freed in clusters. If we improve locality within each pool page, it is more likely that page will be freed sooner as the zpages it contains will likely be invalidated all together. Pool page =?= pageframe used by zsmalloc Isn't it true that that there is no correlation between whether a page is in the same cluster and the zsize (and thus size class) of the zpage? So every zpage may end up in a different pool page and this theory wouldn't work. Or am I misunderstanding? Also, take a note out of the zbud playbook at track LRU based on pool pages, not zpages. One would fill allocation requests from the most recently used pool page. Yes, I'm also thinking that should be in any hybrid solution. A global LRU queue (like in zbud) could also be applicable to entire zspages; this is similar to pageframe-reclaim except all the pageframes in a zspage would be claimed at the same time. Reclaim effectiveness - conflicts with density. As the number of zpages per page increases, the odds decrease that all of those objects will be invalidated, which is necessary to free up the underlying page, since moving objects out of sparely used pages would involve compaction (see next). One solution is to lower the density, but I think that is self-defeating as we lose much the compression benefit though fragmentation. I think the better solution is to improve the likelihood that the zpages in the page are likely to be freed together through increased locality. I do think we should seriously reconsider ZS_MAX_ZSPAGE_ORDER==2. The value vs ZS_MAX_ZSPAGE_ORDER==0 is enough for most cases and 1 is enough for the rest. If get_pages_per_zspage were flexible, there might be a better tradeoff of density vs reclaim effectiveness. I've some ideas along the lines of a hybrid adaptively combining buddying and slab which might make it rarely necessary to have pages_per_zspage exceed 2. That also might make it much easier to have variable sized zspages (size is always one or two). Not a requirement: Compaction - compaction would basically involve creating a virtual address space of sorts, which zsmalloc is capable of through its API with handles, not pointer. However, as Dan points out this requires a structure the maintain the mappings and adds to complexity. Additionally, the need for compaction diminishes as the allocations are short-lived with frontswap backends doing writeback and cleancache backends shrinking. I have an idea
RE: zsmalloc defrag (Was: [PATCH] mm: remove compressed copy from zram in-memory)
> From: Minchan Kim [mailto:minc...@kernel.org] > Subject: Re: zsmalloc defrag (Was: [PATCH] mm: remove compressed copy from > zram in-memory) > > On Tue, Apr 09, 2013 at 10:27:19AM +0900, Minchan Kim wrote: > > Hi Dan, > > > > On Mon, Apr 08, 2013 at 09:32:38AM -0700, Dan Magenheimer wrote: > > > > From: Minchan Kim [mailto:minc...@kernel.org] > > > > Sent: Monday, April 08, 2013 12:01 AM > > > > Subject: [PATCH] mm: remove compressed copy from zram in-memory > > > > > > (patch removed) > > > > > > > Fragment ratio is almost same but memory consumption and compile time > > > > is better. I am working to add defragment function of zsmalloc. > > > > > > Hi Minchan -- > > > > > > I would be very interested in your design thoughts on > > > how you plan to add defragmentation for zsmalloc. In > > > > What I can say now about is only just a word "Compaction". > > As you know, zsmalloc has a transparent handle so we can do whatever > > under user. Of course, there is a tradeoff between performance > > and memory efficiency. I'm biased to latter for embedded usecase. > > > > And I might post it because as you know well, zsmalloc > > Incomplete sentense, > > I might not post it until promoting zsmalloc because as you know well, > zsmalloc/zram's all new stuffs are blocked into staging tree. > Even if we could add it into staging, as you know well, staging is where > every mm guys ignore so we end up needing another round to promote it. sigh. > > I hope it gets better after LSF/MM. If zsmalloc is moving in the direction of supporting only zram, why should it be promoted into mm, or even lib? Why not promote zram into drivers and put zsmalloc.c in the same directory? -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
RE: zsmalloc defrag (Was: [PATCH] mm: remove compressed copy from zram in-memory)
> From: Minchan Kim [mailto:minc...@kernel.org] > Subject: Re: zsmalloc defrag (Was: [PATCH] mm: remove compressed copy from > zram in-memory) > > Hi Dan, > > On Mon, Apr 08, 2013 at 09:32:38AM -0700, Dan Magenheimer wrote: > > > From: Minchan Kim [mailto:minc...@kernel.org] > > > Sent: Monday, April 08, 2013 12:01 AM > > > Subject: [PATCH] mm: remove compressed copy from zram in-memory > > > > (patch removed) > > > > > Fragment ratio is almost same but memory consumption and compile time > > > is better. I am working to add defragment function of zsmalloc. > > > > Hi Minchan -- > > > > I would be very interested in your design thoughts on > > how you plan to add defragmentation for zsmalloc. In > > What I can say now about is only just a word "Compaction". > As you know, zsmalloc has a transparent handle so we can do whatever > under user. Of course, there is a tradeoff between performance > and memory efficiency. I'm biased to latter for embedded usecase. Have you designed or implemented this yet? I have a couple of concerns: 1) The handle is transparent to the "user", but it is still a form of a "pointer" to a zpage. Are you planning on walking zram's tables and changing those pointers? That may be OK for zram but for more complex data structures than tables (as in zswap and zcache) it may not be as easy, due to races, or as efficient because you will have to walk potentially very large trees. 2) Compaction in the kernel is heavily dependent on page migration and page migration is dependent on using flags in the struct page. There's a lot of code in those two code modules and there are going to be a lot of implementation differences between compacting pages vs compacting zpages. I'm also wondering if you will be implementing "variable length zspages". Without that, I'm not sure compaction will help enough. (And that is a good example of the difference between the kernel page compaction design/code and zspage compaction.) > > particular, I am wondering if your design will also > > handle the requirements for zcache (especially for > > cleancache pages) and perhaps also for ramster. > > I don't know requirements for cleancache pages but compaction is > general as you know well so I expect you can get a benefit from it > if you are concern on memory efficiency but not sure it's valuable > to compact cleancache pages for getting more slot in RAM. > Sometime, just discarding would be much better, IMHO. Zcache has page reclaim. Zswap has zpage reclaim. I am concerned that these continue to work in the presence of compaction. With no reclaim at all, zram is a simpler use case but if you implement compaction in a way that can't be used by either zcache or zswap, then zsmalloc is essentially forking. > > In https://lkml.org/lkml/2013/3/27/501 I suggested it > > would be good to work together on a common design, but > > you didn't reply. Are you thinking that zsmalloc > > I saw the thread but explicit agreement is really matter? > I believe everybody want it although they didn't reply. :) > > You can make the design/post it or prototyping/post it. > If there are some conflit with something in my brain, > I will be happy to feedback. :) > > Anyway, I think my above statement "COMPACTION" would be enough to > express my current thought to avoid duplicated work and you can catch up. > > I will get around to it after LSF/MM. > > > improvements should focus only on zram, in which case > > Just focusing zsmalloc. Right. Again, I am asking if you are changing zsmalloc in a way that helps zram but hurts zswap and makes it impossible for zcache to ever use the improvements to zsmalloc. If so, that's fine, but please make it clear that is your goal. > > we may -- and possibly should -- end up with a different > > allocator for frontswap-based/cleancache-based compression > > in zcache (and possibly zswap)? > > > I'm just trying to determine if I should proceed separately > > with my design (with Bob Liu, who expressed interest) or if > > it would be beneficial to work together. > > Just posting and if it affects zsmalloc/zram/zswap and goes the way > I don't want, I will involve the discussion because our product uses > zram heavily and consider zswap, too. > > I really appreciate your enthusiastic collaboration model to find > optimal solution! My goal is to have compression be an integral part of Linux memory management. It may be tied to a config option, but the goal is that distros turn it on by default. I don't think zsmalloc meets that objective yet, but it may be fine for your needs. If so it would be good to understand exactly why it doesn't meet the other zproject needs. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
RE: zsmalloc defrag (Was: [PATCH] mm: remove compressed copy from zram in-memory)
From: Minchan Kim [mailto:minc...@kernel.org] Subject: Re: zsmalloc defrag (Was: [PATCH] mm: remove compressed copy from zram in-memory) Hi Dan, On Mon, Apr 08, 2013 at 09:32:38AM -0700, Dan Magenheimer wrote: From: Minchan Kim [mailto:minc...@kernel.org] Sent: Monday, April 08, 2013 12:01 AM Subject: [PATCH] mm: remove compressed copy from zram in-memory (patch removed) Fragment ratio is almost same but memory consumption and compile time is better. I am working to add defragment function of zsmalloc. Hi Minchan -- I would be very interested in your design thoughts on how you plan to add defragmentation for zsmalloc. In What I can say now about is only just a word Compaction. As you know, zsmalloc has a transparent handle so we can do whatever under user. Of course, there is a tradeoff between performance and memory efficiency. I'm biased to latter for embedded usecase. Have you designed or implemented this yet? I have a couple of concerns: 1) The handle is transparent to the user, but it is still a form of a pointer to a zpage. Are you planning on walking zram's tables and changing those pointers? That may be OK for zram but for more complex data structures than tables (as in zswap and zcache) it may not be as easy, due to races, or as efficient because you will have to walk potentially very large trees. 2) Compaction in the kernel is heavily dependent on page migration and page migration is dependent on using flags in the struct page. There's a lot of code in those two code modules and there are going to be a lot of implementation differences between compacting pages vs compacting zpages. I'm also wondering if you will be implementing variable length zspages. Without that, I'm not sure compaction will help enough. (And that is a good example of the difference between the kernel page compaction design/code and zspage compaction.) particular, I am wondering if your design will also handle the requirements for zcache (especially for cleancache pages) and perhaps also for ramster. I don't know requirements for cleancache pages but compaction is general as you know well so I expect you can get a benefit from it if you are concern on memory efficiency but not sure it's valuable to compact cleancache pages for getting more slot in RAM. Sometime, just discarding would be much better, IMHO. Zcache has page reclaim. Zswap has zpage reclaim. I am concerned that these continue to work in the presence of compaction. With no reclaim at all, zram is a simpler use case but if you implement compaction in a way that can't be used by either zcache or zswap, then zsmalloc is essentially forking. In https://lkml.org/lkml/2013/3/27/501 I suggested it would be good to work together on a common design, but you didn't reply. Are you thinking that zsmalloc I saw the thread but explicit agreement is really matter? I believe everybody want it although they didn't reply. :) You can make the design/post it or prototyping/post it. If there are some conflit with something in my brain, I will be happy to feedback. :) Anyway, I think my above statement COMPACTION would be enough to express my current thought to avoid duplicated work and you can catch up. I will get around to it after LSF/MM. improvements should focus only on zram, in which case Just focusing zsmalloc. Right. Again, I am asking if you are changing zsmalloc in a way that helps zram but hurts zswap and makes it impossible for zcache to ever use the improvements to zsmalloc. If so, that's fine, but please make it clear that is your goal. we may -- and possibly should -- end up with a different allocator for frontswap-based/cleancache-based compression in zcache (and possibly zswap)? I'm just trying to determine if I should proceed separately with my design (with Bob Liu, who expressed interest) or if it would be beneficial to work together. Just posting and if it affects zsmalloc/zram/zswap and goes the way I don't want, I will involve the discussion because our product uses zram heavily and consider zswap, too. I really appreciate your enthusiastic collaboration model to find optimal solution! My goal is to have compression be an integral part of Linux memory management. It may be tied to a config option, but the goal is that distros turn it on by default. I don't think zsmalloc meets that objective yet, but it may be fine for your needs. If so it would be good to understand exactly why it doesn't meet the other zproject needs. -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
RE: zsmalloc defrag (Was: [PATCH] mm: remove compressed copy from zram in-memory)
From: Minchan Kim [mailto:minc...@kernel.org] Subject: Re: zsmalloc defrag (Was: [PATCH] mm: remove compressed copy from zram in-memory) On Tue, Apr 09, 2013 at 10:27:19AM +0900, Minchan Kim wrote: Hi Dan, On Mon, Apr 08, 2013 at 09:32:38AM -0700, Dan Magenheimer wrote: From: Minchan Kim [mailto:minc...@kernel.org] Sent: Monday, April 08, 2013 12:01 AM Subject: [PATCH] mm: remove compressed copy from zram in-memory (patch removed) Fragment ratio is almost same but memory consumption and compile time is better. I am working to add defragment function of zsmalloc. Hi Minchan -- I would be very interested in your design thoughts on how you plan to add defragmentation for zsmalloc. In What I can say now about is only just a word Compaction. As you know, zsmalloc has a transparent handle so we can do whatever under user. Of course, there is a tradeoff between performance and memory efficiency. I'm biased to latter for embedded usecase. And I might post it because as you know well, zsmalloc Incomplete sentense, I might not post it until promoting zsmalloc because as you know well, zsmalloc/zram's all new stuffs are blocked into staging tree. Even if we could add it into staging, as you know well, staging is where every mm guys ignore so we end up needing another round to promote it. sigh. I hope it gets better after LSF/MM. If zsmalloc is moving in the direction of supporting only zram, why should it be promoted into mm, or even lib? Why not promote zram into drivers and put zsmalloc.c in the same directory? -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
zsmalloc defrag (Was: [PATCH] mm: remove compressed copy from zram in-memory)
> From: Minchan Kim [mailto:minc...@kernel.org] > Sent: Monday, April 08, 2013 12:01 AM > Subject: [PATCH] mm: remove compressed copy from zram in-memory (patch removed) > Fragment ratio is almost same but memory consumption and compile time > is better. I am working to add defragment function of zsmalloc. Hi Minchan -- I would be very interested in your design thoughts on how you plan to add defragmentation for zsmalloc. In particular, I am wondering if your design will also handle the requirements for zcache (especially for cleancache pages) and perhaps also for ramster. In https://lkml.org/lkml/2013/3/27/501 I suggested it would be good to work together on a common design, but you didn't reply. Are you thinking that zsmalloc improvements should focus only on zram, in which case we may -- and possibly should -- end up with a different allocator for frontswap-based/cleancache-based compression in zcache (and possibly zswap)? I'm just trying to determine if I should proceed separately with my design (with Bob Liu, who expressed interest) or if it would be beneficial to work together. Thanks, Dan -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
zsmalloc defrag (Was: [PATCH] mm: remove compressed copy from zram in-memory)
From: Minchan Kim [mailto:minc...@kernel.org] Sent: Monday, April 08, 2013 12:01 AM Subject: [PATCH] mm: remove compressed copy from zram in-memory (patch removed) Fragment ratio is almost same but memory consumption and compile time is better. I am working to add defragment function of zsmalloc. Hi Minchan -- I would be very interested in your design thoughts on how you plan to add defragmentation for zsmalloc. In particular, I am wondering if your design will also handle the requirements for zcache (especially for cleancache pages) and perhaps also for ramster. In https://lkml.org/lkml/2013/3/27/501 I suggested it would be good to work together on a common design, but you didn't reply. Are you thinking that zsmalloc improvements should focus only on zram, in which case we may -- and possibly should -- end up with a different allocator for frontswap-based/cleancache-based compression in zcache (and possibly zswap)? I'm just trying to determine if I should proceed separately with my design (with Bob Liu, who expressed interest) or if it would be beneficial to work together. Thanks, Dan -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
RE: [PATCH part2 v6 0/3] staging: zcache: Support zero-filled pages more efficiently
> From: Dan Magenheimer > Subject: RE: [PATCH part2 v6 0/3] staging: zcache: Support zero-filled pages > more efficiently > > > From: Wanpeng Li [mailto:liw...@linux.vnet.ibm.com] > > Subject: Re: [PATCH part2 v6 0/3] staging: zcache: Support zero-filled > > pages more efficiently > > > > Hi Dan, > > > > Some issues against Ramster: > > > > Sure! I am concerned about Konrad's patches adding debug.c as they > add many global variables. They are only required when ZCACHE_DEBUG > is enabled so they may be ok. If not, adding ramster variables > to debug.c may make the problem worse. Oops, I just noticed/remembered that ramster uses BOTH debugfs and sysfs. The sysfs variables are all currently required, i.e. for configuration so should not be tied to debugfs or a DEBUG config option. However, if there is a more acceptable way to implement the function of those sysfs variables, that would be fine. Thanks, Dan -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
RE: [PATCH part2 v6 0/3] staging: zcache: Support zero-filled pages more efficiently
> From: Wanpeng Li [mailto:liw...@linux.vnet.ibm.com] > Subject: Re: [PATCH part2 v6 0/3] staging: zcache: Support zero-filled pages > more efficiently > > Hi Dan, > > Some issues against Ramster: > > - Ramster who takes advantage of zcache also should support zero-filled > pages more efficiently, correct? It doesn't handle zero-filled pages well > currently. When you first posted your patchset I took a quick look at ramster and it looked like your patchset should work for ramster also. However I didn't actually run ramster to try it so there may be a bug. If it doesn't work, I would very much appreciate a patch. > - Ramster DebugFS counters are exported in /sys/kernel/mm/, but > zcache/frontswap/cleancache > all are exported in /sys/kernel/debug/, should we unify them? That would be great. > - If ramster also should move DebugFS counters to a single file like > zcache do? Sure! I am concerned about Konrad's patches adding debug.c as they add many global variables. They are only required when ZCACHE_DEBUG is enabled so they may be ok. If not, adding ramster variables to debug.c may make the problem worse. > If you confirm these issues are make sense to fix, I will start coding. ;-) That would be great. Note that I have a how-to for ramster here: https://oss.oracle.com/projects/tmem/dist/files/RAMster/HOWTO-120817 If when you are testing you find that this how-to has mistakes, please let me know. Or feel free to add the (corrected) how-to file as a patch in your patchset. Thanks very much, Wanpeng, for your great contributions! (Ric, since you have expressed interest in ramster, if you try it and find corrections to the how-to file above, your input would be very much appreciated also!) Dan -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
RE: [PATCH part2 v6 0/3] staging: zcache: Support zero-filled pages more efficiently
From: Wanpeng Li [mailto:liw...@linux.vnet.ibm.com] Subject: Re: [PATCH part2 v6 0/3] staging: zcache: Support zero-filled pages more efficiently Hi Dan, Some issues against Ramster: - Ramster who takes advantage of zcache also should support zero-filled pages more efficiently, correct? It doesn't handle zero-filled pages well currently. When you first posted your patchset I took a quick look at ramster and it looked like your patchset should work for ramster also. However I didn't actually run ramster to try it so there may be a bug. If it doesn't work, I would very much appreciate a patch. - Ramster DebugFS counters are exported in /sys/kernel/mm/, but zcache/frontswap/cleancache all are exported in /sys/kernel/debug/, should we unify them? That would be great. - If ramster also should move DebugFS counters to a single file like zcache do? Sure! I am concerned about Konrad's patches adding debug.c as they add many global variables. They are only required when ZCACHE_DEBUG is enabled so they may be ok. If not, adding ramster variables to debug.c may make the problem worse. If you confirm these issues are make sense to fix, I will start coding. ;-) That would be great. Note that I have a how-to for ramster here: https://oss.oracle.com/projects/tmem/dist/files/RAMster/HOWTO-120817 If when you are testing you find that this how-to has mistakes, please let me know. Or feel free to add the (corrected) how-to file as a patch in your patchset. Thanks very much, Wanpeng, for your great contributions! (Ric, since you have expressed interest in ramster, if you try it and find corrections to the how-to file above, your input would be very much appreciated also!) Dan -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
RE: [PATCH part2 v6 0/3] staging: zcache: Support zero-filled pages more efficiently
From: Dan Magenheimer Subject: RE: [PATCH part2 v6 0/3] staging: zcache: Support zero-filled pages more efficiently From: Wanpeng Li [mailto:liw...@linux.vnet.ibm.com] Subject: Re: [PATCH part2 v6 0/3] staging: zcache: Support zero-filled pages more efficiently Hi Dan, Some issues against Ramster: Sure! I am concerned about Konrad's patches adding debug.c as they add many global variables. They are only required when ZCACHE_DEBUG is enabled so they may be ok. If not, adding ramster variables to debug.c may make the problem worse. Oops, I just noticed/remembered that ramster uses BOTH debugfs and sysfs. The sysfs variables are all currently required, i.e. for configuration so should not be tied to debugfs or a DEBUG config option. However, if there is a more acceptable way to implement the function of those sysfs variables, that would be fine. Thanks, Dan -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
RE: [PATCHv8 5/8] mm: break up swap_writepage() for frontswap backends
> From: Seth Jennings [mailto:sjenn...@linux.vnet.ibm.com] > Subject: Re: [PATCHv8 5/8] mm: break up swap_writepage() for frontswap > backends > > On 04/04/2013 05:10 PM, Seth Jennings wrote: > > swap_writepage() is currently where frontswap hooks into the swap > > write path to capture pages with the frontswap_store() function. > > However, if a frontswap backend wants to "resume" the writeback of > > a page to the swap device, it can't call swap_writepage() as > > the page will simply reenter the backend. > > > > This patch separates swap_writepage() into a top and bottom half, the > > bottom half named __swap_writepage() to allow a frontswap backend, > > like zswap, to resume writeback beyond the frontswap_store() hook. > > > > __add_to_swap_cache() is also made non-static so that the page for > > which writeback is to be resumed can be added to the swap cache. > > > > Acked-by: Minchan Kim > > Signed-off-by: Seth Jennings > > Adding Cc Bob Liu. > > I just remembered that Bob had done a repost of the 5 and 6 patches, > outside the zswap thread, with a small change to avoid a checkpatch > warning. I didn't pull that change into my version, but I should have. > > It doesn't make a functional difference, so this patch can still go > forward and the checkpatch warning can be cleaned up in a subsequent > patch. If another revision of the patchset is needed for other > reasons, I'll pull this change into the next version. > > I think Dan and Bob would be ok with their tags being applied to 5 and 6: > > Acked-by: Bob Liu > Reviewed-by: Dan Magenheimer > > That ok? OK with me. I do support these two MM patches as candidates for the 3.10 window since both zswap AND in-tree zcache depend on them, but the silence from Andrew was a bit deafening. Seth, perhaps you could add a #ifdef CONFIG_ZSWAP_WRITEBACK to the zswap code and Kconfig (as zcache has done) and then these two patches in your patchset can be reviewed separately? -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
RE: [PATCHv8 0/8] zswap: compressed swap caching
> From: Seth Jennings [mailto:sjenn...@linux.vnet.ibm.com] > Subject: [PATCHv8 0/8] zswap: compressed swap caching > > ... I am submitting this as a > candidate for merging in the v3.10 window... > : > I'll be attending the LSF/MM summit where there (hopefully) will be a > discussion this patchset and memory compression in general. IMHO it would be good to first have the discussion at LSF/MM. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
RE: [PATCHv8 0/8] zswap: compressed swap caching
From: Seth Jennings [mailto:sjenn...@linux.vnet.ibm.com] Subject: [PATCHv8 0/8] zswap: compressed swap caching ... I am submitting this as a candidate for merging in the v3.10 window... : I'll be attending the LSF/MM summit where there (hopefully) will be a discussion this patchset and memory compression in general. IMHO it would be good to first have the discussion at LSF/MM. -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
RE: [PATCHv8 5/8] mm: break up swap_writepage() for frontswap backends
From: Seth Jennings [mailto:sjenn...@linux.vnet.ibm.com] Subject: Re: [PATCHv8 5/8] mm: break up swap_writepage() for frontswap backends On 04/04/2013 05:10 PM, Seth Jennings wrote: swap_writepage() is currently where frontswap hooks into the swap write path to capture pages with the frontswap_store() function. However, if a frontswap backend wants to resume the writeback of a page to the swap device, it can't call swap_writepage() as the page will simply reenter the backend. This patch separates swap_writepage() into a top and bottom half, the bottom half named __swap_writepage() to allow a frontswap backend, like zswap, to resume writeback beyond the frontswap_store() hook. __add_to_swap_cache() is also made non-static so that the page for which writeback is to be resumed can be added to the swap cache. Acked-by: Minchan Kim minc...@kernel.org Signed-off-by: Seth Jennings sjenn...@linux.vnet.ibm.com Adding Cc Bob Liu. I just remembered that Bob had done a repost of the 5 and 6 patches, outside the zswap thread, with a small change to avoid a checkpatch warning. I didn't pull that change into my version, but I should have. It doesn't make a functional difference, so this patch can still go forward and the checkpatch warning can be cleaned up in a subsequent patch. If another revision of the patchset is needed for other reasons, I'll pull this change into the next version. I think Dan and Bob would be ok with their tags being applied to 5 and 6: Acked-by: Bob Liu bob@oracle.com Reviewed-by: Dan Magenheimer dan.magenhei...@oracle.com That ok? OK with me. I do support these two MM patches as candidates for the 3.10 window since both zswap AND in-tree zcache depend on them, but the silence from Andrew was a bit deafening. Seth, perhaps you could add a #ifdef CONFIG_ZSWAP_WRITEBACK to the zswap code and Kconfig (as zcache has done) and then these two patches in your patchset can be reviewed separately? -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
RE: zsmalloc/lzo compressibility vs entropy
> From: Dan Magenheimer > Sent: Wednesday, March 27, 2013 3:42 PM > To: Seth Jennings; Konrad Wilk; Minchan Kim; Bob Liu; Robert Jennings; Nitin > Gupta; Wanpeng Li; Andrew > Morton; Mel Gorman > Cc: linux...@kvack.org; linux-kernel@vger.kernel.org > Subject: zsmalloc/lzo compressibility vs entropy > > This might be obvious to those of you who are better > mathematicians than I, but I ran some experiments > to confirm the relationship between entropy and compressibility > and thought I should report the results to the list. A few new observations worth mentioning: Since Seth long ago mentioned that the text of Moby Dick resulted in poor (but not horribly poor) compression I thought I'd look at some ASCII data. I used the first sentence of the Gettysburg Address (91 characters) and repeated it to fill a page. Interestingly, LZO apparently discovered the repetition... the page compressed to 118 bytes even though the result had 15618 one-bits (fairly high entropy). I used the full Gettysburg Address (1459 characters), again repeated to fill a page. LZO compressed this to 1070 bytes. (14568 one-bits.) To fill a page with text, I added part of the Declaration of Independence. No repeating text now. This only compressed to 2754 bytes (which, I assume, is close to Seth's observations on Moby Dick). 14819 one-bits. Last (for swap), to see if random ascii would compress better than binary, I masked off the MSB in each byte of a random page. The mean zsize was 4116 bytes (larger than a page) with a stddev of 51. The one-bit mean was 14336 (7/16 of a page). On a completely different track, I thought it would be relevant to look at the difference between frontswap (anonymous) page zsize distribution and cleancache (file) page zsize distribution. Running kernbench, zsize mean was 1974 (stddev 895). For a different benchmark, I did: # find / | grep3 where grep3 is a simple bash script that does three separate greps on the first argument. Since this fills the page cache and causes reclaiming, and reclaims are captured by cleancache and fed to zcache, this data page stream approximates random pages on the disk. This "benchmark" generated a zsize mean of 2265 with stddev 1008. Also of note: Only a fraction of a percent of cleancache pages are zero-filled, so Wanpeng's zcache patch to handle zero-filled pages more efficiently is very good for frontswap pages but may have little benefit for cleancache pages. Bottom line conclusions: (1) Entropy is probably less a factor for LZO-compressibility than data repetition. (2) Cleancache data pages may have a very different zsize distribution than frontswap data pages, anecdotally skewed to much higher zsize. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
RE: zsmalloc/lzo compressibility vs entropy
From: Dan Magenheimer Sent: Wednesday, March 27, 2013 3:42 PM To: Seth Jennings; Konrad Wilk; Minchan Kim; Bob Liu; Robert Jennings; Nitin Gupta; Wanpeng Li; Andrew Morton; Mel Gorman Cc: linux...@kvack.org; linux-kernel@vger.kernel.org Subject: zsmalloc/lzo compressibility vs entropy This might be obvious to those of you who are better mathematicians than I, but I ran some experiments to confirm the relationship between entropy and compressibility and thought I should report the results to the list. A few new observations worth mentioning: Since Seth long ago mentioned that the text of Moby Dick resulted in poor (but not horribly poor) compression I thought I'd look at some ASCII data. I used the first sentence of the Gettysburg Address (91 characters) and repeated it to fill a page. Interestingly, LZO apparently discovered the repetition... the page compressed to 118 bytes even though the result had 15618 one-bits (fairly high entropy). I used the full Gettysburg Address (1459 characters), again repeated to fill a page. LZO compressed this to 1070 bytes. (14568 one-bits.) To fill a page with text, I added part of the Declaration of Independence. No repeating text now. This only compressed to 2754 bytes (which, I assume, is close to Seth's observations on Moby Dick). 14819 one-bits. Last (for swap), to see if random ascii would compress better than binary, I masked off the MSB in each byte of a random page. The mean zsize was 4116 bytes (larger than a page) with a stddev of 51. The one-bit mean was 14336 (7/16 of a page). On a completely different track, I thought it would be relevant to look at the difference between frontswap (anonymous) page zsize distribution and cleancache (file) page zsize distribution. Running kernbench, zsize mean was 1974 (stddev 895). For a different benchmark, I did: # find / | grep3 where grep3 is a simple bash script that does three separate greps on the first argument. Since this fills the page cache and causes reclaiming, and reclaims are captured by cleancache and fed to zcache, this data page stream approximates random pages on the disk. This benchmark generated a zsize mean of 2265 with stddev 1008. Also of note: Only a fraction of a percent of cleancache pages are zero-filled, so Wanpeng's zcache patch to handle zero-filled pages more efficiently is very good for frontswap pages but may have little benefit for cleancache pages. Bottom line conclusions: (1) Entropy is probably less a factor for LZO-compressibility than data repetition. (2) Cleancache data pages may have a very different zsize distribution than frontswap data pages, anecdotally skewed to much higher zsize. -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
RE: [RFC] mm: remove swapcache page early
> From: Minchan Kim [mailto:minc...@kernel.org] > Subject: Re: [RFC] mm: remove swapcache page early > > Hi Dan, > > On Wed, Mar 27, 2013 at 03:24:00PM -0700, Dan Magenheimer wrote: > > > From: Hugh Dickins [mailto:hu...@google.com] > > > Subject: Re: [RFC] mm: remove swapcache page early > > > > > > I believe the answer is for frontswap/zmem to invalidate the frontswap > > > copy of the page (to free up the compressed memory when possible) and > > > SetPageDirty on the PageUptodate PageSwapCache page when swapping in > > > (setting page dirty so nothing will later go to read it from the > > > unfreed location on backing swap disk, which was never written). > > > > There are two duplication issues: (1) When can the page be removed > > from the swap cache after a call to frontswap_store; and (2) When > > can the page be removed from the frontswap storage after it > > has been brought back into memory via frontswap_load. > > > > This patch from Minchan addresses (1). The issue you are raising > > No. I am addressing (2). > > > here is (2). You may not know that (2) has recently been solved > > in frontswap, at least for zcache. See frontswap_exclusive_gets_enabled. > > If this is enabled (and it is for zcache but not yet for zswap), > > what you suggest (SetPageDirty) is what happens. > > I am blind on zcache so I didn't see it. Anyway, I'd like to address it > on zram and zswap. Zswap can enable it trivially by adding a function call in init_zswap. (Note that it is not enabled by default for all frontswap backends because it is another complicated tradeoff of cpu time vs memory space that needs more study on a broad set of workloads.) I wonder if something like this would have a similar result for zram? (Completely untested... snippet stolen from swap_entry_free with SetPageDirty added... doesn't compile yet, but should give you the idea.) diff --git a/mm/page_io.c b/mm/page_io.c index 56276fe..2d10988 100644 --- a/mm/page_io.c +++ b/mm/page_io.c @@ -81,7 +81,17 @@ void end_swap_bio_read(struct bio *bio, int err) iminor(bio->bi_bdev->bd_inode), (unsigned long long)bio->bi_sector); } else { + struct swap_info_struct *sis; + SetPageUptodate(page); + sis = page_swap_info(page); + if (sis->flags & SWP_BLKDEV) { + struct gendisk *disk = sis->bdev->bd_disk; + if (disk->fops->swap_slot_free_notify) { + SetPageDirty(page); + disk->fops->swap_slot_free_notify(sis->bdev, + offset); + } + } } unlock_page(page); bio_put(bio); -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
RE: [RFC] mm: remove swapcache page early
> From: Hugh Dickins [mailto:hu...@google.com] > Subject: RE: [RFC] mm: remove swapcache page early > > On Wed, 27 Mar 2013, Dan Magenheimer wrote: > > > From: Hugh Dickins [mailto:hu...@google.com] > > > Subject: Re: [RFC] mm: remove swapcache page early > > > > > The issue you are raising > > here is (2). You may not know that (2) has recently been solved > > in frontswap, at least for zcache. See frontswap_exclusive_gets_enabled. > > If this is enabled (and it is for zcache but not yet for zswap), > > what you suggest (SetPageDirty) is what happens. > > Ah, and I have a dim, perhaps mistaken, memory that I gave you > input on that before, suggesting the SetPageDirty. Good, sounds > like the solution is already in place, if not actually activated. > > Thanks, must dash, > Hugh Hi Hugh -- Credit where it is due... Yes, I do recall now that the idea was originally yours. It went on a to-do list where I eventually tried it and it worked... I'm sorry I had forgotten and neglected to give you credit! (BTW, it is activated for zcache in 3.9.) Thanks, Dan -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
RE: [RFC] mm: remove swapcache page early
From: Hugh Dickins [mailto:hu...@google.com] Subject: RE: [RFC] mm: remove swapcache page early On Wed, 27 Mar 2013, Dan Magenheimer wrote: From: Hugh Dickins [mailto:hu...@google.com] Subject: Re: [RFC] mm: remove swapcache page early The issue you are raising here is (2). You may not know that (2) has recently been solved in frontswap, at least for zcache. See frontswap_exclusive_gets_enabled. If this is enabled (and it is for zcache but not yet for zswap), what you suggest (SetPageDirty) is what happens. Ah, and I have a dim, perhaps mistaken, memory that I gave you input on that before, suggesting the SetPageDirty. Good, sounds like the solution is already in place, if not actually activated. Thanks, must dash, Hugh Hi Hugh -- Credit where it is due... Yes, I do recall now that the idea was originally yours. It went on a to-do list where I eventually tried it and it worked... I'm sorry I had forgotten and neglected to give you credit! (BTW, it is activated for zcache in 3.9.) Thanks, Dan -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
RE: [RFC] mm: remove swapcache page early
From: Minchan Kim [mailto:minc...@kernel.org] Subject: Re: [RFC] mm: remove swapcache page early Hi Dan, On Wed, Mar 27, 2013 at 03:24:00PM -0700, Dan Magenheimer wrote: From: Hugh Dickins [mailto:hu...@google.com] Subject: Re: [RFC] mm: remove swapcache page early I believe the answer is for frontswap/zmem to invalidate the frontswap copy of the page (to free up the compressed memory when possible) and SetPageDirty on the PageUptodate PageSwapCache page when swapping in (setting page dirty so nothing will later go to read it from the unfreed location on backing swap disk, which was never written). There are two duplication issues: (1) When can the page be removed from the swap cache after a call to frontswap_store; and (2) When can the page be removed from the frontswap storage after it has been brought back into memory via frontswap_load. This patch from Minchan addresses (1). The issue you are raising No. I am addressing (2). here is (2). You may not know that (2) has recently been solved in frontswap, at least for zcache. See frontswap_exclusive_gets_enabled. If this is enabled (and it is for zcache but not yet for zswap), what you suggest (SetPageDirty) is what happens. I am blind on zcache so I didn't see it. Anyway, I'd like to address it on zram and zswap. Zswap can enable it trivially by adding a function call in init_zswap. (Note that it is not enabled by default for all frontswap backends because it is another complicated tradeoff of cpu time vs memory space that needs more study on a broad set of workloads.) I wonder if something like this would have a similar result for zram? (Completely untested... snippet stolen from swap_entry_free with SetPageDirty added... doesn't compile yet, but should give you the idea.) diff --git a/mm/page_io.c b/mm/page_io.c index 56276fe..2d10988 100644 --- a/mm/page_io.c +++ b/mm/page_io.c @@ -81,7 +81,17 @@ void end_swap_bio_read(struct bio *bio, int err) iminor(bio-bi_bdev-bd_inode), (unsigned long long)bio-bi_sector); } else { + struct swap_info_struct *sis; + SetPageUptodate(page); + sis = page_swap_info(page); + if (sis-flags SWP_BLKDEV) { + struct gendisk *disk = sis-bdev-bd_disk; + if (disk-fops-swap_slot_free_notify) { + SetPageDirty(page); + disk-fops-swap_slot_free_notify(sis-bdev, + offset); + } + } } unlock_page(page); bio_put(bio); -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
RE: [RFC] mm: remove swapcache page early
> From: Hugh Dickins [mailto:hu...@google.com] > Subject: Re: [RFC] mm: remove swapcache page early > > On Wed, 27 Mar 2013, Minchan Kim wrote: > > > Swap subsystem does lazy swap slot free with expecting the page > > would be swapped out again so we can't avoid unnecessary write. > so we can avoid unnecessary write. > > > > But the problem in in-memory swap is that it consumes memory space > > until vm_swap_full(ie, used half of all of swap device) condition > > meet. It could be bad if we use multiple swap device, small in-memory swap > > and big storage swap or in-memory swap alone. > > That is a very good realization: it's surprising that none of us > thought of it before - no disrespect to you, well done, thank you. Yes, my compliments also Minchan. This problem has been thought of before but this patch is the first to identify a possible solution. > And I guess swap readahead is utterly unhelpful in this case too. Yes... as is any "swap writeahead". Excuse my ignorance, but I think this is not done in the swap subsystem but instead the kernel assumes write-coalescing will be done in the block I/O subsystem, which means swap writeahead would affect zram but not zcache/zswap (since frontswap subverts the block I/O subsystem). However I think a swap-readahead solution would be helpful to zram as well as zcache/zswap. > > This patch changes vm_swap_full logic slightly so it could free > > swap slot early if the backed device is really fast. > > For it, I used SWP_SOLIDSTATE but It might be controversial. > > But I strongly disagree with almost everything in your patch :) > I disagree with addressing it in vm_swap_full(), I disagree that > it can be addressed by device, I disagree that it has anything to > do with SWP_SOLIDSTATE. > > This is not a problem with swapping to /dev/ram0 or to /dev/zram0, > is it? In those cases, a fixed amount of memory has been set aside > for swap, and it works out just like with disk block devices. The > memory set aside may be wasted, but that is accepted upfront. It is (I believe) also a problem with swapping to ram. Two copies of the same page are kept in memory in different places, right? Fixed vs variable size is irrelevant I think. Or am I misunderstanding something about swap-to-ram? > Similarly, this is not a problem with swapping to SSD. There might > or might not be other reasons for adjusting the vm_swap_full() logic > for SSD or generally, but those have nothing to do with this issue. I think it is at least highly related. The key issue is the tradeoff of the likelihood that the page will soon be read/written again while it is in swap cache vs the time/resource-usage necessary to "reconstitute" the page into swap cache. Reconstituting from disk requires a LOT of elapsed time. Reconstituting from an SSD likely takes much less time. Reconstituting from zcache/zram takes thousands of CPU cycles. > The problem here is peculiar to frontswap, and the variably sized > memory behind it, isn't it? We are accustomed to using swap to free > up memory by transferring its data to some other, cheaper but slower > resource. Frontswap does make the problem more complex because some pages are in "fairly fast" storage (zcache, needs decompression) and some are on the actual (usually) rotating media. Fortunately, differentiating between these two cases is just a table lookup (see frontswap_test). > But in the case of frontswap and zmem (I'll say that to avoid thinking > through which backends are actually involved), it is not a cheaper and > slower resource, but the very same memory we are trying to save: swap > is stolen from the memory under reclaim, so any duplication becomes > counter-productive (if we ignore cpu compression/decompression costs: > I have no idea how fair it is to do so, but anyone who chooses zmem > is prepared to pay some cpu price for that). Exactly. There is some "robbing of Peter to pay Paul" and other complex resource tradeoffs. Presumably, though, it is not "the very same memory we are trying to save" but a fraction of it, saving the same page of data more efficiently in memory, using less than a page, at some CPU cost. > And because it's a frontswap thing, we cannot decide this by device: > frontswap may or may not stand in front of each device. There is no > problem with swapcache duplicated on disk (until that area approaches > being full or fragmented), but at the higher level we cannot see what > is in zmem and what is on disk: we only want to free up the zmem dup. I *think* frontswap_test(page) resolves this problem, as long as we have a specific page available to use as a parameter. > I believe the answer is for frontswap/zmem to invalidate the frontswap > copy of the page (to free up the compressed memory when possible) and > SetPageDirty on the PageUptodate PageSwapCache page when swapping in > (setting page dirty so nothing will later go to read it from the > unfreed location
zsmalloc/lzo compressibility vs entropy
This might be obvious to those of you who are better mathematicians than I, but I ran some experiments to confirm the relationship between entropy and compressibility and thought I should report the results to the list. Using the LZO code in the kernel via zsmalloc and some hacks in zswap, I measured the compression of pages generated by get_random_bytes and then of pages where half the page is generated by get_random_bytes() and the other half-page is zero-filled. For a fully random page, one would expect the number of zeroes and ones generated to be equal (highest entropy) and that proved true: The mean number of one-bits in the fully random page was 16384 (x86, so PAGE_SIZE=4096 * 8 bits/byte) with a stddev of 93. (sample size > 50). For this sample of pages, zsize had a mean of 4116 and a stddev of 16. So for fully random pages, LZO compression results in "negative" compression... the size of the compressed page is slightly larger than a page. For a "half random page" -- a fully random page with the first half of the page overwritten with zeros -- zsize mean is 2077 with a stddev of 6. So a half-random page compresses by about a factor of 2. (Just to be sure, I reran the experiment with the first half of the page overwritten with ones instead of zeroes, and the result was approximately the same.) For extra credit, I ran a "quarter random page"... zsize mean is 1052 with a stddev of 45. For more extra credit, I tried a fully-random page with every OTHER byte forced to zero, so half the bytes are random and half are zero. The result: mean zsize is 3841 with a stddev of 33. Then I tried a fully-random page with every other PAIR of bytes forced to zero. The result: zsize mean is 4029 with a stddev of 67. (Worse!) So LZO page compression works better when there are many more zeroes than ones in a page (or vice-versa), but works best when a long sequence of bits (bytes?) are the same. All this still begs the question as to what the page-entropy (and zsize distribution) will be over a large set of pages and over a large set of workloads AND across different classes of data (e.g. frontswap pages vs cleancache pages), but at least we have some theory to guide us. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
RE: [RFC] mm: remove swapcache page early
> From: Minchan Kim [mailto:minc...@kernel.org] > Subject: [RFC] mm: remove swapcache page early > > Swap subsystem does lazy swap slot free with expecting the page > would be swapped out again so we can't avoid unnecessary write. > > But the problem in in-memory swap is that it consumes memory space > until vm_swap_full(ie, used half of all of swap device) condition > meet. It could be bad if we use multiple swap device, small in-memory swap > and big storage swap or in-memory swap alone. > > This patch changes vm_swap_full logic slightly so it could free > swap slot early if the backed device is really fast. > For it, I used SWP_SOLIDSTATE but It might be controversial. > So let's add Ccing Shaohua and Hugh. > If it's a problem for SSD, I'd like to create new type SWP_INMEMORY > or something for z* family. > > Other problem is zram is block device so that it can set SWP_INMEMORY > or SWP_SOLIDSTATE easily(ie, actually, zram is already done) but > I have no idea to use it for frontswap. > > Any idea? > > Other optimize point is we remove it unconditionally when we > found it's exclusive when swap in happen. > It could help frontswap family, too. By passing a struct page * to vm_swap_full() you can then call frontswap_test()... if it returns true, then vm_swap_full can return true. Note that this precisely checks whether the page is in zcache/zswap or not, so Seth's concern that some pages may be in-memory and some may be in rotating storage is no longer an issue. > What do you think about it? By removing the page from swapcache, you are now increasing the risk that pages will "thrash" between uncompressed state (in swapcache) and compressed state (in z*). I think this is a better tradeoff though than keeping a copy of both the compressed page AND the uncompressed page in memory. You should probably rename vm_swap_full() because you are now overloading it with other meanings. Maybe vm_swap_reclaimable()? Do you have any measurements? I think you are correct that it may help a LOT. Thanks, Dan -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
RE: [RFC] mm: remove swapcache page early
From: Minchan Kim [mailto:minc...@kernel.org] Subject: [RFC] mm: remove swapcache page early Swap subsystem does lazy swap slot free with expecting the page would be swapped out again so we can't avoid unnecessary write. But the problem in in-memory swap is that it consumes memory space until vm_swap_full(ie, used half of all of swap device) condition meet. It could be bad if we use multiple swap device, small in-memory swap and big storage swap or in-memory swap alone. This patch changes vm_swap_full logic slightly so it could free swap slot early if the backed device is really fast. For it, I used SWP_SOLIDSTATE but It might be controversial. So let's add Ccing Shaohua and Hugh. If it's a problem for SSD, I'd like to create new type SWP_INMEMORY or something for z* family. Other problem is zram is block device so that it can set SWP_INMEMORY or SWP_SOLIDSTATE easily(ie, actually, zram is already done) but I have no idea to use it for frontswap. Any idea? Other optimize point is we remove it unconditionally when we found it's exclusive when swap in happen. It could help frontswap family, too. By passing a struct page * to vm_swap_full() you can then call frontswap_test()... if it returns true, then vm_swap_full can return true. Note that this precisely checks whether the page is in zcache/zswap or not, so Seth's concern that some pages may be in-memory and some may be in rotating storage is no longer an issue. What do you think about it? By removing the page from swapcache, you are now increasing the risk that pages will thrash between uncompressed state (in swapcache) and compressed state (in z*). I think this is a better tradeoff though than keeping a copy of both the compressed page AND the uncompressed page in memory. You should probably rename vm_swap_full() because you are now overloading it with other meanings. Maybe vm_swap_reclaimable()? Do you have any measurements? I think you are correct that it may help a LOT. Thanks, Dan -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
zsmalloc/lzo compressibility vs entropy
This might be obvious to those of you who are better mathematicians than I, but I ran some experiments to confirm the relationship between entropy and compressibility and thought I should report the results to the list. Using the LZO code in the kernel via zsmalloc and some hacks in zswap, I measured the compression of pages generated by get_random_bytes and then of pages where half the page is generated by get_random_bytes() and the other half-page is zero-filled. For a fully random page, one would expect the number of zeroes and ones generated to be equal (highest entropy) and that proved true: The mean number of one-bits in the fully random page was 16384 (x86, so PAGE_SIZE=4096 * 8 bits/byte) with a stddev of 93. (sample size 50). For this sample of pages, zsize had a mean of 4116 and a stddev of 16. So for fully random pages, LZO compression results in negative compression... the size of the compressed page is slightly larger than a page. For a half random page -- a fully random page with the first half of the page overwritten with zeros -- zsize mean is 2077 with a stddev of 6. So a half-random page compresses by about a factor of 2. (Just to be sure, I reran the experiment with the first half of the page overwritten with ones instead of zeroes, and the result was approximately the same.) For extra credit, I ran a quarter random page... zsize mean is 1052 with a stddev of 45. For more extra credit, I tried a fully-random page with every OTHER byte forced to zero, so half the bytes are random and half are zero. The result: mean zsize is 3841 with a stddev of 33. Then I tried a fully-random page with every other PAIR of bytes forced to zero. The result: zsize mean is 4029 with a stddev of 67. (Worse!) So LZO page compression works better when there are many more zeroes than ones in a page (or vice-versa), but works best when a long sequence of bits (bytes?) are the same. All this still begs the question as to what the page-entropy (and zsize distribution) will be over a large set of pages and over a large set of workloads AND across different classes of data (e.g. frontswap pages vs cleancache pages), but at least we have some theory to guide us. -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
RE: [RFC] mm: remove swapcache page early
From: Hugh Dickins [mailto:hu...@google.com] Subject: Re: [RFC] mm: remove swapcache page early On Wed, 27 Mar 2013, Minchan Kim wrote: Swap subsystem does lazy swap slot free with expecting the page would be swapped out again so we can't avoid unnecessary write. so we can avoid unnecessary write. But the problem in in-memory swap is that it consumes memory space until vm_swap_full(ie, used half of all of swap device) condition meet. It could be bad if we use multiple swap device, small in-memory swap and big storage swap or in-memory swap alone. That is a very good realization: it's surprising that none of us thought of it before - no disrespect to you, well done, thank you. Yes, my compliments also Minchan. This problem has been thought of before but this patch is the first to identify a possible solution. And I guess swap readahead is utterly unhelpful in this case too. Yes... as is any swap writeahead. Excuse my ignorance, but I think this is not done in the swap subsystem but instead the kernel assumes write-coalescing will be done in the block I/O subsystem, which means swap writeahead would affect zram but not zcache/zswap (since frontswap subverts the block I/O subsystem). However I think a swap-readahead solution would be helpful to zram as well as zcache/zswap. This patch changes vm_swap_full logic slightly so it could free swap slot early if the backed device is really fast. For it, I used SWP_SOLIDSTATE but It might be controversial. But I strongly disagree with almost everything in your patch :) I disagree with addressing it in vm_swap_full(), I disagree that it can be addressed by device, I disagree that it has anything to do with SWP_SOLIDSTATE. This is not a problem with swapping to /dev/ram0 or to /dev/zram0, is it? In those cases, a fixed amount of memory has been set aside for swap, and it works out just like with disk block devices. The memory set aside may be wasted, but that is accepted upfront. It is (I believe) also a problem with swapping to ram. Two copies of the same page are kept in memory in different places, right? Fixed vs variable size is irrelevant I think. Or am I misunderstanding something about swap-to-ram? Similarly, this is not a problem with swapping to SSD. There might or might not be other reasons for adjusting the vm_swap_full() logic for SSD or generally, but those have nothing to do with this issue. I think it is at least highly related. The key issue is the tradeoff of the likelihood that the page will soon be read/written again while it is in swap cache vs the time/resource-usage necessary to reconstitute the page into swap cache. Reconstituting from disk requires a LOT of elapsed time. Reconstituting from an SSD likely takes much less time. Reconstituting from zcache/zram takes thousands of CPU cycles. The problem here is peculiar to frontswap, and the variably sized memory behind it, isn't it? We are accustomed to using swap to free up memory by transferring its data to some other, cheaper but slower resource. Frontswap does make the problem more complex because some pages are in fairly fast storage (zcache, needs decompression) and some are on the actual (usually) rotating media. Fortunately, differentiating between these two cases is just a table lookup (see frontswap_test). But in the case of frontswap and zmem (I'll say that to avoid thinking through which backends are actually involved), it is not a cheaper and slower resource, but the very same memory we are trying to save: swap is stolen from the memory under reclaim, so any duplication becomes counter-productive (if we ignore cpu compression/decompression costs: I have no idea how fair it is to do so, but anyone who chooses zmem is prepared to pay some cpu price for that). Exactly. There is some robbing of Peter to pay Paul and other complex resource tradeoffs. Presumably, though, it is not the very same memory we are trying to save but a fraction of it, saving the same page of data more efficiently in memory, using less than a page, at some CPU cost. And because it's a frontswap thing, we cannot decide this by device: frontswap may or may not stand in front of each device. There is no problem with swapcache duplicated on disk (until that area approaches being full or fragmented), but at the higher level we cannot see what is in zmem and what is on disk: we only want to free up the zmem dup. I *think* frontswap_test(page) resolves this problem, as long as we have a specific page available to use as a parameter. I believe the answer is for frontswap/zmem to invalidate the frontswap copy of the page (to free up the compressed memory when possible) and SetPageDirty on the PageUptodate PageSwapCache page when swapping in (setting page dirty so nothing will later go to read it from the unfreed location on backing swap disk, which was never written). There are two duplication
RE: [PATCH v2 1/4] introduce zero filled pages handler
> From: Konrad Rzeszutek Wilk [mailto:kon...@darnok.org] > Sent: Tuesday, March 19, 2013 10:44 AM > To: Dan Magenheimer > Cc: Wanpeng Li; Greg Kroah-Hartman; Andrew Morton; Seth Jennings; Minchan > Kim; linux...@kvack.org; > linux-kernel@vger.kernel.org > Subject: Re: [PATCH v2 1/4] introduce zero filled pages handler > > On Sat, Mar 16, 2013 at 2:24 PM, Dan Magenheimer > wrote: > >> From: Konrad Rzeszutek Wilk [mailto:kon...@darnok.org] > >> Subject: Re: [PATCH v2 1/4] introduce zero filled pages handler > >> > >> > + > >> > + for (pos = 0; pos < PAGE_SIZE / sizeof(*page); pos++) { > >> > + if (page[pos]) > >> > + return false; > >> > >> Perhaps allocate a static page filled with zeros and just do memcmp? > > > > That seems like a bad idea. Why compare two different > > memory locations when comparing one memory location > > to a register will do? > > Good point. I was hoping there was an fast memcmp that would > do fancy SSE registers. But it is memory against memory instead of > registers. > > Perhaps a cunning trick would be to check (as a shortcircuit) > check against 'empty_zero_page' and if that check fails, then try > to do the check for each byte in the code? Curious about this, I added some code to check for this case. In my test run, the conditional "if (page == ZERO_PAGE(0))" was never true, for >20 pages passed through frontswap that were zero-filled. My test run is certainly not conclusive, but perhaps some other code in the swap subsystem disqualifies ZERO_PAGE as a candidate for swapping? Or maybe it is accessed frequently enough that it never falls out of the active-anonymous page queue? Dan P.S. In arch/x86/include/asm/pgtable.h: #define ZERO_PAGE(vaddr) (virt_to_page(empty_zero_page)) -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
RE: [PATCH v2 1/4] introduce zero filled pages handler
From: Konrad Rzeszutek Wilk [mailto:kon...@darnok.org] Sent: Tuesday, March 19, 2013 10:44 AM To: Dan Magenheimer Cc: Wanpeng Li; Greg Kroah-Hartman; Andrew Morton; Seth Jennings; Minchan Kim; linux...@kvack.org; linux-kernel@vger.kernel.org Subject: Re: [PATCH v2 1/4] introduce zero filled pages handler On Sat, Mar 16, 2013 at 2:24 PM, Dan Magenheimer dan.magenhei...@oracle.com wrote: From: Konrad Rzeszutek Wilk [mailto:kon...@darnok.org] Subject: Re: [PATCH v2 1/4] introduce zero filled pages handler + + for (pos = 0; pos PAGE_SIZE / sizeof(*page); pos++) { + if (page[pos]) + return false; Perhaps allocate a static page filled with zeros and just do memcmp? That seems like a bad idea. Why compare two different memory locations when comparing one memory location to a register will do? Good point. I was hoping there was an fast memcmp that would do fancy SSE registers. But it is memory against memory instead of registers. Perhaps a cunning trick would be to check (as a shortcircuit) check against 'empty_zero_page' and if that check fails, then try to do the check for each byte in the code? Curious about this, I added some code to check for this case. In my test run, the conditional if (page == ZERO_PAGE(0)) was never true, for 20 pages passed through frontswap that were zero-filled. My test run is certainly not conclusive, but perhaps some other code in the swap subsystem disqualifies ZERO_PAGE as a candidate for swapping? Or maybe it is accessed frequently enough that it never falls out of the active-anonymous page queue? Dan P.S. In arch/x86/include/asm/pgtable.h: #define ZERO_PAGE(vaddr) (virt_to_page(empty_zero_page)) -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
RE: [PATCH v2 1/4] introduce zero filled pages handler
> From: Konrad Rzeszutek Wilk [mailto:kon...@darnok.org] > Subject: Re: [PATCH v2 1/4] introduce zero filled pages handler > > > + > > + for (pos = 0; pos < PAGE_SIZE / sizeof(*page); pos++) { > > + if (page[pos]) > > + return false; > > Perhaps allocate a static page filled with zeros and just do memcmp? That seems like a bad idea. Why compare two different memory locations when comparing one memory location to a register will do? -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
RE: [PATCH v2 1/4] introduce zero filled pages handler
From: Konrad Rzeszutek Wilk [mailto:kon...@darnok.org] Subject: Re: [PATCH v2 1/4] introduce zero filled pages handler + + for (pos = 0; pos PAGE_SIZE / sizeof(*page); pos++) { + if (page[pos]) + return false; Perhaps allocate a static page filled with zeros and just do memcmp? That seems like a bad idea. Why compare two different memory locations when comparing one memory location to a register will do? -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
RE: zsmalloc limitations and related topics
> From: Seth Jennings [mailto:sjenn...@linux.vnet.ibm.com] > Subject: Re: zsmalloc limitations and related topics > > On 03/14/2013 01:54 PM, Dan Magenheimer wrote: > >> From: Robert Jennings [mailto:r...@linux.vnet.ibm.com] > >> Subject: Re: zsmalloc limitations and related topics > >> > >> * Bob (bob@oracle.com) wrote: > >>> On 03/14/2013 06:59 AM, Seth Jennings wrote: > >>>> On 03/13/2013 03:02 PM, Dan Magenheimer wrote: > >>>>>> From: Robert Jennings [mailto:r...@linux.vnet.ibm.com] > >>>>>> Subject: Re: zsmalloc limitations and related topics > >>>>> > >> > >>>>> Yes. And add pageframe-reclaim to this list of things that > >>>>> zsmalloc should do but currently cannot do. > >>>> > >>>> The real question is why is pageframe-reclaim a requirement? What > >>>> operation needs this feature? > >>>> > >>>> AFAICT, the pageframe-reclaim requirements is derived from the > >>>> assumption that some external control path should be able to tell > >>>> zswap/zcache to evacuate a page, like the shrinker interface. But this > >>>> introduces a new and complex problem in designing a policy that doesn't > >>>> shrink the zpage pool so aggressively that it is useless. > >>>> > >>>> Unless there is another reason for this functionality I'm missing. > >>>>. > >>> > >>> Perhaps it's needed if the user want to enable/disable the memory > >>> compression feature dynamically. > >>> Eg, use it as a module instead of recompile the kernel or even > >>> reboot the system. > > > > It's worth thinking about: Under what circumstances would a user want > > to turn off compression? While unloading a compression module should > > certainly be allowed if it makes a user comfortable, in my opinion, > > if a user wants to do that, we have done our job poorly (or there > > is a bug). > > > >> To unload zswap all that is needed is to perform writeback on the pages > >> held in the cache, this can be done by extending the existing writeback > >> code. > > > > Actually, frontswap supports this directly. See frontswap_shrink. > > frontswap_shrink() is a best-effort attempt to fault in all the pages > stored in the backend. However, if there is not enough RAM to hold all > the pages, then it can not completely evacuate the backend. > > Module exit functions must return void, so there is no way to fail a > module unload. If you implement an exit function for your module, you > must insure that it can always complete successfully. For this reason > frontswap_shrink() is unsuitable for module unloading. You'd need to > use a mechanism like writeback that could surely evacuate the backend > (baring I/O failures). A single call to frontswap_shrink may be unsuitable... multiple calls (do while zcache/zswap is not empty) may work fine. Writeback-until-empty should also work fine. In any case, it's a good point that module exit must succeed, and that if there is already heavy memory pressure when zcache/zswap module exit is invoked, module exit may be very very slow and cause many many swap disk writes, so the system may become unresponsive (and may even OOM). So if someone implements zcache/zswap module unload, a thorough test plan would be good. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
RE: zsmalloc limitations and related topics
From: Seth Jennings [mailto:sjenn...@linux.vnet.ibm.com] Subject: Re: zsmalloc limitations and related topics On 03/14/2013 01:54 PM, Dan Magenheimer wrote: From: Robert Jennings [mailto:r...@linux.vnet.ibm.com] Subject: Re: zsmalloc limitations and related topics * Bob (bob@oracle.com) wrote: On 03/14/2013 06:59 AM, Seth Jennings wrote: On 03/13/2013 03:02 PM, Dan Magenheimer wrote: From: Robert Jennings [mailto:r...@linux.vnet.ibm.com] Subject: Re: zsmalloc limitations and related topics snip Yes. And add pageframe-reclaim to this list of things that zsmalloc should do but currently cannot do. The real question is why is pageframe-reclaim a requirement? What operation needs this feature? AFAICT, the pageframe-reclaim requirements is derived from the assumption that some external control path should be able to tell zswap/zcache to evacuate a page, like the shrinker interface. But this introduces a new and complex problem in designing a policy that doesn't shrink the zpage pool so aggressively that it is useless. Unless there is another reason for this functionality I'm missing. . Perhaps it's needed if the user want to enable/disable the memory compression feature dynamically. Eg, use it as a module instead of recompile the kernel or even reboot the system. It's worth thinking about: Under what circumstances would a user want to turn off compression? While unloading a compression module should certainly be allowed if it makes a user comfortable, in my opinion, if a user wants to do that, we have done our job poorly (or there is a bug). To unload zswap all that is needed is to perform writeback on the pages held in the cache, this can be done by extending the existing writeback code. Actually, frontswap supports this directly. See frontswap_shrink. frontswap_shrink() is a best-effort attempt to fault in all the pages stored in the backend. However, if there is not enough RAM to hold all the pages, then it can not completely evacuate the backend. Module exit functions must return void, so there is no way to fail a module unload. If you implement an exit function for your module, you must insure that it can always complete successfully. For this reason frontswap_shrink() is unsuitable for module unloading. You'd need to use a mechanism like writeback that could surely evacuate the backend (baring I/O failures). A single call to frontswap_shrink may be unsuitable... multiple calls (do while zcache/zswap is not empty) may work fine. Writeback-until-empty should also work fine. In any case, it's a good point that module exit must succeed, and that if there is already heavy memory pressure when zcache/zswap module exit is invoked, module exit may be very very slow and cause many many swap disk writes, so the system may become unresponsive (and may even OOM). So if someone implements zcache/zswap module unload, a thorough test plan would be good. -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
RE: zsmalloc limitations and related topics
> From: Dan Magenheimer > Subject: RE: zsmalloc limitations and related topics > > > > I would welcome ideas on how to evaluate workloads for > > > "representativeness". Personally I don't believe we should > > > be making decisions about selecting the "best" algorithms > > > or merging code without an agreement on workloads. > > > > I'd argue that there is no such thing as a "representative workload". > > Instead, we try different workloads to validate the design and illustrate > > the performance characteristics and impacts. > > Sorry for repeatedly hammering my point in the above, but > there have been many design choices driven by what was presumed > to be representative (kernbench and now SPECjbb) workload > that may be entirely wrong for a different workload (as > Seth once pointed out using the text of Moby Dick as a source > data stream). > > Further, the value of different designs can't be measured here just > by the workload because the pages chosen to swap may be completely > independent of the intended workload-driver... i.e. if you track > the pid of the pages intended for swap, the pages can be mostly > pages from long-running or periodic system services, not pages > generated by kernbench or SPECjbb. So it is the workload PLUS the > environment that is being measured and evaluated. That makes > the problem especially tough. > > Just to clarify, I'm not suggesting that there is any single > workload that can be called representative, just that we may > need both a broad set of workloads (not silly benchmarks) AND > some theoretical analysis to drive design decisions. And, without > this, arguing about whether zsmalloc is better than zbud or not > is silly. Both zbud and zsmalloc have strengths and weaknesses. > > That said, it should also be pointed out that the stream of > pages-to-compress from cleancache ("file pages") may be dramatically > different than for frontswap ("anonymous pages"), so unless you > and Seth are going to argue upfront that cleancache pages should > NEVER be candidates for compression, the evaluation criteria > to drive design decisions needs to encompass both anonymous > and file pages. It is currently impossible to evaluate that > with zswap. Sorry to reply to myself here, but I realized last night that I left off another related important point: We have a tendency to run benchmarks on a "cold" system so that the results are reproducible. For compression however, this may unnaturally skew the entropy of data-pages-to-be-compressed and so also the density measurements. I can't prove it, but I suspect that soon after boot the number of anonymous pages containing all (or nearly all) zeroes is large, i.e. entropy is low. As the length of time grows since the system booted, more anonymous pages will be written with non-zero data, thus increasing entropy and decreasing compressibility. So, over time, the distribution of zsize may slowly skew right (toward PAGE_SIZE). If so, this effect may be very real but very hard to observe. Dan -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
RE: zsmalloc limitations and related topics
> From: Robert Jennings [mailto:r...@linux.vnet.ibm.com] > Sent: Thursday, March 14, 2013 7:21 AM > To: Bob > Cc: Seth Jennings; Dan Magenheimer; minc...@kernel.org; Nitin Gupta; Konrad > Wilk; linux...@kvack.org; > linux-kernel@vger.kernel.org; Bob Liu; Luigi Semenzato; Mel Gorman > Subject: Re: zsmalloc limitations and related topics > > * Bob (bob@oracle.com) wrote: > > On 03/14/2013 06:59 AM, Seth Jennings wrote: > > >On 03/13/2013 03:02 PM, Dan Magenheimer wrote: > > >>>From: Robert Jennings [mailto:r...@linux.vnet.ibm.com] > > >>>Subject: Re: zsmalloc limitations and related topics > > >> > > > >>Yes. And add pageframe-reclaim to this list of things that > > >>zsmalloc should do but currently cannot do. > > > > > >The real question is why is pageframe-reclaim a requirement? What > > >operation needs this feature? > > > > > >AFAICT, the pageframe-reclaim requirements is derived from the > > >assumption that some external control path should be able to tell > > >zswap/zcache to evacuate a page, like the shrinker interface. But this > > >introduces a new and complex problem in designing a policy that doesn't > > >shrink the zpage pool so aggressively that it is useless. > > > > > >Unless there is another reason for this functionality I'm missing. > > > > > > > Perhaps it's needed if the user want to enable/disable the memory > > compression feature dynamically. > > Eg, use it as a module instead of recompile the kernel or even > > reboot the system. It's worth thinking about: Under what circumstances would a user want to turn off compression? While unloading a compression module should certainly be allowed if it makes a user comfortable, in my opinion, if a user wants to do that, we have done our job poorly (or there is a bug). > To unload zswap all that is needed is to perform writeback on the pages > held in the cache, this can be done by extending the existing writeback > code. Actually, frontswap supports this directly. See frontswap_shrink. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
RE: zsmalloc limitations and related topics
> From: Seth Jennings [mailto:sjenn...@linux.vnet.ibm.com] > Subject: Re: zsmalloc limitations and related topics Hi Seth -- Thanks for the reply. I think it is very important to be having these conversations. > >>> 2) When not full and especially when nearly-empty _after_ > >>>being full, density may fall below 1.0 as a result of > >>>fragmentation. > >> > >> True and there are several ways to address this including > >> defragmentation, fewer class sizes in zsmalloc, aging, and/or writeback > >> of zpages in sparse zspages to free pageframes during normal writeback. > > > > Yes. And add pageframe-reclaim to this list of things that > > zsmalloc should do but currently cannot do. > > The real question is why is pageframe-reclaim a requirement? It is because pageframes are the currency of the MM subsystem. See more below. > What operation needs this feature? > AFAICT, the pageframe-reclaim requirements is derived from the > assumption that some external control path should be able to tell > zswap/zcache to evacuate a page, like the shrinker interface. But this > introduces a new and complex problem in designing a policy that doesn't > shrink the zpage pool so aggressively that it is useless. > > Unless there is another reason for this functionality I'm missing. That's the reason. IMHO, it is precisely this "new and complex" problem that we must solve. Otherwise, compression is just a cool toy that may (or may not) help your workload if you turn it on. Zcache already does implement "a policy that doesn't shrink the zpage pool so aggressively that it is useless". While I won't claim the policy is the right one, it is a policy, it is not particularly complex, and it is definitely not useless. And it depends on pageframe-reclaim. > >>> 3) Zsmalloc has a density of exactly 1.0 for any number of > >>>zpages with zsize >= 0.8. > >> > >> For this reason zswap does not cache pages which in this range. > >> It is not enforced in the allocator because some users may be forced to > >> store these pages; users like zram. > > > > Again, without a "representative" workload, we don't know whether > > or not it is important to manage pages with zsize >= 0.8. You are > > simply dismissing it as unnecessary because zsmalloc can't handle > > them and because they don't appear at any measurable frequency > > in kernbench or SPECjbb. (Zbud _can_ efficiently handle these larger > > pages under many circumstances... but without a "representative" workload, > > we don't know whether or not those circumstances will occur.) > > The real question is not whether any workload would operate on pages > that don't compress to 80%. Any workload that operates on pages of > already compressed or encrypted data would do this. The question is, is > it worth it to store those pages in the compressed cache since the > effective reclaim efficiency approaches 0. You are letting the implementation of zsmalloc color your thinking. Zbud can quite efficiently store pages that compress up to zsize = ((63 * PAGE_SIZE) / 64) because it buddies highly compressible pages with poorly compressible pages. This is also, of course, very zsize-distribution-dependent. (These are not just already-compressed or encrypted data, although those are good examples. Compressibility is related to entropy, and there may be many anonymous pages that have high entropy. We really just don't know.) > >>> 4) Zsmalloc contains several compile-time parameters; > >>>the best value of these parameters may be very workload > >>>dependent. > >> > >> The parameters fall into two major areas, handle computation and class > >> size. The handle can be abstracted away, eliminating the compile-time > >> parameters. The class-size tunable could be changed to a default value > >> with the option for specifying an alternate value from the user during > >> pool creation. > > > > Perhaps my point here wasn't clear so let me be more blunt: > > There's no way in hell that even a very sophisticated user > > will know how to set these values. I think we need to > > ensure either that they are "always right" (which without > > a "representative workload"...) or, preferably, have some way > > so that they can dynamically adapt at runtime. > > I think you made the point that if this "representative workload" is > completely undefined, then having tunables for zsmalloc that are "always > right" is also not possible. The best we can hope for is "mostly right" > which, of course, is difficult to get everyone to agree on and will be > based on usage. I agree "always right" is impossible and, as I said, would prefer adaptable. I think zsmalloc and zbud address very different zsize-distributions so some combination may be better than either by itself. > >>> If density == 1.0, that means we are paying the overhead of > >>> compression+decompression for no space advantage. If > >>> density < 1.0, that means using zsmalloc is detrimental, > >>>
RE: [PATCH 4/4] zcache: add pageframes count once compress zero-filled pages twice
> From: Wanpeng Li [mailto:liw...@linux.vnet.ibm.com] > Sent: Wednesday, March 13, 2013 6:21 PM > To: Dan Magenheimer > Cc: Andrew Morton; Greg Kroah-Hartman; Dan Magenheimer; Seth Jennings; Konrad > Rzeszutek Wilk; Minchan > Kim; linux...@kvack.org; linux-kernel@vger.kernel.org > Subject: Re: [PATCH 4/4] zcache: add pageframes count once compress > zero-filled pages twice > > On Wed, Mar 13, 2013 at 09:42:16AM -0700, Dan Magenheimer wrote: > >> From: Wanpeng Li [mailto:liw...@linux.vnet.ibm.com] > >> Sent: Wednesday, March 13, 2013 1:05 AM > >> To: Andrew Morton > >> Cc: Greg Kroah-Hartman; Dan Magenheimer; Seth Jennings; Konrad Rzeszutek > >> Wilk; Minchan Kim; linux- > >> m...@kvack.org; linux-kernel@vger.kernel.org; Wanpeng Li > >> Subject: [PATCH 4/4] zcache: add pageframes count once compress > >> zero-filled pages twice > > > >Hi Wanpeng -- > > > >Thanks for taking on this task from the drivers/staging/zcache TODO list! > > > >> Since zbudpage consist of two zpages, two zero-filled pages compression > >> contribute to one [eph|pers]pageframe count accumulated. > > > > Hi Dan, > > >I'm not sure why this is necessary. The [eph|pers]pageframe count > >is supposed to be counting actual pageframes used by zcache. Since > >your patch eliminates the need to store zero pages, no pageframes > >are needed at all to store zero pages, so it's not necessary > >to increment zcache_[eph|pers]_pageframes when storing zero > >pages. > > > > Great point! It seems that we also don't need to caculate > zcache_[eph|pers]_zpages for zero-filled pages. I will fix > it in next version. :-) Hi Wanpeng -- I think we DO need to increment/decrement zcache_[eph|pers]_zpages for zero-filled pages. The main point of the counters for zpages and pageframes is to be able to calculate density == zpages/pageframes. A zero-filled page becomes a zpage that "compresses" to zero bytes and, as a result, requires zero pageframes for storage. So the zpages counter should be increased but the pageframes counter should not. If you are changing the patch anyway, I do like better the use of "zero_filled_page" rather than just "zero" or "zero page". So it might be good to change: handle_zero_page -> handle_zero_filled_page pages_zero -> zero_filled_pages zcache_pages_zero -> zcache_zero_filled_pages and maybe page_zero_filled -> page_is_zero_filled Thanks, Dan -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
RE: [PATCH 4/4] zcache: add pageframes count once compress zero-filled pages twice
From: Wanpeng Li [mailto:liw...@linux.vnet.ibm.com] Sent: Wednesday, March 13, 2013 6:21 PM To: Dan Magenheimer Cc: Andrew Morton; Greg Kroah-Hartman; Dan Magenheimer; Seth Jennings; Konrad Rzeszutek Wilk; Minchan Kim; linux...@kvack.org; linux-kernel@vger.kernel.org Subject: Re: [PATCH 4/4] zcache: add pageframes count once compress zero-filled pages twice On Wed, Mar 13, 2013 at 09:42:16AM -0700, Dan Magenheimer wrote: From: Wanpeng Li [mailto:liw...@linux.vnet.ibm.com] Sent: Wednesday, March 13, 2013 1:05 AM To: Andrew Morton Cc: Greg Kroah-Hartman; Dan Magenheimer; Seth Jennings; Konrad Rzeszutek Wilk; Minchan Kim; linux- m...@kvack.org; linux-kernel@vger.kernel.org; Wanpeng Li Subject: [PATCH 4/4] zcache: add pageframes count once compress zero-filled pages twice Hi Wanpeng -- Thanks for taking on this task from the drivers/staging/zcache TODO list! Since zbudpage consist of two zpages, two zero-filled pages compression contribute to one [eph|pers]pageframe count accumulated. Hi Dan, I'm not sure why this is necessary. The [eph|pers]pageframe count is supposed to be counting actual pageframes used by zcache. Since your patch eliminates the need to store zero pages, no pageframes are needed at all to store zero pages, so it's not necessary to increment zcache_[eph|pers]_pageframes when storing zero pages. Great point! It seems that we also don't need to caculate zcache_[eph|pers]_zpages for zero-filled pages. I will fix it in next version. :-) Hi Wanpeng -- I think we DO need to increment/decrement zcache_[eph|pers]_zpages for zero-filled pages. The main point of the counters for zpages and pageframes is to be able to calculate density == zpages/pageframes. A zero-filled page becomes a zpage that compresses to zero bytes and, as a result, requires zero pageframes for storage. So the zpages counter should be increased but the pageframes counter should not. If you are changing the patch anyway, I do like better the use of zero_filled_page rather than just zero or zero page. So it might be good to change: handle_zero_page - handle_zero_filled_page pages_zero - zero_filled_pages zcache_pages_zero - zcache_zero_filled_pages and maybe page_zero_filled - page_is_zero_filled Thanks, Dan -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
RE: zsmalloc limitations and related topics
From: Seth Jennings [mailto:sjenn...@linux.vnet.ibm.com] Subject: Re: zsmalloc limitations and related topics Hi Seth -- Thanks for the reply. I think it is very important to be having these conversations. 2) When not full and especially when nearly-empty _after_ being full, density may fall below 1.0 as a result of fragmentation. True and there are several ways to address this including defragmentation, fewer class sizes in zsmalloc, aging, and/or writeback of zpages in sparse zspages to free pageframes during normal writeback. Yes. And add pageframe-reclaim to this list of things that zsmalloc should do but currently cannot do. The real question is why is pageframe-reclaim a requirement? It is because pageframes are the currency of the MM subsystem. See more below. What operation needs this feature? AFAICT, the pageframe-reclaim requirements is derived from the assumption that some external control path should be able to tell zswap/zcache to evacuate a page, like the shrinker interface. But this introduces a new and complex problem in designing a policy that doesn't shrink the zpage pool so aggressively that it is useless. Unless there is another reason for this functionality I'm missing. That's the reason. IMHO, it is precisely this new and complex problem that we must solve. Otherwise, compression is just a cool toy that may (or may not) help your workload if you turn it on. Zcache already does implement a policy that doesn't shrink the zpage pool so aggressively that it is useless. While I won't claim the policy is the right one, it is a policy, it is not particularly complex, and it is definitely not useless. And it depends on pageframe-reclaim. 3) Zsmalloc has a density of exactly 1.0 for any number of zpages with zsize = 0.8. For this reason zswap does not cache pages which in this range. It is not enforced in the allocator because some users may be forced to store these pages; users like zram. Again, without a representative workload, we don't know whether or not it is important to manage pages with zsize = 0.8. You are simply dismissing it as unnecessary because zsmalloc can't handle them and because they don't appear at any measurable frequency in kernbench or SPECjbb. (Zbud _can_ efficiently handle these larger pages under many circumstances... but without a representative workload, we don't know whether or not those circumstances will occur.) The real question is not whether any workload would operate on pages that don't compress to 80%. Any workload that operates on pages of already compressed or encrypted data would do this. The question is, is it worth it to store those pages in the compressed cache since the effective reclaim efficiency approaches 0. You are letting the implementation of zsmalloc color your thinking. Zbud can quite efficiently store pages that compress up to zsize = ((63 * PAGE_SIZE) / 64) because it buddies highly compressible pages with poorly compressible pages. This is also, of course, very zsize-distribution-dependent. (These are not just already-compressed or encrypted data, although those are good examples. Compressibility is related to entropy, and there may be many anonymous pages that have high entropy. We really just don't know.) 4) Zsmalloc contains several compile-time parameters; the best value of these parameters may be very workload dependent. The parameters fall into two major areas, handle computation and class size. The handle can be abstracted away, eliminating the compile-time parameters. The class-size tunable could be changed to a default value with the option for specifying an alternate value from the user during pool creation. Perhaps my point here wasn't clear so let me be more blunt: There's no way in hell that even a very sophisticated user will know how to set these values. I think we need to ensure either that they are always right (which without a representative workload...) or, preferably, have some way so that they can dynamically adapt at runtime. I think you made the point that if this representative workload is completely undefined, then having tunables for zsmalloc that are always right is also not possible. The best we can hope for is mostly right which, of course, is difficult to get everyone to agree on and will be based on usage. I agree always right is impossible and, as I said, would prefer adaptable. I think zsmalloc and zbud address very different zsize-distributions so some combination may be better than either by itself. If density == 1.0, that means we are paying the overhead of compression+decompression for no space advantage. If density 1.0, that means using zsmalloc is detrimental, resulting in worse memory pressure than if it were not used. WORKLOAD ANALYSIS These limitations emphasize that the workload used to evaluate zsmalloc is very important.
RE: zsmalloc limitations and related topics
From: Robert Jennings [mailto:r...@linux.vnet.ibm.com] Sent: Thursday, March 14, 2013 7:21 AM To: Bob Cc: Seth Jennings; Dan Magenheimer; minc...@kernel.org; Nitin Gupta; Konrad Wilk; linux...@kvack.org; linux-kernel@vger.kernel.org; Bob Liu; Luigi Semenzato; Mel Gorman Subject: Re: zsmalloc limitations and related topics * Bob (bob@oracle.com) wrote: On 03/14/2013 06:59 AM, Seth Jennings wrote: On 03/13/2013 03:02 PM, Dan Magenheimer wrote: From: Robert Jennings [mailto:r...@linux.vnet.ibm.com] Subject: Re: zsmalloc limitations and related topics snip Yes. And add pageframe-reclaim to this list of things that zsmalloc should do but currently cannot do. The real question is why is pageframe-reclaim a requirement? What operation needs this feature? AFAICT, the pageframe-reclaim requirements is derived from the assumption that some external control path should be able to tell zswap/zcache to evacuate a page, like the shrinker interface. But this introduces a new and complex problem in designing a policy that doesn't shrink the zpage pool so aggressively that it is useless. Unless there is another reason for this functionality I'm missing. Perhaps it's needed if the user want to enable/disable the memory compression feature dynamically. Eg, use it as a module instead of recompile the kernel or even reboot the system. It's worth thinking about: Under what circumstances would a user want to turn off compression? While unloading a compression module should certainly be allowed if it makes a user comfortable, in my opinion, if a user wants to do that, we have done our job poorly (or there is a bug). To unload zswap all that is needed is to perform writeback on the pages held in the cache, this can be done by extending the existing writeback code. Actually, frontswap supports this directly. See frontswap_shrink. -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
RE: zsmalloc limitations and related topics
From: Dan Magenheimer Subject: RE: zsmalloc limitations and related topics I would welcome ideas on how to evaluate workloads for representativeness. Personally I don't believe we should be making decisions about selecting the best algorithms or merging code without an agreement on workloads. I'd argue that there is no such thing as a representative workload. Instead, we try different workloads to validate the design and illustrate the performance characteristics and impacts. Sorry for repeatedly hammering my point in the above, but there have been many design choices driven by what was presumed to be representative (kernbench and now SPECjbb) workload that may be entirely wrong for a different workload (as Seth once pointed out using the text of Moby Dick as a source data stream). Further, the value of different designs can't be measured here just by the workload because the pages chosen to swap may be completely independent of the intended workload-driver... i.e. if you track the pid of the pages intended for swap, the pages can be mostly pages from long-running or periodic system services, not pages generated by kernbench or SPECjbb. So it is the workload PLUS the environment that is being measured and evaluated. That makes the problem especially tough. Just to clarify, I'm not suggesting that there is any single workload that can be called representative, just that we may need both a broad set of workloads (not silly benchmarks) AND some theoretical analysis to drive design decisions. And, without this, arguing about whether zsmalloc is better than zbud or not is silly. Both zbud and zsmalloc have strengths and weaknesses. That said, it should also be pointed out that the stream of pages-to-compress from cleancache (file pages) may be dramatically different than for frontswap (anonymous pages), so unless you and Seth are going to argue upfront that cleancache pages should NEVER be candidates for compression, the evaluation criteria to drive design decisions needs to encompass both anonymous and file pages. It is currently impossible to evaluate that with zswap. Sorry to reply to myself here, but I realized last night that I left off another related important point: We have a tendency to run benchmarks on a cold system so that the results are reproducible. For compression however, this may unnaturally skew the entropy of data-pages-to-be-compressed and so also the density measurements. I can't prove it, but I suspect that soon after boot the number of anonymous pages containing all (or nearly all) zeroes is large, i.e. entropy is low. As the length of time grows since the system booted, more anonymous pages will be written with non-zero data, thus increasing entropy and decreasing compressibility. So, over time, the distribution of zsize may slowly skew right (toward PAGE_SIZE). If so, this effect may be very real but very hard to observe. Dan -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
RE: zsmalloc limitations and related topics
> From: Robert Jennings [mailto:r...@linux.vnet.ibm.com] > Subject: Re: zsmalloc limitations and related topics Hi Robert -- Thanks for the well-considered reply! > * Dan Magenheimer (dan.magenhei...@oracle.com) wrote: > > Hi all -- > > > > I've been doing some experimentation on zsmalloc in preparation > > for my topic proposed for LSFMM13 and have run across some > > perplexing limitations. Those familiar with the intimate details > > of zsmalloc might be well aware of these limitations, but they > > aren't documented or immediately obvious, so I thought it would > > be worthwhile to air them publicly. I've also included some > > measurements from the experimentation and some related thoughts. > > > > (Some of the terms here are unusual and may be used inconsistently > > by different developers so a glossary of definitions of the terms > > used here is appended.) > > > > ZSMALLOC LIMITATIONS > > > > Zsmalloc is used for two zprojects: zram and the out-of-tree > > zswap. Zsmalloc can achieve high density when "full". But: > > > > 1) Zsmalloc has a worst-case density of 0.25 (one zpage per > >four pageframes). > > The design of the allocator results in a trade-off between best case > density and the worst-case which is true for any allocator. For zsmalloc, > the best case density with a 4K page size is 32.0, or 177.0 for a 64K page > size, based on storing a set of zero-filled pages compressed by lzo1x-1. Right. Without a "representative workload", we have no idea whether either my worst-case or your best-case will be relevant. (As an aside, I'm measuring zsize=28 bytes for a zero page... Seth has repeatedly said 103 bytes and I think this is reflected in your computation above. Maybe it is 103 for your hardware compression engine? Else, I'm not sure why our numbers would be different.) > > 2) When not full and especially when nearly-empty _after_ > >being full, density may fall below 1.0 as a result of > >fragmentation. > > True and there are several ways to address this including > defragmentation, fewer class sizes in zsmalloc, aging, and/or writeback > of zpages in sparse zspages to free pageframes during normal writeback. Yes. And add pageframe-reclaim to this list of things that zsmalloc should do but currently cannot do. > > 3) Zsmalloc has a density of exactly 1.0 for any number of > >zpages with zsize >= 0.8. > > For this reason zswap does not cache pages which in this range. > It is not enforced in the allocator because some users may be forced to > store these pages; users like zram. Again, without a "representative" workload, we don't know whether or not it is important to manage pages with zsize >= 0.8. You are simply dismissing it as unnecessary because zsmalloc can't handle them and because they don't appear at any measurable frequency in kernbench or SPECjbb. (Zbud _can_ efficiently handle these larger pages under many circumstances... but without a "representative" workload, we don't know whether or not those circumstances will occur.) > > 4) Zsmalloc contains several compile-time parameters; > >the best value of these parameters may be very workload > >dependent. > > The parameters fall into two major areas, handle computation and class > size. The handle can be abstracted away, eliminating the compile-time > parameters. The class-size tunable could be changed to a default value > with the option for specifying an alternate value from the user during > pool creation. Perhaps my point here wasn't clear so let me be more blunt: There's no way in hell that even a very sophisticated user will know how to set these values. I think we need to ensure either that they are "always right" (which without a "representative workload"...) or, preferably, have some way so that they can dynamically adapt at runtime. > > If density == 1.0, that means we are paying the overhead of > > compression+decompression for no space advantage. If > > density < 1.0, that means using zsmalloc is detrimental, > > resulting in worse memory pressure than if it were not used. > > > > WORKLOAD ANALYSIS > > > > These limitations emphasize that the workload used to evaluate > > zsmalloc is very important. Benchmarks that measure data > > throughput or CPU utilization are of questionable value because > > it is the _content_ of the data that is particularly relevant > > for compression. Even more precisely, it is the "entropy" > > of the data that is relevant, because the amount of > > compressibility in the data is related to the entropy: > > I.e. an entirel
RE: [PATCH 2/4] zcache: zero-filled pages awareness
> From: Wanpeng Li [mailto:liw...@linux.vnet.ibm.com] > Subject: [PATCH 2/4] zcache: zero-filled pages awareness > > Compression of zero-filled pages can unneccessarily cause internal > fragmentation, and thus waste memory. This special case can be > optimized. > > This patch captures zero-filled pages, and marks their corresponding > zcache backing page entry as zero-filled. Whenever such zero-filled > page is retrieved, we fill the page frame with zero. > > Signed-off-by: Wanpeng Li > --- > drivers/staging/zcache/tmem.c|4 +- > drivers/staging/zcache/tmem.h|5 ++ > drivers/staging/zcache/zcache-main.c | 87 > ++ > 3 files changed, 85 insertions(+), 11 deletions(-) > > diff --git a/drivers/staging/zcache/tmem.c b/drivers/staging/zcache/tmem.c > index a2b7e03..62468ea 100644 > --- a/drivers/staging/zcache/tmem.c > +++ b/drivers/staging/zcache/tmem.c > @@ -597,7 +597,9 @@ int tmem_put(struct tmem_pool *pool, struct tmem_oid > *oidp, uint32_t index, > if (unlikely(ret == -ENOMEM)) > /* may have partially built objnode tree ("stump") */ > goto delete_and_free; > - (*tmem_pamops.create_finish)(pampd, is_ephemeral(pool)); > + if (pampd != (void *)ZERO_FILLED) > + (*tmem_pamops.create_finish)(pampd, is_ephemeral(pool)); > + > goto out; > > delete_and_free: > diff --git a/drivers/staging/zcache/tmem.h b/drivers/staging/zcache/tmem.h > index adbe5a8..6719dbd 100644 > --- a/drivers/staging/zcache/tmem.h > +++ b/drivers/staging/zcache/tmem.h > @@ -204,6 +204,11 @@ struct tmem_handle { > uint16_t client_id; > }; > > +/* > + * mark pampd to special vaule in order that later > + * retrieve will identify zero-filled pages > + */ > +#define ZERO_FILLED 0x2 You can avoid changing tmem.[ch] entirely by moving this definition into zcache-main.c and by moving the check comparing pampd against ZERO_FILLED into zcache_pampd_create_finish() I think that would be cleaner... If you change this and make the pageframe counter fix for PATCH 4/4, please add my ack for the next version: Acked-by: Dan Magenheimer -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
RE: [PATCH 4/4] zcache: add pageframes count once compress zero-filled pages twice
> From: Wanpeng Li [mailto:liw...@linux.vnet.ibm.com] > Sent: Wednesday, March 13, 2013 1:05 AM > To: Andrew Morton > Cc: Greg Kroah-Hartman; Dan Magenheimer; Seth Jennings; Konrad Rzeszutek > Wilk; Minchan Kim; linux- > m...@kvack.org; linux-kernel@vger.kernel.org; Wanpeng Li > Subject: [PATCH 4/4] zcache: add pageframes count once compress zero-filled > pages twice Hi Wanpeng -- Thanks for taking on this task from the drivers/staging/zcache TODO list! > Since zbudpage consist of two zpages, two zero-filled pages compression > contribute to one [eph|pers]pageframe count accumulated. I'm not sure why this is necessary. The [eph|pers]pageframe count is supposed to be counting actual pageframes used by zcache. Since your patch eliminates the need to store zero pages, no pageframes are needed at all to store zero pages, so it's not necessary to increment zcache_[eph|pers]_pageframes when storing zero pages. Or am I misunderstanding your intent? Thanks, Dan > Signed-off-by: Wanpeng Li > --- > drivers/staging/zcache/zcache-main.c | 25 +++-- > 1 files changed, 23 insertions(+), 2 deletions(-) > > diff --git a/drivers/staging/zcache/zcache-main.c > b/drivers/staging/zcache/zcache-main.c > index dd52975..7860ff0 100644 > --- a/drivers/staging/zcache/zcache-main.c > +++ b/drivers/staging/zcache/zcache-main.c > @@ -544,6 +544,8 @@ static struct page *zcache_evict_eph_pageframe(void); > static void *zcache_pampd_eph_create(char *data, size_t size, bool raw, > struct tmem_handle *th) > { > + static ssize_t second_eph_zero_page; > + static atomic_t second_eph_zero_page_atomic = ATOMIC_INIT(0); > void *pampd = NULL, *cdata = data; > unsigned clen = size; > bool zero_filled = false; > @@ -561,7 +563,14 @@ static void *zcache_pampd_eph_create(char *data, size_t > size, bool raw, > clen = 0; > zero_filled = true; > zcache_pages_zero++; > - goto got_pampd; > + second_eph_zero_page = atomic_inc_return( > + _eph_zero_page_atomic); > + if (second_eph_zero_page % 2 == 1) > + goto got_pampd; > + else { > + atomic_sub(2, _eph_zero_page_atomic); > + goto count_zero_page; > + } > } > kunmap_atomic(user_mem); > > @@ -597,6 +606,7 @@ static void *zcache_pampd_eph_create(char *data, size_t > size, bool raw, > create_in_new_page: > pampd = (void *)zbud_create_prep(th, true, cdata, clen, newpage); > BUG_ON(pampd == NULL); > +count_zero_page: > zcache_eph_pageframes = > atomic_inc_return(_eph_pageframes_atomic); > if (zcache_eph_pageframes > zcache_eph_pageframes_max) > @@ -621,6 +631,8 @@ out: > static void *zcache_pampd_pers_create(char *data, size_t size, bool raw, > struct tmem_handle *th) > { > + static ssize_t second_pers_zero_page; > + static atomic_t second_pers_zero_page_atomic = ATOMIC_INIT(0); > void *pampd = NULL, *cdata = data; > unsigned clen = size, zero_filled = 0; > struct page *page = (struct page *)(data), *newpage; > @@ -644,7 +656,15 @@ static void *zcache_pampd_pers_create(char *data, size_t > size, bool raw, > clen = 0; > zero_filled = 1; > zcache_pages_zero++; > - goto got_pampd; > + second_pers_zero_page = atomic_inc_return( > + _pers_zero_page_atomic); > + if (second_pers_zero_page % 2 == 1) > + goto got_pampd; > + else { > + atomic_sub(2, _pers_zero_page_atomic); > + goto count_zero_page; > + } > + > } > kunmap_atomic(user_mem); > > @@ -698,6 +718,7 @@ create_pampd: > create_in_new_page: > pampd = (void *)zbud_create_prep(th, false, cdata, clen, newpage); > BUG_ON(pampd == NULL); > +count_zero_page: > zcache_pers_pageframes = > atomic_inc_return(_pers_pageframes_atomic); > if (zcache_pers_pageframes > zcache_pers_pageframes_max) > -- > 1.7.7.6 -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
RE: [PATCH 4/4] zcache: add pageframes count once compress zero-filled pages twice
From: Wanpeng Li [mailto:liw...@linux.vnet.ibm.com] Sent: Wednesday, March 13, 2013 1:05 AM To: Andrew Morton Cc: Greg Kroah-Hartman; Dan Magenheimer; Seth Jennings; Konrad Rzeszutek Wilk; Minchan Kim; linux- m...@kvack.org; linux-kernel@vger.kernel.org; Wanpeng Li Subject: [PATCH 4/4] zcache: add pageframes count once compress zero-filled pages twice Hi Wanpeng -- Thanks for taking on this task from the drivers/staging/zcache TODO list! Since zbudpage consist of two zpages, two zero-filled pages compression contribute to one [eph|pers]pageframe count accumulated. I'm not sure why this is necessary. The [eph|pers]pageframe count is supposed to be counting actual pageframes used by zcache. Since your patch eliminates the need to store zero pages, no pageframes are needed at all to store zero pages, so it's not necessary to increment zcache_[eph|pers]_pageframes when storing zero pages. Or am I misunderstanding your intent? Thanks, Dan Signed-off-by: Wanpeng Li liw...@linux.vnet.ibm.com --- drivers/staging/zcache/zcache-main.c | 25 +++-- 1 files changed, 23 insertions(+), 2 deletions(-) diff --git a/drivers/staging/zcache/zcache-main.c b/drivers/staging/zcache/zcache-main.c index dd52975..7860ff0 100644 --- a/drivers/staging/zcache/zcache-main.c +++ b/drivers/staging/zcache/zcache-main.c @@ -544,6 +544,8 @@ static struct page *zcache_evict_eph_pageframe(void); static void *zcache_pampd_eph_create(char *data, size_t size, bool raw, struct tmem_handle *th) { + static ssize_t second_eph_zero_page; + static atomic_t second_eph_zero_page_atomic = ATOMIC_INIT(0); void *pampd = NULL, *cdata = data; unsigned clen = size; bool zero_filled = false; @@ -561,7 +563,14 @@ static void *zcache_pampd_eph_create(char *data, size_t size, bool raw, clen = 0; zero_filled = true; zcache_pages_zero++; - goto got_pampd; + second_eph_zero_page = atomic_inc_return( + second_eph_zero_page_atomic); + if (second_eph_zero_page % 2 == 1) + goto got_pampd; + else { + atomic_sub(2, second_eph_zero_page_atomic); + goto count_zero_page; + } } kunmap_atomic(user_mem); @@ -597,6 +606,7 @@ static void *zcache_pampd_eph_create(char *data, size_t size, bool raw, create_in_new_page: pampd = (void *)zbud_create_prep(th, true, cdata, clen, newpage); BUG_ON(pampd == NULL); +count_zero_page: zcache_eph_pageframes = atomic_inc_return(zcache_eph_pageframes_atomic); if (zcache_eph_pageframes zcache_eph_pageframes_max) @@ -621,6 +631,8 @@ out: static void *zcache_pampd_pers_create(char *data, size_t size, bool raw, struct tmem_handle *th) { + static ssize_t second_pers_zero_page; + static atomic_t second_pers_zero_page_atomic = ATOMIC_INIT(0); void *pampd = NULL, *cdata = data; unsigned clen = size, zero_filled = 0; struct page *page = (struct page *)(data), *newpage; @@ -644,7 +656,15 @@ static void *zcache_pampd_pers_create(char *data, size_t size, bool raw, clen = 0; zero_filled = 1; zcache_pages_zero++; - goto got_pampd; + second_pers_zero_page = atomic_inc_return( + second_pers_zero_page_atomic); + if (second_pers_zero_page % 2 == 1) + goto got_pampd; + else { + atomic_sub(2, second_pers_zero_page_atomic); + goto count_zero_page; + } + } kunmap_atomic(user_mem); @@ -698,6 +718,7 @@ create_pampd: create_in_new_page: pampd = (void *)zbud_create_prep(th, false, cdata, clen, newpage); BUG_ON(pampd == NULL); +count_zero_page: zcache_pers_pageframes = atomic_inc_return(zcache_pers_pageframes_atomic); if (zcache_pers_pageframes zcache_pers_pageframes_max) -- 1.7.7.6 -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
RE: [PATCH 2/4] zcache: zero-filled pages awareness
From: Wanpeng Li [mailto:liw...@linux.vnet.ibm.com] Subject: [PATCH 2/4] zcache: zero-filled pages awareness Compression of zero-filled pages can unneccessarily cause internal fragmentation, and thus waste memory. This special case can be optimized. This patch captures zero-filled pages, and marks their corresponding zcache backing page entry as zero-filled. Whenever such zero-filled page is retrieved, we fill the page frame with zero. Signed-off-by: Wanpeng Li liw...@linux.vnet.ibm.com --- drivers/staging/zcache/tmem.c|4 +- drivers/staging/zcache/tmem.h|5 ++ drivers/staging/zcache/zcache-main.c | 87 ++ 3 files changed, 85 insertions(+), 11 deletions(-) diff --git a/drivers/staging/zcache/tmem.c b/drivers/staging/zcache/tmem.c index a2b7e03..62468ea 100644 --- a/drivers/staging/zcache/tmem.c +++ b/drivers/staging/zcache/tmem.c @@ -597,7 +597,9 @@ int tmem_put(struct tmem_pool *pool, struct tmem_oid *oidp, uint32_t index, if (unlikely(ret == -ENOMEM)) /* may have partially built objnode tree (stump) */ goto delete_and_free; - (*tmem_pamops.create_finish)(pampd, is_ephemeral(pool)); + if (pampd != (void *)ZERO_FILLED) + (*tmem_pamops.create_finish)(pampd, is_ephemeral(pool)); + goto out; delete_and_free: diff --git a/drivers/staging/zcache/tmem.h b/drivers/staging/zcache/tmem.h index adbe5a8..6719dbd 100644 --- a/drivers/staging/zcache/tmem.h +++ b/drivers/staging/zcache/tmem.h @@ -204,6 +204,11 @@ struct tmem_handle { uint16_t client_id; }; +/* + * mark pampd to special vaule in order that later + * retrieve will identify zero-filled pages + */ +#define ZERO_FILLED 0x2 You can avoid changing tmem.[ch] entirely by moving this definition into zcache-main.c and by moving the check comparing pampd against ZERO_FILLED into zcache_pampd_create_finish() I think that would be cleaner... If you change this and make the pageframe counter fix for PATCH 4/4, please add my ack for the next version: Acked-by: Dan Magenheimer dan.magenhei...@oracle.com -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
RE: zsmalloc limitations and related topics
From: Robert Jennings [mailto:r...@linux.vnet.ibm.com] Subject: Re: zsmalloc limitations and related topics Hi Robert -- Thanks for the well-considered reply! * Dan Magenheimer (dan.magenhei...@oracle.com) wrote: Hi all -- I've been doing some experimentation on zsmalloc in preparation for my topic proposed for LSFMM13 and have run across some perplexing limitations. Those familiar with the intimate details of zsmalloc might be well aware of these limitations, but they aren't documented or immediately obvious, so I thought it would be worthwhile to air them publicly. I've also included some measurements from the experimentation and some related thoughts. (Some of the terms here are unusual and may be used inconsistently by different developers so a glossary of definitions of the terms used here is appended.) ZSMALLOC LIMITATIONS Zsmalloc is used for two zprojects: zram and the out-of-tree zswap. Zsmalloc can achieve high density when full. But: 1) Zsmalloc has a worst-case density of 0.25 (one zpage per four pageframes). The design of the allocator results in a trade-off between best case density and the worst-case which is true for any allocator. For zsmalloc, the best case density with a 4K page size is 32.0, or 177.0 for a 64K page size, based on storing a set of zero-filled pages compressed by lzo1x-1. Right. Without a representative workload, we have no idea whether either my worst-case or your best-case will be relevant. (As an aside, I'm measuring zsize=28 bytes for a zero page... Seth has repeatedly said 103 bytes and I think this is reflected in your computation above. Maybe it is 103 for your hardware compression engine? Else, I'm not sure why our numbers would be different.) 2) When not full and especially when nearly-empty _after_ being full, density may fall below 1.0 as a result of fragmentation. True and there are several ways to address this including defragmentation, fewer class sizes in zsmalloc, aging, and/or writeback of zpages in sparse zspages to free pageframes during normal writeback. Yes. And add pageframe-reclaim to this list of things that zsmalloc should do but currently cannot do. 3) Zsmalloc has a density of exactly 1.0 for any number of zpages with zsize = 0.8. For this reason zswap does not cache pages which in this range. It is not enforced in the allocator because some users may be forced to store these pages; users like zram. Again, without a representative workload, we don't know whether or not it is important to manage pages with zsize = 0.8. You are simply dismissing it as unnecessary because zsmalloc can't handle them and because they don't appear at any measurable frequency in kernbench or SPECjbb. (Zbud _can_ efficiently handle these larger pages under many circumstances... but without a representative workload, we don't know whether or not those circumstances will occur.) 4) Zsmalloc contains several compile-time parameters; the best value of these parameters may be very workload dependent. The parameters fall into two major areas, handle computation and class size. The handle can be abstracted away, eliminating the compile-time parameters. The class-size tunable could be changed to a default value with the option for specifying an alternate value from the user during pool creation. Perhaps my point here wasn't clear so let me be more blunt: There's no way in hell that even a very sophisticated user will know how to set these values. I think we need to ensure either that they are always right (which without a representative workload...) or, preferably, have some way so that they can dynamically adapt at runtime. If density == 1.0, that means we are paying the overhead of compression+decompression for no space advantage. If density 1.0, that means using zsmalloc is detrimental, resulting in worse memory pressure than if it were not used. WORKLOAD ANALYSIS These limitations emphasize that the workload used to evaluate zsmalloc is very important. Benchmarks that measure data throughput or CPU utilization are of questionable value because it is the _content_ of the data that is particularly relevant for compression. Even more precisely, it is the entropy of the data that is relevant, because the amount of compressibility in the data is related to the entropy: I.e. an entirely random pagefull of bits will compress poorly and a highly-regular pagefull of bits will compress well. Since the zprojects manage a large number of zpages, both the mean and distribution of zsize of the workload should be representative. The workload most widely used to publish results for the various zprojects is a kernel-compile using make -jN where N is artificially increased to impose memory pressure. By adding some debug code to zswap, I was able to analyze this workload and found the following: 1
RE: [PATCHv7 4/8] zswap: add to mm/
> From: Seth Jennings [mailto:sjenn...@linux.vnet.ibm.com] > To: Dave Hansen > Subject: Re: [PATCHv7 4/8] zswap: add to mm/ > > On 03/07/2013 01:00 PM, Dave Hansen wrote: > > On 03/06/2013 07:52 AM, Seth Jennings wrote: > > ... > >> +**/ > >> +/* attempts to compress and store an single page */ > >> +static int zswap_frontswap_store(unsigned type, pgoff_t offset, > >> + struct page *page) > >> +{ > > ... > >> + /* store */ > >> + handle = zs_malloc(tree->pool, dlen, > >> + __GFP_NORETRY | __GFP_HIGHMEM | __GFP_NOMEMALLOC | > >> + __GFP_NOWARN); > >> + if (!handle) { > >> + zswap_reject_zsmalloc_fail++; > >> + ret = -ENOMEM; > >> + goto putcpu; > >> + } > >> + > > > > I think there needs to at least be some strong comments in here about > > why you're doing this kind of allocation. From some IRC discussion, it > > seems like you found some pathological case where zswap wasn't helping > > make reclaim progress and ended up draining the reserve pools and you > > did this to avoid draining the reserve pools. > > I'm currently doing some tests with fewer zsmalloc class sizes and > removing __GFP_NOMEMALLOC to see the effect. Zswap/zcache/frontswap are greedy, at times almost violently so. Using emergency reserves seems like a sure way to OOM depending on the workload (and luck). I did some class size experiments too without seeing much advantage. But without a range of "representative" data streams, it's very hard to claim any experiment is successful. I've got some ideas on combining the best of zsmalloc and zbud but they are still a little raw. > > I think the lack of progress doing reclaim is really the root cause you > > should be going after here instead of just working around the symptom. Dave, agreed. See http://marc.info/?l=linux-mm=136147977602561=2 and the PAGEFRAME EVACUATION subsection of http://marc.info/?l=linux-mm=136200745931284=2 -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
RE: [PATCHv7 4/8] zswap: add to mm/
From: Seth Jennings [mailto:sjenn...@linux.vnet.ibm.com] To: Dave Hansen Subject: Re: [PATCHv7 4/8] zswap: add to mm/ On 03/07/2013 01:00 PM, Dave Hansen wrote: On 03/06/2013 07:52 AM, Seth Jennings wrote: ... +**/ +/* attempts to compress and store an single page */ +static int zswap_frontswap_store(unsigned type, pgoff_t offset, + struct page *page) +{ ... + /* store */ + handle = zs_malloc(tree-pool, dlen, + __GFP_NORETRY | __GFP_HIGHMEM | __GFP_NOMEMALLOC | + __GFP_NOWARN); + if (!handle) { + zswap_reject_zsmalloc_fail++; + ret = -ENOMEM; + goto putcpu; + } + I think there needs to at least be some strong comments in here about why you're doing this kind of allocation. From some IRC discussion, it seems like you found some pathological case where zswap wasn't helping make reclaim progress and ended up draining the reserve pools and you did this to avoid draining the reserve pools. I'm currently doing some tests with fewer zsmalloc class sizes and removing __GFP_NOMEMALLOC to see the effect. Zswap/zcache/frontswap are greedy, at times almost violently so. Using emergency reserves seems like a sure way to OOM depending on the workload (and luck). I did some class size experiments too without seeing much advantage. But without a range of representative data streams, it's very hard to claim any experiment is successful. I've got some ideas on combining the best of zsmalloc and zbud but they are still a little raw. I think the lack of progress doing reclaim is really the root cause you should be going after here instead of just working around the symptom. Dave, agreed. See http://marc.info/?l=linux-mmm=136147977602561w=2 and the PAGEFRAME EVACUATION subsection of http://marc.info/?l=linux-mmm=136200745931284w=2 -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
RE: zsmalloc limitations and related topics
> From: Ric Mason [mailto:ric.mas...@gmail.com] > Subject: Re: zsmalloc limitations and related topics > > On 02/28/2013 07:24 AM, Dan Magenheimer wrote: > > Hi all -- > > > > I've been doing some experimentation on zsmalloc in preparation > > for my topic proposed for LSFMM13 and have run across some > > perplexing limitations. Those familiar with the intimate details > > of zsmalloc might be well aware of these limitations, but they > > aren't documented or immediately obvious, so I thought it would > > be worthwhile to air them publicly. I've also included some > > measurements from the experimentation and some related thoughts. > > > > (Some of the terms here are unusual and may be used inconsistently > > by different developers so a glossary of definitions of the terms > > used here is appended.) > > > > ZSMALLOC LIMITATIONS > > > > Zsmalloc is used for two zprojects: zram and the out-of-tree > > zswap. Zsmalloc can achieve high density when "full". But: > > > > 1) Zsmalloc has a worst-case density of 0.25 (one zpage per > > four pageframes). > > 2) When not full and especially when nearly-empty _after_ > > being full, density may fall below 1.0 as a result of > > fragmentation. > > What's the meaning of nearly-empty _after_ being full? Step 1: Add a few (N) pages to zsmalloc. It is "nearly empty". Step 2: Now add many more pages to zsmalloc until allocation limits are reached. It is "full". Step 3: Now remove many pages from zsmalloc until there are N pages remaining. It is now "nearly empty after being full". Fragmentation characteristics are different comparing after Step 1 and after Step 3 even though, in both cases, zsmalloc contains N pages. > > 3) Zsmalloc has a density of exactly 1.0 for any number of > > zpages with zsize >= 0.8. > > 4) Zsmalloc contains several compile-time parameters; > > the best value of these parameters may be very workload > > dependent. > > > > If density == 1.0, that means we are paying the overhead of > > compression+decompression for no space advantage. If > > density < 1.0, that means using zsmalloc is detrimental, > > resulting in worse memory pressure than if it were not used. > > > > WORKLOAD ANALYSIS > > > > These limitations emphasize that the workload used to evaluate > > zsmalloc is very important. Benchmarks that measure data > > Could you share your benchmark? In order that other guys can take > advantage of it. As Seth does, I just used "make" of a kernel. I run it on a full graphical installation of EL6. In order to ensure there is memory pressure, I limit physical memory to 1GB, and use "make -j20". > > throughput or CPU utilization are of questionable value because > > it is the _content_ of the data that is particularly relevant > > for compression. Even more precisely, it is the "entropy" > > of the data that is relevant, because the amount of > > compressibility in the data is related to the entropy: > > I.e. an entirely random pagefull of bits will compress poorly > > and a highly-regular pagefull of bits will compress well. > > Since the zprojects manage a large number of zpages, both > > the mean and distribution of zsize of the workload should > > be "representative". > > > > The workload most widely used to publish results for > > the various zprojects is a kernel-compile using "make -jN" > > where N is artificially increased to impose memory pressure. > > By adding some debug code to zswap, I was able to analyze > > this workload and found the following: > > > > 1) The average page compressed by almost a factor of six > > (mean zsize == 694, stddev == 474) > > stddev is what? Standard deviation. See: http://en.wikipedia.org/wiki/Standard_deviation -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
RE: zsmalloc limitations and related topics
From: Ric Mason [mailto:ric.mas...@gmail.com] Subject: Re: zsmalloc limitations and related topics On 02/28/2013 07:24 AM, Dan Magenheimer wrote: Hi all -- I've been doing some experimentation on zsmalloc in preparation for my topic proposed for LSFMM13 and have run across some perplexing limitations. Those familiar with the intimate details of zsmalloc might be well aware of these limitations, but they aren't documented or immediately obvious, so I thought it would be worthwhile to air them publicly. I've also included some measurements from the experimentation and some related thoughts. (Some of the terms here are unusual and may be used inconsistently by different developers so a glossary of definitions of the terms used here is appended.) ZSMALLOC LIMITATIONS Zsmalloc is used for two zprojects: zram and the out-of-tree zswap. Zsmalloc can achieve high density when full. But: 1) Zsmalloc has a worst-case density of 0.25 (one zpage per four pageframes). 2) When not full and especially when nearly-empty _after_ being full, density may fall below 1.0 as a result of fragmentation. What's the meaning of nearly-empty _after_ being full? Step 1: Add a few (N) pages to zsmalloc. It is nearly empty. Step 2: Now add many more pages to zsmalloc until allocation limits are reached. It is full. Step 3: Now remove many pages from zsmalloc until there are N pages remaining. It is now nearly empty after being full. Fragmentation characteristics are different comparing after Step 1 and after Step 3 even though, in both cases, zsmalloc contains N pages. 3) Zsmalloc has a density of exactly 1.0 for any number of zpages with zsize = 0.8. 4) Zsmalloc contains several compile-time parameters; the best value of these parameters may be very workload dependent. If density == 1.0, that means we are paying the overhead of compression+decompression for no space advantage. If density 1.0, that means using zsmalloc is detrimental, resulting in worse memory pressure than if it were not used. WORKLOAD ANALYSIS These limitations emphasize that the workload used to evaluate zsmalloc is very important. Benchmarks that measure data Could you share your benchmark? In order that other guys can take advantage of it. As Seth does, I just used make of a kernel. I run it on a full graphical installation of EL6. In order to ensure there is memory pressure, I limit physical memory to 1GB, and use make -j20. throughput or CPU utilization are of questionable value because it is the _content_ of the data that is particularly relevant for compression. Even more precisely, it is the entropy of the data that is relevant, because the amount of compressibility in the data is related to the entropy: I.e. an entirely random pagefull of bits will compress poorly and a highly-regular pagefull of bits will compress well. Since the zprojects manage a large number of zpages, both the mean and distribution of zsize of the workload should be representative. The workload most widely used to publish results for the various zprojects is a kernel-compile using make -jN where N is artificially increased to impose memory pressure. By adding some debug code to zswap, I was able to analyze this workload and found the following: 1) The average page compressed by almost a factor of six (mean zsize == 694, stddev == 474) stddev is what? Standard deviation. See: http://en.wikipedia.org/wiki/Standard_deviation -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/