Re: mm/zswap: change to writethrough
On 11/12/2013 03:12 AM, Dan Streetman wrote: Seth, have you (or anyone else) considered making zswap a writethrough cache instead of writeback? I think that it would significantly help the case where zswap fills up and starts writing back its oldest pages to disc - all the decompression work would be avoided since zswap could just evict old pages and forget about them, and it seems likely that when zswap is full that's probably the worst time to add extra work/delay, while adding extra disc IO (presumably using dma) before zswap is full doesn't seem to me like it would have much impact, except in the case where zswap isn't full but there is so little free memory that new allocs are waiting on swap-out. Besides the additional disc IO that obviously comes with making zswap writethrough (additional only before zswap fills up), are there any other disadvantages? Is it a common situation for there to be no memory left and get_free_page actively waiting on swap-out, but before zswap fills up? Making it writethrough also could open up other possible improvements, like making the compression and storage of new swap-out pages async, so the compression doesn't delay the write out to disc. I like this idea and those benefits, the only question I'm not sure is would it be too complicate to implement this feature? It sounds like we need to reimplement something like swapcache to handle zswap write through. -- Regards, -Bob -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [patch 0/8] mm: thrash detection-based file cache sizing v5
Hi Johannes, On Fri, Oct 11, 2013 at 5:46 AM, Johannes Weiner han...@cmpxchg.org wrote: Future Right now we have a fixed ratio (50:50) between inactive and active list but we already have complaints about working sets exceeding half of memory being pushed out of the cache by simple used-once streaming in the background. Ultimately, we want to adjust this ratio and allow for a much smaller inactive list. These patches are an essential step in this direction because they decouple the VMs ability to detect working set changes from the inactive list size. This would allow us to base the inactive list size on something more sensible, like the combined readahead window size for example. I found that this patchset have the similar purpose as Zcache(http://lwn.net/Articles/562254/) in some way. Zcache uses the cleancache API to compress clean file pages so as to keep them in memory, and only pages which used to in active file pages are considered. Through Zcache we can hold more active pages in memory and won't be effected by streaming workload. Perhaps you can take a look at the way of zcache, your workloads are very likely to benefit from it. And zcache don't need so many changes to core VM/VFS subsystem because it's based on cleancache API. I think it's more acceptable. -- Regards, --Bob -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] staging: zsmalloc: Ensure handle is never 0 on success
On 11/07/2013 03:04 PM, Minchan Kim wrote: > On Wed, Nov 06, 2013 at 07:05:11PM -0800, Greg KH wrote: >> On Wed, Nov 06, 2013 at 03:46:19PM -0800, Nitin Gupta wrote: >> > I'm getting really tired of them hanging around in here for many years >>>> now... >>>> >>> >>> Minchan has tried many times to promote zram out of staging. This was >>> his most recent attempt: >>> >>> https://lkml.org/lkml/2013/8/21/54 >>> >>> There he provided arguments for zram inclusion, how it can help in >>> situations where zswap can't and why generalizing /dev/ramX would >>> not be a great idea. So, cannot say why it wasn't picked up >>> for inclusion at that time. >>> >>>> Should I just remove them if no one is working on getting them merged >>>> "properly"? >>>> >>> >>> Please refer the mail thread (link above) and see Minchan's >>> justifications for zram. >>> If they don't sound convincing enough then please remove zram+zsmalloc >>> from staging. >> >> You don't need to be convincing me, you need to be convincing the >> maintainers of the area of the kernel you are working with. >> >> And since the last time you all tried to get this merged was back in >> August, I'm feeling that you all have given up, so it needs to be >> deleted. I'll go do that for 3.14, and if someone wants to pick it up >> and merge it properly, they can easily revert it. > > I'm guilty and I have been busy by other stuff. Sorry for that. > Fortunately, I discussed this issue with Hugh in this Linuxcon for a > long time(Thanks Hugh!) he felt zram's block device abstraction is > better design rather than frontswap backend stuff although it's a question > where we put zsmalloc. I will CC Hugh because many of things is related > to swap subsystem and his opinion is really important. > And I discussed it with Rik and he feel positive about zram. > > Last impression Andrw gave me by private mail is he want to merge > zram's functionality into zswap or vise versa. > If I misunderstood, please correct me. > I understand his concern but I guess he didn't have a time to read > my long description due to a ton of works at that time. > So, I will try one more time. > I hope I'd like to listen feedback than *silence* so that we can > move forward than stall. > > Recently, Bob tried to move zsmalloc under mm directory to unify > zram and zswap with adding pseudo block device in zswap(It's > very weired to me. I think it's horrible monster which is lying > between mm and block in layering POV) but he was ignoring zram's > block device (a.k.a zram-blk) feature and considered only swap > usecase of zram, in turn, it lose zram's good concept. > I already convered other topics Bob raised in this thread[1] > and why I think zram is better in the thread. > I have no objections for zram, and I also think is good for zswap can support zsmalloc and fake swap device. At least users can have more options just like slab/slub/slob. -- Regards, -Bob -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [Patch 3.11.7 1/1]mm: remove and free expired data in time in zswap
On 11/08/2013 05:50 PM, changkun.li wrote: > In zswap, store page A to zbud if the compression ratio is high, insert > its entry into rbtree. if there is a entry B which has the same offset > in the rbtree.Remove and free B before insert the entry of A. > > case: > if the compression ratio of page A is not high, return without checking > the same offset one in rbtree. > > if there is a entry B which has the same offset in the rbtree. Now, we > make sure B is invalid or expired. But the entry and compressed memory > of B are not freed in time. > > Because zswap spaces data in memory, it makes the utilization of memory > lower. the other valid data in zbud is writeback to swap device more > possibility, when zswap is full. > > So if we make sure a entry is expired, free it in time. > > Signed-off-by: changkun.li > --- > mm/zswap.c |5 - > 1 files changed, 4 insertions(+), 1 deletions(-) > > diff --git a/mm/zswap.c b/mm/zswap.c > index cbd9578..90a2813 100644 > --- a/mm/zswap.c > +++ b/mm/zswap.c > @@ -596,6 +596,7 @@ fail: > return ret; > } > > +static void zswap_frontswap_invalidate_page(unsigned type, pgoff_t > offset); > /* > * frontswap hooks > **/ > @@ -614,7 +615,7 @@ static int zswap_frontswap_store(unsigned type, > pgoff_t offset, > > if (!tree) { > ret = -ENODEV; > - goto reject; > + goto nodev; > } > > /* reclaim space if needed */ > @@ -695,6 +696,8 @@ freepage: > put_cpu_var(zswap_dstmem); > zswap_entry_cache_free(entry); > reject: > + zswap_frontswap_invalidate_page(type, offset); I'm afraid when arrives here zswap_rb_search(offset) will always return NULL entry. So most of the time, it's just waste time to call zswap_frontswap_invalidate_page() to search rbtree. -- Regards, -Bob -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [Patch 3.11.7 1/1]mm: remove and free expired data in time in zswap
On 11/08/2013 05:50 PM, changkun.li wrote: In zswap, store page A to zbud if the compression ratio is high, insert its entry into rbtree. if there is a entry B which has the same offset in the rbtree.Remove and free B before insert the entry of A. case: if the compression ratio of page A is not high, return without checking the same offset one in rbtree. if there is a entry B which has the same offset in the rbtree. Now, we make sure B is invalid or expired. But the entry and compressed memory of B are not freed in time. Because zswap spaces data in memory, it makes the utilization of memory lower. the other valid data in zbud is writeback to swap device more possibility, when zswap is full. So if we make sure a entry is expired, free it in time. Signed-off-by: changkun.lixfishco...@gmail.com --- mm/zswap.c |5 - 1 files changed, 4 insertions(+), 1 deletions(-) diff --git a/mm/zswap.c b/mm/zswap.c index cbd9578..90a2813 100644 --- a/mm/zswap.c +++ b/mm/zswap.c @@ -596,6 +596,7 @@ fail: return ret; } +static void zswap_frontswap_invalidate_page(unsigned type, pgoff_t offset); /* * frontswap hooks **/ @@ -614,7 +615,7 @@ static int zswap_frontswap_store(unsigned type, pgoff_t offset, if (!tree) { ret = -ENODEV; - goto reject; + goto nodev; } /* reclaim space if needed */ @@ -695,6 +696,8 @@ freepage: put_cpu_var(zswap_dstmem); zswap_entry_cache_free(entry); reject: + zswap_frontswap_invalidate_page(type, offset); I'm afraid when arrives here zswap_rb_search(offset) will always return NULL entry. So most of the time, it's just waste time to call zswap_frontswap_invalidate_page() to search rbtree. -- Regards, -Bob -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] staging: zsmalloc: Ensure handle is never 0 on success
On 11/07/2013 03:04 PM, Minchan Kim wrote: On Wed, Nov 06, 2013 at 07:05:11PM -0800, Greg KH wrote: On Wed, Nov 06, 2013 at 03:46:19PM -0800, Nitin Gupta wrote: I'm getting really tired of them hanging around in here for many years now... Minchan has tried many times to promote zram out of staging. This was his most recent attempt: https://lkml.org/lkml/2013/8/21/54 There he provided arguments for zram inclusion, how it can help in situations where zswap can't and why generalizing /dev/ramX would not be a great idea. So, cannot say why it wasn't picked up for inclusion at that time. Should I just remove them if no one is working on getting them merged properly? Please refer the mail thread (link above) and see Minchan's justifications for zram. If they don't sound convincing enough then please remove zram+zsmalloc from staging. You don't need to be convincing me, you need to be convincing the maintainers of the area of the kernel you are working with. And since the last time you all tried to get this merged was back in August, I'm feeling that you all have given up, so it needs to be deleted. I'll go do that for 3.14, and if someone wants to pick it up and merge it properly, they can easily revert it. I'm guilty and I have been busy by other stuff. Sorry for that. Fortunately, I discussed this issue with Hugh in this Linuxcon for a long time(Thanks Hugh!) he felt zram's block device abstraction is better design rather than frontswap backend stuff although it's a question where we put zsmalloc. I will CC Hugh because many of things is related to swap subsystem and his opinion is really important. And I discussed it with Rik and he feel positive about zram. Last impression Andrw gave me by private mail is he want to merge zram's functionality into zswap or vise versa. If I misunderstood, please correct me. I understand his concern but I guess he didn't have a time to read my long description due to a ton of works at that time. So, I will try one more time. I hope I'd like to listen feedback than *silence* so that we can move forward than stall. Recently, Bob tried to move zsmalloc under mm directory to unify zram and zswap with adding pseudo block device in zswap(It's very weired to me. I think it's horrible monster which is lying between mm and block in layering POV) but he was ignoring zram's block device (a.k.a zram-blk) feature and considered only swap usecase of zram, in turn, it lose zram's good concept. I already convered other topics Bob raised in this thread[1] and why I think zram is better in the thread. I have no objections for zram, and I also think is good for zswap can support zsmalloc and fake swap device. At least users can have more options just like slab/slub/slob. -- Regards, -Bob -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [Xen-devel] [PATCH] xen/balloon: Set balloon's initial state to number of existing RAM pages
On 11/07/2013 04:37 AM, Boris Ostrovsky wrote: > Currently balloon's initial value is set to max_pfn which includes > non-RAM ranges such as MMIO hole. As result, initial memory target > (specified by guest's configuration file) will appear smaller than > what balloon driver perceives to be the current number of available > pages. Thus it will balloon down "extra" pages, decreasing amount of > available memory for no good reason. > This fix the strange behavior I mentioned yesterday, every time after guest started balloon driver will be triggered unreasonably. > Signed-off-by: Boris Ostrovsky > --- > drivers/xen/balloon.c | 2 +- > 1 file changed, 1 insertion(+), 1 deletion(-) > > diff --git a/drivers/xen/balloon.c b/drivers/xen/balloon.c > index b232908..1b62304 100644 > --- a/drivers/xen/balloon.c > +++ b/drivers/xen/balloon.c > @@ -641,7 +641,7 @@ static int __init balloon_init(void) > > balloon_stats.current_pages = xen_pv_domain() > ? min(xen_start_info->nr_pages - xen_released_pages, max_pfn) > - : max_pfn; > + : get_num_physpages(); By the way, should the other places using max_pfn also be changed with get_num_physpages()? > balloon_stats.target_pages = balloon_stats.current_pages; > balloon_stats.balloon_low = 0; > balloon_stats.balloon_high = 0; > -- Regards, -Bob -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [Xen-devel] [PATCH] xen/balloon: Set balloon's initial state to number of existing RAM pages
On 11/07/2013 04:37 AM, Boris Ostrovsky wrote: Currently balloon's initial value is set to max_pfn which includes non-RAM ranges such as MMIO hole. As result, initial memory target (specified by guest's configuration file) will appear smaller than what balloon driver perceives to be the current number of available pages. Thus it will balloon down extra pages, decreasing amount of available memory for no good reason. This fix the strange behavior I mentioned yesterday, every time after guest started balloon driver will be triggered unreasonably. Signed-off-by: Boris Ostrovsky boris.ostrov...@oracle.com --- drivers/xen/balloon.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/drivers/xen/balloon.c b/drivers/xen/balloon.c index b232908..1b62304 100644 --- a/drivers/xen/balloon.c +++ b/drivers/xen/balloon.c @@ -641,7 +641,7 @@ static int __init balloon_init(void) balloon_stats.current_pages = xen_pv_domain() ? min(xen_start_info-nr_pages - xen_released_pages, max_pfn) - : max_pfn; + : get_num_physpages(); By the way, should the other places using max_pfn also be changed with get_num_physpages()? balloon_stats.target_pages = balloon_stats.current_pages; balloon_stats.balloon_low = 0; balloon_stats.balloon_high = 0; -- Regards, -Bob -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [Xen-devel] [PATCH] drivers: xen-selfballoon: consider slab pages
On 11/05/2013 01:22 AM, David Vrabel wrote: > On 04/11/13 12:39, Bob Liu wrote: >> Currently the goal_page in xen-selfballon doesn't consider much about pages >> used >> in kernel space. >> A typical usage is slab pages, without consider slab pages the goal_page >> result >> may be too rough and lead extra memory pressure to guest os. > > Can you provide some real world figures where the calculatation got it > wrong? What was the resultant behavior? Swap death? OOM killer? > Sorry, I didn't run any testing I just think it's unreasonable while reading the source code. vm_memory_committed() only calculate pages which mapped to process address space, but the kernel itself(like block, fs and network subsystem) may occupy some memory. And it's possible that those subsystem may occupy a significant amount of memory in some situation. I'm afraid if we don't consider those kernel memory while calculating goal_pages, guest memory will be set lower than guest really needs. >> Signed-off-by: Bob Liu >> --- >> drivers/xen/xen-selfballoon.c |2 ++ >> 1 file changed, 2 insertions(+) >> >> diff --git a/drivers/xen/xen-selfballoon.c b/drivers/xen/xen-selfballoon.c >> index 21e18c1..4814759 100644 >> --- a/drivers/xen/xen-selfballoon.c >> +++ b/drivers/xen/xen-selfballoon.c >> @@ -191,6 +191,8 @@ static void selfballoon_process(struct work_struct *work) >> tgt_pages = cur_pages; /* default is no change */ >> goal_pages = vm_memory_committed() + >> totalreserve_pages + >> +global_page_state(NR_SLAB_RECLAIMABLE) + > > Does SLAB_RECLAIMABLE want to be included here? Unless I'm > misunderstanding here, SLAB_RECLAIMABLE is effectively free. > SLAB_RECLAIMABLE isn't effectively free, it means the slab page is in used but can be reclaimed(freed) during memory pressure. >> +global_page_state(NR_SLAB_UNRECLAIMABLE) + > > This bit looks fine to me. > >> MB2PAGES(selfballoon_reserved_mb); >> #ifdef CONFIG_FRONTSWAP >> /* allow space for frontswap pages to be repatriated */ > > David > Thanks for your review. -- Regards, -Bob -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH] drivers: xen-selfballoon: consider slab pages
Currently the goal_page in xen-selfballon doesn't consider much about pages used in kernel space. A typical usage is slab pages, without consider slab pages the goal_page result may be too rough and lead extra memory pressure to guest os. Signed-off-by: Bob Liu --- drivers/xen/xen-selfballoon.c |2 ++ 1 file changed, 2 insertions(+) diff --git a/drivers/xen/xen-selfballoon.c b/drivers/xen/xen-selfballoon.c index 21e18c1..4814759 100644 --- a/drivers/xen/xen-selfballoon.c +++ b/drivers/xen/xen-selfballoon.c @@ -191,6 +191,8 @@ static void selfballoon_process(struct work_struct *work) tgt_pages = cur_pages; /* default is no change */ goal_pages = vm_memory_committed() + totalreserve_pages + + global_page_state(NR_SLAB_RECLAIMABLE) + + global_page_state(NR_SLAB_UNRECLAIMABLE) + MB2PAGES(selfballoon_reserved_mb); #ifdef CONFIG_FRONTSWAP /* allow space for frontswap pages to be repatriated */ -- 1.7.10.4 -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [Xen-devel] [PATCH] drivers: xen-selfballoon: consider slab pages
On 11/05/2013 01:22 AM, David Vrabel wrote: On 04/11/13 12:39, Bob Liu wrote: Currently the goal_page in xen-selfballon doesn't consider much about pages used in kernel space. A typical usage is slab pages, without consider slab pages the goal_page result may be too rough and lead extra memory pressure to guest os. Can you provide some real world figures where the calculatation got it wrong? What was the resultant behavior? Swap death? OOM killer? Sorry, I didn't run any testing I just think it's unreasonable while reading the source code. vm_memory_committed() only calculate pages which mapped to process address space, but the kernel itself(like block, fs and network subsystem) may occupy some memory. And it's possible that those subsystem may occupy a significant amount of memory in some situation. I'm afraid if we don't consider those kernel memory while calculating goal_pages, guest memory will be set lower than guest really needs. Signed-off-by: Bob Liu bob@oracle.com --- drivers/xen/xen-selfballoon.c |2 ++ 1 file changed, 2 insertions(+) diff --git a/drivers/xen/xen-selfballoon.c b/drivers/xen/xen-selfballoon.c index 21e18c1..4814759 100644 --- a/drivers/xen/xen-selfballoon.c +++ b/drivers/xen/xen-selfballoon.c @@ -191,6 +191,8 @@ static void selfballoon_process(struct work_struct *work) tgt_pages = cur_pages; /* default is no change */ goal_pages = vm_memory_committed() + totalreserve_pages + +global_page_state(NR_SLAB_RECLAIMABLE) + Does SLAB_RECLAIMABLE want to be included here? Unless I'm misunderstanding here, SLAB_RECLAIMABLE is effectively free. SLAB_RECLAIMABLE isn't effectively free, it means the slab page is in used but can be reclaimed(freed) during memory pressure. +global_page_state(NR_SLAB_UNRECLAIMABLE) + This bit looks fine to me. MB2PAGES(selfballoon_reserved_mb); #ifdef CONFIG_FRONTSWAP /* allow space for frontswap pages to be repatriated */ David Thanks for your review. -- Regards, -Bob -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH] drivers: xen-selfballoon: consider slab pages
Currently the goal_page in xen-selfballon doesn't consider much about pages used in kernel space. A typical usage is slab pages, without consider slab pages the goal_page result may be too rough and lead extra memory pressure to guest os. Signed-off-by: Bob Liu bob@oracle.com --- drivers/xen/xen-selfballoon.c |2 ++ 1 file changed, 2 insertions(+) diff --git a/drivers/xen/xen-selfballoon.c b/drivers/xen/xen-selfballoon.c index 21e18c1..4814759 100644 --- a/drivers/xen/xen-selfballoon.c +++ b/drivers/xen/xen-selfballoon.c @@ -191,6 +191,8 @@ static void selfballoon_process(struct work_struct *work) tgt_pages = cur_pages; /* default is no change */ goal_pages = vm_memory_committed() + totalreserve_pages + + global_page_state(NR_SLAB_RECLAIMABLE) + + global_page_state(NR_SLAB_UNRECLAIMABLE) + MB2PAGES(selfballoon_reserved_mb); #ifdef CONFIG_FRONTSWAP /* allow space for frontswap pages to be repatriated */ -- 1.7.10.4 -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: zram/zsmalloc issues in very low memory conditions
Hi Olav, On 11/02/2013 08:59 AM, Olav Haugan wrote: > > I tried the above suggestion but it does not seem to have any noticeable > impact. The system is still trying to swap out at a very high rate after > zram reported failure to swap out. The error logging is actually so much > that my system crashed due to excessive logging (we have a watchdog that > is not getting pet because the kernel is busy logging kernel messages). > I have a question that why the low memory killer didn't get triggered in this situation? Is it possible to set the LMK a bit more aggressive? > There isn't anything that can be set to tell the fs layer to back off > completely for a while (congestion control)? > The other way I think might fix your issue is the same as your mentioned in your previous email. Set the congested bit for swap device also. Like: diff --git a/drivers/staging/zram/zram_drv.c b/drivers/staging/zram/zram_drv.c index 91d94b5..c4fc63e 100644 --- a/drivers/staging/zram/zram_drv.c +++ b/drivers/staging/zram/zram_drv.c @@ -474,6 +474,7 @@ static int zram_bvec_write(struct zram *zram, struct bio_vec *bvec, u32 index, if (!handle) { pr_info("Error allocating memory for compressed page: %u, size=%zu\n", index, clen); + blk_set_queue_congested(zram->disk->queue, BLK_RW_ASYNC); ret = -ENOMEM; goto out; } diff --git a/mm/vmscan.c b/mm/vmscan.c index 8ed1b77..1c790ee 100644 --- a/mm/vmscan.c +++ b/mm/vmscan.c @@ -394,8 +394,6 @@ static inline int is_page_cache_freeable(struct page *page) static int may_write_to_queue(struct backing_dev_info *bdi, struct scan_control *sc) { - if (current->flags & PF_SWAPWRITE) - return 1; -- For the update of the congested state of zram, I think you can clear it from use space eg. after LMK triggered and reclaimed some memory. Of course this depends on zram driver to export a sysfs node like "/sys/block/zram0/clear_congested". -- Regards, -Bob -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: zram/zsmalloc issues in very low memory conditions
Hi Olav, On 11/02/2013 08:59 AM, Olav Haugan wrote: I tried the above suggestion but it does not seem to have any noticeable impact. The system is still trying to swap out at a very high rate after zram reported failure to swap out. The error logging is actually so much that my system crashed due to excessive logging (we have a watchdog that is not getting pet because the kernel is busy logging kernel messages). I have a question that why the low memory killer didn't get triggered in this situation? Is it possible to set the LMK a bit more aggressive? There isn't anything that can be set to tell the fs layer to back off completely for a while (congestion control)? The other way I think might fix your issue is the same as your mentioned in your previous email. Set the congested bit for swap device also. Like: diff --git a/drivers/staging/zram/zram_drv.c b/drivers/staging/zram/zram_drv.c index 91d94b5..c4fc63e 100644 --- a/drivers/staging/zram/zram_drv.c +++ b/drivers/staging/zram/zram_drv.c @@ -474,6 +474,7 @@ static int zram_bvec_write(struct zram *zram, struct bio_vec *bvec, u32 index, if (!handle) { pr_info(Error allocating memory for compressed page: %u, size=%zu\n, index, clen); + blk_set_queue_congested(zram-disk-queue, BLK_RW_ASYNC); ret = -ENOMEM; goto out; } diff --git a/mm/vmscan.c b/mm/vmscan.c index 8ed1b77..1c790ee 100644 --- a/mm/vmscan.c +++ b/mm/vmscan.c @@ -394,8 +394,6 @@ static inline int is_page_cache_freeable(struct page *page) static int may_write_to_queue(struct backing_dev_info *bdi, struct scan_control *sc) { - if (current-flags PF_SWAPWRITE) - return 1; -- For the update of the congested state of zram, I think you can clear it from use space eg. after LMK triggered and reclaimed some memory. Of course this depends on zram driver to export a sysfs node like /sys/block/zram0/clear_congested. -- Regards, -Bob -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: zram/zsmalloc issues in very low memory conditions
Hi Olav, On 11/01/2013 07:34 AM, Olav Haugan wrote: > Hi Luigi, > > On 10/24/2013 6:12 PM, Luigi Semenzato wrote: >> On Thu, Oct 24, 2013 at 5:35 PM, Olav Haugan wrote: >>> Hi Bob, Luigi, >>> >>> On 10/23/2013 5:55 PM, Bob Liu wrote: >>>> >>>> On 10/24/2013 05:51 AM, Olav Haugan wrote: >>> >>>> By the way, could you take a try with zswap? Which can write pages to >>>> real swap device if compressed pool is full. >>> >>> zswap might not be feasible in all cases if you only have flash as >>> backing storage. >> >> Zswap can be configured to run without a backing storage. >> > > I was under the impression that zswap requires a backing storage. Can > you elaborate on how to configure zswap to not need a backing storage? > AFAIK, currently zswap can't be used without a backing storage. Perhaps you can take a try by creating a swap device backed by a file on storage. -- Regards, -Bob -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: zram/zsmalloc issues in very low memory conditions
Hi Olav, On 11/01/2013 07:34 AM, Olav Haugan wrote: Hi Luigi, On 10/24/2013 6:12 PM, Luigi Semenzato wrote: On Thu, Oct 24, 2013 at 5:35 PM, Olav Haugan ohau...@codeaurora.org wrote: Hi Bob, Luigi, On 10/23/2013 5:55 PM, Bob Liu wrote: On 10/24/2013 05:51 AM, Olav Haugan wrote: By the way, could you take a try with zswap? Which can write pages to real swap device if compressed pool is full. zswap might not be feasible in all cases if you only have flash as backing storage. Zswap can be configured to run without a backing storage. I was under the impression that zswap requires a backing storage. Can you elaborate on how to configure zswap to not need a backing storage? AFAIK, currently zswap can't be used without a backing storage. Perhaps you can take a try by creating a swap device backed by a file on storage. -- Regards, -Bob -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [BUG] 3.12.0-rcX IPv6 panic
On Mon, Oct 21, 2013 at 06:40:41PM -0500, Bob Tracy wrote: > On Mon, Oct 21, 2013 at 05:52:52PM +0200, Hannes Frederic Sowa wrote: > > On Mon, Oct 21, 2013 at 08:18:46AM -0500, Bob Tracy wrote: > > > Actually, a regression: the 3.11 kernel is rock-solid stable on my > > > Alpha. > > > > > > Beginning with 3.12.0-rc1, I can reliably trigger a kernel panic by > > > executing the gogo6.net "gw6c" IPv6 client program. If the networking > > > layer is active, an "Oops" will eventually (within a day) occur regardless > > > of whether I attempt to run "gw6c". 3.12.0-rcX is stable as long as I > > > leave networking completely disabled. The error has persisted up through > > > -rc6. Apologies for not mentioning this earlier, but the state of my > > > PWS-433au has been questionable, and I wanted to make sure I had a > > > legitimate bug sighting. > > > > > > I'll have to transcribe the panic backtrace by hand: nothing makes it > > > into any of the system logs :-(. I *can* recall that every backtrace > > > I've seen thus far has included one of the skb_copy() variants near the > > > top of the list (first or second function). > > > > Try to capture the panic via serial console. Otherwise a picture > > would give us a first hint. Please watch out for lines like > > skb_(over|under)_panic. > > I'll get a screen capture of some kind for you within the next day or > so. > > > gw6c is a tunnel client? Can you post ip -6 tunnel ls? > > Assuming you meant "show [NAME]" (no "ls" option for the tunnel object), > that yields the following with "gw6c" running on a 3.11.0 kernel: > > smirkin:~# ip -6 tunnel show sit1 > sit1: any/ipv6 remote 4056:5874:: local 4500:0:0:4000:4029:: encaplimit 0 > hoplimit 0 tclass 0x00 flowlabel 0x0 (flowinfo 0x) > > I'm running the gw6c client in gateway mode: the Alpha is my IPv6 > gateway/firewall. Update: no significant change for 3.12.0-rc7. Can still reliably panic the system by running "gw6c" to set up the IPv6 tunnel. Here's a hand- transcribed backtrace: I hope it's sufficient... I would have included the stack/register dump, but about half of it had scrolled off the top of the screen. skb_copy_and_csum_bits+0x88/0x380 icmp_glue_bits+0x48/0xce0 __ip_append_data+0x8f4/0xb40 ip_append_data+0xb0/0x130 icmp_glue_bits+0x0/0xe0 icmp_glue_bits+0x0/0xe0 (yes: repeated once) icmp_push_reply+0x6c/0x190 icmp_send+0x3fc/0x4c0 ip_local_deliver_finish+0x20c/0x2e0 ip_rcv_finish+0x1d8/0x3d0 nf_ct_attach+0x32/0x40 ip_rcv_finish_0x148/0x3d0 __netif_receive_skb_core+0x27c/0x890 process_backlog+0xb8/0x1a0 net_rx_action+0xc8/0x210 __do_softirq+0x1a0/0x230 do_softirq+0x5c/0x70 irq_exit+0x68/0x80 handle_irq+0x90/0xf0 miata_srm_device_interrupt+0x30/0x50 do_entInt+0x1cc/0x1f0 __do_fault+0x3e0/0x5e0 ret_from_sys_call+0x0/0x10 entMM+0x9c/0xc0 do_page_fault+0x0/0x500 do_page_fault+0x48/0x500 entMM+0x9c/0xc0 filp_close+0x6c/0xe0 filp_close+0x98/0xe0 filp_close+0x6c/0xe0 filp_close+0x98/0xe0 __close_fd+0xb8/0xe0 Code: 44640003 e4600010 2032 a77de680 b67e0010 47f10410 47ff0411 --Bob -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [BUG] 3.12.0-rcX IPv6 panic
On Mon, Oct 21, 2013 at 06:40:41PM -0500, Bob Tracy wrote: On Mon, Oct 21, 2013 at 05:52:52PM +0200, Hannes Frederic Sowa wrote: On Mon, Oct 21, 2013 at 08:18:46AM -0500, Bob Tracy wrote: Actually, a regression: the 3.11 kernel is rock-solid stable on my Alpha. Beginning with 3.12.0-rc1, I can reliably trigger a kernel panic by executing the gogo6.net gw6c IPv6 client program. If the networking layer is active, an Oops will eventually (within a day) occur regardless of whether I attempt to run gw6c. 3.12.0-rcX is stable as long as I leave networking completely disabled. The error has persisted up through -rc6. Apologies for not mentioning this earlier, but the state of my PWS-433au has been questionable, and I wanted to make sure I had a legitimate bug sighting. I'll have to transcribe the panic backtrace by hand: nothing makes it into any of the system logs :-(. I *can* recall that every backtrace I've seen thus far has included one of the skb_copy() variants near the top of the list (first or second function). Try to capture the panic via serial console. Otherwise a picture would give us a first hint. Please watch out for lines like skb_(over|under)_panic. I'll get a screen capture of some kind for you within the next day or so. gw6c is a tunnel client? Can you post ip -6 tunnel ls? Assuming you meant show [NAME] (no ls option for the tunnel object), that yields the following with gw6c running on a 3.11.0 kernel: smirkin:~# ip -6 tunnel show sit1 sit1: any/ipv6 remote 4056:5874:: local 4500:0:0:4000:4029:: encaplimit 0 hoplimit 0 tclass 0x00 flowlabel 0x0 (flowinfo 0x) I'm running the gw6c client in gateway mode: the Alpha is my IPv6 gateway/firewall. Update: no significant change for 3.12.0-rc7. Can still reliably panic the system by running gw6c to set up the IPv6 tunnel. Here's a hand- transcribed backtrace: I hope it's sufficient... I would have included the stack/register dump, but about half of it had scrolled off the top of the screen. skb_copy_and_csum_bits+0x88/0x380 icmp_glue_bits+0x48/0xce0 __ip_append_data+0x8f4/0xb40 ip_append_data+0xb0/0x130 icmp_glue_bits+0x0/0xe0 icmp_glue_bits+0x0/0xe0 (yes: repeated once) icmp_push_reply+0x6c/0x190 icmp_send+0x3fc/0x4c0 ip_local_deliver_finish+0x20c/0x2e0 ip_rcv_finish+0x1d8/0x3d0 nf_ct_attach+0x32/0x40 ip_rcv_finish_0x148/0x3d0 __netif_receive_skb_core+0x27c/0x890 process_backlog+0xb8/0x1a0 net_rx_action+0xc8/0x210 __do_softirq+0x1a0/0x230 do_softirq+0x5c/0x70 irq_exit+0x68/0x80 handle_irq+0x90/0xf0 miata_srm_device_interrupt+0x30/0x50 do_entInt+0x1cc/0x1f0 __do_fault+0x3e0/0x5e0 ret_from_sys_call+0x0/0x10 entMM+0x9c/0xc0 do_page_fault+0x0/0x500 do_page_fault+0x48/0x500 entMM+0x9c/0xc0 filp_close+0x6c/0xe0 filp_close+0x98/0xe0 filp_close+0x6c/0xe0 filp_close+0x98/0xe0 __close_fd+0xb8/0xe0 Code: 44640003 e4600010 2032 a77de680 b67e0010 47f10410 b034 47ff0411 --Bob -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: zram/zsmalloc issues in very low memory conditions
On 10/25/2013 08:35 AM, Olav Haugan wrote: > Hi Bob, Luigi, > > On 10/23/2013 5:55 PM, Bob Liu wrote: >> >> On 10/24/2013 05:51 AM, Olav Haugan wrote: >>> I am trying to use zram in very low memory conditions and I am having >>> some issues. zram is in the reclaim path. So if the system is very low >>> on memory the system is trying to reclaim pages by swapping out (in this >>> case to zram). However, since we are very low on memory zram fails to >>> get a page from zsmalloc and thus zram fails to store the page. We get >>> into a cycle where the system is low on memory so it tries to swap out >>> to get more memory but swap out fails because there is not enough memory >>> in the system! The major problem I am seeing is that there does not seem >>> to be a way for zram to tell the upper layers to stop swapping out >>> because the swap device is essentially "full" (since there is no more >>> memory available for zram pages). Has anyone thought about this issue >>> already and have ideas how to solve this or am I missing something and I >>> should not be seeing this issue? >>> >> >> The same question as Luigi "What do you want the system to do at this >> point?" >> >> If swap fails then OOM killer will be triggered, I don't think this will >> be a issue. > > I definitely don't want OOM killer to run since OOM killer can kill > critical processes (this is on Android so we have Android LMK to handle > the killing in a more "safe" way). However, what I am seeing is that > when I run low on memory zram fails to swap out and returns error but > the swap subsystem just continues to try to swap out even when this > error occurs (it tries over and over again very rapidly causing the > kernel to be filled with error messages [at least two error messages per > failure btw]). > A simple way to disable the error messages is delete the printk line in zram source code. > What I expected to happen is for the swap subsystem to stop trying to > swap out until memory is available to swap out. I guess this could be In my opinion, the system already entered heavy memory pressure state when zram fails to allocate a page. In this case, the OOM killer or Low Memory Killer should be waked up and kill some processes in order to free some memory pages. After this happen, the system free memory may above water mark and no swap will happen. Even swap happens again, it's unlikely that zram will fail to alloc page. If it fails again, OOM killer or LMK should be triggered once more. > handled several ways. Either 1) the swap subsystem, upon encountering an > error to swap out, backs off from trying to swap out for some time or 2) > zram informs the swap subsystem that the swap device is full. > > Could this be handled by congestion control? However, I found the > following comment in the code in vmscan.c: > > * If the page is swapcache, write it back even if that would > * block, for some throttling. This happens by accident, because > * swap_backing_dev_info is bust: it doesn't reflect the > * congestion state of the swapdevs. Easy to fix, if needed. > > However, how would one update the congested state of zram when it > becomes un-congested? > >> By the way, could you take a try with zswap? Which can write pages to >> real swap device if compressed pool is full. > > zswap might not be feasible in all cases if you only have flash as > backing storage. > Yes, that's still a problem of zswap. Perhaps you can create a swap file on the backing storage. I'll try to add a fake swap device for zswap, so that users can have one more choice besides zram. >>> I am also seeing a couple other issues that I was wondering whether >>> folks have already thought about: >>> >>> 2) zsmalloc fails when the page allocated is at physical address 0 (pfn >> >> AFAIK, this will never happen. > > I can easily get this to happen since I have memory starting at physical > address 0. > Could you confirm this? AFAIR, physical memory start from 0 usually reserved for some special usage. I used 'cat /proc/zoneinfo' on x86 and arm, neither of the 'start_pfn' was 0. -- Regards, -Bob -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: zram/zsmalloc issues in very low memory conditions
On 10/25/2013 08:35 AM, Olav Haugan wrote: Hi Bob, Luigi, On 10/23/2013 5:55 PM, Bob Liu wrote: On 10/24/2013 05:51 AM, Olav Haugan wrote: I am trying to use zram in very low memory conditions and I am having some issues. zram is in the reclaim path. So if the system is very low on memory the system is trying to reclaim pages by swapping out (in this case to zram). However, since we are very low on memory zram fails to get a page from zsmalloc and thus zram fails to store the page. We get into a cycle where the system is low on memory so it tries to swap out to get more memory but swap out fails because there is not enough memory in the system! The major problem I am seeing is that there does not seem to be a way for zram to tell the upper layers to stop swapping out because the swap device is essentially full (since there is no more memory available for zram pages). Has anyone thought about this issue already and have ideas how to solve this or am I missing something and I should not be seeing this issue? The same question as Luigi What do you want the system to do at this point? If swap fails then OOM killer will be triggered, I don't think this will be a issue. I definitely don't want OOM killer to run since OOM killer can kill critical processes (this is on Android so we have Android LMK to handle the killing in a more safe way). However, what I am seeing is that when I run low on memory zram fails to swap out and returns error but the swap subsystem just continues to try to swap out even when this error occurs (it tries over and over again very rapidly causing the kernel to be filled with error messages [at least two error messages per failure btw]). A simple way to disable the error messages is delete the printk line in zram source code. What I expected to happen is for the swap subsystem to stop trying to swap out until memory is available to swap out. I guess this could be In my opinion, the system already entered heavy memory pressure state when zram fails to allocate a page. In this case, the OOM killer or Low Memory Killer should be waked up and kill some processes in order to free some memory pages. After this happen, the system free memory may above water mark and no swap will happen. Even swap happens again, it's unlikely that zram will fail to alloc page. If it fails again, OOM killer or LMK should be triggered once more. handled several ways. Either 1) the swap subsystem, upon encountering an error to swap out, backs off from trying to swap out for some time or 2) zram informs the swap subsystem that the swap device is full. Could this be handled by congestion control? However, I found the following comment in the code in vmscan.c: * If the page is swapcache, write it back even if that would * block, for some throttling. This happens by accident, because * swap_backing_dev_info is bust: it doesn't reflect the * congestion state of the swapdevs. Easy to fix, if needed. However, how would one update the congested state of zram when it becomes un-congested? By the way, could you take a try with zswap? Which can write pages to real swap device if compressed pool is full. zswap might not be feasible in all cases if you only have flash as backing storage. Yes, that's still a problem of zswap. Perhaps you can create a swap file on the backing storage. I'll try to add a fake swap device for zswap, so that users can have one more choice besides zram. I am also seeing a couple other issues that I was wondering whether folks have already thought about: 2) zsmalloc fails when the page allocated is at physical address 0 (pfn AFAIK, this will never happen. I can easily get this to happen since I have memory starting at physical address 0. Could you confirm this? AFAIR, physical memory start from 0 usually reserved for some special usage. I used 'cat /proc/zoneinfo' on x86 and arm, neither of the 'start_pfn' was 0. -- Regards, -Bob -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: zram/zsmalloc issues in very low memory conditions
On 10/24/2013 05:51 AM, Olav Haugan wrote: > I am trying to use zram in very low memory conditions and I am having > some issues. zram is in the reclaim path. So if the system is very low > on memory the system is trying to reclaim pages by swapping out (in this > case to zram). However, since we are very low on memory zram fails to > get a page from zsmalloc and thus zram fails to store the page. We get > into a cycle where the system is low on memory so it tries to swap out > to get more memory but swap out fails because there is not enough memory > in the system! The major problem I am seeing is that there does not seem > to be a way for zram to tell the upper layers to stop swapping out > because the swap device is essentially "full" (since there is no more > memory available for zram pages). Has anyone thought about this issue > already and have ideas how to solve this or am I missing something and I > should not be seeing this issue? > The same question as Luigi "What do you want the system to do at this point?" If swap fails then OOM killer will be triggered, I don't think this will be a issue. By the way, could you take a try with zswap? Which can write pages to real swap device if compressed pool is full. > I am also seeing a couple other issues that I was wondering whether > folks have already thought about: > > 1) The size of a swap device is statically computed when the swap device > is turned on (nr_swap_pages). The size of zram swap device is dynamic > since we are compressing the pages and thus the swap subsystem thinks > that the zram swap device is full when it is not really full. Any > plans/thoughts about the possibility of being able to update the size > and/or the # of available pages in a swap device on the fly? > > 2) zsmalloc fails when the page allocated is at physical address 0 (pfn AFAIK, this will never happen. > = 0) since the handle returned from zsmalloc is encoded as (, > ) and thus the resulting handle will be 0 (since obj_idx starts > at 0). zs_malloc returns the handle but does not distinguish between a > valid handle of 0 and a failure to allocate. A possible solution to this > would be to start the obj_idx at 1. Is this feasible? > > Thanks, > > Olav Haugan > -- Regards, -Bob -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: zram/zsmalloc issues in very low memory conditions
On 10/24/2013 05:51 AM, Olav Haugan wrote: I am trying to use zram in very low memory conditions and I am having some issues. zram is in the reclaim path. So if the system is very low on memory the system is trying to reclaim pages by swapping out (in this case to zram). However, since we are very low on memory zram fails to get a page from zsmalloc and thus zram fails to store the page. We get into a cycle where the system is low on memory so it tries to swap out to get more memory but swap out fails because there is not enough memory in the system! The major problem I am seeing is that there does not seem to be a way for zram to tell the upper layers to stop swapping out because the swap device is essentially full (since there is no more memory available for zram pages). Has anyone thought about this issue already and have ideas how to solve this or am I missing something and I should not be seeing this issue? The same question as Luigi What do you want the system to do at this point? If swap fails then OOM killer will be triggered, I don't think this will be a issue. By the way, could you take a try with zswap? Which can write pages to real swap device if compressed pool is full. I am also seeing a couple other issues that I was wondering whether folks have already thought about: 1) The size of a swap device is statically computed when the swap device is turned on (nr_swap_pages). The size of zram swap device is dynamic since we are compressing the pages and thus the swap subsystem thinks that the zram swap device is full when it is not really full. Any plans/thoughts about the possibility of being able to update the size and/or the # of available pages in a swap device on the fly? 2) zsmalloc fails when the page allocated is at physical address 0 (pfn AFAIK, this will never happen. = 0) since the handle returned from zsmalloc is encoded as (PFN, obj_idx) and thus the resulting handle will be 0 (since obj_idx starts at 0). zs_malloc returns the handle but does not distinguish between a valid handle of 0 and a failure to allocate. A possible solution to this would be to start the obj_idx at 1. Is this feasible? Thanks, Olav Haugan -- Regards, -Bob -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [BUG] 3.12.0-rcX IPv6 panic
On Mon, Oct 21, 2013 at 05:52:52PM +0200, Hannes Frederic Sowa wrote: > On Mon, Oct 21, 2013 at 08:18:46AM -0500, Bob Tracy wrote: > > Actually, a regression: the 3.11 kernel is rock-solid stable on my > > Alpha. > > > > Beginning with 3.12.0-rc1, I can reliably trigger a kernel panic by > > executing the gogo6.net "gw6c" IPv6 client program. If the networking > > layer is active, an "Oops" will eventually (within a day) occur regardless > > of whether I attempt to run "gw6c". 3.12.0-rcX is stable as long as I > > leave networking completely disabled. The error has persisted up through > > -rc6. Apologies for not mentioning this earlier, but the state of my > > PWS-433au has been questionable, and I wanted to make sure I had a > > legitimate bug sighting. > > > > I'll have to transcribe the panic backtrace by hand: nothing makes it > > into any of the system logs :-(. I *can* recall that every backtrace > > I've seen thus far has included one of the skb_copy() variants near the > > top of the list (first or second function). > > Try to capture the panic via serial console. Otherwise a picture > would give us a first hint. Please watch out for lines like > skb_(over|under)_panic. I'll get a screen capture of some kind for you within the next day or so. > gw6c is a tunnel client? Can you post ip -6 tunnel ls? Assuming you meant "show [NAME]" (no "ls" option for the tunnel object), that yields the following with "gw6c" running on a 3.11.0 kernel: smirkin:~# ip -6 tunnel show sit1 sit1: any/ipv6 remote 4056:5874:: local 4500:0:0:4000:4029:: encaplimit 0 hoplimit 0 tclass 0x00 flowlabel 0x0 (flowinfo 0x) I'm running the gw6c client in gateway mode: the Alpha is my IPv6 gateway/firewall. --Bob -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[BUG] 3.12.0-rcX IPv6 panic
Actually, a regression: the 3.11 kernel is rock-solid stable on my Alpha. Beginning with 3.12.0-rc1, I can reliably trigger a kernel panic by executing the gogo6.net "gw6c" IPv6 client program. If the networking layer is active, an "Oops" will eventually (within a day) occur regardless of whether I attempt to run "gw6c". 3.12.0-rcX is stable as long as I leave networking completely disabled. The error has persisted up through -rc6. Apologies for not mentioning this earlier, but the state of my PWS-433au has been questionable, and I wanted to make sure I had a legitimate bug sighting. I'll have to transcribe the panic backtrace by hand: nothing makes it into any of the system logs :-(. I *can* recall that every backtrace I've seen thus far has included one of the skb_copy() variants near the top of the list (first or second function). --Bob -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[BUG] 3.12.0-rcX IPv6 panic
Actually, a regression: the 3.11 kernel is rock-solid stable on my Alpha. Beginning with 3.12.0-rc1, I can reliably trigger a kernel panic by executing the gogo6.net gw6c IPv6 client program. If the networking layer is active, an Oops will eventually (within a day) occur regardless of whether I attempt to run gw6c. 3.12.0-rcX is stable as long as I leave networking completely disabled. The error has persisted up through -rc6. Apologies for not mentioning this earlier, but the state of my PWS-433au has been questionable, and I wanted to make sure I had a legitimate bug sighting. I'll have to transcribe the panic backtrace by hand: nothing makes it into any of the system logs :-(. I *can* recall that every backtrace I've seen thus far has included one of the skb_copy() variants near the top of the list (first or second function). --Bob -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [BUG] 3.12.0-rcX IPv6 panic
On Mon, Oct 21, 2013 at 05:52:52PM +0200, Hannes Frederic Sowa wrote: On Mon, Oct 21, 2013 at 08:18:46AM -0500, Bob Tracy wrote: Actually, a regression: the 3.11 kernel is rock-solid stable on my Alpha. Beginning with 3.12.0-rc1, I can reliably trigger a kernel panic by executing the gogo6.net gw6c IPv6 client program. If the networking layer is active, an Oops will eventually (within a day) occur regardless of whether I attempt to run gw6c. 3.12.0-rcX is stable as long as I leave networking completely disabled. The error has persisted up through -rc6. Apologies for not mentioning this earlier, but the state of my PWS-433au has been questionable, and I wanted to make sure I had a legitimate bug sighting. I'll have to transcribe the panic backtrace by hand: nothing makes it into any of the system logs :-(. I *can* recall that every backtrace I've seen thus far has included one of the skb_copy() variants near the top of the list (first or second function). Try to capture the panic via serial console. Otherwise a picture would give us a first hint. Please watch out for lines like skb_(over|under)_panic. I'll get a screen capture of some kind for you within the next day or so. gw6c is a tunnel client? Can you post ip -6 tunnel ls? Assuming you meant show [NAME] (no ls option for the tunnel object), that yields the following with gw6c running on a 3.11.0 kernel: smirkin:~# ip -6 tunnel show sit1 sit1: any/ipv6 remote 4056:5874:: local 4500:0:0:4000:4029:: encaplimit 0 hoplimit 0 tclass 0x00 flowlabel 0x0 (flowinfo 0x) I'm running the gw6c client in gateway mode: the Alpha is my IPv6 gateway/firewall. --Bob -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: trace-cmd problem with FC19
On Thu, Oct 17, 2013 at 04:56:56PM -0400, Steven Rostedt wrote: > Hmm, are you sure? > > You may want to do both: > > sudo trace-cmd -v > which trace-cmd > > to see which version it is. To clarify - I ran into the referenced issue using an older, self-compiled version with a recent kernel. The FC19 distro version may be fine, for all I know. Arend says he used the latest version from the repo which should not be a problem, but just throwing that out there since the symptoms are similar. -- Bob Copeland %% www.bobcopeland.com -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: trace-cmd problem with FC19
[sorry, resent for the archives, I failed email] On Thu, Oct 17, 2013 at 4:20 PM, Steven Rostedt wrote: > On Thu, 17 Oct 2013 22:07:15 +0200 > "Arend van Spriel" wrote: > > >> > Does recording other traces work? Or is it only spicific to this module? >> >> It seems a generic issue: >> >> $ sudo trace-cmd record -e ext4:* ls >> /sys/kernel/debug/tracing/events/ext4/*/filter >> systemd-private-6pVB5Lsystemd-private-KdpFqS trace.dat.cpu0 >> trace.dat.cpu2 >> systemd-private-9hedRDtrace.dat trace.dat.cpu1 >> trace.dat.cpu3 >> trace-cmd: Interrupted system call >>recorder error in splice input >> trace-cmd: Interrupted system call >>recorder error in splice input >> trace-cmd: Interrupted system call >>recorder error in splice input Perhaps this is an instance of this bug? https://lkml.org/lkml/2013/5/31/426 tl;dr try with latest trace-cmd? I hit the above myself. -- Bob Copeland %% www.bobcopeland.com -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: BUG: mm, numa: test segfaults, only when NUMA balancing is on
Hi Alex, On Wed, Oct 16, 2013 at 11:54 PM, Alex Thorlton wrote: > Hi guys, > > I ran into a bug a week or so ago, that I believe has something to do > with NUMA balancing, but I'm having a tough time tracking down exactly > what is causing it. When running with the following configuration > options set: > > CONFIG_ARCH_SUPPORTS_NUMA_BALANCING=y > CONFIG_NUMA_BALANCING_DEFAULT_ENABLED=y > CONFIG_NUMA_BALANCING=y > # CONFIG_HUGETLBFS is not set > # CONFIG_HUGETLB_PAGE is not set > What's your kernel version? And did you enable CONFIG_TRANSPARENT_HUGEPAGE ? > I get intermittent segfaults when running the memscale test that we've > been using to test some of the THP changes. Here's a link to the test: > > ftp://shell.sgi.com/collect/memscale/ > > I typically run the test with a line similar to this: > > ./thp_memscale -C 0 -m 0 -c -b > > Where is the number of cores to spawn threads on, and > is the amount of memory to reserve from each core. The field > can accept values like 512m or 1g, etc. I typically run 256 cores and > 512m, though I think the problem should be reproducable on anything with > 128+ cores. > > The test never seems to have any problems when running with hugetlbfs > on and NUMA balancing off, but it segfaults every once in a while with > the config options above. It seems to occur more frequently, the more > cores you run on. It segfaults on about 50% of the runs at 256 cores, > and on almost every run at 512 cores. The fewest number of cores I've > seen a segfault on has been 128, though it seems to be rare on this many > cores. > Could you please attach some logs? > At this point, I'm not familiar enough with NUMA balancing code to know > what could be causing this, and we don't typically run with NUMA > balancing on, so I don't see this in my everyday testing, but I felt > that it was definitely worth bringing up. > > If anybody has any ideas of where I could poke around to find a > solution, please let me know. > -- Regards, --Bob -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: BUG: mm, numa: test segfaults, only when NUMA balancing is on
Hi Alex, On Wed, Oct 16, 2013 at 11:54 PM, Alex Thorlton athorl...@sgi.com wrote: Hi guys, I ran into a bug a week or so ago, that I believe has something to do with NUMA balancing, but I'm having a tough time tracking down exactly what is causing it. When running with the following configuration options set: CONFIG_ARCH_SUPPORTS_NUMA_BALANCING=y CONFIG_NUMA_BALANCING_DEFAULT_ENABLED=y CONFIG_NUMA_BALANCING=y # CONFIG_HUGETLBFS is not set # CONFIG_HUGETLB_PAGE is not set What's your kernel version? And did you enable CONFIG_TRANSPARENT_HUGEPAGE ? I get intermittent segfaults when running the memscale test that we've been using to test some of the THP changes. Here's a link to the test: ftp://shell.sgi.com/collect/memscale/ I typically run the test with a line similar to this: ./thp_memscale -C 0 -m 0 -c cores -b memory Where cores is the number of cores to spawn threads on, and memory is the amount of memory to reserve from each core. The memory field can accept values like 512m or 1g, etc. I typically run 256 cores and 512m, though I think the problem should be reproducable on anything with 128+ cores. The test never seems to have any problems when running with hugetlbfs on and NUMA balancing off, but it segfaults every once in a while with the config options above. It seems to occur more frequently, the more cores you run on. It segfaults on about 50% of the runs at 256 cores, and on almost every run at 512 cores. The fewest number of cores I've seen a segfault on has been 128, though it seems to be rare on this many cores. Could you please attach some logs? At this point, I'm not familiar enough with NUMA balancing code to know what could be causing this, and we don't typically run with NUMA balancing on, so I don't see this in my everyday testing, but I felt that it was definitely worth bringing up. If anybody has any ideas of where I could poke around to find a solution, please let me know. -- Regards, --Bob -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: trace-cmd problem with FC19
[sorry, resent for the archives, I failed email] On Thu, Oct 17, 2013 at 4:20 PM, Steven Rostedt rost...@goodmis.org wrote: On Thu, 17 Oct 2013 22:07:15 +0200 Arend van Spriel ar...@broadcom.com wrote: Does recording other traces work? Or is it only spicific to this module? It seems a generic issue: $ sudo trace-cmd record -e ext4:* ls /sys/kernel/debug/tracing/events/ext4/*/filter systemd-private-6pVB5Lsystemd-private-KdpFqS trace.dat.cpu0 trace.dat.cpu2 systemd-private-9hedRDtrace.dat trace.dat.cpu1 trace.dat.cpu3 trace-cmd: Interrupted system call recorder error in splice input trace-cmd: Interrupted system call recorder error in splice input trace-cmd: Interrupted system call recorder error in splice input Perhaps this is an instance of this bug? https://lkml.org/lkml/2013/5/31/426 tl;dr try with latest trace-cmd? I hit the above myself. -- Bob Copeland %% www.bobcopeland.com -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: trace-cmd problem with FC19
On Thu, Oct 17, 2013 at 04:56:56PM -0400, Steven Rostedt wrote: Hmm, are you sure? You may want to do both: sudo trace-cmd -v which trace-cmd to see which version it is. To clarify - I ran into the referenced issue using an older, self-compiled version with a recent kernel. The FC19 distro version may be fine, for all I know. Arend says he used the latest version from the repo which should not be a problem, but just throwing that out there since the symptoms are similar. -- Bob Copeland %% www.bobcopeland.com -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH v3 2/3] mm/zswap: bugfix: memory leak when invalidate and reclaim occur concurrently
On Sat, Oct 12, 2013 at 5:14 PM, Weijie Yang wrote: > On Sat, Oct 12, 2013 at 10:50 AM, Bob Liu wrote: >> On Fri, Oct 11, 2013 at 3:13 PM, Minchan Kim wrote: >>> On Thu, Sep 26, 2013 at 11:42:17AM +0800, Weijie Yang wrote: >>>> On Tue, Sep 24, 2013 at 9:03 AM, Minchan Kim wrote: >>>> > On Mon, Sep 23, 2013 at 04:21:49PM +0800, Weijie Yang wrote: >>>> > > >>>> > > Modify: >>>> > > - check the refcount in fail path, free memory if it is not >>>> > > referenced. >>>> > >>>> > Hmm, I don't like this because zswap refcount routine is already mess >>>> > for me. >>>> > I'm not sure why it was designed from the beginning. I hope we should >>>> > fix it first. >>>> > >>>> > 1. zswap_rb_serach could include zswap_entry_get semantic if it founds a >>>> > entry from >>>> >the tree. Of course, we should ranme it as find_get_zswap_entry like >>>> > find_get_page. >>>> > 2. zswap_entry_put could hide resource free function like >>>> > zswap_free_entry so that >>>> >all of caller can use it easily following pattern. >>>> > >>>> > find_get_zswap_entry >>>> > ... >>>> > ... >>>> > zswap_entry_put >>>> > >>>> > Of course, zswap_entry_put have to check the entry is in the tree or not >>>> > so if someone already removes it from the tree, it should avoid double >>>> > remove. >>>> > >>>> > One of the concern I can think is that approach extends critical section >>>> > but I think it would be no problem because more bottleneck would be >>>> > [de]compress >>>> > functions. If it were really problem, we can mitigate a problem with >>>> > moving >>>> > unnecessary functions out of zswap_free_entry because it seem to be >>>> > rather >>>> > over-enginnering. >>>> >>>> I refactor the zswap refcount routine according to Minchan's idea. >>>> Here is the new patch, Any suggestion is welcomed. >>>> >>>> To Seth and Bob, would you please review it again? >>> >>> Yeah, Seth, Bob. You guys are right persons to review this because this >>> scheme was suggested by me who is biased so it couldn't be a fair. ;-) >>> But anyway, I will review code itself. >>> >>>> >>>> mm/zswap.c | 116 >>>> >>>> 1 file changed, 52 insertions(+), 64 deletions(-) >>>> >>>> diff --git a/mm/zswap.c b/mm/zswap.c >>>> old mode 100644 >>>> new mode 100755 >>>> index deda2b6..bd04910 >>>> --- a/mm/zswap.c >>>> +++ b/mm/zswap.c >>>> @@ -217,6 +217,7 @@ static struct zswap_entry >>>> *zswap_entry_cache_alloc(gfp_t gfp) >>>> if (!entry) >>>> return NULL; >>>> entry->refcount = 1; >>>> + RB_CLEAR_NODE(>rbnode); >>>> return entry; >>>> } >>>> >>>> @@ -232,10 +233,20 @@ static void zswap_entry_get(struct zswap_entry >>>> *entry) >>>> } >>>> >>>> /* caller must hold the tree lock */ >>>> -static int zswap_entry_put(struct zswap_entry *entry) >>>> +static int zswap_entry_put(struct zswap_tree *tree, struct zswap_entry >>>> *entry) >>> >>> Why should we have return value? If we really need it, it mitigates >>> get/put semantic's whole point so I'd like to just return void. >>> >>> Let me see. >>> >>>> { >>>> - entry->refcount--; >>>> - return entry->refcount; >>>> + int refcount = --entry->refcount; >>>> + >>>> + if (refcount <= 0) { >>> >>> Hmm, I don't like minus refcount, really. >>> I hope we could do following as >>> >>> BUG_ON(refcount < 0); >>> if (refcount == 0) { >>> ... >>> } >>> >>> >>> >>>> + if (!RB_EMPTY_NODE(>rbnode)) { >>>> + rb_erase(>rbnode, >rbroot); >>>> + RB_
Re: [PATCH v3 2/3] mm/zswap: bugfix: memory leak when invalidate and reclaim occur concurrently
On Sat, Oct 12, 2013 at 5:14 PM, Weijie Yang weijie.yang...@gmail.com wrote: On Sat, Oct 12, 2013 at 10:50 AM, Bob Liu lliu...@gmail.com wrote: On Fri, Oct 11, 2013 at 3:13 PM, Minchan Kim minc...@kernel.org wrote: On Thu, Sep 26, 2013 at 11:42:17AM +0800, Weijie Yang wrote: On Tue, Sep 24, 2013 at 9:03 AM, Minchan Kim minc...@kernel.org wrote: On Mon, Sep 23, 2013 at 04:21:49PM +0800, Weijie Yang wrote: Modify: - check the refcount in fail path, free memory if it is not referenced. Hmm, I don't like this because zswap refcount routine is already mess for me. I'm not sure why it was designed from the beginning. I hope we should fix it first. 1. zswap_rb_serach could include zswap_entry_get semantic if it founds a entry from the tree. Of course, we should ranme it as find_get_zswap_entry like find_get_page. 2. zswap_entry_put could hide resource free function like zswap_free_entry so that all of caller can use it easily following pattern. find_get_zswap_entry ... ... zswap_entry_put Of course, zswap_entry_put have to check the entry is in the tree or not so if someone already removes it from the tree, it should avoid double remove. One of the concern I can think is that approach extends critical section but I think it would be no problem because more bottleneck would be [de]compress functions. If it were really problem, we can mitigate a problem with moving unnecessary functions out of zswap_free_entry because it seem to be rather over-enginnering. I refactor the zswap refcount routine according to Minchan's idea. Here is the new patch, Any suggestion is welcomed. To Seth and Bob, would you please review it again? Yeah, Seth, Bob. You guys are right persons to review this because this scheme was suggested by me who is biased so it couldn't be a fair. ;-) But anyway, I will review code itself. mm/zswap.c | 116 1 file changed, 52 insertions(+), 64 deletions(-) diff --git a/mm/zswap.c b/mm/zswap.c old mode 100644 new mode 100755 index deda2b6..bd04910 --- a/mm/zswap.c +++ b/mm/zswap.c @@ -217,6 +217,7 @@ static struct zswap_entry *zswap_entry_cache_alloc(gfp_t gfp) if (!entry) return NULL; entry-refcount = 1; + RB_CLEAR_NODE(entry-rbnode); return entry; } @@ -232,10 +233,20 @@ static void zswap_entry_get(struct zswap_entry *entry) } /* caller must hold the tree lock */ -static int zswap_entry_put(struct zswap_entry *entry) +static int zswap_entry_put(struct zswap_tree *tree, struct zswap_entry *entry) Why should we have return value? If we really need it, it mitigates get/put semantic's whole point so I'd like to just return void. Let me see. { - entry-refcount--; - return entry-refcount; + int refcount = --entry-refcount; + + if (refcount = 0) { Hmm, I don't like minus refcount, really. I hope we could do following as BUG_ON(refcount 0); if (refcount == 0) { ... } + if (!RB_EMPTY_NODE(entry-rbnode)) { + rb_erase(entry-rbnode, tree-rbroot); + RB_CLEAR_NODE(entry-rbnode); Minor, You could make new function zswap_rb_del or zswap_rb_remove which detach the node from rb tree and clear node because we have already zswap_rb_insert. + } + + zswap_free_entry(tree, entry); + } + + return refcount; } /* @@ -258,6 +269,17 @@ static struct zswap_entry *zswap_rb_search(struct rb_root *root, pgoff_t offset) return NULL; } Add function description. +static struct zswap_entry *zswap_entry_find_get(struct rb_root *root, pgoff_t offset) +{ + struct zswap_entry *entry = NULL; + + entry = zswap_rb_search(root, offset); + if (entry) + zswap_entry_get(entry); + + return entry; +} + /* * In the case that a entry with the same offset is found, a pointer to * the existing entry is stored in dupentry and the function returns -EEXIST @@ -387,7 +409,7 @@ static void zswap_free_entry(struct zswap_tree *tree, struct zswap_entry *entry) enum zswap_get_swap_ret { ZSWAP_SWAPCACHE_NEW, ZSWAP_SWAPCACHE_EXIST, - ZSWAP_SWAPCACHE_NOMEM + ZSWAP_SWAPCACHE_FAIL, }; /* @@ -401,9 +423,9 @@ enum zswap_get_swap_ret { * added to the swap cache, and returned in retpage. * * If success, the swap cache page is returned in retpage - * Returns 0 if page was already in the swap cache, page is not locked - * Returns 1 if the new page needs to be populated, page is locked - * Returns 0 on error + * Returns ZSWAP_SWAPCACHE_EXIST if page was already in the swap cache + * Returns ZSWAP_SWAPCACHE_NEW if the new page needs to be populated
Re: [PATCH v3 2/3] mm/zswap: bugfix: memory leak when invalidate and reclaim occur concurrently
On Fri, Oct 11, 2013 at 3:13 PM, Minchan Kim wrote: > On Thu, Sep 26, 2013 at 11:42:17AM +0800, Weijie Yang wrote: >> On Tue, Sep 24, 2013 at 9:03 AM, Minchan Kim wrote: >> > On Mon, Sep 23, 2013 at 04:21:49PM +0800, Weijie Yang wrote: >> > > >> > > Modify: >> > > - check the refcount in fail path, free memory if it is not referenced. >> > >> > Hmm, I don't like this because zswap refcount routine is already mess for >> > me. >> > I'm not sure why it was designed from the beginning. I hope we should fix >> > it first. >> > >> > 1. zswap_rb_serach could include zswap_entry_get semantic if it founds a >> > entry from >> >the tree. Of course, we should ranme it as find_get_zswap_entry like >> > find_get_page. >> > 2. zswap_entry_put could hide resource free function like zswap_free_entry >> > so that >> >all of caller can use it easily following pattern. >> > >> > find_get_zswap_entry >> > ... >> > ... >> > zswap_entry_put >> > >> > Of course, zswap_entry_put have to check the entry is in the tree or not >> > so if someone already removes it from the tree, it should avoid double >> > remove. >> > >> > One of the concern I can think is that approach extends critical section >> > but I think it would be no problem because more bottleneck would be >> > [de]compress >> > functions. If it were really problem, we can mitigate a problem with moving >> > unnecessary functions out of zswap_free_entry because it seem to be rather >> > over-enginnering. >> >> I refactor the zswap refcount routine according to Minchan's idea. >> Here is the new patch, Any suggestion is welcomed. >> >> To Seth and Bob, would you please review it again? > > Yeah, Seth, Bob. You guys are right persons to review this because this > scheme was suggested by me who is biased so it couldn't be a fair. ;-) > But anyway, I will review code itself. > >> >> mm/zswap.c | 116 >> >> 1 file changed, 52 insertions(+), 64 deletions(-) >> >> diff --git a/mm/zswap.c b/mm/zswap.c >> old mode 100644 >> new mode 100755 >> index deda2b6..bd04910 >> --- a/mm/zswap.c >> +++ b/mm/zswap.c >> @@ -217,6 +217,7 @@ static struct zswap_entry *zswap_entry_cache_alloc(gfp_t >> gfp) >> if (!entry) >> return NULL; >> entry->refcount = 1; >> + RB_CLEAR_NODE(>rbnode); >> return entry; >> } >> >> @@ -232,10 +233,20 @@ static void zswap_entry_get(struct zswap_entry *entry) >> } >> >> /* caller must hold the tree lock */ >> -static int zswap_entry_put(struct zswap_entry *entry) >> +static int zswap_entry_put(struct zswap_tree *tree, struct zswap_entry >> *entry) > > Why should we have return value? If we really need it, it mitigates > get/put semantic's whole point so I'd like to just return void. > > Let me see. > >> { >> - entry->refcount--; >> - return entry->refcount; >> + int refcount = --entry->refcount; >> + >> + if (refcount <= 0) { > > Hmm, I don't like minus refcount, really. > I hope we could do following as > > BUG_ON(refcount < 0); > if (refcount == 0) { > ... > } > > > >> + if (!RB_EMPTY_NODE(>rbnode)) { >> + rb_erase(>rbnode, >rbroot); >> + RB_CLEAR_NODE(>rbnode); > > Minor, > You could make new function zswap_rb_del or zswap_rb_remove which detach the > node > from rb tree and clear node because we have already zswap_rb_insert. > > >> + } >> + >> + zswap_free_entry(tree, entry); >> + } >> + >> + return refcount; >> } >> >> /* >> @@ -258,6 +269,17 @@ static struct zswap_entry *zswap_rb_search(struct >> rb_root *root, pgoff_t offset) >> return NULL; >> } >> > > Add function description. > >> +static struct zswap_entry *zswap_entry_find_get(struct rb_root *root, >> pgoff_t offset) >> +{ >> + struct zswap_entry *entry = NULL; >> + >> + entry = zswap_rb_search(root, offset); >> + if (entry) >> + zswap_entry_get(entry); >> + >> + return entry; >> +}
Re: [PATCH v3 2/3] mm/zswap: bugfix: memory leak when invalidate and reclaim occur concurrently
On Thu, Sep 26, 2013 at 11:42 AM, Weijie Yang wrote: > On Tue, Sep 24, 2013 at 9:03 AM, Minchan Kim wrote: >> On Mon, Sep 23, 2013 at 04:21:49PM +0800, Weijie Yang wrote: >> > >> > Modify: >> > - check the refcount in fail path, free memory if it is not referenced. >> >> Hmm, I don't like this because zswap refcount routine is already mess for me. >> I'm not sure why it was designed from the beginning. I hope we should fix it >> first. >> >> 1. zswap_rb_serach could include zswap_entry_get semantic if it founds a >> entry from >>the tree. Of course, we should ranme it as find_get_zswap_entry like >> find_get_page. >> 2. zswap_entry_put could hide resource free function like zswap_free_entry >> so that >>all of caller can use it easily following pattern. >> >> find_get_zswap_entry >> ... >> ... >> zswap_entry_put >> >> Of course, zswap_entry_put have to check the entry is in the tree or not >> so if someone already removes it from the tree, it should avoid double >> remove. >> >> One of the concern I can think is that approach extends critical section >> but I think it would be no problem because more bottleneck would be >> [de]compress >> functions. If it were really problem, we can mitigate a problem with moving >> unnecessary functions out of zswap_free_entry because it seem to be rather >> over-enginnering. > > I refactor the zswap refcount routine according to Minchan's idea. > Here is the new patch, Any suggestion is welcomed. > > To Seth and Bob, would you please review it again? > I have nothing in addition to Minchan's review. Since the code is a bit complex, I'd suggest you to split it into two patches. [1/2]: fix the memory leak [2/2]: clean up the entry_put And run some testing.. Thanks, -Bob -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH v3 2/3] mm/zswap: bugfix: memory leak when invalidate and reclaim occur concurrently
On Thu, Sep 26, 2013 at 11:42 AM, Weijie Yang weijie.y...@samsung.com wrote: On Tue, Sep 24, 2013 at 9:03 AM, Minchan Kim minc...@kernel.org wrote: On Mon, Sep 23, 2013 at 04:21:49PM +0800, Weijie Yang wrote: Modify: - check the refcount in fail path, free memory if it is not referenced. Hmm, I don't like this because zswap refcount routine is already mess for me. I'm not sure why it was designed from the beginning. I hope we should fix it first. 1. zswap_rb_serach could include zswap_entry_get semantic if it founds a entry from the tree. Of course, we should ranme it as find_get_zswap_entry like find_get_page. 2. zswap_entry_put could hide resource free function like zswap_free_entry so that all of caller can use it easily following pattern. find_get_zswap_entry ... ... zswap_entry_put Of course, zswap_entry_put have to check the entry is in the tree or not so if someone already removes it from the tree, it should avoid double remove. One of the concern I can think is that approach extends critical section but I think it would be no problem because more bottleneck would be [de]compress functions. If it were really problem, we can mitigate a problem with moving unnecessary functions out of zswap_free_entry because it seem to be rather over-enginnering. I refactor the zswap refcount routine according to Minchan's idea. Here is the new patch, Any suggestion is welcomed. To Seth and Bob, would you please review it again? I have nothing in addition to Minchan's review. Since the code is a bit complex, I'd suggest you to split it into two patches. [1/2]: fix the memory leak [2/2]: clean up the entry_put And run some testing.. Thanks, -Bob -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH v3 2/3] mm/zswap: bugfix: memory leak when invalidate and reclaim occur concurrently
On Fri, Oct 11, 2013 at 3:13 PM, Minchan Kim minc...@kernel.org wrote: On Thu, Sep 26, 2013 at 11:42:17AM +0800, Weijie Yang wrote: On Tue, Sep 24, 2013 at 9:03 AM, Minchan Kim minc...@kernel.org wrote: On Mon, Sep 23, 2013 at 04:21:49PM +0800, Weijie Yang wrote: Modify: - check the refcount in fail path, free memory if it is not referenced. Hmm, I don't like this because zswap refcount routine is already mess for me. I'm not sure why it was designed from the beginning. I hope we should fix it first. 1. zswap_rb_serach could include zswap_entry_get semantic if it founds a entry from the tree. Of course, we should ranme it as find_get_zswap_entry like find_get_page. 2. zswap_entry_put could hide resource free function like zswap_free_entry so that all of caller can use it easily following pattern. find_get_zswap_entry ... ... zswap_entry_put Of course, zswap_entry_put have to check the entry is in the tree or not so if someone already removes it from the tree, it should avoid double remove. One of the concern I can think is that approach extends critical section but I think it would be no problem because more bottleneck would be [de]compress functions. If it were really problem, we can mitigate a problem with moving unnecessary functions out of zswap_free_entry because it seem to be rather over-enginnering. I refactor the zswap refcount routine according to Minchan's idea. Here is the new patch, Any suggestion is welcomed. To Seth and Bob, would you please review it again? Yeah, Seth, Bob. You guys are right persons to review this because this scheme was suggested by me who is biased so it couldn't be a fair. ;-) But anyway, I will review code itself. mm/zswap.c | 116 1 file changed, 52 insertions(+), 64 deletions(-) diff --git a/mm/zswap.c b/mm/zswap.c old mode 100644 new mode 100755 index deda2b6..bd04910 --- a/mm/zswap.c +++ b/mm/zswap.c @@ -217,6 +217,7 @@ static struct zswap_entry *zswap_entry_cache_alloc(gfp_t gfp) if (!entry) return NULL; entry-refcount = 1; + RB_CLEAR_NODE(entry-rbnode); return entry; } @@ -232,10 +233,20 @@ static void zswap_entry_get(struct zswap_entry *entry) } /* caller must hold the tree lock */ -static int zswap_entry_put(struct zswap_entry *entry) +static int zswap_entry_put(struct zswap_tree *tree, struct zswap_entry *entry) Why should we have return value? If we really need it, it mitigates get/put semantic's whole point so I'd like to just return void. Let me see. { - entry-refcount--; - return entry-refcount; + int refcount = --entry-refcount; + + if (refcount = 0) { Hmm, I don't like minus refcount, really. I hope we could do following as BUG_ON(refcount 0); if (refcount == 0) { ... } + if (!RB_EMPTY_NODE(entry-rbnode)) { + rb_erase(entry-rbnode, tree-rbroot); + RB_CLEAR_NODE(entry-rbnode); Minor, You could make new function zswap_rb_del or zswap_rb_remove which detach the node from rb tree and clear node because we have already zswap_rb_insert. + } + + zswap_free_entry(tree, entry); + } + + return refcount; } /* @@ -258,6 +269,17 @@ static struct zswap_entry *zswap_rb_search(struct rb_root *root, pgoff_t offset) return NULL; } Add function description. +static struct zswap_entry *zswap_entry_find_get(struct rb_root *root, pgoff_t offset) +{ + struct zswap_entry *entry = NULL; + + entry = zswap_rb_search(root, offset); + if (entry) + zswap_entry_get(entry); + + return entry; +} + /* * In the case that a entry with the same offset is found, a pointer to * the existing entry is stored in dupentry and the function returns -EEXIST @@ -387,7 +409,7 @@ static void zswap_free_entry(struct zswap_tree *tree, struct zswap_entry *entry) enum zswap_get_swap_ret { ZSWAP_SWAPCACHE_NEW, ZSWAP_SWAPCACHE_EXIST, - ZSWAP_SWAPCACHE_NOMEM + ZSWAP_SWAPCACHE_FAIL, }; /* @@ -401,9 +423,9 @@ enum zswap_get_swap_ret { * added to the swap cache, and returned in retpage. * * If success, the swap cache page is returned in retpage - * Returns 0 if page was already in the swap cache, page is not locked - * Returns 1 if the new page needs to be populated, page is locked - * Returns 0 on error + * Returns ZSWAP_SWAPCACHE_EXIST if page was already in the swap cache + * Returns ZSWAP_SWAPCACHE_NEW if the new page needs to be populated, page is locked + * Returns ZSWAP_SWAPCACHE_FAIL on error */ static int zswap_get_swap_cache_page(swp_entry_t entry
Re: [PATCH] frontswap: enable call to invalidate area on swapoff
On 10/09/2013 10:40 PM, Seth Jennings wrote: > On Tue, Oct 08, 2013 at 01:08:53PM -0700, Andrew Morton wrote: >> On Tue, 08 Oct 2013 10:13:20 +0200 Krzysztof Kozlowski >> wrote: >> >>> On pon, 2013-10-07 at 15:03 -0700, Andrew Morton wrote: >>>> On Mon, 07 Oct 2013 17:25:41 +0200 Krzysztof Kozlowski >>>> wrote: >>>> >>>>> During swapoff the frontswap_map was NULL-ified before calling >>>>> frontswap_invalidate_area(). However the frontswap_invalidate_area() >>>>> exits early if frontswap_map is NULL. Invalidate was never called during >>>>> swapoff. >>>>> >>>>> This patch moves frontswap_map_set() in swapoff just after calling >>>>> frontswap_invalidate_area() so outside of locks >>>>> (swap_lock and swap_info_struct->lock). This shouldn't be a problem as >>>>> during swapon the frontswap_map_set() is called also outside of any >>>>> locks. >>>>> >>>> >>>> Ahem. So there's a bunch of code in __frontswap_invalidate_area() >>>> which hasn't ever been executed and nobody noticed it. So perhaps that >>>> code isn't actually needed? >>>> >>>> More seriously, this patch looks like it enables code which hasn't been >>>> used or tested before. How well tested was this? >>>> >>>> Are there any runtime-visible effects from this change? >>> >>> I tested zswap on x86 and x86-64 and there was no difference. This is >>> good as there shouldn't be visible anything because swapoff is unusing >>> all pages anyway: >>> try_to_unuse(type, false, 0); /* force all pages to be unused */ >>> >>> I haven't tested other frontswap users. >> >> So is that code in __frontswap_invalidate_area() unneeded? > > Yes, to expand on what Bob said, __frontswap_invalidate_area() is still > needed to let any frontswap backend free per-swaptype resources. > > __frontswap_invalidate_area() is _not_ for freeing structures associated > with individual swapped out pages since all of the pages should be > brought back into memory by try_to_unuse() before > __frontswap_invalidate_area() is called. > > The reason we never noticed this for zswap is that zswap has no > dynamically allocated per-type resources. In the expected case, > where all of the pages have been drained from zswap, > zswap_frontswap_invalidate_area() is a no-op. > Not exactly, see the bug fix "mm/zswap: bugfix: memory leak when re-swapon" from Weijie. Zswap needs invalidate_area() also. Thanks, -Bob -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] frontswap: enable call to invalidate area on swapoff
On 10/09/2013 04:08 AM, Andrew Morton wrote: > On Tue, 08 Oct 2013 10:13:20 +0200 Krzysztof Kozlowski > wrote: > >> On pon, 2013-10-07 at 15:03 -0700, Andrew Morton wrote: >>> On Mon, 07 Oct 2013 17:25:41 +0200 Krzysztof Kozlowski >>> wrote: >>> >>>> During swapoff the frontswap_map was NULL-ified before calling >>>> frontswap_invalidate_area(). However the frontswap_invalidate_area() >>>> exits early if frontswap_map is NULL. Invalidate was never called during >>>> swapoff. >>>> >>>> This patch moves frontswap_map_set() in swapoff just after calling >>>> frontswap_invalidate_area() so outside of locks >>>> (swap_lock and swap_info_struct->lock). This shouldn't be a problem as >>>> during swapon the frontswap_map_set() is called also outside of any >>>> locks. >>>> >>> >>> Ahem. So there's a bunch of code in __frontswap_invalidate_area() >>> which hasn't ever been executed and nobody noticed it. So perhaps that >>> code isn't actually needed? >>> >>> More seriously, this patch looks like it enables code which hasn't been >>> used or tested before. How well tested was this? >>> >>> Are there any runtime-visible effects from this change? >> >> I tested zswap on x86 and x86-64 and there was no difference. This is >> good as there shouldn't be visible anything because swapoff is unusing >> all pages anyway: >> try_to_unuse(type, false, 0); /* force all pages to be unused */ >> >> I haven't tested other frontswap users. > > So is that code in __frontswap_invalidate_area() unneeded? > I don't think so, it's still needed otherwise there will be memory leak. I'm afraid nobody noticed the memory leak here before, this patch can fix it. Sorry for didn't pay enough attention but please keep __frontswap_invalidate_area(). -- Regards, -Bob -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] frontswap: enable call to invalidate area on swapoff
On 10/09/2013 04:08 AM, Andrew Morton wrote: On Tue, 08 Oct 2013 10:13:20 +0200 Krzysztof Kozlowski k.kozlow...@samsung.com wrote: On pon, 2013-10-07 at 15:03 -0700, Andrew Morton wrote: On Mon, 07 Oct 2013 17:25:41 +0200 Krzysztof Kozlowski k.kozlow...@samsung.com wrote: During swapoff the frontswap_map was NULL-ified before calling frontswap_invalidate_area(). However the frontswap_invalidate_area() exits early if frontswap_map is NULL. Invalidate was never called during swapoff. This patch moves frontswap_map_set() in swapoff just after calling frontswap_invalidate_area() so outside of locks (swap_lock and swap_info_struct-lock). This shouldn't be a problem as during swapon the frontswap_map_set() is called also outside of any locks. Ahem. So there's a bunch of code in __frontswap_invalidate_area() which hasn't ever been executed and nobody noticed it. So perhaps that code isn't actually needed? More seriously, this patch looks like it enables code which hasn't been used or tested before. How well tested was this? Are there any runtime-visible effects from this change? I tested zswap on x86 and x86-64 and there was no difference. This is good as there shouldn't be visible anything because swapoff is unusing all pages anyway: try_to_unuse(type, false, 0); /* force all pages to be unused */ I haven't tested other frontswap users. So is that code in __frontswap_invalidate_area() unneeded? I don't think so, it's still needed otherwise there will be memory leak. I'm afraid nobody noticed the memory leak here before, this patch can fix it. Sorry for didn't pay enough attention but please keep __frontswap_invalidate_area(). -- Regards, -Bob -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] frontswap: enable call to invalidate area on swapoff
On 10/09/2013 10:40 PM, Seth Jennings wrote: On Tue, Oct 08, 2013 at 01:08:53PM -0700, Andrew Morton wrote: On Tue, 08 Oct 2013 10:13:20 +0200 Krzysztof Kozlowski k.kozlow...@samsung.com wrote: On pon, 2013-10-07 at 15:03 -0700, Andrew Morton wrote: On Mon, 07 Oct 2013 17:25:41 +0200 Krzysztof Kozlowski k.kozlow...@samsung.com wrote: During swapoff the frontswap_map was NULL-ified before calling frontswap_invalidate_area(). However the frontswap_invalidate_area() exits early if frontswap_map is NULL. Invalidate was never called during swapoff. This patch moves frontswap_map_set() in swapoff just after calling frontswap_invalidate_area() so outside of locks (swap_lock and swap_info_struct-lock). This shouldn't be a problem as during swapon the frontswap_map_set() is called also outside of any locks. Ahem. So there's a bunch of code in __frontswap_invalidate_area() which hasn't ever been executed and nobody noticed it. So perhaps that code isn't actually needed? More seriously, this patch looks like it enables code which hasn't been used or tested before. How well tested was this? Are there any runtime-visible effects from this change? I tested zswap on x86 and x86-64 and there was no difference. This is good as there shouldn't be visible anything because swapoff is unusing all pages anyway: try_to_unuse(type, false, 0); /* force all pages to be unused */ I haven't tested other frontswap users. So is that code in __frontswap_invalidate_area() unneeded? Yes, to expand on what Bob said, __frontswap_invalidate_area() is still needed to let any frontswap backend free per-swaptype resources. __frontswap_invalidate_area() is _not_ for freeing structures associated with individual swapped out pages since all of the pages should be brought back into memory by try_to_unuse() before __frontswap_invalidate_area() is called. The reason we never noticed this for zswap is that zswap has no dynamically allocated per-type resources. In the expected case, where all of the pages have been drained from zswap, zswap_frontswap_invalidate_area() is a no-op. Not exactly, see the bug fix mm/zswap: bugfix: memory leak when re-swapon from Weijie. Zswap needs invalidate_area() also. Thanks, -Bob -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH] mm: pagevec: cleanup: drop pvec->cold argument in all places
Nobody uses the pvec->cold argument of pagevec and it's also unreasonable for pages in pagevec released as cold page, so drop the cold argument from pagevec. Signed-off-by: Bob Liu --- fs/9p/cache.c |2 +- fs/afs/cache.c |2 +- fs/afs/write.c |4 ++-- fs/btrfs/extent_io.c|4 ++-- fs/cachefiles/rdwr.c|8 fs/ceph/addr.c |6 +++--- fs/ceph/cache.c |2 +- fs/cifs/cache.c |2 +- fs/ext4/file.c |2 +- fs/ext4/inode.c |6 +++--- fs/f2fs/checkpoint.c|2 +- fs/f2fs/node.c |2 +- fs/fscache/page.c |4 ++-- fs/gfs2/aops.c |2 +- fs/hugetlbfs/inode.c|4 ++-- fs/nfs/fscache-index.c |2 +- fs/nilfs2/btree.c |2 +- fs/nilfs2/page.c|8 fs/nilfs2/segment.c |4 ++-- fs/xfs/xfs_aops.c |2 +- fs/xfs/xfs_file.c |2 +- include/linux/pagevec.h |9 + mm/filemap.c|2 +- mm/mlock.c |4 ++-- mm/page-writeback.c |2 +- mm/shmem.c |6 +++--- mm/swap.c |8 mm/truncate.c |6 +++--- 28 files changed, 51 insertions(+), 58 deletions(-) diff --git a/fs/9p/cache.c b/fs/9p/cache.c index a9ea73d..c17798c 100644 --- a/fs/9p/cache.c +++ b/fs/9p/cache.c @@ -158,7 +158,7 @@ static void v9fs_cache_inode_now_uncached(void *cookie_netfs_data) pgoff_t first; int loop, nr_pages; - pagevec_init(, 0); + pagevec_init(); first = 0; for (;;) { diff --git a/fs/afs/cache.c b/fs/afs/cache.c index 577763c3..ae3fb3e 100644 --- a/fs/afs/cache.c +++ b/fs/afs/cache.c @@ -377,7 +377,7 @@ static void afs_vnode_cache_now_uncached(void *cookie_netfs_data) _enter("{%x,%x,%Lx}", vnode->fid.vnode, vnode->fid.unique, vnode->status.data_version); - pagevec_init(, 0); + pagevec_init(); first = 0; for (;;) { diff --git a/fs/afs/write.c b/fs/afs/write.c index a890db4..7ace426 100644 --- a/fs/afs/write.c +++ b/fs/afs/write.c @@ -284,7 +284,7 @@ static void afs_kill_pages(struct afs_vnode *vnode, bool error, _enter("{%x:%u},%lx-%lx", vnode->fid.vid, vnode->fid.vnode, first, last); - pagevec_init(, 0); + pagevec_init(); do { _debug("kill %lx-%lx", first, last); @@ -582,7 +582,7 @@ void afs_pages_written_back(struct afs_vnode *vnode, struct afs_call *call) ASSERT(wb != NULL); - pagevec_init(, 0); + pagevec_init(); do { _debug("done %lx-%lx", first, last); diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c index 09582b8..11deea7 100644 --- a/fs/btrfs/extent_io.c +++ b/fs/btrfs/extent_io.c @@ -3517,7 +3517,7 @@ int btree_write_cache_pages(struct address_space *mapping, int scanned = 0; int tag; - pagevec_init(, 0); + pagevec_init(); if (wbc->range_cyclic) { index = mapping->writeback_index; /* Start from prev offset */ end = -1; @@ -3661,7 +3661,7 @@ static int extent_write_cache_pages(struct extent_io_tree *tree, if (!igrab(inode)) return 0; - pagevec_init(, 0); + pagevec_init(); if (wbc->range_cyclic) { index = mapping->writeback_index; /* Start from prev offset */ end = -1; diff --git a/fs/cachefiles/rdwr.c b/fs/cachefiles/rdwr.c index ebaff36..aa82d15 100644 --- a/fs/cachefiles/rdwr.c +++ b/fs/cachefiles/rdwr.c @@ -160,7 +160,7 @@ static void cachefiles_read_copier(struct fscache_operation *_op) _enter("{ino=%lu}", object->backer->d_inode->i_ino); - pagevec_init(, 0); + pagevec_init(); max = 8; spin_lock_irq(>work_lock); @@ -429,7 +429,7 @@ int cachefiles_read_or_alloc_page(struct fscache_retrieval *op, op->op.flags |= FSCACHE_OP_ASYNC; op->op.processor = cachefiles_read_copier; - pagevec_init(, 0); + pagevec_init(); /* we assume the absence or presence of the first block is a good * enough indication for the page as a whole @@ -729,7 +729,7 @@ int cachefiles_read_or_alloc_pages(struct fscache_retrieval *op, shift = PAGE_SHIFT - inode->i_sb->s_blocksize_bits; - pagevec_init(, 0); + pagevec_init(); op->op.flags &= FSCACHE_OP_KEEP_FLAGS; op->op.flags |= FSCACHE_OP_ASYNC; @@ -863,7 +863,7 @@ int cachefiles_allocate_pages(struct fscache_retrieval *op, ret = cachefiles_has_space(cache, 0, *nr_pages); if (ret == 0) { - pagevec_init(, 0); + pagevec_init(); list_for_each_entry(page, pages, lru) { if (pagevec_add(, page) == 0) di
[PATCH] mm: pagevec: cleanup: drop pvec-cold argument in all places
Nobody uses the pvec-cold argument of pagevec and it's also unreasonable for pages in pagevec released as cold page, so drop the cold argument from pagevec. Signed-off-by: Bob Liu bob@oracle.com --- fs/9p/cache.c |2 +- fs/afs/cache.c |2 +- fs/afs/write.c |4 ++-- fs/btrfs/extent_io.c|4 ++-- fs/cachefiles/rdwr.c|8 fs/ceph/addr.c |6 +++--- fs/ceph/cache.c |2 +- fs/cifs/cache.c |2 +- fs/ext4/file.c |2 +- fs/ext4/inode.c |6 +++--- fs/f2fs/checkpoint.c|2 +- fs/f2fs/node.c |2 +- fs/fscache/page.c |4 ++-- fs/gfs2/aops.c |2 +- fs/hugetlbfs/inode.c|4 ++-- fs/nfs/fscache-index.c |2 +- fs/nilfs2/btree.c |2 +- fs/nilfs2/page.c|8 fs/nilfs2/segment.c |4 ++-- fs/xfs/xfs_aops.c |2 +- fs/xfs/xfs_file.c |2 +- include/linux/pagevec.h |9 + mm/filemap.c|2 +- mm/mlock.c |4 ++-- mm/page-writeback.c |2 +- mm/shmem.c |6 +++--- mm/swap.c |8 mm/truncate.c |6 +++--- 28 files changed, 51 insertions(+), 58 deletions(-) diff --git a/fs/9p/cache.c b/fs/9p/cache.c index a9ea73d..c17798c 100644 --- a/fs/9p/cache.c +++ b/fs/9p/cache.c @@ -158,7 +158,7 @@ static void v9fs_cache_inode_now_uncached(void *cookie_netfs_data) pgoff_t first; int loop, nr_pages; - pagevec_init(pvec, 0); + pagevec_init(pvec); first = 0; for (;;) { diff --git a/fs/afs/cache.c b/fs/afs/cache.c index 577763c3..ae3fb3e 100644 --- a/fs/afs/cache.c +++ b/fs/afs/cache.c @@ -377,7 +377,7 @@ static void afs_vnode_cache_now_uncached(void *cookie_netfs_data) _enter({%x,%x,%Lx}, vnode-fid.vnode, vnode-fid.unique, vnode-status.data_version); - pagevec_init(pvec, 0); + pagevec_init(pvec); first = 0; for (;;) { diff --git a/fs/afs/write.c b/fs/afs/write.c index a890db4..7ace426 100644 --- a/fs/afs/write.c +++ b/fs/afs/write.c @@ -284,7 +284,7 @@ static void afs_kill_pages(struct afs_vnode *vnode, bool error, _enter({%x:%u},%lx-%lx, vnode-fid.vid, vnode-fid.vnode, first, last); - pagevec_init(pv, 0); + pagevec_init(pv); do { _debug(kill %lx-%lx, first, last); @@ -582,7 +582,7 @@ void afs_pages_written_back(struct afs_vnode *vnode, struct afs_call *call) ASSERT(wb != NULL); - pagevec_init(pv, 0); + pagevec_init(pv); do { _debug(done %lx-%lx, first, last); diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c index 09582b8..11deea7 100644 --- a/fs/btrfs/extent_io.c +++ b/fs/btrfs/extent_io.c @@ -3517,7 +3517,7 @@ int btree_write_cache_pages(struct address_space *mapping, int scanned = 0; int tag; - pagevec_init(pvec, 0); + pagevec_init(pvec); if (wbc-range_cyclic) { index = mapping-writeback_index; /* Start from prev offset */ end = -1; @@ -3661,7 +3661,7 @@ static int extent_write_cache_pages(struct extent_io_tree *tree, if (!igrab(inode)) return 0; - pagevec_init(pvec, 0); + pagevec_init(pvec); if (wbc-range_cyclic) { index = mapping-writeback_index; /* Start from prev offset */ end = -1; diff --git a/fs/cachefiles/rdwr.c b/fs/cachefiles/rdwr.c index ebaff36..aa82d15 100644 --- a/fs/cachefiles/rdwr.c +++ b/fs/cachefiles/rdwr.c @@ -160,7 +160,7 @@ static void cachefiles_read_copier(struct fscache_operation *_op) _enter({ino=%lu}, object-backer-d_inode-i_ino); - pagevec_init(pagevec, 0); + pagevec_init(pagevec); max = 8; spin_lock_irq(object-work_lock); @@ -429,7 +429,7 @@ int cachefiles_read_or_alloc_page(struct fscache_retrieval *op, op-op.flags |= FSCACHE_OP_ASYNC; op-op.processor = cachefiles_read_copier; - pagevec_init(pagevec, 0); + pagevec_init(pagevec); /* we assume the absence or presence of the first block is a good * enough indication for the page as a whole @@ -729,7 +729,7 @@ int cachefiles_read_or_alloc_pages(struct fscache_retrieval *op, shift = PAGE_SHIFT - inode-i_sb-s_blocksize_bits; - pagevec_init(pagevec, 0); + pagevec_init(pagevec); op-op.flags = FSCACHE_OP_KEEP_FLAGS; op-op.flags |= FSCACHE_OP_ASYNC; @@ -863,7 +863,7 @@ int cachefiles_allocate_pages(struct fscache_retrieval *op, ret = cachefiles_has_space(cache, 0, *nr_pages); if (ret == 0) { - pagevec_init(pagevec, 0); + pagevec_init(pagevec); list_for_each_entry(page, pages, lru) { if (pagevec_add(pagevec, page) == 0) diff --git a/fs/ceph
Re: [PATCH v2 0/5] mm: migrate zbud pages
zswap layer and the radix slots would > hold zbud handles, not struct page pointers. > > However, as I have discovered today, this is problematic when it comes > to reclaim and migration and serializing access. > > I wanted to do as much as possible in the zswap layer since anything > done in the zbud layer would need to be duplicated in any other future > allocator that zswap wanted to support. > > Unfortunately, zbud abstracts away the struct page and that visibility > is needed to properly do what we are talking about. > > So maybe it is inevitable that this will need to be in the zbud code > with the radix tree slots pointing to struct pages after all. > But in this way, zswap_frontswap_load() can't find zswap_entry. We still need the rbtree in current zswap. > I like the idea of masking the bit into the struct page pointer to > indicate which buddy maps to the offset. > I have no idea why we need this. My idea is connect zbud page with a address space and add zbud page to LRU list only without any radix tree. zswap_entry can be still in rbtree or maybe changed to radix tree. There is a sample code in my previous email. -- Regards, -Bob -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH v2 0/5] mm: migrate zbud pages
On 09/28/2013 06:00 AM, Seth Jennings wrote: On Fri, Sep 27, 2013 at 12:16:37PM +0200, Tomasz Stanislawski wrote: On 09/25/2013 11:57 PM, Seth Jennings wrote: On Wed, Sep 25, 2013 at 07:09:50PM +0200, Tomasz Stanislawski wrote: I just had an idea this afternoon to potentially kill both these birds with one stone: Replace the rbtree in zswap with an address_space. Each swap type would have its own page_tree to organize the compressed objects by type and offset (radix tree is more suited for this anyway) and a_ops that could be called by shrink_page_list() (writepage) or the migration code (migratepage). Then zbud pages could be put on the normal LRU list, maybe at the beginning of the inactive LRU so they would live for another cycle through the list, then be reclaimed in the normal way with the mapping-a_ops-writepage() pointing to a zswap_writepage() function that would decompress the pages and call __swap_writepage() on them. This might actually do away with the explicit pool size too as the compressed pool pages wouldn't be outside the control of the MM anymore. I'm just starting to explore this but I think it has promise. Seth Hi Seth, There is a problem with the proposed idea. The radix tree used 'struct address_space' is a part of a bigger data structure. The radix tree is used to translate an offset to a page. That is ok for zswap. But struct page has a field named 'index'. The MM assumes that this index is an offset in radix tree where one can find the page. A lot is done by MM to sustain this consistency. Yes, this is how it is for page cache pages. However, the MM is able to work differently with anonymous pages. In the case of an anonymous page, the mapping field points to an anon_vma struct, or, if ksm in enabled and dedup'ing the page, a private ksm tracking structure. If the anonymous page is fully unmapped and resides only in the swap cache, the page mapping is NULL. So there is precedent for the fields to mean other things. Hi Seth, You are right that page-mapping is NULL for pages in swap_cache but page_mapping() is not NULL in such a case. The mapping is taken from struct address_space swapper_spaces[]. It is still an address space, and it should preserve constraints for struct address_space. The same happen for page-index and page_index(). The question is how to mark and identify zbud pages among the other page types that will be on the LRU. There are many ways. The question is what is the best and most acceptable way. If you consider hacking I have some idea how address_space could utilized for ZBUD. One solution whould be using tags in a radix tree. Every entry in a radix tree can have a few bits assigned to it. Currently 3 bits are supported: From include/linux/fs.h #define PAGECACHE_TAG_DIRTY 0 #define PAGECACHE_TAG_WRITEBACK 1 #define PAGECACHE_TAG_TOWRITE2 You could add a new bit or utilize one of existing ones. The other idea is use a trick from a RB trees and scatter-gather lists. I mean using the last bits of pointers to keep some metadata. Values of 'struct page *' variables are aligned to a pointer alignment which is 4 for 32-bit CPUs and 8 for 64-bit ones (not sure). This means that one could could use the last bit of page pointer in a radix tree to track if a swap entry refers to a lower or a higher part of a ZBUD page. I think it is a serious hacking/obfuscation but it may work with the minimal amount of changes to MM. Adding only (x~3) while extracting page pointer is probably enough. What do you think about this idea? I think it is a good one. I have to say that when I first came up with the idea, I was thinking the address space would be at the zswap layer and the radix slots would hold zbud handles, not struct page pointers. However, as I have discovered today, this is problematic when it comes to reclaim and migration and serializing access. I wanted to do as much as possible in the zswap layer since anything done in the zbud layer would need to be duplicated in any other future allocator that zswap wanted to support. Unfortunately, zbud abstracts away the struct page and that visibility is needed to properly do what we are talking about. So maybe it is inevitable that this will need to be in the zbud code with the radix tree slots pointing to struct pages after all. But in this way, zswap_frontswap_load() can't find zswap_entry. We still need the rbtree in current zswap. I like the idea of masking the bit into the struct page pointer to indicate which buddy maps to the offset. I have no idea why we need this. My idea is connect zbud page with a address space and add zbud page to LRU list only without any radix tree. zswap_entry can be still in rbtree or maybe changed to radix tree. There is a sample code in my previous email. -- Regards, -Bob -- To unsubscribe from this list: send the line unsubscribe linux
Re: [BUG REPORT] ZSWAP: theoretical race condition issues
On Thu, Sep 26, 2013 at 4:48 PM, Weijie Yang wrote: > On Thu, Sep 26, 2013 at 3:57 PM, Minchan Kim wrote: >> On Thu, Sep 26, 2013 at 03:26:33PM +0800, Weijie Yang wrote: >>> On Thu, Sep 26, 2013 at 1:58 PM, Minchan Kim wrote: >>> > Hello Weigie, >>> > >>> > On Wed, Sep 25, 2013 at 05:33:43PM +0800, Weijie Yang wrote: >>> >> On Wed, Sep 25, 2013 at 4:31 PM, Bob Liu wrote: >>> >> > On Wed, Sep 25, 2013 at 4:09 PM, Weijie Yang >>> >> > wrote: >>> >> >> I think I find a new issue, for integrity of this mail thread, I reply >>> >> >> to this mail. >>> >> >> >>> >> >> It is a concurrence issue either, when duplicate store and reclaim >>> >> >> concurrentlly. >>> >> >> >>> >> >> zswap entry x with offset A is already stored in zswap backend. >>> >> >> Consider the following scenario: >>> >> >> >>> >> >> thread 0: reclaim entry x (get refcount, but not call >>> >> >> zswap_get_swap_cache_page) >>> >> >> >>> >> >> thread 1: store new page with the same offset A, alloc a new zswap >>> >> >> entry y. >>> >> >> store finished. shrink_page_list() call __remove_mapping(), and now >>> >> >> it is not in swap_cache >>> >> >> >>> >> > >>> >> > But I don't think swap layer will call zswap with the same offset A. >>> >> >>> >> 1. store page of offset A in zswap >>> >> 2. some time later, pagefault occur, load page data from zswap. >>> >> But notice that zswap entry x is still in zswap because it is not >>> >> frontswap_tmem_exclusive_gets_enabled. >>> > >>> > frontswap_tmem_exclusive_gets_enabled is just option to see tradeoff >>> > between CPU burining by frequent swapout and memory footprint by duplicate >>> > copy in swap cache and frontswap backend so it shouldn't affect the >>> > stability. >>> >>> Thanks for explain this. >>> I don't mean to say this option affects the stability, but that zswap >>> only realize >>> one option. Maybe it's better to realize both options for different >>> workloads. >> >> "zswap only relize one option" >> What does it mena? Sorry. I couldn't parse your intention. :) >> You mean zswap should do something special to support >> frontswap_tmem_exclusive_gets? > > Yes. But I am not sure whether it is worth. > >>> >>> >> this page is with PageSwapCache(page) and page_private(page) = entry.val >>> >> 3. change this page data, and it become dirty >>> > >>> > If non-shared swapin page become redirty, it should remove the page from >>> > swapcache. If shared swapin page become redirty, it should do CoW so it's >>> > a >>> > new page so that it doesn't live in swap cache. It means it should have >>> > new >>> > offset which is different with old's one for swap out. >>> > >>> > What's wrong with that? >>> >>> It is really not a right scene for duplicate store. And I can not think out >>> one. >>> If duplicate store is impossible, How about delete the handle code in zswap? >>> If it does exist, I think there is a potential issue as I described. >> >> You mean "zswap_duplicate_entry"? >> AFAIR, I already had a question to Seth when zswap was born but AFAIRC, >> he said that he didn't know exact reason but he saw that case during >> experiement so copy the code peice from zcache. >> >> Do you see the case, too? > > Yes, I mean duplicate store. > I check the /Documentation/vm/frontswap.txt, it mentions "duplicate stores", > but I am still confused. > > I wrote a zcache varietas which swap out compressed page to swapfile. > I did see that case when I test it on andorid smartphone(arm v7), Why not test it based on zswap directly? My suggestion is do more and more stress testing against current zswap, if there is really any bug triggered or unusual performance issue then we dig it out and fix it. -- Regards, --Bob -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [munlock] BUG: Bad page map in process killall5 pte:cf17e720 pmd:05a22067
On Thu, Sep 26, 2013 at 2:49 PM, Vlastimil Babka wrote: > On 09/26/2013 02:40 AM, Fengguang Wu wrote: >> Hi Vlastimil, >> >> FYI, this bug seems still not fixed in linux-next 20130925. > > Hi, > > I sent (including you) a RFC patch and later reviewed patch about week > ago. I assumed you would test it, but I probably should make that > request explicit, sorry. Anyway it was added to -mm an hour before your > mail. > Great! And please ignore my noise in this thread. -- Regards, --Bob -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [munlock] BUG: Bad page map in process killall5 pte:cf17e720 pmd:05a22067
On Thu, Sep 26, 2013 at 2:49 PM, Vlastimil Babka vba...@suse.cz wrote: On 09/26/2013 02:40 AM, Fengguang Wu wrote: Hi Vlastimil, FYI, this bug seems still not fixed in linux-next 20130925. Hi, I sent (including you) a RFC patch and later reviewed patch about week ago. I assumed you would test it, but I probably should make that request explicit, sorry. Anyway it was added to -mm an hour before your mail. Great! And please ignore my noise in this thread. -- Regards, --Bob -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [BUG REPORT] ZSWAP: theoretical race condition issues
On Thu, Sep 26, 2013 at 4:48 PM, Weijie Yang weijie.yang...@gmail.com wrote: On Thu, Sep 26, 2013 at 3:57 PM, Minchan Kim minc...@kernel.org wrote: On Thu, Sep 26, 2013 at 03:26:33PM +0800, Weijie Yang wrote: On Thu, Sep 26, 2013 at 1:58 PM, Minchan Kim minc...@kernel.org wrote: Hello Weigie, On Wed, Sep 25, 2013 at 05:33:43PM +0800, Weijie Yang wrote: On Wed, Sep 25, 2013 at 4:31 PM, Bob Liu lliu...@gmail.com wrote: On Wed, Sep 25, 2013 at 4:09 PM, Weijie Yang weijie.yang...@gmail.com wrote: I think I find a new issue, for integrity of this mail thread, I reply to this mail. It is a concurrence issue either, when duplicate store and reclaim concurrentlly. zswap entry x with offset A is already stored in zswap backend. Consider the following scenario: thread 0: reclaim entry x (get refcount, but not call zswap_get_swap_cache_page) thread 1: store new page with the same offset A, alloc a new zswap entry y. store finished. shrink_page_list() call __remove_mapping(), and now it is not in swap_cache But I don't think swap layer will call zswap with the same offset A. 1. store page of offset A in zswap 2. some time later, pagefault occur, load page data from zswap. But notice that zswap entry x is still in zswap because it is not frontswap_tmem_exclusive_gets_enabled. frontswap_tmem_exclusive_gets_enabled is just option to see tradeoff between CPU burining by frequent swapout and memory footprint by duplicate copy in swap cache and frontswap backend so it shouldn't affect the stability. Thanks for explain this. I don't mean to say this option affects the stability, but that zswap only realize one option. Maybe it's better to realize both options for different workloads. zswap only relize one option What does it mena? Sorry. I couldn't parse your intention. :) You mean zswap should do something special to support frontswap_tmem_exclusive_gets? Yes. But I am not sure whether it is worth. this page is with PageSwapCache(page) and page_private(page) = entry.val 3. change this page data, and it become dirty If non-shared swapin page become redirty, it should remove the page from swapcache. If shared swapin page become redirty, it should do CoW so it's a new page so that it doesn't live in swap cache. It means it should have new offset which is different with old's one for swap out. What's wrong with that? It is really not a right scene for duplicate store. And I can not think out one. If duplicate store is impossible, How about delete the handle code in zswap? If it does exist, I think there is a potential issue as I described. You mean zswap_duplicate_entry? AFAIR, I already had a question to Seth when zswap was born but AFAIRC, he said that he didn't know exact reason but he saw that case during experiement so copy the code peice from zcache. Do you see the case, too? Yes, I mean duplicate store. I check the /Documentation/vm/frontswap.txt, it mentions duplicate stores, but I am still confused. I wrote a zcache varietas which swap out compressed page to swapfile. I did see that case when I test it on andorid smartphone(arm v7), Why not test it based on zswap directly? My suggestion is do more and more stress testing against current zswap, if there is really any bug triggered or unusual performance issue then we dig it out and fix it. -- Regards, --Bob -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [munlock] BUG: Bad page map in process killall5 pte:cf17e720 pmd:05a22067
On Thu, Sep 26, 2013 at 9:59 AM, Fengguang Wu wrote: > Hi Bob, > > On Thu, Sep 26, 2013 at 09:48:08AM +0800, Bob Liu wrote: >> Hi Fengguang, >> >> Would you please have a try with the attached patch? >> It added a small fix based on Vlastimil's patch. > > Thanks for the quick response! I just noticed Andrew added this patch > to -mm tree: > > -- > From: Vlastimil Babka > Subject: mm/mlock.c: prevent walking off the end of a pagetable in no-pmd > configuration > > What's the git tree your v2 patch based on? If you already had a git It's based on v3.12-rc1. Should I send you a new one based on latest mmotm? Thanks, -Bob -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [munlock] BUG: Bad page map in process killall5 pte:cf17e720 pmd:05a22067
Hi Fengguang, Would you please have a try with the attached patch? It added a small fix based on Vlastimil's patch. Thanks, -Bob On 09/26/2013 08:40 AM, Fengguang Wu wrote: > Hi Vlastimil, > > FYI, this bug seems still not fixed in linux-next 20130925. > > commit 7a8010cd36273ff5f6fea5201ef9232f30cebbd9 > Author: Vlastimil Babka > Date: Wed Sep 11 14:22:35 2013 -0700 > > mm: munlock: manual pte walk in fast path instead of follow_page_mask() > > Currently munlock_vma_pages_range() calls follow_page_mask() to obtain > each individual struct page. This entails repeated full page table > translations and page table lock taken for each page separately. > > This patch avoids the costly follow_page_mask() where possible, by > iterating over ptes within single pmd under single page table lock. The > first pte is obtained by get_locked_pte() for non-THP page acquired by the > initial follow_page_mask(). The rest of the on-stack pagevec for munlock > is filled up using pte_walk as long as pte_present() and vm_normal_page() > are sufficient to obtain the struct page. > > After this patch, a 14% speedup was measured for munlocking a 56GB large > memory area with THP disabled. > > Signed-off-by: Vlastimil Babka > Cc: Jörn Engel > Cc: Mel Gorman > Cc: Michel Lespinasse > Cc: Hugh Dickins > Cc: Rik van Riel > Cc: Johannes Weiner > Cc: Michal Hocko > Cc: Vlastimil Babka > Signed-off-by: Andrew Morton > Signed-off-by: Linus Torvalds > > > [ 89.835504] init: plymouth-upstart-bridge main process (3556) terminated > with status 1 > [ 89.986606] init: tty6 main process (3529) killed by TERM signal > [ 91.414086] BUG: Bad page map in process killall5 pte:cf17e720 > pmd:05a22067 > [ 91.416626] addr:bfc0 vm_flags:00100173 anon_vma:cf128c80 mapping: > (null) index:bfff0 > [ 91.419402] CPU: 0 PID: 3574 Comm: killall5 Not tainted > 3.12.0-rc1-00010-g5fbc0a6 #24 > [ 91.422171] Hardware name: Bochs Bochs, BIOS Bochs 01/01/2011 > [ 91.423998] c0199e34 c1db5db4 c0199e54 > c10e72d4 000bfff0 > [ 91.427933] bfc0 000cf17e cf17e720 c0199e74 > c10e7995 > [ 91.431940] bfc0 cf1ca190 bfc0 cf18 cf1ca190 c0199ee0 > c10eb8cf ce6d1900 > [ 91.435894] Call Trace: > [ 91.436969] [] dump_stack+0x4b/0x66 > [ 91.438503] [] print_bad_pte+0x14b/0x162 > [ 91.440204] [] vm_normal_page+0x67/0x9b > [ 91.441811] [] munlock_vma_pages_range+0xf9/0x176 > [ 91.443633] [] exit_mmap+0x86/0xf7 > [ 91.445156] [] ? lock_release+0x169/0x1ef > [ 91.446795] [] ? rcu_read_unlock+0x17/0x23 > [ 91.448465] [] ? exit_aio+0x2b/0x6c > [ 91.449990] [] mmput+0x6a/0xcb > [ 91.451508] [] do_exit+0x362/0x8be > [ 91.453013] [] ? hrtimer_debug_hint+0xd/0xd > [ 91.454700] [] do_group_exit+0x51/0x9e > [ 91.456296] [] SyS_exit_group+0x16/0x16 > [ 91.457901] [] sysenter_do_call+0x12/0x33 > [ 91.459553] Disabling lock debugging due to kernel taint > > git bisect start 272b98c6455f00884f0350f775c5342358ebb73f v3.11 -- > git bisect good 57d730924d5cc2c3e280af16a9306587c3a511db # 02:21495+ > Merge branch 'timers-urgent-for-linus' of > git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip > git bisect good 3bb22ec53e2bd12a241ed84359bffd591a40ab87 # 12:03495+ > staging/lustre/ptlrpc: convert to new shrinker API > git bisect bad a5b7c87f92076352dbff2fe0423ec255e1c9a71b # 12:18 31- > vmscan, memcg: do softlimit reclaim also for targeted reclaim > git bisect good 3d94ea51c1d8db6f41268a9d2aea5f5771e9a8d3 # 15:40495+ > ocfs2: clean up dead code in ocfs2_acl_from_xattr() > git bisect bad d62a201f24cba74e2fbf9f6f7af86ff5f5e276fc # 16:46 79- > checkpatch: enforce sane perl version > git bisect good 83467efbdb7948146581a56cbd683a22a0684bbb # 01:29585+ > mm: migrate: check movability of hugepage in unmap_and_move_huge_page() > git bisect bad 2bff24a3707093c435ab3241c47dcdb5f16e432b # 02:07148- > memcg: fix multiple large threshold notifications > git bisect bad 1ecfd533f4c528b0b4cc5bc115c4c47f0b5e4828 # 02:34 64- > mm/mremap.c: call pud_free() after fail calling pmd_alloc() > git bisect good 0ec3b74c7f5599c8a4d2b33d430a5470af26ebf6 # 13:10 1170+ > mm: putback_lru_page: remove unnecessary call to page_lru_base_type() > git bisect good 5b40998ae35cf64561868370e6c9f3d3e94b6bf7 # 16:52 1170+ > mm: munlock: remove redundant get_page/put_page pair on the fast path > git bisect bad 187320932dcece9c4b93f38f56d1f888bd5c325f # 17:11 0- > mm/sparse: introduce alloc_usemap_and_me
Re: [BUG REPORT] ZSWAP: theoretical race condition issues
On Wed, Sep 25, 2013 at 5:33 PM, Weijie Yang wrote: > On Wed, Sep 25, 2013 at 4:31 PM, Bob Liu wrote: >> On Wed, Sep 25, 2013 at 4:09 PM, Weijie Yang >> wrote: >>> I think I find a new issue, for integrity of this mail thread, I reply >>> to this mail. >>> >>> It is a concurrence issue either, when duplicate store and reclaim >>> concurrentlly. >>> >>> zswap entry x with offset A is already stored in zswap backend. >>> Consider the following scenario: >>> >>> thread 0: reclaim entry x (get refcount, but not call >>> zswap_get_swap_cache_page) >>> >>> thread 1: store new page with the same offset A, alloc a new zswap entry y. >>> store finished. shrink_page_list() call __remove_mapping(), and now >>> it is not in swap_cache >>> >> >> But I don't think swap layer will call zswap with the same offset A. > > 1. store page of offset A in zswap > 2. some time later, pagefault occur, load page data from zswap. > But notice that zswap entry x is still in zswap because it is not Sorry I didn't notice that zswap_frontswap_load() doesn't call rb_erase(). > frontswap_tmem_exclusive_gets_enabled. > this page is with PageSwapCache(page) and page_private(page) = entry.val > 3. change this page data, and it become dirty > 4. some time later again, swap this page on the same offset A. > > so, a duplicate store happens. > Then I think we should erase the entry from rbtree in zswap_frontswap_load(). After the page is decompressed and loaded from zswap, still storing the compressed data in zswap is meanless. -- Regards, --Bob -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [BUG REPORT] ZSWAP: theoretical race condition issues
On Wed, Sep 25, 2013 at 4:09 PM, Weijie Yang wrote: > I think I find a new issue, for integrity of this mail thread, I reply > to this mail. > > It is a concurrence issue either, when duplicate store and reclaim > concurrentlly. > > zswap entry x with offset A is already stored in zswap backend. > Consider the following scenario: > > thread 0: reclaim entry x (get refcount, but not call > zswap_get_swap_cache_page) > > thread 1: store new page with the same offset A, alloc a new zswap entry y. > store finished. shrink_page_list() call __remove_mapping(), and now > it is not in swap_cache > But I don't think swap layer will call zswap with the same offset A. > thread 0: zswap_get_swap_cache_page called. old page data is added to > swap_cache > > Now, swap cache has old data rather than new data for offset A. > error will happen If do_swap_page() get page from swap_cache. > -- Regards, --Bob -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [BUG REPORT] ZSWAP: theoretical race condition issues
On Wed, Sep 25, 2013 at 4:09 PM, Weijie Yang weijie.yang...@gmail.com wrote: I think I find a new issue, for integrity of this mail thread, I reply to this mail. It is a concurrence issue either, when duplicate store and reclaim concurrentlly. zswap entry x with offset A is already stored in zswap backend. Consider the following scenario: thread 0: reclaim entry x (get refcount, but not call zswap_get_swap_cache_page) thread 1: store new page with the same offset A, alloc a new zswap entry y. store finished. shrink_page_list() call __remove_mapping(), and now it is not in swap_cache But I don't think swap layer will call zswap with the same offset A. thread 0: zswap_get_swap_cache_page called. old page data is added to swap_cache Now, swap cache has old data rather than new data for offset A. error will happen If do_swap_page() get page from swap_cache. -- Regards, --Bob -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [BUG REPORT] ZSWAP: theoretical race condition issues
On Wed, Sep 25, 2013 at 5:33 PM, Weijie Yang weijie.yang...@gmail.com wrote: On Wed, Sep 25, 2013 at 4:31 PM, Bob Liu lliu...@gmail.com wrote: On Wed, Sep 25, 2013 at 4:09 PM, Weijie Yang weijie.yang...@gmail.com wrote: I think I find a new issue, for integrity of this mail thread, I reply to this mail. It is a concurrence issue either, when duplicate store and reclaim concurrentlly. zswap entry x with offset A is already stored in zswap backend. Consider the following scenario: thread 0: reclaim entry x (get refcount, but not call zswap_get_swap_cache_page) thread 1: store new page with the same offset A, alloc a new zswap entry y. store finished. shrink_page_list() call __remove_mapping(), and now it is not in swap_cache But I don't think swap layer will call zswap with the same offset A. 1. store page of offset A in zswap 2. some time later, pagefault occur, load page data from zswap. But notice that zswap entry x is still in zswap because it is not Sorry I didn't notice that zswap_frontswap_load() doesn't call rb_erase(). frontswap_tmem_exclusive_gets_enabled. this page is with PageSwapCache(page) and page_private(page) = entry.val 3. change this page data, and it become dirty 4. some time later again, swap this page on the same offset A. so, a duplicate store happens. Then I think we should erase the entry from rbtree in zswap_frontswap_load(). After the page is decompressed and loaded from zswap, still storing the compressed data in zswap is meanless. -- Regards, --Bob -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [munlock] BUG: Bad page map in process killall5 pte:cf17e720 pmd:05a22067
Hi Fengguang, Would you please have a try with the attached patch? It added a small fix based on Vlastimil's patch. Thanks, -Bob On 09/26/2013 08:40 AM, Fengguang Wu wrote: Hi Vlastimil, FYI, this bug seems still not fixed in linux-next 20130925. commit 7a8010cd36273ff5f6fea5201ef9232f30cebbd9 Author: Vlastimil Babka vba...@suse.cz Date: Wed Sep 11 14:22:35 2013 -0700 mm: munlock: manual pte walk in fast path instead of follow_page_mask() Currently munlock_vma_pages_range() calls follow_page_mask() to obtain each individual struct page. This entails repeated full page table translations and page table lock taken for each page separately. This patch avoids the costly follow_page_mask() where possible, by iterating over ptes within single pmd under single page table lock. The first pte is obtained by get_locked_pte() for non-THP page acquired by the initial follow_page_mask(). The rest of the on-stack pagevec for munlock is filled up using pte_walk as long as pte_present() and vm_normal_page() are sufficient to obtain the struct page. After this patch, a 14% speedup was measured for munlocking a 56GB large memory area with THP disabled. Signed-off-by: Vlastimil Babka vba...@suse.cz Cc: Jörn Engel jo...@logfs.org Cc: Mel Gorman mgor...@suse.de Cc: Michel Lespinasse wal...@google.com Cc: Hugh Dickins hu...@google.com Cc: Rik van Riel r...@redhat.com Cc: Johannes Weiner han...@cmpxchg.org Cc: Michal Hocko mho...@suse.cz Cc: Vlastimil Babka vba...@suse.cz Signed-off-by: Andrew Morton a...@linux-foundation.org Signed-off-by: Linus Torvalds torva...@linux-foundation.org [ 89.835504] init: plymouth-upstart-bridge main process (3556) terminated with status 1 [ 89.986606] init: tty6 main process (3529) killed by TERM signal [ 91.414086] BUG: Bad page map in process killall5 pte:cf17e720 pmd:05a22067 [ 91.416626] addr:bfc0 vm_flags:00100173 anon_vma:cf128c80 mapping: (null) index:bfff0 [ 91.419402] CPU: 0 PID: 3574 Comm: killall5 Not tainted 3.12.0-rc1-00010-g5fbc0a6 #24 [ 91.422171] Hardware name: Bochs Bochs, BIOS Bochs 01/01/2011 [ 91.423998] c0199e34 c1db5db4 c0199e54 c10e72d4 000bfff0 [ 91.427933] bfc0 000cf17e cf17e720 c0199e74 c10e7995 [ 91.431940] bfc0 cf1ca190 bfc0 cf18 cf1ca190 c0199ee0 c10eb8cf ce6d1900 [ 91.435894] Call Trace: [ 91.436969] [c1db5db4] dump_stack+0x4b/0x66 [ 91.438503] [c10e72d4] print_bad_pte+0x14b/0x162 [ 91.440204] [c10e7995] vm_normal_page+0x67/0x9b [ 91.441811] [c10eb8cf] munlock_vma_pages_range+0xf9/0x176 [ 91.443633] [c10ede09] exit_mmap+0x86/0xf7 [ 91.445156] [c10885b8] ? lock_release+0x169/0x1ef [ 91.446795] [c113e5b6] ? rcu_read_unlock+0x17/0x23 [ 91.448465] [c113effe] ? exit_aio+0x2b/0x6c [ 91.449990] [c103d4b0] mmput+0x6a/0xcb [ 91.451508] [c104141a] do_exit+0x362/0x8be [ 91.453013] [c105d280] ? hrtimer_debug_hint+0xd/0xd [ 91.454700] [c10419f8] do_group_exit+0x51/0x9e [ 91.456296] [c1041a5b] SyS_exit_group+0x16/0x16 [ 91.457901] [c1dc6719] sysenter_do_call+0x12/0x33 [ 91.459553] Disabling lock debugging due to kernel taint git bisect start 272b98c6455f00884f0350f775c5342358ebb73f v3.11 -- git bisect good 57d730924d5cc2c3e280af16a9306587c3a511db # 02:21495+ Merge branch 'timers-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip git bisect good 3bb22ec53e2bd12a241ed84359bffd591a40ab87 # 12:03495+ staging/lustre/ptlrpc: convert to new shrinker API git bisect bad a5b7c87f92076352dbff2fe0423ec255e1c9a71b # 12:18 31- vmscan, memcg: do softlimit reclaim also for targeted reclaim git bisect good 3d94ea51c1d8db6f41268a9d2aea5f5771e9a8d3 # 15:40495+ ocfs2: clean up dead code in ocfs2_acl_from_xattr() git bisect bad d62a201f24cba74e2fbf9f6f7af86ff5f5e276fc # 16:46 79- checkpatch: enforce sane perl version git bisect good 83467efbdb7948146581a56cbd683a22a0684bbb # 01:29585+ mm: migrate: check movability of hugepage in unmap_and_move_huge_page() git bisect bad 2bff24a3707093c435ab3241c47dcdb5f16e432b # 02:07148- memcg: fix multiple large threshold notifications git bisect bad 1ecfd533f4c528b0b4cc5bc115c4c47f0b5e4828 # 02:34 64- mm/mremap.c: call pud_free() after fail calling pmd_alloc() git bisect good 0ec3b74c7f5599c8a4d2b33d430a5470af26ebf6 # 13:10 1170+ mm: putback_lru_page: remove unnecessary call to page_lru_base_type() git bisect good 5b40998ae35cf64561868370e6c9f3d3e94b6bf7 # 16:52 1170+ mm: munlock: remove redundant get_page/put_page pair on the fast path git bisect bad 187320932dcece9c4b93f38f56d1f888bd5c325f # 17:11 0- mm/sparse: introduce alloc_usemap_and_memmap git bisect bad
Re: [munlock] BUG: Bad page map in process killall5 pte:cf17e720 pmd:05a22067
On Thu, Sep 26, 2013 at 9:59 AM, Fengguang Wu fengguang...@intel.com wrote: Hi Bob, On Thu, Sep 26, 2013 at 09:48:08AM +0800, Bob Liu wrote: Hi Fengguang, Would you please have a try with the attached patch? It added a small fix based on Vlastimil's patch. Thanks for the quick response! I just noticed Andrew added this patch to -mm tree: -- From: Vlastimil Babka vba...@suse.cz Subject: mm/mlock.c: prevent walking off the end of a pagetable in no-pmd configuration What's the git tree your v2 patch based on? If you already had a git It's based on v3.12-rc1. Should I send you a new one based on latest mmotm? Thanks, -Bob -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH v2 0/5] mm: migrate zbud pages
On Tue, Sep 24, 2013 at 5:20 PM, Krzysztof Kozlowski wrote: > Hi, > > On pon, 2013-09-23 at 17:07 -0500, Seth Jennings wrote: >> On Tue, Sep 17, 2013 at 02:59:24PM +0800, Bob Liu wrote: >> > Mel mentioned several problems about zswap/zbud in thread "[PATCH v6 >> > 0/5] zram/zsmalloc promotion". >> > >> > Like "it's clunky as hell and the layering between zswap and zbud is >> > twisty" and "I think I brought up its stalling behaviour during review >> > when it was being merged. It would have been preferable if writeback >> > could be initiated in batches and then waited on at the very least.. >> > It's worse that it uses _swap_writepage directly instead of going >> > through a writepage ops. It would have been better if zbud pages >> > existed on the LRU and written back with an address space ops and >> > properly handled asynchonous writeback." >> > >> > So I think it would be better if we can address those issues at first >> > and it would be easier to address these issues before adding more new >> > features. Welcome any ideas. >> >> I just had an idea this afternoon to potentially kill both these birds with >> one >> stone: Replace the rbtree in zswap with an address_space. >> >> Each swap type would have its own page_tree to organize the compressed >> objects >> by type and offset (radix tree is more suited for this anyway) and a_ops that >> could be called by shrink_page_list() (writepage) or the migration code >> (migratepage). >> >> Then zbud pages could be put on the normal LRU list, maybe at the beginning >> of >> the inactive LRU so they would live for another cycle through the list, then >> be >> reclaimed in the normal way with the mapping->a_ops->writepage() pointing to >> a >> zswap_writepage() function that would decompress the pages and call >> __swap_writepage() on them. > > How exactly the address space can be used here? Do you want to point to > zbud pages in address_space.page_tree? If yes then which index should be > used? > I didn't get the point neither. I think introduce address_space is enough. 1. zbud.c: static struct address_space_operations zbud_aops = { .writepage= zswap_write_page, }; struct address_space zbud_space = { .a_ops = _aops, }; zbud_alloc() { zbud_page = alloc_page(); zbud_page->mapping = (struct address_space *)_space; set_page_private(page, (unsigned long)pool); lru_add_anon(zbud_page); } 2. zswap.c static int zswap_writepage(struct page *page, struct writeback_control *wbc) { handle = encode_handle(page_address(page), FIRST)); zswap_writeback_entry(pool, handle); handle = encode_handle(page_address(page), LAST)); zswap_writeback_entry(pool, handle); } Of course it may need lots of work for core MM subsystem can maintain zbud pages. But in this way, we can get rid of the clunky reclaiming layer and integrate zswap closely with core MM subsystem which knows better how many zbud pages can be used and when should trigger the zbud pages reclaim. -- Regards, --Bob -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH v2 0/5] mm: migrate zbud pages
On Tue, Sep 24, 2013 at 5:20 PM, Krzysztof Kozlowski k.kozlow...@samsung.com wrote: Hi, On pon, 2013-09-23 at 17:07 -0500, Seth Jennings wrote: On Tue, Sep 17, 2013 at 02:59:24PM +0800, Bob Liu wrote: Mel mentioned several problems about zswap/zbud in thread [PATCH v6 0/5] zram/zsmalloc promotion. Like it's clunky as hell and the layering between zswap and zbud is twisty and I think I brought up its stalling behaviour during review when it was being merged. It would have been preferable if writeback could be initiated in batches and then waited on at the very least.. It's worse that it uses _swap_writepage directly instead of going through a writepage ops. It would have been better if zbud pages existed on the LRU and written back with an address space ops and properly handled asynchonous writeback. So I think it would be better if we can address those issues at first and it would be easier to address these issues before adding more new features. Welcome any ideas. I just had an idea this afternoon to potentially kill both these birds with one stone: Replace the rbtree in zswap with an address_space. Each swap type would have its own page_tree to organize the compressed objects by type and offset (radix tree is more suited for this anyway) and a_ops that could be called by shrink_page_list() (writepage) or the migration code (migratepage). Then zbud pages could be put on the normal LRU list, maybe at the beginning of the inactive LRU so they would live for another cycle through the list, then be reclaimed in the normal way with the mapping-a_ops-writepage() pointing to a zswap_writepage() function that would decompress the pages and call __swap_writepage() on them. How exactly the address space can be used here? Do you want to point to zbud pages in address_space.page_tree? If yes then which index should be used? I didn't get the point neither. I think introduce address_space is enough. 1. zbud.c: static struct address_space_operations zbud_aops = { .writepage= zswap_write_page, }; struct address_space zbud_space = { .a_ops = zbud_aops, }; zbud_alloc() { zbud_page = alloc_page(); zbud_page-mapping = (struct address_space *)zbud_space; set_page_private(page, (unsigned long)pool); lru_add_anon(zbud_page); } 2. zswap.c static int zswap_writepage(struct page *page, struct writeback_control *wbc) { handle = encode_handle(page_address(page), FIRST)); zswap_writeback_entry(pool, handle); handle = encode_handle(page_address(page), LAST)); zswap_writeback_entry(pool, handle); } Of course it may need lots of work for core MM subsystem can maintain zbud pages. But in this way, we can get rid of the clunky reclaiming layer and integrate zswap closely with core MM subsystem which knows better how many zbud pages can be used and when should trigger the zbud pages reclaim. -- Regards, --Bob -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH v2 3/4] mm/zswap: avoid unnecessary page scanning
On 09/10/2013 12:29 AM, Seth Jennings wrote: > On Fri, Sep 06, 2013 at 01:16:45PM +0800, Weijie Yang wrote: >> add SetPageReclaim before __swap_writepage so that page can be moved to the >> tail of the inactive list, which can avoid unnecessary page scanning as this >> page was reclaimed by swap subsystem before. >> >> Signed-off-by: Weijie Yang > > Acked-by: Seth Jennings > Below is a reply from Mel in original thread "[PATCHv11 3/4] zswap: add to mm/" -- > + /* start writeback */ > + SetPageReclaim(page); > + __swap_writepage(page, , end_swap_bio_write); > + page_cache_release(page); > + zswap_written_back_pages++; > + SetPageReclaim? Why?. If the page is under writeback then why do you not mark it as that? Do not free pages that are currently under writeback obviously. It's likely that it was PageWriteback you wanted in zbud.c too. -- Regards, -Bob -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH v2 3/4] mm/zswap: avoid unnecessary page scanning
On 09/10/2013 12:29 AM, Seth Jennings wrote: On Fri, Sep 06, 2013 at 01:16:45PM +0800, Weijie Yang wrote: add SetPageReclaim before __swap_writepage so that page can be moved to the tail of the inactive list, which can avoid unnecessary page scanning as this page was reclaimed by swap subsystem before. Signed-off-by: Weijie Yang weijie.y...@samsung.com Acked-by: Seth Jennings sjenn...@linux.vnet.ibm.com Below is a reply from Mel in original thread [PATCHv11 3/4] zswap: add to mm/ -- + /* start writeback */ + SetPageReclaim(page); + __swap_writepage(page, wbc, end_swap_bio_write); + page_cache_release(page); + zswap_written_back_pages++; + SetPageReclaim? Why?. If the page is under writeback then why do you not mark it as that? Do not free pages that are currently under writeback obviously. It's likely that it was PageWriteback you wanted in zbud.c too. -- Regards, -Bob -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC PATCH RESEND] mm: munlock: Prevent walking off the end of a pagetable in no-pmd configuration
On 09/17/2013 10:22 PM, Vlastimil Babka wrote: > The function __munlock_pagevec_fill() introduced in commit 7a8010cd3 > ("mm: munlock: manual pte walk in fast path instead of follow_page_mask()") > uses pmd_addr_end() for restricting its operation within current page table. > This is insufficient on architectures/configurations where pmd is folded > and pmd_addr_end() just returns the end of the full range to be walked. In > this case, it allows pte++ to walk off the end of a page table resulting in > unpredictable behaviour. > > This patch fixes the function by using pgd_addr_end() and pud_addr_end() > before pmd_addr_end(), which will yield correct page table boundary on all > configurations. This is similar to what existing page walkers do when walking > each level of the page table. > > Additionaly, the patch clarifies a comment for get_locked_pte() call in the > function. > > Reported-by: Fengguang Wu > Cc: Jörn Engel > Cc: Mel Gorman > Cc: Michel Lespinasse > Cc: Hugh Dickins > Cc: Rik van Riel > Cc: Johannes Weiner > Cc: Michal Hocko > Cc: Vlastimil Babka > Signed-off-by: Vlastimil Babka > --- > mm/mlock.c | 8 ++-- > 1 file changed, 6 insertions(+), 2 deletions(-) > > diff --git a/mm/mlock.c b/mm/mlock.c > index d638026..758c0fc 100644 > --- a/mm/mlock.c > +++ b/mm/mlock.c > @@ -379,10 +379,14 @@ static unsigned long __munlock_pagevec_fill(struct > pagevec *pvec, > > /* >* Initialize pte walk starting at the already pinned page where we > - * are sure that there is a pte. > + * are sure that there is a pte, as it was pinned under the same > + * mmap_sem write op. >*/ > pte = get_locked_pte(vma->vm_mm, start, ); > - end = min(end, pmd_addr_end(start, end)); > + /* Make sure we do not cross the page table boundary */ > + end = pgd_addr_end(start, end); > + end = pud_addr_end(start, end); > + end = pmd_addr_end(start, end); > Nitpick, how about unfolding pmd_addr_end(start, end) directly? Like: --- a/mm/mlock.c +++ b/mm/mlock.c @@ -376,13 +376,14 @@ static unsigned long __munlock_pagevec_fill(struct pagevec *pvec, { pte_t *pte; spinlock_t *ptl; + unsigned long pmd_end = (start + PMD_SIZE) & PMD_MASK; + end = (pmd_end - 1 < end - 1) ? pmd_end : end; /* * Initialize pte walk starting at the already pinned page where we * are sure that there is a pte. */ pte = get_locked_pte(vma->vm_mm, start, ); - end = min(end, pmd_addr_end(start, end)); /* The page next to the pinned page is the first we will try to get */ start += PAGE_SIZE; Anyway, Reviewed-by: Bob Liu -- Regards, -Bob -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH v2 0/5] mm: migrate zbud pages
Hi Krzysztof, On 09/11/2013 04:58 PM, Krzysztof Kozlowski wrote: > Hi, > > Currently zbud pages are not movable and they cannot be allocated from CMA > (Contiguous Memory Allocator) region. These patches add migration of zbud > pages. > I agree that the migration of zbud pages is important so that system will not enter order-0 page fragmentation and can be helpful for page compaction/huge pages etc.. But after I looked at the [patch 4/5], I found it will make zbud very complicated. I'd prefer to add this migration feature later until current version zswap/zbud becomes better enough and more stable. Mel mentioned several problems about zswap/zbud in thread "[PATCH v6 0/5] zram/zsmalloc promotion". Like "it's clunky as hell and the layering between zswap and zbud is twisty" and "I think I brought up its stalling behaviour during review when it was being merged. It would have been preferable if writeback could be initiated in batches and then waited on at the very least.. It's worse that it uses _swap_writepage directly instead of going through a writepage ops. It would have been better if zbud pages existed on the LRU and written back with an address space ops and properly handled asynchonous writeback." So I think it would be better if we can address those issues at first and it would be easier to address these issues before adding more new features. Welcome any ideas. -- Regards, -Bob -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH v2 0/5] mm: migrate zbud pages
Hi Krzysztof, On 09/11/2013 04:58 PM, Krzysztof Kozlowski wrote: Hi, Currently zbud pages are not movable and they cannot be allocated from CMA (Contiguous Memory Allocator) region. These patches add migration of zbud pages. I agree that the migration of zbud pages is important so that system will not enter order-0 page fragmentation and can be helpful for page compaction/huge pages etc.. But after I looked at the [patch 4/5], I found it will make zbud very complicated. I'd prefer to add this migration feature later until current version zswap/zbud becomes better enough and more stable. Mel mentioned several problems about zswap/zbud in thread [PATCH v6 0/5] zram/zsmalloc promotion. Like it's clunky as hell and the layering between zswap and zbud is twisty and I think I brought up its stalling behaviour during review when it was being merged. It would have been preferable if writeback could be initiated in batches and then waited on at the very least.. It's worse that it uses _swap_writepage directly instead of going through a writepage ops. It would have been better if zbud pages existed on the LRU and written back with an address space ops and properly handled asynchonous writeback. So I think it would be better if we can address those issues at first and it would be easier to address these issues before adding more new features. Welcome any ideas. -- Regards, -Bob -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC PATCH RESEND] mm: munlock: Prevent walking off the end of a pagetable in no-pmd configuration
On 09/17/2013 10:22 PM, Vlastimil Babka wrote: The function __munlock_pagevec_fill() introduced in commit 7a8010cd3 (mm: munlock: manual pte walk in fast path instead of follow_page_mask()) uses pmd_addr_end() for restricting its operation within current page table. This is insufficient on architectures/configurations where pmd is folded and pmd_addr_end() just returns the end of the full range to be walked. In this case, it allows pte++ to walk off the end of a page table resulting in unpredictable behaviour. This patch fixes the function by using pgd_addr_end() and pud_addr_end() before pmd_addr_end(), which will yield correct page table boundary on all configurations. This is similar to what existing page walkers do when walking each level of the page table. Additionaly, the patch clarifies a comment for get_locked_pte() call in the function. Reported-by: Fengguang Wu fengguang...@intel.com Cc: Jörn Engel jo...@logfs.org Cc: Mel Gorman mgor...@suse.de Cc: Michel Lespinasse wal...@google.com Cc: Hugh Dickins hu...@google.com Cc: Rik van Riel r...@redhat.com Cc: Johannes Weiner han...@cmpxchg.org Cc: Michal Hocko mho...@suse.cz Cc: Vlastimil Babka vba...@suse.cz Signed-off-by: Vlastimil Babka vba...@suse.cz --- mm/mlock.c | 8 ++-- 1 file changed, 6 insertions(+), 2 deletions(-) diff --git a/mm/mlock.c b/mm/mlock.c index d638026..758c0fc 100644 --- a/mm/mlock.c +++ b/mm/mlock.c @@ -379,10 +379,14 @@ static unsigned long __munlock_pagevec_fill(struct pagevec *pvec, /* * Initialize pte walk starting at the already pinned page where we - * are sure that there is a pte. + * are sure that there is a pte, as it was pinned under the same + * mmap_sem write op. */ pte = get_locked_pte(vma-vm_mm, start, ptl); - end = min(end, pmd_addr_end(start, end)); + /* Make sure we do not cross the page table boundary */ + end = pgd_addr_end(start, end); + end = pud_addr_end(start, end); + end = pmd_addr_end(start, end); Nitpick, how about unfolding pmd_addr_end(start, end) directly? Like: --- a/mm/mlock.c +++ b/mm/mlock.c @@ -376,13 +376,14 @@ static unsigned long __munlock_pagevec_fill(struct pagevec *pvec, { pte_t *pte; spinlock_t *ptl; + unsigned long pmd_end = (start + PMD_SIZE) PMD_MASK; + end = (pmd_end - 1 end - 1) ? pmd_end : end; /* * Initialize pte walk starting at the already pinned page where we * are sure that there is a pte. */ pte = get_locked_pte(vma-vm_mm, start, ptl); - end = min(end, pmd_addr_end(start, end)); /* The page next to the pinned page is the first we will try to get */ start += PAGE_SIZE; Anyway, Reviewed-by: Bob Liu bob@oracle.com -- Regards, -Bob -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 0/50] Basic scheduler support for automatic NUMA balancing V7
Hi Mel, On 09/10/2013 05:31 PM, Mel Gorman wrote: > It has been a long time since V6 of this series and time for an update. Much > of this is now stabilised with the most important addition being the inclusion > of Peter and Rik's work on grouping tasks that share pages together. > > This series has a number of goals. It reduces overhead of automatic balancing > through scan rate reduction and the avoidance of TLB flushes. It selects a > preferred node and moves tasks towards their memory as well as moving memory > toward their task. It handles shared pages and groups related tasks together. > I found sometimes numa balancing will be broken after khugepaged started, because khugepaged always allocate huge page from the node of the first scanned normal page during collapsing. A simple use case is when a user run his application interleaving all nodes using "numactl --interleave=all ". But after khugepaged started most pages of his application will be located to only one specific node. I have a simple patch fix this issue in thread: [PATCH 2/2] mm: thp: khugepaged: add policy for finding target node I think this may related with this topic, I don't know whether this series can also fix the issue I mentioned. -- Regards, -Bob -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 0/50] Basic scheduler support for automatic NUMA balancing V7
Hi Mel, On 09/10/2013 05:31 PM, Mel Gorman wrote: It has been a long time since V6 of this series and time for an update. Much of this is now stabilised with the most important addition being the inclusion of Peter and Rik's work on grouping tasks that share pages together. This series has a number of goals. It reduces overhead of automatic balancing through scan rate reduction and the avoidance of TLB flushes. It selects a preferred node and moves tasks towards their memory as well as moving memory toward their task. It handles shared pages and groups related tasks together. I found sometimes numa balancing will be broken after khugepaged started, because khugepaged always allocate huge page from the node of the first scanned normal page during collapsing. A simple use case is when a user run his application interleaving all nodes using numactl --interleave=all . But after khugepaged started most pages of his application will be located to only one specific node. I have a simple patch fix this issue in thread: [PATCH 2/2] mm: thp: khugepaged: add policy for finding target node I think this may related with this topic, I don't know whether this series can also fix the issue I mentioned. -- Regards, -Bob -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC PATCH 1/4] zbud: use page ref counter for zbud pages
Hi Krzysztof, On 08/30/2013 04:42 PM, Krzysztof Kozlowski wrote: > Use page reference counter for zbud pages. The ref counter replaces > zbud_header.under_reclaim flag and ensures that zbud page won't be freed > when zbud_free() is called during reclaim. It allows implementation of > additional reclaim paths. > > The page count is incremented when: > - a handle is created and passed to zswap (in zbud_alloc()), > - user-supplied eviction callback is called (in zbud_reclaim_page()). > > Signed-off-by: Krzysztof Kozlowski > Signed-off-by: Tomasz Stanislawski > Reviewed-by: Bob Liu AFAIR, the previous version you sent out has a function called rebalance_lists() which I think is a good clean up. But I didn't see that function any more in this version. Thanks, -Bob > --- > mm/zbud.c | 97 > + > 1 file changed, 52 insertions(+), 45 deletions(-) > > diff --git a/mm/zbud.c b/mm/zbud.c > index ad1e781..aa9a15c 100644 > --- a/mm/zbud.c > +++ b/mm/zbud.c > @@ -109,7 +109,6 @@ struct zbud_header { > struct list_head lru; > unsigned int first_chunks; > unsigned int last_chunks; > - bool under_reclaim; > }; > > /* > @@ -138,16 +137,9 @@ static struct zbud_header *init_zbud_page(struct page > *page) > zhdr->last_chunks = 0; > INIT_LIST_HEAD(>buddy); > INIT_LIST_HEAD(>lru); > - zhdr->under_reclaim = 0; > return zhdr; > } > > -/* Resets the struct page fields and frees the page */ > -static void free_zbud_page(struct zbud_header *zhdr) > -{ > - __free_page(virt_to_page(zhdr)); > -} > - > /* > * Encodes the handle of a particular buddy within a zbud page > * Pool lock should be held as this function accesses first|last_chunks > @@ -188,6 +180,31 @@ static int num_free_chunks(struct zbud_header *zhdr) > return NCHUNKS - zhdr->first_chunks - zhdr->last_chunks - 1; > } > > +/* > + * Increases ref count for zbud page. > + */ > +static void get_zbud_page(struct zbud_header *zhdr) > +{ > + get_page(virt_to_page(zhdr)); > +} > + > +/* > + * Decreases ref count for zbud page and frees the page if it reaches 0 > + * (no external references, e.g. handles). > + * > + * Returns 1 if page was freed and 0 otherwise. > + */ > +static int put_zbud_page(struct zbud_header *zhdr) > +{ > + struct page *page = virt_to_page(zhdr); > + if (put_page_testzero(page)) { > + free_hot_cold_page(page, 0); > + return 1; > + } > + return 0; > +} > + > + > /* > * API Functions > */ > @@ -250,7 +267,7 @@ void zbud_destroy_pool(struct zbud_pool *pool) > int zbud_alloc(struct zbud_pool *pool, int size, gfp_t gfp, > unsigned long *handle) > { > - int chunks, i, freechunks; > + int chunks, i; > struct zbud_header *zhdr = NULL; > enum buddy bud; > struct page *page; > @@ -273,6 +290,7 @@ int zbud_alloc(struct zbud_pool *pool, int size, gfp_t > gfp, > bud = FIRST; > else > bud = LAST; > + get_zbud_page(zhdr); > goto found; > } > } > @@ -284,6 +302,10 @@ int zbud_alloc(struct zbud_pool *pool, int size, gfp_t > gfp, > return -ENOMEM; > spin_lock(>lock); > pool->pages_nr++; > + /* > + * We will be using zhdr instead of page, so > + * don't increase the page count. > + */ > zhdr = init_zbud_page(page); > bud = FIRST; > > @@ -295,7 +317,7 @@ found: > > if (zhdr->first_chunks == 0 || zhdr->last_chunks == 0) { > /* Add to unbuddied list */ > - freechunks = num_free_chunks(zhdr); > + int freechunks = num_free_chunks(zhdr); > list_add(>buddy, >unbuddied[freechunks]); > } else { > /* Add to buddied list */ > @@ -326,7 +348,6 @@ found: > void zbud_free(struct zbud_pool *pool, unsigned long handle) > { > struct zbud_header *zhdr; > - int freechunks; > > spin_lock(>lock); > zhdr = handle_to_zbud_header(handle); > @@ -337,26 +358,19 @@ void zbud_free(struct zbud_pool *pool, unsigned long > handle) > else > zhdr->first_chunks = 0; > > - if (zhdr->under_reclaim) { > - /* zbud page is under reclaim, reclaim will free */ > - spin_unlock(>lock); > - return; > - } &
Re: [RFC PATCH 1/4] zbud: use page ref counter for zbud pages
Hi Krzysztof, On 08/30/2013 04:42 PM, Krzysztof Kozlowski wrote: Use page reference counter for zbud pages. The ref counter replaces zbud_header.under_reclaim flag and ensures that zbud page won't be freed when zbud_free() is called during reclaim. It allows implementation of additional reclaim paths. The page count is incremented when: - a handle is created and passed to zswap (in zbud_alloc()), - user-supplied eviction callback is called (in zbud_reclaim_page()). Signed-off-by: Krzysztof Kozlowski k.kozlow...@samsung.com Signed-off-by: Tomasz Stanislawski t.stanisl...@samsung.com Reviewed-by: Bob Liu bob@oracle.com AFAIR, the previous version you sent out has a function called rebalance_lists() which I think is a good clean up. But I didn't see that function any more in this version. Thanks, -Bob --- mm/zbud.c | 97 + 1 file changed, 52 insertions(+), 45 deletions(-) diff --git a/mm/zbud.c b/mm/zbud.c index ad1e781..aa9a15c 100644 --- a/mm/zbud.c +++ b/mm/zbud.c @@ -109,7 +109,6 @@ struct zbud_header { struct list_head lru; unsigned int first_chunks; unsigned int last_chunks; - bool under_reclaim; }; /* @@ -138,16 +137,9 @@ static struct zbud_header *init_zbud_page(struct page *page) zhdr-last_chunks = 0; INIT_LIST_HEAD(zhdr-buddy); INIT_LIST_HEAD(zhdr-lru); - zhdr-under_reclaim = 0; return zhdr; } -/* Resets the struct page fields and frees the page */ -static void free_zbud_page(struct zbud_header *zhdr) -{ - __free_page(virt_to_page(zhdr)); -} - /* * Encodes the handle of a particular buddy within a zbud page * Pool lock should be held as this function accesses first|last_chunks @@ -188,6 +180,31 @@ static int num_free_chunks(struct zbud_header *zhdr) return NCHUNKS - zhdr-first_chunks - zhdr-last_chunks - 1; } +/* + * Increases ref count for zbud page. + */ +static void get_zbud_page(struct zbud_header *zhdr) +{ + get_page(virt_to_page(zhdr)); +} + +/* + * Decreases ref count for zbud page and frees the page if it reaches 0 + * (no external references, e.g. handles). + * + * Returns 1 if page was freed and 0 otherwise. + */ +static int put_zbud_page(struct zbud_header *zhdr) +{ + struct page *page = virt_to_page(zhdr); + if (put_page_testzero(page)) { + free_hot_cold_page(page, 0); + return 1; + } + return 0; +} + + /* * API Functions */ @@ -250,7 +267,7 @@ void zbud_destroy_pool(struct zbud_pool *pool) int zbud_alloc(struct zbud_pool *pool, int size, gfp_t gfp, unsigned long *handle) { - int chunks, i, freechunks; + int chunks, i; struct zbud_header *zhdr = NULL; enum buddy bud; struct page *page; @@ -273,6 +290,7 @@ int zbud_alloc(struct zbud_pool *pool, int size, gfp_t gfp, bud = FIRST; else bud = LAST; + get_zbud_page(zhdr); goto found; } } @@ -284,6 +302,10 @@ int zbud_alloc(struct zbud_pool *pool, int size, gfp_t gfp, return -ENOMEM; spin_lock(pool-lock); pool-pages_nr++; + /* + * We will be using zhdr instead of page, so + * don't increase the page count. + */ zhdr = init_zbud_page(page); bud = FIRST; @@ -295,7 +317,7 @@ found: if (zhdr-first_chunks == 0 || zhdr-last_chunks == 0) { /* Add to unbuddied list */ - freechunks = num_free_chunks(zhdr); + int freechunks = num_free_chunks(zhdr); list_add(zhdr-buddy, pool-unbuddied[freechunks]); } else { /* Add to buddied list */ @@ -326,7 +348,6 @@ found: void zbud_free(struct zbud_pool *pool, unsigned long handle) { struct zbud_header *zhdr; - int freechunks; spin_lock(pool-lock); zhdr = handle_to_zbud_header(handle); @@ -337,26 +358,19 @@ void zbud_free(struct zbud_pool *pool, unsigned long handle) else zhdr-first_chunks = 0; - if (zhdr-under_reclaim) { - /* zbud page is under reclaim, reclaim will free */ - spin_unlock(pool-lock); - return; - } - /* Remove from existing buddy list */ list_del(zhdr-buddy); if (zhdr-first_chunks == 0 zhdr-last_chunks == 0) { - /* zbud page is empty, free */ list_del(zhdr-lru); - free_zbud_page(zhdr); pool-pages_nr--; } else { /* Add to unbuddied list */ - freechunks = num_free_chunks(zhdr); + int freechunks = num_free_chunks(zhdr); list_add(zhdr-buddy, pool-unbuddied
Re: [PATCH v2 4/4] mm/zswap: use GFP_NOIO instead of GFP_KERNEL
On 09/06/2013 01:16 PM, Weijie Yang wrote: > To avoid zswap store and reclaim functions called recursively, > use GFP_NOIO instead of GFP_KERNEL > The reason of using GFP_KERNEL in write back path is we want to try our best to move those pages from zswap to real swap device. I think it would be better to keep GFP_KERNEL flag but find some other ways to skip zswap/zswap_frontswap_store() if zswap write back is in progress. What I can think of currently is adding a mutex to zswap, take that mutex when zswap write back happens and check the mutex in zswap_frontswap_store(). > Signed-off-by: Weijie Yang > --- > mm/zswap.c |6 +++--- > 1 file changed, 3 insertions(+), 3 deletions(-) > > diff --git a/mm/zswap.c b/mm/zswap.c > index cc40e6a..3d05ed8 100644 > --- a/mm/zswap.c > +++ b/mm/zswap.c > @@ -427,7 +427,7 @@ static int zswap_get_swap_cache_page(swp_entry_t entry, >* Get a new page to read into from swap. >*/ > if (!new_page) { > - new_page = alloc_page(GFP_KERNEL); > + new_page = alloc_page(GFP_NOIO); > if (!new_page) > break; /* Out of memory */ > } > @@ -435,7 +435,7 @@ static int zswap_get_swap_cache_page(swp_entry_t entry, > /* >* call radix_tree_preload() while we can wait. >*/ > - err = radix_tree_preload(GFP_KERNEL); > + err = radix_tree_preload(GFP_NOIO); > if (err) > break; > > @@ -636,7 +636,7 @@ static int zswap_frontswap_store(unsigned type, pgoff_t > offset, > } > > /* allocate entry */ > - entry = zswap_entry_cache_alloc(GFP_KERNEL); > + entry = zswap_entry_cache_alloc(GFP_NOIO); > if (!entry) { > zswap_reject_kmemcache_fail++; > ret = -ENOMEM; > -- Regards, -Bob -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH v2 2/4] mm/zswap: bugfix: memory leak when invalidate and reclaim occur concurrently
On 09/06/2013 01:16 PM, Weijie Yang wrote: > Consider the following scenario: > thread 0: reclaim entry x (get refcount, but not call > zswap_get_swap_cache_page) > thread 1: call zswap_frontswap_invalidate_page to invalidate entry x. > finished, entry x and its zbud is not freed as its refcount != 0 > now, the swap_map[x] = 0 > thread 0: now call zswap_get_swap_cache_page > swapcache_prepare return -ENOENT because entry x is not used any more > zswap_get_swap_cache_page return ZSWAP_SWAPCACHE_NOMEM > zswap_writeback_entry do nothing except put refcount > Now, the memory of zswap_entry x and its zpage leak. > > Modify: > - check the refcount in fail path, free memory if it is not referenced. > - use ZSWAP_SWAPCACHE_FAIL instead of ZSWAP_SWAPCACHE_NOMEM as the fail path > can be not only caused by nomem but also by invalidate. > > Signed-off-by: Weijie Yang Reviewed-by: Bob Liu > --- > mm/zswap.c | 21 + > 1 file changed, 13 insertions(+), 8 deletions(-) > > diff --git a/mm/zswap.c b/mm/zswap.c > index cbd9578..1be7b90 100644 > --- a/mm/zswap.c > +++ b/mm/zswap.c > @@ -387,7 +387,7 @@ static void zswap_free_entry(struct zswap_tree *tree, > struct zswap_entry *entry) > enum zswap_get_swap_ret { > ZSWAP_SWAPCACHE_NEW, > ZSWAP_SWAPCACHE_EXIST, > - ZSWAP_SWAPCACHE_NOMEM > + ZSWAP_SWAPCACHE_FAIL, > }; > > /* > @@ -401,9 +401,9 @@ enum zswap_get_swap_ret { > * added to the swap cache, and returned in retpage. > * > * If success, the swap cache page is returned in retpage > - * Returns 0 if page was already in the swap cache, page is not locked > - * Returns 1 if the new page needs to be populated, page is locked > - * Returns <0 on error > + * Returns ZSWAP_SWAPCACHE_EXIST if page was already in the swap cache > + * Returns ZSWAP_SWAPCACHE_NEW if the new page needs to be populated, page > is locked > + * Returns ZSWAP_SWAPCACHE_FAIL on error > */ > static int zswap_get_swap_cache_page(swp_entry_t entry, > struct page **retpage) > @@ -475,7 +475,7 @@ static int zswap_get_swap_cache_page(swp_entry_t entry, > if (new_page) > page_cache_release(new_page); > if (!found_page) > - return ZSWAP_SWAPCACHE_NOMEM; > + return ZSWAP_SWAPCACHE_FAIL; > *retpage = found_page; > return ZSWAP_SWAPCACHE_EXIST; > } > @@ -529,11 +529,11 @@ static int zswap_writeback_entry(struct zbud_pool > *pool, unsigned long handle) > > /* try to allocate swap cache page */ > switch (zswap_get_swap_cache_page(swpentry, )) { > - case ZSWAP_SWAPCACHE_NOMEM: /* no memory */ > + case ZSWAP_SWAPCACHE_FAIL: /* no memory or invalidate happened */ > ret = -ENOMEM; > goto fail; > > - case ZSWAP_SWAPCACHE_EXIST: /* page is unlocked */ > + case ZSWAP_SWAPCACHE_EXIST: > /* page is already in the swap cache, ignore for now */ > page_cache_release(page); > ret = -EEXIST; > @@ -591,7 +591,12 @@ static int zswap_writeback_entry(struct zbud_pool *pool, > unsigned long handle) > > fail: > spin_lock(>lock); > - zswap_entry_put(entry); > + refcount = zswap_entry_put(entry); > + if (refcount <= 0) { > + /* invalidate happened, consider writeback as success */ > + zswap_free_entry(tree, entry); > + ret = 0; > + } > spin_unlock(>lock); > return ret; > } > -- Regards, -Bob -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH v2 3/4] mm/zswap: avoid unnecessary page scanning
On 09/06/2013 01:16 PM, Weijie Yang wrote: > add SetPageReclaim before __swap_writepage so that page can be moved to the > tail of the inactive list, which can avoid unnecessary page scanning as this > page was reclaimed by swap subsystem before. > > Signed-off-by: Weijie Yang Reviewed-by: Bob Liu > --- > mm/zswap.c |3 +++ > 1 file changed, 3 insertions(+) > > diff --git a/mm/zswap.c b/mm/zswap.c > index 1be7b90..cc40e6a 100644 > --- a/mm/zswap.c > +++ b/mm/zswap.c > @@ -556,6 +556,9 @@ static int zswap_writeback_entry(struct zbud_pool *pool, > unsigned long handle) > SetPageUptodate(page); > } > > + /* move it to the tail of the inactive list after end_writeback */ > + SetPageReclaim(page); > + > /* start writeback */ > __swap_writepage(page, , end_swap_bio_write); > page_cache_release(page); > -- Regards, -Bob -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH v2 1/4] mm/zswap: bugfix: memory leak when re-swapon
On 09/06/2013 01:16 PM, Weijie Yang wrote: > zswap_tree is not freed when swapoff, and it got re-kmalloc in swapon, > so memory-leak occurs. > > Modify: free memory of zswap_tree in zswap_frontswap_invalidate_area(). > > Signed-off-by: Weijie Yang Reviewed-by: Bob Liu > --- > mm/zswap.c |4 > 1 file changed, 4 insertions(+) > > diff --git a/mm/zswap.c b/mm/zswap.c > index deda2b6..cbd9578 100644 > --- a/mm/zswap.c > +++ b/mm/zswap.c > @@ -816,6 +816,10 @@ static void zswap_frontswap_invalidate_area(unsigned > type) > } > tree->rbroot = RB_ROOT; > spin_unlock(>lock); > + > + zbud_destroy_pool(tree->pool); > + kfree(tree); > + zswap_trees[type] = NULL; > } > > static struct zbud_ops zswap_zbud_ops = { > -- Regards, -Bob -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH v2 1/4] mm/zswap: bugfix: memory leak when re-swapon
On 09/06/2013 01:16 PM, Weijie Yang wrote: zswap_tree is not freed when swapoff, and it got re-kmalloc in swapon, so memory-leak occurs. Modify: free memory of zswap_tree in zswap_frontswap_invalidate_area(). Signed-off-by: Weijie Yang weijie.y...@samsung.com Reviewed-by: Bob Liu bob@oracle.com --- mm/zswap.c |4 1 file changed, 4 insertions(+) diff --git a/mm/zswap.c b/mm/zswap.c index deda2b6..cbd9578 100644 --- a/mm/zswap.c +++ b/mm/zswap.c @@ -816,6 +816,10 @@ static void zswap_frontswap_invalidate_area(unsigned type) } tree-rbroot = RB_ROOT; spin_unlock(tree-lock); + + zbud_destroy_pool(tree-pool); + kfree(tree); + zswap_trees[type] = NULL; } static struct zbud_ops zswap_zbud_ops = { -- Regards, -Bob -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH v2 3/4] mm/zswap: avoid unnecessary page scanning
On 09/06/2013 01:16 PM, Weijie Yang wrote: add SetPageReclaim before __swap_writepage so that page can be moved to the tail of the inactive list, which can avoid unnecessary page scanning as this page was reclaimed by swap subsystem before. Signed-off-by: Weijie Yang weijie.y...@samsung.com Reviewed-by: Bob Liu bob@oracle.com --- mm/zswap.c |3 +++ 1 file changed, 3 insertions(+) diff --git a/mm/zswap.c b/mm/zswap.c index 1be7b90..cc40e6a 100644 --- a/mm/zswap.c +++ b/mm/zswap.c @@ -556,6 +556,9 @@ static int zswap_writeback_entry(struct zbud_pool *pool, unsigned long handle) SetPageUptodate(page); } + /* move it to the tail of the inactive list after end_writeback */ + SetPageReclaim(page); + /* start writeback */ __swap_writepage(page, wbc, end_swap_bio_write); page_cache_release(page); -- Regards, -Bob -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH v2 2/4] mm/zswap: bugfix: memory leak when invalidate and reclaim occur concurrently
On 09/06/2013 01:16 PM, Weijie Yang wrote: Consider the following scenario: thread 0: reclaim entry x (get refcount, but not call zswap_get_swap_cache_page) thread 1: call zswap_frontswap_invalidate_page to invalidate entry x. finished, entry x and its zbud is not freed as its refcount != 0 now, the swap_map[x] = 0 thread 0: now call zswap_get_swap_cache_page swapcache_prepare return -ENOENT because entry x is not used any more zswap_get_swap_cache_page return ZSWAP_SWAPCACHE_NOMEM zswap_writeback_entry do nothing except put refcount Now, the memory of zswap_entry x and its zpage leak. Modify: - check the refcount in fail path, free memory if it is not referenced. - use ZSWAP_SWAPCACHE_FAIL instead of ZSWAP_SWAPCACHE_NOMEM as the fail path can be not only caused by nomem but also by invalidate. Signed-off-by: Weijie Yang weijie.y...@samsung.com Reviewed-by: Bob Liu bob@oracle.com --- mm/zswap.c | 21 + 1 file changed, 13 insertions(+), 8 deletions(-) diff --git a/mm/zswap.c b/mm/zswap.c index cbd9578..1be7b90 100644 --- a/mm/zswap.c +++ b/mm/zswap.c @@ -387,7 +387,7 @@ static void zswap_free_entry(struct zswap_tree *tree, struct zswap_entry *entry) enum zswap_get_swap_ret { ZSWAP_SWAPCACHE_NEW, ZSWAP_SWAPCACHE_EXIST, - ZSWAP_SWAPCACHE_NOMEM + ZSWAP_SWAPCACHE_FAIL, }; /* @@ -401,9 +401,9 @@ enum zswap_get_swap_ret { * added to the swap cache, and returned in retpage. * * If success, the swap cache page is returned in retpage - * Returns 0 if page was already in the swap cache, page is not locked - * Returns 1 if the new page needs to be populated, page is locked - * Returns 0 on error + * Returns ZSWAP_SWAPCACHE_EXIST if page was already in the swap cache + * Returns ZSWAP_SWAPCACHE_NEW if the new page needs to be populated, page is locked + * Returns ZSWAP_SWAPCACHE_FAIL on error */ static int zswap_get_swap_cache_page(swp_entry_t entry, struct page **retpage) @@ -475,7 +475,7 @@ static int zswap_get_swap_cache_page(swp_entry_t entry, if (new_page) page_cache_release(new_page); if (!found_page) - return ZSWAP_SWAPCACHE_NOMEM; + return ZSWAP_SWAPCACHE_FAIL; *retpage = found_page; return ZSWAP_SWAPCACHE_EXIST; } @@ -529,11 +529,11 @@ static int zswap_writeback_entry(struct zbud_pool *pool, unsigned long handle) /* try to allocate swap cache page */ switch (zswap_get_swap_cache_page(swpentry, page)) { - case ZSWAP_SWAPCACHE_NOMEM: /* no memory */ + case ZSWAP_SWAPCACHE_FAIL: /* no memory or invalidate happened */ ret = -ENOMEM; goto fail; - case ZSWAP_SWAPCACHE_EXIST: /* page is unlocked */ + case ZSWAP_SWAPCACHE_EXIST: /* page is already in the swap cache, ignore for now */ page_cache_release(page); ret = -EEXIST; @@ -591,7 +591,12 @@ static int zswap_writeback_entry(struct zbud_pool *pool, unsigned long handle) fail: spin_lock(tree-lock); - zswap_entry_put(entry); + refcount = zswap_entry_put(entry); + if (refcount = 0) { + /* invalidate happened, consider writeback as success */ + zswap_free_entry(tree, entry); + ret = 0; + } spin_unlock(tree-lock); return ret; } -- Regards, -Bob -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH v2 4/4] mm/zswap: use GFP_NOIO instead of GFP_KERNEL
On 09/06/2013 01:16 PM, Weijie Yang wrote: To avoid zswap store and reclaim functions called recursively, use GFP_NOIO instead of GFP_KERNEL The reason of using GFP_KERNEL in write back path is we want to try our best to move those pages from zswap to real swap device. I think it would be better to keep GFP_KERNEL flag but find some other ways to skip zswap/zswap_frontswap_store() if zswap write back is in progress. What I can think of currently is adding a mutex to zswap, take that mutex when zswap write back happens and check the mutex in zswap_frontswap_store(). Signed-off-by: Weijie Yang weijie.y...@samsung.com --- mm/zswap.c |6 +++--- 1 file changed, 3 insertions(+), 3 deletions(-) diff --git a/mm/zswap.c b/mm/zswap.c index cc40e6a..3d05ed8 100644 --- a/mm/zswap.c +++ b/mm/zswap.c @@ -427,7 +427,7 @@ static int zswap_get_swap_cache_page(swp_entry_t entry, * Get a new page to read into from swap. */ if (!new_page) { - new_page = alloc_page(GFP_KERNEL); + new_page = alloc_page(GFP_NOIO); if (!new_page) break; /* Out of memory */ } @@ -435,7 +435,7 @@ static int zswap_get_swap_cache_page(swp_entry_t entry, /* * call radix_tree_preload() while we can wait. */ - err = radix_tree_preload(GFP_KERNEL); + err = radix_tree_preload(GFP_NOIO); if (err) break; @@ -636,7 +636,7 @@ static int zswap_frontswap_store(unsigned type, pgoff_t offset, } /* allocate entry */ - entry = zswap_entry_cache_alloc(GFP_KERNEL); + entry = zswap_entry_cache_alloc(GFP_NOIO); if (!entry) { zswap_reject_kmemcache_fail++; ret = -ENOMEM; -- Regards, -Bob -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] x86: e820: fix memmap kernel boot parameter
On Fri, Aug 30, 2013 at 2:14 PM, Dan Aloni wrote: > On Fri, Aug 30, 2013 at 01:47:53PM +0800, Bob Liu wrote: >>[..] >> Machine2: bootcmdline in grub.cfg "memmap=0x77ff$0x88000", the >> result of >> "cat /proc/cmdline" changed to "memmap=0x77ffx88000". >> >> I didn't find the root cause, I think maybe grub reserved "$0" as something >> special. >> Replace '$' with '%' in kernel boot parameter can fix this issue. > > You are correct with the root cause, however I don't think the patch is > needed. > > In order to bypass grub's variable evaluation you can simply use escaping > and replace $ with \$ in your grub config. > I see, thank you very much! -- Regards, --Bob -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] x86: e820: fix memmap kernel boot parameter
On Fri, Aug 30, 2013 at 2:14 PM, Dan Aloni alo...@stratoscale.com wrote: On Fri, Aug 30, 2013 at 01:47:53PM +0800, Bob Liu wrote: [..] Machine2: bootcmdline in grub.cfg memmap=0x77ff$0x88000, the result of cat /proc/cmdline changed to memmap=0x77ffx88000. I didn't find the root cause, I think maybe grub reserved $0 as something special. Replace '$' with '%' in kernel boot parameter can fix this issue. You are correct with the root cause, however I don't think the patch is needed. In order to bypass grub's variable evaluation you can simply use escaping and replace $ with \$ in your grub config. I see, thank you very much! -- Regards, --Bob -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH] x86: e820: fix memmap kernel boot parameter
Kernel boot parameter memmap=nn[KMG]$ss[KMG] is used to mark specific memory as reserved. Region of memory to be used is from ss to ss+nn. But I found the action of this parameter is not as expected. I tried on two machines. Machine1: bootcmdline in grub.cfg "memmap=800M$0x60bfdfff", but the result of "cat /proc/cmdline" changed to "memmap=800M/bin/bashx60bfdfff" after system booted. Machine2: bootcmdline in grub.cfg "memmap=0x77ff$0x88000", the result of "cat /proc/cmdline" changed to "memmap=0x77ffx88000". I didn't find the root cause, I think maybe grub reserved "$0" as something special. Replace '$' with '%' in kernel boot parameter can fix this issue. Signed-off-by: Bob Liu --- Documentation/kernel-parameters.txt |6 +++--- arch/x86/kernel/e820.c |2 +- 2 files changed, 4 insertions(+), 4 deletions(-) diff --git a/Documentation/kernel-parameters.txt b/Documentation/kernel-parameters.txt index 7f9d4f5..a96c7b1 100644 --- a/Documentation/kernel-parameters.txt +++ b/Documentation/kernel-parameters.txt @@ -1604,13 +1604,13 @@ bytes respectively. Such letter suffixes can also be entirely omitted. [KNL,ACPI] Mark specific memory as ACPI data. Region of memory to be used, from ss to ss+nn. - memmap=nn[KMG]$ss[KMG] + memmap=nn[KMG]%ss[KMG] [KNL,ACPI] Mark specific memory as reserved. Region of memory to be used, from ss to ss+nn. Example: Exclude memory from 0x1869-0x1869 -memmap=64K$0x1869 +memmap=64K%0x1869 or -memmap=0x1$0x1869 +memmap=0x1%0x1869 memory_corruption_check=0/1 [X86] Some BIOSes seem to corrupt the first 64k of diff --git a/arch/x86/kernel/e820.c b/arch/x86/kernel/e820.c index d32abea..8483d45 100644 --- a/arch/x86/kernel/e820.c +++ b/arch/x86/kernel/e820.c @@ -869,7 +869,7 @@ static int __init parse_memmap_one(char *p) } else if (*p == '#') { start_at = memparse(p+1, ); e820_add_region(start_at, mem_size, E820_ACPI); - } else if (*p == '$') { + } else if (*p == '%') { start_at = memparse(p+1, ); e820_add_region(start_at, mem_size, E820_RESERVED); } else -- 1.7.10.4 -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH] x86: e820: fix memmap kernel boot parameter
Kernel boot parameter memmap=nn[KMG]$ss[KMG] is used to mark specific memory as reserved. Region of memory to be used is from ss to ss+nn. But I found the action of this parameter is not as expected. I tried on two machines. Machine1: bootcmdline in grub.cfg memmap=800M$0x60bfdfff, but the result of cat /proc/cmdline changed to memmap=800M/bin/bashx60bfdfff after system booted. Machine2: bootcmdline in grub.cfg memmap=0x77ff$0x88000, the result of cat /proc/cmdline changed to memmap=0x77ffx88000. I didn't find the root cause, I think maybe grub reserved $0 as something special. Replace '$' with '%' in kernel boot parameter can fix this issue. Signed-off-by: Bob Liu bob@oracle.com --- Documentation/kernel-parameters.txt |6 +++--- arch/x86/kernel/e820.c |2 +- 2 files changed, 4 insertions(+), 4 deletions(-) diff --git a/Documentation/kernel-parameters.txt b/Documentation/kernel-parameters.txt index 7f9d4f5..a96c7b1 100644 --- a/Documentation/kernel-parameters.txt +++ b/Documentation/kernel-parameters.txt @@ -1604,13 +1604,13 @@ bytes respectively. Such letter suffixes can also be entirely omitted. [KNL,ACPI] Mark specific memory as ACPI data. Region of memory to be used, from ss to ss+nn. - memmap=nn[KMG]$ss[KMG] + memmap=nn[KMG]%ss[KMG] [KNL,ACPI] Mark specific memory as reserved. Region of memory to be used, from ss to ss+nn. Example: Exclude memory from 0x1869-0x1869 -memmap=64K$0x1869 +memmap=64K%0x1869 or -memmap=0x1$0x1869 +memmap=0x1%0x1869 memory_corruption_check=0/1 [X86] Some BIOSes seem to corrupt the first 64k of diff --git a/arch/x86/kernel/e820.c b/arch/x86/kernel/e820.c index d32abea..8483d45 100644 --- a/arch/x86/kernel/e820.c +++ b/arch/x86/kernel/e820.c @@ -869,7 +869,7 @@ static int __init parse_memmap_one(char *p) } else if (*p == '#') { start_at = memparse(p+1, p); e820_add_region(start_at, mem_size, E820_ACPI); - } else if (*p == '$') { + } else if (*p == '%') { start_at = memparse(p+1, p); e820_add_region(start_at, mem_size, E820_RESERVED); } else -- 1.7.10.4 -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH v7 0/5] zram/zsmalloc promotion
Hi Minchan, On Thu, Aug 22, 2013 at 8:42 AM, Minchan Kim wrote: > Hi Bob, > > On Wed, Aug 21, 2013 at 05:24:00PM +0800, Bob Liu wrote: >> Hi Minchan, >> >> On 08/21/2013 02:16 PM, Minchan Kim wrote: >> > It's 7th trial of zram/zsmalloc promotion. >> > I rewrote cover-letter totally based on previous discussion. >> > >> > The main reason to prevent zram promotion was no review of >> > zsmalloc part while Jens, block maintainer, already acked >> > zram part. >> > >> > At that time, zsmalloc was used for zram, zcache and zswap so >> > everybody wanted to make it general and at last, Mel reviewed it >> > when zswap was submitted to merge mainline a few month ago. >> > Most of review was related to zswap writeback mechanism which >> > can pageout compressed page in memory into real swap storage >> > in runtime and the conclusion was that zsmalloc isn't good for >> > zswap writeback so zswap borrowed zbud allocator from zcache to >> > replace zsmalloc. The zbud is bad for memory compression ratio(2) >> > but it's very predictable behavior because we can expect a zpage >> > includes just two pages as maximum. Other reviews were not major. >> > http://lkml.indiana.edu/hypermail/linux/kernel/1304.1/04334.html >> > >> > Zcache doesn't use zsmalloc either so zsmalloc's user is only >> > zram now so this patchset moves it into zsmalloc directory. >> > Recently, Bob tried to move zsmalloc under mm directory to unify >> > zram and zswap with adding pseudo block device in zswap(It's >> > very weired to me) but he was simple ignoring zram's block device >> > (a.k.a zram-blk) feature and considered only swap usecase of zram, >> > in turn, it lose zram's good concept. >> > >> >> Yes, I didn't notice the feature that zram can be used as a normal block >> device. >> >> >> > Mel raised an another issue in v6, "maintainance headache". >> > He claimed zswap and zram has a similar goal that is to compresss >> > swap pages so if we promote zram, maintainance headache happens >> > sometime by diverging implementaion between zswap and zram >> > so that he want to unify zram and zswap. For it, he want zswap >> > to implement pseudo block device like Bob did to emulate zram so >> > zswap can have an advantage of writeback as well as zram's benefit. >> >> If consider zram as a swap device only, I still think it's better to add >> a pseudo block device to zswap and just disable the writeback of zswap. > > Why do you think zswap is better? > In my opinion: 1. It's easy for zswap to do the same thing by adding a few small changes. 2. zswap won't get to the block layer which can reduce a lot of overheads. 3. zswap is transparent to current users who are using normal block device as the swap device. > I don't know but when I read http://lwn.net/Articles/334649/, it aimed > for compressing page caches as well as swap pages but it made widespread > hooks in core (I guess that's why zcache had a birth later by Nitin and Dan) > so reviewers guided him to support anon pages only to merge it. > And at that time, it was a specific virtual block device for only supporting > swap. AFAIRC, akpm suggested to make it general block device so other party > can have a benefit. > > You can type "zram tmp" in google and will find many article related > to use zram as tmp and I have been received some questions/reports I see. But i think if using shmem as tmp, the pages can be reclaimed during memory pressure, get to zswap and compressed as well. Mel also pointed a situation using zram as tmpfs may make things worse. > from anonymous guys by private mail. And Jorome, Redhat guy, has > contributed that part like partial I/O. > I am not going to block zram being promoted, just some different voice. -- Regards, --Bob -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH v7 0/5] zram/zsmalloc promotion
Hi Minchan, On Thu, Aug 22, 2013 at 8:42 AM, Minchan Kim minc...@kernel.org wrote: Hi Bob, On Wed, Aug 21, 2013 at 05:24:00PM +0800, Bob Liu wrote: Hi Minchan, On 08/21/2013 02:16 PM, Minchan Kim wrote: It's 7th trial of zram/zsmalloc promotion. I rewrote cover-letter totally based on previous discussion. The main reason to prevent zram promotion was no review of zsmalloc part while Jens, block maintainer, already acked zram part. At that time, zsmalloc was used for zram, zcache and zswap so everybody wanted to make it general and at last, Mel reviewed it when zswap was submitted to merge mainline a few month ago. Most of review was related to zswap writeback mechanism which can pageout compressed page in memory into real swap storage in runtime and the conclusion was that zsmalloc isn't good for zswap writeback so zswap borrowed zbud allocator from zcache to replace zsmalloc. The zbud is bad for memory compression ratio(2) but it's very predictable behavior because we can expect a zpage includes just two pages as maximum. Other reviews were not major. http://lkml.indiana.edu/hypermail/linux/kernel/1304.1/04334.html Zcache doesn't use zsmalloc either so zsmalloc's user is only zram now so this patchset moves it into zsmalloc directory. Recently, Bob tried to move zsmalloc under mm directory to unify zram and zswap with adding pseudo block device in zswap(It's very weired to me) but he was simple ignoring zram's block device (a.k.a zram-blk) feature and considered only swap usecase of zram, in turn, it lose zram's good concept. Yes, I didn't notice the feature that zram can be used as a normal block device. Mel raised an another issue in v6, maintainance headache. He claimed zswap and zram has a similar goal that is to compresss swap pages so if we promote zram, maintainance headache happens sometime by diverging implementaion between zswap and zram so that he want to unify zram and zswap. For it, he want zswap to implement pseudo block device like Bob did to emulate zram so zswap can have an advantage of writeback as well as zram's benefit. If consider zram as a swap device only, I still think it's better to add a pseudo block device to zswap and just disable the writeback of zswap. Why do you think zswap is better? In my opinion: 1. It's easy for zswap to do the same thing by adding a few small changes. 2. zswap won't get to the block layer which can reduce a lot of overheads. 3. zswap is transparent to current users who are using normal block device as the swap device. I don't know but when I read http://lwn.net/Articles/334649/, it aimed for compressing page caches as well as swap pages but it made widespread hooks in core (I guess that's why zcache had a birth later by Nitin and Dan) so reviewers guided him to support anon pages only to merge it. And at that time, it was a specific virtual block device for only supporting swap. AFAIRC, akpm suggested to make it general block device so other party can have a benefit. You can type zram tmp in google and will find many article related to use zram as tmp and I have been received some questions/reports I see. But i think if using shmem as tmp, the pages can be reclaimed during memory pressure, get to zswap and compressed as well. Mel also pointed a situation using zram as tmpfs may make things worse. from anonymous guys by private mail. And Jorome, Redhat guy, has contributed that part like partial I/O. I am not going to block zram being promoted, just some different voice. -- Regards, --Bob -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH v7 0/5] zram/zsmalloc promotion
On 08/21/2013 05:24 PM, Bob Liu wrote: > Hi Minchan, > > On 08/21/2013 02:16 PM, Minchan Kim wrote: >> It's 7th trial of zram/zsmalloc promotion. >> I rewrote cover-letter totally based on previous discussion. >> >> The main reason to prevent zram promotion was no review of >> zsmalloc part while Jens, block maintainer, already acked >> zram part. >> >> At that time, zsmalloc was used for zram, zcache and zswap so >> everybody wanted to make it general and at last, Mel reviewed it >> when zswap was submitted to merge mainline a few month ago. >> Most of review was related to zswap writeback mechanism which >> can pageout compressed page in memory into real swap storage >> in runtime and the conclusion was that zsmalloc isn't good for >> zswap writeback so zswap borrowed zbud allocator from zcache to >> replace zsmalloc. The zbud is bad for memory compression ratio(2) >> but it's very predictable behavior because we can expect a zpage >> includes just two pages as maximum. Other reviews were not major. >> http://lkml.indiana.edu/hypermail/linux/kernel/1304.1/04334.html >> >> Zcache doesn't use zsmalloc either so zsmalloc's user is only >> zram now so this patchset moves it into zsmalloc directory. >> Recently, Bob tried to move zsmalloc under mm directory to unify >> zram and zswap with adding pseudo block device in zswap(It's >> very weired to me) but he was simple ignoring zram's block device >> (a.k.a zram-blk) feature and considered only swap usecase of zram, >> in turn, it lose zram's good concept. >> > > Yes, I didn't notice the feature that zram can be used as a normal block > device. > > >> Mel raised an another issue in v6, "maintainance headache". >> He claimed zswap and zram has a similar goal that is to compresss >> swap pages so if we promote zram, maintainance headache happens >> sometime by diverging implementaion between zswap and zram >> so that he want to unify zram and zswap. For it, he want zswap >> to implement pseudo block device like Bob did to emulate zram so >> zswap can have an advantage of writeback as well as zram's benefit. > > If consider zram as a swap device only, I still think it's better to add > a pseudo block device to zswap and just disable the writeback of zswap. > > But I have no idea of zram's block device feature. > BTW: I think the original/main purpose that zram was introduced is for swapping. Is there any real users using zram as a normal block device instead of swap? For normal usage, maybe we can extend ramdisk with compression feature. -- Regards, -Bob -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH v7 0/5] zram/zsmalloc promotion
Hi Minchan, On 08/21/2013 02:16 PM, Minchan Kim wrote: > It's 7th trial of zram/zsmalloc promotion. > I rewrote cover-letter totally based on previous discussion. > > The main reason to prevent zram promotion was no review of > zsmalloc part while Jens, block maintainer, already acked > zram part. > > At that time, zsmalloc was used for zram, zcache and zswap so > everybody wanted to make it general and at last, Mel reviewed it > when zswap was submitted to merge mainline a few month ago. > Most of review was related to zswap writeback mechanism which > can pageout compressed page in memory into real swap storage > in runtime and the conclusion was that zsmalloc isn't good for > zswap writeback so zswap borrowed zbud allocator from zcache to > replace zsmalloc. The zbud is bad for memory compression ratio(2) > but it's very predictable behavior because we can expect a zpage > includes just two pages as maximum. Other reviews were not major. > http://lkml.indiana.edu/hypermail/linux/kernel/1304.1/04334.html > > Zcache doesn't use zsmalloc either so zsmalloc's user is only > zram now so this patchset moves it into zsmalloc directory. > Recently, Bob tried to move zsmalloc under mm directory to unify > zram and zswap with adding pseudo block device in zswap(It's > very weired to me) but he was simple ignoring zram's block device > (a.k.a zram-blk) feature and considered only swap usecase of zram, > in turn, it lose zram's good concept. > Yes, I didn't notice the feature that zram can be used as a normal block device. > Mel raised an another issue in v6, "maintainance headache". > He claimed zswap and zram has a similar goal that is to compresss > swap pages so if we promote zram, maintainance headache happens > sometime by diverging implementaion between zswap and zram > so that he want to unify zram and zswap. For it, he want zswap > to implement pseudo block device like Bob did to emulate zram so > zswap can have an advantage of writeback as well as zram's benefit. If consider zram as a swap device only, I still think it's better to add a pseudo block device to zswap and just disable the writeback of zswap. But I have no idea of zram's block device feature. > But I wonder frontswap-based zswap's writeback is really good > approach for writeback POV. I think that problem isn't only > specific for zswap. If we want to configure multiple swap hierarchy > with various speed device such as RAM, NVRAM, SSD, eMMC, NAS etc, > it would be a general problem. So we should think of more general > approach. At a glance, I can see two approach. > > First, VM could be aware of heterogeneous swap configuration > so it could aim for being able to configure cache hierarchy > among swap devices. It may need indirction layer on swap, which > was already talked about that way so VM can migrate a block from > A to B easily. It will support various configuration with VM's > hints, maybe, in future. > http://lkml.indiana.edu/hypermail/linux/kernel/1203.3/03812.html > > Second, as more practical solution, we could use device mapper like > dm-cache(https://lwn.net/Articles/540996/), which makes it very > flexible. Now, it supports various configruation and cache policy > (block size, writeback/writethrough, LRU, MFU although MQ is merged > now) so it would be good fit for our purpose. Even, it can make zram > support writeback. I tested it following as following scenario > in KVM 4 CPU, 1G DRAM with background 800M memory hogger, which is > allocates random data up to 800M. > > 1) zram swap disk 1G, untar kernel.tgz to tmpfs, build -j 4 >Fail to untar due to shortage of memory space by tmpfs default size limit > > 2) zram swap disk 1G, untar kernel.tgz to ext2 on zram-blk, build -j 4 >OOM happens while building the kernel but it untar successfully >on ext2 based on zram-blk. The reason OOM happend is zram can not find >free pages from main memory to store swap out pages although empty >swap space is still enough. > > 3) dm-cache swap disk 1G, untar kernel.tgz to ext2 on zram-blk, build -j 4 >dmcache consists of zram-meta 10M, zram-cache 1G and real swap storage 1G >No OOM happens and successfully building done. > > Above tests proves zram can support writeback into real swap storage > so that zram-cache can always have a free space. If necessary, we could > add new plugin in dm-cache. I see It's really flexible and well-layered > architecure so zram-blk's concept is good for us and it has lots of > potential to be enhanced by MM/FS/Block developers. > That's an exciting direction! -- Regards, -Bob -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH v7 0/5] zram/zsmalloc promotion
Hi Minchan, On 08/21/2013 02:16 PM, Minchan Kim wrote: It's 7th trial of zram/zsmalloc promotion. I rewrote cover-letter totally based on previous discussion. The main reason to prevent zram promotion was no review of zsmalloc part while Jens, block maintainer, already acked zram part. At that time, zsmalloc was used for zram, zcache and zswap so everybody wanted to make it general and at last, Mel reviewed it when zswap was submitted to merge mainline a few month ago. Most of review was related to zswap writeback mechanism which can pageout compressed page in memory into real swap storage in runtime and the conclusion was that zsmalloc isn't good for zswap writeback so zswap borrowed zbud allocator from zcache to replace zsmalloc. The zbud is bad for memory compression ratio(2) but it's very predictable behavior because we can expect a zpage includes just two pages as maximum. Other reviews were not major. http://lkml.indiana.edu/hypermail/linux/kernel/1304.1/04334.html Zcache doesn't use zsmalloc either so zsmalloc's user is only zram now so this patchset moves it into zsmalloc directory. Recently, Bob tried to move zsmalloc under mm directory to unify zram and zswap with adding pseudo block device in zswap(It's very weired to me) but he was simple ignoring zram's block device (a.k.a zram-blk) feature and considered only swap usecase of zram, in turn, it lose zram's good concept. Yes, I didn't notice the feature that zram can be used as a normal block device. Mel raised an another issue in v6, maintainance headache. He claimed zswap and zram has a similar goal that is to compresss swap pages so if we promote zram, maintainance headache happens sometime by diverging implementaion between zswap and zram so that he want to unify zram and zswap. For it, he want zswap to implement pseudo block device like Bob did to emulate zram so zswap can have an advantage of writeback as well as zram's benefit. If consider zram as a swap device only, I still think it's better to add a pseudo block device to zswap and just disable the writeback of zswap. But I have no idea of zram's block device feature. But I wonder frontswap-based zswap's writeback is really good approach for writeback POV. I think that problem isn't only specific for zswap. If we want to configure multiple swap hierarchy with various speed device such as RAM, NVRAM, SSD, eMMC, NAS etc, it would be a general problem. So we should think of more general approach. At a glance, I can see two approach. First, VM could be aware of heterogeneous swap configuration so it could aim for being able to configure cache hierarchy among swap devices. It may need indirction layer on swap, which was already talked about that way so VM can migrate a block from A to B easily. It will support various configuration with VM's hints, maybe, in future. http://lkml.indiana.edu/hypermail/linux/kernel/1203.3/03812.html Second, as more practical solution, we could use device mapper like dm-cache(https://lwn.net/Articles/540996/), which makes it very flexible. Now, it supports various configruation and cache policy (block size, writeback/writethrough, LRU, MFU although MQ is merged now) so it would be good fit for our purpose. Even, it can make zram support writeback. I tested it following as following scenario in KVM 4 CPU, 1G DRAM with background 800M memory hogger, which is allocates random data up to 800M. 1) zram swap disk 1G, untar kernel.tgz to tmpfs, build -j 4 Fail to untar due to shortage of memory space by tmpfs default size limit 2) zram swap disk 1G, untar kernel.tgz to ext2 on zram-blk, build -j 4 OOM happens while building the kernel but it untar successfully on ext2 based on zram-blk. The reason OOM happend is zram can not find free pages from main memory to store swap out pages although empty swap space is still enough. 3) dm-cache swap disk 1G, untar kernel.tgz to ext2 on zram-blk, build -j 4 dmcache consists of zram-meta 10M, zram-cache 1G and real swap storage 1G No OOM happens and successfully building done. Above tests proves zram can support writeback into real swap storage so that zram-cache can always have a free space. If necessary, we could add new plugin in dm-cache. I see It's really flexible and well-layered architecure so zram-blk's concept is good for us and it has lots of potential to be enhanced by MM/FS/Block developers. That's an exciting direction! -- Regards, -Bob -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH v7 0/5] zram/zsmalloc promotion
On 08/21/2013 05:24 PM, Bob Liu wrote: Hi Minchan, On 08/21/2013 02:16 PM, Minchan Kim wrote: It's 7th trial of zram/zsmalloc promotion. I rewrote cover-letter totally based on previous discussion. The main reason to prevent zram promotion was no review of zsmalloc part while Jens, block maintainer, already acked zram part. At that time, zsmalloc was used for zram, zcache and zswap so everybody wanted to make it general and at last, Mel reviewed it when zswap was submitted to merge mainline a few month ago. Most of review was related to zswap writeback mechanism which can pageout compressed page in memory into real swap storage in runtime and the conclusion was that zsmalloc isn't good for zswap writeback so zswap borrowed zbud allocator from zcache to replace zsmalloc. The zbud is bad for memory compression ratio(2) but it's very predictable behavior because we can expect a zpage includes just two pages as maximum. Other reviews were not major. http://lkml.indiana.edu/hypermail/linux/kernel/1304.1/04334.html Zcache doesn't use zsmalloc either so zsmalloc's user is only zram now so this patchset moves it into zsmalloc directory. Recently, Bob tried to move zsmalloc under mm directory to unify zram and zswap with adding pseudo block device in zswap(It's very weired to me) but he was simple ignoring zram's block device (a.k.a zram-blk) feature and considered only swap usecase of zram, in turn, it lose zram's good concept. Yes, I didn't notice the feature that zram can be used as a normal block device. Mel raised an another issue in v6, maintainance headache. He claimed zswap and zram has a similar goal that is to compresss swap pages so if we promote zram, maintainance headache happens sometime by diverging implementaion between zswap and zram so that he want to unify zram and zswap. For it, he want zswap to implement pseudo block device like Bob did to emulate zram so zswap can have an advantage of writeback as well as zram's benefit. If consider zram as a swap device only, I still think it's better to add a pseudo block device to zswap and just disable the writeback of zswap. But I have no idea of zram's block device feature. BTW: I think the original/main purpose that zram was introduced is for swapping. Is there any real users using zram as a normal block device instead of swap? For normal usage, maybe we can extend ramdisk with compression feature. -- Regards, -Bob -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 4/4] mm: zswap: create a pseudo device /dev/zram0
On 08/20/2013 01:46 AM, Seth Jennings wrote: > On Sun, Aug 18, 2013 at 04:40:49PM +0800, Bob Liu wrote: >> This is used to replace previous zram. >> zram users can enable this feature, then a pseudo device will be created >> automaticlly after kernel boot. >> Just using "mkswp /dev/zram0; swapon /dev/zram0" to use it as a swap disk. >> >> The size of this pseudeo is controlled by zswap boot parameter >> zswap.max_pool_percent. >> disksize = (totalram_pages * zswap.max_pool_percent/100)*PAGE_SIZE. > > This /dev/zram0 will behave nothing like the block device that zram > creates. It only allows reads/writes to the first PAGE_SIZE area of the > device, for mkswap to work, and then doesn't do anything for all other > accesses. Yes, all the other data should be stored in zswap pool and don't need to go through block layer. > > I guess if you disabled zswap writeback, then... it would somewhat be > the same thing. We do need to disable zswap writeback in this case so > that zswap does decompressed a ton of pages into the swapcache for > writebacks that will just fail. Since zsmalloc does not yet support the > reclaim functionality, zswap writeback is implicitly disabled. > Yes, ZSWAP_PSEUDO_BLKDEV depends on zsmalloc and if using zsmalloc as the allocator then the writeback is disabled(not implemented and no requirement). > But this is really weird conceptually since zswap is a caching layer > that uses frontswap. If a frontswap store fails, it will try to send > the page to the zram0 device which will fail the write. Then the page That's a problem. We should disable sending the page to zram0 if frontswap store fails. Return fail just like the swap device is full. > will be... put back on the active or inactive list? > > Also, using the max_pool_percent in calculating the psuedo-device size > isn't right. Right now, the code makes the device the max size of the > _compressed_ pool, but the underlying swap device size is in > _uncompressed_ pages. So you'll never be able to fill zswap sizing the > device like this, unless every page is highly incompressible to the > point that each compressed page effectively uses a memory pool page, in > which case, the user shouldn't be using memory compression. > > This also means that this hasn't been tested in the zswap pool-is-full > case since there is no way, in this code, to hit that case. Yes, but in my understanding there is no need to trigger this path. It's the same with zram. Eg. create /dev/zram0 with disksize(eg. 100M), then mm-core will store ~100M uncompressed pages to /dev/zram0 at most. But the real memory spent for storing those pages are depended on the compression ratio. It's rare that zram will need 100M real memory. > > In the zbud case the expected compression is 2:1 so you could just > multiply the compressed pool size by 2 and get a good psuedo-device > size. With zsmalloc the expected compression is harder to determine > since it can achieve very high effective compression ratios on highly > compressible pages. > Some users can know the compression ratio of their workloads even using zsmalloc. -- Regards, -Bob -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 3/4] mm: zswap: add supporting for zsmalloc
On 08/20/2013 12:59 AM, Seth Jennings wrote: > On Sun, Aug 18, 2013 at 04:40:48PM +0800, Bob Liu wrote: >> Make zswap can use zsmalloc as its allocater. >> But note that zsmalloc don't reclaim any zswap pool pages mandatory, if zswap >> pool gets full, frontswap_store will be refused unless frontswap_get happened >> and freed some space. >> >> The reason of don't implement reclaiming zsmalloc pages from zswap pool is >> there >> is no requiremnet currently. >> If we want to do mandatory reclaim, we have to write those pages to real >> backend >> swap devices. But most of current users of zsmalloc are from embeded world, >> there is even no real backend swap device. >> This action is also the same as privous zram! >> >> For several area, zsmalloc has unpredictable performance characteristics when >> reclaiming a single page, then CONFIG_ZBUD are suggested. > > Looking at this patch on its own, it does show how simple it could be > for zswap to support zsmalloc. So thanks! > > However, I don't like all the ifdefs scattered everywhere. I'd like to > have a ops structure (e.g. struct zswap_alloc_ops) instead and just > switch ops based on the CONFIG flag. Or better yet, have it boot-time > selectable instead of build-time. > I don't like the ifdefs neither. But I didn't find a better way to replace them since the data structures and API of zbud and zsmalloc are different. I can take a try using zswap_alloc_ops. -- Regards, -Bob -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH v6 0/5] zram/zsmalloc promotion
Hi Luigi, On 08/19/2013 01:29 PM, Luigi Semenzato wrote: > > We are gearing up to evaluate zswap, but we have only ported kernels > up to 3.8 to our hardware, so we may be missing important patches. > > In our experience, and with all due respect, the linux MM is a complex > beast, and it's difficult to predict how hard it will be for us to > switch to zswap. Even with the relatively simple zram, our load I think it will be easy if zswap can also create a pseudo block device(I already done the simple implementation [PATCH 0/4] mm: merge zram into zswap), then it's transparent for original zram users. > triggered bugs in other parts of the MM that took a fair amount of > work to resolve. > > I may be wrong, but the in-memory compressed block device implemented > by zram seems like a simple device which uses a well-established API > to the rest of the kernel. If it is removed from the kernel, will it > be difficult for us to carry our own patch? Because we may have to do > that for a while. Of course we would prefer it if it stayed in, at > least temporarily. > > Also, could someone confirm or deny that the maximum compression ratio > in zbud is 2? Because we easily achieve a 2.6-2.8 compression ratio > with our loads using zram with zsmalloc and LZO or snappy. Losing > that memory will cause a noticeable regression, which will encourage > us to stick with zram. > > I am hoping that our load is not so unusual that we are the only Linux > users in this situation, and that zsmalloc (or other > allocator-compressor with similar characteristics) will continue to > exist, whether it is used by zram or zswap. > > Thanks! > -- Regards, -Bob -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH v6 0/5] zram/zsmalloc promotion
Hi Luigi, On 08/19/2013 01:29 PM, Luigi Semenzato wrote: We are gearing up to evaluate zswap, but we have only ported kernels up to 3.8 to our hardware, so we may be missing important patches. In our experience, and with all due respect, the linux MM is a complex beast, and it's difficult to predict how hard it will be for us to switch to zswap. Even with the relatively simple zram, our load I think it will be easy if zswap can also create a pseudo block device(I already done the simple implementation [PATCH 0/4] mm: merge zram into zswap), then it's transparent for original zram users. triggered bugs in other parts of the MM that took a fair amount of work to resolve. I may be wrong, but the in-memory compressed block device implemented by zram seems like a simple device which uses a well-established API to the rest of the kernel. If it is removed from the kernel, will it be difficult for us to carry our own patch? Because we may have to do that for a while. Of course we would prefer it if it stayed in, at least temporarily. Also, could someone confirm or deny that the maximum compression ratio in zbud is 2? Because we easily achieve a 2.6-2.8 compression ratio with our loads using zram with zsmalloc and LZO or snappy. Losing that memory will cause a noticeable regression, which will encourage us to stick with zram. I am hoping that our load is not so unusual that we are the only Linux users in this situation, and that zsmalloc (or other allocator-compressor with similar characteristics) will continue to exist, whether it is used by zram or zswap. Thanks! -- Regards, -Bob -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 3/4] mm: zswap: add supporting for zsmalloc
On 08/20/2013 12:59 AM, Seth Jennings wrote: On Sun, Aug 18, 2013 at 04:40:48PM +0800, Bob Liu wrote: Make zswap can use zsmalloc as its allocater. But note that zsmalloc don't reclaim any zswap pool pages mandatory, if zswap pool gets full, frontswap_store will be refused unless frontswap_get happened and freed some space. The reason of don't implement reclaiming zsmalloc pages from zswap pool is there is no requiremnet currently. If we want to do mandatory reclaim, we have to write those pages to real backend swap devices. But most of current users of zsmalloc are from embeded world, there is even no real backend swap device. This action is also the same as privous zram! For several area, zsmalloc has unpredictable performance characteristics when reclaiming a single page, then CONFIG_ZBUD are suggested. Looking at this patch on its own, it does show how simple it could be for zswap to support zsmalloc. So thanks! However, I don't like all the ifdefs scattered everywhere. I'd like to have a ops structure (e.g. struct zswap_alloc_ops) instead and just switch ops based on the CONFIG flag. Or better yet, have it boot-time selectable instead of build-time. I don't like the ifdefs neither. But I didn't find a better way to replace them since the data structures and API of zbud and zsmalloc are different. I can take a try using zswap_alloc_ops. -- Regards, -Bob -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 4/4] mm: zswap: create a pseudo device /dev/zram0
On 08/20/2013 01:46 AM, Seth Jennings wrote: On Sun, Aug 18, 2013 at 04:40:49PM +0800, Bob Liu wrote: This is used to replace previous zram. zram users can enable this feature, then a pseudo device will be created automaticlly after kernel boot. Just using mkswp /dev/zram0; swapon /dev/zram0 to use it as a swap disk. The size of this pseudeo is controlled by zswap boot parameter zswap.max_pool_percent. disksize = (totalram_pages * zswap.max_pool_percent/100)*PAGE_SIZE. This /dev/zram0 will behave nothing like the block device that zram creates. It only allows reads/writes to the first PAGE_SIZE area of the device, for mkswap to work, and then doesn't do anything for all other accesses. Yes, all the other data should be stored in zswap pool and don't need to go through block layer. I guess if you disabled zswap writeback, then... it would somewhat be the same thing. We do need to disable zswap writeback in this case so that zswap does decompressed a ton of pages into the swapcache for writebacks that will just fail. Since zsmalloc does not yet support the reclaim functionality, zswap writeback is implicitly disabled. Yes, ZSWAP_PSEUDO_BLKDEV depends on zsmalloc and if using zsmalloc as the allocator then the writeback is disabled(not implemented and no requirement). But this is really weird conceptually since zswap is a caching layer that uses frontswap. If a frontswap store fails, it will try to send the page to the zram0 device which will fail the write. Then the page That's a problem. We should disable sending the page to zram0 if frontswap store fails. Return fail just like the swap device is full. will be... put back on the active or inactive list? Also, using the max_pool_percent in calculating the psuedo-device size isn't right. Right now, the code makes the device the max size of the _compressed_ pool, but the underlying swap device size is in _uncompressed_ pages. So you'll never be able to fill zswap sizing the device like this, unless every page is highly incompressible to the point that each compressed page effectively uses a memory pool page, in which case, the user shouldn't be using memory compression. This also means that this hasn't been tested in the zswap pool-is-full case since there is no way, in this code, to hit that case. Yes, but in my understanding there is no need to trigger this path. It's the same with zram. Eg. create /dev/zram0 with disksize(eg. 100M), then mm-core will store ~100M uncompressed pages to /dev/zram0 at most. But the real memory spent for storing those pages are depended on the compression ratio. It's rare that zram will need 100M real memory. In the zbud case the expected compression is 2:1 so you could just multiply the compressed pool size by 2 and get a good psuedo-device size. With zsmalloc the expected compression is harder to determine since it can achieve very high effective compression ratios on highly compressible pages. Some users can know the compression ratio of their workloads even using zsmalloc. -- Regards, -Bob -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 0/4] mm: merge zram into zswap
Hi Minchan, On 08/19/2013 12:10 PM, Minchan Kim wrote: > On Sun, Aug 18, 2013 at 04:40:45PM +0800, Bob Liu wrote: >> Both zswap and zram are used to compress anon pages in memory so as to reduce >> swap io operation. The main different is that zswap uses zbud as its >> allocator >> while zram uses zsmalloc. The other different is zram will create a block >> device, the user need to mkswp and swapon it. >> >> Minchan has areadly try to promote zram/zsmalloc into drivers/block/, but it >> may >> cause increase maintenance headaches. Since the purpose of zswap and zram are >> the same, this patch series try to merge them together as Mel suggested. >> Dropped zram from staging and extended zswap with the same feature as zram. >> >> zswap todo: >> Improve the writeback of zswap pool pages! >> >> Bob Liu (4): >> drivers: staging: drop zram and zsmalloc > > Bob, I feel you're very rude and I'm really upset. > > You're just dropping the subsystem you didn't do anything without any > consensus > from who are contriubting lots of patches to make it works well for a long > time. I apologize for that, at least I should add [RFC] in the patch title! -- Regards, -Bob -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH v6 0/5] zram/zsmalloc promotion
Hi Minchan, On 08/19/2013 11:18 AM, Minchan Kim wrote: > Hello Mel, > > On Fri, Aug 16, 2013 at 09:33:47AM +0100, Mel Gorman wrote: >> On Fri, Aug 16, 2013 at 01:26:41PM +0900, Minchan Kim wrote: >>>>>> >>>>>> If it's used for something like tmpfs then it becomes much worse. Normal >>>>>> tmpfs without swap can lockup if tmpfs is allowed to fill memory. In a >>>>>> sane configuration, lockups will be avoided and deleting a tmpfs file is >>>>>> guaranteed to free memory. When zram is used to back tmpfs, there is no >>>>>> guarantee that any memory is freed due to fragmentation of the compressed >>>>>> pages. The only way to recover the memory may be to kill applications >>>>>> holding tmpfs files open and then delete them which is fairly drastic >>>>>> action in a normal server environment. >>>>> >>>>> Indeed. >>>>> Actually, I had a plan to support zsmalloc compaction. The zsmalloc >>>>> exposes >>>>> handle instead of pure pointer so it could migrate some zpages to >>>>> somewhere >>>>> to pack in. Then, it could help above problem and OOM storm problem. >>>>> Anyway, it's a totally new feature and requires many changes and >>>>> experiement. >>>>> Although we don't have such feature, zram is still good for many people. >>>>> >>>> >>>> And is zsmalloc was pluggable for zswap then it would also benefit. >>> >>> But zswap isn't pseudo block device so it couldn't be used for block device. >> >> It would not be impossible to write one. Taking a quick look it might even >> be doable by just providing a zbud_ops that does not have an evict handler >> and make sure the errors are handled correctly. i.e. does the following >> patch mean that zswap never writes back and instead just compresses pages >> in memory? >> >> diff --git a/mm/zswap.c b/mm/zswap.c >> index deda2b6..99e41c8 100644 >> --- a/mm/zswap.c >> +++ b/mm/zswap.c >> @@ -819,7 +819,6 @@ static void zswap_frontswap_invalidate_area(unsigned >> type) >> } >> >> static struct zbud_ops zswap_zbud_ops = { >> -.evict = zswap_writeback_entry >> }; >> >> static void zswap_frontswap_init(unsigned type) >> >> If so, it should be doable to link that up in a sane way so it can be >> configured at runtime. >> >> Did you ever even try something like this? > > Never. Because I didn't have such requirement for zram. > >> >>> Let say one usecase for using zram-blk. >>> >>> 1) Many embedded system don't have swap so although tmpfs can support >>> swapout >>> it's pointless still so such systems should have sane configuration to limit >>> memory space so it's not only zram problem. >>> >> >> If zswap was backed by a pseudo device that failed all writes or an an >> ops with no evict handler then it would be functionally similar. >> >>> 2) Many embedded system don't have enough memory. Let's assume short-lived >>> file growing up until half of system memory once in a while. We don't want >>> to write it on flash by wear-leveing issue and very slowness so we want to >>> use >>> in-memory but if we uses tmpfs, it should evict half of working set to cover >>> them when the size reach peak. zram would be better choice. >>> >> >> Then back it by a pseudo device that fails all writes so it does not have >> to write to disk. > > You mean "make pseudo block device and register make_request_fn > and prevent writeback". Bah, yes, it's doable but what is it different with > below? > > 1) move zbud into zram > 2) implement frontswap API in zram > 3) implement writebazk in zram > > The zram has been for a long time in staging to be promoted and have been > maintained/deployed. Of course, I have asked the promotion several times > for above a year. > > Why can't zram include zswap functions if you really want to merge them? > Is there any problem? I think merging zram into zswap or merging zswap into zram are the same thing. It's no difference. Both way will result in a solution finally with zram block device, frontswap API etc. The difference is just the name and the merging patch title, I think it's unimportant. I've implemented a series [PATCH 0/4] mm: merge zram into zswap, I can change the tile to "merge zswap into zram" if you want and rename zswap to something like zhybrid. -- Regards, -Bob -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [BUG REPORT] ZSWAP: theoretical race condition issues
Hi Weijie, On 08/19/2013 12:14 AM, Weijie Yang wrote: > I found a few bugs in zswap when I review Linux-3.11-rc5, and I have > also some questions about it, described as following: > > BUG: > 1. A race condition when reclaim a page > when a handle alloced from zbud, zbud considers this handle is used > validly by upper(zswap) and can be a candidate for reclaim. > But zswap has to initialize it such as setting swapentry and addding > it to rbtree. so there is a race condition, such as: > thread 0: obtain handle x from zbud_alloc > thread 1: zbud_reclaim_page is called > thread 1: callback zswap_writeback_entry to reclaim handle x > thread 1: get swpentry from handle x (it is random value now) > thread 1: bad thing may happen > thread 0: initialize handle x with swapentry Yes, this may happen potentially but in rare case. Because we have a LRU list for page frames, after Thread 0 called zbud_alloc the corresponding page will be add to the head of LRU list,While zbud_reclaim_page(Thread 1 called) is started from the tail of LRU list. > Of course, this situation almost never happen, it is a "theoretical > race condition" issue. > > 2. Pollute swapcache data by add a pre-invalided swap page > when a swap_entry is invalidated, it will be reused by other anon > page. At the same time, zswap is reclaiming old page, pollute > swapcache of new page as a result, because old page and new page use > the same swap_entry, such as: > thread 1: zswap reclaim entry x > thread 0: zswap_frontswap_invalidate_page entry x > thread 0: entry x reused by other anon page > thread 1: add old data to swapcache of entry x I didn't get your idea here, why thread1 will add old data to entry x? > thread 0: swapcache of entry x is polluted > Of course, this situation almost never happen, it is another > "theoretical race condition" issue. > > 3. Frontswap uses frontswap_map bitmap to track page in "backend" > implementation, when zswap reclaim a > page, the corresponding bitmap record is not cleared. > That's true, but I don't think it's a big problem. Only waste little time to search rbtree during zswap_frontswap_load(). > 4. zswap_tree is not freed when swapoff, and it got re-kzalloc in > swapon, memory leak occurs. Nice catch! I think it should be freed in zswap_frontswap_invalidate_area(). > > questions: > 1. How about SetPageReclaim befor __swap_writepage, so that move it to > the tail of the inactive list? It will be added to inactive now. > 2. zswap uses GFP_KERNEL flag to alloc things in store and reclaim > function, does this lead to these function called recursively? Yes, that's a potential problem. > 3. for reclaiming one zbud page which contains two buddies, zswap > needs to alloc two pages. Does this reclaim cost-efficient? > Yes, that's a problem too. And that's why we use zbud as the default allocator instead of zsmalloc. I think improving the write back path of zswap is the next important step for zswap. -- Regards, -Bob -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH 4/4] mm: zswap: create a pseudo device /dev/zram0
This is used to replace previous zram. zram users can enable this feature, then a pseudo device will be created automaticlly after kernel boot. Just using "mkswp /dev/zram0; swapon /dev/zram0" to use it as a swap disk. The size of this pseudeo is controlled by zswap boot parameter zswap.max_pool_percent. disksize = (totalram_pages * zswap.max_pool_percent/100)*PAGE_SIZE. Signed-off-by: Bob Liu --- mm/Kconfig | 12 mm/zswap.c | 196 2 files changed, 208 insertions(+) diff --git a/mm/Kconfig b/mm/Kconfig index d80a575..3778026 100644 --- a/mm/Kconfig +++ b/mm/Kconfig @@ -525,6 +525,18 @@ choice be refused unless frontswap_get happened and freed some space. endchoice +config ZSWAP_PSEUDO_BLKDEV + bool "Emulate a pseudo blk-dev based on zswap(previous zram)" + depends on ZSWAP && ZSMALLOC + default n + + help + Enable this option will emulate a pseudo block swapdev /dev/zram0 + with size zswap.max_pool_percent of total ram size. All writes to this + block device will be compressed and cached by zswap as a result no + real IO disk operations will happen. + This feature can be used to replace drivers/staging/zram. + config MEM_SOFT_DIRTY bool "Track memory changes" depends on CHECKPOINT_RESTORE && HAVE_ARCH_SOFT_DIRTY diff --git a/mm/zswap.c b/mm/zswap.c index 8e8dc99..ae73c9d 100644 --- a/mm/zswap.c +++ b/mm/zswap.c @@ -38,6 +38,11 @@ #include #else #include +#ifdef CONFIG_ZSWAP_PSEUDO_BLKDEV +#include +#include +#include +#endif #endif #include #include @@ -968,6 +973,189 @@ static int __init zswap_debugfs_init(void) static void __exit zswap_debugfs_exit(void) { } #endif +#ifdef CONFIG_ZSWAP_PSEUDO_BLKDEV +#define SECTOR_SHIFT 9 +#define SECTOR_SIZE(1 << SECTOR_SHIFT) +#define SECTORS_PER_PAGE_SHIFT (PAGE_SHIFT - SECTOR_SHIFT) +#define SECTORS_PER_PAGE (1 << SECTORS_PER_PAGE_SHIFT) + +struct zram { + struct rw_semaphore lock; /* protect concurent reads and writes */ + struct request_queue *queue; + struct gendisk *disk; + + /* +* This is the disk size for userland. The size is controlled by +* boot parameter zswap.max_pool_percent. +* disksize = (totalram_pages * zswap.max_pool_percent/100)*PAGE_SIZE +*/ + u64 disksize; /* bytes */ + + /* +* This page is used to store real data for /dev/zram. +* Meanful operation to /dev/zramx is only mkswp and swapon/swapoff. +* So use one page to store the real data(written by mkswp). +*/ + struct page *metapage; +}; + +/* + * Only create /dev/zram0, can be extened in future if there is real uercases + * need multiple zram devices. + */ +static struct zram zram_device; +static const struct block_device_operations zram_devops = { + .owner = THIS_MODULE +}; + +static void update_position(u32 *index, int *offset, struct bio_vec *bvec) +{ + if (*offset + bvec->bv_len >= PAGE_SIZE) + (*index)++; + *offset = (*offset + bvec->bv_len) % PAGE_SIZE; +} + +static void zram_make_request(struct request_queue *queue, struct bio *bio) +{ + u32 index; + struct bio_vec *bvec; + unsigned char *src, *dst; + int offset, i, rw = bio_data_dir(bio); + struct zram *zram = queue->queuedata; + + index = bio->bi_sector >> SECTORS_PER_PAGE_SHIFT; + offset = (bio->bi_sector & (SECTORS_PER_PAGE - 1)) << SECTOR_SHIFT; + + bio_for_each_segment(bvec, bio, i) { + /* +* The only operation to pseudo /dev/zramx is mkswp and +* swapon/swapoff, so we only need one extra page to store the +* real meta data! +*/ + BUG_ON(bvec->bv_len != PAGE_SIZE); + BUG_ON(offset); + + if (!index) { + if (rw == READ) { + down_read(>lock); + dst = kmap_atomic(bvec->bv_page); + src = kmap_atomic(zram->metapage); + memcpy(dst, src, bvec->bv_len); + kunmap_atomic(dst); + kunmap_atomic(src); + flush_dcache_page(bvec->bv_page); + up_read(>lock); + } else { + down_write(>lock); + src = kmap_atomic(bvec->bv_page); + dst = kmap_atomic(zram->metapage); + memcpy(dst, src, bvec->bv_len); + kunmap_atomic(dst); + kunmap_atomic(src); +
[PATCH 3/4] mm: zswap: add supporting for zsmalloc
Make zswap can use zsmalloc as its allocater. But note that zsmalloc don't reclaim any zswap pool pages mandatory, if zswap pool gets full, frontswap_store will be refused unless frontswap_get happened and freed some space. The reason of don't implement reclaiming zsmalloc pages from zswap pool is there is no requiremnet currently. If we want to do mandatory reclaim, we have to write those pages to real backend swap devices. But most of current users of zsmalloc are from embeded world, there is even no real backend swap device. This action is also the same as privous zram! For several area, zsmalloc has unpredictable performance characteristics when reclaiming a single page, then CONFIG_ZBUD are suggested. Signed-off-by: Bob Liu --- include/linux/zsmalloc.h |1 + mm/Kconfig |4 +++ mm/zsmalloc.c|9 -- mm/zswap.c | 73 +++--- 4 files changed, 81 insertions(+), 6 deletions(-) diff --git a/include/linux/zsmalloc.h b/include/linux/zsmalloc.h index fbe6bec..72fc126 100644 --- a/include/linux/zsmalloc.h +++ b/include/linux/zsmalloc.h @@ -39,5 +39,6 @@ void *zs_map_object(struct zs_pool *pool, unsigned long handle, void zs_unmap_object(struct zs_pool *pool, unsigned long handle); u64 zs_get_total_size_bytes(struct zs_pool *pool); +u64 zs_get_pool_size(struct zs_pool *pool); #endif diff --git a/mm/Kconfig b/mm/Kconfig index 48d1786..d80a575 100644 --- a/mm/Kconfig +++ b/mm/Kconfig @@ -519,6 +519,10 @@ choice in order to reduce fragmentation and has high compression density. However, this results in a unpredictable performance characteristics when reclaiming a single page. + + Note: By using zsmalloc, no supporting for mandatory reclaiming from + compressed memory pool. If the pool gets full, frontswap_store will + be refused unless frontswap_get happened and freed some space. endchoice config MEM_SOFT_DIRTY diff --git a/mm/zsmalloc.c b/mm/zsmalloc.c index 4bb275b..9df8d25 100644 --- a/mm/zsmalloc.c +++ b/mm/zsmalloc.c @@ -78,8 +78,7 @@ #include #include #include - -#include "zsmalloc.h" +#include /* * This must be power of 2 and greater than of equal to sizeof(link_free). @@ -1056,6 +1055,12 @@ u64 zs_get_total_size_bytes(struct zs_pool *pool) } EXPORT_SYMBOL_GPL(zs_get_total_size_bytes); +u64 zs_get_pool_size(struct zs_pool *pool) +{ + return zs_get_total_size_bytes(pool) >> PAGE_SHIFT; +} +EXPORT_SYMBOL_GPL(zs_get_pool_size); + module_init(zs_init); module_exit(zs_exit); diff --git a/mm/zswap.c b/mm/zswap.c index deda2b6..8e8dc99 100644 --- a/mm/zswap.c +++ b/mm/zswap.c @@ -34,8 +34,11 @@ #include #include #include +#ifdef CONFIG_ZBUD #include - +#else +#include +#endif #include #include #include @@ -189,7 +192,11 @@ struct zswap_header { struct zswap_tree { struct rb_root rbroot; spinlock_t lock; +#ifdef CONFIG_ZBUD struct zbud_pool *pool; +#else + struct zs_pool *pool; +#endif }; static struct zswap_tree *zswap_trees[MAX_SWAPFILES]; @@ -374,12 +381,21 @@ static bool zswap_is_full(void) */ static void zswap_free_entry(struct zswap_tree *tree, struct zswap_entry *entry) { +#ifdef CONFIG_ZBUD zbud_free(tree->pool, entry->handle); +#else + zs_free(tree->pool, entry->handle); +#endif zswap_entry_cache_free(entry); atomic_dec(_stored_pages); +#ifdef CONFIG_ZBUD zswap_pool_pages = zbud_get_pool_size(tree->pool); +#else + zswap_pool_pages = zs_get_pool_size(tree->pool); +#endif } +#ifdef CONFIG_ZBUD /* * writeback code **/ @@ -595,6 +611,7 @@ fail: spin_unlock(>lock); return ret; } +#endif /* * frontswap hooks @@ -620,11 +637,22 @@ static int zswap_frontswap_store(unsigned type, pgoff_t offset, /* reclaim space if needed */ if (zswap_is_full()) { zswap_pool_limit_hit++; +#ifdef CONFIG_ZBUD if (zbud_reclaim_page(tree->pool, 8)) { zswap_reject_reclaim_fail++; ret = -ENOMEM; goto reject; } +#else + /* +* zsmalloc has unpredictable performance +* characteristics when reclaiming, so don't support +* mandatory reclaiming from zsmalloc +*/ + zswap_reject_reclaim_fail++; + ret = -ENOMEM; + goto reject; +#endif } /* allocate entry */ @@ -647,8 +675,9 @@ static int zswap_frontswap_store(unsigned type, pgoff_t offset, /* store */ len = dlen + sizeof(struct zswap_header); +#ifdef CONFIG_ZBUD ret = zbud_alloc(tree->pool, len, __GFP_NORETRY | __GFP_NOWARN, - );
[PATCH 2/4] mm: promote zsmalloc to mm/
zsmalloc is a new slab-based memory allocator for storing compressed pages. It is designed for low fragmentation and high allocation success rate on large object, but <= PAGE_SIZE allocations. zsmalloc differs from the kernel slab allocator in two primary ways to achieve these design goals. zsmalloc never requires high order page allocations to back slabs, or "size classes" in zsmalloc terms. Instead it allows multiple single-order pages to be stitched together into a "zspage" which backs the slab. This allows for higher allocation success rate under memory pressure. Also, zsmalloc allows objects to span page boundaries within the zspage. This allows for lower fragmentation than could be had with the kernel slab allocator for objects between PAGE_SIZE/2 and PAGE_SIZE. With the kernel slab allocator, if a page compresses to 60% of it original size, the memory savings gained through compression is lost in fragmentation because another object of the same size can't be stored in the leftover space. This ability to span pages results in zsmalloc allocations not being directly addressable by the user. The user is given an non-dereferencable handle in response to an allocation request. That handle must be mapped, using zs_map_object(), which returns a pointer to the mapped region that can be used. The mapping is necessary since the object data may reside in two different noncontigious pages. Signed-off-by: Bob Liu --- include/linux/zsmalloc.h | 43 ++ mm/Kconfig | 35 +- mm/Makefile |1 + mm/zsmalloc.c| 1063 ++ 4 files changed, 1131 insertions(+), 11 deletions(-) create mode 100644 include/linux/zsmalloc.h create mode 100644 mm/zsmalloc.c diff --git a/include/linux/zsmalloc.h b/include/linux/zsmalloc.h new file mode 100644 index 000..fbe6bec --- /dev/null +++ b/include/linux/zsmalloc.h @@ -0,0 +1,43 @@ +/* + * zsmalloc memory allocator + * + * Copyright (C) 2011 Nitin Gupta + * + * This code is released using a dual license strategy: BSD/GPL + * You can choose the license that better fits your requirements. + * + * Released under the terms of 3-clause BSD License + * Released under the terms of GNU General Public License Version 2.0 + */ + +#ifndef _ZS_MALLOC_H_ +#define _ZS_MALLOC_H_ + +#include + +/* + * zsmalloc mapping modes + * + * NOTE: These only make a difference when a mapped object spans pages + */ +enum zs_mapmode { + ZS_MM_RW, /* normal read-write mapping */ + ZS_MM_RO, /* read-only (no copy-out at unmap time) */ + ZS_MM_WO /* write-only (no copy-in at map time) */ +}; + +struct zs_pool; + +struct zs_pool *zs_create_pool(gfp_t flags); +void zs_destroy_pool(struct zs_pool *pool); + +unsigned long zs_malloc(struct zs_pool *pool, size_t size); +void zs_free(struct zs_pool *pool, unsigned long obj); + +void *zs_map_object(struct zs_pool *pool, unsigned long handle, + enum zs_mapmode mm); +void zs_unmap_object(struct zs_pool *pool, unsigned long handle); + +u64 zs_get_total_size_bytes(struct zs_pool *pool); + +#endif diff --git a/mm/Kconfig b/mm/Kconfig index 8028dcc..48d1786 100644 --- a/mm/Kconfig +++ b/mm/Kconfig @@ -478,21 +478,10 @@ config FRONTSWAP If unsure, say Y to enable frontswap. -config ZBUD - tristate - default n - help - A special purpose allocator for storing compressed pages. - It is designed to store up to two compressed pages per physical - page. While this design limits storage density, it has simple and - deterministic reclaim properties that make it preferable to a higher - density approach when reclaim will be used. - config ZSWAP bool "Compressed cache for swap pages (EXPERIMENTAL)" depends on FRONTSWAP && CRYPTO=y select CRYPTO_LZO - select ZBUD default n help A lightweight compressed cache for swap pages. It takes @@ -508,6 +497,30 @@ config ZSWAP they have not be fully explored on the large set of potential configurations and workloads that exist. +choice + prompt "Select memory allocator for compressed pages" + depends on ZSWAP + default ZBUD + + config ZBUD + bool "zbud" + help + A special purpose allocator for storing compressed pages. + It is designed to store up to two compressed pages per physical + page. While this design limits storage density, it has simple and + deterministic reclaim properties that make it preferable to a higher + density approach when reclaim will be used. + + config ZSMALLOC + bool "zsmalloc" + help + zsmalloc is a slab-based memory allocator designed to store + compressed RAM pages. zsmalloc uses virtual memory mapping + in order to r