Re: mm/zswap: change to writethrough

2013-11-12 Thread Bob Liu

On 11/12/2013 03:12 AM, Dan Streetman wrote:
 Seth, have you (or anyone else) considered making zswap a writethrough
 cache instead of writeback?  I think that it would significantly help
 the case where zswap fills up and starts writing back its oldest pages
 to disc - all the decompression work would be avoided since zswap
 could just evict old pages and forget about them, and it seems likely
 that when zswap is full that's probably the worst time to add extra
 work/delay, while adding extra disc IO (presumably using dma) before
 zswap is full doesn't seem to me like it would have much impact,
 except in the case where zswap isn't full but there is so little free
 memory that new allocs are waiting on swap-out.
 
 Besides the additional disc IO that obviously comes with making zswap
 writethrough (additional only before zswap fills up), are there any
 other disadvantages?  Is it a common situation for there to be no
 memory left and get_free_page actively waiting on swap-out, but before
 zswap fills up?
 
 Making it writethrough also could open up other possible improvements,
 like making the compression and storage of new swap-out pages async,
 so the compression doesn't delay the write out to disc.
 

I like this idea and those benefits, the only question I'm not sure is
would it be too complicate to implement this feature? It sounds like we
need to reimplement something like swapcache to handle zswap write through.

-- 
Regards,
-Bob
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [patch 0/8] mm: thrash detection-based file cache sizing v5

2013-11-12 Thread Bob Liu
Hi Johannes,

On Fri, Oct 11, 2013 at 5:46 AM, Johannes Weiner han...@cmpxchg.org wrote:
 Future

 Right now we have a fixed ratio (50:50) between inactive and active
 list but we already have complaints about working sets exceeding half
 of memory being pushed out of the cache by simple used-once streaming
 in the background.  Ultimately, we want to adjust this ratio and allow
 for a much smaller inactive list.  These patches are an essential step
 in this direction because they decouple the VMs ability to detect
 working set changes from the inactive list size.  This would allow us
 to base the inactive list size on something more sensible, like the
 combined readahead window size for example.


I found that this patchset have the similar purpose as
Zcache(http://lwn.net/Articles/562254/) in some way.

Zcache uses the cleancache API to compress clean file pages so as to
keep them in memory, and only pages which used to in active file pages
are considered. Through Zcache we can hold more active pages in memory
and won't be effected by streaming workload.

Perhaps you can take a look at the way of zcache, your workloads are
very likely to benefit from it.
And zcache don't need so many changes to core VM/VFS subsystem because
it's based on cleancache API. I think it's more acceptable.

-- 
Regards,
--Bob
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] staging: zsmalloc: Ensure handle is never 0 on success

2013-11-08 Thread Bob Liu
On 11/07/2013 03:04 PM, Minchan Kim wrote:
> On Wed, Nov 06, 2013 at 07:05:11PM -0800, Greg KH wrote:
>> On Wed, Nov 06, 2013 at 03:46:19PM -0800, Nitin Gupta wrote:
>>  > I'm getting really tired of them hanging around in here for many years
>>>> now...
>>>>
>>>
>>> Minchan has tried many times to promote zram out of staging. This was
>>> his most recent attempt:
>>>
>>> https://lkml.org/lkml/2013/8/21/54
>>>
>>> There he provided arguments for zram inclusion, how it can help in
>>> situations where zswap can't and why generalizing /dev/ramX would
>>> not be a great idea. So, cannot say why it wasn't picked up
>>> for inclusion at that time.
>>>
>>>> Should I just remove them if no one is working on getting them merged
>>>> "properly"?
>>>>
>>>
>>> Please refer the mail thread (link above) and see Minchan's
>>> justifications for zram.
>>> If they don't sound convincing enough then please remove zram+zsmalloc
>>> from staging.
>>
>> You don't need to be convincing me, you need to be convincing the
>> maintainers of the area of the kernel you are working with.
>>
>> And since the last time you all tried to get this merged was back in
>> August, I'm feeling that you all have given up, so it needs to be
>> deleted.  I'll go do that for 3.14, and if someone wants to pick it up
>> and merge it properly, they can easily revert it.
> 
> I'm guilty and I have been busy by other stuff. Sorry for that.
> Fortunately, I discussed this issue with Hugh in this Linuxcon for a
> long time(Thanks Hugh!) he felt zram's block device abstraction is
> better design rather than frontswap backend stuff although it's a question
> where we put zsmalloc. I will CC Hugh because many of things is related
> to swap subsystem and his opinion is really important.
> And I discussed it with Rik and he feel positive about zram.
> 
> Last impression Andrw gave me by private mail is he want to merge
> zram's functionality into zswap or vise versa.
> If I misunderstood, please correct me.
> I understand his concern but I guess he didn't have a time to read
> my long description due to a ton of works at that time.
> So, I will try one more time.
> I hope I'd like to listen feedback than *silence* so that we can
> move forward than stall.
> 
> Recently, Bob tried to move zsmalloc under mm directory to unify
> zram and zswap with adding pseudo block device in zswap(It's
> very weired to me. I think it's horrible monster which is lying
> between mm and block in layering POV) but he was ignoring zram's
> block device (a.k.a zram-blk) feature and considered only swap
> usecase of zram, in turn, it lose zram's good concept. 
> I already convered other topics Bob raised in this thread[1]
> and why I think zram is better in the thread.
> 

I have no objections for zram, and I also think is good for zswap can
support zsmalloc and fake swap device. At least users can have more
options just like slab/slub/slob.

-- 
Regards,
-Bob
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [Patch 3.11.7 1/1]mm: remove and free expired data in time in zswap

2013-11-08 Thread Bob Liu
On 11/08/2013 05:50 PM, changkun.li wrote:
> In zswap, store page A to zbud if the compression ratio is high, insert
> its entry into rbtree. if there is a entry B which has the same offset
> in the rbtree.Remove and free B before insert the entry of A.
> 
> case:
> if the compression ratio of page A is not high, return without checking
> the same offset one in rbtree.
> 
> if there is a entry B which has the same offset in the rbtree. Now, we
> make sure B is invalid or expired. But the entry and compressed memory
> of B are not freed in time.
> 
> Because zswap spaces data in memory, it makes the utilization of memory
> lower. the other valid data in zbud is writeback to swap device more
> possibility, when zswap is full.
> 
> So if we make sure a entry is expired, free it in time.
> 
> Signed-off-by: changkun.li
> ---
>  mm/zswap.c |5 -
>  1 files changed, 4 insertions(+), 1 deletions(-)
> 
> diff --git a/mm/zswap.c b/mm/zswap.c
> index cbd9578..90a2813 100644
> --- a/mm/zswap.c
> +++ b/mm/zswap.c
> @@ -596,6 +596,7 @@ fail:
>   return ret;
>  }
>  
> +static void zswap_frontswap_invalidate_page(unsigned type, pgoff_t
> offset);
>  /*
>  * frontswap hooks
>  **/
> @@ -614,7 +615,7 @@ static int zswap_frontswap_store(unsigned type,
> pgoff_t offset,
>  
>   if (!tree) {
>   ret = -ENODEV;
> - goto reject;
> + goto nodev;
>   }
>  
>   /* reclaim space if needed */
> @@ -695,6 +696,8 @@ freepage:
>   put_cpu_var(zswap_dstmem);
>   zswap_entry_cache_free(entry);
>  reject:
> + zswap_frontswap_invalidate_page(type, offset);

I'm afraid when arrives here zswap_rb_search(offset) will always return
NULL entry. So most of the time, it's just waste time to call
zswap_frontswap_invalidate_page() to search rbtree.

-- 
Regards,
-Bob
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [Patch 3.11.7 1/1]mm: remove and free expired data in time in zswap

2013-11-08 Thread Bob Liu
On 11/08/2013 05:50 PM, changkun.li wrote:
 In zswap, store page A to zbud if the compression ratio is high, insert
 its entry into rbtree. if there is a entry B which has the same offset
 in the rbtree.Remove and free B before insert the entry of A.
 
 case:
 if the compression ratio of page A is not high, return without checking
 the same offset one in rbtree.
 
 if there is a entry B which has the same offset in the rbtree. Now, we
 make sure B is invalid or expired. But the entry and compressed memory
 of B are not freed in time.
 
 Because zswap spaces data in memory, it makes the utilization of memory
 lower. the other valid data in zbud is writeback to swap device more
 possibility, when zswap is full.
 
 So if we make sure a entry is expired, free it in time.
 
 Signed-off-by: changkun.lixfishco...@gmail.com
 ---
  mm/zswap.c |5 -
  1 files changed, 4 insertions(+), 1 deletions(-)
 
 diff --git a/mm/zswap.c b/mm/zswap.c
 index cbd9578..90a2813 100644
 --- a/mm/zswap.c
 +++ b/mm/zswap.c
 @@ -596,6 +596,7 @@ fail:
   return ret;
  }
  
 +static void zswap_frontswap_invalidate_page(unsigned type, pgoff_t
 offset);
  /*
  * frontswap hooks
  **/
 @@ -614,7 +615,7 @@ static int zswap_frontswap_store(unsigned type,
 pgoff_t offset,
  
   if (!tree) {
   ret = -ENODEV;
 - goto reject;
 + goto nodev;
   }
  
   /* reclaim space if needed */
 @@ -695,6 +696,8 @@ freepage:
   put_cpu_var(zswap_dstmem);
   zswap_entry_cache_free(entry);
  reject:
 + zswap_frontswap_invalidate_page(type, offset);

I'm afraid when arrives here zswap_rb_search(offset) will always return
NULL entry. So most of the time, it's just waste time to call
zswap_frontswap_invalidate_page() to search rbtree.

-- 
Regards,
-Bob
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] staging: zsmalloc: Ensure handle is never 0 on success

2013-11-08 Thread Bob Liu
On 11/07/2013 03:04 PM, Minchan Kim wrote:
 On Wed, Nov 06, 2013 at 07:05:11PM -0800, Greg KH wrote:
 On Wed, Nov 06, 2013 at 03:46:19PM -0800, Nitin Gupta wrote:
   I'm getting really tired of them hanging around in here for many years
 now...


 Minchan has tried many times to promote zram out of staging. This was
 his most recent attempt:

 https://lkml.org/lkml/2013/8/21/54

 There he provided arguments for zram inclusion, how it can help in
 situations where zswap can't and why generalizing /dev/ramX would
 not be a great idea. So, cannot say why it wasn't picked up
 for inclusion at that time.

 Should I just remove them if no one is working on getting them merged
 properly?


 Please refer the mail thread (link above) and see Minchan's
 justifications for zram.
 If they don't sound convincing enough then please remove zram+zsmalloc
 from staging.

 You don't need to be convincing me, you need to be convincing the
 maintainers of the area of the kernel you are working with.

 And since the last time you all tried to get this merged was back in
 August, I'm feeling that you all have given up, so it needs to be
 deleted.  I'll go do that for 3.14, and if someone wants to pick it up
 and merge it properly, they can easily revert it.
 
 I'm guilty and I have been busy by other stuff. Sorry for that.
 Fortunately, I discussed this issue with Hugh in this Linuxcon for a
 long time(Thanks Hugh!) he felt zram's block device abstraction is
 better design rather than frontswap backend stuff although it's a question
 where we put zsmalloc. I will CC Hugh because many of things is related
 to swap subsystem and his opinion is really important.
 And I discussed it with Rik and he feel positive about zram.
 
 Last impression Andrw gave me by private mail is he want to merge
 zram's functionality into zswap or vise versa.
 If I misunderstood, please correct me.
 I understand his concern but I guess he didn't have a time to read
 my long description due to a ton of works at that time.
 So, I will try one more time.
 I hope I'd like to listen feedback than *silence* so that we can
 move forward than stall.
 
 Recently, Bob tried to move zsmalloc under mm directory to unify
 zram and zswap with adding pseudo block device in zswap(It's
 very weired to me. I think it's horrible monster which is lying
 between mm and block in layering POV) but he was ignoring zram's
 block device (a.k.a zram-blk) feature and considered only swap
 usecase of zram, in turn, it lose zram's good concept. 
 I already convered other topics Bob raised in this thread[1]
 and why I think zram is better in the thread.
 

I have no objections for zram, and I also think is good for zswap can
support zsmalloc and fake swap device. At least users can have more
options just like slab/slub/slob.

-- 
Regards,
-Bob
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [Xen-devel] [PATCH] xen/balloon: Set balloon's initial state to number of existing RAM pages

2013-11-06 Thread Bob Liu

On 11/07/2013 04:37 AM, Boris Ostrovsky wrote:
> Currently balloon's initial value is set to max_pfn which includes
> non-RAM ranges such as MMIO hole. As result, initial memory target
> (specified by guest's configuration file) will appear smaller than
> what balloon driver perceives to be the current number of available
> pages. Thus it will balloon down "extra" pages, decreasing amount of
> available memory for no good reason.
> 

This fix the strange behavior I mentioned yesterday, every time after
guest started balloon driver will be triggered unreasonably.

> Signed-off-by: Boris Ostrovsky 
> ---
>  drivers/xen/balloon.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/drivers/xen/balloon.c b/drivers/xen/balloon.c
> index b232908..1b62304 100644
> --- a/drivers/xen/balloon.c
> +++ b/drivers/xen/balloon.c
> @@ -641,7 +641,7 @@ static int __init balloon_init(void)
>  
>   balloon_stats.current_pages = xen_pv_domain()
>   ? min(xen_start_info->nr_pages - xen_released_pages, max_pfn)
> - : max_pfn;
> + : get_num_physpages();

By the way, should the other places using max_pfn also be changed with
get_num_physpages()?

>   balloon_stats.target_pages  = balloon_stats.current_pages;
>   balloon_stats.balloon_low   = 0;
>   balloon_stats.balloon_high  = 0;
> 

-- 
Regards,
-Bob
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [Xen-devel] [PATCH] xen/balloon: Set balloon's initial state to number of existing RAM pages

2013-11-06 Thread Bob Liu

On 11/07/2013 04:37 AM, Boris Ostrovsky wrote:
 Currently balloon's initial value is set to max_pfn which includes
 non-RAM ranges such as MMIO hole. As result, initial memory target
 (specified by guest's configuration file) will appear smaller than
 what balloon driver perceives to be the current number of available
 pages. Thus it will balloon down extra pages, decreasing amount of
 available memory for no good reason.
 

This fix the strange behavior I mentioned yesterday, every time after
guest started balloon driver will be triggered unreasonably.

 Signed-off-by: Boris Ostrovsky boris.ostrov...@oracle.com
 ---
  drivers/xen/balloon.c | 2 +-
  1 file changed, 1 insertion(+), 1 deletion(-)
 
 diff --git a/drivers/xen/balloon.c b/drivers/xen/balloon.c
 index b232908..1b62304 100644
 --- a/drivers/xen/balloon.c
 +++ b/drivers/xen/balloon.c
 @@ -641,7 +641,7 @@ static int __init balloon_init(void)
  
   balloon_stats.current_pages = xen_pv_domain()
   ? min(xen_start_info-nr_pages - xen_released_pages, max_pfn)
 - : max_pfn;
 + : get_num_physpages();

By the way, should the other places using max_pfn also be changed with
get_num_physpages()?

   balloon_stats.target_pages  = balloon_stats.current_pages;
   balloon_stats.balloon_low   = 0;
   balloon_stats.balloon_high  = 0;
 

-- 
Regards,
-Bob
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [Xen-devel] [PATCH] drivers: xen-selfballoon: consider slab pages

2013-11-04 Thread Bob Liu

On 11/05/2013 01:22 AM, David Vrabel wrote:
> On 04/11/13 12:39, Bob Liu wrote:
>> Currently the goal_page in xen-selfballon doesn't consider much about pages 
>> used
>> in kernel space.
>> A typical usage is slab pages, without consider slab pages the goal_page 
>> result
>> may be too rough and lead extra memory pressure to guest os.
> 
> Can you provide some real world figures where the calculatation got it
> wrong? What was the resultant behavior?  Swap death? OOM killer?
> 

Sorry, I didn't run any testing I just think it's unreasonable while
reading the source code.

vm_memory_committed() only calculate pages which mapped to process
address space, but the kernel itself(like block, fs and network
subsystem) may occupy some memory. And it's possible that those
subsystem may occupy a significant amount of memory in some situation.

I'm afraid if we don't consider those kernel memory while calculating
goal_pages, guest memory will be set lower than guest really needs.

>> Signed-off-by: Bob Liu 
>> ---
>>  drivers/xen/xen-selfballoon.c |2 ++
>>  1 file changed, 2 insertions(+)
>>
>> diff --git a/drivers/xen/xen-selfballoon.c b/drivers/xen/xen-selfballoon.c
>> index 21e18c1..4814759 100644
>> --- a/drivers/xen/xen-selfballoon.c
>> +++ b/drivers/xen/xen-selfballoon.c
>> @@ -191,6 +191,8 @@ static void selfballoon_process(struct work_struct *work)
>>  tgt_pages = cur_pages; /* default is no change */
>>  goal_pages = vm_memory_committed() +
>>  totalreserve_pages +
>> +global_page_state(NR_SLAB_RECLAIMABLE) +
> 
> Does SLAB_RECLAIMABLE want to be included here?  Unless I'm
> misunderstanding here, SLAB_RECLAIMABLE is effectively free.
> 

SLAB_RECLAIMABLE isn't effectively free, it means the slab page is in
used but can be reclaimed(freed) during memory pressure.

>> +global_page_state(NR_SLAB_UNRECLAIMABLE) +
> 
> This bit looks fine to me.
> 
>>  MB2PAGES(selfballoon_reserved_mb);
>>  #ifdef CONFIG_FRONTSWAP
>>  /* allow space for frontswap pages to be repatriated */
> 
> David
> 

Thanks for your review.

-- 
Regards,
-Bob
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH] drivers: xen-selfballoon: consider slab pages

2013-11-04 Thread Bob Liu
Currently the goal_page in xen-selfballon doesn't consider much about pages used
in kernel space.
A typical usage is slab pages, without consider slab pages the goal_page result
may be too rough and lead extra memory pressure to guest os.

Signed-off-by: Bob Liu 
---
 drivers/xen/xen-selfballoon.c |2 ++
 1 file changed, 2 insertions(+)

diff --git a/drivers/xen/xen-selfballoon.c b/drivers/xen/xen-selfballoon.c
index 21e18c1..4814759 100644
--- a/drivers/xen/xen-selfballoon.c
+++ b/drivers/xen/xen-selfballoon.c
@@ -191,6 +191,8 @@ static void selfballoon_process(struct work_struct *work)
tgt_pages = cur_pages; /* default is no change */
goal_pages = vm_memory_committed() +
totalreserve_pages +
+   global_page_state(NR_SLAB_RECLAIMABLE) +
+   global_page_state(NR_SLAB_UNRECLAIMABLE) +
MB2PAGES(selfballoon_reserved_mb);
 #ifdef CONFIG_FRONTSWAP
/* allow space for frontswap pages to be repatriated */
-- 
1.7.10.4

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [Xen-devel] [PATCH] drivers: xen-selfballoon: consider slab pages

2013-11-04 Thread Bob Liu

On 11/05/2013 01:22 AM, David Vrabel wrote:
 On 04/11/13 12:39, Bob Liu wrote:
 Currently the goal_page in xen-selfballon doesn't consider much about pages 
 used
 in kernel space.
 A typical usage is slab pages, without consider slab pages the goal_page 
 result
 may be too rough and lead extra memory pressure to guest os.
 
 Can you provide some real world figures where the calculatation got it
 wrong? What was the resultant behavior?  Swap death? OOM killer?
 

Sorry, I didn't run any testing I just think it's unreasonable while
reading the source code.

vm_memory_committed() only calculate pages which mapped to process
address space, but the kernel itself(like block, fs and network
subsystem) may occupy some memory. And it's possible that those
subsystem may occupy a significant amount of memory in some situation.

I'm afraid if we don't consider those kernel memory while calculating
goal_pages, guest memory will be set lower than guest really needs.

 Signed-off-by: Bob Liu bob@oracle.com
 ---
  drivers/xen/xen-selfballoon.c |2 ++
  1 file changed, 2 insertions(+)

 diff --git a/drivers/xen/xen-selfballoon.c b/drivers/xen/xen-selfballoon.c
 index 21e18c1..4814759 100644
 --- a/drivers/xen/xen-selfballoon.c
 +++ b/drivers/xen/xen-selfballoon.c
 @@ -191,6 +191,8 @@ static void selfballoon_process(struct work_struct *work)
  tgt_pages = cur_pages; /* default is no change */
  goal_pages = vm_memory_committed() +
  totalreserve_pages +
 +global_page_state(NR_SLAB_RECLAIMABLE) +
 
 Does SLAB_RECLAIMABLE want to be included here?  Unless I'm
 misunderstanding here, SLAB_RECLAIMABLE is effectively free.
 

SLAB_RECLAIMABLE isn't effectively free, it means the slab page is in
used but can be reclaimed(freed) during memory pressure.

 +global_page_state(NR_SLAB_UNRECLAIMABLE) +
 
 This bit looks fine to me.
 
  MB2PAGES(selfballoon_reserved_mb);
  #ifdef CONFIG_FRONTSWAP
  /* allow space for frontswap pages to be repatriated */
 
 David
 

Thanks for your review.

-- 
Regards,
-Bob
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH] drivers: xen-selfballoon: consider slab pages

2013-11-04 Thread Bob Liu
Currently the goal_page in xen-selfballon doesn't consider much about pages used
in kernel space.
A typical usage is slab pages, without consider slab pages the goal_page result
may be too rough and lead extra memory pressure to guest os.

Signed-off-by: Bob Liu bob@oracle.com
---
 drivers/xen/xen-selfballoon.c |2 ++
 1 file changed, 2 insertions(+)

diff --git a/drivers/xen/xen-selfballoon.c b/drivers/xen/xen-selfballoon.c
index 21e18c1..4814759 100644
--- a/drivers/xen/xen-selfballoon.c
+++ b/drivers/xen/xen-selfballoon.c
@@ -191,6 +191,8 @@ static void selfballoon_process(struct work_struct *work)
tgt_pages = cur_pages; /* default is no change */
goal_pages = vm_memory_committed() +
totalreserve_pages +
+   global_page_state(NR_SLAB_RECLAIMABLE) +
+   global_page_state(NR_SLAB_UNRECLAIMABLE) +
MB2PAGES(selfballoon_reserved_mb);
 #ifdef CONFIG_FRONTSWAP
/* allow space for frontswap pages to be repatriated */
-- 
1.7.10.4

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: zram/zsmalloc issues in very low memory conditions

2013-11-01 Thread Bob Liu
Hi Olav,

On 11/02/2013 08:59 AM, Olav Haugan wrote:

> 
> I tried the above suggestion but it does not seem to have any noticeable
> impact. The system is still trying to swap out at a very high rate after
> zram reported failure to swap out. The error logging is actually so much
> that my system crashed due to excessive logging (we have a watchdog that
> is not getting pet because the kernel is busy logging kernel messages).
> 

I have a question that why the low memory killer didn't get triggered in
this situation?
Is it possible to set the LMK a bit more aggressive?

> There isn't anything that can be set to tell the fs layer to back off
> completely for a while (congestion control)?
> 

The other way I think might fix your issue is the same as your mentioned
in your previous email.
Set the congested bit for swap device also.
Like:

diff --git a/drivers/staging/zram/zram_drv.c
b/drivers/staging/zram/zram_drv.c
index 91d94b5..c4fc63e 100644
--- a/drivers/staging/zram/zram_drv.c
+++ b/drivers/staging/zram/zram_drv.c
@@ -474,6 +474,7 @@ static int zram_bvec_write(struct zram *zram, struct
bio_vec *bvec, u32 index,
if (!handle) {
pr_info("Error allocating memory for compressed page:
%u, size=%zu\n",
index, clen);
+   blk_set_queue_congested(zram->disk->queue, BLK_RW_ASYNC);
ret = -ENOMEM;
goto out;
}
diff --git a/mm/vmscan.c b/mm/vmscan.c
index 8ed1b77..1c790ee 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -394,8 +394,6 @@ static inline int is_page_cache_freeable(struct page
*page)
 static int may_write_to_queue(struct backing_dev_info *bdi,
  struct scan_control *sc)
 {
-   if (current->flags & PF_SWAPWRITE)
-   return 1;

--

For the update of the congested state of zram, I think you can clear it
from use space eg. after LMK triggered and reclaimed some memory.

Of course this depends on zram driver to export a sysfs node like
"/sys/block/zram0/clear_congested".

-- 
Regards,
-Bob
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: zram/zsmalloc issues in very low memory conditions

2013-11-01 Thread Bob Liu
Hi Olav,

On 11/02/2013 08:59 AM, Olav Haugan wrote:

 
 I tried the above suggestion but it does not seem to have any noticeable
 impact. The system is still trying to swap out at a very high rate after
 zram reported failure to swap out. The error logging is actually so much
 that my system crashed due to excessive logging (we have a watchdog that
 is not getting pet because the kernel is busy logging kernel messages).
 

I have a question that why the low memory killer didn't get triggered in
this situation?
Is it possible to set the LMK a bit more aggressive?

 There isn't anything that can be set to tell the fs layer to back off
 completely for a while (congestion control)?
 

The other way I think might fix your issue is the same as your mentioned
in your previous email.
Set the congested bit for swap device also.
Like:

diff --git a/drivers/staging/zram/zram_drv.c
b/drivers/staging/zram/zram_drv.c
index 91d94b5..c4fc63e 100644
--- a/drivers/staging/zram/zram_drv.c
+++ b/drivers/staging/zram/zram_drv.c
@@ -474,6 +474,7 @@ static int zram_bvec_write(struct zram *zram, struct
bio_vec *bvec, u32 index,
if (!handle) {
pr_info(Error allocating memory for compressed page:
%u, size=%zu\n,
index, clen);
+   blk_set_queue_congested(zram-disk-queue, BLK_RW_ASYNC);
ret = -ENOMEM;
goto out;
}
diff --git a/mm/vmscan.c b/mm/vmscan.c
index 8ed1b77..1c790ee 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -394,8 +394,6 @@ static inline int is_page_cache_freeable(struct page
*page)
 static int may_write_to_queue(struct backing_dev_info *bdi,
  struct scan_control *sc)
 {
-   if (current-flags  PF_SWAPWRITE)
-   return 1;

--

For the update of the congested state of zram, I think you can clear it
from use space eg. after LMK triggered and reclaimed some memory.

Of course this depends on zram driver to export a sysfs node like
/sys/block/zram0/clear_congested.

-- 
Regards,
-Bob
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: zram/zsmalloc issues in very low memory conditions

2013-10-31 Thread Bob Liu
Hi Olav,

On 11/01/2013 07:34 AM, Olav Haugan wrote:
> Hi Luigi,
> 
> On 10/24/2013 6:12 PM, Luigi Semenzato wrote:
>> On Thu, Oct 24, 2013 at 5:35 PM, Olav Haugan  wrote:
>>> Hi Bob, Luigi,
>>>
>>> On 10/23/2013 5:55 PM, Bob Liu wrote:
>>>>
>>>> On 10/24/2013 05:51 AM, Olav Haugan wrote:
>>>
>>>> By the way, could you take a try with zswap? Which can write pages to
>>>> real swap device if compressed pool is full.
>>>
>>> zswap might not be feasible in all cases if you only have flash as
>>> backing storage.
>>
>> Zswap can be configured to run without a backing storage.
>>
> 
> I was under the impression that zswap requires a backing storage. Can
> you elaborate on how to configure zswap to not need a backing storage?
> 

AFAIK, currently zswap can't be used without a backing storage.
Perhaps you can take a try by creating a swap device backed by a file on
storage.

-- 
Regards,
-Bob
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: zram/zsmalloc issues in very low memory conditions

2013-10-31 Thread Bob Liu
Hi Olav,

On 11/01/2013 07:34 AM, Olav Haugan wrote:
 Hi Luigi,
 
 On 10/24/2013 6:12 PM, Luigi Semenzato wrote:
 On Thu, Oct 24, 2013 at 5:35 PM, Olav Haugan ohau...@codeaurora.org wrote:
 Hi Bob, Luigi,

 On 10/23/2013 5:55 PM, Bob Liu wrote:

 On 10/24/2013 05:51 AM, Olav Haugan wrote:

 By the way, could you take a try with zswap? Which can write pages to
 real swap device if compressed pool is full.

 zswap might not be feasible in all cases if you only have flash as
 backing storage.

 Zswap can be configured to run without a backing storage.

 
 I was under the impression that zswap requires a backing storage. Can
 you elaborate on how to configure zswap to not need a backing storage?
 

AFAIK, currently zswap can't be used without a backing storage.
Perhaps you can take a try by creating a swap device backed by a file on
storage.

-- 
Regards,
-Bob
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [BUG] 3.12.0-rcX IPv6 panic

2013-10-28 Thread Bob Tracy
On Mon, Oct 21, 2013 at 06:40:41PM -0500, Bob Tracy wrote:
> On Mon, Oct 21, 2013 at 05:52:52PM +0200, Hannes Frederic Sowa wrote:
> > On Mon, Oct 21, 2013 at 08:18:46AM -0500, Bob Tracy wrote:
> > > Actually, a regression: the 3.11 kernel is rock-solid stable on my
> > > Alpha.
> > > 
> > > Beginning with 3.12.0-rc1, I can reliably trigger a kernel panic by
> > > executing the gogo6.net "gw6c" IPv6 client program.  If the networking
> > > layer is active, an "Oops" will eventually (within a day) occur regardless
> > > of whether I attempt to run "gw6c".  3.12.0-rcX is stable as long as I
> > > leave networking completely disabled.  The error has persisted up through
> > > -rc6.  Apologies for not mentioning this earlier, but the state of my
> > > PWS-433au has been questionable, and I wanted to make sure I had a
> > > legitimate bug sighting.
> > > 
> > > I'll have to transcribe the panic backtrace by hand: nothing makes it
> > > into any of the system logs :-(.  I *can* recall that every backtrace
> > > I've seen thus far has included one of the skb_copy() variants near the
> > > top of the list (first or second function).
> > 
> > Try to capture the panic via serial console. Otherwise a picture
> > would give us a first hint. Please watch out for lines like
> > skb_(over|under)_panic.
> 
> I'll get a screen capture of some kind for you within the next day or
> so.
> 
> > gw6c is a tunnel client? Can you post ip -6 tunnel ls?
> 
> Assuming you meant "show [NAME]" (no "ls" option for the tunnel object),
> that yields the following with "gw6c" running on a 3.11.0 kernel:
> 
> smirkin:~# ip -6 tunnel show sit1
> sit1: any/ipv6 remote 4056:5874:: local 4500:0:0:4000:4029:: encaplimit 0 
> hoplimit 0 tclass 0x00 flowlabel 0x0 (flowinfo 0x)
> 
> I'm running the gw6c client in gateway mode: the Alpha is my IPv6
> gateway/firewall.

Update: no significant change for 3.12.0-rc7.  Can still reliably panic
the system by running "gw6c" to set up the IPv6 tunnel.  Here's a hand-
transcribed backtrace: I hope it's sufficient...  I would have included
the stack/register dump, but about half of it had scrolled off the top
of the screen.

skb_copy_and_csum_bits+0x88/0x380
icmp_glue_bits+0x48/0xce0
__ip_append_data+0x8f4/0xb40
ip_append_data+0xb0/0x130
icmp_glue_bits+0x0/0xe0
icmp_glue_bits+0x0/0xe0  (yes: repeated once)
icmp_push_reply+0x6c/0x190
icmp_send+0x3fc/0x4c0
ip_local_deliver_finish+0x20c/0x2e0
ip_rcv_finish+0x1d8/0x3d0
nf_ct_attach+0x32/0x40
ip_rcv_finish_0x148/0x3d0
__netif_receive_skb_core+0x27c/0x890
process_backlog+0xb8/0x1a0
net_rx_action+0xc8/0x210
__do_softirq+0x1a0/0x230
do_softirq+0x5c/0x70
irq_exit+0x68/0x80
handle_irq+0x90/0xf0
miata_srm_device_interrupt+0x30/0x50
do_entInt+0x1cc/0x1f0
__do_fault+0x3e0/0x5e0
ret_from_sys_call+0x0/0x10
entMM+0x9c/0xc0
do_page_fault+0x0/0x500
do_page_fault+0x48/0x500
entMM+0x9c/0xc0
filp_close+0x6c/0xe0
filp_close+0x98/0xe0
filp_close+0x6c/0xe0
filp_close+0x98/0xe0
__close_fd+0xb8/0xe0

Code: 44640003 e4600010 2032 a77de680 b67e0010 47f10410  47ff0411

--Bob
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [BUG] 3.12.0-rcX IPv6 panic

2013-10-28 Thread Bob Tracy
On Mon, Oct 21, 2013 at 06:40:41PM -0500, Bob Tracy wrote:
 On Mon, Oct 21, 2013 at 05:52:52PM +0200, Hannes Frederic Sowa wrote:
  On Mon, Oct 21, 2013 at 08:18:46AM -0500, Bob Tracy wrote:
   Actually, a regression: the 3.11 kernel is rock-solid stable on my
   Alpha.
   
   Beginning with 3.12.0-rc1, I can reliably trigger a kernel panic by
   executing the gogo6.net gw6c IPv6 client program.  If the networking
   layer is active, an Oops will eventually (within a day) occur regardless
   of whether I attempt to run gw6c.  3.12.0-rcX is stable as long as I
   leave networking completely disabled.  The error has persisted up through
   -rc6.  Apologies for not mentioning this earlier, but the state of my
   PWS-433au has been questionable, and I wanted to make sure I had a
   legitimate bug sighting.
   
   I'll have to transcribe the panic backtrace by hand: nothing makes it
   into any of the system logs :-(.  I *can* recall that every backtrace
   I've seen thus far has included one of the skb_copy() variants near the
   top of the list (first or second function).
  
  Try to capture the panic via serial console. Otherwise a picture
  would give us a first hint. Please watch out for lines like
  skb_(over|under)_panic.
 
 I'll get a screen capture of some kind for you within the next day or
 so.
 
  gw6c is a tunnel client? Can you post ip -6 tunnel ls?
 
 Assuming you meant show [NAME] (no ls option for the tunnel object),
 that yields the following with gw6c running on a 3.11.0 kernel:
 
 smirkin:~# ip -6 tunnel show sit1
 sit1: any/ipv6 remote 4056:5874:: local 4500:0:0:4000:4029:: encaplimit 0 
 hoplimit 0 tclass 0x00 flowlabel 0x0 (flowinfo 0x)
 
 I'm running the gw6c client in gateway mode: the Alpha is my IPv6
 gateway/firewall.

Update: no significant change for 3.12.0-rc7.  Can still reliably panic
the system by running gw6c to set up the IPv6 tunnel.  Here's a hand-
transcribed backtrace: I hope it's sufficient...  I would have included
the stack/register dump, but about half of it had scrolled off the top
of the screen.

skb_copy_and_csum_bits+0x88/0x380
icmp_glue_bits+0x48/0xce0
__ip_append_data+0x8f4/0xb40
ip_append_data+0xb0/0x130
icmp_glue_bits+0x0/0xe0
icmp_glue_bits+0x0/0xe0  (yes: repeated once)
icmp_push_reply+0x6c/0x190
icmp_send+0x3fc/0x4c0
ip_local_deliver_finish+0x20c/0x2e0
ip_rcv_finish+0x1d8/0x3d0
nf_ct_attach+0x32/0x40
ip_rcv_finish_0x148/0x3d0
__netif_receive_skb_core+0x27c/0x890
process_backlog+0xb8/0x1a0
net_rx_action+0xc8/0x210
__do_softirq+0x1a0/0x230
do_softirq+0x5c/0x70
irq_exit+0x68/0x80
handle_irq+0x90/0xf0
miata_srm_device_interrupt+0x30/0x50
do_entInt+0x1cc/0x1f0
__do_fault+0x3e0/0x5e0
ret_from_sys_call+0x0/0x10
entMM+0x9c/0xc0
do_page_fault+0x0/0x500
do_page_fault+0x48/0x500
entMM+0x9c/0xc0
filp_close+0x6c/0xe0
filp_close+0x98/0xe0
filp_close+0x6c/0xe0
filp_close+0x98/0xe0
__close_fd+0xb8/0xe0

Code: 44640003 e4600010 2032 a77de680 b67e0010 47f10410 b034 47ff0411

--Bob
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: zram/zsmalloc issues in very low memory conditions

2013-10-24 Thread Bob Liu
On 10/25/2013 08:35 AM, Olav Haugan wrote:
> Hi Bob, Luigi,
> 
> On 10/23/2013 5:55 PM, Bob Liu wrote:
>>
>> On 10/24/2013 05:51 AM, Olav Haugan wrote:
>>> I am trying to use zram in very low memory conditions and I am having
>>> some issues. zram is in the reclaim path. So if the system is very low
>>> on memory the system is trying to reclaim pages by swapping out (in this
>>> case to zram). However, since we are very low on memory zram fails to
>>> get a page from zsmalloc and thus zram fails to store the page. We get
>>> into a cycle where the system is low on memory so it tries to swap out
>>> to get more memory but swap out fails because there is not enough memory
>>> in the system! The major problem I am seeing is that there does not seem
>>> to be a way for zram to tell the upper layers to stop swapping out
>>> because the swap device is essentially "full" (since there is no more
>>> memory available for zram pages). Has anyone thought about this issue
>>> already and have ideas how to solve this or am I missing something and I
>>> should not be seeing this issue?
>>>
>>
>> The same question as Luigi "What do you want the system to do at this
>> point?"
>>
>> If swap fails then OOM killer will be triggered, I don't think this will
>> be a issue.
> 
> I definitely don't want OOM killer to run since OOM killer can kill
> critical processes (this is on Android so we have Android LMK to handle
> the killing in a more "safe" way). However, what I am seeing is that
> when I run low on memory zram fails to swap out and returns error but
> the swap subsystem just continues to try to swap out even when this
> error occurs (it tries over and over again very rapidly causing the
> kernel to be filled with error messages [at least two error messages per
> failure btw]).
> 

A simple way to disable the error messages is delete the printk line in
zram source code.

> What I expected to happen is for the swap subsystem to stop trying to
> swap out until memory is available to swap out. I guess this could be

In my opinion, the system already entered heavy memory pressure state
when zram fails to allocate a page.

In this case, the OOM killer or Low Memory Killer should be waked up and
kill some processes in order to free some memory pages.

After this happen, the system free memory may above water mark and no
swap will happen. Even swap happens again, it's unlikely that zram will
fail to alloc page. If it fails again, OOM killer or LMK should be
triggered once more.

> handled several ways. Either 1) the swap subsystem, upon encountering an
> error to swap out, backs off from trying to swap out for some time or 2)
> zram informs the swap subsystem that the swap device is full.
> 
> Could this be handled by congestion control? However, I found the
> following comment in the code in vmscan.c:
> 
> * If the page is swapcache, write it back even if that would
> * block, for some throttling. This happens by accident, because
> * swap_backing_dev_info is bust: it doesn't reflect the
> * congestion state of the swapdevs.  Easy to fix, if needed.
> 
> However, how would one update the congested state of zram when it
> becomes un-congested?
> 
>> By the way, could you take a try with zswap? Which can write pages to
>> real swap device if compressed pool is full.
> 
> zswap might not be feasible in all cases if you only have flash as
> backing storage.
> 

Yes, that's still a problem of zswap. Perhaps you can create a swap file
on the backing storage.

I'll try to add a fake swap device for zswap, so that users can have one
more choice besides zram.

>>> I am also seeing a couple other issues that I was wondering whether
>>> folks have already thought about:
>>>
>>> 2) zsmalloc fails when the page allocated is at physical address 0 (pfn
>>
>> AFAIK, this will never happen.
> 
> I can easily get this to happen since I have memory starting at physical
> address 0.
> 

Could you confirm this? AFAIR, physical memory start from 0 usually
reserved for some special usage.

I used 'cat /proc/zoneinfo' on x86 and arm, neither of the 'start_pfn'
was 0.

-- 
Regards,
-Bob
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: zram/zsmalloc issues in very low memory conditions

2013-10-24 Thread Bob Liu
On 10/25/2013 08:35 AM, Olav Haugan wrote:
 Hi Bob, Luigi,
 
 On 10/23/2013 5:55 PM, Bob Liu wrote:

 On 10/24/2013 05:51 AM, Olav Haugan wrote:
 I am trying to use zram in very low memory conditions and I am having
 some issues. zram is in the reclaim path. So if the system is very low
 on memory the system is trying to reclaim pages by swapping out (in this
 case to zram). However, since we are very low on memory zram fails to
 get a page from zsmalloc and thus zram fails to store the page. We get
 into a cycle where the system is low on memory so it tries to swap out
 to get more memory but swap out fails because there is not enough memory
 in the system! The major problem I am seeing is that there does not seem
 to be a way for zram to tell the upper layers to stop swapping out
 because the swap device is essentially full (since there is no more
 memory available for zram pages). Has anyone thought about this issue
 already and have ideas how to solve this or am I missing something and I
 should not be seeing this issue?


 The same question as Luigi What do you want the system to do at this
 point?

 If swap fails then OOM killer will be triggered, I don't think this will
 be a issue.
 
 I definitely don't want OOM killer to run since OOM killer can kill
 critical processes (this is on Android so we have Android LMK to handle
 the killing in a more safe way). However, what I am seeing is that
 when I run low on memory zram fails to swap out and returns error but
 the swap subsystem just continues to try to swap out even when this
 error occurs (it tries over and over again very rapidly causing the
 kernel to be filled with error messages [at least two error messages per
 failure btw]).
 

A simple way to disable the error messages is delete the printk line in
zram source code.

 What I expected to happen is for the swap subsystem to stop trying to
 swap out until memory is available to swap out. I guess this could be

In my opinion, the system already entered heavy memory pressure state
when zram fails to allocate a page.

In this case, the OOM killer or Low Memory Killer should be waked up and
kill some processes in order to free some memory pages.

After this happen, the system free memory may above water mark and no
swap will happen. Even swap happens again, it's unlikely that zram will
fail to alloc page. If it fails again, OOM killer or LMK should be
triggered once more.

 handled several ways. Either 1) the swap subsystem, upon encountering an
 error to swap out, backs off from trying to swap out for some time or 2)
 zram informs the swap subsystem that the swap device is full.
 
 Could this be handled by congestion control? However, I found the
 following comment in the code in vmscan.c:
 
 * If the page is swapcache, write it back even if that would
 * block, for some throttling. This happens by accident, because
 * swap_backing_dev_info is bust: it doesn't reflect the
 * congestion state of the swapdevs.  Easy to fix, if needed.
 
 However, how would one update the congested state of zram when it
 becomes un-congested?
 
 By the way, could you take a try with zswap? Which can write pages to
 real swap device if compressed pool is full.
 
 zswap might not be feasible in all cases if you only have flash as
 backing storage.
 

Yes, that's still a problem of zswap. Perhaps you can create a swap file
on the backing storage.

I'll try to add a fake swap device for zswap, so that users can have one
more choice besides zram.

 I am also seeing a couple other issues that I was wondering whether
 folks have already thought about:

 2) zsmalloc fails when the page allocated is at physical address 0 (pfn

 AFAIK, this will never happen.
 
 I can easily get this to happen since I have memory starting at physical
 address 0.
 

Could you confirm this? AFAIR, physical memory start from 0 usually
reserved for some special usage.

I used 'cat /proc/zoneinfo' on x86 and arm, neither of the 'start_pfn'
was 0.

-- 
Regards,
-Bob
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: zram/zsmalloc issues in very low memory conditions

2013-10-23 Thread Bob Liu

On 10/24/2013 05:51 AM, Olav Haugan wrote:
> I am trying to use zram in very low memory conditions and I am having
> some issues. zram is in the reclaim path. So if the system is very low
> on memory the system is trying to reclaim pages by swapping out (in this
> case to zram). However, since we are very low on memory zram fails to
> get a page from zsmalloc and thus zram fails to store the page. We get
> into a cycle where the system is low on memory so it tries to swap out
> to get more memory but swap out fails because there is not enough memory
> in the system! The major problem I am seeing is that there does not seem
> to be a way for zram to tell the upper layers to stop swapping out
> because the swap device is essentially "full" (since there is no more
> memory available for zram pages). Has anyone thought about this issue
> already and have ideas how to solve this or am I missing something and I
> should not be seeing this issue?
> 

The same question as Luigi "What do you want the system to do at this
point?"

If swap fails then OOM killer will be triggered, I don't think this will
be a issue.

By the way, could you take a try with zswap? Which can write pages to
real swap device if compressed pool is full.

> I am also seeing a couple other issues that I was wondering whether
> folks have already thought about:
> 
> 1) The size of a swap device is statically computed when the swap device
> is turned on (nr_swap_pages). The size of zram swap device is dynamic
> since we are compressing the pages and thus the swap subsystem thinks
> that the zram swap device is full when it is not really full. Any
> plans/thoughts about the possibility of being able to update the size
> and/or the # of available pages in a swap device on the fly?
> 
> 2) zsmalloc fails when the page allocated is at physical address 0 (pfn

AFAIK, this will never happen.

> = 0) since the handle returned from zsmalloc is encoded as (,
> ) and thus the resulting handle will be 0 (since obj_idx starts
> at 0). zs_malloc returns the handle but does not distinguish between a
> valid handle of 0 and a failure to allocate. A possible solution to this
> would be to start the obj_idx at 1. Is this feasible?
> 
> Thanks,
> 
> Olav Haugan
> 

-- 
Regards,
-Bob
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: zram/zsmalloc issues in very low memory conditions

2013-10-23 Thread Bob Liu

On 10/24/2013 05:51 AM, Olav Haugan wrote:
 I am trying to use zram in very low memory conditions and I am having
 some issues. zram is in the reclaim path. So if the system is very low
 on memory the system is trying to reclaim pages by swapping out (in this
 case to zram). However, since we are very low on memory zram fails to
 get a page from zsmalloc and thus zram fails to store the page. We get
 into a cycle where the system is low on memory so it tries to swap out
 to get more memory but swap out fails because there is not enough memory
 in the system! The major problem I am seeing is that there does not seem
 to be a way for zram to tell the upper layers to stop swapping out
 because the swap device is essentially full (since there is no more
 memory available for zram pages). Has anyone thought about this issue
 already and have ideas how to solve this or am I missing something and I
 should not be seeing this issue?
 

The same question as Luigi What do you want the system to do at this
point?

If swap fails then OOM killer will be triggered, I don't think this will
be a issue.

By the way, could you take a try with zswap? Which can write pages to
real swap device if compressed pool is full.

 I am also seeing a couple other issues that I was wondering whether
 folks have already thought about:
 
 1) The size of a swap device is statically computed when the swap device
 is turned on (nr_swap_pages). The size of zram swap device is dynamic
 since we are compressing the pages and thus the swap subsystem thinks
 that the zram swap device is full when it is not really full. Any
 plans/thoughts about the possibility of being able to update the size
 and/or the # of available pages in a swap device on the fly?
 
 2) zsmalloc fails when the page allocated is at physical address 0 (pfn

AFAIK, this will never happen.

 = 0) since the handle returned from zsmalloc is encoded as (PFN,
 obj_idx) and thus the resulting handle will be 0 (since obj_idx starts
 at 0). zs_malloc returns the handle but does not distinguish between a
 valid handle of 0 and a failure to allocate. A possible solution to this
 would be to start the obj_idx at 1. Is this feasible?
 
 Thanks,
 
 Olav Haugan
 

-- 
Regards,
-Bob
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [BUG] 3.12.0-rcX IPv6 panic

2013-10-21 Thread Bob Tracy
On Mon, Oct 21, 2013 at 05:52:52PM +0200, Hannes Frederic Sowa wrote:
> On Mon, Oct 21, 2013 at 08:18:46AM -0500, Bob Tracy wrote:
> > Actually, a regression: the 3.11 kernel is rock-solid stable on my
> > Alpha.
> > 
> > Beginning with 3.12.0-rc1, I can reliably trigger a kernel panic by
> > executing the gogo6.net "gw6c" IPv6 client program.  If the networking
> > layer is active, an "Oops" will eventually (within a day) occur regardless
> > of whether I attempt to run "gw6c".  3.12.0-rcX is stable as long as I
> > leave networking completely disabled.  The error has persisted up through
> > -rc6.  Apologies for not mentioning this earlier, but the state of my
> > PWS-433au has been questionable, and I wanted to make sure I had a
> > legitimate bug sighting.
> > 
> > I'll have to transcribe the panic backtrace by hand: nothing makes it
> > into any of the system logs :-(.  I *can* recall that every backtrace
> > I've seen thus far has included one of the skb_copy() variants near the
> > top of the list (first or second function).
> 
> Try to capture the panic via serial console. Otherwise a picture
> would give us a first hint. Please watch out for lines like
> skb_(over|under)_panic.

I'll get a screen capture of some kind for you within the next day or
so.

> gw6c is a tunnel client? Can you post ip -6 tunnel ls?

Assuming you meant "show [NAME]" (no "ls" option for the tunnel object),
that yields the following with "gw6c" running on a 3.11.0 kernel:

smirkin:~# ip -6 tunnel show sit1
sit1: any/ipv6 remote 4056:5874:: local 4500:0:0:4000:4029:: encaplimit 0 
hoplimit 0 tclass 0x00 flowlabel 0x0 (flowinfo 0x)

I'm running the gw6c client in gateway mode: the Alpha is my IPv6
gateway/firewall.

--Bob
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[BUG] 3.12.0-rcX IPv6 panic

2013-10-21 Thread Bob Tracy
Actually, a regression: the 3.11 kernel is rock-solid stable on my
Alpha.

Beginning with 3.12.0-rc1, I can reliably trigger a kernel panic by
executing the gogo6.net "gw6c" IPv6 client program.  If the networking
layer is active, an "Oops" will eventually (within a day) occur regardless
of whether I attempt to run "gw6c".  3.12.0-rcX is stable as long as I
leave networking completely disabled.  The error has persisted up through
-rc6.  Apologies for not mentioning this earlier, but the state of my
PWS-433au has been questionable, and I wanted to make sure I had a
legitimate bug sighting.

I'll have to transcribe the panic backtrace by hand: nothing makes it
into any of the system logs :-(.  I *can* recall that every backtrace
I've seen thus far has included one of the skb_copy() variants near the
top of the list (first or second function).

--Bob
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[BUG] 3.12.0-rcX IPv6 panic

2013-10-21 Thread Bob Tracy
Actually, a regression: the 3.11 kernel is rock-solid stable on my
Alpha.

Beginning with 3.12.0-rc1, I can reliably trigger a kernel panic by
executing the gogo6.net gw6c IPv6 client program.  If the networking
layer is active, an Oops will eventually (within a day) occur regardless
of whether I attempt to run gw6c.  3.12.0-rcX is stable as long as I
leave networking completely disabled.  The error has persisted up through
-rc6.  Apologies for not mentioning this earlier, but the state of my
PWS-433au has been questionable, and I wanted to make sure I had a
legitimate bug sighting.

I'll have to transcribe the panic backtrace by hand: nothing makes it
into any of the system logs :-(.  I *can* recall that every backtrace
I've seen thus far has included one of the skb_copy() variants near the
top of the list (first or second function).

--Bob
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [BUG] 3.12.0-rcX IPv6 panic

2013-10-21 Thread Bob Tracy
On Mon, Oct 21, 2013 at 05:52:52PM +0200, Hannes Frederic Sowa wrote:
 On Mon, Oct 21, 2013 at 08:18:46AM -0500, Bob Tracy wrote:
  Actually, a regression: the 3.11 kernel is rock-solid stable on my
  Alpha.
  
  Beginning with 3.12.0-rc1, I can reliably trigger a kernel panic by
  executing the gogo6.net gw6c IPv6 client program.  If the networking
  layer is active, an Oops will eventually (within a day) occur regardless
  of whether I attempt to run gw6c.  3.12.0-rcX is stable as long as I
  leave networking completely disabled.  The error has persisted up through
  -rc6.  Apologies for not mentioning this earlier, but the state of my
  PWS-433au has been questionable, and I wanted to make sure I had a
  legitimate bug sighting.
  
  I'll have to transcribe the panic backtrace by hand: nothing makes it
  into any of the system logs :-(.  I *can* recall that every backtrace
  I've seen thus far has included one of the skb_copy() variants near the
  top of the list (first or second function).
 
 Try to capture the panic via serial console. Otherwise a picture
 would give us a first hint. Please watch out for lines like
 skb_(over|under)_panic.

I'll get a screen capture of some kind for you within the next day or
so.

 gw6c is a tunnel client? Can you post ip -6 tunnel ls?

Assuming you meant show [NAME] (no ls option for the tunnel object),
that yields the following with gw6c running on a 3.11.0 kernel:

smirkin:~# ip -6 tunnel show sit1
sit1: any/ipv6 remote 4056:5874:: local 4500:0:0:4000:4029:: encaplimit 0 
hoplimit 0 tclass 0x00 flowlabel 0x0 (flowinfo 0x)

I'm running the gw6c client in gateway mode: the Alpha is my IPv6
gateway/firewall.

--Bob
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: trace-cmd problem with FC19

2013-10-17 Thread Bob Copeland
On Thu, Oct 17, 2013 at 04:56:56PM -0400, Steven Rostedt wrote:
> Hmm, are you sure?
> 
> You may want to do both:
> 
> sudo trace-cmd -v
> which trace-cmd
> 
> to see which version it is.

To clarify - I ran into the referenced issue using an older,
self-compiled version with a recent kernel.  The FC19 distro
version may be fine, for all I know.

Arend says he used the latest version from the repo which should
not be a problem, but just throwing that out there since the
symptoms are similar.

-- 
Bob Copeland %% www.bobcopeland.com
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: trace-cmd problem with FC19

2013-10-17 Thread Bob Copeland
[sorry, resent for the archives, I failed email]
On Thu, Oct 17, 2013 at 4:20 PM, Steven Rostedt  wrote:
> On Thu, 17 Oct 2013 22:07:15 +0200
> "Arend van Spriel"  wrote:
>
>
>> > Does recording other traces work? Or is it only spicific to this module?
>>
>> It seems a generic issue:
>>
>> $ sudo trace-cmd record -e ext4:* ls
>> /sys/kernel/debug/tracing/events/ext4/*/filter
>> systemd-private-6pVB5Lsystemd-private-KdpFqS  trace.dat.cpu0  
>> trace.dat.cpu2
>> systemd-private-9hedRDtrace.dat   trace.dat.cpu1  
>> trace.dat.cpu3
>> trace-cmd: Interrupted system call
>>recorder error in splice input
>> trace-cmd: Interrupted system call
>>recorder error in splice input
>> trace-cmd: Interrupted system call
>>recorder error in splice input

Perhaps this is an instance of this bug?

https://lkml.org/lkml/2013/5/31/426

tl;dr try with latest trace-cmd?  I hit the above myself.

-- 
Bob Copeland %% www.bobcopeland.com
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: BUG: mm, numa: test segfaults, only when NUMA balancing is on

2013-10-17 Thread Bob Liu
Hi Alex,

On Wed, Oct 16, 2013 at 11:54 PM, Alex Thorlton  wrote:
> Hi guys,
>
> I ran into a bug a week or so ago, that I believe has something to do
> with NUMA balancing, but I'm having a tough time tracking down exactly
> what is causing it.  When running with the following configuration
> options set:
>
> CONFIG_ARCH_SUPPORTS_NUMA_BALANCING=y
> CONFIG_NUMA_BALANCING_DEFAULT_ENABLED=y
> CONFIG_NUMA_BALANCING=y
> # CONFIG_HUGETLBFS is not set
> # CONFIG_HUGETLB_PAGE is not set
>

What's your kernel version?
And did you enable CONFIG_TRANSPARENT_HUGEPAGE ?

> I get intermittent segfaults when running the memscale test that we've
> been using to test some of the THP changes.  Here's a link to the test:
>
> ftp://shell.sgi.com/collect/memscale/
>
> I typically run the test with a line similar to this:
>
> ./thp_memscale -C 0 -m 0 -c  -b 
>
> Where  is the number of cores to spawn threads on, and 
> is the amount of memory to reserve from each core.  The  field
> can accept values like 512m or 1g, etc.  I typically run 256 cores and
> 512m, though I think the problem should be reproducable on anything with
> 128+ cores.
>
> The test never seems to have any problems when running with hugetlbfs
> on and NUMA balancing off, but it segfaults every once in a while with
> the config options above.  It seems to occur more frequently, the more
> cores you run on.  It segfaults on about 50% of the runs at 256 cores,
> and on almost every run at 512 cores.  The fewest number of cores I've
> seen a segfault on has been 128, though it seems to be rare on this many
> cores.
>

Could you please attach some logs?

> At this point, I'm not familiar enough with NUMA balancing code to know
> what could be causing this, and we don't typically run with NUMA
> balancing on, so I don't see this in my everyday testing, but I felt
> that it was definitely worth bringing up.
>
> If anybody has any ideas of where I could poke around to find a
> solution, please let me know.
>

-- 
Regards,
--Bob
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: BUG: mm, numa: test segfaults, only when NUMA balancing is on

2013-10-17 Thread Bob Liu
Hi Alex,

On Wed, Oct 16, 2013 at 11:54 PM, Alex Thorlton athorl...@sgi.com wrote:
 Hi guys,

 I ran into a bug a week or so ago, that I believe has something to do
 with NUMA balancing, but I'm having a tough time tracking down exactly
 what is causing it.  When running with the following configuration
 options set:

 CONFIG_ARCH_SUPPORTS_NUMA_BALANCING=y
 CONFIG_NUMA_BALANCING_DEFAULT_ENABLED=y
 CONFIG_NUMA_BALANCING=y
 # CONFIG_HUGETLBFS is not set
 # CONFIG_HUGETLB_PAGE is not set


What's your kernel version?
And did you enable CONFIG_TRANSPARENT_HUGEPAGE ?

 I get intermittent segfaults when running the memscale test that we've
 been using to test some of the THP changes.  Here's a link to the test:

 ftp://shell.sgi.com/collect/memscale/

 I typically run the test with a line similar to this:

 ./thp_memscale -C 0 -m 0 -c cores -b memory

 Where cores is the number of cores to spawn threads on, and memory
 is the amount of memory to reserve from each core.  The memory field
 can accept values like 512m or 1g, etc.  I typically run 256 cores and
 512m, though I think the problem should be reproducable on anything with
 128+ cores.

 The test never seems to have any problems when running with hugetlbfs
 on and NUMA balancing off, but it segfaults every once in a while with
 the config options above.  It seems to occur more frequently, the more
 cores you run on.  It segfaults on about 50% of the runs at 256 cores,
 and on almost every run at 512 cores.  The fewest number of cores I've
 seen a segfault on has been 128, though it seems to be rare on this many
 cores.


Could you please attach some logs?

 At this point, I'm not familiar enough with NUMA balancing code to know
 what could be causing this, and we don't typically run with NUMA
 balancing on, so I don't see this in my everyday testing, but I felt
 that it was definitely worth bringing up.

 If anybody has any ideas of where I could poke around to find a
 solution, please let me know.


-- 
Regards,
--Bob
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: trace-cmd problem with FC19

2013-10-17 Thread Bob Copeland
[sorry, resent for the archives, I failed email]
On Thu, Oct 17, 2013 at 4:20 PM, Steven Rostedt rost...@goodmis.org wrote:
 On Thu, 17 Oct 2013 22:07:15 +0200
 Arend van Spriel ar...@broadcom.com wrote:


  Does recording other traces work? Or is it only spicific to this module?

 It seems a generic issue:

 $ sudo trace-cmd record -e ext4:* ls
 /sys/kernel/debug/tracing/events/ext4/*/filter
 systemd-private-6pVB5Lsystemd-private-KdpFqS  trace.dat.cpu0  
 trace.dat.cpu2
 systemd-private-9hedRDtrace.dat   trace.dat.cpu1  
 trace.dat.cpu3
 trace-cmd: Interrupted system call
recorder error in splice input
 trace-cmd: Interrupted system call
recorder error in splice input
 trace-cmd: Interrupted system call
recorder error in splice input

Perhaps this is an instance of this bug?

https://lkml.org/lkml/2013/5/31/426

tl;dr try with latest trace-cmd?  I hit the above myself.

-- 
Bob Copeland %% www.bobcopeland.com
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: trace-cmd problem with FC19

2013-10-17 Thread Bob Copeland
On Thu, Oct 17, 2013 at 04:56:56PM -0400, Steven Rostedt wrote:
 Hmm, are you sure?
 
 You may want to do both:
 
 sudo trace-cmd -v
 which trace-cmd
 
 to see which version it is.

To clarify - I ran into the referenced issue using an older,
self-compiled version with a recent kernel.  The FC19 distro
version may be fine, for all I know.

Arend says he used the latest version from the repo which should
not be a problem, but just throwing that out there since the
symptoms are similar.

-- 
Bob Copeland %% www.bobcopeland.com
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v3 2/3] mm/zswap: bugfix: memory leak when invalidate and reclaim occur concurrently

2013-10-12 Thread Bob Liu
On Sat, Oct 12, 2013 at 5:14 PM, Weijie Yang  wrote:
> On Sat, Oct 12, 2013 at 10:50 AM, Bob Liu  wrote:
>> On Fri, Oct 11, 2013 at 3:13 PM, Minchan Kim  wrote:
>>> On Thu, Sep 26, 2013 at 11:42:17AM +0800, Weijie Yang wrote:
>>>> On Tue, Sep 24, 2013 at 9:03 AM, Minchan Kim  wrote:
>>>> > On Mon, Sep 23, 2013 at 04:21:49PM +0800, Weijie Yang wrote:
>>>> > >
>>>> > > Modify:
>>>> > >  - check the refcount in fail path, free memory if it is not 
>>>> > > referenced.
>>>> >
>>>> > Hmm, I don't like this because zswap refcount routine is already mess 
>>>> > for me.
>>>> > I'm not sure why it was designed from the beginning. I hope we should 
>>>> > fix it first.
>>>> >
>>>> > 1. zswap_rb_serach could include zswap_entry_get semantic if it founds a 
>>>> > entry from
>>>> >the tree. Of course, we should ranme it as find_get_zswap_entry like 
>>>> > find_get_page.
>>>> > 2. zswap_entry_put could hide resource free function like 
>>>> > zswap_free_entry so that
>>>> >all of caller can use it easily following pattern.
>>>> >
>>>> >   find_get_zswap_entry
>>>> >   ...
>>>> >   ...
>>>> >   zswap_entry_put
>>>> >
>>>> > Of course, zswap_entry_put have to check the entry is in the tree or not
>>>> > so if someone already removes it from the tree, it should avoid double 
>>>> > remove.
>>>> >
>>>> > One of the concern I can think is that approach extends critical section
>>>> > but I think it would be no problem because more bottleneck would be 
>>>> > [de]compress
>>>> > functions. If it were really problem, we can mitigate a problem with 
>>>> > moving
>>>> > unnecessary functions out of zswap_free_entry because it seem to be 
>>>> > rather
>>>> > over-enginnering.
>>>>
>>>> I refactor the zswap refcount routine according to Minchan's idea.
>>>> Here is the new patch, Any suggestion is welcomed.
>>>>
>>>> To Seth and Bob, would you please review it again?
>>>
>>> Yeah, Seth, Bob. You guys are right persons to review this because this
>>> scheme was suggested by me who is biased so it couldn't be a fair. ;-)
>>> But anyway, I will review code itself.
>>>
>>>>
>>>> mm/zswap.c |  116
>>>> 
>>>>  1 file changed, 52 insertions(+), 64 deletions(-)
>>>>
>>>> diff --git a/mm/zswap.c b/mm/zswap.c
>>>> old mode 100644
>>>> new mode 100755
>>>> index deda2b6..bd04910
>>>> --- a/mm/zswap.c
>>>> +++ b/mm/zswap.c
>>>> @@ -217,6 +217,7 @@ static struct zswap_entry 
>>>> *zswap_entry_cache_alloc(gfp_t gfp)
>>>>   if (!entry)
>>>>   return NULL;
>>>>   entry->refcount = 1;
>>>> + RB_CLEAR_NODE(>rbnode);
>>>>   return entry;
>>>>  }
>>>>
>>>> @@ -232,10 +233,20 @@ static void zswap_entry_get(struct zswap_entry 
>>>> *entry)
>>>>  }
>>>>
>>>>  /* caller must hold the tree lock */
>>>> -static int zswap_entry_put(struct zswap_entry *entry)
>>>> +static int zswap_entry_put(struct zswap_tree *tree, struct zswap_entry 
>>>> *entry)
>>>
>>> Why should we have return value? If we really need it, it mitigates
>>> get/put semantic's whole point so I'd like to just return void.
>>>
>>> Let me see.
>>>
>>>>  {
>>>> - entry->refcount--;
>>>> - return entry->refcount;
>>>> + int refcount = --entry->refcount;
>>>> +
>>>> + if (refcount <= 0) {
>>>
>>> Hmm, I don't like minus refcount, really.
>>> I hope we could do following as
>>>
>>> BUG_ON(refcount < 0);
>>> if (refcount == 0) {
>>> ...
>>> }
>>>
>>>
>>>
>>>> + if (!RB_EMPTY_NODE(>rbnode)) {
>>>> + rb_erase(>rbnode, >rbroot);
>>>> + RB_

Re: [PATCH v3 2/3] mm/zswap: bugfix: memory leak when invalidate and reclaim occur concurrently

2013-10-12 Thread Bob Liu
On Sat, Oct 12, 2013 at 5:14 PM, Weijie Yang weijie.yang...@gmail.com wrote:
 On Sat, Oct 12, 2013 at 10:50 AM, Bob Liu lliu...@gmail.com wrote:
 On Fri, Oct 11, 2013 at 3:13 PM, Minchan Kim minc...@kernel.org wrote:
 On Thu, Sep 26, 2013 at 11:42:17AM +0800, Weijie Yang wrote:
 On Tue, Sep 24, 2013 at 9:03 AM, Minchan Kim minc...@kernel.org wrote:
  On Mon, Sep 23, 2013 at 04:21:49PM +0800, Weijie Yang wrote:
  
   Modify:
- check the refcount in fail path, free memory if it is not 
   referenced.
 
  Hmm, I don't like this because zswap refcount routine is already mess 
  for me.
  I'm not sure why it was designed from the beginning. I hope we should 
  fix it first.
 
  1. zswap_rb_serach could include zswap_entry_get semantic if it founds a 
  entry from
 the tree. Of course, we should ranme it as find_get_zswap_entry like 
  find_get_page.
  2. zswap_entry_put could hide resource free function like 
  zswap_free_entry so that
 all of caller can use it easily following pattern.
 
find_get_zswap_entry
...
...
zswap_entry_put
 
  Of course, zswap_entry_put have to check the entry is in the tree or not
  so if someone already removes it from the tree, it should avoid double 
  remove.
 
  One of the concern I can think is that approach extends critical section
  but I think it would be no problem because more bottleneck would be 
  [de]compress
  functions. If it were really problem, we can mitigate a problem with 
  moving
  unnecessary functions out of zswap_free_entry because it seem to be 
  rather
  over-enginnering.

 I refactor the zswap refcount routine according to Minchan's idea.
 Here is the new patch, Any suggestion is welcomed.

 To Seth and Bob, would you please review it again?

 Yeah, Seth, Bob. You guys are right persons to review this because this
 scheme was suggested by me who is biased so it couldn't be a fair. ;-)
 But anyway, I will review code itself.


 mm/zswap.c |  116
 
  1 file changed, 52 insertions(+), 64 deletions(-)

 diff --git a/mm/zswap.c b/mm/zswap.c
 old mode 100644
 new mode 100755
 index deda2b6..bd04910
 --- a/mm/zswap.c
 +++ b/mm/zswap.c
 @@ -217,6 +217,7 @@ static struct zswap_entry 
 *zswap_entry_cache_alloc(gfp_t gfp)
   if (!entry)
   return NULL;
   entry-refcount = 1;
 + RB_CLEAR_NODE(entry-rbnode);
   return entry;
  }

 @@ -232,10 +233,20 @@ static void zswap_entry_get(struct zswap_entry 
 *entry)
  }

  /* caller must hold the tree lock */
 -static int zswap_entry_put(struct zswap_entry *entry)
 +static int zswap_entry_put(struct zswap_tree *tree, struct zswap_entry 
 *entry)

 Why should we have return value? If we really need it, it mitigates
 get/put semantic's whole point so I'd like to just return void.

 Let me see.

  {
 - entry-refcount--;
 - return entry-refcount;
 + int refcount = --entry-refcount;
 +
 + if (refcount = 0) {

 Hmm, I don't like minus refcount, really.
 I hope we could do following as

 BUG_ON(refcount  0);
 if (refcount == 0) {
 ...
 }



 + if (!RB_EMPTY_NODE(entry-rbnode)) {
 + rb_erase(entry-rbnode, tree-rbroot);
 + RB_CLEAR_NODE(entry-rbnode);

 Minor,
 You could make new function zswap_rb_del or zswap_rb_remove which detach 
 the node
 from rb tree and clear node because we have already zswap_rb_insert.


 + }
 +
 + zswap_free_entry(tree, entry);
 + }
 +
 + return refcount;
  }

  /*
 @@ -258,6 +269,17 @@ static struct zswap_entry *zswap_rb_search(struct 
 rb_root *root, pgoff_t offset)
   return NULL;
  }


 Add function description.

 +static struct zswap_entry *zswap_entry_find_get(struct rb_root *root, 
 pgoff_t offset)
 +{
 + struct zswap_entry *entry = NULL;
 +
 + entry = zswap_rb_search(root, offset);
 + if (entry)
 + zswap_entry_get(entry);
 +
 + return entry;
 +}
 +
  /*
   * In the case that a entry with the same offset is found, a pointer to
   * the existing entry is stored in dupentry and the function returns 
 -EEXIST
 @@ -387,7 +409,7 @@ static void zswap_free_entry(struct zswap_tree *tree, 
 struct zswap_entry *entry)
  enum zswap_get_swap_ret {
   ZSWAP_SWAPCACHE_NEW,
   ZSWAP_SWAPCACHE_EXIST,
 - ZSWAP_SWAPCACHE_NOMEM
 + ZSWAP_SWAPCACHE_FAIL,
  };

  /*
 @@ -401,9 +423,9 @@ enum zswap_get_swap_ret {
   * added to the swap cache, and returned in retpage.
   *
   * If success, the swap cache page is returned in retpage
 - * Returns 0 if page was already in the swap cache, page is not locked
 - * Returns 1 if the new page needs to be populated, page is locked
 - * Returns 0 on error
 + * Returns ZSWAP_SWAPCACHE_EXIST if page was already in the swap cache
 + * Returns ZSWAP_SWAPCACHE_NEW if the new page needs to be populated

Re: [PATCH v3 2/3] mm/zswap: bugfix: memory leak when invalidate and reclaim occur concurrently

2013-10-11 Thread Bob Liu
On Fri, Oct 11, 2013 at 3:13 PM, Minchan Kim  wrote:
> On Thu, Sep 26, 2013 at 11:42:17AM +0800, Weijie Yang wrote:
>> On Tue, Sep 24, 2013 at 9:03 AM, Minchan Kim  wrote:
>> > On Mon, Sep 23, 2013 at 04:21:49PM +0800, Weijie Yang wrote:
>> > >
>> > > Modify:
>> > >  - check the refcount in fail path, free memory if it is not referenced.
>> >
>> > Hmm, I don't like this because zswap refcount routine is already mess for 
>> > me.
>> > I'm not sure why it was designed from the beginning. I hope we should fix 
>> > it first.
>> >
>> > 1. zswap_rb_serach could include zswap_entry_get semantic if it founds a 
>> > entry from
>> >the tree. Of course, we should ranme it as find_get_zswap_entry like 
>> > find_get_page.
>> > 2. zswap_entry_put could hide resource free function like zswap_free_entry 
>> > so that
>> >all of caller can use it easily following pattern.
>> >
>> >   find_get_zswap_entry
>> >   ...
>> >   ...
>> >   zswap_entry_put
>> >
>> > Of course, zswap_entry_put have to check the entry is in the tree or not
>> > so if someone already removes it from the tree, it should avoid double 
>> > remove.
>> >
>> > One of the concern I can think is that approach extends critical section
>> > but I think it would be no problem because more bottleneck would be 
>> > [de]compress
>> > functions. If it were really problem, we can mitigate a problem with moving
>> > unnecessary functions out of zswap_free_entry because it seem to be rather
>> > over-enginnering.
>>
>> I refactor the zswap refcount routine according to Minchan's idea.
>> Here is the new patch, Any suggestion is welcomed.
>>
>> To Seth and Bob, would you please review it again?
>
> Yeah, Seth, Bob. You guys are right persons to review this because this
> scheme was suggested by me who is biased so it couldn't be a fair. ;-)
> But anyway, I will review code itself.
>
>>
>> mm/zswap.c |  116
>> 
>>  1 file changed, 52 insertions(+), 64 deletions(-)
>>
>> diff --git a/mm/zswap.c b/mm/zswap.c
>> old mode 100644
>> new mode 100755
>> index deda2b6..bd04910
>> --- a/mm/zswap.c
>> +++ b/mm/zswap.c
>> @@ -217,6 +217,7 @@ static struct zswap_entry *zswap_entry_cache_alloc(gfp_t 
>> gfp)
>>   if (!entry)
>>   return NULL;
>>   entry->refcount = 1;
>> + RB_CLEAR_NODE(>rbnode);
>>   return entry;
>>  }
>>
>> @@ -232,10 +233,20 @@ static void zswap_entry_get(struct zswap_entry *entry)
>>  }
>>
>>  /* caller must hold the tree lock */
>> -static int zswap_entry_put(struct zswap_entry *entry)
>> +static int zswap_entry_put(struct zswap_tree *tree, struct zswap_entry 
>> *entry)
>
> Why should we have return value? If we really need it, it mitigates
> get/put semantic's whole point so I'd like to just return void.
>
> Let me see.
>
>>  {
>> - entry->refcount--;
>> - return entry->refcount;
>> + int refcount = --entry->refcount;
>> +
>> + if (refcount <= 0) {
>
> Hmm, I don't like minus refcount, really.
> I hope we could do following as
>
> BUG_ON(refcount < 0);
> if (refcount == 0) {
> ...
> }
>
>
>
>> + if (!RB_EMPTY_NODE(>rbnode)) {
>> + rb_erase(>rbnode, >rbroot);
>> + RB_CLEAR_NODE(>rbnode);
>
> Minor,
> You could make new function zswap_rb_del or zswap_rb_remove which detach the 
> node
> from rb tree and clear node because we have already zswap_rb_insert.
>
>
>> + }
>> +
>> + zswap_free_entry(tree, entry);
>> + }
>> +
>> + return refcount;
>>  }
>>
>>  /*
>> @@ -258,6 +269,17 @@ static struct zswap_entry *zswap_rb_search(struct 
>> rb_root *root, pgoff_t offset)
>>   return NULL;
>>  }
>>
>
> Add function description.
>
>> +static struct zswap_entry *zswap_entry_find_get(struct rb_root *root, 
>> pgoff_t offset)
>> +{
>> + struct zswap_entry *entry = NULL;
>> +
>> + entry = zswap_rb_search(root, offset);
>> + if (entry)
>> + zswap_entry_get(entry);
>> +
>> + return entry;
>> +}

Re: [PATCH v3 2/3] mm/zswap: bugfix: memory leak when invalidate and reclaim occur concurrently

2013-10-11 Thread Bob Liu
On Thu, Sep 26, 2013 at 11:42 AM, Weijie Yang  wrote:
> On Tue, Sep 24, 2013 at 9:03 AM, Minchan Kim  wrote:
>> On Mon, Sep 23, 2013 at 04:21:49PM +0800, Weijie Yang wrote:
>> >
>> > Modify:
>> >  - check the refcount in fail path, free memory if it is not referenced.
>>
>> Hmm, I don't like this because zswap refcount routine is already mess for me.
>> I'm not sure why it was designed from the beginning. I hope we should fix it 
>> first.
>>
>> 1. zswap_rb_serach could include zswap_entry_get semantic if it founds a 
>> entry from
>>the tree. Of course, we should ranme it as find_get_zswap_entry like 
>> find_get_page.
>> 2. zswap_entry_put could hide resource free function like zswap_free_entry 
>> so that
>>all of caller can use it easily following pattern.
>>
>>   find_get_zswap_entry
>>   ...
>>   ...
>>   zswap_entry_put
>>
>> Of course, zswap_entry_put have to check the entry is in the tree or not
>> so if someone already removes it from the tree, it should avoid double 
>> remove.
>>
>> One of the concern I can think is that approach extends critical section
>> but I think it would be no problem because more bottleneck would be 
>> [de]compress
>> functions. If it were really problem, we can mitigate a problem with moving
>> unnecessary functions out of zswap_free_entry because it seem to be rather
>> over-enginnering.
>
> I refactor the zswap refcount routine according to Minchan's idea.
> Here is the new patch, Any suggestion is welcomed.
>
> To Seth and Bob, would you please review it again?
>

I have nothing in addition to Minchan's review.

Since the code is a bit complex, I'd suggest you to split it into two patches.
[1/2]: fix the memory leak
[2/2]: clean up the entry_put

And run some testing..

Thanks,
-Bob
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v3 2/3] mm/zswap: bugfix: memory leak when invalidate and reclaim occur concurrently

2013-10-11 Thread Bob Liu
On Thu, Sep 26, 2013 at 11:42 AM, Weijie Yang weijie.y...@samsung.com wrote:
 On Tue, Sep 24, 2013 at 9:03 AM, Minchan Kim minc...@kernel.org wrote:
 On Mon, Sep 23, 2013 at 04:21:49PM +0800, Weijie Yang wrote:
 
  Modify:
   - check the refcount in fail path, free memory if it is not referenced.

 Hmm, I don't like this because zswap refcount routine is already mess for me.
 I'm not sure why it was designed from the beginning. I hope we should fix it 
 first.

 1. zswap_rb_serach could include zswap_entry_get semantic if it founds a 
 entry from
the tree. Of course, we should ranme it as find_get_zswap_entry like 
 find_get_page.
 2. zswap_entry_put could hide resource free function like zswap_free_entry 
 so that
all of caller can use it easily following pattern.

   find_get_zswap_entry
   ...
   ...
   zswap_entry_put

 Of course, zswap_entry_put have to check the entry is in the tree or not
 so if someone already removes it from the tree, it should avoid double 
 remove.

 One of the concern I can think is that approach extends critical section
 but I think it would be no problem because more bottleneck would be 
 [de]compress
 functions. If it were really problem, we can mitigate a problem with moving
 unnecessary functions out of zswap_free_entry because it seem to be rather
 over-enginnering.

 I refactor the zswap refcount routine according to Minchan's idea.
 Here is the new patch, Any suggestion is welcomed.

 To Seth and Bob, would you please review it again?


I have nothing in addition to Minchan's review.

Since the code is a bit complex, I'd suggest you to split it into two patches.
[1/2]: fix the memory leak
[2/2]: clean up the entry_put

And run some testing..

Thanks,
-Bob
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v3 2/3] mm/zswap: bugfix: memory leak when invalidate and reclaim occur concurrently

2013-10-11 Thread Bob Liu
On Fri, Oct 11, 2013 at 3:13 PM, Minchan Kim minc...@kernel.org wrote:
 On Thu, Sep 26, 2013 at 11:42:17AM +0800, Weijie Yang wrote:
 On Tue, Sep 24, 2013 at 9:03 AM, Minchan Kim minc...@kernel.org wrote:
  On Mon, Sep 23, 2013 at 04:21:49PM +0800, Weijie Yang wrote:
  
   Modify:
- check the refcount in fail path, free memory if it is not referenced.
 
  Hmm, I don't like this because zswap refcount routine is already mess for 
  me.
  I'm not sure why it was designed from the beginning. I hope we should fix 
  it first.
 
  1. zswap_rb_serach could include zswap_entry_get semantic if it founds a 
  entry from
 the tree. Of course, we should ranme it as find_get_zswap_entry like 
  find_get_page.
  2. zswap_entry_put could hide resource free function like zswap_free_entry 
  so that
 all of caller can use it easily following pattern.
 
find_get_zswap_entry
...
...
zswap_entry_put
 
  Of course, zswap_entry_put have to check the entry is in the tree or not
  so if someone already removes it from the tree, it should avoid double 
  remove.
 
  One of the concern I can think is that approach extends critical section
  but I think it would be no problem because more bottleneck would be 
  [de]compress
  functions. If it were really problem, we can mitigate a problem with moving
  unnecessary functions out of zswap_free_entry because it seem to be rather
  over-enginnering.

 I refactor the zswap refcount routine according to Minchan's idea.
 Here is the new patch, Any suggestion is welcomed.

 To Seth and Bob, would you please review it again?

 Yeah, Seth, Bob. You guys are right persons to review this because this
 scheme was suggested by me who is biased so it couldn't be a fair. ;-)
 But anyway, I will review code itself.


 mm/zswap.c |  116
 
  1 file changed, 52 insertions(+), 64 deletions(-)

 diff --git a/mm/zswap.c b/mm/zswap.c
 old mode 100644
 new mode 100755
 index deda2b6..bd04910
 --- a/mm/zswap.c
 +++ b/mm/zswap.c
 @@ -217,6 +217,7 @@ static struct zswap_entry *zswap_entry_cache_alloc(gfp_t 
 gfp)
   if (!entry)
   return NULL;
   entry-refcount = 1;
 + RB_CLEAR_NODE(entry-rbnode);
   return entry;
  }

 @@ -232,10 +233,20 @@ static void zswap_entry_get(struct zswap_entry *entry)
  }

  /* caller must hold the tree lock */
 -static int zswap_entry_put(struct zswap_entry *entry)
 +static int zswap_entry_put(struct zswap_tree *tree, struct zswap_entry 
 *entry)

 Why should we have return value? If we really need it, it mitigates
 get/put semantic's whole point so I'd like to just return void.

 Let me see.

  {
 - entry-refcount--;
 - return entry-refcount;
 + int refcount = --entry-refcount;
 +
 + if (refcount = 0) {

 Hmm, I don't like minus refcount, really.
 I hope we could do following as

 BUG_ON(refcount  0);
 if (refcount == 0) {
 ...
 }



 + if (!RB_EMPTY_NODE(entry-rbnode)) {
 + rb_erase(entry-rbnode, tree-rbroot);
 + RB_CLEAR_NODE(entry-rbnode);

 Minor,
 You could make new function zswap_rb_del or zswap_rb_remove which detach the 
 node
 from rb tree and clear node because we have already zswap_rb_insert.


 + }
 +
 + zswap_free_entry(tree, entry);
 + }
 +
 + return refcount;
  }

  /*
 @@ -258,6 +269,17 @@ static struct zswap_entry *zswap_rb_search(struct 
 rb_root *root, pgoff_t offset)
   return NULL;
  }


 Add function description.

 +static struct zswap_entry *zswap_entry_find_get(struct rb_root *root, 
 pgoff_t offset)
 +{
 + struct zswap_entry *entry = NULL;
 +
 + entry = zswap_rb_search(root, offset);
 + if (entry)
 + zswap_entry_get(entry);
 +
 + return entry;
 +}
 +
  /*
   * In the case that a entry with the same offset is found, a pointer to
   * the existing entry is stored in dupentry and the function returns -EEXIST
 @@ -387,7 +409,7 @@ static void zswap_free_entry(struct zswap_tree *tree, 
 struct zswap_entry *entry)
  enum zswap_get_swap_ret {
   ZSWAP_SWAPCACHE_NEW,
   ZSWAP_SWAPCACHE_EXIST,
 - ZSWAP_SWAPCACHE_NOMEM
 + ZSWAP_SWAPCACHE_FAIL,
  };

  /*
 @@ -401,9 +423,9 @@ enum zswap_get_swap_ret {
   * added to the swap cache, and returned in retpage.
   *
   * If success, the swap cache page is returned in retpage
 - * Returns 0 if page was already in the swap cache, page is not locked
 - * Returns 1 if the new page needs to be populated, page is locked
 - * Returns 0 on error
 + * Returns ZSWAP_SWAPCACHE_EXIST if page was already in the swap cache
 + * Returns ZSWAP_SWAPCACHE_NEW if the new page needs to be populated, page 
 is locked
 + * Returns ZSWAP_SWAPCACHE_FAIL on error
   */
  static int zswap_get_swap_cache_page(swp_entry_t entry

Re: [PATCH] frontswap: enable call to invalidate area on swapoff

2013-10-09 Thread Bob Liu

On 10/09/2013 10:40 PM, Seth Jennings wrote:
> On Tue, Oct 08, 2013 at 01:08:53PM -0700, Andrew Morton wrote:
>> On Tue, 08 Oct 2013 10:13:20 +0200 Krzysztof Kozlowski 
>>  wrote:
>>
>>> On pon, 2013-10-07 at 15:03 -0700, Andrew Morton wrote:
>>>> On Mon, 07 Oct 2013 17:25:41 +0200 Krzysztof Kozlowski 
>>>>  wrote:
>>>>
>>>>> During swapoff the frontswap_map was NULL-ified before calling
>>>>> frontswap_invalidate_area(). However the frontswap_invalidate_area()
>>>>> exits early if frontswap_map is NULL. Invalidate was never called during
>>>>> swapoff.
>>>>>
>>>>> This patch moves frontswap_map_set() in swapoff just after calling
>>>>> frontswap_invalidate_area() so outside of locks
>>>>> (swap_lock and swap_info_struct->lock). This shouldn't be a problem as
>>>>> during swapon the frontswap_map_set() is called also outside of any
>>>>> locks.
>>>>>
>>>>
>>>> Ahem.  So there's a bunch of code in __frontswap_invalidate_area()
>>>> which hasn't ever been executed and nobody noticed it.  So perhaps that
>>>> code isn't actually needed?
>>>>
>>>> More seriously, this patch looks like it enables code which hasn't been
>>>> used or tested before.  How well tested was this?
>>>>
>>>> Are there any runtime-visible effects from this change?
>>>
>>> I tested zswap on x86 and x86-64 and there was no difference. This is
>>> good as there shouldn't be visible anything because swapoff is unusing
>>> all pages anyway:
>>> try_to_unuse(type, false, 0); /* force all pages to be unused */
>>>
>>> I haven't tested other frontswap users.
>>
>> So is that code in __frontswap_invalidate_area() unneeded?
> 
> Yes, to expand on what Bob said, __frontswap_invalidate_area() is still
> needed to let any frontswap backend free per-swaptype resources.
> 
> __frontswap_invalidate_area() is _not_ for freeing structures associated
> with individual swapped out pages since all of the pages should be
> brought back into memory by try_to_unuse() before
> __frontswap_invalidate_area() is called.
> 
> The reason we never noticed this for zswap is that zswap has no
> dynamically allocated per-type resources.  In the expected case,
> where all of the pages have been drained from zswap,
> zswap_frontswap_invalidate_area() is a no-op.
> 

Not exactly, see the bug fix "mm/zswap: bugfix: memory leak when
re-swapon" from Weijie.
Zswap needs invalidate_area() also.

Thanks,
-Bob

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] frontswap: enable call to invalidate area on swapoff

2013-10-09 Thread Bob Liu

On 10/09/2013 04:08 AM, Andrew Morton wrote:
> On Tue, 08 Oct 2013 10:13:20 +0200 Krzysztof Kozlowski 
>  wrote:
> 
>> On pon, 2013-10-07 at 15:03 -0700, Andrew Morton wrote:
>>> On Mon, 07 Oct 2013 17:25:41 +0200 Krzysztof Kozlowski 
>>>  wrote:
>>>
>>>> During swapoff the frontswap_map was NULL-ified before calling
>>>> frontswap_invalidate_area(). However the frontswap_invalidate_area()
>>>> exits early if frontswap_map is NULL. Invalidate was never called during
>>>> swapoff.
>>>>
>>>> This patch moves frontswap_map_set() in swapoff just after calling
>>>> frontswap_invalidate_area() so outside of locks
>>>> (swap_lock and swap_info_struct->lock). This shouldn't be a problem as
>>>> during swapon the frontswap_map_set() is called also outside of any
>>>> locks.
>>>>
>>>
>>> Ahem.  So there's a bunch of code in __frontswap_invalidate_area()
>>> which hasn't ever been executed and nobody noticed it.  So perhaps that
>>> code isn't actually needed?
>>>
>>> More seriously, this patch looks like it enables code which hasn't been
>>> used or tested before.  How well tested was this?
>>>
>>> Are there any runtime-visible effects from this change?
>>
>> I tested zswap on x86 and x86-64 and there was no difference. This is
>> good as there shouldn't be visible anything because swapoff is unusing
>> all pages anyway:
>>  try_to_unuse(type, false, 0); /* force all pages to be unused */
>>
>> I haven't tested other frontswap users.
> 
> So is that code in __frontswap_invalidate_area() unneeded?
> 

I don't think so, it's still needed otherwise there will be memory leak.
I'm afraid nobody noticed the memory leak here before, this patch can
fix it. Sorry for didn't pay enough attention but please keep
__frontswap_invalidate_area().

-- 
Regards,
-Bob
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] frontswap: enable call to invalidate area on swapoff

2013-10-09 Thread Bob Liu

On 10/09/2013 04:08 AM, Andrew Morton wrote:
 On Tue, 08 Oct 2013 10:13:20 +0200 Krzysztof Kozlowski 
 k.kozlow...@samsung.com wrote:
 
 On pon, 2013-10-07 at 15:03 -0700, Andrew Morton wrote:
 On Mon, 07 Oct 2013 17:25:41 +0200 Krzysztof Kozlowski 
 k.kozlow...@samsung.com wrote:

 During swapoff the frontswap_map was NULL-ified before calling
 frontswap_invalidate_area(). However the frontswap_invalidate_area()
 exits early if frontswap_map is NULL. Invalidate was never called during
 swapoff.

 This patch moves frontswap_map_set() in swapoff just after calling
 frontswap_invalidate_area() so outside of locks
 (swap_lock and swap_info_struct-lock). This shouldn't be a problem as
 during swapon the frontswap_map_set() is called also outside of any
 locks.


 Ahem.  So there's a bunch of code in __frontswap_invalidate_area()
 which hasn't ever been executed and nobody noticed it.  So perhaps that
 code isn't actually needed?

 More seriously, this patch looks like it enables code which hasn't been
 used or tested before.  How well tested was this?

 Are there any runtime-visible effects from this change?

 I tested zswap on x86 and x86-64 and there was no difference. This is
 good as there shouldn't be visible anything because swapoff is unusing
 all pages anyway:
  try_to_unuse(type, false, 0); /* force all pages to be unused */

 I haven't tested other frontswap users.
 
 So is that code in __frontswap_invalidate_area() unneeded?
 

I don't think so, it's still needed otherwise there will be memory leak.
I'm afraid nobody noticed the memory leak here before, this patch can
fix it. Sorry for didn't pay enough attention but please keep
__frontswap_invalidate_area().

-- 
Regards,
-Bob
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] frontswap: enable call to invalidate area on swapoff

2013-10-09 Thread Bob Liu

On 10/09/2013 10:40 PM, Seth Jennings wrote:
 On Tue, Oct 08, 2013 at 01:08:53PM -0700, Andrew Morton wrote:
 On Tue, 08 Oct 2013 10:13:20 +0200 Krzysztof Kozlowski 
 k.kozlow...@samsung.com wrote:

 On pon, 2013-10-07 at 15:03 -0700, Andrew Morton wrote:
 On Mon, 07 Oct 2013 17:25:41 +0200 Krzysztof Kozlowski 
 k.kozlow...@samsung.com wrote:

 During swapoff the frontswap_map was NULL-ified before calling
 frontswap_invalidate_area(). However the frontswap_invalidate_area()
 exits early if frontswap_map is NULL. Invalidate was never called during
 swapoff.

 This patch moves frontswap_map_set() in swapoff just after calling
 frontswap_invalidate_area() so outside of locks
 (swap_lock and swap_info_struct-lock). This shouldn't be a problem as
 during swapon the frontswap_map_set() is called also outside of any
 locks.


 Ahem.  So there's a bunch of code in __frontswap_invalidate_area()
 which hasn't ever been executed and nobody noticed it.  So perhaps that
 code isn't actually needed?

 More seriously, this patch looks like it enables code which hasn't been
 used or tested before.  How well tested was this?

 Are there any runtime-visible effects from this change?

 I tested zswap on x86 and x86-64 and there was no difference. This is
 good as there shouldn't be visible anything because swapoff is unusing
 all pages anyway:
 try_to_unuse(type, false, 0); /* force all pages to be unused */

 I haven't tested other frontswap users.

 So is that code in __frontswap_invalidate_area() unneeded?
 
 Yes, to expand on what Bob said, __frontswap_invalidate_area() is still
 needed to let any frontswap backend free per-swaptype resources.
 
 __frontswap_invalidate_area() is _not_ for freeing structures associated
 with individual swapped out pages since all of the pages should be
 brought back into memory by try_to_unuse() before
 __frontswap_invalidate_area() is called.
 
 The reason we never noticed this for zswap is that zswap has no
 dynamically allocated per-type resources.  In the expected case,
 where all of the pages have been drained from zswap,
 zswap_frontswap_invalidate_area() is a no-op.
 

Not exactly, see the bug fix mm/zswap: bugfix: memory leak when
re-swapon from Weijie.
Zswap needs invalidate_area() also.

Thanks,
-Bob

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH] mm: pagevec: cleanup: drop pvec->cold argument in all places

2013-09-28 Thread Bob Liu
Nobody uses the pvec->cold argument of pagevec and it's also unreasonable for
pages in pagevec released as cold page, so drop the cold argument from pagevec.

Signed-off-by: Bob Liu 
---
 fs/9p/cache.c   |2 +-
 fs/afs/cache.c  |2 +-
 fs/afs/write.c  |4 ++--
 fs/btrfs/extent_io.c|4 ++--
 fs/cachefiles/rdwr.c|8 
 fs/ceph/addr.c  |6 +++---
 fs/ceph/cache.c |2 +-
 fs/cifs/cache.c |2 +-
 fs/ext4/file.c  |2 +-
 fs/ext4/inode.c |6 +++---
 fs/f2fs/checkpoint.c|2 +-
 fs/f2fs/node.c  |2 +-
 fs/fscache/page.c   |4 ++--
 fs/gfs2/aops.c  |2 +-
 fs/hugetlbfs/inode.c|4 ++--
 fs/nfs/fscache-index.c  |2 +-
 fs/nilfs2/btree.c   |2 +-
 fs/nilfs2/page.c|8 
 fs/nilfs2/segment.c |4 ++--
 fs/xfs/xfs_aops.c   |2 +-
 fs/xfs/xfs_file.c   |2 +-
 include/linux/pagevec.h |9 +
 mm/filemap.c|2 +-
 mm/mlock.c  |4 ++--
 mm/page-writeback.c |2 +-
 mm/shmem.c  |6 +++---
 mm/swap.c   |8 
 mm/truncate.c   |6 +++---
 28 files changed, 51 insertions(+), 58 deletions(-)

diff --git a/fs/9p/cache.c b/fs/9p/cache.c
index a9ea73d..c17798c 100644
--- a/fs/9p/cache.c
+++ b/fs/9p/cache.c
@@ -158,7 +158,7 @@ static void v9fs_cache_inode_now_uncached(void 
*cookie_netfs_data)
pgoff_t first;
int loop, nr_pages;
 
-   pagevec_init(, 0);
+   pagevec_init();
first = 0;
 
for (;;) {
diff --git a/fs/afs/cache.c b/fs/afs/cache.c
index 577763c3..ae3fb3e 100644
--- a/fs/afs/cache.c
+++ b/fs/afs/cache.c
@@ -377,7 +377,7 @@ static void afs_vnode_cache_now_uncached(void 
*cookie_netfs_data)
_enter("{%x,%x,%Lx}",
   vnode->fid.vnode, vnode->fid.unique, vnode->status.data_version);
 
-   pagevec_init(, 0);
+   pagevec_init();
first = 0;
 
for (;;) {
diff --git a/fs/afs/write.c b/fs/afs/write.c
index a890db4..7ace426 100644
--- a/fs/afs/write.c
+++ b/fs/afs/write.c
@@ -284,7 +284,7 @@ static void afs_kill_pages(struct afs_vnode *vnode, bool 
error,
_enter("{%x:%u},%lx-%lx",
   vnode->fid.vid, vnode->fid.vnode, first, last);
 
-   pagevec_init(, 0);
+   pagevec_init();
 
do {
_debug("kill %lx-%lx", first, last);
@@ -582,7 +582,7 @@ void afs_pages_written_back(struct afs_vnode *vnode, struct 
afs_call *call)
 
ASSERT(wb != NULL);
 
-   pagevec_init(, 0);
+   pagevec_init();
 
do {
_debug("done %lx-%lx", first, last);
diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c
index 09582b8..11deea7 100644
--- a/fs/btrfs/extent_io.c
+++ b/fs/btrfs/extent_io.c
@@ -3517,7 +3517,7 @@ int btree_write_cache_pages(struct address_space *mapping,
int scanned = 0;
int tag;
 
-   pagevec_init(, 0);
+   pagevec_init();
if (wbc->range_cyclic) {
index = mapping->writeback_index; /* Start from prev offset */
end = -1;
@@ -3661,7 +3661,7 @@ static int extent_write_cache_pages(struct extent_io_tree 
*tree,
if (!igrab(inode))
return 0;
 
-   pagevec_init(, 0);
+   pagevec_init();
if (wbc->range_cyclic) {
index = mapping->writeback_index; /* Start from prev offset */
end = -1;
diff --git a/fs/cachefiles/rdwr.c b/fs/cachefiles/rdwr.c
index ebaff36..aa82d15 100644
--- a/fs/cachefiles/rdwr.c
+++ b/fs/cachefiles/rdwr.c
@@ -160,7 +160,7 @@ static void cachefiles_read_copier(struct fscache_operation 
*_op)
 
_enter("{ino=%lu}", object->backer->d_inode->i_ino);
 
-   pagevec_init(, 0);
+   pagevec_init();
 
max = 8;
spin_lock_irq(>work_lock);
@@ -429,7 +429,7 @@ int cachefiles_read_or_alloc_page(struct fscache_retrieval 
*op,
op->op.flags |= FSCACHE_OP_ASYNC;
op->op.processor = cachefiles_read_copier;
 
-   pagevec_init(, 0);
+   pagevec_init();
 
/* we assume the absence or presence of the first block is a good
 * enough indication for the page as a whole
@@ -729,7 +729,7 @@ int cachefiles_read_or_alloc_pages(struct fscache_retrieval 
*op,
 
shift = PAGE_SHIFT - inode->i_sb->s_blocksize_bits;
 
-   pagevec_init(, 0);
+   pagevec_init();
 
op->op.flags &= FSCACHE_OP_KEEP_FLAGS;
op->op.flags |= FSCACHE_OP_ASYNC;
@@ -863,7 +863,7 @@ int cachefiles_allocate_pages(struct fscache_retrieval *op,
 
ret = cachefiles_has_space(cache, 0, *nr_pages);
if (ret == 0) {
-   pagevec_init(, 0);
+   pagevec_init();
 
list_for_each_entry(page, pages, lru) {
if (pagevec_add(, page) == 0)
di

[PATCH] mm: pagevec: cleanup: drop pvec-cold argument in all places

2013-09-28 Thread Bob Liu
Nobody uses the pvec-cold argument of pagevec and it's also unreasonable for
pages in pagevec released as cold page, so drop the cold argument from pagevec.

Signed-off-by: Bob Liu bob@oracle.com
---
 fs/9p/cache.c   |2 +-
 fs/afs/cache.c  |2 +-
 fs/afs/write.c  |4 ++--
 fs/btrfs/extent_io.c|4 ++--
 fs/cachefiles/rdwr.c|8 
 fs/ceph/addr.c  |6 +++---
 fs/ceph/cache.c |2 +-
 fs/cifs/cache.c |2 +-
 fs/ext4/file.c  |2 +-
 fs/ext4/inode.c |6 +++---
 fs/f2fs/checkpoint.c|2 +-
 fs/f2fs/node.c  |2 +-
 fs/fscache/page.c   |4 ++--
 fs/gfs2/aops.c  |2 +-
 fs/hugetlbfs/inode.c|4 ++--
 fs/nfs/fscache-index.c  |2 +-
 fs/nilfs2/btree.c   |2 +-
 fs/nilfs2/page.c|8 
 fs/nilfs2/segment.c |4 ++--
 fs/xfs/xfs_aops.c   |2 +-
 fs/xfs/xfs_file.c   |2 +-
 include/linux/pagevec.h |9 +
 mm/filemap.c|2 +-
 mm/mlock.c  |4 ++--
 mm/page-writeback.c |2 +-
 mm/shmem.c  |6 +++---
 mm/swap.c   |8 
 mm/truncate.c   |6 +++---
 28 files changed, 51 insertions(+), 58 deletions(-)

diff --git a/fs/9p/cache.c b/fs/9p/cache.c
index a9ea73d..c17798c 100644
--- a/fs/9p/cache.c
+++ b/fs/9p/cache.c
@@ -158,7 +158,7 @@ static void v9fs_cache_inode_now_uncached(void 
*cookie_netfs_data)
pgoff_t first;
int loop, nr_pages;
 
-   pagevec_init(pvec, 0);
+   pagevec_init(pvec);
first = 0;
 
for (;;) {
diff --git a/fs/afs/cache.c b/fs/afs/cache.c
index 577763c3..ae3fb3e 100644
--- a/fs/afs/cache.c
+++ b/fs/afs/cache.c
@@ -377,7 +377,7 @@ static void afs_vnode_cache_now_uncached(void 
*cookie_netfs_data)
_enter({%x,%x,%Lx},
   vnode-fid.vnode, vnode-fid.unique, vnode-status.data_version);
 
-   pagevec_init(pvec, 0);
+   pagevec_init(pvec);
first = 0;
 
for (;;) {
diff --git a/fs/afs/write.c b/fs/afs/write.c
index a890db4..7ace426 100644
--- a/fs/afs/write.c
+++ b/fs/afs/write.c
@@ -284,7 +284,7 @@ static void afs_kill_pages(struct afs_vnode *vnode, bool 
error,
_enter({%x:%u},%lx-%lx,
   vnode-fid.vid, vnode-fid.vnode, first, last);
 
-   pagevec_init(pv, 0);
+   pagevec_init(pv);
 
do {
_debug(kill %lx-%lx, first, last);
@@ -582,7 +582,7 @@ void afs_pages_written_back(struct afs_vnode *vnode, struct 
afs_call *call)
 
ASSERT(wb != NULL);
 
-   pagevec_init(pv, 0);
+   pagevec_init(pv);
 
do {
_debug(done %lx-%lx, first, last);
diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c
index 09582b8..11deea7 100644
--- a/fs/btrfs/extent_io.c
+++ b/fs/btrfs/extent_io.c
@@ -3517,7 +3517,7 @@ int btree_write_cache_pages(struct address_space *mapping,
int scanned = 0;
int tag;
 
-   pagevec_init(pvec, 0);
+   pagevec_init(pvec);
if (wbc-range_cyclic) {
index = mapping-writeback_index; /* Start from prev offset */
end = -1;
@@ -3661,7 +3661,7 @@ static int extent_write_cache_pages(struct extent_io_tree 
*tree,
if (!igrab(inode))
return 0;
 
-   pagevec_init(pvec, 0);
+   pagevec_init(pvec);
if (wbc-range_cyclic) {
index = mapping-writeback_index; /* Start from prev offset */
end = -1;
diff --git a/fs/cachefiles/rdwr.c b/fs/cachefiles/rdwr.c
index ebaff36..aa82d15 100644
--- a/fs/cachefiles/rdwr.c
+++ b/fs/cachefiles/rdwr.c
@@ -160,7 +160,7 @@ static void cachefiles_read_copier(struct fscache_operation 
*_op)
 
_enter({ino=%lu}, object-backer-d_inode-i_ino);
 
-   pagevec_init(pagevec, 0);
+   pagevec_init(pagevec);
 
max = 8;
spin_lock_irq(object-work_lock);
@@ -429,7 +429,7 @@ int cachefiles_read_or_alloc_page(struct fscache_retrieval 
*op,
op-op.flags |= FSCACHE_OP_ASYNC;
op-op.processor = cachefiles_read_copier;
 
-   pagevec_init(pagevec, 0);
+   pagevec_init(pagevec);
 
/* we assume the absence or presence of the first block is a good
 * enough indication for the page as a whole
@@ -729,7 +729,7 @@ int cachefiles_read_or_alloc_pages(struct fscache_retrieval 
*op,
 
shift = PAGE_SHIFT - inode-i_sb-s_blocksize_bits;
 
-   pagevec_init(pagevec, 0);
+   pagevec_init(pagevec);
 
op-op.flags = FSCACHE_OP_KEEP_FLAGS;
op-op.flags |= FSCACHE_OP_ASYNC;
@@ -863,7 +863,7 @@ int cachefiles_allocate_pages(struct fscache_retrieval *op,
 
ret = cachefiles_has_space(cache, 0, *nr_pages);
if (ret == 0) {
-   pagevec_init(pagevec, 0);
+   pagevec_init(pagevec);
 
list_for_each_entry(page, pages, lru) {
if (pagevec_add(pagevec, page) == 0)
diff --git a/fs/ceph

Re: [PATCH v2 0/5] mm: migrate zbud pages

2013-09-27 Thread Bob Liu
 zswap layer and the radix slots would
> hold zbud handles, not struct page pointers.
> 
> However, as I have discovered today, this is problematic when it comes
> to reclaim and migration and serializing access.
> 
> I wanted to do as much as possible in the zswap layer since anything
> done in the zbud layer would need to be duplicated in any other future
> allocator that zswap wanted to support.
> 
> Unfortunately, zbud abstracts away the struct page and that visibility
> is needed to properly do what we are talking about.
> 
> So maybe it is inevitable that this will need to be in the zbud code
> with the radix tree slots pointing to struct pages after all.
> 

But in this way, zswap_frontswap_load() can't find zswap_entry. We still
need the rbtree in current zswap.

> I like the idea of masking the bit into the struct page pointer to
> indicate which buddy maps to the offset.
> 

I have no idea why we need this.
My idea is connect zbud page with a address space and add zbud page to
LRU list only without any radix tree.

zswap_entry can be still in rbtree or maybe changed to radix tree.
There is a sample code in my previous email.

-- 
Regards,
-Bob
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v2 0/5] mm: migrate zbud pages

2013-09-27 Thread Bob Liu


On 09/28/2013 06:00 AM, Seth Jennings wrote:
 On Fri, Sep 27, 2013 at 12:16:37PM +0200, Tomasz Stanislawski wrote:
 On 09/25/2013 11:57 PM, Seth Jennings wrote:
 On Wed, Sep 25, 2013 at 07:09:50PM +0200, Tomasz Stanislawski wrote:
 I just had an idea this afternoon to potentially kill both these birds 
 with one
 stone: Replace the rbtree in zswap with an address_space.

 Each swap type would have its own page_tree to organize the compressed 
 objects
 by type and offset (radix tree is more suited for this anyway) and a_ops 
 that
 could be called by shrink_page_list() (writepage) or the migration code
 (migratepage).

 Then zbud pages could be put on the normal LRU list, maybe at the 
 beginning of
 the inactive LRU so they would live for another cycle through the list, 
 then be
 reclaimed in the normal way with the mapping-a_ops-writepage() pointing 
 to a
 zswap_writepage() function that would decompress the pages and call
 __swap_writepage() on them.

 This might actually do away with the explicit pool size too as the 
 compressed
 pool pages wouldn't be outside the control of the MM anymore.

 I'm just starting to explore this but I think it has promise.

 Seth


 Hi Seth,
 There is a problem with the proposed idea.
 The radix tree used 'struct address_space' is a part of
 a bigger data structure.
 The radix tree is used to translate an offset to a page.
 That is ok for zswap. But struct page has a field named 'index'.
 The MM assumes that this index is an offset in radix tree
 where one can find the page. A lot is done by MM to sustain
 this consistency.

 Yes, this is how it is for page cache pages.  However, the MM is able to
 work differently with anonymous pages.  In the case of an anonymous
 page, the mapping field points to an anon_vma struct, or, if ksm in
 enabled and dedup'ing the page, a private ksm tracking structure.  If
 the anonymous page is fully unmapped and resides only in the swap cache,
 the page mapping is NULL.  So there is precedent for the fields to mean
 other things.

 Hi Seth,
 You are right that page-mapping is NULL for pages in swap_cache but
 page_mapping() is not NULL in such a case. The mapping is taken from
 struct address_space swapper_spaces[]. It is still an address space,
 and it should preserve constraints for struct address_space.
 The same happen for page-index and page_index().


 The question is how to mark and identify zbud pages among the other page
 types that will be on the LRU.  There are many ways.  The question is
 what is the best and most acceptable way.


 If you consider hacking I have some idea how address_space could utilized 
 for ZBUD.
 One solution whould be using tags in a radix tree. Every entry in a radix 
 tree
 can have a few bits assigned to it. Currently 3 bits are supported:

 From include/linux/fs.h
 #define PAGECACHE_TAG_DIRTY  0
 #define PAGECACHE_TAG_WRITEBACK  1
 #define PAGECACHE_TAG_TOWRITE2

 You could add a new bit or utilize one of existing ones.

 The other idea is use a trick from a RB trees and scatter-gather lists.
 I mean using the last bits of pointers to keep some metadata.
 Values of 'struct page *' variables are aligned to a pointer alignment which 
 is
 4 for 32-bit CPUs and 8 for 64-bit ones (not sure). This means that one could
 could use the last bit of page pointer in a radix tree to track if a swap 
 entry
 refers to a lower or a higher part of a ZBUD page.
 I think it is a serious hacking/obfuscation but it may work with the minimal
 amount of changes to MM. Adding only (x~3) while extracting page pointer is
 probably enough.

 What do you think about this idea?
 
 I think it is a good one.
 
 I have to say that when I first came up with the idea, I was thinking
 the address space would be at the zswap layer and the radix slots would
 hold zbud handles, not struct page pointers.
 
 However, as I have discovered today, this is problematic when it comes
 to reclaim and migration and serializing access.
 
 I wanted to do as much as possible in the zswap layer since anything
 done in the zbud layer would need to be duplicated in any other future
 allocator that zswap wanted to support.
 
 Unfortunately, zbud abstracts away the struct page and that visibility
 is needed to properly do what we are talking about.
 
 So maybe it is inevitable that this will need to be in the zbud code
 with the radix tree slots pointing to struct pages after all.
 

But in this way, zswap_frontswap_load() can't find zswap_entry. We still
need the rbtree in current zswap.

 I like the idea of masking the bit into the struct page pointer to
 indicate which buddy maps to the offset.
 

I have no idea why we need this.
My idea is connect zbud page with a address space and add zbud page to
LRU list only without any radix tree.

zswap_entry can be still in rbtree or maybe changed to radix tree.
There is a sample code in my previous email.

-- 
Regards,
-Bob
--
To unsubscribe from this list: send the line unsubscribe linux

Re: [BUG REPORT] ZSWAP: theoretical race condition issues

2013-09-26 Thread Bob Liu
On Thu, Sep 26, 2013 at 4:48 PM, Weijie Yang  wrote:
> On Thu, Sep 26, 2013 at 3:57 PM, Minchan Kim  wrote:
>> On Thu, Sep 26, 2013 at 03:26:33PM +0800, Weijie Yang wrote:
>>> On Thu, Sep 26, 2013 at 1:58 PM, Minchan Kim  wrote:
>>> > Hello Weigie,
>>> >
>>> > On Wed, Sep 25, 2013 at 05:33:43PM +0800, Weijie Yang wrote:
>>> >> On Wed, Sep 25, 2013 at 4:31 PM, Bob Liu  wrote:
>>> >> > On Wed, Sep 25, 2013 at 4:09 PM, Weijie Yang 
>>> >> >  wrote:
>>> >> >> I think I find a new issue, for integrity of this mail thread, I reply
>>> >> >> to this mail.
>>> >> >>
>>> >> >> It is a concurrence issue either, when duplicate store and reclaim
>>> >> >> concurrentlly.
>>> >> >>
>>> >> >> zswap entry x with offset A is already stored in zswap backend.
>>> >> >> Consider the following scenario:
>>> >> >>
>>> >> >> thread 0: reclaim entry x (get refcount, but not call 
>>> >> >> zswap_get_swap_cache_page)
>>> >> >>
>>> >> >> thread 1: store new page with the same offset A, alloc a new zswap 
>>> >> >> entry y.
>>> >> >>   store finished. shrink_page_list() call __remove_mapping(), and now
>>> >> >> it is not in swap_cache
>>> >> >>
>>> >> >
>>> >> > But I don't think swap layer will call zswap with the same offset A.
>>> >>
>>> >> 1. store page of offset A in zswap
>>> >> 2. some time later, pagefault occur, load page data from zswap.
>>> >>   But notice that zswap entry x is still in zswap because it is not
>>> >> frontswap_tmem_exclusive_gets_enabled.
>>> >
>>> > frontswap_tmem_exclusive_gets_enabled is just option to see tradeoff
>>> > between CPU burining by frequent swapout and memory footprint by duplicate
>>> > copy in swap cache and frontswap backend so it shouldn't affect the 
>>> > stability.
>>>
>>> Thanks for explain this.
>>> I don't mean to say this option affects the stability,  but that zswap
>>> only realize
>>> one option. Maybe it's better to realize both options for different 
>>> workloads.
>>
>> "zswap only relize one option"
>> What does it mena? Sorry. I couldn't parse your intention. :)
>> You mean zswap should do something special to support 
>> frontswap_tmem_exclusive_gets?
>
> Yes. But I am not sure whether it is worth.
>
>>>
>>> >>  this page is with PageSwapCache(page) and page_private(page) = entry.val
>>> >> 3. change this page data, and it become dirty
>>> >
>>> > If non-shared swapin page become redirty, it should remove the page from
>>> > swapcache. If shared swapin page become redirty, it should do CoW so it's 
>>> > a
>>> > new page so that it doesn't live in swap cache. It means it should have 
>>> > new
>>> > offset which is different with old's one for swap out.
>>> >
>>> > What's wrong with that?
>>>
>>> It is really not a right scene for duplicate store. And I can not think out 
>>> one.
>>> If duplicate store is impossible, How about delete the handle code in zswap?
>>> If it does exist, I think there is a potential issue as I described.
>>
>> You mean "zswap_duplicate_entry"?
>> AFAIR, I already had a question to Seth when zswap was born but AFAIRC,
>> he said that he didn't know exact reason but he saw that case during
>> experiement so copy the code peice from zcache.
>>
>> Do you see the case, too?
>
> Yes, I mean duplicate store.
> I check the /Documentation/vm/frontswap.txt, it mentions "duplicate stores",
> but I am still confused.
>
> I wrote a zcache varietas which swap out compressed page to swapfile.
> I did see that case when I test it on andorid smartphone(arm v7),

Why not test it based on zswap directly?
My suggestion is do more and more stress testing against current
zswap, if there is really any bug triggered or unusual performance
issue then we dig it out and fix it.

-- 
Regards,
--Bob
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [munlock] BUG: Bad page map in process killall5 pte:cf17e720 pmd:05a22067

2013-09-26 Thread Bob Liu
On Thu, Sep 26, 2013 at 2:49 PM, Vlastimil Babka  wrote:
> On 09/26/2013 02:40 AM, Fengguang Wu wrote:
>> Hi Vlastimil,
>>
>> FYI, this bug seems still not fixed in linux-next 20130925.
>
> Hi,
>
> I sent (including you) a RFC patch and later reviewed patch about week
> ago. I assumed you would test it, but I probably should make that
> request explicit, sorry. Anyway it was added to -mm an hour before your
> mail.
>

Great! And please ignore my noise in this thread.

-- 
Regards,
--Bob
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [munlock] BUG: Bad page map in process killall5 pte:cf17e720 pmd:05a22067

2013-09-26 Thread Bob Liu
On Thu, Sep 26, 2013 at 2:49 PM, Vlastimil Babka vba...@suse.cz wrote:
 On 09/26/2013 02:40 AM, Fengguang Wu wrote:
 Hi Vlastimil,

 FYI, this bug seems still not fixed in linux-next 20130925.

 Hi,

 I sent (including you) a RFC patch and later reviewed patch about week
 ago. I assumed you would test it, but I probably should make that
 request explicit, sorry. Anyway it was added to -mm an hour before your
 mail.


Great! And please ignore my noise in this thread.

-- 
Regards,
--Bob
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [BUG REPORT] ZSWAP: theoretical race condition issues

2013-09-26 Thread Bob Liu
On Thu, Sep 26, 2013 at 4:48 PM, Weijie Yang weijie.yang...@gmail.com wrote:
 On Thu, Sep 26, 2013 at 3:57 PM, Minchan Kim minc...@kernel.org wrote:
 On Thu, Sep 26, 2013 at 03:26:33PM +0800, Weijie Yang wrote:
 On Thu, Sep 26, 2013 at 1:58 PM, Minchan Kim minc...@kernel.org wrote:
  Hello Weigie,
 
  On Wed, Sep 25, 2013 at 05:33:43PM +0800, Weijie Yang wrote:
  On Wed, Sep 25, 2013 at 4:31 PM, Bob Liu lliu...@gmail.com wrote:
   On Wed, Sep 25, 2013 at 4:09 PM, Weijie Yang 
   weijie.yang...@gmail.com wrote:
   I think I find a new issue, for integrity of this mail thread, I reply
   to this mail.
  
   It is a concurrence issue either, when duplicate store and reclaim
   concurrentlly.
  
   zswap entry x with offset A is already stored in zswap backend.
   Consider the following scenario:
  
   thread 0: reclaim entry x (get refcount, but not call 
   zswap_get_swap_cache_page)
  
   thread 1: store new page with the same offset A, alloc a new zswap 
   entry y.
 store finished. shrink_page_list() call __remove_mapping(), and now
   it is not in swap_cache
  
  
   But I don't think swap layer will call zswap with the same offset A.
 
  1. store page of offset A in zswap
  2. some time later, pagefault occur, load page data from zswap.
But notice that zswap entry x is still in zswap because it is not
  frontswap_tmem_exclusive_gets_enabled.
 
  frontswap_tmem_exclusive_gets_enabled is just option to see tradeoff
  between CPU burining by frequent swapout and memory footprint by duplicate
  copy in swap cache and frontswap backend so it shouldn't affect the 
  stability.

 Thanks for explain this.
 I don't mean to say this option affects the stability,  but that zswap
 only realize
 one option. Maybe it's better to realize both options for different 
 workloads.

 zswap only relize one option
 What does it mena? Sorry. I couldn't parse your intention. :)
 You mean zswap should do something special to support 
 frontswap_tmem_exclusive_gets?

 Yes. But I am not sure whether it is worth.


   this page is with PageSwapCache(page) and page_private(page) = entry.val
  3. change this page data, and it become dirty
 
  If non-shared swapin page become redirty, it should remove the page from
  swapcache. If shared swapin page become redirty, it should do CoW so it's 
  a
  new page so that it doesn't live in swap cache. It means it should have 
  new
  offset which is different with old's one for swap out.
 
  What's wrong with that?

 It is really not a right scene for duplicate store. And I can not think out 
 one.
 If duplicate store is impossible, How about delete the handle code in zswap?
 If it does exist, I think there is a potential issue as I described.

 You mean zswap_duplicate_entry?
 AFAIR, I already had a question to Seth when zswap was born but AFAIRC,
 he said that he didn't know exact reason but he saw that case during
 experiement so copy the code peice from zcache.

 Do you see the case, too?

 Yes, I mean duplicate store.
 I check the /Documentation/vm/frontswap.txt, it mentions duplicate stores,
 but I am still confused.

 I wrote a zcache varietas which swap out compressed page to swapfile.
 I did see that case when I test it on andorid smartphone(arm v7),

Why not test it based on zswap directly?
My suggestion is do more and more stress testing against current
zswap, if there is really any bug triggered or unusual performance
issue then we dig it out and fix it.

-- 
Regards,
--Bob
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [munlock] BUG: Bad page map in process killall5 pte:cf17e720 pmd:05a22067

2013-09-25 Thread Bob Liu
On Thu, Sep 26, 2013 at 9:59 AM, Fengguang Wu  wrote:
> Hi Bob,
>
> On Thu, Sep 26, 2013 at 09:48:08AM +0800, Bob Liu wrote:
>> Hi Fengguang,
>>
>> Would you please have a try with the attached patch?
>> It added a small fix based on Vlastimil's patch.
>
> Thanks for the quick response! I just noticed Andrew added this patch
> to -mm tree:
>
> --
> From: Vlastimil Babka 
> Subject: mm/mlock.c: prevent walking off the end of a pagetable in no-pmd 
> configuration
>
> What's the git tree your v2 patch based on? If you already had a git

It's based on v3.12-rc1.  Should I send you a new one based on latest mmotm?

Thanks,
-Bob
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [munlock] BUG: Bad page map in process killall5 pte:cf17e720 pmd:05a22067

2013-09-25 Thread Bob Liu
Hi Fengguang,

Would you please have a try with the attached patch?
It added a small fix based on Vlastimil's patch.

Thanks,
-Bob

On 09/26/2013 08:40 AM, Fengguang Wu wrote:
> Hi Vlastimil,
> 
> FYI, this bug seems still not fixed in linux-next 20130925.
> 
> commit 7a8010cd36273ff5f6fea5201ef9232f30cebbd9
> Author: Vlastimil Babka 
> Date:   Wed Sep 11 14:22:35 2013 -0700
> 
> mm: munlock: manual pte walk in fast path instead of follow_page_mask()
> 
> Currently munlock_vma_pages_range() calls follow_page_mask() to obtain
> each individual struct page.  This entails repeated full page table
> translations and page table lock taken for each page separately.
> 
> This patch avoids the costly follow_page_mask() where possible, by
> iterating over ptes within single pmd under single page table lock.  The
> first pte is obtained by get_locked_pte() for non-THP page acquired by the
> initial follow_page_mask().  The rest of the on-stack pagevec for munlock
> is filled up using pte_walk as long as pte_present() and vm_normal_page()
> are sufficient to obtain the struct page.
> 
> After this patch, a 14% speedup was measured for munlocking a 56GB large
> memory area with THP disabled.
> 
> Signed-off-by: Vlastimil Babka 
> Cc: Jörn Engel 
> Cc: Mel Gorman 
> Cc: Michel Lespinasse 
> Cc: Hugh Dickins 
> Cc: Rik van Riel 
> Cc: Johannes Weiner 
> Cc: Michal Hocko 
> Cc: Vlastimil Babka 
> Signed-off-by: Andrew Morton 
> Signed-off-by: Linus Torvalds 
> 
> 
> [   89.835504] init: plymouth-upstart-bridge main process (3556) terminated 
> with status 1
> [   89.986606] init: tty6 main process (3529) killed by TERM signal
> [   91.414086] BUG: Bad page map in process killall5  pte:cf17e720 
> pmd:05a22067
> [   91.416626] addr:bfc0 vm_flags:00100173 anon_vma:cf128c80 mapping:  
> (null) index:bfff0
> [   91.419402] CPU: 0 PID: 3574 Comm: killall5 Not tainted 
> 3.12.0-rc1-00010-g5fbc0a6 #24
> [   91.422171] Hardware name: Bochs Bochs, BIOS Bochs 01/01/2011
> [   91.423998]    c0199e34 c1db5db4  c0199e54 
> c10e72d4 000bfff0
> [   91.427933]   bfc0  000cf17e cf17e720 c0199e74 
> c10e7995 
> [   91.431940]  bfc0 cf1ca190 bfc0 cf18 cf1ca190 c0199ee0 
> c10eb8cf ce6d1900
> [   91.435894] Call Trace:
> [   91.436969]  [] dump_stack+0x4b/0x66
> [   91.438503]  [] print_bad_pte+0x14b/0x162
> [   91.440204]  [] vm_normal_page+0x67/0x9b
> [   91.441811]  [] munlock_vma_pages_range+0xf9/0x176
> [   91.443633]  [] exit_mmap+0x86/0xf7
> [   91.445156]  [] ? lock_release+0x169/0x1ef
> [   91.446795]  [] ? rcu_read_unlock+0x17/0x23
> [   91.448465]  [] ? exit_aio+0x2b/0x6c
> [   91.449990]  [] mmput+0x6a/0xcb
> [   91.451508]  [] do_exit+0x362/0x8be
> [   91.453013]  [] ? hrtimer_debug_hint+0xd/0xd
> [   91.454700]  [] do_group_exit+0x51/0x9e
> [   91.456296]  [] SyS_exit_group+0x16/0x16
> [   91.457901]  [] sysenter_do_call+0x12/0x33
> [   91.459553] Disabling lock debugging due to kernel taint
> 
> git bisect start 272b98c6455f00884f0350f775c5342358ebb73f v3.11 --
> git bisect good 57d730924d5cc2c3e280af16a9306587c3a511db  # 02:21495+  
> Merge branch 'timers-urgent-for-linus' of 
> git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
> git bisect good 3bb22ec53e2bd12a241ed84359bffd591a40ab87  # 12:03495+  
> staging/lustre/ptlrpc: convert to new shrinker API
> git bisect  bad a5b7c87f92076352dbff2fe0423ec255e1c9a71b  # 12:18 31-  
> vmscan, memcg: do softlimit reclaim also for targeted reclaim
> git bisect good 3d94ea51c1d8db6f41268a9d2aea5f5771e9a8d3  # 15:40495+  
> ocfs2: clean up dead code in ocfs2_acl_from_xattr()
> git bisect  bad d62a201f24cba74e2fbf9f6f7af86ff5f5e276fc  # 16:46 79-  
> checkpatch: enforce sane perl version
> git bisect good 83467efbdb7948146581a56cbd683a22a0684bbb  # 01:29585+  
> mm: migrate: check movability of hugepage in unmap_and_move_huge_page()
> git bisect  bad 2bff24a3707093c435ab3241c47dcdb5f16e432b  # 02:07148-  
> memcg: fix multiple large threshold notifications
> git bisect  bad 1ecfd533f4c528b0b4cc5bc115c4c47f0b5e4828  # 02:34 64-  
> mm/mremap.c: call pud_free() after fail calling pmd_alloc()
> git bisect good 0ec3b74c7f5599c8a4d2b33d430a5470af26ebf6  # 13:10   1170+  
> mm: putback_lru_page: remove unnecessary call to page_lru_base_type()
> git bisect good 5b40998ae35cf64561868370e6c9f3d3e94b6bf7  # 16:52   1170+  
> mm: munlock: remove redundant get_page/put_page pair on the fast path
> git bisect  bad 187320932dcece9c4b93f38f56d1f888bd5c325f  # 17:11  0-  
> mm/sparse: introduce alloc_usemap_and_me

Re: [BUG REPORT] ZSWAP: theoretical race condition issues

2013-09-25 Thread Bob Liu
On Wed, Sep 25, 2013 at 5:33 PM, Weijie Yang  wrote:
> On Wed, Sep 25, 2013 at 4:31 PM, Bob Liu  wrote:
>> On Wed, Sep 25, 2013 at 4:09 PM, Weijie Yang  
>> wrote:
>>> I think I find a new issue, for integrity of this mail thread, I reply
>>> to this mail.
>>>
>>> It is a concurrence issue either, when duplicate store and reclaim
>>> concurrentlly.
>>>
>>> zswap entry x with offset A is already stored in zswap backend.
>>> Consider the following scenario:
>>>
>>> thread 0: reclaim entry x (get refcount, but not call 
>>> zswap_get_swap_cache_page)
>>>
>>> thread 1: store new page with the same offset A, alloc a new zswap entry y.
>>>   store finished. shrink_page_list() call __remove_mapping(), and now
>>> it is not in swap_cache
>>>
>>
>> But I don't think swap layer will call zswap with the same offset A.
>
> 1. store page of offset A in zswap
> 2. some time later, pagefault occur, load page data from zswap.
>   But notice that zswap entry x is still in zswap because it is not

Sorry I didn't notice that zswap_frontswap_load() doesn't call rb_erase().

> frontswap_tmem_exclusive_gets_enabled.
>  this page is with PageSwapCache(page) and page_private(page) = entry.val
> 3. change this page data, and it become dirty
> 4. some time later again, swap this page on the same offset A.
>
> so, a duplicate store happens.
>

Then I think we should erase the entry from rbtree in zswap_frontswap_load().
After the page is decompressed and loaded from zswap, still storing
the compressed data in zswap is meanless.

-- 
Regards,
--Bob
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [BUG REPORT] ZSWAP: theoretical race condition issues

2013-09-25 Thread Bob Liu
On Wed, Sep 25, 2013 at 4:09 PM, Weijie Yang  wrote:
> I think I find a new issue, for integrity of this mail thread, I reply
> to this mail.
>
> It is a concurrence issue either, when duplicate store and reclaim
> concurrentlly.
>
> zswap entry x with offset A is already stored in zswap backend.
> Consider the following scenario:
>
> thread 0: reclaim entry x (get refcount, but not call 
> zswap_get_swap_cache_page)
>
> thread 1: store new page with the same offset A, alloc a new zswap entry y.
>   store finished. shrink_page_list() call __remove_mapping(), and now
> it is not in swap_cache
>

But I don't think swap layer will call zswap with the same offset A.

> thread 0: zswap_get_swap_cache_page called. old page data is added to 
> swap_cache
>
> Now, swap cache has old data rather than new data for offset A.
> error will happen If do_swap_page() get page from swap_cache.
>

-- 
Regards,
--Bob
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [BUG REPORT] ZSWAP: theoretical race condition issues

2013-09-25 Thread Bob Liu
On Wed, Sep 25, 2013 at 4:09 PM, Weijie Yang weijie.yang...@gmail.com wrote:
 I think I find a new issue, for integrity of this mail thread, I reply
 to this mail.

 It is a concurrence issue either, when duplicate store and reclaim
 concurrentlly.

 zswap entry x with offset A is already stored in zswap backend.
 Consider the following scenario:

 thread 0: reclaim entry x (get refcount, but not call 
 zswap_get_swap_cache_page)

 thread 1: store new page with the same offset A, alloc a new zswap entry y.
   store finished. shrink_page_list() call __remove_mapping(), and now
 it is not in swap_cache


But I don't think swap layer will call zswap with the same offset A.

 thread 0: zswap_get_swap_cache_page called. old page data is added to 
 swap_cache

 Now, swap cache has old data rather than new data for offset A.
 error will happen If do_swap_page() get page from swap_cache.


-- 
Regards,
--Bob
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [BUG REPORT] ZSWAP: theoretical race condition issues

2013-09-25 Thread Bob Liu
On Wed, Sep 25, 2013 at 5:33 PM, Weijie Yang weijie.yang...@gmail.com wrote:
 On Wed, Sep 25, 2013 at 4:31 PM, Bob Liu lliu...@gmail.com wrote:
 On Wed, Sep 25, 2013 at 4:09 PM, Weijie Yang weijie.yang...@gmail.com 
 wrote:
 I think I find a new issue, for integrity of this mail thread, I reply
 to this mail.

 It is a concurrence issue either, when duplicate store and reclaim
 concurrentlly.

 zswap entry x with offset A is already stored in zswap backend.
 Consider the following scenario:

 thread 0: reclaim entry x (get refcount, but not call 
 zswap_get_swap_cache_page)

 thread 1: store new page with the same offset A, alloc a new zswap entry y.
   store finished. shrink_page_list() call __remove_mapping(), and now
 it is not in swap_cache


 But I don't think swap layer will call zswap with the same offset A.

 1. store page of offset A in zswap
 2. some time later, pagefault occur, load page data from zswap.
   But notice that zswap entry x is still in zswap because it is not

Sorry I didn't notice that zswap_frontswap_load() doesn't call rb_erase().

 frontswap_tmem_exclusive_gets_enabled.
  this page is with PageSwapCache(page) and page_private(page) = entry.val
 3. change this page data, and it become dirty
 4. some time later again, swap this page on the same offset A.

 so, a duplicate store happens.


Then I think we should erase the entry from rbtree in zswap_frontswap_load().
After the page is decompressed and loaded from zswap, still storing
the compressed data in zswap is meanless.

-- 
Regards,
--Bob
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [munlock] BUG: Bad page map in process killall5 pte:cf17e720 pmd:05a22067

2013-09-25 Thread Bob Liu
Hi Fengguang,

Would you please have a try with the attached patch?
It added a small fix based on Vlastimil's patch.

Thanks,
-Bob

On 09/26/2013 08:40 AM, Fengguang Wu wrote:
 Hi Vlastimil,
 
 FYI, this bug seems still not fixed in linux-next 20130925.
 
 commit 7a8010cd36273ff5f6fea5201ef9232f30cebbd9
 Author: Vlastimil Babka vba...@suse.cz
 Date:   Wed Sep 11 14:22:35 2013 -0700
 
 mm: munlock: manual pte walk in fast path instead of follow_page_mask()
 
 Currently munlock_vma_pages_range() calls follow_page_mask() to obtain
 each individual struct page.  This entails repeated full page table
 translations and page table lock taken for each page separately.
 
 This patch avoids the costly follow_page_mask() where possible, by
 iterating over ptes within single pmd under single page table lock.  The
 first pte is obtained by get_locked_pte() for non-THP page acquired by the
 initial follow_page_mask().  The rest of the on-stack pagevec for munlock
 is filled up using pte_walk as long as pte_present() and vm_normal_page()
 are sufficient to obtain the struct page.
 
 After this patch, a 14% speedup was measured for munlocking a 56GB large
 memory area with THP disabled.
 
 Signed-off-by: Vlastimil Babka vba...@suse.cz
 Cc: Jörn Engel jo...@logfs.org
 Cc: Mel Gorman mgor...@suse.de
 Cc: Michel Lespinasse wal...@google.com
 Cc: Hugh Dickins hu...@google.com
 Cc: Rik van Riel r...@redhat.com
 Cc: Johannes Weiner han...@cmpxchg.org
 Cc: Michal Hocko mho...@suse.cz
 Cc: Vlastimil Babka vba...@suse.cz
 Signed-off-by: Andrew Morton a...@linux-foundation.org
 Signed-off-by: Linus Torvalds torva...@linux-foundation.org
 
 
 [   89.835504] init: plymouth-upstart-bridge main process (3556) terminated 
 with status 1
 [   89.986606] init: tty6 main process (3529) killed by TERM signal
 [   91.414086] BUG: Bad page map in process killall5  pte:cf17e720 
 pmd:05a22067
 [   91.416626] addr:bfc0 vm_flags:00100173 anon_vma:cf128c80 mapping:  
 (null) index:bfff0
 [   91.419402] CPU: 0 PID: 3574 Comm: killall5 Not tainted 
 3.12.0-rc1-00010-g5fbc0a6 #24
 [   91.422171] Hardware name: Bochs Bochs, BIOS Bochs 01/01/2011
 [   91.423998]    c0199e34 c1db5db4  c0199e54 
 c10e72d4 000bfff0
 [   91.427933]   bfc0  000cf17e cf17e720 c0199e74 
 c10e7995 
 [   91.431940]  bfc0 cf1ca190 bfc0 cf18 cf1ca190 c0199ee0 
 c10eb8cf ce6d1900
 [   91.435894] Call Trace:
 [   91.436969]  [c1db5db4] dump_stack+0x4b/0x66
 [   91.438503]  [c10e72d4] print_bad_pte+0x14b/0x162
 [   91.440204]  [c10e7995] vm_normal_page+0x67/0x9b
 [   91.441811]  [c10eb8cf] munlock_vma_pages_range+0xf9/0x176
 [   91.443633]  [c10ede09] exit_mmap+0x86/0xf7
 [   91.445156]  [c10885b8] ? lock_release+0x169/0x1ef
 [   91.446795]  [c113e5b6] ? rcu_read_unlock+0x17/0x23
 [   91.448465]  [c113effe] ? exit_aio+0x2b/0x6c
 [   91.449990]  [c103d4b0] mmput+0x6a/0xcb
 [   91.451508]  [c104141a] do_exit+0x362/0x8be
 [   91.453013]  [c105d280] ? hrtimer_debug_hint+0xd/0xd
 [   91.454700]  [c10419f8] do_group_exit+0x51/0x9e
 [   91.456296]  [c1041a5b] SyS_exit_group+0x16/0x16
 [   91.457901]  [c1dc6719] sysenter_do_call+0x12/0x33
 [   91.459553] Disabling lock debugging due to kernel taint
 
 git bisect start 272b98c6455f00884f0350f775c5342358ebb73f v3.11 --
 git bisect good 57d730924d5cc2c3e280af16a9306587c3a511db  # 02:21495+  
 Merge branch 'timers-urgent-for-linus' of 
 git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
 git bisect good 3bb22ec53e2bd12a241ed84359bffd591a40ab87  # 12:03495+  
 staging/lustre/ptlrpc: convert to new shrinker API
 git bisect  bad a5b7c87f92076352dbff2fe0423ec255e1c9a71b  # 12:18 31-  
 vmscan, memcg: do softlimit reclaim also for targeted reclaim
 git bisect good 3d94ea51c1d8db6f41268a9d2aea5f5771e9a8d3  # 15:40495+  
 ocfs2: clean up dead code in ocfs2_acl_from_xattr()
 git bisect  bad d62a201f24cba74e2fbf9f6f7af86ff5f5e276fc  # 16:46 79-  
 checkpatch: enforce sane perl version
 git bisect good 83467efbdb7948146581a56cbd683a22a0684bbb  # 01:29585+  
 mm: migrate: check movability of hugepage in unmap_and_move_huge_page()
 git bisect  bad 2bff24a3707093c435ab3241c47dcdb5f16e432b  # 02:07148-  
 memcg: fix multiple large threshold notifications
 git bisect  bad 1ecfd533f4c528b0b4cc5bc115c4c47f0b5e4828  # 02:34 64-  
 mm/mremap.c: call pud_free() after fail calling pmd_alloc()
 git bisect good 0ec3b74c7f5599c8a4d2b33d430a5470af26ebf6  # 13:10   1170+  
 mm: putback_lru_page: remove unnecessary call to page_lru_base_type()
 git bisect good 5b40998ae35cf64561868370e6c9f3d3e94b6bf7  # 16:52   1170+  
 mm: munlock: remove redundant get_page/put_page pair on the fast path
 git bisect  bad 187320932dcece9c4b93f38f56d1f888bd5c325f  # 17:11  0-  
 mm/sparse: introduce alloc_usemap_and_memmap
 git bisect  bad

Re: [munlock] BUG: Bad page map in process killall5 pte:cf17e720 pmd:05a22067

2013-09-25 Thread Bob Liu
On Thu, Sep 26, 2013 at 9:59 AM, Fengguang Wu fengguang...@intel.com wrote:
 Hi Bob,

 On Thu, Sep 26, 2013 at 09:48:08AM +0800, Bob Liu wrote:
 Hi Fengguang,

 Would you please have a try with the attached patch?
 It added a small fix based on Vlastimil's patch.

 Thanks for the quick response! I just noticed Andrew added this patch
 to -mm tree:

 --
 From: Vlastimil Babka vba...@suse.cz
 Subject: mm/mlock.c: prevent walking off the end of a pagetable in no-pmd 
 configuration

 What's the git tree your v2 patch based on? If you already had a git

It's based on v3.12-rc1.  Should I send you a new one based on latest mmotm?

Thanks,
-Bob
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v2 0/5] mm: migrate zbud pages

2013-09-24 Thread Bob Liu
On Tue, Sep 24, 2013 at 5:20 PM, Krzysztof Kozlowski
 wrote:
> Hi,
>
> On pon, 2013-09-23 at 17:07 -0500, Seth Jennings wrote:
>> On Tue, Sep 17, 2013 at 02:59:24PM +0800, Bob Liu wrote:
>> > Mel mentioned several problems about zswap/zbud in thread "[PATCH v6
>> > 0/5] zram/zsmalloc promotion".
>> >
>> > Like "it's clunky as hell and the layering between zswap and zbud is
>> > twisty" and "I think I brought up its stalling behaviour during review
>> > when it was being merged. It would have been preferable if writeback
>> > could be initiated in batches and then waited on at the very least..
>> >  It's worse that it uses _swap_writepage directly instead of going
>> > through a writepage ops.  It would have been better if zbud pages
>> > existed on the LRU and written back with an address space ops and
>> > properly handled asynchonous writeback."
>> >
>> > So I think it would be better if we can address those issues at first
>> > and it would be easier to address these issues before adding more new
>> > features. Welcome any ideas.
>>
>> I just had an idea this afternoon to potentially kill both these birds with 
>> one
>> stone: Replace the rbtree in zswap with an address_space.
>>
>> Each swap type would have its own page_tree to organize the compressed 
>> objects
>> by type and offset (radix tree is more suited for this anyway) and a_ops that
>> could be called by shrink_page_list() (writepage) or the migration code
>> (migratepage).
>>
>> Then zbud pages could be put on the normal LRU list, maybe at the beginning 
>> of
>> the inactive LRU so they would live for another cycle through the list, then 
>> be
>> reclaimed in the normal way with the mapping->a_ops->writepage() pointing to 
>> a
>> zswap_writepage() function that would decompress the pages and call
>> __swap_writepage() on them.
>
> How exactly the address space can be used here? Do you want to point to
> zbud pages in address_space.page_tree? If yes then which index should be
> used?
>

I didn't get the point neither. I think introduce address_space is enough.
1. zbud.c:
static struct address_space_operations zbud_aops = {
.writepage= zswap_write_page,
};
struct address_space zbud_space = {
.a_ops = _aops,
};

zbud_alloc() {
zbud_page = alloc_page();
zbud_page->mapping = (struct address_space *)_space;
set_page_private(page, (unsigned long)pool);
lru_add_anon(zbud_page);
}

2. zswap.c
static int zswap_writepage(struct page *page, struct writeback_control *wbc)
{
handle = encode_handle(page_address(page), FIRST));
zswap_writeback_entry(pool, handle);

handle = encode_handle(page_address(page), LAST));
zswap_writeback_entry(pool, handle);
}

Of course it may need lots of work for core MM subsystem can maintain
zbud pages.
But in this way, we can get rid of the clunky reclaiming layer and
integrate zswap closely with core MM subsystem which knows better how
many zbud pages can be used and when should trigger the zbud pages
reclaim.

-- 
Regards,
--Bob
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v2 0/5] mm: migrate zbud pages

2013-09-24 Thread Bob Liu
On Tue, Sep 24, 2013 at 5:20 PM, Krzysztof Kozlowski
k.kozlow...@samsung.com wrote:
 Hi,

 On pon, 2013-09-23 at 17:07 -0500, Seth Jennings wrote:
 On Tue, Sep 17, 2013 at 02:59:24PM +0800, Bob Liu wrote:
  Mel mentioned several problems about zswap/zbud in thread [PATCH v6
  0/5] zram/zsmalloc promotion.
 
  Like it's clunky as hell and the layering between zswap and zbud is
  twisty and I think I brought up its stalling behaviour during review
  when it was being merged. It would have been preferable if writeback
  could be initiated in batches and then waited on at the very least..
   It's worse that it uses _swap_writepage directly instead of going
  through a writepage ops.  It would have been better if zbud pages
  existed on the LRU and written back with an address space ops and
  properly handled asynchonous writeback.
 
  So I think it would be better if we can address those issues at first
  and it would be easier to address these issues before adding more new
  features. Welcome any ideas.

 I just had an idea this afternoon to potentially kill both these birds with 
 one
 stone: Replace the rbtree in zswap with an address_space.

 Each swap type would have its own page_tree to organize the compressed 
 objects
 by type and offset (radix tree is more suited for this anyway) and a_ops that
 could be called by shrink_page_list() (writepage) or the migration code
 (migratepage).

 Then zbud pages could be put on the normal LRU list, maybe at the beginning 
 of
 the inactive LRU so they would live for another cycle through the list, then 
 be
 reclaimed in the normal way with the mapping-a_ops-writepage() pointing to 
 a
 zswap_writepage() function that would decompress the pages and call
 __swap_writepage() on them.

 How exactly the address space can be used here? Do you want to point to
 zbud pages in address_space.page_tree? If yes then which index should be
 used?


I didn't get the point neither. I think introduce address_space is enough.
1. zbud.c:
static struct address_space_operations zbud_aops = {
.writepage= zswap_write_page,
};
struct address_space zbud_space = {
.a_ops = zbud_aops,
};

zbud_alloc() {
zbud_page = alloc_page();
zbud_page-mapping = (struct address_space *)zbud_space;
set_page_private(page, (unsigned long)pool);
lru_add_anon(zbud_page);
}

2. zswap.c
static int zswap_writepage(struct page *page, struct writeback_control *wbc)
{
handle = encode_handle(page_address(page), FIRST));
zswap_writeback_entry(pool, handle);

handle = encode_handle(page_address(page), LAST));
zswap_writeback_entry(pool, handle);
}

Of course it may need lots of work for core MM subsystem can maintain
zbud pages.
But in this way, we can get rid of the clunky reclaiming layer and
integrate zswap closely with core MM subsystem which knows better how
many zbud pages can be used and when should trigger the zbud pages
reclaim.

-- 
Regards,
--Bob
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v2 3/4] mm/zswap: avoid unnecessary page scanning

2013-09-22 Thread Bob Liu

On 09/10/2013 12:29 AM, Seth Jennings wrote:
> On Fri, Sep 06, 2013 at 01:16:45PM +0800, Weijie Yang wrote:
>> add SetPageReclaim before __swap_writepage so that page can be moved to the
>> tail of the inactive list, which can avoid unnecessary page scanning as this
>> page was reclaimed by swap subsystem before.
>>
>> Signed-off-by: Weijie Yang 
> 
> Acked-by: Seth Jennings 
> 

Below is a reply from Mel in original thread "[PATCHv11 3/4] zswap: add
to mm/"
--
> + /* start writeback */
> + SetPageReclaim(page);
> + __swap_writepage(page, , end_swap_bio_write);
> + page_cache_release(page);
> + zswap_written_back_pages++;
> +

SetPageReclaim? Why?. If the page is under writeback then why do you not
mark it as that? Do not free pages that are currently under writeback
obviously. It's likely that it was PageWriteback you wanted in zbud.c too.


-- 
Regards,
-Bob
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v2 3/4] mm/zswap: avoid unnecessary page scanning

2013-09-22 Thread Bob Liu

On 09/10/2013 12:29 AM, Seth Jennings wrote:
 On Fri, Sep 06, 2013 at 01:16:45PM +0800, Weijie Yang wrote:
 add SetPageReclaim before __swap_writepage so that page can be moved to the
 tail of the inactive list, which can avoid unnecessary page scanning as this
 page was reclaimed by swap subsystem before.

 Signed-off-by: Weijie Yang weijie.y...@samsung.com
 
 Acked-by: Seth Jennings sjenn...@linux.vnet.ibm.com
 

Below is a reply from Mel in original thread [PATCHv11 3/4] zswap: add
to mm/
--
 + /* start writeback */
 + SetPageReclaim(page);
 + __swap_writepage(page, wbc, end_swap_bio_write);
 + page_cache_release(page);
 + zswap_written_back_pages++;
 +

SetPageReclaim? Why?. If the page is under writeback then why do you not
mark it as that? Do not free pages that are currently under writeback
obviously. It's likely that it was PageWriteback you wanted in zbud.c too.


-- 
Regards,
-Bob
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC PATCH RESEND] mm: munlock: Prevent walking off the end of a pagetable in no-pmd configuration

2013-09-17 Thread Bob Liu
On 09/17/2013 10:22 PM, Vlastimil Babka wrote:
> The function __munlock_pagevec_fill() introduced in commit 7a8010cd3
> ("mm: munlock: manual pte walk in fast path instead of follow_page_mask()")
> uses pmd_addr_end() for restricting its operation within current page table.
> This is insufficient on architectures/configurations where pmd is folded
> and pmd_addr_end() just returns the end of the full range to be walked. In
> this case, it allows pte++ to walk off the end of a page table resulting in
> unpredictable behaviour.
> 
> This patch fixes the function by using pgd_addr_end() and pud_addr_end()
> before pmd_addr_end(), which will yield correct page table boundary on all
> configurations. This is similar to what existing page walkers do when walking
> each level of the page table.
> 
> Additionaly, the patch clarifies a comment for get_locked_pte() call in the
> function.
> 
> Reported-by: Fengguang Wu 
> Cc: Jörn Engel 
> Cc: Mel Gorman 
> Cc: Michel Lespinasse 
> Cc: Hugh Dickins 
> Cc: Rik van Riel 
> Cc: Johannes Weiner 
> Cc: Michal Hocko 
> Cc: Vlastimil Babka 
> Signed-off-by: Vlastimil Babka 
> ---
>  mm/mlock.c | 8 ++--
>  1 file changed, 6 insertions(+), 2 deletions(-)
> 
> diff --git a/mm/mlock.c b/mm/mlock.c
> index d638026..758c0fc 100644
> --- a/mm/mlock.c
> +++ b/mm/mlock.c
> @@ -379,10 +379,14 @@ static unsigned long __munlock_pagevec_fill(struct 
> pagevec *pvec,
>  
>   /*
>* Initialize pte walk starting at the already pinned page where we
> -  * are sure that there is a pte.
> +  * are sure that there is a pte, as it was pinned under the same
> +  * mmap_sem write op.
>*/
>   pte = get_locked_pte(vma->vm_mm, start, );
> - end = min(end, pmd_addr_end(start, end));
> + /* Make sure we do not cross the page table boundary */
> + end = pgd_addr_end(start, end);
> + end = pud_addr_end(start, end);
> + end = pmd_addr_end(start, end);
>  

Nitpick, how about unfolding pmd_addr_end(start, end) directly? Like:

--- a/mm/mlock.c
+++ b/mm/mlock.c
@@ -376,13 +376,14 @@ static unsigned long __munlock_pagevec_fill(struct
pagevec *pvec,
 {
pte_t *pte;
spinlock_t *ptl;
+   unsigned long pmd_end = (start + PMD_SIZE) & PMD_MASK;
+   end = (pmd_end - 1 < end - 1) ? pmd_end : end;

/*
 * Initialize pte walk starting at the already pinned page where we
 * are sure that there is a pte.
 */
pte = get_locked_pte(vma->vm_mm, start, );
-   end = min(end, pmd_addr_end(start, end));

/* The page next to the pinned page is the first we will try to
get */
start += PAGE_SIZE;

Anyway,
Reviewed-by: Bob Liu 

-- 
Regards,
-Bob
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v2 0/5] mm: migrate zbud pages

2013-09-17 Thread Bob Liu
Hi Krzysztof,

On 09/11/2013 04:58 PM, Krzysztof Kozlowski wrote:
> Hi,
> 
> Currently zbud pages are not movable and they cannot be allocated from CMA
> (Contiguous Memory Allocator) region. These patches add migration of zbud 
> pages.
> 

I agree that the migration of zbud pages is important so that system
will not enter order-0 page fragmentation and can be helpful for page
compaction/huge pages etc..

But after I looked at the [patch 4/5], I found it will make zbud very
complicated.
I'd prefer to add this migration feature later until current version
zswap/zbud becomes better enough and more stable.

Mel mentioned several problems about zswap/zbud in thread "[PATCH v6
0/5] zram/zsmalloc promotion".

Like "it's clunky as hell and the layering between zswap and zbud is
twisty" and "I think I brought up its stalling behaviour during review
when it was being merged. It would have been preferable if writeback
could be initiated in batches and then waited on at the very least..
 It's worse that it uses _swap_writepage directly instead of going
through a writepage ops.  It would have been better if zbud pages
existed on the LRU and written back with an address space ops and
properly handled asynchonous writeback."

So I think it would be better if we can address those issues at first
and it would be easier to address these issues before adding more new
features. Welcome any ideas.

-- 
Regards,
-Bob
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v2 0/5] mm: migrate zbud pages

2013-09-17 Thread Bob Liu
Hi Krzysztof,

On 09/11/2013 04:58 PM, Krzysztof Kozlowski wrote:
 Hi,
 
 Currently zbud pages are not movable and they cannot be allocated from CMA
 (Contiguous Memory Allocator) region. These patches add migration of zbud 
 pages.
 

I agree that the migration of zbud pages is important so that system
will not enter order-0 page fragmentation and can be helpful for page
compaction/huge pages etc..

But after I looked at the [patch 4/5], I found it will make zbud very
complicated.
I'd prefer to add this migration feature later until current version
zswap/zbud becomes better enough and more stable.

Mel mentioned several problems about zswap/zbud in thread [PATCH v6
0/5] zram/zsmalloc promotion.

Like it's clunky as hell and the layering between zswap and zbud is
twisty and I think I brought up its stalling behaviour during review
when it was being merged. It would have been preferable if writeback
could be initiated in batches and then waited on at the very least..
 It's worse that it uses _swap_writepage directly instead of going
through a writepage ops.  It would have been better if zbud pages
existed on the LRU and written back with an address space ops and
properly handled asynchonous writeback.

So I think it would be better if we can address those issues at first
and it would be easier to address these issues before adding more new
features. Welcome any ideas.

-- 
Regards,
-Bob
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC PATCH RESEND] mm: munlock: Prevent walking off the end of a pagetable in no-pmd configuration

2013-09-17 Thread Bob Liu
On 09/17/2013 10:22 PM, Vlastimil Babka wrote:
 The function __munlock_pagevec_fill() introduced in commit 7a8010cd3
 (mm: munlock: manual pte walk in fast path instead of follow_page_mask())
 uses pmd_addr_end() for restricting its operation within current page table.
 This is insufficient on architectures/configurations where pmd is folded
 and pmd_addr_end() just returns the end of the full range to be walked. In
 this case, it allows pte++ to walk off the end of a page table resulting in
 unpredictable behaviour.
 
 This patch fixes the function by using pgd_addr_end() and pud_addr_end()
 before pmd_addr_end(), which will yield correct page table boundary on all
 configurations. This is similar to what existing page walkers do when walking
 each level of the page table.
 
 Additionaly, the patch clarifies a comment for get_locked_pte() call in the
 function.
 
 Reported-by: Fengguang Wu fengguang...@intel.com
 Cc: Jörn Engel jo...@logfs.org
 Cc: Mel Gorman mgor...@suse.de
 Cc: Michel Lespinasse wal...@google.com
 Cc: Hugh Dickins hu...@google.com
 Cc: Rik van Riel r...@redhat.com
 Cc: Johannes Weiner han...@cmpxchg.org
 Cc: Michal Hocko mho...@suse.cz
 Cc: Vlastimil Babka vba...@suse.cz
 Signed-off-by: Vlastimil Babka vba...@suse.cz
 ---
  mm/mlock.c | 8 ++--
  1 file changed, 6 insertions(+), 2 deletions(-)
 
 diff --git a/mm/mlock.c b/mm/mlock.c
 index d638026..758c0fc 100644
 --- a/mm/mlock.c
 +++ b/mm/mlock.c
 @@ -379,10 +379,14 @@ static unsigned long __munlock_pagevec_fill(struct 
 pagevec *pvec,
  
   /*
* Initialize pte walk starting at the already pinned page where we
 -  * are sure that there is a pte.
 +  * are sure that there is a pte, as it was pinned under the same
 +  * mmap_sem write op.
*/
   pte = get_locked_pte(vma-vm_mm, start, ptl);
 - end = min(end, pmd_addr_end(start, end));
 + /* Make sure we do not cross the page table boundary */
 + end = pgd_addr_end(start, end);
 + end = pud_addr_end(start, end);
 + end = pmd_addr_end(start, end);
  

Nitpick, how about unfolding pmd_addr_end(start, end) directly? Like:

--- a/mm/mlock.c
+++ b/mm/mlock.c
@@ -376,13 +376,14 @@ static unsigned long __munlock_pagevec_fill(struct
pagevec *pvec,
 {
pte_t *pte;
spinlock_t *ptl;
+   unsigned long pmd_end = (start + PMD_SIZE)  PMD_MASK;
+   end = (pmd_end - 1  end - 1) ? pmd_end : end;

/*
 * Initialize pte walk starting at the already pinned page where we
 * are sure that there is a pte.
 */
pte = get_locked_pte(vma-vm_mm, start, ptl);
-   end = min(end, pmd_addr_end(start, end));

/* The page next to the pinned page is the first we will try to
get */
start += PAGE_SIZE;

Anyway,
Reviewed-by: Bob Liu bob@oracle.com

-- 
Regards,
-Bob
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 0/50] Basic scheduler support for automatic NUMA balancing V7

2013-09-13 Thread Bob Liu
Hi Mel,

On 09/10/2013 05:31 PM, Mel Gorman wrote:
> It has been a long time since V6 of this series and time for an update. Much
> of this is now stabilised with the most important addition being the inclusion
> of Peter and Rik's work on grouping tasks that share pages together.
> 
> This series has a number of goals. It reduces overhead of automatic balancing
> through scan rate reduction and the avoidance of TLB flushes. It selects a
> preferred node and moves tasks towards their memory as well as moving memory
> toward their task. It handles shared pages and groups related tasks together.
> 

I found sometimes numa balancing will be broken after khugepaged
started, because khugepaged always allocate huge page from the node of
the first scanned normal page during collapsing.

A simple use case is when a user run his application interleaving all
nodes using "numactl --interleave=all ".
But after khugepaged started most pages of his application will be
located to only one specific node.

I have a simple patch fix this issue in thread:
[PATCH 2/2] mm: thp: khugepaged: add policy for finding target node

I think this may related with this topic, I don't know whether this
series can also fix the issue I mentioned.

-- 
Regards,
-Bob
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 0/50] Basic scheduler support for automatic NUMA balancing V7

2013-09-13 Thread Bob Liu
Hi Mel,

On 09/10/2013 05:31 PM, Mel Gorman wrote:
 It has been a long time since V6 of this series and time for an update. Much
 of this is now stabilised with the most important addition being the inclusion
 of Peter and Rik's work on grouping tasks that share pages together.
 
 This series has a number of goals. It reduces overhead of automatic balancing
 through scan rate reduction and the avoidance of TLB flushes. It selects a
 preferred node and moves tasks towards their memory as well as moving memory
 toward their task. It handles shared pages and groups related tasks together.
 

I found sometimes numa balancing will be broken after khugepaged
started, because khugepaged always allocate huge page from the node of
the first scanned normal page during collapsing.

A simple use case is when a user run his application interleaving all
nodes using numactl --interleave=all .
But after khugepaged started most pages of his application will be
located to only one specific node.

I have a simple patch fix this issue in thread:
[PATCH 2/2] mm: thp: khugepaged: add policy for finding target node

I think this may related with this topic, I don't know whether this
series can also fix the issue I mentioned.

-- 
Regards,
-Bob
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC PATCH 1/4] zbud: use page ref counter for zbud pages

2013-09-08 Thread Bob Liu
Hi Krzysztof,

On 08/30/2013 04:42 PM, Krzysztof Kozlowski wrote:
> Use page reference counter for zbud pages. The ref counter replaces
> zbud_header.under_reclaim flag and ensures that zbud page won't be freed
> when zbud_free() is called during reclaim. It allows implementation of
> additional reclaim paths.
> 
> The page count is incremented when:
>  - a handle is created and passed to zswap (in zbud_alloc()),
>  - user-supplied eviction callback is called (in zbud_reclaim_page()).
> 
> Signed-off-by: Krzysztof Kozlowski 
> Signed-off-by: Tomasz Stanislawski 
> Reviewed-by: Bob Liu 

AFAIR, the previous version you sent out has a function  called
rebalance_lists() which I think is a good clean up.
But I didn't see that function any more in this version.

Thanks,
-Bob


> ---
>  mm/zbud.c |   97 
> +
>  1 file changed, 52 insertions(+), 45 deletions(-)
> 
> diff --git a/mm/zbud.c b/mm/zbud.c
> index ad1e781..aa9a15c 100644
> --- a/mm/zbud.c
> +++ b/mm/zbud.c
> @@ -109,7 +109,6 @@ struct zbud_header {
>   struct list_head lru;
>   unsigned int first_chunks;
>   unsigned int last_chunks;
> - bool under_reclaim;
>  };
>  
>  /*
> @@ -138,16 +137,9 @@ static struct zbud_header *init_zbud_page(struct page 
> *page)
>   zhdr->last_chunks = 0;
>   INIT_LIST_HEAD(>buddy);
>   INIT_LIST_HEAD(>lru);
> - zhdr->under_reclaim = 0;
>   return zhdr;
>  }
>  
> -/* Resets the struct page fields and frees the page */
> -static void free_zbud_page(struct zbud_header *zhdr)
> -{
> - __free_page(virt_to_page(zhdr));
> -}
> -
>  /*
>   * Encodes the handle of a particular buddy within a zbud page
>   * Pool lock should be held as this function accesses first|last_chunks
> @@ -188,6 +180,31 @@ static int num_free_chunks(struct zbud_header *zhdr)
>   return NCHUNKS - zhdr->first_chunks - zhdr->last_chunks - 1;
>  }
>  
> +/*
> + * Increases ref count for zbud page.
> + */
> +static void get_zbud_page(struct zbud_header *zhdr)
> +{
> + get_page(virt_to_page(zhdr));
> +}
> +
> +/*
> + * Decreases ref count for zbud page and frees the page if it reaches 0
> + * (no external references, e.g. handles).
> + *
> + * Returns 1 if page was freed and 0 otherwise.
> + */
> +static int put_zbud_page(struct zbud_header *zhdr)
> +{
> + struct page *page = virt_to_page(zhdr);
> + if (put_page_testzero(page)) {
> + free_hot_cold_page(page, 0);
> + return 1;
> + }
> + return 0;
> +}
> +
> +
>  /*
>   * API Functions
>  */
> @@ -250,7 +267,7 @@ void zbud_destroy_pool(struct zbud_pool *pool)
>  int zbud_alloc(struct zbud_pool *pool, int size, gfp_t gfp,
>   unsigned long *handle)
>  {
> - int chunks, i, freechunks;
> + int chunks, i;
>   struct zbud_header *zhdr = NULL;
>   enum buddy bud;
>   struct page *page;
> @@ -273,6 +290,7 @@ int zbud_alloc(struct zbud_pool *pool, int size, gfp_t 
> gfp,
>   bud = FIRST;
>   else
>   bud = LAST;
> + get_zbud_page(zhdr);
>   goto found;
>   }
>   }
> @@ -284,6 +302,10 @@ int zbud_alloc(struct zbud_pool *pool, int size, gfp_t 
> gfp,
>   return -ENOMEM;
>   spin_lock(>lock);
>   pool->pages_nr++;
> + /*
> +  * We will be using zhdr instead of page, so
> +  * don't increase the page count.
> +  */
>   zhdr = init_zbud_page(page);
>   bud = FIRST;
>  
> @@ -295,7 +317,7 @@ found:
>  
>   if (zhdr->first_chunks == 0 || zhdr->last_chunks == 0) {
>   /* Add to unbuddied list */
> - freechunks = num_free_chunks(zhdr);
> + int freechunks = num_free_chunks(zhdr);
>   list_add(>buddy, >unbuddied[freechunks]);
>   } else {
>   /* Add to buddied list */
> @@ -326,7 +348,6 @@ found:
>  void zbud_free(struct zbud_pool *pool, unsigned long handle)
>  {
>   struct zbud_header *zhdr;
> - int freechunks;
>  
>   spin_lock(>lock);
>   zhdr = handle_to_zbud_header(handle);
> @@ -337,26 +358,19 @@ void zbud_free(struct zbud_pool *pool, unsigned long 
> handle)
>   else
>   zhdr->first_chunks = 0;
>  
> - if (zhdr->under_reclaim) {
> - /* zbud page is under reclaim, reclaim will free */
> - spin_unlock(>lock);
> - return;
> - }
&

Re: [RFC PATCH 1/4] zbud: use page ref counter for zbud pages

2013-09-08 Thread Bob Liu
Hi Krzysztof,

On 08/30/2013 04:42 PM, Krzysztof Kozlowski wrote:
 Use page reference counter for zbud pages. The ref counter replaces
 zbud_header.under_reclaim flag and ensures that zbud page won't be freed
 when zbud_free() is called during reclaim. It allows implementation of
 additional reclaim paths.
 
 The page count is incremented when:
  - a handle is created and passed to zswap (in zbud_alloc()),
  - user-supplied eviction callback is called (in zbud_reclaim_page()).
 
 Signed-off-by: Krzysztof Kozlowski k.kozlow...@samsung.com
 Signed-off-by: Tomasz Stanislawski t.stanisl...@samsung.com
 Reviewed-by: Bob Liu bob@oracle.com

AFAIR, the previous version you sent out has a function  called
rebalance_lists() which I think is a good clean up.
But I didn't see that function any more in this version.

Thanks,
-Bob


 ---
  mm/zbud.c |   97 
 +
  1 file changed, 52 insertions(+), 45 deletions(-)
 
 diff --git a/mm/zbud.c b/mm/zbud.c
 index ad1e781..aa9a15c 100644
 --- a/mm/zbud.c
 +++ b/mm/zbud.c
 @@ -109,7 +109,6 @@ struct zbud_header {
   struct list_head lru;
   unsigned int first_chunks;
   unsigned int last_chunks;
 - bool under_reclaim;
  };
  
  /*
 @@ -138,16 +137,9 @@ static struct zbud_header *init_zbud_page(struct page 
 *page)
   zhdr-last_chunks = 0;
   INIT_LIST_HEAD(zhdr-buddy);
   INIT_LIST_HEAD(zhdr-lru);
 - zhdr-under_reclaim = 0;
   return zhdr;
  }
  
 -/* Resets the struct page fields and frees the page */
 -static void free_zbud_page(struct zbud_header *zhdr)
 -{
 - __free_page(virt_to_page(zhdr));
 -}
 -
  /*
   * Encodes the handle of a particular buddy within a zbud page
   * Pool lock should be held as this function accesses first|last_chunks
 @@ -188,6 +180,31 @@ static int num_free_chunks(struct zbud_header *zhdr)
   return NCHUNKS - zhdr-first_chunks - zhdr-last_chunks - 1;
  }
  
 +/*
 + * Increases ref count for zbud page.
 + */
 +static void get_zbud_page(struct zbud_header *zhdr)
 +{
 + get_page(virt_to_page(zhdr));
 +}
 +
 +/*
 + * Decreases ref count for zbud page and frees the page if it reaches 0
 + * (no external references, e.g. handles).
 + *
 + * Returns 1 if page was freed and 0 otherwise.
 + */
 +static int put_zbud_page(struct zbud_header *zhdr)
 +{
 + struct page *page = virt_to_page(zhdr);
 + if (put_page_testzero(page)) {
 + free_hot_cold_page(page, 0);
 + return 1;
 + }
 + return 0;
 +}
 +
 +
  /*
   * API Functions
  */
 @@ -250,7 +267,7 @@ void zbud_destroy_pool(struct zbud_pool *pool)
  int zbud_alloc(struct zbud_pool *pool, int size, gfp_t gfp,
   unsigned long *handle)
  {
 - int chunks, i, freechunks;
 + int chunks, i;
   struct zbud_header *zhdr = NULL;
   enum buddy bud;
   struct page *page;
 @@ -273,6 +290,7 @@ int zbud_alloc(struct zbud_pool *pool, int size, gfp_t 
 gfp,
   bud = FIRST;
   else
   bud = LAST;
 + get_zbud_page(zhdr);
   goto found;
   }
   }
 @@ -284,6 +302,10 @@ int zbud_alloc(struct zbud_pool *pool, int size, gfp_t 
 gfp,
   return -ENOMEM;
   spin_lock(pool-lock);
   pool-pages_nr++;
 + /*
 +  * We will be using zhdr instead of page, so
 +  * don't increase the page count.
 +  */
   zhdr = init_zbud_page(page);
   bud = FIRST;
  
 @@ -295,7 +317,7 @@ found:
  
   if (zhdr-first_chunks == 0 || zhdr-last_chunks == 0) {
   /* Add to unbuddied list */
 - freechunks = num_free_chunks(zhdr);
 + int freechunks = num_free_chunks(zhdr);
   list_add(zhdr-buddy, pool-unbuddied[freechunks]);
   } else {
   /* Add to buddied list */
 @@ -326,7 +348,6 @@ found:
  void zbud_free(struct zbud_pool *pool, unsigned long handle)
  {
   struct zbud_header *zhdr;
 - int freechunks;
  
   spin_lock(pool-lock);
   zhdr = handle_to_zbud_header(handle);
 @@ -337,26 +358,19 @@ void zbud_free(struct zbud_pool *pool, unsigned long 
 handle)
   else
   zhdr-first_chunks = 0;
  
 - if (zhdr-under_reclaim) {
 - /* zbud page is under reclaim, reclaim will free */
 - spin_unlock(pool-lock);
 - return;
 - }
 -
   /* Remove from existing buddy list */
   list_del(zhdr-buddy);
  
   if (zhdr-first_chunks == 0  zhdr-last_chunks == 0) {
 - /* zbud page is empty, free */
   list_del(zhdr-lru);
 - free_zbud_page(zhdr);
   pool-pages_nr--;
   } else {
   /* Add to unbuddied list */
 - freechunks = num_free_chunks(zhdr);
 + int freechunks = num_free_chunks(zhdr);
   list_add(zhdr-buddy, pool-unbuddied

Re: [PATCH v2 4/4] mm/zswap: use GFP_NOIO instead of GFP_KERNEL

2013-09-06 Thread Bob Liu

On 09/06/2013 01:16 PM, Weijie Yang wrote:
> To avoid zswap store and reclaim functions called recursively,
> use GFP_NOIO instead of GFP_KERNEL
> 

The reason of using GFP_KERNEL in write back path is we want to try our
best to move those pages from zswap to real swap device.

I think it would be better to keep GFP_KERNEL flag but find some other
ways to skip zswap/zswap_frontswap_store() if zswap write back is in
progress.

What I can think of currently is adding a mutex to zswap, take that
mutex when zswap write back happens and check the mutex in
zswap_frontswap_store().


> Signed-off-by: Weijie Yang 
> ---
>  mm/zswap.c |6 +++---
>  1 file changed, 3 insertions(+), 3 deletions(-)
> 
> diff --git a/mm/zswap.c b/mm/zswap.c
> index cc40e6a..3d05ed8 100644
> --- a/mm/zswap.c
> +++ b/mm/zswap.c
> @@ -427,7 +427,7 @@ static int zswap_get_swap_cache_page(swp_entry_t entry,
>* Get a new page to read into from swap.
>*/
>   if (!new_page) {
> - new_page = alloc_page(GFP_KERNEL);
> + new_page = alloc_page(GFP_NOIO);
>   if (!new_page)
>   break; /* Out of memory */
>   }
> @@ -435,7 +435,7 @@ static int zswap_get_swap_cache_page(swp_entry_t entry,
>   /*
>* call radix_tree_preload() while we can wait.
>*/
> - err = radix_tree_preload(GFP_KERNEL);
> + err = radix_tree_preload(GFP_NOIO);
>   if (err)
>   break;
>  
> @@ -636,7 +636,7 @@ static int zswap_frontswap_store(unsigned type, pgoff_t 
> offset,
>   }
>  
>   /* allocate entry */
> - entry = zswap_entry_cache_alloc(GFP_KERNEL);
> + entry = zswap_entry_cache_alloc(GFP_NOIO);
>       if (!entry) {
>   zswap_reject_kmemcache_fail++;
>   ret = -ENOMEM;
> 

-- 
Regards,
-Bob
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v2 2/4] mm/zswap: bugfix: memory leak when invalidate and reclaim occur concurrently

2013-09-06 Thread Bob Liu

On 09/06/2013 01:16 PM, Weijie Yang wrote:
> Consider the following scenario:
> thread 0: reclaim entry x (get refcount, but not call 
> zswap_get_swap_cache_page)
> thread 1: call zswap_frontswap_invalidate_page to invalidate entry x.
>   finished, entry x and its zbud is not freed as its refcount != 0
>   now, the swap_map[x] = 0
> thread 0: now call zswap_get_swap_cache_page
>   swapcache_prepare return -ENOENT because entry x is not used any more
>   zswap_get_swap_cache_page return ZSWAP_SWAPCACHE_NOMEM
>   zswap_writeback_entry do nothing except put refcount
> Now, the memory of zswap_entry x and its zpage leak.
> 
> Modify:
> - check the refcount in fail path, free memory if it is not referenced.
> - use ZSWAP_SWAPCACHE_FAIL instead of ZSWAP_SWAPCACHE_NOMEM as the fail path
> can be not only caused by nomem but also by invalidate.
> 
> Signed-off-by: Weijie Yang 

Reviewed-by: Bob Liu 

> ---
>  mm/zswap.c |   21 +
>  1 file changed, 13 insertions(+), 8 deletions(-)
> 
> diff --git a/mm/zswap.c b/mm/zswap.c
> index cbd9578..1be7b90 100644
> --- a/mm/zswap.c
> +++ b/mm/zswap.c
> @@ -387,7 +387,7 @@ static void zswap_free_entry(struct zswap_tree *tree, 
> struct zswap_entry *entry)
>  enum zswap_get_swap_ret {
>   ZSWAP_SWAPCACHE_NEW,
>   ZSWAP_SWAPCACHE_EXIST,
> - ZSWAP_SWAPCACHE_NOMEM
> + ZSWAP_SWAPCACHE_FAIL,
>  };
>  
>  /*
> @@ -401,9 +401,9 @@ enum zswap_get_swap_ret {
>   * added to the swap cache, and returned in retpage.
>   *
>   * If success, the swap cache page is returned in retpage
> - * Returns 0 if page was already in the swap cache, page is not locked
> - * Returns 1 if the new page needs to be populated, page is locked
> - * Returns <0 on error
> + * Returns ZSWAP_SWAPCACHE_EXIST if page was already in the swap cache
> + * Returns ZSWAP_SWAPCACHE_NEW if the new page needs to be populated, page 
> is locked
> + * Returns ZSWAP_SWAPCACHE_FAIL on error
>   */
>  static int zswap_get_swap_cache_page(swp_entry_t entry,
>   struct page **retpage)
> @@ -475,7 +475,7 @@ static int zswap_get_swap_cache_page(swp_entry_t entry,
>   if (new_page)
>   page_cache_release(new_page);
>   if (!found_page)
> - return ZSWAP_SWAPCACHE_NOMEM;
> + return ZSWAP_SWAPCACHE_FAIL;
>   *retpage = found_page;
>   return ZSWAP_SWAPCACHE_EXIST;
>  }
> @@ -529,11 +529,11 @@ static int zswap_writeback_entry(struct zbud_pool 
> *pool, unsigned long handle)
>  
>   /* try to allocate swap cache page */
>   switch (zswap_get_swap_cache_page(swpentry, )) {
> - case ZSWAP_SWAPCACHE_NOMEM: /* no memory */
> + case ZSWAP_SWAPCACHE_FAIL: /* no memory or invalidate happened */
>   ret = -ENOMEM;
>   goto fail;
>  
> - case ZSWAP_SWAPCACHE_EXIST: /* page is unlocked */
> + case ZSWAP_SWAPCACHE_EXIST:
>   /* page is already in the swap cache, ignore for now */
>   page_cache_release(page);
>   ret = -EEXIST;
> @@ -591,7 +591,12 @@ static int zswap_writeback_entry(struct zbud_pool *pool, 
> unsigned long handle)
>  
>  fail:
>   spin_lock(>lock);
> - zswap_entry_put(entry);
> + refcount = zswap_entry_put(entry);
> +     if (refcount <= 0) {
> + /* invalidate happened, consider writeback as success */
> + zswap_free_entry(tree, entry);
> + ret = 0;
> + }
>   spin_unlock(>lock);
>   return ret;
>  }
> 

-- 
Regards,
-Bob
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v2 3/4] mm/zswap: avoid unnecessary page scanning

2013-09-06 Thread Bob Liu

On 09/06/2013 01:16 PM, Weijie Yang wrote:
> add SetPageReclaim before __swap_writepage so that page can be moved to the
> tail of the inactive list, which can avoid unnecessary page scanning as this
> page was reclaimed by swap subsystem before.
> 
> Signed-off-by: Weijie Yang 

Reviewed-by: Bob Liu 

> ---
>  mm/zswap.c |3 +++
>  1 file changed, 3 insertions(+)
> 
> diff --git a/mm/zswap.c b/mm/zswap.c
> index 1be7b90..cc40e6a 100644
> --- a/mm/zswap.c
> +++ b/mm/zswap.c
> @@ -556,6 +556,9 @@ static int zswap_writeback_entry(struct zbud_pool *pool, 
> unsigned long handle)
>   SetPageUptodate(page);
>   }
>  
> + /* move it to the tail of the inactive list after end_writeback */
> + SetPageReclaim(page);
> +
>   /* start writeback */
>   __swap_writepage(page, , end_swap_bio_write);
>   page_cache_release(page);
> 

-- 
Regards,
-Bob
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v2 1/4] mm/zswap: bugfix: memory leak when re-swapon

2013-09-06 Thread Bob Liu


On 09/06/2013 01:16 PM, Weijie Yang wrote:
> zswap_tree is not freed when swapoff, and it got re-kmalloc in swapon,
> so memory-leak occurs.
> 
> Modify: free memory of zswap_tree in zswap_frontswap_invalidate_area().
> 
> Signed-off-by: Weijie Yang 

Reviewed-by: Bob Liu 

> ---
>  mm/zswap.c |4 
>  1 file changed, 4 insertions(+)
> 
> diff --git a/mm/zswap.c b/mm/zswap.c
> index deda2b6..cbd9578 100644
> --- a/mm/zswap.c
> +++ b/mm/zswap.c
> @@ -816,6 +816,10 @@ static void zswap_frontswap_invalidate_area(unsigned 
> type)
>   }
>   tree->rbroot = RB_ROOT;
>   spin_unlock(>lock);
> +
> + zbud_destroy_pool(tree->pool);
> + kfree(tree);
> + zswap_trees[type] = NULL;
>  }
>  
>  static struct zbud_ops zswap_zbud_ops = {
> 

-- 
Regards,
-Bob
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v2 1/4] mm/zswap: bugfix: memory leak when re-swapon

2013-09-06 Thread Bob Liu


On 09/06/2013 01:16 PM, Weijie Yang wrote:
 zswap_tree is not freed when swapoff, and it got re-kmalloc in swapon,
 so memory-leak occurs.
 
 Modify: free memory of zswap_tree in zswap_frontswap_invalidate_area().
 
 Signed-off-by: Weijie Yang weijie.y...@samsung.com

Reviewed-by: Bob Liu bob@oracle.com

 ---
  mm/zswap.c |4 
  1 file changed, 4 insertions(+)
 
 diff --git a/mm/zswap.c b/mm/zswap.c
 index deda2b6..cbd9578 100644
 --- a/mm/zswap.c
 +++ b/mm/zswap.c
 @@ -816,6 +816,10 @@ static void zswap_frontswap_invalidate_area(unsigned 
 type)
   }
   tree-rbroot = RB_ROOT;
   spin_unlock(tree-lock);
 +
 + zbud_destroy_pool(tree-pool);
 + kfree(tree);
 + zswap_trees[type] = NULL;
  }
  
  static struct zbud_ops zswap_zbud_ops = {
 

-- 
Regards,
-Bob
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v2 3/4] mm/zswap: avoid unnecessary page scanning

2013-09-06 Thread Bob Liu

On 09/06/2013 01:16 PM, Weijie Yang wrote:
 add SetPageReclaim before __swap_writepage so that page can be moved to the
 tail of the inactive list, which can avoid unnecessary page scanning as this
 page was reclaimed by swap subsystem before.
 
 Signed-off-by: Weijie Yang weijie.y...@samsung.com

Reviewed-by: Bob Liu bob@oracle.com

 ---
  mm/zswap.c |3 +++
  1 file changed, 3 insertions(+)
 
 diff --git a/mm/zswap.c b/mm/zswap.c
 index 1be7b90..cc40e6a 100644
 --- a/mm/zswap.c
 +++ b/mm/zswap.c
 @@ -556,6 +556,9 @@ static int zswap_writeback_entry(struct zbud_pool *pool, 
 unsigned long handle)
   SetPageUptodate(page);
   }
  
 + /* move it to the tail of the inactive list after end_writeback */
 + SetPageReclaim(page);
 +
   /* start writeback */
   __swap_writepage(page, wbc, end_swap_bio_write);
   page_cache_release(page);
 

-- 
Regards,
-Bob
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v2 2/4] mm/zswap: bugfix: memory leak when invalidate and reclaim occur concurrently

2013-09-06 Thread Bob Liu

On 09/06/2013 01:16 PM, Weijie Yang wrote:
 Consider the following scenario:
 thread 0: reclaim entry x (get refcount, but not call 
 zswap_get_swap_cache_page)
 thread 1: call zswap_frontswap_invalidate_page to invalidate entry x.
   finished, entry x and its zbud is not freed as its refcount != 0
   now, the swap_map[x] = 0
 thread 0: now call zswap_get_swap_cache_page
   swapcache_prepare return -ENOENT because entry x is not used any more
   zswap_get_swap_cache_page return ZSWAP_SWAPCACHE_NOMEM
   zswap_writeback_entry do nothing except put refcount
 Now, the memory of zswap_entry x and its zpage leak.
 
 Modify:
 - check the refcount in fail path, free memory if it is not referenced.
 - use ZSWAP_SWAPCACHE_FAIL instead of ZSWAP_SWAPCACHE_NOMEM as the fail path
 can be not only caused by nomem but also by invalidate.
 
 Signed-off-by: Weijie Yang weijie.y...@samsung.com

Reviewed-by: Bob Liu bob@oracle.com

 ---
  mm/zswap.c |   21 +
  1 file changed, 13 insertions(+), 8 deletions(-)
 
 diff --git a/mm/zswap.c b/mm/zswap.c
 index cbd9578..1be7b90 100644
 --- a/mm/zswap.c
 +++ b/mm/zswap.c
 @@ -387,7 +387,7 @@ static void zswap_free_entry(struct zswap_tree *tree, 
 struct zswap_entry *entry)
  enum zswap_get_swap_ret {
   ZSWAP_SWAPCACHE_NEW,
   ZSWAP_SWAPCACHE_EXIST,
 - ZSWAP_SWAPCACHE_NOMEM
 + ZSWAP_SWAPCACHE_FAIL,
  };
  
  /*
 @@ -401,9 +401,9 @@ enum zswap_get_swap_ret {
   * added to the swap cache, and returned in retpage.
   *
   * If success, the swap cache page is returned in retpage
 - * Returns 0 if page was already in the swap cache, page is not locked
 - * Returns 1 if the new page needs to be populated, page is locked
 - * Returns 0 on error
 + * Returns ZSWAP_SWAPCACHE_EXIST if page was already in the swap cache
 + * Returns ZSWAP_SWAPCACHE_NEW if the new page needs to be populated, page 
 is locked
 + * Returns ZSWAP_SWAPCACHE_FAIL on error
   */
  static int zswap_get_swap_cache_page(swp_entry_t entry,
   struct page **retpage)
 @@ -475,7 +475,7 @@ static int zswap_get_swap_cache_page(swp_entry_t entry,
   if (new_page)
   page_cache_release(new_page);
   if (!found_page)
 - return ZSWAP_SWAPCACHE_NOMEM;
 + return ZSWAP_SWAPCACHE_FAIL;
   *retpage = found_page;
   return ZSWAP_SWAPCACHE_EXIST;
  }
 @@ -529,11 +529,11 @@ static int zswap_writeback_entry(struct zbud_pool 
 *pool, unsigned long handle)
  
   /* try to allocate swap cache page */
   switch (zswap_get_swap_cache_page(swpentry, page)) {
 - case ZSWAP_SWAPCACHE_NOMEM: /* no memory */
 + case ZSWAP_SWAPCACHE_FAIL: /* no memory or invalidate happened */
   ret = -ENOMEM;
   goto fail;
  
 - case ZSWAP_SWAPCACHE_EXIST: /* page is unlocked */
 + case ZSWAP_SWAPCACHE_EXIST:
   /* page is already in the swap cache, ignore for now */
   page_cache_release(page);
   ret = -EEXIST;
 @@ -591,7 +591,12 @@ static int zswap_writeback_entry(struct zbud_pool *pool, 
 unsigned long handle)
  
  fail:
   spin_lock(tree-lock);
 - zswap_entry_put(entry);
 + refcount = zswap_entry_put(entry);
 + if (refcount = 0) {
 + /* invalidate happened, consider writeback as success */
 + zswap_free_entry(tree, entry);
 + ret = 0;
 + }
   spin_unlock(tree-lock);
   return ret;
  }
 

-- 
Regards,
-Bob
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v2 4/4] mm/zswap: use GFP_NOIO instead of GFP_KERNEL

2013-09-06 Thread Bob Liu

On 09/06/2013 01:16 PM, Weijie Yang wrote:
 To avoid zswap store and reclaim functions called recursively,
 use GFP_NOIO instead of GFP_KERNEL
 

The reason of using GFP_KERNEL in write back path is we want to try our
best to move those pages from zswap to real swap device.

I think it would be better to keep GFP_KERNEL flag but find some other
ways to skip zswap/zswap_frontswap_store() if zswap write back is in
progress.

What I can think of currently is adding a mutex to zswap, take that
mutex when zswap write back happens and check the mutex in
zswap_frontswap_store().


 Signed-off-by: Weijie Yang weijie.y...@samsung.com
 ---
  mm/zswap.c |6 +++---
  1 file changed, 3 insertions(+), 3 deletions(-)
 
 diff --git a/mm/zswap.c b/mm/zswap.c
 index cc40e6a..3d05ed8 100644
 --- a/mm/zswap.c
 +++ b/mm/zswap.c
 @@ -427,7 +427,7 @@ static int zswap_get_swap_cache_page(swp_entry_t entry,
* Get a new page to read into from swap.
*/
   if (!new_page) {
 - new_page = alloc_page(GFP_KERNEL);
 + new_page = alloc_page(GFP_NOIO);
   if (!new_page)
   break; /* Out of memory */
   }
 @@ -435,7 +435,7 @@ static int zswap_get_swap_cache_page(swp_entry_t entry,
   /*
* call radix_tree_preload() while we can wait.
*/
 - err = radix_tree_preload(GFP_KERNEL);
 + err = radix_tree_preload(GFP_NOIO);
   if (err)
   break;
  
 @@ -636,7 +636,7 @@ static int zswap_frontswap_store(unsigned type, pgoff_t 
 offset,
   }
  
   /* allocate entry */
 - entry = zswap_entry_cache_alloc(GFP_KERNEL);
 + entry = zswap_entry_cache_alloc(GFP_NOIO);
   if (!entry) {
   zswap_reject_kmemcache_fail++;
   ret = -ENOMEM;
 

-- 
Regards,
-Bob
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] x86: e820: fix memmap kernel boot parameter

2013-08-30 Thread Bob Liu
On Fri, Aug 30, 2013 at 2:14 PM, Dan Aloni  wrote:
> On Fri, Aug 30, 2013 at 01:47:53PM +0800, Bob Liu wrote:
>>[..]
>> Machine2: bootcmdline in grub.cfg "memmap=0x77ff$0x88000", the 
>> result of
>> "cat /proc/cmdline" changed to "memmap=0x77ffx88000".
>>
>> I didn't find the root cause, I think maybe grub reserved "$0" as something
>> special.
>> Replace '$' with '%' in kernel boot parameter can fix this issue.
>
> You are correct with the root cause, however I don't think the patch is 
> needed.
>
> In order to bypass grub's variable evaluation you can simply use escaping
> and replace $ with \$ in your grub config.
>

I see, thank you very much!

-- 
Regards,
--Bob
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] x86: e820: fix memmap kernel boot parameter

2013-08-30 Thread Bob Liu
On Fri, Aug 30, 2013 at 2:14 PM, Dan Aloni alo...@stratoscale.com wrote:
 On Fri, Aug 30, 2013 at 01:47:53PM +0800, Bob Liu wrote:
[..]
 Machine2: bootcmdline in grub.cfg memmap=0x77ff$0x88000, the 
 result of
 cat /proc/cmdline changed to memmap=0x77ffx88000.

 I didn't find the root cause, I think maybe grub reserved $0 as something
 special.
 Replace '$' with '%' in kernel boot parameter can fix this issue.

 You are correct with the root cause, however I don't think the patch is 
 needed.

 In order to bypass grub's variable evaluation you can simply use escaping
 and replace $ with \$ in your grub config.


I see, thank you very much!

-- 
Regards,
--Bob
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH] x86: e820: fix memmap kernel boot parameter

2013-08-29 Thread Bob Liu
Kernel boot parameter memmap=nn[KMG]$ss[KMG] is used to mark specific memory as
reserved. Region of memory to be used is from ss to ss+nn.

But I found the action of this parameter is not as expected.
I tried on two machines.
Machine1: bootcmdline in grub.cfg "memmap=800M$0x60bfdfff", but the result of
"cat /proc/cmdline" changed to "memmap=800M/bin/bashx60bfdfff" after system
booted.

Machine2: bootcmdline in grub.cfg "memmap=0x77ff$0x88000", the result of
"cat /proc/cmdline" changed to "memmap=0x77ffx88000".

I didn't find the root cause, I think maybe grub reserved "$0" as something
special.
Replace '$' with '%' in kernel boot parameter can fix this issue.

Signed-off-by: Bob Liu 
---
 Documentation/kernel-parameters.txt |6 +++---
 arch/x86/kernel/e820.c  |2 +-
 2 files changed, 4 insertions(+), 4 deletions(-)

diff --git a/Documentation/kernel-parameters.txt 
b/Documentation/kernel-parameters.txt
index 7f9d4f5..a96c7b1 100644
--- a/Documentation/kernel-parameters.txt
+++ b/Documentation/kernel-parameters.txt
@@ -1604,13 +1604,13 @@ bytes respectively. Such letter suffixes can also be 
entirely omitted.
[KNL,ACPI] Mark specific memory as ACPI data.
Region of memory to be used, from ss to ss+nn.
 
-   memmap=nn[KMG]$ss[KMG]
+   memmap=nn[KMG]%ss[KMG]
[KNL,ACPI] Mark specific memory as reserved.
Region of memory to be used, from ss to ss+nn.
Example: Exclude memory from 0x1869-0x1869
-memmap=64K$0x1869
+memmap=64K%0x1869
 or
-memmap=0x1$0x1869
+memmap=0x1%0x1869
 
memory_corruption_check=0/1 [X86]
Some BIOSes seem to corrupt the first 64k of
diff --git a/arch/x86/kernel/e820.c b/arch/x86/kernel/e820.c
index d32abea..8483d45 100644
--- a/arch/x86/kernel/e820.c
+++ b/arch/x86/kernel/e820.c
@@ -869,7 +869,7 @@ static int __init parse_memmap_one(char *p)
} else if (*p == '#') {
start_at = memparse(p+1, );
e820_add_region(start_at, mem_size, E820_ACPI);
-   } else if (*p == '$') {
+   } else if (*p == '%') {
start_at = memparse(p+1, );
e820_add_region(start_at, mem_size, E820_RESERVED);
} else
-- 
1.7.10.4

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH] x86: e820: fix memmap kernel boot parameter

2013-08-29 Thread Bob Liu
Kernel boot parameter memmap=nn[KMG]$ss[KMG] is used to mark specific memory as
reserved. Region of memory to be used is from ss to ss+nn.

But I found the action of this parameter is not as expected.
I tried on two machines.
Machine1: bootcmdline in grub.cfg memmap=800M$0x60bfdfff, but the result of
cat /proc/cmdline changed to memmap=800M/bin/bashx60bfdfff after system
booted.

Machine2: bootcmdline in grub.cfg memmap=0x77ff$0x88000, the result of
cat /proc/cmdline changed to memmap=0x77ffx88000.

I didn't find the root cause, I think maybe grub reserved $0 as something
special.
Replace '$' with '%' in kernel boot parameter can fix this issue.

Signed-off-by: Bob Liu bob@oracle.com
---
 Documentation/kernel-parameters.txt |6 +++---
 arch/x86/kernel/e820.c  |2 +-
 2 files changed, 4 insertions(+), 4 deletions(-)

diff --git a/Documentation/kernel-parameters.txt 
b/Documentation/kernel-parameters.txt
index 7f9d4f5..a96c7b1 100644
--- a/Documentation/kernel-parameters.txt
+++ b/Documentation/kernel-parameters.txt
@@ -1604,13 +1604,13 @@ bytes respectively. Such letter suffixes can also be 
entirely omitted.
[KNL,ACPI] Mark specific memory as ACPI data.
Region of memory to be used, from ss to ss+nn.
 
-   memmap=nn[KMG]$ss[KMG]
+   memmap=nn[KMG]%ss[KMG]
[KNL,ACPI] Mark specific memory as reserved.
Region of memory to be used, from ss to ss+nn.
Example: Exclude memory from 0x1869-0x1869
-memmap=64K$0x1869
+memmap=64K%0x1869
 or
-memmap=0x1$0x1869
+memmap=0x1%0x1869
 
memory_corruption_check=0/1 [X86]
Some BIOSes seem to corrupt the first 64k of
diff --git a/arch/x86/kernel/e820.c b/arch/x86/kernel/e820.c
index d32abea..8483d45 100644
--- a/arch/x86/kernel/e820.c
+++ b/arch/x86/kernel/e820.c
@@ -869,7 +869,7 @@ static int __init parse_memmap_one(char *p)
} else if (*p == '#') {
start_at = memparse(p+1, p);
e820_add_region(start_at, mem_size, E820_ACPI);
-   } else if (*p == '$') {
+   } else if (*p == '%') {
start_at = memparse(p+1, p);
e820_add_region(start_at, mem_size, E820_RESERVED);
} else
-- 
1.7.10.4

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v7 0/5] zram/zsmalloc promotion

2013-08-23 Thread Bob Liu
Hi Minchan,

On Thu, Aug 22, 2013 at 8:42 AM, Minchan Kim  wrote:
> Hi Bob,
>
> On Wed, Aug 21, 2013 at 05:24:00PM +0800, Bob Liu wrote:
>> Hi Minchan,
>>
>> On 08/21/2013 02:16 PM, Minchan Kim wrote:
>> > It's 7th trial of zram/zsmalloc promotion.
>> > I rewrote cover-letter totally based on previous discussion.
>> >
>> > The main reason to prevent zram promotion was no review of
>> > zsmalloc part while Jens, block maintainer, already acked
>> > zram part.
>> >
>> > At that time, zsmalloc was used for zram, zcache and zswap so
>> > everybody wanted to make it general and at last, Mel reviewed it
>> > when zswap was submitted to merge mainline a few month ago.
>> > Most of review was related to zswap writeback mechanism which
>> > can pageout compressed page in memory into real swap storage
>> > in runtime and the conclusion was that zsmalloc isn't good for
>> > zswap writeback so zswap borrowed zbud allocator from zcache to
>> > replace zsmalloc. The zbud is bad for memory compression ratio(2)
>> > but it's very predictable behavior because we can expect a zpage
>> > includes just two pages as maximum. Other reviews were not major.
>> > http://lkml.indiana.edu/hypermail/linux/kernel/1304.1/04334.html
>> >
>> > Zcache doesn't use zsmalloc either so zsmalloc's user is only
>> > zram now so this patchset moves it into zsmalloc directory.
>> > Recently, Bob tried to move zsmalloc under mm directory to unify
>> > zram and zswap with adding pseudo block device in zswap(It's
>> > very weired to me) but he was simple ignoring zram's block device
>> > (a.k.a zram-blk) feature and considered only swap usecase of zram,
>> > in turn, it lose zram's good concept.
>> >
>>
>> Yes, I didn't notice the feature that zram can be used as a normal block
>> device.
>>
>>
>> > Mel raised an another issue in v6, "maintainance headache".
>> > He claimed zswap and zram has a similar goal that is to compresss
>> > swap pages so if we promote zram, maintainance headache happens
>> > sometime by diverging implementaion between zswap and zram
>> > so that he want to unify zram and zswap. For it, he want zswap
>> > to implement pseudo block device like Bob did to emulate zram so
>> > zswap can have an advantage of writeback as well as zram's benefit.
>>
>> If consider zram as a swap device only, I still think it's better to add
>> a pseudo block device to zswap and just disable the writeback of zswap.
>
> Why do you think zswap is better?
>

In my opinion:
1. It's easy for zswap to do the same thing by adding a few small changes.
2. zswap won't get to the block layer which can reduce a lot of overheads.
3. zswap is transparent to current users who are using normal block
device as the swap device.

> I don't know but when I read http://lwn.net/Articles/334649/, it aimed
> for compressing page caches as well as swap pages but it made widespread
> hooks in core (I guess that's why zcache had a birth later by Nitin and Dan)
> so reviewers guided him to support anon pages only to merge it.
> And at that time, it was a specific virtual block device for only supporting
> swap. AFAIRC, akpm suggested to make it general block device so other party
> can have a benefit.
>
> You can type "zram tmp" in google and will find many article related
> to use zram as tmp and I have been received some questions/reports

I see.
But i think if using shmem as tmp, the pages can be reclaimed during
memory pressure,
get to zswap and compressed as well.

Mel also pointed a situation using zram as tmpfs may make things worse.

> from anonymous guys by private mail. And Jorome, Redhat guy, has
> contributed that part like partial I/O.
>

I am not going to block zram being promoted, just some different voice.

-- 
Regards,
--Bob
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v7 0/5] zram/zsmalloc promotion

2013-08-23 Thread Bob Liu
Hi Minchan,

On Thu, Aug 22, 2013 at 8:42 AM, Minchan Kim minc...@kernel.org wrote:
 Hi Bob,

 On Wed, Aug 21, 2013 at 05:24:00PM +0800, Bob Liu wrote:
 Hi Minchan,

 On 08/21/2013 02:16 PM, Minchan Kim wrote:
  It's 7th trial of zram/zsmalloc promotion.
  I rewrote cover-letter totally based on previous discussion.
 
  The main reason to prevent zram promotion was no review of
  zsmalloc part while Jens, block maintainer, already acked
  zram part.
 
  At that time, zsmalloc was used for zram, zcache and zswap so
  everybody wanted to make it general and at last, Mel reviewed it
  when zswap was submitted to merge mainline a few month ago.
  Most of review was related to zswap writeback mechanism which
  can pageout compressed page in memory into real swap storage
  in runtime and the conclusion was that zsmalloc isn't good for
  zswap writeback so zswap borrowed zbud allocator from zcache to
  replace zsmalloc. The zbud is bad for memory compression ratio(2)
  but it's very predictable behavior because we can expect a zpage
  includes just two pages as maximum. Other reviews were not major.
  http://lkml.indiana.edu/hypermail/linux/kernel/1304.1/04334.html
 
  Zcache doesn't use zsmalloc either so zsmalloc's user is only
  zram now so this patchset moves it into zsmalloc directory.
  Recently, Bob tried to move zsmalloc under mm directory to unify
  zram and zswap with adding pseudo block device in zswap(It's
  very weired to me) but he was simple ignoring zram's block device
  (a.k.a zram-blk) feature and considered only swap usecase of zram,
  in turn, it lose zram's good concept.
 

 Yes, I didn't notice the feature that zram can be used as a normal block
 device.


  Mel raised an another issue in v6, maintainance headache.
  He claimed zswap and zram has a similar goal that is to compresss
  swap pages so if we promote zram, maintainance headache happens
  sometime by diverging implementaion between zswap and zram
  so that he want to unify zram and zswap. For it, he want zswap
  to implement pseudo block device like Bob did to emulate zram so
  zswap can have an advantage of writeback as well as zram's benefit.

 If consider zram as a swap device only, I still think it's better to add
 a pseudo block device to zswap and just disable the writeback of zswap.

 Why do you think zswap is better?


In my opinion:
1. It's easy for zswap to do the same thing by adding a few small changes.
2. zswap won't get to the block layer which can reduce a lot of overheads.
3. zswap is transparent to current users who are using normal block
device as the swap device.

 I don't know but when I read http://lwn.net/Articles/334649/, it aimed
 for compressing page caches as well as swap pages but it made widespread
 hooks in core (I guess that's why zcache had a birth later by Nitin and Dan)
 so reviewers guided him to support anon pages only to merge it.
 And at that time, it was a specific virtual block device for only supporting
 swap. AFAIRC, akpm suggested to make it general block device so other party
 can have a benefit.

 You can type zram tmp in google and will find many article related
 to use zram as tmp and I have been received some questions/reports

I see.
But i think if using shmem as tmp, the pages can be reclaimed during
memory pressure,
get to zswap and compressed as well.

Mel also pointed a situation using zram as tmpfs may make things worse.

 from anonymous guys by private mail. And Jorome, Redhat guy, has
 contributed that part like partial I/O.


I am not going to block zram being promoted, just some different voice.

-- 
Regards,
--Bob
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v7 0/5] zram/zsmalloc promotion

2013-08-21 Thread Bob Liu
On 08/21/2013 05:24 PM, Bob Liu wrote:
> Hi Minchan,
> 
> On 08/21/2013 02:16 PM, Minchan Kim wrote:
>> It's 7th trial of zram/zsmalloc promotion.
>> I rewrote cover-letter totally based on previous discussion.
>>
>> The main reason to prevent zram promotion was no review of
>> zsmalloc part while Jens, block maintainer, already acked
>> zram part.
>>
>> At that time, zsmalloc was used for zram, zcache and zswap so
>> everybody wanted to make it general and at last, Mel reviewed it
>> when zswap was submitted to merge mainline a few month ago.
>> Most of review was related to zswap writeback mechanism which
>> can pageout compressed page in memory into real swap storage
>> in runtime and the conclusion was that zsmalloc isn't good for
>> zswap writeback so zswap borrowed zbud allocator from zcache to
>> replace zsmalloc. The zbud is bad for memory compression ratio(2)
>> but it's very predictable behavior because we can expect a zpage
>> includes just two pages as maximum. Other reviews were not major. 
>> http://lkml.indiana.edu/hypermail/linux/kernel/1304.1/04334.html
>>
>> Zcache doesn't use zsmalloc either so zsmalloc's user is only
>> zram now so this patchset moves it into zsmalloc directory.
>> Recently, Bob tried to move zsmalloc under mm directory to unify
>> zram and zswap with adding pseudo block device in zswap(It's
>> very weired to me) but he was simple ignoring zram's block device
>> (a.k.a zram-blk) feature and considered only swap usecase of zram,
>> in turn, it lose zram's good concept.
>>
> 
> Yes, I didn't notice the feature that zram can be used as a normal block
> device.
> 
> 
>> Mel raised an another issue in v6, "maintainance headache".
>> He claimed zswap and zram has a similar goal that is to compresss
>> swap pages so if we promote zram, maintainance headache happens
>> sometime by diverging implementaion between zswap and zram
>> so that he want to unify zram and zswap. For it, he want zswap
>> to implement pseudo block device like Bob did to emulate zram so
>> zswap can have an advantage of writeback as well as zram's benefit.
> 
> If consider zram as a swap device only, I still think it's better to add
> a pseudo block device to zswap and just disable the writeback of zswap.
> 
> But I have no idea of zram's block device feature.
> 

BTW: I think the original/main purpose that zram was introduced is for
swapping. Is there any real users using zram as a normal block device
instead of swap?
For normal usage, maybe we can extend ramdisk with compression feature.

-- 
Regards,
-Bob
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v7 0/5] zram/zsmalloc promotion

2013-08-21 Thread Bob Liu
Hi Minchan,

On 08/21/2013 02:16 PM, Minchan Kim wrote:
> It's 7th trial of zram/zsmalloc promotion.
> I rewrote cover-letter totally based on previous discussion.
> 
> The main reason to prevent zram promotion was no review of
> zsmalloc part while Jens, block maintainer, already acked
> zram part.
> 
> At that time, zsmalloc was used for zram, zcache and zswap so
> everybody wanted to make it general and at last, Mel reviewed it
> when zswap was submitted to merge mainline a few month ago.
> Most of review was related to zswap writeback mechanism which
> can pageout compressed page in memory into real swap storage
> in runtime and the conclusion was that zsmalloc isn't good for
> zswap writeback so zswap borrowed zbud allocator from zcache to
> replace zsmalloc. The zbud is bad for memory compression ratio(2)
> but it's very predictable behavior because we can expect a zpage
> includes just two pages as maximum. Other reviews were not major. 
> http://lkml.indiana.edu/hypermail/linux/kernel/1304.1/04334.html
> 
> Zcache doesn't use zsmalloc either so zsmalloc's user is only
> zram now so this patchset moves it into zsmalloc directory.
> Recently, Bob tried to move zsmalloc under mm directory to unify
> zram and zswap with adding pseudo block device in zswap(It's
> very weired to me) but he was simple ignoring zram's block device
> (a.k.a zram-blk) feature and considered only swap usecase of zram,
> in turn, it lose zram's good concept.
> 

Yes, I didn't notice the feature that zram can be used as a normal block
device.


> Mel raised an another issue in v6, "maintainance headache".
> He claimed zswap and zram has a similar goal that is to compresss
> swap pages so if we promote zram, maintainance headache happens
> sometime by diverging implementaion between zswap and zram
> so that he want to unify zram and zswap. For it, he want zswap
> to implement pseudo block device like Bob did to emulate zram so
> zswap can have an advantage of writeback as well as zram's benefit.

If consider zram as a swap device only, I still think it's better to add
a pseudo block device to zswap and just disable the writeback of zswap.

But I have no idea of zram's block device feature.

> But I wonder frontswap-based zswap's writeback is really good
> approach for writeback POV. I think that problem isn't only
> specific for zswap. If we want to configure multiple swap hierarchy
> with various speed device such as RAM, NVRAM, SSD, eMMC, NAS etc,
> it would be a general problem. So we should think of more general
> approach. At a glance, I can see two approach.
> 
> First, VM could be aware of heterogeneous swap configuration
> so it could aim for being able to configure cache hierarchy
> among swap devices. It may need indirction layer on swap, which
> was already talked about that way so VM can migrate a block from 
> A to B easily. It will support various configuration with VM's
> hints, maybe, in future.
> http://lkml.indiana.edu/hypermail/linux/kernel/1203.3/03812.html
> 
> Second, as more practical solution, we could use device mapper like
> dm-cache(https://lwn.net/Articles/540996/), which makes it very
> flexible. Now, it supports various configruation and cache policy
> (block size, writeback/writethrough, LRU, MFU although MQ is merged
> now) so it would be good fit for our purpose. Even, it can make zram
> support writeback. I tested it following as following scenario
> in KVM 4 CPU, 1G DRAM with background 800M memory hogger, which is
> allocates random data up to 800M.
> 
> 1) zram swap disk 1G, untar kernel.tgz to tmpfs, build -j 4
>Fail to untar due to shortage of memory space by tmpfs default size limit
> 
> 2) zram swap disk 1G, untar kernel.tgz to ext2 on zram-blk, build -j 4
>OOM happens while building the kernel but it untar successfully
>on ext2 based on zram-blk. The reason OOM happend is zram can not find
>free pages from main memory to store swap out pages although empty
>swap space is still enough.
> 
> 3) dm-cache swap disk 1G, untar kernel.tgz to ext2 on zram-blk, build -j 4
>dmcache consists of zram-meta 10M, zram-cache 1G and real swap storage 1G
>No OOM happens and successfully building done.
> 
> Above tests proves zram can support writeback into real swap storage
> so that zram-cache can always have a free space. If necessary, we could
> add new plugin in dm-cache. I see It's really flexible and well-layered
> architecure so zram-blk's concept is good for us and it has lots of
> potential to be enhanced by MM/FS/Block developers. 
> 

That's an exciting direction!

-- 
Regards,
-Bob
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v7 0/5] zram/zsmalloc promotion

2013-08-21 Thread Bob Liu
Hi Minchan,

On 08/21/2013 02:16 PM, Minchan Kim wrote:
 It's 7th trial of zram/zsmalloc promotion.
 I rewrote cover-letter totally based on previous discussion.
 
 The main reason to prevent zram promotion was no review of
 zsmalloc part while Jens, block maintainer, already acked
 zram part.
 
 At that time, zsmalloc was used for zram, zcache and zswap so
 everybody wanted to make it general and at last, Mel reviewed it
 when zswap was submitted to merge mainline a few month ago.
 Most of review was related to zswap writeback mechanism which
 can pageout compressed page in memory into real swap storage
 in runtime and the conclusion was that zsmalloc isn't good for
 zswap writeback so zswap borrowed zbud allocator from zcache to
 replace zsmalloc. The zbud is bad for memory compression ratio(2)
 but it's very predictable behavior because we can expect a zpage
 includes just two pages as maximum. Other reviews were not major. 
 http://lkml.indiana.edu/hypermail/linux/kernel/1304.1/04334.html
 
 Zcache doesn't use zsmalloc either so zsmalloc's user is only
 zram now so this patchset moves it into zsmalloc directory.
 Recently, Bob tried to move zsmalloc under mm directory to unify
 zram and zswap with adding pseudo block device in zswap(It's
 very weired to me) but he was simple ignoring zram's block device
 (a.k.a zram-blk) feature and considered only swap usecase of zram,
 in turn, it lose zram's good concept.
 

Yes, I didn't notice the feature that zram can be used as a normal block
device.


 Mel raised an another issue in v6, maintainance headache.
 He claimed zswap and zram has a similar goal that is to compresss
 swap pages so if we promote zram, maintainance headache happens
 sometime by diverging implementaion between zswap and zram
 so that he want to unify zram and zswap. For it, he want zswap
 to implement pseudo block device like Bob did to emulate zram so
 zswap can have an advantage of writeback as well as zram's benefit.

If consider zram as a swap device only, I still think it's better to add
a pseudo block device to zswap and just disable the writeback of zswap.

But I have no idea of zram's block device feature.

 But I wonder frontswap-based zswap's writeback is really good
 approach for writeback POV. I think that problem isn't only
 specific for zswap. If we want to configure multiple swap hierarchy
 with various speed device such as RAM, NVRAM, SSD, eMMC, NAS etc,
 it would be a general problem. So we should think of more general
 approach. At a glance, I can see two approach.
 
 First, VM could be aware of heterogeneous swap configuration
 so it could aim for being able to configure cache hierarchy
 among swap devices. It may need indirction layer on swap, which
 was already talked about that way so VM can migrate a block from 
 A to B easily. It will support various configuration with VM's
 hints, maybe, in future.
 http://lkml.indiana.edu/hypermail/linux/kernel/1203.3/03812.html
 
 Second, as more practical solution, we could use device mapper like
 dm-cache(https://lwn.net/Articles/540996/), which makes it very
 flexible. Now, it supports various configruation and cache policy
 (block size, writeback/writethrough, LRU, MFU although MQ is merged
 now) so it would be good fit for our purpose. Even, it can make zram
 support writeback. I tested it following as following scenario
 in KVM 4 CPU, 1G DRAM with background 800M memory hogger, which is
 allocates random data up to 800M.
 
 1) zram swap disk 1G, untar kernel.tgz to tmpfs, build -j 4
Fail to untar due to shortage of memory space by tmpfs default size limit
 
 2) zram swap disk 1G, untar kernel.tgz to ext2 on zram-blk, build -j 4
OOM happens while building the kernel but it untar successfully
on ext2 based on zram-blk. The reason OOM happend is zram can not find
free pages from main memory to store swap out pages although empty
swap space is still enough.
 
 3) dm-cache swap disk 1G, untar kernel.tgz to ext2 on zram-blk, build -j 4
dmcache consists of zram-meta 10M, zram-cache 1G and real swap storage 1G
No OOM happens and successfully building done.
 
 Above tests proves zram can support writeback into real swap storage
 so that zram-cache can always have a free space. If necessary, we could
 add new plugin in dm-cache. I see It's really flexible and well-layered
 architecure so zram-blk's concept is good for us and it has lots of
 potential to be enhanced by MM/FS/Block developers. 
 

That's an exciting direction!

-- 
Regards,
-Bob
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v7 0/5] zram/zsmalloc promotion

2013-08-21 Thread Bob Liu
On 08/21/2013 05:24 PM, Bob Liu wrote:
 Hi Minchan,
 
 On 08/21/2013 02:16 PM, Minchan Kim wrote:
 It's 7th trial of zram/zsmalloc promotion.
 I rewrote cover-letter totally based on previous discussion.

 The main reason to prevent zram promotion was no review of
 zsmalloc part while Jens, block maintainer, already acked
 zram part.

 At that time, zsmalloc was used for zram, zcache and zswap so
 everybody wanted to make it general and at last, Mel reviewed it
 when zswap was submitted to merge mainline a few month ago.
 Most of review was related to zswap writeback mechanism which
 can pageout compressed page in memory into real swap storage
 in runtime and the conclusion was that zsmalloc isn't good for
 zswap writeback so zswap borrowed zbud allocator from zcache to
 replace zsmalloc. The zbud is bad for memory compression ratio(2)
 but it's very predictable behavior because we can expect a zpage
 includes just two pages as maximum. Other reviews were not major. 
 http://lkml.indiana.edu/hypermail/linux/kernel/1304.1/04334.html

 Zcache doesn't use zsmalloc either so zsmalloc's user is only
 zram now so this patchset moves it into zsmalloc directory.
 Recently, Bob tried to move zsmalloc under mm directory to unify
 zram and zswap with adding pseudo block device in zswap(It's
 very weired to me) but he was simple ignoring zram's block device
 (a.k.a zram-blk) feature and considered only swap usecase of zram,
 in turn, it lose zram's good concept.

 
 Yes, I didn't notice the feature that zram can be used as a normal block
 device.
 
 
 Mel raised an another issue in v6, maintainance headache.
 He claimed zswap and zram has a similar goal that is to compresss
 swap pages so if we promote zram, maintainance headache happens
 sometime by diverging implementaion between zswap and zram
 so that he want to unify zram and zswap. For it, he want zswap
 to implement pseudo block device like Bob did to emulate zram so
 zswap can have an advantage of writeback as well as zram's benefit.
 
 If consider zram as a swap device only, I still think it's better to add
 a pseudo block device to zswap and just disable the writeback of zswap.
 
 But I have no idea of zram's block device feature.
 

BTW: I think the original/main purpose that zram was introduced is for
swapping. Is there any real users using zram as a normal block device
instead of swap?
For normal usage, maybe we can extend ramdisk with compression feature.

-- 
Regards,
-Bob
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 4/4] mm: zswap: create a pseudo device /dev/zram0

2013-08-19 Thread Bob Liu

On 08/20/2013 01:46 AM, Seth Jennings wrote:
> On Sun, Aug 18, 2013 at 04:40:49PM +0800, Bob Liu wrote:
>> This is used to replace previous zram.
>> zram users can enable this feature, then a pseudo device will be created
>> automaticlly after kernel boot.
>> Just using "mkswp /dev/zram0; swapon /dev/zram0" to use it as a swap disk.
>>
>> The size of this pseudeo is controlled by zswap boot parameter
>> zswap.max_pool_percent.
>> disksize = (totalram_pages * zswap.max_pool_percent/100)*PAGE_SIZE.
> 
> This /dev/zram0 will behave nothing like the block device that zram
> creates.  It only allows reads/writes to the first PAGE_SIZE area of the
> device, for mkswap to work, and then doesn't do anything for all other
> accesses.

Yes, all the other data should be stored in zswap pool and don't need to
go through block layer.

> 
> I guess if you disabled zswap writeback, then... it would somewhat be
> the same thing.  We do need to disable zswap writeback in this case so
> that zswap does decompressed a ton of pages into the swapcache for
> writebacks that will just fail.  Since zsmalloc does not yet support the
> reclaim functionality, zswap writeback is implicitly disabled.
> 

Yes, ZSWAP_PSEUDO_BLKDEV depends on zsmalloc and if using zsmalloc as
the allocator then the writeback is disabled(not implemented and no
requirement).

> But this is really weird conceptually since zswap is a caching layer
> that uses frontswap.  If a frontswap store fails, it will try to send
> the page to the zram0 device which will fail the write.  Then the page

That's a problem. We should disable sending the page to zram0 if
frontswap store fails. Return fail just like the swap device is full.

> will be... put back on the active or inactive list?
> 
> Also, using the max_pool_percent in calculating the psuedo-device size
> isn't right.  Right now, the code makes the device the max size of the
> _compressed_ pool, but the underlying swap device size is in
> _uncompressed_ pages. So you'll never be able to fill zswap sizing the
> device like this, unless every page is highly incompressible to the
> point that each compressed page effectively uses a memory pool page, in
> which case, the user shouldn't be using memory compression.
> 
> This also means that this hasn't been tested in the zswap pool-is-full
> case since there is no way, in this code, to hit that case.

Yes, but in my understanding there is no need to trigger this path. It's
the same with zram. Eg. create /dev/zram0 with disksize(eg. 100M), then
mm-core will store ~100M uncompressed pages to /dev/zram0 at most. But
the real memory spent for storing those pages are depended on the
compression ratio. It's rare that zram will need 100M real memory.

> 
> In the zbud case the expected compression is 2:1 so you could just
> multiply the compressed pool size by 2 and get a good psuedo-device
> size.  With zsmalloc the expected compression is harder to determine
> since it can achieve very high effective compression ratios on highly
> compressible pages.
> 

Some users can know the compression ratio of their workloads even using
zsmalloc.

-- 
Regards,
-Bob
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 3/4] mm: zswap: add supporting for zsmalloc

2013-08-19 Thread Bob Liu

On 08/20/2013 12:59 AM, Seth Jennings wrote:
> On Sun, Aug 18, 2013 at 04:40:48PM +0800, Bob Liu wrote:
>> Make zswap can use zsmalloc as its allocater.
>> But note that zsmalloc don't reclaim any zswap pool pages mandatory, if zswap
>> pool gets full, frontswap_store will be refused unless frontswap_get happened
>> and freed some space.
>>
>> The reason of don't implement reclaiming zsmalloc pages from zswap pool is 
>> there
>> is no requiremnet currently.
>> If we want to do mandatory reclaim, we have to write those pages to real 
>> backend
>> swap devices. But most of current users of zsmalloc are from embeded world,
>> there is even no real backend swap device.
>> This action is also the same as privous zram!
>>
>> For several area, zsmalloc has unpredictable performance characteristics when
>> reclaiming a single page, then CONFIG_ZBUD are suggested.
> 
> Looking at this patch on its own, it does show how simple it could be
> for zswap to support zsmalloc.  So thanks!
> 
> However, I don't like all the ifdefs scattered everywhere.  I'd like to
> have a ops structure (e.g. struct zswap_alloc_ops) instead and just
> switch ops based on the CONFIG flag.  Or better yet, have it boot-time
> selectable instead of build-time.
> 

I don't like the ifdefs neither. But I didn't find a better way to
replace them since the data structures and API of zbud and zsmalloc are
different. I can take a try using zswap_alloc_ops.

-- 
Regards,
-Bob
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v6 0/5] zram/zsmalloc promotion

2013-08-19 Thread Bob Liu
Hi Luigi,

On 08/19/2013 01:29 PM, Luigi Semenzato wrote:
> 
> We are gearing up to evaluate zswap, but we have only ported kernels
> up to 3.8 to our hardware, so we may be missing important patches.
> 
> In our experience, and with all due respect, the linux MM is a complex
> beast, and it's difficult to predict how hard it will be for us to
> switch to zswap.  Even with the relatively simple zram, our load

I think it will be easy if zswap can also create a pseudo block device(I
already done the simple implementation [PATCH 0/4] mm: merge zram into
zswap), then it's transparent for original zram users.

> triggered bugs in other parts of the MM that took a fair amount of
> work to resolve.
> 
> I may be wrong, but the in-memory compressed block device implemented
> by zram seems like a simple device which uses a well-established API
> to the rest of the kernel.  If it is removed from the kernel, will it
> be difficult for us to carry our own patch?  Because we may have to do
> that for a while.  Of course we would prefer it if it stayed in, at
> least temporarily.
> 
> Also, could someone confirm or deny that the maximum compression ratio
> in zbud is 2?  Because we easily achieve a 2.6-2.8 compression ratio
> with our loads using zram with zsmalloc and LZO or snappy.  Losing
> that memory will cause a noticeable regression, which will encourage
> us to stick with zram.
> 
> I am hoping that our load is not so unusual that we are the only Linux
> users in this situation, and that zsmalloc (or other
> allocator-compressor with similar characteristics) will continue to
> exist, whether it is used by zram or zswap.
> 
> Thanks!
> 

-- 
Regards,
-Bob
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v6 0/5] zram/zsmalloc promotion

2013-08-19 Thread Bob Liu
Hi Luigi,

On 08/19/2013 01:29 PM, Luigi Semenzato wrote:
 
 We are gearing up to evaluate zswap, but we have only ported kernels
 up to 3.8 to our hardware, so we may be missing important patches.
 
 In our experience, and with all due respect, the linux MM is a complex
 beast, and it's difficult to predict how hard it will be for us to
 switch to zswap.  Even with the relatively simple zram, our load

I think it will be easy if zswap can also create a pseudo block device(I
already done the simple implementation [PATCH 0/4] mm: merge zram into
zswap), then it's transparent for original zram users.

 triggered bugs in other parts of the MM that took a fair amount of
 work to resolve.
 
 I may be wrong, but the in-memory compressed block device implemented
 by zram seems like a simple device which uses a well-established API
 to the rest of the kernel.  If it is removed from the kernel, will it
 be difficult for us to carry our own patch?  Because we may have to do
 that for a while.  Of course we would prefer it if it stayed in, at
 least temporarily.
 
 Also, could someone confirm or deny that the maximum compression ratio
 in zbud is 2?  Because we easily achieve a 2.6-2.8 compression ratio
 with our loads using zram with zsmalloc and LZO or snappy.  Losing
 that memory will cause a noticeable regression, which will encourage
 us to stick with zram.
 
 I am hoping that our load is not so unusual that we are the only Linux
 users in this situation, and that zsmalloc (or other
 allocator-compressor with similar characteristics) will continue to
 exist, whether it is used by zram or zswap.
 
 Thanks!
 

-- 
Regards,
-Bob
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 3/4] mm: zswap: add supporting for zsmalloc

2013-08-19 Thread Bob Liu

On 08/20/2013 12:59 AM, Seth Jennings wrote:
 On Sun, Aug 18, 2013 at 04:40:48PM +0800, Bob Liu wrote:
 Make zswap can use zsmalloc as its allocater.
 But note that zsmalloc don't reclaim any zswap pool pages mandatory, if zswap
 pool gets full, frontswap_store will be refused unless frontswap_get happened
 and freed some space.

 The reason of don't implement reclaiming zsmalloc pages from zswap pool is 
 there
 is no requiremnet currently.
 If we want to do mandatory reclaim, we have to write those pages to real 
 backend
 swap devices. But most of current users of zsmalloc are from embeded world,
 there is even no real backend swap device.
 This action is also the same as privous zram!

 For several area, zsmalloc has unpredictable performance characteristics when
 reclaiming a single page, then CONFIG_ZBUD are suggested.
 
 Looking at this patch on its own, it does show how simple it could be
 for zswap to support zsmalloc.  So thanks!
 
 However, I don't like all the ifdefs scattered everywhere.  I'd like to
 have a ops structure (e.g. struct zswap_alloc_ops) instead and just
 switch ops based on the CONFIG flag.  Or better yet, have it boot-time
 selectable instead of build-time.
 

I don't like the ifdefs neither. But I didn't find a better way to
replace them since the data structures and API of zbud and zsmalloc are
different. I can take a try using zswap_alloc_ops.

-- 
Regards,
-Bob
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 4/4] mm: zswap: create a pseudo device /dev/zram0

2013-08-19 Thread Bob Liu

On 08/20/2013 01:46 AM, Seth Jennings wrote:
 On Sun, Aug 18, 2013 at 04:40:49PM +0800, Bob Liu wrote:
 This is used to replace previous zram.
 zram users can enable this feature, then a pseudo device will be created
 automaticlly after kernel boot.
 Just using mkswp /dev/zram0; swapon /dev/zram0 to use it as a swap disk.

 The size of this pseudeo is controlled by zswap boot parameter
 zswap.max_pool_percent.
 disksize = (totalram_pages * zswap.max_pool_percent/100)*PAGE_SIZE.
 
 This /dev/zram0 will behave nothing like the block device that zram
 creates.  It only allows reads/writes to the first PAGE_SIZE area of the
 device, for mkswap to work, and then doesn't do anything for all other
 accesses.

Yes, all the other data should be stored in zswap pool and don't need to
go through block layer.

 
 I guess if you disabled zswap writeback, then... it would somewhat be
 the same thing.  We do need to disable zswap writeback in this case so
 that zswap does decompressed a ton of pages into the swapcache for
 writebacks that will just fail.  Since zsmalloc does not yet support the
 reclaim functionality, zswap writeback is implicitly disabled.
 

Yes, ZSWAP_PSEUDO_BLKDEV depends on zsmalloc and if using zsmalloc as
the allocator then the writeback is disabled(not implemented and no
requirement).

 But this is really weird conceptually since zswap is a caching layer
 that uses frontswap.  If a frontswap store fails, it will try to send
 the page to the zram0 device which will fail the write.  Then the page

That's a problem. We should disable sending the page to zram0 if
frontswap store fails. Return fail just like the swap device is full.

 will be... put back on the active or inactive list?
 
 Also, using the max_pool_percent in calculating the psuedo-device size
 isn't right.  Right now, the code makes the device the max size of the
 _compressed_ pool, but the underlying swap device size is in
 _uncompressed_ pages. So you'll never be able to fill zswap sizing the
 device like this, unless every page is highly incompressible to the
 point that each compressed page effectively uses a memory pool page, in
 which case, the user shouldn't be using memory compression.
 
 This also means that this hasn't been tested in the zswap pool-is-full
 case since there is no way, in this code, to hit that case.

Yes, but in my understanding there is no need to trigger this path. It's
the same with zram. Eg. create /dev/zram0 with disksize(eg. 100M), then
mm-core will store ~100M uncompressed pages to /dev/zram0 at most. But
the real memory spent for storing those pages are depended on the
compression ratio. It's rare that zram will need 100M real memory.

 
 In the zbud case the expected compression is 2:1 so you could just
 multiply the compressed pool size by 2 and get a good psuedo-device
 size.  With zsmalloc the expected compression is harder to determine
 since it can achieve very high effective compression ratios on highly
 compressible pages.
 

Some users can know the compression ratio of their workloads even using
zsmalloc.

-- 
Regards,
-Bob
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 0/4] mm: merge zram into zswap

2013-08-18 Thread Bob Liu
Hi Minchan,

On 08/19/2013 12:10 PM, Minchan Kim wrote:
> On Sun, Aug 18, 2013 at 04:40:45PM +0800, Bob Liu wrote:
>> Both zswap and zram are used to compress anon pages in memory so as to reduce
>> swap io operation. The main different is that zswap uses zbud as its 
>> allocator
>> while zram uses zsmalloc. The other different is zram will create a block
>> device, the user need to mkswp and swapon it.
>>
>> Minchan has areadly try to promote zram/zsmalloc into drivers/block/, but it 
>> may
>> cause increase maintenance headaches. Since the purpose of zswap and zram are
>> the same, this patch series try to merge them together as Mel suggested.
>> Dropped zram from staging and extended zswap with the same feature as zram.
>>
>> zswap todo:
>> Improve the writeback of zswap pool pages!
>>
>> Bob Liu (4):
>>   drivers: staging: drop zram and zsmalloc
> 
> Bob, I feel you're very rude and I'm really upset.
> 
> You're just dropping the subsystem you didn't do anything without any 
> consensus
> from who are contriubting lots of patches to make it works well for a long 
> time.

I apologize for that, at least I should add [RFC] in the patch title!

-- 
Regards,
-Bob
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v6 0/5] zram/zsmalloc promotion

2013-08-18 Thread Bob Liu
Hi Minchan,

On 08/19/2013 11:18 AM, Minchan Kim wrote:
> Hello Mel,
> 
> On Fri, Aug 16, 2013 at 09:33:47AM +0100, Mel Gorman wrote:
>> On Fri, Aug 16, 2013 at 01:26:41PM +0900, Minchan Kim wrote:
>>>>>> 
>>>>>> If it's used for something like tmpfs then it becomes much worse. Normal
>>>>>> tmpfs without swap can lockup if tmpfs is allowed to fill memory. In a
>>>>>> sane configuration, lockups will be avoided and deleting a tmpfs file is
>>>>>> guaranteed to free memory. When zram is used to back tmpfs, there is no
>>>>>> guarantee that any memory is freed due to fragmentation of the compressed
>>>>>> pages. The only way to recover the memory may be to kill applications
>>>>>> holding tmpfs files open and then delete them which is fairly drastic
>>>>>> action in a normal server environment.
>>>>>
>>>>> Indeed.
>>>>> Actually, I had a plan to support zsmalloc compaction. The zsmalloc 
>>>>> exposes
>>>>> handle instead of pure pointer so it could migrate some zpages to 
>>>>> somewhere
>>>>> to pack in. Then, it could help above problem and OOM storm problem.
>>>>> Anyway, it's a totally new feature and requires many changes and 
>>>>> experiement.
>>>>> Although we don't have such feature, zram is still good for many people.
>>>>>
>>>>
>>>> And is zsmalloc was pluggable for zswap then it would also benefit.
>>>
>>> But zswap isn't pseudo block device so it couldn't be used for block device.
>>
>> It would not be impossible to write one. Taking a quick look it might even
>> be doable by just providing a zbud_ops that does not have an evict handler
>> and make sure the errors are handled correctly. i.e. does the following
>> patch mean that zswap never writes back and instead just compresses pages
>> in memory?
>>
>> diff --git a/mm/zswap.c b/mm/zswap.c
>> index deda2b6..99e41c8 100644
>> --- a/mm/zswap.c
>> +++ b/mm/zswap.c
>> @@ -819,7 +819,6 @@ static void zswap_frontswap_invalidate_area(unsigned 
>> type)
>>  }
>>  
>>  static struct zbud_ops zswap_zbud_ops = {
>> -.evict = zswap_writeback_entry
>>  };
>>  
>>  static void zswap_frontswap_init(unsigned type)
>>
>> If so, it should be doable to link that up in a sane way so it can be
>> configured at runtime.
>>
>> Did you ever even try something like this?
> 
> Never. Because I didn't have such requirement for zram.
> 
>>
>>> Let say one usecase for using zram-blk.
>>>
>>> 1) Many embedded system don't have swap so although tmpfs can support 
>>> swapout
>>> it's pointless still so such systems should have sane configuration to limit
>>> memory space so it's not only zram problem.
>>>
>>
>> If zswap was backed by a pseudo device that failed all writes or an an
>> ops with no evict handler then it would be functionally similar.
>>
>>> 2) Many embedded system don't have enough memory. Let's assume short-lived
>>> file growing up until half of system memory once in a while. We don't want
>>> to write it on flash by wear-leveing issue and very slowness so we want to 
>>> use
>>> in-memory but if we uses tmpfs, it should evict half of working set to cover
>>> them when the size reach peak. zram would be better choice.
>>>
>>
>> Then back it by a pseudo device that fails all writes so it does not have
>> to write to disk.
> 
> You mean "make pseudo block device and register make_request_fn
> and prevent writeback". Bah, yes, it's doable but what is it different with 
> below?
> 
> 1) move zbud into zram
> 2) implement frontswap API in zram
> 3) implement writebazk in zram
> 
> The zram has been for a long time in staging to be promoted and have been
> maintained/deployed. Of course, I have asked the promotion several times
> for above a year.
> 
> Why can't zram include zswap functions if you really want to merge them?
> Is there any problem?

I think merging zram into zswap or merging zswap into zram are the same
thing. It's no difference.
Both way will result in a solution finally with zram block device,
frontswap API etc.

The difference is just the name and the merging patch title, I think
it's unimportant.

I've implemented a series [PATCH 0/4] mm: merge zram into zswap, I can
change the tile to "merge zswap into zram" if you want and rename zswap
to something like zhybrid.

-- 
Regards,
-Bob
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [BUG REPORT] ZSWAP: theoretical race condition issues

2013-08-18 Thread Bob Liu
Hi Weijie,

On 08/19/2013 12:14 AM, Weijie Yang wrote:
> I found a few bugs in zswap when I review Linux-3.11-rc5, and I have
> also some questions about it, described as following:
> 
> BUG:
> 1. A race condition when reclaim a page
> when a handle alloced from zbud, zbud considers this handle is used
> validly by upper(zswap) and can be a candidate for reclaim.
> But zswap has to initialize it such as setting swapentry and addding
> it to rbtree. so there is a race condition, such as:
> thread 0: obtain handle x from zbud_alloc
> thread 1: zbud_reclaim_page is called
> thread 1: callback zswap_writeback_entry to reclaim handle x
> thread 1: get swpentry from handle x (it is random value now)
> thread 1: bad thing may happen
> thread 0: initialize handle x with swapentry

Yes, this may happen potentially but in rare case.
Because we have a LRU list for page frames, after Thread 0 called
zbud_alloc the corresponding page will be add to the head of LRU
list,While zbud_reclaim_page(Thread 1 called) is started from the tail
of LRU list.

> Of course, this situation almost never happen, it is a "theoretical
> race condition" issue.
> 
> 2. Pollute swapcache data by add a pre-invalided swap page
> when a swap_entry is invalidated, it will be reused by other anon
> page. At the same time, zswap is reclaiming old page, pollute
> swapcache of new page as a result, because old page and new page use
> the same swap_entry, such as:
> thread 1: zswap reclaim entry x
> thread 0: zswap_frontswap_invalidate_page entry x
> thread 0: entry x reused by other anon page
> thread 1: add old data to swapcache of entry x

I didn't get your idea here, why thread1 will add old data to entry x?

> thread 0: swapcache of entry x is polluted
> Of course, this situation almost never happen, it is another
> "theoretical race condition" issue.
> 
> 3. Frontswap uses frontswap_map bitmap to track page in "backend"
> implementation, when zswap reclaim a
> page, the corresponding bitmap record is not cleared.
>

That's true, but I don't think it's a big problem.
Only waste little time to search rbtree during zswap_frontswap_load().

> 4. zswap_tree is not freed when swapoff, and it got re-kzalloc in
> swapon, memory leak occurs.

Nice catch! I think it should be freed in zswap_frontswap_invalidate_area().

> 
> questions:
> 1. How about SetPageReclaim befor __swap_writepage, so that move it to
> the tail of the inactive list?

It will be added to inactive now.

> 2. zswap uses GFP_KERNEL flag to alloc things in store and reclaim
> function, does this lead to these function called recursively?

Yes, that's a potential problem.

> 3. for reclaiming one zbud page which contains two buddies, zswap
> needs to alloc two pages. Does this reclaim cost-efficient?
> 

Yes, that's a problem too. And that's why we use zbud as the default
allocator instead of zsmalloc.
I think improving the write back path of zswap is the next important
step for zswap.

-- 
Regards,
-Bob
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH 4/4] mm: zswap: create a pseudo device /dev/zram0

2013-08-18 Thread Bob Liu
This is used to replace previous zram.
zram users can enable this feature, then a pseudo device will be created
automaticlly after kernel boot.
Just using "mkswp /dev/zram0; swapon /dev/zram0" to use it as a swap disk.

The size of this pseudeo is controlled by zswap boot parameter
zswap.max_pool_percent.
disksize = (totalram_pages * zswap.max_pool_percent/100)*PAGE_SIZE.

Signed-off-by: Bob Liu 
---
 mm/Kconfig |   12 
 mm/zswap.c |  196 
 2 files changed, 208 insertions(+)

diff --git a/mm/Kconfig b/mm/Kconfig
index d80a575..3778026 100644
--- a/mm/Kconfig
+++ b/mm/Kconfig
@@ -525,6 +525,18 @@ choice
  be refused unless frontswap_get happened and freed some space.
 endchoice
 
+config ZSWAP_PSEUDO_BLKDEV
+   bool "Emulate a pseudo blk-dev based on zswap(previous zram)"
+   depends on ZSWAP && ZSMALLOC
+   default n
+
+   help
+ Enable this option will emulate a pseudo block swapdev /dev/zram0
+ with size zswap.max_pool_percent of total ram size. All writes to this
+ block device will be compressed and cached by zswap as a result no
+ real IO disk operations will happen.
+ This feature can be used to replace drivers/staging/zram.
+
 config MEM_SOFT_DIRTY
bool "Track memory changes"
depends on CHECKPOINT_RESTORE && HAVE_ARCH_SOFT_DIRTY
diff --git a/mm/zswap.c b/mm/zswap.c
index 8e8dc99..ae73c9d 100644
--- a/mm/zswap.c
+++ b/mm/zswap.c
@@ -38,6 +38,11 @@
 #include 
 #else
 #include 
+#ifdef CONFIG_ZSWAP_PSEUDO_BLKDEV
+#include 
+#include 
+#include 
+#endif
 #endif
 #include 
 #include 
@@ -968,6 +973,189 @@ static int __init zswap_debugfs_init(void)
 static void __exit zswap_debugfs_exit(void) { }
 #endif
 
+#ifdef CONFIG_ZSWAP_PSEUDO_BLKDEV
+#define SECTOR_SHIFT   9
+#define SECTOR_SIZE(1 << SECTOR_SHIFT)
+#define SECTORS_PER_PAGE_SHIFT (PAGE_SHIFT - SECTOR_SHIFT)
+#define SECTORS_PER_PAGE   (1 << SECTORS_PER_PAGE_SHIFT)
+
+struct zram {
+   struct rw_semaphore lock; /* protect concurent reads and writes */
+   struct request_queue *queue;
+   struct gendisk *disk;
+
+   /*
+* This is the disk size for userland. The size is controlled by
+* boot parameter zswap.max_pool_percent.
+* disksize = (totalram_pages * zswap.max_pool_percent/100)*PAGE_SIZE
+*/
+   u64 disksize;   /* bytes */
+
+   /*
+* This page is used to store real data for /dev/zram.
+* Meanful operation to /dev/zramx is only mkswp and swapon/swapoff.
+* So use one page to store the real data(written by mkswp).
+*/
+   struct page *metapage;
+};
+
+/*
+ * Only create /dev/zram0, can be extened in future if there is real uercases
+ * need multiple zram devices.
+ */
+static struct zram zram_device;
+static const struct block_device_operations zram_devops = {
+   .owner = THIS_MODULE
+};
+
+static void update_position(u32 *index, int *offset, struct bio_vec *bvec)
+{
+   if (*offset + bvec->bv_len >= PAGE_SIZE)
+   (*index)++;
+   *offset = (*offset + bvec->bv_len) % PAGE_SIZE;
+}
+
+static void zram_make_request(struct request_queue *queue, struct bio *bio)
+{
+   u32 index;
+   struct bio_vec *bvec;
+   unsigned char *src, *dst;
+   int offset, i, rw = bio_data_dir(bio);
+   struct zram *zram = queue->queuedata;
+
+   index = bio->bi_sector >> SECTORS_PER_PAGE_SHIFT;
+   offset = (bio->bi_sector & (SECTORS_PER_PAGE - 1)) << SECTOR_SHIFT;
+
+   bio_for_each_segment(bvec, bio, i) {
+   /*
+* The only operation to pseudo /dev/zramx is mkswp and
+* swapon/swapoff, so we only need one extra page to store the
+* real meta data!
+*/
+   BUG_ON(bvec->bv_len != PAGE_SIZE);
+   BUG_ON(offset);
+
+   if (!index) {
+   if (rw == READ) {
+   down_read(>lock);
+   dst = kmap_atomic(bvec->bv_page);
+   src = kmap_atomic(zram->metapage);
+   memcpy(dst, src, bvec->bv_len);
+   kunmap_atomic(dst);
+   kunmap_atomic(src);
+   flush_dcache_page(bvec->bv_page);
+   up_read(>lock);
+   } else {
+   down_write(>lock);
+   src = kmap_atomic(bvec->bv_page);
+   dst = kmap_atomic(zram->metapage);
+   memcpy(dst, src, bvec->bv_len);
+   kunmap_atomic(dst);
+   kunmap_atomic(src);
+   

[PATCH 3/4] mm: zswap: add supporting for zsmalloc

2013-08-18 Thread Bob Liu
Make zswap can use zsmalloc as its allocater.
But note that zsmalloc don't reclaim any zswap pool pages mandatory, if zswap
pool gets full, frontswap_store will be refused unless frontswap_get happened
and freed some space.

The reason of don't implement reclaiming zsmalloc pages from zswap pool is there
is no requiremnet currently.
If we want to do mandatory reclaim, we have to write those pages to real backend
swap devices. But most of current users of zsmalloc are from embeded world,
there is even no real backend swap device.
This action is also the same as privous zram!

For several area, zsmalloc has unpredictable performance characteristics when
reclaiming a single page, then CONFIG_ZBUD are suggested.

Signed-off-by: Bob Liu 
---
 include/linux/zsmalloc.h |1 +
 mm/Kconfig   |4 +++
 mm/zsmalloc.c|9 --
 mm/zswap.c   |   73 +++---
 4 files changed, 81 insertions(+), 6 deletions(-)

diff --git a/include/linux/zsmalloc.h b/include/linux/zsmalloc.h
index fbe6bec..72fc126 100644
--- a/include/linux/zsmalloc.h
+++ b/include/linux/zsmalloc.h
@@ -39,5 +39,6 @@ void *zs_map_object(struct zs_pool *pool, unsigned long 
handle,
 void zs_unmap_object(struct zs_pool *pool, unsigned long handle);
 
 u64 zs_get_total_size_bytes(struct zs_pool *pool);
+u64 zs_get_pool_size(struct zs_pool *pool);
 
 #endif
diff --git a/mm/Kconfig b/mm/Kconfig
index 48d1786..d80a575 100644
--- a/mm/Kconfig
+++ b/mm/Kconfig
@@ -519,6 +519,10 @@ choice
  in order to reduce fragmentation and has high compression density.
  However, this results in a unpredictable performance characteristics
  when reclaiming a single page.
+
+ Note: By using zsmalloc, no supporting for mandatory reclaiming from
+ compressed memory pool. If the pool gets full, frontswap_store will
+ be refused unless frontswap_get happened and freed some space.
 endchoice
 
 config MEM_SOFT_DIRTY
diff --git a/mm/zsmalloc.c b/mm/zsmalloc.c
index 4bb275b..9df8d25 100644
--- a/mm/zsmalloc.c
+++ b/mm/zsmalloc.c
@@ -78,8 +78,7 @@
 #include 
 #include 
 #include 
-
-#include "zsmalloc.h"
+#include 
 
 /*
  * This must be power of 2 and greater than of equal to sizeof(link_free).
@@ -1056,6 +1055,12 @@ u64 zs_get_total_size_bytes(struct zs_pool *pool)
 }
 EXPORT_SYMBOL_GPL(zs_get_total_size_bytes);
 
+u64 zs_get_pool_size(struct zs_pool *pool)
+{
+   return zs_get_total_size_bytes(pool) >> PAGE_SHIFT;
+}
+EXPORT_SYMBOL_GPL(zs_get_pool_size);
+
 module_init(zs_init);
 module_exit(zs_exit);
 
diff --git a/mm/zswap.c b/mm/zswap.c
index deda2b6..8e8dc99 100644
--- a/mm/zswap.c
+++ b/mm/zswap.c
@@ -34,8 +34,11 @@
 #include 
 #include 
 #include 
+#ifdef CONFIG_ZBUD
 #include 
-
+#else
+#include 
+#endif
 #include 
 #include 
 #include 
@@ -189,7 +192,11 @@ struct zswap_header {
 struct zswap_tree {
struct rb_root rbroot;
spinlock_t lock;
+#ifdef CONFIG_ZBUD
struct zbud_pool *pool;
+#else
+   struct zs_pool *pool;
+#endif
 };
 
 static struct zswap_tree *zswap_trees[MAX_SWAPFILES];
@@ -374,12 +381,21 @@ static bool zswap_is_full(void)
  */
 static void zswap_free_entry(struct zswap_tree *tree, struct zswap_entry 
*entry)
 {
+#ifdef CONFIG_ZBUD
zbud_free(tree->pool, entry->handle);
+#else
+   zs_free(tree->pool, entry->handle);
+#endif
zswap_entry_cache_free(entry);
atomic_dec(_stored_pages);
+#ifdef CONFIG_ZBUD
zswap_pool_pages = zbud_get_pool_size(tree->pool);
+#else
+   zswap_pool_pages = zs_get_pool_size(tree->pool);
+#endif
 }
 
+#ifdef CONFIG_ZBUD
 /*
 * writeback code
 **/
@@ -595,6 +611,7 @@ fail:
spin_unlock(>lock);
return ret;
 }
+#endif
 
 /*
 * frontswap hooks
@@ -620,11 +637,22 @@ static int zswap_frontswap_store(unsigned type, pgoff_t 
offset,
/* reclaim space if needed */
if (zswap_is_full()) {
zswap_pool_limit_hit++;
+#ifdef CONFIG_ZBUD
if (zbud_reclaim_page(tree->pool, 8)) {
zswap_reject_reclaim_fail++;
ret = -ENOMEM;
goto reject;
}
+#else
+   /*
+* zsmalloc has unpredictable performance
+* characteristics when reclaiming, so don't support
+* mandatory reclaiming from zsmalloc
+*/
+   zswap_reject_reclaim_fail++;
+   ret = -ENOMEM;
+   goto reject;
+#endif
}
 
/* allocate entry */
@@ -647,8 +675,9 @@ static int zswap_frontswap_store(unsigned type, pgoff_t 
offset,
 
/* store */
len = dlen + sizeof(struct zswap_header);
+#ifdef CONFIG_ZBUD
ret = zbud_alloc(tree->pool, len, __GFP_NORETRY | __GFP_NOWARN,
-   );

[PATCH 2/4] mm: promote zsmalloc to mm/

2013-08-18 Thread Bob Liu
zsmalloc is a new slab-based memory allocator for storing
compressed pages.  It is designed for low fragmentation and
high allocation success rate on large object, but <= PAGE_SIZE
allocations.

zsmalloc differs from the kernel slab allocator in two primary
ways to achieve these design goals.

zsmalloc never requires high order page allocations to back
slabs, or "size classes" in zsmalloc terms. Instead it allows
multiple single-order pages to be stitched together into a
"zspage" which backs the slab.  This allows for higher allocation
success rate under memory pressure.

Also, zsmalloc allows objects to span page boundaries within the
zspage.  This allows for lower fragmentation than could be had
with the kernel slab allocator for objects between PAGE_SIZE/2
and PAGE_SIZE.  With the kernel slab allocator, if a page compresses
to 60% of it original size, the memory savings gained through
compression is lost in fragmentation because another object of
the same size can't be stored in the leftover space.

This ability to span pages results in zsmalloc allocations not being
directly addressable by the user.  The user is given an
non-dereferencable handle in response to an allocation request.
That handle must be mapped, using zs_map_object(), which returns
a pointer to the mapped region that can be used.  The mapping is
necessary since the object data may reside in two different
noncontigious pages.

Signed-off-by: Bob Liu 
---
 include/linux/zsmalloc.h |   43 ++
 mm/Kconfig   |   35 +-
 mm/Makefile  |1 +
 mm/zsmalloc.c| 1063 ++
 4 files changed, 1131 insertions(+), 11 deletions(-)
 create mode 100644 include/linux/zsmalloc.h
 create mode 100644 mm/zsmalloc.c

diff --git a/include/linux/zsmalloc.h b/include/linux/zsmalloc.h
new file mode 100644
index 000..fbe6bec
--- /dev/null
+++ b/include/linux/zsmalloc.h
@@ -0,0 +1,43 @@
+/*
+ * zsmalloc memory allocator
+ *
+ * Copyright (C) 2011  Nitin Gupta
+ *
+ * This code is released using a dual license strategy: BSD/GPL
+ * You can choose the license that better fits your requirements.
+ *
+ * Released under the terms of 3-clause BSD License
+ * Released under the terms of GNU General Public License Version 2.0
+ */
+
+#ifndef _ZS_MALLOC_H_
+#define _ZS_MALLOC_H_
+
+#include 
+
+/*
+ * zsmalloc mapping modes
+ *
+ * NOTE: These only make a difference when a mapped object spans pages
+ */
+enum zs_mapmode {
+   ZS_MM_RW, /* normal read-write mapping */
+   ZS_MM_RO, /* read-only (no copy-out at unmap time) */
+   ZS_MM_WO /* write-only (no copy-in at map time) */
+};
+
+struct zs_pool;
+
+struct zs_pool *zs_create_pool(gfp_t flags);
+void zs_destroy_pool(struct zs_pool *pool);
+
+unsigned long zs_malloc(struct zs_pool *pool, size_t size);
+void zs_free(struct zs_pool *pool, unsigned long obj);
+
+void *zs_map_object(struct zs_pool *pool, unsigned long handle,
+   enum zs_mapmode mm);
+void zs_unmap_object(struct zs_pool *pool, unsigned long handle);
+
+u64 zs_get_total_size_bytes(struct zs_pool *pool);
+
+#endif
diff --git a/mm/Kconfig b/mm/Kconfig
index 8028dcc..48d1786 100644
--- a/mm/Kconfig
+++ b/mm/Kconfig
@@ -478,21 +478,10 @@ config FRONTSWAP
 
  If unsure, say Y to enable frontswap.
 
-config ZBUD
-   tristate
-   default n
-   help
- A special purpose allocator for storing compressed pages.
- It is designed to store up to two compressed pages per physical
- page.  While this design limits storage density, it has simple and
- deterministic reclaim properties that make it preferable to a higher
- density approach when reclaim will be used.
-
 config ZSWAP
bool "Compressed cache for swap pages (EXPERIMENTAL)"
depends on FRONTSWAP && CRYPTO=y
select CRYPTO_LZO
-   select ZBUD
default n
help
  A lightweight compressed cache for swap pages.  It takes
@@ -508,6 +497,30 @@ config ZSWAP
  they have not be fully explored on the large set of potential
  configurations and workloads that exist.
 
+choice
+   prompt "Select memory allocator for compressed pages"
+   depends on ZSWAP
+   default ZBUD
+
+   config ZBUD
+   bool "zbud"
+   help
+ A special purpose allocator for storing compressed pages.
+ It is designed to store up to two compressed pages per physical
+ page.  While this design limits storage density, it has simple and
+ deterministic reclaim properties that make it preferable to a higher
+ density approach when reclaim will be used.
+
+   config ZSMALLOC
+   bool "zsmalloc"
+   help
+ zsmalloc is a slab-based memory allocator designed to store
+ compressed RAM pages.  zsmalloc uses virtual memory mapping
+ in order to r

<    5   6   7   8   9   10   11   12   13   14   >